Head to head: grok-4.3 vs Llama-4-Maverick-17B-128E-Instruct-FP8

This matchup wasn’t close on execution: one model consistently did the job asked, while the other kept drifting into extra verbiage and looser instruction-following. The difference showed up not in flashy reasoning claims, but in whether the output was precise, disciplined, and actually usable.

By RuntimeWire · Published Jun 21, 2026, 9:04am CT

Comparison of two abstract AI model outputs: one precise, one verbose (Studio still life)

grok-4.3 wins this head-to-head cleanly, 37.0 to 24.0, by doing the unglamorous thing better: following directions exactly. Across all four cited tasks, it was the model that stayed inside the brief, preserved required structure, and avoided the kind of “helpful” sprawl that turns a usable answer into something an editor or engineer has to trim.

The clearest technical edge showed up in python-log-redactor. Both models were broadly correct, but grok-4.3 had the stronger pattern design: its token/api_key handling stops at any whitespace, while Llama-4-Maverick-17B-128E-Instruct-FP8 only stops at a literal space. That matters. In real logs, tabs and irregular whitespace happen, and B’s approach can overmatch. A was also more concise while better preserving the original line.

On instruction discipline, the gap widened. In meeting-notes-summary, grok-4.3 delivered exactly what was asked: a 2-sentence summary plus a JSON object with only the specified keys. B was “mostly correct,” but mostly correct is not the standard when the format is explicit; it added commentary and an alternative JSON structure with disallowed keys. The same pattern repeated in messy-csv-to-json, where A returned the converted JSON directly and correctly, while B padded the answer with explanation and code despite being told, in effect, to just output the data.

Even in softer business writing, grok-4.3 was sharper. In vendor-delay-update, it was calmer, tighter, and more action-oriented while still covering all required facts within the requested length. Llama’s version wasn’t missing the essentials, but it was wordier and slipped into a slightly motivational tone that didn’t fit the brief. That’s the story of this matchup: B often knew roughly what to do, but A knew how to deliver it in the form actually requested.

Final call: grok-4.3 is the better working model here, decisively. It won on robustness in the code task and on obedience to format in the structured-output tasks, while also writing cleaner operational prose. Llama-4-Maverick-17B-128E-Instruct-FP8 wasn’t disastrous; it was just consistently less disciplined, and that’s exactly why it lost.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. grok-4.3 scored 37.0 to Llama-4-Maverick-17B-128E-Instruct-FP8's 24.0.

1. python-log-redactor

Language: Python 3. Write a function redact_log(line: str) -> str that masks sensitive values in application log lines before they are shipped to a vendor. Replace any email address with [EMAIL], any IPv4 address with [IP], and any token value in patterns token=... or api_key=... with [SECRET] while preserving the rest of the line exactly. Matching should be case-insensitive for the keys token and api_key. Do not alter other fields. Example: 2026-02-14 user=maya.chen@brindle.io ip=172.16.4.22 token=sk-live-9fzA note=retry -> 2026-02-14 user=[EMAIL] ip=[IP] token=[SECRET] note=retry. Return code only.

Winner: grok-4.3 — A better preserves the original line while redacting secrets: its token/api_key pattern stops at any whitespace, whereas B only stops at a literal space and can overmatch across tabs or other whitespace. Both are generally correct, but A is slightly more robust and concise for the stated requirements.

2. vendor-delay-update

Write a Slack update to the operations team about a shipment delay. Facts to include: 280 custom badge reels for the HarborNorth conference were due Thursday; supplier Lattice Print called at 7:40 a.m. saying their laminator failed overnight; earliest ship date is now Monday 9:00 a.m.; we have enough old stock for only 90 attendees; Priya is checking a same-day local fallback with Mint Carton in Tacoma; no customer-facing announcement yet. Audience: internal ops team. Tone: calm, concise, action-oriented. Length: 90-130 words.

Winner: grok-4.3 — A is calmer, more concise, and more action-oriented while covering all required facts within the requested length. B includes all key details but is wordier, uses less concise phrasing, and adds a slightly motivational tone that is less aligned with the brief.

3. meeting-notes-summary

Read the meeting notes and produce: (1) a 2-sentence summary, and (2) a JSON object with keys decision, owner, deadline, and risks. Meeting notes: - Weekly checkout reliability sync, 11 Apr, attendees: Nia, Pavel, Jorge, Emi. - Incident recap: on Tuesday from 14:12-14:45 UTC, card payments from Banco Mistral were rejected because the gateway cert chain on pay-gw-2 was outdated. - Customer impact: 183 failed orders, 41 later recovered manually by support. - Decision: rotate certs on both gateways monthly instead of quarterly, and add a 7-day expiry alert in Grafana. - Pavel will implement the alerting this sprint. Jorge will update the runbook by 18 Apr. - Nia flagged one risk: the backup gateway still uses a hard-coded trust store in the legacy container, so monthly rotation may fail until that is patched. - Emi asked support to draft a macro for affected merchants.

Winner: grok-4.3 — A follows the requested format exactly with a 2-sentence summary and a JSON object using the specified keys only, while accurately capturing the main decisions, owners, deadline, and risk. B is mostly correct but adds extra commentary and an alternative JSON structure with disallowed keys, so it adheres less well to the instructions.

4. messy-csv-to-json

Convert the messy inventory data below into valid JSON as an array of objects with EXACTLY these keys in this order: sku (string), name (string), qty (integer), unit_price (number), discontinued (boolean). Rules: trim whitespace; qty may have leading zeros; parse prices as numbers without currency symbols; treat Y as true and N as false; skip the header row; preserve the original row order. Data: SKU | Item | Qty | Price | Disc QX-14 | Pocket torch | 0007 | $19.95 | N LM-2 | Field notebook | 12 | 4.5 | n ZZ-900|Cable ties (50) | 003 | $7.00 | Y AA-77 | Mini first-aid kit | 0 | $12 | N

Winner: grok-4.3 — A directly provides valid JSON in the requested format with the correct keys, order, values, and row order. B includes substantial extra explanatory text and code instead of only the converted JSON, so it does not adhere to the instruction despite containing a correct JSON example within it.

See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

How they were tested

1. python-log-redactor

2. vendor-delay-update

3. meeting-notes-summary

4. messy-csv-to-json

Reader comments