Head to head: grok-4.3 vs Llama-4-Scout-17B-16E-Instruct

One model was consistently cleaner, sharper, and better at following instructions; the other kept leaving small but consequential mistakes on the table. Across coding, transformation, summarization, and business writing, this matchup wasn’t especially close.

By · Published

A lone observer scrutinizing the juxtaposed outputs of two AI models (Oil painting in the manner of Edward Hopper)

grok-4.3 wins this one on discipline as much as capability. The 37.0 to 22.0 aggregate isn’t just a scoring gap; it reflects a pattern: A reliably did the asked-for thing, while Llama-4-Scout-17B-16E-Instruct too often drifted into "close enough" territory. In head-to-head evals, that’s how you lose.

The clearest separation showed up in the structured-output and code tasks. On python-shift-merge, grok-4.3 handled the actual requirements: merge overlapping and back-to-back intervals per employee, return a new sorted list, and leave the input untouched. Llama-4-Scout-17B-16E-Instruct mutated the input, padded the answer with example/print code despite a code-only brief, and introduced a real logic bug by merging intervals separated by a one-minute gap because of its timedelta handling. On messy-orders-to-json, the story was similar: grok-4.3 returned valid JSON only, correctly normalized and sorted; B wrapped the right answer in extra explanation and code, which is still a format failure.

The writing tasks weren’t a blowout, but grok-4.3 was still plainly better. In vendor-delay-email, A struck the right retailer-facing tone: candid without wobbling, specific about the QA issue and revised window, clear that pricing and allocation were preserved, and explicit in asking them not to publish the old date. B was competent, but it read more like a template and softened into vaguer reassurance about business impact.

On meeting-notes-summary, grok-4.3 again showed better editorial judgment. It followed the requested format, captured the legal approval, invite cap, and explicit risks, and named a plausible owner based on the notes. Llama-4-Scout-17B-16E-Instruct was mostly serviceable, but adding commentary outside the requested output and failing to assign an owner when Ezra was the obvious candidate is exactly the kind of miss that makes summaries less useful in practice.

Final call: grok-4.3 is the more dependable model and the easy winner here. Llama-4-Scout-17B-16E-Instruct had moments of competence, but too many avoidable instruction-following and correctness errors turned this into a comfortable victory rather than a real contest.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. grok-4.3 scored 37.0 to Llama-4-Scout-17B-16E-Instruct's 22.0.

1. python-shift-merge

Practical coding — Python. Return code only. Write a function merge_shift_intervals(rows) that takes a list of dicts like {"employee":"Mina","start":"2025-04-12 22:30","end":"2025-04-13 01:15"} and returns a new list where overlapping or back-to-back intervals for the same employee are merged. Timestamps are in %Y-%m-%d %H:%M 24-hour format and already in the same timezone. Preserve one output dict per merged interval with the same keys. Sort output by employee, then start. Example: Mina 22:30-23:00 and Mina 23:00-01:15 should merge; Mina and Jorge should never merge with each other.

Winner: grok-4.3 — A correctly merges overlapping and back-to-back intervals per employee, returns a new sorted list, and avoids mutating the input. B mutates the input list, includes extra example/print code despite the code-only task, and incorrectly merges intervals with a 1-minute gap due to its timedelta logic.

2. vendor-delay-email

Professional writing — Draft a workplace email to retail partners. Situation: your company, Northline Audio, is delaying shipment of the new Ravelin Mini DAC because a capacitor batch failed thermal testing. Original ship date was 14 August; new estimated ship window is 26-30 August. Partners with confirmed POs will keep their pricing and allocation. Ask them not to publish the old date. Audience: small independent hi-fi retailers. Tone: candid, steady, professional. Length: 140-180 words.

Winner: grok-4.3 — A is slightly stronger on tone and specificity: it is candid, steady, and clearly explains the QA issue, revised window, preserved pricing/allocation, and request not to publish the old date. B is solid but more generic, adds a less precise reassurance about business impact, and feels a bit more templated for this retailer audience.

3. meeting-notes-summary

Summarization & extraction — Read these meeting notes and then provide: (1) a 2-sentence summary, and (2) a JSON object with keys decision, owner, launch_date, risks. Notes: - Tuesday growth sync for the LatticeLearn referral experiment. - Priya: paid social CAC rose from $41 to $57 after we broadened targeting; not sustainable. - Owen: referral landing page converted 18.4% last week vs 11.2% for the generic signup page. - Marta: legal approved the revised incentive copy as long as we cap rewards at 5 invites per user. - Decision discussed: launch referral flow to 50% of new visitors on 2025-07-08, not 100%, because support macros are not ready. - Ezra will finish support macros by Friday; if delayed, we keep the old post-signup help center link in place. - Main risks: fraud from self-referrals, support queue spike, and analytics mismatch between web and iOS events.

Winner: grok-4.3 — A better follows the requested format and captures more of the salient details, including legal approval, the invite cap, and the explicit risks, while providing a plausible owner from the notes. B is mostly correct but adds unnecessary commentary outside the requested output and omits an owner despite Ezra being the clearest named responsible party mentioned.

4. messy-orders-to-json

Data wrangling / structured output — Convert the messy order lines below into valid JSON only. Output an array of objects sorted by order_id ascending. Schema per object: order_id (string), customer (string), items (array of strings), priority (boolean), ship_by (string in YYYY-MM-DD). Rules: trim spaces, title-case customer names, item names exactly as written after trimming, priority is true for yes/y/true and false for no/n/false, and dates must be normalized to YYYY-MM-DD. Messy data: ORD-910 | customer= leah p. moreno | items= gasket set ; 8mm hex key ; thread seal tape | priority=Y | ship_by=7/9/2025 ORD-907 | customer=danil cho | items= panel clip|priority=no|ship_by=2025-07-06 ORD-913 | customer= R. banerjee | items= flow sensor ; calibration tag | priority= true | ship_by= 2025/07/11 ORD-908 | customer=mei-lin ortega | items= drip tray ; filter basket ; group brush | priority= n | ship_by=07-07-2025

Winner: grok-4.3 — Model A follows the instruction exactly by returning valid JSON only, correctly normalized and sorted. Model B includes explanatory text and code instead of JSON-only output, violating the required format despite showing a correct final JSON block.


See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

Reader comments

Conversation for this story loads after sign-in.