grok-4.3 edges gpt-5.4-mini on execution

grok-4.3 wins this matchup 38.3 to 36.2 by being a little more disciplined where it counts. gpt-5.4-mini is competitive and even sharper on one summarization task, but it gives away points on instruction fidelity and tone.

By · Published

Comparative performance data of two AI models, showing one's slight advantage in execution and discipline (risograph two-color print)

This wasn’t a blowout, but the margin is real. grok-4.3 takes the aggregate, 38.3 to 36.2, because it was more reliable on the kinds of details that decide practical head-to-heads: exact formatting, tone control, and not adding avoidable friction.

The cleanest example is messy-orders-to-json, where both models parsed, normalized, and sorted correctly, but only grok-4.3 actually obeyed the requirement to return valid JSON directly. gpt-5.4-mini wrapped its answer in Markdown fences, which is a small-looking mistake that completely matters when the output is supposed to be machine-consumable.

On ops-delay-email, grok-4.3 also had the better editorial instinct. Both drafts were concise and audience-appropriate, but A was calmer and more accountable: it explicitly took responsibility and gave a more concrete operational instruction. gpt-5.4-mini’s version was still good, just softer in the wrong place and a bit vaguer in the ask.

To gpt-5.4-mini’s credit, it won meeting-notes-summary. Its two-sentence summary was tighter, and its risks section was more complete because it captured both the timeout issue and the downstream tradeoff. And python-windowed-alerts was rightly a tie: both handled empty input, merged adjacent same-code events within 90 seconds, and returned the required tuples without a meaningful correctness gap.

Final call: grok-4.3 is the better model here because it made fewer consequential mistakes. gpt-5.4-mini is close and sometimes more polished, but grok-4.3 was the one that actually closed the job.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. grok-4.3 scored 38.3 to gpt-5.4-mini's 36.2.

1. python-windowed-alerts

Language: Python 3.11. Write a function compress_alerts(events: list[tuple[int,str]]) -> list[tuple[int,int,str]] that groups consecutive events with the same alert code when their timestamps are no more than 90 seconds apart. Each input tuple is (unix_ts, code), already sorted by timestamp ascending. Return tuples (start_ts, end_ts, code) for each merged run. Example: [(1700010000,'TEMP_HIGH'), (1700010045,'TEMP_HIGH'), (1700010200,'DOOR_OPEN')] -> [(1700010000,1700010045,'TEMP_HIGH'), (1700010200,1700010200,'DOOR_OPEN')]. Handle empty input. Return code only.

Winner: Tie — Both outputs correctly handle empty input, merge consecutive same-code events when adjacent timestamps are within 90 seconds, and return the required (start_ts, end_ts, code) tuples. Model A is slightly more minimal, while Model B uses prev_ts naming that makes the chaining logic explicit, but neither has a correctness or instruction-following advantage.

2. ops-delay-email

Draft an internal status email to warehouse managers about a shipment delay. Context: a label printer outage at the Reno fulfillment center delayed 184 orders placed between 14:10 and 16:35 PT today; replacement hardware is installed, backlog should clear by 21:30 PT, and customer support has a prepared script. Audience: non-technical operations managers. Tone: calm, accountable, and concise. Length: 130-170 words.

Winner: grok-4.3 — Both are clear, concise, and fit the audience, but A better matches the requested calm, accountable tone by explicitly taking responsibility and giving a slightly more concrete operational instruction. B is strong, but it is a bit less accountable and adds a vaguer call to action.

3. meeting-notes-summary

Read these meeting notes and produce: (1) a 2-sentence summary, and (2) a JSON object with keys decision, owner, deadline, and risks. Notes: - Team: Maple Claims portal redesign - Priya: search is still timing out for policies with more than 8 attachments; likely tied to the legacy OCR service. - Mateo: vendor can raise OCR throughput next Tuesday, but only if we commit to a 3-month minimum. - Jen: legal approved the new consent text with one edit: replace "biometric scan" with "identity check". - Decision discussion: launch the new intake flow to 15% of users on May 6, keep old upload page for everyone else. - Action: Priya to confirm by May 2 whether timeout fix is feasible without vendor change. - Risk: if fix is not feasible, we either pay for the vendor minimum or delay rollout by one sprint. - Nice-to-have items (not in scope): dark mode, SMS alerts.

Winner: gpt-5.4-mini — Both outputs follow the requested format and capture the key decision, owner, deadline, and risks accurately. B is slightly better because its 2-sentence summary is more concise and focused, and its risks field is more complete by including the underlying timeout issue as well as the downstream tradeoff.

4. messy-orders-to-json

Convert the messy order lines below into valid JSON: an array of objects sorted by order_id ascending. Schema per object: order_id (string), customer (string), items (array of strings), total_usd (number with 2 decimals), paid (boolean). Rules: trim spaces, normalize customer names to title case, split items on |, and treat PAID, yes, true, Y as true; no, false, N, unpaid as false. Data: ORD-104 | nora velasquez | cable ties 100pk|label rolls | 48.5 | yes ORD-099|ACME BIOLABS|centrifuge gasket| 129.00|PAID ORD-117 | leon xu | thermal printer | spare battery | 214 | unpaid ORD-103| irene okafor | shrink wrap | tape gun | 73.40 | Y

Winner: grok-4.3 — Both outputs correctly parse, normalize, and sort the orders, but Model A follows the instruction to return valid JSON directly. Model B wraps the JSON in Markdown code fences, so the overall output is not itself valid JSON.


See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

Reader comments

Conversation for this story loads after sign-in.