grok-4.3 edges gpt-5.4-nano on execution, not flash

This was close on aggregate, but grok-4.3 wins because it made the fewer costly mistakes in structured-output work. gpt-5.4-nano was sharper on tone and regex edge cases, yet it gave back those gains by breaking instructions where precision mattered more.

By · Published

Comparison of two competing AI models' performance and precision (1970s offset-print magazine illustration — halftone dots, slightly off-register inks, warm yellowed paper feel)

The score says nail-biter — 33.8 to 33.4 — but the split is revealing. gpt-5.4-nano took both writing-adjacent tasks: python-log-redaction-fix and release-delay-customer-email. In the log-redaction task, B was simply more careful: it preserved separators and existing quotes better, and it dealt more explicitly with quoted JSON-style values. In the customer email, B also had the stronger editorial instinct, matching the requested candid tone and laying out options more cleanly.

But grok-4.3 won the task that mattered most for trust: meeting-notes-summary-extract. A stuck to the requested format, delivered the required two-sentence summary, and produced a valid JSON array without getting cute. B, by contrast, added markdown/code fences and invented absolute dates for relative phrases like “today” and “before next Wednesday.” That is exactly the kind of unnecessary improvisation that turns a usable model into one you have to babysit.

The messy-orders-to-json tie reinforces the point. When the job was straightforward normalization, both models were competent: valid JSON, correct coercions, malformed quantity: "two" sensibly set to null. No daylight there.

So this verdict comes down to error profile. gpt-5.4-nano is the nicer stylist and the more polished fixer on narrow tasks, but grok-4.3 was more disciplined when the instructions were brittle and the output format actually mattered. Final call: grok-4.3 wins because it was the model less likely to create new problems while solving the prompt.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. grok-4.3 scored 33.8 to gpt-5.4-nano's 33.4.

1. python-log-redaction-fix

Language: Python 3.11. Fix the function below so it safely redacts sensitive values from app logs. It must replace the values for keys api_key, token, and ssn with "[REDACTED]" whether they appear as JSON-style pairs or query-string-style pairs, case-insensitively, without changing unrelated text. Preserve the original separators and quotes when possible. Return code only. Broken code: import re def redact_sensitive(line: str) -> str: pattern = r'(api_key|token|ssn)\s*[:=]\s*([^,\s}]+)' return re.sub(pattern, r'\1=[REDACTED]', line) # desired behavior examples: # 'POST /sync?token=abC9z&user=mina' -> 'POST /sync?token=[REDACTED]&user=mina' # '{"ssn": "903-44-1209", "name": "Ivo"}' -> '{"ssn": "[REDACTED]", "name": "Ivo"}' # 'api_key : ZX-77-PL, status=ok' -> 'api_key : [REDACTED], status=ok'

Winner: gpt-5.4-nano — B better preserves separators and existing quotes around values, and more explicitly handles quoted JSON-style values as required. A is concise and mostly works, but its single regex is less robust for JSON cases and can mishandle edge cases around quoting and value boundaries.

2. release-delay-customer-email

Write an email to customers about a delayed shipment of the EmberLane Studio Desk in walnut. Audience: customers who preordered from our small furniture company. Tone: candid, calm, and professional; do not sound legalistic. Length: 140-180 words. Facts to include: a varnish supplier in Brno delivered a batch that failed our finish test; we will not ship desks with that defect; the new estimated ship window is 18-24 July; customers can keep their order, switch to oak for immediate dispatch, or get a full refund; as an apology, include a 12% refund on the desk price for anyone who keeps the walnut order. Include a clear subject line.

Winner: gpt-5.4-nano — Both emails are clear, professional, and include all required facts within the length range. B is slightly better because it more directly mirrors the requested candid tone, states the defect issue more explicitly, and presents the customer options in a cleaner, easier-to-scan format.

3. meeting-notes-summary-extract

Read these meeting notes and then do two things: (1) write a 2-sentence summary, and (2) extract the action items as a JSON array where each item has owner, task, and due_date. Notes: - Monday ops check-in for the Northline clinic rollout. - Priya: kiosk tablets at Glenferry arrived Friday; 11 of 12 enrolled correctly, last one stuck on setup screen. - Mateo: traced the setup issue to an expired MDM token in the staging profile, not hardware. - Decision: postpone Glenferry patient self-check-in launch from 6 May to 8 May. - Juno will renew the MDM token and re-enroll the failed tablet by 3 May. - Priya will notify clinic manager Elise Tan about the 2-day delay today. - Mateo to update the deployment runbook with the corrected staging-profile step before next Wednesday. - Separate note: Southport site remains on track for 13 May; no blockers. - Risk raised: courier labels for spare chargers still show old helpdesk number. - Anika to send corrected label artwork to vendor LumaPack by 2 May.

Winner: grok-4.3 — A follows the requested format closely with a correct 2-sentence summary and a valid JSON array preserving the source due-date wording. B’s summary is fine, but it adds markdown/code fences and incorrectly invents absolute dates for relative due dates like “today” and “before next Wednesday,” reducing adherence and accuracy.

4. messy-orders-to-json

Convert the messy order lines below into valid JSON as an array of objects. Use exactly this schema for each object and no extra fields: {"order_id": string, "customer": string, "sku": string, "quantity": integer, "unit_price": number, "ship_country": string, "priority": boolean} Rules: trim spaces; order_id must stay a string; quantity is an integer; unit_price is a number without currency symbols; priority is true only for Y/yes/TRUE (case-insensitive), otherwise false; normalize country names to title case. Data: ORD-9081 | customer= Nabila Hart | sku: QN-44B | qty 3 | unit price $19.90 | country=canada | priority=Y ORD-9082|customer=R. Velasco|sku:LM-2|qty=1|unit price=7|country=UNITED STATES|priority=no ORD-9083 | customer = Teo Mirk | sku : AX-9Z | qty= 12 | unit price= €3.50 | country = germany | priority = TRUE ORD-9084 | customer= "Aya Song" | sku: PT-88 | qty two | unit price $14.00 | country=Japan | priority=N If a field is malformed and cannot be safely coerced, set it to null for that field only.

Winner: Tie — Both outputs are valid JSON arrays matching the exact schema, correctly normalize and coerce fields, and appropriately set the malformed quantity 'two' to null. There are no substantive differences in correctness or instruction adherence.


See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

Reader comments

Conversation for this story loads after sign-in.