grok-4.3 edges gpt-5.4-mini on execution
grok-4.3 wins this matchup 38.3 to 36.2 by being a little more disciplined where it counts. gpt-5.4-mini is competitive and even sharper on one summarization task, but it gives away points on instruction fidelity and tone.
By RuntimeWire · Published

This wasn’t a blowout, but the margin is real. grok-4.3 takes the aggregate, 38.3 to 36.2, because it was more reliable on the kinds of details that decide practical head-to-heads: exact formatting, tone control, and not adding avoidable friction.
The cleanest example is messy-orders-to-json, where both models parsed, normalized, and sorted correctly, but only grok-4.3 actually obeyed the requirement to return valid JSON directly. gpt-5.4-mini wrapped its answer in Markdown fences, which is a small-looking mistake that completely matters when the output is supposed to be machine-consumable.
On ops-delay-email, grok-4.3 also had the better editorial instinct. Both drafts were concise and audience-appropriate, but A was calmer and more accountable: it explicitly took responsibility and gave a more concrete operational instruction. gpt-5.4-mini’s version was still good, just softer in the wrong place and a bit vaguer in the ask.
To gpt-5.4-mini’s credit, it won meeting-notes-summary. Its two-sentence summary was tighter, and its risks section was more complete because it captured both the timeout issue and the downstream tradeoff. And python-windowed-alerts was rightly a tie: both handled empty input, merged adjacent same-code events within 90 seconds, and returned the required tuples without a meaningful correctness gap.
Final call: grok-4.3 is the better model here because it made fewer consequential mistakes. gpt-5.4-mini is close and sometimes more polished, but grok-4.3 was the one that actually closed the job.
How they were tested
We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. grok-4.3 scored 38.3 to gpt-5.4-mini's 36.2.
1. python-windowed-alerts
Language: Python 3.11. Write a function
compress_alerts(events: list[tuple[int,str]]) -> list[tuple[int,int,str]]that groups consecutive events with the same alert code when their timestamps are no more than 90 seconds apart. Each input tuple is(unix_ts, code), already sorted by timestamp ascending. Return tuples(start_ts, end_ts, code)for each merged run. Example:[(1700010000,'TEMP_HIGH'), (1700010045,'TEMP_HIGH'), (1700010200,'DOOR_OPEN')]->[(1700010000,1700010045,'TEMP_HIGH'), (1700010200,1700010200,'DOOR_OPEN')]. Handle empty input. Return code only.
Winner: Tie — Both outputs correctly handle empty input, merge consecutive same-code events when adjacent timestamps are within 90 seconds, and return the required (start_ts, end_ts, code) tuples. Model A is slightly more minimal, while Model B uses prev_ts naming that makes the chaining logic explicit, but neither has a correctness or instruction-following advantage.
2. ops-delay-email
Draft an internal status email to warehouse managers about a shipment delay. Context: a label printer outage at the Reno fulfillment center delayed 184 orders placed between 14:10 and 16:35 PT today; replacement hardware is installed, backlog should clear by 21:30 PT, and customer support has a prepared script. Audience: non-technical operations managers. Tone: calm, accountable, and concise. Length: 130-170 words.
Winner: grok-4.3 — Both are clear, concise, and fit the audience, but A better matches the requested calm, accountable tone by explicitly taking responsibility and giving a slightly more concrete operational instruction. B is strong, but it is a bit less accountable and adds a vaguer call to action.
3. meeting-notes-summary
Read these meeting notes and produce: (1) a 2-sentence summary, and (2) a JSON object with keys
decision,owner,deadline, andrisks. Notes: - Team: Maple Claims portal redesign - Priya: search is still timing out for policies with more than 8 attachments; likely tied to the legacy OCR service. - Mateo: vendor can raise OCR throughput next Tuesday, but only if we commit to a 3-month minimum. - Jen: legal approved the new consent text with one edit: replace "biometric scan" with "identity check". - Decision discussion: launch the new intake flow to 15% of users on May 6, keep old upload page for everyone else. - Action: Priya to confirm by May 2 whether timeout fix is feasible without vendor change. - Risk: if fix is not feasible, we either pay for the vendor minimum or delay rollout by one sprint. - Nice-to-have items (not in scope): dark mode, SMS alerts.
Winner: gpt-5.4-mini — Both outputs follow the requested format and capture the key decision, owner, deadline, and risks accurately. B is slightly better because its 2-sentence summary is more concise and focused, and its risks field is more complete by including the underlying timeout issue as well as the downstream tradeoff.
4. messy-orders-to-json
Convert the messy order lines below into valid JSON: an array of objects sorted by
order_idascending. Schema per object:order_id(string),customer(string),items(array of strings),total_usd(number with 2 decimals),paid(boolean). Rules: trim spaces, normalize customer names to title case, split items on|, and treatPAID,yes,true,Yas true;no,false,N,unpaidas false. Data: ORD-104 | nora velasquez | cable ties 100pk|label rolls | 48.5 | yes ORD-099|ACME BIOLABS|centrifuge gasket| 129.00|PAID ORD-117 | leon xu | thermal printer | spare battery | 214 | unpaid ORD-103| irene okafor | shrink wrap | tape gun | 73.40 | Y
Winner: grok-4.3 — Both outputs correctly parse, normalize, and sort the orders, but Model A follows the instruction to return valid JSON directly. Model B wraps the JSON in Markdown code fences, so the overall output is not itself valid JSON.
See every prompt and the full side-by-side outputs in the interactive Head-to-Head.