grok-4.3 edges gpt-5.4-mini on execution
grok-4.3 vs gpt-5.4-mini
grok-4.3 wins this matchup 38.3 to 36.2 by being a little more disciplined where it counts. gpt-5.4-mini is competitive and even sharper on one summarization task, but it gives away points on instruction fidelity and tone.
This wasn’t a blowout, but the margin is real. grok-4.3 takes the aggregate, 38.3 to 36.2, because it was more reliable on the kinds of details that decide practical head-to-heads: exact formatting, tone control, and not adding avoidable friction. The cleanest example is **messy-orders-to-json**, where both models parsed, normalized, and sorted correctly, but only grok-4.3 actually obeyed the requirement to return valid JSON directly. gpt-5.4-mini wrapped its answer in Markdown fences, which is a small-looking mistake that completely matters when the output is supposed to be machine-consumable. On **ops-delay-email**, grok-4.3 also had the better editorial instinct. Both drafts were concise and audience-appropriate, but A was calmer and more accountable: it explicitly took responsibility and gave a more concrete operational instruction. gpt-5.4-mini’s version was still good, just softer in the wrong place and a bit vaguer in the ask. To gpt-5.4-mini’s credit, it won **meeting-notes-summary**. Its two-sentence summary was tighter, and its risks section was more complete because it captured both the timeout issue and the downstream tradeoff. And **python-windowed-alerts** was rightly a tie: both handled empty input, merged adjacent same-code events within 90 seconds, and returned the required tuples without a meaningful correctness gap. **Final call: grok-4.3 is the better model here because it made fewer consequential mistakes. gpt-5.4-mini is close and sometimes more polished, but grok-4.3 was the one that actually closed the job.**
python-windowed-alerts
Language: Python 3.11. Write a function `compress_alerts(events: list[tuple[int,str]]) -> list[tuple[int,int,str]]` that groups consecutive events with the same alert code when their timestamps are no more than 90 seconds apart. Each input tuple is `(unix_ts, code)`, already sorted by timestamp ascending. Return tuples `(start_ts, end_ts, code)` for each merged run. Example: `[(1700010000,'TEMP_HIGH'), (1700010045,'TEMP_HIGH'), (1700010200,'DOOR_OPEN')]` -> `[(1700010000,1700010045,'TEMP_HIGH'), (1700010200,1700010200,'DOOR_OPEN')]`. Handle empty input. Return code only.
Both outputs correctly handle empty input, merge consecutive same-code events when adjacent timestamps are within 90 seconds, and return the required `(start_ts, end_ts, code)` tuples. Model A is slightly more minimal, while Model B uses `prev_ts` naming that makes the chaining logic explicit, but neither has a correctness or instruction-following advantage.
ops-delay-email
Draft an internal status email to warehouse managers about a shipment delay. Context: a label printer outage at the Reno fulfillment center delayed 184 orders placed between 14:10 and 16:35 PT today; replacement hardware is installed, backlog should clear by 21:30 PT, and customer support has a prepared script. Audience: non-technical operations managers. Tone: calm, accountable, and concise. Length: 130-170 words.
Both are clear, concise, and fit the audience, but A better matches the requested calm, accountable tone by explicitly taking responsibility and giving a slightly more concrete operational instruction. B is strong, but it is a bit less accountable and adds a vaguer call to action.
meeting-notes-summary
Read these meeting notes and produce: (1) a 2-sentence summary, and (2) a JSON object with keys `decision`, `owner`, `deadline`, and `risks`. Notes: - Team: Maple Claims portal redesign - Priya: search is still timing out for policies with more than 8 attachments; likely tied to the legacy OCR service. - Mateo: vendor can raise OCR throughput next Tuesday, but only if we commit to a 3-month minimum. - Jen: legal approved the new consent text with one edit: replace "biometric scan" with "identity check". - Decision discussion: launch the new intake flow to 15% of users on May 6, keep old upload page for everyone else. - Action: Priya to confirm by May 2 whether timeout fix is feasible without vendor change. - Risk: if fix is not feasible, we either pay for the vendor minimum or delay rollout by one sprint. - Nice-to-have items (not in scope): dark mode, SMS alerts.
Both outputs follow the requested format and capture the key decision, owner, deadline, and risks accurately. B is slightly better because its 2-sentence summary is more concise and focused, and its risks field is more complete by including the underlying timeout issue as well as the downstream tradeoff.
messy-orders-to-json
Convert the messy order lines below into valid JSON: an array of objects sorted by `order_id` ascending. Schema per object: `order_id` (string), `customer` (string), `items` (array of strings), `total_usd` (number with 2 decimals), `paid` (boolean). Rules: trim spaces, normalize customer names to title case, split items on `|`, and treat `PAID`, `yes`, `true`, `Y` as true; `no`, `false`, `N`, `unpaid` as false. Data: ORD-104 | nora velasquez | cable ties 100pk|label rolls | 48.5 | yes ORD-099|ACME BIOLABS|centrifuge gasket| 129.00|PAID ORD-117 | leon xu | thermal printer | spare battery | 214 | unpaid ORD-103| irene okafor | shrink wrap | tape gun | 73.40 | Y
Both outputs correctly parse, normalize, and sort the orders, but Model A follows the instruction to return valid JSON directly. Model B wraps the JSON in Markdown code fences, so the overall output is not itself valid JSON.
Matchup powered by OpenRouter.