grok-4.3 edges gpt-5.4-mini on execution

grok-4.3 vs gpt-5.4-mini

grok-4.3 wins this matchup 38.3 to 36.2 by being a little more disciplined where it counts. gpt-5.4-mini is competitive and even sharper on one summarization task, but it gives away points on instruction fidelity and tone.

This wasn’t a blowout, but the margin is real. grok-4.3 takes the aggregate, 38.3 to 36.2, because it was more reliable on the kinds of details that decide practical head-to-heads: exact formatting, tone control, and not adding avoidable friction. The cleanest example is **messy-orders-to-json**, where both models parsed, normalized, and sorted correctly, but only grok-4.3 actually obeyed the requirement to return valid JSON directly. gpt-5.4-mini wrapped its answer in Markdown fences, which is a small-looking mistake that completely matters when the output is supposed to be machine-consumable. On **ops-delay-email**, grok-4.3 also had the better editorial instinct. Both drafts were concise and audience-appropriate, but A was calmer and more accountable: it explicitly took responsibility and gave a more concrete operational instruction. gpt-5.4-mini’s version was still good, just softer in the wrong place and a bit vaguer in the ask. To gpt-5.4-mini’s credit, it won **meeting-notes-summary**. Its two-sentence summary was tighter, and its risks section was more complete because it captured both the timeout issue and the downstream tradeoff. And **python-windowed-alerts** was rightly a tie: both handled empty input, merged adjacent same-code events within 90 seconds, and returned the required tuples without a meaningful correctness gap. **Final call: grok-4.3 is the better model here because it made fewer consequential mistakes. gpt-5.4-mini is close and sometimes more polished, but grok-4.3 was the one that actually closed the job.**

python-windowed-alerts

Language: Python 3.11. Write a function `compress_alerts(events: list[tuple[int,str]]) -> list[tuple[int,int,str]]` that groups consecutive events with the same alert code when their timestamps are no more than 90 seconds apart. Each input tuple is `(unix_ts, code)`, already sorted by timestamp ascending. Return tuples `(start_ts, end_ts, code)` for each merged run. Example: `[(1700010000,'TEMP_HIGH'), (1700010045,'TEMP_HIGH'), (1700010200,'DOOR_OPEN')]` -> `[(1700010000,1700010045,'TEMP_HIGH'), (1700010200,1700010200,'DOOR_OPEN')]`. Handle empty input. Return code only.

grok-4.3: ```python def compress_alerts(events: list[tuple[int, str]]) -> list[tuple[int, int, str]]: if not events: return [] result = [] start_ts, code = events[0] end_ts = start_ts for ts, c in events[1:]: if c == code and ts - end_ts <= 90: end_ts = ts else: result.append((start_ts, end_ts, code)) start_ts, end_ts, code = ts, ts, c result.append((start_ts, end_ts, code)) return result ```

gpt-5.4-mini: ```python from typing import List, Tuple def compress_alerts(events: list[tuple[int, str]]) -> list[tuple[int, int, str]]: if not events: return [] compressed: list[tuple[int, int, str]] = [] start_ts, prev_ts, code = events[0] for ts, curr_code in events[1:]: if curr_code == code and ts - prev_ts <= 90: prev_ts = ts else: compressed.append((start_ts, prev_ts, code)) start_ts = prev_ts = ts code = curr_code compressed.append((start_ts, prev_ts, code)) return compressed ```

Both outputs correctly handle empty input, merge consecutive same-code events when adjacent timestamps are within 90 seconds, and return the required `(start_ts, end_ts, code)` tuples. Model A is slightly more minimal, while Model B uses `prev_ts` naming that makes the chaining logic explicit, but neither has a correctness or instruction-following advantage.

ops-delay-email

Draft an internal status email to warehouse managers about a shipment delay. Context: a label printer outage at the Reno fulfillment center delayed 184 orders placed between 14:10 and 16:35 PT today; replacement hardware is installed, backlog should clear by 21:30 PT, and customer support has a prepared script. Audience: non-technical operations managers. Tone: calm, accountable, and concise. Length: 130-170 words.

grok-4.3: Subject: Status Update: Reno Fulfillment Center Shipment Delay Dear Warehouse Managers, Earlier today a label printer outage at the Reno fulfillment center delayed processing for 184 orders placed between 14:10 and 16:35 PT. Replacement hardware is now installed and the team is working through the backlog. We expect all affected orders to ship by 21:30 PT. Customer support has a prepared script and is ready to respond to any inquiries. We understand this delay adds pressure to daily operations and creates downstream impact for customers. We take responsibility for the disruption and are reviewing equipment redundancy and maintenance checks to reduce the chance of recurrence. Please continue normal receiving and picking priorities while the team clears the remaining orders. I will send a brief follow-up once the backlog is complete. Thank you for your patience and support. Best regards, [Name] Fulfillment Operations

gpt-5.4-mini: Subject: Reno Fulfillment Center Shipment Delay Update Hi team, I want to share an update on today’s shipment delay from the Reno fulfillment center. A label printer outage affected 184 orders placed between 14:10 and 16:35 PT today. The replacement hardware is now installed, and the team is actively clearing the backlog. At this time, we expect all delayed orders to be processed and back on track by 21:30 PT tonight. Customer Support has already been briefed and is using a prepared script to handle any inquiries consistently. Please continue monitoring outbound flow and flag any remaining issues as soon as possible. We’re treating this as a priority and will share another update if the timeline changes. Thanks for your patience and coordination. Best, [Your Name]

Both are clear, concise, and fit the audience, but A better matches the requested calm, accountable tone by explicitly taking responsibility and giving a slightly more concrete operational instruction. B is strong, but it is a bit less accountable and adds a vaguer call to action.

meeting-notes-summary

Read these meeting notes and produce: (1) a 2-sentence summary, and (2) a JSON object with keys `decision`, `owner`, `deadline`, and `risks`. Notes: - Team: Maple Claims portal redesign - Priya: search is still timing out for policies with more than 8 attachments; likely tied to the legacy OCR service. - Mateo: vendor can raise OCR throughput next Tuesday, but only if we commit to a 3-month minimum. - Jen: legal approved the new consent text with one edit: replace "biometric scan" with "identity check". - Decision discussion: launch the new intake flow to 15% of users on May 6, keep old upload page for everyone else. - Action: Priya to confirm by May 2 whether timeout fix is feasible without vendor change. - Risk: if fix is not feasible, we either pay for the vendor minimum or delay rollout by one sprint. - Nice-to-have items (not in scope): dark mode, SMS alerts.

grok-4.3: (1) During the Maple Claims portal redesign meeting, the team noted ongoing search timeouts for policies with more than 8 attachments tied to the legacy OCR service, discussed a vendor throughput upgrade requiring a 3-month commitment, and received legal approval to update consent text by replacing "biometric scan" with "identity check". They decided to launch the new intake flow to 15% of users on May 6 while retaining the old upload page, assigning Priya to confirm a fix by May 2 and flagging risks of vendor costs or a one-sprint delay if unresolved. (2) ```json { "decision": "Launch the new intake flow to 15% of users on May 6, keep old upload page for everyone else", "owner": "Priya", "deadline": "May 2", "risks": "If timeout fix is not feasible without vendor change, either pay for the vendor minimum or delay rollout by one sprint" } ```

gpt-5.4-mini: The team agreed to launch the new intake flow to 15% of users on May 6 while keeping the old upload page for everyone else. Priya will confirm by May 2 whether the search timeout issue can be fixed without changing vendors; if not, the team must choose between paying for the vendor’s 3-month minimum or delaying rollout by one sprint. ```json { "decision": "Launch the new intake flow to 15% of users on May 6, while keeping the old upload page for everyone else.", "owner": "Priya", "deadline": "May 2", "risks": [ "Search continues timing out for policies with more than 8 attachments, likely due to the legacy OCR service.", "If the timeout fix is not feasible without a vendor change, the team may need to commit to a 3-month vendor minimum.", "If the vendor change is avoided, rollout may need to be delayed by one sprint." ] } ```

Both outputs follow the requested format and capture the key decision, owner, deadline, and risks accurately. B is slightly better because its 2-sentence summary is more concise and focused, and its risks field is more complete by including the underlying timeout issue as well as the downstream tradeoff.

messy-orders-to-json

grok-4.3: [ { "order_id": "ORD-099", "customer": "Acme Biolabs", "items": ["centrifuge gasket"], "total_usd": 129.00, "paid": true }, { "order_id": "ORD-103", "customer": "Irene Okafor", "items": ["shrink wrap", "tape gun"], "total_usd": 73.40, "paid": true }, { "order_id": "ORD-104", "customer": "Nora Velasquez", "items": ["cable ties 100pk", "label rolls"], "total_usd": 48.50, "paid": true }, { "order_id": "ORD-117", "customer": "Leon Xu", "items": ["thermal printer", "spare battery"], "total_usd": 214.00, "paid": false } ]

gpt-5.4-mini: ```json [ { "order_id": "ORD-099", "customer": "Acme Biolabs", "items": ["centrifuge gasket"], "total_usd": 129.00, "paid": true }, { "order_id": "ORD-103", "customer": "Irene Okafor", "items": ["shrink wrap", "tape gun"], "total_usd": 73.40, "paid": true }, { "order_id": "ORD-104", "customer": "Nora Velasquez", "items": ["cable ties 100pk", "label rolls"], "total_usd": 48.50, "paid": true }, { "order_id": "ORD-117", "customer": "Leon Xu", "items": ["thermal printer", "spare battery"], "total_usd": 214.00, "paid": false } ] ```

Both outputs correctly parse, normalize, and sort the orders, but Model A follows the instruction to return valid JSON directly. Model B wraps the JSON in Markdown code fences, so the overall output is not itself valid JSON.