Head to head: grok-4.3 vs Phi-4-mini-instruct

grok-4.3 vs Phi-4-mini-instruct

This matchup wasn’t especially close: grok-4.3 wins on execution, not style points. Across the non-code tasks, it was the model that actually followed instructions, kept facts straight, and avoided the avoidable mistakes that dragged Phi-4-mini-instruct down.

The scoreline says it plainly: **37.0 to 28.0 for grok-4.3**. And the split behind that result is even more telling. On the lone coding task that ended in a draw, both models made the same correct fix in `python-log-window-fix`: inclusive boundary checks, distinct-user counting via a set, the required signature preserved, and code-only output. There was no meaningful separation there. Everywhere else, grok-4.3 was the more reliable model. In `customer-delay-update`, it did what a production-grade assistant should do: explain the delayed 80 pins clearly, state that there’s **no extra cost**, and keep the tone professional without inventing context. Phi-4-mini-instruct stumbled into exactly the kind of error that makes business users lose trust fast, incorrectly implying the sender was **Alder & Finch Studio** and even thanking the customer for choosing them. The gap widened in `ops-meeting-summary`. grok-4.3 followed the requested format and captured the notes cleanly, including the actions and decisions that mattered. Phi-4-mini-instruct missed Evan’s action altogether, reassigned Noor’s alert action to Mei, and then ignored the instruction for plain output by adding extra formatting. That’s not a small miss; it’s a failure to track ownership accurately. `messy-orders-to-json` sealed it. grok-4.3 correctly merged orders, preserved item order, and set priority properly, with only a minor blemish in carrying over the literal `note:` label for O-484. Phi-4-mini-instruct made the more damaging mistake: flagging **O-482 as priority true** without any HIGH or RUSH signal, while also dropping leftover note text for O-481. That’s the difference between a slightly untidy parse and a materially wrong one. **Final call: grok-4.3 wins because it is simply the steadier, more trustworthy model. Phi-4-mini-instruct could keep up on a straightforward code fix, but on instruction-following and factual precision, grok-4.3 was the only one acting like a senior-level assistant.**

python-log-window-fix

Language: Python 3.11 Return code only. Fix this function so it returns the number of distinct user IDs that had at least one event in the last `window_minutes` before `now_ts` (inclusive). Input timestamps are Unix seconds. The current code is wrong around duplicate users and the time boundary. ```python def active_users(events, now_ts, window_minutes): cutoff = now_ts - window_minutes * 60 count = 0 seen = [] for ts, user_id in events: if ts > cutoff and ts < now_ts: if user_id not in seen: count += 1 seen.append(user_id) return count ``` Requirements: - Keep the same function signature. - `events` is a list of `(timestamp, user_id)` tuples. - Count each user once at most. - Include events exactly at `cutoff` and exactly at `now_ts`. - Must be efficient for large inputs. - Do not add any explanation or tests; output just the corrected code.

grok-4.3: ```python def active_users(events, now_ts, window_minutes): cutoff = now_ts - window_minutes * 60 seen = set() for ts, user_id in events: if cutoff <= ts <= now_ts: seen.add(user_id) return len(seen) ```
Phi-4-mini-instruct: ```python def active_users(events, now_ts, window_minutes): cutoff = now_ts - window_minutes * 60 user_set = set() for ts, user_id in events: if cutoff <= ts <= now_ts: user_set.add(user_id) return len(user_set) ```

Both outputs correctly fix the boundary conditions by using inclusive comparisons, count distinct users efficiently with a set, preserve the required signature, and return code only. They are functionally equivalent aside from variable naming.

customer-delay-update

Write an email to a customer named Priya Natarajan at Alder & Finch Studio. Situation: We promised delivery of 240 custom enamel pins for their June 28 gallery opening, but our plating vendor found a color-separation defect in batch 2. We can still ship 160 pins by June 24 and the remaining 80 on July 3 at no extra cost. We are also offering either a 12% refund on the full order or a free rush remake of the 80 pins in matte black metal. Audience/tone: external customer, honest, accountable, calm, professional. Length: 170-220 words. Goal: explain the issue clearly, apologize once, present the options, and ask for a decision by 3 p.m. tomorrow so production can proceed.

grok-4.3: Subject: Update on Your Order of 240 Custom Enamel Pins Dear Priya Natarajan, Thank you for your order of 240 custom enamel pins for the June 28 gallery opening at Alder & Finch Studio. I am writing to inform you of a development that affects the original delivery timeline. Our plating vendor identified a color-separation defect in batch 2 during routine quality inspection. We have isolated the affected pins and are taking immediate steps to correct the issue while protecting your schedule as much as possible. We will ship the 160 unaffected pins by June 24 so they arrive ahead of the opening. The remaining 80 pins will be completed and delivered on July 3 at no additional cost. To address the situation, we offer two options: a 12% refund on the full order or a free rush remake of the 80 pins in matte black metal. I apologize for any inconvenience this may cause. Please let me know your preferred option by 3 p.m. tomorrow so we can proceed with production. We value your partnership and will ensure the order is completed to your satisfaction. Best regards, [Your Name] [Your Title] [Company Name]
Phi-4-mini-instruct: Subject: Update on Your Custom Enamel Pins Order for June 28 Dear Priya Natarajan, I hope this email finds you well. I wanted to provide an update regarding your order for the custom enamel pins for Alder & Finch Studio's gallery opening on June 28. After a thorough inspection, our plating vendor discovered a color-separation defect in batch 2 of the enamel pins. We understand this is critical to your preparation, and we sincerely apologize for any inconvenience this may cause. We are happy to inform you that we can still ship 160 pins by June 24 and will deliver the remaining 80 pins on July 3. To ensure your expectations are met, we are offering two options: a 12% refund of the full order value or a free rush remake of the 80 pins in matte black metal. We value your cooperation during this process and appreciate your understanding. Please let us know your preferred option by 3 p.m. tomorrow so we can proceed with the production. Thank you for choosing Alder & Finch Studio, and we look forward to finalizing your order. Best regards, [Your Name] [Your Position] Alder & Finch Studio

A is clearer, more accountable, and more complete: it explicitly notes no extra cost for the delayed 80 pins, explains the issue professionally, and maintains the requested tone and structure. B contains factual/signoff errors by implying the sender is Alder & Finch Studio and thanking the customer for choosing Alder & Finch Studio, which is inappropriate for this context.

ops-meeting-summary

Read the meeting notes below, then provide: 1) a 2-bullet summary of the key decisions 2) a JSON object with keys `owner_actions` and `risks` Rules: - `owner_actions` must be an array of objects with keys `owner`, `action`, and `due` - `risks` must be an array of short strings - Use only information from the notes Meeting notes: """ AtlasOps daily sync — 09:10 - Noor: overnight API error rate dropped from 3.8% to 0.6% after rollback of `ledger-cache` to v1.4.19. - Evan: root cause still unconfirmed; suspects TTL parsing on nodes with locale `tr_TR`. Will prepare a minimal repro by Thursday. - Mei: enterprise customer Hartside Medical asked whether yesterday's duplicate invoice emails affected PHI. Answer: no PHI in those emails, only invoice number and billing contact name. - Decision: keep the rollback in place through Friday; no re-enable before a canary passes on 5 pods for 6 hours. - Luis: can own the customer-facing incident note draft; first version by today 15:30. - Noor: add alert for invoice email volume > 3x 7-day hourly baseline, due Friday morning. - Mei: support queue has 14 open tickets tied to this issue; 3 are marked urgent. - Decision: sales gets a one-paragraph internal FAQ, owner Mei, by noon today. - Risk: if finance reruns failed invoices before the dedupe patch lands, some customers may receive another duplicate. """

grok-4.3: - Keep rollback through Friday; re-enable only after 5-pod/6-hour canary passes - Sales receives one-paragraph internal FAQ by noon today (owner Mei) { "owner_actions": [ {"owner": "Evan", "action": "prepare a minimal repro", "due": "Thursday"}, {"owner": "Luis", "action": "customer-facing incident note draft first version", "due": "today 15:30"}, {"owner": "Noor", "action": "add alert for invoice email volume > 3x 7-day hourly baseline", "due": "Friday morning"}, {"owner": "Mei", "action": "one-paragraph internal FAQ for sales", "due": "noon today"} ], "risks": [ "if finance reruns failed invoices before the dedupe patch lands, some customers may receive another duplicate" ] }
Phi-4-mini-instruct: 1) Summary of key decisions: - Keep the `ledger-cache` rollback in place until after a canary test on 5 pods for 6 hours. - Sales department to receive an internal FAQ related to this issue by noon today. 2) JSON object: ```json { "owner_actions": [ { "owner": "Luis", "action": "own the customer-facing incident note draft; first version by today 15:30", "due": "today 15:30" }, { "owner": "Mei", "action": "add alert for invoice email volume > 3x 7-day hourly baseline, due Friday morning", "due": "Friday morning" }, { "owner": "Mei", "action": "create internal FAQ for sales, due by noon today", "due": "noon today" } ], "risks": [ "if finance reruns failed invoices before the dedupe patch lands, some customers may receive another duplicate" ] } ```

Model A follows the requested format and captures the actions and decisions accurately from the notes. Model B omits Evan's action, incorrectly assigns Noor's alert action to Mei, and adds extra formatting/text beyond the plain requested output.

messy-orders-to-json

Transform the messy order lines below into valid JSON with this exact schema: {"orders":[{"order_id":"string","customer":"string","ship_city":"string","items":[{"sku":"string","qty":number}],"priority":true,"notes":"string"}]} Rules: - One object per order_id - Combine repeated order lines into one order with multiple items - `priority` is true only if the line contains HIGH or RUSH - Preserve item order as listed - `notes` should contain any leftover text not captured elsewhere; use "" if none - Trim whitespace; keep city text exactly as written Messy data: O-481 | customer=Juno Biolabs | city: Reno | sku AX-9 | qty 4 | HIGH | deliver dock C O-482|customer=Maple Row Cafe|city:Santa Fe|sku QZ-2|qty 1| gift wrap O-481 | customer=Juno Biolabs | city: Reno | sku LM-3 | qty 2 order O-483 / customer "Velvet Circuit" / ship_city=Boise / item=TT-7 / quantity=9 / RUSH / call on arrival O-482 | customer=Maple Row Cafe | city:Santa Fe | sku NN-8 | qty 6 O-484 | customer=Northline Habitat | city: Tulsa | sku RP-1 | qty 12 | note: leave with front desk

grok-4.3: { "orders": [ { "order_id": "O-481", "customer": "Juno Biolabs", "ship_city": "Reno", "items": [ {"sku": "AX-9", "qty": 4}, {"sku": "LM-3", "qty": 2} ], "priority": true, "notes": "deliver dock C" }, { "order_id": "O-482", "customer": "Maple Row Cafe", "ship_city": "Santa Fe", "items": [ {"sku": "QZ-2", "qty": 1}, {"sku": "NN-8", "qty": 6} ], "priority": false, "notes": "gift wrap" }, { "order_id": "O-483", "customer": "Velvet Circuit", "ship_city": "Boise", "items": [ {"sku": "TT-7", "qty": 9} ], "priority": true, "notes": "call on arrival" }, { "order_id": "O-484", "customer": "Northline Habitat", "ship_city": "Tulsa", "items": [ {"sku": "RP-1", "qty": 12} ], "priority": false, "notes": "note: leave with front desk" } ] }
Phi-4-mini-instruct: ```json { "orders": [ { "order_id": "O-481", "customer": "Juno Biolabs", "ship_city": "Reno", "items": [ {"sku": "AX-9", "qty": 4}, {"sku": "LM-3", "qty": 2} ], "priority": true, "notes": "" }, { "order_id": "O-482", "customer": "Maple Row Cafe", "ship_city": "Santa Fe", "items": [ {"sku": "QZ-2", "qty": 1}, {"sku": "NN-8", "qty": 6} ], "priority": true, "notes": "gift wrap" }, { "order_id": "O-483", "customer": "Velvet Circuit", "ship_city": "Boise", "items": [ {"sku": "TT-7", "qty": 9} ], "priority": true, "notes": "call on arrival" }, { "order_id": "O-484", "customer": "Northline Habitat", "ship_city": "Tulsa", "items": [ {"sku": "RP-1", "qty": 12} ], "priority": false, "notes": "leave with front desk" } ] } ```

A correctly combines orders, preserves item order, and sets priority accurately; its only minor issue is including the 'note:' label in O-484 notes. B makes a more serious error by marking O-482 as priority true despite no HIGH or RUSH, and it also drops O-481's leftover note text.

Matchup powered by OpenRouter.