Head to head: grok-4.3 vs Phi-4-mini-instruct
grok-4.3 vs Phi-4-mini-instruct
This matchup wasn’t especially close: grok-4.3 wins on execution, not style points. Across the non-code tasks, it was the model that actually followed instructions, kept facts straight, and avoided the avoidable mistakes that dragged Phi-4-mini-instruct down.
The scoreline says it plainly: **37.0 to 28.0 for grok-4.3**. And the split behind that result is even more telling. On the lone coding task that ended in a draw, both models made the same correct fix in `python-log-window-fix`: inclusive boundary checks, distinct-user counting via a set, the required signature preserved, and code-only output. There was no meaningful separation there. Everywhere else, grok-4.3 was the more reliable model. In `customer-delay-update`, it did what a production-grade assistant should do: explain the delayed 80 pins clearly, state that there’s **no extra cost**, and keep the tone professional without inventing context. Phi-4-mini-instruct stumbled into exactly the kind of error that makes business users lose trust fast, incorrectly implying the sender was **Alder & Finch Studio** and even thanking the customer for choosing them. The gap widened in `ops-meeting-summary`. grok-4.3 followed the requested format and captured the notes cleanly, including the actions and decisions that mattered. Phi-4-mini-instruct missed Evan’s action altogether, reassigned Noor’s alert action to Mei, and then ignored the instruction for plain output by adding extra formatting. That’s not a small miss; it’s a failure to track ownership accurately. `messy-orders-to-json` sealed it. grok-4.3 correctly merged orders, preserved item order, and set priority properly, with only a minor blemish in carrying over the literal `note:` label for O-484. Phi-4-mini-instruct made the more damaging mistake: flagging **O-482 as priority true** without any HIGH or RUSH signal, while also dropping leftover note text for O-481. That’s the difference between a slightly untidy parse and a materially wrong one. **Final call: grok-4.3 wins because it is simply the steadier, more trustworthy model. Phi-4-mini-instruct could keep up on a straightforward code fix, but on instruction-following and factual precision, grok-4.3 was the only one acting like a senior-level assistant.**
python-log-window-fix
Language: Python 3.11 Return code only. Fix this function so it returns the number of distinct user IDs that had at least one event in the last `window_minutes` before `now_ts` (inclusive). Input timestamps are Unix seconds. The current code is wrong around duplicate users and the time boundary. ```python def active_users(events, now_ts, window_minutes): cutoff = now_ts - window_minutes * 60 count = 0 seen = [] for ts, user_id in events: if ts > cutoff and ts < now_ts: if user_id not in seen: count += 1 seen.append(user_id) return count ``` Requirements: - Keep the same function signature. - `events` is a list of `(timestamp, user_id)` tuples. - Count each user once at most. - Include events exactly at `cutoff` and exactly at `now_ts`. - Must be efficient for large inputs. - Do not add any explanation or tests; output just the corrected code.
Both outputs correctly fix the boundary conditions by using inclusive comparisons, count distinct users efficiently with a set, preserve the required signature, and return code only. They are functionally equivalent aside from variable naming.
customer-delay-update
Write an email to a customer named Priya Natarajan at Alder & Finch Studio. Situation: We promised delivery of 240 custom enamel pins for their June 28 gallery opening, but our plating vendor found a color-separation defect in batch 2. We can still ship 160 pins by June 24 and the remaining 80 on July 3 at no extra cost. We are also offering either a 12% refund on the full order or a free rush remake of the 80 pins in matte black metal. Audience/tone: external customer, honest, accountable, calm, professional. Length: 170-220 words. Goal: explain the issue clearly, apologize once, present the options, and ask for a decision by 3 p.m. tomorrow so production can proceed.
A is clearer, more accountable, and more complete: it explicitly notes no extra cost for the delayed 80 pins, explains the issue professionally, and maintains the requested tone and structure. B contains factual/signoff errors by implying the sender is Alder & Finch Studio and thanking the customer for choosing Alder & Finch Studio, which is inappropriate for this context.
ops-meeting-summary
Read the meeting notes below, then provide: 1) a 2-bullet summary of the key decisions 2) a JSON object with keys `owner_actions` and `risks` Rules: - `owner_actions` must be an array of objects with keys `owner`, `action`, and `due` - `risks` must be an array of short strings - Use only information from the notes Meeting notes: """ AtlasOps daily sync — 09:10 - Noor: overnight API error rate dropped from 3.8% to 0.6% after rollback of `ledger-cache` to v1.4.19. - Evan: root cause still unconfirmed; suspects TTL parsing on nodes with locale `tr_TR`. Will prepare a minimal repro by Thursday. - Mei: enterprise customer Hartside Medical asked whether yesterday's duplicate invoice emails affected PHI. Answer: no PHI in those emails, only invoice number and billing contact name. - Decision: keep the rollback in place through Friday; no re-enable before a canary passes on 5 pods for 6 hours. - Luis: can own the customer-facing incident note draft; first version by today 15:30. - Noor: add alert for invoice email volume > 3x 7-day hourly baseline, due Friday morning. - Mei: support queue has 14 open tickets tied to this issue; 3 are marked urgent. - Decision: sales gets a one-paragraph internal FAQ, owner Mei, by noon today. - Risk: if finance reruns failed invoices before the dedupe patch lands, some customers may receive another duplicate. """
Model A follows the requested format and captures the actions and decisions accurately from the notes. Model B omits Evan's action, incorrectly assigns Noor's alert action to Mei, and adds extra formatting/text beyond the plain requested output.
messy-orders-to-json
Transform the messy order lines below into valid JSON with this exact schema: {"orders":[{"order_id":"string","customer":"string","ship_city":"string","items":[{"sku":"string","qty":number}],"priority":true,"notes":"string"}]} Rules: - One object per order_id - Combine repeated order lines into one order with multiple items - `priority` is true only if the line contains HIGH or RUSH - Preserve item order as listed - `notes` should contain any leftover text not captured elsewhere; use "" if none - Trim whitespace; keep city text exactly as written Messy data: O-481 | customer=Juno Biolabs | city: Reno | sku AX-9 | qty 4 | HIGH | deliver dock C O-482|customer=Maple Row Cafe|city:Santa Fe|sku QZ-2|qty 1| gift wrap O-481 | customer=Juno Biolabs | city: Reno | sku LM-3 | qty 2 order O-483 / customer "Velvet Circuit" / ship_city=Boise / item=TT-7 / quantity=9 / RUSH / call on arrival O-482 | customer=Maple Row Cafe | city:Santa Fe | sku NN-8 | qty 6 O-484 | customer=Northline Habitat | city: Tulsa | sku RP-1 | qty 12 | note: leave with front desk
A correctly combines orders, preserves item order, and sets priority accurately; its only minor issue is including the 'note:' label in O-484 notes. B makes a more serious error by marking O-482 as priority true despite no HIGH or RUSH, and it also drops O-481's leftover note text.
Matchup powered by OpenRouter.