grok-4.3 Beats Phi-4 by Doing the Actual Job
Phi-4 vs grok-4.3
grok-4.3 wins this head-to-head 37.9 to 28.1 because it is consistently more obedient, more exact, and less prone to self-sabotage on basic formatting rules. Phi-4 is competent, but in this matchup it repeatedly turns acceptable work into a loss by adding what wasn’t asked for and missing critical specifics.
The scoreline isn’t subtle: **grok-4.3 takes it 37.9 to 28.1**. More importantly, the pattern across tasks is clear. B doesn’t just sound polished; it executes the prompt more faithfully, with fewer avoidable mistakes and better attention to the details that actually decide these evaluations. The cleanest example is **python-log-redactor**. grok-4.3 follows the instruction to return **code only** and uses a stronger IPv4 regex that constrains octets to **0–255**. Phi-4 fumbles both parts of the assignment: it adds explanatory text, example usage, and print output, and its regex overmatches invalid IP addresses. That’s not a stylistic nitpick; it’s a direct miss on both format and correctness. The same discipline shows up in **messy-orders-to-json**. Both models parsed and normalized the orders correctly, but Phi-4 still lost because it wrapped the answer in a Markdown code fence instead of returning **raw JSON only**. grok-4.3 gave valid JSON and moved on. In production-style tasks, that difference matters: one output drops into a pipeline, the other creates cleanup work. On writing tasks, grok-4.3 was simply sharper. In **vendor-delay-email**, it was more accountable and precise, explicitly saying the issue was caught **before packing** and avoiding Phi-4’s less accurate phrasing about a "minor leaking" problem. In **meeting-notes-summary**, B packed more of the actual notes into two sentences, including the likely cause, affected area, owner, deadline, and risks; Phi-4 was decent but noticeably less specific and left out details like the receipt scan screen. **Final call: grok-4.3 wins because it combines better instruction-following with better factual precision. Phi-4 is good enough to stay in the conversation, but here it repeatedly loses on the unforced errors a stronger model avoids.**
python-log-redactor
Language: Python 3. Write a function `redact_log(line: str) -> str` for an API gateway. Replace any IPv4 address with `[IP]` and any bearer token with `[TOKEN]`. A bearer token is the word `Bearer` followed by one or more spaces and then a token made of letters, digits, `_`, `-`, or `.`. Preserve all other text exactly. Example: `2026-04-11 client=198.51.100.23 auth="Bearer sk_live-9A.bc_77"` becomes `2026-04-11 client=[IP] auth="[TOKEN]"`. Return code only.
B follows the instruction to return code only and provides a more correct IPv4 regex that limits octets to 0-255. A includes explanatory text, example usage, and print output, violating the format requirement, and its IPv4 pattern overmatches invalid addresses.
vendor-delay-email
Draft an email to a retail buyer at Northline Outfitters. Context: we promised 420 units of the HarborMist insulated bottle by Friday, but a cap-molding defect means only 260 units will ship Friday and the remaining 160 will ship next Wednesday. QA found the issue before packing; no safety risk, just leaking under inversion. Offer either a split shipment at our cost or a full delay with a 7% invoice credit. Tone: professional, accountable, calm. Length: 140-190 words.
B is more precise and accountable, explicitly noting the issue was caught before packing and keeping the tone calm and professional while staying concise. A is solid but slightly less exact, adds a less accurate phrase ('minor leaking'), and is a bit more generic and padded.
meeting-notes-summary
Read these meeting notes and provide: (1) a 2-sentence summary, and (2) a JSON object with keys `decision`, `owner`, `deadline`, and `risks`. Meeting notes: - Team: PulseDesk mobile app - Nora: crash rate on Android 14 spiked after 5.18.0; mostly on the receipt scan screen - Imran: traced likely cause to the new image cropper library; rollback patch is ready - Decision discussed: ship 5.18.1 rollback today, postpone smart-crop feature to next sprint - Ava: App Store listing text already mentions smart-crop; marketing needs updated copy by 3 PM - Nora owns release coordination and store metadata changes - Deadline: submit build by 1 PM today so review can clear before evening campaign - Risk: if review stalls, pause the 6 PM ad spend; also support needs a macro for affected users
Model B is more complete and faithful to the notes: its 2-sentence summary includes the cause, affected area, owner, deadline, and risks. Model A is solid, but its summary is less specific and omits key details like the likely cause and receipt scan screen.
messy-orders-to-json
Convert the messy order lines below into valid JSON: an array of objects sorted by `order_id` ascending. Schema for each object: `order_id` (string), `customer` (string), `sku` (string), `qty` (integer), `ship_method` (one of `ground`,`air`,`pickup`), `priority` (boolean). Normalize customer names to title case and SKUs to uppercase. Interpret `Y/yes/true` as true and `N/no/false` as false. Output JSON only. order 1048 | cust=mina velasquez | sku qn-44 | qty 3 | ship: Air | priority yes #1046; customer "DEON PARK"; item=lb-2; quantity=1; method=ground; priority=N ID 1049 / customer: r. patel / sku: tx9 / qty: 12 / ship method: PICKUP / priority: true order_id=1047, cust=alix romero, sku=MN-88, qty=2, ship=Ground, priority=Y
Both outputs correctly parse, normalize, and sort the orders, but A violates the instruction to output JSON only by wrapping the array in a Markdown code fence. B is valid raw JSON and fully adheres to the prompt.
Matchup powered by OpenRouter.