grok-4.3 Beats Phi-4 by Doing the Actual Job
grok-4.3 wins this head-to-head 37.9 to 28.1 because it is consistently more obedient, more exact, and less prone to self-sabotage on basic formatting rules. Phi-4 is competent, but in this matchup it repeatedly turns acceptable work into a loss by adding what wasn’t asked for and missing critical specifics.
By RuntimeWire · Published

The scoreline isn’t subtle: grok-4.3 takes it 37.9 to 28.1. More importantly, the pattern across tasks is clear. B doesn’t just sound polished; it executes the prompt more faithfully, with fewer avoidable mistakes and better attention to the details that actually decide these evaluations.
The cleanest example is python-log-redactor. grok-4.3 follows the instruction to return code only and uses a stronger IPv4 regex that constrains octets to 0–255. Phi-4 fumbles both parts of the assignment: it adds explanatory text, example usage, and print output, and its regex overmatches invalid IP addresses. That’s not a stylistic nitpick; it’s a direct miss on both format and correctness.
The same discipline shows up in messy-orders-to-json. Both models parsed and normalized the orders correctly, but Phi-4 still lost because it wrapped the answer in a Markdown code fence instead of returning raw JSON only. grok-4.3 gave valid JSON and moved on. In production-style tasks, that difference matters: one output drops into a pipeline, the other creates cleanup work.
On writing tasks, grok-4.3 was simply sharper. In vendor-delay-email, it was more accountable and precise, explicitly saying the issue was caught before packing and avoiding Phi-4’s less accurate phrasing about a "minor leaking" problem. In meeting-notes-summary, B packed more of the actual notes into two sentences, including the likely cause, affected area, owner, deadline, and risks; Phi-4 was decent but noticeably less specific and left out details like the receipt scan screen.
Final call: grok-4.3 wins because it combines better instruction-following with better factual precision. Phi-4 is good enough to stay in the conversation, but here it repeatedly loses on the unforced errors a stronger model avoids.
How they were tested
We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. Phi-4 scored 28.1 to grok-4.3's 37.9.
1. python-log-redactor
Language: Python 3. Write a function
redact_log(line: str) -> strfor an API gateway. Replace any IPv4 address with[IP]and any bearer token with[TOKEN]. A bearer token is the wordBearerfollowed by one or more spaces and then a token made of letters, digits,_,-, or.. Preserve all other text exactly. Example:2026-04-11 client=198.51.100.23 auth="Bearer sk_live-9A.bc_77"becomes2026-04-11 client=[IP] auth="[TOKEN]". Return code only.
Winner: grok-4.3 — B follows the instruction to return code only and provides a more correct IPv4 regex that limits octets to 0-255. A includes explanatory text, example usage, and print output, violating the format requirement, and its IPv4 pattern overmatches invalid addresses.
2. vendor-delay-email
Draft an email to a retail buyer at Northline Outfitters. Context: we promised 420 units of the HarborMist insulated bottle by Friday, but a cap-molding defect means only 260 units will ship Friday and the remaining 160 will ship next Wednesday. QA found the issue before packing; no safety risk, just leaking under inversion. Offer either a split shipment at our cost or a full delay with a 7% invoice credit. Tone: professional, accountable, calm. Length: 140-190 words.
Winner: grok-4.3 — B is more precise and accountable, explicitly noting the issue was caught before packing and keeping the tone calm and professional while staying concise. A is solid but slightly less exact, adds a less accurate phrase ('minor leaking'), and is a bit more generic and padded.
3. meeting-notes-summary
Read these meeting notes and provide: (1) a 2-sentence summary, and (2) a JSON object with keys
decision,owner,deadline, andrisks. Meeting notes: - Team: PulseDesk mobile app - Nora: crash rate on Android 14 spiked after 5.18.0; mostly on the receipt scan screen - Imran: traced likely cause to the new image cropper library; rollback patch is ready - Decision discussed: ship 5.18.1 rollback today, postpone smart-crop feature to next sprint - Ava: App Store listing text already mentions smart-crop; marketing needs updated copy by 3 PM - Nora owns release coordination and store metadata changes - Deadline: submit build by 1 PM today so review can clear before evening campaign - Risk: if review stalls, pause the 6 PM ad spend; also support needs a macro for affected users
Winner: grok-4.3 — Model B is more complete and faithful to the notes: its 2-sentence summary includes the cause, affected area, owner, deadline, and risks. Model A is solid, but its summary is less specific and omits key details like the likely cause and receipt scan screen.
4. messy-orders-to-json
Convert the messy order lines below into valid JSON: an array of objects sorted by
order_idascending. Schema for each object:order_id(string),customer(string),sku(string),qty(integer),ship_method(one ofground,air,pickup),priority(boolean). Normalize customer names to title case and SKUs to uppercase. InterpretY/yes/trueas true andN/no/falseas false. Output JSON only. order 1048 | cust=mina velasquez | sku qn-44 | qty 3 | ship: Air | priority yes #1046; customer "DEON PARK"; item=lb-2; quantity=1; method=ground; priority=N ID 1049 / customer: r. patel / sku: tx9 / qty: 12 / ship method: PICKUP / priority: true order_id=1047, cust=alix romero, sku=MN-88, qty=2, ship=Ground, priority=Y
Winner: grok-4.3 — Both outputs correctly parse, normalize, and sort the orders, but A violates the instruction to output JSON only by wrapping the array in a Markdown code fence. B is valid raw JSON and fully adheres to the prompt.
See every prompt and the full side-by-side outputs in the interactive Head-to-Head.