grok-4.3 Beats Phi-4 by Doing the Actual Job

grok-4.3 wins this head-to-head 37.9 to 28.1 because it is consistently more obedient, more exact, and less prone to self-sabotage on basic formatting rules. Phi-4 is competent, but in this matchup it repeatedly turns acceptable work into a loss by adding what wasn’t asked for and missing critical specifics.

By RuntimeWire · Published Jun 6, 2026, 10:21pm CT

A visual comparison of two AI models, emphasizing grok-4.3's precision and adherence to rules versus Phi-4's errors and deviations. (Hand-drawn editorial illustration with confident linework, limited color palette, and generous negative spa

The scoreline isn’t subtle: grok-4.3 takes it 37.9 to 28.1. More importantly, the pattern across tasks is clear. B doesn’t just sound polished; it executes the prompt more faithfully, with fewer avoidable mistakes and better attention to the details that actually decide these evaluations.

The cleanest example is python-log-redactor. grok-4.3 follows the instruction to return code only and uses a stronger IPv4 regex that constrains octets to 0–255. Phi-4 fumbles both parts of the assignment: it adds explanatory text, example usage, and print output, and its regex overmatches invalid IP addresses. That’s not a stylistic nitpick; it’s a direct miss on both format and correctness.

The same discipline shows up in messy-orders-to-json. Both models parsed and normalized the orders correctly, but Phi-4 still lost because it wrapped the answer in a Markdown code fence instead of returning raw JSON only. grok-4.3 gave valid JSON and moved on. In production-style tasks, that difference matters: one output drops into a pipeline, the other creates cleanup work.

On writing tasks, grok-4.3 was simply sharper. In vendor-delay-email, it was more accountable and precise, explicitly saying the issue was caught before packing and avoiding Phi-4’s less accurate phrasing about a "minor leaking" problem. In meeting-notes-summary, B packed more of the actual notes into two sentences, including the likely cause, affected area, owner, deadline, and risks; Phi-4 was decent but noticeably less specific and left out details like the receipt scan screen.

Final call: grok-4.3 wins because it combines better instruction-following with better factual precision. Phi-4 is good enough to stay in the conversation, but here it repeatedly loses on the unforced errors a stronger model avoids.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. Phi-4 scored 28.1 to grok-4.3's 37.9.

1. python-log-redactor

Language: Python 3. Write a function redact_log(line: str) -> str for an API gateway. Replace any IPv4 address with [IP] and any bearer token with [TOKEN]. A bearer token is the word Bearer followed by one or more spaces and then a token made of letters, digits, _, -, or .. Preserve all other text exactly. Example: 2026-04-11 client=198.51.100.23 auth="Bearer sk_live-9A.bc_77" becomes 2026-04-11 client=[IP] auth="[TOKEN]". Return code only.

Winner: grok-4.3 — B follows the instruction to return code only and provides a more correct IPv4 regex that limits octets to 0-255. A includes explanatory text, example usage, and print output, violating the format requirement, and its IPv4 pattern overmatches invalid addresses.

2. vendor-delay-email

Draft an email to a retail buyer at Northline Outfitters. Context: we promised 420 units of the HarborMist insulated bottle by Friday, but a cap-molding defect means only 260 units will ship Friday and the remaining 160 will ship next Wednesday. QA found the issue before packing; no safety risk, just leaking under inversion. Offer either a split shipment at our cost or a full delay with a 7% invoice credit. Tone: professional, accountable, calm. Length: 140-190 words.

Winner: grok-4.3 — B is more precise and accountable, explicitly noting the issue was caught before packing and keeping the tone calm and professional while staying concise. A is solid but slightly less exact, adds a less accurate phrase ('minor leaking'), and is a bit more generic and padded.

3. meeting-notes-summary

Read these meeting notes and provide: (1) a 2-sentence summary, and (2) a JSON object with keys decision, owner, deadline, and risks. Meeting notes: - Team: PulseDesk mobile app - Nora: crash rate on Android 14 spiked after 5.18.0; mostly on the receipt scan screen - Imran: traced likely cause to the new image cropper library; rollback patch is ready - Decision discussed: ship 5.18.1 rollback today, postpone smart-crop feature to next sprint - Ava: App Store listing text already mentions smart-crop; marketing needs updated copy by 3 PM - Nora owns release coordination and store metadata changes - Deadline: submit build by 1 PM today so review can clear before evening campaign - Risk: if review stalls, pause the 6 PM ad spend; also support needs a macro for affected users

Winner: grok-4.3 — Model B is more complete and faithful to the notes: its 2-sentence summary includes the cause, affected area, owner, deadline, and risks. Model A is solid, but its summary is less specific and omits key details like the likely cause and receipt scan screen.

4. messy-orders-to-json

Convert the messy order lines below into valid JSON: an array of objects sorted by order_id ascending. Schema for each object: order_id (string), customer (string), sku (string), qty (integer), ship_method (one of ground,air,pickup), priority (boolean). Normalize customer names to title case and SKUs to uppercase. Interpret Y/yes/true as true and N/no/false as false. Output JSON only. order 1048 | cust=mina velasquez | sku qn-44 | qty 3 | ship: Air | priority yes #1046; customer "DEON PARK"; item=lb-2; quantity=1; method=ground; priority=N ID 1049 / customer: r. patel / sku: tx9 / qty: 12 / ship method: PICKUP / priority: true order_id=1047, cust=alix romero, sku=MN-88, qty=2, ship=Ground, priority=Y

Winner: grok-4.3 — Both outputs correctly parse, normalize, and sort the orders, but A violates the instruction to output JSON only by wrapping the array in a Markdown code fence. B is valid raw JSON and fully adheres to the prompt.

See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

How they were tested

1. python-log-redactor

2. vendor-delay-email

3. meeting-notes-summary

4. messy-orders-to-json

Reader comments