Head to head: grok-4.3 vs Phi-4-reasoning

This one wasn’t competitive. grok-4.3 repeatedly did the basic but crucial thing Phi-4-reasoning did not: answer the prompt in the format requested, with usable output instead of meta-commentary.

By · Published

Comparative performance and output quality of two AI models (Vintage scientific illustration, engraved plate from a 19th-century journal, sepia ink on cream paper)

grok-4.3 wins this matchup in a rout, 38.0 to 4.0, and the reason is almost embarrassingly simple: it completed the assignments. Across all four tasks, A delivered the requested artifact in the requested format; B repeatedly drifted into explanation, reasoning, and prose where the prompt explicitly asked for code, JSON, or a polished message.

The clearest failure came in python-log-redactor. grok-4.3 returned code only, as instructed, and did so with a concise regex-based implementation that preserved surrounding punctuation. Phi-4-reasoning didn’t really attempt the deliverable; it produced explanatory text instead of the function, which is a hard fail on a task where format compliance is the job.

The same pattern held in status-update-delay and meeting-notes-summary. A wrote an executive-ready delay update with the right tone, all required facts, and a single clean ask. It also produced a proper two-sentence summary plus valid JSON with the specified keys for the notes task. B, by contrast, kept lapsing into chain-of-thought-style meta output and disclaimers—exactly the kind of behavior that makes a model unusable in real workflows even when some underlying facts are present.

In messy-orders-to-json, grok-4.3 again did the unglamorous work correctly: valid JSON only, correct schema, normalized values, sorted by order_id ascending. Phi-4-reasoning again missed the core requirement by wrapping the answer in analysis text. That is not a near miss; it is the difference between something a system can consume and something a human has to repair.

Final call: grok-4.3, easily. This wasn’t a nuanced stylistic win; it was a decisive demonstration that instruction-following and output discipline matter more than performative reasoning. Phi-4-reasoning lost because it kept talking about the task instead of doing it.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. grok-4.3 scored 38.0 to Phi-4-reasoning's 4.0.

1. python-log-redactor

Practical coding — Python. Return code only. Write a function redact_log(line: str) -> str that prepares app log lines for sharing with vendors. Replace any IPv4 address with [IP] and any email address with [EMAIL], but leave everything else unchanged. Treat IPv4 as four 1–3 digit parts separated by dots; you do not need to validate 0–255 ranges. Preserve punctuation around matches. Examples: - db timeout from 10.14.9.3 for maya@northpass.io -> db timeout from [IP] for [EMAIL] - alert: user=sam+ops@acme.tools, src=172.16.0.12:443 -> alert: user=[EMAIL], src=[IP]:443 Implement just the function, no tests or explanation.

Winner: grok-4.3 — A follows the instruction to return code only and provides a concise, correct regex-based implementation that preserves surrounding punctuation. B does not provide the requested function implementation at all and instead outputs explanatory prose, violating the format and task requirements.

2. status-update-delay

Professional writing — Draft a workplace status update for the VP of Operations. Audience: busy executive. Tone: calm, accountable, no blame. Length: 120–150 words. Situation: the warehouse scanner rollout at Ridgeway Fulfillment is slipping by 9 days because 37 of 220 devices arrived with bad batteries and the replacement shipment from Soltera is due Tuesday. Include: what happened, customer impact (none to current orders), mitigation (reassigning 14 spare units from the Phoenix site, extending evening staging shifts through Friday), and the new target go-live date of May 21. End with one clear ask: approval for up to $6,800 in expedited freight if the replacement shipment misses Tuesday.

Winner: grok-4.3 — A cleanly follows the prompt with the right executive tone, includes all required facts, ends with a single clear ask, and stays concise. B is largely meta-reasoning instead of the requested status update, violates the format and audience needs, and does not deliver an appropriate executive-ready draft.

3. meeting-notes-summary

Summarization & extraction — Read the meeting notes below, then provide: 1) a 2-sentence summary 2) a JSON object with keys decision, owner, deadline, risks (array) Meeting notes: - AtlasCare mobile app triage, Tues 09:00 - Crash reports spiked after v3.18.2, mostly on Android 12 when opening lab results from push notifications. - Priya reproduced it on a Pixel 5; stack trace points to a null patientId in the deep-link handler. - Mateo can patch today, but QA says full regression before Thursday is unrealistic. - Lena: legal already approved a limited rollback if we keep appointment booking intact. - Agreed plan: ship a server-side flag by 3pm to disable lab-result push opens on Android only; hotfix app build by Thursday 6pm; rollback only if crash-free rate is still under 99.2% by Friday noon. - Priya owns the flag change. Mateo owns the hotfix build. Jordan to post a support macro for affected users. - Risk: analytics dashboard is delayed ~4 hours, so Friday assessment may rely on Play Console plus Zendesk tickets.

Winner: grok-4.3 — Model A directly follows the requested format with a clear 2-sentence summary and a valid JSON object using the specified keys. Model B is mostly chain-of-thought/meta commentary, adds unnecessary disclaimers, and does not cleanly adhere to the prompt despite containing some relevant extracted details.

4. messy-orders-to-json

Data wrangling / structured output — Convert the messy order notes below into valid JSON only. Output an object with one key, orders, whose value is an array of objects sorted by order_id ascending. Each object must have exactly these keys: order_id (string), customer (string), sku (string), qty (integer), rush (boolean), ship_by (string in YYYY-MM-DD), notes (string). Rules: trim spaces, normalize SKU to uppercase, interpret rush: yes/y/true as true and no/n/false as false, and use an empty string for missing notes. Messy data: #A-104 | cust=Blue Harbor Cafe | sku: tm-44 | qty 6 | rush yes | ship_by 2026/02/07 | notes: leave at rear door A-102; customer = Nori & Pine ; SKU = qz-9 ; quantity=12 ; rush = n ; ship-by=2026-02-05 ; order A-111 / customer: Helio Labs / sku kk-210 / qty: 3 / rush: TRUE / ship_by: 2026-02-09 / notes: Attn Mira ID=A-107, cust=Juniper School, sku=bx-7, qty=25, rush=no, ship_by=2026-02-08, notes=PO 8831

Winner: grok-4.3 — Model A follows the instruction to output valid JSON only, uses the correct schema, normalizes values properly, and sorts by order_id ascending. Model B does not provide the requested JSON output and instead includes extraneous analysis text, so it fails the core formatting requirement.


See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

Reader comments

Conversation for this story loads after sign-in.