Head to head: grok-4.3 vs Phi-4-multimodal-instruct

One model handled practical editing-room work cleanly; the other kept tripping over instructions, structure, and basic factual precision. This wasn’t a close stylistic split—it was a decisive test of who can execute under constraints.

By RuntimeWire · Published Jun 28, 2026, 9:07am CT

Comparison of two AI model architectures or execution flows (Blueprint-style technical drawing with annotations and callouts)

grok-4.3 wins this matchup outright, 38.0 to 12.0, and the gap is earned on the boring, important stuff: following instructions, preserving facts, and not inventing behavior. Phi-4-multimodal-instruct doesn’t just lose on polish; it loses on reliability.

The clearest miss is the Python rate-limit fix. grok-4.3 correctly repairs both bugs: it counts timestamps >= now - window_seconds, only allows requests when the recent count is strictly below the limit, and records only allowed requests. Phi-4-multimodal-instruct goes off the rails immediately—it's not code-only, it invents future timestamps, uses the wrong cutoff comparison, and fails the stated requirements. That’s not a near miss; that’s a model you can’t safely drop into a real debugging workflow.

On writing tasks, grok-4.3 is simply the sharper editor. In the vendor outage update, it lands the requested tone: calm, accountable, plain English, with the key facts intact. Phi-4-multimodal-instruct is wordier, awkwardly phrased, and slips in an unsupported explanation about duplicate prevention while drifting toward vendor blame-shifting. The meeting-notes summary tells the same story: grok-4.3 captures the actual decision to ship the simplified intake form, the conditional deferral of the printer fix, the correct owner, blocker, and budget change. Phi-4-multimodal-instruct gets vague, assigns the wrong owner, and muddies decision versus status.

The JSON extraction task seals it. grok-4.3 follows the schema, merges lines by order_id, preserves item order, and normalizes dates and numeric fields. Phi-4-multimodal-instruct produces output that is largely invalid, polluted with extra text, breaks the top-level schema, alters order IDs, and even misformats a date. If your workload depends on structured output being actually usable, that’s disqualifying.

Final call: grok-4.3 is the only model in this pairing that looks dependable across code, summarization, operational writing, and schema-bound extraction. Phi-4-multimodal-instruct wasn’t merely second-best here; it was error-prone in ways that make it hard to trust.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. grok-4.3 scored 38.0 to Phi-4-multimodal-instruct's 12.0.

1. python-rate-limit-fix

Language: Python 3.11 Return code only. Fix the function below so it enforces a per-user sliding-window rate limit correctly. A request should be allowed only if the user has made fewer than limit requests in the last window_seconds seconds, counting only timestamps >= now - window_seconds. The function should record allowed requests and reject blocked ones without recording them. Keep the same signature. events maps user_id to a list of prior request timestamps in ascending order. python def allow_request(events: dict[str, list[int]], user_id: str, now: int, limit: int, window_seconds: int) -> bool: history = events.get(user_id, []) cutoff = now - window_seconds recent = [t for t in history if t > cutoff] if len(recent) <= limit: recent.append(now) events[user_id] = recent return True events[user_id] = recent return False Example expectations: - If limit=3, window_seconds=60, history=[100,130,160], now=160 => reject - If history=[100,130,160], now=161 => reject - If history=[100,130,160], now=221 => allow - If a request is rejected, its timestamp must not be added

Winner: grok-4.3 — A correctly fixes both bugs: it counts timestamps >= now - window_seconds, allows only when the recent count is strictly less than limit, and records only allowed requests. B is not code-only, changes behavior incorrectly by inventing future timestamps, uses the wrong cutoff comparison, and fails the stated requirements.

2. vendor-outage-update

Write a workplace status update for our customers. Context: We run the invoice delivery service for BrightLedger. From 09:12 to 10:03 ET today, PDF invoices generated from the "June settlement" batch were delayed because a storage node in us-east-2 stopped accepting writes after a failed firmware rollout by our vendor, Norell Systems. No data was lost. The backlog cleared by 10:41 ET. 1,842 invoices were delayed; 117 customers may have received duplicate "processing" notifications, but no duplicate invoices were sent. We have paused the vendor rollout, added an extra queue drain alarm, and will publish a full RCA by tomorrow 3 PM ET. Audience: affected customers Tone: calm, accountable, plain English, no blame-shifting Length: 140-180 words Include: what happened, customer impact, current status, and next steps

Winner: grok-4.3 — A is clearer, more concise, and better matches the requested calm, accountable, plain-English tone while covering the key facts accurately. B is wordier, includes awkward/incorrect phrasing and an unsupported explanation about duplicate prevention, and leans more toward vendor blame-shifting and generic reassurance.

3. meeting-notes-summary

Read the meeting notes below. Then provide: 1) a 2-sentence summary 2) a JSON object with keys: decision, owner, due_date, blockers, budget_change Source notes: "Ops + Product sync — 14 May, 11:00. Mina said the pilot for the Harborview clinic tablets went better than expected: 26 devices enrolled, 24 active daily. However, check-in times still spike when the label printer reconnects. Devon wants to move the printer fix out of the pilot scope unless support tickets exceed 15 next week. Team agreed to ship the simplified intake form to all three pilot sites on Monday, 19 May. Rhea will send the rollout checklist by Thursday. Finance approved an extra $3,200 for rugged cases after two cracked screens at Harborview. Open issue: legal has not yet approved the revised patient consent text; without that, Westgrove cannot go live."

Winner: grok-4.3 — Model A is more accurate and complete: it captures the key decision about shipping the simplified intake form, the conditional deferral of the printer fix, the correct owner, blocker, and numeric budget change. Model B is vaguer, assigns an incorrect owner, folds status updates into the decision, and is less precise in both the summary and JSON fields.

4. messy-orders-to-json

Transform the messy order lines below into valid JSON only. Output schema: { "orders": [ { "order_id": "string", "customer": "string", "items": [{"sku":"string","qty":number,"unit_price":number}], "priority": true, "ship_by": "YYYY-MM-DD or null" } ] } Rules: - One object per order_id - Merge repeated order lines into the same order - qty must be numbers, unit_price must be numbers without currency symbols - priority is true only if marked rush/urgent/priority=yes; otherwise false - Convert dates to YYYY-MM-DD; if blank or TBD, use null - Preserve item order within each order Messy data: Order A-904 | Cust: Luma Dental | SKU=XR-11 qty 2 price $48.50 | ship_by 6/03/2026 | priority=yes A-904 ; SKU CLN-2 ; qty=1 ; unit price=12 ; note add-on Order# B-118 / customer "North Basin Café" / item MUG-8 / q: 12 / $4.25 each / ship TBD / rush B-118 / item LID-8 / q:12 / 0.60 C-771, customer=Ilex School, sku=KIT-RED, qty 3, price 19.99, ship_by 2026-06-01 C-771, sku=PATCH-1, qty 6, price $1.5, urgent: no D-552 | Cust: Pavo Labs | SKU=GLV-9 qty 100 price $0.18 | ship_by | standard

Winner: grok-4.3 — Model A fully follows the requested JSON schema, correctly merges lines by order_id, preserves item order, and normalizes dates and numeric fields. Model B is largely invalid and contaminated with extraneous text, breaks the required top-level schema, and also alters order IDs and misformats at least one date.

See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

How they were tested

1. python-rate-limit-fix

2. vendor-outage-update

3. meeting-notes-summary

4. messy-orders-to-json

Reader comments