Head to head: grok-4.3 vs Phi-4-multimodal-instruct

grok-4.3 vs Phi-4-multimodal-instruct

One model handled practical editing-room work cleanly; the other kept tripping over instructions, structure, and basic factual precision. This wasn’t a close stylistic split—it was a decisive test of who can execute under constraints.

grok-4.3 wins this matchup outright, 38.0 to 12.0, and the gap is earned on the boring, important stuff: following instructions, preserving facts, and not inventing behavior. Phi-4-multimodal-instruct doesn’t just lose on polish; it loses on reliability. The clearest miss is the Python rate-limit fix. grok-4.3 correctly repairs both bugs: it counts timestamps `>= now - window_seconds`, only allows requests when the recent count is strictly below the limit, and records only allowed requests. Phi-4-multimodal-instruct goes off the rails immediately—it's not code-only, it invents future timestamps, uses the wrong cutoff comparison, and fails the stated requirements. That’s not a near miss; that’s a model you can’t safely drop into a real debugging workflow. On writing tasks, grok-4.3 is simply the sharper editor. In the vendor outage update, it lands the requested tone: calm, accountable, plain English, with the key facts intact. Phi-4-multimodal-instruct is wordier, awkwardly phrased, and slips in an unsupported explanation about duplicate prevention while drifting toward vendor blame-shifting. The meeting-notes summary tells the same story: grok-4.3 captures the actual decision to ship the simplified intake form, the conditional deferral of the printer fix, the correct owner, blocker, and budget change. Phi-4-multimodal-instruct gets vague, assigns the wrong owner, and muddies decision versus status. The JSON extraction task seals it. grok-4.3 follows the schema, merges lines by `order_id`, preserves item order, and normalizes dates and numeric fields. Phi-4-multimodal-instruct produces output that is largely invalid, polluted with extra text, breaks the top-level schema, alters order IDs, and even misformats a date. If your workload depends on structured output being actually usable, that’s disqualifying. **Final call: grok-4.3 is the only model in this pairing that looks dependable across code, summarization, operational writing, and schema-bound extraction. Phi-4-multimodal-instruct wasn’t merely second-best here; it was error-prone in ways that make it hard to trust.**

python-rate-limit-fix

Language: Python 3.11 Return code only. Fix the function below so it enforces a per-user sliding-window rate limit correctly. A request should be allowed only if the user has made fewer than `limit` requests in the last `window_seconds` seconds, counting only timestamps `>= now - window_seconds`. The function should record allowed requests and reject blocked ones without recording them. Keep the same signature. `events` maps user_id to a list of prior request timestamps in ascending order. ```python def allow_request(events: dict[str, list[int]], user_id: str, now: int, limit: int, window_seconds: int) -> bool: history = events.get(user_id, []) cutoff = now - window_seconds recent = [t for t in history if t > cutoff] if len(recent) <= limit: recent.append(now) events[user_id] = recent return True events[user_id] = recent return False ``` Example expectations: - If limit=3, window_seconds=60, history=[100,130,160], now=160 => reject - If history=[100,130,160], now=161 => reject - If history=[100,130,160], now=221 => allow - If a request is rejected, its timestamp must not be added

grok-4.3: ```python def allow_request(events: dict[str, list[int]], user_id: str, now: int, limit: int, window_seconds: int) -> bool: history = events.get(user_id, []) cutoff = now - window_seconds recent = [t for t in history if t >= cutoff] if len(recent) < limit: recent.append(now) events[user_id] = recent return True events[user_id] = recent return False ```
Phi-4-multimodal-instruct: To fix the given function to correctly enforce the per-user sliding-window rate limit, you need to change two things. 1. When a request is rejected, the elapsed time since the latest request needs to be computed. 2. The rejected request time must not be added to the history. ```python def allow_request(events: dict[str, list[int]], user_id: str, now: int, limit: int, window_seconds: int) -> bool: history = events.get(user_id, []) chronological_list = sorted(history) # Remove timestamps older than window_seconds cutoff = now - window_seconds history = [t for t in chronological_list if t > cutoff] if len(history) < limit: # We may or may not have blocked this very request # If this is the only request, we just add now to the history # Otherwise, we calculate the timestamp of the next request and add both to history if not history: recent = [now] else: next_request_time = history[-1] + window_seconds recent = history + [next_request_time] events[user_id] = history + [next_request_time] if len(recent) > 1 else history + [now] return True else: # We must send back a 429 within the sliding window # We no longer need the event return False ``` In your second point, the snippet should be reviewed: ```python # ... internal logic up to: if len(history) < limit: # We may or may not have blocked this very request # If this is the only request, we just add now to the history # Otherwise, we calculate the timestamp of the next request and add both to history if not history: recent = [now] else: next_request_time = history[-1] + window_seconds recent = history + [next_request_time] # ... internal logic continues. ``` This internal logic was incorrectly touched now when a request is accepted for some strange reason. If this is the case then since the limit is followed (this will contain only two or less items) it shouldn't be rejected, the current implementation will lead to such conception. However, I saw this comment in the code and I thought this is mistakenly placed. So, I commented it. Now it contains two recent timestamp, explained below the comment. This will store previous requests as well as the next request to ensure the constraints are upheld, even though currently, it would not cause you any issues.

A correctly fixes both bugs: it counts timestamps >= now - window_seconds, allows only when the recent count is strictly less than limit, and records only allowed requests. B is not code-only, changes behavior incorrectly by inventing future timestamps, uses the wrong cutoff comparison, and fails the stated requirements.

vendor-outage-update

Write a workplace status update for our customers. Context: We run the invoice delivery service for BrightLedger. From 09:12 to 10:03 ET today, PDF invoices generated from the "June settlement" batch were delayed because a storage node in us-east-2 stopped accepting writes after a failed firmware rollout by our vendor, Norell Systems. No data was lost. The backlog cleared by 10:41 ET. 1,842 invoices were delayed; 117 customers may have received duplicate "processing" notifications, but no duplicate invoices were sent. We have paused the vendor rollout, added an extra queue drain alarm, and will publish a full RCA by tomorrow 3 PM ET. Audience: affected customers Tone: calm, accountable, plain English, no blame-shifting Length: 140-180 words Include: what happened, customer impact, current status, and next steps

grok-4.3: Today we experienced a delay delivering PDF invoices from the June settlement batch. Between 9:12 and 10:03 ET, a storage node in us-east-2 stopped accepting writes following a firmware update. This caused 1,842 invoices to be held in queue. No data was lost. The backlog was cleared by 10:41 ET, and every invoice has now been delivered successfully. A total of 117 customers received duplicate “processing” notifications; however, no duplicate invoices were sent. We have paused the firmware rollout and added an extra queue-drain alarm to catch similar issues earlier. A full root-cause analysis will be published by 3 PM ET tomorrow. We apologize for the inconvenience and appreciate your patience while we restored normal service.
Phi-4-multimodal-instruct: Dear Valued Customers, We wanted to provide you with an update on today's invoice delivery service at BrightLedger, as you may have received some delay due to technical issues. Initially, we experienced a the problem around 9:12 AM with our servers in the us-east-2 region. A vendor-led firmware update did not go as planned, causing a temporary halt to our storage node writes. To ensure the integrity of your invoices, no data was lost. We successfully cleared the backlog by around 10:41 AM, affecting 1,842 invoices in the process. Some customers received a duplicate "processing" notification. We assure you no duplicates were sent, as we use a unique identifier linked to each email address to prevent this process. To avoid future disruptions, we have paused the vendor's rollout and added extra monitoring to our systems. An in-depth root cause analysis (RCA) will be shared by tomorrow at 3 PM ET. Rest assured, your invoices are being handled with the utmost diligence and efficiency. Kind regards, The BrightLedger Team

A is clearer, more concise, and better matches the requested calm, accountable, plain-English tone while covering the key facts accurately. B is wordier, includes awkward/incorrect phrasing and an unsupported explanation about duplicate prevention, and leans more toward vendor blame-shifting and generic reassurance.

meeting-notes-summary

Read the meeting notes below. Then provide: 1) a 2-sentence summary 2) a JSON object with keys: decision, owner, due_date, blockers, budget_change Source notes: "Ops + Product sync — 14 May, 11:00. Mina said the pilot for the Harborview clinic tablets went better than expected: 26 devices enrolled, 24 active daily. However, check-in times still spike when the label printer reconnects. Devon wants to move the printer fix out of the pilot scope unless support tickets exceed 15 next week. Team agreed to ship the simplified intake form to all three pilot sites on Monday, 19 May. Rhea will send the rollout checklist by Thursday. Finance approved an extra $3,200 for rugged cases after two cracked screens at Harborview. Open issue: legal has not yet approved the revised patient consent text; without that, Westgrove cannot go live."

grok-4.3: 1) The Ops + Product sync on May 14 reviewed the Harborview tablet pilot's strong results (26 enrolled, 24 daily active) alongside ongoing printer reconnection issues, leading Devon to propose moving the printer fix out of scope unless support tickets exceed 15. 2) The team agreed to ship the simplified intake form to all three pilot sites on May 19, with Rhea sending the rollout checklist by Thursday, finance approving an extra $3,200 for cases, while legal's pending consent text approval blocks the Westgrove launch. ```json {"decision": "Ship simplified intake form to all three pilot sites on May 19 and defer printer fix unless tickets exceed 15", "owner": "Rhea", "due_date": "19 May", "blockers": "Legal approval for revised patient consent text", "budget_change": 3200} ```
Phi-4-multimodal-instruct: 1) During the meeting on 14 May, the Harborview clinic's tablet pilot was successful with 26 devices enrolled, meetings agreed to ship simplified intake forms, and an additional $3,200 approved for cases after damaged screens, with pending legal approvals affecting the pilot's full implementation. 2) ```json { "decision": "Mina reports successful tablet pilot, team agrees to ship forms, and extra budget approved for tablet cases.", "owner": "Ops + Product sync team", "due_date": "19 May for form shipment", "blockers": "Legal approval of patient consent text missing", "budget_change": "Approved extra $3,200 for rugged tablet cases" } ```

Model A is more accurate and complete: it captures the key decision about shipping the simplified intake form, the conditional deferral of the printer fix, the correct owner, blocker, and numeric budget change. Model B is vaguer, assigns an incorrect owner, folds status updates into the decision, and is less precise in both the summary and JSON fields.

messy-orders-to-json

Transform the messy order lines below into valid JSON only. Output schema: { "orders": [ { "order_id": "string", "customer": "string", "items": [{"sku":"string","qty":number,"unit_price":number}], "priority": true, "ship_by": "YYYY-MM-DD or null" } ] } Rules: - One object per order_id - Merge repeated order lines into the same order - qty must be numbers, unit_price must be numbers without currency symbols - priority is true only if marked rush/urgent/priority=yes; otherwise false - Convert dates to YYYY-MM-DD; if blank or TBD, use null - Preserve item order within each order Messy data: Order A-904 | Cust: Luma Dental | SKU=XR-11 qty 2 price $48.50 | ship_by 6/03/2026 | priority=yes A-904 ; SKU CLN-2 ; qty=1 ; unit price=12 ; note add-on Order# B-118 / customer "North Basin Café" / item MUG-8 / q: 12 / $4.25 each / ship TBD / rush B-118 / item LID-8 / q:12 / 0.60 C-771, customer=Ilex School, sku=KIT-RED, qty 3, price 19.99, ship_by 2026-06-01 C-771, sku=PATCH-1, qty 6, price $1.5, urgent: no D-552 | Cust: Pavo Labs | SKU=GLV-9 qty 100 price $0.18 | ship_by | standard

grok-4.3: { "orders": [ { "order_id": "A-904", "customer": "Luma Dental", "items": [ {"sku": "XR-11", "qty": 2, "unit_price": 48.5}, {"sku": "CLN-2", "qty": 1, "unit_price": 12} ], "priority": true, "ship_by": "2026-06-03" }, { "order_id": "B-118", "customer": "North Basin Café", "items": [ {"sku": "MUG-8", "qty": 12, "unit_price": 4.25}, {"sku": "LID-8", "qty": 12, "unit_price": 0.6} ], "priority": true, "ship_by": null }, { "order_id": "C-771", "customer": "Ilex School", "items": [ {"sku": "KIT-RED", "qty": 3, "unit_price": 19.99}, {"sku": "PATCH-1", "qty": 6, "unit_price": 1.5} ], "priority": false, "ship_by": "2026-06-01" }, { "order_id": "D-552", "customer": "Pavo Labs", "items": [ {"sku": "GLV-9", "qty": 100, "unit_price": 0.18} ], "priority": false, "ship_by": null } ] }
Phi-4-multimodal-instruct: ```2:2 -3) - *, remove -3 - Use your name "24/ repeat easily Option A-L96: no "24-9/10 -1 -2 -9, (if|12:09,priority -11 -ii -5. -12 { 24-14 if priority = $12 likely -13 -8 { 11 -12, 9-13 -14 -26 - 25 -26 -19 -29 -OR -30 { 17 -28 -10 -11 -30 -31 { 32 -24 -01 -33 -32 34 { 45 -35 15 37 - 40 35: 38 -44 36 -29 41 -02 43 45 -03 46 45 48 -39 { 50 -24 51 { -52 53 -54 56 -77 { 59 { 60 ``` 61 67-88 808 70 89 71 75 79, 72 80 83 84 85 86 89 90 92 95 99 95 100 97 96 98 101 104 107 110,111 212, 213 113 114 215 216 116 126 118 127 129 129 130 130 131 134 134 138 137,140 143,144 145 149 150 146 150,147 151 158 155,158 166 { 169, 168 170,174 176 178 179, # -19,0 183 186 188 189 191,194 193,215 198 200 201, 195 202,206 205 207,208 210 212 213 214 215 216 ; 23,224 226 025224 229 200 231, 234,215, Probem 236, 1 238 232 238 243 250 0 253 254-255 1 2,22,3 4 12 43 275 1234 10, 3 235 245 246 246 247 251 248 =>262 Order 263 { 254 264 266 1,287,2,258 3,4 5, 6 7, 9 10 11 12,13 14 15 24,25 16,17 18 221, 20, 19 223,20 225, 22, 24 223, 33, 34 12,25, motion 35 36,41 38,43 40 44,48 49,100 391,51 43 and 52, 68 } ``` { "order_id": "A904", "customer": "Luma Dental", "items": [ { "sku": "XR-11", "qty": 2, "unit_price": 48.5 } ], "priority": true, "ship_by": "2026-03-06" } { "order_id": "B118", "customer": "North Basin Café", "items": [ { "sku": "MUG-8", "qty": 12, "unit_price": 4.25 }, { "sku": "LID-8", "qty": 12, "unit_price": 0.6 } ], "priority": true, "ship_by": null } { "order_id": "C771", "customer": "Ilex School", "items": [ { "sku": "KIT-RED", "qty": 3, "unit_price": 19.99 }, { "sku": "PATCH-1", "qty": 6, "unit_price": 1.5 } ], "priority": false, "ship_by": "2026-06-01" } { "order_id": "D552", "customer": "Pavo Labs", "items": [ { "sku": "GLV-9", "qty": 100, "unit_price": 0.18 } ], "priority": false, "ship_by": null }

Model A fully follows the requested JSON schema, correctly merges lines by order_id, preserves item order, and normalizes dates and numeric fields. Model B is largely invalid and contaminated with extraneous text, breaks the required top-level schema, and also alters order IDs and misformats at least one date.

Matchup powered by OpenRouter.