One model handled practical editing-room work cleanly; the other kept tripping over instructions, structure, and basic factual precision. This wasn’t a close stylistic split—it was a decisive test of who can execute under constraints.
grok-4.3 wins this matchup outright, 38.0 to 12.0, and the gap is earned on the boring, important stuff: following instructions, preserving facts, and not inventing behavior. Phi-4-multimodal-instruct doesn’t just lose on polish; it loses on reliability.
The clearest miss is the Python rate-limit fix. grok-4.3 correctly repairs both bugs: it counts timestamps `>= now - window_seconds`, only allows requests when the recent count is strictly below the limit, and records only allowed requests. Phi-4-multimodal-instruct goes off the rails immediately—it's not code-only, it invents future timestamps, uses the wrong cutoff comparison, and fails the stated requirements. That’s not a near miss; that’s a model you can’t safely drop into a real debugging workflow.
On writing tasks, grok-4.3 is simply the sharper editor. In the vendor outage update, it lands the requested tone: calm, accountable, plain English, with the key facts intact. Phi-4-multimodal-instruct is wordier, awkwardly phrased, and slips in an unsupported explanation about duplicate prevention while drifting toward vendor blame-shifting. The meeting-notes summary tells the same story: grok-4.3 captures the actual decision to ship the simplified intake form, the conditional deferral of the printer fix, the correct owner, blocker, and budget change. Phi-4-multimodal-instruct gets vague, assigns the wrong owner, and muddies decision versus status.
The JSON extraction task seals it. grok-4.3 follows the schema, merges lines by `order_id`, preserves item order, and normalizes dates and numeric fields. Phi-4-multimodal-instruct produces output that is largely invalid, polluted with extra text, breaks the top-level schema, alters order IDs, and even misformats a date. If your workload depends on structured output being actually usable, that’s disqualifying.
**Final call: grok-4.3 is the only model in this pairing that looks dependable across code, summarization, operational writing, and schema-bound extraction. Phi-4-multimodal-instruct wasn’t merely second-best here; it was error-prone in ways that make it hard to trust.**
python-rate-limit-fix
Language: Python 3.11
Return code only.
Fix the function below so it enforces a per-user sliding-window rate limit correctly. A request should be allowed only if the user has made fewer than `limit` requests in the last `window_seconds` seconds, counting only timestamps `>= now - window_seconds`. The function should record allowed requests and reject blocked ones without recording them. Keep the same signature. `events` maps user_id to a list of prior request timestamps in ascending order.
```python
def allow_request(events: dict[str, list[int]], user_id: str, now: int, limit: int, window_seconds: int) -> bool:
history = events.get(user_id, [])
cutoff = now - window_seconds
recent = [t for t in history if t > cutoff]
if len(recent) <= limit:
recent.append(now)
events[user_id] = recent
return True
events[user_id] = recent
return False
```
Example expectations:
- If limit=3, window_seconds=60, history=[100,130,160], now=160 => reject
- If history=[100,130,160], now=161 => reject
- If history=[100,130,160], now=221 => allow
- If a request is rejected, its timestamp must not be added
grok-4.3: ```python
def allow_request(events: dict[str, list[int]], user_id: str, now: int, limit: int, window_seconds: int) -> bool:
history = events.get(user_id, [])
cutoff = now - window_seconds
recent = [t for t in history if t >= cutoff]
if len(recent) < limit:
recent.append(now)
events[user_id] = recent
return True
events[user_id] = recent
return False
```
Phi-4-multimodal-instruct: To fix the given function to correctly enforce the per-user sliding-window rate limit, you need to change two things.
1. When a request is rejected, the elapsed time since the latest request needs to be computed.
2. The rejected request time must not be added to the history.
```python
def allow_request(events: dict[str, list[int]], user_id: str, now: int, limit: int, window_seconds: int) -> bool:
history = events.get(user_id, [])
chronological_list = sorted(history)
# Remove timestamps older than window_seconds
cutoff = now - window_seconds
history = [t for t in chronological_list if t > cutoff]
if len(history) < limit:
# We may or may not have blocked this very request
# If this is the only request, we just add now to the history
# Otherwise, we calculate the timestamp of the next request and add both to history
if not history:
recent = [now]
else:
next_request_time = history[-1] + window_seconds
recent = history + [next_request_time]
events[user_id] = history + [next_request_time] if len(recent) > 1 else history + [now]
return True
else:
# We must send back a 429 within the sliding window
# We no longer need the event
return False
```
In your second point, the snippet should be reviewed:
```python
# ... internal logic up to:
if len(history) < limit:
# We may or may not have blocked this very request
# If this is the only request, we just add now to the history
# Otherwise, we calculate the timestamp of the next request and add both to history
if not history:
recent = [now]
else:
next_request_time = history[-1] + window_seconds
recent = history + [next_request_time]
# ... internal logic continues.
```
This internal logic was incorrectly touched now when a request is accepted for some strange reason. If this is the case then since the limit is followed (this will contain only two or less items) it shouldn't be rejected, the current implementation will lead to such conception. However, I saw this comment in the code and I thought this is mistakenly placed. So, I commented it. Now it contains two recent timestamp, explained below the comment.
This will store previous requests as well as the next request to ensure the constraints are upheld, even though currently, it would not cause you any issues.
A correctly fixes both bugs: it counts timestamps >= now - window_seconds, allows only when the recent count is strictly less than limit, and records only allowed requests. B is not code-only, changes behavior incorrectly by inventing future timestamps, uses the wrong cutoff comparison, and fails the stated requirements.
vendor-outage-update
Write a workplace status update for our customers.
Context: We run the invoice delivery service for BrightLedger. From 09:12 to 10:03 ET today, PDF invoices generated from the "June settlement" batch were delayed because a storage node in us-east-2 stopped accepting writes after a failed firmware rollout by our vendor, Norell Systems. No data was lost. The backlog cleared by 10:41 ET. 1,842 invoices were delayed; 117 customers may have received duplicate "processing" notifications, but no duplicate invoices were sent. We have paused the vendor rollout, added an extra queue drain alarm, and will publish a full RCA by tomorrow 3 PM ET.
Audience: affected customers
Tone: calm, accountable, plain English, no blame-shifting
Length: 140-180 words
Include: what happened, customer impact, current status, and next steps
grok-4.3: Today we experienced a delay delivering PDF invoices from the June settlement batch. Between 9:12 and 10:03 ET, a storage node in us-east-2 stopped accepting writes following a firmware update. This caused 1,842 invoices to be held in queue.
No data was lost. The backlog was cleared by 10:41 ET, and every invoice has now been delivered successfully. A total of 117 customers received duplicate “processing” notifications; however, no duplicate invoices were sent.
We have paused the firmware rollout and added an extra queue-drain alarm to catch similar issues earlier. A full root-cause analysis will be published by 3 PM ET tomorrow.
We apologize for the inconvenience and appreciate your patience while we restored normal service.
Phi-4-multimodal-instruct: Dear Valued Customers,
We wanted to provide you with an update on today's invoice delivery service at BrightLedger, as you may have received some delay due to technical issues.
Initially, we experienced a the problem around 9:12 AM with our servers in the us-east-2 region. A vendor-led firmware update did not go as planned, causing a temporary halt to our storage node writes. To ensure the integrity of your invoices, no data was lost. We successfully cleared the backlog by around 10:41 AM, affecting 1,842 invoices in the process.
Some customers received a duplicate "processing" notification. We assure you no duplicates were sent, as we use a unique identifier linked to each email address to prevent this process.
To avoid future disruptions, we have paused the vendor's rollout and added extra monitoring to our systems. An in-depth root cause analysis (RCA) will be shared by tomorrow at 3 PM ET.
Rest assured, your invoices are being handled with the utmost diligence and efficiency.
Kind regards,
The BrightLedger Team
A is clearer, more concise, and better matches the requested calm, accountable, plain-English tone while covering the key facts accurately. B is wordier, includes awkward/incorrect phrasing and an unsupported explanation about duplicate prevention, and leans more toward vendor blame-shifting and generic reassurance.
meeting-notes-summary
Read the meeting notes below. Then provide:
1) a 2-sentence summary
2) a JSON object with keys: decision, owner, due_date, blockers, budget_change
Source notes:
"Ops + Product sync — 14 May, 11:00. Mina said the pilot for the Harborview clinic tablets went better than expected: 26 devices enrolled, 24 active daily. However, check-in times still spike when the label printer reconnects. Devon wants to move the printer fix out of the pilot scope unless support tickets exceed 15 next week. Team agreed to ship the simplified intake form to all three pilot sites on Monday, 19 May. Rhea will send the rollout checklist by Thursday. Finance approved an extra $3,200 for rugged cases after two cracked screens at Harborview. Open issue: legal has not yet approved the revised patient consent text; without that, Westgrove cannot go live."
grok-4.3: 1) The Ops + Product sync on May 14 reviewed the Harborview tablet pilot's strong results (26 enrolled, 24 daily active) alongside ongoing printer reconnection issues, leading Devon to propose moving the printer fix out of scope unless support tickets exceed 15. 2) The team agreed to ship the simplified intake form to all three pilot sites on May 19, with Rhea sending the rollout checklist by Thursday, finance approving an extra $3,200 for cases, while legal's pending consent text approval blocks the Westgrove launch.
```json
{"decision": "Ship simplified intake form to all three pilot sites on May 19 and defer printer fix unless tickets exceed 15", "owner": "Rhea", "due_date": "19 May", "blockers": "Legal approval for revised patient consent text", "budget_change": 3200}
```
Phi-4-multimodal-instruct: 1) During the meeting on 14 May, the Harborview clinic's tablet pilot was successful with 26 devices enrolled, meetings agreed to ship simplified intake forms, and an additional $3,200 approved for cases after damaged screens, with pending legal approvals affecting the pilot's full implementation.
2)
```json
{
"decision": "Mina reports successful tablet pilot, team agrees to ship forms, and extra budget approved for tablet cases.",
"owner": "Ops + Product sync team",
"due_date": "19 May for form shipment",
"blockers": "Legal approval of patient consent text missing",
"budget_change": "Approved extra $3,200 for rugged tablet cases"
}
```
Model A is more accurate and complete: it captures the key decision about shipping the simplified intake form, the conditional deferral of the printer fix, the correct owner, blocker, and numeric budget change. Model B is vaguer, assigns an incorrect owner, folds status updates into the decision, and is less precise in both the summary and JSON fields.
messy-orders-to-json
Transform the messy order lines below into valid JSON only.
Output schema:
{
"orders": [
{
"order_id": "string",
"customer": "string",
"items": [{"sku":"string","qty":number,"unit_price":number}],
"priority": true,
"ship_by": "YYYY-MM-DD or null"
}
]
}
Rules:
- One object per order_id
- Merge repeated order lines into the same order
- qty must be numbers, unit_price must be numbers without currency symbols
- priority is true only if marked rush/urgent/priority=yes; otherwise false
- Convert dates to YYYY-MM-DD; if blank or TBD, use null
- Preserve item order within each order
Messy data:
Order A-904 | Cust: Luma Dental | SKU=XR-11 qty 2 price $48.50 | ship_by 6/03/2026 | priority=yes
A-904 ; SKU CLN-2 ; qty=1 ; unit price=12 ; note add-on
Order# B-118 / customer "North Basin Café" / item MUG-8 / q: 12 / $4.25 each / ship TBD / rush
B-118 / item LID-8 / q:12 / 0.60
C-771, customer=Ilex School, sku=KIT-RED, qty 3, price 19.99, ship_by 2026-06-01
C-771, sku=PATCH-1, qty 6, price $1.5, urgent: no
D-552 | Cust: Pavo Labs | SKU=GLV-9 qty 100 price $0.18 | ship_by | standard
grok-4.3: {
"orders": [
{
"order_id": "A-904",
"customer": "Luma Dental",
"items": [
{"sku": "XR-11", "qty": 2, "unit_price": 48.5},
{"sku": "CLN-2", "qty": 1, "unit_price": 12}
],
"priority": true,
"ship_by": "2026-06-03"
},
{
"order_id": "B-118",
"customer": "North Basin Café",
"items": [
{"sku": "MUG-8", "qty": 12, "unit_price": 4.25},
{"sku": "LID-8", "qty": 12, "unit_price": 0.6}
],
"priority": true,
"ship_by": null
},
{
"order_id": "C-771",
"customer": "Ilex School",
"items": [
{"sku": "KIT-RED", "qty": 3, "unit_price": 19.99},
{"sku": "PATCH-1", "qty": 6, "unit_price": 1.5}
],
"priority": false,
"ship_by": "2026-06-01"
},
{
"order_id": "D-552",
"customer": "Pavo Labs",
"items": [
{"sku": "GLV-9", "qty": 100, "unit_price": 0.18}
],
"priority": false,
"ship_by": null
}
]
}
Phi-4-multimodal-instruct: ```2:2
-3)
- *, remove
-3 - Use your name "24/ repeat easily
Option A-L96: no "24-9/10
-1
-2
-9, (if|12:09,priority
-11
-ii
-5. -12
{
24-14 if priority = $12 likely
-13
-8
{
11
-12, 9-13
-14
-26
- 25
-26
-19
-29
-OR
-30
{
17
-28
-10
-11
-30
-31
{
32
-24
-01
-33
-32
34
{
45
-35
15
37 - 40
35: 38
-44
36
-29
41
-02
43
45
-03
46
45
48
-39
{
50
-24
51
{
-52
53
-54
56
-77
{
59
{
60
```
61
67-88
808
70
89
71
75
79, 72
80
83
84
85
86
89
90
92
95
99
95
100
97
96
98
101
104
107
110,111
212, 213
113
114
215
216
116
126
118
127
129
129
130
130
131
134
134
138
137,140 143,144
145
149
150
146
150,147
151
158
155,158
166
{
169, 168
170,174
176
178
179, # -19,0
183
186
188
189
191,194
193,215
198
200
201, 195
202,206
205
207,208
210
212
213
214
215
216
; 23,224
226
025224
229
200
231,
234,215, Probem 236, 1
238 232
238
243
250
0
253
254-255
1
2,22,3
4
12
43
275
1234
10, 3
235
245
246
246
247
251
248
=>262
Order 263
{
254
264
266
1,287,2,258
3,4
5, 6
7, 9
10
11
12,13
14
15
24,25
16,17
18
221, 20, 19
223,20
225, 22, 24
223, 33, 34
12,25, motion 35
36,41
38,43
40
44,48
49,100
391,51
43
and 52, 68
}
```
{
"order_id": "A904",
"customer": "Luma Dental",
"items": [
{
"sku": "XR-11",
"qty": 2,
"unit_price": 48.5
}
],
"priority": true,
"ship_by": "2026-03-06"
}
{
"order_id": "B118",
"customer": "North Basin Café",
"items": [
{
"sku": "MUG-8",
"qty": 12,
"unit_price": 4.25
},
{
"sku": "LID-8",
"qty": 12,
"unit_price": 0.6
}
],
"priority": true,
"ship_by": null
}
{
"order_id": "C771",
"customer": "Ilex School",
"items": [
{
"sku": "KIT-RED",
"qty": 3,
"unit_price": 19.99
},
{
"sku": "PATCH-1",
"qty": 6,
"unit_price": 1.5
}
],
"priority": false,
"ship_by": "2026-06-01"
}
{
"order_id": "D552",
"customer": "Pavo Labs",
"items": [
{
"sku": "GLV-9",
"qty": 100,
"unit_price": 0.18
}
],
"priority": false,
"ship_by": null
}
Model A fully follows the requested JSON schema, correctly merges lines by order_id, preserves item order, and normalizes dates and numeric fields. Model B is largely invalid and contaminated with extraneous text, breaks the required top-level schema, and also alters order IDs and misformats at least one date.
Matchup powered by
OpenRouter.