One model was consistently cleaner, sharper, and better at following instructions; the other kept leaving small but consequential mistakes on the table. Across coding, transformation, summarization, and business writing, this matchup wasn’t especially close.
grok-4.3 wins this one on discipline as much as capability. The 37.0 to 22.0 aggregate isn’t just a scoring gap; it reflects a pattern: A reliably did the asked-for thing, while Llama-4-Scout-17B-16E-Instruct too often drifted into "close enough" territory. In head-to-head evals, that’s how you lose.
The clearest separation showed up in the structured-output and code tasks. On **python-shift-merge**, grok-4.3 handled the actual requirements: merge overlapping and back-to-back intervals per employee, return a new sorted list, and leave the input untouched. Llama-4-Scout-17B-16E-Instruct mutated the input, padded the answer with example/print code despite a code-only brief, and introduced a real logic bug by merging intervals separated by a one-minute gap because of its `timedelta` handling. On **messy-orders-to-json**, the story was similar: grok-4.3 returned valid JSON only, correctly normalized and sorted; B wrapped the right answer in extra explanation and code, which is still a format failure.
The writing tasks weren’t a blowout, but grok-4.3 was still plainly better. In **vendor-delay-email**, A struck the right retailer-facing tone: candid without wobbling, specific about the QA issue and revised window, clear that pricing and allocation were preserved, and explicit in asking them not to publish the old date. B was competent, but it read more like a template and softened into vaguer reassurance about business impact.
On **meeting-notes-summary**, grok-4.3 again showed better editorial judgment. It followed the requested format, captured the legal approval, invite cap, and explicit risks, and named a plausible owner based on the notes. Llama-4-Scout-17B-16E-Instruct was mostly serviceable, but adding commentary outside the requested output and failing to assign an owner when Ezra was the obvious candidate is exactly the kind of miss that makes summaries less useful in practice.
**Final call: grok-4.3 is the more dependable model and the easy winner here. Llama-4-Scout-17B-16E-Instruct had moments of competence, but too many avoidable instruction-following and correctness errors turned this into a comfortable victory rather than a real contest.**
python-shift-merge
Practical coding — Python. Return code only. Write a function `merge_shift_intervals(rows)` that takes a list of dicts like `{"employee":"Mina","start":"2025-04-12 22:30","end":"2025-04-13 01:15"}` and returns a new list where overlapping or back-to-back intervals for the same employee are merged. Timestamps are in `%Y-%m-%d %H:%M` 24-hour format and already in the same timezone. Preserve one output dict per merged interval with the same keys. Sort output by employee, then start. Example: Mina 22:30-23:00 and Mina 23:00-01:15 should merge; Mina and Jorge should never merge with each other.
grok-4.3: ```python
from datetime import datetime
from collections import defaultdict
def merge_shift_intervals(rows):
employee_intervals = defaultdict(list)
for row in rows:
emp = row["employee"]
start = datetime.strptime(row["start"], "%Y-%m-%d %H:%M")
end = datetime.strptime(row["end"], "%Y-%m-%d %H:%M")
employee_intervals[emp].append([start, end])
merged = []
for emp in sorted(employee_intervals.keys()):
intervals = sorted(employee_intervals[emp], key=lambda x: x[0])
if not intervals:
continue
current_start, current_end = intervals[0]
for i in range(1, len(intervals)):
start, end = intervals[i]
if start <= current_end:
if end > current_end:
current_end = end
else:
merged.append({
"employee": emp,
"start": current_start.strftime("%Y-%m-%d %H:%M"),
"end": current_end.strftime("%Y-%m-%d %H:%M")
})
current_start, current_end = start, end
merged.append({
"employee": emp,
"start": current_start.strftime("%Y-%m-%d %H:%M"),
"end": current_end.strftime("%Y-%m-%d %H:%M")
})
return merged
```
Llama-4-Scout-17B-16E-Instruct: ```python
from datetime import datetime, timedelta
def merge_shift_intervals(rows):
if not rows:
return []
# Sort by employee and start time
rows.sort(key=lambda x: (x['employee'], x['start']))
merged_rows = [rows[0]]
for row in rows[1:]:
last_merged_row = merged_rows[-1]
if row['employee'] == last_merged_row['employee']:
last_end = datetime.strptime(last_merged_row['end'], '%Y-%m-%d %H:%M')
current_start = datetime.strptime(row['start'], '%Y-%m-%d %H:%M')
if last_end >= current_start - timedelta(minutes=1):
last_merged_row['end'] = max(last_merged_row['end'], row['end'], key=lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M'))
else:
merged_rows.append(row)
else:
merged_rows.append(row)
return merged_rows
# Example usage:
rows = [
{"employee":"Mina","start":"2025-04-12 22:30","end":"2025-04-13 00:00"},
{"employee":"Mina","start":"2025-04-13 00:00","end":"2025-04-13 01:15"},
{"employee":"Jorge","start":"2025-04-12 23:00","end":"2025-04-13 02:00"},
{"employee":"Mina","start":"2025-04-13 03:00","end":"2025-04-13 04:00"},
]
print(merge_shift_intervals(rows))
```
A correctly merges overlapping and back-to-back intervals per employee, returns a new sorted list, and avoids mutating the input. B mutates the input list, includes extra example/print code despite the code-only task, and incorrectly merges intervals with a 1-minute gap due to its timedelta logic.
vendor-delay-email
Professional writing — Draft a workplace email to retail partners. Situation: your company, Northline Audio, is delaying shipment of the new Ravelin Mini DAC because a capacitor batch failed thermal testing. Original ship date was 14 August; new estimated ship window is 26-30 August. Partners with confirmed POs will keep their pricing and allocation. Ask them not to publish the old date. Audience: small independent hi-fi retailers. Tone: candid, steady, professional. Length: 140-180 words.
grok-4.3: Subject: Ravelin Mini DAC Shipment Delay Notification
Dear Valued Retail Partners,
I'm reaching out to share an update regarding the upcoming Ravelin Mini DAC. During our pre-shipment quality assurance process, one batch of capacitors failed thermal testing. Rather than risk any compromise in long-term performance, we have sourced a new batch and initiated re-testing of the assemblies.
Consequently, we must adjust the original ship date of 14 August. Our current estimate for initial shipments is now between 26 and 30 August.
Partners with confirmed purchase orders can rest assured that your pricing and product allocations remain unchanged. We are committed to honoring every confirmed order without alteration.
To prevent any misinformation in the market, please refrain from publishing or advertising the original 14 August date. We will provide a further update once units are en route.
Should you have questions about your order status or this timeline, please contact your account representative directly.
Thank you for your continued partnership.
Best regards,
Northline Audio
Llama-4-Scout-17B-16E-Instruct: Subject: Update on Ravelin Mini DAC Shipment
Dear valued retail partners,
We are writing to inform you that, due to an unexpected setback, we are delaying the shipment of the new Ravelin Mini DAC. Unfortunately, a batch of capacitors supplied for the product failed thermal testing, and we need to source an alternative. As a result, we are adjusting our shipping schedule.
The original ship date of 14 August will no longer be met. We now estimate that shipments will commence between 26-30 August. We apologize for any inconvenience this may cause and appreciate your understanding in this matter.
For those partners who have confirmed purchase orders, please be assured that your pricing and allocation will be maintained. We value your commitment to Northline Audio and want to ensure that your business is not impacted.
In the meantime, we kindly request that you refrain from publishing the original ship date to avoid any potential miscommunication. We will provide a formal update with the new shipping schedule and product details as soon as possible.
Thank you for your continued support, and we look forward to getting the Ravelin Mini DAC to you soon.
Best regards,
[Your Name]
Northline Audio Team
A is slightly stronger on tone and specificity: it is candid, steady, and clearly explains the QA issue, revised window, preserved pricing/allocation, and request not to publish the old date. B is solid but more generic, adds a less precise reassurance about business impact, and feels a bit more templated for this retailer audience.
meeting-notes-summary
Summarization & extraction — Read these meeting notes and then provide: (1) a 2-sentence summary, and (2) a JSON object with keys `decision`, `owner`, `launch_date`, `risks`.
Notes:
- Tuesday growth sync for the LatticeLearn referral experiment.
- Priya: paid social CAC rose from $41 to $57 after we broadened targeting; not sustainable.
- Owen: referral landing page converted 18.4% last week vs 11.2% for the generic signup page.
- Marta: legal approved the revised incentive copy as long as we cap rewards at 5 invites per user.
- Decision discussed: launch referral flow to 50% of new visitors on 2025-07-08, not 100%, because support macros are not ready.
- Ezra will finish support macros by Friday; if delayed, we keep the old post-signup help center link in place.
- Main risks: fraud from self-referrals, support queue spike, and analytics mismatch between web and iOS events.
grok-4.3: (1) During the Tuesday growth sync for the LatticeLearn referral experiment, the team observed a higher conversion rate of 18.4% on the referral landing page versus 11.2% on the generic signup page, received legal approval for incentive copy with a cap of 5 invites, and noted unsustainable increases in paid social CAC. The group decided to launch the referral flow to only 50% of new visitors on 2025-07-08 due to unprepared support macros, which Ezra will complete by Friday, while identifying key risks of fraud from self-referrals, support queue spikes, and web-iOS analytics mismatches.
(2)
{
"decision": "launch referral flow to 50% of new visitors (not 100%)",
"owner": "Ezra",
"launch_date": "2025-07-08",
"risks": ["fraud from self-referrals", "support queue spike", "analytics mismatch between web and iOS events"]
}
Llama-4-Scout-17B-16E-Instruct: Here is a 2-sentence summary and a JSON object with the requested information:
**Summary:** The team discussed the LatticeLearn referral experiment, deciding to launch the referral flow to 50% of new visitors on July 8, 2025, due to incomplete support macros. The team also reviewed metrics, including a rise in paid social CAC and a higher conversion rate for the referral landing page.
**JSON Object:**
```json
{
"decision": "Launch referral flow to 50% of new visitors",
"owner": null,
"launch_date": "2025-07-08",
"risks": [
"fraud from self-referrals",
"support queue spike",
"analytics mismatch between web and iOS events"
]
}
```
Note that I didn't specify an owner, as the notes don't mention a specific person being assigned to own the decision or launch. If you'd like, I can try to infer an owner based on the context, but it seems that the decision was a collective one. Let me know if you'd like me to revise anything!
A better follows the requested format and captures more of the salient details, including legal approval, the invite cap, and the explicit risks, while providing a plausible owner from the notes. B is mostly correct but adds unnecessary commentary outside the requested output and omits an owner despite Ezra being the clearest named responsible party mentioned.
messy-orders-to-json
Data wrangling / structured output — Convert the messy order lines below into valid JSON only. Output an array of objects sorted by `order_id` ascending. Schema per object: `order_id` (string), `customer` (string), `items` (array of strings), `priority` (boolean), `ship_by` (string in `YYYY-MM-DD`). Rules: trim spaces, title-case customer names, item names exactly as written after trimming, `priority` is true for yes/y/true and false for no/n/false, and dates must be normalized to `YYYY-MM-DD`.
Messy data:
ORD-910 | customer= leah p. moreno | items= gasket set ; 8mm hex key ; thread seal tape | priority=Y | ship_by=7/9/2025
ORD-907 | customer=danil cho | items= panel clip|priority=no|ship_by=2025-07-06
ORD-913 | customer= R. banerjee | items= flow sensor ; calibration tag | priority= true | ship_by= 2025/07/11
ORD-908 | customer=mei-lin ortega | items= drip tray ; filter basket ; group brush | priority= n | ship_by=07-07-2025
grok-4.3: [
{
"order_id": "ORD-907",
"customer": "Danil Cho",
"items": ["panel clip"],
"priority": false,
"ship_by": "2025-07-06"
},
{
"order_id": "ORD-908",
"customer": "Mei-Lin Ortega",
"items": ["drip tray", "filter basket", "group brush"],
"priority": false,
"ship_by": "2025-07-07"
},
{
"order_id": "ORD-910",
"customer": "Leah P. Moreno",
"items": ["gasket set", "8mm hex key", "thread seal tape"],
"priority": true,
"ship_by": "2025-07-09"
},
{
"order_id": "ORD-913",
"customer": "R. Banerjee",
"items": ["flow sensor", "calibration tag"],
"priority": true,
"ship_by": "2025-07-11"
}
]
Llama-4-Scout-17B-16E-Instruct: ### Data Wrangling and JSON Output
Below is a Python script that takes the provided messy order lines, cleans and structures the data, and outputs a JSON array of objects sorted by `order_id` ascending.
```python
import json
import re
from datetime import datetime
# Messy data
order_lines = [
"ORD-910 | customer= leah p. moreno | items= gasket set ; 8mm hex key ; thread seal tape | priority=Y | ship_by=7/9/2025",
"ORD-907 | customer=danil cho | items= panel clip|priority=no|ship_by=2025-07-06",
"ORD-913 | customer= R. banerjee | items= flow sensor ; calibration tag | priority= true | ship_by= 2025/07/11",
"ORD-908 | customer=mei-lin ortega | items= drip tray ; filter basket ; group brush | priority= n | ship_by=07-07-2025"
]
# Cleaning and structuring data
data = []
for line in order_lines:
# Split line into parts
parts = re.split(r'\|', line)
# Extract and clean data
order_id = parts[0].strip()
customer = parts[1].split('=')[1].strip().title()
items = [item.strip() for item in parts[2].split('=')[1].split(';')]
priority = parts[3].split('=')[1].strip().lower() in ['yes', 'y', 'true']
ship_by = datetime.strptime(parts[4].split('=')[1].strip(), '%m/%d/%Y').strftime('%Y-%m-%d') if '/' in parts[4].split('=')[1].strip() else datetime.strptime(parts[4].split('=')[1].strip(), '%Y-%m-%d').strftime('%Y-%m-%d')
# Create object
obj = {
"order_id": order_id,
"customer": customer,
"items": items,
"priority": priority,
"ship_by": ship_by
}
# Add to data
data.append(obj)
# Sort data by order_id
data.sort(key=lambda x: x['order_id'])
# Output JSON
print(json.dumps(data, indent=4))
```
### Output
```json
[
{
"order_id": "ORD-907",
"customer": "Danil Cho",
"items": [
"panel clip"
],
"priority": false,
"ship_by": "2025-07-06"
},
{
"order_id": "ORD-908",
"customer": "Mei-Lin Ortega",
"items": [
"drip tray",
"filter basket",
"group brush"
],
"priority": false,
"ship_by": "2025-07-07"
},
{
"order_id": "ORD-910",
"customer": "Leah P. Moreno",
"items": [
"gasket set",
"8mm hex key",
"thread seal tape"
],
"priority": true,
"ship_by": "2025-07-09"
},
{
"order_id": "ORD-913",
"customer": "R. Banerjee",
"items": [
"flow sensor",
"calibration tag"
],
"priority": true,
"ship_by": "2025-07-11"
}
]
```
Model A follows the instruction exactly by returning valid JSON only, correctly normalized and sorted. Model B includes explanatory text and code instead of JSON-only output, violating the required format despite showing a correct final JSON block.
Matchup powered by
OpenRouter.