Head to head: grok-4.3 vs mistral-small-2503

This matchup wasn’t especially close: one model consistently did the unglamorous, high-value work of following instructions and getting edge cases right, while the other kept leaking avoidable mistakes. Across coding, writing, extraction, and formatting discipline, the gap showed up in concrete ways.

By RuntimeWire · Published Jul 4, 2026, 9:06am CT

abstract symbolic representation of the story's core idea (editorial illustration in the spirit of New Yorker or The Atlantic)

grok-4.3 wins because it was reliably correct where these evaluations actually matter: edge cases, audience fit, and output discipline. The aggregate score says 38.0 to 22.0, but the more important story is that grok-4.3 kept clearing the practical bar while mistral-small-2503 repeatedly stumbled on details that should not be optional.

The clearest separation came in the Python billing task. grok-4.3 handled month rollover, preserved end-of-month behavior, clamped correctly for shorter months, and included the required asserts. mistral-small-2503 reached for a timedelta-style monthly fix that is simply the wrong tool for calendar billing, failed to preserve the required day semantics, and then ignored the format instruction by adding Markdown fences to a code-only prompt. That is not a near miss; that is a model missing both the logic and the contract.

The writing and summarization results tell a similar story. In the vendor incident update, grok-4.3 wrote for technical customer contacts the way a competent operator should: candid, concise, and specific about impact, cause, remediation, next steps, and support. mistral-small-2503 padded the message with generic customer-comms fluff, placeholder signature fields, and less precise phrasing. In the meeting-notes task, grok-4.3 stayed faithful to the source and extracted the requested fields cleanly, while mistral-small-2503 invented a year for the launch date and misfiled action items and dependencies as blocked items. Those are accuracy errors, not stylistic preferences.

Even in the data-wrangling round, where both models normalized the orders correctly, grok-4.3 still separated itself by obeying the instruction to return JSON only. mistral-small-2503 again wrapped valid content in Markdown fences and turned a straightforward formatting requirement into a failure mode. That kind of sloppiness is exactly what breaks downstream pipelines.

Final call: grok-4.3 is the clearly better text model here. It was more accurate, more disciplined, and more usable in real workflows; mistral-small-2503 lost on the details that professionals cannot afford to excuse.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. grok-4.3 scored 38.0 to mistral-small-2503's 22.0.

1. Practical coding — Python invoice due date fix

Language: Python 3.11. Return code only, no explanation. Fix this function so it calculates the next invoice due date correctly for a monthly billing system. Requirements: - Input start_date is YYYY-MM-DD. - months_ahead is a nonnegative integer. - Preserve end-of-month behavior: if the start date is the last day of its month, the result must be the last day of the target month. - Otherwise, keep the same day-of-month when possible; if the target month is shorter, clamp to that month's last day. - Raise ValueError for invalid dates. Current buggy code: python from datetime import datetime def next_due_date(start_date: str, months_ahead: int) -> str: dt = datetime.strptime(start_date, "%Y-%m-%d") month = dt.month + months_ahead year = dt.year + month // 12 month = month % 12 return dt.replace(year=year, month=month).strftime("%Y-%m-%d") Also include these asserts in your answer: python assert next_due_date("2024-01-31", 1) == "2024-02-29" assert next_due_date("2023-01-31", 1) == "2023-02-28" assert next_due_date("2024-08-31", 6) == "2025-02-28" assert next_due_date("2024-05-30", 1) == "2024-06-30" assert next_due_date("2024-05-15", 10) == "2025-03-15"

Winner: grok-4.3 — A correctly handles month rollover, end-of-month preservation, clamping for shorter months, and includes the required asserts. B uses an incorrect timedelta-based approach for monthly billing, does not preserve the required day semantics, and includes markdown fences despite the prompt requiring code only.

2. Professional writing — vendor incident status update

Write a status update email to enterprise customers about a resolved service incident. Context: - Company: Northline Metrics - Product: RouteCast API - Incident window: 08:42–10:17 UTC on 14 May - Impact: about 18% of requests to /v2/forecast returned HTTP 503 - Cause: bad cache invalidation rule deployed during a rollout in eu-west-1 - Resolution: rollback completed, cache warmed, extra alert added - No data loss or security issue - Next step: publish full RCA by 17 May Audience: technical customer contacts at logistics companies. Tone: candid, calm, professional; no marketing fluff. Length: 140–190 words. Include: apology, customer impact, plain-English cause, what was done, what happens next, and a support contact line.

Winner: grok-4.3 — A is more appropriately tailored to technical customer contacts: it is candid, concise, and avoids marketing language while clearly covering impact, cause, remediation, next steps, and support contact. B includes fluff ('Dear Valued Customers,' 'continued trust'), placeholder signature fields, and slightly less precise phrasing for this audience.

3. Summarization & extraction — meeting notes to bullets + facts

Read the meeting notes below. Then produce: 1) a 3-bullet summary for an exec who missed the meeting 2) a JSON object with keys launch_date, owner, blocked_items, and numeric_targets Meeting notes: """ Atlas mobile onboarding sync — 3 Sept, 09:30 - Priya said Android crash-free sessions improved from 96.8% to 98.1% after the image-loader patch. - iOS is still blocked by the App Store review; Mateo expects a decision by Friday. - Marketing needs the final screenshot set by 16:00 Thursday or the paid social campaign slips a week. - We agreed not to launch with Apple SSO yet; too many edge cases in school-managed accounts. - Revised target launch date: 18 Sept, assuming iOS approval lands this week. - Juno will own the go/no-go checklist and send it for signoff. - Support asked for a macros doc covering invite-code failures, duplicate profiles, and timezone-related reminder confusion. - KPI target for week 1 remains: onboarding completion 72% and day-7 retention 31%. """ Return valid JSON for part 2.

Winner: grok-4.3 — A is more faithful and concise: its summary captures the key decisions, risks, owner, and KPI targets, and its JSON cleanly extracts the requested fields without overreaching. B adds an unsupported year to the launch date and misclassifies action items/dependencies as blocked_items, making its extraction less accurate.

4. Data wrangling — messy orders into clean JSON

Transform the messy order lines below into a valid JSON array. Output JSON only. Schema for each object, in this exact key order: order_id (string), customer (string), sku (string), qty (integer), unit_price (number), currency (string), ship_date (string YYYY-MM-DD or null) Rules: - Trim spaces. - Normalize currency to 3-letter uppercase codes. - Parse quantity as integer. - Parse unit_price as a number without currency symbols. - Normalize dates to YYYY-MM-DD. - If ship date is missing, use null. - Keep input order. Messy data: A-9041 | Lumen Dockworks | WX-14B | qty 3 | $19.50 | usd | ships 2025/04/09 A-9042|Pine & Circuit Co.|QZ-8|2 units|EUR 7.2|eur|ship: 09-04-2025 A-9043 | Nara Studio | MESH-2 | 11 | GBP 0.85 | gbp | pending A-9044| Helio Foods|COLD-7|qty:1| CAD$104.00 | cad | 2025-04-12 A-9045 | Orbit Kiosk | TAG-44 | Qty 5 | 12.00 usd | USD | ship 2025.04.15

Winner: grok-4.3 — Both outputs correctly normalize the data, but A follows the instruction to output JSON only. B wraps the JSON in Markdown code fences, which violates the format requirement.

See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

How they were tested

1. Practical coding — Python invoice due date fix

2. Professional writing — vendor incident status update

3. Summarization & extraction — meeting notes to bullets + facts

4. Data wrangling — messy orders into clean JSON

Reader comments