Head to head: grok-4.3 vs mistral-small-2503

grok-4.3 vs mistral-small-2503

This matchup wasn’t especially close: one model consistently did the unglamorous, high-value work of following instructions and getting edge cases right, while the other kept leaking avoidable mistakes. Across coding, writing, extraction, and formatting discipline, the gap showed up in concrete ways.

grok-4.3 wins because it was reliably correct where these evaluations actually matter: edge cases, audience fit, and output discipline. The aggregate score says 38.0 to 22.0, but the more important story is that grok-4.3 kept clearing the practical bar while mistral-small-2503 repeatedly stumbled on details that should not be optional. The clearest separation came in the Python billing task. grok-4.3 handled month rollover, preserved end-of-month behavior, clamped correctly for shorter months, and included the required asserts. mistral-small-2503 reached for a `timedelta`-style monthly fix that is simply the wrong tool for calendar billing, failed to preserve the required day semantics, and then ignored the format instruction by adding Markdown fences to a code-only prompt. That is not a near miss; that is a model missing both the logic and the contract. The writing and summarization results tell a similar story. In the vendor incident update, grok-4.3 wrote for technical customer contacts the way a competent operator should: candid, concise, and specific about impact, cause, remediation, next steps, and support. mistral-small-2503 padded the message with generic customer-comms fluff, placeholder signature fields, and less precise phrasing. In the meeting-notes task, grok-4.3 stayed faithful to the source and extracted the requested fields cleanly, while mistral-small-2503 invented a year for the launch date and misfiled action items and dependencies as blocked items. Those are accuracy errors, not stylistic preferences. Even in the data-wrangling round, where both models normalized the orders correctly, grok-4.3 still separated itself by obeying the instruction to return JSON only. mistral-small-2503 again wrapped valid content in Markdown fences and turned a straightforward formatting requirement into a failure mode. That kind of sloppiness is exactly what breaks downstream pipelines. **Final call: grok-4.3 is the clearly better text model here. It was more accurate, more disciplined, and more usable in real workflows; mistral-small-2503 lost on the details that professionals cannot afford to excuse.**

Practical coding — Python invoice due date fix

Language: Python 3.11. Return code only, no explanation. Fix this function so it calculates the next invoice due date correctly for a monthly billing system. Requirements: - Input `start_date` is `YYYY-MM-DD`. - `months_ahead` is a nonnegative integer. - Preserve end-of-month behavior: if the start date is the last day of its month, the result must be the last day of the target month. - Otherwise, keep the same day-of-month when possible; if the target month is shorter, clamp to that month's last day. - Raise `ValueError` for invalid dates. Current buggy code: ```python from datetime import datetime def next_due_date(start_date: str, months_ahead: int) -> str: dt = datetime.strptime(start_date, "%Y-%m-%d") month = dt.month + months_ahead year = dt.year + month // 12 month = month % 12 return dt.replace(year=year, month=month).strftime("%Y-%m-%d") ``` Also include these asserts in your answer: ```python assert next_due_date("2024-01-31", 1) == "2024-02-29" assert next_due_date("2023-01-31", 1) == "2023-02-28" assert next_due_date("2024-08-31", 6) == "2025-02-28" assert next_due_date("2024-05-30", 1) == "2024-06-30" assert next_due_date("2024-05-15", 10) == "2025-03-15" ```

grok-4.3: import calendar from datetime import datetime def next_due_date(start_date: str, months_ahead: int) -> str: dt = datetime.strptime(start_date, "%Y-%m-%d") last_day = calendar.monthrange(dt.year, dt.month)[1] is_eom = dt.day == last_day total_month = dt.month + months_ahead new_year = dt.year + (total_month - 1) // 12 new_month = (total_month - 1) % 12 + 1 if is_eom: new_day = calendar.monthrange(new_year, new_month)[1] else: max_day = calendar.monthrange(new_year, new_month)[1] new_day = min(dt.day, max_day) new_dt = dt.replace(year=new_year, month=new_month, day=new_day) return new_dt.strftime("%Y-%m-%d") assert next_due_date("2024-01-31", 1) == "2024-02-29" assert next_due_date("2023-01-31", 1) == "2023-02-28" assert next_due_date("2024-08-31", 6) == "2025-02-28" assert next_due_date("2024-05-30", 1) == "2024-06-30" assert next_due_date("2024-05-15", 10) == "2025-03-15"

mistral-small-2503: ```python from datetime import datetime, timedelta def next_due_date(start_date: str, months_ahead: int) -> str: dt = datetime.strptime(start_date, "%Y-%m-%d") if dt.day > 28: next = dt.replace(day=28) + timedelta(days=32) else: next = dt.replace(day=1) + timedelta(days=32) next = next.replace(year=dt.year, month=dt.month) + timedelta(days=32 * months_ahead) return next.strftime("%Y-%m-%d") assert next_due_date("2024-01-31", 1) == "2024-02-29" assert next_due_date("2023-01-31", 1) == "2023-02-28" assert next_due_date("2024-08-31", 6) == "2025-02-28" assert next_due_date("2024-05-30", 1) == "2024-06-30" assert next_due_date("2024-05-15", 10) == "2025-03-15" ```

A correctly handles month rollover, end-of-month preservation, clamping for shorter months, and includes the required asserts. B uses an incorrect timedelta-based approach for monthly billing, does not preserve the required day semantics, and includes markdown fences despite the prompt requiring code only.

Professional writing — vendor incident status update

Write a status update email to enterprise customers about a resolved service incident. Context: - Company: Northline Metrics - Product: RouteCast API - Incident window: 08:42–10:17 UTC on 14 May - Impact: about 18% of requests to `/v2/forecast` returned HTTP 503 - Cause: bad cache invalidation rule deployed during a rollout in eu-west-1 - Resolution: rollback completed, cache warmed, extra alert added - No data loss or security issue - Next step: publish full RCA by 17 May Audience: technical customer contacts at logistics companies. Tone: candid, calm, professional; no marketing fluff. Length: 140–190 words. Include: apology, customer impact, plain-English cause, what was done, what happens next, and a support contact line.

grok-4.3: Subject: RouteCast API Incident Resolved – 14 May Dear Technical Contacts, We apologize for the disruption to the RouteCast API on 14 May. Between 08:42 and 10:17 UTC, approximately 18 % of requests to the /v2/forecast endpoint returned HTTP 503 responses. Customers querying real-time forecast data experienced intermittent failures during this window. The cause was a misconfigured cache invalidation rule deployed as part of a routine rollout in eu-west-1. The rule caused the service to reject valid cache entries, producing the observed errors. We identified the change within minutes, completed a rollback, warmed the cache, and added an alert to detect similar misconfigurations earlier. Service returned to normal by 10:17 UTC. There was no data loss or security impact. A full root cause analysis will be published by 17 May. If you have questions about the incident or its effect on your integrations, please contact support@northlinemetrics.com. Northline Metrics Operations

mistral-small-2503: Subject: RouteCast API Service Incident Update - 14 May Dear Valued Customers, We want to inform you that the service incident affecting the RouteCast API on 14 May has been fully resolved. Between 08:42 and 10:17 UTC, approximately 18% of requests to the `/v2/forecast` endpoint returned HTTP 503 errors. We apologize for any disruption this may have caused to your operations. The issue was caused by a bad cache invalidation rule that was inadvertently deployed during a routine rollout in the eu-west-1 region. Our engineering team quickly identified the problem, rolled back the changes, and warmed the cache to restore full functionality. No data loss or security issues occurred as a result of this incident. To prevent similar issues in the future, we have added an extra alert to our monitoring system. We are committed to maintaining the reliability of our services and appreciate your understanding. We will publish a full Root Cause Analysis (RCA) by 17 May. If you have any questions or need further assistance, please do not hesitate to contact our support team at [support@northlinemetrics.com](mailto:support@northlinemetrics.com). Thank you for your continued trust in Northline Metrics. Best regards, [Your Name] [Your Position] Northline Metrics

A is more appropriately tailored to technical customer contacts: it is candid, concise, and avoids marketing language while clearly covering impact, cause, remediation, next steps, and support contact. B includes fluff ('Dear Valued Customers,' 'continued trust'), placeholder signature fields, and slightly less precise phrasing for this audience.

Summarization & extraction — meeting notes to bullets + facts

Read the meeting notes below. Then produce: 1) a 3-bullet summary for an exec who missed the meeting 2) a JSON object with keys `launch_date`, `owner`, `blocked_items`, and `numeric_targets` Meeting notes: """ Atlas mobile onboarding sync — 3 Sept, 09:30 - Priya said Android crash-free sessions improved from 96.8% to 98.1% after the image-loader patch. - iOS is still blocked by the App Store review; Mateo expects a decision by Friday. - Marketing needs the final screenshot set by 16:00 Thursday or the paid social campaign slips a week. - We agreed not to launch with Apple SSO yet; too many edge cases in school-managed accounts. - Revised target launch date: 18 Sept, assuming iOS approval lands this week. - Juno will own the go/no-go checklist and send it for signoff. - Support asked for a macros doc covering invite-code failures, duplicate profiles, and timezone-related reminder confusion. - KPI target for week 1 remains: onboarding completion 72% and day-7 retention 31%. """ Return valid JSON for part 2.

grok-4.3: **3-bullet exec summary:** - Android crash-free sessions improved to 98.1% after the image-loader fix; iOS launch remains blocked by App Store review with a decision expected Friday. - Revised target launch date is 18 Sept (pending iOS approval this week); marketing needs final screenshots by Thursday 16:00 or the paid campaign slips one week. - Juno owns the go/no-go checklist; week-1 KPI targets are 72% onboarding completion and 31% day-7 retention. ```json { "launch_date": "18 Sept", "owner": "Juno", "blocked_items": ["iOS App Store review"], "numeric_targets": { "onboarding_completion_pct": 72, "day7_retention_pct": 31 } } ```

mistral-small-2503: ### 1) 3-Bullet Summary for an Exec Who Missed the Meeting - **Android Performance Improvement**: The Android app's crash-free sessions improved to 98.1% after the image-loader patch. - **iOS Blocked**: The iOS launch is pending App Store review, with a decision expected by Friday. - **Launch Date and Ownership**: The revised launch date is September 18th, assuming iOS approval. Juno will own the go/no-go checklist. ### 2) JSON Object ```json { "launch_date": "2023-09-18", "owner": "Juno", "blocked_items": [ "iOS App Store review", "Marketing needs final screenshot set by 16:00 Thursday", "Support needs macros doc for invite-code failures, duplicate profiles, and timezone-related reminder confusion" ], "numeric_targets": { "onboarding_completion_week_1": 72, "day_7_retention": 31 } } ```

A is more faithful and concise: its summary captures the key decisions, risks, owner, and KPI targets, and its JSON cleanly extracts the requested fields without overreaching. B adds an unsupported year to the launch date and misclassifies action items/dependencies as blocked_items, making its extraction less accurate.

Data wrangling — messy orders into clean JSON

Transform the messy order lines below into a valid JSON array. Output JSON only. Schema for each object, in this exact key order: `order_id` (string), `customer` (string), `sku` (string), `qty` (integer), `unit_price` (number), `currency` (string), `ship_date` (string YYYY-MM-DD or null) Rules: - Trim spaces. - Normalize currency to 3-letter uppercase codes. - Parse quantity as integer. - Parse unit_price as a number without currency symbols. - Normalize dates to YYYY-MM-DD. - If ship date is missing, use null. - Keep input order. Messy data: `A-9041 | Lumen Dockworks | WX-14B | qty 3 | $19.50 | usd | ships 2025/04/09` `A-9042|Pine & Circuit Co.|QZ-8|2 units|EUR 7.2|eur|ship: 09-04-2025` `A-9043 | Nara Studio | MESH-2 | 11 | GBP 0.85 | gbp | pending` `A-9044| Helio Foods|COLD-7|qty:1| CAD$104.00 | cad | 2025-04-12` `A-9045 | Orbit Kiosk | TAG-44 | Qty 5 | 12.00 usd | USD | ship 2025.04.15`

grok-4.3: [ {"order_id": "A-9041", "customer": "Lumen Dockworks", "sku": "WX-14B", "qty": 3, "unit_price": 19.5, "currency": "USD", "ship_date": "2025-04-09"}, {"order_id": "A-9042", "customer": "Pine & Circuit Co.", "sku": "QZ-8", "qty": 2, "unit_price": 7.2, "currency": "EUR", "ship_date": "2025-04-09"}, {"order_id": "A-9043", "customer": "Nara Studio", "sku": "MESH-2", "qty": 11, "unit_price": 0.85, "currency": "GBP", "ship_date": null}, {"order_id": "A-9044", "customer": "Helio Foods", "sku": "COLD-7", "qty": 1, "unit_price": 104, "currency": "CAD", "ship_date": "2025-04-12"}, {"order_id": "A-9045", "customer": "Orbit Kiosk", "sku": "TAG-44", "qty": 5, "unit_price": 12, "currency": "USD", "ship_date": "2025-04-15"} ]

mistral-small-2503: ```json [ { "order_id": "A-9041", "customer": "Lumen Dockworks", "sku": "WX-14B", "qty": 3, "unit_price": 19.50, "currency": "USD", "ship_date": "2025-04-09" }, { "order_id": "A-9042", "customer": "Pine & Circuit Co.", "sku": "QZ-8", "qty": 2, "unit_price": 7.2, "currency": "EUR", "ship_date": "2025-04-09" }, { "order_id": "A-9043", "customer": "Nara Studio", "sku": "MESH-2", "qty": 11, "unit_price": 0.85, "currency": "GBP", "ship_date": null }, { "order_id": "A-9044", "customer": "Helio Foods", "sku": "COLD-7", "qty": 1, "unit_price": 104.00, "currency": "CAD", "ship_date": "2025-04-12" }, { "order_id": "A-9045", "customer": "Orbit Kiosk", "sku": "TAG-44", "qty": 5, "unit_price": 12.00, "currency": "USD", "ship_date": "2025-04-15" } ] ```

Both outputs correctly normalize the data, but A follows the instruction to output JSON only. B wraps the JSON in Markdown code fences, which violates the format requirement.