Head to head: grok-4.3 vs Ministral-3B

This wasn’t a close stylistic split; it was a clean execution gap. grok-4.3 won every task by being more disciplined about instructions, format, and the small details that make outputs usable in the real world.

By · Published

Two distinct, minimalist sculptural forms representing algorithmic entities, one perfectly defined and sharp-edged, the other slightly less precise or diffused, positioned side-by-side to emphasize direct comparison and outcome. (Studio sti

grok-4.3 takes this matchup 38.0 to 25.0, and the scoreline flatters Ministral-3B. Across all four tasks, the pattern is the same: grok-4.3 actually does the job asked, while Ministral-3B keeps introducing avoidable problems—mutating inputs, drifting from requested tone, or breaking output constraints.

The coding task is the clearest example. In python-merge-interval-bugfix, grok-4.3 gets the whole brief right: it sorts without mutating the input, merges overlapping and touching intervals, and keeps the required function name intact. Ministral-3B repairs the merge logic, but then blows the non-mutation requirement by calling windows.sort(). That’s not a cosmetic miss; it’s the kind of bug that causes downstream surprises.

The writing and summarization tasks tell the same story. In vendor-delay-status-update, grok-4.3 matches the requested calm, direct, professional internal Slack tone, while Ministral-3B leans on emojis and alert-style formatting that make a routine ops update sound theatrical. In meeting-notes-summary-actions, grok-4.3 follows the requested structure with a real two-sentence summary and a clean JSON object, and it captures the important 8-store pilot scope decision. Ministral-3B misses that owner decision, uses bullets instead of the requested summary format, and freelances on the JSON schema.

Then there’s messy-orders-to-json, where grok-4.3 simply behaves like a model you can trust in production: valid JSON only, correct preservation of the customer name as "Noah Bennet," no extra decoration. Ministral-3B adds Markdown fences—despite being told not to—and silently changes the name to "Noah Bennett." That’s exactly the kind of seemingly small alteration that breaks pipelines and corrupts records.

Final call: grok-4.3 wins decisively because it is more exact, more compliant, and more reliable on the details that matter. Ministral-3B isn’t losing on brilliance; it’s losing on discipline.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. grok-4.3 scored 38.0 to Ministral-3B's 25.0.

1. python-merge-interval-bugfix

Practical coding — Python. Return code only. Fix the bug in this function so it merges overlapping maintenance windows for the same server. Input: a list of tuples (start_minute, end_minute) where end_minute is exclusive. The function should sort if needed, merge any intervals that overlap OR touch, and return a new list. Example: [(30,45),(10,20),(20,30),(50,55)] -> [(10,45),(50,55)]. Buggy code: python def merge_windows(windows): if not windows: return [] merged = [windows[0]] for start, end in windows[1:]: last_start, last_end = merged[-1] if start < last_end: merged[-1] = (last_start, max(last_end, end)) else: merged.append((start, end)) return merged Requirements: - Keep the function name merge_windows - Do not mutate the input list - Handle unsorted input - Merge touching intervals, e.g. (100,120) and (120,125) - Assume all tuples are valid with start_minute < end_minute

Winner: grok-4.3 — A correctly sorts without mutating the input, merges overlapping and touching intervals, and preserves the required function name. B fixes the merge logic but violates the requirement not to mutate the input list by calling windows.sort().

2. vendor-delay-status-update

Professional writing. Write a status update for our internal operations Slack channel about a vendor delay. Audience: warehouse supervisors and customer support leads. Tone: calm, direct, no blame. Length: 110-140 words. Facts to include: - Vendor: North Spoke Plastics - Delayed item: 28 mm tamper seals for the Haviland cold-brew bottles - PO: 44187-K - Original dock date: Tuesday, 14 May - New ETA: Friday, 17 May by 3:00 p.m. - Current stock on hand covers packing through Wednesday morning only - We are prioritizing wholesale orders for Juniper Market and Vale & Pine - Customer support should avoid promising same-day shipment for cold-brew bottle orders until replenishment lands - Next update will be posted by 4:30 p.m. today

Winner: grok-4.3 — A better matches the requested calm, direct, professional Slack update and includes all required facts in a clear internal-ops style. B is accurate but less aligned in tone due to emojis, alert-style formatting, and the unnecessary closing line, which make it feel less professional and more dramatic.

3. meeting-notes-summary-actions

Summarization & extraction. Read these meeting notes and produce: 1) a 2-sentence summary 2) a JSON object with keys launch_date, owner_decisions, risks, and action_items. Meeting notes: """ OrbitLane weekly launch sync — 6 Feb - Priya said the Android build 2.8.1 fixed the duplicate receipt bug, but the offline refund flow still fails if the cashier added more than 12 line items. - Mateo wants the pilot to start on 19 Feb, limited to 8 stores, not the original 20. - Lena confirmed legal approved the revised loyalty disclaimer text this morning. - Risk: training guides are still in review with Retail Ops; if not signed off by 12 Feb, store managers won't have time to brief weekend staff. - Decision: keep iOS in the pilot; remove kiosk mode from scope entirely. - Action: Priya to deliver a hotfix candidate by 9 Feb. - Action: Jonah to send the final store list by end of day Thursday. - Action: Lena to publish the approved disclaimer in the help center. - Open question: whether gift-card top-ups should be enabled during pilot; no decision yet. """

Winner: grok-4.3 — A follows the requested format more closely with a true 2-sentence summary and a JSON object whose fields cleanly capture the meeting decisions, including the 8-store pilot scope change. B is mostly correct but misses that owner decision, uses bullet points instead of a plain 2-sentence summary, and changes the action_items structure beyond what was asked.

4. messy-orders-to-json

Data wrangling / structured output. Transform the messy order notes below into valid JSON only. Output an object with one key, orders, whose value is an array of order objects sorted by order_id ascending. Each order object must have exactly these keys: - order_id (string) - customer (string, title case) - items (array of objects with keys sku string and qty integer) - ship_method (one of ground, air, pickup) - priority (boolean) Rules: - Normalize customer names to title case - If an item appears twice within the same order, combine quantities - Ignore comments in parentheses - priority=yes/true => true, priority=no/false => false Messy notes: """ #A-104 | customer: marta iverson | items: QX-4 x2, LM-9 x1, QX-4 x3 | ship=ground | priority=yes A-102 / customer=ELI TRAN / ship:pickup / items:[RN-11; TK-72] / priority=false order A-103 ; items ZP-8=4, RN-1=2 ; customer "noah bennet" ; PRIORITY: TRUE ; ship method = air A-101, customer: reem al-hadi, items: LM-9 x 2, BC-3 x1, ship=ground, priority=no (hold gift wrap) """

Winner: grok-4.3 — A fully follows the instruction to output valid JSON only and correctly preserves the source customer name as "Noah Bennet". B adds Markdown code fences, violating the output-format requirement, and also changes the customer name to "Noah Bennett," which is not supported by the input.


See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

Reader comments

Conversation for this story loads after sign-in.