Claude Sonnet 4.6 beats DeepSeek V4 Flash on rigor

Claude Sonnet 4.6 wins 35.0 to 26.5 by being more reliable where correctness actually bites. DeepSeek V4 Flash had the cleaner customer email, but it fell down on harder structured and coding work.

By RuntimeWire · Published Jun 3, 2026, 4:51pm CT · Updated Jun 3, 2026, 6:41pm CT

Comparison of AI model performance and rigor (isometric 3D render with paper-cut materials)

Claude Sonnet 4.6 takes this head-to-head because its wins came on the tasks with the highest penalty for being almost right. In the Python cost-allocation test, both models understood the shape of the solution, but DeepSeek used floating-point arithmetic; that is a real robustness flaw for large integer inputs. Claude’s exact integer handling makes it the safer implementation.

The meeting-summary task was the clearest separation. Claude delivered the requested two-sentence summary plus a complete action-item table with owners, tasks, timing, and risks. DeepSeek’s response was materially incomplete: one sentence, and the action table was essentially absent.

DeepSeek’s best showing was the customer-delay email, where it was more direct about the cancellation option and gave the ship date exactly as May 18. That was a legitimate win: clearer customer communication, fewer implications, better compliance with the prompt’s specifics.

The vendor-records JSON task was a wash. Both models extracted the data correctly and normalized it properly, but both also wrapped the JSON in Markdown fences after being told to output valid JSON only. That shared formatting miss matters, but it does not change the result.

Final call: Claude Sonnet 4.6 wins decisively. DeepSeek V4 Flash can write a cleaner customer note, but Claude is the more dependable model for code, structured reasoning, and task completion.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had OpenAI: GPT-5.5 score each one. DeepSeek: DeepSeek V4 Flash scored 26.5 to Anthropic: Claude Sonnet 4.6's 35.0.

1. Python cost allocation

Python 3.11: Implement a self-contained function allocate_fee(total_cents: int, weights: list[int]) -> list[int] for a billing system. It must split total_cents across line items proportional to nonnegative integer weights, return integer cents, and always sum exactly to total_cents. Use largest-remainder rounding; if remainders tie, earlier index gets the extra cent. If all weights are zero, split as evenly as possible by index. Raise ValueError for negative totals, negative weights, or an empty weights list. Return code only.

Winner: Anthropic: Claude Sonnet 4.6 — Both implement the required validation, summing, zero-weight behavior, and tie-breaking, but Model A uses floating-point arithmetic, which can misallocate for large integer inputs due to precision/overflow. Model B uses exact integer arithmetic for floors and remainders, making it the more correct and robust implementation, though both include markdown fences despite the 'code only' instruction.

2. Customer delay email

Draft a customer email for Bloom Harbor Furnishings to Ari Patel, whose custom walnut desk order BH-4472 is delayed because the supplier shipped the wrong drawer slides. Audience: frustrated but loyal customer. Tone: accountable, warm, not overly formal. Include: apology, new estimated ship date of May 18, free white-glove delivery upgrade, option to cancel for full refund, and a direct reply invitation. Length: 130-160 words.

Winner: DeepSeek: DeepSeek V4 Flash — Both emails meet the requested tone, audience, length, and required elements. Model A is slightly better because it more explicitly states the option to cancel for a full refund and gives the ship date exactly as May 18, while Model B says “by May 18th” and implies rather than directly states cancellation.

3. Meeting summary and actions

Source meeting notes: "Tue 9:00, Pilot launch sync for Northstar Clinic portal. Mara said security review passed except MFA copy needs plain-language edits. Devon reported SSO bug affects about 7% of test logins on Safari only; patch expected Friday noon. Priya needs final billing CSV template by Thursday to train clinic admins. Launch date stays June 24 if training deck is approved by June 17. Omar will ask legal whether the Spanish consent text can be used before certified translation arrives. Team agreed no new feature requests until after launch." Provide: (1) a 2-sentence summary, and (2) a Markdown table of action items with columns Owner, Task, Due/Timing, Dependency/Risk.

Winner: Anthropic: Claude Sonnet 4.6 — Model B provides a complete two-sentence summary and a well-formed action-item table covering all owners, tasks, timings, and risks from the notes. Model A is incomplete: it provides only one sentence and the action-item table is essentially missing.

4. Messy vendor records to JSON

Convert the messy records below into valid JSON only. Schema: an array of objects with keys vendor (string), invoice_id (string), amount_usd (number), due_date (YYYY-MM-DD string), approved (boolean), notes (string or null). Preserve record order. Messy data: Larch & Co | inv: LC-0817 | $1,204.50 | due 4/9/2026 | OK by Nina | notes: rush packaging; Blue Kite Labs, invoice BKL-44A, amount USD 875, due date Apr 12 2026, approved=no, note none; vendor=Omni Sprout; id OS-3009; total $62.07; due 2026-04-08; approved yes; notes "tax corrected".

Winner: Tie — The outputs are identical and correctly extract all fields, preserve order, use proper types, and normalize dates. However, both wrap the JSON in Markdown code fences, violating the instruction to output valid JSON only.

See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

How they were tested

1. Python cost allocation

2. Customer delay email

3. Meeting summary and actions

4. Messy vendor records to JSON

Reader comments