Claude Sonnet 4.6 beats DeepSeek V4 Flash on rigor

DeepSeek: DeepSeek V4 Flash vs Anthropic: Claude Sonnet 4.6

Claude Sonnet 4.6 wins 35.0 to 26.5 by being more reliable where correctness actually bites. DeepSeek V4 Flash had the cleaner customer email, but it fell down on harder structured and coding work.

Claude Sonnet 4.6 takes this head-to-head because its wins came on the tasks with the highest penalty for being almost right. In the Python cost-allocation test, both models understood the shape of the solution, but DeepSeek used floating-point arithmetic; that is a real robustness flaw for large integer inputs. Claude’s exact integer handling makes it the safer implementation. The meeting-summary task was the clearest separation. Claude delivered the requested two-sentence summary plus a complete action-item table with owners, tasks, timing, and risks. DeepSeek’s response was materially incomplete: one sentence, and the action table was essentially absent. DeepSeek’s best showing was the customer-delay email, where it was more direct about the cancellation option and gave the ship date exactly as May 18. That was a legitimate win: clearer customer communication, fewer implications, better compliance with the prompt’s specifics. The vendor-records JSON task was a wash. Both models extracted the data correctly and normalized it properly, but both also wrapped the JSON in Markdown fences after being told to output valid JSON only. That shared formatting miss matters, but it does not change the result. **Final call: Claude Sonnet 4.6 wins decisively. DeepSeek V4 Flash can write a cleaner customer note, but Claude is the more dependable model for code, structured reasoning, and task completion.**

Python cost allocation

Python 3.11: Implement a self-contained function `allocate_fee(total_cents: int, weights: list[int]) -> list[int]` for a billing system. It must split `total_cents` across line items proportional to nonnegative integer `weights`, return integer cents, and always sum exactly to `total_cents`. Use largest-remainder rounding; if remainders tie, earlier index gets the extra cent. If all weights are zero, split as evenly as possible by index. Raise `ValueError` for negative totals, negative weights, or an empty weights list. Return code only.

DeepSeek: DeepSeek V4 Flash: ```python def allocate_fee(total_cents: int, weights: list[int]) -> list[int]: if total_cents < 0: raise ValueError("total_cents must be non-negative") if not weights: raise ValueError("weights list must not be empty") if any(w < 0 for w in weights): raise ValueError("weights must be non-negative") n = len(weights) if all(w == 0 for w in weights): base = total_cents // n remainder = total_cents % n result = [base] * n for i in range(remainder): result[i] += 1 return result total_weight = sum(weights) result = [0] * n remainders = [] for i, w in enumerate(weights): exact = total_cents * w / total_weight floor_val = int(exact) result[i] = floor_val remainders.append((exact - floor_val, i)) allocated = sum(result) remaining = total_cents - allocated remainders.sort(key=lambda x: (-x[0], x[1])) for i in range(remaining): idx = remainders[i][1] result[idx] += 1 return result ```

Anthropic: Claude Sonnet 4.6: ```python def allocate_fee(total_cents: int, weights: list[int]) -> list[int]: if total_cents < 0: raise ValueError("total_cents must be nonnegative") if not weights: raise ValueError("weights list must not be empty") if any(w < 0 for w in weights): raise ValueError("all weights must be nonnegative") n = len(weights) total_weight = sum(weights) if total_weight == 0: # Split as evenly as possible by index base, remainder = divmod(total_cents, n) return [base + (1 if i < remainder else 0) for i in range(n)] # Compute exact shares as fractions, then use largest-remainder method # exact[i] = total_cents * weights[i] / total_weight # We need integer allocation summing to total_cents # Compute floor allocations and remainders # Using integer arithmetic to avoid floating point issues # exact[i] = total_cents * weights[i] / total_weight # floor[i] = (total_cents * weights[i]) // total_weight # remainder[i] = (total_cents * weights[i]) % total_weight (numerator of fractional part) floors = [] remainders = [] for w in weights: numerator = total_cents * w f, r = divmod(numerator, total_weight) floors.append(f) remainders.append(r) allocated = sum(floors) leftover = total_cents - allocated # leftover cents need to be distributed to indices with largest remainders # In case of ties (equal remainders), earlier index gets priority # Sort indices by remainder descending, ties broken by index ascending indices_by_remainder = sorted(range(n), key=lambda i: (-remainders[i], i)) result = floors[:] for k in range(leftover): result[indices_by_remainder[k]] += 1 return result ```

Both implement the required validation, summing, zero-weight behavior, and tie-breaking, but Model A uses floating-point arithmetic, which can misallocate for large integer inputs due to precision/overflow. Model B uses exact integer arithmetic for floors and remainders, making it the more correct and robust implementation, though both include markdown fences despite the 'code only' instruction.

Customer delay email

Draft a customer email for Bloom Harbor Furnishings to Ari Patel, whose custom walnut desk order BH-4472 is delayed because the supplier shipped the wrong drawer slides. Audience: frustrated but loyal customer. Tone: accountable, warm, not overly formal. Include: apology, new estimated ship date of May 18, free white-glove delivery upgrade, option to cancel for full refund, and a direct reply invitation. Length: 130-160 words.

DeepSeek: DeepSeek V4 Flash: **Subject:** An update on your custom walnut desk (Order BH-4472) Hi Ari, I’m writing to personally let you know that your desk order BH-4472 is facing an unexpected delay. Our supplier shipped the wrong drawer slides, and we’ve caught the error before it reached you. We’re so sorry for the frustration this causes, especially when you’ve been waiting on this custom piece. Your desk is now estimated to ship on **May 18**, and we’d like to make things right. As a gesture, we’re upgrading you to **free white-glove delivery**—so your desk will be brought inside, assembled, and set up just the way you want it. If this delay doesn’t work for your timeline, we completely understand. You can cancel for a **full refund**, no questions asked. Please just reply to this email to let me know your preference. I’ll personally ensure your order is handled with care. Warmly, [Your Name] Bloom Harbor Furnishings

Anthropic: Claude Sonnet 4.6: **Subject: Update on Your Custom Desk Order #BH-4472** Hi Ari, I want to reach out personally about your walnut desk order — and I want to start with a straightforward apology. Our supplier shipped the wrong drawer slides, which has pushed your timeline back further than anyone wanted. That's on us, and I'm truly sorry for the frustration this has caused. Here's where things stand: your desk is now estimated to ship by **May 18th**. To make this right, we're upgrading your delivery to **complimentary white-glove service** — full setup, packaging removed, everything handled. I also want to be upfront: if this delay doesn't work for you, we'll process a **full refund** with no hassle and no hard feelings. You've been patient, and you deserve that option. Just hit reply and I'll take care of whatever you need personally. Warm regards, **[Your Name]** Bloom Harbor Furnishings

Both emails meet the requested tone, audience, length, and required elements. Model A is slightly better because it more explicitly states the option to cancel for a full refund and gives the ship date exactly as May 18, while Model B says “by May 18th” and implies rather than directly states cancellation.

Meeting summary and actions

Source meeting notes: "Tue 9:00, Pilot launch sync for Northstar Clinic portal. Mara said security review passed except MFA copy needs plain-language edits. Devon reported SSO bug affects about 7% of test logins on Safari only; patch expected Friday noon. Priya needs final billing CSV template by Thursday to train clinic admins. Launch date stays June 24 if training deck is approved by June 17. Omar will ask legal whether the Spanish consent text can be used before certified translation arrives. Team agreed no new feature requests until after launch." Provide: (1) a 2-sentence summary, and (2) a Markdown table of action items with columns Owner, Task, Due/Timing, Dependency/Risk.

DeepSeek: DeepSeek V4 Flash: ### (1) Two-Sentence Summary The team confirmed the Northstar Clinic portal launch remains on track for June 24, pending training deck approval by June 17; key updates include a Safari-only SSO bug to be patched by Friday and a requirement for plain-language edits to MFA copy, while no new feature requests will be accepted until after launch. ### (2) Action Items Table | Owner | Task | Due/Timing | Dependency/Risk | |

Anthropic: Claude Sonnet 4.6: ## Summary The Northstar Clinic portal pilot launch remains on track for June 24, contingent on resolving a Safari SSO bug, completing admin training materials, and obtaining legal clarity on the Spanish consent text. Key immediate priorities include a Friday patch for the login issue, Thursday delivery of the billing CSV template, and training deck approval by June 17. --- ## Action Items | Owner | Task | Due / Timing | Dependency / Risk | |-------|------|--------------|-------------------| | **Mara** | Revise MFA copy to plain language | Before launch (June 24) | Security review already passed; copy edits are the remaining blocker | | **Devon** | Deploy patch for Safari SSO bug (~7% of test logins affected) | Friday, noon | If patch slips, Safari users may be unable to log in at launch | | **Priya** | Deliver final billing CSV template to clinic admins | Thursday | Needed to begin admin training; delay risks training timeline | | **[Training deck owner — unspecified]** | Submit training deck for approval | By June 17 | Approval required to hold June 24 launch date | | **Omar** | Confirm with legal whether Spanish consent text can be used pre-certified translation | ASAP / before launch | Legal sign-off needed; using uncleared text is a compliance risk | | **Entire team** | Freeze new feature requests until post-launch | Ongoing through June 24 | Scope creep could destabilize launch timeline | > **Note:** The training deck owner was not explicitly named in the meeting notes — recommend confirming ownership to avoid a gap.

Model B provides a complete two-sentence summary and a well-formed action-item table covering all owners, tasks, timings, and risks from the notes. Model A is incomplete: it provides only one sentence and the action-item table is essentially missing.

Messy vendor records to JSON

Convert the messy records below into valid JSON only. Schema: an array of objects with keys `vendor` (string), `invoice_id` (string), `amount_usd` (number), `due_date` (YYYY-MM-DD string), `approved` (boolean), `notes` (string or null). Preserve record order. Messy data: `Larch & Co | inv: LC-0817 | $1,204.50 | due 4/9/2026 | OK by Nina | notes: rush packaging`; `Blue Kite Labs, invoice BKL-44A, amount USD 875, due date Apr 12 2026, approved=no, note none`; `vendor=Omni Sprout; id OS-3009; total $62.07; due 2026-04-08; approved yes; notes "tax corrected"`.

DeepSeek: DeepSeek V4 Flash: ```json [ { "vendor": "Larch & Co", "invoice_id": "LC-0817", "amount_usd": 1204.50, "due_date": "2026-04-09", "approved": true, "notes": "rush packaging" }, { "vendor": "Blue Kite Labs", "invoice_id": "BKL-44A", "amount_usd": 875, "due_date": "2026-04-12", "approved": false, "notes": null }, { "vendor": "Omni Sprout", "invoice_id": "OS-3009", "amount_usd": 62.07, "due_date": "2026-04-08", "approved": true, "notes": "tax corrected" } ] ```

Anthropic: Claude Sonnet 4.6: ```json [ { "vendor": "Larch & Co", "invoice_id": "LC-0817", "amount_usd": 1204.50, "due_date": "2026-04-09", "approved": true, "notes": "rush packaging" }, { "vendor": "Blue Kite Labs", "invoice_id": "BKL-44A", "amount_usd": 875, "due_date": "2026-04-12", "approved": false, "notes": null }, { "vendor": "Omni Sprout", "invoice_id": "OS-3009", "amount_usd": 62.07, "due_date": "2026-04-08", "approved": true, "notes": "tax corrected" } ] ```

The outputs are identical and correctly extract all fields, preserve order, use proper types, and normalize dates. However, both wrap the JSON in Markdown code fences, violating the instruction to output valid JSON only.