DeepSeek-R1 beats Codestral-2501 where it counts

DeepSeek-R1 takes this matchup 36.8 to 33.1 by being more exact, more complete, and more usable in real-world writing tasks. Codestral-2501 is competent, but it repeatedly settles for "good enough" where DeepSeek-R1 closes the gap to actually correct.

By · Published

Comparison and evaluation of two AI models' output for precision in writing tasks (Scratchboard illustration – white scratches on black, with dense crosshatching and high-contrast details. Digital elements could be rendered with a subtle sc

DeepSeek-R1 wins this head-to-head because it was sharper on the details that decide whether output is merely plausible or actually reliable. The aggregate score gap — 36.8 to 33.1 — is not a fluke of style points; it comes from B being better on three of the four tasks and tying the fourth.

The clearest technical separation showed up in python-log-redactor. Codestral-2501 would redact invalid dotted quads like 999.999.999.999, which is exactly the kind of sloppy pattern matching that looks fine until it quietly mangles data. DeepSeek-R1 did the stricter, correct thing by limiting IPv4 octets to 0–255, and it also preserved the required token and api_key key names. That is the difference between a regex that demos well and one you can trust.

DeepSeek-R1 also had the stronger hand on the writing-and-extraction tasks. In customer-delay-email, B was more candid, gave the revised ETA plainly, and included a concrete next step — confirming after testing — instead of Codestral-2501’s vaguer "we’re prioritizing this." In meeting-notes-summary, B again did the fuller job: it captured the pending QA sign-off and the release-note dependency, and it surfaced QA as an open risk in the JSON. A was accurate, but it left out the very risk that could still block the release.

The only non-loss for Codestral-2501 was messy-orders-to-json, and that was a tie. Both models produced valid JSON, preserved order, parsed quantities correctly, split items cleanly, and handled rush= yes as true after trimming. But a tie on straightforward structure conversion does not erase the pattern everywhere else: when precision and completeness mattered, DeepSeek-R1 was the model doing the grown-up work.

Final call: DeepSeek-R1 is the better text model here. Codestral-2501 is serviceable, but DeepSeek-R1 was more exact in code, more accountable in customer communication, and more thorough in structured summarization.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. Codestral-2501 scored 33.1 to DeepSeek-R1's 36.8.

1. python-log-redactor

Language: Python 3. Write a function redact_log(line: str) -> str that masks sensitive values in application logs. Replace any email address with [EMAIL], any IPv4 address with [IP], and any token value in patterns token=... or api_key=... with token=[SECRET] / api_key=[SECRET]. Preserve all other text exactly. Example: user=mina.cho@northbeam.io ip=10.44.3.9 token=sk_live_9Qx -> user=[EMAIL] ip=[IP] token=[SECRET]. Return code only.

Winner: DeepSeek-R1 — B is more correct on IPv4 matching because it restricts octets to 0–255, while A would redact invalid dotted quads like 999.999.999.999. Both satisfy the prompt reasonably well, but B adheres better to the intended definition of an IPv4 address and preserves the required token/api_key key names.

2. customer-delay-email

Draft an email to a customer named Armand Vale at Bristle & Pine Studio. Context: we promised a custom analytics export by Tuesday, but a regression found this morning means delivery will slip to Friday. We have already reproduced the bug, a fix is in review, and no customer data was lost or exposed. Audience: external customer. Tone: candid, calm, accountable. Length: 140-180 words. Include a brief apology, the new ETA, and one concrete next step we are taking.

Winner: DeepSeek-R1 — B is more candid and accountable, clearly states the revised ETA, includes the required concrete next step, and better matches an external customer-facing tone. A is solid and accurate but more generic, and its next step ('prioritizing this task') is less specific than B’s commitment to confirm after testing.

3. meeting-notes-summary

Summarize the meeting notes below in 3 bullet points, then extract: launch_date, owner_for_pricing_page, and open_risks as a JSON object. Meeting notes: - Nimbus Note 2.3 rollout discussed by Priya, Evan, Luc. - Mobile crash on Android 12 seems fixed in build 2841; QA wants one more pass tomorrow morning. - Pricing page still shows last quarter's "Starter $19" copy in two places. Luc will update web copy today. - Marketing asked whether launch can move earlier than May 23; team said no because localization for Swedish and Czech is still pending. - Priya confirmed the in-app tutorial is translated and ready. - One enterprise customer, Alder Peak Clinics, is waiting on SSO docs; Evan will send them by Wednesday. - If QA sign-off lands tomorrow, release notes get finalized immediately after.

Winner: DeepSeek-R1 — Model B is more complete: it captures the pending QA sign-off and release-note dependency in the summary and includes QA as an open risk in the JSON. Model A is accurate but omits that key pending risk, making its extraction slightly less thorough.

4. messy-orders-to-json

Convert the messy order lines below into valid JSON with this exact schema: {\"orders\":[{\"order_id\":string,\"customer\":string,\"items\":[{\"sku\":string,\"qty\":number}],\"rush\":boolean}]}. Rules: trim spaces, preserve order, split multiple items on ;, parse quantities as numbers, and set rush true only for rush=yes. Data: ORD-9102 | customer: Kepler Bakery | items: FLR-00A x 4; YST-19 x 1 | rush=yes ORD-9103|customer: Lumen Transit|items: MAP-7 x2; PIN-44 x 12|rush=no ORD-9104 | customer: Harbor Thread Co. | items: DYE-8 x 3 | rush= yes

Winner: Tie — Both outputs produce correct, valid JSON matching the required schema, preserve order, trim spaces, split items correctly, parse quantities as numbers, and set rush properly including rush= yes as true after trimming. Model B is more compact but adds extra explanation, while Model A wraps the JSON in prose and a code block; neither issue affects the correctness of the converted JSON itself.


See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

Reader comments

Conversation for this story loads after sign-in.