DeepSeek-R1 beats Codestral-2501 where it counts

Codestral-2501 vs DeepSeek-R1

DeepSeek-R1 takes this matchup 36.8 to 33.1 by being more exact, more complete, and more usable in real-world writing tasks. Codestral-2501 is competent, but it repeatedly settles for "good enough" where DeepSeek-R1 closes the gap to actually correct.

DeepSeek-R1 wins this head-to-head because it was sharper on the details that decide whether output is merely plausible or actually reliable. The aggregate score gap — 36.8 to 33.1 — is not a fluke of style points; it comes from B being better on three of the four tasks and tying the fourth. The clearest technical separation showed up in **python-log-redactor**. Codestral-2501 would redact invalid dotted quads like `999.999.999.999`, which is exactly the kind of sloppy pattern matching that looks fine until it quietly mangles data. DeepSeek-R1 did the stricter, correct thing by limiting IPv4 octets to `0–255`, and it also preserved the required `token` and `api_key` key names. That is the difference between a regex that demos well and one you can trust. DeepSeek-R1 also had the stronger hand on the writing-and-extraction tasks. In **customer-delay-email**, B was more candid, gave the revised ETA plainly, and included a concrete next step — confirming after testing — instead of Codestral-2501’s vaguer "we’re prioritizing this." In **meeting-notes-summary**, B again did the fuller job: it captured the pending QA sign-off and the release-note dependency, and it surfaced QA as an open risk in the JSON. A was accurate, but it left out the very risk that could still block the release. The only non-loss for Codestral-2501 was **messy-orders-to-json**, and that was a tie. Both models produced valid JSON, preserved order, parsed quantities correctly, split items cleanly, and handled `rush= yes` as `true` after trimming. But a tie on straightforward structure conversion does not erase the pattern everywhere else: when precision and completeness mattered, DeepSeek-R1 was the model doing the grown-up work. **Final call: DeepSeek-R1 is the better text model here. Codestral-2501 is serviceable, but DeepSeek-R1 was more exact in code, more accountable in customer communication, and more thorough in structured summarization.**

python-log-redactor

Language: Python 3. Write a function `redact_log(line: str) -> str` that masks sensitive values in application logs. Replace any email address with `[EMAIL]`, any IPv4 address with `[IP]`, and any token value in patterns `token=...` or `api_key=...` with `token=[SECRET]` / `api_key=[SECRET]`. Preserve all other text exactly. Example: `user=mina.cho@northbeam.io ip=10.44.3.9 token=sk_live_9Qx` -> `user=[EMAIL] ip=[IP] token=[SECRET]`. Return code only.

Codestral-2501: ```python import re def redact_log(line: str) -> str: # Regular expressions for email, IPv4, and token patterns email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' ipv4_pattern = r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b' token_pattern = r'\b(token|api_key)=[^\s]+' # Replace email addresses line = re.sub(email_pattern, '[EMAIL]', line) # Replace IPv4 addresses line = re.sub(ipv4_pattern, '[IP]', line) # Replace token values line = re.sub(token_pattern, lambda match: f'{match.group(1)}=[SECRET]', line) return line ```

DeepSeek-R1: ```python import re def redact_log(line: str) -> str: line = re.sub(r'(token=)\S+', r'\1[SECRET]', line) line = re.sub(r'(api_key=)\S+', r'\1[SECRET]', line) line = re.sub(r'[\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,}', '[EMAIL]', line) ip_pattern = r'\b((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\b' return re.sub(ip_pattern, '[IP]', line) ```

B is more correct on IPv4 matching because it restricts octets to 0–255, while A would redact invalid dotted quads like 999.999.999.999. Both satisfy the prompt reasonably well, but B adheres better to the intended definition of an IPv4 address and preserves the required token/api_key key names.

customer-delay-email

Draft an email to a customer named Armand Vale at Bristle & Pine Studio. Context: we promised a custom analytics export by Tuesday, but a regression found this morning means delivery will slip to Friday. We have already reproduced the bug, a fix is in review, and no customer data was lost or exposed. Audience: external customer. Tone: candid, calm, accountable. Length: 140-180 words. Include a brief apology, the new ETA, and one concrete next step we are taking.

Codestral-2501: Subject: Update on Your Custom Analytics Export - New Delivery Date Dear Armand, I hope this message finds you well. I am writing to inform you that we have encountered an unexpected issue with your custom analytics export, which has caused a slight delay in delivery. We discovered a regression this morning, but rest assured, no customer data was lost or exposed. We have already reproduced the bug and are in the process of reviewing a fix. Given this, we will not be able to deliver the export by Tuesday as previously promised. Instead, we expect to have it ready for you by Friday. We understand the importance of this delivery and apologize for any inconvenience this may cause. To address this promptly, we will be prioritizing this task and will keep you updated on any further developments. Thank you for your understanding and patience. Best regards, [Your Name] Bristle & Pine Studio

DeepSeek-R1: **Subject:** Update on Your Custom Analytics Export – Revised Delivery Timeline Dear Armand, I’m writing to sincerely apologize for the delay in delivering your custom analytics export, which we had committed to providing by Tuesday. During final testing this morning, we identified a regression that requires immediate attention. While we’re confident no customer data was compromised or exposed, we’ve made the difficult decision to postpone delivery to **Friday, [Date]** to ensure the solution meets our quality standards. Here’s what we’re doing to resolve this swiftly: - **The fix is already underway.** Our team has reproduced the issue and submitted a resolution for code review. - **Rigorous testing will follow approval** to prevent further disruptions. We understand the frustration this delay may cause and take full accountability. To keep you informed, I’ll send a confirmation once testing concludes and the export is ready for delivery. If you have urgent questions or need interim data, please reply directly—I’m happy to assist. We deeply value your partnership and are committed to ensuring this solution exceeds expectations. Thank you for your understanding, [Your Full Name] [Your Title] [Company Name] [Contact Information] *Word count: 160*

B is more candid and accountable, clearly states the revised ETA, includes the required concrete next step, and better matches an external customer-facing tone. A is solid and accurate but more generic, and its next step ('prioritizing this task') is less specific than B’s commitment to confirm after testing.

meeting-notes-summary

Summarize the meeting notes below in 3 bullet points, then extract: `launch_date`, `owner_for_pricing_page`, and `open_risks` as a JSON object. Meeting notes: - Nimbus Note 2.3 rollout discussed by Priya, Evan, Luc. - Mobile crash on Android 12 seems fixed in build 2841; QA wants one more pass tomorrow morning. - Pricing page still shows last quarter's "Starter $19" copy in two places. Luc will update web copy today. - Marketing asked whether launch can move earlier than May 23; team said no because localization for Swedish and Czech is still pending. - Priya confirmed the in-app tutorial is translated and ready. - One enterprise customer, Alder Peak Clinics, is waiting on SSO docs; Evan will send them by Wednesday. - If QA sign-off lands tomorrow, release notes get finalized immediately after.

Codestral-2501: ### Summary: - The team discussed the rollout of Nimbus Note 2.3 and addressed issues with the Android 12 mobile crash. - The pricing page still shows outdated information, and Luc will update the web copy today. - The launch date cannot be moved earlier than May 23 due to pending localization for Swedish and Czech. ### Extracted Information: ```json { "launch_date": "May 23", "owner_for_pricing_page": "Luc", "open_risks": [ "Pending localization for Swedish and Czech", "Enterprise customer Alder Peak Clinics waiting on SSO docs" ] } ```

DeepSeek-R1: **Summary:** - Nimbus Note 2.3 rollout progress reviewed, with QA requiring a final Android 12 test before sign-off. - Pricing page updates assigned to Luc, while Marketing’s request to move the launch earlier was denied due to pending Swedish/Czech localization. - SSO documentation for Alder Peak Clinics will be sent by Evan by Wednesday, and release notes await QA approval. **JSON Object:** ```json { "launch_date": "May 23", "owner_for_pricing_page": "Luc", "open_risks": [ "Pending localization for Swedish and Czech", "QA sign-off for Android 12 crash fix pending final test", "SSO documentation delay for Alder Peak Clinics" ] } ```

Model B is more complete: it captures the pending QA sign-off and release-note dependency in the summary and includes QA as an open risk in the JSON. Model A is accurate but omits that key pending risk, making its extraction slightly less thorough.

messy-orders-to-json

Convert the messy order lines below into valid JSON with this exact schema: `{\"orders\":[{\"order_id\":string,\"customer\":string,\"items\":[{\"sku\":string,\"qty\":number}],\"rush\":boolean}]}`. Rules: trim spaces, preserve order, split multiple items on `;`, parse quantities as numbers, and set `rush` true only for `rush=yes`. Data: ORD-9102 | customer: Kepler Bakery | items: FLR-00A x 4; YST-19 x 1 | rush=yes ORD-9103|customer: Lumen Transit|items: MAP-7 x2; PIN-44 x 12|rush=no ORD-9104 | customer: Harbor Thread Co. | items: DYE-8 x 3 | rush= yes

Codestral-2501: Here is the converted JSON based on the provided schema and rules: ```json { "orders": [ { "order_id": "ORD-9102", "customer": "Kepler Bakery", "items": [ { "sku": "FLR-00A", "qty": 4 }, { "sku": "YST-19", "qty": 1 } ], "rush": true }, { "order_id": "ORD-9103", "customer": "Lumen Transit", "items": [ { "sku": "MAP-7", "qty": 2 }, { "sku": "PIN-44", "qty": 12 } ], "rush": false }, { "order_id": "ORD-9104", "customer": "Harbor Thread Co.", "items": [ { "sku": "DYE-8", "qty": 3 } ], "rush": true } ] } ```

DeepSeek-R1: Here is the valid JSON output adhering to the specified schema and rules: ```json {"orders":[{"order_id":"ORD-9102","customer":"Kepler Bakery","items":[{"sku":"FLR-00A","qty":4},{"sku":"YST-19","qty":1}],"rush":true},{"order_id":"ORD-9103","customer":"Lumen Transit","items":[{"sku":"MAP-7","qty":2},{"sku":"PIN-44","qty":12}],"rush":false},{"order_id":"ORD-9104","customer":"Harbor Thread Co.","items":[{"sku":"DYE-8","qty":3}],"rush":true}]} ``` ### Explanation of Processing: 1. **Order ID** Extracted from the first segment, trimmed (e.g., `ORD-9102`). 2. **Customer** Prefix `customer:` removed, value trimmed (e.g., `Kepler Bakery`). 3. **Items** - Prefix `items:` removed, then split into individual items using `;` - Each item split by `x` to separate SKU and quantity - SKU and quantity trimmed, quantity converted to a number *Example:* `"FLR-00A x 4"` → SKU: `"FLR-00A"`, Quantity: `4` 4. **Rush** `rush=yes` → `true` (only when value is `"yes"` after trimming). `rush=no` or `rush= yes` (with space) → `false`/`true` based on trimmed value. 5. **Whitespace Handling** All fields trimmed, and inconsistent spacing in input normalized (e.g., `rush= yes` → `"yes"` → `true`). 6. **Order Preservation** Orders maintain their original sequence in the output JSON array.

Both outputs produce correct, valid JSON matching the required schema, preserve order, trim spaces, split items correctly, parse quantities as numbers, and set rush properly including `rush= yes` as true after trimming. Model B is more compact but adds extra explanation, while Model A wraps the JSON in prose and a code block; neither issue affects the correctness of the converted JSON itself.