Head to head: grok-4.3 vs Codestral-2501

grok-4.3 vs Codestral-2501

One model handled the basics cleanly; the other kept tripping over instructions that weren’t optional. This matchup wasn’t close once the outputs were judged on correctness, format discipline, and tone control.

grok-4.3 wins because it was reliably *usable* across the set, while Codestral-2501 too often made the kind of mistakes that kill trust in production. The aggregate score says 35.0 to 23.0, but the more important story is where those points came from: grok-4.3 kept following the brief, and Codestral-2501 kept slipping on details that were the whole job. The clearest miss was `python-log-redactor`. grok-4.3 actually redacted the **values** of the specified query parameters case-insensitively, while preserving the rest of the URL, and it also masked emails and IPv4 addresses. Codestral-2501 botched the core requirement by replacing bare parameter names instead of their values, yielding malformed output like `token=[REDACTED]=abc`. That’s not a stylistic disagreement; that’s a failure on the primary task. The writing tasks told a similar story. On `status-update-delay`, grok-4.3 was calmer, tighter, and better tuned to an internal update: direct, accountable, and within the requested length. Codestral-2501 was factually fine but padded and overly formal, with list formatting and soft corporate filler that made it feel less like a real status note. On `meeting-notes-summary`, grok-4.3 again showed better instruction discipline: a true two-sentence summary plus a more complete JSON payload, including the conditional rollback decision and the pricing-copy change. Codestral-2501 was close, but “close” included omitting decision detail and misreading the indexing lag threshold as a current blocker rather than a peak condition to monitor. Even where Codestral-2501 had a defensible instinct, it still lost on compliance. In `messy-orders-to-json`, its choice to merge duplicate `AB-9` quantities was arguably reasonable—but it wrapped the answer in Markdown code fences, violating the explicit JSON-only requirement. grok-4.3’s output was valid JSON-only, correctly extracted and sorted, and its only arguable flaw was leaving duplicate line items separate. That’s the difference in this matchup: grok-4.3’s mistakes were debatable implementation choices; Codestral-2501’s were instruction-breaking errors. **Final call: grok-4.3, comfortably. It was more accurate on the hard requirements, more disciplined about output format, and better at matching tone without unnecessary ceremony. Codestral-2501 had moments, but too many of them came with avoidable own-goals.**

python-log-redactor

Language: Python 3. Write a function `redact_log(line: str) -> str` that masks sensitive values in application log lines before they are shipped to vendors. Rules: - Replace any email address with `[EMAIL]` - Replace IPv4 addresses with `[IP]` - Replace values of query params named `token`, `api_key`, or `sessionId` with `[REDACTED]` while leaving the rest of the URL intact - Matching for parameter names should be case-insensitive - Do not modify other text Examples: - `GET /v1/export?user=ana@corlix.dev&token=abC123 from 192.168.4.9` -> `GET /v1/export?user=[EMAIL]&token=[REDACTED] from [IP]` - `callback url=/done?API_KEY=z9z9&ok=true contact=ops@ferron.io` -> `callback url=/done?API_KEY=[REDACTED]&ok=true contact=[EMAIL]` Return code only. No explanation.

grok-4.3: ```python import re def redact_log(line: str) -> str: def redact_param(m): return f"{m.group(1)}=[REDACTED]" line = re.sub(r'(?i)(token|api_key|sessionid)=([^&\s]*)', redact_param, line) line = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', '[EMAIL]', line) line = re.sub(r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b', '[IP]', line) return line ```

Codestral-2501: ```python import re def redact_log(line: str) -> str: email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' ip_pattern = r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b' query_param_pattern = re.compile(r'\b(token|api_key|sessionId)\b', re.IGNORECASE) def redact_email(match): return '[EMAIL]' def redact_ip(match): return '[IP]' def redact_query_param(match): return match.group(0) + '=[REDACTED]' line = re.sub(email_pattern, redact_email, line) line = re.sub(ip_pattern, redact_ip, line) line = query_param_pattern.sub(redact_query_param, line) return line ```

A correctly redacts the specified query parameter values case-insensitively while preserving the rest of the URL, and it also masks emails and IPv4 addresses. B incorrectly replaces bare parameter names rather than their values, producing malformed output like `token=[REDACTED]=abc` and failing the core requirement.

status-update-delay

Draft a workplace status update for the VP of Operations and the warehouse leads. Tone: calm, direct, accountable. Length: 140-180 words. Context: The Riverside fulfillment center was supposed to start wave-picking with the new scanner workflow this Monday. During pilot runs, scan latency on aisle routers averaged 2.8 seconds instead of the target 0.5 seconds, causing duplicate picks. We are delaying rollout by one week. Temporary workaround: keep the old workflow for all ambient zones, but continue limited testing in cold storage with 6 trained associates per shift. Net impact: no customer SLA risk this week, but the planned labor-efficiency gain of 7% will not land this month. Mention next steps: router firmware patch tonight, retest Wednesday morning, go/no-go decision by Thursday 3 p.m., and a revised training calendar by Friday.

grok-4.3: Team, The Riverside fulfillment center was set to launch wave-picking with the new scanner workflow this Monday. During pilot runs, however, the aisle routers averaged 2.8 seconds of scan latency—well above the 0.5-second target—which led to duplicate picks. As a result, we are postponing the rollout by one week to resolve the issue. For the interim, the previous workflow will remain in place for all ambient zones. We will continue limited testing in cold storage with six trained associates per shift. This plan carries no risk to customer SLAs this week, though the expected 7% improvement in labor efficiency will not occur this month. Next, the router firmware will be patched tonight, with retesting scheduled for Wednesday morning. A go or no-go decision will be made by Thursday at 3 p.m., and I will share a revised training calendar by Friday. I will oversee these actions and keep you informed of any developments.

Codestral-2501: Subject: Status Update on Riverside Fulfillment Center Wave-Picking Rollout Dear VP of Operations and Warehouse Leads, I am writing to provide an update on the Riverside fulfillment center's wave-picking rollout with the new scanner workflow. As planned, we were set to commence this process this Monday. However, during pilot runs, we encountered scan latency issues on aisle routers, averaging 2.8 seconds instead of the target 0.5 seconds. This latency resulted in duplicate picks, necessitating a delay in the rollout by one week. To mitigate immediate customer SLA risks, we have implemented a temporary workaround. We will maintain the old workflow for all ambient zones but will continue limited testing in cold storage with 6 trained associates per shift. This approach ensures no customer SLA risk this week but means we will not achieve the planned 7% labor-efficiency gain this month. Next steps include: 1. Applying a router firmware patch tonight. 2. Retesting the workflow Wednesday morning. 3. Making a go/no-go decision by Thursday 3 p.m. 4. Developing a revised training calendar by Friday. I am committed to ensuring a smooth transition and will keep you both informed of any further developments. Thank you for your understanding and support. Best, [Your Name]

A is more concise and better matches the requested calm, direct, accountable tone while covering all required facts within the target length. B is accurate but more formal and padded, with list formatting and phrases like 'thank you for your understanding and support' that feel less direct for an internal status update.

meeting-notes-summary

Read the meeting notes below, then provide: 1) a 2-sentence summary 2) a JSON object with keys `launch_date`, `blocking_issues`, `decisions`, and `owners` Meeting notes: - Product sync for Nimbus Note 2.4 on 12 May. - Marta said the desktop editor crash is fixed in build 4182, but QA has only verified on Windows; macOS verification is scheduled for Thursday. - DevOps reported EU region indexing lag peaked at 19 minutes after the search schema change; Arjun will roll back the analyzer tweak if lag exceeds 10 minutes again. - Decision: keep the AI outline feature behind the team-admin flag for launch. - Pricing page copy still mentions "unlimited guest workspaces"; legal wants it changed to "up to 3 guest workspaces on Growth" before release. - Sam confirmed the release candidate can be cut on 16 May if macOS QA passes and pricing copy is updated. - Owner list: Marta - desktop stability verification; Arjun - search indexing monitoring; Lena - pricing page copy update.

grok-4.3: 1) The Nimbus Note 2.4 sync on 12 May covered a desktop crash fix verified only on Windows so far, ongoing EU search-indexing lag after a schema change, and the need to revise outdated pricing-page copy before release. The team decided to gate the AI outline feature, set a release-candidate target of 16 May once macOS QA and copy updates are complete, and assigned owners for the remaining work. 2) {"launch_date": "16 May", "blocking_issues": ["macOS QA verification of desktop crash fix still pending", "pricing page copy still references unlimited guest workspaces", "EU indexing lag reached 19 min after schema change"], "decisions": ["keep AI outline feature behind team-admin flag", "update pricing copy to state 'up to 3 guest workspaces on Growth'", "roll back analyzer tweak if lag exceeds 10 minutes"], "owners": {"Marta": "desktop stability verification", "Arjun": "search indexing monitoring", "Lena": "pricing page copy update"}}

Codestral-2501: ### Summary The meeting discussed the progress and issues related to the Nimbus Note 2.4 release, focusing on desktop editor stability, search indexing performance, and pricing page updates. Key decisions include keeping the AI outline feature behind a flag and setting a release date of 16 May if certain conditions are met. ### JSON Object ```json { "launch_date": "16 May", "blocking_issues": [ "macOS verification for desktop editor crash", "EU region indexing lag exceeding 10 minutes", "Pricing page copy update" ], "decisions": [ "Keep the AI outline feature behind the team-admin flag for launch" ], "owners": { "Marta": "desktop stability verification", "Arjun": "search indexing monitoring", "Lena": "pricing page copy update" } } ```

A better follows the requested format with a true 2-sentence summary and a more complete JSON, including the conditional rollback decision and the specific pricing-copy change. B is mostly correct but adds extra formatting, omits important decision details, and treats the indexing lag threshold as a current blocking issue rather than the noted peak/monitoring condition.

messy-orders-to-json

Convert the messy order notes below into valid JSON only. Output an object with one key, `orders`, whose value is an array of objects sorted by `order_id` ascending. Schema for each order object: - `order_id`: string - `customer`: string - `ship_method`: one of `ground`, `air`, `pickup` - `priority`: boolean - `items`: array of objects with keys `sku` (string) and `qty` (integer) - `notes`: string or null Normalization rules: - `priority=yes` or `rush` => true; otherwise false - `pickup`, `pick-up`, and `will collect` => `pickup` - `2x AB-9` means sku `AB-9`, qty 2 - If notes are missing, use null Messy data: 1) Order R-208 | cust: Helio Farm Supply | ship=ground | items: 2x AB-9, 1x TT-41 | priority=no 2) order_id=R-205 ; customer "Mira Leto" ; method: pick-up ; items=1x QL-2 ; rush ; note=bring invoice copy 3) R-211 / customer=Northglass Clinic / ship method air / items 4x MN-88, 2x QL-2 / notes: leave at receiving bay 3 4) ID R-206, customer: Oren Vale, ship: will collect, items: 3x AB-9, 3x AB-9, 1x LM-7, priority=yes

grok-4.3: { "orders": [ { "order_id": "R-205", "customer": "Mira Leto", "ship_method": "pickup", "priority": true, "items": [ { "sku": "QL-2", "qty": 1 } ], "notes": "bring invoice copy" }, { "order_id": "R-206", "customer": "Oren Vale", "ship_method": "pickup", "priority": true, "items": [ { "sku": "AB-9", "qty": 3 }, { "sku": "AB-9", "qty": 3 }, { "sku": "LM-7", "qty": 1 } ], "notes": null }, { "order_id": "R-208", "customer": "Helio Farm Supply", "ship_method": "ground", "priority": false, "items": [ { "sku": "AB-9", "qty": 2 }, { "sku": "TT-41", "qty": 1 } ], "notes": null }, { "order_id": "R-211", "customer": "Northglass Clinic", "ship_method": "air", "priority": false, "items": [ { "sku": "MN-88", "qty": 4 }, { "sku": "QL-2", "qty": 2 } ], "notes": "leave at receiving bay 3" } ] }

Codestral-2501: ```json { "orders": [ { "order_id": "R-205", "customer": "Mira Leto", "ship_method": "pickup", "priority": true, "items": [ { "sku": "QL-2", "qty": 1 } ], "notes": "bring invoice copy" }, { "order_id": "R-206", "customer": "Oren Vale", "ship_method": "pickup", "priority": true, "items": [ { "sku": "AB-9", "qty": 6 }, { "sku": "LM-7", "qty": 1 } ], "notes": null }, { "order_id": "R-208", "customer": "Helio Farm Supply", "ship_method": "ground", "priority": false, "items": [ { "sku": "AB-9", "qty": 2 }, { "sku": "TT-41", "qty": 1 } ], "notes": null }, { "order_id": "R-211", "customer": "Northglass Clinic", "ship_method": "air", "priority": false, "items": [ { "sku": "MN-88", "qty": 4 }, { "sku": "QL-2", "qty": 2 } ], "notes": "leave at receiving bay 3" } ] } ```

A is valid JSON-only and correctly extracts and sorts all orders; its only debatable issue is leaving duplicate AB-9 line items separate. B incorrectly wraps the output in Markdown code fences, violating the JSON-only instruction, even though its merged quantity for duplicate items may be reasonable.