Head to head: grok-4.3 vs Codestral-2501
grok-4.3 vs Codestral-2501
One model handled the basics cleanly; the other kept tripping over instructions that weren’t optional. This matchup wasn’t close once the outputs were judged on correctness, format discipline, and tone control.
grok-4.3 wins because it was reliably *usable* across the set, while Codestral-2501 too often made the kind of mistakes that kill trust in production. The aggregate score says 35.0 to 23.0, but the more important story is where those points came from: grok-4.3 kept following the brief, and Codestral-2501 kept slipping on details that were the whole job. The clearest miss was `python-log-redactor`. grok-4.3 actually redacted the **values** of the specified query parameters case-insensitively, while preserving the rest of the URL, and it also masked emails and IPv4 addresses. Codestral-2501 botched the core requirement by replacing bare parameter names instead of their values, yielding malformed output like `token=[REDACTED]=abc`. That’s not a stylistic disagreement; that’s a failure on the primary task. The writing tasks told a similar story. On `status-update-delay`, grok-4.3 was calmer, tighter, and better tuned to an internal update: direct, accountable, and within the requested length. Codestral-2501 was factually fine but padded and overly formal, with list formatting and soft corporate filler that made it feel less like a real status note. On `meeting-notes-summary`, grok-4.3 again showed better instruction discipline: a true two-sentence summary plus a more complete JSON payload, including the conditional rollback decision and the pricing-copy change. Codestral-2501 was close, but “close” included omitting decision detail and misreading the indexing lag threshold as a current blocker rather than a peak condition to monitor. Even where Codestral-2501 had a defensible instinct, it still lost on compliance. In `messy-orders-to-json`, its choice to merge duplicate `AB-9` quantities was arguably reasonable—but it wrapped the answer in Markdown code fences, violating the explicit JSON-only requirement. grok-4.3’s output was valid JSON-only, correctly extracted and sorted, and its only arguable flaw was leaving duplicate line items separate. That’s the difference in this matchup: grok-4.3’s mistakes were debatable implementation choices; Codestral-2501’s were instruction-breaking errors. **Final call: grok-4.3, comfortably. It was more accurate on the hard requirements, more disciplined about output format, and better at matching tone without unnecessary ceremony. Codestral-2501 had moments, but too many of them came with avoidable own-goals.**
python-log-redactor
Language: Python 3. Write a function `redact_log(line: str) -> str` that masks sensitive values in application log lines before they are shipped to vendors. Rules: - Replace any email address with `[EMAIL]` - Replace IPv4 addresses with `[IP]` - Replace values of query params named `token`, `api_key`, or `sessionId` with `[REDACTED]` while leaving the rest of the URL intact - Matching for parameter names should be case-insensitive - Do not modify other text Examples: - `GET /v1/export?user=ana@corlix.dev&token=abC123 from 192.168.4.9` -> `GET /v1/export?user=[EMAIL]&token=[REDACTED] from [IP]` - `callback url=/done?API_KEY=z9z9&ok=true contact=ops@ferron.io` -> `callback url=/done?API_KEY=[REDACTED]&ok=true contact=[EMAIL]` Return code only. No explanation.
A correctly redacts the specified query parameter values case-insensitively while preserving the rest of the URL, and it also masks emails and IPv4 addresses. B incorrectly replaces bare parameter names rather than their values, producing malformed output like `token=[REDACTED]=abc` and failing the core requirement.
status-update-delay
Draft a workplace status update for the VP of Operations and the warehouse leads. Tone: calm, direct, accountable. Length: 140-180 words. Context: The Riverside fulfillment center was supposed to start wave-picking with the new scanner workflow this Monday. During pilot runs, scan latency on aisle routers averaged 2.8 seconds instead of the target 0.5 seconds, causing duplicate picks. We are delaying rollout by one week. Temporary workaround: keep the old workflow for all ambient zones, but continue limited testing in cold storage with 6 trained associates per shift. Net impact: no customer SLA risk this week, but the planned labor-efficiency gain of 7% will not land this month. Mention next steps: router firmware patch tonight, retest Wednesday morning, go/no-go decision by Thursday 3 p.m., and a revised training calendar by Friday.
A is more concise and better matches the requested calm, direct, accountable tone while covering all required facts within the target length. B is accurate but more formal and padded, with list formatting and phrases like 'thank you for your understanding and support' that feel less direct for an internal status update.
meeting-notes-summary
Read the meeting notes below, then provide: 1) a 2-sentence summary 2) a JSON object with keys `launch_date`, `blocking_issues`, `decisions`, and `owners` Meeting notes: - Product sync for Nimbus Note 2.4 on 12 May. - Marta said the desktop editor crash is fixed in build 4182, but QA has only verified on Windows; macOS verification is scheduled for Thursday. - DevOps reported EU region indexing lag peaked at 19 minutes after the search schema change; Arjun will roll back the analyzer tweak if lag exceeds 10 minutes again. - Decision: keep the AI outline feature behind the team-admin flag for launch. - Pricing page copy still mentions "unlimited guest workspaces"; legal wants it changed to "up to 3 guest workspaces on Growth" before release. - Sam confirmed the release candidate can be cut on 16 May if macOS QA passes and pricing copy is updated. - Owner list: Marta - desktop stability verification; Arjun - search indexing monitoring; Lena - pricing page copy update.
A better follows the requested format with a true 2-sentence summary and a more complete JSON, including the conditional rollback decision and the specific pricing-copy change. B is mostly correct but adds extra formatting, omits important decision details, and treats the indexing lag threshold as a current blocking issue rather than the noted peak/monitoring condition.
messy-orders-to-json
Convert the messy order notes below into valid JSON only. Output an object with one key, `orders`, whose value is an array of objects sorted by `order_id` ascending. Schema for each order object: - `order_id`: string - `customer`: string - `ship_method`: one of `ground`, `air`, `pickup` - `priority`: boolean - `items`: array of objects with keys `sku` (string) and `qty` (integer) - `notes`: string or null Normalization rules: - `priority=yes` or `rush` => true; otherwise false - `pickup`, `pick-up`, and `will collect` => `pickup` - `2x AB-9` means sku `AB-9`, qty 2 - If notes are missing, use null Messy data: 1) Order R-208 | cust: Helio Farm Supply | ship=ground | items: 2x AB-9, 1x TT-41 | priority=no 2) order_id=R-205 ; customer "Mira Leto" ; method: pick-up ; items=1x QL-2 ; rush ; note=bring invoice copy 3) R-211 / customer=Northglass Clinic / ship method air / items 4x MN-88, 2x QL-2 / notes: leave at receiving bay 3 4) ID R-206, customer: Oren Vale, ship: will collect, items: 3x AB-9, 3x AB-9, 1x LM-7, priority=yes
A is valid JSON-only and correctly extracts and sorts all orders; its only debatable issue is leaving duplicate AB-9 line items separate. B incorrectly wraps the output in Markdown code fences, violating the JSON-only instruction, even though its merged quantity for duplicate items may be reasonable.
Matchup powered by OpenRouter.