Head to head: Anthropic: Claude Opus 4.8 vs Google: Gemini 3.5 Flash

Anthropic: Claude Opus 4.8 vs Google: Gemini 3.5 Flash

This one is close on the aggregate, but the split tells a clear story: Gemini 3.5 Flash wins by being more disciplined about format and slightly sharper on practical instruction-following. Claude Opus 4.8 lands the strongest single extraction/summarization performance, yet gives away too much on avoidable execution det

The final margin is narrow — 35.4 to 34.8 for **Google: Gemini 3.5 Flash** — but the reasons are not mysterious. Gemini takes three of four tasks, and in each of those wins the edge comes from cleaner compliance with the brief rather than flashy ambition. That matters in real production use, where small formatting misses and slightly loose pattern matching are exactly how good outputs become annoying ones. The most convincing Gemini win is probably `messy-orders-to-json`. Both models normalized and sorted the data correctly, but Claude wrapped the answer in a Markdown code fence despite an explicit JSON-only instruction. That is a straight instruction-following error, and Gemini didn’t make it. On `release-delay-note`, the gap is subtler but still real: both were accurate, yet Gemini better matched the requested calm, professional Slack tone by staying concise, accountable, and free of unnecessary stylistic garnish. The `python-log-redactor` result reinforces the same pattern. Both solutions were solid, standard-library-only, and correctly avoided clobbering version numbers like `v2.14.7`. Gemini’s advantage came from a better Bearer-token regex: using word boundaries lowers the risk of partially redacting a longer alphanumeric string. Claude’s solution works, but it is a little sloppier at the edges, and edge cases are where redaction tools live or die. Claude’s best showing came in `meeting-notes-summary`, and it was a legitimate win. It captured both key risks, included the extra owner/deadline detail, and produced a richer `numbers_mentioned` field. Gemini was mostly right, but it missed the Safari/Klarna risk in the JSON and lost some nuance around draft versus final timing. That result shows Claude still has the stronger instinct for fuller synthesis when the task rewards completeness over strict brevity. **Final call: Google: Gemini 3.5 Flash wins because it was more reliable where users actually feel reliability — format compliance, tone control, and tighter handling of implementation details. Claude Opus 4.8 produced the best single summary, but Gemini was the better editor’s choice across the full slate.**

python-log-redactor

Practical coding — Python. Return code only. Write a function `redact_log(line: str) -> str` that sanitizes a single application log line before it is shipped to a vendor. Replace: - any email address with `[EMAIL]` - any IPv4 address with `[IP]` - any bearer token of the form `Bearer <token>` with `Bearer [TOKEN]` where `<token>` is 20+ characters of letters, digits, `_` or `-` Requirements: - preserve all other text exactly - do not redact version numbers like `v2.14.7` - do not redact invalid IP-like text such as `999.12.3.4` - use only the Python standard library Example input line: `2026-04-09T07:14:55Z WARN user=mina.cho@northlane.io ip=10.44.18.201 auth='Bearer sk_live_9AbCdEfGhIjKlMnOpQrStUv' path=/api/v2.14.7/status` Expected output for that example: `2026-04-09T07:14:55Z WARN user=[EMAIL] ip=[IP] auth='Bearer [TOKEN]' path=/api/v2.14.7/status`

Anthropic: Claude Opus 4.8: ```python import re def redact_log(line: str) -> str: line = re.sub( r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}', '[EMAIL]', line, ) line = re.sub( r'Bearer\s+[A-Za-z0-9_-]{20,}', 'Bearer [TOKEN]', line, ) line = re.sub( r'(?<!\d)(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(?!\d)', '[IP]', line, ) return line ```
Google: Gemini 3.5 Flash: ```python import re def redact_log(line: str) -> str: # 1. Redact Bearer tokens (Bearer followed by 20+ chars of letters, digits, _, or -) line = re.sub(r'\bBearer\s+[a-zA-Z0-9_-]{20,}\b', 'Bearer [TOKEN]', line) # 2. Redact Emails line = re.sub(r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b', '[EMAIL]', line) # 3. Redact IPv4 Addresses (valid octets 0-255) octet = r'(?:25[0-5]|2[0-4]\d|[01]?\d?\d)' ip_pattern = fr'\b{octet}\.{octet}\.{octet}\.{octet}\b' line = re.sub(ip_pattern, '[IP]', line) return line ```

Both solutions are standard-library-only and correctly handle valid IPv4s without redacting version numbers like v2.14.7. Model B is slightly better because its Bearer-token regex uses word boundaries, reducing the chance of partially redacting a longer alphanumeric string, while Model A could match just the first 20+ valid characters of a longer token-like sequence.

release-delay-note

Professional writing. Draft a Slack update to the company-wide `#ops-and-sales` channel. Audience: account managers and support leads. Tone: calm, accountable, no blame. Length: 110-140 words. Situation: This morning's rollout of the Atlas invoicing export is delayed. During final checks, the team found duplicate rows appearing only for customers with archived cost centers. No customer data was lost or exposed. We have paused the rollout, disabled the new export toggle for all workspaces, and expect another update by 3:30 PM Eastern. Ask customer-facing teams not to promise a launch time yet, and route urgent client asks to Priya Nadeem in Revenue Systems.

Anthropic: Claude Opus 4.8: **📋 Atlas Invoicing Export — Rollout Paused** Hi all — quick update on this morning's Atlas invoicing export. During final checks, we found duplicate rows appearing only for customers with archived cost centers. No customer data was lost or exposed. As a precaution, we've paused the rollout and disabled the new export toggle across all workspaces while we resolve it. We expect another update by **3:30 PM Eastern**. In the meantime, two asks for customer-facing teams: 1. Please **don't promise a launch time** yet — we'll confirm one once we're confident in the fix. 2. Route any **urgent client questions to Priya Nadeem** in Revenue Systems. Thanks for your patience and for keeping messaging consistent while we work through this. More soon. 🙏
Google: Gemini 3.5 Flash: Hi team, We have paused this morning’s scheduled rollout of the Atlas invoicing export. During our final checks, we identified an issue causing duplicate rows to appear exclusively for customers with archived cost centers. To be absolutely clear: no customer data was lost or exposed. As we work on a solution, we have disabled the new export toggle across all customer workspaces. We expect to share our next status update by **3:30 PM Eastern** today. For now, please avoid promising a specific launch window to your clients. If you receive any urgent inquiries, please route them directly to **Priya Nadeem** in Revenue Systems. Thank you for your patience and partnership as we resolve this.

Both are strong and accurate, but B is slightly better aligned to the requested calm, professional Slack update: it is more concise, avoids extra formatting/emojis, and maintains an accountable, no-blame tone while covering every required detail. A is also good, but the headline, numbered list, and emoji make it feel a bit more stylized than necessary.

meeting-notes-summary

Summarization & extraction. Read the meeting notes below, then produce: 1) a 2-sentence summary 2) a JSON object with keys `decision`, `owner`, `deadline`, `risks`, and `numbers_mentioned` Source notes: "Checkout reliability sync — 11 May - Jae: error rate fell from 3.8% to 1.1% after disabling the coupon recompute step. - Marta: mobile Safari still times out on the payment page for some Klarna users; 17 complaints since Friday. - Decision: keep the recompute step off through the Cedar & Finch campaign. - Owner for root-cause writeup: Marta. - Deadline: draft by Tuesday 14:00 UTC, final by Wednesday noon. - Risk: finance team says coupon totals may be off by up to $0.27 on roughly 2% of orders. - Jae will add a banner to the internal status page by 09:30 UTC. - No evidence of fraud or duplicate charges." Return only the summary and the JSON.

Anthropic: Claude Opus 4.8: **Summary:** Disabling the coupon recompute step reduced the checkout error rate from 3.8% to 1.1%, so the team decided to keep it off through the Cedar & Finch campaign. Marta will lead the root-cause writeup despite an ongoing mobile Safari/Klarna timeout issue and a minor coupon-total discrepancy flagged by finance. ```json { "decision": "Keep the coupon recompute step disabled through the Cedar & Finch campaign.", "owner": "Marta (root-cause writeup); Jae (internal status page banner)", "deadline": "Root-cause draft by Tuesday 14:00 UTC, final by Wednesday noon; status page banner by 09:30 UTC", "risks": [ "Coupon totals may be off by up to $0.27 on roughly 2% of orders (per finance)", "Mobile Safari payment page times out for some Klarna users (17 complaints since Friday)" ], "numbers_mentioned": [ "Error rate fell from 3.8% to 1.1%", "17 complaints since Friday", "Coupon totals off by up to $0.27", "Affects roughly 2% of orders", "Draft deadline Tuesday 14:00 UTC", "Banner by 09:30 UTC" ] } ```
Google: Gemini 3.5 Flash: During the checkout reliability sync, the team decided to keep the coupon recompute step disabled through the Cedar & Finch campaign to maintain a lower error rate of 1.1%. Marta was tasked with investigating ongoing mobile Safari timeouts for Klarna users and delivering a root-cause writeup by Wednesday noon. ```json { "decision": "Keep the recompute step off through the Cedar & Finch campaign.", "owner": "Marta", "deadline": "Draft by Tuesday 14:00 UTC, final by Wednesday noon", "risks": "Coupon totals may be off by up to $0.27 on roughly 2% of orders.", "numbers_mentioned": ["11", "3.8%", "1.1%", "17", "14:00", "$0.27", "2%", "09:30"] } ```

A is more complete and accurate: it captures both key risks, includes the additional owner/deadline detail from the notes, and provides a richer numbers_mentioned field. B is mostly correct but omits the Safari/Klarna risk from the JSON, misses the draft/final nuance in the summary, and its numbers list is less informative.

messy-orders-to-json

Data wrangling / structured output. Convert the messy order lines below into valid JSON: an array of objects sorted by `order_id` ascending. Each object must have exactly these fields: `order_id` (string), `customer` (string in title case), `sku` (uppercase string), `qty` (integer), `unit_price` (number), `rush` (boolean), `ship_date` (string YYYY-MM-DD or null). Rules: - trim extra spaces - title case customer names - uppercase SKU - parse `rush` yes/no/Y/N into true/false - if ship date is `TBD` or blank, use null - preserve numeric values exactly as decimals, not strings Messy data: `ord-204 | nora feld | qx-9 | 3 | 19.95 | yes | 2026/07/03` `ord-198|ELI TRAN|ab-11|1|250|N|TBD` `ord-203 |mara iTO| lk-2 | 12 | 4.5 | y | 2026-07-01` `ord-201| jules park |zx-77|2| 88.00|no|` Return JSON only.

Anthropic: Claude Opus 4.8: ```json [ { "order_id": "ord-198", "customer": "Eli Tran", "sku": "AB-11", "qty": 1, "unit_price": 250, "rush": false, "ship_date": null }, { "order_id": "ord-201", "customer": "Jules Park", "sku": "ZX-77", "qty": 2, "unit_price": 88.00, "rush": false, "ship_date": null }, { "order_id": "ord-203", "customer": "Mara Ito", "sku": "LK-2", "qty": 12, "unit_price": 4.5, "rush": true, "ship_date": "2026-07-01" }, { "order_id": "ord-204", "customer": "Nora Feld", "sku": "QX-9", "qty": 3, "unit_price": 19.95, "rush": true, "ship_date": "2026-07-03" } ] ```
Google: Gemini 3.5 Flash: [ { "order_id": "ord-198", "customer": "Eli Tran", "sku": "AB-11", "qty": 1, "unit_price": 250, "rush": false, "ship_date": null }, { "order_id": "ord-201", "customer": "Jules Park", "sku": "ZX-77", "qty": 2, "unit_price": 88.00, "rush": false, "ship_date": null }, { "order_id": "ord-203", "customer": "Mara Ito", "sku": "LK-2", "qty": 12, "unit_price": 4.5, "rush": true, "ship_date": "2026-07-01" }, { "order_id": "ord-204", "customer": "Nora Feld", "sku": "QX-9", "qty": 3, "unit_price": 19.95, "rush": true, "ship_date": "2026-07-03" } ]

Both outputs correctly normalize and sort the data, but Model A violates the instruction to return JSON only by wrapping the array in a Markdown code fence. Model B provides valid raw JSON and is therefore better.

Matchup powered by OpenRouter.