Head to head: Anthropic: Claude Opus 4.8 vs Google: Gemini 3.5 Flash

This one is close on the aggregate, but the split tells a clear story: Gemini 3.5 Flash wins by being more disciplined about format and slightly sharper on practical instruction-following. Claude Opus 4.8 lands the strongest single extraction/summarization performance, yet gives away too much on avoidable execution det

By RuntimeWire · Published Jun 24, 2026, 1:11pm CT

Visual comparison of two AI models' performance (Mixed-media paper collage — torn newsprint, photographic cutouts of abstract data, tape and staples, slight scanner shadow)

The final margin is narrow — 35.4 to 34.8 for Google: Gemini 3.5 Flash — but the reasons are not mysterious. Gemini takes three of four tasks, and in each of those wins the edge comes from cleaner compliance with the brief rather than flashy ambition. That matters in real production use, where small formatting misses and slightly loose pattern matching are exactly how good outputs become annoying ones.

The most convincing Gemini win is probably messy-orders-to-json. Both models normalized and sorted the data correctly, but Claude wrapped the answer in a Markdown code fence despite an explicit JSON-only instruction. That is a straight instruction-following error, and Gemini didn’t make it. On release-delay-note, the gap is subtler but still real: both were accurate, yet Gemini better matched the requested calm, professional Slack tone by staying concise, accountable, and free of unnecessary stylistic garnish.

The python-log-redactor result reinforces the same pattern. Both solutions were solid, standard-library-only, and correctly avoided clobbering version numbers like v2.14.7. Gemini’s advantage came from a better Bearer-token regex: using word boundaries lowers the risk of partially redacting a longer alphanumeric string. Claude’s solution works, but it is a little sloppier at the edges, and edge cases are where redaction tools live or die.

Claude’s best showing came in meeting-notes-summary, and it was a legitimate win. It captured both key risks, included the extra owner/deadline detail, and produced a richer numbers_mentioned field. Gemini was mostly right, but it missed the Safari/Klarna risk in the JSON and lost some nuance around draft versus final timing. That result shows Claude still has the stronger instinct for fuller synthesis when the task rewards completeness over strict brevity.

Final call: Google: Gemini 3.5 Flash wins because it was more reliable where users actually feel reliability — format compliance, tone control, and tighter handling of implementation details. Claude Opus 4.8 produced the best single summary, but Gemini was the better editor’s choice across the full slate.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. Anthropic: Claude Opus 4.8 scored 34.8 to Google: Gemini 3.5 Flash's 35.4.

1. python-log-redactor

Practical coding — Python. Return code only. Write a function redact_log(line: str) -> str that sanitizes a single application log line before it is shipped to a vendor. Replace: - any email address with [EMAIL] - any IPv4 address with [IP] - any bearer token of the form Bearer <token> with Bearer [TOKEN] where <token> is 20+ characters of letters, digits, _ or - Requirements: - preserve all other text exactly - do not redact version numbers like v2.14.7 - do not redact invalid IP-like text such as 999.12.3.4 - use only the Python standard library Example input line: 2026-04-09T07:14:55Z WARN user=mina.cho@northlane.io ip=10.44.18.201 auth='Bearer sk_live_9AbCdEfGhIjKlMnOpQrStUv' path=/api/v2.14.7/status Expected output for that example: 2026-04-09T07:14:55Z WARN user=[EMAIL] ip=[IP] auth='Bearer [TOKEN]' path=/api/v2.14.7/status

Winner: Google: Gemini 3.5 Flash — Both solutions are standard-library-only and correctly handle valid IPv4s without redacting version numbers like v2.14.7. Model B is slightly better because its Bearer-token regex uses word boundaries, reducing the chance of partially redacting a longer alphanumeric string, while Model A could match just the first 20+ valid characters of a longer token-like sequence.

2. release-delay-note

Professional writing. Draft a Slack update to the company-wide #ops-and-sales channel. Audience: account managers and support leads. Tone: calm, accountable, no blame. Length: 110-140 words. Situation: This morning's rollout of the Atlas invoicing export is delayed. During final checks, the team found duplicate rows appearing only for customers with archived cost centers. No customer data was lost or exposed. We have paused the rollout, disabled the new export toggle for all workspaces, and expect another update by 3:30 PM Eastern. Ask customer-facing teams not to promise a launch time yet, and route urgent client asks to Priya Nadeem in Revenue Systems.

Winner: Google: Gemini 3.5 Flash — Both are strong and accurate, but B is slightly better aligned to the requested calm, professional Slack update: it is more concise, avoids extra formatting/emojis, and maintains an accountable, no-blame tone while covering every required detail. A is also good, but the headline, numbered list, and emoji make it feel a bit more stylized than necessary.

3. meeting-notes-summary

Summarization & extraction. Read the meeting notes below, then produce: 1) a 2-sentence summary 2) a JSON object with keys decision, owner, deadline, risks, and numbers_mentioned Source notes: "Checkout reliability sync — 11 May - Jae: error rate fell from 3.8% to 1.1% after disabling the coupon recompute step. - Marta: mobile Safari still times out on the payment page for some Klarna users; 17 complaints since Friday. - Decision: keep the recompute step off through the Cedar & Finch campaign. - Owner for root-cause writeup: Marta. - Deadline: draft by Tuesday 14:00 UTC, final by Wednesday noon. - Risk: finance team says coupon totals may be off by up to $0.27 on roughly 2% of orders. - Jae will add a banner to the internal status page by 09:30 UTC. - No evidence of fraud or duplicate charges." Return only the summary and the JSON.

Winner: Anthropic: Claude Opus 4.8 — A is more complete and accurate: it captures both key risks, includes the additional owner/deadline detail from the notes, and provides a richer numbers_mentioned field. B is mostly correct but omits the Safari/Klarna risk from the JSON, misses the draft/final nuance in the summary, and its numbers list is less informative.

4. messy-orders-to-json

Data wrangling / structured output. Convert the messy order lines below into valid JSON: an array of objects sorted by order_id ascending. Each object must have exactly these fields: order_id (string), customer (string in title case), sku (uppercase string), qty (integer), unit_price (number), rush (boolean), ship_date (string YYYY-MM-DD or null). Rules: - trim extra spaces - title case customer names - uppercase SKU - parse rush yes/no/Y/N into true/false - if ship date is TBD or blank, use null - preserve numeric values exactly as decimals, not strings Messy data: ord-204 | nora feld | qx-9 | 3 | 19.95 | yes | 2026/07/03 ord-198|ELI TRAN|ab-11|1|250|N|TBD ord-203 |mara iTO| lk-2 | 12 | 4.5 | y | 2026-07-01 ord-201| jules park |zx-77|2| 88.00|no| Return JSON only.

Winner: Google: Gemini 3.5 Flash — Both outputs correctly normalize and sort the data, but Model A violates the instruction to return JSON only by wrapping the array in a Markdown code fence. Model B provides valid raw JSON and is therefore better.

See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

How they were tested

1. python-log-redactor

2. release-delay-note

3. meeting-notes-summary

4. messy-orders-to-json

Reader comments