Head to head: grok-4.3 vs gpt-oss-120b

grok-4.3 vs gpt-oss-120b

This matchup turns on a familiar distinction: both models are competent, but one is more reliable when the prompt punishes invention. grok-4.3 wins by being the steadier finisher across extraction and code tasks, while gpt-oss-120b’s best showing comes in polished business writing.

The aggregate score says it plainly: grok-4.3 takes this one, 37.8 to 34.4. And the task breakdown backs that up. Two ties in the coding-style prompts mean neither model found easy points there, so the deciding factor was who made fewer unforced errors when the instructions got specific. On **python-log-window-bugfix**, there was essentially no daylight. Both models fixed the boundary condition the right way: lower-bound binary search for the first timestamp `>= now - window_seconds`, exclusive upper bound, correct handling of empty input and duplicates. B added a bit more documentation, but functionally this was a dead even result. **messy-orders-to-json** was the same story: both cleaned and normalized the rows correctly, skipped invalid records, and produced valid output. Formatting differences don’t count as substance. The split came in the language tasks. **gpt-oss-120b** won **vendor-delay-status-email** because it wrote the tighter executive email: concise, polished, and grounded in the supplied facts. grok-4.3’s version was still good, but it inserted a follow-up timing that wasn’t in the prompt. That’s a small miss, but in this kind of task, unnecessary specificity is still drift. Where grok-4.3 earned the overall verdict was **meeting-notes-summary-extract**. It stuck to the requested structure and kept the action items faithful to the source notes. gpt-oss-120b blinked here by adding unsupported due dates and muddying the causality around the retry setting. That’s the kind of error that matters more than stylistic polish, because it turns summarization into light fabrication. **Final call: grok-4.3 wins because it was more trustworthy where precision mattered most.** gpt-oss-120b is the nicer business writer in this pairing, but grok-4.3 was the better editor of facts—and that’s what decided the match.

python-log-window-bugfix

Language: Python 3. Fix this function so it returns the count of events that occurred within the last `window_seconds` of `now`. The input `timestamps` is a list of integer Unix seconds in ascending order. It should run in O(log n), handle duplicates, and count events where `now - ts <= window_seconds`. Return code only. ```python def count_recent_events(timestamps, now, window_seconds): # timestamps sorted ascending lo, hi = 0, len(timestamps) - 1 cutoff = now - window_seconds while lo < hi: mid = (lo + hi) // 2 if timestamps[mid] <= cutoff: lo = mid + 1 else: hi = mid return len(timestamps) - lo ```

grok-4.3: ```python def count_recent_events(timestamps, now, window_seconds): lo, hi = 0, len(timestamps) cutoff = now - window_seconds while lo < hi: mid = (lo + hi) // 2 if timestamps[mid] < cutoff: lo = mid + 1 else: hi = mid return len(timestamps) - lo ```

gpt-oss-120b: ```python def count_recent_events(timestamps, now, window_seconds): """ Count timestamps within the last `window_seconds` seconds of `now`. Parameters ---------- timestamps : List[int] Sorted list of Unix epoch seconds (ascending). now : int Current Unix epoch time. window_seconds : int Length of the time window. Returns ------- int Number of timestamps `ts` such that now - ts <= window_seconds (i.e., ts >= now - window_seconds). """ # lower bound index of the first timestamp >= cutoff cutoff = now - window_seconds lo, hi = 0, len(timestamps) # hi is exclusive while lo < hi: mid = (lo + hi) // 2 if timestamps[mid] < cutoff: # strictly less → not recent lo = mid + 1 else: hi = mid # lo is now the first index with timestamp >= cutoff (or len(timestamps)) return len(timestamps) - lo ```

Both outputs correctly fix the boundary bug by finding the first timestamp >= now - window_seconds using an O(log n) lower-bound binary search with an exclusive upper bound, and both handle empty lists and duplicates correctly. Model B adds documentation, but functionally they are equivalent and both satisfy the prompt.

vendor-delay-status-email

Write a workplace status email to our COO, Marta Ilyan, updating her on a vendor delay. Context: our badge printer supplier, Northline ID, just moved shipment of 420 visitor badges from May 14 to May 21 because a laminate roll failed QC. We have enough stock for the June 3 client summit but not for the June 18 hiring fair unless the new date holds. We already asked Facilities to ration badge use and requested expedited freight at no charge. Tone: calm, concise, accountable. Length: 140-190 words.

grok-4.3: Subject: Update on Northline ID Badge Order Delay Hi Marta, I'm writing to update you on a recent development with our badge printer supplier, Northline ID. They notified us today that the shipment of 420 visitor badges, originally scheduled for May 14, has been rescheduled to May 21. This change stems from a laminate roll that did not pass their quality control checks. Fortunately, our current inventory should be adequate for the June 3 client summit. That said, we won't have enough badges on hand for the June 18 hiring fair unless this new delivery date is met without further issues. In response, I've coordinated with Facilities to implement a badge rationing protocol to extend our existing supply. We've also contacted Northline to request expedited shipping at no cost to us, which they are reviewing. I'll keep a close eye on the production timeline and provide another update by the end of the week. In the meantime, I'm happy to discuss contingency plans if needed. Thanks, [Your Name]

gpt-oss-120b: **Subject:** Update on Visitor Badge Supply – Northline ID Delay Hi Marta, I wanted to brief you on the recent delay from our badge‑printer supplier, Northline ID. The original shipment of 420 visitor badges, scheduled for delivery on May 14, has been pushed to May 21 after a laminate roll failed quality control. We have sufficient inventory to cover the June 3 client summit, but the revised arrival date leaves a shortfall for the June 18 hiring fair if any further delays occur. We have taken two immediate actions: Facilities has been instructed to ration badge usage for non‑critical events, and we have requested Northline ID to provide expedited freight at no additional charge to meet the May 21 deadline. I am monitoring the shipment closely and will keep you posted on any changes. Please let me know if you need any additional information or would like to discuss alternate contingency plans. Thank you, [Your Name] [Title]

Both are strong and within the requested tone and length, but B is slightly more concise and polished while clearly covering all required facts and actions. A adds a specific follow-up timing not provided in the prompt, which is reasonable but slightly less tightly grounded in the source context.

meeting-notes-summary-extract

Summarize the meeting notes below in 3 bullet points, then extract the action items as JSON with this schema: {"actions":[{"owner":"...","task":"...","due":"..."}]} Notes: - Checkout latency spiked after Tuesday's deploy; p95 went from 780ms to 1.9s between 14:10 and 14:42. - Priya traced most of it to the new tax-service retry setting: 4 retries with a 600ms timeout. - Mateo rolled traffic back at 14:37; latency normalized by 14:42. - Eight orders were duplicated when customers refreshed during the slowdown. Finance estimates exposure at $612.44 before refunds. - Lena will contact affected customers with an apology and confirm refund timing. - Priya to propose safer retry limits and a circuit breaker. - Mateo wants an alert if checkout p95 exceeds 1.2s for 5 minutes. - Next check-in: Friday 09:30.

grok-4.3: - Checkout latency spiked from 780 ms to 1.9 s p95 after Tuesday’s deploy because of the tax-service retry settings (4 retries / 600 ms); Mateo rolled traffic back at 14:37 and latency recovered by 14:42. - Eight orders were duplicated during the slowdown, creating an estimated $612.44 exposure before refunds. - Lena will contact affected customers; Priya will propose safer retry limits plus a circuit breaker; Mateo requested a p95 > 1.2 s alert; next check-in is Friday 09:30. ```json {"actions":[{"owner":"Lena","task":"contact affected customers with an apology and confirm refund timing","due":""},{"owner":"Priya","task":"propose safer retry limits and a circuit breaker","due":""},{"owner":"Mateo","task":"set up alert if checkout p95 exceeds 1.2s for 5 minutes","due":""}]} ```

gpt-oss-120b: **Summary (3 bullet points)** - A post‑deploy issue on Tuesday caused checkout p95 latency to jump from 780 ms to 1.9 s, which was corrected after Mateo rolled back traffic at 14:37 and latency normalized by 14:42. - The spike triggered four retries with a 600 ms timeout in the new tax‑service, leading to eight duplicated orders and an estimated $612.44 exposure before refunds. - Follow‑up actions include proposing safer retry limits/circuit breaker, setting up a latency alert, and contacting affected customers; the next check‑in is Friday at 09:30. **Action items (JSON)** ```json { "actions": [ { "owner": "Lena", "task": "Contact affected customers with an apology and confirm refund timing", "due": "2026-06-21" }, { "owner": "Priya", "task": "Propose safer retry limits and implement a circuit breaker for the tax service", "due": "2026-06-21" }, { "owner": "Mateo", "task": "Create an alert that triggers if checkout p95 exceeds 1.2 s for 5 consecutive minutes", "due": "2026-06-21" } ] } ```

A follows the requested format closely and keeps the extracted actions faithful to the notes without inventing details. B adds unsupported due dates and slightly misstates causality around the retry setting, which hurts correctness and instruction adherence.

messy-orders-to-json

Convert the messy order lines below into valid JSON as an array of objects using exactly this schema and key order: [{"order_id":"string","customer":"string","sku":"string","qty":number,"unit_price":number,"rush":true|false}]. Rules: trim spaces, uppercase SKU, parse qty as integer, parse unit_price as a number with 2 decimals and no currency symbol, set rush=true only for y/yes/true (case-insensitive), and skip rows missing order_id or sku. Data: ord-901 | Cora Ng | ab-44 | 3 | $19.9 | Y ord-902|M. Petrov| qx9 | 01 | 7.00 | no | J. Alvarez | lm-2 | 5 | $12.00 | yes ord-904 | T Weiss | zz-81 | two | $4.50 | n ord-905 | Rina Holt | ak-7 | 8 | 11 | TRUE ord-906 | Bao Tran | | 2 | $3.25 | yes

grok-4.3: [ {"order_id":"ord-901","customer":"Cora Ng","sku":"AB-44","qty":3,"unit_price":19.90,"rush":true}, {"order_id":"ord-902","customer":"M. Petrov","sku":"QX9","qty":1,"unit_price":7.00,"rush":false}, {"order_id":"ord-905","customer":"Rina Holt","sku":"AK-7","qty":8,"unit_price":11.00,"rush":true} ]

gpt-oss-120b: [ { "order_id": "ord-901", "customer": "Cora Ng", "sku": "AB-44", "qty": 3, "unit_price": 19.90, "rush": true }, { "order_id": "ord-902", "customer": "M. Petrov", "sku": "QX9", "qty": 1, "unit_price": 7.00, "rush": false }, { "order_id": "ord-905", "customer": "Rina Holt", "sku": "AK-7", "qty": 8, "unit_price": 11.00, "rush": true } ]

Both outputs correctly trim fields, uppercase SKUs, parse quantities and prices properly, set rush flags correctly, and skip rows missing order_id or sku; they also omit the invalid qty row. The only difference is formatting/whitespace, which does not affect validity or adherence.