DeepSeek V4 Pro beats GPT-5.5 Pro on precision
DeepSeek: DeepSeek V4 Pro vs OpenAI: GPT-5.5 Pro
DeepSeek V4 Pro wins this head-to-head by being more exact where it matters: following instructions, matching schemas, and solving edge cases cleanly. GPT-5.5 Pro is still strong, but it gave away points with avoidable deviations.
DeepSeek V4 Pro takes this matchup 38.0 to 33.0, and the margin feels earned. Across the scored tasks, the pattern is simple: Model A was tighter, more literal, and more reliable under constraints, while Model B was good but a little too willing to improvise. The clearest technical win came in **python-log-redactor**. DeepSeek handled overlapping patterns the right way: one regex, one replacer, correct priority, no dropped matches. GPT-5.5 Pro split the work across separate regexes, which opens the door to ordering bugs, and its email pattern had small but real flaws around boundaries and over-matching. That is the difference between code that merely looks plausible and code you would actually trust. DeepSeek also won the instruction-following tasks by not getting cute. In **vendor-delay-update**, it did exactly what the prompt asked: tell the VP to send daily shortage counts by 4 p.m. local time, in a calm and accountable tone, without bolting on extra process. GPT-5.5 Pro wrote a solid note, but it drifted—adding shift-handoff and escalation details and even redirecting the recipient toward "Operations Planning." In **meeting-notes-summary**, the gap was even cleaner: DeepSeek matched the schema exactly, while GPT-5.5 Pro broke it with conditional text in `launch_date` and an array for `blocked_by` where a single value was required. The only draw was **messy-orders-to-json**, where both models did the unglamorous work correctly: valid JSON, preserved order, correct schema, normalized fields. But a tie on the easy cleanup task does not erase the misses on precision work. **Final call: DeepSeek V4 Pro is the better model here.** It was more disciplined, more exact, and more dependable on the tasks where small deviations turn into real failures.
python-log-redactor
Language: Python 3. Write code only. Implement a function `redact_log(line: str) -> str` for an internal support tool. It must mask: - email addresses -> replace the whole address with `[EMAIL]` - IPv4 addresses -> replace with `[IP]` - ticket IDs of the form `INC-` followed by 6 digits -> replace with `[TICKET]` Preserve all other text exactly. Do not mask invalid IPs like `999.1.2.3`. Assume no multiline input. Include any imports needed and nothing else besides the code.
Model A correctly handles overlapping patterns with a single regex and replacer function, ensuring proper replacement priority and no missed matches. Model B's separate regexes risk incorrect ordering and has minor email regex flaws like missing word boundaries and potential over-matching.
vendor-delay-update
Draft a workplace status update for the VP of Operations to send to regional warehouse managers. Situation: our barcode scanner vendor, North Quay Devices, delayed shipment of 420 replacement units from May 12 to May 19 because of a failed battery certification batch. We have enough spare scanners to cover only the Memphis and Reno sites; Tulsa and Allentown will need to share devices for one week. Ask managers to pause nonessential inventory recounts, prioritize outbound picking, and send daily shortage counts by 4 p.m. local time. Tone: calm, accountable, and practical. Length: 140–180 words.
Model A better adheres to the prompt by directly specifying 'send daily shortage counts by 4 p.m. local time' to the VP without adding unprompted details like shift handoffs or escalation instructions, while maintaining a perfectly calm, accountable, and practical tone. Model B introduces minor extras and shifts the recipient to 'Operations Planning,' slightly deviating from instructions, though both are high-quality and within word limits.
meeting-notes-summary
Read the meeting notes below, then provide: 1) a 2-sentence summary 2) a JSON object with keys `launch_date`, `owner`, `blocked_by`, `open_questions` (array), and `decisions` (array) Meeting notes: - Project: Cedar Lane tenant portal refresh - Maya said legal approved the new lease-upload wording after changing “instant approval” to “faster review.” - Andre confirmed the frontend is done except for the maintenance banner behavior on iPad Mini. - Priya wants launch on 2026-03-18, but only if payment autofill passes final QA by the 14th. - Blocker: finance sandbox is still returning duplicate receipt IDs for ACH retries. - Decision: remove dark mode from this release and revisit in Q3. - Decision: keep SMS login, but make email login the default option. - Open question: should users be able to delete stored bank accounts without calling support? - Open question: do we localize the late-fee explainer for Quebec French now or after launch? - Owner for launch checklist: Priya.
A follows the requested schema exactly and provides a clear 2-sentence summary plus correctly typed JSON fields. B’s summary is good, but its JSON does not adhere to the specified structure: `launch_date` includes extra conditional text and `blocked_by` is an array instead of a single value.
messy-orders-to-json
Convert the messy order lines below into valid JSON as an array of objects. Use exactly this schema for each object and preserve input order: `{"order_id": string, "customer": string, "items": [{"sku": string, "qty": integer}], "priority": boolean, "ship_by": string|null}` Rules: - Normalize `priority` to true/false. - Normalize missing ship date words like `none`, `tbd`, `-` to null. - Trim spaces around values. - `items` are separated by `;` and each item is `SKU xQTY`. Data: Order=QX-1042 | customer: Larkspur Clinic | items: AB-9 x2; TT-41 x1 | priority=YES | ship_by=2026-07-02 Order=QX-1043|customer: Mira & Son Catering|items: P-88 x12 |priority=no|ship_by=none Order = QX-1044 | customer: Oak Route Studio | items: ZK-2 x3; MN-7 x5; MN-7 x1 | priority = true | ship_by = TBD Order=QX-1045 | customer: Heliotrope Labs | items: R1 x1 | priority = false | ship_by = 2026-07-05
Both outputs are valid JSON, preserve input order, match the required schema exactly, and correctly normalize priority and ship_by values. There are no substantive differences in quality or correctness between them.
Matchup powered by OpenRouter.