grok-4.3 edges gpt-5.4 in a narrow, format-first fight

grok-4.3 vs gpt-5.4

grok-4.3 takes the head-to-head by a hair, but only because it was more disciplined on the tasks that punished sloppiness. gpt-5.4 won the hardest parsing task, yet it gave back too much on instruction-following and formatting.

The scoreline says it all: **grok-4.3 wins 36.0 to 34.0**, and this was not a blowout. It was a precision contest, and grok-4.3 simply made fewer avoidable mistakes where the prompt’s constraints mattered most. The split is clean. **gpt-5.4** took `python-redact-logs` by being more robust on regex boundaries and invalid IPv4 handling — the better engineering answer, full stop. But **grok-4.3** answered back on `status-update-delay` and `meeting-notes-summary`, and those wins were about compliance, not style points: calmer workplace tone, tighter formatting, and exact adherence to the requested output structure. The `messy-orders-to-json` task was a wash, a tie in both correctness and quality. That matters because it removes any fantasy that one model was consistently cleaner across structured-output work; they were identical there. **Final call: grok-4.3 wins this matchup, but only because it was more reliable on instruction discipline. gpt-5.4 had the stronger technical edge on log redaction, yet it lost the head-to-head by being a little looser where format and tone were non-negotiable.**

python-redact-logs

Write a Python 3 function `redact_log(line: str) -> str` for an internal support tool. It must replace any email address with `[EMAIL]` and any IPv4 address with `[IP]`, while leaving timestamps, ports, and version numbers unchanged. Examples: `"user meera.cho@northbay.dev failed from 10.24.8.19 at 2025-04-12 09:14"` -> `"user [EMAIL] failed from [IP] at 2025-04-12 09:14"`; `"proxy 10.24.8.19:443 upgraded to v2.10.3"` -> `"proxy [IP]:443 upgraded to v2.10.3"`. Return code only.

grok-4.3: ```python import re def redact_log(line: str) -> str: line = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b', '[EMAIL]', line) line = re.sub(r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b', '[IP]', line) return line ```
gpt-5.4: ```python import re _EMAIL_RE = re.compile( r'(?<![\w.+-])' r'[A-Za-z0-9._%+-]+' r'@' r'[A-Za-z0-9.-]+\.[A-Za-z]{2,}' r'(?![\w.-])' ) _IPV4_RE = re.compile( r'(?<![\d.])' r'(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.' r'(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.' r'(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.' r'(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)' r'(?![\d.])' ) def redact_log(line: str) -> str: line = _EMAIL_RE.sub('[EMAIL]', line) line = _IPV4_RE.sub('[IP]', line) return line ```

Both solutions satisfy the core requirement, but B is more robust: it uses a stricter IPv4 pattern that avoids invalid octets and better boundary handling for emails and IPs. A is simpler and likely sufficient for the examples, but it can match invalid IP-like strings more loosely.

status-update-delay

Draft a workplace status update for the VP of Operations. Context: the Phoenix Yard scanner rollout is slipping by 6 days because 14 of 60 devices arrived with dead batteries and the vendor cannot replace them before Tuesday. Field training in Mesa already happened, so we want to preserve confidence and propose a revised plan: start with the 46 working units on Wednesday, finish the remaining devices the following Monday, and keep the June 28 inventory freeze unchanged. Tone: calm, accountable, no blame. Length: 120-150 words.

grok-4.3: **Subject: Phoenix Yard Scanner Rollout – Revised Timeline** The Phoenix Yard scanner rollout is now expected to finish six days later than planned. Fourteen of the sixty devices arrived with dead batteries, and the vendor confirmed replacements will not arrive until Tuesday. Field training in Mesa was completed on schedule last week, which positions us well to begin work with the equipment already on site. We will install the 46 working units starting Wednesday and complete the remaining fourteen the following Monday after the replacements arrive. This phased approach keeps the June 28 inventory freeze date unchanged. I will provide daily progress updates and remain available to address any questions as we execute the revised plan.
gpt-5.4: Here’s a draft you can send to the VP of Operations: I wanted to provide a brief update on the Phoenix Yard scanner rollout. We’ve identified a six-day slip in the original deployment schedule after receiving 14 of the 60 devices with nonfunctional batteries. Replacement units will not be available from the vendor before Tuesday. To keep the rollout moving and preserve momentum from the Mesa field training, we recommend proceeding with the 46 fully operational devices on Wednesday as planned. We would then complete deployment of the remaining 14 units the following Monday once the replacement batteries arrive. This approach allows us to maintain field confidence, continue adoption with minimal interruption, and keep the June 28 inventory freeze unchanged. We are tracking the adjustment closely and will provide confirmation once the remaining devices are in hand and the rollout is fully completed.

A is slightly better aligned with the requested workplace status-update tone: calm, accountable, and concise, with no extra framing. B is solid but includes a conversational lead-in and a bit more promotional language, making it slightly less direct.

meeting-notes-summary

Read these meeting notes and produce: (1) a 2-sentence summary, and (2) a JSON object with keys `decision`, `owner`, `due_date`, and `risks`. Notes: - Checkout latency spiked after the Cart service 3.14 deploy on Thursday. - Nia found the new coupon lookup calls `promo-cache` twice for guest users. - Reverting only that flag dropped p95 from 1.9s to 1.1s. - Marco wants a full rollback, but Priya said the release also contains the tax rounding fix requested by Finance. - Decision: keep 3.14 in place, disable guest coupon lookup, and patch forward next sprint. - Owner: Nia to open a hotfix PR today; Omar to monitor tonight's traffic. - Risk: guest checkout won't show promotional savings until patch lands. - Target date for patch: 2025-07-09.

grok-4.3: (1) Checkout latency spiked after the Cart service 3.14 deploy because the new coupon lookup began calling promo-cache twice for guest users. The team chose to retain the release with guest coupon lookup disabled, tasking Nia with a hotfix PR and Omar with traffic monitoring, targeting a patch on 2025-07-09 while accepting the risk that guests will not see promotional savings until then. (2) ```json { "decision": "keep 3.14 in place, disable guest coupon lookup, and patch forward next sprint", "owner": "Nia to open a hotfix PR today; Omar to monitor tonight's traffic", "due_date": "2025-07-09", "risks": ["guest checkout won't show promotional savings until patch lands"] } ```
gpt-5.4: **Summary:** Checkout latency increased after the Cart service 3.14 deploy because guest coupon lookup was calling `promo-cache` twice. The team decided to keep version 3.14, disable guest coupon lookup as a hotfix, and patch the issue forward by 2025-07-09 while monitoring traffic. ```json { "decision": "Keep Cart service 3.14 in place, disable guest coupon lookup, and patch forward next sprint.", "owner": { "hotfix_pr": "Nia", "traffic_monitoring": "Omar" }, "due_date": "2025-07-09", "risks": [ "Guest checkout will not show promotional savings until the patch lands." ] } ```

A follows the requested format more closely by providing exactly a 2-sentence summary and a JSON object with the required keys. B is clear, but its `owner` field is structured as an object rather than the single value implied by the prompt, making it less compliant.

messy-orders-to-json

Convert the messy order lines below into valid JSON as an array of objects sorted by `order_id` ascending. Schema for each object: `order_id` (string), `customer` (string), `items` (array of strings), `priority` ("low"|"normal"|"high"), `rush` (boolean), `total_usd` (number with 2 decimals). Rules: trim spaces, normalize customer names to title case, split items on `|`, and interpret `rush` values `Y/yes/true` as true and `N/no/false` as false. Data: ORD-9081 ; customer= lENA ortiz ; items=label rolls|thermal head ; priority=high ; rush=Y ; total=$184.5 ORD-9077;customer=omar dunn;items= usb-c dock | hdmi cable |ethernet adapter;priority=normal;rush=no;total=249 ORD-9080 ; customer = PRIYA sen ; items = nitrile gloves | face shields ; priority = low ; rush = true ; total = $73.00

grok-4.3: [ { "order_id": "ORD-9077", "customer": "Omar Dunn", "items": ["usb-c dock", "hdmi cable", "ethernet adapter"], "priority": "normal", "rush": false, "total_usd": 249.00 }, { "order_id": "ORD-9080", "customer": "Priya Sen", "items": ["nitrile gloves", "face shields"], "priority": "low", "rush": true, "total_usd": 73.00 }, { "order_id": "ORD-9081", "customer": "Lena Ortiz", "items": ["label rolls", "thermal head"], "priority": "high", "rush": true, "total_usd": 184.50 } ]
gpt-5.4: [ { "order_id": "ORD-9077", "customer": "Omar Dunn", "items": ["usb-c dock", "hdmi cable", "ethernet adapter"], "priority": "normal", "rush": false, "total_usd": 249.00 }, { "order_id": "ORD-9080", "customer": "Priya Sen", "items": ["nitrile gloves", "face shields"], "priority": "low", "rush": true, "total_usd": 73.00 }, { "order_id": "ORD-9081", "customer": "Lena Ortiz", "items": ["label rolls", "thermal head"], "priority": "high", "rush": true, "total_usd": 184.50 } ]

Both outputs are valid JSON, correctly sorted by order_id, and accurately apply the normalization and parsing rules. They are identical in content and quality.

Matchup powered by OpenRouter.