grok-4.3 Beats Phi-4 by Doing the Actual Job

Phi-4 vs grok-4.3

grok-4.3 wins this head-to-head 37.9 to 28.1 because it is consistently more obedient, more exact, and less prone to self-sabotage on basic formatting rules. Phi-4 is competent, but in this matchup it repeatedly turns acceptable work into a loss by adding what wasn’t asked for and missing critical specifics.

The scoreline isn’t subtle: **grok-4.3 takes it 37.9 to 28.1**. More importantly, the pattern across tasks is clear. B doesn’t just sound polished; it executes the prompt more faithfully, with fewer avoidable mistakes and better attention to the details that actually decide these evaluations. The cleanest example is **python-log-redactor**. grok-4.3 follows the instruction to return **code only** and uses a stronger IPv4 regex that constrains octets to **0–255**. Phi-4 fumbles both parts of the assignment: it adds explanatory text, example usage, and print output, and its regex overmatches invalid IP addresses. That’s not a stylistic nitpick; it’s a direct miss on both format and correctness. The same discipline shows up in **messy-orders-to-json**. Both models parsed and normalized the orders correctly, but Phi-4 still lost because it wrapped the answer in a Markdown code fence instead of returning **raw JSON only**. grok-4.3 gave valid JSON and moved on. In production-style tasks, that difference matters: one output drops into a pipeline, the other creates cleanup work. On writing tasks, grok-4.3 was simply sharper. In **vendor-delay-email**, it was more accountable and precise, explicitly saying the issue was caught **before packing** and avoiding Phi-4’s less accurate phrasing about a "minor leaking" problem. In **meeting-notes-summary**, B packed more of the actual notes into two sentences, including the likely cause, affected area, owner, deadline, and risks; Phi-4 was decent but noticeably less specific and left out details like the receipt scan screen. **Final call: grok-4.3 wins because it combines better instruction-following with better factual precision. Phi-4 is good enough to stay in the conversation, but here it repeatedly loses on the unforced errors a stronger model avoids.**

python-log-redactor

Language: Python 3. Write a function `redact_log(line: str) -> str` for an API gateway. Replace any IPv4 address with `[IP]` and any bearer token with `[TOKEN]`. A bearer token is the word `Bearer` followed by one or more spaces and then a token made of letters, digits, `_`, `-`, or `.`. Preserve all other text exactly. Example: `2026-04-11 client=198.51.100.23 auth="Bearer sk_live-9A.bc_77"` becomes `2026-04-11 client=[IP] auth="[TOKEN]"`. Return code only.

Phi-4: Here's the Python function based on your requirements: ```python import re def redact_log(line: str) -> str: # Regular expression to find IPv4 addresses ipv4_pattern = re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b') # Regular expression to find bearer tokens (Bearer followed by one or more spaces and a valid token) token_pattern = re.compile(r'Bearer\s+[\w.-]+') # Substituting IPv4 addresses with [IP] line = ipv4_pattern.sub('[IP]', line) # Substituting bearer tokens with [TOKEN] line = token_pattern.sub('[TOKEN]', line) return line # Example usage example_line = '2026-04-11 client=198.51.100.23 auth="Bearer sk_live-9A.bc_77"' print(redact_log(example_line)) ``` This function uses regular expressions to identify and replace IPv4 addresses and bearer tokens, as per your specifications. The IPv4 address pattern looks for sequences of four groups of one to three digits separated by dots. The bearer token pattern looks for the string "Bearer" followed by one or more spaces and a sequence of word characters, underscores, periods, or hyphens.

grok-4.3: ```python import re def redact_log(line: str) -> str: line = re.sub(r'Bearer +[A-Za-z0-9_.-]+', '[TOKEN]', line) line = re.sub(r'\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b', '[IP]', line) return line ```

B follows the instruction to return code only and provides a more correct IPv4 regex that limits octets to 0-255. A includes explanatory text, example usage, and print output, violating the format requirement, and its IPv4 pattern overmatches invalid addresses.

vendor-delay-email

Draft an email to a retail buyer at Northline Outfitters. Context: we promised 420 units of the HarborMist insulated bottle by Friday, but a cap-molding defect means only 260 units will ship Friday and the remaining 160 will ship next Wednesday. QA found the issue before packing; no safety risk, just leaking under inversion. Offer either a split shipment at our cost or a full delay with a 7% invoice credit. Tone: professional, accountable, calm. Length: 140-190 words.

Phi-4: Subject: Update on HarborMist Insulated Bottle Shipment Dear [Buyer's Name], I hope this message finds you well. I am writing to inform you about a change in our delivery schedule for the HarborMist insulated bottles originally promised for this Friday. Our Quality Assurance team identified a cap-molding defect that causes minor leaking when inverted. While there is no safety issue, we want to ensure full satisfaction with our products. Therefore, we have decided to ship 260 units by this Friday to meet your needs as promptly as possible. Due to this, the remaining 160 units will now be shipped on Wednesday next week. You have two options to consider: 1. We can proceed with a split shipment, with us covering all additional shipping costs. 2. Alternatively, we can delay the entire shipment until next Wednesday, providing a 7% credit on your invoice for the inconvenience. We deeply apologize for any disruption this may cause and appreciate your understanding as we resolve this matter. Please let us know how you would like to proceed at your earliest convenience. Thank you for your continued support and partnership. Warm regards, [Your Name] [Your Position] [Your Contact Information] [Company Name]

grok-4.3: Subject: Shipment Update for HarborMist Insulated Bottles Dear [Name], We are writing regarding your order for 420 HarborMist insulated bottles, which we had committed to ship by Friday. During final quality assurance checks, we identified a cap-molding defect that causes leakage when the bottle is inverted. There is no safety risk, and the issue was caught before any units were packed. As a result, 260 units will ship this Friday. The remaining 160 units will ship next Wednesday once corrected components are available. To limit disruption, we can complete the split shipment at our expense for the added freight costs. Alternatively, we can hold the full order for delivery next Wednesday and apply a 7% credit to your invoice. Please let us know which option works best, or if you need any other accommodation. We apologize for the inconvenience and will ensure this is resolved quickly. Best regards, [Your Name]

B is more precise and accountable, explicitly noting the issue was caught before packing and keeping the tone calm and professional while staying concise. A is solid but slightly less exact, adds a less accurate phrase ('minor leaking'), and is a bit more generic and padded.

meeting-notes-summary

Read these meeting notes and provide: (1) a 2-sentence summary, and (2) a JSON object with keys `decision`, `owner`, `deadline`, and `risks`. Meeting notes: - Team: PulseDesk mobile app - Nora: crash rate on Android 14 spiked after 5.18.0; mostly on the receipt scan screen - Imran: traced likely cause to the new image cropper library; rollback patch is ready - Decision discussed: ship 5.18.1 rollback today, postpone smart-crop feature to next sprint - Ava: App Store listing text already mentions smart-crop; marketing needs updated copy by 3 PM - Nora owns release coordination and store metadata changes - Deadline: submit build by 1 PM today so review can clear before evening campaign - Risk: if review stalls, pause the 6 PM ad spend; also support needs a macro for affected users

Phi-4: (1) **Summary:** The PulseDesk mobile app team decided to ship a rollback update (5.18.1) to address the spike in crash rates on Android 14 and postponed the smart-crop feature to the next sprint. This update is to be coordinated by Nora, with critical deadlines and contingency plans in place to manage potential review stalls and marketing discrepancies. (2) **JSON Object:** ```json { "decision": "Ship 5.18.1 rollback to address crash issues and postpone smart-crop feature.", "owner": "Nora", "deadline": "Submit build by 1 PM today", "risks": [ "If review stalls, pause the 6 PM ad spend.", "Support needs a macro for affected users." ] } ```

grok-4.3: (1) The PulseDesk team identified a spike in Android 14 crashes on the receipt scan screen after the 5.18.0 release, traced to the new image cropper library. They decided to ship a 5.18.1 rollback today while postponing the smart-crop feature, with Nora coordinating the release and store metadata updates to meet the 1 PM build submission deadline amid risks to ad campaigns and user support. (2) ```json { "decision": "Ship 5.18.1 rollback today and postpone smart-crop feature to next sprint", "owner": "Nora", "deadline": "Submit build by 1 PM today", "risks": "If review stalls, pause the 6 PM ad spend; support needs a macro for affected users" } ```

Model B is more complete and faithful to the notes: its 2-sentence summary includes the cause, affected area, owner, deadline, and risks. Model A is solid, but its summary is less specific and omits key details like the likely cause and receipt scan screen.

messy-orders-to-json

Convert the messy order lines below into valid JSON: an array of objects sorted by `order_id` ascending. Schema for each object: `order_id` (string), `customer` (string), `sku` (string), `qty` (integer), `ship_method` (one of `ground`,`air`,`pickup`), `priority` (boolean). Normalize customer names to title case and SKUs to uppercase. Interpret `Y/yes/true` as true and `N/no/false` as false. Output JSON only. order 1048 | cust=mina velasquez | sku qn-44 | qty 3 | ship: Air | priority yes #1046; customer "DEON PARK"; item=lb-2; quantity=1; method=ground; priority=N ID 1049 / customer: r. patel / sku: tx9 / qty: 12 / ship method: PICKUP / priority: true order_id=1047, cust=alix romero, sku=MN-88, qty=2, ship=Ground, priority=Y

Phi-4: ```json [ { "order_id": "1046", "customer": "Deon Park", "sku": "LB-2", "qty": 1, "ship_method": "ground", "priority": false }, { "order_id": "1047", "customer": "Alix Romero", "sku": "MN-88", "qty": 2, "ship_method": "ground", "priority": true }, { "order_id": "1048", "customer": "Mina Velasquez", "sku": "QN-44", "qty": 3, "ship_method": "air", "priority": true }, { "order_id": "1049", "customer": "R. Patel", "sku": "TX9", "qty": 12, "ship_method": "pickup", "priority": true } ] ```

grok-4.3: [ { "order_id": "1046", "customer": "Deon Park", "sku": "LB-2", "qty": 1, "ship_method": "ground", "priority": false }, { "order_id": "1047", "customer": "Alix Romero", "sku": "MN-88", "qty": 2, "ship_method": "ground", "priority": true }, { "order_id": "1048", "customer": "Mina Velasquez", "sku": "QN-44", "qty": 3, "ship_method": "air", "priority": true }, { "order_id": "1049", "customer": "R. Patel", "sku": "TX9", "qty": 12, "ship_method": "pickup", "priority": true } ]

Both outputs correctly parse, normalize, and sort the orders, but A violates the instruction to output JSON only by wrapping the array in a Markdown code fence. B is valid raw JSON and fully adheres to the prompt.