Head to head: grok-4.3 vs Llama-3.3-70B-Instruct
grok-4.3 vs Llama-3.3-70B-Instruct
This matchup wasn’t close: grok-4.3 won on both instruction discipline and editorial reliability. Across code, JSON formatting, and business writing, it consistently did the exact job asked, while Llama-3.3-70B-Instruct kept slipping on output constraints and precision.
grok-4.3 takes this comfortably, 38.0 to 26.0, because it does the unglamorous thing that actually matters in production settings: it follows directions exactly. That was the pattern across the set. Where the prompt asked for code only or JSON only, grok-4.3 delivered the required format without decoration; Llama-3.3-70B-Instruct repeatedly added Markdown fences or extra material and turned otherwise workable answers into instruction misses. The clearest example is `python-log-redactor`. grok-4.3 produced clean code only, included the necessary import, and correctly redacted values up to a space, comma, semicolon, or end of line while preserving delimiters. Llama-3.3-70B-Instruct had broadly similar redaction logic, but it wrapped the answer in Markdown fences and added example usage and printing. That is not a minor stylistic quirk; it is a direct failure on the prompt’s core constraint. The same problem showed up again in `messy-orders-to-json`. Both models basically parsed the orders correctly and sorted them into the right schema, but only grok-4.3 returned valid JSON only. Llama-3.3-70B-Instruct once again fenced the output, which is exactly the kind of formatting error that breaks downstream use. In structured-output tasks, “almost right” is just wrong. On writing tasks, grok-4.3 was also the steadier editor. In `status-update-delay`, it was clearer and more professional, and it included the concrete hotfix timing of **15:10 UTC** instead of vaguely circling the plan. In `meeting-notes-summary`, it stayed closer to the source notes and kept the JSON cleaner, especially by not inflating status updates and action items into formal decisions. Llama-3.3-70B-Instruct was serviceable, but grok-4.3 was sharper, calmer, and more faithful. **Final call: grok-4.3 is the better model here, decisively. It wins not with flash, but with the far more valuable habit of being precise, compliant, and trustworthy under instruction.**
python-log-redactor
Practical coding — Python. Return code only. Write a function `redact_log(line: str) -> str` that masks sensitive values in application log lines before they are shipped to a vendor. Replace the value after `email=` with `[REDACTED_EMAIL]`, after `token=` with `[REDACTED_TOKEN]`, and after `ip=` with `[REDACTED_IP]`. Values end at the next space, comma, semicolon, or end of line. Preserve everything else exactly. Example input: `ts=2026-03-14T09:22:11Z level=INFO email=mira.chen@northbay.io action=login ip=203.44.18.9 token=sk_live_7Hk92aX note=ok` should become `ts=2026-03-14T09:22:11Z level=INFO email=[REDACTED_EMAIL] action=login ip=[REDACTED_IP] token=[REDACTED_TOKEN] note=ok`. Include any imports needed, but no explanation.
A cleanly satisfies the prompt with code only, includes the needed import, and correctly redacts values up to space, comma, semicolon, or end of line while preserving delimiters. B violates the 'code only' constraint by wrapping in Markdown fences and adding example usage/printing, so it does not adhere to the instructions despite having broadly similar redaction logic.
status-update-delay
Professional writing — Write a Slack status update to your product and support teams. Audience: internal coworkers. Situation: the "Mercury Lane" billing export scheduled for 14:00 UTC is delayed because a schema change in the `invoice_adjustments` table broke the job at 13:52 UTC. Impact: CSV exports for 18 enterprise customers are delayed; no data loss; API and dashboard are unaffected. Current plan: hotfix by 15:10 UTC, rerun exports by 15:25 UTC, post next update at 15:00 UTC. Tone: calm, accountable, concise. Length: 90–130 words.
A is clearer, more professional, and fully includes the requested plan details, including the hotfix timing of 15:10 UTC. B is acceptable but slightly less concise/calm in tone and omits the explicit hotfix-by-15:10 UTC commitment.
meeting-notes-summary
Summarization & extraction — Read these meeting notes and then provide: (1) a 2-sentence summary, and (2) a JSON object with keys `launch_date`, `owner`, `blocked_by`, `budget_change_usd`, and `decisions` (array of short strings). Notes: - NimbusOak mobile app launch review, Tue 7 Jan. - Priya said App Store screenshots are done, but Android screenshots still need legal review. - Marco moved the target launch from Feb 3 to Feb 10 after QA found a crash on password reset for Android 12 only. - Fix is assigned to Lena; patch build expected by Friday morning. - Finance approved an extra $6,500 for paid acquisition testing in week one. - Team agreed not to change onboarding copy before launch. - Biggest blocker: waiting on legal sign-off for the Android screenshots. - Next checkpoint Thursday 16:30.
Model A is more faithful to the notes and cleaner: its summary captures the key facts accurately without adding framing text, and its JSON avoids treating non-decision items as decisions. Model B is still mostly correct, but its `decisions` array includes status/actions like assigning a fix and approving budget, which are less clearly decisions than the explicit onboarding-copy agreement.
messy-orders-to-json
Data wrangling / structured output — Convert the messy order notes below into valid JSON only. Output must be an object with one key, `orders`, whose value is an array of objects sorted by `order_id` ascending. Each order object must have exactly these keys: `order_id` (string), `customer` (string), `items` (array of strings), `priority` ("low"|"normal"|"high"), `ship_by` (YYYY-MM-DD), and `gift` (boolean). Normalize priorities like rush/urgent => high, std => normal. If gift is missing, use false. Messy notes: # A-104 | customer: Velora Studio | items: "brass lamp"; "linen shade" | ship by 2026/04/09 | priority=std Order A-102, customer=Kite & Hollow, items=[walnut tray, candle set], rush, ship_by: 2026-04-05, gift=yes A-103 ; customer : Dr. Imani Sethi ; items : engraved pen ; ship by : 2026-04-07 ; priority : low order_id=A-101 customer="Parker Reef" items=ceramic mug|tea tin urgent ship-by=2026-04-04
Both outputs parse the orders correctly and match the required schema and sorting, but Model A follows the instruction to output valid JSON only. Model B wraps the JSON in Markdown code fences, which violates the format requirement.
Matchup powered by OpenRouter.