grok-4.3 vs Kimi-K2.6: Precision Beats Polish

grok-4.3 takes this matchup 38.0 to 35.5 by being more obedient where it counts: format, extraction, and output discipline. Kimi-K2.6 writes a slightly better stakeholder update, but grok-4.3 wins the tasks that punish sloppiness.

By · Published

Comparison of AI model performance metrics and output quality on a technical blueprint (Blueprint)

This wasn’t a blowout, but it was decisive. grok-4.3 wins the aggregate 38.0 to 35.5 because it was tighter on instruction-following in the tasks where small deviations matter. Kimi-K2.6 showed better audience instincts in one business-writing prompt, but grok-4.3 was the more reliable model across the set.

The coding task, python-log-window-bugfix, correctly ended in a tie. Both models produced the required O(n) sliding-window fix, used the inclusive <= window_secs boundary, returned the earliest qualifying timestamp, and naturally handled empty input. grok-4.3 added an explicit empty-input check, but that’s a style edge, not a substantive win.

Kimi-K2.6 earned its one clear victory on vendor-delay-status-update, and fairly so. Its note did a better job signaling the next decision point, mentioning internal review, and sounding calibrated for product, support, and leadership. grok-4.3 was solid, but less complete on update timing and decision framing.

But grok-4.3 took the two tasks that separate a merely good model from a dependable one. On meeting-notes-summary-extract, it followed the brief more cleanly: a plain two-sentence summary and a concise JSON object with properly typed fields, including a numeric budget_delta. Kimi-K2.6 stumbled by turning that field into a verbose string. On messy-orders-to-json, the difference was even simpler and more damning: both normalized the data correctly, but grok-4.3 returned JSON only, while Kimi-K2.6 wrapped its answer in Markdown fences and failed the output contract.

Final call: grok-4.3 wins because it is stricter, cleaner, and more trustworthy under exacting instructions. Kimi-K2.6 is the slightly better communicator; grok-4.3 is the better model.

How they were tested

We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. grok-4.3 scored 38.0 to Kimi-K2.6's 35.5.

1. python-log-window-bugfix

Language: Python 3.11 Fix the function below. It should return the earliest timestamp such that the total bytes of events within the next window_secs seconds is at least threshold. Events are (timestamp, bytes) tuples, sorted by timestamp ascending. If no window qualifies, return None. The current code is wrong on edge cases and is too slow for large inputs. Return code only. python def first_window_hit(events, window_secs, threshold): best = None for i in range(len(events)): total = 0 for j in range(i, len(events)): if events[j][0] - events[i][0] < window_secs: total += events[j][1] else: break if total > threshold: best = events[i][0] break return best Example: events = [(100, 120), (102, 90), (105, 40), (109, 300)], window_secs = 5, threshold = 250 -> return 100 because timestamps 100..105 inclusive contain 250 bytes. Requirements: - Treat the window as inclusive: event_ts - start_ts <= window_secs - Return the earliest qualifying start timestamp - Handle empty input - Target O(n) time

Winner: Tie — Both outputs correctly implement an O(n) sliding-window solution, use the required inclusive boundary (<= window_secs), return the earliest qualifying timestamp, and handle empty input by naturally returning None. Model A adds an explicit empty-input check, but Model B is equally correct and efficient without it.

2. vendor-delay-status-update

Write a status update for our internal engineering Slack channel. Context: We planned to enable the new invoice export on Monday, but our vendor LatticeFox found a rounding bug in tax totals for multi-currency accounts. They expect a fix by Thursday 14:00 UTC. We have paused rollout for all customers, including the 18 accounts in the pilot. No data loss occurred; exports generated before the pause remain valid. Engineering is adding an extra validation check before we resume. Next decision point is Thursday after vendor confirmation. Audience: product, support, and leadership Tone: calm, accountable, concise Length: 90-130 words Include: what happened, customer impact, what we’re doing next, and when we’ll update again.

Winner: Kimi-K2.6 — Both are clear, accurate, and within the requested length, but B better matches the audience and brief by explicitly framing the next decision point, noting internal review, and sounding slightly more polished for product, support, and leadership. A is strong but a bit less complete on the decision/update timing details.

3. meeting-notes-summary-extract

Read the meeting notes below, then produce: 1) a 2-sentence summary 2) a JSON object with keys decision, owner, deadline, risks, and budget_delta Meeting notes: - Project: replacing the old badge printers at Northline Health’s 12 clinics. - Mara said the original plan was 30 Zebra QL units, but 6 clinics have lower print volume than expected. - Devon proposed buying 24 QL units now and keeping 4 spare Brother TD devices from storage instead of ordering all 30. - Priya warned the Brother drivers fail on kiosk image v5.2 unless IT applies patch KB-441. - Team agreed to Devon’s proposal if IT confirms the patch by 2026-07-12. - Finance note: this change reduces immediate spend from $18,900 to $15,360. - If patch confirmation slips, fallback is the original 30-unit Zebra order. - Mara will confirm with IT and send final go/no-go by end of day 2026-07-12.

Winner: grok-4.3 — Both capture the key facts accurately, but A adheres more cleanly to the requested format by providing a plain 2-sentence summary and a concise JSON object with appropriately typed fields, especially a numeric budget_delta. B is also strong, but its budget_delta is a verbose string rather than a clean extracted value.

4. messy-orders-to-json

Convert the messy order lines below into valid JSON as an array of objects. Each object must have exactly these keys: order_id (string), customer (string), sku (string), qty (integer), unit_price (number), rush (boolean), ship_date (string in YYYY-MM-DD). Rules: - Trim spaces - Normalize booleans: yes/y/true -> true; no/n/false -> false - Normalize dates to YYYY-MM-DD - qty is an integer - unit_price is numeric with no currency symbol - Preserve row order - Output JSON only Data: 1. order=AX-104 | customer=Rillridge Labs | sku = MX-88 | qty: 3 | price USD 19.50 | rush? yes | ship 7/18/2026 2. order=AX-105|customer= Solena Studio|sku=Q-2A|qty:12|price=$4|rush? n|ship=2026-07-19 3. order : AX-106 ; customer : "Harbor & Pine" ; sku : VT-900 ; qty : 1 ; price : 249.99 USD ; rush : TRUE ; ship : Jul 21 2026 4. order=AX-107 | customer=Maploop Transit | sku=LR-44 | qty= 25 | price=USD 0.85 | rush=no | ship=21-07-2026

Winner: grok-4.3 — Both outputs correctly normalize the fields and preserve row order, but Model A fully follows the instruction to output JSON only. Model B wraps the JSON in Markdown code fences, making it not pure JSON output.


See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

Reader comments

Conversation for this story loads after sign-in.