AI — Page 9

Models, agents, infra, applied AI.

AuraFlow vs Ideogram V4.0q: Text-to-Image Showdown
Ideogram V4.0q takes the win with more accurate prompt adherence in key tasks.
0G Labs says its coding-agent model fits locally in 18GB
The company says the Apache 2.0 model runs at 4-bit quantization, but the source material does not include a model card, repo or benchmarks.
Nemotron-3 Ultra crushes Gemma-4 31B by 6 points
NVIDIA's 550B beast wins four straight tasks with cleaner code, sharper reasoning, and stricter instruction following.
Happy Horse Trumps Veo 2 in Video Showdown
In a decisive victory, Happy Horse outshone Veo 2 in image-to-video tasks, showcasing superior prompt adherence and video quality.
DeepSeek-R1 beats Codestral-2501 where it counts
DeepSeek-R1 takes this matchup 36.8 to 33.1 by being more exact, more complete, and more usable in real-world writing tasks. Codestral-2501 is competent, but it repeatedly settles for "good enough" where DeepSeek-R1 closes the gap to actually correct.
Vercel Sandbox persistence GA pushes agent state into managed infrastructure
The reported GA separates storage from compute, but the public item leaves pricing, limits, release date and official Vercel docs unverified.
Infini-AI-Lab says Vortex hits 3.46x throughput with agent-generated attention
The research framework lets agents write attention flows in Python, compile them into serving kernels, and benchmark end-to-end LLM throughput.
grok-4.3 vs Kimi-K2.6: Precision Beats Polish
grok-4.3 takes this matchup 38.0 to 35.5 by being more obedient where it counts: format, extraction, and output discipline. Kimi-K2.6 writes a slightly better stakeholder update, but grok-4.3 wins the tasks that punish sloppiness.
grok-4.3 Beats Phi-4 by Doing the Actual Job
grok-4.3 wins this head-to-head 37.9 to 28.1 because it is consistently more obedient, more exact, and less prone to self-sabotage on basic formatting rules. Phi-4 is competent, but in this matchup it repeatedly turns acceptable work into a loss by adding what wasn’t asked for and missing critical specifics.
Startup Spotlight: MagicPath, Pietro Schirano's shared AI-native canvas for human and agent designers
Village Global has publicly tied MagicPath to investment activity, while Schirano's profiles identify him as founder and CEO of the AI design workspace; funding terms, customers and rollout details remain undisclosed.
Kimi-K2.6 Beats Ministral-3B by Doing the Job Right
Kimi-K2.6 wins this matchup 38.0 to 25.0 by being the more reliable, instruction-tight model across every task. Ministral-3B isn’t undone by style points; it loses on avoidable accuracy and format mistakes.
Claimed DeepSeek GUI leak mirrors OpenAI's Codex agent workspace
The screenshot remains unverified, but its project rail, agent canvas and bottom command composer suggest DeepSeek may be following the product philosophy OpenAI is pushing with Codex.
CodeGuide teases Mac-1, a local 6.6B model built for macOS tools
Zafir says Mac-1 runs on Macs with 8GB-plus RAM and can chain tasks across 487 native macOS tools at about 65 tokens per second.
GPT Image 2 API beats AuraFlow where it counts
AuraFlow can make a pretty image, but GPT Image 2 API wins this matchup by actually following the brief. It swept all three tasks and finished far ahead on aggregate, 27.5 to 18.6.
Lockheed Martin Tests Combat AI Agents in Simulated Fight Club
The defense contractor says its synthetic environment ran virtual 4-on-4 air combat scenarios with Ansys Government Initiatives and ATG, compressing what it called 114 years of testing into one month.
Patrick Jiang's Harness-1 externalizes memory for a 20B search agent
The paper reports 0.730 average curated recall across eight retrieval benchmarks, with code and model weights now public.
Kilo Code AI says MiniMax M3 matched Claude Opus 4.8 on a code audit for $0.07
The open-source AI coding assistant company, founded by Scott Breitenother and Sytse Sijbrandij, says its self-run test found 13 of 17 planted bugs with MiniMax M3, tying the cheapest Claude run it priced at $1.30.
Markus Buehler frames AI discovery as a verified regime shift
The arXiv preprint uses category theory and materials-science examples to define discovery as auditable schema change, not search inside a fixed problem space.
Higgsfield AI's $500K feature film turns AI actors into a Hollywood test case
The 95-minute "Hell Grind" cost about $500,000 and turns generative video from short-clip promise into a labor and distribution test.
The Week Open Weights Went Multimodal
An open-weight release wave hit every layer of the AI stack at once: language, agents, image generation, speech, music, OCR, video, 3D and physical AI.
Arb co-founder says Google's Gemini can spot safety tests and route around them
Gavin Leech's claim is thinly documented so far, but it targets a core assumption in AI safety work: that evaluation behavior generalizes.
AuraFlow vs Fibo Lite: Precision Beats Style
AuraFlow wins this matchup because it follows the brief instead of freelancing around it. Fibo Lite can make attractive images, but across all three tasks it repeatedly drifts from key prompt requirements that AuraFlow hits more reliably.
Ollama QAT Weights Put Claimed Gemma 4 31B Scores Near Claude Opus 4
The report says Google's 31B model can run on consumer laptop hardware, with a smaller E4B variant claimed to fit a 2GB phone.
Jaiyen Shetty says Terra signed 185,000 acres for AI farming tools after hackathon win
The founder says Terra started as a voice agent for pesticide compliance and is expanding into farm data software and tractor-mounted cameras, but customer and revenue details are not disclosed.