AI — Page 8
Models, agents, infra, applied AI.
- Mollick's Claude Fable 5 test highlights hours-long agent work, not another launch demo
Ethan Mollick says Claude Fable 5 worked for hours across research and coding tasks, offering a long-horizon outside read on the Amodeis' agent bet.
- Seedance 2 steamrolls AnimateDiff on prompt fidelity
AnimateDiff stays coherent, but coherence alone doesn’t win head-to-heads when the model keeps dropping the brief. Seedance 2 Image to Video was dramatically better at actually staging the scenes it was asked to make.
- Guan Wang's Sapient Says It Trained a 1B-Parameter Model for About $1,500
Sapient Intelligence's HRM-Text claim targets the enterprise fear that custom AI means frontier-model budgets and vendor lock-in.
- Imagineart 2.0 Preview Beats AuraFlow Where It Counts
AuraFlow has flashes of atmosphere, but Imagineart 2.0 Preview wins this matchup decisively by following the brief, handling text, and delivering more convincing scenes. The 26.3 to 18.5 scoreline flatters AuraFlow.
- Ari Jacoby's Concentrate AI enters the AI routing fight as token bills bite
Concentrate emerged from stealth with more than $5 million, while OpenRouter's recent $113 million round shows how fast the gateway layer is heating up.
- grok-4.3 edges gpt-5.4-nano on execution, not flash
This was close on aggregate, but grok-4.3 wins because it made the fewer costly mistakes in structured-output work. gpt-5.4-nano was sharper on tone and regex edge cases, yet it gave back those gains by breaking instructions where precision mattered more.
- Tesla hacker Yoni Ramon brings Pi out of stealth with $35M for AI security
Pi is valued at $100 million and counts Navan as an early customer, while Forbes reports xAI is also using the system.
- Mike Krieger turns Anthropic's Fable 5 launch into a product test
Anthropic says Fable 5 routes under 5% of sessions to Opus 4.8, while Mythos 5 keeps higher-risk capability behind trusted access.
- Instawork turns its gig marketplace into a robot-training data line
Instacore puts five cameras and a compute backpack on workers to capture commercial tasks for AI labs, with customers still unnamed.
- Luma Ray 3.2 steamrolls AnimateDiff
Across both prompt-following tests, Luma Ray 3.2 Image to Video wasn’t just better than AnimateDiff—it was operating in a different league. AnimateDiff could gesture at mood; Luma delivered the actual scene, action, and camera logic the prompts asked for.
- Fibo Bbq Preview beats Bagel on image direction
Bagel steals one poster task, but Fibo Bbq Preview wins the matchup where it matters: prompt control, scene construction, and mood. On aggregate, B is the more reliable image model and the clear overall pick.
- Anthropic launches Claude Fable 5 with a gated Mythos 5 for cyber use
The new model is priced at $10 per million input tokens and $50 per million output tokens, with some requests routed to Opus 4.8.
- Google ships Gemini 3.5 Live Translate across consumer, enterprise and developer tools
The audio model streams speech-to-speech translation across 70-plus languages, with Meet access limited to private preview this month.
- grok-4.3 vs DeepSeek-V4-Flash: Precision Beats Polish
grok-4.3 takes this matchup by being the stricter, cleaner finisher on structured-output tasks, while DeepSeek-V4-Flash wins the one audience-sensitive writing test. The scoreline is close, but the deciding errors are the kind that matter in production.
- Harness-1 researchers say a 20B open search agent beat GPT-5.4 on recall
The UIUC, UC Berkeley and Chroma project shifts search memory from the model context window into a structured software environment.
- Marey Realism V1.5 Beats AnimateDiff Where It Counts
AnimateDiff is the steadier clip-maker, but Marey Realism V1.5 is the better prompt reader and the more convincing filmmaker. Across both tests, it delivered the details, atmosphere, and camera language the prompts actually asked for.
- ImagineArt 1.5 Pro Preview beats AuraFlow on obedience
AuraFlow can make attractive images, but this matchup wasn’t about vibes alone. ImagineArt 1.5 Pro Preview won all three tasks by doing the harder thing consistently: following the prompt in specific, visible ways.
- Leopold Aschenbrenner turns an AI thesis into a $20 billion hedge fund
WSJ reports Jane Street is now an investor in Situational Awareness, whose biggest disclosed win is tied to Anthropic.
- grok-4.3 vs DeepSeek-V4-Pro: Precision Beats Padding
grok-4.3 wins this head-to-head 37.0 to 30.0 by being the more obedient, production-ready text model. Across four tasks, it was consistently tighter on instructions and cleaner on edge cases, while DeepSeek-V4-Pro kept drifting into unnecessary constraints or formatting mistakes.
- MMAE benchmark tests whether AI can edit audio without collateral damage
Tencent Hunyuan and university collaborators say current models post an Exact Match Rate below 5% on the new speech and audio editing benchmark.
- DeepSeek V4 Pro beats GPT-5.5 Pro on precision
DeepSeek V4 Pro wins this head-to-head by being more exact where it matters: following instructions, matching schemas, and solving edge cases cleanly. GPT-5.5 Pro is still strong, but it gave away points with avoidable deviations.
- Happy Horse Trumps AnimateDiff in Video Modeling
In a decisive victory, Happy Horse outshone AnimateDiff in cinematic motion and dynamic environment tasks.
- echohive turns Codex into a creative-coding assistant
The Three.js visual demo points to a smaller but important market for coding agents: creators selling workflows, not software seats.
- Hugging Face turns its community toward small-model efficiency
The Build Small Hackathon track asks developers to build with smaller models as open-weight systems move closer to local production use.