AI — Page 4

Models, agents, infra, applied AI.

Aaron Levie says agents will use software 100X more than people - and force new SaaS guardrails
The Box co-founder argues agents will query CRM, documents, analytics and corporate knowledge far more than employees do.
Head to head: Bernini-R Edit Video vs Wan v2.6 Image to Video
One model consistently follows the brief; the other keeps wandering off it. Across both tests, Bernini-R Edit Video wins by being the only system that reliably preserves scene logic, camera intent, and the requested visual changes over time.
Andrew Curran says a stronger Anthropic Mythos model has emerged from training
Andrew Curran says a stronger Mythos-class model has finished training, days after US export controls forced Fable 5 and Mythos 5 offline.
Apertus Mini pushes Switzerland's open AI bet onto smaller devices
The Swiss AI Initiative is using distillation and quantization to turn its public foundation model into deployable infrastructure.
Renji John gets a second shot at greenhouse robots with Eternal.ag
The Cologne startup recently raised about $10 million to scale autonomous tomato harvesting, a hard robotics problem its CEO has already seen fail once.
Head to head: Bagel vs Juggernaut Flux Base LoRA
These two finish dead even on aggregate, but they get there in very different ways. Juggernaut Flux Base LoRA wins on scene fidelity and commercial composition, while Bagel’s edge is stricter spatial obedience when the prompt turns into a placement test.
Shift's free cleaning bet just got its first apartment-level stress test
Business Insider let Shift workers film a New York apartment, showing both the appeal and the cost of trading privacy for robot data.
Head to head: grok-4.3 vs Llama-4-Maverick-17B-128E-Instruct-FP8
This matchup wasn’t close on execution: one model consistently did the job asked, while the other kept drifting into extra verbiage and looser instruction-following. The difference showed up not in flashy reasoning claims, but in whether the output was precise, disciplined, and actually usable.
Ian Barber's warning: LLMs have entered the recsys phase
Barber argues model research now depends on composable kernels, not just cleaner agents.
Z.ai's GLM-5.2 vs Gemini on Agent Arena: the viral claim needs context
A post said GLM-5.2 ranked #3 and topped Gemini 3.5 Flash. Agent Arena is a live, multi-signal leaderboard, so any rank needs a named signal and timeframe.
Head to head: Bernini-R Edit Video vs Seedance 2 Image to Video
This matchup turns on prompt discipline, not vibes. Bernini-R Edit Video can produce attractive imagery, but Seedance 2 Image to Video is the model that actually lands the shot the prompt asked for, twice, and wins comfortably on aggregate.
NVIDIA reportedly acquihires Essential AI team including Ashish Vaswani
The reported move would put one of the Transformer authors inside NVIDIA's Nemotron model group, but deal terms and timing remain unclear.
Head to head: Bagel vs Imagineart 2.0 Preview
This one isn’t close. Across all three prompts, Imagineart 2.0 Preview is the model that actually reads the brief and delivers the right objects, materials, and palette discipline, while Bagel repeatedly slides into attractive-but-wrong interpretation.
Prem AI brings multi-GPU confidential inference into Fluso
Simone Giacomelli is moving Prem AI's private AI pitch from infrastructure into a production workspace for regulated teams.
Head to head: grok-4.3 vs Phi-4-reasoning
This one wasn’t competitive. grok-4.3 repeatedly did the basic but crucial thing Phi-4-reasoning did not: answer the prompt in the format requested, with usable output instead of meta-commentary.
Elon Musk takes Grok into Databricks as xAI chases enterprise distribution
Grok is now a native option in Agent Bricks, giving Databricks customers another model choice for governed AI agents.
Elon Musk puts xAI's video bet on a 2026 movie clock
xAI posted Grok Imagine Video 1.5 this week, but Musk's full movie prediction still runs ahead of what the public docs describe.
Aikido brings pentest-style reasoning into static code review
Code Audit analyzes source code for multi-step vulnerabilities that rule-based scanners and live pentests can miss before release.
Head to head: Bernini-R Edit Video vs Marey Realism V1.5
One model understood the assignment; the other mostly delivered good-looking detours. Across both tests, Bernini-R Edit Video was the clearer, more disciplined editor, winning on prompt fidelity, occlusion logic, and shot continuity.
Langflow attacks show AI agent frameworks have become production infrastructure before security caught up
VentureBeat tied active Langflow exploitation to fresh LangGraph and LangChain-core flaws that turn old AppSec bugs into AI infrastructure risk.
Head to head: Bagel vs ImagineArt 1.5 Pro Preview
Bagel brings atmosphere, but this matchup turned on prompt discipline and compositional authority. Across architecture, landscape storytelling, and graphic design, ImagineArt 1.5 Pro Preview was the model that actually delivered the brief.
Subquadratic's LLM efficiency claim moves from launch hype to benchmark fight
Justin Dangel and Alex Whedon say SubQ can make long context cheap. MIT's latest coverage shows the burden is now proof, not pitch.
Jack Dorsey's Block says Builderbot now accounts for 15% of its production code changes
The internal Slack agent merges about 1,500 PRs a week, but Block has not said whether Builderbot will become a product.
Head to head: grok-4.3 vs gpt-oss-120b
This matchup turns on a familiar distinction: both models are competent, but one is more reliable when the prompt punishes invention. grok-4.3 wins by being the steadier finisher across extraction and code tasks, while gpt-oss-120b’s best showing comes in polished business writing.