AI — Page 10

Models, agents, infra, applied AI.

OpenAI's Dreaming paper puts ChatGPT memory back at the center of the agent race
The June 4 research post frames memory as an architecture problem, not just a settings toggle, as OpenAI pushes ChatGPT toward longer-running work.
Elon Musk's SpaceX turns Google into a $920 million-a-month compute customer
The deal, disclosed a week before SpaceX's expected Nasdaq debut, would run through June 2029 unless either side exits early.
YC Startup Trata Releases Hedge-Bench, a New Benchmark Built From Hedge Fund Analyst Reasoning
The YC W25 company says the benchmark uses 102 tasks drawn from hedge fund analyst reasoning traces.
Sakana AI Forms RSI Lab To Chase Self-Improving AI From Tokyo
The Tokyo AI company says the internal research group will focus on models that can help write, test, and improve AI systems.
Mira Murati reemerges with Thinking Machines Lab's interface bet
The former OpenAI CTO previewed models for continuous audio, text and video, but left release timing and customer traction unstated.
ScaleDown targets AI inference costs with task-specific small models
Patel says ScaleDown's small language models beat GPT-5.4 Mini on cost, speed and accuracy, but the benchmarks are self-reported.
Mercor CEO Brendan Foody puts a number-shaped hole in the AI agent story
A 20VC interview frames Mercor as spending more on AI agent tokens than salaries, but the exact cost comparison remains unclear.
Screenshot claims DeepSeek V4 changed code over Tiananmen and Taiwan references
Jane Manchun Wong said the original prompt was simply "Improve ./core.rs" after a screenshot showed DeepSeek V4 changing Rust functions about Tiananmen Square and Taiwan.
Arena.ai launches Agent Arena to rank AI agents on live user tasks
The benchmark uses Arena.ai's own session data, including 160,000 tasks and 2.06 million tool calls over one week.
Google Gemma introduces Magenta RealTime 2 for live AI music on MacBooks
The open model is pitched as a playable instrument that takes MIDI, text and audio inputs, but Google has not detailed its specs in the thread.
Poke brings its AI agent to Apple Messages for Business
TechCrunch and Poke describe the approval as Apple's first for an AI agent on the business messaging platform.
DeepSeek V4 Flash routs Xiaomi MiMo-V2.5
DeepSeek V4 Flash wins 34.0 to 17.0 by being usable, complete, and more faithful to the prompts. MiMo-V2.5 repeatedly looked polished while dropping facts, inventing details, or failing outright.
microagi's Shift offers free apartment cleaning if AI can watch
Founded by two former Formula One engineers and an AI researcher, microagi is testing whether chores can become robotics training data.
NVIDIA's new Nemotron model takes the top US open-weight slot, Artificial Analysis says
The benchmark group says the 550B-parameter model scores 47.7 on its Intelligence Index, with weights posted on Hugging Face.
Appendix Lets Patients Get Prescriptions from Claude Without Ever Speaking to a Doctor
Appendix lets users have an AI agent draft a medical encounter, then routes it to a human physician for review and a prescription if warranted.
Anton Osika's Lovable deepens Google Cloud bet as AI coding rivals pick sides
TechCrunch reports the multi-year deal expands Lovable's Google Cloud footprint fivefold and gives it broader access to Claude and Gemini.
xAI puts Grok Imagine 1.5 Preview into its API
The preview is pitched as a video upgrade, with xAI claiming better motion, scene coherence, native audio and longer clips.
Claude Sonnet 4.6 beats DeepSeek V4 Flash on rigor
Claude Sonnet 4.6 wins 35.0 to 26.5 by being more reliable where correctness actually bites. DeepSeek V4 Flash had the cleaner customer email, but it fell down on harder structured and coding work.
Reve details image API for create, edit and remix after 2.0 launch
The image-generation startup's docs list endpoints for text-to-image, image editing and reference-image remixing, with direct image responses, postprocessing and credit headers.
We Put Ideogram 4 Head-to-Head against OpenAI, Google, and Microsoft in Four Image Stress Test
The comparison found different strengths across storytelling, product design, brand systems, and photorealistic physics.
Ideogram releases its first open-weight image model
The 9.3B-parameter Ideogram 4 model was trained from scratch and adds a structured JSON prompting interface for text, layout, color and 2K image control.
Sanders wants the public to own half of OpenAI, Anthropic and xAI
The planned bill would tax large AI companies in stock, turning a redistribution idea into a fight over startup control.
Nvidia says Cosmos 3 tops seven physical AI leaderboards
The claim spans world generation, robot action policy, and industrial vision understanding, but the post did not include scores or test details.
ViBench aims to rank AI models by app-building, not just coding tests
The public site and ACM paper frame ViBench as an end-to-end test of whether coding agents can deliver usable apps, not just pass SWE-style tasks.