Tripo's VAST previews Project Eden, a bid for persistent AI worlds
Rather than treating world models as video generators, VAST says Eden separates durable state from rendering so objects, actions and multiplayer views can persist over time.
By Ryan Merket · · updated
Why it matters
Even with few public details, the page shows Tripo continuing to wrap AI 3D generation into a commercial toolchain for studios, game developers and creators, where workflow fit may matter as much as model quality.

Tripo AI has put a research preview for Project Eden online and pointed users to it in a post on X, framing the work as a step beyond today's action-conditioned video generators and static 3D scene tools.
https://x.com/tripoai/status/2061307584817385960
VAST AI Research's argument is that predicting the next sequence of pixels is not the same as simulating a world. A video model can estimate how an image should change. A world model, in VAST's framing, needs to track what the pixels represent: objects, spaces, events, actions, memory and physical consequences that carry forward over time.
The preview casts current research as split between two incomplete paths. Action-conditioned video systems capture time and motion, but their understanding is often compressed into a short window of recent frames. If an object leaves the camera view, there is no independent state preserving it, so the model has to infer it again when the camera returns. Static 3D scene generation has the opposite problem: it can provide navigable spatial structure, but often treats the scene as a fixed asset rather than a world that changes.
Project Eden is VAST's attempt to combine both pieces. The core design choice is to decouple the underlying world state from visual rendering. In Eden, the world is meant to exist before any single camera observes it. A wall should remain when a player looks away. A fire that has been extinguished should stay out. Two players should be able to act inside one synchronized environment from different viewpoints.
VAST describes Eden as a three-layer system. The first layer is an evolving structured state: a compact implicit or structured representation carrying content, coarse geometry, object semantics and the consequences of user actions. The second is a state-to-observation interface that converts that state into camera-conditioned constraints such as local semantics, geometry cues and event changes. The third is a generative neural renderer that turns those constraints into visual output, including texture, lighting, materials, motion, smoke, fire and water.
That architecture is meant to shift memory out of the image stream. The renderer does not have to infer the entire scene from recent pixels alone, because it receives constraints derived from a persistent state. The image becomes a view into the world, rather than the place where the whole world is stored.
The data strategy follows the same split. VAST says Eden is trained around alignment between an underlying simulation state and rendered observations. Internet video supplies visual scale and diversity, while Tripo's 3D foundation model capabilities are used to recover structural signals such as depth, camera pose and geometric trajectories from unlabeled video. Game-engine data supplies the more explicit side: internal state, 3D annotations, action instructions, camera poses, object identities and environmental changes.
The promised capabilities are the ones that pure video generation and static 3D generation struggle to offer together: long-horizon object persistence, viewpoint consistency, editable worlds, shared multiplayer spaces and environments suitable for agent training. VAST points to demos including a fire-extinguishing interaction, a racing scene with two cars on the same track and a shooting-range scene where different players take different actions inside the same environment.
The preview is still positioned as research, not a finished general-purpose world model. VAST says it is working on richer physical dynamics, more complex scene evolution, broader free-viewpoint exploration, larger environments, finer-grained object interaction, real-time efficiency and stronger state transition modeling. The company also says evaluation has to move beyond visual quality to test persistence, object identity, causal consistency, rule-following, cross-view consistency, action consequences and multi-agent synchronization.
For creators and game teams, Eden's significance will depend on whether VAST can turn that state-first architecture into reliable production tools. For AI researchers, the sharper claim is that world models should not be treated as a subproblem of video generation. VAST is betting that the important shift is from predicting the next pixel to simulating the next state.