World Models & the Bet on Causal Reality
The term 'world model' is applied broadly across the AI literature, but in this article we’ll define it as follows: a system capable not merely of recognizing its environment, but of predicting what will happen next, and understanding why. A true world model enables an agent to mentally simulate the consequences of actions before executing them, replacing costly trial-and-error in the real world with forward planning over an internalized model of reality. For physical AI applications such as robots, autonomous vehicles, and embodied agents, this distinction is concretely consequential, as it manifests as the difference between requiring millions of real-world interaction examples and needing only a fraction of that data to achieve the objective.
The current state of the art (as of early 2026) is a field that is rapidly evolving, and as we’ll argue later, is in tension due to competing architectural designs. Google DeepMind's Genie 3 generates navigable 3D environments from text prompts, running in real time at 24 frames per second and remaining consistent for several minutes at a time. NVIDIA's Cosmos was trained on 20 million hours of video and has been downloaded over two million times as a synthetic data engine for physical AI. World Labs, founded by Stanford's Fei-Fei Li, released Marble, a 3D generative system that constructs full scenes from images or text, optimized for creative and spatial generation.
This past weekend AGI House hosted an agent skills and world model build day at our Hillsborough house, where we interviewed experts from the space including Google Deepmind, Inworld, and Roblox. This memorandum consolidates information gathered from these talks, along with a salient interview conducted with Sharon Lee, Co-Founder of Moonlake AI—a San Francisco-based AI research lab that raised a $28M seed round and that is making a different bet through their agentic multimodal world models. It’s an architecture at odds with what is broadly seen in the space, and that will help form the basis for our thesis and investment recommendations on world models.
The Moonlake Bet: Causality Over Pixels
The central technical tension in world modeling today is a question of what level of abstraction a model should operate at. Most leading systems, including Genie 3 and Sora, are fundamentally next-frame prediction engines. They use video tokenizers to compress each frame into a compact representation, then train an autoregressive model to predict the next token sequence. The outputs are visually impressive and immediately interactive. But the underlying causal structure of the world is approximated, not encoded: there is no variable storing door.state = 'open'.
Moonlake's thesis is that this architecture has a hard ceiling. 'No matter how much you train on persistent data, it's always going to index on the last frame most,' Sharon Lee said in our interview. 'There's no mathematical source of truth. If smoke is going upwards, is that really true, or does it just look plausible?' The transformer architecture, she argues, is fundamentally a recency-biased pattern matcher—over thousands of frames, coherence degrades, and physics becomes probabilistic rather than enforced. Another example from some of their published works states this with an example: "Consider a bowling pin. It is simultaneously a textured object in space, a rigid body with mass and inertia, an object that can be knocked down, a symbolic contributor to a score, and a source of sound upon impact."
Moonlake's approach is to represent the world as a structured state space—more like a physics engine with a database than a video generator. Physical properties like friction, mass, and geometry are stored explicitly. This persistence is the key insight: a cube in the scene will never disappear unless you remove it. A friction coefficient, once set, remains until an event changes it. On top of this structured simulation layer, Moonlake trains a diffusion model called Reverie to handle the 'reskinning' from sparse engine outputs to photorealistic rendering. The promise here is that once Reverie can match or approach matching the visual realism of models like Genie, Reverie will have a distinct architectural advantage in training agents in comparison.
This hybrid architecture also allows what Moonlake calls multimodal causal propagation—a single physical event, like a bowling pin falling, propagates simultaneously across visual appearance, physics state, audio, and symbolic score representation. This is what makes it purpose-built for agent training: agents operating in this environment are forced to develop causally grounded representations rather than visual correlations.
RL Is Powerful Enough & The Leap of Faith
The most important strategic claim we’ve gathered from our interviews is that reinforcement learning, given sufficient high-quality simulated environments, will figure out complex physical skills on its own. In other words, the implementation layer is solved or at least will be solved without any inherent first-principles barriers, and hence the current large investment in simulation and synthetic data.
A kitchen task (also echoed by the work at Sunday Robotics) is a great example of this idea, as it is fine-grained manipulation in a complex environment with many interacting objects. ‘It's not something you can just reason through language. You want maybe thousands, millions of kitchens for it to learn a task, depending on how complex it is.'
Although the sim-to-real gap is always a persistent obstacle, collecting real-world robot trajectories is prohibitively expensive. Moonlake put the cost at thousands of dollars per minute of data. In addition, motor interaction datasets are specific to particular robot platforms, require physical execution, and don't transfer easily across embodiments. Text and image data are abundant, while action-labeled physical interaction data is scarce.
The synthetic data answer is to generate unlimited, varied, physically consistent training environments and let RL explore them. The vision is targeted curriculum generation: if a robot fails because of an obstacle and friction level, generate a thousand more environments with exactly that combination of challenges. Once you've seen a million variations, you generalize to any new configuration.
When asked about the sim-to-real gap, Sharon’s view was pragmatic. She emphasized that the gap is closing quickly, with substantial capital flowing into physics engine fidelity, and the idea is that Moonlake is positioned to leverage that improvement without needing to solve it themselves. This is a reasonable bet given the rise of trajectory of tools like NVIDIA Isaac Sim and Genesis.
This Isn't a New Idea: JEPA
The idea that the right representation for world modeling is not pixels but causal structure has a deep intellectual lineage—and a very recent, powerful instantiation. Meta's Joint Embedding Predictive Architecture, or JEPA, represents the most technically rigorous formulation of this principle currently in production.
The key insight of JEPA is about what the model predicts. Standard generative models, whether diffusion-based or autoregressive transformers, are trained to reconstruct raw data. JEPA predicts in abstract representation space: given a partially observed video, what is the semantic meaning of what's hidden? This skips surface noise entirely and forces the model to learn underlying dynamics.
The physical world is full of chaotic, unpredictable details, be it rustling leaves, rippling water, or the exact trajectory of a bouncing ball. Forcing a model to predict these details wastes its capacity on noise rather than on the underlying principles of motion and interaction. JEPA's bet is that models trained to predict in latent space will develop richer physical intuitions, because they're never distracted by irrelevant visual complexity.
The results? Meta's V-JEPA 2 achieved state-of-the-art on physical world modeling benchmarks—outperforming GPT-4o and Gemini 2.0 at only 1.6 billion parameters on tasks requiring causal physical reasoning. More importantly for the embodied AI question: trained on fewer than 62 hours of unlabeled robot video, V-JEPA 2 achieved 65–80% success rates on zero-shot pick-and-place tasks with new objects in new lab environments, without any task-specific training. Head-to-head against NVIDIA's Cosmos, a generative diffusion world model, V-JEPA 2 achieved 60–80% manipulation success to Cosmos's 0–20%, while running 16 seconds per action versus Cosmos's four minutes.
Yann LeCun, the primary intellectual architect of JEPA, left Meta in late 2025 to found AMI Labs, with the explicit goal of building AI systems that understand physics, maintain persistent memory, and plan complex actions. His philosophical critique of the dominant generative approach maps almost exactly onto Moonlake's: both argue that plausible-looking outputs are insufficient, and that what actually matters is causal grounding.
Why Gaming Is the Trojan Horse
Gaming may represent the single most structurally advantageous training ground for physical AI, and the reason lies not in the medium itself but in what game environments record by design. Unlike passive observational data, every interaction within a game is timestamped and causally annotated: a given input, issued at a specific world state, produces a documented outcome that propagates through the environment in traceable, reproducible ways. The full trajectory—action, consequence, and the chain of downstream effects—is preserved as a matter of how the system works, not as a byproduct of deliberate data collection effort.
This stands in direct contrast to the internet video that powers most of today's generative world models. Video from platforms like YouTube is abundant but epistemically thin: one observes a world unfolding without access to the decisions that caused it. A cooking video conveys appearances and sequences, but reveals nothing about the alternatives the cook considered, the micro-adjustments made in response to feedback, or what would have happened had a different choice been made. A game environment, by contrast, encodes all of this. Roblox alone processes ~6 billion chat messages per day and has accumulated billions of hours of action-labeled 3D interaction data—a dataset whose scale and structural richness no research laboratory can replicate from scratch, and whose value to physical AI training is only beginning to be appreciated.
Roblox has understood this positioning clearly. In February 2026, the company revealed a 'real-time, action-conditioned world model'—a generative AI tool allowing creators to walk through a space, issue prompts, and paint playable 3D worlds that multiple players can enter simultaneously. CEO David Baszucki has publicly emphasized AI world models as central to Roblox's platform strategy, framing the platform as an agent operating system as much as a game engine. Their Cube 3D foundation model, open-sourced in early 2025, underscores the ambition.
The deeper bet here is about agents at scale inside games alongside human players. This enables three interlocking developments that compound on each other. First: evolving NPCs—characters that aren't running scripted dialogue trees but maintaining persistent memory, developing relationships, adapting strategies, and pursuing long-term goals across sessions. Companies like Inworld AI and Artificial Agency are already deploying this commercially. Second: procedurally generated quests and content. Agent skills that generate narratives, environments, and challenges in response to individual player behavior, compressing development cycles dramatically while expanding replayability. All three major engines (Unity, Unreal, and Roblox Studio) are shipping or actively developing native agent skill interpreters. Third: in-game agent economies—agents are already trading items across servers, forming guilds, creating content that other agents and players consume. This fundamentally shifts the gaming revenue model from human player purchases toward agent operator infrastructure fees and agent-to-agent marketplace transaction cuts.
The long-run implication is significant. If agents can autonomously create, populate, and maintain game experiences, a platform like Roblox becomes something qualitatively different: no longer a game platform, but an agent operating system with a deeply entrenched moat in action-labeled 3D data. The combination of world models for environment generation, agent skills for NPC intelligence, and multiplayer interaction as the data flywheel is a compounding loop that is difficult to replicate from elsewhere, and unlike other flywheel approaches, already has a working business model that generates revenue.
Investment Hypotheses: Betting on the Infrastructure
The architecture debate is genuinely unsettled, and may resolve differently for different applications. But certain investment theses hold regardless of which approach ultimately dominates.
Hypothesis 1: The Physics Engine Layer Gets Acquired or Commoditized
Sim-to-real fidelity is the shared bottleneck across all approaches. Every player in this space—Moonlake for robotics training, NVIDIA for physical AI, the gaming studios for NPC realism—depends on physics engines that accurately simulate friction, contact, deformation, and fluid dynamics. The recent generation of engines (Isaac Sim, Genesis, MuJoCo's successors) represents a step-change in fidelity, and the capital flowing into this layer is substantial. The bet: high-fidelity physics simulation is infrastructure, and infrastructure tends to concentrate. Either a dominant player (NVIDIA being the obvious candidate) consolidates this layer, or an open-source standard emerges that everyone builds on top of.
Hypothesis 2: Character AI Infrastructure Becomes the Picks-and-Shovels Play
Regardless of whether world models win in robotics or gaming or autonomous driving, the demand for intelligent, persistent, memory-bearing agents inside those environments is universal. The NPC AI market is already a billion dollar market (Yahoo Finance), and is growing at an accelerating rate. Inworld AI has established the early lead in giving game characters voice, personality, and memory. Artificial Agency is building behavior engines that turn game developers into stage managers rather than scriptwriters. These infrastructure layers serve the top of the stack independent of what generates the underlying world.
Hypothesis 3: Action-Labeled Data Becomes the Strategic Asset
OpenAI's reported $500M bid for Medal, a platform with two billion gaming clips, failed—but the signal was unambiguous. The most valuable training data for physical AI is not passive video but action-conditioned trajectories. Whoever controls large repositories of action-labeled 3D interaction data has a durable moat. Roblox's position here is underappreciated by non-gaming investors. Game studios with large active multiplayer communities are sitting on training data assets whose strategic value has not been priced in. Thus, investing in companies with access to other large multiplayer games (and the ability to consolidate them) will be important in challenging Roblox’s moat.
Hypothesis 4: The Simulation-to-Real Stack Has a Missing Middle Layer
The field has strong capabilities at both ends of the pipeline: high-fidelity physics simulation and real-world robotics hardware. The weaker layer is the automated curriculum design system that bridges them: the infrastructure that diagnoses why a policy failed, designs harder variants of that scenario, generates millions of targeted training environments, and iterates automatically. Moonlake's thesis is precisely that this layer is what's missing. It is also what would make the synthetic data bet work at scale. Startups that nail automated environment generation and curriculum design will be infrastructure providers to the entire physical AI space, independent of the underlying model architecture. This is also tied to the idea of making benchmarking standards along a trajectory instead of just an episodic result, and making this infrastructure standardized and streamlined across the emerging gaming/AI ecosystem.

