Blog Post

‍AGI Dinner Series | May 20, 2026

At this week’s AGI Dinner Series at AGI House, researchers, founders, and engineers from across the frontier AI ecosystem gathered to discuss one of the most important emerging paradigms in AI: world models.

The conversation spanned reinforcement learning, robotics, simulation, gaming, multimodal generation, social intelligence, and the future of embodied AGI. Participants included researchers and leaders connected to organizations like Google DeepMind, World Labs, and Waymo, NVIDIA, alongside founders building next-generation AI systems.

The central question of the evening:

What happens when AI systems can simulate, predict, and reason about the world?

Jack Parker-Holder (Director @ Google DeepMind; Co-lead Genie 3) Chris Manning (Professor @ Stanford, Distinguished MTS, Moonlake AI) Fan-Yun Sun (CEO @ Moonlake AI) Sharon Lee (Cofounder @ Moonlake AI) Ben Mildenhall (Co-Founder @ World Labs) Jim Fan (Director of AI @ NVIDIA; Co-Lead GEAR Lab)Ankur Handa (Co-Lead, Dex Team @ NVIDIA) Pantelis Kalogiros (GP@Exponential, Co-Founder, Fyusion) Hao Zhu (Researcher @ Stanford ) Boyang Deng (Researcher @ Stanford) Vaish Srivathsan (Product @ World Labs) Diego Rivas (Product @Google DeepMind Ashka Stephen (Product @ AGI House) Jesus Lares (CTO @ AGI House) Rocky Yu (Founder @ AGI House, Host)

‍

From Tokens to Worlds

Despite the hype surrounding “world models,” the room converged surprisingly quickly on a shared definition:

A world model fundamentally learns:

state + action → next state

In reinforcement learning terms, this is simply modeling transitions in a Markov Decision Process. But the implications extend far beyond gaming or robotics.

Participants argued that world models are not limited to visual 3D environments. Any system involving evolving state transitions could potentially benefit:

Enterprise workflows. Financial markets. Social systems. Manufacturing processes. Human coordination. Urban environments.

One researcher from a leading AI lab summarized it succinctly:

“Anyone can use world models — it’s just reinforcement learning applied to different domains.”

But others emphasized that realism alone is not intelligence. One AI researcher argued that the real breakthrough will come from learning abstracted, semantically meaningful states rather than simply generating visually impressive simulations.

Beautiful graphics are not enough. The key challenge is learning the right abstractions.

State of World Models

no one has trained WMs at LLM scale yet - imagine what that could unlock
world models have not reached their "chatGPT moment", we are early in PMF

Genie 3 and the Rise of Interactive Simulation

Much of the discussion centered around recent advances from Google DeepMind, especially the release of Omni Flash and the evolution of the Genie world-modeling systems.

The new systems support temporal extended actions, native multimodal interaction, video editing and generation, and interactive environmental simulation.

Google's announcement of "20 years of streetview data integrated"

Yet researchers acknowledged the limitations. Simulations are still not physically accurate enough for broad deployment. Real-time performance remains difficult. Persistence and consistency across long horizons remain unresolved.

Still, the applications are already becoming practical.

Waymo is reportedly using Genie-style systems to generate rare long-tail driving scenarios for autonomous vehicle training, including edge cases like fires on the Golden Gate Bridge. Researchers noted that the models can now generalize to novel situations rather than simply replaying memorized trajectories.

Another important milestone discussed was the successful fine-tuning of LiDAR directly into generative world models.

Gaming May Not Be the Endgame

While gaming is often viewed as the natural home for world models, several participants challenged that assumption.

One provocative take from the evening:

“Gaming data is much less important than people think.”

The argument was simple: if the goal is to create entirely new interactive experiences, reproducing existing games may not matter much.

“If you want to play GTA, just buy the game.”

Instead, the real opportunity may lie in dramatically accelerating creative iteration.

Founders described workflows where creators could prototype dozens of interactive experiences in a single morning, then fully develop the best concepts after lunch. World models become creative engines rather than game emulators.

At the same time, participants noted substantial resistance from parts of the gaming community toward generative AI systems, citing backlash against recent AI-generated gaming projects.

Others compared the debate to early reactions toward LLMs:

“Saying world models will replace games is like saying LLMs will replace books. It depends on the type.”

Robotics Needs World Models

If there was one area where consensus was strongest, it was robotics.

One robotics researcher made perhaps the boldest claim of the night:

“World models are all we need for robotics.”

The reasoning is straightforward: general-purpose robots cannot rely purely on imitation learning or narrow task-specific policies. They need predictive understanding of physical environments.

But interestingly, many argued that robotics world models do not require perfect physical simulation.

Instead, they require what several participants called intuitive physics.

A robot does not need atomically accurate fluid dynamics to pour wine successfully. It needs enough predictive structure to reason about outcomes in real time.

This led to a recurring theme throughout the evening:

Speed matters more than perfect realism.

Real-time inference and adaptability may be far more important than photorealistic fidelity.

The Simulation-First Future

Another major thread focused on scaling robotic learning through simulation.

Participants argued that teleoperation-based data collection faces fundamental bottlenecks: hardware maintenance costs, human operator fatigue, slow iteration cycles, and limited environmental diversity.

One founder proposed that robotics may ultimately require training on:

“one trillion simulation trajectories”

to sufficiently close the sim-to-real gap.

Others discussed alternative data pipelines, including egocentric human video, wearable sensors, passive observation systems, and removing robots entirely from the data collection loop.

The implication was clear: the future of robotics training may look much more like internet-scale data collection than traditional robotics engineering.

Physics Is Easier Than Society

One of the most fascinating debates centered on what aspects of reality are actually hardest to model.

Counterintuitively, several researchers argued that physics itself may not be the main bottleneck.

As one participant put it:

“Physics is boring and consistent.”

Contact dynamics remain challenging, but many believe physical simulation problems will largely be solved over the next several years.

Social intelligence, however, appears vastly harder.

Humans coordinate through trust, norms, implicit communication, shared context, and cultural expectations. These dynamics are poorly represented in today’s systems.

Participants noted that current AI agents still struggle with real-time collaboration, group coordination, long-term social consistency, and understanding human expectations.

One AI researcher emphasized that human intelligence is fundamentally distributed across groups rather than isolated individuals.

This may explain why multi-agent reinforcement learning is becoming increasingly important: cooperation itself may be a core ingredient of intelligence.

Why Embodied AGI May Arrive Later Than Digital Superintelligence

The AGI timeline discussion revealed an increasingly common viewpoint among frontier researchers:

Digital superintelligence may arrive before robust embodied intelligence.

Several attendees referenced predictions from leading AI executives that conversational AGI could emerge within the next few years.

But robotics timelines remain constrained by physics, hardware, and real-world deployment realities.

As one robotics researcher pointed out, human bodies self-heal, robots break constantly, physical iteration cycles are slow, and real-world deployment is expensive.

One participant summarized the challenge bluntly:

“A robot may need six hours of repair for four hours of work.”

The “last mile” of embodied intelligence may therefore be less about reasoning and more about integration into human society.

Robots must learn social timing, norms, safety expectations, human comfort, and contextual behavior.

In other words: robots cannot simply be intelligent — they must avoid being socially awkward.

Beyond Gaming: The Emerging Applications

While entertainment was viewed as the most likely early large-scale deployment, participants explored many emerging application areas:

Education, enterprise process simulation, manufacturing optimization, financial modeling, AI avatars and dialogue systems, architecture visualization, and language learning.

One recurring theme was controllability.

Researchers argued that practical systems may require separating world representation, rendering, and authoring controls rather than relying purely on end-to-end neural generation.

This separation could allow humans to maintain creative control while leveraging generative flexibility.

The Deeper Question: What Kind of Physics Do World Models Learn?

The evening closed with a more philosophical discussion.

One attendee observed that current world models feel fundamentally “Newtonian” rather than “Einsteinian.”

That sparked a broader debate:

What reference frame do world models actually operate in?

Humans reason across multiple layers of abstraction simultaneously: molecular physics, object dynamics, fluid intuition, social reasoning, and semantic understanding.

Future systems may need entirely different abstraction layers depending on the task.

A robot folding laundry does not need particle simulation. A social agent negotiating trust does not need rigid body dynamics.

The future of world models may therefore be less about recreating reality exactly and more about learning the right abstractions for action.

Closing Thoughts

The discussion revealed a notable shift in frontier AI thinking.

The industry is moving from static prediction to interactive simulation, from text generation to environment modeling, and from passive intelligence to agentic intelligence.

World models increasingly appear to be the connective tissue between large language models, robotics, multimodal systems, and autonomous agents.

But perhaps the biggest insight from the evening was this:

The hardest part of intelligence may not be modeling physics. It may be modeling humans.

‍

AGI Dinner Series: World Models and the Path to Embodied Intelligence