Blog Post

On Valentine's Day, AGI House and Radical Ventures hosted a small-group paper discussion on Vision-Language-Action models (VLAs) — where the field started, and where it's going. The discussion was led by Vincent Vanhoucke, co-author of RT-2, the paper that helped define the VLA paradigm, and one of the most influential researchers in modern robot learning. The session drew on two papers: RT-2, and DreamZero, NVIDIA's release from just days prior.

What followed was two hours of substantive discussion on the now and future of embodied intelligence.

1. World modeling and action prediction must be co-trained. This is not a design choice.

Models trained purely on action data embed all their physics knowledge in that narrow signal. They leave enormous leverage on the table. Models trained purely on video learn loose, hacky physics — when they don't understand a collision, they blend the objects. Objectness is not a first-class concept.

Co-training on future world state and actions forces a richer representation of what actually matters in the scene. The empirical evidence — including results corroborated by multiple people in the room — shows it substantially improves downstream performance.

DreamZero gestures at this compellingly. It just doesn't prove it, because it omits the obvious ablation: same backbone, same data, no world modeling loss. The field should run that experiment. Loudly.

2. Baking in physics as a constraint degrades performance at scale. The bitter lesson still applies.

This has been tried. It has not worked. The bitter lesson applies to robotics the same way it applied to linguistics in speech recognition: the more precisely you encode a prior, the more you've assumed values you cannot observe.

The coefficient of friction is a law. The coefficient of friction of this surface, right now, is unknown. Adding rigid physical constraints bounds generalization in ways that hurt at scale. Let the model learn physical intuitions from data, and study carefully where those intuitions break — that's the gap analysis worth doing.

3. Real-world data is the only ground truth. Simulation fills gaps — it doesn't replace them.

For navigation and collision avoidance, synthetic data quality is rapidly improving and will close many open problems. For contact-rich manipulation, the domain gap is harder to close than it looks, and maintaining a simulation that accurately models the right problem — not just a tractable proxy — is chronically underestimated work.

The deeper issue: the moment you want to evaluate your system, you want to evaluate it in the real world. Everything else is approximate. The field has celebrated too many results that held only under simulation conditions that don't transfer.

4. Robotics benchmarking is broken. A/B testing must become the standard.

Even in a single lab, with a controlled physical setup, running the same model against itself across days produces wildly variable results. The variability isn't averaging noise — it's systematic variance from hardware wear, operator conditions, and environmental factors that papers don't model and reviewers don't check.

Most comparative claims in robotics papers are statistically unsound. The fix: adopt A/B testing as standard practice, blind the evaluator to which model is running, and do proper statistical analysis across episodes. The longer-term answer is a Robot Arena — portable benchmarks, any lab running any model, results aggregated across environments. Zero-shot evaluation should become the default: put the model in front of a robot it's never seen, hand it a task, no tuning.

5. Scale will settle arguments that experiments never could.

Most of the skepticism around cross-environment transfer, multi-task generalization, and general-purpose robot policies came from negative results at small scale. Those results were real — and misleading.

At sufficient model size, what looks like negative interference becomes constructive transfer. The same dynamic that reshaped NLP is playing out in embodied AI. If your research conclusions were drawn from small models, treat them as hypotheses, not findings.

6. "End-to-end vs. modular" is the wrong frame. Stop using it.

Modularity is an architectural property. End-to-end is an optimization property. They are independent axes, and conflating them has confused the field for years.

The best systems today are both: implicitly modular in architecture, jointly optimized across components. The real risk of pure end-to-end systems isn't the architecture — it's outputting a naked action with no interpretable intermediate state. If your system can't tell you why it took an action, you can't apply safety constraints, audit failures, or reason about edge cases. Capability and interpretability are not in tension here. Build systems that predict enough about the world alongside their actions, and you get both.

7. Diffusion action heads win in high-dimensional settings. Bet accordingly.

In low-dimensional action spaces — steer, accelerate — diffusion and simpler regression are roughly comparable. In high-dimensional, contact-rich manipulation, diffusion's ability to capture multimodal action distributions becomes decisive.

As robotics moves toward full-body dexterity, the gap will widen. Hierarchical decompositions and hard architectural priors will lose to soft attention-based representations that can learn whatever structure the task actually requires. Attention is already doing implicit hierarchy better than explicit designs.

8. Deployment will teach us things that research cannot.

We don't yet have a clear picture of the right data mix, the right sensor suite, the right task scope for general-purpose manipulation. The answers will come from production robots in production environments — not from benchmarks.

The business model for robotics deployment is still being worked out. Until we figure it out and get systems in the field, the data flywheel doesn't start spinning. Most of the important data in this space is data that simply doesn't exist unless robots are already deployed.

9. Value functions remain harder than they look outside game environments.

In chess, Go, and StarCraft, value functions work because the reward signal is clean, the state space is tractable, and the rules are fixed. In unstructured manipulation tasks, defining a good value function is roughly as hard as solving the original problem. The leverage is less than you'd hope.

Failure demonstrations, counterfactual training, and post-training RL all have real value — but the dominant pre-training signal will remain future prediction and imitation. At least until we figure out a better way to define what "good" means in open-ended physical tasks.

10. Compute efficiency is a solved problem waiting to happen. Don't optimize for it early.

RT-2 runs at 55 billion parameters. DreamZero at 14 billion. Neither is deployable on edge hardware today. This is a reason to do engineering, not a reason to do research on small models.

Researching small-model efficiency is a bet that compute costs won't decrease faster than you can publish — historically, a bad bet. The right posture: push capability, trust that economic incentives will drive efficiency downstream. The history of AI is littered with systems that were "a thousand times too expensive to deploy" and were productionized within months once someone cared enough to try.

What We'd Read

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Brohan et al., Google DeepMind, 2023

RT-2 is the paper that established the VLA paradigm. Its core insight: if you discretize robot actions into tokens and express them as text strings, you can co-fine-tune a vision-language model on robot trajectory data and internet-scale vision-language tasks simultaneously — no separate action head, no frozen backbone. The result is a robot policy that inherits semantic understanding from web-scale pretraining: RT-2 can follow instructions it was never trained on, reason about novel objects, and perform rudimentary multi-step planning through chain-of-thought. It demonstrated, for the first time at scale, that the same dynamics driving progress in language and vision could be redirected into physical control. The open question it left was physical generalization: RT-2 transfers semantics well, but has no model of how the world evolves. It sees an image, receives a command, predicts the next action. What happens next in the world is left entirely unmodeled.

DreamZero: World Action Models are Zero-shot Policies

‍Ye, Ge, Zheng et al., NVIDIA, 2026

DreamZero is a direct answer to RT-2's open question. Built on a 14B autoregressive video diffusion backbone, it jointly predicts future video frames and robot actions — treating dense video prediction as a supervisory signal for learning physical dynamics, not just a perceptual input. The bet is that modeling how the world looks after an action forces the model to internalize causal structure: object permanence, contact dynamics, physical consequence. The results are the strongest zero-shot generalization numbers in the literature — over 2x improvement on novel tasks and environments versus state-of-the-art VLA baselines — with cross-embodiment transfer requiring only 30 minutes of play data on a new robot. The paper landed days before the session, which made it an unusually live target. The central research question it opens is whether video-based world modeling and language-space action prediction are fundamentally different paradigms, or whether they converge at scale.

The Reading Room

The session drew roughly twenty people to a brunch in San Francisco on Valentine's Day (plus-ones were welcome). The mix was deliberately high-signal: founders building physical AI companies, PhDs from Stanford and Berkeley, ML engineers from Waymo, NVIDIA, Cruise, and Physical Intelligence, researchers at Honda and ASML, and engineers who've shipped systems at Google, Meta, SpaceX, and Amazon. Several attendees are working on autonomous systems across humanoid robotics, drone autonomy, and L2+/L3 driving.

The Reading Room is a small, recurring paper discussion series co-organized by AGI House Research and Radical Ventures, continuing the tradition of graduate reading groups for researchers and engineers working at the frontier. The format is always the same: a paper assigned in advance, a shared meal, and a real conversation — kept deliberately small so the discussion stays substantive. If you're interested in future sessions, follow AGI House Research and Radical Ventures for announcements.

Robotics benchmarking is broken – and 9 other things we learned from the VLA Reading Room