Embodied AI's Real Race: From Hardware Parity to Data Monopoly
Embodied AI represents the next frontier in the path toward AGI. While large language models have largely saturated the finite corpus of internet text, physical robots can learn from an infinite, continuous stream of real-world interactions. Therefore, the capability of AI to affect direct and consequential change in everyday life—be it replacement of high-risk jobs in nuclear plants or performing daily house chores—has potentially no ceiling.
The market is responding accordingly. Autonomous vehicles—the most established front of physical AI—is already a massive industry: Waymo delivers over 250,000 rides weekly across U.S. cities, and projections put the self-driving vehicle market to reach $200 billion by 2030 as adoption in developed markets jumps from 8% to 28%. But the adoption of AI in robotics (beyond preplanned automation) now extends far beyond the road. Over 4.2 million industrial robots now operate in factories worldwide, and the humanoid robotics sector is accelerating faster than expected. Goldman Sachs recently revised its humanoid market forecast sixfold—from $6 billion to $38 billion by 2035—citing a 40% drop in manufacturing costs and AI breakthroughs like generalized task completion. Figure AI raised a $1 billion Series C at a $39 billion valuation in September 2025 with backing from OpenAI and NVIDIA. Physical Intelligence secured $600 million at a $5.6 billion valuation in November 2025. Chinese companies are strongly competing, if not leading in sectors such as hardware development and sensor intelligence, with Fourier raising $100 million in early 2025 and Unitree valued at nearly $7 billion.

The Data Scarcity Problem
Robotics hardware that has high force and dexterity capability, while remaining affordable enough to deploy en masse, has long been the bottleneck for creating a sustainable market for embodied AI. But as of late, manufacturing costs for humanoid robots have plummeted nearly 40% year-over-year—from a range of $50,000-$250,000 per unit in 2024 to $30,000-$150,000 in 2025, driven primarily by the optimization and investment in Chinese supply chains. Notably, high torque-density motors, critical for human-like movements in intricate areas such as robotic hands, are now heavily commercialized.
Thus, the new bottleneck becomes data. Unlike LLMs, which can train on the entire internet, robotics has no pre-existing digital ocean of labeled "action tokens." The largest robotics datasets—Open X-Embodiment with ~1 million episodes—pale against the trillions of tokens used to train GPT-4.
Fundamentally, this is because Vision-Language-Action (VLA) models, which are the LLM-analog for robotics, need to unify visual perception, natural language understanding, and physical action planning. This requires both width (different types of sensors like LiDAR or RGB-D cameras) and depth (enough instances) of recorded sensor data that is also diverse enough to be generalizable to different environments and tasks—and such a dataset simply does not yet exist.
The best path to creating this dataset, however, is not straightforward like it is currently with autonomous vehicles. Tesla's FSD scaling strategy benefits from continuous data collection, where millions of cars driving billions of miles passively captures sensor data along with labels of human input. But robotics has no such luxury—paying one human operator to teleoperate one robot per task in niche industries is fundamentally unscalable.
Thus, to scale VLA models, the industry is converging on a three-pronged strategy to synthesize this missing data layer. We argue that the winning strategy will be to invest in all of the following approaches, as monopolization of any single approach will be insufficient to reach the critical mass of data.
Approach I: Simulation as the Economic Multiplier
The first layer of the solution is to synthesize experience computationally. Simulation is not an alternative to real-world experience, but rather the foundational layer that allows physical AI to scale beyond the constraints of linear time and physical risk.
NVIDIA's Isaac Sim represents the industry's bet that synthetic data generation at scale is as critical as physical data. The core value proposition is that Isaac Lab enables 10,000+ concurrent environments on a single GPU, generating training data 1,000x faster than real-world collection. This parallelization is what makes reinforcement learning tractable for robotics, effectively turning a data scarcity problem into a compute problem.
Modern simulation has evolved far beyond simple rigid-body physics engines. Isaac Sim provides accelerated PhysX physics, which provides high-fidelity modeling of contact dynamics, friction, and deformable objects. To prevent policies from overfitting, developers employ domain randomization—systematically varying parameters like lighting, textures, and object mass during training.
But while Isaac Sim relies on solving mathematical equations to render outcomes—World Models learn physics directly from visual observation. Spearheaded by architectures like OpenAI’s Sora and Google DeepMind’s Genie, these systems utilize diffusion transformers to treat physical simulation as a generative modeling problem. Rather than defining rigid laws for every object, they ingest vast quantities of video to internalize spatiotemporal patterns, spitting out predictions of future states based solely on learned distributions. This approach allows the AI to "dream" scenarios in a learned latent space, enabling it to model complex, non-rigid interactions—like fluid dynamics or cloth deformation—that traditional physics engines struggle to capture accurately.
However, synthetic data often falls short due to the so-called "sim-to-real" gap. Physics engines struggle to capture the messy, stochastic nature of the real world, while World Models frequently hallucinate—producing visually plausible but physically impossible outcomes. Thus, it is difficult to rigorously prove that any subsequently trained models will be reliable upon deployment. Simulation is ultimately a multiplier, not a replacement. It accelerates learning, but only real-world data can validate it.
Approach II: The "Human Video" Breakthrough
If simulation provides volume, human video provides diversity. The industry's most significant pivot in late 2025 was the realization that we do not need to wait for robots to generate data if we can translate human data instead.
This relies on Cross-Embodiment Learning—training models to bridge the gap between a human hand and a robot gripper. As demonstrated by Physical Intelligence in their paper Emergence of Human to Robot Transfer, advanced VLAs now utilize latent action alignment. Instead of trying to map the exact kinematics of a human arm to a robot arm (which often fails due to different joint constraints), the model maps the visual intent of the human (e.g., "pick up the cup") to a shared latent space. The robot then decodes this intent into its own specific motor controls. In a way, the model is able to interpret human video as just another embodiment.
Figure AI is pursuing this strategy with its initiative, Project Go-Big. By partnering with Brookfield properties to record human behaviors across thousands of residential units, Figure is essentially building "YouTube for Robotics"—a massive, diverse corpus of human navigation and manipulation data to train their Helix model.
Approach III: Teleoperation Ground Truth
While Approach II teaches the robot what to do, only teleoperation teaches it how to do it safely. This is the "last mile" of reliability that simulation and video cannot solve because they lack proprioception—the sensory feedback of weight, resistance, and slippage.
The winning strategy here is not 1:1 remote control, but Shared Autonomy and Action Chunking. In this model, the robot executes the high-level plan learned from human video autonomously. Then, a human teleoperator only intervenes when the model's confidence drops, piloting the robot through that specific bottleneck. The system then records this intervention as a high-quality "Action Chunk"—a precise sequence of joint movements—which is fed back into the model to close the failure mode.
This turns deployment into a vertical flywheel: the more robots typically fail, the more high-value "correction data" is generated. This is why integrated players like Tesla and Figure have an advantage—they own the fleet that generates the edge cases, creating a proprietary dataset of failure-and-recovery that no amount of web scraping can replicate.
Conclusion
The race to embodied AI will be won by the synthesis of these three data streams, not the dominance of one.
We view the data stack as a pyramid: simulation provides the infinite base of low-fidelity physical intuition; human video provides the massive middle layer of semantic understanding and task diversity; and teleoperation provides the razor-thin, high-value capstone of ground-truth motor control. All three are required to create a true generally-intelligent, safe, and useful embodied AI. The winners will ultimately be the companies with enough consolidated resources to vertically integrate this stack, and the willingness to compete in domains for which they may not currently be invested in.


