Embodied AI: Pre-Event Guide

This Saturday, AGI House is hosting a one-day robotics hackathon at our Hillsborough house. We're bringing together robotics founders, researchers, and engineers to build toward FSD-level autonomy for physical robots—working hands-on with real hardware, simulation environments, and deployment-ready problems.

Unlike typical robotics demos, this event is structured around the actual constraints of real-world autonomy: perception under uncertainty, multi-step task planning, sim-to-real transfer, and generalist skill development. Teams will have access to Unitree G1 humanoids, SO-101 manipulator arms, simulation credits, and technical support from robotics companies actively deploying autonomous systems. The goal isn't to prototype speculative concepts—it's to push the boundaries of what's actually deployable today.

Foundation Models Meet Physical Intelligence

The embodied AI landscape has undergone a critical transition in the past 18 months. Vision-Language-Action (VLA) models—foundation models that directly map visual observations and language instructions into robot actions—are now achieving genuine cross-task generalization. Physical Intelligence's π₀.5 (CoRL 2025) successfully deployed in entirely new homes and kitchens without any environment-specific tuning, representing a fundamental shift away from brittle, specialized policies.

These models are not only more general, but also much faster. Flow matching architectures now enable inference at 50-200 Hz (π₀ and GR00T N1's System 1), allowing VLAs to run directly inside control loops rather than just providing high-level plans. Previous autoregressive methods generated actions sequentially and discretized continuous motions (~10-20 Hz), but flow matching predicts smooth action trajectories in parallel, slashing inference time and enabling the fluid, responsive movements needed for dexterous tasks like folding laundry or assembling objects.

This progress is now compounding rapidly thanks to advances in simulation and synthetic data. NVIDIA's GR00T Blueprint synthesizes 780,000 robot trajectories (equivalent to 6,500 hours of human demonstration) in just 11 hours, while self-improvement mechanisms like Pelican-VL 1.0's Direct Policy Preference Optimization (DPPO) enable models to autonomously refine their policies on tactile manipulation and multi-agent coordination. Perhaps most surprisingly, world models trained predominantly on video have begun demonstrating zero-shot physical reasoning—Meta's V-JEPA 2 achieved robot planning with only 62 hours of robot footage, suggesting that internet-scale video pretraining may unlock physical common sense without requiring massive robot-specific datasets.

The industry and markets have responded decisively in light of these recent breakthroughs. Robotics startups raised $13.8 billion in 2025—a 77% increase from 2024's $7.8 billion and surpassing the 2021 peak of $13.1 billion. Humanoid manufacturers are now accepting broad consumer preorders. Foundation model laboratories—Anthropic, OpenAI, Google—are now integrating embodied AI as part of their long-term intelligence strategies. Open-source ecosystems have matured rapidly: LeRobot provides production-grade implementations of ACT and diffusion policies, OpenVLA (CoRL 2024) delivers an alternative to proprietary VLAs trained on 970K demonstrations across 22 morphologies, and WALL-OSS builds on VLMs by adding specialized spatial reasoning capabilities for complex, multi-step manipulation tasks. Building and innovating in this space is now more accessible than ever.

This hackathon is your opportunity to build at this frontier. You will train policies using techniques published in just the past few months, deploy them on Unitree G1 humanoids and SO-101 manipulators, and confront the precise failure modes—sim-to-real transfer, object generalization, and edge computing capability—that currently limit commercial deployment.

Foundational Framework

The embodied AI problem decomposes into three coupled subsystems operating under real-time constraints: (1) perception—extracting actionable representations from high-dimensional sensory input (vision, proprioception, tactile feedback), (2) reasoning—inferring task structure, goals, and action sequences from partial observability and ambiguous instructions, and (3) control—generating motor commands that achieve desired physical outcomes despite model uncertainty and actuation constraints.

The following sections formalize the core technical concepts, empirical results from recent literature, and practical considerations for implementation. For ML researchers entering robotics: expect familiar architectures (transformers, diffusion models, RL objectives) applied to unfamiliar constraints (real-time control, sim-to-real transfer, contact dynamics). For roboticists exploring foundation models: expect rapid iteration cycles borrowed from deep learning, but with evaluation metrics emphasizing physical robustness over benchmark accuracy.

The following sections formalize the core technical concepts, empirical results from recent literature, and practical considerations for implementation. For ML researchers entering robotics: expect familiar architectures (transformers, diffusion models, RL objectives) applied to unfamiliar constraints, namely real-time control and the sim-to-real gap. For roboticists exploring foundation models: you'll find these models are sample-efficient compared to classical RL but may fail in ways that violate physical constraints—they may generate kinematically infeasible trajectories or ignore collision geometry. Success depends less on tuning gains, and more on curating diverse and accurate training data.

The Core Concepts

Vision-Language-Action Models (VLAs): These are the "GPT-4 for robots"—foundation models that take images and language instructions as input and output robot actions directly. Seminal examples include RT-2 (Google DeepMind), OpenVLA (Stanford/Berkeley), π₀.5 (Physical Intelligence), and GR00T N1 (NVIDIA). The key insight: by co-training on web-scale vision-language data alongside robot demonstrations, these models inherit common-sense knowledge about the world. They know what a "red cup" looks like without being shown one in robot training data. Physical Intelligence's π₀ was pre-trained on 7 robot platforms and 68 tasks, demonstrating laundry folding, box assembly, and table bussing—the first credible evidence that scaling VLAs yields cross-task generalization. π₀.5 extended this to mobile manipulators cleaning entirely novel kitchens and bedrooms, representing genuine open-world generalization rather than memorization.

Flow Matching: While the field often uses "diffusion policy" as shorthand, state-of-the-art models like π₀ actually use flow matching—a continuous normalizing flow variant that handles high-frequency action chunks up to 50 Hz. This distinction matters because flow matching enables the faster inference speeds that make real-time dexterous control possible. GR00T N1 pushes this further with 200Hz "System 1" action generation. When you hear "diffusion policy," understand it as part of a family of generative approaches for smooth action trajectory prediction.

Imitation Learning/Behavioral Cloning: The simplest approach—collect human demonstrations of a task, then train a policy to mimic them. ACT (Action Chunking with Transformers) remains the gold standard, achieving strong success rates on fine manipulation with relatively small demonstration datasets. HumanPlus (Stanford) provides a full-stack system enabling humanoids to learn from human motion via single RGB cameras—making the case that humanoid morphology matters precisely because it unlocks human demonstration data. The magic is in predicting sequences of future actions (chunks) rather than single timesteps, which smooths out compounding errors.

World Models: An emerging paradigm where a model learns to predict how the world will evolve given actions, enabling planning—imagine consequences before executing. Unlike traditional physics simulators that encode laws (e.g. gravity), world models learn physics implicitly from data by watching videos of the world and learning the statistical patterns of how things move, deform, and interact. Meta's V-JEPA 2, NVIDIA's Cosmos, DeepMind's Genie 3, 1X's World Model, and Tesla's neural world simulators represent the frontier. Meta's V-JEPA 2 achieved zero-shot robot planning with only 62 hours of robot video—a potential paradigm shift we'll explore below.

Self-Evolving Agents: Cutting-edge research like SEEA-R1 introduces tree-structured reinforcement learning (Tree-GRPO) enabling agents to autonomously improve on sparse-reward long-horizon tasks without human supervision, surpassing GPT-4o on ALFWorld benchmarks. Pelican-VL 1.0 takes this further with Direct Policy Preference Optimization (DPPO)—a training loop allowing 7B-72B parameter vision-language "brain models" to self-improve on tactile grasping and multi-robot coordination tasks, with open-source code and checkpoints.

Sim-to-Real Transfer: Training in simulation is fast, cheap, and safe. But simulators aren't reality. Domain randomization—varying physics parameters, textures, lighting during training—forces policies to be robust to the inevitable mismatch. Important nuance: "zero-shot transfer" in robotics doesn't mean "no preparation needed"—it means the policy works on a real robot without additional real-world fine-tuning, but still requires extensive domain randomization during simulation training.

The Data Problem

You'll hear that robots face a "data shortage" compared to LLMs. This is true but incomplete. The data gap between existing robot foundation models and mature LLMs could be up to 120,000x (Goldberg at GTC 2025). But the challenge is more nuanced than raw volume. Robotics data is fundamentally different because:

High-dimensional continuous spaces: Robots learn through physical interaction in high-dimensional, continuous state-action spaces. Every robot action might involve dozens of joint positions, varying forces, and complex contact dynamics—multiplied across a nearly infinite range of objects and environments.
The embodiment gap: Unlike text, video data for robotics suffers from missing signals—videos don't show underlying forces, torques, or tactile feedback. A human hand, a 7-DOF industrial arm, and a quadruped all have vastly different morphologies. Mapping a human's "grasp" to a robot's "actuation" is a massive translation problem.
The deployment-learning loop: Commercialization requires a cycle where real-world scenarios generate data, data improves models, and better models unlock broader deployment. Without this cycle, the reality-grounded data generation loop, and therefore progress, stalls.

The Tech Stack

Below is an overview of a standard technology stack that you might work with.

Category	Tool	Key Features	When to Use
Simulation Platforms	MuJoCo	Open-source physics engine (DeepMind, 2022). Extremely fast and accurate for articulated bodies and contact dynamics.	Quick iteration and physics fidelity on manipulation tasks. Start here for most projects.
	NVIDIA Isaac Sim	GPU-accelerated simulation (PhysX), photorealistic rendering (Omniverse). Enables massive parallelization—thousands of robots training simultaneously. GR00T Blueprint built on this.	When you need photorealism, massive scale, or vision-based policies benefiting from realistic rendering. Higher learning curve.
	Newton	New open-source physics engine (NVIDIA, DeepMind, Disney Research). Purpose-built for robot learning.	Emerging tool—worth watching but not yet production-ready.
Learning Frameworks	LeRobot (Hugging Face)	"Transformers library for robotics." Implementations of ACT, Diffusion Policy, VLA models. Standardized dataset formats. Hundreds of pre-trained models. π₀ and π₀-FAST available.	Excellent starting point for newcomers. Can have something working in hours.
	OpenVLA	Open-source 7B-parameter VLA trained on 970K robot demonstrations across 22 embodiments. Fine-tunable via LoRA on consumer GPUs.	Good baseline, but note foundation model performance on novel environments remains challenging.
	π₀ / SmolVLA (Physical Intelligence, openpi)	Flow-matching VLAs outputting continuous actions at up to 50Hz. PyTorch support. 1-20 hours of data sufficient for task fine-tuning.	State-of-the-art approach for dexterous manipulation with limited task-specific data.
	GR00T N1 (NVIDIA)	First open foundation model for humanoid robots. Dual-system architecture: "System 2" (vision-language reasoning, 7-9 Hz) + "System 1" (fast action generation, 200 Hz).	Humanoid-specific tasks. Used by 1X's NEO Gamma for domestic tasks.
	Pelican-VL 1.0	Open-source vision-language "brain models" (7B-72B params). DPPO training loop for self-improvement. Focuses on tactile grasping and multi-robot coordination.	Cutting-edge self-evolving agent research. Code and checkpoints available for experimentation.
	WALL-OSS	Embodied foundation model extending VLMs with mixture-of-experts spatial reasoning. Two-stage training (inspiration + integration).	Long-horizon manipulation tasks requiring enhanced spatial understanding.
Hardware Available	Unitree G1 Humanoid	127cm tall, 35kg, 23-43 DOF. EDU version: NVIDIA Jetson Orin onboard compute, force-controlled dexterous hands. Does backflips.	Real humanoid robot platform. Expensive—handle with care.
	Booster K1 Humanoid	Alternative humanoid platform from UFB roster.	Secondary humanoid option.
	SO-101 Arms (LeRobot compatible)	Low-cost (~$100) robotic arms designed by Hugging Face for accessible robot learning.	Perfect for manipulation experiments. Extensive documentation and pre-trained policies available. Best starting point.

Platforms like NomadicML (sponsoring this event) are tools to address an infrastructural barrier in robot learning: transforming raw video footage into structured, labeled datasets suitable for training. These tools provide workflows for ingesting motion data, generating candidate annotations (such as temporal segments or action boundaries), and refining labels through human-in-the-loop pipelines. In the context of this hackathon, NomadicML enables participants in Track 4 to curate boxing action datasets from UFB fight footage, train motion segmentation models, and evaluate predictions against human-annotated ground truth—demonstrating how data curation infrastructure accelerates the research cycle from raw observations to trained policies.

Track Deep Dives

Track 1: Robot Task Autonomy (Scene Reasoning & Task Planning)

The Challenge: Build the cognitive layer that sits above motor control—the system that looks at a scene, understands what's in it, reasons about goals, and decides what to do next.

This is fundamentally a reasoning problem. How does a robot know that to "clean the desk" it needs to first identify objects, determine which are trash vs. which belong, plan a sequence of pick-and-place operations, and adapt when something unexpected happens?

State of the Art:

VLMs like GPT-4V and Gemini can reason about images and generate plausible plans
VLAs like RT-2 show chain-of-thought reasoning for robotics—outputting natural language reasoning steps before actions
Hierarchical approaches: a "slow" language model for high-level planning (System 2), a "fast" policy for execution (System 1)—this is how GR00T N1 and Figure's Helix are architected
WALL-OSS demonstrates mixture-of-experts spatial reasoning for complex long-horizon manipulation planning

Where It's Hard:

Real-time operation: language models are slow; robots need to react in milliseconds
Grounding: VLMs can describe what to do, but mapping "pick up the red cup" to specific motor commands requires spatial understanding
Recovery: plans fail, and datasets notoriously don't have representative points of what to do when a manipulation fails. Adaptive replanning remains largely unsolved

Industry Context: Figure AI initially partnered with OpenAI for language models but has since built its own in-house neural network (Helix), emphasizing the need to "vertically integrate robot AI" because translating software commands to smooth physical actions has unique challenges.

Promising Approaches for This Hackathon:

Use a VLM to interpret scenes and generate high-level task plans, then execute via a pre-trained manipulation policy
Build an "interrupt" mechanism that detects when execution has failed and triggers replanning
Explore structured representations (scene graphs, object-centric embeddings) that help VLMs reason about spatial relationships
Natural language teleoperation: human gives spoken commands, robot interprets and executes
Experiment with Pelican-VL's DPPO for self-improving task planning

What Would Impress Us:

A system that handles task ambiguity gracefully ("clean up" → reasonable interpretation)
Adaptation to unexpected obstacles or failures mid-task
Novel integration of existing models rather than training from scratch
Clear demonstration of where reasoning adds value over end-to-end policies

Tactical Note: This track is primarily simulation-focused. You won't be judged on hardware demos. Focus on the reasoning architecture.

Track 2: Generalist Skills & Task Transfer

The Challenge: Train a manipulation skill once, then transfer it to new objects, new arrangements, new environments. The holy grail: skills that aren't brittle.

Current robot policies are specialists—they work great on the exact objects and configurations they were trained on, and fail miserably on anything else. A policy trained to pick up a specific red cube struggles with a blue one. This is the generalization problem.

State of the Art:

ACT + LoRA fine-tuning: Train a base policy, adapt with a few demos of new objects
OpenVLA: Pre-trained on 970K demonstrations across 22 embodiments; shows transfer to new robots and objects
π₀.5 (Physical Intelligence, CoRL 2025 Best Paper Finalist): The first VLA claiming "meaningful generalization to entirely new environments"—though it "does not always succeed on the first try"
1X World Model: Claims robots can "learn from internet-scale video and apply that knowledge directly to the physical world"—though actual learned tasks remain limited to basics like removing air fryer baskets
HumanPlus (Stanford, CoRL 2024 Outstanding Paper Finalist): Demonstrates full-stack humanoid learning from human motion data via single RGB camera

The Real Bottleneck: For home robots, the bottleneck is dexterous manipulation under uncertainty. Real homes involve deformable fabrics, random clutter, reflective packaging, and cramped storage spaces. Industrial settings are more tractable because environments are semi-structured.

Key Research Questions:

How much does shape vs. texture vs. weight matter for transfer?
Can policies trained on simulation objects transfer to real objects? (Expect 24-30% performance drop)
Does more diverse training data help, or does it hurt specialization?

Promising Approaches for This Hackathon:

Fine-tune OpenVLA or SmolVLA on a narrow task, then test generalization
Train on a diverse set of procedurally-generated objects in simulation, test on real objects
Use depth/point cloud inputs instead of RGB—more invariant to visual appearance
Explicit object segmentation as a preprocessing step
Test transfer across different grippers/end-effectors
Explore WALL-OSS for enhanced spatial reasoning in transfer scenarios

What Would Impress Us:

Clear experimental design: train on X, test on Y, measure success rate
Honest characterization of what transfers and what doesn't—this is more valuable than inflated success rates
Novel objects that weren't cherry-picked for success
Hardware demos carry extra weight in this track

Tactical Note: Start with the LeRobot SO-101 setup—it's the most documented and has pre-trained checkpoints. You can have something working in hours rather than days.

Track 3: Sim Training & World Models

The Challenge: Train robot behaviors entirely in simulation—from athletic movements to dance to novel skills—and make them work in the real world.

The sim-to-real gap is real and substantial: simulators model physics imperfectly, render scenes unrealistically, and can't capture every real-world perturbation. Policies that work perfectly in simulation often fail catastrophically on hardware. But simulation is so much cheaper than real-world data collection that solving this transfer problem is worth enormous effort.

The Reality Gap consists of multiple sub-gaps:

Dynamics: Mismatches in contact physics, friction, mass distributions
Perception: Lighting, textures, shadows, reflections
Actuation: Motor behavior, joint dynamics, control system latencies
System: Communication delays, safety mechanisms, software stack inconsistencies

The Pipeline You'll Use:

Video2Robot: Feed in video of a human performing a motion → outputs skeleton motion files for the G1
MJLab: View and refine motions in MuJoCo, train reinforcement learning policies that can execute them
UFB Submission: Test your trained policies on the real Unitree G1

State of the Art:

Domain randomization: Vary physics parameters (friction, mass, damping) during training so the policy is robust to any plausible configuration
Teacher-student training: Train a privileged policy with full state, then distill to a policy that only uses observations
Neural world simulators: Tesla's approach—train video generation models as "physics engines" to generate synthetic training scenarios
World models: Train a model to predict future states, use it to generate synthetic training data or for planning. Meta's V-JEPA 2 achieved zero-shot robot planning with only 62 hours of robot video.
SEEA-R1: Tree-structured RL (Tree-GRPO) for self-evolving agents that autonomously improve on sparse-reward tasks

Synthetic Data Reality Check:

NVIDIA generated 780,000 synthetic trajectories in 11 hours—but combining this with real data only improved performance by 40%. Synthetic data alone hits a ceiling because:

Simulated contact dynamics differ from real contact
Sensor noise patterns don't match reality
Edge cases in the real world weren't imagined by the simulator
Policies can overfit to simulator-specific artifacts

What Makes a Good Sim-to-Real Result:

Zero-shot transfer: Policy works on real robot without any real-world fine-tuning (but remember: this still requires extensive domain randomization during training)
Robustness: Works across perturbations (push the robot, change the floor surface)
Novel behaviors: Something the simulator wasn't specifically designed for

Promising Approaches for This Hackathon:

Focus on a single compelling motion (a specific dance move, a recovery from a fall, an athletic skill)
Go deep on domain randomization parameters—the details matter enormously
Experiment with different reward functions for RL training
Use the world model paradigm: train a video prediction model on robot data, use it to evaluate policies
Explore SEEA-R1's Tree-GRPO for sparse-reward long-horizon task learning

What Would Impress Us:

A behavior that clearly came from video (show us the source!)
Quantitative comparison of transfer quality with/without your techniques
Novel motions, not just walking
Clear documentation of what parameters you randomized and why
Honest analysis of failure modes

Tactical Note: You need a CUDA GPU for training. Use the Nebius credits provided. Test in simulation first—don't waste time debugging hardware issues until you have something working in sim.

Resources

Documentation Links

LeRobot: github.com/huggingface/lerobot
OpenVLA: github.com/openvla/openvla
MuJoCo: mujoco.readthedocs.io
Isaac Sim: docs.nvidia.com/isaac-sim

Papers Worth Reading

"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" (ACT paper)
"OpenVLA: An Open-Source Vision-Language-Action Model"
"Robot Learning From Randomized Simulations: A Review"

The most interesting robotics companies of the next decade will be started by people who are building today. Not reading about building. Not planning to build. Building.

You have access to hardware that cost $100,000 five years ago. You have access to open-source models that didn't exist eighteen months ago. You have a weekend.

Make something!

Questions? Find us on Discord: [2-7-autonomous-robot-build-day channel]

— The AGI House Robotics Team

‍

Autonomous Robot Build Day: Pre-Event Guide

Embodied AI: Pre-Event Guide

Foundation Models Meet Physical Intelligence

Foundational Framework

The Core Concepts

The Data Problem

The Tech Stack

Track Deep Dives

Track 1: Robot Task Autonomy (Scene Reasoning & Task Planning)

Track 2: Generalist Skills & Task Transfer

Track 3: Sim Training & World Models

Resources

Documentation Links

Papers Worth Reading

The SLM Opportunity: Edge Deployment and the New AI Infrastructure Stack

AI × Healthcare: Technical Market Map & Post-Event Memo