Blog Post

This Saturday, AGI House is hosting a one-day agent skills, OpenClaw, and world models hackathon at our Hillsborough house. This article is your technical primer for the ideas you’ll be putting into practice alongside industry leaders, researchers, and domain experts this weekend.

Part I — Landscape

The Shift to Agent-Native Infrastructure

The past year has produced a quiet but consequential reorganization in how AI capability gets packaged and deployed. For most of the foundation model era, the dominant pattern was model-centric: to make an AI system better at a task, you either prompted it more carefully, fine-tuned it on domain data, or built a custom harness around it. But this approach is typically brittle, and training to domain-specific data typically results in a loss of generality.

Agent skills — self-contained, reusable capability packages that any autonomous agent can discover, load, and execute at inference time — are an architectural answer to that ceiling. The SKILL.md pattern, popularized by the OpenClaw ecosystem, decouples what an agent can do from which model powers it, what hardware it runs on, and what environment it operates in. A domain expert writes a skill once, and any sufficiently capable model can run it by reading it into context. This makes skills the new unit of agent capability.

In just the past 3 months, we’ve seen extremely rapid adoption of runtimes like OpenClaw and the utilization of skills. Skills.sh (launched by Vercel) now hosts over 80,000 skills optimized for cross-agent compatibility across Claude Code, Cursor, Codex, Copilot, and Windsurf. Skills MP has indexed over 130,000, functioning effectively as a search engine for agent capability. ClawHub, OpenClaw's native registry, hosts nearly 14,000 community-built skills with zero curation — which is itself a problem we will return to. The bottleneck has already shifted from skill supply to skill routing and quality.

Agent Skills: The Ecosystem in Detail

How skills work

A skill is a structured text file that encodes procedural knowledge for a specific domain or task. When an agent encounters a problem that matches a skill's domain, it loads the skill into its context window, giving it a dense, expert-authored instruction set. The key insight is that context management is now the primary systems problem in agent deployment. What an agent loads into its limited context window is the biggest lever on performance, independent of the underlying model.

Routing, composition, and the hard problems

With ClawHub alone hosting thousands of overlapping skills, the hard problem is no longer writing good skills. Instead, it is intelligent selection and composition. An agent must detect when a conflict or ambiguity exists between available skills, select the appropriate skill given task context, and recover when skill execution fails. This routing and arbitration layer is currently the most significant infrastructural gap in the stack.

Beyond routing, several structural challenges persist. Security is underbuilt: nothing structurally prevents a malicious skill from being uploaded to a public registry, and OpenClaw's VirusTotal partnership only catches shallow threats. Execution unpredictability remains real — in complex long-horizon tasks, agents can misinterpret prompts, hallucinate command flags, or make irreversible modifications to privileged data. Pre-execution validation and rollback mechanisms are largely absent from current production deployments, though this is an active area of research in benchmarking techniques.

OpenClaw: The Personal Agent Runtime

OpenClaw is a free, self-hosted autonomous AI agent runtime. Technically, it is a persistent Node.js process — the Gateway — that binds to a local port, ingests events from connected messaging channels (WhatsApp, Telegram), manages session state across turns, and runs an agent loop: call a configurable LLM, invoke tools or skills, route results back through the originating channel. It is not itself an agent, but instead it runs agents.

Unlike cloud-based chatbots, OpenClaw maintains memory and state across sessions on local disk, enabling persistent, adaptive behavior without a cloud dependency. The system is model-agnostic: it supports Claude, Gemini, DeepSeek, and any local model served via a compatible endpoint such as Ollama. With a Mac Mini M4 or similar, you can run an open-source model plus OpenClaw for a secure, zero-latency personal agent with no ongoing subscription cost.

Traction and current dynamics

OpenClaw has accumulated over 300,000 GitHub stars and is governed independently with OpenAI sponsorship. DigitalOcean's 1-Click Deploy is an early signal of where accessibility is heading in terms of the huge potential for general public adoption. Tencent launched a full OpenClaw-compatible AI product suite on WeChat — the largest single distribution expansion of any platform partner.

Two inflection points from early 2026 are worth understanding. In January, over 21,000 OpenClaw instances were found exposed directly on the public internet, leaking API keys and private chat history. The “attack surface” here is unusually large, as agents run with system-level access and hold credentials for email, calendars, and messaging platforms, which led the Chinese government in March to restrict state-run enterprises and government agencies from running OpenClaw on office computers.

The other inflection point pertains to model economics. Google mass-suspended OpenClaw users in mid-February 2026 for ToS violations tied to its Antigravity IDE. It was on grounds that OpenClaw users were essentially using fixed-rate plans while having their agents draw immense amounts of usage. The implication: if highest-quality models remain hostile or difficult to autonomous agent use, the open-weight ecosystem gains more traction, and may push the industry closer to model commoditization faster than expected.

World Models: Gaming and Physical AI

Competing architectures

A world model learns to predict how the world will evolve given actions — enabling planning by imagining consequences before executing them. Unlike traditional physics simulators that encode laws explicitly, world models learn physics implicitly from data. The field is currently organized around two distinct bets about what level of abstraction a world model should operate at.

Google DeepMind's Genie 3 operates at the pixel level: it uses a video tokenizer to compress each frame into tokens and an autoregressive dynamics model to predict the next token sequence, producing visually impressive, immediately interactive environments. The causal structure is approximated rather than enforced — there is no variable storing door.state = "open". This produces rich, responsive worlds from video input, but physical causality is emergent, not guaranteed.

Moonlake AI's bet is that a world model built for decision-making should represent the world as a multimodal state space where a single causal event propagates consistently across all modalities at once. Their argument: “Consider a bowling pin. It is simultaneously a textured object in space, a rigid body with mass and inertia, an object that can be knocked down, a symbolic contributor to a score, and a source of sound upon impact.”

Essentially, a pixel model approximates this, but a symbolic causal model enforces it. There is also a practical argument for the symbolic approach, due to data efficiency: internet video is abundant but almost entirely action-free, while symbolic representations like code and language are compact, action-conditional, and naturally produced through human-computer interaction.

Which approach is ideal or sufficient for physical AI training is still yet to be determined.

Gaming as the deployment domain

Games are not just training sandboxes — they are a rich, high-value deployment domain in their own right. Games inherently store actions alongside trajectories of consequences, creating natural evaluation environments for agents and their skills. This is why Roblox is particularly interesting: if agents can autonomously create, populate, and maintain Roblox experiences, Roblox effectively becomes an agent operating system. The platform also has a large moat as it sits on billions of hours of action-labeled 3D interaction data that no lab can replicate, and is explicitly multiplayer.

The character AI layer is advancing in parallel. Inworld has established itself as the NPC character AI leader, giving game characters voice, personality, and basic memory. Skills extend these characters with real tools: inventory management, trading, external API calls, long-term goal pursuit, and cross-session learning. The result is NPCs that remember individual players, develop relationships, and participate meaningfully in game economies. EA's expanded partnership with Stability AI in 2026 is the first major publisher adopting a generative asset layer at scale.

Emerging agent economies in games — agents trading items across servers, forming guilds, and creating content for other agents — point toward a fundamental shift in revenue models. We could see in the very near future agent operators paying platforms for compute and infrastructure, alongside agent-to-agent marketplace transaction fees, rather than the traditional human-player-to-platform subscription model.

Part II — The Tracks

All tracks share equal judging weight across three axes: technical quality, execution, and composability or reusability. Below are some high-level thoughts for each track, but please refer to the detailed track outline for precise guidance.

Track 1: New Agent Skills

This track is about expanding the frontier of what agents can reliably do, i.e. durable capability packages that can be invoked, measured, reused, and composed into larger systems.

The skills paradigm is still early, and the field is learning what makes a skill actually good. But research is unambiguous on one point: domain-expert-authored skills significantly outperform both unaided models and models naively generating their own skills. That means the most valuable contributions to this track will come from teams that bring genuine domain knowledge and encode it as transferable procedural knowledge, and are able to quantify this gain when compared to a baseline.

The most interesting design questions here are at the interface layer. How does a skill signal what context it needs? How does an agent fail gracefully when that context is missing? How do two skills that overlap in capability avoid conflicting when loaded together? These are not glamorous questions, but they are the ones separating production-quality skills from hackathon demos.

Some potentially interesting directions: a skill that handles a genuinely complex multi-step workflow in a specific domain (legal research, financial analysis, code review) and can demonstrate transfer across at least two different contexts. Think about how you might define how agents hand off tasks and share state, or create error recovery mechanisms that detect execution failure and trigger a fallback without human intervention.

Track 2: Skill Evaluation and Benchmarking

This track is building the truth layer that the skills ecosystem currently lacks. ClawHub has 14,000 skills and essentially no quality signal. The field needs rigorous, automated ways to know whether a skill works — and to characterize exactly where and how it fails, even in open-ended tasks.

This is harder than it sounds. A skill that succeeds 90% of the time in a controlled setting and 40% under realistic task variation is not a good skill — it is a liability that looks like an asset. Good evaluation means characterizing the failure distribution, not just the mean success rate. It means testing under the conditions agents actually encounter: shifting instructions, incomplete context, adversarial inputs, long-horizon tasks where early errors compound.

The deeper challenge is that many agent tasks do not have a clean ground truth signal. On the open web, knowing whether an agent succeeded is often itself an open problem. Track 2 teams that take this seriously — building evaluation frameworks that reason explicitly about ambiguous success criteria — are working on one of the most important unsolved problems in the space.

Interesting directions include: adversarial harnesses that probe for brittleness in skills that otherwise look robust, latency and cost profiling systems that surface the economic reality of skill deployment, or approaches that are human-in-the-loop and more collaborative in nature, but still scalable.

Track 3: Self-Improving Agents

This track is attempting something genuinely hard: closing the loop between execution data and skill improvement, so that an agent gets measurably better over time without requiring human intervention.

An example state-of-the-art approach applies reinforcement learning to evaluate agent trajectories across task chains and iteratively refines skill text as a function of outcome. Benchmarking infrastructure is what makes this loop honest — without a rigorous evaluation layer (see Track 2), self-improvement just means self-reinforcement of whatever the agent was already doing.

The hardest part of this problem is generalization. It is relatively straightforward to build a system that improves on a specific task with a fixed evaluation metric. It is very hard to build a system that improves in a way that transfers to related tasks. The difference between these two things is the difference between overfitting and learning. Demonstrating genuine generalization — even in a narrow domain (do this to limit your scope) — would be a significant result.

Interesting directions to take this: memory systems that accumulate structured knowledge from past execution and influence future behavior in explainable ways — i.e. the idea of an “experience packet” where the approach and trajectory of a previous agent is learned alongside procedure knowledge from the skill.

Track 4: Game Development Pipelines

This track treats gaming as a serious production environment for AI tooling. The focus is orchestration: how do modular skills coordinate into coherent pipelines that accelerate or meaningfully improve real developer workflows?

The interesting design constraint here is that game development workflows are unusually multi-modal and multi-stage. An asset needs to be generated, critiqued, iterated, QA'd, and integrated — and each of those steps has different inputs, outputs, and quality criteria. Skills that work well in isolation often fail when chained together because their output formats don't compose cleanly or their failure modes aren't handled. Building pipelines that are actually coherent end-to-end is the challenge.

The NPC layer is also worth taking seriously. Inworld-style character AI combined with agent skills is already producing NPCs that remember players across sessions, develop persistent relationships, and participate in in-game economies. A dialogue system that generates responses in isolation is a much smaller contribution than one that maintains narrative consistency across sessions, enforces character-specific constraints, and integrates with inventory and quest state.

A common pitfall: building a pipeline where each stage works independently but the connections between stages are fragile. Integration points — how one skill's output becomes the next skill's input — are where real pipelines fail. Design those interfaces explicitly, and don’t get lost in the scale of too many assets or characters.

Track 5: OpenClaw Orchestration

This track goes deep on what makes OpenClaw genuinely powerful: not single-skill invocations, but multi-skill pipelines where it coordinates tools, manages state across turns, routes tasks to the right capabilities, and handles failure gracefully. The emphasis is on coordination architecture.

The core problem in multi-skill orchestration is state management. OpenClaw maintains memory across sessions, but that memory is unstructured by default. When a pipeline spans multiple skills — research, then synthesis, then formatting, then delivery — the state that needs to persist between steps is not always obvious, and the failure modes that arise from missing or stale state are hard to debug.

The model economics problem is relevant to how you design. OpenClaw's bursty, unpredictable inference demand has generated surprise API bills and prompted platform ToS responses. Orchestration layers that are aware of model costs — that route cheaper tasks to cheaper models, batch operations where possible, and avoid unnecessary inference — are what makes personal agents viable for non-developers.

Interesting directions to take this: planning systems that decompose complex objectives into structured execution steps with explicit state handoffs, or monitoring layers that observe multi-skill pipelines and surface failure modes in real time.

Agent Skills Build Day: Pre-Event Guide