Build Day Pre-Event Memo: Agent Harnesses

This Saturday, AGI House is hosting a one-day hackathon centered around agent harnesses at our Hillsborough house. As one of the most rapidly evolving and promising axes of agentic AI development, we hope you'll join us and dozens of founders, builders, and enthusiasts in pushing the frontier this weekend.

What Is an Agent Harness, and Why Now?

A foundation model on its own is a stateless inference engine — it generates text, then forgets everything. An agent harness is the infrastructure that turns that engine into something that can actually do work: the runtime that manages tool calls, the sandbox where code executes, the memory system that persists context, the verification layer that catches errors, and the governance framework that keeps the whole thing safe.

When Claude Code edits your codebase, the model decides what to write. But it's the harness that selects which files to read for context, executes the edit on disk, runs the test suite, and decides whether to surface the result or retry. The model provides the intelligence; the harness provides the environment, the tools, and the guardrails. The harness is the difference between "a model that can write code" and "a system that can ship code."

Agent harnesses have become a central topic in AI research and deployment over the past year because the gap between model capability and production-grade agent deployment is almost entirely a harness problem. Models can reason, call tools, and generate plans. Making them do so reliably, safely, and at scale is engineering work at the harness layer — sandboxing, verification, memory management, orchestration, governance.

This is also one of the most dynamic parts of the stack right now. Multiple independent teams — Anthropic (Claude Code), Vercel (v0), Manus, and infrastructure practitioners like the RunLoop team — are converging on a counterintuitive finding: as models improve, the best harnesses are getting simpler, not more complex. Stripping out orchestration layers that were built to compensate for weaker models often improves performance with stronger ones. The question of what the harness should do — and what it should stop doing — is changing in real time, and that's what makes this a particularly interesting moment to build.

Where the Frontier Is: How Agent Harnesses Got Here

The early foundations (2022–2023)

The modern agent harness traces back to a handful of research contributions that defined the basic architecture most systems still use. ReAct (Yao et al., 2022) established the canonical agent loop — interleave reasoning with actions and observations — and nearly every production agent today runs some variant of this pattern. Around the same time, MRKL Systems (AI21 Labs, 2022) formalized the idea of using an LLM as an orchestrator that routes to external tools, and Toolformer (Meta, 2023) showed that models could learn when and how to call tools through training rather than relying on the harness to dictate every interaction.

On the memory side, two papers from 2023 set the terms of the current conversation. Generative Agents (Park et al., Stanford) introduced a memory stream architecture — agents store observations as natural language, scored by recency, importance, and relevance, with a reflection mechanism that synthesizes higher-level insights. This remains the most cited reference for agent memory design. Reflexion (Shinn et al.) demonstrated that agents storing verbal self-reflections on past failures dramatically outperform amnesic agents, establishing experiential memory as a real capability rather than a theoretical curiosity.

This period also saw the first wave of agent hype. AutoGPT went viral in March 2023 as the first autonomous agent to demonstrate the full goal-plan-execute loop. It was brittle and impractical, but it demonstrated both the promise and the harness challenges of autonomy — and it catalyzed the entire ecosystem. OpenAI's release of function calling (June 2023) then marked the first major shift of tool use from a harness-engineering problem to a model capability, triggering an early round of harness simplification.

The framework era and its limits (2023–2024)

The perceived complexity of building agents spawned a wave of orchestration frameworks. LangChain provided chain-based primitives and tool wrappers. LangGraph added stateful, graph-based orchestration with conditional branching. AutoGen (Microsoft) emphasized multi-agent conversation. CrewAI offered role-based agent teams. MetaGPT encoded standard operating procedures into multi-agent collaboration with specialized roles. MemGPT (Packer et al., UC Berkeley) introduced the idea of treating the context window as virtual memory — with "main context" as RAM and external storage as disk, managed by the agent itself through explicit read/write operations.

Alongside frameworks, benchmarks started revealing how hard real-world agent tasks actually are. SWE-bench (Princeton, 2023) became the standard for coding agents — real GitHub issues that agents must diagnose and fix — and demonstrated that harness design (sandbox quality, file access, test execution) matters as much as model capability. WebArena and OSWorld established benchmarks for web and desktop agents, showing enormous gaps between what models can do in theory and what agents accomplish in practice.

The current moment: simplification, protocols, and the reasoning shift (2024–2025)

The most important recent developments aren't new frameworks — they're structural shifts in what the harness needs to do and how the ecosystem connects.

Reasoning models changed the harness equation. The release of OpenAI's o1 (September 2024) and its successors (o3, o4-mini), alongside DeepSeek R1, Claude's extended thinking, and Gemini's Deep Think mode, moved multi-step planning and reasoning inside the model. This had a direct and observable impact on harness design: planning scaffolding that was built to compensate for models that couldn't plan well became unnecessary, and in some cases actively degraded performance. The harness's role shifted from directing the model's reasoning to supporting it with the right tools and constraints.

MCP and A2A are standardizing the tool layer. Anthropic's Model Context Protocol (November 2024) proposed an open standard for connecting agents to tools and data sources — and it has been adopted remarkably fast across the ecosystem, including by Cursor, Windsurf, Sourcegraph, OpenAI, and Google's ADK. Google's Agent-to-Agent Protocol (April 2025) complements MCP by standardizing how agents communicate with each other. Together, these protocols are doing for agent infrastructure what HTTP did for the web: creating a shared interface layer that makes tools and agents interoperable across platforms. This is still early — authentication, discovery, and trust frameworks are unresolved — but the direction is clear.

The simplification trend is the defining dynamic of the current moment. The Claude Code system prompt leak was one of the most revealing artifacts we have about how a frontier lab actually builds an agent harness. What it showed was striking: a remarkably thin harness consisting of behavioral contracts and tool definitions, with no elaborate planning scaffolding, no multi-agent routing, no complex output parsing. Anthropic had stripped out planning mechanisms as Claude's native reasoning improved. What remained was almost entirely governance — behavioral rules, permission boundaries, context management guidance — and infrastructure — tool access, execution environment.

This wasn't an isolated finding. Manus (the agent system from Monica.im) publicly discussed re-architecting their harness five times since early 2024, each time because a model improvement made part of their orchestration unnecessary. Vercel removed output validation layers from v0 when model upgrades rendered them redundant. RunLoop's team described the market as returning to a "harness-light" pattern: a powerful model, a while loop, tools, and well-curated context. DSPy (Stanford) and Smolagents (HuggingFace) showed that compiler-style prompt optimization and code-as-action approaches can replace much of the structural complexity that frameworks introduced.

The emerging picture is that the harness's job is undergoing a phase transition. Early harnesses compensated for model weakness — they parsed flaky outputs, decomposed tasks the model couldn't handle, retried format failures. Current harnesses, built for stronger models, are shifting toward governing model strength — sandboxing execution, enforcing permissions, verifying outputs, persisting state, and maintaining audit trails. The intelligence is increasingly in the model; the value of the harness is in the environment, the verification, and the governance.

The Three Tracks

Track 1: Finance and the Agentic Stack

Finance is one of the most natural verticals for agent deployment and one of the hardest from a harness perspective. The tasks are high-value — analysis, reconciliation, compliance, procurement — but the environment is unforgiving. Errors are costly, outputs must be explainable and auditable, data sources are messy and real-time, and regulatory requirements impose hard constraints on what an agent can and cannot do.

This creates harness requirements that go well beyond what any general-purpose framework provides. A financial agent needs live data access with structured extraction, verification layers that catch numerical and logical errors before they propagate, audit trails that record every action and decision, compliance checks against regulatory constraints, and approval workflows for high-stakes actions.

A key insight from the broader agent harness research applies particularly well here: verification is the mechanism that converts probabilistic model output into production-grade reliability. Coding agents lead other verticals precisely because code has rich automated verification infrastructure — test suites, type checkers, compilers. Finance has analogous verification opportunities — numerical consistency checks, regulatory format validation, cross-source data reconciliation — but the verification tooling hasn't been built yet. Building this layer for financial workflows is one of the highest-impact things a team could take on.

The "agent-on-box" pattern is also worth noting for this track. Rather than having agents send sensitive financial data to external APIs for processing, the agent operates within a controlled sandbox environment where data never leaves the security boundary — a particularly important architectural choice for regulated industries.

Project Ideas

Project 1: Financial Analysis Agent with Live Data and Verification

Difficulty: Intermediate

Build an agent that pulls live financial data (earnings, filings, market data), synthesizes analysis, and produces structured output — with a verification layer that cross-checks numerical claims against source data before surfacing results. The verification layer is the core contribution: most financial AI demos generate plausible-sounding analysis with no check on whether the numbers are correct. Building harness-level verification — does this claim match the source? are the calculations internally consistent? — is what separates a demo from something deployable.

Stack: Foundation model API, Exa API for search and data extraction (free credits provided), Python for verification logic, simple web interface for output.

Project 2: Multi-Agent ERP Harness

Difficulty: Advanced

Build a multi-agent system where specialized agents handle distinct ERP domains — finance/accounting, inventory, procurement, compliance — with a coordination layer that manages handoffs, shared state, and a full audit trail. The harness here is the product: the orchestration, conflict resolution, and audit logging across agents. The challenge isn't making each agent work individually — it's making them coordinate without conflicts, maintain consistency, and handle failures gracefully.

Stack: Foundation model API, shared state store (structured JSON or SQLite), orchestration layer (LangGraph, OpenAI Agents SDK, or custom while-loop orchestration).

Project 3: Compliance and Audit Trail Agent

Difficulty: Beginner–Intermediate

Build an agent that processes financial documents (invoices, contracts, transaction records) and checks them against a set of compliance rules, flagging violations and generating an audit report. The emphasis is on the harness: every agent decision, tool call, and data access should be logged in a structured, queryable format. Audit and explainability are table-stakes requirements for financial agent deployment that most demos ignore entirely. Building the logging and governance harness around an agent — not just the agent itself — is the differentiating skill.

Stack: Foundation model API, PDF/document parsing, rule engine (can be a simple set of Python functions), structured logging framework.

Project 4: Data Reconciliation Agent

Difficulty: Intermediate

Build an agent that takes data from multiple financial sources (spreadsheets, APIs, databases), identifies discrepancies, and produces a reconciliation report with explanations for each mismatch. The harness challenge is managing multiple data connections, handling format inconsistencies, and verifying that the reconciliation logic is correct. Reconciliation is a massive, tedious, high-value financial task that's well-suited for agents if the harness can ensure accuracy.

Stack: Foundation model API, Pandas or similar data processing, multiple data source connectors (CSV, API, database), verification logic.

Track 2: Agent Memory Architecture

The most capable agents today are also the most amnesic. They lose context between sessions, can't learn from past mistakes, and have no way to query their own history. Memory is the harness layer with the widest gap between what's needed and what exists — and it's one of the least absorbed by model providers, making it a particularly open area for new work.

It helps to distinguish four types of agent memory, each at a different level of maturity.

Conversational context — what happened earlier in this session — is largely solved by expanding context windows. With models approaching a million tokens, in-session memory is a non-problem for most use cases. The remaining challenge is curation: more context isn't always better, and deciding what's relevant enough to include still matters for quality.

Cross-session persistence — remembering things across conversations — is where frontier labs are actively building. ChatGPT Memory and Claude Memory both extract and persist facts across sessions. But what they provide is essentially a flat key-value store with no structure, no temporal reasoning, and no decay. They remember that you prefer Python but can't represent that your team migrated to Rust six months ago. For production agent systems, this is insufficient.

Episodic and experiential memory — learning from past actions — is the least explored and potentially most valuable type. When an agent fails, adjusts its approach, and succeeds, that experience typically disappears. Reflexion and Voyager showed that agents with experiential memory dramatically outperform amnesic ones, but nobody has productized this at scale. The hard problems are unsolved: when to create a memory, how to index it for retrieval, how to consolidate across experiences, and how to handle contradictions and decay.

Shared and organizational memory — knowledge spanning multiple agents and users — is fundamentally a distributed systems problem. Multiple agents collaborating on a project need shared state with consistency guarantees, access controls, and conflict resolution. This is entirely a harness and infrastructure concern, with no role for model providers since the requirements are determined by the deployment context.

The architectural insight from MemGPT/Letta is that memory management should be a capability of the agent itself, not a passive store managed externally by the harness. The agent decides what to commit to long-term memory, what to retrieve, what to consolidate, and what to forget — using explicit memory operations as tools. This works because the model understands relevance better than any fixed heuristic; the harness provides the storage and retrieval infrastructure, while the agent provides the curation intelligence.

Project Ideas

Project 5: Episodic Memory System with Reflection

Difficulty: Intermediate–Advanced

Build a memory architecture where an agent stores structured records of its task attempts — what it tried, what worked, what failed, and a natural-language reflection on why. On future tasks, the agent retrieves relevant past episodes and uses them to inform its approach. Evaluate whether this improves performance on a repeated task set. The memory schema design and retrieval strategy are the core contributions — this is one of the most open problems in agent research.

Stack: Foundation model API, vector database or embedding store (Chroma, Qdrant, or in-memory FAISS), structured memory schema, evaluation harness to measure improvement over time.

Project 6: Cross-Agent Shared Memory Store

Difficulty: Advanced

Build a shared memory system for multiple agents working on related tasks. Agents read from and write to a common store, with conflict resolution for contradictory writes and access controls for sensitive information. Test with a multi-agent workflow where agents in different roles (researcher, analyst, writer) need to share findings without passing increasingly long message chains back and forth.

Stack: Foundation model API, shared state store (Redis, SQLite, or custom), multi-agent framework or custom orchestration, conflict resolution logic.

Project 7: Memory Compression and Retrieval Benchmark

Difficulty: Intermediate

Build and evaluate different strategies for compressing long agent histories into retrievable memory. Compare approaches: raw storage with embedding-based retrieval, progressive summarization (summarize every N steps), hierarchical memory (detailed recent, compressed older), and agent-managed memory (the agent decides what to keep). Measure retrieval quality and downstream task performance. The community needs empirical comparisons — not just "does it work?" but "under what conditions does each approach win?"

Stack: Foundation model API, embedding model, vector store, evaluation framework, a task suite that requires recalling past context.

Project 8: Self-Correcting Agent via Memory-Based Drift Detection

Difficulty: Advanced

Build an agent that monitors its own behavior over time by comparing current actions to patterns in its memory of past successful executions. When it detects divergence from established successful patterns — drift — it flags the issue and self-corrects. This is a critical unsolved problem for autonomous agents: systems that run for extended periods need a mechanism for noticing when they've gone off track. Building this as a memory-powered capability is a novel approach.

Stack: Foundation model API, memory store with action logs, anomaly detection logic (statistical or LLM-based), task environment for testing.

Track 3: Harness Optimization

The conventional wisdom has been that more capable agents need more harness — more tools, more middleware, more orchestration layers. Recent evidence challenges this directly. As discussed in the frontier overview, multiple teams have independently found that stripping back the harness and simplifying the architecture around a strong model produces measurably better results.

But "simpler is better" is an oversimplification of the real finding. The more precise question is: which harness components add value, which add friction, and how do you tell the difference? Some complexity is genuinely necessary — verification loops, security sandboxing, durable state management. Other complexity was built to compensate for model limitations that no longer exist. The challenge is that these two categories aren't always obvious, and they change every time the underlying model improves.

This track is research-oriented. The goal is not just to build a working agent but to produce evidence — a controlled comparison, a benchmark, an ablation study — that helps the community understand what matters in harness design and what doesn't. The strongest outputs here will be proofs about harness architecture, not just demos of agent capability.

Project Ideas

Project 9: Thin vs. Thick Harness Benchmark

Difficulty: Intermediate

Take a well-defined task suite and compare agent performance across harness configurations: a minimal harness (model + tools + simple loop), a framework-based harness (LangGraph or similar with planning and routing), and a multi-agent harness (specialized agents with coordination). Measure not just final output quality but also latency, token cost, and failure modes. The "simpler is better" claim is widely repeated but rarely rigorously tested — a controlled comparison with proper measurement is a genuine contribution.

Stack: Foundation model API, at least two harness configurations, task suite with evaluation criteria, measurement instrumentation.

Project 10: Context Curation Experiment

Difficulty: Beginner–Intermediate

Test how context curation strategy affects agent performance. For a given task type, compare: stuffing the full context window with all available information, using embedding-based retrieval to select relevant context, having the model itself request what it needs (agent-managed context), and a hand-curated baseline. The "just use the full context window" assumption is the default as windows expand, but research has shown that models degrade when relevant information is buried in irrelevant context. Understanding when curation matters — and when it doesn't — is directly actionable.

Stack: Foundation model API, embedding model, document set, evaluation framework.

Project 11: Harness Component Ablation Study

Difficulty: Intermediate–Advanced

Start with a full-featured agent harness (planning layer, tool routing, output validation, retry logic, memory retrieval) and systematically remove components one at a time, measuring performance after each removal. Identify which components are load-bearing (performance drops when removed) and which are dead weight (performance stays the same or improves). Ablation studies are standard practice in ML research but almost never done for agent harnesses — a ranked list of "this component matters for this task type, this one doesn't" is exactly the kind of empirical evidence the field needs.

Stack: Foundation model API, modular harness implementation where components can be toggled, task suite, evaluation metrics.

Project 12: Prompt Engineering vs. Structural Complexity

Difficulty: Beginner–Intermediate

Take a task that's typically solved with structural harness complexity — for example, a multi-step research task with a planning agent, researcher agent, and writer agent — and try to match or exceed its performance using a single model call with careful prompt engineering and context design. This directly tests the single-agent vs. multi-agent question. The most useful finding would be a clear characterization of where the multi-agent approach genuinely helps versus where it just adds overhead.

Stack: Foundation model API, two implementations (multi-agent and single-prompt), evaluation criteria.

Resources and Setup Notes

  • Exa API: Free credits will be available for web search and data extraction — particularly useful for Track 1 (financial data) and any project requiring live information retrieval.
  • Evaluation and observability: Consider using LangFuse (open-source) or simple custom logging for tracing agent behavior. For any project, instrumenting what the agent does — not just what it outputs — will make your results more compelling and debuggable.

The strongest projects won't just build an agent — they'll build or improve the harness, and show why the harness design choices matter. We're looking forward to seeing what you build.