Executive Summary

The coding agent ecosystem is rapidly evolving, with products differentiating across multiple dimensions: the scope of changes they produce, the frequency of human review they require, their interaction surfaces, and their target use cases. This memo presents a framework for understanding the landscape through two dimensions of autonomy—scope (from incremental edits to greenfield applications) and review frequency (from continuous approval to asynchronous output review)—while examining how different products serve distinct user segments ranging from "vibe coders" building quick prototypes to enterprise teams maintaining mission-critical systems.

This analysis draws on insights from Gemini Build Day, a one-day event we hosted on December 13th in Hillsborough in partnership with Google and Graphon AI to mark the launch of Gemini 3. Over 200 builders—including AI/ML founders, engineers, and ML/AI PhDs from top universities—spent eighteen hours exploring what becomes possible when models can truly reason across text, images, audio, and video simultaneously. Projects spanned AI-driven personal assistants, real estate intelligence, educational platforms, content creation tools, security infrastructure, healthcare applications, sales automation, and robotics systems. The event crystallized many of the themes in this memo: the demand for greenfield scope with multimodal verification, the tension between autonomy and review frequency, and the infrastructure requirements that differ sharply between rapid prototyping and enterprise deployment.

We argue that while multi-agent systems face coordination challenges today, they remain the path forward—the solution lies not in retreating to single-agent architectures but in advancing orchestration, memory, and tooling capabilities. The market is consolidating around platform plays, with context engineering emerging as a key differentiator as models themselves commoditize.

Part I: The Spectrum of Human-Agent Interaction

Understanding User Archetypes

Developers approach coding agents with fundamentally different expectations based on their context, experience level, and objectives. Three primary archetypes have emerged:

Vibe Coders represent a growing segment of users—often indie developers, solo founders, designers, or product managers—who prioritize speed over precision. They want to describe an application in natural language and see it materialize quickly. Their time horizons are short: build an MVP, test an idea, create a demo. They are comfortable with imperfect code that works, and they'd rather iterate through conversation than write code themselves. For these users, the agent is less a pair programmer and more an on-demand development team.
Professional Developers in Large Codebases have the opposite set of concerns. They work within established systems with years of accumulated complexity—internal libraries, legacy dependencies, proprietary patterns, and institutional knowledge embedded in code structure. For them, an agent must understand context deeply and make incremental, surgical changes that don't introduce regressions. They value reliability over speed, preferring an agent that makes five confident edits over one that attempts fifty speculative changes. The agent here acts as an intelligent assistant that augments human judgment rather than replacing it.
Broad Development Users fall somewhere between these poles. They might be building new features within existing projects, exploring unfamiliar parts of a codebase, or learning a new framework. They want more autonomy than traditional autocomplete provides but aren't ready to hand over entire projects. They benefit from agents that can plan, execute multi-step tasks, and explain their reasoning—but they still want to review significant changes before they land.

These archetypes aren't fixed; the same developer might vibe-code a prototype on Monday and carefully review agent-suggested refactors in a production system on Friday. The best tools recognize this fluidity.

Part II: The Two-Axis Framework

Coding agents vary along two distinct dimensions of autonomy. Both axes describe how much independence the agent exercises, but they capture different aspects of that independence: the scope of changes the agent makes, and the frequency of human review during the process.

Axis 1: Scope of Changes (Incremental to Greenfield)

The first axis measures the scale of changes an agent is designed to produce—from small, surgical edits to entire applications built from scratch.

Incremental Scope tools focus on bounded modifications within existing systems. They might complete a function, fix a bug, refactor a method, or add error handling to existing code. GitHub Copilot's inline suggestions exemplify this: each suggestion is a small, contained change that the developer evaluates in isolation. The agent works within the structure the human has established, making improvements at the margins rather than reshaping the architecture.
Multi-File Scope agents can coordinate changes across a project. Given an instruction like "rename this API endpoint and update all call sites," they navigate multiple files, understand dependencies, and produce coherent modifications that span the codebase. Claude Code and Cursor's Commander mode operate here—capable of planning and executing changes that touch many files while maintaining consistency.
Greenfield Scope agents can build entire applications or major features from specifications. You describe what you want; the agent architects, implements, and delivers. Google's Jules can take a GitHub issue and produce a complete implementation across multiple files. Vibe coding platforms like Lovable and Bolt.new generate entire applications from natural language descriptions. The agent isn't editing existing work—it's creating from scratch.

Axis 2: Review Frequency (Continuous to Asynchronous)

The second axis captures how often human review interrupts the agent's work—from constant approval requirements to extended autonomous operation with review only at completion.

High-Frequency Review means the agent pauses regularly for human confirmation. Claude Code's incremental permission system exemplifies this: the agent might perform a handful of actions, then surface its reasoning and request approval before proceeding. Each checkpoint gives the human an opportunity to course-correct. The workflow feels collaborative—agent and human in constant dialogue.
Medium-Frequency Review involves checkpoints at significant decision points rather than after every action. An agent might formulate a plan, request approval, execute the plan, and then present results for review. Cursor's Plan mode works this way: the agent proposes a multi-step approach, the human approves, and the agent executes. Review happens at phase boundaries, not continuously.
Low-Frequency Review characterizes agents that work for extended periods before surfacing results. Jules operates asynchronously—it clones a repository, works through implementation, and eventually opens a pull request. The human reviews the output, not the process. Devin, in its original form, could work autonomously for hours or days before presenting results.
Multimodal Artifacts represent an interesting position on this axis. Google Antigravity's visual verification—screenshots, recordings, browser tests—provides review without requiring line-by-line code inspection. The human reviews outputs (what the application looks like, how it behaves) rather than implementation (the code itself). This can be faster than traditional code review, particularly for UI-heavy work, but it's less granular. It's low-frequency in terms of interrupting agent work, but the artifacts themselves enable rapid assessment.

Mapping Products to the Framework

The two axes create a space where products cluster according to their design philosophy:

Product	Scope of Changes	Review Frequency	Notes
GitHub Copilot	Incremental	High	Inline suggestions, constant human control
Cursor (standard mode)	Incremental-to-Multi-File	Medium-High	IDE-integrated with visual feedback
Cursor (Commander)	Multi-File	Medium	Executes multi-file plans after approval
Claude Code	Multi-File	High	Permission system, transparent reasoning, regular checkpoints
Windsurf	Multi-File	Medium	Self-verification loops, judge agents reduce need for human review
Jules	Greenfield	Low (PR-based)	Autonomous execution, review at PR stage
Antigravity	Greenfield	Low (Multimodal)	Visual artifacts enable rapid output review
Devin	Greenfield	Low	Extended autonomous operation before review
Lovable / Bolt.new	Greenfield	Low (Conversational)	Full apps from prompts, review via iteration

Use Case Alignment

The framework maps naturally to user archetypes:

Professional developers in large codebases typically want incremental-to-multi-file scope with high review frequency. They're modifying complex existing systems where errors compound quickly. They want the agent to help, but they need to understand and approve changes before they land. Claude Code and Cursor's standard modes fit this profile.
Vibe coders and prototypers want greenfield scope with low review frequency. They're building new things quickly; they'd rather iterate on a working (if imperfect) application than review code line by line. Lovable, Bolt.new, and Antigravity serve this segment. Multimodal artifacts are particularly valuable here—a screenshot confirms correct layout faster than reading CSS.
Teams with established workflows may want high scope but also structured review—greenfield capability that integrates with PR-based processes. Jules occupies this niche: it can build substantial features autonomously, but outputs flow through standard code review before merge.

Part III: Interaction Surfaces

Where Users Meet Agents

The interface through which developers interact with coding agents profoundly shapes the experience. Each surface carries assumptions about workflow, context, and control.

Command Line Interfaces (CLI) appeal to power users who live in terminals. Claude Code and Aider exemplify this approach—conversational agents invoked from the shell that can read and modify local files, run commands, and integrate into scripted workflows. The CLI surface provides flexibility and composability; an agent can be part of a larger toolchain. However, it offers less visual feedback than richer interfaces, and it assumes comfort with terminal-based workflows. CLI agents excel for developers who want the agent as a utility they invoke, not an environment they inhabit.
Integrated Development Environments (IDE) embed agents directly into the coding environment. Cursor, Windsurf, and VS Code with Copilot represent this model. The agent sees what the developer sees—open files, project structure, editor state. It can offer suggestions in context, highlight relevant code, and apply changes with a keystroke. IDE integration suits day-to-day development work, where the agent augments a workflow the developer already controls. The majority of developers report preferring IDE-based agents for routine coding tasks.
Browser Workspaces serve users who want to minimize local setup. Replit, Vercel's v0, Google Antigravity, and Bolt.new provide cloud-based environments where the agent and the execution environment are co-located. This surface is particularly friendly to vibe coders, designers, and non-engineers who want to describe an application and see it run without configuring a local development environment. The browser workspace also enables deployment integration—code generated in the workspace can go live with a button click.
Pull Request and Issue Surfaces embed agents in collaboration workflows. Sweep AI monitors GitHub issues and opens pull requests to address them. Graphite's AI reviewer (now part of Cursor) operates within the PR review context. These surfaces position the agent as a participant in team workflows rather than a tool an individual wields. They're natural fits for maintenance tasks, routine fixes, and the kind of work that often sits in a backlog because no human has time for it.
Chat and Messaging Platforms (Slack, Teams, etc.) represent an emerging surface. An agent that lives in Slack can be assigned tasks asynchronously, report progress, and integrate with team communication. This model suits organizations that want coding agents to be accessible to non-developers or to operate on timelines decoupled from any individual's work session.

Part IV: Verification and Trust

How Users Confirm Agent Work

Trust in coding agents depends critically on how their work is verified. Different verification mechanisms correspond to different positions on the review frequency axis—and the appropriate mechanism depends on both the scope of changes and the use case.

Code Diffs and Pull Requests remain the gold standard for professional development, particularly when review frequency is medium to high. A diff shows exactly what changed, line by line. It integrates with existing review workflows, allows comments and discussion, and creates an audit trail. Nearly all serious coding agents now output diffs or PRs as their primary artifact. Jules creates PRs for human review. Sweep opens PRs linked to the issues they address. Even agents that operate with low review frequency ultimately produce changes that flow through version control—the review just happens at the end rather than throughout.
Test Results and Runtime Feedback provide automated verification that can reduce the need for human review frequency. An agent that runs the test suite after making changes and reports results gives developers confidence without requiring line-by-line inspection. Windsurf and Antigravity support self-verification loops where the agent detects test failures and attempts fixes iteratively. This approach is particularly valuable for greenfield scope work and QA-focused tasks, where ensuring correct behavior matters more than scrutinizing implementation details.
Multimodal Artifacts represent a verification modality that enables low review frequency without sacrificing confidence. Antigravity's agents can spawn headless browsers to test UI changes, capturing screenshots or recordings as evidence. An agent might present a video demonstrating that a feature works as expected, allowing visual verification faster than code review. This approach is especially valuable for vibe coders building UI-heavy applications—a glance at a screenshot can confirm correct layout more efficiently than parsing CSS. Multimodal artifacts shift review from "examine the implementation" to "verify the output," which is often more efficient for greenfield work.
Ephemeral Verification Agents are an emerging pattern that addresses review frequency differently. Rather than relying on a single agent to both write and verify code, some systems spawn separate "judge" agents to critique changes. Windsurf's architecture includes judge agents that evaluate the primary agent's work, catching errors before they surface to the user. This automated review can substitute for some human review, enabling lower human review frequency while maintaining quality.

Matching Verification to Use Case

The appropriate review mechanism depends on where the use case falls on both axes:

Vibe coding and prototyping (greenfield scope, low review frequency): Multimodal artifacts and rapid visual feedback. Speed matters more than exhaustive review. The goal is confirming the output works, not scrutinizing the implementation.
Professional development in large codebases (incremental-to-multi-file scope, high review frequency): Code diffs, PR-based review, and test results. Reliability and auditability matter most. Every change needs to be understood before it lands.
QA and testing workflows (variable scope, medium review frequency): Test results, runtime feedback, and screenshots. The focus is behavior verification, which automated testing can partially address.
Maintenance and routine fixes (incremental scope, medium review frequency): Automated tests with human spot-checking. High volume necessitates efficiency, but changes are bounded enough that risk is manageable.

Part V: Product Landscape

AI Coding Agents Product Landscape

Product / Company	Description & Characteristics	Scope & Review Frequency	Interaction Surface
Google Antigravity	Cloud-based IDE functioning as agent workspace with multimodal artifact generation (screenshots, recordings, headless browser testing). Enables confident review without line-by-line code inspection. Strong cloud integration with GCP.	Greenfield scope with low review frequency (multimodal artifacts enable review of outputs rather than process)	Browser workspace (cloud-based IDE)
Google Jules	Asynchronous GitHub issue handler. Clones repositories into cloud VMs, formulates plans, implements complete features, opens PRs. Operates autonomously during execution with review checkpoint at completion.	Greenfield scope with low review frequency (works asynchronously, review happens at PR stage)	Pull request/issue surface (GitHub integration)
Microsoft/GitHub Copilot	Conservative approach emphasizing inline autocomplete, chat, PR summaries, and security scanning. Always within VS Code/GitHub context with human reviewing each suggestion. Includes security filtering and compliance policy integration for enterprise.	Incremental scope with high review frequency (developer decides on each suggestion)	IDE (VS Code integration)
Microsoft Copilot Workspace	Expansion of Copilot toward multi-file capabilities while maintaining human control. Points toward larger scope but emphasizes predictability for enterprise customers.	Multi-file scope with medium-to-high review frequency	IDE (VS Code/GitHub)
Anthropic Claude Code	Terminal tool with strong reasoning capabilities, filesystem integration via Model Context Protocol (MCP). Incremental permission system maintains high review frequency even with multi-file scope. Agent performs bounded actions, surfaces reasoning, requests approval before proceeding.	Multi-file scope with high review frequency (incremental permission system creates collaborative workflow)	CLI (terminal-based)
OpenAI Codex/GPT-4/5.x	Functions as model layer for other products (Replit, Cursor, etc.). Focus on model performance and cost optimization rather than dedicated coding tools. Competitive on benchmarks with lower latency and cost at scale.	Varies (depends on implementation)	API (powers other tools)
AWS Frontier Agents - DevOps	Targets DevOps automation, integrates with monitoring tools (e.g., Dynatrace), takes remediation actions without human intervention. Runs within customer AWS environments.	Greenfield scope within domain with very low review frequency (minimal human review during execution)	Cloud platform (AWS environment)
AWS Frontier Agents - Security	Continuous code scanning, opens PRs for vulnerabilities. Large scope within security domain, bounded by specialization.	Greenfield scope within domain with very low review frequency	Cloud platform (AWS environment) + Pull request surface
Cursor - Standard Mode	Inline suggestions with constant human control in AI-native IDE built from scratch.	Incremental scope with high review frequency	IDE (Cursor)
Cursor - Plan Mode	Multi-file reasoning with user approval before execution. Proposes approaches for approval at plan stage.	Multi-file scope with medium review frequency (approval at plan stage, not per-change)	IDE (Cursor)
Cursor - Commander Mode	Substantial multi-file operations with approval at plan stage. Includes Graphite acquisition for AI-powered code review.	Multi-file scope with medium review frequency	IDE (Cursor)
Windsurf (Cascade Agent)	Emphasizes planning ("thinking ten steps ahead") with self-verification loops. Runs tests after changes and attempts fixes. Built own model suite (SWE-1) for agentic coding. Includes "judge agents" for automated review.	Multi-file scope with low review frequency (self-verification substitutes for some human checkpoints)	IDE
Zed	High-performance editor with AI chat and completion features. Primary pitch is speed and collaborative editing. AI as enhancement rather than core feature.	Incremental scope with high review frequency	IDE
Cognition Labs (Devin)	First major attempt at agent working hours/days with review only at completion. Testing revealed failures on complex tasks due to compounded errors without course-correction checkpoints.	Greenfield scope with very low review frequency (demonstrated need for better automated verification)	Standalone agent workspace
Sweep AI	Focuses on issue-to-PR workflow. Monitors GitHub issues, formulates plans, implements changes, opens PRs. Scope bounded by individual issues. Good for minor issues, bugs, dependency updates.	Incremental-to-multi-file scope with medium review frequency (review at PR stage)	Pull request/issue surface (GitHub)
Replit (Ghostwriter & Agents)	Browser-based IDE evolved into prompt-to-product platform. Generates code from descriptions, runs immediately in Replit infrastructure. Integration of development and hosting. Educational angle with explanation capabilities.	Greenfield scope with low review frequency (review through using application and conversational iteration)	Browser workspace
Vercel v0	Specializes in frontend/UI development. Generates React/Next.js code from UI descriptions, deploys to Vercel with one click. Exceptionally fast feedback loops.	Greenfield scope with low review frequency (visual review: "does it look right?")	Browser workspace
Bolt.new	Full-stack generation through conversation. Users describe applications, refine through dialogue, receive deployable code. Includes prompt enhancement features.	Greenfield scope with low review frequency (conversational iteration)	Browser workspace
Lovable AI	Targets "no-code" market with complete applications from descriptions. Built-in integrations for authentication, payments, databases. No infrastructure knowledge required.	Greenfield scope with low review frequency (conversational: describe the fix you want)	Browser workspace
Emergent	Rapid growth to millions of users. Streamlined creation-to-deployment experience. Backed by Google, likely leverages Google model technology.	Greenfield scope with low review frequency	Browser workspace (implied)
Aider	Open-source CLI agent with local control and model flexibility. Runs in terminal, modifies local files, supports cloud APIs and local models. For developers concerned about proprietary code or wanting model experimentation.	Multi-file scope with medium-to-high review frequency (customizable)	CLI (terminal-based)
Mistral Devstral	Open-weight model for self-hosting with large context windows. Trained for agentic coding tasks (tool use, multi-file changes, repository reasoning).	Varies (model layer)	API (self-hosted)
Mistral Vibe CLI	Terminal-based agent using Devstral. Open alternative to Claude Code for enterprises with data sensitivity concerns.	Multi-file scope with medium-to-high review frequency	CLI (terminal-based)
Kilo Code	Founded by GitLab's former CTO. Emphasizes speed and minimizing latency. Open-source, can integrate with various models and be customized for workflows.	Varies (customizable)	Varies (open-source, customizable)
Sourcegraph Cody	AI assistance grounded in enterprise-grade code search. Retrieves relevant context from massive codebases on the fly. Enterprise deployment options (on-prem, VPC) with compliance features.	Multi-file scope with medium-to-high review frequency (RAG-enhanced assistance)	IDE integration + code search
Greptile	Code search layer/API that other agents can use. Fast semantic search for making agents codebase-aware. Infrastructure for other tools rather than end-user product.	N/A (infrastructure layer)	API (for other agents)

Part VI: Codebase Interaction

The Memory Problem

The fundamental challenge for any coding agent working with real software is context: codebases are large, often exceeding any model's context window. A production repository might contain hundreds of thousands of lines across thousands of files. An agent that can only "see" a few thousand lines at once will struggle to understand architectural patterns, follow dependencies, or make consistent changes across a system.

Several strategies have emerged to address this:

Massive Context Windows offer a brute-force approach. Gemini's million-token context and Mistral's 512k-token capacity allow pasting substantial portions of a codebase directly into the model. This simplifies architecture—no need for retrieval systems if everything fits—but comes with costs: latency increases, expense rises, and models may suffer "lost in the middle" attention degradation where information in the middle of long contexts receives less weight.
Retrieval-Augmented Generation (RAG) gives agents on-demand access to relevant code. When addressing a task, the agent queries a vector index of the codebase, retrieves pertinent files or snippets, and incorporates them into its context. This allows operation on codebases far larger than any context window by paging in relevant pieces. Sourcegraph Cody and Greptile exemplify this approach—fast, accurate retrieval that lets agents focus on what matters.
Graph-Based Memory structures code understanding more richly. Graphon AI and similar efforts represent codebases as graphs—nodes for functions, classes, and modules; edges for calls, dependencies, and relationships. An agent can traverse this graph to understand how components relate, rather than relying on text similarity. Graph memory can persist across sessions, allowing agents to accumulate understanding over time.
Embedded Knowledge via Documentation provides cheap context through structured files. Anthropic recommends adding a CLAUDE.md file to repositories with architecture notes, style guides, and key commands. The agent loads this "briefing document" every session, gaining high-level understanding without scanning every file.
File Selection mimics how human developers work. Tools like Aider and Cursor can provide the agent with a repository "map"—perhaps an AST summary or directory structure—and let the agent decide which files to load for a given task. This focuses context on what's relevant while maintaining awareness of overall structure.

Tooling Beyond Memory

Effective coding agents need more than memory; they need tools to act on the world:

Model Context Protocol (MCP) is emerging as a standard for connecting agents to external tools and data sources. Claude Code's MCP integration lets it invoke file operations, run shell commands, and interact with version control—structured interfaces that extend what the agent can do.
Context Compression techniques summarize or distill information to fit more into limited windows. Meta-RAG approaches condense repository structure into summaries; hierarchical representations provide high-level views with the ability to zoom in on demand.
Structured Task Execution frameworks provide agents with safe interfaces to external services—creating issues, querying databases, sending notifications. Rather than generating arbitrary shell commands, agents call well-defined APIs with appropriate validation.

Orchestration: Single Agent vs. Multi-Agent

A persistent question in agent architecture is whether complex tasks—particularly those with greenfield scope—are best handled by a single capable agent or by multiple specialized agents coordinating.

The Multi-Agent Case rests on decomposition. A complex coding task might involve planning (architectural decisions), implementation (writing code), testing (verification), and review (quality checks). Different agents could specialize in each, potentially using different models optimized for different capabilities. MetaGPT simulates teams with PM, engineer, and QA roles. AutoGen enables asynchronous multi-agent workflows.

Multi-agent systems also offer a path to lower review frequency without sacrificing quality: if a "judge" agent reviews the "coder" agent's work, the human can review less frequently while maintaining confidence. Windsurf's architecture demonstrates this pattern.

The Single Agent Case emphasizes simplicity. Every additional agent introduces communication overhead, potential for miscommunication, and points of failure. A sufficiently capable model with good tools might handle the entire workflow sequentially, maintaining coherent context without the complexity of inter-agent coordination. One "mind" is easier to debug than five.

The Coordination Tax is real. Early multi-agent experiments (AutoGPT-style loops) often produced agents that got stuck, repeated work, or amplified each other's errors. Cognition's Devin, while technically a single agent, pursued multi-step tasks autonomously—and famously failed on most attempts, sometimes spending extended periods on impossible approaches. The failures often stemmed from poor self-verification rather than the multi-step nature of the work.

Our View: The challenges with multi-agent systems today reflect immature orchestration and tooling, not fundamental flaws in the paradigm. The solution is not to abandon multi-agent approaches but to improve them:

Better frameworks for inter-agent communication and shared state (LangGraph, CrewAI)
Clearer role definitions and handoff protocols
Supervisor agents that can recognize and halt unproductive efforts
Shared memory systems that maintain context across agent boundaries

Multi-agent architectures are particularly valuable for enabling greenfield scope with manageable review frequency. Rather than requiring humans to review every step of a large task, specialized review agents can handle intermediate verification, with human review at key milestones. This is the path to agents that can handle ambitious scope without either overwhelming human reviewers or operating blindly.

Hybrid approaches are already emerging. Windsurf uses ephemeral "judge" agents to critique its primary agent's work—multi-agent verification without full multi-agent complexity. The industry will likely converge on architectures where multiple agents collaborate within structured frameworks, with robust tooling managing coordination.

Part VII: Infrastructure and Deployment

The Bifurcation

Infrastructure requirements differ dramatically between vibe coding and enterprise deployment, creating distinct market segments with different needs.

Vibe Coding: Easy Shipping

For vibe coders, the critical infrastructure requirement is frictionless deployment. The value proposition of prompt-to-product breaks down if generated code requires complex deployment pipelines. The ideal experience: describe what you want, see it built, share a live URL.

Integrated Platforms handle this end-to-end:

Replit provides development environment and hosting in one. Code runs on Replit's infrastructure immediately.
Vercel v0 generates code and deploys to Vercel's edge network with a click.
Bolt.new and Lovable AI maintain their own hosting, so generated applications go live without the user touching infrastructure.
Emergent has scaled to millions of deployed applications by making generation and hosting seamless.

Supporting Infrastructure makes generated code functional:

Supabase and Neon provide instant databases. An AI-generated application can have persistent storage without the user configuring anything.
Zeabur and similar PaaS offerings handle containerization and deployment for frameworks the AI might generate.

The pattern is clear: vibe coding infrastructure abstracts away everything between "code exists" and "code runs somewhere accessible." This abstraction is a feature, not a limitation—users in this segment don't want to understand Docker or Kubernetes.

Scaling Agent-Built Applications

An application that starts as a vibe-coded prototype may need to scale. Infrastructure services that bridge the gap are valuable:

Supabase grows from simple key-value storage to full Postgres capabilities
PlanetScale and Neon provide serverless database scaling
Cloud providers' serverless offerings (Vercel, Cloudflare Workers, AWS Lambda) handle traffic growth automatically

The question of when to "graduate" from vibe-coded simplicity to proper engineering becomes pressing as applications gain users.

Enterprise Deployment

Enterprise requirements are categorically different. Here, the challenge is not ease but governance.

Environment Setup and Reproducibility: Enterprise codebases have complex build systems, internal dependencies, and specific runtime requirements. An agent that works on a developer's laptop may fail in CI. Solutions include containerized agent environments (Google's Jules runs in cloud VMs that can be configured to match production) and explicit environment profiles.

Ephemeral Test Environments: Enterprise workflows demand that agent-generated changes be tested before affecting production. Platforms that can spin up temporary environments, run full test suites, and report results fit enterprise needs. AWS's Frontier agents integrate with existing CI/CD pipelines.

Compliance and Audit Requirements: Regulated industries require audit trails. What changes did the AI suggest? Who approved them? When? Solutions like Sourcegraph Cody and Claude Enterprise offer audit logging, SSO integration, and admin controls. Zencoder emphasizes its integration with Jira, GitHub, and compliance tooling.

Security and Data Handling: Sending proprietary code to external APIs raises concerns. VPC deployments (Sourcegraph, Runloop.ai), on-premises options, and self-hosted open-source models (Mistral Devstral) address data sovereignty needs.

Key Players in Enterprise:

Claude Enterprise: Anthropic's offering with enhanced privacy, admin controls, and audit capabilities
Zencoder: IDE plugin plus CI/CD integration, focused on enterprise workflow integration
Sourcegraph Cody: Enterprise-grade search and AI, deployable on-prem
AWS Frontier Agents: Run within customer AWS environments, addressing data residency concerns

Part VIII: The Self-Evolving Flywheel

The Vision

Frontier AI labs are investing heavily in systems that improve themselves—and coding agents are central to this vision. A sufficiently capable coding agent could potentially improve its own code, fix its own bugs, and enhance its own capabilities. The flywheel: agents write code, code improves agents, improved agents write better code.

NVIDIA's internal implementation demonstrates the concept in a more modest form: their MAPE control loop (Monitor-Analyze-Plan-Execute) collects feedback from agent interactions, identifies failure patterns, fine-tunes models to address weaknesses, and deploys improved versions. The result: a smaller, faster, more accurate model replacing a larger generic one.

Current Gaps

Several obstacles prevent closing this loop effectively:

Feedback Scarcity: Most systems receive limited explicit feedback. NVIDIA collected only hundreds of negative feedback samples over months—not enough for robust learning. Agents need richer signals about what worked and what didn't.
Self-Diagnosing Failure: Humans learn from debugging; agents rarely get that opportunity. An agent that could recognize when its approach is failing—and articulate why—could learn more effectively. Current agents often persist in unproductive directions without recognizing the problem.
Context Decay: Performance degrades over long sessions as context fills with outdated information. Agents need mechanisms to maintain coherence over extended interactions—forgetting what's no longer relevant while retaining what matters.
Implementing Changes Safely: If agents could modify their own prompts or fine-tune their weights, improvement loops could accelerate—but so could runaway failures. Robust guardrails are essential for any self-modification capability.
Benchmark Limitations: Existing benchmarks like SWE-bench test specific capabilities but don't capture the full complexity of real-world coding. SWE-bench Pro revealed a dramatic performance drop on novel codebases, suggesting that benchmark performance doesn't translate directly to practical capability.

Emerging Approaches

Several techniques address these gaps:

Human-in-the-Loop Feedback: Collect instances where agent suggestions required significant revision, have experts classify failures, and use that data to improve prompts or fine-tune models. This is labor-intensive but produces high-quality signal.
Agent-in-the-Loop Evaluation: Use AI judges to evaluate agent outputs, generating synthetic feedback at scale. Weak supervision from LLM judges can provide training signal when human labels are scarce.
Self-Correction Paradigms: Systems like CorrectNav generate correction data from error trajectories—when an agent goes wrong, capture the sequence and train on the correction.
Reinforcement from Outcomes: If a PR is merged without changes, that's a strong positive signal. If it requires substantial revision, that's negative. Connecting agent outputs to downstream outcomes creates a learning signal, though the feedback loop may be slow.

Part IX: Strategic Implications

For Builders and Founders

The coding agent landscape suggests several strategic opportunities:

Context Engineering as Moat: As models commoditize, the ability to manage massive codebases—through smart RAG, memory systems, and retrieval—becomes the key differentiator. Founders building in this space should invest heavily in context infrastructure. The agent that understands your codebase best will be the most useful, regardless of where it sits on the scope or review frequency axes.
Enabling Greenfield Scope with Confident Low-Frequency Review: The market opportunity lies in agents that can handle large scope (multi-file, greenfield) without requiring constant human review. This requires investment in verification: automated testing, judge agents, multimodal artifacts, self-correction loops. The company that solves verification for greenfield agents enables a new category of use cases.
Interface Pluralism: The market won't converge on a single interface. CLI, IDE, browser, and PR surfaces serve different workflows. Rather than betting on one surface, successful products might offer multiple interaction modes or focus on deep integration with specific workflows. Interoperability may prove more valuable than lock-in.
Enterprise Trust Architecture: SOC 2, ISO 42001, SSO, audit logging—these form the new baseline for enterprise adoption. Tools that built compliance infrastructure as an afterthought face extended procurement cycles. Building this early creates a significant barrier to competition—particularly for enterprises that want higher scope with audit trails for every change.
Orchestration and Tooling: Multi-agent systems face coordination challenges, but the solution is better orchestration, not abandoning the paradigm. Frameworks that manage agent collaboration, shared memory, and error handling will enable more sophisticated agent architectures—particularly architectures that use specialized review agents to enable lower human review frequency. This infrastructure layer is underbuilt.
Vertical Specialization: Horizontal coding agents face competition from deep-pocketed incumbents. Vertical focus—on specific languages, frameworks, domains, or tasks—may provide defensible positions. An agent optimized for iOS development or infrastructure-as-code might outperform general-purpose tools in its niche, particularly if it can handle larger scope within that domain.

Closing Thoughts

The coding agent landscape is evolving rapidly, but the fundamental tensions are clear: incremental versus greenfield scope, high versus low review frequency, individual productivity versus team workflows. Products that navigate these tensions thoughtfully—providing appropriate scope for context, verification mechanisms that enable confident lower-frequency review, and seamless integration with existing workflows—will define the next generation of developer tools.

The future is not a single agent that does everything, nor a fragmented collection of single-purpose tools. It is an ecosystem of specialized agents, coordinated by sophisticated orchestration, grounded in deep understanding of codebases, and integrated into the workflows developers already use. The companies building this infrastructure today will shape how software is built for years to come.

Acknowledgments

We're grateful to Google and Graphon AI for sponsoring the Gemini 3 Build Day and making this research possible. Special thanks to our speakers—Paige Bailey (AI Developer Relations Lead @ Google DeepMind), Bonnie Li (Research Scientist @ Google DeepMind), Cooper Price (Software Engineer @ Google Antigravity), Suyash Kumar (Senior Software Engineer @ Google), De Kai (Creator of Google Translate / Professor of Computer Science @ HKUST), Arbaaz Khan (CEO @ Graphon), and Div Garg (CEO @ AGI, Inc)—whose insights on infrastructure optimization, model safety, and architectural design informed both the event and this analysis. Our judges—Clark Zhang (CTO @ Graphon), Vaibhav Tulsyan (Research Engineer @ Google DeepMind), Audrey Choy (Senior Software Engineer @ Airbnb), and Yan Wu (Software Engineer @ Google Antigravity)—brought deep technical expertise across the AI stack, elevating project quality and identifying the most promising architectural approaches.

We extend our deepest gratitude to the entire AGI House community for creating a space where ambitious ideas meet execution. This hackathon brought together builders, researchers, and engineers who dedicated their day to pushing the boundaries of what multimodal AI can do—from clinical documentation systems to embodied AI agents to privacy-preserving data tools. Events like this don't just test technologies; they forge the communities and collaborations that will define the next era of intelligent systems.

And finally, thank you to Google's Nano Banana Pro for the graphics on this memo.

‍

From Greenfield Devs to PR Wizzes: The Newfound Autonomy of Coding Agents