Blog Post

Between October 2025 and April 2026, every major cloud platform shipped a production-grade agent sandbox.

AWS reached global availability with Bedrock AgentCore — a full agent stack comprising Code Interpreter, Browser, Runtime (microVM-per-session), Memory, and Gateway.
Google launched Vertex AI Agent Engine Code Execution, now folded into the rebranded Gemini Enterprise Agent Platform, and shipped a second product — GKE Agent Sandbox, with gVisor-based kernel isolation and sub-second provisioning.
Microsoft rebuilt Foundry Hosted Agents in April with per-session hypervisor isolation and scale-to-zero pricing.
Vercel and Cloudflare reached general availability in January and April respectively, with Active-CPU pricing models that pressured the unit economics of the category.

Five major platforms, six months. The substrate layer for AI agents transitioned from a specialist startup category to a default cloud feature in roughly the time it takes a typical enterprise procurement cycle to close.

The pure-play sandbox companies that defined the category — Daytona, E2B, Runloop, Modal — have not lost share. But the hyperscaler entry forces a sharper question: what are the pure-plays doing that the hyperscalers have not yet replicated? The answer reveals what the category is actually competing on, and it is not cold-start latency.

The Reframe: From Sandboxes to Computers

The language used in these announcements has shifted noticeably. Cloudflare's GA post is titled "Agents have their own computers," with a closing section headed "This is what a computer looks like." Daytona's February Series A press release introduced "composable computers for AI agents" as the company's official positioning, with founder Ivan Burazin explicitly rejecting the "sandbox provider" framing as too narrow. Microsoft's Foundry Hosted Agents documentation describes per-session VM-isolated sandboxes with persistent filesystems. AWS's AgentCore Runtime gives each session its own microVM with filesystem and shell access.

The unit of agent compute has shifted from a fenced-off function call to something closer to a personal computer: persistent filesystem, snapshots, terminals, preview URLs, background processes, addressable by name, woken from sleep, forkable. When the substrate was "a sandbox for code," the relevant metrics were isolation strength and execution latency. When the substrate is "a computer for an agent," the relevant metrics expand to include persistence, state restoration, dynamic resizing, and the developer experience of working with a machine. The category competes on a wider surface than it did two years ago.

The Two Camps

The five hyperscaler products and the four major pure-plays cluster into two architectural camps. The technology stack each provider chooses reveals what it is optimizing for.

Camp 1: "Computers for Agents"

This camp leads with persistence, full-environment capability, and long sessions.

Daytona offers four isolation layers (Docker, Firecracker, Cloud Hypervisor, QEMU) selected automatically based on the workload specification. Persistence is the default; ephemeral was added as a feature later. The architectural bet is local-disk snapshots distributed across nodes — substantially faster than the network-disk snapshots most competitors use.
Cloudflare Sandboxes runs on Containers backed by Durable Objects, with snapshots stored in R2 for global tiered caching. The platform ships persistent code interpreters, PTY terminals, background processes with live preview URLs, real-time filesystem watching, and credential injection via programmable egress proxies. Idle sandboxes sleep automatically and wake on request, addressable globally by name.
Microsoft Foundry Hosted Agents runs hypervisor-isolated per-session containers with persistent $HOME and /files directories, scale-to-zero with stateful resume, and dedicated Entra agent identities. The April refresh was positioned explicitly for enterprise governance.

‍

Architecture | Daytona — Daytona architecture.

‍

What unites this camp: persistence as default, OS-level capabilities (install packages, run development servers, expose ports), long or unlimited sessions, snapshot-based state. The customer question this camp answers is whether an agent can perform substantive work.

Camp 2: "Safe Execution of Untrusted Code"

This camp leads with isolation strength and predictable, often-ephemeral lifetimes.

E2B commits to Firecracker microVMs only, with custom AI workload modifications. Their argument is explicit: containers share a host kernel, which has known escape vectors; Firecracker offers hardware-level isolation. Sessions cap at 24 hours, BYOC is available on Enterprise, and self-hosting is supported.
Vercel Sandbox also chose Firecracker microVMs, running on the same infrastructure that powers Vercel's deployment platform. Runtimes are narrow by design — Node.js and Python only. Sessions are time-capped at 45 minutes on Hobby and 5 hours on Pro. The product copy is direct: "The safest way to run code you didn't write."
AWS AgentCore ships Code Interpreter as a containerized sandbox with IAM-controlled isolation, plus AgentCore Runtime, which gives each session its own microVM. Code Interpreter supports Python, JavaScript, and TypeScript, with CloudTrail integration for audit.
Google's two products sit in this camp differently: Vertex AI Code Execution offers managed sandboxes with hardened Python and JavaScript runtimes; GKE Agent Sandbox uses gVisor — a kernel-isolation layer between containers and microVMs — with sub-second provisioning.

How AWS's Firecracker virtual machines work - Amazon Science — See here for the original Firecracker paper published in 2020.

What unites this camp: isolation as the load-bearing value proposition, narrower runtime support, credentials and egress as first-class concerns. The customer question this camp answers is whether an agent can be trusted with code.

Why the Stack Reveals the Camp

The architectural choices are not aesthetic. Camp 2 converges on microVMs and kernel-isolation layers because the value proposition is isolation strength. When the code being run may have been influenced by an attacker, a shared host kernel is a risk that cannot be engineered around. Firecracker (battle-tested by AWS Lambda) and gVisor (battle-tested by Google) provide hardware-level or kernel-level boundaries that containers do not.

Camp 1 mostly uses containers, often with hypervisor or microVM isolation layered on top for enterprise compliance. The choice is not security indifference. Containers natively support the OS-level capabilities the camp's value proposition requires: full filesystem access, package installation, long-running processes, exposed ports. Cloudflare's approach is illustrative: they have V8 isolates as a separate, lighter product (Dynamic Workers) for ephemeral code, but for Sandboxes specifically they chose Containers because the "agent as computer" framing demanded it.

The cross-camp comparisons that look natural on the surface — Daytona versus Vercel, E2B versus Cloudflare — are mostly category errors. Daytona competes with Cloudflare; both are in Camp 1. E2B competes with Vercel; both are in Camp 2. The fact that all four are called "sandboxes" obscures structurally different products.

The Technical Depth Within Each Camp

Within each camp, four layers determine where differentiation lives.

Isolation strategy sets the floor. Most providers commit to one or two isolation technologies; Daytona's four-layer setup is the most flexible. Each layer involves real trade-offs:

QEMU can run essentially anything — Linux, Windows, CPU, GPU. But pause and resume operations are slow because the entire memory state must be loaded from disk into RAM before the sandbox can serve traffic. Burazin estimates two to three seconds per resume on Daytona's stack.
Firecracker is Linux and CPU only, but supports lazy memory loading: the few megabytes required to restart are pulled into RAM first, the customer begins receiving responses, and the rest of memory loads in the background. Resume becomes effectively instant from the user's perspective.
Cloud Hypervisor sits between the two, offering more hardware support than Firecracker with similar performance characteristics.
Containers offer the fastest startup and the highest density per host, at the cost of shared kernel isolation. For workloads where security is not the primary concern — or where additional isolation can be layered above — containers are often the right answer.

Camp 2 providers commit to one approach and harden it. Camp 1 providers tend to offer multiple options abstracted from the user: a builder requests a specification (CPU or GPU, operating system, persistence needs), and the platform selects the appropriate isolation layer underneath. The user does not have to know whether their workload landed on Firecracker or QEMU. This abstraction matters more as the user base diversifies — a research team running RL benchmarks and an enterprise customer running compliance-bound workloads have different needs that the same platform can serve only if the underlying flexibility exists.

State and snapshot architecture is where Camp 1's advantage compounds. The default in most of the category is to store snapshots on network-attached storage and load them over the wire at resume — simple to implement but with materially worse IOPS than local disk. Daytona's bet is to distribute snapshots across local disks on nodes, with hot and cold tiering, and delta compression between customer snapshots. The result is benchmark-leading snapshot restore times. Cloudflare uses R2 with tiered caching for similar reasons; their internal benchmark shows a 30-second sandbox-creation-plus-repo-clone-plus-install sequence shrinking to 2 seconds when restored from backup.

Scheduling is the layer most builders never see. Custom schedulers — Daytona's bare-metal allocator, E2B's Firecracker pool manager — handle the second-by-second question of which sandbox lands on which machine. Kubernetes shows up at the API layer in many of these stacks but rarely manages individual sandbox allocation. Agent sandboxes are far more bursty than typical web services, and the OOM behavior that Kubernetes handles by killing pods is fatal for RL workloads. Daytona's dynamic sandbox resizing is one practical response: rather than over-provision RAM to avoid OOMs, the platform expands the sandbox as it needs more memory.

Network, secrets, and egress has become first-class in the last six months. Cloudflare ships programmable egress proxies that inject credentials at the network layer; the agent never sees raw tokens. Vercel performs credential brokering at a similar layer. E2B shipped a secrets vault using man-in-the-middle proxying. As agents act with more authority over external systems, the credential and network-control surface becomes load-bearing. Builders who treated this as a "later" concern are now retrofitting it.

The Shapes of Demand

The hyperscaler entry does not make sandbox selection simpler. It makes selection more workload-dependent. Three demand shapes have emerged, each stressing different parts of the stack.

Background agents — coding assistants, customer-support agents, workflow automation — follow predictable usage curves. Higher during business hours, lower at night and on weekends, aggregate growth tracking customer growth. Hyperscaler sandboxes serve these workloads well. The provisioning lag and burst handling that constrain other workload shapes do not bite as hard here.

RL and eval researchers look entirely different. Usage is flat at zero, then spikes to the full capacity allocated, runs at saturation for hours, then dies. The agent may return in an hour, two weeks, or three months — there is no follow-the-sun curve. Researchers fire off runs at midnight with the expectation that results will be ready in the morning. This shape is where Camp 1 pure-plays hold a meaningful advantage: Daytona's excess bare-metal capacity absorbs the spikes, and dynamic resizing prevents the OOMs that constrain fixed-size environments. The most-quoted illustration in this space — though we have been unable to independently verify the specific figure — is Daytona's claim that Anthropic's Opus 4.6 Terminal Bench evaluation ran on roughly half the compute it would have required on Kubernetes, because dynamic resizing prevented the over-provisioning otherwise needed to avoid OOM-driven failures.

Computer use and browser-driven agents is the fastest-growing third shape. These workloads need display buffers, PTY interactivity, and longer sessions than text-only agents. AWS shipped an AgentCore Browser tool to serve this. Daytona's three publicly stated use cases — code execution, computer use, RL — include it explicitly. Vercel's customer testimonials now include Cua AI running computer-use agents on Vercel Sandbox for RL workflows, which suggests the workload-shape boundaries between hyperscalers and pure-plays are softer than the camp framework implies.

Provider selection should follow workload shape. The same provider can be the right answer for one shape and the wrong one for another.

What Hyperscalers Are Still Missing

Even with five major platforms shipping production-grade products, the pure-plays retain meaningful differentiation. The pattern is consistent: hyperscalers commoditize the floor; pure-plays own the workload-specific ceiling.

Unlimited or very long sessions. AWS Code Interpreter caps at eight hours; Microsoft Foundry sessions run roughly an hour; Vercel maxes at five hours on Pro; Cloudflare's limits are plan-dependent. Daytona's sessions are unlimited; E2B's cap at 24 hours; Runloop's Devboxes persist indefinitely; Modal and Northflank both offer unlimited.
GPU sandboxes. Cloudflare and Vercel do not offer GPU. AWS AgentCore does not provide first-class GPU support. Microsoft Foundry Hosted Agents top out at CPU-only configurations. Google's GKE Agent Sandbox supports GPUs but requires GKE expertise. Modal, Beam Cloud, Daytona, and Northflank all offer GPU sandboxes natively.
First-class forking. Cloudflare ships fork-from-snapshot. Most other hyperscaler products do not have an analogous primitive. Daytona's copy-on-write forking is technically inexpensive — splitting into two does not double the compute, since only diverging memory and filesystem operations consume new resources. The use case is still mostly checkpoint-style (save this state, branch from here) rather than the multiverse exploration RL theoretically wants.
Workload-specific optimization. Daytona's dynamic resizing for RL is the canonical example. The architectural choice — substantial bare-metal capacity sitting idle to absorb bursts — is one no hyperscaler economics support. Pure-plays make this choice because they can; hyperscalers cannot price it.
BYOC and self-hosting. Hyperscalers are locked to their own cloud by definition. E2B offers BYOC on AWS VPC with GCP and Azure planned, plus self-hosting on Enterprise. Northflank offers BYOC. Daytona's Kubernetes deployment option lets customers run the platform on their own infrastructure. For regulated industries and multi-cloud strategies, this is non-negotiable.
Arbitrary Docker images and runtimes. Vercel restricts to Node.js and Python. AWS Code Interpreter supports Python, JavaScript, and TypeScript. Microsoft Foundry's Code Interpreter is Python-only. The pure-plays accept any Docker image and support custom templates. Anything that does not fit the supported runtime — R workloads, custom binaries, niche language stacks — pushes builders toward Camp 1 pure-plays.
Transparent pricing. Hyperscaler billing is multi-dimensional: Cloudflare layers Workers requests, Durable Objects, R2 storage, and a base plan on top of CPU; AWS AgentCore bills active CPU, memory, storage, and downstream tool costs separately; Microsoft adds Entra identity costs and Toolbox tools. Pure-plays mostly bill a single per-second compute rate. The cheapest hourly rate is not the lowest total cost.
Fan-out economics for multi-tenant agent platforms. Agent platforms that need to spin up a separate environment per end user face very different cost structures across providers. The architecture pure-plays were built around — virtual clusters, shared resource pools, dynamic allocation — was designed for this fan-out pattern. The Manus AI case study, where the platform reached over a million isolated database clusters per user account by moving off AWS RDS, illustrates the magnitude of difference architectural choice can make at this shape.

The category is not being absorbed. It is being segmented.

The Compute Economics Thread

The most underdiscussed structural change in this six-month window is Active-CPU pricing.

Cloudflare prices Sandboxes at $0.00002 per vCPU-second of active CPU — billing only the cycles the sandbox actually uses, not the wall-clock time it sits waiting for an LLM response. Vercel's Fluid compute model claims up to 95% lower cost for bursty or I/O-bound workloads on the same principle. Microsoft Foundry Hosted Agents scales to zero at $0.0994 per vCPU-hour. AWS AgentCore Runtime's pricing example uses an "I/O wait is free" framing. For agent workloads — which spend most of their time waiting for model responses, not actively executing — this is a meaningful economic shift.

The pure-plays' counter is structural. E2B and Daytona both bill closer to wall-clock CPU, but with simpler pricing models and without stacked billing dimensions. Daytona's broader argument is different: pure-plays own physical compute that hyperscalers cannot always provision on demand. The company's February Series A press release stated openly that the company is "hardware-constrained" and will use the funding to expand bare-metal capacity. Daytona's go-to-market — meetups, hackathons, sponsorships, conference circuits — funds the capex that maintains the surplus capacity needed to absorb bursty workloads.

Pricing comparison in this category requires modeling realistic workload patterns end-to-end. The headline rate is rarely the operative number, and the comparison flips depending on session length, I/O wait ratios, memory provisioning needs, and burst patterns.

Future Directions and Open Challenges

The category just transitioned from "is this a real layer" to "how does segmentation shake out." Several open questions will shape the next twelve months.

Will hyperscalers add GPU and computer-use capabilities at parity with pure-plays? This is the most direct path to closing the differentiation gap. AWS's AgentCore Browser tool is a first step; GKE Agent Sandbox supports GPU at the cost of GKE complexity. Whether these mature into competitive substitutes for Daytona, Modal, and Beam will determine how durable the pure-play advantage is in workload-specialized segments.
Does Active-CPU pricing become the category standard? If yes, the pure-plays will need to respond either by re-architecting their billing or by competing on dimensions where wall-clock pricing is still simpler to model. If no, hyperscaler entries may not have the cost advantage their marketing implies at scale.
How does the persistence and forking story mature? Cloudflare ships fork-from-snapshot. AWS has filesystem persistence in preview. Microsoft offers scale-to-zero with stateful resume. The pure-plays still lead on the depth of these primitives, but the gap is narrowing. The interesting test is whether copy-on-write forking becomes a default expectation rather than a specialty feature.
How do evaluation metrics for this category mature? Cold-start latency is a vanity metric past a low threshold. What replaces it as the meaningful evaluation surface? Snapshot IOPS, dynamic resizing behavior, secrets and egress story, and total cost under realistic workload patterns all matter more — but none has become a community-standard benchmark.
Will the two architectural camps stay separate, or converge? Cloudflare's Sandboxes-plus-Workers tier system suggests one path: keep both camps separate as different product tiers under one platform. Daytona's four-layer isolation suggests another: a single platform that dispatches across both camps based on workload spec. Whether other providers follow either path will shape the next round of feature releases.
Does the standardization of secrets, egress, and credential brokering converge across providers? Each platform has invented its own primitives. The Model Context Protocol established a partial standard for tool integration; nothing equivalent exists for sandbox security primitives. The category would benefit from convergence here, but the incentives for individual providers to defect from a common standard are real.

The interesting questions for the category are no longer about whether sandboxes matter. They are about how a maturing substrate layer divides into segments, how those segments price themselves, and how the providers on either side of the segmentation invest to defend their positions.

Acknowledgments

This research note draws on a series of in-depth interviews conducted at the AGI House Agent Harness Build Day on April 18, 2026, where founders and engineers building the substrate layer for AI agents shared their perspectives on technical architecture, market dynamics, and the road ahead. Special thanks to Ivan Burazin at Daytona, Matt Brockman at E2B, Ross at Runloop, and Xin Shi and Lia Du at PingCAP for their time, their candor, and the insights that grounded much of this analysis. Their willingness to discuss not only what they have built, but where they think the category is heading, made this memo possible.

AGI House hosts hackathons, build days, dinners, and research sessions in San Francisco focused on the frontier of AI infrastructure and applications. We continue to publish observations from the field as the substrate layer for agents matures.

The Agent Sandbox Layer Just Stopped Being a Startup Category