Blog Post

What Is a Proactive Agent?

The stack that defined 2023/24—where a user simply opens an app, types a prompt, reads the response, and closes the tab—has largely been solved and commoditized. These “wrappers” around chat models are a competitive category in the sense that a thousand of them exist, but they are no longer a defensible one. So what’s next in the direct, mass-consumer-facing market for AI? We’re seeing signals that it may be with systems one layer down: ones that observe what a user is doing and what is happening around them, infer intent without being asked, and act through an interface that stays with the user between explicit sessions.

This seamless integration with daily life is the ultimate objective for wearables, but also represents one of the hardest technical challenges currently facing this sector. There is no "open app" moment on a pair of glasses—either the agent decides when to speak, or it is simply dead weight on the user's face. When the agent sees a text from your roommate saying "we're out of soap,” it should add soap to the shopping list without being asked. On glasses, that same assistant also has to decide whether to surface the suggestion in your field of view now, later, or not at all. The decision of when to act, rather than the action itself, is the hard part, and it is where the next generation of agent products will be won and lost.

The Core Dynamic

Every function in a proactive wearable system sits on a spectrum between two poles. Intervention means the agent acts on its own: drafting, scheduling, reminding, completing. It is high-leverage when correct and high-cost when wrong. Ambience means the agent stays silent and observant, surfacing information only when confidence crosses a threshold. It is low cost per decision, but the value is entirely in when the agent breaks silence.

The prevailing wisdom from 2024–2025 was that intervention-biased products lose. The canonical example is Humane's AI Pin, which was an intervention-heavy proactive agent strapped to the user's chest.

Founded by ex-Apple designers Imran Chaudhri and Bethany Bongiorno, backed by $230M from Sam Altman and others, the Pin shipped in April 2024 at $699 as a screen-free smartphone alternative. Yet nearly immediately upon launch, reviews were catastrophically bad:

The agent couldn't touch the user's actual tools. The Pin couldn't set an alarm, start a timer, read your calendar, or write to it. When The Verge tried to add an item to a shopping list after the fact, it “almost always” failed. A proactive agent that cannot take basic actions on the user's existing stack is just an expensive microphone.
Every interaction round-tripped to the cloud. Every query had to process through Humane's servers, typically taking around 10 seconds and sometimes failing entirely. An ambient agent whose response arrives 10 seconds after the relevant moment has already passed is architecturally broken, regardless of model quality. Half of outbound calls reportedly didn't connect, while half of inbound calls went straight to voicemail.
The device physically throttled its own always-on premise. Back-to-back queries or sustained projector use caused the Pin to overheat and shut off until it cooled, which made continuous ambient assistance incompatible with the hardware running it. Humane later issued a fire-risk warning on the charging case itself.
It offered no 10x use case over the phone already in the user's pocket. Every task the Pin attempted was something a phone did faster, more reliably, with better fallback. Voice-first as a modality is genuinely worse than a glance at a screen for most quick interactions, and the projected-on-palm display didn't close the gap.

By summer 2024, return rates exceeded new sales. A price cut to $499 in October didn't move inventory. In February 2025, HP acquired Humane's team, patents, and CosmOS operating system for $116M, roughly an eighth of the high-end valuation, and bricked the Pin on February 28, cutting off servers for the ~10,000 units already shipped.

The lesson the industry took from this, alongside softer echoes from Rabbit R1 and Limitless, was that users punish agents that act too often, too slowly, or with insufficient access to the user's real tools, much more than they punish agents that stay quiet. Restraint, latency, and deep app access were priced into the thesis.

Pare-Bench, introduced in April 2026 by researchers at UCSB/Apple, made the restraint half of this equation quantitative. The benchmark is a simulation environment: 143 scheduled-event scenarios across communication, productivity, scheduling, and lifestyle apps, with an LLM-driven "active user" who can accept, reject, or ignore agent proposals in real time. It measures four things: context observation, goal inference, intervention timing, and multi-app orchestration. The headline metrics:

Task success rate. Did the agent actually complete the scenario's underlying goal? Claude 4.5 Sonnet led at 42%. Even frontier models failed 58% of the time.
Proposal rate. How often the agent broke silence to suggest doing something. Claude: 12.8%. GPT-5: 28.1%.
Acceptance rate. Of the proposals the agent made, how many the simulated user accepted. Claude: 78.2%. GPT-5: lower.

As we can see, the model with the highest success rate was also the one that proposed least often and got accepted most often, and this aligns with signals from the early failures of wearable AI: the agent's value is determined less by how much it can do, and more by how well it calibrates when to do anything at all. An agent that proposes constantly trains its user to ignore it, while an agent that stays quiet and proposes only with high confidence trains its user to act when it speaks.

There are also a couple other dynamics shaping the space that we’ve observed since the start of 2026:

First, on-device models just made a generational jump, which changes where agents can realistically run. Humane's Pin failed in part because every query round-tripped to the cloud, and that architecture is no longer forced. Google's Gemma 4 family, which underpins Gemini Nano 4 on Android flagships later this year, claims roughly 4x faster inference and 60% lower battery draw than the previous generation, with native multimodal support across 140+ languages. The same trajectory is visible across Qualcomm's latest AR chipsets, Meta's in-house work on its EvenLLM-scale stack, and Apple's rumored 2027 entry. Whatever the specific vendor, the direction of travel is consistent: 2B to 4B parameter models running locally with usable latency on phone-class hardware. The frontier-vs-local gap that Pare-Bench measured (Claude at 42%, Qwen 3 4B at 18.5%, Gemma 3 4B at 3%) is closing faster than most strategy decks assume.

Second, the camera-heavy incumbents are absorbing proactive features that were supposed to be startup territory. The previous consensus treated Meta as the intervention-heavy player who would overreach and lose users, but that read is getting out of date. Meta's January 2026 v21 update added Conversation Focus (AI-amplified voice of the person you're talking to) and expanded live captioning across Ray-Ban, Oakley, and Ray-Ban Display. The rumored Ray-Ban Gen 3 adds "super sensing" for real-time object, location, and person recognition. Samsung confirmed at MWC 2026 that its 2026 glasses will be "agentic," explicitly framed as AI that takes actions autonomously based on what the wearer sees and says. Google's Android XR pairing with Warby Parker, Samsung, and Gentle Monster is doing the same thing with Gemini. Meta is now racing to be ambience-competent, with more distribution than anyone else in the category.

Third, privacy has become a live regulatory front. Kenya's Office of the Data Protection Commissioner opened a formal probe into Meta Ray-Ban glasses in March 2026 after contractors were found reviewing intimate footage captured by the devices. Italy's Garante has issued formal GDPR inquiries. A coalition of 75 organizations including the ACLU and EFF sent an open letter to Zuckerberg in April demanding Meta abandon its "Name Tag" facial recognition feature. Meta delayed its Ray-Ban Display international launch indefinitely, citing "demand" but landing in exactly the regulatory posture that a delay would produce. The first major privacy incident in wearable AI is no longer a forecast. It is a rolling present tense, and the regulatory infrastructure responding to it is being built now.

Put those together, we can see that the space is being shaped by multiple forces: model capability commoditizing downward toward the device, incumbent distribution absorbing proactive features at platform scale, privacy regulation hardening, and memory infrastructure maturing into a real layer (see below). The product that wins is the one that is positioned correctly on all vectors at once.

Three Theses

Thesis 1: The Agent Social Contract

Goal inference scales with model capability, but the user's tolerance for an agent that breaks silence at the wrong moment does not. That tolerance is the product, and it is not something a model provider is positioned to deliver. OpenAI, Anthropic, and Google can all train a model that infers "this person would probably want soap added to their shopping list." None of them can tell you whether to surface that suggestion now, in ten minutes, at the next natural pause in conversation, or not at all. That judgment depends on the hardware affordances of the device (a heads-up display can show text silently; a phone notification cannot), the specific deployment context (a doctor's visit, a meeting, a walk home), the user's established preferences, and the accumulated social contract of the last thousand interactions between that user and that agent. A model provider shipping a generic SDK cannot calibrate against any of this. The calibration is deployment-specific, empirically tuned, and owned by whoever builds the interface that sits between the model and the user.

This has a direct architectural consequence. Proactive agents work best when observation and execution are separated into two roles with an explicit confirmation gate between them. An observer watches continuously at low cost, deciding only when something might warrant action. An executor, invoked only when the observer crosses a confidence threshold, does the actual work, proposes the action, and waits for the user's consent. This pattern maps almost exactly onto the glasses-plus-phone architecture: the glasses observe (microphone, optional camera, display, IMU), and the phone or cloud executes. The structural value of wearable hardware is that it compresses the notice-to-surface latency from the seconds of a notification round-trip to the milliseconds of a heads-up display, letting the agent act during the relevant moment rather than after it has passed.

The design implication is that the interface layer is the new OS. When iOS and Android defined how mobile apps would ask for permissions, request attention, and present information, every downstream app inherited those conventions. Proactive wearables are at the same moment now. What does a "proposal" look like in the glass? How does the user accept or reject it without breaking stride? What does it mean for the agent to "gather more context" when the user hasn't explicitly asked for anything? The team that answers these questions at scale will set the interaction grammar that every ambient agent built on top inherits.

Thesis 2: Memory is the Feature That Compounds

Ten-million-token context windows do not solve proactive memory. A wearable agent needs to know that a coworker mentioned a deadline three weeks ago at lunch, that the user typically cancels Friday evening meetings, and that "Alex" refers to a roommate in some contexts and a client in others. Those facts have to be stored, indexed, and retrieved rather than streamed into context at every inference. The reason is the same one that agent-harness builders in the coding world have been running into for the past year, and the comparison is useful.

Consider Claude Code, which runs agentic sessions against a 1M-token window. In principle, a million tokens should be enough to hold a whole engineering session. In practice, Anthropic's own engineering blog documents two failure modes that show up well before the window fills. The first is context rot: an early file read is still technically in the window, but by the end of a long run it sits buried under everything read since, and the model's ability to recall its details degrades. The second is prefill latency: every turn pays to process the full context, whether the current step needs it or not. The harness's response has been to treat context as a resource to actively manage rather than a bag to keep filling. Claude Code uses three primitives for this: compaction (older conversation replaced by a model-generated summary when usage crosses a threshold), a memory tool (the agent writes facts to a filesystem that survives context resets and future sessions), and subagents (work isolated in a fresh context and a compact result returned to the parent). The general pattern is that bigger windows are not a substitute for deciding what deserves to be stateful.

The wearable version of this problem is the same in structure and harder in every constraint. The input stream is continuous rather than task-bounded, because the sensor does not shut off. The token budget is smaller, because the model is running on a 2B to 4B parameter local model rather than a frontier cloud endpoint. Compaction cannot run on every inference because it is expensive in both battery and latency. The raw tier has to stay on-device for privacy reasons, so whatever the agent keeps in cloud memory has to already be abstracted. And entity resolution is a problem that barely shows up in Claude Code but dominates wearable deployment: which "Alex"? Was that "meeting" a calendar event or a chance hallway conversation? Those are questions a coding agent never has to answer.

The last twelve months have produced a real memory-infrastructure category aimed at exactly these problems. Mem0 runs an extraction pipeline that decides ADD, UPDATE, DELETE, or NOOP for each incoming fact against existing memories. Zep and Graphiti build a temporal knowledge graph with fact-validity windows, so "the user moved from Mumbai to Bangalore" actually retires the old fact rather than coexisting with it. Letta takes the OS-inspired tiered approach, with core memory always in context, recall memory searchable outside it, and archival memory queried on demand. LongMemEval and LoCoMo are the benchmarks the field iterates against, and architectures have diverged enough that choosing one is now an engineering decision rather than a default. The question for a wearable team is typically which trade-off matches the device's latency budget and privacy posture.

What the frontier looks like specifically for wearables is a hierarchical store where each layer serves a different query pattern. Raw transcript and sensor data sit at the bottom, compressed and on-device. Above that, entity-resolved events ("coworker mentioned Q3 deadline at lunch Tuesday"). Above that, a relationship graph that links entities and tracks how facts change over time. At the top, a small set of stable user preferences that travels in the context window on every turn, in the same role that CLAUDE.md plays in Claude Code. The store has to forget selectively, because one that only accretes drowns within a week. It also has to expose an explicit management surface, because letting the user see and correct what the agent remembers is the single highest-leverage trust-building feature available.

This is also where verification is hardest and most valuable. Pare-Bench works because apps have discrete state machines and tasks have binary success criteria. Real ambient deployment has none of that. Did the agent surface the right context at the right moment? Did it correctly infer that the user was in a meeting rather than a social setting? These have no automated verifier yet. The highest-leverage research opportunity in proactive wearables is the test suite for ambient context: labeled datasets, user-study harnesses, behavioral metrics for timing and relevance. Whoever builds that owns an inherent advantage in the iteration loop for every serious deployment downstream.

Thesis 3: Privacy Landscape is Shifting Fast

The standard framing for privacy-first architecture used to be that it was a concession: no camera meant missing out on features. That framing is now obsolete. Between Kenya's ODPC probe, Italy's Garante inquiry, the EFF/ACLU coalition against Name Tag, and Meta's indefinite postponement of Ray-Ban Display's international rollout, privacy has graduated from a differentiator to a structural constraint that shapes what products can be sold where. The EU AI Act's August effective date brings risk-based assessments for biometric processing and real-time monitoring, exactly the feature surface of camera-equipped display glasses.

Even Realities is a salient case study relevant to this thesis. The G2 (launched November 2025, ~$599, estimated 10,000–25,000 units shipped) is built around a deliberate set of constraints: no camera, no external speaker, monochrome green heads-up display at 640x350, on-device processing where possible, and a proprietary EvenLLM. The company calls this "quiet tech." In practice, it is an architectural commitment that makes the glasses unsellable in some use cases (anything needing computer vision) and structurally advantaged in others: regulated enterprise (healthcare, legal, finance), European markets under GDPR, and the substantial user base that actively rejects face-mounted cameras. The USA Deaf Swimming Team already uses them for live translation and transcription, which is not a hypothetical deployment. It is the use case the existing user base is already validating. Even's vertical optical integration compounds this advantage: the company owns its lens factory and supports prescription lenses from −12 to +12 diopters, the widest range in the category. Roughly 60% of adults need corrective lenses, Ray-Ban Meta offloads prescription to EssilorLuxottica, and most Chinese competitors don't support prescription at all.

The challenge to the thesis is the direct competitive attack on exactly this positioning. Halliday raised $3.3M on Kickstarter in early 2026 for "the world's first proactive AI glasses with invisible display," at 28.5g with no camera, included prescription, ring control, and explicit "proactive AI" framing. Nimo is at $363 with comparable hardware. TCL RayNeo, Xiaomi, and Huawei are all integrating domestic LLMs and moving fast. The privacy-first, prescription-first, display-equipped niche is now contested rather than open. But the contest itself is evidence for the thesis: the market has decided this architectural commitment is worth building against, not around.

Event Context and the Three Tracks

The event on April 26th 2026, co-hosted by AGI House and Even Realities, is structured around a single constraint: what does agent design look like when the interface is always on? Every attendee gets hands-on time with the G2. The three tracks (Ambient Agents, Agents with Memory, Agents for Good) are each a different slice of the dynamic above.

Track 1: Ambient Agents

The Pare-Bench result framing this track: even frontier models succeed only 42% of the time at proactive tasks when evaluated through a simulated active user. The bottleneck is timing: when to propose, when to wait, when to gather more context before proposing.

The current technical blockers are trigger design (no one has a clean abstraction for semantic triggers on continuous ambient signal, and running an LLM on every audio chunk is a non-starter for battery); proposal/wait calibration (model-specific, empirically tuned; a bad proposal on a glasses HUD is higher-cost than a bad notification on a phone because the user can't easily dismiss it while walking); on-device compute (the G2 is a thin client, all compute on the phone, which means the proactive loop is constrained by phone battery and phone-to-cloud latency); and signal-from-noise robustness (Pare-Bench showed Claude stays stable at 2/4/6 events per minute while Gemini Flash and GPT-5 degrade, and real environments are much noisier than Pare-Bench simulates).

What would be impressive to see: a working proposal/confirmation loop running on the glasses, with a suggestion in the display, accept or reject via ring or voice, and clean execution. A calibrated silence threshold, meaning a demo where the agent deliberately doesn't intervene during ambiguous scenarios, with a clean explanation of why. Multi-app orchestration from a single ambient trigger. A local-first architecture doing serious work without continuously round-tripping audio or screen context.

What is hard: every knob is tuned against user perception, not ground truth. The failure mode isn't wrong answers; it's correct but poorly-timed ones, and users who stop wearing the glasses.

Track 2: Agents with Memory

The memory category is now real (LoCoMo, LongMemEval, Mem0, Zep, Letta, OMEGA), but wearable-specific deployment has constraints those frameworks weren't purpose-built for: entity resolution at the edge (which "Alex"?), selective forgetting, tiered privacy boundaries (recent/personal on-device, aggregate/anonymized in cloud), and retrieval latency inside the proactive loop (vector search against personal memory at acceptable battery cost on phone-class hardware is still open territory).

What would be impressive: a system that demonstrably recalls a detail from a conversation earlier in the day and uses it in an unrelated context later. Explicit memory management UI, where the user sees what the agent remembers, corrects it, deletes it, freezes it. Graph-structured memory that answers "what did we decide about the apartment search?" not just "what did we talk about yesterday?" A demonstration that the system works when the user isn't wearing the glasses, with memory persisting and updating from other sources.

What is hard: memory that improves user experience compounds over weeks. A one-day hackathon can only produce architecture and scripted demos, not demonstrated long-horizon value. The teams that will impress are the ones that pick a narrow slice (meeting context recall, person-memory, a specific productivity workflow) and nail it end-to-end.

Track 3: Agents for Good

Accessibility is the track where the G2 has the clearest, most immediate product-market fit. Hearing accessibility doesn't need computer vision; it needs audio-to-display with low latency. The USA Deaf Swimming Team is already using Even Realities for real-time translation and transcription. Xander Glasses, Hearview, XRAI Glass, Ava, and the Longitude Prize-winning CrossSense project are all in this space. Visual accessibility is harder on a no-camera device, because OCR and scene description are structurally unavailable, but audio-first assistance (environmental description, spatial audio cues, conversational scaffolding) has real surface area. Cognitive and mental-wellbeing applications (memory aids for early-stage cognitive decline, attention support for ADHD, social-cue scaffolding for neurodivergent users) are domains where ambient and private matter more than capable, and where Even's design philosophy aligns naturally.

Technical blockers: sub-second end-to-end captioning (cloud round-trips usually break this; on-device ASR is close but not uniformly good); speaker diarization on phone-class compute with glasses-frame microphone geometry; context-sensitive communication support that matches register and preserves the user's voice; and regulatory/medical-device boundaries (anything framed as "medical" triggers FDA and equivalents).

What would be impressive: a captioning system running on the G2 with sub-second latency, speaker labels, acceptable accuracy, and a user-test demonstration. A communication-assistance agent that navigates a specific high-friction scenario (doctor's appointment, cross-language business meeting, unfamiliar city). A cognitive scaffold that reminds a user of commitments or names in a socially graceful way. An explicit safety and consent model for any audio-capture demo.

What is hard: accessibility users are not a forgiving audience. They have seen a decade of well-meaning prototypes that didn't survive contact with real deployment. Any team building in this space credibly should find an actual user to test with before the demo, not after. The prize for doing this is that accessibility is where the proactive wearable agent has its clearest value proposition today, not as a productivity nudge for knowledge workers, but as a genuine capability extension for users whose interaction with the world has measurable friction the agent can reduce.

What We're Watching

The memo's core argument is that the proactive wearable agent space is not a model-capability race. Goal inference scales with frontier models, but the decisions that actually determine product success — when the agent speaks, what it remembers, and what the sensor is allowed to capture — are deployment-specific, interface-specific, and increasingly regulation-specific. Restraint, memory, and privacy architecture are the three axes where defensible positioning accrues. Here are the signals we would watch to confirm or falsify them.

Apple's 2027 entry and its choice of eyewear partner. If Apple ships a privacy-first, architecturally constrained device through a traditional eyewear house, Thesis 3 holds and the moat concentrates in fewer hands. If Apple ships something camera-heavy that users accept because it's Apple, the regulatory pressure we're tracking is softer than we think.

Whether Meta's next Ray-Ban generation ships with Name Tag. This is the cleanest referendum on whether the regulatory front actually binds. If Name Tag ships in any form despite the ACLU/EFF coalition and active Kenya and Italy probes, privacy-as-moat is theater and Meta's distribution wins; if it's quietly dropped or hard-gated by jurisdiction, the floor is real and privacy-architected companies have been pricing correctly.

Whether a model provider ships a proactive-agent SDK that gets real traction. Thesis 1 rests on the claim that the social contract of agent speech cannot be shipped by a model provider. Heavy adoption of an OpenAI, Anthropic, or Google proactive SDK would suggest the interface layer is more commoditizable than the memo claims; light adoption with deployments rewriting on top confirms the timing decision travels with the device and the user, not the model.

The trajectory of on-device proactive-task performance. The Pare-Bench gap between frontier cloud models and small local models is the load-bearing number for whether wearables remain cloud-dependent. If a 2B-to-4B local model gets within ten points of frontier on proactive tasks in the next twelve months, the observer-on-glass / executor-in-cloud architecture starts giving way to fully local agents.

Whether a real test suite for ambient context gets built. Thesis 2 argues the highest-leverage research opportunity is a Pare-Bench equivalent for continuous multimodal input. If one gets built and adopted, the iteration loop on wearable agent quality tightens dramatically; if not, vertical deployments with their own ground truth (accessibility, translation) continue to have a structural head start over general-purpose ambient assistants.

---

See you Sunday!

Proactive Agent Build Day: Pre-Event Guide