The Multi-Agent Part Is Easy to Agree On
By now, most teams building with LLMs have learned the same lesson: a single agent can't do everything. Give it a 200-page document, a set of regulatory rules, and ask for a structured report, and it will hallucinate, lose context, and take forever. The answer is well understood. Decompose the work across specialized agents: one reads, another extracts, a third validates, and an orchestrator coordinates the whole thing.
The hard part isn't deciding to go multi-agent. It's building a multi-agent runtime that actually works. This post walks through how we built ours, covering the problems we hit in the order we hit them and what we ended up building.
From Text Messages to Typed Contracts
Our first version of agent-to-agent communication was embarrassingly simple: send a text string. "Please extract the tables from this filing." The child would produce text output, the parent would try to parse it, and things would break in creative ways. No schema, no validation, no guarantees about what the result would look like.
We replaced this with a first-class task primitive that bundles a typed input and an expected output type into a single delegation unit. The parent agent creates a task with a structured model as input and declares what shape the result should have. The child agent receives the task, does the work, and submits its result. At submit time, the result is validated against the declared output type, and if it doesn't match, it fails loudly. No silent schema drift.
The parent receives a deserialized, strongly-typed object rather than a string to be re-parsed.
Teaching Agents to Wait
With typed contracts solved, the next problem was immediate: how does the parent actually wait for the child to finish?
Our agents are LLM-powered loops: the model generates a tool call, we execute it, feed the result back, repeat. When an agent spawned a subagent, it had no native way to "wait," so it would poll, checking the child's session status in a loop and burning tokens on every iteration.
Polling wasn't even the worst of it. LLMs are eager. They want to finish the task. If a subagent took more than a few seconds, the parent would often get impatient and start doing the subtask itself, duplicating work, producing conflicting results, or hallucinating an answer rather than waiting for the real one. We needed a way to force the agent to stop and wait.
LLM agents can't async/await and they're not coroutines,
but they can stop.
When an agent decides to wait for a subagent, it signals the runtime to halt the agent loop. No more LLM requests, but the execution environment stays alive with all variables and state fully intact. The agent is idle but not dead.
This solves three problems at once. First, token savings: a polling agent burns LLM tokens on every "are you done yet?" check, and those checks add up fast when subagents take minutes or hours. A yielded agent uses zero tokens while waiting. Second, compute savings: pausing is not just a logical wait. Once the agent loop has yielded, the cloud E2B sandbox that hosts its execution can be paused too. Its filesystem and state are retained, but the sandbox is not consuming active CPU while it waits, freeing infrastructure for other tasks. Third, and most importantly, preventing premature work: because the agent loop is forcefully stopped, the model has no turn in which to decide to do the subtask itself. Without this, eager models would routinely start working on the delegated problem after a few seconds of silence, producing conflicting or hallucinated results.
The wake-up mechanism is event-driven. When a subagent finishes its work and submits a result, the platform fires a completion event. This event is routed through a persistent delivery system back to the sleeping parent's execution environment. If the E2B sandbox was paused, the runtime resumes it first. Then the event resolves the parent's Promise (more on how we built our Promise system in the next section), which triggers an inbox notification (a message injected into the agent's conversation that wakes it up on its next turn). The parent resumes exactly where it left off, with the typed result available immediately. No polling, no status checks. The platform pushes the result to the agent when it's ready.
The same Promise mechanism can work with external systems the agent needs to wait on, not just subagents, as long as they can report completion through an integration, webhook, or event source the runtime can observe. Background jobs like document fingerprinting or data pipelines, CI checks that take minutes to complete, events from third-party systems: an agent can kick off a build, yield, and wake up minutes later when the build passes or fails with the full result payload ready to act on.
Coordinating Deep Hierarchies
Once agents could delegate typed work and yield while waiting, the next challenge was composition. Real workflows have depth. An orchestrator spawns a filing analyzer, which spawns a table extractor, which spawns a cell parser. Three levels deep, each waiting on its children, each needing results to flow back up the chain. Sometimes the orchestrator also needs to fan out, spawning 10 analyzers in parallel, waiting for all of them, then merging the results.
Without a coordination layer, this turns into spaghetti: manual callbacks, ad-hoc status tracking, results getting lost between layers.
We built a Promise system (inspired by JavaScript Promises) as the universal coordination primitive. A Promise represents "a value that will be available in the future" and composes naturally:
Sequential chains pipe the output of one agent into the next. Parallel-all waits for every child to complete before continuing. Parallel-any resolves as soon as the first child succeeds, which is useful for racing a primary strategy against a fallback. Built-in error handling, cleanup hooks, and configurable timeouts round out the system.
Each level in the hierarchy uses the same primitives. A subagent three levels deep can itself spawn children, coordinate their results, and submit its own result upward. The same mechanism works at every layer.
Operating a Multi-Agent Runtime at Scale
The primitives above work for 2–3 agents. But real workflows decompose into dozens of parallel subtasks: analyze 50 filings, extract tables from 30 documents. Two problems surface at scale that don't exist at small numbers: infrastructure gets saturated, and the orchestrator drowns in noise.
Bounded concurrency
Spawning 50 agents at once would saturate sandbox provisioning and hit API rate limits. We built a pool dispatcher that manages backpressure automatically. You give it a list of work items, a way to spawn an agent per item, and a concurrency limit. It launches up to the limit, then backfills as each agent completes, following the classic semaphore-gated fan-out pattern but purpose-built for agent Promises.
The pool tracks everything: successes, failures, ordered results, errors with index, and total duration. You can cancel mid-flight to stop dispatching new items while letting in-progress work finish.
For the common case, we also built a one-liner that spawns one subagent per item, waits for all of them, and delivers the collected results as a single message when everything's done.
Notification coalescing
The other problem at scale is noise. If 50 subagents each try to wake the parent when they complete, the parent gets 50 separate inbox notifications, processes each one as its own turn, and burns tokens on 50 "oh, another result arrived" reactions instead of seeing the full picture at once.
Promises in a chain or a pool share a notification group. The group tracks whether any notification has already been sent, and once one fires, the rest are silenced. This is implemented as a union-find data structure: when Promises are composed through chaining, parallel-all, or parallel-any, their notification groups are merged so they share a single notification flag.
For agent pools, this is even more intentional. Each individual child Promise explicitly opts out of notifications. Only the pool's terminal Promise (the one that resolves when all children are done) is wired to notify the parent. So 50 agents complete, results are collected into an ordered summary, and the orchestrator wakes up exactly once with the full picture: 48 succeeded, 2 failed, here are all the results in order.
This might sound like a small detail, but it's the difference between an orchestrator that scales and one that drowns in noise.
Shared and Isolated Compute
Once you have deep agent hierarchies, the next question is: where does each agent run? Some subagents need filesystem isolation because they're modifying files and can't step on each other. Others are doing lightweight computation and sharing an environment is fine (and way cheaper). Forcing everything into one model wastes resources; forcing everything into isolation wastes provisioning time.
Every agent spawn now specifies its resource placement:
- Shared compute: The child runs in the parent's E2B sandbox with the same filesystem, installed packages, and working directory. Near-zero startup overhead. When the child finishes and submits its result, the platform automatically cleans up its execution environment so it doesn't leak resources. This is ideal for batch-spawning many lightweight agents that each do a focused piece of work and exit.
- Isolated compute: The child gets its own isolated E2B sandbox with a completely separate filesystem. This is essential when agents modify files, because two agents rewriting the same config file in a shared sandbox would corrupt each other's work. Concurrency-capped to prevent resource exhaustion.
Shared compute is where things get interesting. Multiple agents can collaborate on the same codebase and work in the same computer on different parts of the task: one agent writes a file, another reads and validates it, a third runs tests against it, all in the same sandbox and coordinated through the Promise system.
This maps to a clean 2×2: one-shot tasks vs. long-lived sessions, crossed with shared vs. isolated compute. The orchestrator picks the right quadrant for each subtask based on what the work actually needs.
Production Hardening
Getting the primitives right was only half the battle. Making them reliable in production required solving two more problems that aren't obvious until you hit them at scale.
First, portable serialization. When a parent agent sends a typed input to a child running in an isolated E2B sandbox, the two environments might not have identical code deployed. Standard serialization stores class definitions as module references (just a dotted path like "mymodule.MyClass"), which fails if the receiver doesn't have that exact module installed. We built a serialization layer that walks the full type graph of the payload and can embed application-level class definitions directly into the serialized bytes. The result is a more portable payload that can move between E2B sandboxes without requiring every application class to be pre-installed on both sides, while still relying on a compatible runtime and trusted serialization boundary.
Second, durable event delivery. Completion events (the "child is done" signal that wakes a sleeping parent) can't be fire-and-forget. If the event is lost, the parent sleeps forever. We built a durable delivery pipeline: completion events are persisted to a local store before delivery is attempted. A background worker claims events atomically and delivers them to the correct parent agent. If delivery fails, it retries with exponential backoff. That keeps events from being lost, even if the receiving agent's environment is temporarily busy.
The Full Picture
Here's what the runtime looks like today:
- Typed delegation. Agents exchange structured models with declared output types, validated at submit time.
- Cooperative yielding. Agents stop their loop, preserve their state, and wait without active polling or active CPU work until a completion event wakes them. No premature work, no wasted tokens.
- Composable orchestration. Sequential chains, parallel-all, and parallel-any work at every level of the agent tree, with error handling and timeouts built in.
- Explicit resource placement. Shared or isolated compute, declared per-spawn, with automatic cleanup.
- Bounded concurrency with coalesced notifications. Agent pools manage backpressure and deliver one summary instead of fifty wake-ups.
The core insight is that an LLM agent loop is not a coroutine, but it can cooperatively yield. Once you have that primitive (stop the loop, keep the state, wake me on completion), everything else composes on top of it.
We're running this in production today, where a single user request routinely spawns 10–20 specialized agents working in parallel. The architecture handles it without polling, without token waste, and without the orchestrator losing track of what it asked for.
The same runtime powers deep research across decades of insurance filings, end-to-end rater construction, and structured workflows where agents and your team do the work together. See it in action →