Claude Code Swarm (Part 2): Leader/Worker State Machines
A practical blueprint for multi-agent swarms: state transitions, idempotency, retries, and shutdown that doesn’t break under real delivery.
Most multi-agent systems fail for one reason: they treat coordination like chat.
Chat is not reliable under retries. Chat has no replay semantics. Chat is not idempotent.
If you want a swarm you can ship, you need explicit state machines on both sides:
- the leader (the only place where decisions become policy)
- the worker (an execution engine that never outruns its approvals)
This post lays out the minimum state machines and rules you need to make swarms behave predictably.
Part 2 of 3. Prev: /swarm-mailbox-protocol/. Next: /swarm-mailbox-type-catalog/.
What You’ll Learn
- The minimum leader/worker states you need for correctness.
- A mailbox-driven event loop that is safe under duplicates and restarts.
- How plan approval, permission sync, sandbox approvals, and shutdown fit together.
- Invariants you can test.
The One Rule That Prevents Chaos
Treat every coordination message as an event in a state machine, not as text to be “interpreted”.
That means:
- every request has an id
- every response is correlated
- every state transition is monotonic
- retries are expected
- duplicates are tolerated
If you don’t do this, your swarm will work until it doesn’t, and then you won’t be able to debug it.
The Transport Assumption
Assume you have a “mailbox” transport that can deliver:
- out of order
- duplicated
- late (after the sender believes it is done)
Assume the transport can also be replayed after a crash.
These are not pessimistic assumptions. They are what happens when you build on files, background loops, user interaction, and long-running sessions.
So the protocol has to be correct even when delivery isn’t.
Message Model
Use an envelope plus a typed payload.
Envelope (transport-level, generic):
fromtimestamptext(either plain text, or a JSON string)read(delivery bookkeeping)
Payload (protocol-level, typed):
type(required)requestId(required for requests and most responses)timestamp(required)body(type-specific)
The key design choice: typed payloads drive side effects; plain text never does.
Worker State Machine
Here is a minimal worker lifecycle that stays correct under concurrency.
States:
bootingwaiting_for_mode(optional, if leader can push “mode” or policy)idleplan_proposed(waiting for approval)executingblocked_on_permissionblocked_on_sandboxshutting_downterminated
Transitions (high level):
booting->idleidle->plan_proposedwhen plan-mode is required and a task startsplan_proposed->executingonly after plan approvalexecuting->blocked_on_permissionwhen a tool needs approvalexecuting->blocked_on_sandboxwhen network access needs approvalexecuting->idlewhen work completes (or when task is cancelled)- any state ->
shutting_downon shutdown request shutting_down->terminatedafter acknowledgment and cleanup
Worker invariants (the ones worth testing):
- A worker never executes a task action unless it is in
executing. - A worker never enters
executingfromidleif plan-mode is required. - A worker never applies “permission granted” unless the
requestIdmatches an outstanding request. - A worker always acknowledges shutdown, even if mid-execution.
Leader State Machine
The leader is not just a “router”. It is the policy authority.
States (per worker):
startingreadyawaiting_planawaiting_permission_decisionawaiting_sandbox_decisionawaiting_shutdown_ackstopped
Transitions:
starting->readywhen the leader can deliver messages and receive a heartbeat or first contactready->awaiting_planwhen leader assigns a task that requires a planawaiting_plan->readyafter approval and the worker is unblockedready->awaiting_permission_decisionwhen a permission request arrivesready->awaiting_sandbox_decisionwhen a sandbox request arrives- any ->
awaiting_shutdown_ackwhen leader initiates shutdown awaiting_shutdown_ack->stoppedafter ack and cleanup
Leader invariants:
- For a given worker, only the leader can change the worker’s effective policy.
- For a given
requestId, the leader emits at most one terminal decision (approve or deny). - The leader can safely retry sends; workers must be idempotent.
Idempotency and Dedup: The Non-Optional Part
If your protocol doesn’t specify what happens on duplicates, you don’t have a protocol.
Implement:
- a per-worker
seenRequestIdsset (bounded by time or count) - a per-worker
pendingmap for outstanding requests byrequestId - “last-write wins” rules for policy updates, keyed by timestamp
Processing rule:
- Parse typed payload.
- If it has a
requestIdand it was already handled, ignore it. - Otherwise, apply the state transition and record it as handled.
This makes crash recovery and “poller restarts” safe.
The Four Flows You Must Get Right
1. Plan approval
Goal: prevent the worker from taking irreversible action before a human (or leader policy) reviews intent.
Rules:
- plan approval gates entry into
executing - approval response is tied to a
requestId - approval may carry policy (example: permission mode) that must be applied before delivery
2. Permission sync
Goal: make tool execution safe under concurrency.
Rules:
- permission requests are events, not “interruptions”
- the worker blocks tool execution while in
blocked_on_permission - the leader’s decision is terminal for the request id
3. Sandbox approvals (network)
Goal: make network access explicit, auditable, and reversible.
Rules:
- sandbox decisions are treated like permissions, but scoped to capability (network) not tool name
- approvals must be applied before the worker resumes
4. Shutdown
Goal: make shutdown graceful and correct, even while busy.
Rules:
- shutdown request is always handled (no “I missed it”)
- the worker acknowledges exactly once per request id
- the leader considers the worker live until ack is observed
Implementation Skeleton: Mailbox-Driven Event Loop
This is the control-plane loop that makes everything work:
- poll mailbox for unread envelopes
- parse typed payloads
- apply side effects (policy updates, unblock decisions) before delivery
- deliver remaining messages as chat output
- periodically emit health and progress events
If you implement this loop and the two state machines above, you can ship a swarm that survives real-world failure modes.
Test Checklist (Make It Repro-Grade)
Write tests that assert invariants under adversarial delivery:
- duplicate plan approvals
- out-of-order permission decision arrives before the request is recorded
- poller restarts mid-flight (replay unread or re-parse text)
- shutdown request arrives during a tool call
- worker crashes after receiving approval but before applying policy
If these pass, the rest is details.
Related
/swarm-mailbox-protocol//permission-control-plane//tool-registry-and-execution/