Claude Code Swarm (Part 2): Leader/Worker State Machines

A practical blueprint for multi-agent swarms: state transitions, idempotency, retries, and shutdown that doesn’t break under real delivery.

multi-agentcoordinationstate-machineprotocolcoding-agentsclaude-code

Most multi-agent systems fail for one reason: they treat coordination like chat.

Chat is not reliable under retries. Chat has no replay semantics. Chat is not idempotent.

If you want a swarm you can ship, you need explicit state machines on both sides:

This post lays out the minimum state machines and rules you need to make swarms behave predictably.

Part 2 of 3. Prev: /swarm-mailbox-protocol/. Next: /swarm-mailbox-type-catalog/.

What You’ll Learn


The One Rule That Prevents Chaos

Treat every coordination message as an event in a state machine, not as text to be “interpreted”.

That means:

If you don’t do this, your swarm will work until it doesn’t, and then you won’t be able to debug it.


The Transport Assumption

Assume you have a “mailbox” transport that can deliver:

Assume the transport can also be replayed after a crash.

These are not pessimistic assumptions. They are what happens when you build on files, background loops, user interaction, and long-running sessions.

So the protocol has to be correct even when delivery isn’t.


Message Model

Use an envelope plus a typed payload.

Envelope (transport-level, generic):

Payload (protocol-level, typed):

The key design choice: typed payloads drive side effects; plain text never does.


Worker State Machine

Here is a minimal worker lifecycle that stays correct under concurrency.

States:

Transitions (high level):

  1. booting -> idle
  2. idle -> plan_proposed when plan-mode is required and a task starts
  3. plan_proposed -> executing only after plan approval
  4. executing -> blocked_on_permission when a tool needs approval
  5. executing -> blocked_on_sandbox when network access needs approval
  6. executing -> idle when work completes (or when task is cancelled)
  7. any state -> shutting_down on shutdown request
  8. shutting_down -> terminated after acknowledgment and cleanup

Worker invariants (the ones worth testing):


Leader State Machine

The leader is not just a “router”. It is the policy authority.

States (per worker):

Transitions:

  1. starting -> ready when the leader can deliver messages and receive a heartbeat or first contact
  2. ready -> awaiting_plan when leader assigns a task that requires a plan
  3. awaiting_plan -> ready after approval and the worker is unblocked
  4. ready -> awaiting_permission_decision when a permission request arrives
  5. ready -> awaiting_sandbox_decision when a sandbox request arrives
  6. any -> awaiting_shutdown_ack when leader initiates shutdown
  7. awaiting_shutdown_ack -> stopped after ack and cleanup

Leader invariants:


Idempotency and Dedup: The Non-Optional Part

If your protocol doesn’t specify what happens on duplicates, you don’t have a protocol.

Implement:

Processing rule:

  1. Parse typed payload.
  2. If it has a requestId and it was already handled, ignore it.
  3. Otherwise, apply the state transition and record it as handled.

This makes crash recovery and “poller restarts” safe.


The Four Flows You Must Get Right

1. Plan approval

Goal: prevent the worker from taking irreversible action before a human (or leader policy) reviews intent.

Rules:

2. Permission sync

Goal: make tool execution safe under concurrency.

Rules:

3. Sandbox approvals (network)

Goal: make network access explicit, auditable, and reversible.

Rules:

4. Shutdown

Goal: make shutdown graceful and correct, even while busy.

Rules:


Implementation Skeleton: Mailbox-Driven Event Loop

This is the control-plane loop that makes everything work:

  1. poll mailbox for unread envelopes
  2. parse typed payloads
  3. apply side effects (policy updates, unblock decisions) before delivery
  4. deliver remaining messages as chat output
  5. periodically emit health and progress events

If you implement this loop and the two state machines above, you can ship a swarm that survives real-world failure modes.


Test Checklist (Make It Repro-Grade)

Write tests that assert invariants under adversarial delivery:

If these pass, the rest is details.

← Back to all posts