Control plane¶
The control plane is what flows between the shell host and each agent host. There are three channels and three TCP ports.
Summary¶
| Direction | Port | Transport | Subsystem | Purpose |
|---|---|---|---|---|
| Shell → Agent | tcp/222 |
SSH (ed25519) | convocate-agent-rpc |
JSON-RPC: CRUD ops on sessions |
| Shell → Agent | tcp/222 |
SSH (ed25519) | convocate-agent-attach |
Raw byte relay to tmux PTY inside session container |
| Agent → Shell | tcp/223 |
SSH (ed25519) | convocate-status |
Newline-delimited JSON event stream |
| Agent → Shell | tcp/514 |
rsyslog over TLS (ECDSA P-256) | n/a | Container log forwarding |
Each direction is a persistent connection — the shell holds one
SSH session open to each agent on :222, and each agent holds one SSH
session open back to the shell on :223. Reconnect with exponential
backoff (capped at 30s) on disconnect.
Why two ports for SSH?¶
The agent's role is "let the shell drive me," so it listens on :222.
The shell's role is "let agents push me events," so it listens on a
separate port :223 to keep the listeners distinct on a host that
runs both roles (single-box deployments are common).
A single host running both shell and agent has two SSH listeners
running — one on :222, one on :223 — plus the host's regular SSH
on :22 used by humans. They don't collide.
Authentication¶
Every channel is mutually authenticated using ed25519 keypairs
that convocate-host init-agent mints at provisioning time:
shell-side keypair ──── used for shell → agent SSH (tcp/222)
agent-side keypair ──── used for agent → shell SSH (tcp/223)
agent client TLS cert ──── used for agent → shell rsyslog (tcp/514)
signed by the shell-side rsyslog CA
There is no shared key across the cluster. If one agent is compromised, its keys don't open up any other agent.
The shell side stores agent metadata under
/etc/convocate/agent-keys/<agent-id>/. Each subdirectory holds:
agent-host— the agent's SSH host (a string)agent_to_shell_ed25519_key.pub— the agent's public key, used to authenticate the agent when it dials ourtcp/223listenershell_to_agent_ed25519_key(private) and.pub— the shell's side of the shell→agent SSH channelhost_key.pub— the agent's SSH host key, pinned
SSH server invariants¶
The agent's SSH listener (internal/agentserver) is locked down
hard:
- Only ed25519 keys for both client and host
- Only the two named subsystems (
convocate-agent-rpc,convocate-agent-attach) are accepted - No shell, no exec, no env, no port forwarding, no X11 — every other channel type is refused with an SSH protocol-level rejection
- No password auth, ever — pubkey only
The same posture applies to the shell's tcp/223 listener
(internal/shellserver): only the convocate-status subsystem is
accepted, no other channel types.
Subsystem framing¶
convocate-agent-rpc (JSON-RPC)¶
Each request is a single JSON object on a line, response is a single JSON object on a line. The connection is closed after one request/ response pair (the SSH connection itself is persistent, but each RPC opens a fresh subsystem channel).
// request
{"op":"list","params":null}
// response (success)
{"ok":true,"result":[{ ...session metadata... }]}
// response (error)
{"ok":false,"error":"agent op \"create\": port 53/udp already in use"}
Op names are listed in RPC ops.
convocate-agent-attach (raw PTY)¶
After the subsystem is requested, the client writes a single JSON header line:
…then the channel becomes a raw byte pipe between the SSH session and
docker exec -it <container> sudo -u claude -- tmux attach-session -t claude.
Window-change events on the SSH channel are forwarded to the PTY.
convocate-status (newline-JSON event stream)¶
Agent → Shell. After the subsystem is opened, the agent writes one JSON event per line. No reply expected; the shell's listener drains the stream and dispatches each event to its handler.
Event types are listed in Status events.
Reconnect behavior¶
Both directions implement the same pattern:
- Dial. If success, run.
- On any error (read EOF, SSH disconnect, write fail) → close,
sleep
backoff, redial. backoffdoubles on each failure (capped at the configured max, default 30s) and resets to the initial value (default 1s) on any successful connection.
This means a transient network blip shows up as a brief gap in events + failed CRUD calls, and recovers automatically. A durable break (e.g. wrong key, wrong port, firewall) just keeps logging "redial failed" until you fix the underlying problem.