Orchestrator CLAUDE.md now instructs Claire, per turn, to call resolve_host
with the signals it extracts (explicit_host / capability_needs /
session_uuid|task_id); when the decision is NOT this node, surface that the
work belongs on that host's Claire and hand it off. Decision layer of
location-transparent Claire (13764f2f) is now live in orchestrator behavior;
cross-host execution/proxy remains the follow-up. resolve_host added to the
Plan tools list.
(manual commit via ALLOW_COMMIT — autocommit LLM still down on claire)
deploy-agent used bare `ssh <host>` / `rsync <host>:` (→ <host>.lan), which
fails off-LAN or when the direct plum→host WG relay drops — blocking deploys
even though the host is reachable via the black jump host. Now it probes
<host> → <host>-wg → <host>-j and uses the first that answers for the ssh/rsync
legs (remote-run keeps its own routing), keeping the claire host LABEL as
<host>. Override with CLAIRE_SSH_ALIAS. Verified: apricot deployed via apricot-j
while .lan + -wg were timing out.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Exposes routing.route() to the orchestrator as `resolve_host(explicit_host,
capability_needs, session_uuid, task_id)`. Claire (LLM) extracts the signals
from a turn and calls it; gets back {host, reason, detail, candidates} via the
deterministic cascade (explicit>capability>sticky>default-local), with
live-session counts feeding the capability tiebreak. The decision layer of
location-transparent Claire is now callable from the orchestrator.
Part of task 13764f2f. Smoke-verified: explicit→named, media→black (seeded
capability), no-signal→local. 371 tests green.
(manual commit via ALLOW_COMMIT — autocommit LLM still down on claire)
route(signals, fleet) -> RouteDecision via a deterministic cascade:
explicit host > capability-pin (uses hosts_with_capability) > sticky
(subject's session/task already runs on a host, via sessions+assignments)
> default-local. Pure + auditable (reason+candidates surfaced); the LLM
classify step and cross-host execution are separate layers. 13 tests.
Part of task 13764f2f.
(manual commit via ALLOW_COMMIT — autocommit LLM still down on claire)
When a local worker pane dies (crash, OOM, host power-cycle), its JSONL persists
and is resumable. The agent supervisor now detects dead-but-recent local
sessions and `claude --resume <uuid>`s them, then sends a re-orient kick so the
session re-determines its OWN state (done vs pending vs finished) before acting
— mirrors the orchestrator's rehydrate-on-startup.
- rclaude.Rclaude.resume(): spawn `claude --resume <uuid>` via RCLAUDE_RESUME_ID
(verified empirically against a real dead session on apricot).
- supervisor.select_resume_candidates(): pure, guarded selection — recency
window, supersession (skip if a LIVE session shares the cwd), orchestrator-
workspace exclusion, per-session retry cap, per-tick global ceiling (the
first-wake token-storm guard). 7 unit tests.
- AgentConfig.auto_resume off|dry-run|on (default off) + max/per_tick/window.
Ships off; roll out via dry-run, then on — same pattern as auto_continue.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
known_hosts gains a `capabilities` tag list (e.g. media, transmission,
cores:64, gpu) + ClaireConfig.hosts_with_capability(tag) (exact or key:
prefix match) and capabilities_for(host) (alias-resolved). Lets routing
(location-transparent Claire, task 13764f2f) and dispatch pick a host by
what it CAN do, not just load. Seeded black={media,transmission}.
Prereq task a5453fb8. 351 tests green.
(manual commit via ALLOW_COMMIT — autocommit LLM still timing out on claire)
Default control-group KillMode meant restarting claire-agent.service (every
deploy) SIGTERMed the whole cgroup, silently killing the live worker claude/tmux
sessions the agent had spawned (next-tour-planner lost on 2026-06-02). KillMode=
process signals only `claire agent run`; panes survive and the fresh agent
re-discovers them via pull. Note: takes effect from the NEXT restart (a unit's
stop phase uses the KillMode it was started under).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the rounds timer to a pure-Python skip gate so claire-serve only wakes
the orchestrator model when worker fleet state changed (not every tick):
- web/rounds.py: fleet_fingerprint() over worker sessions (minus the
orchestrator's own) + open tasks; should_skip_round() with heartbeat floor.
- web/app.py: _rounds_loop tracks last fingerprint + consecutive skips.
- excludes the orchestrator's own session/chat so a round's self-side-effects
can't defeat the gate.
Add scripts/release-fleet.sh (test -> deploy apricot+black -> restart plum
services) and harden deploy-agent.sh's cosmetic status check against a SIGPIPE
false-abort. 3 new discriminating tests; 349 pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Peer nodes can run a local orchestrator registered with claude.ai/code as
[<host>] claire, installed uniformly via deploy-agent.sh (not hand-wired).
- agent.orchestrator_enable + orchestrator.mcp_url config (round-trip safe)
- bootstrap points orchestrator MCP at central endpoint when set
- peer lifespan bootstraps + heartbeats the orchestrator (NO rounds loop)
- claire agent enable-orchestrator CLI + deploy-agent.sh wiring
(manual commit via ALLOW_COMMIT=1 — autocommit LLM was timing out on claire)
discover_session polled `rclaude list sessions` for the freshly spawned
session but filtered rows with `r.host == host` where host is the
canonical name (e.g. "plum"), while rclaude labels the calling machine's
own sessions "local". "local" == "plum" is always False, so discovery
matched nothing and timed out even though the session's JSONL was already
on disk (observed: 18s after spawn, inside the 30s window). dispatch then
falsely returned "spawned but not discovered", orphaning the live session
until a manual pull.
Root cause is a missing host-label normalization the pull loop already
does. Fix discover_session to canonicalize both sides via
resolve_host_label, and key local-path symlink resolution on the ROW's
raw label. Apply the same normalization to dispatch_task's pre_uuids
filter (identical mismatch left it empty, risking a stale-sibling match
at a shared cwd). 2 regression tests reproduce rclaude's "local" labeling
(the old fake echoed the dispatch host, masking the bug). 310 tests pass.
Committed manually with ALLOW_COMMIT=1 per user authorization: the
auto-commit service's message LLM was timing out on this repo.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>