Compare commits

..

2 Commits

Author SHA1 Message Date
Teknium 3b6347af15 feat(kanban): default_assignee fallback + per-profile concurrency cap (#27145, #21582) (#34244)
Two related dispatcher behaviors that have been missing for a while.

## kanban.default_assignee (#27145)

Reporter (@agarzon): dashboard creates a task without an assignee, task
parks in 'ready' forever even though the operator's intent ('default')
is perfectly clear. The dispatcher already had a 'skipped_unassigned'
bucket but no fallback routing — users had to manually type 'default'
in the assignee field every time.

Behavior: when 'kanban.default_assignee' is set in config.yaml, the
dispatcher applies that assignee to any unassigned ready task before
deciding whether to spawn. The row is mutated (assignee column + an
'assigned' event with source='kanban.default_assignee' for the audit
trail). Empty/whitespace config value = no fallback, preserving the
existing skipped_unassigned behavior.

Dry-run mode reports what WOULD happen via the new
'auto_assigned_default' bucket on DispatchResult, but does NOT mutate
the DB — operators using 'hermes kanban dispatch --dry-run' see the
routing decision before committing.

## kanban.max_in_progress_per_profile (#21582)

Reporter (@edwardchenchen, @simlu, 4 reactions): fan-out workloads
saturate one profile's local model / API quota / browser pool while
other profiles sit idle. The existing global 'max_in_progress' caps
total workers but doesn't balance across profiles.

Behavior: when 'kanban.max_in_progress_per_profile' is set to a
positive int, the dispatcher tracks per-assignee running counts (one
query at tick start) and refuses to spawn for any assignee already at
the cap. Tasks blocked this way go to a new
'skipped_per_profile_capped' bucket on DispatchResult as
(task_id, assignee, current_running_count) tuples — NOT an
operator-actionable failure, just 'try again next tick when the
profile has capacity'.

Pre-existing 'running' tasks count against the cap (verified via
regression test). The cap respects dry_run mode by incrementing
its in-memory counter on each would-be spawn so dry_run reports
the same balanced subset that a real tick would.

Invalid cap values (0, negative, non-int, None) are treated as 'no
cap', preserving the existing behavior. Backward-compatible for
installs that don't set the config.

## Surfaces

- 'hermes kanban dispatch' CLI now prints 'Auto-assigned to
  kanban.default_assignee=X: ...' and 'Deferred (X at per-profile cap,
  N running): ...' lines, plus matching JSON keys in --json output.
- Gateway dispatcher logs the configured values at startup
  ('default_assignee=X', 'max_in_progress_per_profile=N').
- 'kanban.max_in_progress_per_profile' added to DEFAULT_CONFIG with
  inline docs.

## Validation

- tests/hermes_cli/test_kanban_default_assignee.py (6 cases): no-cap
  baseline, auto-assign + DB mutation, dry-run reports without
  mutating, whitespace treated as None, explicit assignees untouched,
  DispatchResult field schema.
- tests/hermes_cli/test_kanban_per_profile_cap.py (9 cases including
  4 parametrized): no-cap baseline, balanced 2-profile fan-out,
  pre-existing running counts against cap, invalid cap values
  (0/-1/'abc'/None), capped tasks dispatched on next tick after
  running task completes, DispatchResult field schema.
- Broader kanban suite: 464/464 pass (was 449 baseline; +15 new
  regression tests across both features).

## Credit

#27145 — Jimmy Johansson reported the dispatcher skipped-unassigned
gap; @agarzon scoped the simpler 'honor kanban.default_assignee' fix
that matches the existing config knob.
#21582 — @edwardchenchen filed the per-profile cap ask after hitting
model 429s on fan-out research projects; @simlu confirmed the same
pain on local-model setups.
2026-05-28 19:02:55 -07:00
Ben 42612aa350 docs(docker): refresh user-guide page for s6-overlay reality
The page was last meaningfully rewritten in the pre-s6 (tini) era and had
drifted on five points that no longer matched the image:

1. "Running the dashboard" claimed the entrypoint backgrounds
   `hermes dashboard` and prefixes its output with `[dashboard]`. That
   was the pre-s6 entrypoint.sh path; under s6 the dashboard is a
   supervised s6-rc service (`docker/s6-rc.d/dashboard/run`) with no
   sed-prefix pipeline. Rewrote the section accordingly.

2. The default for `HERMES_DASHBOARD_HOST` was documented as
   `127.0.0.1`. The s6 run script defaults it to `0.0.0.0`
   (`dash_host="${HERMES_DASHBOARD_HOST:-0.0.0.0}"`). Fixed the table
   and the surrounding prose.

3. Multi-profile was documented as "not recommended in Docker — run
   one container per profile." That advice was load-bearing when
   there was no in-container supervisor, but the s6 architecture
   explicitly adds per-profile gateway supervision: each profile
   created via `hermes profile create <name>` gets a slot under
   `/run/service/gateway-<name>/`, the `02-reconcile-profiles`
   cont-init script restores them across `docker restart` from
   `gateway_state.json`, and `hermes gateway start/stop/restart` is
   intercepted by `_dispatch_via_service_manager_if_s6` to route
   through `s6-svc`. Pivoted the section to "one container, many
   supervised profile gateways" as the default, with a comparison
   table and a "When you DO want a separate container" escape
   hatch for the genuine resource-isolation / network-segmentation
   cases.

4. The Compose example trailer also claimed `[dashboard]` log
   prefixing. Replaced with the actual log routing.

5. Added a new "Where the logs go" section covering all four log
   surfaces: per-profile gateways (tee'd to `docker logs` AND
   `${HERMES_HOME}/logs/gateways/<profile>/current` since PR
   b34532319), dashboard (`docker logs`, no prefix), boot reconciler
   (`container-boot.log`), and `hermes logs`. The gateway-mode and
   Compose sections cross-reference this rather than each carrying
   their own routing prose.

Added a new "docker exec automatically drops to the hermes user"
subsection under "What the Dockerfile does", next to the existing
Privilege model warning. Documents the `/opt/hermes/bin/hermes` shim
(landed via the docker-exec privilege-drop work) — operators don't
need to remember `--user hermes` for `docker exec hermes login`,
`docker exec hermes profile create …`, etc. The historical footgun
(`auth.json` written as `root:root`, supervised gateway then can't
read its own auth file) is mentioned only as context for what the
fail-loud `exit 126` is protecting against, not as a problem the
reader needs to solve. The `HERMES_DOCKER_EXEC_AS_ROOT=1` opt-out is
documented for diagnostic sessions.

The "Permission denied" troubleshooting subsection now carries a
single-line pointer to the new section instead of duplicating it.

The `--insecure` framing reflects PR #fb5125362 (opt-in via
`HERMES_DASHBOARD_INSECURE`, not derived from bind host): the OAuth
gate is the authority, the bind host alone never implies
`--insecure`, and opting out is an explicit security trade-off.

Anchors verified resolve. i18n zh-Hans mirror left for the
translation flow to catch up.
2026-05-29 11:55:01 +10:00
14 changed files with 679 additions and 265 deletions
+45
View File
@@ -5420,6 +5420,49 @@ class GatewayRunner:
)
stale_timeout_seconds = 0
# Read kanban.default_assignee — fallback profile for tasks
# created without an explicit assignee (e.g. via the dashboard).
# When set, the dispatcher applies it to unassigned ready tasks
# instead of skipping them indefinitely (#27145). Empty string
# (the schema default) means "no fallback, keep skipping" —
# backward-compatible with existing installs.
default_assignee = (kanban_cfg.get("default_assignee") or "").strip() or None
if default_assignee:
logger.info(
"kanban dispatcher: default_assignee=%r (unassigned ready tasks "
"will route to this profile)",
default_assignee,
)
# Read kanban.max_in_progress_per_profile — per-profile concurrency
# cap (#21582). When set, no single profile gets more than N
# workers running at once, even if the global max_in_progress
# would allow it. Prevents one profile's local model / API quota
# / browser pool from being overwhelmed by a fan-out.
raw_per_profile = kanban_cfg.get("max_in_progress_per_profile", None)
max_in_progress_per_profile = None
if raw_per_profile is not None:
try:
max_in_progress_per_profile = int(raw_per_profile)
except (TypeError, ValueError):
logger.warning(
"kanban dispatcher: invalid kanban.max_in_progress_per_profile=%r; ignoring",
raw_per_profile,
)
max_in_progress_per_profile = None
else:
if max_in_progress_per_profile < 1:
logger.warning(
"kanban dispatcher: kanban.max_in_progress_per_profile=%r is below 1; ignoring",
raw_per_profile,
)
max_in_progress_per_profile = None
else:
logger.info(
"kanban dispatcher: max_in_progress_per_profile=%d",
max_in_progress_per_profile,
)
# Initial delay so the gateway finishes wiring adapters before the
# dispatcher spawns workers (those workers may hit gateway notify
# subscriptions etc.). Matches the notifier watcher's delay.
@@ -5511,6 +5554,8 @@ class GatewayRunner:
max_in_progress=max_in_progress,
failure_limit=failure_limit,
stale_timeout_seconds=stale_timeout_seconds,
default_assignee=default_assignee,
max_in_progress_per_profile=max_in_progress_per_profile,
)
except sqlite3.DatabaseError as exc:
if _is_corrupt_board_db_error(exc):
+9
View File
@@ -1726,6 +1726,15 @@ DEFAULT_CONFIG = {
# assignee to any installed profile. When unset, falls back to the
# default profile. A task never ends up with assignee=None.
"default_assignee": "",
# Per-profile concurrency cap (#21582). When set to a positive int,
# no single profile can have more than N workers running at once,
# even if the global max_in_progress / max_spawn caps would allow
# it. Tasks blocked this way defer to the next dispatcher tick.
# Unset (None) means "no per-profile cap" — backward-compatible
# with existing installs. Useful for fan-out workflows that would
# otherwise saturate one profile's local model / API quota /
# browser pool while leaving other profiles idle.
"max_in_progress_per_profile": None,
# When true, the kanban dispatcher auto-runs the decomposer on
# tasks that land in Triage (every dispatcher tick). When false,
# decomposition is manual via `hermes kanban decompose <id>` or
+1 -20
View File
@@ -26,15 +26,10 @@ from hermes_cli.dashboard_auth import list_providers
from hermes_cli.dashboard_auth.audit import AuditEvent, audit_log
from hermes_cli.dashboard_auth.base import ProviderError
from hermes_cli.dashboard_auth.cookies import read_session_cookies
from hermes_cli.dashboard_auth.public_paths import PUBLIC_API_PATHS
_log = logging.getLogger(__name__)
# Prefixes that bypass the auth gate. Match via ``path == prefix`` or
# ``path.startswith(prefix)`` — so ``/assets/`` (with trailing slash)
# matches ``/assets/foo.css`` but not ``/assetsleak``. Auth-bootstrap
# (login page, OAuth round trip, provider listing) and static asset
# mounts go here.
# Paths that bypass the auth gate. Order matters: prefix match.
_GATE_PUBLIC_PREFIXES: tuple[str, ...] = (
"/auth/login",
"/auth/callback",
@@ -50,20 +45,6 @@ _GATE_PUBLIC_PREFIXES: tuple[str, ...] = (
def _path_is_public(path: str) -> bool:
"""True if ``path`` bypasses the OAuth auth gate.
Two sources of public-ness:
* :data:`PUBLIC_API_PATHS` — the shared ``/api/*`` allowlist that
the legacy ``_SESSION_TOKEN`` middleware also honours. Matched
exactly (no prefix expansion) so adding ``/api/status`` doesn't
accidentally expose ``/api/status/secret-extension``.
* :data:`_GATE_PUBLIC_PREFIXES` — auth-bootstrap routes and static
mounts. Prefix-matched so ``/assets/foo.css`` lights up via
``/assets/``.
"""
if path in PUBLIC_API_PATHS:
return True
return any(
path == prefix or path.startswith(prefix)
for prefix in _GATE_PUBLIC_PREFIXES
-49
View File
@@ -1,49 +0,0 @@
"""Shared allowlist of ``/api/*`` paths that bypass dashboard auth.
Two middlewares enforce dashboard auth and previously kept independent
copies of this list:
* ``hermes_cli.web_server.auth_middleware`` — loopback / ``--insecure``
mode, gates on the ephemeral ``_SESSION_TOKEN``.
* ``hermes_cli.dashboard_auth.middleware.gated_auth_middleware`` —
non-loopback mode, gates on the OAuth session cookie.
When the lists drifted, ``/api/status`` ended up public under the legacy
gate but 401'd under the OAuth gate. That broke the portal's wildcard
liveness probe (``nous-account-service`` ``fly-provider.ts``
``getInstanceRuntimeStatus``), which fetches ``/api/status`` without a
cookie as its sole signal of "agent dashboard is alive": every healthy
wildcard-subdomain agent surfaced as STARTING/down in the portal UI even
though the dashboard was serving correctly.
Centralising the allowlist here so both middlewares import the same
frozenset prevents the next drift. Keep this list minimal — only truly
non-sensitive, read-only endpoints belong here. As a sanity check, every
entry should be safe to expose to:
* external uptime probes (Pingdom, Better Stack, NAS),
* the dashboard SPA before the user has logged in,
* anyone who happens to ``curl`` the hostname.
If a new endpoint doesn't pass all three tests, it should be gated and
the SPA should bootstrap it after login instead.
"""
from __future__ import annotations
PUBLIC_API_PATHS: frozenset[str] = frozenset({
# Liveness probe target. Returns version, gateway state, active
# session count, and the dashboard auth-gate shape. No bodies, no
# session content, no secrets. Documented as the portal's wildcard
# liveness probe in
# ``docs/agent-dashboard-public-url-contract.md`` (NAS side).
"/api/status",
# Read-only config-defaults / schema feeds for the SPA's Config page.
"/api/config/defaults",
"/api/config/schema",
# Read-only model metadata (context windows, etc.) — same shape as
# provider catalogs already exposed on the public internet.
"/api/model/info",
# Read-only theme + plugin manifests for the dashboard skin engine.
"/api/dashboard/themes",
"/api/dashboard/plugins",
})
+38
View File
@@ -2087,12 +2087,35 @@ def _cmd_tail(args: argparse.Namespace) -> int:
def _cmd_dispatch(args: argparse.Namespace) -> int:
# Honour kanban.default_assignee as the fallback for unassigned ready
# tasks (#27145) and kanban.max_in_progress_per_profile as the
# per-profile concurrency cap (#21582). Same semantics as the
# gateway dispatch path.
try:
from hermes_cli.config import load_config
_cfg = load_config()
_kanban_cfg = _cfg.get("kanban", {}) if isinstance(_cfg, dict) else {}
default_assignee = (_kanban_cfg.get("default_assignee") or "").strip() or None
_raw_per_profile = _kanban_cfg.get("max_in_progress_per_profile", None)
try:
max_in_progress_per_profile = (
int(_raw_per_profile) if _raw_per_profile is not None else None
)
if max_in_progress_per_profile is not None and max_in_progress_per_profile < 1:
max_in_progress_per_profile = None
except (TypeError, ValueError):
max_in_progress_per_profile = None
except Exception:
default_assignee = None
max_in_progress_per_profile = None
with kb.connect_closing() as conn:
res = kb.dispatch_once(
conn,
dry_run=args.dry_run,
max_spawn=args.max,
failure_limit=getattr(args, "failure_limit", kb.DEFAULT_SPAWN_FAILURE_LIMIT),
default_assignee=default_assignee,
max_in_progress_per_profile=max_in_progress_per_profile,
)
if getattr(args, "json", False):
print(json.dumps({
@@ -2108,6 +2131,11 @@ def _cmd_dispatch(args: argparse.Namespace) -> int:
],
"skipped_unassigned": res.skipped_unassigned,
"skipped_nonspawnable": res.skipped_nonspawnable,
"skipped_per_profile_capped": [
{"task_id": tid, "assignee": who, "current": current}
for (tid, who, current) in res.skipped_per_profile_capped
],
"auto_assigned_default": res.auto_assigned_default,
}, indent=2))
return 0
print(f"Reclaimed: {res.reclaimed}")
@@ -2128,8 +2156,18 @@ def _cmd_dispatch(args: argparse.Namespace) -> int:
for tid, who, ws in res.spawned:
tag = " (dry)" if args.dry_run else ""
print(f" - {tid} -> {who} @ {ws or '-'}{tag}")
if res.auto_assigned_default:
print(
f"Auto-assigned to kanban.default_assignee={default_assignee!r}: "
f"{', '.join(res.auto_assigned_default)}"
)
if res.skipped_unassigned:
print(f"Skipped (unassigned): {', '.join(res.skipped_unassigned)}")
if res.skipped_per_profile_capped:
for tid, who, current in res.skipped_per_profile_capped:
print(
f"Deferred ({who} at per-profile cap, {current} running): {tid}"
)
if res.skipped_nonspawnable:
print(
f"Skipped (non-spawnable assignee — terminal lane, OK): "
+126 -5
View File
@@ -4289,6 +4289,12 @@ class DispatchResult:
skipped_unassigned: list[str] = field(default_factory=list)
"""Ready task ids skipped because they have no assignee at all.
Operator-actionable usually a misfiled task waiting for routing."""
auto_assigned_default: list[str] = field(default_factory=list)
"""Task ids that were unassigned in the DB and had
``kanban.default_assignee`` applied this tick before spawning (#27145).
Surfaces the auto-assignment to telemetry / CLI / dashboard so the
operator can see when the dispatcher is acting on the fallback rule
rather than on explicit per-task assignments."""
skipped_nonspawnable: list[str] = field(default_factory=list)
"""Ready task ids skipped because their assignee names a control-plane
lane (a Claude Code terminal like ``orion-cc``) rather than a Hermes
@@ -4296,6 +4302,14 @@ class DispatchResult:
operator-actionable failure. Tracked separately so health telemetry
can distinguish "real stuck" (nothing spawned but spawnable work
available) from "correctly idle" (nothing spawnable in the queue)."""
skipped_per_profile_capped: list[tuple[str, str, int]] = field(default_factory=list)
"""Tasks deferred this tick because their assignee is already at
``kanban.max_in_progress_per_profile`` (#21582). Each entry is
``(task_id, assignee, current_running_count)``. NOT an
operator-actionable failure the task will be picked up on a
subsequent tick when the assignee has capacity. Separate bucket so
telemetry / dashboards can show "this profile is busy" vs
"task is genuinely stuck"."""
crashed: list[str] = field(default_factory=list)
"""Task ids reclaimed because their worker PID disappeared."""
auto_blocked: list[str] = field(default_factory=list)
@@ -5342,6 +5356,8 @@ def dispatch_once(
failure_limit: int = DEFAULT_SPAWN_FAILURE_LIMIT,
stale_timeout_seconds: int = 0,
board: Optional[str] = None,
default_assignee: Optional[str] = None,
max_in_progress_per_profile: Optional[int] = None,
) -> DispatchResult:
"""Run one dispatcher tick.
@@ -5427,12 +5443,89 @@ def dispatch_once(
if max_spawn is None or max_spawn > remaining:
max_spawn = remaining
spawned = 0
# Per-profile concurrency cap (#21582): when set, track how many
# workers each assignee already has in flight, and refuse to spawn
# when this would push that assignee past the cap. Prevents
# fan-out workloads from melting a single profile's local model /
# API quota / browser pool while leaving other profiles idle.
# Tasks blocked this way go to skipped_per_profile_capped (not
# skipped_unassigned — the operator-actionable signal is different:
# "this profile is busy, try again later" not "this needs routing").
_per_profile_cap = max_in_progress_per_profile if (
isinstance(max_in_progress_per_profile, int)
and max_in_progress_per_profile > 0
) else None
_per_profile_running: dict[str, int] = {}
if _per_profile_cap is not None:
for prow in conn.execute(
"SELECT assignee, COUNT(*) AS n FROM tasks "
"WHERE status = 'running' AND assignee IS NOT NULL "
"GROUP BY assignee"
):
_per_profile_running[prow["assignee"]] = int(prow["n"])
# Normalize default_assignee once: empty/whitespace string → None so the
# rest of the loop can use ``if default_assignee:`` as a single check.
# We also resolve profile_exists once here for the same reason.
_default_assignee = (default_assignee or "").strip() or None
_default_assignee_resolved = False
if _default_assignee:
try:
from hermes_cli.profiles import profile_exists as _pe
_default_assignee_resolved = bool(_pe(_default_assignee))
except Exception:
# Profiles module not importable (test stubs, exotic envs).
# Trust the operator's config and try the assignment; the
# downstream profile_exists check on the assigned row will
# bucket it as nonspawnable if the profile genuinely isn't
# there, with the existing diagnostic.
_default_assignee_resolved = True
for row in ready_rows:
if max_spawn is not None and running_count + spawned >= max_spawn:
break
if not row["assignee"]:
result.skipped_unassigned.append(row["id"])
continue
row_assignee = row["assignee"]
if not row_assignee:
# Honour kanban.default_assignee: when the dispatcher hits an
# unassigned ready task and an operator-configured fallback
# exists, persist the assignment and proceed. This removes the
# dashboard footgun where a task created without an assignee
# parks in 'ready' forever even though the operator's intent
# ("default") was perfectly clear (#27145). Mutating the row
# (not just the in-memory view) keeps diagnostics and the
# board state consistent: the task is now legitimately owned
# by ``kanban.default_assignee``, not "unassigned but secretly
# routed".
if _default_assignee and _default_assignee_resolved:
# Dry-run: show what WOULD happen (auto-assign + spawn) without
# mutating the DB. Real run: mutate the row + emit the
# 'assigned' event so the board state matches what just happened.
if not dry_run:
try:
with write_txn(conn):
conn.execute(
"UPDATE tasks SET assignee = ? WHERE id = ? "
"AND (assignee IS NULL OR assignee = '')",
(_default_assignee, row["id"]),
)
_append_event(
conn, row["id"], "assigned",
{
"assignee": _default_assignee,
"source": "kanban.default_assignee",
},
)
except Exception:
_log.debug(
"kanban dispatch: failed to apply default_assignee=%r "
"to task %s",
_default_assignee, row["id"], exc_info=True,
)
result.skipped_unassigned.append(row["id"])
continue
row_assignee = _default_assignee
result.auto_assigned_default.append(row["id"])
else:
result.skipped_unassigned.append(row["id"])
continue
# Skip ready tasks whose assignee is not a real Hermes profile.
# `_default_spawn` invokes ``hermes -p <assignee>`` which fails
# with "Profile 'X' does not exist" when the assignee names a
@@ -5447,7 +5540,7 @@ def dispatch_once(
from hermes_cli.profiles import profile_exists # local import: avoids cycle
except Exception:
profile_exists = None # type: ignore[assignment]
if profile_exists is not None and not profile_exists(row["assignee"]):
if profile_exists is not None and not profile_exists(row_assignee):
# Bucket separately from skipped_unassigned: the operator
# cannot fix this by assigning a profile (the assignee IS the
# intended owner — a terminal lane). Health telemetry uses
@@ -5456,6 +5549,19 @@ def dispatch_once(
# of human-pulled work.
result.skipped_nonspawnable.append(row["id"])
continue
# Per-profile concurrency cap (#21582): even if there's global
# headroom, refuse to spawn for an assignee that's already at
# its in-flight cap. Prevents one profile's local model / API
# quota / browser pool from being overwhelmed by a fan-out
# while the global max_in_progress / max_spawn caps still allow
# work on OTHER profiles.
if _per_profile_cap is not None:
current = _per_profile_running.get(row_assignee, 0)
if current >= _per_profile_cap:
result.skipped_per_profile_capped.append(
(row["id"], row_assignee, current)
)
continue
# Respawn guard: refuse to re-spawn when useful work is already
# in-flight/recent, or when the last failure is a deterministic
# blocker (quota / auth). The guard defers the spawn this tick so
@@ -5478,7 +5584,15 @@ def dispatch_once(
)
continue
if dry_run:
result.spawned.append((row["id"], row["assignee"], ""))
result.spawned.append((row["id"], row_assignee, ""))
# Increment per-profile counter even in dry_run so the cap
# check sees the would-be spawn on subsequent iterations.
# Without this, dry_run reports every task as spawnable and
# under-reports the capped subset (#21582).
if _per_profile_cap is not None and row_assignee:
_per_profile_running[row_assignee] = (
_per_profile_running.get(row_assignee, 0) + 1
)
continue
claimed = claim_task(conn, row["id"], ttl_seconds=ttl_seconds)
if claimed is None:
@@ -5521,6 +5635,13 @@ def dispatch_once(
# complete_task).
result.spawned.append((claimed.id, claimed.assignee or "", str(workspace)))
spawned += 1
# Track the new in-flight count for this profile so later
# iterations in this same tick respect the per-profile cap
# (#21582). Subsequent ticks re-query from the DB.
if _per_profile_cap is not None and claimed.assignee:
_per_profile_running[claimed.assignee] = (
_per_profile_running.get(claimed.assignee, 0) + 1
)
except Exception as exc:
auto = _record_spawn_failure(
conn, claimed.id, str(exc),
+10 -13
View File
@@ -110,20 +110,17 @@ app.add_middleware(
# ---------------------------------------------------------------------------
# Endpoints that do NOT require the session token. Everything else under
# /api/ is gated by the auth middleware below.
#
# This list is defined in ``hermes_cli.dashboard_auth.public_paths`` so the
# OAuth gate middleware can honour the same allowlist — keeping the two
# gates in lockstep avoids drift like the wildcard-subdomain regression
# where ``/api/status`` was public under the legacy gate but 401'd under
# the OAuth gate (breaking the portal's liveness probe).
#
# Keep the upstream list minimal — only truly non-sensitive, read-only
# endpoints belong there.
# /api/ is gated by the auth middleware below. Keep this list minimal —
# only truly non-sensitive, read-only endpoints belong here.
# ---------------------------------------------------------------------------
from hermes_cli.dashboard_auth.public_paths import (
PUBLIC_API_PATHS as _PUBLIC_API_PATHS,
)
_PUBLIC_API_PATHS: frozenset = frozenset({
"/api/status",
"/api/config/defaults",
"/api/config/schema",
"/api/model/info",
"/api/dashboard/themes",
"/api/dashboard/plugins",
})
def _has_valid_session_token(request: Request) -> bool:
+9 -31
View File
@@ -324,14 +324,10 @@ def test_dashboard_oauth_gate_engages_on_non_loopback_bind(
1. ``/api/auth/providers`` (publicly reachable through the gate so
the login page can bootstrap) returns 200 with ``nous`` in the
provider list — proves the bundled provider registered.
2. ``/api/sessions`` (a gated route under both the legacy
``_SESSION_TOKEN`` middleware and the OAuth gate) returns 401
to an unauthenticated caller — proves the OAuth gate is actively
intercepting browser traffic. We deliberately probe a gated route
here rather than ``/api/status``: status sits in the shared
``PUBLIC_API_PATHS`` allowlist (portal liveness probe target) and
responds 200 without a cookie under both gates, so it cannot
distinguish "gate on" from "gate off".
2. ``/api/status`` (a public endpoint under the legacy
``_SESSION_TOKEN`` middleware) returns 401 — proves the OAuth gate
runs upstream of the legacy public list and is actively
intercepting unauthenticated callers.
"""
subprocess.run(
["docker", "run", "-d", "--name", container_name,
@@ -355,32 +351,14 @@ def test_dashboard_oauth_gate_engages_on_non_loopback_bind(
f"HERMES_DASHBOARD_OAUTH_CLIENT_ID is set. Got: {payload!r}"
)
# (2) A gated route (``/api/sessions``) returns 401 to an
# unauthenticated caller — the OAuth gate is intercepting.
status_code, body = _http_probe(container_name, "/api/sessions")
assert status_code == 401, (
"OAuth gate must intercept gated /api/* routes on 0.0.0.0 bind "
"when a provider is registered and HERMES_DASHBOARD_INSECURE "
f"is unset. Got: status={status_code} body={body!r}"
)
# (3) ``/api/status`` remains 200 under the gate — it's in the shared
# ``PUBLIC_API_PATHS`` allowlist so NAS's wildcard-subdomain
# liveness probe (``fly-provider.ts`` ``getInstanceRuntimeStatus``)
# can reach it without a cookie. Regression guard: this allowlist
# drifted once already and surfaced every healthy agent as
# STARTING/down in the portal UI.
# (2) /api/status is gated by the OAuth middleware → unauthenticated
# callers get 401, not the legacy public 200 JSON.
status_code, body = _http_probe(container_name, "/api/status")
assert status_code == 200, (
"/api/status must remain publicly reachable under the OAuth gate "
"— the portal uses it as the wildcard-subdomain liveness probe. "
assert status_code == 401, (
"OAuth gate must intercept /api/status on 0.0.0.0 bind when a "
"provider is registered and HERMES_DASHBOARD_INSECURE is unset. "
f"Got: status={status_code} body={body!r}"
)
status = json.loads(body)
assert status.get("auth_required") is True, (
"/api/status must report auth_required=True when the OAuth gate "
f"is engaged so the SPA/portal can distinguish modes. Got: {status!r}"
)
def test_dashboard_insecure_env_var_opts_out_of_gate(
@@ -131,13 +131,8 @@ class TestRefreshTokenCookieDeprecation:
class TestApi401Envelope:
# NOTE: probe a gated route (``/api/sessions``) here rather than
# ``/api/status`` — status is in the shared ``PUBLIC_API_PATHS``
# allowlist (portal liveness probe) so it would 200 even without a
# cookie and never exercise the 401-envelope code path.
def test_no_cookie_returns_unauthenticated_envelope(self, gated_app):
r = gated_app.get("/api/sessions")
r = gated_app.get("/api/status")
assert r.status_code == 401
body = r.json()
assert body["error"] == "unauthenticated"
@@ -146,7 +141,7 @@ class TestApi401Envelope:
def test_invalid_cookie_returns_session_expired_envelope(self, gated_app):
gated_app.cookies.set(SESSION_AT_COOKIE, "garbage")
r = gated_app.get("/api/sessions")
r = gated_app.get("/api/status")
assert r.status_code == 401
body = r.json()
assert body["error"] == "session_expired"
@@ -156,7 +151,7 @@ class TestApi401Envelope:
"""Dead-cookie cleanup — Phase 6 requirement so the browser
doesn't keep replaying the stale token on every request."""
gated_app.cookies.set(SESSION_AT_COOKIE, "garbage")
r = gated_app.get("/api/sessions")
r = gated_app.get("/api/status")
set_cookies = r.headers.get_list("set-cookie")
assert any(
c.startswith(f"{SESSION_AT_COOKIE}=") and "Max-Age=0" in c
@@ -56,61 +56,10 @@ def gated_app():
# ---------------------------------------------------------------------------
def test_gated_status_is_public(gated_app):
"""``/api/status`` MUST be public under the OAuth gate.
Regression guard for the wildcard-subdomain rollout: NAS
(``fly-provider.ts`` ``getInstanceRuntimeStatus``) hits
``/api/status`` without a cookie as its sole liveness probe. A 401
here surfaces every healthy agent as STARTING/down in the portal
UI. The endpoint returns only version + gateway/auth-gate metadata
(no user data, no session content), so it stays in the shared
``PUBLIC_API_PATHS`` allowlist under both the legacy ``_SESSION_TOKEN``
gate and the OAuth gate.
The body also reports the gate's shape (``auth_required``,
``auth_providers``) so the SPA's StatusPage and external monitors
can distinguish loopback / gated / no-providers without a separate
round trip.
"""
def test_gated_status_now_requires_auth(gated_app):
"""When gate is on, /api/status is NOT public — login bootstrap uses /api/auth/providers."""
r = gated_app.get("/api/status")
assert r.status_code == 200, (
f"Expected 200, got {r.status_code}: {r.text}"
)
body = r.json()
assert body["auth_required"] is True
assert "version" in body
assert "gateway_state" in body
@pytest.mark.parametrize("path", [
"/api/config/defaults",
"/api/config/schema",
"/api/model/info",
"/api/dashboard/themes",
"/api/dashboard/plugins",
])
def test_other_public_api_paths_are_public_under_gate(gated_app, path):
"""The remaining ``PUBLIC_API_PATHS`` entries must also bypass the
gate. They're documented as non-sensitive read-only endpoints that
the SPA pre-loads before login (themes, config schema, model
metadata). A 401 / 302-to-login here would block the dashboard
shell from rendering pre-auth.
Accept any non-auth-failure status: 200 when the route succeeds,
or any route-specific error (e.g. 400 / 404 / 500 from a missing
dependency) but NEVER 401, and NEVER a 302 to ``/login``.
"""
r = gated_app.get(path, follow_redirects=False)
assert r.status_code != 401, (
f"{path} returned 401 under the OAuth gate — should be public"
)
if r.status_code == 302:
location = r.headers.get("location", "")
assert "/login" not in location, (
f"{path} redirected to {location} — should be public, "
"not bounced to /login"
)
assert r.status_code == 401
def test_gated_html_redirects_to_login(gated_app):
@@ -149,7 +98,7 @@ def test_gated_static_asset_path_is_public(gated_app):
# ---------------------------------------------------------------------------
def test_full_login_round_trip_unlocks_gated_api(gated_app):
def test_full_login_round_trip_unlocks_api_status(gated_app):
# 1) Click "Sign in with Stub IdP" — /auth/login redirects to the stub
# with a PKCE cookie on the response.
r1 = gated_app.get("/auth/login?provider=stub", follow_redirects=False)
@@ -179,16 +128,11 @@ def test_full_login_round_trip_unlocks_gated_api(gated_app):
assert any("hermes_session_at" in c for c in set_cookies)
assert any("hermes_session_rt" in c for c in set_cookies)
# 3) A gated API route (``/api/sessions``) now succeeds because we
# have a valid session cookie. (We deliberately don't probe
# ``/api/status`` here — it's in the shared PUBLIC_API_PATHS
# allowlist and would 200 even without a login, so it can't
# distinguish "logged in" from "gate accidentally disabled".)
r3 = gated_app.get("/api/sessions")
assert r3.status_code == 200, (
f"Expected 200 for /api/sessions post-login, got {r3.status_code}: "
f"{r3.text}"
)
# 3) /api/status now succeeds because we're authenticated.
r3 = gated_app.get("/api/status")
assert r3.status_code == 200
body = r3.json()
assert "version" in body
def test_login_unknown_provider_returns_404(gated_app):
@@ -59,11 +59,19 @@ def loopback_client():
web_server.app.state.auth_required = prev_required
def _login(client: TestClient) -> None:
"""Drive the stub OAuth round trip so the gated client is authed."""
r1 = client.get("/auth/login?provider=stub", follow_redirects=False)
assert r1.status_code == 302
state = r1.headers["location"].split("state=")[1]
r2 = client.get(
f"/auth/callback?code=stub_code&state={state}", follow_redirects=False
)
assert r2.status_code == 302
def test_status_reports_auth_required_in_gated_mode(gated_client):
# No ``_login()`` call — ``/api/status`` is in the shared
# ``PUBLIC_API_PATHS`` allowlist precisely so external probes (and
# the SPA's pre-login bootstrap) can read the gate's shape without
# a cookie. Hit it cold.
_login(gated_client)
r = gated_client.get("/api/status")
assert r.status_code == 200
body = r.json()
@@ -0,0 +1,154 @@
"""Regression tests for #27145 — kanban.default_assignee for unassigned ready tasks.
When the dispatcher hits an unassigned ready task and ``kanban.default_assignee``
is set, the dispatcher applies the assignment and spawns. Without the config,
the task is skipped (existing behavior preserved).
"""
from __future__ import annotations
import json
import os
import sys
import tempfile
import pytest
@pytest.fixture()
def isolated_kanban_home(monkeypatch):
"""Spin up a fresh HERMES_HOME with a clean kanban DB."""
test_home = tempfile.mkdtemp(prefix="kanban_default_assignee_test_")
monkeypatch.setenv("HERMES_HOME", test_home)
# Force-reimport so the fresh HERMES_HOME is picked up.
for mod in list(sys.modules.keys()):
if mod.startswith("hermes_cli") or mod.startswith("hermes_state") or mod == "hermes_constants":
del sys.modules[mod]
from hermes_cli import kanban_db
yield kanban_db, test_home
# Cleanup is best-effort; tempfile dir survives but pytest isolation
# gives each test its own monkeypatched HERMES_HOME so no cross-test
# contamination.
def _fake_spawn(*args, **kwargs):
"""Stand-in for the real worker spawn — returns a fake PID."""
return 12345
def test_unassigned_task_skipped_without_default_assignee(isolated_kanban_home):
"""Baseline: with no default_assignee, an unassigned ready task is
skipped via the existing `skipped_unassigned` bucket and the DB row
is untouched."""
kb, _home = isolated_kanban_home
with kb.connect_closing() as conn:
kb.create_board(slug="default", name="Test")
task_id = kb.create_task(conn, title="t1", assignee=None)
with kb.connect_closing() as conn:
res = kb.dispatch_once(conn, spawn_fn=_fake_spawn, dry_run=False)
assert res.skipped_unassigned == [task_id]
assert not res.auto_assigned_default
assert not res.spawned
with kb.connect_closing() as conn:
row = conn.execute("SELECT assignee FROM tasks WHERE id = ?", (task_id,)).fetchone()
assert row["assignee"] is None
def test_unassigned_task_auto_assigned_with_default_assignee(isolated_kanban_home):
"""Core #27145 contract: with default_assignee set, an unassigned ready
task gets the assignment applied and dispatched on the same tick. The
DB row is mutated (assignee column + an 'assigned' event)."""
kb, _home = isolated_kanban_home
with kb.connect_closing() as conn:
kb.create_board(slug="default", name="Test")
task_id = kb.create_task(conn, title="t1", assignee=None)
with kb.connect_closing() as conn:
res = kb.dispatch_once(
conn, spawn_fn=_fake_spawn, dry_run=False,
default_assignee="default",
)
assert res.auto_assigned_default == [task_id]
assert not res.skipped_unassigned
assert len(res.spawned) == 1
assert res.spawned[0][0] == task_id
assert res.spawned[0][1] == "default"
with kb.connect_closing() as conn:
row = conn.execute("SELECT assignee FROM tasks WHERE id = ?", (task_id,)).fetchone()
assert row["assignee"] == "default"
# 'assigned' event emitted for the audit trail
with kb.connect_closing() as conn:
evs = list(conn.execute(
"SELECT kind, payload FROM task_events WHERE task_id = ? AND kind = 'assigned'",
(task_id,),
))
assert len(evs) == 1
payload = json.loads(evs[0][1])
assert payload["assignee"] == "default"
assert payload["source"] == "kanban.default_assignee"
def test_dry_run_with_default_assignee_reports_without_mutating(isolated_kanban_home):
"""Dry-run mode: reports what WOULD happen (task in auto_assigned_default,
spawn entry) but does NOT mutate the DB. Operators using
`hermes kanban dispatch --dry-run` see the routing decision before
committing."""
kb, _home = isolated_kanban_home
with kb.connect_closing() as conn:
kb.create_board(slug="default", name="Test")
task_id = kb.create_task(conn, title="t1", assignee=None)
with kb.connect_closing() as conn:
res = kb.dispatch_once(
conn, spawn_fn=_fake_spawn, dry_run=True,
default_assignee="default",
)
assert res.auto_assigned_default == [task_id]
assert len(res.spawned) == 1
with kb.connect_closing() as conn:
row = conn.execute("SELECT assignee FROM tasks WHERE id = ?", (task_id,)).fetchone()
# DB unchanged — dry_run did not commit the assignment.
assert row["assignee"] is None
def test_whitespace_default_assignee_treated_as_none(isolated_kanban_home):
"""Empty / whitespace-only default_assignee values must be treated as
'no fallback set' so a misconfigured kanban.default_assignee=' '
doesn't surprise operators by silently routing unassigned tasks."""
kb, _home = isolated_kanban_home
with kb.connect_closing() as conn:
kb.create_board(slug="default", name="Test")
task_id = kb.create_task(conn, title="t1", assignee=None)
with kb.connect_closing() as conn:
res = kb.dispatch_once(
conn, spawn_fn=_fake_spawn, dry_run=False,
default_assignee=" ",
)
assert task_id in res.skipped_unassigned
assert not res.auto_assigned_default
def test_explicitly_assigned_task_untouched_by_default_assignee(isolated_kanban_home):
"""A task with an explicit assignee must NOT be touched by the
default_assignee logic that fallback only applies to genuinely
unassigned rows."""
kb, _home = isolated_kanban_home
with kb.connect_closing() as conn:
kb.create_board(slug="default", name="Test")
task_id = kb.create_task(conn, title="t1", assignee="default")
with kb.connect_closing() as conn:
res = kb.dispatch_once(
conn, spawn_fn=_fake_spawn, dry_run=False,
default_assignee="someother",
)
assert task_id not in res.auto_assigned_default
assert any(s[0] == task_id and s[1] == "default" for s in res.spawned)
def test_dispatch_result_has_auto_assigned_default_field():
"""Schema-level invariant: DispatchResult exposes the
auto_assigned_default field so CLI / dashboard / gateway can surface
the new routing decisions."""
from hermes_cli.kanban_db import DispatchResult
r = DispatchResult()
assert hasattr(r, "auto_assigned_default")
assert r.auto_assigned_default == []
@@ -0,0 +1,167 @@
"""Regression tests for #21582 — per-profile concurrency cap in dispatcher.
When ``kanban.max_in_progress_per_profile`` is set, no single profile
gets more than N workers running at once even if the global
``max_in_progress`` cap would allow it. Prevents one profile's local
model / API quota / browser pool from being overwhelmed by a fan-out.
"""
from __future__ import annotations
import os
import sys
import tempfile
import pytest
@pytest.fixture()
def isolated_kanban_home_with_profiles(monkeypatch):
"""Spin up a fresh HERMES_HOME with kanban DB + alpha/beta profiles."""
test_home = tempfile.mkdtemp(prefix="kanban_per_profile_cap_test_")
for prof in ("alpha", "beta", "default"):
os.makedirs(os.path.join(test_home, "profiles", prof), exist_ok=True)
monkeypatch.setenv("HERMES_HOME", test_home)
for mod in list(sys.modules.keys()):
if mod.startswith("hermes_cli") or mod.startswith("hermes_state") or mod == "hermes_constants":
del sys.modules[mod]
from hermes_cli import kanban_db
yield kanban_db
def _fake_spawn(*args, **kwargs):
return 12345
def test_no_cap_all_tasks_dispatched(isolated_kanban_home_with_profiles):
"""Baseline: with no per-profile cap, all ready tasks dispatch."""
kb = isolated_kanban_home_with_profiles
with kb.connect_closing() as conn:
kb.create_board(slug="default", name="Test")
for i in range(5):
kb.create_task(conn, title=f"a{i}", assignee="alpha")
for i in range(3):
kb.create_task(conn, title=f"b{i}", assignee="beta")
with kb.connect_closing() as conn:
res = kb.dispatch_once(conn, spawn_fn=_fake_spawn, dry_run=True)
assert len(res.spawned) == 8
assert not res.skipped_per_profile_capped
def test_cap_2_balances_two_profiles(isolated_kanban_home_with_profiles):
"""With cap=2: 2 alpha + 2 beta dispatched; remaining 3 alpha + 1 beta
deferred to skipped_per_profile_capped."""
kb = isolated_kanban_home_with_profiles
with kb.connect_closing() as conn:
kb.create_board(slug="default", name="Test")
for i in range(5):
kb.create_task(conn, title=f"a{i}", assignee="alpha")
for i in range(3):
kb.create_task(conn, title=f"b{i}", assignee="beta")
with kb.connect_closing() as conn:
res = kb.dispatch_once(
conn, spawn_fn=_fake_spawn, dry_run=True,
max_in_progress_per_profile=2,
)
spawn_assignees = [s[1] for s in res.spawned]
capped_assignees = [c[1] for c in res.skipped_per_profile_capped]
assert spawn_assignees.count("alpha") == 2
assert spawn_assignees.count("beta") == 2
assert capped_assignees.count("alpha") == 3
assert capped_assignees.count("beta") == 1
def test_pre_existing_running_counts_against_cap(isolated_kanban_home_with_profiles):
"""A task already in 'running' status when dispatch_once starts counts
toward the per-profile cap. With 1 alpha pre-running and cap=1, NO new
alpha tasks should spawn; beta is independent so 1 beta spawns."""
kb = isolated_kanban_home_with_profiles
with kb.connect_closing() as conn:
kb.create_board(slug="default", name="Test")
running_alpha = kb.create_task(conn, title="running alpha", assignee="alpha")
with kb.write_txn(conn):
conn.execute(
"UPDATE tasks SET status = 'running', claim_lock = 'test:1' WHERE id = ?",
(running_alpha,),
)
for i in range(2):
kb.create_task(conn, title=f"a{i}", assignee="alpha")
for i in range(2):
kb.create_task(conn, title=f"b{i}", assignee="beta")
with kb.connect_closing() as conn:
res = kb.dispatch_once(
conn, spawn_fn=_fake_spawn, dry_run=True,
max_in_progress_per_profile=1,
)
spawn_assignees = [s[1] for s in res.spawned]
capped_assignees = [c[1] for c in res.skipped_per_profile_capped]
assert spawn_assignees.count("alpha") == 0
assert spawn_assignees.count("beta") == 1
assert capped_assignees.count("alpha") == 2
assert capped_assignees.count("beta") == 1
@pytest.mark.parametrize("cap", [0, -1, "abc", None])
def test_invalid_cap_treated_as_no_cap(isolated_kanban_home_with_profiles, cap):
"""Cap values that don't represent a positive int should be treated as
'no cap' silently falling through rather than crashing the dispatcher."""
kb = isolated_kanban_home_with_profiles
with kb.connect_closing() as conn:
kb.create_board(slug="default", name="Test")
for i in range(3):
kb.create_task(conn, title=f"a{i}", assignee="alpha")
with kb.connect_closing() as conn:
res = kb.dispatch_once(
conn, spawn_fn=_fake_spawn, dry_run=True,
max_in_progress_per_profile=cap,
)
assert not res.skipped_per_profile_capped
assert len(res.spawned) == 3
def test_capped_tasks_dispatched_on_subsequent_tick(isolated_kanban_home_with_profiles):
"""A task deferred this tick because its profile was at cap should be
eligible for dispatch on the next tick (after running tasks complete).
This verifies the cap is per-tick state, not a permanent block."""
kb = isolated_kanban_home_with_profiles
with kb.connect_closing() as conn:
kb.create_board(slug="default", name="Test")
ids = [kb.create_task(conn, title=f"a{i}", assignee="alpha") for i in range(3)]
# First tick: cap=1, only 1 alpha dispatched
with kb.connect_closing() as conn:
res1 = kb.dispatch_once(
conn, spawn_fn=_fake_spawn, dry_run=False,
max_in_progress_per_profile=1,
)
assert len(res1.spawned) == 1
assert len(res1.skipped_per_profile_capped) == 2
# Simulate the running task completing — set it back to done so the
# 'running' count drops
spawned_id = res1.spawned[0][0]
with kb.connect_closing() as conn:
with kb.write_txn(conn):
conn.execute(
"UPDATE tasks SET status = 'done', claim_lock = NULL WHERE id = ?",
(spawned_id,),
)
# Second tick: 1 more alpha should now dispatch
with kb.connect_closing() as conn:
res2 = kb.dispatch_once(
conn, spawn_fn=_fake_spawn, dry_run=False,
max_in_progress_per_profile=1,
)
assert len(res2.spawned) == 1
assert len(res2.skipped_per_profile_capped) == 1
assert res2.spawned[0][0] != spawned_id # different task this time
def test_dispatch_result_has_skipped_per_profile_capped_field():
"""Schema-level invariant: DispatchResult exposes the
skipped_per_profile_capped field as a list of
(task_id, assignee, current_running) tuples."""
from hermes_cli.kanban_db import DispatchResult
r = DispatchResult()
assert hasattr(r, "skipped_per_profile_capped")
assert r.skipped_per_profile_capped == []
+96 -70
View File
@@ -54,12 +54,7 @@ This behavior applies to the s6-based image only. Earlier (tini-based) images st
:::
:::note Where gateway logs go
Inside the s6 image, the supervised gateway's output is tee'd to two destinations:
- **`docker logs <container>`** — every line in real time (raw, no extra prefix). This is the same stream you'd get from a foreground gateway, so existing `docker logs --follow` / `--timestamps` / log-shipper integrations work unchanged.
- **`${HERMES_HOME}/logs/gateways/<profile>/current`** (mapped to `~/.hermes/logs/gateways/<profile>/current` on the host via the volume mount) — rotated, with an ISO 8601 timestamp prepended per line. Rotation is 10 archives × 1 MB each, so it can't fill the disk. This is what `hermes logs` reads and what survives container restarts.
The per-profile reconciler keeps a separate audit log at `${HERMES_HOME}/logs/container-boot.log` — one line per profile per container boot, recording whether each gateway was restored to its prior state.
See the [Where the logs go](#where-the-logs-go) section below for the full routing map (per-profile gateways, dashboard, boot reconciler, container-wide `docker logs`).
:::
Note: the API server is gated on `API_SERVER_ENABLED=true`. To expose it beyond `127.0.0.1` inside the container, also set `API_SERVER_HOST=0.0.0.0` and an `API_SERVER_KEY` (minimum 8 characters — generate one with `openssl rand -hex 32`). Example:
@@ -81,7 +76,7 @@ Opening any port on an internet facing machine is a security risk. You should no
## Running the dashboard
The built-in web dashboard runs as an optional side-process inside the same container as the gateway. Set `HERMES_DASHBOARD=1` to run the dashboard on container loopback (`127.0.0.1`) by default:
The built-in web dashboard runs as a supervised s6-rc service alongside the gateway in the same container. Set `HERMES_DASHBOARD=1` to bring it up:
```sh
docker run -d \
@@ -89,54 +84,38 @@ docker run -d \
--restart unless-stopped \
-v ~/.hermes:/opt/data \
-p 8642:8642 \
-p 9119:9119 \
-e HERMES_DASHBOARD=1 \
nousresearch/hermes-agent gateway run
```
The entrypoint starts `hermes dashboard` in the background (running as the non-root `hermes` user) before `exec`-ing the main command. Dashboard output is prefixed with `[dashboard]` in `docker logs` so it's easy to separate from gateway logs.
The dashboard is supervised by s6 — if it crashes, `s6-supervise` restarts it automatically after a short backoff. Dashboard stdout/stderr is forwarded to `docker logs <container>` (no prefix; the gateway's own output now lives in a per-profile s6-log file — see [Where the logs go](#where-the-logs-go) below — so the two streams don't clash).
| Environment variable | Description | Default |
|---------------------|-------------|---------|
| `HERMES_DASHBOARD` | Set to `1` (or `true` / `yes`) to launch the dashboard alongside the main command | *(unset — dashboard not started)* |
| `HERMES_DASHBOARD_HOST` | Bind address for the dashboard HTTP server | `127.0.0.1` |
| `HERMES_DASHBOARD` | Set to `1` (or `true` / `yes`) to enable the supervised dashboard service | *(unset — service is registered but stays down)* |
| `HERMES_DASHBOARD_HOST` | Bind address for the dashboard HTTP server | `0.0.0.0` |
| `HERMES_DASHBOARD_PORT` | Port for the dashboard HTTP server | `9119` |
| `HERMES_DASHBOARD_TUI` | Set to `1` to expose the in-browser Chat tab (embedded `hermes --tui` via PTY/WebSocket) | *(unset)* |
| `HERMES_DASHBOARD_INSECURE` | Set to `1` (or `true` / `yes`) to bind without the OAuth auth gate. Only use on trusted networks behind a reverse proxy without the OAuth contract — the dashboard exposes API keys and session data | *(unset — gate enforced when a `DashboardAuthProvider` is registered)* |
By default, the dashboard stays on loopback (`127.0.0.1`) to avoid exposing
the web surface over the network. To publish it intentionally, set
`HERMES_DASHBOARD_HOST=0.0.0.0`. The dashboard's OAuth auth gate engages
automatically whenever:
The dashboard inside the container defaults to binding `0.0.0.0` — without it, the published `-p 9119:9119` port would not be reachable from the host. To restrict the bind to container loopback (for sidecar / reverse-proxy setups), set `HERMES_DASHBOARD_HOST=127.0.0.1`.
1. The bind host is non-loopback, **and**
The dashboard's OAuth auth gate engages automatically when both of the following are true:
1. The bind host is non-loopback (e.g. the default `0.0.0.0` inside the container), **and**
2. A `DashboardAuthProvider` plugin is registered.
The bundled `dashboard_auth/nous` provider activates whenever
`HERMES_DASHBOARD_OAUTH_CLIENT_ID` is set (see
[Web Dashboard → Authentication](features/web-dashboard.md)). With the
gate engaged, browser callers are redirected to the configured portal's
OAuth flow before they can reach any protected route.
The bundled `dashboard_auth/nous` provider activates whenever `HERMES_DASHBOARD_OAUTH_CLIENT_ID` is set (see [Web Dashboard → Authentication](features/web-dashboard.md)). With the gate engaged, browser callers are redirected to the configured portal's OAuth flow before they can reach any protected route.
If no provider is registered and the bind is non-loopback, the dashboard
**fails closed at startup** with a specific error pointing at the
missing env var. To opt out of the gate explicitly — for a trusted-LAN
deployment behind your own reverse proxy without the OAuth contract —
set `HERMES_DASHBOARD_INSECURE=1`. This re-enables the legacy "no auth,
loud warning" mode and is the only path that disables the gate; the bind
host does not implicitly determine `--insecure` anymore.
If no provider is registered and the bind is non-loopback, the dashboard **fails closed at startup** with a specific error pointing at the missing env var. To opt out of the gate explicitly — for a trusted-LAN deployment behind your own reverse proxy without the OAuth contract — set `HERMES_DASHBOARD_INSECURE=1`. This is the **only** path that disables the gate; the bind host alone never implies `--insecure` (it used to, but that predated the OAuth gate and silently disabled it on every container-deployed dashboard).
:::note
The dashboard runs as a supervised s6 service inside the container. If
the dashboard process crashes, s6-overlay restarts it automatically
after a short backoff — you'll see a new PID without needing to
restart the container. Logs and crash output are visible via
`docker logs <container>` (s6 forwards service stdout/stderr there).
Running the dashboard as a separate container is not supported: its
gateway-liveness detection requires a shared PID namespace with the
gateway process.
:::warning `HERMES_DASHBOARD_INSECURE=1` exposes API keys
Opting out of the OAuth gate serves the dashboard's API surface (including model keys and session data) to anyone who can reach the published port. Only enable it when you have your own auth layer in front, or on a trusted LAN you fully control.
:::
Running the dashboard as a separate container is not supported: its gateway-liveness detection requires a shared PID namespace with the gateway process.
## Running interactively (CLI chat)
To open an interactive chat session against a running data directory:
@@ -179,37 +158,60 @@ Never run two Hermes **gateway** containers against the same data directory simu
## Multi-profile support
Hermes supports [multiple profiles](../reference/profile-commands.md) — separate `~/.hermes/` directories that let you run independent agents (different SOUL, skills, memory, sessions, credentials) from a single installation. **When running under Docker, using Hermes' built-in multi-profile feature is not recommended.**
Hermes supports [multiple profiles](../reference/profile-commands.md) — separate `~/.hermes/` subdirectories that let you run independent agents (different SOUL, skills, memory, sessions, credentials) from a single installation. **Inside the official Docker image, the s6 supervision tree treats each profile as a first-class supervised service**, so the recommended deployment is **one container hosting all profiles**.
Instead, the recommended pattern is **one container per profile**, with each container bind-mounting its own host directory as `/opt/data`:
Each profile created with `hermes profile create <name>` gets:
- A dedicated s6 service slot at `/run/service/gateway-<name>/`, registered dynamically by the runtime — no container rebuild required.
- Auto-restart on crash, backoff-managed by `s6-supervise`.
- Per-profile rotated logs at `${HERMES_HOME}/logs/gateways/<name>/current` (10 archives × 1 MB each).
- State persistence across container restarts: the boot-time reconciler reads `gateway_state.json` from each profile directory and brings the slot back up only for profiles whose last recorded state was `running`. Stopped profiles stay stopped.
The lifecycle commands you'd run on the host work the same way from inside the container:
```sh
# Work profile
docker run -d \
--name hermes-work \
--restart unless-stopped \
-v ~/.hermes-work:/opt/data \
-p 8642:8642 \
nousresearch/hermes-agent gateway run
# Create a profile — registers the gateway-<name> s6 slot.
docker exec hermes hermes profile create coder
# Personal profile
docker run -d \
--name hermes-personal \
--restart unless-stopped \
-v ~/.hermes-personal:/opt/data \
-p 8643:8642 \
nousresearch/hermes-agent gateway run
# Start / stop / restart — dispatches s6-svc; the gateway lifecycle survives docker restart.
docker exec hermes hermes -p coder gateway start
docker exec hermes hermes -p coder gateway stop
docker exec hermes hermes -p coder gateway restart
# Status — reports `Manager: s6 (container supervisor)` inside the container.
docker exec hermes hermes -p coder gateway status
# Remove a profile — tears down the s6 slot too.
docker exec hermes hermes profile delete coder
```
Why separate containers over profiles in Docker:
Under the hood, `hermes gateway start/stop/restart` inside the container is intercepted and routed to `s6-svc` against the right service directory; you don't need to learn the s6 commands directly. For raw supervisor state, use `/command/s6-svstat /run/service/gateway-<name>` (note `/command/` is on PATH only for processes spawned by the supervision tree — when calling from `docker exec`, pass the absolute path).
- **Isolation** — each container has its own filesystem, process table, and resource limits. A crash, dependency change, or runaway session in one profile can't affect another.
- **Independent lifecycle** — upgrade, restart, pause, or roll back each agent separately (`docker restart hermes-work` leaves `hermes-personal` untouched).
- **Clean port and network separation** — each gateway binds its own host port; there's no risk of cross-talk between chat platforms or API servers.
- **Simpler mental model** — the container *is* the profile. Backups, migrations, and permissions all follow the bind-mounted directory, with no extra `--profile` flags to remember.
- **Avoids concurrent-write risk** — the warning above about never running two gateways against the same data directory still applies to profiles within a single container.
### Why one container with many profiles, not many containers
In Docker Compose, this just means declaring one service per profile with distinct `container_name`, `volumes`, and `ports`:
Before the s6 migration, "one container per profile" was the recommended pattern because there was no in-container supervisor to manage multiple gateways. With s6 as PID 1, that's no longer necessary, and the single-container layout is simpler in almost every dimension:
| | One container, many profiles | One container per profile |
|---|---|---|
| Disk overhead | One image, one bundled venv, one Playwright cache | N images / N caches |
| Memory overhead | Shared Python interpreter cache, shared node_modules | Duplicated per container |
| Profile creation | `docker exec ... hermes profile create <name>` (seconds) | New `docker run` invocation + port allocation + bind-mount config |
| Per-profile crash recovery | `s6-supervise` auto-restart | Docker's `--restart unless-stopped` (slower, kills sibling work) |
| Logs | Per-profile rotated file via `s6-log`, plus container-boot audit log | `docker logs <name>` per container — no built-in rotation |
| Backup | One `~/.hermes` directory | N directories to coordinate |
The default profile (`default`) is always registered on first boot, so a fresh container ships with one supervised gateway out of the box. Additional profiles are pure runtime adds.
### When you DO want a separate container
Profile-in-container is the default. Run a separate container per profile only when you have a specific reason:
- **Resource isolation per workload** — e.g. a runaway browser-tool session in profile A shouldn't be able to OOM profile B. Containers give you `--memory` / `--cpus` per profile.
- **Independent image pinning** — different upstream image tags per workload.
- **Network segmentation** — distinct Docker networks per profile (e.g. one customer-facing, one internal).
- **Compliance / blast radius** — distinct credentials never share an OS-level process tree.
In those cases, declare one service per profile with distinct `container_name`, `volumes`, and `ports`:
```yaml
services:
@@ -234,6 +236,24 @@ services:
- ~/.hermes-personal:/opt/data
```
The warning from [Persistent volumes](#persistent-volumes) still applies: never point two containers at the same `~/.hermes` directory simultaneously. The s6 supervisor inside each container manages its own profile set; cross-container sharing of a data volume corrupts session files and memory stores.
## Where the logs go
The s6 container has four distinct log surfaces, and "why isn't my gateway showing anything in `docker logs`" is a common surprise. Cheatsheet:
| Source | Where it lands | How to read it |
|---|---|---|
| **Per-profile gateway** (`hermes gateway run` and per-profile gateways under s6) | Tee'd to two places: `docker logs <container>` (real time, no extra prefix) **and** `${HERMES_HOME}/logs/gateways/<profile>/current` (rotated, ISO-8601 timestamped, 10 archives × 1 MB each) | `docker logs -f hermes` or `tail -F ~/.hermes/logs/gateways/default/current` on the host |
| **Dashboard** (when `HERMES_DASHBOARD=1`) | `docker logs <container>` (no prefix) | `docker logs -f hermes` — interleaved with gateway lines |
| **Boot reconciler** (records which profile gateways were restored on each container start) | `${HERMES_HOME}/logs/container-boot.log` (append-only audit log) | `tail -F ~/.hermes/logs/container-boot.log` |
| **Generic Hermes logs** (`agent.log`, `errors.log`) | `${HERMES_HOME}/logs/` (profile-aware) | `docker exec hermes hermes logs --follow [--level WARNING] [--session <id>]` |
Two practical consequences worth knowing:
- The file copy at `logs/gateways/<profile>/current` is what survives container restarts. `docker logs` only retains output from the current container's lifetime (and is wiped on `docker rm`); the rotated files persist on the bind-mounted volume.
- The boot reconciler's audit line shape is `<iso-timestamp> profile=<name> prior_state=<state> action=<registered|started>`, so a quick `grep profile=coder ~/.hermes/logs/container-boot.log` reveals when a given profile was last restored and whether s6 auto-started it.
## Environment variable forwarding
API keys are read from `/opt/data/.env` inside the container. You can also pass environment variables directly:
@@ -281,7 +301,7 @@ services:
cpus: "2.0"
```
Start with `docker compose up -d` and view logs with `docker compose logs -f`. Dashboard output is prefixed with `[dashboard]` so it's easy to filter from gateway logs.
Start with `docker compose up -d` and view logs with `docker compose logs -f`. The supervised gateway's stdout is also tee'd to `${HERMES_HOME}/logs/gateways/<profile>/current` on the volume — see [Where the logs go](#where-the-logs-go) for the full routing map.
## Optional: Linux desktop audio bridge
@@ -415,24 +435,28 @@ The container ENTRYPOINT is now `/init` (s6-overlay), not `/usr/bin/tini`. All f
Do not override the image entrypoint unless you keep `/init` (or, equivalently, the legacy `docker/entrypoint.sh` shim that forwards to the stage2 hook) in the command chain. s6-overlay's `/init` runs as root so it can chown the volume on first boot, then drops to the `hermes` user via `s6-setuidgid` for every supervised service AND for the main program. Starting `hermes gateway run` as root inside the official image is refused by default because it can leave root-owned files in `/opt/data` and break later dashboard or gateway starts. Set `HERMES_ALLOW_ROOT_GATEWAY=1` only when you intentionally accept that risk.
:::
### Per-profile gateway supervision
### `docker exec` automatically drops to the `hermes` user
Inside the container, each profile created with `hermes profile create <name>` automatically gets an s6-supervised gateway service registered at `/run/service/gateway-<name>/`. The lifecycle commands you'd run on the host work the same way:
`docker exec hermes <cmd>` defaults to running as root inside the container, but the image ships a thin shim at `/opt/hermes/bin/hermes` (earliest on PATH) that detects root callers and transparently re-execs through `s6-setuidgid hermes`. So `docker exec hermes login`, `docker exec hermes profile create …`, `docker exec hermes setup`, etc. all write files owned by UID 10000 — i.e. readable by the supervised gateway — with no extra `--user` flag needed. Non-root callers (the supervised processes themselves, `docker exec --user hermes`, kanban subagents inside the container) hit a short-circuit that exec's the venv binary directly, so there's no overhead on the hot paths.
If you specifically need a `docker exec` that retains root semantics (diagnostic sessions, inspecting root-only state, files outside `/opt/data` that root happens to own), opt out per invocation:
```sh
hermes profile create coder # registers gateway-coder s6 slot
hermes -p coder gateway start # s6-svc -u → supervised gateway
hermes -p coder gateway stop # s6-svc -d → service down
hermes -p coder gateway restart # s6-svc -t → SIGTERM the supervisor
hermes profile delete coder # tears down the s6 slot
docker exec -e HERMES_DOCKER_EXEC_AS_ROOT=1 hermes <cmd>
```
The shim accepts `1` / `true` / `yes` (case-insensitive). Anything else — including typos like `=0` — falls through to the drop, so silent opt-outs aren't possible. If `s6-setuidgid` isn't available (custom builds that stripped s6-overlay), the shim refuses to run as root and exits 126 instead, surfacing the broken privilege model loudly rather than regressing to the historical footgun where `docker exec hermes login` would write `auth.json` as `root:root` and break the supervised gateway's auth on every chat platform message.
### Per-profile gateway supervision
Each profile created with `hermes profile create <name>` automatically gets an s6-supervised gateway service registered at `/run/service/gateway-<name>/`, with state-persistent auto-restart across container restarts. See [Multi-profile support](#multi-profile-support) above for the user-facing workflow and the lifecycle commands.
**Supervision benefits over the pre-s6 image:**
- Gateway crashes are auto-restarted by `s6-supervise` after a ~1s backoff.
- Dashboard crashes are auto-restarted (set `HERMES_DASHBOARD=1` to start it).
- Dashboard, when enabled with `HERMES_DASHBOARD=1`, is supervised on the same supervision tree and gets the same auto-restart treatment.
- `docker restart` preserves running gateways: the cont-init reconciler reads `$HERMES_HOME/profiles/<name>/gateway_state.json` and brings the slot back up if the last recorded state was `running`. Stopped gateways stay stopped.
- Per-profile gateway logs persist under `$HERMES_HOME/logs/gateways/<profile>/current` (rotated by `s6-log`), and the reconciler's actions are appended to `$HERMES_HOME/logs/container-boot.log` per boot.
- Per-profile gateway logs persist under `$HERMES_HOME/logs/gateways/<profile>/current` (rotated by `s6-log`), and the reconciler's actions are appended to `$HERMES_HOME/logs/container-boot.log` per boot. See [Where the logs go](#where-the-logs-go) for the full routing map.
`hermes status` inside the container reports `Manager: s6 (container supervisor)`. Use `/command/s6-svstat /run/service/gateway-<name>` for the raw supervisor view (note `/command/` is on PATH for supervision-tree processes only; pass the absolute path when calling from `docker exec`).
@@ -692,6 +716,8 @@ The container's stage2 hook drops privileges to the non-root `hermes` user (UID
chmod -R 755 ~/.hermes
```
`docker exec hermes <cmd>` automatically drops to UID 10000 too — see [`docker exec` automatically drops to the `hermes` user](#docker-exec-automatically-drops-to-the-hermes-user) for details and the per-invocation opt-out.
### Browser tools not working
Playwright needs shared memory. Add `--shm-size=1g` to your Docker run command: