hermes-agent

Author	SHA1	Message	Date
Ben	fb51253620	docker: opt in to dashboard --insecure via env var, never derive from bind host The s6 dashboard run script flipped `--insecure` on whenever `HERMES_DASHBOARD_HOST` was anything other than 127.0.0.1 / localhost. That comment ("the dashboard refuses otherwise") predates the OAuth auth gate: back when it was written, `start_server` would SystemExit on any non-loopback bind, so the run script's `--insecure` was the only way to make in-container deployments work at all. The gate has since been replaced by `should_require_auth(host, allow_public)`, which engages the OAuth flow when a `DashboardAuthProvider` is registered (the bundled `dashboard_auth/nous` provider auto-registers on `HERMES_DASHBOARD_OAUTH_CLIENT_ID`) and fails closed with a specific operator-facing error when none is. The host-derived `--insecure` ran upstream of all that and silently disabled the gate on every container-deployed dashboard. Most visible under the portal's wildcard-subdomain rollout: every Fly machine binds 0.0.0.0 so the edge can reach Flycast, every machine boots with the correct `HERMES_DASHBOARD_OAUTH_CLIENT_ID`, the nous provider registers — and `/api/status` still returns `{"auth_required": false, "auth_providers": ["nous"]}` because the run script disabled the gate before `start_server` ever saw the request. The dashboard SPA was served to anyone, no `/login` redirect, no OAuth challenge. Fix: derive `--insecure` from an explicit opt-in env var, `HERMES_DASHBOARD_INSECURE` (truthy values matching the rest of the s6 boolean envs: 1, true, TRUE, True, yes, YES, Yes). Operators on trusted LANs behind a reverse proxy without the OAuth contract (the existing `docker-compose.windows.yml` use case) opt in explicitly; portal-managed agent deployments leave it unset and let the gate engage. `docker-compose.windows.yml` already passes `--insecure` on the `command:` array directly (line 38), so it doesn't depend on the s6 auto-injection. No compose-file change required. Tests: * `tests/test_docker_home_override_scripts.py` — extends the existing static-text guard with a regression assertion that the legacy host-derived case-statement is gone and the new env-var opt-in is present (locks against accidental revert). * `tests/docker/test_dashboard.py` — adds two Docker-in-Docker tests exercising the actual `/api/status` round-trip: - 0.0.0.0 bind + `HERMES_DASHBOARD_OAUTH_CLIENT_ID` → gate engaged - 0.0.0.0 bind + `HERMES_DASHBOARD_INSECURE=1` → gate disabled Docs: * `website/docs/user-guide/docker.md` + zh-Hans i18n — adds the new env var to the table, replaces the stale prose ("the entrypoint no longer auto-enables insecure mode" — which until this PR was flat-out wrong) with an accurate description of the gate's trigger conditions and the explicit opt-out. shellcheck clean. Python static-text test passes locally. Behavioural test will run against any future image build (CI's Docker harness).	2026-05-29 09:56:40 +10:00
Dusk	c341a2d107	fix(docker): align HOME for dashboard and s6 gateway services (#33481 )	2026-05-28 13:42:27 +10:00
Ben Barclay	aeb992d343	fix(docker): drop `docker exec` to hermes uid before invoking the CLI When operators ran `docker exec <c> hermes login` (or anything else that wrote under $HERMES_HOME) they defaulted to root, leaving /opt/data/auth.json root:root mode 0600. The supervised gateway (UID 10000) then couldn't read its own credentials and returned "Provider authentication failed: Hermes is not logged into Nous Portal" on every Telegram/Discord/etc. message — even though `docker exec <c> hermes chat -q ping` (also root) succeeded because root could read its own root-owned file. _load_auth_store swallowed PermissionError as a parse failure and copied the file aside as auth.json.corrupt, making the diagnostic more misleading. Fix: install a privilege-drop shim at /opt/hermes/bin/hermes, prepended ahead of the venv on PATH. When invoked as root the shim exec's the real venv binary via `s6-setuidgid hermes` — so any file the docker-exec session writes is uid-aligned with the supervised processes. Non-root callers (the supervised processes themselves, `docker exec --user hermes`, kanban subagents, anything inside the container that's not coming through docker-exec) hit a single exec to the absolute venv path with no privilege change. Recursion is impossible: the shim exec's the venv binary by absolute path (/opt/hermes/.venv/bin/hermes), so the second hop cannot re-enter the shim regardless of PATH state. No sentinel env var needed (unlike #33583's gateway-run redirect which DOES need HERMES_S6_SUPERVISED_CHILD because there's no absolute-path equivalent for the s6 dispatch). Opt-out: `docker exec -e HERMES_DOCKER_EXEC_AS_ROOT=1 …` for diagnostic sessions where the operator deliberately wants root. Strict truthiness (1/true/yes case-insensitive); typos like `=0` do not silently opt out, mirroring HERMES_GATEWAY_NO_SUPERVISE in #33583. If `s6-setuidgid` is missing (someone stripped s6-overlay in a downstream fork), the shim exits 126 with a remediation message pointing at `--user hermes` and the opt-out — never silently runs as root. Test plan: - tests/docker/test_docker_exec_privilege_drop.py — 11 tests - shim drops root to hermes uid (file ownership check) - shim short-circuits for non-root docker exec - HERMES_DOCKER_EXEC_AS_ROOT=1 keeps root - strict-truthiness parametrization (5 falsy values reject) - main CMD path unaffected (recursion guard) - E2E: every file written by docker-exec is readable by uid 10000 - Full tests/docker/ harness: 32/32 pass against fresh image build - shellcheck --severity=error: clean - hadolint: clean - Manual: reproduced the original symptom (root-owned auth.json) by bypassing the shim; confirmed default docker-exec produces hermes-owned files; confirmed opt-out env keeps root semantics. Known follow-up: this prevents NEW instances of the bug. Volumes that already have root:root /opt/data/auth.json from a pre-shim image need a one-time `chown hermes:hermes` before rebooting onto the new image. A stage2-hook chown sweep can self-heal that, but is deferred per scope decision.	2026-05-28 13:30:36 +10:00
Ben	3e33e14335	fix(docker): discover agent-browser Chromium binary at boot The image's Dockerfile runs npx playwright install chromium, which populates $PLAYWRIGHT_BROWSERS_PATH (=/opt/hermes/.playwright) with a `chromium_headless_shell-<build>/chrome-headless-shell-linux64/` tree. agent-browser (the runtime CLI Hermes spawns for the browser tool) doesn't recognise this layout in its own cache scan and fails with `Auto-launch failed: Chrome not found` — even though the binary is right there. Reproduction on current main: $ docker run --rm <image> sh -c 'npx -y agent-browser snapshot --url about:blank' ✗ Auto-launch failed: Chrome not found. Checked: - agent-browser cache: /tmp/.../.agent-browser/browsers - System Chrome installations - Puppeteer browser cache - Playwright browser cache Run `agent-browser install` to download Chrome, or use --executable-path. Fix: at boot, locate the binary under $PLAYWRIGHT_BROWSERS_PATH and export AGENT_BROWSER_EXECUTABLE_PATH via /run/s6/container_environment so the with-contenv shebang on main-wrapper.sh propagates it into the supervised `hermes` process and thence to agent-browser subprocesses. Filename-matched (chrome / chromium / chrome-headless-shell / chromium-browser), not path-matched: the chromium dir contains many shared libraries (libGLESv2.so, libEGL.so, ...) which inherit the executable bit from Playwright's tarball but are NOT browser binaries. Compare PR #18635's earlier `find \| grep -Ei 'chrome\|chromium'` which would match the path .../chrome-headless-shell-linux64/libGLESv2.so and pick a .so as the browser binary. User overrides (e.g. `-e AGENT_BROWSER_EXECUTABLE_PATH=/usr/bin/...`) are respected — the discovery block is skipped when the env var is already set. Quietly skipped when $PLAYWRIGHT_BROWSERS_PATH doesn't exist (e.g. custom builds that strip Playwright). This salvages PR #18635 by @jackey8616, who identified the bug and proposed the same env-var approach but in the now-deprecated docker/entrypoint.sh shim and with a path-match find command that selected .so files instead of the chrome binary. The fix retargets docker/stage2-hook.sh (the s6-overlay cont-init script where boot-time env setup belongs) with a corrected filename-match query. Fixes #15697 Closes #18635 Co-authored-by: Clooooode <12930377+jackey8616@users.noreply.github.com>	2026-05-27 20:43:27 +10:00
Ben	fb298a958c	fix(docker): mkdir HERMES_HOME as root in stage2 before chown / privilege drop (#18488 ) When HERMES_HOME points at a custom path whose parent directories only root can create (e.g. HERMES_HOME=/home/hermes/.hermes in a Compose file, or any path under a fresh / not pre-populated by the image), stage2-hook.sh fails on first boot: [stage2] Warning: chown failed (rootless container?) - continuing mkdir: cannot create directory '/custom': Permission denied mkdir: cannot create directory '/custom': Permission denied ... (one per s6-setuidgid hermes mkdir invocation) cont-init: info: /etc/cont-init.d/01-hermes-setup exited 1 The mkdirs fail because s6-setuidgid drops to hermes (UID 10000) before invoking mkdir -p, and the runtime user has no permission to create root-owned ancestor directories. 02-reconcile-profiles then crashes with FileNotFoundError, .install_method never lands, and the container limps on in a half-initialized state. Bootstrap HERMES_HOME with mkdir -p while still root, before the ownership normalization. Idempotent on the default /opt/data path (directory already exists from the Dockerfile RUN mkdir -p) and on any subsequent restart. (#18482) Retargeted from the original PR's docker/entrypoint.sh (now a deprecated shim) to docker/stage2-hook.sh where the related chown logic moved during the s6-overlay rework. Co-authored-by: wpengpeng168 <133926080+wpengpeng168@users.noreply.github.com>	2026-05-27 17:16:40 +10:00
Ben	c3bdb2af37	ci(docker): add shellcheck shell=sh directive to main-wrapper.sh shellcheck doesn't recognize the s6-overlay `#!/command/with-contenv sh` shebang and aborts with SC1008 ("This shebang was unrecognized. ShellCheck only supports sh/bash/dash/ksh/'busybox sh'. Add a 'shell' directive to specify."). The error fires at --severity=error too, so it fails the "Docker / shell lint" CI job on every PR that touches docker/. Add the canonical `# shellcheck shell=sh` directive — same fix already applied to the sibling cont-init.d scripts (`02-reconcile-profiles` and `015-supervise-perms`) when they adopted the with-contenv shebang. The shebang was changed from `#!/bin/sh` → `#!/command/with-contenv sh` in PR #32412 (commit `29c71e9`) to fix env-propagation through s6's PID 1. The shellcheck-directive line was missed in that PR; this patches it. Reproduces locally: docker run --rm -v "$PWD:/mnt" -w /mnt koalaman/shellcheck:stable \ --severity=error --format=gcc docker/main-wrapper.sh Before: docker/main-wrapper.sh:1:1: error: [SC1008] (rc=1) After: (no output) (rc=0) Script behavior is unchanged — the directive is a comment, and `sh -n` / `bash -n` parse the file cleanly either way.	2026-05-27 16:32:35 +10:00
Ben	22eb4d13f7	fix(docker): chown ui-tui and node_modules on UID remap so TUI esbuild works (#28851 ) When HERMES_UID remaps the hermes user from 10000 to another UID (e.g. matching the host user's UID for bind-mount ergonomics), the TUI launcher's esbuild step fails: ✘ [ERROR] Failed to write to output file: open /opt/hermes/ui-tui/dist/entry.js: permission denied TUI build failed. This is because the Dockerfile's build-time `chown -R hermes:hermes` on `/opt/hermes/{.venv,ui-tui,node_modules}` (line 154) wrote UID 10000, and stage2-hook.sh only re-chowned `.venv` on UID remap — leaving the TUI build trees still owned by the old UID. Extend the stage2 re-chown to include the same set as the build-time chown: `.venv`, `ui-tui`, `node_modules`. These are the runtime-writable trees under $INSTALL_DIR; everything else under /opt/hermes is read-only at runtime so keeping it root-owned is fine. Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the .venv chown moved during the s6-overlay rework. Co-authored-by: Andreas Steffan <623481+deas@users.noreply.github.com>	2026-05-27 15:41:48 +10:00
Ben	9eadb6805c	fix(docker): targeted chown to preserve host file ownership in HERMES_HOME (#19795 ) Replaces the recursive chown of $HERMES_HOME in stage2-hook.sh with a targeted approach: chown the top-level dir (so hermes can create new subdirs) plus the specific hermes-owned subdirectories (cron/, sessions/, logs/, hooks/, memories/, skills/, skins/, plans/, workspace/, home/, profiles/) — the same canonical list seeded by the s6-setuidgid mkdir -p block below. Avoids clobbering host-side file ownership when $HERMES_HOME is a bind mount that contains user-owned files not managed by hermes (issue #19788). Original fix targeted docker/entrypoint.sh which is now a deprecated shim; retargeted to docker/stage2-hook.sh where the recursive chown moved during the s6-overlay rework. Co-authored-by: Ptichalouf <1809721+ptichalouf@users.noreply.github.com>	2026-05-27 15:08:41 +10:00
John Paul Soliva	29c71e972a	fix(docker): propagate container env through s6 to cont-init and main CMD s6-overlay's /init scrubs the environment before invoking both /etc/cont-init.d/* scripts and the container's CMD wrapper. As a result, ENV directives from the Dockerfile (HERMES_HOME=/opt/data, HERMES_WEB_DIST, …) and compose-time `environment:` entries (HERMES_UID, HERMES_GID) never reached the scripts that actually use them. Three concrete failures observed on macOS Docker Desktop with `~/.hermes:/opt/data`: * stage2-hook.sh ran with HERMES_UID unset → no UID remap, hermes user stayed at UID 10000 instead of the host user's UID. * skills_sync.py (invoked from stage2-hook) ran with HERMES_HOME unset → get_hermes_home() fell back to Path.home()/.hermes, populating a shadow $HERMES_HOME/.hermes/skills tree on the mounted volume (visible on the host as ~/.hermes/.hermes/skills). * The main `hermes gateway run` process inherited HOME=/root from the /init context (s6-setuidgid doesn't update HOME), so libraries resolving XDG_STATE_HOME via $HOME tried to write to /root/.local/state/hermes/gateway-locks/ and failed with EACCES, preventing the Discord adapter from acquiring its bot-token lock. Three surgical changes restore correct env flow: 1. The auto-generated /etc/cont-init.d/01-hermes-setup wrapper now uses `#!/command/with-contenv sh`, matching the pattern already used by docker/cont-init.d/02-reconcile-profiles. The container env (Dockerfile ENV + compose `environment:`) now reaches stage2-hook.sh and the skills_sync.py subprocess it spawns. 2. docker/main-wrapper.sh also switches to `#!/command/with-contenv sh`. The container CMD (`gateway run`, `chat`, `setup`, …) now sees HERMES_HOME and the other container-level env vars. 3. docker/main-wrapper.sh exports HOME=/opt/data before `s6-setuidgid hermes`. with-contenv populates HOME from the /init context (/root); s6-setuidgid drops privileges but does not update HOME. The hermes user's home per /etc/passwd is /opt/data, so the explicit override matches passwd. No behavior change for the non-buggy paths: the s6-supervised services already used with-contenv, and HOME=/opt/data only affects processes that resolved $HOME-based paths to /root (silently broken).	2026-05-26 13:41:21 +09:00
dusterbloom	79fc92e9cb	fix(security): tighten .env file permissions to 0600 at all creation sites .env holds API keys and secrets. Multiple creation sites used `cp` / `touch` / `shutil.copy2` which obey the process umask — commonly 0o022, leaving the file at 0o644 (world-readable). Apply chmod 0o600 explicitly at every site that creates or copies .env. Sites covered: - docker/stage2-hook.sh: after the seed_one '.env' call, applied unconditionally (not just on first-seed) so a host-mounted .env with loose perms gets tightened on every container restart - hermes_cli/doctor.py: 'hermes doctor --fix' touches an empty .env when missing - hermes_cli/profiles.py: 'hermes profile create --clone' copies .env from the source profile; shutil.copy2 preserves source mode, so a source .env at 0o644 was being cloned into 0o644 - setup-hermes.sh: in-tree setup script's cp .env.example .env path, plus the already-exists branch (mirror of install.sh which already chmods 600 unconditionally on line 1442) scripts/install.sh was NOT changed — it already chmod 600's the .env unconditionally after the create/already-exists branches (line 1442). Salvaged from PR #25726 by @dusterbloom. The docker/entrypoint.sh portion of the original PR was dropped because main switched to an s6-overlay shim — the .env creation logic moved to stage2-hook.sh, which is where the chmod now lives. Closes #25497 (subset — install.sh + setup-hermes.sh) and #8448 (subset — install.sh only) as superseded. Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>	2026-05-25 03:40:47 -07:00
Ben	4f416fc40c	fix(docker): make s6 lifecycle work for the unprivileged hermes user Resolves the explicit "Known follow-up" left by commit `2f8ceeab9` and the resulting CI failures in tests/docker/test_dashboard.py and tests/docker/test_s6_profile_gateway_integration.py. The product gap --------------- Every hermes runtime operation inside the container runs as the hermes user (UID 10000) via s6-setuidgid. But s6-supervise — spawned by s6-svscan running as PID 1 — creates each service's supervise/ and top-level event/ directories with mode 0700 owned by its effective UID (root). That left every s6-svc / s6-svstat / s6-svwait call from hermes hitting EACCES on the supervise/control FIFO and supervise/status — i.e. the entire S6ServiceManager lifecycle (register, start, stop, unregister) was inert in production. The `2f8ceeab9` commit message called this out and deferred the fix. The audit changes that landed alongside it (defaulting docker_exec to -u hermes) made the integration tests reproduce the bug deterministically; the fix below resolves it. The fix: pre-create the supervise/ skeleton hermes-owned ---------------------------------------------------------- Reading s6's source (src/supervision/s6-supervise.c::trymkdir + control_init), the mkdir and mkfifo calls that build the supervise tree are EEXIST-safe: if the directory or FIFO is already present, s6-supervise reuses it and skips the chown/chmod fix-up that would normally make event/ 03730 root:root. So if we lay the skeleton down with hermes ownership before triggering s6-svscanctl -a, s6-supervise inherits our layout and never touches it. The death_tally / lock / status regular files written later by s6-supervise (still as root) land mode 0644 — world-readable — which is all s6-svstat needs. New module-level helper _seed_supervise_skeleton(svc_dir) in hermes_cli/service_manager.py lays down: svc_dir/event/ hermes:hermes 03730 svc_dir/supervise/ hermes:hermes 0755 svc_dir/supervise/event/ hermes:hermes 03730 svc_dir/supervise/control hermes:hermes 0660 (FIFO) svc_dir/log/event/ hermes:hermes 03730 (if log/ present) svc_dir/log/supervise/ hermes:hermes 0755 svc_dir/log/supervise/event/ hermes:hermes 03730 svc_dir/log/supervise/control hermes:hermes 0660 (FIFO) The log/ branch matters because the logger is a second s6-supervise instance — without it, unregister rmtree races on the logger's root-owned supervise dir even after the parent slot's supervise/ is hermes-owned. The helper is idempotent and swallows PermissionError on chown so it works equally well when called from root (cont-init.d) or hermes (runtime register). Wiring ------ 1. S6ServiceManager.register_profile_gateway calls _seed_supervise_skeleton(tmp_dir) just before publishing the slot via Path.replace. Runtime-registered profile gateways are set up by hermes. 2. container_boot._register_service does the same in the cont-init.d reconciliation path so boot-time-restored profile slots inherit the same layout. 3. New cont-init.d/015-supervise-perms script chowns the supervise/ and event/ trees for STATIC s6-rc services (dashboard, main-hermes). These are spawned by s6-rc before cont-init.d gets to run, so the EEXIST-trick doesn't apply; we chown the already-existing tree instead. s6-supervise keeps using the same files; it never re-asserts ownership on a running service. The script skips s6-overlay internal services (s6rc-, s6-linux-) so the supervision tree itself stays root-only. 015- slot is intentional: lex-sorts between 01-hermes-setup and 02-reconcile-profiles in the container's C-locale, so the chown finishes before the reconciler walks the scandir. Unregister teardown reordering ------------------------------ S6ServiceManager.unregister_profile_gateway now fires s6-svscanctl -an BEFORE rmtree (with a 200ms grace), so s6-svscan reaps the supervise child and releases its file handles on supervise/lock + supervise/status before we try to remove the directory. Previously rmtree raced s6-supervise on a set of files inside the supervise dir, and even with the parent supervise/ now hermes-owned, the contained files (death_tally, lock, status, written by root) could still be in use. Dashboard down-state redesign ----------------------------- The original PR #30136 review fix wrote a 'down' marker file into /run/service/dashboard/ via cont-init.d/03-dashboard-toggle. That approach was broken in two ways: (a) /run/service/dashboard is a symlink to a TRANSIENT /run/s6-rc:s6-rc-init:<tmpdir>/ directory while s6-rc is mid-transaction; the touch landed in a soon-to-be-discarded tmp. (b) Even when written to the final /run/s6-rc/servicedirs/ location, the 'down' file is only consulted by s6-supervise at slot startup. s6-rc's user-bundle explicitly transitions 'dashboard' to 'up' on every boot, overriding any down marker. The right fix is the canonical s6 pattern: when HERMES_DASHBOARD is unset, the dashboard run script exits 0 and a companion finish script exits 125. Per s6-supervise(8), exit code 125 from the finish script is the 'permanent failure, do not restart' marker — equivalent to s6-svc -O. The slot reports as 'down' to s6-svstat, matching the reality that no dashboard process is running. When HERMES_DASHBOARD IS truthy, finish exits 0 and restart-on-crash semantics apply. 03-dashboard-toggle is removed (its function is now subsumed by the run/finish pair). Tests ----- Adds four unit tests for _seed_supervise_skeleton covering the produced layout, the log/ subservice case, the skip-when-no-log case, and idempotency. The live-container verification continues to live in tests/docker/test_s6_profile_gateway_integration.py and tests/docker/test_dashboard.py — both now pass against the rebuilt image. References ---------- * Skarnet skaware mailing list 2020-02-02 (Laurent Bercot + Guillermo Diaz Hartusch) on unprivileged s6 tool semantics: http://skarnet.org/lists/skaware/1424.html * just-containers/s6-overlay#130 — same EEXIST-preseed pattern, community-validated 2016 onward * https://skarnet.org/software/s6/servicedir.html — exit-code 125 semantics in finish scripts (cherry picked from commit `c41f908ad4`)	2026-05-25 12:23:23 +10:00
Ben	04bdbce906	docs(docker): deprecation warning in entrypoint.sh shim PR #30136 review item O5: docker/entrypoint.sh is now a thin shim that forwards to stage2-hook.sh — the real ENTRYPOINT is /init plus main-wrapper.sh. External scripts that hard-coded entrypoint.sh as the container's ENTRYPOINT will see the cont-init bootstrap happen but the CMD will not be exec'd (because stage2-hook only handles bootstrap; main-wrapper.sh handles the CMD passthrough). Add a stderr warning explaining the new contract and pointing callers at the migration path (drop the --entrypoint override). The shim itself stays in place for one release cycle so the deprecation isn't a hard break — anyone still invoking it sees the warning in their logs and has time to migrate.	2026-05-24 18:05:33 -07:00
Ben	9914bfc594	docker: drop sh -c wrappers from stage2-hook.sh PR #30136 review caught: three `s6-setuidgid hermes sh -c "..."` invocations in stage2-hook.sh interpolated $HERMES_HOME into a nested shell context. Practically low-risk (a malicious HERMES_HOME already requires container-launch privileges) but the cleaner pattern is to invoke commands directly so the shell isn't a second interpreter. * `mkdir -p` of the data subdirs now runs directly via s6-setuidgid, one path per arg. * The .install_method stamp is written via `printf \| tee` — also no shell wrapper. * The skills_sync invocation uses the venv's python by absolute path instead of sourcing activate inside a shell. skills_sync.py doesn't need anything from activate beyond sys.path, which the bin-stub python already provides. No behavior change. Just a smaller attack surface and a script that's easier to read.	2026-05-24 18:05:33 -07:00
Ben	1dfabe47b3	fix(docker): dashboard slot stays 'down' when HERMES_DASHBOARD unset PR #30136 review caught a false positive: when HERMES_DASHBOARD was unset, the dashboard run script did `exec sleep infinity`, so `s6-svstat /run/service/dashboard` reported the slot as 'up'. `hermes doctor` and any other s6-svstat-based health check saw the dashboard as supervised-running even though no dashboard process existed. Add cont-init.d/03-dashboard-toggle: writes a `down` marker file into `/run/service/dashboard/` when HERMES_DASHBOARD is falsy, removes any leftover marker when it's truthy. s6-supervise honors `down` by not starting the service, so s6-svstat reports 'down' — matching reality. The run script's HERMES_DASHBOARD case-statement stays in place as a belt-and-suspenders guard, so the two layers can never disagree. Two new integration tests lock the behavior: slot reports down when unset; slot reports up when set to 1.	2026-05-24 18:05:33 -07:00
Ben	fc39296e1f	fix(service_manager): s6 detection works for unprivileged hermes user PR #30136 review surfaced two issues, both rooted in the same audit gap: docker integration tests were running as root, not the unprivileged `hermes` user (UID 10000) that the runtime actually uses via `s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to the s6 control surface worked as root in the tests but was inert in production. Fixes: 1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`, which is root-only readable. For UID 10000 the symlink yields PermissionError, `resolve()` silently returns the unresolved path, and `exe.name == "exe"` — so detection always returned False, the service-manager runtime-registration path was inert, and every `hermes profile create` / `hermes -p X gateway start` silently skipped the s6 hook. Replace with `/proc/1/comm` (world-readable) + `/run/s6/basedir` (s6-overlay-specific) — both required, fail closed. 2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/` {control,lock} to hermes so `s6-svscanctl -a/-an` works without root. Previously the directory chown stopped at `/run/service` and the FIFO inside stayed root-owned, so `register_profile_gateway` from hermes failed at the rescan-trigger step with EACCES — the wrapper in profiles.py caught the exception and printed a swallowed warning, so profile creation appeared to succeed while the slot was rolled back. Audit changes to flush this class of bug next time: - Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py` that default to `-u hermes`. The module docstring explains why and flags `user="root"` as opt-in only for tests that explicitly need root (none currently do). - Refactor every `docker exec` call in tests/docker/ through the new helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py, test_container_restart.py, test_s6_profile_gateway_integration.py). - Add 5 unit tests covering `_s6_running` under various probe states (both signals present; comm wrong; basedir missing; PermissionError on /proc/1/comm; missing /proc — non-Linux). The PermissionError test is the explicit regression guard for the original bug. Known follow-up: the per-service `supervise/control` FIFO inside each `/run/service/gateway-<profile>/supervise/` is created root-owned by s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc -u/-d/-t` from the hermes user will get EACCES on those. The audit under `-u hermes` will reveal this in lifecycle tests — surfacing the issue cleanly so it can be fixed in a focused follow-up (likely via a small SUID helper or a polling chown loop in cont-init.d). The detection + svscanctl fixes here are independent and complete on their own.	2026-05-24 18:05:33 -07:00
Ben	4b4c36cb61	feat(docker): remove gosu from bundled image; s6-setuidgid handles privilege drop The s6-overlay migration replaced every runtime use of gosu with s6-setuidgid (in stage2-hook.sh, main-wrapper.sh, per-service run scripts, and cont-init.d hooks), but the gosu binary itself was still being copied into the image from tianon/gosu, and several comments across the repo still pointed to it. Image changes: - Drop the FROM tianon/gosu:1.19-trixie AS gosu_source stage - Drop the COPY --from=gosu_source /gosu /usr/local/bin/ layer - Net: one fewer base-image pull, ~12-15 MB layer eliminated Documentation/comment refresh (no behavior change): - Dockerfile: update root-user rationale comment + cont-init.d comment - docker/main-wrapper.sh: drop "pre-s6 contract (gosu drop)" reference - docker-compose.yml: update UID/GID remap comment - .hadolint.yaml: update DL3002 ignore rationale - website/docs/user-guide/docker.md: privilege-drop helper is s6-setuidgid now - hermes_cli/config.py: docker_run_as_host_user docstring tools/environments/docker.py runs arbitrary user images via the terminal backend, not the bundled Hermes image. It still needs SETUID/ SETGID caps so user images that use gosu/su/s6-setuidgid all work. Renamed the cap-list constant _GOSU_CAP_ARGS → _PRIVDROP_CAP_ARGS and updated comments to list s6-setuidgid alongside the others as examples. The matching test (test_security_args_include_setuid_setgid_for_gosu_drop → test_security_args_include_setuid_setgid_for_privdrop) was renamed and its docstring updated; behavior is unchanged. Verification: - hadolint clean against .hadolint.yaml - shellcheck clean against all docker/ shell scripts - Image rebuilt successfully (sha 1a090924ccea) - Docker harness: 19 passed in 41.87s (every Phase 0 test + Phase 4 per-profile-gateway lifecycle + container-restart reconciliation) - tests/tools/test_docker_environment.py: 23 passed (rename did not break test discovery; pre-existing unrelated mock warning) The plan document (docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md) intentionally retains its historical references to gosu — it describes the pre-s6 entrypoint as background for understanding the migration.	2026-05-24 18:05:33 -07:00
Ben	2afefc501c	feat(docker): per-profile s6 supervision + container-restart reconciliation Phase 4 of the s6-overlay supervision plan. Activates the Phase 3 S6ServiceManager by hooking it into the profile lifecycle and the `hermes gateway start/stop/restart` dispatcher, and adds a cont- init.d-time reconciliation pass that survives `docker restart`. Task 4.0 — container-boot reconciliation: /run/service/ is tmpfs, so every `docker restart` wipes every per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles invokes hermes_cli.container_boot.reconcile_profile_gateways() on every boot, which walks $HERMES_HOME/profiles/<name>/, reads each gateway_state.json, recreates the s6 service slot, and auto-starts only those whose last state was 'running'. Other states (stopped, starting, startup_failed, missing) register the slot in the down state — avoiding crash-loops across restarts for a gateway that was broken last boot. Per-profile outcome is recorded to $HERMES_HOME/logs/container-boot.log. Implementation: hermes_cli/container_boot.py + 12 unit tests. Profile-marker is SOUL.md, not config.yaml, because `hermes profile create` only seeds SOUL.md by default (config.yaml comes from `hermes setup`). Task 4.1 / 4.2 — profile create/delete hooks: hermes_cli/profiles.py::create_profile now calls _maybe_register_gateway_service(<canon>) at the end, which routes through ServiceManager.register_profile_gateway when running on s6 and no-ops on host backends. delete_profile mirrors with _maybe_unregister_gateway_service. _allocate_gateway_port produces a deterministic SHA-256-derived port in [9200, 9800). Task 4.3 — gateway dispatch + remove rejection arms: _dispatch_via_service_manager_if_s6(action) intercepts start/stop/restart at the top of each subcommand and routes them through S6ServiceManager.{start,stop,restart}. The pre-Phase-4 `elif is_container():` rejection arms are kept as fallback for pre-s6 containers / unsupported runtimes, but only ever fire when detect_service_manager() != 's6'. install/uninstall under s6 print informational guidance pointing users at profile create/delete. Removed the two xfail(strict=True) markers from tests/docker/test_profile_gateway.py — both tests now pass strictly. Task 4.4 — status reporting: get_gateway_runtime_snapshot() reports Manager: 's6 (container supervisor)' inside an s6 container instead of 'docker (foreground)'. Plan-vs-reality drift fixed in this commit: - Plan's S6ServiceManager._render_run_script used `gateway start --foreground --port {port}` — invented args; the real CLI is `gateway run`. Switched accordingly. port arg retained for API parity but now documented as 'currently ignored'. - Plan's reconciler keyed on config.yaml; switched to SOUL.md (config.yaml is created by hermes setup, not by hermes profile create, so the original gate caught nothing). - The plan's _dispatch helper used _profile_arg() which returns '--profile <name>' (i.e. with the flag prefix). Switched to _profile_suffix() which returns the bare name. - Architecture B's docker exec doesn't get /command on PATH or the venv on PATH; Dockerfile's runtime PATH now includes /opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works without sourcing the venv. - stage2-hook now chowns $HERMES_HOME/profiles to hermes on every boot, not just on the UID-remap path. Without this, files created by docker-exec-as-root accumulate and the next reconciler run fails with PermissionError reading SOUL.md. Test harness: 19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to passing). 78 unit tests across service_manager + container_boot + profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck pass cleanly. Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-24 18:05:33 -07:00
Ben	e0e9c895d3	feat(docker)!: replace tini with s6-overlay as PID 1 BREAKING CHANGE: the container ENTRYPOINT is now /init (s6-overlay) instead of /usr/bin/tini. Main hermes runs as the container CMD with TTY inherited (preserving --tui), dashboard runs as a supervised s6-rc service (HERMES_DASHBOARD=1 starts it; crashes auto-restart), and the ground is laid for per-profile gateway supervision (Phase 3+4). All five pre-s6 docker run invocation patterns continue to work identically — verified by the Phase 0 docker harness: docker run <image> → `hermes` with no args docker run <image> chat -q "..." → `hermes chat -q ...` passthrough docker run <image> sleep infinity → `sleep infinity` direct docker run <image> bash → interactive bash docker run -it <image> --tui → interactive Ink TUI Phase 2 harness result: 12 passed, 2 xfailed (Phase 4 target). Hadolint + shellcheck pass cleanly. Architecture pivot from plan v3 (documented in main-hermes/run header): the plan called for main hermes to be an s6-supervised service, but two real s6-overlay v3 mechanics blocked that — cont-init.d scripts receive no arguments (CMD args are not visible to stage2-hook), and `/run/s6/basedir/bin/halt` after writing the exit code did not propagate the desired exit code (container exits 143). We use the s6-overlay-native CMD pattern instead: main-wrapper.sh is the container's main program (ENTRYPOINT prepends it so leading-dash args like --version aren't intercepted by /init), exec's the final program with stdin/stdout/stderr inherited, and the program's exit code becomes the container exit code. main-hermes is now a no-op `sleep infinity` slot kept for future supervised-gateway-container modes. This trades "supervised restart of main hermes" for arg- parity with the pre-s6 contract — main hermes was already unsupervised under tini, so we lose nothing functional. Dashboard supervision is the only new guarantee added by this phase. Files added: docker/main-wrapper.sh # arg routing + s6-setuidgid drop docker/stage2-hook.sh # gosu-equivalent + chown + seed docker/s6-rc.d/main-hermes/{type,run,dependencies.d/base} docker/s6-rc.d/dashboard/{type,run,dependencies.d/base} docker/s6-rc.d/user/contents.d/{main-hermes,dashboard} Files changed: Dockerfile: tini → s6-overlay install + ENTRYPOINT flip + service wiring docker/entrypoint.sh: thin shim to stage2-hook.sh for back-compat tests/docker/test_dashboard.py: add test_dashboard_restarts_after_crash Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-24 18:05:33 -07:00
Teknium	2df2f9190b	fix(docker): keep dashboard side-process loopback by default (#30740 )	2026-05-24 04:25:28 -07:00
Siddharth Balyan	6f5ec929a1	feat(config): add install-method stamping + Docker detection (#27843 ) * feat(config): add install-method stamping + Docker detection Dockerfile stamps "docker", install.sh stamps "git", and cmd_postinstall stamps "pip" into ~/.hermes/.install_method. detect_install_method() reads the stamp first, then falls back to managed-system / container / .git heuristics. Adds Docker upgrade guidance. Tracking: #27826 * fix(stamp): move Docker stamp to entrypoint, install.sh stamp after print_success The Dockerfile stamp was overwritten by the VOLUME overlay at container start. Moving it to entrypoint.sh ensures it persists. The install.sh stamp now writes after print_success so it only lands on full success.	2026-05-18 16:34:10 +05:30
Siddharth Balyan	942adf6179	fix(docker): chown .venv to hermes so lazy_deps can install platform packages (#24841 ) The Dockerfile permissions section made /opt/hermes/.venv readable but not writable by the hermes runtime user. Since the 2026-05-12 policy change moved messaging packages (discord.py, telegram, slack, etc.) out of [all] and into lazy_deps.py, the Docker image no longer ships with them pre-installed. At first gateway boot, lazy_deps.ensure() tries to `uv pip install` them into the venv but fails with EACCES because site-packages is root-owned. The result: every messaging platform adapter silently fails to load inside Docker containers, producing only a cryptic "discord.py not installed" warning despite the gateway being correctly configured. Two-part fix: 1. Dockerfile: add /opt/hermes/.venv to the existing chown -R hermes:hermes line so the default (UID 10000) case works out of the box. 2. docker/entrypoint.sh: extend the needs_chown block to also re-chown the .venv when HERMES_UID is remapped. Without this, the build-time chown becomes stale when someone uses the documented HERMES_UID override in docker-compose.yml. Fixes #21536 Related: #17674, #21543, #21755	2026-05-13 11:55:07 +05:30
pefontana	5643c29790	feat(docker): bootstrap auth.json from env on first boot Lets orchestrators (e.g. an account-management service provisioning a Hermes VPS) seed an OAuth refresh credential non-interactively instead of walking the user through `hermes setup` + the device-flow login dance. Matches the existing first-boot-only pattern used for .env, config.yaml, and SOUL.md. If HERMES_AUTH_JSON_BOOTSTRAP is set and $HERMES_HOME/auth.json doesn't already exist, write the env var's contents to auth.json with mode 600. The `[ ! -f ... ]` guard is critical: it ensures that on container restart the rotated refresh token Hermes wrote back to the persistent volume is never clobbered by the now-stale value the orchestrator originally seeded. Generic name (not Nous-specific) so the feature is reusable by any future orchestrator.	2026-05-08 06:28:44 -07:00
Ben	5671059f62	feat(docker): launch dashboard as side-process via HERMES_DASHBOARD=1 Adds an optional dashboard side-process to the container entrypoint, toggled by `HERMES_DASHBOARD=1` (also accepts `true` / `yes`). When set, the entrypoint backgrounds `hermes dashboard` before `exec`-ing the main command so the user's chosen foreground process (gateway, chat, `sleep infinity`, …) remains PID-of-interest for the container runtime. docker run -d \ -v ~/.hermes:/opt/data \ -p 8642:8642 -p 9119:9119 \ -e HERMES_DASHBOARD=1 \ nousresearch/hermes-agent gateway run Defaults chosen for the container case: - Host: 0.0.0.0 (reachable through published port; can override to 127.0.0.1 via HERMES_DASHBOARD_HOST for sidecar/reverse-proxy setups) - Port: 9119 (matches `hermes dashboard`) - Auto-adds `--insecure` when binding to non-localhost, matching the dashboard's own safety gate for exposing API keys - HERMES_DASHBOARD_TUI is read by `hermes dashboard` directly — no entrypoint plumbing needed Dashboard output is prefixed with `[dashboard]` via `stdbuf`+`sed -u` so it's easy to separate from gateway logs in `docker logs`. No supervision: if the dashboard crashes it stays down until the container restarts (documented in the `:::note` panel). Other changes bundled in: - Deprecate GATEWAY_HEALTH_URL / GATEWAY_HEALTH_TIMEOUT env vars in hermes_cli/web_server.py with a DEPRECATED block comment and a `.. deprecated::` note on _probe_gateway_health. The feature still works for this release; it'll be removed alongside the move to a first-class dashboard config key. - Rewrite the "Running the dashboard" doc section around the new single-container pattern. Drops the previously-documented dashboard-as-its-own-container setup — that pattern relied on the deprecated env vars for cross-container gateway-liveness detection, and without them the dashboard would permanently report the gateway as "not running". - Collapse the two-service Compose example (gateway + dashboard container) into a single service with HERMES_DASHBOARD=1. Removes the now-unnecessary bridge network and `depends_on`. - Drop the ":::warning" caveat about "Running a dashboard container alongside the gateway is safe" — that case no longer exists.	2026-05-04 15:37:27 +10:00
Teknium	9ef1ae138a	fix(docker): don't chown config.yaml after gosu drop (#15865 ) (#16096 ) The chown/chmod block on config.yaml was added in `b24d239ce` to keep the file readable by the hermes runtime user, but it sat in the post-gosu 'running as hermes' section of the entrypoint. That meant: 1. Default `docker run <image>` — container starts as root, entrypoint drops to hermes via gosu, then non-root hermes tries to chown the file to hermes. Works by coincidence because the file was just created by root during volume setup and gosu target == target owner. 2. `docker run -u $(id -u):$(id -g) <image>` (#15865) — container starts as the caller's UID. The root block is skipped entirely, we land in the hermes section as some arbitrary non-root user, and chown to 'hermes' fails with 'Operation not permitted'. Script aborts under `set -e`. Move the chown/chmod into the root block (before the gosu exec) where it actually has privilege, and guard with `2>/dev/null \|\| true` so rootless Podman (where even in-container root lacks host-side chown rights) doesn't abort either. Closes #15865	2026-04-26 08:27:39 -07:00
Jiecheng Wu	14c9f7272c	fix(docker): fix HERMES_UID permission handling and add docker-compose.yml - Remove 'USER hermes' from Dockerfile so entrypoint runs as root and can usermod/groupmod before gosu drop. Add chmod -R a+rX /opt/hermes so any remapped UID can read the install directory. - Fix entrypoint chown logic: always chown -R when HERMES_UID is remapped from default 10000, not just when top-level dir ownership mismatches. - Add docker-compose.yml with gateway + dashboard services. - Add .hermes to .gitignore.	2026-04-24 04:52:11 -07:00
GuyCui	b24d239ce1	Update permissions for config.yaml Fix config.yaml permission drift on startup	2026-04-23 03:10:04 -07:00
Ubuntu	d70f0f1dc0	fix(docker): allow entrypoint to pass-through non-hermes commands Commit `8254b820` ("--init for zombie reaping + sleep infinity for idle-based lifetime") made the Docker terminal backend launch sandbox containers with `sleep infinity` as the command, so the lifetime is controlled by an external idle reaper instead of a fixed timeout. But `docker/entrypoint.sh` unconditionally wraps its args with `hermes`: exec hermes "$@" Result: `hermes sleep infinity` → argparse rejects `sleep` as a subcommand and the container exits immediately with code 2: hermes: error: argument command: invalid choice: 'sleep' (choose from chat, model, gateway, setup, ...) Every sandbox container launched by the docker backend dies at startup, breaking terminal/file tool execution end-to-end. Fix: dispatch at the tail of the entrypoint. If the first arg is an executable on PATH (sleep, bash, sh, etc.) run it raw; otherwise preserve the legacy `hermes <subcommand>` wrapping behavior. Both invocation styles below keep working: docker run <image> -> hermes (interactive) docker run <image> chat -q "hi" -> hermes chat -q "hi" docker run <image> sleep infinity -> sleep infinity docker run <image> bash -> bash Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:13:14 -07:00
Teknium	8548893d14	feat: entry-level Podman support — find_docker() + rootless entrypoint (#10066 ) - find_docker() now checks HERMES_DOCKER_BINARY env var first, then docker on PATH, then podman on PATH, then macOS known locations - Entrypoint respects HERMES_HOME env var (was hardcoded to /opt/data) - Entrypoint uses groupmod -o to tolerate non-unique GIDs (fixes macOS GID 20 conflict with Debian's dialout group) - Entrypoint makes chown best-effort so rootless Podman continues instead of failing with 'Operation not permitted' - 5 new tests covering env var override, podman fallback, precedence Based on work by alanjds (PR #3996) and malaiwah (PR #8115). Closes #4084.	2026-04-14 21:20:37 -07:00
m0n5t3r	fee0e0d35e	fix(docker): run as non-root user, use virtualenv (salvage #5811 ) - Add gosu for runtime privilege dropping from root to hermes user - Support HERMES_UID/HERMES_GID env vars for host mount permission matching - Switch to debian:13.4-slim base image - Use uv venv instead of pip install --break-system-packages - Pin uv and gosu multi-stage images with SHA256 digests - Set PLAYWRIGHT_BROWSERS_PATH to /opt/hermes/.playwright so build-time chromium install survives the /opt/data volume mount - Keep procps for container debugging Based on work by m0n5t3r in PR #5811. Stripped to hardening-only changes (non-root, virtualenv, slim base); matrix deps, fonts, xvfb, and entrypoint playwright download deferred to follow-up.	2026-04-12 00:53:16 -07:00
Teknium	e8f16f7432	fix(docker): add missing skins/plans/workspace dirs to entrypoint The profile system expects these directories but they weren't being created on container startup. Adds them to the mkdir list alongside the existing dirs. Co-authored-by: Tranquil-Flow <tranquil_flow@protonmail.com>	2026-04-10 15:42:30 -07:00
Teknium	4fb42d0193	fix: per-profile subprocess HOME isolation (#4426 ) (#7357 ) Isolate system tool configs (git, ssh, gh, npm) per profile by injecting a per-profile HOME into subprocess environments only. The Python process's own os.environ['HOME'] and Path.home() are never modified, preserving all existing profile infrastructure. Activation is directory-based: when {HERMES_HOME}/home/ exists on disk, subprocesses see it as HOME. The directory is created automatically for: - Docker: entrypoint.sh bootstraps it inside the persistent volume - Named profiles: added to _PROFILE_DIRS in profiles.py Injection points (all three subprocess env builders): - tools/environments/local.py _make_run_env() — foreground terminal - tools/environments/local.py _sanitize_subprocess_env() — background procs - tools/code_execution_tool.py child_env — execute_code sandbox Single source of truth: hermes_constants.get_subprocess_home() Closes #4426	2026-04-10 13:37:45 -07:00
Teknium	dcbdfdbb2b	feat(docker): add Docker container for the agent (salvage #1841 ) (#3668 ) Adds a complete Docker packaging for Hermes Agent: - Dockerfile based on debian:13.4 with all deps - Entrypoint that bootstraps .env, config.yaml, SOUL.md on first run - CI workflow to build, test, and push to DockerHub - Documentation for interactive, gateway, and upgrade workflows Closes #850, #913. Changes vs original PR: - Removed pre-created legacy cache/platform dirs from entrypoint (image_cache, audio_cache, pairing, whatsapp/session) — these are now created on demand by the application using the consolidated layout from get_hermes_dir() - Moved docs from docs/docker.md to website/docs/user-guide/docker.md and added to Docusaurus sidebar Co-authored-by: benbarclay <benbarclay@users.noreply.github.com>	2026-03-28 22:21:48 -07:00

32 Commits