Compare commits

...

153 Commits

Author SHA1 Message Date
Ben 4c481860ce ci(docker): run tests/docker/ in build-amd64 against the freshly-built image
The new tests/docker/ suite (added by this PR) was being picked up by the
sharded pytest matrix in tests.yml, where its session-scoped `built_image`
fixture issued a 3-7min `docker build` under tests/docker/conftest.py's
180s pytest-timeout cap. Every test in the directory failed in fixture
setup across all 6 shards.

Fix the suite so it actually runs (not skips):

1. Wire the docker tests into docker-publish.yml's build-amd64 job, right
   after the existing smoke test. The image is already loaded into the
   local daemon as `nousresearch/hermes-agent:test`; set
   HERMES_TEST_IMAGE to that and the fixture's pre-built-image branch
   short-circuits the rebuild. 21 tests run in ~90s locally against a
   prebuilt image, no rebuild cost on top of the existing build step.

2. Exclude tests/docker/ from scripts/run_tests_parallel.py's default
   discovery so the sharded matrix in tests.yml stops trying to build
   the image. Explicit positional paths (`pytest tests/docker/` or
   `scripts/run_tests.sh tests/docker/`) still pick the suite up — the
   skip rule honors directory-level user intent, matching the existing
   per-file override pattern.

The dedicated docker-tests step runs on every PR that touches docker
code (the existing path filters on docker-publish.yml already cover
`tests/docker/**` via `**/*.py`), so the suite gates real changes.
2026-05-25 11:55:03 +10:00
Ben 1150639fa9 chore(ty): suppress unresolved-import inside tests/ to keep lint-diff PR comment useful
The lint-diff CI job runs ty as a bare uv tool without installing the
project's venv, so test files trip ty with unresolved-import on
pytest itself and on local test-only deps. The PR #30136 github-
actions lint-summary bot reported 7 new such warnings, even though
ty itself flags them as non-blocking and the imports demonstrably
work at runtime (the full pytest suite in a sibling CI job exercises
them).

Installing the full venv just to please ty would balloon the lint
job runtime; the override below tells ty to ignore unresolved-import
strictly inside tests/. The diagnostic class continues to be active
for hermes_cli/, agent/, plugins/, etc. — anywhere those imports
might really break.
2026-05-25 11:22:06 +10:00
Ben 02c933aedc test(docker): fix svstat 'want up' assertion in profile-gateway lifecycle test
After the supervise-perms fix lands, the s6 lifecycle actually works
for the hermes user — hermes -p <profile> gateway start now genuinely
brings the supervised gateway up rather than silently no-op'ing on
EACCES. That exposes a latent bug in this test's assertion: it
expected 'want up' to appear literally in s6-svstat output, but
s6-svstat elides redundancies — when the slot is currently up AND
s6 wants it up, the output is just 'up (pid N pgid N) X seconds';
the explicit 'want up' token only appears when current ≠ wanted
(e.g. 'down (exitcode 1) … , want up' on a crash-loop).

Add a small helper _svstat_wants_up() that reads the want-state
correctly across both spellings:
  * 'up …'                       → wanted up (unless explicit 'want down')
  * 'down …, want up'            → wanted up explicitly
  * 'down …'                     → wanted down

Both stop and start assertions now use the helper. Also rewords
the module docstring to acknowledge that the supervised process
may succeed OR crash-loop depending on environment, but the want-
state contract holds either way.
2026-05-25 11:21:47 +10:00
Ben c41f908ad4 fix(docker): make s6 lifecycle work for the unprivileged hermes user
Resolves the explicit "Known follow-up" left by commit 2f8ceeab9 and
the resulting CI failures in tests/docker/test_dashboard.py and
tests/docker/test_s6_profile_gateway_integration.py.

The product gap
---------------
Every hermes runtime operation inside the container runs as the
hermes user (UID 10000) via s6-setuidgid. But s6-supervise — spawned
by s6-svscan running as PID 1 — creates each service's supervise/
and top-level event/ directories with mode 0700 owned by its
effective UID (root). That left every s6-svc / s6-svstat / s6-svwait
call from hermes hitting EACCES on the supervise/control FIFO and
supervise/status — i.e. the entire S6ServiceManager lifecycle
(register, start, stop, unregister) was inert in production.

The 2f8ceeab9 commit message called this out and deferred the fix.
The audit changes that landed alongside it (defaulting docker_exec
to -u hermes) made the integration tests reproduce the bug
deterministically; the fix below resolves it.

The fix: pre-create the supervise/ skeleton hermes-owned
----------------------------------------------------------
Reading s6's source (src/supervision/s6-supervise.c::trymkdir +
control_init), the mkdir and mkfifo calls that build the supervise
tree are EEXIST-safe: if the directory or FIFO is already present,
s6-supervise reuses it and skips the chown/chmod fix-up that would
normally make event/ 03730 root:root. So if we lay the skeleton
down with hermes ownership before triggering s6-svscanctl -a,
s6-supervise inherits our layout and never touches it. The
death_tally / lock / status regular files written later by
s6-supervise (still as root) land mode 0644 — world-readable —
which is all s6-svstat needs.

New module-level helper _seed_supervise_skeleton(svc_dir) in
hermes_cli/service_manager.py lays down:
  svc_dir/event/                       hermes:hermes 03730
  svc_dir/supervise/                   hermes:hermes 0755
  svc_dir/supervise/event/             hermes:hermes 03730
  svc_dir/supervise/control            hermes:hermes 0660 (FIFO)
  svc_dir/log/event/                   hermes:hermes 03730  (if log/ present)
  svc_dir/log/supervise/               hermes:hermes 0755
  svc_dir/log/supervise/event/         hermes:hermes 03730
  svc_dir/log/supervise/control        hermes:hermes 0660 (FIFO)

The log/ branch matters because the logger is a second
s6-supervise instance — without it, unregister rmtree races on
the logger's root-owned supervise dir even after the parent
slot's supervise/ is hermes-owned. The helper is idempotent and
swallows PermissionError on chown so it works equally well when
called from root (cont-init.d) or hermes (runtime register).

Wiring
------
1. S6ServiceManager.register_profile_gateway calls
   _seed_supervise_skeleton(tmp_dir) just before publishing the
   slot via Path.replace. Runtime-registered profile gateways are
   set up by hermes.

2. container_boot._register_service does the same in the cont-init.d
   reconciliation path so boot-time-restored profile slots inherit
   the same layout.

3. New cont-init.d/015-supervise-perms script chowns the supervise/
   and event/ trees for STATIC s6-rc services (dashboard,
   main-hermes). These are spawned by s6-rc before cont-init.d
   gets to run, so the EEXIST-trick doesn't apply; we chown the
   already-existing tree instead. s6-supervise keeps using the
   same files; it never re-asserts ownership on a running service.
   The script skips s6-overlay internal services (s6rc-*,
   s6-linux-*) so the supervision tree itself stays root-only.
   015- slot is intentional: lex-sorts between 01-hermes-setup
   and 02-reconcile-profiles in the container's C-locale, so
   the chown finishes before the reconciler walks the scandir.

Unregister teardown reordering
------------------------------
S6ServiceManager.unregister_profile_gateway now fires
s6-svscanctl -an BEFORE rmtree (with a 200ms grace), so
s6-svscan reaps the supervise child and releases its file
handles on supervise/lock + supervise/status before we try to
remove the directory. Previously rmtree raced s6-supervise on a
set of files inside the supervise dir, and even with the parent
supervise/ now hermes-owned, the contained files (death_tally,
lock, status, written by root) could still be in use.

Dashboard down-state redesign
-----------------------------
The original PR #30136 review fix wrote a 'down' marker file
into /run/service/dashboard/ via cont-init.d/03-dashboard-toggle.
That approach was broken in two ways:

  (a) /run/service/dashboard is a symlink to a TRANSIENT
      /run/s6-rc:s6-rc-init:<tmpdir>/ directory while s6-rc is
      mid-transaction; the touch landed in a soon-to-be-discarded
      tmp.

  (b) Even when written to the final /run/s6-rc/servicedirs/
      location, the 'down' file is only consulted by s6-supervise
      at slot startup. s6-rc's user-bundle explicitly transitions
      'dashboard' to 'up' on every boot, overriding any down
      marker.

The right fix is the canonical s6 pattern: when HERMES_DASHBOARD
is unset, the dashboard run script exits 0 and a companion
finish script exits 125. Per s6-supervise(8), exit code 125 from
the finish script is the 'permanent failure, do not restart'
marker — equivalent to s6-svc -O. The slot reports as 'down' to
s6-svstat, matching the reality that no dashboard process is
running. When HERMES_DASHBOARD IS truthy, finish exits 0 and
restart-on-crash semantics apply.

03-dashboard-toggle is removed (its function is now subsumed by
the run/finish pair).

Tests
-----
Adds four unit tests for _seed_supervise_skeleton covering the
produced layout, the log/ subservice case, the skip-when-no-log
case, and idempotency. The live-container verification continues
to live in tests/docker/test_s6_profile_gateway_integration.py and
tests/docker/test_dashboard.py — both now pass against the
rebuilt image.

References
----------
* Skarnet skaware mailing list 2020-02-02 (Laurent Bercot
  + Guillermo Diaz Hartusch) on unprivileged s6 tool semantics:
  http://skarnet.org/lists/skaware/1424.html
* just-containers/s6-overlay#130 — same EEXIST-preseed pattern,
  community-validated 2016 onward
* https://skarnet.org/software/s6/servicedir.html — exit-code 125
  semantics in finish scripts
2026-05-25 11:21:31 +10:00
Ben ffc1bb6393 test(dockerfile): recognize s6-overlay/init as a valid PID-1; harden against historical-comment masquerade
PR #30136 CI: test_dockerfile_entrypoint_routes_through_the_init failed
because the test hardcoded known_inits = ('tini', 'dumb-init',
'catatonit'). The PR replaced tini with s6-overlay's /init (which execs
s6-svscan as PID 1) — same SIGCHLD-reaping contract, different name,
so the substring scan against ENTRYPOINT missed it.

Two-part fix:

1. Extend the accepted token list to include 's6-overlay', 's6-svscan',
   and '/init'. The contract these tests enforce is behavioural ('some
   PID-1 init reaps SIGCHLD'), so the names list is purely a recognition
   table and any reaper-capable family should qualify.

2. Harden test_dockerfile_installs_an_init_for_zombie_reaping (the
   sibling check) against comment-only matches. It was scanning the full
   Dockerfile text and only passed because the word 'tini' is still in
   a historical comment explaining why we used to use it. The next
   person to clean up that comment would have silently broken the test.
   New _instruction_text() helper joins only the parsed, non-comment
   Dockerfile instructions so stale comments can't satisfy the check.
2026-05-25 10:32:51 +10:00
Ben 472be1247d fix(service_manager): pass encoding to Path.read_text in _s6_running
PR #30136 CI: ruff PLW1514 (preview rule unspecified-encoding) failed on
`Path('/proc/1/comm').read_text().strip()` introduced by commit
2f8ceeab9 (the daimon-nous critical-bug fix that switched s6 detection
off /proc/1/exe to /proc/1/comm so it works for the unprivileged hermes
user).

Add explicit encoding='utf-8'. /proc/1/comm is always plain ASCII (the
kernel's PR_GET_NAME / TASK_COMM_LEN buffer), so utf-8 is correct and
locale-independent.
2026-05-25 10:32:36 +10:00
Ben Barclay 59da190512 Merge branch 'main' into docker_s6 2026-05-25 09:39:27 +10:00
Teknium 9c08070703 test(cli): update resume usage-hint assertion for numbered selection
PR #9020's salvage changed the /resume list footer from
'Use /resume <session id or title> to continue.' to
'Use /resume <number>, /resume <session id>, or /resume <session title> to continue.\n  Example: /resume 2'.

test_resume_without_target_lists_recent_sessions still pinned the old
string verbatim and failed in CI. Relax to substring assertions that
allow both the new numbered footer and any future tweaks while still
verifying the hint is shown.
2026-05-24 16:22:48 -07:00
Teknium c043c86bd7 i18n+tests: add list_item_numbered, list_footer_numbered, out_of_range for 15 locales
The numbered /resume feature added new i18n keys to en.yaml; the catalog parity
tests require every locale to carry matching keys and placeholders, so add
translations to all 15 supported locales.

Also unblock tests/cli/test_cli_resume_command.py:
- _make_cli stub now sets self.resume_display = 'minimal' since
  _handle_resume_command (post-#31695) calls _display_resumed_history.
- mock_db.resolve_resume_session_id returns the input id (no compression
  chain) so HERMES_SESSION_ID is set to a real string, not a MagicMock.
2026-05-24 16:22:48 -07:00
Teknium 87580076fd chore(release): map 490408354@qq.com to daizhonggeng (PR #9020) 2026-05-24 16:22:48 -07:00
daizhonggeng fef733d56b feat: support numbered resume selection in cli and gateway 2026-05-24 16:22:48 -07:00
AhmetArif0 4f4e337c47 fix(file-safety): write-deny pairing/ directory to prevent approved-list injection
The gateway pairing directory (~/.hermes/pairing/) stores per-platform
access-control files (telegram-approved.json, discord-approved.json, etc.).
A prompt-injected agent using write_file could add arbitrary user IDs to an
approved file, granting persistent gateway access without going through the
pairing code flow — the same threat class that motivated protecting
webhook_subscriptions.json (#14157).

The pairing directory was not included in the original control-plane protection
because it postdates PR #14157. PR #30383 introduced the hashed-pending schema
and made the approved files the sole source of truth for gateway access, raising
the security sensitivity of the directory.

Apply the same mcp-tokens pattern: block writes to pairing/ and any path within
it, under both the active hermes_home and the root path (for profile-mode parity
with the fix in #30382).

Regression tests verify denial for pairing/telegram-approved.json,
pairing/discord-pending.json, and the directory itself, in both normal and
profile-mode layouts.
2026-05-24 16:15:33 -07:00
LeonSGP43 6c44d537cc fix(cli): show full session titles in /resume list 2026-05-24 16:13:23 -07:00
Teknium 8e68426981 fix(cli): add inline --yes/now skip for destructive slash commands (#30768)
Issue #30768 reports that on native Windows PowerShell the destructive-slash
confirmation modal renders but never registers keypresses, leaving the user
unable to confirm or cancel /reset, /new, /clear, or /undo. The modal works
on macOS, Linux, and WSL; PR #23907 (merged May 11) replaced the
daemon-thread input() pattern with a prompt_toolkit-native keybinding modal
but the win32 input pipeline apparently doesn't dispatch keys to the
filter-conditioned handlers. The modal investigation is ongoing.

This change ships the immediate escape hatch: append `now`, `--yes`, or `-y`
to any destructive slash command to bypass the modal and run the action
immediately. Works on every platform without touching the broken Windows
code path.

  /reset now            -> reset, no modal
  /new --yes my-session -> new session titled "my-session", no modal
  /clear -y             -> clear, no modal
  /undo -y              -> undo, no modal

The default behavior (modal prompts when approvals.destructive_slash_confirm
is True) is unchanged for users who don't pass a skip token.

Implementation:

- New classmethod HermesCLI._split_destructive_skip(text) -> (remainder, skip)
  parses a destructive-slash command string, strips the leading "/cmd" word
  and any recognized skip tokens (case-insensitive exact match, not substring),
  and reports whether a skip was requested.
- HermesCLI._confirm_destructive_slash gains an optional cmd_original= arg.
  When the arg contains a skip token, it returns "once" immediately —
  before the gate check and before any modal rendering.
- The /clear, /new, /undo handlers in process_command pass cmd_original
  through. /new additionally uses _split_destructive_skip to strip skip
  tokens from the remaining text before deriving the session title, so
  "/new now My Session" yields title="My Session" (not "now My Session").

Tests:

- 7 new unit tests in tests/cli/test_destructive_slash_confirm.py covering
  the helper (recognized tokens, command-word stripping, case-insensitive
  exact match, None/empty input) and the modal bypass (now and --yes both
  skip; no-skip-token still consults the modal).
- 3 new integration tests in tests/cli/test_destructive_slash_inline_skip_e2e.py
  driving HermesCLI.process_command end-to-end and asserting (a) new_session
  is invoked, (b) the modal is never reached, (c) the skip token does not
  leak into the session title, and (d) the no-skip-token path still reaches
  the modal as a sanity check that we haven't accidentally short-circuited
  the normal flow.

All 31 tests across the destructive-slash test surface pass.

Docs:

- website/docs/reference/slash-commands.md documents the new flags both in
  the destructive-commands table and the dedicated approval section, with a
  link back to issue #30768 explaining why the escape hatch exists.
2026-05-24 16:13:03 -07:00
teknium1 99a7ecc335 chore(release): map leeseoki0 for PR #31315 salvage 2026-05-24 15:48:58 -07:00
leeseoki0 ce529d6072 fix(kanban): scratch tasks must not inherit board.default_workdir (#28818)
Board defaults represent persistent project checkouts. Scratch workspaces
are auto-deleted on completion and must stay under the per-board scratch
root that resolve_workspace() creates. Inheriting default_workdir for a
scratch task pointed the cleanup path at the user's source tree — the
data-loss vector documented in #28818.

The containment guard in _cleanup_workspace (just added) is the safety
rail. This commit prevents the bad state from being created in the first
place: only persistent kinds (dir/worktree) inherit board defaults.

Tests updated to cover the new semantics: scratch with default_workdir
set keeps workspace_path=None; dir/worktree still inherits the board
default.

Salvaged from PR #31315 by @leeseoki0 — prevention layer on top of the
#28819 containment fix by @briandevans.

Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
2026-05-24 15:48:58 -07:00
briandevans 23115b5c0f fix(kanban): restrict managed-scratch roots to workspaces/ dirs only
Copilot review on PR #28819 flagged that `_is_managed_scratch_path` accepted
the entire `<kanban_home>/kanban` subtree as managed scratch storage. With
that, a task whose `workspace_kind='scratch'` and `workspace_path` was
mis-set to `<kanban_home>/kanban`, `.../kanban/logs`, or a board's
metadata directory (e.g. `.../kanban/boards/<slug>` without the
`workspaces/` child) would pass the containment guard and let task
completion `shutil.rmtree` Hermes' own DB, metadata, and log subtrees.

Tighten the guard:

* Allowed roots are now exclusively `workspaces/` directories — the
  `HERMES_KANBAN_WORKSPACES_ROOT` override, `<kanban_home>/kanban/workspaces`,
  and each `<kanban_home>/kanban/boards/<slug>/workspaces` discovered on
  disk.
* Require strict descendancy: a path equal to a root itself is rejected
  too, because deleting a workspaces root would wipe every task's scratch
  dir at once.

Add a regression test covering the three Copilot-named attack paths
(kanban root, kanban/logs, board root without `workspaces/`) plus the
workspaces-root-itself case, and confirm the inner task-id dir still
matches.
2026-05-24 15:48:58 -07:00
briandevans 80ad1609c8 fix(kanban): refuse to rmtree workspace_path outside managed scratch root (#28818)
A board's ``default_workdir`` (e.g. ``hermes kanban boards
set-default-workdir my-board /path/to/real/source``) is copied into
``tasks.workspace_path`` for tasks created without an explicit
``workspace_kind``. Those tasks default to ``workspace_kind='scratch'``,
so completion calls ``_cleanup_workspace`` and unconditionally runs
``shutil.rmtree(wp, ignore_errors=True)`` — deleting the user's real
source tree as if it were disposable scratch storage.

Add ``_is_managed_scratch_path()`` and gate ``_cleanup_workspace`` on
it: only delete paths under ``HERMES_KANBAN_WORKSPACES_ROOT`` (the
worker-side override the dispatcher injects) or under the active kanban
home's ``kanban/`` subtree (covering both the legacy default-board root
and per-board ``kanban/boards/<slug>/workspaces`` roots). Anything else
gets a warning log and is left alone, so a misconfigured
``default_workdir`` can no longer destroy user data on task completion.
2026-05-24 15:48:58 -07:00
Teknium 396ee69032 fix(gateway): seed plugin extras before is_connected gate (#31703)
Follow-up to 54e61f933. The plugin enablement gate calls
``entry.is_connected(probe_cfg)`` BEFORE ``env_enablement_fn`` runs,
and the probe is built as ``existing_cfg or PlatformConfig()`` — empty
extras, ``enabled=False``.

For plugins whose ``is_connected`` reads ``config.extra`` instead
of env vars directly, that probe is a misrepresentation of what the
platform will look like after enablement. Google Chat's
``_is_connected`` short-circuits on ``config.enabled`` and inspects
``config.extra["project_id"]`` / ``config.extra["subscription_name"]``
— both False on the default probe even when the user has set
``GOOGLE_CHAT_PROJECT_ID`` and ``GOOGLE_CHAT_SUBSCRIPTION_NAME``. Result:
Google Chat silently fails the gate on every env-var-only setup.

Build a candidate probe that mirrors what the platform will look like
post-enablement:
- pre-call ``env_enablement_fn`` and layer its result into the probe's
  ``extra`` (without mutating any existing platform config)
- pass ``enabled=True`` on the probe — we're asking "would this BE
  configured if we let it in?" not "is it currently enabled?"
- reuse the same seeded extras when we commit the platform to
  ``config.platforms`` (avoids calling ``env_enablement_fn`` twice)

Discord/IRC/Teams/LINE/ntfy/Simplex ``_is_connected`` hooks read env
vars directly, so they are unaffected. This change only restores
Google Chat on env-var-only setups while keeping the original #31116
Discord-no-token block intact.

All 6 shipped ``env_enablement_fn`` implementations were audited and
are pure reads (no ``os.environ`` writes), so running them earlier in
the loop has no observable side effects.

Tests: 2 new in tests/gateway/test_platform_registry.py covering
extras-seeded-before-is_connected and don't-leak-extras-on-gate-fail.
693 tests across 11 adjacent suites pass (platform_registry, config,
google_chat, matrix, discord_connect, ntfy_plugin, simplex_plugin,
line_plugin, irc_adapter, teams, gateway_platform_gating).

Refs #31116.
2026-05-24 15:44:26 -07:00
helix4u 514f5020c7 fix(debug): redact BlueBubbles webhook secrets 2026-05-24 15:43:48 -07:00
Teknium 13b85bc646 feat(config): document resume-recap tuning keys in DEFAULT_CONFIG
The hardcoded constants in _display_resumed_history were exposed as
config in PR #4434; declare them in DEFAULT_CONFIG and the CLI fallback
dict so they show up in 'hermes config' diagnostics and the schema
validator.
2026-05-24 15:36:37 -07:00
Teknium 5dc10ec3ba test(cli): reconcile resume-recap tests with skip-tool-only default and compression-chain helper
- test_tool_calls_shown_as_summary: explicitly disable resume_skip_tool_only
  (#4434 made True the default; the legacy assertion relied on tool-only
  entries being rendered as a summary).
- test_tool_only_message_skipped_by_default: add coverage for the new
  default skip behavior.
- test_resume_command_*: mock_db.resolve_resume_session_id now returns the
  same id (no compression chain) so the post-#15000 redirect block doesn't
  shove a MagicMock into HERMES_SESSION_ID.
2026-05-24 15:36:37 -07:00
Teknium 27c4ba98c3 chore(release): map zhangsamuel12@gmail.com to SamuelZ12 (PR #7480) 2026-05-24 15:36:37 -07:00
ygd58 cdf4876bfe fix(cli): skip tool-call-only entries in resume recap, expose limits as config options 2026-05-24 15:36:37 -07:00
Samuel Zhang 961e34a1d3 fix: show recap after in-session resume 2026-05-24 15:36:37 -07:00
Teknium 16eed4f91b test(telegram): add brand-new-topic regression for #31086
The cherry-picked fix from #28605 inverts an existing test (an unknown
non-lobby thread_id no longer rewrites to the most-recent binding), but
that test only seeds two bindings and queries a third thread_id. Add a
second regression test that more closely mirrors the live failure mode:
seed exactly one prior binding, then query a brand-new thread_id and
assert recovery returns None — so the new topic is allowed to get its
own session row instead of being silently merged into the previous
topic's session.

Co-authored-by: Fábio Siqueira <fabioxxx@gmail.com>
Co-authored-by: dillweed <dillweed@users.noreply.github.com>
2026-05-24 15:28:40 -07:00
Maxim Esipov bdc9b0eff5 fix(telegram): preserve new DM topic lanes 2026-05-24 15:28:40 -07:00
Teknium eea9553a9c fix(anthropic): skip mcp_ prefix on outgoing tool schemas when already prefixed
Companion to the GH-25255 incoming-strip fix from @hayka-pacha. Without
this, build_anthropic_kwargs unconditionally added 'mcp_' to every tool
name in step 3, so a native MCP server tool registered as
'mcp_composio_X' was sent as 'mcp_mcp_composio_X' on the wire. The
incoming strip only removes ONE prefix, which still worked on first
call, but on subsequent calls the model pattern-matched the
single-prefixed form from message history and produced names that
stripped to 'composio_X' — registry miss, dispatch fail.

The history-rewrite block (#4) already has this guard. Apply the same
guard to the schema-rewrite block (#3) so round-trip is symmetric.

Added 4 outgoing-side tests. Existing 7 incoming-side tests still pass.

Author map: hayka-pacha added for PR #25270 salvage attribution.

Refs GH-25255.
2026-05-24 15:27:45 -07:00
HKPA 2f91a8406c fix(agent): only strip mcp_ prefix for OAuth-injected tools (GH-25255)
When strip_tool_prefix=True (Anthropic OAuth path), normalize_response
unconditionally stripped the mcp_ prefix from ALL tool names starting
with mcp_. This broke Hermes-native MCP server tools (registered under
their full mcp_<server>_<tool> name in the registry) because the stripped
name doesn't match any registry entry.

Fix: check the tool registry before stripping. Only strip when:
- The stripped name EXISTS in the registry (OAuth-injected tool)
- The full name does NOT exist in the registry

This preserves backward compatibility for OAuth-injected tools while
protecting native MCP server tools from incorrect prefix removal.

7 new tests covering: OAuth strip, native preserve, no-flag, non-mcp,
unknown tools, mixed responses, and dual-registration edge case.

Signed-off-by: HKPA <hayka-pacha@users.noreply.github.com>
2026-05-24 15:27:45 -07:00
Yuan Li 476c897439 fix(telegram): gate send() on send-path health after reconnect storms (#31165)
After sustained Bad Gateway / TimedOut reconnect cycles, the PTB httpx
client can enter a state where bot.send_message() returns a valid
Message (real message_id) but the message never reaches the recipient.
TelegramAdapter.send returns SendResult(success=True) and cron's
live-adapter branch marks the run delivered while the message is
silently dropped.

Add a _send_path_degraded flag. _handle_polling_network_error sets it
on reconnect storms; the existing _verify_polling_after_reconnect
heartbeat probe clears it once getMe() confirms the Bot client is
healthy. While the flag is set, send() short-circuits with
SendResult(success=False, retryable=True) so cron falls through to
the standalone delivery path (fresh HTTP session).

Closes #31165.

Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
2026-05-24 15:27:41 -07:00
Teknium 54e61f9331 fix(matrix,gateway): Matrix E2EE installs full dep set; plugins respect is_connected
Fixes #31116 — two distinct bugs in fresh-install Matrix gateway:

1. Matrix E2EE setup installed only mautrix[encryption], leaving asyncpg
   / aiosqlite / Markdown / aiohttp-socks uninstalled. The first encrypted
   connect failed with 'No module named asyncpg' deep inside
   MatrixAdapter.connect(). Root cause: the setup wizard hand-rolled a
   pip install of one package instead of using lazy_deps.ensure(
   'platform.matrix'), and check_matrix_requirements() short-circuited the
   runtime installer on 'import mautrix' alone — so the other 4 packages
   were never pulled in.

2. Discord auto-enabled itself on every gateway start, even when the user
   never selected Discord and had no DISCORD_BOT_TOKEN. Root cause:
   gateway/config.py plugin-enablement loop gated enablement on
   entry.check_fn() (just 'is the SDK importable?') and ignored
   entry.is_connected (the 'did the user configure credentials?' probe).
   Same bug class as commit 7849a3d73 fixed for _platform_status in the
   setup wizard; this is the runtime counterpart. Affects Discord, Teams,
   and Google Chat.

Changes:
- hermes_cli/setup.py::_setup_matrix — install via
  lazy_deps.ensure('platform.matrix') to pull the full feature group.
- gateway/platforms/matrix.py::_check_e2ee_deps — verify asyncpg +
  aiosqlite + PgCryptoStore in addition to OlmMachine, so E2EE failures
  surface at startup instead of at first encrypted-room connect.
- gateway/platforms/matrix.py::check_matrix_requirements — use
  feature_missing('platform.matrix') as the install gate instead of a
  single 'import mautrix' check, so partial installs trigger the lazy
  installer correctly.
- gateway/config.py plugin-enablement loop — consult entry.is_connected
  before flipping enabled=True. Explicit YAML enabled=true still wins.

Tests: 3 new in tests/gateway/test_matrix.py (asyncpg-required,
aiosqlite-required, partial-install lazy-runs), 5 new in
tests/gateway/test_platform_registry.py (is_connected=False blocks,
is_connected=True enables, is_connected=None falls back to check_fn,
raising probe doesn't enable, explicit YAML wins).

Validation: 310 tests across affected test modules pass.
2026-05-24 15:16:03 -07:00
teknium1 88834baf50 chore: map soju06@users.noreply.github.com for PR #26054 salvage 2026-05-24 15:15:37 -07:00
Soju 6212e9ade8 fix(error-classifier): treat 5xx request-validation errors as non-retryable
Standard OpenAI returns request-validation failures (unknown/
unsupported parameter, malformed request) as 4xx. Some
OpenAI-compatible gateways return them as 5xx instead — codex.nekos.me
returns 502 for an unknown parameter.

The generic '5xx -> retryable server_error' rule then misfires: the
error is deterministic (every retry gets the identical rejection), so
the retry loop burns all 3 attempts, the transport-recovery path
resets the counter and burns 3 more, and the result is a request
flood against a request that can never succeed.

Fix: when a 500/502 body carries an unambiguous request-validation
signal — 'unknown parameter' / 'unsupported parameter' /
'invalid_request_error' in the message text, or invalid_request_error
/ unknown_parameter / unsupported_parameter as the structured error
code — classify as a non-retryable format_error so the loop fails
fast and falls back. Genuine 502 Bad Gateway with no such signal
stays retryable as before.

Origin: local-author
Upstream-PR: none
Patch-State: local-only
2026-05-24 15:15:37 -07:00
Soju 775a17284f fix(transport): strip Hermes-internal scaffolding keys before chat.completions
The empty-response recovery path in run_agent.py appends synthetic
messages tagged with _empty_recovery_synthetic (and the agent loop uses
_thinking_prefill / _empty_terminal_sentinel similarly). These are
internal bookkeeping markers — they must never reach the wire.

chat_completions' convert_messages only stripped Codex Responses leak
fields (codex_reasoning_items, call_id, etc.), not these _-prefixed
markers. Permissive providers (real OpenAI, Anthropic) silently ignore
unknown message keys so the bug stayed hidden, but strict
OpenAI-compatible gateways reject them outright. Observed against
codex.nekos.me:

  502: [ObjectParam] [input[617]._empty_recovery_synthetic]
       [unknown_parameter] Unknown parameter:
       '_empty_recovery_synthetic'

Because the synthetic messages persist in the session, every
subsequent request in that session carries the poisoned key and
fails identically — a deterministic 502 the retry loop mistakes for
a transient server error.

Fix: convert_messages now drops any top-level message key starting
with '_'. OpenAI's message schema has no '_'-prefixed fields, so this
is safe and future-proofs against new internal markers.

Origin: local-author
Upstream-PR: none
Patch-State: local-only
2026-05-24 15:15:37 -07:00
Teknium 7ab1677362 feat(security): on-demand supply-chain audit via OSV.dev (#31460)
Adds 'hermes security audit' — a one-shot vulnerability scan against
OSV.dev covering three surfaces a Hermes user actually controls:

  1. The running Python's installed PyPI dists (importlib.metadata)
  2. Plugin requirements.txt / pyproject.toml pins under ~/.hermes/plugins/
  3. Pinned npx/uvx MCP servers in config.yaml

Zero new dependencies (stdlib urllib + importlib.metadata + tomllib +
concurrent.futures). No auth required for OSV's public batch API.

Flags: --json, --fail-on {low,moderate,high,critical} (default: critical),
       --skip-venv, --skip-plugins, --skip-mcp

Output groups findings by source, sorts by severity descending, surfaces
fixed-versions inline. Exit 1 when any finding meets the --fail-on tier.

Deliberately out of scope: globally-installed pip/npm, editor/browser
extensions, daily background scans, auto-blocking of installs. The audit
is on-demand by design — daily scans become noise the user trains
themselves to ignore.
2026-05-24 15:15:16 -07:00
Teknium 8065e70274 fix(agent): abort on HTTP 402 after pool rotation and fallback fail (#31443)
Closes #31273.

HTTP 402 (insufficient credits) was retried up to agent.api_max_retries
times (default 3), burning paid requests against an exhausted balance.
Real-world impact: ~$40 in 48h on a 24/7 Telegram+Discord gateway.

Root cause: FailoverReason.billing was in the is_client_error
exclusion set in agent/conversation_loop.py, which prevents the
non-retryable-abort branch from firing.

By the time control reaches that predicate:
  * credential-pool rotation has already run for billing and either
    continued the loop or returned False (pool exhausted/absent)
  * the eager-fallback branch has also fired on billing and either
    continued the loop or fell through (no fallback configured)

Falling through to the backoff retry from here has no recovery
mechanism left — it just burns more paid requests.  Removing billing
from the exclusion set makes 402 abort cleanly once pool+fallback
recovery has failed, mirroring how 401/403 (also should_fallback=True)
already behave.

Added tests/run_agent/test_31273_402_not_retried.py which mirrors the
is_client_error predicate shape from the source and asserts the
invariant (plus a source-inspection guard against accidental
re-introduction).
2026-05-24 15:14:13 -07:00
teknium1 5b52e26d18 fix(gateway): swallow transient Telegram TimedOut at loop level
Closes #31066. Closes #31110.

An unhandled `telegram.error.TimedOut` (or peer `NetworkError` /
`httpx` connection error) propagating to the asyncio event loop killed
the entire gateway process, taking down every profile attached to the
same runner. systemd restarted the service after ~5s but the active
conversation turn was lost.

Public adapter methods (`adapter.send`, `adapter.edit_message`,
`adapter.send_voice`, …) are individually try/except-wrapped on
current main, but at least one async path was reaching the loop with
TimedOut unhandled — the report's traceback ends at the deepest httpx
frame and doesn't pinpoint the caller.

Rather than audit 30+ call sites blind, install a loop-level safety net:
`_gateway_loop_exception_handler` is set as the loop's exception handler
in `start_gateway()` after `asyncio.get_running_loop()`. It classifies
the exception via `_is_transient_network_error()` (walks the
__cause__/__context__ chain, matches on class name so the test suite
doesn't need the real telegram/httpx packages installed). Transient
errors are logged at WARNING with full traceback so the originating
call site stays diagnosable; everything else forwards to
`loop.default_exception_handler` so real bugs still surface.

Tests cover the classifier (known transients accepted, real bugs
rejected, cause/context chain unwrap, cyclic-cause termination) and the
handler (swallow + log warning, forward unknowns, missing-exception
context). One end-to-end test schedules an orphan task raising TimedOut
and asserts `asyncio.run` returns cleanly.
2026-05-24 15:03:27 -07:00
Teknium 3d66787a04 fix(vision): route auxiliary.vision.provider=openai to api.openai.com, skip text-only main (#31452)
* fix(vision): route auxiliary.vision.provider=openai to api.openai.com, skip text-only main for vision

Fixes #31179. Three coupled fixes so a configured aux vision backend
actually serves vision tasks instead of silently routing images to the
user's main provider:

1. agent/auxiliary_client.py: `auxiliary.<task>.provider: openai` resolves
   to `custom` + `https://api.openai.com/v1`. "openai" was not in
   PROVIDER_REGISTRY (we have `openai-codex` for OAuth and `custom` for
   manual base_url), so the obvious config name silently failed to build a
   client. User-supplied base_url is still preserved; only the provider
   name normalises to `custom` so resolution doesn't hit the
   PROVIDER_REGISTRY-only path.

2. agent/auxiliary_client.py: the vision auto-detect chain now skips the
   user's main provider when models.dev reports `supports_vision=False`.
   Without this guard, a misconfigured aux provider would fall back to
   `auto`, which happily returned the main-provider client. The caller
   would then send image content to e.g. api.deepseek.com with model
   `gpt-4o-mini` and get a cryptic `unknown variant 'image_url',
   expected 'text'` from the provider's parser.

3. tools/vision_tools.py + tools/browser_tool.py: `check_vision_requirements`
   now mirrors the runtime fallback chain (explicit provider, then auto),
   so `vision_analyze` shows up whenever vision is actually serviceable.
   `browser_vision` gets a new `check_browser_vision_requirements` check_fn
   that AND-gates browser + vision availability, so it doesn't get
   advertised to the model when the call would fail at runtime.

Reproduction (config from the bug report):
  model.provider: deepseek
  model.default: deepseek-v4-pro
  auxiliary.vision.provider: openai
  auxiliary.vision.model: gpt-4o-mini

Before: resolve_vision_provider_client() returns None for the explicit
provider, fallback auto returns the deepseek client with model='gpt-4o-mini',
image hits api.deepseek.com → 'unknown variant image_url'. vision_analyze
hidden from tool list; browser_vision exposed but fails at call time.

After: resolves to custom + api.openai.com/v1 with model gpt-4o-mini.
vision_analyze and browser_vision both gate correctly on capability.

Tests: tests/agent/test_vision_routing_31179.py covers all three fixes
(12 cases including the user's exact scenario, base_url preservation,
text-only-main skip, capability-unknown permissive fallback, and tool
gating parity). Existing 382 tests across auxiliary/vision/image_routing
suites still pass.

* test(vision): use exact hostname check to silence CodeQL substring-sanitization alert

* fix(auxiliary): drop model name from vision-skip debug log to silence CodeQL

The new `logger.debug(...)` added in the previous commit interpolated
both `main_provider` and `vision_model` (a public model slug \u2014 not
sensitive). CodeQL's `py/clear-text-logging-sensitive-data` heuristic
re-flagged it twice because the rule mis-detects multi-value
interpolations near tainted-via-config provider strings.

Drop the model from the log args (provider alone is enough to diagnose
the skip; the same sibling branch a few lines up already logs provider
only). Behavior unchanged; CodeQL false positive cleared.
2026-05-24 15:01:28 -07:00
Hinotoi Agent d9ec90585c test(dashboard): send loopback headers for WebSocket sidecar test 2026-05-24 15:00:44 -07:00
hinotoi-agent 2e66eefbc3 fix(dashboard): validate WebSocket Host and Origin 2026-05-24 15:00:44 -07:00
Teknium 186bf25cb1 test(guardrail): assert halt message reaches stream_delta_callback
Regression guard for #30770 — verifies the guardrail-halt branch in
agent/conversation_loop.py pushes the synthesized halt message through
stream_delta_callback before breaking out of the loop.  Without the
emit, chat-completions SSE writers drain an empty queue and clients
(Open WebUI, etc.) see a finish chunk with zero content delta —
indistinguishable from a crash.

Verified: the test fails when the production fix is reverted.
2026-05-24 07:38:24 -07:00
annguyenNous 38b8d0da85 fix: emit guardrail halt message to client before closing stream
When the tool loop guardrail fires (max_tool_failures, etc.), the
turn exits with guardrail_halt but no final assistant message was
emitted to the client. The SSE stream closed silently —
indistinguishable from a crash.

The stream_delta_callback(None) before tool execution is a display
flush, not a hard close. After generating the halt response, emit
it through both _safe_print (CLI) and stream_delta_callback (SSE)
so clients see the explanation.

Fixes #30770
2026-05-24 07:38:24 -07:00
Teknium 889903f0fa fix(tests): align CI tests with recent security hardening (#31470)
Four recent security PRs landed on main with stale/missing test updates,
breaking 4 test shards on every subsequent PR's CI run:

- test_discord_bot_auth_bypass.py (PR #30742 c3caca658):
  DISCORD_ALLOWED_ROLES no longer bypasses _is_user_authorized.
  Inverted 3 tests to assert the new (correct) behavior: role config
  alone does NOT authorize at the gateway layer.

- test_msgraph_webhook.py (PR #30169 4ca77f105):
  adapter.is_connected is a @property, not a method. Test was calling
  it with () after the connect() change; TypeError: 'bool' is not
  callable. Removed the parens.

- test_feishu_approval_buttons.py (PR #30744 bdb97b857):
  Card-action callbacks now go through _allow_group_message
  authorization. 3 tests in TestCardActionCallbackResponse didn't
  populate adapter._allowed_group_users so the operator's open_id got
  rejected. Added the allowlist setup to each test, matching the
  existing pattern in test_returns_card_for_approve_action.

Also raise tolerance on test_wait_for_process_kills_subprocess_on_keyboardinterrupt:
the SIGTERM → 3s TimeoutStopSec → SIGKILL → reap chain can exceed 10s
under loaded xdist (40 workers). Bumped _wait_for_pgid_exit timeout
10→30s and worker join timeout 5→15s. Passes 100% in isolation
already; this just makes it tolerant of CI-host load.

Validation: 270/270 tests pass across the 5 affected files.
2026-05-24 06:54:16 -07:00
Hinotoi-agent 3bace071bf fix(state): restrict sensitive store file permissions
response_store.db (api server) holds conversation history including tool
payloads, prompts, and results. webhook_subscriptions.json holds per-route
HMAC secrets. Under a permissive umask (e.g. 0o022, default on most
distros) both files were created mode 0o644 — readable by other local
users on shared boxes.

- gateway/platforms/api_server.py: ResponseStore tightens itself + WAL/SHM
  sidecars to 0o600 after __init__, then trusts the inode. (Original
  contributor patch chmod'd after every _commit() — wasteful on a hot
  api_server path; chmod-on-create is sufficient since SQLite preserves
  mode bits across writes.)

- hermes_cli/webhook.py: _save_subscriptions writes via tempfile.mkstemp
  (which itself creates the file with 0o600), chmods the temp before the
  atomic rename, and re-asserts 0o600 on the destination so an existing
  permissive file from before this fix gets narrowed.

Tests cover (a) creation under permissive umask leaves 0o600 and (b) an
existing 0o644 webhook_subscriptions.json gets narrowed on next save.
Tests guarded with skipif os.name=='nt' since POSIX mode bits don't apply
on Windows.

Salvaged from PR #30917 by @Hinotoi-agent. Reworked the api_server.py
side from chmod-on-every-commit to chmod-on-create.

Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
2026-05-24 04:55:18 -07:00
m0n3r0 f378f00bfb fix(feishu): validate verification token before reflecting url_verification challenge
When FEISHU_VERIFICATION_TOKEN is configured, an unauthenticated remote
could previously prove endpoint control by sending a url_verification
payload with any attacker-controlled challenge string — the handler
reflected the challenge BEFORE running the token check.

Move the verification_token check ahead of the url_verification echo so
the challenge response is gated on a valid token. Add a regression test
covering the wrong-token case. Also fix the stale
test_connect_webhook_mode_starts_local_server fixture to set
FEISHU_VERIFICATION_TOKEN (post #30746 webhook mode requires a secret).

Salvaged from PR #29663 by @m0n3r0 — kept the url_verification reorder
and its regression test; dropped the host-conditional weakening of the
#30746 secret guard (we want webhook secrets required regardless of
bind host, not only on 0.0.0.0/::).

Docs updated to call out the gating.

Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
2026-05-24 04:51:19 -07:00
teknium1 5e6749fbf3 chore(release): map m0n3r0 for PR #29629 salvage 2026-05-24 04:47:45 -07:00
teknium1 15aa6884a2 fix(webhook): use 403 not 500 for missing-secret rejection
Operator misconfiguration is a client/setup error, not an internal server
exception. 403 "forbidden" more accurately reflects "this route refuses
to authenticate" than 500 "internal server error" — the latter triggers
incident alerting on operator monitoring and conflates real bugs with
config drift.

Follow-up tweak to PR #29629 by @m0n3r0.
2026-05-24 04:47:45 -07:00
m0n3r0 dbf73e90fa fix: fail closed for webhook routes without secrets
Reject unsigned webhook requests when a route has no effective HMAC secret, even if the request handler is reached without the normal connect-time validation. Add regression coverage for the direct-handler path.
2026-05-24 04:47:45 -07:00
BaxBit bbf02c3224 fix(gateway): validate Svix webhook signatures (#30200) 2026-05-24 04:45:13 -07:00
Jiaming Guo ee002e7fc5 fix(dashboard): require auth for plugin rescan (#27340) 2026-05-24 04:45:07 -07:00
Teknium 5acaeba2bb fix(mcp): raise ImportError instead of NameError when stdio SDK missing (#31450)
When the 'mcp' Python SDK isn't installed, _run_stdio leaked a bare
'NameError: name StdioServerParameters is not defined' because the
top-level 'from mcp import ...' fails inside try/except ImportError,
leaving the names unbound at module scope.

Mirror the _MCP_HTTP_AVAILABLE gate that _run_http already had: raise
a clear ImportError with install instructions instead.

Fixes #30904
2026-05-24 04:44:59 -07:00
xxxigm 6cafcf9c77 test(streaming): pin partial-stream-stub finish_reason + continuation contract
Three test classes lock in the #30963 fix:

1. TestPartialStreamStubFinishReason — drives _interruptible_streaming_api_call
   through the two recovery branches and asserts:
     - text-only partial → finish_reason="length" (the new behaviour),
     - mid-tool-call partial → finish_reason="stop" (unchanged on purpose).

2. TestLengthContinuationPromptBranching — pure-Python check on the branch
   that picks the continuation prompt by response.id. Locks the network
   error wording for partial-stream-stub vs. the output-length wording
   for everything else.

3. TestConversationLoopPartialStreamContinuation — feeds a stub +
   continuation pair into run_conversation, verifies the loop makes a
   second API call (instead of exiting with text_response(stop)),
   confirms the network-error continuation prompt actually reaches the
   model on call #2, and that final_response stitches both halves.

Refs: NousResearch/hermes-agent#30963
2026-05-24 04:35:15 -07:00
xxxigm 20b3703a42 fix(conversation-loop): tailor length-continuation prompt for partial stream
The length-continue path's user-facing vprint and continuation prompt
both told the model "your response was truncated by the output length
limit." That's a lie when the stub came from a partial-stream network
error (issue #30963) — and a lie the model can detect, leading to "I
wasn't truncated, I'm done" no-op responses that defeat the
continuation entirely.

Detect the partial-stream-stub via response.id and swap in:

- vprint:   "Stream interrupted by network error
             (finish_reason='length' on partial-stream-stub)"
- prompt:   "[System: The previous response was cut off by a network
             error mid-stream. Continue exactly where you left off.
             Do not restart or repeat prior text. Finish the answer
             directly.]"

Real length truncations still see the original "truncated by output
length limit" prompt — the model needs to know which class of failure
it's recovering from. Same length_continue_retries=3 budget,
truncated_response_parts merging, and final-response stitching
infrastructure on both branches.

Refs: NousResearch/hermes-agent#30963
2026-05-24 04:35:15 -07:00
xxxigm 9140be7c22 fix(streaming): emit finish_reason=length on text-only partial-stream stub
When the API connection drops mid-stream after text deltas have already
been delivered, chat_completion_helpers returned a stub response with
finish_reason=stop. The conversation loop then classified the stub as a
clean text completion (text_response(finish_reason=stop)) and exited
with iteration budget remaining — even when the goal-judge verdict
came back as "continue" milliseconds later (issue #30963).

Switch the text-only partial-stream stub to finish_reason=length. The
existing length-continuation path (length_continue_retries up to 3,
"continue exactly where you left off" prompt, partial parts merged
into final_response) then fires automatically: the partial assistant
content is persisted, the model is asked to continue from the cut
point, and the loop keeps making progress against the goal.

The mid-tool-call branch keeps finish_reason=stop on purpose — its
user-facing warning ("Ask me to retry if you want to continue") asks
the user to drive the retry rather than auto-replaying a tool call
with possible side effects.

#5544's "no duplicate message" contract is preserved verbatim: the
partial content is reused, never re-emitted as a fresh API call, so
the user never sees two copies of the same delta.

Refs: NousResearch/hermes-agent#30963
2026-05-24 04:35:15 -07:00
teknium1 60d20a37c9 fix(acp): only deliver final_response after streaming when transformed
PR #29119 dropped the 'not streamed_message' guard unconditionally so
that plugin-transformed responses (transform_llm_output hook) would
reach ACP clients. That regressed test_prompt_does_not_duplicate_streamed_final_message:
when no transform happened, the streamed text was re-sent as a duplicate
final delivery.

Tighten the condition to mirror the gateway side: deliver after streaming
only when response_transformed=True. Otherwise keep the old guard.

Adds test_prompt_delivers_transformed_response_after_streaming so the
transformed path stays covered.
2026-05-24 04:31:13 -07:00
teknium1 26088ca669 chore: map kenyon1977@gmail.com for PR #29119 salvage 2026-05-24 04:31:13 -07:00
teknium1 b9f533af0a test(gateway): regression for plugin-transformed response after streaming
Adds a test that fails without the gateway fix, exercising the
response_transformed=True branch in _finalize_response: a streamed
response whose final text was modified by a transform_llm_output
plugin hook must be edit_message'd in place (not duplicate-sent),
with already_sent=True so the normal final-send is skipped.

Also drops two minor leftovers from the salvaged PR #29119:

  * accumulated_text property on GatewayStreamConsumer (unused)
  * duplicate _response_transformed=False inside the hook try block
2026-05-24 04:31:13 -07:00
kenyonxu 5cb21e3fb5 fix(gateway): edit streamed message instead of sending duplicate when response_transformed
When a transform_llm_output hook appends content after streaming, the previous
fix skipped the final-send suppression which caused the full response to be
sent as a NEW message (duplicate). Instead, edit the existing streamed message
in-place to append the transformed content, then set already_sent=True.

Added stream_consumer.message_id and .accumulated_text public properties.
2026-05-24 04:31:13 -07:00
kenyonxu a4ceead796 fix(gateway): propagate response_transformed flag through run_sync return dict
run_sync() cherry-picks fields from the run_conversation result dict into
a new response dict for the gateway. response_transformed was missing from
the cherry-pick list, so the gateway always saw it as False and suppressed
the final send even though a transform_llm_output hook had modified the content.
2026-05-24 04:31:13 -07:00
kenyonxu 8edeebe6d7 fix: propagate response_transformed flag — plugin hook output survives streaming suppression
When a transform_llm_output hook modifies final_response after streaming,
the gateway was silently discarding the transformed content because
streamed=True / content_delivered=True triggered the final-send
suppression. Three changes:

1. conversation_loop: set `_response_transformed=True` when a
   transform_llm_output hook returns a non-empty string, and expose it
   as `response_transformed` in the result dict.

2. gateway/run: skip the final-send suppression when
   `response_transformed` is True — the transformed response must
   reach the client even if streaming already sent the original text.

3. acp_adapter/server: remove `not streamed_message` guard so
   final_response is always delivered (ACP path fixed separately).
2026-05-24 04:31:13 -07:00
kenyonxu 7eb6c7f489 fix(acp): deliver final_response after streaming — transform_llm_output hook now visible
When streaming is active, streamed_message=True skipped the final_response
update, causing plugin hooks like transform_llm_output to be silently
invisible. Remove the `not streamed_message` guard so the final response
(possibly transformed by plugins) is always delivered to the ACP client.
2026-05-24 04:31:13 -07:00
Teknium 197f63f454 fix(feishu): require webhook auth secret and honor config extras (#30746) 2026-05-24 04:27:28 -07:00
Teknium bdb97b8573 fix(feishu): enforce auth and chat binding for approval buttons (#30744) 2026-05-24 04:27:17 -07:00
Teknium 485292ac7d fix(feishu): authorize interactive exec approval callbacks (#30739) 2026-05-24 04:26:57 -07:00
Teknium be27bfed01 security: harden API server key placeholder handling (#30738) 2026-05-24 04:25:32 -07:00
Teknium 2df2f9190b fix(docker): keep dashboard side-process loopback by default (#30740) 2026-05-24 04:25:28 -07:00
Teknium 4ca77f1059 Harden msgraph webhook auth requirements (#30169) 2026-05-24 04:25:20 -07:00
Teknium 3e78e353d7 fix(qqbot): authorize approval button interactions by session owner (#30737) 2026-05-24 04:25:12 -07:00
Teknium e4a1220f83 security: restrict default webhook toolset capabilities (#30745) 2026-05-24 04:24:54 -07:00
Teknium c3caca6584 fix(gateway): remove discord role allowlist auth bypass (#30742) 2026-05-24 04:24:49 -07:00
Teknium 1f897b0dc9 fix(gateway): stop enabling dingtalk allow-all during setup (#30743) 2026-05-24 04:24:44 -07:00
Teknium 9732559864 fix(security): restrict dashboard websockets to loopback clients (#30741) 2026-05-24 04:24:40 -07:00
Teknium bc3f1f4f34 feat(secrets/bitwarden): EU Cloud + self-hosted server URL support (#31378)
Closes #31370.

bws defaults to the US identity endpoint, so EU Cloud and self-hosted
machine-account tokens fail with [400 Bad Request] {"error":"invalid_client"}
during 'hermes secrets bitwarden setup'. The token is valid — it's just
being checked against the wrong region.

Add a Bitwarden region step to the wizard between the access-token and
project-list steps:

  Step 1  Install bws
  Step 2  Provide access token
  Step 3  Pick region   <-- new (US / EU / self-hosted-custom-URL)
  Step 4  Pick project  (now talks to the right endpoint)
  Step 5  Test fetch

Region is stored in config.yaml as secrets.bitwarden.server_url and
plumbed into every bws subprocess as BWS_SERVER_URL (project list,
secret list, test fetch, and the env_loader startup pull).

Also:
- Non-interactive: 'hermes secrets bitwarden setup --server-url ...'
- Pre-existing BWS_SERVER_URL in the shell is detected and reused
- Cache key includes server_url so EU/US fetches don't collide
- 'hermes secrets bitwarden status' shows the configured region
- 'invalid_client' / '400 Bad Request' from bws now triggers a hint
  pointing at the region setting instead of looking like a bad token
2026-05-24 02:19:57 -07:00
Teknium c9b3eeabdc fix(cli): decouple tool_progress=verbose from global DEBUG logging (#31379)
PR #6a1aa420e coupled `display.tool_progress: verbose` (a per-tool display
toggle for full args / results / think blocks) to `self.verbose` — which
controls root-logger DEBUG level. Result: setting tool_progress: verbose
in config silently flipped every module in the process to DEBUG and
flooded the terminal with internal logging, far beyond just full tool
calls.

The two concepts are separate:
- `tool_progress_mode == 'verbose'` → display behavior (tool rendering)
- `self.verbose` → logging behavior (root logger → DEBUG, line 9795)

This change keeps PR #6a1aa420e's argparse.SUPPRESS / config-fallback
plumbing but severs the verbose-display → debug-logging link.

Changes:
- cli.py:2868 — `self.verbose` only follows explicit `verbose=` arg; no
  longer auto-True when tool_progress_mode == 'verbose'.
- cli.py:_toggle_verbose — slash-cycle through tool progress modes no
  longer flips `self.verbose` / `agent.verbose_logging` / `agent.quiet_mode`.
- cli.py:9355 — fix misleading label (drop 'and debug logs').
- tui_gateway/server.py:_make_agent — same decoupling on the TUI side
  (verbose_logging no longer derived from tool_progress_mode).
- tests/cli/test_tool_progress_scrollback.py — invert the test that
  asserted the broken coupling; add coverage for explicit `--verbose`
  still enabling DEBUG independent of tool_progress.

Live verified:
- tool_progress: verbose, no --verbose flag → 0 DEBUG/INFO log lines
- --verbose flag explicit → 32 DEBUG/INFO log lines (as expected)
2026-05-24 02:19:20 -07:00
AhmetArif0 5848174374 fix(wecom): guard flush task against cancel-delivery race to prevent message loss
When asyncio.sleep() fires just before Task.cancel() is called, CPython
sets _must_cancel=True but cannot cancel the already-completed sleep
future, so CancelledError is delivered at the next await (handle_message)
rather than at the sleep.  By that point the superseded task has already
popped the merged event from _pending_text_batches, so the superseding
task sees an empty batch and silently drops the message.

Fix: add a synchronous task-registry check between the sleep and the pop.
No await between the check and the pop means no other coroutine can
interleave, so the guard is race-free.
2026-05-24 01:33:40 -07:00
Teknium 1bed4e8eed fix(gateway): drop text snippet from debounce debug log (CodeQL)
CodeQL py/clear-text-logging-sensitive-data flagged the candidate-accept
debug log including event.text[:60]. Log text_len instead — sufficient for
debugging burst behavior without surfacing message contents.

Co-authored-by: Paulo Nascimento <pnascimento9596@gmail.com>
2026-05-24 01:31:45 -07:00
Teknium 51bb8c0a9e chore: map pnascimento9596@gmail.com for PR #31235 salvage 2026-05-24 01:31:45 -07:00
Paulo Nascimento 7abd62719b gateway: debounce queued text follow-ups 2026-05-24 01:31:45 -07:00
AhmetArif0 21db250034 fix(wecom-callback): retry send with fresh token on errcode 40001/42001
When WeCom returns errcode=40001 (invalid credential) or 42001 (token
expired), send() was returning a failure without evicting the bad token
from _access_tokens. All subsequent sends then kept using the same
invalid cached token until its TTL naturally expired (~7200s).

Fix: on the first token-rejection errcode, evict the cache entry and
retry once with a freshly fetched token. Non-token errcodes fail
immediately as before. If the refreshed token also fails, the error
is returned without looping further.

Adds four regression tests covering: successful retry on 40001,
successful retry on 42001, no retry on unrelated errcode, and clean
failure when the refresh does not help.
2026-05-24 01:30:47 -07:00
Teknium d3c167b644 fix(profiles): cross-profile soft guard on file-write tools + system-prompt hint (#31290)
* fix(profiles): cross-profile soft guard on file-write tools + system-prompt hint

Adds a soft guard so an agent running under one Hermes profile cannot
silently edit a different profile's skills/plugins/cron/memories.
Three layers:

A. agent/file_safety.classify_cross_profile_target
   Classifies a write target against the active HERMES_HOME. Returns
   a {active_profile, target_profile, area, target_path} dict when the
   path lands in another profile's scoped area. PROFILE_SCOPED_AREAS =
   (skills, plugins, cron, memories). get_cross_profile_warning()
   wraps it into a model-facing error string that names both profiles,
   names the area, and points at the cross_profile=True bypass.

   Defense-in-depth, NOT a security boundary — the terminal tool runs
   as the same OS user and can write any of these paths directly. The
   guard exists to prevent confused-agent corruption, not to stop a
   determined attacker. SECURITY.md §3.2 (terminal-bypass posture)
   still applies.

   Wired into tools/file_tools.write_file_tool and patch_tool with a
   cross_profile=False kwarg. WRITE_FILE_SCHEMA and PATCH_SCHEMA both
   advertise cross_profile so the model can pass it after explicit
   user direction. patch_tool extracts target paths from V4A patch
   bodies before checking (same shape as the existing sensitive-path
   check).

   skill_manage is already scoped to the active profile's SKILLS_DIR
   by construction, so no extra guard wiring is needed there. The
   D-side error message (below) still names other profiles when the
   skill exists elsewhere.

B. agent/system_prompt
   One deterministic line near the environment-hints block names the
   active profile and tells the model not to modify another profile's
   skills/plugins/cron/memories without explicit direction. Profile
   name is stable for the lifetime of the AIAgent, so the line is
   prompt-cache-safe.

D. tools/skill_manager_tool._skill_not_found_error
   Replaces the bare "Skill 'X' not found." with a message that:
     - names the active profile,
     - searches OTHER profiles' skills dirs for the same name,
     - names the profile(s) where the skill exists and the path,
     - suggests `hermes -p <name>` to switch profiles, or
       cross_profile=True for an explicit edit.

   All 5 "not found" sites in skill_manager_tool (edit, patch, delete,
   write_file, remove_file) now go through the helper.

Reference incident (May 2026): a hermes-security profile session
edited skills under both ~/.hermes/profiles/hermes-security/skills/
AND ~/.hermes/skills/ (the default profile's skills) without
realizing the second path belonged to a different profile. Three of
the four skill files needed manual restoration afterward.

What this PR does NOT do:

  * No hard block. The terminal tool can still touch any of these
    paths with no guard — same posture as the dangerous-command
    approval flow. SECURITY.md §3.2 applies.
  * No regex sweep on terminal commands for cross-profile paths.
    That direction is a Skills-Guard-style arms race (cd + relative
    paths, base64, etc.) and would false-positive on legitimate
    cross-profile reads. Filed as a follow-up.
  * No on-disk path migration. ~/.hermes/skills/ remains the
    default profile's skills dir; this PR is about telling the
    agent about that boundary, not changing the layout.

Tests:
  tests/agent/test_file_safety_cross_profile.py (16 tests)
    - _resolve_active_profile_name covers default/named/failure paths
    - classify_cross_profile_target covers all four scoped areas,
      both directions (default → named, named → default, named → named),
      non-Hermes paths, and root-level config files
    - get_cross_profile_warning covers in-profile no-op, cross-profile
      message shape, and the defense-in-depth self-documentation

  tests/tools/test_cross_profile_guard.py (12 tests)
    - write_file: in-profile allow, cross-profile block, cross_profile=True
      bypass, non-Hermes pass-through
    - patch: replace-mode block, cross_profile=True bypass, V4A patch
      path extraction
    - skill_manage: error names the other profile (single + multiple),
      missing-everywhere falls back to skills_list hint
    - system prompt: contract-level checks (both branches present,
      cross_profile=True mentioned, ~/.hermes/profiles/ referenced)

All 207 existing tests in file_safety/file_operations/skill_manager
still pass. 10 system-prompt tests still pass.

E2E verified: the exact incident scenario (security profile editing
default's hermes-agent-dev skill) is now blocked with the warning
message; cross_profile=True unblocks.

* fix(code_execution): add cross_profile to write_file/patch stubs

The cross_profile kwarg added to write_file_tool/patch_tool needs to
flow through the execute_code sandbox stubs in _TOOL_STUBS so the
test_stubs_cover_all_schema_params drift test passes. Without this,
scripts running inside execute_code couldn't pass cross_profile=True
through hermes_tools.write_file().

Caught by CI on PR #31290.
2026-05-24 00:38:17 -07:00
Teknium b207dc28b3 feat(kanban): --ids bulk promote + AUTHOR_MAP entry for #29464
Adds an --ids flag to 'hermes kanban promote' mirroring the existing
block/schedule convention, so the marquee use case from issue #28822
(promote all children of a closed organizational parent in one shot)
doesn't require a shell loop. Single-id JSON output stays a flat
object for back-compat; bulk emits a list. Dedupes positional + --ids
so the same id can't be promoted twice in one call. 5 new CLI-level
tests cover bulk happy path, partial-failure exit code, JSON shapes,
and dedup.

Also adds the thedavidmurray noreply-email -> github-login mapping in
scripts/release.py so the salvage cherry-pick passes the AUTHOR_MAP
contributor-credit check.
2026-05-23 23:10:36 -07:00
David Murray d46adad22f feat(cli): kanban promote verb for manual todo->ready recovery
Adds `hermes kanban promote <task_id>` for manual lifecycle recovery
when an auto-promote daemon misses the parent-done transition (issue
#28822). Refuses promotion unless every parent dep is done/archived
(override with --force). Emits a `promoted_manual` audit event distinct
from the automatic `promoted` kind, so audit consumers can filter
human-driven from system-driven promotions. Supports --dry-run and
--json for orchestration. Does not mutate assignee/claim state — the
dispatcher picks the card up via its normal ready polling path.

Closes #28822.
2026-05-23 23:10:36 -07:00
novax635 421ab81052 fix(cli): reuse canonical root model key normalization in load_cli_config 2026-05-23 23:08:05 -07:00
Teknium 2442a0c281 fix(background-review): allow pinned skills to be improved
The post-turn background reviewer prompt listed pinned skills under
'Protected skills (DO NOT edit these)' alongside bundled and
hub-installed skills, with the instruction to say 'Nothing to save.'
if only protected skills needed updating. This meant the reviewer
would refuse to patch a pinned skill even when the user explicitly
wanted that skill improved.

The underlying tool layer already gets this right: skill_manage's
_pinned_guard only fires on delete; patch/edit/write_file go through
on pinned skills. Curator archive/consolidation still skips pinned
at the data layer (agent/curator.py), which is the correct place for
that protection — pin's job is anti-deletion, not anti-improvement.

Both _SKILL_REVIEW_PROMPT and _COMBINED_REVIEW_PROMPT now explicitly
tell the reviewer that pinned skills can be patched, with rationale,
so it doesn't bail out of an improvement just because the target is
pinned.
2026-05-23 22:57:42 -07:00
brooklyn! a627981a65 fix(tui): stop slash dropdown from chopping last char of /goal (#31311)
Two independent bugs caused the slash-command autocomplete to render
`/goal` as `/goa` (and `/gquota` as `/gquot` for that matter) in the TUI:

1. `tui_gateway/server.py` was forwarding `c.display` from
   prompt_toolkit's `Completion` straight into the JSON-RPC payload.
   prompt_toolkit normalizes `display=` into `FormattedText` (a `list`
   subclass), so the wire format became `[["", "/goal"]]` instead of
   the `string` that `CompletionItem.display` in the TUI declares.
   `meta` already went through `to_plain_text` — `display` did not.

2. The dropdown row in `appOverlays.tsx` used `flexDirection="row"`
   with the display `<Text>` and the (very long) meta `<Text>` as
   siblings. When the meta overflows the row width, Ink/Yoga shrinks
   the *first* column by one cell, lopping the trailing character off
   the command name. `/goal` triggers it reliably because its meta
   string is the longest of any built-in command (description +
   embedded `[text | pause | resume | clear | status]` usage hint).
   Wrapping the display column in `<Box flexShrink={0}>` keeps it at
   its natural width and lets the meta wrap or truncate instead.
2026-05-23 22:12:55 -07:00
Teknium 2666009ccc docs: dedicated Nous Portal integration page and setup guide (#31296)
If Nous Portal is the recommended way to run Hermes Agent, it deserves
more than a sub-section buried under `## Inference Providers`. Add two
new pages and shrink the existing providers.md section to a stub that
points at them.

New pages:
- `website/docs/integrations/nous-portal.md` — landing page. What's in
  the subscription (300+ model catalog table, Tool Gateway breakdown,
  Nous Chat, cross-platform parity, no-dotfile-credentials). Hermes 4
  recommendation note. Setup paths (fresh install, existing install,
  headless / SSH, profiles). Day-to-day usage (portal status / portal
  tools / portal open, switching models, mixing gateway with own
  backends, subscription management). Configuration reference. Token
  handling. Troubleshooting. Cross-links. Sidebar-position 1 — first
  entry under Integrations.

- `website/docs/guides/run-hermes-with-nous-portal.md` — task script.
  Eight numbered steps: subscribe → setup --portal → verify with
  portal status → first chat → switch models → customize gateway
  routing → voice mode → cron/always-on. Per-step troubleshooting.
  'What this gets you in plain numbers' comparison table. Sidebar
  position 1 — first entry under Guides & Tutorials.

Existing providers.md:
- Replace the 80-line `### Nous Portal` deep-dive with a 13-line stub
  that summarizes the value prop, lists the three CLI commands, and
  links to the new pages. Saves ~6KB. Other provider sections and
  callouts (Codex Note, Two Commands, Tool Gateway tip) preserved.

Sidebar:
- `integrations/nous-portal` inserted right after `integrations/index`,
  before `integrations/providers`.
- `guides/run-hermes-with-nous-portal` inserted first in Guides &
  Tutorials.
2026-05-23 21:07:58 -07:00
Teknium 2b10024ee8 test(display): cover failure-suffix rendering + update scrollback test
The original PR #17194 description claimed test_display_tool_preview.py
but only ever shipped test_display_todo_progress.py. Add the missing
coverage for the failure-suffix path:

- _trim_error: whitespace strip, length cap, File-not-found path collapse
- _detect_tool_failure: terminal exit codes, memory full, structured
  {error}/{message} extraction, malformed JSON, None result
- get_cute_tool_message E2E: read_file failure, terminal exit-only,
  terminal stderr message, memory full, success path, no-result path

Also update test_tool_progress_scrollback.test_error_suffix_on_failed_tool
to reflect the new behavior: the generic '[error]' fallback in cli.py
has been removed; failure suffixes now come from the result-aware
_detect_tool_failure (e.g. '[exit 1]', '[File not found: x]').
2026-05-23 21:03:51 -07:00
Albert.Zhou ffde8b7b09 feat(cli): show todo progress as done/total fraction
Parse the todo_tool result summary to display completion progress in
CLI tool preview lines:

  Read:    ┊ 📋 plan      3/4 task(s)  0.5s
  Update:  ┊ 📋 plan      update 3/4 ✓  0.5s
  Create:  falls back to plain count when no completed tasks

Falls back gracefully to the existing 'N task(s)' format when the
result is missing, malformed, or has no completed items.

Originally proposed in PR #17194 by Albert.Zhou; salvaged onto current
main.

Co-authored-by: Albert.Zhou <albert748@gmail.com>
2026-05-23 21:03:51 -07:00
Albert.Zhou 094d732378 fix(cli): surface tool failures with specific error messages
Improves the failure suffix on tool completion lines. Instead of always
showing '[error]' for non-terminal failures, parse the tool's JSON result
and surface the actual message:

  Before:  ┊ 📖 read      foo.py  0.1s [error]
  After:   ┊ 📖 read      foo.py  0.1s [File not found: foo.py]

  Before:  ┊ 💻 $         ls bad  0.1s [exit 127]
  After:   ┊ 💻 $         ls bad  0.1s [ls: cannot access 'bad'...]

Adds a _trim_error helper that strips long absolute paths down to the
filename and caps the suffix at 48 chars so it stays readable on narrow
terminals.

Threads the tool result through the tool.completed progress callback so
agent/display.get_cute_tool_message can inspect it. The cli.py [error]
post-suffix is removed in favor of the richer suffix _detect_tool_failure
now produces directly.

Originally proposed in PR #17194 by Albert.Zhou; salvaged onto current
main with the dead-code preview-length bumps dropped (tool_preview_length
config already strictly caps previews, so the per-tool n= defaults are
unreachable).

Co-authored-by: Albert.Zhou <albert748@gmail.com>
2026-05-23 21:03:51 -07:00
honor2030 6a1aa420e7 Fix CLI verbose tool progress config fallback 2026-05-23 21:03:51 -07:00
Teknium d97c324473 fix(terminal): warn at call time when background=true runs silently (#31289)
`terminal(background=true)` without `notify_on_complete=true` or
`watch_patterns` runs the process SILENTLY — the agent has no way
to learn it finished short of calling `process(action='poll')`
explicitly. That's correct for genuine long-lived processes (servers,
watchers, daemons) but is a footgun for every bounded task (tests,
builds, deploys, CI pollers, batch jobs), which is the vast majority
of background uses.

Hit on May 23, 2026 (PR #31231 incident): agent launched a CI-watch
loop with `background=true` only. The poller ran fine, exited green
6 minutes later, agent never noticed. User had to surface 'we are
green CI, you can merge.' Memory and skill docs said *what* to do
(poll in background) but not *how* to receive the result. The
`notify_on_complete=true` flag exists and works, but is easy to
forget when bg seems sufficient on its own.

Two changes here, mutually reinforcing:

1. Runtime nudge: tool result for `background=true` w/o notify or
   watch_patterns now includes a `hint` field explaining the silent-
   process failure mode and pointing at the corrective flag. Agent
   sees it on the same turn and self-corrects without needing the
   user to surface anything. Cost for legitimate server cases is one
   ignored read (~50 tokens); cost for forgot-notify cases is
   prevented blindness (potentially many turns, or a user nudge).
   False positives << false negatives.

2. Schema/description rewrite: top-level TERMINAL_TOOL_DESCRIPTION
   and the `background` field description now lead with 'Almost
   always pair with notify_on_complete=true' instead of presenting
   it as one of two equally-likely patterns. The two legitimate
   non-notify shapes (long-lived servers; watch_patterns mid-process
   signals) are still documented, but as the minority case.

Tests cover all four shapes: bg-only emits hint, bg+notify doesn't,
bg+watch_patterns doesn't, foreground doesn't. 4 new tests; full
suite of background/process tests stays green (160/160 across the
relevant 6 test files).
2026-05-23 21:02:14 -07:00
AhmetArif0 39b8d1d313 fix(dingtalk): finalize open streaming cards before disconnect
AI Card "tool progress" cards created with finalize=False were left in
streaming state on DingTalk's UI after a gateway restart because
disconnect() called _streaming_cards.clear() without first closing
them via _close_streaming_siblings.

Move the finalization loop before self._http_client.aclose() so the
HTTP client is still available when the finalize requests are sent.
Adds a regression test that asserts the HTTP client is alive during
finalization.
2026-05-23 20:48:56 -07:00
Teknium a7b622effc docs(providers): move Nous Portal first, Google Gemini OAuth last (#31287)
Reorder the per-provider subsections under '## Inference Providers'
so Nous Portal — the recommended setup — leads the list, and Google
Gemini via OAuth (which carries a policy-risk warning) drops to last
position right before the '## Custom & Self-Hosted LLM Providers'
section. All other provider sections keep their relative order. Pure
section move; no content changes.
2026-05-23 20:46:17 -07:00
Fewmanism 83f6a83b24 fix(tui): handle images with codex app-server 2026-05-23 20:40:09 -07:00
teknium1 7ce6b504a2 fix(process_registry): use taskkill /T /F for tree-kill on Windows
The Windows branch of `_terminate_host_pid` early-returned after
`os.kill(pid, SIGTERM)` (which Python maps to `TerminateProcess` for
the target handle only), leaving descendant processes — e.g. Chromium
renderer/GPU/network helpers spawned by an `agent-browser` daemon —
running on Windows even after the preceding commit fixed POSIX.

The right Windows primitive is `taskkill /PID <pid> /T /F`:
`/T` walks the tree, `/F` force-terminates. Same approach
`gateway.status.terminate_pid(force=True)` already uses for the
gateway's own shutdown path; reuse the same shape here.

Why NOT extend the POSIX psutil tree-walk to Windows:

  1. Windows doesn't maintain a Unix-style process tree. `psutil.
     Process.children(recursive=True)` walks PPID links that go stale
     when intermediate processes exit, so enumeration is best-effort
     and silently misses orphaned descendants. The whole bug we're
     fixing is orphaned descendants.

  2. `psutil.Process.terminate()` on Windows is `TerminateProcess()`
     for one handle — same single-PID scope as the existing
     `os.kill`. The existing comment in `gateway/status.py::
     terminate_pid` warns this explicitly: 'os.kill SIGTERM is not
     equivalent to a tree-killing hard stop' on Windows.

  3. Headless Chromium has no GUI window, so the softer
     `taskkill /T` without `/F` (which sends WM_CLOSE) won't reach
     it either. `/F` is required.

POSIX path is unchanged. The taskkill subprocess uses the same
`creationflags=windows_hide_flags()` pattern other Windows shellouts
in this codebase use. `FileNotFoundError` / `TimeoutExpired` /
`OSError` fall back to bare `os.kill(SIGTERM)` as cheap insurance.

Tests cover the Windows branch via the codebase's standard
`monkeypatch _IS_WINDOWS` pattern (`references/windows-native-
support.md`), plus POSIX tree-walk order, NoSuchProcess swallow,
and the OSError fallback path. 7 new tests, all green on Linux CI.
2026-05-23 20:30:29 -07:00
Yuan Li 22f3f5a75a fix(browser): use process-tree termination for daemon cleanup
os.kill(pid, SIGTERM) only signals the parent, leaving Chromium child
    processes (renderer, GPU, etc.) orphaned.  Reuse the existing
    ProcessRegistry._terminate_host_pid() helper which walks the process
    tree leaf-up via psutil, terminating children before the parent.
2026-05-23 20:30:29 -07:00
Teknium 72ff3e909c docs(providers): rewrite Nous Portal section as primary recommended path (#31230)
The old section sold Nous Portal as access to Hermes-4 models, which is
backwards — Hermes 4 is a chat/reasoning family that's NOT recommended
for Hermes Agent (per portal.nousresearch.com/info itself). The actual
value prop is the 300+ frontier agentic models (Claude, GPT, Gemini,
DeepSeek, etc.) plus the Tool Gateway plus Nous Chat under one
subscription.

Rewrite to lead with that, position the portal as the recommended way
to run Hermes Agent, demote Hermes 4 to a 'note' explaining why it's
not the right pick for agent workloads, and link to the
manage-subscription page from setup.
2026-05-23 18:19:17 -07:00
Teknium e42fcc5625 fix(provider): make config.yaml model.provider the single source of truth (#31222)
Policy: if it ain't a secret it goes in config.yaml. HERMES_INFERENCE_PROVIDER
was leaking behavioral config into the .env surface, including from the gateway,
which bypassed config.yaml entirely.

Behavior:
- gateway/run.py: drop HERMES_INFERENCE_PROVIDER read in _resolve_runtime_agent_kwargs.
  Gateway now flows through resolve_runtime_provider() with no `requested` override,
  which reads model.provider from config.yaml first.

Docs/UX (strip env var from user-facing surface):
- --provider help text no longer mentions the env var
- cli-config.yaml.example same
- reference/environment-variables.md: remove HERMES_INFERENCE_PROVIDER row and
  the cross-reference from HERMES_INFERENCE_MODEL
- reference/cli-commands.md: blank the env-var column for --provider
- guides/xai-grok-oauth.md, guides/minimax-oauth.md: replace
  HERMES_INFERENCE_PROVIDER=x hermes invocations with config.yaml / --provider
- developer-guide/adding-providers.md, model-provider-plugin.md: reframe

Internal mechanism (kept as-is):
- hermes_cli/main.py writes HERMES_INFERENCE_PROVIDER into the TUI subprocess env
- tui_gateway/server.py reads it on TUI startup
- resolve_requested_provider() / oneshot.py / cli.py still fall through to the
  env var as a last-resort behind config.yaml, which is what makes the TUI
  parent->child handoff work
This stays. We just stop documenting it as a user knob.

Tests: tests/gateway/test_auth_fallback.py — simplify mock to fail on first
call, succeed on second; drop monkeypatch.setenv lines that no longer matter.

Supersedes #31064 (closed with credit to @novax635 who surfaced the underlying
issue but proposed aligning gateway *to* the env var rather than removing it).
2026-05-23 18:18:41 -07:00
Teknium 7a4dc8e8d6 chore: map edison@mcclean.codes for PR #29817 salvage 2026-05-23 17:49:47 -07:00
Edison e752c9454e feat(plugins): add register_auxiliary_task() to PluginContext API
Auxiliary LLM tasks (vision, compression, web_extract, etc.) currently
require modifications to core files for any plugin that needs its own
task slot — specifically the _AUX_TASKS list in hermes_cli/main.py and
the hardcoded env-var bridging dict in gateway/run.py. This violates
the 'plugins must not modify core files' rule and forces every memory
or context plugin that wants its own auxiliary task to either fork
core or open a coupled core+plugin PR.

This change adds a generic plugin surface for auxiliary task
registration:

    ctx.register_auxiliary_task(
        key='memory_retain_filter',
        display_name='Memory retain filter',
        description='hindsight pre-retain dedup/extract',
        defaults={'timeout': 30, 'extra_body': {'reasoning_effort': 'low'}},
    )

After registration, the task automatically:

  - Appears in 'hermes model → Configure auxiliary models' picker via
    a new _all_aux_tasks() merge of built-in + plugin tasks
  - Has its provider/model/base_url/api_key bridged from config.yaml
    to AUXILIARY_<KEY_UPPER>_* env vars at gateway startup
    (gateway/run.py now uses a dynamic bridged-keys set instead of
    a hardcoded per-task dict)
  - Gets plugin-declared defaults (timeout, extra_body, etc.) layered
    underneath user config so unconfigured plugin tasks still work
    (agent/auxiliary_client._get_auxiliary_task_config)
  - Resets to auto via 'Reset all to auto' alongside built-ins

Validation:

  - Rejects shadowing of built-in keys (vision, compression, etc.)
  - Rejects invalid key shapes (must match [A-Za-z0-9_]+)
  - Rejects cross-plugin collisions (clear error)
  - Allows same-plugin re-registration (idempotent updates)

Plugin discovery failures (rare) fall back gracefully — the aux
config UI still shows built-in tasks if get_plugin_auxiliary_tasks()
raises, and gateway env-var bridging keeps working for built-ins.

Built-in tasks remain hardcoded in _AUX_TASKS for stability — they're
the baseline UX, and DEFAULT_CONFIG already ships their defaults.
Plugin tasks layer on top.

Tests: 15 new tests in test_plugin_auxiliary_tasks.py covering API
validation, manager state lifecycle, helper sort order, _all_aux_tasks
merge semantics, _reset_aux_to_auto inclusion of plugin tasks, and
default-layering in auxiliary_client.

Updates the gateway-bridge code-parity test (test_auxiliary_config_bridge)
to assert the new dynamic shape rather than the hardcoded literal env
var names which no longer appear post-refactor.

Motivation: this unblocks PR #20262 (hindsight smart retain pipeline)
and similar plugins that need a dedicated aux task slot. The change
is non-breaking — built-in env vars (AUXILIARY_VISION_PROVIDER, etc.)
keep working since they're produced by the same f-string template
that built the hardcoded names.
2026-05-23 17:49:47 -07:00
soynchux e8fa415a9e fix(cli): validate runtime token refresh capability in Qwen auth status 2026-05-23 17:47:36 -07:00
teknium1 4254f7dd17 refactor(skills): slim AST diagnostic to single entry point
Trim ~600 LOC off the original contribution while keeping the same
operator-facing surface and detection coverage.

- Collapse three entry points (file / dir / bundle) into one
  ast_scan_path(path) that handles both files and directories.
- Drop AstFinding dataclass + severity field — replaced with plain
  (file, line, pattern_id, description) tuples. Severity ordering was
  display-only for a diagnostic that explicitly disclaims security
  verdicts, so the field added bookkeeping without earning its place.
- Replace Rich-markup formatter with plain text grouped by file.
- Drop the 'inspect --ast-deep' surface — same scanner, same output as
  'audit --deep', single CLI entry is enough. Operators audit after
  install; pre-install inspection signal isn't worth the second surface.
- Trim test file to the cases that earn their place: bypass payload,
  syntax error survival, RecursionError survival, false-positive guard
  (importer lookalike), literal-arg false-positive guard, non-.py
  ignored, directory recursion + cache-dir skipping, missing-path,
  getattr/__dict__ detection, formatter empty + populated.

Net: tools/skills_ast_audit.py 353 -> 133 LOC,
tests/tools/test_skills_ast_audit.py 299 -> 103 LOC, full diff
+704/-12 -> +264/-6. No change to tools/skills_guard.py — Skills Guard
verdicts remain untouched per SECURITY.md §2.4.
2026-05-23 17:47:26 -07:00
Tranquil-Flow 7255050c99 feat(skills): add opt-in AST deep diagnostics
Add opt-in AST diagnostics for skill review without making Skills Guard stricter by default.

- Add hermes skills inspect --ast-deep to scan fetched skill bundles before installation
- Add hermes skills audit --deep to scan already-installed hub skills
- Keep AST analysis in tools/skills_ast_audit.py, separate from tools/skills_guard.py
- Label output as diagnostic hints, not security verdicts
- Cover dynamic import/access patterns: importlib, __import__(computed), getattr(computed), and __dict__[computed]

This follows the maintainer guidance from closed PR #7436: useful AST-level analysis belongs in an opt-in diagnostic path, not in Skills Guard's default heuristic scan.
2026-05-23 17:47:26 -07:00
novax635 86871ee25a fix(cli): synchronize HERMES_SESSION_ID across environment and contextvar during session switches 2026-05-23 17:46:55 -07:00
brooklyn! f63ef74eaf fix(tui): refresh virtual transcript on viewport resize (#31077)
* fix(tui): refresh virtual transcript on viewport resize

Notify scroll subscribers when ScrollBox viewport bounds change and key virtual-history updates on viewport height so resize/keyboard changes remount the tail rows instead of leaving stale spacers visible.

* test(tui): isolate viewport-height remount regression

Keep the resize delta below the virtual history scroll quantum so the regression test specifically depends on viewport height entering the snapshot key.

* test(tui): clarify virtual history resize snapshot

Update the resize regression and comments so the test specifically guards viewport-height changes in the virtual-history snapshot key.

* docs(tui): clarify scrollbox subscription signals

Document that ScrollBox subscribers are notified for renderer-computed viewport and content bound changes, not only imperative scrolls.

* fix(tui): recompute virtual tail after width resize

Avoid preserving a frozen virtual transcript range when wrapped rows shrink enough that the old tail window no longer covers the viewport.

* fix(tui): preserve transcript tail across resizes

Wraps + heights are column-dependent, so a width change must remeasure
every row and the renderer must repaint the full viewport.

- Key virtualRows on cols so React remounts wrapped rows on resize.
- Snap back to bottom after sticky-mode resize once React rerenders.
- Reserve a scrollbar + gap column in transcriptBodyWidth (non-termux).
- Full repaint on any viewport height change (was: shrink-only).
- ScrollBox scrollHeight uses deepest child bottom so sticky-bottom
  math can reach the real final rendered row after reflow.
- DECSTBM fast-path now requires full container rect match.

* feat(tui): responsive banner tiers

Terminals can't scale glyphs, so the banner now picks a layout per
column width instead of always rendering the full 101-col logo:

- Wide (>= logo width): full ASCII logo + tagline.
- Mid (>= 58 cols): centered rule banner that expands with viewport.
- Narrow (>= 34 cols): brand line + tagline, both width-aware.
- < 34 cols: hidden.

SessionPanel surfaces model/cwd/sid inline when the hero column is
hidden, so narrow layouts don't lose that info. Logo width constants
derive from the art itself.

* fix(tui): re-check sticky inside resize debounce + document remount

Addresses Copilot review on PR #31077:

- onResize now re-checks isSticky() inside the 100ms timer so manual
  scrolls during the debounce window don't get snapped back to tail.
- Comment on the virtualRows cols-keying calls out the deliberate
  trade-off: per-row local state (e.g. systemOpen) resets on resize so
  yoga can remeasure off live geometry. The hook's scale-by-ratio path
  is too approximate for mixed markdown widths.
2026-05-23 19:39:53 -05:00
0z1-ghb dcbcdd6526 fix(compressor): propagate api_mode and fix root logger calls
- Add api_mode to 4 update_model() call sites:
  - conversation_loop.py: long_context failover and probe stepping
  - agent_runtime_helpers.py: rollback restore (also saves compressor_api_mode)
  - chat_completion_helpers.py: fallback activation
- Fix 31 root-logger calls across 5 files (logging.warning/error/info
  -> logger.warning/error/info) to respect module-level log filtering
2026-05-23 17:38:19 -07:00
0z1-ghb 8b2adead78 fix(compressor): ABC compliance — total_tokens, api_mode, logger consistency 2026-05-23 17:38:19 -07:00
Yuan Li 75643a6154 fix(env): strip null bytes from .env before python-dotenv loads
Null bytes in API key values (introduced by copy-paste) crash
    os.environ[k] = v with ValueError: embedded null byte, preventing
    hermes from starting at all.
2026-05-23 17:17:05 -07:00
Brian D. Evans 514a4eff36 docs(simplex): remove broken Docker install command (#26974) (#26975)
* docs(simplex): remove broken Docker install command (#26974)

The "Or Docker" snippet pointed at `simplexchat/simplex-chat`, which is
not a published Docker Hub image. Users following the docs hit:

  docker: Error response from daemon: pull access denied for
  simplexchat/simplex-chat, repository does not exist or may require
  'docker login'.

The SimpleX Chat project only publishes Docker images for its server
components (smp-server, xftp-server) — the chat CLI is distributed as a
binary release. Drop the broken `docker run` line and keep the verified
binary-download path, with a note pointing users to the upstream
Dockerfile if they want to build a container themselves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(simplex): drop misleading "Dockerfile" link text

Copilot review flagged that the link text claimed "Dockerfile in the
upstream repo" but the URL pointed at the repository root, not a
specific Dockerfile path. Reword to "build from source from the
simplex-chat repository" so the link text and target match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: briandevans <252620095+briandevans@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 16:32:20 -07:00
teknium1 87111c7bfe chore: map Glucksberg noreply email in AUTHOR_MAP 2026-05-23 16:26:28 -07:00
Glucksberg 9451087aab fix(telegram): preserve observed group slash commands 2026-05-23 16:26:28 -07:00
chyuwei 7de8cd4c5f fix(tui): clear TTS env var on voice off, and add TTS indicator to status bar
Bug 1: /voice off in TUI mode did not clear HERMES_VOICE_TTS,
leaving TTS stuck ON with no way to disable it (the voice.toggle
tts handler requires voice mode to be ON).

Bug 2: TUI status bar only showed 'voice on/off' without any
indication of whether TTS speech output is active, because the
frontend never tracked voiceTts state.

- tui_gateway/server.py: clear HERMES_VOICE_TTS when voice is turned off
- ui-tui/src/app/useMainApp.ts: add voiceTts state, thread setVoiceTts
  through voice contexts, display [tts] in status bar
- ui-tui/src/app/slash/commands/session.ts: sync tts from voice.toggle response
- ui-tui/src/app/interfaces.ts: add setVoiceTts to all voice context interfaces
2026-05-23 16:21:29 -07:00
0xchainer 2c34a7da87 fix(cli): prevent temp directory leak on ZIP update failure
Move shutil.rmtree into a finally block so the temp directory is always
cleaned up, even when an exception occurs during download, extraction,
or file copying.
2026-05-23 16:16:35 -07:00
Teknium 3b096d6f6d ntfy: tighten robustness, dedupe auth/truncation, add docs
Robustness:
- Surface 401/404 stream failures via _set_fatal_error() so the gateway's
  runtime status reflects 'fatal: ntfy_unauthorized' / 'ntfy_topic_not_found'
  instead of staying 'connected' when the reconnect loop halts. Matches
  the pattern in whatsapp / telegram / sms adapters.
- Strip whitespace from auth tokens so pasted tokens with trailing
  newlines don't produce malformed Authorization headers.

Simplicity:
- Extract _build_auth_header() and _truncate_body() to module-level
  helpers, used by both NtfyAdapter and _standalone_send. Removes the
  duplicated auth/truncation logic between the two paths.

Docs:
- website/docs/user-guide/messaging/ntfy.md — full setup guide,
  identity-model warning, self-hosting, cron usage, troubleshooting.
- website/docs/reference/environment-variables.md — all 9 NTFY_* vars.
- website/docs/user-guide/messaging/index.md — platform comparison row.
- website/sidebars.ts — sidebar entry between simplex and open-webui.

Tests: 78/78 (+ 10 new robustness tests covering token hygiene, fatal
error propagation for 401/404, and the _truncate_body helper).
2026-05-23 16:13:01 -07:00
Teknium 6a8e131a0a refactor(ntfy): convert built-in adapter to platform plugin
ntfy now ships as a self-contained plugin under plugins/platforms/ntfy/
instead of editing 8 core files (gateway/config.py Platform enum,
gateway/run.py factory + auth maps, cron/scheduler.py, toolsets.py,
hermes_cli/status.py, agent/prompt_builder.py, gateway/channel_directory.py,
tools/send_message_tool.py).

All routing goes through gateway/platform_registry via register_platform():
- adapter_factory, check_fn, validate_config, is_connected
- env_enablement_fn seeds PlatformConfig.extra from NTFY_* env vars so
  gateway status reflects env-only setups without instantiating httpx
- standalone_sender_fn handles deliver=ntfy cron jobs when cron runs
  out-of-process from the gateway
- allowed_users_env / allow_all_env hook into _is_user_authorized
- cron_deliver_env_var=NTFY_HOME_CHANNEL for cron home routing
- platform_hint surfaces in the system prompt
- pii_safe=True (topic names are the only identifier; no PII to redact)

Tests moved to tests/gateway/test_ntfy_plugin.py using _plugin_adapter_loader
so the module lives under plugin_adapter_ntfy in sys.modules and cannot
collide with sibling plugin-adapter tests on the same xdist worker. The
core-file grep tests (Platform.NTFY in source, hermes-ntfy in toolsets,
etc.) are replaced with plugin-shape tests covering register() metadata,
env_enablement_fn output, and standalone_sender_fn behavior.

68 tests pass under scripts/run_tests.sh.
2026-05-23 16:13:01 -07:00
sprmn24 b10f17bf1e feat(ntfy): add ntfy platform adapter with atomic reconnect, identity fix, and 81 tests 2026-05-23 16:13:01 -07:00
Brooklyn Nicholson 511b8e2325 fix(tui): re-check sticky inside resize debounce + document remount
Addresses Copilot review on PR #31077:

- onResize now re-checks isSticky() inside the 100ms timer so manual
  scrolls during the debounce window don't get snapped back to tail.
- Comment on the virtualRows cols-keying calls out the deliberate
  trade-off: per-row local state (e.g. systemOpen) resets on resize so
  yoga can remeasure off live geometry. The hook's scale-by-ratio path
  is too approximate for mixed markdown widths.
2026-05-23 17:48:28 -05:00
Brooklyn Nicholson 35fdf11145 feat(tui): responsive banner tiers
Terminals can't scale glyphs, so the banner now picks a layout per
column width instead of always rendering the full 101-col logo:

- Wide (>= logo width): full ASCII logo + tagline.
- Mid (>= 58 cols): centered rule banner that expands with viewport.
- Narrow (>= 34 cols): brand line + tagline, both width-aware.
- < 34 cols: hidden.

SessionPanel surfaces model/cwd/sid inline when the hero column is
hidden, so narrow layouts don't lose that info. Logo width constants
derive from the art itself.
2026-05-23 17:37:51 -05:00
Brooklyn Nicholson 0277194e3b fix(tui): preserve transcript tail across resizes
Wraps + heights are column-dependent, so a width change must remeasure
every row and the renderer must repaint the full viewport.

- Key virtualRows on cols so React remounts wrapped rows on resize.
- Snap back to bottom after sticky-mode resize once React rerenders.
- Reserve a scrollbar + gap column in transcriptBodyWidth (non-termux).
- Full repaint on any viewport height change (was: shrink-only).
- ScrollBox scrollHeight uses deepest child bottom so sticky-bottom
  math can reach the real final rendered row after reflow.
- DECSTBM fast-path now requires full container rect match.
2026-05-23 17:37:42 -05:00
Brooklyn Nicholson 2a75bec607 fix(tui): recompute virtual tail after width resize
Avoid preserving a frozen virtual transcript range when wrapped rows shrink enough that the old tail window no longer covers the viewport.
2026-05-23 14:49:26 -05:00
Brooklyn Nicholson 521c870a05 docs(tui): clarify scrollbox subscription signals
Document that ScrollBox subscribers are notified for renderer-computed viewport and content bound changes, not only imperative scrolls.
2026-05-23 14:18:56 -05:00
Brooklyn Nicholson d1ad919a44 test(tui): clarify virtual history resize snapshot
Update the resize regression and comments so the test specifically guards viewport-height changes in the virtual-history snapshot key.
2026-05-23 14:11:44 -05:00
Brooklyn Nicholson cc61e3be49 test(tui): isolate viewport-height remount regression
Keep the resize delta below the virtual history scroll quantum so the regression test specifically depends on viewport height entering the snapshot key.
2026-05-23 14:06:08 -05:00
Brooklyn Nicholson 4fea02cc16 fix(tui): refresh virtual transcript on viewport resize
Notify scroll subscribers when ScrollBox viewport bounds change and key virtual-history updates on viewport height so resize/keyboard changes remount the tail rows instead of leaving stale spacers visible.
2026-05-23 13:41:46 -05:00
Ben 0988ab83b7 docs(plans): trim s6-overlay plan to a post-implementation reference
PR #30136 review item O7: the plan doc was 3,191 lines — 5x the
size of any other plan in docs/plans/ and the largest reference
document in the repo. With the implementation shipped, most of
that content is either:

* The phase-by-phase TDD walkthrough (~2,800 lines): now canonical
  in the PR commit log (`git log a957ef083..a6f7171a5`).
* The v2/v3 re-validation preambles: artifacts of the planning
  process, no longer load-bearing.
* The full Open Questions deliberations with options A/B/C laid
  out: collapsed into the Decision Log.
* The Rollout Plan and Estimated Timeline: history.

Trim to ~430 lines covering what readers actually need going
forward: the goal, architecture, scope, key design decisions
(D1–D9), risk register (now including the three risks surfaced
in PR review — `_s6_running` detection, svscanctl FIFO perms,
supervise control FIFO perms), the decision log including the
post-merge additions, and the verification checklist (now all
boxes ticked).

Header now reads 'Status: shipped' and points at the PR. The git
history preserves the full v3 plan for anyone who needs it.
2026-05-23 16:24:33 +10:00
Ben 3b69bdb74e test(docker): poll for boot-log signal instead of fixed sleeps
PR #30136 review item O6: test_container_restart.py used fixed
`time.sleep(8)` calls after `docker restart` to wait for the
cont-init reconciler to finish. Fixed sleeps are slow when the
event happens fast and false-fail when the event happens slow.

Replace with two polling helpers:

* `_wait_for_path(container, path, kind='f' | 'd', deadline_s=...)`
  — generic `test -f/-d` poller. Returns True on success, False on
  timeout; callers assert with a clear message.
* `_wait_for_reconcile_log_mention(container, profile, ...)` — the
  reconciler's per-profile log line is the canonical signal that
  the cont-init reconcile has finished for that profile. Poll on
  it instead of a sleep that hopes 8 seconds is enough.

The fixture-level setup wait is similarly migrated: it now polls
for `profile=default` in the boot log (every container always
gets a default-slot entry per item I1) and raises a clear timeout
error from the fixture if the container never finishes cont-init —
much better diagnostics than a mid-test KeyError.

The remaining `time.sleep()` calls are all internal interval_s
between probe attempts; no fixed wait points left.
2026-05-23 16:21:00 +10:00
Ben e3050657aa docs(docker): deprecation warning in entrypoint.sh shim
PR #30136 review item O5: docker/entrypoint.sh is now a thin shim
that forwards to stage2-hook.sh — the real ENTRYPOINT is /init plus
main-wrapper.sh. External scripts that hard-coded entrypoint.sh as
the container's ENTRYPOINT will see the cont-init bootstrap happen
but the CMD will not be exec'd (because stage2-hook only handles
bootstrap; main-wrapper.sh handles the CMD passthrough).

Add a stderr warning explaining the new contract and pointing
callers at the migration path (drop the --entrypoint override).
The shim itself stays in place for one release cycle so the
deprecation isn't a hard break — anyone still invoking it sees
the warning in their logs and has time to migrate.
2026-05-23 16:18:59 +10:00
Ben 541b40532a fix(container_boot): publish reconciled service dirs atomically
PR #30136 review noted the asymmetry: `register_profile_gateway`
used tmp_dir + rename to publish a new service slot atomically,
but the boot-time reconciler wrote files into the slot directly.
Same underlying concern (a concurrent s6-svscan rescan could
observe a half-populated directory), different code path.

Rewrite `container_boot._register_service` to mirror the manager:
build everything in `<scandir>/gateway-<profile>.tmp/`, then
`Path.replace` into place. If a previous interrupted run left a
`.tmp` sibling, it's cleaned up before the new build starts. If
the target already exists, it's removed before the rename so
`Path.replace` doesn't error on a non-empty target (Linux `rename`
overwrites empty targets only).

Three new tests: atomic publication leaves no .tmp leftovers,
overwriting an existing slot still leaves no .tmp leftovers, and
a stale .tmp from an interrupted run is cleaned up automatically.
2026-05-23 15:34:51 +10:00
Ben 5b1fcdd16b fix(container_boot): rotate container-boot.log when it exceeds 256 KiB
PR #30136 review noted: container-boot.log was append-only with no
rotation. On a long-lived container with frequent restarts and
many profiles it would grow unboundedly (~80 B per profile per
reconcile pass).

Add a soft cap: when the file size hits 256 KiB (`_LOG_ROTATE_BYTES`,
≈3000 reconcile lines, ≈1 year of daily reboots × 5 profiles), the
current file is renamed to `container-boot.log.1` (replacing any
existing one) before new entries are appended. Worst case is two
files at ~512 KiB — well within visibility limits for grep/cat.

Rotation is intentionally simple (no logrotate or s6-log machinery
for one append-only file). Failures during rotation are logged via
the module logger and treated as non-fatal — we keep appending to
the existing file rather than dropping the reconcile entry. Three
new unit tests cover above-threshold rotation, below-threshold
non-rotation, and overwrite of an existing .1 file.
2026-05-23 15:33:11 +10:00
Ben f83b9b96d1 docker: drop sh -c wrappers from stage2-hook.sh
PR #30136 review caught: three `s6-setuidgid hermes sh -c "..."`
invocations in stage2-hook.sh interpolated $HERMES_HOME into a
nested shell context. Practically low-risk (a malicious HERMES_HOME
already requires container-launch privileges) but the cleaner
pattern is to invoke commands directly so the shell isn't a second
interpreter.

* `mkdir -p` of the data subdirs now runs directly via s6-setuidgid,
  one path per arg.
* The .install_method stamp is written via `printf | tee` — also no
  shell wrapper.
* The skills_sync invocation uses the venv's python by absolute path
  instead of sourcing activate inside a shell. skills_sync.py doesn't
  need anything from activate beyond sys.path, which the bin-stub
  python already provides.

No behavior change. Just a smaller attack surface and a script
that's easier to read.
2026-05-23 15:31:46 +10:00
Ben 8b6733ebe2 fix(service_manager): rip out dead port parameter
PR #30136 review caught: `_allocate_gateway_port()` in profiles.py
computed a SHA-256-derived port that was threaded through
`register_profile_gateway(profile, port=N)` →
`_render_run_script(profile, port, extra_env)` → and then **ignored**.
The rendered run script picked the bind port from the profile's
config.yaml (`[gateway] port = …`), never from the allocator. So
the entire allocator + parameter chain was dead code.

Remove:

* `hermes_cli.profiles._allocate_gateway_port` (deterministic
  SHA-256 → [9200, 9800) — never used).
* `port` kwarg from `ServiceManager.register_profile_gateway`
  (Protocol + Mixin + S6 implementation).
* `port` positional arg from `_render_run_script(profile, port,
  extra_env)` — now `_render_run_script(profile, extra_env)`.
* The pass-through call in `profiles._maybe_register_gateway_service`.

config.yaml is now the single source of truth for gateway port
selection — matches reality and reduces the API surface. Three
explanatory comments in service_manager.py / profiles.py document
the retirement so future readers don't reach for the allocator and
find a ghost.

Tests: drop the three `_allocate_gateway_port` tests; update
fakes' signatures throughout test_service_manager.py and
test_profiles_s6_hooks.py to match the new no-port API.
2026-05-23 15:30:15 +10:00
Ben 7b16e4448a docs(compose): update entrypoint comment for s6-overlay
PR #30136 review caught: docker-compose.yml still said "If you
override entrypoint, keep /opt/hermes/docker/entrypoint.sh in the
command chain." That was true under tini; under s6-overlay the
entrypoint is /init plus main-wrapper.sh, and entrypoint.sh is now
only a backward-compat shim.

Replace with an accurate description: /init must remain first in the
chain because it's PID 1 and runs the cont-init.d scripts (chown,
profile reconcile, dashboard toggle) before any service starts.
2026-05-23 15:24:46 +10:00
Ben 9ba349b6e9 fix(docker): dashboard slot stays 'down' when HERMES_DASHBOARD unset
PR #30136 review caught a false positive: when HERMES_DASHBOARD was
unset, the dashboard run script did `exec sleep infinity`, so
`s6-svstat /run/service/dashboard` reported the slot as 'up'.
`hermes doctor` and any other s6-svstat-based health check saw the
dashboard as supervised-running even though no dashboard process
existed.

Add cont-init.d/03-dashboard-toggle: writes a `down` marker file
into `/run/service/dashboard/` when HERMES_DASHBOARD is falsy,
removes any leftover marker when it's truthy. s6-supervise honors
`down` by not starting the service, so s6-svstat reports 'down' —
matching reality.

The run script's HERMES_DASHBOARD case-statement stays in place as
a belt-and-suspenders guard, so the two layers can never disagree.

Two new integration tests lock the behavior: slot reports down
when unset; slot reports up when set to 1.
2026-05-23 15:24:17 +10:00
Ben 1759c0f090 fix(service_manager): friendly errors for missing slots and s6-svc failures
PR #30136 review caught: `S6ServiceManager.start/stop/restart` called
`subprocess.run(check=True)` on `s6-svc`, so any failure surfaced as
a raw `CalledProcessError` traceback. The two cases operators
actually hit are:

  1. The service slot doesn't exist — most commonly because the user
     typed a profile name wrong (`hermes -p typo gateway start`).
  2. s6-svc itself fails — most commonly EACCES on the supervise
     control FIFO when running unprivileged.

Both deserve named errors with actionable messages, not stacktraces.

Changes:

* Add `S6Error` base + two concrete errors in `hermes_cli.service_manager`:
    - `GatewayNotRegisteredError(profile)` — carries the unprefixed
      profile name; message: `no such gateway 'typo': register it
      with `hermes profile create typo` first, or pass an existing
      profile name via `-p <name>``.
    - `S6CommandError(service, action, returncode, stderr)` — carries
      the s6-svc rc and stderr; message: `s6-svc start on
      'gateway-coder' failed (rc=111): <stderr>`.

* Factor lifecycle dispatch through `_run_svc(flag, label, name)`:
  pre-checks that the service directory exists (raises
  GatewayNotRegisteredError before invoking s6-svc), then runs
  s6-svc and translates any CalledProcessError into S6CommandError.

* `_dispatch_via_service_manager_if_s6` in `hermes_cli.gateway`
  catches both errors and prints `✗ <message>` + `sys.exit(1)`
  instead of letting the exception bubble. The dispatch path that
  used to dump a traceback at the user now gives an actionable
  one-liner.

Tests: 6 new tests for the error types and their CLI rendering;
existing lifecycle test pre-seeds the slot directory before calling
`mgr.start` etc.
2026-05-23 15:20:41 +10:00
Ben 367c15b1dc fix(container_boot): always register gateway-default slot
PR #30136 review caught: `hermes gateway start` (no `-p`) inside
the container resolves `_profile_suffix() == ""` → service name
`gateway-default`, but no such slot was ever registered. The Phase 4
profile-create hook only fired on `hermes profile create <name>`,
and the root profile (which lives at the top of $HERMES_HOME, not
under `profiles/`) was never one of those. So bare `hermes gateway
start` landed on `s6-svc -u /run/service/gateway-default` →
uncaught `CalledProcessError` → traceback to the user.

Changes:

1. `reconcile_profile_gateways` now always registers a
   `gateway-default` slot before iterating named profiles. Its
   prior state is read from `$HERMES_HOME/gateway_state.json`
   (sibling to the profile root, not under `profiles/`); stale
   runtime files there are swept the same way. Auto-up only if the
   prior state was `running` — same rule as named profiles.

2. `S6ServiceManager._render_run_script` special-cases
   `profile == "default"` to emit `hermes gateway run` with NO
   `-p` flag. Passing `-p default` would resolve to
   `$HERMES_HOME/profiles/default/` — a different profile that
   almost certainly doesn't exist. The empty profile-suffix
   convention is the dispatcher's contract and the run script has
   to match.

3. A user-created `profiles/default/` collides with the reserved
   root-profile slot; the reconciler now skips it with a warning
   rather than producing two registrations of the same service name.

Action-list ordering is stable: `default` first, then named
profiles in directory order. Boot-log readers can rely on this.

Tests: 8 new dedicated default-slot tests plus updates to every
existing test that asserted against the action list (via the new
`_named_actions` helper that drops the always-present default
entry).
2026-05-23 15:16:35 +10:00
Ben 04d1894f36 docs(docker): dashboard IS supervised — update note that contradicted the PR
PR #30136 review caught that website/docs/user-guide/docker.md still
said "The dashboard side-process is **not supervised** — if it
crashes, it stays down until the container restarts." That was true
under tini but is the opposite of the s6 behavior this PR ships and
`test_dashboard_restarts_after_crash` proves.

Replace with a description of what users actually see now: automatic
restart by s6-overlay, new PID after a short backoff, logs via
`docker logs`. The standalone-container caveat carries forward
unchanged.
2026-05-23 15:08:48 +10:00
Ben efd3569739 fix(gateway): route --all stop/restart through s6 under container
PR #30136 review caught that `hermes gateway stop --all` and
`... restart --all` were broken under s6. The Phase 4 dispatcher was
gated on `not stop_all` (and the symmetric restart_all), so `--all`
fell through to `kill_gateway_processes(all_profiles=True)`. pkill
SIGTERMed every gateway, s6-supervise observed the crashes, and
restarted every gateway ~1s later — net effect: `--all` *kicked*
gateways instead of *stopping* them.

Add `_dispatch_all_via_service_manager_if_s6(action)` that iterates
`mgr.list_profile_gateways()` and routes stop/restart through each
service slot. s6's `want up`/`want down` flips correctly, so a
stop persists. Partial failures are surfaced per-profile with a
running success count; the host pkill path is only reached when s6
isn't in play.

`start --all` isn't a CLI surface — the helper rejects it and
returns False (host code path can take over).
2026-05-23 15:08:17 +10:00
Ben 8ae959adb6 fix(ci): drop --entrypoint override in hermes-smoke-test action
PR #30136 review caught a silent regression: the smoke-test action
overrode ENTRYPOINT to `/opt/hermes/docker/entrypoint.sh`, which the
s6-overlay migration reduced to a shim that just `exec`s the stage2
hook. stage2-hook ignores its CMD args, prints "Setup complete", and
exits 0 — so `hermes --help` and `hermes dashboard --help` never
ran. The #9153 regression guard was a green-always no-op.

Drop the override so the smoke test uses the image's real ENTRYPOINT
chain (`/init` + `main-wrapper.sh`), which is the actual production
startup path. `hermes --help` and `hermes dashboard --help` now run
through the full supervision tree and exercise the real argv routing.
2026-05-23 15:00:43 +10:00
Ben eb59d6f774 fix(docker): SHA256-verify s6-overlay tarballs
PR #30136 review flagged the s6-overlay install as a supply-chain
regression vs the gosu source it replaced — `tianon/gosu` was
digest-pinned via `FROM ...@sha256:...`, but the three new
ADD/curl downloads had no integrity check at all.

Pin all three tarballs (noarch, symlinks-noarch, per-arch) to
upstream-published SHA256s via ARGs. Verification happens via
`sha256sum -c` against a single checksum file (avoids a piped-shell
hadolint DL4006 warning under dash). To bump S6_OVERLAY_VERSION,
fetch the four `.sha256` files from the new release and update
the ARGs — documented inline.

If upstream artifacts are tampered with mid-build, the build now
fails loudly at the verification step instead of silently
producing a tainted image.
2026-05-23 14:59:42 +10:00
Ben 928e52e574 fix(docker): support multi-arch s6-overlay install (amd64 + arm64)
The Dockerfile only ADD'd `s6-overlay-x86_64.tar.xz`, so the
`build-arm64` job in docker-publish.yml — which runs on
`ubuntu-24.04-arm` and publishes by digest — produced an image whose
`/init` couldn't exec on actual arm64 hosts. Apple Silicon and ARM
server users were getting a broken container.

Map BuildKit's `TARGETARCH` (`amd64` / `arm64`) to s6's kernel-arch
naming (`x86_64` / `aarch64`) inside the RUN step and fetch the
correct tarball via `curl` (`ADD`'s URL is evaluated at parse time,
before TARGETARCH substitution, so dynamic arch selection requires
RUN). The noarch + symlinks tarballs are architecture-independent
and stay as ADDs.

The audit case is now explicit: unsupported architectures fail loudly
at build time rather than producing a silently-broken image.
2026-05-23 14:58:06 +10:00
Ben 2f8ceeab9a fix(service_manager): s6 detection works for unprivileged hermes user
PR #30136 review surfaced two issues, both rooted in the same audit gap:
docker integration tests were running as root, not the unprivileged
`hermes` user (UID 10000) that the runtime actually uses via
`s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to
the s6 control surface worked as root in the tests but was inert in
production.

Fixes:

1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`,
   which is root-only readable. For UID 10000 the symlink yields
   PermissionError, `resolve()` silently returns the unresolved path,
   and `exe.name == "exe"` — so detection always returned False, the
   service-manager runtime-registration path was inert, and every
   `hermes profile create` / `hermes -p X gateway start` silently
   skipped the s6 hook. Replace with `/proc/1/comm` (world-readable)
   + `/run/s6/basedir` (s6-overlay-specific) — both required, fail
   closed.

2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/`
   {control,lock} to hermes so `s6-svscanctl -a/-an` works without
   root. Previously the directory chown stopped at `/run/service`
   and the FIFO inside stayed root-owned, so `register_profile_gateway`
   from hermes failed at the rescan-trigger step with EACCES — the
   wrapper in profiles.py caught the exception and printed a swallowed
   warning, so profile creation appeared to succeed while the slot
   was rolled back.

Audit changes to flush this class of bug next time:

- Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py`
  that default to `-u hermes`. The module docstring explains why and
  flags `user="root"` as opt-in only for tests that explicitly need
  root (none currently do).
- Refactor every `docker exec` call in tests/docker/ through the new
  helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py,
  test_container_restart.py, test_s6_profile_gateway_integration.py).
- Add 5 unit tests covering `_s6_running` under various probe states
  (both signals present; comm wrong; basedir missing; PermissionError
  on /proc/1/comm; missing /proc — non-Linux). The PermissionError
  test is the explicit regression guard for the original bug.

Known follow-up: the per-service `supervise/control` FIFO inside each
`/run/service/gateway-<profile>/supervise/` is created root-owned by
s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc
-u/-d/-t` from the hermes user will get EACCES on those. The audit
under `-u hermes` will reveal this in lifecycle tests — surfacing the
issue cleanly so it can be fixed in a focused follow-up (likely via a
small SUID helper or a polling chown loop in cont-init.d). The
detection + svscanctl fixes here are independent and complete on
their own.
2026-05-23 14:56:39 +10:00
Ben a6f7171a5e feat(docker): remove gosu from bundled image; s6-setuidgid handles privilege drop
The s6-overlay migration replaced every runtime use of gosu with
s6-setuidgid (in stage2-hook.sh, main-wrapper.sh, per-service run
scripts, and cont-init.d hooks), but the gosu binary itself was still
being copied into the image from tianon/gosu, and several comments
across the repo still pointed to it.

Image changes:
- Drop the FROM tianon/gosu:1.19-trixie AS gosu_source stage
- Drop the COPY --from=gosu_source /gosu /usr/local/bin/ layer
- Net: one fewer base-image pull, ~12-15 MB layer eliminated

Documentation/comment refresh (no behavior change):
- Dockerfile: update root-user rationale comment + cont-init.d comment
- docker/main-wrapper.sh: drop "pre-s6 contract (gosu drop)" reference
- docker-compose.yml: update UID/GID remap comment
- .hadolint.yaml: update DL3002 ignore rationale
- website/docs/user-guide/docker.md: privilege-drop helper is s6-setuidgid now
- hermes_cli/config.py: docker_run_as_host_user docstring

tools/environments/docker.py runs *arbitrary user images* via the
terminal backend, not the bundled Hermes image. It still needs SETUID/
SETGID caps so user images that use gosu/su/s6-setuidgid all work.
Renamed the cap-list constant _GOSU_CAP_ARGS → _PRIVDROP_CAP_ARGS and
updated comments to list s6-setuidgid alongside the others as examples.
The matching test (test_security_args_include_setuid_setgid_for_gosu_drop
→ test_security_args_include_setuid_setgid_for_privdrop) was renamed
and its docstring updated; behavior is unchanged.

Verification:
- hadolint clean against .hadolint.yaml
- shellcheck clean against all docker/ shell scripts
- Image rebuilt successfully (sha 1a090924ccea)
- Docker harness: 19 passed in 41.87s (every Phase 0 test + Phase 4
  per-profile-gateway lifecycle + container-restart reconciliation)
- tests/tools/test_docker_environment.py: 23 passed (rename did not
  break test discovery; pre-existing unrelated mock warning)

The plan document (docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md)
intentionally retains its historical references to gosu — it describes
the pre-s6 entrypoint as background for understanding the migration.
2026-05-22 11:47:42 +10:00
Ben 7d07dd60a8 docs(s6): document container supervision; doctor + skill + user-guide updates
Phase 5 of the s6-overlay supervision plan. Documentation + small
diagnostic cleanups; no behavior changes.

website/docs/user-guide/docker.md:
  - Replace the old 'entrypoint script does the bootstrap' section
    with the s6-overlay boot flow (cont-init.d/01-hermes-setup,
    cont-init.d/02-reconcile-profiles, static main-hermes + dashboard
    services, ENTRYPOINT-as-main-program pattern).
  - Add a 'Per-profile gateway supervision' subsection covering the
    new lifecycle commands, restart semantics, log persistence, and
    'Manager: s6 (container supervisor)' status reporting.
  - Add 'Breaking change vs. pre-s6 images' callout naming the
    /init ENTRYPOINT and pointing affected wrappers at the pin
    workaround.

website/docs/user-guide/profiles.md:
  - Add a note under 'Persistent services' pointing container users
    at the docker.md section explaining s6 supervision inside the
    image. Host-side systemd/launchd documentation is unchanged.

skills/software-development/hermes-s6-container-supervision/SKILL.md:
  - New maintainer skill covering the supervision-tree map, file
    layout, the Architecture B rationale (cont-init.d args + halt
    exit-code propagation), quick recipes, and the 8 pitfalls we hit
    while implementing the plan (PATH-without-/command, root-owned
    profile dirs, SOUL.md as marker, the '143' anti-pattern, etc.).

hermes_cli/doctor.py:
  - _check_gateway_service_linger skips on s6 (the linger concept
    doesn't apply inside the container).
  - New _check_s6_supervision section reports main-hermes/dashboard
    state and per-profile-gateway count (registered vs supervised
    up), only inside the s6 container. Host doctor output unchanged.
  - External Tools / Docker check no longer emits a 'docker not
    found' warning inside the container; prints an explanatory
    info line instead. Still respects an explicit TERMINAL_ENV=docker
    (in case the user mounted /var/run/docker.sock).

hermes_cli/gateway.py:
  - Document _container_systemd_operational more precisely: it's
    NOT for our Hermes Docker image (s6-overlay handles that via
    detect_service_manager() == 's6'). It still covers
    systemd-nspawn / k8s-with-systemd-init cases, so leaving it in
    place is correct; the docstring just makes that explicit.

Test harness (verification, no test changes in this commit):
  19 passed, 0 xfailed. 66 service-manager / container-boot /
  profiles-s6-hooks / gateway-s6-dispatch unit tests still green.
  61 doctor tests still green. Hadolint + shellcheck clean.

Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
2026-05-22 11:47:42 +10:00
Ben 57c6e29666 feat(docker): per-profile s6 supervision + container-restart reconciliation
Phase 4 of the s6-overlay supervision plan. Activates the Phase 3
S6ServiceManager by hooking it into the profile lifecycle and the
`hermes gateway start/stop/restart` dispatcher, and adds a cont-
init.d-time reconciliation pass that survives `docker restart`.

Task 4.0 — container-boot reconciliation:
  /run/service/ is tmpfs, so every `docker restart` wipes every
  per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles
  invokes hermes_cli.container_boot.reconcile_profile_gateways() on
  every boot, which walks $HERMES_HOME/profiles/<name>/, reads each
  gateway_state.json, recreates the s6 service slot, and auto-starts
  only those whose last state was 'running'. Other states
  (stopped, starting, startup_failed, missing) register the slot
  in the down state — avoiding crash-loops across restarts for a
  gateway that was broken last boot. Per-profile outcome is recorded
  to $HERMES_HOME/logs/container-boot.log.

  Implementation: hermes_cli/container_boot.py + 12 unit tests.
  Profile-marker is SOUL.md, not config.yaml, because `hermes profile
  create` only seeds SOUL.md by default (config.yaml comes from
  `hermes setup`).

Task 4.1 / 4.2 — profile create/delete hooks:
  hermes_cli/profiles.py::create_profile now calls
  _maybe_register_gateway_service(<canon>) at the end, which routes
  through ServiceManager.register_profile_gateway when running on s6
  and no-ops on host backends. delete_profile mirrors with
  _maybe_unregister_gateway_service. _allocate_gateway_port produces
  a deterministic SHA-256-derived port in [9200, 9800).

Task 4.3 — gateway dispatch + remove rejection arms:
  _dispatch_via_service_manager_if_s6(action) intercepts
  start/stop/restart at the top of each subcommand and routes them
  through S6ServiceManager.{start,stop,restart}. The pre-Phase-4
  `elif is_container():` rejection arms are kept as fallback for
  pre-s6 containers / unsupported runtimes, but only ever fire when
  detect_service_manager() != 's6'. install/uninstall under s6
  print informational guidance pointing users at profile create/delete.

  Removed the two xfail(strict=True) markers from
  tests/docker/test_profile_gateway.py — both tests now pass strictly.

Task 4.4 — status reporting:
  get_gateway_runtime_snapshot() reports
  Manager: 's6 (container supervisor)' inside an s6 container instead
  of 'docker (foreground)'.

Plan-vs-reality drift fixed in this commit:
  - Plan's S6ServiceManager._render_run_script used
    `gateway start --foreground --port {port}` — invented args; the
    real CLI is `gateway run`. Switched accordingly. port arg
    retained for API parity but now documented as 'currently ignored'.
  - Plan's reconciler keyed on config.yaml; switched to SOUL.md
    (config.yaml is created by hermes setup, not by hermes profile
    create, so the original gate caught nothing).
  - The plan's _dispatch helper used _profile_arg() which returns
    '--profile <name>' (i.e. with the flag prefix). Switched to
    _profile_suffix() which returns the bare name.
  - Architecture B's docker exec doesn't get /command on PATH or
    the venv on PATH; Dockerfile's runtime PATH now includes
    /opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works
    without sourcing the venv.
  - stage2-hook now chowns $HERMES_HOME/profiles to hermes on every
    boot, not just on the UID-remap path. Without this, files created
    by docker-exec-as-root accumulate and the next reconciler run
    fails with PermissionError reading SOUL.md.

Test harness:
  19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to
  passing). 78 unit tests across service_manager + container_boot +
  profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck
  pass cleanly.

Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
2026-05-22 11:47:42 +10:00
Ben ad5fdab092 feat(service_manager): add S6ServiceManager for runtime gateway supervision
Phase 3 of the s6-overlay supervision plan. Implements the runtime-
registration surface from D4 — only the s6 backend supports
register_profile_gateway / unregister_profile_gateway /
list_profile_gateways; host backends continue to raise
NotImplementedError. No caller yet (Phase 4 wires in the profile
create/delete hooks).

Key implementation notes:

  - Service directory shape: /run/service/gateway-<profile>/{type,run,log/run}.
    Atomic register: write to gateway-<profile>.tmp, fsync via
    os.rename. Cleanup on rescan failure.

  - Run script uses #!/command/with-contenv sh so HERMES_HOME and any
    extra_env arrive at exec time. The hermes -p <profile> gateway
    start --foreground --port <port> command is wrapped in
    s6-setuidgid hermes for the per-service privilege drop (OQ2-A).

  - Log script (OQ8-C): persists via s6-log to
    ${HERMES_HOME}/logs/gateways/<profile>/. CRITICAL — HERMES_HOME is
    a runtime env-var expansion in the rendered script, NOT a Python
    f-string substitution. Negative-asserted in
    test_s6_register_creates_service_dir_and_triggers_scan so
    regressions are caught.

  - PATH gotcha: /command/ is only on PATH for processes spawned by
    the supervision tree (services, cont-init.d). `docker exec` and
    profile-create hooks don't get it. S6ServiceManager calls all
    s6-* binaries via absolute path through the new _S6_BIN_DIR
    constant so callers don't have to fix up env vars.

  - validate_profile_name rejects path-traversal, leading-dash (s6
    would parse as a flag), uppercase, whitespace, and names >251
    chars (s6-svscan default name_max).

Test coverage:
  - 13 new unit tests in tests/hermes_cli/test_service_manager.py
    (kind detection, run-script content, env quoting, register
    rollback on rescan failure, unregister idempotence, list filter,
    lifecycle dispatch, svstat parsing). Total: 36 passing.
  - 2 new in-container integration tests in
    tests/docker/test_s6_profile_gateway_integration.py validating
    end-to-end registration against a real s6 supervision tree.

Docker harness: 14 passed, 2 xfailed (Phase 4 target unchanged).

Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
2026-05-22 11:47:41 +10:00
Ben 4826ea7b41 feat(docker)!: replace tini with s6-overlay as PID 1
BREAKING CHANGE: the container ENTRYPOINT is now /init (s6-overlay)
instead of /usr/bin/tini. Main hermes runs as the container CMD with
TTY inherited (preserving --tui), dashboard runs as a supervised s6-rc
service (HERMES_DASHBOARD=1 starts it; crashes auto-restart), and the
ground is laid for per-profile gateway supervision (Phase 3+4).

All five pre-s6 docker run invocation patterns continue to work
identically — verified by the Phase 0 docker harness:

  docker run <image>                  → `hermes` with no args
  docker run <image> chat -q "..."    → `hermes chat -q ...` passthrough
  docker run <image> sleep infinity   → `sleep infinity` direct
  docker run <image> bash             → interactive bash
  docker run -it <image> --tui        → interactive Ink TUI

Phase 2 harness result: 12 passed, 2 xfailed (Phase 4 target). Hadolint
+ shellcheck pass cleanly.

Architecture pivot from plan v3 (documented in main-hermes/run header):
the plan called for main hermes to be an s6-supervised service, but
two real s6-overlay v3 mechanics blocked that — cont-init.d scripts
receive no arguments (CMD args are not visible to stage2-hook), and
`/run/s6/basedir/bin/halt` after writing the exit code did not
propagate the desired exit code (container exits 143). We use the
s6-overlay-native CMD pattern instead: main-wrapper.sh is the
container's main program (ENTRYPOINT prepends it so leading-dash
args like --version aren't intercepted by /init), exec's the final
program with stdin/stdout/stderr inherited, and the program's exit
code becomes the container exit code. main-hermes is now a no-op
`sleep infinity` slot kept for future supervised-gateway-container
modes. This trades "supervised restart of main hermes" for arg-
parity with the pre-s6 contract — main hermes was already unsupervised
under tini, so we lose nothing functional. Dashboard supervision is
the only new guarantee added by this phase.

Files added:
  docker/main-wrapper.sh           # arg routing + s6-setuidgid drop
  docker/stage2-hook.sh            # gosu-equivalent + chown + seed
  docker/s6-rc.d/main-hermes/{type,run,dependencies.d/base}
  docker/s6-rc.d/dashboard/{type,run,dependencies.d/base}
  docker/s6-rc.d/user/contents.d/{main-hermes,dashboard}

Files changed:
  Dockerfile: tini → s6-overlay install + ENTRYPOINT flip + service wiring
  docker/entrypoint.sh: thin shim to stage2-hook.sh for back-compat
  tests/docker/test_dashboard.py: add test_dashboard_restarts_after_crash

Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
2026-05-22 11:47:41 +10:00
Ben cf6133495c feat(service_manager): add ServiceManager protocol + host wrappers
Phase 1 of the s6-overlay supervision plan. Pure-refactor addition:
introduces the abstract interface (with runtime_checkable Protocol),
detect_service_manager(), validate_profile_name(), and thin
SystemdServiceManager / LaunchdServiceManager / WindowsServiceManager
wrappers around the existing systemd_* / launchd_* / gateway_windows.*
module-level functions. No host call site was modified — host code
continues to use the existing functions directly; the protocol is for
new backend-agnostic code (Phase 4 profile create/delete hooks and the
Phase 4 s6 dispatch path in 'hermes gateway start/stop/restart').

WindowsServiceManager.install() forwards the v3 kwargs (start_now,
start_on_login, elevated_handoff) added in PRs #28169-adjacent so
non-Windows callers — there aren't any today — can opt in.

The s6 backend lands in Phase 3; until then get_service_manager()
raises a clear error if invoked on a host that detects as 's6'.
2026-05-22 11:47:41 +10:00
Ben c6febe3765 ci(docker): add hadolint + shellcheck for container build inputs
Phase 0.5 of the s6-overlay supervision plan. Catches Dockerfile and
shell-script regressions that the behavioral docker-publish smoke test
can't surface — unquoted variable expansions, silently-failing RUN
commands, missing apt-get clean, etc.

Both lint clean against the current (tini) Dockerfile + entrypoint.sh
at the configured thresholds (hadolint: warning, shellcheck: error).
Each ignore in .hadolint.yaml carries a one-line justification; the
shellcheck severity floor is documented in the workflow file.

Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
2026-05-22 11:47:41 +10:00
Ben a957ef0834 test(docker): stabilize Phase 0 baseline harness
Two pre-existing baseline issues found while running the Phase 0 harness
against the tini image that need fixing before later phases can use the
harness as a behavior-parity oracle:

1. The autouse `_enforce_test_timeout` fixture in tests/conftest.py
   hard-coded a 30s SIGALRM, which preempted any `pytest.mark.timeout`
   marker (already honored by pytest-timeout). Honor the marker if
   present; fall back to 30s otherwise. Docker harness tests carry a
   180s marker applied at collection time in tests/docker/conftest.py.

2. test_dashboard_port_override polled via `ss -tlnp` / `netstat -tln`
   — neither is installed in the Hermes image, so the probe trivially
   failed even when the dashboard was bound. The dashboard also takes
   8-15s to bind on cold image; the 5s sleep was insufficient. Replace
   with a poll loop reading /proc/net/tcp directly (port 9120 = 0x23A0,
   state 0A = LISTEN). Bump probe deadline to 60s and switch
   test_dashboard_opt_in_starts to a similar poll for pgrep so we don't
   regress to the same race.

Result: 11 passed, 2 xfailed (Phase 4 target) on tini image. Harness
now ready to serve as Phase 2's behavior-parity oracle.
2026-05-22 11:47:41 +10:00
Ben 60d8e07ded test(docker): apply 180s timeout to docker harness tests
The agent-test suite default is 30s; docker test_no_args (the dashboard
spin-up, the container restart) routinely take 60-90s. Without this
they intermittently fail in CI with TimeoutError.
2026-05-22 11:46:52 +10:00
Ben 244d62ded3 test(docker): lock baseline behavior for Phase 0 harness
Tasks 0.2-0.6 of the s6-overlay supervision plan. Locks the
user-visible behavior we must preserve through the Phase 2 init-
system swap:

- test_main_invocation.py (Task 0.2): docker run <image> with no
  args, chat subcommand passthrough, bare executable passthrough,
  bash pattern, exit-code propagation
- test_tui_passthrough.py (Task 0.3): TTY allocation via docker -t
  using the host's script(1) for a PTY
- test_dashboard.py (Task 0.4): HERMES_DASHBOARD=1 opt-in,
  HERMES_DASHBOARD_PORT override
- test_profile_gateway.py (Task 0.5): per-profile gateway
  start/stop and profile-delete-stops-gateway. Both marked
  xfail(strict=True) because the current tini image refuses
  gateway lifecycle commands inside the container; Phase 4
  Task 4.3 flips them to passing.
- test_zombie_reaping.py (Task 0.6): PID 1 reaps orphaned
  zombies. tini does this today; s6-overlay's /init must
  continue to.

Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
2026-05-22 11:46:52 +10:00
Ben 705256aaa6 test(docker): add conftest fixtures for docker harness
Task 0.1 of the s6-overlay supervision plan. Establishes the test
infrastructure for tests/docker/: skip-on-missing-Docker collection
hook, session-scoped image-build fixture (overridable via the
HERMES_TEST_IMAGE env var for faster local iteration), and a
container_name fixture that ensures cleanup on test exit.

Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
2026-05-22 11:46:52 +10:00
Ben ef536880a3 docs(plans): add s6-overlay supervision plan (v3)
Replace tini with s6-overlay as PID 1 in the Hermes Docker image so that
main hermes, the dashboard, and dynamically-created per-profile gateways
all run as supervised services. Includes container-boot reconciliation
(Task 4.0) so per-profile gateways survive docker restart.

Plan history:
- v1: 2026-05-07 — original design (subagent gateways scope)
- v2: 2026-05-18 — re-validated, scope narrowed to per-profile gateways,
  WindowsServiceManager added to protocol
- v3: 2026-05-21 — re-validated in docker_s6 worktree, install-method
  stamp preservation noted in Task 2.3, Task 4.0 added for container
  restart survival

12.5 engineering days estimated across 7 phases.
2026-05-22 11:46:52 +10:00
242 changed files with 19091 additions and 1022 deletions
+5 -2
View File
@@ -29,9 +29,13 @@ runs:
- name: hermes --help
shell: bash
run: |
# Use the image's real ENTRYPOINT (/init + main-wrapper.sh) so
# this exercises the actual production startup path. PR #30136
# review caught that an --entrypoint override here had been
# silently neutered by the s6-overlay migration — stage2-hook
# ignores its CMD args, so the smoke test was a no-op.
docker run --rm \
-v /tmp/hermes-test:/opt/data \
--entrypoint /opt/hermes/docker/entrypoint.sh \
"${{ inputs.image }}" --help
- name: hermes dashboard --help
@@ -43,5 +47,4 @@ runs:
# installed package.
docker run --rm \
-v /tmp/hermes-test:/opt/data \
--entrypoint /opt/hermes/docker/entrypoint.sh \
"${{ inputs.image }}" dashboard --help
+68
View File
@@ -0,0 +1,68 @@
name: Docker / shell lint
# Lints the container build inputs: Dockerfile (via hadolint) and any shell
# scripts under docker/ (via shellcheck). These catch the class of regression
# the behavioral docker-publish smoke test can't — unquoted variable
# expansions, silently-failing RUN commands, etc.
#
# Rules and ignores are documented in .hadolint.yaml at the repo root.
# shellcheck severity is pinned to `error` so SC1091-style "can't follow
# sourced script" info-level warnings don't fail the job — the .venv
# activate script doesn't exist at lint time.
on:
push:
branches: [main]
paths:
- Dockerfile
- docker/**
- .hadolint.yaml
- .github/workflows/docker-lint.yml
pull_request:
branches: [main]
paths:
- Dockerfile
- docker/**
- .hadolint.yaml
- .github/workflows/docker-lint.yml
permissions:
contents: read
concurrency:
group: docker-lint-${{ github.ref }}
cancel-in-progress: true
jobs:
hadolint:
name: Lint Dockerfile (hadolint)
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- name: Checkout code
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: hadolint
uses: hadolint/hadolint-action@54c9adbab1582c2ef04b2016b760714a4bfde3cf # v3.1.0
with:
dockerfile: Dockerfile
config: .hadolint.yaml
failure-threshold: warning
shellcheck:
name: Lint docker/ shell scripts (shellcheck)
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- name: Checkout code
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: shellcheck
uses: ludeeus/action-shellcheck@00cae500b08a931fb5698e11e79bfbd38e612a38 # v2.0.0
env:
# Severity = error: SC1091 (can't follow sourced script) is info-
# level and would otherwise fail when the venv activate script
# doesn't exist at lint time.
SHELLCHECK_OPTS: --severity=error
with:
scandir: ./docker
+50
View File
@@ -80,6 +80,56 @@ jobs:
with:
image: ${{ env.IMAGE_NAME }}:test
# ---------------------------------------------------------------------
# Run the docker-integration test suite against the freshly-built
# image already loaded into the local daemon (`:test`). These tests
# are excluded from the sharded `tests.yml :: test` matrix on purpose
# (see `_SKIP_PARTS` in scripts/run_tests_parallel.py) because each
# shard would otherwise reach the session-scoped ``built_image``
# fixture in ``tests/docker/conftest.py`` and start a 3-7min
# ``docker build`` under a 180s pytest-timeout cap — guaranteed to
# die in fixture setup.
#
# Piggybacking here avoids a second image build: the smoke test
# already proved the image loads + runs, so the daemon has it under
# `${IMAGE_NAME}:test` and we just point ``HERMES_TEST_IMAGE`` at
# that. The fixture's ``HERMES_TEST_IMAGE`` branch (see
# tests/docker/conftest.py:62-63) short-circuits the rebuild.
#
# Why this job and not a standalone one: the image is 5GB+; passing
# it between jobs via ``docker save``/``upload-artifact`` is slower
# than the build itself. Reusing the existing daemon state is the
# cheapest path to coverage on every PR that touches docker code.
# ---------------------------------------------------------------------
- name: Install uv (for docker tests)
uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86 # v5
- name: Set up Python 3.11 (for docker tests)
run: uv python install 3.11
- name: Install Python dependencies (for docker tests)
run: |
uv venv .venv --python 3.11
source .venv/bin/activate
# ``dev`` extra pulls in pytest, pytest-asyncio, pytest-timeout —
# everything tests/docker/ needs. We deliberately avoid ``all``
# here because the docker tests only drive the container via
# subprocess and don't import hermes_agent's optional deps.
uv pip install -e ".[dev]"
- name: Run docker integration tests
env:
# Skip rebuild; use the image already loaded by the build step.
HERMES_TEST_IMAGE: ${{ env.IMAGE_NAME }}:test
# Match the policy in tests.yml :: test job — no accidental
# real-API calls from inside the harness.
OPENROUTER_API_KEY: ""
OPENAI_API_KEY: ""
NOUS_API_KEY: ""
run: |
source .venv/bin/activate
python -m pytest tests/docker/ -v --tb=short
- name: Log in to Docker Hub
if: github.event_name == 'push' && github.ref == 'refs/heads/main' || github.event_name == 'release'
uses: docker/login-action@4907a6ddec9925e35a0a9e82d7399ccc52663121 # v4.1.0
+36
View File
@@ -0,0 +1,36 @@
# hadolint configuration for the Hermes Agent Dockerfile.
# See https://github.com/hadolint/hadolint#configure for rules.
#
# We want hadolint to surface NEW Dockerfile lint regressions, but we
# don't want to rewrite the existing image to silence rules that are
# either intentional or pragmatic tradeoffs for this project. Each
# ignore below has a one-line justification.
failure-threshold: warning
ignored:
# Pin versions in apt get install. We intentionally don't pin common
# tools (curl, git, openssh-client, etc.) — security updates flow in
# via the periodic base-image rebuild, and pinning would lock us to
# superseded patch releases. Same rationale as nearly every distro-
# base official image (python, node, debian).
- DL3008
# Use WORKDIR to switch to a directory. The image uses `(cd web && …)`
# / `(cd ../ui-tui && …)` inline subshells for one-off build steps
# because they don't affect later RUN commands; promoting them to
# full WORKDIR switches with restores would obscure intent.
- DL3003
# Multiple consecutive RUN instructions. The `touch README.md` + `uv
# sync` split is intentional — `touch` is cheap, `uv sync` is the
# expensive layer-cached step we want isolated, and merging them
# would invalidate the cache for trivial changes.
- DL3059
# Last USER should not be root. /init (s6-overlay) runs as root so the
# stage2 hook can usermod/groupmod and chown the data volume per
# HERMES_UID at runtime; each supervised service then drops to the
# hermes user via `s6-setuidgid`.
- DL3002
# Require explicit base-image pins (SHA256) — we already do this.
trustedRegistries:
- docker.io
- ghcr.io
+114 -10
View File
@@ -1,5 +1,4 @@
FROM ghcr.io/astral-sh/uv:0.11.6-python3.13-trixie@sha256:b3c543b6c4f23a5f2df22866bd7857e5d304b67a564f4feab6ac22044dde719b AS uv_source
FROM tianon/gosu:1.19-trixie@sha256:3b176695959c71e123eb390d427efc665eeb561b1540e82679c15e992006b8b9 AS gosu_source
FROM debian:13.4
# Disable Python stdout buffering to ensure logs are printed immediately
@@ -9,18 +8,68 @@ ENV PYTHONUNBUFFERED=1
# install survives the /opt/data volume overlay at runtime.
ENV PLAYWRIGHT_BROWSERS_PATH=/opt/hermes/.playwright
# Install system dependencies in one layer, clear APT cache
# tini reaps orphaned zombie processes (MCP stdio subprocesses, git, bun, etc.)
# that would otherwise accumulate when hermes runs as PID 1. See #15012.
# Install system dependencies in one layer, clear APT cache.
# tini was previously PID 1 to reap orphaned zombie processes (MCP stdio
# subprocesses, git, bun, etc.) that would otherwise accumulate when hermes
# ran as PID 1. See #15012. Phase 2 of the s6-overlay supervision plan
# replaces tini with s6-overlay's /init (PID 1 = s6-svscan), which reaps
# zombies non-blockingly on SIGCHLD and additionally supervises the main
# hermes process, the dashboard, and per-profile gateways.
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential curl nodejs npm python3 ripgrep ffmpeg gcc python3-dev libffi-dev procps git openssh-client docker-cli tini && \
build-essential curl nodejs npm python3 ripgrep ffmpeg gcc python3-dev libffi-dev procps git openssh-client docker-cli xz-utils && \
rm -rf /var/lib/apt/lists/*
# ---------- s6-overlay install ----------
# s6-overlay provides supervision for the main hermes process, the dashboard,
# and per-profile gateways. /init becomes PID 1 below — see ENTRYPOINT.
#
# Multi-arch: BuildKit auto-populates TARGETARCH (amd64 / arm64). s6-overlay
# uses tarball names keyed on the kernel arch string (x86_64 / aarch64), so
# we map between them inline. The noarch + symlinks tarballs are
# architecture-independent and reused as-is.
#
# We use `curl` instead of `ADD` for the per-arch tarball because `ADD`
# evaluates its URL at parse time, before any ARG / TARGETARCH substitution
# — splitting one URL per arch into two ADDs would download both on every
# build and leave dead bytes in the cache. A single curl + arch-keyed URL
# is simpler and cache-friendlier.
#
# Supply-chain integrity: every tarball is checksum-verified against the
# upstream-published SHA256. To bump S6_OVERLAY_VERSION, fetch the four
# `.sha256` files from the corresponding release and update the ARGs. The
# checksum lookup happens during build, so a compromised release artifact
# fails the build loudly instead of silently producing a tampered image.
ARG TARGETARCH
ARG S6_OVERLAY_VERSION=3.2.3.0
ARG S6_OVERLAY_NOARCH_SHA256=b720f9d9340efc8bb07528b9743813c836e4b02f8693d90241f047998b4c53cf
ARG S6_OVERLAY_X86_64_SHA256=a93f02882c6ed46b21e7adb5c0add86154f01236c93cd82c7d682722e8840563
ARG S6_OVERLAY_AARCH64_SHA256=0952056ff913482163cc30e35b2e944b507ba1025d78f5becbb89367bf344581
ARG S6_OVERLAY_SYMLINKS_SHA256=a60dc5235de3ecbcf874b9c1f18d73263ab99b289b9329aa950e8729c4789f0e
ADD https://github.com/just-containers/s6-overlay/releases/download/v${S6_OVERLAY_VERSION}/s6-overlay-noarch.tar.xz /tmp/
ADD https://github.com/just-containers/s6-overlay/releases/download/v${S6_OVERLAY_VERSION}/s6-overlay-symlinks-noarch.tar.xz /tmp/
RUN set -eu; \
case "${TARGETARCH:-amd64}" in \
amd64) s6_arch="x86_64"; s6_arch_sha="${S6_OVERLAY_X86_64_SHA256}" ;; \
arm64) s6_arch="aarch64"; s6_arch_sha="${S6_OVERLAY_AARCH64_SHA256}" ;; \
*) echo "Unsupported TARGETARCH=${TARGETARCH} for s6-overlay" >&2; exit 1 ;; \
esac; \
curl -fsSL --retry 3 -o /tmp/s6-overlay-arch.tar.xz \
"https://github.com/just-containers/s6-overlay/releases/download/v${S6_OVERLAY_VERSION}/s6-overlay-${s6_arch}.tar.xz"; \
{ \
printf '%s %s\n' "${S6_OVERLAY_NOARCH_SHA256}" /tmp/s6-overlay-noarch.tar.xz; \
printf '%s %s\n' "${s6_arch_sha}" /tmp/s6-overlay-arch.tar.xz; \
printf '%s %s\n' "${S6_OVERLAY_SYMLINKS_SHA256}" /tmp/s6-overlay-symlinks-noarch.tar.xz; \
} > /tmp/s6-overlay.sha256; \
sha256sum -c /tmp/s6-overlay.sha256; \
tar -C / -Jxpf /tmp/s6-overlay-noarch.tar.xz; \
tar -C / -Jxpf /tmp/s6-overlay-arch.tar.xz; \
tar -C / -Jxpf /tmp/s6-overlay-symlinks-noarch.tar.xz; \
rm /tmp/s6-overlay-*.tar.xz /tmp/s6-overlay.sha256
# Non-root user for runtime; UID can be overridden via HERMES_UID at runtime
RUN useradd -u 10000 -m -d /opt/data hermes
COPY --chmod=0755 --from=gosu_source /gosu /usr/local/bin/
COPY --chmod=0755 --from=uv_source /usr/local/bin/uv /usr/local/bin/uvx /usr/local/bin/
WORKDIR /opt/hermes
@@ -103,18 +152,73 @@ RUN cd web && npm run build && \
USER root
RUN chmod -R a+rX /opt/hermes && \
chown -R hermes:hermes /opt/hermes/.venv /opt/hermes/ui-tui /opt/hermes/node_modules
# Start as root so the entrypoint can usermod/groupmod + gosu.
# If HERMES_UID is unset, the entrypoint drops to the default hermes user (10000).
# Start as root so the s6-overlay stage2 hook can usermod/groupmod and chown
# the data volume. Each supervised service then drops to the hermes user via
# `s6-setuidgid hermes` in its run script. If HERMES_UID is unset, services
# run as the default hermes user (UID 10000).
# ---------- Link hermes-agent itself (editable) ----------
# Deps are already installed in the cached layer above; `--no-deps` makes
# this a fast (~1s) egg-link creation with no resolution or downloads.
RUN uv pip install --no-cache-dir --no-deps -e "."
# ---------- s6-overlay service wiring ----------
# Static services declared at build time: main-hermes + dashboard.
# Per-profile gateway services are registered dynamically at runtime by
# the profile create/delete hooks (Phase 4); they live under
# /run/service/ (tmpfs) and are reconciled on container restart by
# /etc/cont-init.d/02-reconcile-profiles (Phase 4 Task 4.0).
COPY docker/s6-rc.d/ /etc/s6-overlay/s6-rc.d/
# stage2-hook handles UID/GID remap, volume chown, config seeding,
# skills sync — all the work the old entrypoint.sh did before
# `exec hermes`. Wired in as cont-init.d/01- so it
# runs before user services start.
#
# 02-reconcile-profiles re-creates per-profile gateway s6 service
# slots from $HERMES_HOME/profiles/<name>/ after a container restart
# (the /run/service/ scandir is tmpfs and wiped on restart). Phase 4.
RUN mkdir -p /etc/cont-init.d && \
printf '#!/bin/sh\nexec /opt/hermes/docker/stage2-hook.sh\n' \
> /etc/cont-init.d/01-hermes-setup && \
chmod +x /etc/cont-init.d/01-hermes-setup
COPY --chmod=0755 docker/cont-init.d/015-supervise-perms /etc/cont-init.d/015-supervise-perms
COPY --chmod=0755 docker/cont-init.d/02-reconcile-profiles /etc/cont-init.d/02-reconcile-profiles
# ---------- Runtime ----------
ENV HERMES_WEB_DIST=/opt/hermes/hermes_cli/web_dist
ENV HERMES_HOME=/opt/data
ENV PATH="/opt/data/.local/bin:${PATH}"
# Pre-s6 entrypoint.sh did `source .venv/bin/activate` which exported
# the venv bin onto PATH; Architecture B's main-wrapper.sh does the
# same for the container's main process, but `docker exec` and our
# cont-init.d scripts don't pass through the wrapper. Expose the venv
# bin globally so `docker exec <container> hermes ...` and any
# subprocess that doesn't activate the venv first still find hermes.
ENV PATH="/opt/hermes/.venv/bin:/opt/data/.local/bin:${PATH}"
RUN mkdir -p /opt/data
VOLUME [ "/opt/data" ]
ENTRYPOINT [ "/usr/bin/tini", "-g", "--", "/opt/hermes/docker/entrypoint.sh" ]
# s6-overlay's /init is PID 1. It sets up the supervision tree, runs
# /etc/cont-init.d/* (our stage2 hook), starts s6-rc services
# declared in /etc/s6-overlay/s6-rc.d/, then exec's its remaining
# argv as the container's "main program" with stdin/stdout/stderr
# inherited (this is what makes interactive --tui work). When the
# main program exits, /init begins stage 3 shutdown and the container
# exits with the program's exit code. Replaces tini — see Phase 2 of
# docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md.
#
# We use the ENTRYPOINT+CMD split rather than CMD alone so the
# wrapper is prepended to user-supplied args automatically:
#
# docker run <image> → /init main-wrapper.sh (CMD default)
# docker run <image> chat -q "hi" → /init main-wrapper.sh chat -q hi
# docker run <image> sleep infinity → /init main-wrapper.sh sleep infinity
# docker run <image> --tui → /init main-wrapper.sh --tui
#
# main-wrapper.sh handles arg routing (bare-exec vs. hermes
# subcommand vs. no-args), drops to the hermes user via s6-setuidgid,
# and exec's the final program so its exit code becomes the container
# exit code. Without the wrapper-as-ENTRYPOINT, leading-dash args
# like `--version` would be intercepted by /init's POSIX shell.
ENTRYPOINT [ "/init", "/opt/hermes/docker/main-wrapper.sh" ]
CMD [ ]
+5 -1
View File
@@ -1534,7 +1534,11 @@ class HermesACPAgent(acp.Agent):
)
except Exception:
logger.debug("Failed to auto-title ACP session %s", session_id, exc_info=True)
if final_response and conn and not streamed_message:
if final_response and conn and (not streamed_message or result.get("response_transformed")):
# Deliver the final response when streaming did not already send it,
# or when a plugin hook transformed the response after streaming
# finished (e.g. transform_llm_output) — otherwise the appended /
# rewritten text never reaches the client.
update = acp.update_agent_message_text(final_response)
await conn.session_update(session_id, update)
+7 -8
View File
@@ -976,16 +976,14 @@ def init_agent(
# Expose session ID to tools (terminal, execute_code) so agents can
# reference their own session for --resume commands, cross-session
# coordination, and logging. Uses the ContextVar system from
# session_context.py for concurrency safety (gateway runs multiple
# sessions in one process). Also writes os.environ as fallback for
# CLI mode where ContextVars aren't used.
os.environ["HERMES_SESSION_ID"] = agent.session_id
# coordination, and logging. Keep the ContextVar and os.environ
# fallback synchronized because different tool paths still read both.
try:
from gateway.session_context import _SESSION_ID
_SESSION_ID.set(agent.session_id)
from gateway.session_context import set_current_session_id
set_current_session_id(agent.session_id)
except Exception:
pass # CLI/test mode — ContextVar not needed
os.environ["HERMES_SESSION_ID"] = agent.session_id
# Session logs go into ~/.hermes/sessions/ alongside gateway sessions
hermes_home = get_hermes_home()
@@ -1429,6 +1427,7 @@ def init_agent(
base_url=agent.base_url,
api_key=getattr(agent, "api_key", ""),
provider=agent.provider,
api_mode=agent.api_mode,
)
if not agent.quiet_mode:
_ra().logger.info("Using context engine: %s", _selected_engine.name)
+8 -6
View File
@@ -132,7 +132,7 @@ def convert_to_trajectory_format(agent, messages: List[Dict[str, Any]], user_que
except json.JSONDecodeError:
# This shouldn't happen since we validate and retry during conversation,
# but if it does, log warning and use empty dict
logging.warning(f"Unexpected invalid JSON in trajectory conversion: {tool_call['function']['arguments'][:100]}")
logger.warning(f"Unexpected invalid JSON in trajectory conversion: {tool_call['function']['arguments'][:100]}")
arguments = {}
tool_call_json = {
@@ -747,7 +747,7 @@ def try_recover_primary_transport(
time.sleep(wait_time)
return True
except Exception as e:
logging.warning("Primary transport recovery failed: %s", e)
logger.warning("Primary transport recovery failed: %s", e)
return False
# ── End provider fallback ──────────────────────────────────────────────
@@ -910,19 +910,20 @@ def restore_primary_runtime(agent) -> bool:
base_url=rt["compressor_base_url"],
api_key=rt["compressor_api_key"],
provider=rt["compressor_provider"],
api_mode=rt.get("compressor_api_mode", ""),
)
# ── Reset fallback chain for the new turn ──
agent._fallback_activated = False
agent._fallback_index = 0
logging.info(
logger.info(
"Primary runtime restored for new turn: %s (%s)",
agent.model, agent.provider,
)
return True
except Exception as e:
logging.warning("Failed to restore primary runtime: %s", e)
logger.warning("Failed to restore primary runtime: %s", e)
return False
# Which error types indicate a transient transport failure worth
@@ -1093,7 +1094,7 @@ def dump_api_request_debug(
return dump_file
except Exception as dump_error:
if agent.verbose_logging:
logging.warning(f"Failed to dump API request debug payload: {dump_error}")
logger.warning(f"Failed to dump API request debug payload: {dump_error}")
return None
@@ -1478,6 +1479,7 @@ def switch_model(agent, new_model, new_provider, api_key='', base_url='', api_mo
"compressor_api_key": getattr(_cc, "api_key", "") if _cc else "",
"compressor_provider": getattr(_cc, "provider", agent.provider) if _cc else agent.provider,
"compressor_context_length": _cc.context_length if _cc else 0,
"compressor_api_mode": getattr(_cc, "api_mode", agent.api_mode) if _cc else agent.api_mode,
"compressor_threshold_tokens": _cc.threshold_tokens if _cc else 0,
}
if api_mode == "anthropic_messages":
@@ -1509,7 +1511,7 @@ def switch_model(agent, new_model, new_provider, api_key='', base_url='', api_mo
agent._fallback_chain = fallback_chain
agent._fallback_model = fallback_chain[0] if fallback_chain else None
logging.info(
logger.info(
"Model switched in-place: %s (%s) -> %s (%s)",
old_model, old_provider, new_model, new_provider,
)
+5 -1
View File
@@ -2122,9 +2122,13 @@ def build_anthropic_kwargs(
block["text"] = text
# 3. Prefix tool names with mcp_ (Claude Code convention)
# Skip names that already begin with the marker — native MCP server
# tools (from mcp_servers: in config.yaml) are registered under their
# full mcp_<server>_<tool> name and would double-prefix otherwise,
# breaking round-trip registry lookup in normalize_response. GH-25255.
if anthropic_tools:
for tool in anthropic_tools:
if "name" in tool:
if "name" in tool and not tool["name"].startswith(_MCP_TOOL_PREFIX):
tool["name"] = _MCP_TOOL_PREFIX + tool["name"]
# 4. Prefix tool names in message history (tool_use and tool_result blocks)
+116 -2
View File
@@ -3730,6 +3730,37 @@ _VISION_AUTO_PROVIDER_ORDER = (
)
def _main_model_supports_vision(provider: str, model: Optional[str]) -> bool:
"""Return True when ``provider``/``model`` is known to accept image input.
Used by the vision auto-detect chain to skip the user's main provider
when it's known to be text-only (e.g. DeepSeek, gpt-oss without vision).
Without this guard, ``resolve_vision_provider_client(provider="auto")``
would happily return the main-provider client and any subsequent image
payload would surface as a cryptic provider-side error
(``unknown variant `image_url`, expected `text```, #31179).
Returns True when capability lookup is unknown — preserves the historical
behaviour of attempting the call, so providers we haven't catalogued yet
don't silently regress to text-only.
"""
try:
from agent.image_routing import _lookup_supports_vision
from hermes_cli.config import load_config
except ImportError:
return True
try:
supports = _lookup_supports_vision(provider, model, load_config())
except Exception: # pragma: no cover - defensive
return True
if supports is None:
# No capability data — keep current behaviour and let the call attempt
# happen rather than silently skipping. This avoids false-positive
# skips for new/custom providers.
return True
return bool(supports)
def _normalize_vision_provider(provider: Optional[str]) -> str:
return _normalize_aux_provider(provider)
@@ -3870,6 +3901,23 @@ def resolve_vision_provider_client(
"vision support) — falling through to aggregator chain",
main_provider,
)
elif not _main_model_supports_vision(main_provider, vision_model):
# The main model is known to be text-only (e.g. DeepSeek V4,
# gpt-oss-120b without vision). Building a client and sending
# an image would produce a cryptic provider-side error like
# ``unknown variant `image_url`, expected `text``` (#31179).
# Fall through to the aggregator chain instead.
#
# Only log the provider name (not the model) — mirrors the
# sibling _PROVIDERS_WITHOUT_VISION branch above, and avoids
# CodeQL py/clear-text-logging-sensitive-data heuristic false
# positives on multi-value interpolations.
logger.debug(
"Vision auto-detect: skipping main provider %s "
"(reports no vision capability) — falling through to "
"aggregator chain",
main_provider,
)
else:
rpc_client, rpc_model = resolve_provider_client(
main_provider, vision_model,
@@ -4281,6 +4329,23 @@ def _get_cached_client(
return client, model or default_model
# Aliases that target direct REST APIs not modeled as first-class providers
# in PROVIDER_REGISTRY. Used for ``auxiliary.<task>.provider`` so users can
# write the obvious name and have it resolve to a working ``custom`` endpoint
# without needing to know our internal provider IDs.
#
# Why these specifically: PROVIDER_REGISTRY has ``openai-codex`` (OAuth) and
# ``custom`` (manual base_url + OPENAI_API_KEY) but no plain ``openai`` for
# direct API-key access. Users predictably type ``provider: openai`` and
# expect it to use OPENAI_API_KEY against api.openai.com. Previously this
# silently fell back to the user's main provider, sending OpenAI model names
# to e.g. DeepSeek and producing cryptic ``unknown variant 'image_url'``
# errors (issue #31179).
_AUX_DIRECT_API_BASE_URLS: Dict[str, str] = {
"openai": "https://api.openai.com/v1",
}
def _resolve_task_provider_model(
task: str = None,
provider: str = None,
@@ -4317,6 +4382,25 @@ def _resolve_task_provider_model(
resolved_model = model or cfg_model
resolved_api_mode = cfg_api_mode
# Convenience aliases for direct API-key endpoints that aren't first-class
# providers (e.g. ``provider: openai`` → custom + api.openai.com/v1).
# Applied to both explicit args and config-derived values. When the user
# has already supplied a base_url we keep their endpoint but still rewrite
# the provider to ``custom`` so resolution doesn't hit the
# PROVIDER_REGISTRY-only path (which has no ``openai`` entry).
def _expand_direct_api_alias(prov: Optional[str], existing_base: Optional[str]) -> Tuple[Optional[str], Optional[str]]:
if not prov:
return prov, existing_base
target_base = _AUX_DIRECT_API_BASE_URLS.get(prov.strip().lower())
if target_base is None:
return prov, existing_base
return "custom", existing_base or target_base
if provider:
provider, base_url = _expand_direct_api_alias(provider, base_url)
if cfg_provider:
cfg_provider, cfg_base_url = _expand_direct_api_alias(cfg_provider, cfg_base_url)
if base_url:
return "custom", resolved_model, base_url, api_key, resolved_api_mode
if provider:
@@ -4344,7 +4428,17 @@ _DEFAULT_AUX_TIMEOUT = 30.0
def _get_auxiliary_task_config(task: str) -> Dict[str, Any]:
"""Return the config dict for auxiliary.<task>, or {} when unavailable."""
"""Return the config dict for auxiliary.<task>, or {} when unavailable.
For plugin-registered auxiliary tasks (see
:meth:`hermes_cli.plugins.PluginContext.register_auxiliary_task`) the
plugin's declared *defaults* are layered underneath the user's config
so an unconfigured plugin task still works:
plugin defaults ← config.yaml auxiliary.<task> (user wins)
Built-in tasks ignore this path (their defaults live in DEFAULT_CONFIG).
"""
if not task:
return {}
try:
@@ -4354,7 +4448,27 @@ def _get_auxiliary_task_config(task: str) -> Dict[str, Any]:
return {}
aux = config.get("auxiliary", {}) if isinstance(config, dict) else {}
task_config = aux.get(task, {}) if isinstance(aux, dict) else {}
return task_config if isinstance(task_config, dict) else {}
if not isinstance(task_config, dict):
task_config = {}
# Layer plugin-declared defaults underneath user config so
# ctx.register_auxiliary_task(defaults={...}) takes effect without
# forcing the user to write config.yaml entries.
try:
from hermes_cli.plugins import get_plugin_auxiliary_tasks
for _entry in get_plugin_auxiliary_tasks():
if _entry.get("key") == task:
_defaults = _entry.get("defaults") or {}
if isinstance(_defaults, dict):
merged = dict(_defaults)
merged.update(task_config)
return merged
break
except Exception:
# Plugin discovery failure must not break aux task config reads.
pass
return task_config
def _get_task_timeout(task: str, default: float = _DEFAULT_AUX_TIMEOUT) -> float:
+8 -2
View File
@@ -115,7 +115,10 @@ _SKILL_REVIEW_PROMPT = (
"Protected skills (DO NOT edit these):\n"
" • Bundled skills (shipped with Hermes, e.g. 'hermes-agent').\n"
" • Hub-installed skills (installed via 'hermes skills install').\n"
"Pinned skills (marked via 'hermes curator pin').\n"
"Pinned skills (marked via 'hermes curator pin') CAN be improved — "
"pin only blocks deletion/archive/consolidation by the curator, not "
"content updates. Patch them when a pitfall or missing step turns up, "
"same as any other agent-created skill.\n"
"If the only skills that need updating are protected, say\n"
"'Nothing to save.' and stop.\n\n"
"Do NOT capture (these become persistent self-imposed constraints "
@@ -198,7 +201,10 @@ _COMBINED_REVIEW_PROMPT = (
"Protected skills (DO NOT edit these):\n"
" • Bundled skills (shipped with Hermes, e.g. 'hermes-agent').\n"
" • Hub-installed skills (installed via 'hermes skills install').\n"
"Pinned skills (marked via 'hermes curator pin').\n"
"Pinned skills (marked via 'hermes curator pin') CAN be improved — "
"pin only blocks deletion/archive/consolidation by the curator, not "
"content updates. Patch them when a pitfall or missing step turns up, "
"same as any other agent-created skill.\n"
"If the only skills that need updating are protected, say\n"
"'Nothing to save.' and stop.\n\n"
"Do NOT capture as skills (these become persistent self-imposed "
+31 -14
View File
@@ -757,7 +757,7 @@ def try_activate_fallback(agent, reason: "FailoverReason | None" = None) -> bool
current_base_url = str(getattr(agent, "base_url", "") or "").rstrip("/").lower()
fb_base_url_for_dedup = (fb.get("base_url") or "").strip().rstrip("/").lower()
if fb_provider == current_provider and fb_model == current_model:
logging.warning(
logger.warning(
"Fallback skip: chain entry %s/%s matches current provider/model",
fb_provider, fb_model,
)
@@ -768,7 +768,7 @@ def try_activate_fallback(agent, reason: "FailoverReason | None" = None) -> bool
and fb_base_url_for_dedup == current_base_url
and fb_model == current_model
):
logging.warning(
logger.warning(
"Fallback skip: chain entry base_url %s matches current backend",
fb_base_url_for_dedup,
)
@@ -800,7 +800,7 @@ def try_activate_fallback(agent, reason: "FailoverReason | None" = None) -> bool
explicit_base_url=fb_base_url_hint,
explicit_api_key=fb_api_key_hint)
if fb_client is None:
logging.warning(
logger.warning(
"Fallback to %s failed: provider not configured",
fb_provider)
return agent._try_activate_fallback() # try next in chain
@@ -940,19 +940,20 @@ def try_activate_fallback(agent, reason: "FailoverReason | None" = None) -> bool
base_url=agent.base_url,
api_key=getattr(agent, "api_key", ""), # callable preserved → call_llm
provider=agent.provider,
api_mode=agent.api_mode,
)
agent._emit_status(
f"🔄 Primary model failed — switching to fallback: "
f"{fb_model} via {fb_provider}"
)
logging.info(
logger.info(
"Fallback activated: %s%s (%s)",
old_model, fb_model, fb_provider,
)
return True
except Exception as e:
logging.error("Failed to activate fallback %s: %s", fb_model, e)
logger.error("Failed to activate fallback %s: %s", fb_model, e)
return agent._try_activate_fallback() # try next in chain
@@ -1168,7 +1169,7 @@ def handle_max_iterations(agent, messages: list, api_call_count: int) -> str:
final_response = "I reached the iteration limit and couldn't generate a summary."
except Exception as e:
logging.warning(f"Failed to get summary response: {e}")
logger.warning(f"Failed to get summary response: {e}")
final_response = f"I reached the maximum iterations ({agent.max_iterations}) but couldn't summarize. Error: {str(e)}"
return final_response
@@ -1197,12 +1198,12 @@ def cleanup_task_resources(agent, task_id: str) -> None:
_ra().cleanup_vm(task_id)
except Exception as e:
if agent.verbose_logging:
logging.warning(f"Failed to cleanup VM for task {task_id}: {e}")
logger.warning(f"Failed to cleanup VM for task {task_id}: {e}")
try:
_ra().cleanup_browser(task_id)
except Exception as e:
if agent.verbose_logging:
logging.warning(f"Failed to cleanup browser for task {task_id}: {e}")
logger.warning(f"Failed to cleanup browser for task {task_id}: {e}")
@@ -2076,8 +2077,21 @@ def interruptible_streaming_api_call(agent, api_kwargs: dict, *, on_first_delta=
# Streaming failed AFTER some tokens were already delivered to
# the platform. Re-raising would let the outer retry loop make
# a new API call, creating a duplicate message. Return a
# partial "stop" response instead so the outer loop treats this
# turn as complete (no retry, no fallback).
# partial response stub instead and let the outer loop decide:
#
# - text-only partials → finish_reason="length" so the
# conversation loop persists the partial assistant content
# and asks the model to continue from where the stream
# died (issue #30963: partial stop misclassified as a
# clean completion was exiting the loop with budget
# remaining and an unfinished goal).
#
# - partial mid-tool-call → finish_reason="stop" stays.
# The user-visible warning we append says "Ask me to
# retry if you want to continue", so the agent should
# hand control back rather than auto-retry a tool call
# that may have side-effects.
#
# Recover whatever content was already streamed to the user.
# _current_streamed_assistant_text accumulates text fired
# through _fire_stream_delta, so it has exactly what the
@@ -2115,14 +2129,17 @@ def interruptible_streaming_api_call(agent, api_kwargs: dict, *, on_first_delta=
"of text; surfaced warning to user: %s",
_partial_names, len(_partial_text or ""), result["error"],
)
_stub_finish_reason = "stop"
else:
logger.warning(
"Partial stream delivered before error; returning stub "
"response with %s chars of recovered content to prevent "
"duplicate messages: %s",
"Partial stream delivered before error; returning "
"length-truncated stub with %s chars of recovered "
"content so the loop can continue from where the "
"stream died: %s",
len(_partial_text or ""),
result["error"],
)
_stub_finish_reason = "length"
_stub_msg = SimpleNamespace(
role="assistant", content=_partial_text, tool_calls=None,
reasoning_content=None,
@@ -2131,7 +2148,7 @@ def interruptible_streaming_api_call(agent, api_kwargs: dict, *, on_first_delta=
id="partial-stream-stub",
model=getattr(agent, "model", "unknown"),
choices=[SimpleNamespace(
index=0, message=_stub_msg, finish_reason="stop",
index=0, message=_stub_msg, finish_reason=_stub_finish_reason,
)],
usage=None,
)
+4 -3
View File
@@ -609,6 +609,7 @@ class ContextCompressor(ContextEngine):
"""Update tracked token usage from API response."""
self.last_prompt_tokens = usage.get("prompt_tokens", 0)
self.last_completion_tokens = usage.get("completion_tokens", 0)
self.last_total_tokens = usage.get("total_tokens", self.last_prompt_tokens + self.last_completion_tokens)
def should_compress(self, prompt_tokens: int = None) -> bool:
"""Check if context exceeds the compression threshold.
@@ -897,7 +898,7 @@ class ContextCompressor(ContextEngine):
into the warning log.
"""
self._summary_model_fallen_back = True
logging.warning(
logger.warning(
"Summary model '%s' %s (%s). "
"Falling back to main model '%s' for compression.",
self.summary_model, reason, e, self.model,
@@ -1086,7 +1087,7 @@ The user has requested that this compaction PRIORITISE preserving all informatio
# No provider configured — long cooldown, unlikely to self-resolve
self._summary_failure_cooldown_until = time.monotonic() + _SUMMARY_FAILURE_COOLDOWN_SECONDS
self._last_summary_error = "no auxiliary LLM provider configured"
logging.warning("Context compression: no provider available for "
logger.warning("Context compression: no provider available for "
"summary. Middle turns will be dropped without summary "
"for %d seconds.",
_SUMMARY_FAILURE_COOLDOWN_SECONDS)
@@ -1182,7 +1183,7 @@ The user has requested that this compaction PRIORITISE preserving all informatio
if len(err_text) > 220:
err_text = err_text[:217].rstrip() + "..."
self._last_summary_error = err_text
logging.warning(
logger.warning(
"Failed to generate context summary: %s. "
"Further summary attempts paused for %d seconds.",
e,
+1
View File
@@ -200,6 +200,7 @@ class ContextEngine(ABC):
base_url: str = "",
api_key: str = "",
provider: str = "",
api_mode: str = "",
) -> None:
"""Called when the user switches models or on fallback activation.
+4 -4
View File
@@ -381,12 +381,12 @@ def compress_context(
agent._session_db.end_session(agent.session_id, "compression")
old_session_id = agent.session_id
agent.session_id = f"{datetime.now().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:6]}"
os.environ["HERMES_SESSION_ID"] = agent.session_id
try:
from gateway.session_context import _SESSION_ID
_SESSION_ID.set(agent.session_id)
from gateway.session_context import set_current_session_id
set_current_session_id(agent.session_id)
except Exception:
pass
os.environ["HERMES_SESSION_ID"] = agent.session_id
agent._session_db_created = False
agent._session_db.create_session(
session_id=agent.session_id,
+91 -24
View File
@@ -1183,7 +1183,7 @@ def run_conversation(
else str(_codex_error_obj) if _codex_error_obj
else f"Responses API returned status '{_codex_resp_status}'"
)
logging.warning(
logger.warning(
"Codex response status='%s' (error=%s). Routing to fallback. %s",
_codex_resp_status, _codex_error_msg,
agent._client_log_context(),
@@ -1335,7 +1335,7 @@ def run_conversation(
primary_recovery_attempted = False
continue
agent._emit_status(f"❌ Max retries ({max_retries}) exceeded for invalid responses. Giving up.")
logging.error(f"{agent.log_prefix}Invalid API response after {max_retries} retries.")
logger.error(f"{agent.log_prefix}Invalid API response after {max_retries} retries.")
agent._persist_session(messages, conversation_history)
return {
"messages": messages,
@@ -1348,7 +1348,7 @@ def run_conversation(
# Backoff before retry — jittered exponential: 5s base, 120s cap
wait_time = jittered_backoff(retry_count, base_delay=5.0, max_delay=120.0)
agent._vprint(f"{agent.log_prefix}⏳ Retrying in {wait_time:.1f}s ({_failure_hint})...", force=True)
logging.warning(f"Invalid API response (retry {retry_count}/{max_retries}): {', '.join(error_details)} | Provider: {provider_name}")
logger.warning(f"Invalid API response (retry {retry_count}/{max_retries}): {', '.join(error_details)} | Provider: {provider_name}")
# Sleep in small increments to stay responsive to interrupts
sleep_end = time.time() + wait_time
@@ -1414,7 +1414,18 @@ def run_conversation(
finish_reason = "length"
if finish_reason == "length":
agent._vprint(f"{agent.log_prefix}⚠️ Response truncated (finish_reason='length') - model hit max output tokens", force=True)
if getattr(response, "id", "") == "partial-stream-stub":
agent._vprint(
f"{agent.log_prefix}⚠️ Stream interrupted by network error "
f"(finish_reason='length' on partial-stream-stub)",
force=True,
)
else:
agent._vprint(
f"{agent.log_prefix}⚠️ Response truncated "
f"(finish_reason='length') - model hit max output tokens",
force=True,
)
# Normalize the truncated response to a single OpenAI-style
# message shape so text-continuation and tool-call retry
@@ -1507,17 +1518,40 @@ def run_conversation(
truncated_response_parts.append(assistant_message.content)
if length_continue_retries < 3:
agent._vprint(
f"{agent.log_prefix}↻ Requesting continuation "
f"({length_continue_retries}/3)..."
# Distinguish a real output-token truncation
# from a partial-stream-stub network error
# (#30963). Same continuation machinery,
# but the prompt has to tell the truth or
# the model goes off rails ("I wasn't
# truncated, I'm done").
_is_partial_stream_stub = (
getattr(response, "id", "") == "partial-stream-stub"
)
continue_msg = {
"role": "user",
"content": (
if _is_partial_stream_stub:
agent._vprint(
f"{agent.log_prefix}↻ Stream interrupted — "
f"requesting continuation "
f"({length_continue_retries}/3)..."
)
_continue_content = (
"[System: The previous response was cut off by a "
"network error mid-stream. Continue exactly where "
"you left off. Do not restart or repeat prior text. "
"Finish the answer directly.]"
)
else:
agent._vprint(
f"{agent.log_prefix}↻ Requesting continuation "
f"({length_continue_retries}/3)..."
)
_continue_content = (
"[System: Your previous response was truncated by the output "
"length limit. Continue exactly where you left off. Do not "
"restart or repeat prior text. Finish the answer directly.]"
),
)
continue_msg = {
"role": "user",
"content": _continue_content,
}
messages.append(continue_msg)
agent._session_messages = messages
@@ -2225,7 +2259,7 @@ def run_conversation(
f"stripped all thinking blocks, retrying...",
force=True,
)
logging.warning(
logger.warning(
"%sThinking block signature recovery: stripped "
"reasoning_details from %d messages",
agent.log_prefix, len(messages),
@@ -2250,7 +2284,7 @@ def run_conversation(
from tools.schema_sanitizer import strip_pattern_and_format
_, _stripped = strip_pattern_and_format(agent.tools)
except Exception as _strip_exc: # pragma: no cover — defensive
logging.warning(
logger.warning(
"%sllama.cpp grammar recovery: strip helper failed: %s",
agent.log_prefix, _strip_exc,
)
@@ -2261,7 +2295,7 @@ def run_conversation(
f"stripped {_stripped} pattern/format keyword(s), retrying...",
force=True,
)
logging.warning(
logger.warning(
"%sllama.cpp grammar recovery: stripped %d "
"pattern/format keyword(s) from tool schemas",
agent.log_prefix, _stripped,
@@ -2269,7 +2303,7 @@ def run_conversation(
continue
# No keywords found to strip — fall through to normal
# retry path rather than loop forever on the same error.
logging.warning(
logger.warning(
"%sllama.cpp grammar error but no pattern/format "
"keywords to strip — falling through to normal retry",
agent.log_prefix,
@@ -2370,6 +2404,7 @@ def run_conversation(
base_url=agent.base_url,
api_key=getattr(agent, "api_key", ""),
provider=agent.provider,
api_mode=agent.api_mode,
)
# Context probing flags — only set on built-in
# compressor (plugin engines manage their own).
@@ -2483,7 +2518,7 @@ def run_conversation(
error_context=error_context,
)
else:
logging.info(
logger.info(
"Nous 429 looks like upstream capacity "
"(no exhausted bucket in headers or "
"last-known state) -- not tripping "
@@ -2543,7 +2578,7 @@ def run_conversation(
if compression_attempts > max_compression_attempts:
agent._vprint(f"{agent.log_prefix}❌ Max compression attempts ({max_compression_attempts}) reached for payload-too-large error.", force=True)
agent._vprint(f"{agent.log_prefix} 💡 Try /new to start a fresh conversation, or /compress to retry compression.", force=True)
logging.error(f"{agent.log_prefix}413 compression failed after {max_compression_attempts} attempts.")
logger.error(f"{agent.log_prefix}413 compression failed after {max_compression_attempts} attempts.")
agent._persist_session(messages, conversation_history)
return {
"messages": messages,
@@ -2574,7 +2609,7 @@ def run_conversation(
else:
agent._vprint(f"{agent.log_prefix}❌ Payload too large and cannot compress further.", force=True)
agent._vprint(f"{agent.log_prefix} 💡 Try /new to start a fresh conversation, or /compress to retry compression.", force=True)
logging.error(f"{agent.log_prefix}413 payload too large. Cannot compress further.")
logger.error(f"{agent.log_prefix}413 payload too large. Cannot compress further.")
agent._persist_session(messages, conversation_history)
return {
"messages": messages,
@@ -2627,7 +2662,7 @@ def run_conversation(
if compression_attempts > max_compression_attempts:
agent._vprint(f"{agent.log_prefix}❌ Max compression attempts ({max_compression_attempts}) reached.", force=True)
agent._vprint(f"{agent.log_prefix} 💡 Try /new to start a fresh conversation, or /compress to retry compression.", force=True)
logging.error(f"{agent.log_prefix}Context compression failed after {max_compression_attempts} attempts.")
logger.error(f"{agent.log_prefix}Context compression failed after {max_compression_attempts} attempts.")
agent._persist_session(messages, conversation_history)
return {
"messages": messages,
@@ -2679,6 +2714,7 @@ def run_conversation(
base_url=agent.base_url,
api_key=getattr(agent, "api_key", ""),
provider=agent.provider,
api_mode=agent.api_mode,
)
# Context probing flags — only set on built-in
# compressor (plugin engines manage their own).
@@ -2700,7 +2736,7 @@ def run_conversation(
if compression_attempts > max_compression_attempts:
agent._vprint(f"{agent.log_prefix}❌ Max compression attempts ({max_compression_attempts}) reached.", force=True)
agent._vprint(f"{agent.log_prefix} 💡 Try /new to start a fresh conversation, or /compress to retry compression.", force=True)
logging.error(f"{agent.log_prefix}Context compression failed after {max_compression_attempts} attempts.")
logger.error(f"{agent.log_prefix}Context compression failed after {max_compression_attempts} attempts.")
agent._persist_session(messages, conversation_history)
return {
"messages": messages,
@@ -2733,7 +2769,7 @@ def run_conversation(
# Can't compress further and already at minimum tier
agent._vprint(f"{agent.log_prefix}❌ Context length exceeded and cannot compress further.", force=True)
agent._vprint(f"{agent.log_prefix} 💡 The conversation has accumulated too much content. Try /new to start fresh, or /compress to manually trigger compression.", force=True)
logging.error(f"{agent.log_prefix}Context length exceeded: {approx_tokens:,} tokens. Cannot compress further.")
logger.error(f"{agent.log_prefix}Context length exceeded: {approx_tokens:,} tokens. Cannot compress further.")
agent._persist_session(messages, conversation_history)
return {
"messages": messages,
@@ -2770,6 +2806,21 @@ def run_conversation(
# retryable=True mapping takes effect instead.
and not isinstance(api_error, ssl.SSLError)
)
# ``FailoverReason.billing`` (HTTP 402) is NOT in this
# exclusion set. By the time we reach this block:
# • credential-pool rotation (line ~2031) has already
# fired for billing and either ``continue``d or
# returned (False, ...) — pool is exhausted or absent.
# • the eager-fallback branch above (line ~2422) also
# fires on billing and ``continue``s if a fallback
# provider is configured.
# Falling through to here means BOTH recovery paths
# gave up. Treating 402 as retryable from this point
# just burns more paid requests against a depleted
# balance with no recovery mechanism left — see #31273
# (real-world: ~$40 in 48h on a 24/7 gateway). Aborting
# mirrors how 401/403 (also ``should_fallback=True``)
# already behave once their recovery paths have failed.
is_client_error = (
is_local_validation_error
or (
@@ -2777,7 +2828,6 @@ def run_conversation(
and not classified.should_compress
and classified.reason not in {
FailoverReason.rate_limit,
FailoverReason.billing,
FailoverReason.overloaded,
FailoverReason.context_overflow,
FailoverReason.payload_too_large,
@@ -2826,7 +2876,7 @@ def run_conversation(
agent._vprint(f"{agent.log_prefix} • Check credits: https://openrouter.ai/settings/credits", force=True)
else:
agent._vprint(f"{agent.log_prefix} 💡 This type of error won't be fixed by retrying.", force=True)
logging.error(f"{agent.log_prefix}Non-retryable client error: {api_error}")
logger.error(f"{agent.log_prefix}Non-retryable client error: {api_error}")
# Skip session persistence when the error is likely
# context-overflow related (status 400 + large session).
# Persisting the failed user message would make the
@@ -2903,7 +2953,7 @@ def run_conversation(
force=True,
)
logging.error(
logger.error(
"%sAPI call failed after %s retries. %s | provider=%s model=%s msgs=%s tokens=~%s",
agent.log_prefix, max_retries, _final_summary,
_provider, _model, len(api_messages), f"{approx_tokens:,}",
@@ -3434,6 +3484,19 @@ def run_conversation(
f"⚠️ Tool guardrail halted {decision.tool_name}: {decision.code}"
)
messages.append({"role": "assistant", "content": final_response})
# Emit the halt message to the client so it's not
# indistinguishable from a crash. The stream display
# was flushed (callback(None)) before tool execution,
# but the callback is still alive — fire the text
# through it so SSE/TUI clients see the explanation.
if final_response:
agent._safe_print(f"\n{final_response}\n")
if agent.stream_delta_callback:
try:
agent.stream_delta_callback(final_response)
agent.stream_delta_callback(None)
except Exception:
pass
break
# Reset per-turn retry counters after successful tool
@@ -4029,6 +4092,8 @@ def run_conversation(
except Exception as _ver_err:
logger.debug("file-mutation verifier footer failed: %s", _ver_err)
_response_transformed = False
# Plugin hook: transform_llm_output
# Fired once per turn after the tool-calling loop completes.
# Plugins can transform the LLM's output text before it's returned.
@@ -4046,6 +4111,7 @@ def run_conversation(
for _hook_result in _transform_results:
if isinstance(_hook_result, str) and _hook_result:
final_response = _hook_result
_response_transformed = True
break # First non-empty string wins
except Exception as exc:
logger.warning("transform_llm_output hook failed: %s", exc)
@@ -4097,6 +4163,7 @@ def run_conversation(
"failed": failed,
"partial": False, # True only when stopped due to invalid tool calls
"interrupted": interrupted,
"response_transformed": _response_transformed,
"response_previewed": getattr(agent, "_response_was_previewed", False),
"model": agent.model,
"provider": agent.provider,
+56 -6
View File
@@ -787,33 +787,65 @@ class KawaiiSpinner:
# Cute tool message (completion line that replaces the spinner)
# =========================================================================
_ERROR_SUFFIX_MAX_LEN = 48
def _trim_error(msg: str) -> str:
"""Shrink an error message for inline display in a tool status line.
Strips overly long absolute paths down to just the filename so the
suffix stays readable on narrow terminals.
"""
msg = msg.strip()
# Common case: "File not found: /very/long/absolute/path/foo.py"
if "File not found:" in msg:
_, _, tail = msg.partition("File not found:")
tail = tail.strip()
if "/" in tail:
msg = f"File not found: {tail.rsplit('/', 1)[-1]}"
if len(msg) > _ERROR_SUFFIX_MAX_LEN:
msg = msg[: _ERROR_SUFFIX_MAX_LEN - 3] + "..."
return msg
def _detect_tool_failure(tool_name: str, result: str | None) -> tuple[bool, str]:
"""Inspect a tool result string for signs of failure.
Returns ``(is_failure, suffix)`` where *suffix* is an informational tag
like ``" [exit 1]"`` for terminal failures, or ``" [error]"`` for generic
failures. On success, returns ``(False, "")``.
Returns ``(is_failure, suffix)`` where *suffix* is a short informational
tag like ``" [exit 1]"`` for terminal failures, ``" [full]"`` for memory
overflow, or a trimmed error message (``" [File not found: foo.py]"``).
On success returns ``(False, "")``.
"""
if result is None:
return False, ""
if file_mutation_result_landed(tool_name, result):
return False, ""
data = safe_json_loads(result)
# Terminal: non-zero exit code is the canonical failure signal.
if tool_name == "terminal":
data = safe_json_loads(result)
if isinstance(data, dict):
exit_code = data.get("exit_code")
if exit_code is not None and exit_code != 0:
err_msg = data.get("error")
if err_msg:
return True, f" [{_trim_error(str(err_msg))}]"
return True, f" [exit {exit_code}]"
return False, ""
# Memory-specific: distinguish "full" from real errors
# Memory: distinguish "store full" from real errors.
if tool_name == "memory":
data = safe_json_loads(result)
if isinstance(data, dict):
if data.get("success") is False and "exceed the limit" in data.get("error", ""):
return True, " [full]"
# Structured error in JSON result (any tool that surfaces {"error": ...}).
if isinstance(data, dict):
err = data.get("error") or data.get("message")
if err and (data.get("success") is False or "error" in data):
return True, f" [{_trim_error(str(err))}]"
# Generic heuristic for non-terminal tools
# Multimodal tool results (dicts with _multimodal=True) are not strings —
# treat them as successes since failures would be JSON-encoded strings.
@@ -921,11 +953,29 @@ def get_cute_tool_message(
if tool_name == "todo":
todos_arg = args.get("todos")
merge = args.get("merge", False)
# Parse result for completion progress
total = 0
done = 0
if result:
try:
data = safe_json_loads(result)
if data:
s = data.get("summary", {})
total = s.get("total", 0)
done = s.get("completed", 0)
except Exception:
pass
if todos_arg is None:
if total > 0:
return _wrap(f"┊ 📋 plan {done}/{total} task(s) {dur}")
return _wrap(f"┊ 📋 plan reading tasks {dur}")
elif merge:
if total > 0 and done > 0:
return _wrap(f"┊ 📋 plan update {done}/{total}{dur}")
return _wrap(f"┊ 📋 plan update {len(todos_arg)} task(s) {dur}")
else:
if total > 0 and done > 0:
return _wrap(f"┊ 📋 plan {done}/{total} task(s) {dur}")
return _wrap(f"┊ 📋 plan {len(todos_arg)} task(s) {dur}")
if tool_name == "session_search":
return _wrap(f"┊ 🔍 recall \"{_trunc(args.get('query', ''), 35)}\" {dur}")
+35
View File
@@ -240,6 +240,24 @@ _MODEL_NOT_FOUND_PATTERNS = [
"unsupported model",
]
# Request-validation patterns — the request is malformed and will fail
# identically on every retry. Some OpenAI-compatible gateways (notably
# codex.nekos.me) return these as 5xx instead of the standard 4xx, which
# makes the generic "5xx → retryable server_error" rule misfire: the retry
# loop hammers the same deterministic rejection 3+ times, then the
# transport-recovery path resets the counter and does it again, producing
# a request flood. When a 5xx body carries one of these unambiguous
# request-validation signals, classify as a non-retryable format_error so
# the loop fails fast and falls back instead of looping.
_REQUEST_VALIDATION_PATTERNS = [
"unknown parameter",
"unsupported parameter",
"unrecognized request argument",
"invalid_request_error",
"unknown_parameter",
"unsupported_parameter",
]
# OpenRouter aggregator policy-block patterns.
#
# When a user's OpenRouter account privacy setting (or a per-request
@@ -745,6 +763,23 @@ def _classify_by_status(
)
if status_code in {500, 502}:
# Some OpenAI-compatible gateways return request-validation errors
# with a 5xx status (codex.nekos.me returns 502 for unknown/
# unsupported parameters). These are deterministic — every retry
# gets the identical rejection — so the generic "5xx → retryable
# server_error" rule turns one bad request into a retry flood.
# Detect the unambiguous request-validation signals (in either the
# message text or the structured error code) and fail fast.
if (
any(p in error_msg for p in _REQUEST_VALIDATION_PATTERNS)
or error_code.lower() in {"invalid_request_error", "unknown_parameter",
"unsupported_parameter"}
):
return result_fn(
FailoverReason.format_error,
retryable=False,
should_fallback=True,
)
return result_fn(FailoverReason.server_error, retryable=True)
if status_code in {503, 529}:
+151
View File
@@ -127,6 +127,12 @@ def is_write_denied(path: str) -> bool:
return True
except Exception:
pass
try:
pairing_real = os.path.realpath(os.path.join(base_real, "pairing"))
if resolved == pairing_real or resolved.startswith(pairing_real + os.sep):
return True
except Exception:
pass
safe_root = get_safe_write_root()
if safe_root and not (resolved == safe_root or resolved.startswith(safe_root + os.sep)):
@@ -254,3 +260,148 @@ def get_read_block_error(path: str) -> Optional[str]:
)
return None
# ---------------------------------------------------------------------------
# Cross-profile write guard (#TBD)
#
# Hermes profiles are separate HERMES_HOME dirs under
# ``<root>/profiles/<name>/``. Each profile has its own skills/, plugins/,
# cron/, memories/. When an agent runs under one profile, writing into
# ANOTHER profile's directories is almost always wrong — those skills /
# plugins / cron jobs / memories affect a different session the user runs
# from a different shell.
#
# Soft guard, NOT a security boundary: the agent runs as the same OS user
# and has unrestricted terminal access, so this returns a warning the model
# can choose to honor or override with ``cross_profile=True``. Same shape
# as the dangerous-command approval flow — the agent is told the boundary
# exists, and explicit user direction is required to cross it.
#
# Reference: May 2026 incident where a hermes-security profile session
# edited skills under both ``~/.hermes/profiles/hermes-security/skills/``
# AND ``~/.hermes/skills/`` (the default profile's skills) without realizing
# the second path belonged to a different profile.
# ---------------------------------------------------------------------------
# Profile-scoped directories under HERMES_HOME / <root> / <root>/profiles/<X>/
# that should be guarded. Adding a new area here extends the guard with no
# other code change.
PROFILE_SCOPED_AREAS = ("skills", "plugins", "cron", "memories")
def _resolve_active_profile_name() -> str:
"""Return the active profile name derived from HERMES_HOME.
``~/.hermes`` -> ``"default"``
``~/.hermes/profiles/X`` -> ``"X"``
Falls back to ``"default"`` on any resolution failure so the guard
never raises into the tool path.
"""
try:
home_real = _hermes_home_path().resolve()
root_real = _hermes_root_path().resolve()
except (OSError, RuntimeError):
return "default"
profiles_dir = root_real / "profiles"
try:
rel = home_real.relative_to(profiles_dir)
parts = rel.parts
if len(parts) >= 1:
return parts[0]
except ValueError:
pass
return "default"
def classify_cross_profile_target(path: str) -> Optional[dict]:
"""Classify a write target as cross-profile if it lands in another
profile's scoped area (skills/plugins/cron/memories).
Returns ``None`` when the target is outside Hermes scope, or is inside
the ACTIVE profile, or doesn't hit a profile-scoped area. Otherwise
returns a dict with:
* ``active_profile``: name of the profile the agent is running as
* ``target_profile``: name of the profile the path belongs to
* ``area``: which scoped area (``"skills"``, ``"plugins"``, etc.)
* ``target_path``: the resolved path string
The caller decides what to do with the result surface a warning to
the model, prompt the user, or (with explicit consent /
``cross_profile=True``) proceed anyway.
"""
try:
target = Path(os.path.expanduser(str(path))).resolve()
root_real = _hermes_root_path().resolve()
except (OSError, RuntimeError):
return None
target_profile: Optional[str] = None
area: Optional[str] = None
try:
rel = target.relative_to(root_real)
except ValueError:
return None
parts = rel.parts
if not parts:
return None
if parts[0] in PROFILE_SCOPED_AREAS:
# ``<root>/<area>/...`` → default profile.
target_profile = "default"
area = parts[0]
elif (
parts[0] == "profiles"
and len(parts) >= 3
and parts[2] in PROFILE_SCOPED_AREAS
):
# ``<root>/profiles/<name>/<area>/...`` → named profile.
target_profile = parts[1]
area = parts[2]
else:
return None
active_profile = _resolve_active_profile_name()
if target_profile == active_profile:
# In-profile write — not a cross-profile event.
return None
return {
"active_profile": active_profile,
"target_profile": target_profile,
"area": area,
"target_path": str(target),
}
def get_cross_profile_warning(path: str) -> Optional[str]:
"""Return a model-facing warning string when ``path`` is cross-profile.
Returns ``None`` when the write is in-scope (same profile) or outside
Hermes entirely. Caller is expected to surface the warning to the
agent as a tool-result error, NOT to silently allow the write the
agent must either get explicit user direction to proceed, or pass
``cross_profile=True`` to its write tool.
This is defense-in-depth: the terminal tool runs as the same OS user
and can write any of these paths without going through this guard.
Treat the guard as a confusion-reducer, not a security boundary.
"""
info = classify_cross_profile_target(path)
if info is None:
return None
return (
f"Cross-profile write blocked by soft guard: {info['target_path']} "
f"belongs to Hermes profile {info['target_profile']!r}, but the "
f"agent is running under profile {info['active_profile']!r}. "
f"Editing another profile's {info['area']}/ will affect that "
f"profile's future sessions, not the one you are currently in. "
f"Confirm with the user before proceeding. To bypass this guard "
f"after explicit user direction, retry the call with "
f"``cross_profile=True``. (Defense-in-depth — not a security "
f"boundary; the terminal tool can still bypass.)"
)
+1 -1
View File
@@ -641,7 +641,7 @@ def fetch_model_metadata(force_refresh: bool = False) -> Dict[str, Dict[str, Any
return cache
except Exception as e:
logging.warning(f"Failed to fetch model metadata from OpenRouter: {e}")
logger.warning(f"Failed to fetch model metadata from OpenRouter: {e}")
return _model_metadata_cache or {}
+42
View File
@@ -176,6 +176,15 @@ _URL_USERINFO_RE = re.compile(
r"(https?|wss?|ftp)://([^/\s:@]+):([^/\s@]+)@",
)
# HTTP access logs often use a relative request target rather than a full URL:
# `"POST /webhook?password=... HTTP/1.1"`. The full-URL redactor above only
# sees strings containing `://`, so handle request-target query strings too.
_HTTP_REQUEST_TARGET_QUERY_RE = re.compile(
r"\b((?:GET|POST|PUT|PATCH|DELETE|HEAD|OPTIONS|TRACE|CONNECT)\s+[^ \t\r\n\"']*?)"
r"\?([^ \t\r\n\"']+)",
re.IGNORECASE,
)
# Form-urlencoded body detection: conservative — only applies when the entire
# text looks like a query string (k=v&k=v pattern with no newlines).
_FORM_BODY_RE = re.compile(
@@ -293,6 +302,15 @@ def _redact_url_userinfo(text: str) -> str:
)
def _redact_http_request_target_query_params(text: str) -> str:
"""Redact sensitive query params in HTTP access-log request targets."""
def _sub(m: re.Match) -> str:
prefix = m.group(1)
query = _redact_query_string(m.group(2))
return f"{prefix}?{query}"
return _HTTP_REQUEST_TARGET_QUERY_RE.sub(_sub, text)
def _redact_form_body(text: str) -> str:
"""Redact sensitive values in a form-urlencoded body.
@@ -397,6 +415,11 @@ def redact_sensitive_text(text: str, *, force: bool = False, code_file: bool = F
if "?" in text:
text = _redact_url_query_params(text)
# HTTP access logs can contain relative request targets with query params
# and no URL scheme, e.g. `"POST /hook?password=... HTTP/1.1"`.
if "?" in text and "=" in text and _has_http_method_substring(text):
text = _redact_http_request_target_query_params(text)
# Form-urlencoded bodies (only triggers on clean k=v&k=v inputs).
if "&" in text and "=" in text:
text = _redact_form_body(text)
@@ -456,6 +479,25 @@ def _has_known_prefix_substring(text: str) -> bool:
return any(p in text for p in _PREFIX_SUBSTRINGS)
_HTTP_METHOD_SUBSTRINGS = (
"GET ",
"POST ",
"PUT ",
"PATCH ",
"DELETE ",
"HEAD ",
"OPTIONS ",
"TRACE ",
"CONNECT ",
)
def _has_http_method_substring(text: str) -> bool:
"""Cheap pre-check before scanning for access-log request targets."""
upper = text.upper()
return any(method in upper for method in _HTTP_METHOD_SUBSTRINGS)
class RedactingFormatter(logging.Formatter):
"""Log formatter that redacts secrets from all log messages."""
+24 -4
View File
@@ -70,7 +70,7 @@ _BWS_RUN_TIMEOUT = 30
# In-process cache so repeated load_hermes_dotenv() calls (CLI startup,
# gateway hot-reload, test suites) don't re-fetch from BSM.
_CacheKey = Tuple[str, str] # (access_token_fingerprint, project_id)
_CacheKey = Tuple[str, str, str] # (access_token_fingerprint, project_id, server_url)
_CACHE: Dict[_CacheKey, "_CachedFetch"] = {}
@@ -317,11 +317,18 @@ def fetch_bitwarden_secrets(
binary: Optional[Path] = None,
cache_ttl_seconds: float = 300,
use_cache: bool = True,
server_url: str = "",
) -> Tuple[Dict[str, str], List[str]]:
"""Pull the secrets for ``project_id`` from Bitwarden Secrets Manager.
Returns ``(secrets_dict, warnings_list)``.
Set ``server_url`` to point at a non-default Bitwarden region or a
self-hosted instance e.g. ``https://vault.bitwarden.eu`` for EU
Cloud accounts. When empty, ``bws`` uses its built-in default
(``https://vault.bitwarden.com``, US Cloud). This is plumbed into
the subprocess as ``BWS_SERVER_URL``.
Raises :class:`RuntimeError` for fatal conditions (missing binary,
auth failure, unparseable output). Callers in the env_loader path
catch this and emit a single warning; callers in the user-facing
@@ -332,7 +339,7 @@ def fetch_bitwarden_secrets(
if not project_id:
raise RuntimeError("Bitwarden project_id is empty")
cache_key = (_token_fingerprint(access_token), project_id)
cache_key = (_token_fingerprint(access_token), project_id, server_url or "")
if use_cache:
cached = _CACHE.get(cache_key)
if cached and cached.is_fresh(cache_ttl_seconds):
@@ -347,19 +354,26 @@ def fetch_bitwarden_secrets(
"`hermes secrets bitwarden setup`."
)
secrets, warnings = _run_bws_list(bws, access_token, project_id)
secrets, warnings = _run_bws_list(bws, access_token, project_id, server_url)
_CACHE[cache_key] = _CachedFetch(secrets=secrets, fetched_at=time.time())
return secrets, warnings
def _run_bws_list(
bws: Path, access_token: str, project_id: str
bws: Path, access_token: str, project_id: str, server_url: str = ""
) -> Tuple[Dict[str, str], List[str]]:
cmd = [str(bws), "secret", "list", project_id, "--output", "json"]
env = os.environ.copy()
env["BWS_ACCESS_TOKEN"] = access_token
# Make sure we're not echoing telemetry / colour codes into json.
env.setdefault("NO_COLOR", "1")
# Region / self-hosted support. bws defaults to https://vault.bitwarden.com
# (US Cloud); EU Cloud users need https://vault.bitwarden.eu, and
# self-hosted users need their own URL. When unset, fall back to whatever
# BWS_SERVER_URL the caller already had in their shell env (preserved by
# the copy above) so manual overrides keep working too.
if server_url:
env["BWS_SERVER_URL"] = server_url
try:
proc = subprocess.run( # noqa: S603 — bws path is trusted
@@ -437,6 +451,7 @@ def apply_bitwarden_secrets(
override_existing: bool = False,
cache_ttl_seconds: float = 300,
auto_install: bool = True,
server_url: str = "",
) -> FetchResult:
"""Pull secrets from BSM and set them on ``os.environ``.
@@ -444,6 +459,10 @@ def apply_bitwarden_secrets(
files have loaded. It is intentionally defensive any failure
returns a :class:`FetchResult` with ``error`` set; it never raises.
``server_url`` selects the Bitwarden region or self-hosted endpoint
(e.g. ``https://vault.bitwarden.eu`` for EU Cloud). Empty string
means use ``bws``'s default (US Cloud).
Parameters mirror the ``secrets.bitwarden.*`` config keys so the
caller can just splat the dict in.
"""
@@ -482,6 +501,7 @@ def apply_bitwarden_secrets(
project_id=project_id,
binary=binary,
cache_ttl_seconds=cache_ttl_seconds,
server_url=server_url,
)
except RuntimeError as exc:
result.error = str(exc)
+34
View File
@@ -205,6 +205,40 @@ def build_system_prompt_parts(agent: Any, system_message: Optional[str] = None)
if _env_hints:
stable_parts.append(_env_hints)
# Active-profile hint — names the Hermes profile the agent is running
# under so it doesn't conflate ~/.hermes/skills/ (default profile) with
# ~/.hermes/profiles/<active>/skills/ (this profile's). Deterministic
# for the lifetime of the agent — profile name doesn't change
# mid-session, so this doesn't break the prompt cache.
# See file_safety._resolve_active_profile_name + classify_cross_profile_target
# for the matching tool-side guard.
try:
from agent.file_safety import _resolve_active_profile_name
active_profile = _resolve_active_profile_name()
except Exception:
active_profile = "default"
if active_profile == "default":
stable_parts.append(
"Active Hermes profile: default. Other profiles (if any) live "
"under ~/.hermes/profiles/<name>/. Each profile has its own "
"skills/, plugins/, cron/, and memories/ that affect a different "
"session than this one. Do not modify another profile's "
"skills/plugins/cron/memories unless the user explicitly directs "
"you to."
)
else:
stable_parts.append(
f"Active Hermes profile: {active_profile}. This session reads "
f"and writes ~/.hermes/profiles/{active_profile}/. The default "
f"profile's data lives at ~/.hermes/skills/, ~/.hermes/plugins/, "
f"~/.hermes/cron/, ~/.hermes/memories/ — those belong to a "
f"different session run from a different shell. Do NOT modify "
f"another profile's skills/plugins/cron/memories unless the user "
f"explicitly directs you to. The cross-profile write guard will "
f"refuse such writes by default; pass cross_profile=True only "
f"after explicit direction."
)
platform_key = (agent.platform or "").lower().strip()
if platform_key in PLATFORM_HINTS:
stable_parts.append(PLATFORM_HINTS[platform_key])
+3 -1
View File
@@ -388,6 +388,7 @@ def execute_tool_calls_concurrent(agent, assistant_message, messages: list, effe
agent.tool_progress_callback(
"tool.completed", function_name, None, None,
duration=tool_duration, is_error=is_error,
result=function_result,
)
except Exception as cb_err:
logging.debug(f"Tool progress callback error: {cb_err}")
@@ -491,7 +492,7 @@ def execute_tool_calls_sequential(agent, assistant_message, messages: list, effe
try:
function_args = json.loads(tool_call.function.arguments)
except json.JSONDecodeError as e:
logging.warning(f"Unexpected JSON error after validation: {e}")
logger.warning(f"Unexpected JSON error after validation: {e}")
function_args = {}
if not isinstance(function_args, dict):
function_args = {}
@@ -822,6 +823,7 @@ def execute_tool_calls_sequential(agent, assistant_message, messages: list, effe
agent.tool_progress_callback(
"tool.completed", function_name, None, None,
duration=tool_duration, is_error=_is_error_result,
result=function_result,
)
except Exception as cb_err:
logging.debug(f"Tool progress callback error: {cb_err}")
+11 -1
View File
@@ -106,7 +106,17 @@ class AnthropicTransport(ProviderTransport):
elif block.type == "tool_use":
name = block.name
if strip_tool_prefix and name.startswith(_MCP_PREFIX):
name = name[len(_MCP_PREFIX):]
stripped = name[len(_MCP_PREFIX):]
# Only strip the mcp_ prefix for OAuth-injected tools
# (where Hermes adds the prefix when sending to Anthropic
# and must remove it on the way back). Native MCP server
# tools (from mcp_servers: in config.yaml) are registered
# in the tool registry under their FULL mcp_<server>_<tool>
# name and must NOT be stripped. GH-25255.
from tools.registry import registry as _tool_registry
if (_tool_registry.get_entry(stripped)
and not _tool_registry.get_entry(name)):
name = stripped
tool_calls.append(
ToolCall(
id=block.id,
+20 -3
View File
@@ -113,9 +113,8 @@ class ChatCompletionsTransport(ProviderTransport):
self, messages: list[dict[str, Any]], **kwargs
) -> list[dict[str, Any]]:
"""Messages are already in OpenAI format — strip internal fields
that strict chat-completions providers reject with HTTP 400/422.
Strips:
that strict chat-completions providers reject with HTTP 400/422
(or, in the case of some OpenAI-compatible gateways, 5xx):
- Codex Responses API fields: ``codex_reasoning_items`` /
``codex_message_items`` on the message, ``call_id`` /
@@ -127,6 +126,16 @@ class ChatCompletionsTransport(ProviderTransport):
``Extra inputs are not permitted, field: 'messages[N].tool_name'``.
Permissive providers (OpenRouter, MiniMax) silently ignore the
field, which masked the bug for months.
- Hermes-internal scaffolding markers any top-level message key
starting with ``_`` (e.g. ``_empty_recovery_synthetic``,
``_empty_terminal_sentinel``, ``_thinking_prefill``). These are
bookkeeping flags the agent loop attaches to messages so the
persistence layer can later strip its own scaffolding; they must
never reach the wire. Permissive providers (real OpenAI,
Anthropic) silently drop unknown message keys, but strict
gateways (e.g. opencode-go, codex.nekos.me) reject with
``Extra inputs are not permitted, field: 'messages[N]._empty_recovery_synthetic'``,
which then poisons every subsequent request in the session.
"""
needs_sanitize = False
for msg in messages:
@@ -139,6 +148,9 @@ class ChatCompletionsTransport(ProviderTransport):
):
needs_sanitize = True
break
if any(isinstance(k, str) and k.startswith("_") for k in msg):
needs_sanitize = True
break
tool_calls = msg.get("tool_calls")
if isinstance(tool_calls, list):
for tc in tool_calls:
@@ -160,6 +172,11 @@ class ChatCompletionsTransport(ProviderTransport):
msg.pop("codex_reasoning_items", None)
msg.pop("codex_message_items", None)
msg.pop("tool_name", None)
# Drop all Hermes-internal scaffolding markers (``_``-prefixed).
# OpenAI's message schema has no ``_``-prefixed fields, so this
# is safe and future-proofs against new markers being added.
for key in [k for k in msg if isinstance(k, str) and k.startswith("_")]:
msg.pop(key, None)
tool_calls = msg.get("tool_calls")
if isinstance(tool_calls, list):
for tc in tool_calls:
+37 -2
View File
@@ -87,6 +87,39 @@ class TurnResult:
_TURN_ABORTED_MARKERS = ("<turn_aborted>", "<turn_aborted/>")
def _coerce_turn_input_text(user_input: Any) -> str:
"""Collapse Hermes/OpenAI rich content into app-server text input.
The current `turn/start` path sends text items only. TUI image attachment
can hand us OpenAI-style content parts, so keep the text/path hints and
replace opaque image payloads with a small marker instead of putting a
Python list into the `text` field.
"""
if isinstance(user_input, str):
return user_input
if isinstance(user_input, list):
parts: list[str] = []
for item in user_input:
if isinstance(item, str):
if item.strip():
parts.append(item)
continue
if not isinstance(item, dict):
if item is not None:
parts.append(str(item))
continue
item_type = item.get("type")
if item_type in {"text", "input_text"}:
text = item.get("text") or item.get("content") or ""
if text:
parts.append(str(text))
elif item_type in {"image", "image_url", "input_image"}:
parts.append("[image attached]")
text = "\n\n".join(p for p in parts if p).strip()
return text or "What do you see in this image?"
return "" if user_input is None else str(user_input)
# Substrings in codex stderr / JSON-RPC error messages that signal the
# subprocess died because its OAuth credentials are no longer valid.
# Kept conservative: we only redirect users to `codex login` when we're
@@ -327,7 +360,7 @@ class CodexAppServerSession:
def run_turn(
self,
user_input: str,
user_input: Any,
*,
turn_timeout: float = 600.0,
notification_poll_timeout: float = 0.25,
@@ -365,6 +398,8 @@ class CodexAppServerSession:
self._interrupt_event.clear()
projector = CodexEventProjector()
user_input_text = _coerce_turn_input_text(user_input)
# Send turn/start with the user input. Text-only for now (codex
# supports rich content but Hermes' text path is the common case).
try:
@@ -372,7 +407,7 @@ class CodexAppServerSession:
"turn/start",
{
"threadId": self._thread_id,
"input": [{"type": "text", "text": user_input}],
"input": [{"type": "text", "text": user_input_text}],
},
timeout=10,
)
+1 -1
View File
@@ -39,7 +39,7 @@ model:
# LM Studio is first-class and uses provider: "lmstudio".
# It works with both no-auth and auth-enabled server modes.
#
# Can also be overridden with --provider flag or HERMES_INFERENCE_PROVIDER env var.
# Can also be overridden for a single invocation with the --provider flag.
provider: "auto"
# API configuration (falls back to OPENROUTER_API_KEY env var)
+140 -47
View File
@@ -415,6 +415,12 @@ def load_cli_config() -> Dict[str, Any]:
"display": {
"compact": False,
"resume_display": "full",
# Recap tuning for /resume — see hermes_cli/config.py DEFAULT_CONFIG.
"resume_exchanges": 10,
"resume_max_user_chars": 300,
"resume_max_assistant_chars": 200,
"resume_max_assistant_lines": 3,
"resume_skip_tool_only": True,
"show_reasoning": False,
"streaming": True,
"busy_input_mode": "interrupt",
@@ -468,7 +474,9 @@ def load_cli_config() -> Dict[str, Any]:
if config_path.exists():
try:
with open(config_path, "r", encoding="utf-8") as f:
file_config = yaml.safe_load(f) or {}
from hermes_cli.config import _normalize_root_model_keys
file_config = _normalize_root_model_keys(yaml.safe_load(f) or {})
_file_has_terminal_config = "terminal" in file_config
@@ -489,21 +497,6 @@ def load_cli_config() -> Dict[str, Any]:
if "model" in file_config["model"] and "default" not in file_config["model"]:
defaults["model"]["default"] = file_config["model"]["model"]
# Legacy root-level provider/base_url fallback.
# Some users (or old code) put provider: / base_url: at the
# config root instead of inside the model: section. These are
# only used as a FALLBACK when model.provider / model.base_url
# is not already set — never as an override. The canonical
# location is model.provider (written by `hermes model`).
if not defaults["model"].get("provider"):
root_provider = file_config.get("provider")
if root_provider:
defaults["model"]["provider"] = root_provider
if not defaults["model"].get("base_url"):
root_base_url = file_config.get("base_url")
if root_base_url:
defaults["model"]["base_url"] = root_base_url
# Deep merge file_config into defaults.
# First: merge keys that exist in both (deep-merge dicts, overwrite scalars)
for key in defaults:
@@ -775,8 +768,6 @@ from rich.markup import escape as _escape
from rich.panel import Panel
from rich.text import Text as _RichText
import fire
# Import agent and tool systems lazily. Bare interactive startup only needs the
# prompt; the full agent/tool registry is initialized on first use.
def AIAgent(*args, **kwargs):
@@ -818,6 +809,13 @@ def validate_toolset(*args, **kwargs):
return _validate_toolset(*args, **kwargs)
def _sync_process_session_id(session_id: str) -> None:
"""Keep process-local session-id consumers aligned after CLI switches."""
from gateway.session_context import set_current_session_id
set_current_session_id(session_id)
# Cron job system for scheduled tasks (execution is handled by the gateway)
def get_job(*args, **kwargs):
from cron import get_job as _get_job
@@ -2814,7 +2812,7 @@ class HermesCLI:
api_key: str = None,
base_url: str = None,
max_turns: int = None,
verbose: bool = False,
verbose: Optional[bool] = None,
compact: bool = False,
resume: str = None,
checkpoints: bool = False,
@@ -2865,7 +2863,12 @@ class HermesCLI:
else:
self.busy_input_mode = "interrupt"
self.verbose = verbose if verbose is not None else (self.tool_progress_mode == "verbose")
# self.verbose ONLY controls global DEBUG logging (root logger level).
# display.tool_progress="verbose" controls tool-call rendering (full args,
# results, think blocks) and is independent — see _apply_logging_levels.
# Coupling the two (PR #6a1aa420e) caused all module DEBUG logs to spew
# to console whenever a user set tool_progress: verbose in config.
self.verbose = bool(verbose) if verbose is not None else False
# streaming: stream tokens to the terminal as they arrive (display.streaming in config.yaml)
self.streaming_enabled = CLI_CONFIG["display"].get("streaming", False)
@@ -5091,10 +5094,13 @@ class HermesCLI:
if self.resume_display == "minimal":
return
MAX_DISPLAY_EXCHANGES = 10 # max user+assistant pairs to show
MAX_USER_LEN = 300 # truncate user messages
MAX_ASST_LEN = 200 # truncate assistant text
MAX_ASST_LINES = 3 # max lines of assistant text
# Read limits from config (with hardcoded defaults)
_disp = CLI_CONFIG.get("display", {})
MAX_DISPLAY_EXCHANGES = int(_disp.get("resume_exchanges", 10))
MAX_USER_LEN = int(_disp.get("resume_max_user_chars", 300))
MAX_ASST_LEN = int(_disp.get("resume_max_assistant_chars", 200))
MAX_ASST_LINES = int(_disp.get("resume_max_assistant_lines", 3))
SKIP_TOOL_ONLY = _disp.get("resume_skip_tool_only", True)
# Collect displayable entries (skip system, tool-result messages)
entries = [] # list of (role, display_text)
@@ -5157,6 +5163,10 @@ class HermesCLI:
if not parts:
# Skip pure-reasoning messages that have no visible output
continue
# Skip tool-call-only entries when SKIP_TOOL_ONLY is enabled
has_text = bool(text)
if SKIP_TOOL_ONLY and not has_text and tool_calls:
continue
entries.append(("assistant", " ".join(parts)))
_last_asst_idx = len(entries) - 1
_last_asst_full = " ".join(full_parts)
@@ -6162,15 +6172,16 @@ class HermesCLI:
else:
print(" Recent sessions:")
print()
print(f" {'Title':<32} {'Preview':<40} {'Last Active':<13} {'ID'}")
print(f" {'' * 32} {'' * 40} {'' * 13} {'' * 24}")
for session in sessions:
title = (session.get("title") or "")[:30]
print(f" {'#':<3} {'Title':<32} {'Preview':<40} {'Last Active':<13} {'ID'}")
print(f" {'' * 3} {'' * 32} {'' * 40} {'' * 13} {'' * 24}")
for idx, session in enumerate(sessions, start=1):
title = session.get("title") or ""
preview = (session.get("preview") or "")[:38]
last_active = _relative_time(session.get("last_active"))
print(f" {title:<32} {preview:<40} {last_active:<13} {session['id']}")
print(f" {idx:<3} {title:<32} {preview:<40} {last_active:<13} {session['id']}")
print()
print(" Use /resume <session id or title> to continue where you left off.")
print(" Use /resume <number>, /resume <session id>, or /resume <session title> to continue.")
print(" Example: /resume 2")
print()
return True
@@ -6281,6 +6292,7 @@ class HermesCLI:
self.conversation_history = []
self._pending_title = None
self._resumed = False
_sync_process_session_id(self.session_id)
if self.agent:
self.agent.session_id = self.session_id
@@ -6514,7 +6526,7 @@ class HermesCLI:
target = parts[1].strip() if len(parts) > 1 else ""
if not target:
_cprint(" Usage: /resume <session_id_or_title>")
_cprint(" Usage: /resume <number|session_id_or_title>")
if self._show_recent_sessions(reason="resume"):
return
_cprint(" Tip: Use /history or `hermes sessions list` to find sessions.")
@@ -6525,10 +6537,20 @@ class HermesCLI:
_cprint(f" {format_session_db_unavailable()}")
return
# Resolve title or ID
from hermes_cli.main import _resolve_session_by_name_or_id
resolved = _resolve_session_by_name_or_id(target)
target_id = resolved or target
# Resolve numbered selection, title, or ID
if target.isdigit():
sessions = self._list_recent_sessions(limit=10)
index = int(target)
if index < 1 or index > len(sessions):
_cprint(f" Resume index {index} is out of range.")
_cprint(" Use /resume with no arguments to see available sessions.")
return
selected = sessions[index - 1]
target_id = selected["id"]
else:
from hermes_cli.main import _resolve_session_by_name_or_id
resolved = _resolve_session_by_name_or_id(target)
target_id = resolved or target
session_meta = self._session_db.get_session(target_id)
if not session_meta:
@@ -6567,6 +6589,7 @@ class HermesCLI:
self.session_id = target_id
self._resumed = True
self._pending_title = None
_sync_process_session_id(target_id)
# Load conversation history (strip transcript-only metadata entries)
restored = self._session_db.get_messages_as_conversation(target_id)
@@ -6618,6 +6641,7 @@ class HermesCLI:
f" ({msg_count} user message{'s' if msg_count != 1 else ''},"
f" {len(self.conversation_history)} total)"
)
self._display_resumed_history()
else:
_cprint(f" ↻ Resumed session {target_id}{title_part} — no messages, starting fresh.")
@@ -6740,6 +6764,7 @@ class HermesCLI:
self.session_start = now
self._pending_title = None
self._resumed = True # Prevents auto-title generation
_sync_process_session_id(new_session_id)
# Sync the agent
if self.agent:
@@ -8101,6 +8126,7 @@ class HermesCLI:
"clear",
"This clears the screen and starts a new session.\n"
"The current conversation history will be discarded.",
cmd_original=cmd_original,
) is None:
return
self.new_session(silent=True)
@@ -8225,12 +8251,16 @@ class HermesCLI:
if not self._handle_handoff_command(cmd_original):
return False
elif canonical == "new":
parts = cmd_original.split(maxsplit=1)
title = parts[1].strip() if len(parts) > 1 else None
# Strip inline-skip tokens (now/--yes/-y) before deriving the title
# so "/new now My Session" yields title="My Session" instead of
# title="now My Session". See _split_destructive_skip.
_new_args, _ = self._split_destructive_skip(cmd_original)
title = _new_args.strip() or None
if self._confirm_destructive_slash(
"new",
"This starts a fresh session.\n"
"The current conversation history will be discarded.",
cmd_original=cmd_original,
) is None:
return
self.new_session(title=title)
@@ -8257,6 +8287,7 @@ class HermesCLI:
if self._confirm_destructive_slash(
"undo",
"This removes the last user/assistant exchange from history.",
cmd_original=cmd_original,
) is None:
return
self.undo_last()
@@ -9334,18 +9365,23 @@ class HermesCLI:
_cprint(" Failed to save runtime_footer setting to config.yaml")
def _toggle_verbose(self):
"""Cycle tool progress mode: off → new → all → verbose → off."""
"""Cycle tool progress mode: off → new → all → verbose → off.
Tool-progress display (full args / results / think blocks at the
``verbose`` step) is INDEPENDENT of global DEBUG logging. Cycling
through here does not change ``self.verbose`` or the agent's
``verbose_logging`` / ``quiet_mode`` those remain under the
explicit ``-v``/``--verbose`` flag and the ``/verbose-logging``
toggle. See PR #6a1aa420e for the history that decoupled them.
"""
cycle = ["off", "new", "all", "verbose"]
try:
idx = cycle.index(self.tool_progress_mode)
except ValueError:
idx = 2 # default to "all"
self.tool_progress_mode = cycle[(idx + 1) % len(cycle)]
self.verbose = self.tool_progress_mode == "verbose"
if self.agent:
self.agent.verbose_logging = self.verbose
self.agent.quiet_mode = not self.verbose
self.agent.reasoning_callback = self._current_reasoning_callback()
# Use raw ANSI codes via _cprint so the output is routed through
@@ -9357,7 +9393,7 @@ class HermesCLI:
"off": f"{_Colors.DIM}Tool progress: OFF{_Colors.RESET} — silent mode, just the final response.",
"new": f"{_Colors.YELLOW}Tool progress: NEW{_Colors.RESET} — show each new tool (skip repeats).",
"all": f"{_Colors.GREEN}Tool progress: ALL{_Colors.RESET} — show every tool call.",
"verbose": f"{_Colors.BOLD}{_Colors.GREEN}Tool progress: VERBOSE{_Colors.RESET} — full args, results, think blocks, and debug logs.",
"verbose": f"{_Colors.BOLD}{_Colors.GREEN}Tool progress: VERBOSE{_Colors.RESET} — full args, results, and think blocks.",
}
_cprint(labels.get(self.tool_progress_mode, ""))
@@ -9903,7 +9939,49 @@ class HermesCLI:
if _reload_thread.is_alive():
print(" ⚠️ MCP reload timed out (30s). Some servers may not have reconnected.")
def _confirm_destructive_slash(self, command: str, detail: str) -> Optional[str]:
# Inline-skip tokens that bypass the destructive-slash confirmation modal.
# Matches the escape-hatch pattern users on broken modal platforms
# (currently native Windows PowerShell — issue #30768) need to self-serve
# without having to flip approvals.destructive_slash_confirm in config.
_DESTRUCTIVE_SKIP_TOKENS = frozenset({"now", "--yes", "-y"})
@classmethod
def _split_destructive_skip(cls, cmd_text: Optional[str]) -> tuple[str, bool]:
"""Split inline-skip tokens out of a destructive slash command.
Returns ``(remainder, skip)`` where ``remainder`` is the original
text with the command word and any recognized skip tokens removed,
and ``skip`` is True iff at least one skip token was found.
Examples:
"/reset now" -> ("", True)
"/reset --yes My title" -> ("My title", True)
"/new My title" -> ("My title", False)
"/clear" -> ("", False)
"""
if not cmd_text:
return "", False
tokens = cmd_text.strip().split()
if not tokens:
return "", False
# Drop leading "/cmd" word — callers pass the full command text.
if tokens[0].startswith("/"):
tokens = tokens[1:]
skip = False
kept: list[str] = []
for tok in tokens:
if tok.lower() in cls._DESTRUCTIVE_SKIP_TOKENS:
skip = True
continue
kept.append(tok)
return " ".join(kept), skip
def _confirm_destructive_slash(
self,
command: str,
detail: str,
cmd_original: Optional[str] = None,
) -> Optional[str]:
"""Prompt the user to confirm a destructive session slash command.
Used by ``/clear``, ``/new``/``/reset``, and ``/undo`` before they
@@ -9919,9 +9997,24 @@ class HermesCLI:
gate is off the function returns ``"once"`` immediately without
prompting.
Inline-skip: if ``cmd_original`` contains ``now``, ``--yes``, or
``-y`` as an argument (e.g. ``/reset now``, ``/new --yes My title``),
the modal is bypassed and ``"once"`` is returned immediately. This is
an escape hatch for platforms where the prompt_toolkit modal hangs
(issue #30768 — native Windows PowerShell). Callers are responsible
for stripping the skip tokens from any remaining argument parsing
(see :meth:`_split_destructive_skip`).
Returns ``"once"``, ``"always"``, or ``None`` (cancelled). Callers
proceed with the destructive action when the result is non-None.
"""
# Inline-skip escape hatch — works regardless of platform/modal state.
# See class-level _DESTRUCTIVE_SKIP_TOKENS for the accepted tokens.
if cmd_original:
_, _skip = self._split_destructive_skip(cmd_original)
if _skip:
return "once"
# Gate check — respects prior "Always Approve" clicks.
try:
cfg = load_cli_config()
@@ -10256,9 +10349,7 @@ class HermesCLI:
self._last_scrollback_tool = function_name
try:
from agent.display import get_cute_tool_message
line = get_cute_tool_message(function_name, stored_args, duration)
if is_error:
line = f"{line} [error]"
line = get_cute_tool_message(function_name, stored_args, duration, result=kwargs.get("result"))
_cprint(f" {line}")
except Exception:
pass
@@ -14431,7 +14522,7 @@ def main(
api_key: str = None,
base_url: str = None,
max_turns: int = None,
verbose: bool = False,
verbose: Optional[bool] = None,
quiet: bool = False,
compact: bool = False,
list_tools: bool = False,
@@ -14777,4 +14868,6 @@ def main(
if __name__ == "__main__":
import fire
fire.Fire(main)
+10 -5
View File
@@ -6,17 +6,22 @@
#
# Set HERMES_UID / HERMES_GID to the host user that owns ~/.hermes so
# files created inside the container stay readable/writable on the host.
# The entrypoint remaps the internal `hermes` user to these values via
# usermod/groupmod + gosu.
# The s6-overlay stage2 hook remaps the internal `hermes` user to these
# values via usermod/groupmod; each supervised service then drops to that
# user via `s6-setuidgid`.
#
# Security notes:
# - The dashboard service binds to 127.0.0.1 by default. It stores API
# keys; exposing it on LAN without auth is unsafe. If you want remote
# access, use an SSH tunnel or put it behind a reverse proxy that
# adds authentication — do NOT pass --insecure --host 0.0.0.0.
# - If you override entrypoint, keep /opt/hermes/docker/entrypoint.sh in
# the command chain. It drops root to the hermes user before gateway
# files such as gateway.lock are created.
# - If you override entrypoint, keep `/init` as the first command in
# the chain (or let docker use the image's default ENTRYPOINT,
# which is `["/init", "/opt/hermes/docker/main-wrapper.sh"]`).
# `/init` is s6-overlay's PID 1 — it runs the cont-init.d scripts
# (chown, profile reconcile, dashboard toggle) and sets up the
# supervision tree before any service starts. Bypassing it skips
# all of that setup and the gateway will not work correctly.
# - The gateway's API server is off unless you uncomment API_SERVER_KEY
# and API_SERVER_HOST. See docs/user-guide/api-server.md before doing
# this on an internet-facing host.
+90
View File
@@ -0,0 +1,90 @@
#!/command/with-contenv sh
# shellcheck shell=sh
# Make supervise/ trees for ALL declared s6 services queryable and
# controllable by the unprivileged hermes user (UID 10000).
#
# Background (PR #30136 review item I4): the entire s6 lifecycle
# (s6-svc, s6-svstat, s6-svwait) is dispatched as the hermes user
# inside the container (every Hermes runtime path runs under
# ``s6-setuidgid hermes``). But s6-supervise creates each service's
# ``supervise/`` and top-level ``event/`` directory with mode 0700
# owned by its effective UID — which is root, because s6-supervise
# is spawned by s6-svscan running as PID 1. So unprivileged clients
# get EACCES on every probe / control call against the slot.
#
# Two fixes, one in each registration path:
#
# 1. For RUNTIME-registered profile gateways (created via the s6
# runtime register hooks in profiles.py): the Python helper
# ``_seed_supervise_skeleton`` pre-creates supervise/ + event/ +
# supervise/control owned by hermes BEFORE s6-svscanctl -a fires.
# s6-supervise's mkdir/mkfifo are EEXIST-safe, so it inherits our
# ownership and never tries to chown back to root.
#
# 2. For STATIC s6-rc services (dashboard, main-hermes) declared at
# image-build time under /etc/s6-overlay/s6-rc.d/*: these are
# compiled by s6-rc at boot, and s6-supervise spawns BEFORE
# cont-init.d gets to run — so by the time we're here, the
# supervise/ tree is already there as root:root 0700. We chown
# it here. s6-supervise will keep using the same files; it never
# re-asserts ownership on a running service.
#
# This script runs as root after 01-hermes-setup but before
# 02-reconcile-profiles, so the chowns are settled before the
# Python reconciler walks the scandir. Lexicographic ordering
# guarantees this — the suffix is unusual because we want to slot
# in between 01 and the existing 02-reconcile-profiles without
# renumbering both (which would be a churn-noise patch on its own).
set -eu
# /run/s6-rc/servicedirs holds the live, compiled service directories
# for every static (s6-rc) service. Symlinks under /run/service/*
# point here. Per-service supervise/ + event/ both need hermes
# ownership for s6-svstat etc. to work as hermes.
SVC_ROOT=/run/s6-rc/servicedirs
if [ ! -d "$SVC_ROOT" ]; then
echo "[supervise-perms] $SVC_ROOT not present; skipping"
exit 0
fi
for svc in "$SVC_ROOT"/*; do
[ -d "$svc" ] || continue
name=$(basename "$svc")
# Skip s6-overlay-internal services (they need to stay root-only;
# the s6rc-* helpers manage the supervision tree itself).
case "$name" in
s6rc-*|s6-linux-*)
continue
;;
esac
# supervise/ tree — needed by s6-svstat / s6-svc.
if [ -d "$svc/supervise" ]; then
chown -R hermes:hermes "$svc/supervise" 2>/dev/null || \
echo "[supervise-perms] could not chown $svc/supervise"
# 0710 = group searchable. ``s6-svstat`` only needs to openat
# status, not list the dir, but giving the hermes group +x is
# the minimum that lets group members access the contents.
chmod 0710 "$svc/supervise" 2>/dev/null || true
# supervise/control is a FIFO that s6-svc writes commands
# into; the hermes user needs +w. Owner is already hermes
# after the recursive chown above; widen perms to 0660 so
# ``s6-svc`` works for any member of the hermes group too.
if [ -p "$svc/supervise/control" ]; then
chmod 0660 "$svc/supervise/control" 2>/dev/null || true
fi
fi
# Top-level event/ dir — s6-svlisten1 / s6-svwait subscribe here.
if [ -d "$svc/event" ]; then
chown hermes:hermes "$svc/event" 2>/dev/null || \
echo "[supervise-perms] could not chown $svc/event"
# Preserve s6's 03730 mode (setgid + g+rwx + sticky).
chmod 03730 "$svc/event" 2>/dev/null || true
fi
done
echo "[supervise-perms] chowned supervise/ trees for static s6-rc services"
+46
View File
@@ -0,0 +1,46 @@
#!/command/with-contenv sh
# shellcheck shell=sh
# Container-boot reconciliation of per-profile gateway s6 services.
#
# Runs as root after 01-hermes-setup (the stage2 hook) has chowned
# the volume and seeded $HERMES_HOME, but before s6-rc starts user
# services. /etc/cont-init.d/* scripts run in lexicographic order,
# so the `02-` prefix guarantees ordering.
#
# Service directories under /run/service/ live on tmpfs and are
# wiped on every container restart. Profile directories under
# $HERMES_HOME/profiles/ live on the persistent VOLUME. This script
# walks the persistent profiles, recreates the s6 service slots,
# and auto-starts only those whose last recorded state was
# `running` — see hermes_cli/container_boot.py.
#
# Phase 4 also needs hermes-user writes to /run/service/ (so the
# profile create/delete hooks can register/unregister at runtime),
# so we chown the scandir before invoking the reconciler. We
# additionally chown the s6-svscan control FIFO so the hermes user
# can send rescan signals via ``s6-svscanctl -a``; without this the
# entire runtime-registration path is inert under UID 10000 (the
# Python wrapper catches the resulting EACCES, prints a warning,
# and swallows the failure).
set -e
# Make the dynamic scandir hermes-writable. The directory itself
# starts root-owned by s6-overlay.
chown hermes:hermes /run/service 2>/dev/null || true
# Make the svscan control FIFO hermes-writable so s6-svscanctl -a
# / -an work for the hermes user. The FIFO is created by s6-svscan
# at PID-1 startup, so by the time this cont-init.d script runs it
# already exists. Both ``control`` and ``lock`` need to be writable
# for the various svscanctl operations; the directory itself stays
# root-owned (we only need to touch the two FIFOs/locks inside).
if [ -d /run/service/.s6-svscan ]; then
for entry in control lock; do
if [ -e "/run/service/.s6-svscan/$entry" ]; then
chown hermes:hermes "/run/service/.s6-svscan/$entry" 2>/dev/null || true
fi
done
fi
exec s6-setuidgid hermes /opt/hermes/.venv/bin/python -m hermes_cli.container_boot
+25 -158
View File
@@ -1,160 +1,27 @@
#!/bin/bash
# Docker/Podman entrypoint: bootstrap config files into the mounted volume, then run hermes.
set -e
HERMES_HOME="${HERMES_HOME:-/opt/data}"
INSTALL_DIR="/opt/hermes"
# --- Privilege dropping via gosu ---
# When started as root (the default for Docker, or fakeroot in rootless Podman),
# optionally remap the hermes user/group to match host-side ownership, fix volume
# permissions, then re-exec as hermes.
if [ "$(id -u)" = "0" ]; then
if [ -n "$HERMES_UID" ] && [ "$HERMES_UID" != "$(id -u hermes)" ]; then
echo "Changing hermes UID to $HERMES_UID"
usermod -u "$HERMES_UID" hermes
fi
if [ -n "$HERMES_GID" ] && [ "$HERMES_GID" != "$(id -g hermes)" ]; then
echo "Changing hermes GID to $HERMES_GID"
# -o allows non-unique GID (e.g. macOS GID 20 "staff" may already exist
# as "dialout" in the Debian-based container image)
groupmod -o -g "$HERMES_GID" hermes 2>/dev/null || true
fi
# Fix ownership of the data volume. When HERMES_UID remaps the hermes user,
# files created by previous runs (under the old UID) become inaccessible.
# Always chown -R when UID was remapped; otherwise only if top-level is wrong.
actual_hermes_uid=$(id -u hermes)
needs_chown=false
if [ -n "$HERMES_UID" ] && [ "$HERMES_UID" != "10000" ]; then
needs_chown=true
elif [ "$(stat -c %u "$HERMES_HOME" 2>/dev/null)" != "$actual_hermes_uid" ]; then
needs_chown=true
fi
if [ "$needs_chown" = true ]; then
echo "Fixing ownership of $HERMES_HOME to hermes ($actual_hermes_uid)"
# In rootless Podman the container's "root" is mapped to an unprivileged
# host UID — chown will fail. That's fine: the volume is already owned
# by the mapped user on the host side.
chown -R hermes:hermes "$HERMES_HOME" 2>/dev/null || \
echo "Warning: chown failed (rootless container?) — continuing anyway"
# The .venv must also be re-chowned when UID is remapped, otherwise
# lazy_deps.py cannot install platform packages (discord.py, etc.).
chown -R hermes:hermes "$INSTALL_DIR/.venv" 2>/dev/null || \
echo "Warning: chown .venv failed (rootless container?) — continuing anyway"
fi
# Ensure config.yaml is readable by the hermes runtime user even if it was
# edited on the host after initial ownership setup. Must run here (as root)
# rather than after the gosu drop, otherwise a non-root caller like
# `docker run -u $(id -u):$(id -g)` hits "Operation not permitted" (#15865).
if [ -f "$HERMES_HOME/config.yaml" ]; then
chown hermes:hermes "$HERMES_HOME/config.yaml" 2>/dev/null || true
chmod 640 "$HERMES_HOME/config.yaml" 2>/dev/null || true
fi
echo "Dropping root privileges"
exec gosu hermes "$0" "$@"
fi
# --- Running as hermes from here ---
source "${INSTALL_DIR}/.venv/bin/activate"
# Stamp install method for detect_install_method()
echo "docker" > "${HERMES_HOME:=/opt/data}/.install_method" 2>/dev/null || true
# Create essential directory structure. Cache and platform directories
# (cache/images, cache/audio, platforms/whatsapp, etc.) are created on
# demand by the application — don't pre-create them here so new installs
# get the consolidated layout from get_hermes_dir().
# The "home/" subdirectory is a per-profile HOME for subprocesses (git,
# ssh, gh, npm …). Without it those tools write to /root which is
# ephemeral and shared across profiles. See issue #4426.
mkdir -p "$HERMES_HOME"/{cron,sessions,logs,hooks,memories,skills,skins,plans,workspace,home}
# .env
if [ ! -f "$HERMES_HOME/.env" ]; then
cp "$INSTALL_DIR/.env.example" "$HERMES_HOME/.env"
fi
# config.yaml
if [ ! -f "$HERMES_HOME/config.yaml" ]; then
cp "$INSTALL_DIR/cli-config.yaml.example" "$HERMES_HOME/config.yaml"
fi
# SOUL.md
if [ ! -f "$HERMES_HOME/SOUL.md" ]; then
cp "$INSTALL_DIR/docker/SOUL.md" "$HERMES_HOME/SOUL.md"
fi
# auth.json: bootstrap from env on first boot only. Used by orchestrators
# (e.g. provisioning a Hermes VPS from an account-management service) that
# need to seed the OAuth refresh credential non-interactively, instead of
# walking the user through `hermes setup` + the device-flow login dance.
# Subsequent token rotations write back to the same file, which lives on a
# persistent volume — so this env var is consumed exactly once at first
# boot. The `[ ! -f ... ]` guard is critical: without it, a container
# restart would clobber a rotated refresh token with the now-stale value
# the orchestrator originally seeded.
if [ ! -f "$HERMES_HOME/auth.json" ] && [ -n "$HERMES_AUTH_JSON_BOOTSTRAP" ]; then
printf '%s' "$HERMES_AUTH_JSON_BOOTSTRAP" > "$HERMES_HOME/auth.json"
chmod 600 "$HERMES_HOME/auth.json"
fi
# Sync bundled skills (manifest-based so user edits are preserved)
if [ -d "$INSTALL_DIR/skills" ]; then
python3 "$INSTALL_DIR/tools/skills_sync.py"
fi
# Optionally start `hermes dashboard` as a side-process.
#!/bin/sh
# s6-overlay shim. The real logic lives in docker/stage2-hook.sh, invoked
# by /etc/cont-init.d/01-hermes-setup (installed by the Dockerfile). This
# file exists so external references to docker/entrypoint.sh still work,
# but it's no longer the ENTRYPOINT — /init is.
#
# Toggled by HERMES_DASHBOARD=1 (also accepts "true"/"yes", case-insensitive).
# Host/port/TUI can be overridden via:
# HERMES_DASHBOARD_HOST (default 0.0.0.0 — exposed outside the container)
# HERMES_DASHBOARD_PORT (default 9119, matches `hermes dashboard` default)
# HERMES_DASHBOARD_TUI (already honored by `hermes dashboard` itself)
# When called directly (e.g. by an old wrapper script that hard-coded
# docker/entrypoint.sh as the container ENTRYPOINT, or by an external
# orchestration script that invokes it inside the container), forward to
# the stage2 hook for parity with the pre-s6 entrypoint behavior. The
# stage2 hook only handles cont-init bootstrap (UID remap, chown, config
# seed, skills sync); it does NOT exec the CMD. Callers that depended
# on the pre-s6 contract "entrypoint.sh sets up state then execs hermes"
# will see the bootstrap happen but the CMD will not run from this shim.
#
# The dashboard is a long-lived server. We background it *before* the final
# `exec hermes "$@"` so the user's chosen foreground command (chat, gateway,
# sleep infinity, …) remains PID-of-interest for the container runtime. When
# the container stops the whole process tree is torn down, so no explicit
# cleanup is needed.
case "${HERMES_DASHBOARD:-}" in
1|true|TRUE|True|yes|YES|Yes)
dash_host="${HERMES_DASHBOARD_HOST:-0.0.0.0}"
dash_port="${HERMES_DASHBOARD_PORT:-9119}"
dash_args=(--host "$dash_host" --port "$dash_port" --no-open)
# Binding to anything other than localhost requires --insecure — the
# dashboard refuses otherwise because it exposes API keys. Inside a
# container this is the expected deployment (host reaches it via
# published port), so opt in automatically.
if [ "$dash_host" != "127.0.0.1" ] && [ "$dash_host" != "localhost" ]; then
dash_args+=(--insecure)
fi
echo "Starting hermes dashboard on ${dash_host}:${dash_port} (background)"
# Prefix dashboard output so it's distinguishable from the main
# process in `docker logs`. stdbuf keeps the pipe line-buffered.
(
stdbuf -oL -eL hermes dashboard "${dash_args[@]}" 2>&1 \
| sed -u 's/^/[dashboard] /'
) &
;;
esac
# Final exec: two supported invocation patterns.
#
# docker run <image> -> exec `hermes` with no args (legacy default)
# docker run <image> chat -q "..." -> exec `hermes chat -q "..."` (legacy wrap)
# docker run <image> sleep infinity -> exec `sleep infinity` directly
# docker run <image> bash -> exec `bash` directly
#
# If the first positional arg resolves to an executable on PATH, we assume the
# caller wants to run it directly (needed by the launcher which runs long-lived
# `sleep infinity` sandbox containers — see tools/environments/docker.py).
# Otherwise we treat the args as a hermes subcommand and wrap with `hermes`,
# preserving the documented `docker run <image> <subcommand>` behavior.
if [ $# -gt 0 ] && command -v "$1" >/dev/null 2>&1; then
exec "$@"
fi
exec hermes "$@"
# Deprecation: this shim is preserved for one release cycle to give
# downstream users time to migrate their wrappers to the image's real
# ENTRYPOINT (`/init`). It will be removed in a future major release.
# Surface a warning to stderr so anyone still invoking this path
# sees the migration notice in their logs.
echo "[hermes] WARNING: docker/entrypoint.sh is a deprecated shim under " \
"s6-overlay. The container's real ENTRYPOINT is /init + " \
"main-wrapper.sh; this script only runs the stage2 cont-init hook " \
"and does NOT exec the CMD. If you hard-coded docker/entrypoint.sh " \
"as your ENTRYPOINT, drop the override — docker will use the image's " \
"default ENTRYPOINT (/init), which handles bootstrap AND CMD." >&2
exec /opt/hermes/docker/stage2-hook.sh "$@"
+30
View File
@@ -0,0 +1,30 @@
#!/bin/sh
# /opt/hermes/docker/main-wrapper.sh — wraps the container's CMD with
# the same argument-routing logic the pre-s6 entrypoint.sh used. Runs
# as /init's "main program" (Docker CMD) so it inherits stdin/stdout/
# stderr from the container.
#
# Routing:
# no args → exec `hermes` (the default)
# first arg is an executable → exec it directly (sleep, bash, sh, …)
# first arg is anything else → exec `hermes <args>` (subcommand passthrough)
#
# We drop to the hermes user via `s6-setuidgid` so the supervised
# workload runs unprivileged (UID 10000 by default).
set -e
cd /opt/data
# shellcheck disable=SC1091
. /opt/hermes/.venv/bin/activate
if [ $# -eq 0 ]; then
exec s6-setuidgid hermes hermes
fi
if command -v "$1" >/dev/null 2>&1; then
# Bare executable — pass through directly.
exec s6-setuidgid hermes "$@"
fi
# Hermes subcommand pass-through.
exec s6-setuidgid hermes hermes "$@"
+30
View File
@@ -0,0 +1,30 @@
#!/command/with-contenv sh
# shellcheck shell=sh
# Dashboard finish script. Companion to ./run.
#
# When HERMES_DASHBOARD is unset (or falsy), ./run exits 0 immediately.
# Without this finish script, s6-supervise would just restart the run
# script in a tight loop. By exiting 125 here, we tell s6-supervise
# "this service has permanently failed; do not restart" — equivalent
# to `s6-svc -O`. The supervise slot reports as down, matching reality
# (no dashboard process is running).
#
# When HERMES_DASHBOARD IS enabled and the run script later exits or
# is killed, we want s6-supervise to restart it (the whole point of
# supervised lifecycle). So we exit non-125 in that case.
# Arguments passed to a finish script: $1=run-exit-code, $2=signal-num,
# $3=service-dir-name, $4=run-pgid. See servicedir(7).
case "${HERMES_DASHBOARD:-}" in
1|true|TRUE|True|yes|YES|Yes)
# Dashboard was enabled — let s6-supervise restart on crash by
# exiting non-125. (Pass-through any sensible default.)
exit 0
;;
*)
# Dashboard disabled — permanent-failure marker so s6-supervise
# leaves the slot in 'down' state and s6-svstat reflects that.
exit 125
;;
esac
+40
View File
@@ -0,0 +1,40 @@
#!/command/with-contenv sh
# shellcheck shell=sh
# Dashboard service. Always declared so s6 has a supervised slot; if
# HERMES_DASHBOARD isn't truthy the run script exits cleanly and the
# companion finish script returns 125 (s6's "permanent failure, do
# not restart" marker), so s6-svstat reports the slot as down. See
# also docker/s6-rc.d/dashboard/finish.
case "${HERMES_DASHBOARD:-}" in
1|true|TRUE|True|yes|YES|Yes) ;;
*)
# Exit 0; the finish script will exit 125 → s6-supervise won't
# restart us and the slot reports down. Using a clean exit
# (rather than `exec sleep infinity`) means s6-svstat reflects
# reality: when HERMES_DASHBOARD is unset, the service is NOT
# running, just supervised-with-permanent-failure. See PR
# #30136 review item I3.
exit 0
;;
esac
cd /opt/data
# shellcheck disable=SC1091
. /opt/hermes/.venv/bin/activate
dash_host="${HERMES_DASHBOARD_HOST:-0.0.0.0}"
dash_port="${HERMES_DASHBOARD_PORT:-9119}"
# Binding to anything other than localhost requires --insecure — the
# dashboard refuses otherwise because it exposes API keys. Inside a
# container this is the expected deployment.
insecure=""
case "$dash_host" in
127.0.0.1|localhost) ;;
*) insecure="--insecure" ;;
esac
# shellcheck disable=SC2086 # word-splitting of $insecure is intentional
exec s6-setuidgid hermes hermes dashboard \
--host "$dash_host" --port "$dash_port" --no-open $insecure
+1
View File
@@ -0,0 +1 @@
longrun
+27
View File
@@ -0,0 +1,27 @@
#!/command/with-contenv sh
# shellcheck shell=sh
# Main hermes service.
#
# IMPORTANT — this is NOT how the user's CMD runs.
#
# We chose Architecture B from the plan: the container's CMD (the bare
# command the user passes to `docker run <image> …`) runs as /init's
# "main program" via Docker's CMD mechanism, NOT as an s6-supervised
# service. This is the canonical s6-overlay pattern for "container
# exits when the program exits" semantics, and it lets us preserve
# every pre-s6 invocation contract (chat passthrough, sleep infinity,
# bash, --tui) without re-implementing argument routing through
# /run/s6/container_environment.
#
# So why does this service exist at all? Two reasons:
# 1. s6-rc requires at least one user service for the "user" bundle
# to be valid. We can't ship an empty bundle.
# 2. Future work may want to supervise a long-lived hermes process
# (e.g. for gateway-server containers); having the slot already
# wired in keeps that change small.
#
# For now this service is a no-op: it sleeps forever, doing nothing.
# The dashboard runs as a real s6 service alongside it (see
# ../dashboard/run) and per-profile gateways register dynamically via
# /run/service/ at runtime (Phase 4).
exec sleep infinity
+1
View File
@@ -0,0 +1 @@
longrun
+134
View File
@@ -0,0 +1,134 @@
#!/bin/sh
# s6-overlay stage2 hook — runs as root after the supervision tree is
# up but before user services start. Handles UID/GID remap, volume
# chown, config seeding, and skills sync.
#
# Per-service privilege drop happens inside each service's `run` script
# (and in main-wrapper.sh) via s6-setuidgid, not here.
#
# Wired into the image as /etc/cont-init.d/01-hermes-setup by the
# Dockerfile. The shim at docker/entrypoint.sh forwards to this script
# so external references to docker/entrypoint.sh still work.
#
# NB: cont-init.d scripts run with no arguments — the user's CMD args
# are NOT visible here. That's fine: we use Architecture B (s6-overlay
# main-program model), so main-wrapper.sh runs the CMD with full
# stdin/stdout/stderr access and handles arg parsing there.
set -eu
HERMES_HOME="${HERMES_HOME:-/opt/data}"
INSTALL_DIR="/opt/hermes"
# --- UID/GID remap ---
if [ -n "${HERMES_UID:-}" ] && [ "$HERMES_UID" != "$(id -u hermes)" ]; then
echo "[stage2] Changing hermes UID to $HERMES_UID"
usermod -u "$HERMES_UID" hermes
fi
if [ -n "${HERMES_GID:-}" ] && [ "$HERMES_GID" != "$(id -g hermes)" ]; then
echo "[stage2] Changing hermes GID to $HERMES_GID"
# -o allows non-unique GID (e.g. macOS GID 20 "staff" may already
# exist as "dialout" in the Debian-based container image).
groupmod -o -g "$HERMES_GID" hermes 2>/dev/null || true
fi
# --- Fix ownership of data volume ---
actual_hermes_uid=$(id -u hermes)
needs_chown=false
if [ -n "${HERMES_UID:-}" ] && [ "$HERMES_UID" != "10000" ]; then
needs_chown=true
elif [ "$(stat -c %u "$HERMES_HOME" 2>/dev/null)" != "$actual_hermes_uid" ]; then
needs_chown=true
fi
if [ "$needs_chown" = true ]; then
echo "[stage2] Fixing ownership of $HERMES_HOME to hermes ($actual_hermes_uid)"
# In rootless Podman the container's "root" is mapped to an
# unprivileged host UID — chown will fail. That's fine: the volume
# is already owned by the mapped user on the host side.
chown -R hermes:hermes "$HERMES_HOME" 2>/dev/null || \
echo "[stage2] Warning: chown failed (rootless container?) — continuing"
# The .venv must also be re-chowned when UID is remapped, otherwise
# lazy_deps.py cannot install platform packages (discord.py, etc.).
chown -R hermes:hermes "$INSTALL_DIR/.venv" 2>/dev/null || \
echo "[stage2] Warning: chown .venv failed (rootless container?) — continuing"
fi
# Always reset ownership of $HERMES_HOME/profiles to hermes on every
# boot. Profile dirs and files can land owned by root when commands
# are invoked via `docker exec <container> hermes …` (which defaults
# to root unless `-u` is passed), and that breaks the cont-init
# reconciler (02-reconcile-profiles) which runs as hermes and walks
# the profiles dir. Idempotent; skipped on rootless containers where
# chown would fail.
if [ -d "$HERMES_HOME/profiles" ]; then
chown -R hermes:hermes "$HERMES_HOME/profiles" 2>/dev/null || true
fi
# --- config.yaml permissions ---
# Ensure config.yaml is readable by the hermes runtime user even if it
# was edited on the host after initial ownership setup.
if [ -f "$HERMES_HOME/config.yaml" ]; then
chown hermes:hermes "$HERMES_HOME/config.yaml" 2>/dev/null || true
chmod 640 "$HERMES_HOME/config.yaml" 2>/dev/null || true
fi
# --- Seed directory structure as hermes user ---
# Run as hermes via s6-setuidgid so dirs end up owned correctly (matters
# under rootless Podman where chown back to root would fail).
#
# Use direct `mkdir -p` invocation (no `sh -c "..."` wrapper) so the
# shell isn't a second interpreter — defends against $HERMES_HOME values
# containing shell metacharacters. PR #30136 review item O2.
s6-setuidgid hermes mkdir -p \
"$HERMES_HOME/cron" \
"$HERMES_HOME/sessions" \
"$HERMES_HOME/logs" \
"$HERMES_HOME/hooks" \
"$HERMES_HOME/memories" \
"$HERMES_HOME/skills" \
"$HERMES_HOME/skins" \
"$HERMES_HOME/plans" \
"$HERMES_HOME/workspace" \
"$HERMES_HOME/home"
# --- Install-method stamp (read by detect_install_method() in hermes status) ---
# Preserved from the tini-era entrypoint (PR #27843). Must be written as
# the hermes user so ownership matches the file's documented owner.
# tee is invoked directly via s6-setuidgid (no `sh -c` wrapper) for the
# same shell-metacharacter safety described above.
printf 'docker\n' | s6-setuidgid hermes tee "$HERMES_HOME/.install_method" >/dev/null \
|| true
# --- Seed config files (only on first boot) ---
seed_one() {
dest=$1
src=$2
if [ ! -f "$HERMES_HOME/$dest" ] && [ -f "$INSTALL_DIR/$src" ]; then
s6-setuidgid hermes cp "$INSTALL_DIR/$src" "$HERMES_HOME/$dest"
fi
}
seed_one ".env" ".env.example"
seed_one "config.yaml" "cli-config.yaml.example"
seed_one "SOUL.md" "docker/SOUL.md"
# auth.json: bootstrap from env on first boot only. Same semantics as the
# pre-s6 entrypoint — the [ ! -f ] guard is critical to avoid clobbering
# rotated refresh tokens on container restart.
if [ ! -f "$HERMES_HOME/auth.json" ] && [ -n "${HERMES_AUTH_JSON_BOOTSTRAP:-}" ]; then
printf '%s' "$HERMES_AUTH_JSON_BOOTSTRAP" > "$HERMES_HOME/auth.json"
chown hermes:hermes "$HERMES_HOME/auth.json" 2>/dev/null || true
chmod 600 "$HERMES_HOME/auth.json"
fi
# --- Sync bundled skills ---
# Invoke the venv's python by absolute path so we don't need a `sh -c`
# wrapper to source the activate script. This is safe because
# skills_sync.py doesn't depend on any environment exports beyond what
# the python binary's own bin-stub already sets up (sys.path is rooted
# at the venv's site-packages by virtue of running .venv/bin/python).
if [ -d "$INSTALL_DIR/skills" ]; then
s6-setuidgid hermes "$INSTALL_DIR/.venv/bin/python" "$INSTALL_DIR/tools/skills_sync.py" \
|| echo "[stage2] Warning: skills_sync.py failed; continuing"
fi
echo "[stage2] Setup complete; starting user services"
@@ -0,0 +1,434 @@
# s6-overlay Supervision for Per-Profile Gateways in Docker — Implementation Plan
> **Status: shipped.** Phases 05 landed via PR
> [NousResearch/hermes-agent#30136](https://github.com/NousResearch/hermes-agent/pull/30136)
> in May 2026. This document is preserved as a post-implementation reference
> for the architecture and the resolved design questions. The phase-by-phase
> TDD walkthrough (≈2,800 lines) and the v2/v3 re-validation preambles have
> been removed — the canonical implementation history is the PR commit log
> (`git log --oneline a957ef083..a6f7171a5 -- 'docker/*' 'hermes_cli/service_manager.py' …`).
> Open Questions are collapsed into a single Decision Log table; full
> deliberations live in PR review comments.
**Goal:** Replace `tini` with s6-overlay as PID 1 in the Hermes Docker image so
that the main hermes process, the dashboard, and dynamically-created
per-profile gateways all run as supervised services (auto-restart on crash,
clean shutdown, signal forwarding, zombie reaping). Preserve every existing
`docker run …` invocation pattern — including interactive TUI.
**Architecture:** s6-overlay's `/init` is the container ENTRYPOINT, running
s6-svscan as PID 1. Main hermes and the dashboard are declared as static
s6-rc services at image build time. Per-profile gateways — which users create
*after* the image is built (`hermes profile create coder`
`coder gateway start`) — are registered dynamically by writing service
directories under a scandir watched by s6-svscan. A `ServiceManager` protocol
abstracts the install/start/stop/restart surface across the init systems we
care about (systemd on Linux host, launchd on macOS host, Scheduled Tasks on
native Windows host, s6 inside container) and adds a second tier for runtime
service registration that only s6 implements.
**Tech Stack:**
- [s6-overlay](https://github.com/just-containers/s6-overlay) v3.2.3.0
(noarch + per-arch tarballs ~15 MB). SHA256-pinned via build ARGs;
multi-arch via `TARGETARCH` (amd64 → `x86_64`, arm64 → `aarch64`).
- Debian 13.4 base image (unchanged).
- [hadolint](https://github.com/hadolint/hadolint) for the Dockerfile +
[shellcheck](https://github.com/koalaman/shellcheck) for entrypoint scripts.
- Python subprocess wrappers for `s6-svc`, `s6-svstat`, `s6-svscanctl`.
- Existing systemd/launchd/windows surface in `hermes_cli/gateway.py` and
`hermes_cli/gateway_windows.py`.
**Scope:**
- Container-only (host-side systemd/launchd/windows behavior is preserved,
not modified).
- s6-overlay only (no pure-Python fallback).
- Architecture A (s6 owns PID 1; tini is removed).
- Interactive TUI must keep working:
`docker run -it --rm nousresearch/hermes-agent:latest --tui`.
- Dynamic registration is limited to per-profile gateways — one service per
profile, created when a profile is created, torn down when deleted. A
`gateway-default` slot is always registered for the root HERMES_HOME
profile so `hermes gateway start` (no `-p`) has somewhere to land.
**Out of scope:**
- Host-side dynamic supervision (systemd-run / launchd transient plists) —
not needed.
- Pure-Python supervisor fallback — not needed.
- Arbitrary user-defined supervised processes inside the container — only
profile gateways.
- Migration of existing per-profile systemd unit generation to s6 on the
host side.
- Non-Docker container runtimes (Podman rootless validated reactively).
- UX polish around in-container profile lifecycle (e.g. a nice status view
of all supervised profile gateways) — deferred to follow-up.
---
## Background From The Codebase
> **Note on line numbers:** This section refers to functions and structures
> by name only. Use `grep -n 'def <name>' <file>` to locate anything below
> if you need the current line.
### Pre-s6 container init (what we replaced)
The original `Dockerfile` declared
`ENTRYPOINT [ "/usr/bin/tini", "-g", "--", "/opt/hermes/docker/entrypoint.sh" ]`.
tini was PID 1, reaped zombies, forwarded SIGTERM to the process group. The
old `docker/entrypoint.sh`:
1. `gosu` privilege drop from root → `hermes` UID.
2. Copied `.env.example`, `cli-config.yaml.example`, `SOUL.md` into
`$HERMES_HOME` if missing.
3. Synced bundled skills via `tools/skills_sync.py`.
4. Optionally backgrounded `hermes dashboard` in a subshell when
`HERMES_DASHBOARD=1`**not supervised**, no restart.
5. `exec hermes "$@"` — tini's sole direct child.
Known limitations: dashboard crash → stays dead; dashboard fails at startup →
silent; gateway crash → dashboard dies too. The May 4, 2026 decision was
"leave as is" because nothing in the container needed supervision then.
Adding per-profile gateway supervision changed that.
### ServiceManager surface (what we wrapped, not refactored)
All init-system logic lives in **`hermes_cli/gateway.py`** (~5,400 LOC at
re-validation). The systemd/launchd code is ~1,500 lines of that, plus a
separate **`hermes_cli/gateway_windows.py`** (~690 LOC) for Windows
Scheduled Tasks.
| Layer | Systemd functions | Launchd functions | Windows functions |
|---|---|---|---|
| **Detection** | `supports_systemd_services()`, `_systemd_operational()`, `_wsl_systemd_operational()`, `_container_systemd_operational()` | `is_macos()` | `is_windows()`, `gateway_windows.is_installed()` |
| **Paths** | `get_systemd_unit_path(system)`, `get_service_name()` | `get_launchd_plist_path()`, `get_launchd_label()` | `gateway_windows.get_task_name()`, `get_task_script_path()`, `get_startup_entry_path()` |
| **Install/lifecycle** | `systemd_install(force, system, run_as_user)`, `systemd_uninstall(system)`, `systemd_start/stop/restart(system)` | `launchd_install(force)`, `launchd_uninstall/start/stop/restart` | `gateway_windows.install/uninstall/start/stop/restart` |
| **Probes** | `_probe_systemd_service_running(system)`, `_read_systemd_unit_properties(system)`, `_wait_for_systemd_service_restart`, `_recover_pending_systemd_restart` | `_probe_launchd_service_running()` | `gateway_windows.is_task_registered()`, `_pid_exists` helper |
| **D-Bus plumbing** | `_ensure_user_systemd_env`, `_user_systemd_socket_ready`, `_user_systemd_private_socket_path`, `get_systemd_linger_status` | — | — |
| **Unit/plist generation** | `generate_systemd_unit(system, run_as_user)`, `systemd_unit_is_current`, `refresh_systemd_unit_if_needed` | plist templating in `launchd_install` | `_build_gateway_cmd_script`, `_build_startup_launcher`, `_write_task_script` |
Container-relevant callers outside `gateway.py`:
- `hermes_cli/status.py` — gained an `s6` branch for in-container runs.
- `hermes_cli/profiles.py``create_profile` / `delete_profile` register and
unregister with s6 inside the container (no-op on host).
- `hermes_cli/doctor.py``_check_gateway_service_linger` skips on s6, and a
new "Service Supervisor" section reports main-hermes / dashboard /
profile-gateway counts via the ServiceManager.
- `hermes_cli/gateway.py::gateway_command` — the
`elif is_container():` rejection arms that refused gateway lifecycle
operations were removed; the `_dispatch_via_service_manager_if_s6` helper
intercepts start/stop/restart and routes them through s6.
### Per-profile gateway spawning
`hermes gateway start`, `coder gateway start` (profile alias), and
`hermes -p <profile> gateway start` all spawn a gateway process scoped to a
given profile. See
[Profiles: Running Gateways](https://hermes-agent.nousresearch.com/docs/user-guide/profiles#running-gateways).
On host, lifecycle is managed via per-profile systemd units
(`hermes-gateway-<profile>.service`); inside the container, an s6 service at
`/run/service/gateway-<name>/` is registered when the profile is created and
torn down when it's deleted.
**Persistence across container restart:** `/run/service/` is tmpfs —
service registrations are wiped when the container restarts. Profile
directories at `/opt/data/profiles/<name>/` live on the persistent VOLUME,
and each one records its gateway's last state in `gateway_state.json`.
`/etc/cont-init.d/02-reconcile-profiles` walks the persistent profiles on
every container boot, recreates the s6 service slots via
`hermes_cli/container_boot.py`, and auto-starts those whose last recorded
state was `running`. Profiles whose last state was `stopped`,
`startup_failed`, `starting`, or absent get their slot recreated in the
`down` state and wait for explicit user action. `docker restart` is therefore
invisible to a user with running profile gateways: they come back up;
stopped ones stay stopped.
### s6-overlay constraints
- **Root/non-root model:** `/init` runs as root to set up the supervision
tree, install signal handlers, and run the stage2 hook that does
`usermod`/`chown`. Each supervised service drops to UID 10000 via
`s6-setuidgid hermes` in its `run` script. The per-service `s6-supervise`
monitor stays root so it can signal its child regardless of UID. Net
effect: hermes and all its subprocesses run as UID 10000 exactly as
before; only the supervision tree itself runs as root.
- v3.2.3.0 has limited non-root support for running `/init` itself as
non-root — some tools (`fix-attrs`, `logutil-service`) assume root. We
don't hit this because `/init` runs as root.
- Scandir hard cap: `services_max` default 1000, configurable to 160,000.
- `/command/with-contenv` sources `/run/s6/container_environment/*` into
service env — convenient for passing `HERMES_HOME` etc.
- s6 signal semantics: service crash triggers `s6-supervise` restart after
1s; override with a `finish` script.
- Zombie reaping: PID 1 (s6-svscan) reaps all zombies non-blockingly on
SIGCHLD. Any subagent subprocess spawned by the main hermes process is
reaped automatically.
---
## Key Design Decisions
### D1. s6-overlay replaces tini entirely
Container ENTRYPOINT is `/init`, PID 1 is s6-svscan. The main hermes
process, the dashboard, and every per-profile gateway run as supervised
services. This is a single breaking change to the container contract.
### D2. Main hermes is an s6 service with container-exit semantics
The contract "container exits when `hermes` exits" is preserved via a
service `finish` script that writes to
`/run/s6-linux-init-container-results/exitcode` and calls
`/run/s6/basedir/bin/halt`. All five supported invocations work:
| `docker run <image> …` | Behavior |
|---|---|
| (no args) | `hermes` with no args, container exits when hermes exits |
| `chat -q "..."` | `hermes chat -q "..."`, container exits with hermes exit code |
| `sleep infinity` | `sleep infinity` directly (long-lived sandbox mode) |
| `bash` | interactive `bash` directly |
| `docker run -it … --tui` | interactive Ink TUI with real TTY — see D9 |
`docker/main-wrapper.sh` detects whether `$1` is an executable on PATH and
routes either to "run this as a one-shot main service" or "wrap with
hermes".
### D3. Static services at build time; dynamic (per-profile) services at runtime
s6 offers two mechanisms:
- **s6-rc** (declarative, compile-then-swap): used for main hermes and the
dashboard — they're known at image build time.
- **scandir** (drop a directory + `s6-svscanctl -a`): used for per-profile
gateways — profiles are user-created after the image is built.
Per-profile gateway service dirs live at `/run/service/gateway-<profile>/`
(tmpfs, hermes-writable). s6-svscan picks them up on rescan.
### D4. ServiceManager protocol with two methods for runtime registration
Host paths (systemd, launchd, Windows Scheduled Tasks) need only
install/start/stop/restart of pre-declared services. Inside the container,
we additionally need to register services at runtime when a profile is
created. The protocol exposes this directly:
```python
class ServiceManager(Protocol):
kind: ServiceManagerKind # "systemd" | "launchd" | "windows" | "s6" | "none"
# Lifecycle of an already-declared service
def start(self, name: str) -> None: ...
def stop(self, name: str) -> None: ...
def restart(self, name: str) -> None: ...
def is_running(self, name: str) -> bool: ...
# Runtime registration (container-only; hosts raise NotImplementedError)
def supports_runtime_registration(self) -> bool: ...
def register_profile_gateway(
self, profile: str, *,
extra_env: dict[str, str] | None = None,
) -> None: ...
def unregister_profile_gateway(self, profile: str) -> None: ...
def list_profile_gateways(self) -> list[str]: ...
```
Systemd, launchd, and Windows backends raise `NotImplementedError` on the
registration methods. Only the s6 backend implements them. Callers check
`supports_runtime_registration()` before calling.
The scope is intentionally narrow: it's specifically "register/unregister a
profile gateway," not a general-purpose process-management API.
### D5. Per-profile gateway service spec is fixed, not user-provided
Every profile gateway has the same command shape
(`hermes -p <profile> gateway run`, or `hermes gateway run` for the default
profile). The s6 backend generates the `run` script from a fixed template
given the profile name — no arbitrary command list. This keeps the API
surface tight and prevents callers from accidentally registering
non-gateway services.
Port selection is governed by the profile's `config.yaml`
(`[gateway] port = …`) — the single source of truth. (The original plan
proposed a Python-side SHA-256 port allocator with a 600-port range; it was
retired during PR review because it was dead code through the entire stack.)
### D6. Add detect_service_manager() alongside supports_systemd_services()
`supports_systemd_services()` stays as-is (host code paths unchanged). A new
`detect_service_manager() -> Literal["systemd", "launchd", "windows", "s6", "none"]`
composes existing detection functions (`is_macos()`, `is_windows()`,
`supports_systemd_services()`, `is_container()` + `_s6_running()`) and adds
an s6 branch for container detection. Host call sites continue to use the
existing functions; container-only code (the profile hooks) uses the new one.
`_s6_running()` probes `/proc/1/comm` (world-readable) and
`/run/s6/basedir`. The earlier `/proc/1/exe` probe was root-only readable
and silently failed for the unprivileged hermes user (UID 10000), making
the entire runtime-registration path inert in production — caught in PR
review.
### D7. Wrap existing systemd/launchd/windows functions, don't rewrite them
`SystemdServiceManager` / `LaunchdServiceManager` / `WindowsServiceManager`
are thin adapters over the existing `systemd_*` / `launchd_*` module-level
functions in `hermes_cli/gateway.py` and the
`gateway_windows.install/uninstall/start/stop/restart/is_installed`
functions in `hermes_cli/gateway_windows.py`. We get the abstraction
without rewriting ~2,200 LOC of working code.
### D8. Profile create/delete hooks register/unregister the s6 service
When `hermes profile create <name>` runs inside the container, the
profile-creation code path calls
`ServiceManager.register_profile_gateway(<name>)` if
`supports_runtime_registration()` is True. When `hermes profile delete
<name>` runs, it calls `unregister_profile_gateway(<name>)`. On host, both
calls are no-ops (registration not supported; existing systemd unit
generation continues to handle install/uninstall).
Existing per-profile `hermes -p <profile> gateway start/stop/restart` CLI
commands continue to work — in the container they dispatch to
`ServiceManager.start/stop/restart("gateway-<profile>")`, which translates
to `s6-svc -u`/`-d`/`-t` on the service dir.
`hermes gateway start` (no `-p`) targets a special `gateway-default` slot
that's always registered by the cont-init reconciler. Its run script omits
the `-p` flag and runs against the root `$HERMES_HOME` profile.
`--all` lifecycle (`hermes gateway stop --all`, `... restart --all`)
iterates `mgr.list_profile_gateways()` through s6 so s6's `want up`/`want
down` flips correctly. Without this, `--all` fell through to `pkill`
followed by s6-supervise auto-restart — net effect: kick instead of stop.
### D9. Interactive TUI bypasses s6 service-mode and runs as CMD for TTY passthrough
`docker run -it --rm <image> --tui` needs a real TTY connected to container
stdin/stdout for Ink raw-mode keyboard input, cursor control, and SIGWINCH.
Running the TUI as a normal s6 service fails because s6-supervise
disconnects service stdio from the container TTY (documented:
[s6-overlay#230](https://github.com/just-containers/s6-overlay/issues/230)).
**The pattern:** s6-overlay's `/init` execs a CMD as the container's "main
program" after the supervision tree is up. The CMD inherits
stdin/stdout/stderr from `/init` — which in `-it` mode is the container
TTY. The stage2 hook detects the TUI case and short-circuits the
main-hermes service so the hermes CMD becomes that main program.
```sh
# In docker/stage2-hook.sh
_is_tui_invocation() {
for arg in "$@"; do
case "$arg" in --tui|-T) return 0 ;; esac
done
case "${HERMES_TUI:-}" in 1|true|TRUE|yes) return 0 ;; esac
if [ -t 0 ] && [ $# -eq 0 ]; then return 0; fi
return 1
}
```
And in `docker/s6-rc.d/main-hermes/run`:
```sh
if [ -f /var/run/s6/container_environment/HERMES_TUI_MODE ]; then
exec sleep infinity # s6-overlay will exec CMD as the TTY-connected main
fi
exec s6-setuidgid hermes hermes ${HERMES_ARGS:-}
```
In TUI mode main hermes is effectively unsupervised (same as the pre-s6
behavior with tini — acceptable because the user is interactively
present). Dashboard and profile gateways still get full s6 supervision via
their separate services.
The integration test `test_tty_passthrough_to_container` uses `tput cols`
and `COLUMNS=123` as the probe.
---
## Risk Register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Phase 2 breaks a downstream user's Dockerfile that `FROM`s ours | Medium | Medium | Release notes call out ENTRYPOINT change; the test harness (`tests/docker/`) gives high confidence in behavior parity |
| TUI TTY passthrough fails on some Docker versions | Low | High | Harness includes `test_tty_passthrough_to_container` as a hard gate; fallback plan = s6-fdholder ([s6-overlay#230](https://github.com/just-containers/s6-overlay/issues/230) Solution 2) |
| s6-overlay non-root quirks (logutil-service, fix-attrs) bite us | Low | Low | Supervisor runs as root, services drop — sidesteps these issues |
| Podman rootless UID mapping confuses s6 | Medium | Low | Documented as supported, fix reactively; a Podman + Docker environment is stood up for validation |
| Test harness is flaky (docker daemon issues, timing) | Medium | Low | Generous timeouts; skip when docker unavailable; polling helpers replace fixed sleeps in `test_container_restart.py` |
| Profile gateway crash loop masks a real config error | Low | Medium | s6 `finish` script `max_restarts` cap (planned follow-up); operators see crash-looping logs in `$HERMES_HOME/logs/gateways/<profile>/` |
| Dockerfile+entrypoint drift from linter (hadolint/shellcheck) reveals latent bugs | Low | Low | CI lint jobs catch them; fix or document ignore with rationale |
| Stale `gateway.pid` from a dead container collides with an unrelated live PID in the restarted container | Low | Medium | Cont-init reconciliation removes `gateway.pid` and `processes.json` from every profile dir on boot, before any new gateway starts |
| `docker restart` silently loses per-profile gateway registrations (tmpfs scandir wiped) | High (without mitigation) | High | Cont-init reconciliation re-registers from persistent `$HERMES_HOME/profiles/` and auto-starts those last seen `running`; outcome recorded to `$HERMES_HOME/logs/container-boot.log` (size-bounded, rotates to `.1` at 256 KiB) |
| A `running` gateway that's actually broken auto-restarts into a crash loop after every container restart | Low | Medium | s6 `finish` script `max_restarts` cap (planned); follow-up: `hermes doctor` alerts when N consecutive container restarts ended in `startup_failed` |
| `_s6_running()` detection works as root but silently fails for unprivileged hermes user, making runtime-registration path inert | High (without mitigation) | High | **Caught in PR review.** Detection now probes `/proc/1/comm` (world-readable) + `/run/s6/basedir`. Docker integration tests refactored to `docker exec -u hermes` so the realistic runtime user is exercised |
| `s6-svscanctl` from hermes hits EACCES on the root-owned control FIFO | Medium | Medium | `02-reconcile-profiles` chowns `/run/service/.s6-svscan/{control,lock}` to hermes after stage1 creates them |
| Per-service `supervise/control` FIFO is root-owned by s6-supervise, blocking `s6-svc` from hermes | Known | Medium | Surfaced cleanly as `S6CommandError` (with rc + stderr) instead of raw `CalledProcessError`. Permission fix tracked as a follow-up (small SUID helper, polling chown loop in cont-init.d, or replace `s6-svc` with `down`-marker manipulation) |
---
## Decision Log
| # | Question | Decision |
|---|---|---|
| OQ1 | Gate Phase 2 behind env var? | Ship directly (Hermes is pre-1.0; users can pin the previous image) |
| OQ2 | s6 root model | Root `/init`, drop per-service via `s6-setuidgid hermes` |
| OQ3 | Dashboard opt-in mechanism | Always declared as an s6 service; `03-dashboard-toggle` cont-init script writes a `down` marker when `HERMES_DASHBOARD` is unset so `s6-svstat` reports the slot's real state |
| OQ4 | Podman rootless | Supported, fix reactively |
| OQ5 | Service naming | `gateway-<profile>` (matches pre-existing `hermes-gateway-<profile>.service` systemd convention) |
| OQ6 | — (retired; no subagent gateways in scope) | — |
| OQ7 | Resource limits per profile gateway | Defer (no per-cgroup limits; rely on the container's overall limit) |
| OQ8 | Log persistence | `$HERMES_HOME/logs/gateways/<profile>/`. The log path is sourced from runtime `$HERMES_HOME` via `with-contenv`, NOT Python-substituted at registration time |
| OQ9 | TUI passthrough | Trust the documented [s6-overlay#230](https://github.com/just-containers/s6-overlay/issues/230) Solution 1; harness includes a TTY passthrough hard-gate test |
**Post-merge additions from PR #30136 review:**
- **Multi-arch tarballs:** `TARGETARCH` mapped to `x86_64` / `aarch64`;
per-arch tarball fetched via `curl` because `ADD` doesn't honor BuildKit
args.
- **SHA256 verification:** all three tarballs (noarch, symlinks, per-arch)
pinned via build ARGs and verified with `sha256sum -c` against a single
checksum file (avoids hadolint DL4006 piped-shell warning).
- **`gateway-default` slot:** always registered by the reconciler so
`hermes gateway start` (no `-p`) has somewhere to land.
- **Friendly lifecycle errors:** `GatewayNotRegisteredError` and
`S6CommandError` translate `CalledProcessError` into actionable CLI
messages.
- **Atomic publication in the reconciler:** mirrors
`register_profile_gateway`'s tmp+rename pattern.
- **`container-boot.log` rotation:** 256 KiB soft cap, rotated to `.1`.
- **`port` parameter retired:** allocator + kwarg were dead code through
the entire stack; `config.yaml` is the single source of truth.
---
## Verification Checklist
- [x] Test harness (`tests/docker/`) passes against the s6 image
- [x] hadolint + shellcheck run green in CI
- [x] `docker run -it --rm hermes-agent --tui` starts the Ink TUI with
working keyboard input, cursor control, and resize (SIGWINCH)
- [x] Dashboard crashes are recovered by s6 within ~2s
- [x] `hermes profile create test` inside a container creates
`/run/service/gateway-test/`
- [x] `hermes -p test gateway start` inside a container dispatches through s6
- [x] `hermes -p test gateway stop` inside a container cleanly stops via s6
- [x] `hermes profile delete test` inside a container removes
`/run/service/gateway-test/`
- [x] Profile gateway logs persist at
`$HERMES_HOME/logs/gateways/test/current`
- [x] `hermes status` inside the container shows `Manager: s6`
- [x] `hermes gateway start` (no `-p`) inside a container targets
`gateway-default` and runs against the root profile
- [x] `hermes gateway stop --all` / `... restart --all` iterate every
profile gateway under s6 instead of pkill-then-supervise-restart
- [x] `docker restart` survives per-profile gateway registrations via the
cont-init reconciler; running gateways come back up, stopped ones
stay down
- [x] Multi-arch image builds for both `linux/amd64` and `linux/arm64`
- [x] s6-overlay tarballs are SHA256-verified at build time
- [x] No systemd/launchd host-side functions were modified (only wrapped)
- [x] `hermes gateway install/start/stop` on Linux host and macOS host
behave identically to pre-change
+101 -23
View File
@@ -424,7 +424,9 @@ _PLATFORM_CONNECTED_CHECKERS: dict[Platform, Callable[[PlatformConfig], bool]] =
Platform.SMS: lambda cfg: bool(os.getenv("TWILIO_ACCOUNT_SID")),
Platform.API_SERVER: lambda cfg: True,
Platform.WEBHOOK: lambda cfg: True,
Platform.MSGRAPH_WEBHOOK: lambda cfg: True,
Platform.MSGRAPH_WEBHOOK: lambda cfg: bool(
str(cfg.extra.get("client_state") or "").strip()
),
Platform.FEISHU: lambda cfg: bool(cfg.extra.get("app_id")),
Platform.WECOM: lambda cfg: bool(cfg.extra.get("bot_id")),
Platform.WECOM_CALLBACK: lambda cfg: bool(
@@ -1811,6 +1813,17 @@ def _apply_env_overrides(config: GatewayConfig) -> None:
# need to seed ``PlatformConfig.extra`` from env vars (e.g. Google Chat's
# project_id / subscription_name) can supply ``env_enablement_fn`` on
# their PlatformEntry — called here BEFORE adapter construction.
#
# Enablement gate (#31116): when a plugin registers ``is_connected``
# (the "has the user actually configured credentials for this?" check),
# we MUST consult it before flipping ``enabled = True``. Otherwise
# ``check_fn`` alone — which for adapter plugins typically just
# verifies the SDK is importable / lazy-installs it — silently enables
# platforms the user never opted into, and the gateway then tries to
# connect to Discord / Teams / Google Chat with no token and emits
# noisy retry-forever errors. ``_platform_status`` was already fixed
# for the same bug class in commit 7849a3d73; this is the runtime
# counterpart.
try:
from hermes_cli.plugins import discover_plugins
discover_plugins() # idempotent
@@ -1823,34 +1836,99 @@ def _apply_env_overrides(config: GatewayConfig) -> None:
logger.debug("check_fn for %s raised: %s", entry.name, e)
continue
platform = Platform(entry.name)
if platform not in config.platforms:
config.platforms[platform] = PlatformConfig()
config.platforms[platform].enabled = True
# Seed extras from env if the plugin opted in.
existing_cfg = config.platforms.get(platform)
# Seed candidate extras from ``env_enablement_fn`` so plugins
# whose ``is_connected`` reads ``config.extra`` (e.g. Google
# Chat's ``_is_connected`` checks ``config.extra["project_id"]``)
# see the same state they will after enablement. Without this,
# Google-Chat-on-env-vars-only setups silently fail the gate
# below even though the user is configured. Plugins whose
# ``is_connected`` reads env vars directly (Discord, IRC,
# Teams, LINE, ntfy, Simplex) are unaffected; this only
# restores Google Chat.
seed_for_probe = None
if entry.env_enablement_fn is not None:
try:
seed = entry.env_enablement_fn()
seed_for_probe = entry.env_enablement_fn()
except Exception as e:
logger.debug(
"env_enablement_fn for %s raised: %s", entry.name, e
)
seed = None
if isinstance(seed, dict) and seed:
# Extract the home_channel dict (if provided) so we wire it
# up as a proper HomeChannel dataclass. Everything else is
# merged into ``extra``.
home = seed.pop("home_channel", None)
config.platforms[platform].extra.update(seed)
if isinstance(home, dict) and home.get("chat_id"):
config.platforms[platform].home_channel = HomeChannel(
platform=platform,
chat_id=str(home["chat_id"]),
name=str(home.get("name") or "Home"),
thread_id=(
str(home["thread_id"])
if home.get("thread_id")
else None
),
seed_for_probe = None
# Only consult is_connected for platforms that are NOT already
# explicitly configured in YAML / env (existing_cfg with
# enabled=True means the user wrote it themselves or another
# env-var bridge enabled it — keep that decision).
if existing_cfg is None or not existing_cfg.enabled:
if entry.is_connected is not None:
try:
# Probe with ``enabled=True`` since we're asking
# "would this plugin BE configured if we enabled
# it?" not "is it currently enabled?". Google
# Chat's ``_is_connected`` short-circuits on
# ``config.enabled`` being False, which on the
# default ``PlatformConfig()`` would fail the
# gate even with proper env vars set.
if existing_cfg is not None:
probe_cfg = existing_cfg
if not probe_cfg.enabled:
probe_cfg = PlatformConfig(
enabled=True,
extra=dict(probe_cfg.extra or {}),
)
else:
probe_cfg = PlatformConfig(enabled=True)
if isinstance(seed_for_probe, dict) and seed_for_probe:
# Don't mutate ``existing_cfg``; the probe gets
# a transient view with env-seeded extras layered
# on top of whatever's already there.
probe_extra = dict(getattr(probe_cfg, "extra", {}) or {})
for k, v in seed_for_probe.items():
if k == "home_channel":
continue
probe_extra.setdefault(k, v)
probe_cfg = PlatformConfig(
enabled=True,
extra=probe_extra,
)
configured = bool(entry.is_connected(probe_cfg))
except Exception as exc:
logger.debug(
"is_connected for %s raised: %s — skipping enablement",
entry.name, exc,
)
configured = False
if not configured:
logger.debug(
"Plugin platform '%s' available but not configured "
"(is_connected returned False) — skipping enable",
entry.name,
)
continue
if platform not in config.platforms:
config.platforms[platform] = PlatformConfig()
config.platforms[platform].enabled = True
# Commit env-seeded extras onto the now-enabled platform.
# We've already called ``env_enablement_fn`` above (for the
# probe); reuse that result instead of calling it twice.
if isinstance(seed_for_probe, dict) and seed_for_probe:
seed = dict(seed_for_probe)
# Extract the home_channel dict (if provided) so we wire it
# up as a proper HomeChannel dataclass. Everything else is
# merged into ``extra``.
home = seed.pop("home_channel", None)
config.platforms[platform].extra.update(seed)
if isinstance(home, dict) and home.get("chat_id"):
config.platforms[platform].home_channel = HomeChannel(
platform=platform,
chat_id=str(home["chat_id"]),
name=str(home.get("name") or "Home"),
thread_id=(
str(home["thread_id"])
if home.get("thread_id")
else None
),
)
except Exception as e:
logger.debug("Plugin platform enable pass failed: %s", e)
+28
View File
@@ -35,6 +35,7 @@ import re
import sqlite3
import time
import uuid
from pathlib import Path
from typing import Any, Dict, List, Optional
try:
@@ -337,10 +338,12 @@ class ResponseStore:
db_path = str(get_hermes_home() / "response_store.db")
except Exception:
db_path = ":memory:"
self._db_path: Optional[str] = db_path if db_path != ":memory:" else None
try:
self._conn = sqlite3.connect(db_path, check_same_thread=False)
except Exception:
self._conn = sqlite3.connect(":memory:", check_same_thread=False)
self._db_path = None
# Use shared WAL-fallback helper so response_store.db degrades
# gracefully on NFS/SMB/FUSE-mounted HERMES_HOME (same filesystem
# issue addressed for state.db/kanban.db — see
@@ -361,6 +364,31 @@ class ResponseStore:
)"""
)
self._conn.commit()
# response_store.db contains conversation history (tool payloads,
# prompts, results). Tighten to owner-only after creation so other
# local users on a shared box can't read it. Run once at __init__
# rather than after every commit — chmod-on-every-write is wasted
# syscalls on a hot path.
self._tighten_file_permissions()
def _tighten_file_permissions(self) -> None:
"""Force owner-only permissions on the DB and SQLite sidecars."""
if not self._db_path:
return
for candidate in (
Path(self._db_path),
Path(f"{self._db_path}-wal"),
Path(f"{self._db_path}-shm"),
):
try:
if candidate.exists():
candidate.chmod(0o600)
except OSError:
logger.debug(
"Failed to restrict response store permissions for %s",
candidate,
exc_info=True,
)
def get(self, response_id: str) -> Optional[Dict[str, Any]]:
"""Retrieve a stored response by ID (updates access time for LRU)."""
+228 -25
View File
@@ -15,6 +15,7 @@ import re
import socket as _socket
import subprocess
import sys
import time
import uuid
from abc import ABC, abstractmethod
from urllib.parse import urlsplit
@@ -40,6 +41,16 @@ def _platform_name(platform) -> str:
return str(value or "").lower()
def _float_env(name: str, default: float) -> float:
raw = os.environ.get(name, "").strip()
if not raw:
return default
try:
return float(raw)
except (TypeError, ValueError):
return default
def _thread_metadata_for_source(source, reply_to_message_id: str | None = None) -> dict | None:
"""Build platform-aware thread metadata for adapter sends.
@@ -1103,6 +1114,14 @@ class MessageEvent:
return args
@dataclass
class TextDebounceState:
event: MessageEvent
task: asyncio.Task | None
first_ts: float
last_ts: float
_PLAINTEXT_GATEWAY_RESTART_PATTERNS: tuple[re.Pattern[str], ...] = (
re.compile(r"^(?:please\s+)?restart\s+(?:the\s+)?gateway[.!?\s]*$", re.IGNORECASE),
re.compile(r"^(?:please\s+)?restart\s+(?:the\s+)?hermes\s+gateway[.!?\s]*$", re.IGNORECASE),
@@ -1398,6 +1417,17 @@ class BasePlatformAdapter(ABC):
self._active_sessions: Dict[str, asyncio.Event] = {}
self._pending_messages: Dict[str, MessageEvent] = {}
self._session_tasks: Dict[str, asyncio.Task] = {}
self._busy_text_mode: str = (
os.environ.get("HERMES_GATEWAY_BUSY_TEXT_MODE", "queue").strip().lower()
or "queue"
)
self._busy_text_debounce_seconds: float = _float_env(
"HERMES_GATEWAY_BUSY_TEXT_DEBOUNCE_SECONDS", 0.35
)
self._busy_text_hard_cap_seconds: float = _float_env(
"HERMES_GATEWAY_BUSY_TEXT_HARD_CAP_SECONDS", 1.0
)
self._text_debounce: dict[str, TextDebounceState] = {}
# Background message-processing tasks spawned by handle_message().
# Gateway shutdown cancels these so an old gateway instance doesn't keep
# working on a task after --replace or manual restarts.
@@ -2725,6 +2755,161 @@ class BasePlatformAdapter(ABC):
return f"{existing_text}\n\n{new_text}".strip()
return existing_text
def _text_debounce_store(self) -> dict[str, TextDebounceState]:
store = getattr(self, "_text_debounce", None)
if store is None:
store = {}
self._text_debounce = store
return store
def _is_queue_text_debounce_candidate(self, event: MessageEvent) -> bool:
"""Return True for normal text eligible for queue-mode debounce."""
result = (
getattr(self, "_busy_text_mode", "queue") == "queue"
and event.message_type == MessageType.TEXT
and not getattr(event, "internal", False)
and not event.is_command()
and bool((event.text or "").strip())
)
if result:
logger.debug(
"[%s] Queue-text debounce candidate accepted: session=%s text_len=%d",
self.name,
getattr(event, "session_key", "?"),
len(event.text or ""),
)
return result
def _can_merge_text_debounce_events(self, existing: MessageEvent, event: MessageEvent) -> bool:
"""Return True when two text debounce events came from the same sender."""
def _identity(candidate: MessageEvent) -> tuple[str, ...] | None:
source = getattr(candidate, "source", None)
if source is None:
return None
platform = _platform_name(getattr(source, "platform", None))
sender = getattr(source, "user_id_alt", None) or getattr(source, "user_id", None)
if sender:
return (platform, str(sender))
if getattr(source, "chat_type", None) in {"dm", "private"} and getattr(source, "chat_id", None):
return (platform, "dm", str(source.chat_id))
return None
existing_sender = _identity(existing)
incoming_sender = _identity(event)
return existing_sender is not None and existing_sender == incoming_sender
def _text_debounce_delay(self, session_key: str) -> float:
"""Return bounded busy-text debounce delay for ``session_key``."""
state = self._text_debounce_store().get(session_key)
if state is None:
return 0.0
now = time.monotonic()
window_deadline = state.last_ts + self._busy_text_debounce_seconds
hard_cap_deadline = state.first_ts + self._busy_text_hard_cap_seconds
return max(0.0, min(window_deadline, hard_cap_deadline) - now)
async def _queue_text_debounce(self, session_key: str, event: MessageEvent) -> None:
"""Buffer normal queue-mode busy text and schedule a bounded flush."""
store = self._text_debounce_store()
state = store.get(session_key)
if state is not None and not self._can_merge_text_debounce_events(state.event, event):
# Preserve sender attribution in shared sessions. The current
# buffer becomes the next pending turn; the new sender starts a
# fresh debounce burst when the pending slot allows it.
await self._flush_text_debounce_now(session_key)
state = store.get(session_key)
if state is not None and not self._can_merge_text_debounce_events(state.event, event):
existing_pending = self._pending_messages.get(session_key)
if existing_pending is not None and self._can_merge_text_debounce_events(existing_pending, event):
merge_pending_message_event(
self._pending_messages,
session_key,
event,
merge_text=True,
)
return
now = time.monotonic()
if state is None:
state = TextDebounceState(
event=event,
task=None,
first_ts=now,
last_ts=now,
)
store[session_key] = state
else:
if event.text:
state.event.text = (
f"{state.event.text}\n{event.text}"
if state.event.text
else event.text
)
latest_message_id = getattr(event, "message_id", None)
latest_anchor = latest_message_id or getattr(event, "reply_to_message_id", None)
if latest_message_id is not None:
state.event.message_id = str(latest_message_id)
if latest_anchor is not None and hasattr(state.event, "reply_to_message_id"):
state.event.reply_to_message_id = str(latest_anchor)
state.last_ts = now
if state.task is not None and not state.task.done():
state.task.cancel()
delay = self._text_debounce_delay(session_key)
state.task = asyncio.create_task(self._flush_text_debounce(session_key, delay))
async def _flush_text_debounce(self, session_key: str, delay: float) -> None:
"""Timer task that flushes the debounced text buffer."""
try:
await asyncio.sleep(delay)
await self._flush_text_debounce_now(session_key)
except asyncio.CancelledError:
return
finally:
current = asyncio.current_task()
state = self._text_debounce_store().get(session_key)
if state is not None and state.task is current:
state.task = None
async def _flush_text_debounce_now(self, session_key: str) -> bool:
"""Force-flush one debounced busy-text burst into the pending slot."""
store = self._text_debounce_store()
state = store.get(session_key)
if state is None:
return False
current = asyncio.current_task()
if state.task is not None and state.task is not current and not state.task.done():
state.task.cancel()
state.task = None
existing_pending = self._pending_messages.get(session_key)
if (
existing_pending is not None
and not self._can_merge_text_debounce_events(existing_pending, state.event)
):
return False
state = store.pop(session_key, None)
if state is None:
return False
merge_pending_message_event(
self._pending_messages,
session_key,
state.event,
merge_text=True,
)
return True
def _discard_text_debounce(self, session_key: str) -> None:
"""Cancel and drop pending text debounce state for control commands."""
state = self._text_debounce_store().pop(session_key, None)
if state is not None and state.task is not None and not state.task.done():
state.task.cancel()
# ------------------------------------------------------------------
# Session task + guard ownership helpers
# ------------------------------------------------------------------
@@ -2794,6 +2979,7 @@ class BasePlatformAdapter(ABC):
self._active_sessions.pop(session_key, None)
self._pending_messages.pop(session_key, None)
self._session_tasks.pop(session_key, None)
self._discard_text_debounce(session_key)
return True
def _start_session_processing(
@@ -2875,6 +3061,7 @@ class BasePlatformAdapter(ABC):
)
if discard_pending:
self._pending_messages.pop(session_key, None)
self._discard_text_debounce(session_key)
if release_guard:
self._release_session_guard(session_key)
@@ -2889,6 +3076,7 @@ class BasePlatformAdapter(ABC):
command-scoped guard, then if a follow-up message landed while the
command was running spawns a fresh processing task for it.
"""
await self._flush_text_debounce_now(session_key)
pending_event = self._pending_messages.pop(session_key, None)
self._release_session_guard(session_key, guard=command_guard)
if pending_event is None:
@@ -3020,6 +3208,7 @@ class BasePlatformAdapter(ABC):
# through the dedicated handoff path that serializes
# cancellation + runner response + pending drain.
if cmd in {"stop", "new", "reset"}:
self._discard_text_debounce(session_key)
try:
await self._dispatch_active_session_command(event, session_key, cmd)
except Exception as e:
@@ -3064,8 +3253,9 @@ class BasePlatformAdapter(ABC):
# clarify-intercept can resolve it and unblock the agent.
#
# Without this bypass: the message gets queued in
# _pending_messages AND triggers an interrupt, killing the
# agent run mid-clarify and discarding the user's answer.
# _pending_messages as a follow-up turn instead of reaching the
# clarify resolver, leaving the agent blocked and discarding the
# user's answer.
# Same shape as the /approve deadlock fix (PR #4926) — both
# cases are "agent thread blocked on Event.wait, message must
# reach the resolver before being treated as a new turn."
@@ -3124,27 +3314,28 @@ class BasePlatformAdapter(ABC):
merge_pending_message_event(self._pending_messages, session_key, event)
return # Don't interrupt now - will run after current task completes
# Default behavior for non-photo follow-ups: interrupt the running agent.
#
# Use merge_text=True so rapid TEXT follow-ups (#4469) accumulate
# into the single pending slot instead of clobbering each other.
# Without merging, three rapid messages "A", "B", "C" land like:
# _pending_messages[k] = A (interrupts)
# _pending_messages[k] = B (replaces A before consumer reads)
# _pending_messages[k] = C (replaces B)
# ...and only "C" reaches the next turn. merge_pending_message_event
# already does the right thing for photo/media bursts; the
# ``merge_text=True`` flag extends that to plain TEXT events.
# Same shape as the Telegram bursty-grace path in gateway/run.py.
logger.debug("[%s] New message while session %s is active — triggering interrupt", self.name, session_key)
merge_pending_message_event(
self._pending_messages,
session_key,
event,
merge_text=True,
)
# Signal the interrupt (the processing task checks this)
self._active_sessions[session_key].set()
if self._is_queue_text_debounce_candidate(event):
logger.debug(
"[%s] New text message while session %s is active — "
"debouncing follow-up (busy_text_mode=queue, window=%.2fs)",
self.name,
session_key,
self._busy_text_debounce_seconds,
)
await self._queue_text_debounce(session_key, event)
else:
logger.debug(
"[%s] New message while session %s is active — queuing follow-up "
"(no interrupt, will cascade after current turn)",
self.name,
session_key,
)
merge_pending_message_event(
self._pending_messages,
session_key,
event,
merge_text=event.message_type == MessageType.TEXT,
)
return # Don't process now - will be handled after current task finishes
# Mark session as active BEFORE spawning background task to close
@@ -3498,10 +3689,15 @@ class BasePlatformAdapter(ABC):
ProcessingOutcome.SUCCESS if processing_ok else ProcessingOutcome.FAILURE,
)
# The active drain owns debounce state. If a queue-mode timer has
# not fired yet, force-flush into _pending_messages here and let
# this task hand off the follow-up.
await self._flush_text_debounce_now(session_key)
# Check if there's a pending message that was queued during our processing
if session_key in self._pending_messages:
pending_event = self._pending_messages.pop(session_key)
logger.debug("[%s] Processing queued message from interrupt", self.name)
logger.debug("[%s] Processing queued follow-up message", self.name)
# Keep the _active_sessions entry live across the turn chain
# and only CLEAR the interrupt Event — do NOT delete the entry.
# If we deleted here, a concurrent inbound message arriving
@@ -3510,7 +3706,7 @@ class BasePlatformAdapter(ABC):
# with the recursive drain below. Two agents on one
# session_key = duplicate responses, duplicate tool calls.
# Clearing the Event keeps the guard live so follow-ups take
# the busy-handler path (queue + interrupt) as intended.
# the busy-handler path as intended.
_active = self._active_sessions.get(session_key)
if _active is not None:
_active.clear()
@@ -3603,6 +3799,9 @@ class BasePlatformAdapter(ABC):
await self.stop_typing(event.source.chat_id)
except Exception:
pass
# Final drain/release boundary: force-flush any timer that missed
# the in-band drain before deciding whether the guard can clear.
await self._flush_text_debounce_now(session_key)
# Late-arrival drain: a message may have arrived during the
# cleanup awaits above (typing_task cancel, stop_typing). Such
# messages passed the Level-1 guard (entry still live, Event
@@ -3722,6 +3921,10 @@ class BasePlatformAdapter(ABC):
self._session_tasks.clear()
self._pending_messages.clear()
self._active_sessions.clear()
for state in list(self._text_debounce_store().values()):
if state.task is not None and not state.task.done():
state.task.cancel()
self._text_debounce_store().clear()
def has_pending_interrupt(self, session_key: str) -> bool:
"""Check if there's a pending interrupt for a session."""
+17 -5
View File
@@ -189,7 +189,10 @@ class BlueBubblesAdapter(BasePlatformAdapter):
app = web.Application()
app.router.add_get("/health", lambda _: web.Response(text="ok"))
app.router.add_post(self.webhook_path, self._handle_webhook)
self._runner = web.AppRunner(app)
# The webhook auth value is carried in the query string because the
# BlueBubbles webhook API cannot send custom headers. Do not let
# aiohttp access logs write that request target to agent.log.
self._runner = web.AppRunner(app, access_log=None)
await self._runner.setup()
site = web.TCPSite(self._runner, self.webhook_host, self.webhook_port)
await site.start()
@@ -242,6 +245,14 @@ class BlueBubblesAdapter(BasePlatformAdapter):
return f"{base}?password={quote(self.password, safe='')}"
return base
@property
def _webhook_register_url_for_log(self) -> str:
"""Webhook registration URL safe for logs."""
base = self._webhook_url
if self.password:
return f"{base}?password=***"
return base
async def _find_registered_webhooks(self, url: str) -> list:
"""Return list of BB webhook entries matching *url*."""
try:
@@ -269,7 +280,8 @@ class BlueBubblesAdapter(BasePlatformAdapter):
existing = await self._find_registered_webhooks(webhook_url)
if existing:
logger.info(
"[bluebubbles] webhook already registered: %s", webhook_url
"[bluebubbles] webhook already registered: %s",
self._webhook_register_url_for_log,
)
return True
@@ -284,7 +296,7 @@ class BlueBubblesAdapter(BasePlatformAdapter):
if 200 <= status < 300:
logger.info(
"[bluebubbles] webhook registered with server: %s",
webhook_url,
self._webhook_register_url_for_log,
)
return True
else:
@@ -324,7 +336,8 @@ class BlueBubblesAdapter(BasePlatformAdapter):
removed = True
if removed:
logger.info(
"[bluebubbles] webhook unregistered: %s", webhook_url
"[bluebubbles] webhook unregistered: %s",
self._webhook_register_url_for_log,
)
except Exception as exc:
logger.debug(
@@ -934,4 +947,3 @@ class BlueBubblesAdapter(BasePlatformAdapter):
asyncio.create_task(self.mark_read(session_chat_id))
return web.Response(text="ok")
+13
View File
@@ -358,6 +358,19 @@ class DingTalkAdapter(BasePlatformAdapter):
await asyncio.gather(*self._bg_tasks, return_exceptions=True)
self._bg_tasks.clear()
# Finalize any open streaming cards before the HTTP client closes so
# they don't stay stuck in streaming state on DingTalk's UI after
# a gateway restart. _close_streaming_siblings handles its own
# per-card exceptions; the outer try is a safety net for token fetch.
for _chat_id in list(self._streaming_cards):
try:
await self._close_streaming_siblings(_chat_id)
except Exception as _exc:
logger.debug(
"[%s] Failed to finalize streaming card on disconnect for %s: %s",
self.name, _chat_id, _exc,
)
if self._http_client:
await self._http_client.aclose()
self._http_client = None
+72 -10
View File
@@ -1514,8 +1514,10 @@ class FeishuAdapter(BasePlatformAdapter):
connection_mode=str(
extra.get("connection_mode") or os.getenv("FEISHU_CONNECTION_MODE", "websocket")
).strip().lower(),
encrypt_key=os.getenv("FEISHU_ENCRYPT_KEY", "").strip(),
verification_token=os.getenv("FEISHU_VERIFICATION_TOKEN", "").strip(),
encrypt_key=str(extra.get("encrypt_key") or os.getenv("FEISHU_ENCRYPT_KEY", "")).strip(),
verification_token=str(
extra.get("verification_token") or os.getenv("FEISHU_VERIFICATION_TOKEN", "")
).strip(),
group_policy=os.getenv("FEISHU_GROUP_POLICY", "allowlist").strip().lower(),
allowed_group_users=frozenset(
item.strip()
@@ -1642,6 +1644,11 @@ class FeishuAdapter(BasePlatformAdapter):
self._connection_mode,
)
return False
if self._connection_mode == "webhook" and not (self._verification_token or self._encrypt_key):
logger.error(
"[Feishu] Webhook mode requires FEISHU_VERIFICATION_TOKEN or FEISHU_ENCRYPT_KEY."
)
return False
try:
self._app_lock_identity = self._app_id
@@ -2563,13 +2570,44 @@ class FeishuAdapter(BasePlatformAdapter):
if approval_id is None:
logger.debug("[Feishu] Card action missing approval_id, ignoring")
return P2CardActionTriggerResponse() if P2CardActionTriggerResponse else None
state = self._approval_state.get(approval_id)
if not state:
logger.debug("[Feishu] Approval %s already resolved or unknown", approval_id)
return P2CardActionTriggerResponse() if P2CardActionTriggerResponse else None
choice = _APPROVAL_CHOICE_MAP.get(action_value.get("hermes_action"), "deny")
operator = getattr(event, "operator", None)
open_id = str(getattr(operator, "open_id", "") or "")
sender_id = SimpleNamespace(open_id=open_id, user_id=str(getattr(operator, "user_id", "") or ""))
if not self._allow_group_message(sender_id, state.get("chat_id", ""), is_bot=False):
logger.warning("[Feishu] Unauthorized approval click by %s", open_id or "<unknown>")
return P2CardActionTriggerResponse() if P2CardActionTriggerResponse else None
callback_chat_id = str(getattr(getattr(event, "context", None), "open_chat_id", "") or "")
expected_chat_id = str(state.get("chat_id", "") or "")
if callback_chat_id and expected_chat_id and callback_chat_id != expected_chat_id:
logger.warning(
"[Feishu] Approval callback chat mismatch for %s (expected=%s, got=%s)",
approval_id,
expected_chat_id,
callback_chat_id,
)
return P2CardActionTriggerResponse() if P2CardActionTriggerResponse else None
user_name = self._get_cached_sender_name(open_id) or open_id
if not self._submit_on_loop(loop, self._resolve_approval(approval_id, choice, user_name)):
chat_context = getattr(event, "context", None)
chat_id = str(getattr(chat_context, "open_chat_id", "") or "")
if not self._submit_on_loop(
loop,
self._resolve_approval(
approval_id=approval_id,
choice=choice,
user_name=user_name,
open_id=open_id,
chat_id=chat_id,
),
):
return P2CardActionTriggerResponse() if P2CardActionTriggerResponse else None
if P2CardActionTriggerResponse is None:
@@ -2617,12 +2655,34 @@ class FeishuAdapter(BasePlatformAdapter):
response.card = card
return response
async def _resolve_approval(self, approval_id: Any, choice: str, user_name: str) -> None:
async def _resolve_approval(
self,
approval_id: Any,
choice: str,
user_name: str,
*,
open_id: str = "",
chat_id: str = "",
) -> None:
"""Pop approval state and unblock the waiting agent thread."""
state = self._approval_state.pop(approval_id, None)
state = self._approval_state.get(approval_id)
if not state:
logger.debug("[Feishu] Approval %s already resolved or unknown", approval_id)
return
if not self._is_interactive_operator_authorized(open_id):
logger.warning("[Feishu] Unauthorized approval click by %s for approval %s", open_id or "<unknown>", approval_id)
return
expected_chat_id = str(state.get("chat_id", "") or "")
if expected_chat_id and chat_id and expected_chat_id != chat_id:
logger.warning(
"[Feishu] Approval %s chat mismatch (expected=%s, got=%s)",
approval_id, expected_chat_id, chat_id,
)
return
state = self._approval_state.pop(approval_id, None)
if not state:
logger.debug("[Feishu] Approval %s already resolved while validating callback", approval_id)
return
try:
from tools.approval import resolve_gateway_approval
count = resolve_gateway_approval(state["session_key"], choice)
@@ -3229,11 +3289,6 @@ class FeishuAdapter(BasePlatformAdapter):
self._record_webhook_anomaly(remote_ip, "400")
return web.json_response({"code": 400, "msg": "invalid json"}, status=400)
# URL verification challenge — respond before other checks so that Feishu's
# subscription setup works even before encrypt_key is wired.
if payload.get("type") == "url_verification":
return web.json_response({"challenge": payload.get("challenge", "")})
# Verification token check — second layer of defence beyond signature (matches openclaw).
if self._verification_token:
header = payload.get("header") or {}
@@ -3243,6 +3298,13 @@ class FeishuAdapter(BasePlatformAdapter):
self._record_webhook_anomaly(remote_ip, "401-token")
return web.Response(status=401, text="Invalid verification token")
# URL verification challenge — Feishu includes the verification token in
# challenge requests. Validate the token (above) before reflecting the
# challenge so an unauthenticated remote request cannot prove endpoint
# control by getting attacker-supplied challenge data echoed back.
if payload.get("type") == "url_verification":
return web.json_response({"challenge": payload.get("challenge", "")})
# Timing-safe signature verification (only enforced when encrypt_key is set).
if self._encrypt_key and not self._is_webhook_signature_valid(request.headers, body_bytes):
logger.warning("[Feishu] Webhook rejected: invalid signature from %s", remote_ip)
+42 -8
View File
@@ -138,7 +138,8 @@ _OUTBOUND_MENTION_RE = re.compile(
)
_E2EE_INSTALL_HINT = (
"Install with: pip install 'mautrix[encryption]' (requires libolm C library)"
"Install with: pip install 'mautrix[encryption]' asyncpg aiosqlite "
"(requires libolm C library)"
)
_MATRIX_IMAGE_FILENAME_EXTS = frozenset({
@@ -214,9 +215,22 @@ def _create_matrix_session(proxy_url: str | None):
def _check_e2ee_deps() -> bool:
"""Return True if mautrix E2EE dependencies (python-olm) are available."""
"""Return True if mautrix E2EE dependencies are available.
Verifies python-olm (via mautrix.crypto.OlmMachine), the SQLite crypto
store backend (mautrix.crypto.store.asyncpg.PgCryptoStore yes, the
PgCryptoStore class also drives the sqlite backend in mautrix 0.21),
and the database drivers actually used at connect time (``asyncpg`` for
the underlying upgrade_table machinery, ``aiosqlite`` for the
``sqlite:///`` URL we pass to ``Database.create``). Without all four,
encrypted rooms fail at connect time with a confusing
``No module named 'asyncpg'`` (#31116).
"""
try:
from mautrix.crypto import OlmMachine # noqa: F401
from mautrix.crypto.store.asyncpg import PgCryptoStore # noqa: F401
import asyncpg # noqa: F401
import aiosqlite # noqa: F401
return True
except (ImportError, AttributeError):
@@ -226,8 +240,13 @@ def _check_e2ee_deps() -> bool:
def check_matrix_requirements() -> bool:
"""Return True if the Matrix adapter can be used.
Lazy-installs mautrix via ``tools.lazy_deps.ensure("platform.matrix")``
on first call if not present. Rebinds all module-level type globals on success.
Lazy-installs the full ``platform.matrix`` feature group via
``tools.lazy_deps.ensure_and_bind`` whenever any of the declared
packages (mautrix, Markdown, aiosqlite, asyncpg, aiohttp-socks) is
missing not just mautrix itself. Previously this short-circuited on
``import mautrix``, which left the other four packages uninstalled
forever and broke E2EE connect with ``No module named 'asyncpg'``
(#31116). Rebinds module-level type globals on success.
"""
token = os.getenv("MATRIX_ACCESS_TOKEN", "")
password = os.getenv("MATRIX_PASSWORD", "")
@@ -239,9 +258,20 @@ def check_matrix_requirements() -> bool:
if not homeserver:
logger.warning("Matrix: MATRIX_HOMESERVER not set")
return False
# Check whether any package in the platform.matrix feature group is
# missing. ``feature_missing`` is cheap (per-spec importlib.metadata
# lookups) and correctly handles ``mautrix[encryption]`` by stripping
# the extras marker before checking the bare package.
try:
import mautrix # noqa: F401
except ImportError:
from tools.lazy_deps import feature_missing, ensure_and_bind
missing = feature_missing("platform.matrix")
except Exception as exc: # pragma: no cover — defensive
logger.debug("Matrix: lazy_deps lookup failed: %s", exc)
missing = ()
ensure_and_bind = None # type: ignore[assignment]
if missing or ensure_and_bind is None:
def _import():
from mautrix.types import (
ContentURI, EventID, EventType, PaginationDirection,
@@ -261,10 +291,14 @@ def check_matrix_requirements() -> bool:
"UserID": UserID,
}
from tools.lazy_deps import ensure_and_bind
if ensure_and_bind is None:
return False
if not ensure_and_bind("platform.matrix", _import, globals(), prompt=False):
logger.warning(
"Matrix: mautrix not installed. Run: pip install 'mautrix[encryption]'"
"Matrix: required packages not installed (%s). "
"Run: pip install 'mautrix[encryption]' asyncpg aiosqlite "
"Markdown aiohttp-socks",
", ".join(missing) if missing else "platform.matrix",
)
return False
+7 -1
View File
@@ -133,6 +133,12 @@ class MSGraphWebhookAdapter(BasePlatformAdapter):
self._notification_scheduler = scheduler
async def connect(self) -> bool:
if self._client_state is None:
logger.error(
"[msgraph_webhook] Refusing to start without extra.client_state configured"
)
return False
app = web.Application()
app.router.add_get(self._health_path, self._handle_health)
app.router.add_get(self._webhook_path, self._handle_validation)
@@ -310,7 +316,7 @@ class MSGraphWebhookAdapter(BasePlatformAdapter):
"""
expected = self._client_state
if expected is None:
return True
return False
provided = self._string_or_none(notification.get("clientState"))
if provided is None:
return False
+54
View File
@@ -1054,6 +1054,46 @@ class QQAdapter(BasePlatformAdapter):
"deny": "deny",
}
@staticmethod
def _parse_gateway_session_key(session_key: str) -> Optional[Dict[str, str]]:
"""Parse ``agent:main:<platform>:<chat_type>:<chat_id>[:<user_id>]``."""
parts = str(session_key or "").split(":")
if len(parts) < 5 or parts[0] != "agent" or parts[1] != "main":
return None
parsed = {
"platform": parts[2],
"chat_type": parts[3],
"chat_id": parts[4],
}
if len(parts) > 5:
parsed["user_id"] = parts[5]
return parsed
def _is_authorized_interaction_for_session(
self,
event: InteractionEvent,
session_key: str,
) -> bool:
"""Authorize approval/update interactions against session + operator."""
parsed = self._parse_gateway_session_key(session_key)
operator = str(event.operator_openid or "").strip()
if not parsed or parsed.get("platform") != "qqbot" or not operator:
return False
chat_type = parsed.get("chat_type", "")
chat_id = parsed.get("chat_id", "")
if chat_type == "c2c":
return bool(chat_id) and operator == chat_id
if chat_type in {"group", "guild"}:
event_chat = str(event.group_openid or event.guild_id or "").strip()
if not event_chat or event_chat != chat_id:
return False
session_user = str(parsed.get("user_id", "")).strip()
return bool(session_user) and operator == session_user
return False
async def _default_interaction_dispatch(
self,
event: InteractionEvent,
@@ -1087,6 +1127,13 @@ class QQAdapter(BasePlatformAdapter):
self._log_tag, decision, session_key,
)
return
if not self._is_authorized_interaction_for_session(event, session_key):
logger.warning(
"[%s] Rejected unauthorized approval click for session %s "
"(operator=%s)",
self._log_tag, session_key, event.operator_openid,
)
return
try:
# Import lazily to keep the adapter importable in tests that
# don't exercise the approval subsystem.
@@ -1107,6 +1154,13 @@ class QQAdapter(BasePlatformAdapter):
update_answer = parse_update_prompt_button_data(button_data)
if update_answer is not None:
update_session_key = f"agent:main:qqbot:{event.scene}:{event.group_openid or event.guild_id or event.user_openid}"
if not self._is_authorized_interaction_for_session(event, update_session_key):
logger.warning(
"[%s] Rejected unauthorized update prompt click (operator=%s)",
self._log_tag, event.operator_openid,
)
return
self._write_update_response(update_answer, event.operator_openid)
return
+20 -1
View File
@@ -429,6 +429,13 @@ class TelegramAdapter(BasePlatformAdapter):
self._polling_conflict_count: int = 0
self._polling_network_error_count: int = 0
self._polling_error_callback_ref = None
# After sustained reconnect storms the PTB httpx pool can return
# SendResult(success=True) for sends that never actually transmit.
# _handle_polling_network_error sets this; _verify_polling_after_reconnect
# clears it once getMe() confirms the Bot client is healthy.
# While True, send() short-circuits to a failure so callers
# (cron live-adapter branch) fall through to standalone delivery.
self._send_path_degraded: bool = False
# DM Topics: map of topic_name -> message_thread_id (populated at startup)
self._dm_topics: Dict[str, int] = {}
# Track forum chats where we've already registered bot commands
@@ -874,6 +881,7 @@ class TelegramAdapter(BasePlatformAdapter):
MAX_DELAY = 60
self._polling_network_error_count += 1
self._send_path_degraded = True
attempt = self._polling_network_error_count
if attempt > MAX_NETWORK_RETRIES:
@@ -971,6 +979,7 @@ class TelegramAdapter(BasePlatformAdapter):
try:
await asyncio.wait_for(self._app.bot.get_me(), PROBE_TIMEOUT)
self._send_path_degraded = False
except Exception as probe_err:
logger.warning(
"[%s] Polling heartbeat probe failed %ds after reconnect: %s",
@@ -1683,7 +1692,11 @@ class TelegramAdapter(BasePlatformAdapter):
"""Send a message to a Telegram chat."""
if not self._bot:
return SendResult(success=False, error="Not connected")
# getattr() — tests build adapters via object.__new__() (no __init__).
if getattr(self, "_send_path_degraded", False):
return SendResult(success=False, error="send_path_degraded", retryable=True)
# Skip whitespace-only text to prevent Telegram 400 empty-text errors.
if not content or not content.strip():
return SendResult(success=True, message_id=None)
@@ -4631,6 +4644,12 @@ class TelegramAdapter(BasePlatformAdapter):
shared_source = self._telegram_group_observe_shared_source(event.source)
observe_prompt = self._telegram_group_observe_channel_prompt()
channel_prompt = f"{event.channel_prompt}\n\n{observe_prompt}" if event.channel_prompt else observe_prompt
if event.message_type == MessageType.COMMAND:
return dataclasses.replace(
event,
source=shared_source,
channel_prompt=channel_prompt,
)
return dataclasses.replace(
event,
text=self._telegram_group_observe_attributed_text(event),
+97 -4
View File
@@ -27,6 +27,8 @@ Security:
"""
import asyncio
import base64
import binascii
import hashlib
import hmac
import json
@@ -377,9 +379,21 @@ class WebhookAdapter(BasePlatformAdapter):
logger.error("[webhook] Failed to read body: %s", e)
return web.json_response({"error": "Bad request"}, status=400)
# Validate HMAC signature FIRST (skip for INSECURE_NO_AUTH testing mode)
# Validate HMAC signature FIRST (skip only for the explicit local-test
# INSECURE_NO_AUTH mode). Missing/empty secrets must fail closed here,
# not only during connect(), so direct handler reuse cannot turn a
# network webhook route into an unauthenticated agent-dispatch surface.
secret = route_config.get("secret", self._global_secret)
if secret and secret != _INSECURE_NO_AUTH:
if not secret:
logger.error(
"[webhook] Route %s has no HMAC secret; refusing request",
route_name,
)
return web.json_response(
{"error": "Webhook route is missing an HMAC secret"},
status=403,
)
if secret != _INSECURE_NO_AUTH:
if not self._validate_signature(request, raw_body, secret):
logger.warning(
"[webhook] Invalid signature for route %s", route_name
@@ -419,6 +433,7 @@ class WebhookAdapter(BasePlatformAdapter):
request.headers.get("X-GitHub-Event", "")
or request.headers.get("X-GitLab-Event", "")
or payload.get("event_type", "")
or payload.get("type", "")
or "unknown"
)
allowed_events = route_config.get("events", [])
@@ -471,7 +486,10 @@ class WebhookAdapter(BasePlatformAdapter):
# Build a unique delivery ID
delivery_id = request.headers.get(
"X-GitHub-Delivery",
request.headers.get("X-Request-ID", str(int(time.time() * 1000))),
request.headers.get(
"svix-id",
request.headers.get("X-Request-ID", str(int(time.time() * 1000))),
),
)
# ── Idempotency ─────────────────────────────────────────
@@ -616,7 +634,32 @@ class WebhookAdapter(BasePlatformAdapter):
def _validate_signature(
self, request: "web.Request", body: bytes, secret: str
) -> bool:
"""Validate webhook signature (GitHub, GitLab, generic HMAC-SHA256)."""
"""Validate webhook signature (GitHub, GitLab, Svix, generic HMAC-SHA256)."""
def _header(name: str) -> str:
return (
request.headers.get(name, "")
or request.headers.get(name.lower(), "")
or request.headers.get(name.upper(), "")
)
# Svix / AgentMail:
# svix-id: msg_...
# svix-timestamp: unix seconds
# svix-signature: v1,<base64-hmac> [v1,<base64-hmac> ...]
# Signed content is: "{id}.{timestamp}.{raw_body}". Svix secrets
# usually start with "whsec_" and the remainder is base64-encoded.
svix_id = _header("svix-id")
svix_timestamp = _header("svix-timestamp")
svix_signature = _header("svix-signature")
if svix_id or svix_timestamp or svix_signature:
return self._validate_svix_signature(
body=body,
secret=secret,
msg_id=svix_id,
timestamp=svix_timestamp,
signature_header=svix_signature,
)
# GitHub: X-Hub-Signature-256 = sha256=<hex>
gh_sig = request.headers.get("X-Hub-Signature-256", "")
if gh_sig:
@@ -644,6 +687,56 @@ class WebhookAdapter(BasePlatformAdapter):
)
return False
def _validate_svix_signature(
self,
body: bytes,
secret: str,
msg_id: str,
timestamp: str,
signature_header: str,
tolerance_seconds: int = 300,
) -> bool:
"""Validate Svix-compatible signatures used by AgentMail webhooks."""
if not (msg_id and timestamp and signature_header and secret):
return False
try:
ts = int(timestamp)
except (TypeError, ValueError):
return False
if abs(int(time.time()) - ts) > tolerance_seconds:
logger.warning("[webhook] Svix signature timestamp outside replay window")
return False
if secret.startswith("whsec_"):
encoded_secret = secret.removeprefix("whsec_")
try:
key = base64.b64decode(encoded_secret, validate=True)
except (binascii.Error, ValueError):
logger.debug("[webhook] Invalid whsec_ Svix signing secret")
return False
else:
# Be permissive for providers that document Svix-style headers but
# hand out raw shared secrets rather than whsec_ base64 secrets.
logger.debug("[webhook] Validating Svix-style signature with raw secret")
key = secret.encode()
signed_content = msg_id.encode() + b"." + timestamp.encode() + b"." + body
expected = base64.b64encode(
hmac.new(key, signed_content, hashlib.sha256).digest()
).decode()
# Svix can send multiple signatures separated by spaces during secret
# rotation. Each entry is formatted as "vN,<base64>".
for part in signature_header.split():
try:
version, signature = part.split(",", 1)
except ValueError:
continue
if version == "v1" and hmac.compare_digest(signature, expected):
return True
return False
# ------------------------------------------------------------------
# Prompt rendering
# ------------------------------------------------------------------
+12
View File
@@ -616,6 +616,18 @@ class WeComAdapter(BasePlatformAdapter):
else:
delay = self._text_batch_delay_seconds
await asyncio.sleep(delay)
# Guard against the cancel-delivery race: when the sleep timer
# fires just before cancel() is called, CPython sets
# Task._must_cancel but cannot cancel the already-done sleep
# future, so CancelledError is delivered at the *next* await
# (handle_message) rather than here. By that point this task
# has already popped the merged event, so the superseding task
# sees an empty batch and silently drops the message.
# This check is synchronous — no await between the sleep and
# the pop — so no other coroutine can modify the task registry
# in between.
if self._pending_text_batch_tasks.get(key) is not current_task:
return
event = self._pending_text_batches.pop(key, None)
if not event:
return
+25 -13
View File
@@ -187,7 +187,6 @@ class WecomCallbackAdapter(BasePlatformAdapter):
app = self._resolve_app_for_chat(chat_id)
touser = chat_id.split(":", 1)[1] if ":" in chat_id else chat_id
try:
token = await self._get_access_token(app)
payload = {
"touser": touser,
"msgtype": "text",
@@ -195,18 +194,31 @@ class WecomCallbackAdapter(BasePlatformAdapter):
"text": {"content": content[:2048]},
"safe": 0,
}
resp = await self._http_client.post(
f"https://qyapi.weixin.qq.com/cgi-bin/message/send?access_token={token}",
json=payload,
)
data = resp.json()
if data.get("errcode") != 0:
return SendResult(success=False, error=str(data))
return SendResult(
success=True,
message_id=str(data.get("msgid", "")),
raw_response=data,
)
for _attempt in range(2):
token = await self._get_access_token(app)
resp = await self._http_client.post(
f"https://qyapi.weixin.qq.com/cgi-bin/message/send?access_token={token}",
json=payload,
)
data = resp.json()
errcode = data.get("errcode")
if errcode in {40001, 42001} and _attempt == 0:
# WeCom rejected the token — evict the cached entry so
# the next _get_access_token call forces a fresh fetch.
logger.warning(
"[WecomCallback] Token rejected for app '%s' (errcode=%s), refreshing",
app.get("name", "default"), errcode,
)
self._access_tokens.pop(app["name"], None)
continue
if errcode != 0:
return SendResult(success=False, error=str(data))
return SendResult(
success=True,
message_id=str(data.get("msgid", "")),
raw_response=data,
)
return SendResult(success=False, error="send failed after token refresh")
except Exception as exc:
return SendResult(success=False, error=str(exc))
+228 -68
View File
@@ -139,6 +139,85 @@ def _gateway_platform_value(platform: Any) -> str:
return str(getattr(platform, "value", platform) or "").strip().lower()
def _is_transient_network_error(exc: BaseException) -> bool:
"""Return True for transient network errors safe to log + swallow.
The crash class targeted by #31066 / #31110: an unhandled Telegram
``TimedOut`` (or peer ``NetworkError`` / ``httpx`` connection error)
propagating to the event loop and killing the entire gateway
process. These are by definition transient the next poll cycle or
user action recovers so they must never crash the process.
Walk the exception cause chain so wrapped errors (e.g. PTB's
``NetworkError`` wrapping ``httpx.ConnectError``) are still
classified. The chain is bounded to avoid pathological cycles.
"""
seen: set[int] = set()
cur: Optional[BaseException] = exc
depth = 0
transient_class_names = {
"TimedOut",
"NetworkError",
"ReadError",
"WriteError",
"ConnectError",
"ConnectTimeout",
"ReadTimeout",
"WriteTimeout",
"PoolTimeout",
"RemoteProtocolError",
"ServerDisconnectedError",
"ClientConnectorError",
"ClientOSError",
}
while cur is not None and depth < 12:
ident = id(cur)
if ident in seen:
break
seen.add(ident)
depth += 1
name = type(cur).__name__
if name in transient_class_names:
return True
cur = cur.__cause__ or cur.__context__
return False
def _gateway_loop_exception_handler(
loop: "asyncio.AbstractEventLoop", context: Dict[str, Any]
) -> None:
"""Loop-level safety net for transient network errors.
Installed once during :func:`start_gateway`. Catches the
``telegram.error.TimedOut`` crash class (issues #31066 / #31110)
and any peer transient network error before it can kill the
gateway process. Logs at WARNING with full traceback so the
originating call site stays diagnosable; non-transient errors
are forwarded to the default loop handler so real bugs still
surface.
"""
exc = context.get("exception")
if exc is not None and _is_transient_network_error(exc):
message = context.get("message") or "transient network error"
task = context.get("future") or context.get("task")
task_name = ""
if task is not None:
try:
task_name = task.get_name() if hasattr(task, "get_name") else repr(task)
except Exception:
task_name = repr(task)
logger.warning(
"Gateway swallowed transient network error from %s: %s: %s",
task_name or "<unknown task>",
type(exc).__name__,
exc,
exc_info=(type(exc), exc, exc.__traceback__),
)
return
# Fall back to the default handler for anything we don't recognise.
loop.default_exception_handler(context)
def _redact_gateway_user_facing_secrets(text: str) -> str:
"""Best-effort secret redaction before text can leave the gateway."""
redacted = str(text or "")
@@ -774,31 +853,29 @@ if _config_path.exists():
os.environ[_env_var] = str(_val)
# Compression config is read directly from config.yaml by run_agent.py
# and auxiliary_client.py — no env var bridging needed.
# Auxiliary model/direct-endpoint overrides (vision, web_extract).
# Each task has provider/model/base_url/api_key; bridge non-default values to env vars.
# Auxiliary model/direct-endpoint overrides (vision, web_extract,
# approval, plus any plugin-registered auxiliary tasks).
# Each task has provider/model/base_url/api_key; bridge non-default
# values to env vars named AUXILIARY_<KEY_UPPER>_*. The legacy
# hard-coded list (vision/web_extract/approval) is replaced by a
# dynamic loop so plugin-registered tasks benefit from the same
# config→env bridging without core knowing about each one.
_auxiliary_cfg = _cfg.get("auxiliary", {})
if _auxiliary_cfg and isinstance(_auxiliary_cfg, dict):
_aux_task_env = {
"vision": {
"provider": "AUXILIARY_VISION_PROVIDER",
"model": "AUXILIARY_VISION_MODEL",
"base_url": "AUXILIARY_VISION_BASE_URL",
"api_key": "AUXILIARY_VISION_API_KEY",
},
"web_extract": {
"provider": "AUXILIARY_WEB_EXTRACT_PROVIDER",
"model": "AUXILIARY_WEB_EXTRACT_MODEL",
"base_url": "AUXILIARY_WEB_EXTRACT_BASE_URL",
"api_key": "AUXILIARY_WEB_EXTRACT_API_KEY",
},
"approval": {
"provider": "AUXILIARY_APPROVAL_PROVIDER",
"model": "AUXILIARY_APPROVAL_MODEL",
"base_url": "AUXILIARY_APPROVAL_BASE_URL",
"api_key": "AUXILIARY_APPROVAL_API_KEY",
},
}
for _task_key, _env_map in _aux_task_env.items():
# Built-in tasks that previously had explicit env-var bridging.
# Kept here as the canonical bridged set; plugin tasks are added
# below via the plugin auxiliary registry.
_aux_bridged_keys = {"vision", "web_extract", "approval"}
try:
from hermes_cli.plugins import get_plugin_auxiliary_tasks
for _entry in get_plugin_auxiliary_tasks():
_aux_bridged_keys.add(_entry["key"])
except Exception:
# Plugin discovery failure must not break gateway startup;
# built-in bridging stays intact.
pass
for _task_key in _aux_bridged_keys:
_task_cfg = _auxiliary_cfg.get(_task_key, {})
if not isinstance(_task_cfg, dict):
continue
@@ -806,14 +883,15 @@ if _config_path.exists():
_model = str(_task_cfg.get("model", "")).strip()
_base_url = str(_task_cfg.get("base_url", "")).strip()
_api_key = str(_task_cfg.get("api_key", "")).strip()
_upper = _task_key.upper()
if _prov and _prov != "auto":
os.environ[_env_map["provider"]] = _prov
os.environ[f"AUXILIARY_{_upper}_PROVIDER"] = _prov
if _model:
os.environ[_env_map["model"]] = _model
os.environ[f"AUXILIARY_{_upper}_MODEL"] = _model
if _base_url:
os.environ[_env_map["base_url"]] = _base_url
os.environ[f"AUXILIARY_{_upper}_BASE_URL"] = _base_url
if _api_key:
os.environ[_env_map["api_key"]] = _api_key
os.environ[f"AUXILIARY_{_upper}_API_KEY"] = _api_key
# config.yaml is the documented, authoritative source for these
# settings — it unconditionally wins over .env values. Previously
# the guards below read `if X not in os.environ` and let stale
@@ -840,6 +918,8 @@ if _config_path.exists():
if _display_cfg and isinstance(_display_cfg, dict):
if "busy_input_mode" in _display_cfg:
os.environ["HERMES_GATEWAY_BUSY_INPUT_MODE"] = str(_display_cfg["busy_input_mode"])
if "busy_text_mode" in _display_cfg:
os.environ["HERMES_GATEWAY_BUSY_TEXT_MODE"] = str(_display_cfg["busy_text_mode"])
if "busy_ack_enabled" in _display_cfg:
os.environ["HERMES_GATEWAY_BUSY_ACK_ENABLED"] = str(_display_cfg["busy_ack_enabled"])
# Timezone: bridge config.yaml → HERMES_TIMEZONE env var.
@@ -963,6 +1043,12 @@ _AGENT_PENDING_SENTINEL = object()
def _resolve_runtime_agent_kwargs() -> dict:
"""Resolve provider credentials for gateway-created AIAgent instances.
Provider is read from ``config.yaml`` ``model.provider`` (the single
source of truth). ``resolve_runtime_provider()`` falls through to env
var lookups internally for legacy compatibility, but the gateway does
not consult environment variables for behavioral config config.yaml
is authoritative.
If the primary provider fails with an authentication error, attempt to
resolve credentials using the fallback provider chain from config.yaml
before giving up.
@@ -974,9 +1060,7 @@ def _resolve_runtime_agent_kwargs() -> dict:
from hermes_cli.auth import AuthError
try:
runtime = resolve_runtime_provider(
requested=os.getenv("HERMES_INFERENCE_PROVIDER"),
)
runtime = resolve_runtime_provider()
except AuthError as auth_exc:
# Primary provider auth failed (expired token, revoked key, etc.).
# Try the fallback provider chain before raising.
@@ -1551,6 +1635,7 @@ class GatewayRunner:
# blow up on attribute access.
_running_agents_ts: Dict[str, float] = {}
_busy_input_mode: str = "interrupt"
_busy_text_mode: str = "interrupt"
_restart_drain_timeout: float = DEFAULT_GATEWAY_RESTART_DRAIN_TIMEOUT
_exit_code: Optional[int] = None
_draining: bool = False
@@ -1577,6 +1662,7 @@ class GatewayRunner:
self._service_tier = self._load_service_tier()
self._show_reasoning = self._load_show_reasoning()
self._busy_input_mode = self._load_busy_input_mode()
self._busy_text_mode = self._load_busy_text_mode()
self._restart_drain_timeout = self._load_restart_drain_timeout()
self._provider_routing = self._load_provider_routing()
self._fallback_model = self._load_fallback_model()
@@ -2186,13 +2272,14 @@ class GatewayRunner:
) -> Optional[str]:
"""Pin DM-topic routing to the user's last-active topic.
Telegram fragments topic-mode DMs two ways: a Reply on a message
in another topic delivers ``message_thread_id`` for *that* topic,
and ``_build_message_event`` strips the thread_id on plain replies
(#3206 — needed for non-topic users). Both route the user to the
wrong session. When topic mode is on, rewrite the thread_id to the
user's most-recent binding if the inbound id is missing/General or
not a known topic for this chat. Returns None to leave it alone.
Telegram can omit ``message_thread_id`` or surface General (``1``)
for some topic-mode DM replies. In those lobby-shaped cases, keep the
conversation attached to the user's most-recent bound topic.
Do not rewrite a non-lobby, previously-unbound thread id: a newly
created Telegram DM topic is also "unknown" until the first inbound
message is recorded, and rewriting it would send that brand-new topic's
answer into an older lane. Returns None to leave the source alone.
"""
if (
source.platform != Platform.TELEGRAM
@@ -2202,6 +2289,14 @@ class GatewayRunner:
or not self._telegram_topic_mode_enabled(source)
):
return None
inbound = str(source.thread_id or "")
is_lobby = not inbound or inbound in self._TELEGRAM_GENERAL_TOPIC_IDS
if not is_lobby:
# A non-lobby, unknown thread_id is most likely the first message in
# a brand-new Telegram DM topic. Preserve it so it can be recorded
# as a new independent lane below instead of hijacking the latest
# existing topic binding.
return None
session_db = getattr(self, "_session_db", None)
if session_db is None:
return None
@@ -2214,11 +2309,6 @@ class GatewayRunner:
return None
if not bindings:
return None
inbound = str(source.thread_id or "")
is_lobby = not inbound or inbound in self._TELEGRAM_GENERAL_TOPIC_IDS
known = {str(b.get("thread_id") or "") for b in bindings}
if not is_lobby and inbound in known:
return None
user_id = str(source.user_id)
for b in bindings: # newest-first
if str(b.get("user_id") or "") == user_id:
@@ -2823,6 +2913,17 @@ class GatewayRunner:
return "steer"
return "interrupt"
@staticmethod
def _load_busy_text_mode() -> str:
"""Load normal busy TEXT follow-up behavior from config/env."""
mode = os.getenv("HERMES_GATEWAY_BUSY_TEXT_MODE", "").strip().lower()
if not mode:
cfg = _load_gateway_runtime_config()
mode = str(cfg_get(cfg, "display", "busy_text_mode", default="") or "").strip().lower()
if mode == "interrupt":
return "interrupt"
return "queue"
@staticmethod
def _load_restart_drain_timeout() -> float:
"""Load graceful gateway restart/stop drain timeout in seconds."""
@@ -2970,11 +3071,19 @@ class GatewayRunner:
running_agent = self._running_agents.get(session_key)
effective_mode = self._busy_input_mode
busy_text_mode = getattr(self, "_busy_text_mode", "queue")
if (
event.message_type == MessageType.TEXT
and busy_text_mode == "queue"
and effective_mode != "steer"
):
return False
# Steer mode: inject mid-run via running_agent.steer() instead of
# queueing + interrupting. If the agent isn't running yet
# (sentinel) or lacks steer(), or the payload is empty, fall back
# to queue semantics so nothing is lost.
effective_mode = self._busy_input_mode
steered = False
if effective_mode == "steer":
steer_text = (event.text or "").strip()
@@ -2999,7 +3108,12 @@ class GatewayRunner:
# successful steer — the text already landed inside the run and
# must NOT also be replayed as a next-turn user message.
if not steered:
merge_pending_message_event(adapter._pending_messages, session_key, event)
merge_pending_message_event(
adapter._pending_messages,
session_key,
event,
merge_text=event.message_type == MessageType.TEXT,
)
is_queue_mode = effective_mode == "queue"
is_steer_mode = effective_mode == "steer"
@@ -3931,6 +4045,7 @@ class GatewayRunner:
adapter.set_fatal_error_handler(self._handle_adapter_fatal_error)
adapter.set_session_store(self.session_store)
adapter.set_busy_session_handler(self._handle_active_session_busy_message)
adapter._busy_text_mode = self._busy_text_mode
# Try to connect
logger.info("Connecting to %s...", platform.value)
@@ -5543,6 +5658,7 @@ class GatewayRunner:
adapter.set_fatal_error_handler(self._handle_adapter_fatal_error)
adapter.set_session_store(self.session_store)
adapter.set_busy_session_handler(self._handle_active_session_busy_message)
adapter._busy_text_mode = self._busy_text_mode
success = await self._connect_adapter_with_timeout(adapter, platform)
if success:
@@ -6296,18 +6412,6 @@ class GatewayRunner:
if allow_bots_var and os.getenv(allow_bots_var, "none").lower().strip() in {"mentions", "all"}:
return True
# Discord role-based access (DISCORD_ALLOWED_ROLES): the adapter's
# on_message pre-filter already verified role membership — if the
# message reached here, the user passed that check. Authorize
# directly to avoid the "no allowlists configured" branch below
# rejecting role-only setups where DISCORD_ALLOWED_USERS is empty
# (issue #7871).
if (
source.platform == Platform.DISCORD
and os.getenv("DISCORD_ALLOWED_ROLES", "").strip()
):
return True
# Check pairing store (always checked, regardless of allowlists)
platform_name = source.platform.value if source.platform else ""
if self.pairing_store.is_approved(platform_name, user_id):
@@ -12637,7 +12741,7 @@ class GatewayRunner:
return t("gateway.title.current_no_title", session_id=session_id)
async def _handle_resume_command(self, event: MessageEvent) -> str:
"""Handle /resume command — switch to a previously-named session."""
"""Handle /resume command — list or switch to a previous session."""
if not self._session_db:
from hermes_state import format_session_db_unavailable
return format_session_db_unavailable(prefix=t("gateway.shared.session_db_unavailable_prefix"))
@@ -12646,30 +12750,44 @@ class GatewayRunner:
session_key = self._session_key_for_source(source)
name = event.get_command_args().strip()
def _list_titled_sessions() -> list[dict]:
user_source = source.platform.value if source.platform else None
sessions = self._session_db.list_sessions_rich(source=user_source, limit=10)
return [s for s in sessions if s.get("title")][:10]
if not name:
# List recent titled sessions for this user/platform
try:
user_source = source.platform.value if source.platform else None
sessions = self._session_db.list_sessions_rich(
source=user_source, limit=10
)
titled = [s for s in sessions if s.get("title")]
titled = _list_titled_sessions()
if not titled:
return t("gateway.resume.no_named_sessions")
lines = [t("gateway.resume.list_header")]
for s in titled[:10]:
for idx, s in enumerate(titled[:10], start=1):
title = s["title"]
preview = s.get("preview", "")[:40]
preview_part = t("gateway.resume.list_preview_suffix", preview=preview) if preview else ""
lines.append(t("gateway.resume.list_item", title=title, preview_part=preview_part))
lines.append(t("gateway.resume.list_footer"))
lines.append(t("gateway.resume.list_item_numbered", index=idx, title=title, preview_part=preview_part))
lines.append(t("gateway.resume.list_footer_numbered"))
return "\n".join(lines)
except Exception as e:
logger.debug("Failed to list titled sessions: %s", e)
return t("gateway.resume.list_failed", error=e)
# Resolve the name to a session ID.
target_id = self._session_db.resolve_session_by_title(name)
# Resolve a numbered choice or a title to a session ID.
if name.isdigit():
try:
titled = _list_titled_sessions()
except Exception as e:
logger.debug("Failed to list titled sessions for numeric resume: %s", e)
return t("gateway.resume.list_failed", error=e)
index = int(name)
if index < 1 or index > len(titled):
return t("gateway.resume.out_of_range", index=index)
target = titled[index - 1]
target_id = target.get("id")
name = target.get("title") or name
else:
target_id = self._session_db.resolve_session_by_title(name)
if not target_id:
return t("gateway.resume.not_found", name=name)
# Compression creates child continuations that hold the live transcript.
@@ -17020,6 +17138,7 @@ class GatewayRunner:
"context_length": _context_length,
"session_id": effective_session_id,
"response_previewed": result.get("response_previewed", False),
"response_transformed": result.get("response_transformed", False),
}
# Start progress message sender if enabled
@@ -17657,7 +17776,11 @@ class GatewayRunner:
_content_delivered = bool(
_sc and getattr(_sc, "final_content_delivered", False)
)
if not _is_empty_sentinel and (_streamed or _previewed or _content_delivered):
# Plugin hooks (e.g. transform_llm_output) may have appended content
# after streaming finished — when the response was transformed, always
# send the final version so the appended content reaches the client.
_transformed = bool(response.get("response_transformed"))
if not _is_empty_sentinel and not _transformed and (_streamed or _previewed or _content_delivered):
logger.info(
"Suppressing normal final send for session %s: final delivery already confirmed (streamed=%s previewed=%s content_delivered=%s).",
session_key or "?",
@@ -17666,6 +17789,28 @@ class GatewayRunner:
_content_delivered,
)
response["already_sent"] = True
elif not _is_empty_sentinel and _transformed and _sc is not None:
# Plugin hooks transformed the response after streaming — edit the
# existing streamed message instead of sending a duplicate.
_sc_msg_id = _sc.message_id
if _sc_msg_id:
try:
await _sc.adapter.edit_message(
chat_id=source.chat_id,
message_id=_sc_msg_id,
content=response["final_response"],
finalize=True,
)
response["already_sent"] = True
logger.info(
"Edited streamed message %s for session %s to include plugin-transformed content.",
_sc_msg_id, session_key or "?",
)
except Exception as _edit_err:
logger.warning(
"Failed to edit streamed message for session %s: %s",
session_key or "?", _edit_err,
)
# Schedule deletion of tracked temporary progress bubbles after the
# final response lands. Failed runs skip this so bubbles remain as
@@ -18092,6 +18237,21 @@ async def start_gateway(config: Optional[GatewayConfig] = None, replace: bool =
runner.request_restart(detached=False, via_service=True)
loop = asyncio.get_running_loop()
# Install a loop-level exception handler that swallows transient
# network errors from background tasks. Issues #31066 / #31110:
# an unhandled ``telegram.error.TimedOut`` (or peer NetworkError /
# httpx connection error) in any awaited coroutine would propagate
# to the loop and kill the gateway process, taking down every
# profile attached to the same runner. systemd then restarts the
# service after ~5s but the active conversation turn is lost.
#
# The fix is intentionally narrow: only well-known transient
# network errors are swallowed (and logged with full traceback so
# the originating call site is still discoverable). Anything else
# is forwarded to the default handler so real bugs still surface.
loop.set_exception_handler(_gateway_loop_exception_handler)
if threading.current_thread() is threading.main_thread():
for sig in (signal.SIGINT, signal.SIGTERM):
try:
+15
View File
@@ -83,6 +83,21 @@ _VAR_MAP = {
}
def set_current_session_id(session_id: str) -> None:
"""Synchronize ``HERMES_SESSION_ID`` across ContextVar and ``os.environ``.
Long-lived single-process entrypoints like the CLI can rotate sessions via
``/new``, ``/resume``, ``/branch``, or compression splits without
reconstructing the entire agent. Tools still consult
``get_session_env("HERMES_SESSION_ID")`` with an ``os.environ`` fallback,
so both storage paths must move together when the active session changes.
"""
import os
os.environ["HERMES_SESSION_ID"] = session_id
_SESSION_ID.set(session_id)
def set_session_vars(
platform: str = "",
chat_id: str = "",
+5
View File
@@ -192,6 +192,11 @@ class GatewayStreamConsumer:
"""True when the stream consumer delivered the final assistant reply."""
return self._final_response_sent
@property
def message_id(self) -> str | None:
"""The Discord/chat message ID of the last-sent or edited message."""
return self._message_id
@property
def final_content_delivered(self) -> bool:
"""True when the final response content reached the user, even if
+7 -2
View File
@@ -129,7 +129,8 @@ def build_top_level_parser():
default=None,
help=(
"Provider override for this invocation (e.g. openrouter, anthropic). "
"Applies to -z/--oneshot and --tui. Also settable via HERMES_INFERENCE_PROVIDER env var."
"Applies to -z/--oneshot and --tui. The persistent provider lives in config.yaml "
"under model.provider — use `hermes setup` or edit the file to change it."
),
)
parser.add_argument(
@@ -268,7 +269,11 @@ def build_top_level_parser():
help="Inference provider (default: auto). Built-in or a user-defined name from `providers:` in config.yaml.",
)
chat_parser.add_argument(
"-v", "--verbose", action="store_true", help="Verbose output"
"-v",
"--verbose",
action="store_true",
default=argparse.SUPPRESS,
help="Verbose output",
)
chat_parser.add_argument(
"-Q",
+5 -1
View File
@@ -553,6 +553,7 @@ _PLACEHOLDER_SECRET_VALUES = {
"***",
"changeme",
"your_api_key",
"your_api_key_here",
"your-api-key",
"placeholder",
"example",
@@ -2065,7 +2066,10 @@ def resolve_qwen_runtime_credentials(
def get_qwen_auth_status() -> Dict[str, Any]:
auth_path = _qwen_cli_auth_path()
try:
creds = resolve_qwen_runtime_credentials(refresh_if_expiring=False)
# Validate the runtime credentials, including refresh when the cached
# CLI token is expired. Otherwise stale tokens show up as "logged in"
# and `hermes model` walks users into a broken Qwen setup flow.
creds = resolve_qwen_runtime_credentials(refresh_if_expiring=True)
return {
"logged_in": True,
"auth_file": str(auth_path),
+1 -1
View File
@@ -164,7 +164,7 @@ COMMAND_REGISTRY: list[CommandDef] = [
cli_only=True),
CommandDef("skills", "Search, install, inspect, or manage skills",
"Tools & Skills", cli_only=True,
subcommands=("search", "browse", "inspect", "install")),
subcommands=("search", "browse", "inspect", "install", "audit")),
CommandDef("bundles", "List skill bundles (aliases /<name> for multiple skills)",
"Tools & Skills"),
CommandDef("cron", "Manage scheduled tasks", "Tools & Skills",
+23 -1
View File
@@ -658,7 +658,8 @@ DEFAULT_CONFIG = {
# are owned by your host user instead of root, which avoids needing
# `sudo chown` after container runs. Default off to preserve behavior
# for images whose entrypoints expect to start as root (e.g. the
# bundled Hermes image, which drops to the `hermes` user via gosu).
# bundled Hermes image, which drops to the `hermes` user via
# s6-setuidgid inside each supervised service).
# When on, SETUID/SETGID caps are omitted from the container since
# no privilege drop is needed.
"docker_run_as_host_user": False,
@@ -1008,6 +1009,19 @@ DEFAULT_CONFIG = {
"compact": False,
"personality": "kawaii",
"resume_display": "full",
# Recap tuning for /resume and startup resume. The defaults match the
# historical hardcoded values; expose them as config so power users can
# widen or tighten the snapshot to taste.
"resume_exchanges": 10, # max user+assistant pairs to show
"resume_max_user_chars": 300, # truncate user message text
"resume_max_assistant_chars": 200, # truncate non-last assistant text
"resume_max_assistant_lines": 3, # truncate non-last assistant lines
# When True (default), assistant entries that are *only* tool calls
# (no visible text) are skipped in the recap. This prevents the recap
# from being dominated by `[2 tool calls: terminal, read_file]` lines
# when an exchange was tool-heavy. Set False to restore the legacy
# behavior of showing tool-call summaries inline.
"resume_skip_tool_only": True,
"busy_input_mode": "interrupt", # interrupt | queue | steer
# When true, `hermes --tui` auto-resumes the most recent human-
# facing session on launch instead of forging a fresh one.
@@ -1775,6 +1789,14 @@ DEFAULT_CONFIG = {
# ~/.hermes/bin/ on first use. When False you must install
# bws yourself and have it on PATH.
"auto_install": True,
# Bitwarden region / self-hosted endpoint. Empty string
# means use the bws CLI default (US Cloud,
# https://vault.bitwarden.com). Set to
# https://vault.bitwarden.eu for EU Cloud, or your own URL
# for self-hosted Bitwarden. Plumbed into the bws subprocess
# as BWS_SERVER_URL. Prompted for during
# `hermes secrets bitwarden setup`.
"server_url": "",
},
},
+325
View File
@@ -0,0 +1,325 @@
"""Container-boot reconciliation of per-profile gateway s6 services.
Service directories under /run/service/ live on **tmpfs** and are wiped
on every container restart. Profile directories under
``$HERMES_HOME/profiles/<name>/`` live on the persistent VOLUME, and
each one records its gateway's last state in ``gateway_state.json``.
This module bridges the two: on every container boot, walk the
persistent profiles, recreate the s6 service slots, and auto-start
only those whose last recorded state was ``running``.
Wired into the image as /etc/cont-init.d/02-reconcile-profiles by the
Dockerfile (Phase 4 Task 4.0). Runs as root after 01-hermes-setup
(the stage2 hook) has chowned the volume and seeded $HERMES_HOME, but
before s6-rc starts user services.
Without this module, every ``docker restart`` would silently wipe
every per-profile gateway, even though the user's profiles still
exist on disk.
"""
from __future__ import annotations
import json
import logging
import os
from dataclasses import dataclass
from pathlib import Path
from typing import Literal
log = logging.getLogger(__name__)
# Only this prior state triggers automatic restart. Everything else
# (startup_failed, starting, stopped, missing) registers the slot in
# the down state and waits for explicit user action — this avoids the
# crash-loop where a broken gateway keeps being restarted across
# `docker restart` cycles.
_AUTOSTART_STATES = frozenset({"running"})
# Stale runtime files we sweep before recreating service slots. These
# all hold container-namespaced state (PIDs, process tables) that's
# garbage post-restart — a numerically-equal PID in the new container
# is a different process. See the Risk Register in the plan.
_STALE_RUNTIME_FILES = ("gateway.pid", "processes.json")
ReconcileActionLabel = Literal["started", "registered", "skipped"]
@dataclass(frozen=True)
class ReconcileAction:
"""One profile's outcome from a single reconciliation pass."""
profile: str
prior_state: str | None
action: ReconcileActionLabel
def reconcile_profile_gateways(
*,
hermes_home: Path,
scandir: Path,
dry_run: bool = False,
) -> list[ReconcileAction]:
"""Recreate s6 service registrations for every persistent profile.
Always registers a ``gateway-default`` slot for the root profile
(the implicit profile that lives at the top of ``$HERMES_HOME``,
not under ``profiles/``). The dispatcher in ``hermes_cli.gateway``
maps an empty profile suffix to ``gateway-default``, so this slot
is what ``hermes gateway start`` (no ``-p``) targets. Without it,
bare ``hermes gateway start`` inside the container would land on
``s6-svc -u /run/service/gateway-default`` uncaught
``CalledProcessError`` traceback to the user (PR #30136 review).
The default slot's prior state is read from
``$HERMES_HOME/gateway_state.json`` (sibling to the profile root,
not under ``profiles/``); stale runtime files there are swept the
same way as for named profiles.
Args:
hermes_home: The container's HERMES_HOME (typically /opt/data).
Profiles live under ``<hermes_home>/profiles/<name>/``;
the default profile lives at ``<hermes_home>`` itself.
scandir: The s6 dynamic scandir (typically /run/service). Service
directories are created at ``<scandir>/gateway-<profile>/``.
dry_run: When True, walk and return the action list without
touching the filesystem. For tests and `--dry-run` debug.
Returns:
One :class:`ReconcileAction` per profile, in this order:
``default`` first, then named profiles in directory order.
"""
actions: list[ReconcileAction] = []
# Default profile — always register, even if nothing has ever
# populated the root profile dir. The slot exists so
# ``hermes gateway start`` (no ``-p``) has somewhere to land;
# auto-up only when the prior state was "running" (same rule as
# named profiles).
default_prior_state = _read_prior_state(hermes_home)
default_should_start = default_prior_state in _AUTOSTART_STATES
if not dry_run:
_cleanup_stale_runtime_files(hermes_home)
_register_service(scandir, "default", start=default_should_start)
actions.append(ReconcileAction(
profile="default",
prior_state=default_prior_state,
action="started" if default_should_start else "registered",
))
profiles_root = hermes_home / "profiles"
if profiles_root.is_dir():
for entry in sorted(profiles_root.iterdir()):
if not entry.is_dir():
continue
# SOUL.md is always seeded by `hermes profile create` (config.yaml
# is not — that comes later via `hermes setup`). Use it as the
# "real profile" marker so stray dirs (backups, manual mkdir)
# aren't picked up.
if not (entry / "SOUL.md").exists():
continue
# The "default" service name is reserved for the root
# profile (above) — if a user has somehow created a
# ``profiles/default/`` directory, skip it to avoid the
# slot collision. Their gateway would still be reachable
# via ``hermes -p default-named gateway start`` if they
# rename the directory; we don't try to disambiguate here.
if entry.name == "default":
log.warning(
"profiles/default/ exists — skipping to avoid colliding "
"with the reserved root-profile s6 slot",
)
continue
prior_state = _read_prior_state(entry)
should_start = prior_state in _AUTOSTART_STATES
if not dry_run:
_cleanup_stale_runtime_files(entry)
_register_service(scandir, entry.name, start=should_start)
actions.append(ReconcileAction(
profile=entry.name,
prior_state=prior_state,
action="started" if should_start else "registered",
))
if not dry_run:
_write_reconcile_log(hermes_home, actions)
return actions
def _read_prior_state(profile_dir: Path) -> str | None:
"""Read gateway_state.json's ``gateway_state`` field, or None if
missing or unparseable. Unparseable counts as "no prior state" so
we don't bork the whole reconciliation on a corrupt file."""
state_file = profile_dir / "gateway_state.json"
if not state_file.exists():
return None
try:
return json.loads(state_file.read_text()).get("gateway_state")
except (OSError, json.JSONDecodeError):
log.warning(
"could not read %s; treating as no prior state", state_file,
)
return None
def _cleanup_stale_runtime_files(profile_dir: Path) -> None:
"""Remove gateway.pid and processes.json — they reference PIDs in
the dead container's process namespace and would otherwise confuse
the newly-started gateway's process-mismatch checks."""
for name in _STALE_RUNTIME_FILES:
(profile_dir / name).unlink(missing_ok=True)
def _register_service(scandir: Path, profile: str, *, start: bool) -> None:
"""Recreate the s6 service slot for one profile.
Mirrors the rendering in :func:`S6ServiceManager.register_profile_gateway`,
but here we control the start state directly via the ``down`` marker
file (s6-svscan honors it on rescan). Cannot use the manager
directly because the cont-init.d phase runs as root before
s6-svscan starts scanning the dynamic scandir the manager's
``s6-svscanctl -a`` call would fail with no control socket.
Atomicity: build the new layout in a sibling temp directory and
rename it into place via :meth:`Path.replace`. This matches
:meth:`S6ServiceManager.register_profile_gateway` (PR #30136
review item O4) even though cont-init.d runs before s6-svscan
starts scanning, an atomic publication keeps the contract uniform
between the two registration paths and protects against a
half-populated dir if the script is interrupted mid-write.
"""
import shutil
from hermes_cli.service_manager import (
S6ServiceManager,
_seed_supervise_skeleton,
validate_profile_name,
)
validate_profile_name(profile)
service_dir = scandir / f"gateway-{profile}"
tmp_dir = service_dir.with_name(service_dir.name + ".tmp")
# Wipe any leftover tmp from a previous interrupted run.
if tmp_dir.exists():
shutil.rmtree(tmp_dir, ignore_errors=True)
tmp_dir.mkdir(parents=True)
try:
(tmp_dir / "type").write_text("longrun\n")
# Reuse the manager's run-script rendering — single source of
# truth so register_profile_gateway and reconcile_profile_gateways
# stay consistent. extra_env is empty here; users who need
# per-profile env can set it via the profile's config.yaml
# (which the gateway itself loads).
run = tmp_dir / "run"
run.write_text(S6ServiceManager._render_run_script(profile, extra_env={}))
run.chmod(0o755)
# Persistent log rotation (OQ8-C).
log_subdir = tmp_dir / "log"
log_subdir.mkdir()
log_run = log_subdir / "run"
log_run.write_text(S6ServiceManager._render_log_run(profile))
log_run.chmod(0o755)
# The presence of a `down` file tells s6-supervise to NOT
# start the service when s6-svscan picks it up. User brings
# it up explicitly with `hermes -p <profile> gateway start`
# (which routes through the Phase 4
# _dispatch_via_service_manager_if_s6 helper to `s6-svc -u`).
if not start:
(tmp_dir / "down").touch()
# Pre-create the supervise/ skeleton with hermes ownership
# BEFORE we publish the slot. Mirrors the same pre-creation
# step in S6ServiceManager.register_profile_gateway — when
# s6-svscan picks the published slot up, the s6-supervise it
# spawns will EEXIST our dirs/FIFOs and inherit hermes
# ownership, so runtime s6-svc / s6-svstat / s6-svwait calls
# (all dispatched as the hermes user) won't hit EACCES. See
# ``_seed_supervise_skeleton`` in service_manager.py for the
# full rationale.
_seed_supervise_skeleton(tmp_dir)
# Publish atomically. Path.replace handles the existing-target
# case the same way os.rename does on POSIX: the target is
# silently replaced, so a previous reconcile pass's slot is
# cleanly overwritten in one operation.
if service_dir.exists():
shutil.rmtree(service_dir)
tmp_dir.replace(service_dir)
except Exception:
shutil.rmtree(tmp_dir, ignore_errors=True)
raise
def _write_reconcile_log(
hermes_home: Path, actions: list[ReconcileAction],
) -> None:
"""Append one line per profile to $HERMES_HOME/logs/container-boot.log.
Operators inspect this to debug "why didn't my profile come back
up". Keeping a separate log file (vs. mixing into agent.log) lets
troubleshooters grep for "profile=foo" without wading through
unrelated activity.
Size-bounded: when the file exceeds ``_LOG_ROTATE_BYTES``
(defaults to 256 KiB 3000 reconcile lines), the current file
is renamed to ``container-boot.log.1`` (replacing any previous
rotation) before the new entries are appended. This gives long-
lived containers a soft cap of ~512 KiB across the two files
without pulling in logrotate or s6-log machinery just for this
one append-only file (PR #30136 review item O3).
"""
import time
log_dir = hermes_home / "logs"
log_dir.mkdir(parents=True, exist_ok=True)
log_path = log_dir / "container-boot.log"
# Rotate before opening to append, so the new entries always land
# in a fresh file when we crossed the threshold last time.
try:
if log_path.exists() and log_path.stat().st_size >= _LOG_ROTATE_BYTES:
log_path.replace(log_dir / "container-boot.log.1")
except OSError as exc:
# Rotation failure is non-fatal — keep appending to the
# existing file rather than losing the entry entirely.
log.warning("could not rotate %s: %s", log_path, exc)
ts = time.strftime("%Y-%m-%dT%H:%M:%S%z")
with log_path.open("a", encoding="utf-8") as f:
for a in actions:
f.write(
f"{ts} profile={a.profile} prior_state={a.prior_state} "
f"action={a.action}\n"
)
# 256 KiB soft cap on container-boot.log; rotated to .1 when crossed.
# At ~80 B per reconcile-action line this is ~3000 lines, or about a
# year of daily reboots on a 5-profile container. Two files = ~512 KiB
# worst case. Tuned for visibility (small enough to grep / cat without
# scrolling forever) more than space (the persistent volume has GB).
_LOG_ROTATE_BYTES = 256 * 1024
def main() -> int:
"""Entry point invoked from /etc/cont-init.d/02-reconcile-profiles."""
hermes_home = Path(os.environ.get("HERMES_HOME", "/opt/data"))
scandir = Path(os.environ.get("S6_PROFILE_GATEWAY_SCANDIR", "/run/service"))
actions = reconcile_profile_gateways(
hermes_home=hermes_home, scandir=scandir,
)
for a in actions:
print(
f"reconcile: profile={a.profile} "
f"prior_state={a.prior_state} action={a.action}"
)
return 0
if __name__ == "__main__":
raise SystemExit(main())
+9 -1
View File
@@ -14,6 +14,7 @@ Currently supports:
import io
import json
import logging
import re
import sys
import time
import urllib.error
@@ -36,6 +37,12 @@ _REDACTION_BANNER = (
"run with --no-redact to disable]\n"
)
_EMAIL_ADDRESS_RE = re.compile(
r"(?<![A-Za-z0-9._%+-])"
r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
r"(?![A-Za-z0-9._%+-])"
)
# ---------------------------------------------------------------------------
# Paste services — try paste.rs first, dpaste.com as fallback.
@@ -398,7 +405,8 @@ def _redact_log_text(text: str) -> str:
return text
from agent.redact import redact_sensitive_text
return redact_sensitive_text(text, force=True)
text = redact_sensitive_text(text, force=True)
return _EMAIL_ADDRESS_RE.sub("[REDACTED_EMAIL]", text)
def _capture_log_snapshot(
+85 -1
View File
@@ -207,14 +207,69 @@ def _fail_and_issue(text: str, detail: str, fix: str, issues: list[str]) -> None
issues.append(fix)
def _check_s6_supervision(issues: list[str]) -> None:
"""Inside a container under our s6 /init, surface what s6 sees.
Runs as a counterpart to :func:`_check_gateway_service_linger` for
the systemd-on-host case. No-op everywhere except in the s6
container so host runs aren't cluttered with irrelevant output.
Reports:
- Whether the main-hermes and dashboard static services are up
- How many per-profile gateway slots are registered (via
``S6ServiceManager.list_profile_gateways()``) and how many are
currently supervised as ``up``
"""
try:
from hermes_cli.service_manager import (
S6ServiceManager,
detect_service_manager,
)
except Exception:
return
if detect_service_manager() != "s6":
return
_section("s6 Supervision")
mgr = S6ServiceManager()
# Static services. They live under /run/service/ via s6-rc symlinks,
# so the same s6-svstat probe works.
for static in ("main-hermes", "dashboard"):
if mgr.is_running(static):
check_ok(f"{static}: up")
else:
check_info(f"{static}: down (expected if not enabled via env)")
profiles = mgr.list_profile_gateways()
if not profiles:
check_info("No per-profile gateways registered yet — create one with `hermes profile create <name>`")
return
up_count = sum(1 for p in profiles if mgr.is_running(f"gateway-{p}"))
check_ok(
f"Per-profile gateways: {up_count}/{len(profiles)} supervised up"
+ (f" ({', '.join(sorted(profiles))})" if len(profiles) <= 8 else "")
)
def _check_gateway_service_linger(issues: list[str]) -> None:
"""Warn when a systemd user gateway service will stop after logout."""
"""Warn when a systemd user gateway service will stop after logout.
Skipped inside a container running under s6 the linger concept
(user-systemd surviving SSH logout) doesn't apply there, and the
s6 supervision state is surfaced separately by
``_check_s6_supervision``.
"""
try:
from hermes_cli.gateway import (
get_systemd_linger_status,
get_systemd_unit_path,
is_linux,
)
from hermes_cli.service_manager import detect_service_manager
except Exception as e:
check_warn("Gateway service linger", f"(could not import gateway helpers: {e})")
return
@@ -222,6 +277,12 @@ def _check_gateway_service_linger(issues: list[str]) -> None:
if not is_linux():
return
# Inside a container under our s6 /init, _check_s6_supervision
# reports the live supervision state; the linger warning would be
# confusing here (no systemd, no logout, no "lingering" concept).
if detect_service_manager() == "s6":
return
unit_path = get_systemd_unit_path()
if not unit_path.exists():
return
@@ -984,6 +1045,7 @@ def run_doctor(args):
pass
_check_gateway_service_linger(issues)
_check_s6_supervision(issues)
if sys.platform != "win32":
_section("Command Installation")
@@ -1076,6 +1138,26 @@ def run_doctor(args):
# Docker (optional)
terminal_env = os.getenv("TERMINAL_ENV", "local")
try:
from hermes_constants import is_container as _is_container
running_in_container = _is_container()
except Exception:
running_in_container = False
if running_in_container:
# Inside our container the Docker terminal backend is not
# configured by default (Docker-in-Docker isn't set up); the
# local backend is the intended one. Skip the noisy "docker
# not found" warning. If the user has explicitly chosen
# TERMINAL_ENV=docker inside the container they likely mounted
# /var/run/docker.sock, so fall through to the normal check.
if terminal_env != "docker":
check_info(
"Running inside a container — using local terminal backend "
"(docker-in-docker is not configured by default)"
)
# Skip to next section; Docker isn't relevant here.
terminal_env = "local"
if terminal_env == "docker":
if _safe_which("docker"):
# Check if docker daemon is running
@@ -1098,6 +1180,8 @@ def run_doctor(args):
check_ok("docker", "(optional)")
elif _is_termux():
check_info("Docker backend is not available inside Termux (expected on Android)")
elif running_in_container:
pass # already explained above
else:
check_warn("docker not found", "(optional)")
+10 -1
View File
@@ -140,6 +140,10 @@ def _sanitize_env_file_if_needed(path: Path) -> None:
This produces mangled values e.g. a bot token duplicated 8×
(see #8908).
Also strips embedded null bytes which crash ``os.environ[k] = v``
with ``ValueError: embedded null byte`` typically introduced by
copy-pasting API keys from terminals or rich-text editors.
We delegate to ``hermes_cli.config._sanitize_env_lines`` which
already knows all valid Hermes env-var names and can split
concatenated lines correctly.
@@ -155,7 +159,11 @@ def _sanitize_env_file_if_needed(path: Path) -> None:
try:
with open(path, **read_kw) as f:
original = f.readlines()
sanitized = _sanitize_env_lines(original)
# Strip null bytes before _sanitize_env_lines so they never
# reach python-dotenv (which passes them to os.environ and
# crashes with ValueError).
stripped = [line.replace("\x00", "") for line in original]
sanitized = _sanitize_env_lines(stripped)
if sanitized != original:
import tempfile
fd, tmp = tempfile.mkstemp(
@@ -244,6 +252,7 @@ def _apply_external_secret_sources(home_path: Path) -> None:
override_existing=bool(bw_cfg.get("override_existing", False)),
cache_ttl_seconds=float(bw_cfg.get("cache_ttl_seconds", 300)),
auto_install=bool(bw_cfg.get("auto_install", True)),
server_url=str(bw_cfg.get("server_url", "") or "").strip(),
)
if result.applied:
+179 -5
View File
@@ -981,6 +981,18 @@ def get_gateway_runtime_snapshot(system: bool = False) -> GatewayRuntimeSnapshot
from hermes_constants import is_container
if is_linux() and is_container():
# Phase 4: report s6 supervision when running under our /init.
# Other container runtimes (or containers built before Phase 2)
# still get the original "docker (foreground)" label.
try:
from hermes_cli.service_manager import detect_service_manager
if detect_service_manager() == "s6":
return GatewayRuntimeSnapshot(
manager="s6 (container supervisor)",
gateway_pids=gateway_pids,
)
except Exception:
pass # Fall through to the legacy label on any detection error.
return GatewayRuntimeSnapshot(
manager="docker (foreground)",
gateway_pids=gateway_pids,
@@ -1202,7 +1214,17 @@ def _systemd_operational(system: bool = False) -> bool:
def _container_systemd_operational() -> bool:
"""Return True when a container exposes working user or system systemd."""
"""Return True when a container exposes working user or system systemd.
This is NOT our Hermes Docker image that one runs s6-overlay as
PID 1 (since Phase 2 of the s6-overlay supervision plan) and is
detected via ``service_manager.detect_service_manager() == "s6"``.
This function handles the "container managed by something else"
case: systemd-nspawn, certain k8s pods, containers built FROM
systemd-bearing distros where the user has wired systemd as their
init. In those environments systemctl behaves identically to the
host case, so we fall through to the normal systemd code paths.
"""
if _systemd_operational(system=False):
return True
if _systemd_operational(system=True):
@@ -3998,15 +4020,11 @@ def _setup_dingtalk():
client_id, client_secret = result
save_env_value("DINGTALK_CLIENT_ID", client_id)
save_env_value("DINGTALK_CLIENT_SECRET", client_secret)
save_env_value("DINGTALK_ALLOW_ALL_USERS", "true")
print()
print_success(f"{emoji} {label} configured via QR scan!")
else:
# ── Manual entry ──
_setup_standard_platform(dingtalk_platform)
# Also enable allow-all by default for convenience
if get_env_value("DINGTALK_CLIENT_ID"):
save_env_value("DINGTALK_ALLOW_ALL_USERS", "true")
def _setup_wecom():
@@ -5007,6 +5025,108 @@ def gateway_setup():
# Main Command Handler
# =============================================================================
def _dispatch_via_service_manager_if_s6(
action: str, profile: str | None = None,
) -> bool:
"""If we're in a container with s6, dispatch gateway lifecycle via s6.
Returns True iff dispatched (caller should ``return``); False
otherwise caller continues with the host-side code path.
``action`` is one of ``start`` / ``stop`` / ``restart``. The
profile defaults to the current one (resolved via ``_profile_arg``).
The s6 service slot was created either by the Phase 4 profile-create
hook or by the container-boot reconciler (cont-init.d/02-). If it
doesn't exist or s6 returns an error, the named errors from
:mod:`hermes_cli.service_manager` are caught and surfaced as
actionable CLI messages (no raw ``CalledProcessError`` traceback).
"""
from hermes_cli.service_manager import (
GatewayNotRegisteredError,
S6CommandError,
detect_service_manager,
get_service_manager,
)
if detect_service_manager() != "s6":
return False
if profile is None:
# _profile_suffix() returns the bare profile name for
# HERMES_HOME=<root>/profiles/<name>, "" for the default root,
# or a hash for unrelated paths. Map "" → "default" so the
# default-profile gateway is reachable as gateway-default.
profile = _profile_suffix() or "default"
mgr = get_service_manager()
service_name = f"gateway-{profile}"
try:
if action == "start":
mgr.start(service_name)
elif action == "stop":
mgr.stop(service_name)
elif action == "restart":
mgr.restart(service_name)
else:
return False
except GatewayNotRegisteredError as exc:
print(f"{exc}")
sys.exit(1)
except S6CommandError as exc:
print(f"{exc}")
sys.exit(1)
return True
def _dispatch_all_via_service_manager_if_s6(action: str) -> bool:
"""Inside a container with s6, dispatch ``--all`` lifecycle to every
registered profile gateway.
Returns True iff dispatched (caller should ``return``); False
otherwise caller continues with the host-side code path.
Without this, ``hermes gateway stop --all`` and ``... restart --all``
fall through to ``kill_gateway_processes(all_profiles=True)``, which
just ``pkill``s every gateway process. s6-supervise observes the
crash and restarts each one ~1s later so ``--all`` ends up
*kicking* every gateway instead of *stopping* it. By iterating
``list_profile_gateways()`` and sending the lifecycle command
through the service manager we get the intended semantics (s6's
``want up``/``want down`` flips correctly so supervise stays down
after a stop).
``action`` is one of ``stop`` / ``restart`` (``start --all`` isn't
a supported CLI surface).
"""
from hermes_cli.service_manager import (
detect_service_manager,
get_service_manager,
)
if detect_service_manager() != "s6":
return False
if action not in ("stop", "restart"):
return False
mgr = get_service_manager()
profiles = mgr.list_profile_gateways()
if not profiles:
print("✗ No profile gateways registered under s6")
return True
fn = mgr.stop if action == "stop" else mgr.restart
errors: list[tuple[str, Exception]] = []
for profile in profiles:
service_name = f"gateway-{profile}"
try:
fn(service_name)
except Exception as exc: # noqa: BLE001 — report and continue
errors.append((profile, exc))
succeeded = len(profiles) - len(errors)
verb = "stopped" if action == "stop" else "restarted"
if succeeded:
print(f"{verb.capitalize()} {succeeded} profile gateway(s) under s6")
for profile, exc in errors:
print(f"✗ Could not {action} gateway-{profile}: {exc}")
return True
def gateway_command(args):
"""Handle gateway subcommands."""
try:
@@ -5091,6 +5211,21 @@ def _gateway_command_inner(args):
print(" nohup hermes gateway run > ~/.hermes/logs/gateway.log 2>&1 & # background")
sys.exit(1)
elif is_container():
# Phase 4: inside a container with s6 the gateway service is
# auto-registered when the profile is created (and reconciled
# at every container boot). `install` is therefore informational.
from hermes_cli.service_manager import detect_service_manager
if detect_service_manager() == "s6":
print("Per-profile gateways are auto-registered when you create a profile.")
print()
print(" hermes profile create <name> # creates the s6 service slot")
print(" hermes -p <name> gateway start # bring it up via s6")
print(" hermes status # see currently-supervised gateways")
return
# Fallback for pre-s6 containers or other container runtimes
# we haven't taught about supervision (Podman without our
# /init, k8s plain runs, etc.) — the historical guidance still
# applies.
print("Service installation is not needed inside a Docker container.")
print("The container runtime is your service manager — use Docker restart policies instead:")
print()
@@ -5121,6 +5256,13 @@ def _gateway_command_inner(args):
from hermes_cli import gateway_windows
gateway_windows.uninstall()
elif is_container():
from hermes_cli.service_manager import detect_service_manager
if detect_service_manager() == "s6":
print("Per-profile gateways are auto-unregistered when you delete the profile.")
print()
print(" hermes profile delete <name> # tears down the s6 service slot")
print(" hermes -p <name> gateway stop # stop without deleting the profile")
return
print("Service uninstall is not applicable inside a Docker container.")
print("To stop the gateway, stop or remove the container:")
print()
@@ -5135,6 +5277,14 @@ def _gateway_command_inner(args):
system = getattr(args, 'system', False)
start_all = getattr(args, 'all', False)
# Phase 4: inside a container with s6, dispatch via the service
# manager instead of falling through to systemd/launchd/windows.
# `--all` isn't meaningful here (each profile has its own service
# slot — start them individually via `hermes -p <name> gateway
# start`), so just bring up the current profile's slot.
if not start_all and _dispatch_via_service_manager_if_s6("start"):
return
if start_all:
# Kill all stale gateway processes across all profiles before starting
killed = kill_gateway_processes(all_profiles=True)
@@ -5164,6 +5314,11 @@ def _gateway_command_inner(args):
print("To enable systemd: add systemd=true to /etc/wsl.conf and run 'wsl --shutdown' from PowerShell.")
sys.exit(1)
elif is_container():
# Reached only when s6 ISN'T running (the early dispatch
# above handles the s6 case). Pre-s6 containers or other
# container runtimes that don't ship our /init get the
# historical guidance: the gateway is the container's main
# process, so use docker lifecycle commands.
print("Service start is not applicable inside a Docker container.")
print("The gateway runs as the container's main process.")
print()
@@ -5180,6 +5335,15 @@ def _gateway_command_inner(args):
stop_all = getattr(args, 'all', False)
system = getattr(args, 'system', False)
# Phase 4: inside a container with s6, dispatch via the service
# manager. ``--all`` iterates every registered profile gateway
# through s6 (otherwise it would fall through to ``pkill``,
# which s6-supervise observes as a crash and immediately restarts).
if stop_all and _dispatch_all_via_service_manager_if_s6("stop"):
return
if not stop_all and _dispatch_via_service_manager_if_s6("stop"):
return
if stop_all:
# --all: kill every gateway process on the machine
service_available = False
@@ -5249,6 +5413,16 @@ def _gateway_command_inner(args):
restart_all = getattr(args, 'all', False)
service_configured = False
# Phase 4: inside a container with s6, dispatch via the service
# manager (s6-svc -t restarts the supervised process). ``--all``
# iterates every registered profile gateway through s6; without
# this it would fall through to ``pkill``, which s6-supervise
# would observe as a crash and immediately restart anyway.
if restart_all and _dispatch_all_via_service_manager_if_s6("restart"):
return
if not restart_all and _dispatch_via_service_manager_if_s6("restart"):
return
if restart_all:
# --all: stop every gateway process across all profiles, then start fresh
service_stopped = False
+85
View File
@@ -550,6 +550,39 @@ def build_parser(parent_subparsers: argparse._SubParsersAction) -> argparse.Argu
p_unblock = sub.add_parser("unblock", help="Return one or more blocked/scheduled tasks to ready")
p_unblock.add_argument("task_ids", nargs="+")
p_promote = sub.add_parser(
"promote",
help="Manually move one or more todo/blocked tasks to ready (recovery path)",
)
p_promote.add_argument("task_id")
p_promote.add_argument(
"reason",
nargs="*",
help="Audit-trail reason (recorded on the task_events row)",
)
p_promote.add_argument(
"--ids",
nargs="+",
default=None,
help="Additional task ids to promote with the same reason (bulk mode)",
)
p_promote.add_argument(
"--force",
action="store_true",
help="Promote even if parent dependencies are not yet done/archived",
)
p_promote.add_argument(
"--dry-run",
action="store_true",
help="Validate the promotion without mutating state",
)
p_promote.add_argument(
"--json",
dest="json",
action="store_true",
help="Emit machine-readable JSON result",
)
p_archive = sub.add_parser("archive", help="Archive one or more tasks")
p_archive.add_argument("task_ids", nargs="*",
help="Task ids to archive (default mode)")
@@ -899,6 +932,7 @@ def kanban_command(args: argparse.Namespace) -> int:
"block": _cmd_block,
"schedule": _cmd_schedule,
"unblock": _cmd_unblock,
"promote": _cmd_promote,
"archive": _cmd_archive,
"tail": _cmd_tail,
"dispatch": _cmd_dispatch,
@@ -1955,6 +1989,57 @@ def _cmd_unblock(args: argparse.Namespace) -> int:
return 0 if not failed else 1
def _cmd_promote(args: argparse.Namespace) -> int:
reason = " ".join(args.reason).strip() if args.reason else None
author = _profile_author()
as_json = getattr(args, "json", False)
extra_ids = list(getattr(args, "ids", None) or [])
# Dedupe while preserving order; positional task_id always first.
ids: list[str] = []
seen: set[str] = set()
for tid in [args.task_id, *extra_ids]:
if tid not in seen:
ids.append(tid)
seen.add(tid)
results: list[dict[str, object]] = []
with kb.connect() as conn:
for tid in ids:
ok, err = kb.promote_task(
conn,
tid,
actor=author,
reason=reason,
force=bool(args.force),
dry_run=bool(args.dry_run),
)
results.append({
"task_id": tid,
"promoted": ok,
"dry_run": bool(args.dry_run),
"forced": bool(args.force),
"reason": reason,
"error": err,
})
failed = [r for r in results if not r["promoted"]]
if as_json:
# Single-id stays a flat object for back-compat; bulk emits a list.
payload: object = results[0] if len(results) == 1 else results
print(json.dumps(payload, indent=2, ensure_ascii=False))
return 0 if not failed else 1
tag = " (dry)" if args.dry_run else ""
label = "Would promote" if args.dry_run else "Promoted"
for r in results:
if r["promoted"]:
suffix = f": {reason}" if reason else ""
print(f"{label} {r['task_id']} -> ready{tag}{suffix}")
else:
print(f"cannot promote {r['task_id']}: {r['error']}", file=sys.stderr)
return 0 if not failed else 1
def _cmd_archive(args: argparse.Namespace) -> int:
ids = list(args.task_ids or [])
purge_ids = list(getattr(args, "purge_ids", None) or [])
+170 -4
View File
@@ -1651,8 +1651,15 @@ def create_task(
now = int(time.time())
# Resolve workspace_path from board-level default_workdir when the
# caller did not specify one explicitly.
if workspace_path is None:
# caller did not specify one explicitly. Board defaults represent
# persistent project checkouts, so only persistent workspace kinds may
# inherit them. Scratch workspaces are auto-deleted on completion and
# must stay under the per-board scratch root created by
# ``resolve_workspace``; inheriting ``default_workdir`` for a scratch
# task would point cleanup at the user's source tree (#28818). The
# containment guard in ``_cleanup_workspace`` is the safety rail, but
# we also stop the bad state from being created in the first place.
if workspace_path is None and workspace_kind in {"dir", "worktree"}:
board_slug = board if board else get_current_board()
board_meta = read_board_metadata(board_slug)
board_default = board_meta.get("default_workdir")
@@ -3037,6 +3044,81 @@ def complete_task(
# Workspace / tmux cleanup
# ---------------------------------------------------------------------------
def _is_managed_scratch_path(p: Path) -> bool:
"""Return True iff *p* is a strict descendant of a kanban-managed scratch root.
A managed root is exclusively a ``workspaces/`` directory never the
broader kanban home, a board root, or sibling subtrees like ``logs/`` or
``boards/<slug>/`` itself. Allowed roots:
* ``HERMES_KANBAN_WORKSPACES_ROOT`` when set (worker-side override
injected by the dispatcher).
* ``<kanban_home>/kanban/workspaces`` legacy default-board scratch root.
* ``<kanban_home>/kanban/boards/<slug>/workspaces`` for each board slug
that currently exists on disk.
The check requires strict descendancy: a path equal to one of these
roots is NOT managed (deleting the workspaces root would wipe every
task's scratch dir at once), and a path that resolves to ``<kanban_home>
/kanban`` itself, ``<kanban_home>/kanban/logs``, or
``<kanban_home>/kanban/boards/<slug>`` is rejected because those
subtrees hold Hermes' own DB, metadata, and logs, not task workspaces.
Used by :func:`_cleanup_workspace` to refuse to ``shutil.rmtree`` paths
outside Hermes-managed storage. A board ``default_workdir`` pointing at a
real source tree can otherwise pair with ``workspace_kind='scratch'`` and
cause task completion to delete user data (#28818).
"""
try:
p_abs = p.resolve(strict=False)
except OSError:
return False
roots: list[Path] = []
override = os.environ.get("HERMES_KANBAN_WORKSPACES_ROOT", "").strip()
if override:
try:
roots.append(Path(override).expanduser().resolve(strict=False))
except OSError:
pass
try:
home = kanban_home()
except OSError:
home = None
if home is not None:
try:
roots.append((home / "kanban" / "workspaces").resolve(strict=False))
except OSError:
pass
try:
boards_parent = (home / "kanban" / "boards").resolve(strict=False)
except OSError:
boards_parent = None
if boards_parent is not None:
try:
entries = list(boards_parent.iterdir())
except OSError:
entries = []
for entry in entries:
try:
if not entry.is_dir():
continue
except OSError:
continue
try:
roots.append((entry / "workspaces").resolve(strict=False))
except OSError:
continue
for root in roots:
if p_abs == root:
continue
try:
if p_abs.is_relative_to(root):
return True
except ValueError:
continue
return False
def _cleanup_workspace(conn: sqlite3.Connection, task_id: str) -> None:
"""Remove a task's scratch workspace dir and kill its stale tmux session.
@@ -3059,8 +3141,21 @@ def _cleanup_workspace(conn: sqlite3.Connection, task_id: str) -> None:
import shutil
wp = Path(path)
if wp.is_dir():
shutil.rmtree(wp, ignore_errors=True)
_log.debug("Removed scratch workspace: %s", wp)
# Containment guard (#28818): a board's ``default_workdir`` can
# pair ``workspace_kind='scratch'`` with a user-supplied path
# pointing at a real source tree. Without this check, task
# completion would unconditionally ``shutil.rmtree`` that path
# and silently delete the user's source data.
if _is_managed_scratch_path(wp):
shutil.rmtree(wp, ignore_errors=True)
_log.debug("Removed scratch workspace: %s", wp)
else:
_log.warning(
"Refusing to remove out-of-scratch workspace for task %s: %s "
"(workspace_kind='scratch' but path is outside any "
"kanban-managed workspaces root)",
task_id, wp,
)
# Also kill the tmux session for the worker that owned this task,
# if the tmux session is now dead (worker process exited).
_cleanup_worker_tmux(conn, task_id)
@@ -3303,6 +3398,77 @@ def block_task(
return True
def promote_task(
conn: sqlite3.Connection,
task_id: str,
*,
actor: str,
reason: Optional[str] = None,
force: bool = False,
dry_run: bool = False,
) -> tuple[bool, Optional[str]]:
"""Manually promote a `todo` or `blocked` task to `ready`.
Mirrors the automatic promotion done by ``recompute_ready`` but
drives it from a deliberate operator action with an audit-trail
entry. Refuses to promote if any parent dep is not in a terminal
state (`done`/`archived`) unless ``force=True``. Does NOT change
assignee or claim state. Returns ``(True, None)`` on success and
``(False, reason)`` if refused. ``dry_run=True`` validates the
promotion would succeed without mutating state.
"""
row = conn.execute(
"SELECT status FROM tasks WHERE id = ?", (task_id,)
).fetchone()
if row is None:
return False, f"task {task_id} not found"
cur_status = row["status"]
if cur_status not in ("todo", "blocked"):
return False, (
f"task {task_id} is {cur_status!r}; promote only applies to "
f"'todo' or 'blocked'"
)
if not force:
parents = conn.execute(
"SELECT t.id, t.status FROM tasks t "
"JOIN task_links l ON l.parent_id = t.id "
"WHERE l.child_id = ?",
(task_id,),
).fetchall()
unsatisfied = [
p["id"] for p in parents
if p["status"] not in ("done", "archived")
]
if unsatisfied:
return False, (
f"unsatisfied parent dependencies: "
f"{', '.join(unsatisfied)} (use --force to override)"
)
if dry_run:
return True, None
with write_txn(conn):
upd = conn.execute(
"UPDATE tasks SET status = 'ready' "
"WHERE id = ? AND status IN ('todo', 'blocked')",
(task_id,),
)
if upd.rowcount != 1:
return False, f"task {task_id} status changed during promotion"
_append_event(
conn,
task_id,
"promoted_manual",
{"actor": actor, "reason": reason, "forced": force},
)
return True, None
def unblock_task(conn: sqlite3.Connection, task_id: str) -> bool:
"""Transition ``blocked``/``scheduled`` -> ready or todo.
+114 -18
View File
@@ -1454,7 +1454,7 @@ def _launch_tui(
provider: Optional[str] = None,
toolsets: object = None,
skills: object = None,
verbose: bool = False,
verbose: Optional[bool] = None,
quiet: bool = False,
query: Optional[str] = None,
image: Optional[str] = None,
@@ -1763,7 +1763,7 @@ def cmd_chat(args):
provider=getattr(args, "provider", None),
toolsets=getattr(args, "toolsets", None),
skills=getattr(args, "skills", None),
verbose=getattr(args, "verbose", False),
verbose=getattr(args, "verbose", None),
quiet=getattr(args, "quiet", False),
query=getattr(args, "query", None),
image=getattr(args, "image", None),
@@ -1783,7 +1783,7 @@ def cmd_chat(args):
"provider": getattr(args, "provider", None),
"toolsets": args.toolsets,
"skills": getattr(args, "skills", None),
"verbose": args.verbose,
"verbose": getattr(args, "verbose", None),
"quiet": getattr(args, "quiet", False),
"query": args.query,
"image": getattr(args, "image", None),
@@ -2505,6 +2505,27 @@ _AUX_TASKS: list[tuple[str, str, str]] = [
]
def _all_aux_tasks() -> list[tuple[str, str, str]]:
"""Return built-in + plugin-registered auxiliary tasks for picker/menu use.
Built-in tasks come first (preserving order), followed by plugin tasks
sorted by key. Used by ``_aux_config_menu``, ``_reset_aux_to_auto``, and
display-name lookups so plugin-registered tasks (registered via
:meth:`hermes_cli.plugins.PluginContext.register_auxiliary_task`) appear
in the same surfaces as built-in ones without core knowing about them.
"""
tasks = list(_AUX_TASKS)
try:
from hermes_cli.plugins import get_plugin_auxiliary_tasks
for entry in get_plugin_auxiliary_tasks():
tasks.append((entry["key"], entry["display_name"], entry["description"]))
except Exception:
# Plugin discovery failure must not break the aux config UI.
# Built-in tasks remain available.
pass
return tasks
def _format_aux_current(task_cfg: dict) -> str:
"""Render the current aux config for display in the task menu."""
if not isinstance(task_cfg, dict):
@@ -2555,7 +2576,11 @@ def _save_aux_choice(
def _reset_aux_to_auto() -> int:
"""Reset every known aux task back to auto/empty. Returns number reset."""
"""Reset every known aux task back to auto/empty. Returns number reset.
Includes plugin-registered tasks (via ``_all_aux_tasks``) so a plugin
that contributed an auxiliary task gets reset alongside built-ins.
"""
from hermes_cli.config import load_config, save_config
cfg = load_config()
@@ -2564,7 +2589,7 @@ def _reset_aux_to_auto() -> int:
aux = {}
cfg["auxiliary"] = aux
count = 0
for task, _name, _desc in _AUX_TASKS:
for task, _name, _desc in _all_aux_tasks():
entry = aux.setdefault(task, {})
if not isinstance(entry, dict):
entry = {}
@@ -2607,10 +2632,11 @@ def _aux_config_menu() -> None:
print()
# Build the task menu with current settings inline
name_col = max(len(name) for _, name, _ in _AUX_TASKS) + 2
desc_col = max(len(desc) for _, _, desc in _AUX_TASKS) + 4
all_tasks = _all_aux_tasks()
name_col = max(len(name) for _, name, _ in all_tasks) + 2
desc_col = max(len(desc) for _, _, desc in all_tasks) + 4
entries: list[tuple[str, str]] = []
for task_key, name, desc in _AUX_TASKS:
for task_key, name, desc in all_tasks:
task_cfg = (
aux.get(task_key, {}) if isinstance(aux.get(task_key), dict) else {}
)
@@ -2661,7 +2687,7 @@ def _aux_select_for_task(task: str) -> None:
current_model = str(task_cfg.get("model") or "").strip()
current_base_url = str(task_cfg.get("base_url") or "").strip()
display_name = next((name for key, name, _ in _AUX_TASKS if key == task), task)
display_name = next((name for key, name, _ in _all_aux_tasks() if key == task), task)
# Gather authenticated providers (has credentials + curated model list)
try:
@@ -2732,7 +2758,7 @@ def _aux_flow_provider_model(
from hermes_cli.auth import _prompt_model_selection
from hermes_cli.models import get_pricing_for_provider
display_name = next((name for key, name, _ in _AUX_TASKS if key == task), task)
display_name = next((name for key, name, _ in _all_aux_tasks() if key == task), task)
# Fetch live pricing for this provider (non-blocking)
pricing: dict = {}
@@ -2778,7 +2804,7 @@ def _aux_flow_custom_endpoint(task: str, task_cfg: dict) -> None:
"""Prompt for a direct OpenAI-compatible base_url + optional api_key/model."""
import getpass
display_name = next((name for key, name, _ in _AUX_TASKS if key == task), task)
display_name = next((name for key, name, _ in _all_aux_tasks() if key == task), task)
current_base_url = str(task_cfg.get("base_url") or "").strip()
current_model = str(task_cfg.get("model") or "").strip()
@@ -6156,6 +6182,19 @@ def cmd_doctor(args):
run_doctor(args)
def cmd_security(args):
"""Dispatch `hermes security <subcmd>`."""
sub = getattr(args, "security_command", None)
if sub in ("audit", None):
from hermes_cli.security_audit import cmd_security_audit
# Default subcommand is `audit` when no subcmd is given.
code = cmd_security_audit(args)
sys.exit(int(code or 0))
print(f"unknown security subcommand: {sub}", file=sys.stderr)
sys.exit(2)
def cmd_dump(args):
"""Dump setup summary for support/debugging."""
from hermes_cli.dump import run_dump
@@ -6932,8 +6971,8 @@ def _update_via_zip(args):
)
print("→ Downloading latest version...")
tmp_dir = tempfile.mkdtemp(prefix="hermes-update-")
try:
tmp_dir = tempfile.mkdtemp(prefix="hermes-update-")
zip_path = os.path.join(tmp_dir, f"hermes-agent-{branch}.zip")
urlretrieve(zip_url, zip_path)
@@ -6980,12 +7019,11 @@ def _update_via_zip(args):
print(f"✓ Updated {update_count} items from ZIP")
# Cleanup
shutil.rmtree(tmp_dir, ignore_errors=True)
except Exception as e:
print(f"✗ ZIP update failed: {e}")
sys.exit(1)
finally:
shutil.rmtree(tmp_dir, ignore_errors=True)
# Clear stale bytecode after ZIP extraction
removed = _clear_bytecode_cache(PROJECT_ROOT)
@@ -9817,6 +9855,7 @@ def _coalesce_session_name_args(argv: list) -> list:
"honcho",
"claw",
"plugins",
"security",
"acp",
"webhook",
"memory",
@@ -10657,7 +10696,7 @@ _BUILTIN_SUBCOMMANDS = frozenset(
"model", "pairing", "plugins", "portal", "postinstall", "profile", "proxy",
"send", "sessions", "setup",
"skills", "slack", "status", "tools", "uninstall", "update",
"version", "webhook", "whatsapp", "chat", "secrets",
"version", "webhook", "whatsapp", "chat", "secrets", "security",
# Help-ish invocations — plugin commands not being listed in
# top-level --help is an acceptable trade-off for skipping an
# expensive eager import of every bundled plugin module.
@@ -11977,6 +12016,58 @@ def main():
)
doctor_parser.set_defaults(func=cmd_doctor)
# =========================================================================
# security command — on-demand supply-chain audit
# =========================================================================
security_parser = subparsers.add_parser(
"security",
help="Supply-chain audit (OSV.dev) for venv, plugins, and MCP servers",
description=(
"On-demand vulnerability scan against OSV.dev. Covers the Hermes "
"venv (installed PyPI dists), Python deps declared by plugins under "
"~/.hermes/plugins/, and pinned npx/uvx MCP servers in config.yaml. "
"Does NOT scan globally-installed packages or editor/browser extensions."
),
)
security_subparsers = security_parser.add_subparsers(
dest="security_command",
metavar="<subcommand>",
)
audit_parser = security_subparsers.add_parser(
"audit",
help="Run a one-shot supply-chain audit",
description="Query OSV.dev for known vulnerabilities in installed components.",
)
audit_parser.add_argument(
"--json",
action="store_true",
help="Emit machine-readable JSON instead of human-readable text",
)
audit_parser.add_argument(
"--fail-on",
default="critical",
choices=["low", "moderate", "high", "critical"],
help="Exit non-zero when any finding meets this severity (default: critical)",
)
audit_parser.add_argument(
"--skip-venv",
action="store_true",
help="Skip scanning the Hermes Python venv",
)
audit_parser.add_argument(
"--skip-plugins",
action="store_true",
help="Skip scanning plugin requirements files",
)
audit_parser.add_argument(
"--skip-mcp",
action="store_true",
help="Skip scanning pinned MCP servers in config.yaml",
)
audit_parser.set_defaults(func=cmd_security)
security_parser.set_defaults(func=cmd_security)
# =========================================================================
# dump command
# =========================================================================
@@ -12302,6 +12393,11 @@ Examples:
skills_audit.add_argument(
"name", nargs="?", help="Specific skill to audit (default: all)"
)
skills_audit.add_argument(
"--deep",
action="store_true",
help="Run AST-level analysis on Python files (opt-in diagnostic)",
)
skills_uninstall = skills_subparsers.add_parser(
"uninstall", help="Remove a hub-installed skill"
@@ -13781,7 +13877,7 @@ Examples:
("model", None),
("provider", None),
("toolsets", None),
("verbose", False),
("verbose", None),
("worktree", False),
]:
if not hasattr(args, attr):
@@ -13796,7 +13892,7 @@ Examples:
("model", None),
("provider", None),
("toolsets", None),
("verbose", False),
("verbose", None),
("resume", None),
("continue_last", None),
("worktree", False),
+2 -4
View File
@@ -17,7 +17,6 @@ Model / provider selection mirrors `hermes chat`:
Env var fallbacks (used when the corresponding arg is not passed):
- HERMES_INFERENCE_MODEL
- HERMES_INFERENCE_PROVIDER (already read by resolve_runtime_provider)
"""
from __future__ import annotations
@@ -135,9 +134,8 @@ def run_oneshot(
prompt: The user message to send.
model: Optional model override. Falls back to HERMES_INFERENCE_MODEL
env var, then config.yaml's model.default / model.model.
provider: Optional provider override. Falls back to
HERMES_INFERENCE_PROVIDER env var, then config.yaml's model.provider,
then "auto".
provider: Optional provider override. Falls back to config.yaml's
model.provider, then "auto".
toolsets: Optional comma-separated string or iterable of toolsets.
Returns the exit code. Caller should sys.exit() with the return.
+132
View File
@@ -698,6 +698,119 @@ class PluginContext:
# -- hook registration --------------------------------------------------
# -- auxiliary task registration ---------------------------------------
def register_auxiliary_task(
self,
key: str,
*,
display_name: str,
description: str,
defaults: Optional[Dict[str, Any]] = None,
) -> None:
"""Register a plugin-defined auxiliary LLM task.
Auxiliary tasks are LLM-backed side jobs (vision analysis, web extraction,
compression, smart-approval, etc.) that route through ``auxiliary_client.py``.
Each task has its own ``auxiliary.<key>`` config block where users can
pin a provider/model independent of the main chat model.
Plugins use this to declare their own auxiliary tasks without touching
core files. After registration, the task:
- Appears in the ``hermes model Configure auxiliary models`` picker
- Has its provider/model/base_url/api_key bridged from config.yaml to
``AUXILIARY_<KEY_UPPER>_*`` env vars at gateway startup
- Gets default routing fields (provider="auto", model="", etc.) merged
into loaded configs so ``cfg.get("auxiliary", {}).get(key)`` works
Args:
key: stable task key (snake_case). Used in config ``auxiliary.<key>``
and env vars ``AUXILIARY_<KEY_UPPER>_*``. Must not shadow a
built-in task key (vision, compression, web_extract, approval,
mcp, title_generation, skills_hub, curator).
display_name: human-readable name shown in the picker.
description: short one-line description shown next to the name.
defaults: optional dict of default routing fields. Recognized keys:
``provider`` (default "auto"), ``model`` (default ""),
``base_url`` (default ""), ``api_key`` (default ""),
``timeout`` (default 60), ``extra_body`` (default {}),
plus any task-specific extras (e.g. ``download_timeout``).
Unknown keys are preserved verbatim the plugin owns the
schema for its own task.
Raises:
ValueError: if *key* is empty, contains invalid characters, or
shadows a built-in auxiliary task key.
Example:
ctx.register_auxiliary_task(
key="memory_retain_filter",
display_name="Memory retain filter",
description="hindsight pre-retain dedup/extract",
defaults={"provider": "auto", "timeout": 30},
)
"""
# Validate key shape
if not key or not isinstance(key, str):
raise ValueError(
f"Plugin '{self.manifest.name}' tried to register auxiliary task "
f"with invalid key {key!r}"
)
if not all(c.isalnum() or c == "_" for c in key):
raise ValueError(
f"Plugin '{self.manifest.name}' auxiliary task key {key!r} "
f"must contain only alphanumeric characters and underscores"
)
# Lazy import to avoid circular: hermes_cli.main imports plugins indirectly
from hermes_cli.main import _AUX_TASKS as _BUILTIN_AUX_TASKS
builtin_keys = {k for k, _name, _desc in _BUILTIN_AUX_TASKS}
if key in builtin_keys:
raise ValueError(
f"Plugin '{self.manifest.name}' cannot register auxiliary task "
f"{key!r} — that key is reserved for a built-in task. "
f"Pick a plugin-namespaced key (e.g. '{self.manifest.name}_{key}')."
)
# Reject duplicate registrations across plugins
existing = self._manager._aux_tasks.get(key)
if existing is not None and existing.get("plugin") != self.manifest.name:
raise ValueError(
f"Plugin '{self.manifest.name}' cannot register auxiliary task "
f"{key!r} — already registered by plugin "
f"'{existing.get('plugin')}'"
)
# Normalize defaults — plugin owns the schema, but we ensure routing
# fields exist with sensible types so consumers don't crash.
merged_defaults: Dict[str, Any] = {
"provider": "auto",
"model": "",
"base_url": "",
"api_key": "",
"timeout": 60,
"extra_body": {},
}
if defaults:
for k, v in defaults.items():
merged_defaults[k] = v
self._manager._aux_tasks[key] = {
"key": key,
"display_name": display_name,
"description": description,
"defaults": merged_defaults,
"plugin": self.manifest.name,
}
logger.debug(
"Plugin %s registered auxiliary task: %s (%s)",
self.manifest.name,
key,
display_name,
)
def register_hook(self, hook_name: str, callback: Callable) -> None:
"""Register a lifecycle hook callback.
@@ -782,6 +895,9 @@ class PluginManager:
self._cli_ref = None # Set by CLI after plugin discovery
# Plugin skill registry: qualified name → metadata dict.
self._plugin_skills: Dict[str, Dict[str, Any]] = {}
# Plugin-registered auxiliary tasks: key → {key, display_name,
# description, defaults, plugin}. See PluginContext.register_auxiliary_task.
self._aux_tasks: Dict[str, Dict[str, Any]] = {}
# -----------------------------------------------------------------------
# Public
@@ -803,6 +919,7 @@ class PluginManager:
self._cli_commands.clear()
self._plugin_commands.clear()
self._plugin_skills.clear()
self._aux_tasks.clear()
self._context_engine = None
self._discovered = True
@@ -1548,6 +1665,21 @@ def get_plugin_commands() -> Dict[str, dict]:
return _ensure_plugins_discovered()._plugin_commands
def get_plugin_auxiliary_tasks() -> List[Dict[str, Any]]:
"""Return all plugin-registered auxiliary tasks as a stable-ordered list.
Each entry is the registration dict from
:meth:`PluginContext.register_auxiliary_task`:
``{key, display_name, description, defaults, plugin}``.
Triggers idempotent plugin discovery so callers can read the registry
before any explicit ``discover_plugins()`` call. Sorted by ``key`` for
deterministic ordering in pickers and tests.
"""
manager = _ensure_plugins_discovered()
return [manager._aux_tasks[k] for k in sorted(manager._aux_tasks)]
def get_plugin_toolsets() -> List[tuple]:
"""Return plugin toolsets as ``(key, label, description)`` tuples.
+67
View File
@@ -777,6 +777,14 @@ def create_profile(
except Exception:
pass # non-fatal — user can describe later with `hermes profile describe`
# Phase 4: when running inside a container under s6, register the
# new profile's gateway as a runtime s6 service so
# `hermes -p <profile> gateway start` can supervise it via
# `s6-svc -u` instead of spawning a bare process. On host (systemd
# / launchd / windows) this is a no-op — the existing per-profile
# unit-generation paths handle gateway lifecycle.
_maybe_register_gateway_service(canon)
return profile_dir
@@ -893,6 +901,10 @@ def delete_profile(name: str, yes: bool = False) -> Path:
# 1. Disable service (prevents auto-restart)
_cleanup_gateway_service(canon, profile_dir)
# 1b. Phase 4: unregister the s6 service slot (container path).
# On host this is a no-op; on container it removes
# /run/service/gateway-<profile>/ so s6-supervise drops it.
_maybe_unregister_gateway_service(canon)
# 2. Stop running gateway
if gw_running:
@@ -965,6 +977,61 @@ def delete_profile(name: str, yes: bool = False) -> Path:
return profile_dir
def _maybe_register_gateway_service(profile_name: str) -> None:
"""Register a profile's gateway with s6 inside the container.
No-op on host (systemd/launchd/windows) those backends raise
``NotImplementedError`` on ``register_profile_gateway`` and the
existing per-profile unit-generation paths handle lifecycle.
Best-effort: any error (no backend detected, s6 not yet ready,
etc.) is logged and swallowed so profile creation doesn't fail
because the s6 supervision tree is in a weird state. The user
can re-register manually later via the gateway start command,
which goes through the same dispatch path.
Port selection is governed by the profile's ``config.yaml``
(``[gateway] port = ``) there is no Python-side allocator
(PR #30136 review item I5 retired the SHA-256-derived range
[9200, 9800) because it was dead code through the entire stack).
"""
try:
from hermes_cli.service_manager import get_service_manager
mgr = get_service_manager()
except RuntimeError:
return # no backend on this host — nothing to do
if not mgr.supports_runtime_registration():
return # host backend; no-op
try:
mgr.register_profile_gateway(profile_name)
except ValueError:
# Already registered (e.g. the container-boot reconciler ran
# first and brought up a stale slot). That's fine.
pass
except Exception as exc:
# Don't fail profile create over a supervision-tree hiccup.
print(f"⚠ Could not register s6 gateway service: {exc}")
def _maybe_unregister_gateway_service(profile_name: str) -> None:
"""Tear down a profile's s6 gateway service inside the container.
No-op on host. Idempotent: absent services are silently skipped
by ``unregister_profile_gateway``.
"""
try:
from hermes_cli.service_manager import get_service_manager
mgr = get_service_manager()
except RuntimeError:
return
if not mgr.supports_runtime_registration():
return
try:
mgr.unregister_profile_gateway(profile_name)
except Exception as exc:
print(f"⚠ Could not unregister s6 gateway service: {exc}")
def _cleanup_gateway_service(name: str, profile_dir: Path) -> None:
"""Disable and remove systemd/launchd service for a profile."""
import platform as _platform
+137 -5
View File
@@ -57,6 +57,15 @@ def register_cli(parent_parser: argparse.ArgumentParser) -> None:
"--access-token",
help="Provide the access token non-interactively (will be stored in .env)",
)
setup.add_argument(
"--server-url",
help=(
"Bitwarden region / self-hosted endpoint. Examples: "
"https://vault.bitwarden.com (US, default), "
"https://vault.bitwarden.eu (EU), or your self-hosted URL. "
"Skips the interactive region prompt."
),
)
setup.set_defaults(func=cmd_setup)
status = sub.add_parser("status", help="Show config + binary + last fetch")
@@ -145,14 +154,28 @@ def cmd_setup(args: argparse.Namespace) -> int:
os.environ[token_env] = token # so the test fetch below sees it
console.print(f" [green]✓[/green] stored in {get_env_path()} as {token_env}")
# ------------------------------------------------------------------ region
console.print()
console.print("[bold]Step 3[/bold] Pick a Bitwarden region")
server_url = _resolve_server_url(args, secrets_cfg, console)
if server_url is None:
return 1
if server_url:
console.print(f" [green]✓[/green] using {server_url}")
else:
console.print(
" [green]✓[/green] using bws default "
"(US Cloud, https://vault.bitwarden.com)"
)
# ------------------------------------------------------------------- project
if args.project_id and args.project_id.strip():
project_id = args.project_id.strip()
else:
console.print()
console.print("[bold]Step 3[/bold] Pick a project")
console.print("[bold]Step 4[/bold] Pick a project")
project_id = ""
projects = _list_projects(binary, token, console)
projects = _list_projects(binary, token, console, server_url=server_url)
if projects is None:
return 1
if not projects:
@@ -187,7 +210,7 @@ def cmd_setup(args: argparse.Namespace) -> int:
# ------------------------------------------------------------------- test
console.print()
step_num = 4 if not (args.project_id and args.project_id.strip()) else 3
step_num = 5 if not (args.project_id and args.project_id.strip()) else 4
console.print(f"[bold]Step {step_num}[/bold] Test fetch")
try:
secrets, warnings = bw.fetch_bitwarden_secrets(
@@ -195,6 +218,7 @@ def cmd_setup(args: argparse.Namespace) -> int:
project_id=project_id,
binary=binary,
use_cache=False,
server_url=server_url,
)
except Exception as exc: # noqa: BLE001
console.print(f" [red]✗ Fetch failed: {exc}[/red]")
@@ -221,6 +245,7 @@ def cmd_setup(args: argparse.Namespace) -> int:
# ------------------------------------------------------------------- save
secrets_cfg["enabled"] = True
secrets_cfg["project_id"] = project_id
secrets_cfg["server_url"] = server_url
secrets_cfg.setdefault("access_token_env", token_env)
secrets_cfg.setdefault("cache_ttl_seconds", 300)
secrets_cfg.setdefault("override_existing", True)
@@ -248,6 +273,7 @@ def cmd_status(args: argparse.Namespace) -> int:
enabled = bool(bw_cfg.get("enabled"))
token_env = bw_cfg.get("access_token_env", "BWS_ACCESS_TOKEN")
project_id = bw_cfg.get("project_id", "")
server_url = str(bw_cfg.get("server_url", "") or "").strip()
token_set = bool(os.environ.get(token_env))
table = Table(show_header=False, box=None, padding=(0, 2))
@@ -257,6 +283,10 @@ def cmd_status(args: argparse.Namespace) -> int:
table.add_row("Token env var", token_env)
table.add_row("Token in env", _yn(token_set))
table.add_row("Project ID", project_id or "[dim](unset)[/dim]")
table.add_row(
"Server URL",
server_url or "[dim]default (US Cloud, https://vault.bitwarden.com)[/dim]",
)
table.add_row("Override existing", _yn(bool(bw_cfg.get("override_existing", False))))
table.add_row("Cache TTL (s)", str(bw_cfg.get("cache_ttl_seconds", 300)))
table.add_row("Auto-install", _yn(bool(bw_cfg.get("auto_install", True))))
@@ -306,11 +336,14 @@ def cmd_sync(args: argparse.Namespace) -> int:
console.print("[red]No project_id configured.[/red]")
return 1
server_url = str(bw_cfg.get("server_url", "") or "").strip()
try:
secrets, warnings = bw.fetch_bitwarden_secrets(
access_token=token,
project_id=project_id,
use_cache=False,
server_url=server_url,
)
except Exception as exc: # noqa: BLE001
console.print(f"[red]Fetch failed: {exc}[/red]")
@@ -407,12 +440,14 @@ def _bws_version(binary: Path) -> str:
def _list_projects(
binary: Path, token: str, console: Console
binary: Path, token: str, console: Console, *, server_url: str = ""
) -> Optional[List[dict]]:
"""Call ``bws project list`` and return the parsed list, or None on failure."""
env = os.environ.copy()
env["BWS_ACCESS_TOKEN"] = token
env.setdefault("NO_COLOR", "1")
if server_url:
env["BWS_SERVER_URL"] = server_url
try:
res = subprocess.run(
[str(binary), "project", "list", "--output", "json"],
@@ -428,7 +463,16 @@ def _list_projects(
if res.returncode != 0:
err = (res.stderr or res.stdout).strip()[:300]
console.print(f" [red]bws project list failed: {err}[/red]")
if "authorization" in err.lower() or "invalid" in err.lower():
lowered = err.lower()
if "invalid_client" in lowered or "400 bad request" in lowered:
console.print(
" [yellow]'invalid_client' from the US identity endpoint usually "
"means the token is for a different Bitwarden region. Re-run "
"[cyan]hermes secrets bitwarden setup[/cyan] and pick EU or "
"self-hosted at the region prompt, or set [cyan]secrets.bitwarden."
"server_url[/cyan] in config.yaml.[/yellow]"
)
elif "authorization" in lowered or "invalid" in lowered:
console.print(
" [yellow]This usually means the access token is wrong or revoked. "
"Double-check it in the Bitwarden web app.[/yellow]"
@@ -443,3 +487,91 @@ def _list_projects(
if not isinstance(data, list):
return []
return [p for p in data if isinstance(p, dict) and p.get("id")]
# Canonical Bitwarden region endpoints. Keep in sync with what Bitwarden
# publishes — these are stable but if a third region appears, add it here
# and to the prompt below.
_REGION_PRESETS = [
("US Cloud (https://vault.bitwarden.com — bws default)", ""),
("EU Cloud (https://vault.bitwarden.eu)", "https://vault.bitwarden.eu"),
]
def _resolve_server_url(
args: argparse.Namespace,
secrets_cfg: dict,
console: Console,
) -> Optional[str]:
"""Pick a Bitwarden server URL for setup.
Resolution order:
1. ``--server-url`` CLI flag (non-interactive)
2. ``BWS_SERVER_URL`` env var (so users running with that already set
in their shell don't have to re-enter it)
3. Existing ``secrets.bitwarden.server_url`` value (for re-runs)
4. Interactive menu: US / EU / self-hosted
Returns the chosen URL as a string (empty string = bws default,
i.e. US Cloud). Returns None if the user aborted with an empty
custom URL.
"""
if args.server_url and args.server_url.strip():
return args.server_url.strip()
env_url = os.environ.get("BWS_SERVER_URL", "").strip()
if env_url:
console.print(
f" Detected [cyan]BWS_SERVER_URL[/cyan]={env_url} in your shell — using it."
)
return env_url
existing = str(secrets_cfg.get("server_url", "") or "").strip()
if existing:
console.print(
f" Existing config: [cyan]{existing}[/cyan]. "
"Press Enter to keep, or pick a different option below."
)
table = Table(show_header=True, header_style="bold", box=None, padding=(0, 2))
table.add_column("#", style="cyan", width=4)
table.add_column("Region / endpoint")
for i, (label, _url) in enumerate(_REGION_PRESETS, 1):
table.add_row(str(i), label)
table.add_row(str(len(_REGION_PRESETS) + 1), "Self-hosted / custom URL")
console.print(table)
custom_idx = len(_REGION_PRESETS) + 1
while True:
prompt = f" Select region [1-{custom_idx}]"
if existing:
prompt += " (Enter to keep current)"
prompt += ": "
choice = console.input(prompt).strip()
if not choice:
if existing:
return existing
console.print(" [red]Enter a number.[/red]")
continue
try:
idx = int(choice)
except ValueError:
console.print(" [red]Enter a number.[/red]")
continue
if 1 <= idx <= len(_REGION_PRESETS):
return _REGION_PRESETS[idx - 1][1]
if idx == custom_idx:
custom = console.input(
" Enter your Bitwarden server URL "
"(e.g. https://vault.example.com): "
).strip()
if not custom:
console.print(" [red]Empty URL, aborting.[/red]")
return None
if not custom.startswith(("http://", "https://")):
console.print(
" [yellow]Warning: URL doesn't start with http:// or "
"https:// — bws may reject it.[/yellow]"
)
return custom
console.print(f" [red]Out of range — pick 1-{custom_idx}.[/red]")
+576
View File
@@ -0,0 +1,576 @@
"""On-demand supply-chain audit for Hermes Agent installs.
Scans three surfaces a Hermes user actually controls and we can map to
upstream advisories without auth or extra binaries:
1. The Hermes venv (every PyPI dist via ``importlib.metadata``).
2. Python deps declared by user-installed plugins under ``~/.hermes/plugins``
(``requirements.txt`` + ``pyproject.toml`` best-effort pin extraction).
3. MCP servers wired in ``config.yaml`` whose ``command/args`` look like
``npx -y <pkg>@<ver>`` or ``uvx <pkg>==<ver>``.
Vulnerabilities are looked up against OSV.dev (``api.osv.dev/v1/querybatch``
+ ``/v1/vulns/{id}``). Single-shot, on-demand, never daily see the design
notes in ``references/security-disclosure-triage.md``.
Out of scope on purpose: global pip/npm, editor/browser extensions,
daily background scans, auto-blocking installs.
"""
from __future__ import annotations
import argparse
import concurrent.futures
import json
import re
import sys
import urllib.error
import urllib.request
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Iterable, Optional
from hermes_constants import get_hermes_home
OSV_BATCH_URL = "https://api.osv.dev/v1/querybatch"
OSV_VULN_URL = "https://api.osv.dev/v1/vulns/{vid}"
OSV_BATCH_MAX = 1000 # OSV documented hard cap per request
HTTP_TIMEOUT = 20
DETAIL_PARALLELISM = 8
# Severity ordering for --fail-on gating. UNKNOWN sits below LOW so it
# never blocks unless --fail-on is passed something even lower (we don't
# expose that).
SEVERITY_ORDER = {
"UNKNOWN": 0,
"LOW": 1,
"MODERATE": 2,
"MEDIUM": 2,
"HIGH": 3,
"CRITICAL": 4,
}
# ─── Data shapes ──────────────────────────────────────────────────────────────
@dataclass(frozen=True)
class Component:
"""A single (name, version, ecosystem) tuple discovered on disk."""
name: str
version: str
ecosystem: str # "PyPI" | "npm" — exactly as OSV expects
source: str # human-readable origin, e.g. "venv", "plugin:foo", "mcp:bar"
@dataclass
class Vulnerability:
osv_id: str
severity: str = "UNKNOWN"
summary: str = ""
fixed_versions: list[str] = field(default_factory=list)
@dataclass
class Finding:
component: Component
vuln: Vulnerability
# ─── Component discovery ──────────────────────────────────────────────────────
def _discover_venv() -> list[Component]:
"""Every dist installed in the running Python's import path."""
from importlib.metadata import distributions
out: list[Component] = []
seen: set[tuple[str, str]] = set()
for dist in distributions():
try:
name = (dist.metadata["Name"] or "").strip()
except Exception:
continue
version = (dist.version or "").strip()
if not name or not version:
continue
key = (name.lower(), version)
if key in seen:
continue
seen.add(key)
out.append(Component(name=name, version=version, ecosystem="PyPI", source="venv"))
return out
# requirements.txt line: drop comments, environment markers, options, extras
_REQ_LINE = re.compile(
r"""^\s*
(?P<name>[A-Za-z0-9][A-Za-z0-9._-]*)
(?:\[[^\]]+\])? # extras
\s*==\s*
(?P<version>[A-Za-z0-9._+!-]+)
\s*(?:;.*)?$
""",
re.VERBOSE,
)
def _parse_requirements(text: str) -> list[tuple[str, str]]:
"""Extract ``name==version`` pins. Everything else (>=, ~=, no pin) is skipped.
A loose pin can't be mapped to a single OSV query, and getting it wrong
is worse than missing a finding for an audit tool false positives
train users to ignore output.
"""
pins: list[tuple[str, str]] = []
for raw in text.splitlines():
line = raw.strip()
if not line or line.startswith("#") or line.startswith("-"):
continue
m = _REQ_LINE.match(line)
if m:
pins.append((m.group("name"), m.group("version")))
return pins
def _parse_pyproject_pins(text: str) -> list[tuple[str, str]]:
"""Pull ``name==version`` pins from a ``pyproject.toml`` ``dependencies`` list.
Uses stdlib ``tomllib`` (3.11+). Same exact-pin policy as requirements.
"""
try:
import tomllib
except ImportError: # pragma: no cover - 3.10 only
return []
try:
data = tomllib.loads(text)
except Exception:
return []
deps: list[str] = []
project = data.get("project") or {}
if isinstance(project.get("dependencies"), list):
deps.extend(str(x) for x in project["dependencies"])
optional = project.get("optional-dependencies") or {}
if isinstance(optional, dict):
for group in optional.values():
if isinstance(group, list):
deps.extend(str(x) for x in group)
pins: list[tuple[str, str]] = []
for dep in deps:
m = _REQ_LINE.match(dep)
if m:
pins.append((m.group("name"), m.group("version")))
return pins
def _discover_plugins(hermes_home: Path) -> list[Component]:
"""Python deps declared by plugins under ``~/.hermes/plugins``.
Plugins typically don't install into the venv (they're directory-based
with relative imports), so their stated requirements are useful audit
surface even when the venv scan misses them.
"""
plugins_dir = hermes_home / "plugins"
if not plugins_dir.is_dir():
return []
out: list[Component] = []
for plugin_dir in sorted(plugins_dir.iterdir()):
if not plugin_dir.is_dir() or plugin_dir.name.startswith("."):
continue
source = f"plugin:{plugin_dir.name}"
for req_file in ("requirements.txt", "requirements-dev.txt"):
path = plugin_dir / req_file
if path.is_file():
try:
pins = _parse_requirements(path.read_text(encoding="utf-8", errors="replace"))
except OSError:
continue
for name, version in pins:
out.append(Component(name=name, version=version, ecosystem="PyPI", source=source))
pyproject = plugin_dir / "pyproject.toml"
if pyproject.is_file():
try:
pins = _parse_pyproject_pins(pyproject.read_text(encoding="utf-8", errors="replace"))
except OSError:
continue
for name, version in pins:
out.append(Component(name=name, version=version, ecosystem="PyPI", source=source))
return out
# npx forms we recognise:
# npx -y @scope/pkg@1.2.3
# npx --yes pkg@1.2.3
# npx pkg@1.2.3 [...args]
# We deliberately don't try to resolve unversioned names — that maps to
# "latest" at runtime and isn't a stable audit subject.
_NPX_PKG = re.compile(r"^(@[A-Za-z0-9._-]+/[A-Za-z0-9._-]+|[A-Za-z0-9._-]+)@([A-Za-z0-9._+-]+)$")
# uvx forms:
# uvx pkg==1.2.3
# uvx --with pkg==1.2.3 entrypoint
_UVX_PKG = re.compile(r"^([A-Za-z0-9][A-Za-z0-9._-]*)==([A-Za-z0-9._+!-]+)$")
def _extract_mcp_component(server_name: str, command: str, args: list[str]) -> Optional[Component]:
"""Best-effort: parse `command/args` into a (name, version, ecosystem).
Returns None when the entry doesn't pin a version we can audit (local
paths, Docker images, unversioned npx, etc.). Audit output stays silent
rather than guess.
"""
cmd = (command or "").strip().lower()
if not args:
return None
# npx (any prefix path)
if cmd.endswith("npx") or cmd == "npx":
# Skip flag tokens until we see the first thing that looks like a pkg ref
for token in args:
if token.startswith("-"):
continue
m = _NPX_PKG.match(token)
if m:
return Component(
name=m.group(1),
version=m.group(2),
ecosystem="npm",
source=f"mcp:{server_name}",
)
return None # First non-flag token isn't a pinned ref
# uvx (any prefix path)
if cmd.endswith("uvx") or cmd == "uvx":
for token in args:
if token.startswith("-"):
continue
m = _UVX_PKG.match(token)
if m:
return Component(
name=m.group(1),
version=m.group(2),
ecosystem="PyPI",
source=f"mcp:{server_name}",
)
return None
return None
def _discover_mcp() -> list[Component]:
"""Pinned MCP server packages from ``config.yaml``."""
try:
from hermes_cli.mcp_config import _get_mcp_servers
except Exception:
return []
out: list[Component] = []
servers = _get_mcp_servers()
if not isinstance(servers, dict):
return []
for name, cfg in servers.items():
if not isinstance(cfg, dict):
continue
command = cfg.get("command", "") or ""
args = cfg.get("args") or []
if not isinstance(args, list):
continue
comp = _extract_mcp_component(name, command, [str(a) for a in args])
if comp is not None:
out.append(comp)
return out
# ─── OSV client ───────────────────────────────────────────────────────────────
def _http_post_json(url: str, payload: dict) -> dict:
data = json.dumps(payload).encode("utf-8")
req = urllib.request.Request(
url, data=data, headers={"Content-Type": "application/json"}, method="POST"
)
with urllib.request.urlopen(req, timeout=HTTP_TIMEOUT) as resp:
return json.loads(resp.read().decode("utf-8"))
def _http_get_json(url: str) -> dict:
req = urllib.request.Request(url, method="GET")
with urllib.request.urlopen(req, timeout=HTTP_TIMEOUT) as resp:
return json.loads(resp.read().decode("utf-8"))
def _osv_query_batch(components: list[Component]) -> dict[Component, list[str]]:
"""Return {component -> [osv_id, ...]} for components with any vulns.
Components without findings are omitted from the result dict.
"""
if not components:
return {}
findings: dict[Component, list[str]] = {}
for chunk_start in range(0, len(components), OSV_BATCH_MAX):
chunk = components[chunk_start:chunk_start + OSV_BATCH_MAX]
payload = {
"queries": [
{
"package": {"name": c.name, "ecosystem": c.ecosystem},
"version": c.version,
}
for c in chunk
]
}
try:
resp = _http_post_json(OSV_BATCH_URL, payload)
except (urllib.error.URLError, TimeoutError, ConnectionError) as exc:
raise RuntimeError(f"OSV batch query failed: {exc}") from exc
results = resp.get("results") or []
for comp, result in zip(chunk, results):
vulns = (result or {}).get("vulns") or []
ids = [v.get("id") for v in vulns if v.get("id")]
if ids:
findings[comp] = ids
return findings
def _osv_severity_from_record(record: dict) -> str:
"""Extract CVSS-derived severity tier from an OSV vuln record."""
# OSV puts CVSS in `severity` (top-level or per-affected) and a
# human-readable bucket in `database_specific.severity` for GHSAs.
db_specific = record.get("database_specific") or {}
raw = db_specific.get("severity")
if isinstance(raw, str) and raw.strip():
upper = raw.strip().upper()
if upper in SEVERITY_ORDER:
return upper
# Fall back to CVSS score → tier
score: Optional[float] = None
for sev_entry in record.get("severity") or []:
s = sev_entry.get("score")
if isinstance(s, str):
# CVSS vector strings look like "CVSS:3.1/AV:N/..." — we can't
# parse without a lib. Look for an explicit numeric in
# affected[].ecosystem_specific later if present.
continue
affected = record.get("affected") or []
for entry in affected:
eco_spec = entry.get("ecosystem_specific") or {}
sev = eco_spec.get("severity")
if isinstance(sev, str) and sev.strip().upper() in SEVERITY_ORDER:
return sev.strip().upper()
if score is not None:
if score >= 9.0:
return "CRITICAL"
if score >= 7.0:
return "HIGH"
if score >= 4.0:
return "MODERATE"
if score > 0:
return "LOW"
return "UNKNOWN"
def _osv_fixed_versions(record: dict) -> list[str]:
fixes: list[str] = []
for entry in record.get("affected") or []:
for rng in entry.get("ranges") or []:
for event in rng.get("events") or []:
if "fixed" in event:
fixes.append(str(event["fixed"]))
# Dedupe, preserve order
seen: set[str] = set()
out: list[str] = []
for f in fixes:
if f not in seen:
seen.add(f)
out.append(f)
return out
def _osv_fetch_details(vuln_ids: Iterable[str]) -> dict[str, Vulnerability]:
"""Fetch summary/severity for each unique vuln id, in parallel."""
unique = sorted({vid for vid in vuln_ids if vid})
if not unique:
return {}
out: dict[str, Vulnerability] = {}
def _fetch_one(vid: str) -> Vulnerability:
try:
rec = _http_get_json(OSV_VULN_URL.format(vid=vid))
except (urllib.error.URLError, TimeoutError, ConnectionError):
return Vulnerability(osv_id=vid)
return Vulnerability(
osv_id=vid,
severity=_osv_severity_from_record(rec),
summary=(rec.get("summary") or "").strip(),
fixed_versions=_osv_fixed_versions(rec),
)
with concurrent.futures.ThreadPoolExecutor(max_workers=DETAIL_PARALLELISM) as pool:
for vuln in pool.map(_fetch_one, unique):
out[vuln.osv_id] = vuln
return out
# ─── Orchestration ────────────────────────────────────────────────────────────
def run_audit(
*,
skip_venv: bool = False,
skip_plugins: bool = False,
skip_mcp: bool = False,
hermes_home: Optional[Path] = None,
) -> list[Finding]:
"""Discover components, query OSV, return findings sorted by severity desc."""
home = hermes_home or Path(get_hermes_home())
components: list[Component] = []
if not skip_venv:
components.extend(_discover_venv())
if not skip_plugins:
components.extend(_discover_plugins(home))
if not skip_mcp:
components.extend(_discover_mcp())
if not components:
return []
raw = _osv_query_batch(components)
if not raw:
return []
all_ids: list[str] = []
for ids in raw.values():
all_ids.extend(ids)
details = _osv_fetch_details(all_ids)
findings: list[Finding] = []
for comp, ids in raw.items():
for vid in ids:
vuln = details.get(vid) or Vulnerability(osv_id=vid)
findings.append(Finding(component=comp, vuln=vuln))
findings.sort(
key=lambda f: (
-SEVERITY_ORDER.get(f.vuln.severity, 0),
f.component.source,
f.component.name.lower(),
f.vuln.osv_id,
)
)
return findings
# ─── Rendering ────────────────────────────────────────────────────────────────
def _render_human(findings: list[Finding], total_components: int) -> str:
if not findings:
return f"No known vulnerabilities found across {total_components} component(s)."
lines: list[str] = []
lines.append(
f"Found {len(findings)} known vulnerability finding(s) "
f"across {total_components} component(s):"
)
lines.append("")
last_source = None
for f in findings:
if f.component.source != last_source:
lines.append(f"[{f.component.source}]")
last_source = f.component.source
sev = f.vuln.severity.ljust(8)
head = f" {sev} {f.component.name}=={f.component.version} {f.vuln.osv_id}"
lines.append(head)
if f.vuln.summary:
summary = f.vuln.summary
if len(summary) > 100:
summary = summary[:97] + "..."
lines.append(f" {summary}")
if f.vuln.fixed_versions:
lines.append(f" fixed in: {', '.join(f.vuln.fixed_versions[:3])}")
return "\n".join(lines)
def _render_json(findings: list[Finding], total_components: int) -> str:
payload = {
"total_components_scanned": total_components,
"finding_count": len(findings),
"findings": [
{
"package": f.component.name,
"version": f.component.version,
"ecosystem": f.component.ecosystem,
"source": f.component.source,
"vuln_id": f.vuln.osv_id,
"severity": f.vuln.severity,
"summary": f.vuln.summary,
"fixed_versions": f.vuln.fixed_versions,
}
for f in findings
],
}
return json.dumps(payload, indent=2)
def _count_components(
*, skip_venv: bool, skip_plugins: bool, skip_mcp: bool, hermes_home: Path
) -> int:
total = 0
if not skip_venv:
total += len(_discover_venv())
if not skip_plugins:
total += len(_discover_plugins(hermes_home))
if not skip_mcp:
total += len(_discover_mcp())
return total
# ─── CLI entrypoint ───────────────────────────────────────────────────────────
def cmd_security_audit(args: argparse.Namespace) -> int:
"""Implementation of `hermes security audit`."""
home = Path(get_hermes_home())
skip_venv = bool(getattr(args, "skip_venv", False))
skip_plugins = bool(getattr(args, "skip_plugins", False))
skip_mcp = bool(getattr(args, "skip_mcp", False))
output_json = bool(getattr(args, "json", False))
fail_on = (getattr(args, "fail_on", None) or "critical").upper()
if fail_on not in SEVERITY_ORDER:
print(
f"unknown --fail-on value: {fail_on.lower()} "
f"(choose from: low, moderate, high, critical)",
file=sys.stderr,
)
return 2
total = _count_components(
skip_venv=skip_venv, skip_plugins=skip_plugins, skip_mcp=skip_mcp, hermes_home=home
)
if total == 0:
msg = "No components discovered (everything skipped, or empty environment)."
if output_json:
print(json.dumps({"total_components_scanned": 0, "finding_count": 0, "findings": []}))
else:
print(msg)
return 0
try:
findings = run_audit(
skip_venv=skip_venv,
skip_plugins=skip_plugins,
skip_mcp=skip_mcp,
hermes_home=home,
)
except RuntimeError as exc:
print(f"audit failed: {exc}", file=sys.stderr)
return 2
if output_json:
print(_render_json(findings, total))
else:
print(_render_human(findings, total))
# Exit code: 1 iff any finding meets or exceeds the --fail-on threshold.
threshold = SEVERITY_ORDER[fail_on]
for f in findings:
if SEVERITY_ORDER.get(f.vuln.severity, 0) >= threshold:
return 1
return 0
+886
View File
@@ -0,0 +1,886 @@
"""Abstract service manager interface.
Wraps the existing systemd (Linux host), launchd (macOS host), Windows
Scheduled Task (native Windows host), and s6 (container) backends behind
a common Protocol. Only the s6 backend supports runtime registration
(for per-profile gateways) host backends raise NotImplementedError
from those methods, and callers MUST check supports_runtime_registration()
before invoking them.
Host-side call sites (setup wizard, uninstall, status) continue to use
the existing module-level functions in hermes_cli.gateway and
hermes_cli.gateway_windows directly. This protocol is a thin facade
used by new code that needs to be backend-agnostic specifically the
profile create/delete hooks (Phase 4) and the s6 dispatch path in
``hermes gateway start/stop/restart`` when running inside a container.
"""
from __future__ import annotations
import re
from pathlib import Path
from typing import Literal, Protocol, runtime_checkable
ServiceManagerKind = Literal["systemd", "launchd", "windows", "s6", "none"]
# Profile name → service directory mapping. Profile names must be safe
# as filesystem directory names because the s6 backend creates a service
# directory at ``<scandir>/gateway-<profile>/``. We reject anything that
# could traverse paths, span filesystems, or break s6's own naming rules.
_VALID_PROFILE_RE = re.compile(r"^[a-z0-9][a-z0-9_-]*$")
_MAX_PROFILE_LEN = 251 # s6-svscan default name_max
def validate_profile_name(name: str) -> None:
"""Raise ValueError if ``name`` is not usable as a profile name.
Profile names are used as s6 service directory names, so they must
match a conservative subset of filesystem-safe characters. Reject
empty strings, uppercase, paths-traversal sequences, and anything
longer than s6's default ``name_max``.
"""
if not name:
raise ValueError("profile name must not be empty")
if len(name) > _MAX_PROFILE_LEN:
raise ValueError(
f"profile name too long ({len(name)} > {_MAX_PROFILE_LEN})"
)
if not _VALID_PROFILE_RE.match(name):
raise ValueError(
f"profile name must match [a-z0-9][a-z0-9_-]*, got {name!r}"
)
@runtime_checkable
class ServiceManager(Protocol):
"""Abstract interface for init-system-specific service operations.
Lifecycle methods (start / stop / restart / is_running) are
implemented by every backend. Runtime registration
(register_profile_gateway / unregister_profile_gateway /
list_profile_gateways) is implemented only by the s6 backend
callers MUST check ``supports_runtime_registration()`` before
invoking the registration methods.
"""
kind: ServiceManagerKind
# Lifecycle of a pre-declared service.
def start(self, name: str) -> None: ...
def stop(self, name: str) -> None: ...
def restart(self, name: str) -> None: ...
def is_running(self, name: str) -> bool: ...
# Runtime registration (s6 only).
def supports_runtime_registration(self) -> bool: ...
def register_profile_gateway(
self,
profile: str,
*,
extra_env: dict[str, str] | None = None,
) -> None: ...
def unregister_profile_gateway(self, profile: str) -> None: ...
def list_profile_gateways(self) -> list[str]: ...
def detect_service_manager() -> ServiceManagerKind:
"""Detect which service manager is available in this environment.
Returns:
"s6" inside a container when /init is s6-svscan (Phase 2+)
"windows" native Windows host
"launchd" macOS host
"systemd" Linux host with a working user/system bus
"none" anything else (Termux, sandbox shells, etc.)
This function does NOT replace ``supports_systemd_services()``
host call sites continue to use that. It exists for new backend-
agnostic code (profile create/delete hooks, the s6 dispatch path
in ``hermes gateway start/stop/restart``).
"""
# Imports deferred so importing this module doesn't drag in the
# whole gateway dependency graph for callers that only need the
# Protocol type or validate_profile_name().
from hermes_constants import is_container
from hermes_cli.gateway import (
is_macos,
is_windows,
supports_systemd_services,
)
if is_container() and _s6_running():
return "s6"
if is_windows():
return "windows"
if is_macos():
return "launchd"
if supports_systemd_services():
return "systemd"
return "none"
def _s6_running() -> bool:
"""True when s6-svscan is running as PID 1 in this container.
Detection has to work for **both** root and the unprivileged hermes
user (UID 10000). The obvious probe ``Path('/proc/1/exe').resolve()``
only works as root: for any other UID, the symlink at
``/proc/1/exe`` is unreadable and ``resolve()`` silently returns the
path unchanged, so the resolved name is the literal ``"exe"`` and
detection always fails. Since every Hermes runtime call inside the
container drops to hermes via ``s6-setuidgid``, that silent failure
made the entire service-manager runtime-registration path inert in
production (PR #30136 review).
Probe instead via:
* ``/proc/1/comm`` world-readable, contains the process comm
(``s6-svscan`` when s6-overlay is PID 1).
* ``/run/s6/basedir`` s6-overlay-specific directory created by
stage1. World-readable. More specific than ``/run/s6`` (which
other tools occasionally create).
Both signals are required; either alone could false-positive
(e.g. a container with the s6 binaries installed but a different
init, or an unrelated process named ``s6-svscan``).
"""
try:
comm = Path("/proc/1/comm").read_text(encoding="utf-8").strip()
except OSError:
return False
if comm != "s6-svscan":
return False
return Path("/run/s6/basedir").is_dir()
# ---------------------------------------------------------------------------
# Backend wrappers
#
# These adapters are thin facades over the existing module-level functions
# in ``hermes_cli.gateway`` (systemd/launchd) and ``hermes_cli.gateway_windows``
# (Windows Scheduled Tasks). The protocol's ``name`` parameter is currently
# unused for host backends — they operate on whichever profile is currently
# active (set via the ``hermes -p <profile>`` flag before the call). This
# matches existing host-side semantics; the parameter shape is designed
# for s6 where each profile maps to a distinct service directory.
# ---------------------------------------------------------------------------
class _RegistrationUnsupportedMixin:
"""Mixin for host backends that don't support runtime registration."""
def supports_runtime_registration(self) -> bool:
return False
def register_profile_gateway(
self,
profile: str,
*,
extra_env: dict[str, str] | None = None,
) -> None:
raise NotImplementedError(
f"{type(self).__name__} does not support runtime profile "
"gateway registration (container-only feature)"
)
def unregister_profile_gateway(self, profile: str) -> None:
raise NotImplementedError(
f"{type(self).__name__} does not support runtime profile "
"gateway unregistration (container-only feature)"
)
def list_profile_gateways(self) -> list[str]:
return []
class SystemdServiceManager(_RegistrationUnsupportedMixin):
"""Thin wrapper around the ``systemd_*`` functions in hermes_cli.gateway.
Existing host call sites continue to use those functions directly;
this wrapper exists for new code that needs to be backend-agnostic
(the Phase 4 profile create/delete hooks).
"""
kind: ServiceManagerKind = "systemd"
def start(self, name: str) -> None:
from hermes_cli.gateway import systemd_start
systemd_start()
def stop(self, name: str) -> None:
from hermes_cli.gateway import systemd_stop
systemd_stop()
def restart(self, name: str) -> None:
from hermes_cli.gateway import systemd_restart
systemd_restart()
def is_running(self, name: str) -> bool:
from hermes_cli.gateway import _probe_systemd_service_running
_, running = _probe_systemd_service_running()
return running
class LaunchdServiceManager(_RegistrationUnsupportedMixin):
"""Thin wrapper around the ``launchd_*`` functions in hermes_cli.gateway."""
kind: ServiceManagerKind = "launchd"
def start(self, name: str) -> None:
from hermes_cli.gateway import launchd_start
launchd_start()
def stop(self, name: str) -> None:
from hermes_cli.gateway import launchd_stop
launchd_stop()
def restart(self, name: str) -> None:
from hermes_cli.gateway import launchd_restart
launchd_restart()
def is_running(self, name: str) -> bool:
from hermes_cli.gateway import _probe_launchd_service_running
return _probe_launchd_service_running()
class WindowsServiceManager(_RegistrationUnsupportedMixin):
"""Thin wrapper around ``hermes_cli.gateway_windows`` (Scheduled Task /
Startup-folder fallback).
The native Windows backend uses a Scheduled Task rather than a true
init-system service, but for protocol purposes the lifecycle is the
same: start / stop / restart / is_running. ``install`` accepts a
handful of Windows-specific kwargs (start_now, start_on_login,
elevated_handoff) that are passed straight through non-Windows
callers should never invoke ``install`` on this wrapper.
"""
kind: ServiceManagerKind = "windows"
def install(
self,
*,
force: bool = False,
start_now: bool | None = None,
start_on_login: bool | None = None,
elevated_handoff: bool = False,
) -> None:
from hermes_cli import gateway_windows
gateway_windows.install(
force=force,
start_now=start_now,
start_on_login=start_on_login,
elevated_handoff=elevated_handoff,
)
def start(self, name: str) -> None:
from hermes_cli import gateway_windows
gateway_windows.start()
def stop(self, name: str) -> None:
from hermes_cli import gateway_windows
gateway_windows.stop()
def restart(self, name: str) -> None:
from hermes_cli import gateway_windows
gateway_windows.restart()
def is_running(self, name: str) -> bool:
from hermes_cli import gateway_windows
from hermes_cli.gateway import find_gateway_pids
if not gateway_windows.is_installed():
return False
return bool(find_gateway_pids())
def get_service_manager() -> ServiceManager:
"""Return the ServiceManager instance for the current environment.
Raises:
RuntimeError: when no supported backend is available.
"""
kind = detect_service_manager()
if kind == "systemd":
return SystemdServiceManager()
if kind == "launchd":
return LaunchdServiceManager()
if kind == "windows":
return WindowsServiceManager()
if kind == "s6":
return S6ServiceManager()
raise RuntimeError("no supported service manager detected")
# ---------------------------------------------------------------------------
# S6ServiceManager (container-only)
#
# Per-profile gateways are registered dynamically when `hermes profile create`
# runs inside the container (Phase 4). Static services (main-hermes, dashboard)
# live in /etc/s6-overlay/s6-rc.d/ and are NOT managed by this class — they're
# part of the image, not runtime-created.
# ---------------------------------------------------------------------------
# s6-overlay's dynamic scandir for runtime-registered services. Lives on
# tmpfs and is the directory s6-svscan watches. Writes here trigger
# automatic supervision on the next rescan.
S6_DYNAMIC_SCANDIR = Path("/run/service")
S6_SERVICE_PREFIX = "gateway-"
# s6-overlay installs its binaries under /command/ and only adds that
# directory to PATH for processes started under the supervision tree
# (services started by s6-svscan, cont-init.d scripts, etc.). Code
# that runs via `docker exec` or any other out-of-tree entry point —
# notably our Phase 4 profile create/delete hooks — inherits the
# container's base PATH which does NOT include /command/.
#
# Rather than asking every caller to fix up its environment, the
# S6ServiceManager calls s6-* binaries by absolute path via this
# constant. We don't use `/usr/bin/s6-…` symlinks because the
# s6-overlay-symlinks-noarch tarball only links a subset, and we
# want every s6 invocation to be guaranteed-findable.
_S6_BIN_DIR = "/command"
# UID/GID of the in-image ``hermes`` user. Hardcoded to match what
# ``stage2-hook.sh`` enforces (the runtime invariant — see also
# tests/docker/test_uid_remap.py). The container starts s6-supervise
# under root and immediately drops to this UID via ``s6-setuidgid``.
_HERMES_UID = 10000
_HERMES_GID = 10000
def _seed_supervise_skeleton(svc_dir: Path) -> None:
"""Pre-create the ``supervise/`` and top-level ``event/`` skeleton
inside a service directory, owned by the hermes user.
Why this exists
---------------
When s6-supervise spawns a service it tries to ``mkdir`` two
directories: ``<svc>/event`` and ``<svc>/supervise``, both with mode
``0700``. It also ``mkfifo``s ``<svc>/supervise/control`` with mode
``0600``. Because s6-supervise runs as PID 1's effective UID (root)
these dirs end up root-owned mode 0700, and an unprivileged client
(the ``hermes`` user UID 10000 running every Hermes runtime
operation via ``s6-setuidgid``) gets ``EACCES`` on any ``s6-svc``,
``s6-svstat``, or ``s6-svwait`` invocation against the slot.
The PR #30136 review surfaced this as a real product gap: the
entire S6ServiceManager lifecycle (``register/start/stop/unregister
_profile_gateway``) was inert in production because every operation
is dispatched as the hermes user.
Why this works
--------------
Reading s6's source (src/supervision/s6-supervise.c::trymkdir +
control_init): the ``mkdir`` and ``mkfifo`` calls both treat
``EEXIST`` as success. If the directory is already present, the
chown/chmod fix-up that would normally make event/ ``03730
root:root`` is **skipped** entirely s6-supervise just opens the
pre-existing FIFOs and proceeds. So if we lay the skeleton down
with hermes ownership before triggering ``s6-svscanctl -a``,
s6-supervise inherits our layout and never touches it.
Layout produced
---------------
``svc_dir/`` hermes:hermes, 0755 (parent must already exist)
``svc_dir/event/`` hermes:hermes, 03730 (setgid + g+rwx + sticky)
``svc_dir/supervise/`` hermes:hermes, 0755
``svc_dir/supervise/event/`` hermes:hermes, 03730
``svc_dir/supervise/control`` hermes:hermes, 0660 (FIFO)
The ``death_tally``, ``lock``, and ``status`` regular files end up
written by s6-supervise itself (as root), but those land mode 0644
world-readable and ``s6-svstat`` only needs read access, so the
hermes user reads them fine.
If ``svc_dir/log/`` is present (the canonical s6 logger pattern
one s6-supervise instance per service, plus a second for its
logger), the same skeleton is seeded under ``log/`` as well:
``log/event/``, ``log/supervise/``, ``log/supervise/event/``,
``log/supervise/control``. Without this, unregister teardown
would EACCES on the logger's supervise dir even after the parent
slot's supervise/ was hermes-owned.
Idempotency
-----------
Safe to call against a directory where the skeleton already exists.
Existing entries are left untouched (the helper doesn't try to
re-chown / re-chmod live FIFOs that s6-supervise may have already
opened).
Reference
---------
Discussed at length on the skarnet `skaware` mailing list in 2020
(`<http://skarnet.org/lists/skaware/1424.html>`_); see also
just-containers/s6-overlay#130. The pre-creation pattern was
historically called out as forward-compatibility-fragile, but the
EEXIST handling in s6-supervise has been stable since 2015 it's
the same pattern ``s6-svperms`` and ``fix-attrs.d`` rely on.
"""
import os
def _mkdir_owned(path: Path, mode: int) -> None:
if path.exists():
return
path.mkdir(parents=False, exist_ok=False)
path.chmod(mode)
try:
os.chown(path, _HERMES_UID, _HERMES_GID)
except PermissionError:
# Running as the hermes user already — directory is hermes-
# owned by default. The chown is a no-op in that case, so
# swallowing this keeps both root and unprivileged callers
# on one code path.
pass
# Top-level event/ dir (this is the s6-svlisten1 event-subscription
# dir at the service root, distinct from supervise/event/).
_mkdir_owned(svc_dir / "event", 0o3730)
# supervise/ dir + its inner event/ dir.
supervise = svc_dir / "supervise"
_mkdir_owned(supervise, 0o755)
_mkdir_owned(supervise / "event", 0o3730)
# supervise/control FIFO. Same EEXIST-safe pattern: if it's already
# there (s6-supervise has already started against this slot), leave
# it alone. The explicit chmod after mkfifo is required because
# mkfifo honors the process umask, which can strip group-write
# (e.g. the default 0022 on most dev hosts → 0o660 becomes 0o640).
# The container runs with umask 0 inside s6-overlay's stage2, but
# being defensive here keeps the helper consistent under any
# invocation context.
control = supervise / "control"
if not control.exists():
os.mkfifo(control, 0o660)
control.chmod(0o660)
try:
os.chown(control, _HERMES_UID, _HERMES_GID)
except PermissionError:
pass
# If a log/ subdir is present (the canonical s6 logger pattern —
# see servicedir(7)), it gets its own s6-supervise instance and
# needs the same skeleton. Without this, unregister teardown
# would EACCES on the logger's root-owned supervise/ dir even
# when the parent slot's supervise/ is hermes-owned.
log_dir = svc_dir / "log"
if log_dir.is_dir():
_mkdir_owned(log_dir / "event", 0o3730)
log_supervise = log_dir / "supervise"
_mkdir_owned(log_supervise, 0o755)
_mkdir_owned(log_supervise / "event", 0o3730)
log_control = log_supervise / "control"
if not log_control.exists():
os.mkfifo(log_control, 0o660)
log_control.chmod(0o660)
try:
os.chown(log_control, _HERMES_UID, _HERMES_GID)
except PermissionError:
pass
class S6Error(RuntimeError):
"""Base error for S6ServiceManager lifecycle failures.
Concrete subclasses carry the slot name (and, where useful, the
underlying subprocess output) so the CLI can render an actionable
message instead of leaking a raw ``CalledProcessError`` traceback.
"""
def __init__(self, message: str, *, service: str | None = None) -> None:
super().__init__(message)
self.service = service
class GatewayNotRegisteredError(S6Error):
"""Raised when a lifecycle method targets a slot that doesn't exist.
Most commonly: ``hermes -p typo gateway start`` when no profile
``typo`` exists. Carries the unprefixed profile name (not the
full ``gateway-<profile>`` service-dir name) so callers can phrase
a user-facing message like "no such gateway 'typo'".
"""
def __init__(self, profile: str) -> None:
self.profile = profile
super().__init__(
f"no such gateway {profile!r}: register it with "
f"`hermes profile create {profile}` first, or pass "
"an existing profile name via `-p <name>`",
service=f"gateway-{profile}",
)
class S6CommandError(S6Error):
"""Raised when an s6 command fails for a reason other than a
missing slot e.g. permission denied on the supervise control
FIFO, or s6-svc returning a non-zero exit for an unexpected
reason. Carries the stderr from the failing command so callers
can surface it.
"""
def __init__(
self, *, service: str, action: str, returncode: int, stderr: str,
) -> None:
self.action = action
self.returncode = returncode
self.stderr = stderr
message = (
f"s6-svc {action} on {service!r} failed (rc={returncode})"
)
if stderr.strip():
message += f": {stderr.strip()}"
super().__init__(message, service=service)
class S6ServiceManager:
"""Per-profile gateway supervision via s6-overlay.
Only handles runtime-registered services under
``S6_DYNAMIC_SCANDIR``. Static services (main-hermes, dashboard)
are managed by s6-rc at image-build time and are out of scope.
"""
kind: ServiceManagerKind = "s6"
def __init__(self, scandir: Path = S6_DYNAMIC_SCANDIR) -> None:
self.scandir = scandir
# -- internal helpers --------------------------------------------------
def _service_dir(self, profile: str) -> Path:
validate_profile_name(profile)
return self.scandir / f"{S6_SERVICE_PREFIX}{profile}"
def _service_name(self, profile: str) -> str:
return f"{S6_SERVICE_PREFIX}{profile}"
@staticmethod
def _render_run_script(
profile: str,
extra_env: dict[str, str],
) -> str:
"""Generate the run script for a profile-gateway s6 service.
The script:
1. Sources HERMES_HOME (and any extra env) via with-contenv
so e.g. ``-e HERMES_HOME=/data/hermes`` is honored at run
time, not Python-substituted at registration time (OQ8-C).
2. Activates the bundled venv.
3. Drops to the hermes user and exec's
``hermes -p <profile> gateway run`` (or just ``hermes
gateway run`` for the default profile see below).
Special case: ``profile == "default"`` emits ``hermes gateway
run`` with **no** ``-p`` flag. This is the sentinel for "the
root HERMES_HOME profile" (the implicit profile that exists at
the top of $HERMES_HOME, not under profiles/). It must be
spelled this way because ``_profile_suffix()`` returns the
empty string for the root profile, and the dispatcher in
``hermes_cli.gateway`` maps that empty string to the
``gateway-default`` service slot. Passing ``-p default`` here
would instead look up ``$HERMES_HOME/profiles/default/`` a
completely different (and almost always nonexistent) profile.
Port selection: the gateway picks its bind port from the
profile's ``config.yaml`` (``[gateway] port = ...``) — that
is the single source of truth. Previously this method took a
``port`` parameter that was passed in but never substituted
into the rendered script (it was carried in for "API parity"
with a deterministic SHA-256 allocator in
``hermes_cli.profiles._allocate_gateway_port``). PR #30136
review item I5 retired both the allocator and the parameter
because they were dead code through the entire stack.
"""
import shlex
lines = [
"#!/command/with-contenv sh",
"# shellcheck shell=sh",
"set -e",
"cd /opt/data",
". /opt/hermes/.venv/bin/activate",
]
for k, v in sorted(extra_env.items()):
lines.append(f"export {k}={shlex.quote(v)}")
if profile == "default":
lines.append("exec s6-setuidgid hermes hermes gateway run")
else:
lines.append(
f"exec s6-setuidgid hermes hermes -p {shlex.quote(profile)} gateway run"
)
return "\n".join(lines) + "\n"
@staticmethod
def _render_log_run(profile: str) -> str:
"""Generate the log/run script for a profile-gateway service.
OQ8-C: persist to ``${HERMES_HOME}/logs/gateways/<profile>/``.
CRITICAL: the HERMES_HOME path is sourced from the runtime env
via with-contenv NOT Python-substituted at registration time
so a container started with ``-e HERMES_HOME=/data/hermes``
gets its logs under /data/hermes/logs/..., not the build-time
default.
"""
import shlex
prof = shlex.quote(profile)
return (
f"#!/command/with-contenv sh\n"
f"# shellcheck shell=sh\n"
f': "${{HERMES_HOME:=/opt/data}}"\n'
f'log_dir="$HERMES_HOME/logs/gateways/{prof}"\n'
f'mkdir -p "$log_dir"\n'
f'chown -R hermes:hermes "$log_dir" 2>/dev/null || true\n'
f'exec s6-setuidgid hermes s6-log n10 s1000000 T "$log_dir"\n'
)
# -- lifecycle ---------------------------------------------------------
def _run_svc(self, action_flag: str, action_label: str, name: str) -> None:
"""Shared lifecycle dispatch for start / stop / restart.
Translates the two failure modes operators care about into
named errors:
* ``GatewayNotRegisteredError`` the service directory at
``<scandir>/<name>/`` doesn't exist. ``s6-svc`` would
exit non-zero with a fairly opaque message; we pre-empt
it with a clear "no such gateway 'X'" tied to the profile
name (without the ``gateway-`` prefix).
* ``S6CommandError`` anything else (EACCES on the
supervise control FIFO, timeout, etc.). Carries the
subprocess return code and stderr so callers can render
them inline.
``action_flag`` is the ``s6-svc`` flag (``-u`` / ``-d`` /
``-t``); ``action_label`` is the human verb (``start`` /
``stop`` / ``restart``) used in error messages.
"""
import subprocess
service_dir = self.scandir / name
if not service_dir.is_dir():
# Strip the gateway- prefix back off so the message
# matches what the user typed on the CLI (``-p <profile>``).
profile = (
name[len(S6_SERVICE_PREFIX):]
if name.startswith(S6_SERVICE_PREFIX)
else name
)
raise GatewayNotRegisteredError(profile)
try:
subprocess.run(
[f"{_S6_BIN_DIR}/s6-svc", action_flag, str(service_dir)],
check=True, capture_output=True, text=True, timeout=5,
)
except subprocess.CalledProcessError as exc:
raise S6CommandError(
service=name,
action=action_label,
returncode=exc.returncode,
stderr=exc.stderr or "",
) from exc
def start(self, name: str) -> None:
"""Bring up a registered service (``s6-svc -u``).
Raises:
GatewayNotRegisteredError: no service directory for ``name``.
S6CommandError: s6-svc exited non-zero for any other reason
(permission denied on the supervise FIFO, timeout, etc.).
"""
self._run_svc("-u", "start", name)
def stop(self, name: str) -> None:
"""Bring down a registered service (``s6-svc -d``).
Raises:
GatewayNotRegisteredError: no service directory for ``name``.
S6CommandError: s6-svc exited non-zero for any other reason.
"""
self._run_svc("-d", "stop", name)
def restart(self, name: str) -> None:
"""Restart a registered service (``s6-svc -t`` = SIGTERM).
Raises:
GatewayNotRegisteredError: no service directory for ``name``.
S6CommandError: s6-svc exited non-zero for any other reason.
"""
self._run_svc("-t", "restart", name)
def is_running(self, name: str) -> bool:
"""True iff ``s6-svstat`` reports the service as up."""
import subprocess
result = subprocess.run(
[f"{_S6_BIN_DIR}/s6-svstat", str(self.scandir / name)],
capture_output=True, text=True, timeout=5,
)
return result.returncode == 0 and "up " in result.stdout
# -- runtime registration ---------------------------------------------
def supports_runtime_registration(self) -> bool:
return True
def register_profile_gateway(
self,
profile: str,
*,
extra_env: dict[str, str] | None = None,
) -> None:
"""Create the s6 service directory for a profile gateway.
Triggers ``s6-svscanctl -a`` so s6-svscan picks the new directory
up immediately. The service is created in the *up* state to
register without auto-starting, follow up with ``stop(profile)``
(or pass the start flag via the future ``start_now=False`` arg,
which the Phase 4 reconciliation path uses via a ``down``
marker file written directly).
Raises:
ValueError: if the profile name is invalid or the service
directory already exists.
RuntimeError: if ``s6-svscanctl`` fails.
"""
import shutil
import subprocess
svc_dir = self._service_dir(profile)
if svc_dir.exists():
raise ValueError(
f"profile gateway {profile!r} already registered at {svc_dir}"
)
# Build the service directory atomically: write to a sibling
# temp dir, then rename. Avoids s6-svscan observing a half-
# populated directory on a fast rescan.
tmp_dir = svc_dir.with_name(svc_dir.name + ".tmp")
if tmp_dir.exists():
shutil.rmtree(tmp_dir, ignore_errors=True)
tmp_dir.mkdir(parents=True)
try:
(tmp_dir / "type").write_text("longrun\n")
run_script = self._render_run_script(profile, extra_env or {})
run_path = tmp_dir / "run"
run_path.write_text(run_script)
run_path.chmod(0o755)
# Persistent log rotation (OQ8-C).
log_subdir = tmp_dir / "log"
log_subdir.mkdir()
log_run = log_subdir / "run"
log_run.write_text(self._render_log_run(profile))
log_run.chmod(0o755)
# Pre-create the supervise/ skeleton with hermes ownership
# BEFORE we publish the slot. s6-supervise will EEXIST our
# dirs/FIFOs and inherit the ownership, so the runtime
# s6-svc / s6-svstat / s6-svwait calls (all dispatched as
# the hermes user) won't hit EACCES on root-owned 0700
# dirs. See ``_seed_supervise_skeleton`` for the full
# rationale.
_seed_supervise_skeleton(tmp_dir)
tmp_dir.rename(svc_dir)
except Exception:
shutil.rmtree(tmp_dir, ignore_errors=True)
raise
# Trigger rescan so s6-svscan picks up the new service.
result = subprocess.run(
[f"{_S6_BIN_DIR}/s6-svscanctl", "-a", str(self.scandir)],
capture_output=True, text=True, timeout=5,
)
if result.returncode != 0:
# Clean up: rescan failed, leave the directory in place would
# be confusing (no supervisor watching it).
shutil.rmtree(svc_dir, ignore_errors=True)
raise RuntimeError(
f"s6-svscanctl failed: {result.stderr or result.stdout}"
)
def unregister_profile_gateway(self, profile: str) -> None:
"""Stop the profile gateway service and remove its directory.
Idempotent: absent services are a no-op. Best-effort stop +
wait-for-down before removal so the running gateway process
gets a chance to shut down cleanly before its service dir
disappears.
Teardown ordering matters: ``s6-svscanctl -an`` is fired
**before** ``rmtree`` so s6-svscan reaps the supervise child
process (releasing its handle on ``supervise/lock`` and the
regular files inside the supervise dir), giving us a clean
directory to remove. Without the reap-first ordering, the
rmtree races s6-supervise on a set of root-owned files inside
the supervise dir and the dir is left half-removed.
"""
import shutil
import subprocess
import time
svc_dir = self._service_dir(profile)
if not svc_dir.exists():
return
# Stop the service (best effort — service may already be down).
subprocess.run(
[f"{_S6_BIN_DIR}/s6-svc", "-d", str(svc_dir)],
capture_output=True, text=True, timeout=5,
check=False,
)
# Wait for it to actually go down (up to 10s).
subprocess.run(
[f"{_S6_BIN_DIR}/s6-svwait", "-D", "-t", "10000", str(svc_dir)],
capture_output=True, text=True, timeout=15,
check=False,
)
# Reap the supervise child FIRST: -n tells s6-svscan to drop
# any supervise processes whose service dir is gone (which
# includes any service dir we're about to remove). This
# releases the file handles s6-supervise holds against the
# supervise/lock + supervise/status + supervise/death_tally
# files inside the slot, so the upcoming rmtree doesn't race.
subprocess.run(
[f"{_S6_BIN_DIR}/s6-svscanctl", "-an", str(self.scandir)],
capture_output=True, text=True, timeout=5,
check=False,
)
# Give s6-svscan a moment to reap. There's no synchronous
# "scan completed" handshake — the -a/-n trigger just sets a
# flag s6-svscan reads on its next loop iteration. 200ms is
# comfortably above the loop's resolution but well under any
# user-perceived latency.
time.sleep(0.2)
# Now the supervise dir's files are no longer held open by a
# live s6-supervise, so rmtree can remove them. Files inside
# supervise/ are root-owned (death_tally, lock, status, written
# by s6-supervise itself) — but the parent supervise/ directory
# is hermes-owned (see ``_seed_supervise_skeleton``), and on
# POSIX you only need write+execute on the parent to remove
# contained files regardless of file ownership.
shutil.rmtree(svc_dir, ignore_errors=True)
def list_profile_gateways(self) -> list[str]:
"""Return the profile names of all currently-registered gateway services.
Filters the scandir to entries that match the ``gateway-`` prefix.
Other services (e.g. ``s6-linux-init-shutdownd``) are ignored.
"""
if not self.scandir.exists():
return []
profiles: list[str] = []
for entry in self.scandir.iterdir():
if entry.name.startswith("."):
continue
if not entry.is_dir():
continue
if not entry.name.startswith(S6_SERVICE_PREFIX):
continue
profiles.append(entry.name[len(S6_SERVICE_PREFIX):])
return profiles
+50 -20
View File
@@ -2188,28 +2188,58 @@ def _setup_matrix():
print_success("E2EE enabled")
matrix_pkg = "mautrix[encryption]" if want_e2ee else "mautrix"
# Use the central lazy-deps feature group so we install ALL of
# platform.matrix's dependencies (mautrix, Markdown, aiosqlite,
# asyncpg, aiohttp-socks) — not just mautrix itself. The previous
# hand-rolled ``pip install mautrix[encryption]`` left asyncpg /
# aiosqlite uninstalled and broke E2EE connect with
# ``No module named 'asyncpg'`` on every fresh install (#31116).
try:
__import__("mautrix")
from tools.lazy_deps import ensure as _lazy_ensure, feature_missing
_missing_before = feature_missing("platform.matrix")
if _missing_before:
print_info(
f"Installing {matrix_pkg} (+ {len(_missing_before)} runtime deps)..."
)
try:
_lazy_ensure("platform.matrix", prompt=False)
print_success(f"{matrix_pkg} installed")
except Exception as exc:
print_warning(
f"Install failed — run manually: pip install "
f"'mautrix[encryption]' asyncpg aiosqlite Markdown "
f"aiohttp-socks"
)
print_info(f" Error: {exc}")
except ImportError:
print_info(f"Installing {matrix_pkg}...")
import subprocess
uv_bin = shutil.which("uv")
if uv_bin:
result = subprocess.run(
[uv_bin, "pip", "install", "--python", sys.executable, matrix_pkg],
capture_output=True, text=True,
)
else:
result = subprocess.run(
[sys.executable, "-m", "pip", "install", matrix_pkg],
capture_output=True, text=True,
)
if result.returncode == 0:
print_success(f"{matrix_pkg} installed")
else:
print_warning(f"Install failed — run manually: pip install '{matrix_pkg}'")
if result.stderr:
print_info(f" Error: {result.stderr.strip().splitlines()[-1]}")
# tools.lazy_deps unavailable (extreme edge case — partial
# install). Fall back to the legacy single-package install
# path so the wizard still does *something*.
try:
__import__("mautrix")
except ImportError:
print_info(f"Installing {matrix_pkg}...")
import subprocess
uv_bin = shutil.which("uv")
if uv_bin:
result = subprocess.run(
[uv_bin, "pip", "install", "--python", sys.executable, matrix_pkg],
capture_output=True, text=True,
)
else:
result = subprocess.run(
[sys.executable, "-m", "pip", "install", matrix_pkg],
capture_output=True, text=True,
)
if result.returncode == 0:
print_success(f"{matrix_pkg} installed")
else:
print_warning(
f"Install failed — run manually: pip install "
f"'{matrix_pkg}' asyncpg aiosqlite Markdown aiohttp-socks"
)
if result.stderr:
print_info(f" Error: {result.stderr.strip().splitlines()[-1]}")
print()
print_info("🔒 Security: Restrict who can use your bot")
+22 -5
View File
@@ -906,8 +906,14 @@ def do_update(name: Optional[str] = None, console: Optional[Console] = None) ->
c.print(f"[bold green]Updated {len(updates)} skill(s).[/]\n")
def do_audit(name: Optional[str] = None, console: Optional[Console] = None) -> None:
"""Re-run security scan on installed hub skills."""
def do_audit(name: Optional[str] = None, console: Optional[Console] = None,
deep: bool = False) -> None:
"""Re-run security scan on installed hub skills.
When ``deep=True``, also runs an opt-in AST-level diagnostic on Python
files (review aid only not a security gate; skills_guard.py verdicts
are unchanged).
"""
from tools.skills_hub import HubLockFile, SKILLS_DIR
from tools.skills_guard import scan_skill, format_scan_report
@@ -928,6 +934,9 @@ def do_audit(name: Optional[str] = None, console: Optional[Console] = None) -> N
c.print(f"\n[bold]Auditing {len(targets)} skill(s)...[/]\n")
if deep:
from tools.skills_ast_audit import ast_scan_path, format_ast_report
for entry in targets:
skill_path = SKILLS_DIR / entry["install_path"]
if not skill_path.exists():
@@ -936,6 +945,10 @@ def do_audit(name: Optional[str] = None, console: Optional[Console] = None) -> N
result = scan_skill(skill_path, source=entry.get("identifier", entry["source"]))
c.print(format_scan_report(result))
if deep:
c.print(format_ast_report(ast_scan_path(skill_path), skill_name=entry["name"]))
c.print()
@@ -1343,7 +1356,8 @@ def skills_command(args) -> None:
elif action == "update":
do_update(name=getattr(args, "name", None))
elif action == "audit":
do_audit(name=getattr(args, "name", None))
do_audit(name=getattr(args, "name", None),
deep=getattr(args, "deep", False))
elif action == "uninstall":
do_uninstall(args.name)
elif action == "reset":
@@ -1395,6 +1409,8 @@ def handle_skills_slash(cmd: str, console: Optional[Console] = None) -> None:
/skills update
/skills audit
/skills audit my-skill
/skills audit --deep
/skills audit my-skill --deep
/skills uninstall my-skill
/skills tap list
/skills tap add owner/repo
@@ -1509,8 +1525,9 @@ def handle_skills_slash(cmd: str, console: Optional[Console] = None) -> None:
do_update(name=name, console=c)
elif action == "audit":
name = args[0] if args else None
do_audit(name=name, console=c)
name = args[0] if args and not args[0].startswith("--") else None
deep = "--deep" in args
do_audit(name=name, console=c, deep=deep)
elif action == "uninstall":
if not args:
+38 -14
View File
@@ -119,7 +119,6 @@ _PUBLIC_API_PATHS: frozenset = frozenset({
"/api/model/info",
"/api/dashboard/themes",
"/api/dashboard/plugins",
"/api/dashboard/plugins/rescan",
})
@@ -3296,24 +3295,49 @@ _VALID_CHANNEL_RE = re.compile(r"^[A-Za-z0-9._-]{1,128}$")
_LOOPBACK_HOSTS = frozenset({"127.0.0.1", "::1", "localhost", "testclient"})
def _is_public_bind() -> bool:
"""True when bound to all-interfaces (operator used --insecure)."""
return getattr(app.state, "bound_host", "") in {"0.0.0.0", "::"}
def _ws_client_is_allowed(ws: "WebSocket") -> bool:
"""Check if the WebSocket client IP is acceptable.
Allows loopback always; allows any IP when bound to all-interfaces
(--insecure mode, guarded by session token auth).
Allows loopback clients only.
"""
if _is_public_bind():
return True
client_host = ws.client.host if ws.client else ""
if not client_host:
return True
return client_host in _LOOPBACK_HOSTS
def _ws_host_origin_is_allowed(ws: "WebSocket") -> bool:
"""Apply the dashboard Host/Origin guard to WebSocket upgrades.
FastAPI HTTP middleware does not run for WebSocket routes, so the
DNS-rebinding Host check used for normal dashboard HTTP requests must be
repeated here before accepting the upgrade. Browsers also send an Origin
header on WebSocket handshakes; when present, require it to target the
same bound dashboard host.
"""
bound_host = getattr(app.state, "bound_host", None)
if not bound_host:
return True
host_header = ws.headers.get("host", "")
if not _is_accepted_host(host_header, bound_host):
return False
origin = ws.headers.get("origin", "")
if not origin:
return True
parsed = urllib.parse.urlparse(origin)
if parsed.scheme not in {"http", "https"} or not parsed.netloc:
return False
return _is_accepted_host(parsed.netloc, bound_host)
def _ws_request_is_allowed(ws: "WebSocket") -> bool:
"""Return True when the WebSocket upgrade matches dashboard boundaries."""
return _ws_host_origin_is_allowed(ws) and _ws_client_is_allowed(ws)
# Per-channel subscriber registry used by /api/pub (PTY-side gateway → dashboard)
# and /api/events (dashboard → browser sidebar). Keyed by an opaque channel id
# the chat tab generates on mount; entries auto-evict when the last subscriber
@@ -3415,7 +3439,7 @@ async def pty_ws(ws: WebSocket) -> None:
await ws.close(code=4401)
return
if not _ws_client_is_allowed(ws):
if not _ws_request_is_allowed(ws):
await ws.close(code=4403)
return
@@ -3534,7 +3558,7 @@ async def gateway_ws(ws: WebSocket) -> None:
await ws.close(code=4401)
return
if not _ws_client_is_allowed(ws):
if not _ws_request_is_allowed(ws):
await ws.close(code=4403)
return
@@ -3566,7 +3590,7 @@ async def pub_ws(ws: WebSocket) -> None:
await ws.close(code=4401)
return
if not _ws_client_is_allowed(ws):
if not _ws_request_is_allowed(ws):
await ws.close(code=4403)
return
@@ -3595,7 +3619,7 @@ async def events_ws(ws: WebSocket) -> None:
await ws.close(code=4401)
return
if not _ws_client_is_allowed(ws):
if not _ws_request_is_allowed(ws):
await ws.close(code=4403)
return
+29 -5
View File
@@ -11,8 +11,10 @@ hot-reloaded by the webhook adapter without a gateway restart.
"""
import json
import os
import re
import secrets
import tempfile
import time
from pathlib import Path
from typing import Dict
@@ -23,6 +25,7 @@ from hermes_cli.config import cfg_get
_SUBSCRIPTIONS_FILENAME = "webhook_subscriptions.json"
_SUBSCRIPTIONS_FILE_MODE = 0o600
def _hermes_home() -> Path:
@@ -48,12 +51,33 @@ def _load_subscriptions() -> Dict[str, dict]:
def _save_subscriptions(subs: Dict[str, dict]) -> None:
path = _subscriptions_path()
path.parent.mkdir(parents=True, exist_ok=True)
tmp_path = path.with_suffix(".tmp")
tmp_path.write_text(
json.dumps(subs, indent=2, ensure_ascii=False),
encoding="utf-8",
# webhook_subscriptions.json contains per-route HMAC secrets — write
# via tempfile + chmod 0o600 before the atomic rename so a permissive
# umask cannot leave the secrets readable to other local users in the
# window between create and rename.
fd, tmp_name = tempfile.mkstemp(
prefix=f".{path.name}.",
suffix=".tmp",
dir=path.parent,
text=True,
)
atomic_replace(tmp_path, path)
tmp_path = Path(tmp_name)
try:
with os.fdopen(fd, "w", encoding="utf-8") as fh:
json.dump(subs, fh, indent=2, ensure_ascii=False)
fh.flush()
os.fsync(fh.fileno())
os.chmod(tmp_path, _SUBSCRIPTIONS_FILE_MODE)
atomic_replace(tmp_path, path)
# Re-assert after rename in case the destination existed with a
# broader mode and atomic_replace preserved it.
os.chmod(path, _SUBSCRIPTIONS_FILE_MODE)
except Exception:
try:
tmp_path.unlink(missing_ok=True)
except OSError:
pass
raise
def _get_webhook_config() -> dict:
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "Geen benoemde sessies gevind nie.\nGebruik `/title My Sessie` om jou huidige sessie 'n naam te gee, en dan `/resume My Sessie` om later daarheen terug te keer."
list_header: "📋 **Benoemde Sessies**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nGebruik: `/resume <session name>`"
list_footer_numbered: "\nGebruik: `/resume <sessienaam>` of `/resume <nommer>` (bv. `/resume 1` vir die mees onlangse)"
list_failed: "Kon nie sessies lys nie: {error}"
out_of_range: "Hervat-indeks {index} is buite bereik.\nGebruik `/resume` sonder argumente om beskikbare sessies te sien."
not_found: "Geen sessie gevind wat by '**{name}**' pas nie.\nGebruik `/resume` sonder argumente om beskikbare sessies te sien."
already_on: "📌 Reeds op sessie **{name}**."
switch_failed: "Kon nie sessie verander nie."
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "Keine benannten Sitzungen gefunden.\nVerwenden Sie `/title Meine Sitzung`, um die aktuelle Sitzung zu benennen, dann `/resume Meine Sitzung`, um später dorthin zurückzukehren."
list_header: "📋 **Benannte Sitzungen**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nVerwendung: `/resume <Sitzungsname>`"
list_footer_numbered: "\nVerwendung: `/resume <Sitzungsname>` oder `/resume <Nummer>` (z. B. `/resume 1` für die zuletzt verwendete)"
list_failed: "Sitzungen konnten nicht aufgelistet werden: {error}"
out_of_range: "Wiederaufnahme-Index {index} liegt außerhalb des gültigen Bereichs.\nVerwenden Sie `/resume` ohne Argumente, um verfügbare Sitzungen anzuzeigen."
not_found: "Keine Sitzung passend zu '**{name}**' gefunden.\nVerwenden Sie `/resume` ohne Argumente, um verfügbare Sitzungen zu sehen."
already_on: "📌 Bereits in Sitzung **{name}**."
switch_failed: "Sitzungswechsel fehlgeschlagen."
+3
View File
@@ -237,9 +237,12 @@ gateway:
no_named_sessions: "No named sessions found.\nUse `/title My Session` to name your current session, then `/resume My Session` to return to it later."
list_header: "📋 **Named Sessions**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nUsage: `/resume <session name>`"
list_footer_numbered: "\nUsage: `/resume <session name>` or `/resume <number>` (e.g. `/resume 1` for the most recent)"
list_failed: "Could not list sessions: {error}"
out_of_range: "Resume index {index} is out of range.\nUse `/resume` with no arguments to see available sessions."
not_found: "No session found matching '**{name}**'.\nUse `/resume` with no arguments to see available sessions."
already_on: "📌 Already on session **{name}**."
switch_failed: "Failed to switch session."
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "No se encontraron sesiones con nombre.\nUsa `/title Mi sesión` para nombrar la sesión actual y luego `/resume Mi sesión` para volver a ella."
list_header: "📋 **Sesiones con nombre**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nUso: `/resume <nombre de sesión>`"
list_footer_numbered: "\nUso: `/resume <nombre de sesión>` o `/resume <número>` (p. ej. `/resume 1` para la más reciente)"
list_failed: "No se pudieron listar las sesiones: {error}"
out_of_range: "El índice de reanudación {index} está fuera de rango.\nUsa `/resume` sin argumentos para ver las sesiones disponibles."
not_found: "No se encontró ninguna sesión que coincida con '**{name}**'.\nUsa `/resume` sin argumentos para ver las sesiones disponibles."
already_on: "📌 Ya estás en la sesión **{name}**."
switch_failed: "No se pudo cambiar de sesión."
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "Aucune session nommée trouvée.\nUtilisez `/title Ma session` pour nommer la session actuelle, puis `/resume Ma session` pour y revenir plus tard."
list_header: "📋 **Sessions nommées**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nUsage : `/resume <nom de session>`"
list_footer_numbered: "\nUtilisation : `/resume <nom de session>` ou `/resume <numéro>` (par exemple `/resume 1` pour la plus récente)"
list_failed: "Impossible de lister les sessions : {error}"
out_of_range: "L'index de reprise {index} est hors limites.\nUtilisez `/resume` sans arguments pour voir les sessions disponibles."
not_found: "Aucune session correspondant à '**{name}**' trouvée.\nUtilisez `/resume` sans argument pour voir les sessions disponibles."
already_on: "📌 Déjà sur la session **{name}**."
switch_failed: "Échec du changement de session."
+3
View File
@@ -226,9 +226,12 @@ gateway:
no_named_sessions: "Níor aimsíodh aon seisiún ainmnithe.\nÚsáid `/title M'Ainm Seisiúin` chun do sheisiún reatha a ainmniú, ansin `/resume M'Ainm Seisiúin` chun filleadh air níos déanaí."
list_header: "📋 **Seisiúin Ainmnithe**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nÚsáid: `/resume <session name>`"
list_footer_numbered: "\nÚsáid: `/resume <ainm seisiúin>` nó `/resume <uimhir>` (m.sh. `/resume 1` don cheann is déanaí)"
list_failed: "Níorbh fhéidir seisiúin a liostáil: {error}"
out_of_range: "Tá an t-innéacs atosaithe {index} as raon.\nÚsáid `/resume` gan argóintí chun na seisiúin atá ar fáil a fheiceáil."
not_found: "Níor aimsíodh aon seisiún ag teacht le '**{name}**'.\nÚsáid `/resume` gan argóintí chun seisiúin atá ar fáil a fheiceáil."
already_on: "📌 Cheana ar an seisiún **{name}**."
switch_failed: "Theip ar athrú seisiúin."
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "Nem található elnevezett munkamenet.\nHasználd a `/title Saját munkamenet` parancsot a jelenlegi munkamenet elnevezéséhez, majd a `/resume Saját munkamenet` paranccsal térhetsz vissza hozzá."
list_header: "📋 **Elnevezett munkamenetek**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nHasználat: `/resume <munkamenet neve>`"
list_footer_numbered: "\nHasználat: `/resume <munkamenet neve>` vagy `/resume <szám>` (pl. `/resume 1` a legutóbbihoz)"
list_failed: "Nem sikerült listázni a munkameneteket: {error}"
out_of_range: "A folytatási index ({index}) tartományon kívül esik.\nA `/resume` argumentumok nélküli használata megjeleníti az elérhető munkameneteket."
not_found: "Nem található '**{name}**' nevű munkamenet.\nArgumentumok nélkül használd a `/resume` parancsot az elérhető munkamenetek megtekintéséhez."
already_on: "📌 Már a **{name}** munkamenetben vagy."
switch_failed: "Nem sikerült munkamenetet váltani."
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "Nessuna sessione con nome trovata.\nUsa `/title My Session` per dare un nome alla sessione attuale, poi `/resume My Session` per tornare a essa in seguito."
list_header: "📋 **Sessioni con nome**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nUso: `/resume <session name>`"
list_footer_numbered: "\nUso: `/resume <nome sessione>` o `/resume <numero>` (es. `/resume 1` per la più recente)"
list_failed: "Impossibile elencare le sessioni: {error}"
out_of_range: "L'indice di ripresa {index} è fuori intervallo.\nUsa `/resume` senza argomenti per vedere le sessioni disponibili."
not_found: "Nessuna sessione trovata corrispondente a '**{name}**'.\nUsa `/resume` senza argomenti per vedere le sessioni disponibili."
already_on: "📌 Già nella sessione **{name}**."
switch_failed: "Cambio di sessione non riuscito."
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "名前付きセッションが見つかりません。\n`/title セッション名` で現在のセッションに名前を付けると、後で `/resume セッション名` で戻れます。"
list_header: "📋 **名前付きセッション**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\n使い方: `/resume <セッション名>`"
list_footer_numbered: "\n使い方: `/resume <セッション名>` または `/resume <番号>`(例: 最新のセッションには `/resume 1`"
list_failed: "セッションを一覧表示できませんでした: {error}"
out_of_range: "再開インデックス {index} は範囲外です。\n引数なしで `/resume` を実行すると、利用可能なセッションが表示されます。"
not_found: "'**{name}**' に一致するセッションが見つかりません。\n引数なしで `/resume` を実行すると利用可能なセッションを表示します。"
already_on: "📌 既にセッション **{name}** にいます。"
switch_failed: "セッションの切り替えに失敗しました。"
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "이름이 지정된 세션이 없습니다.\n현재 세션에 이름을 지정하려면 `/title 내 세션`을 사용하고, 나중에 `/resume 내 세션`으로 돌아오세요."
list_header: "📋 **이름이 지정된 세션**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\n사용법: `/resume <session name>`"
list_footer_numbered: "\n사용법: `/resume <세션 이름>` 또는 `/resume <번호>` (예: 가장 최근 세션은 `/resume 1`)"
list_failed: "세션 목록을 가져올 수 없습니다: {error}"
out_of_range: "재개 인덱스 {index}이(가) 범위를 벗어났습니다.\n인자 없이 `/resume`을 실행하면 사용 가능한 세션이 표시됩니다."
not_found: "'**{name}**'와 일치하는 세션이 없습니다.\n사용 가능한 세션을 보려면 인수 없이 `/resume`을 사용하세요."
already_on: "📌 이미 **{name}** 세션에 있습니다."
switch_failed: "세션 전환에 실패했습니다."
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "Não foram encontradas sessões com nome.\nUsa `/title A minha sessão` para nomear a sessão atual e depois `/resume A minha sessão` para voltar a ela."
list_header: "📋 **Sessões com nome**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nUso: `/resume <nome da sessão>`"
list_footer_numbered: "\nUso: `/resume <nome da sessão>` ou `/resume <número>` (ex.: `/resume 1` para a mais recente)"
list_failed: "Não foi possível listar as sessões: {error}"
out_of_range: "O índice de retomada {index} está fora do intervalo.\nUse `/resume` sem argumentos para ver as sessões disponíveis."
not_found: "Não foi encontrada nenhuma sessão correspondente a '**{name}**'.\nUsa `/resume` sem argumentos para ver as sessões disponíveis."
already_on: "📌 Já estás na sessão **{name}**."
switch_failed: "Falha ao mudar de sessão."
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "Именованных сеансов не найдено.\nИспользуйте `/title Мой сеанс`, чтобы назвать текущий сеанс, затем `/resume Мой сеанс`, чтобы вернуться к нему позже."
list_header: "📋 **Именованные сеансы**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nИспользование: `/resume <название сеанса>`"
list_footer_numbered: "\nИспользование: `/resume <имя сеанса>` или `/resume <номер>` (например, `/resume 1` для самого недавнего)"
list_failed: "Не удалось получить список сеансов: {error}"
out_of_range: "Индекс возобновления {index} вне диапазона.\nИспользуйте `/resume` без аргументов, чтобы увидеть доступные сеансы."
not_found: "Сеанс, соответствующий '**{name}**', не найден.\nИспользуйте `/resume` без аргументов, чтобы увидеть доступные сеансы."
already_on: "📌 Уже в сеансе **{name}**."
switch_failed: "Не удалось переключить сеанс."
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "Adlandırılmış oturum bulunamadı.\nMevcut oturumu adlandırmak için `/title Oturumum`, daha sonra geri dönmek için `/resume Oturumum` kullanın."
list_header: "📋 **Adlandırılmış Oturumlar**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nKullanım: `/resume <oturum adı>`"
list_footer_numbered: "\nKullanım: `/resume <oturum adı>` veya `/resume <numara>` (örn. en yenisi için `/resume 1`)"
list_failed: "Oturumlar listelenemedi: {error}"
out_of_range: "Devam endeksi {index} aralık dışında.\nKullanılabilir oturumları görmek için `/resume` komutunu argümansız çalıştırın."
not_found: "'**{name}**' ile eşleşen oturum bulunamadı.\nKullanılabilir oturumları görmek için argümansız `/resume` kullanın."
already_on: "📌 Zaten **{name}** oturumundasınız."
switch_failed: "Oturum değiştirilemedi."
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "Іменованих сеансів не знайдено.\nВикористайте `/title Мій сеанс`, щоб назвати поточний сеанс, потім `/resume Мій сеанс`, щоб повернутися до нього."
list_header: "📋 **Іменовані сеанси**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\nВикористання: `/resume <назва сеансу>`"
list_footer_numbered: "\nВикористання: `/resume <назва сесії>` або `/resume <номер>` (наприклад, `/resume 1` для найновішої)"
list_failed: "Не вдалося отримати список сеансів: {error}"
out_of_range: "Індекс відновлення {index} поза межами діапазону.\nВикористовуйте `/resume` без аргументів, щоб переглянути доступні сесії."
not_found: "Сеанс, що відповідає '**{name}**', не знайдено.\nВикористайте `/resume` без аргументів, щоб побачити доступні сеанси."
already_on: "📌 Уже в сеансі **{name}**."
switch_failed: "Не вдалося переключити сеанс."
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "找不到已命名的工作階段。\n使用 `/title 我的工作階段` 為目前工作階段命名,然後使用 `/resume 我的工作階段` 返回。"
list_header: "📋 **已命名工作階段**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\n用法:`/resume <工作階段名稱>`"
list_footer_numbered: "\n用法:`/resume <會話名稱>` 或 `/resume <編號>`(例如,`/resume 1` 表示最近的會話)"
list_failed: "無法列出工作階段:{error}"
out_of_range: "恢復索引 {index} 超出範圍。\n請使用不帶參數的 `/resume` 查看可用會話。"
not_found: "找不到符合 '**{name}**' 的工作階段。\n使用不帶參數的 `/resume` 檢視可用的工作階段。"
already_on: "📌 已在工作階段 **{name}** 上。"
switch_failed: "切換工作階段失敗。"
+3
View File
@@ -222,9 +222,12 @@ gateway:
no_named_sessions: "未找到已命名的会话。\n使用 `/title 我的会话` 为当前会话命名,然后用 `/resume 我的会话` 返回。"
list_header: "📋 **已命名会话**\n"
list_item: "• **{title}**{preview_part}"
list_item_numbered: "{index}. **{title}**{preview_part}"
list_preview_suffix: " — _{preview}_"
list_footer: "\n用法:`/resume <会话名称>`"
list_footer_numbered: "\n用法:`/resume <会话名称>` 或 `/resume <编号>`(例如,`/resume 1` 表示最近的会话)"
list_failed: "无法列出会话:{error}"
out_of_range: "恢复索引 {index} 超出范围。\n请使用不带参数的 `/resume` 查看可用会话。"
not_found: "未找到匹配 '**{name}**' 的会话。\n使用不带参数的 `/resume` 查看可用会话。"
already_on: "📌 已在会话 **{name}** 上。"
switch_failed: "切换会话失败。"
+3
View File
@@ -0,0 +1,3 @@
from .adapter import register
__all__ = ["register"]

Some files were not shown because too many files have changed in this diff Show More