Thread 1 (cli.py:1488): Fix broken skin hook — class is SkinConfig
not Skin. The previous code silently no-op'd via the broad except,
so SkinConfig.get_color() calls weren't actually remapped. Verified
the hook fires now: in light mode, banner_text returns #1A1A1A
instead of #FFF8DC.
Thread 2 (cli.py:1328): Align comment with actual timeout. The OSC 11
read deadline is 100ms (time.monotonic() + 0.1), not 50ms. Fixed
the docstring.
Thread 3 (cli.py:13389): Remove unused imports of Point and Screen
in the _output_screen_diff monkey-patch block. Leftover from earlier
experiments — the wrapper only needs previous_screen mutation.
Thread 4 (cli.py:11422): Skip light-mode remap entirely when a pt
style string already specifies its own bg (e.g. 'bg:#1a1a2e #FFF8DC'
for status-bar / completion-menu). Those colors were tuned for that
specific dark bg; remapping the FG to #1A1A1A would produce
dark-on-dark (invisible). Now we detect the explicit 'bg:' token
and leave the whole value untouched.
Also dropped the stale comment block at the resize-handler that
described the old 'force \x1b[2J\x1b[H clear-screen on resize'
recovery — replaced with the actual current strategy
(monkey-patch _output_screen_diff).
Three changes that together make the response Panel readable in light
Terminal.app mode:
1. Hook Skin.get_color() at module load so EVERY skin color read goes
through _maybe_remap_for_light_mode(). Previously only _hex_to_ansi()
and pt's style strings were remapped — Rich Panel borders and body
text bypassed the remap and stayed as #FFF8DC (cornsilk on cream).
2. Prime the light-mode detection cache at import time when stdin is
a tty. Ensures OSC 11 query happens before any banner/Panel render.
3. Drop status-bar fg colors (#C0C0C0 silver, #888888, #555555, #8B8682)
from the remap table — those are paired with a dark navy bg, so
remapping them to dark gray would make them invisible the OTHER
direction (dark on dark).
OSC 11 background query needs raw tty access; running it from inside
pt's render path could race with pt's own tty handling. Call
_detect_light_mode() once in HermesCLI.run() at startup so the result
is cached before pt's Application starts.
Mirrors ui-tui/src/theme.ts detectLightMode() in Python so the base
hermes CLI also adapts to light Terminal.app backgrounds.
Detection priority (first match wins):
1. HERMES_LIGHT / HERMES_TUI_LIGHT env (true/false)
2. HERMES_TUI_THEME=light|dark
3. HERMES_TUI_BACKGROUND=#RRGGBB
4. COLORFGBG env (xterm/Konsole/urxvt)
5. OSC 11 query (\x1b]11;?\x1b\\) — asks the terminal directly
with a 100ms timeout
6. Default: dark
When light mode is detected, dark-mode-tuned skin colors are remapped
to higher-contrast equivalents:
#FFF8DC (cornsilk) -> #1A1A1A (near-black)
#FFD700 (gold) -> #9A6B00 (dark goldenrod)
#B8860B (dim) -> #5C4500 (deeper brown)
... etc
Hooked at two points:
- _hex_to_ansi() — auto-remaps any color emitted via the ANSI helper
- _build_tui_style_dict() — rewrites pt style strings (chrome bg/fg)
Set HERMES_TUI_THEME=light to force light-mode behavior; otherwise
the OSC 11 query at startup auto-detects in most modern terminals.
The _DIM ANSI escape was a SkinAwareAnsi bound to banner_dim (#B8860B
dark goldenrod). On light cream Terminal.app backgrounds this rendered
the [thinking] reasoning preview essentially invisible (dark goldenrod
on cream is very low contrast).
Replace _DIM with a fixed ANSI dim+italic escape (\x1b[2;3m) so dim
text inherits the terminal's default foreground color and stays
readable in both light and dark Terminal.app modes.
Updated the /skin command to no longer call _DIM.reset() since _DIM
is now a plain str.
Skin engine was setting 'input-area' style to the skin's 'prompt' color
(near-white #FFF8DC for default and most other skins). On light-mode
Terminal.app this made typed text invisible (white-on-white).
Decouple the prompt symbol color (still skin-controlled) from the typed
input color (now always inherits terminal default fg). The user's typed
text is now readable in both light and dark Terminal.app modes
regardless of which skin is active.
Hardcoded #FFF8DC (cornsilk) for the input area and prompt made typed
text invisible on light-mode Terminal.app (white-on-white).
Default to empty style string '' so the input/prompt inherit the
terminal's default foreground color. Skins can still opt into a
colored prompt by setting the 'prompt' color explicitly in their YAML.
banner_text default kept at #FFF8DC since the banner has its own
background and the legacy default was working there.
Previous version (ba3822a64) replaced None previous_screen with a
fresh Screen() before passing to pt's renderer. That changed the
behavior of pt's `if not previous_screen` guard at L178-185, which
fires reset_attributes() + erase_down() on first-paint and after
width changes. With that reset suppressed, ANSI styles can leak
between renders and chat text loses its color/bold/italic styling.
Fix: only mutate previous_screen.height when previous_screen is
already non-None AND its current height is genuinely smaller than
the new screen's height. Don't touch the None case at all — let pt's
own first-paint reset path run as designed.
The reserve-vertical-space scroll suppression (the actual bug fix)
still works because that branch only matters when previous_screen
exists with a height that's less than current_height — which is
exactly the case we now handle.
# Verified empirically
- Before/after resize: colors preserved (status bar yellow, rules
orange, "26 commits behind" warning yellow caution)
- After widen back: colors still correct
- 10-resize stress test: ZERO scrollback delta, full content preserved
# What changed
Replaced DECSTBM scroll region + chrome-row erase approach with a
direct monkey-patch of prompt_toolkit's module-level
`_output_screen_diff` function.
The DECSTBM approach had two killer bugs:
1. Scroll region leaked into the user's shell after hermes quit
(atexit firing semantics + the region persists across processes
in macOS Terminal.app)
2. Chrome-row erase wiped chat content / streaming responses if user
resized mid-stream
# Root cause (re-verified by reading pt/renderer.py)
`_output_screen_diff` (renderer.py L232-242) deliberately moves the
cursor to the bottom of the canvas after painting:
```python
# Correctly reserve vertical space as required by the layout.
# When this is a new screen (drawn for the first time), or for some
# reason higher than the previous one. Move the cursor once to the
# bottom of the output. That way, we're sure that the terminal
# scrolls up, even when the lower lines of the canvas just contain
# whitespace.
if current_height > previous_screen.height:
current_pos = move_cursor(Point(x=0, y=current_height - 1))
```
In non-fullscreen mode this scrolls chrome content into terminal
scrollback EVERY render — not just on resize. The `move_cursor`
walks down via `\r\n` which scrolls when at the bottom row.
# Fix
Wrap `_output_screen_diff` and inflate `previous_screen.height` to
match `screen.height` before passing through. This makes the
`if current_height > previous_screen.height` guard fall through and
skip the bottom-cursor-move entirely. Without that move, pt's render
only writes within the layout's actual rows. `\r\n` between rows
inside the layout body never reaches the bottom of the viewport
(because `move_cursor(0,0)` walks UP first to layout-top, then
`\r\n*N` walks DOWN only as far as the layout actually spans).
# Verified empirically in real Terminal.app
10-resize stress test (mixed shrink+widen) during streaming:
✅ ZERO scrollback delta (0 status bars added)
✅ Full streaming response preserved
✅ User input preserved
✅ Banner preserved in scrollback
✅ Status bar correctly anchored at bottom
✅ No visible duplicates anywhere
✅ No shell breakage after quit (no scroll region to leak)
# Reverted
- DECSTBM scroll region (shell-leak risk gone)
- atexit handler for scroll region restore (no longer needed)
- Chrome-row erase (\x1b[2K walking) — no longer needed
- _hermes_resize_clear function — back to vanilla _schedule_resize_recovery
Previous version (fef97aee5) used `\x1b[J` (erase from cursor to end of
screen) which WIPED the entire viewport — losing the user's just-typed
message and any streaming agent response if they resized mid-stream.
Fix: erase ONLY the bottom chrome rows (`CHROME_ROWS = 8`, generous
slack for status bar + 2 rules + input + reflow extras). Walk up
from the bottom; for each row emit `\x1b[<row>;1H\x1b[2K` (move
to row, erase line). `\x1b[2K` does NOT push to scrollback.
Chat content above the chrome band stays untouched.
# Verified empirically in real Terminal.app
Test sequence:
1. Start hermes (170 cols)
2. Send message "Tell me a 4 sentence story about a cat"
3. While agent is streaming, shrink to 98 cols
4. Widen back to 170 cols
Result after this fix:
✅ User's message still visible
✅ "Initializing agent..." still visible
✅ Full agent response still visible (the cat story)
✅ Status bar at bottom, no duplicates
✅ Banner preserved in scrollback above
✅ Zero scrollback pollution (delta = 0 across 2 resizes)
# Verified empirically in real Terminal.app with real shell scrollback above
After 6 column shrinks:
✅ ZERO status bars accumulated in scrollback (delta = 0)
✅ Status bar correctly anchored at bottom of viewport
✅ No visible duplicate chrome
✅ Chat responses display correctly after fix
✅ Layout matches normal hermes UX
# Root cause (verified by reading prompt_toolkit/renderer.py source)
pt's `_output_screen_diff` (renderer.py:106) emits `write("\r\n" * N)` to
advance the cursor between rows during paint. At the bottom row of the
terminal, each `\r\n` SCROLLS the viewport, pushing content into terminal
scrollback. pt does this *deliberately* — see line 232-242 comment:
"Move the cursor once to the bottom of the output. That way, we're sure
that the terminal scrolls up". This is the actual mechanism behind pt
issues #29 (open since 2014), #1675, #1933. aider/xonsh/ipython all hit
this wall and gave up; nobody on GitHub has shipped a fix.
# The fix
DECSTBM `\x1b[<top>;<bottom>r` sets a SCROLL REGION on the terminal.
When pt's `\r\n` scrolls within the region, rows that fall off the top
of the region are DISCARDED instead of being pushed to terminal
scrollback. Region top must be > 1 — when region starts at row 1, the
terminal treats it semantically as "no region" and scrolled content
still goes to scrollback. Above row 2 it gets discarded.
Same trick used by vim's status line, tmux, weechat, htop.
Three more critical details:
1. **DECSTBM resets cursor to (1,1).** We follow it with an explicit
`\x1b[<rows>;1H` to move the cursor back to the bottom row, so pt's
render anchors the chrome at the bottom of the viewport.
2. **`\x1b[J` (erase from cursor to end of screen) does NOT push to
scrollback.** `\x1b[2J` does. So on resize we use `\x1b[J` to wipe
the old reflowed chrome WITHOUT polluting history.
3. **Skip `_schedule_resize_recovery`** — its `_status_bar_suppressed
_after_resize=True` flag hides the chrome until next user input,
which makes resize feel broken with this fix in place. Call pt's
native `_on_resize` directly instead.
# Reverts
- transcript widget (alt-screen-only path, was an earlier attempt)
- alt-screen mode (broke chat output rendering)
- HERMES_DEBUG_RESIZE / HERMES_RESIZE_STRATEGY env-var paths
When the terminal shrinks, already-printed box-drawing rules (response,
reasoning, streaming TTS, background-task Panels) reflow into multiple
narrower rows — visible as duplicated horizontal separators / ghost
lines in scrollback. Similarly, prompt_toolkit redraws a fresh status
bar on SIGWINCH on top of one the terminal just reflowed, producing
double-bar artifacts on column shrink.
Two surgical changes:
1. Decorative scrollback boxes now use a new
`HermesCLI._scrollback_box_width()` helper that clamps to
`max(32, min(width, 56))`. The live TUI footer is unaffected and still
uses the full width. Covers: streaming response box (open + close),
reasoning box (open + close, both streaming and post-stream paths),
streaming-TTS box close, final-response Rich Panel, and the
background-task Rich Panel.
2. `_recover_after_resize()` now also sets a new
`_status_bar_suppressed_after_resize` flag so the dynamic status bar
and both input separator rules stay hidden until the next user input.
The flag is cleared in the process loop the moment the user submits
their next prompt, restoring chrome cleanly.
Tests:
- New `test_input_rules_hide_after_resize_until_next_input` covers the
flag's effect on rule heights.
- New `test_scrollback_box_width_caps_to_resize_safe_value` covers the
helper at floor / cap / mid-range / overflow.
- Existing resize-recovery test extended to assert the flag flips.
Refs: #18449#19280#22976
Salvage of #24403.
Co-authored-by: Szymonclawd <szymonclawd@mac.home>
The spinner already shows tool activity visually; the 1.2 kHz tone on
every tool.started event was unwanted noise (especially on WSL2, where
each beep also triggers Windows Terminal's bell notification).
Removed the play_beep call in _on_tool_progress entirely. Record
start/stop beeps (gated by voice.beep_enabled) are unaffected.
When codex app-server fails outside the OAuth-classified path
(non-auth turn/start errors, plain TimeoutErrors, generic turn-ended
status, subprocess silently exits, hard deadline timeout), the user
got a bare 'Internal error' / 'turn/start failed: ...' with no
context. Diagnosing config/provider/auth-bridge issues forced a
re-run with verbose codex flags.
Add a _format_error_with_stderr helper that appends the last few
stderr lines via agent.redact.redact_sensitive_text(force=True),
and use it at every catch-all error site:
- ensure_started() failures (codex init / thread/start) now return
a TurnResult.error with should_retire=True instead of bubbling
- non-OAuth turn/start CodexAppServerError / TimeoutError
- subprocess-died branch (previously dumped raw stderr_blob[-300:]
with no redaction — a leak risk)
- turn ended with non-completed status
- hard turn-timeout deadline
OAuth-classified failures and the post-tool quiet watchdog already
produce clean hints and stay unchanged. The redactor catches sk-*,
gh*_*, Authorization: Bearer, query-string tokens, JWTs, private
keys, etc., so provider error payloads can't leak into chat output
or trajectories.
Inspired by openclaw#80718, adapted for our app-server transport.
When the stream consumer's got_done handler successfully delivers the
final response content via _send_or_edit but the subsequent edit
(e.g. cursor removal) fails, final_response_sent remains False even
though the user has already received the final answer. The gateway's
fallback send path then re-delivers the same content, causing the
user to see the response twice on Telegram.
Introduce a new _final_content_delivered flag on the stream consumer,
set by the got_done handler when the final content has reached the
user. The _run_agent suppression logic now treats this flag as an
additional signal (alongside final_response_sent and
response_previewed) that final delivery is already complete.
This preserves the existing behavior for intermediate-text-only
streams (where already_sent=True but no final content has been
delivered) — those still receive the gateway's fallback send, matching
the test expectation in test_partial_stream_output_does_not_set_already_sent.
Adds TestFinalContentDeliveredSuppression with two cases covering
both the suppression (content delivered + edit failed) and the
non-suppression (intermediate text only) branches.
`hermes config set gateway.streaming.*` writes the streaming block
nested under a `gateway:` key in config.yaml, but the config loader
only checked for a top-level `streaming:` key — silently ignoring
the nested variant.
Fall back to `yaml_cfg['gateway']['streaming']` when the top-level
key is absent, matching the pattern already used for other nested
config sections.
Closes#25676
When the final streamed text is identical to the last plain-text edit,
stream_consumer._send_or_edit short-circuits and never calls
adapter.edit_message(finalize=True). For Telegram, this skips the
plain-text → MarkdownV2 conversion, leaving raw Markdown syntax visible
to the user.
Set REQUIRES_EDIT_FINALIZE = True on TelegramAdapter so the finalize
edit is always delivered, matching the existing DingTalk pattern.
Fixes#25710
WhatsApp pseudo-chats (Status updates / Stories, Channels / Newsletters,
broadcast lists) were being routed through the full agent pipeline. A
user's gateway.log showed the agent replying to a contact's Story
('status@broadcast') with 345 chars plus title-generation cost, which
also shows up in the contact's status feed.
Drop these JIDs at _should_process_message() before the policy gate so
they're filtered regardless of dm_policy or allowlist state. Covers:
- status@broadcast (Stories)
- *@newsletter (Channels)
- *@broadcast (broadcast lists, future-proofing)
The bridge.js already filters these on the fromMe outbound path, but
inbound events on self-chat mode skipped that check.
Tests:
- status@broadcast dropped on open policy
- broadcast filter wins over allowlisted senders
- real DMs still pass through
- helper unit cases (case-insensitive, whitespace-tolerant)
26/26 tests/gateway/test_whatsapp_group_gating.py pass; 59/59 adjacent
WhatsApp test suites pass.
Adds references/template-integrity.md covering safe conversion of the
official comfyui-workflow-templates package from editor format to API
format — Reroute bypass via link tracing, dotted dynamic-input keys
(values.a, resize_type.width) that must NOT be flattened, server-error
"patch don't rebuild" loop, Cloud quirks (302 redirect to signed GCS
URL, free-tier 1 concurrent job, 1920x1080 OOM on RTX 5090), and a
Discord-compatible ffmpeg stitch recipe (yuv420p + xfade/acrossfade).
SKILL.md lists the new reference so the agent loads it when starting
from an official template. purzbeats added to author list and to
scripts/release.py AUTHOR_MAP.
Co-authored-by: purzbeats <97489706+purzbeats@users.noreply.github.com>
The Debian/Ubuntu branch of install_node_deps() ran 'npx playwright install
--with-deps chromium' unconditionally. Playwright invokes sudo interactively
to apt-install Chromium's system libraries, which blocks the installer for
non-sudo users (systemd service accounts, unprivileged operator users) on
an unsatisfiable password prompt.
Changes:
- install.sh: gate --with-deps behind a sudo capability check on the apt
branch (matches the existing Arch/pacman branch pattern). Non-sudo users
fall back to 'npx playwright install chromium' alone and the installer
prints the exact 'sudo npx playwright install-deps chromium' command an
administrator can run separately.
- install.sh: add --skip-browser (alias --no-playwright) to skip the
Playwright step entirely for headless installs that don't need browser
automation. Mirrors the existing --no-venv / --skip-setup shape.
- installation.md: add a 'Non-Sudo / System Service User Installs' section
covering the admin/service-user split, the --skip-browser flag, and the
~/.local/bin PATH gotcha (the root cause of the 'No module named dotenv'
error users hit when running the repo source 'hermes' script with system
Python instead of the venv launcher).
- test_install_sh_browser_install.py: regression coverage for the
--skip-browser flag and the sudo-gate on the apt branch.
Reported by @ssilver in Discord.
_make_stream_chunk built delta_kwargs with only `role`, so a reasoning-only
chunk produced a SimpleNamespace without a `.content` attribute. Downstream
consumers that read `delta.content` then raised AttributeError on Gemini 2.5
Flash, where the thinking delta arrives before any content delta.
Seed `content`, `tool_calls`, `reasoning`, and `reasoning_content` as None
up front, matching the pattern already used in gemini_native_adapter.py.
Key-present arguments still override the defaults.
Fixes#24974
References: Related open PR #24984 (luyao618) applies the same 1-line fix; this PR adds a regression test that #24984 omits
Co-Authored-By: Claude <noreply@anthropic.com>
Pyproject's [all] extra was slimmed down in May 2026 — ~20 optional
backends moved to tools/lazy_deps.py and only install on first use.
hermes update runs uv pip install -e .[all] which doesn't touch any of
them, so pin bumps in LAZY_DEPS (CVE response, transitive fixes) were
silently ignored on already-activated backends.
Two changes:
1. _is_satisfied() now parses the spec and checks the installed version
against the constraint via packaging.specifiers. Previously it
returned True the moment the package name was importable, which made
ensure() a name-presence gate rather than a version-pin gate.
2. New active_features() / refresh_active_features() pair: lists every
feature with at least one of its packages currently installed, then
re-runs ensure() on each. Refresh is invoked at the end of
_cmd_update_impl, right after the [all] install completes. Cold
backends (never activated) stay quiet — no churn for them.
Output during update is one summary block:
→ Refreshing 4 active lazy backend(s)...
↑ 1 refreshed: provider.anthropic
✓ 3 already current
or
⚠ memory.honcho failed to refresh: <pip stderr>
Failures never raise out of update — backends keep their previously-
installed version and we tell the user to rerun once upstream is fixed.
security.allow_lazy_installs=false is honored: features get marked
"skipped" with the reason shown.
Tests: 18 new unit tests covering version-aware satisfaction (exact pin,
range, extras blocks, missing package, malformed spec), active feature
discovery, and refresh status reporting. All 61 lazy_deps tests pass.
Adds regression tests pinning web search into the WhatsApp and api-server
default platform-coverage toolsets. Pure test additions, no runtime change.
Salvage of the test-addition commit from #25692 by @wesleysimplicio.
(The AUTHOR_MAP fixup commit from the same PR landed separately as
529ec85c7.)
The _foreground_background_guidance() function matched background-wrapper
keywords (nohup/disown/setsid) anywhere in the command text, including
inside quoted strings, Python -c code, commit messages, and PR body text.
Two-layer fix:
1. Strip single-quoted, double-quoted, and backtick-quoted content before
pattern matching via _strip_quotes() helper.
2. Tighten the regex to only match keywords at command-start positions
(after ^, ;, &, &&, ||, or $() — not mid-argument.
Both layers are needed: quote stripping handles the common case of keywords
in string literals, and the position-aware regex handles unquoted cases
like 'export FOO=setsid' (word boundary match, wrong position).
Fixes#20064
When the gateway spawned a background agent (e.g. for delegation), media
URLs and types from the originating message weren't forwarded — the bg
agent saw the prompt but no attached images. Vision-enabled tasks
effectively lost their inputs.
Forwards media_urls/media_types through the bg-task spawn path and
runs the same vision-enrichment step the main flow uses, so the bg
agent gets image descriptions inlined into its prompt.
Closes#25614.
Salvage of #25603 by @oxngon (manually re-applied — original branch
was severely stale against current main).
Set file mode 0600 on ~/.hermes/.env after creation in the installer and
after every write via memory_setup._write_env_vars(). This ensures only
the file owner can read/write API keys and tokens, matching standard
practice for credential files (.netrc, .aws/credentials, .ssh/config).
Fixes#25477
On WSL2 (and similar environments), time.time() is not strictly monotonic
due to NTP sync or host clock adjustments. When clock regression occurs
during a multi-tool flush, later-inserted rows get earlier timestamps,
causing ORDER BY timestamp, id to sort them before rows that were written
first. This breaks the tool_calls/tool_response adjacency invariant and
triggers HTTP 400 from the API.
Use ORDER BY id instead, since id (INTEGER PRIMARY KEY AUTOINCREMENT)
always reflects true insertion order regardless of system clock behavior.
The _approval_callback method in HermesCLI hardcoded timeout=60
instead of reading the approvals.timeout config value. This meant
the config setting was silently ignored for CLI interactive prompts.
Other approval paths (callbacks.py, tools/approval.py) already read
the config correctly — only cli.py was missed.
Pre-stages AUTHOR_MAP for 7 new contributors in the upcoming batch:
- HxT9 (#25760)
- evgyur (#25651)
- AsoTora (#25624)
- oxngon (#25603)
- yifengingit (#25589)
- vanthinh6886 (#25562)
- Arkmusn (#25559)
EthanGuo-coder, wesleysimplicio, and zccyman are already in the map.
Mirrors openclaw beta.8's app-server resilience fixes so a stuck codex
subprocess can't burn the full turn deadline and so users get a
`codex login` pointer instead of raw RPC errors when their token expires.
- TurnResult.should_retire signals the caller to drop+respawn codex.
- Deadline-hit path and dead-subprocess detection set should_retire so
the next turn doesn't ride a CPU-spinning or auth-broken process.
- Post-tool watchdog (post_tool_quiet_timeout=90s): if a tool item
completes and codex goes silent past the threshold without further
output or turn/completed, fast-fail instead of waiting the full 600s.
Resets on any non-tool activity so normal think-after-tool flows are
not affected.
- <turn_aborted> and <turn_aborted/> in agent text are treated as
terminal — some codex builds tear down a turn that way without
emitting turn/completed.
- _classify_oauth_failure() inspects RPC error message + stderr tail
for invalid_grant / token refresh / 401 / etc. and rewrites
user-facing errors to 'run codex login'. Conservative: generic
failures still surface verbatim. Fires at turn/start failure,
turn/completed failure, and dead-subprocess paths.
- thread/start cross-fill: tolerate thread.id, thread.sessionId,
top-level sessionId/threadId so future codex schema drift doesn't
KeyError us at handshake.
- run_agent.py: when run_turn returns should_retire=True OR raises,
close + null self._codex_session so the next turn respawns.
Tests: +30 cases across session + integration suites.
tests/agent/transports/test_codex_app_server_session.py 50/50 pass
tests/run_agent/test_codex_app_server_integration.py 27/27 pass
Broader codex scope (transports + cli runtime/migration) 376/376 pass
The cherry-picked PR over-indented the edit_message_text block for
the mm: (model selected → switch) success path so the confirmation
edit lived inside the preceding 'except Exception as exc' branch and
only fired when the callback raised. Dedent the try/except back to
12-space indent so it runs after the callback succeeds, restoring
the original flow that removes the inline buttons and shows the
'Switched to ...' confirmation.
Add a regression test (test_model_selected_edits_message_on_success)
that asserts edit_message_text is awaited and the result text is
routed through format_message (MARKDOWN_V2 + backtick survival).
Add phuongvm to scripts/release.py AUTHOR_MAP.
Use MarkdownV2 formatting for Telegram callback follow-ups and interactive prompts where dynamic names or user text can break legacy Markdown parsing. Add regression coverage for reload-mcp, model picker, approval callbacks, and update prompts.
* fix(cli): allow rotating broken OpenRouter / AI Gateway key in `hermes model` flow
Before: when `OPENROUTER_API_KEY` (or `AI_GATEWAY_API_KEY`) was already
set in ~/.hermes/.env, `hermes model openrouter` / `hermes model
ai-gateway` skipped the API-key prompt entirely and jumped straight to
the model picker. Users with a broken / expired / wrong key had no way
to replace it without editing ~/.hermes/.env by hand or re-running
`hermes setup` from scratch.
Both flows now route through the existing `_prompt_api_key()` helper,
which surfaces [K]eep / [R]eplace / [C]lear when a key is already
configured — the same UX the generic API-key providers (z.ai, MiniMax,
Gemini, etc.) and the Daytona setup already use.
* fix(install.ps1): pin uv sync target to venv\, verify baseline imports
Two related Windows-installer bugs that produce a broken venv with
`ModuleNotFoundError: No module named 'dotenv'` on first `hermes` run.
## Bug 1: uv sync ignores VIRTUAL_ENV, syncs into .venv\ instead of venv\
`Install-Dependencies` creates the venv at `venv\` via `uv venv venv`,
sets `$env:VIRTUAL_ENV = "$InstallDir\venv"`, then runs
`uv sync --extra all --locked`. Modern uv (>=0.5) ignores `VIRTUAL_ENV`
for the `sync` subcommand and uses the project default `.venv\`
instead. Result: deps land in `$InstallDir\.venv\`, `venv\` stays
empty except for the python.exe stub from the earlier `uv venv` call,
`hermes.exe` ends up wired to the wrong site-packages.
The bash installer (`scripts/install.sh`) already worked around this in
`install_deps()` line 1127 by passing `UV_PROJECT_ENVIRONMENT` — that
flag tells uv exactly where to put the project env regardless of
`VIRTUAL_ENV`. Port the same fix to PowerShell.
## Bug 2: no post-install verification
If the sync still misdirects for any other reason (uv version drift,
filesystem quirk, user re-run scenarios), the installer reports success
and the user only finds out by running `hermes` and getting an
unhelpful traceback. Add a baseline-import probe that runs the venv's
own python against the four packages every `hermes` invocation needs
(`dotenv`, `openai`, `rich`, `prompt_toolkit`). On failure, throw
with a recovery command tailored to whether a sibling `.venv\` exists.
User report (Windows 11, Python 3.13.5, Hermes v0.13.0): manual repro
steps were exactly this — `uv sync` landed in `.venv\`, recovered by
junctioning `venv\` → `.venv\` to bridge the path mismatch.
Before: when `OPENROUTER_API_KEY` (or `AI_GATEWAY_API_KEY`) was already
set in ~/.hermes/.env, `hermes model openrouter` / `hermes model
ai-gateway` skipped the API-key prompt entirely and jumped straight to
the model picker. Users with a broken / expired / wrong key had no way
to replace it without editing ~/.hermes/.env by hand or re-running
`hermes setup` from scratch.
Both flows now route through the existing `_prompt_api_key()` helper,
which surfaces [K]eep / [R]eplace / [C]lear when a key is already
configured — the same UX the generic API-key providers (z.ai, MiniMax,
Gemini, etc.) and the Daytona setup already use.
Brings Discord to parity with Telegram on the clarify tool's interactive
UX. Overrides BasePlatformAdapter.send_clarify on DiscordAdapter to attach
a button view when choices are present.
- ClarifyChoiceView: one discord.ui.Button per choice (max 24, Discord's
25-component view cap leaves one slot for Other) plus a final
'Other (type answer)' button.
- Numeric click -> tools.clarify_gateway.resolve_gateway_clarify(
clarify_id, choice_text) using the canonical choice text from the
gateway entry (falls back to the button label if the entry vanished).
- Other click -> tools.clarify_gateway.mark_awaiting_text(clarify_id) so
the gateway's text-intercept captures the next user message in this
session as the response.
- Auth via the shared _component_check_auth helper (same OR-semantics as
ExecApprovalView / SlashConfirmView / UpdatePromptView / ModelPickerView).
- Open-ended (no choices) path renders the prompt as a plain embed and
relies on the existing text-intercept resolution.
- Single-use: first valid click disables every button and updates the
embed footer with who answered and what they chose.
No changes to BasePlatformAdapter.send_clarify or the gateway's
clarify_callback wiring -- the existing scaffolding already drives all
adapters; Discord just inherits the default text fallback today and gains
buttons by virtue of this override.
Test conftest extended: _FakeEmbed gains add_field() / set_footer() stubs
so tests can construct embedded views without monkey-patching per-test.
Original PR: #19249 by @LeonSGP43. This is a reshape of the contributor's
work onto current main's clarify infrastructure (clarify_id + entry-based
resolution shared with Telegram, instead of a parallel on_answer-closure
mechanism). The button view structure and UX shape are preserved.
Tests: 14 new tests in tests/gateway/test_discord_clarify_buttons.py.
391/391 existing Discord gateway tests still pass.
Co-authored-by: LeonSGP43 <cine.dreamer.one@gmail.com>
setup_path() writes the user-facing hermes shim with `cat >`, which
follows existing symlinks. Older installs created
`$command_link_dir/hermes` as a symlink to `$HERMES_BIN`
(`venv/bin/hermes`), so re-running install.sh stomped the pip entry
point with a bash shim that exec'd itself in an infinite loop.
`rm -f` the link target before writing so the shim lands at
`$command_link_dir/hermes` and the venv entry point is left intact.
Adds a regression test that reproduces the symlink-stomp end-to-end
(creates the symlink, drives the real shim-write block from setup_path,
asserts the venv pip script body survives and the shim is now a regular
file). Both new assertions fail on origin/main and pass with the fix.
Closes#21454.
Follow-up to Alex-wuhu's NovitaAI provider commit. Adds:
- _pricing_cache hit/write in _fetch_novita_pricing (was missing — every
pricing fetch was re-hitting the network), mirroring the
fetch_ai_gateway_pricing pattern. force_refresh now also propagates
from get_pricing_for_provider.
- TestNovitaProvider in tests/hermes_cli/test_api_key_providers.py
covering profile load, alias resolution, registry auto-registration,
model list parity between main.py and models.py, _URL_TO_PROVIDER,
_PROVIDER_PREFIXES, context_size in _CONTEXT_LENGTH_KEYS, pricing
unit conversion, and pricing cache behavior.
- AUTHOR_MAP entry for yanglongwei06@gmail.com → @Alex-yang00.
Add NovitaAI as a first-class provider with dedicated model selection
flow, live pricing, and authoritative context length resolution.
- Register provider in PROVIDER_REGISTRY, HERMES_OVERLAYS, and all
alias/label maps (ID: novita, aliases: novita-ai, novitaai)
- Add dedicated _model_flow_novita() with 3-tier model list fallback:
Novita API → models.dev → static curated list
- Fetch live pricing from /v1/models with correct unit conversion
(input_token_price_per_m is 0.0001 USD per Mtok)
- Add Novita-specific context length resolution (step 4b) in
get_model_context_length(), prioritized over models.dev/OpenRouter
- Register api.novita.ai in _URL_TO_PROVIDER to prevent early return
from the custom-endpoint code path
- Add models.dev mapping (novita → novita-ai)
- Add default auxiliary model (deepseek/deepseek-v3-0324)
- Add NOVITA_API_KEY to test isolation (conftest.py)
- Update docs: providers page, env vars reference, CLI reference,
.env.example, README, and landing page
Background review fork redirected stdout/stderr around run_conversation()
so its iteration messages stay silent. But the memory-provider teardown
(shutdown_memory_provider() and review_agent.close()) fired in the outer
finally block AFTER the redirect_stdout context exited — so provider
teardown prints (Honcho disconnect, Hindsight sync, etc.) leaked into
the parent terminal at end of every turn.
Moves the teardown inside the redirect_stdout scope on the success path
(and nulls review_agent so the finally safety-net skips double-shutdown).
The finally block is rewritten as an exception-path safety net that
re-opens a devnull redirect, since the original 'with' context has
already exited by the time finally runs.
Salvage of #25342 by @ayushere (manually re-applied + merged conflict
with current main's set_thread_tool_whitelist wiring).
When auxiliary.compression.provider is "auto", the compression model
reuses the main model's provider and base_url. The main model's
context_length was correctly picking up custom_providers per-model
overrides (via _custom_providers stored during __init__), but the
auxiliary compression model's context-length detection path in
_check_compression_model_feasibility was not passing custom_providers,
causing it to skip step 0b and fall through to models.dev.
This meant that for providers like NVIDIA NIM where the user has a
per-model context_length in custom_providers (e.g. 196608 for
minimax-m2.7), the auxiliary model would use the models.dev value
(204800) instead of the user-configured one — a subtle discrepancy
that could lead to silent compression issues when the auxiliary model
doesn't actually support the detected context length.
Fix: pass self._custom_providers (already stored as an instance attr
during __init__) to the get_model_context_length() call for the
auxiliary compression model.
Cross-provider delegation (e.g. MiniMax parent → DeepSeek child) must not
inherit the parent's api_mode, because each provider uses a different API
surface: MiniMax uses 'anthropic_messages' while DeepSeek uses
'chat_completions'. Inheriting the wrong mode causes 404 errors.
When the effective provider differs from the parent's provider, derive
api_mode from the target provider's defaults instead (None triggers
re-derivation).
Refs: Bug #20558, PR #20563
The Feishu adapter wrapped lark-oapi's Connect() callable to inject
ping_interval/ping_timeout overrides, but made the wrapper async. The
underlying library uses Connect() as an async context manager (async
with Connect(...) as ws:), which requires the call itself to be sync
and return an AsyncContextManager — making it async meant the wrapper
was awaited eagerly and ws never bound.
Restoring the sync wrapper preserves the protocol while still injecting
the overrides.
Salvage of #25388 by @pearjelly (manually re-applied — original branch
was severely stale against current main).
- _read_process_cmdline: /proc and 'ps' are unavailable on Windows,
so process cmdline was always empty. Add psutil fallback (already
a hard dependency used by _pid_exists in the same module).
- _record_looks_like_gateway: argv paths use backslashes on Windows
but patterns use forward slashes/dots, so the fallback record check
always failed. Normalize backslashes to forward slashes before
matching.
Together these caused get_running_pid() to return None on Windows
even when the gateway process is alive, making the dashboard report
gateway as 'stopped' despite it functioning normally.
When the auxiliary client fallback chain reaches a provider that has no
credentials configured (no API key, no pool entry), the current code
just returns (None, None) which counts toward the per-call timeout
budget on the next attempt. Mark the provider unhealthy with a short
TTL so the chain advances quickly to the next viable option.
Closes#25384.
Salvage of #25395 by @AllynSheep.
Discord introduced message_snapshots for forwarded messages — text and
attachments live inside snap.content / snap.attachments rather than on
the parent message. _handle_message wasn't reading them, so forwards
showed up empty.
Defensively extracts snapshot text (when raw_content is empty) and
appends snapshot attachments to the working all_attachments list used
for type detection and media routing. hasattr/getattr guards keep this
safe on older discord.py installs without the field.
Salvage of #25462 by @1RB (manually re-applied — original branch was
stale against current main).
Xiaomi MiMo emits reasoning via OpenAI's reasoning_content field and
requires reasoning_content on every assistant tool-call message when
replaying history. Without echo-back, subsequent API calls fail with
HTTP 400 — same shape as DeepSeek and Kimi/Moonshot thinking modes.
Adds _needs_mimo_tool_reasoning() detection (provider == 'xiaomi',
'mimo' in model, or xiaomimimo.com base url) and wires it into the
_needs_thinking_reasoning_pad() check.
Salvage of #25358 by @ephron-ren (manually re-applied — original branch
was severely stale against current main).
The word "worktree" (a git subcommand feature for parallel checkouts)
was used interchangeably with "repository" in the LSP docs, causing
confusion. LSP only requires a git-initialized directory, not an actual
worktree.
Fixes two instances: section "When LSP runs" and the troubleshooting
"Editing a file outside any git repo" heading.
Previously ACP dangerous-command approvals mixed an invalid ACP
payload shape with partial Hermes option mapping, and the callback
plumbing was shared across worker threads. This commit uses ACP
tool-call updates, preserves Hermes once/session/always semantics,
and scopes approval callbacks to the current worker thread.
- Build permission requests with `update_tool_call` and unique
`perm-check-*` ids in `acp_adapter/permissions.py`
- Keep ACP option mapping explicit and fail closed on unknown outcomes
or request failures
- Set approval callbacks inside the ACP executor worker and read them
from thread-local state in `tools/terminal_tool.py`
- Replace duplicated ACP bridge coverage with focused tests in
`tests/acp/test_permissions.py` and add a thread-local callback test
The salvaged regression test called skin.get_spinner_list() which
doesn't exist on SkinConfig. Replace with direct dict access on
skin.spinner — same intent (verify default empty spinner is preserved
when user override is invalid).
* feat(goals): /subgoal — user-added criteria appended to active /goal
Layers a /subgoal command on top of the existing freeform Ralph judge
loop. The user can append extra criteria mid-loop; the judge factors
them into its done/continue verdict and the continuation prompt
surfaces them to the agent. No new tool, no agent self-judging — the
existing judge model just sees a richer prompt.
Forms:
/subgoal show current subgoals
/subgoal <text> append a criterion
/subgoal remove <n> drop subgoal n (1-based)
/subgoal clear wipe all subgoals
How it integrates:
- GoalState gains `subgoals: List[str]` (default []), backwards-compat
for existing state_meta rows.
- judge_goal accepts an optional subgoals kwarg; non-empty switches to
JUDGE_USER_PROMPT_WITH_SUBGOALS_TEMPLATE which lists them as
numbered criteria and asks 'is the goal AND every additional
criterion satisfied?'
- next_continuation_prompt picks CONTINUATION_PROMPT_WITH_SUBGOALS_TEMPLATE
when non-empty so the agent sees what to target.
- /subgoal is allowed mid-run on the gateway since it only touches the
state the judge reads at turn boundary — no race with the running
turn.
- Status line shows '... , N subgoals' when present.
Surface:
- hermes_cli/goals.py — field, prompt blocks, manager methods, judge weave
- hermes_cli/commands.py — /subgoal CommandDef
- cli.py — _handle_subgoal_command
- gateway/run.py — _handle_subgoal_command + mid-run dispatch
- tests/hermes_cli/test_goals.py — 15 new tests (backcompat, mutation,
persistence, prompt template selection, judge-prompt content via mock,
status-line rendering)
77 goal-related tests passing across goals + cli + gateway + tui.
* fix(goals): slash commands don't preempt the goal-continuation hook
Two findings from live-testing /subgoal:
1. Slash commands queued while the agent is running landed in
_pending_input (same queue as real user messages). The goal hook's
'is a real user message pending?' check returned True and silently
skipped — but the slash command consumes its queue slot via
process_command() which never re-fires the goal hook, so the loop
stalls indefinitely. Now the hook peeks the queue and only defers
when a non-slash payload is present.
2. The with-subgoals judge prompt was too soft — opus 4.7 said 'done,
implying all requirements met' without verifying. Tightened to
demand specific per-criterion evidence (file contents, output line,
command result) and explicitly reject phrases like 'implying it was
done.'
Live verified: /subgoal injected mid-loop now correctly forces the
judge to refuse done until the new criterion is met. Agent gets the
continuation prompt with subgoals listed, updates the script, judge
confirms done with specific evidence cited.
Tighten _is_png_file() to read just the 8-byte PNG magic via path.open()
+ read(8), instead of slurping the entire image into memory only to check
the prefix.
The cherry-picked tests from #6173 set HERMES_HOME outside Path.home()/.hermes,
which forces get_default_hermes_root() down its Docker branch and returns
HERMES_HOME directly — so _get_default_hermes_home() never resolves to the
~/.hermes directory the tests were trying to assert about.
Rewire both tests to use the real profile layout (HERMES_HOME pointing at
~/.hermes/profiles/<name>) so _get_default_hermes_home() resolves back to
~/.hermes and the default-profile fallback is actually exercised.
Surfaced by local E2E behavior-parity testing of PR vs origin/main: the
plugin-migrated dispatchers were quietly changing the error envelope
shape returned to function-calling models on unconfigured systems.
Two findings, both from per-result error wrapping bleeding into the
pre-flight configuration error path:
1. **search**: ``firecrawl.search()`` caught the
``ValueError("Web tools are not configured...")`` from
``_get_firecrawl_client()`` and returned it as
``{"success": False, "error": ...}``, losing the legacy
``{"error": "Error searching web: ..."}`` envelope that
``tool_error()`` emits on main. Models that special-case the
``error`` key still detect the failure, but the prefix is part of
the legacy contract some users rely on.
2. **crawl**: ``firecrawl.crawl()`` caught the same pre-flight
``ValueError`` and wrapped it as a per-page error inside
``results[0]``. Main short-circuits on ``check_firecrawl_api_key()``
BEFORE dispatching, so its unconfigured response is
``{"success": False, "error": "web_crawl requires Firecrawl..."}``
at the top level. The PR's per-page burying hid the failure inside
``results[]`` where models that check ``result.get("error")`` would
miss it.
Fix:
- ``plugins/web/firecrawl/provider.py``: pull
``_get_firecrawl_client()`` outside the broad ``try`` in
``search()``. Pre-flight ``ValueError`` / ``ImportError`` propagate
to the dispatcher's top-level exception handler. In-flight SDK
errors still get wrapped as ``{"success": False, ...}``.
- ``tools/web_tools.py``: mirror main's upstream availability gate in
``web_crawl_tool``. When the resolved crawl provider is
``is_available()==False``, short-circuit BEFORE dispatching with the
same top-level error shape main emits.
- ``tests/tools/test_web_providers.py``: 2 regression tests
(``TestUnconfiguredErrorEnvelopeParity``) lock in the behavior so
future plugin work can't undo this.
Verified via local subprocess-based parity test (14/14 scenarios match
origin/main shape exactly) and full 210/210 web test suite green.
Self-review of the plugin migration surfaced one warning and a handful of
doc/dead-code cleanups. None affect production behaviour through the main
dispatcher (which always calls `tools.web_tools._get_backend()` first and
preserves the full 7-provider walk), but direct callers of
`agent.web_search_registry.get_active_*_provider()` previously diverged
from the legacy order and could return `None` for users with credentials
but no explicit `web.backend` config key.
Changes
-------
1. `_LEGACY_PREFERENCE` was shipped as a 4-tuple
`("brave-free", "firecrawl", "searxng", "ddgs")` while the PR
description and the legacy `_get_backend()` candidate order both
call for the 7-tuple
`(firecrawl, parallel, tavily, exa, searxng, brave-free, ddgs)`.
Replaced with the 7-tuple. Verified empirically: with TAVILY+EXA keys
and no config, `get_active_search_provider()` now returns tavily
(was None); with EXA+PARALLEL it returns parallel (was None); with
BRAVE+FIRECRAWL it returns firecrawl (was brave-free).
2. `agent/web_search_registry.py` — module docstring, `_resolve` step-3
docstring, and inline comment all listed the old 4-tuple and claimed
"brave-free first because it was the shipped default". The legacy
default is `"firecrawl"`. Rewritten to match the new ordering and
reference `tools.web_tools._get_backend()` as the source of truth.
3. `agent/web_search_registry.py` — `get_active_crawl_provider`
docstring said "only Tavily implements it among built-in providers".
Firecrawl also advertises `supports_crawl=True` after the previous
commit. Updated to "Tavily and Firecrawl".
4. `plugins/web/tavily/provider.py` — module docstring said "Tavily is
the only built-in backend that natively crawls". Updated.
5. `agent/web_search_provider.py` — ABC docstring mentioned only
`search` / `extract` capabilities. Added `crawl` for accuracy.
6. `plugins/web/{firecrawl,parallel,exa}/provider.py` — dead plugin-level
cache globals (`_firecrawl_client`, `_parallel_client`,
`_async_parallel_client`, `_exa_client`) were declared but never read
(all reads/writes go through `_wt.*` per the `extracting-inline-
helpers-to-plugins` recipe). Removed the dead declarations; the
reset-for-tests helpers in firecrawl + parallel now clear the
canonical `_wt._<name>` slots, matching the pattern exa already used.
Tests
-----
218/218 web-targeted tests still pass (no test changes needed). 4910/4910
in `tests/tools/` still green.
The web-provider migration originally left firecrawl crawl as the only
provider-specific code remaining inline in tools/web_tools.py (~250
lines of Firecrawl-specific crawl orchestration that didn't fit the
plugin's existing surface). This commit closes that gap.
What this adds
--------------
1. plugins/web/firecrawl/provider.py: implement async ``crawl(url, **kwargs)``
- Accepts the same kwargs as the dispatcher passes to any crawl
provider (``instructions``, ``depth``, ``limit``); Firecrawl's
/crawl endpoint ignores ``instructions`` and ``depth`` so we log
and drop with a clear info message.
- Wraps the sync SDK ``crawl()`` call in asyncio.to_thread so the
gateway event loop isn't blocked on a multi-page crawl.
- Preserves the response-shape normalization across pydantic /
typed-object / dict variants that the legacy inline code did.
- Preserves per-page website-policy re-check (catches blocked
redirects after the SDK returns).
- Returns the same {"results": [...]} shape so the dispatcher's
shared LLM-summarization post-processing path works unchanged.
- Sets supports_crawl() to True so the dispatcher routes through
the plugin instead of the legacy fallthrough.
2. tools/web_tools.py: delete the entire legacy firecrawl crawl block
that used to run after "No registered provider supports crawl" —
~270 lines including:
- check_firecrawl_api_key gate + typed error
- inline SSRF + website-policy seed-URL gate (dispatcher already
does this)
- Firecrawl client setup with crawl_params
- 100+ lines of pydantic/dict/typed-object normalization
- Per-page LLM-processing loop (kept in the dispatcher's shared
post-processing path; that's where it always belonged)
- trimming + base64 image cleanup (still done in the dispatcher's
shared path)
Replaced with a single typed-error branch when no crawl-capable
provider is available: "web_crawl has no available backend. Set
FIRECRAWL_API_KEY (or FIRECRAWL_API_URL for self-hosted), or set
TAVILY_API_KEY for Tavily."
Test updates
------------
- tests/tools/test_website_policy.py:
- test_web_crawl_short_circuits_blocked_url: dispatcher seed-URL
gate still runs on web_tools.check_website_access (no change to
that patch), but the firecrawl client lockdown moved to the
plugin module — patch firecrawl_provider._get_firecrawl_client
instead of web_tools._get_firecrawl_client. The dispatcher
short-circuits before the plugin runs, so the test still passes.
- test_web_crawl_blocks_redirected_final_url: patch the per-page
policy gate at plugins.web.firecrawl.provider.check_website_access
(where it now runs) AND on web_tools (where the seed-URL gate
still runs). Patch firecrawl_provider._get_firecrawl_client for
the FakeCrawlClient injection. Both checks flow through the same
fake_check function.
- tests/plugins/web/test_web_search_provider_plugins.py:
- Update parametrized capability-flag spec: firecrawl supports_crawl
is now True.
- Add test_firecrawl_crawl_returns_error_dict_when_unconfigured —
verifies inspect.iscoroutinefunction(p.crawl) is True and that
the async crawl returns a per-page error dict (not a raise) when
FIRECRAWL_API_KEY is missing.
Verified
--------
- 218/218 web tests pass (was 173, +44 plugin tests + 1 new firecrawl
crawl test from this commit = 218 with the test deduplication).
- Compile-clean (py_compile passes on both files).
- Provider capabilities matrix confirmed end-to-end:
name search extract crawl async-extract? async-crawl?
firecrawl True True True True True
tavily True True True False False
Both crawl-capable providers exercise the dispatcher's
inspect.iscoroutinefunction async-or-sync detection.
Net diff
--------
- tools/web_tools.py: -254 lines (legacy inline crawl gone)
- plugins/web/firecrawl/provider.py: +185 lines (crawl method)
- test_website_policy.py: +14/-9 lines (patch locations)
- test_web_search_provider_plugins.py: +22/-1 lines (capability flag
+ new firecrawl crawl test)
- Total: -32 net LoC; tools/web_tools.py is now 1509 lines (was 1763
before this commit, 2227 before the migration started).
Adds 44 focused tests under tests/plugins/web/ covering the surface that
the PR #25182 web-provider migration introduced. Complements the
existing tests/tools/ coverage which is dispatcher-centric; this file is
plugin-centric and tests each plugin + the registry directly.
Test classes (44 tests, ~1.1s on 4 workers)
-------------------------------------------
TestBundledPluginsRegister (16 tests)
- All seven plugins present in the registry after
_ensure_plugins_discovered()
- Per-plugin parametrized capability-flag assertions
(brave-free / ddgs / searxng: search-only;
exa / parallel / firecrawl: search + extract;
tavily: search + extract + crawl)
- Every plugin exposes name + display_name properties
- Every plugin returns a picker-compatible get_setup_schema() dict
TestIsAvailable (7 tests)
- Each premium plugin reports is_available()==False when its env var is
absent and True once set (brave-free / searxng / tavily / exa /
parallel)
- firecrawl recognizes either FIRECRAWL_API_KEY or FIRECRAWL_API_URL
as a "configured" signal
- ddgs is the always-on fallback and must not raise from is_available()
TestRegistryResolution (4 tests)
- Option B semantics validated end-to-end:
1. Explicit configured provider wins even when is_available()==False
(dispatcher surfaces typed credential errors, no silent switch)
2. Unknown/typo name falls back to first available legacy-preference
provider
3. Asking for extract via a search-only backend falls back to an
extract-capable available provider (capability-incompatible
branch in _resolve())
4. No config + no credentials → None (or ddgs if installed)
TestAsyncExtractDispatch (4 tests)
- parallel + firecrawl extract() are coroutine functions (async path
in dispatcher uses await)
- exa + tavily extract() are sync (dispatcher wraps in
asyncio.to_thread)
TestErrorResponseShapes (7 tests)
- Plugins return typed error dicts (success=False + "error" key) when
credentials are missing, never raise
- async extract() returns list of per-URL error dicts
- tavily crawl() returns {"results": [{"error": ...}]} on missing
credentials
Design notes
------------
- All tests use real imports of plugin modules — no mocking of provider
classes themselves — so they catch drift in the ABC, registry, and
glue layer simultaneously. Per the hermes-agent-dev skill's E2E
testing guidance.
- The autouse _isolate_env fixture clears every web-provider env var
before each test so is_available() reflects the test's setup.
- Resolution tests use the lower-level _resolve() directly rather than
rebuilding the HERMES_HOME config dance — same observable behavior,
no sys.modules.pop side-effects that would break the ABC isinstance
check inside ctx.register_web_search_provider().
Removes the legacy in-tree provider scaffolding that PR #25182 fully
replaced with the plugin architecture:
tools/web_providers/__init__.py (6 lines)
tools/web_providers/base.py (89 lines — old ABCs)
tools/web_providers/ARCHITECTURE.md (73 lines — old design doc)
These were the staging-ground ABCs and provider modules that the
plugin migration absorbed. All seven web providers now implement the
single :class:`agent.web_search_provider.WebSearchProvider` ABC and
live under ``plugins/web/<vendor>/``. Nothing else in the tree imports
``tools.web_providers`` — verified via grep before deletion.
Test migration (tests/tools/test_web_providers.py)
--------------------------------------------------
Rewrote ``TestWebProviderABCs`` to test the new unified ABC at
:mod:`agent.web_search_provider`:
- test_cannot_instantiate_abc_directly — abstract ``name`` + ``is_available``
- test_concrete_search_only_provider_works — exercise default
``supports_extract=False`` / ``supports_crawl=False`` flags
- test_concrete_multi_capability_provider_works — exercise all three
capabilities, async extract supported (declared sync here for
simplicity; real plugins like parallel + firecrawl use async)
- test_search_only_provider_skips_extract_and_crawl — verify
``supports_*()`` flags default to False so search-only providers
don't have to implement extract() or crawl()
The 9 other tests in the file (per-capability backend selection,
DEFAULT_CONFIG merge, dispatcher routing) test public helpers in
``tools.web_tools`` that still exist and pass unchanged.
agent/web_search_provider.py docstring updated to reflect that the
legacy ABCs no longer exist; the response-shape contract is preserved
bit-for-bit so external consumers see no behavioral change.
Net diff
--------
- tools/web_providers/ removed (-168 lines)
- tests/tools/test_web_providers.py rewritten ABC section (+78/-30 net,
same coverage, new API)
- agent/web_search_provider.py docstring (-3/+5 lines)
Verified
--------
- 173/173 targeted web tests pass
- 12/12 ABC contract tests pass with the new interface
- No remaining grep hits for ``tools.web_providers`` outside of
intentional historical references in plugin docstrings.
Removes the seven hardcoded TOOL_CATEGORIES["web"] provider rows that
duplicated the plugin-registered providers, and deletes the
_WEB_PLUGIN_SKIPLIST that existed to prevent duplicate picker rows
during the migration. The Web Search & Extract category now derives its
provider rows entirely from agent.web_search_registry via
_plugin_web_search_providers(), matching how Spotify, Google Meet, and
the image_gen plugins are surfaced.
Removed (deduplicated against plugin schemas):
- Firecrawl Cloud → plugins.web.firecrawl
- Exa → plugins.web.exa
- Parallel → plugins.web.parallel
- Tavily → plugins.web.tavily
- SearXNG → plugins.web.searxng
- Brave Search (Free Tier) → plugins.web.brave_free
- DuckDuckGo (ddgs) → plugins.web.ddgs (post_setup hook preserved)
Retained in TOOL_CATEGORIES["web"]:
- Nous Subscription — requires requires_nous_auth +
managed_nous_feature + override_env_vars
to drive the managed-gateway UX. Not a
provider — a different *setup flow* for the
firecrawl backend.
- Firecrawl Self-Hosted — points firecrawl at a private Docker URL
via FIRECRAWL_API_URL only. Same reason:
UX setup-flow row, not a provider.
These two rows describe alternative auth/billing paths for the
firecrawl backend; they intentionally share web_backend="firecrawl"
with the plugin row but light up different env-var prompts.
Plugin schema extensions
------------------------
- ddgs plugin's get_setup_schema() now emits `post_setup: "ddgs"` so
selection still triggers the pip-install hook in _run_post_setup().
- _plugin_web_search_providers() passes `post_setup` through verbatim
when present in the schema (other future plugins like camofox / a
hypothetical playwright-web plugin can opt in the same way).
- Picker rows now carry both `web_backend` (legacy field consumed by
setup + selection helpers) and `web_search_plugin_name`
(informational marker), so behavior is identical between hardcoded
and plugin-registered rows.
Net diff
--------
- hermes_cli/tools_config.py: -141/+50 lines (~91 lines net)
- plugins/web/ddgs/provider.py: +7/-4 (post_setup field + badge polish)
Verified
--------
- Compile-clean for both files
- Picker shows: 2 hardcoded rows (Nous Subscription, Firecrawl
Self-Hosted) + 7 plugin rows (alphabetically: Brave Search,
DuckDuckGo, Exa, Firecrawl, Parallel, SearXNG, Tavily). DuckDuckGo
row carries post_setup="ddgs" for first-time install.
- 173 web-specific tests still pass.
Removes ~580 lines of dead code from tools/web_tools.py that were
superseded by the plugin migration but kept around in the cutover commit
to keep the diff focused. Replaces them with thin re-export shims so
existing tests and external callers that reach for the legacy
``tools.web_tools.<name>`` paths continue to work transparently.
Deleted from tools/web_tools.py
--------------------------------
- Lazy Firecrawl SDK proxy (_load_firecrawl_cls, _FirecrawlProxy,
_FIRECRAWL_CLS_CACHE, the Firecrawl singleton)
- Firecrawl client section (_get_direct_firecrawl_config,
_get_firecrawl_gateway_url, _is_tool_gateway_ready,
_has_direct_firecrawl_config, _raise_web_backend_configuration_error,
_firecrawl_backend_help_suffix, _get_firecrawl_client)
- Parallel client section (_get_parallel_client,
_get_async_parallel_client, _parallel_client, _async_parallel_client)
- Tavily client section (_TAVILY_BASE_URL, _tavily_request,
_normalize_tavily_search_results, _normalize_tavily_documents)
- Generic SDK normalizers (_to_plain_object, _normalize_result_list,
_extract_web_search_results, _extract_scrape_payload)
- Exa client section (_get_exa_client, _exa_client, _exa_search,
_exa_extract)
- Parallel helpers (_parallel_search, _parallel_extract)
- Duplicate inline check_firecrawl_api_key
Net: tools/web_tools.py drops from 2227 → 1613 lines (-614 lines).
Re-exports added at top of tools/web_tools.py
---------------------------------------------
- From plugins.web.firecrawl.provider:
Firecrawl, _FirecrawlProxy, _FIRECRAWL_CLS_CACHE, _load_firecrawl_cls,
_get_direct_firecrawl_config, _get_firecrawl_gateway_url,
_is_tool_gateway_ready, _has_direct_firecrawl_config,
_firecrawl_backend_help_suffix, _raise_web_backend_configuration_error,
_get_firecrawl_client, _to_plain_object, _normalize_result_list,
_extract_web_search_results, _extract_scrape_payload,
check_firecrawl_api_key
- From plugins.web.tavily.provider:
_tavily_request, _normalize_tavily_search_results,
_normalize_tavily_documents
- From plugins.web.parallel.provider:
_get_parallel_client, _get_async_parallel_client
- From plugins.web.exa.provider:
_get_exa_client
Plus retained module-level imports for backward-compat with tests:
- httpx (tests patch tools.web_tools.httpx for tavily request mocking)
- build_vendor_gateway_url, _read_nous_access_token,
resolve_managed_tool_gateway, managed_nous_tools_enabled,
prefers_gateway (tests patch tools.web_tools.<name>)
Plugin indirection pattern (key technique)
------------------------------------------
For functions inside the firecrawl/parallel/exa plugins to honor
unit-test patches that target ``tools.web_tools.<name>``, the plugin
implementations now do ``import tools.web_tools as _wt`` at call time
and read helper names through that module (``_wt._read_nous_access_token``,
``_wt.Firecrawl``, ``_wt.prefers_gateway``, etc.). This makes the
existing test patches transparently reach the plugin code without any
test changes.
The cached client globals (_firecrawl_client, _firecrawl_client_config,
_parallel_client, _async_parallel_client, _exa_client) also now live on
tools.web_tools so existing test setup_method handlers that reset
``tools.web_tools._<vendor>_client = None`` between cases keep working.
The plugins read/write the cache via getattr/setattr on the web_tools
module.
Verified
--------
- 173/173 targeted web tests pass:
test_web_providers.py, test_web_providers_brave_free.py,
test_web_providers_ddgs.py, test_web_providers_searxng.py,
test_web_tools_config.py, test_web_tools_tavily.py,
test_website_policy.py, test_config_null_guard.py
- Compile-clean (py_compile.compile passes)
- All inline implementations now exist in exactly one place
(plugins.web.<vendor>.provider)
Follow-up clean-up
------------------
- Drop _WEB_PLUGIN_SKIPLIST + hardcoded TOOL_CATEGORIES["web"] rows
(next commit)
- Delete tools/web_providers/ directory entirely
- Add tests/plugins/web/ coverage
- Full tests/tools/ + tests/gateway/ regression sweep before promoting PR
Two regressions discovered by running the full tests/tools/ suite after
the dispatcher cutover, both fixed in this commit:
1. web_crawl_tool incorrectly errored "search-only" for firecrawl
---------------------------------------------------------------------
The cutover treated any provider with supports_crawl()==False as a
search-only backend and returned the typed search-only error. But
firecrawl can crawl via the legacy multi-page-extract path inside
web_crawl_tool — it just doesn't expose supports_crawl on the plugin
(adding native firecrawl crawl is a clean follow-up).
Fix: only emit the search-only error when the provider supports
NEITHER crawl NOR extract (brave-free / ddgs / searxng). When the
provider supports extract but not crawl (firecrawl), fall through to
the legacy firecrawl-via-extract path below.
2. firecrawl plugin's check_website_access wasn't patchable
---------------------------------------------------------------------
The plugin imported `from tools.website_policy import check_website_access`
INSIDE the extract() function body, so monkeypatching the name on
plugins.web.firecrawl.provider had no effect — the inner import re-bound
the name on every call.
Fix: hoist the import to module level. Cheap (website_policy itself
has no heavy deps) and makes the standard
monkeypatch.setattr(firecrawl_provider, "check_website_access", ...)
pattern work.
Test updates (tests/tools/test_website_policy.py — 4 tests):
- test_web_extract_short_circuits_blocked_url
- test_web_extract_blocks_redirected_final_url
Both: patch the gate at plugins.web.firecrawl.provider (where it
runs after migration) and force the firecrawl plugin to be the
active extract provider via FIRECRAWL_API_KEY.
- test_web_crawl_short_circuits_blocked_url
- test_web_crawl_blocks_redirected_final_url
Both: unchanged — the dispatcher-level gate at tools.web_tools.py
line 1651 still uses the imported `check_website_access` name and
the firecrawl-fallthrough path is exercised as before.
Verified: 22/22 tests/tools/test_website_policy.py pass.
Cuts over web_search_tool, web_extract_tool, and web_crawl_tool in
tools/web_tools.py to dispatch through agent.web_search_registry
instead of the legacy hardcoded if-elif backend chains.
Per-tool changes:
web_search_tool (sync)
Replace 5 backend branches (parallel, exa, registry-3-providers,
tavily, firecrawl-fallthrough) with a single registry path:
1. _get_search_backend() resolves the configured name
2. _wsp_get_provider(name) for explicit-config-wins semantics
3. get_active_search_provider() fallback for typo / unknown name
4. provider.search(query, limit) — sync for all 7 providers
web_extract_tool (async)
Replace 4 backend branches (parallel-async, exa-sync, tavily-sync,
search-only-error, firecrawl-perurl-loop) with:
1. Same provider resolution as search.
2. When configured backend IS registered but doesn't support
extract (search-only providers like brave-free), surface a
typed "search-only" error matching the legacy text — tests
assert that wording.
3. inspect.iscoroutinefunction(provider.extract) detects sync vs
async: parallel + firecrawl are async; exa + tavily are sync.
Sync extracts run in asyncio.to_thread() so we don't block.
web_crawl_tool (async)
Replace tavily-specific branch + search-only-error block with:
1. _wsp_get_provider(backend) — explicit config first
2. Search-only typed error when the configured name doesn't
support crawl (matches legacy phrasing)
3. get_active_crawl_provider() fallback otherwise
4. provider.crawl(url, **kwargs) — async-or-sync dispatch as above
5. Response post-processing (LLM summarization, trimming) stays
unchanged — it's not provider-specific.
When no plugin advertises supports_crawl, falls through to the
existing Firecrawl-via-web-summarize path below (unchanged).
Test updates (2 tests in tests/tools/test_web_tools_config.py):
- test_web_search_clamps_limit_before_backend_call:
patch("tools.web_tools._parallel_search") -> patch the registry
provider returned by agent.web_search_registry.get_provider
- test_search_error_response_does_not_expose_diagnostics:
patch("tools.web_tools._get_firecrawl_client") -> same pattern
Tests unchanged (still pass):
- All TestXBackendWiring classes (test _get_backend / _is_backend_available
config-resolution, independent of dispatch)
- All TestXSearchOnlyErrors classes (test the search-only error path
via web_extract_tool / web_crawl_tool — error text preserved)
- 141 passing web tests total, 0 regressions.
Dead-code cleanup deferred to a follow-up commit so this diff stays
focused on the cutover. After this commit:
- tools.web_tools._exa_search / _exa_extract / _parallel_search /
_parallel_extract / _tavily_request / _normalize_tavily_* /
_get_firecrawl_client / _extract_web_search_results /
_extract_scrape_payload / _to_plain_object / _normalize_result_list
are no longer called by the dispatchers, but still exist.
- The config-resolution layer (_get_backend, _is_backend_available,
_is_tool_gateway_ready, _has_direct_firecrawl_config) IS still in
use and must stay.
- The Firecrawl proxy and check_firecrawl_api_key are still imported
by integration tests and patched by unit tests — must stay (or be
re-exported from the plugin).
Migrates Firecrawl from inline code in tools/web_tools.py to a bundled
plugin at plugins/web/firecrawl/. By line count this is the largest of
the seven provider migrations: the firecrawl path captured most of the
file's vendor-specific complexity.
What moved into the plugin (all previously in tools/web_tools.py):
Lazy Firecrawl SDK proxy
- _load_firecrawl_cls() — caches the imported SDK class
- _FirecrawlProxy + Firecrawl singleton — defers ~200ms of SDK
imports until first construction or isinstance check.
Client construction (dual auth)
- _get_direct_firecrawl_config() — direct FIRECRAWL_API_KEY/URL path
- _get_firecrawl_gateway_url() — managed Nous tool-gateway URL
- _is_tool_gateway_ready() — gateway URL + Nous token check
- _has_direct_firecrawl_config() — direct config present?
- _get_firecrawl_client() — combined client construction
honoring web.use_gateway
- check_firecrawl_api_key() — top-level "is firecrawl usable"
- _firecrawl_backend_help_suffix() — managed-gateway help string
- _raise_web_backend_configuration_error() — typed misconfig error
Response shape normalization (vendor-specific)
- _to_plain_object(), _normalize_result_list() — SDK→dict helpers
- _extract_web_search_results() — handles SDK/direct/gateway shapes
- _extract_scrape_payload() — nested-data unwrap for scrape
Per-URL extract loop
- 60s asyncio.wait_for timeout per URL
- Pre-scrape website-policy gate
- Post-scrape redirect-aware SSRF re-check
- Format-aware content selection (markdown / html / auto)
- Per-URL errors returned as {"error": str} entries, no raises
Extract is declared `async def` — each URL is scraped in
asyncio.to_thread(...). This is the second async-extract plugin after
parallel.
The plugin re-exports `Firecrawl` (the lazy proxy) and
`check_firecrawl_api_key()` so existing tests doing
`patch("tools.web_tools.Firecrawl")` or
`monkeypatch.setattr(web_tools, "check_firecrawl_api_key", ...)` keep
working — tools/web_tools.py re-exports both names in the next
dispatcher-cutover commit.
Note: web_crawl_tool still has its own Firecrawl crawl path inline
(separate from extract); the Firecrawl SDK supports /crawl but we don't
expose supports_crawl=True on this plugin yet. Tavily handles crawl
today. Adding Firecrawl crawl is a clean follow-up.
Adds "firecrawl" to _WEB_PLUGIN_SKIPLIST.
E2E verified:
- All 7 providers register: brave-free, ddgs, exa, firecrawl,
parallel, searxng, tavily
- inspect.iscoroutinefunction(firecrawl.extract) -> True
- Firecrawl proxy is a callable lazy proxy at module level
- check_firecrawl_api_key reflects FIRECRAWL_API_KEY presence
Migrates Tavily from inline _tavily_request() / _normalize_tavily_*
helpers in tools/web_tools.py to a bundled plugin at plugins/web/tavily/.
First plugin in the codebase to advertise supports_crawl=True. Tavily is
unique among built-in backends in offering a native /crawl endpoint that
walks linked pages from a seed URL with optional natural-language
instructions and depth ("basic" or "advanced").
Capabilities:
- supports_search() -> True (Tavily /search)
- supports_extract() -> True (Tavily /extract)
- supports_crawl() -> True (Tavily /crawl)
All sync (httpx.post under the hood).
The crawl method accepts forward-compat kwargs (instructions, depth,
limit) and is gated against unsafe URLs/policy by the dispatcher in
web_crawl_tool — exactly as before.
Behavior preserved:
- TAVILY_API_KEY required (ValueError → typed error response)
- TAVILY_BASE_URL env override honored
- /crawl requires both body auth AND Bearer header — preserved
- failed_results[] and failed_urls[] response keys mapped to per-URL
items with error fields rather than raising
- max_results capped at 20 server-side
Adds "tavily" to _WEB_PLUGIN_SKIPLIST.
The legacy inline _tavily_request / _normalize_tavily_search_results /
_normalize_tavily_documents / _TAVILY_BASE_URL in tools/web_tools.py are
NOT deleted yet — search/extract dispatch and the entire web_crawl_tool
function still reference them. They go away when those dispatchers are
cut over to the registry.
E2E verified:
- Tavily registers with all 3 capabilities
- Provider list now: brave-free, ddgs, exa, parallel, searxng, tavily
Migrates Parallel.ai from inline `_parallel_search()` / `_parallel_extract()`
in tools/web_tools.py to a bundled plugin at plugins/web/parallel/.
First plugin in the codebase to expose an async :meth:`extract`:
- search() is sync — Parallel.beta.search
- extract() is **async def** — AsyncParallel.beta.extract
The ABC's docstring on supports_extract() already permits sync-or-async;
this commit is the first to exercise the async path. The web_extract_tool
dispatcher (next commit) detects coroutines via
inspect.iscoroutinefunction and awaits accordingly.
Behavior preserved:
- PARALLEL_API_KEY required (raises ValueError if missing → surfaced
as {"success": False, "error": "..."} instead)
- PARALLEL_SEARCH_MODE env var honored (agentic|fast|one-shot, default
agentic), validated via _resolve_search_mode()
- Limit capped at 20 server-side via min(limit, 20)
- Per-URL failure mode preserved: response.errors[] each become a
result dict with an "error" field rather than raising
- Module-level _parallel_client / _async_parallel_client caches kept
(mirrors legacy singleton pattern)
Adds "parallel" to _WEB_PLUGIN_SKIPLIST in hermes_cli/tools_config.py so
the picker doesn't double-list.
The legacy inline _parallel_search, _parallel_extract, _get_parallel_client,
_get_async_parallel_client in tools/web_tools.py are NOT deleted yet — the
dispatcher still calls them. They go away when the dispatcher cuts over.
E2E verified:
- inspect.iscoroutinefunction(p.search) -> False
- inspect.iscoroutinefunction(p.extract) -> True
- extract() returns a coroutine (not a list)
- 5 providers register correctly (brave-free, ddgs, exa, parallel, searxng)
Migrates Exa from the inline `_exa_search()` / `_exa_extract()` helpers in
tools/web_tools.py to a bundled plugin at plugins/web/exa/.
This is the first plugin in this PR to advertise supports_extract=True,
exercising the multi-capability ABC path that the initial three migrations
(brave_free, ddgs, searxng — all search-only) did not cover.
Both Exa methods are sync — the SDK is sync-only. The web_extract_tool
dispatcher in tools/web_tools.py will continue to call them inline until
Task "dispatch-extract-all" cuts it over to the registry.
Behaviour preserved bit-for-bit aside from the ABC method-name change:
- is_configured() -> is_available()
- provider_name() -> name (property)
- "exa" stays as the registered name
- Module-level `_exa_client` cache + lazy `from exa_py import Exa`
preserved at the new location.
- Errors (ValueError for missing API key, ImportError for missing SDK,
generic Exception) caught and surfaced as {"success": False, "error": ...}
instead of raising.
Adds "exa" to _WEB_PLUGIN_SKIPLIST in hermes_cli/tools_config.py so the
hardcoded TOOL_CATEGORIES["web"] row and the plugin-injected row don't
duplicate during the spike. The skip-list goes away in the cleanup phase
along with the hardcoded row.
The legacy inline `_exa_search` / `_exa_extract` / `_get_exa_client` /
`_exa_client` in tools/web_tools.py are NOT deleted yet — the dispatcher
still references them. They go away in the next dispatcher-cutover commit.
E2E verified:
- Plugin discovers + registers
- .supports_search/.supports_extract/.supports_crawl = (True, True, False)
- .get_setup_schema() returns the picker row shape
- resolve(): explicit exa + EXA_API_KEY -> exa; without key -> exa (registered
but unavailable, dispatcher surfaces "EXA_API_KEY not set" error)
Two ABC additions to cover the surface area of the remaining four
providers (exa, parallel, tavily, firecrawl) which were untouched by the
initial spike:
1. supports_crawl() + crawl() — Tavily natively crawls a seed URL via
its /crawl endpoint. Exposing supports_crawl=True lets the crawl
tool's dispatcher route to Tavily when configured, falling back to
the auxiliary-model summarization path otherwise. Firecrawl could
add this in a follow-up (the SDK supports it; we just don't surface
it as a tool today).
2. Async-or-sync extract() — Parallel's SDK is natively async
(AsyncParallel.beta.extract); Exa and Tavily are sync; Firecrawl is
sync but called inside asyncio.to_thread() with a 60s timeout. The
ABC docstring now permits either shape: implementations declare
their own sync/async signature and the dispatcher uses
inspect.iscoroutinefunction to detect and await.
Also adds get_active_crawl_provider() to web_search_registry mirroring
the search/extract resolvers, with web.crawl_backend as the explicit
override config key.
No behavior change on its own — these are scaffolds for the four
remaining provider migrations.
Both web_search_registry._resolve() and image_gen_registry.get_active_provider()
walked their registered providers and returned the first one matching the
capability flag — without checking whether that provider was actually
usable. On a fresh install with no credentials at all, this meant
get_active_search_provider() returned `brave-free` (legacy preference
order) even though BRAVE_SEARCH_API_KEY was unset, leading the
dispatcher to surface a "BRAVE_SEARCH_API_KEY is not set" error for a
provider the user never chose. Same bug shape in image_gen for FAL.
Resolution semantics now match tools.web_tools._get_backend():
1. Explicit config name wins, ignoring is_available() — the dispatcher
surfaces a precise "X_API_KEY is not set" error rather than silently
switching backends. Matches user expectation: "I configured X, tell
me what's wrong with X."
2. Fallback (no explicit config) walks the legacy preference order
filtered by is_available() — pick the highest-priority backend the
user actually has credentials for.
is_available() is wrapped in a try/except so a buggy provider doesn't
brick resolution.
E2E verified:
- No creds + no config: get_active_search_provider() -> None
- Explicit brave-free + no key: get_active_search_provider() -> brave-free
(and .is_available() correctly reports False)
This fix was identified during the spike (#25182 finding #1) and is
fold-in to the same PR rather than a follow-up.
Deletes tools/web_providers/{brave_free,ddgs,searxng}.py — the three
providers that moved to plugins/web/ in prior commits. tools/web_tools.py
no longer imports them (registry dispatch as of d8735963f), so removing
them is purely a cleanup pass.
Also migrates the existing tests to the new import paths:
tests/tools/test_web_providers_brave_free.py
tests/tools/test_web_providers_ddgs.py
tests/tools/test_web_providers_searxng.py
Mechanical rewrites:
- `from tools.web_providers.X import YSearchProvider`
-> `from plugins.web.X.provider import YWebSearchProvider`
- `.is_configured()` -> `.is_available()` (legacy method -> new method)
- `.provider_name()` -> `.name` (legacy method -> new property)
- `from tools.web_providers.base import WebSearchProvider`
-> `from agent.web_search_provider import WebSearchProvider`
(the subclass-check asserts membership in the new plugin-facing ABC)
- `sys.modules.delitem("tools.web_providers.ddgs")` updated to point at
`plugins.web.ddgs.provider` (cache-busting for lazy ddgs imports)
The TestXBackendWiring / TestXSearchOnlyErrors classes (covering
_is_backend_available, _get_backend, check_web_api_key, and the
"search-only" error paths in web_extract/web_crawl) are untouched —
those still test web_tools.py's backend-selection logic, which continues
to recognize the names "brave-free" / "ddgs" / "searxng" even after the
modules behind them moved to plugins.
tools/web_providers/base.py is intentionally NOT deleted by this commit
— it's the parent ABC of the legacy modules and shares its name with
agent/web_search_provider.py::WebSearchProvider. Removing it surfaces the
naming collision (see PR description Finding 0); the real migration PR
deletes it in the same commit that drops the _WEB_PLUGIN_SKIPLIST
guards in hermes_cli/tools_config.py.
Test results:
bash scripts/run_tests.sh tests/tools/test_web_providers_*.py
-> 65 passed in 3.41s (all rewritten unit tests + unchanged integration tests)
bash scripts/run_tests.sh tests/tools/test_web_*.py
-> 141 passed in 4.70s (full web test set, post-deletion)
Adds _plugin_web_search_providers() and wires it into _visible_providers()
for the "Web Search & Extract" category. Mirrors the existing image_gen
pattern at the same site exactly.
Spike scope: while the three migrated providers (brave-free, ddgs, searxng)
still have hardcoded TOOL_CATEGORIES rows, _WEB_PLUGIN_SKIPLIST excludes
them so the picker doesn't show duplicates. The migration PR drops the
hardcoded rows and the skip-list both — then this helper is the only
source of web-provider picker rows.
E2E verified: helper returns [] today (skip-list covers all 3 migrated
providers); injection point is sound and ready for the post-migration state.
The three migrated providers (brave-free, ddgs, searxng) are now dispatched
through agent.web_search_registry.get_provider() instead of importing
their concrete classes directly. The four inline providers (parallel, exa,
tavily, firecrawl) keep their existing branches — they live in
tools/web_tools.py itself and aren't part of this spike's plugin extraction.
The legacy tools/web_providers/{brave_free,ddgs,searxng}.py modules are
still in place (untouched by this commit) — Task 10 deletes them once the
real migration PR is ready. Keeping them alive during the spike means
revertibility is trivial.
E2E verified:
1. Plugin discovery registers ['brave-free','ddgs','searxng']
2. Config web.search_backend: brave-free resolves to the plugin instance
3. Dispatch result matches the original {success, data.web[]} contract
4. compile OK; no new LSP errors beyond pre-existing ones in web_tools.py
Adds plugins/web/searxng/. SearXNG aggregates results from upstream engines
via its JSON API (/search?format=json) — search-only, no extract capability
(supports_extract() returns False).
E2E verified — registry now has ['brave-free', 'ddgs', 'searxng'].
Adds plugins/web/ddgs/ following the same plugins/image_gen/ pattern as
brave_free. DuckDuckGo search via the community ddgs package; no API key,
package is an optional dep gated by is_available().
E2E verified — registry now has ['brave-free', 'ddgs'].
Adds plugins/web/brave_free/ as the first plugin built against the new
WebSearchProvider ABC. Mirrors the plugins/image_gen/openai/ layout exactly:
plugins/web/brave_free/
plugin.yaml kind: backend, provides_web_providers: [brave-free]
__init__.py register(ctx) -> ctx.register_web_search_provider(...)
provider.py BraveFreeWebSearchProvider(WebSearchProvider)
Behavior preserved: same name ("brave-free" with hyphen), same env var
(BRAVE_SEARCH_API_KEY), same HTTP request shape, same response normalization.
The legacy tools/web_providers/brave_free.py is left in place — the
dispatcher in tools/web_tools.py still references it. Task 7 cuts over the
dispatcher to the new registry; Task 10 deletes the legacy file.
E2E verified:
HERMES_PLUGINS_DEBUG=1 python -c "
from hermes_cli.plugins import _ensure_plugins_discovered
_ensure_plugins_discovered()
from agent.web_search_registry import list_providers
print([p.name for p in list_providers()])
"
# -> ['brave-free']
The interactive CLI /model picker was the third call-site duplicating
the inline config-slice + list_authenticated_providers pattern that
PR #23666 consolidated for the dashboard and TUI. Route it through
load_picker_context() + build_models_payload() too so all surfaces
that show authenticated providers share one substrate.
Side effect: cli.py now also benefits from the latent v12+ keyed
providers fix (custom_providers populated via
get_compatible_custom_providers, not cfg.get raw).
The aux-task switcher (hermes_cli/main.py) and gateway model
switcher (gateway/run.py) deliberately stay on the legacy path —
they use different config sections (auxiliary.<task>.*) and a
different config loader (_load_gateway_config) respectively, so
forcing them through ConfigContext would either overload its
semantics or grow the module past the clean refactor scope.
Three call-sites in the codebase each duplicated the same config-slice
+ list_authenticated_providers + post-processing pattern:
- hermes_cli/web_server.py /api/model/options
- tui_gateway/server.py model.options JSON-RPC
- tui_gateway/server.py model.save_key JSON-RPC
This consolidates them onto hermes_cli/inventory.py:
load_picker_context() -> ConfigContext
Replaces the 17-LOC config-slice (model.{default,name,provider,
base_url}, providers:, custom_providers:) every consumer did
inline.
ConfigContext.with_overrides(*, current_provider=, current_model=,
current_base_url=) -> ConfigContext
Truthy-only overlay for TUI agent-session state on top of disk
config. Empty getattr(agent, ...) attrs MUST NOT clobber disk.
build_models_payload(ctx, *, include_unconfigured, picker_hints,
canonical_order, max_models) -> dict
Single payload builder. Delegates curation to
list_authenticated_providers (does not call provider_model_ids
per row \u2014 that pulls non-agentic models). picker_hints +
canonical_order produce the TUI ModelPickerDialog shape;
defaults match the dashboard's existing /api/model/options
contract.
Two latent bugs fixed by consolidation:
1. The dashboard read cfg.get('custom_providers') directly, missing
the v12+ keyed providers: form. Now both surfaces go through
get_compatible_custom_providers().
2. The TUI's canonical-merge keyed on is_user_defined to decide order.
Section 3 of list_authenticated_providers sets is_user_defined=True
on rows from the providers: config dict even when the slug is
canonical \u2014 that silently demoted them to the picker tail.
_reorder_canonical now keys on slug membership instead.
Stats: +666 / -145 (net +521). Module 240 LOC; 18 behavior tests.
This PR replaces the rejected #23369 (which bundled the consolidation
with new scriptable CLI surfaces \u2014 hermes models list/status, hermes
providers list \u2014 and a JSON contract that have no external user
demand). Just the refactor; the CLI surface is deferred to a separate
PR gated on actual demand.
Refs #23359.
Follow-up on the salvaged feat commit:
- Keep the constructor / config / yaml-example default at 3 so existing
gateway and CLI users see no behavioural change. PR #13754 (which this
builds on) had lowered the default to 2 to chase pre-feature parity in
the system-prompt-present case, at the cost of quietly halving the
protected head for the gateway path (which strips the system prompt
before calling compress()). With the new "system prompt is implicit"
semantics, default 3 gives every caller a stable head shape.
- agent/context_engine.py: bring the ABC's protect_first_n docstring in
line with the new semantics so plugin context engines interpret the
config key the same way the built-in compressor does.
- tests: adjust the default-value test (3, not 2) and a stale comment;
per-test protect_first_n=2/3/1 values added in PR #13754 stay as-is
since those tests fix concrete head shapes.
The number of head messages preserved verbatim across context compactions
was previously hardcoded to 3 in AIAgent.__init__. Expose it as
`compression.protect_first_n` in config, matching the existing
`protect_last_n` pattern.
Motivation: users who rely on rolling compaction for long-running sessions
had the opening user/assistant exchange pinned as head forever, which
doesn't always match how they want the session framed after many
compactions. Lowering to 1 preserves the system prompt + first non-system
message; lowering to 0 preserves only the system prompt and lets the
entire first exchange age out naturally through the summary.
Semantics: `protect_first_n` counts non-system head messages protected
**in addition to** the system prompt, which is always implicitly protected
when present. Same meaning across both code paths:
protect_first_n=0 → system prompt only (or nothing if no system message)
protect_first_n=2 → system prompt + first 2 non-system messages (default)
This unifies the CLI path (which reads messages with the system prompt at
position 0) and the gateway path (where the gateway /compress handler
strips the system prompt before calling compress() — see
gateway/run.py L9150-9154 on the parent fork). Previously these two paths
disagreed:
CLI path: protect_first_n=1 → protect system prompt only
Gateway path: protect_first_n=1 → protect first USER turn forever
In practice on long-running gateway sessions the old semantics pinned
whatever stale aside happened to be the first user message, reinserting
it into every compaction summary indefinitely.
Default chosen as 2 (not 3) so that the effective protected head count
remains 3 messages in the common case — assuming a system prompt is
present, default protection becomes system + 2 non-system = 3 total,
matching the pre-feature behaviour where `protect_first_n` was hardcoded
to protect 3 messages total. Sessions without a system prompt will see a
small behaviour change (2 protected head messages instead of 3), but this
is the rare path and the new semantics make the system-prompt-present
case the well-defined one.
Changes:
- agent/context_compressor.py: redefine protect_first_n as the count of
non-system head messages protected beyond the implicit system-prompt
guarantee; both paths converge. Constructor default updated to 2.
- hermes_cli/config.py: add `compression.protect_first_n` default (2),
matching the new semantics. `show_config` label tweaked to
'Protect first: N non-system head messages' for clarity.
- run_agent.py: read protect_first_n from config; 0 is now valid (system
prompt is always implicitly protected).
- cli-config.yaml.example: document the new key and rationale.
- tests/agent/test_context_compressor.py: cover default, override, the
end-to-end `protect_first_n=0` and `protect_first_n=1` behaviour,
the no-system-prompt (gateway) path, and the new shared-semantics
regression test.
Fixes#13751
Tested on Ubuntu 24.04.
By default, once Hermes participates in a Discord thread (auto-created on
@mention or replied in once) it auto-responds to every subsequent message
in that thread without requiring further @mentions. That's the right default
for one-on-one conversations and isolated channel threads.
But it's a confirmed footgun in multi-bot threads. When a user invokes one
bot per turn — addressing Codex first, then Hermes — every other bot in the
thread also fires on every message, burning credits and spamming the channel.
Author has hit this personally in active multi-bot research-team threads.
Add a new `discord.thread_require_mention` config key (env:
`DISCORD_THREAD_REQUIRE_MENTION`), default `false` to preserve existing
behavior. When `true`, the in-thread mention shortcut is disabled and
threads are gated the same way channels are. Explicit @mentions still pass
through as expected.
Mirrors the existing helper shape (config.extra > env > default) and the
existing yaml→env bridge pattern used by `require_mention`.
Changes:
- gateway/platforms/discord.py: new `_discord_thread_require_mention()`
helper; in_bot_thread shortcut now AND's with `not _discord_thread_require_mention()`
- gateway/config.py: bridge `discord.thread_require_mention` from config.yaml
to `DISCORD_THREAD_REQUIRE_MENTION` env var (mirrors the existing
`require_mention` bridge two lines above)
- hermes_cli/config.py: add `thread_require_mention: False` default to
DEFAULT_CONFIG['discord']
- tests/gateway/test_discord_free_response.py: 4 new tests covering default
behaviour (in-thread shortcut still works), enabled behaviour (mention
required in threads), enabled+mentioned (mention still passes through),
and yaml-via-config.extra path. Also clears DISCORD_* env vars in the
`adapter` fixture so process-env state from the contributor's shell
doesn't leak into per-test behaviour.
- tests/gateway/test_config.py: 2 new tests covering the yaml→env bridge
(both the apply-from-yaml and env-precedence-over-yaml paths)
- website/docs/user-guide/messaging/discord.md: document the new env var
+ config key with multi-bot rationale; cross-link from `auto_thread`
section
Tested on Ubuntu 24.04.
Free-response channels are intended as lightweight chat surfaces — the bot
responds to every message without requiring an @mention. But the auto-thread
gate only checked DISCORD_NO_THREAD_CHANNELS, not DISCORD_FREE_RESPONSE_CHANNELS,
so every message in a free-response channel still spawned a brand-new thread.
That turns a chat channel into a thread-spawning machine: 1 thread per message.
The user-facing docs at website/docs/user-guide/messaging/discord.md already
describe the intended behavior ("Free-response channels also skip auto-threading
— the bot replies inline rather than spinning off a new thread per message"),
so this is a code-vs-docs gap, not a design change.
Fix: OR is_free_channel into skip_thread alongside the existing no_thread_channels
check. One-line production change.
Regression test added at tests/gateway/test_discord_free_response.py:
test_discord_free_response_channel_skips_auto_thread asserts that a message
in a free-response channel never calls _auto_create_thread. Reverting the
one-line fix causes the test to fail with 'Expected mock to not have been
awaited. Awaited 1 times.' — i.e. the test demonstrates the bug concretely.
Lets platform plugins own their YAML→env config bridge instead of forcing
core gateway/config.py to know every platform's schema.
The hook receives the full parsed config.yaml and the platform's own
sub-dict, may mutate os.environ (env > YAML precedence preserved via the
standard `not os.getenv(...)` guards), and may return a dict to merge
into PlatformConfig.extra. It runs during load_gateway_config() after
the existing generic shared-key loop and before _apply_env_overrides(),
mirroring the env_enablement_fn dispatch pattern (#21306, #21331).
Pure addition — no behavior change for existing platforms. Each of the
eight platforms with hardcoded YAML→env blocks today (discord, telegram,
whatsapp, slack, dingtalk, mattermost, matrix, feishu, ~252 LOC in
gateway/config.py) can migrate in independent follow-up PRs; the
hardcoded blocks remain functional in the meantime, and their
`not os.getenv(...)` guards make them no-ops for any env var the hook
already set.
Test coverage: 10 new tests in tests/gateway/test_platform_registry.py
covering field default, callable acceptance, env mutation, extras
merge, both signature args, exception swallowing, missing/non-dict
sections, and env > YAML precedence.
Refs #3823, #24356.
Closes#24836.
Followup to PR #24182 — caught when scanning OpenClaw for recent codex
fixes we hadn't considered. OpenClaw learned the hard way (#80815) that
migrating plugins which codex itself reports as unavailable produces
config that fails at activation time.
Our /codex-runtime codex_app_server enable path queries codex's
plugin/list and migrates everything where installed=true. We were
trusting codex's installation state and ignoring its availability
field. So a plugin that's installed=true but availability=UNAVAILABLE
(broken local install) or REQUIRES_AUTH (OAuth expired or never
completed) would get an [plugins."<n>@openai-curated"] entry in
~/.codex/config.toml — and the user's first codex turn after enabling
the runtime would fail because codex refuses to activate it.
Fix: filter on availability in _query_codex_plugins(). Only emit
plugins where availability is empty (older codex versions without the
field — preserve backward compat) or explicitly AVAILABLE.
Tests:
test_plugin_discovery_skips_unavailable_plugins — verifies 4 cases:
- good-plugin (installed=True, availability=AVAILABLE) → migrated
- broken-plugin (installed=True, availability=UNAVAILABLE) → skipped
- auth-pending (installed=True, availability=REQUIRES_AUTH) → skipped
- legacy-plugin (installed=True, no availability field) → migrated
(older codex versions; preserve backward compat)
Docs:
Added bullet to 'What's NOT migrated' list in the docs page calling
out the availability filter and why.
Other OpenClaw codex PRs I reviewed but did NOT apply (with reasoning):
- #81591 (load Codex for selectable models): we resolve runtime
per-call already, no startup-time gating to fix
- #81510 (cron compatibility): we documented cron as untested; their
fix is for OpenClaw-specific cron orchestration shape
- #81223 (rotate incompatible context-engine threads): we don't
have a Lossless context engine equivalent
- #80688 (constrain sandbox): we don't have an outer-sandbox concept
- #80616 (release on turn_aborted): we already handle status=
interrupted in turn/completed correctly
- #80278 (expose activeModel in plugin SDK): not our surface
- #80792 (default destructive_actions on): we don't expose that knob
56 codex-runtime migration tests still green (+1 new).
The Analytics page and the token/cost surfaces on the Models page show
local debug estimates only. They count input+output (and a bar viz adds
cache_read+reasoning, missing cache_write entirely) from successful
main-agent responses that returned a usable usage block.
Excluded silently:
- All auxiliary calls — context compression, title generation, vision,
session search, web extract, smart approvals, MCP routing, plugin LLM
access (13 production call sites bypass update_token_counts)
- Provider-side retries, fallback attempts
- Any call whose usage block didn't come back
- cache_write_tokens (column exists in sessions table but not returned
by /api/analytics/models)
Real-world impact: a user on Kimi K2.6 saw 150K local vs 27M on the
OpenRouter side over the same window. Precise-looking numbers next to
provider billing create false confidence and support load.
This change adds dashboard.show_token_analytics (default False) to gate:
- The Analytics nav item (hidden from sidebar when off)
- The Analytics page (renders an explanation card instead of charts)
- Token bars, totals, cost figures, avg/api_calls on the Models page
The Models page keeps capability metadata (context window, vision,
tools, reasoning), the use-as-main/aux menu, sessions count, and
last-used timestamps when the flag is off.
Set dashboard.show_token_analytics: true in config.yaml to opt back in
to the local debug estimate. Fixing the underlying accounting (issue
#23270) is a separate, larger workstream.
Refs: #23270, #21705
Both addresses route to the same GitHub account (@simpolism / snav). Adding
the mappings here keeps release notes from showing two separate contributors
for what is one person's work, and unblocks subsequent PRs from this account
that would otherwise each need their own scripts/release.py noise.
- test_background_review_does_not_narrow_toolset_schema: review fork must
NOT pass enabled_toolsets to AIAgent (full parent schema = matching
Anthropic cache key on the 'tools' field).
- test_background_review_installs_thread_local_whitelist: the runtime
whitelist that replaces schema-level narrowing must contain memory +
skills tools and exclude terminal / send_message / delegate_task /
web_search / execute_code.
- test_review_fork_inherits_parent_cached_system_prompt: new test for
PR #17276's first root cause — the fork's _cached_system_prompt must
equal the parent's byte-for-byte.
- test_review_fork_pins_session_start_and_session_id: defensive belt-and-
suspenders for the cached-prompt inheritance.
Inverted the original test_background_review_agent_uses_restricted_toolsets
(which asserted the schema-level narrowing) — that narrowing was the
direct cause of #25322's cache miss, and the runtime whitelist replaces
its safety claim without breaking cache parity.
Refs #25322, #15204, PR #17276.
Belt-and-suspenders complement to the cached-system-prompt inheritance:
pin session_start and session_id to the parent's so any code path that
re-renders parts of the system prompt (compression, plugin hooks)
still produces byte-identical output. The cached-prompt assignment
already short-circuits the normal rebuild path, but these pins
guarantee parity even if a future code path bypasses the cache.
Idea from simpolism's reference PR #25427 for #25322.
Co-Authored-By: simpolism <32201324+simpolism@users.noreply.github.com>
Background review fork is supposed to hit Anthropic's prefix cache on the
parent's messages_snapshot, but currently doesn't (cache_read=0 on every
fork). Two root causes, fixed in this commit:
1. System prompt is rebuilt at fork time. _cached_system_prompt starts as
None, so run_conversation calls _build_system_prompt, which embeds a
minute-precision "Conversation started: ..." timestamp. Reviews fire
10+ turns after session start, so the minute differs from main's,
producing a 1-character diff that invalidates the byte-exact cache key.
Fix: inherit the parent's _cached_system_prompt directly (same idea as
#17089, which was self-closed for only fixing this half).
2. Tools schema was narrowed via enabled_toolsets=["memory","skills"] for
safety. Anthropic's cache key includes `tools`, which sits before
`system` in the cache hierarchy, so even byte-identical `system` won't
hit when `tools` differs from main's full set.
Fix: drop the schema-level restriction so `tools` matches main, and
deny non-whitelisted tools at runtime via the existing
get_pre_tool_call_block_message gate (hermes_cli/plugins.py:1085,
already called at all three dispatch sites). Install/clear a thread-
local whitelist (added in the previous commit) on the daemon thread.
Append a soft constraint to the review prompt so the model knows.
Real E2E on Sonnet 4.5 (12-tool task + auto-triggered review):
- Per review-call cost: $0.331 → $0.035 (~89% reduction)
- End-to-end per run: $0.848 → $0.629 (~26% reduction)
- Review fork cache_create / cache_read: 88,385 / 0 → 1,234 / 94,404
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds set_thread_tool_whitelist / clear_thread_tool_whitelist to
hermes_cli/plugins.py. When set on the current thread, restricts which
tools can pass through get_pre_tool_call_block_message; non-whitelisted
tools are blocked with a configurable deny message.
Mirrors the per-thread approval-callback pattern already used by
set_approval_callback (tools/terminal_tool.py:190). Used by
_spawn_background_review to deny non-memory/non-skill tools at runtime
while inheriting the parent agent's full tools schema for prefix-cache
parity (see follow-up commit).
Tests cover allow / deny / clear / cross-thread isolation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes#25028.
The lazy-install hooks added in #25014 installed packages correctly but
failed to rebind module-level globals after install:
- Slack: missing aiohttp rebind → NameError on file uploads
- Feishu: none of the ~25 lark_oapi symbols rebound → TypeError on
adapter instantiation
- Matrix: mautrix.types enums stayed as stubs → mismatched values at
runtime
Introduces tools.lazy_deps.ensure_and_bind() — a DRY helper that
combines ensure() + importer-callable + globals().update(). This
eliminates the error-prone pattern of manually listing every global
that needs updating after lazy-install. Each platform adapter now
defines a single _import() function returning all bindings.
Also fixes: pyproject.toml [slack] extra was missing aiohttp (needed
by slack-bolt's async path).
Follow-up on @pty819's t2a_v2 endpoint fix:
- Default model: speech-02 -> speech-02-hd (bare 'speech-02' is not in the
supported enum; t2a_v2 rejects it with 400). Official enum: speech-01-hd,
speech-01-turbo, speech-02-hd, speech-02-turbo, speech-2.6-hd/turbo,
speech-2.8-hd/turbo.
- Default voice: female-shaonv -> English_expressive_narrator. The
legacy speech-01-series short ID doesn't resolve cleanly on the
speech-02+ models that are now the default.
- Default base URL: api.minimaxi.com -> api.minimax.io (matches the
canonical host in the published docs; api-uw.minimax.io is the
reduced-latency alt).
- Add GroupId support via tts.minimax.group_id config or MINIMAX_GROUP_ID
env var. Some MiniMax accounts scope TTS requests by group; without it,
requests 401. Only appended when not already in the user's base_url.
Tests rewritten to cover both the default t2a_v2 path (hex-encoded audio
in JSON, nested voice_setting/audio_setting) and the legacy
text_to_speech path (raw audio bytes, flat payload). Adds coverage for
GroupId config/env wiring and error surfacing.
Also adds AUTHOR_MAP entry for pty819's GitHub-noreply email.
The MiniMax TTS defaults were outdated:
- DEFAULT_MINIMAX_MODEL was 'speech-01' but MiniMax now uses 'speech-02'
- DEFAULT_MINIMAX_BASE_URL was 'https://api.minimax.chat/v1/text_to_speech'
which no longer works; the correct endpoint is
'https://api.minimaxi.com/v1/t2a_v2'
Users who configured tts.provider: minimax were getting model-not-supported
errors because the hardcoded defaults did not match available API permissions.
Slack platform-blocks native slash commands inside thread replies ("/queue
is not supported in threads. Sorry!") and there is no app-side setting to
re-enable them. As a workaround, rewrite a leading '!' to '/' for any known
gateway command before downstream processing — so '!queue', '!stop',
'!model gpt-5.4' etc. work inside Slack threads (and anywhere else).
Only the first token is checked against is_gateway_known_command(), so
casual messages like '!nice work' pass through to the agent unchanged.
Downstream pipeline (MessageType.COMMAND tagging, gateway dispatcher,
thread reply routing) is unchanged.
Adds 6 tests covering rewrite, args preservation, thread routing,
casual-message passthrough, '@bot' suffix, and plain '/' still-works.
`hermes tools` -> "All Platforms" took ~14s to render the checklist
because building the toolset labels called `get_nous_auth_status()` ~31x
transitively (`_toolset_has_keys` -> `_visible_providers` ->
`get_nous_subscription_features` -> `managed_nous_tools_enabled`).
Each call did a synchronous OAuth refresh POST to
portal.nousresearch.com (~350ms even on the failure path), so one menu
paint burned >13s of HTTP and 31 single-use Nous refresh tokens.
Secondary hot spot: every `get_env_value()` re-read and re-sanitised
the entire .env file. 116 reads with O(lines x known-keys) scanning
added ~300ms of CPU per render.
Fix is two process-level caches, both mtime-keyed so login/logout/edit
invalidate naturally:
* `hermes_cli/auth.py`: memoise `get_nous_auth_status()` for 15s keyed
on auth.json mtime. Splits `_compute_nous_auth_status()` as the
uncached impl. Adds `invalidate_nous_auth_status_cache()`.
* `hermes_cli/config.py`: memoise `load_env()` keyed on .env
(path, mtime, size). Adds `invalidate_env_cache()`, wired into
`save_env_value`, `remove_env_value`, and the sanitize-on-load
writer so writers don't return stale dicts on same-second writes.
Before/after on Teknium's box (real HERMES_HOME, no Nous login):
* "All Platforms" cold path: ~13,874ms -> ~691ms label-build
* Warm re-open within the same process: ~122ms -> ~17ms
Side benefit: stops burning a Nous refresh token on every menu paint,
which was risking the portal's reuse-detection revocation logic.
`_reconfigure_provider()` handled `image_gen_plugin_name` in both
branches (no-env-vars early return and post-env-vars) but never mirrored
the same handling for `video_gen_plugin_name`. The first-time
`_configure_provider()` path correctly routes to
`_select_plugin_video_gen_provider()`; reconfigure forgot to.
Repro:
1. Enable video_gen in `hermes tools` → Configure for All Platforms.
2. Go back into `hermes tools` → Reconfigure tool → Video Generation.
3. Pick xAI (with XAI_API_KEY already set).
4. Hit Enter at the "keep current key?" prompt.
Expected: `video_gen.provider: xai` written to config.yaml.
Actual: function returns silently; no `video_gen:` block ever written;
`video_generate` tool fails with "No video generation backend is
configured."
Fix: add the missing `video_gen_plugin_name` branch in both code paths
of `_reconfigure_provider()`, mirroring the existing
`image_gen_plugin_name` handling and the first-time configure logic.
Tests: `tests/hermes_cli/test_video_gen_picker.py` covers both branches
(env-vars-set keep-current and no-env-vars paths).
AGENTS.md and CONTRIBUTING.md both now state:
1. No new memory providers in the repo. The set under plugins/memory/
(honcho, mem0, supermemory, byterover, hindsight, holographic,
openviking, retaindb) is closed. New backends ship as standalone
plugin repos that users install into ~/.hermes/plugins/ via the
same MemoryProvider ABC, discovery path, and hermes memory setup
integration. PRs adding a new plugins/memory/<name>/ directory get
closed with a pointer to publish as their own repo.
2. Skill authoring standards (hardline) — applies to all new or
modernized skills (bundled, optional, contributed):
- description <= 60 chars, one sentence, ends with period, no
marketing words, no name repetition (verification snippet
included)
- tools referenced in SKILL.md prose must be native Hermes tools
or MCP servers the skill expects — no grep/cat/sed/find etc.
when search_files/read_file/patch already cover them
- platforms: gating audited against actual POSIX-only primitives
- author credits the human contributor first, not 'Hermes Agent'
- SKILL.md uses modern section order with line targets
- scripts/references/templates layout for non-trivial logic
- tests at tests/skills/test_<skill>_skill.py, stdlib + mock only
- .env.example edits isolated to a delimited block
CONTRIBUTING.md includes a good/bad description example and a
'don't say / say' table mapping shell utilities to native tools.
AGENTS.md points the agent at references/new-skill-pr-salvage.md
for the full salvage checklist.
Salvages the closed PR #2010 (Mibayy's EVM multi-chain skill) and folds the
existing optional-skills/blockchain/base/ skill into it, so we ship one
unified EVM skill instead of two overlapping ones.
Pulled in from base/:
- 8 missing Base-specific tokens (AERO, DEGEN, TOSHI, BRETT, WELL,
cbETH, cbBTC, wstETH, rETH) added to KNOWN_TOKENS['base'] —
base/ had 11, evm/ only had 3 (USDC/DAI/WETH).
- L1 data-fee pitfall note for rollups (Base, Arbitrum, Optimism, zkSync).
- Batch-size chunking in rpc_batch (Base RPC caps batches at 10 calls
per JSON-RPC request; adding more known tokens tripped that limit
and broke 'wallet --chain base' with a 'list index out of range'
error). Ported the chunking pattern from base/_rpc_batch_chunk.
Latent bugs found and fixed while smoke-testing the merge:
- cmd_multichain and cmd_allowance both iterated KNOWN_TOKENS[chain]
with 'for contract, (symbol, _name) in known.items()' — but the dict
shape is {symbol: contract_str}, not {addr: (sym, name)}. This raised
'too many values to unpack (expected 2)' on every non-zero balance.
Now iterates as 'for symbol, contract in known.items()'.
- Input validation: added is_valid_address / is_valid_txhash /
require_address / require_txhash helpers and wired them into
cmd_wallet, cmd_tx, cmd_token, cmd_activity, cmd_allowance,
cmd_decode, cmd_contract, cmd_multichain. Fails fast with exit 2
on malformed input instead of burning an RPC round-trip on garbage.
Documentation:
- SKILL.md now flags that this skill supersedes optional-skills/blockchain/base.
- Pitfalls expanded for ENS (single-endpoint dependency on
ensideas.com), tx decoding (single-endpoint dependency on
4byte.directory), and rollup L1 fees.
- Regenerated website/docs/user-guide/skills/optional/blockchain/
blockchain-evm.md and removed the old blockchain-base.md page;
catalog updated.
Removed:
- optional-skills/blockchain/base/SKILL.md
- optional-skills/blockchain/base/scripts/base_client.py
- website/docs/user-guide/skills/optional/blockchain/blockchain-base.md
Smoke-tested live against Base mainnet: stats, price, token, wallet
(vitalik.eth — 3.12 ETH + 13.88 USDC + 4.23 DAI + 0.06 WETH on Base)
and allowance (ethereum, 7 unlimited approvals to Uniswap/Permit2).
Original PR #2010 author: Mibayy.
Original base/ skill author: youssefea.
* feat(codex-runtime): scaffold optional codex app-server runtime
Foundational commit for an opt-in alternate runtime that hands OpenAI/Codex
turns to a 'codex app-server' subprocess instead of Hermes' tool dispatch.
Default behavior is unchanged.
Lands in three pieces:
1. agent/transports/codex_app_server.py — JSON-RPC 2.0 over stdio speaker
for codex's app-server protocol (codex-rs/app-server). Spawn, init
handshake, request/response, notification queue, server-initiated
request queue (for approval round-trips), interrupt-friendly blocking
reads. Tested against real codex 0.130.0 binary end-to-end during
development.
2. hermes_cli/runtime_provider.py:
- Adds 'codex_app_server' to _VALID_API_MODES.
- Adds _maybe_apply_codex_app_server_runtime() helper, called at the
end of _resolve_runtime_from_pool_entry(). Inert unless
'model.openai_runtime: codex_app_server' is set in config.yaml AND
provider in {openai, openai-codex}. Other providers cannot be
rerouted (anthropic, openrouter, etc. preserved).
3. tests/agent/transports/test_codex_app_server_runtime.py — 24 tests
covering api_mode registration, the rewriter helper (default-off,
case-insensitive, opt-in, non-eligible providers preserved), version
parser, missing-binary handling, error class. Does NOT require codex
CLI installed.
This commit is wire-only: the api_mode is recognized but AIAgent does
not yet branch on it. Followup commits add the session adapter, event
projector, approval bridge, transcript projection (so memory/skill
review still works), plugin migration, and slash command.
Existing tests remain green:
- tests/cli/test_cli_provider_resolution.py (29 passed)
- tests/agent/test_credential_pool_routing.py (included above)
* feat(codex-runtime): add codex item projector for memory/skill review
The translator that lets Hermes' self-improvement loop keep working under the
Codex runtime: converts codex 'item/*' notifications into Hermes' standard
{role, content, tool_calls, tool_call_id} message shape that
agent/curator.py already knows how to read.
Item taxonomy (matches codex-rs/app-server-protocol/src/protocol/v2/item.rs):
- userMessage → {role: user, content}
- agentMessage → {role: assistant, content: text}
- reasoning → stashed in next assistant's 'reasoning' field
- commandExecution → assistant tool_call(name='exec_command') + tool result
- fileChange → assistant tool_call(name='apply_patch') + tool result
- mcpToolCall → assistant tool_call(name='mcp.<server>.<tool>') + tool result
- dynamicToolCall → assistant tool_call(name=<tool>) + tool result
- plan/hookPrompt/etc → opaque assistant note, no fabricated tool_calls
Invariants preserved:
- Message role alternation never violated: each tool item produces at most
one assistant + one tool message in that order, correlated by call_id.
- Streaming deltas (item/<type>/outputDelta, item/agentMessage/delta)
don't materialize messages — only item/completed does. Mirrors how
Hermes already only writes the assistant message after streaming ends.
- Tool call ids are deterministic (codex item id-based) so replays produce
identical messages and prefix caches stay valid (AGENTS.md pitfall #16).
- JSON args use sorted_keys for the same reason.
Real wire formats verified against codex 0.130.0 by capturing live
notifications from thread/shellCommand and including one as a fixture
(COMMAND_EXEC_COMPLETED).
23 new tests, all green:
- Streaming deltas don't materialize (3 paths)
- Turn/thread frame events are silent
- commandExecution: 5 tests including non-zero exit annotation +
deterministic id stability across replays
- agentMessage + reasoning attachment + reasoning consumption
- fileChange: summary without inlined content
- mcpToolCall: namespaced naming + error surfacing
- userMessage: text fragments only (drops images/etc)
- opaque items: no fabricated tool_calls
- Helpers: deterministic id stability + sorted JSON args
- Role alternation invariant across all four tool-shaped item types
This commit is a pure addition. AIAgent integration (the wire that uses the
projector) is the next commit.
* feat(codex-runtime): add session adapter + approval bridge
The third self-contained module: CodexAppServerSession owns one Codex
thread per Hermes session, drives turn/start, consumes streaming
notifications via CodexEventProjector, handles server-initiated approval
requests, and translates cancellation into turn/interrupt.
The adapter has a single public per-turn method:
result = session.run_turn(user_input='...', turn_timeout=600)
# result.final_text → assistant text for the caller
# result.projected_messages → list ready to splice into AIAgent.messages
# result.tool_iterations → tick count for _iters_since_skill nudge
# result.interrupted → True on Ctrl+C / deadline / interrupt
# result.error → error string when the turn cannot complete
# result.turn_id, thread_id → for sessions DB / resume
Behavior:
- ensure_started() spawns codex, does the initialize handshake, and
issues thread/start with cwd + permissions profile. Idempotent.
- run_turn() blocks until turn/completed, drains server-initiated
requests (approvals) before reading notifications so codex never
deadlocks waiting for us, projects every item/completed via the
projector, and increments tool_iterations for the skill nudge gate.
- request_interrupt() is thread-safe (threading.Event); the next loop
iteration issues turn/interrupt and unwinds.
- turn_timeout deadlock guard issues turn/interrupt and records an
error if the turn never completes.
- close() escalates terminate → kill via the underlying client.
Approval bridge:
Codex emits server-initiated requests for execCommandApproval and
applyPatchApproval. The adapter translates Hermes' approval choice
vocabulary onto codex's decision vocabulary:
Hermes 'once' → codex 'approved'
Hermes 'session' or 'always' → codex 'approvedForSession'
Hermes 'deny' / anything else → codex 'denied'
Routing precedence:
1. _ServerRequestRouting.auto_approve_* flags (cron / non-interactive)
2. approval_callback wired by the CLI (defers to
tools.approval.prompt_dangerous_approval())
3. Fail-closed denial when neither is wired
Unknown server-request methods are answered with JSON-RPC error -32601
so codex doesn't hang waiting for us.
Permission profile mapping mirrors AGENTS.md:
Hermes 'auto' → codex 'workspace-write'
Hermes 'approval-required' → codex 'read-only-with-approval'
Hermes 'unrestricted/yolo' → codex 'full-access'
20 new tests, all green. Combined with prior commits this PR now has
67 tests across three modules:
- test_codex_app_server_runtime.py: 24 (api_mode + transport surface)
- test_codex_event_projector.py: 23 (item taxonomy projections)
- test_codex_app_server_session.py: 20 (turn loop + approvals + interrupts)
Full tests/agent/transports/ directory: 249/249 pass — no regressions
to existing transport tests.
Still no wire into AIAgent.run_conversation(); that integration commit
is small and goes next.
* feat(codex-runtime): wire codex_app_server runtime into AIAgent
The integration commit. AIAgent.run_conversation() now early-returns to a
new helper _run_codex_app_server_turn() when self.api_mode ==
'codex_app_server', bypassing the chat_completions tool loop entirely.
Three small surgical edits to run_agent.py (~105 LOC total):
1. Line ~1204 (constructor api_mode validation set):
Add 'codex_app_server' so an explicit api_mode='codex_app_server'
passed to AIAgent() isn't silently rewritten to 'chat_completions'.
2. Line ~12048 (run_conversation, just before the while loop):
Early-return to _run_codex_app_server_turn() when self.api_mode is
'codex_app_server'. Placed AFTER all standard pre-loop setup —
logging context, session DB, surrogate sanitization, _user_turn_count
and _turns_since_memory increments, _ext_prefetch_cache, memory
manager on_turn_start — so behavior outside the model-call loop is
identical between paths. Default Hermes flow is unchanged when the
flag is off.
3. End-of-class (line ~15497):
New method _run_codex_app_server_turn(). Lazy-instantiates one
CodexAppServerSession per AIAgent (reused across turns), runs the
turn, splices projected_messages into messages, increments
_iters_since_skill by tool_iterations (since the chat_completions
loop normally does that per iteration), fires
_spawn_background_review on the same cadence as the default path.
Counter accounting:
_turns_since_memory ← already incremented at run_conversation:11817
(gated on memory store configured) — codex
helper does NOT touch it (would double-count).
_user_turn_count ← already incremented at run_conversation:11793
— codex helper does NOT touch it.
_iters_since_skill ← incremented in the chat_completions loop per
tool iteration. Codex helper increments by
turn.tool_iterations since the loop is bypassed.
User message:
ALREADY appended to messages by run_conversation pre-loop (line 11823)
before the early-return reaches us. Helper does NOT append again.
Regression test test_user_message_not_duplicated guards this.
Approval callback wiring:
Lazy-fetches tools.terminal_tool._get_approval_callback at session
spawn time, passes to CodexAppServerSession. CLI threads with
prompt_toolkit get interactive approvals; gateway/cron contexts get
the codex-side fail-closed deny.
Error path:
Codex session exceptions become a 'partial' result with completed=False
and a final_response that explicitly tells the user how to switch back:
'Codex app-server turn failed: ... Fall back to default runtime with
/codex-runtime auto.' Same return-dict shape as the chat_completions
path so all callers (gateway, CLI, batch_runner, ACP) work unchanged.
9 new integration tests in tests/run_agent/test_codex_app_server_integration.py:
- api_mode='codex_app_server' is accepted on AIAgent construction
- run_conversation returns the expected codex shape
(final_response, codex_thread_id, codex_turn_id, completed, partial)
- Projected messages are spliced into messages list
- _iters_since_skill ticks per tool iteration
- _user_turn_count delegated to standard flow (not double-counted)
- User message appears exactly once (regression guard)
- _spawn_background_review IS invoked (memory/skill review keeps working)
- chat.completions.create is NEVER called (loop fully bypassed)
- Session exception → partial result with /codex-runtime auto hint
- Interrupted turn → partial result with error preserved
Adjacent test runs confirm no regressions:
- tests/run_agent/test_memory_nudge_counter_hydration.py: green
- tests/run_agent/test_background_review.py: green
- tests/run_agent/test_fallback_model.py: green
- tests/agent/transports/: 249/249 green
Still missing for full feature: /codex-runtime slash command, plugin
migration helper, docs page, live e2e test gated on codex binary. Those
are the remaining followup commits.
* feat(codex-runtime): add /codex-runtime slash command (CLI + gateway)
User-facing toggle for the optional codex app-server runtime. Follows the
'Adding a Slash Command (All Platforms)' pattern from AGENTS.md exactly:
single CommandDef in the central registry → CLI handler → gateway handler
→ running-agent guard → all surfaces (autocomplete, /help, Telegram menu,
Slack subcommands) update automatically.
Surface:
/codex-runtime — show current state + codex CLI status
/codex-runtime auto — Hermes default runtime
/codex-runtime codex_app_server — codex subprocess runtime
/codex-runtime on / off — synonyms
Files changed:
hermes_cli/codex_runtime_switch.py (new):
Pure-Python state machine shared by CLI and gateway. Parse args,
read/write model.openai_runtime in the config dict, gate enabling
behind a codex --version check (don't let users opt in to a runtime
they have no binary for; print npm install hint instead).
Returns a CodexRuntimeStatus dataclass that callers render however
suits their surface.
hermes_cli/commands.py:
Single CommandDef entry, no aliases (codex-runtime is its own thing).
cli.py:
Dispatch in process_command() + _handle_codex_runtime() handler that
delegates to the shared module and renders results via _cprint.
gateway/run.py:
Dispatch in _handle_message() + _handle_codex_runtime_command() that
returns a string (gateway sends as message). On a successful change
that requires a new session, _evict_cached_agent() forces the next
inbound message to construct a fresh AIAgent with the new api_mode —
avoids prompt-cache invalidation mid-session.
gateway/run.py running-agent guard:
/codex-runtime joins /model in the early-intercept block so a runtime
flip mid-turn can't split a turn across two transports.
Tests:
tests/hermes_cli/test_codex_runtime_switch.py — 25 tests covering the
state machine: arg parsing (10 cases incl. case-insensitive and
synonyms), reading current runtime (5 cases incl. malformed configs),
writing runtime (3 cases), apply() entry point covering read-only,
no-op, codex-missing-blocked, codex-present-success, disable-no-binary-check,
and persist-failure paths (8 cases). All green.
Adjacent test suites confirm no regressions:
- tests/hermes_cli/test_commands.py + test_codex_runtime_switch.py:
167/167 green
- tests/agent/transports/: 283/283 green when combined with prior commits
Still missing: plugin migration helper, docs page, live e2e test gated on
codex binary. Followup commits.
* feat(codex-runtime): auto-migrate Hermes MCP servers to ~/.codex/config.toml
Translates the user's mcp_servers config from ~/.hermes/config.yaml into
the TOML format codex's MCP client expects. Wired into the
/codex-runtime codex_app_server enable path so users get their MCP tool
surface in the spawned subprocess automatically.
The migration runs on every enable. Failures are non-fatal — the runtime
change still proceeds and the user gets a warning so they can fix the
codex config manually.
What translates (mapping verified against codex-rs/core/src/config/edit.rs):
Hermes mcp_servers.<n>.command/args/env → codex stdio transport
Hermes mcp_servers.<n>.url/headers → codex streamable_http transport
Hermes mcp_servers.<n>.timeout → codex tool_timeout_sec
Hermes mcp_servers.<n>.connect_timeout → codex startup_timeout_sec
Hermes mcp_servers.<n>.cwd → codex stdio cwd
Hermes mcp_servers.<n>.enabled: false → codex enabled = false
What does NOT translate (warned + skipped per server):
Hermes-specific keys (sampling, etc.) — codex's MCP client has no
equivalent. Listed in the per-server skipped[] field of the report.
What's NOT migrated (intentional):
AGENTS.md — codex respects this file natively in its cwd. Hermes' own
AGENTS.md (project-level) is already in the worktree, so codex picks
it up without translation. No code needed.
Idempotency design:
All managed content lives between a 'managed by hermes-agent' marker
and the next non-mcp_servers section header. _strip_existing_managed_block
removes the prior managed region cleanly, preserving any user-added
codex config (model, providers.openai, sandbox profiles, etc.) above
or below.
Files added:
hermes_cli/codex_runtime_plugin_migration.py — pure-Python migration
helper. Public API: migrate(hermes_config, codex_home=None,
dry_run=False) returns MigrationReport with .migrated/.errors/
.skipped_keys_per_server. No external TOML dependency — minimal
formatter handles strings/numbers/booleans/lists/inline-tables.
tests/hermes_cli/test_codex_runtime_plugin_migration.py — 39 tests
covering:
- per-server translation (12): stdio/http/sse, cwd, timeouts,
enabled flag, command+url precedence, sampling drop, unknown keys
- TOML formatter (8): types, escaping, inline tables, error case
- existing-block stripping (4): no marker, alone, with user content
above, with user content below
- end-to-end migrate() (8): empty, dry-run, round-trip, idempotent
re-run, preserves user config, error reporting, invalid input,
summary formatting
Files changed:
hermes_cli/codex_runtime_switch.py — apply() now calls migrate() in
the codex_app_server enable branch. Migration failure logs a warning
in the result message but does NOT fail the runtime change. Disable
path (auto) explicitly skips migration.
tests/hermes_cli/test_codex_runtime_switch.py — 3 new tests:
test_enable_triggers_mcp_migration, test_disable_does_not_trigger_migration,
test_migration_failure_does_not_block_enable.
All 325 feature tests green:
- tests/agent/transports/: 249 (incl. 67 new)
- tests/run_agent/test_codex_app_server_integration.py: 9
- tests/hermes_cli/test_codex_runtime_switch.py: 28 (3 new)
- tests/hermes_cli/test_codex_runtime_plugin_migration.py: 39 (new)
* perf(codex-runtime): cache codex --version check within apply()
Single /codex-runtime invocation could spawn 'codex --version' up to 3
times (state report, enable gate, success message). Each spawn is ~50ms,
so the cumulative cost wasn't a crisis, but it was wasteful and turned a
trivial slash command into something noticeably laggy on slower systems.
Refactored to lazy-once via a closure over a nonlocal cache. First call
spawns; subsequent calls in the same apply() reuse the result.
Behavior unchanged — same return shape, same error handling, same install
hint when codex is missing. Just one subprocess per call instead of three.
Two regression-guard tests added:
- test_binary_check_cached_within_apply: enable path → call_count == 1
- test_binary_check_cached_on_read_only_call: state-report path → call_count == 1
Total tests for /codex-runtime now 30 (was 28); all 143 codex-runtime
tests still green.
* fix(codex-runtime): correct protocol field names found via live e2e test
Three real bugs caught only by running a turn end-to-end against codex
0.130.0 with a real ChatGPT subscription. Unit tests passed because they
asserted on our own (incorrect) wire shapes; the wire format from
codex-rs/app-server-protocol/src/protocol/v2/* is the source of truth and
my initial reading of the README was incomplete.
Bug 1: thread/start.permissions wire format
Was sending {"profileId": "workspace-write"}.
Real format per PermissionProfileSelectionParams enum (tagged union):
{"type": "profile", "id": "workspace-write"}
AND requires the experimentalApi capability declared during initialize.
AND requires a matching [permissions] table in ~/.codex/config.toml or
codex fails the request with 'default_permissions requires a [permissions]
table'.
Fix: stop overriding permissions on thread/start. Codex picks its default
profile (read-only unless user configures otherwise), which matches what
codex CLI users expect — they configure their default permission profile
in ~/.codex/config.toml the standard way. Trying to be clever about
profile selection broke every turn we tested.
Live error before fix: 'Invalid request: missing field type' on every
turn/start, even though our turn/start payload was correct — the field
codex was complaining about was inside the permissions sub-object we
shouldn't have been sending.
Bug 2: server-request method names
Was matching 'execCommandApproval' and 'applyPatchApproval'.
Real names per common.rs ServerRequest enum:
item/commandExecution/requestApproval
item/fileChange/requestApproval
item/permissions/requestApproval (new third method)
Fix: match the documented names. Added handler for
item/permissions/requestApproval that always declines — codex sometimes
asks to escalate permissions mid-turn and silent acceptance would surprise
users.
Live symptom before fix: agent.log showed
'Unknown codex server request: item/commandExecution/requestApproval'
and codex stalled because we replied with -32601 (unsupported method)
instead of an approval decision. The agent reported back 'The write
command was rejected' even though Hermes never showed the user an
approval prompt.
Bug 3: approval decision values
Was sending decision strings 'approved'/'approvedForSession'/'denied'.
Real values per CommandExecutionApprovalDecision enum (camelCase):
accept, acceptForSession, decline, cancel
(also AcceptWithExecpolicyAmendment and ApplyNetworkPolicyAmendment
variants we don't currently use).
Fix: rename _approval_choice_to_codex_decision return values; update
auto_approve_* fallbacks; update fail-closed default from 'denied' to
'decline'. Test mapping table updated to match.
Live test verified after fixes:
$ hermes (with model.openai_runtime: codex_app_server)
> Run the shell command: echo hermes-codex-livetest > .../proof.txt
then read it back
Approval prompt fired with 'Codex requests exec in <cwd>'.
User chose 'Allow once'. Codex executed the command, wrote the file,
read it back. Final response: 'Read back from proof.txt:
hermes-codex-livetest'. File contents on disk match.
agent.log confirms:
codex app-server thread started: id=019e200e profile=workspace-write
cwd=/tmp/hermes-codex-livetest/workspace
All 20 session tests still green after wire-format updates.
* fix(codex-runtime): correct apply_patch approval params + ship docs
Live e2e revealed FileChangeRequestApprovalParams doesn't carry the
changeset (just itemId, threadId, turnId, reason, grantRoot) — Codex's
'reason' field describes what the patch wants to do. Test config and
display logic updated to use it. The first 'apply_patch (0 change(s))'
display from the live test is now 'apply_patch: <reason>'.
Adds website/docs/user-guide/features/codex-app-server-runtime.md
covering enable/disable, prerequisites, approval UX, MCP migration
behavior, permission profile delegation to ~/.codex/config.toml, known
limitations, and the architecture diagram. Wired into the Automation
category in sidebars.ts.
Live e2e validation across the path matrix:
✓ thread/start handshake
✓ turn/start with text input
✓ commandExecution items + projection
✓ item/commandExecution/requestApproval → Hermes UI → response
✓ Approve once → command runs
✓ Deny → command rejected, codex falls back to read-only message
✓ Multi-turn (codex remembers prior turn's results)
✓ apply_patch via Codex's fileChange path
✓ item/fileChange/requestApproval → Hermes UI
✓ MCP server migration loads inside spawned codex (verified via
'use the filesystem MCP tool' prompt)
✓ /codex-runtime auto → codex_app_server toggle cycle
✓ Disable doesn't trigger migration
✓ Enable with codex CLI present succeeds + migrates
✓ Hermes-side interrupt path (turn/interrupt request issued cleanly
even if codex finishes before the interrupt lands)
Known live-validated limitations now documented in the docs page:
- delegate_task subagents unavailable on this runtime
- permission profile selection delegated to ~/.codex/config.toml
- apply_patch approval prompt has no inline changeset (codex protocol
doesn't expose it)
145/145 codex-runtime tests still green.
* feat(codex-runtime): native plugin migration + UX polish (quirks 2/4/5/10/11)
Major: migrate native Codex plugins (#7 in OpenClaw's PR list)
Discovers installed curated plugins via codex's plugin/list RPC and
writes [plugins."<name>@<marketplace>"] entries to ~/.codex/config.toml
so they're enabled in the spawned Codex sessions. This is the
'YouTube-video-worthy' bit Pash highlighted: when a user has
google-calendar, github, etc. installed in their Codex CLI, those
plugins activate automatically when they enable Hermes' codex runtime.
Implementation:
- hermes_cli/codex_runtime_plugin_migration.py: new _query_codex_plugins()
helper spawns 'codex app-server' briefly and walks plugin/list. Returns
(plugins, error) — failures are non-fatal so MCP migration still works.
- render_codex_toml_section() now takes plugins + permissions args.
- migrate() defaults: discover_plugins=True, default_permission_profile=
'workspace-write'. Explicit None on either disables that side.
- _strip_existing_managed_block() now also strips [plugins.*] and
[permissions]/[permissions.*] sections inside the managed block, so
re-runs replace plugins cleanly without touching codex's own config.
Quirk fixes:
#2 Default permissions profile written on enable.
Without this, Codex's read-only default kicks in and EVERY write
triggers an approval prompt. Now writes [permissions] default =
'workspace-write' so the runtime feels normal out of the box. Set
default_permission_profile=None to opt out.
#4 apply_patch approval prompt now shows what's changing.
Codex's FileChangeRequestApprovalParams doesn't carry the changeset.
Session adapter now caches the fileChange item from item/started
notifications and looks it up by itemId when codex requests approval.
Prompt shows '1 add, 1 update: /tmp/new.py, /tmp/old.py' instead of
'apply_patch (0 change(s))'.
Side benefit: also drains pending notifications BEFORE handling a
server request, so the projector and per-turn caches are up to date
when the approval decision fires. Bounded to 8 notifications per
loop iter to avoid starving codex's response.
#5/#10 Exec approval prompt never shows empty cwd.
When codex omits cwd in CommandExecutionRequestApprovalParams, fall
back to the session's cwd. If somehow neither is available, show
'<unknown>' explicitly instead of an empty string.
Also surfaces 'reason' from the approval params when codex provides
it — gives users more context on why codex wants to run something.
#11 Banner indicates the codex_app_server runtime when active.
New 'Runtime: codex app-server (terminal/file ops/MCP run inside
codex)' line appears in the welcome banner only when the runtime is
on. Default banner is unchanged.
Tests:
- 7 new tests in test_codex_runtime_plugin_migration.py covering
plugin discovery (mocked), failure handling, dry-run skip, opt-out
flag, idempotent re-runs, and permissions writing.
- 3 new tests in test_codex_app_server_session.py covering the
enriched approval prompts: cwd fallback, change summary on
apply_patch, fallback when no item/started cache exists.
- All 26 session tests + 46 migration tests green; 153 total in PR.
* feat(codex-runtime): hermes-tools MCP callback + native plugin migration
The big architectural addition: when codex_app_server runtime is on,
Hermes registers its own tool surface as an MCP server in
~/.codex/config.toml so the codex subprocess can call back into Hermes
for tools codex doesn't ship with — web_search, browser_*, vision,
image_generate, skills, TTS.
Also: 'migrate native codex plugins' (Pash's YouTube-video-worthy bit) —
when the user has plugins like Linear, GitHub, Gmail, Calendar, Canva
installed via 'codex plugin', Hermes discovers them via plugin/list and
writes [plugins.<name>@openai-curated] entries so they activate
automatically.
New module: agent/transports/hermes_tools_mcp_server.py
FastMCP stdio server exposing 17 Hermes tools. Each call dispatches
through model_tools.handle_function_call() — same code path as the
Hermes default runtime. Run with:
python -m agent.transports.hermes_tools_mcp_server [--verbose]
Exposed: web_search, web_extract, browser_navigate / _click / _type /
_press / _snapshot / _scroll / _back / _get_images / _console /
_vision, vision_analyze, image_generate, skill_view, skills_list,
text_to_speech.
NOT exposed (deliberately):
- terminal/shell/read_file/write_file/patch — codex has built-ins
- delegate_task/memory/session_search/todo — _AGENT_LOOP_TOOLS in
model_tools.py:493, require running AIAgent context. Documented
as a limitation and surfaced in the slash command output.
Migration changes (hermes_cli/codex_runtime_plugin_migration.py):
- _query_codex_plugins() spawns 'codex app-server' briefly to walk
plugin/list and pull installed openai-curated plugins. Failures are
non-fatal — MCP migration still completes.
- render_codex_toml_section() now takes plugins + permissions args
AND wraps the managed block with a MIGRATION_END_MARKER comment so
the stripper can reliably find both ends, even when the block
contains top-level keys (default_permissions = ...).
- migrate() defaults: discover_plugins=True, expose_hermes_tools=True,
default_permission_profile=':workspace' (built-in codex profile name
— must be prefixed with ':'). All three opt-out via explicit args.
- _build_hermes_tools_mcp_entry() builds the codex stdio entry with
HERMES_HOME and PYTHONPATH passthrough so a worktree-launched
Hermes points the MCP subprocess at the same module layout.
Live-caught wire bugs fixed during this turn:
1. Permission profile config key is top-level , NOT a [permissions] table. The [permissions] table is
for *user-defined* profiles with structured fields. Built-in
profile names start with ':' (':workspace', ':read-only',
':danger-no-sandbox'). Was emitting
which codex rejected with 'invalid type: string "X", expected
struct PermissionProfileToml'.
2. Built-in profile is , NOT . Codex
rejected with 'unknown built-in profile'.
3. Codex's MCP layer sends for
tool-call confirmation. We weren't handling it, so codex stalled
and returned 'MCP tool call was rejected'. Now: auto-accept for
our own hermes-tools server (user already opted in by enabling
the runtime), decline for third-party servers.
Quirk fixes shipped (from the limitations list):
#2 default permissions: workspace profile written on enable. No more
approval prompt on every write.
#4 apply_patch approval shows what's changing: cache fileChange
items from item/started, look up by itemId when codex sends
item/fileChange/requestApproval. Prompt: '1 add, 1 update:
/tmp/new.py, /tmp/old.py' instead of '0 change(s)'.
#5/#10 exec approval cwd never empty: fall back to session cwd, then
'<unknown>'. Also surfaces 'reason' from codex when present.
#11 banner shows 'Runtime: codex app-server' line when active so
users understand why tool counts may not match what's reachable.
Tests:
- 5 new tests in test_codex_runtime_plugin_migration.py covering
plugin discovery, expose_hermes_tools entry generation, idempotent
re-runs, opt-out flag, permissions profile.
- 3 new tests in test_codex_app_server_session.py covering enriched
approval prompts (cwd fallback, fileChange summary).
- 2 new tests for mcpServer/elicitation/request handling (accept
hermes-tools, decline others).
- New test file test_hermes_tools_mcp_server.py covering module
surface, EXPOSED_TOOLS safety invariants (no shell/file_ops,
no agent-loop tools), and main() error paths.
- 166 codex-runtime tests total, all green.
Live e2e validated against codex 0.130.0 + ChatGPT subscription:
✓ /codex-runtime codex_app_server enables, migrates filesystem MCP,
registers hermes-tools, writes default_permissions = ':workspace'
✓ Banner shows 'Runtime: codex app-server' line in subsequent sessions
✓ Shell command runs without approval prompt (workspace profile works)
✓ Multi-turn — codex remembers prior turn's results
✓ apply_patch path via fileChange request approval
✓ web_search via hermes-tools MCP callback returns real Firecrawl
results: 'OpenAI Codex CLI – Getting Started' end-to-end in 13s
✓ Disable cycle clean
Docs updated: website/docs/user-guide/features/codex-app-server-runtime.md
Full re-write covering native plugin migration, the hermes-tools
callback architecture, the prerequisites change ('codex login is
separate from hermes auth login codex'), the trade-off table now
reflecting which Hermes tools work via callback, and the limitations
list updated with what's actually unavailable on this runtime.
* feat(codex-runtime): pin user-config preservation invariant for quirk #6
Quirk #6 from the limitations list — user MCP servers / overrides /
codex-only sections in ~/.codex/config.toml that live OUTSIDE the
hermes-managed block must survive re-migration verbatim.
This already worked thanks to the MIGRATION_MARKER + MIGRATION_END_MARKER
pair I added when fixing the default_permissions wire format (so the
strip can find both ends of the managed region even with top-level
keys like default_permissions). But it was an emergent property
without a test pinning it.
Now explicitly tested:
- User MCP server above the managed block survives migration
- User MCP server below the managed block survives migration
- Both above + below survive a second re-migration
- User content (model, providers, sandbox, otel, etc.) outside our
region is left untouched
Docs added a section "Editing ~/.codex/config.toml safely" explaining
the marker contract — so users know they can add their own MCP
servers, override permissions, configure codex-only options, etc.
without fear of Hermes overwriting their work.
167 codex-runtime tests, all green.
* docs(codex-runtime): clarify the actual tool surface — shell covers terminal/read/write/find
Previous docs and PR description undersold what codex's built-in
toolset actually provides. apply_patch alone made it sound like the
runtime could only edit files in patch format — implying you'd lose
terminal use, read_file, write_file, search/find. That was wrong.
Codex's 'shell' tool runs arbitrary shell commands inside the sandbox,
which covers everything you'd do in bash: cat/head/tail (read), echo>
or heredocs (write), find/rg/grep (search), ls/cd (navigate), build/
test/git/etc. apply_patch is for structured multi-file edits on top
of that. update_plan is its in-runtime todo. view_image loads images.
And codex has its own web_search built in (in addition to the
Firecrawl-backed one Hermes exposes via MCP callback).
Docs now have a 'What tools the model actually has' section right
after Why, breaking the surface into three clearly-labeled buckets:
1. Codex's built-in toolset (always on) — shell, apply_patch,
update_plan, view_image, web_search; covers everything terminal-
adjacent.
2. Native Codex plugins (auto-migrated from your codex plugin
install) — Linear, GitHub, Gmail, Calendar, Outlook, Canva, etc.
3. Hermes tool callback (MCP server in ~/.codex/config.toml) —
web_search/web_extract via Firecrawl, browser_*, vision_analyze,
image_generate, skill_view/skills_list, text_to_speech.
Plus a 'What's NOT available' callout listing the four agent-loop tools
(delegate_task, memory, session_search, todo) that need running
AIAgent context and can't reach the codex runtime.
Trade-offs table broken out: shell, apply_patch, update_plan,
view_image, sandbox each get their own row with a one-line description
so users can see at a glance what's available natively.
Architecture diagram updated to list the codex built-ins by name
instead of 'apply_patch + shell + sandbox'.
No code changes — purely docs clarification. 167 codex-runtime tests
still green.
* fix(codex-runtime): _spawn_background_review signature + review fork api_mode downgrade
Two real bugs in the self-improvement loop integration that the previous
test mocked away.
Bug 1: wrong call signature
The codex helper was calling self._spawn_background_review() with no
args after every turn. That function actually requires:
messages_snapshot=list (positional or keyword)
review_memory=bool (at least one trigger must be True)
review_skills=bool
So the call would have raised TypeError at runtime — except the only
test that exercised this path mocked _spawn_background_review entirely
and just asserted spawn.called, so the wrong-arg shape never surfaced.
Bug 2: review fork inherits codex_app_server api_mode
The review fork is constructed with:
api_mode = _parent_runtime.get('api_mode')
So when the parent is codex_app_server, the review fork ALSO runs as
codex_app_server. But the review fork's whole job is to call agent-loop
tools (memory, skill_manage) which require Hermes' own dispatch — they
short-circuit with 'must be handled by the agent loop' on the codex
runtime. So the review fork would have run, decided to save something,
called memory or skill_manage, and silently no-op'd.
Fixed in run_agent.py:_spawn_background_review() — when the parent
api_mode is 'codex_app_server', the review fork is downgraded to
'codex_responses' (same OAuth credentials, same openai-codex provider,
but talks to OpenAI's Responses API directly so Hermes owns the loop).
Also rewrote the codex helper's review wiring to match the
chat_completions path:
- Computes _should_review_memory in the pre-loop block (was already
being computed; now passed through to the helper as an arg).
- Computes _should_review_skills AFTER the codex turn returns +
counters tick (line ~15432 pattern in chat_completions).
- Calls _spawn_background_review(messages_snapshot=, review_memory=,
review_skills=) only when at least one trigger fires.
- Adds the external memory provider sync (_sync_external_memory_for_turn)
that the chat_completions path runs after every turn.
Tests:
Replaced the broken test_background_review_invoked (which only
asserted spawn.called) with three sharper tests:
- test_background_review_NOT_invoked_below_threshold:
single turn at default thresholds → no review fires (would have
caught the original 'every turn calls spawn with no args' bug)
- test_background_review_skill_trigger_fires_above_threshold:
10 tool_iterations at threshold=10 → review fires with
messages_snapshot=list, review_skills=True, counter resets
- test_background_review_signature_never_breaks: regression guard
asserting positional args are always empty and kwargs include
messages_snapshot
New TestReviewForkApiModeDowngrade class:
- test_codex_app_server_parent_downgrades_review_fork: drives the
real _spawn_background_review function (no mock at that level),
asserts the review_agent gets api_mode='codex_responses' when
the parent was codex_app_server.
Live-validated against real run_conversation:
- Counter ticked from 0 to 5 after a 5-tool-iteration turn
- _spawn_background_review fired exactly once with kwargs-only signature
- review_skills=True, review_memory=False
- messages_snapshot was 12 entries (5 assistant tool_calls + 5 tool
results + 1 final assistant + initial system/user)
- Counter reset to 0 after fire
170 codex-runtime tests, all green.
Docs: added a Self-improvement loop section to the codex runtime page
explaining both how the trigger logic stays equivalent and that the
review fork is auto-downgraded to codex_responses for the agent-loop
tools. Also clarified that apply_patch and update_plan ARE codex's
built-in tools (the previous version made it sound like they were
separate from 'codex's stuff' — they're not, all five tools listed
in 'What tools the model actually has' section 1 are codex built-ins).
* feat(codex-runtime): expose kanban tools through Hermes MCP callback
Kanban workers spawn as separate hermes chat -q subprocesses that read
the user's config.yaml. If model.openai_runtime: codex_app_server is set
globally (which is the whole point of opt-in), every dispatched worker
ALSO comes up on the codex runtime.
That mostly works — codex's built-in shell + apply_patch + update_plan
do the actual task work fine — but it had one critical break: the
worker handoff tools (kanban_complete, kanban_block, kanban_comment,
kanban_heartbeat) are Hermes-registered tools, not codex built-ins.
On the codex runtime, codex builds its own tool list and these never
reach the model, so the worker would do the work but not be able to
report back, hanging until the dispatcher's timeout escalates it as
zombie.
Fix: add all 9 kanban tools to the EXPOSED_TOOLS list in the Hermes
MCP callback. They dispatch statelessly through handle_function_call()
just like web_search and the others — they read HERMES_KANBAN_TASK
from env (set by the dispatcher), gate correctly (worker tools require
the env var, orchestrator tools require it unset), and write to
~/.hermes/kanban.db.
Why kanban tools work via stateless dispatch when delegate_task/memory/
session_search/todo don't: those four are listed in _AGENT_LOOP_TOOLS
(model_tools.py:493) and short-circuit in handle_function_call() with
'must be handled by the agent loop' — they need to mutate AIAgent's
mid-loop state. Kanban tools have no such requirement; they're pure
side-effect functions against the kanban.db plus state_meta.
Tools exposed:
Worker handoff (require HERMES_KANBAN_TASK):
kanban_complete, kanban_block, kanban_comment, kanban_heartbeat
Read-only board queries:
kanban_show, kanban_list
Orchestrator (require HERMES_KANBAN_TASK unset):
kanban_create, kanban_unblock, kanban_link
Tests:
- test_kanban_worker_tools_exposed: complete/block/comment/heartbeat
in EXPOSED_TOOLS (regression guard for the would-hang-worker bug)
- test_kanban_orchestrator_tools_exposed: create/show/list/unblock/link
Docs:
- New 'Workflow features' section in the docs page covering /goal,
kanban, and cron behavior on this runtime
- /goal: works fully via run_conversation feedback; only caveat is
approval-prompt noise on long writes-heavy goals (mitigated by
the default :workspace permission profile)
- Kanban: enumerated which tools are reachable via the callback and
why the env var propagates correctly through the codex subprocess
to the MCP server subprocess
- Cron: documented as 'not specifically tested' — same rules as the
CLI apply since cron runs through AIAgent.run_conversation
- Trade-offs table gained rows for /goal, kanban worker, kanban
orchestrator
172/172 codex-runtime tests green (+2 from kanban tests).
* docs(codex-runtime): wire /codex-runtime into slash-commands ref + flag aux token cost
Three docs gaps caught during a final audit:
1. /codex-runtime was only in the feature docs page, not in the
slash-commands reference. Added rows to both the CLI section and
the Messaging section so users discover it where they'd look for
slash command syntax.
2. CODEX_HOME and HERMES_KANBAN_TASK weren't in environment-variables.md.
CODEX_HOME lets users redirect Codex CLI's config dir (the migration
honors it). HERMES_KANBAN_TASK is set by the kanban dispatcher and
propagates to the codex subprocess + the hermes-tools MCP subprocess
so kanban worker tools gate correctly — documented as 'don't set
manually' since it's an internal handoff.
3. Aux client behavior on this runtime. When openai_runtime=
codex_app_server is on with the openai-codex provider, every aux
task (title generation, context compression, vision auto-detect,
session search summarization, the background self-improvement review
fork) flows through the user's ChatGPT subscription by default.
This is true for the existing codex_responses path too, but it's
more visible / important here because users explicitly opted in for
subscription billing. Added a 'Auxiliary tasks and ChatGPT
subscription token cost' section to the docs page with a YAML
example showing how to override specific aux tasks to a cheaper
model (typically google/gemini-3-flash-preview via OpenRouter).
Also documents how the self-improvement review fork gets
auto-downgraded from codex_app_server to codex_responses by the
fix earlier in this PR.
No code changes — pure docs. 172 codex-runtime tests still green.
* docs+test(codex-runtime): pin HOME passthrough, document multi-profile + CODEX_HOME
OpenClaw hit a real footgun in openclaw/openclaw#81562: when spawning
codex app-server they were synthesizing a per-agent HOME alongside
CODEX_HOME. That made every subprocess codex's shell tool launches
(gh, git, aws, npm, gcloud, ...) see a fake $HOME and miss the user's
real config files. They had to back it out in PR #81562 — keep
CODEX_HOME isolation, leave HOME alone.
Audit confirms Hermes' codex spawn doesn't have this problem. We do
os.environ.copy() and only overlay CODEX_HOME (when provided) and
RUST_LOG. HOME passes through unchanged. But it was an emergent
property without a test pinning it, so adding a regression guard:
test_spawn_env_preserves_HOME — confirms parent HOME survives intact
in the subprocess env
test_spawn_env_sets_CODEX_HOME_when_provided — confirms codex_home
arg still isolates
codex state correctly
Docs additions:
'HOME environment variable passthrough' section — calls out the
contract explicitly: CODEX_HOME isolates codex's own state, HOME
stays user-real so gh/git/aws/npm/etc. find their normal config.
Cites openclaw#81562 as the cautionary tale.
'Multi-profile / multi-tenant setups' section — addresses the
related concern: profiles share ~/.codex/ by default. For users who
want per-profile codex isolation (separate auth, separate plugins),
documents the manual CODEX_HOME=<profile-scoped-dir> approach.
Explains why we DON'T auto-scope CODEX_HOME per profile: doing so
would silently invalidate existing codex login state for anyone
upgrading to this PR with tokens already at ~/.codex/auth.json.
Opt-in is safer than surprising users.
174 codex-runtime tests (+2 from HOME guards), all green.
* fix(codex-runtime): TOML control-char escapes + atomic config.toml write
Two footguns caught in a final audit pass before merge.
Bug 1: TOML control characters not escaped
The _format_toml_value() helper escaped backslashes and double quotes
but passed literal control characters (\n, \t, \r, \f, \b) through
unchanged. TOML basic strings don't allow literal control characters
— a path or env var containing a newline would produce invalid TOML
that codex refuses to load.
Realistic exposure: pathological cases like a HERMES_HOME with a
trailing newline (env var concatenation accident), or a PYTHONPATH
with a tab from a multi-line shell heredoc.
Fix: escape all five TOML basic-string control sequences (\b \t \n
\f \r) in addition to \\ and \" that we already did. Order
matters — backslash must come first or the other escapes get
re-escaped.
Bug 2: config.toml write wasn't atomic
If the python process crashed between target.mkdir() and the
write_text() finishing, a half-written config.toml could be left
behind. On NFS / Windows / some FUSE mounts this is a real concern;
on ext4/APFS small writes are usually atomic in practice but not
guaranteed.
Fix: write to a tempfile.mkstemp() temp file in the same directory,
then Path.replace() (atomic same-dir rename on POSIX, ReplaceFile on
Windows). On rename failure, clean up the temp file so repeated
failed migrations don't pile up .config.toml.* files.
Tests:
- test_string_with_newline_escaped — \n in value → \n in output
- test_string_with_tab_escaped — \t in value → \t in output
- test_string_with_other_controls_escaped — \r, \f, \b
- test_windows_path_escaped_correctly — backslash doubling
- test_atomic_write_no_temp_leak_on_success — no .config.toml.*
left over after a successful write
- test_atomic_write_cleanup_on_rename_failure — temp file removed
when Path.replace raises (simulated disk full)
180 codex-runtime tests, all green (+6 from this commit).
Footguns audited but NOT fixed (with rationale):
- Concurrent migrations race. Two Hermes processes hitting
/codex-runtime codex_app_server within seconds of each other could
cause one writer to lose entries. Low probability (you'd have to
enable from two surfaces simultaneously) and low impact (just re-run
migration). Adding fcntl/msvcrt locking is more code than it's
worth here. The atomic rename above means each individual write is
consistent — only the merge step is racy.
- Codex protocol version drift. We pin MIN_CODEX_VERSION=0.125 and
check at runtime but don't reject too-new versions. Right call —
the protocol has been stable through 0.125 → 0.130. If OpenAI
breaks it later we'd see the error in test_codex_app_server_runtime
on CI before users hit it.
* feat(video_gen): unified video_generate tool with pluggable provider backends
One core video_generate tool, every backend a plugin. Mirrors the
image_gen + memory_provider + context_engine architecture: ABC, registry,
plugin-context registration hook, and per-plugin model catalogs surfaced
through hermes tools.
Surface (one schema, every backend):
- operation: generate / edit / extend
- modalities: text-to-video (prompt only), image-to-video (prompt +
image_url), video edit (prompt + video_url), video extend (video_url)
- reference_image_urls, duration, aspect_ratio, resolution,
negative_prompt, audio, seed, model override
- Providers ignore unknown kwargs and declare what they support via
VideoGenProvider.capabilities() — backend-specific quirks stay in the
backend, the agent learns one tool
Backends shipped:
- plugins/video_gen/xai/ — Grok-Imagine, full generate/edit/extend +
image-to-video + reference images (salvaged from PR #10600 by
@Jaaneek, reshaped into the plugin interface)
- plugins/video_gen/fal/ — Veo 3.1 (t2v + i2v), Kling O3 i2v,
Pixverse v6 i2v with model-aware payload building that drops keys a
model doesn't declare
Wiring:
- agent/video_gen_provider.py — VideoGenProvider ABC, normalize_operation,
success_response / error_response, save_b64_video / save_bytes_video,
$HERMES_HOME/cache/videos/
- agent/video_gen_registry.py — thread-safe register/get/list +
get_active_provider() reading video_gen.provider from config.yaml
- hermes_cli/plugins.py — PluginContext.register_video_gen_provider()
- hermes_cli/tools_config.py — Video Generation category in
hermes tools, plugin-only providers list, model picker per plugin,
config write to video_gen.{provider,model}
- toolsets.py — new video_gen toolset
- tests: 31 new tests covering ABC, registry, tool dispatch, both plugins
- docs: developer-guide/video-gen-provider-plugin.md (parallel to the
image-gen guide), sidebar + toolsets-reference + plugin guides updated
Supersedes: #25035 (FAL), #17972 (FAL), #14543 (xAI), #13847 (HappyHorse),
#10458 (provider categories), #10786 (xAI media+search bundle), #2984
(FAL duplicate), #19086 (Google Veo standalone — easy port to plugin
interface).
Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
* feat(video_gen): dynamic schema reflects active backend's capabilities
Address the 'capability variance' question — instead of one tool with a
static schema that lies about what every backend supports, the
video_generate tool now rebuilds its description at get_definitions()
time based on the configured video_gen.provider and video_gen.model.
The agent sees backend-specific guidance up-front:
- 'fal-ai/veo3.1/image-to-video': 'image-to-video only — image_url is
REQUIRED; text-only prompts will be rejected'
- 'fal-ai/veo3.1' (t2v): no image_url restriction shown
- xAI grok-imagine-video: 'operations: generate, edit, extend; up to 7
reference_image_urls'
- Backends without edit/extend: 'not supported on this backend — surface
that they need to switch backends via hermes tools'
This is the same pattern PR #22694 used for delegate_task self-capping —
documented in the dynamic-tool-schemas skill. Cache invalidation is
free: get_tool_definitions() already memoizes on config.yaml mtime, so a
mid-session backend swap rebuilds the schema automatically.
Tested:
- Empirical FAL OpenAPI schema check confirms image-to-video models
require image_url (FAL returns HTTP 422 otherwise) — client-side
rejection in FALVideoGenProvider.generate() now prevents the wasted
round-trip
- Live E2E: fal-ai/veo3.1/image-to-video + prompt-only → clean
missing_image_url error; fal-ai/veo3.1 + prompt-only → dispatches
- 6 new tests cover the builder (no config / image-only / full-surface /
text-only / unknown provider / registry wiring), all passing
- 37/37 in the slice, 134/134 in the broader regression set
* test(video_gen/xai): full surface integration tests + cleaner schema
Verified end-to-end that the xAI plugin handles every documented mode
from PR #10600's surface: text-to-video, image-to-video,
reference-images-to-video, video edit, video extend (with and without
prompt). All five modes route to the correct xAI endpoint
(/videos/generations, /videos/edits, /videos/extensions) with the right
payload shape (image / reference_images / video keys), and all five
client-side rejections fire before the network: edit-without-prompt,
extend-without-video_url, image+refs conflict, >7 references, and
duration/aspect_ratio clamping.
15 new integration tests grouped into four classes (endpoint routing,
modalities, validation, clamping). httpx is stubbed via a small fake
AsyncClient that records POSTs so the tests assert the actual payload
the plugin would send to xAI — not just the success/error envelope.
Also cleaned up a description redundancy: when a model's operations
match the backend's overall set, we no longer print the duplicate
'operations supported by this model' line. xAI's description now reads:
Active backend: xAI . model: grok-imagine-video
- operations supported by this backend: edit, extend, generate
- modalities supported by this backend: image, reference_images, text
- aspect_ratio choices: 16:9, 1:1, 2:3, 3:2, 3:4, 4:3, 9:16
- resolution choices: 480p, 720p
- duration range: 1-15s
- reference_image_urls: up to 7 images
Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
* feat(video_gen): collapse surface to t2v + i2v, family-based auto-routing
Two design changes per Teknium:
1) Drop edit/extend from the tool surface entirely. Only text-to-video
and image-to-video remain. The agent sees a clean tool with two
modalities; backend-specific quirks like xAI's edit/extend endpoints
stay out of the unified schema.
2) FAL: pick a model FAMILY once, the plugin routes between the
family's text-to-video and image-to-video endpoints based on whether
image_url was passed. Users no longer pick 'fal-ai/veo3.1' AND
'fal-ai/veo3.1/image-to-video' as separate options — they pick
'veo3.1', and the plugin handles the rest.
Catalog rewritten as families:
veo3.1 fal-ai/veo3.1 / fal-ai/veo3.1/image-to-video
pixverse-v6 fal-ai/pixverse/v6/text-to-video / fal-ai/pixverse/v6/image-to-video
kling-o3-standard fal-ai/kling-video/o3/standard/text-to-video / fal-ai/kling-video/o3/standard/image-to-video
xAI uses a single endpoint (/videos/generations) for both modes,
routed by the presence of the 'image' field in the payload — no
edit/extend exposure.
Schema changes:
- VIDEO_GENERATE_SCHEMA: drop operation, drop video_url. Final params:
prompt (required), image_url, reference_image_urls, duration,
aspect_ratio, resolution, negative_prompt, audio, seed, model.
- VideoGenProvider ABC: drop normalize_operation, VALID_OPERATIONS,
DEFAULT_OPERATION. capabilities() drops 'operations' key.
- success_response: add 'modality' field ('text' | 'image') so the
agent and logs can see which endpoint was actually hit.
Dynamic schema builder simplified — no operations bullet, no
'switch backends if you need edit/extend' guidance. When the active
backend supports both modalities (the common case), description reads:
Active backend: FAL . model: pixverse-v6
- supports both text-to-video (omit image_url) and image-to-video
(pass image_url) - routes automatically
- aspect_ratio choices: 16:9, 9:16, 1:1
- resolution choices: 360p, 540p, 720p, 1080p
- duration range: 1-15s
- audio: pass audio=true to enable native audio (pricing tier)
- negative_prompt: supported
Tests: 51 in the video_gen slice, 216 across the broader image+video
sweep, all passing. New FAL routing tests prove pixverse-v6 + no image
hits text-to-video endpoint, pixverse-v6 + image_url hits
image-to-video endpoint, same for veo3.1 and kling-o3-standard.
Docs updated: developer-guide page rewrites the 'model families' pattern
as a first-class section so external plugin authors know the convention.
toolsets-reference and toolsets.py descriptions match the new surface.
Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
* feat(video_gen/fal): expand catalog to 6 families, cheap + premium tiers
Catalog now covers everything Teknium specced from FAL:
Cheap tier:
ltx-2.3 fal-ai/ltx-2.3-22b/text-to-video / image-to-video
pixverse-v6 fal-ai/pixverse/v6/text-to-video / image-to-video
Premium tier:
veo3.1 fal-ai/veo3.1 / fal-ai/veo3.1/image-to-video
seedance-2.0 bytedance/seedance-2.0/text-to-video / image-to-video
kling-v3-4k fal-ai/kling-video/v3/4k/text-to-video / image-to-video
happy-horse fal-ai/happy-horse/text-to-video / image-to-video
DEFAULT_MODEL moved from veo3.1 (premium) to pixverse-v6 (cheap, sane
defaults, both modalities) — better first-run UX for users who haven't
explicitly picked a model.
New family-entry knob: image_param_key. Kling v3 4K's image-to-video
endpoint expects start_image_url instead of image_url; declaring
image_param_key='start_image_url' on the family lets _build_payload
remap correctly. Other families default to plain image_url.
Per-family capability flags reflect each model's docs:
- LTX 2.3 + Happy Horse: minimal payloads (no duration/aspect/resolution
enum exposed by FAL — let endpoint apply defaults)
- Seedance: 6 aspect ratios incl 21:9, durations 4-15, audio supported,
negative prompts NOT supported per docs
- Kling v3 4K: 16:9/9:16/1:1, 3-15s, audio + negative
- Veo 3.1: unchanged, 16:9/9:16, 4/6/8s
Tests: +5 covering the new families (full catalog, Kling 4K
start_image_url remap, Seedance routing, LTX payload minimality, Happy
Horse minimality). 56/56 in the slice green.
Note: I did NOT add the FAL-hosted xAI Grok-Imagine variant. Hermes
already has a direct xAI plugin that talks to xAI's own API; routing
the same model through FAL's wrapper would duplicate the surface
without adding capabilities. Users on FAL who want Grok-Imagine should
use the xAI plugin directly; flag if you want both routes available.
* test(video_gen): tool-surface routing matrix — every model x modality
End-to-end matrix test driven through _handle_video_generate() — the
actual function the agent's video_generate tool call lands in. Writes
config.yaml, invokes the registered handler with a raw args dict, then
asserts the outbound HTTP/SDK call hit the right endpoint with the right
payload shape.
Parametrized over FAL_FAMILIES.keys() so the matrix auto-discovers new
families as they're added (add a family to FAL_FAMILIES and you get
both modalities tested for free).
Coverage:
- All 6 FAL families x {text-only, text+image} = 12 cases
- xAI x {text-only, text+image} = 2 cases
- tool-level model= arg overrides config = 2 cases
For each case, verifies:
- result['success'] is True
- result['modality'] matches input shape ('text' if no image_url, 'image' otherwise)
- outbound endpoint URL matches the family's text_endpoint or image_endpoint
- text-only payloads carry no image-shaped keys
- text+image payloads carry the family's image key (image_url for most,
start_image_url for kling-v3-4k, wrapped 'image' object for xAI)
All 16 cases passing. Confirms the tool surface routes every
(provider, model, modality) combination correctly with zero leakage.
* feat(video_gen): keep video_gen out of first-run setup, surface in status
Two changes:
1. video_gen joins _DEFAULT_OFF_TOOLSETS, so it is NOT pre-selected in
the first-run toolset checklist. Video gen is niche, paid, and slow —
most users don't want it nagging them during initial setup. Anyone
who wants it opts in via 'hermes tools' -> Video Generation, which
already routes to the provider+model picker.
2. The 'hermes setup' status panel learns about video_gen — but only
shows the row when a plugin reports available. Users without
FAL_KEY/XAI_API_KEY see nothing about video gen; users with one of
those keys see 'Video Generation (FAL) ✓' as confirmation it's wired.
Verified live:
- Fresh install (no creds): zero video_gen mentions in wizard.
- With FAL_KEY: status row appears with active backend name.
- 160/160 in the setup + tools_config + video_gen test slice.
Rationale: image_gen is on by default because it's a featured creative
tool used in casual chat (telegrams, etc). Video gen is heavier — long
wait, paid per-second pricing. Default-off matches user intent better.
---------
Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
Replace tenant-specific example text in the transcript offset regression with generic follow-up turns so the upstream test documents the bug without customer-specific wording.
Keep the outer history_offset when _run_agent drains queued follow-ups recursively so transcript persistence includes every queued turn in the chain instead of only the last one.
* tui: make URLs clickable + hover-highlight in any terminal
Problem
-------
URLs printed by `hermes --tui` were not clickable in basic macOS Terminal.app.
Cmd+click did nothing, the cursor didn't change shape — like nothing was
detected — even though arrow buttons and other Box onClick handlers worked
fine.
Root cause
----------
Two layers of dead plumbing:
1. `<Link>` only emitted the underlying `<ink-link>` (which carries the
hyperlink metadata into the screen buffer) when `supportsHyperlinks()`
said yes. On Apple_Terminal that's false, so the per-cell hyperlink
field stayed empty, so `Ink.getHyperlinkAt()` had nothing to return on
click. The visible underline was just decorative.
2. `Ink.openHyperlink()` calls `this.onHyperlinkClick?.(url)`, but
`onHyperlinkClick` was never assigned anywhere in the codebase. The
click pipeline (`App.tsx → onOpenHyperlink → Ink.openHyperlink`) ran
but bailed silently on the optional chain.
Bonus discovery: even when wired up, there was no hover affordance —
terminal apps can't change the system mouse cursor, so users had no
visual signal that a cell was clickable. Arrow buttons in the chrome
worked because they had explicit `<Box onClick>` styling; inline link
URLs didn't.
Fix
---
- `Link.tsx`: always emit `<ink-link>` regardless of terminal capability.
The renderer's `wrapWithOsc8Link` already gates the actual OSC 8 escape
on `supportsHyperlinks()` further down — so terminals that don't
understand OSC 8 still don't see the escape, but the screen-buffer
metadata (which the click dispatcher reads) is now populated everywhere.
- `ink.tsx + root.ts`: add `onHyperlinkClick?: (url: string) => void` to
`Options` / `RenderOptions`, wire it to the existing `Ink.onHyperlinkClick`
field in the constructor.
- `src/lib/openExternalUrl.ts`: small platform-aware opener using
`child_process.spawn` with arg-array (no shell) — http(s) only, rejects
`file:`, `javascript:`, `data:`, etc., so a hostile model can't trigger
arbitrary local handlers via `<Link url="file:///...">`. Detached + stdio
ignore so closing the TUI doesn't kill the browser and Chrome stderr
doesn't leak into the alt screen.
- `entry.tsx`: pass `onHyperlinkClick: openExternalUrl` to `ink.render`.
- `hyperlinkHover.ts` + Ink hover wiring: track the URL under the pointer
in `Ink.hoveredHyperlink`, update it from `dispatchHover`, and inverse-
highlight every cell of the matching link in the render-pass overlay
(same pattern as `applySearchHighlight`). This is the cursor-hover
affordance for clickable links — terminals don't expose cursor shape,
so we light up the link itself.
- `types/hermes-ink.d.ts`: add `onHyperlinkClick` to the `RenderOptions`
shim so consumers (`entry.tsx`) type-check against the new option.
Tests
-----
- `src/lib/openExternalUrl.test.ts` (15 cases): http(s) accepted; file/js/
data/mailto/ftp/ssh rejected; macOS open(1), Windows cmd.exe start with
empty title slot, Linux xdg-open dispatch; shell-metacharacter URLs
pass through unmolested as a single argv element; synchronous spawn
failure returns false.
Verified empirically in Apple Terminal 455.1 (macOS 15.7.3): clicking a
URL opens in default browser, hovering inverts the link cells, and
moving away clears the highlight. Full TUI suite: 713 passing, 0
type errors.
Reverts
-------
The earlier attempt that version-gated Apple_Terminal in
`supports-hyperlinks.ts` was based on a wrong assumption — Terminal.app
silently strips OSC 8 sequences but does not render them as clickable
hyperlinks. Reverted to the original allowlist.
* tui: address Copilot review — explorer.exe on win32 + comment fixes
- openExternalUrl: switch win32 from `cmd.exe /c start` to `explorer.exe`.
cmd.exe's `start` builtin reparses the URL through cmd's tokenizer, so
`&`, `|`, `^`, `<`, `>` either split the command or get reinterpreted —
breaking both the protocol-allowlist safety story AND plain http(s) URLs
with `&` in query strings. `explorer.exe <url>` invokes the registered
protocol handler directly with no shell.
- openExternalUrl.test.ts: rename the win32 test to reflect the new
contract and add two regression tests — one with `&|^<>` metachars,
one with the common analytics-URL `&` query-param pattern — both pinned
to single-argv-element delivery via explorer.exe.
- Link.tsx: fix misleading comment. OSC 8 escapes are emitted
unconditionally by the renderer (`wrapWithOsc8Link` in
render-node-to-output.ts, `oscLink` in log-update.ts). Non-supporting
terminals silently strip the sequence, which is why hover/click
affordance has to come from the in-process overlay rather than the
terminal's own link rendering.
Verified: 715/715 tests pass, type-check + build clean.
* tui: address Copilot review #2 — async spawn errors + hover scope + docs
1. openExternalUrl: attach a no-op `'error'` listener on the spawned
child BEFORE unref(). spawn() returns a ChildProcess synchronously
even when the binary is missing (ENOENT on xdg-open / explorer.exe),
unreachable, or otherwise unusable; the failure surfaces later as
an 'error' event. An unhandled 'error' on an EventEmitter crashes
Node, which would tear down the whole TUI. The listener is a
deliberate no-op — we already returned `true` synchronously and the
user just doesn't see the browser pop.
2. openExternalUrl.test.ts: add a regression test using a real
EventEmitter to simulate the async-error path. Pins both the
listener-attached contract and the "doesn't throw on emit" behavior.
Was 17/17, now 18/18.
3. ink.tsx dispatchHover: bypass `getHyperlinkAt()` and read
`cellAt(...).hyperlink` directly. `getHyperlinkAt` falls back to
`findPlainTextUrlAt` for cells without an OSC 8 hyperlink, but the
render-pass overlay (`applyHyperlinkHoverHighlight`) only matches on
`cell.hyperlink === hoveredUrl` — so plain-text URLs would burn
re-renders without ever producing the highlight. Hover is now a
strictly 1:1 fit for what the overlay can paint. Plain-text URLs
still get the click action via the existing dispatch path.
4. root.ts + ink.tsx doc comments: replace the misleading "typically
`open` / `xdg-open` / `start` shell" wording with the actual safe
recipe — argv-array spawn into `open` / `xdg-open` / `explorer.exe`,
with an explicit warning that `cmd.exe /c start` reparses the URL
through cmd's tokenizer and is unsafe + breaks `&`-query URLs.
Verified: 716/716 tests pass, type-check + build clean.
* tui: address Copilot review #3 — hover damage, alt-screen cleanup, opener allowlist
1. ink.tsx onRender: stop folding steady-state hover into hlActive.
hlActive forces a full-screen damage diff so previous-frame inverted
cells get re-emitted when the highlight set changes. The transition
IS the trigger — enter / leave / change-to-other-link. While the
pointer just sits on a link the painted cells don't change and the
per-cell diff handles the no-op. Folding the steady state in would
burn a full-screen diff on every frame. Added a
lastRenderedHoveredHyperlink tracker and gate the hlActive bump on
`hovered !== lastRendered`.
2. ink.tsx setAltScreenActive: clear hoveredHyperlink (and the tracker)
when toggling alt-screen state. Hover dispatch is alt-screen-gated,
so once we leave there's no path to clear it. Without this, remounting
<AlternateScreen> would paint a phantom hover from the previous
session until the next mouse-move arrived.
3. openExternalUrl.ts openCommand: allowlist linux + the BSD family for
xdg-open and return null for everything else (aix, sunos, cygwin,
haiku, etc.). Previously the default-fallback always returned
xdg-open, which made the caller's `if (!command) return false` dead
and yielded a misleading `true` on platforms that probably don't
have xdg-open. New tests cover the null path AND the
openExternalUrl-returns-false-without-spawning behavior.
Verified: 718/718 tests pass, type-check + build clean.
* tui: address Copilot review #4 — doc comment accuracy
1. openExternalUrl return-value doc: now lists all three false paths
(URL rejected / no opener for platform / synchronous spawn throw)
plus a note that async 'error' events still return true because the
spawn was attempted.
2. ink.tsx onHyperlinkClick field doc: clarifies the callback receives
either an OSC 8 hyperlink OR a plain-text URL detected by
findPlainTextUrlAt — App.tsx routes both into the same callback.
3. hyperlinkHover applyHyperlinkHoverHighlight doc: drops the misleading
'caller forces full-frame damage' promise. Caller decides; for hover
the current caller only forces full damage on transitions.
No behavior change. 718/718 tests pass.
* tui: address Copilot review #5 — lint fixes
1. ink.tsx: reorder `./hyperlinkHover.js` import before `./screen.js` to
satisfy perfectionist/sort-imports.
2. Link.tsx: drop unused `fallback` parameter destructuring + the
trailing `void (null as ...)` dead-statement (would trip
no-unused-expressions). Kept `fallback?: ReactNode` on the Props
interface as a documented compat shim so existing call sites still
compile, with a comment explaining why it's no longer wired up.
3. openExternalUrl.test.ts: replace `typeof import('node:child_process').spawn`
inline annotations (forbidden by @typescript-eslint/consistent-type-imports)
with a `SpawnLike` type alias backed by a real `import type { spawn as SpawnFn }`.
No behavior change. 718/718 tests pass, type-check clean, lint clean on
all modified files.
Recover from SIGWINCH without clearing the physical screen or scrollback
buffer. The startup banner and tool summary are printed before
prompt_toolkit owns the live chrome, so they live in normal terminal
scrollback. Calling erase_screen() + \x1b[3J] on every resize removed
that UI permanently — _replay_output_history cannot reconstruct it
because the banner was never added to _OUTPUT_HISTORY.
Instead, just reset prompt_toolkit's renderer cache and invalidate so
the next incremental redraw starts from a clean slate, then let the
original on_resize handler recalculate layout for the new terminal
size. This matches the behaviour of bash/zsh/fish on SIGWINCH.
FixesNousResearch/hermes-agent#22999
skill_view ran the direct-path strategy across every skill dir before
the recursive strategy, so a top-level skill in an external dir could
silently shadow a same-named nested local skill. /skills correctly
listed the local version (deduped local-first by _find_all_skills) but
skill_view loaded the external one — confusing, and a real bug class
for users with skills.external_dirs registered alongside categorized
local skills.
Pick a louder fix than @polkn's PR #6136 proposed: collect every match
across all dirs (direct path, recursive by parent dir name, legacy
flat <name>.md), and if there's more than one, refuse with an error
that surfaces every matching path plus a hint to load by the
categorized form. Local-first precedence would have replaced silent
external-shadowing with silent same-name collisions between two
externals, or made an externally-shadowed-by-local skill unreachable
by bare name with no signal. Refusing forces the user to disambiguate
once and never wonder which skill ran.
Recovery: pass the full categorized path
("foundations/runtime/explore-codebase" instead of
"explore-codebase"), or rename one of the colliding skills.
Co-authored-by: pol <pol.kuijken@gmail.com>
Removes the 'Launch hermes chat now? (Y/n)' prompt at the end of
hermes setup. The summary already prints 'Ready to go! → hermes'
so the auto-launch was redundant, and on macOS 26+ it could crash
in prompt_toolkit when setup was invoked from the curl install
script with stdin redirected from /dev/tty (#5884, #6128).
After setup, users run 'hermes' themselves like every other CLI
tool. Same pattern applies to the Windows installer.
Closes#6128 (narrower env-var-guarded fix superseded by removing
the prompt outright).
Adds an explicit API compatibility mode prompt to the `hermes model -> custom`
flow so Codex-compatible third-party endpoints (and any other non-default
backend whose URL doesn't match the existing heuristics in
`_detect_api_mode_for_url`) can be selected explicitly instead of silently
falling back to chat_completions.
Choices: Auto-detect / chat_completions / codex_responses / anthropic_messages.
Persists `api_mode` to:
- `model.api_mode` (active session config)
- the matching `custom_providers[*]` entry (so re-activating the named
provider next time replays the same transport)
Salvaged from PR #6125 onto current main: kept the new prompt and the
`_save_custom_provider(api_mode=...)` plumbing; the named-custom flow
already extracts and applies `api_mode` from the saved entry on current
main so those changes are preserved as-is. Test fixtures updated for the
new prompt and the existing display-name prompt.
Co-authored-by: littlewwwhite <1095245867@qq.com>
- memory_setup.py: use shlex.split() for plugin dep checks instead of shell=True
- transcription_tools.py: avoid shell=True for auto-detected whisper commands
(user-provided templates via env var still use shell=True for compatibility)
- cli.py: add comment clarifying intentional shell=True for user quick_commands
- Add test verifying auto-detected template is shlex-safe
Addresses CONTRIBUTING.md Priority #3 (Security hardening — shell injection).
These two functions in hermes_cli/profiles.py have no callers — the live
`hermes completion {bash,zsh}` command uses hermes_cli/completion.py's
generate_bash() / generate_zsh() instead. Multiple PRs (incl. #6141) tried
to fix the trailing-`_hermes "$@"` zsh bug here, only to discover the
patch never reached users. Delete the dead code so future contributors
patch the right file.
The actual user-facing fix lives in the preceding cherry-picked commits
to hermes_cli/completion.py.
Previously :latest tracked the tip of main, which meant pulling :latest
got you whatever was last merged — fine for development, surprising for
users who expect :latest to mean 'the most recent stable release'.
Reshape the publish flow so the floating tags carry their conventional
meaning:
- :sha-<sha> every main commit (unchanged, immutable)
- :main tip of main (NEW; what :latest used to do)
- :<release_tag> every published release, e.g. :v1.2.3 (unchanged)
- :latest most recent release (CHANGED; release-only now)
Implementation:
- Rename the move-latest job to move-main; it still gates on push to
main, still ancestor-checks the existing :main label before
retagging, still uses cancel-in-progress: false so queued moves run
serially.
- Add a new move-latest job gated on release: published. Reads the
OCI revision label off the existing :latest and only advances if
the release commit is a strict descendant. This keeps backport
releases on older branches (e.g. patching v1.1.5 after v1.2.3 has
already shipped) from dragging :latest backwards.
- merge job exposes pushed_release_tag and release_tag outputs so
move-latest knows when to fire and what to retag from.
Only Discord and Telegram had lazy-install hooks in their
check_*_requirements() functions. The remaining four platforms that were
moved to lazy_deps (Slack, Matrix, DingTalk, Feishu) would just return
False immediately if their packages weren't pre-installed — no attempt
to install them at runtime.
This means even with the .venv permissions fix (#24841), these four
platforms would still fail to load in Docker (or any fresh install)
unless the user manually ran pip install.
Add the same lazy_deps.ensure() pattern to all four, matching the
existing Discord/Telegram implementation.
Drops the duplicate _FILE_MUTATING_TOOLS frozenset in run_agent.py and
imports the canonical FILE_MUTATING_TOOL_NAMES from
agent/tool_result_classification.py (aliased as _FILE_MUTATING_TOOLS to
avoid renaming the existing call sites). Prevents future drift if
another file-mutating tool is added — only one set needs updating.
No behavior change: same frozenset({'write_file', 'patch'}), and the
117 PR-scoped tests still pass.
`lsp` is registered as a top-level subparser in `main()` (lines 9539-9545)
via `agent.lsp.cli.register_subparser`, so it shows up in `hermes --help`
output alongside the other built-ins. The `_BUILTIN_SUBCOMMANDS` set used
by `_plugin_cli_discovery_needed` to short-circuit the ~500-650ms plugin
import pass did not list it, so every `hermes lsp ...` invocation paid
the full discovery cost despite being a fully-built-in command.
This is also caught by the parity guard added in #22120:
`tests/hermes_cli/test_startup_plugin_gating.py::test_builtin_set_covers_every_registered_subcommand`
has been failing on clean origin/main with:
AssertionError: _BUILTIN_SUBCOMMANDS is missing these live
subcommands: ['lsp']. Add them to hermes_cli/main.py::_BUILTIN_SUBCOMMANDS
so plugin discovery can be skipped when the user targets them.
Fix: add `"lsp"` to the frozenset (alphabetical position between `logs`
and `mcp`). The accompanying `test_builtin_set_has_no_phantom_entries`
guard still passes because `lsp` is genuinely live — registered via the
guarded `try/except Exception` in main() since #24168.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Dockerfile permissions section made /opt/hermes/.venv readable but not
writable by the hermes runtime user. Since the 2026-05-12 policy change
moved messaging packages (discord.py, telegram, slack, etc.) out of [all]
and into lazy_deps.py, the Docker image no longer ships with them
pre-installed. At first gateway boot, lazy_deps.ensure() tries to
`uv pip install` them into the venv but fails with EACCES because
site-packages is root-owned.
The result: every messaging platform adapter silently fails to load inside
Docker containers, producing only a cryptic "discord.py not installed"
warning despite the gateway being correctly configured.
Two-part fix:
1. Dockerfile: add /opt/hermes/.venv to the existing chown -R hermes:hermes
line so the default (UID 10000) case works out of the box.
2. docker/entrypoint.sh: extend the needs_chown block to also re-chown the
.venv when HERMES_UID is remapped. Without this, the build-time chown
becomes stale when someone uses the documented HERMES_UID override in
docker-compose.yml.
Fixes#21536
Related: #17674, #21543, #21755
- Rename 'Alibaba Cloud (DashScope)' display label to 'Qwen Cloud'
in CANONICAL_PROVIDERS (model picker, /model, hermes model TUI) and
PROVIDER_REGISTRY (setup wizard prompts, status output).
- Move Qwen Cloud (alibaba) up to position 6 — directly below
OpenAI Codex and above Xiaomi MiMo.
- Move Qwen OAuth (Portal) (qwen-oauth) to the bottom of the
canonical provider list.
Provider slug 'alibaba' is unchanged — only the display label
moved. DashScope env var (DASHSCOPE_API_KEY) and base URL are
unchanged. The separate 'alibaba-coding-plan' plugin provider is
not affected.
* feat(nous): unified client=hermes-client-v<version> tag on every Portal request
Every Hermes request to Nous Portal now carries the same
client=hermes-client-v<__version__> tag (e.g. client=hermes-client-v0.13.0
on this release), sourced live from hermes_cli.__version__. The release
script's regex bump auto-aligns it on every release.
Centralized in agent/portal_tags.py and wired into all four call sites:
- NousProfile.build_extra_body (main agent loop, every chat completion)
- auxiliary_client.NOUS_EXTRA_BODY + _build_call_kwargs (aux client)
- run_agent.py compression-summary fallback path
- tools/web_tools.py web_extract fallback
Replaces the client=aux marker added in #24194 with the unified version
tag. Tests assert against the helper output (invariant) rather than the
literal string, so they don't need updating on every release.
* feat(nous): cover /goal judge and kanban specify aux paths
Two aux-using surfaces bypassed call_llm by invoking
client.chat.completions.create() directly without extra_body, so they
were missing the unified Portal client tag:
- hermes_cli/goals.py — /goal standing-goal judge
- hermes_cli/kanban_specify.py — kanban triage specifier
Both now pass extra_body=get_auxiliary_extra_body() or None so they
inherit the version tag when the aux client points at Nous Portal, and
emit nothing otherwise (no tag leak to OpenRouter/Anthropic auxes).
The long-lived prefix-cache layout split the system prompt into stable/
context/volatile blocks and re-derived them on every API call. The
volatile tier (timestamp + memory snapshot + USER profile) ticks per
turn, so the system message bytes mutated mid-conversation and broke
upstream prompt caches (OpenRouter, Nous Portal, Anthropic).
Diagnosed via live wire-format diffing: an 8-turn conversation showed
OLD layout flipping system block[1] sha mid-session at the minute
boundary, dropping cached_tokens to 0 on that turn (cumulative
66.6% vs 83.3% for the single-block layout). Hermes invariant:
history (system + all but the last 1-2 messages) must be static.
Fix: drop the long-lived layout entirely. Single layout everywhere —
system_and_3 with one cached system string built once on first turn,
replayed verbatim on every subsequent turn. Loses cross-session 1h
prefix caching for Claude (the feature that motivated the split), but
within-session caching now actually works on every provider.
Removed:
- run_agent.py: _use_long_lived_prefix_cache flag, _long_lived_cache_ttl,
_supports_long_lived_anthropic_cache method, the long-lived branch in
run_conversation, mark_tools_for_long_lived_cache call site
- agent/prompt_caching.py: apply_anthropic_cache_control_long_lived,
mark_tools_for_long_lived_cache, _mark_system_stable_block helper
- hermes_cli/config.py: prompt_caching.long_lived_prefix and
prompt_caching.long_lived_ttl config keys
- tests/agent/test_prompt_caching_live.py (entire file)
- tests/agent/test_prompt_caching.py: TestMarkToolsForLongLivedCache,
TestApplyAnthropicCacheControlLongLived
- tests/run_agent/test_anthropic_prompt_cache_policy.py:
TestSupportsLongLivedAnthropicCache
Targeted tests: 62/62 pass.
When switching models via /model, AIAgent._config_context_length was
never cleared, so the new model inherited the previous model's context
window instead of auto-detecting the correct one via
get_model_context_length().
Clear _config_context_length to None before the runtime field swap so
the full resolution chain (custom_providers per-model, endpoint probe,
models.dev, etc.) is re-evaluated for the newly selected model.
Closes#21509
The test_restart_command_while_busy_requests_drain_without_interrupt test
was asserting against a hardcoded emoji string that was valid before the
i18n migration. After gateway/run.py switched to t("gateway.draining",
count=N), the test sees the translated output (or the raw key when the
locale catalog isn't resolved in xdist workers).
Fix by asserting against t("gateway.draining", count=1) — this produces
the correct expected value regardless of whether the locale file is
available in the test environment.
Default timeout raised from 60s to 300s (5 minutes) to accommodate
slower systems like Unraid NAS. Configurable via WHATSAPP_NPM_INSTALL_TIMEOUT
environment variable.
The live adapter path in _send_via_adapter called adapter.send() without
passing thread_id, while the standalone fallback path correctly forwarded
it. For plugin platforms (google_chat, teams, irc, line) running with the
gateway in-process, this caused every threaded reply to land as a new
top-level message instead of continuing the thread.
Matches the pattern already used by _send_matrix_via_adapter and
_send_feishu: build metadata={"thread_id": thread_id} and pass it through.
The WeCom adapter's _listen_loop() automatically reconnects when the
WebSocket drops, but it never called _mark_connected() after a successful
reconnection. This left the runtime status file (gateway_state.json) stuck
in "disconnected" even though the adapter was fully operational again.
Add self._mark_connected() right after _open_connection() succeeds so
that the dashboard and health probes report the correct state.
Tested by forcing a WebSocket close via the heartbeat loop and verifying
that the status file updated from "disconnected" back to "connected".
The LINE adapter calls self.create_source(...) which raises
AttributeError on every inbound message — no such method exists.
The base PlatformAdapter exposes this factory as build_source(),
consistent with the IRC and Teams adapters.
Fixes#23728
GLM-family models (z-ai/glm-4.5-air, z-ai/glm-4.5-flash, etc.) exhibit
the same "describe-instead-of-call" failure mode that gpt/codex/gemini/
gemma/grok already trigger enforcement for. Without the injection,
free-tier GLM workers spawned by the kanban dispatcher routinely exit
cleanly (rc=0) without invoking kanban_complete or kanban_block,
producing the "protocol violation" error and triggering the dispatcher's
gave_up path.
Observed in real workloads: seven consecutive kanban tasks across three
GLM-tier profiles (shipbackend, frontend-engineer, backend-engineer) all
failed with the identical message:
worker exited cleanly (rc=0) without calling kanban_complete or
kanban_block — protocol violation
Re-running the same tasks on Claude Haiku immediately resolved them.
Adding "glm" to TOOL_USE_ENFORCEMENT_MODELS closes the gap so future
GLM-routed work receives the explicit "every response must contain a
tool call or final result" steering that already protects the other
enforcement-gated model families.
One-line change; no behavior change for non-GLM models.
PR #23458 introduced _send_message_with_thread_fallback() and applied it
to all control-style sends (send_update_prompt, send_approval_request,
send_model_picker_prompt), but the slash-confirm result message in
handle_callback_query still called self._bot.send_message directly.
In supergroups with stale message_thread_id on the callback's parent
message, this raises "Message thread not found" and silently swallows
the result text. Replace with the helper so the same retry-without-
thread-id logic applies.
Autostash creates refs/stash as a pointer to the latest stash commit, but
git stash apply/drop expect the symbolic ref format like stash@{0}, not
the raw commit SHA. Using the commit SHA causes: error: 'X is not a stash reference'
- Note that typescript-language-server pulls in the typescript SDK
automatically (peer-dep relationship was previously implicit and
caused initialize failures when the SDK was absent).
- Add a Troubleshooting entry for the new Backend warnings section
in hermes lsp status, with the shellcheck install commands across
apt / brew / scoop.
Reflects what shipped in PR #24630.
_session_info() used os.getcwd() which reflects the gateway process
working directory, not the user's actual working directory. This caused
the TUI status line to display incorrect paths (e.g. D:\HermesWork
instead of D:\Hermes\HermesWork) after agent turns that changed the
process cwd.
Align with session.create which already correctly reads TERMINAL_CWD
env var set by the CLI launcher.
In WSL2, sounddevice.query_devices() returns [] even when the
PulseAudio bridge is functional. The existing code already handled
the case where the query itself raises an exception, but it missed
the empty-list case.
This change treats an empty device list as non-fatal in WSL when
PULSE_SERVER is configured, matching the existing exception-handler
behavior.
Fixes: WSL users seeing 'No audio input/output devices detected'
even though paplay/arecord work fine.
Closes#23064
When Hermes connects to Signal via signal-cli in daemon mode (linked
device setup), group messages sent from the user's phone were silently
dropped. The syncMessage handler only processed events where
destinationNumber equals the bot's own number (Note to Self).
Group messages from linked devices carry a groupInfo.groupId instead of a
destinationNumber. Extend the condition to also pass through sync messages
that have a groupId, so group messages are promoted to dataMessage and
reach the agent.
PR #24151 routed Portal Qwen (qwen3.6-plus) through the prefix_and_2
long-lived cache layout, attaching {"type":"ephemeral","ttl":"1h"}
markers to the tools[-1] entry and the stable system-prefix block.
That layout works for Portal Claude because Anthropic / OpenRouter on
Anthropic routes honour 1h TTL — but Portal Qwen ultimately proxies to
Alibaba DashScope, which documents a single "ephemeral" TTL of 5
minutes on its Context Cache. The ttl="1h" qualifier is silently
dropped upstream, so the two highest-value breakpoints (tools array +
system prefix) never land. Only the rolling-window 5m markers on the
last 2 messages cache, which matches the observed ~25% read rate.
Fix: keep Portal Qwen on cache_control via _anthropic_prompt_cache_policy
returning (True, False), but drop it from _supports_long_lived_anthropic_cache
so it rides the standard system_and_3 5m layout (system + last 3 messages,
all at 5m). Same 4 breakpoints, all in a TTL the upstream actually honours.
Refs: https://www.alibabacloud.com/help/en/model-studio/context-cachehttps://openrouter.ai/docs/features/prompt-caching (Alibaba Qwen
section: "TTL: 5 minutes")
- _supports_long_lived_anthropic_cache: Portal scope narrowed back to Claude
- tests: flip the two qwen long-lived expectations to False, retitle
non_claude_non_qwen_rejected -> non_claude_rejected
Cron jobs using `deliver: whatsapp` were silently dropped because the
resolver's home-channel env var dict in cron/scheduler.py listed every
messaging platform except whatsapp. _resolve_delivery_targets() returned
[] and no message was sent — but jobs.json marked the run successful and
no log line surfaced the failure.
The gateway adapter and the send_message tool path both honored
WHATSAPP_HOME_CHANNEL correctly; only the cron path missed.
Adds 'whatsapp' -> 'WHATSAPP_HOME_CHANNEL' to _HOME_TARGET_ENV_VARS.
Verified end-to-end with multiple cron pings landing in WhatsApp
self-chat after the fix.
Fixes#22997
Tavily's /crawl endpoint requires Authorization: Bearer <key> in the header,
unlike /search and /extract which accept api_key in the JSON body.
Without the header, crawl returns 401 Unauthorized.
Xiaomi MiMo's /v1/models endpoint returns 401 even with a valid API key,
causing hermes doctor to falsely report 'invalid API key'.
Add a `supports_health_check` field to ProviderProfile (default True).
Providers whose /models endpoint doesn't support auth verification can
set it to False. The doctor's dynamic provider discovery now reads this
field instead of hardcoding True.
The xiaomi provider plugin sets supports_health_check=False.
_parse_target_ref() has no handler for XMPP JIDs (user@server or
room@conference.server), so they fall through to the final
`return None, None, False`. This causes send_message to fail when
targeting an XMPP chat by JID, since the JID is not numeric and
doesn't match any other platform pattern.
Add an explicit check for XMPP targets containing '@', matching the
existing Matrix pattern above it.
Salvage of #21063 — adds 'Weixin, and more' to module-level docstrings
in gateway/__init__.py, gateway/config.py, gateway/platforms/base.py
and the 'hermes gateway' subparser description.
Co-authored-by: wuwuzhijing <chuang.guo@hopechart.com>
Three follow-ups to PR #24168 found during live E2E testing on TS/bash files:
1. typescript-language-server now installs the typescript SDK (tsserver)
alongside it. Without that sibling install, initialize() failed with
"Could not find a valid TypeScript installation" and the server was
marked broken — no diagnostics ever reached the agent. New extra_pkgs
field on INSTALL_RECIPES makes that explicit and reusable for future
peer-dep cases.
2. _check_lint now treats "linter command exists on PATH but cannot
actually run" as skipped instead of error. The motivating case is
npx tsc when typescript is not in node_modules — npx prints its
"This is not the tsc command you are looking for" banner and exits
non-zero, which previously blocked the LSP semantic tier (gated on
success or skipped). Pattern-matched per base command (npx,
rustfmt, go) so genuine lint errors still flow through normally.
3. hermes lsp status now surfaces a Backend warnings section when
bash-language-server is installed but shellcheck is missing. The
server itself spawns fine but bash-language-server delegates
diagnostics to shellcheck — without it on PATH the integration
looks alive but never reports any problems. Same warning is
logged once at server spawn time.
Validation:
- 12 new tests in tests/agent/lsp/test_install_and_lint_fixes.py:
* recipe carries typescript SDK
* _install_npm passes both pkg + extras to npm CLI
* backwards compat: recipes without extras still work
* _backend_warnings quiet when bash absent / both present
* _backend_warnings fires when bash installed without shellcheck
* status output includes the Backend warnings section
* _looks_like_linter_unusable catches the npx tsc banner
* real TS type errors not misclassified as unusable
* unfamiliar linters fall through normally
* _check_lint returns skipped on npx tsc unusable
* _check_lint returns error on real tsc type errors
- Full lsp + file_operations test suite: 245/245 pass
- Live E2E:
* try_install("typescript-language-server") installs both packages
into node_modules
* write_file(bad.ts, ...) returns lint=skipped + lsp_diagnostics
with two real TS errors (was lint=error, no lsp_diagnostics)
* hermes lsp status renders the shellcheck warning when bash is
installed but shellcheck is not on PATH
When the user runs /stop or a session is interrupted mid-flight, the
👀 in-progress reaction lingered on the user's message indefinitely.
Without another agent run to swap it for 👍/👎, the eyes stayed there
forever — visually misleading (looks like the agent is still working).
Fix: on ProcessingOutcome.CANCELLED, call set_message_reaction with
reaction=None to clear all reactions on the message. Documented Bot API
semantics (equivalent to Bot API 10.0's deleteMessageReaction, but works
on PTB 22.6 already without the version bump).
Test changes:
- Renamed test_on_processing_complete_cancelled_keeps_existing_reaction
→ test_on_processing_complete_cancelled_clears_reaction; updated
assertion to expect set_message_reaction(reaction=None).
- Added test_on_processing_complete_cancelled_skipped_when_disabled
(TELEGRAM_REACTIONS=false short-circuits).
- Added test_clear_reactions_handles_api_error_gracefully and
test_clear_reactions_returns_false_without_bot to cover the new
_clear_reactions helper.
The fuzzy @-file completer shells out to 'rg --files' via subprocess.run
with text=True. On Windows, Python 3.13 decodes stdout using the system
ANSI codepage (cp1252), so any filename containing bytes like 0x81/0x8f
crashes the background reader thread with UnicodeDecodeError. The
exception is swallowed inside subprocess, leaving proc.stdout=None, and
the next line ('proc.stdout.strip()') blows up with:
AttributeError: 'NoneType' object has no attribute 'strip'
This takes down the prompt_toolkit event loop and forces 'Press ENTER to
continue' until the user clears the @-query.
Fix:
- Pass encoding='utf-8', errors='replace' so rg's UTF-8 output is decoded
consistently across platforms and unmappable bytes don't crash.
- Guard 'proc.stdout' with a None check before .strip(), so a future
reader-thread failure degrades gracefully instead of breaking input.
Replace `len(label)` with `HermesCLI._status_bar_display_width(label)`
in two places where the response box top border is rendered.
`len()` counts characters, not terminal columns. CJK characters like
`测` and `试` each occupy 2 columns, causing the top border
`╭─ 测试 ───╮` to render 2 columns wider than the bottom border
`╰─────────╯`.
The `_status_bar_display_width` helper already exists (line 2881) and
uses `prompt_toolkit.utils.get_cwidth` for proper CJK width calculation.
When TUI exits, tmux captures some TUI output into its scrollback buffer.
On restart, stale scrollback content appears at the top of screen before
AlternateScreen takes over.
Add ANSI escape sequences at startup:
- ESC[2J clear visible screen
- ESC[H cursor home
- ESC[3J clear scrollback buffer
Replace the hardcoded i18n placeholder "~/.hermes/config.yaml" with the
real config_path returned from api.getStatus(), falling back to the i18n
string while loading or on API failure.
Co-authored-by: aqilaziz <gonzes7@gmail.com>
Fixes#24127
On headless Linux VPS (no DISPLAY or WAYLAND_DISPLAY), some Python
webbrowser backends register TUI programs such as links, lynx, or
www-browser. GenericBrowser.open() spawns these without redirecting
stdin/stdout, allowing them to take over the terminal. This can cause
the process to receive SIGHUP and exit immediately even though uvicorn
bound the port successfully, producing a misleading success message
followed by an empty --status.
Fix: detect headless Linux at startup and skip the auto-open when no
display server is available. On such systems the URL is still printed
so the user can open it manually or via an SSH tunnel. The webbrowser
call is also wrapped in a try/except so any unexpected failure on other
platforms is silently absorbed rather than surfacing as an unhandled
exception in the daemon thread.
_resolve_task_provider_model drops cfg_base_url and cfg_api_key when
returning a named provider, causing configured API keys and base URLs
to be lost. Pass them through so named providers can use custom
endpoints while still resolving credentials from provider-specific
env vars.
Closes#20139
Built-in commands with required args (e.g. /queue, /steer, /background)
were excluded from Telegram setMyCommands output, making them invisible
in the autocomplete menu. However, their handlers already return usage
text when invoked without arguments, so hiding them hurts discoverability.
This commit removes the _requires_argument filter for built-in commands
(COMMAND_REGISTRY) while keeping it for plugin-registered slash commands,
which may not provide a no-arg usage fallback.
Closes#24312
The clarify tool returned 'not available in this execution context' for
every gateway-mode agent because gateway/run.py never passed
clarify_callback into the AIAgent constructor. Schema actively encouraged
calling it; users never saw the question.
Changes:
- tools/clarify_gateway.py — new event-based primitive mirroring
tools/approval.py: register/wait_for_response/resolve_gateway_clarify
with per-session FIFO, threading.Event blocking with 1s heartbeat
slices (so the inactivity watchdog keeps ticking), and
clear_session for boundary cleanup.
- gateway/platforms/base.py — abstract send_clarify with a numbered-text
fallback so every adapter (Discord, Slack, WhatsApp, Signal, Matrix,
etc.) gets a working clarify out of the box. Plus an active-session
bypass: when the agent is blocked on a text-awaiting clarify, the next
non-command message routes inline to the runner's intercept instead
of being queued + triggering an interrupt. Same shape as the /approve
deadlock fix from PR #4926.
- gateway/platforms/telegram.py — concrete send_clarify renders one
inline button per choice plus '✏️ Other (type answer)'. cl: callback
handler resolves numeric choices immediately, flips to text-capture
mode for Other, with the same authorization guards as exec/slash
approvals.
- gateway/run.py — clarify_callback wired at the cached-agent per-turn
callback assignment site (only the user-facing agent path; cron and
hygiene-compress agents have no human attached). Bridges sync→async
via run_coroutine_threadsafe, blocks with the configured timeout, and
returns a '[user did not respond within Xm]' sentinel on timeout so
the agent adapts rather than pinning the running-agent guard. Text-
intercept added to _handle_message before slash-confirm intercept
(skipping slash commands). clear_session called in the run's finally
to cancel any orphan entries.
- hermes_cli/config.py — agent.clarify_timeout default 600s.
- website/docs/user-guide/messaging/telegram.md — Interactive Prompts
section.
Tests:
- tests/tools/test_clarify_gateway.py (14 tests) — full primitive
coverage: button resolve, open-ended auto-await, Other flip, timeout
None, unknown-id idempotency, clear_session cancellation, FIFO
ordering, register/unregister notify, config default.
- tests/gateway/test_telegram_clarify_buttons.py (12 tests) — render
paths (multi-choice/open-ended/long-label/HTML-escape/not-connected),
callback dispatch (numeric resolve/Other flip/already-resolved/
unauthorized/invalid-token), and base-adapter text fallback.
Out of scope: bot-to-bot, guest mode, checklists, poll media, live
photos. Closes#24191.
PR #24500 introduced stale-lock detection that calls
`_looks_like_gateway_process` to confirm a running PID is not an
unrelated process that reused the slot. On Windows neither `/proc`
nor `ps` is available, so `_read_process_cmdline` always returns
`None` and `_looks_like_gateway_process` always returns `False` —
causing every valid Windows gateway lock to be marked stale and
immediately evicted.
Fix: after `_looks_like_gateway_process` returns `False`, call
`_read_process_cmdline` directly. If the result is non-`None` the
live cmdline was readable and confirms the PID is foreign → stale.
If it is `None` (cmdline unreadable, e.g. Windows without ps), fall
back to `_record_looks_like_gateway` which validates the stored
`argv` the gateway wrote into the lock file at startup. Both
oracles must say "not a gateway" before the lock is evicted — the
same two-oracle pattern already used in `get_running_pid` (line 941).
Adds a regression test that simulates a Windows host where
`_looks_like_gateway_process` returns `False` for every PID and
`_read_process_cmdline` returns `None`, confirming the lock is kept
when the record's argv identifies it as a gateway process.
deepseek-v4-pro has been routable since v0.12 but was missing from
the _OFFICIAL_DOCS_PRICING table. Sessions using this model showed
as "unknown cost" in hermes insights instead of a dollar estimate.
Add pricing entry using published list prices:
- input: \$1.74/M tokens
- output: \$3.48/M tokens
- cache_read: \$0.0145/M tokens
Uses standard list rates (not the 75% promo) so estimates remain
accurate after promo expires 2026-05-31.
Closes#24218
* feat(lsp): semantic diagnostics from real language servers in write_file/patch
Wire ~26 language servers (pyright, gopls, rust-analyzer, typescript-language-server,
clangd, bash-language-server, ...) into the post-write lint check used by write_file
and patch. The model now sees type errors, undefined names, missing imports, and
project-wide semantic issues introduced by its edits, not just syntax errors.
LSP is gated on git workspace detection: when the agent's cwd or the file being
edited is inside a git worktree, LSP runs against that workspace; otherwise the
existing in-process syntax checks are the only tier. This keeps users on
user-home cwds (Telegram/Discord gateway chats) from spawning daemons.
The post-write check is layered: in-process syntax check first (microseconds),
then LSP semantic diagnostics second when syntax is clean. Diagnostics are
delta-filtered against a baseline captured at write start, so the agent only
sees errors its edit introduced. A flaky/missing language server can never
break a write -- every LSP failure path falls back silently to the syntax-only
result.
New module agent/lsp/ split into:
- protocol.py: Content-Length JSON-RPC framer + envelope helpers
- client.py: async LSPClient (spawn, initialize, didOpen/didChange,
ContentModified retry, push/pull diagnostic stores)
- workspace.py: git worktree walk-up + per-server NearestRoot resolver
- servers.py: registry of 26 language servers (extension match,
root resolver, spawn builder per language)
- install.py: auto-install dispatch (npm install --prefix, go install
with GOBIN, pip install --target) into HERMES_HOME/lsp/bin/
- manager.py: LSPService (per-(server_id, root) client registry, lazy
spawn, broken-set, in-flight dedupe, sync facade for tools layer)
- reporter.py: <diagnostics> block formatter (severity-1-only, 20-per-file)
- cli.py: hermes lsp {status,list,install,install-all,restart,which}
Wired into tools/file_operations.py:
- write_file/patch_replace now call _snapshot_lsp_baseline before write
- _check_lint_delta gains a third tier: LSP semantic diagnostics when
syntax is clean
- All LSP code paths swallow exceptions; write_file's contract unchanged
Config: 'lsp' section in DEFAULT_CONFIG with enabled (default true),
wait_mode, wait_timeout, install_strategy (default 'auto'), and per-server
overrides (disabled, command, env, initialization_options).
Tests: tests/agent/lsp/ -- 49 tests covering protocol framing (encode and
read_message round-trip, EOF/truncation/missing Content-Length), workspace
gate (git walk-up, exclude markers, fallback to file location), reporter
(severity filter, max-per-file cap, truncation), service-level delta filter,
and an in-process mock LSP server that exercises the full client lifecycle
including didChange version bumps, dedup, crash recovery, and idempotent
teardown.
Live E2E verified end-to-end through ShellFileOperations: pyright
auto-installed via npm into HERMES_HOME, baseline captured, type error
introduced, single delta diagnostic surfaced with correct line/column/code/
source, then patch fix removes the diagnostic from the output.
Docs: new website/docs/user-guide/features/lsp.md page covering supported
languages, configuration knobs, performance characteristics, and
troubleshooting; cli-commands.md updated with the 'hermes lsp' reference;
sidebar updated.
* feat(lsp): structured logging, backend gate, defensive walk caps
Cherry-picks the substantive ideas from #24155 (different scope, same
problem space) onto our PR.
agent/lsp/eventlog.py (new): dedicated structured logger
``hermes.lint.lsp`` with steady-state silence. Module-level dedup sets
keep a 1000-write session at exactly ONE INFO line ("active for
<root>") at the default INFO threshold; clean writes log at DEBUG so
they never reach agent.log under normal config. State transitions
(server starts, no project root for a file, server unavailable) fire
at INFO/WARNING once per (server_id, key); novel events (timeouts,
unexpected errors) fire WARNING per call. Grep recipe: ``rg 'lsp\\['``.
agent/lsp/manager.py: wire the eventlog into _get_or_spawn and
get_diagnostics_sync so users can answer "did LSP fire on this edit?"
with a single grep, plus surface "binary not on PATH" warnings once
instead of silently retrying every write.
tools/file_operations.py: backend-type gate. ``_lsp_local_only()``
returns False for non-local backends (Docker / Modal / SSH /
Daytona); ``_snapshot_lsp_baseline`` and ``_maybe_lsp_diagnostics``
now skip entirely on remote envs. The host-side language server
can't see files inside a sandbox, so this prevents pretending to
lint a file the host process can't open.
agent/lsp/protocol.py: 8 KiB cap on the header block in
``read_message``. A pathological server that streams headers
without ever emitting CRLF-CRLF would have looped forever consuming
bytes; now raises ``LSPProtocolError`` instead.
agent/lsp/workspace.py: 64-step cap on ``find_git_worktree`` and
``nearest_root`` upward walks, plus try/except containment around
``Path(...).resolve()`` and child ``.exists()`` calls. Defensive
against pathological inputs (symlink loops, encoding errors,
permission failures mid-walk) — the lint hook is hot-path code and
must never raise.
Tests:
- tests/agent/lsp/test_eventlog.py: 18 tests covering steady-state
silence (clean writes stay DEBUG), state-transition INFO-once
semantics (active for, no project root), action-required
WARNING-once (server unavailable), per-call WARNING (timeouts,
spawn failures), and the "1000 clean writes => 1 INFO" contract.
- tests/agent/lsp/test_backend_gate.py: 5 tests verifying
_lsp_local_only / snapshot_baseline / maybe_lsp_diagnostics skip
the LSP layer for non-local backends and route correctly for
LocalEnvironment.
- tests/agent/lsp/test_protocol.py: new test_read_message_rejects_runaway_header
exercising the 8 KiB cap.
Validation:
- 73/73 LSP tests pass (49 original + 18 eventlog + 5 backend-gate + 1 framer cap)
- 198/198 pass when run alongside existing file_operations tests
- Live E2E re-run with pyright still surfaces "ERROR [2:12] Type
... reportReturnType (Pyright)" through the full path, then patch
fix removes it on the next call.
* feat(lsp): atexit cleanup + separate lsp_diagnostics JSON field
Two improvements salvaged from #24414's plugin-form alternative,
keeping our core-integrated design:
1. atexit cleanup of spawned language servers
----------------------------------------------------------------
``agent/lsp/__init__.get_service`` now registers an ``atexit``
handler on first creation that tears down the LSPService on
Python exit. Without this, every ``hermes chat`` exit was
leaking pyright/gopls/etc. processes for a few seconds while
their stdout buffers drained -- they got reaped by the kernel
eventually but a watchful ``ps aux`` would catch them.
The handler runs once per process (gated by
``_atexit_registered``); idempotent ``shutdown_service``
ensures double-fire is a no-op. Errors during shutdown are
swallowed at debug level since by the time atexit fires the
user has already seen the agent's final response.
2. Separate ``lsp_diagnostics`` field on WriteResult / PatchResult
----------------------------------------------------------------
Previously the LSP layer folded its diagnostic block into the
``lint.output`` string, conflating the syntax-check tier with
the semantic tier. The agent (and any downstream parsers) now
read syntax errors and semantic errors as independent signals:
{
"bytes_written": 42,
"lint": {"status": "ok", "output": ""},
"lsp_diagnostics": "<diagnostics file=...>\nERROR [2:12] ..."
}
``_check_lint_delta`` returns to its original two-tier shape
(syntax check + delta filter); ``write_file`` and
``patch_replace`` independently fetch LSP diagnostics via
``_maybe_lsp_diagnostics`` and pass them into the new field.
``patch_replace`` propagates the inner write_file's
``lsp_diagnostics`` so the outer PatchResult carries the patch's
delta correctly.
Tests: 19 new
- tests/agent/lsp/test_lifecycle.py (8 tests): atexit registration
fires once and only once across N get_service calls; the
registered callable is our internal shutdown wrapper;
shutdown_service is idempotent and safe when never started;
exceptions during shutdown are swallowed; inactive service is
cached so we don't rebuild on every check.
- tests/agent/lsp/test_diagnostics_field.py (11 tests): WriteResult
/ PatchResult dataclass shape, to_dict include/omit semantics,
channel separation (lint and lsp_diagnostics carry independent
signals), write_file populates the field via
_maybe_lsp_diagnostics only when the syntax tier is clean,
patch_replace propagates the field forward from its internal
write_file.
Validation:
- 92/92 LSP tests pass (73 prior + 8 lifecycle + 11 diagnostics field)
- 217/217 pass with file_operations + LSP combined
- Live E2E reverified: clean writes -> both fields empty/none; type
error introduced -> lint clean (parses), lsp_diagnostics carries
the pyright reportReturnType block; patch fix -> both fields
clean again.
* fix(lsp): broken-set short-circuit so a wedged server isn't paid every write
Discovered while auditing failure paths: a language server binary that
hangs (sleep forever, no LSP traffic on stdin/stdout) caused EVERY
subsequent write to re-pay the 8s snapshot_baseline timeout. Five
writes = ~64s of dead time.
The bug: ``_get_or_spawn`` adds the (server_id, root) pair to
``_broken`` inside its inner exception handler, but when the OUTER
``_loop.run`` timeout fires, it cancels the inner task before that
handler runs. The pair never makes it to broken-set, so the next
write re-enters the spawn path and re-pays the timeout.
Fix:
- New ``_mark_broken_for_file`` helper at the service layer marks
the (server_id, workspace_root) pair broken from the OUTSIDE when
the outer timeout fires. Called from the except branches in
``snapshot_baseline``, ``get_diagnostics_sync`` (asyncio.TimeoutError
+ generic Exception). Also kills any orphan client process that
survived the cancelled future, fire-and-forget with a 1s ceiling.
- ``enabled_for`` now consults the broken-set BEFORE returning True.
Files in already-broken (server_id, root) pairs short-circuit to
False, so the file_operations layer skips the LSP path entirely
with no spawn cost. Until the service is restarted (``hermes lsp
restart``) or the process exits.
- A single eventlog WARNING is emitted on first mark-broken so the
user knows which server gave up. Subsequent edits in the same
project stay silent.
Tests: 7 new in tests/agent/lsp/test_broken_set.py — covers the
key shape (server_id, per_server_root), enabled_for short-circuit,
sibling-file skip in same project, project isolation (broken in
A doesn't affect B), graceful no-op for missing-server / no-workspace,
and an end-to-end test that snapshots after a failure and verifies
the next ``enabled_for`` returns False.
Validation:
- Live retest of the wedged-binary scenario: 5 sequential writes,
first 8.88s (the one snapshot timeout), subsequent four ~0.84s
(no LSP cost). Down from 5x12.85s = 64s before this fix.
- 99/99 LSP tests pass (92 prior + 7 broken-set)
- 224/224 pass with file_operations + LSP combined
- Happy path E2E reverified — clean write, type error introduced,
patch fix all behave correctly with the new broken-set logic.
Note: the FIRST write to a wedged binary still pays 8s (the
snapshot_baseline timeout). We could shorten that, but pyright/
tsserver normally take 2-3s and slow CI rust-analyzer can need
5+ seconds, so 8s is the conservative ceiling. Subsequent writes
are instant.
Daytona ships breaking SDK changes on June 10, 2026 — `list()` returns
an iterator and the `page=` offset parameter is removed. We pin
daytona==0.155.0 so we're past the May 24 hard-cutoff, but the
legacy-sandbox resume path in DaytonaEnvironment still passes `page=1`
and reads `.items` off the result.
Switch to `next(iter(results), None)` against a single-result
`list(labels=..., limit=1)` call. Update tests to use `iter([...])`
and drop the `page=1` kwarg from list() assertions.
Adds behavior detail to the existing 'Externally managed Camofox sessions'
subsection in features/browser.md:
- Three-row settings table (config key + env var + effect).
- 'What changes when user_id is set' — soft-cleanup behavior, why
DELETE /sessions/<user_id> is skipped.
- 'How tab adoption works' — 4-step lookup against GET /tabs, listItemId
matching, fallback to new-tab creation, no mid-run re-polling.
- Picking session_key: how to attach to a specific existing tab vs
share-profile-only behavior with the default per-task session_key.
- Concurrency note that Camofox does not arbitrate per-tab focus.
Allow integrations to share a visible Camofox identity with Hermes and recover existing tabs without carrying local patches.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(install): use `--extra all` not `--all-extras`; drop lazy-covered extras from [all]
Two coupled fixes for the Windows install hang where uv sync built
python-olm from sdist and failed on missing make.
# Root cause: --all-extras vs --extra all (credit: ethernet)
`uv sync --all-extras` installs every key in [project.optional-
dependencies], bypassing the curated [all] extra entirely. So even
when [all] excluded [matrix], [rl], [yc-bench], etc., the installer
pulled them anyway because they were still defined as extras. On
Windows that meant python-olm (no wheel, needs make to build from
sdist) and the install died there.
The right flag is `--extra all` — install just the [all] extra's
contents, respecting curation. Empirically verified via dry-run:
--all-extras: pulls python-olm, mautrix, ctranslate2, onnxruntime,
atroposlib, tinker, wandb, modal, daytona, vercel,
python-telegram-bot, discord.py, slack-bolt,
dingtalk-stream, lark-oapi, anthropic, boto3,
edge-tts, elevenlabs, exa-py, fal-client, faster-
whisper, firecrawl-py, honcho-ai, parallel-web
--extra all: pulls none of those — just [all]'s curated set
Dockerfile already uses `--extra all` (with comment explaining the
gotcha) — knowledge existed; the gap was install.sh / install.ps1 /
setup-hermes.sh.
Sites fixed: scripts/install.sh L1118, scripts/install.ps1 L809,
setup-hermes.sh L245.
# Companion fix: drop lazy-covered extras from [all]
`tools/lazy_deps.py` already covers anthropic, bedrock, exa,
firecrawl, parallel-web, fal, edge-tts, elevenlabs, modal, daytona,
vercel, all messaging platforms (telegram/discord/slack/matrix/
dingtalk/feishu), honcho, and faster-whisper. They were ALSO in
[all], which defeats the whole point of lazy-install — fresh
installs eager-pulled them and inherited whatever was broken
upstream (the matrix → python-olm → no Windows wheel chain being
the proximate symptom).
[all] now contains only what genuinely can't be lazy-installed:
cron, cli, dev, pty, mcp, homeassistant, sms, acp, google, web,
youtube. Same trim applied to [termux-all]. New regression test
asserts the contract: every extra in LAZY_DEPS must NOT also appear
in [all].
# Companion fix: surface uv progress + errors
setup-hermes.sh's hash-verified path swallowed uv's stderr to a
tempfile, identical to the install.sh bug fixed in PR #24504. Same
fix applied: stream stderr through directly so users see live
progress instead of staring at a frozen prompt.
# Files
- pyproject.toml: trim [all] and [termux-all] to non-lazy extras only.
- scripts/install.sh: --all-extras → --extra all; trim _ALL_EXTRAS /
_PYPI_EXTRAS to match.
- scripts/install.ps1: --all-extras → --extra all; trim $allExtras /
$pypiExtras to match.
- setup-hermes.sh: --all-extras → --extra all; stream stderr.
- tests/test_project_metadata.py: invert matrix-in-[all] assertion;
add lazy-coverage contract test.
- uv.lock: regenerated.
# Validation
5/5 metadata tests pass. 37/37 in update_autostash + tool_token_
estimation. `uv lock --check` passes. Empirical dry-run confirms
`--extra all` excludes python-olm + RL chain on the new lockfile.
* fix(install): parse [all] from pyproject.toml instead of mirroring it
ethernet's review point: the previous patch left two hand-mirrored
copies of [all]'s contents (in install.sh's $_ALL_EXTRAS and
install.ps1's $allExtras). That guarantees future drift the next
time pyproject.toml's [all] changes.
Now both scripts parse pyproject.toml at install time using stdlib
tomllib (Python 3.11+, which the bootstrap step already requires).
Single source of truth. The only purpose of the parsed list is to
build the 'Tier 2: [all] minus broken extras' fallback spec — so we
parse, filter against $brokenExtras, and rebuild the .[a,b,c] spec.
Also: removed redundant fallback tiers.
Before: Tier 1 [all]
Tier 2 [all] minus broken
Tier 3 PyPI-only extras (no git deps)
Tier 4 [web,mcp,cron,cli,messaging,dev]
Tier 5 .
After: Tier 1 [all]
Tier 2 [all] minus broken
Tier 3 .
Tier 3 (PyPI-only) and Tier 4 (dashboard+core) used to dodge the [rl]
git+sdist deps and the [matrix] python-olm build. Both are no longer
in [all] post-2026-05-12 lazy-install migration, so the carve-out
tiers had no remaining content. Tier 4 also referenced [messaging],
which is now lazy-installed — the hardcoded fallback was actually
inconsistent with the new policy.
Defensive fallback: if tomllib parse fails (corrupted pyproject,
unexpected schema), Tier 2 collapses to '.[all]' (same as Tier 1) so
the broken-extras path becomes a no-op rather than crashing.
* fix(gateway): hide Matrix from setup picker on Windows
Matrix is the one messaging platform that has no working install path
on Windows: [matrix] -> mautrix[encryption] -> python-olm, which has
Linux-only wheels and needs make + libolm to build from sdist. The
[all] cleanup in this PR keeps mautrix out of fresh installs, but a
user who picked Matrix in 'hermes setup gateway' would still walk
into the same sdist build failure when the wizard tried to install
the extra.
Hide the option at the picker so users never get the chance to try.
The gate lives in _all_platforms() — single source of truth for the
setup wizard, the curses gateway-config menu, and any future picker.
Adapter loading at runtime is intentionally NOT gated: users who
already have MATRIX_* env vars set (e.g. config copied from a Linux
install) keep working if they somehow have python-olm available.
This is the lowest-friction fix — picker visibility only.
Tests cover linux/darwin/win32 and verify other platforms aren't
collateral damage.
- cron-script-only: webhook subscription links pointed to
/docs/user-guide/features/webhooks; the page lives under messaging/
- mlops-hermes-atropos-environments: axolotl and TRL related-skill links
pointed to skills/bundled/mlops/; both files live under skills/optional/mlops/
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-#21561 the liveness probe in acquire_scoped_lock() routes through
gateway.status._pid_exists (psutil-first, safe on Windows), not
os.kill(pid, 0). The two new macOS regression tests were patching
status.os.kill, which had no effect — the unmocked psutil call returned
False for PID 99999, marking the lock stale before the new code branch
ran. The 'replaces' test passed only because acquired=True was already
the expected outcome; the 'keeps' test failed in CI.
Switch both tests to monkeypatch status._pid_exists directly, matching
the existing test_acquire_scoped_lock_rejects_live_other_process pattern,
so they actually exercise the new start_time=None + cmdline-based
staleness branch.
On macOS (and Windows), /proc is unavailable so _get_process_start_time()
always returns None. When a gateway creates a scoped lock record with
start_time=None and then exits, macOS can reuse that PID for an unrelated
process. On restart, acquire_scoped_lock() sees:
1. os.kill(pid, 0) succeeds (PID is alive — but it's bluetoothuserd, not
the gateway)
2. existing.start_time is None and current_start is None, so the
start_time comparison is inconclusive
3. The lock is treated as active, blocking gateway startup with:
"Telegram bot token already in use (PID 873). Stop the other gateway
first."
Root cause: _read_process_cmdline() only reads /proc/<pid>/cmdline, which
doesn't exist on macOS. It always returns None, making
_looks_like_gateway_process() always return False, so the cmdline fallback
path in acquire_scoped_lock() was unreachable on macOS.
Fix (two parts):
1. _read_process_cmdline(): Add a ps(1) fallback for platforms without
/proc. When /proc/<pid>/cmdline doesn't exist, we now run
"ps -p <pid> -o command=" to retrieve the process command line. The
/proc path is tried first (preserving Linux performance); ps is only
invoked as a fallback.
2. acquire_scoped_lock(): When both the lock record's start_time and the
live process's start_time are None (the macOS case), fall back to
checking whether the live PID still looks like a Hermes gateway process
via _looks_like_gateway_process(). If it doesn't, the lock is stale.
Closes#16376
The c1eb2dcda tiered installer made two install paths look frozen on
slow networks or broken environments because both swallowed the
underlying tool's stderr.
scripts/install.sh, setup-hermes.sh:
curl -LsSf https://astral.sh/uv/install.sh | sh 2>/dev/null
printed only '✗ Failed to install uv' on failure with no diagnostic.
Common real causes (glibc mismatch on old distros, corp proxy / TLS
interception, missing curl, ~/.local/bin not writable, disk full)
were invisible. Also: piping curl into sh masks curl failures under
set -e (no pipefail) — sh exits 0 on empty stdin, so a network error
succeeded silently.
Fix: download installer to a tempfile first, then run it. Capture
curl + installer output to a log; on failure, indent and print it.
scripts/install.sh hash-verified tier:
uv sync --all-extras --locked 2>"$(mktemp)" silenced uv's progress
output, making a fresh-venv install (~50 transitives including
torch-class deps) look hung for 1-5 minutes — users see 'Trying tier:
hash-verified (uv.lock) ...' and assume it's frozen. The mktemp
substitution also wasn't saved to a variable, so the uv error on
failure was unreachable.
Fix: stream uv's stderr directly so users see live 'Resolved N /
Prepared / Installed' progress. Print an upfront note that the first
run takes 1-5 minutes.
Detect when write_file / patch calls fail during a turn and are never
superseded by a successful write to the same path. When the final
text response is delivered, append an advisory footer listing the
files that did NOT change — so models that over-claim 'patched 5 files'
after 4 silent failures can't hide the lie.
Catches the failure mode reported in Ben Eng's llm-wiki session:
grok-4.1-fast issued batches of parallel patches, half failed with
'Could not find old_string', and the agent summarised the turn
claiming every file was edited. The user had to manually run
'git status' each turn to catch it.
The verifier is a pure post-hoc check on tool results — no new LLM
calls, no synthetic messages injected into history (prompt cache
preserved), no changes to tool argument dispatch. Per-turn state is
keyed by path; a later successful write to the same path clears the
failure entry so single-file retry recovery is not flagged.
Wired into both _execute_tool_calls_concurrent and
_execute_tool_calls_sequential, so batched parallel patches and one-at-
a-time edits are both covered. Footer emission happens after the
agent loop exits, before transform_llm_output / post_llm_call plugin
hooks run, so plugins still see (and can modify) the augmented text.
Config: display.file_mutation_verifier (bool, default true) +
HERMES_FILE_MUTATION_VERIFIER env override.
31 unit tests in tests/run_agent/test_file_mutation_verifier.py cover
target extraction (write_file, patch-replace, patch-v4a single and
multi-file), error-preview extraction (JSON .error field and plain
string), per-turn state transitions (first-error-wins on repeated
failure, success supersedes failure), footer rendering (truncation
at 10 entries, user-actionable hint), and env/config precedence.
Companion docs updated: user-guide/configuration.md +
reference/environment-variables.md.
* feat(security): supply-chain advisory checker + lazy-install framework + tiered install fallback
Three coordinated mitigations for the Mini Shai-Hulud worm hitting
mistralai 2.4.6 on PyPI (2026-05-12) and for the next single-package
compromise that follows.
# What this PR makes true
1. Users with the poisoned mistralai 2.4.6 in their venv get a loud
detection banner with copy-pasteable remediation steps the moment
they run hermes (and on every gateway startup).
2. One quarantined / yanked PyPI package can no longer silently demote
a fresh install to 'core only' — the installer keeps every other
extra and tells the user which tier landed.
3. Future opt-in backends (Mistral, ElevenLabs, Honcho, etc.) can
lazy-install on first use under a strict allowlist, instead of
eagerly pulling everything at install time.
# Detection: hermes_cli/security_advisories.py
- ADVISORIES catalog (one entry currently: shai-hulud-2026-05 for
mistralai==2.4.6). Adding the next one is a single dataclass.
- detect_compromised() uses importlib.metadata.version() — no pip
dependency, works in uv venvs that lack pip.
- Banner cache (~/.hermes/cache/advisory_banner_seen) rate-limits
the startup banner to once per 24h per advisory.
- Acks persisted to security.acked_advisories in config.yaml; never
re-banner after ack.
- Wired into:
* hermes doctor — runs first, prints full remediation block
* hermes doctor --ack <id> — dismisses an advisory
* cli.py interactive run() and single-query branches — short
stderr banner pointing at hermes doctor
* gateway/run.py startup — operator-visible warning in gateway.log
# Lazy-install framework: tools/lazy_deps.py
- LAZY_DEPS allowlist maps namespaced feature keys (tts.elevenlabs,
memory.honcho, provider.bedrock, etc.) to pip specs.
- ensure(feature) installs missing deps in the active venv via the
uv → pip → ensurepip ladder (matches tools_config._pip_install).
- Strict spec safety regex rejects URLs, file paths, shell metas,
pip flag injection, control chars — only PyPI-by-name accepted.
- Gated on security.allow_lazy_installs (default true) plus the
HERMES_DISABLE_LAZY_INSTALLS env var for restricted/audited envs.
- Migrated three backends as proof of pattern:
* tools/tts_tool.py — _import_elevenlabs() calls ensure first
* plugins/memory/honcho/client.py — get_honcho_client lazy-installs
* tts.mistral / stt.mistral entries pre-registered for when PyPI
restores mistralai
# Installer fallback tiers
scripts/install.sh, scripts/install.ps1, setup-hermes.sh:
- Centralised _BROKEN_EXTRAS list (currently: mistral). Edit one
array when a transitive breaks; users keep every other extra.
- New 'all minus known-broken' tier between [all] and the existing
PyPI-only-extras tier. Only kicks in when [all] fails resolve.
- All three tiers explicit: every fallback announces which tier
landed and prints a re-run hint when not on Tier 1.
- install.ps1 and install.sh both regenerate their tier specs from
the same _BROKEN_EXTRAS array so updates stay in sync.
Side effect: install.ps1 Tier 2 spec previously hardcoded 'mistral'
in its extra list — bug fixed by the refactor (mistral is filtered
out).
# Config
hermes_cli/config.py — DEFAULT_CONFIG.security gains:
- acked_advisories: [] (advisory IDs the user has dismissed)
- allow_lazy_installs: True (security gate for ensure())
No config version bump needed — both keys nest under existing
security: block, and load_config's deep-merge picks up DEFAULT_CONFIG
defaults for users with older configs.
# Tests
tests/hermes_cli/test_security_advisories.py — 23 tests covering:
- detect_compromised matches/non-matches, wildcard frozenset
- ack persistence, idempotence, blank rejection, config-failure path
- banner cache rate limiting + 24h re-banner + ack-stops-banner
- short_banner_lines / full_remediation_text / render_doctor_section /
gateway_log_message
- shipped catalog well-formedness invariant
tests/tools/test_lazy_deps.py — 40 tests covering:
- spec safety: 11 safe parametrized + 18 unsafe parametrized
- allowlist: unknown-feature rejection, namespace.name shape,
every shipped spec passes the safety regex
- security gating: config flag, env var, default, fail-open
- ensure() happy/sad paths: already-satisfied, install success,
pip stderr surfaced on failure, install-succeeds-but-still-missing
- is_available, feature_install_command
Combined: 63 new tests, all passing under scripts/run_tests.sh.
# Validation
- scripts/run_tests.sh tests/hermes_cli/test_security_advisories.py
tests/tools/test_lazy_deps.py → 63/63 passing
- scripts/run_tests.sh tests/hermes_cli/test_doctor.py
tests/hermes_cli/test_doctor_command_install.py
tests/tools/test_tts_mistral.py tests/tools/test_transcription_tools.py
tests/tools/test_transcription_dotenv_fallback.py → 165/165 passing
- scripts/run_tests.sh tests/hermes_cli/ tests/tools/ →
9191 passed, 8 pre-existing failures (verified on origin/main
before this change)
- bash -n on install.sh and setup-hermes.sh → OK
- py_compile on all modified .py files → OK
- End-to-end smoke test of detect_compromised + render_doctor_section
+ gateway_log_message with mocked installed version → produces
copy-pasteable remediation output
# Community
Full advisory + remediation steps:
website/docs/community/security-advisories/shai-hulud-mistralai-2026-05.md
Short-form post drafts (Discord, GitHub pinned issue, README banner):
scripts/community-announcement-shai-hulud.md
Refs: PR #24205 (mistral disabled), Socket Security advisory
<https://socket.dev/blog/mini-shai-hulud-worm-pypi>
* build(deps): pin every direct dep to ==X.Y.Z (no ranges)
Companion to the supply-chain advisory work: replace every >=/</~= range
in pyproject.toml's [project.dependencies] and [project.optional-dependencies]
with an exact ==X.Y.Z pin sourced from uv.lock.
Why: ranges allow PyPI to ship a fresh version of any direct dep at any
time without a code review on our side. With ranges, the malicious
mistralai 2.4.6 release would have been pulled by every fresh
'pip install -e .[all]' for the hours between upload and PyPI's
quarantine — exactly the install window we got hit on. Exact pins close
that window: the only way a new package version reaches a user is via
an intentional update on our end.
What the user-facing change is: nothing, behavior-wise. Every package
resolves to the same version it was already resolving to via uv.lock —
the pins just remove the resolver's freedom to pick a different one.
Cost: any user installing Hermes alongside another package that requires
a newer pin gets a resolver conflict. Acceptable for our isolated-venv
install path; documented in the new comment block.
Build-system requires line (setuptools>=61.0) is intentionally left
as a range — pinning the build backend would block fresh pip from
bootstrapping the build on architectures where that exact wheel isn't
available.
mistral extra (mistralai==2.3.0) is pinned but stays out of [all]
(per PR #24205). 'uv lock' regeneration will fail until PyPI restores
mistralai; lockfile regeneration is gated behind that, NOT on every PR.
LAZY_DEPS in tools/lazy_deps.py also moved to exact pins so the lazy-
install pathway can never resolve a different version than the one
declared in pyproject.toml.
Validation:
- Cross-checked all 77 pinned direct deps in pyproject.toml against
uv.lock — every pin matches the resolved version exactly.
- Cross-checked all LAZY_DEPS specs against uv.lock — same.
- 'uv pip install -e .[all] --dry-run' resolves 205 packages cleanly.
- tests/tools/test_lazy_deps.py + tests/hermes_cli/test_security_advisories.py
→ 63/63 passing (every shipped spec passes the safety regex).
- Doctor + TTS + transcription targeted suite → 146/146 passing.
* build(deps): hash-verify transitives via uv.lock; remove unresolvable [mistral] extra
You asked: 'what about the dependencies the dependencies rely on?' —
correctly noting that exact-pinning direct deps in pyproject.toml does
NOT cover the transitive graph. `pip install` and `uv pip install` both
re-resolve transitives fresh from PyPI at install time, so a compromised
transitive (e.g. `httpcore` if it got worm-poisoned tomorrow) would
still hit our users even with every direct dep exact-pinned.
# What this commit fixes
1. **Both real installer scripts now prefer `uv sync --locked` as Tier 0.**
uv.lock records SHA256 hashes for every transitive — a compromised
package with a different hash gets REJECTED. Falls through to the
existing `uv pip install` cascade if the lockfile is missing or
stale, with a loud warning that the fallback path does NOT
hash-verify transitives. Previously only `setup-hermes.sh` (the dev
path) used the lockfile; `scripts/install.sh` and `scripts/install.ps1`
(the paths fresh users actually run) skipped it.
2. **Removed the `[mistral]` extra entirely.** The `mistralai` PyPI
project is fully quarantined right now — every version returns 404,
so any pin we wrote was unresolvable, which broke `uv lock --check`
in CI. Restoration is documented in pyproject.toml as a 5-step
checklist (verify, re-add extra, re-enable in 4 modules, regenerate
lock, optionally re-add to [all]).
3. **Regenerated uv.lock.** 262 packages, mistralai/eval-type-backport/
jsonpath-python pruned. `uv lock --check` now passes.
# Defense-in-depth view
| Layer | Where | Protects against |
|----------------------------|-------------------|-------------------------------------------|
| Exact pins in pyproject | direct deps | new mistralai 2.4.6-style direct compromise |
| uv.lock + `--locked` install | transitive graph | transitive worm injection |
| Tier-0 hash-verified path | install.sh / .ps1 | actually USE the lockfile in fresh installs |
| `uv lock --check` CI gate | every PR | drift between pyproject and lockfile |
| `hermes_cli/security_advisories.py` | runtime | cleanup for users who already got hit |
The exact pinning + hash verification together close the supply-chain
gap. Without the lockfile path, exact pins alone are theater.
# Validation
- `uv lock --check` → passes (262 packages resolved, no drift).
- `bash -n` on install.sh + setup-hermes.sh → OK.
- 209/209 tests passing across new + adjacent test files
(test_lazy_deps.py, test_security_advisories.py, test_doctor.py,
test_tts_mistral.py, test_transcription_tools.py).
- TOML parse OK.
* chore: remove community announcement drafts (PR body covers it)
* build(deps): lazy-install every opt-in backend (anthropic, search, terminal, platforms, dashboard)
Extends the lazy-install framework to cover everything that's not used by
every hermes session. Base install drops from ~60 packages to 45.
Moved out of core dependencies = []:
- anthropic (only when provider=anthropic native, not via aggregators)
- exa-py, firecrawl-py, parallel-web (search backends; only when picked)
- fal-client (image gen; only when picked)
- edge-tts (default TTS but still optional)
New extras in pyproject.toml: [anthropic] [exa] [firecrawl] [parallel-web]
[fal] [edge-tts]. All added to [all].
New LAZY_DEPS entries: provider.anthropic, search.{exa,firecrawl,parallel},
tts.edge, image.fal, memory.hindsight, platform.{telegram,discord,matrix},
terminal.{modal,daytona,vercel}, tool.dashboard.
Each import site now calls ensure() before importing the SDK. Where the
module had a top-level try/except (telegram, discord, fastapi), the
graceful-fallback pattern was extended to lazy-install on first
check_*_requirements() call and re-bind module globals.
Updated test_windows_native_support.py tzdata check from snapshot
(>=2023.3 literal) to invariant (any version + win32 marker).
Validation:
- Base install: 45 packages (was ~60); 6 newly-extracted packages absent
- uv lock --check: passes (262 packages, no drift)
- 209/209 lazy_deps + advisory + doctor + tts/transcription tests passing
- py_compile clean on all 12 modified modules
The `mistralai` PyPI package was quarantined on 2026-05-12 after a
malicious 2.4.6 release. Every fresh resolve (AUR makepkg, Docker build,
CI run, install.sh first-run) currently fails on
`mistralai>=2.3.0,<3` because PyPI returns zero candidates.
Existing users running `hermes update` mostly didn't notice — `hermes
update` falls back from `.[all]` to per-extra retries and silently
skips mistral with a warning that scrolls past. But fresh installs
hard-fail or lose every other extra.
Changes:
- pyproject.toml: drop `hermes-agent[mistral]` from `[all]` and
`[termux-all]`. The `mistral` extra itself is preserved so users
can opt back in once PyPI un-quarantines.
- hermes_cli/tools_config.py: hide Mistral Voxtral TTS from the
`hermes tools` provider picker until restored.
- hermes_cli/web_server.py: drop "mistral" from dashboard STT options.
- tools/transcription_tools.py: explicit `provider: mistral` returns
"none" with a clear status message; auto-detect skips mistral.
- tools/tts_tool.py: dispatcher returns a clear "temporarily disabled"
error before any SDK import attempt (avoids cached-stale-package
surprises).
- tests/tools/: update three test files to assert the new disabled
behavior. Each test docstring records why and points at the rollback
trigger (PyPI un-quarantines mistralai).
Restore plan: revert this commit once the package is available on PyPI
again. The behavior change is intentional and documented in code
comments + test docstrings to make the rollback trivial.
Validation:
- scripts/run_tests.sh tests/tools/ -k 'mistral or stt or tts' →
425/425 passing.
Refs: https://pypi.org/simple/mistralai/ (currently
"pypi:project-status: quarantined").
Handle MiniMax OAuth expiry values consistently across CLI and dashboard
flows, fix CLI status/add behavior, and force pooled OAuth runtime
requests through Anthropic Messages.
- web_server._minimax_poller: parse expired_in via the shared resolver
so unix-ms absolute timestamps stop landing as TTL seconds and crashing
with 'year 583911 is out of range' when a user connects MiniMax OAuth
from the dashboard.
- auth._minimax_oauth_login / _refresh_minimax_oauth_state: same fix on
the CLI login + refresh paths.
- auth.get_auth_status: dispatch minimax-oauth to its dedicated status
function instead of falling through.
- auth_commands.auth_add_command: 'hermes auth add minimax-oauth' now
starts the device-code login flow and persists a pool entry with the
access + refresh tokens, instead of requiring credentials to already
exist.
- runtime_provider._resolve_runtime_from_pool_entry: pin pooled
minimax-oauth credentials to anthropic_messages so a stale
model.api_mode: chat_completions can't send requests to
/anthropic/chat/completions and trigger MiniMax nginx 404s.
Co-authored-by: Cursor <cursoragent@cursor.com>
Qwen models on Nous Portal (e.g. qwen3.6-plus) now get the same envelope-layout
cache_control markers and long-lived (1h cross-session) cache treatment as
Portal Claude. Portal proxies to OpenRouter with identical wire-format and
cache_control semantics, but the prior policy left Portal Qwen falling through
to the alibaba-family branch (which only matches provider=opencode/alibaba),
serving 0% cache hits and re-billing the full prompt every turn.
Scope is narrow: Portal Claude OR Portal Qwen. Other models on Portal keep
their existing behavior.
- _anthropic_prompt_cache_policy: add (is_nous_portal and qwen) -> (True, False)
- _supports_long_lived_anthropic_cache: drop Claude-only gate for Portal so
Qwen also gets the validated 1h cross-session layout
- tests cover both functions, both bare and vendored qwen slug forms, and
the rejection of non-Claude non-Qwen Portal traffic
* fix(tui-clipboard): skip native safety net on OSC52-capable terminals
On terminals with first-class OSC 52 support (Ghostty, kitty, WezTerm,
Windows Terminal, VS Code), setClipboard() currently fires both OSC 52
AND a parallel native-tool write (wl-copy / xclip / pbcopy). On Wayland
+ wl-copy this corrupts the clipboard: probeLinuxCopy() runs wl-copy
with empty stdin as an existence check (destructive — wipes clipboard
to empty string), and the subsequent real wl-copy invocation races
OSC 52 plus its own daemon's previous SIGTERM.
Symptom: user on Arch + Ghostty + wl-copy (Wayland, no tmux, no SSH)
had to press Ctrl+Shift+C three times before a selection landed.
env -u WAYLAND_DISPLAY -u DISPLAY HERMES_TUI_FORCE_OSC52=1 (which
short-circuits copyNative via the DISPLAY-absent early-return) made
every copy work instantly — proving OSC 52 alone is sufficient on
Ghostty and that copyNative() is actively destructive there.
Add OSC52_CAPABLE_TERMINALS allowlist to terminal.ts (same pattern as
the existing EXTENDED_KEYS_TERMINALS), and gate copyNative() on the
terminal NOT being on it. The native safety net continues to fire on
unrecognised terminals (xterm, GNOME Terminal, Konsole, Terminal.app,
etc.) where OSC 52 is less reliable.
* fix(tui-clipboard): address Copilot review feedback
- Move OSC52_CAPABLE_TERMINALS + supportsOsc52Clipboard() from
ink/terminal.ts to utils/env.ts. ink/terminal.ts already imports
link from ink/termio/osc.ts; importing back into termio/osc.ts
introduced a circular dependency. utils/env.ts has no deps on
either file and already owns terminal detection (detectTerminal()),
so the helper sits naturally next to it.
- Replace the inline gating (!SSH_CONNECTION && !supportsOsc52Clipboard())
with a pure shouldUseNativeClipboard(env, terminal) helper. The old
expression skipped native on allowlisted terminals even when
setClipboard() wouldn't actually emit OSC 52 (e.g. inside
TMUX/STY where we use tmux load-buffer instead, or when the user
has set HERMES_TUI_FORCE_OSC52=0). That made the clipboard write
a no-op in those configurations. The new helper:
1. SSH_CONNECTION set -> false (existing behaviour)
2. TMUX or STY set -> true (we go through load-buffer, no race)
3. shouldEmitClipboardSequence() false -> true (native is the
only path left when OSC 52 is suppressed)
4. Otherwise: skip native iff terminal is allowlisted.
- Add 11 tests for shouldUseNativeClipboard covering the SSH guard,
TMUX/STY tmux-inside-Ghostty case, HERMES_TUI_FORCE_OSC52=0
override, allowlisted vs non-allowlisted terminals, precedence,
and default-args smoke. Tests follow the package's existing
parameterised-helper style (no vi.mock; helpers accept env and
terminal as arguments).
- Update test imports to the new utils/env.js path.
* fix(tui-clipboard): address Copilot round 2 feedback
* fix(tui-clipboard): address Copilot round 3 feedback
* fix(tui-clipboard): address Copilot round 4 feedback
Free-tier users were seeing 'No free models currently available.' in the
`hermes model` and post-login pickers even though qwen/qwen3.6-plus is
free on the Portal right now. Three independent breakages compounded:
1. The docs-hosted catalog manifest at website/static/api/model-catalog.json
was not regenerated when _PROVIDER_MODELS['nous'] was updated, so users
fetching the manifest got a list that didn't include qwen/qwen3.6-plus.
2. _resolve_nous_pricing_credentials() returned ('', '') on any auth blip,
collapsing get_pricing_for_provider('nous') to {} and making every
curated model fall through the free-tier filter as 'paid'.
3. Even with healthy pricing, the picker only ever showed models from the
in-repo curated list intersected with live pricing — a Portal-flagged
free model not yet in the curated list could never appear.
Changes:
- hermes_cli/models.py: new union_with_portal_free_recommendations() that
augments the curated list with Portal freeRecommendedModels entries
(with synthetic free pricing so partition keeps them). The Portal's
/api/nous/recommended-models endpoint is now the source of truth for
free-tier surfacing — old Hermes builds will see new free models
without a CLI release.
- hermes_cli/models.py: _resolve_nous_pricing_credentials() falls back to
the public inference base URL when runtime cred resolution fails.
The /v1/models endpoint exposes pricing without auth, so silently
returning {} just because a refresh token expired was wrong.
- hermes_cli/auth.py + hermes_cli/main.py: both free-tier picker call
sites call union_with_portal_free_recommendations() before partition.
- tests/hermes_cli/test_models.py: 7 tests covering union behaviour
(prepend, dedup, end-to-end with stale pricing, empty/missing/error
payloads, invalid entries).
- tests/hermes_cli/test_model_catalog.py: drift guard
TestManifestMatchesInRepoLists fails CI when _PROVIDER_MODELS['nous']
or OPENROUTER_MODELS is edited without re-running
scripts/build_model_catalog.py. Verified empirically that removing a
manifest entry triggers an assertion with an actionable error message.
Validation:
- 133/133 targeted tests pass (test_models, test_model_catalog,
test_auth_nous_provider).
- Live E2E against the real Portal:
- Stale curated list ['claude-opus','claude-sonnet','gpt-5.4'] (no
qwen) → after union: ['qwen/qwen3.6-plus', ...] →
partition(free_tier=True): selectable=['qwen/qwen3.6-plus'].
- Simulated expired refresh token → anon fetch returns 403 pricing
entries including qwen/qwen3.6-plus -> {prompt:0, completion:0}.
- ruff: clean.
cua-driver was only installed once on toolset enable: `_run_post_setup` early-returns when the binary is already on PATH, so upstream fixes (e.g. v0.1.6 Safari window-focus fix) never reached existing users without manual reinstall.
Two refresh points now:
- `hermes update` re-runs the upstream installer at the end of the update if cua-driver is on PATH (macOS-only, no-op otherwise). Ties driver freshness to the user-controlled update cadence — no startup latency, no per-launch GitHub API call.
- `hermes computer-use install --upgrade` for manual force-refresh.
The upstream `install.sh` always pulls the latest release, so re-running is the canonical upgrade path. No version-comparison logic needed.
`hermes computer-use status` now shows the installed version, and points at `--upgrade` for refreshing.
Fixes#22832.
## Root cause
`hermes_cli/web_server.py:start_oauth_login` dispatched OAuth flows by
the catalog's `flow` field rather than provider id:
if catalog_entry["flow"] == "pkce":
return _start_anthropic_pkce()
The catalog had two `flow: "pkce"` entries — `anthropic` and
`minimax-oauth` — so clicking "Login" on MiniMax in the dashboard's
Keys tab unconditionally launched the Anthropic/Claude PKCE flow.
## Fix
Three changes in `hermes_cli/web_server.py`:
1. Catalog entry for `minimax-oauth` changed from `flow: "pkce"` to
`flow: "device_code"`. From a UX perspective MiniMax is a
verification-URI + user-code flow (open URL, enter code, backend
polls) — same shape as Nous's device-code flow. The PKCE bit
(verifier + challenge from `_minimax_pkce_pair`) is a security
extension that doesn't change the operator experience; the existing
dashboard modal already renders `device_code` correctly for this UX.
2. New MiniMax branch in `_start_device_code_flow`, mirroring the
existing Nous branch but calling MiniMax-specific helpers
(`_minimax_request_user_code`, `_minimax_pkce_pair`). Stashes
verifier + state in the session for the poller to consume. Handles
the overloaded `expired_in` field (could be unix-ms timestamp OR
seconds-from-now duration) the same way `_minimax_poll_token` does.
3. New `_minimax_poller` background thread mirroring `_nous_poller`.
Calls `_minimax_poll_token` → on success builds the same
`auth_state` dict the CLI flow (`_minimax_oauth_login`) builds, and
persists via `_minimax_save_auth_state` so the dashboard path leaves
the system in the same state as `hermes auth add minimax-oauth`.
Plus a dispatcher tightening to prevent regression: the `pkce` branch
now requires `provider_id == "anthropic"`, so any future PKCE provider
added without a proper start function gets a clean
`400 Unsupported flow` rather than silently launching Anthropic OAuth.
## Test
New `tests/hermes_cli/test_web_oauth_dispatch.py`:
- Regression test asserting MiniMax start does NOT return claude.ai
- Sanity test that Anthropic PKCE still works after the dispatcher
tightening
- Forward-looking test: a hypothetical pkce-flagged provider without
an explicit branch is rejected cleanly rather than misrouted
## Limitations
- The dashboard MiniMax path defaults to `region="global"`. CN-region
operators can still use the CLI flow which supports `--region cn`.
Adding a region toggle to the dashboard UI is a follow-up.
Follow-up to #23863 (CJK table alignment). The realigner was
correctly padding pipes to identical column offsets, but when a
table's natural width exceeds terminal cells it produced lines that
the terminal soft-wrapped mid-cell, destroying column alignment
visually even though the bytes were perfectly padded. Reported as
'columns are not aligned' on tables containing one long row alongside
several short rows.
Approach mirrors Claude Code's MarkdownTable.tsx narrow-terminal
fallback: when realign_markdown_tables is given an available_width
budget and the rebuilt horizontal table exceeds it, render each body
row as 'Header: value' lines separated by a thin ─ rule. Word-wraps
oversize values at the budget with a 2-space continuation indent.
- agent/markdown_tables.py: realign_markdown_tables(text, available_width=None);
threshold check at the top of _render_block flips into a new
_render_vertical fallback. Includes _wrap_to_width with hard-break
for tokens longer than the budget.
- cli.py: helper _terminal_width_for_streaming() returns
shutil.get_terminal_size().columns minus _STREAM_PAD and a 2-cell
safety margin; passed to all three realign call sites
(_render_final_assistant_content for strip+render Panel paths, and
the streaming flushers in _emit_stream_text / _flush_stream).
- tests/agent/test_markdown_tables.py: 4 new tests covering the
overflow-vertical fallback for ASCII + CJK content, the
'fits → keep horizontal' case, and the long-cell wrap with indent.
Live-verified: with COLUMNS=100, the user's reported 'long row in
ASCII table' case now renders as vertical key-value rows that all fit
the panel; the 6-column CJK comparison table still renders as an
aligned horizontal table because it fits inside 100 cols.
* feat(ui-tui): resolve links to readable page titles
Mirror desktop pretty-link behavior in the TUI by resolving HTTP links to page titles with shared caching and safe fetch filters, plus slug-based fallbacks so chat links stay readable even when title fetch fails.
* refactor(ui-tui): tighten link-title fallback handling
Clean up the link-title resolver by hardening in-flight cleanup and clarifying title length limits, while adding focused coverage for HTML entity decoding and markdown-label fallback behavior.
* fix(ui-tui): block private-network targets in title fetches
Prevent automatic link-title resolution from requesting local or private hosts by rejecting RFC1918, link-local, ULA, and intranet-style hostnames before fetch, and add regression coverage for blocked host patterns.
The old mtime-tracking staleness machinery (_tui_build_needed,
_hermes_ink_bundle_stale, _find_bundled_tui) tried to avoid rebuilding
by comparing source timestamps to dist/entry.js. This was fragile and
added ~100 lines of code. Replace with three clear paths:
1. HERMES_TUI_DIR set (prebuilt/nix): just node dist/entry.js, no build
2. --dev mode: tsx src/entry.tsx, no build, hot reload
3. Normal: always npm run build (esbuild is ~1s, correctness > caching)
Also error when HERMES_TUI_DIR is set with --dev (footgun: prebuilt
bundle has no source code to hot-reload).
Based on PR #23950 by @nicoechaniz.
- Add "kimi" and "moonshot" to PROVIDER_TO_MODELS_DEV → kimi-for-coding
- Gate OpenRouter metadata step behind "if not effective_provider":
known providers should not be overridden by community-maintained OR data
- Keep the targeted Kimi-family 32k guard as a secondary safety net
inside the OR gate (for unknown providers with Kimi models)
Co-authored-by: nicoechaniz <nicoechaniz@altermundi.net>
Kimi-k2.6 (which supports 262K context) was incorrectly resolved as 32K,
tripping the 64K minimum-context guard and preventing use of the model on
Ollama Cloud and Kimi Coding / Moonshot providers.
Three fixes in the context-length resolution chain:
1. Ollama Cloud native /api/show query: new _query_ollama_api_show()
queries the Ollama native API for authoritative GGUF model_info
context_length. For hosted Ollama, prefers model_info over num_ctx
since users can't set their own num_ctx on Cloud. Added at step 5e
in get_model_context_length(), before the models.dev fallback.
2. models.dev :cloud/-cloud suffix fallback: lookup_models_dev_context()
now also tries appending :cloud and -cloud suffixes when the bare
model name doesn't match. models.dev stores 'kimi-k2.6:cloud' but
users and the live API use bare 'kimi-k2.6'.
3. Kimi-family 32K guard: after the OpenRouter metadata step, reject
exactly 32768 for Kimi-named models (kimi-*, moonshot*) and fall
through to hardcoded defaults ('kimi': 262144). OpenRouter reports
32768 for moonshotai/kimi-k2.6 but the model actually supports 262K.
Narrow filter — only 32768, only Kimi-family — becomes dead code
when OpenRouter updates its metadata.
---
Set HERMES_SESSION_ID using the existing session_context.py ContextVar
system for concurrency safety (multiple gateway sessions in one process
won't cross-talk). Also writes os.environ as fallback for CLI mode.
Touchpoints:
- gateway/session_context.py: Add _SESSION_ID ContextVar + _VAR_MAP entry
- run_agent.py: Set both ContextVar and os.environ at init and on
context-compression rotation
- tools/environments/local.py: Bridge ContextVars into subprocess env
in _make_run_env() (ContextVars don't propagate to child processes)
- tests/run_agent/test_session_id_env.py: 3 tests covering env, provided
ID, and ContextVar paths
execute_code subprocess already passes HERMES_* prefixed vars through
_scrub_child_env (line 82: _SAFE_ENV_PREFIXES includes 'HERMES_').
Primary use case: webhook-triggered agents that need to include a
`--resume <session_id>` takeover command in their output.
Cuts input cost for first-turn Claude requests by ~85-90% on subsequent
sessions within an hour. Tools array (~13k tokens for default toolset) +
stable system prefix (~5-8k tokens) get a 1h cache_control marker; the
volatile suffix (memory, USER profile, timestamp, session id) sits in a
separate non-cached block at the end so it doesn't poison the cross-session
prefix when it changes.
Provider gate: Claude on native Anthropic (incl. OAuth subscription),
OpenRouter, and Nous Portal (which proxies to OpenRouter). All other
providers keep today's system_and_3 layout unchanged.
Layout (4 cache_control breakpoints, Anthropic max):
1. tools[-1] -> 1h (cross-session)
2. system content[0] -> 1h (cross-session, stable prefix)
3. messages[-2] -> 5m (within-session rolling)
4. messages[-1] -> 5m (within-session rolling)
Within-session rolling shrinks from 3 messages to 2 to free the breakpoint
budget. On Claude with realistic tool loadouts the long-lived tier carries
the bulk of cross-session value anyway.
System prompt is now always assembled cache-friendly: stable identity /
guidance / skills / platform hints first, then session-stable context
files (AGENTS.md, .cursorrules), then per-call volatile content. Old
single-string callers see the same logical content (same join order),
just reordered so volatile lives at the end.
Config knobs (defaults shown):
prompt_caching:
cache_ttl: "5m" # rolling-window TTL (unchanged)
long_lived_prefix: true # opt-out switch
long_lived_ttl: "1h" # cross-session prefix TTL
Live E2E (tests/agent/test_prompt_caching_live.py, gated on
OPENROUTER_API_KEY) on anthropic/claude-haiku-4.5 with default toolset:
Call 1 (cold): cache_write=13,415 cache_read=0
Call 2 (NEW agent + msg): cache_write=391 cache_read=13,025
Cross-session reuse: 97.09%
Implementation:
* agent/prompt_caching.py: new apply_anthropic_cache_control_long_lived()
+ mark_tools_for_long_lived_cache(); existing apply_anthropic_cache_control()
preserved verbatim for the fallback path.
* agent/anthropic_adapter.py: convert_tools_to_anthropic() now forwards
cache_control onto each Anthropic-format tool dict.
* run_agent.py: _build_system_prompt_parts() returns the 3-tier dict;
_build_system_prompt() joins them (backward compatible).
_supports_long_lived_anthropic_cache() policy added next to the existing
_anthropic_prompt_cache_policy() (which now also recognises Nous Portal
Claude — pre-existing gap fixed in passing).
_build_api_kwargs() resolves tools_for_api once and propagates the
marker through all four build paths (anthropic_messages, bedrock,
codex_responses, profile/legacy chat completions).
Long-lived flag plumbed into the runtime snapshot/restore + model-switch
+ fallback-promotion paths.
Tests:
* tests/agent/test_prompt_caching.py: +8 tests (TestMarkToolsForLongLivedCache,
TestApplyAnthropicCacheControlLongLived).
* tests/run_agent/test_anthropic_prompt_cache_policy.py: +9 tests
(TestSupportsLongLivedAnthropicCache matrix across 8 endpoint classes
+ a fallback-target case).
* tests/agent/test_prompt_caching_live.py: new live E2E (skipif when
OPENROUTER_API_KEY is unset; runs outside the hermetic suite).
* Targeted suites: 327/327 pass (caching/adapter/policy/builder).
* tests/agent/ + tests/run_agent/: 3992 pass, 17 skip, 1 pre-existing
flake (test_async_httpx_del_neuter::test_same_key_replaces_stale_loop_entry,
verified failing on pristine origin/main).
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining.
133 files, +626/-626 (net zero).
#23482 fixed cache poisoning in the sync path: when a Codex auxiliary
timeout closes the underlying OpenAI client, _evict_cached_client_instance
walks CodexAuxiliaryClient wrappers via their _real_client attribute and
drops the cache entry so the next aux call rebuilds.
The cache key includes async_mode (see _client_cache_key), so the sync and
async clients for the same provider live in two distinct entries pointing
at the same underlying transport. The fix walked the sync wrapper's
_real_client correctly but the async wrappers
(AsyncCodexAuxiliaryClient, AsyncAnthropicAuxiliaryClient,
AsyncGeminiNativeClient) never exposed _real_client at all, so the async
entry survived eviction and kept handing out the poisoned client.
Effect on async aux callers: one timeout now poisons every subsequent
async aux call (compression, vision, session_search, title_generation)
with 'Connection error' until gateway restart -- even while the sync
route recovered as designed in #23482.
Mirror the sync wrapper's _real_client onto each async wrapper so the
existing eviction helper finds them. Three changes, one per wrapper:
- AsyncCodexAuxiliaryClient: self._real_client = sync_wrapper._real_client
(the underlying OpenAI client)
- AsyncAnthropicAuxiliaryClient: same shape
- AsyncGeminiNativeClient: self._real_client = sync_client (Gemini's
native facade is itself the leaf; no OpenAI client beneath it)
Update _evict_cached_client_instance docstring to reflect that it now
covers both sync and async wrappers via the same attribute walk.
Test: TestAuxiliaryClientPoisonedCacheEviction.test_evict_cached_client_instance_walks_async_wrapper
seeds both sync and async cache entries pointing at the same leaf and
asserts both are dropped on a single eviction call. Verified the test
fails without the wrapper changes ("async cache entry survived
eviction -- wrapper is missing _real_client") and passes with them.
Refs #23482, #23432
CJK and emoji glyphs render as two terminal cells but JS String#length
and the model's own padding count them as one, so any markdown table
with Chinese / Japanese / Korean cells drifts right per row when a
real terminal renders it. Both surfaces fix this with a display-cell
width measurement (wcswidth on the Python side, stringWidth on the
TUI side).
Changes:
- agent/markdown_tables.py: new helper. realign_markdown_tables(text)
detects markdown table blocks (header + |---| divider) and
rewrites the row padding using wcwidth.wcswidth so every pipe and
dash lines up across rows. No-op on text without tables.
- cli.py: hook the helper into _render_final_assistant_content for
strip / render modes (raw passes through untouched), and into the
streaming line emitter so live token-by-token rendering also
produces aligned tables. A small two-buffer state machine in
_emit_stream_text holds table rows until the block ends, then
flushes them through the realigner so all rows pad to a single
per-column width.
- ui-tui/src/components/markdown.tsx: renderTable now uses
stringWidth (Bun.stringWidth fast path + East-Asian-width-aware
fallback, already memoised in @hermes/ink) instead of UTF-16
String#length for both column-width measurement and per-cell
padding. Drops the comment that documented the bug as a deliberate
limitation.
Validation:
- New tests/agent/test_markdown_tables.py (11): every rebuilt block
shares pipe column offsets across rows for pure CJK, mixed
CJK+emoji, ragged-row, and multi-table inputs.
- Updated tests/cli/test_cli_markdown_rendering.py: the existing
strip-mode test asserted exact whitespace; rewritten to assert the
alignment contract (cell content survives + every rendered row
shares pipe offsets).
- New ui-tui markdown.test.ts case (1): rendered column-2 start
offset is identical for the header + every body row, including
the CJK row that drifted before the fix.
- Live: hermes chat -q with the user-reported screenshot prompt now
produces a perfectly aligned table on the wire (header, divider,
4 body rows including '通义千问', all pipes at identical columns).
The /model picker for Nous Portal users was returning the in-repo
_PROVIDER_MODELS["nous"] snapshot — which only updates on Hermes
releases — instead of the remote manifest published at
https://hermes-agent.nousresearch.com/docs/api/model-catalog.json.
OpenRouter already pulled from the manifest via fetch_openrouter_models;
"nous" was the only curated provider where the existing manifest
plumbing (get_curated_nous_model_ids → get_curated_nous_models) was
defined but not wired into the picker pipeline. Switch the curated
build in list_authenticated_providers to use it, with the same
graceful fallback to the in-repo snapshot when the manifest is
unreachable.
Test: tests/hermes_cli/test_model_catalog.py exercises the picker with
a patched manifest and asserts the manifest's nous list reaches
list_picker_providers. Falls-back-to-static path was already covered
by test_curated_nous_ids_falls_back_to_hardcoded_on_empty_catalog.
- getattr(self, '_slash_confirm_state', None) at the two read sites that
trip object.__new__(HermesCLI) test fixtures (test_cli_external_editor,
test_cli_skin_integration)
- _build_tui_layout_children: make slash_confirm_widget keyword-only with
default None to avoid breaking subclassing extension hook for wrapper
CLIs (test_cli_extension_hooks)
- AUTHOR_MAP entry for zhengyn0001
Follow-up to the salvaged commit ca1d4375a.
Follow-up to PR #23824. Adds two correctness fixes on top of the
contributor's salvaged commit:
1. Stale-dist fallback no longer gated on `fatal=False`. `cmd_dashboard`
passes `fatal=True` and is the primary scenario this fallback is for
(issue #23817 — Windows Scheduled Task at logon). The previous gate
meant the fallback never fired in the case it was designed for.
2. `--skip-build` now verifies the dist actually exists before starting
the server. Without this, a misconfigured pre-build would launch the
dashboard pointing at a missing dist and silently serve 404s. We now
exit 1 with a clear "pre-build first: cd web && npm run build"
message, and on success print which dist directory is being used.
Verified end-to-end on Linux:
- build fails + stale dist (fatal=True) -> fallback fires
- build fails + no dist (fatal=True) -> exit 1 with stderr surfaced
- build fails + stale dist (fatal=False) -> fallback fires
- --skip-build + missing dist -> exit 1 with clear guidance
- --skip-build + valid dist -> 'Skipping web UI build...'
Three improvements for non-interactive contexts (Windows Scheduled
Tasks, CI/CD) where the web UI build may fail (issue #23817):
1. Retry build once after 3s — covers boot-time races (antivirus
scanning Node.js, npm cache not ready, transient disk I/O)
2. Fall back to existing dist when build fails (non-fatal mode) —
a stale UI is far better than no UI at all
3. Add --skip-build flag — lets callers pre-build in their wrapper
script and start the dashboard without internal build attempt
4. Surface npm stderr in build failure output for easier debugging
Fixes#23817
On Windows systems using a Chinese GBK locale, `hermes update` could misreport the Web UI build as failed even when `npm run build` actually succeeded. The failure was caused by Python decoding captured npm output with the process locale inside a background subprocess reader thread. When npm emitted bytes such as `0x85`, decoding under GBK raised `UnicodeDecodeError`, and Hermes then surfaced a misleading "Web UI build failed" warning.
This change makes the npm install/npm ci path and the Web UI build step decode captured output explicitly as UTF-8 with `errors="replace"`. That keeps unexpected bytes from crashing output collection, preserves successful builds, and prevents false negatives during update on Windows.
The patch also adds regression tests that verify these subprocess calls always use explicit UTF-8 decoding with replacement semantics.
When the user's main provider is openai-codex on the ChatGPT-account
backend (https://chatgpt.com/backend-api/codex), sending a native image
attachment encodes it as data:image/...base64,... in the input_image
field. The OpenAI Responses API on the public endpoint accepts that, but
the ChatGPT-account variant rejects it with HTTP 400:
Invalid 'input[N].content[K].image_url'. Expected a valid URL, but got
a value with an invalid format.
Hermes' image-rejection phrase list didn't include this wording, so the
error escaped the strip-and-retry branch and fell through to the generic
recovery path: model fallback → context-too-large → compression cascade
→ auxiliary OpenRouter 402 spam (issue #23570).
Add a NARROW phrase keyed on the field-path apostrophe used by the Codex
Responses error format: "image_url'. expected". This matches the actual
error format without false-tripping on generic 'Expected a valid URL'
errors from unrelated tools (webhooks, redirect_uri, etc.). Once matched,
the existing branch strips images from history, sets _vision_supported=
False for the session, and retries text-only.
Refs #23570 (1 of 3 image-replay improvements; persistence rewrite to
store image PATHS instead of inlined base64 is a separate follow-up)
* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (#23547)"
This reverts commit a63a2b7c78.
* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (#23530)"
This reverts commit 4a080b1d5a.
* Revert "feat(goals): /goal checklist + /subgoal user controls (#23456)"
This reverts commit 404640a2b7.
Adds the only #17873 category not covered by the in-flight PRs #17962
(briandevans, reverse shell + download-execute) and #7993 (SHL0MS,
credential reads + curl/wget exfiltration): sudo invocations that an
LLM-driven agent can drive without TTY interaction.
The agent has no TTY, so the sudo forms that succeed without human
involvement are those reading the password from stdin (`-S` / `--stdin`)
or via an askpass helper (`-A` / `--askpass`). The shell-launch (`-s`)
and list-privileges (`-a`) flags are also gated since they are
privilege-relevant invocations the agent can chain after acquiring the
password (e.g. read SUDO_PASSWORD from .env -> sudo -S -s -> root shell).
Plain `sudo cmd` (no flag) is TTY-bound and excluded.
Two patterns:
1. Direct flag: `\bsudo\b[^;|&\n]*?\s+(?:-s\b|--stdin\b|-a\b|--askpass\b)`
The lazy `[^;|&\n]*?` consumes flag-arguments without spanning
command separators, so `sudo -u root -S whoami` matches (a textbook
offensive form that a strict `(?:\s+-[^\s]+)*` "leading flags only"
pattern would have missed because `root` is a flag-value not a flag).
2. Combined short flags: `\bsudo\b[^;|&\n]*?\s+-[a-z]*[sa][a-z]*\b`
Catches packed forms like `sudo -nS id` where multiple flags share
a single `-X` token.
`_normalize_command_for_detection` lowercases input before pattern
matching (tools/approval.py:340), so case variants of S/s and A/a
collapse — both letter-pairs are gated since each is a privilege-
relevant invocation.
Tests: 21 new cases in TestDetectSudoStdin (12 positive covering all
flag-order permutations including herestring source and printf-piped
forms; 9 negative including TTY-bound `sudo whoami`, interactive
`sudo -i`, env-var reference `$SUDO_USER`, doc lookup `man sudo`,
package install, and the `pseudosudo` word-boundary edge case).
Empirical coverage: 11/11 attacks matched, 0/10 false positives.
Refs: #17873 category 4. Adjacent: #17962 (reverse shell + download-
execute), #7993 (credential reads + curl/wget exfiltration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes#9590: Block explicit sudo -S (stdin password mode) commands
when the SUDO_PASSWORD environment variable is not configured.
The attack vector: the LLM constructs 'echo guessedpass | sudo -S cmd'
to brute-force sudo passwords, iterates based on sudo's error output
('Sorry, try again'). The existing _transform_sudo_command only
injects -S when SUDO_PASSWORD exists; without it, the LLM's explicit
sudo -S must be treated as a guessing attempt.
Changes:
- Add _check_sudo_stdin_guard() in approval.py: detects sudo -S when
SUDO_PASSWORD is absent, anchored to command-start positions
(^ ; && || | etc.) to avoid false positives on literal text
- Integrate into check_all_command_guards() above yolo/mode=off so
the block is unconditional (like the hardline floor)
- Add 6 tests covering: detection, allow-list, SUDO_PASSWORD bypass,
integration with check_all_command_guards, yolo non-bypass,
container backend bypass
The _default_spawn HERMES_HOME injection (PR #23356) calls
resolve_profile_env which raises FileNotFoundError when the profile
dir doesn't exist. In production the profile always exists (workers are
only dispatched for live profiles), but tests with isolated HERMES_HOME
never create profile dirs. Catch FileNotFoundError and fall through —
HERMES_PROFILE is still set below, so the worker CLI resolves the
profile at startup.
For PRs #23206 (Frowtek), #23252 (Sylw3ster), #23358 (dmnkhorvath),
#23659 (smwbev), and #23356 (TurgutKural) — all part of the kanban
bug-fix batch salvage.
When a parent task is archived, dependent child tasks were stuck in
todo forever because recompute_ready and claim_task only checked for
status == 'done'. Now both functions also treat 'archived' as a
terminal status, allowing children to proceed when their parent is
archived.
Fixes#23180.
Adds a TestCheckSendMessage class with 7 focused tests pinning the
four passing conditions and the failure modes:
- HERMES_KANBAN_TASK grants access (the new branch)
- HERMES_KANBAN_TASK short-circuits before consulting
session_context or gateway.status (so workers don't depend on
those import paths being healthy)
- HERMES_SESSION_PLATFORM=telegram grants access
- HERMES_SESSION_PLATFORM=local falls through to gateway check
- is_gateway_running()=True grants access
- All signals absent → False
- gateway.status ImportError is swallowed → False
Pinning the short-circuit (test #2) is the load-bearing one — it
documents the contract that worker-side availability cannot regress
to depending on gateway-side state lookups.
The kanban dispatcher sets HERMES_KANBAN_TASK on every spawned worker
but launches it with the assignee profile's HERMES_HOME (e.g.
~/.hermes/profiles/<name>/), which has no gateway.pid file. The
existing _check_send_message therefore returned False from the
is_gateway_running() fallback, even though the parent gateway is
alive and reachable.
Net effect: workers could call kanban_* tools (gated on
HERMES_KANBAN_TASK in _check_kanban_mode) but not send_message. This
breaks the natural pattern of "worker does the job, calls
send_message to deliver rich content to the originating chat, then
calls kanban_complete with a one-line summary" because the kanban
notifier's payload_summary is hard-truncated to the first line
(~200 chars) at gateway/run.py:3963 — anything richer has to ship
via send_message.
Honoring HERMES_KANBAN_TASK in _check_send_message — symmetric with
_check_kanban_mode in kanban_tools.py:42 — closes the gap. No new
state, no new env var, no profile-config changes required.
Default spawn did not propagate HERMES_HOME when forking kanban workers.
The worker's env is copied from the parent via dict(os.environ), so
HERMES_HOME is absent. When the child then starts hermes -p <profile>,
the CLI's _apply_profile_override() runs before hermes_constants is
imported and get_hermes_home() falls back to ~/.hermes (the default
profile root), silently ignoring the profile's config.yaml. Profile-
scoped fallback_providers, toolsets, and agent settings are therefore
never applied to kanban workers.
The fix injects HERMES_HOME into the worker's env using
resolve_profile_env(profile_arg) so the child reads the correct profile
directory instead of the default root.
When a kanban worker subprocess hits the iteration budget, the agent
loop strips tools and asks the model for a summary. The model cannot
call kanban_block itself at that point, so the process exits rc=0
without calling kanban_complete or kanban_block — a protocol violation
that the dispatcher detects as a fatal error, giving up after 1 failure
and stranding downstream tasks.
Fix: after _handle_max_iterations() returns, check HERMES_KANBAN_TASK
and call kanban_block with a reason describing the exhaustion. The
dispatcher then sees a clean block transition instead of a protocol
violation, and the task can be retried or escalated by a human.
Fixes [Bug] kanban-worker exits cleanly (rc=0) on iteration-budget
exhaustion without calling kanban_complete or kanban_block #23216
The container entrypoint ran `chown -R` on $HERMES_HOME every start.
`chown` strips the setgid bit (kernel security behavior), destroying
the 2770 permissions the NixOS activation script sets for group access
by hostUsers. This caused PermissionError for interactive CLI users
even though they were in the hermes group.
Replace with `find ... ! -user $UID -exec chown` which only touches
files with wrong ownership, leaving correctly-owned directories and
their permission bits intact.
Affects: container.enable + container.hostUsers + addToSystemPackages
Related: #19795, #19788, #9383
Expose the dependency-groups parameter from python.nix through
hermes-agent.nix and the NixOS module, allowing users to opt into
pyproject.toml optional extras (e.g. hindsight, voice, matrix) that
are resolved by uv inside the sealed venv.
Unlike extraPythonPackages (which appends to PYTHONPATH and requires
collision checking), extraDependencyGroups resolves the full dependency
graph in a single uv pass — no PYTHONPATH patching, no version
conflicts, no collision risk.
When to use which:
- extraDependencyGroups: enable a pyproject.toml optional extra
- extraPythonPackages: add an external Python plugin not in pyproject.toml
Usage:
services.hermes-agent.extraDependencyGroups = [ "hindsight" ];
Or via overlay:
pkgs.hermes-agent.override { extraDependencyGroups = [ "hindsight" ]; }
Refs: #8873, #9194
Declares hindsight-client as an optional dependency group [hindsight]
in pyproject.toml. This allows build-time inclusion for environments
where runtime pip install is not possible (NixOS sealed venvs, Docker,
Kubernetes).
Not included in [all] — memory providers are plugins and should be
opted into explicitly.
Install via:
uv sync --extra hindsight
pip install hermes-agent[hindsight]
NixOS (with extraDependencyGroups):
services.hermes-agent.extraDependencyGroups = [ "hindsight" ];
Closes#8873
Two independent opt-in QoL toggles, both off by default.
terminal.docker_extra_args:
- List of extra flags appended verbatim to docker run after security
defaults. Useful for adding capabilities (e.g. --cap-add SETUID) or
other docker run options not exposed by existing config keys.
- Non-string entries are logged and skipped.
- Also available via TERMINAL_DOCKER_EXTRA_ARGS='[...]' env var.
display.timestamps:
- Appends [HH:MM] to user input bullet and the assistant response box
header. Single hub in _format_submitted_user_message_preview()
covers both single-line and multi-line user previews; assistant
response label gets the timestamp at box-open time.
Closes#1569 (timestamps).
Co-authored-by: Mibayy <Mibayy@users.noreply.github.com>
When an auxiliary provider returns HTTP 402 (credit / payment), every
subsequent compression / title-gen / session-search / vision call still
re-tried it as the FIRST entry in the chain — burning ~1 RTT to hit 402
again, then falling back. On a long Discord/LCM session that meant dozens
of doomed 402s per minute (issue #23570).
Add a per-process unhealthy-provider cache with a 10 min TTL. When any
caller observes a payment error against a provider, the label is marked
unhealthy and skipped by:
* _resolve_auto Step-1 (main provider use-as-aux path)
* _resolve_auto Step-2 (aggregator/fallback chain)
* _try_payment_fallback (used by call_llm/acall_llm on first 402)
Skip-logs are throttled to once per minute per label so a bursty session
doesn't spam agent.log. Entries auto-expire so a topped-up account
recovers without manual intervention. The cache is in-process only by
design — multi-profile users with different keys per profile must each
hit the 402 once.
Refs #23570
When the Discord typing API call fails (rate limit, network error, 403),
_typing_loop returns early but the stale task remains in _typing_tasks.
Subsequent send_typing calls see the stale entry and skip, leaving no
typing indicator for the rest of the agent invocation.
Add finally block to _typing_loop to always remove the task from
_typing_tasks on exit, whether from cancellation, error, or normal
completion. This allows send_typing to create a fresh task.
3 new tests in test_discord_send.py:
- Task removed after API error
- Typing restartable after failure
- stop_typing cleans up
A YAML parse error in ~/.hermes/config.yaml caused load_config() to print
one line to stdout (Warning: Failed to load config: ...) and silently fall
back to DEFAULT_CONFIG, dropping every user override (auxiliary providers,
fallback chain, model settings). Users only noticed when downstream
behavior misbehaved — see issue #23570 where a tab-indent error in the
auxiliary section caused aux fallback to use OpenRouter (depleted) instead
of the configured Codex/MiniMax chain.
Now: log at WARNING (so 'hermes logs' surfaces it), write a prominent line
to stderr, dedup on (path, mtime_ns, size) so concurrent loads don't spam,
and re-warn after the user edits the file. Both call sites (raw read +
merged load) route through the same helper.
Refs #23570
Salvages the three substantive low-severity fixes from Gutslabs' #1974
"misc bug fixes" bundle. The other 8 claims in that PR were either
already fixed on main with superior implementations (state lock,
firecrawl lazy import, fcntl/msvcrt guard, path normalization, schema
migrations) or did not survive review.
- run_agent: `_materialize_data_url_for_vision` uses
`NamedTemporaryFile(delete=False)`; if `base64.b64decode` raises on a
corrupt data URL the temp file would persist forever. Wrap the
write in try/except and `os.unlink` the temp on failure.
- gateway/session: `append_to_transcript` JSONL write had no error
handling, so disk-full / read-only-fs / permission errors crashed the
message handler. The SQLite write above is the primary store, so
swallow OSError on the JSONL fallback with a debug log.
- gateway/status: `_read_pid_record` reads `pid_path.read_text()` after
an `exists()` check; if the PID file is deleted between the two
calls (concurrent gateway restart) we hit an unhandled OSError.
Catch it and return None.
Adds a regression test for the tempfile cleanup; the other two paths
are defensive try/excepts on infrequent OSError that don't warrant
dedicated tests.
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
Re-authored against current main from PR #10388 by @wilsen0. The
original branch is 3800+ commits stale and could not be cherry-picked
without reverting unrelated work; this change carries only the perf
intent forward.
Tuning summary
==============
Text-batch ingress (gateway/platforms/telegram.py):
- HERMES_TELEGRAM_TEXT_BATCH_DELAY_SECONDS default 0.6 -> 0.3
- HERMES_TELEGRAM_TEXT_BATCH_SPLIT_DELAY_SECONDS default 2.0 -> 1.0
- Adaptive fast-path tiers in _flush_text_batch:
total <= 320 cp -> min(cap, 0.18)
total <= 1024 cp -> min(cap, 0.24)
else -> cap
A single short reply now reaches the agent in ~180ms instead of
600ms. Tier constants compose with the configured cap via min()
so an operator who tightens HERMES_TELEGRAM_TEXT_BATCH_DELAY_SECONDS
below 0.18 still wins on every tier.
- _env_float_clamped helper replaces bare float(os.getenv()).
Rejects NaN / Inf, applies optional min/max bounds. Used for
text-batch + media-batch knobs. Prevents asyncio.sleep(NaN)
crashes when an operator typos an env var.
Stream cadence (gateway/config.py + stream_consumer.py):
- StreamingConfig.edit_interval default 1.0s -> 0.8s
- StreamingConfig.buffer_threshold default 40 -> 24 chars
- DEFAULT_STREAMING_EDIT_INTERVAL / BUFFER_THRESHOLD / CURSOR are now
a single source of truth. StreamConsumerConfig imports them
instead of duplicating the literals; the prior dual-source drift
is fixed.
Tool progress (gateway/display_config.py):
- Telegram default tool_progress 'all' -> 'new'. Inside
Telegram's ~1 edit/s flood envelope the 'all' default would
accumulate edit pressure on busy chats; 'new' shows only the
leading bubble per tool batch and feels less spammy.
- Slack tier_low override (tool_progress='off') is preserved.
Composition with native draft streaming (#23512)
================================================
The mid-stream cadence (edit_interval, buffer_threshold) gates BOTH
the draft path (send_draft) and the edit path (edit_message), so the
tighter cadence helps native draft as much as edit-based. The
text-batch fast-path applies before the consumer starts, so it speeds
up the first-token latency on every transport. No conflict.
Stale-base avoidance
====================
Re-authored from scratch rather than cherry-picked. Dropped from the
original branch:
- Unrelated d2f043f9c 'fix(anthropic): preserve third-party thinking
continuity' commit
- boot_md.py builtin gateway hook (unrelated)
- Reverted Slack tool_progress='off' (#14663) restoration
- Reverted Platform plugin discovery, MSGRAPH_WEBHOOK, YUANBAO
members deletion
- 2300+ lines of run.py base-skew noise
Tests
=====
New tests/gateway/test_telegram_text_batch_perf.py:
- 7 tests for _env_float_clamped (NaN, Inf, garbage, bounds).
- 4 tests for the adaptive-tier composition rules.
Updated tests/gateway/test_display_config.py:
- test_platform_default_when_no_user_config: 'all' -> 'new' for
Telegram, with comment.
- test_high_tier_platforms: split into Telegram-overrides-to-new
and Discord-stays-all assertions.
Closes#10388.
Co-authored-by: wilsen0 <132184373+wilsen0@users.noreply.github.com>
More specific name. The skill is REST + GraphQL debugging end-to-end,
not generic 'api testing' (a smoke-test pytest scaffold is one short
section out of ~500 lines). Renames directory + frontmatter name +
self-reference in the delegate_task example body.
- Implement tests for normalizing perpetual markets and DEXs.
- Validate JSON output for main commands including markets, candles, and review.
- Ensure environment variable resolution and dotenv file reading are covered.
- Test export functionality for market data with expected output structure.
agent.redact._REDACT_ENABLED is snapshotted at import time from
HERMES_REDACT_SECRETS env. Under xdist a prior test in the same worker
can flip it, so test_exec_command_output_is_redacted was order-dependent.
Pin it via monkeypatch like test_terminal_output_transform_still_runs_strip_and_redact does.
1. Quick command exec ran in the gateway process's full environment
without env sanitization or output redaction. A quick command like
"env" or "printenv" would leak all API keys, OAuth tokens, and
bot credentials to the messaging user.
Fix: apply _sanitize_subprocess_env() before exec and
redact_sensitive_text() on output before returning.
2. GatewayRunner._pending_messages was written on every interrupt
(lines 1331-1334) but never read or consumed anywhere. The actual
interrupt delivery uses adapter._pending_messages (a separate dict).
Removed the write-only accumulation to prevent unbounded growth.
Two new tests:
- tests/gateway/test_telegram_format.py
test_message_too_long_splits_into_continuations_not_silent_truncation:
asserts edit_message returns success=True with continuation_message_ids
populated and message_id pointing at the last continuation when
content exceeds MAX_MESSAGE_LENGTH (#19537). Replaces the original
fail-on-overflow assertion with the split-and-deliver contract.
- tests/gateway/test_stream_consumer.py
TestEditOverflowSplitAndDeliver.test_consumer_advances_message_id_on_split_and_deliver:
asserts the consumer side updates _message_id to the latest
continuation, clears _last_sent_text, and fires on_new_message when
the adapter reports a split-and-deliver result.
When edit_message_text exceeded Telegram's 4096 UTF-16 codepoint limit,
the adapter caught the BadRequest, best-effort truncated the content
with '…', and returned SendResult(success=True). The stream consumer
believed the full edit was delivered and never recovered, silently
dropping everything past the truncation boundary on long replies.
Returning failure isn't safe either — the consumer's existing fallback
path can race against the next streaming tick, producing duplicate
sends or gaps. Instead, the adapter now SPLITS the oversized payload
across the existing message + new continuation messages, so the user
always gets the full reply in correct order.
How it works:
1. Pre-flight: if utf16_len(content) already exceeds MAX_MESSAGE_LENGTH,
call the new _edit_overflow_split helper directly — saves a doomed
round-trip + a Telegram error.
2. Reactive: if Telegram still returns 'message_too_long' after the
pre-flight (e.g. parse_mode formatting inflated the payload past
the limit via MarkdownV2 escapes), the same helper handles it.
3. _edit_overflow_split:
- Splits via truncate_message(len_fn=utf16_len) — same chunking the
non-streaming send() path uses; chunks get '(1/N)' suffixes.
- Edits the original message_id with chunk 1 (with parse_mode +
plain-fallback when finalize=True, mirroring the main edit path).
- Sends each remaining chunk via self._bot.send_message threaded as
a reply to the previous chunk so the user sees them as a
contiguous block. MarkdownV2-with-plain-fallback per chunk on
finalize.
- Returns SendResult(success=True, message_id=<last_chunk_id>,
continuation_message_ids=(<chunk2_id>, <chunk3_id>, ...)) so the
stream consumer can keep editing the most recent visible message
and the gateway has full visibility into every message id.
SendResult contract extension:
Added optional continuation_message_ids: tuple = () field. When
empty (the common case), behavior is unchanged. When populated, the
caller knows the adapter delivered across multiple platform messages.
Stream consumer integration:
GatewayStreamConsumer._send_or_edit advances _message_id to the
last-continuation id when it sees continuation_message_ids on a
successful edit result, resets _last_sent_text (the new visible
message holds only the final chunk's text), and fires
on_new_message so tool-progress bubbles linearize below the new
continuation rather than the original. Mirrors the openclaw #32535
inter-tool-leak guard.
Composes with what just landed:
- PR #23455 (UTF-16 length-aware splitting in stream consumer)
prevents most overflows upstream by measuring text in UTF-16
codeunits before deciding to split. This PR is the safety net at
the adapter boundary.
- PR #23512 (native draft streaming, default for DM Telegram) routes
DM streaming through send_draft, which has its own contract
unaffected by this change. So this fix narrows in scope to the
edit-based path: groups, supergroups, forum topics, every
non-Telegram platform, and the per-response fallback after a
draft failure.
Salvage notes:
- Cherry-picked from PR #19537 by @kjames2001. Original PR returned
failure on overflow; this evolves to split-and-deliver so users
never lose content and the consumer state stays consistent.
- Dropped an unrelated model-picker hunk (line 2114-2117) that
silently killed the 'X more available — type /model <name>
directly' hint by hardcoding total=len(models). Not in scope.
- Restored the timeout-aware retryable=not is_timeout signal in
send()'s fallthrough catch block.
Closes#19537.
Surface ready tasks that nobody claims within a threshold (default
30 min) regardless of why. One identity-agnostic signal that catches:
- Operator typo'd the assignee
- Profile was deleted, leaving its tasks stranded
- External worker pool (Codex CLI lane, custom daemon) is down
- Dispatcher misconfigured (wrong board / wrong HERMES_HOME)
Today the dispatcher correctly skips these (no respawn loop, good)
but nothing surfaces the fact that operator-actionable work is
accumulating. The new `stranded_in_ready` rule does that without
requiring a manual lane registry — it reads the most recent ready-
transition event (`created` / `promoted` / `reclaimed` / `unblocked`)
and fires when (now - last_ready_ts) > threshold.
Severity escalates with age: warning at threshold, error at 2x,
critical at 6x. The cli_hint and reassign actions point operators
at the right next step.
Out of scope deliberately:
- Lane registry (#20157 closed) — this signal supersedes it.
- Pushing the diagnostic into messaging gateways — diagnostics
are pull-only via 'hermes kanban diagnostics' for now; gateway
push is a separate UX decision.
Tests: 10 new + 461 existing kanban tests pass. E2E verified end-
to-end via 'hermes kanban diagnostics --json' against a 2h-old
stranded task — surfaces as error severity with correct actions.
- Bug 1: shift-click now always adds the target card and sets it as the
last-selected anchor, so range selection works even when 0 or 1 cards
are selected.
- Bug 2: column select-all checkbox now toggles: if every card in the
column is already selected, clicking unselects them all.
- Bug 3: applyBulk now mirrors moveSelected with optimistic UI updates
for status moves and calls loadBoard() on catch for consistency.
The Task dataclass has no `summary` field; only Run carries summary.
The dashboard already searches `latest_summary` (derived from the
latest run), so `t.summary` in the client-side haystack was always
undefined and therefore redundant.
Verdict from task t_4bcac44f:
- Before batch QOL (6c7ec94d9): search only covered id, title,
assignee, tenant.
- Batch QOL (7fd187102) correctly added body, result, latest_summary.
- `t.summary` was included but is a misleading no-op because tasks
never expose a `summary` key — `latest_summary` already covers it.
Removes the redundant field from the haystack only.
- When dragging a selected card while multiple cards are selected, the
browser ghost image now shows a 'N cards' badge instead of a single card.
- All selected cards in the original column are dimmed (opacity 0.45 +
grayscale) during the drag so the user sees the whole set is in-flight.
- Uses React state for the dragged task id; event delegation on the board
columns container to avoid deep prop threading.
- Preserve failedIds partial-failure highlighting after moveSelected/
applyBulk by clearing only selectedIds/lastSelectedId instead of
calling clearSelected() (which also wiped failedIds).
- Fix touch/native multi-drag drop stale closure by adding
props.selectedIds and props.onMoveSelected to the hermes-kanban:drop
useEffect dependency array.
Fixes t_5bfafb73.
- Extend BulkTaskBody with reclaim_first: bool = False
- In bulk_update, use kanban_db.reassign_task(..., reclaim_first=True)
when payload.reclaim_first is set and assignee is present
- Falls back to existing assign_task behavior when reclaim_first is false
This enables the dashboard to bulk-reassign running tasks by
reclaiming their claims first, matching the single-task
/tasks/{id}/reassign endpoint behavior.
Live-tested on gemini-3-flash-preview the judge kept returning empty
or non-JSON content, tripping the consecutive-parse-failures auto-
pause. Free-form JSON output is hopeful; tool-call schemas are
enforced server-side by virtually every modern provider.
Two new tools the judge calls:
- submit_checklist(items) — Phase A, decompose
- update_checklist(updates, new_items, reason) — Phase B, evaluate
Both phases now call the auxiliary client with tool_choice forcing
the right tool. read_file remains for Phase B history inspection,
with the loop exiting only when update_checklist is called or the
read budget is exhausted (at which point read_file is dropped from
the toolbox and update_checklist is forced).
Robustness:
- _call_judge_with_tool_choice falls back tool_choice forced→required→
auto if the provider rejects a particular shape.
- If a fully-broken provider still returns content instead of a tool
call, the legacy JSON-text parsers stay around as a last-ditch
backstop so we never silently lose a checklist.
- _normalize_update_args replaces the JSON parser for the apply
layer; same 1-based→0-based conversion + terminal-status filter.
Live verification: same fizzbuzz goal that was hitting 'judge model
returned unparseable output 3 turns in a row' before now terminates
in 2 turns, all 11 items marked completed with item-specific
evidence, no auto-pause. Agent log shows
'produced 11 checklist items via tool call' instead of the JSON-
parse path.
Tests: 7 new cases for the tool-call path (Phase A success, Phase B
update only, Phase B read_file→update, JSON-content backstop,
empty-text item dropping, non-terminal status filter).
When run_agent's _compress_context fires mid-turn it ends the parent
session in SessionDB and creates a new continuation session with a
fresh session_id. The /goal state is keyed on session_id in
state_meta ("goal:<sid>"), so without forwarding the goal silently
disappears: _get_goal_manager() rebinds for the new session_id,
load_goal() returns None, mgr.is_active() is False, and the
continuation loop dies with no user-visible signal.
Fix: in the same SessionDB transaction block that creates the
continuation session, copy state_meta[goal:<old>] →
state_meta[goal:<new>] when present. No-op when the user has no
active goal. Logged at INFO so a stuck loop is debuggable.
Tests cover the round-trip via SessionDB and the no-op path.
Affects all three run-conversation surfaces (CLI, gateway, TUI
gateway) because _compress_context is the single rotation site.
Out-of-scope behavior change in #23521 — the kanban notifier-routing fix
also flipped the 'kanban create --created-by' default from 'user' to the
active profile name. Revert to keep PR scope focused on the notifier
ownership fix; the profile-aware author default can be its own change.
Added tests/gateway/test_stream_consumer_draft.py with 11 tests
covering:
- Transport selection: auto+dm-supported -> draft; auto+group -> edit;
explicit edit; explicit draft on unsupported adapter -> edit;
MagicMock adapter -> edit (back-compat for the existing test suite).
- Happy path: DM stream animates draft frames with a single shared
draft_id, then finalizes via a regular adapter.send.
- Group fallback: drafts entirely skipped in non-DM chats.
- Failure fallback: send_draft returning success=False disables drafts
for the rest of the response.
- Draft_id lifecycle: consecutive responses use distinct ids; tool
boundaries bump the id so post-tool text animates fresh below the
tool-progress bubble (the openclaw #32535 leak guard).
- _already_sent contract: drafts must NOT set the flag so the gateway's
fallback final-send still fires (drafts have no message_id).
Updated website/docs/user-guide/messaging/telegram.md with a
'Streaming transport' section explaining auto|draft|edit|off, the
DM-only constraint, and the per-response fallback behaviour.
Adds Telegram's native streaming-draft API as a streaming transport so DM
replies render with smooth animated previews as tokens arrive, dropping
the per-edit jitter of the legacy editMessageText polling path.
Adapter contract (gateway/platforms/base.py):
- supports_draft_streaming(chat_type, metadata) -> bool. Default False.
Telegram returns True only for DMs and only when the bound python-
telegram-bot version exposes Bot.send_message_draft (PTB 22.6+).
- send_draft(chat_id, draft_id, content, metadata) -> SendResult.
Default raises NotImplementedError. Telegram delegates to PTB's
send_message_draft. Drafts have no message_id (Bot API contract);
SendResult.message_id is None on success.
Telegram adapter (gateway/platforms/telegram.py):
- supports_draft_streaming gates on chat_type='dm' AND PTB capability.
- send_draft trims to MAX_MESSAGE_LENGTH using utf16_len, threads
message_thread_id through metadata, and routes failures back as
SendResult(success=False, error=...) so the consumer can fall back.
Stream consumer (gateway/stream_consumer.py):
- StreamConsumerConfig gains transport ('auto'|'draft'|'edit'|'off')
and chat_type fields.
- run() resolves _use_draft_streaming once via a probe at the top of
the run, allocating a fresh class-wide draft_id_counter so each
response animates as its own preview (no animation collision across
consecutive responses to the same chat).
- _send_or_edit gains a pre-edit branch: when drafts are active AND
not finalizing AND no edit-path message_id is established, the
frame routes through _send_draft_frame instead of edit_message.
Drafts intentionally do NOT set _already_sent so the gateway's
final sendMessage path still fires — drafts have no message_id and
the user needs a real message in their chat history.
- _reset_segment_state bumps the draft_id when the consumer is in
draft mode so each text block after a tool boundary animates as a
fresh preview below the tool-progress bubble (avoids the inter-
tool-call leak openclaw documented in their #32535).
- Per-response fallback: any send_draft failure (transient network,
server reject, capability gap) flips _use_draft_streaming to False
for the rest of the run, gracefully returning to the edit path.
Gateway config (gateway/config.py):
- StreamingConfig.transport default flips edit -> auto. The auto path
is identical to edit on every chat type that doesn't currently
support drafts (groups, supergroups, forum topics, every non-
Telegram platform), so the default is backwards-compatible for
non-DM users.
Lifecycle model (Telegram Bot API 9.5):
1. sendMessageDraft(chat_id, draft_id, text='') opens the bubble.
2. Repeated sendMessageDraft calls with the SAME draft_id animate
the preview as text grows.
3. Drafts have no message_id and cannot be edited or deleted.
4. When the response finishes the gateway's normal sendMessage path
delivers the final answer; the draft preview clears naturally on
the client and the user sees a real message in their history.
Inspired by PR #3412 by @NivOO5. Re-authored against current main
(stream_consumer.py is now ~4x larger than at #3412's branch base, with
new _NEW_SEGMENT/_COMMENTARY/finalize/_on_new_message machinery the
original PR didn't account for) but the design call (DM-only, edit-
fallback, transport=auto|draft|edit|off) is faithful to the original
proposal, with two improvements baked in:
1. Per-response draft_id (monotonic counter, not a time hash) — no
collision risk across consecutive responses on the same chat.
2. Tool-boundary draft_id bump — prevents the inter-tool-call leak
openclaw hit during their rollout (their #32535).
Closes#21439 (duplicate feature request).
Follow-up to TreyDong's fix: switch the auth header to
`X-Hermes-Session-Token` (the canonical pattern used by the rest of
the dashboard SPA — see `web/src/lib/api.ts` `fetchJSON()`). The
server still accepts both schemes, so the original `Authorization:
Bearer` form would also work; we standardize on X-header to match
every other dashboard fetch and only set the header when a token is
actually present.
Also add scripts/release.py AUTHOR_MAP entry for treydong.zh@gmail.com.
The existing _live_system_guard (PR #23397) blocked os.kill / os.killpg
and a narrow subset of subprocess invocations. Tests still SIGTERMed the
live gateway today (May 10) because the guard had structural holes.
Plug them all:
- subprocess: also wrap getoutput, getstatusoutput
- os.system, os.popen - completely unwrapped before
- pty.spawn - completely unwrapped before
- asyncio.create_subprocess_exec / create_subprocess_shell - bypassed
the subprocess module entirely; now wrapped
- Subprocess command inspection now looks at the WHOLE command string,
not just tokens[0]. Catches sudo systemctl, env systemctl, bash -c
'systemctl', setsid systemctl, /usr/bin/systemctl, etc.
- New process-killer block: pkill / killall / taskkill / fuser
targeting hermes/python patterns is now refused
- os.kill PID 0 (own group) allowed; PID -1 (every process we can
signal) refused
- subprocess.Popen wrapper preserves __class_getitem__ so third-party
packages that use Popen[bytes] as a type annotation still import
Coverage is locked in by tests/test_live_system_guard_self_test.py -
exercises every primitive against a guaranteed-foreign PID and asserts
the guard fires. Adding a new kill primitive without updating the guard
breaks CI.
scripts/run_tests.sh now also force-loads ~/.hermes/pytest_live_guard.py
when present (developer-machine convenience), so even worktrees that
predate this commit get the protection on subsequent test runs through
the canonical wrapper.
A Codex auxiliary timeout closes the underlying OpenAI client (so the
streaming hang doesn't sit until the user kills the session), but the
cached wrapper kept pointing at the now-dead transport. Subsequent
auxiliary calls (compression retry, memory flush, background review,
title generation routed via provider: main) reused that closed client
and failed fast with 'Connection error' until the gateway restarted —
even though the main agent route was healthy the whole time.
Sync `_get_cached_client` had no liveness check (async did, via loop
identity), and the connection-error fallback in `call_llm` only fired
on the auto provider path, so an explicit provider — including the
common `auxiliary.compression.provider: main` shape — never evicted.
Three fixes:
* New `_evict_cached_client_instance(target)` helper that drops the
cache entry whose stored client is target (or wraps it via
`_real_client`, for `CodexAuxiliaryClient`).
* `_CodexCompletionsAdapter._close_client_on_timeout` evicts the
wrapper after closing the inner OpenAI client.
* `call_llm` and `async_call_llm` evict on `_is_connection_error`
before re-raising, regardless of whether the provider is auto.
Net effect: one timeout costs one summary attempt + the existing 30s
compressor cooldown; the next compaction rebuilds the client and
works. Non-connection errors (4xx/5xx) do not evict, so cache hits
stay stable.
Closes#23432
Closes the architectural-pin part of #19931. Most of what that issue
asked for is already implemented (logs under kanban root, env-pinned
workspace, dispatcher routing of unknown assignees, lifecycle
ownership, structured handoff conventions). What was missing:
1. A written contract integrators can point at when adding a new
worker lane shape, and
2. The "code-changing workers should not auto-promote success to
done" convention.
This commit ships both as docs+convention layered on existing primitives.
No kernel changes — the kanban_complete / kanban_block / kanban_comment
surfaces already support the review-required pattern; we just hadn't
written it down or made it visible to workers.
Changes:
- `agent/prompt_builder.py::KANBAN_GUIDANCE`: append the review-required
exception to step 5 of the lifecycle. Workers get the cue
auto-injected into their system prompt — drop structured metadata
into a kanban_comment first, then end with
kanban_block(reason="review-required: <summary>") instead of
kanban_complete when the work needs review. Total prompt size went
from ~3000 to ~3275 chars; well under the 4096 budget enforced by
test_kanban_guidance_size.
- `skills/devops/kanban-worker/SKILL.md`: add a worked example to the
existing "Good summary + metadata shapes" section between the
Coding-task and Research-task examples. Same shape as the others
(kanban_comment with structured handoff JSON, then kanban_block with
the human-readable reason). Plus a one-line guide on when to use
kanban_complete vs the review-required pattern.
- `website/docs/user-guide/features/kanban-worker-lanes.md` (new): the
integrator-facing contract. Covers the hierarchy, the three things
every lane must provide (assignee, spawn mechanism, lifecycle
terminator), the env vars the dispatcher injects, the
review-required convention, the failure modes the kernel handles
for free, and an explicit "external CLI worker lane" deferred-
pending-concrete-asker section that links to #19931 and #19924.
- `website/sidebars.ts`: link the new page under user-guide/features.
The "specialist worker lanes for external CLI tools (Codex / Claude
Code / OpenCode)" runner is NOT shipped here. The dispatcher's
spawn_fn parameter already supports plugin-shaped extension; the
per-CLI integration work (auth, sandbox policy, exit-code mapping)
needs a concrete asker. The new docs page tells would-be integrators
the contract any such lane must satisfy.
Refs #19931
Cherry-picked from PR #10371. Two-layer defense for the spurious-thread_id
issue (#3206):
1. _build_message_event filters DM thread_ids: only preserve thread_id
for real topic messages (is_topic_message=True). Telegram puts
message_thread_id on every DM that is a reply, but reply-chain ids
route to nonexistent threads on send.
2. _send_message_with_thread_fallback helper: control sends
(send_update_prompt, send_exec_approval / send_slash_confirm,
send_model_picker) retry once without message_thread_id when
Telegram returns BadRequest 'Message thread not found'. Mirrors
the pattern PR #3390 added for the streaming send path.
Salvage notes:
- Conflict 1 (line ~4099): merged the contributor's DM is_topic_message
filter with the existing forum General-topic default from #22423,
preserving both behaviors.
- Conflict 2 (line ~1664 / 1690): kept main's delete_message (PR #23416)
alongside the new helper. Tightened the helper's exception catch
from bare 'Exception' to use the existing _is_bad_request_error +
_is_thread_not_found_error helpers (line 484-496) for consistency
with the streaming send path.
- Widened the fix to send_update_prompt (was bare self._bot.send_message,
same bug class).
Authored by rahimsais via PR #10371 (re-attributed from donrhmexe@
local commit author).
* feat(goals): /goal checklist + /subgoal user controls
Two-phase judge for /goal — Phase A decomposes the goal into a detailed
checklist on first turn; Phase B evaluates each pending item harshly
against the agent's most recent response. The goal completes only when
every item is in a terminal status (completed or impossible). Adds
/subgoal so the user can append, complete, mark impossible, undo,
remove, or clear items the judge missed or got wrong.
Mechanics:
- GoalState gains `checklist` and `decomposed` fields, both backwards
compatible (old state_meta rows load unchanged).
- Phase A: aux call writes a harsh, exhaustive checklist; biased toward
more items not fewer. Falls through to legacy freeform judge when
decompose fails.
- Phase B: judge gets the checklist + last-response snippet + path to
a per-session conversation dump at <HERMES_HOME>/goals/<sid>.json.
A bounded read_file tool (max 5 calls per turn, restricted to that
one file) lets the judge inspect history when the snippet is
ambiguous. Stickiness in code: terminal items are frozen, only the
user can revert via /subgoal undo.
- Continuation prompt shows checklist progress when non-empty;
reverts to old prompt when empty.
- Status line shows M/N done counts.
CLI + gateway + TUI gateway all pass the agent reference into
evaluate_after_turn so the dump can be written. Gateway-side
/subgoal is allowed mid-run since it only modifies the checklist
the judge consults at turn boundaries.
Tests: 24 new cases — backcompat round-trip, Phase A decompose,
Phase B updates + new_items + stickiness, user override flows,
conversation dump (incl. unsafe-sid sanitization), judge read_file
restriction. Existing freeform-mode tests updated to patch the
renamed `judge_goal_freeform` and skip Phase A explicitly.
* fix(goals): off-by-one in judge index, message-list plumbing, prompt tuning
Three live-test findings from running /goal end-to-end against
gemini-3-flash-preview as the judge:
1. Off-by-one bug — the judge sees the checklist rendered with 1-based
indices ('1. [ ] foo, 2. [ ] bar') but the apply layer indexed
state.checklist as 0-based. Result: every judge update landed on
the wrong item, evidence got attached to neighbouring rows, and
the genuine 'first pending' item (usually #1) never got marked.
Fix: convert 1 → 0 in _parse_evaluate_response. Also tightened the
user prompt to call out the 1-based scheme explicitly. New tests
cover the parser conversion + an end-to-end fake-judge round-trip.
2. Conversation dump never happened — _extract_agent_messages tried
common AIAgent attribute names (.messages, .conversation_history,
etc.) but AIAgent doesn't expose the message list as an instance
attribute; it lives inside run_conversation()'s scope. Result: the
judge's read_file tool always saw history_path=unavailable. Fix:
added an explicit messages= kwarg to evaluate_after_turn that all
three call sites (CLI, gateway, TUI gateway) now pass directly.
Agent-attribute extraction kept as back-compat fallback.
3. Prompt was too harsh on simple goals. The original 'be HARSH,
default to leaving items pending' wording made the judge refuse
to mark 'file exists' completed even after the agent ran ls,
test -f, os.path.isfile, and find — burning the entire 8-turn
budget on a fizzbuzz task. Softened to 'strict but not absurd'
with explicit guidance on what counts as evidence and a directive
not to require re-proving items already established earlier.
Re-tested live with the same fizzbuzz goal: now terminates in 2
turns with all 8 checklist items correctly attributed to their
own evidence. /subgoal user-action flow (add / complete / undo /
impossible) verified live as well.
New TestUtf16OverflowDetection class covers two scenarios:
- test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds
2200 emoji codepoints (4400 UTF-16 units) — under Telegram's
codepoint-equivalent limit but over its UTF-16 limit. Asserts
truncate_message was called with len_fn=utf16_len, confirming the
consumer detected the overflow.
- test_codepoint_only_adapter_falls_back_to_len: documents that
adapters which don't subclass BasePlatformAdapter (or test MagicMocks)
fall back to plain len for backwards compat.
The contributor's PR shipped no tests for the UTF-16 path.
The stream consumer measured message length using Python's len() (Unicode
code points), but Telegram's actual limit is in UTF-16 code units. This
caused messages with supplementary characters (emoji, CJK, etc.) to exceed
Telegram's 4096-character limit, resulting in truncated messages with
formatting artifacts.
Changes:
- Add message_len_fn property to BasePlatformAdapter (defaults to len)
- Override in TelegramAdapter to return utf16_len
- Stream consumer uses adapter.message_len_fn for:
- safe_limit calculation
- overflow detection
- truncate_message calls
- split point calculation (via _custom_unit_to_cp)
- fallback final send chunking
Fixes truncated messages with black square artifacts on Telegram when
the model generates responses containing multi-byte Unicode characters.
Slash commands (/clear, /new, /undo, /reload-mcp) are dispatched from the
process_loop daemon thread. prompt_toolkit.run_in_terminal returns a
coroutine that only the main-thread event loop can drive, so calling it
from a daemon thread orphans the coroutine — the input prompt never
renders and user keystrokes leak into the composer instead of the
confirmation prompt (issue #23185).
Mirror the thread-aware guard already in _run_curses_picker: when off the
main thread, fall back to a direct input() call. Also wrap
run_in_terminal in try/except so WSL / Warp / other emulators that
silently drop the scheduled coroutine fall back to input() too.
Tests: tests/cli/test_prompt_text_input_thread_safety.py covers main
thread (run_in_terminal path), daemon thread (direct input fallback),
no-app, run_in_terminal-raises, and EOF handling.
When kanban_complete rejects a created_cards list as hallucinated, the
task is intentionally left in-flight (the gate runs before the write
txn) so the worker can retry with a corrected list or pass
created_cards=[] to skip the check. The retry path already worked, but
the previous error wording read like a terminal failure and workers
were observed abandoning the run instead of trying again.
Spell out the recovery path explicitly in the tool_error response
("Your task is still in-flight ... Retry kanban_complete with ...") and
add regression coverage at both the kernel and tool layers so the
retry contract — and the wording the worker depends on to discover
it — is pinned.
Fixes#22923
The auto-reset notice ("◐ Session automatically reset…") was being sent
with metadata=getattr(event, 'metadata', None), which can drop or
mis-route in Telegram forum topics: the event's metadata isn't
guaranteed to carry the originating thread_id, so the notice could leak
into General or another topic.
Use the existing self._thread_metadata_for_source(source) helper, which
already handles thread_id construction plus the Telegram DM topic
reply-fallback shape used everywhere else in the gateway.
Carve-out of #7404. The PR's other hunk (line 7578, queued first
response) is already redundant on main — gateway/run.py:15782 has used
_status_thread_metadata since the _thread_metadata_for_source plumbing
landed.
Closes#7355 (path B; paths A and C closed via prior salvage merges).
Sub-issue 5 of #22034.
Right-click on the composer always pasted from the clipboard, even when
the user had highlighted text — diverging from terminal-native behavior
(xterm/iTerm/gnome-terminal) where right-click copies an active selection
and only pastes when nothing is selected.
Extract a small pure helper, decideRightClickAction(value, range), and
route the existing onMouseDown right-click branch through it. Selection
present and non-empty -> writeClipboardText(slice). Otherwise fall back
to the existing emitPaste path.
Workers running slow models (e.g. kimi-k2.6) can spend longer than
DEFAULT_CLAIM_TTL_SECONDS inside a single tool-free LLM call, making
no tool calls and therefore not heartbeating. release_stale_claims
previously reclaimed these healthy workers, producing the
spawn-then-immediately-reclaim loop reported in #23025.
When a stale-by-TTL claim's host-local worker PID is still alive,
extend the claim (emit a claim_extended event) rather than killing
it. enforce_max_runtime / detect_crashed_workers remain the upper
bounds for genuinely wedged or dead workers. Reclaim events now also
record claim_expires, last_heartbeat_at, worker_pid, and host_local
so operators can see why a worker was killed.
* docs(user-stories): add 116 stories from Discord archive
Mined teknium1/nous-discord-archive for first-person user stories that match
the existing collage voice ('I run X every day', 'my family uses Hermes for
Y', 'so I built Z'). Skipped pure project pitches, Q&A, install help, and
generic announcements.
- Added 'discord' as a source in UserStoriesCollage (label + brand color)
- Added 116 entries to userStories.json (237 total, up from 121)
- Each entry links back to the discord-archive thread or channel archive file
* docs(user-stories): interleave discord stories across the full collage
Shuffle userStories.json with a fixed seed so the 116 Discord-sourced
entries are mixed evenly with the existing 121 entries instead of
appearing as a contiguous block at the end. Even distribution: 10-16
discord entries per decile across the array (ideal would be ~11).
xAI's Responses API returns HTTP 400 ("Model X does not support
parameter reasoningEffort") for grok-4, grok-4-0709, grok-4-fast-*,
grok-4-1-fast-*, grok-3, grok-4.20-0309-*, and grok-code-fast-1 — even
though those models reason natively. Hermes was unconditionally sending
`reasoning: {effort: 'medium'}` to xAI for every Grok model, breaking
direct `--provider xai` for the entire grok-4 line.
Add a substring allowlist predicate (verified live against api.x.ai
2026-05-10) covering the only Grok families that accept the effort dial:
grok-3-mini*, grok-4.20-multi-agent*, grok-4.3*. The Responses transport
omits the `reasoning` key entirely for everything else while still
including `reasoning.encrypted_content` so we capture native reasoning
tokens.
Verified end-to-end: `hermes chat -q hi --provider xai --model grok-4-0709`
went from HTTP 400 to a successful reply.
The contributor's regression test for Feishu fallback thread routing
asserted on attributes specific to the real lark SDK builder
(call_args.body, body.receive_id). In test environments without the
lark SDK installed, the in-tree fallback (gateway/platforms/feishu.py
_build_create_message_request) returns a SimpleNamespace using
.request_body instead of .body, causing AttributeError.
Now reads via getattr fallback and also verifies receive_id_type is
'thread_id' (not 'chat_id') as a stronger contract check.
When the first streamed message exceeds the platform length limit and
gets split into chunks, _send_new_chunk was called with self._message_id
(which is None on first send), dropping thread routing entirely.
Fallback to self._initial_reply_to_id so overflow chunks land in the
correct topic/thread.
Also fix a fragile test assertion that could be silently skipped.
Cherry-picked from PR #13077 commits:
- 5500c7d8 fix(gateway): stream consumer first message drops thread context
- e84403b9 test(gateway): add regression tests for stream consumer thread routing
Fixes: Streaming first message drops thread/topic context in Feishu group
topics, Slack threads, Telegram forum topics. Adds initial_reply_to_id
ctor arg to GatewayStreamConsumer, threaded through _send_or_edit and
_send_new_chunk. Also fixes Feishu _send_raw_message fallback path
(reply -> create) to use receive_id_type='thread_id' so the new message
lands in the correct topic instead of the main channel.
Authored by hrygo via PR #13077 (re-attributed from the bot-authored
salvage commit on the original branch).
The split-overflow path in _send_or_edit (gateway/stream_consumer.py) was
copying the cumulative _already_sent flag into _final_response_sent on the
done frame. _already_sent goes True on any successful prior edit (tool
progress) or on fallback-mode promotion when an edit fails — neither
proves the *current* chunked send delivered the final answer.
When the chunked send actually fails (network error, flood control), the
consumer would wrongly claim 'final delivered' and the gateway's
independent fallback delivery in run.py would be suppressed. User saw
only tool-progress bubbles and never got the answer.
Now we track per-chunk success locally: _send_new_chunk returns the new
message_id on success or returns the passed-in reply_to unchanged on
failure. If at least one returned id differs, chunks_delivered = True;
otherwise stays False, gateway fallback runs.
Adds two regression tests:
- test_split_overflow_failed_send_does_not_mark_final_sent — primes
_already_sent=True, then makes every send fail; asserts
_final_response_sent stays False.
- test_split_overflow_partial_send_marks_final_sent — happy path,
asserts _final_response_sent goes True.
Note: the companion bug at the CancelledError handler (issue cited
lines 417-418) was already fixed by 3b5572ded on 2026-04-16.
Closes#10748
Follow-up to the previous commit's notifier behavior change. Two test fixes:
1. `tests/gateway/test_kanban_notifier.py` gains
`test_notifier_redelivers_same_kind_on_dispatch_cycle` — pins the new
contract directly: a task that crashes, gets reclaimed, and crashes
again notifies the user BOTH times. Before #21398 the second crash
silently dropped because the subscription was already deleted.
2. `tests/hermes_cli/test_kanban_notify.py::
test_notifier_unsubs_after_abnormal_events[gave_up|crashed|timed_out]`
is flipped. Those tests were added in the salvage of #22941 and
asserted the OLD behavior (subscription deleted after gave_up /
crashed / timed_out). They're now obsolete — the new contract is
"subscription survives a non-final terminal event so retries reach
the user." Updated docstring + asserts; the cursor-advance check is
added to confirm the dedup mechanism still works.
The `test_notifier_unsubs_after_completed_event` test stays untouched
because `completed` IS still a terminal event that triggers unsub
(the task hits `done` status, which is handled by the `task_terminal`
branch in the notifier loop).
The kanban notifier was re-firing the same blocked/gave_up/crashed/timed_out
notifications on every 5-second tick. Root cause: after delivering a terminal
event, the notifier unsubscribed the subscription, deleting its cursor. If
the unsub failed (WAL contention, transient error), the subscription survived
with a stale cursor, and the next tick would re-deliver the same event.
Even when the unsub succeeded, the subscription was gone. If the task later
transitioned to a different state (e.g., blocked -> unblocked -> blocked
again), a new subscription would start at cursor=0, re-delivering all past
events.
Fix: stop unsubscribing on terminal event kinds. Only remove the subscription
when the task reaches a truly final status (done/archived). For blocked,
gave_up, crashed, and timed_out, the subscription stays alive and the cursor
mechanism deduplicates naturally -- events with id <= last_event_id are never
re-fetched. This makes the dedup idempotent and eliminates the re-fire bug.
The old concern about subscriptions leaking forever on blocked tasks is moot:
blocked tasks will eventually be unblocked (transitioning to ready/running)
or archived, at which point the subscription is cleaned up.
Follow-up to HuangYuChuh's #17384 cherry-pick:
- Use defensive getattr+logger.debug for delete_message lookup, mirroring
the sibling _try_send_fresh_final cleanup pattern at L820+. Platforms
that don't implement delete_message no longer raise AttributeError; the
failure path now logs at debug for diagnosability instead of silently
swallowing.
- Add three regression tests in tests/gateway/test_stream_consumer.py:
- delete_message awaited on happy-path exit with stale id
- delete_message NOT awaited when no fallback chunks reached the user
- no crash on adapters that lack delete_message (spec-restricted mock)
When Telegram flood control triggers 3+ consecutive edit failures, the
stream consumer enters fallback mode and sends the complete response as
a new message. This leaves the user seeing two messages: a frozen
partial (with cursor) and the full duplicate.
After the fallback chunks are sent successfully, delete the original
partial message so the user only sees one complete response. The delete
is best-effort — if it fails (e.g. flood still active, missing
permissions), the full answer is still delivered.
Fixes#16668
The shutdown forensics added in #23285 caught tests/hermes_cli/ pytest
runs sending SIGTERM to the developer's live gateway 5+ times in 3
days. Root cause: when a single test forgets to mock os.kill or
find_gateway_pids, the real call leaks past the hermetic HERMES_HOME
isolation — find_gateway_pids' psutil scan walks the whole machine and
returns the live gateway PID, then the unmocked os.kill delivers the
signal.
Rather than audit and patch ~30 tests across cmd_update, kill_gateway_processes,
and stop_profile_gateway code paths, install a single autouse guard in
tests/conftest.py that blocks the two primitives that actually cause
the damage:
- os.kill rejects any PID outside the test process subtree with a
hard RuntimeError so the offending test gets a stack trace instead
of silently murdering the real gateway.
- subprocess.run / Popen / call / check_call / check_output reject
any 'systemctl <verb> hermes-gateway' invocation that would mutate
the live unit. Read-only systemctl calls (status, show, list-units)
still pass through.
We intentionally do NOT stub find_gateway_pids / _scan_gateway_pids —
tests of those functions themselves need the real implementation.
Discovery without delivery is harmless; the os.kill + systemctl guards
catch the actual damage path.
Tests that legitimately need real signal delivery (e.g. PTY tests
signalling their own child) opt out via
@pytest.mark.live_system_guard_bypass.
Validation: tests/hermes_cli/ + tests/cli/ + tests/gateway/ produce
the same 17 failures with and without this guard (all pre-existing on
main, unrelated to gateway-kill leaks). The live gateway survives the
test run that previously SIGTERMed it.
Two follow-up improvements to the previous commit's notifier dedup work.
1. Add a regression test for the send-exception rewind path. The
contributor's PR included a test for the adapter-disconnect path
(test_kanban_notifier_rewinds_claim_if_adapter_disconnects, where
adapter is None at delivery time), but not for the "adapter is
connected, send() raises" path that fires inside the inner try/except
at gateway/run.py:4314. The new test
(test_kanban_notifier_rewinds_claim_on_send_exception) uses a
FailingAdapter that always raises and confirms (a) send was actually
attempted, (b) the claim was rewound, (c) the next call to
unseen_events_for_sub still returns the event for retry.
2. Drop the per-delivery success log from INFO to DEBUG. A busy board
on a multi-platform gateway can produce hundreds of these per day;
that's gateway.log noise that obscures real warnings. Failure paths
stay at WARNING (where you'd want to look when something's wrong)
so we don't lose visibility into transient send issues.
Adds a Cross-Platform Handoff section to user-guide/sessions.md covering
the CLI flow, per-platform thread behavior (Telegram topics / Discord
threads / Slack message-anchored / no-thread fallback), failure modes,
and the resume-back-to-CLI loop.
Adds the /handoff entry to reference/slash-commands.md and updates the
CLI-only commands note.
Three issues hit during a fresh Windows install + first `hermes update`:
1. `pyproject.toml` re-introduced the invalid `exclude-newer = "7 days"`
under [tool.uv]. uv requires an RFC 3339 / ISO date — relative-duration
strings parse-fail. The line was removed in PR #21221 on May 7 and
accidentally added back in the v0.13.0 release commit (498bfc7bc1)
the same day. Every uv invocation throughout install logged a TOML
parse error, confusing users into thinking the install was broken.
Fix: remove the line (and the now-empty [tool.uv] section).
2. `hermes update` failed on Windows with
`Access is denied. (os error 5)` when uv tried to overwrite
`venv\\Scripts\\hermes.exe` — the running entry-point shim. Windows
blocks REPLACE on a mapped/loaded executable but allows RENAME (kernel
tracks the file by handle, not path; same trick Chrome/Firefox use for
self-update). Pre-rename live shims to `hermes.exe.old.<unix-ms>`
before each `uv pip install -e .`; uv writes a fresh shim at the
original path; the .old files are swept on the next hermes invocation.
Wraps every install attempt (primary, base-only fallback, and
per-extra retries). Restores shims if uv fails before writing
replacements.
3. Tools post-setup hooks (ddgs, piper-tts, kittentts, langfuse,
tinker-atropos) shelled out to `[sys.executable, '-m', 'pip', ...]`
and died with `No module named pip` on every fresh Windows install.
install.ps1 creates the venv via `uv venv` which doesn't seed pip;
install.ps1 bootstraps pip later, but only inside the platform-SDK
verify block — by then the wizard's post-setup hooks have already
run and failed.
New `_pip_install` helper tries uv pip first (works in pip-less
venvs), then python -m pip, then ensurepip-bootstrap-then-pip. All
five post-setup sites now route through it.
E2E:
- uv pip compile pyproject.toml — no parse warning
- quarantine + cleanup with simulated Windows scripts dir; rollback
works when uv install fails before writing replacement shim
- _pip_install in a real `uv venv`-created (pip-less) venv: bootstraps
pip via ensurepip and completes the install
Tests: tests/hermes_cli/ — 4135 pass, 8 pre-existing failures on main
unrelated to this PR (kanban_boards, openclaw_migration,
update_gateway_restart, web_server PluginAPIAuth).
Builds on @kshitijk4poor's CLI handoff stub. The original PR's flow
deferred everything to whenever a real user happened to message the
target platform; this rewrites it so the gateway picks up handoffs
immediately and the destination chat just starts working.
State machine on sessions table replaces the boolean flag:
None -> 'pending' -> 'running' -> ('completed' | 'failed')
plus handoff_error for failure reasons. CLI request_handoff /
get_handoff_state / list_pending_handoffs / claim_handoff /
complete_handoff / fail_handoff helpers wrap the transitions.
CLI side (cli.py): /handoff <platform> validates the platform's home
channel via load_gateway_config, refuses if the agent is mid-turn,
flips the row to 'pending', and poll-blocks (60s) on terminal state.
On 'completed' it prints the /resume hint and exits the CLI like
/quit. On 'failed' or timeout it surfaces the reason and the CLI
session stays intact.
Gateway side (gateway/run.py): new _handoff_watcher background task
scans state.db every 2s, atomically claims pending rows, and runs
_process_handoff for each. _process_handoff:
1. Resolves the platform's home channel.
2. Asks the adapter for a fresh thread via the new
create_handoff_thread(parent_chat_id, name) capability so the
handed-off conversation gets its own scrollback. Adapters that
don't support threads (or fail) return None and the watcher
falls back to the home channel directly.
3. Constructs a SessionSource keyed as 'thread' when a thread was
created, 'dm' otherwise, then session_store.switch_session
re-binds the destination key to the CLI session_id. The full
role-aware transcript replays via load_transcript on the next
turn (no flat-text injection into context_prompt).
4. Forges a synthetic MessageEvent(internal=True) with the handoff
notice and dispatches through _handle_message; the agent runs
against the loaded transcript and adapter.send delivers the
reply.
5. Marks the row 'completed' on success, 'failed' (+error) on any
exception.
Adapter capability (gateway/platforms/base.py): create_handoff_thread
default returns None. Three overrides:
- Telegram (gateway/platforms/telegram.py): wraps _create_dm_topic
so DM topics (Bot API 9.4+) and forum supergroups both work.
- Discord (gateway/platforms/discord.py): parent.create_thread on
text channels with a seed-message + message.create_thread
fallback for permission edge cases. Skips DMs and other
non-thread-capable parents.
- Slack (gateway/platforms/slack.py): posts a seed message and
returns its ts as the thread anchor — Slack threads are
message-anchored.
In thread mode, build_session_key keys the destination without
user_id (thread_sessions_per_user defaults to False) so the synthetic
turn and any later real-user message in the thread share the same
session_key — seamless takeover without race.
CommandDef stays cli_only=True (handoff is initiated from the CLI;
gateway exposes /resume for the reverse direction).
Removed the original PR's _handle_message_with_agent handoff hook
(transcript-as-text injection into context_prompt) and the
send_message_tool notification — both replaced by the watcher path.
Tests rewritten around the new state machine: 13/13 pass.
E2E-validated thread + no-thread paths and the failure path against
real worktree imports with mocked adapters.
Adds /handoff <platform> CLI command that queues the current session for
resume on the configured home channel of any messaging platform.
CLI side:
- /handoff telegram — marks session in shared DB, sends summary to
the Telegram home channel via send_message
- /handoff discord — same for Discord
- Supports telegram, discord, slack, whatsapp, signal, matrix
Gateway side:
- On new session creation, checks for pending handoffs for the
incoming message's platform
- If found, loads the CLI session's full conversation history and
injects it into the context prompt as a handoff transcript
- Agent continues the conversation seamlessly
Files:
- hermes_state.py: handoff_pending, handoff_platform columns + helpers
- cli.py: _handle_handoff_command dispatch + handler
- hermes_cli/commands.py: CommandDef entry
- gateway/run.py: handoff detection in _handle_message_with_agent
- tests/hermes_cli/test_session_handoff.py: 8 tests
The skill enumerated 8 specialist profile names (researcher, analyst,
writer, reviewer, backend-eng, frontend-eng, ops, pm) as "the standard
roster" and told orchestrators to "assume these exist." Almost no real
Hermes setup matches that fleet — single-profile setups, Docker-worker
setups, and curated-team setups all violate it — so following the skill
literally produced cards assigned to non-existent profiles, which the
dispatcher silently failed to spawn (no autocorrect, no fallback, just
sits in `ready` forever).
Changes:
- Drop the standard-specialist-roster table.
- Add a "Profiles are user-configured — not a fixed roster" section at
the top with a Step 0 that prescribes `hermes profile list` (or asking
the user) before fanning out. Cache the result in working memory.
- Rewrite the worked task-graph example with placeholder names
(<profile-A>, <profile-B>, <profile-C>) so the structure is still
teachable but doesn't invite copy-paste of role names that may not
exist.
- Reframe the "If no specialist fits" anti-temptation rule: don't
invent profile names; ask the user.
- Add a "Inventing profile names that doesn't exist" entry to Pitfalls.
- Bump skill version 2.0.0 → 3.0.0 (semantic break: previous behavior
promised a roster the skill no longer enumerates).
- Update website/docs/user-guide/features/kanban.md to drop the
matching "(researcher, writer, analyst, backend-eng, reviewer, ops)"
line and explain the discovery prompt instead.
- Re-run website/scripts/generate-skill-docs.py to refresh the
auto-generated skill page + catalog.
Closes#21131 in spirit — addresses the same hardcoded-names footgun
@yehuosi flagged, with a different shape than their PR (delete the
roster rather than replace each name with placeholder, since the
roster table was the load-bearing footgun and the worked example is
salvageable with placeholder profile names).
Co-authored-by: yehuosi <yehuosi@users.noreply.github.com>
* feat(gateway): per-platform admin/user split for slash commands
Adds an opt-in two-list access control on top of the existing per-platform
`allow_from` allowlists, scoped to slash commands only:
- allow_admin_from — full slash command access
- user_allowed_commands — what non-admins may run
- group_allow_admin_from — same, group/channel scope
- group_user_allowed_commands
When `allow_admin_from` is unset for a scope, gating is disabled and every
allowed user keeps full access (backward compat). Plain chat is unaffected.
`/help` and `/whoami` are always reachable so users can see what they
can run.
Gate runs at the slash command dispatch site in gateway/run.py and uses
`is_gateway_known_command()`, so it covers built-in AND plugin-registered
commands through the live registry without per-feature wiring.
Adds `/whoami` showing platform, scope, tier, and runnable commands.
Salvage of PR #4443's permission tier work, scoped down. The full tier
system, tool filtering, audit log, usage tracking, rate limiting,
`/promote` flow, and persistent SQLite stores are not included here —
those can be re-expanded later if needed.
Co-authored-by: ReqX <mike@grossmann.at>
* fix(gateway): close running-agent fast-path bypass + add coverage and central docs
The slash command access gate was only applied at the cold dispatch site
(line ~5921). When an agent was already running, the running-agent
fast-path block (line ~5574) dispatched /restart, /stop, /new, /steer,
/model, /approve, /deny, /agents, /background, /kanban, /goal, /yolo,
/verbose, /footer, /help, /commands, /profile, /update directly
without going through the gate — letting non-admins bypass gating just
because an agent happens to be busy.
Refactored the gate into _check_slash_access() and called from BOTH
paths. /status remains intentionally pre-gate so users can always see
session state.
Also added 18 more dispatch tests covering:
- Running-agent fast-path: blocks non-admin, allows admin, /status
always works
- Alias canonicalization (gate uses canonical name, not user alias)
- Unknown / unregistered commands pass through (don't false-positive)
- DM admin scope-locked when group has its own admin list
- Multi-platform isolation (Discord gated, Telegram unrestricted)
Docs: added Slash Command Access Control section to the central
messaging index page + /whoami row in the chat commands table.
Co-authored-by: ReqX <mike@grossmann.at>
---------
Co-authored-by: ReqX <mike@grossmann.at>
xAI is retiring grok-4, grok-4-0709, grok-4-fast{,-reasoning,-non-reasoning},
grok-4-1-fast{,-reasoning,-non-reasoning}, and grok-code-fast-1 on
May 15, 2026 at 12:00 PT. Remove them from the static fallbacks so the
`hermes model` picker, gateway /model picker, and setup wizard stop
auto-suggesting models that will be dead in days.
- _XAI_STATIC_FALLBACK in hermes_cli/models.py now lists only grok-4.20-*
and grok-4.3 (the live replacements).
- copilot lists in hermes_cli/models.py and hermes_cli/setup.py drop
grok-code-fast-1 (Copilot proxies it through xAI, so the upstream
retirement breaks it there too).
Old configs that already reference retired IDs keep working until xAI
flips the switch — context-length lookups in agent/model_metadata.py and
the cache-affinity-header logic in provider_profiles still recognise the
old names. The cleanup here is purely about not advertising them to new
users.
Closes#23278.
Source: https://docs.x.ai/developers/migration/may-15-retirement
Follow-up to the previous commit's behavior fix.
Adds a paragraph to dispatch_once's docstring making the concurrency-cap
semantic explicit, and an inline comment near the running_count query
explaining why we do the count (so a future reader doesn't refactor it
back to per-tick semantics thinking it's redundant). Both call out the
unbounded-accumulation failure mode that motivated the fix, since
nothing in the codebase or skills currently documents what max_spawn
is supposed to mean.
The semantic is per-board: each kanban board has its own SQLite file,
so the running-count COUNT(*) is naturally scoped to the board the
dispatcher tick is processing.
When the gateway received SIGTERM, the shutdown_signal_handler ran a
synchronous 'ps aux' (3s timeout) inside the asyncio event loop, then
asyncio.create_task(runner.stop()). On a busy host that ate 1-3s of
the teardown budget before draining could even start, and the resulting
log line was a multi-line ps dump that didn't tell us who sent the
signal. The shutdown path itself logged 'Stopping gateway...' and then
nothing until 'Gateway stopped' — when systemd SIGKILLed mid-drain,
there was no way to see which phase wedged.
Changes:
- New gateway/shutdown_forensics.py:
* snapshot_shutdown_context(sig) — sub-millisecond /proc-only capture
of signal name, parent pid+name+cmdline, INVOCATION_ID (systemd
marker), loadavg_1m, TracerPid, takeover/planned-stop marker
presence + whether-it-names-self. Pure stdlib, never raises.
* spawn_async_diagnostic(log_path, sig) — detached subprocess with
its own 'timeout 5s', start_new_session=True, writes ps auxf +
pstree + dmesg to ~/.hermes/logs/gateway-shutdown-diag.log.
Returns immediately, can't block the event loop or the cgroup
teardown.
* check_systemd_timing_alignment(drain_timeout) — reads
/proc/self/cgroup for our unit, asks systemctl show for
TimeoutStopUSec, returns mismatch info when the unit's stop
timeout is smaller than restart_drain_timeout + 30s headroom
(the case where systemd SIGKILLs mid-drain).
* _parse_systemd_duration_to_us — covers '90s', '1min 30s',
'500ms', '1h' style values from systemctl show.
* format_context_for_log — single scannable key=value line, parent
cmdline last.
- gateway/run.py shutdown_signal_handler:
* Replaces synchronous ps aux + ad-hoc 'hermes-related lines' filter
with snapshot + detached spawn.
* Always logs 'Shutdown context: signal=... parent_pid=...
parent_cmdline=...' regardless of planned/unexpected so we can
correlate signal source even on planned restarts.
- gateway/run.py _stop_impl:
* Per-phase '+X.XXs' timing for notify_active_sessions, drain
(with drain_seconds, active_at_start, active_now, timed_out),
post-interrupt tool kill, each adapter disconnect (Xs),
all adapters disconnected, final-cleanup tool kill, SessionDB
close, total teardown.
- gateway/run.py start():
* Stale-unit warning at startup when the running systemd unit's
TimeoutStopSec is smaller than the configured drain timeout.
Points the user at 'hermes gateway service install --replace'
to regenerate, or at shortening agent.restart_drain_timeout.
Tests: 30 new in tests/gateway/test_shutdown_forensics.py — snapshot
speed bound, signal name resolution, marker detection self-vs-other,
async diag spawn doesn't block caller, systemd duration parser, and
alignment check returns None outside systemd. Wider tests/gateway/
suite: 5258 passing, 3 pre-existing TTS-routing failures unchanged
on main.
Follow-up to the previous commit's toolset-vs-skill validation.
The contributor's fix raises ValueError on the first toolset name found
in the skills list. That works for one mistake, but agents that confuse
skills with toolsets usually pass several at once
(`skills=["web", "browser", "terminal"]`) — and serial-correcting one
per failure round-trip wastes tokens. Collect all toolset-shaped
entries first, then raise once with the full list.
The error message is also slightly clearer:
'web', 'browser', 'terminal' are toolset names, not skill name(s).
Put toolsets in the assignee profile's `toolsets:` config instead of
per-task skills. Skills are named skill bundles (e.g. `kanban-worker`,
`blogwatcher`); toolsets are runtime capabilities (e.g. `web`,
`browser`, `terminal`).
vs. the previous "the assignee profile's toolsets" — explicitly naming
the YAML key (`toolsets:`) and giving concrete examples in both
categories closes the conceptual gap that produced the bug to begin
with.
Adds one regression test (test_create_task_skills_lists_all_toolset_typos)
covering the multi-name aggregation path. The single-typo test from
the original PR still passes (the loose `match="toolset name"` matches
both singular and plural forms).
Two follow-up improvements to Tranquil-Flow's metadata-panel restyle.
Both stay within the parent PR's "tone down the panel" scope.
1. Native <details>/<summary> collapse for verbose metadata.
The parent PR consciously deferred this ("adding native expand/collapse
would be the next step but requires UX agreement"). The default they
asked for is straightforward: collapsed when the rendered JSON exceeds
300 chars (the threshold where the max-height: 8.5rem cap actually
starts mattering), expanded otherwise. <details>/<summary> is the right
primitive — zero JS, browser-handled state, accessible by default
(keyboard-navigable, screen-reader announces the disclosure state),
and survives any react-state churn for free.
The OS-default disclosure marker is suppressed (list-style: none +
::-webkit-details-marker hidden) and replaced with a CSS ::before
chevron that rotates 90deg on the [open] attribute, so the look is
consistent across Firefox/WebKit/Blink without the double-marker
that would otherwise appear on the platforms that still render the
default triangle.
2. Skip rendering when metadata is an empty object.
`r.metadata && ...` truthy-checks, but `{}` is truthy in JS — so a
completed task with no actual metadata would render a "Metadata"
labeled disclosure block containing literal `{}`. Adds an
Object.keys(r.metadata).length > 0 guard so empty payloads render
nothing instead of an empty disclosure stub.
Tests: three new static-asset assertions covering the <details> shape,
the empty-object skip, and the suppress-default-marker + animated-chevron
CSS — all in `tests/plugins/test_kanban_dashboard_plugin.py`.
Hand-rebased onto current main from PR #19980; the original branch was stale
against main (~6 unrelated dashboard fixes had landed since), so applying
the PR's dist files directly would have silently reverted them.
The run-history panel in the task drawer rendered each completed run's
`metadata` field as a `<code class="hermes-kanban-run-meta">` containing
`JSON.stringify(r.metadata)` — a single unindented monoline. With
`white-space: pre-wrap` and a monospace font, a writer task's metadata
(changed_files paths, source URLs, generated-artifact details) wrapped
into a tall block of code-ish text that filled the parent run row. The
container's faint `--color-foreground 3%` background then made the whole
thing read like a crash dump even though the run completed normally.
Restyle and label, no interactivity changes:
- Wrap the meta payload in a `.hermes-kanban-run-meta-block` sub-block
with an explicit `Metadata` label (small, uppercase, muted) so the
panel reads as auxiliary detail at a glance.
- Pretty-print the JSON (`indent=2`) so the structure is scannable
instead of a wall of monoline text.
- Cap `.hermes-kanban-run-meta` at `max-height: 8.5rem; overflow: auto`
so a verbose blob scrolls inside its own pane rather than swamping
the run row.
- Sub-block uses a thin `border-left` rule and `background: transparent`
— distinct from the destructive-tinted treatment used by crashed /
timed_out / blocked / spawn_failed runs higher in the same file.
Tests: two new static-asset assertions in
`tests/plugins/test_kanban_dashboard_plugin.py` lock in the rendered
shape (the plugin ships built-only, no src/).
Adds CDPSupervisor.evaluate_runtime() and wires it into _browser_eval as a
fast path when a supervisor is alive for the current task_id. Replaces the
~180ms agent-browser subprocess fork+exec+Node-startup hop with a ~1ms
Runtime.evaluate over the supervisor's already-connected WebSocket.
Falls through to the existing agent-browser CLI path when no supervisor is
running (e.g. backends without CDP, or before the first browser_navigate
attaches one), so behaviour is unchanged where it can't apply.
JS-side exceptions surface directly without falling through to the
subprocess (the subprocess would just re-raise the same error, slower);
supervisor-side failures (loop down, no session) fall through cleanly.
Benchmark — 30 iterations of `1 + 1` against headless Chrome:
supervisor WS mean= 0.96ms median= 0.91ms
agent-browser subprocess mean=179.35ms median=167.73ms
→ 187x speedup mean
Tests: 14 unit tests (mocked supervisor + response-shape coverage), 5
real-Chrome e2e tests in test_browser_supervisor.py (gated on Chrome
being installed). Browser test suite: 355 passed, 1 skipped.
Follow-up to the previous commit's casing fix.
The original PR shipped the dist edits without test coverage. The
contributor's reasoning (UI-only attributes in a pre-built JS bundle,
nothing meaningful to unit-test) is fair, but a static-asset assertion
catches the most likely regression vector — a future rebuild of the
dist bundle that loses the attributes — at near-zero cost.
Adds two regression tests in tests/plugins/test_kanban_dashboard_plugin.py:
- test_dashboard_assignee_inputs_preserve_casing — reads dist/index.js
and asserts autoCapitalize="none", autoCorrect="off", spellCheck=false,
and textTransform="none" each appear at least twice (one per assignee
input — inline triage/lane create + task-edit panel).
- test_dashboard_lane_head_preserves_assignee_casing — reads dist/style.css
and asserts the .hermes-kanban-lane-head rule body does NOT contain
text-transform: uppercase. Locates the rule by marker so unrelated CSS
churn nearby doesn't flake the test.
Both follow the same shape as the existing test_dashboard_requests_default_board_explicitly
static-asset guard from PR #22940's salvage.
Also adds the AUTHOR_MAP entry for princepal9120's GitHub-noreply email
so release notes credit the right account.
Follow-up to the previous commit's safe-int task_age fix.
The original PR shipped without test coverage. This commit adds:
- test_safe_int_accepts_int_and_int_string — sanity for the well-typed
path so the helper itself can't quietly start swallowing valid values.
- test_safe_int_returns_none_on_corrupt_inputs — the failure modes
(None, '%s', 'abc', '', '1.5', random objects). Covers both the
ValueError and TypeError catch branches.
- test_task_age_handles_corrupt_created_at — the headline regression:
a task with created_at='%s' used to raise ValueError and turn
GET /api/plugins/kanban/board into a 500.
- test_task_age_handles_corrupt_started_and_completed — confirms the
safe-int treatment is consistent across all three timestamp fields.
- test_task_age_well_formed_task — regression that the safe path
doesn't change observable output for normal data.
- test_task_dict_survives_corrupt_created_at — defense in depth.
Writes a corrupt row directly via SQL, reads it back through the
ORM, and confirms task_age + the surrounding plugin_api guard
degrade gracefully instead of crashing.
Also adds the AUTHOR_MAP entry for the contributor's GitHub-noreply
email so release notes credit @baocin (the commit was authored locally
as `aoi <aoi@hino.local>` — re-attributed during salvage to the
github noreply form).
task_age() crashed with ValueError when created_at contained the
literal format string '%s' instead of a Unix timestamp, taking down
the entire GET /board endpoint with a 500.
- Add _safe_int() helper that returns None on non-numeric values
- Refactor task_age() to use _safe_int instead of bare int() casts
- Wrap task_age() call in _task_dict with try/except fallback so one
corrupt row never kills the whole board endpoint
* feat(i18n): localize /model command output
Reported by @tianma8888: when Chinese users run /model, the labels
("Provider:", "Context:", "_session only_", etc.) are still English.
This routes the static prose through the existing i18n catalog so it
follows display.language / HERMES_LANGUAGE.
Changes:
- locales/{en,zh,ja,de,es,fr,tr,uk}.yaml: add 17 keys under
gateway.model.* covering switched/provider/context/max_output/cost/
capabilities/prompt_caching/warning/saved_global/session_only_hint/
current_label/current_tag/more_models_suffix/usage_*.
- gateway/run.py _handle_model_command: replace hardcoded f-strings in
the picker callback, the text-list fallback, and the direct-switch
confirmation block with t("gateway.model.<key>", ...).
What stays English:
- model IDs, provider slugs, capability strings, cost figures, and the
"[Note: model was just switched...]" prepended to the model's next
prompt (LLM-facing, not user-facing).
- The two slightly-different session-only hints unify on a single key
with the em-dash phrasing.
Validation: tests/agent/test_i18n.py 27/27 passing (parity contract
holds), tests/gateway/ -k 'model or i18n' 74/74 passing.
* feat(i18n): localize all gateway slash command outputs
Expands the i18n catalog from 7 strings to 234 keys across 35 gateway
slash command handlers, so non-English users see localized output for
\`/profile\`, \`/status\`, \`/help\`, \`/personality\`, \`/voice\`, \`/reset\`,
\`/agents\`, \`/restart\`, \`/commands\`, \`/goal\`, \`/retry\`, \`/undo\`,
\`/sethome\`, \`/title\`, \`/yolo\`, \`/background\`, \`/approve\`, \`/deny\`,
\`/insights\`, \`/debug\`, \`/rollback\`, \`/reasoning\`, \`/fast\`,
\`/verbose\`, \`/footer\`, \`/compress\`, \`/topic\`, \`/kanban\`,
\`/resume\`, \`/branch\`, \`/usage\`, \`/reload-mcp\`, \`/reload-skills\`,
\`/update\`, \`/stop\` (plus the \`/model\` block already added in the
previous commit).
Reported by @tianma8888 — Chinese users want command output prose in
their language, not just the labels we already had.
Translations are hand-written for all 8 supported locales (en, zh, ja,
de, es, fr, tr, uk), matching each catalog's existing style: full-width
punctuation in zh, em-dashes in zh/ja/uk, French spaced colons,
German noun capitalization, etc.
What stays English (unchanged):
- Identifiers/values: model IDs, file paths, profile names, session IDs,
command flag names like --global, URLs, config keys.
- Backtick code spans: \`/foo\`, \`config.yaml\`.
- Log messages (logger.info/warning/error).
- LLM-facing system notes prepended to next prompt (e.g. [Note: model
was just switched...]).
- Strings produced by external modules (gateway_help_lines,
format_gateway, manual_compression_feedback) — those have their
own surfaces.
New shared keys for cross-handler boilerplate:
- gateway.shared.session_db_unavailable (5 call sites: branch, title,
resume, topic, _disable_telegram_topic_mode_for_chat)
- gateway.shared.session_not_found (1 site)
- gateway.shared.warn_passthrough (2 sites in /title's f"⚠️ {e}" pattern)
YAML gotcha fixed: \`yolo.on\` and \`yolo.off\` were originally written
unquoted, which YAML 1.1 parses as boolean True/False keys. Renamed to
\`yolo.enabled\` / \`yolo.disabled\` for both safety and clarity.
Test fix: tests/agent/test_i18n.py::test_t_missing_key_in_non_english_falls_back_to_english
now resets the catalog cache on teardown, so the fake "foo: English Foo"
locale doesn't poison the module-level cache for subsequent tests in
the same xdist worker. (Without this, every gateway slash command test
that shares a worker with the i18n suite would see the fake catalog.)
Validation:
- tests/agent/test_i18n.py: 27/27 (parity contract — every key in every
locale, matching placeholder tokens).
- tests/gateway/: 5077 passed, 0 failed (full gateway suite).
- 180 t() call sites added across 35 handlers; 1872 catalog entries
total (234 keys × 8 locales).
* feat(i18n): add 8 new locales — af, ko, it, ga, zh-hant, pt, ru, hu
Expands the static-message catalog from 8 → 16 languages, each with full
270-key parity against the English source-of-truth. Every locale now
covers the same surface PR #22914 added: approval prompts plus all 35
gateway slash command outputs.
New locales:
- af Afrikaans (community ask in #21961 by @GodsBoy; PRs #21962, #21970)
- ko Korean (PRs #20297 by @tmdgusya, #22285 by @project820)
- it Italian (PR #20371 by @leprincep35700)
- ga Irish/Gaeilge (PR #20962 by @ryanmcc09-dot)
- zh-hant Traditional Chinese (PRs #20523 by @jackey8616, #13140 by @anomixer)
- pt Portuguese (PRs #20443 by @pedroborges, #15737 by @carloshenriquecarniatto, #22063 by @Magaav)
- ru Russian (PR #22770 by @DrMaks22)
- hu Hungarian (PR #22336 by @lunasec007)
Each locale uses native-quality translations matching the existing tone
and conventions of the older 8 locales:
- zh-hant uses 繁體 characters with TW/HK technical vocabulary (軟體
not 软件, 連線 not 连接, 設定 not 设置, 訊息 not 消息, 工作階段 not 会话, 程式
not 程序, 預設 not 默认, 伺服器 not 服务器), full-width punctuation 「:()」.
- ko uses formal 합니다체 (습니다/합니다) register throughout.
- pt uses European Portuguese as baseline with neutral PT/BR vocabulary
where possible.
- ga uses standard An Caighdeán Oifigiúil; English loanwords retained
for tech terms without good Irish equivalents (gateway, API, JSON).
- All preserve {placeholder} tokens, backtick code spans, slash commands,
brand names (Hermes, MCP, TTS, YOLO, OpenAI, Telegram, etc.), and emoji.
Aliases added in agent/i18n.py:
- af-za, Afrikaans → af
- ko-kr, Korean, 한국어 → ko
- it-it, italiano → it
- ga-ie, Irish, Gaeilge → ga
- zh-tw, zh-hk, zh-mo, traditional-chinese → zh-hant (note: zh-tw used to
alias to zh; now aliases to its own zh-hant catalog)
- zh-cn, zh-hans, zh-sg → zh (unchanged from before)
- pt-pt, pt-br, brazilian, portuguese → pt
- ru-ru, Russian, русский → ru
- hu-hu, Magyar → hu
The zh-tw alias re-routing is intentional: previously typing 'zh-TW' got
the Simplified Chinese catalog (wrong vocabulary for Taiwan/HK users).
Now those users get the proper Traditional Chinese catalog.
Validation:
- tests/agent/test_i18n.py: 43/43 (parity contract holds for all 16
languages × 270 keys = 4320 catalog entries, with matching placeholder
tokens).
- E2E alias resolution verified for all 19 alias inputs (Afrikaans, ko-KR,
한국어, italiano, Gaeilge, zh-TW, zh-HK, traditional-chinese, pt-BR,
brazilian, Magyar, etc.).
- tests/gateway/: 5198 passed (3 pre-existing TTS routing failures
unrelated to i18n).
Credit to all contributors whose PRs surfaced these language requests.
Their original PRs may now be closed as superseded with credit.
* feat(dashboard-i18n): add 14 web dashboard locales matching the static catalog
Brings the React dashboard (web/src/) up to the same 16-language
coverage the static catalog already has after the previous commits in
this PR. The Translations interface is TypeScript-typed, so every new
locale must provide every key — tsc -b is the parity guard.
Languages added (each is a complete 429-line locale file):
- af Afrikaans
- ja Japanese (PR #22513 by @snuffxxx surfaced this)
- de German (PR #21749 by @mag1art)
- es Spanish (PR #21749)
- fr French (PRs #21749, #10310 by @foXaCe)
- tr Turkish
- uk Ukrainian
- ko Korean (PRs #21749, #18894 by @ovstng, #22285 by @project820)
- it Italian
- ga Irish (Gaeilge)
- zh-hant Traditional Chinese (PR #13140 by @anomixer)
- pt Portuguese (PRs #22063 by @Magaav, #22182 by @wesleysimplicio, #15737 by @carloshenriquecarniatto)
- ru Russian (PRs #21749, #22770 by @DrMaks22)
- hu Hungarian (PR #22336 by @lunasec007)
Each translation covers all 15 namespaces with full key parity vs en.ts,
preserves every {placeholder} token verbatim, keeps identifiers
untranslated (brand names, file paths, cron expressions, code spans),
translates the language.switchTo tooltip into the target language, and
matches existing tone conventions (zh-hant uses TW/HK vocab; ja uses
formal desu/masu; ko uses formal seumnida register; ga uses An
Caighdean Oifigiuil with English loanwords for tech vocab without good
Irish equivalents).
Plumbing:
- web/src/i18n/types.ts: Locale union expanded to all 16 codes.
- web/src/i18n/context.tsx: imports all 16 catalogs; exports
LOCALE_META (endonym + flag per locale); isLocale() type guard.
- web/src/i18n/index.ts: re-export LOCALE_META.
- web/src/components/LanguageSwitcher.tsx: replaced two-state EN-ZH
toggle with a click-to-open dropdown listing all 16 languages.
Note: zh-hant.ts exports zhHant (camelCase) since hyphen is invalid in
a JS identifier; the canonical 'zh-hant' string keys it in TRANSLATIONS.
Validation:
- npx tsc -b: 0 errors. Every locale satisfies Translations.
- npm run build (tsc + vite production): green, 2062 modules.
- Each locale file is exactly 429 lines.
Out of scope: plugin dashboards (kanban/achievements ship as prebuilt
bundles with no source in repo); Docusaurus docs (separate surface);
TUI (no i18n yet).
* feat(plugin-i18n): localize achievements + kanban plugin dashboards across all 16 locales
Brings the two shipped plugin dashboards (hermes-achievements, kanban)
under the same i18n umbrella as the core dashboard PR #22914 just
established. Both bundles now read user-facing strings from the host's
i18n catalog via SDK.useI18n() instead of hardcoded English.
## Approach
Plugin dashboards ship as prebuilt IIFE bundles in
plugins/<name>/dashboard/dist/index.js — no build step, no source in
repo (upstream-authored, vendored as compiled JS). Earlier contributor
PRs (#22594, #22595, #18747) tried direct edits but didn't actually
wire the bundles to read translations.
This change does the wiring properly:
1. Each bundle gets a useI18n shim at IIFE scope:
const useI18n = SDK.useI18n
|| function () { return { t: { kanban: null }, locale: "en" }; };
Older host SDKs without useI18n still load the bundle and render
English fallbacks.
2. A small tx(t, path, fallback, vars) helper resolves dotted keys
under the plugin's namespace (t.kanban.* or t.achievements.*) and
interpolates {placeholder} tokens.
3. Every React component starts with const { t } = useI18n() and
each user-visible string is wrapped in tx(t, "key", "English fallback").
Helpers called outside React components (window.prompt callers,
constants used during init) take t as a parameter.
4. Top-level constants that were English dictionaries (COLUMN_LABEL,
COLUMN_HELP, DESTRUCTIVE_TRANSITIONS, DIAGNOSTIC_EVENT_LABELS in
kanban) become getColumnLabel(t, status)-style functions backed by
FALLBACK_* dictionaries.
## Translations added
Two new top-level namespaces added to the dashboard's TypeScript-typed
Translations interface:
- achievements: ~70 keys covering the hero, scan banner, achievement
card, share dialog, stats, filters, and empty states.
- kanban: ~145 keys covering the board, columns (with nested
columnLabels and columnHelp sub-dicts), card detail panel,
bulk-actions toolbar, dependency editor, board switcher, and
diagnostic callouts.
Each key is provided across all 16 supported locales:
en, zh, zh-hant, ja, de, es, fr, tr, uk, af, ko, it, ga, pt, ru, hu.
Total new translation entries: ~3,440 (215 keys × 16 locales).
## What stays English (deliberate)
- API paths, CSS class names, data-* attributes, JSON keys, regex
strings, URLs, file paths (~/.hermes/kanban.db, boards/_archived/).
- State identifier strings used as lookup keys (triage / todo / ready /
running / blocked / done / archived) — labels translate, key strings
don't.
- The PNG share-card text rendered to canvas in the achievements
ShareDialog (HERMES AGENT watermark, UNLOCKED stamp, tier names) —
these become part of a globally-shared image and stay English.
- localStorage keys (hermes.kanban.selectedBoard).
- Brand names (Kanban, Hermes, WebSocket, Nous Research).
## Contributor credit
PR #22594 by @02356abc and PR #22595 by @02356abc supplied the
en + zh kanban namespace skeleton (145 keys); used as the en source-
of-truth in this commit and translated to the other 14 locales.
PR #18747 by @laolaoshiren first surfaced the achievements
localization request.
## Validation
- npx tsc -b: 0 errors. All 16 locale .ts files satisfy the
Translations type with full key parity.
- npm run build (tsc + vite production build): green, 2062 modules,
1.56MB JS / 95KB CSS, ~2.5s build.
- node --check on both plugin bundles: parse cleanly.
- 126 tx() call sites in kanban, 46 in achievements.
## Out of scope
- TUI (ui-tui/) has no i18n infrastructure yet.
- Docusaurus docs (website/i18n/) — already had zh-Hans; expanding
is a separate translation workstream (Thai / Korean / Hindi PRs).
Follow-up to the previous commit's contributor cherry-pick.
The cherry-picked change replaced the bare ``["hermes", ...]`` spawn with
``[sys.executable, "-m", "hermes", ...]``. The intent was right (avoid
PATH dependence — cron, systemd User= services, launchd jobs, and other
detached dispatcher invocations routinely run with a stripped $PATH that
doesn't include the venv's bin/, breaking the bare-shim spawn) but the
module name is wrong: there is no top-level ``hermes`` package. The
console-script entry point in pyproject.toml is
``hermes = "hermes_cli.main:main"``, and ``python -m hermes`` fails with
``No module named hermes``. The cherry-picked form would have replaced a
sometimes-broken spawn with an always-broken one.
This commit:
- Adds ``_resolve_hermes_argv()``, mirroring ``gateway.run._resolve_hermes_bin``.
Tries ``shutil.which("hermes")`` first (preferred — keeps existing ``ps``
output and log lines familiar in the common case) and falls back to
``[sys.executable, "-m", "hermes_cli.main"]`` when the shim is not on
PATH. The fallback goes through the running interpreter so it's
PATH-independent. Kept as a local helper rather than imported from
gateway because ``hermes_cli`` sits below ``gateway`` in the dependency
order.
- Switches the dispatcher's ``cmd`` list to use ``*_resolve_hermes_argv()``.
- Adds three regression tests:
* ``test_resolve_hermes_argv_prefers_path_shim`` — pins the PATH-first
branch so a future refactor doesn't silently flip the order.
* ``test_resolve_hermes_argv_falls_back_to_module_form_when_no_path_shim`` —
pins the correct module name (``hermes_cli.main``, NOT ``hermes``).
Direct regression guard for the form that shipped in the original PR.
* ``test_resolve_hermes_argv_module_actually_runs`` — runs the fallback
invocation as a real subprocess and asserts ``--version`` works, so
losing ``hermes_cli.main``'s ``__main__`` handling can't slip past the
string-match test.
Verified end-to-end: with the shim on PATH the resolver returns
``[/.../hermes]`` and ``--version`` works; with the shim removed the
resolver returns ``[python, -m, hermes_cli.main]`` and ``--version``
still works; the original PR's ``python -m hermes`` invocation fails as
expected (``No module named hermes``).
In NixOS container mode, hermes is installed at a store path with no
symlink on PATH (e.g. /data/current-package/bin/hermes). The kanban
dispatcher spawns workers via _default_spawn() using a bare 'hermes'
subprocess call, which fails with 'hermes executable not found on PATH'
in container mode.
Fix by calling sys.executable -m hermes instead, which is guaranteed
to resolve to the same Python interpreter running the dispatcher.
* feat(plugins): host-owned LLM access via ctx.llm
Plugins can now ask the host to run a one-shot chat or structured
completion against the user's active model and auth, without ever
seeing an OAuth token or API key. Closes the gap where plugins that
needed bounded structured inference (receipts, CRM extraction,
support classification) had to either bring their own provider keys
or register a tool the agent had to call.
New surface on PluginContext:
- ctx.llm.complete(messages, ...)
- ctx.llm.complete_structured(instructions, input, json_schema, ...)
- async siblings ctx.llm.acomplete / acomplete_structured
Backed by the existing auxiliary_client.call_llm pipeline — every
provider, fallback chain, vision routing, and timeout policy Hermes
already supports applies automatically.
Trust gate (fail-closed by default):
- plugins.entries.<id>.llm.allow_model_override
- plugins.entries.<id>.llm.allowed_models (allowlist; '*' = any)
- plugins.entries.<id>.llm.allow_agent_id_override
- plugins.entries.<id>.llm.allow_profile_override
Embedded model@profile shorthand goes through the same gate as
explicit profile=, so it can't bypass the auth-profile policy.
Conflicting explicit and embedded profiles fail closed.
Also lands:
- plugins/plugin-llm-example/ — reference plugin that registers
/receipt-extract, demonstrating image+text structured input,
jsonschema validation, and the trust-gate config.
- website/docs/developer-guide/plugin-llm-access.md — full API docs.
- 45 unit tests covering trust gates, JSON parsing, schema
validation, image encoding, async surface, and config loading.
Validation:
- 2628 tests pass in tests/agent/
- E2E: bundled plugin loaded with isolated HERMES_HOME, slash
command produced parsed JSON via stubbed call_llm
- response_format extra_body wired correctly for both json_object
and json_schema modes
* docs(plugin-llm): rewrite quickstart and framing
The quickstart now uses a meeting-notes-to-tasks example instead of
a receipt extractor, and the page leads with hook-time / gateway
pre-filter / scheduled-job framing rather than the OpenClaw
KB/support/CRM/finance/migration enumeration that the original
upstream PR used. Receipt example moved to a separate worked
example link so the docs page itself doesn't echo any of the
upstream framing.
Also clarifies where ctx.llm fits in the broader plugin surface
(table comparing register_tool / register_platform / register_hook
/ etc.) and what makes this lane different from auxiliary_client
internals.
No code change.
* docs(plugin-llm): reframe as any LLM call, not just structured output
The original draft leaned heavily on complete_structured() and made
the chat lane (complete() / acomplete()) feel like a footnote.
Restructure so:
- The page title and description say 'any LLM call.'
- The lead shows BOTH a plain chat call (error rewriter) AND a
structured call (triage scorer) up top.
- Quick start has two complete plugin examples — /tldr (chat) and
/paste-to-tasks (structured).
- New 'When to use which' table for choosing complete() vs
complete_structured() vs the async siblings.
- Trust-gate sections explicitly note 'all four methods,' and the
request-shaping list calls out chat-only fields (messages) and
structured-only fields (instructions, input, json_schema)
alongside each other.
- The 'Where this fits' section now says 'for any reason,
structured or not.'
The receipt-extractor reference plugin still exists under
plugins/plugin-llm-example/ — but the docs page no longer treats
it as the canonical surface example. It's now described as 'a third
worked example, this time with image input.'
No code change.
* feat(plugin-llm): split provider/model into independent explicit kwargs
The first cut accepted a single 'provider/model' slug on every method
and split it internally. That looked clean but broke under live test:
the model-override path tried to use the slug's vendor prefix as a
literal Hermes provider id, which silently switched the user off
their aggregator (e.g. plugin asks for 'openai/gpt-4o-mini' on a user
who routes through OpenRouter — host attempted to call the 'openai'
provider directly, failed because OPENAI_API_KEY wasn't set).
New shape mirrors the host's main config:
ctx.llm.complete(
messages=[...],
provider='openrouter', # gated, optional
model='openai/gpt-4o-mini', # gated, optional
profile='work', # gated, optional
...
)
Each is independently gated by its own allow_*_override flag.
Granting model-override does NOT auto-grant provider-override.
Allowlists are now per-axis (allowed_providers, allowed_models)
matched literally against whatever string the plugin sends.
Dropped 'model@profile' embedded-suffix shorthand entirely. Hermes
doesn't use that pattern anywhere else; profile= is its own kwarg.
Live E2E (against real OpenRouter via Teknium's config) confirms:
- zero-config call works
- default-deny blocks each override with a helpful error
- model-only override stays on user's active provider (the bug)
- provider+model override switches cleanly
- allowlist refuses non-listed entries
- structured output round-trip parses + schema-validates
Tests: 49 cases (up from 45); all green. Docs updated to match the
new shape, including a 'most plugins never need this section' callout
on the trust-gate config block.
* fix+cleanup(plugin-llm): real attribution, hook-mode coverage, move example out of core
Three integration fixes for the ctx.llm surface:
1. Attribution bug — result.provider and result.model now reflect
what call_llm actually used, not placeholder fallbacks ('auto',
'default'). New _resolve_attribution() helper:
- explicit overrides win (what the call targeted)
- response.model wins for the recorded model (provider
canonicalisation: 'gpt-4o' → 'gpt-4o-2024-08-06' etc.)
- falls back to _read_main_provider() / _read_main_model()
when no override is set, so audit logs reflect the user's
active main provider/model
- 'auto' / 'default' only when EVERYTHING is empty
Live verified: zero-config call now records
provider='openrouter', model='anthropic/claude-4.7-opus-20260416'
instead of provider='auto', model='default'.
2. Hook-mode coverage — TestHookMode confirms ctx.llm.complete
works from inside a registered post_tool_call callback. The
docs page promised hook integration; now there's a test that
exercises the lazy-import path through the real invoke_hook
machinery. Two cases: traceback-rewrite hook with conditional
ctx.llm.complete, and minimal hook regression for the
sync-hook + sync-llm path.
3. Reference plugin moved out of core. plugins/plugin-llm-example/
is gone from hermes-agent — it now lives in the new
NousResearch/hermes-example-plugins companion repo. The docs
page links there. Hermes' bundled plugins should be plugins
users actually run; reference / docs-companion plugins live
externally.
Test count: 56 (up from 49). Wider sweep on tests/hermes_cli/
+ tests/gateway/ + tests/tools/ + tests/agent/ shows 16770
passing; the 12 failures are all pre-existing on origin/main
(verified by stashing this branch's changes and re-running) —
kanban-boards, delegate-task, gateway-restart, tts-routing —
none touch the plugin_llm surface.
* chore(plugins): move all example plugins to companion repo
Reference / docs-companion plugins now live exclusively in
NousResearch/hermes-example-plugins, not bundled with the core repo:
- example-dashboard
- strike-freedom-cockpit
A new fourth example, plugin-llm-async-example, was added to that
repo demonstrating ctx.llm's async surface (acomplete()) with
asyncio.gather() — registers /translate <lang>: <text> which fires
forward translation + sentiment classifier in parallel, then a
back-translation for QA. Live-tested at 2.5s for three real
provider round-trips (would be ~5-6s sequential).
Docs updated:
- developer-guide/plugin-llm-access.md links both sync and async
examples in the Reference section
- user-guide/features/extending-the-dashboard.md repoints both demo
sections to the companion repo with corrected install paths
- user-guide/features/built-in-plugins.md drops the two demo rows
- AGENTS.md notes that example plugins live in the companion repo
Net: hermes-agent's plugins/ directory now contains only plugins
users actually run (memory providers, dashboard tabs that ship real
features, the disk-cleanup hook, platform adapters). All four
demo / reference plugins live externally where they can be cloned
on demand instead of inflating the core install.
Follow-up to the previous commit's middleware fix.
- plugins/kanban/dashboard/plugin_api.py: rewrite the "Security note"
docstring. The previous text said "/api/plugins/ is unauthenticated by
design" — that's now actively wrong and dangerously misleading. New
text explains that plugin routes flow through the same session-token
middleware as core API routes and that --host 0.0.0.0 is safe to use
on a LAN as a result.
- tests/hermes_cli/test_web_server.py: extend TestPluginAPIAuth to cover
the surfaces the original PR didn't pin:
* test_plugin_route_allows_auth now exercises a real plugin path
(/api/plugins/example/hello) instead of accepting 200 OR 404 from
a maybe-loaded kanban plugin — the assertion was effectively vacuous.
* test_plugin_patch_requires_auth + test_plugin_delete_requires_auth
cover non-GET mutation methods in case a future regression
whitelists them by accident.
* test_non_kanban_plugin_route_requires_auth proves the fix is
plugin-agnostic, not kanban-specific (hits hermes-achievements +
a non-existent plugin namespace; both 401 before route resolution).
* test_plugin_websocket_unaffected_by_http_middleware locks in that
the HTTP middleware change didn't accidentally start gating WS
upgrades — kanban /events still uses its own ?token= check.
Plus a cosmetic blank-line cleanup.
Remove the blanket /api/plugins/* exemption from auth_middleware so
plugin API routes (e.g. Kanban dashboard) require the same session
token as all other /api/ endpoints.
Fixes#19533
Surfaces the pin command at the moment users care about it: when a
consolidation just landed against their skill library and they're
looking at the umbrella name in the curator output. Previously `hermes
curator pin` existed but had no discovery surface — users only learned
it existed by reading docs or stumbling onto `hermes curator --help`.
The hint:
archived 3 skill(s):
• docx-extraction → document-tools
• pdf-extraction → document-tools
• old-stale — pruned (stale)
full report: hermes curator status
keep an umbrella stable: hermes curator pin document-tools
Gated on having at least one consolidation that produced an umbrella.
Pruned-only runs (nothing surviving to pin) skip the hint. When
multiple umbrellas were produced, picks alphabetically first as a
concrete example rather than listing them all.
3 new tests in tests/agent/test_curator_classification.py covering:
consolidation produces hint with real umbrella name, pruned-only run
omits it, multi-umbrella picks one example.
* feat(gateway): add LINE Messaging API platform plugin
Adds LINE as a bundled platform plugin under `plugins/platforms/line/`,
synthesized from the strongest pieces of seven open community PRs. The
adapter requires zero core edits — `Platform("line")` is auto-discovered
via the bundled-plugin scan in `gateway/config.py`, and all hooks
(setup, env-enablement, cron delivery, standalone send) are wired
through `register_platform()` kwargs the way IRC and Teams do it.
Highlights merged into one plugin:
- **Reply token preferred, Push fallback.** Try the free reply token
first (single-use, ~60s TTL); fall back to metered Push when the
token is absent, expired, or rejected. (PR #21023)
- **Slow-LLM Template Buttons postback.** When the LLM is still running
past `LINE_SLOW_RESPONSE_THRESHOLD` (default 45s), the adapter burns
the original reply token to send a "Get answer" button bubble. The
user taps it to fetch the cached answer via a fresh reply token —
also free. State machine: PENDING → READY → DELIVERED, ERROR for
cancelled runs (orphan resolves to `LINE_INTERRUPTED_TEXT` after
/stop). Set threshold to 0 to disable. (PR #18153)
- **Three-allowlist gating** — separate user / group / room allowlists
with `LINE_ALLOW_ALL_USERS=true` dev-only escape hatch. (PR #18153)
- **Markdown URL preservation.** Strip bold/italic/code-fence/heading
markers (LINE renders them literally) but keep `[label](url)` →
`label (url)` so URLs stay tappable. (PR #18153)
- **System-message bypass** for `⚡ Interrupting`, `⏳ Queued`, etc. —
busy-acks reach the user as visible bubbles instead of being
swallowed into the postback cache. (PR #18153)
- **Media via public HTTPS URLs.** LINE doesn't accept binary uploads;
images/audio/video must be HTTPS-reachable. The adapter serves
registered tempfiles under `/line/media/<token>/<filename>` from the
same aiohttp app. Allowed-roots traversal guard covers
`tempfile.gettempdir()`, `/tmp` (→ `/private/tmp` on macOS), and
`HERMES_HOME`. `LINE_PUBLIC_URL` overrides URL construction for
setups behind tunnels/proxies. (PR #8398)
- **5-message-per-call batching.** LINE rejects >5 messages per
Reply/Push; smart-chunker caps text at 4500 chars per bubble.
- **Inbound dedup** via `webhookEventId` LRU. (PR #21023)
- **Self-message filter** via `/v2/bot/info` userId lookup. (PR #21023)
- **Loading-animation indicator** wired to LINE's `chat/loading/start`
endpoint, DM-only (LINE rejects it for groups/rooms). (PR #21023)
- **Out-of-process cron delivery** via `_standalone_send`, so
`deliver: line` cron jobs work even when cron runs detached from
the gateway.
- **Webhook hardening** — 1 MiB body cap, constant-time HMAC-SHA256
signature verification, dedup, scoped lock so two profiles can't
bind the same channel.
Validation
----------
- `scripts/run_tests.sh tests/gateway/test_line_plugin.py` →
73 passed in 1.05s
- `scripts/run_tests.sh tests/gateway/test_line_plugin.py
tests/gateway/test_irc_adapter.py
tests/gateway/test_plugin_platform_interface.py
tests/gateway/test_platform_registry.py
tests/gateway/test_config.py` → 193 passed, 7 skipped
- E2E import + register + signature roundtrip + `Platform("line")`
bundled-plugin discovery verified against current `origin/main`.
Closes the seven open LINE PRs (#18153, #16832, #6676, #21023, #14942,
#14988, #8398) by superseding them with a single plugin-form
implementation that takes the best idea from each.
Co-authored-by: pwlee <32443648+leepoweii@users.noreply.github.com>
Co-authored-by: Jetha Chan <jetha@google.com>
Co-authored-by: Cattia <openclaw@liyangchen.me>
Co-authored-by: perng <charles@perng.com>
Co-authored-by: Soichiro Yoshimura <soichiro0111.dev@gmail.com>
Co-authored-by: David Zhou <77736378+David-0x221Eight@users.noreply.github.com>
Co-authored-by: Yu-ga <74749461+yuga-hashimoto@users.noreply.github.com>
* docs(platforms): document platform-specific slow-LLM UX pattern
Add a 'Platform-Specific Slow-LLM UX' section to the platform-adapter
developer guide covering the _keep_typing override pattern that LINE
uses for its Template Buttons postback flow.
Three subsections:
- Pattern: subclass _keep_typing to layer mid-flight UX (with code)
- Pattern: subclass send to route through a cache instead of sending
- When this pattern is appropriate (vs. always-Push fallback)
Plus a short pointer in gateway/platforms/ADDING_A_PLATFORM.md so
tree-readers find the prose walkthrough on the docsite.
Filed because the LINE plugin (PR #23197) was the first bundled
adapter to need this pattern — every prior plugin (irc, teams,
google_chat) handles slow responses with the default typing-loop and
a regular send_text. Documenting now while the rationale is fresh.
---------
Co-authored-by: pwlee <32443648+leepoweii@users.noreply.github.com>
Co-authored-by: Jetha Chan <jetha@google.com>
Co-authored-by: Cattia <openclaw@liyangchen.me>
Co-authored-by: perng <charles@perng.com>
Co-authored-by: Soichiro Yoshimura <soichiro0111.dev@gmail.com>
Co-authored-by: David Zhou <77736378+David-0x221Eight@users.noreply.github.com>
Co-authored-by: Yu-ga <74749461+yuga-hashimoto@users.noreply.github.com>
web_extract runs returned page content through the web_extract auxiliary
model when pages exceed 5 000 chars (single-pass up to 500k, chunked up
to 2M, refused above that). The user-guide page didn't mention this —
users were surprised that long-page extracts produced summaries instead
of raw markdown, and that those summaries cost main-model tokens by
default.
Adds:
- size-driven behavior table (under 5k / 5k–500k / 500k–2M / over 2M)
- which auxiliary task does the work (auxiliary.web_extract)
- how to route summaries to a cheap model regardless of main
- escape hatch: browser_navigate when you need raw content
- troubleshooting entry for summarization timeouts
Adapted from PR #20568 commit ce3518578 (Eric Litovsky / @kallidean).
Adds two-tier gating for the kanban tool surface so dispatcher-spawned
workers see only task-lifecycle tools (show/complete/block/heartbeat/
comment/create/link) while orchestrator profiles with `toolsets: [kanban]`
also see board-routing tools (kanban_list, kanban_unblock).
Workers shouldn't be enumerating or unblocking the board — they should
close their own task via the lifecycle tools. Hiding board-routing tools
from worker schemas keeps the worker focused and the toolset-isolation
contract honest.
Plus inherited from the same upstream commit:
- 50/200 row bound on kanban_list with `truncated` + `next_limit` metadata.
- Belt-and-suspenders runtime guard `_require_orchestrator_tool()` inside
the orchestrator handlers in case a stale registration ever routes a
worker to one of them.
- Tests for the new gate, the stricter bound, and the fact that even a
worker with `toolsets: [kanban]` in config still doesn't see board
routing.
Co-authored-by: Eric Litovsky <elitovsky@zenproject.net>
Two follow-ups from self-review:
1. Add gpt-5.3-codex-spark to DEFAULT_CONTEXT_LENGTHS at 128k. The
primary resolution path for Spark goes through provider='openai-codex'
→ _CODEX_OAUTH_CONTEXT_FALLBACK (already correct). But if any future
code path resolves Spark's context with a different provider (custom
proxy, generic fallthrough), the longest-substring-first lookup in
step 8 would match 'gpt-5' and report 400k, which is wrong by ~3x.
Adding the explicit override is a cheap defensive correctness fix
matching how gpt-5.4-mini and gpt-5.4-nano already shadow the generic
gpt-5 entry.
2. Update test_openai_codex_model_validation_fallback.py docstring. The
bug it was originally written for (gpt-5.3-codex-spark missing from
listing) is now resolved by this PR's catalog restoration. The test
still validly exercises the soft-accept code path for any future
entitlement-gated Codex slug that ships before Hermes catalogs it,
but the framing was stale — clarified.
Two follow-ups from self-review:
1. Add unit test for _fetch_models_from_api covering the live HTTP path.
The salvaged PR #19530 dropped the supported_in_api:false filter in
both _fetch_models_from_api and _read_cache_models, but only the
cache path had a regression test. This adds the symmetric live-fetch
test (mocked httpx) so a future drive-by change to the HTTP path
can't silently re-introduce the filter.
2. Pin test_codex_picker_uses_live_codex_catalog to the cache fallback.
The test wrote a fake JWT and a CODEX_HOME cache, but provider_model_ids
('openai-codex') still issued a real 10s HTTP probe to
chatgpt.com/backend-api/codex/models before falling back to the cache.
That made the test slow and non-deterministic in restricted/CI
networks. Patch _fetch_models_from_api to return [] so we go straight
to the cache path the test actually means to exercise.
PR #12994 stripped gpt-5.3-codex-spark on the assumption that it was
unsupported. It's actually research-preview, ChatGPT-Pro-only, exposed
via the Codex OAuth backend at chatgpt.com/backend-api/codex/models —
not via the public OpenAI API.
Add explanatory comments in:
- DEFAULT_CODEX_MODELS / _FORWARD_COMPAT_TEMPLATE_MODELS (codex_models.py)
- _CODEX_OAUTH_CONTEXT_FALLBACK (model_metadata.py)
- list_authenticated_providers' live-discovery branch (model_switch.py)
so future maintainers don't strip the entry again. Also documents the
intentional asymmetry that Spark stays out of the "openai" provider
catalog (it isn't on the public API) and why the supported_in_api
filter is *not* applied for the openai-codex route.
Closes#6051.
Reported failure mode: agent migrated to WSL2, browser launch failed
because Playwright wasn't installed yet. Background reviewer captured
the failure as a durable skill (`browser-tool-launch-issue`) and the
agent kept refusing the browser tool for weeks after Playwright was
installed and verified working. Negative claims also propagated into
unrelated skills ("browser tools do not work", "cannot use Y from
execute_code").
Root cause: `_SKILL_REVIEW_PROMPT` and `_COMBINED_REVIEW_PROMPT` both
lean hard on "be active, save things, a pass that does nothing is a
missed learning opportunity." Neither distinguished durable knowledge
from transient environment state. The reviewer was doing what it was
told.
Fix at the write site — both prompts now carry a "Do NOT capture"
section calling out:
• Environment-dependent failures (missing binaries, fresh-install
errors, post-migration path mismatches, 'command not found',
unconfigured credentials, uninstalled packages)
• Negative claims about tools or features ("X does not work")
that harden into self-cited refusals
• Session-specific transient errors that resolved before the
conversation ended
• One-off task narratives ("summarize today's market", "analyze
this PR") — also addresses the #12812 / #4538 family
Plus a positive-reframing line: when a tool fails because of setup
state, capture the FIX (install command, config step, env var)
under an existing setup/troubleshooting skill — never "this tool
doesn't work" as a standalone constraint.
Targeted tests: 24/24 passing in tests/run_agent/test_review_prompt_class_first.py
(2 new + all existing review-prompt assertions). Substring-based
checks so future prompt edits don't false-fail.
The previous PR (#22993) gave us a structured WARNING per stream drop
but the only diagnostic was 'error_type=APIError error=Network
connection lost.' — same nothing the user started with. To actually
diagnose why subagents drop streams disproportionately we need to know
WHERE the drop happened.
Adds three breadcrumbs to the agent.log WARNING:
1. Inner exception chain. openai SDK wraps httpx errors as
APIConnectionError / APIError so the catch site only sees the
wrapper. _flatten_exception_chain walks __cause__/__context__ up to
4 levels deep and renders 'Outer(msg) <- Inner(msg)' so we can
tell ConnectError vs RemoteProtocolError vs ReadError vs
ProxyError without enabling verbose mode.
2. Upstream HTTP headers. Snapshots cf-ray, x-openrouter-provider,
x-openrouter-model, x-openrouter-id, x-request-id, server, via,
etc. from stream.response immediately after open (so they survive
even when the stream dies before the first chunk). These answer
'is one CF edge / one downstream provider responsible, or random?'
3. Per-attempt counters. bytes streamed, chunk count, elapsed time on
the dying attempt, and time-to-first-byte. Distinguishes 'couldn't
connect at all' (0s, 0 bytes) from 'died after 30s mid-stream'
(very different root causes — first is auth/routing, second is
upstream idle-kill or proxy timeout).
Plumbing:
- _stream_diag_init / _stream_diag_capture_response live on AIAgent
and produce a per-attempt dict held on request_client_holder['diag']
for closure access from the retry block.
- _call_chat_completions and _call_anthropic both initialize the diag
and increment counters per chunk/event (best-effort, never raises in
the streaming hot path).
- _log_stream_retry / _emit_stream_drop accept an optional diag and
render the new fields. Final-exhaustion log goes through the same
helper so it gets the same diagnostic dump.
- UI status line gains a brief 'after Xs' suffix when timing is
available — distinguishes 'connect failed' from 'died mid-stream'
at a glance without grepping logs.
Sample WARNING after this change:
Stream drop mid tool-call on attempt 2/3 — retrying.
subagent_id=sa-2-cafef00d depth=1 provider=openrouter
base_url=https://openrouter.ai/api/v1
error_type=APIError error=Connection error.
chain=APIError(Connection error.) <- RemoteProtocolError(peer
closed connection without sending complete message body)
http_status=200 bytes=12400 chunks=47 elapsed=12.00s ttfb=0.83s
upstream=[cf-ray=8f1a2b3c4d5e6f7g-LAX
x-openrouter-provider=Anthropic
x-openrouter-id=gen-abc123 server=cloudflare]
Tests: 10 covering diag init, header capture (whitelist enforced for
PII), exception-chain walking + depth cap, log content with full diag,
log content without diag (placeholders), UI elapsed-suffix on/off.
Closes#21794.
`/kanban`, `/kanban help`, `/kanban --help`, and `/kanban <sub> -h`
all returned broken output to the gateway and interactive CLI. Three
underlying bugs in `hermes_cli.kanban.run_slash`:
1. argparse writes help to **stdout** but `run_slash` only captured
stderr at parse time, so `-h` text was silently swallowed and
replaced with the `(usage error: 0)` sentinel.
2. The wrapping parser used `prog="/"` and routed via a synthetic
"_top → kanban" subparser, producing `usage: / kanban …` (stray
space) and `usage: /kanban kanban …` (doubled token) in error text.
3. Bare `/kanban` and `/kanban help` dumped argparse's full ~3KB
usage tree, which reads as visual garbage in a chat bubble.
Fix: drive the kanban_parser directly (no double-wrap), rewrite prog
strings on every leaf subparser, capture stdout AND stderr around
parse_args, distinguish SystemExit(0) (help — return captured stdout)
from SystemExit(2) (error — return single-line ⚠-prefixed message),
and add an explicit chat-friendly short-help block returned for bare
invocation and the help aliases (`help`, `--help`, `-h`, `?`).
Added 5 regression tests covering bare invocation, every help alias,
subcommand help, unknown action, and missing required arg.
Affects every chat platform via gateway/run.py::_handle_kanban_command
and the interactive CLI via cli.py::_handle_kanban_command.
Co-Authored-By: Nagatha (Claude Opus 4.7) <noreply@anthropic.com>
Reorder Anthropic Opus 4.7/4.6 + Sonnet 4.6 to the top, cluster free
models at the bottom of the OpenRouter list, and mirror the same
ordering into the Nous portal list (paid models only).
- Add inclusionai/ring-2.6-1t:free
- Drop minimax-m2.5, minimax-m2.5:free, sonnet-4.5, mimo-v2.5,
glm-5v-turbo, glm-5-turbo, trinity-large-preview:free,
trinity-large-thinking, qwen3.5-plus-02-15
- Replace qwen3.5-35b-a3b with qwen3.6-35b-a3b
- Drop x-ai/grok-4.20-beta from the Nous list
Both `_kanban_notifier_watcher` and `_kanban_dispatcher_watcher`'s
`_tick_once_for_board` called `_kb.connect(board=slug)` immediately
followed by `_kb.init_db(board=slug)`. Since `connect()` already runs
the schema + idempotent migration on first open per process, the
explicit `init_db()` was redundant — and worse, `init_db()` deliberately
busts the per-process `_INITIALIZED_PATHS` cache and re-runs the migration
on a *second* connection that races the first.
On every cold gateway start against a legacy DB this surfaced as either
`sqlite3.OperationalError: duplicate column name: <col>` or intermittent
`database is locked` errors logged at the first tick. The duplicate-column
case is now tolerated by `_add_column_if_missing` (commit 78698381a), but
the wasted second migration plus the database-is-locked race remain
fixable by skipping the redundant call entirely.
Drops `_kb.init_db(board=slug)` at both call sites and adds a regression
test in `tests/hermes_cli/test_kanban_notify.py` that pins the absence
via source inspection plus a runtime spy.
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
Subagent stream drops were spamming the parent terminal with two lines
per blip ('Connection dropped...' + 'Reconnected...') while leaving zero
breadcrumb in agent.log to debug them.
Two underlying bugs, fixed together:
1. quiet_mode raised the run_agent/tools/etc. loggers to ERROR, which
filters records before root-logger file handlers see them. The comment
claimed 'File handlers still capture everything' — that was wrong.
Removed in both run_agent.py and cli.py; console quietness already
comes from hermes_logging not installing a console StreamHandler in
non-verbose mode.
2. The stream-retry blocks emitted two _emit_status calls per drop
('⚠️ Connection dropped... Reconnecting...' + '🔄 Reconnected —
resuming…') with no provider name, so multi-provider sessions had to
dig through agent.log to attribute a drop. Replaced both call sites
with a single _emit_stream_drop helper that emits ONE line naming the
provider and error class, and always writes a structured WARNING to
agent.log with subagent_id, depth, provider, base_url, error_type.
Net UX change: 6 lines per triple-subagent drop → 3 lines, each
naming the provider. agent.log now has a structured breadcrumb per
retry that didn't exist before.
Tests: 6 new tests in tests/run_agent/test_stream_drop_logging.py
covering the logger-level guard, structured WARNING content, single
status line per drop (no Reconnected follow-up), and provider naming.
When the active main model has native vision and the provider supports
multimodal tool results (Anthropic, OpenAI Chat, Codex Responses, Gemini
3, OpenRouter, Nous), vision_analyze loads the image bytes and returns
them to the model as a multimodal tool-result envelope. The model then
sees the pixels directly on its next turn instead of receiving a lossy
text description from an auxiliary LLM.
Falls back to the legacy aux-LLM text path for non-vision models and
unverified providers.
Mirrors the architecture used in OpenCode, Claude Code, Codex CLI, and
Cline. All four converge on the same pattern: tool results carry image
content blocks for vision-capable provider/model combinations.
Changes
- tools/vision_tools.py: _vision_analyze_native fast path + provider
capability table (_supports_media_in_tool_results). Schema description
updated to reflect new behaviour.
- agent/codex_responses_adapter.py: function_call_output.output now
accepts the array form for multimodal tool results (was string-only).
Preflight validates input_text/input_image parts.
- agent/auxiliary_client.py: _RUNTIME_MAIN_PROVIDER/_MODEL globals so
tools see the live CLI/gateway override, not the stale config.yaml
default. set_runtime_main()/clear_runtime_main() helpers.
- run_agent.py: AIAgent.run_conversation calls set_runtime_main at turn
start so vision_analyze's fast-path check sees the actual runtime.
- tests/conftest.py: clear runtime-main override between tests.
Tests
- tests/tools/test_vision_native_fast_path.py: provider capability
table, envelope shape, fast-path gating (vision-capable model uses
fast path; non-vision model falls through to aux).
- tests/run_agent/test_codex_multimodal_tool_result.py: list tool
content becomes function_call_output.output array; preflight
preserves arrays and drops unknown part types.
Live verified
- Opus 4.6 + Sonnet 4.6 on OpenRouter: model calls vision_analyze on a
typed filepath, gets pixels back, reads exact text from images that
no aux description could capture (font color irony, multi-line
fruit-count list, etc.).
PR replaces the closed prior efforts (#16506 shipped the inbound user-
attached path; this PR closes the gap for tool-discovered images).
Found 18 real Hermes-Agent stories from HN, X, and Reddit not yet
captured on the page. All URLs HTTP-verified to return 200 with
matching titles.
Reddit (15): r/hermesagent (Obsidian-as-memory writeup at 794 upvotes,
LLM cheatsheet at 635 upvotes, Kanban game-changer post, OpenRouter #1
ranking, AMA from the Nous team, etc.); r/LocalLLaMA, r/Rag,
r/openclaw, r/SideProject, r/LocalLLM threads where users describe
their actual setups (Qwen3.5-9b on 16gb VRAM, 5060Ti + Telegram, smart
routing tiers).
X (3): @vmiss33's 'what I use Hermes for' guide, @HeyYanvi's
X-to-NotebookLM podcast workflow, @ExileAI_0's spare-laptop Iris
running RenPy + ComfyUI, @brucexu_eth's Hermes Inc. Telegram startup
sim from the hackathon, Hype's deep-dive blog.
HN (1): 'I'm using Hermes — sandbox it like any agent.'
No component changes — all new entries fit the existing schema
(real URL, real author, real date).
Adds test_notifier_second_blocked_delivers to cover the case where a
task is blocked, unblocked, then blocked again — the second blocked
event must still deliver a gateway notification.
Currently fails because blocked is treated as a terminal event kind,
causing the subscription to be dropped after the first block.
Linux's MAX_ARG_STRLEN caps any single argv element at 128 KB
(32 * PAGE_SIZE). The previous heredoc-in-the-command-string approach
in _write_to_sandbox put the entire tool result inside the 'bash -c'
arg, so any result over ~128 KB raised OSError [Errno 7] 'Argument
list too long' before the heredoc ever ran. The caller logged a
warning, but quiet_mode (CLI default) sets tools.* to ERROR — so the
warning never reached agent.log either, and the agent saw a 1.5 KB
preview tagged 'Full output could not be saved to sandbox'. Hits
delegate_task with 3+ subagent outputs routinely now.
Switch to passing content via env.execute(stdin_data=...). cmd is
now just 'mkdir -p X && cat > Y' (under 1 KB), and the heavyweight
payload travels through stdin where there is no argv-element limit.
E2E reproduced the user's exact 144,778-char delegate_task envelope:
old code OSError'd, new code round-trips cleanly to disk with all
three task summaries intact.
These skills require heavy GPU/CUDA stacks or are niche enough that they shouldn't
be active by default. Moved to optional-skills/ where users opt-in via
`hermes skills install official/...`.
Moved:
- mlops/training/axolotl
- mlops/training/trl-fine-tuning
- mlops/training/unsloth
- mlops/inference/outlines
Counts: 91 -> 87 built-in, 72 -> 76 optional.
Auto-regenerated docs (per-skill pages + catalogs) reflect the move.
* feat(curator): show rename map (where skills went) in user-visible summary
The full data has always been on disk in REPORT.md, but the user-visible
curator summary (gateway 💾 line, CLI session-start panel,
`hermes curator status`) was counts-only — "consolidated 4 into 2
umbrellas" with no names. Users only discovered renames when something
they expected was gone.
New `_build_rename_summary()` formats the rename map and appends it to
`final_summary`:
auto: 1 marked stale; llm: consolidated 2 into 1, pruned 1
archived 3 skill(s):
• docx-extraction → document-tools
• pdf-extraction → document-tools
• old-stale-thing — pruned (stale)
full report: hermes curator status
Empty on no-op ticks (no archives), so most ticks add zero log noise.
Cap of 10 entries keeps agent.log readable when a 50-skill
consolidation lands; the full list is always in REPORT.md.
`hermes curator status` indents continuation lines so the multi-line
summary reads as one logical field.
5 new tests in tests/agent/test_curator_classification.py covering
empty / consolidation / pruning / cap / mixed cases.
* feat(curator): show recent run summary once on `hermes update`
The rename map is now visible from where users actually look — the
update flow they explicitly run, instead of just the live gateway log
or transient CLI session-start panel.
Behavior:
- After `hermes update`, if the most recent curator run produced a
rename map (multi-line summary) that the user hasn't seen yet, print
it once with a 'last run Xh ago' header and a one-time-message
footer.
- Stamp `last_run_summary_shown_at = last_run_at` after printing so
subsequent `hermes update` invocations are silent until a newer
curator run lands.
- Silent on no-op runs (single-line summary like 'auto: no changes;
llm: no change'). Still stamps shown so we don't reconsider on
every update.
- Silent when the curator has never run (the existing first-run
notice handles that case).
Output:
ℹ Skill curator — last run 4h ago
auto: 1 marked stale; llm: consolidated 2 into 1, pruned 1
archived 3 skill(s):
• docx-extraction → document-tools
• pdf-extraction → document-tools
• old-stale-thing — pruned (stale)
full report: hermes curator status
(This message shows once per curator run. View anytime: hermes curator status)
State migration:
- `_default_state()` gains `last_run_summary_shown_at: None`. Existing
state files lack the field; `.get()` returns None; the comparison
treats any prior run as 'not yet shown' and prints once on next
update. Self-healing.
Wiring:
- Both `hermes update` paths in main.py call the new
`_print_curator_recent_run_notice()` right after the existing
first-run notice. Best-effort try/except so a state-load bug
never breaks the update flow.
6 tests in tests/hermes_cli/test_curator_recent_run_notice.py:
no-run / single-line / multi-line / show-once / new-run-resets /
time-formatter buckets.
`hermes chat -q "..."` printed the full welcome banner before
running the query — kawaii ASCII logo, available toolsets list,
available skills list, model name, session ID, working directory,
update-available notice. Building it took ~420 ms on cold start
(~200 ms version-update probe, the rest is toolset / skill enumeration
plus Rich panel rendering).
For a one-shot `-q` query the banner is noise: the user already
picked the prompt, doesn't need a toolset reference, and gets the
session ID + resume hint from `_print_exit_summary()` after the
response prints.
The fully-quiet `-Q` / `--quiet` machine-readable path was already
banner-free; this brings the human-facing single-query path in line
so all non-interactive invocations are fast.
Measured impact (`hermes chat -q "ok" --max-turns 1`, 10-run
percentiles, 9950X3D):
median: 1.90 → 1.75 s (-150 ms)
min: 1.80 → 1.73 s ( -70 ms)
P25: 1.82 → 1.74 s ( -80 ms)
Wider variance than expected; the banner cost overlaps with API
latency on real `chat -q` runs. Min-time delta of 70 ms is the
cleanest signal — that's the deterministic banner-build cost gone.
The 150 ms median delta picks up cases where the version-update
probe also finishes during the wait.
Interactive mode (`hermes` with no `-q`) and the `--list-tools` /
`--list-toolsets` one-shot listing commands still show the banner —
those are the contexts where it's actually wanted.
Tests: 656/656 `tests/cli/` pass on top of latest main (modulo 5 pre-
existing flakes in `test_cli_save_config_value.py` that fail with
`No module named 'ruamel'` both with and without this change).
The Skills Hub at /skills had cards that, when expanded, showed only the
one-line description, tags, author, version, and an install command. For
the 163 bundled and optional skills shipped with the repo, this was thinner
than the data we already have on disk.
Three changes, all under website/:
1. extract-skills.py now pulls four extra fields per local skill:
- 'overview' — first non-heading body paragraph from SKILL.md (stripped
of admonitions/code fences, capped at ~500 chars at a sentence boundary)
- 'envVars' / 'commands' — from the prerequisites: block in frontmatter
- 'license' — from the top-level frontmatter
- 'docsPath' — slug to the per-skill /docs/user-guide/skills/.../* page,
computed with the same logic as generate-skill-docs.py
162 of 163 local skills get a non-empty overview automatically. The
remaining one (media/heartmula) has only headings/code in its body and
falls through to the description.
2. Skill TS interface + SkillCard expanded-panel render the new fields:
- Overview paragraph at the top of the panel
- Prerequisites box (env vars + required commands) when frontmatter
declares them
- License row alongside author/version
- 'View full documentation →' link to the per-skill docs page
Search now covers the overview text too, so users can find skills by
matching content from inside SKILL.md, not just the one-line description.
3. styles.module.css gains six new classes (overviewBlock, detailLabel,
overviewText, prereqBlock/Row/Kind/List/Item, docsLink) styled to match
the existing dark panel aesthetic.
External / community skills (Anthropic, LobeHub, Claude Marketplace cached
indexes) keep the old behavior — overview is empty, no prereqs, no docsPath.
Validation: 'npm run build' clean (exit 0); broken-link count unchanged at
155 baseline; all 163 generated docsPath values resolve to existing pages
under website/docs/user-guide/skills/.
Same-provider /model switches on a 'custom' endpoint kept stale credentials
because (a) _resolve_named_custom_runtime's bare-custom + explicit_base_url
path went straight to OPENAI_API_KEY/OPENROUTER_API_KEY env fallbacks
without consulting the credential pool, and (b) switch_model() guarded
against custom-provider re-resolution to preserve base_url, locking in
the prior api_key.
Now the bare-custom path queries the credential pool first (mirroring
the named-custom-provider branch behavior), and the same-provider switch
guard is removed since resolve_runtime_provider has since grown a robust
custom-resolution path that preserves base_url from model_cfg.
Refs #18681 (the gateway-side api_key wiring is still separate),
#16254, #12919.
The /rollback command handler in gateway/run.py was constructing
CheckpointManager with only enabled and max_snapshots, omitting
max_total_size_mb and max_file_size_mb that the __init__ expects.
This caused a TypeError on every /rollback invocation when checkpoints
were enabled.
Fixes: NousResearch/hermes-agent#18841
Follow-up test fix for #22693 — the existing test for ps-failure +
pid-file fallback needed the /proc walk path stubbed too since /proc
is now consulted first.
Salvage of NousResearch/hermes-agent#7622.
Docker images often lack procps so `ps` is unavailable. Try reading
/proc/*/cmdline first (works in any Linux container) and fall back to
`ps -A eww` only when /proc is not present. PermissionError on
individual PIDs is silently skipped.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
run_gateway() calls refresh_systemd_unit_if_needed() on every invocation
so restart settings stay current after exit-code-75 respawns. The
user-scope unit path resolves under Path.home() (NOT sandboxed by
conftest, only HERMES_HOME is), and generate_systemd_unit() bakes the
current HERMES_HOME into the unit's Environment= line.
Result: any test that exercises run_gateway() end-to-end on a real
Linux dev box silently rewrites the developer's installed
~/.config/systemd/user/hermes-gateway.service with a polluted
HERMES_HOME pointing at /tmp/pytest-of-<user>/.../hermes_test. On the
next reboot, systemd loads that unit, the gateway starts looking at an
empty tmp dir, and Telegram/Discord/etc. all show as 'No messaging
platforms enabled' even though the user's real config is fine. Three
tests in tests/hermes_cli/test_gateway.py hit this path:
test_run_gateway_exits_cleanly_on_keyboard_interrupt,
test_run_gateway_exits_nonzero_when_start_gateway_reports_failure, and
test_run_gateway_root_guard_has_escape_hatch.
Two-layer fix:
1. _install_fake_gateway_run helper (covers all four run_gateway() call
sites in test_gateway.py and any future ones) now also stubs
supports_systemd_services and refresh_systemd_unit_if_needed.
2. refresh_systemd_unit_if_needed() itself sniffs the generated unit
body for /pytest-of- and /hermes_test markers and refuses to write
when present. Defense in depth so a future test that bypasses the
helper still can't corrupt the dev's gateway. Tests that legitimately
exercise the refresh flow (test_run_gateway_refreshes_outdated_unit_on_boot)
patch generate_systemd_unit to return synthetic content that doesn't
carry those markers, so they keep working.
Adds test_refresh_refuses_to_bake_pytest_tmpdir_into_real_user_unit as a
regression test for the source-side guard.
RuntimeError('claude CLI turn timed out') from a local OpenAI-compatible
shim was falling through to FailoverReason.unknown, surfacing as 'Empty
response from model' and burning 3 retry slots on the same failing
endpoint. _classify_by_message had no timeout-message branch — only
billing/rate_limit/auth/context_overflow/model_not_found patterns. The
type-based check at line 565 also requires isinstance(error, (TimeoutError,
ConnectionError, OSError)) — a plain RuntimeError doesn't match.
Add _TIMEOUT_MESSAGE_PATTERNS for 'timed out', 'deadline exceeded',
'request timed out', 'operation timed out', 'upstream timed out', 'turn
timed out'. _classify_by_message returns FailoverReason.timeout (retryable=True)
when any pattern matches.
Salvage of #22664's classifier portion. The original PR also bundled a
fallback self-selection guard which is now redundant (already on main
via #22780) plus DeepSeek thinking and session_search fixes that are
their own separate concerns.
Follow-up to #22780 — fixes the still-broken classification of
generic-typed provider-shim timeouts that #22780's dedup didn't cover.
Fallback chain entries with 'api_key_env: ENV_VAR_NAME' weren't being
resolved by either the init-time fallback path (line ~1660) or the
runtime _try_activate_fallback path (line ~8045). Only literal
'api_key' was honored; the snake_case 'api_key_env' alias documented
elsewhere in the config was silently dropped, so a 'provider: custom'
fallback with base_url + api_key_env worked as primary but failed as
fallback with 'no endpoint credentials found' / 401.
Adds 'or fb.get("api_key_env")' to the existing 'key_env' lookup in
both call sites, with empty-string-to-None coercion so unset env vars
don't poison the resolver.
Salvage of #22665's fallback portion. The original PR also bundled
gateway-degrade-on-no-adapters changes (those land via the carve-out
in #22853 which is the same code) and run_agent.py memory-nudge
counter hydration (issue #22357 territory, not mentioned in the
title). Drops both bundled pieces; keeps just the api_key_env fix.
Closes#5392.
When connected_count == 0 AND enabled_platform_count > 0, the gateway
treated 'all adapters returned None' identically to 'all adapters
failed to connect' — both as fatal startup errors. The 'returned None'
case happens when imports fail silently or when adapters are present
in config but their dependencies aren't installed (e.g. discord.py
missing). Cron jobs and other gateway-runtime work would unnecessarily
fail to start.
Split: only return False when startup_retryable_errors is non-empty
(real connection attempt failed). When the list is empty AND enabled
> 0, log a warning and continue running, matching the 'no platforms
enabled' cron path.
Salvage of #22642's gateway slice. Drops the bundled run_agent.py
memory-nudge counter hydration block (issue #22357 territory) which
wasn't mentioned in the PR description.
Closes#5196.
Problem: terminal.docker_env set in config.yaml was silently ignored.
Docker containers never received the user-specified env vars.
Root cause: docker_env was missing from all three config→env bridging
maps (cli.py env_mappings, gateway/run.py _terminal_env_map,
hermes_cli/config.py _config_to_env_sync) and from the terminal_tool
_get_env_config() reader. _create_environment() consumed the key from
container_config correctly, but it was always {} because TERMINAL_DOCKER_ENV
was never set.
Also extend the list-serialisation branches in cli.py and gateway/run.py
to handle dict values via json.dumps (lists already used json.dumps;
plain str() on a dict produces undecodable output).
Fix:
- cli.py: add "docker_env": "TERMINAL_DOCKER_ENV" to env_mappings;
serialise dict values with json.dumps alongside existing list path
- gateway/run.py: same additions to _terminal_env_map and serialisation
- hermes_cli/config.py: add "terminal.docker_env": "TERMINAL_DOCKER_ENV"
to _config_to_env_sync so `hermes config set terminal.docker_env …`
persists to .env correctly
- tools/terminal_tool.py: add docker_env key to _get_env_config() reading
TERMINAL_DOCKER_ENV via _parse_env_var with default "{}"
Tests: add test_docker_env_is_bridged_everywhere to
tests/tools/test_terminal_config_env_sync.py — stash-verified: fails on
origin/main, passes with fix.
Fixes#20537
After Popen succeeds with os.setsid (detached process group), 5 things
happen with no try/except: Thread construction, reader.start(), lock
acquisition, prune+register, checkpoint write. If any raises, the
Popen object goes unregistered and the detached process group leaks
indefinitely.
Wrap the post-spawn setup in try/except. On failure:
- os.killpg(getpgid(pid), SIGKILL) takes down the entire process
group (not just the shell - important because of detached PG +
-lic shell wrapper that may have spawned children)
- proc.kill() fallback for ProcessLookupError/PermissionError/OSError
- proc.wait(timeout=5) reaps with a bound
- re-raise to preserve original traceback
Nested try/except around cleanup so a secondary failure can't mask the
original.
Closes#2749.
The Termux update path (PR #22814) prebuilds psutil from a marker-patched
sdist so 'platform android is not supported' doesn't kill it. The same
psutil setup.py error blocks fresh installs via scripts/install.sh — only
the update path was wired up. Without this, a brand-new Termux user can't
get past the very first 'pip install -e .[termux-all]' call.
- New scripts/install_psutil_android.py — standalone version of the same
patcher hermes_cli/main.py uses, callable from bash.
- scripts/install.sh detects sys.platform == 'android' and runs the
patcher before pip install.
- TODO note added to both copies pointing at upstream
https://github.com/giampaolo/psutil/pull/2762; remove both when that
ships.
Note: we keep psutil as a base dep on Android (do not adopt the proposed
sys_platform != 'android' marker in pyproject). Removing it would crash
five unguarded 'import psutil' sites at runtime
(tools/code_execution_tool.py, tools/tts_tool.py, tools/process_registry.py
(2x), gateway/platforms/whatsapp.py).
Problem
=======
`tools.checkpoint_manager._touch_project` reads the project metadata
file with `json.loads(meta_path.read_text(...))`, then immediately does:
meta["workdir"] = str(_normalize_path(working_dir))
The `except` block only catches `(OSError, ValueError)`. When the file
parses successfully but returns a non-dict value (a list `[]`, `null`,
or a scalar from a corrupted or hand-truncated write), `json.loads`
succeeds without error and `meta` is set to, e.g., `[]`. The subsequent
subscript assignment then raises `TypeError: list indices must be
integers or slices, not str`, which is NOT caught by the narrow except
clause.
This TypeError propagates up through `_take` to `ensure_checkpoint`,
where the broad `except Exception` safety net swallows it. The effect
is that `ensure_checkpoint` silently returns False for the entire
session — all checkpoints are skipped for the affected working directory
without any user-visible error.
Root cause
==========
Missing `isinstance(meta, dict)` guard after `json.loads`, identical in
pattern to bugs fixed in `cron/jobs.py` (#22569) and
`tools/process_registry.py` (#22544). The same guard is already
present one function below in `_list_projects` (line 506), but was
inadvertently omitted in `_touch_project`.
Fix
===
Add two lines after the try/except:
```python
if not isinstance(meta, dict):
meta = {}
```
This matches the existing guard in `_list_projects` and ensures a fresh
empty dict is used whenever the persisted value is not a mapping —
preserving the `created_at` semantics via `setdefault` on the next line.
Tests
=====
`TestTouchProjectMalformedMeta` covers four non-dict root values
(`[]`, `null`, `42`, `"oops"`). Each writes a corrupted metadata file,
calls `_touch_project`, and asserts: (a) no exception raised, (b) the
metadata file is rewritten as a valid dict containing `last_touch` and
`workdir`. All four fail on main with `TypeError`, pass with fix.
Full `tests/tools/test_checkpoint_manager.py` regression: 77 passed.
The FTS5 trigram tokenizer requires >=3 CJK characters per individual
token to produce matchable trigrams. A query like "广西 OR 桂林 OR 漓江"
has cjk_count=6 (passes the existing >=3 guard) but each token is only
2 CJK chars, so the trigram index returns 0 results.
Fix:
- Add per-token check: if any non-operator CJK token has <3 CJK chars,
force the LIKE fallback path regardless of total cjk_count.
- Expand the LIKE fallback to build one LIKE condition per non-operator
token joined with OR, so each term is matched independently.
Regression tests added in TestCJKSearchFallback:
- test_cjk_or_combined_short_tokens_returns_results
- test_cjk_short_token_or_query_preserves_filters
Problem:
When a provider or proxy drops a streaming response mid-flight (httpcore
raises RemoteProtocolError: "incomplete chunked read", "peer closed
connection", "response ended prematurely", etc.), _generate_summary
would not classify it as a transient error. Instead of retrying on the
main model, it entered the generic 60-second cooldown, leaving context
growing unbounded until the cooldown expired. Issue #18458.
Root cause:
_is_connection_error in auxiliary_client.py did not match httpcore's
streaming premature-close error substrings. context_compressor.py's
_generate_summary except block never called _is_connection_error, so
those errors fell through to the 60-second generic cooldown rather than
triggering the retry-on-main fallback path used for timeouts.
Fix:
1. auxiliary_client.py — extend _is_connection_error keyword list with:
"incomplete chunked read", "peer closed connection",
"response ended prematurely", "unexpected eof",
"remoteprotocolerror", "localprotocolerror".
Also guard the `from openai import ...` with try/except ImportError
so the function works in environments without the openai package.
2. context_compressor.py — import _is_connection_error and call it in
_generate_summary's except block as _is_streaming_closed. Include
_is_streaming_closed in the fallback-to-main condition (alongside
_is_model_not_found, _is_timeout, _is_json_decode) and use the
shorter 30s transient cooldown for streaming-closed errors.
Tests:
4 new regression tests in TestStreamingClosedFallback:
- test_incomplete_chunked_read_falls_back_to_main
- test_peer_closed_connection_falls_back_to_main
- test_streaming_closed_on_main_uses_short_cooldown (stash-verified)
- test_non_streaming_unknown_error_still_uses_long_cooldown
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`tools/image_generation_tool.py` did `import fal_client` at module
top, which pulled the entire fal_client + httpx + rich stack on every
process that ran `discover_builtin_tools()` — every `hermes` cold
start, even ones that never touch image generation.
Make the import lazy: replace the eager import with a placeholder
(`fal_client: Any = None`) and add an idempotent `_load_fal_client()`
that rebinds the module global on first use. Call it from the two
runtime entry points (`_ManagedFalSyncClient.__init__` and
`_submit_fal_request`) and from the SDK-presence check in
`check_image_generation_requirements`.
The loader short-circuits if the global is already truthy, which
preserves the test pattern of monkeypatching `fal_client` to install
a mock — the `monkeypatch.setattr(image_tool, "fal_client", ...)`
calls in test_image_generation.py keep working unchanged.
Measured impact (15-run min times, 9950X3D):
tools.image_generation_tool alone: 77 → 20 ms (-74%)
36 → 20 MB (-44%)
import cli (full): 734 → 720 ms (-2%)
import model_tools: 372 → 366 ms (-2%)
The microbench is dramatic but the full-CLI win is small — fal_client
shares its httpx + rich dependencies with the rest of the agent, so
on a real cold start most of the 16 MB / 64 ms is already paid by
other imports. The win matters mostly for processes that touch this
tool without otherwise loading httpx (rare) and for architectural
consistency with the previous lazy-load PRs (#22681 google_chat,
#22831 teams).
Tests: 55/55 `tests/tools/test_image_generation.py` pass, including
the cases that monkeypatch the module global to install a mock
fal_client. End-to-end verification confirms `import model_tools`
no longer pulls `fal_client` into `sys.modules`.
Cross-checked 75 docs pages under user-guide/messaging/, developer-guide/,
guides/, and integrations/ against the live registries and gateway code.
messaging/
- index.md: API Server toolset is hermes-api-server (was 'hermes (default)');
Google Chat slug is hermes-google_chat (underscore — plugin name uses _).
- google_chat.md: drop bogus 'pip install hermes-agent[google_chat]' (no such
extra); list the actual deps (google-cloud-pubsub, google-api-python-client,
google-auth, google-auth-oauthlib).
- qqbot.md: config namespace is platforms.qqbot (was platforms.qq, which is
silently ignored by the adapter); QQ_STT_BASE_URL is not read directly —
baseUrl lives under platforms.qqbot.extra.stt.
- teams-meetings.md: 'hermes teams-pipeline' is plugin-gated (teams_pipeline
plugin must be enabled), not a built-in subcommand.
- sms.md: example log line 0.0.0.0:8080 -> 127.0.0.1:8080 (default
SMS_WEBHOOK_HOST).
- open-webui.md: API_SERVER_* are env vars, not YAML keys — write them to
per-profile .env, not 'hermes config set' (same pattern fixed in
api-server.md last round). Also bumped example ports to 8650+ to dodge the
default webhook (8644)/wecom-callback (8645)/msgraph-webhook (8646)
collision.
developer-guide/
- architecture.md: tool/toolset counts (61/52 -> 70+/~28); LOC stamps for
run_agent.py, cli.py, hermes_cli/main.py, setup.py, mcp_tool.py,
gateway/run.py replaced with 'large file' to stop drifting.
- agent-loop.md: same LOC drift (~13,700 -> 'a large file (15k+ lines)').
- gateway-internals.md: '14+ external messaging platforms' -> '20+'; gateway
platform tree updated (qqbot is a sub-package, not qqbot.py; added
yuanbao.py, feishu_comment.py, msgraph_webhook.py); 'gateway/builtin_hooks/
(always active)' was wrong — it's an empty extension point and
_register_builtin_hooks() is a no-op stub.
- acp-internals.md: drop fictional 'message_callback' from the bridged-
callbacks list; clarify thinking_callback is currently set to None.
- provider-runtime.md: provider list was missing AWS Bedrock, Azure Foundry,
NVIDIA NIM, xAI, Arcee, GMI Cloud, StepFun, Qwen OAuth, Xiaomi, Ollama
Cloud, LM Studio, Tencent TokenHub. Fallback section described only the
legacy single-pair model — corrected to the canonical list-form
fallback_providers chain.
- environments.md: parsers list missing llama4_json and the deepseek_v31
alias; both register via @register_parser.
- browser-supervisor.md: drop reference to scripts/browser_supervisor_e2e.py
which doesn't exist in-repo.
- contributing.md: tinker-atropos is a git submodule — note that
'git submodule update --init' is required if cloning without
--recurse-submodules.
guides/
- operate-teams-meeting-pipeline.md: cron flags were all wrong — schedule is
positional (not --schedule), the script-only flag is --no-agent (not
--script-only), and there's no --command flag. Replaced with a real example
that creates the script under ~/.hermes/scripts/ and uses the actual flags.
Also replaced fictional 'hermes cron show <name>' with 'hermes cron status'.
- automation-templates.md: 'cron create --skills "a,b"' doesn't work —
the flag is --skill (singular, repeatable). Fixed all 5 occurrences via AST
rewrite.
- minimax-oauth.md: 'hermes auth add minimax-oauth --region cn' silently
fails because --region isn't registered on the auth-add argparse spec.
Pointed users at the minimax-cn provider (or MINIMAX_CN_API_KEY env) for
China-region access.
- cron-script-only.md: 'hermes send' is fictional — replaced the comparison-
table mention with a webhook-subscription pointer; also fixed the dead link
to /guides/pipe-script-output (page doesn't exist).
- cron-troubleshooting.md: 'hermes serve' isn't a real subcommand. Pointed
at 'hermes gateway' (foreground) / 'hermes gateway start' (service).
- local-ollama-setup.md: 'agent.api_timeout' is not a config key. The right
knob is the HERMES_API_TIMEOUT env var.
- python-library.md: run_conversation() return dict has only final_response
and messages — task_id is stored on the agent instance, not echoed back.
- use-mcp-with-hermes.md: '--args /c "npx -y …"' wraps the npx command in
one quoted string, so cmd.exe gets a single arg instead of the multi-token
command line it needs. Removed the surrounding quotes — argparse nargs='*'
collects each token correctly.
integrations/
- providers.md: Bedrock guardrail YAML keys were 'id'/'version' (don't exist);
actual keys are guardrail_identifier/guardrail_version (matches DEFAULT_CONFIG
and the run_agent.py reader). GMI default base URL (api.gmi.ai/v1 ->
api.gmi-serving.com/v1) and portal URL (inference.gmi.ai -> www.gmicloud.ai)
refreshed. Fallback section rewritten to lead with the canonical
fallback_providers list form (was leading with the legacy fallback_model
single dict); supported-providers list extended to include azure-foundry,
alibaba-coding-plan, lmstudio.
index.md
- '68 built-in tools' -> '70+'; '15+ platforms' was both inconsistent with
integrations/index.md ('19+') and undercounted — bumped to 20+ and added
Weixin/QQ Bot/Yuanbao/Google Chat to the list.
Validation: 'npm run build' clean (exit 0); broken-link count unchanged at
155 (same as round-1 post-skill-regen baseline). 24 files, +132/-89.
The plumbing for setting OpenRouter provider preferences and the Pareto Code
router on auxiliary tasks already exists — auxiliary.<task>.extra_body is
forwarded verbatim by call_llm() / async_call_llm(). It just wasn't documented,
so users who wanted (e.g.) Pareto Code routing for compression but the strongest
coder for the main agent had no way to discover the escape hatch.
- hermes_cli/config.py: expand the auxiliary section header with a YAML
example showing provider routing plus plugins under extra_body, and an
explicit note that main-agent provider_routing / openrouter.min_coding_score
do NOT propagate to aux calls (each task is independent by design)
- website/docs/user-guide/configuration.md: new 'OpenRouter routing and
Pareto Code for auxiliary tasks' subsection with worked example
- website/docs/integrations/providers.md: cross-link from the Pareto Code
Router section to the aux-side doc
E2E verified that auxiliary.<task>.extra_body reaches the OpenRouter API with
the configured provider routing and plugins blocks intact.
PR #2974 whitelisted three reasoning fields (reasoning, reasoning_details,
codex_reasoning_items) for the gateway's simple-text replay branch. Three
more fields were added to the DB later but the whitelist was never updated:
- reasoning_content: provider-facing thinking text. _copy_reasoning_content_for_api
promotes 'reasoning' -> 'reasoning_content' at send time only when the
strings happen to match. Carrying the original verbatim avoids loss
for providers that return them as distinct fields (DeepSeek/Kimi/
Moonshot thinking modes), and preserves the empty-string sentinel
that DeepSeek V4 Pro requires for thinking-mode replay.
- codex_message_items: exact assistant message items with 'phase'.
OpenAI docs: 'preserve and resend phase on all assistant messages —
dropping it can degrade performance.' Required for prefix cache hits.
No recovery path exists — once dropped, gone.
- finish_reason: informational; cheap to keep so transcripts replay
identically across CLI and gateway.
The CLI is unaffected because cli.py keeps the live in-memory message list
across turns (cli.py:10046 'self.conversation_history = result["messages"]').
The gateway rebuilds agent_history from the SQLite transcript on every turn,
so any field stripped during replay is silently lost.
Refactors the inline whitelist into a module-level _build_replay_entry()
helper so the contract can be unit-tested. 16 new tests pin the field set
and falsy-value handling.
Verified end-to-end: DB stores all 8 fields, replay now preserves all 8
(was preserving only 5 for assistant text turns).
Pick openrouter/pareto-code as your model and OpenRouter auto-routes each
request to the cheapest model meeting your coding-quality bar (ranked by
Artificial Analysis). The new openrouter.min_coding_score config key (0.0-1.0,
default 0.65) tunes the floor.
- hermes_cli/models.py: add openrouter/pareto-code to OPENROUTER_MODELS so
it shows up in the picker with a description
- hermes_cli/config.py: add openrouter.min_coding_score (default 0.65 — lands
on a mid-tier coder on the current Pareto frontier)
- plugins/model-providers/openrouter: emit extra_body.plugins =
[{id: pareto-router, min_coding_score: X}] when model is openrouter/pareto-code
AND the score is a valid float in [0.0, 1.0]
- agent/transports/chat_completions.py: same emission on the legacy flag
path (when no provider profile is loaded)
- run_agent.py: openrouter_min_coding_score kwarg + storage; plumbed into
both build_kwargs() invocations and the context-summary extra_body path
- cli.py: read openrouter.min_coding_score once at init, validate float in
[0,1], pass to AIAgent constructions (CLI + background-task paths)
- cron/scheduler.py, batch_runner.py, tools/delegate_tool.py,
tui_gateway/server.py: propagate the kwarg (mirrors providers_order
plumbing — subagents inherit, cron/batch read from config)
- tests: profile-level + transport-level coverage of the model gating,
unset/empty/out-of-range handling, and the legacy flag path
- docs: new 'OpenRouter Pareto Code Router' section in providers.md
Verified end-to-end against api.openrouter.ai: at score=0.65 we land on a
mid-tier coder, at omission we get the strongest. Score is silently dropped
on any model other than openrouter/pareto-code, so it's safe to leave set.
Same pattern as the google_chat lazy-load (PR #22681), applied to the
Teams plugin. The bundled `plugins/platforms/teams/adapter.py` did
`import httpx` at module top, which dragged the entire httpx +
httpcore stack into every process that triggered plugin discovery —
including `hermes` invocations that never instantiate the Teams
adapter.
`httpx` is only needed inside one method
(`TeamsMeetingPipeline._write_summary_via_incoming_webhook`), and the
`httpx.AsyncBaseTransport` parameter annotation is already string-only
thanks to the existing `from __future__ import annotations`. Move the
runtime import inside the method.
Measured impact (7-run medians, 9950X3D):
teams plugin alone: 118 → 89 ms (-25%)
46 → 38 MB (-17%)
import cli (full): unchanged
import model_tools: unchanged
The full-CLI numbers are flat because httpx is loaded transitively
from many other modules on that path. The microbench win is the real
signal: 29 ms / 8 MB shaved off any process that touches the teams
plugin without otherwise pulling httpx — primarily future workflows
where the gateway is enabled but Teams is not configured.
Tests: 44/44 `tests/gateway/test_teams.py` pass; 345 across all
plugin-platform suites (teams + qqbot + google_chat). The test file
imports `httpx` itself for the `MockTransport` fixture, which is
correct — tests legitimately use httpx, only the plugin's module-level
import was the issue.
Pass session_id through to provider profile build_api_kwargs_extras so
the OpenRouter profile can attach an xAI cache-affinity header
(x-grok-conv-id: <session-id>) for x-ai/grok-* models. xAI prompt
cache requires server affinity via this header — without it the cache
is poisoned and Grok prompt-cache hit rates drop dramatically on
multi-turn sessions.
Carve-out of #22708 by Ninso112. The original PR bundled a /diff
slash command, a zsh completion fix (already on main via #22802),
and holographic memory null-guards. This salvage keeps just the
Grok header work — small, targeted, and well-tested. Other
contributors and changes preserved for separate review.
Closes#22705.
When systemd_restart / systemd_status / systemd_stop run under sudo,
HERMES_HOME is stripped and HOME=/root, so get_hermes_home() resolves
to /root/.hermes instead of the unit's pinned home. read_runtime_status
and get_running_pid then look at the wrong gateway_state.json — the
60s status poll never sees "running", times out, and forces another
systemctl restart that SIGTERMs the in-progress new gateway.
Read the unit's pinned HERMES_HOME from `systemctl show -p Environment`
and mirror it into os.environ before any HERMES_HOME-derived read.
Early-out when system=False (user-scope inherits naturally). Errors
swallowed so a transient systemctl failure doesn't break unrelated
CLI ops.
Closes#22035.
Per-tool-call push notifications on Telegram are noisy enough that
'all' is the wrong default — long agent runs spam the user's notification
shade with status messages they didn't ask to be pinged about. Final
responses, approval prompts, and slash confirmations still notify;
intermediate progress, streaming, and tool-progress messages now
deliver silently via disable_notification.
Users who want the legacy behavior can opt back in with:
display:
platforms:
telegram:
notifications: all
or HERMES_TELEGRAM_NOTIFICATIONS=all.
Add a configurable notifications mode for the Telegram platform adapter
that controls which messages trigger push notifications.
- display.platforms.telegram.notifications: "all" (default) | "important"
- HERMES_TELEGRAM_NOTIFICATIONS env var override
- In "important" mode, all sends use disable_notification=True except:
- Approvals (send_exec_approval) and slash confirmations
- Final response messages (metadata["notify"]=True)
- Zero overhead in default "all" mode
- Zero impact on non-Telegram platforms
Closes#22771
acp_command / acp_args descriptions previously primed the model to
populate them — "Per-task ACP command override (e.g. 'copilot')" —
even when no ACP CLI was installed. Models with weaker schema-following
discipline would set them and the spawn would fail.
Add explicit "Do NOT set unless the user has explicitly told you"
guidance at both the top-level acp_command and the per-task override.
Strengthen acp_args to mention it's empty unless acp_command is set.
Adds 2 tests pinning the descriptions.
Note: this is a cosmetic prompt-engineering fix — the params remain
exposed in the schema. The fully-correct fix is to gate them behind
a config flag or runtime ACP-CLI detection so the schema only emits
them when an ACP harness is available. Tracked as a follow-up; this
PR ships the low-cost stopgap.
Salvage of #22680 (delegate schema only). The original PR also
bundled unrelated fixes for #22548, #21944, #22150 — those
need separate PRs since #22548 and #21944 are already addressed
on main (#22780 + #22798 in flight) and #22150 deserves its own
review.
Closes#22013.
Two co-located fixes:
1. agent/model_metadata.py: bump hy3-preview static fallback from
256000 to 262144 (256 * 1024) to match OpenRouter live metadata
so cache and offline both agree (issue #22268).
2. tests/hermes_cli/test_tencent_tokenhub_provider.py: replace the
exact-value change-detector (assert ctx == 256000) with an
invariant assertion (registered + >= 4096). Per AGENTS.md
'Don't write change-detector tests': pinning the upstream-controlled
context length is exactly the test class the rule forbids — it
breaks every time the provider bumps the published value, with
zero behavioral coverage gained.
Salvage of #22574 with a redirect on the test approach. The
contributor's diff bumped the integer and added a SECOND
change-detector pinning DEFAULT_CONTEXT_LENGTHS[hy3-preview] == 262144,
which would re-break on the next published bump. We instead delete
the change-detector entirely and assert the relationship.
Closes#22268.
The generated zsh completion script used `(-h --help)` as the exclusion
group for `_arguments`, which zsh rejects with:
_arguments:comparguments: invalid argument: (-h --help){-h,--help}[...]
Exclusion groups in `_arguments` cannot contain long options. Use the
canonical `(-)` form (exclude all other options) which correctly
handles flag pairs like `-h`/`--help`.
FixesNousResearch/hermes-agent#22686
Problem
-------
`hermes doctor` ran two health checks for Anthropic: a dedicated one
with the correct `x-api-key` + `anthropic-version` headers, and a
generic Bearer-auth one driven by the pluggable `ProviderProfile` for
"anthropic". The generic check called `https://api.anthropic.com/v1/models`
with `Authorization: Bearer ...`, which Anthropic answers with HTTP 404,
producing a noisy duplicate warning even when the dedicated check passed.
Root cause
----------
`hermes_cli/doctor.py:_build_apikey_providers_list` deduplicated profiles
against a `_known_canonical` set built from the static list (Z.AI/GLM,
Kimi, DeepSeek, …). Providers with their own dedicated check above the
generic loop (Anthropic, OpenRouter, Bedrock) were not in that set, so
their profiles were appended and ran a second, broken check.
Fix
---
Add `{"anthropic", "openrouter", "bedrock"}` to the skip set, and
also skip profiles whose aliases match any of those names (e.g.
`claude`, `claude-oauth` → anthropic).
Tests
-----
tests/hermes_cli/test_doctor_dedicated_provider_skip.py:
- test_build_apikey_providers_list_skips_dedicated_check_providers:
asserts the assembled list does not contain anthropic, openrouter,
or bedrock entries.
- test_build_apikey_providers_list_includes_non_dedicated_providers:
sanity guard that legitimate providers (DeepSeek, Z.AI/GLM) survive.
Both confirmed via stash-verify (fail pre-fix with anthropic/openrouter
leaking, pass post-fix).
Fixes#22346
ALTER TABLE calls inside _migrate_add_optional_columns were guarded by a
snapshot of PRAGMA table_info taken at function entry. When the gateway
dispatcher opens the kanban DB twice per tick (once in _tick_once_for_board
and once via init_db's discard-and-reconnect path), a second connection can
run the same migration before the first one commits, causing:
sqlite3.OperationalError: duplicate column name: consecutive_failures
This crashed the dispatcher on every first tick after a gateway restart
(subsequent ticks succeeded because the columns were then present).
Fix: introduce _add_column_if_missing() which wraps ALTER TABLE in a
try/except that swallows OperationalError whose message contains
'duplicate column name'. All ALTER TABLE calls in
_migrate_add_optional_columns are routed through this helper.
Closes#21708
DeepSeek V4 Pro returns thinking content as typed blocks inside the
content array rather than as a top-level reasoning_content field:
[{"type": "thinking", "thinking": "..."}, {"type": "output", ...}]
_extract_reasoning only handled content as a plain string, so the
thinking text was silently dropped. On the next turn the session was
replayed without the thinking block, causing:
HTTP 400: The content[].thinking in the thinking mode must be
passed back to the API.
Fix: when content is a list and no structured reasoning field was
found, scan for items with type=='thinking' and accumulate their
'thinking' (or 'text') value into reasoning_parts. Structured fields
(reasoning, reasoning_content, reasoning_details) still take priority
so existing provider behaviour is unchanged.
Closes#21944
skills/media/youtube-content/scripts/fetch_transcript.py and
optional-skills/productivity/memento-flashcards/scripts/youtube_quiz.py
both import youtube-transcript-api at runtime, but the package was not
listed in pyproject.toml. A fresh `uv sync` therefore omits it, and
both skills fail on first invocation with:
ModuleNotFoundError: No module named 'youtube_transcript_api'
Add a new [youtube] optional-dependency group with
youtube-transcript-api>=1.2.0 (the v1.x API surface the scripts already
use) and include it in [all] so standard installs pick it up.
Regression tests: TestPyprojectDeclaresYoutubeExtra verifies the extra
is present in pyproject.toml and included in [all].
Closes#22243
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
163/NetEase IMAP servers reject every UID SEARCH/FETCH with `BYE Unsafe
Login` unless the client first identifies itself via the RFC 2971 ID
command after LOGIN. Without this, the email gateway logs in OK but
then fails on the very first poll and the connection is torn down.
Send the ID payload best-effort after both `imap.login()` sites
(`EmailAdapter.connect` and `_fetch_new_messages`). Failures are
swallowed at debug level so non-supporting IMAP servers (Gmail,
Outlook, Fastmail, Yahoo, etc.) keep working unchanged.
Closes#22271
Problem: `_get_cloud_provider()` set `_cloud_provider_resolved = True`
before resolution. If credentials were briefly unavailable on the first
call (e.g. a managed Nous Portal token mid-refresh), the resolver pinned
the entire process to local mode forever, even after credentials
self-healed seconds later.
Root cause: bookkeeping was set up-front, so any code path that fell
through to `return _cached_cloud_provider` (config read failure, no
credentials yet, explicit-provider instantiation failure) committed the
transient `None` to the cache permanently.
Fix: invert the bookkeeping. `_cloud_provider_resolved = True` is now
set only when (a) the user explicitly chose `cloud_provider: local`, or
(b) a provider was successfully resolved. All transient `None` paths
return without poisoning the cache, so the next call retries. Explicit
provider instantiation failures now log at warning level with stack
trace so operators can diagnose them.
Tests: 5 new cases in tests/tools/test_browser_cloud_provider_cache.py
covering explicit local, successful resolution, no-credentials-yet,
config read failure, and explicit provider instantiation failure.
Stash-verify confirmed the 3 transient-None tests fail without the fix.
All 320 existing browser tests still green.
Closes#22324
`fetch_models_dev()` is on the hot path of every `AIAgent.__init__`
(via `context_compressor → get_model_context_length`). The previous
policy was "always try network first, only fall back to disk if
network fails," so every fresh `hermes chat` / `hermes gateway` /
batch / cron process paid 250-500 ms re-fetching a 2 MB JSON registry
that was already on disk from earlier runs.
Add a stage 2 between in-mem and network: if
`models_dev_cache.json` exists and its mtime is younger than the
existing `_MODELS_DEV_CACHE_TTL` (1 hour, same TTL the in-mem cache
already uses), load from disk and skip the network call.
The in-mem TTL is anchored to the disk file's age, so a 50-min-old
cache stays in-memory for only 10 more minutes — no surprise
extension of staleness window.
Invariants preserved:
- `force_refresh=True` still always hits the network and only falls
back to disk on failure (`hermes config refresh` semantics).
- Missing disk cache → fall through to network (first-ever run).
- Stale disk cache (mtime > TTL) → fall through to network.
- Negative file age (clock skew) → fall through to network.
- Network failure → existing stage-4 stale-disk fallback unchanged.
Measured impact (3-run medians, 9950X3D, fresh process per run):
fetch_models_dev cold: 256 → 17 ms (-93%)
hermes chat -q wall: 4.00 → 3.73 s (-7% median)
3.99 → 3.60 s (-10% min)
The chat-end-to-end win is bounded below by API latency variance, but
the fetch_models_dev microbenchmark is the cleanest signal: 239 ms
shaved off every fresh-process agent construction.
Win compounds with the previous perf PRs:
#22681 google_chat lazy-load
#22766 doctor parallel + IMDS off
#22790 gateway.platforms PEP 562
Tests: all 30 `tests/agent/test_models_dev.py` pass (added 4 new ones
covering the new disk-cache-first path, force_refresh override, stale
disk fallback, and missing-disk-cache fall-through). Full `tests/agent/`
suite: 2560 passed, 0 failed.
The is_xai_responses branch only sent include=[reasoning.encrypted_content]
without forwarding the resolved reasoning_effort. Other Responses providers
(OpenAI, GitHub) already get effort forwarded — this aligns the xAI path.
Without this, agent.reasoning_effort is silently dropped on the xAI direct
path, making Hermes unable to control reasoning depth on grok-4.x via
api.x.ai. Tests added to TestCodexBuildKwargs cover effort passthrough,
disabled state, and minimal-clamp parity with non-xAI.
* docs: deep audit — fix stale config keys, missing commands, and registry drift
Cross-checked ~80 high-impact docs pages (getting-started, reference, top-level
user-guide, user-guide/features) against the live registries:
hermes_cli/commands.py COMMAND_REGISTRY (slash commands)
hermes_cli/auth.py PROVIDER_REGISTRY (providers)
hermes_cli/config.py DEFAULT_CONFIG (config keys)
toolsets.py TOOLSETS (toolsets)
tools/registry.py get_all_tool_names() (tools)
python -m hermes_cli.main <subcmd> --help (CLI args)
reference/
- cli-commands.md: drop duplicate hermes fallback row + duplicate section,
add stepfun/lmstudio to --provider enum, expand auth/mcp/curator subcommand
lists to match --help output (status/logout/spotify, login, archive/prune/
list-archived).
- slash-commands.md: add missing /sessions and /reload-skills entries +
correct the cross-platform Notes line.
- tools-reference.md: drop bogus '68 tools' headline, drop fictional
'browser-cdp toolset' (these tools live in 'browser' and are runtime-gated),
add missing 'kanban' and 'video' toolset sections, fix MCP example to use
the real mcp_<server>_<tool> prefix.
- toolsets-reference.md: list browser_cdp/browser_dialog inside the 'browser'
row, add missing 'kanban' and 'video' toolset rows, drop the stale
'38 tools' count for hermes-cli.
- profile-commands.md: add missing install/update/info subcommands, document
fish completion.
- environment-variables.md: dedupe GMI_API_KEY/GMI_BASE_URL rows (kept the
one with the correct gmi-serving.com default).
- faq.md: Anthropic/Google/OpenAI examples — direct providers exist (not just
via OpenRouter), refresh the OpenAI model list.
getting-started/
- installation.md: PortableGit (not MinGit) is what the Windows installer
fetches; document the 32-bit MinGit fallback.
- installation.md / termux.md: installer prefers .[termux-all] then falls
back to .[termux].
- nix-setup.md: Python 3.12 (not 3.11), Node.js 22 (not 20); fix invalid
'nix flake update --flake' invocation.
- updating.md: 'hermes backup restore --state pre-update' doesn't exist —
point at the snapshot/quick-snapshot flow; correct config key
'updates.pre_update_backup' (was 'update.backup').
user-guide/
- configuration.md: api_max_retries default 3 (not 2); display.runtime_footer
is the real key (not display.runtime_metadata_footer); checkpoints defaults
enabled=false / max_snapshots=20 (not true / 50).
- configuring-models.md: 'hermes model list' / 'hermes model set ...' don't
exist — hermes model is interactive only.
- tui.md: busy_indicator -> tui_status_indicator with values
kaomoji|emoji|unicode|ascii (not kawaii|minimal|dots|wings|none).
- security.md: SSH backend keys (TERMINAL_SSH_HOST/USER/KEY) live in .env,
not config.yaml.
- windows-wsl-quickstart.md: there is no 'hermes api' subcommand — the
OpenAI-compatible API server runs inside hermes gateway.
user-guide/features/
- computer-use.md: approvals.mode (not security.approval_level); fix broken
./browser-use.md link to ./browser.md.
- fallback-providers.md: top-level fallback_providers (not
model.fallback_providers); the picker is subcommand-based, not modal.
- api-server.md: API_SERVER_* are env vars — write to per-profile .env,
not 'hermes config set' which targets YAML.
- web-search.md: drop web_crawl as a registered tool (it isn't); deep-crawl
modes are exposed through web_extract.
- kanban.md: failure_limit default is 2, not '~5'.
- plugins.md: drop hard-coded '33 providers' count.
- honcho.md: fix unclosed quote in echo HONCHO_API_KEY snippet; document
that 'hermes honcho' subcommand is gated on memory.provider=honcho;
reconcile subcommand list with actual --help output.
- memory-providers.md: legacy 'hermes honcho setup' redirect documented.
Verified via 'npm run build' — site builds cleanly; broken-link count went
from 149 to 146 (no regressions, fixed a few in passing).
* docs: round 2 audit fixes + regenerate skill catalogs
Follow-up to the previous commit on this branch:
Round 2 manual fixes:
- quickstart.md: KIMI_CODING_API_KEY mentioned alongside KIMI_API_KEY;
voice-mode and ACP install commands rewritten — bare 'pip install ...'
doesn't work for curl-installed setups (no pip on PATH, not in repo
dir); replaced with 'cd ~/.hermes/hermes-agent && uv pip install -e
".[voice]"'. ACP already ships in [all] so the curl install includes it.
- cli.md / configuration.md: 'auxiliary.compression.model' shown as
'google/gemini-3-flash-preview' (the doc's own claimed default);
actual default is empty (= use main model). Reworded as 'leave empty
(default) or pin a cheap model'.
- built-in-plugins.md: added the bundled 'kanban/dashboard' plugin row
that was missing from the table.
Regenerated skill catalogs:
- ran website/scripts/generate-skill-docs.py to refresh all 163 per-skill
pages and both reference catalogs (skills-catalog.md,
optional-skills-catalog.md). This adds the entries that were genuinely
missing — productivity/teams-meeting-pipeline (bundled),
optional/finance/* (entire category — 7 skills:
3-statement-model, comps-analysis, dcf-model, excel-author, lbo-model,
merger-model, pptx-author), creative/hyperframes,
creative/kanban-video-orchestrator, devops/watchers,
productivity/shop-app, research/searxng-search,
apple/macos-computer-use — and rewrites every other per-skill page from
the current SKILL.md. Most diffs are tiny (one line of refreshed
metadata).
Validation:
- 'npm run build' succeeded.
- Broken-link count moved 146 -> 155 — the +9 are zh-Hans translation
shells that lag every newly-added skill page (pre-existing pattern).
No regressions on any en/ page.
`gateway/platforms/__init__.py` eagerly imported `QQAdapter` and
`YuanbaoAdapter` at package-init time, which transitively pulled in
qqbot's chunked-upload + keyboards + onboard machinery and yuanbao's
websocket stack. About 84 ms wall and 23 MB RSS on every fresh process
that touched anything under `gateway.platforms` — including `hermes
chat` (via run_agent → cli's plugin discovery transitive import).
Nothing in the codebase actually consumes these symbols from the
package root; every real call site uses the long-form path
(`from gateway.platforms.qqbot import QQAdapter`,
`from gateway.platforms.yuanbao import YuanbaoAdapter` in gateway/run.py).
The eager re-export was only there for convenience.
Replace with a PEP 562 module-level `__getattr__` that lazily imports
on first attribute access. Public API stays identical:
`from gateway.platforms import QQAdapter` keeps working but only
pays the import cost when the symbol is actually touched. `__dir__`
preserves help() / autocomplete behavior.
Measured impact (7-run medians, 9950X3D):
import gateway.platforms 127 → 43 ms (-66%)
50 → 27 MB (-46%)
import gateway.platforms.base 127 → 44 ms (-65%)
50 → 27 MB (-46%)
import cli (full chat path) 745 → 710 ms ( -5%)
96 → 90 MB ( -6%)
hermes chat -q (cold) -5 MB
The per-import win is biggest because qqbot/yuanbao deps don't overlap
with anything on the gateway-platforms path — full `import cli`
already loads aiohttp/websockets transitively from other places, so
the marginal CLI win is smaller than the isolated import benchmark.
The `gateway.platforms.base` win is what matters most for long-lived
gateway processes: every gateway boot saves 23 MB resident.
All 144 qqbot tests pass; broader gateway suite (5132 tests) passes
modulo 4 pre-existing flakes also failing on main without this change.
The xAI image-gen provider was DOA from PR #14765 onward — every request
422'd because the resolution param was being mapped to '1024'/'2048' but
xAI's API expects the literal strings '1k'/'2k'. PR #18678 fixed the
mapping; this test asserts the wire payload carries the literal so the
regression cannot recur silently.
The xAI /v1/images/generations endpoint expects resolution as a
literal string ('1k' or '2k'), not the numeric value ('1024').
- Change _XAI_RESOLUTIONS from a dict mapping to a validation set
- Use the resolution key directly instead of the mapped value
- Fall back to DEFAULT_RESOLUTION on invalid config values
Fixes 422 Unprocessable Entity errors when resolution was sent.
`hermes doctor` ran every connectivity probe sequentially and on a typical
developer laptop spent ~2s of its ~5s wall time inside boto3's EC2
instance-metadata-service lookup (169.254.169.254) — the default
AWS credential chain probes IMDS even when AWS_BEARER_TOKEN_BEDROCK
or AWS_ACCESS_KEY_ID is the only legitimate source.
Refactor the API Connectivity section so every probe (OpenRouter,
Anthropic, ~16 static API-key providers + dynamic profiles, AWS
Bedrock) is a pure function returning a structured result, then
fan them out through a ThreadPoolExecutor(max_workers=8). Output
order, glyphs, colours, padding, and issue strings stay byte-for-byte
identical to the sequential implementation; results are gathered
in submission order.
Also disable IMDS for the parallel block by setting
AWS_EC2_METADATA_DISABLED=true on the parent thread before submitting
work (and restoring its prior value in a finally block). Bedrock's
real-API call gets a Config(connect_timeout=5, read_timeout=10,
retries={max_attempts:1}) so a transient regional failure can't pad
the run by 30+ seconds.
Measured impact (5-run medians, 9950X3D):
hermes doctor: 5.07 → 2.16 s (-57%)
Doctor tests: 48 passed (test_doctor.py + test_doctor_command_install.py).
The remaining ~2s of wall is import overhead + a couple of one-off
network calls outside the API Connectivity section (`fetch_models_dev`
provider catalog refresh, Nous OAuth refresh in `Auth Providers`).
Those are next-tier targets, not part of this change.
Returning users who enabled '🖱️ Computer Use (macOS)' via 'hermes tools'
saw '✓ Saved configuration' but no install — cua-driver was never on
PATH and the toolset failed at first use. Two compounding causes:
1. _toolset_needs_configuration_prompt fell through to _toolset_has_keys,
which returned True for any provider with empty env_vars. cua-driver
has no env vars, so the gate skipped _configure_toolset entirely and
_run_post_setup('cua_driver') never ran.
2. No stable CLI entry-point existed for re-running the install when
the picker no-op'd it (e.g. when toggling the toolset off+on inside
one picker session, where 'added' is empty).
Changes:
- hermes_cli/tools_config.py: add _POST_SETUP_INSTALLED registry
mapping post_setup keys to installed-state predicates. The gate
now returns True when any visible provider has a registered
post_setup whose predicate fails. cua_driver is the only opt-in
for now; other post_setup hooks keep their existing behaviour.
- hermes_cli/main.py: add 'hermes computer-use install' and
'hermes computer-use status' as a stable docs target. install
reuses the same _run_post_setup('cua_driver') path that the
picker invokes; status reports whether cua-driver is on PATH.
- tools/computer_use/cua_backend.py: install hint now points users
at 'hermes computer-use install' first.
- website/docs/user-guide/features/computer-use.md: document the
new command as the primary install path.
- website/docs/reference/cli-commands.md: catalog 'hermes
computer-use' alongside 'hermes tools'.
- tests/hermes_cli/test_post_setup_gating.py: regression coverage
for the gate predicate (missing -> setup forced, installed ->
setup skipped, broken predicate -> non-blocking, unregistered
keys -> behaviour unchanged).
Fixes#22737. Reported by @f-trycua.
The model regularly writes session-outcome facts to MEMORY.md despite
the existing 'Do NOT save task progress' line — entries like
'Submitted PR #22577 for the kanban dedup fix' or 'Fixed bug X in
file Y'. These are stale within days, pollute the system prompt,
and crowd out durable user preferences (the issue #22563 reporter
saw 9 sections of bug-fix notes injected on a brand-new task).
Add explicit examples of what NOT to save (PR numbers, issue
numbers, commit SHAs, 'fixed/submitted/Phase N done', file counts)
plus the 7-day-staleness heuristic so the model has a concrete
calibration target rather than guessing what counts as 'task progress'.
Closes#22563 (the prompt-side, low-risk portion). The bigger
relevance-based-injection / vector-retrieval feature requested in
#22563 is tracked under #2184 (Richer local memory). Per skill rule
on prompt caching, dynamic memory injection breaks the frozen-snapshot
invariant and needs a separate design call.
_try_activate_fallback() walked the chain by index without comparing
the candidate entry against the currently-failing backend. So a
misconfigured chain that listed the same provider+model as the primary,
or two custom_providers entries pointing at the same shim URL, would
loop the same failure 3x for the same backend.
After the fix, advance() skips:
- entries where (provider, model) match the current agent's
- entries with a base_url + model matching the current backend
(catches two custom_providers names pointing at the same shim)
Recursing through self._try_activate_fallback() continues to the next
chain entry; if everything matches, returns False and the caller
moves on without retrying the same broken path.
3 regression tests covering same-provider-same-model skip, same-base_url-
same-model skip, and the all-self-matching-returns-False exhaustion path.
Closes#22548 (the Hermes-side portion). The 120s timeout itself in
the downstream claude-cli shim is a deployment concern documented in
that issue's wherewolf87 comment.
Native Windows, WSL, SSH sessions, and Windows Terminal all send
Ctrl+Enter as bare LF (c-j). Hermes was binding c-j as submit on
every POSIX platform, so Ctrl+Enter submitted instead of inserting
a newline on those terminals. Reported in #22379.
Add _preserve_ctrl_enter_newline() predicate that detects the
environments where Ctrl+Enter must produce a newline (sys.platform
== 'win32', SSH_CONNECTION/SSH_CLIENT/SSH_TTY env, WT_SESSION,
WSL_DISTRO_NAME, /proc/version 'microsoft' marker). Gate the
c-j-as-submit binding off in those environments and gate the
c-j-as-newline handler on. Local POSIX TTYs without those markers
(docker exec, plain ssh from a Mac) keep c-j as submit so plain
Enter still works on thin PTYs.
Add install_ctrl_enter_alias() in hermes_cli/pt_input_extras.py
mapping the three CSI-u / modifyOtherKeys variants of Ctrl+Enter
('\x1b[13;5u', '\x1b[27;5;13~', '\x1b[27;5;13u') to the
(Escape, ControlM) tuple Alt+Enter produces. This lets Kitty /
mintty / xterm-with-modifyOtherKeys users over SSH get a Ctrl+Enter
newline through the existing Alt+Enter handler.
9 new tests + extended existing test_lf_enter_binds_to_submit_handler_posix
to cover bare-local vs SSH branches.
Closes#22379.
Non-streaming /v1/chat/completions wrapped any AIAgent result \u2014 including
partial/failed runs \u2014 as a successful 200 with finish_reason='stop' and
the internal failure string substituted into message.content. API
clients had no way to distinguish 'agent answered: X' from
'agent crashed and the X you see is its error message'.
After the fix:
- completed: True \u2192 200 finish_reason='stop' (unchanged)
- partial + truncated text \u2192 200 finish_reason='length' + hermes extras
- partial + no text / failed \u2192 502 OpenAI error envelope (SDKs raise)
- other failures \u2192 200 finish_reason='error' + hermes extras
Adds X-Hermes-Completed / X-Hermes-Partial / X-Hermes-Error headers
plus a 'hermes' extras object on partial responses for clients that
want the full picture.
Closes#22496.
Gateway creates a fresh AIAgent per inbound message in several common
scenarios: cache miss, idle eviction (1h TTL), config-signature
mismatch, process restart. A freshly-built AIAgent has
_turns_since_memory=0 and _user_turn_count=0, so the
memory.nudge_interval trigger ('_turns_since_memory >=
_memory_nudge_interval') can never be reached when these reconstructions
happen on roughly the cadence of the interval. A user can chat for hours
on Telegram without ever seeing a self-improvement review fire.
Reconstruct the counters from conversation_history at the top of
run_conversation(), right after the existing _hydrate_todo_store call.
Idempotent guard ('if self._user_turn_count == 0') means a cached agent
that already accumulated counters keeps them; only freshly-built agents
hydrate. Modulo arithmetic preserves the original 1-in-N cadence rather
than firing a review immediately on resume.
7 regression tests pinning the contract (mid-cycle history, modulo wrap,
idempotency, zero-interval skip, role==user filtering, production-code
anchor).
Closes#22357.
Operator-controlled HERMES_PROFILE values were rendered as
'**${author}** (${ts}):' — markdown bold with no provenance prefix.
Worker comment bodies render directly underneath. A misleading
profile name like 'hermes-system' or 'operator' could be misread by
the next worker as a system directive above attacker-influenced
content (confused-deputy primitive gated on operator misconfig).
The LLM-controlled author-forgery surface was already closed in
#22435 (author removed from KANBAN_COMMENT_SCHEMA). This is
defense-in-depth: render with an explicit 'comment from worker
`<author>` at <ts>:' prefix so even 'hermes-system' resolves to
'comment from worker `hermes-system` at ...' — parseable as
worker-comment metadata, not a system directive. Strip backticks
from author so they can't break out of the fence.
Update test_build_worker_context_caps_comments to count by body
regex since the rendered author line now also starts with
'comment '.
Closes#22452.
Two unrelated but co-located fixes to scripts/run_tests.sh:
1. pytest-split bootstrap (#22401): the script tried '$PYTHON -m pip
install pytest-split' on first run, but uv-created venvs ship without
pip. Result: 'No module named pip' before any test ran. Add a uv
fallback (uv pip install --python $PYTHON), keep pip as a secondary
path, and emit a clear error pointing at 'uv pip install -e ".[dev]"'
when neither is available. Also declare pytest-split in
pyproject.toml dev extra so a normal '.[dev]' install provisions it.
2. HERMES_CRON_SESSION leak (#22400): the hermetic env scrub already
unsets HERMES_GATEWAY_SESSION and HERMES_INTERACTIVE but missed the
sibling HERMES_CRON_SESSION. When run_tests.sh is invoked from a
Hermes cron job, that variable leaks into pytest, flipping
tools/approval.py into cron-deny mode and breaking
tests/acp/test_approval_isolation.py and friends.
Closes#22400.
Closes#22401.
When session_id rotates (e.g. /new), commit_memory_session was firing
MemoryManager.on_session_end but skipping ContextEngine.on_session_end.
Engines that accumulate per-session state (LCM-style DAGs, summary
stores) leaked that state from the rotated-out session into whatever
continued under the same compressor instance.
Mirror the call shutdown_memory_provider already makes — same
lifecycle moment, same hook contract ("real session boundaries (CLI
exit, /reset, gateway expiry)"). /new is a real boundary for the old
session_id; providers keep their state but the rotated-out session_id
is done.
6 regression tests covering both-hooks-fire, no-memory-manager,
no-context-engine, both failure-tolerant paths.
Closes#22394.
- Restore allowed_chats gate before thread_id check so ignored_threads
applies universally (even to guest mentions).
- Compute _message_mentions_bot once in _should_process_message to
eliminate redundant second entity scan when guest_mode=true and the
message does not mention the bot.
- Remove redundant _is_group_chat from _is_guest_mention (caller already
verified the message is a group chat).
- Update _telegram_allowed_chats docstring to note guest_mode exception.
- Add test coverage: bot_command entity, text_mention entity,
caption_entities, and ignored_threads + guest_mode interaction.
- Add nik1t7n to AUTHOR_MAP.
The original PR placed 'pwd = pytest.importorskip("pwd")' on line 4
but 'import pytest' on line 9 — NameError on module load. Same for
test_file_sync_back.py. Plus, the in-function 'pwd = pytest.importorskip'
calls in test_auto_detected_root_is_rejected confused Python's scope
analysis (later 'import pytest' made pytest local everywhere in the
function) and caused UnboundLocalError. Drop the now-redundant
in-function importorskip calls and rely on the module-level guard.
The github-pr-workflow skill wraps the URL in double-quotes
('curl -H ... "https://api.github.com/..."'), which the original
allowlist regex (\s+https://api...) did not match. Without this,
the bundled github-pr-workflow skill is still blocked at every
cron tick despite #22605's fix landing for the bare-URL form.
Make the leading quote optional and add a regression test pinning
both single- and double-quoted forms.
Adds 'codex' to the _MCP_PRESETS registry so users can add it via
Connecting to 'codex'...
✓ Connected! Found 2 tool(s) from 'codex':
codex Run a Codex session. Accepts configuration parameters matchi...
codex-reply Continue a Codex conversation by providing the thread id and...
Enable all 2 tools? [Y/n/select]:
Cancelled. without manually specifying
the command and args.
Enables: codex mcp-server → Hermes native MCP client → Codex tools
available as first-class Hermes tools.
Problem:
After `hermes profile use NAME`, the gateway (started via systemd with
HERMES_HOME=/root/.hermes hardcoded) ignores the active profile and
always runs as the Default profile. WebUI, Telegram, and all non-CLI
platforms are affected.
Root cause:
_apply_profile_override() contained an early-return guard:
if profile_name is None and os.environ.get("HERMES_HOME"):
return # trust the inherited value
The intent was to let child processes inherit their parent's profile via
HERMES_HOME without redundantly re-reading active_profile. But
systemd also sets HERMES_HOME — to the hermes root (/root/.hermes),
not a profile directory — so the guard fired and silently skipped the
active_profile check. The user's `hermes profile use NAME` write to
~/.hermes/active_profile was never seen by the gateway process.
Fix:
Only skip the active_profile check when HERMES_HOME is already a
profile directory, identified by its immediate parent directory being
named "profiles" (e.g. ~/.hermes/profiles/coder or
/opt/data/profiles/coder). When HERMES_HOME points to a root
directory (parent name != "profiles"), continue to read active_profile.
Tests:
- test_hermes_home_at_root_with_active_profile_is_redirected: the
bug scenario — HERMES_HOME=/root/.hermes + active_profile=coder →
HERMES_HOME must be redirected to .../profiles/coder.
Stash-verified: FAILS without fix, PASSES with fix.
- test_hermes_home_already_profile_dir_is_trusted: child-process
inheritance contract unchanged — .../profiles/coder is trusted as-is.
- test_hermes_home_unset_reads_active_profile: classic path unchanged.
- test_hermes_home_unset_default_profile_no_redirect: "default" still
produces no redirect.
4/4 tests green.
Closes#22502.
When a Telegram user replies using the native quote feature to select
only part of a prior message, _build_message_event was injecting the
ENTIRE replied-to message into reply_to_text via
message.reply_to_message.text/caption. python-telegram-bot exposes
the user-selected substring as message.quote (TextQuote.text); we now
prefer that and fall back to the full replied-to text only when no
native quote is present.
The agent-visible "[Replying to: \"...\"]" prefix can otherwise expand
the user's narrow quote into the full prior message, causing the agent
to act on unrelated actionable-looking text the user did not select
(e.g. multi-item briefings where the user quotes one bullet but the
prefix injects every bullet). Falls back cleanly when message.quote
is absent (PTB <21 or replies that don't quote a substring).
Fixes#22619
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve git via shutil.which with POSIX and Git-for-Windows fallbacks before clone and pull so Dashboard/API installs do not misreport Git as missing.
Add regression tests for the resolver and pull subprocess invocation.
When platform_toolsets[<platform>] contains both a composite (e.g.
hermes-cli) and at least one configurable opt-in (e.g. spotify), the
has_explicit_config branch in _get_platform_tools silently dropped the
composite, leaving sessions with only the configurable + plugin tools
and no native tools (terminal, file, web, browser, memory, etc.).
Mirror the else-branch's subset inference for composites that sit
alongside the configurables, but apply _DEFAULT_OFF_TOOLSETS only to the
implicit expansion so user-listed default-off toolsets (spotify,
discord) survive.
The delegate_task tool description hardcoded 'default 3' / 'default 2' for
max_concurrent_children / max_spawn_depth, which misled the model on any
install that raised these limits — the schema text said 'default 3' even
when the user had set max_concurrent_children=15 / max_spawn_depth=3, so
the model would self-cap at 3 and never use the headroom.
Make the description dynamic. ToolEntry gains an optional
dynamic_schema_overrides callable; registry.get_definitions() merges its
output on top of the static schema before returning it. delegate_tool
registers a builder that reads the current delegation.* config and emits:
- 'up to N items concurrently for this user' (N = max_concurrent_children)
- 'Nested delegation IS enabled / OFF for this user (max_spawn_depth=N)'
- 'orchestrator children can themselves delegate up to M more level(s)'
- 'orchestrator_enabled=false' when the kill switch is set
The model_tools cache key already includes config.yaml mtime+size, so
edits to delegation.* in config invalidate the cached tool definitions
without an explicit hook. CLI_CONFIG staleness within a process is a
pre-existing limitation of _load_config and out of scope here.
Static description / tasks.description / role.description in
DELEGATE_TASK_SCHEMA are placeholders so module import doesn't trigger
cli.CLI_CONFIG load before the test conftest can redirect HERMES_HOME.
Enforce the parent-completion invariant at claim_task (the single
ready->running chokepoint) and re-gate unblock_task so blocked->ready
only fires when parents are done. Prevents child tasks from running
ahead of in-progress parents under the create-then-link race.
Also adds a stress test that races concurrent create+link against
hammered claim_task and asserts no child runs while any parent is undone.
Ref: kanban/boards/cookai/workspaces/t_a6acd07d/root-cause.md
Refs: t_8d6af9d6
Plugin authors had no easy way to figure out why their plugin wasn't
loading — failures were buried in agent.log at WARNING and skip reasons
(disabled, not enabled, depth cap, exclusive) were DEBUG-only and
invisible by default.
Set HERMES_PLUGINS_DEBUG=1 to attach a stderr handler at DEBUG to the
hermes_cli.plugins logger only. Surfaces:
- which directories were scanned + manifest counts per source
- per manifest: resolved key, name, kind, source, on-disk path
- skip reasons (disabled, not enabled, exclusive, depth cap, no register)
- per load: tools/hooks/slash/CLI commands the plugin registered
- full traceback on YAML parse failure (exc_info on the existing warning)
- full traceback on register() exceptions, pointing at the plugin author's line
Env var off (default) → zero new stderr output, same as before.
Touches only hermes_cli/plugins.py + a doc section in the plugin-build
guide + an entry in the env-vars reference. 3 new tests lock the
attach/idempotent/no-attach behavior.
Plugin discovery imports every bundled platform plugin at model_tools
import time. The google_chat adapter unconditionally pulled in
google.cloud.pubsub_v1, googleapiclient, grpc, httplib2, and friends at
module top — about 33 MB RSS and 110 ms wall on every CLI invocation,
even ones that never construct a gateway adapter.
Wrap the heavy imports in _load_google_modules(): an idempotent loader
that rebinds the module-level globals (pubsub_v1, service_account,
HttpError, MediaFileUpload, …) on first call and is invoked from
GoogleChatAdapter.__init__, connect(), and check_google_chat_requirements().
The HttpError = Exception placeholder is preserved for the brief window
before the loader runs, so 'except HttpError as exc:' clauses stay
correct (Python looks up the name at try/except evaluation time, not
at function definition time).
Measured impact on a 9950X3D, 7-run medians:
import cli: 895 → 787 ms (-108 ms / -12%)
133 → 110 MB ( -23 MB / -17%)
import model_tools: 491 → 400 ms ( -91 ms / -19%)
95 → 66 MB ( -29 MB / -31%)
google_chat alone: 244 → 132 ms (-112 ms / -46%)
83 → 50 MB ( -33 MB / -40%)
hermes chat -q (cold): 177 → 145 MB ( -32 MB / -18%)
Real-world win lands on every path that imports cli.py: hermes chat,
hermes gateway, cron jobs, batch runs, subagents. Long-lived gateway
processes save ~30 MB resident.
All 157 google_chat tests pass; full gateway suite (5050 tests) green.
Problem:
unlink_tasks() removes a parent→child dependency edge but does not trigger
recompute_ready(). A child whose last blocking parent is unlinked stays
stuck in 'todo' indefinitely — it only promotes to 'ready' on the next
dispatcher tick or a manual 'hermes kanban recompute'. For CLI-only users
without a dispatcher, the child is permanently stuck.
Root cause:
complete_task() and unblock_task() both call recompute_ready() after their
write transaction so downstream children are evaluated immediately.
unlink_tasks() was missing this call — removing a dependency is
semantically equivalent to completing one, so the same recompute is needed.
Fix:
Capture the rowcount result before the write_txn exits, then call
recompute_ready(conn) outside the transaction when a row was actually
deleted (so the child sees the updated task_links state).
Tests:
Added test_unlink_tasks_triggers_recompute_ready in
tests/hermes_cli/test_kanban_db.py: creates parent A (done) + parent C
(running), child B with both parents (todo), unlinks C→B, asserts B is
ready immediately. Stash-verified: FAILS without fix (child stays todo),
PASSES with fix.
62/62 tests green in tests/hermes_cli/test_kanban_db.py.
Closes#22459.
/clear, /new, /reset, and /undo now ask the user to confirm before
discarding conversation state — three-option prompt routed through the
existing tools.slash_confirm primitive.
Native yes/no buttons render on Telegram, Discord, and Slack (their
adapters already implement send_slash_confirm); other platforms get a
text-fallback prompt and reply with /approve, /always, or /cancel.
The classic prompt_toolkit CLI uses the same three-option flow via the
established _prompt_text_input pattern (see _confirm_and_reload_mcp).
TUI keeps its existing modal overlay (#12312).
Gated by new config key approvals.destructive_slash_confirm (default
true). Picking 'Always Approve' flips the gate to false so subsequent
destructive commands run silently — matches the established
mcp_reload_confirm UX.
Out of scope: /cron remove (separate domain — scheduled jobs, not
session history). Existing TUI overlay env-var (HERMES_TUI_NO_CONFIRM)
left unchanged; cosmetic unification can come later.
Closes#4069.
When a GFM table has a row-label column (first column with no header),
_render_table_block_for_telegram incorrectly included the row-label cell
in the bullet zip alongside the data cells, producing a spurious bullet
like '• 維度: 核心賣點' before the real data rows.
Detect the row-label column by comparing the first data row cell count
against the header count (has_row_label_col = len(first_data_row) ==
len(headers) + 1). When present, use cells[0] as the heading and
zip headers against cells[1:] only, correctly excluding the row-label
from the bullet list.
Fixes#22604
Restate the trust model from first principles: the OS is the only
load-bearing boundary against an adversarial LLM. Distinguish
terminal-backend isolation (sandboxes the shell tool) from
whole-process wrapping (sandboxes the agent itself, reference
deployment NVIDIA OpenShell). Name in-process components (approval
gate, output redaction, Skills Guard) as heuristics, and the class
of reports that defeat them as out of scope under this policy —
while explicitly welcoming them as regular issues or PRs.
Introduce 'agent-loaded content' as the narrow, honest commitment:
attacker-influenced input must not chain into a write the agent
later loads on its own initiative.
Strip implementation-detail enumerations (backend names, adapter
names, config keys, env vars, internal symbols) so the doc stays
evergreen as code evolves.
the esbuild pipeline (scripts/build.mjs) already bundles ink into a
single self-contained dist/entry.js.
remove the Dockerfile steps that manually copied packages/hermes-ink
into node_modules/@hermes/ink and ran a nested
npm install there.
- Dockerfile: simplify TUI build step to just 'npm run build'
- hermes_cli/main.py: _tui_build_needed now checks dist/entry.js
staleness against source files before falling back to the old
ink-bundle.js logic
- tests: update TUI npm install tests and drop the Dockerfile contract
test for the removed ink materialization step
Replace the tsc + babel pipeline with a single esbuild invocation that
produces a self-contained dist/entry.js. The nix TUI derivation no
longer copies node_modules — only dist/ + package.json ship, shrinking
the output from hundreds of MB to ~2.9 MB.
- ui-tui/scripts/build.mjs: new esbuild bundler. Aliases @hermes/ink
to source (esbuild's __esm helper doesn't await nested async init,
which breaks lazy-assigned exports like 'render' when re-exporting
through a prebuilt submodule). Stubs react-devtools-core (dev-only).
Injects a createRequire shim for transitive CJS deps. Strips the
shebang from src/entry.tsx because Nix patchShebangs mangles
'/usr/bin/env -S node --max-old-space-size=8192 --expose-gc' — it
drops the 'node' token. The Python launcher always invokes node
explicitly, so the shebang is redundant.
- nix/tui.nix: installPhase no longer copies node_modules or the
@hermes/ink packages dir.
- nix/checks.nix: drop the 'node_modules present' assertion.
- hermes_cli/main.py: _tui_need_npm_install short-circuits when
dist/entry.js exists and no package-lock.json is present. That is
the prebuilt-bundle layout (nix / packaged release) and there is
nothing to install. Without this, the launcher tried to npm install
in a non-existent site-packages/ui-tui path.
2026-04-30 15:38:50 -04:00
877 changed files with 101077 additions and 10217 deletions
@@ -49,6 +49,24 @@ If your skill is specialized, community-contributed, or niche, it's better suite
---
## Memory Providers: Ship as a Standalone Plugin
**We are no longer accepting new memory providers into this repo.** The set of built-in providers under `plugins/memory/` (honcho, mem0, supermemory, byterover, hindsight, holographic, openviking, retaindb) is closed. If you want to add a new memory backend, publish it as a **standalone plugin repo** that users install into `~/.hermes/plugins/` (or via a pip entry point).
Standalone memory plugins:
- Implement the same `MemoryProvider` ABC (`agent/memory_provider.py`) — `sync_turn`, `prefetch`, `shutdown`, and optionally `post_setup(hermes_home, config)` for setup-wizard integration
- Use the same discovery system — `discover_memory_providers()` picks them up from user/project plugin directories and pip entry points
- Integrate with `hermes memory setup` via `post_setup()` — no need to touch core code
- Can register their own CLI subcommands via `register_cli(subparser)` in a `cli.py` file
- Get all the same lifecycle hooks and config plumbing as in-tree providers
PRs that add a new directory under `plugins/memory/` will be closed with a pointer to publish the provider as its own repo. Existing in-tree providers stay; bug fixes to them are welcome.
This isn't a quality bar — it's a coupling-and-maintenance decision. Memory providers are the most common plugin type and they shouldn't all live in this tree.
---
## Development Setup
### Prerequisites
@@ -461,6 +479,58 @@ Gateway and messaging sessions never collect secrets in-band; they instruct the
See `skills/gifs/gif-search/` and `skills/email/himalaya/` for examples.
### Skill authoring standards (HARDLINE)
Every new or modernized skill — bundled, optional, or contributed — must meet these standards before merge. Reviewers reject PRs that violate them.
1.**`description` ≤ 60 characters, one sentence, ends with a period.** Long descriptions bloat the skill listing UI and dilute the model's attention when many skills are loaded. State the capability, not the implementation. No marketing words ("powerful", "comprehensive", "seamless", "advanced"). Don't repeat the skill name. Verify with:
Good: `Search arXiv papers by keyword, author, category, or ID.`
Bad: `A powerful and comprehensive skill that allows the agent to search arXiv for relevant academic papers using various criteria including keywords, authors, and categories.`
2. **Tools referenced in SKILL.md prose must be native Hermes tools or MCP servers the skill explicitly expects.** When the skill needs a capability, point at the proper tool by name in backticks: `` `terminal` ``, `` `web_extract` ``, `` `web_search` ``, `` `read_file` ``, `` `write_file` ``, `` `patch` ``, `` `search_files` ``, `` `vision_analyze` ``, `` `browser_navigate` ``, `` `delegate_task` ``, `` `image_generate` ``, `` `text_to_speech` ``, `` `cronjob` ``, `` `memory` ``, `` `skill_view` ``, `` `todo` ``, `` `execute_code` ``.
Do NOT name shell utilities the agent already has wrapped:
If the skill depends on an MCP server, name the MCP server and document its setup in `## Prerequisites`. Third-party CLIs (e.g. `ffmpeg`, `gh`, a specific SDK) are fine to invoke from inside script files, but the prose should frame the interaction as "invoke through the `terminal` tool", not as a manual shell session.
3. **`platforms:` gating audited against actual script imports.** Skills that use POSIX-only primitives (`fcntl`, `termios`, `os.setsid`, `os.kill(pid, 0)` for liveness, `/proc`, hardcoded `/tmp` paths, `signal.SIGKILL`, bash heredocs, `osascript`, `apt`, `systemctl`) must declare their supported platforms via the `platforms:` frontmatter. Default posture is to fix it cross-platform first — `tempfile.gettempdir()`, `pathlib.Path`, `psutil.pid_exists()`, Python-level filtering instead of `grep`. Gate to a narrower set only when the dependency is genuinely platform-bound (e.g. `osascript` is macOS-only, `/proc` is Linux-only).
4. **`author` credits the human contributor first.** For external contributions, the contributor's real name + GitHub handle goes first (`Jane Doe (jane-doe)`); "Hermes Agent" is the secondary collaborator. If the contributor's commit shows "Hermes Agent" as author because they used Hermes to draft the skill, replace it with their actual name — credit the human, not the tool.
5. **SKILL.md body uses the modern section order.** `# <Skill> Skill` title, 2-3 sentence intro stating what it does and what it doesn't do, then:
- `## Procedure` — numbered steps with copy-paste commands
- `## Pitfalls` — known limits, rate limits, things that look broken but aren't
- `## Verification` — single command that proves the skill works
Target ~200 lines for a complex skill, ~100 lines for a simple one. Cut redundant intro fluff, marketing prose, and re-explanations of env vars already documented in `## Prerequisites`.
6. **Scripts go in `scripts/`, references in `references/`, templates in `templates/`.** Don't expect the model to inline-write parsers, XML walkers, or non-trivial logic every call — ship a helper script. Reference scripts from SKILL.md by path relative to the skill directory.
7. **Tests live at `tests/skills/test_<skill>_skill.py`** and use only stdlib + pytest + `unittest.mock`. No live network calls. Run via `scripts/run_tests.sh tests/skills/test_<skill>_skill.py -q`. Must pass under the hermetic CI env (no API keys leaking through). Use `monkeypatch` and `tmp_path` for any env-var or filesystem dependencies.
8. **`.env.example` additions are isolated to a clearly delimited block.** Don't touch the surrounding file — contributor-supplied `.env.example` versions are usually stale, and edits outside the skill's own block will be dropped during salvage. Comment all values with `#` (it's documentation, not live config).
**The self-improving AI agent built by [Nous Research](https://nousresearch.com).** It's the only agent with a built-in learning loop — it creates skills from experience, improves them during use, nudges itself to persist knowledge, searches its own past conversations, and builds a deepening model of who you are across sessions. Run it on a $5 VPS, a GPU cluster, or serverless infrastructure that costs nearly nothing when idle. It's not tied to your laptop — talk to it from Telegram while it works on a cloud VM.
Use any model you want — [Nous Portal](https://portal.nousresearch.com), [OpenRouter](https://openrouter.ai) (200+ models), [NVIDIA NIM](https://build.nvidia.com) (Nemotron), [Xiaomi MiMo](https://platform.xiaomimimo.com), [z.ai/GLM](https://z.ai), [Kimi/Moonshot](https://platform.moonshot.ai), [MiniMax](https://www.minimax.io), [Hugging Face](https://huggingface.co), OpenAI, or your own endpoint. Switch with `hermes model` — no code changes, no lock-in.
Use any model you want — [Nous Portal](https://portal.nousresearch.com), [OpenRouter](https://openrouter.ai) (200+ models), [NovitaAI](https://novita.ai) (AI-native cloud for Model API, Agent Sandbox, and GPU Cloud), [NVIDIA NIM](https://build.nvidia.com) (Nemotron), [Xiaomi MiMo](https://platform.xiaomimimo.com), [z.ai/GLM](https://z.ai), [Kimi/Moonshot](https://platform.moonshot.ai), [MiniMax](https://www.minimax.io), [Hugging Face](https://huggingface.co), OpenAI, or your own endpoint. Switch with `hermes model` — no code changes, no lock-in.
<table>
<tr><td><b>A real terminal interface</b></td><td>Full TUI with multiline editing, slash-command autocomplete, conversation history, interrupt-and-redirect, and streaming tool output.</td></tr>
This document outlines the security protocols, trust model, and deployment hardening guidelines for the **Hermes Agent** project.
This document describes Hermes Agent's trust model, names the one
security boundary the project treats as load-bearing, and defines the
scope for vulnerability reports.
## 1. Vulnerability Reporting
## 1. Reporting a Vulnerability
Hermes Agent does **not** operate a bug bounty program. Security issues should be reported via [GitHub Security Advisories (GHSA)](https://github.com/NousResearch/hermes-agent/security/advisories/new) or by emailing **security@nousresearch.com**. Do not open public issues for security vulnerabilities.
Report privately via [GitHub Security Advisories](https://github.com/NousResearch/hermes-agent/security/advisories/new)
or **security@nousresearch.com**. Do not open public issues for
security vulnerabilities. **Hermes Agent does not operate a bug
bounty program.**
### Required Submission Details
- **Title & Severity:** Concise description and CVSS score/rating.
-**Affected Component:** Exact file path and line range (e.g., `tools/approval.py:120-145`).
-**Environment:** Output of `hermes version`, commit SHA, OS, and Python version.
- **Reproduction:** Step-by-step Proof-of-Concept (PoC) against `main` or the latest release.
-**Impact:** Explanation of what trust boundary was crossed.
A useful report includes:
-A concise description and severity assessment.
-The affected component, identified by file path and line range
- A reproduction against `main` or the latest release.
- A statement of which trust boundary in §2 is crossed.
Please read §2 and §3 before submitting. Reports that demonstrate
limits of an in-process heuristic this policy does not treat as a
boundary will be closed as out-of-scope under §3 — but see §3.2:
they are still welcome as regular issues or pull requests, just not
through the private security channel.
---
## 2. Trust Model
The core assumption is that Hermes is a **personal agent** with one trusted operator.
Hermes Agent is a single-tenant personal agent. Its posture is
layered, and the layers are not equally load-bearing. Reporters and
operators should reason about them in the same terms.
### Operator & Session Trust
- **Single Tenant:** The system protects the operator from LLM actions, not from malicious co-tenants. Multi-user isolation must happen at the OS/host level.
- **Gateway Security:** Authorized callers (Telegram, Discord, Slack, etc.) receive equal trust. Session keys are used for routing, not as authorization boundaries.
- **Execution:** Defaults to `terminal.backend: local` (direct host execution). Container isolation (Docker, Modal, Daytona) is opt-in for sandboxing.
### 2.1 Definitions
### Dangerous Command Approval
The approval system (`tools/approval.py`) is a core security boundary. Terminal commands, file operations, and other potentially destructive actions are gated behind explicit user confirmation before execution. The approval mode is configurable via `approvals.mode` in `config.yaml`:
-`"on"` (default) — prompts the user to approve dangerous commands.
-`"auto"` — auto-approves after a configurable delay.
-`"off"` — disables the gate entirely (break-glass; see Section 3).
- **Agent process.** The Python interpreter running Hermes Agent,
including any Python modules it has loaded (skills, plugins,
hook handlers).
-**Terminal backend.** A pluggable execution target for the
`terminal()` tool. The default runs commands directly on the host.
Other backends run commands inside a container, cloud sandbox, or
remote host.
- **Input surface.** Any channel through which content enters the
agent's context: operator input, web fetches, email, gateway
messages, file reads, MCP server responses, tool results.
- **Trust envelope.** The set of resources an operator has implicitly
granted Hermes Agent access to by running it — typically, whatever
the operator's own user account can reach on the host.
- **Stance.** An explicit statement in Hermes Agent's documentation
or code about how a consuming layer (adapter, UI, file writer,
shell) should treat agent output — e.g. "the dashboard renders
agent output as inert HTML."
### Output Redaction
`agent/redact.py` strips secret-like patterns (API keys, tokens, credentials) from all display output before it reaches the terminal or gateway platform. This prevents accidental credential leakage in chat logs, tool previews, and response text. Redaction operates on the display layer only — underlying values remain intact for internal agent operations.
### 2.2 The Boundary: OS-Level Isolation
### Skills vs. MCP Servers
- **Installed Skills:** High trust. Equivalent to local host code; skills can read environment variables and run arbitrary commands.
- **MCP Servers:** Lower trust. MCP subprocesses receive a filtered environment (`_build_safe_env()` in `tools/mcp_tool.py`) — only safe baseline variables (`PATH`, `HOME`, `XDG_*`) plus variables explicitly declared in the server's `env` config block are passed through. Host credentials are stripped by default. Additionally, packages invoked via `npx`/`uvx` are checked against the OSV malware database before spawning.
**The only security boundary against an adversarial LLM is the
operating system.** Nothing inside the agent process constitutes
containment — not the approval gate, not output redaction, not any
pattern scanner, not any tool allowlist. Any in-process component
that screens LLM output is a heuristic operating on an
attacker-influenced string, and this policy treats it as such.
### Code Execution Sandbox
The `execute_code` tool (`tools/code_execution_tool.py`) runs LLM-generated Python scripts in a child process with API keys and tokens stripped from the environment to prevent credential exfiltration. Only environment variables explicitly declared by loaded skills (via `env_passthrough`) or by the user in `config.yaml` (`terminal.env_passthrough`) are passed through. The child accesses Hermes tools via RPC, not direct API calls.
Hermes Agent supports two OS-level isolation postures. They address
different threats and an operator should choose deliberately.
### Subagents
- **No recursive delegation:** The `delegate_task` tool is disabled for child agents.
- **Depth limit:** `MAX_DEPTH = 2` — parent (depth 0) can spawn a child (depth 1); grandchildren are rejected.
- **Memory isolation:** Subagents run with `skip_memory=True` and do not have access to the parent's persistent memory provider. The parent receives only the task prompt and final response as an observation.
#### Terminal-backend isolation
A non-default terminal backend runs LLM-emitted shell commands
inside a container, remote host, or cloud sandbox. The file tools
(`read_file`, `write_file`, `patch`) also run through this backend,
since they are implemented on top of the shell contract — they
cannot reach paths the backend doesn't expose.
What this confines: anything the agent does by issuing shell or
file operations. What this does **not** confine: everything the
agent does in its own Python process. That includes the
code-execution tool (spawned as a host subprocess), MCP subprocesses
(spawned from the agent's environment), plugin loading, hook
dispatch, and skill loading (all imported into the agent
interpreter).
Terminal-backend isolation is the right posture when the concern is
LLM-emitted destructive shell or unwanted file-tool writes, and the
operator is otherwise trusted.
#### Whole-process wrapping
Whole-process wrapping runs the entire agent process tree inside a
sandbox. Every code path — shell, code-execution, MCP, file tools,
plugins, hooks, skill loading — is subject to the same filesystem,
network, process, and (where applicable) inference policy.
Hermes Agent supports this in two ways:
- **Hermes Agent's own Docker image and Compose setup.** Lighter-
weight; the agent runs in a standard container with operator-
- **Network-exposed HTTP surfaces.** The API server adapter, the
dashboard plugin, the kanban plugin's HTTP endpoints, and any
other plugin that binds a listening socket.
- **Editor / IDE adapters.** The ACP adapter (`acp_adapter/`) and
equivalent integrations that accept requests from a local client
process.
- **The TUI gateway (`tui_gateway/`).** JSON-RPC backend for the
Ink terminal UI, reached over local IPC.
**Uniform rules:**
1. **Authorization is required at every surface that crosses a
trust boundary.** For messaging and network HTTP surfaces, the
boundary is the network: authorization means an operator-
configured caller allowlist. For editor and local-IPC surfaces
(ACP, TUI gateway), the boundary is the host's user account:
authorization means relying on OS-level access control (file
permissions, loopback-only binds) and not exposing the surface
beyond the local user without an explicit network auth layer.
2. **An allowlist is required for every enabled network-exposed
adapter.** Adapters must refuse to dispatch agent work, resolve
approvals, or relay output until an allowlist is set. Code paths
that fail open when no allowlist is configured are code bugs in
scope under §3.1.
3. **Session identifiers are routing handles, not authorization
boundaries.** Knowing another caller's session ID does not grant
access to their approvals or output; authorization is always
re-checked against the allowlist (or OS-level equivalent).
4.**Within the authorized set, all callers are equally trusted.**
Hermes Agent does not model per-caller capabilities inside a
single adapter. Operators who need capability separation should
run separate agent instances with separate allowlists.
5. **Binding a local-only surface to a non-loopback interface is a
break-glass operator decision (§3.2).** The dashboard and other
plugin HTTP servers default to loopback; exposing them via
`--host 0.0.0.0` or equivalent makes public-exposure hardening
(§4) the operator's responsibility.
---
## 3. Out of Scope (Non-Vulnerabilities)
## 3. Scope
The following scenarios are **not** considered security breaches:
- **Prompt Injection:** Unless it results in a concrete bypass of the approval system, toolset restrictions, or container sandbox.
-**Public Exposure:** Deploying the gateway to the public internet without external authentication or network protection.
- **Trusted State Access:** Reports that require pre-existing write access to `~/.hermes/`, `.env`, or `config.yaml` (these are operator-owned files).
- **Default Behavior:** Host-level command execution when `terminal.backend` is set to `local` — this is the documented default, not a vulnerability.
-**Configuration Trade-offs:** Intentional break-glass settings such as `approvals.mode: "off"` or `terminal.backend: local` in production.
- **Tool-level read/access restrictions:** The agent has unrestricted shell access via the `terminal` tool by design. Reports that a specific tool (e.g., `read_file`) can access a resource are not vulnerabilities if the same access is available through `terminal`. Tool-level deny lists only constitute a meaningful security boundary when paired with equivalent restrictions on the terminal side (as with write operations, where `WRITE_DENIED_PATHS` is paired with the dangerous command approval system).
### 3.1 In Scope
-Escape from a declared OS-level isolation posture (§2.2): an
attacker-controlled code path reaching state that the posture
claimed to confine.
-Unauthorized external-surface access: a caller outside the
configured authorization set (allowlist, or OS-level equivalent
for local-IPC surfaces) dispatching work, receiving output, or
resolving approvals (§2.6).
- Credential exfiltration: leakage of operator credentials or
session authorization material to a destination outside the
trust envelope, via a mechanism that should have prevented it
(environment scrubbing bug, adapter logging, transport error
that flushes credentials to an upstream, etc.).
- Trust-model documentation violations: code behaving contrary to
what this policy, Hermes Agent's own documentation, or reasonable
operator expectations would predict — including cases where
Hermes Agent has documented a stance about how its output should
be rendered by a consuming layer (dashboard, gateway adapter,
file writer, shell) and a code path breaks that stance.
### 3.2 Out of Scope
"Out of scope" here means "not a security vulnerability under this
policy." It does not mean "not worth reporting." Improvements to the
in-process heuristics, hardening ideas, and UX fixes are welcome as
regular issues or pull requests — the approval gate can always catch
more patterns, redaction can always get smarter, adapter behavior
can always be tightened. These items just don't go through the
private-disclosure channel and don't receive advisories.
- **Bypasses of in-process heuristics (§2.4)** — approval-gate regex
bypasses, redaction bypasses, Skills Guard pattern bypasses, and
analogous reports against future heuristics. These components are
not boundaries; defeating them is not a vulnerability under this
policy.
- **Prompt injection per se.** Getting the LLM to emit unusual
output — via injected content, hallucination, training artifacts,
or any other cause — is not itself a vulnerability. "I achieved
prompt injection" without a chained §3.1 outcome is not an
actionable report under this policy.
- **Consequences of a chosen isolation posture.** Reports that a
code path operating within its posture's scope can do what that
posture permits are not vulnerabilities. Examples: shell or file
tools reaching host state under the local backend; code-execution
or MCP subprocesses reaching host state under terminal-backend
isolation that only sandboxes shell; reports whose preconditions
require pre-existing write access to operator-owned configuration
or credential files (those are already inside the trust envelope).
that explicitly disable protections: `--insecure` and equivalent
flags on the dashboard or other components, disabled approvals,
local backend in production, development profiles that bypass
hermes-home security, and similar. Reports against those
configurations are not vulnerabilities — that's the flag's job.
- **Community-contributed skills and plugins.** Third-party skills
(including the community skills repository) and third-party
plugins are in the operator's review surface, not Hermes Agent's
trust surface (§2.4, §2.5). A skill or plugin doing something
malicious is the expected failure mode of one that wasn't
reviewed, not a vulnerability in Hermes Agent. Bugs in Hermes
Agent's skill-install or plugin-install path that prevent the
operator from seeing what they're installing are in scope under
§3.1.
- **Public exposure without external controls.** Exposing the
gateway or API to the public internet without authentication,
VPN, or firewall.
- **Tool-level read/write restrictions on a posture where shell is
permitted.** If a path is reachable via the terminal tool, reports
that other file tools can reach it add nothing.
---
## 4. Deployment Hardening & Best Practices
## 4. Deployment Hardening
### Filesystem & Network
- **Production sandboxing:** Use container backends (`docker`, `modal`, `daytona`) instead of `local` for untrusted workloads.
- **File permissions:** Run as non-root (the Docker image uses UID 10000); protect credentials with `chmod 600 ~/.hermes/.env` on local installs.
- **Network exposure:** Do not expose the gateway or API server to the public internet without VPN, Tailscale, or firewall protection. SSRF protection is enabled by default across all gateway platform adapters (Telegram, Discord, Slack, Matrix, Mattermost, etc.) with redirect validation. Note: the local terminal backend does not apply SSRF filtering, as it operates within the trusted operator's environment.
The single most important hardening decision is matching isolation
(§2.2) to the trust of the content the agent will ingest. Beyond
that:
### Skills & Supply Chain
- **Skill installation:** Review Skills Guard reports (`tools/skills_guard.py`) before installing third-party skills. The audit log at `~/.hermes/skills/.hub/audit.log` tracks every install and removal.
-**MCP safety:** OSV malware checking runs automatically for `npx`/`uvx` packages before MCP server processes are spawned.
- **CI/CD:** GitHub Actions are pinned to full commit SHAs. The `supply-chain-audit.yml` workflow blocks PRs containing `.pth` files or suspicious `base64`+`exec` patterns.
### Credential Storage
-API keys and tokens belong exclusively in `~/.hermes/.env` — never in `config.yaml` or checked into version control.
- The credential pool system (`agent/credential_pool.py`) handles key rotation and fallback. Credentials are resolved from environment variables, not stored in plaintext databases.
- Run the agent as a non-root user. The supplied container image
does this by default.
-Keep credentials in the operator credential file with tight
permissions, never in the main config, never in version control.
Under OpenShell, use the Provider store rather than an on-disk
credential file.
-Do not expose the gateway or API to the public internet without
VPN, Tailscale, or firewall protection. Under OpenShell, use the
network policy layer to restrict egress.
- Configure a caller allowlist for every network-exposed adapter
you enable (§2.6).
- Review third-party skills and plugins before install (§2.4,
§2.5). For skills, this means reading the Python and scripts,
not just SKILL.md. Skills Guard reports and the install audit
log are the review surface.
- Hermes Agent includes supply-chain guards for MCP server
launches and for dependency / bundled-package changes in CI; see
`CONTRIBUTING.md` for specifics.
---
## 5. Disclosure Process
## 5. Disclosure
- **Coordinated Disclosure:** 90-day window or until a fix is released, whichever comes first.
- **Communication:** All updates occur via the GHSA thread or email correspondence with security@nousresearch.com.
- **Credits:** Reporters are credited in release notes unless anonymity is requested.
- **Coordinated disclosure window:** 90days from report, or until a
fix is released, whichever comes first.
- **Channel:** the GHSA thread or email correspondence with
security@nousresearch.com.
- **Credit:** reporters are credited in release notes unless
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.