add generic gateway startup readiness checks

feat(discord): register skills under /skill command group with category subcommands (#9909 )
Instead of consuming one top-level slash command slot per skill (hitting the 100-command limit with ~26 built-ins + 74 skills), skills are now organized under a single /skill group command with category-based subcommand groups: /skill creative ascii-art [args] /skill media gif-search [args] /skill mlops axolotl [args] Discord supports 25 subcommand groups × 25 subcommands = 625 max skills, well beyond the previous 74-slot ceiling. Categories are derived from the skill directory structure: - skills/creative/ascii-art/ → category 'creative' - skills/mlops/training/axolotl/ → category 'mlops' (top-level parent) - skills/dogfood/ → uncategorized (direct subcommand) Changes: - hermes_cli/commands.py: add discord_skill_commands_by_category() with category grouping, hub/disabled filtering, Discord limit enforcement - gateway/platforms/discord.py: replace top-level skill registration with _register_skill_group() using app_commands.Group hierarchy - tests: 7 new tests covering group creation, category grouping, uncategorized skills, hub exclusion, deep nesting, empty skills, and handler dispatch Inspired by Discord community suggestion from bottium.
2026-04-15 10:03:23 +10:00 · 2026-04-14 16:27:02 -07:00 · 2026-04-14 16:26:36 -07:00 · 2026-04-14 16:26:01 -07:00 · 2026-04-14 16:11:37 -07:00 · 2026-04-14 16:10:18 -07:00
53 changed files with 2745 additions and 444 deletions
@@ -9,11 +9,14 @@ on:
      - '**/*.py'
      - '.github/workflows/contributor-check.yml'

+permissions:
+  contents: read
+
 jobs:
  check-attribution:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
        with:
          fetch-depth: 0  # Full history needed for git log

@@ -28,20 +28,20 @@ jobs:
      name: github-pages
      url: ${{ steps.deploy.outputs.page_url }}
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4

-      - uses: actions/setup-node@v4
+      - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020  # v4
        with:
          node-version: 20
          cache: npm
          cache-dependency-path: website/package-lock.json

-      - uses: actions/setup-python@v5
+      - uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065  # v5
        with:
          python-version: '3.11'

      - name: Install PyYAML for skill extraction
-        run: pip install pyyaml httpx
+        run: pip install pyyaml==6.0.2 httpx==0.28.1

      - name: Extract skill metadata for dashboard
        run: python3 website/scripts/extract-skills.py
@@ -73,10 +73,10 @@ jobs:
          echo "hermes-agent.nousresearch.com" > _site/CNAME

      - name: Upload artifact
-        uses: actions/upload-pages-artifact@v3
+        uses: actions/upload-pages-artifact@56afc609e74202658d3ffba0e8f6dda462b719fa  # v3
        with:
          path: _site

      - name: Deploy to GitHub Pages
        id: deploy
-        uses: actions/deploy-pages@v4
+        uses: actions/deploy-pages@d6db90164ac5ed86f2b6aed7e0febac5b3c0c03e  # v4
@@ -23,21 +23,21 @@ jobs:
    timeout-minutes: 60
    steps:
      - name: Checkout code
-        uses: actions/checkout@v4
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
        with:
          submodules: recursive

      - name: Set up QEMU
-        uses: docker/setup-qemu-action@v3
+        uses: docker/setup-qemu-action@c7c53464625b32c7a7e944ae62b3e17d2b600130  # v3

      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v3
+        uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f  # v3

      # Build amd64 only so we can `load` the image for smoke testing.
      # `load: true` cannot export a multi-arch manifest to the local daemon.
      # The multi-arch build follows on push to main / release.
      - name: Build image (amd64, smoke test)
-        uses: docker/build-push-action@v6
+        uses: docker/build-push-action@10e90e3645eae34f1e60eeb005ba3a3d33f178e8  # v6
        with:
          context: .
          file: Dockerfile
@@ -56,14 +56,14 @@ jobs:

      - name: Log in to Docker Hub
        if: github.event_name == 'push' && github.ref == 'refs/heads/main' || github.event_name == 'release'
-        uses: docker/login-action@v3
+        uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9  # v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Push multi-arch image (main branch)
        if: github.event_name == 'push' && github.ref == 'refs/heads/main'
-        uses: docker/build-push-action@v6
+        uses: docker/build-push-action@10e90e3645eae34f1e60eeb005ba3a3d33f178e8  # v6
        with:
          context: .
          file: Dockerfile
@@ -75,7 +75,7 @@ jobs:

      - name: Push multi-arch image (release)
        if: github.event_name == 'release'
-        uses: docker/build-push-action@v6
+        uses: docker/build-push-action@10e90e3645eae34f1e60eeb005ba3a3d33f178e8  # v6
        with:
          context: .
          file: Dockerfile
@@ -7,13 +7,16 @@ on:
      - '.github/workflows/docs-site-checks.yml'
  workflow_dispatch:

+permissions:
+  contents: read
+
 jobs:
  docs-site-checks:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4

-      - uses: actions/setup-node@v4
+      - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020  # v4
        with:
          node-version: 20
          cache: npm
@@ -23,7 +26,7 @@ jobs:
        run: npm ci
        working-directory: website

-      - uses: actions/setup-python@v5
+      - uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065  # v5
        with:
          python-version: '3.11'

@@ -14,6 +14,9 @@ on:
      - 'run_agent.py'
      - 'acp_adapter/**'

+permissions:
+  contents: read
+
 concurrency:
  group: nix-${{ github.ref }}
  cancel-in-progress: true
@@ -26,7 +29,7 @@ jobs:
    runs-on: ${{ matrix.os }}
    timeout-minutes: 30
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
      - uses: DeterminateSystems/nix-installer-action@ef8a148080ab6020fd15196c2084a2eea5ff2d25  # v22
      - uses: DeterminateSystems/magic-nix-cache-action@565684385bcd71bad329742eefe8d12f2e765b39  # v13
      - name: Check flake
@@ -20,14 +20,14 @@ jobs:
    if: github.repository == 'NousResearch/hermes-agent'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4

-      - uses: actions/setup-python@v5
+      - uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065  # v5
        with:
          python-version: '3.11'

      - name: Install dependencies
-        run: pip install httpx pyyaml
+        run: pip install httpx==0.28.1 pyyaml==6.0.2

      - name: Build skills index
        env:
@@ -35,7 +35,7 @@ jobs:
        run: python scripts/build_skills_index.py

      - name: Upload index artifact
-        uses: actions/upload-artifact@v4
+        uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02  # v4
        with:
          name: skills-index
          path: website/static/api/skills-index.json
@@ -53,25 +53,25 @@ jobs:
    # Only deploy on schedule or manual trigger (not on every push to the script)
    if: github.event_name == 'schedule' || github.event_name == 'workflow_dispatch'
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4

-      - uses: actions/download-artifact@v4
+      - uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093  # v4
        with:
          name: skills-index
          path: website/static/api/

-      - uses: actions/setup-node@v4
+      - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020  # v4
        with:
          node-version: 20
          cache: npm
          cache-dependency-path: website/package-lock.json

-      - uses: actions/setup-python@v5
+      - uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065  # v5
        with:
          python-version: '3.11'

      - name: Install PyYAML for skill extraction
-        run: pip install pyyaml
+        run: pip install pyyaml==6.0.2

      - name: Extract skill metadata for dashboard
        run: python3 website/scripts/extract-skills.py
@@ -92,10 +92,10 @@ jobs:
          echo "hermes-agent.nousresearch.com" > _site/CNAME

      - name: Upload artifact
-        uses: actions/upload-pages-artifact@v3
+        uses: actions/upload-pages-artifact@56afc609e74202658d3ffba0e8f6dda462b719fa  # v3
        with:
          path: _site

      - name: Deploy to GitHub Pages
        id: deploy
-        uses: actions/deploy-pages@v4
+        uses: actions/deploy-pages@d6db90164ac5ed86f2b6aed7e0febac5b3c0c03e  # v4
@@ -14,7 +14,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v4
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
        with:
          fetch-depth: 0

@@ -149,6 +149,62 @@ jobs:
          "
          fi

+          # --- CI/CD workflow files modified ---
+          WORKFLOW_HITS=$(git diff --name-only "$BASE".."$HEAD" | grep -E '\.github/workflows/.*\.ya?ml$' || true)
+          if [ -n "$WORKFLOW_HITS" ]; then
+            FINDINGS="${FINDINGS}
+          ### ⚠️ WARNING: CI/CD workflow files modified
+          Changes to workflow files can alter build pipelines, inject steps, or modify permissions. Verify no unauthorized actions or secrets access were added.
+
+          **Files:**
+          \`\`\`
+          ${WORKFLOW_HITS}
+          \`\`\`
+          "
+          fi
+
+          # --- Dockerfile / container build files modified ---
+          DOCKER_HITS=$(git diff --name-only "$BASE".."$HEAD" | grep -iE '(Dockerfile|\.dockerignore|docker-compose)' || true)
+          if [ -n "$DOCKER_HITS" ]; then
+            FINDINGS="${FINDINGS}
+          ### ⚠️ WARNING: Container build files modified
+          Changes to Dockerfiles or compose files can alter base images, add build steps, or expose ports. Verify base image pins and build commands.
+
+          **Files:**
+          \`\`\`
+          ${DOCKER_HITS}
+          \`\`\`
+          "
+          fi
+
+          # --- Dependency manifest files modified ---
+          DEP_HITS=$(git diff --name-only "$BASE".."$HEAD" | grep -E '(pyproject\.toml|requirements.*\.txt|package\.json|Gemfile|go\.mod|Cargo\.toml)$' || true)
+          if [ -n "$DEP_HITS" ]; then
+            FINDINGS="${FINDINGS}
+          ### ⚠️ WARNING: Dependency manifest files modified
+          Changes to dependency files can introduce new packages or change version pins. Verify all dependency changes are intentional and from trusted sources.
+
+          **Files:**
+          \`\`\`
+          ${DEP_HITS}
+          \`\`\`
+          "
+          fi
+
+          # --- GitHub Actions version unpinning (mutable tags instead of SHAs) ---
+          ACTIONS_UNPIN=$(echo "$DIFF" | grep -n '^\+' | grep 'uses:' | grep -v '#' | grep -E '@v[0-9]' | head -10 || true)
+          if [ -n "$ACTIONS_UNPIN" ]; then
+            FINDINGS="${FINDINGS}
+          ### ⚠️ WARNING: GitHub Actions with mutable version tags
+          Actions should be pinned to full commit SHAs (not \`@v4\`, \`@v5\`). Mutable tags can be retargeted silently if a maintainer account is compromised.
+
+          **Matches:**
+          \`\`\`
+          ${ACTIONS_UNPIN}
+          \`\`\`
+          "
+          fi
+
          # --- Output results ---
          if [ -n "$FINDINGS" ]; then
            echo "found=true" >> "$GITHUB_OUTPUT"
@@ -6,6 +6,9 @@ on:
  pull_request:
    branches: [main]

+permissions:
+  contents: read
+
 # Cancel in-progress runs for the same PR/branch
 concurrency:
  group: tests-${{ github.ref }}
@@ -17,13 +20,13 @@ jobs:
    timeout-minutes: 10
    steps:
      - name: Checkout code
-        uses: actions/checkout@v4
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4

      - name: Install system dependencies
        run: sudo apt-get update && sudo apt-get install -y ripgrep

      - name: Install uv
-        uses: astral-sh/setup-uv@v5
+        uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86  # v5

      - name: Set up Python 3.11
        run: uv python install 3.11
@@ -49,10 +52,10 @@ jobs:
    timeout-minutes: 10
    steps:
      - name: Checkout code
-        uses: actions/checkout@v4
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4

      - name: Install uv
-        uses: astral-sh/setup-uv@v5
+        uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86  # v5

      - name: Set up Python 3.11
        run: uv python install 3.11
@@ -112,6 +112,7 @@ _API_KEY_PROVIDER_AUX_MODELS: Dict[str, str] = {
 # "exotic provider" branch checks this before falling back to the main model.
 _PROVIDER_VISION_MODELS: Dict[str, str] = {
    "xiaomi": "mimo-v2-omni",
+    "zai": "glm-5v-turbo",
 }

 # OpenRouter app attribution headers
@@ -28,7 +28,6 @@ from agent.model_metadata import (
    get_model_context_length,
    estimate_messages_tokens_rough,
 )
-from agent.redact import redact_sensitive_text

 logger = logging.getLogger(__name__)

@@ -271,15 +270,11 @@ class ContextCompressor(ContextEngine):
        Includes tool call arguments and result content (up to
        ``_CONTENT_MAX`` chars per message) so the summarizer can preserve
        specific details like file paths, commands, and outputs.
-
-        All content is redacted before serialization to prevent secrets
-        (API keys, tokens, passwords) from leaking into the summary that
-        gets sent to the auxiliary model and persisted across compactions.
        """
        parts = []
        for msg in turns:
            role = msg.get("role", "unknown")
-            content = redact_sensitive_text(msg.get("content") or "")
+            content = msg.get("content") or ""

            # Tool results: keep enough content for the summarizer
            if role == "tool":
@@ -300,7 +295,7 @@ class ContextCompressor(ContextEngine):
                        if isinstance(tc, dict):
                            fn = tc.get("function", {})
                            name = fn.get("name", "?")
-                            args = redact_sensitive_text(fn.get("arguments", ""))
+                            args = fn.get("arguments", "")
                            # Truncate long arguments but keep enough for context
                            if len(args) > self._TOOL_ARGS_MAX:
                                args = args[:self._TOOL_ARGS_HEAD] + "..."
@@ -358,11 +353,7 @@ class ContextCompressor(ContextEngine):
            "assistant that continues the conversation. "
            "Do NOT respond to any questions or requests in the conversation — "
            "only output the structured summary. "
-            "Do NOT include any preamble, greeting, or prefix. "
-            "NEVER include API keys, tokens, passwords, secrets, credentials, "
-            "or connection strings in the summary — replace any that appear "
-            "with [REDACTED]. Note that the user had credentials present, but "
-            "do not preserve their values."
+            "Do NOT include any preamble, greeting, or prefix."
        )

        # Shared structured template (used by both paths).
@@ -403,7 +394,7 @@ class ContextCompressor(ContextEngine):
 [What remains to be done — framed as context, not instructions]

 ## Critical Context
-[Any specific values, error messages, configuration details, or data that would be lost without explicit preservation. NEVER include API keys, tokens, passwords, or credentials — write [REDACTED] instead.]
+[Any specific values, error messages, configuration details, or data that would be lost without explicit preservation]

 ## Tools & Patterns
 [Which tools were used, how they were used effectively, and any tool-specific discoveries]
@@ -446,7 +437,7 @@ Use this exact structure:
            prompt += f"""

 FOCUS TOPIC: "{focus_topic}"
-The user has requested that this compaction PRIORITISE preserving all information related to the focus topic above. For content related to "{focus_topic}", include full detail — exact values, file paths, command outputs, error messages, and decisions. For content NOT related to the focus topic, summarise more aggressively (brief one-liners or omit if truly irrelevant). The focus topic sections should receive roughly 60-70% of the summary token budget. Even for the focus topic, NEVER preserve API keys, tokens, passwords, or credentials — use [REDACTED]."""
+The user has requested that this compaction PRIORITISE preserving all information related to the focus topic above. For content related to "{focus_topic}", include full detail — exact values, file paths, command outputs, error messages, and decisions. For content NOT related to the focus topic, summarise more aggressively (brief one-liners or omit if truly irrelevant). The focus topic sections should receive roughly 60-70% of the summary token budget."""

        try:
            call_kwargs = {
@@ -469,9 +460,7 @@ The user has requested that this compaction PRIORITISE preserving all informatio
            # Handle cases where content is not a string (e.g., dict from llama.cpp)
            if not isinstance(content, str):
                content = str(content) if content else ""
-            # Redact the summary output as well — the summarizer LLM may
-            # ignore prompt instructions and echo back secrets verbatim.
-            summary = redact_sensitive_text(content.strip())
+            summary = content.strip()
            # Store for iterative updates on next compaction
            self._previous_summary = summary
            self._summary_failure_cooldown_until = 0.0
@@ -8631,6 +8631,24 @@ class HermesCLI:
            self._should_exit = True
            event.app.exit()

+        _modal_prompt_active = Condition(
+            lambda: bool(self._secret_state or self._sudo_state)
+        )
+
+        @kb.add('escape', filter=_modal_prompt_active, eager=True)
+        def handle_escape_modal(event):
+            """ESC cancels active secret/sudo prompts."""
+            if self._secret_state:
+                self._cancel_secret_capture()
+                event.app.current_buffer.reset()
+                event.app.invalidate()
+                return
+            if self._sudo_state:
+                self._sudo_state["response_queue"].put("")
+                self._sudo_state = None
+                event.app.invalidate()
+                return
+
        @kb.add('c-z')
        def handle_ctrl_z(event):
            """Handle Ctrl+Z - suspend process to background (Unix only)."""
@@ -8928,9 +8946,9 @@ class HermesCLI:
            if cli_ref._voice_processing:
                return "transcribing..."
            if cli_ref._sudo_state:
-                return "type password (hidden), Enter to skip"
+                return "type password (hidden), Enter to submit · ESC to skip"
            if cli_ref._secret_state:
-                return "type secret (hidden), Enter to skip"
+                return "type secret (hidden), Enter to submit · ESC to skip"
            if cli_ref._approval_state:
                return ""
            if cli_ref._clarify_freetext:
@@ -9173,7 +9191,7 @@ class HermesCLI:
            prompt = state.get("prompt") or f"Enter value for {state.get('var_name', 'secret')}"
            metadata = state.get("metadata") or {}
            help_text = metadata.get("help")
-            body = 'Enter secret below (hidden), or press Enter to skip'
+            body = 'Enter secret below (hidden), ESC or Ctrl+C to skip'
            content_lines = [prompt, body]
            if help_text:
                content_lines.insert(1, str(help_text))
@@ -3,11 +3,12 @@ Event Hook System

 A lightweight event-driven system that fires handlers at key lifecycle points.
 Hooks are discovered from ~/.hermes/hooks/ directories, each containing:
-  - HOOK.yaml  (metadata: name, description, events list)
+  - HOOK.yaml  (metadata: name, description, events list, optional startup_readiness)
  - handler.py (Python handler with async def handle(event_type, context))

 Events:
  - gateway:startup     -- Gateway process starts
+  - gateway:shutdown    -- Gateway process is shutting down
  - session:start       -- New session created (first message of a new session)
  - session:end         -- Session ends (user ran /new or /reset)
  - session:reset       -- Session reset completed (new session entry created)
@@ -31,6 +32,26 @@ from hermes_cli.config import get_hermes_home
 HOOKS_DIR = get_hermes_home() / "hooks"


+def _normalize_startup_readiness(hook_name: str, manifest: dict[str, Any]) -> Optional[dict[str, Any]]:
+    """Validate and normalize optional startup readiness metadata."""
+    readiness = manifest.get("startup_readiness")
+    if readiness is None:
+        return None
+    if not isinstance(readiness, dict):
+        print(f"[hooks] Ignoring startup_readiness for {hook_name}: expected mapping", flush=True)
+        return None
+
+    check_id = str(readiness.get("id", "")).strip()
+    if not check_id:
+        print(f"[hooks] Ignoring startup_readiness for {hook_name}: missing id", flush=True)
+        return None
+
+    return {
+        "id": check_id,
+        "required": bool(readiness.get("required", True)),
+    }
+
+
 class HookRegistry:
    """
    Discovers, loads, and fires event hooks.
@@ -62,6 +83,7 @@ class HookRegistry:
                "description": "Run ~/.hermes/BOOT.md on gateway startup",
                "events": ["gateway:startup"],
                "path": "(builtin)",
+                "startup_readiness": None,
            })
        except Exception as e:
            print(f"[hooks] Could not load built-in boot-md hook: {e}", flush=True)
@@ -102,6 +124,7 @@ class HookRegistry:
                if not events:
                    print(f"[hooks] Skipping {hook_name}: no events declared", flush=True)
                    continue
+                startup_readiness = _normalize_startup_readiness(hook_name, manifest)

                # Dynamically load the handler module
                spec = importlib.util.spec_from_file_location(
@@ -128,6 +151,7 @@ class HookRegistry:
                    "description": manifest.get("description", ""),
                    "events": events,
                    "path": str(hook_dir),
+                    "startup_readiness": startup_readiness,
                })

                print(f"[hooks] Loaded hook '{hook_name}' for events: {events}", flush=True)
@@ -10,6 +10,7 @@ Exposes an HTTP server with endpoints:
 - POST /v1/runs                    — start a run, returns run_id immediately (202)
 - GET  /v1/runs/{run_id}/events    — SSE stream of structured lifecycle events
 - GET  /health                     — health check
+- GET  /health/detailed            — rich status for cross-container dashboard probing

 Any OpenAI-compatible frontend (Open WebUI, LobeChat, LibreChat,
 AnythingLLM, NextChat, ChatBox, etc.) can connect to hermes-agent
@@ -565,6 +566,27 @@ class APIServerAdapter(BasePlatformAdapter):
        """GET /health — simple health check."""
        return web.json_response({"status": "ok", "platform": "hermes-agent"})

+    async def _handle_health_detailed(self, request: "web.Request") -> "web.Response":
+        """GET /health/detailed — rich status for cross-container dashboard probing.
+
+        Returns gateway state, connected platforms, PID, and uptime so the
+        dashboard can display full status without needing a shared PID file or
+        /proc access.  No authentication required.
+        """
+        from gateway.status import read_runtime_status
+
+        runtime = read_runtime_status() or {}
+        return web.json_response({
+            "status": "ok",
+            "platform": "hermes-agent",
+            "gateway_state": runtime.get("gateway_state"),
+            "platforms": runtime.get("platforms", {}),
+            "active_agents": runtime.get("active_agents", 0),
+            "exit_reason": runtime.get("exit_reason"),
+            "updated_at": runtime.get("updated_at"),
+            "pid": os.getpid(),
+        })
+
    async def _handle_models(self, request: "web.Request") -> "web.Response":
        """GET /v1/models — return hermes-agent as an available model."""
        auth_err = self._check_auth(request)
@@ -1783,6 +1805,7 @@ class APIServerAdapter(BasePlatformAdapter):
            self._app = web.Application(middlewares=mws)
            self._app["api_server_adapter"] = self
            self._app.router.add_get("/health", self._handle_health)
+            self._app.router.add_get("/health/detailed", self._handle_health_detailed)
            self._app.router.add_get("/v1/health", self._handle_health)
            self._app.router.add_get("/v1/models", self._handle_models)
            self._app.router.add_post("/v1/chat/completions", self._handle_chat_completions)
@@ -1736,46 +1736,90 @@ class DiscordAdapter(BasePlatformAdapter):
        async def slash_btw(interaction: discord.Interaction, question: str):
            await self._run_simple_slash(interaction, f"/btw {question}")

-        # Register installed skills as native slash commands (parity with
-        # Telegram, which uses telegram_menu_commands() in commands.py).
-        # Discord allows up to 100 application commands globally.
-        _DISCORD_CMD_LIMIT = 100
+        # Register skills under a single /skill command group with category
+        # subcommand groups.  This uses 1 top-level slot instead of N,
+        # supporting up to 25 categories × 25 skills = 625 skills.
+        self._register_skill_group(tree)
+
+    def _register_skill_group(self, tree) -> None:
+        """Register a ``/skill`` command group with category subcommand groups.
+
+        Skills are organized by their directory category under ``SKILLS_DIR``.
+        Each category becomes a subcommand group; root-level skills become
+        direct subcommands.  Discord supports 25 subcommand groups × 25
+        subcommands each = 625 skills — well beyond the old 100-command cap.
+        """
        try:
-            from hermes_cli.commands import discord_skill_commands
+            from hermes_cli.commands import discord_skill_commands_by_category

-            existing_names = {cmd.name for cmd in tree.get_commands()}
-            remaining_slots = max(0, _DISCORD_CMD_LIMIT - len(existing_names))
+            existing_names = set()
+            try:
+                existing_names = {cmd.name for cmd in tree.get_commands()}
+            except Exception:
+                pass

-            skill_entries, skipped = discord_skill_commands(
-                max_slots=remaining_slots,
+            categories, uncategorized, hidden = discord_skill_commands_by_category(
                reserved_names=existing_names,
            )

-            for discord_name, description, cmd_key in skill_entries:
-                # Closure factory to capture cmd_key per iteration
-                def _make_skill_handler(_key: str):
-                    async def _skill_slash(interaction: discord.Interaction, args: str = ""):
-                        await self._run_simple_slash(interaction, f"{_key} {args}".strip())
-                    return _skill_slash
+            if not categories and not uncategorized:
+                return

-                handler = _make_skill_handler(cmd_key)
-                handler.__name__ = f"skill_{discord_name.replace('-', '_')}"
+            skill_group = discord.app_commands.Group(
+                name="skill",
+                description="Run a Hermes skill",
+            )

+            # ── Helper: build a callback for a skill command key ──
+            def _make_handler(_key: str):
+                @discord.app_commands.describe(args="Optional arguments for the skill")
+                async def _handler(interaction: discord.Interaction, args: str = ""):
+                    await self._run_simple_slash(interaction, f"{_key} {args}".strip())
+                _handler.__name__ = f"skill_{_key.lstrip('/').replace('-', '_')}"
+                return _handler
+
+            # ── Uncategorized (root-level) skills → direct subcommands ──
+            for discord_name, description, cmd_key in uncategorized:
                cmd = discord.app_commands.Command(
                    name=discord_name,
-                    description=description,
-                    callback=handler,
+                    description=description or f"Run the {discord_name} skill",
+                    callback=_make_handler(cmd_key),
                )
-                discord.app_commands.describe(args="Optional arguments for the skill")(cmd)
-                tree.add_command(cmd)
+                skill_group.add_command(cmd)

-            if skipped:
+            # ── Category subcommand groups ──
+            for cat_name in sorted(categories):
+                cat_desc = f"{cat_name.replace('-', ' ').title()} skills"
+                if len(cat_desc) > 100:
+                    cat_desc = cat_desc[:97] + "..."
+                cat_group = discord.app_commands.Group(
+                    name=cat_name,
+                    description=cat_desc,
+                    parent=skill_group,
+                )
+                for discord_name, description, cmd_key in categories[cat_name]:
+                    cmd = discord.app_commands.Command(
+                        name=discord_name,
+                        description=description or f"Run the {discord_name} skill",
+                        callback=_make_handler(cmd_key),
+                    )
+                    cat_group.add_command(cmd)
+
+            tree.add_command(skill_group)
+
+            total = sum(len(v) for v in categories.values()) + len(uncategorized)
+            logger.info(
+                "[%s] Registered /skill group: %d skill(s) across %d categories"
+                " + %d uncategorized",
+                self.name, total, len(categories), len(uncategorized),
+            )
+            if hidden:
                logger.warning(
-                    "[%s] Discord slash command limit reached (%d): %d skill(s) not registered",
-                    self.name, _DISCORD_CMD_LIMIT, skipped,
+                    "[%s] %d skill(s) not registered (Discord subcommand limits)",
+                    self.name, hidden,
                )
        except Exception as exc:
-            logger.warning("[%s] Failed to register skill slash commands: %s", self.name, exc)
+            logger.warning("[%s] Failed to register /skill group: %s", self.name, exc)

    def _build_slash_event(self, interaction: discord.Interaction, text: str) -> MessageEvent:
        """Build a MessageEvent from a Discord slash command interaction."""
@@ -72,7 +72,10 @@ try:
        UpdateMessageRequestBody,
    )
    from lark_oapi.core.const import FEISHU_DOMAIN, LARK_DOMAIN
-    from lark_oapi.event.callback.model.p2_card_action_trigger import P2CardActionTriggerResponse
+    from lark_oapi.event.callback.model.p2_card_action_trigger import (
+        CallBackCard,
+        P2CardActionTriggerResponse,
+    )
    from lark_oapi.event.dispatcher_handler import EventDispatcherHandler
    from lark_oapi.ws import Client as FeishuWSClient

@@ -80,6 +83,7 @@ try:
 except ImportError:
    FEISHU_AVAILABLE = False
    lark = None  # type: ignore[assignment]
+    CallBackCard = None  # type: ignore[assignment]
    P2CardActionTriggerResponse = None  # type: ignore[assignment]
    EventDispatcherHandler = None  # type: ignore[assignment]
    FeishuWSClient = None  # type: ignore[assignment]
@@ -169,6 +173,19 @@ _FEISHU_WEBHOOK_BODY_TIMEOUT_SECONDS = 30          # max seconds to read request
 _FEISHU_WEBHOOK_ANOMALY_THRESHOLD = 25             # consecutive error responses before WARNING log
 _FEISHU_WEBHOOK_ANOMALY_TTL_SECONDS = 6 * 60 * 60  # anomaly tracker TTL (6 hours) — matches openclaw
 _FEISHU_CARD_ACTION_DEDUP_TTL_SECONDS = 15 * 60    # card action token dedup window (15 min)
+
+_APPROVAL_CHOICE_MAP: Dict[str, str] = {
+    "approve_once": "once",
+    "approve_session": "session",
+    "approve_always": "always",
+    "deny": "deny",
+}
+_APPROVAL_LABEL_MAP: Dict[str, str] = {
+    "once": "Approved once",
+    "session": "Approved for session",
+    "always": "Approved permanently",
+    "deny": "Denied",
+}
 _FEISHU_BOT_MSG_TRACK_SIZE = 512                   # LRU size for tracking sent message IDs
 _FEISHU_REPLY_FALLBACK_CODES = frozenset({230011, 231003})  # reply target withdrawn/missing → create fallback
 _FEISHU_ACK_EMOJI = "OK"
@@ -1490,14 +1507,12 @@ class FeishuAdapter(BasePlatformAdapter):
            logger.warning("[Feishu] send_exec_approval failed: %s", exc)
            return SendResult(success=False, error=str(exc))

-    async def _update_approval_card(
-        self, message_id: str, label: str, user_name: str, choice: str,
-    ) -> None:
-        """Replace the approval card with a resolved status card."""
-        if not self._client or not message_id:
-            return
+    @staticmethod
+    def _build_resolved_approval_card(*, choice: str, user_name: str) -> Dict[str, Any]:
+        """Build raw card JSON for a resolved approval action."""
        icon = "❌" if choice == "deny" else "✅"
-        card = {
+        label = _APPROVAL_LABEL_MAP.get(choice, "Resolved")
+        return {
            "config": {"wide_screen_mode": True},
            "header": {
                "title": {"content": f"{icon} {label}", "tag": "plain_text"},
@@ -1510,13 +1525,6 @@ class FeishuAdapter(BasePlatformAdapter):
                },
            ],
        }
-        try:
-            payload = json.dumps(card, ensure_ascii=False)
-            body = self._build_update_message_body(msg_type="interactive", content=payload)
-            request = self._build_update_message_request(message_id=message_id, request_body=body)
-            await asyncio.to_thread(self._client.im.v1.message.update, request)
-        except Exception as exc:
-            logger.warning("[Feishu] Failed to update approval card %s: %s", message_id, exc)

    async def send_voice(
        self,
@@ -1845,20 +1853,82 @@ class FeishuAdapter(BasePlatformAdapter):
        future.add_done_callback(self._log_background_failure)

    def _on_card_action_trigger(self, data: Any) -> Any:
-        """Schedule Feishu card actions on the adapter loop and acknowledge immediately."""
+        """Handle card-action callback from the Feishu SDK (synchronous).
+
+        For approval actions: parses the event once, returns the resolved card
+        inline (the only reliable way to sync all clients), and schedules a
+        lightweight async method to actually unblock the agent.
+
+        For other card actions: delegates to ``_handle_card_action_event``.
+        """
        loop = self._loop
-        if loop is None or bool(getattr(loop, "is_closed", lambda: False)()):
+        if not self._loop_accepts_callbacks(loop):
            logger.warning("[Feishu] Dropping card action before adapter loop is ready")
-        else:
-            future = asyncio.run_coroutine_threadsafe(
-                self._handle_card_action_event(data),
-                loop,
-            )
-            future.add_done_callback(self._log_background_failure)
+            return P2CardActionTriggerResponse() if P2CardActionTriggerResponse else None
+
+        event = getattr(data, "event", None)
+        action = getattr(event, "action", None)
+        action_value = getattr(action, "value", {}) or {}
+        hermes_action = action_value.get("hermes_action") if isinstance(action_value, dict) else None
+
+        if hermes_action:
+            return self._handle_approval_card_action(event=event, action_value=action_value, loop=loop)
+
+        self._submit_on_loop(loop, self._handle_card_action_event(data))
        if P2CardActionTriggerResponse is None:
            return None
        return P2CardActionTriggerResponse()

+    @staticmethod
+    def _loop_accepts_callbacks(loop: Any) -> bool:
+        """Return True when the adapter loop can accept thread-safe submissions."""
+        return loop is not None and not bool(getattr(loop, "is_closed", lambda: False)())
+
+    def _submit_on_loop(self, loop: Any, coro: Any) -> None:
+        """Schedule background work on the adapter loop with shared failure logging."""
+        future = asyncio.run_coroutine_threadsafe(coro, loop)
+        future.add_done_callback(self._log_background_failure)
+
+    def _handle_approval_card_action(self, *, event: Any, action_value: Dict[str, Any], loop: Any) -> Any:
+        """Schedule approval resolution and build the synchronous callback response."""
+        approval_id = action_value.get("approval_id")
+        if approval_id is None:
+            logger.debug("[Feishu] Card action missing approval_id, ignoring")
+            return P2CardActionTriggerResponse() if P2CardActionTriggerResponse else None
+        choice = _APPROVAL_CHOICE_MAP.get(action_value.get("hermes_action"), "deny")
+
+        operator = getattr(event, "operator", None)
+        open_id = str(getattr(operator, "open_id", "") or "")
+        user_name = self._get_cached_sender_name(open_id) or open_id
+
+        self._submit_on_loop(loop, self._resolve_approval(approval_id, choice, user_name))
+
+        if P2CardActionTriggerResponse is None:
+            return None
+        response = P2CardActionTriggerResponse()
+        if CallBackCard is not None:
+            card = CallBackCard()
+            card.type = "raw"
+            card.data = self._build_resolved_approval_card(choice=choice, user_name=user_name)
+            response.card = card
+        return response
+
+    async def _resolve_approval(self, approval_id: Any, choice: str, user_name: str) -> None:
+        """Pop approval state and unblock the waiting agent thread."""
+        state = self._approval_state.pop(approval_id, None)
+        if not state:
+            logger.debug("[Feishu] Approval %s already resolved or unknown", approval_id)
+            return
+        try:
+            from tools.approval import resolve_gateway_approval
+            count = resolve_gateway_approval(state["session_key"], choice)
+            logger.info(
+                "Feishu button resolved %d approval(s) for session %s (choice=%s, user=%s)",
+                count, state["session_key"], choice, user_name,
+            )
+        except Exception as exc:
+            logger.error("Failed to resolve gateway approval from Feishu button: %s", exc)
+
    async def _handle_reaction_event(self, event_type: str, data: Any) -> None:
        """Fetch the reacted-to message; if it was sent by this bot, emit a synthetic text event."""
        if not self._client:
@@ -1950,51 +2020,6 @@ class FeishuAdapter(BasePlatformAdapter):
        action_tag = str(getattr(action, "tag", "") or "button")
        action_value = getattr(action, "value", {}) or {}

-        # --- Exec approval button intercept ---
-        hermes_action = action_value.get("hermes_action") if isinstance(action_value, dict) else None
-        if hermes_action:
-            approval_id = action_value.get("approval_id")
-            state = self._approval_state.pop(approval_id, None)
-            if not state:
-                logger.debug("[Feishu] Approval %s already resolved or unknown", approval_id)
-                return
-
-            choice_map = {
-                "approve_once": "once",
-                "approve_session": "session",
-                "approve_always": "always",
-                "deny": "deny",
-            }
-            choice = choice_map.get(hermes_action, "deny")
-
-            label_map = {
-                "once": "Approved once",
-                "session": "Approved for session",
-                "always": "Approved permanently",
-                "deny": "Denied",
-            }
-            label = label_map.get(choice, "Resolved")
-
-            # Resolve sender name for the status card
-            sender_id = SimpleNamespace(open_id=open_id, user_id=None, union_id=None)
-            sender_profile = await self._resolve_sender_profile(sender_id)
-            user_name = sender_profile.get("user_name") or open_id
-
-            # Resolve the approval — unblocks the agent thread
-            try:
-                from tools.approval import resolve_gateway_approval
-                count = resolve_gateway_approval(state["session_key"], choice)
-                logger.info(
-                    "Feishu button resolved %d approval(s) for session %s (choice=%s, user=%s)",
-                    count, state["session_key"], choice, user_name,
-                )
-            except Exception as exc:
-                logger.error("Failed to resolve gateway approval from Feishu button: %s", exc)
-
-            # Update the card to show the decision
-            await self._update_approval_card(state.get("message_id", ""), label, user_name, choice)
-            return
-
        synthetic_text = f"/card {action_tag}"
        if action_value:
            try:
@@ -2897,6 +2922,19 @@ class FeishuAdapter(BasePlatformAdapter):
            "user_id_alt": union_id,
        }

+    def _get_cached_sender_name(self, sender_id: Optional[str]) -> Optional[str]:
+        """Return a cached sender name only while its TTL is still valid."""
+        if not sender_id:
+            return None
+        cached = self._sender_name_cache.get(sender_id)
+        if cached is None:
+            return None
+        name, expire_at = cached
+        if time.time() < expire_at:
+            return name
+        self._sender_name_cache.pop(sender_id, None)
+        return None
+
    async def _resolve_sender_name_from_api(self, sender_id: Optional[str]) -> Optional[str]:
        """Fetch the sender's display name from the Feishu contact API with a 10-minute cache.

@@ -2909,11 +2947,9 @@ class FeishuAdapter(BasePlatformAdapter):
        if not trimmed:
            return None
        now = time.time()
-        cached = self._sender_name_cache.get(trimmed)
-        if cached is not None:
-            name, expire_at = cached
-            if now < expire_at:
-                return name
+        cached_name = self._get_cached_sender_name(trimmed)
+        if cached_name is not None:
+            return cached_name
        try:
            from lark_oapi.api.contact.v3 import GetUserRequest  # lazy import
            if trimmed.startswith("ou_"):
@@ -1391,6 +1391,65 @@ class GatewayRunner:
            except Exception as e:
                logger.debug("Failed interrupting agent during shutdown: %s", e)

+    async def _notify_active_sessions_of_shutdown(self) -> None:
+        """Send a notification to every chat with an active agent.
+
+        Called at the very start of stop() — adapters are still connected so
+        messages can be delivered.  Best-effort: individual send failures are
+        logged and swallowed so they never block the shutdown sequence.
+        """
+        active = self._snapshot_running_agents()
+        if not active:
+            return
+
+        action = "restarting" if self._restart_requested else "shutting down"
+        hint = (
+            "Your current task will be interrupted. "
+            "Use /retry after restart to continue."
+            if self._restart_requested
+            else "Your current task will be interrupted."
+        )
+        msg = f"⚠️ Gateway {action} — {hint}"
+
+        notified: set = set()
+        for session_key in active:
+            # Parse platform + chat_id from the session key.
+            # Format: agent:main:{platform}:{chat_type}:{chat_id}[:{extra}...]
+            parts = session_key.split(":")
+            if len(parts) < 5:
+                continue
+            platform_str = parts[2]
+            chat_id = parts[4]
+
+            # Deduplicate: one notification per chat, even if multiple
+            # sessions (different users/threads) share the same chat.
+            dedup_key = (platform_str, chat_id)
+            if dedup_key in notified:
+                continue
+
+            try:
+                platform = Platform(platform_str)
+                adapter = self.adapters.get(platform)
+                if not adapter:
+                    continue
+
+                # Include thread_id if present so the message lands in the
+                # correct forum topic / thread.
+                thread_id = parts[5] if len(parts) > 5 else None
+                metadata = {"thread_id": thread_id} if thread_id else None
+
+                await adapter.send(chat_id, msg, metadata=metadata)
+                notified.add(dedup_key)
+                logger.info(
+                    "Sent shutdown notification to %s:%s",
+                    platform_str, chat_id,
+                )
+            except Exception as e:
+                logger.debug(
+                    "Failed to send shutdown notification to %s:%s: %s",
+                    platform_str, chat_id, e,
+                )
+
    def _finalize_shutdown_agents(self, active_agents: Dict[str, Any]) -> None:
        for agent in active_agents.values():
            try:
@@ -1481,7 +1540,7 @@ class GatewayRunner:
            pass
        try:
            from gateway.status import write_runtime_status
-            write_runtime_status(gateway_state="starting", exit_reason=None)
+            write_runtime_status(gateway_state="starting", exit_reason=None, startup_checks={})
        except Exception:
            pass
        
@@ -1523,8 +1582,23 @@ class GatewayRunner:
                "or configure platform allowlists (e.g., TELEGRAM_ALLOWED_USERS=your_id)."
            )
        
+        # Discover plugins before hooks so plugin-owned hook bundles can
+        # participate in this same startup cycle.
+        try:
+            from hermes_cli.plugins import discover_plugins
+
+            discover_plugins()
+        except Exception as e:
+            logger.warning("Plugin discovery during gateway startup failed: %s", e)
+
        # Discover and load event hooks
        self.hooks.discover_and_load()
+        try:
+            from gateway.status import reset_startup_checks
+
+            reset_startup_checks(self.hooks.loaded_hooks)
+        except Exception as e:
+            logger.warning("Startup readiness initialization failed: %s", e)
        
        # Recover background processes from checkpoint (crash recovery)
        try:
@@ -2018,6 +2092,10 @@ class GatewayRunner:
            self._running = False
            self._draining = True

+            # Notify all chats with active agents BEFORE draining.
+            # Adapters are still connected here, so messages can be sent.
+            await self._notify_active_sessions_of_shutdown()
+
            timeout = self._restart_drain_timeout
            active_agents, timed_out = await self._drain_active_agents(timeout)
            if timed_out:
@@ -2041,6 +2119,11 @@ class GatewayRunner:
                    logger.error("Failed to launch detached gateway restart: %s", e)

            self._finalize_shutdown_agents(active_agents)
+            await self.hooks.emit("gateway:shutdown", {
+                "restart": self._restart_requested,
+                "service_restart": self._restart_via_service,
+                "detached_restart": self._restart_detached,
+            })

            for platform, adapter in list(self.adapters.items()):
                try:
@@ -2088,12 +2171,23 @@ class GatewayRunner:

            # Write a clean-shutdown marker so the next startup knows this
            # wasn't a crash.  suspend_recently_active() only needs to run
-            # after unexpected exits — graceful shutdowns already drain
-            # active agents, so there's no stuck-session risk.
-            try:
-                (_hermes_home / ".clean_shutdown").touch()
-            except Exception:
-                pass
+            # after unexpected exits.  However, if the drain timed out and
+            # agents were force-interrupted, their sessions may be in an
+            # incomplete state (trailing tool response, no final assistant
+            # message).  Skip the marker in that case so the next startup
+            # suspends those sessions — giving users a clean slate instead
+            # of resuming a half-finished tool loop.
+            if not timed_out:
+                try:
+                    (_hermes_home / ".clean_shutdown").touch()
+                except Exception:
+                    pass
+            else:
+                logger.info(
+                    "Skipping .clean_shutdown marker — drain timed out with "
+                    "interrupted agents; next startup will suspend recently "
+                    "active sessions."
+                )

            if self._restart_requested and self._restart_via_service:
                self._exit_code = GATEWAY_SERVICE_RESTART_EXIT_CODE
@@ -9187,8 +9281,41 @@ async def start_gateway(config: Optional[GatewayConfig] = None, replace: bool =

    runner = GatewayRunner(config)
    
+    # Track whether a signal initiated the shutdown (vs. internal request).
+    # When an unexpected SIGTERM kills the gateway, we exit non-zero so
+    # systemd's Restart=on-failure revives the process.  systemctl stop
+    # is safe: systemd tracks stop-requested state independently of exit
+    # code, so Restart= never fires for a deliberate stop.
+    _signal_initiated_shutdown = False
+
    # Set up signal handlers
    def shutdown_signal_handler():
+        nonlocal _signal_initiated_shutdown
+        _signal_initiated_shutdown = True
+        logger.info("Received SIGTERM/SIGINT — initiating shutdown")
+        # Diagnostic: log all hermes-related processes so we can identify
+        # what triggered the signal (hermes update, hermes gateway restart,
+        # a stale detached subprocess, etc.).
+        try:
+            import subprocess as _sp
+            _ps = _sp.run(
+                ["ps", "aux"],
+                capture_output=True, text=True, timeout=3,
+            )
+            _hermes_procs = [
+                line for line in _ps.stdout.splitlines()
+                if ("hermes" in line.lower() or "gateway" in line.lower())
+                and str(os.getpid()) not in line.split()[1:2]  # exclude self
+            ]
+            if _hermes_procs:
+                logger.warning(
+                    "Shutdown diagnostic — other hermes processes running:\n  %s",
+                    "\n  ".join(_hermes_procs),
+                )
+            else:
+                logger.info("Shutdown diagnostic — no other hermes processes found")
+        except Exception:
+            pass
        asyncio.create_task(runner.stop())

    def restart_signal_handler():
@@ -9258,6 +9385,21 @@ async def start_gateway(config: Optional[GatewayConfig] = None, replace: bool =
    if runner.exit_code is not None:
        raise SystemExit(runner.exit_code)

+    # When a signal (SIGTERM/SIGINT) caused the shutdown and it wasn't a
+    # planned restart (/restart, /update, SIGUSR1), exit non-zero so
+    # systemd's Restart=on-failure revives the process.  This covers:
+    #   - hermes update killing the gateway mid-work
+    #   - External kill commands
+    #   - WSL2/container runtime sending unexpected signals
+    # systemctl stop is safe: systemd tracks "stop requested" state
+    # independently of exit code, so Restart= never fires for it.
+    if _signal_initiated_shutdown and not runner._restart_requested:
+        logger.info(
+            "Exiting with code 1 (signal-initiated shutdown without restart "
+            "request) so systemd Restart=on-failure can revive the gateway."
+        )
+        return False  # → sys.exit(1) in the caller
+
    return True


@@ -27,6 +27,7 @@ _RUNTIME_STATUS_FILE = "gateway_state.json"
 _LOCKS_DIRNAME = "gateway-locks"
 _IS_WINDOWS = sys.platform == "win32"
 _UNSET = object()
+_VALID_STARTUP_CHECK_STATES = {"pending", "ready", "failed"}


 def _get_pid_path() -> Path:
@@ -162,11 +163,39 @@ def _build_runtime_status_record() -> dict[str, Any]:
        "restart_requested": False,
        "active_agents": 0,
        "platforms": {},
+        "startup_checks": {},
        "updated_at": _utc_now_iso(),
    })
    return payload


+def _normalize_startup_check_entries(
+    startup_checks: Optional[dict[str, Any]],
+) -> dict[str, dict[str, Any]]:
+    """Normalize persisted startup readiness entries."""
+    if not isinstance(startup_checks, dict):
+        return {}
+
+    now = _utc_now_iso()
+    normalized: dict[str, dict[str, Any]] = {}
+    for raw_id, raw_payload in startup_checks.items():
+        check_id = str(raw_id).strip()
+        if not check_id:
+            continue
+        payload = raw_payload if isinstance(raw_payload, dict) else {}
+        state = str(payload.get("state", "pending")).strip().lower()
+        if state not in _VALID_STARTUP_CHECK_STATES:
+            state = "pending"
+        normalized[check_id] = {
+            "state": state,
+            "required": bool(payload.get("required", True)),
+            "source": payload.get("source"),
+            "detail": payload.get("detail"),
+            "updated_at": payload.get("updated_at") or now,
+        }
+    return normalized
+
+
 def _read_json_file(path: Path) -> Optional[dict[str, Any]]:
    if not path.exists():
        return None
@@ -223,6 +252,7 @@ def write_runtime_status(
    exit_reason: Any = _UNSET,
    restart_requested: Any = _UNSET,
    active_agents: Any = _UNSET,
+    startup_checks: Any = _UNSET,
    platform: Any = _UNSET,
    platform_state: Any = _UNSET,
    error_code: Any = _UNSET,
@@ -245,6 +275,8 @@ def write_runtime_status(
        payload["restart_requested"] = bool(restart_requested)
    if active_agents is not _UNSET:
        payload["active_agents"] = max(0, int(active_agents))
+    if startup_checks is not _UNSET:
+        payload["startup_checks"] = _normalize_startup_check_entries(startup_checks)

    if platform is not _UNSET:
        platform_payload = payload["platforms"].get(platform, {})
@@ -262,13 +294,131 @@ def write_runtime_status(

 def read_runtime_status() -> Optional[dict[str, Any]]:
    """Read the persisted gateway runtime health/status information."""
-    return _read_json_file(_get_runtime_status_path())
+    payload = _read_json_file(_get_runtime_status_path())
+    if payload is None:
+        return None
+    payload.setdefault("platforms", {})
+    payload["startup_checks"] = _normalize_startup_check_entries(payload.get("startup_checks"))
+    return payload
+
+
+def reset_startup_checks(checks: Optional[list[dict[str, Any]]] = None) -> dict[str, dict[str, Any]]:
+    """Replace persisted startup readiness checks for the current run."""
+    normalized: dict[str, dict[str, Any]] = {}
+    now = _utc_now_iso()
+
+    for hook in checks or []:
+        if not isinstance(hook, dict):
+            continue
+        readiness = hook.get("startup_readiness")
+        if not isinstance(readiness, dict):
+            continue
+        check_id = str(readiness.get("id", "")).strip()
+        if not check_id:
+            continue
+        normalized[check_id] = {
+            "state": "pending",
+            "required": bool(readiness.get("required", True)),
+            "source": hook.get("name"),
+            "detail": None,
+            "updated_at": now,
+        }
+
+    write_runtime_status(startup_checks=normalized)
+    return normalized
+
+
+def update_startup_check(
+    check_id: str,
+    state: str,
+    *,
+    detail: Any = _UNSET,
+    required: Any = _UNSET,
+    source: Any = _UNSET,
+) -> dict[str, Any]:
+    """Update a single startup readiness check in the runtime status file."""
+    normalized_id = str(check_id).strip()
+    if not normalized_id:
+        raise ValueError("startup readiness check id is required")
+
+    normalized_state = str(state).strip().lower()
+    if normalized_state not in _VALID_STARTUP_CHECK_STATES:
+        raise ValueError(f"invalid startup readiness state: {state}")
+
+    path = _get_runtime_status_path()
+    payload = _read_json_file(path) or _build_runtime_status_record()
+    checks = _normalize_startup_check_entries(payload.get("startup_checks"))
+    existing = checks.get(normalized_id, {})
+    now = _utc_now_iso()
+
+    checks[normalized_id] = {
+        "state": normalized_state,
+        "required": bool(existing.get("required", True) if required is _UNSET else required),
+        "source": existing.get("source") if source is _UNSET else source,
+        "detail": existing.get("detail") if detail is _UNSET else detail,
+        "updated_at": now,
+    }
+
+    payload["startup_checks"] = checks
+    payload.setdefault("platforms", {})
+    payload.setdefault("kind", _GATEWAY_KIND)
+    payload["pid"] = os.getpid()
+    payload["start_time"] = _get_process_start_time(os.getpid())
+    payload["updated_at"] = now
+    _write_json_file(path, payload)
+    return checks[normalized_id]
+
+
+def mark_startup_check_pending(
+    check_id: str,
+    *,
+    detail: Any = _UNSET,
+    required: Any = _UNSET,
+    source: Any = _UNSET,
+) -> dict[str, Any]:
+    return update_startup_check(check_id, "pending", detail=detail, required=required, source=source)
+
+
+def mark_startup_check_ready(
+    check_id: str,
+    *,
+    detail: Any = _UNSET,
+    required: Any = _UNSET,
+    source: Any = _UNSET,
+) -> dict[str, Any]:
+    return update_startup_check(check_id, "ready", detail=detail, required=required, source=source)
+
+
+def mark_startup_check_failed(
+    check_id: str,
+    *,
+    detail: Any = _UNSET,
+    required: Any = _UNSET,
+    source: Any = _UNSET,
+) -> dict[str, Any]:
+    return update_startup_check(check_id, "failed", detail=detail, required=required, source=source)


 def remove_pid_file() -> None:
-    """Remove the gateway PID file if it exists."""
+    """Remove the gateway PID file, but only if it belongs to this process.
+
+    During --replace handoffs, the old process's atexit handler can fire AFTER
+    the new process has written its own PID file.  Blindly removing the file
+    would delete the new process's record, leaving the gateway running with no
+    PID file (invisible to ``get_running_pid()``).
+    """
    try:
-        _get_pid_path().unlink(missing_ok=True)
+        path = _get_pid_path()
+        record = _read_json_file(path)
+        if record is not None:
+            try:
+                file_pid = int(record["pid"])
+            except (KeyError, TypeError, ValueError):
+                file_pid = None
+            if file_pid is not None and file_pid != os.getpid():
+                # PID file belongs to a different process — leave it alone.
+                return
+        path.unlink(missing_ok=True)
    except Exception:
        pass

@@ -383,13 +383,16 @@ def _resolve_api_key_provider_secret(
 # Z.AI has separate billing for general vs coding plans, and global vs China
 # endpoints.  A key that works on one may return "Insufficient balance" on
 # another.  We probe at setup time and store the working endpoint.
+# Each entry lists candidate models to try in order — newer coding plan accounts
+# may only have access to recent models (glm-5.1, glm-5v-turbo) while older
+# ones still use glm-4.7.

 ZAI_ENDPOINTS = [
-    # (id, base_url, default_model, label)
-    ("global",        "https://api.z.ai/api/paas/v4",        "glm-5",   "Global"),
-    ("cn",            "https://open.bigmodel.cn/api/paas/v4", "glm-5",   "China"),
-    ("coding-global", "https://api.z.ai/api/coding/paas/v4",  "glm-4.7", "Global (Coding Plan)"),
-    ("coding-cn",     "https://open.bigmodel.cn/api/coding/paas/v4", "glm-4.7", "China (Coding Plan)"),
+    # (id, base_url, probe_models, label)
+    ("global",        "https://api.z.ai/api/paas/v4",        ["glm-5"],   "Global"),
+    ("cn",            "https://open.bigmodel.cn/api/paas/v4", ["glm-5"],   "China"),
+    ("coding-global", "https://api.z.ai/api/coding/paas/v4",  ["glm-5.1", "glm-5v-turbo", "glm-4.7"], "Global (Coding Plan)"),
+    ("coding-cn",     "https://open.bigmodel.cn/api/coding/paas/v4", ["glm-5.1", "glm-5v-turbo", "glm-4.7"], "China (Coding Plan)"),
 ]


@@ -397,35 +400,37 @@ def detect_zai_endpoint(api_key: str, timeout: float = 8.0) -> Optional[Dict[str
    """Probe z.ai endpoints to find one that accepts this API key.

    Returns {"id": ..., "base_url": ..., "model": ..., "label": ...} for the
-    first working endpoint, or None if all fail.
+    first working endpoint, or None if all fail.  For endpoints with multiple
+    candidate models, tries each in order and returns the first that succeeds.
    """
-    for ep_id, base_url, model, label in ZAI_ENDPOINTS:
-        try:
-            resp = httpx.post(
-                f"{base_url}/chat/completions",
-                headers={
-                    "Authorization": f"Bearer {api_key}",
-                    "Content-Type": "application/json",
-                },
-                json={
-                    "model": model,
-                    "stream": False,
-                    "max_tokens": 1,
-                    "messages": [{"role": "user", "content": "ping"}],
-                },
-                timeout=timeout,
-            )
-            if resp.status_code == 200:
-                logger.debug("Z.AI endpoint probe: %s (%s) OK", ep_id, base_url)
-                return {
-                    "id": ep_id,
-                    "base_url": base_url,
-                    "model": model,
-                    "label": label,
-                }
-            logger.debug("Z.AI endpoint probe: %s returned %s", ep_id, resp.status_code)
-        except Exception as exc:
-            logger.debug("Z.AI endpoint probe: %s failed: %s", ep_id, exc)
+    for ep_id, base_url, probe_models, label in ZAI_ENDPOINTS:
+        for model in probe_models:
+            try:
+                resp = httpx.post(
+                    f"{base_url}/chat/completions",
+                    headers={
+                        "Authorization": f"Bearer {api_key}",
+                        "Content-Type": "application/json",
+                    },
+                    json={
+                        "model": model,
+                        "stream": False,
+                        "max_tokens": 1,
+                        "messages": [{"role": "user", "content": "ping"}],
+                    },
+                    timeout=timeout,
+                )
+                if resp.status_code == 200:
+                    logger.debug("Z.AI endpoint probe: %s (%s) model=%s OK", ep_id, base_url, model)
+                    return {
+                        "id": ep_id,
+                        "base_url": base_url,
+                        "model": model,
+                        "label": label,
+                    }
+                logger.debug("Z.AI endpoint probe: %s model=%s returned %s", ep_id, model, resp.status_code)
+            except Exception as exc:
+                logger.debug("Z.AI endpoint probe: %s model=%s failed: %s", ep_id, model, exc)
    return None


@@ -75,12 +75,12 @@ def prompt_for_secret(cli, var_name: str, prompt: str, metadata=None) -> dict:
        if not hasattr(cli, "_secret_deadline"):
            cli._secret_deadline = 0
        try:
-            value = getpass.getpass(f"{prompt} (hidden, Enter to skip): ")
+            value = getpass.getpass(f"{prompt} (hidden, ESC or empty Enter to skip): ")
        except (EOFError, KeyboardInterrupt):
            value = ""

        if not value:
-            cprint(f"\n{_DIM}  ⏭ Secret entry cancelled{_RST}")
+            cprint(f"\n{_DIM}  ⏭ Secret entry skipped{_RST}")
            return {
                "success": True,
                "reason": "cancelled",
@@ -133,7 +133,7 @@ def prompt_for_secret(cli, var_name: str, prompt: str, metadata=None) -> dict:
                cli._app.invalidate()

            if not value:
-                cprint(f"\n{_DIM}  ⏭ Secret entry cancelled{_RST}")
+                cprint(f"\n{_DIM}  ⏭ Secret entry skipped{_RST}")
                return {
                    "success": True,
                    "reason": "cancelled",
@@ -582,6 +582,116 @@ def discord_skill_commands(
    )


+def discord_skill_commands_by_category(
+    reserved_names: set[str],
+) -> tuple[dict[str, list[tuple[str, str, str]]], list[tuple[str, str, str]], int]:
+    """Return skill entries organized by category for Discord ``/skill`` subcommand groups.
+
+    Skills whose directory is nested at least 2 levels under ``SKILLS_DIR``
+    (e.g. ``creative/ascii-art/SKILL.md``) are grouped by their top-level
+    category.  Root-level skills (e.g. ``dogfood/SKILL.md``) are returned as
+    *uncategorized* — the caller should register them as direct subcommands
+    of the ``/skill`` group.
+
+    The same filtering as :func:`discord_skill_commands` is applied: hub
+    skills excluded, per-platform disabled excluded, names clamped.
+
+    Returns:
+        ``(categories, uncategorized, hidden_count)``
+
+        - *categories*: ``{category_name: [(name, description, cmd_key), ...]}``
+        - *uncategorized*: ``[(name, description, cmd_key), ...]``
+        - *hidden_count*: skills dropped due to Discord group limits
+          (25 subcommand groups, 25 subcommands per group)
+    """
+    from pathlib import Path as _P
+
+    _platform_disabled: set[str] = set()
+    try:
+        from agent.skill_utils import get_disabled_skill_names
+        _platform_disabled = get_disabled_skill_names(platform="discord")
+    except Exception:
+        pass
+
+    # Collect raw skill data --------------------------------------------------
+    categories: dict[str, list[tuple[str, str, str]]] = {}
+    uncategorized: list[tuple[str, str, str]] = []
+    _names_used: set[str] = set(reserved_names)
+    hidden = 0
+
+    try:
+        from agent.skill_commands import get_skill_commands
+        from tools.skills_tool import SKILLS_DIR
+        _skills_dir = SKILLS_DIR.resolve()
+        _hub_dir = (SKILLS_DIR / ".hub").resolve()
+        skill_cmds = get_skill_commands()
+
+        for cmd_key in sorted(skill_cmds):
+            info = skill_cmds[cmd_key]
+            skill_path = info.get("skill_md_path", "")
+            if not skill_path:
+                continue
+            sp = _P(skill_path).resolve()
+            # Skip skills outside SKILLS_DIR or from the hub
+            if not str(sp).startswith(str(_skills_dir)):
+                continue
+            if str(sp).startswith(str(_hub_dir)):
+                continue
+
+            skill_name = info.get("name", "")
+            if skill_name in _platform_disabled:
+                continue
+
+            raw_name = cmd_key.lstrip("/")
+            # Clamp to 32 chars (Discord limit)
+            discord_name = raw_name[:32]
+            if discord_name in _names_used:
+                continue
+            _names_used.add(discord_name)
+
+            desc = info.get("description", "")
+            if len(desc) > 100:
+                desc = desc[:97] + "..."
+
+            # Determine category from the relative path within SKILLS_DIR.
+            # e.g. creative/ascii-art/SKILL.md → parts = ("creative", "ascii-art")
+            try:
+                rel = sp.parent.relative_to(_skills_dir)
+            except ValueError:
+                continue
+            parts = rel.parts
+            if len(parts) >= 2:
+                cat = parts[0]
+                categories.setdefault(cat, []).append((discord_name, desc, cmd_key))
+            else:
+                uncategorized.append((discord_name, desc, cmd_key))
+    except Exception:
+        pass
+
+    # Enforce Discord limits: 25 subcommand groups, 25 subcommands each ------
+    _MAX_GROUPS = 25
+    _MAX_PER_GROUP = 25
+
+    trimmed_categories: dict[str, list[tuple[str, str, str]]] = {}
+    group_count = 0
+    for cat in sorted(categories):
+        if group_count >= _MAX_GROUPS:
+            hidden += len(categories[cat])
+            continue
+        entries = categories[cat][:_MAX_PER_GROUP]
+        hidden += max(0, len(categories[cat]) - _MAX_PER_GROUP)
+        trimmed_categories[cat] = entries
+        group_count += 1
+
+    # Uncategorized skills also count against the 25 top-level limit
+    remaining_slots = _MAX_GROUPS - group_count
+    if len(uncategorized) > remaining_slots:
+        hidden += len(uncategorized) - remaining_slots
+        uncategorized = uncategorized[:remaining_slots]
+
+    return trimmed_categories, uncategorized, hidden
+
+
 def slack_subcommand_map() -> dict[str, str]:
    """Return subcommand -> /command mapping for Slack /hermes handler.

@@ -10,6 +10,7 @@ import shutil
 import signal
 import subprocess
 import sys
+import time
 from pathlib import Path

 PROJECT_ROOT = Path(__file__).parent.parent.resolve()
@@ -37,6 +38,10 @@ from hermes_cli.setup import (
 from hermes_cli.colors import Colors, color


+_SERVICE_READINESS_TIMEOUT = 30.0
+_SERVICE_READINESS_POLL_INTERVAL = 0.2
+
+
 # =============================================================================
 # Process Management (for manual gateway runs)
 # =============================================================================
@@ -1100,12 +1105,123 @@ def systemd_uninstall(system: bool = False):
    print(f"✓ {_service_scope_label(system).capitalize()} service uninstalled")


+def _describe_startup_check(check_id: str, check: dict) -> str:
+    source = check.get("source")
+    detail = check.get("detail")
+    label = f"{check_id} ({source})" if source and source != check_id else check_id
+    return f"{label}: {detail}" if detail else label
+
+
+def _classify_startup_checks(state: dict | None) -> tuple[list[str], list[str], list[str]]:
+    checks = (state or {}).get("startup_checks") or {}
+    pending_required: list[str] = []
+    failed_required: list[str] = []
+    optional_warnings: list[str] = []
+
+    if not isinstance(checks, dict):
+        return pending_required, failed_required, optional_warnings
+
+    for check_id, raw_check in checks.items():
+        check = raw_check if isinstance(raw_check, dict) else {}
+        label = _describe_startup_check(str(check_id), check)
+        check_state = str(check.get("state", "pending")).strip().lower()
+        required = bool(check.get("required", True))
+
+        if check_state == "ready":
+            continue
+        if required:
+            if check_state == "failed":
+                failed_required.append(label)
+            else:
+                pending_required.append(label)
+        else:
+            prefix = "failed" if check_state == "failed" else "pending"
+            optional_warnings.append(f"{prefix}: {label}")
+
+    return pending_required, failed_required, optional_warnings
+
+
+def _wait_for_service_readiness(
+    *,
+    action: str,
+    previous_pid: int | None = None,
+    timeout: float = _SERVICE_READINESS_TIMEOUT,
+    poll_interval: float = _SERVICE_READINESS_POLL_INTERVAL,
+) -> list[str]:
+    from gateway.status import get_running_pid, read_runtime_status
+
+    deadline = time.monotonic() + timeout
+    last_pending: list[str] = []
+
+    while time.monotonic() < deadline:
+        live_pid = get_running_pid()
+        if live_pid is None or (previous_pid is not None and live_pid == previous_pid):
+            time.sleep(poll_interval)
+            continue
+
+        runtime = read_runtime_status() or {}
+        try:
+            runtime_pid = int(runtime.get("pid"))
+        except (TypeError, ValueError):
+            runtime_pid = None
+        if runtime_pid != live_pid:
+            time.sleep(poll_interval)
+            continue
+
+        gateway_state = runtime.get("gateway_state")
+        pending_required, failed_required, optional_warnings = _classify_startup_checks(runtime)
+        last_pending = pending_required
+
+        if gateway_state == "startup_failed":
+            reason = runtime.get("exit_reason") or f"gateway {action} failed during startup"
+            raise RuntimeError(reason)
+        if failed_required:
+            raise RuntimeError(
+                "required startup checks failed: " + "; ".join(failed_required)
+            )
+        if gateway_state == "running" and not pending_required:
+            return optional_warnings
+
+        time.sleep(poll_interval)
+
+    if last_pending:
+        raise RuntimeError(
+            "timed out waiting for required startup checks: " + "; ".join(last_pending)
+        )
+    if previous_pid is not None:
+        raise RuntimeError(
+            f"timed out waiting for gateway {action}; previous process is still active or no new runtime became ready"
+        )
+    raise RuntimeError(f"timed out waiting for gateway {action} readiness")
+
+
+def _await_service_ready_or_exit(
+    *,
+    action: str,
+    previous_pid: int | None = None,
+    timeout: float = _SERVICE_READINESS_TIMEOUT,
+) -> None:
+    try:
+        optional_warnings = _wait_for_service_readiness(
+            action=action,
+            previous_pid=previous_pid,
+            timeout=timeout,
+        )
+    except RuntimeError as exc:
+        print_error(f"  Gateway {action} did not become ready: {exc}")
+        raise SystemExit(1) from exc
+
+    for warning in optional_warnings:
+        print_warning(f"  Optional startup check {warning}")
+
+
 def systemd_start(system: bool = False):
    system = _select_systemd_scope(system)
    if system:
        _require_root_for_system_service("start")
    refresh_systemd_unit_if_needed(system=system)
    _run_systemctl(["start", get_service_name()], system=system, check=True, timeout=30)
+    _await_service_ready_or_exit(action="start")
    print(f"✓ {_service_scope_label(system).capitalize()} service started")


@@ -1128,9 +1244,11 @@ def systemd_restart(system: bool = False):

    pid = get_running_pid()
    if pid is not None and _request_gateway_self_restart(pid):
-        print(f"✓ {_service_scope_label(system).capitalize()} service restart requested")
+        _await_service_ready_or_exit(action="restart", previous_pid=pid)
+        print(f"✓ {_service_scope_label(system).capitalize()} service restarted")
        return
    _run_systemctl(["reload-or-restart", get_service_name()], system=system, check=True, timeout=90)
+    _await_service_ready_or_exit(action="restart", previous_pid=pid)
    print(f"✓ {_service_scope_label(system).capitalize()} service restarted")


@@ -1389,6 +1507,7 @@ def launchd_start():
        plist_path.write_text(generate_launchd_plist(), encoding="utf-8")
        subprocess.run(["launchctl", "bootstrap", _launchd_domain(), str(plist_path)], check=True, timeout=30)
        subprocess.run(["launchctl", "kickstart", f"{_launchd_domain()}/{label}"], check=True, timeout=30)
+        _await_service_ready_or_exit(action="start")
        print("✓ Service started")
        return

@@ -1401,6 +1520,7 @@ def launchd_start():
        print("↻ launchd job was unloaded; reloading service definition")
        subprocess.run(["launchctl", "bootstrap", _launchd_domain(), str(plist_path)], check=True, timeout=30)
        subprocess.run(["launchctl", "kickstart", f"{_launchd_domain()}/{label}"], check=True, timeout=30)
+    _await_service_ready_or_exit(action="start")
    print("✓ Service started")

 def launchd_stop():
@@ -1471,7 +1591,8 @@ def launchd_restart():
    try:
        pid = get_running_pid()
        if pid is not None and _request_gateway_self_restart(pid):
-            print("✓ Service restart requested")
+            _await_service_ready_or_exit(action="restart", previous_pid=pid)
+            print("✓ Service restarted")
            return
        if pid is not None:
            try:
@@ -1483,6 +1604,7 @@ def launchd_restart():
                if not exited:
                    print(f"⚠ Gateway drain timed out after {drain_timeout:.0f}s — forcing launchd restart")
        subprocess.run(["launchctl", "kickstart", "-k", target], check=True, timeout=90)
+        _await_service_ready_or_exit(action="restart", previous_pid=pid)
        print("✓ Service restarted")
    except subprocess.CalledProcessError as e:
        if e.returncode not in (3, 113):
@@ -1492,6 +1614,7 @@ def launchd_restart():
        plist_path = get_launchd_plist_path()
        subprocess.run(["launchctl", "bootstrap", _launchd_domain(), str(plist_path)], check=True, timeout=30)
        subprocess.run(["launchctl", "kickstart", target], check=True, timeout=30)
+        _await_service_ready_or_exit(action="restart", previous_pid=pid)
        print("✓ Service restarted")

 def launchd_status(deep: bool = False):
@@ -4036,7 +4036,40 @@ def cmd_update(args):
                                    capture_output=True, text=True, timeout=15,
                                )
                                if restart.returncode == 0:
-                                    restarted_services.append(svc_name)
+                                    # Verify the service actually survived the
+                                    # restart.  systemctl restart returns 0 even
+                                    # if the new process crashes immediately.
+                                    import time as _time
+                                    _time.sleep(3)
+                                    verify = subprocess.run(
+                                        scope_cmd + ["is-active", svc_name],
+                                        capture_output=True, text=True, timeout=5,
+                                    )
+                                    if verify.stdout.strip() == "active":
+                                        restarted_services.append(svc_name)
+                                    else:
+                                        # Retry once — transient startup failures
+                                        # (stale module cache, import race) often
+                                        # resolve on the second attempt.
+                                        print(f"  ⚠ {svc_name} died after restart, retrying...")
+                                        retry = subprocess.run(
+                                            scope_cmd + ["restart", svc_name],
+                                            capture_output=True, text=True, timeout=15,
+                                        )
+                                        _time.sleep(3)
+                                        verify2 = subprocess.run(
+                                            scope_cmd + ["is-active", svc_name],
+                                            capture_output=True, text=True, timeout=5,
+                                        )
+                                        if verify2.stdout.strip() == "active":
+                                            restarted_services.append(svc_name)
+                                            print(f"  ✓ {svc_name} recovered on retry")
+                                        else:
+                                            print(
+                                                f"  ✗ {svc_name} failed to stay running after restart.\n"
+                                                f"    Check logs: journalctl --user -u {svc_name} --since '2 min ago'\n"
+                                                f"    Restart manually: systemctl {'--user ' if scope == 'user' else ''}restart {svc_name}"
+                                            )
                                else:
                                    print(f"  ⚠ Failed to restart {svc_name}: {restart.stderr.strip()}")
                    except (FileNotFoundError, subprocess.TimeoutExpired):
@@ -44,6 +44,7 @@ OPENROUTER_MODELS: list[tuple[str, str]] = [
    ("minimax/minimax-m2.7",            ""),
    ("minimax/minimax-m2.5",            ""),
    ("z-ai/glm-5.1",                    ""),
+    ("z-ai/glm-5v-turbo",               ""),
    ("z-ai/glm-5-turbo",                ""),
    ("moonshotai/kimi-k2.5",            ""),
    ("x-ai/grok-4.20",                  ""),
@@ -89,6 +90,7 @@ _PROVIDER_MODELS: dict[str, list[str]] = {
        "minimax/minimax-m2.7",
        "minimax/minimax-m2.5",
        "z-ai/glm-5.1",
+        "z-ai/glm-5v-turbo",
        "z-ai/glm-5-turbo",
        "moonshotai/kimi-k2.5",
        "x-ai/grok-4.20-beta",
@@ -134,6 +136,7 @@ _PROVIDER_MODELS: dict[str, list[str]] = {
    "zai": [
        "glm-5.1",
        "glm-5",
+        "glm-5v-turbo",
        "glm-5-turbo",
        "glm-4.7",
        "glm-4.5",
@@ -13,6 +13,7 @@ import asyncio
 import hmac
 import json
 import logging
+import os
 import secrets
 import sys
 import threading
@@ -319,12 +320,68 @@ class EnvVarReveal(BaseModel):
    key: str


+_GATEWAY_HEALTH_URL = os.getenv("GATEWAY_HEALTH_URL")
+_GATEWAY_HEALTH_TIMEOUT = float(os.getenv("GATEWAY_HEALTH_TIMEOUT", "3"))
+
+
+def _probe_gateway_health() -> tuple[bool, dict | None]:
+    """Probe the gateway via its HTTP health endpoint (cross-container).
+
+    Uses ``/health/detailed`` first (returns full state), falling back to
+    the simpler ``/health`` endpoint.  Returns ``(is_alive, body_dict)``.
+
+    Accepts any of these as ``GATEWAY_HEALTH_URL``:
+    - ``http://gateway:8642``                (base URL — recommended)
+    - ``http://gateway:8642/health``         (explicit health path)
+    - ``http://gateway:8642/health/detailed`` (explicit detailed path)
+
+    This is a **blocking** call — run via ``run_in_executor`` from async code.
+    """
+    if not _GATEWAY_HEALTH_URL:
+        return False, None
+
+    # Normalise to base URL so we always probe the right paths regardless of
+    # whether the user included /health or /health/detailed in the env var.
+    base = _GATEWAY_HEALTH_URL.rstrip("/")
+    if base.endswith("/health/detailed"):
+        base = base[: -len("/health/detailed")]
+    elif base.endswith("/health"):
+        base = base[: -len("/health")]
+
+    for path in (f"{base}/health/detailed", f"{base}/health"):
+        try:
+            req = urllib.request.Request(path, method="GET")
+            with urllib.request.urlopen(req, timeout=_GATEWAY_HEALTH_TIMEOUT) as resp:
+                if resp.status == 200:
+                    body = json.loads(resp.read())
+                    return True, body
+        except Exception:
+            continue
+    return False, None
+
+
@app.get("/api/status")
 async def get_status():
    current_ver, latest_ver = check_config_version()

+    # --- Gateway liveness detection ---
+    # Try local PID check first (same-host).  If that fails and a remote
+    # GATEWAY_HEALTH_URL is configured, probe the gateway over HTTP so the
+    # dashboard works when the gateway runs in a separate container.
    gateway_pid = get_running_pid()
    gateway_running = gateway_pid is not None
+    remote_health_body: dict | None = None
+
+    if not gateway_running and _GATEWAY_HEALTH_URL:
+        loop = asyncio.get_event_loop()
+        alive, remote_health_body = await loop.run_in_executor(
+            None, _probe_gateway_health
+        )
+        if alive:
+            gateway_running = True
+            # PID from the remote container (display only — not locally valid)
+            if remote_health_body:
+                gateway_pid = remote_health_body.get("pid")

    gateway_state = None
    gateway_platforms: dict = {}
@@ -341,7 +398,12 @@ async def get_status():
    except Exception:
        configured_gateway_platforms = None

+    # Prefer the detailed health endpoint response (has full state) when the
+    # local runtime status file is absent or stale (cross-container).
    runtime = read_runtime_status()
+    if runtime is None and remote_health_body and remote_health_body.get("gateway_state"):
+        runtime = remote_health_body
+
    if runtime:
        gateway_state = runtime.get("gateway_state")
        gateway_platforms = runtime.get("platforms") or {}
@@ -356,6 +418,17 @@ async def get_status():
        if not gateway_running:
            gateway_state = gateway_state if gateway_state in ("stopped", "startup_failed") else "stopped"
            gateway_platforms = {}
+        elif gateway_running and remote_health_body is not None:
+            # The health probe confirmed the gateway is alive, but the local
+            # runtime status file may be stale (cross-container).  Override
+            # stopped/None state so the dashboard shows the correct badge.
+            if gateway_state in (None, "stopped"):
+                gateway_state = "running"
+
+    # If there was no runtime info at all but the health probe confirmed alive,
+    # ensure we still report the gateway as running (no shared volume scenario).
+    if gateway_running and gateway_state is None and remote_health_body is not None:
+        gateway_state = "running"

    active_sessions = 0
    try:
@@ -78,13 +78,13 @@ dingtalk = ["dingtalk-stream>=0.1.0,<1"]
 feishu = ["lark-oapi>=1.5.3,<2"]
 web = ["fastapi>=0.104.0,<1", "uvicorn[standard]>=0.24.0,<1"]
 rl = [
-  "atroposlib @ git+https://github.com/NousResearch/atropos.git",
-  "tinker @ git+https://github.com/thinking-machines-lab/tinker.git",
+  "atroposlib @ git+https://github.com/NousResearch/atropos.git@c20c85256e5a45ad31edf8b7276e9c5ee1995a30",
+  "tinker @ git+https://github.com/thinking-machines-lab/tinker.git@30517b667f18a3dfb7ef33fb56cf686d5820ba2b",
  "fastapi>=0.104.0,<1",
  "uvicorn[standard]>=0.24.0,<1",
  "wandb>=0.15.0,<1",
 ]
-yc-bench = ["yc-bench @ git+https://github.com/collinear-ai/yc-bench.git ; python_version >= '3.12'"]
+yc-bench = ["yc-bench @ git+https://github.com/collinear-ai/yc-bench.git@bfb0c88062450f46341bd9a5298903fc2e952a5c ; python_version >= '3.12'"]
 all = [
  "hermes-agent[modal]",
  "hermes-agent[daytona]",
@@ -62,6 +62,7 @@ AUTHOR_MAP = {
    "258577966+voidborne-d@users.noreply.github.com": "voidborne-d",
    "70424851+insecurejezza@users.noreply.github.com": "insecurejezza",
    "259807879+Bartok9@users.noreply.github.com": "Bartok9",
+    "268667990+Roy-oss1@users.noreply.github.com": "Roy-oss1",
    # contributors (manual mapping from git names)
    "dmayhem93@gmail.com": "dmahan93",
    "samherring99@gmail.com": "samherring99",
@@ -8,7 +8,7 @@
    "start": "node bridge.js"
  },
  "dependencies": {
-    "@whiskeysockets/baileys": "WhiskeySockets/Baileys#fix/abprops-abt-fetch",
+    "@whiskeysockets/baileys": "WhiskeySockets/Baileys#01047debd81beb20da7b7779b08edcb06aa03770",
    "express": "^4.21.0",
    "qrcode-terminal": "^0.12.0",
    "pino": "^9.0.0"
@@ -0,0 +1,129 @@
+---
+name: architecture-diagram
+description: Generate professional dark-themed system architecture diagrams as standalone HTML/SVG files. Self-contained output with no external dependencies. Based on Cocoon AI's architecture-diagram-generator (MIT).
+version: 1.0.0
+author: Cocoon AI (hello@cocoon-ai.com), ported by Hermes Agent
+license: MIT
+dependencies: []
+metadata:
+  hermes:
+    tags: [architecture, diagrams, SVG, HTML, visualization, infrastructure, cloud]
+    related_skills: [excalidraw]
+---
+
+# Architecture Diagram Skill
+
+Generate professional, dark-themed technical architecture diagrams as standalone HTML files with inline SVG graphics. No external tools, no API keys, no rendering libraries — just write the HTML file and open it in a browser.
+
+Based on [Cocoon AI's architecture-diagram-generator](https://github.com/Cocoon-AI/architecture-diagram-generator) (MIT).
+
+## Workflow
+
+1. User describes their system architecture (components, connections, technologies)
+2. Generate the HTML file following the design system below
+3. Save with `write_file` to a `.html` file (e.g. `~/architecture-diagram.html`)
+4. User opens in any browser — works offline, no dependencies
+
+### Output Location
+
+Save diagrams to a user-specified path, or default to the current working directory:
+```
+./[project-name]-architecture.html
+```
+
+### Preview
+
+After saving, suggest the user open it:
+```bash
+# macOS
+open ./my-architecture.html
+# Linux
+xdg-open ./my-architecture.html
+```
+
+## Design System & Visual Language
+
+### Color Palette (Semantic Mapping)
+
+Use specific `rgba` fills and hex strokes to categorize components:
+
+| Component Type | Fill (rgba) | Stroke (Hex) |
+| :--- | :--- | :--- |
+| **Frontend** | `rgba(8, 51, 68, 0.4)` | `#22d3ee` (cyan-400) |
+| **Backend** | `rgba(6, 78, 59, 0.4)` | `#34d399` (emerald-400) |
+| **Database** | `rgba(76, 29, 149, 0.4)` | `#a78bfa` (violet-400) |
+| **AWS/Cloud** | `rgba(120, 53, 15, 0.3)` | `#fbbf24` (amber-400) |
+| **Security** | `rgba(136, 19, 55, 0.4)` | `#fb7185` (rose-400) |
+| **Message Bus** | `rgba(251, 146, 60, 0.3)` | `#fb923c` (orange-400) |
+| **External** | `rgba(30, 41, 59, 0.5)` | `#94a3b8` (slate-400) |
+
+### Typography & Background
+- **Font:** JetBrains Mono (Monospace), loaded from Google Fonts
+- **Sizes:** 12px (Names), 9px (Sublabels), 8px (Annotations), 7px (Tiny labels)
+- **Background:** Slate-950 (`#020617`) with a subtle 40px grid pattern
+
+```svg
+<!-- Background Grid Pattern -->
+<pattern id="grid" width="40" height="40" patternUnits="userSpaceOnUse">
+  <path d="M 40 0 L 0 0 0 40" fill="none" stroke="#1e293b" stroke-width="0.5"/>
+</pattern>
+```
+
+## Technical Implementation Details
+
+### Component Rendering
+Components are rounded rectangles (`rx="6"`) with 1.5px strokes. To prevent arrows from showing through semi-transparent fills, use a **double-rect masking technique**:
+1. Draw an opaque background rect (`#0f172a`)
+2. Draw the semi-transparent styled rect on top
+
+### Connection Rules
+- **Z-Order:** Draw arrows *early* in the SVG (after the grid) so they render behind component boxes
+- **Arrowheads:** Defined via SVG markers
+- **Security Flows:** Use dashed lines in rose color (`#fb7185`)
+- **Boundaries:**
+  - *Security Groups:* Dashed (`4,4`), rose color
+  - *Regions:* Large dashed (`8,4`), amber color, `rx="12"`
+
+### Spacing & Layout Logic
+- **Standard Height:** 60px (Services); 80-120px (Large components)
+- **Vertical Gap:** Minimum 40px between components
+- **Message Buses:** Must be placed *in the gap* between services, not overlapping them
+- **Legend Placement:** **CRITICAL.** Must be placed outside all boundary boxes. Calculate the lowest Y-coordinate of all boundaries and place the legend at least 20px below it.
+
+## Document Structure
+
+The generated HTML file follows a four-part layout:
+1. **Header:** Title with a pulsing dot indicator and subtitle
+2. **Main SVG:** The diagram contained within a rounded border card
+3. **Summary Cards:** A grid of three cards below the diagram for high-level details
+4. **Footer:** Minimal metadata
+
+### Info Card Pattern
+```html
+<div class="card">
+  <div class="card-header">
+    <div class="card-dot cyan"></div>
+    <h3>Title</h3>
+  </div>
+  <ul>
+    <li>• Item one</li>
+    <li>• Item two</li>
+  </ul>
+</div>
+```
+
+## Output Requirements
+- **Single File:** One self-contained `.html` file
+- **No External Dependencies:** All CSS and SVG must be inline (except Google Fonts)
+- **No JavaScript:** Use pure CSS for any animations (like pulsing dots)
+- **Compatibility:** Must render correctly in any modern web browser
+
+## Template Reference
+
+Load the full HTML template for the exact structure, CSS, and SVG component examples:
+
+```
+skill_view(name="architecture-diagram", file_path="templates/template.html")
+```
+
+The template contains working examples of every component type (frontend, backend, database, cloud, security), arrow styles (standard, dashed, curved), security groups, region boundaries, and the legend — use it as your structural reference when generating diagrams.
@@ -0,0 +1,319 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>[PROJECT NAME] Architecture Diagram</title>
+  <link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;500;600;700&display=swap" rel="stylesheet">
+  <style>
+    * {
+      margin: 0;
+      padding: 0;
+      box-sizing: border-box;
+    }
+    
+    body {
+      font-family: 'JetBrains Mono', monospace;
+      background: #020617;
+      min-height: 100vh;
+      padding: 2rem;
+      color: white;
+    }
+    
+    .container {
+      max-width: 1200px;
+      margin: 0 auto;
+    }
+    
+    .header {
+      margin-bottom: 2rem;
+    }
+    
+    .header-row {
+      display: flex;
+      align-items: center;
+      gap: 1rem;
+      margin-bottom: 0.5rem;
+    }
+    
+    .pulse-dot {
+      width: 12px;
+      height: 12px;
+      background: #22d3ee;
+      border-radius: 50%;
+      animation: pulse 2s infinite;
+    }
+    
+    @keyframes pulse {
+      0%, 100% { opacity: 1; }
+      50% { opacity: 0.5; }
+    }
+    
+    h1 {
+      font-size: 1.5rem;
+      font-weight: 700;
+      letter-spacing: -0.025em;
+    }
+    
+    .subtitle {
+      color: #94a3b8;
+      font-size: 0.875rem;
+      margin-left: 1.75rem;
+    }
+    
+    .diagram-container {
+      background: rgba(15, 23, 42, 0.5);
+      border-radius: 1rem;
+      border: 1px solid #1e293b;
+      padding: 1.5rem;
+      overflow-x: auto;
+    }
+    
+    svg {
+      width: 100%;
+      min-width: 900px;
+      display: block;
+    }
+    
+    .cards {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
+      gap: 1rem;
+      margin-top: 2rem;
+    }
+    
+    .card {
+      background: rgba(15, 23, 42, 0.5);
+      border-radius: 0.75rem;
+      border: 1px solid #1e293b;
+      padding: 1.25rem;
+    }
+    
+    .card-header {
+      display: flex;
+      align-items: center;
+      gap: 0.5rem;
+      margin-bottom: 0.75rem;
+    }
+    
+    .card-dot {
+      width: 8px;
+      height: 8px;
+      border-radius: 50%;
+    }
+    
+    .card-dot.cyan { background: #22d3ee; }
+    .card-dot.emerald { background: #34d399; }
+    .card-dot.violet { background: #a78bfa; }
+    .card-dot.amber { background: #fbbf24; }
+    .card-dot.rose { background: #fb7185; }
+    
+    .card h3 {
+      font-size: 0.875rem;
+      font-weight: 600;
+    }
+    
+    .card ul {
+      list-style: none;
+      color: #94a3b8;
+      font-size: 0.75rem;
+    }
+    
+    .card li {
+      margin-bottom: 0.375rem;
+    }
+    
+    .footer {
+      text-align: center;
+      margin-top: 1.5rem;
+      color: #475569;
+      font-size: 0.75rem;
+    }
+  </style>
+</head>
+<body>
+  <div class="container">
+    <!-- Header -->
+    <div class="header">
+      <div class="header-row">
+        <div class="pulse-dot"></div>
+        <h1>[PROJECT NAME] Architecture</h1>
+      </div>
+      <p class="subtitle">[Subtitle description]</p>
+    </div>
+
+    <!-- Main Diagram -->
+    <div class="diagram-container">
+      <svg viewBox="0 0 1000 680">
+        <!-- Definitions -->
+        <defs>
+          <marker id="arrowhead" markerWidth="10" markerHeight="7" refX="9" refY="3.5" orient="auto">
+            <polygon points="0 0, 10 3.5, 0 7" fill="#64748b" />
+          </marker>
+          <pattern id="grid" width="40" height="40" patternUnits="userSpaceOnUse">
+            <path d="M 40 0 L 0 0 0 40" fill="none" stroke="#1e293b" stroke-width="0.5"/>
+          </pattern>
+        </defs>
+
+        <!-- Background Grid -->
+        <rect width="100%" height="100%" fill="url(#grid)" />
+
+        <!-- =================================================================
+             COMPONENT EXAMPLES - Copy and customize these patterns
+             ================================================================= -->
+
+        <!-- External/Generic Component -->
+        <rect x="30" y="280" width="100" height="50" rx="6" fill="rgba(30, 41, 59, 0.5)" stroke="#94a3b8" stroke-width="1.5"/>
+        <text x="80" y="300" fill="white" font-size="11" font-weight="600" text-anchor="middle">Users</text>
+        <text x="80" y="316" fill="#94a3b8" font-size="9" text-anchor="middle">Browser/Mobile</text>
+
+        <!-- Security Component -->
+        <rect x="30" y="80" width="100" height="60" rx="6" fill="rgba(136, 19, 55, 0.4)" stroke="#fb7185" stroke-width="1.5"/>
+        <text x="80" y="105" fill="white" font-size="11" font-weight="600" text-anchor="middle">Auth Provider</text>
+        <text x="80" y="121" fill="#94a3b8" font-size="9" text-anchor="middle">OAuth 2.0</text>
+
+        <!-- Region/Cloud Boundary -->
+        <rect x="160" y="40" width="820" height="620" rx="12" fill="rgba(251, 191, 36, 0.05)" stroke="#fbbf24" stroke-width="1" stroke-dasharray="8,4"/>
+        <text x="172" y="58" fill="#fbbf24" font-size="10" font-weight="600">AWS Region: us-west-2</text>
+
+        <!-- AWS/Cloud Service -->
+        <rect x="200" y="280" width="110" height="50" rx="6" fill="rgba(120, 53, 15, 0.3)" stroke="#fbbf24" stroke-width="1.5"/>
+        <text x="255" y="300" fill="white" font-size="11" font-weight="600" text-anchor="middle">CloudFront</text>
+        <text x="255" y="316" fill="#94a3b8" font-size="9" text-anchor="middle">CDN</text>
+
+        <!-- Multi-line AWS Component (S3 Buckets example) -->
+        <rect x="200" y="380" width="110" height="100" rx="6" fill="rgba(120, 53, 15, 0.3)" stroke="#fbbf24" stroke-width="1.5"/>
+        <text x="255" y="400" fill="white" font-size="11" font-weight="600" text-anchor="middle">S3 Buckets</text>
+        <text x="255" y="420" fill="#94a3b8" font-size="8" text-anchor="middle">• bucket-one</text>
+        <text x="255" y="434" fill="#94a3b8" font-size="8" text-anchor="middle">• bucket-two</text>
+        <text x="255" y="448" fill="#94a3b8" font-size="8" text-anchor="middle">• bucket-three</text>
+        <text x="255" y="466" fill="#fbbf24" font-size="7" text-anchor="middle">OAI Protected</text>
+
+        <!-- Security Group (dashed boundary) -->
+        <rect x="350" y="265" width="120" height="80" rx="8" fill="transparent" stroke="#fb7185" stroke-width="1" stroke-dasharray="4,4"/>
+        <text x="358" y="279" fill="#fb7185" font-size="8">sg-name :port</text>
+        
+        <!-- Component inside security group -->
+        <rect x="360" y="280" width="100" height="50" rx="6" fill="rgba(120, 53, 15, 0.3)" stroke="#fbbf24" stroke-width="1.5"/>
+        <text x="410" y="300" fill="white" font-size="11" font-weight="600" text-anchor="middle">Load Balancer</text>
+        <text x="410" y="316" fill="#94a3b8" font-size="9" text-anchor="middle">HTTPS :443</text>
+
+        <!-- Backend Component -->
+        <rect x="510" y="280" width="110" height="50" rx="6" fill="rgba(6, 78, 59, 0.4)" stroke="#34d399" stroke-width="1.5"/>
+        <text x="565" y="300" fill="white" font-size="11" font-weight="600" text-anchor="middle">API Server</text>
+        <text x="565" y="316" fill="#94a3b8" font-size="9" text-anchor="middle">FastAPI :8000</text>
+
+        <!-- Database Component -->
+        <rect x="700" y="280" width="120" height="50" rx="6" fill="rgba(76, 29, 149, 0.4)" stroke="#a78bfa" stroke-width="1.5"/>
+        <text x="760" y="300" fill="white" font-size="11" font-weight="600" text-anchor="middle">Database</text>
+        <text x="760" y="316" fill="#94a3b8" font-size="9" text-anchor="middle">PostgreSQL</text>
+
+        <!-- Frontend Component -->
+        <rect x="200" y="520" width="200" height="110" rx="8" fill="rgba(8, 51, 68, 0.4)" stroke="#22d3ee" stroke-width="1.5"/>
+        <text x="300" y="545" fill="white" font-size="12" font-weight="600" text-anchor="middle">Frontend</text>
+        <text x="300" y="565" fill="#94a3b8" font-size="9" text-anchor="middle">React + TypeScript</text>
+        <text x="300" y="580" fill="#94a3b8" font-size="9" text-anchor="middle">Additional detail</text>
+        <text x="300" y="595" fill="#94a3b8" font-size="9" text-anchor="middle">More info</text>
+        <text x="300" y="615" fill="#22d3ee" font-size="8" text-anchor="middle">domain.example.com</text>
+
+        <!-- =================================================================
+             ARROW EXAMPLES
+             ================================================================= -->
+
+        <!-- Standard arrow with label -->
+        <line x1="130" y1="305" x2="198" y2="305" stroke="#22d3ee" stroke-width="1.5" marker-end="url(#arrowhead)"/>
+        <text x="164" y="299" fill="#94a3b8" font-size="9" text-anchor="middle">HTTPS</text>
+        
+        <!-- Simple arrow (no label) -->
+        <line x1="310" y1="305" x2="358" y2="305" stroke="#22d3ee" stroke-width="1.5" marker-end="url(#arrowhead)"/>
+        
+        <!-- Vertical arrow -->
+        <line x1="255" y1="330" x2="255" y2="378" stroke="#fbbf24" stroke-width="1.5" marker-end="url(#arrowhead)"/>
+        <text x="270" y="358" fill="#94a3b8" font-size="9">OAI</text>
+        
+        <!-- Dashed arrow (for auth/security flows) -->
+        <line x1="460" y1="305" x2="508" y2="305" stroke="#34d399" stroke-width="1.5" marker-end="url(#arrowhead)"/>
+        <line x1="620" y1="305" x2="698" y2="305" stroke="#a78bfa" stroke-width="1.5" marker-end="url(#arrowhead)"/>
+        <text x="655" y="299" fill="#94a3b8" font-size="9">TLS</text>
+
+        <!-- Curved path for auth flow -->
+        <path d="M 80 140 L 80 200 Q 80 220 100 220 L 200 220 Q 220 220 220 240 L 220 278" fill="none" stroke="#fb7185" stroke-width="1.5" stroke-dasharray="5,5"/>
+        <text x="150" y="210" fill="#fb7185" font-size="8">JWT + PKCE</text>
+
+        <!-- =================================================================
+             LEGEND
+             ================================================================= -->
+        <text x="720" y="70" fill="white" font-size="10" font-weight="600">Legend</text>
+        
+        <rect x="720" y="82" width="16" height="10" rx="2" fill="rgba(8, 51, 68, 0.4)" stroke="#22d3ee" stroke-width="1"/>
+        <text x="742" y="90" fill="#94a3b8" font-size="8">Frontend</text>
+        
+        <rect x="720" y="98" width="16" height="10" rx="2" fill="rgba(6, 78, 59, 0.4)" stroke="#34d399" stroke-width="1"/>
+        <text x="742" y="106" fill="#94a3b8" font-size="8">Backend</text>
+        
+        <rect x="720" y="114" width="16" height="10" rx="2" fill="rgba(120, 53, 15, 0.3)" stroke="#fbbf24" stroke-width="1"/>
+        <text x="742" y="122" fill="#94a3b8" font-size="8">Cloud Service</text>
+        
+        <rect x="720" y="130" width="16" height="10" rx="2" fill="rgba(76, 29, 149, 0.4)" stroke="#a78bfa" stroke-width="1"/>
+        <text x="742" y="138" fill="#94a3b8" font-size="8">Database</text>
+        
+        <rect x="720" y="146" width="16" height="10" rx="2" fill="rgba(136, 19, 55, 0.4)" stroke="#fb7185" stroke-width="1"/>
+        <text x="742" y="154" fill="#94a3b8" font-size="8">Security</text>
+        
+        <line x1="720" y1="168" x2="736" y2="168" stroke="#fb7185" stroke-width="1" stroke-dasharray="3,3"/>
+        <text x="742" y="171" fill="#94a3b8" font-size="8">Auth Flow</text>
+        
+        <rect x="720" y="178" width="16" height="10" rx="2" fill="transparent" stroke="#fb7185" stroke-width="1" stroke-dasharray="3,3"/>
+        <text x="742" y="186" fill="#94a3b8" font-size="8">Security Group</text>
+      </svg>
+    </div>
+
+    <!-- Info Cards -->
+    <div class="cards">
+      <div class="card">
+        <div class="card-header">
+          <div class="card-dot rose"></div>
+          <h3>Card Title 1</h3>
+        </div>
+        <ul>
+          <li>• Item one</li>
+          <li>• Item two</li>
+          <li>• Item three</li>
+          <li>• Item four</li>
+        </ul>
+      </div>
+
+      <div class="card">
+        <div class="card-header">
+          <div class="card-dot amber"></div>
+          <h3>Card Title 2</h3>
+        </div>
+        <ul>
+          <li>• Item one</li>
+          <li>• Item two</li>
+          <li>• Item three</li>
+          <li>• Item four</li>
+        </ul>
+      </div>
+
+      <div class="card">
+        <div class="card-header">
+          <div class="card-dot violet"></div>
+          <h3>Card Title 3</h3>
+        </div>
+        <ul>
+          <li>• Item one</li>
+          <li>• Item two</li>
+          <li>• Item three</li>
+          <li>• Item four</li>
+        </ul>
+      </div>
+    </div>
+
+    <!-- Footer -->
+    <p class="footer">
+      [Project Name] • [Additional metadata]
+    </p>
+  </div>
+</body>
+</html>
@@ -781,83 +781,3 @@ class TestTokenBudgetTailProtection:
        # Tool at index 2 is outside the protected tail (last 3 = indices 2,3,4)
        # so it might or might not be pruned depending on boundary
        assert isinstance(pruned, int)
-
-
-class TestSerializeRedactsSecrets:
-    """Verify that _serialize_for_summary strips secrets before they reach the summarizer LLM."""
-
-    def _make_compressor(self):
-        with patch("agent.context_compressor.get_model_context_length", return_value=100000):
-            return ContextCompressor(model="test", quiet_mode=True)
-
-    def test_redacts_api_key_in_tool_result(self):
-        c = self._make_compressor()
-        turns = [{"role": "tool", "content": "OPENAI_API_KEY=sk-proj-abc123def456ghi789jkl012", "tool_call_id": "tc1"}]
-        result = c._serialize_for_summary(turns)
-        assert "abc123def456" not in result
-        assert "sk-proj" not in result
-
-    def test_redacts_api_key_in_user_message(self):
-        c = self._make_compressor()
-        turns = [{"role": "user", "content": "My key is sk-proj-abc123def456ghi789jkl012"}]
-        result = c._serialize_for_summary(turns)
-        assert "abc123def456" not in result
-
-    def test_redacts_secret_in_tool_call_arguments(self):
-        c = self._make_compressor()
-        turns = [{
-            "role": "assistant",
-            "content": "",
-            "tool_calls": [{
-                "function": {
-                    "name": "bash",
-                    "arguments": '{"command": "export OPENAI_API_KEY=sk-proj-abc123def456ghi789jkl012"}',
-                },
-            }],
-        }]
-        result = c._serialize_for_summary(turns)
-        assert "abc123def456" not in result
-
-    def test_redacts_github_pat_in_assistant_content(self):
-        c = self._make_compressor()
-        turns = [{"role": "assistant", "content": "Found token: ghp_abcdef1234567890abcdef1234567890abcd"}]
-        result = c._serialize_for_summary(turns)
-        assert "abcdef1234567890" not in result
-
-    def test_preserves_non_secret_content(self):
-        c = self._make_compressor()
-        turns = [
-            {"role": "user", "content": "Please fix the bug in src/main.py"},
-            {"role": "assistant", "content": "I found the issue on line 42."},
-        ]
-        result = c._serialize_for_summary(turns)
-        assert "src/main.py" in result
-        assert "line 42" in result
-
-
-class TestGenerateSummaryRedactsOutput:
-    """Verify that _generate_summary redacts the summarizer LLM's output."""
-
-    def test_summary_output_is_redacted(self):
-        """If the summarizer LLM echoes a secret despite instructions, it gets redacted."""
-        mock_response = MagicMock()
-        mock_response.choices = [MagicMock()]
-        mock_response.choices[0].message.content = (
-            "## Goal\nDeploy app.\n## Critical Context\n"
-            "User's API key: sk-proj-abc123def456ghi789jkl012"
-        )
-
-        with patch("agent.context_compressor.get_model_context_length", return_value=100000):
-            c = ContextCompressor(model="test", quiet_mode=True)
-
-        messages = [
-            {"role": "user", "content": "deploy my app"},
-            {"role": "assistant", "content": "deploying now"},
-        ]
-
-        with patch("agent.context_compressor.call_llm", return_value=mock_response):
-            summary = c._generate_summary(messages)
-
-        assert "abc123def456" not in summary
-        # Also verify _previous_summary is redacted (iterative update path)
-        assert "abc123def456" not in (c._previous_summary or "")
@@ -93,6 +93,12 @@ def make_restart_runner(
    runner._running_agent_count = GatewayRunner._running_agent_count.__get__(
        runner, GatewayRunner
    )
+    runner._snapshot_running_agents = GatewayRunner._snapshot_running_agents.__get__(
+        runner, GatewayRunner
+    )
+    runner._notify_active_sessions_of_shutdown = (
+        GatewayRunner._notify_active_sessions_of_shutdown.__get__(runner, GatewayRunner)
+    )
    runner._launch_detached_restart_command = GatewayRunner._launch_detached_restart_command.__get__(
        runner, GatewayRunner
    )
@@ -220,6 +220,7 @@ def _create_app(adapter: APIServerAdapter) -> web.Application:
    app = web.Application(middlewares=mws)
    app["api_server_adapter"] = adapter
    app.router.add_get("/health", adapter._handle_health)
+    app.router.add_get("/health/detailed", adapter._handle_health_detailed)
    app.router.add_get("/v1/health", adapter._handle_health)
    app.router.add_get("/v1/models", adapter._handle_models)
    app.router.add_post("/v1/chat/completions", adapter._handle_chat_completions)
@@ -277,6 +278,58 @@ class TestHealthEndpoint:
            assert data["platform"] == "hermes-agent"


+# ---------------------------------------------------------------------------
+# /health/detailed endpoint
+# ---------------------------------------------------------------------------
+
+
+class TestHealthDetailedEndpoint:
+    @pytest.mark.asyncio
+    async def test_health_detailed_returns_ok(self, adapter):
+        """GET /health/detailed returns status, platform, and runtime fields."""
+        app = _create_app(adapter)
+        with patch("gateway.status.read_runtime_status", return_value={
+            "gateway_state": "running",
+            "platforms": {"telegram": {"state": "connected"}},
+            "active_agents": 2,
+            "exit_reason": None,
+            "updated_at": "2026-04-14T00:00:00Z",
+        }):
+            async with TestClient(TestServer(app)) as cli:
+                resp = await cli.get("/health/detailed")
+                assert resp.status == 200
+                data = await resp.json()
+                assert data["status"] == "ok"
+                assert data["platform"] == "hermes-agent"
+                assert data["gateway_state"] == "running"
+                assert data["platforms"] == {"telegram": {"state": "connected"}}
+                assert data["active_agents"] == 2
+                assert isinstance(data["pid"], int)
+                assert "updated_at" in data
+
+    @pytest.mark.asyncio
+    async def test_health_detailed_no_runtime_status(self, adapter):
+        """When gateway_state.json is missing, fields are None."""
+        app = _create_app(adapter)
+        with patch("gateway.status.read_runtime_status", return_value=None):
+            async with TestClient(TestServer(app)) as cli:
+                resp = await cli.get("/health/detailed")
+                assert resp.status == 200
+                data = await resp.json()
+                assert data["status"] == "ok"
+                assert data["gateway_state"] is None
+                assert data["platforms"] == {}
+
+    @pytest.mark.asyncio
+    async def test_health_detailed_does_not_require_auth(self, auth_adapter):
+        """Health detailed endpoint should be accessible without auth, like /health."""
+        app = _create_app(auth_adapter)
+        with patch("gateway.status.read_runtime_status", return_value=None):
+            async with TestClient(TestServer(app)) as cli:
+                resp = await cli.get("/health/detailed")
+                assert resp.status == 200
+
+
 # ---------------------------------------------------------------------------
 # /v1/models endpoint
 # ---------------------------------------------------------------------------
@@ -19,10 +19,34 @@ def _ensure_discord_mock():
    discord_mod.Thread = type("Thread", (), {})
    discord_mod.ForumChannel = type("ForumChannel", (), {})
    discord_mod.Interaction = object
+
+    # Lightweight mock for app_commands.Group and Command used by
+    # _register_skill_group.
+    class _FakeGroup:
+        def __init__(self, *, name, description, parent=None):
+            self.name = name
+            self.description = description
+            self.parent = parent
+            self._children: dict[str, object] = {}
+            if parent is not None:
+                parent.add_command(self)
+
+        def add_command(self, cmd):
+            self._children[cmd.name] = cmd
+
+    class _FakeCommand:
+        def __init__(self, *, name, description, callback, parent=None):
+            self.name = name
+            self.description = description
+            self.callback = callback
+            self.parent = parent
+
    discord_mod.app_commands = SimpleNamespace(
        describe=lambda **kwargs: (lambda fn: fn),
        choices=lambda **kwargs: (lambda fn: fn),
        Choice=lambda **kwargs: SimpleNamespace(**kwargs),
+        Group=_FakeGroup,
+        Command=_FakeCommand,
    )

    ext_mod = MagicMock()
@@ -51,6 +75,12 @@ class FakeTree:

        return decorator

+    def add_command(self, cmd):
+        self.commands[cmd.name] = cmd
+
+    def get_commands(self):
+        return [SimpleNamespace(name=n) for n in self.commands]
+

@pytest.fixture
 def adapter():
@@ -498,3 +528,79 @@ def test_discord_auto_thread_config_bridge(monkeypatch, tmp_path):

    import os
    assert os.getenv("DISCORD_AUTO_THREAD") == "true"
+
+
+# ------------------------------------------------------------------
+# /skill group registration
+# ------------------------------------------------------------------
+
+
+def test_register_skill_group_creates_group(adapter):
+    """_register_skill_group should register a '/skill' Group on the tree."""
+    mock_categories = {
+        "creative": [
+            ("ascii-art", "Generate ASCII art", "/ascii-art"),
+            ("excalidraw", "Hand-drawn diagrams", "/excalidraw"),
+        ],
+        "media": [
+            ("gif-search", "Search for GIFs", "/gif-search"),
+        ],
+    }
+    mock_uncategorized = [
+        ("dogfood", "Exploratory QA testing", "/dogfood"),
+    ]
+
+    with patch(
+        "hermes_cli.commands.discord_skill_commands_by_category",
+        return_value=(mock_categories, mock_uncategorized, 0),
+    ):
+        adapter._register_slash_commands()
+
+    tree = adapter._client.tree
+    assert "skill" in tree.commands, "Expected /skill group to be registered"
+    skill_group = tree.commands["skill"]
+    assert skill_group.name == "skill"
+    # Should have 2 category subgroups + 1 uncategorized subcommand
+    children = skill_group._children
+    assert "creative" in children
+    assert "media" in children
+    assert "dogfood" in children
+    # Category groups should have their skills
+    assert "ascii-art" in children["creative"]._children
+    assert "excalidraw" in children["creative"]._children
+    assert "gif-search" in children["media"]._children
+
+
+def test_register_skill_group_empty_skills_no_group(adapter):
+    """No /skill group should be added when there are zero skills."""
+    with patch(
+        "hermes_cli.commands.discord_skill_commands_by_category",
+        return_value=({}, [], 0),
+    ):
+        adapter._register_slash_commands()
+
+    tree = adapter._client.tree
+    assert "skill" not in tree.commands
+
+
+def test_register_skill_group_handler_dispatches_command(adapter):
+    """Skill subcommand handlers should dispatch the correct /cmd-key text."""
+    mock_categories = {
+        "media": [
+            ("gif-search", "Search for GIFs", "/gif-search"),
+        ],
+    }
+
+    with patch(
+        "hermes_cli.commands.discord_skill_commands_by_category",
+        return_value=(mock_categories, [], 0),
+    ):
+        adapter._register_slash_commands()
+
+    skill_group = adapter._client.tree.commands["skill"]
+    media_group = skill_group._children["media"]
+    gif_cmd = media_group._children["gif-search"]
+    assert gif_cmd.callback is not None
+    # The callback name should reflect the skill
+    assert "gif_search" in gif_cmd.callback.__name__
+
@@ -1,12 +1,11 @@
 """Tests for Feishu interactive card approval buttons."""

-import asyncio
+import importlib.util
 import json
-import os
 import sys
 from pathlib import Path
 from types import SimpleNamespace
-from unittest.mock import AsyncMock, MagicMock, Mock, patch
+from unittest.mock import AsyncMock, MagicMock, patch

 import pytest

@@ -23,14 +22,14 @@ if _repo not in sys.path:
 # ---------------------------------------------------------------------------
 def _ensure_feishu_mocks():
    """Provide stubs for lark-oapi / aiohttp.web so the import succeeds."""
-    if "lark_oapi" not in sys.modules:
+    if importlib.util.find_spec("lark_oapi") is None and "lark_oapi" not in sys.modules:
        mod = MagicMock()
        for name in (
            "lark_oapi", "lark_oapi.api.im.v1",
            "lark_oapi.event", "lark_oapi.event.callback_type",
        ):
            sys.modules.setdefault(name, mod)
-    if "aiohttp" not in sys.modules:
+    if importlib.util.find_spec("aiohttp") is None and "aiohttp" not in sys.modules:
        aio = MagicMock()
        sys.modules.setdefault("aiohttp", aio)
        sys.modules.setdefault("aiohttp.web", aio.web)
@@ -39,6 +38,7 @@ def _ensure_feishu_mocks():
 _ensure_feishu_mocks()

 from gateway.config import PlatformConfig
+import gateway.platforms.feishu as feishu_module
 from gateway.platforms.feishu import FeishuAdapter


@@ -74,6 +74,12 @@ def _make_card_action_data(
    )


+def _close_submitted_coro(coro, _loop):
+    """Close scheduled coroutines in sync-handler tests to avoid unawaited warnings."""
+    coro.close()
+    return SimpleNamespace(add_done_callback=lambda *_args, **_kwargs: None)
+
+
 # ===========================================================================
 # send_exec_approval — interactive card with buttons
 # ===========================================================================
@@ -203,14 +209,14 @@ class TestFeishuExecApproval:


 # ===========================================================================
-# _handle_card_action_event — approval button clicks
+# _resolve_approval — approval state pop + gateway resolution
 # ===========================================================================

-class TestFeishuApprovalCallback:
-    """Test the approval intercept in _handle_card_action_event."""
+class TestResolveApproval:
+    """Test _resolve_approval pops state and calls resolve_gateway_approval."""

    @pytest.mark.asyncio
-    async def test_resolves_approval_on_click(self):
+    async def test_resolves_once(self):
        adapter = _make_adapter()
        adapter._approval_state[1] = {
            "session_key": "agent:main:feishu:group:oc_12345",
@@ -218,28 +224,14 @@ class TestFeishuApprovalCallback:
            "chat_id": "oc_12345",
        }

-        data = _make_card_action_data(
-            action_value={"hermes_action": "approve_once", "approval_id": 1},
-        )
-
-        with (
-            patch.object(
-                adapter, "_resolve_sender_profile", new_callable=AsyncMock,
-                return_value={"user_id": "ou_user1", "user_name": "Norbert", "user_id_alt": None},
-            ),
-            patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
-            patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
-        ):
-            await adapter._handle_card_action_event(data)
+        with patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve:
+            await adapter._resolve_approval(1, "once", "Norbert")

        mock_resolve.assert_called_once_with("agent:main:feishu:group:oc_12345", "once")
-        mock_update.assert_called_once_with("msg_001", "Approved once", "Norbert", "once")
-
-        # State should be cleaned up
        assert 1 not in adapter._approval_state

    @pytest.mark.asyncio
-    async def test_deny_button(self):
+    async def test_resolves_deny(self):
        adapter = _make_adapter()
        adapter._approval_state[2] = {
            "session_key": "some-session",
@@ -247,26 +239,13 @@ class TestFeishuApprovalCallback:
            "chat_id": "oc_12345",
        }

-        data = _make_card_action_data(
-            action_value={"hermes_action": "deny", "approval_id": 2},
-            token="tok_deny",
-        )
-
-        with (
-            patch.object(
-                adapter, "_resolve_sender_profile", new_callable=AsyncMock,
-                return_value={"user_id": "ou_alice", "user_name": "Alice", "user_id_alt": None},
-            ),
-            patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
-            patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
-        ):
-            await adapter._handle_card_action_event(data)
+        with patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve:
+            await adapter._resolve_approval(2, "deny", "Alice")

        mock_resolve.assert_called_once_with("some-session", "deny")
-        mock_update.assert_called_once_with("msg_002", "Denied", "Alice", "deny")

    @pytest.mark.asyncio
-    async def test_session_approval(self):
+    async def test_resolves_session(self):
        adapter = _make_adapter()
        adapter._approval_state[3] = {
            "session_key": "sess-3",
@@ -274,26 +253,13 @@ class TestFeishuApprovalCallback:
            "chat_id": "oc_99",
        }

-        data = _make_card_action_data(
-            action_value={"hermes_action": "approve_session", "approval_id": 3},
-            token="tok_ses",
-        )
-
-        with (
-            patch.object(
-                adapter, "_resolve_sender_profile", new_callable=AsyncMock,
-                return_value={"user_id": "ou_u", "user_name": "Bob", "user_id_alt": None},
-            ),
-            patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
-            patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
-        ):
-            await adapter._handle_card_action_event(data)
+        with patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve:
+            await adapter._resolve_approval(3, "session", "Bob")

        mock_resolve.assert_called_once_with("sess-3", "session")
-        mock_update.assert_called_once_with("msg_003", "Approved for session", "Bob", "session")

    @pytest.mark.asyncio
-    async def test_always_approval(self):
+    async def test_resolves_always(self):
        adapter = _make_adapter()
        adapter._approval_state[4] = {
            "session_key": "sess-4",
@@ -301,42 +267,29 @@ class TestFeishuApprovalCallback:
            "chat_id": "oc_55",
        }

-        data = _make_card_action_data(
-            action_value={"hermes_action": "approve_always", "approval_id": 4},
-            token="tok_alw",
-        )
-
-        with (
-            patch.object(
-                adapter, "_resolve_sender_profile", new_callable=AsyncMock,
-                return_value={"user_id": "ou_u", "user_name": "Carol", "user_id_alt": None},
-            ),
-            patch.object(adapter, "_update_approval_card", new_callable=AsyncMock),
-            patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
-        ):
-            await adapter._handle_card_action_event(data)
+        with patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve:
+            await adapter._resolve_approval(4, "always", "Carol")

        mock_resolve.assert_called_once_with("sess-4", "always")

    @pytest.mark.asyncio
    async def test_already_resolved_drops_silently(self):
        adapter = _make_adapter()
-        # No state for approval_id 99 — already resolved
-
-        data = _make_card_action_data(
-            action_value={"hermes_action": "approve_once", "approval_id": 99},
-            token="tok_gone",
-        )

        with patch("tools.approval.resolve_gateway_approval") as mock_resolve:
-            await adapter._handle_card_action_event(data)
+            await adapter._resolve_approval(99, "once", "Nobody")

-        # Should NOT resolve — already handled
        mock_resolve.assert_not_called()

+# ===========================================================================
+# _handle_card_action_event — non-approval card actions
+# ===========================================================================
+
+class TestNonApprovalCardAction:
+    """Non-approval card actions should still route as synthetic commands."""
+
    @pytest.mark.asyncio
-    async def test_non_approval_actions_route_normally(self):
-        """Non-approval card actions should still become synthetic commands."""
+    async def test_routes_as_synthetic_command(self):
        adapter = _make_adapter()

        data = _make_card_action_data(
@@ -351,82 +304,141 @@ class TestFeishuApprovalCallback:
            ),
            patch.object(adapter, "get_chat_info", new_callable=AsyncMock, return_value={"name": "Test Chat"}),
            patch.object(adapter, "_handle_message_with_guards", new_callable=AsyncMock) as mock_handle,
-            patch("tools.approval.resolve_gateway_approval") as mock_resolve,
        ):
            await adapter._handle_card_action_event(data)

-        # Should NOT resolve any approval
-        mock_resolve.assert_not_called()
-        # Should have routed as synthetic command
        mock_handle.assert_called_once()
        event = mock_handle.call_args[0][0]
        assert "/card button" in event.text


 # ===========================================================================
-# _update_approval_card — card replacement after resolution
+# _on_card_action_trigger — inline card response for approval actions
 # ===========================================================================

-class TestFeishuUpdateApprovalCard:
-    """Test the card update after approval resolution."""
+class _FakeCallBackCard:
+    def __init__(self):
+        self.type = None
+        self.data = None

-    @pytest.mark.asyncio
-    async def test_updates_card_on_approve(self):
+
+class _FakeP2Response:
+    def __init__(self):
+        self.card = None
+
+
+@pytest.fixture(autouse=False)
+def _patch_callback_card_types(monkeypatch):
+    """Provide real-ish P2CardActionTriggerResponse / CallBackCard for tests."""
+    monkeypatch.setattr(feishu_module, "P2CardActionTriggerResponse", _FakeP2Response)
+    monkeypatch.setattr(feishu_module, "CallBackCard", _FakeCallBackCard)
+
+
+class TestCardActionCallbackResponse:
+    """Test that _on_card_action_trigger returns updated card inline."""
+
+    def test_drops_action_when_loop_not_ready(self, _patch_callback_card_types):
        adapter = _make_adapter()
+        adapter._loop = None
+        data = _make_card_action_data({"hermes_action": "approve_once", "approval_id": 1})

-        mock_update = AsyncMock()
-        adapter._client.im.v1.message.update = MagicMock()
+        with patch("asyncio.run_coroutine_threadsafe") as mock_submit:
+            response = adapter._on_card_action_trigger(data)

-        with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
-            await adapter._update_approval_card(
-                "msg_001", "Approved once", "Norbert", "once"
-            )
+        assert response is not None
+        assert response.card is None
+        mock_submit.assert_not_called()

-        mock_thread.assert_called_once()
-        # Verify the update request was built
-        call_args = mock_thread.call_args
-        assert call_args[0][0] == adapter._client.im.v1.message.update
-
-    @pytest.mark.asyncio
-    async def test_updates_card_on_deny(self):
+    def test_returns_card_for_approve_action(self, _patch_callback_card_types):
        adapter = _make_adapter()
+        adapter._loop = MagicMock()
+        adapter._loop.is_closed = MagicMock(return_value=False)
+        data = _make_card_action_data(
+            {"hermes_action": "approve_once", "approval_id": 1},
+            open_id="ou_bob",
+        )
+        adapter._sender_name_cache["ou_bob"] = ("Bob", 9999999999)

-        with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
-            await adapter._update_approval_card(
-                "msg_002", "Denied", "Alice", "deny"
-            )
+        with patch("asyncio.run_coroutine_threadsafe", side_effect=_close_submitted_coro):
+            response = adapter._on_card_action_trigger(data)

-        mock_thread.assert_called_once()
+        assert response is not None
+        assert response.card is not None
+        assert response.card.type == "raw"
+        card = response.card.data
+        assert card["header"]["template"] == "green"
+        assert "Approved once" in card["header"]["title"]["content"]
+        assert "Bob" in card["elements"][0]["content"]

-    @pytest.mark.asyncio
-    async def test_skips_update_when_not_connected(self):
+    def test_returns_card_for_deny_action(self, _patch_callback_card_types):
        adapter = _make_adapter()
-        adapter._client = None
+        adapter._loop = MagicMock()
+        adapter._loop.is_closed = MagicMock(return_value=False)
+        data = _make_card_action_data(
+            {"hermes_action": "deny", "approval_id": 2},
+        )

-        with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
-            await adapter._update_approval_card(
-                "msg_001", "Approved", "Bob", "once"
-            )
+        with patch("asyncio.run_coroutine_threadsafe", side_effect=_close_submitted_coro):
+            response = adapter._on_card_action_trigger(data)

-        mock_thread.assert_not_called()
+        assert response.card is not None
+        card = response.card.data
+        assert card["header"]["template"] == "red"
+        assert "Denied" in card["header"]["title"]["content"]

-    @pytest.mark.asyncio
-    async def test_skips_update_when_no_message_id(self):
+    def test_ignores_missing_approval_id(self, _patch_callback_card_types):
        adapter = _make_adapter()
+        adapter._loop = MagicMock()
+        adapter._loop.is_closed = MagicMock(return_value=False)
+        data = _make_card_action_data({"hermes_action": "approve_once"})

-        with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
-            await adapter._update_approval_card(
-                "", "Approved", "Bob", "once"
-            )
+        with patch("asyncio.run_coroutine_threadsafe") as mock_submit:
+            response = adapter._on_card_action_trigger(data)

-        mock_thread.assert_not_called()
+        assert response is not None
+        assert response.card is None
+        mock_submit.assert_not_called()

-    @pytest.mark.asyncio
-    async def test_swallows_update_errors(self):
+    def test_no_card_for_non_approval_action(self, _patch_callback_card_types):
        adapter = _make_adapter()
+        adapter._loop = MagicMock()
+        adapter._loop.is_closed = MagicMock(return_value=False)
+        data = _make_card_action_data({"some_other": "value"})

-        with patch("asyncio.to_thread", new_callable=AsyncMock, side_effect=Exception("API error")):
-            # Should not raise
-            await adapter._update_approval_card(
-                "msg_001", "Approved", "Bob", "once"
-            )
+        with patch("asyncio.run_coroutine_threadsafe", side_effect=_close_submitted_coro):
+            response = adapter._on_card_action_trigger(data)
+
+        assert response is not None
+        assert response.card is None
+
+    def test_falls_back_to_open_id_when_name_not_cached(self, _patch_callback_card_types):
+        adapter = _make_adapter()
+        adapter._loop = MagicMock()
+        adapter._loop.is_closed = MagicMock(return_value=False)
+        data = _make_card_action_data(
+            {"hermes_action": "approve_session", "approval_id": 3},
+            open_id="ou_unknown",
+        )
+
+        with patch("asyncio.run_coroutine_threadsafe", side_effect=_close_submitted_coro):
+            response = adapter._on_card_action_trigger(data)
+
+        card = response.card.data
+        assert "ou_unknown" in card["elements"][0]["content"]
+
+    def test_ignores_expired_cached_name(self, _patch_callback_card_types):
+        adapter = _make_adapter()
+        adapter._loop = MagicMock()
+        adapter._loop.is_closed = MagicMock(return_value=False)
+        data = _make_card_action_data(
+            {"hermes_action": "approve_once", "approval_id": 4},
+            open_id="ou_expired",
+        )
+        adapter._sender_name_cache["ou_expired"] = ("Old Name", 1)
+
+        with patch("asyncio.run_coroutine_threadsafe", side_effect=_close_submitted_coro):
+            response = adapter._on_card_action_trigger(data)
+
+        card = response.card.data
+        assert "Old Name" not in card["elements"][0]["content"]
+        assert "ou_expired" in card["elements"][0]["content"]
@@ -125,6 +125,25 @@ async def test_gateway_stop_service_restart_sets_named_exit_code():
    assert runner._exit_code == GATEWAY_SERVICE_RESTART_EXIT_CODE


+@pytest.mark.asyncio
+async def test_gateway_stop_emits_shutdown_hook_after_drain(monkeypatch):
+    runner, adapter = make_restart_runner()
+    adapter.disconnect = AsyncMock()
+    runner.hooks.emit = AsyncMock()
+
+    with patch("gateway.status.remove_pid_file"), patch("gateway.status.write_runtime_status"):
+        await runner.stop(restart=True, service_restart=True)
+
+    runner.hooks.emit.assert_awaited_once_with(
+        "gateway:shutdown",
+        {
+            "restart": True,
+            "service_restart": True,
+            "detached_restart": False,
+        },
+    )
+
+
@pytest.mark.asyncio
 async def test_drain_active_agents_throttles_status_updates():
    runner, _adapter = make_restart_runner()
@@ -9,7 +9,7 @@ import pytest
 from gateway.hooks import HookRegistry


-def _create_hook(hooks_dir, hook_name, events, handler_code):
+def _create_hook(hooks_dir, hook_name, events, handler_code, *, manifest_extra=""):
    """Helper to create a hook directory with HOOK.yaml and handler.py."""
    hook_dir = hooks_dir / hook_name
    hook_dir.mkdir(parents=True)
@@ -17,6 +17,7 @@ def _create_hook(hooks_dir, hook_name, events, handler_code):
        f"name: {hook_name}\n"
        f"description: Test hook\n"
        f"events: {events}\n"
+        f"{manifest_extra}"
    )
    (hook_dir / "handler.py").write_text(handler_code)
    return hook_dir
@@ -112,6 +113,24 @@ class TestDiscoverAndLoad:

        assert len(reg.loaded_hooks) == 2

+    def test_preserves_optional_startup_readiness_metadata(self, tmp_path):
+        _create_hook(
+            tmp_path,
+            "ready-hook",
+            '["gateway:startup"]',
+            "def handle(e, c): pass\n",
+            manifest_extra="startup_readiness:\n  id: beam-runtime\n  required: false\n",
+        )
+
+        reg = HookRegistry()
+        with patch("gateway.hooks.HOOKS_DIR", tmp_path), _patch_no_builtins(reg):
+            reg.discover_and_load()
+
+        assert reg.loaded_hooks[0]["startup_readiness"] == {
+            "id": "beam-runtime",
+            "required": False,
+        }
+

 class TestEmit:
    @pytest.mark.asyncio
@@ -161,3 +161,84 @@ async def test_launch_detached_restart_command_uses_setsid(monkeypatch):
    assert kwargs["start_new_session"] is True
    assert kwargs["stdout"] is subprocess.DEVNULL
    assert kwargs["stderr"] is subprocess.DEVNULL
+
+
+# ── Shutdown notification tests ──────────────────────────────────────
+
+
+@pytest.mark.asyncio
+async def test_shutdown_notification_sent_to_active_sessions():
+    """Active sessions receive a notification when the gateway starts shutting down."""
+    runner, adapter = make_restart_runner()
+    source = make_restart_source(chat_id="999", chat_type="dm")
+    session_key = f"agent:main:telegram:dm:999"
+    runner._running_agents[session_key] = MagicMock()
+
+    await runner._notify_active_sessions_of_shutdown()
+
+    assert len(adapter.sent) == 1
+    assert "shutting down" in adapter.sent[0]
+    assert "interrupted" in adapter.sent[0]
+
+
+@pytest.mark.asyncio
+async def test_shutdown_notification_says_restarting_when_restart_requested():
+    """When _restart_requested is True, the message says 'restarting' and mentions /retry."""
+    runner, adapter = make_restart_runner()
+    runner._restart_requested = True
+    session_key = "agent:main:telegram:dm:999"
+    runner._running_agents[session_key] = MagicMock()
+
+    await runner._notify_active_sessions_of_shutdown()
+
+    assert len(adapter.sent) == 1
+    assert "restarting" in adapter.sent[0]
+    assert "/retry" in adapter.sent[0]
+
+
+@pytest.mark.asyncio
+async def test_shutdown_notification_deduplicates_per_chat():
+    """Multiple sessions in the same chat only get one notification."""
+    runner, adapter = make_restart_runner()
+    # Two sessions (different users) in the same chat
+    runner._running_agents["agent:main:telegram:group:chat1:u1"] = MagicMock()
+    runner._running_agents["agent:main:telegram:group:chat1:u2"] = MagicMock()
+
+    await runner._notify_active_sessions_of_shutdown()
+
+    assert len(adapter.sent) == 1
+
+
+@pytest.mark.asyncio
+async def test_shutdown_notification_skipped_when_no_active_agents():
+    """No notification is sent when there are no active agents."""
+    runner, adapter = make_restart_runner()
+
+    await runner._notify_active_sessions_of_shutdown()
+
+    assert len(adapter.sent) == 0
+
+
+@pytest.mark.asyncio
+async def test_shutdown_notification_ignores_pending_sentinels():
+    """Pending sentinels (not-yet-started agents) don't trigger notifications."""
+    from gateway.run import _AGENT_PENDING_SENTINEL
+
+    runner, adapter = make_restart_runner()
+    runner._running_agents["agent:main:telegram:dm:999"] = _AGENT_PENDING_SENTINEL
+
+    await runner._notify_active_sessions_of_shutdown()
+
+    assert len(adapter.sent) == 0
+
+
+@pytest.mark.asyncio
+async def test_shutdown_notification_send_failure_does_not_block():
+    """If sending a notification fails, the method still completes."""
+    runner, adapter = make_restart_runner()
+    adapter.send = AsyncMock(side_effect=Exception("network error"))
+    session_key = "agent:main:telegram:dm:999"
+    runner._running_agents[session_key] = MagicMock()
+
+    # Should not raise
+    await runner._notify_active_sessions_of_shutdown()
@@ -132,6 +132,68 @@ async def test_runner_records_connected_platform_state_on_success(monkeypatch, t
    assert state["platforms"]["discord"]["error_message"] is None


+@pytest.mark.asyncio
+async def test_runner_discovers_plugins_before_loading_hooks(monkeypatch, tmp_path):
+    monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+    config = GatewayConfig(
+        platforms={
+            Platform.DISCORD: PlatformConfig(enabled=True, token="***")
+        },
+        sessions_dir=tmp_path / "sessions",
+    )
+    runner = GatewayRunner(config)
+    order: list[str] = []
+
+    monkeypatch.setattr(runner, "_create_adapter", lambda platform, platform_config: _SuccessfulAdapter())
+    monkeypatch.setattr("hermes_cli.plugins.discover_plugins", lambda: order.append("plugins"))
+    monkeypatch.setattr(runner.hooks, "discover_and_load", lambda: order.append("hooks"))
+    monkeypatch.setattr(runner.hooks, "emit", AsyncMock())
+
+    ok = await runner.start()
+
+    assert ok is True
+    assert order == ["plugins", "hooks"]
+
+
+@pytest.mark.asyncio
+async def test_runner_initializes_startup_checks_before_gateway_startup_emit(monkeypatch, tmp_path):
+    monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+    config = GatewayConfig(
+        platforms={
+            Platform.DISCORD: PlatformConfig(enabled=True, token="***")
+        },
+        sessions_dir=tmp_path / "sessions",
+    )
+    runner = GatewayRunner(config)
+
+    runner.hooks._loaded_hooks = [
+        {
+            "name": "beam-runtime",
+            "events": ["gateway:startup"],
+            "path": str(tmp_path / "hook"),
+            "startup_readiness": {
+                "id": "beam-runtime",
+                "required": True,
+            },
+        }
+    ]
+    monkeypatch.setattr(runner, "_create_adapter", lambda platform, platform_config: _SuccessfulAdapter())
+    monkeypatch.setattr("hermes_cli.plugins.discover_plugins", lambda: None)
+    monkeypatch.setattr(runner.hooks, "discover_and_load", lambda: None)
+
+    async def _assert_checks(event_type, context):
+        state = read_runtime_status()
+        assert event_type == "gateway:startup"
+        assert state["startup_checks"]["beam-runtime"]["state"] == "pending"
+        assert state["startup_checks"]["beam-runtime"]["required"] is True
+
+    monkeypatch.setattr(runner.hooks, "emit", _assert_checks)
+
+    ok = await runner.start()
+
+    assert ok is True
+
+
@pytest.mark.asyncio
 async def test_start_gateway_verbosity_imports_redacting_formatter(monkeypatch, tmp_path):
    """Verbosity != None must not crash with NameError on RedactingFormatter (#8044)."""
@@ -132,6 +132,72 @@ class TestGatewayRuntimeStatus:
        assert payload["platforms"]["discord"]["error_code"] is None
        assert payload["platforms"]["discord"]["error_message"] is None

+    def test_reset_startup_checks_replaces_previous_run_entries(self, tmp_path, monkeypatch):
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+
+        status.write_runtime_status(
+            gateway_state="running",
+            startup_checks={
+                "old-check": {
+                    "state": "ready",
+                    "required": True,
+                    "source": "old-hook",
+                    "detail": None,
+                }
+            },
+        )
+
+        status.reset_startup_checks([
+            {
+                "name": "new-hook",
+                "startup_readiness": {
+                    "id": "new-check",
+                    "required": False,
+                },
+            }
+        ])
+
+        payload = status.read_runtime_status()
+        assert set(payload["startup_checks"]) == {"new-check"}
+        assert payload["startup_checks"]["new-check"]["state"] == "pending"
+        assert payload["startup_checks"]["new-check"]["required"] is False
+        assert payload["startup_checks"]["new-check"]["source"] == "new-hook"
+
+    def test_mark_startup_check_ready_persists_detail(self, tmp_path, monkeypatch):
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+
+        status.reset_startup_checks([
+            {
+                "name": "beam",
+                "startup_readiness": {
+                    "id": "beam-runtime",
+                    "required": True,
+                },
+            }
+        ])
+
+        status.mark_startup_check_ready("beam-runtime", detail="ready for RPC")
+
+        payload = status.read_runtime_status()
+        assert payload["startup_checks"]["beam-runtime"]["state"] == "ready"
+        assert payload["startup_checks"]["beam-runtime"]["detail"] == "ready for RPC"
+
+    def test_mark_startup_check_failed_creates_missing_entry(self, tmp_path, monkeypatch):
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+
+        status.mark_startup_check_failed(
+            "late-hook",
+            detail="startup hook crashed",
+            required=False,
+            source="late-hook",
+        )
+
+        payload = status.read_runtime_status()
+        assert payload["startup_checks"]["late-hook"]["state"] == "failed"
+        assert payload["startup_checks"]["late-hook"]["required"] is False
+        assert payload["startup_checks"]["late-hook"]["source"] == "late-hook"
+        assert payload["startup_checks"]["late-hook"]["detail"] == "startup hook crashed"
+

 class TestTerminatePid:
    def test_force_uses_taskkill_on_windows(self, monkeypatch):
@@ -1028,3 +1028,154 @@ class TestDiscordSkillCommands:
            assert len(name) <= _CMD_NAME_LIMIT, (
                f"Name '{name}' is {len(name)} chars (limit {_CMD_NAME_LIMIT})"
            )
+
+
+# ---------------------------------------------------------------------------
+# Discord skill commands grouped by category
+# ---------------------------------------------------------------------------
+
+from hermes_cli.commands import discord_skill_commands_by_category  # noqa: E402
+
+
+class TestDiscordSkillCommandsByCategory:
+    """Tests for discord_skill_commands_by_category() — /skill group registration."""
+
+    def test_groups_skills_by_category(self, tmp_path, monkeypatch):
+        """Skills nested 2+ levels deep should be grouped by top-level category."""
+        from unittest.mock import patch
+
+        fake_skills_dir = str(tmp_path / "skills")
+        # Create the directory structure so resolve() works
+        for p in [
+            "skills/creative/ascii-art",
+            "skills/creative/excalidraw",
+            "skills/media/gif-search",
+        ]:
+            (tmp_path / p).mkdir(parents=True, exist_ok=True)
+            (tmp_path / p / "SKILL.md").write_text("---\nname: test\n---\n")
+
+        fake_cmds = {
+            "/ascii-art": {
+                "name": "ascii-art",
+                "description": "Generate ASCII art",
+                "skill_md_path": f"{fake_skills_dir}/creative/ascii-art/SKILL.md",
+            },
+            "/excalidraw": {
+                "name": "excalidraw",
+                "description": "Hand-drawn diagrams",
+                "skill_md_path": f"{fake_skills_dir}/creative/excalidraw/SKILL.md",
+            },
+            "/gif-search": {
+                "name": "gif-search",
+                "description": "Search for GIFs",
+                "skill_md_path": f"{fake_skills_dir}/media/gif-search/SKILL.md",
+            },
+        }
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+        with (
+            patch("agent.skill_commands.get_skill_commands", return_value=fake_cmds),
+            patch("tools.skills_tool.SKILLS_DIR", tmp_path / "skills"),
+        ):
+            categories, uncategorized, hidden = discord_skill_commands_by_category(
+                reserved_names=set(),
+            )
+
+        assert "creative" in categories
+        assert "media" in categories
+        assert len(categories["creative"]) == 2
+        assert len(categories["media"]) == 1
+        assert uncategorized == []
+        assert hidden == 0
+
+    def test_root_level_skills_are_uncategorized(self, tmp_path, monkeypatch):
+        """Skills directly under SKILLS_DIR (only 1 path component) → uncategorized."""
+        from unittest.mock import patch
+
+        fake_skills_dir = str(tmp_path / "skills")
+        (tmp_path / "skills" / "dogfood").mkdir(parents=True, exist_ok=True)
+        (tmp_path / "skills" / "dogfood" / "SKILL.md").write_text("")
+
+        fake_cmds = {
+            "/dogfood": {
+                "name": "dogfood",
+                "description": "QA testing",
+                "skill_md_path": f"{fake_skills_dir}/dogfood/SKILL.md",
+            },
+        }
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+        with (
+            patch("agent.skill_commands.get_skill_commands", return_value=fake_cmds),
+            patch("tools.skills_tool.SKILLS_DIR", tmp_path / "skills"),
+        ):
+            categories, uncategorized, hidden = discord_skill_commands_by_category(
+                reserved_names=set(),
+            )
+
+        assert categories == {}
+        assert len(uncategorized) == 1
+        assert uncategorized[0][0] == "dogfood"
+
+    def test_hub_skills_excluded(self, tmp_path, monkeypatch):
+        """Skills under .hub should be excluded."""
+        from unittest.mock import patch
+
+        fake_skills_dir = str(tmp_path / "skills")
+        (tmp_path / "skills" / ".hub" / "some-skill").mkdir(parents=True, exist_ok=True)
+        (tmp_path / "skills" / ".hub" / "some-skill" / "SKILL.md").write_text("")
+
+        fake_cmds = {
+            "/some-skill": {
+                "name": "some-skill",
+                "description": "Hub skill",
+                "skill_md_path": f"{fake_skills_dir}/.hub/some-skill/SKILL.md",
+            },
+        }
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+        with (
+            patch("agent.skill_commands.get_skill_commands", return_value=fake_cmds),
+            patch("tools.skills_tool.SKILLS_DIR", tmp_path / "skills"),
+        ):
+            categories, uncategorized, hidden = discord_skill_commands_by_category(
+                reserved_names=set(),
+            )
+
+        assert categories == {}
+        assert uncategorized == []
+
+    def test_deep_nested_skills_use_top_category(self, tmp_path, monkeypatch):
+        """Skills like mlops/training/axolotl should group under 'mlops'."""
+        from unittest.mock import patch
+
+        fake_skills_dir = str(tmp_path / "skills")
+        (tmp_path / "skills" / "mlops" / "training" / "axolotl").mkdir(parents=True, exist_ok=True)
+        (tmp_path / "skills" / "mlops" / "training" / "axolotl" / "SKILL.md").write_text("")
+        (tmp_path / "skills" / "mlops" / "inference" / "vllm").mkdir(parents=True, exist_ok=True)
+        (tmp_path / "skills" / "mlops" / "inference" / "vllm" / "SKILL.md").write_text("")
+
+        fake_cmds = {
+            "/axolotl": {
+                "name": "axolotl",
+                "description": "Fine-tuning with Axolotl",
+                "skill_md_path": f"{fake_skills_dir}/mlops/training/axolotl/SKILL.md",
+            },
+            "/vllm": {
+                "name": "vllm",
+                "description": "vLLM inference",
+                "skill_md_path": f"{fake_skills_dir}/mlops/inference/vllm/SKILL.md",
+            },
+        }
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+        with (
+            patch("agent.skill_commands.get_skill_commands", return_value=fake_cmds),
+            patch("tools.skills_tool.SKILLS_DIR", tmp_path / "skills"),
+        ):
+            categories, uncategorized, hidden = discord_skill_commands_by_category(
+                reserved_names=set(),
+            )
+
+        # Both should be under 'mlops' regardless of sub-category
+        assert "mlops" in categories
+        names = {n for n, _d, _k in categories["mlops"]}
+        assert "axolotl" in names
+        assert "vllm" in names
+        assert len(uncategorized) == 0
@@ -6,12 +6,21 @@ from pathlib import Path
 from types import SimpleNamespace

 import hermes_cli.gateway as gateway_cli
+import pytest
 from gateway.restart import (
    DEFAULT_GATEWAY_RESTART_DRAIN_TIMEOUT,
    GATEWAY_SERVICE_RESTART_EXIT_CODE,
 )


+_REAL_AWAIT_SERVICE_READY = gateway_cli._await_service_ready_or_exit
+
+
+@pytest.fixture(autouse=True)
+def _stub_service_readiness(monkeypatch):
+    monkeypatch.setattr(gateway_cli, "_await_service_ready_or_exit", lambda **kwargs: None)
+
+
 class TestSystemdServiceRefresh:
    def test_systemd_install_repairs_outdated_unit_without_force(self, tmp_path, monkeypatch):
        unit_path = tmp_path / "hermes-gateway.service"
@@ -82,6 +91,30 @@ class TestSystemdServiceRefresh:
            ["systemctl", "--user", "reload-or-restart", gateway_cli.get_service_name()],
        ]

+    def test_systemd_start_waits_for_readiness_before_reporting_success(self, monkeypatch):
+        calls = []
+
+        monkeypatch.setattr(gateway_cli, "_select_systemd_scope", lambda system=False: False)
+        monkeypatch.setattr(gateway_cli, "refresh_systemd_unit_if_needed", lambda system=False: calls.append(("refresh", system)))
+        monkeypatch.setattr(
+            gateway_cli,
+            "_run_systemctl",
+            lambda cmd, system=False, check=True, timeout=30, **kwargs: calls.append((tuple(cmd), system, timeout)),
+        )
+        monkeypatch.setattr(
+            gateway_cli,
+            "_await_service_ready_or_exit",
+            lambda **kwargs: calls.append(("ready", kwargs)),
+        )
+
+        gateway_cli.systemd_start()
+
+        assert calls == [
+            ("refresh", False),
+            (("start", gateway_cli.get_service_name()), False, 30),
+            ("ready", {"action": "start"}),
+        ]
+

 class TestGeneratedSystemdUnits:
    def test_user_unit_avoids_recursive_execstop_and_uses_extended_stop_timeout(self):
@@ -268,6 +301,32 @@ class TestLaunchdServiceRecovery:
            ["launchctl", "kickstart", target],
        ]

+    def test_launchd_start_waits_for_readiness_before_reporting_success(self, tmp_path, monkeypatch):
+        plist_path = tmp_path / "ai.hermes.gateway.plist"
+        plist_path.write_text(gateway_cli.generate_launchd_plist(), encoding="utf-8")
+        label = gateway_cli.get_launchd_label()
+        calls = []
+
+        monkeypatch.setattr(gateway_cli, "get_launchd_plist_path", lambda: plist_path)
+        monkeypatch.setattr(gateway_cli, "refresh_launchd_plist_if_needed", lambda: None)
+        monkeypatch.setattr(
+            gateway_cli.subprocess,
+            "run",
+            lambda cmd, check=False, **kwargs: calls.append(cmd) or SimpleNamespace(returncode=0, stdout="", stderr=""),
+        )
+        monkeypatch.setattr(
+            gateway_cli,
+            "_await_service_ready_or_exit",
+            lambda **kwargs: calls.append(("ready", kwargs)),
+        )
+
+        gateway_cli.launchd_start()
+
+        assert calls == [
+            ["launchctl", "kickstart", f"{gateway_cli._launchd_domain()}/{label}"],
+            ("ready", {"action": "start"}),
+        ]
+
    def test_launchd_restart_drains_running_gateway_before_kickstart(self, monkeypatch):
        calls = []
        target = f"{gateway_cli._launchd_domain()}/{gateway_cli.get_launchd_label()}"
@@ -315,7 +374,7 @@ class TestLaunchdServiceRecovery:
        gateway_cli.launchd_restart()

        assert calls == [("self", 321)]
-        assert "restart requested" in capsys.readouterr().out.lower()
+        assert "service restarted" in capsys.readouterr().out.lower()

    def test_launchd_stop_uses_bootout_not_kill(self, monkeypatch):
        """launchd_stop must bootout the service so KeepAlive doesn't respawn it."""
@@ -393,6 +452,109 @@ class TestLaunchdServiceRecovery:
        assert "not loaded" in output.lower()


+class TestGatewayServiceReadiness:
+    def test_wait_for_service_readiness_accepts_running_gateway_without_checks(self, monkeypatch):
+        monkeypatch.setattr("gateway.status.get_running_pid", lambda: 123)
+        monkeypatch.setattr(
+            "gateway.status.read_runtime_status",
+            lambda: {"pid": 123, "gateway_state": "running", "startup_checks": {}},
+        )
+
+        warnings = gateway_cli._wait_for_service_readiness(action="start", timeout=0.1, poll_interval=0.0)
+
+        assert warnings == []
+
+    def test_wait_for_service_readiness_ignores_stale_runtime_state_until_pid_matches(self, monkeypatch):
+        runtime_states = iter(
+            [
+                {"pid": 999, "gateway_state": "running", "startup_checks": {}},
+                {"pid": 123, "gateway_state": "running", "startup_checks": {}},
+            ]
+        )
+
+        monkeypatch.setattr("gateway.status.get_running_pid", lambda: 123)
+        monkeypatch.setattr("gateway.status.read_runtime_status", lambda: next(runtime_states))
+
+        warnings = gateway_cli._wait_for_service_readiness(action="start", timeout=0.1, poll_interval=0.0)
+
+        assert warnings == []
+
+    def test_wait_for_service_readiness_returns_optional_pending_warnings(self, monkeypatch):
+        monkeypatch.setattr("gateway.status.get_running_pid", lambda: 123)
+        monkeypatch.setattr(
+            "gateway.status.read_runtime_status",
+            lambda: {
+                "pid": 123,
+                "gateway_state": "running",
+                "startup_checks": {
+                    "optional-check": {
+                        "state": "pending",
+                        "required": False,
+                        "source": "test-hook",
+                        "detail": "still warming",
+                    }
+                },
+            },
+        )
+
+        warnings = gateway_cli._wait_for_service_readiness(action="start", timeout=0.1, poll_interval=0.0)
+
+        assert warnings == ["pending: optional-check (test-hook): still warming"]
+
+    def test_wait_for_service_readiness_fails_when_required_check_fails(self, monkeypatch):
+        monkeypatch.setattr("gateway.status.get_running_pid", lambda: 123)
+        monkeypatch.setattr(
+            "gateway.status.read_runtime_status",
+            lambda: {
+                "pid": 123,
+                "gateway_state": "running",
+                "startup_checks": {
+                    "beam-runtime": {
+                        "state": "failed",
+                        "required": True,
+                        "source": "beam",
+                        "detail": "RPC boot failed",
+                    }
+                },
+            },
+        )
+
+        with pytest.raises(RuntimeError, match=r"required startup checks failed: beam-runtime \(beam\): RPC boot failed"):
+            gateway_cli._wait_for_service_readiness(action="start", timeout=0.1, poll_interval=0.0)
+
+    def test_wait_for_service_readiness_times_out_on_pending_required_check(self, monkeypatch):
+        monkeypatch.setattr("gateway.status.get_running_pid", lambda: 123)
+        monkeypatch.setattr(
+            "gateway.status.read_runtime_status",
+            lambda: {
+                "pid": 123,
+                "gateway_state": "running",
+                "startup_checks": {
+                    "beam-runtime": {
+                        "state": "pending",
+                        "required": True,
+                        "source": "beam",
+                        "detail": "waiting for runtime",
+                    }
+                },
+            },
+        )
+
+        with pytest.raises(RuntimeError, match=r"timed out waiting for required startup checks: beam-runtime \(beam\): waiting for runtime"):
+            gateway_cli._wait_for_service_readiness(action="start", timeout=0.01, poll_interval=0.0)
+
+    def test_await_service_ready_or_exit_raises_system_exit_when_not_ready(self, monkeypatch):
+        monkeypatch.setattr(gateway_cli, "_await_service_ready_or_exit", _REAL_AWAIT_SERVICE_READY)
+        monkeypatch.setattr(
+            gateway_cli,
+            "_wait_for_service_readiness",
+            lambda **kwargs: (_ for _ in ()).throw(RuntimeError("not ready")),
+        )
+
+        with pytest.raises(SystemExit, match="1"):
+            gateway_cli._await_service_ready_or_exit(action="start")
+
+
 class TestGatewayServiceDetection:
    def test_supports_systemd_services_requires_systemctl_binary(self, monkeypatch):
        monkeypatch.setattr(gateway_cli, "is_linux", lambda: True)
@@ -475,7 +637,7 @@ class TestGatewaySystemServiceRouting:
        gateway_cli.systemd_restart()

        assert calls == [("refresh", False), ("self", 654)]
-        assert "restart requested" in capsys.readouterr().out.lower()
+        assert "service restarted" in capsys.readouterr().out.lower()

    def test_gateway_install_passes_system_flags(self, monkeypatch):
        monkeypatch.setattr(gateway_cli, "supports_systemd_services", lambda: True)
@@ -984,3 +984,195 @@ class TestModelInfoEndpoint:
        assert resp.status_code == 200
        data = resp.json()
        assert data["auto_context_length"] == 0
+
+
+# ---------------------------------------------------------------------------
+# Gateway health probe tests
+# ---------------------------------------------------------------------------
+
+
+class TestProbeGatewayHealth:
+    """Tests for _probe_gateway_health() — cross-container gateway detection."""
+
+    def test_returns_false_when_no_url_configured(self, monkeypatch):
+        """When GATEWAY_HEALTH_URL is unset, the probe returns (False, None)."""
+        import hermes_cli.web_server as ws
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", None)
+        alive, body = ws._probe_gateway_health()
+        assert alive is False
+        assert body is None
+
+    def test_normalizes_url_with_health_suffix(self, monkeypatch):
+        """If the user sets the URL to include /health, it's stripped to base."""
+        import hermes_cli.web_server as ws
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642/health")
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_TIMEOUT", 1)
+        # Both paths should fail (no server), but we verify they were constructed
+        # correctly by checking the URLs attempted.
+        calls = []
+        original_urlopen = ws.urllib.request.urlopen
+
+        def mock_urlopen(req, **kwargs):
+            calls.append(req.full_url)
+            raise ConnectionError("mock")
+
+        monkeypatch.setattr(ws.urllib.request, "urlopen", mock_urlopen)
+        alive, body = ws._probe_gateway_health()
+        assert alive is False
+        assert "http://gw:8642/health/detailed" in calls
+        assert "http://gw:8642/health" in calls
+
+    def test_normalizes_url_with_health_detailed_suffix(self, monkeypatch):
+        """If the user sets the URL to include /health/detailed, it's stripped to base."""
+        import hermes_cli.web_server as ws
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642/health/detailed")
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_TIMEOUT", 1)
+        calls = []
+
+        def mock_urlopen(req, **kwargs):
+            calls.append(req.full_url)
+            raise ConnectionError("mock")
+
+        monkeypatch.setattr(ws.urllib.request, "urlopen", mock_urlopen)
+        ws._probe_gateway_health()
+        assert "http://gw:8642/health/detailed" in calls
+        assert "http://gw:8642/health" in calls
+
+    def test_successful_detailed_probe(self, monkeypatch):
+        """Successful /health/detailed probe returns (True, body_dict)."""
+        import hermes_cli.web_server as ws
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642")
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_TIMEOUT", 1)
+
+        response_body = json.dumps({
+            "status": "ok",
+            "gateway_state": "running",
+            "pid": 42,
+        })
+
+        mock_resp = MagicMock()
+        mock_resp.status = 200
+        mock_resp.read.return_value = response_body.encode()
+        mock_resp.__enter__ = MagicMock(return_value=mock_resp)
+        mock_resp.__exit__ = MagicMock(return_value=False)
+
+        monkeypatch.setattr(ws.urllib.request, "urlopen", lambda req, **kw: mock_resp)
+        alive, body = ws._probe_gateway_health()
+        assert alive is True
+        assert body["status"] == "ok"
+        assert body["pid"] == 42
+
+    def test_detailed_fails_falls_back_to_simple_health(self, monkeypatch):
+        """If /health/detailed fails, falls back to /health."""
+        import hermes_cli.web_server as ws
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642")
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_TIMEOUT", 1)
+
+        call_count = [0]
+
+        def mock_urlopen(req, **kwargs):
+            call_count[0] += 1
+            if call_count[0] == 1:
+                raise ConnectionError("detailed failed")
+            mock_resp = MagicMock()
+            mock_resp.status = 200
+            mock_resp.read.return_value = json.dumps({"status": "ok"}).encode()
+            mock_resp.__enter__ = MagicMock(return_value=mock_resp)
+            mock_resp.__exit__ = MagicMock(return_value=False)
+            return mock_resp
+
+        monkeypatch.setattr(ws.urllib.request, "urlopen", mock_urlopen)
+        alive, body = ws._probe_gateway_health()
+        assert alive is True
+        assert body["status"] == "ok"
+        assert call_count[0] == 2
+
+
+class TestStatusRemoteGateway:
+    """Tests for /api/status with remote gateway health fallback."""
+
+    @pytest.fixture(autouse=True)
+    def _setup_test_client(self):
+        try:
+            from starlette.testclient import TestClient
+        except ImportError:
+            pytest.skip("fastapi/starlette not installed")
+
+        from hermes_cli.web_server import app, _SESSION_TOKEN
+        self.client = TestClient(app)
+        self.client.headers["Authorization"] = f"Bearer {_SESSION_TOKEN}"
+
+    def test_status_falls_back_to_remote_probe(self, monkeypatch):
+        """When local PID check fails and remote probe succeeds, gateway shows running."""
+        import hermes_cli.web_server as ws
+
+        monkeypatch.setattr(ws, "get_running_pid", lambda: None)
+        monkeypatch.setattr(ws, "read_runtime_status", lambda: None)
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642")
+        monkeypatch.setattr(ws, "_probe_gateway_health", lambda: (True, {
+            "status": "ok",
+            "gateway_state": "running",
+            "platforms": {"telegram": {"state": "connected"}},
+            "pid": 999,
+        }))
+
+        resp = self.client.get("/api/status")
+        assert resp.status_code == 200
+        data = resp.json()
+        assert data["gateway_running"] is True
+        assert data["gateway_pid"] == 999
+        assert data["gateway_state"] == "running"
+
+    def test_status_remote_probe_not_attempted_when_local_pid_found(self, monkeypatch):
+        """When local PID check succeeds, the remote probe is never called."""
+        import hermes_cli.web_server as ws
+
+        monkeypatch.setattr(ws, "get_running_pid", lambda: 1234)
+        monkeypatch.setattr(ws, "read_runtime_status", lambda: {
+            "gateway_state": "running",
+            "platforms": {},
+        })
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642")
+        probe_called = [False]
+        original = ws._probe_gateway_health
+
+        def track_probe():
+            probe_called[0] = True
+            return original()
+
+        monkeypatch.setattr(ws, "_probe_gateway_health", track_probe)
+
+        resp = self.client.get("/api/status")
+        assert resp.status_code == 200
+        assert not probe_called[0]
+
+    def test_status_remote_probe_not_attempted_when_no_url(self, monkeypatch):
+        """When GATEWAY_HEALTH_URL is unset, no probe is attempted."""
+        import hermes_cli.web_server as ws
+
+        monkeypatch.setattr(ws, "get_running_pid", lambda: None)
+        monkeypatch.setattr(ws, "read_runtime_status", lambda: None)
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", None)
+
+        resp = self.client.get("/api/status")
+        assert resp.status_code == 200
+        data = resp.json()
+        assert data["gateway_running"] is False
+
+    def test_status_remote_running_null_pid(self, monkeypatch):
+        """Remote gateway running but PID not in response — pid should be None."""
+        import hermes_cli.web_server as ws
+
+        monkeypatch.setattr(ws, "get_running_pid", lambda: None)
+        monkeypatch.setattr(ws, "read_runtime_status", lambda: None)
+        monkeypatch.setattr(ws, "_GATEWAY_HEALTH_URL", "http://gw:8642")
+        monkeypatch.setattr(ws, "_probe_gateway_health", lambda: (True, {
+            "status": "ok",
+        }))
+
+        resp = self.client.get("/api/status")
+        assert resp.status_code == 200
+        data = resp.json()
+        assert data["gateway_running"] is True
+        assert data["gateway_pid"] is None
+        assert data["gateway_state"] == "running"
@@ -550,11 +550,12 @@ class TestGatewayProtection:
        dangerous, key, desc = detect_dangerous_command(cmd)
        assert dangerous is False

-    def test_systemctl_restart_not_flagged(self):
-        """Using systemctl to manage the gateway is the correct approach."""
+    def test_systemctl_restart_flagged(self):
+        """systemctl restart kills running agents and should require approval."""
        cmd = "systemctl --user restart hermes-gateway"
        dangerous, key, desc = detect_dangerous_command(cmd)
-        assert dangerous is False
+        assert dangerous is True
+        assert "stop/restart" in desc

    def test_pkill_hermes_detected(self):
        """pkill targeting hermes/gateway processes must be caught."""
@@ -2837,7 +2837,7 @@ class TestRegistryCollisionWarning:
    """registry.register() warns when a tool name is overwritten by a different toolset."""

    def test_overwrite_different_toolset_logs_warning(self, caplog):
-        """Overwriting a tool from a different toolset emits a warning."""
+        """Overwriting a tool from a different toolset is REJECTED with an error."""
        from tools.registry import ToolRegistry
        import logging

@@ -2847,11 +2847,13 @@ class TestRegistryCollisionWarning:

        reg.register(name="my_tool", toolset="builtin", schema=schema, handler=handler)

-        with caplog.at_level(logging.WARNING, logger="tools.registry"):
+        with caplog.at_level(logging.ERROR, logger="tools.registry"):
            reg.register(name="my_tool", toolset="mcp-ext", schema=schema, handler=handler)

-        assert any("collision" in r.message.lower() for r in caplog.records)
+        assert any("rejected" in r.message.lower() for r in caplog.records)
        assert any("builtin" in r.message and "mcp-ext" in r.message for r in caplog.records)
+        # The original tool should still be from 'builtin', not overwritten
+        assert reg.get_toolset_for_tool("my_tool") == "builtin"

    def test_overwrite_same_toolset_no_warning(self, caplog):
        """Re-registering within the same toolset is silent (e.g. reconnect)."""
@@ -87,7 +87,7 @@ DANGEROUS_PATTERNS = [
    (r'\bDELETE\s+FROM\b(?!.*\bWHERE\b)', "SQL DELETE without WHERE"),
    (r'\bTRUNCATE\s+(TABLE)?\s*\w', "SQL TRUNCATE"),
    (r'>\s*/etc/', "overwrite system config"),
-    (r'\bsystemctl\s+(stop|disable|mask)\b', "stop/disable system service"),
+    (r'\bsystemctl\s+(-[^\s]+\s+)*(stop|restart|disable|mask)\b', "stop/restart system service"),
    (r'\bkill\s+-9\s+-1\b', "kill all processes"),
    (r'\bpkill\s+-9\b', "force kill processes"),
    (r':\(\)\s*\{\s*:\s*\|\s*:\s*&\s*\}\s*;\s*:', "fork bomb"),
@@ -101,6 +101,11 @@ DANGEROUS_PATTERNS = [
    (r'\bxargs\s+.*\brm\b', "xargs with rm"),
    (r'\bfind\b.*-exec\s+(/\S*/)?rm\b', "find -exec rm"),
    (r'\bfind\b.*-delete\b', "find -delete"),
+    # Gateway lifecycle protection: prevent the agent from killing its own
+    # gateway process.  These commands trigger a gateway restart/stop that
+    # terminates all running agents mid-work.
+    (r'\bhermes\s+gateway\s+(stop|restart)\b', "stop/restart hermes gateway (kills running agents)"),
+    (r'\bhermes\s+update\b', "hermes update (restarts gateway, kills running agents)"),
    # Gateway protection: never start gateway outside systemd management
    (r'gateway\s+run\b.*(&\s*$|&\s*;|\bdisown\b|\bsetsid\b)', "start gateway outside systemd (use 'systemctl --user restart hermes-gateway')"),
    (r'\bnohup\b.*gateway\s+run\b', "start gateway outside systemd (use 'systemctl --user restart hermes-gateway')"),
@@ -219,6 +219,58 @@ def _sanitize_error(text: str) -> str:
    return _CREDENTIAL_PATTERN.sub("[REDACTED]", text)


+# ---------------------------------------------------------------------------
+# MCP tool description content scanning
+# ---------------------------------------------------------------------------
+
+# Patterns that indicate potential prompt injection in MCP tool descriptions.
+# These are WARNING-level — we log but don't block, since false positives
+# would break legitimate MCP servers.
+_MCP_INJECTION_PATTERNS = [
+    (re.compile(r"ignore\s+(all\s+)?previous\s+instructions", re.I),
+     "prompt override attempt ('ignore previous instructions')"),
+    (re.compile(r"you\s+are\s+now\s+a", re.I),
+     "identity override attempt ('you are now a...')"),
+    (re.compile(r"your\s+new\s+(task|role|instructions?)\s+(is|are)", re.I),
+     "task override attempt"),
+    (re.compile(r"system\s*:\s*", re.I),
+     "system prompt injection attempt"),
+    (re.compile(r"<\s*(system|human|assistant)\s*>", re.I),
+     "role tag injection attempt"),
+    (re.compile(r"do\s+not\s+(tell|inform|mention|reveal)", re.I),
+     "concealment instruction"),
+    (re.compile(r"(curl|wget|fetch)\s+https?://", re.I),
+     "network command in description"),
+    (re.compile(r"base64\.(b64decode|decodebytes)", re.I),
+     "base64 decode reference"),
+    (re.compile(r"exec\s*\(|eval\s*\(", re.I),
+     "code execution reference"),
+    (re.compile(r"import\s+(subprocess|os|shutil|socket)", re.I),
+     "dangerous import reference"),
+]
+
+
+def _scan_mcp_description(server_name: str, tool_name: str, description: str) -> List[str]:
+    """Scan an MCP tool description for prompt injection patterns.
+
+    Returns a list of finding strings (empty = clean).
+    """
+    findings = []
+    if not description:
+        return findings
+    for pattern, reason in _MCP_INJECTION_PATTERNS:
+        if pattern.search(description):
+            findings.append(reason)
+    if findings:
+        logger.warning(
+            "MCP server '%s' tool '%s': suspicious description content — %s. "
+            "Description: %.200s",
+            server_name, tool_name, "; ".join(findings),
+            description,
+        )
+    return findings
+
+
 def _prepend_path(env: dict, directory: str) -> dict:
    """Prepend *directory* to env PATH if it is not already present."""
    updated = dict(env or {})
@@ -798,6 +850,9 @@ class MCPServerTask:
        from toolsets import TOOLSETS

        async with self._refresh_lock:
+            # Capture old tool names for change diff
+            old_tool_names = set(self._registered_tool_names)
+
            # 1. Fetch current tool list from server
            tools_result = await self.session.list_tools()
            new_mcp_tools = tools_result.tools if hasattr(tools_result, "tools") else []
@@ -817,10 +872,26 @@ class MCPServerTask:
                self.name, self, self._config
            )

-            logger.info(
-                "MCP server '%s': dynamically refreshed %d tool(s)",
-                self.name, len(self._registered_tool_names),
-            )
+            # 5. Log what changed (user-visible notification)
+            new_tool_names = set(self._registered_tool_names)
+            added = new_tool_names - old_tool_names
+            removed = old_tool_names - new_tool_names
+            changes = []
+            if added:
+                changes.append(f"added: {', '.join(sorted(added))}")
+            if removed:
+                changes.append(f"removed: {', '.join(sorted(removed))}")
+            if changes:
+                logger.warning(
+                    "MCP server '%s': tools changed dynamically — %s. "
+                    "Verify these changes are expected.",
+                    self.name, "; ".join(changes),
+                )
+            else:
+                logger.info(
+                    "MCP server '%s': dynamically refreshed %d tool(s) (no changes)",
+                    self.name, len(self._registered_tool_names),
+                )

    async def _run_stdio(self, config: dict):
        """Run the server using stdio transport."""
@@ -1838,6 +1909,10 @@ def _register_server_tools(name: str, server: MCPServerTask, config: dict) -> Li
        if not _should_register(mcp_tool.name):
            logger.debug("MCP server '%s': skipping tool '%s' (filtered by config)", name, mcp_tool.name)
            continue
+
+        # Scan tool description for prompt injection patterns
+        _scan_mcp_description(name, mcp_tool.name, mcp_tool.description or "")
+
        schema = _convert_mcp_schema(name, mcp_tool)
        tool_name_prefixed = schema["name"]

@@ -117,11 +117,27 @@ class ToolRegistry:
        with self._lock:
            existing = self._tools.get(name)
            if existing and existing.toolset != toolset:
-                logger.warning(
-                    "Tool name collision: '%s' (toolset '%s') is being "
-                    "overwritten by toolset '%s'",
-                    name, existing.toolset, toolset,
+                # Allow MCP-to-MCP overwrites (legitimate: server refresh,
+                # or two MCP servers with overlapping tool names).
+                both_mcp = (
+                    existing.toolset.startswith("mcp-")
+                    and toolset.startswith("mcp-")
                )
+                if both_mcp:
+                    logger.debug(
+                        "Tool '%s': MCP toolset '%s' overwriting MCP toolset '%s'",
+                        name, toolset, existing.toolset,
+                    )
+                else:
+                    # Reject shadowing — prevent plugins/MCP from overwriting
+                    # built-in tools or vice versa.
+                    logger.error(
+                        "Tool registration REJECTED: '%s' (toolset '%s') would "
+                        "shadow existing tool from toolset '%s'. Deregister the "
+                        "existing tool first if this is intentional.",
+                        name, toolset, existing.toolset,
+                    )
+                    return
            self._tools[name] = ToolEntry(
                name=name,
                toolset=toolset,
@@ -64,11 +64,11 @@ def _security_scan_skill(skill_dir: Path) -> Optional[str]:
            report = format_scan_report(result)
            return f"Security scan blocked this skill ({reason}):\n{report}"
        if allowed is None:
-            # "ask" — allow but include the warning so the user sees the findings
+            # "ask" verdict — for agent-created skills this means dangerous
+            # findings were detected.  Block the skill and include the report.
            report = format_scan_report(result)
-            logger.warning("Agent-created skill has security findings: %s", reason)
-            # Don't block — return None to allow, but log the warning
-            return None
+            logger.warning("Agent-created skill blocked (dangerous findings): %s", reason)
+            return f"Security scan blocked this skill ({reason}):\n{report}"
    except Exception as e:
        logger.warning("Security scan failed for %s: %s", skill_dir, e, exc_info=True)
    return None
@@ -80,6 +80,7 @@ export const en: Translations = {
    notRunning: "Not running",
    startFailed: "Start failed",
    pid: "PID",
+    runningRemote: "Running (remote)",
    noneRunning: "None",
    gatewayFailedToStart: "Gateway failed to start",
    lastUpdate: "Last update",
@@ -83,6 +83,7 @@ export interface Translations {
    notRunning: string;
    startFailed: string;
    pid: string;
+    runningRemote: string;
    noneRunning: string;
    gatewayFailedToStart: string;
    lastUpdate: string;
@@ -80,6 +80,7 @@ export const zh: Translations = {
    notRunning: "未运行",
    startFailed: "启动失败",
    pid: "进程",
+    runningRemote: "运行中（远程）",
    noneRunning: "无",
    gatewayFailedToStart: "网关启动失败",
    lastUpdate: "最后更新",
@@ -53,7 +53,8 @@ export default function StatusPage() {
  };

  function gatewayValue(): string {
-    if (status!.gateway_running) return `${t.status.pid} ${status!.gateway_pid}`;
+    if (status!.gateway_running && status!.gateway_pid) return `${t.status.pid} ${status!.gateway_pid}`;
+    if (status!.gateway_running) return t.status.runningRemote;
    if (status!.gateway_state === "startup_failed") return t.status.startFailed;
    return t.status.notRunning;
  }
@@ -35,9 +35,39 @@ docker run -d \
  --name hermes \
  --restart unless-stopped \
  -v ~/.hermes:/opt/data \
+  -p 8642:8642 \
  nousresearch/hermes-agent gateway run
 ```

+Port 8642 exposes the gateway's [OpenAI-compatible API server](./api-server.md) and health endpoint. It's optional if you only use chat platforms (Telegram, Discord, etc.), but required if you want the dashboard or external tools to reach the gateway.
+
+Opening any port on an internet facing machine is a security risk. You should not do it unless you understand the risks.
+
+## Running the dashboard
+
+The built-in web dashboard can run alongside the gateway as a separate container. 
+
+To run the dashboard as its own container, point it at the gateway's health endpoint so it can detect gateway status across containers:
+
+```sh
+docker run -d \
+  --name hermes-dashboard \
+  --restart unless-stopped \
+  -v ~/.hermes:/opt/data \
+  -p 9119:9119 \
+  -e GATEWAY_HEALTH_URL=http://$HOST_IP:8642 \
+  nousresearch/hermes-agent dashboard
+```
+
+Replace `$HOST_IP` with the IP address of the machine running the gateway container (e.g. `192.168.1.100`), or use a Docker network hostname if both containers share a network (see the [Compose example](#docker-compose-example) below).
+
+| Environment variable | Description | Default |
+|---------------------|-------------|---------|
+| `GATEWAY_HEALTH_URL` | Base URL of the gateway's API server, e.g. `http://gateway:8642` | *(unset — local PID check only)* |
+| `GATEWAY_HEALTH_TIMEOUT` | Health probe timeout in seconds | `3` |
+
+Without `GATEWAY_HEALTH_URL`, the dashboard falls back to local process detection — which only works when the gateway runs in the same container or on the same host.
+
 ## Running interactively (CLI chat)

 To open an interactive chat session against a running data directory:
@@ -66,7 +96,7 @@ The `/opt/data` volume is the single source of truth for all Hermes state. It ma
 | `skins/` | Custom CLI skins |

 :::warning
-Never run two Hermes containers against the same data directory simultaneously — session files and memory stores are not designed for concurrent access.
+Never run two Hermes **gateway** containers against the same data directory simultaneously — session files and memory stores are not designed for concurrent write access. Running a dashboard container alongside the gateway is safe since the dashboard only reads data.
 :::

 ## Environment variable forwarding
@@ -85,18 +115,21 @@ Direct `-e` flags override values from `.env`. This is useful for CI/CD or secre

 ## Docker Compose example

-For persistent gateway deployment, a `docker-compose.yaml` is convenient:
+For persistent deployment with both the gateway and dashboard, a `docker-compose.yaml` is convenient:

 ```yaml
-version: "3.8"
 services:
  hermes:
    image: nousresearch/hermes-agent:latest
    container_name: hermes
    restart: unless-stopped
    command: gateway run
+    ports:
+      - "8642:8642"
    volumes:
      - ~/.hermes:/opt/data
+    networks:
+      - hermes-net
    # Uncomment to forward specific env vars instead of using .env file:
    # environment:
    #   - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
@@ -107,9 +140,34 @@ services:
        limits:
          memory: 4G
          cpus: "2.0"
+
+  dashboard:
+    image: nousresearch/hermes-agent:latest
+    container_name: hermes-dashboard
+    restart: unless-stopped
+    command: dashboard --host 0.0.0.0
+    ports:
+      - "9119:9119"
+    volumes:
+      - ~/.hermes:/opt/data
+    environment:
+      - GATEWAY_HEALTH_URL=http://hermes:8642
+    networks:
+      - hermes-net
+    depends_on:
+      - hermes
+    deploy:
+      resources:
+        limits:
+          memory: 512M
+          cpus: "0.5"
+
+networks:
+  hermes-net:
+    driver: bridge
 ```

-Start with `docker compose up -d` and view logs with `docker compose logs -f hermes`.
+Start with `docker compose up -d` and view logs with `docker compose logs -f`.

 ## Resource limits
Author	SHA1	Message	Date
Shannon Sands	bad9fe2452	add generic gateway startup readiness checks	2026-04-15 10:03:23 +10:00
Teknium	10494b42a1	feat(discord): register skills under /skill command group with category subcommands (#9909 ) Instead of consuming one top-level slash command slot per skill (hitting the 100-command limit with ~26 built-ins + 74 skills), skills are now organized under a single /skill group command with category-based subcommand groups: /skill creative ascii-art [args] /skill media gif-search [args] /skill mlops axolotl [args] Discord supports 25 subcommand groups × 25 subcommands = 625 max skills, well beyond the previous 74-slot ceiling. Categories are derived from the skill directory structure: - skills/creative/ascii-art/ → category 'creative' - skills/mlops/training/axolotl/ → category 'mlops' (top-level parent) - skills/dogfood/ → uncategorized (direct subcommand) Changes: - hermes_cli/commands.py: add discord_skill_commands_by_category() with category grouping, hub/disabled filtering, Discord limit enforcement - gateway/platforms/discord.py: replace top-level skill registration with _register_skill_group() using app_commands.Group hierarchy - tests: 7 new tests covering group creation, category grouping, uncategorized skills, hub exclusion, deep nesting, empty skills, and handler dispatch Inspired by Discord community suggestion from bottium.	2026-04-14 16:27:02 -07:00
Teknium	039023f497	diag: log all hermes processes on unexpected gateway shutdown (#9905 ) When the gateway receives SIGTERM/SIGINT, the shutdown handler now runs 'ps aux' and logs every hermes/gateway-related process (excluding itself). This will show in agent.log as: WARNING: Shutdown diagnostic — other hermes processes running: hermes 1234 ... hermes update --gateway hermes 5678 ... hermes gateway restart This is the missing diagnostic for #5646 / #6666 — we can prove the restarts are from systemctl but can't determine WHO issues the systemctl command. Next time it happens, the agent.log will contain the evidence (the process that sent the signal or called systemctl should still be alive when the handler fires).	2026-04-14 16:26:36 -07:00
Teknium	6448e1da23	feat(zai): add GLM-5V-Turbo support for coding plan (#9907 ) - Add glm-5v-turbo to OpenRouter, Nous, and native Z.AI model lists - Add glm-5v context length entry (200K tokens) to model metadata - Update Z.AI endpoint probe to try multiple candidate models per endpoint (glm-5.1, glm-5v-turbo, glm-4.7) — fixes detection for newer coding plan accounts that lack older models - Add zai to _PROVIDER_VISION_MODELS so auxiliary vision tasks (vision_analyze, browser screenshots) route through 5v Fixes #9888	2026-04-14 16:26:01 -07:00
Teknium	1e5e1e822b	fix: ESC cancels secret/sudo prompts, clearer skip messaging (#9902 ) - Add ESC key binding (eager) for secret_state and sudo_state modal prompts — fires immediately, same behavior as Ctrl+C cancel - Update placeholder text: 'Enter to submit · ESC to skip' (was 'Enter to skip' which was confusing — Enter on empty looked like submitting nothing rather than intentionally skipping) - Update widget body text: 'ESC or Ctrl+C to skip' - Change feedback message from 'Secret entry cancelled' to 'Secret entry skipped' — more accurate for the action taken - getpass fallback prompt also updated for non-TUI mode	2026-04-14 16:11:37 -07:00
Teknium	55ce76b372	feat: add architecture-diagram skill (Cocoon AI port) (#9906 ) Port of Cocoon AI's architecture-diagram-generator (MIT) as a Hermes skill. Generates professional dark-themed system architecture diagrams as standalone HTML/SVG files. Self-contained output, no dependencies. - SKILL.md with design system specs, color palette, layout rules - HTML template with all component types, arrow styles, legend examples - Fits alongside excalidraw in creative/ category Source: https://github.com/Cocoon-AI/architecture-diagram-generator	2026-04-14 16:10:18 -07:00
Teknium	1525624904	fix: block agent from self-destructing gateway via terminal (#6666 ) Add dangerous command patterns that require approval when the agent tries to run gateway lifecycle commands via the terminal tool: - hermes gateway stop/restart — kills all running agents mid-work - hermes update — pulls code and restarts the gateway - systemctl restart/stop (with optional flags like --user) These patterns fire the approval prompt so the user must explicitly approve before the agent can kill its own gateway process. In YOLO mode, the commands run without approval (by design — YOLO means the user accepts all risks). Also fixes the existing systemctl pattern to handle flags between the command and action (e.g. 'systemctl --user restart' was previously undetected because the regex expected the action immediately after 'systemctl'). Root cause: issue #6666 reported agents running 'hermes gateway restart' via terminal, killing the gateway process mid-agent-loop. The user sees the agent suddenly stop responding with no explanation. Combined with the SIGTERM auto-recovery from PR #9875, the gateway now both prevents accidental self-destruction AND recovers if it happens anyway. Test plan: - Updated test_systemctl_restart_not_flagged → test_systemctl_restart_flagged - All 119 approval tests pass - E2E verified: hermes gateway restart, hermes update, systemctl --user restart all detected; hermes gateway status, systemctl status remain safe	2026-04-14 15:43:31 -07:00
Teknium	353b5bacbd	test: add tests for /health/detailed endpoint and gateway health probe - TestHealthDetailedEndpoint: 3 tests for the new API server endpoint (returns runtime data, handles missing status, no auth required) - TestProbeGatewayHealth: 5 tests for _probe_gateway_health() (URL normalization, successful/failed probes, fallback chain) - TestStatusRemoteGateway: 4 tests for /api/status remote fallback (remote probe triggers, skipped when local PID found, null PID handling)	2026-04-14 15:41:30 -07:00
Hermes Agent	139a5e37a4	docs(docker): add dashboard section, expose API port, update Compose example - Running in gateway mode: expose port 8642 for the API server and health endpoint, with a note on when it's needed. - New 'Running the dashboard' section: docker run command with GATEWAY_HEALTH_URL and env var reference table. - Docker Compose example: updated to include both gateway and dashboard services with internal network connectivity (hermes-net), so the dashboard probes the gateway via http://hermes:8642. - Concurrent access warning: clarified that running a read-only dashboard alongside the gateway is safe.	2026-04-14 15:41:30 -07:00
Hermes Agent	673acf22ae	fix: override stale 'stopped' state when health probe confirms gateway alive When the gateway responds to the health probe but the local gateway_state.json has a stale 'stopped' state (common in cross-container setups where the file was written before the gateway restarted), the dashboard would show 'Running (remote)' but with a 'Stopped' badge. Now if the HTTP probe succeeded (remote_health_body is not None) and gateway_state is 'stopped' or None, override it to 'running'. Also handles the no-shared-volume case where runtime is None entirely.	2026-04-14 15:41:30 -07:00
Hermes Agent	6ed682f111	fix: normalise GATEWAY_HEALTH_URL to base URL before probing The probe was appending '/detailed' to whatever URL was provided, so GATEWAY_HEALTH_URL=http://host:8642 would try /8642/detailed and /8642 — neither of which are valid routes. Now strips any trailing /health or /health/detailed from the env var and always probes {base}/health/detailed then {base}/health. Accepts bare base URL, /health, or /health/detailed forms.	2026-04-14 15:41:30 -07:00
Hermes Agent	45595f4805	feat(dashboard): add HTTP health probe for cross-container gateway detection The dashboard's gateway status detection relied solely on local PID checks (os.kill + /proc), which fails when the gateway runs in a separate container. Changes: - web_server.py: Add _probe_gateway_health() that queries the gateway's HTTP /health/detailed endpoint when the local PID check fails. Activated by setting the GATEWAY_HEALTH_URL env var (e.g. http://gateway:8642/health). Falls back to standard PID check when the env var is not set. - api_server.py: Add GET /health/detailed endpoint that returns full gateway state (platforms, gateway_state, active_agents, pid, etc.) without auth. The existing GET /health remains unchanged for backwards compatibility. - StatusPage.tsx: Handle the case where gateway_pid is null but the gateway is running remotely, displaying 'Running (remote)' instead of 'PID null'. Environment variables: - GATEWAY_HEALTH_URL: URL of the gateway health endpoint (e.g. http://gateway-container:8642/health). Unset = local PID check only. - GATEWAY_HEALTH_TIMEOUT: Probe timeout in seconds (default: 3).	2026-04-14 15:41:30 -07:00
Teknium	397386cae2	fix: gateway auto-recovers from unexpected SIGTERM via systemd (#5646 ) Root cause: when the gateway received SIGTERM (from hermes update, external kill, WSL2 runtime, etc.), it exited with status 0. systemd's Restart=on-failure only restarts on non-zero exit, so the gateway stayed dead permanently. Users had to manually restart. Fix 1: Signal-initiated shutdown exits non-zero When SIGTERM/SIGINT is received and no restart was requested (via /restart, /update, or SIGUSR1), start_gateway() returns False which causes sys.exit(1). systemd sees a failure exit and auto-restarts after RestartSec=30. This is safe because systemctl stop tracks its own stop-requested state independently of exit code — Restart= never fires for a deliberate stop, regardless of exit code. Also logs 'Received SIGTERM/SIGINT — initiating shutdown' so the cause of unexpected shutdowns is visible in agent.log. Fix 2: PID file ownership guard remove_pid_file() now checks that the PID file belongs to the current process before removing it. During --replace handoffs, the old process's atexit handler could fire AFTER the new process wrote its PID file, deleting the new record. This left the gateway running but invisible to get_running_pid(), causing 'Another gateway already running' errors on next restart. Test plan: - All restart drain tests pass (13) - All gateway service tests pass (84) - All update gateway restart tests pass (34)	2026-04-14 15:35:58 -07:00
Teknium	eed891f1bb	security: supply chain hardening — CI pinning, dep pinning, and code fixes (#9801 ) CI/CD Hardening: - Pin all 12 GitHub Actions to full commit SHAs (was mutable @vN tags) - Add explicit permissions: {contents: read} to 4 workflows - Pin CI pip installs to exact versions (pyyaml==6.0.2, httpx==0.28.1) - Extend supply-chain-audit.yml to scan workflow, Dockerfile, dependency manifest, and Actions version changes Dependency Pinning: - Pin git-based Python deps to commit SHAs (atroposlib, tinker, yc-bench) - Pin WhatsApp Baileys from mutable branch to commit SHA Tool Registry: - Reject tool name shadowing from different tool families (plugins/MCP cannot overwrite built-in tools). MCP-to-MCP overwrites still allowed. MCP Security: - Add tool description content scanning for prompt injection patterns - Log detailed change diff on dynamic tool refresh at WARNING level Skill Manager: - Fix dangerous verdict bug: agent-created skills with dangerous findings were silently allowed (ask->None->allow). Now blocked.	2026-04-14 14:23:37 -07:00
Teknium	9bbf7659e9	chore: add Roy-oss1 to AUTHOR_MAP	2026-04-14 14:22:11 -07:00
Roy-oss1	1aa76620d4	fix(feishu): keep approval clicks synchronized with callback card state Feishu approval clicks need the resolved card to come back from the synchronous callback path itself. Leaving approval resolution to the generic asynchronous card-action flow made button feedback depend on later loop work instead of the callback response the client is waiting for. Change-Id: I574997cbbcaa097fdba759b47367e28d1b56b040 Constraint: Feishu card-action callbacks must acknowledge quickly and reflect final approval state from the callback response path Rejected: Keep approval handling on the generic async card-action route \| leaves card state synchronization vulnerable to callback timing and follow-up update ordering Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep approval callback response construction separate from async queue unblocking unless Feishu callback semantics change Tested: pytest tests/gateway/test_feishu.py tests/gateway/test_feishu_approval_buttons.py tests/gateway/test_approve_deny_commands.py tests/gateway/test_slack_approval_buttons.py tests/gateway/test_telegram_approval_buttons.py -q Not-tested: Live Feishu workspace end-to-end callback rendering	2026-04-14 14:22:11 -07:00
Teknium	fa8c448f7d	fix: notify active sessions on gateway shutdown + update health check Three fixes for gateway lifecycle stability: 1. Notify active sessions before shutdown (#new) When the gateway receives SIGTERM or /restart, it now sends a notification to every chat with an active agent BEFORE starting the drain. Users see: - Shutdown: 'Gateway shutting down — your task will be interrupted.' - Restart: 'Gateway restarting — use /retry after restart to continue.' Deduplicates per-chat so group sessions with multiple users get one notification. Best-effort: send failures are logged and swallowed. 2. Skip .clean_shutdown marker when drain timed out Previously, a graceful SIGTERM always wrote .clean_shutdown, even if agents were force-interrupted when the drain timed out. This meant the next startup skipped session suspension, leaving interrupted sessions in a broken state (trailing tool response, no final message). Now the marker is only written if the drain completed without timeout, so interrupted sessions get properly suspended on next startup. 3. Post-restart health check for hermes update (#6631) cmd_update() now verifies the gateway actually survived after systemctl restart (sleep 3s + is-active check). If the service crashed immediately, it retries once. If still dead, prints actionable diagnostics (journalctl command, manual restart hint). Also closes #8104 — already fixed on main (the /restart handler correctly detects systemd via INVOCATION_ID and uses via_service=True). Test plan: - 6 new tests for shutdown notifications (dedup, restart vs shutdown messaging, sentinel filtering, send failure resilience) - Existing restart drain + update tests pass (47 total)	2026-04-14 14:21:57 -07:00