Env robustness: context-safe prompting + tool arg normalization

- Preserve full trajectory while truncating prompt view per turn (avoids context overflow) - Add max_context_tokens support and wire from env config - Normalize tool call arguments robustly (dict / stringified JSON / plain string) - Avoid double-encoding tool arguments in Hermes parser - Add tool-call metrics to AgentResult for debugging/optional shaping Scope: environments/* only
Add platform-specific formatting hints and identity for AIAgent
2026-02-14 13:13:00 +10:00 · 2026-02-12 16:11:16 -08:00 · 2026-02-12 15:59:31 -08:00 · 2026-02-12 10:07:03 -08:00 · 2026-02-12 10:05:08 -08:00 · 2026-02-12 05:38:15 +00:00
47 changed files with 4837 additions and 426 deletions
@@ -42,9 +42,10 @@ TERMINAL_ENV=local


 # Container images (for singularity/docker/modal backends)
-TERMINAL_DOCKER_IMAGE=python:3.11
-TERMINAL_SINGULARITY_IMAGE=docker://python:3.11
-TERMINAL_MODAL_IMAGE=python:3.11
+TERMINAL_DOCKER_IMAGE=nikolaik/python-nodejs:python3.11-nodejs20
+TERMINAL_SINGULARITY_IMAGE=docker://nikolaik/python-nodejs:python3.11-nodejs20
+TERMINAL_MODAL_IMAGE=nikolaik/python-nodejs:python3.11-nodejs20
+

 # Working directory for terminal commands
 # For CLI: "." means current directory (resolved automatically from config.yaml)
@@ -0,0 +1,142 @@
+# Project Notes
+
+*Maintained by Hermes — last updated June 2025*
+
+---
+
+## 1. Kandinsky (Multimodal Transformer)
+- **Repo:** https://github.com/samherring99/kandinsky
+- **Local path:** `~/Desktop/Projects/kandinsky`
+- **Description:** An anything-to-anything transformer combining text, image, and audio modalities. Trains on Pokemon BLIP captions paired with Gen 1 Pokemon audio cries. Uses audio tokenization adapted from nanoGPT.
+- **Status:** Early POC. Training code exists (`model.py`) and dataset creation (`create_dataset.py`) works. Audio heads are producing the same sound — unclear if it's a training issue or data issue.
+- **TODO:**
+  - Debug why audio heads produce identical output
+  - Investigate if model needs more training time
+  - Design a data pipeline for better/more training data
+  - General repo cleanup (requirements.txt, proper CLI, etc.)
+
+---
+
+## 2. NightwingGameSim (LLM → GameBoy ROM Generator)
+- **Repo:** https://github.com/samherring99/NightwingGameSim
+- **Local path:** `~/Desktop/Projects/NightwingGameSim`
+- **Description:** AI-powered pipeline that turns natural language prompts into playable GameBoy ROM files. Generates C code, compiles with GBDK, outputs `.gb` files. Supports Claude API, local Llama, and RAG backends.
+- **Status:** Functional — generation pipeline works end-to-end with Claude 4 system prompt. Has tests, docs, examples, and retry logic.
+- **TODO:**
+  - Harden the repo, clean up structure
+  - Build a better testing pipeline
+  - Come up with better prompt ideas / examples
+
+---
+
+## 3. ContentBasedMIR (Music Information Retrieval)
+- **Repo:** https://github.com/samherring99/ContentBasedMIR
+- **Local path:** `~/Desktop/Projects/ContentBasedMIR`
+- **Description:** Music similarity analysis using Spotify API track data. Extracts 54 audio features per song and visualizes similarity matrices for music recommendation.
+- **Status:** Early stage. Can download Spotify track analysis data and plot similarity matrices. Needs significant expansion.
+- **TODO:**
+  - Expand analysis pipeline with more features
+  - Integrate with text message data for personalized recommendations
+  - Build out visualization and exploration tools
+  - General modernization (dependencies, structure)
+
+---
+
+## 4. MessageRetrieval (iMessage RAG/SQL)
+- **Repo:** https://github.com/samherring99/MessageRetrieval
+- **Local path:** `~/Desktop/Projects/MessageRetrieval`
+- **Description:** Natural language querying over iMessage data using SQL generation (text2SQL) instead of vector embeddings. Uses LLM-as-Judge pattern for scoring and ranking retrieved messages.
+- **Status:** Has initial text2SQL pipeline and summarization tool. Recently worked on with Claude Code. Needs testing.
+- **TODO:**
+  - Test out the recent Claude Code work
+  - Build "iMessage Jarvis" — answer questions about texts
+  - Improve SQL generation prompts and accuracy
+  - Better error handling and UX
+
+---
+
+## 5. Grailed Embedding Search
+- **Repo:** https://github.com/samherring99/grailed-embedding-search
+- **Local path:** `~/Desktop/Projects/grailed-embedding-search`
+- **Description:** Semantic similarity search over Grailed fashion listings using CLIP embeddings and FAISS. Search by image URL or text description to find visually similar products.
+- **Status:** Functional core pipeline. CLIP ViT-B/32 embeds product cover photos into 512-dim vectors, indexed with FAISS cosine similarity. Has CLI, batch embedding, persistent index save/load, and logging.
+- **Recent work (June 2025):**
+  - PR #1 — Initial cleanup: docstrings, type hints, `.gitignore`, `requirements.txt`, README rewrite
+  - PR #2 — Feature improvements: persistent FAISS save/load, batch embedding, CLI (`cli.py`), proper logging throughout, lazy Grailed client, `fetch_details` toggle
+- **TODO:**
+  - Embedding cache (avoid re-embedding known product URLs)
+  - Async/threaded image downloads for faster batch indexing
+  - Search result visualization (matplotlib grid of cover photos)
+  - Filter by category, designer, price range before search
+  - Web UI (Gradio or Streamlit)
+
+---
+
+## 6. NightwingNBA (Sports Analytics)
+- **Repo:** https://github.com/samherring99/NightwingNBA
+- **Local path:** `~/Desktop/Projects/NightwingNBA`
+- **Description:** NBA game prediction system. Builds a database of game data, trains a PyTorch model, and makes daily predictions. Has full pipeline: build DB → write data → train → predict.
+- **Status:** Functional pipeline exists. Has database building, training, prediction, and daily update scripts.
+- **TODO:**
+  - Explore and potentially revive
+  - Update data sources if stale
+  - Improve model accuracy
+  - Add visualization/reporting
+
+---
+
+## 7. Stable Audio Sample Explorer
+- **Repo:** https://github.com/samherring99/stable-audio-sample-explorer
+- **Local path:** `~/Desktop/Projects/stable-audio-sample-explorer`
+- **Description:** Tool for exploring audio samples generated by Stable Audio.
+- **Status:** 🪦 **Dead** — no active work needed per Sam.
+
+---
+
+## 8. NightwingArt (Art Tools)
+- **Repo:** https://github.com/samherring99/NightwingArt
+- **Local path:** `~/Desktop/Projects/NightwingArt`
+- **Description:** Collection of art tooling scripts — video editing, clip splicing with beat matching, damage effects, and general image manipulation.
+- **Status:** Maintenance mode. Tools exist for various effects. Work happens as-needed.
+- **TODO:**
+  - Add tools as needed for new art projects
+
+---
+
+## 9. Claude-based VST Building ⚠️ *Needs new repo*
+- **Description:** Generate VST audio plugins for DAWs from English language prompts. LLM-powered audio plugin creation.
+- **Status:** Concept only — no repo exists yet.
+- **TODO:**
+  - Create repo
+  - Research VST SDK / JUCE framework
+  - Design prompt → code → compile pipeline
+
+---
+
+## 10. Government Auction Site Scraper ⚠️ *Needs new repo*
+- **Description:** Tool that monitors and scrapes government auction sites in San Francisco for deals.
+- **Status:** Concept only — no repo exists yet.
+- **TODO:**
+  - Create repo
+  - Research SF government auction sites and their structure
+  - Build scraper + notification system
+
+---
+
+## Priority Assessment
+
+| Project | Activity Level | Suggested Priority |
+|---------|---------------|-------------------|
+| NightwingGameSim | Active | 🔴 High |
+| MessageRetrieval | Active | 🔴 High |
+| Kandinsky | Active | 🟡 Medium |
+| ContentBasedMIR | Exploratory | 🟡 Medium |
+| Grailed Embedding Search | Early | 🟡 Medium |
+| NightwingNBA | Dormant | 🟢 Low |
+| NightwingArt | As-needed | 🟢 Low |
+| VST Builder | Concept | 🔵 Future |
+| Gov Auction Scraper | Concept | 🔵 Future |
+| Stable Audio Explorer | Dead | ⚫ None |
+
+
+
@@ -37,8 +37,9 @@ All your settings are stored in `~/.hermes/` for easy access:

 ```
 ~/.hermes/
-├── config.yaml     # Settings (model, terminal, compression, etc.)
+├── config.yaml     # Settings (model, terminal, TTS, compression, etc.)
 ├── .env            # API keys and secrets
+├── SOUL.md         # Optional: global persona (agent embodies this personality)
 ├── cron/           # Scheduled jobs
 ├── sessions/       # Gateway sessions
 └── logs/           # Logs
@@ -76,6 +77,8 @@ You need at least one LLM provider:
 | Web scraping | [Firecrawl](https://firecrawl.dev/) | `FIRECRAWL_API_KEY` |
 | Browser automation | [Browserbase](https://browserbase.com/) | `BROWSERBASE_API_KEY`, `BROWSERBASE_PROJECT_ID` |
 | Image generation | [FAL](https://fal.ai/) | `FAL_KEY` |
+| Premium TTS voices | [ElevenLabs](https://elevenlabs.io/) | `ELEVENLABS_API_KEY` |
+| OpenAI TTS voices | [OpenAI](https://platform.openai.com/api-keys) | `OPENAI_API_KEY` |
 | RL Training | [Tinker](https://tinker-console.thinkingmachines.ai/) + [WandB](https://wandb.ai/) | `TINKER_API_KEY`, `WANDB_API_KEY` |
 | Messaging | Telegram, Discord | `TELEGRAM_BOT_TOKEN`, `DISCORD_BOT_TOKEN` |

@@ -128,7 +131,58 @@ hermes --toolsets "web,terminal"
 hermes --list-tools
 ```

-**Available toolsets:** `web`, `terminal`, `browser`, `vision`, `creative`, `reasoning`, `skills`, `cronjob`, and more.
+**Available toolsets:** `web`, `terminal`, `browser`, `vision`, `creative`, `reasoning`, `skills`, `tts`, `cronjob`, and more.
+
+### 🔊 Text-to-Speech
+
+Convert text to speech with three providers:
+
+| Provider | Quality | Cost | API Key |
+|----------|---------|------|---------|
+| **Edge TTS** (default) | Good | Free | None needed |
+| **ElevenLabs** | Excellent | Paid | `ELEVENLABS_API_KEY` |
+| **OpenAI TTS** | Good | Paid | `OPENAI_API_KEY` |
+
+On Telegram, audio plays as native voice bubbles. On Discord/WhatsApp, sent as audio files. In CLI mode, saved to `~/voice-memos/`.
+
+**Configure in `~/.hermes/config.yaml`:**
+```yaml
+tts:
+  provider: "edge"              # "edge" | "elevenlabs" | "openai"
+  edge:
+    voice: "en-US-AriaNeural"   # 322 voices, 74 languages
+  elevenlabs:
+    voice_id: "pNInz6obpgDQGcFmaJgB"  # Adam
+    model_id: "eleven_multilingual_v2"
+  openai:
+    model: "gpt-4o-mini-tts"
+    voice: "alloy"              # alloy, echo, fable, onyx, nova, shimmer
+```
+
+> **Note:** Telegram voice bubbles require `ffmpeg` for Opus conversion (Edge TTS only outputs MP3). Install with `apt install ffmpeg` or `brew install ffmpeg`. Without ffmpeg, audio is sent as a file instead of a voice bubble.
+
+### 📄 Context Files (SOUL.md, AGENTS.md, .cursorrules)
+
+Drop these files in your project directory and the agent automatically picks them up:
+
+| File | Purpose |
+|------|---------|
+| `AGENTS.md` | Project-specific instructions, coding conventions, tool usage guidelines |
+| `SOUL.md` | Persona definition -- the agent embodies this personality and tone |
+| `.cursorrules` | Cursor IDE rules (also detected) |
+| `.cursor/rules/*.mdc` | Cursor rule files (also detected) |
+
+- **AGENTS.md** is hierarchical: if subdirectories also have `AGENTS.md`, all are combined (like Codex/Cline).
+- **SOUL.md** checks cwd first, then `~/.hermes/SOUL.md` as a global fallback.
+- All context files are capped at 20,000 characters with smart truncation.
+
+### 🛡️ Exec Approval (Messaging Platforms)
+
+When the agent tries to run a potentially dangerous command (rm -rf, chmod 777, etc.) on Telegram/Discord/WhatsApp, instead of blocking it silently, it asks the user for approval:
+
+> ⚠️ This command is potentially dangerous (recursive delete). Reply "yes" to approve.
+
+Reply "yes"/"y" to approve or "no"/"n" to deny. In CLI mode, the existing interactive approval prompt (once/session/always/deny) is preserved.

 ### 🖥️ Terminal Backend

@@ -28,18 +28,13 @@ os.environ["HERMES_QUIET"] = "1"  # Our own modules
 import yaml

 # prompt_toolkit for fixed input area TUI
-from prompt_toolkit import PromptSession
 from prompt_toolkit.history import FileHistory
 from prompt_toolkit.styles import Style as PTStyle
-from prompt_toolkit.formatted_text import HTML
 from prompt_toolkit.patch_stdout import patch_stdout
-from prompt_toolkit.application import Application, get_app
-from prompt_toolkit.buffer import Buffer
+from prompt_toolkit.application import Application
 from prompt_toolkit.layout import Layout, HSplit, Window, FormattedTextControl
-from prompt_toolkit.layout.processors import BeforeInput
 from prompt_toolkit.widgets import TextArea
 from prompt_toolkit.key_binding import KeyBindings
-import asyncio
 import threading
 import queue

@@ -498,6 +493,8 @@ COMMANDS = {
    "/clear": "Clear screen and reset conversation (fresh start)",
    "/history": "Show conversation history",
    "/reset": "Reset conversation only (keep screen)",
+    "/retry": "Retry the last message (resend to agent)",
+    "/undo": "Remove the last user/assistant exchange",
    "/save": "Save the current conversation",
    "/config": "Show current configuration",
    "/cron": "Manage scheduled tasks (list, add, remove)",
@@ -508,7 +505,11 @@ COMMANDS = {

 def save_config_value(key_path: str, value: any) -> bool:
    """
-    Save a value to cli-config.yaml at the specified key path.
+    Save a value to the active config file at the specified key path.
+    
+    Respects the same lookup order as load_cli_config():
+    1. ~/.hermes/config.yaml (user config - preferred, used if it exists)
+    2. ./cli-config.yaml (project config - fallback)
    
    Args:
        key_path: Dot-separated path like "agent.system_prompt"
@@ -517,9 +518,15 @@ def save_config_value(key_path: str, value: any) -> bool:
    Returns:
        True if successful, False otherwise
    """
-    config_path = Path(__file__).parent / 'cli-config.yaml'
+    # Use the same precedence as load_cli_config: user config first, then project config
+    user_config_path = Path.home() / '.hermes' / 'config.yaml'
+    project_config_path = Path(__file__).parent / 'cli-config.yaml'
+    config_path = user_config_path if user_config_path.exists() else project_config_path
    
    try:
+        # Ensure parent directory exists (for ~/.hermes/config.yaml on first use)
+        config_path.parent.mkdir(parents=True, exist_ok=True)
+        
        # Load existing config
        if config_path.exists():
            with open(config_path, 'r') as f:
@@ -631,26 +638,8 @@ class HermesCLI:
        short_uuid = uuid.uuid4().hex[:6]
        self.session_id = f"{timestamp_str}_{short_uuid}"
        
-        # Setup prompt_toolkit session with history
-        self._setup_prompt_session()
-    
-    def _setup_prompt_session(self):
-        """Setup prompt_toolkit session with history and styling."""
-        history_file = Path.home() / ".hermes_history"
-        
-        # Custom style for the prompt
-        self.prompt_style = PTStyle.from_dict({
-            'prompt': '#FFD700 bold',
-            'input': '#FFF8DC',
-        })
-        
-        # Create prompt session with file history
-        # Note: multiline disabled - Enter submits, use \ at end of line for continuation
-        self.prompt_session = PromptSession(
-            history=FileHistory(str(history_file)),
-            style=self.prompt_style,
-            enable_history_search=True,
-        )
+        # History file for persistent input recall across sessions
+        self._history_file = Path.home() / ".hermes_history"
    
    def _init_agent(self) -> bool:
        """
@@ -673,6 +662,7 @@ class HermesCLI:
                quiet_mode=True,  # Suppress verbose output for clean CLI
                ephemeral_system_prompt=self.system_prompt if self.system_prompt else None,
                session_id=self.session_id,  # Pass CLI's session ID to agent
+                platform="cli",  # CLI interface — agent uses terminal-friendly formatting
            )
            return True
        except Exception as e:
@@ -931,6 +921,67 @@ class HermesCLI:
        except Exception as e:
            print(f"(x_x) Failed to save: {e}")
    
+    def retry_last(self):
+        """Retry the last user message by removing the last exchange and re-sending.
+        
+        Removes the last assistant response (and any tool-call messages) and
+        the last user message, then re-sends that user message to the agent.
+        Returns the message to re-send, or None if there's nothing to retry.
+        """
+        if not self.conversation_history:
+            print("(._.) No messages to retry.")
+            return None
+        
+        # Walk backwards to find the last user message
+        last_user_idx = None
+        for i in range(len(self.conversation_history) - 1, -1, -1):
+            if self.conversation_history[i].get("role") == "user":
+                last_user_idx = i
+                break
+        
+        if last_user_idx is None:
+            print("(._.) No user message found to retry.")
+            return None
+        
+        # Extract the message text and remove everything from that point forward
+        last_message = self.conversation_history[last_user_idx].get("content", "")
+        self.conversation_history = self.conversation_history[:last_user_idx]
+        
+        print(f"(^_^)b Retrying: \"{last_message[:60]}{'...' if len(last_message) > 60 else ''}\"")
+        return last_message
+    
+    def undo_last(self):
+        """Remove the last user/assistant exchange from conversation history.
+        
+        Walks backwards and removes all messages from the last user message
+        onward (including assistant responses, tool calls, etc.).
+        """
+        if not self.conversation_history:
+            print("(._.) No messages to undo.")
+            return
+        
+        # Walk backwards to find the last user message
+        last_user_idx = None
+        for i in range(len(self.conversation_history) - 1, -1, -1):
+            if self.conversation_history[i].get("role") == "user":
+                last_user_idx = i
+                break
+        
+        if last_user_idx is None:
+            print("(._.) No user message found to undo.")
+            return
+        
+        # Count how many messages we're removing
+        removed_count = len(self.conversation_history) - last_user_idx
+        removed_msg = self.conversation_history[last_user_idx].get("content", "")
+        
+        # Truncate history to before the last user message
+        self.conversation_history = self.conversation_history[:last_user_idx]
+        
+        print(f"(^_^)b Undid {removed_count} message(s). Removed: \"{removed_msg[:60]}{'...' if len(removed_msg) > 60 else ''}\"")
+        remaining = len(self.conversation_history)
+        print(f"  {remaining} message(s) remaining in history.")
+    
    def _handle_prompt_command(self, cmd: str):
        """Handle the /prompt command to view or set system prompt."""
        parts = cmd.split(maxsplit=1)
@@ -1268,6 +1319,13 @@ class HermesCLI:
        elif cmd_lower.startswith("/personality"):
            # Use original case (handler lowercases the personality name itself)
            self._handle_personality_command(cmd_original)
+        elif cmd_lower == "/retry":
+            retry_msg = self.retry_last()
+            if retry_msg and hasattr(self, '_pending_input'):
+                # Re-queue the message so process_loop sends it to the agent
+                self._pending_input.put(retry_msg)
+        elif cmd_lower == "/undo":
+            self.undo_last()
        elif cmd_lower == "/save":
            self.save_conversation()
        elif cmd_lower.startswith("/cron"):
@@ -1302,8 +1360,9 @@ class HermesCLI:
        # Add user message to history
        self.conversation_history.append({"role": "user", "content": message})
        
-        # Visual separator after user input
-        print("─" * 60, flush=True)
+        # Visual separator after user input (adapt to terminal width, capped for readability)
+        term_width = min(self.console.width, 120)
+        print("─" * term_width, flush=True)
        
        try:
            # Run the conversation with interrupt monitoring
@@ -1361,14 +1420,20 @@ class HermesCLI:
            
            if response:
                # Use simple print for compatibility with prompt_toolkit's patch_stdout
+                # Adapt box width to terminal (cap at 120 for readability)
+                box_width = min(self.console.width, 120)
+                inner = box_width - 2  # account for border chars ╭/╰ and ╮/╯
+                label = "⚕ Hermes"
+                padding = inner - len(label) - 1  # -1 for the leading space
+                
                print()
-                print("╭" + "─" * 58 + "╮")
-                print("│ ⚕ Hermes" + " " * 49 + "│")
-                print("╰" + "─" * 58 + "╯")
+                print("╭" + "─" * inner + "╮")
+                print("│ " + label + " " * max(padding, 0) + "│")
+                print("╰" + "─" * inner + "╯")
                print()
                print(response)
                print()
-                print("─" * 60)
+                print("─" * box_width)
            
            # If we have a pending message from interrupt, re-queue it for process_loop
            # instead of recursing (avoids unbounded recursion from rapid interrupts)
@@ -1382,37 +1447,6 @@ class HermesCLI:
            print(f"Error: {e}")
            return None
    
-    def get_input(self) -> Optional[str]:
-        """
-        Get user input using prompt_toolkit.
-        
-        Enter submits. For multiline, end line with \\ to continue.
-        
-        Returns:
-            The user's input, or None if EOF/interrupt
-        """
-        try:
-            # Get first line
-            line = self.prompt_session.prompt(
-                HTML('<prompt>❯ </prompt>'),
-                style=self.prompt_style,
-            )
-            
-            # Handle multi-line input (lines ending with \)
-            lines = [line]
-            while line.endswith("\\"):
-                lines[-1] = line[:-1]  # Remove trailing backslash
-                line = self.prompt_session.prompt(
-                    HTML('<prompt>  </prompt>'),  # Continuation prompt
-                    style=self.prompt_style,
-                )
-                lines.append(line)
-            
-            return "\n".join(lines).strip()
-            
-        except (EOFError, KeyboardInterrupt):
-            return None
-    
    def run(self):
        """Run the interactive CLI loop with persistent input at bottom."""
        self.show_banner()
@@ -1426,9 +1460,6 @@ class HermesCLI:
        self._should_exit = False
        self._last_ctrl_c_time = 0  # Track double Ctrl+C for force exit
        
-        # Create a persistent input area using prompt_toolkit Application
-        input_buffer = Buffer()
-        
        # Key bindings for the input area
        kb = KeyBindings()
        
@@ -1486,13 +1517,14 @@ class HermesCLI:
            self._should_exit = True
            event.app.exit()
        
-        # Create the input area widget
+        # Create the input area widget with persistent history across sessions
        input_area = TextArea(
            height=1,
            prompt='❯ ',
            style='class:input-area',
            multiline=False,
            wrap_lines=False,
+            history=FileHistory(str(self._history_file)),
        )
        
        # Create a status line that shows when agent is working
@@ -1545,6 +1577,7 @@ class HermesCLI:
                    
                    # Check for commands
                    if user_input.startswith("/"):
+                        print(f"\n⚙️  {user_input}")
                        if not self.process_command(user_input):
                            self._should_exit = True
                            # Schedule app exit
@@ -1556,6 +1589,9 @@ class HermesCLI:
                    self._agent_running = True
                    app.invalidate()  # Refresh status line
                    
+                    # Echo the user's input so it stays visible in scrollback
+                    print(f"\n💬 You: {user_input}")
+                    
                    try:
                        self.chat(user_input)
                    finally:
@@ -0,0 +1,330 @@
+# Hermes-Agent Atropos Environments
+
+This directory contains the integration layer between **hermes-agent's** tool-calling capabilities and the **Atropos** RL training framework. It provides everything needed to run agentic LLMs through multi-turn tool-calling loops, score their output with arbitrary reward functions, and feed results into Atropos for training or evaluation.
+
+## Architecture Overview
+
+```
+                        Atropos Framework
+                    ┌───────────────────────┐
+                    │       BaseEnv          │  (atroposlib)
+                    │  - Server management   │
+                    │  - Worker scheduling   │
+                    │  - Wandb logging       │
+                    │  - CLI (serve/process/ │
+                    │    evaluate)           │
+                    └───────────┬───────────┘
+                                │ inherits
+                    ┌───────────┴───────────┐
+                    │  HermesAgentBaseEnv    │  hermes_base_env.py
+                    │  - Terminal backend    │
+                    │  - Tool resolution     │
+                    │  - Agent loop          │
+                    │  - ToolContext          │
+                    │  - Async patches       │
+                    └───────────┬───────────┘
+                                │ inherits
+              ┌─────────────────┼─────────────────┐
+              │                 │                  │
+     TerminalTestEnv     HermesSweEnv    TerminalBench2EvalEnv
+     (stack testing)     (SWE training)   (TB2 benchmark eval)
+```
+
+### Inheritance Chain
+
+**BaseEnv** (from `atroposlib`) is the Atropos base class. It provides:
+- Server management (OpenAI-compatible API servers, VLLM, SGLang)
+- Worker scheduling for parallel rollouts
+- Wandb integration for metrics and rollout logging
+- CLI interface with three subcommands: `serve`, `process`, `evaluate`
+- `evaluate_log()` for saving eval results to JSON + samples.jsonl
+
+**HermesAgentBaseEnv** (`hermes_base_env.py`) extends BaseEnv with hermes-agent specifics:
+- Sets `os.environ["TERMINAL_ENV"]` to configure the terminal backend (local, docker, modal, ssh, singularity)
+- Resolves hermes-agent toolsets via `_resolve_tools_for_group()` (calls `get_tool_definitions()` from `model_tools.py`)
+- Implements `collect_trajectory()` which runs the full agent loop and computes rewards
+- Supports two-phase operation (Phase 1: OpenAI server, Phase 2: VLLM ManagedServer)
+- Applies monkey patches for async-safe tool operation at import time
+
+Concrete environments inherit from `HermesAgentBaseEnv` and implement:
+- `setup()` -- Load dataset, initialize state
+- `get_next_item()` -- Return the next item for rollout
+- `format_prompt()` -- Convert a dataset item into the user message
+- `compute_reward()` -- Score the rollout using ToolContext
+- `evaluate()` -- Periodic evaluation logic
+
+## Core Components
+
+### Agent Loop (`agent_loop.py`)
+
+`HermesAgentLoop` is the reusable multi-turn agent engine. It runs the same pattern as hermes-agent's `run_agent.py`:
+
+1. Send messages + tools to the API via `server.chat_completion()`
+2. If the response contains `tool_calls`, execute each one via `handle_function_call()` from `model_tools.py`
+3. Append tool results to the conversation and go back to step 1
+4. If the response has no tool_calls, the agent is done
+
+Tool calls are executed in a thread pool (`run_in_executor`) so backends that use `asyncio.run()` internally (Modal, Docker) don't deadlock inside Atropos's event loop.
+
+Returns an `AgentResult` containing the full conversation history, turn count, reasoning content per turn, tool errors, and optional ManagedServer state (for Phase 2).
+
+### Tool Context (`tool_context.py`)
+
+`ToolContext` is a per-rollout handle that gives reward/verification functions direct access to **all** hermes-agent tools, scoped to the rollout's `task_id`. The same `task_id` means the terminal/browser session is the SAME one the model used during its rollout -- all state (files, processes, browser tabs) is preserved.
+
+```python
+async def compute_reward(self, item, result, ctx: ToolContext):
+    # Run tests in the model's terminal sandbox
+    test = ctx.terminal("pytest -v")
+    if test["exit_code"] == 0:
+        return 1.0
+
+    # Check if a file was created
+    content = ctx.read_file("/workspace/solution.py")
+    if content.get("content"):
+        return 0.5
+
+    # Download files locally for verification (binary-safe)
+    ctx.download_file("/remote/output.bin", "/local/output.bin")
+
+    return 0.0
+```
+
+Available methods:
+- **Terminal**: `terminal(command, timeout)` -- run shell commands
+- **Files**: `read_file(path)`, `write_file(path, content)`, `search(query, path)`
+- **Transfers**: `upload_file()`, `upload_dir()`, `download_file()`, `download_dir()` -- binary-safe file transfers between host and sandbox
+- **Web**: `web_search(query)`, `web_extract(urls)`
+- **Browser**: `browser_navigate(url)`, `browser_snapshot()`
+- **Generic**: `call_tool(name, args)` -- call any hermes-agent tool by name
+- **Cleanup**: `cleanup()` -- release all resources (called automatically after `compute_reward`)
+
+### Patches (`patches.py`)
+
+**Problem**: Some hermes-agent tools use `asyncio.run()` internally (e.g., mini-swe-agent's Modal backend via SWE-ReX). This crashes when called from inside Atropos's event loop because `asyncio.run()` cannot be nested.
+
+**Solution**: `patches.py` monkey-patches `SwerexModalEnvironment` to use a dedicated background thread (`_AsyncWorker`) with its own event loop. The calling code sees the same sync interface, but internally the async work happens on a separate thread that doesn't conflict with Atropos's loop.
+
+What gets patched:
+- `SwerexModalEnvironment.__init__` -- creates Modal deployment on a background thread
+- `SwerexModalEnvironment.execute` -- runs commands on the same background thread
+- `SwerexModalEnvironment.stop` -- stops deployment on the background thread
+
+The patches are:
+- **Idempotent** -- calling `apply_patches()` multiple times is safe
+- **Transparent** -- same interface and behavior, only the internal async execution changes
+- **Universal** -- works identically in normal CLI use (no running event loop)
+
+Applied automatically at import time by `hermes_base_env.py`.
+
+### Tool Call Parsers (`tool_call_parsers/`)
+
+Client-side parsers that extract structured `tool_calls` from raw model output text. Used in **Phase 2** (VLLM server type) where ManagedServer's `/generate` endpoint returns raw text without tool call parsing.
+
+Each parser is a standalone reimplementation of the corresponding VLLM parser's `extract_tool_calls()` logic. No VLLM dependency -- only standard library (`re`, `json`, `uuid`) and `openai` types.
+
+Available parsers:
+- `hermes` -- Hermes/ChatML `<tool_call>` XML format
+- `mistral` -- Mistral `[TOOL_CALLS]` format
+- `llama3_json` -- Llama 3 JSON tool calling
+- `qwen` -- Qwen tool calling format
+- `qwen3_coder` -- Qwen3 Coder format
+- `deepseek_v3` -- DeepSeek V3 format
+- `deepseek_v3_1` -- DeepSeek V3.1 format
+- `kimi_k2` -- Kimi K2 format
+- `longcat` -- Longcat format
+- `glm45` / `glm47` -- GLM model formats
+
+Usage:
+```python
+from environments.tool_call_parsers import get_parser
+
+parser = get_parser("hermes")
+content, tool_calls = parser.parse(raw_model_output)
+```
+
+In Phase 1 (OpenAI server type), these parsers are not needed -- the server handles tool call parsing natively.
+
+## Two-Phase Operation
+
+### Phase 1: OpenAI Server (Evaluation / SFT Data Generation)
+
+Uses `server.chat_completion()` with `tools=` parameter. The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing natively. Returns `ChatCompletion` objects with structured `tool_calls`.
+
+- Good for: evaluation, SFT data generation, testing
+- Run with: `serve` (with `run-api`), `process`, or `evaluate` subcommands
+- Placeholder tokens are created for the Atropos pipeline
+
+### Phase 2: VLLM ManagedServer (Full RL Training)
+
+Uses ManagedServer for exact token IDs + logprobs via `/generate`. Client-side tool call parser (from `tool_call_parsers/`) reconstructs structured `tool_calls` from raw output.
+
+- Good for: full RL training with GRPO/PPO
+- Run with: `serve` subcommand
+- Real tokens, masks, and logprobs flow through the pipeline
+
+## Directory Structure
+
+```
+environments/
+├── README.md                     # This file
+├── __init__.py                   # Package exports
+├── hermes_base_env.py            # Abstract base (HermesAgentBaseEnv)
+├── agent_loop.py                 # Multi-turn agent engine (HermesAgentLoop)
+├── tool_context.py               # Per-rollout tool access for reward functions
+├── patches.py                    # Async-safety patches for Modal backend
+│
+├── tool_call_parsers/            # Phase 2 client-side parsers
+│   ├── __init__.py               # Registry + base class
+│   ├── hermes_parser.py
+│   ├── mistral_parser.py
+│   ├── llama_parser.py
+│   ├── qwen_parser.py
+│   ├── qwen3_coder_parser.py
+│   ├── deepseek_v3_parser.py
+│   ├── deepseek_v3_1_parser.py
+│   ├── kimi_k2_parser.py
+│   ├── longcat_parser.py
+│   ├── glm45_parser.py
+│   └── glm47_parser.py
+│
+├── terminal_test_env/            # Stack validation environment
+│   └── terminal_test_env.py
+│
+├── hermes_swe_env/               # SWE-bench style training environment
+│   └── hermes_swe_env.py
+│
+└── benchmarks/                   # Evaluation benchmarks
+    └── terminalbench_2/
+        └── terminalbench2_env.py
+```
+
+## Concrete Environments
+
+### TerminalTestEnv (`terminal_test_env/`)
+
+A self-contained environment with inline tasks (no external dataset needed) for validating the full stack end-to-end. Each task asks the model to create a file at a known path, and the verifier checks the content matches.
+
+```bash
+# Serve mode (needs run-api)
+run-api
+python environments/terminal_test_env/terminal_test_env.py serve
+
+# Process mode (no run-api, saves to JSONL)
+python environments/terminal_test_env/terminal_test_env.py process \
+    --env.data_path_to_save_groups terminal_test_output.jsonl
+```
+
+### HermesSweEnv (`hermes_swe_env/`)
+
+SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.
+
+```bash
+python environments/hermes_swe_env/hermes_swe_env.py serve \
+    --openai.model_name YourModel \
+    --env.dataset_name bigcode/humanevalpack \
+    --env.terminal_backend modal
+```
+
+### TerminalBench2EvalEnv (`benchmarks/terminalbench_2/`)
+
+**Eval-only** environment for the Terminal-Bench 2.0 benchmark (89 tasks). Each task gets a pre-built Docker Hub image, a natural language instruction, and a test suite. The agent uses terminal + file tools to solve the task, then the test suite verifies correctness.
+
+Follows the standard Atropos eval pattern (like GPQA, MMLU, etc.):
+- Run via `evaluate` subcommand (no `run-api` needed)
+- `setup()` loads the dataset, `evaluate()` runs all tasks
+- `rollout_and_score_eval()` handles per-task agent loop + test verification
+- Downloads verifier output locally for reliable reward checking (Harbor pattern)
+
+```bash
+# Run full benchmark
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6
+
+# Run subset of tasks
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6 \
+    --env.task_filter fix-git,git-multibranch
+
+# Skip specific tasks
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6 \
+    --env.skip_tasks heavy-task,slow-task
+```
+
+## Creating a New Environment
+
+### Training Environment
+
+1. Create a new directory under `environments/`
+2. Create your env file inheriting from `HermesAgentBaseEnv`
+3. Implement the four abstract methods + `evaluate()`
+
+```python
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+
+class MyEnvConfig(HermesAgentEnvConfig):
+    pass  # Add custom fields as needed
+
+class MyEnv(HermesAgentBaseEnv):
+    name = "my-env"
+    env_config_cls = MyEnvConfig
+
+    @classmethod
+    def config_init(cls):
+        env_config = MyEnvConfig(
+            enabled_toolsets=["terminal", "file"],
+            terminal_backend="modal",
+            # ... other config
+        )
+        server_configs = [APIServerConfig(...)]
+        return env_config, server_configs
+
+    async def setup(self):
+        self.dataset = load_dataset(...)
+        self.iter = 0
+
+    async def get_next_item(self):
+        item = self.dataset[self.iter % len(self.dataset)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item):
+        return item["instruction"]
+
+    async def compute_reward(self, item, result, ctx):
+        # ctx gives you full tool access to the rollout's sandbox
+        test = ctx.terminal("pytest -v")
+        return 1.0 if test["exit_code"] == 0 else 0.0
+
+    async def evaluate(self, *args, **kwargs):
+        # Periodic evaluation logic
+        ...
+
+if __name__ == "__main__":
+    MyEnv.cli()
+```
+
+### Eval-Only Environment (Benchmark)
+
+For eval benchmarks, follow the pattern in `terminalbench2_env.py`:
+1. Create under `environments/benchmarks/your-benchmark/`
+2. Inherit from `HermesAgentBaseEnv`
+3. Set eval-only config: `eval_handling=STOP_TRAIN`, `steps_per_eval=1`, `total_steps=1`
+4. Stub the training methods (`collect_trajectories`, `score`)
+5. Implement `rollout_and_score_eval()` and `evaluate()`
+6. Run with `evaluate` subcommand
+
+## Key Config Fields
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| `enabled_toolsets` | Which hermes toolsets to enable | `None` (all) |
+| `disabled_toolsets` | Toolsets to disable | `None` |
+| `distribution` | Probabilistic toolset distribution name | `None` |
+| `max_agent_turns` | Max LLM calls per rollout | `30` |
+| `agent_temperature` | Sampling temperature | `1.0` |
+| `terminal_backend` | `local`, `docker`, `modal`, `ssh`, `singularity` | `local` |
+| `system_prompt` | System message for the agent | `None` |
+| `tool_call_parser` | Parser name for Phase 2 | `hermes` |
+| `eval_handling` | `STOP_TRAIN`, `LIMIT_TRAIN`, `NONE` | `STOP_TRAIN` |
@@ -4,15 +4,18 @@ Hermes-Agent Atropos Environments
 Provides a layered integration between hermes-agent's tool-calling capabilities
 and the Atropos RL training framework.

-Layers:
+Core layers:
    - agent_loop: Reusable multi-turn agent loop with standard OpenAI-spec tool calling
    - tool_context: Per-rollout tool access handle for reward/verification functions
    - hermes_base_env: Abstract base environment (BaseEnv subclass) for Atropos
    - tool_call_parsers: Client-side tool call parser registry for Phase 2 (VLLM /generate)

 Concrete environments:
-    - terminal_test_env: Simple file-creation tasks for testing the stack
-    - hermes_swe_env: SWE-bench style tasks with Modal sandboxes
+    - terminal_test_env/: Simple file-creation tasks for testing the stack
+    - hermes_swe_env/: SWE-bench style tasks with Modal sandboxes
+
+Benchmarks (eval-only):
+    - benchmarks/terminalbench_2/: Terminal-Bench 2.0 evaluation
 """

 from environments.agent_loop import AgentResult, HermesAgentLoop
@@ -15,6 +15,7 @@ import asyncio
 import concurrent.futures
 import json
 import logging
+import os
 import uuid
 from dataclasses import dataclass, field
 from typing import Any, Dict, List, Optional, Set
@@ -24,7 +25,22 @@ from model_tools import handle_function_call
 # Thread pool for running sync tool calls that internally use asyncio.run()
 # (e.g., mini-swe-agent's modal/docker backends). Running them in a separate
 # thread gives them a clean event loop so they don't deadlock inside Atropos's loop.
-_tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=8)
+# Size must be large enough for concurrent eval tasks (e.g., 89 TB2 tasks all
+# making tool calls). Too small = thread pool starvation, tasks queue for minutes.
+# Resized at runtime by HermesAgentBaseEnv.__init__ via resize_tool_pool().
+_tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=128)
+
+
+def resize_tool_pool(max_workers: int):
+    """
+    Replace the global tool executor with a new one of the given size.
+
+    Called by HermesAgentBaseEnv.__init__ based on config.tool_pool_size.
+    Safe to call before any tasks are submitted.
+    """
+    global _tool_executor
+    _tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
+    logger.info("Tool thread pool resized to %d workers", max_workers)

 logger = logging.getLogger(__name__)

@@ -57,6 +73,12 @@ class AgentResult:
    # Tool errors encountered during the loop
    tool_errors: List[ToolError] = field(default_factory=list)

+    # Tool-call metrics (debugging / optional reward shaping)
+    tool_calls_attempted: int = 0
+    tool_calls_schema_valid: int = 0
+    tool_calls_executed_ok: int = 0
+    tool_calls_exec_error: int = 0
+

 def _extract_reasoning_from_message(message) -> Optional[str]:
    """
@@ -119,6 +141,9 @@ class HermesAgentLoop:
        task_id: Optional[str] = None,
        temperature: float = 1.0,
        max_tokens: Optional[int] = None,
+        extra_body: Optional[Dict[str, Any]] = None,
+        tool_handler=None,
+        max_context_tokens: Optional[int] = None,
    ):
        """
        Initialize the agent loop.
@@ -132,6 +157,16 @@ class HermesAgentLoop:
            task_id: Unique ID for terminal/browser session isolation
            temperature: Sampling temperature for generation
            max_tokens: Max tokens per generation (None for server default)
+            extra_body: Extra parameters passed to the OpenAI client's create() call.
+                        Used for OpenRouter provider preferences, transforms, etc.
+                        e.g. {"provider": {"ignore": ["DeepInfra"]}}
+            tool_handler: Optional async callable(tool_name, args, task_id) -> str.
+                         When provided, used INSTEAD of handle_function_call() for
+                         tool dispatch. This allows sandbox backends (Modal, Nomad)
+                         to route tool calls through their slot-based execution.
+            max_context_tokens: Maximum prompt tokens before truncation.
+                               If None, no truncation is applied.
+                               Recommended: set to max_model_len - max_tokens - 512 (safety margin).
        """
        self.server = server
        self.tool_schemas = tool_schemas
@@ -140,6 +175,124 @@ class HermesAgentLoop:
        self.task_id = task_id or str(uuid.uuid4())
        self.temperature = temperature
        self.max_tokens = max_tokens
+        self.extra_body = extra_body
+        self.tool_handler = tool_handler
+        self.max_context_tokens = max_context_tokens
+
+    def _truncate_context(self, messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+        """Truncate conversation history to fit within max_context_tokens.
+
+        Strategy:
+        - Keep system message (index 0) and initial user message (index 1) always
+        - Keep last 6 messages (recent context) always
+        - For everything in between, progressively truncate tool result content
+        - If still too long, drop oldest middle messages entirely
+
+        Uses rough char/4 token estimate (fast, no tokenizer needed).
+
+        NOTE: This function mutates the provided list (it may pop/replace entries).
+        Call it on a copy when you want to preserve the full trajectory.
+        """
+        if self.max_context_tokens is None:
+            return messages
+
+        def estimate_tokens(msgs):
+            total = 0
+            for m in msgs:
+                content = m.get("content", "") or ""
+                total += len(content) // 4 + 10  # ~4 chars per token + overhead
+                if "tool_calls" in m:
+                    total += 50 * len(m["tool_calls"])  # tool call overhead
+            return total
+
+        if estimate_tokens(messages) <= self.max_context_tokens:
+            return messages
+
+        protect_head = 2
+        protect_tail = max(0, min(6, len(messages) - protect_head))
+        middle_start = protect_head
+        middle_end = len(messages) - protect_tail
+
+        # Phase 1: truncate tool outputs in the middle
+        if middle_start < middle_end:
+            for i in range(middle_start, middle_end):
+                if messages[i].get("role") == "tool":
+                    content = messages[i].get("content", "") or ""
+                    if len(content) > 200:
+                        messages[i] = dict(messages[i])
+                        messages[i]["content"] = content[:100] + "\n...[truncated]...\n" + content[-50:]
+
+            if estimate_tokens(messages) <= self.max_context_tokens:
+                return messages
+
+        # Phase 2: drop oldest middle messages (try to keep assistant+tool pairs)
+        while middle_start < middle_end and estimate_tokens(messages) > self.max_context_tokens:
+            msg = messages[middle_start]
+            messages.pop(middle_start)
+            middle_end -= 1
+
+            if msg.get("role") == "assistant" and msg.get("tool_calls"):
+                tool_ids = {
+                    tc.get("id") or tc.get("tool_call_id", "")
+                    for tc in msg.get("tool_calls", [])
+                    if isinstance(tc, dict)
+                }
+                i = middle_start
+                while i < middle_end:
+                    if messages[i].get("role") == "tool" and messages[i].get("tool_call_id", "") in tool_ids:
+                        messages.pop(i)
+                        middle_end -= 1
+                    else:
+                        i += 1
+
+        return messages
+
+    def _normalize_tool_args(self, tool_name: str, tool_args_raw: str) -> (Dict[str, Any], bool):
+        """Normalize tool arguments into a dict.
+
+        Returns: (args_dict, schema_valid)
+
+        schema_valid is True only when arguments decode directly into a dict
+        (no double-decoding and no coercion/wrapping required).
+
+        Goal: keep environments robust (never crash on args format drift) while
+        still allowing reward functions to penalize malformed formats if desired.
+        """
+        try:
+            decoded = json.loads(tool_args_raw)
+        except json.JSONDecodeError:
+            # Not JSON at all — treat as a plain string
+            if tool_name == "terminal":
+                return {"command": tool_args_raw}, False
+            return {"input": tool_args_raw}, False
+
+        if isinstance(decoded, dict):
+            if tool_name == "terminal":
+                cmd = decoded.get("command")
+                if isinstance(cmd, str) and cmd.strip():
+                    return decoded, True
+                if isinstance(decoded.get("input"), str):
+                    return {"command": decoded.get("input")}, False
+                return decoded, False
+            return decoded, True
+
+        if isinstance(decoded, str):
+            s = decoded.strip()
+            if (s.startswith("{") and s.endswith("}")) or (s.startswith("[") and s.endswith("]")):
+                try:
+                    decoded2 = json.loads(s)
+                except json.JSONDecodeError:
+                    decoded2 = None
+                if isinstance(decoded2, dict):
+                    return decoded2, False
+
+            if tool_name == "terminal":
+                return {"command": decoded}, False
+            return {"input": decoded}, False
+
+        if tool_name == "terminal":
+            return {"command": str(decoded)}, False
+        return {"input": decoded}, False

    async def run(self, messages: List[Dict[str, Any]]) -> AgentResult:
        """
@@ -155,10 +308,22 @@ class HermesAgentLoop:
        reasoning_per_turn = []
        tool_errors: List[ToolError] = []

+        tool_calls_attempted = 0
+        tool_calls_schema_valid = 0
+        tool_calls_executed_ok = 0
+        tool_calls_exec_error = 0
+
+        import time as _time
+
        for turn in range(self.max_turns):
+            turn_start = _time.monotonic()
+
+            # Truncate prompt view on a copy (preserve full trajectory in `messages`)
+            prompt_messages = self._truncate_context(list(messages))
+
            # Build the chat_completion kwargs
            chat_kwargs = {
-                "messages": messages,
+                "messages": prompt_messages,
                "n": 1,
                "temperature": self.temperature,
            }
@@ -171,11 +336,18 @@ class HermesAgentLoop:
            if self.max_tokens is not None:
                chat_kwargs["max_tokens"] = self.max_tokens

+            # Inject extra_body for provider-specific params (e.g., OpenRouter
+            # provider preferences like banned/preferred providers, transforms)
+            if self.extra_body:
+                chat_kwargs["extra_body"] = self.extra_body
+
            # Make the API call -- standard OpenAI spec
+            api_start = _time.monotonic()
            try:
                response = await self.server.chat_completion(**chat_kwargs)
            except Exception as e:
-                logger.error("API call failed on turn %d: %s", turn + 1, e)
+                api_elapsed = _time.monotonic() - api_start
+                logger.error("API call failed on turn %d (%.1fs): %s", turn + 1, api_elapsed, e)
                return AgentResult(
                    messages=messages,
                    managed_state=self._get_managed_state(),
@@ -183,10 +355,16 @@ class HermesAgentLoop:
                    finished_naturally=False,
                    reasoning_per_turn=reasoning_per_turn,
                    tool_errors=tool_errors,
+                    tool_calls_attempted=tool_calls_attempted,
+                    tool_calls_schema_valid=tool_calls_schema_valid,
+                    tool_calls_executed_ok=tool_calls_executed_ok,
+                    tool_calls_exec_error=tool_calls_exec_error,
                )

+            api_elapsed = _time.monotonic() - api_start
+
            if not response or not response.choices:
-                logger.warning("Empty response on turn %d", turn + 1)
+                logger.warning("Empty response on turn %d (api=%.1fs)", turn + 1, api_elapsed)
                return AgentResult(
                    messages=messages,
                    managed_state=self._get_managed_state(),
@@ -194,6 +372,10 @@ class HermesAgentLoop:
                    finished_naturally=False,
                    reasoning_per_turn=reasoning_per_turn,
                    tool_errors=tool_errors,
+                    tool_calls_attempted=tool_calls_attempted,
+                    tool_calls_schema_valid=tool_calls_schema_valid,
+                    tool_calls_executed_ok=tool_calls_executed_ok,
+                    tool_calls_exec_error=tool_calls_exec_error,
                )

            assistant_msg = response.choices[0].message
@@ -236,6 +418,7 @@ class HermesAgentLoop:

                    # Validate tool name
                    if tool_name not in self.valid_tool_names:
+                        tool_calls_exec_error += 1
                        tool_result = json.dumps(
                            {
                                "error": f"Unknown tool '{tool_name}'. "
@@ -253,34 +436,47 @@ class HermesAgentLoop:
                            tool_name, turn + 1,
                        )
                    else:
-                        # Parse arguments and dispatch
-                        try:
-                            args = json.loads(tool_args_raw)
-                        except json.JSONDecodeError:
-                            args = {}
-                            logger.warning(
-                                "Invalid JSON in tool call arguments for '%s': %s",
-                                tool_name, tool_args_raw[:200],
-                            )
+                        tool_calls_attempted += 1
+                        args, schema_valid = self._normalize_tool_args(tool_name, tool_args_raw)
+                        if schema_valid:
+                            tool_calls_schema_valid += 1

                        try:
                            if tool_name == "terminal":
-                                import os
                                backend = os.getenv("TERMINAL_ENV", "local")
-                                cmd_preview = args.get("command", "")[:80]
-                                print(f"  🖥️  [{backend}] $ {cmd_preview}")
+                                cmd_preview = str(args.get("command", ""))[:80]
+                                logger.info(
+                                    "[%s] $ %s", self.task_id[:8], cmd_preview,
+                                )

-                            # Run tool calls in a thread pool so backends that use
-                            # asyncio.run() internally (modal, docker) get a clean
-                            # event loop instead of deadlocking inside Atropos's loop.
-                            loop = asyncio.get_event_loop()
-                            tool_result = await loop.run_in_executor(
-                                _tool_executor,
-                                lambda: handle_function_call(
-                                    tool_name, args, task_id=self.task_id
-                                ),
-                            )
+                            tool_submit_time = _time.monotonic()
+
+                            if self.tool_handler:
+                                tool_result = await self.tool_handler(tool_name, args, self.task_id)
+                            else:
+                                # Run tool calls in a thread pool so backends that use
+                                # asyncio.run() internally (modal, docker) get a clean
+                                # event loop instead of deadlocking inside Atropos's loop.
+                                loop = asyncio.get_event_loop()
+                                tool_result = await loop.run_in_executor(
+                                    _tool_executor,
+                                    lambda: handle_function_call(
+                                        tool_name, args, task_id=self.task_id
+                                    ),
+                                )
+
+                            tool_elapsed = _time.monotonic() - tool_submit_time
+
+                            # Log slow tools and thread pool stats for debugging
+                            pool_active = _tool_executor._work_queue.qsize()
+                            if tool_elapsed > 30:
+                                logger.warning(
+                                    "[%s] turn %d: %s took %.1fs (pool queue=%d)",
+                                    self.task_id[:8], turn + 1, tool_name,
+                                    tool_elapsed, pool_active,
+                                )
                        except Exception as e:
+                            tool_calls_exec_error += 1
                            tool_result = json.dumps(
                                {"error": f"Tool execution failed: {type(e).__name__}: {str(e)}"}
                            )
@@ -294,22 +490,31 @@ class HermesAgentLoop:
                                "Tool '%s' execution failed on turn %d: %s",
                                tool_name, turn + 1, e,
                            )
+                        else:
+                            tool_err = False
+                            try:
+                                result_data = json.loads(tool_result)
+                                if isinstance(result_data, dict):
+                                    err = result_data.get("error")
+                                    if err:
+                                        tool_err = True

-                        # Also check if the tool returned an error in its JSON result
-                        try:
-                            result_data = json.loads(tool_result)
-                            if isinstance(result_data, dict):
-                                err = result_data.get("error")
-                                exit_code = result_data.get("exit_code")
-                                if err and exit_code and exit_code < 0:
-                                    tool_errors.append(ToolError(
-                                        turn=turn + 1, tool_name=tool_name,
-                                        arguments=tool_args_raw[:200],
-                                        error=str(err),
-                                        tool_result=tool_result[:500],
-                                    ))
-                        except (json.JSONDecodeError, TypeError):
-                            pass
+                                    exit_code = result_data.get("exit_code")
+                                    if exit_code is not None and isinstance(exit_code, int) and exit_code < 0:
+                                        tool_err = True
+                                        tool_errors.append(ToolError(
+                                            turn=turn + 1, tool_name=tool_name,
+                                            arguments=tool_args_raw[:200],
+                                            error=str(err) if err else "nonzero exit_code",
+                                            tool_result=tool_result[:500],
+                                        ))
+                            except (json.JSONDecodeError, TypeError):
+                                pass
+
+                            if tool_err:
+                                tool_calls_exec_error += 1
+                            else:
+                                tool_calls_executed_ok += 1

                    # Add tool response to conversation
                    messages.append(
@@ -320,10 +525,11 @@ class HermesAgentLoop:
                        }
                    )

-                logger.debug(
-                    "Turn %d: %d tool calls executed",
-                    turn + 1,
-                    len(assistant_msg.tool_calls),
+                turn_elapsed = _time.monotonic() - turn_start
+                logger.info(
+                    "[%s] turn %d: api=%.1fs, %d tools, turn_total=%.1fs",
+                    self.task_id[:8], turn + 1, api_elapsed,
+                    len(assistant_msg.tool_calls), turn_elapsed,
                )

            else:
@@ -336,8 +542,10 @@ class HermesAgentLoop:
                    msg_dict["reasoning_content"] = reasoning
                messages.append(msg_dict)

-                logger.debug(
-                    "Turn %d: model finished naturally (no tool calls)", turn + 1
+                turn_elapsed = _time.monotonic() - turn_start
+                logger.info(
+                    "[%s] turn %d: api=%.1fs, no tools (finished), turn_total=%.1fs",
+                    self.task_id[:8], turn + 1, api_elapsed, turn_elapsed,
                )

                return AgentResult(
@@ -347,6 +555,10 @@ class HermesAgentLoop:
                    finished_naturally=True,
                    reasoning_per_turn=reasoning_per_turn,
                    tool_errors=tool_errors,
+                    tool_calls_attempted=tool_calls_attempted,
+                    tool_calls_schema_valid=tool_calls_schema_valid,
+                    tool_calls_executed_ok=tool_calls_executed_ok,
+                    tool_calls_exec_error=tool_calls_exec_error,
                )

        # Hit max turns without the model stopping
@@ -358,6 +570,10 @@ class HermesAgentLoop:
            finished_naturally=False,
            reasoning_per_turn=reasoning_per_turn,
            tool_errors=tool_errors,
+            tool_calls_attempted=tool_calls_attempted,
+            tool_calls_schema_valid=tool_calls_schema_valid,
+            tool_calls_executed_ok=tool_calls_executed_ok,
+            tool_calls_exec_error=tool_calls_exec_error,
        )

    def _get_managed_state(self) -> Optional[Dict[str, Any]]:
@@ -0,0 +1,38 @@
+# Terminal-Bench 2.0 Evaluation -- Default Configuration
+#
+# Eval-only environment for the TB2 benchmark (89 terminal tasks).
+# Uses Modal terminal backend for per-task cloud-isolated sandboxes
+# and OpenRouter for inference.
+#
+# Usage:
+#   python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+#       --config environments/benchmarks/terminalbench_2/default.yaml
+#
+#   # Override model:
+#   python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+#       --config environments/benchmarks/terminalbench_2/default.yaml \
+#       --openai.model_name anthropic/claude-sonnet-4
+
+env:
+  enabled_toolsets: ["terminal", "file"]
+  max_agent_turns: 60
+  max_token_length: 32000
+  agent_temperature: 0.8
+  terminal_backend: "modal"
+  terminal_timeout: 300        # 5 min per command (builds, pip install)
+  tool_pool_size: 128          # thread pool for 89 parallel tasks
+  dataset_name: "NousResearch/terminal-bench-2"
+  test_timeout: 600
+  task_timeout: 1800           # 30 min wall-clock per task, auto-FAIL if exceeded
+  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
+  use_wandb: true
+  wandb_name: "terminal-bench-2"
+  ensure_scores_are_not_same: false
+  data_dir_to_save_evals: "environments/benchmarks/evals/terminal-bench-2"
+
+openai:
+  base_url: "https://openrouter.ai/api/v1"
+  model_name: "anthropic/claude-opus-4.6"
+  server_type: "openai"
+  health_check: false
+  # api_key loaded from OPENROUTER_API_KEY in .env
@@ -0,0 +1,32 @@
+#!/bin/bash
+
+# Terminal-Bench 2.0 Evaluation
+#
+# Run from repo root:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh
+#
+# Override model:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh \
+#       --openai.model_name anthropic/claude-sonnet-4
+#
+# Run a subset:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh \
+#       --env.task_filter fix-git,git-multibranch
+
+mkdir -p logs evals/terminal-bench-2
+LOG_FILE="logs/terminalbench2_$(date +%Y%m%d_%H%M%S).log"
+
+echo "Terminal-Bench 2.0 Evaluation"
+echo "Log: $LOG_FILE"
+echo ""
+
+export TERMINAL_ENV=modal
+export TERMINAL_TIMEOUT=300
+
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+  --config environments/benchmarks/terminalbench_2/default.yaml \
+  "$@" \
+  2>&1 | tee "$LOG_FILE"
+
+echo ""
+echo "Log saved to: $LOG_FILE"
@@ -0,0 +1,904 @@
+"""
+TerminalBench2Env -- Terminal-Bench 2.0 Evaluation Environment
+
+Evaluates agentic LLMs on challenging terminal tasks from Terminal-Bench 2.0.
+Each task provides a unique Docker environment (pre-built on Docker Hub), a natural
+language instruction, and a test suite for verification. The agent uses terminal +
+file tools to complete the task, then the test suite runs inside the same sandbox.
+
+This is an eval-only environment (not a training environment). It is designed to
+be run via the `evaluate` subcommand:
+
+    python environments/terminalbench2_env.py evaluate \\
+        --env.dataset_name NousResearch/terminal-bench-2
+
+The evaluate flow:
+    1. setup()     -- Loads the TB2 dataset from HuggingFace
+    2. evaluate()  -- Iterates over all tasks, running each through:
+        a. rollout_and_score_eval()  -- Per-task agent loop + test verification
+            - Resolves Docker image (pre-built Hub image or Dockerfile fallback)
+            - Registers per-task Modal sandbox via register_task_env_overrides()
+            - Runs the HermesAgentLoop (terminal + file tools)
+            - Uploads test suite and runs test.sh in the same sandbox
+            - Returns binary pass/fail result
+        b. Aggregates per-task, per-category, and overall pass rates
+        c. Logs results via evaluate_log() and wandb
+
+Key features:
+  - Per-task Modal sandboxes using pre-built Docker Hub images
+  - Binary reward: 1.0 if all tests pass, 0.0 otherwise
+  - Concurrency-controlled parallel evaluation via asyncio.Semaphore
+  - Per-task, per-category, and aggregate pass rate tracking
+"""
+
+import asyncio
+import base64
+import io
+import json
+import logging
+import os
+import shutil
+import sys
+import tarfile
+import tempfile
+import time
+import uuid
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# Ensure repo root is on sys.path for imports
+_repo_root = Path(__file__).resolve().parent.parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from pydantic import Field
+
+from atroposlib.envs.base import EvalHandlingEnum
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+
+from environments.agent_loop import AgentResult, HermesAgentLoop
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.tool_context import ToolContext
+from tools.terminal_tool import (
+    register_task_env_overrides,
+    clear_task_env_overrides,
+    cleanup_vm,
+)
+
+logger = logging.getLogger(__name__)
+
+
+# =============================================================================
+# Configuration
+# =============================================================================
+
+class TerminalBench2EvalConfig(HermesAgentEnvConfig):
+    """
+    Configuration for the Terminal-Bench 2.0 evaluation environment.
+
+    Extends HermesAgentEnvConfig with TB2-specific settings for dataset loading,
+    test execution, task filtering, and eval concurrency.
+    """
+
+    # --- Dataset ---
+    dataset_name: str = Field(
+        default="NousResearch/terminal-bench-2",
+        description="HuggingFace dataset containing TB2 tasks.",
+    )
+
+    # --- Test execution ---
+    test_timeout: int = Field(
+        default=180,
+        description="Timeout in seconds for running the test suite after agent completes.",
+    )
+
+    # --- Image strategy ---
+    force_build: bool = Field(
+        default=False,
+        description="If True, always build from Dockerfile (ignore docker_image). "
+        "Useful for testing custom Dockerfiles.",
+    )
+
+    # --- Task filtering (comma-separated from CLI) ---
+    task_filter: Optional[str] = Field(
+        default=None,
+        description="Comma-separated task names to run (e.g., 'fix-git,git-multibranch'). "
+        "If not set, all tasks are run.",
+    )
+    skip_tasks: Optional[str] = Field(
+        default=None,
+        description="Comma-separated task names to skip on top of the default skip list.",
+    )
+
+    # --- Per-task wall-clock timeout ---
+    task_timeout: int = Field(
+        default=1800,
+        description="Maximum wall-clock seconds per task (agent loop + verification). "
+        "Tasks exceeding this are scored as FAIL. Default 30 minutes.",
+    )
+
+
+# Tasks that cannot run properly on Modal and are excluded from scoring.
+MODAL_INCOMPATIBLE_TASKS = {
+    "qemu-startup",        # Needs KVM/hardware virtualization
+    "qemu-alpine-ssh",     # Needs KVM/hardware virtualization
+    "crack-7z-hash",       # Password brute-force -- too slow for cloud sandbox timeouts
+}
+
+
+# =============================================================================
+# Tar extraction helper
+# =============================================================================
+
+def _extract_base64_tar(b64_data: str, target_dir: Path):
+    """Extract a base64-encoded tar.gz archive into target_dir."""
+    if not b64_data:
+        return
+    raw = base64.b64decode(b64_data)
+    buf = io.BytesIO(raw)
+    with tarfile.open(fileobj=buf, mode="r:gz") as tar:
+        tar.extractall(path=str(target_dir))
+
+
+# =============================================================================
+# Main Environment
+# =============================================================================
+
+class TerminalBench2EvalEnv(HermesAgentBaseEnv):
+    """
+    Terminal-Bench 2.0 evaluation environment (eval-only, no training).
+
+    Inherits from HermesAgentBaseEnv for:
+      - Terminal backend setup (os.environ["TERMINAL_ENV"])
+      - Tool resolution via _resolve_tools_for_group()
+      - Monkey patches for async-safe tool operation
+      - Wandb trajectory formatting
+
+    The evaluate flow (triggered by `environment.py evaluate`):
+      1. setup()    -- Load dataset from HuggingFace
+      2. evaluate() -- Run all tasks through rollout_and_score_eval()
+
+    Each task in rollout_and_score_eval():
+      1. Resolve Docker image (pre-built Hub image or Dockerfile fallback)
+      2. Register per-task Modal sandbox override
+      3. Run HermesAgentLoop with terminal + file tools
+      4. Upload test suite and execute test.sh in the same sandbox
+      5. Check /logs/verifier/reward.txt for pass/fail
+      6. Clean up sandbox, overrides, and temp files
+    """
+
+    name = "terminal-bench-2"
+    env_config_cls = TerminalBench2EvalConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[TerminalBench2EvalConfig, List[APIServerConfig]]:
+        """
+        Default configuration for Terminal-Bench 2.0 evaluation.
+
+        Uses eval-only settings:
+          - eval_handling=STOP_TRAIN so the eval flow runs cleanly
+          - steps_per_eval=1, total_steps=1 so eval triggers immediately
+          - group_size=1 (one rollout per group, each task is expensive)
+
+        Uses Modal terminal backend (cloud-isolated sandbox per task) and
+        OpenRouter with Claude for inference.
+        """
+        env_config = TerminalBench2EvalConfig(
+            # Terminal + file tools only (the agent interacts via shell commands)
+            enabled_toolsets=["terminal", "file"],
+            disabled_toolsets=None,
+            distribution=None,
+
+            # Agent settings -- TB2 tasks are complex, need many turns
+            max_agent_turns=60,
+            max_token_length=16000,
+            agent_temperature=0.6,
+            system_prompt=None,
+
+            # Modal backend for per-task cloud-isolated sandboxes
+            terminal_backend="modal",
+            terminal_timeout=300,   # 5 min per command (builds, pip install, etc.)
+
+            # Test execution timeout (TB2 test scripts can install deps like pytest)
+            test_timeout=180,
+
+            # 89 tasks run in parallel, each needs a thread for tool calls
+            tool_pool_size=128,
+
+            # --- Eval-only Atropos settings ---
+            # These settings make the env work as an eval-only environment:
+            #   - STOP_TRAIN: pauses training during eval (standard for eval envs)
+            #   - steps_per_eval=1, total_steps=1: eval triggers immediately
+            #   - group_size=1: one rollout per group (each task is expensive)
+            eval_handling=EvalHandlingEnum.STOP_TRAIN,
+            group_size=1,
+            steps_per_eval=1,
+            total_steps=1,
+
+            tokenizer_name="NousResearch/Hermes-3-Llama-3.1-8B",
+            use_wandb=True,
+            wandb_name="terminal-bench-2",
+            ensure_scores_are_not_same=False,  # Binary rewards may all be 0 or 1
+        )
+
+        # OpenRouter with Claude -- API key loaded from .env
+        server_configs = [
+            APIServerConfig(
+                base_url="https://openrouter.ai/api/v1",
+                model_name="anthropic/claude-sonnet-4",
+                server_type="openai",
+                api_key=os.getenv("OPENROUTER_API_KEY", ""),
+                health_check=False,
+            )
+        ]
+
+        return env_config, server_configs
+
+    # =========================================================================
+    # Setup -- load dataset
+    # =========================================================================
+
+    async def setup(self):
+        """Load the Terminal-Bench 2.0 dataset from HuggingFace."""
+        from datasets import load_dataset
+
+        # Auto-set terminal_lifetime to task_timeout + 120s so sandboxes
+        # never get killed during an active task, but still get cleaned up
+        # promptly after the task times out.
+        lifetime = self.config.task_timeout + 120
+        self.config.terminal_lifetime = lifetime
+        os.environ["TERMINAL_LIFETIME_SECONDS"] = str(lifetime)
+        print(f"  Terminal lifetime auto-set to {lifetime}s (task_timeout + 120s)")
+
+        print(f"Loading TB2 dataset from: {self.config.dataset_name}")
+        ds = load_dataset(self.config.dataset_name, split="train")
+
+        # Apply task filters (comma-separated strings from CLI)
+        tasks = list(ds)
+        if self.config.task_filter:
+            allowed = {name.strip() for name in self.config.task_filter.split(",")}
+            tasks = [t for t in tasks if t["task_name"] in allowed]
+            print(f"  Filtered to {len(tasks)} tasks: {sorted(allowed)}")
+
+        # Skip tasks incompatible with the current backend (e.g., QEMU on Modal)
+        # plus any user-specified skip_tasks
+        skip = set(MODAL_INCOMPATIBLE_TASKS) if self.config.terminal_backend == "modal" else set()
+        if self.config.skip_tasks:
+            skip |= {name.strip() for name in self.config.skip_tasks.split(",")}
+        if skip:
+            before = len(tasks)
+            tasks = [t for t in tasks if t["task_name"] not in skip]
+            skipped = before - len(tasks)
+            if skipped > 0:
+                print(f"  Skipped {skipped} incompatible tasks: {sorted(skip & {t['task_name'] for t in ds})}")
+
+        self.all_eval_items = tasks
+        self.iter = 0
+
+        # Build category index for per-category metrics
+        self.category_index: Dict[str, List[int]] = defaultdict(list)
+        for i, task in enumerate(self.all_eval_items):
+            self.category_index[task.get("category", "unknown")].append(i)
+
+        # Reward tracking for wandb logging
+        self.eval_metrics: List[Tuple[str, float]] = []
+
+        # Streaming JSONL writer -- saves each task's full conversation
+        # immediately on completion so data is preserved even on Ctrl+C.
+        # Timestamped filename so each run produces a unique file.
+        import datetime
+        log_dir = os.path.join(os.path.dirname(__file__), "logs")
+        os.makedirs(log_dir, exist_ok=True)
+        run_ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        self._streaming_path = os.path.join(log_dir, f"samples_{run_ts}.jsonl")
+        self._streaming_file = open(self._streaming_path, "w")
+        self._streaming_lock = __import__("threading").Lock()
+        print(f"  Streaming results to: {self._streaming_path}")
+
+        print(f"TB2 ready: {len(self.all_eval_items)} tasks across {len(self.category_index)} categories")
+        for cat, indices in sorted(self.category_index.items()):
+            print(f"  {cat}: {len(indices)} tasks")
+
+    def _save_result(self, result: Dict[str, Any]):
+        """Write a single task result to the streaming JSONL file immediately."""
+        if not hasattr(self, "_streaming_file") or self._streaming_file.closed:
+            return
+        with self._streaming_lock:
+            self._streaming_file.write(json.dumps(result, ensure_ascii=False, default=str) + "\n")
+            self._streaming_file.flush()
+
+    # =========================================================================
+    # Training pipeline stubs -- NOT used in eval-only mode
+    # =========================================================================
+    # These satisfy the abstract method requirements from HermesAgentBaseEnv.
+    # The evaluate subcommand calls setup() -> evaluate() directly, bypassing
+    # the training pipeline entirely.
+
+    async def get_next_item(self):
+        """Return next item (stub -- not used in eval-only mode)."""
+        item = self.all_eval_items[self.iter % len(self.all_eval_items)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item: Dict[str, Any]) -> str:
+        """Return the task's instruction as the user prompt."""
+        return item["instruction"]
+
+    async def compute_reward(self, item, result, ctx) -> float:
+        """Compute reward (stub -- actual verification is in rollout_and_score_eval)."""
+        return 0.0
+
+    async def collect_trajectories(self, item):
+        """Collect trajectories (stub -- not used in eval-only mode)."""
+        return None, []
+
+    async def score(self, rollout_group_data):
+        """Score rollouts (stub -- not used in eval-only mode)."""
+        return None
+
+    # =========================================================================
+    # Docker image resolution
+    # =========================================================================
+
+    def _resolve_task_image(
+        self, item: Dict[str, Any], task_name: str
+    ) -> Tuple[str, Optional[Path]]:
+        """
+        Resolve the Docker image for a task, with fallback to Dockerfile.
+
+        Strategy (mirrors Harbor's approach):
+        1. If force_build=True, always build from Dockerfile in environment_tar
+        2. If docker_image is available, use the pre-built Docker Hub image (fast)
+        3. Otherwise, extract Dockerfile from environment_tar and build (slow)
+
+        Returns:
+            (modal_image, temp_dir) -- modal_image is a Docker Hub name or a
+            Dockerfile path. temp_dir is set if we extracted files that need
+            cleanup later.
+        """
+        docker_image = item.get("docker_image", "")
+        environment_tar = item.get("environment_tar", "")
+
+        # Fast path: use pre-built Docker Hub image
+        if docker_image and not self.config.force_build:
+            logger.info("Task %s: using pre-built image %s", task_name, docker_image)
+            return docker_image, None
+
+        # Slow path: extract Dockerfile from environment_tar and build
+        if environment_tar:
+            task_dir = Path(tempfile.mkdtemp(prefix=f"tb2-{task_name}-"))
+            _extract_base64_tar(environment_tar, task_dir)
+            dockerfile_path = task_dir / "Dockerfile"
+            if dockerfile_path.exists():
+                logger.info(
+                    "Task %s: building from Dockerfile (force_build=%s, docker_image=%s)",
+                    task_name, self.config.force_build, bool(docker_image),
+                )
+                return str(dockerfile_path), task_dir
+
+        # Neither available -- fall back to Hub image if force_build was True
+        if docker_image:
+            logger.warning(
+                "Task %s: force_build=True but no environment_tar, "
+                "falling back to docker_image %s", task_name, docker_image,
+            )
+            return docker_image, None
+
+        return "", None
+
+    # =========================================================================
+    # Per-task evaluation -- agent loop + test verification
+    # =========================================================================
+
+    async def rollout_and_score_eval(self, eval_item: Dict[str, Any]) -> Dict:
+        """
+        Evaluate a single TB2 task: run the agent loop, then verify with tests.
+
+        This is the core evaluation method. For each task it:
+        1. Resolves the Docker image and registers the Modal sandbox override
+        2. Runs HermesAgentLoop with terminal + file tools
+        3. Uploads the test suite into the sandbox
+        4. Executes test.sh and checks the result
+        5. Cleans up the sandbox and temp files
+
+        Args:
+            eval_item: A single TB2 task dict from the dataset
+
+        Returns:
+            Dict with 'passed' (bool), 'reward' (float), 'task_name' (str),
+            'category' (str), and optional debug info
+        """
+        task_name = eval_item.get("task_name", "unknown")
+        category = eval_item.get("category", "unknown")
+        task_id = str(uuid.uuid4())
+        task_dir = None  # Set if we extract a Dockerfile (needs cleanup)
+
+        from tqdm import tqdm
+        tqdm.write(f"  [START] {task_name} (task_id={task_id[:8]})")
+        task_start = time.time()
+
+        try:
+            # --- 1. Resolve Docker image ---
+            modal_image, task_dir = self._resolve_task_image(eval_item, task_name)
+            if not modal_image:
+                logger.error("Task %s: no docker_image or environment_tar, skipping", task_name)
+                return {
+                    "passed": False, "reward": 0.0,
+                    "task_name": task_name, "category": category,
+                    "error": "no_image",
+                }
+
+            # --- 2. Register per-task Modal image override ---
+            register_task_env_overrides(task_id, {"modal_image": modal_image})
+            logger.info(
+                "Task %s: registered image override for task_id %s",
+                task_name, task_id[:8],
+            )
+
+            # --- 3. Resolve tools and build messages ---
+            tools, valid_names = self._resolve_tools_for_group()
+
+            messages: List[Dict[str, Any]] = []
+            if self.config.system_prompt:
+                messages.append({"role": "system", "content": self.config.system_prompt})
+            messages.append({"role": "user", "content": self.format_prompt(eval_item)})
+
+            # --- 4. Run agent loop ---
+            agent = HermesAgentLoop(
+                server=self.server,
+                tool_schemas=tools,
+                valid_tool_names=valid_names,
+                max_turns=self.config.max_agent_turns,
+                task_id=task_id,
+                temperature=self.config.agent_temperature,
+                max_tokens=self.config.max_token_length,
+                extra_body=self.config.extra_body,
+            )
+            result = await agent.run(messages)
+
+            # --- 5. Verify -- run test suite in the agent's sandbox ---
+            # Skip verification if the agent produced no meaningful output
+            only_system_and_user = all(
+                msg.get("role") in ("system", "user") for msg in result.messages
+            )
+            if result.turns_used == 0 or only_system_and_user:
+                logger.warning(
+                    "Task %s: agent produced no output (turns=%d). Reward=0.",
+                    task_name, result.turns_used,
+                )
+                reward = 0.0
+            else:
+                # Run tests in a thread so the blocking ctx.terminal() calls
+                # don't freeze the entire event loop (which would stall all
+                # other tasks, tqdm updates, and timeout timers).
+                ctx = ToolContext(task_id)
+                try:
+                    loop = asyncio.get_event_loop()
+                    reward = await loop.run_in_executor(
+                        None,  # default thread pool
+                        self._run_tests, eval_item, ctx, task_name,
+                    )
+                except Exception as e:
+                    logger.error("Task %s: test verification failed: %s", task_name, e)
+                    reward = 0.0
+                finally:
+                    ctx.cleanup()
+
+            passed = reward == 1.0
+            status = "PASS" if passed else "FAIL"
+            elapsed = time.time() - task_start
+            tqdm.write(f"  [{status}] {task_name} (turns={result.turns_used}, {elapsed:.0f}s)")
+            logger.info(
+                "Task %s: reward=%.1f, turns=%d, finished=%s",
+                task_name, reward, result.turns_used, result.finished_naturally,
+            )
+
+            out = {
+                "passed": passed,
+                "reward": reward,
+                "task_name": task_name,
+                "category": category,
+                "turns_used": result.turns_used,
+                "finished_naturally": result.finished_naturally,
+                "messages": result.messages,
+            }
+            self._save_result(out)
+            return out
+
+        except Exception as e:
+            elapsed = time.time() - task_start
+            logger.error("Task %s: rollout failed: %s", task_name, e, exc_info=True)
+            tqdm.write(f"  [ERROR] {task_name}: {e} ({elapsed:.0f}s)")
+            out = {
+                "passed": False, "reward": 0.0,
+                "task_name": task_name, "category": category,
+                "error": str(e),
+            }
+            self._save_result(out)
+            return out
+
+        finally:
+            # --- Cleanup: clear overrides, sandbox, and temp files ---
+            clear_task_env_overrides(task_id)
+            try:
+                cleanup_vm(task_id)
+            except Exception as e:
+                logger.debug("VM cleanup for %s: %s", task_id[:8], e)
+            if task_dir and task_dir.exists():
+                shutil.rmtree(task_dir, ignore_errors=True)
+
+    def _run_tests(
+        self, item: Dict[str, Any], ctx: ToolContext, task_name: str
+    ) -> float:
+        """
+        Upload and execute the test suite in the agent's sandbox, then
+        download the verifier output locally to read the reward.
+
+        Follows Harbor's verification pattern:
+        1. Upload tests/ directory into the sandbox
+        2. Execute test.sh inside the sandbox
+        3. Download /logs/verifier/ directory to a local temp dir
+        4. Read reward.txt locally with native Python I/O
+
+        Downloading locally avoids issues with the file_read tool on
+        the Modal VM and matches how Harbor handles verification.
+
+        TB2 test scripts (test.sh) typically:
+        1. Install pytest via uv/pip
+        2. Run pytest against the test files in /tests/
+        3. Write results to /logs/verifier/reward.txt
+
+        Args:
+            item: The TB2 task dict (contains tests_tar, test_sh)
+            ctx: ToolContext scoped to this task's sandbox
+            task_name: For logging
+
+        Returns:
+            1.0 if tests pass, 0.0 otherwise
+        """
+        tests_tar = item.get("tests_tar", "")
+        test_sh = item.get("test_sh", "")
+
+        if not test_sh:
+            logger.warning("Task %s: no test_sh content, reward=0", task_name)
+            return 0.0
+
+        # Create required directories in the sandbox
+        ctx.terminal("mkdir -p /tests /logs/verifier")
+
+        # Upload test files into the sandbox (binary-safe via base64)
+        if tests_tar:
+            tests_temp = Path(tempfile.mkdtemp(prefix=f"tb2-tests-{task_name}-"))
+            try:
+                _extract_base64_tar(tests_tar, tests_temp)
+                ctx.upload_dir(str(tests_temp), "/tests")
+            except Exception as e:
+                logger.warning("Task %s: failed to upload test files: %s", task_name, e)
+            finally:
+                shutil.rmtree(tests_temp, ignore_errors=True)
+
+        # Write the test runner script (test.sh)
+        ctx.write_file("/tests/test.sh", test_sh)
+        ctx.terminal("chmod +x /tests/test.sh")
+
+        # Execute the test suite
+        logger.info(
+            "Task %s: running test suite (timeout=%ds)",
+            task_name, self.config.test_timeout,
+        )
+        test_result = ctx.terminal(
+            "bash /tests/test.sh",
+            timeout=self.config.test_timeout,
+        )
+
+        exit_code = test_result.get("exit_code", -1)
+        output = test_result.get("output", "")
+
+        # Download the verifier output directory locally, then read reward.txt
+        # with native Python I/O. This avoids issues with file_read on the
+        # Modal VM and matches Harbor's verification pattern.
+        reward = 0.0
+        local_verifier_dir = Path(tempfile.mkdtemp(prefix=f"tb2-verifier-{task_name}-"))
+        try:
+            ctx.download_dir("/logs/verifier", str(local_verifier_dir))
+
+            reward_file = local_verifier_dir / "reward.txt"
+            if reward_file.exists() and reward_file.stat().st_size > 0:
+                content = reward_file.read_text().strip()
+                if content == "1":
+                    reward = 1.0
+                elif content == "0":
+                    reward = 0.0
+                else:
+                    # Unexpected content -- try parsing as float
+                    try:
+                        reward = float(content)
+                    except (ValueError, TypeError):
+                        logger.warning(
+                            "Task %s: reward.txt content unexpected (%r), "
+                            "falling back to exit_code=%d",
+                            task_name, content, exit_code,
+                        )
+                        reward = 1.0 if exit_code == 0 else 0.0
+            else:
+                # reward.txt not written -- fall back to exit code
+                logger.warning(
+                    "Task %s: reward.txt not found after download, "
+                    "falling back to exit_code=%d",
+                    task_name, exit_code,
+                )
+                reward = 1.0 if exit_code == 0 else 0.0
+        except Exception as e:
+            logger.warning(
+                "Task %s: failed to download verifier dir: %s, "
+                "falling back to exit_code=%d",
+                task_name, e, exit_code,
+            )
+            reward = 1.0 if exit_code == 0 else 0.0
+        finally:
+            shutil.rmtree(local_verifier_dir, ignore_errors=True)
+
+        # Log test output for debugging failures
+        if reward == 0.0:
+            output_preview = output[-500:] if output else "(no output)"
+            logger.info(
+                "Task %s: FAIL (exit_code=%d)\n%s",
+                task_name, exit_code, output_preview,
+            )
+
+        return reward
+
+    # =========================================================================
+    # Evaluate -- main entry point for the eval subcommand
+    # =========================================================================
+
+    async def _eval_with_timeout(self, item: Dict[str, Any]) -> Dict:
+        """
+        Wrap rollout_and_score_eval with a per-task wall-clock timeout.
+
+        If the task exceeds task_timeout seconds, it's automatically scored
+        as FAIL. This prevents any single task from hanging indefinitely.
+        """
+        task_name = item.get("task_name", "unknown")
+        category = item.get("category", "unknown")
+        try:
+            return await asyncio.wait_for(
+                self.rollout_and_score_eval(item),
+                timeout=self.config.task_timeout,
+            )
+        except asyncio.TimeoutError:
+            from tqdm import tqdm
+            elapsed = self.config.task_timeout
+            tqdm.write(f"  [TIMEOUT] {task_name} (exceeded {elapsed}s wall-clock limit)")
+            logger.error("Task %s: wall-clock timeout after %ds", task_name, elapsed)
+            out = {
+                "passed": False, "reward": 0.0,
+                "task_name": task_name, "category": category,
+                "error": f"timeout ({elapsed}s)",
+            }
+            self._save_result(out)
+            return out
+
+    async def evaluate(self, *args, **kwargs) -> None:
+        """
+        Run Terminal-Bench 2.0 evaluation over all tasks.
+
+        This is the main entry point when invoked via:
+            python environments/terminalbench2_env.py evaluate
+
+        Runs all tasks through rollout_and_score_eval() via asyncio.gather()
+        (same pattern as GPQA and other Atropos eval envs). Each task is
+        wrapped with a wall-clock timeout so hung tasks auto-fail.
+
+        Suppresses noisy Modal/terminal output (HERMES_QUIET) so the tqdm
+        bar stays visible.
+        """
+        start_time = time.time()
+
+        # Route all logging through tqdm.write() so the progress bar stays
+        # pinned at the bottom while log lines scroll above it.
+        from tqdm import tqdm
+
+        class _TqdmHandler(logging.Handler):
+            def emit(self, record):
+                try:
+                    tqdm.write(self.format(record))
+                except Exception:
+                    self.handleError(record)
+
+        handler = _TqdmHandler()
+        handler.setFormatter(logging.Formatter(
+            "%(asctime)s [%(name)s] %(levelname)s: %(message)s",
+            datefmt="%H:%M:%S",
+        ))
+        root = logging.getLogger()
+        root.handlers = [handler]  # Replace any existing handlers
+        root.setLevel(logging.INFO)
+
+        # Silence noisy third-party loggers that flood the output
+        logging.getLogger("httpx").setLevel(logging.WARNING)      # Every HTTP request
+        logging.getLogger("openai").setLevel(logging.WARNING)     # OpenAI client retries
+        logging.getLogger("rex-deploy").setLevel(logging.WARNING) # Swerex deployment
+        logging.getLogger("rex_image_builder").setLevel(logging.WARNING)  # Image builds
+
+        print(f"\n{'='*60}")
+        print("Starting Terminal-Bench 2.0 Evaluation")
+        print(f"{'='*60}")
+        print(f"  Dataset: {self.config.dataset_name}")
+        print(f"  Total tasks: {len(self.all_eval_items)}")
+        print(f"  Max agent turns: {self.config.max_agent_turns}")
+        print(f"  Task timeout: {self.config.task_timeout}s")
+        print(f"  Terminal backend: {self.config.terminal_backend}")
+        print(f"  Tool thread pool: {self.config.tool_pool_size}")
+        print(f"  Terminal timeout: {self.config.terminal_timeout}s/cmd")
+        print(f"  Terminal lifetime: {self.config.terminal_lifetime}s (auto: task_timeout + 120)")
+        print(f"{'='*60}\n")
+
+        # Fire all tasks with wall-clock timeout, track live accuracy on the bar
+        total_tasks = len(self.all_eval_items)
+        eval_tasks = [
+            asyncio.ensure_future(self._eval_with_timeout(item))
+            for item in self.all_eval_items
+        ]
+
+        results = []
+        passed_count = 0
+        pbar = tqdm(total=total_tasks, desc="Evaluating TB2", dynamic_ncols=True)
+        try:
+            for coro in asyncio.as_completed(eval_tasks):
+                result = await coro
+                results.append(result)
+                if result and result.get("passed"):
+                    passed_count += 1
+                done = len(results)
+                pct = (passed_count / done * 100) if done else 0
+                pbar.set_postfix_str(f"pass={passed_count}/{done} ({pct:.1f}%)")
+                pbar.update(1)
+        except (KeyboardInterrupt, asyncio.CancelledError):
+            pbar.close()
+            print(f"\n\nInterrupted! Cleaning up {len(eval_tasks)} tasks...")
+            # Cancel all pending tasks
+            for task in eval_tasks:
+                task.cancel()
+            # Let cancellations propagate (finally blocks run cleanup_vm)
+            await asyncio.gather(*eval_tasks, return_exceptions=True)
+            # Belt-and-suspenders: clean up any remaining sandboxes
+            from tools.terminal_tool import cleanup_all_environments
+            cleanup_all_environments()
+            print("All sandboxes cleaned up.")
+            return
+        finally:
+            pbar.close()
+
+        end_time = time.time()
+
+        # Filter out None results (shouldn't happen, but be safe)
+        valid_results = [r for r in results if r is not None]
+
+        if not valid_results:
+            print("Warning: No valid evaluation results obtained")
+            return
+
+        # ---- Compute metrics ----
+        total = len(valid_results)
+        passed = sum(1 for r in valid_results if r.get("passed"))
+        overall_pass_rate = passed / total if total > 0 else 0.0
+
+        # Per-category breakdown
+        cat_results: Dict[str, List[Dict]] = defaultdict(list)
+        for r in valid_results:
+            cat_results[r.get("category", "unknown")].append(r)
+
+        # Build metrics dict
+        eval_metrics = {
+            "eval/pass_rate": overall_pass_rate,
+            "eval/total_tasks": total,
+            "eval/passed_tasks": passed,
+            "eval/evaluation_time_seconds": end_time - start_time,
+        }
+
+        # Per-category metrics
+        for category, cat_items in sorted(cat_results.items()):
+            cat_passed = sum(1 for r in cat_items if r.get("passed"))
+            cat_total = len(cat_items)
+            cat_pass_rate = cat_passed / cat_total if cat_total > 0 else 0.0
+            cat_key = category.replace(" ", "_").replace("-", "_").lower()
+            eval_metrics[f"eval/pass_rate_{cat_key}"] = cat_pass_rate
+
+        # Store metrics for wandb_log
+        self.eval_metrics = [(k, v) for k, v in eval_metrics.items()]
+
+        # ---- Print summary ----
+        print(f"\n{'='*60}")
+        print("Terminal-Bench 2.0 Evaluation Results")
+        print(f"{'='*60}")
+        print(f"Overall Pass Rate: {overall_pass_rate:.4f} ({passed}/{total})")
+        print(f"Evaluation Time: {end_time - start_time:.1f} seconds")
+
+        print("\nCategory Breakdown:")
+        for category, cat_items in sorted(cat_results.items()):
+            cat_passed = sum(1 for r in cat_items if r.get("passed"))
+            cat_total = len(cat_items)
+            cat_rate = cat_passed / cat_total if cat_total > 0 else 0.0
+            print(f"  {category}: {cat_rate:.1%} ({cat_passed}/{cat_total})")
+
+        # Print individual task results
+        print("\nTask Results:")
+        for r in sorted(valid_results, key=lambda x: x.get("task_name", "")):
+            status = "PASS" if r.get("passed") else "FAIL"
+            turns = r.get("turns_used", "?")
+            error = r.get("error", "")
+            extra = f" (error: {error})" if error else ""
+            print(f"  [{status}] {r['task_name']} (turns={turns}){extra}")
+
+        print(f"{'='*60}\n")
+
+        # Build sample records for evaluate_log (includes full conversations)
+        samples = [
+            {
+                "task_name": r.get("task_name"),
+                "category": r.get("category"),
+                "passed": r.get("passed"),
+                "reward": r.get("reward"),
+                "turns_used": r.get("turns_used"),
+                "error": r.get("error"),
+                "messages": r.get("messages"),
+            }
+            for r in valid_results
+        ]
+
+        # Log evaluation results
+        try:
+            await self.evaluate_log(
+                metrics=eval_metrics,
+                samples=samples,
+                start_time=start_time,
+                end_time=end_time,
+                generation_parameters={
+                    "temperature": self.config.agent_temperature,
+                    "max_tokens": self.config.max_token_length,
+                    "max_agent_turns": self.config.max_agent_turns,
+                    "terminal_backend": self.config.terminal_backend,
+                },
+            )
+        except Exception as e:
+            print(f"Error logging evaluation results: {e}")
+
+        # Close streaming file
+        if hasattr(self, "_streaming_file") and not self._streaming_file.closed:
+            self._streaming_file.close()
+            print(f"  Live results saved to: {self._streaming_path}")
+
+        # Kill all remaining sandboxes. Timed-out tasks leave orphaned thread
+        # pool workers still executing commands -- cleanup_all stops them.
+        from tools.terminal_tool import cleanup_all_environments
+        print("\nCleaning up all sandboxes...")
+        cleanup_all_environments()
+
+        # Shut down the tool thread pool so orphaned workers from timed-out
+        # tasks are killed immediately instead of retrying against dead
+        # sandboxes and spamming the console with TimeoutError warnings.
+        from environments.agent_loop import _tool_executor
+        _tool_executor.shutdown(wait=False, cancel_futures=True)
+        print("Done.")
+
+    # =========================================================================
+    # Wandb logging
+    # =========================================================================
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log TB2-specific metrics to wandb."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        # Add stored eval metrics
+        for metric_name, metric_value in self.eval_metrics:
+            wandb_metrics[metric_name] = metric_value
+        self.eval_metrics = []
+
+        await super().wandb_log(wandb_metrics)
+
+
+if __name__ == "__main__":
+    TerminalBench2EvalEnv.cli()
@@ -117,6 +117,18 @@ class HermesAgentEnvConfig(BaseEnvConfig):
        description="Terminal backend: 'local', 'docker', 'modal', 'ssh', 'singularity'. "
        "Modal recommended for production RL (cloud isolation per rollout).",
    )
+    terminal_timeout: int = Field(
+        default=120,
+        description="Per-command timeout in seconds for terminal tool calls. "
+        "Commands exceeding this are killed. Increase for tasks with long-running "
+        "commands (compilation, pip install, etc.).",
+    )
+    terminal_lifetime: int = Field(
+        default=3600,
+        description="Sandbox inactivity lifetime in seconds. The cleanup thread kills "
+        "sandboxes that have been idle longer than this. Must be longer than "
+        "the longest gap between tool calls (e.g., waiting for LLM response).",
+    )

    # --- Dataset ---
    dataset_name: Optional[str] = Field(
@@ -132,6 +144,14 @@ class HermesAgentEnvConfig(BaseEnvConfig):
        description="Which field in the dataset contains the prompt.",
    )

+    # --- Thread pool ---
+    tool_pool_size: int = Field(
+        default=128,
+        description="Thread pool size for tool execution. Each concurrent task needs a "
+        "thread for tool calls. Must be large enough for parallel evaluation. "
+        "Too small = thread pool starvation.",
+    )
+
    # --- Phase 2: Tool call parsing ---
    tool_call_parser: str = Field(
        default="hermes",
@@ -140,6 +160,22 @@ class HermesAgentEnvConfig(BaseEnvConfig):
        "Options: hermes, mistral, llama3_json, qwen, deepseek_v3, etc.",
    )

+    # --- Provider-specific parameters ---
+    # Passed as extra_body to the OpenAI client's chat.completions.create() call.
+    # Useful for OpenRouter provider preferences, transforms, route settings, etc.
+    # Example YAML:
+    #   extra_body:
+    #     provider:
+    #       ignore: ["DeepInfra", "Fireworks"]
+    #       order: ["Together"]
+    #     transforms: ["middle-out"]
+    extra_body: Optional[Dict[str, Any]] = Field(
+        default=None,
+        description="Extra body parameters passed to the OpenAI client's "
+        "chat.completions.create(). Used for OpenRouter provider preferences, "
+        "transforms, and other provider-specific settings.",
+    )
+

 class HermesAgentBaseEnv(BaseEnv):
    """
@@ -175,10 +211,23 @@ class HermesAgentBaseEnv(BaseEnv):
    ):
        super().__init__(config, server_configs, slurm, testing)

-        # Set terminal backend environment variable so hermes tools pick it up
+        # Set terminal environment variables so hermes tools pick them up.
+        # These can all be overridden per-environment via config fields instead
+        # of requiring users to set shell env vars.
        if config.terminal_backend:
            os.environ["TERMINAL_ENV"] = config.terminal_backend
-            print(f"🖥️  Terminal backend: {config.terminal_backend}")
+        os.environ["TERMINAL_TIMEOUT"] = str(config.terminal_timeout)
+        os.environ["TERMINAL_LIFETIME_SECONDS"] = str(config.terminal_lifetime)
+        print(
+            f"🖥️  Terminal: backend={config.terminal_backend}, "
+            f"timeout={config.terminal_timeout}s, lifetime={config.terminal_lifetime}s"
+        )
+
+        # Resize the agent loop's thread pool for tool execution.
+        # This must be large enough for the number of concurrent tasks
+        # (e.g., 89 parallel TB2 eval tasks each need a thread for tool calls).
+        from environments.agent_loop import resize_tool_pool
+        resize_tool_pool(config.tool_pool_size)

        # Current group's resolved tools (set in collect_trajectories)
        self._current_group_tools: Optional[Tuple[List[Dict], Set[str]]] = None
@@ -429,6 +478,7 @@ class HermesAgentBaseEnv(BaseEnv):
                    tokenizer=self.tokenizer,
                    tool_call_parser=tc_parser,
                ) as managed:
+                    _max_ctx = self.config.max_token_length if (self.config.max_token_length and self.config.max_token_length > 0) else None
                    agent = HermesAgentLoop(
                        server=managed,
                        tool_schemas=tools,
@@ -437,6 +487,8 @@ class HermesAgentBaseEnv(BaseEnv):
                        task_id=task_id,
                        temperature=self.config.agent_temperature,
                        max_tokens=self.config.max_token_length,
+                        extra_body=self.config.extra_body,
+                        max_context_tokens=_max_ctx,
                    )
                    result = await agent.run(messages)
            except NotImplementedError:
@@ -445,6 +497,7 @@ class HermesAgentBaseEnv(BaseEnv):
                    "ManagedServer not available (OpenAI server?). "
                    "Falling back to direct server mode."
                )
+                _max_ctx = self.config.max_token_length if (self.config.max_token_length and self.config.max_token_length > 0) else None
                agent = HermesAgentLoop(
                    server=self.server,
                    tool_schemas=tools,
@@ -453,10 +506,13 @@ class HermesAgentBaseEnv(BaseEnv):
                    task_id=task_id,
                    temperature=self.config.agent_temperature,
                    max_tokens=self.config.max_token_length,
+                    extra_body=self.config.extra_body,
+                    max_context_tokens=_max_ctx,
                )
                result = await agent.run(messages)
        else:
            # Phase 1: OpenAI server -- native tool_calls, placeholder tokens
+            _max_ctx = self.config.max_token_length if (self.config.max_token_length and self.config.max_token_length > 0) else None
            agent = HermesAgentLoop(
                server=self.server,
                tool_schemas=tools,
@@ -465,6 +521,8 @@ class HermesAgentBaseEnv(BaseEnv):
                task_id=task_id,
                temperature=self.config.agent_temperature,
                max_tokens=self.config.max_token_length,
+                extra_body=self.config.extra_body,
+                max_context_tokens=_max_ctx,
            )
            result = await agent.run(messages)

@@ -4,7 +4,8 @@
 # Uses terminal + file + web toolsets.
 #
 # Usage:
-#   python environments/hermes_swe_env.py serve --config environments/configs/swe_default.yaml
+#   python environments/hermes_swe_env/hermes_swe_env.py serve \
+#       --config environments/hermes_swe_env/default.yaml

 env:
  enabled_toolsets: ["terminal", "file", "web"]
@@ -36,7 +36,7 @@ from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple, Union

 # Ensure repo root is on sys.path for imports
-_repo_root = Path(__file__).resolve().parent.parent
+_repo_root = Path(__file__).resolve().parent.parent.parent
 if str(_repo_root) not in sys.path:
    sys.path.insert(0, str(_repo_root))

@@ -6,9 +6,8 @@
 #
 # Usage:
 #   run-api
-#   python environments/terminal_test_env.py serve
-#   # Or with config file:
-#   python environments/terminal_test_env.py serve --config environments/configs/terminal_test_default.yaml
+#   python environments/terminal_test_env/terminal_test_env.py serve \
+#       --config environments/terminal_test_env/default.yaml

 env:
  enabled_toolsets: ["terminal", "file"]
@@ -36,7 +36,7 @@ from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple, Union

 # Ensure repo root is on sys.path for imports
-_repo_root = Path(__file__).resolve().parent.parent
+_repo_root = Path(__file__).resolve().parent.parent.parent
 if str(_repo_root) not in sys.path:
    sys.path.insert(0, str(_repo_root))

@@ -49,15 +49,22 @@ class HermesToolCallParser(ToolCallParser):
                    continue

                tc_data = json.loads(raw_json)
+                # Handle arguments: could be dict or already a JSON string
+                raw_args = tc_data.get("arguments", {})
+                if isinstance(raw_args, str):
+                    # Already a string — pass through as-is.
+                    # It may be a JSON string ("{...}") or a plain string ("ls").
+                    args_str = raw_args
+                else:
+                    # Dict — serialize to JSON
+                    args_str = json.dumps(raw_args, ensure_ascii=False)
                tool_calls.append(
                    ChatCompletionMessageToolCall(
                        id=f"call_{uuid.uuid4().hex[:8]}",
                        type="function",
                        function=Function(
                            name=tc_data["name"],
-                            arguments=json.dumps(
-                                tc_data.get("arguments", {}), ensure_ascii=False
-                            ),
+                            arguments=args_str,
                        ),
                    )
                )
@@ -129,11 +129,14 @@ class ToolContext:

    def write_file(self, path: str, content: str) -> Dict[str, Any]:
        """
-        Write a file in the rollout's filesystem.
+        Write a TEXT file in the rollout's filesystem.
+
+        Uses a shell heredoc under the hood, so this is only safe for text content.
+        For binary files (images, compiled artifacts, etc.), use upload_file() instead.

        Args:
            path: File path to write
-            content: Content to write
+            content: Text content to write

        Returns:
            Dict with success status or error
@@ -146,6 +149,177 @@ class ToolContext:
        except json.JSONDecodeError:
            return {"error": result}

+    def upload_file(self, local_path: str, remote_path: str) -> Dict[str, Any]:
+        """
+        Upload a local file to the rollout's sandbox (binary-safe).
+
+        Unlike write_file() which passes content through a shell heredoc (text-only),
+        this method base64-encodes the file and decodes it inside the sandbox.
+        Safe for any file type: binaries, images, archives, etc.
+
+        For large files (>1MB), the content is split into chunks to avoid
+        hitting shell command-length limits.
+
+        Args:
+            local_path: Path to a local file on the host
+            remote_path: Destination path inside the sandbox
+
+        Returns:
+            Dict with 'exit_code' and 'output'
+        """
+        import base64
+        from pathlib import Path as _Path
+
+        local = _Path(local_path)
+        if not local.exists():
+            return {"exit_code": -1, "output": f"Local file not found: {local_path}"}
+
+        raw = local.read_bytes()
+        b64 = base64.b64encode(raw).decode("ascii")
+
+        # Ensure parent directory exists in the sandbox
+        parent = str(_Path(remote_path).parent)
+        if parent not in (".", "/"):
+            self.terminal(f"mkdir -p {parent}", timeout=10)
+
+        # For small files, single command is fine
+        chunk_size = 60_000  # ~60KB per chunk (well within shell limits)
+        if len(b64) <= chunk_size:
+            result = self.terminal(
+                f"printf '%s' '{b64}' | base64 -d > {remote_path}",
+                timeout=30,
+            )
+        else:
+            # For larger files, write base64 in chunks then decode
+            tmp_b64 = "/tmp/_hermes_upload.b64"
+            self.terminal(f": > {tmp_b64}", timeout=5)  # truncate
+            for i in range(0, len(b64), chunk_size):
+                chunk = b64[i : i + chunk_size]
+                self.terminal(f"printf '%s' '{chunk}' >> {tmp_b64}", timeout=15)
+            result = self.terminal(
+                f"base64 -d {tmp_b64} > {remote_path} && rm -f {tmp_b64}",
+                timeout=30,
+            )
+
+        return result
+
+    def upload_dir(self, local_dir: str, remote_dir: str) -> List[Dict[str, Any]]:
+        """
+        Upload an entire local directory to the rollout's sandbox (binary-safe).
+
+        Recursively uploads all files, preserving directory structure.
+
+        Args:
+            local_dir: Path to a local directory on the host
+            remote_dir: Destination directory inside the sandbox
+
+        Returns:
+            List of results, one per file uploaded
+        """
+        from pathlib import Path as _Path
+
+        local = _Path(local_dir)
+        if not local.exists() or not local.is_dir():
+            return [{"exit_code": -1, "output": f"Local directory not found: {local_dir}"}]
+
+        results = []
+        for file_path in sorted(local.rglob("*")):
+            if file_path.is_file():
+                relative = file_path.relative_to(local)
+                target = f"{remote_dir}/{relative}"
+                results.append(self.upload_file(str(file_path), target))
+        return results
+
+    def download_file(self, remote_path: str, local_path: str) -> Dict[str, Any]:
+        """
+        Download a file from the rollout's sandbox to the host (binary-safe).
+
+        The inverse of upload_file(). Base64-encodes the file inside the sandbox,
+        reads the encoded data through the terminal, and decodes it locally.
+        Safe for any file type.
+
+        Args:
+            remote_path: Path to the file inside the sandbox
+            local_path: Destination path on the host
+
+        Returns:
+            Dict with 'success' (bool) and 'bytes' (int) or 'error' (str)
+        """
+        import base64
+        from pathlib import Path as _Path
+
+        # Base64-encode the file inside the sandbox and capture output
+        result = self.terminal(
+            f"base64 {remote_path} 2>/dev/null",
+            timeout=30,
+        )
+
+        if result.get("exit_code", -1) != 0:
+            return {
+                "success": False,
+                "error": f"Failed to read remote file: {result.get('output', '')}",
+            }
+
+        b64_data = result.get("output", "").strip()
+        if not b64_data:
+            return {"success": False, "error": f"Remote file is empty or missing: {remote_path}"}
+
+        try:
+            raw = base64.b64decode(b64_data)
+        except Exception as e:
+            return {"success": False, "error": f"Base64 decode failed: {e}"}
+
+        # Write to local host filesystem
+        local = _Path(local_path)
+        local.parent.mkdir(parents=True, exist_ok=True)
+        local.write_bytes(raw)
+
+        return {"success": True, "bytes": len(raw)}
+
+    def download_dir(self, remote_dir: str, local_dir: str) -> List[Dict[str, Any]]:
+        """
+        Download a directory from the rollout's sandbox to the host (binary-safe).
+
+        Lists all files in the remote directory, then downloads each one.
+        Preserves directory structure.
+
+        Args:
+            remote_dir: Path to the directory inside the sandbox
+            local_dir: Destination directory on the host
+
+        Returns:
+            List of results, one per file downloaded
+        """
+        from pathlib import Path as _Path
+
+        # List files in the remote directory
+        ls_result = self.terminal(
+            f"find {remote_dir} -type f 2>/dev/null",
+            timeout=15,
+        )
+
+        if ls_result.get("exit_code", -1) != 0:
+            return [{"success": False, "error": f"Failed to list remote dir: {remote_dir}"}]
+
+        file_list = ls_result.get("output", "").strip()
+        if not file_list:
+            return [{"success": False, "error": f"Remote directory is empty or missing: {remote_dir}"}]
+
+        results = []
+        for remote_file in file_list.splitlines():
+            remote_file = remote_file.strip()
+            if not remote_file:
+                continue
+            # Compute the relative path to preserve directory structure
+            if remote_file.startswith(remote_dir):
+                relative = remote_file[len(remote_dir):].lstrip("/")
+            else:
+                relative = _Path(remote_file).name
+            local_file = str(_Path(local_dir) / relative)
+            results.append(self.download_file(remote_file, local_file))
+
+        return results
+
    def search(self, query: str, path: str = ".") -> Dict[str, Any]:
        """
        Search for text in the rollout's filesystem.
@@ -6,10 +6,11 @@ and implement the required methods.
 """

 import asyncio
+import re
 from abc import ABC, abstractmethod
 from dataclasses import dataclass, field
 from datetime import datetime
-from typing import Dict, List, Optional, Any, Callable, Awaitable
+from typing import Dict, List, Optional, Any, Callable, Awaitable, Tuple
 from enum import Enum

 import sys
@@ -177,6 +178,123 @@ class BasePlatformAdapter(ABC):
        """
        pass
    
+    async def send_image(
+        self,
+        chat_id: str,
+        image_url: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """
+        Send an image natively via the platform API.
+        
+        Override in subclasses to send images as proper attachments
+        instead of plain-text URLs. Default falls back to sending the
+        URL as a text message.
+        """
+        # Fallback: send URL as text (subclasses override for native images)
+        text = f"{caption}\n{image_url}" if caption else image_url
+        return await self.send(chat_id=chat_id, content=text, reply_to=reply_to)
+    
+    @staticmethod
+    def extract_images(content: str) -> Tuple[List[Tuple[str, str]], str]:
+        """
+        Extract image URLs from markdown and HTML image tags in a response.
+        
+        Finds patterns like:
+        - ![alt text](https://example.com/image.png)
+        - <img src="https://example.com/image.png">
+        - <img src="https://example.com/image.png"></img>
+        
+        Args:
+            content: The response text to scan.
+        
+        Returns:
+            Tuple of (list of (url, alt_text) pairs, cleaned content with image tags removed).
+        """
+        images = []
+        cleaned = content
+        
+        # Match markdown images: ![alt](url)
+        md_pattern = r'!\[([^\]]*)\]\((https?://[^\s\)]+)\)'
+        for match in re.finditer(md_pattern, content):
+            alt_text = match.group(1)
+            url = match.group(2)
+            # Only extract URLs that look like actual images
+            if any(url.lower().endswith(ext) or ext in url.lower() for ext in
+                   ['.png', '.jpg', '.jpeg', '.gif', '.webp', 'fal.media', 'fal-cdn', 'replicate.delivery']):
+                images.append((url, alt_text))
+        
+        # Match HTML img tags: <img src="url"> or <img src="url"></img> or <img src="url"/>
+        html_pattern = r'<img\s+src=["\']?(https?://[^\s"\'<>]+)["\']?\s*/?>\s*(?:</img>)?'
+        for match in re.finditer(html_pattern, content):
+            url = match.group(1)
+            images.append((url, ""))
+        
+        # Remove matched image tags from content if we found images
+        if images:
+            cleaned = re.sub(md_pattern, '', cleaned)
+            cleaned = re.sub(html_pattern, '', cleaned)
+            # Clean up leftover blank lines
+            cleaned = re.sub(r'\n{3,}', '\n\n', cleaned).strip()
+        
+        return images, cleaned
+    
+    async def send_voice(
+        self,
+        chat_id: str,
+        audio_path: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """
+        Send an audio file as a native voice message via the platform API.
+        
+        Override in subclasses to send audio as voice bubbles (Telegram)
+        or file attachments (Discord). Default falls back to sending the
+        file path as text.
+        """
+        text = f"🔊 Audio: {audio_path}"
+        if caption:
+            text = f"{caption}\n{text}"
+        return await self.send(chat_id=chat_id, content=text, reply_to=reply_to)
+    
+    @staticmethod
+    def extract_media(content: str) -> Tuple[List[Tuple[str, bool]], str]:
+        """
+        Extract MEDIA:<path> tags and [[audio_as_voice]] directives from response text.
+        
+        The TTS tool returns responses like:
+            [[audio_as_voice]]
+            MEDIA:/path/to/audio.ogg
+        
+        Args:
+            content: The response text to scan.
+        
+        Returns:
+            Tuple of (list of (path, is_voice) pairs, cleaned content with tags removed).
+        """
+        media = []
+        cleaned = content
+        
+        # Check for [[audio_as_voice]] directive
+        has_voice_tag = "[[audio_as_voice]]" in content
+        cleaned = cleaned.replace("[[audio_as_voice]]", "")
+        
+        # Extract MEDIA:<path> tags (path may contain spaces)
+        media_pattern = r'MEDIA:(\S+)'
+        for match in re.finditer(media_pattern, content):
+            path = match.group(1).strip()
+            if path:
+                media.append((path, has_voice_tag))
+        
+        # Remove MEDIA tags from content
+        if media:
+            cleaned = re.sub(media_pattern, '', cleaned)
+            cleaned = re.sub(r'\n{3,}', '\n\n', cleaned).strip()
+        
+        return media, cleaned
+    
    async def _keep_typing(self, chat_id: str, interval: float = 2.0) -> None:
        """
        Continuously send typing indicator until cancelled.
@@ -231,23 +349,56 @@ class BasePlatformAdapter(ABC):
            
            # Send response if any
            if response:
-                result = await self.send(
-                    chat_id=event.source.chat_id,
-                    content=response,
-                    reply_to=event.message_id
-                )
+                # Extract MEDIA:<path> tags (from TTS tool) before other processing
+                media_files, response = self.extract_media(response)
                
-                # Log send failures (don't raise - user already saw tool progress)
-                if not result.success:
-                    print(f"[{self.name}] Failed to send response: {result.error}")
-                    # Try sending without markdown as fallback
-                    fallback_result = await self.send(
+                # Extract image URLs and send them as native platform attachments
+                images, text_content = self.extract_images(response)
+                
+                # Send the text portion first (if any remains after extractions)
+                if text_content:
+                    result = await self.send(
                        chat_id=event.source.chat_id,
-                        content=f"(Response formatting failed, plain text:)\n\n{response[:3500]}",
+                        content=text_content,
                        reply_to=event.message_id
                    )
-                    if not fallback_result.success:
-                        print(f"[{self.name}] Fallback send also failed: {fallback_result.error}")
+                    
+                    # Log send failures (don't raise - user already saw tool progress)
+                    if not result.success:
+                        print(f"[{self.name}] Failed to send response: {result.error}")
+                        # Try sending without markdown as fallback
+                        fallback_result = await self.send(
+                            chat_id=event.source.chat_id,
+                            content=f"(Response formatting failed, plain text:)\n\n{text_content[:3500]}",
+                            reply_to=event.message_id
+                        )
+                        if not fallback_result.success:
+                            print(f"[{self.name}] Fallback send also failed: {fallback_result.error}")
+                
+                # Send extracted images as native attachments
+                for image_url, alt_text in images:
+                    try:
+                        img_result = await self.send_image(
+                            chat_id=event.source.chat_id,
+                            image_url=image_url,
+                            caption=alt_text if alt_text else None,
+                        )
+                        if not img_result.success:
+                            print(f"[{self.name}] Failed to send image: {img_result.error}")
+                    except Exception as img_err:
+                        print(f"[{self.name}] Error sending image: {img_err}")
+                
+                # Send extracted audio/voice files as native attachments
+                for audio_path, is_voice in media_files:
+                    try:
+                        voice_result = await self.send_voice(
+                            chat_id=event.source.chat_id,
+                            audio_path=audio_path,
+                        )
+                        if not voice_result.success:
+                            print(f"[{self.name}] Failed to send voice: {voice_result.error}")
+                    except Exception as voice_err:
+                        print(f"[{self.name}] Error sending voice: {voice_err}")
            
            # Check if there's a pending message that was queued during our processing
            if session_key in self._pending_messages:
@@ -286,7 +437,7 @@ class BasePlatformAdapter(ABC):
    
    def get_pending_message(self, session_key: str) -> Optional[MessageEvent]:
        """Get and clear any pending message for a session."""
-        return self._pending_messages.get(session_key)
+        return self._pending_messages.pop(session_key, None)
    
    def build_source(
        self,
@@ -8,6 +8,7 @@ Uses discord.py library for:
 """

 import asyncio
+import os
 from typing import Dict, List, Optional, Any

 try:
@@ -173,6 +174,99 @@ class DiscordAdapter(BasePlatformAdapter):
        except Exception as e:
            return SendResult(success=False, error=str(e))
    
+    async def send_voice(
+        self,
+        chat_id: str,
+        audio_path: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send audio as a Discord file attachment."""
+        if not self._client:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            import io
+            
+            channel = self._client.get_channel(int(chat_id))
+            if not channel:
+                channel = await self._client.fetch_channel(int(chat_id))
+            if not channel:
+                return SendResult(success=False, error=f"Channel {chat_id} not found")
+            
+            if not os.path.exists(audio_path):
+                return SendResult(success=False, error=f"Audio file not found: {audio_path}")
+            
+            # Determine filename from path
+            filename = os.path.basename(audio_path)
+            
+            with open(audio_path, "rb") as f:
+                file = discord.File(io.BytesIO(f.read()), filename=filename)
+                msg = await channel.send(
+                    content=caption if caption else None,
+                    file=file,
+                )
+                return SendResult(success=True, message_id=str(msg.id))
+        
+        except Exception as e:
+            print(f"[{self.name}] Failed to send audio: {e}")
+            return await super().send_voice(chat_id, audio_path, caption, reply_to)
+    
+    async def send_image(
+        self,
+        chat_id: str,
+        image_url: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send an image natively as a Discord file attachment."""
+        if not self._client:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            import aiohttp
+            
+            channel = self._client.get_channel(int(chat_id))
+            if not channel:
+                channel = await self._client.fetch_channel(int(chat_id))
+            if not channel:
+                return SendResult(success=False, error=f"Channel {chat_id} not found")
+            
+            # Download the image and send as a Discord file attachment
+            # (Discord renders attachments inline, unlike plain URLs)
+            async with aiohttp.ClientSession() as session:
+                async with session.get(image_url, timeout=aiohttp.ClientTimeout(total=30)) as resp:
+                    if resp.status != 200:
+                        raise Exception(f"Failed to download image: HTTP {resp.status}")
+                    
+                    image_data = await resp.read()
+                    
+                    # Determine filename from URL or content type
+                    content_type = resp.headers.get("content-type", "image/png")
+                    ext = "png"
+                    if "jpeg" in content_type or "jpg" in content_type:
+                        ext = "jpg"
+                    elif "gif" in content_type:
+                        ext = "gif"
+                    elif "webp" in content_type:
+                        ext = "webp"
+                    
+                    import io
+                    file = discord.File(io.BytesIO(image_data), filename=f"image.{ext}")
+                    
+                    msg = await channel.send(
+                        content=caption if caption else None,
+                        file=file,
+                    )
+                    return SendResult(success=True, message_id=str(msg.id))
+        
+        except ImportError:
+            print(f"[{self.name}] aiohttp not installed, falling back to URL. Run: pip install aiohttp")
+            return await super().send_image(chat_id, image_url, caption, reply_to)
+        except Exception as e:
+            print(f"[{self.name}] Failed to send image attachment, falling back to URL: {e}")
+            return await super().send_image(chat_id, image_url, caption, reply_to)
+    
    async def send_typing(self, chat_id: str) -> None:
        """Send typing indicator."""
        if self._client:
@@ -232,6 +326,36 @@ class DiscordAdapter(BasePlatformAdapter):
    
    async def _handle_message(self, message: DiscordMessage) -> None:
        """Handle incoming Discord messages."""
+        # In server channels (not DMs), require the bot to be @mentioned
+        # UNLESS the channel is in the free-response list.
+        #
+        # Config:
+        #   DISCORD_FREE_RESPONSE_CHANNELS: Comma-separated channel IDs where the
+        #       bot responds to every message without needing a mention.
+        #   DISCORD_REQUIRE_MENTION: Set to "false" to disable mention requirement
+        #       globally (all channels become free-response). Default: "true".
+        
+        if not isinstance(message.channel, discord.DMChannel):
+            # Check if this channel is in the free-response list
+            free_channels_raw = os.getenv("DISCORD_FREE_RESPONSE_CHANNELS", "")
+            free_channels = {ch.strip() for ch in free_channels_raw.split(",") if ch.strip()}
+            channel_id = str(message.channel.id)
+            
+            # Global override: if DISCORD_REQUIRE_MENTION=false, all channels are free
+            require_mention = os.getenv("DISCORD_REQUIRE_MENTION", "true").lower() not in ("false", "0", "no")
+            
+            is_free_channel = channel_id in free_channels
+            
+            if require_mention and not is_free_channel:
+                # Must be @mentioned to respond
+                if self._client.user not in message.mentions:
+                    return  # Silently ignore messages that don't mention the bot
+            
+            # Strip the bot mention from the message text so the agent sees clean input
+            if self._client.user and self._client.user in message.mentions:
+                message.content = message.content.replace(f"<@{self._client.user.id}>", "").strip()
+                message.content = message.content.replace(f"<@!{self._client.user.id}>", "").strip()
+        
        # Determine message type
        msg_type = MessageType.TEXT
        if message.content.startswith("/"):
@@ -174,6 +174,69 @@ class TelegramAdapter(BasePlatformAdapter):
        except Exception as e:
            return SendResult(success=False, error=str(e))
    
+    async def send_voice(
+        self,
+        chat_id: str,
+        audio_path: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send audio as a native Telegram voice message or audio file."""
+        if not self._bot:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            import os
+            if not os.path.exists(audio_path):
+                return SendResult(success=False, error=f"Audio file not found: {audio_path}")
+            
+            with open(audio_path, "rb") as audio_file:
+                # .ogg files -> send as voice (round playable bubble)
+                if audio_path.endswith(".ogg") or audio_path.endswith(".opus"):
+                    msg = await self._bot.send_voice(
+                        chat_id=int(chat_id),
+                        voice=audio_file,
+                        caption=caption[:1024] if caption else None,
+                        reply_to_message_id=int(reply_to) if reply_to else None,
+                    )
+                else:
+                    # .mp3 and others -> send as audio file
+                    msg = await self._bot.send_audio(
+                        chat_id=int(chat_id),
+                        audio=audio_file,
+                        caption=caption[:1024] if caption else None,
+                        reply_to_message_id=int(reply_to) if reply_to else None,
+                    )
+            return SendResult(success=True, message_id=str(msg.message_id))
+        except Exception as e:
+            print(f"[{self.name}] Failed to send voice/audio: {e}")
+            return await super().send_voice(chat_id, audio_path, caption, reply_to)
+    
+    async def send_image(
+        self,
+        chat_id: str,
+        image_url: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send an image natively as a Telegram photo."""
+        if not self._bot:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            # Telegram can send photos directly from URLs
+            msg = await self._bot.send_photo(
+                chat_id=int(chat_id),
+                photo=image_url,
+                caption=caption[:1024] if caption else None,  # Telegram caption limit
+                reply_to_message_id=int(reply_to) if reply_to else None,
+            )
+            return SendResult(success=True, message_id=str(msg.message_id))
+        except Exception as e:
+            print(f"[{self.name}] Failed to send photo, falling back to URL: {e}")
+            # Fallback: send as text link
+            return await super().send_image(chat_id, image_url, caption, reply_to)
+    
    async def send_typing(self, chat_id: str) -> None:
        """Send typing indicator."""
        if self._bot:
@@ -35,6 +35,9 @@ load_dotenv()
 # Gateway runs in quiet mode - suppress debug output and use cwd directly (no temp dirs)
 os.environ["HERMES_QUIET"] = "1"

+# Enable interactive exec approval for dangerous commands on messaging platforms
+os.environ["HERMES_EXEC_ASK"] = "1"
+
 # Set terminal working directory for messaging platforms
 # Uses MESSAGING_CWD if set, otherwise defaults to home directory
 # This is separate from CLI which uses the directory where `hermes` is run
@@ -77,6 +80,10 @@ class GatewayRunner:
        # Key: session_key, Value: AIAgent instance
        self._running_agents: Dict[str, Any] = {}
        self._pending_messages: Dict[str, str] = {}  # Queued messages during interrupt
+        
+        # Track pending exec approvals per session
+        # Key: session_key, Value: {"command": str, "pattern_key": str}
+        self._pending_approvals: Dict[str, Dict[str, str]] = {}
    
    async def start(self) -> bool:
        """
@@ -246,6 +253,25 @@ class GatewayRunner:
        if command == "stop":
            return await self._handle_stop_command(event)
        
+        # Check for pending exec approval responses
+        session_key_preview = f"agent:main:{source.platform.value}:{source.chat_type}:{source.chat_id}" if source.chat_type != "dm" else f"agent:main:{source.platform.value}:dm"
+        if session_key_preview in self._pending_approvals:
+            user_text = event.text.strip().lower()
+            if user_text in ("yes", "y", "approve", "ok", "go", "do it"):
+                approval = self._pending_approvals.pop(session_key_preview)
+                cmd = approval["command"]
+                pattern_key = approval.get("pattern_key", "")
+                print(f"[gateway] ✅ User approved dangerous command: {cmd[:60]}...")
+                # Approve for session and re-run via terminal_tool with force=True
+                from tools.terminal_tool import terminal_tool, _session_approved_patterns
+                _session_approved_patterns.add(pattern_key)
+                result = terminal_tool(command=cmd, force=True)
+                return f"✅ Command approved and executed.\n\n```\n{result[:3500]}\n```"
+            elif user_text in ("no", "n", "deny", "cancel", "nope"):
+                self._pending_approvals.pop(session_key_preview)
+                return "❌ Command denied."
+            # If it's not clearly an approval/denial, fall through to normal processing
+        
        # Get or create session
        session_entry = self.session_store.get_or_create_session(source)
        session_key = session_entry.session_key
@@ -282,6 +308,17 @@ class GatewayRunner:
                session_key=session_key
            )
            
+            # Check if the agent encountered a dangerous command needing approval
+            # The terminal tool stores the last pending approval globally
+            try:
+                from tools.terminal_tool import _last_pending_approval
+                if _last_pending_approval:
+                    self._pending_approvals[session_key] = _last_pending_approval.copy()
+                    # Clear the global so it doesn't leak to other sessions
+                    _last_pending_approval.clear()
+            except Exception:
+                pass
+            
            # Append to transcript
            self.session_store.append_to_transcript(
                session_entry.session_id,
@@ -418,23 +455,35 @@ class GatewayRunner:
                return
            last_tool[0] = tool_name
            
-            # Build progress message
+            # Build progress message with primary argument preview
            tool_emojis = {
                "terminal": "💻",
                "web_search": "🔍",
                "web_extract": "📄",
                "read_file": "📖",
                "write_file": "✍️",
+                "patch": "🔧",
+                "search": "🔎",
                "list_directory": "📂",
                "image_generate": "🎨",
+                "text_to_speech": "🔊",
                "browser_navigate": "🌐",
                "browser_click": "👆",
+                "browser_type": "⌨️",
+                "browser_snapshot": "📸",
                "moa_query": "🧠",
+                "mixture_of_agents": "🧠",
+                "vision_analyze": "👁️",
+                "skill_view": "📚",
+                "skills_list": "📋",
            }
            emoji = tool_emojis.get(tool_name, "⚙️")
            
-            if tool_name == "terminal" and preview:
-                msg = f"{emoji} `{preview}`..."
+            if preview:
+                # Truncate preview to keep messages clean
+                if len(preview) > 40:
+                    preview = preview[:37] + "..."
+                msg = f"{emoji} {tool_name}... \"{preview}\""
            else:
                msg = f"{emoji} {tool_name}..."
            
@@ -480,6 +529,10 @@ class GatewayRunner:
            # Read from env var or use default (same as CLI)
            max_iterations = int(os.getenv("HERMES_MAX_ITERATIONS", "60"))
            
+            # Map platform enum to the platform hint key the agent understands.
+            # Platform.LOCAL ("local") maps to "cli"; others pass through as-is.
+            platform_key = "cli" if source.platform == Platform.LOCAL else source.platform.value
+            
            agent = AIAgent(
                model=os.getenv("HERMES_MODEL", "anthropic/claude-opus-4.6"),
                max_iterations=max_iterations,
@@ -488,19 +541,42 @@ class GatewayRunner:
                ephemeral_system_prompt=context_prompt,
                session_id=session_id,
                tool_progress_callback=progress_callback if tool_progress_enabled else None,
+                platform=platform_key,  # Tells the agent which interface to format for
            )
            
            # Store agent reference for interrupt support
            agent_holder[0] = agent
            
-            # Convert transcript history to agent format
-            # Transcript has timestamps; agent expects {"role": ..., "content": ...}
+            # Convert history to agent format.
+            # Two cases:
+            #   1. Normal path (from transcript): simple {role, content, timestamp} dicts
+            #      - Strip timestamps, keep role+content
+            #   2. Interrupt path (from agent result["messages"]): full agent messages
+            #      that may include tool_calls, tool_call_id, reasoning, etc.
+            #      - These must be passed through intact so the API sees valid
+            #        assistant→tool sequences (dropping tool_calls causes 500 errors)
            agent_history = []
            for msg in history:
                role = msg.get("role")
-                content = msg.get("content")
-                if role and content:
-                    agent_history.append({"role": role, "content": content})
+                if not role:
+                    continue
+                
+                # Check if this is a rich agent message (has tool_calls or tool_call_id)
+                # If so, pass it through with full structure intact
+                has_tool_calls = "tool_calls" in msg
+                has_tool_call_id = "tool_call_id" in msg
+                is_tool_message = role == "tool"
+                
+                if has_tool_calls or has_tool_call_id or is_tool_message:
+                    # Preserve full message structure (tool_calls, tool_call_id, etc.)
+                    # Only strip fields that are purely internal (e.g. timestamp)
+                    clean_msg = {k: v for k, v in msg.items() if k != "timestamp"}
+                    agent_history.append(clean_msg)
+                else:
+                    # Simple text message - just need role and content
+                    content = msg.get("content")
+                    if content:
+                        agent_history.append({"role": role, "content": content})
            
            result = agent.run_conversation(message, conversation_history=agent_history)
            result_holder[0] = result
@@ -572,13 +648,16 @@ class GatewayRunner:
            
            if pending:
                print(f"[gateway] 📨 Processing interrupted message: '{pending[:40]}...'")
-                # Add an indicator to the response
-                if response:
-                    response = response + "\n\n---\n_[Interrupted - processing your new message]_"
                
-                # Send the interrupted response first
-                if adapter and response:
-                    await adapter.send(chat_id=source.chat_id, content=response)
+                # Clear the adapter's interrupt event so the next _run_agent call
+                # doesn't immediately re-trigger the interrupt before the new agent
+                # even makes its first API call (this was causing an infinite loop).
+                if adapter and hasattr(adapter, '_active_sessions') and source.chat_id in adapter._active_sessions:
+                    adapter._active_sessions[source.chat_id].clear()
+                
+                # Don't send the interrupted response to the user — it's just noise
+                # like "Operation interrupted." They already know they sent a new
+                # message, so go straight to processing it.
                
                # Now process the pending message with updated history
                updated_history = result.get("messages", history)
@@ -612,11 +691,13 @@ class GatewayRunner:
        return response


-async def start_gateway(config: Optional[GatewayConfig] = None) -> None:
+async def start_gateway(config: Optional[GatewayConfig] = None) -> bool:
    """
    Start the gateway and run until interrupted.
    
    This is the main entry point for running the gateway.
+    Returns True if the gateway ran successfully, False if it failed to start.
+    A False return causes a non-zero exit code so systemd can auto-restart.
    """
    runner = GatewayRunner(config)
    
@@ -635,10 +716,11 @@ async def start_gateway(config: Optional[GatewayConfig] = None) -> None:
    # Start the gateway
    success = await runner.start()
    if not success:
-        return
+        return False
    
    # Wait for shutdown
    await runner.wait_for_shutdown()
+    return True


 def main():
@@ -658,8 +740,11 @@ def main():
            data = json.load(f)
            config = GatewayConfig.from_dict(data)
    
-    # Run the gateway
-    asyncio.run(start_gateway(config))
+    # Run the gateway - exit with code 1 if no platforms connected,
+    # so systemd Restart=on-failure will retry on transient errors (e.g. DNS)
+    success = asyncio.run(start_gateway(config))
+    if not success:
+        sys.exit(1)


 if __name__ == "__main__":
@@ -99,6 +99,24 @@ DEFAULT_CONFIG = {
        "personality": "kawaii",
    },
    
+    # Text-to-speech configuration
+    "tts": {
+        "provider": "edge",  # "edge" (free) | "elevenlabs" (premium) | "openai"
+        "edge": {
+            "voice": "en-US-AriaNeural",
+            # Popular: AriaNeural, JennyNeural, AndrewNeural, BrianNeural, SoniaNeural
+        },
+        "elevenlabs": {
+            "voice_id": "pNInz6obpgDQGcFmaJgB",  # Adam
+            "model_id": "eleven_multilingual_v2",
+        },
+        "openai": {
+            "model": "gpt-4o-mini-tts",
+            "voice": "alloy",
+            # Voices: alloy, echo, fable, onyx, nova, shimmer
+        },
+    },
+    
    # Permanently allowed dangerous command patterns (added via "always" approval)
    "command_allowlist": [],
    
@@ -202,6 +220,13 @@ OPTIONAL_ENV_VARS = {
        "url": None,
        "password": False,
    },
+    # Text-to-speech (premium providers)
+    "ELEVENLABS_API_KEY": {
+        "description": "ElevenLabs API key for premium text-to-speech voices",
+        "prompt": "ElevenLabs API key",
+        "url": "https://elevenlabs.io/",
+        "password": True,
+    },
    # Terminal configuration
    "MESSAGING_CWD": {
        "description": "Working directory for terminal commands via messaging (Telegram/Discord/etc). CLI always uses current directory.",
@@ -360,7 +360,11 @@ def run_gateway(verbose: bool = False):
    print("└─────────────────────────────────────────────────────────┘")
    print()
    
-    asyncio.run(start_gateway())
+    # Exit with code 1 if gateway fails to connect any platform,
+    # so systemd Restart=on-failure will retry on transient errors
+    success = asyncio.run(start_gateway())
+    if not success:
+        sys.exit(1)


 # =============================================================================
@@ -186,6 +186,11 @@ def _print_setup_summary(config: dict, hermes_home):
    else:
        tool_status.append(("Image Generation", False, "FAL_KEY"))
    
+    # TTS (always available via Edge TTS; ElevenLabs/OpenAI are optional)
+    tool_status.append(("Text-to-Speech (Edge TTS)", True, None))
+    if get_env_value('ELEVENLABS_API_KEY'):
+        tool_status.append(("Text-to-Speech (ElevenLabs)", True, None))
+    
    # Tinker + WandB (RL training)
    if get_env_value('TINKER_API_KEY') and get_env_value('WANDB_API_KEY'):
        tool_status.append(("RL Training (Tinker)", True, None))
@@ -991,6 +996,28 @@ def run_setup_wizard(args):
                print_success("    Configured ✓")
    print()
    
+    # ElevenLabs - Premium TTS
+    print_info("─" * 50)
+    print(color("  Text-to-Speech - ElevenLabs (Premium)", Colors.CYAN))
+    print_info("  Enables: Premium TTS voices (Edge TTS is free and works without a key)")
+    print_info("  Use case: High-quality, customizable voice synthesis")
+    if get_env_value('ELEVENLABS_API_KEY'):
+        print_success("  Status: Configured ✓")
+        if prompt_yes_no("  Update ElevenLabs API key?", False):
+            api_key = prompt("    API key", password=True)
+            if api_key:
+                save_env_value("ELEVENLABS_API_KEY", api_key)
+                print_success("    Updated")
+    else:
+        print_warning("  Status: Not configured (free Edge TTS will be used by default)")
+        if prompt_yes_no("  Set up ElevenLabs?", False):
+            print_info("    Get your API key at: https://elevenlabs.io/")
+            api_key = prompt("    API key", password=True)
+            if api_key:
+                save_env_value("ELEVENLABS_API_KEY", api_key)
+                print_success("    Configured ✓")
+    print()
+    
    # Tinker + WandB - RL Training
    print_info("─" * 50)
    print(color("  RL Training (Tinker + WandB)", Colors.CYAN))
@@ -76,6 +76,7 @@ def show_status(args):
        "FAL": "FAL_KEY",
        "Tinker": "TINKER_API_KEY",
        "WandB": "WANDB_API_KEY",
+        "ElevenLabs": "ELEVENLABS_API_KEY",
    }
    
    for name, env_var in keys.items():
@@ -41,7 +41,7 @@ from tools.terminal_hecate import terminal_hecate_tool, check_hecate_requirement
 from tools.vision_tools import vision_analyze_tool, check_vision_requirements
 from tools.mixture_of_agents_tool import mixture_of_agents_tool, check_moa_requirements
 from tools.image_generation_tool import image_generate_tool, check_image_generation_requirements
-from tools.skills_tool import skills_categories, skills_list, skill_view, check_skills_requirements, SKILLS_TOOL_DESCRIPTION
+from tools.skills_tool import skills_list, skill_view, check_skills_requirements, SKILLS_TOOL_DESCRIPTION
 # RL Training tools (Tinker-Atropos)
 from tools.rl_training_tool import (
    rl_list_environments,
@@ -83,6 +83,8 @@ from tools.browser_tool import (
    check_browser_requirements,
    BROWSER_TOOL_SCHEMAS
 )
+# Text-to-speech tool (Edge TTS / ElevenLabs / OpenAI)
+from tools.tts_tool import text_to_speech_tool, check_tts_requirements
 from toolsets import (
    get_toolset, resolve_toolset, resolve_multiple_toolsets,
    get_all_toolsets, get_toolset_names, validate_toolset,
@@ -143,7 +145,7 @@ TOOLSET_REQUIREMENTS = {
        "env_vars": [],  # Just needs skills directory
        "check_fn": check_skills_requirements,
        "setup_url": None,
-        "tools": ["skills_categories", "skills_list", "skill_view"],
+        "tools": ["skills_list", "skill_view"],
    },
    "rl": {
        "name": "RL Training (Tinker-Atropos)",
@@ -165,6 +167,13 @@ TOOLSET_REQUIREMENTS = {
        "setup_url": None,
        "tools": ["read_file", "write_file", "patch", "search"],
    },
+    "tts": {
+        "name": "Text-to-Speech",
+        "env_vars": [],  # Edge TTS needs no key; premium providers checked at runtime
+        "check_fn": check_tts_requirements,
+        "setup_url": None,
+        "tools": ["text_to_speech"],
+    },
 }


@@ -392,7 +401,7 @@ def get_image_tool_definitions() -> List[Dict[str, Any]]:
            "type": "function",
            "function": {
                "name": "image_generate",
-                "description": "Generate high-quality images from text prompts using FLUX 2 Pro model with automatic 2x upscaling. Creates detailed, artistic images that are automatically upscaled for hi-rez results. Returns a single upscaled image URL that can be displayed using <img src=\"{URL}\"></img> tags.",
+                "description": "Generate high-quality images from text prompts using FLUX 2 Pro model with automatic 2x upscaling. Creates detailed, artistic images that are automatically upscaled for hi-rez results. Returns a single upscaled image URL. Display it using markdown: ![description](URL)",
                "parameters": {
                    "type": "object",
                    "properties": {
@@ -432,24 +441,7 @@ def get_skills_tool_definitions() -> List[Dict[str, Any]]:
                    "properties": {
                        "category": {
                            "type": "string",
-                            "description": "Optional category filter (from skills_categories)"
-                        }
-                    },
-                    "required": []
-                }
-            }
-        },
-        {
-            "type": "function",
-            "function": {
-                "name": "skills_categories",
-                "description": "List available skill categories. Call this first to discover what skill categories exist, then use skills_list(category) to see skills in a category.",
-                "parameters": {
-                    "type": "object",
-                    "properties": {
-                        "verbose": {
-                            "type": "boolean",
-                            "description": "If true, include skill counts per category. Default: false."
+                            "description": "Optional category filter to narrow results"
                        }
                    },
                    "required": []
@@ -879,6 +871,38 @@ def get_file_tool_definitions() -> List[Dict[str, Any]]:
    ]


+def get_tts_tool_definitions() -> List[Dict[str, Any]]:
+    """
+    Get tool definitions for text-to-speech tools in OpenAI's expected format.
+    
+    Returns:
+        List[Dict]: List of TTS tool definitions compatible with OpenAI API
+    """
+    return [
+        {
+            "type": "function",
+            "function": {
+                "name": "text_to_speech",
+                "description": "Convert text to speech audio. Returns a MEDIA: path that the platform delivers as a voice message. On Telegram it plays as a voice bubble, on Discord/WhatsApp as an audio attachment. In CLI mode, saves to ~/voice-memos/. Voice and provider are user-configured, not model-selected.",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "text": {
+                            "type": "string",
+                            "description": "The text to convert to speech. Keep under 4000 characters."
+                        },
+                        "output_path": {
+                            "type": "string",
+                            "description": "Optional custom file path to save the audio. Defaults to ~/voice-memos/<timestamp>.mp3"
+                        }
+                    },
+                    "required": ["text"]
+                }
+            }
+        }
+    ]
+
+
 def get_all_tool_names() -> List[str]:
    """
    Get the names of all available tools across all toolsets.
@@ -910,7 +934,7 @@ def get_all_tool_names() -> List[str]:
    
    # Skills tools
    if check_skills_requirements():
-        tool_names.extend(["skills_categories", "skills_list", "skill_view"])
+        tool_names.extend(["skills_list", "skill_view"])
    
    # Browser automation tools
    if check_browser_requirements():
@@ -943,6 +967,10 @@ def get_all_tool_names() -> List[str]:
            "read_file", "write_file", "patch", "search"
        ])
    
+    # Text-to-speech tools
+    if check_tts_requirements():
+        tool_names.extend(["text_to_speech"])
+    
    return tool_names


@@ -957,7 +985,6 @@ TOOL_TO_TOOLSET_MAP = {
    "mixture_of_agents": "moa_tools",
    "image_generate": "image_tools",
    # Skills tools
-    "skills_categories": "skills_tools",
    "skills_list": "skills_tools",
    "skill_view": "skills_tools",
    # Browser automation tools
@@ -985,6 +1012,8 @@ TOOL_TO_TOOLSET_MAP = {
    "rl_stop_training": "rl_tools",
    "rl_get_results": "rl_tools",
    "rl_list_runs": "rl_tools",
+    # Text-to-speech tools
+    "text_to_speech": "tts_tools",
    # File manipulation tools
    "read_file": "file_tools",
    "write_file": "file_tools",
@@ -1088,6 +1117,11 @@ def get_tool_definitions(
        for tool in get_file_tool_definitions():
            all_available_tools_map[tool["function"]["name"]] = tool
    
+    # Text-to-speech tools
+    if check_tts_requirements():
+        for tool in get_tts_tool_definitions():
+            all_available_tools_map[tool["function"]["name"]] = tool
+    
    # Determine which tools to include based on toolsets
    tools_to_include = set()
    
@@ -1109,7 +1143,7 @@ def get_tool_definitions(
                        "vision_tools": ["vision_analyze"],
                        "moa_tools": ["mixture_of_agents"],
                        "image_tools": ["image_generate"],
-                        "skills_tools": ["skills_categories", "skills_list", "skill_view"],
+                        "skills_tools": ["skills_list", "skill_view"],
                        "browser_tools": [
                            "browser_navigate", "browser_snapshot", "browser_click",
                            "browser_type", "browser_scroll", "browser_back",
@@ -1124,7 +1158,8 @@ def get_tool_definitions(
                            "rl_stop_training", "rl_get_results",
                            "rl_list_runs", "rl_test_inference"
                        ],
-                        "file_tools": ["read_file", "write_file", "patch", "search"]
+                        "file_tools": ["read_file", "write_file", "patch", "search"],
+                        "tts_tools": ["text_to_speech"]
                    }
                    legacy_tools = legacy_map.get(toolset_name, [])
                    tools_to_include.update(legacy_tools)
@@ -1162,7 +1197,7 @@ def get_tool_definitions(
                        "vision_tools": ["vision_analyze"],
                        "moa_tools": ["mixture_of_agents"],
                        "image_tools": ["image_generate"],
-                        "skills_tools": ["skills_categories", "skills_list", "skill_view"],
+                        "skills_tools": ["skills_list", "skill_view"],
                        "browser_tools": [
                            "browser_navigate", "browser_snapshot", "browser_click",
                            "browser_type", "browser_scroll", "browser_back",
@@ -1177,7 +1212,8 @@ def get_tool_definitions(
                            "rl_stop_training", "rl_get_results",
                            "rl_list_runs", "rl_test_inference"
                        ],
-                        "file_tools": ["read_file", "write_file", "patch", "search"]
+                        "file_tools": ["read_file", "write_file", "patch", "search"],
+                        "tts_tools": ["text_to_speech"]
                    }
                    legacy_tools = legacy_map.get(toolset_name, [])
                    tools_to_include.difference_update(legacy_tools)
@@ -1391,11 +1427,7 @@ def handle_skills_function_call(function_name: str, function_args: Dict[str, Any
    Returns:
        str: Function result as JSON string
    """
-    if function_name == "skills_categories":
-        verbose = function_args.get("verbose", False)
-        return skills_categories(verbose=verbose)
-    
-    elif function_name == "skills_list":
+    if function_name == "skills_list":
        category = function_args.get("category")
        return skills_list(category=category)
    
@@ -1639,6 +1671,28 @@ def handle_file_function_call(
    return json.dumps({"error": f"Unknown file function: {function_name}"}, ensure_ascii=False)


+def handle_tts_function_call(
+    function_name: str,
+    function_args: Dict[str, Any]
+) -> str:
+    """
+    Handle function calls for text-to-speech tools.
+    
+    Args:
+        function_name (str): Name of the TTS function to call
+        function_args (Dict): Arguments for the function
+    
+    Returns:
+        str: Function result as JSON string
+    """
+    if function_name == "text_to_speech":
+        text = function_args.get("text", "")
+        output_path = function_args.get("output_path")
+        return text_to_speech_tool(text=text, output_path=output_path)
+    
+    return json.dumps({"error": f"Unknown TTS function: {function_name}"}, ensure_ascii=False)
+
+
 def handle_function_call(
    function_name: str, 
    function_args: Dict[str, Any], 
@@ -1686,7 +1740,7 @@ def handle_function_call(
            return handle_image_function_call(function_name, function_args)

        # Route skills tools
-        elif function_name in ["skills_categories", "skills_list", "skill_view"]:
+        elif function_name in ["skills_list", "skill_view"]:
            return handle_skills_function_call(function_name, function_args)

        # Route browser automation tools
@@ -1716,6 +1770,10 @@ def handle_function_call(
        elif function_name in ["read_file", "write_file", "patch", "search"]:
            return handle_file_function_call(function_name, function_args, task_id)

+        # Route text-to-speech tools
+        elif function_name in ["text_to_speech"]:
+            return handle_tts_function_call(function_name, function_args)
+
        else:
            error_msg = f"Unknown function: {function_name}"
            print(f"❌ {error_msg}")
@@ -1767,7 +1825,7 @@ def get_available_toolsets() -> Dict[str, Dict[str, Any]]:
        },
        "skills_tools": {
            "available": check_skills_requirements(),
-            "tools": ["skills_categories", "skills_list", "skill_view"],
+            "tools": ["skills_list", "skill_view"],
            "description": "Access skill documents that provide specialized instructions, guidelines, or knowledge the agent can load on demand",
            "requirements": ["skills/ directory in repo root"]
        },
@@ -1793,6 +1851,12 @@ def get_available_toolsets() -> Dict[str, Dict[str, Any]]:
            "tools": ["read_file", "write_file", "patch", "search"],
            "description": "File manipulation tools: read/write files, search content/files, patch with fuzzy matching",
            "requirements": ["Terminal backend available (local/docker/ssh/singularity/modal)"]
+        },
+        "tts_tools": {
+            "available": check_tts_requirements(),
+            "tools": ["text_to_speech"],
+            "description": "Text-to-speech: convert text to audio (Edge TTS free, ElevenLabs, OpenAI)",
+            "requirements": ["edge-tts package (free) or ELEVENLABS_API_KEY or OPENAI_API_KEY"]
        }
    }
    
@@ -1814,7 +1878,8 @@ def check_toolset_requirements() -> Dict[str, bool]:
        "skills_tools": check_skills_requirements(),
        "browser_tools": check_browser_requirements(),
        "cronjob_tools": check_cronjob_requirements(),
-        "file_tools": check_file_requirements()
+        "file_tools": check_file_requirements(),
+        "tts_tools": check_tts_requirements()
    }

 if __name__ == "__main__":
@@ -29,6 +29,12 @@ platformdirs
 # Optional: For Modal backend (cloud execution)
 # swe-rex[modal]>=1.4.0  # Includes modal + boto3 + swe-rex runtime

+# Text-to-speech (Edge TTS is free, no API key needed)
+edge-tts
+
+# Optional: Premium TTS providers
+# elevenlabs  # Uncomment if using ElevenLabs TTS (needs ELEVENLABS_API_KEY)
+
 # Optional: For cron expression parsing (cronjob scheduling)
 croniter

@@ -20,6 +20,7 @@ Usage:
    response = agent.run_conversation("Tell me about the latest Python updates")
 """

+import copy
 import json
 import logging
 import os
@@ -48,11 +49,46 @@ elif not os.getenv("HERMES_QUIET"):

 # Import our tool system
 from model_tools import get_tool_definitions, handle_function_call, check_toolset_requirements
-from tools.terminal_tool import cleanup_vm
+from tools.terminal_tool import cleanup_vm, set_interrupt_event as _set_terminal_interrupt
 from tools.browser_tool import cleanup_browser

 import requests

+# =============================================================================
+# Default Agent Identity & Platform Hints
+# =============================================================================
+
+# The default identity prompt is prepended to every conversation so the agent
+# knows who it is and behaves consistently across platforms.
+DEFAULT_AGENT_IDENTITY = (
+    "You are Hermes Agent, an intelligent AI assistant created by Nous Research. "
+    "You are helpful, knowledgeable, and direct. You assist users with a wide "
+    "range of tasks including answering questions, writing and editing code, "
+    "analyzing information, creative work, and executing actions via your tools. "
+    "You communicate clearly, admit uncertainty when appropriate, and prioritize "
+    "being genuinely useful over being verbose unless otherwise directed below."
+)
+
+# Platform-specific formatting hints appended to the system prompt.
+# These tell the agent how to format its output for the current interface.
+PLATFORM_HINTS = {
+    "whatsapp": (
+        "You are on a text messaging communication platform, WhatsApp. "
+        "Please do not use markdown as it does not render."
+    ),
+    "telegram": (
+        "You are on a text messaging communication platform, Telegram. "
+        "Please do not use markdown as it does not render."
+    ),
+    "discord": (
+        "You are in a Discord server or group chat communicating with your user."
+    ),
+    "cli": (
+        "You are a CLI AI Agent. Try not to use markdown but simple text "
+        "renderable inside a terminal."
+    ),
+}
+
 # =============================================================================
 # Model Context Management
 # =============================================================================
@@ -457,18 +493,389 @@ Write only the summary, starting with "[CONTEXT SUMMARY]:" prefix."""
        return compressed


+# =============================================================================
+# Anthropic Prompt Caching (system_and_3 strategy)
+# =============================================================================
+# Reduces input token costs by ~75% on multi-turn conversations by caching
+# the conversation prefix. Uses 4 cache_control breakpoints (Anthropic max):
+#   1. System prompt (stable across all turns)
+#   2-4. Last 3 non-system messages (rolling window)
+#
+# Cached tokens are read at 0.1x input price. Cache writes cost 1.25x (5m TTL)
+# or 2x (1h TTL). Only applied to Claude models via OpenRouter.
+
+def _apply_cache_marker(msg: dict, cache_marker: dict) -> None:
+    """
+    Add cache_control to a single message, handling all format variations.
+
+    - tool messages: cache_control at message level (Anthropic API quirk)
+    - string content: converted to multipart content array
+    - list content: marker added to last item
+    - None content (assistant with tool_calls): message level
+    """
+    role = msg.get("role", "")
+    content = msg.get("content")
+
+    if role == "tool":
+        msg["cache_control"] = cache_marker
+        return
+
+    if content is None:
+        msg["cache_control"] = cache_marker
+        return
+
+    if isinstance(content, str):
+        msg["content"] = [{"type": "text", "text": content, "cache_control": cache_marker}]
+        return
+
+    if isinstance(content, list) and content:
+        last = content[-1]
+        if isinstance(last, dict):
+            last["cache_control"] = cache_marker
+
+
+def apply_anthropic_cache_control(
+    api_messages: List[Dict[str, Any]],
+    cache_ttl: str = "5m",
+) -> List[Dict[str, Any]]:
+    """
+    Apply system_and_3 caching strategy to messages for Anthropic models.
+
+    Places up to 4 cache_control breakpoints:
+      1. System prompt (index 0, stable across all turns)
+      2-4. Last 3 non-system messages (rolling cache frontier)
+
+    Each breakpoint tells Anthropic "cache everything from the start up to here."
+    Multiple breakpoints create a ladder of cached prefixes at different depths,
+    which provides robust cache hits even when the most recent cache entry hasn't
+    propagated yet.
+
+    Args:
+        api_messages: Fully assembled message list (system prompt first).
+        cache_ttl: "5m" (default, 1.25x write cost) or "1h" (2x write cost).
+
+    Returns:
+        Deep copy of messages with cache_control breakpoints injected.
+    """
+    messages = copy.deepcopy(api_messages)
+    if not messages:
+        return messages
+
+    marker = {"type": "ephemeral"}
+    if cache_ttl == "1h":
+        marker["ttl"] = "1h"
+
+    breakpoints_used = 0
+
+    # Breakpoint 1: System prompt (always stable, gives a guaranteed minimum hit)
+    if messages[0].get("role") == "system":
+        _apply_cache_marker(messages[0], marker)
+        breakpoints_used += 1
+
+    # Breakpoints 2-4: Last 3 non-system messages (rolling window)
+    remaining = 4 - breakpoints_used
+    non_sys = [i for i in range(len(messages)) if messages[i].get("role") != "system"]
+    for idx in non_sys[-remaining:]:
+        _apply_cache_marker(messages[idx], marker)
+
+    return messages
+
+
 # =============================================================================
 # Default System Prompt Components
 # =============================================================================

-# Skills guidance - instructs the model to check skills before technical tasks
-SKILLS_SYSTEM_PROMPT = """## Skills
-Before answering technical questions about tools, frameworks, or workflows:
-1. Check skills_categories to see if a relevant category exists
-2. If a category matches your task, use skills_list with that category
-3. If a skill matches, load it with skill_view and follow its instructions
+# Skills guidance - embeds a compact skill index in the system prompt so
+# the model can match skills at a glance without extra tool calls.
+def build_skills_system_prompt() -> str:
+    """
+    Build a dynamic skills system prompt by scanning the skills/ directory.
+    
+    Returns a prompt section that lists all skill categories (with descriptions
+    from DESCRIPTION.md) and their skill names inline, so the model can
+    immediately see if a relevant skill exists and load it with a single
+    skill_view(name) call -- no discovery tool calls needed.
+    
+    Returns:
+        str: The skills system prompt section, or empty string if no skills found.
+    """
+    import re
+    from pathlib import Path
+    
+    skills_dir = Path(__file__).parent / "skills"
+    if not skills_dir.exists():
+        return ""
+    
+    # Scan for SKILL.md files grouped by category
+    skills_by_category = {}
+    for skill_file in skills_dir.rglob("SKILL.md"):
+        rel_path = skill_file.relative_to(skills_dir)
+        parts = rel_path.parts
+        if len(parts) >= 2:
+            category = parts[0]
+            skill_name = parts[-2]  # Folder containing SKILL.md
+        else:
+            category = "general"
+            skill_name = skill_file.parent.name
+        skills_by_category.setdefault(category, []).append(skill_name)
+    
+    if not skills_by_category:
+        return ""
+    
+    # Load category descriptions from DESCRIPTION.md files (YAML frontmatter)
+    category_descriptions = {}
+    for category in skills_by_category:
+        desc_file = skills_dir / category / "DESCRIPTION.md"
+        if desc_file.exists():
+            try:
+                content = desc_file.read_text(encoding="utf-8")
+                # Parse description from YAML frontmatter: ---\ndescription: ...\n---
+                match = re.search(r"^---\s*\n.*?description:\s*(.+?)\s*\n.*?^---", content, re.MULTILINE | re.DOTALL)
+                if match:
+                    category_descriptions[category] = match.group(1).strip()
+            except Exception:
+                pass
+    
+    # Build compact index: category with description + skill names
+    index_lines = []
+    for category in sorted(skills_by_category.keys()):
+        desc = category_descriptions.get(category, "")
+        names = ", ".join(sorted(skills_by_category[category]))
+        if desc:
+            index_lines.append(f"  {category}: {desc}")
+        else:
+            index_lines.append(f"  {category}:")
+        index_lines.append(f"    skills: {names}")
+    
+    return (
+        "## Skills (mandatory)\n"
+        "Before replying, scan the skills below. If one clearly matches your task, "
+        "load it with skill_view(name) and follow its instructions.\n"
+        "\n"
+        "<available_skills>\n"
+        + "\n".join(index_lines) + "\n"
+        "</available_skills>\n"
+        "\n"
+        "If none match, proceed normally without loading a skill."
+    )

-Skills contain vetted, up-to-date instructions for specific tools and workflows."""
+
+# =============================================================================
+# Context File Injection (SOUL.md, AGENTS.md, .cursorrules)
+# =============================================================================
+
+# Maximum characters per context file before truncation
+CONTEXT_FILE_MAX_CHARS = 20_000
+# Truncation strategy: keep 70% from the head, 20% from the tail
+CONTEXT_TRUNCATE_HEAD_RATIO = 0.7
+CONTEXT_TRUNCATE_TAIL_RATIO = 0.2
+
+
+def _truncate_content(content: str, filename: str, max_chars: int = CONTEXT_FILE_MAX_CHARS) -> str:
+    """
+    Truncate content if it exceeds max_chars using a head/tail strategy.
+    
+    Keeps 70% from the start and 20% from the end, with a truncation
+    marker in the middle so the model knows content was cut.
+    """
+    if len(content) <= max_chars:
+        return content
+    
+    head_chars = int(max_chars * CONTEXT_TRUNCATE_HEAD_RATIO)
+    tail_chars = int(max_chars * CONTEXT_TRUNCATE_TAIL_RATIO)
+    head = content[:head_chars]
+    tail = content[-tail_chars:]
+    
+    marker = f"\n\n[...truncated {filename}: kept {head_chars}+{tail_chars} of {len(content)} chars. Use file tools to read the full file.]\n\n"
+    return head + marker + tail
+
+
+def build_context_files_prompt(cwd: str = None) -> str:
+    """
+    Discover and load context files (SOUL.md, AGENTS.md, .cursorrules)
+    for injection into the system prompt.
+    
+    Discovery rules:
+    - AGENTS.md: Recursively search from cwd (only if top-level exists).
+                 Each file becomes a ## section with its relative path.
+    - .cursorrules: Check cwd for .cursorrules file and .cursor/rules/*.mdc
+    - SOUL.md: Check cwd first, then ~/.hermes/SOUL.md as global fallback
+    
+    Args:
+        cwd: Working directory to search from. Defaults to os.getcwd().
+    
+    Returns:
+        str: The context files prompt section, or empty string if none found.
+    """
+    import os
+    import glob as glob_mod
+    from pathlib import Path
+    
+    if cwd is None:
+        cwd = os.getcwd()
+    
+    cwd_path = Path(cwd).resolve()
+    sections = []
+    
+    # ----- AGENTS.md (hierarchical, recursive) -----
+    top_level_agents = None
+    for name in ["AGENTS.md", "agents.md"]:
+        candidate = cwd_path / name
+        if candidate.exists():
+            top_level_agents = candidate
+            break
+    
+    if top_level_agents:
+        # Recursively find all AGENTS.md files (case-insensitive)
+        agents_files = []
+        for root, dirs, files in os.walk(cwd_path):
+            # Skip hidden directories and common non-project dirs
+            dirs[:] = [d for d in dirs if not d.startswith('.') and d not in ('node_modules', '__pycache__', 'venv', '.venv')]
+            for f in files:
+                if f.lower() == "agents.md":
+                    agents_files.append(Path(root) / f)
+        
+        # Sort by path depth (top-level first, then deeper)
+        agents_files.sort(key=lambda p: len(p.parts))
+        
+        total_agents_content = ""
+        for agents_path in agents_files:
+            try:
+                content = agents_path.read_text(encoding="utf-8").strip()
+                if content:
+                    rel_path = agents_path.relative_to(cwd_path)
+                    total_agents_content += f"## {rel_path}\n\n{content}\n\n"
+            except Exception:
+                pass
+        
+        if total_agents_content:
+            total_agents_content = _truncate_content(total_agents_content, "AGENTS.md")
+            sections.append(total_agents_content)
+    
+    # ----- .cursorrules -----
+    cursorrules_content = ""
+    
+    # Check for .cursorrules file
+    cursorrules_file = cwd_path / ".cursorrules"
+    if cursorrules_file.exists():
+        try:
+            content = cursorrules_file.read_text(encoding="utf-8").strip()
+            if content:
+                cursorrules_content += f"## .cursorrules\n\n{content}\n\n"
+        except Exception:
+            pass
+    
+    # Check for .cursor/rules/*.mdc files
+    cursor_rules_dir = cwd_path / ".cursor" / "rules"
+    if cursor_rules_dir.exists() and cursor_rules_dir.is_dir():
+        mdc_files = sorted(cursor_rules_dir.glob("*.mdc"))
+        for mdc_file in mdc_files:
+            try:
+                content = mdc_file.read_text(encoding="utf-8").strip()
+                if content:
+                    cursorrules_content += f"## .cursor/rules/{mdc_file.name}\n\n{content}\n\n"
+            except Exception:
+                pass
+    
+    if cursorrules_content:
+        cursorrules_content = _truncate_content(cursorrules_content, ".cursorrules")
+        sections.append(cursorrules_content)
+    
+    # ----- SOUL.md (cwd first, then ~/.hermes/ fallback) -----
+    soul_content = ""
+    soul_path = None
+    
+    for name in ["SOUL.md", "soul.md"]:
+        candidate = cwd_path / name
+        if candidate.exists():
+            soul_path = candidate
+            break
+    
+    if not soul_path:
+        # Global fallback
+        global_soul = Path.home() / ".hermes" / "SOUL.md"
+        if global_soul.exists():
+            soul_path = global_soul
+    
+    if soul_path:
+        try:
+            content = soul_path.read_text(encoding="utf-8").strip()
+            if content:
+                content = _truncate_content(content, "SOUL.md")
+                soul_content = f"## SOUL.md\n\nIf SOUL.md is present, embody its persona and tone. Avoid stiff, generic replies; follow its guidance unless higher-priority instructions override it.\n\n{content}"
+                sections.append(soul_content)
+        except Exception:
+            pass
+    
+    # ----- Assemble -----
+    if not sections:
+        return ""
+    
+    return "# Project Context\n\nThe following project context files have been loaded and should be followed:\n\n" + "\n".join(sections)
+
+
+def _build_tool_preview(tool_name: str, args: dict, max_len: int = 40) -> str:
+    """
+    Build a short preview of a tool call's primary argument for display.
+    
+    Returns a truncated string showing the most informative argument,
+    or None if no meaningful preview is available.
+    
+    Args:
+        tool_name: Name of the tool being called
+        args: The tool call arguments dict
+        max_len: Maximum preview length before truncation
+    
+    Returns:
+        str or None: Short preview string, or None
+    """
+    # Map tool names to their primary argument key(s)
+    primary_args = {
+        "terminal": "command",
+        "web_search": "query",
+        "web_extract": "urls",
+        "read_file": "path",
+        "write_file": "path",
+        "patch": "path",
+        "search": "pattern",
+        "browser_navigate": "url",
+        "browser_click": "ref",
+        "browser_type": "text",
+        "image_generate": "prompt",
+        "text_to_speech": "text",
+        "vision_analyze": "question",
+        "mixture_of_agents": "user_prompt",
+        "skill_view": "name",
+        "skills_list": "category",
+        "schedule_cronjob": "name",
+    }
+    
+    key = primary_args.get(tool_name)
+    if not key:
+        # Try common arg names as fallback
+        for fallback_key in ("query", "text", "command", "path", "name", "prompt"):
+            if fallback_key in args:
+                key = fallback_key
+                break
+    
+    if not key or key not in args:
+        return None
+    
+    value = args[key]
+    
+    # Handle list values (e.g., urls)
+    if isinstance(value, list):
+        value = value[0] if value else ""
+    
+    preview = str(value).strip()
+    if not preview:
+        return None
+    
+    # Truncate
+    if len(preview) > max_len:
+        preview = preview[:max_len - 3] + "..."
+    
+    return preview


 class KawaiiSpinner:
@@ -605,6 +1012,7 @@ class AIAgent:
        max_tokens: int = None,
        reasoning_config: Dict[str, Any] = None,
        prefill_messages: List[Dict[str, Any]] = None,
+        platform: str = None,
    ):
        """
        Initialize the AI Agent.
@@ -635,6 +1043,8 @@ class AIAgent:
            prefill_messages (List[Dict]): Messages to prepend to conversation history as prefilled context.
                Useful for injecting a few-shot example or priming the model's response style.
                Example: [{"role": "user", "content": "Hi!"}, {"role": "assistant", "content": "Hello!"}]
+            platform (str): The interface platform the user is on (e.g. "cli", "telegram", "discord", "whatsapp").
+                Used to inject platform-specific formatting hints into the system prompt.
        """
        self.model = model
        self.max_iterations = max_iterations
@@ -643,9 +1053,12 @@ class AIAgent:
        self.verbose_logging = verbose_logging
        self.quiet_mode = quiet_mode
        self.ephemeral_system_prompt = ephemeral_system_prompt
+        self.platform = platform  # "cli", "telegram", "discord", "whatsapp", etc.
        self.log_prefix_chars = log_prefix_chars
        self.log_prefix = f"{log_prefix} " if log_prefix else ""
-        self.base_url = base_url or ""  # Store for OpenRouter detection
+        # Store effective base URL for feature detection (prompt caching, reasoning, etc.)
+        # When no base_url is provided, the client defaults to OpenRouter, so reflect that here.
+        self.base_url = base_url or "https://openrouter.ai/api/v1"
        self.tool_progress_callback = tool_progress_callback
        self._last_reported_tool = None  # Track for "new tool" mode
        
@@ -668,6 +1081,14 @@ class AIAgent:
        self.reasoning_config = reasoning_config  # None = use default (xhigh for OpenRouter)
        self.prefill_messages = prefill_messages or []  # Prefilled conversation turns
        
+        # Anthropic prompt caching: auto-enabled for Claude models via OpenRouter.
+        # Reduces input costs by ~75% on multi-turn conversations by caching the
+        # conversation prefix. Uses system_and_3 strategy (4 breakpoints).
+        is_openrouter = "openrouter" in self.base_url.lower()
+        is_claude = "claude" in self.model.lower()
+        self._use_prompt_caching = is_openrouter and is_claude
+        self._cache_ttl = "5m"  # Default 5-minute TTL (1.25x write cost)
+        
        # Configure logging
        if self.verbose_logging:
            logging.basicConfig(
@@ -773,6 +1194,10 @@ class AIAgent:
            prompt_preview = self.ephemeral_system_prompt[:60] + "..." if len(self.ephemeral_system_prompt) > 60 else self.ephemeral_system_prompt
            print(f"🔒 Ephemeral system prompt: '{prompt_preview}' (not saved to trajectories)")
        
+        # Show prompt caching status
+        if self._use_prompt_caching and not self.quiet_mode:
+            print(f"💾 Prompt caching: ENABLED (Claude via OpenRouter, {self._cache_ttl} TTL)")
+        
        # Session logging setup - auto-save conversation trajectories for debugging
        self.session_start = datetime.now()
        if session_id:
@@ -951,10 +1376,6 @@ class AIAgent:
            return f"{face} 🎨 creating '{prompt}'... {time_str}"
        
        # Skills - use large pool for variety
-        elif tool_name == "skills_categories":
-            face = random.choice(self.KAWAII_SKILL)
-            return f"{face} 📚 listing categories... {time_str}"
-        
        elif tool_name == "skills_list":
            category = args.get("category", "skills")
            face = random.choice(self.KAWAII_SKILL)
@@ -965,19 +1386,65 @@ class AIAgent:
            face = random.choice(self.KAWAII_SKILL)
            return f"{face} 📖 loading {name}... {time_str}"
        
+        # File tools
+        elif tool_name == "read_file":
+            path = args.get("path", "file")
+            if len(path) > 30:
+                path = "..." + path[-27:]
+            face = random.choice(self.KAWAII_READ)
+            return f"{face} 📖 reading \"{path}\" {time_str}"
+        
+        elif tool_name == "write_file":
+            path = args.get("path", "file")
+            if len(path) > 30:
+                path = "..." + path[-27:]
+            face = random.choice(self.KAWAII_CREATE)
+            return f"{face} ✍️ writing \"{path}\" {time_str}"
+        
+        elif tool_name == "patch":
+            path = args.get("path", "file")
+            if path and len(path) > 30:
+                path = "..." + path[-27:]
+            face = random.choice(self.KAWAII_CREATE)
+            return f"{face} 🔧 patching \"{path}\" {time_str}"
+        
+        elif tool_name == "search":
+            pattern = args.get("pattern", "")
+            if len(pattern) > 25:
+                pattern = pattern[:22] + "..."
+            face = random.choice(self.KAWAII_SEARCH)
+            return f"{face} 🔎 searching \"{pattern}\" {time_str}"
+        
+        # TTS
+        elif tool_name == "text_to_speech":
+            text = args.get("text", "")
+            if len(text) > 25:
+                text = text[:22] + "..."
+            face = random.choice(self.KAWAII_CREATE)
+            return f"{face} 🔊 speaking \"{text}\" {time_str}"
+        
        # Vision tools
        elif tool_name == "vision_analyze":
+            question = args.get("question", "")
+            if len(question) > 25:
+                question = question[:22] + "..."
            face = random.choice(self.KAWAII_BROWSER)
-            return f"{face} 👁️✨ analyzing image... {time_str}"
+            return f"{face} 👁️✨ analyzing \"{question}\" {time_str}"
        
        # Mixture of agents
        elif tool_name == "mixture_of_agents":
+            prompt = args.get("user_prompt", "")
+            if len(prompt) > 25:
+                prompt = prompt[:22] + "..."
            face = random.choice(self.KAWAII_THINK)
-            return f"{face} 🧠💭 thinking REALLY hard... {time_str}"
+            return f"{face} 🧠💭 deep thinking \"{prompt}\" {time_str}"
        
-        # Default fallback - random generic kawaii
+        # Default fallback - random generic kawaii with primary arg preview
        else:
            face = random.choice(self.KAWAII_GENERIC)
+            preview = _build_tool_preview(tool_name, args)
+            if preview:
+                return f"{face} ⚡ {tool_name}... \"{preview}\" {time_str}"
            return f"{face} ⚡ {tool_name}... {time_str}"
    
    def _has_content_after_think_block(self, content: str) -> bool:
@@ -1446,6 +1913,9 @@ class AIAgent:
        Call this from another thread (e.g., input handler, message receiver)
        to gracefully stop the agent and process a new message.
        
+        Also signals long-running tool executions (e.g. terminal commands)
+        to terminate early, so the agent can respond immediately.
+        
        Args:
            message: Optional new message that triggered the interrupt.
                     If provided, the agent will include this in its response context.
@@ -1462,6 +1932,8 @@ class AIAgent:
        """
        self._interrupt_requested = True
        self._interrupt_message = message
+        # Signal the terminal tool to kill any running subprocess immediately
+        _set_terminal_interrupt(True)
        if not self.quiet_mode:
            print(f"\n⚡ Interrupt requested" + (f": '{message[:40]}...'" if message and len(message) > 40 else f": '{message}'" if message else ""))
    
@@ -1469,6 +1941,7 @@ class AIAgent:
        """Clear any pending interrupt request."""
        self._interrupt_requested = False
        self._interrupt_message = None
+        _set_terminal_interrupt(False)
    
    @property
    def is_interrupted(self) -> bool:
@@ -1521,20 +1994,46 @@ class AIAgent:
        if not self.quiet_mode:
            print(f"💬 Starting conversation: '{user_message[:60]}{'...' if len(user_message) > 60 else ''}'")
        
-        # Determine which system prompt to use for API calls (ephemeral)
-        # Priority: explicit system_message > ephemeral_system_prompt > None
-        base_system_prompt = system_message if system_message is not None else self.ephemeral_system_prompt
-        
-        # Auto-include skills guidance if skills tools are available
-        has_skills_tools = any(name in self.valid_tool_names for name in ['skills_list', 'skills_categories', 'skill_view'])
-        if has_skills_tools:
-            if base_system_prompt:
-                active_system_prompt = f"{base_system_prompt}\n\n{SKILLS_SYSTEM_PROMPT}"
-            else:
-                active_system_prompt = SKILLS_SYSTEM_PROMPT
-        else:
-            active_system_prompt = base_system_prompt
-        
+        # ── Build the full system prompt ──
+        # Layers (in order):
+        #   1. Default agent identity (always present)
+        #   2. User / gateway system prompt (if provided)
+        #   3. Skills guidance (if skills tools are loaded)
+        #   4. Context files (SOUL.md, AGENTS.md, .cursorrules)
+        #   5. Current date & time
+        #   6. Platform-specific formatting hint
+        prompt_parts = [DEFAULT_AGENT_IDENTITY]
+
+        # Layer in the caller-supplied system prompt (explicit > ephemeral).
+        caller_prompt = system_message if system_message is not None else self.ephemeral_system_prompt
+        if caller_prompt:
+            prompt_parts.append(caller_prompt)
+
+        # Auto-include skills guidance if skills tools are available.
+        has_skills_tools = any(name in self.valid_tool_names for name in ['skills_list', 'skill_view'])
+        skills_prompt = build_skills_system_prompt() if has_skills_tools else ""
+        if skills_prompt:
+            prompt_parts.append(skills_prompt)
+
+        # Auto-include context files (SOUL.md, AGENTS.md, .cursorrules).
+        context_files_prompt = build_context_files_prompt()
+        if context_files_prompt:
+            prompt_parts.append(context_files_prompt)
+
+        # Current local date and time so the model is never confused about
+        # what day/time it is (LLM training cutoffs can otherwise mislead it).
+        now = datetime.now()
+        prompt_parts.append(
+            f"Current local date and time: {now.strftime('%A, %B %d, %Y %I:%M %p')}"
+        )
+
+        # Platform-specific formatting hint (no markdown on WhatsApp, etc.).
+        platform_key = (self.platform or "").lower().strip()
+        if platform_key in PLATFORM_HINTS:
+            prompt_parts.append(PLATFORM_HINTS[platform_key])
+
+        active_system_prompt = "\n\n".join(prompt_parts)
+
        # Main conversation loop
        api_call_count = 0
        final_response = None
@@ -1582,6 +2081,13 @@ class AIAgent:
                # Insert system message at the beginning
                api_messages = [{"role": "system", "content": active_system_prompt}] + api_messages
            
+            # Apply Anthropic prompt caching for Claude models via OpenRouter.
+            # Auto-detected: if model name contains "claude" and base_url is OpenRouter,
+            # inject cache_control breakpoints (system + last 3 messages) to reduce
+            # input token costs by ~75% on multi-turn conversations.
+            if self._use_prompt_caching:
+                api_messages = apply_anthropic_cache_control(api_messages, cache_ttl=self._cache_ttl)
+            
            # Calculate approximate request size for logging
            total_chars = sum(len(str(msg)) for msg in api_messages)
            approx_tokens = total_chars // 4  # Rough estimate: 4 chars per token
@@ -1811,6 +2317,16 @@ class AIAgent:
                        
                        if self.verbose_logging:
                            logging.debug(f"Token usage: prompt={usage_dict['prompt_tokens']:,}, completion={usage_dict['completion_tokens']:,}, total={usage_dict['total_tokens']:,}")
+                        
+                        # Log cache hit stats when prompt caching is active
+                        if self._use_prompt_caching:
+                            details = getattr(response.usage, 'prompt_tokens_details', None)
+                            cached = getattr(details, 'cached_tokens', 0) or 0 if details else 0
+                            written = getattr(details, 'cache_write_tokens', 0) or 0 if details else 0
+                            prompt = usage_dict["prompt_tokens"]
+                            hit_pct = (cached / prompt * 100) if prompt > 0 else 0
+                            if not self.quiet_mode:
+                                print(f"{self.log_prefix}   💾 Cache: {cached:,}/{prompt:,} tokens ({hit_pct:.0f}% hit, {written:,} written)")
                    
                    break  # Success, exit retry loop

@@ -2124,12 +2640,8 @@ class AIAgent:
                        # Fire progress callback if registered (for messaging platforms)
                        if self.tool_progress_callback:
                            try:
-                                # Build preview for terminal commands
-                                if function_name == "terminal":
-                                    cmd = function_args.get("command", "")
-                                    preview = cmd[:50] + "..." if len(cmd) > 50 else cmd
-                                else:
-                                    preview = None
+                                # Build a short preview of the primary argument
+                                preview = _build_tool_preview(function_name, function_args)
                                self.tool_progress_callback(function_name, preview)
                            except Exception as cb_err:
                                logging.debug(f"Tool progress callback error: {cb_err}")
@@ -2151,7 +2663,6 @@ class AIAgent:
                                'image_generate': ('sparkle', ['🎨', '✨', '🖼️', '🌟']),
                                'skill_view': ('star', ['📚', '📖', '🎓', '✨']),
                                'skills_list': ('pulse', ['📋', '📝', '📑', '📜']),
-                                'skills_categories': ('pulse', ['📂', '🗂️', '📁', '🏷️']),
                                'moa_query': ('brain', ['🧠', '💭', '🤔', '💡']),
                                'analyze_image': ('sparkle', ['👁️', '🔍', '📷', '✨']),
                            }
@@ -2189,6 +2700,21 @@ class AIAgent:
                            response_preview = function_result[:self.log_prefix_chars] + "..." if len(function_result) > self.log_prefix_chars else function_result
                            print(f"  ✅ Tool {i} completed in {tool_duration:.2f}s - {response_preview}")
                        
+                        # Check for interrupt between tool calls - skip remaining
+                        # tools so the agent can respond to the user immediately
+                        if self._interrupt_requested and i < len(assistant_message.tool_calls):
+                            remaining = len(assistant_message.tool_calls) - i
+                            print(f"{self.log_prefix}⚡ Interrupt: skipping {remaining} remaining tool call(s)")
+                            # Add placeholder results for skipped tool calls so the
+                            # message sequence stays valid (assistant tool_calls need matching tool results)
+                            for skipped_tc in assistant_message.tool_calls[i:]:
+                                messages.append({
+                                    "role": "tool",
+                                    "content": "[Tool execution skipped - user sent a new message]",
+                                    "tool_call_id": skipped_tc.id
+                                })
+                            break
+                        
                        # Delay between tool calls
                        if self.tool_delay > 0 and i < len(assistant_message.tool_calls):
                            time.sleep(self.tool_delay)
@@ -262,6 +262,25 @@ function Test-Ripgrep {
    return $true  # Don't fail - ripgrep is optional
 }

+function Test-Ffmpeg {
+    Write-Info "Checking ffmpeg (optional, for TTS voice messages)..."
+    
+    if (Get-Command ffmpeg -ErrorAction SilentlyContinue) {
+        $version = ffmpeg -version 2>&1 | Select-Object -First 1
+        Write-Success "ffmpeg found"
+        $script:HasFfmpeg = $true
+        return $true
+    }
+    
+    Write-Warn "ffmpeg not found (TTS voice bubbles on Telegram will send as audio files instead)"
+    Write-Info "  Install with: winget install ffmpeg"
+    Write-Info "  Or: choco install ffmpeg"
+    Write-Info "  Or download from: https://ffmpeg.org/download.html"
+    
+    $script:HasFfmpeg = $false
+    return $true  # Don't fail - ffmpeg is optional
+}
+
 # ============================================================================
 # Installation
 # ============================================================================
@@ -567,6 +586,7 @@ function Main {
    if (-not (Test-Git)) { exit 1 }
    Test-Node      # Optional, doesn't fail
    Test-Ripgrep   # Optional, doesn't fail
+    Test-Ffmpeg    # Optional, doesn't fail
    
    Install-Repository
    Install-Venv
@@ -413,6 +413,45 @@ check_ripgrep() {
    # Don't exit - ripgrep is optional (grep fallback exists)
 }

+check_ffmpeg() {
+    log_info "Checking ffmpeg (optional, for TTS voice messages)..."
+    
+    if command -v ffmpeg &> /dev/null; then
+        local ffmpeg_version=$(ffmpeg -version 2>/dev/null | head -1 | awk '{print $3}')
+        log_success "ffmpeg found: $ffmpeg_version"
+        HAS_FFMPEG=true
+        return
+    fi
+    
+    log_warn "ffmpeg not found (TTS voice bubbles on Telegram will send as audio files instead)"
+    log_info "To install ffmpeg (optional):"
+    
+    case "$OS" in
+        linux)
+            case "$DISTRO" in
+                ubuntu|debian)
+                    log_info "  sudo apt install ffmpeg"
+                    ;;
+                fedora)
+                    log_info "  sudo dnf install ffmpeg"
+                    ;;
+                arch)
+                    log_info "  sudo pacman -S ffmpeg"
+                    ;;
+                *)
+                    log_info "  https://ffmpeg.org/download.html"
+                    ;;
+            esac
+            ;;
+        macos)
+            log_info "  brew install ffmpeg"
+            ;;
+    esac
+    
+    HAS_FFMPEG=false
+    # Don't exit - ffmpeg is optional
+}
+
 # ============================================================================
 # Installation
 # ============================================================================
@@ -707,6 +746,7 @@ main() {
    check_git
    check_node
    check_ripgrep
+    check_ffmpeg
    
    clone_repo
    setup_venv
@@ -0,0 +1,34 @@
+#!/bin/bash
+# Kill all running Modal apps (sandboxes, deployments, etc.)
+#
+# Usage:
+#   bash scripts/kill_modal.sh          # Stop swe-rex (the sandbox app)
+#   bash scripts/kill_modal.sh --all    # Stop ALL Modal apps
+
+set -uo pipefail
+
+echo "Fetching Modal app list..."
+APP_LIST=$(modal app list 2>/dev/null)
+
+if [[ "${1:-}" == "--all" ]]; then
+    echo "Stopping ALL Modal apps..."
+    echo "$APP_LIST" | grep -oE 'ap-[A-Za-z0-9]+' | sort -u | while read app_id; do
+        echo "  Stopping $app_id"
+        modal app stop "$app_id" 2>/dev/null || true
+    done
+else
+    echo "Stopping swe-rex sandboxes..."
+    APPS=$(echo "$APP_LIST" | grep 'swe-rex' | grep -oE 'ap-[A-Za-z0-9]+' || true)
+    if [[ -z "$APPS" ]]; then
+        echo "  No swe-rex apps found."
+    else
+        echo "$APPS" | while read app_id; do
+            echo "  Stopping $app_id"
+            modal app stop "$app_id" 2>/dev/null || true
+        done
+    fi
+fi
+
+echo ""
+echo "Current swe-rex status:"
+modal app list 2>/dev/null | grep -E 'State|swe-rex' || echo "  (none)"
@@ -0,0 +1,3 @@
+---
+description: Diagram creation skills for generating visual diagrams, flowcharts, architecture diagrams, and illustrations using tools like Excalidraw.
+---
@@ -0,0 +1,191 @@
+---
+name: excalidraw
+description: Create hand-drawn style diagrams using Excalidraw JSON format. Generate .excalidraw files for architecture diagrams, flowcharts, sequence diagrams, concept maps, and more. Files can be opened at excalidraw.com or uploaded for shareable links.
+version: 1.0.0
+author: Hermes Agent
+license: MIT
+tags: [Excalidraw, Diagrams, Flowcharts, Architecture, Visualization, JSON]
+dependencies: []
+related_skills: []
+---
+
+# Excalidraw Diagram Skill
+
+Create diagrams by writing standard Excalidraw element JSON and saving as `.excalidraw` files. These files can be drag-and-dropped onto [excalidraw.com](https://excalidraw.com) for viewing and editing. No accounts, no API keys, no rendering libraries -- just JSON.
+
+## Workflow
+
+1. **Load this skill** (you already did)
+2. **Write the elements JSON** -- an array of Excalidraw element objects
+3. **Save the file** using `write_file` to create a `.excalidraw` file
+4. **Optionally upload** for a shareable link using `scripts/upload.py` via `terminal`
+
+### Saving a Diagram
+
+Wrap your elements array in the standard `.excalidraw` envelope and save with `write_file`:
+
+```json
+{
+  "type": "excalidraw",
+  "version": 2,
+  "source": "hermes-agent",
+  "elements": [ ...your elements array here... ],
+  "appState": {
+    "viewBackgroundColor": "#ffffff"
+  }
+}
+```
+
+Save to any path, e.g. `~/diagrams/my_diagram.excalidraw`.
+
+### Uploading for a Shareable Link
+
+Run the upload script (located in this skill's `scripts/` directory) via terminal:
+
+```bash
+python skills/diagramming/excalidraw/scripts/upload.py ~/diagrams/my_diagram.excalidraw
+```
+
+This uploads to excalidraw.com (no account needed) and prints a shareable URL. Requires the `cryptography` pip package (`pip install cryptography`).
+
+---
+
+## Element Format Reference
+
+### Required Fields (all elements)
+`type`, `id` (unique string), `x`, `y`, `width`, `height`
+
+### Defaults (skip these -- they're applied automatically)
+- `strokeColor`: `"#1e1e1e"`
+- `backgroundColor`: `"transparent"`
+- `fillStyle`: `"solid"`
+- `strokeWidth`: `2`
+- `roughness`: `1` (hand-drawn look)
+- `opacity`: `100`
+
+Canvas background is white.
+
+### Element Types
+
+**Rectangle**:
+```json
+{ "type": "rectangle", "id": "r1", "x": 100, "y": 100, "width": 200, "height": 100 }
+```
+- `roundness: { "type": 3 }` for rounded corners
+- `backgroundColor: "#a5d8ff"`, `fillStyle: "solid"` for filled
+
+**Ellipse**:
+```json
+{ "type": "ellipse", "id": "e1", "x": 100, "y": 100, "width": 150, "height": 150 }
+```
+
+**Diamond**:
+```json
+{ "type": "diamond", "id": "d1", "x": 100, "y": 100, "width": 150, "height": 150 }
+```
+
+**Labeled shape (container binding)** -- create a text element bound to the shape:
+
+> **WARNING:** Do NOT use `"label": { "text": "..." }` on shapes. This is NOT a valid
+> Excalidraw property and will be silently ignored, producing blank shapes. You MUST
+> use the container binding approach below.
+
+The shape needs `boundElements` listing the text, and the text needs `containerId` pointing back:
+```json
+{ "type": "rectangle", "id": "r1", "x": 100, "y": 100, "width": 200, "height": 80,
+  "roundness": { "type": 3 }, "backgroundColor": "#a5d8ff", "fillStyle": "solid",
+  "boundElements": [{ "id": "t_r1", "type": "text" }] },
+{ "type": "text", "id": "t_r1", "x": 105, "y": 110, "width": 190, "height": 25,
+  "text": "Hello", "fontSize": 20, "fontFamily": 1, "strokeColor": "#1e1e1e",
+  "textAlign": "center", "verticalAlign": "middle",
+  "containerId": "r1", "originalText": "Hello", "autoResize": true }
+```
+- Works on rectangle, ellipse, diamond
+- Text is auto-centered by Excalidraw when `containerId` is set
+- The text `x`/`y`/`width`/`height` are approximate -- Excalidraw recalculates them on load
+- `originalText` should match `text`
+- Always include `fontFamily: 1` (Virgil/hand-drawn font)
+
+**Labeled arrow** -- same container binding approach:
+```json
+{ "type": "arrow", "id": "a1", "x": 300, "y": 150, "width": 200, "height": 0,
+  "points": [[0,0],[200,0]], "endArrowhead": "arrow",
+  "boundElements": [{ "id": "t_a1", "type": "text" }] },
+{ "type": "text", "id": "t_a1", "x": 370, "y": 130, "width": 60, "height": 20,
+  "text": "connects", "fontSize": 16, "fontFamily": 1, "strokeColor": "#1e1e1e",
+  "textAlign": "center", "verticalAlign": "middle",
+  "containerId": "a1", "originalText": "connects", "autoResize": true }
+```
+
+**Standalone text** (titles and annotations only -- no container):
+```json
+{ "type": "text", "id": "t1", "x": 150, "y": 138, "text": "Hello", "fontSize": 20,
+  "fontFamily": 1, "strokeColor": "#1e1e1e", "originalText": "Hello", "autoResize": true }
+```
+- `x` is the LEFT edge. To center at position `cx`: `x = cx - (text.length * fontSize * 0.5) / 2`
+- Do NOT rely on `textAlign` or `width` for positioning
+
+**Arrow**:
+```json
+{ "type": "arrow", "id": "a1", "x": 300, "y": 150, "width": 200, "height": 0,
+  "points": [[0,0],[200,0]], "endArrowhead": "arrow" }
+```
+- `points`: `[dx, dy]` offsets from element `x`, `y`
+- `endArrowhead`: `null` | `"arrow"` | `"bar"` | `"dot"` | `"triangle"`
+- `strokeStyle`: `"solid"` (default) | `"dashed"` | `"dotted"`
+
+### Arrow Bindings (connect arrows to shapes)
+
+```json
+{
+  "type": "arrow", "id": "a1", "x": 300, "y": 150, "width": 150, "height": 0,
+  "points": [[0,0],[150,0]], "endArrowhead": "arrow",
+  "startBinding": { "elementId": "r1", "fixedPoint": [1, 0.5] },
+  "endBinding": { "elementId": "r2", "fixedPoint": [0, 0.5] }
+}
+```
+
+`fixedPoint` coordinates: `top=[0.5,0]`, `bottom=[0.5,1]`, `left=[0,0.5]`, `right=[1,0.5]`
+
+### Drawing Order (z-order)
+- Array order = z-order (first = back, last = front)
+- Emit progressively: background zones → shape → its bound text → its arrows → next shape
+- BAD: all rectangles, then all texts, then all arrows
+- GOOD: bg_zone → shape1 → text_for_shape1 → arrow1 → arrow_label_text → shape2 → text_for_shape2 → ...
+- Always place the bound text element immediately after its container shape
+
+### Sizing Guidelines
+
+**Font sizes:**
+- Minimum `fontSize`: **16** for body text, labels, descriptions
+- Minimum `fontSize`: **20** for titles and headings
+- Minimum `fontSize`: **14** for secondary annotations only (sparingly)
+- NEVER use `fontSize` below 14
+
+**Element sizes:**
+- Minimum shape size: 120x60 for labeled rectangles/ellipses
+- Leave 20-30px gaps between elements minimum
+- Prefer fewer, larger elements over many tiny ones
+
+### Color Palette
+
+See `references/colors.md` for full color tables. Quick reference:
+
+| Use | Fill Color | Hex |
+|-----|-----------|-----|
+| Primary / Input | Light Blue | `#a5d8ff` |
+| Success / Output | Light Green | `#b2f2bb` |
+| Warning / External | Light Orange | `#ffd8a8` |
+| Processing / Special | Light Purple | `#d0bfff` |
+| Error / Critical | Light Red | `#ffc9c9` |
+| Notes / Decisions | Light Yellow | `#fff3bf` |
+| Storage / Data | Light Teal | `#c3fae8` |
+
+### Tips
+- Use the color palette consistently across the diagram
+- **Text contrast is CRITICAL** -- never use light gray on white backgrounds. Minimum text color on white: `#757575`
+- Do NOT use emoji in text -- they don't render in Excalidraw's font
+- For dark mode diagrams, see `references/dark-mode.md`
+- For larger examples, see `references/examples.md`
+
+
@@ -0,0 +1,44 @@
+# Excalidraw Color Palette
+
+Use these colors consistently across diagrams.
+
+## Primary Colors (for strokes, arrows, and accents)
+
+| Name | Hex | Use |
+|------|-----|-----|
+| Blue | `#4a9eed` | Primary actions, links, data series 1 |
+| Amber | `#f59e0b` | Warnings, highlights, data series 2 |
+| Green | `#22c55e` | Success, positive, data series 3 |
+| Red | `#ef4444` | Errors, negative, data series 4 |
+| Purple | `#8b5cf6` | Accents, special items, data series 5 |
+| Pink | `#ec4899` | Decorative, data series 6 |
+| Cyan | `#06b6d4` | Info, secondary, data series 7 |
+| Lime | `#84cc16` | Extra, data series 8 |
+
+## Pastel Fills (for shape backgrounds)
+
+| Color | Hex | Good For |
+|-------|-----|----------|
+| Light Blue | `#a5d8ff` | Input, sources, primary nodes |
+| Light Green | `#b2f2bb` | Success, output, completed |
+| Light Orange | `#ffd8a8` | Warning, pending, external |
+| Light Purple | `#d0bfff` | Processing, middleware, special |
+| Light Red | `#ffc9c9` | Error, critical, alerts |
+| Light Yellow | `#fff3bf` | Notes, decisions, planning |
+| Light Teal | `#c3fae8` | Storage, data, memory |
+| Light Pink | `#eebefa` | Analytics, metrics |
+
+## Background Zones (use with opacity: 30-35 for layered diagrams)
+
+| Color | Hex | Good For |
+|-------|-----|----------|
+| Blue zone | `#dbe4ff` | UI / frontend layer |
+| Purple zone | `#e5dbff` | Logic / agent layer |
+| Green zone | `#d3f9d8` | Data / tool layer |
+
+## Text Contrast Rules
+
+- **On white backgrounds**: minimum text color is `#757575`. Default `#1e1e1e` is best.
+- **Colored text on light fills**: use dark variants (`#15803d` not `#22c55e`, `#2563eb` not `#4a9eed`)
+- **White text**: only on dark backgrounds (`#9a5030` not `#c4795b`)
+- **Never**: light gray (`#b0b0b0`, `#999`) on white -- unreadable
@@ -0,0 +1,68 @@
+# Excalidraw Dark Mode Diagrams
+
+To create a dark-themed diagram, use a massive dark background rectangle as the **first element** in the array. Make it large enough to cover any viewport:
+
+```json
+{
+  "type": "rectangle", "id": "darkbg",
+  "x": -4000, "y": -3000, "width": 10000, "height": 7500,
+  "backgroundColor": "#1e1e2e", "fillStyle": "solid",
+  "strokeColor": "transparent", "strokeWidth": 0
+}
+```
+
+Then use the following color palettes for elements on the dark background.
+
+## Text Colors (on dark)
+
+| Color | Hex | Use |
+|-------|-----|-----|
+| White | `#e5e5e5` | Primary text, titles |
+| Muted | `#a0a0a0` | Secondary text, annotations |
+| NEVER | `#555` or darker | Invisible on dark bg! |
+
+## Shape Fills (on dark)
+
+| Color | Hex | Good For |
+|-------|-----|----------|
+| Dark Blue | `#1e3a5f` | Primary nodes |
+| Dark Green | `#1a4d2e` | Success, output |
+| Dark Purple | `#2d1b69` | Processing, special |
+| Dark Orange | `#5c3d1a` | Warning, pending |
+| Dark Red | `#5c1a1a` | Error, critical |
+| Dark Teal | `#1a4d4d` | Storage, data |
+
+## Stroke and Arrow Colors (on dark)
+
+Use the standard Primary Colors from the main color palette -- they're bright enough on dark backgrounds:
+- Blue `#4a9eed`, Amber `#f59e0b`, Green `#22c55e`, Red `#ef4444`, Purple `#8b5cf6`
+
+For subtle shape borders, use `#555555`.
+
+## Example: Dark mode labeled rectangle
+
+Use container binding (NOT the `"label"` property, which doesn't work). On dark backgrounds, set text `strokeColor` to `"#e5e5e5"` so it's visible:
+
+```json
+[
+  {
+    "type": "rectangle", "id": "r1",
+    "x": 100, "y": 100, "width": 200, "height": 80,
+    "backgroundColor": "#1e3a5f", "fillStyle": "solid",
+    "strokeColor": "#4a9eed", "strokeWidth": 2,
+    "roundness": { "type": 3 },
+    "boundElements": [{ "id": "t_r1", "type": "text" }]
+  },
+  {
+    "type": "text", "id": "t_r1",
+    "x": 105, "y": 120, "width": 190, "height": 25,
+    "text": "Dark Node", "fontSize": 20, "fontFamily": 1,
+    "strokeColor": "#e5e5e5",
+    "textAlign": "center", "verticalAlign": "middle",
+    "containerId": "r1", "originalText": "Dark Node", "autoResize": true
+  }
+]
+```
+
+Note: For standalone text elements on dark backgrounds, always set `"strokeColor": "#e5e5e5"` explicitly. The default `#1e1e1e` is invisible on dark.
+
@@ -0,0 +1,141 @@
+# Excalidraw Diagram Examples
+
+Complete, copy-pasteable examples. Wrap each in the `.excalidraw` envelope before saving:
+
+```json
+{
+  "type": "excalidraw",
+  "version": 2,
+  "source": "hermes-agent",
+  "elements": [ ...elements from examples below... ],
+  "appState": { "viewBackgroundColor": "#ffffff" }
+}
+```
+
+> **IMPORTANT:** All text labels on shapes and arrows use container binding (`containerId` + `boundElements`).
+> Do NOT use the non-existent `"label"` property -- it will be silently ignored, producing blank shapes.
+
+---
+
+## Example 1: Two Connected Labeled Boxes
+
+A minimal flowchart with two boxes and an arrow between them.
+
+```json
+[
+  { "type": "text", "id": "title", "x": 280, "y": 30, "text": "Simple Flow", "fontSize": 28, "fontFamily": 1, "strokeColor": "#1e1e1e", "originalText": "Simple Flow", "autoResize": true },
+  { "type": "rectangle", "id": "b1", "x": 100, "y": 100, "width": 200, "height": 100, "roundness": { "type": 3 }, "backgroundColor": "#a5d8ff", "fillStyle": "solid", "boundElements": [{ "id": "t_b1", "type": "text" }, { "id": "a1", "type": "arrow" }] },
+  { "type": "text", "id": "t_b1", "x": 105, "y": 130, "width": 190, "height": 25, "text": "Start", "fontSize": 20, "fontFamily": 1, "strokeColor": "#1e1e1e", "textAlign": "center", "verticalAlign": "middle", "containerId": "b1", "originalText": "Start", "autoResize": true },
+  { "type": "rectangle", "id": "b2", "x": 450, "y": 100, "width": 200, "height": 100, "roundness": { "type": 3 }, "backgroundColor": "#b2f2bb", "fillStyle": "solid", "boundElements": [{ "id": "t_b2", "type": "text" }, { "id": "a1", "type": "arrow" }] },
+  { "type": "text", "id": "t_b2", "x": 455, "y": 130, "width": 190, "height": 25, "text": "End", "fontSize": 20, "fontFamily": 1, "strokeColor": "#1e1e1e", "textAlign": "center", "verticalAlign": "middle", "containerId": "b2", "originalText": "End", "autoResize": true },
+  { "type": "arrow", "id": "a1", "x": 300, "y": 150, "width": 150, "height": 0, "points": [[0,0],[150,0]], "endArrowhead": "arrow", "startBinding": { "elementId": "b1", "fixedPoint": [1, 0.5] }, "endBinding": { "elementId": "b2", "fixedPoint": [0, 0.5] } }
+]
+```
+
+---
+
+## Example 2: Photosynthesis Process Diagram
+
+A larger diagram with background zones, multiple nodes, and directional arrows showing inputs/outputs.
+
+```json
+[
+  {"type":"text","id":"ti","x":280,"y":10,"text":"Photosynthesis","fontSize":28,"fontFamily":1,"strokeColor":"#1e1e1e","originalText":"Photosynthesis","autoResize":true},
+  {"type":"text","id":"fo","x":245,"y":48,"text":"6CO2 + 6H2O --> C6H12O6 + 6O2","fontSize":16,"fontFamily":1,"strokeColor":"#757575","originalText":"6CO2 + 6H2O --> C6H12O6 + 6O2","autoResize":true},
+  {"type":"rectangle","id":"lf","x":150,"y":90,"width":520,"height":380,"backgroundColor":"#d3f9d8","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#22c55e","strokeWidth":1,"opacity":35},
+  {"type":"text","id":"lfl","x":170,"y":96,"text":"Inside the Leaf","fontSize":16,"fontFamily":1,"strokeColor":"#15803d","originalText":"Inside the Leaf","autoResize":true},
+
+  {"type":"rectangle","id":"lr","x":190,"y":190,"width":160,"height":70,"backgroundColor":"#fff3bf","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#f59e0b","boundElements":[{"id":"t_lr","type":"text"},{"id":"a1","type":"arrow"},{"id":"a2","type":"arrow"},{"id":"a3","type":"arrow"},{"id":"a5","type":"arrow"}]},
+  {"type":"text","id":"t_lr","x":195,"y":205,"width":150,"height":20,"text":"Light Reactions","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"lr","originalText":"Light Reactions","autoResize":true},
+
+  {"type":"arrow","id":"a1","x":350,"y":225,"width":120,"height":0,"points":[[0,0],[120,0]],"strokeColor":"#1e1e1e","strokeWidth":2,"endArrowhead":"arrow","boundElements":[{"id":"t_a1","type":"text"}]},
+  {"type":"text","id":"t_a1","x":390,"y":205,"width":40,"height":20,"text":"ATP","fontSize":14,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"a1","originalText":"ATP","autoResize":true},
+
+  {"type":"rectangle","id":"cc","x":470,"y":190,"width":160,"height":70,"backgroundColor":"#d0bfff","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#8b5cf6","boundElements":[{"id":"t_cc","type":"text"},{"id":"a1","type":"arrow"},{"id":"a4","type":"arrow"},{"id":"a6","type":"arrow"}]},
+  {"type":"text","id":"t_cc","x":475,"y":205,"width":150,"height":20,"text":"Calvin Cycle","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"cc","originalText":"Calvin Cycle","autoResize":true},
+
+  {"type":"rectangle","id":"sl","x":10,"y":200,"width":120,"height":50,"backgroundColor":"#fff3bf","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#f59e0b","boundElements":[{"id":"t_sl","type":"text"},{"id":"a2","type":"arrow"}]},
+  {"type":"text","id":"t_sl","x":15,"y":210,"width":110,"height":20,"text":"Sunlight","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"sl","originalText":"Sunlight","autoResize":true},
+
+  {"type":"arrow","id":"a2","x":130,"y":225,"width":60,"height":0,"points":[[0,0],[60,0]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":"arrow"},
+
+  {"type":"rectangle","id":"wa","x":200,"y":360,"width":140,"height":50,"backgroundColor":"#a5d8ff","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#4a9eed","boundElements":[{"id":"t_wa","type":"text"},{"id":"a3","type":"arrow"}]},
+  {"type":"text","id":"t_wa","x":205,"y":370,"width":130,"height":20,"text":"Water (H2O)","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"wa","originalText":"Water (H2O)","autoResize":true},
+
+  {"type":"arrow","id":"a3","x":270,"y":360,"width":0,"height":-100,"points":[[0,0],[0,-100]],"strokeColor":"#4a9eed","strokeWidth":2,"endArrowhead":"arrow"},
+
+  {"type":"rectangle","id":"co","x":480,"y":360,"width":130,"height":50,"backgroundColor":"#ffd8a8","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#f59e0b","boundElements":[{"id":"t_co","type":"text"},{"id":"a4","type":"arrow"}]},
+  {"type":"text","id":"t_co","x":485,"y":370,"width":120,"height":20,"text":"CO2","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"co","originalText":"CO2","autoResize":true},
+
+  {"type":"arrow","id":"a4","x":545,"y":360,"width":0,"height":-100,"points":[[0,0],[0,-100]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":"arrow"},
+
+  {"type":"rectangle","id":"ox","x":540,"y":100,"width":100,"height":40,"backgroundColor":"#ffc9c9","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#ef4444","boundElements":[{"id":"t_ox","type":"text"},{"id":"a5","type":"arrow"}]},
+  {"type":"text","id":"t_ox","x":545,"y":105,"width":90,"height":20,"text":"O2","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"ox","originalText":"O2","autoResize":true},
+
+  {"type":"arrow","id":"a5","x":310,"y":190,"width":230,"height":-50,"points":[[0,0],[230,-50]],"strokeColor":"#ef4444","strokeWidth":2,"endArrowhead":"arrow"},
+
+  {"type":"rectangle","id":"gl","x":690,"y":195,"width":120,"height":60,"backgroundColor":"#c3fae8","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#22c55e","boundElements":[{"id":"t_gl","type":"text"},{"id":"a6","type":"arrow"}]},
+  {"type":"text","id":"t_gl","x":695,"y":210,"width":110,"height":25,"text":"Glucose","fontSize":18,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"gl","originalText":"Glucose","autoResize":true},
+
+  {"type":"arrow","id":"a6","x":630,"y":225,"width":60,"height":0,"points":[[0,0],[60,0]],"strokeColor":"#22c55e","strokeWidth":2,"endArrowhead":"arrow"},
+
+  {"type":"ellipse","id":"sun","x":30,"y":110,"width":50,"height":50,"backgroundColor":"#fff3bf","fillStyle":"solid","strokeColor":"#f59e0b","strokeWidth":2},
+  {"type":"arrow","id":"r1","x":55,"y":108,"width":0,"height":-14,"points":[[0,0],[0,-14]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":null,"startArrowhead":null},
+  {"type":"arrow","id":"r2","x":55,"y":162,"width":0,"height":14,"points":[[0,0],[0,14]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":null,"startArrowhead":null},
+  {"type":"arrow","id":"r3","x":28,"y":135,"width":-14,"height":0,"points":[[0,0],[-14,0]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":null,"startArrowhead":null},
+  {"type":"arrow","id":"r4","x":82,"y":135,"width":14,"height":0,"points":[[0,0],[14,0]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":null,"startArrowhead":null}
+]
+```
+
+---
+
+## Example 3: Sequence Diagram (UML-style)
+
+Demonstrates a sequence diagram with actors, dashed lifelines, and message arrows.
+
+```json
+[
+  {"type":"text","id":"title","x":200,"y":15,"text":"MCP Apps -- Sequence Flow","fontSize":24,"fontFamily":1,"strokeColor":"#1e1e1e","originalText":"MCP Apps -- Sequence Flow","autoResize":true},
+
+  {"type":"rectangle","id":"uHead","x":60,"y":60,"width":100,"height":40,"backgroundColor":"#a5d8ff","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#4a9eed","strokeWidth":2,"boundElements":[{"id":"t_uHead","type":"text"}]},
+  {"type":"text","id":"t_uHead","x":65,"y":65,"width":90,"height":20,"text":"User","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"uHead","originalText":"User","autoResize":true},
+
+  {"type":"arrow","id":"uLine","x":110,"y":100,"width":0,"height":400,"points":[[0,0],[0,400]],"strokeColor":"#b0b0b0","strokeWidth":1,"strokeStyle":"dashed","endArrowhead":null},
+
+  {"type":"rectangle","id":"aHead","x":230,"y":60,"width":100,"height":40,"backgroundColor":"#d0bfff","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#8b5cf6","strokeWidth":2,"boundElements":[{"id":"t_aHead","type":"text"}]},
+  {"type":"text","id":"t_aHead","x":235,"y":65,"width":90,"height":20,"text":"Agent","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"aHead","originalText":"Agent","autoResize":true},
+
+  {"type":"arrow","id":"aLine","x":280,"y":100,"width":0,"height":400,"points":[[0,0],[0,400]],"strokeColor":"#b0b0b0","strokeWidth":1,"strokeStyle":"dashed","endArrowhead":null},
+
+  {"type":"rectangle","id":"sHead","x":420,"y":60,"width":130,"height":40,"backgroundColor":"#ffd8a8","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#f59e0b","strokeWidth":2,"boundElements":[{"id":"t_sHead","type":"text"}]},
+  {"type":"text","id":"t_sHead","x":425,"y":65,"width":120,"height":20,"text":"Server","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"sHead","originalText":"Server","autoResize":true},
+
+  {"type":"arrow","id":"sLine","x":485,"y":100,"width":0,"height":400,"points":[[0,0],[0,400]],"strokeColor":"#b0b0b0","strokeWidth":1,"strokeStyle":"dashed","endArrowhead":null},
+
+  {"type":"arrow","id":"m1","x":110,"y":150,"width":170,"height":0,"points":[[0,0],[170,0]],"strokeColor":"#1e1e1e","strokeWidth":2,"endArrowhead":"arrow","boundElements":[{"id":"t_m1","type":"text"}]},
+  {"type":"text","id":"t_m1","x":165,"y":130,"width":60,"height":20,"text":"request","fontSize":14,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"m1","originalText":"request","autoResize":true},
+
+  {"type":"arrow","id":"m2","x":280,"y":200,"width":205,"height":0,"points":[[0,0],[205,0]],"strokeColor":"#8b5cf6","strokeWidth":2,"endArrowhead":"arrow","boundElements":[{"id":"t_m2","type":"text"}]},
+  {"type":"text","id":"t_m2","x":352,"y":180,"width":60,"height":20,"text":"tools/call","fontSize":14,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"m2","originalText":"tools/call","autoResize":true},
+
+  {"type":"arrow","id":"m3","x":485,"y":260,"width":-205,"height":0,"points":[[0,0],[-205,0]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":"arrow","strokeStyle":"dashed","boundElements":[{"id":"t_m3","type":"text"}]},
+  {"type":"text","id":"t_m3","x":352,"y":240,"width":60,"height":20,"text":"result","fontSize":14,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"m3","originalText":"result","autoResize":true},
+
+  {"type":"arrow","id":"m4","x":280,"y":320,"width":-170,"height":0,"points":[[0,0],[-170,0]],"strokeColor":"#8b5cf6","strokeWidth":2,"endArrowhead":"arrow","strokeStyle":"dashed","boundElements":[{"id":"t_m4","type":"text"}]},
+  {"type":"text","id":"t_m4","x":165,"y":300,"width":60,"height":20,"text":"response","fontSize":14,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"m4","originalText":"response","autoResize":true}
+]
+```
+
+---
+
+## Common Mistakes to Avoid
+
+- **Do NOT use `"label"` property** -- this is the #1 mistake. It is NOT part of the Excalidraw file format and will be silently ignored, producing blank shapes with no visible text. Always use container binding (`containerId` + `boundElements`) as shown in the examples above.
+- **Every bound text needs both sides linked** -- the shape needs `boundElements: [{"id": "t_xxx", "type": "text"}]` AND the text needs `containerId: "shape_id"`. If either is missing, the binding won't work.
+- **Include `originalText` and `autoResize: true`** on all text elements -- Excalidraw uses these for proper text reflow.
+- **Include `fontFamily: 1`** on all text elements -- without it, text may not render with the expected hand-drawn font.
+- **Elements overlap when y-coordinates are close** -- always check that text, boxes, and labels don't stack on top of each other
+- **Arrow labels need space** -- long labels like "ATP + NADPH" overflow short arrows. Keep labels short or make arrows wider
+- **Center titles relative to the diagram** -- estimate total width and center the title text over it
+- **Draw decorations LAST** -- cute illustrations (sun, stars, icons) should appear at the end of the array so they're drawn on top
+
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+"""
+Upload an .excalidraw file to excalidraw.com and print a shareable URL.
+
+No account required. The diagram is encrypted client-side (AES-GCM) before
+upload -- the encryption key is embedded in the URL fragment, so the server
+never sees plaintext.
+
+Requirements:
+    pip install cryptography
+
+Usage:
+    python upload.py <path-to-file.excalidraw>
+
+Example:
+    python upload.py ~/diagrams/architecture.excalidraw
+    # prints: https://excalidraw.com/#json=abc123,encryptionKeyHere
+"""
+
+import json
+import os
+import struct
+import sys
+import zlib
+import base64
+import urllib.request
+
+try:
+    from cryptography.hazmat.primitives.ciphers.aead import AESGCM
+except ImportError:
+    print("Error: 'cryptography' package is required for upload.")
+    print("Install it with: pip install cryptography")
+    sys.exit(1)
+
+# Excalidraw public upload endpoint (no auth needed)
+UPLOAD_URL = "https://json.excalidraw.com/api/v2/post/"
+
+
+def concat_buffers(*buffers: bytes) -> bytes:
+    """
+    Build the Excalidraw v2 concat-buffers binary format.
+
+    Layout: [version=1 (4B big-endian)] then for each buffer:
+            [length (4B big-endian)] [data bytes]
+    """
+    parts = [struct.pack(">I", 1)]  # version = 1
+    for buf in buffers:
+        parts.append(struct.pack(">I", len(buf)))
+        parts.append(buf)
+    return b"".join(parts)
+
+
+def upload(excalidraw_json: str) -> str:
+    """
+    Encrypt and upload Excalidraw JSON to excalidraw.com.
+
+    Args:
+        excalidraw_json: The full .excalidraw file content as a string.
+
+    Returns:
+        Shareable URL string.
+    """
+    # 1. Inner payload: concat_buffers(file_metadata, data)
+    file_metadata = json.dumps({}).encode("utf-8")
+    data_bytes = excalidraw_json.encode("utf-8")
+    inner_payload = concat_buffers(file_metadata, data_bytes)
+
+    # 2. Compress with zlib
+    compressed = zlib.compress(inner_payload)
+
+    # 3. AES-GCM 128-bit encrypt
+    raw_key = os.urandom(16)   # 128-bit key
+    iv = os.urandom(12)        # 12-byte nonce
+    aesgcm = AESGCM(raw_key)
+    encrypted = aesgcm.encrypt(iv, compressed, None)
+
+    # 4. Encoding metadata
+    encoding_meta = json.dumps({
+        "version": 2,
+        "compression": "pako@1",
+        "encryption": "AES-GCM",
+    }).encode("utf-8")
+
+    # 5. Outer payload: concat_buffers(encoding_meta, iv, encrypted)
+    payload = concat_buffers(encoding_meta, iv, encrypted)
+
+    # 6. Upload
+    req = urllib.request.Request(UPLOAD_URL, data=payload, method="POST")
+    with urllib.request.urlopen(req, timeout=30) as resp:
+        if resp.status != 200:
+            raise RuntimeError(f"Upload failed with HTTP {resp.status}")
+        result = json.loads(resp.read().decode("utf-8"))
+
+    file_id = result.get("id")
+    if not file_id:
+        raise RuntimeError(f"Upload returned no file ID. Response: {result}")
+
+    # 7. Key as base64url (JWK 'k' format, no padding)
+    key_b64 = base64.urlsafe_b64encode(raw_key).rstrip(b"=").decode("ascii")
+
+    return f"https://excalidraw.com/#json={file_id},{key_b64}"
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: python upload.py <path-to-file.excalidraw>")
+        sys.exit(1)
+
+    file_path = sys.argv[1]
+
+    if not os.path.isfile(file_path):
+        print(f"Error: File not found: {file_path}")
+        sys.exit(1)
+
+    with open(file_path, "r", encoding="utf-8") as f:
+        content = f.read()
+
+    # Basic validation: should be valid JSON with an "elements" key
+    try:
+        doc = json.loads(content)
+    except json.JSONDecodeError as e:
+        print(f"Error: File is not valid JSON: {e}")
+        sys.exit(1)
+
+    if "elements" not in doc:
+        print("Warning: File does not contain an 'elements' key. Uploading anyway.")
+
+    url = upload(content)
+    print(url)
+
+
+if __name__ == "__main__":
+    main()
@@ -31,6 +31,8 @@ from .terminal_tool import (
    cleanup_vm,
    cleanup_all_environments,
    get_active_environments_info,
+    register_task_env_overrides,
+    clear_task_env_overrides,
    TERMINAL_TOOL_DESCRIPTION
 )

@@ -57,7 +59,6 @@ from .image_generation_tool import (
 )

 from .skills_tool import (
-    skills_categories,
    skills_list,
    skill_view,
    check_skills_requirements,
@@ -121,6 +122,12 @@ from .file_tools import (
    clear_file_ops_cache,
 )

+# Text-to-speech tools (Edge TTS / ElevenLabs / OpenAI)
+from .tts_tool import (
+    text_to_speech_tool,
+    check_tts_requirements,
+)
+
 # File tools have no external requirements - they use the terminal backend
 def check_file_requirements():
    """File tools only require terminal backend to be available."""
@@ -139,6 +146,8 @@ __all__ = [
    'cleanup_vm',
    'cleanup_all_environments',
    'get_active_environments_info',
+    'register_task_env_overrides',
+    'clear_task_env_overrides',
    'TERMINAL_TOOL_DESCRIPTION',
    # Terminal tools (Hecate/MorphCloud backend)
    'terminal_hecate_tool',
@@ -154,7 +163,6 @@ __all__ = [
    'image_generate_tool',
    'check_image_generation_requirements',
    # Skills tools
-    'skills_categories',
    'skills_list',
    'skill_view',
    'check_skills_requirements',
@@ -205,5 +213,8 @@ __all__ = [
    'get_file_tools',
    'clear_file_ops_cache',
    'check_file_requirements',
+    # Text-to-speech tools
+    'text_to_speech_tool',
+    'check_tts_requirements',
 ]

@@ -51,6 +51,7 @@ import subprocess
 import shutil
 import sys
 import asyncio
+import tempfile
 import threading
 import time
 import requests
@@ -644,17 +645,25 @@ def _find_agent_browser() -> str:
    """
    Find the agent-browser CLI executable.
    
+    Checks in order: PATH, local node_modules/.bin/, npx fallback.
+    
    Returns:
        Path to agent-browser executable
        
    Raises:
        FileNotFoundError: If agent-browser is not installed
    """
-    # Check if it's in PATH
+    # Check if it's in PATH (global install)
    which_result = shutil.which("agent-browser")
    if which_result:
        return which_result
    
+    # Check local node_modules/.bin/ (npm install in repo root)
+    repo_root = Path(__file__).parent.parent
+    local_bin = repo_root / "node_modules" / ".bin" / "agent-browser"
+    if local_bin.exists():
+        return str(local_bin)
+    
    # Check common npx locations
    npx_path = shutil.which("npx")
    if npx_path:
@@ -662,6 +671,7 @@ def _find_agent_browser() -> str:
    
    raise FileNotFoundError(
        "agent-browser CLI not found. Install it with: npm install -g agent-browser\n"
+        "Or run 'npm install' in the repo root to install locally.\n"
        "Or ensure npx is available in your PATH."
    )

@@ -708,12 +718,26 @@ def _run_browser_command(
    ] + args
    
    try:
+        # Give each task its own socket directory to prevent concurrency conflicts.
+        # Without this, parallel workers fight over the same default socket path,
+        # causing "Failed to create socket directory: Permission denied" errors.
+        task_socket_dir = os.path.join(
+            tempfile.gettempdir(), 
+            f"agent-browser-{session_info['session_name']}"
+        )
+        os.makedirs(task_socket_dir, exist_ok=True)
+        
+        browser_env = {
+            **os.environ,
+            "AGENT_BROWSER_SOCKET_DIR": task_socket_dir,
+        }
+        
        result = subprocess.run(
            cmd_parts,
            capture_output=True,
            text=True,
            timeout=timeout,
-            env={**os.environ}
+            env=browser_env,
        )
        
        # Parse JSON output
@@ -1487,6 +1511,13 @@ def cleanup_browser(task_id: Optional[str] = None) -> None:
        except Exception as e:
            print(f"[browser_tool] Exception during BrowserBase session close: {e}", file=sys.stderr)
        
+        # Clean up per-task socket directory
+        session_name = session_info.get("session_name", "")
+        if session_name:
+            socket_dir = os.path.join(tempfile.gettempdir(), f"agent-browser-{session_name}")
+            if os.path.exists(socket_dir):
+                shutil.rmtree(socket_dir, ignore_errors=True)
+        
        del _active_sessions[task_id]
        if not os.getenv("HERMES_QUIET"):
            print(f"[browser_tool] Removed task {task_id} from active sessions", file=sys.stderr)
@@ -30,57 +30,63 @@ def _get_file_ops(task_id: str = "default") -> ShellFileOperations:
        if task_id in _file_ops_cache:
            return _file_ops_cache[task_id]
    
-    # Check if we need to create a new environment
+    # Check if we need to create a new environment.
+    # Uses the same per-task creation locks as terminal_tool to prevent
+    # duplicate sandbox creation from concurrent tool calls.
+    from tools.terminal_tool import _creation_locks, _creation_locks_lock
+    
    needs_creation = False
    with _env_lock:
        if task_id not in _active_environments:
            needs_creation = True
    
-    # Create environment OUTSIDE locks so we don't block other rollouts
-    # during slow Modal/Docker startup (~10s)
    if needs_creation:
-        config = _get_env_config()
-        env_type = config["env_type"]
-        
-        if env_type == "docker":
-            image = config["docker_image"]
-        elif env_type == "singularity":
-            image = config["singularity_image"]
-        elif env_type == "modal":
-            image = config["modal_image"]
-        else:
-            image = ""
-        
-        cwd = config["cwd"]
-        _check_disk_usage_warning()
-        if not os.getenv("HERMES_QUIET"):
-            print(f"[FileTools] Creating new {env_type} environment for task {task_id[:8]}...", flush=True)
-        
-        new_env = _create_environment(
-            env_type=env_type,
-            image=image,
-            cwd=cwd,
-            timeout=config["timeout"],
-        )
-        
-        # Store under lock (brief) -- do NOT call _start_cleanup_thread inside
-        # the lock because it also acquires _env_lock (non-reentrant = deadlock)
-        created = False
-        with _env_lock:
-            if task_id not in _active_environments:
-                _active_environments[task_id] = new_env
-                created = True
-            else:
-                try:
-                    if hasattr(new_env, 'stop'):
-                        new_env.stop()
-                except Exception:
-                    pass
-        
-        if created:
-            _start_cleanup_thread()
-            if not os.getenv("HERMES_QUIET"):
-                print(f"[FileTools] {env_type} environment ready for task {task_id[:8]}", flush=True)
+        # Per-task lock: only one thread creates the sandbox, others wait
+        with _creation_locks_lock:
+            if task_id not in _creation_locks:
+                _creation_locks[task_id] = __import__("threading").Lock()
+            task_lock = _creation_locks[task_id]
+
+        with task_lock:
+            # Double-check after acquiring the per-task lock
+            with _env_lock:
+                if task_id in _active_environments:
+                    needs_creation = False
+
+            if needs_creation:
+                from tools.terminal_tool import _task_env_overrides
+                
+                config = _get_env_config()
+                env_type = config["env_type"]
+                overrides = _task_env_overrides.get(task_id, {})
+                
+                if env_type == "docker":
+                    image = overrides.get("docker_image") or config["docker_image"]
+                elif env_type == "singularity":
+                    image = overrides.get("singularity_image") or config["singularity_image"]
+                elif env_type == "modal":
+                    image = overrides.get("modal_image") or config["modal_image"]
+                else:
+                    image = ""
+                
+                cwd = overrides.get("cwd") or config["cwd"]
+                if not os.getenv("HERMES_QUIET"):
+                    print(f"[FileTools] Creating new {env_type} environment for task {task_id[:8]}...", flush=True)
+                
+                new_env = _create_environment(
+                    env_type=env_type,
+                    image=image,
+                    cwd=cwd,
+                    timeout=config["timeout"],
+                )
+                
+                with _env_lock:
+                    _active_environments[task_id] = new_env
+                    _last_activity[task_id] = __import__("time").time()
+                
+                _start_cleanup_thread()
+                if not os.getenv("HERMES_QUIET"):
+                    print(f"[FileTools] {env_type} environment ready for task {task_id[:8]}", flush=True)
    
    # Now get the environment and build file_ops
    with _env_lock:
@@ -28,6 +28,7 @@ Usage:

 import json
 import os
+import signal
 import sys
 import time
 import threading
@@ -39,6 +40,28 @@ import uuid
 from pathlib import Path
 from typing import Optional, Dict, Any

+
+# ---------------------------------------------------------------------------
+# Global interrupt event: set by the agent when a user interrupt arrives.
+# The terminal tool polls this during command execution so it can kill
+# long-running subprocesses immediately instead of blocking until timeout.
+# ---------------------------------------------------------------------------
+_interrupt_event = threading.Event()
+
+
+def set_interrupt_event(active: bool) -> None:
+    """Called by the agent to signal or clear the interrupt."""
+    if active:
+        _interrupt_event.set()
+    else:
+        _interrupt_event.clear()
+
+
+def is_interrupted() -> bool:
+    """Check if an interrupt has been requested."""
+    return _interrupt_event.is_set()
+
+
 # Add mini-swe-agent to path if not installed
 mini_swe_path = Path(__file__).parent.parent / "mini-swe-agent" / "src"
 if mini_swe_path.exists():
@@ -83,9 +106,9 @@ def _get_apptainer_cache_dir() -> Path:
        cache_path.mkdir(parents=True, exist_ok=True)
        return cache_path
    
-    # Use scratch dir parent for cache (one level up from sandboxes)
+    # Use user-specific subdirectory in scratch for cache
    scratch = _get_scratch_dir()
-    cache_path = scratch.parent / ".apptainer"
+    cache_path = scratch / ".apptainer"
    cache_path.mkdir(parents=True, exist_ok=True)
    return cache_path

@@ -214,6 +237,10 @@ _cached_sudo_password: str = ""
 # Session-cached dangerous command approvals (pattern -> approved)
 _session_approved_patterns: set = set()

+# Last approval-required command (for gateway to pick up)
+# Set by _check_dangerous_command when in ask mode, read by gateway
+_last_pending_approval: dict = {}
+
 # Dangerous command patterns (regex, description)
 DANGEROUS_PATTERNS = [
    (r'\brm\s+(-[^\s]*\s+)*/', "delete in root path"),
@@ -385,12 +412,22 @@ def _check_dangerous_command(command: str, env_type: str) -> dict:
        # Programmatic use - allow (user opted into local backend)
        return {"approved": True, "message": None}
    
-    if is_gateway:
-        # Messaging context - return informative denial, agent should ask user
+    if is_gateway or os.getenv("HERMES_EXEC_ASK"):
+        # Messaging context - return approval_required so the gateway can
+        # prompt the user interactively instead of just blocking
+        global _last_pending_approval
+        _last_pending_approval = {
+            "command": command,
+            "pattern_key": pattern_key,
+            "description": description,
+        }
        return {
            "approved": False,
            "pattern_key": pattern_key,
-            "message": f"BLOCKED: This command is potentially dangerous ({description}). Tell the user and ask if they want to add this command pattern to their allowlist. They can do this via 'hermes config edit' or by running the command directly on their machine."
+            "status": "approval_required",
+            "command": command,
+            "description": description,
+            "message": f"⚠️ This command is potentially dangerous ({description}). Asking the user for approval..."
        }
    
    # CLI context - prompt user
@@ -599,7 +636,13 @@ class _LocalEnvironment:
        self.env = env or {}
    
    def execute(self, command: str, cwd: str = "", *, timeout: int | None = None) -> dict:
-        """Execute a command locally with sudo support."""
+        """
+        Execute a command locally with sudo support.
+        
+        Uses Popen + polling so the global interrupt event can kill the
+        process early when the user sends a new message, instead of
+        blocking for the full timeout.
+        """
        work_dir = cwd or self.cwd or os.getcwd()
        effective_timeout = timeout or self.timeout
        
@@ -607,22 +650,56 @@ class _LocalEnvironment:
        exec_command = _transform_sudo_command(command)
        
        try:
-            result = subprocess.run(
+            proc = subprocess.Popen(
                exec_command,
                shell=True,
                text=True,
                cwd=work_dir,
                env=os.environ | self.env,
-                timeout=effective_timeout,
                encoding="utf-8",
                errors="replace",
                stdout=subprocess.PIPE,
                stderr=subprocess.STDOUT,
                stdin=subprocess.DEVNULL,  # Prevent hanging on interactive prompts
+                # Start in a new process group so we can kill the whole tree
+                preexec_fn=os.setsid,
            )
-            return {"output": result.stdout, "returncode": result.returncode}
-        except subprocess.TimeoutExpired:
-            return {"output": f"Command timed out after {effective_timeout}s", "returncode": 124}
+            
+            deadline = time.monotonic() + effective_timeout
+            
+            # Poll every 200ms so we notice interrupts quickly
+            while proc.poll() is None:
+                if _interrupt_event.is_set():
+                    # User sent a new message — kill the process tree and return
+                    # what we have so far
+                    try:
+                        os.killpg(os.getpgid(proc.pid), signal.SIGTERM)
+                    except (ProcessLookupError, PermissionError):
+                        proc.kill()
+                    # Grab any partial output
+                    partial, _ = proc.communicate(timeout=2)
+                    output = partial or ""
+                    return {
+                        "output": output + "\n[Command interrupted — user sent a new message]",
+                        "returncode": 130  # Standard interrupted exit code
+                    }
+                
+                if time.monotonic() > deadline:
+                    # Timeout — kill process tree
+                    try:
+                        os.killpg(os.getpgid(proc.pid), signal.SIGTERM)
+                    except (ProcessLookupError, PermissionError):
+                        proc.kill()
+                    proc.communicate(timeout=2)
+                    return {"output": f"Command timed out after {effective_timeout}s", "returncode": 124}
+                
+                # Short sleep to avoid busy-waiting
+                time.sleep(0.2)
+            
+            # Process finished normally — read all output
+            stdout, _ = proc.communicate()
+            return {"output": stdout or "", "returncode": proc.returncode}
+            
        except Exception as e:
            return {"output": f"Execution error: {str(e)}", "returncode": 1}
    
@@ -637,15 +714,21 @@ class _LocalEnvironment:

 class _SingularityEnvironment:
    """
-    Custom Singularity/Apptainer environment with better space management.
+    Persistent Singularity/Apptainer container environment.
    
-    - Automatically builds/caches SIF images from docker:// URLs
-    - Builds sandbox in /scratch (if available) or configurable location
-    - Binds a large working directory into the container
-    - Keeps container isolated from host filesystem
+    Uses `apptainer instance` to create a long-running container that persists
+    state (files, installs, env changes) across all commands within a task.
+    The model experiences this as a real Linux VM.
+    
+    Features:
+    - Persistent filesystem: files created in one command are visible in the next
+    - Package installs persist: pip/apt installs survive across tool calls
+    - Full isolation: --containall gives PID, IPC, and environment isolation
+    - Writable tmpfs overlay: full root filesystem is writable (RAM-backed)
+    - Automatic SIF caching: docker:// images converted to SIF once, reused forever
    """
    
-    def __init__(self, image: str, cwd: str = "/workspace", timeout: int = 60):
+    def __init__(self, image: str, cwd: str = "/root", timeout: int = 60):
        self.cwd = cwd
        self.timeout = timeout
        
@@ -655,60 +738,60 @@ class _SingularityEnvironment:
        # Get or build SIF from docker:// URL (fast if already cached)
        self.image = _get_or_build_sif(image, self.executable)
        
-        # Get scratch directory for sandbox
-        self.scratch_dir = _get_scratch_dir()
+        # Create unique instance name (must be alphanumeric + underscores)
+        self.instance_id = f"hermes_{uuid.uuid4().hex[:12]}"
+        self._instance_started = False
        
-        # Create unique sandbox directory
-        self.sandbox_id = f"hermes-{uuid.uuid4().hex[:12]}"
-        self.sandbox_dir = self.scratch_dir / self.sandbox_id
-        
-        # Create a working directory that will be bound into the container
-        self.work_dir = self.scratch_dir / f"{self.sandbox_id}-work"
-        self.work_dir.mkdir(parents=True, exist_ok=True)
-        
-        # Build the sandbox
-        self._build_sandbox()
+        # Start the persistent instance
+        self._start_instance()
    
-    def _build_sandbox(self):
-        """Build a writable sandbox from the container image (SIF or other)."""
+    def _start_instance(self):
+        """Start a persistent apptainer instance.
+        
+        The instance runs as a background process. All subsequent execute() calls
+        run commands inside this same instance, so state persists across calls.
+        """
+        cmd = [
+            self.executable, "instance", "start",
+            "--writable-tmpfs",  # RAM-backed writable overlay on read-only SIF
+            "--containall",      # Full isolation: PID, IPC, environment, filesystem
+            str(self.image),
+            self.instance_id,
+        ]
+        
        try:
            result = subprocess.run(
-                [self.executable, "build", "--sandbox", str(self.sandbox_dir), self.image],
+                cmd,
                capture_output=True,
                text=True,
-                timeout=300  # 5 min timeout for building
+                timeout=120,  # 2 min for instance startup
            )
            if result.returncode != 0:
-                raise RuntimeError(f"Failed to build sandbox: {result.stderr}")
+                raise RuntimeError(f"Failed to start instance: {result.stderr}")
            
-            # Create /workspace directory inside the sandbox for bind mounting
-            workspace_in_sandbox = self.sandbox_dir / "workspace"
-            workspace_in_sandbox.mkdir(parents=True, exist_ok=True)
+            self._instance_started = True
+            print(f"[Singularity] Instance {self.instance_id} started (persistent container)", flush=True)
            
        except subprocess.TimeoutExpired:
-            shutil.rmtree(self.sandbox_dir, ignore_errors=True)
-            raise RuntimeError("Sandbox build timed out")
+            raise RuntimeError("Instance start timed out")
    
    def execute(self, command: str, cwd: str = "", *, timeout: int | None = None) -> dict:
-        """Execute a command in the Singularity container."""
+        """Execute a command in the persistent Singularity instance.
+        
+        All commands run in the same container, so files, installs, and
+        environment changes persist between calls.
+        """
+        if not self._instance_started:
+            return {"output": "Instance not started", "returncode": -1}
+        
        cmd = [self.executable, "exec"]
        
-        # Isolation flags - contain but allow network
-        cmd.extend(["--contain", "--cleanenv"])
-        
-        # Bind the working directory into the container at /workspace
-        # This gives the container access to a large writable space
-        cmd.extend(["--bind", f"{self.work_dir}:/workspace"])
-        
-        # Also bind it to /tmp inside container for pip cache etc.
-        cmd.extend(["--bind", f"{self.work_dir}:/tmp"])
-        
        # Set working directory
        work_dir = cwd or self.cwd
        cmd.extend(["--pwd", work_dir])
        
-        # Use writable sandbox
-        cmd.extend(["--writable", str(self.sandbox_dir)])
+        # Connect to the running instance
+        cmd.append(f"instance://{self.instance_id}")
        
        # Transform sudo commands if SUDO_PASSWORD is available
        exec_command = _transform_sudo_command(command)
@@ -732,9 +815,19 @@ class _SingularityEnvironment:
            return {"output": f"Command timed out after {timeout or self.timeout}s", "returncode": 124}
    
    def cleanup(self):
-        """Clean up sandbox and working directory."""
-        shutil.rmtree(self.sandbox_dir, ignore_errors=True)
-        shutil.rmtree(self.work_dir, ignore_errors=True)
+        """Stop the persistent instance and clean up."""
+        if self._instance_started:
+            try:
+                subprocess.run(
+                    [self.executable, "instance", "stop", self.instance_id],
+                    capture_output=True,
+                    text=True,
+                    timeout=30,
+                )
+                print(f"[Singularity] Instance {self.instance_id} stopped", flush=True)
+            except Exception as e:
+                print(f"[Singularity] Warning: failed to stop instance {self.instance_id}: {e}", flush=True)
+            self._instance_started = False
    
    def stop(self):
        """Alias for cleanup."""
@@ -742,7 +835,10 @@ class _SingularityEnvironment:
    
    def __del__(self):
        """Cleanup on destruction."""
-        self.cleanup()
+        try:
+            self.cleanup()
+        except:
+            pass


 class _SSHEnvironment:
@@ -957,13 +1053,37 @@ class _ModalEnvironment:
    
    Wraps mini-swe-agent's SwerexModalEnvironment but adds:
    - SUDO_PASSWORD support via _transform_sudo_command
+    - Automatic async-safety patches (applied once, before first use)
    
-    Note: stdin handling is not needed for Modal since it uses remote async execution.
+    The patches replace SwerexModalEnvironment's asyncio.run() calls with a
+    background thread approach, making it safe to use inside any event loop
+    (e.g., Atropos). Applied here at the point of use rather than relying on
+    import-time side effects, so ALL callers get the fix automatically.
    """
    
+    # Class-level flag: patches only need to be applied once
+    _patches_applied = False
+    
    def __init__(self, image: str, cwd: str = "/root", timeout: int = 60):
+        # Ensure async-safety patches are applied before creating any
+        # SwerexModalEnvironment instance. This is the single authoritative
+        # place -- no other module needs to call apply_patches() for Modal.
+        if not _ModalEnvironment._patches_applied:
+            try:
+                from environments.patches import apply_patches
+                apply_patches()
+            except ImportError:
+                pass  # patches module not available (standalone use)
+            _ModalEnvironment._patches_applied = True
+        
        from minisweagent.environments.extra.swerex_modal import SwerexModalEnvironment
-        self._inner = SwerexModalEnvironment(image=image, cwd=cwd, timeout=timeout)
+        # Generous startup timeout: sandbox creation can take 30-60s for cold images,
+        # and the SWE-ReX runtime needs another 10-30s to boot inside it.
+        self._inner = SwerexModalEnvironment(
+            image=image, cwd=cwd, timeout=timeout,
+            startup_timeout=180.0,
+            runtime_timeout=3600.0,
+        )
        self.cwd = cwd
        self.timeout = timeout
    
@@ -1014,7 +1134,7 @@ TERMINAL_TOOL_DESCRIPTION = """Execute commands on a secure Linux environment.
 - Run servers/long processes in background
 - Monitor disk usage for large tasks
 - Install whatever tools you need with apt-get or pip
- Do not be afraid to run pip with --break-system-packages
+- Try to create or use a venv with uv or python -m venv to keep isolation from global system packages.

 **Things to avoid:**
 - Do NOT use interactive tools such as tmux, vim, nano, python repl - you will get stuck.
@@ -1026,9 +1146,49 @@ _active_environments: Dict[str, Any] = {}
 _task_workdirs: Dict[str, str] = {}  # Maps task_id to working directory
 _last_activity: Dict[str, float] = {}
 _env_lock = threading.Lock()
+_creation_locks: Dict[str, threading.Lock] = {}  # Per-task locks for sandbox creation
+_creation_locks_lock = threading.Lock()  # Protects _creation_locks dict itself
 _cleanup_thread = None
 _cleanup_running = False

+# Per-task environment overrides registry.
+# Allows environments (e.g., TerminalBench2Env) to specify a custom Docker/Modal
+# image for a specific task_id BEFORE the agent loop starts. When the terminal or
+# file tools create a new sandbox for that task_id, they check this registry first
+# and fall back to the TERMINAL_MODAL_IMAGE (etc.) env var if no override is set.
+#
+# This is never exposed to the model -- only infrastructure code calls it.
+# Thread-safe because each task_id is unique per rollout.
+_task_env_overrides: Dict[str, Dict[str, Any]] = {}
+
+
+def register_task_env_overrides(task_id: str, overrides: Dict[str, Any]):
+    """
+    Register environment overrides for a specific task/rollout.
+
+    Called by Atropos environments before the agent loop to configure
+    per-task sandbox settings (e.g., a custom Dockerfile for the Modal image).
+
+    Supported override keys:
+        - modal_image: str -- Path to Dockerfile or Docker Hub image name
+        - docker_image: str -- Docker image name
+        - cwd: str -- Working directory inside the sandbox
+
+    Args:
+        task_id: The rollout's unique task identifier
+        overrides: Dict of config keys to override
+    """
+    _task_env_overrides[task_id] = overrides
+
+
+def clear_task_env_overrides(task_id: str):
+    """
+    Clear environment overrides for a task after rollout completes.
+
+    Called during cleanup to avoid stale entries accumulating.
+    """
+    _task_env_overrides.pop(task_id, None)
+
 # Configuration from environment variables
 def _get_env_config() -> Dict[str, Any]:
    """Get terminal environment configuration from environment variables."""
@@ -1040,10 +1200,10 @@ def _get_env_config() -> Dict[str, Any]:
    #   - local/ssh: current working directory (CLI resolves "." before we get here)
    #   - docker/singularity: /tmp inside the container (singularity bind-mounts /scratch there)
    #   - modal: /root (ephemeral cloud container, full filesystem access)
-    if env_type == "modal":
+    if env_type in ("modal", "singularity"):
        default_cwd = "/root"
-    elif env_type in ("docker", "singularity"):
-        default_cwd = "/tmp"
+    elif env_type == "docker":
+        default_cwd = "/"
    else:
        default_cwd = os.getcwd()
    
@@ -1284,7 +1444,15 @@ def cleanup_vm(task_id: str):
                    print(f"[Terminal Cleanup] Error cleaning up environment for task {task_id}: {e}")


-atexit.register(_stop_cleanup_thread)
+def _atexit_cleanup():
+    """Stop cleanup thread and shut down all remaining sandboxes on exit."""
+    _stop_cleanup_thread()
+    if _active_environments:
+        count = len(_active_environments)
+        print(f"\n[Terminal Cleanup] Shutting down {count} remaining sandbox(es)...")
+        cleanup_all_environments()
+
+atexit.register(_atexit_cleanup)


 def terminal_tool(
@@ -1326,24 +1494,28 @@ def terminal_tool(
        # Get configuration
        config = _get_env_config()
        env_type = config["env_type"]
-        
-        # Select image based on env type
-        if env_type == "docker":
-            image = config["docker_image"]
-        elif env_type == "singularity":
-            image = config["singularity_image"]
-        elif env_type == "modal":
-            image = config["modal_image"]
-        else:
-            image = ""
-        
-        cwd = config["cwd"]
-        default_timeout = config["timeout"]
-        effective_timeout = timeout or default_timeout

        # Use task_id for environment isolation
        effective_task_id = task_id or "default"

+        # Check per-task overrides (set by environments like TerminalBench2Env)
+        # before falling back to global env var config
+        overrides = _task_env_overrides.get(effective_task_id, {})
+        
+        # Select image based on env type, with per-task override support
+        if env_type == "docker":
+            image = overrides.get("docker_image") or config["docker_image"]
+        elif env_type == "singularity":
+            image = overrides.get("singularity_image") or config["singularity_image"]
+        elif env_type == "modal":
+            image = overrides.get("modal_image") or config["modal_image"]
+        else:
+            image = ""
+        
+        cwd = overrides.get("cwd") or config["cwd"]
+        default_timeout = config["timeout"]
+        effective_timeout = timeout or default_timeout
+
        # For local environment in batch mode, create a unique subdirectory per task
        # This prevents parallel tasks from overwriting each other's files
        # In CLI mode (HERMES_QUIET), use the cwd directly without subdirectories
@@ -1359,68 +1531,86 @@ def terminal_tool(
        # Start cleanup thread
        _start_cleanup_thread()

-        # Get or create environment
-        # Check under lock, but create OUTSIDE lock so we don't block
-        # other concurrent rollouts during slow Modal/Docker startup
-        needs_creation = False
+        # Get or create environment.
+        # Use a per-task creation lock so concurrent tool calls for the same
+        # task_id wait for the first one to finish creating the sandbox,
+        # instead of each creating their own (wasting Modal resources).
        with _env_lock:
-            if effective_task_id not in _active_environments:
-                needs_creation = True
-            else:
+            if effective_task_id in _active_environments:
                _last_activity[effective_task_id] = time.time()
                env = _active_environments[effective_task_id]
+                needs_creation = False
+            else:
+                needs_creation = True

        if needs_creation:
-            _check_disk_usage_warning()
-            if not os.getenv("HERMES_QUIET"):
-                print(f"[Terminal] Creating new {env_type} environment for task {effective_task_id[:8]}...", flush=True)
-            try:
-                ssh_config = None
-                if env_type == "ssh":
-                    ssh_config = {
-                        "host": config.get("ssh_host", ""),
-                        "user": config.get("ssh_user", ""),
-                        "port": config.get("ssh_port", 22),
-                        "key": config.get("ssh_key", ""),
-                    }
+            # Per-task lock: only one thread creates the sandbox, others wait
+            with _creation_locks_lock:
+                if effective_task_id not in _creation_locks:
+                    _creation_locks[effective_task_id] = threading.Lock()
+                task_lock = _creation_locks[effective_task_id]

-                new_env = _create_environment(
-                    env_type=env_type,
-                    image=image,
-                    cwd=cwd,
-                    timeout=effective_timeout,
-                    ssh_config=ssh_config
-                )
-            except ImportError as e:
-                return json.dumps({
-                    "output": "",
-                    "exit_code": -1,
-                    "error": f"Terminal tool disabled: mini-swe-agent not available ({e})",
-                    "status": "disabled"
-                }, ensure_ascii=False)
+            with task_lock:
+                # Double-check after acquiring the per-task lock
+                with _env_lock:
+                    if effective_task_id in _active_environments:
+                        _last_activity[effective_task_id] = time.time()
+                        env = _active_environments[effective_task_id]
+                        needs_creation = False

-            # Store under lock (brief)
-            with _env_lock:
-                if effective_task_id not in _active_environments:
-                    _active_environments[effective_task_id] = new_env
-                else:
-                    # Another thread created it while we were building -- clean up ours
+                if needs_creation:
+                    if env_type in ("singularity", "local"):
+                        _check_disk_usage_warning()
+                    if not os.getenv("HERMES_QUIET"):
+                        print(f"[Terminal] Creating new {env_type} environment for task {effective_task_id[:8]}...", flush=True)
                    try:
-                        if hasattr(new_env, 'stop'):
-                            new_env.stop()
-                    except Exception:
-                        pass
+                        ssh_config = None
+                        if env_type == "ssh":
+                            ssh_config = {
+                                "host": config.get("ssh_host", ""),
+                                "user": config.get("ssh_user", ""),
+                                "port": config.get("ssh_port", 22),
+                                "key": config.get("ssh_key", ""),
+                            }

-                _last_activity[effective_task_id] = time.time()
-                env = _active_environments[effective_task_id]
-                if not os.getenv("HERMES_QUIET"):
-                    print(f"[Terminal] {env_type} environment ready for task {effective_task_id[:8]}", flush=True)
+                        new_env = _create_environment(
+                            env_type=env_type,
+                            image=image,
+                            cwd=cwd,
+                            timeout=effective_timeout,
+                            ssh_config=ssh_config
+                        )
+                    except ImportError as e:
+                        return json.dumps({
+                            "output": "",
+                            "exit_code": -1,
+                            "error": f"Terminal tool disabled: mini-swe-agent not available ({e})",
+                            "status": "disabled"
+                        }, ensure_ascii=False)
+
+                    with _env_lock:
+                        _active_environments[effective_task_id] = new_env
+                        _last_activity[effective_task_id] = time.time()
+                        env = new_env
+                    if not os.getenv("HERMES_QUIET"):
+                        print(f"[Terminal] {env_type} environment ready for task {effective_task_id[:8]}", flush=True)

        # Check for dangerous commands (only for local/ssh in interactive modes)
        # Skip check if force=True (user has confirmed they want to run it)
        if not force:
            approval = _check_dangerous_command(command, env_type)
            if not approval["approved"]:
+                # Check if this is an approval_required (gateway ask mode)
+                if approval.get("status") == "approval_required":
+                    return json.dumps({
+                        "output": "",
+                        "exit_code": -1,
+                        "error": approval.get("message", "Waiting for user approval"),
+                        "status": "approval_required",
+                        "command": approval.get("command", command),
+                        "description": approval.get("description", "dangerous command"),
+                        "pattern_key": approval.get("pattern_key", ""),
+                    }, ensure_ascii=False)
                # Command was blocked - return informative message
                return json.dumps({
                    "output": "",
@@ -0,0 +1,403 @@
+#!/usr/bin/env python3
+"""
+Text-to-Speech Tool Module
+
+Supports three TTS providers:
+- Edge TTS (default, free, no API key): Microsoft Edge neural voices
+- ElevenLabs (premium): High-quality voices, needs ELEVENLABS_API_KEY
+- OpenAI TTS: Good quality, needs OPENAI_API_KEY
+
+Output formats:
+- Opus (.ogg) for Telegram voice bubbles (requires ffmpeg for Edge TTS)
+- MP3 (.mp3) for everything else (CLI, Discord, WhatsApp)
+
+Configuration is loaded from ~/.hermes/config.yaml under the 'tts:' key.
+The user chooses the provider and voice; the model just sends text.
+
+Usage:
+    from tools.tts_tool import text_to_speech_tool, check_tts_requirements
+
+    result = text_to_speech_tool(text="Hello world")
+"""
+
+import asyncio
+import datetime
+import json
+import os
+import shutil
+import subprocess
+import tempfile
+from pathlib import Path
+from typing import Dict, Any, Optional
+
+# ---------------------------------------------------------------------------
+# Optional imports -- providers degrade gracefully if not installed
+# ---------------------------------------------------------------------------
+try:
+    import edge_tts
+    _HAS_EDGE_TTS = True
+except ImportError:
+    _HAS_EDGE_TTS = False
+
+try:
+    from elevenlabs.client import ElevenLabs
+    _HAS_ELEVENLABS = True
+except ImportError:
+    _HAS_ELEVENLABS = False
+
+# openai is a core dependency, but guard anyway
+try:
+    from openai import OpenAI as OpenAIClient
+    _HAS_OPENAI = True
+except ImportError:
+    _HAS_OPENAI = False
+
+
+# ===========================================================================
+# Defaults
+# ===========================================================================
+DEFAULT_PROVIDER = "edge"
+DEFAULT_EDGE_VOICE = "en-US-AriaNeural"
+DEFAULT_ELEVENLABS_VOICE_ID = "pNInz6obpgDQGcFmaJgB"  # Adam
+DEFAULT_ELEVENLABS_MODEL_ID = "eleven_multilingual_v2"
+DEFAULT_OPENAI_MODEL = "gpt-4o-mini-tts"
+DEFAULT_OPENAI_VOICE = "alloy"
+DEFAULT_OUTPUT_DIR = os.path.expanduser("~/voice-memos")
+MAX_TEXT_LENGTH = 4000
+
+
+# ===========================================================================
+# Config loader -- reads tts: section from ~/.hermes/config.yaml
+# ===========================================================================
+def _load_tts_config() -> Dict[str, Any]:
+    """
+    Load TTS configuration from ~/.hermes/config.yaml.
+
+    Returns a dict with provider settings. Falls back to defaults
+    for any missing fields.
+    """
+    try:
+        from hermes_cli.config import load_config
+        config = load_config()
+        return config.get("tts", {})
+    except Exception:
+        return {}
+
+
+def _get_provider(tts_config: Dict[str, Any]) -> str:
+    """Get the configured TTS provider name."""
+    return tts_config.get("provider", DEFAULT_PROVIDER).lower().strip()
+
+
+# ===========================================================================
+# ffmpeg Opus conversion (Edge TTS MP3 -> OGG Opus for Telegram)
+# ===========================================================================
+def _has_ffmpeg() -> bool:
+    """Check if ffmpeg is available on the system."""
+    return shutil.which("ffmpeg") is not None
+
+
+def _convert_to_opus(mp3_path: str) -> Optional[str]:
+    """
+    Convert an MP3 file to OGG Opus format for Telegram voice bubbles.
+
+    Args:
+        mp3_path: Path to the input MP3 file.
+
+    Returns:
+        Path to the .ogg file, or None if conversion fails.
+    """
+    if not _has_ffmpeg():
+        return None
+
+    ogg_path = mp3_path.rsplit(".", 1)[0] + ".ogg"
+    try:
+        subprocess.run(
+            ["ffmpeg", "-i", mp3_path, "-acodec", "libopus",
+             "-ac", "1", "-b:a", "64k", "-vbr", "off", ogg_path, "-y"],
+            capture_output=True, timeout=30,
+        )
+        if os.path.exists(ogg_path) and os.path.getsize(ogg_path) > 0:
+            return ogg_path
+    except Exception:
+        pass
+    return None
+
+
+# ===========================================================================
+# Provider: Edge TTS (free)
+# ===========================================================================
+async def _generate_edge_tts(text: str, output_path: str, tts_config: Dict[str, Any]) -> str:
+    """
+    Generate audio using Edge TTS.
+
+    Args:
+        text: Text to convert.
+        output_path: Where to save the MP3 file.
+        tts_config: TTS config dict.
+
+    Returns:
+        Path to the saved audio file.
+    """
+    edge_config = tts_config.get("edge", {})
+    voice = edge_config.get("voice", DEFAULT_EDGE_VOICE)
+
+    communicate = edge_tts.Communicate(text, voice)
+    await communicate.save(output_path)
+    return output_path
+
+
+# ===========================================================================
+# Provider: ElevenLabs (premium)
+# ===========================================================================
+def _generate_elevenlabs(text: str, output_path: str, tts_config: Dict[str, Any]) -> str:
+    """
+    Generate audio using ElevenLabs.
+
+    Args:
+        text: Text to convert.
+        output_path: Where to save the audio file.
+        tts_config: TTS config dict.
+
+    Returns:
+        Path to the saved audio file.
+    """
+    api_key = os.getenv("ELEVENLABS_API_KEY", "")
+    if not api_key:
+        raise ValueError("ELEVENLABS_API_KEY not set. Get one at https://elevenlabs.io/")
+
+    el_config = tts_config.get("elevenlabs", {})
+    voice_id = el_config.get("voice_id", DEFAULT_ELEVENLABS_VOICE_ID)
+    model_id = el_config.get("model_id", DEFAULT_ELEVENLABS_MODEL_ID)
+
+    # Determine output format based on file extension
+    if output_path.endswith(".ogg"):
+        output_format = "opus_48000_64"
+    else:
+        output_format = "mp3_44100_128"
+
+    client = ElevenLabs(api_key=api_key)
+    audio_generator = client.text_to_speech.convert(
+        text=text,
+        voice_id=voice_id,
+        model_id=model_id,
+        output_format=output_format,
+    )
+
+    # audio_generator yields chunks -- write them all
+    with open(output_path, "wb") as f:
+        for chunk in audio_generator:
+            f.write(chunk)
+
+    return output_path
+
+
+# ===========================================================================
+# Provider: OpenAI TTS
+# ===========================================================================
+def _generate_openai_tts(text: str, output_path: str, tts_config: Dict[str, Any]) -> str:
+    """
+    Generate audio using OpenAI TTS.
+
+    Args:
+        text: Text to convert.
+        output_path: Where to save the audio file.
+        tts_config: TTS config dict.
+
+    Returns:
+        Path to the saved audio file.
+    """
+    api_key = os.getenv("OPENAI_API_KEY", "")
+    if not api_key:
+        raise ValueError("OPENAI_API_KEY not set. Get one at https://platform.openai.com/api-keys")
+
+    oai_config = tts_config.get("openai", {})
+    model = oai_config.get("model", DEFAULT_OPENAI_MODEL)
+    voice = oai_config.get("voice", DEFAULT_OPENAI_VOICE)
+
+    # Determine response format from extension
+    if output_path.endswith(".ogg"):
+        response_format = "opus"
+    else:
+        response_format = "mp3"
+
+    client = OpenAIClient(api_key=api_key)
+    response = client.audio.speech.create(
+        model=model,
+        voice=voice,
+        input=text,
+        response_format=response_format,
+    )
+
+    response.stream_to_file(output_path)
+    return output_path
+
+
+# ===========================================================================
+# Main tool function
+# ===========================================================================
+def text_to_speech_tool(
+    text: str,
+    output_path: Optional[str] = None,
+) -> str:
+    """
+    Convert text to speech audio.
+
+    Reads provider/voice config from ~/.hermes/config.yaml (tts: section).
+    The model sends text; the user configures voice and provider.
+
+    On messaging platforms, the returned MEDIA:<path> tag is intercepted
+    by the send pipeline and delivered as a native voice message.
+    In CLI mode, the file is saved to ~/voice-memos/.
+
+    Args:
+        text: The text to convert to speech.
+        output_path: Optional custom save path. Defaults to ~/voice-memos/<timestamp>.mp3
+
+    Returns:
+        str: JSON result with success, file_path, and optionally MEDIA tag.
+    """
+    if not text or not text.strip():
+        return json.dumps({"success": False, "error": "Text is required"}, ensure_ascii=False)
+
+    # Truncate very long text with a warning
+    if len(text) > MAX_TEXT_LENGTH:
+        print(f"⚠️  TTS text too long ({len(text)} chars), truncating to {MAX_TEXT_LENGTH}")
+        text = text[:MAX_TEXT_LENGTH]
+
+    tts_config = _load_tts_config()
+    provider = _get_provider(tts_config)
+
+    # Determine output path
+    if output_path:
+        file_path = Path(output_path).expanduser()
+    else:
+        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        out_dir = Path(DEFAULT_OUTPUT_DIR)
+        out_dir.mkdir(parents=True, exist_ok=True)
+        file_path = out_dir / f"tts_{timestamp}.mp3"
+
+    # Ensure parent directory exists
+    file_path.parent.mkdir(parents=True, exist_ok=True)
+    file_str = str(file_path)
+
+    try:
+        # Generate audio with the configured provider
+        if provider == "elevenlabs":
+            if not _HAS_ELEVENLABS:
+                return json.dumps({
+                    "success": False,
+                    "error": "ElevenLabs provider selected but 'elevenlabs' package not installed. Run: pip install elevenlabs"
+                }, ensure_ascii=False)
+            print(f"🔊 Generating speech with ElevenLabs...")
+            _generate_elevenlabs(text, file_str, tts_config)
+
+        elif provider == "openai":
+            if not _HAS_OPENAI:
+                return json.dumps({
+                    "success": False,
+                    "error": "OpenAI provider selected but 'openai' package not installed."
+                }, ensure_ascii=False)
+            print(f"🔊 Generating speech with OpenAI TTS...")
+            _generate_openai_tts(text, file_str, tts_config)
+
+        else:
+            # Default: Edge TTS (free)
+            if not _HAS_EDGE_TTS:
+                return json.dumps({
+                    "success": False,
+                    "error": "Edge TTS not available. Run: pip install edge-tts"
+                }, ensure_ascii=False)
+            print(f"🔊 Generating speech with Edge TTS...")
+            # Edge TTS is async, run it
+            try:
+                loop = asyncio.get_running_loop()
+                import concurrent.futures
+                with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
+                    pool.submit(
+                        lambda: asyncio.run(_generate_edge_tts(text, file_str, tts_config))
+                    ).result(timeout=60)
+            except RuntimeError:
+                asyncio.run(_generate_edge_tts(text, file_str, tts_config))
+
+        # Check the file was actually created
+        if not os.path.exists(file_str) or os.path.getsize(file_str) == 0:
+            return json.dumps({
+                "success": False,
+                "error": f"TTS generation produced no output (provider: {provider})"
+            }, ensure_ascii=False)
+
+        # Try Opus conversion for Telegram compatibility (Edge TTS only outputs MP3)
+        voice_compatible = False
+        if provider == "edge" and file_str.endswith(".mp3"):
+            opus_path = _convert_to_opus(file_str)
+            if opus_path:
+                file_str = opus_path
+                voice_compatible = True
+        elif provider in ("elevenlabs", "openai"):
+            # These providers can output Opus natively if the path ends in .ogg
+            voice_compatible = file_str.endswith(".ogg")
+
+        file_size = os.path.getsize(file_str)
+        print(f"✅ TTS audio saved: {file_str} ({file_size:,} bytes, provider: {provider})")
+
+        # Build response with MEDIA tag for platform delivery
+        media_tag = f"MEDIA:{file_str}"
+        if voice_compatible:
+            media_tag = f"[[audio_as_voice]]\n{media_tag}"
+
+        return json.dumps({
+            "success": True,
+            "file_path": file_str,
+            "media_tag": media_tag,
+            "provider": provider,
+            "voice_compatible": voice_compatible,
+        }, ensure_ascii=False)
+
+    except Exception as e:
+        error_msg = f"TTS generation failed ({provider}): {e}"
+        print(f"❌ {error_msg}")
+        return json.dumps({"success": False, "error": error_msg}, ensure_ascii=False)
+
+
+# ===========================================================================
+# Requirements check
+# ===========================================================================
+def check_tts_requirements() -> bool:
+    """
+    Check if at least one TTS provider is available.
+
+    Edge TTS needs no API key and is the default, so if the package
+    is installed, TTS is available.
+
+    Returns:
+        bool: True if at least one provider can work.
+    """
+    if _HAS_EDGE_TTS:
+        return True
+    if _HAS_ELEVENLABS and os.getenv("ELEVENLABS_API_KEY"):
+        return True
+    if _HAS_OPENAI and os.getenv("OPENAI_API_KEY"):
+        return True
+    return False
+
+
+# ===========================================================================
+# Main -- quick diagnostics
+# ===========================================================================
+if __name__ == "__main__":
+    print("🔊 Text-to-Speech Tool Module")
+    print("=" * 50)
+
+    print(f"\nProvider availability:")
+    print(f"  Edge TTS:   {'✅ installed' if _HAS_EDGE_TTS else '❌ not installed (pip install edge-tts)'}")
+    print(f"  ElevenLabs: {'✅ installed' if _HAS_ELEVENLABS else '❌ not installed (pip install elevenlabs)'}")
+    print(f"    API Key:  {'✅ set' if os.getenv('ELEVENLABS_API_KEY') else '❌ not set'}")
+    print(f"  OpenAI:     {'✅ installed' if _HAS_OPENAI else '❌ not installed'}")
+    print(f"    API Key:  {'✅ set' if os.getenv('OPENAI_API_KEY') else '❌ not set'}")
+    print(f"  ffmpeg:     {'✅ found' if _has_ffmpeg() else '❌ not found (needed for Telegram Opus)'}")
+    print(f"\n  Output dir: {DEFAULT_OUTPUT_DIR}")
+
+    config = _load_tts_config()
+    provider = _get_provider(config)
+    print(f"  Configured provider: {provider}")
@@ -69,7 +69,7 @@ TOOLSETS = {
    
    "skills": {
        "description": "Access skill documents with specialized instructions and knowledge",
-        "tools": ["skills_categories", "skills_list", "skill_view"],
+        "tools": ["skills_list", "skill_view"],
        "includes": []
    },
    
@@ -108,6 +108,12 @@ TOOLSETS = {
        "includes": []
    },
    
+    "tts": {
+        "description": "Text-to-speech: convert text to audio with Edge TTS (free), ElevenLabs, or OpenAI",
+        "tools": ["text_to_speech"],
+        "includes": []
+    },
+    
    # Scenario-specific toolsets
    
    "debugging": {
@@ -142,12 +148,14 @@ TOOLSETS = {
            # MoA
            "mixture_of_agents",
            # Skills
-            "skills_categories", "skills_list", "skill_view",
+            "skills_list", "skill_view",
            # Browser
            "browser_navigate", "browser_snapshot", "browser_click",
            "browser_type", "browser_scroll", "browser_back",
            "browser_press", "browser_close", "browser_get_images",
            "browser_vision",
+            # Text-to-speech
+            "text_to_speech",
            # Cronjob management (CLI-only)
            "schedule_cronjob", "list_cronjobs", "remove_cronjob"
        ],
@@ -169,8 +177,12 @@ TOOLSETS = {
            "web_search", "web_extract",
            # Vision - analyze images sent by users
            "vision_analyze",
+            # Image generation
+            "image_generate",
+            # Text-to-speech
+            "text_to_speech",
            # Skills - access knowledge base
-            "skills_categories", "skills_list", "skill_view",
+            "skills_list", "skill_view",
            # Cronjob management - let users schedule tasks
            "schedule_cronjob", "list_cronjobs", "remove_cronjob"
        ],
@@ -178,15 +190,23 @@ TOOLSETS = {
    },
    
    "hermes-discord": {
-        "description": "Discord bot toolset - limited for public server safety (no terminal, no file access)",
+        "description": "Discord bot toolset - full access (terminal has safety checks via dangerous command approval)",
        "tools": [
-            # Web tools - safe for messaging
-            "web_search",
-            # Vision - analyze images
+            # Terminal - enabled with dangerous command approval system
+            "terminal",
+            # File manipulation
+            "read_file", "write_file", "patch", "search",
+            # Web tools
+            "web_search", "web_extract",
+            # Vision - analyze images sent by users
            "vision_analyze",
+            # Image generation
+            "image_generate",
+            # Text-to-speech
+            "text_to_speech",
            # Skills - access knowledge base
-            "skills_categories", "skills_list", "skill_view",
-            # Cronjob - let users schedule reminders
+            "skills_list", "skill_view",
+            # Cronjob management - let users schedule tasks
            "schedule_cronjob", "list_cronjobs", "remove_cronjob"
        ],
        "includes": []
@@ -203,8 +223,12 @@ TOOLSETS = {
            "read_file", "write_file", "patch", "search",
            # Vision
            "vision_analyze",
+            # Image generation
+            "image_generate",
+            # Text-to-speech
+            "text_to_speech",
            # Skills
-            "skills_categories", "skills_list", "skill_view",
+            "skills_list", "skill_view",
            # Cronjob management
            "schedule_cronjob", "list_cronjobs", "remove_cronjob"
        ],
Author	SHA1	Message	Date
Shannon Sands	ae6435f787	Env robustness: context-safe prompting + tool arg normalization - Preserve full trajectory while truncating prompt view per turn (avoids context overflow) - Add max_context_tokens support and wire from env config - Normalize tool call arguments robustly (dict / stringified JSON / plain string) - Avoid double-encoding tool arguments in Hermes parser - Add tool-call metrics to AgentResult for debugging/optional shaping Scope: environments/* only	2026-02-14 13:13:00 +10:00
teknium1	84718d183a	Add platform-specific formatting hints and identity for AIAgent - Introduced a default agent identity prompt to ensure consistent behavior across platforms. - Added platform-specific formatting hints for CLI, WhatsApp, Telegram, and Discord to guide the agent's output style. - Updated the AIAgent initialization to accept a platform parameter, enhancing adaptability to different interfaces.	2026-02-12 16:11:16 -08:00
teknium1	3099a2f53c	Add timestamp to active system prompt in AIAgent - Appended the current local date and time to the active system prompt to provide context for the model, addressing potential misinterpretations due to training cutoffs.	2026-02-12 15:59:31 -08:00
teknium1	ed010752dd	Update .env.example to use new Docker, Singularity, and Modal images for Python 3.11 with Node.js 20 support	2026-02-12 10:07:03 -08:00
teknium1	f5be6177b2	Add Text-to-Speech (TTS) functionality with multiple providers Add tool previews Add AGENTS and SOUL.md support Add Exec Approval	2026-02-12 10:05:08 -08:00
teknium	89c6f24d48	Merge branch 'main' of github.com:nousresearch/hermes-agent	2026-02-12 05:38:15 +00:00
teknium	f23856df8e	Add kill_modal script to manage Modal applications and better handling of file and terminal tools - Introduced a new script, `kill_modal.sh`, to facilitate stopping running Modal apps, including the ability to stop all apps or specific swe-rex sandboxes. - Enhanced user experience with clear usage instructions and feedback during the stopping process. - Improved error handling to ensure smooth execution even if some apps fail to stop.	2026-02-12 05:37:14 +00:00
teknium	1b7bc299f3	Enhance TerminalBench2 environment with task filtering due to incompat with modal and logging improvements - Updated task filter descriptions for clarity and added a new skip task feature to exclude incompatible tasks. - Introduced a set of modal incompatible tasks to prevent execution errors in cloud environments. - Implemented streaming JSONL logging for task results, preserving data even on interruptions. - Refactored task evaluation logic to include skipped task reporting and improved error handling.	2026-02-12 05:36:45 +00:00
teknium	a291cc99cf	more extra kwarg support for provider selection etc on openrouter in agent rl envs and evals	2026-02-12 05:36:25 +00:00
teknium	389ac5e017	pass extrabody for agentloop to ban and allowlist providers on openrouter, control thinking, etc	2026-02-12 05:35:48 +00:00
nightwing	fc792a4be9	Update Project_notes.md: grailed-embedding-search status and TODOs (June 2025)	2026-02-11 17:54:47 -07:00
nightwing	07501bef14	Add Project_notes.md — centralized status tracker for all side projects	2026-02-11 17:36:18 -07:00
teknium1	137ce05324	Add image generation tool to toolsets for messaging platforms - Included "image_generate" in the toolsets for web, vision, and skills categories, expanding functionality for image-related tasks. - Updated comments for clarity on the new tool's purpose, ensuring users understand its integration within the existing framework.	2026-02-10 21:04:24 -08:00
teknium1	ada0b4f131	Enhance image handling in platform adapters - Updated the image generation function description to clarify usage with markdown. - Added `send_image` method to `BasePlatformAdapter` for native image sending across platforms. - Implemented `send_image` in `DiscordAdapter` and `TelegramAdapter` to handle image attachments directly. - Introduced `extract_images` method to extract image URLs from markdown and HTML, improving content processing. - Enhanced message handling to support sending images as attachments while maintaining text content.	2026-02-10 21:02:40 -08:00
teknium	abe925e212	Update hermes-discord toolset to enable full terminal access with safety checks - Revised the description to reflect full access capabilities, including terminal usage with a dangerous command approval system. - Added terminal and file manipulation tools to the toolset, enhancing functionality for users. - Updated comments for clarity on tool purposes, ensuring better understanding of available features.	2026-02-11 04:44:30 +00:00
teknium1	8fb44608bf	Update SKILL.md and related references to implement container binding for labeled shapes and arrows in Excalidraw - Revised the labeled shape and arrow sections to utilize container binding instead of the deprecated "label" property, ensuring proper text rendering. - Added warnings about the invalidity of the "label" property and emphasized the use of `boundElements` for text elements. - Updated examples in dark-mode and general references to reflect the new binding approach, enhancing clarity and usability for users creating diagrams.	2026-02-10 20:05:23 -08:00
teknium1	153cd5bb44	Refactor skills tool integration and enhance system prompt - Removed the skills_categories tool from the skills toolset, streamlining the skills functionality to focus on skills_list and skill_view. - Updated the system prompt to dynamically build a compact skills index, allowing the model to quickly reference available skills without additional tool calls. - Cleaned up related code and documentation to reflect the removal of skills_categories, ensuring clarity and consistency across the codebase.	2026-02-10 19:48:38 -08:00
teknium1	669545f551	Add diagramming skills for Excalidraw - Introduced a new DESCRIPTION.md file outlining diagram creation skills for visual diagrams and flowcharts using Excalidraw. - Added SKILL.md for the Excalidraw skill, detailing its functionality, usage, and workflow for creating hand-drawn style diagrams. - Created references for color palettes, dark mode diagrams, and example diagrams to assist users in utilizing the Excalidraw skill effectively. - Implemented an upload script for sharing diagrams via Excalidraw.com, ensuring user-friendly access to generated diagrams.	2026-02-10 19:30:46 -08:00
teknium1	cfe2f3fe15	Implement interrupt handling for long-running tool executions in AIAgent - Added functionality to signal and terminate long-running terminal commands when a new user message is received, allowing for immediate agent response. - Introduced a global interrupt event in the terminal tool to facilitate early termination of subprocesses. - Updated the AIAgent class to handle interrupts gracefully, ensuring that remaining tool calls are skipped and appropriate messages are returned to maintain valid message sequences.	2026-02-10 16:34:27 -08:00
teknium1	140d609e0c	Refine agent history conversion logic in GatewayRunner - Enhanced the conversion of message history to agent format by distinguishing between normal and rich agent messages. - Implemented logic to preserve full message structure for tool-related messages, ensuring valid assistant-to-tool sequences. - Simplified handling of simple text messages by stripping unnecessary fields while retaining essential role and content information.	2026-02-10 16:16:30 -08:00
teknium	a32ad1a656	Fix infinite interrupt loop in gateway by consuming pending messages with .pop() and clearing interrupt events before recursion - Added logic to clear the adapter's interrupt event to prevent infinite loops during message processing. - Updated the get_pending_message method to pop messages from the pending queue, ensuring proper message handling.	2026-02-11 00:05:30 +00:00
teknium1	62ba69a29d	Fix gateway exit code to enable systemd auto-restart on connection failure - Updated the start_gateway function to return a boolean indicating success or failure, allowing for better control over exit codes. - Modified the main function to handle gateway startup failures, ensuring systemd can automatically restart on transient errors. - Enhanced error handling in the hermes_cli gateway to exit with code 1 if the gateway fails to connect to any platform.	2026-02-10 16:01:00 -08:00
teknium1	9b0f2a16ca	Enhance CLI functionality with retry and undo commands - Added /retry command to resend the last user message, improving user experience by allowing message re-sending without retyping. - Introduced /undo command to remove the last user/assistant exchange from conversation history, providing better control over conversation flow. - Updated save_config_value function to respect user and project config precedence, enhancing configuration management. - Improved prompt handling and visual output for user input, adapting to terminal width for better readability.	2026-02-10 15:59:46 -08:00
teknium	85e629e915	Add cleanup functionality for orphaned sandboxes in TerminalBench2EvalEnv - Implemented a cleanup process to terminate any remaining sandboxes after evaluation, addressing issues with orphaned thread pool workers. - Enhanced logging to inform users about the cleanup process, ensuring better resource management and user awareness.	2026-02-10 23:48:49 +00:00
teknium	999a28062d	Implement graceful exit cleanup for terminal tool - Added a new `_atexit_cleanup` function to handle cleanup of active environments and stop the cleanup thread upon program exit. - Enhanced logging to inform users about the number of remaining sandboxes being shut down during cleanup.	2026-02-10 22:53:44 +00:00
teknium	ba3fea24f1	Enhance TerminalBench 2 configuration and evaluation handling - Added task_timeout parameter to enforce a maximum wall-clock time for each task, automatically scoring as FAIL if exceeded. - Introduced terminal_timeout and tool_pool_size parameters to improve command execution and concurrency management. - Updated logging to provide detailed task execution times and timeout handling, enhancing overall monitoring. - Removed outdated evaluate_config.yaml file to streamline configuration management.	2026-02-10 22:53:24 +00:00
teknium	6b4a8d0b17	Add terminal configuration options and enhance environment setup - Introduced terminal_timeout and terminal_lifetime parameters to control command execution and sandbox inactivity. - Updated environment variable handling to allow configuration overrides for terminal settings. - Enhanced logging to provide detailed information about terminal settings during initialization. - Added tool_pool_size parameter to dynamically resize the thread pool for tool execution, improving concurrency management.	2026-02-10 22:51:50 +00:00
teknium	5ec75e38b9	Enhance tool execution and logging in HermesAgentLoop - Increased thread pool size for tool execution from 8 to 128 to improve concurrency and prevent starvation. - Added a function to resize the tool executor dynamically based on configuration. - Enhanced logging to track API call durations and tool execution times, including warnings for slow tools. - Improved overall performance monitoring by logging detailed information for each turn in the agent loop.	2026-02-10 22:51:18 +00:00
teknium	ad042fdd68	Update terminalbench_2 configuration for enhanced performance and evaluation - Increased max_token_length from 16000 to 32000 to allow for longer inputs. - Adjusted agent_temperature from 0.6 to 0.8 for more varied responses. - Extended test_timeout from 180 to 600 seconds to accommodate longer evaluations. - Updated data directory path for saving evaluations to ensure proper organization.	2026-02-10 19:48:41 +00:00
teknium	35ad3146a8	Add new environments and enhance tool context functionality - Introduced new environments: Terminal Test Environment and SWE Environment, each with default configurations for testing and software engineering tasks. - Added TerminalBench 2.0 evaluation environment with comprehensive setup for agentic LLMs, including task execution and verification. - Enhanced ToolContext with methods for uploading and downloading files, ensuring binary-safe operations. - Updated documentation across environments to reflect new features and usage instructions. - Refactored existing environment configurations for consistency and clarity.	2026-02-10 19:39:05 +00:00
teknium	e8343f2d87	Refactor Singularity environment for persistent container management - Updated the _SingularityEnvironment class to utilize a persistent Apptainer instance, allowing state (files, installs, environment changes) to persist across commands. - Enhanced the initialization process to start a background instance with full isolation and writable filesystem. - Modified the execute method to connect to the running instance, ensuring commands run within the same container context. - Implemented cleanup functionality to stop the persistent instance on cleanup or destruction, improving resource management. - Updated class documentation to reflect new features and usage of the persistent environment.	2026-02-10 06:49:58 +00:00
teknium	1b1307d0d1	Implement Anthropic prompt caching for Claude models via OpenRouter - Introduced a caching strategy that reduces input token costs by ~75% on multi-turn conversations by caching the conversation prefix. - Added functions to apply cache control markers to messages, enhancing efficiency in token usage. - Updated AIAgent to auto-enable prompt caching for Claude models, with configurable cache TTL. - Enhanced logging to track cache hit statistics when caching is active, improving monitoring of token usage.	2026-02-10 06:49:41 +00:00
teknium	7a11be9f3f	Enhance browser tool functionality and cleanup process - Added checks for local installation of the agent-browser CLI in the `_find_agent_browser` function, improving installation guidance. - Implemented per-task socket directory management in `_run_browser_command` to prevent concurrency issues. - Updated `cleanup_browser` to remove per-task socket directories, ensuring proper resource cleanup after task completion. - Refactored comments for clarity and improved documentation throughout the browser tool code.	2026-02-09 04:36:37 +00:00