Env robustness: context-safe prompting + tool arg normalization

- Preserve full trajectory while truncating prompt view per turn (avoids context overflow) - Add max_context_tokens support and wire from env config - Normalize tool call arguments robustly (dict / stringified JSON / plain string) - Avoid double-encoding tool arguments in Hermes parser - Add tool-call metrics to AgentResult for debugging/optional shaping Scope: environments/* only
Add platform-specific formatting hints and identity for AIAgent
2026-02-14 13:13:00 +10:00 · 2026-02-12 16:11:16 -08:00 · 2026-02-12 15:59:31 -08:00 · 2026-02-12 10:07:03 -08:00 · 2026-02-12 10:05:08 -08:00 · 2026-02-12 05:38:15 +00:00
79 changed files with 9685 additions and 1811 deletions
--- a/.env.example
+++ b/.env.example
@@ -10,8 +10,8 @@
 OPENROUTER_API_KEY=

 # Default model to use (OpenRouter format: provider/model)
-# Examples: anthropic/claude-sonnet-4, openai/gpt-4o, google/gemini-2.0-flash, zhipuai/glm-4-plus
-LLM_MODEL=anthropic/claude-sonnet-4
+# Examples: anthropic/claude-opus-4.6, openai/gpt-4o, google/gemini-2.0-flash, zhipuai/glm-4-plus
+LLM_MODEL=anthropic/claude-opus-4.6

 # =============================================================================
 # TOOL API KEYS
@@ -42,12 +42,16 @@ TERMINAL_ENV=local


 # Container images (for singularity/docker/modal backends)
-TERMINAL_DOCKER_IMAGE=python:3.11
-TERMINAL_SINGULARITY_IMAGE=docker://python:3.11
-TERMINAL_MODAL_IMAGE=python:3.11
+TERMINAL_DOCKER_IMAGE=nikolaik/python-nodejs:python3.11-nodejs20
+TERMINAL_SINGULARITY_IMAGE=docker://nikolaik/python-nodejs:python3.11-nodejs20
+TERMINAL_MODAL_IMAGE=nikolaik/python-nodejs:python3.11-nodejs20

-# Working directory inside the container
-TERMINAL_CWD=/tmp
+
+# Working directory for terminal commands
+# For CLI: "." means current directory (resolved automatically from config.yaml)
+# For containers (docker/singularity/modal): absolute path inside the container
+# Usually managed by config.yaml (terminal.cwd) — uncomment to override
+# TERMINAL_CWD=.

 # Default command timeout in seconds
 TERMINAL_TIMEOUT=60
--- a/.gitignore
+++ b/.gitignore
@@ -39,6 +39,10 @@ agent-browser/
 *.pem
 privvy*
 images/
+__pycache__/
+hermes_agent.egg-info/
+wandb/
+testlogs

 # CLI config (may contain sensitive SSH paths)
 cli-config.yaml
--- a/Project_notes.md
+++ b/Project_notes.md
@@ -0,0 +1,142 @@
+# Project Notes
+
+*Maintained by Hermes — last updated June 2025*
+
+---
+
+## 1. Kandinsky (Multimodal Transformer)
+- **Repo:** https://github.com/samherring99/kandinsky
+- **Local path:** `~/Desktop/Projects/kandinsky`
+- **Description:** An anything-to-anything transformer combining text, image, and audio modalities. Trains on Pokemon BLIP captions paired with Gen 1 Pokemon audio cries. Uses audio tokenization adapted from nanoGPT.
+- **Status:** Early POC. Training code exists (`model.py`) and dataset creation (`create_dataset.py`) works. Audio heads are producing the same sound — unclear if it's a training issue or data issue.
+- **TODO:**
+  - Debug why audio heads produce identical output
+  - Investigate if model needs more training time
+  - Design a data pipeline for better/more training data
+  - General repo cleanup (requirements.txt, proper CLI, etc.)
+
+---
+
+## 2. NightwingGameSim (LLM → GameBoy ROM Generator)
+- **Repo:** https://github.com/samherring99/NightwingGameSim
+- **Local path:** `~/Desktop/Projects/NightwingGameSim`
+- **Description:** AI-powered pipeline that turns natural language prompts into playable GameBoy ROM files. Generates C code, compiles with GBDK, outputs `.gb` files. Supports Claude API, local Llama, and RAG backends.
+- **Status:** Functional — generation pipeline works end-to-end with Claude 4 system prompt. Has tests, docs, examples, and retry logic.
+- **TODO:**
+  - Harden the repo, clean up structure
+  - Build a better testing pipeline
+  - Come up with better prompt ideas / examples
+
+---
+
+## 3. ContentBasedMIR (Music Information Retrieval)
+- **Repo:** https://github.com/samherring99/ContentBasedMIR
+- **Local path:** `~/Desktop/Projects/ContentBasedMIR`
+- **Description:** Music similarity analysis using Spotify API track data. Extracts 54 audio features per song and visualizes similarity matrices for music recommendation.
+- **Status:** Early stage. Can download Spotify track analysis data and plot similarity matrices. Needs significant expansion.
+- **TODO:**
+  - Expand analysis pipeline with more features
+  - Integrate with text message data for personalized recommendations
+  - Build out visualization and exploration tools
+  - General modernization (dependencies, structure)
+
+---
+
+## 4. MessageRetrieval (iMessage RAG/SQL)
+- **Repo:** https://github.com/samherring99/MessageRetrieval
+- **Local path:** `~/Desktop/Projects/MessageRetrieval`
+- **Description:** Natural language querying over iMessage data using SQL generation (text2SQL) instead of vector embeddings. Uses LLM-as-Judge pattern for scoring and ranking retrieved messages.
+- **Status:** Has initial text2SQL pipeline and summarization tool. Recently worked on with Claude Code. Needs testing.
+- **TODO:**
+  - Test out the recent Claude Code work
+  - Build "iMessage Jarvis" — answer questions about texts
+  - Improve SQL generation prompts and accuracy
+  - Better error handling and UX
+
+---
+
+## 5. Grailed Embedding Search
+- **Repo:** https://github.com/samherring99/grailed-embedding-search
+- **Local path:** `~/Desktop/Projects/grailed-embedding-search`
+- **Description:** Semantic similarity search over Grailed fashion listings using CLIP embeddings and FAISS. Search by image URL or text description to find visually similar products.
+- **Status:** Functional core pipeline. CLIP ViT-B/32 embeds product cover photos into 512-dim vectors, indexed with FAISS cosine similarity. Has CLI, batch embedding, persistent index save/load, and logging.
+- **Recent work (June 2025):**
+  - PR #1 — Initial cleanup: docstrings, type hints, `.gitignore`, `requirements.txt`, README rewrite
+  - PR #2 — Feature improvements: persistent FAISS save/load, batch embedding, CLI (`cli.py`), proper logging throughout, lazy Grailed client, `fetch_details` toggle
+- **TODO:**
+  - Embedding cache (avoid re-embedding known product URLs)
+  - Async/threaded image downloads for faster batch indexing
+  - Search result visualization (matplotlib grid of cover photos)
+  - Filter by category, designer, price range before search
+  - Web UI (Gradio or Streamlit)
+
+---
+
+## 6. NightwingNBA (Sports Analytics)
+- **Repo:** https://github.com/samherring99/NightwingNBA
+- **Local path:** `~/Desktop/Projects/NightwingNBA`
+- **Description:** NBA game prediction system. Builds a database of game data, trains a PyTorch model, and makes daily predictions. Has full pipeline: build DB → write data → train → predict.
+- **Status:** Functional pipeline exists. Has database building, training, prediction, and daily update scripts.
+- **TODO:**
+  - Explore and potentially revive
+  - Update data sources if stale
+  - Improve model accuracy
+  - Add visualization/reporting
+
+---
+
+## 7. Stable Audio Sample Explorer
+- **Repo:** https://github.com/samherring99/stable-audio-sample-explorer
+- **Local path:** `~/Desktop/Projects/stable-audio-sample-explorer`
+- **Description:** Tool for exploring audio samples generated by Stable Audio.
+- **Status:** 🪦 **Dead** — no active work needed per Sam.
+
+---
+
+## 8. NightwingArt (Art Tools)
+- **Repo:** https://github.com/samherring99/NightwingArt
+- **Local path:** `~/Desktop/Projects/NightwingArt`
+- **Description:** Collection of art tooling scripts — video editing, clip splicing with beat matching, damage effects, and general image manipulation.
+- **Status:** Maintenance mode. Tools exist for various effects. Work happens as-needed.
+- **TODO:**
+  - Add tools as needed for new art projects
+
+---
+
+## 9. Claude-based VST Building ⚠️ *Needs new repo*
+- **Description:** Generate VST audio plugins for DAWs from English language prompts. LLM-powered audio plugin creation.
+- **Status:** Concept only — no repo exists yet.
+- **TODO:**
+  - Create repo
+  - Research VST SDK / JUCE framework
+  - Design prompt → code → compile pipeline
+
+---
+
+## 10. Government Auction Site Scraper ⚠️ *Needs new repo*
+- **Description:** Tool that monitors and scrapes government auction sites in San Francisco for deals.
+- **Status:** Concept only — no repo exists yet.
+- **TODO:**
+  - Create repo
+  - Research SF government auction sites and their structure
+  - Build scraper + notification system
+
+---
+
+## Priority Assessment
+
+| Project | Activity Level | Suggested Priority |
+|---------|---------------|-------------------|
+| NightwingGameSim | Active | 🔴 High |
+| MessageRetrieval | Active | 🔴 High |
+| Kandinsky | Active | 🟡 Medium |
+| ContentBasedMIR | Exploratory | 🟡 Medium |
+| Grailed Embedding Search | Early | 🟡 Medium |
+| NightwingNBA | Dormant | 🟢 Low |
+| NightwingArt | As-needed | 🟢 Low |
+| VST Builder | Concept | 🔵 Future |
+| Gov Auction Scraper | Concept | 🔵 Future |
+| Stable Audio Explorer | Dead | ⚫ None |
+
+
+
--- a/README.md
+++ b/README.md
@@ -15,11 +15,13 @@ irm https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/ins
 ```

 The installer will:
- Clone to `~/.hermes-agent` (with submodules: mini-swe-agent, tinker-atropos)
- Create a virtual environment
- Install all dependencies
+- Install [uv](https://docs.astral.sh/uv/) (fast Python package manager) if not present
+- Install Python 3.11 via uv if not already available (no sudo needed)
+- Clone to `~/.hermes/hermes-agent` (with submodules: mini-swe-agent, tinker-atropos)
+- Create a virtual environment with Python 3.11
+- Install all dependencies and submodule packages
+- Symlink `hermes` into `~/.local/bin` so it works globally (no venv activation needed)
 - Run the interactive setup wizard
- Add `hermes` to your PATH

 After installation, reload your shell and run:
 ```bash
@@ -35,8 +37,9 @@ All your settings are stored in `~/.hermes/` for easy access:

 ```
 ~/.hermes/
-├── config.yaml     # Settings (model, terminal, compression, etc.)
+├── config.yaml     # Settings (model, terminal, TTS, compression, etc.)
 ├── .env            # API keys and secrets
+├── SOUL.md         # Optional: global persona (agent embodies this personality)
 ├── cron/           # Scheduled jobs
 ├── sessions/       # Gateway sessions
 └── logs/           # Logs
@@ -64,16 +67,18 @@ You need at least one LLM provider:
 | Provider | Get Key | Env Variable |
 |----------|---------|--------------|
 | **OpenRouter** (recommended) | [openrouter.ai/keys](https://openrouter.ai/keys) | `OPENROUTER_API_KEY` |
-| Anthropic | [console.anthropic.com](https://console.anthropic.com/) | `ANTHROPIC_API_KEY` |
-| OpenAI | [platform.openai.com](https://platform.openai.com/api-keys) | `OPENAI_API_KEY` |
+

 ### Optional API Keys

 | Feature | Provider | Env Variable |
 |---------|----------|--------------|
+| Custom OpenAI Endpoint (OAI or VLLM/SGLANG) | [platform.openai.com](https://platform.openai.com/api-keys) | `OPENAI_API_KEY` |
 | Web scraping | [Firecrawl](https://firecrawl.dev/) | `FIRECRAWL_API_KEY` |
 | Browser automation | [Browserbase](https://browserbase.com/) | `BROWSERBASE_API_KEY`, `BROWSERBASE_PROJECT_ID` |
 | Image generation | [FAL](https://fal.ai/) | `FAL_KEY` |
+| Premium TTS voices | [ElevenLabs](https://elevenlabs.io/) | `ELEVENLABS_API_KEY` |
+| OpenAI TTS voices | [OpenAI](https://platform.openai.com/api-keys) | `OPENAI_API_KEY` |
 | RL Training | [Tinker](https://tinker-console.thinkingmachines.ai/) + [WandB](https://wandb.ai/) | `TINKER_API_KEY`, `WANDB_API_KEY` |
 | Messaging | Telegram, Discord | `TELEGRAM_BOT_TOKEN`, `DISCORD_BOT_TOKEN` |

@@ -126,7 +131,58 @@ hermes --toolsets "web,terminal"
 hermes --list-tools
 ```

-**Available toolsets:** `web`, `terminal`, `browser`, `vision`, `creative`, `reasoning`, `skills`, `cronjob`, and more.
+**Available toolsets:** `web`, `terminal`, `browser`, `vision`, `creative`, `reasoning`, `skills`, `tts`, `cronjob`, and more.
+
+### 🔊 Text-to-Speech
+
+Convert text to speech with three providers:
+
+| Provider | Quality | Cost | API Key |
+|----------|---------|------|---------|
+| **Edge TTS** (default) | Good | Free | None needed |
+| **ElevenLabs** | Excellent | Paid | `ELEVENLABS_API_KEY` |
+| **OpenAI TTS** | Good | Paid | `OPENAI_API_KEY` |
+
+On Telegram, audio plays as native voice bubbles. On Discord/WhatsApp, sent as audio files. In CLI mode, saved to `~/voice-memos/`.
+
+**Configure in `~/.hermes/config.yaml`:**
+```yaml
+tts:
+  provider: "edge"              # "edge" | "elevenlabs" | "openai"
+  edge:
+    voice: "en-US-AriaNeural"   # 322 voices, 74 languages
+  elevenlabs:
+    voice_id: "pNInz6obpgDQGcFmaJgB"  # Adam
+    model_id: "eleven_multilingual_v2"
+  openai:
+    model: "gpt-4o-mini-tts"
+    voice: "alloy"              # alloy, echo, fable, onyx, nova, shimmer
+```
+
+> **Note:** Telegram voice bubbles require `ffmpeg` for Opus conversion (Edge TTS only outputs MP3). Install with `apt install ffmpeg` or `brew install ffmpeg`. Without ffmpeg, audio is sent as a file instead of a voice bubble.
+
+### 📄 Context Files (SOUL.md, AGENTS.md, .cursorrules)
+
+Drop these files in your project directory and the agent automatically picks them up:
+
+| File | Purpose |
+|------|---------|
+| `AGENTS.md` | Project-specific instructions, coding conventions, tool usage guidelines |
+| `SOUL.md` | Persona definition -- the agent embodies this personality and tone |
+| `.cursorrules` | Cursor IDE rules (also detected) |
+| `.cursor/rules/*.mdc` | Cursor rule files (also detected) |
+
+- **AGENTS.md** is hierarchical: if subdirectories also have `AGENTS.md`, all are combined (like Codex/Cline).
+- **SOUL.md** checks cwd first, then `~/.hermes/SOUL.md` as a global fallback.
+- All context files are capped at 20,000 characters with smart truncation.
+
+### 🛡️ Exec Approval (Messaging Platforms)
+
+When the agent tries to run a potentially dangerous command (rm -rf, chmod 777, etc.) on Telegram/Discord/WhatsApp, instead of blocking it silently, it asks the user for approval:
+
+> ⚠️ This command is potentially dangerous (recursive delete). Reply "yes" to approve.
+
+Reply "yes"/"y" to approve or "no"/"n" to deny. In CLI mode, the existing interactive approval prompt (once/session/always/deny) is preserved.

 ### 🖥️ Terminal Backend

@@ -179,8 +235,8 @@ hermes config set terminal.singularity_image ~/python.sif

 **Modal** (serverless cloud):
 ```bash
-pip install modal boto3
-modal setup  # Authenticate
+uv pip install "swe-rex[modal]"   # Installs swe-rex + modal + boto3
+modal setup                    # Authenticate with Modal
 hermes config set terminal.backend modal
 ```

@@ -275,16 +331,19 @@ See [docs/messaging.md](docs/messaging.md) for WhatsApp and advanced setup.

 Train language models with reinforcement learning using the Tinker API and Atropos framework.

+> **Note:** RL training tools require **Python 3.11+** (the upstream `tinker` package has this requirement). On Python 3.10, the RL toolset will be automatically disabled — all other features work fine.
+
 #### Requirements

-1. **API Keys:** Add to `~/.hermes/.env`:
+1. **Python 3.11+** (check with `python3 --version`)
+2. **API Keys:** Add to `~/.hermes/.env`:
 ```bash
 TINKER_API_KEY=your-tinker-key      # Get from https://tinker-console.thinkingmachines.ai/keys
 WANDB_API_KEY=your-wandb-key        # Get from https://wandb.ai/authorize
 OPENROUTER_API_KEY=your-key         # Optional: for rl_test_inference
 ```

-2. **That's it!** tinker-atropos is included as a submodule - no separate installation needed.
+3. **That's it!** tinker-atropos is included as a submodule — the installer handles it automatically.

 #### Using RL Tools

@@ -320,6 +379,94 @@ For extended RL workflows with longer timeouts:
 python rl_cli.py --model "anthropic/claude-sonnet-4-20250514"
 ```

+### 🧪 Atropos RL Environments
+
+Hermes-Agent integrates with the [Atropos](https://github.com/NousResearch/atropos) RL framework through a layered environment system. This allows training models with reinforcement learning on agentic tasks using hermes-agent's tools.
+
+#### Architecture
+
+The integration has three layers:
+
+| Layer | File | Purpose |
+|-------|------|---------|
+| **Agent Loop** | `environments/agent_loop.py` | Reusable multi-turn tool-calling engine (standard OpenAI spec) |
+| **Base Environment** | `environments/hermes_base_env.py` | Abstract Atropos `BaseEnv` subclass with toolset resolution, ToolContext, scoring |
+| **Concrete Envs** | `environments/terminal_test_env.py`, `environments/hermes_swe_env.py` | Task-specific environments |
+
+#### Two-Phase Operation
+
+- **Phase 1 (OpenAI server type)**: Works with any OpenAI-compatible endpoint (VLLM, SGLang, OpenRouter, OpenAI API). The server handles tool call parsing natively. Good for **SFT data generation**, **verifier testing**, and **evaluation**.
+- **Phase 2 (VLLM server type)**: Uses ManagedServer for exact token IDs + logprobs via `/generate`. Client-side tool call parser registry reconstructs structured `tool_calls` from raw output. Required for **full RL training**.
+
+#### Quick Start
+
+```bash
+# 1. Launch VLLM with tool parser
+vllm serve YourModel --tool-parser hermes
+
+# 2. Start the Atropos API server
+run-api
+
+# 3. Run an environment
+python environments/terminal_test_env.py serve \
+    --openai.base_url http://localhost:8000/v1 \
+    --openai.model_name YourModel \
+    --openai.server_type openai
+```
+
+#### ToolContext (Reward Functions)
+
+Reward functions receive a `ToolContext` with unrestricted access to all hermes-agent tools, scoped to the rollout's sandbox:
+
+```python
+async def compute_reward(self, item, result, ctx: ToolContext) -> float:
+    # Run tests in the model's terminal sandbox
+    test = ctx.terminal("pytest -v")
+    if test["exit_code"] == 0:
+        return 1.0
+    # Or check a file, search the web, navigate a browser...
+    return 0.0
+```
+
+#### Creating Custom Environments
+
+Subclass `HermesAgentBaseEnv` and implement 5 methods:
+
+```python
+from environments.hermes_base_env import HermesAgentBaseEnv
+
+class MyEnv(HermesAgentBaseEnv):
+    name = "my-env"
+    async def setup(self): ...            # Load data
+    async def get_next_item(self): ...    # Return next item
+    def format_prompt(self, item): ...    # Item -> prompt string
+    async def compute_reward(self, item, result, ctx): ...  # Score with ToolContext
+    async def evaluate(self, *args, **kwargs): ...          # Periodic eval
+
+if __name__ == "__main__":
+    MyEnv.cli()
+```
+
+#### Toolset Distributions
+
+Configure which tools are available per group, either explicitly or probabilistically:
+
+```bash
+# Explicit toolsets
+--env.enabled_toolsets '["terminal","file","web"]'
+
+# Probabilistic distribution (sampled per group)
+--env.distribution development
+```
+
+#### Tool Call Parsers (Phase 2)
+
+For VLLM server type, a parser registry extracts structured `tool_calls` from raw model output. Supported parsers: `hermes`, `mistral`, `llama3_json`, `qwen`, `deepseek_v3`, `deepseek_v3_1`, `kimi_k2`, `longcat`, `glm45`, `glm47`, `qwen3_coder`.
+
+```bash
+--env.tool_call_parser hermes  # Match your VLLM --tool-parser flag
+```
+
 ### ⏰ Scheduled Tasks (Cron)

 Schedule tasks to run automatically:
@@ -425,26 +572,332 @@ skills/

 ## Manual Installation

-If you prefer not to use the installer:
+If you prefer full control over the installation process (or the quick-install script doesn't suit your environment), follow these steps to set everything up by hand.
+
+### Prerequisites
+
+| Requirement | Minimum Version | Check Command | Notes |
+|-------------|----------------|---------------|-------|
+| **Git** | Any recent | `git --version` | Required |
+| **Node.js** | 18+ | `node --version` | Optional — needed for browser automation tools |
+| **ripgrep** | Any | `rg --version` | Optional — faster file search in terminal tool (falls back to grep) |
+
+> **Note:** Python and pip are **not** prerequisites. The installer uses [uv](https://docs.astral.sh/uv/) to provision Python 3.11 automatically (no sudo needed). If you already have Python 3.11+ installed, uv will use it.
+
+<details>
+<summary><strong>Installing prerequisites by platform</strong></summary>
+
+**Ubuntu / Debian:**
+```bash
+sudo apt update && sudo apt install git
+# Optional:
+sudo apt install ripgrep nodejs npm
+```
+
+**macOS (Homebrew):**
+```bash
+brew install git
+# Optional:
+brew install ripgrep node
+```
+
+**Windows (WSL recommended):**
+Use the [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/install) and follow the Ubuntu instructions above. Alternatively, use the PowerShell quick-install script at the top of this README.
+
+</details>
+
+---
+
+### Step 1: Clone the Repository
+
+Clone with `--recurse-submodules` to pull the required submodules ([mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent) for the terminal tool backend and [tinker-atropos](https://github.com/nousresearch/tinker-atropos) for RL training):

 ```bash
-# Clone the repository (with submodules)
+git clone --recurse-submodules https://github.com/NousResearch/hermes-agent.git
+cd hermes-agent
+```
+
+If you already cloned without `--recurse-submodules`, initialize them manually:
+```bash
+git submodule update --init --recursive
+```
+
+---
+
+### Step 2: Install uv & Create Virtual Environment
+
+[uv](https://docs.astral.sh/uv/) is a fast Python package manager that can also provision Python itself. Install it and create the venv in one go:
+
+```bash
+# Install uv (if not already installed)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# Create venv with Python 3.11 (uv downloads it if not present — no sudo needed)
+uv venv venv --python 3.11
+```
+
+> **Tip:** You do **not** need to activate the venv to use `hermes`. The entry point has a hardcoded shebang pointing to the venv Python, so it works globally once symlinked (see Step 8). For installing packages, uv can target the venv directly via `VIRTUAL_ENV`.
+
+---
+
+### Step 3: Install Python Dependencies
+
+Install the main package in editable mode with all optional extras (messaging, cron, CLI menus, modal):
+
+```bash
+# Tell uv which venv to install into
+export VIRTUAL_ENV="$(pwd)/venv"
+
+# Install with all extras
+uv pip install -e ".[all]"
+```
+
+If you only want the core agent (no Telegram/Discord/cron support):
+```bash
+uv pip install -e "."
+```
+
+<details>
+<summary><strong>Optional extras breakdown</strong></summary>
+
+| Extra | What it adds | Install command |
+|-------|-------------|-----------------|
+| `all` | Everything below | `uv pip install -e ".[all]"` |
+| `messaging` | Telegram & Discord gateway | `uv pip install -e ".[messaging]"` |
+| `cron` | Cron expression parsing for scheduled tasks | `uv pip install -e ".[cron]"` |
+| `cli` | Terminal menu UI for setup wizard | `uv pip install -e ".[cli]"` |
+| `modal` | Modal cloud execution backend (swe-rex + modal + boto3) | `uv pip install -e ".[modal]"` |
+| `dev` | pytest & test utilities | `uv pip install -e ".[dev]"` |
+
+You can combine extras: `uv pip install -e ".[messaging,cron]"`
+
+</details>
+
+---
+
+### Step 4: Install Submodule Packages
+
+These are local packages checked out as Git submodules. Install them in editable mode:
+
+```bash
+# Terminal tool backend (required for the terminal/command-execution tool)
+uv pip install -e "./mini-swe-agent"
+
+# RL training backend
+uv pip install -e "./tinker-atropos"
+```
+
+Both are optional — if you skip them, the corresponding toolsets simply won't be available.
+
+---
+
+### Step 5: Install Node.js Dependencies (Optional)
+
+Only needed if you plan to use the **browser automation** toolset (Browserbase-powered):
+
+```bash
+npm install
+```
+
+This installs the `agent-browser` package defined in `package.json`. Skip this step if you don't need browser tools.
+
+---
+
+### Step 6: Create the Configuration Directory
+
+Hermes stores all user configuration in `~/.hermes/`:
+
+```bash
+# Create the directory structure
+mkdir -p ~/.hermes/{cron,sessions,logs}
+
+# Copy the example config file
+cp cli-config.yaml.example ~/.hermes/config.yaml
+
+# Create an empty .env file for API keys
+touch ~/.hermes/.env
+```
+
+Your `~/.hermes/` directory should now look like:
+```
+~/.hermes/
+├── config.yaml     # Agent settings (model, terminal, toolsets, compression, etc.)
+├── .env            # API keys and secrets (one per line: KEY=value)
+├── cron/           # Scheduled job data
+├── sessions/       # Messaging gateway sessions
+└── logs/           # Conversation logs
+```
+
+---
+
+### Step 7: Add Your API Keys
+
+Open `~/.hermes/.env` in your editor and add at minimum an LLM provider key:
+
+```bash
+# Required — at least one LLM provider:
+OPENROUTER_API_KEY=sk-or-v1-your-key-here
+
+# Optional — enable additional tools:
+FIRECRAWL_API_KEY=fc-your-key          # Web search & scraping
+BROWSERBASE_API_KEY=bb-your-key        # Browser automation
+BROWSERBASE_PROJECT_ID=your-project-id # Browser automation
+FAL_KEY=your-fal-key                   # Image generation (FLUX)
+TINKER_API_KEY=your-tinker-key         # RL training
+WANDB_API_KEY=your-wandb-key           # RL training metrics
+
+# Optional — messaging gateway:
+TELEGRAM_BOT_TOKEN=123456:ABC-DEF      # From @BotFather
+TELEGRAM_ALLOWED_USERS=your-user-id    # Comma-separated
+DISCORD_BOT_TOKEN=MTIz...              # From Developer Portal
+DISCORD_ALLOWED_USERS=your-user-id     # Comma-separated
+```
+
+Or set them one at a time via the CLI:
+```bash
+hermes config set OPENROUTER_API_KEY sk-or-v1-your-key-here
+```
+
+---
+
+### Step 8: Add `hermes` to Your PATH
+
+The `hermes` entry point at `venv/bin/hermes` has a hardcoded shebang pointing to the venv's Python, so it works **without activating the venv**. The recommended approach is a symlink into `~/.local/bin` (most distributions already have this on PATH):
+
+```bash
+mkdir -p ~/.local/bin
+ln -sf "$(pwd)/venv/bin/hermes" ~/.local/bin/hermes
+```
+
+If `~/.local/bin` isn't on your PATH yet, add it:
+
+**Bash** (`~/.bashrc`):
+```bash
+echo '' >> ~/.bashrc
+echo '# Hermes Agent' >> ~/.bashrc
+echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
+source ~/.bashrc
+```
+
+**Zsh** (`~/.zshrc`):
+```bash
+echo '' >> ~/.zshrc
+echo '# Hermes Agent' >> ~/.zshrc
+echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
+source ~/.zshrc
+```
+
+**Fish** (`~/.config/fish/config.fish`):
+```fish
+fish_add_path $HOME/.local/bin
+```
+
+---
+
+### Step 9: Run the Setup Wizard (Optional)
+
+The interactive setup wizard walks you through configuring your API keys and preferences:
+
+```bash
+hermes setup
+```
+
+This is optional if you already configured `~/.hermes/.env` and `~/.hermes/config.yaml` manually in the steps above.
+
+---
+
+### Step 10: Verify the Installation
+
+```bash
+# Check that the command is available
+hermes version
+
+# Run diagnostics to verify everything is working
+hermes doctor
+
+# Check your configuration
+hermes status
+
+# Test with a quick query
+hermes chat -q "Hello! What tools do you have available?"
+```
+
+If `hermes doctor` reports issues, it will tell you exactly what's missing and how to fix it.
+
+---
+
+### Quick-Reference: Manual Install (Condensed)
+
+For those who just want the commands without the explanations:
+
+```bash
+# Install uv (if not already installed)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# Clone & enter
 git clone --recurse-submodules https://github.com/NousResearch/hermes-agent.git
 cd hermes-agent

-# Run setup script
-./setup-hermes.sh
+# Create venv with Python 3.11 (uv downloads it if needed)
+uv venv venv --python 3.11
+export VIRTUAL_ENV="$(pwd)/venv"

-# Or manually:
-python3 -m venv venv
-source venv/bin/activate
-pip install -e ".[all]"
+# Install everything
+uv pip install -e ".[all]"
+uv pip install -e "./mini-swe-agent"
+uv pip install -e "./tinker-atropos"
+npm install  # optional, for browser tools

-# Install submodules (required for terminal and RL tools)
-pip install -e "./mini-swe-agent"    # Terminal tool backend
-pip install -e "./tinker-atropos"    # RL training backend
+# Configure
+mkdir -p ~/.hermes/{cron,sessions,logs}
+cp cli-config.yaml.example ~/.hermes/config.yaml
+touch ~/.hermes/.env
+echo 'OPENROUTER_API_KEY=sk-or-v1-your-key' >> ~/.hermes/.env

-hermes setup
+# Make hermes available globally (no venv activation needed)
+mkdir -p ~/.local/bin
+ln -sf "$(pwd)/venv/bin/hermes" ~/.local/bin/hermes
+
+# Verify
+hermes doctor
+hermes
+```
+
+---
+
+### Updating a Manual Installation
+
+To update an existing manual install to the latest version:
+
+```bash
+cd /path/to/hermes-agent
+export VIRTUAL_ENV="$(pwd)/venv"
+
+# Pull latest code and submodules
+git pull origin main
+git submodule update --init --recursive
+
+# Reinstall (picks up new dependencies)
+uv pip install -e ".[all]"
+uv pip install -e "./mini-swe-agent"
+uv pip install -e "./tinker-atropos"
+
+# Check for new config options added since your last update
+hermes config check
+hermes config migrate   # Interactively add any missing options
+```
+
+### Uninstalling a Manual Installation
+
+```bash
+# Remove the hermes symlink
+rm -f ~/.local/bin/hermes
+
+# Remove the cloned repository
+rm -rf /path/to/hermes-agent
+
+# Remove user configuration (optional — keep if you plan to reinstall)
+rm -rf ~/.hermes
 ```

 ---
--- a/pycache/model_tools.cpython-310.pyc
+++ b/pycache/model_tools.cpython-310.pyc
--- a/pycache/web_tools.cpython-310.pyc
+++ b/pycache/web_tools.cpython-310.pyc
--- a/batch_runner.py
+++ b/batch_runner.py
@@ -41,24 +41,17 @@ from toolset_distributions import (
    sample_toolsets_from_distribution,
    validate_distribution
 )
+from model_tools import TOOL_TO_TOOLSET_MAP


 # Global configuration for worker processes
 _WORKER_CONFIG = {}

-# All possible tools - used to ensure consistent schema across all trajectory entries
-# This is required because Arrow/Parquet (used by HuggingFace datasets) needs identical schemas
-ALL_POSSIBLE_TOOLS = {
-    'terminal', 'web_search', 'web_extract',
-    'vision_analyze', 'image_generate', 'mixture_of_agents',
-    # Skills tools
-    'skills_categories', 'skills_list', 'skill_view',
-    # Browser automation tools
-    'browser_navigate', 'browser_snapshot', 'browser_click',
-    'browser_type', 'browser_scroll', 'browser_back',
-    'browser_press', 'browser_close', 'browser_get_images',
-    'browser_vision'
-}
+# All possible tools - auto-derived from the master mapping in model_tools.py.
+# This stays in sync automatically when new tools are added to TOOL_TO_TOOLSET_MAP.
+# Used for consistent schema in Arrow/Parquet (HuggingFace datasets) and for
+# filtering corrupted entries during trajectory combination.
+ALL_POSSIBLE_TOOLS = set(TOOL_TO_TOOLSET_MAP.keys())

 # Default stats for tools that weren't used
 DEFAULT_TOOL_STATS = {'count': 0, 'success': 0, 'failure': 0}
@@ -200,6 +193,42 @@ def _extract_tool_stats(messages: List[Dict[str, Any]]) -> Dict[str, Dict[str, i
    return tool_stats


+def _extract_reasoning_stats(messages: List[Dict[str, Any]]) -> Dict[str, int]:
+    """
+    Count how many assistant turns have reasoning vs no reasoning.
+    
+    Checks for <REASONING_SCRATCHPAD> in content or a non-empty 'reasoning' field
+    (native thinking tokens). Returns counts for tracking reasoning coverage.
+    
+    Args:
+        messages: Message history
+        
+    Returns:
+        Dict with 'total_assistant_turns', 'turns_with_reasoning', 'turns_without_reasoning'
+    """
+    total = 0
+    with_reasoning = 0
+    
+    for msg in messages:
+        if msg.get("role") != "assistant":
+            continue
+        total += 1
+        
+        content = msg.get("content", "") or ""
+        has_scratchpad = "<REASONING_SCRATCHPAD>" in content
+        has_native_reasoning = bool(msg.get("reasoning", "").strip()) if msg.get("reasoning") else False
+        
+        if has_scratchpad or has_native_reasoning:
+            with_reasoning += 1
+    
+    return {
+        "total_assistant_turns": total,
+        "turns_with_reasoning": with_reasoning,
+        "turns_without_reasoning": total - with_reasoning,
+        "has_any_reasoning": with_reasoning > 0,
+    }
+
+
 def _process_single_prompt(
    prompt_index: int,
    prompt_data: Dict[str, Any],
@@ -244,6 +273,9 @@ def _process_single_prompt(
            providers_ignored=config.get("providers_ignored"),
            providers_order=config.get("providers_order"),
            provider_sort=config.get("provider_sort"),
+            max_tokens=config.get("max_tokens"),
+            reasoning_config=config.get("reasoning_config"),
+            prefill_messages=config.get("prefill_messages"),
        )

        # Run the agent with task_id to ensure each task gets its own isolated VM
@@ -252,6 +284,9 @@ def _process_single_prompt(
        # Extract tool usage statistics
        tool_stats = _extract_tool_stats(result["messages"])
        
+        # Extract reasoning coverage stats
+        reasoning_stats = _extract_reasoning_stats(result["messages"])
+        
        # Convert to trajectory format (using existing method)
        trajectory = agent._convert_to_trajectory_format(
            result["messages"],
@@ -264,6 +299,7 @@ def _process_single_prompt(
            "prompt_index": prompt_index,
            "trajectory": trajectory,
            "tool_stats": tool_stats,
+            "reasoning_stats": reasoning_stats,
            "completed": result["completed"],
            "partial": result.get("partial", False),
            "api_calls": result["api_calls"],
@@ -332,7 +368,9 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
    
    # Initialize aggregated stats for this batch
    batch_tool_stats = {}
+    batch_reasoning_stats = {"total_assistant_turns": 0, "turns_with_reasoning": 0, "turns_without_reasoning": 0}
    completed_in_batch = []
+    discarded_no_reasoning = 0
    
    # Process each prompt sequentially in this batch
    for prompt_index, prompt_data in prompts_to_process:
@@ -346,6 +384,13 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
        
        # Save trajectory if successful
        if result["success"] and result["trajectory"]:
+            # Discard samples with zero reasoning across all turns
+            reasoning = result.get("reasoning_stats", {})
+            if not reasoning.get("has_any_reasoning", True):
+                print(f"   🚫 Prompt {prompt_index} discarded (no reasoning in any turn)")
+                discarded_no_reasoning += 1
+                continue
+            
            # Get and normalize tool stats for consistent schema across all entries
            raw_tool_stats = result.get("tool_stats", {})
            tool_stats = _normalize_tool_stats(raw_tool_stats)
@@ -386,6 +431,10 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
            batch_tool_stats[tool_name]["success"] += stats["success"]
            batch_tool_stats[tool_name]["failure"] += stats["failure"]
        
+        # Aggregate reasoning stats
+        for key in batch_reasoning_stats:
+            batch_reasoning_stats[key] += result.get("reasoning_stats", {}).get(key, 0)
+        
        # Only mark as completed if successfully saved (failed prompts can be retried on resume)
        if result["success"] and result["trajectory"]:
            completed_in_batch.append(prompt_index)
@@ -401,6 +450,8 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
        "processed": len(prompts_to_process),
        "skipped": len(batch_data) - len(prompts_to_process),
        "tool_stats": batch_tool_stats,
+        "reasoning_stats": batch_reasoning_stats,
+        "discarded_no_reasoning": discarded_no_reasoning,
        "completed_prompts": completed_in_batch
    }

@@ -428,6 +479,10 @@ class BatchRunner:
        providers_ignored: List[str] = None,
        providers_order: List[str] = None,
        provider_sort: str = None,
+        max_tokens: int = None,
+        reasoning_config: Dict[str, Any] = None,
+        prefill_messages: List[Dict[str, Any]] = None,
+        max_samples: int = None,
    ):
        """
        Initialize the batch runner.
@@ -449,6 +504,10 @@ class BatchRunner:
            providers_ignored (List[str]): OpenRouter providers to ignore (optional)
            providers_order (List[str]): OpenRouter providers to try in order (optional)
            provider_sort (str): Sort providers by price/throughput/latency (optional)
+            max_tokens (int): Maximum tokens for model responses (optional, uses model default if not set)
+            reasoning_config (Dict): OpenRouter reasoning config override (e.g. {"effort": "none"} to disable thinking)
+            prefill_messages (List[Dict]): Messages to prepend as prefilled conversation context (few-shot priming)
+            max_samples (int): Only process the first N samples from the dataset (optional, processes all if not set)
        """
        self.dataset_file = Path(dataset_file)
        self.batch_size = batch_size
@@ -466,6 +525,10 @@ class BatchRunner:
        self.providers_ignored = providers_ignored
        self.providers_order = providers_order
        self.provider_sort = provider_sort
+        self.max_tokens = max_tokens
+        self.reasoning_config = reasoning_config
+        self.prefill_messages = prefill_messages
+        self.max_samples = max_samples
        
        # Validate distribution
        if not validate_distribution(distribution):
@@ -481,8 +544,12 @@ class BatchRunner:
        # Statistics file
        self.stats_file = self.output_dir / "statistics.json"
        
-        # Load dataset
+        # Load dataset (and optionally truncate to max_samples)
        self.dataset = self._load_dataset()
+        if self.max_samples and self.max_samples < len(self.dataset):
+            full_count = len(self.dataset)
+            self.dataset = self.dataset[:self.max_samples]
+            print(f"✂️  Truncated dataset from {full_count} to {self.max_samples} samples (--max_samples)")
        
        # Create batches
        self.batches = self._create_batches()
@@ -735,6 +802,9 @@ class BatchRunner:
            "providers_ignored": self.providers_ignored,
            "providers_order": self.providers_order,
            "provider_sort": self.provider_sort,
+            "max_tokens": self.max_tokens,
+            "reasoning_config": self.reasoning_config,
+            "prefill_messages": self.prefill_messages,
        }
        
        # For backward compatibility, still track by index (but this is secondary to content matching)
@@ -797,6 +867,8 @@ class BatchRunner:
        
        # Aggregate all batch statistics and update checkpoint
        all_completed_prompts = list(completed_prompts_set)
+        total_reasoning_stats = {"total_assistant_turns": 0, "turns_with_reasoning": 0, "turns_without_reasoning": 0}
+        
        for batch_result in results:
            # Add newly completed prompts
            all_completed_prompts.extend(batch_result.get("completed_prompts", []))
@@ -813,6 +885,10 @@ class BatchRunner:
                total_tool_stats[tool_name]["count"] += stats["count"]
                total_tool_stats[tool_name]["success"] += stats["success"]
                total_tool_stats[tool_name]["failure"] += stats["failure"]
+            
+            # Aggregate reasoning stats
+            for key in total_reasoning_stats:
+                total_reasoning_stats[key] += batch_result.get("reasoning_stats", {}).get(key, 0)
        
        # Save final checkpoint
        checkpoint_data["completed_prompts"] = all_completed_prompts
@@ -835,15 +911,8 @@ class BatchRunner:
        combined_file = self.output_dir / "trajectories.jsonl"
        print(f"\n📦 Combining ALL batch files into {combined_file.name}...")
        
-        VALID_TOOLS = {'web_search', 'web_extract', 'terminal', 'vision_analyze', 
-                       'image_generate', 'mixture_of_agents',
-                       # Skills tools
-                       'skills_categories', 'skills_list', 'skill_view',
-                       # Browser automation tools
-                       'browser_navigate', 'browser_snapshot', 'browser_click',
-                       'browser_type', 'browser_scroll', 'browser_back',
-                       'browser_press', 'browser_close', 'browser_get_images',
-                       'browser_vision'}
+        # Valid tools auto-derived from model_tools.py — no manual updates needed
+        VALID_TOOLS = ALL_POSSIBLE_TOOLS
        
        total_entries = 0
        filtered_entries = 0
@@ -892,7 +961,8 @@ class BatchRunner:
            "model": self.model,
            "completed_at": datetime.now().isoformat(),
            "duration_seconds": round(time.time() - start_time, 2),
-            "tool_statistics": total_tool_stats
+            "tool_statistics": total_tool_stats,
+            "reasoning_statistics": total_reasoning_stats,
        }
        
        with open(self.stats_file, 'w', encoding='utf-8') as f:
@@ -930,6 +1000,25 @@ class BatchRunner:
        else:
            print("No tool calls were made during this run.")
        
+        # Print reasoning coverage stats
+        total_discarded = sum(r.get("discarded_no_reasoning", 0) for r in results)
+        
+        print(f"\n🧠 Reasoning Coverage:")
+        print("-" * 70)
+        total_turns = total_reasoning_stats["total_assistant_turns"]
+        with_reasoning = total_reasoning_stats["turns_with_reasoning"]
+        without_reasoning = total_reasoning_stats["turns_without_reasoning"]
+        if total_turns > 0:
+            pct_with = round(with_reasoning / total_turns * 100, 1)
+            pct_without = round(without_reasoning / total_turns * 100, 1)
+            print(f"   Total assistant turns:    {total_turns:,}")
+            print(f"   With reasoning:           {with_reasoning:,} ({pct_with}%)")
+            print(f"   Without reasoning:        {without_reasoning:,} ({pct_without}%)")
+        else:
+            print("   No assistant turns recorded.")
+        if total_discarded > 0:
+            print(f"   🚫 Samples discarded (zero reasoning): {total_discarded:,}")
+        
        print(f"\n💾 Results saved to: {self.output_dir}")
        print(f"   - Trajectories: trajectories.jsonl (combined)")
        print(f"   - Individual batches: batch_*.jsonl (for debugging)")
@@ -956,6 +1045,11 @@ def main(
    providers_ignored: str = None,
    providers_order: str = None,
    provider_sort: str = None,
+    max_tokens: int = None,
+    reasoning_effort: str = None,
+    reasoning_disabled: bool = False,
+    prefill_messages_file: str = None,
+    max_samples: int = None,
 ):
    """
    Run batch processing of agent prompts from a dataset.
@@ -979,6 +1073,11 @@ def main(
        providers_ignored (str): Comma-separated list of OpenRouter providers to ignore (e.g. "together,deepinfra")
        providers_order (str): Comma-separated list of OpenRouter providers to try in order (e.g. "anthropic,openai,google")
        provider_sort (str): Sort providers by "price", "throughput", or "latency" (OpenRouter only)
+        max_tokens (int): Maximum tokens for model responses (optional, uses model default if not set)
+        reasoning_effort (str): OpenRouter reasoning effort level: "xhigh", "high", "medium", "low", "minimal", "none" (default: "xhigh")
+        reasoning_disabled (bool): Completely disable reasoning/thinking tokens (default: False)
+        prefill_messages_file (str): Path to JSON file containing prefill messages (list of {role, content} dicts)
+        max_samples (int): Only process the first N samples from the dataset (optional, processes all if not set)
        
    Examples:
        # Basic usage
@@ -990,9 +1089,13 @@ def main(
        # Use specific distribution
        python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=image_test --distribution=image_gen
        
-        # With ephemeral system prompt (not saved to dataset)
+        # With disabled reasoning and max tokens
        python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run \\
-                               --ephemeral_system_prompt="You are a helpful assistant focused on image generation."
+                               --reasoning_disabled --max_tokens=128000
+        
+        # With prefill messages from file
+        python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run \\
+                               --prefill_messages_file=configs/prefill_opus.json
        
        # List available distributions
        python batch_runner.py --list_distributions
@@ -1031,6 +1134,36 @@ def main(
    providers_ignored_list = [p.strip() for p in providers_ignored.split(",")] if providers_ignored else None
    providers_order_list = [p.strip() for p in providers_order.split(",")] if providers_order else None
    
+    # Build reasoning_config from CLI flags
+    # --reasoning_disabled takes priority, then --reasoning_effort, then default (xhigh)
+    reasoning_config = None
+    if reasoning_disabled:
+        # Completely disable reasoning/thinking tokens
+        reasoning_config = {"effort": "none"}
+        print("🧠 Reasoning: DISABLED (effort=none)")
+    elif reasoning_effort:
+        # Use specified effort level
+        valid_efforts = ["xhigh", "high", "medium", "low", "minimal", "none"]
+        if reasoning_effort not in valid_efforts:
+            print(f"❌ Error: --reasoning_effort must be one of: {', '.join(valid_efforts)}")
+            return
+        reasoning_config = {"enabled": True, "effort": reasoning_effort}
+        print(f"🧠 Reasoning effort: {reasoning_effort}")
+    
+    # Load prefill messages from JSON file if provided
+    prefill_messages = None
+    if prefill_messages_file:
+        try:
+            with open(prefill_messages_file, 'r', encoding='utf-8') as f:
+                prefill_messages = json.load(f)
+            if not isinstance(prefill_messages, list):
+                print(f"❌ Error: prefill_messages_file must contain a JSON array of messages")
+                return
+            print(f"💬 Loaded {len(prefill_messages)} prefill messages from {prefill_messages_file}")
+        except Exception as e:
+            print(f"❌ Error loading prefill messages: {e}")
+            return
+    
    # Initialize and run batch runner
    try:
        runner = BatchRunner(
@@ -1050,6 +1183,10 @@ def main(
            providers_ignored=providers_ignored_list,
            providers_order=providers_order_list,
            provider_sort=provider_sort,
+            max_tokens=max_tokens,
+            reasoning_config=reasoning_config,
+            prefill_messages=prefill_messages,
+            max_samples=max_samples,
        )

        runner.run(resume=resume)
--- a/cli-config.yaml.example
+++ b/cli-config.yaml.example
@@ -7,7 +7,7 @@
 # =============================================================================
 model:
  # Default model to use (can be overridden with --model flag)
-  default: "anthropic/claude-sonnet-4"
+  default: "anthropic/claude-opus-4.6"
  
  # API configuration (falls back to OPENROUTER_API_KEY env var)
  # api_key: "your-key-here"  # Uncomment to set here instead of .env
@@ -140,7 +140,7 @@ compression:
  
  # Model to use for generating summaries (fast/cheap recommended)
  # This model compresses the middle turns into a concise summary
-  summary_model: "google/gemini-2.0-flash-001"
+  summary_model: "google/gemini-3-flash-preview"

 # =============================================================================
 # Agent Behavior
--- a/cli.py
+++ b/cli.py
@@ -28,18 +28,13 @@ os.environ["HERMES_QUIET"] = "1"  # Our own modules
 import yaml

 # prompt_toolkit for fixed input area TUI
-from prompt_toolkit import PromptSession
 from prompt_toolkit.history import FileHistory
 from prompt_toolkit.styles import Style as PTStyle
-from prompt_toolkit.formatted_text import HTML
 from prompt_toolkit.patch_stdout import patch_stdout
-from prompt_toolkit.application import Application, get_app
-from prompt_toolkit.buffer import Buffer
+from prompt_toolkit.application import Application
 from prompt_toolkit.layout import Layout, HSplit, Window, FormattedTextControl
-from prompt_toolkit.layout.processors import BeforeInput
 from prompt_toolkit.widgets import TextArea
 from prompt_toolkit.key_binding import KeyBindings
-import asyncio
 import threading
 import queue

@@ -83,12 +78,12 @@ def load_cli_config() -> Dict[str, Any]:
    # Default configuration
    defaults = {
        "model": {
-            "default": "anthropic/claude-opus-4-20250514",
+            "default": "anthropic/claude-opus-4.6",
            "base_url": "https://openrouter.ai/api/v1",
        },
        "terminal": {
            "env_type": "local",
-            "cwd": "/tmp",
+            "cwd": ".",  # "." is resolved to os.getcwd() at runtime
            "timeout": 60,
            "lifetime_seconds": 300,
            "docker_image": "python:3.11",
@@ -101,7 +96,7 @@ def load_cli_config() -> Dict[str, Any]:
        "compression": {
            "enabled": True,      # Auto-compress when approaching context limit
            "threshold": 0.85,    # Compress at 85% of model's context limit
-            "summary_model": "google/gemini-2.0-flash-001",  # Fast/cheap model for summaries
+            "summary_model": "google/gemini-3-flash-preview",  # Fast/cheap model for summaries
        },
        "agent": {
            "max_turns": 60,  # Default max tool-calling iterations
@@ -238,6 +233,10 @@ from toolsets import get_all_toolsets, get_toolset_info, resolve_toolset, valida
 # Cron job system for scheduled tasks
 from cron import create_job, list_jobs, remove_job, get_job, run_daemon as run_cron_daemon, tick as cron_tick

+# Resource cleanup imports for safe shutdown (terminal VMs, browser sessions)
+from tools.terminal_tool import cleanup_all_environments as _cleanup_all_terminals
+from tools.browser_tool import _emergency_cleanup_all_sessions as _cleanup_all_browsers
+
 # ============================================================================
 # ASCII Art & Branding
 # ============================================================================
@@ -494,6 +493,8 @@ COMMANDS = {
    "/clear": "Clear screen and reset conversation (fresh start)",
    "/history": "Show conversation history",
    "/reset": "Reset conversation only (keep screen)",
+    "/retry": "Retry the last message (resend to agent)",
+    "/undo": "Remove the last user/assistant exchange",
    "/save": "Save the current conversation",
    "/config": "Show current configuration",
    "/cron": "Manage scheduled tasks (list, add, remove)",
@@ -504,7 +505,11 @@ COMMANDS = {

 def save_config_value(key_path: str, value: any) -> bool:
    """
-    Save a value to cli-config.yaml at the specified key path.
+    Save a value to the active config file at the specified key path.
+    
+    Respects the same lookup order as load_cli_config():
+    1. ~/.hermes/config.yaml (user config - preferred, used if it exists)
+    2. ./cli-config.yaml (project config - fallback)
    
    Args:
        key_path: Dot-separated path like "agent.system_prompt"
@@ -513,9 +518,15 @@ def save_config_value(key_path: str, value: any) -> bool:
    Returns:
        True if successful, False otherwise
    """
-    config_path = Path(__file__).parent / 'cli-config.yaml'
+    # Use the same precedence as load_cli_config: user config first, then project config
+    user_config_path = Path.home() / '.hermes' / 'config.yaml'
+    project_config_path = Path(__file__).parent / 'cli-config.yaml'
+    config_path = user_config_path if user_config_path.exists() else project_config_path
    
    try:
+        # Ensure parent directory exists (for ~/.hermes/config.yaml on first use)
+        config_path.parent.mkdir(parents=True, exist_ok=True)
+        
        # Load existing config
        if config_path.exists():
            with open(config_path, 'r') as f:
@@ -627,26 +638,8 @@ class HermesCLI:
        short_uuid = uuid.uuid4().hex[:6]
        self.session_id = f"{timestamp_str}_{short_uuid}"
        
-        # Setup prompt_toolkit session with history
-        self._setup_prompt_session()
-    
-    def _setup_prompt_session(self):
-        """Setup prompt_toolkit session with history and styling."""
-        history_file = Path.home() / ".hermes_history"
-        
-        # Custom style for the prompt
-        self.prompt_style = PTStyle.from_dict({
-            'prompt': '#FFD700 bold',
-            'input': '#FFF8DC',
-        })
-        
-        # Create prompt session with file history
-        # Note: multiline disabled - Enter submits, use \ at end of line for continuation
-        self.prompt_session = PromptSession(
-            history=FileHistory(str(history_file)),
-            style=self.prompt_style,
-            enable_history_search=True,
-        )
+        # History file for persistent input recall across sessions
+        self._history_file = Path.home() / ".hermes_history"
    
    def _init_agent(self) -> bool:
        """
@@ -669,6 +662,7 @@ class HermesCLI:
                quiet_mode=True,  # Suppress verbose output for clean CLI
                ephemeral_system_prompt=self.system_prompt if self.system_prompt else None,
                session_id=self.session_id,  # Pass CLI's session ID to agent
+                platform="cli",  # CLI interface — agent uses terminal-friendly formatting
            )
            return True
        except Exception as e:
@@ -839,7 +833,7 @@ class HermesCLI:
        """Display current configuration with kawaii ASCII art."""
        # Get terminal config from environment (which was set from cli-config.yaml)
        terminal_env = os.getenv("TERMINAL_ENV", "local")
-        terminal_cwd = os.getenv("TERMINAL_CWD", "/tmp")
+        terminal_cwd = os.getenv("TERMINAL_CWD", os.getcwd())
        terminal_timeout = os.getenv("TERMINAL_TIMEOUT", "60")
        
        config_path = Path(__file__).parent / 'cli-config.yaml'
@@ -927,6 +921,67 @@ class HermesCLI:
        except Exception as e:
            print(f"(x_x) Failed to save: {e}")
    
+    def retry_last(self):
+        """Retry the last user message by removing the last exchange and re-sending.
+        
+        Removes the last assistant response (and any tool-call messages) and
+        the last user message, then re-sends that user message to the agent.
+        Returns the message to re-send, or None if there's nothing to retry.
+        """
+        if not self.conversation_history:
+            print("(._.) No messages to retry.")
+            return None
+        
+        # Walk backwards to find the last user message
+        last_user_idx = None
+        for i in range(len(self.conversation_history) - 1, -1, -1):
+            if self.conversation_history[i].get("role") == "user":
+                last_user_idx = i
+                break
+        
+        if last_user_idx is None:
+            print("(._.) No user message found to retry.")
+            return None
+        
+        # Extract the message text and remove everything from that point forward
+        last_message = self.conversation_history[last_user_idx].get("content", "")
+        self.conversation_history = self.conversation_history[:last_user_idx]
+        
+        print(f"(^_^)b Retrying: \"{last_message[:60]}{'...' if len(last_message) > 60 else ''}\"")
+        return last_message
+    
+    def undo_last(self):
+        """Remove the last user/assistant exchange from conversation history.
+        
+        Walks backwards and removes all messages from the last user message
+        onward (including assistant responses, tool calls, etc.).
+        """
+        if not self.conversation_history:
+            print("(._.) No messages to undo.")
+            return
+        
+        # Walk backwards to find the last user message
+        last_user_idx = None
+        for i in range(len(self.conversation_history) - 1, -1, -1):
+            if self.conversation_history[i].get("role") == "user":
+                last_user_idx = i
+                break
+        
+        if last_user_idx is None:
+            print("(._.) No user message found to undo.")
+            return
+        
+        # Count how many messages we're removing
+        removed_count = len(self.conversation_history) - last_user_idx
+        removed_msg = self.conversation_history[last_user_idx].get("content", "")
+        
+        # Truncate history to before the last user message
+        self.conversation_history = self.conversation_history[:last_user_idx]
+        
+        print(f"(^_^)b Undid {removed_count} message(s). Removed: \"{removed_msg[:60]}{'...' if len(removed_msg) > 60 else ''}\"")
+        remaining = len(self.conversation_history)
+        print(f"  {remaining} message(s) remaining in history.")
+    
    def _handle_prompt_command(self, cmd: str):
        """Handle the /prompt command to view or set system prompt."""
        parts = cmd.split(maxsplit=1)
@@ -1217,33 +1272,35 @@ class HermesCLI:
        Returns:
            bool: True to continue, False to exit
        """
-        cmd = command.lower().strip()
+        # Lowercase only for dispatch matching; preserve original case for arguments
+        cmd_lower = command.lower().strip()
+        cmd_original = command.strip()
        
-        if cmd in ("/quit", "/exit", "/q"):
+        if cmd_lower in ("/quit", "/exit", "/q"):
            return False
-        elif cmd == "/help":
+        elif cmd_lower == "/help":
            self.show_help()
-        elif cmd == "/tools":
+        elif cmd_lower == "/tools":
            self.show_tools()
-        elif cmd == "/toolsets":
+        elif cmd_lower == "/toolsets":
            self.show_toolsets()
-        elif cmd == "/config":
+        elif cmd_lower == "/config":
            self.show_config()
-        elif cmd == "/clear":
-            # Clear terminal screen
-            import os as _os
-            _os.system('clear' if _os.name != 'nt' else 'cls')
+        elif cmd_lower == "/clear":
+            # Clear terminal screen using Rich (portable, no shell needed)
+            self.console.clear()
            # Reset conversation
            self.conversation_history = []
            # Show fresh banner
            self.show_banner()
            print("  ✨ (◕‿◕)✨ Fresh start! Screen cleared and conversation reset.\n")
-        elif cmd == "/history":
+        elif cmd_lower == "/history":
            self.show_history()
-        elif cmd == "/reset":
+        elif cmd_lower == "/reset":
            self.reset_conversation()
-        elif cmd.startswith("/model"):
-            parts = cmd.split(maxsplit=1)
+        elif cmd_lower.startswith("/model"):
+            # Use original case so model names like "Anthropic/Claude-Opus-4" are preserved
+            parts = cmd_original.split(maxsplit=1)
            if len(parts) > 1:
                new_model = parts[1]
                self.model = new_model
@@ -1256,18 +1313,27 @@ class HermesCLI:
            else:
                print(f"Current model: {self.model}")
                print("  Usage: /model <model-name> to change")
-        elif cmd.startswith("/prompt"):
-            self._handle_prompt_command(cmd)
-        elif cmd.startswith("/personality"):
-            self._handle_personality_command(cmd)
-        elif cmd == "/save":
+        elif cmd_lower.startswith("/prompt"):
+            # Use original case so prompt text isn't lowercased
+            self._handle_prompt_command(cmd_original)
+        elif cmd_lower.startswith("/personality"):
+            # Use original case (handler lowercases the personality name itself)
+            self._handle_personality_command(cmd_original)
+        elif cmd_lower == "/retry":
+            retry_msg = self.retry_last()
+            if retry_msg and hasattr(self, '_pending_input'):
+                # Re-queue the message so process_loop sends it to the agent
+                self._pending_input.put(retry_msg)
+        elif cmd_lower == "/undo":
+            self.undo_last()
+        elif cmd_lower == "/save":
            self.save_conversation()
-        elif cmd.startswith("/cron"):
-            self._handle_cron_command(command)  # Use original command for proper parsing
-        elif cmd == "/platforms" or cmd == "/gateway":
+        elif cmd_lower.startswith("/cron"):
+            self._handle_cron_command(cmd_original)
+        elif cmd_lower == "/platforms" or cmd_lower == "/gateway":
            self._show_gateway_status()
        else:
-            self.console.print(f"[bold red]Unknown command: {cmd}[/]")
+            self.console.print(f"[bold red]Unknown command: {cmd_lower}[/]")
            self.console.print("[dim #B8860B]Type /help for available commands[/]")
        
        return True
@@ -1276,6 +1342,11 @@ class HermesCLI:
        """
        Send a message to the agent and get a response.
        
+        Uses a dedicated _interrupt_queue (separate from _pending_input) to avoid
+        race conditions between the process_loop and interrupt monitoring. Messages
+        typed while the agent is running go to _interrupt_queue; messages typed while
+        idle go to _pending_input.
+        
        Args:
            message: The user's message
            
@@ -1289,8 +1360,9 @@ class HermesCLI:
        # Add user message to history
        self.conversation_history.append({"role": "user", "content": message})
        
-        # Visual separator after user input
-        print("─" * 60, flush=True)
+        # Visual separator after user input (adapt to terminal width, capped for readability)
+        term_width = min(self.console.width, 120)
+        print("─" * term_width, flush=True)
        
        try:
            # Run the conversation with interrupt monitoring
@@ -1307,21 +1379,22 @@ class HermesCLI:
            agent_thread = threading.Thread(target=run_agent)
            agent_thread.start()
            
-            # Monitor for new input in the pending queue while agent runs
+            # Monitor the dedicated interrupt queue while the agent runs.
+            # _interrupt_queue is separate from _pending_input, so process_loop
+            # and chat() never compete for the same queue.
            interrupt_msg = None
            while agent_thread.is_alive():
-                # Check if there's new input in the queue (from the persistent input area)
-                if hasattr(self, '_pending_input'):
+                if hasattr(self, '_interrupt_queue'):
                    try:
-                        interrupt_msg = self._pending_input.get(timeout=0.1)
+                        interrupt_msg = self._interrupt_queue.get(timeout=0.1)
                        if interrupt_msg:
                            print(f"\n⚡ New message detected, interrupting...")
                            self.agent.interrupt(interrupt_msg)
                            break
-                    except:
+                    except queue.Empty:
                        pass  # Queue empty or timeout, continue waiting
                else:
-                    # Fallback if no queue (shouldn't happen)
+                    # Fallback for non-interactive mode (e.g., single-query)
                    agent_thread.join(0.1)
            
            agent_thread.join()  # Ensure agent thread completes
@@ -1332,6 +1405,11 @@ class HermesCLI:
            # Get the final response
            response = result.get("final_response", "") if result else ""
            
+            # Handle failed results (e.g., non-retryable errors like invalid model)
+            if result and result.get("failed") and not response:
+                error_detail = result.get("error", "Unknown error")
+                response = f"Error: {error_detail}"
+            
            # Handle interrupt - check if we were interrupted
            pending_message = None
            if result and result.get("interrupted"):
@@ -1342,19 +1420,26 @@ class HermesCLI:
            
            if response:
                # Use simple print for compatibility with prompt_toolkit's patch_stdout
+                # Adapt box width to terminal (cap at 120 for readability)
+                box_width = min(self.console.width, 120)
+                inner = box_width - 2  # account for border chars ╭/╰ and ╮/╯
+                label = "⚕ Hermes"
+                padding = inner - len(label) - 1  # -1 for the leading space
+                
                print()
-                print("╭" + "─" * 58 + "╮")
-                print("│ ⚕ Hermes" + " " * 49 + "│")
-                print("╰" + "─" * 58 + "╯")
+                print("╭" + "─" * inner + "╮")
+                print("│ " + label + " " * max(padding, 0) + "│")
+                print("╰" + "─" * inner + "╯")
                print()
                print(response)
                print()
-                print("─" * 60)
+                print("─" * box_width)
            
-            # If we have a pending message from interrupt, process it immediately
-            if pending_message:
-                print(f"\n📨 Processing: '{pending_message[:50]}{'...' if len(pending_message) > 50 else ''}'")
-                return self.chat(pending_message)  # Recursive call to handle the new message
+            # If we have a pending message from interrupt, re-queue it for process_loop
+            # instead of recursing (avoids unbounded recursion from rapid interrupts)
+            if pending_message and hasattr(self, '_pending_input'):
+                print(f"\n📨 Queued: '{pending_message[:50]}{'...' if len(pending_message) > 50 else ''}'")
+                self._pending_input.put(pending_message)
            
            return response
            
@@ -1362,37 +1447,6 @@ class HermesCLI:
            print(f"Error: {e}")
            return None
    
-    def get_input(self) -> Optional[str]:
-        """
-        Get user input using prompt_toolkit.
-        
-        Enter submits. For multiline, end line with \\ to continue.
-        
-        Returns:
-            The user's input, or None if EOF/interrupt
-        """
-        try:
-            # Get first line
-            line = self.prompt_session.prompt(
-                HTML('<prompt>❯ </prompt>'),
-                style=self.prompt_style,
-            )
-            
-            # Handle multi-line input (lines ending with \)
-            lines = [line]
-            while line.endswith("\\"):
-                lines[-1] = line[:-1]  # Remove trailing backslash
-                line = self.prompt_session.prompt(
-                    HTML('<prompt>  </prompt>'),  # Continuation prompt
-                    style=self.prompt_style,
-                )
-                lines.append(line)
-            
-            return "\n".join(lines).strip()
-            
-        except (EOFError, KeyboardInterrupt):
-            return None
-    
    def run(self):
        """Run the interactive CLI loop with persistent input at bottom."""
        self.show_banner()
@@ -1401,32 +1455,59 @@ class HermesCLI:
        
        # State for async operation
        self._agent_running = False
-        self._pending_input = queue.Queue()
+        self._pending_input = queue.Queue()     # For normal input (commands + new queries)
+        self._interrupt_queue = queue.Queue()   # For messages typed while agent is running
        self._should_exit = False
-        
-        # Create a persistent input area using prompt_toolkit Application
-        input_buffer = Buffer()
+        self._last_ctrl_c_time = 0  # Track double Ctrl+C for force exit
        
        # Key bindings for the input area
        kb = KeyBindings()
        
        @kb.add('enter')
        def handle_enter(event):
-            """Handle Enter key - submit input."""
+            """Handle Enter key - submit input.
+            
+            Routes to the correct queue based on agent state:
+            - Agent running: goes to _interrupt_queue (chat() monitors this)
+            - Agent idle: goes to _pending_input (process_loop monitors this)
+            Commands (starting with /) always go to _pending_input so they're
+            handled as commands, not sent as interrupt text to the agent.
+            """
            text = event.app.current_buffer.text.strip()
            if text:
-                # Store the input
-                self._pending_input.put(text)
+                if self._agent_running and not text.startswith("/"):
+                    # Agent is working - route to interrupt queue for chat() to pick up
+                    self._interrupt_queue.put(text)
+                else:
+                    # Agent idle, or it's a command - route to normal input queue
+                    self._pending_input.put(text)
                # Clear the buffer
                event.app.current_buffer.reset()
        
        @kb.add('c-c')
        def handle_ctrl_c(event):
-            """Handle Ctrl+C - interrupt or exit."""
+            """Handle Ctrl+C - interrupt agent or force exit on double press.
+            
+            First Ctrl+C: interrupt the running agent gracefully.
+            Second Ctrl+C within 2 seconds (or when agent is idle): force exit.
+            """
+            import time as _time
+            now = _time.time()
+            
            if self._agent_running and self.agent:
-                print("\n⚡ Interrupting agent...")
+                # Check for double Ctrl+C (second press within 2 seconds)
+                if now - self._last_ctrl_c_time < 2.0:
+                    print("\n⚡ Force exiting...")
+                    self._should_exit = True
+                    event.app.exit()
+                    return
+                
+                # First Ctrl+C: try graceful interrupt
+                self._last_ctrl_c_time = now
+                print("\n⚡ Interrupting agent... (press Ctrl+C again to force exit)")
                self.agent.interrupt()
            else:
+                # Agent not running, exit immediately
                self._should_exit = True
                event.app.exit()
        
@@ -1436,13 +1517,14 @@ class HermesCLI:
            self._should_exit = True
            event.app.exit()
        
-        # Create the input area widget
+        # Create the input area widget with persistent history across sessions
        input_area = TextArea(
            height=1,
            prompt='❯ ',
            style='class:input-area',
            multiline=False,
            wrap_lines=False,
+            history=FileHistory(str(self._history_file)),
        )
        
        # Create a status line that shows when agent is working
@@ -1495,6 +1577,7 @@ class HermesCLI:
                    
                    # Check for commands
                    if user_input.startswith("/"):
+                        print(f"\n⚙️  {user_input}")
                        if not self.process_command(user_input):
                            self._should_exit = True
                            # Schedule app exit
@@ -1506,6 +1589,9 @@ class HermesCLI:
                    self._agent_running = True
                    app.invalidate()  # Refresh status line
                    
+                    # Echo the user's input so it stays visible in scrollback
+                    print(f"\n💬 You: {user_input}")
+                    
                    try:
                        self.chat(user_input)
                    finally:
@@ -1519,6 +1605,11 @@ class HermesCLI:
        process_thread = threading.Thread(target=process_loop, daemon=True)
        process_thread.start()
        
+        # Register atexit cleanup so resources are freed even on unexpected exit
+        # (terminal VMs, browser sessions, etc.)
+        atexit.register(_cleanup_all_browsers)
+        atexit.register(_cleanup_all_terminals)
+        
        # Run the application with patch_stdout for proper output handling
        try:
            with patch_stdout():
@@ -1527,6 +1618,15 @@ class HermesCLI:
            pass
        finally:
            self._should_exit = True
+            # Explicitly clean up resources before exit
+            try:
+                _cleanup_all_terminals()
+            except Exception:
+                pass
+            try:
+                _cleanup_all_browsers()
+            except Exception:
+                pass
            print("\nGoodbye! ⚕")


@@ -1646,6 +1746,10 @@ def main(
        cli.show_toolsets()
        sys.exit(0)
    
+    # Register cleanup for single-query mode (interactive mode registers in run())
+    atexit.register(_cleanup_all_browsers)
+    atexit.register(_cleanup_all_terminals)
+    
    # Handle single query mode
    if query:
        cli.show_banner()
--- a/cron/scheduler.py
+++ b/cron/scheduler.py
@@ -40,7 +40,7 @@ def run_job(job: dict) -> tuple[bool, str, Optional[str]]:
        # Create agent with default settings
        # Jobs run in isolated sessions (no prior context)
        agent = AIAgent(
-            model=os.getenv("HERMES_MODEL", "anthropic/claude-sonnet-4"),
+            model=os.getenv("HERMES_MODEL", "anthropic/claude-opus-4.6"),
            quiet_mode=True,
            session_id=f"cron_{job_id}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        )
--- a/environments/README.md
+++ b/environments/README.md
@@ -0,0 +1,330 @@
+# Hermes-Agent Atropos Environments
+
+This directory contains the integration layer between **hermes-agent's** tool-calling capabilities and the **Atropos** RL training framework. It provides everything needed to run agentic LLMs through multi-turn tool-calling loops, score their output with arbitrary reward functions, and feed results into Atropos for training or evaluation.
+
+## Architecture Overview
+
+```
+                        Atropos Framework
+                    ┌───────────────────────┐
+                    │       BaseEnv          │  (atroposlib)
+                    │  - Server management   │
+                    │  - Worker scheduling   │
+                    │  - Wandb logging       │
+                    │  - CLI (serve/process/ │
+                    │    evaluate)           │
+                    └───────────┬───────────┘
+                                │ inherits
+                    ┌───────────┴───────────┐
+                    │  HermesAgentBaseEnv    │  hermes_base_env.py
+                    │  - Terminal backend    │
+                    │  - Tool resolution     │
+                    │  - Agent loop          │
+                    │  - ToolContext          │
+                    │  - Async patches       │
+                    └───────────┬───────────┘
+                                │ inherits
+              ┌─────────────────┼─────────────────┐
+              │                 │                  │
+     TerminalTestEnv     HermesSweEnv    TerminalBench2EvalEnv
+     (stack testing)     (SWE training)   (TB2 benchmark eval)
+```
+
+### Inheritance Chain
+
+**BaseEnv** (from `atroposlib`) is the Atropos base class. It provides:
+- Server management (OpenAI-compatible API servers, VLLM, SGLang)
+- Worker scheduling for parallel rollouts
+- Wandb integration for metrics and rollout logging
+- CLI interface with three subcommands: `serve`, `process`, `evaluate`
+- `evaluate_log()` for saving eval results to JSON + samples.jsonl
+
+**HermesAgentBaseEnv** (`hermes_base_env.py`) extends BaseEnv with hermes-agent specifics:
+- Sets `os.environ["TERMINAL_ENV"]` to configure the terminal backend (local, docker, modal, ssh, singularity)
+- Resolves hermes-agent toolsets via `_resolve_tools_for_group()` (calls `get_tool_definitions()` from `model_tools.py`)
+- Implements `collect_trajectory()` which runs the full agent loop and computes rewards
+- Supports two-phase operation (Phase 1: OpenAI server, Phase 2: VLLM ManagedServer)
+- Applies monkey patches for async-safe tool operation at import time
+
+Concrete environments inherit from `HermesAgentBaseEnv` and implement:
+- `setup()` -- Load dataset, initialize state
+- `get_next_item()` -- Return the next item for rollout
+- `format_prompt()` -- Convert a dataset item into the user message
+- `compute_reward()` -- Score the rollout using ToolContext
+- `evaluate()` -- Periodic evaluation logic
+
+## Core Components
+
+### Agent Loop (`agent_loop.py`)
+
+`HermesAgentLoop` is the reusable multi-turn agent engine. It runs the same pattern as hermes-agent's `run_agent.py`:
+
+1. Send messages + tools to the API via `server.chat_completion()`
+2. If the response contains `tool_calls`, execute each one via `handle_function_call()` from `model_tools.py`
+3. Append tool results to the conversation and go back to step 1
+4. If the response has no tool_calls, the agent is done
+
+Tool calls are executed in a thread pool (`run_in_executor`) so backends that use `asyncio.run()` internally (Modal, Docker) don't deadlock inside Atropos's event loop.
+
+Returns an `AgentResult` containing the full conversation history, turn count, reasoning content per turn, tool errors, and optional ManagedServer state (for Phase 2).
+
+### Tool Context (`tool_context.py`)
+
+`ToolContext` is a per-rollout handle that gives reward/verification functions direct access to **all** hermes-agent tools, scoped to the rollout's `task_id`. The same `task_id` means the terminal/browser session is the SAME one the model used during its rollout -- all state (files, processes, browser tabs) is preserved.
+
+```python
+async def compute_reward(self, item, result, ctx: ToolContext):
+    # Run tests in the model's terminal sandbox
+    test = ctx.terminal("pytest -v")
+    if test["exit_code"] == 0:
+        return 1.0
+
+    # Check if a file was created
+    content = ctx.read_file("/workspace/solution.py")
+    if content.get("content"):
+        return 0.5
+
+    # Download files locally for verification (binary-safe)
+    ctx.download_file("/remote/output.bin", "/local/output.bin")
+
+    return 0.0
+```
+
+Available methods:
+- **Terminal**: `terminal(command, timeout)` -- run shell commands
+- **Files**: `read_file(path)`, `write_file(path, content)`, `search(query, path)`
+- **Transfers**: `upload_file()`, `upload_dir()`, `download_file()`, `download_dir()` -- binary-safe file transfers between host and sandbox
+- **Web**: `web_search(query)`, `web_extract(urls)`
+- **Browser**: `browser_navigate(url)`, `browser_snapshot()`
+- **Generic**: `call_tool(name, args)` -- call any hermes-agent tool by name
+- **Cleanup**: `cleanup()` -- release all resources (called automatically after `compute_reward`)
+
+### Patches (`patches.py`)
+
+**Problem**: Some hermes-agent tools use `asyncio.run()` internally (e.g., mini-swe-agent's Modal backend via SWE-ReX). This crashes when called from inside Atropos's event loop because `asyncio.run()` cannot be nested.
+
+**Solution**: `patches.py` monkey-patches `SwerexModalEnvironment` to use a dedicated background thread (`_AsyncWorker`) with its own event loop. The calling code sees the same sync interface, but internally the async work happens on a separate thread that doesn't conflict with Atropos's loop.
+
+What gets patched:
+- `SwerexModalEnvironment.__init__` -- creates Modal deployment on a background thread
+- `SwerexModalEnvironment.execute` -- runs commands on the same background thread
+- `SwerexModalEnvironment.stop` -- stops deployment on the background thread
+
+The patches are:
+- **Idempotent** -- calling `apply_patches()` multiple times is safe
+- **Transparent** -- same interface and behavior, only the internal async execution changes
+- **Universal** -- works identically in normal CLI use (no running event loop)
+
+Applied automatically at import time by `hermes_base_env.py`.
+
+### Tool Call Parsers (`tool_call_parsers/`)
+
+Client-side parsers that extract structured `tool_calls` from raw model output text. Used in **Phase 2** (VLLM server type) where ManagedServer's `/generate` endpoint returns raw text without tool call parsing.
+
+Each parser is a standalone reimplementation of the corresponding VLLM parser's `extract_tool_calls()` logic. No VLLM dependency -- only standard library (`re`, `json`, `uuid`) and `openai` types.
+
+Available parsers:
+- `hermes` -- Hermes/ChatML `<tool_call>` XML format
+- `mistral` -- Mistral `[TOOL_CALLS]` format
+- `llama3_json` -- Llama 3 JSON tool calling
+- `qwen` -- Qwen tool calling format
+- `qwen3_coder` -- Qwen3 Coder format
+- `deepseek_v3` -- DeepSeek V3 format
+- `deepseek_v3_1` -- DeepSeek V3.1 format
+- `kimi_k2` -- Kimi K2 format
+- `longcat` -- Longcat format
+- `glm45` / `glm47` -- GLM model formats
+
+Usage:
+```python
+from environments.tool_call_parsers import get_parser
+
+parser = get_parser("hermes")
+content, tool_calls = parser.parse(raw_model_output)
+```
+
+In Phase 1 (OpenAI server type), these parsers are not needed -- the server handles tool call parsing natively.
+
+## Two-Phase Operation
+
+### Phase 1: OpenAI Server (Evaluation / SFT Data Generation)
+
+Uses `server.chat_completion()` with `tools=` parameter. The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing natively. Returns `ChatCompletion` objects with structured `tool_calls`.
+
+- Good for: evaluation, SFT data generation, testing
+- Run with: `serve` (with `run-api`), `process`, or `evaluate` subcommands
+- Placeholder tokens are created for the Atropos pipeline
+
+### Phase 2: VLLM ManagedServer (Full RL Training)
+
+Uses ManagedServer for exact token IDs + logprobs via `/generate`. Client-side tool call parser (from `tool_call_parsers/`) reconstructs structured `tool_calls` from raw output.
+
+- Good for: full RL training with GRPO/PPO
+- Run with: `serve` subcommand
+- Real tokens, masks, and logprobs flow through the pipeline
+
+## Directory Structure
+
+```
+environments/
+├── README.md                     # This file
+├── __init__.py                   # Package exports
+├── hermes_base_env.py            # Abstract base (HermesAgentBaseEnv)
+├── agent_loop.py                 # Multi-turn agent engine (HermesAgentLoop)
+├── tool_context.py               # Per-rollout tool access for reward functions
+├── patches.py                    # Async-safety patches for Modal backend
+│
+├── tool_call_parsers/            # Phase 2 client-side parsers
+│   ├── __init__.py               # Registry + base class
+│   ├── hermes_parser.py
+│   ├── mistral_parser.py
+│   ├── llama_parser.py
+│   ├── qwen_parser.py
+│   ├── qwen3_coder_parser.py
+│   ├── deepseek_v3_parser.py
+│   ├── deepseek_v3_1_parser.py
+│   ├── kimi_k2_parser.py
+│   ├── longcat_parser.py
+│   ├── glm45_parser.py
+│   └── glm47_parser.py
+│
+├── terminal_test_env/            # Stack validation environment
+│   └── terminal_test_env.py
+│
+├── hermes_swe_env/               # SWE-bench style training environment
+│   └── hermes_swe_env.py
+│
+└── benchmarks/                   # Evaluation benchmarks
+    └── terminalbench_2/
+        └── terminalbench2_env.py
+```
+
+## Concrete Environments
+
+### TerminalTestEnv (`terminal_test_env/`)
+
+A self-contained environment with inline tasks (no external dataset needed) for validating the full stack end-to-end. Each task asks the model to create a file at a known path, and the verifier checks the content matches.
+
+```bash
+# Serve mode (needs run-api)
+run-api
+python environments/terminal_test_env/terminal_test_env.py serve
+
+# Process mode (no run-api, saves to JSONL)
+python environments/terminal_test_env/terminal_test_env.py process \
+    --env.data_path_to_save_groups terminal_test_output.jsonl
+```
+
+### HermesSweEnv (`hermes_swe_env/`)
+
+SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.
+
+```bash
+python environments/hermes_swe_env/hermes_swe_env.py serve \
+    --openai.model_name YourModel \
+    --env.dataset_name bigcode/humanevalpack \
+    --env.terminal_backend modal
+```
+
+### TerminalBench2EvalEnv (`benchmarks/terminalbench_2/`)
+
+**Eval-only** environment for the Terminal-Bench 2.0 benchmark (89 tasks). Each task gets a pre-built Docker Hub image, a natural language instruction, and a test suite. The agent uses terminal + file tools to solve the task, then the test suite verifies correctness.
+
+Follows the standard Atropos eval pattern (like GPQA, MMLU, etc.):
+- Run via `evaluate` subcommand (no `run-api` needed)
+- `setup()` loads the dataset, `evaluate()` runs all tasks
+- `rollout_and_score_eval()` handles per-task agent loop + test verification
+- Downloads verifier output locally for reliable reward checking (Harbor pattern)
+
+```bash
+# Run full benchmark
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6
+
+# Run subset of tasks
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6 \
+    --env.task_filter fix-git,git-multibranch
+
+# Skip specific tasks
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6 \
+    --env.skip_tasks heavy-task,slow-task
+```
+
+## Creating a New Environment
+
+### Training Environment
+
+1. Create a new directory under `environments/`
+2. Create your env file inheriting from `HermesAgentBaseEnv`
+3. Implement the four abstract methods + `evaluate()`
+
+```python
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+
+class MyEnvConfig(HermesAgentEnvConfig):
+    pass  # Add custom fields as needed
+
+class MyEnv(HermesAgentBaseEnv):
+    name = "my-env"
+    env_config_cls = MyEnvConfig
+
+    @classmethod
+    def config_init(cls):
+        env_config = MyEnvConfig(
+            enabled_toolsets=["terminal", "file"],
+            terminal_backend="modal",
+            # ... other config
+        )
+        server_configs = [APIServerConfig(...)]
+        return env_config, server_configs
+
+    async def setup(self):
+        self.dataset = load_dataset(...)
+        self.iter = 0
+
+    async def get_next_item(self):
+        item = self.dataset[self.iter % len(self.dataset)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item):
+        return item["instruction"]
+
+    async def compute_reward(self, item, result, ctx):
+        # ctx gives you full tool access to the rollout's sandbox
+        test = ctx.terminal("pytest -v")
+        return 1.0 if test["exit_code"] == 0 else 0.0
+
+    async def evaluate(self, *args, **kwargs):
+        # Periodic evaluation logic
+        ...
+
+if __name__ == "__main__":
+    MyEnv.cli()
+```
+
+### Eval-Only Environment (Benchmark)
+
+For eval benchmarks, follow the pattern in `terminalbench2_env.py`:
+1. Create under `environments/benchmarks/your-benchmark/`
+2. Inherit from `HermesAgentBaseEnv`
+3. Set eval-only config: `eval_handling=STOP_TRAIN`, `steps_per_eval=1`, `total_steps=1`
+4. Stub the training methods (`collect_trajectories`, `score`)
+5. Implement `rollout_and_score_eval()` and `evaluate()`
+6. Run with `evaluate` subcommand
+
+## Key Config Fields
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| `enabled_toolsets` | Which hermes toolsets to enable | `None` (all) |
+| `disabled_toolsets` | Toolsets to disable | `None` |
+| `distribution` | Probabilistic toolset distribution name | `None` |
+| `max_agent_turns` | Max LLM calls per rollout | `30` |
+| `agent_temperature` | Sampling temperature | `1.0` |
+| `terminal_backend` | `local`, `docker`, `modal`, `ssh`, `singularity` | `local` |
+| `system_prompt` | System message for the agent | `None` |
+| `tool_call_parser` | Parser name for Phase 2 | `hermes` |
+| `eval_handling` | `STOP_TRAIN`, `LIMIT_TRAIN`, `NONE` | `STOP_TRAIN` |
--- a/environments/init.py
+++ b/environments/init.py
@@ -0,0 +1,31 @@
+"""
+Hermes-Agent Atropos Environments
+
+Provides a layered integration between hermes-agent's tool-calling capabilities
+and the Atropos RL training framework.
+
+Core layers:
+    - agent_loop: Reusable multi-turn agent loop with standard OpenAI-spec tool calling
+    - tool_context: Per-rollout tool access handle for reward/verification functions
+    - hermes_base_env: Abstract base environment (BaseEnv subclass) for Atropos
+    - tool_call_parsers: Client-side tool call parser registry for Phase 2 (VLLM /generate)
+
+Concrete environments:
+    - terminal_test_env/: Simple file-creation tasks for testing the stack
+    - hermes_swe_env/: SWE-bench style tasks with Modal sandboxes
+
+Benchmarks (eval-only):
+    - benchmarks/terminalbench_2/: Terminal-Bench 2.0 evaluation
+"""
+
+from environments.agent_loop import AgentResult, HermesAgentLoop
+from environments.tool_context import ToolContext
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+
+__all__ = [
+    "AgentResult",
+    "HermesAgentLoop",
+    "ToolContext",
+    "HermesAgentBaseEnv",
+    "HermesAgentEnvConfig",
+]
--- a/environments/agent_loop.py
+++ b/environments/agent_loop.py
@@ -0,0 +1,588 @@
+"""
+HermesAgentLoop -- Reusable Multi-Turn Agent Engine
+
+Runs the hermes-agent tool-calling loop using standard OpenAI-spec tool calling.
+Works with any server that returns ChatCompletion objects with tool_calls:
+    - Phase 1: OpenAI server type (VLLM, SGLang, OpenRouter, OpenAI API)
+    - Phase 2: ManagedServer with client-side tool call parser
+
+The loop passes tools= and checks response.choices[0].message.tool_calls,
+identical to hermes-agent's run_agent.py. Tool execution is dispatched via
+handle_function_call() from model_tools.py.
+"""
+
+import asyncio
+import concurrent.futures
+import json
+import logging
+import os
+import uuid
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Set
+
+from model_tools import handle_function_call
+
+# Thread pool for running sync tool calls that internally use asyncio.run()
+# (e.g., mini-swe-agent's modal/docker backends). Running them in a separate
+# thread gives them a clean event loop so they don't deadlock inside Atropos's loop.
+# Size must be large enough for concurrent eval tasks (e.g., 89 TB2 tasks all
+# making tool calls). Too small = thread pool starvation, tasks queue for minutes.
+# Resized at runtime by HermesAgentBaseEnv.__init__ via resize_tool_pool().
+_tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=128)
+
+
+def resize_tool_pool(max_workers: int):
+    """
+    Replace the global tool executor with a new one of the given size.
+
+    Called by HermesAgentBaseEnv.__init__ based on config.tool_pool_size.
+    Safe to call before any tasks are submitted.
+    """
+    global _tool_executor
+    _tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
+    logger.info("Tool thread pool resized to %d workers", max_workers)
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ToolError:
+    """Record of a tool execution error during the agent loop."""
+
+    turn: int                  # Which turn the error occurred on
+    tool_name: str             # Which tool was called
+    arguments: str             # The arguments passed (truncated)
+    error: str                 # The error message
+    tool_result: str           # The raw result returned to the model
+
+
+@dataclass
+class AgentResult:
+    """Result of running the agent loop."""
+
+    # Full conversation history in OpenAI message format
+    messages: List[Dict[str, Any]]
+    # ManagedServer.get_state() if available (Phase 2), None otherwise
+    managed_state: Optional[Dict[str, Any]] = None
+    # How many LLM calls were made
+    turns_used: int = 0
+    # True if model stopped calling tools naturally (vs hitting max_turns)
+    finished_naturally: bool = False
+    # Extracted reasoning content per turn (from PR #297 helpers)
+    reasoning_per_turn: List[Optional[str]] = field(default_factory=list)
+    # Tool errors encountered during the loop
+    tool_errors: List[ToolError] = field(default_factory=list)
+
+    # Tool-call metrics (debugging / optional reward shaping)
+    tool_calls_attempted: int = 0
+    tool_calls_schema_valid: int = 0
+    tool_calls_executed_ok: int = 0
+    tool_calls_exec_error: int = 0
+
+
+def _extract_reasoning_from_message(message) -> Optional[str]:
+    """
+    Extract reasoning content from a ChatCompletion message.
+
+    Handles multiple provider formats:
+    1. message.reasoning_content field (some providers)
+    2. message.reasoning field (some providers)
+    3. message.reasoning_details[].text (OpenRouter style)
+
+    Note: <think> block extraction from content is NOT done here -- that's
+    handled by the response already in Phase 1 (server does it) or by
+    ManagedServer's patch in Phase 2.
+
+    Args:
+        message: The assistant message from ChatCompletion response
+
+    Returns:
+        Extracted reasoning text, or None if not found
+    """
+    # Check reasoning_content field (common across providers)
+    if hasattr(message, "reasoning_content") and message.reasoning_content:
+        return message.reasoning_content
+
+    # Check reasoning field
+    if hasattr(message, "reasoning") and message.reasoning:
+        return message.reasoning
+
+    # Check reasoning_details (OpenRouter style)
+    if hasattr(message, "reasoning_details") and message.reasoning_details:
+        for detail in message.reasoning_details:
+            if hasattr(detail, "text") and detail.text:
+                return detail.text
+            if isinstance(detail, dict) and detail.get("text"):
+                return detail["text"]
+
+    return None
+
+
+class HermesAgentLoop:
+    """
+    Runs hermes-agent's tool-calling loop using standard OpenAI-spec tool calling.
+
+    Same pattern as run_agent.py:
+    - Pass tools= to the API
+    - Check response.choices[0].message.tool_calls
+    - Dispatch via handle_function_call()
+
+    Works identically with any server type -- OpenAI, VLLM, SGLang, OpenRouter,
+    or ManagedServer with a parser. The server determines how tool_calls get
+    populated on the response.
+    """
+
+    def __init__(
+        self,
+        server,
+        tool_schemas: List[Dict[str, Any]],
+        valid_tool_names: Set[str],
+        max_turns: int = 30,
+        task_id: Optional[str] = None,
+        temperature: float = 1.0,
+        max_tokens: Optional[int] = None,
+        extra_body: Optional[Dict[str, Any]] = None,
+        tool_handler=None,
+        max_context_tokens: Optional[int] = None,
+    ):
+        """
+        Initialize the agent loop.
+
+        Args:
+            server: Server object with chat_completion() method (OpenAIServer,
+                    ManagedServer, ServerManager, etc.)
+            tool_schemas: OpenAI-format tool definitions from get_tool_definitions()
+            valid_tool_names: Set of tool names the model is allowed to call
+            max_turns: Maximum number of LLM calls before stopping
+            task_id: Unique ID for terminal/browser session isolation
+            temperature: Sampling temperature for generation
+            max_tokens: Max tokens per generation (None for server default)
+            extra_body: Extra parameters passed to the OpenAI client's create() call.
+                        Used for OpenRouter provider preferences, transforms, etc.
+                        e.g. {"provider": {"ignore": ["DeepInfra"]}}
+            tool_handler: Optional async callable(tool_name, args, task_id) -> str.
+                         When provided, used INSTEAD of handle_function_call() for
+                         tool dispatch. This allows sandbox backends (Modal, Nomad)
+                         to route tool calls through their slot-based execution.
+            max_context_tokens: Maximum prompt tokens before truncation.
+                               If None, no truncation is applied.
+                               Recommended: set to max_model_len - max_tokens - 512 (safety margin).
+        """
+        self.server = server
+        self.tool_schemas = tool_schemas
+        self.valid_tool_names = valid_tool_names
+        self.max_turns = max_turns
+        self.task_id = task_id or str(uuid.uuid4())
+        self.temperature = temperature
+        self.max_tokens = max_tokens
+        self.extra_body = extra_body
+        self.tool_handler = tool_handler
+        self.max_context_tokens = max_context_tokens
+
+    def _truncate_context(self, messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+        """Truncate conversation history to fit within max_context_tokens.
+
+        Strategy:
+        - Keep system message (index 0) and initial user message (index 1) always
+        - Keep last 6 messages (recent context) always
+        - For everything in between, progressively truncate tool result content
+        - If still too long, drop oldest middle messages entirely
+
+        Uses rough char/4 token estimate (fast, no tokenizer needed).
+
+        NOTE: This function mutates the provided list (it may pop/replace entries).
+        Call it on a copy when you want to preserve the full trajectory.
+        """
+        if self.max_context_tokens is None:
+            return messages
+
+        def estimate_tokens(msgs):
+            total = 0
+            for m in msgs:
+                content = m.get("content", "") or ""
+                total += len(content) // 4 + 10  # ~4 chars per token + overhead
+                if "tool_calls" in m:
+                    total += 50 * len(m["tool_calls"])  # tool call overhead
+            return total
+
+        if estimate_tokens(messages) <= self.max_context_tokens:
+            return messages
+
+        protect_head = 2
+        protect_tail = max(0, min(6, len(messages) - protect_head))
+        middle_start = protect_head
+        middle_end = len(messages) - protect_tail
+
+        # Phase 1: truncate tool outputs in the middle
+        if middle_start < middle_end:
+            for i in range(middle_start, middle_end):
+                if messages[i].get("role") == "tool":
+                    content = messages[i].get("content", "") or ""
+                    if len(content) > 200:
+                        messages[i] = dict(messages[i])
+                        messages[i]["content"] = content[:100] + "\n...[truncated]...\n" + content[-50:]
+
+            if estimate_tokens(messages) <= self.max_context_tokens:
+                return messages
+
+        # Phase 2: drop oldest middle messages (try to keep assistant+tool pairs)
+        while middle_start < middle_end and estimate_tokens(messages) > self.max_context_tokens:
+            msg = messages[middle_start]
+            messages.pop(middle_start)
+            middle_end -= 1
+
+            if msg.get("role") == "assistant" and msg.get("tool_calls"):
+                tool_ids = {
+                    tc.get("id") or tc.get("tool_call_id", "")
+                    for tc in msg.get("tool_calls", [])
+                    if isinstance(tc, dict)
+                }
+                i = middle_start
+                while i < middle_end:
+                    if messages[i].get("role") == "tool" and messages[i].get("tool_call_id", "") in tool_ids:
+                        messages.pop(i)
+                        middle_end -= 1
+                    else:
+                        i += 1
+
+        return messages
+
+    def _normalize_tool_args(self, tool_name: str, tool_args_raw: str) -> (Dict[str, Any], bool):
+        """Normalize tool arguments into a dict.
+
+        Returns: (args_dict, schema_valid)
+
+        schema_valid is True only when arguments decode directly into a dict
+        (no double-decoding and no coercion/wrapping required).
+
+        Goal: keep environments robust (never crash on args format drift) while
+        still allowing reward functions to penalize malformed formats if desired.
+        """
+        try:
+            decoded = json.loads(tool_args_raw)
+        except json.JSONDecodeError:
+            # Not JSON at all — treat as a plain string
+            if tool_name == "terminal":
+                return {"command": tool_args_raw}, False
+            return {"input": tool_args_raw}, False
+
+        if isinstance(decoded, dict):
+            if tool_name == "terminal":
+                cmd = decoded.get("command")
+                if isinstance(cmd, str) and cmd.strip():
+                    return decoded, True
+                if isinstance(decoded.get("input"), str):
+                    return {"command": decoded.get("input")}, False
+                return decoded, False
+            return decoded, True
+
+        if isinstance(decoded, str):
+            s = decoded.strip()
+            if (s.startswith("{") and s.endswith("}")) or (s.startswith("[") and s.endswith("]")):
+                try:
+                    decoded2 = json.loads(s)
+                except json.JSONDecodeError:
+                    decoded2 = None
+                if isinstance(decoded2, dict):
+                    return decoded2, False
+
+            if tool_name == "terminal":
+                return {"command": decoded}, False
+            return {"input": decoded}, False
+
+        if tool_name == "terminal":
+            return {"command": str(decoded)}, False
+        return {"input": decoded}, False
+
+    async def run(self, messages: List[Dict[str, Any]]) -> AgentResult:
+        """
+        Execute the full agent loop using standard OpenAI tool calling.
+
+        Args:
+            messages: Initial conversation messages (system + user).
+                      Modified in-place as the conversation progresses.
+
+        Returns:
+            AgentResult with full conversation history, managed state, and metadata
+        """
+        reasoning_per_turn = []
+        tool_errors: List[ToolError] = []
+
+        tool_calls_attempted = 0
+        tool_calls_schema_valid = 0
+        tool_calls_executed_ok = 0
+        tool_calls_exec_error = 0
+
+        import time as _time
+
+        for turn in range(self.max_turns):
+            turn_start = _time.monotonic()
+
+            # Truncate prompt view on a copy (preserve full trajectory in `messages`)
+            prompt_messages = self._truncate_context(list(messages))
+
+            # Build the chat_completion kwargs
+            chat_kwargs = {
+                "messages": prompt_messages,
+                "n": 1,
+                "temperature": self.temperature,
+            }
+
+            # Only pass tools if we have them
+            if self.tool_schemas:
+                chat_kwargs["tools"] = self.tool_schemas
+
+            # Only pass max_tokens if explicitly set
+            if self.max_tokens is not None:
+                chat_kwargs["max_tokens"] = self.max_tokens
+
+            # Inject extra_body for provider-specific params (e.g., OpenRouter
+            # provider preferences like banned/preferred providers, transforms)
+            if self.extra_body:
+                chat_kwargs["extra_body"] = self.extra_body
+
+            # Make the API call -- standard OpenAI spec
+            api_start = _time.monotonic()
+            try:
+                response = await self.server.chat_completion(**chat_kwargs)
+            except Exception as e:
+                api_elapsed = _time.monotonic() - api_start
+                logger.error("API call failed on turn %d (%.1fs): %s", turn + 1, api_elapsed, e)
+                return AgentResult(
+                    messages=messages,
+                    managed_state=self._get_managed_state(),
+                    turns_used=turn + 1,
+                    finished_naturally=False,
+                    reasoning_per_turn=reasoning_per_turn,
+                    tool_errors=tool_errors,
+                    tool_calls_attempted=tool_calls_attempted,
+                    tool_calls_schema_valid=tool_calls_schema_valid,
+                    tool_calls_executed_ok=tool_calls_executed_ok,
+                    tool_calls_exec_error=tool_calls_exec_error,
+                )
+
+            api_elapsed = _time.monotonic() - api_start
+
+            if not response or not response.choices:
+                logger.warning("Empty response on turn %d (api=%.1fs)", turn + 1, api_elapsed)
+                return AgentResult(
+                    messages=messages,
+                    managed_state=self._get_managed_state(),
+                    turns_used=turn + 1,
+                    finished_naturally=False,
+                    reasoning_per_turn=reasoning_per_turn,
+                    tool_errors=tool_errors,
+                    tool_calls_attempted=tool_calls_attempted,
+                    tool_calls_schema_valid=tool_calls_schema_valid,
+                    tool_calls_executed_ok=tool_calls_executed_ok,
+                    tool_calls_exec_error=tool_calls_exec_error,
+                )
+
+            assistant_msg = response.choices[0].message
+
+            # Extract reasoning content from the response (all provider formats)
+            reasoning = _extract_reasoning_from_message(assistant_msg)
+            reasoning_per_turn.append(reasoning)
+
+            # Check for tool calls -- standard OpenAI spec
+            if assistant_msg.tool_calls:
+                # Build the assistant message dict for conversation history
+                msg_dict: Dict[str, Any] = {
+                    "role": "assistant",
+                    "content": assistant_msg.content or "",
+                    "tool_calls": [
+                        {
+                            "id": tc.id,
+                            "type": "function",
+                            "function": {
+                                "name": tc.function.name,
+                                "arguments": tc.function.arguments,
+                            },
+                        }
+                        for tc in assistant_msg.tool_calls
+                    ],
+                }
+
+                # Preserve reasoning_content for multi-turn chat template handling
+                # (e.g., Kimi-K2's template renders <think> blocks differently
+                # for history vs. the latest turn based on this field)
+                if reasoning:
+                    msg_dict["reasoning_content"] = reasoning
+
+                messages.append(msg_dict)
+
+                # Execute each tool call via hermes-agent's dispatch
+                for tc in assistant_msg.tool_calls:
+                    tool_name = tc.function.name
+                    tool_args_raw = tc.function.arguments
+
+                    # Validate tool name
+                    if tool_name not in self.valid_tool_names:
+                        tool_calls_exec_error += 1
+                        tool_result = json.dumps(
+                            {
+                                "error": f"Unknown tool '{tool_name}'. "
+                                f"Available tools: {sorted(self.valid_tool_names)}"
+                            }
+                        )
+                        tool_errors.append(ToolError(
+                            turn=turn + 1, tool_name=tool_name,
+                            arguments=tool_args_raw[:200],
+                            error=f"Unknown tool '{tool_name}'",
+                            tool_result=tool_result,
+                        ))
+                        logger.warning(
+                            "Model called unknown tool '%s' on turn %d",
+                            tool_name, turn + 1,
+                        )
+                    else:
+                        tool_calls_attempted += 1
+                        args, schema_valid = self._normalize_tool_args(tool_name, tool_args_raw)
+                        if schema_valid:
+                            tool_calls_schema_valid += 1
+
+                        try:
+                            if tool_name == "terminal":
+                                backend = os.getenv("TERMINAL_ENV", "local")
+                                cmd_preview = str(args.get("command", ""))[:80]
+                                logger.info(
+                                    "[%s] $ %s", self.task_id[:8], cmd_preview,
+                                )
+
+                            tool_submit_time = _time.monotonic()
+
+                            if self.tool_handler:
+                                tool_result = await self.tool_handler(tool_name, args, self.task_id)
+                            else:
+                                # Run tool calls in a thread pool so backends that use
+                                # asyncio.run() internally (modal, docker) get a clean
+                                # event loop instead of deadlocking inside Atropos's loop.
+                                loop = asyncio.get_event_loop()
+                                tool_result = await loop.run_in_executor(
+                                    _tool_executor,
+                                    lambda: handle_function_call(
+                                        tool_name, args, task_id=self.task_id
+                                    ),
+                                )
+
+                            tool_elapsed = _time.monotonic() - tool_submit_time
+
+                            # Log slow tools and thread pool stats for debugging
+                            pool_active = _tool_executor._work_queue.qsize()
+                            if tool_elapsed > 30:
+                                logger.warning(
+                                    "[%s] turn %d: %s took %.1fs (pool queue=%d)",
+                                    self.task_id[:8], turn + 1, tool_name,
+                                    tool_elapsed, pool_active,
+                                )
+                        except Exception as e:
+                            tool_calls_exec_error += 1
+                            tool_result = json.dumps(
+                                {"error": f"Tool execution failed: {type(e).__name__}: {str(e)}"}
+                            )
+                            tool_errors.append(ToolError(
+                                turn=turn + 1, tool_name=tool_name,
+                                arguments=tool_args_raw[:200],
+                                error=f"{type(e).__name__}: {str(e)}",
+                                tool_result=tool_result,
+                            ))
+                            logger.error(
+                                "Tool '%s' execution failed on turn %d: %s",
+                                tool_name, turn + 1, e,
+                            )
+                        else:
+                            tool_err = False
+                            try:
+                                result_data = json.loads(tool_result)
+                                if isinstance(result_data, dict):
+                                    err = result_data.get("error")
+                                    if err:
+                                        tool_err = True
+
+                                    exit_code = result_data.get("exit_code")
+                                    if exit_code is not None and isinstance(exit_code, int) and exit_code < 0:
+                                        tool_err = True
+                                        tool_errors.append(ToolError(
+                                            turn=turn + 1, tool_name=tool_name,
+                                            arguments=tool_args_raw[:200],
+                                            error=str(err) if err else "nonzero exit_code",
+                                            tool_result=tool_result[:500],
+                                        ))
+                            except (json.JSONDecodeError, TypeError):
+                                pass
+
+                            if tool_err:
+                                tool_calls_exec_error += 1
+                            else:
+                                tool_calls_executed_ok += 1
+
+                    # Add tool response to conversation
+                    messages.append(
+                        {
+                            "role": "tool",
+                            "tool_call_id": tc.id,
+                            "content": tool_result,
+                        }
+                    )
+
+                turn_elapsed = _time.monotonic() - turn_start
+                logger.info(
+                    "[%s] turn %d: api=%.1fs, %d tools, turn_total=%.1fs",
+                    self.task_id[:8], turn + 1, api_elapsed,
+                    len(assistant_msg.tool_calls), turn_elapsed,
+                )
+
+            else:
+                # No tool calls -- model is done
+                msg_dict = {
+                    "role": "assistant",
+                    "content": assistant_msg.content or "",
+                }
+                if reasoning:
+                    msg_dict["reasoning_content"] = reasoning
+                messages.append(msg_dict)
+
+                turn_elapsed = _time.monotonic() - turn_start
+                logger.info(
+                    "[%s] turn %d: api=%.1fs, no tools (finished), turn_total=%.1fs",
+                    self.task_id[:8], turn + 1, api_elapsed, turn_elapsed,
+                )
+
+                return AgentResult(
+                    messages=messages,
+                    managed_state=self._get_managed_state(),
+                    turns_used=turn + 1,
+                    finished_naturally=True,
+                    reasoning_per_turn=reasoning_per_turn,
+                    tool_errors=tool_errors,
+                    tool_calls_attempted=tool_calls_attempted,
+                    tool_calls_schema_valid=tool_calls_schema_valid,
+                    tool_calls_executed_ok=tool_calls_executed_ok,
+                    tool_calls_exec_error=tool_calls_exec_error,
+                )
+
+        # Hit max turns without the model stopping
+        logger.info("Agent hit max_turns (%d) without finishing", self.max_turns)
+        return AgentResult(
+            messages=messages,
+            managed_state=self._get_managed_state(),
+            turns_used=self.max_turns,
+            finished_naturally=False,
+            reasoning_per_turn=reasoning_per_turn,
+            tool_errors=tool_errors,
+            tool_calls_attempted=tool_calls_attempted,
+            tool_calls_schema_valid=tool_calls_schema_valid,
+            tool_calls_executed_ok=tool_calls_executed_ok,
+            tool_calls_exec_error=tool_calls_exec_error,
+        )
+
+    def _get_managed_state(self) -> Optional[Dict[str, Any]]:
+        """
+        Get ManagedServer state if the server supports it.
+
+        Returns state dict with SequenceNodes containing tokens/logprobs/masks,
+        or None if the server doesn't support get_state() (e.g., regular OpenAI server).
+        """
+        if hasattr(self.server, "get_state"):
+            return self.server.get_state()
+        return None
--- a/environments/benchmarks/init.py
+++ b/environments/benchmarks/init.py
--- a/environments/benchmarks/terminalbench_2/init.py
+++ b/environments/benchmarks/terminalbench_2/init.py
--- a/environments/benchmarks/terminalbench_2/default.yaml
+++ b/environments/benchmarks/terminalbench_2/default.yaml
@@ -0,0 +1,38 @@
+# Terminal-Bench 2.0 Evaluation -- Default Configuration
+#
+# Eval-only environment for the TB2 benchmark (89 terminal tasks).
+# Uses Modal terminal backend for per-task cloud-isolated sandboxes
+# and OpenRouter for inference.
+#
+# Usage:
+#   python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+#       --config environments/benchmarks/terminalbench_2/default.yaml
+#
+#   # Override model:
+#   python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+#       --config environments/benchmarks/terminalbench_2/default.yaml \
+#       --openai.model_name anthropic/claude-sonnet-4
+
+env:
+  enabled_toolsets: ["terminal", "file"]
+  max_agent_turns: 60
+  max_token_length: 32000
+  agent_temperature: 0.8
+  terminal_backend: "modal"
+  terminal_timeout: 300        # 5 min per command (builds, pip install)
+  tool_pool_size: 128          # thread pool for 89 parallel tasks
+  dataset_name: "NousResearch/terminal-bench-2"
+  test_timeout: 600
+  task_timeout: 1800           # 30 min wall-clock per task, auto-FAIL if exceeded
+  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
+  use_wandb: true
+  wandb_name: "terminal-bench-2"
+  ensure_scores_are_not_same: false
+  data_dir_to_save_evals: "environments/benchmarks/evals/terminal-bench-2"
+
+openai:
+  base_url: "https://openrouter.ai/api/v1"
+  model_name: "anthropic/claude-opus-4.6"
+  server_type: "openai"
+  health_check: false
+  # api_key loaded from OPENROUTER_API_KEY in .env
--- a/environments/benchmarks/terminalbench_2/run_eval.sh
+++ b/environments/benchmarks/terminalbench_2/run_eval.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+
+# Terminal-Bench 2.0 Evaluation
+#
+# Run from repo root:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh
+#
+# Override model:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh \
+#       --openai.model_name anthropic/claude-sonnet-4
+#
+# Run a subset:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh \
+#       --env.task_filter fix-git,git-multibranch
+
+mkdir -p logs evals/terminal-bench-2
+LOG_FILE="logs/terminalbench2_$(date +%Y%m%d_%H%M%S).log"
+
+echo "Terminal-Bench 2.0 Evaluation"
+echo "Log: $LOG_FILE"
+echo ""
+
+export TERMINAL_ENV=modal
+export TERMINAL_TIMEOUT=300
+
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+  --config environments/benchmarks/terminalbench_2/default.yaml \
+  "$@" \
+  2>&1 | tee "$LOG_FILE"
+
+echo ""
+echo "Log saved to: $LOG_FILE"
--- a/environments/benchmarks/terminalbench_2/terminalbench2_env.py
+++ b/environments/benchmarks/terminalbench_2/terminalbench2_env.py
@@ -0,0 +1,904 @@
+"""
+TerminalBench2Env -- Terminal-Bench 2.0 Evaluation Environment
+
+Evaluates agentic LLMs on challenging terminal tasks from Terminal-Bench 2.0.
+Each task provides a unique Docker environment (pre-built on Docker Hub), a natural
+language instruction, and a test suite for verification. The agent uses terminal +
+file tools to complete the task, then the test suite runs inside the same sandbox.
+
+This is an eval-only environment (not a training environment). It is designed to
+be run via the `evaluate` subcommand:
+
+    python environments/terminalbench2_env.py evaluate \\
+        --env.dataset_name NousResearch/terminal-bench-2
+
+The evaluate flow:
+    1. setup()     -- Loads the TB2 dataset from HuggingFace
+    2. evaluate()  -- Iterates over all tasks, running each through:
+        a. rollout_and_score_eval()  -- Per-task agent loop + test verification
+            - Resolves Docker image (pre-built Hub image or Dockerfile fallback)
+            - Registers per-task Modal sandbox via register_task_env_overrides()
+            - Runs the HermesAgentLoop (terminal + file tools)
+            - Uploads test suite and runs test.sh in the same sandbox
+            - Returns binary pass/fail result
+        b. Aggregates per-task, per-category, and overall pass rates
+        c. Logs results via evaluate_log() and wandb
+
+Key features:
+  - Per-task Modal sandboxes using pre-built Docker Hub images
+  - Binary reward: 1.0 if all tests pass, 0.0 otherwise
+  - Concurrency-controlled parallel evaluation via asyncio.Semaphore
+  - Per-task, per-category, and aggregate pass rate tracking
+"""
+
+import asyncio
+import base64
+import io
+import json
+import logging
+import os
+import shutil
+import sys
+import tarfile
+import tempfile
+import time
+import uuid
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# Ensure repo root is on sys.path for imports
+_repo_root = Path(__file__).resolve().parent.parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from pydantic import Field
+
+from atroposlib.envs.base import EvalHandlingEnum
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+
+from environments.agent_loop import AgentResult, HermesAgentLoop
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.tool_context import ToolContext
+from tools.terminal_tool import (
+    register_task_env_overrides,
+    clear_task_env_overrides,
+    cleanup_vm,
+)
+
+logger = logging.getLogger(__name__)
+
+
+# =============================================================================
+# Configuration
+# =============================================================================
+
+class TerminalBench2EvalConfig(HermesAgentEnvConfig):
+    """
+    Configuration for the Terminal-Bench 2.0 evaluation environment.
+
+    Extends HermesAgentEnvConfig with TB2-specific settings for dataset loading,
+    test execution, task filtering, and eval concurrency.
+    """
+
+    # --- Dataset ---
+    dataset_name: str = Field(
+        default="NousResearch/terminal-bench-2",
+        description="HuggingFace dataset containing TB2 tasks.",
+    )
+
+    # --- Test execution ---
+    test_timeout: int = Field(
+        default=180,
+        description="Timeout in seconds for running the test suite after agent completes.",
+    )
+
+    # --- Image strategy ---
+    force_build: bool = Field(
+        default=False,
+        description="If True, always build from Dockerfile (ignore docker_image). "
+        "Useful for testing custom Dockerfiles.",
+    )
+
+    # --- Task filtering (comma-separated from CLI) ---
+    task_filter: Optional[str] = Field(
+        default=None,
+        description="Comma-separated task names to run (e.g., 'fix-git,git-multibranch'). "
+        "If not set, all tasks are run.",
+    )
+    skip_tasks: Optional[str] = Field(
+        default=None,
+        description="Comma-separated task names to skip on top of the default skip list.",
+    )
+
+    # --- Per-task wall-clock timeout ---
+    task_timeout: int = Field(
+        default=1800,
+        description="Maximum wall-clock seconds per task (agent loop + verification). "
+        "Tasks exceeding this are scored as FAIL. Default 30 minutes.",
+    )
+
+
+# Tasks that cannot run properly on Modal and are excluded from scoring.
+MODAL_INCOMPATIBLE_TASKS = {
+    "qemu-startup",        # Needs KVM/hardware virtualization
+    "qemu-alpine-ssh",     # Needs KVM/hardware virtualization
+    "crack-7z-hash",       # Password brute-force -- too slow for cloud sandbox timeouts
+}
+
+
+# =============================================================================
+# Tar extraction helper
+# =============================================================================
+
+def _extract_base64_tar(b64_data: str, target_dir: Path):
+    """Extract a base64-encoded tar.gz archive into target_dir."""
+    if not b64_data:
+        return
+    raw = base64.b64decode(b64_data)
+    buf = io.BytesIO(raw)
+    with tarfile.open(fileobj=buf, mode="r:gz") as tar:
+        tar.extractall(path=str(target_dir))
+
+
+# =============================================================================
+# Main Environment
+# =============================================================================
+
+class TerminalBench2EvalEnv(HermesAgentBaseEnv):
+    """
+    Terminal-Bench 2.0 evaluation environment (eval-only, no training).
+
+    Inherits from HermesAgentBaseEnv for:
+      - Terminal backend setup (os.environ["TERMINAL_ENV"])
+      - Tool resolution via _resolve_tools_for_group()
+      - Monkey patches for async-safe tool operation
+      - Wandb trajectory formatting
+
+    The evaluate flow (triggered by `environment.py evaluate`):
+      1. setup()    -- Load dataset from HuggingFace
+      2. evaluate() -- Run all tasks through rollout_and_score_eval()
+
+    Each task in rollout_and_score_eval():
+      1. Resolve Docker image (pre-built Hub image or Dockerfile fallback)
+      2. Register per-task Modal sandbox override
+      3. Run HermesAgentLoop with terminal + file tools
+      4. Upload test suite and execute test.sh in the same sandbox
+      5. Check /logs/verifier/reward.txt for pass/fail
+      6. Clean up sandbox, overrides, and temp files
+    """
+
+    name = "terminal-bench-2"
+    env_config_cls = TerminalBench2EvalConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[TerminalBench2EvalConfig, List[APIServerConfig]]:
+        """
+        Default configuration for Terminal-Bench 2.0 evaluation.
+
+        Uses eval-only settings:
+          - eval_handling=STOP_TRAIN so the eval flow runs cleanly
+          - steps_per_eval=1, total_steps=1 so eval triggers immediately
+          - group_size=1 (one rollout per group, each task is expensive)
+
+        Uses Modal terminal backend (cloud-isolated sandbox per task) and
+        OpenRouter with Claude for inference.
+        """
+        env_config = TerminalBench2EvalConfig(
+            # Terminal + file tools only (the agent interacts via shell commands)
+            enabled_toolsets=["terminal", "file"],
+            disabled_toolsets=None,
+            distribution=None,
+
+            # Agent settings -- TB2 tasks are complex, need many turns
+            max_agent_turns=60,
+            max_token_length=16000,
+            agent_temperature=0.6,
+            system_prompt=None,
+
+            # Modal backend for per-task cloud-isolated sandboxes
+            terminal_backend="modal",
+            terminal_timeout=300,   # 5 min per command (builds, pip install, etc.)
+
+            # Test execution timeout (TB2 test scripts can install deps like pytest)
+            test_timeout=180,
+
+            # 89 tasks run in parallel, each needs a thread for tool calls
+            tool_pool_size=128,
+
+            # --- Eval-only Atropos settings ---
+            # These settings make the env work as an eval-only environment:
+            #   - STOP_TRAIN: pauses training during eval (standard for eval envs)
+            #   - steps_per_eval=1, total_steps=1: eval triggers immediately
+            #   - group_size=1: one rollout per group (each task is expensive)
+            eval_handling=EvalHandlingEnum.STOP_TRAIN,
+            group_size=1,
+            steps_per_eval=1,
+            total_steps=1,
+
+            tokenizer_name="NousResearch/Hermes-3-Llama-3.1-8B",
+            use_wandb=True,
+            wandb_name="terminal-bench-2",
+            ensure_scores_are_not_same=False,  # Binary rewards may all be 0 or 1
+        )
+
+        # OpenRouter with Claude -- API key loaded from .env
+        server_configs = [
+            APIServerConfig(
+                base_url="https://openrouter.ai/api/v1",
+                model_name="anthropic/claude-sonnet-4",
+                server_type="openai",
+                api_key=os.getenv("OPENROUTER_API_KEY", ""),
+                health_check=False,
+            )
+        ]
+
+        return env_config, server_configs
+
+    # =========================================================================
+    # Setup -- load dataset
+    # =========================================================================
+
+    async def setup(self):
+        """Load the Terminal-Bench 2.0 dataset from HuggingFace."""
+        from datasets import load_dataset
+
+        # Auto-set terminal_lifetime to task_timeout + 120s so sandboxes
+        # never get killed during an active task, but still get cleaned up
+        # promptly after the task times out.
+        lifetime = self.config.task_timeout + 120
+        self.config.terminal_lifetime = lifetime
+        os.environ["TERMINAL_LIFETIME_SECONDS"] = str(lifetime)
+        print(f"  Terminal lifetime auto-set to {lifetime}s (task_timeout + 120s)")
+
+        print(f"Loading TB2 dataset from: {self.config.dataset_name}")
+        ds = load_dataset(self.config.dataset_name, split="train")
+
+        # Apply task filters (comma-separated strings from CLI)
+        tasks = list(ds)
+        if self.config.task_filter:
+            allowed = {name.strip() for name in self.config.task_filter.split(",")}
+            tasks = [t for t in tasks if t["task_name"] in allowed]
+            print(f"  Filtered to {len(tasks)} tasks: {sorted(allowed)}")
+
+        # Skip tasks incompatible with the current backend (e.g., QEMU on Modal)
+        # plus any user-specified skip_tasks
+        skip = set(MODAL_INCOMPATIBLE_TASKS) if self.config.terminal_backend == "modal" else set()
+        if self.config.skip_tasks:
+            skip |= {name.strip() for name in self.config.skip_tasks.split(",")}
+        if skip:
+            before = len(tasks)
+            tasks = [t for t in tasks if t["task_name"] not in skip]
+            skipped = before - len(tasks)
+            if skipped > 0:
+                print(f"  Skipped {skipped} incompatible tasks: {sorted(skip & {t['task_name'] for t in ds})}")
+
+        self.all_eval_items = tasks
+        self.iter = 0
+
+        # Build category index for per-category metrics
+        self.category_index: Dict[str, List[int]] = defaultdict(list)
+        for i, task in enumerate(self.all_eval_items):
+            self.category_index[task.get("category", "unknown")].append(i)
+
+        # Reward tracking for wandb logging
+        self.eval_metrics: List[Tuple[str, float]] = []
+
+        # Streaming JSONL writer -- saves each task's full conversation
+        # immediately on completion so data is preserved even on Ctrl+C.
+        # Timestamped filename so each run produces a unique file.
+        import datetime
+        log_dir = os.path.join(os.path.dirname(__file__), "logs")
+        os.makedirs(log_dir, exist_ok=True)
+        run_ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        self._streaming_path = os.path.join(log_dir, f"samples_{run_ts}.jsonl")
+        self._streaming_file = open(self._streaming_path, "w")
+        self._streaming_lock = __import__("threading").Lock()
+        print(f"  Streaming results to: {self._streaming_path}")
+
+        print(f"TB2 ready: {len(self.all_eval_items)} tasks across {len(self.category_index)} categories")
+        for cat, indices in sorted(self.category_index.items()):
+            print(f"  {cat}: {len(indices)} tasks")
+
+    def _save_result(self, result: Dict[str, Any]):
+        """Write a single task result to the streaming JSONL file immediately."""
+        if not hasattr(self, "_streaming_file") or self._streaming_file.closed:
+            return
+        with self._streaming_lock:
+            self._streaming_file.write(json.dumps(result, ensure_ascii=False, default=str) + "\n")
+            self._streaming_file.flush()
+
+    # =========================================================================
+    # Training pipeline stubs -- NOT used in eval-only mode
+    # =========================================================================
+    # These satisfy the abstract method requirements from HermesAgentBaseEnv.
+    # The evaluate subcommand calls setup() -> evaluate() directly, bypassing
+    # the training pipeline entirely.
+
+    async def get_next_item(self):
+        """Return next item (stub -- not used in eval-only mode)."""
+        item = self.all_eval_items[self.iter % len(self.all_eval_items)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item: Dict[str, Any]) -> str:
+        """Return the task's instruction as the user prompt."""
+        return item["instruction"]
+
+    async def compute_reward(self, item, result, ctx) -> float:
+        """Compute reward (stub -- actual verification is in rollout_and_score_eval)."""
+        return 0.0
+
+    async def collect_trajectories(self, item):
+        """Collect trajectories (stub -- not used in eval-only mode)."""
+        return None, []
+
+    async def score(self, rollout_group_data):
+        """Score rollouts (stub -- not used in eval-only mode)."""
+        return None
+
+    # =========================================================================
+    # Docker image resolution
+    # =========================================================================
+
+    def _resolve_task_image(
+        self, item: Dict[str, Any], task_name: str
+    ) -> Tuple[str, Optional[Path]]:
+        """
+        Resolve the Docker image for a task, with fallback to Dockerfile.
+
+        Strategy (mirrors Harbor's approach):
+        1. If force_build=True, always build from Dockerfile in environment_tar
+        2. If docker_image is available, use the pre-built Docker Hub image (fast)
+        3. Otherwise, extract Dockerfile from environment_tar and build (slow)
+
+        Returns:
+            (modal_image, temp_dir) -- modal_image is a Docker Hub name or a
+            Dockerfile path. temp_dir is set if we extracted files that need
+            cleanup later.
+        """
+        docker_image = item.get("docker_image", "")
+        environment_tar = item.get("environment_tar", "")
+
+        # Fast path: use pre-built Docker Hub image
+        if docker_image and not self.config.force_build:
+            logger.info("Task %s: using pre-built image %s", task_name, docker_image)
+            return docker_image, None
+
+        # Slow path: extract Dockerfile from environment_tar and build
+        if environment_tar:
+            task_dir = Path(tempfile.mkdtemp(prefix=f"tb2-{task_name}-"))
+            _extract_base64_tar(environment_tar, task_dir)
+            dockerfile_path = task_dir / "Dockerfile"
+            if dockerfile_path.exists():
+                logger.info(
+                    "Task %s: building from Dockerfile (force_build=%s, docker_image=%s)",
+                    task_name, self.config.force_build, bool(docker_image),
+                )
+                return str(dockerfile_path), task_dir
+
+        # Neither available -- fall back to Hub image if force_build was True
+        if docker_image:
+            logger.warning(
+                "Task %s: force_build=True but no environment_tar, "
+                "falling back to docker_image %s", task_name, docker_image,
+            )
+            return docker_image, None
+
+        return "", None
+
+    # =========================================================================
+    # Per-task evaluation -- agent loop + test verification
+    # =========================================================================
+
+    async def rollout_and_score_eval(self, eval_item: Dict[str, Any]) -> Dict:
+        """
+        Evaluate a single TB2 task: run the agent loop, then verify with tests.
+
+        This is the core evaluation method. For each task it:
+        1. Resolves the Docker image and registers the Modal sandbox override
+        2. Runs HermesAgentLoop with terminal + file tools
+        3. Uploads the test suite into the sandbox
+        4. Executes test.sh and checks the result
+        5. Cleans up the sandbox and temp files
+
+        Args:
+            eval_item: A single TB2 task dict from the dataset
+
+        Returns:
+            Dict with 'passed' (bool), 'reward' (float), 'task_name' (str),
+            'category' (str), and optional debug info
+        """
+        task_name = eval_item.get("task_name", "unknown")
+        category = eval_item.get("category", "unknown")
+        task_id = str(uuid.uuid4())
+        task_dir = None  # Set if we extract a Dockerfile (needs cleanup)
+
+        from tqdm import tqdm
+        tqdm.write(f"  [START] {task_name} (task_id={task_id[:8]})")
+        task_start = time.time()
+
+        try:
+            # --- 1. Resolve Docker image ---
+            modal_image, task_dir = self._resolve_task_image(eval_item, task_name)
+            if not modal_image:
+                logger.error("Task %s: no docker_image or environment_tar, skipping", task_name)
+                return {
+                    "passed": False, "reward": 0.0,
+                    "task_name": task_name, "category": category,
+                    "error": "no_image",
+                }
+
+            # --- 2. Register per-task Modal image override ---
+            register_task_env_overrides(task_id, {"modal_image": modal_image})
+            logger.info(
+                "Task %s: registered image override for task_id %s",
+                task_name, task_id[:8],
+            )
+
+            # --- 3. Resolve tools and build messages ---
+            tools, valid_names = self._resolve_tools_for_group()
+
+            messages: List[Dict[str, Any]] = []
+            if self.config.system_prompt:
+                messages.append({"role": "system", "content": self.config.system_prompt})
+            messages.append({"role": "user", "content": self.format_prompt(eval_item)})
+
+            # --- 4. Run agent loop ---
+            agent = HermesAgentLoop(
+                server=self.server,
+                tool_schemas=tools,
+                valid_tool_names=valid_names,
+                max_turns=self.config.max_agent_turns,
+                task_id=task_id,
+                temperature=self.config.agent_temperature,
+                max_tokens=self.config.max_token_length,
+                extra_body=self.config.extra_body,
+            )
+            result = await agent.run(messages)
+
+            # --- 5. Verify -- run test suite in the agent's sandbox ---
+            # Skip verification if the agent produced no meaningful output
+            only_system_and_user = all(
+                msg.get("role") in ("system", "user") for msg in result.messages
+            )
+            if result.turns_used == 0 or only_system_and_user:
+                logger.warning(
+                    "Task %s: agent produced no output (turns=%d). Reward=0.",
+                    task_name, result.turns_used,
+                )
+                reward = 0.0
+            else:
+                # Run tests in a thread so the blocking ctx.terminal() calls
+                # don't freeze the entire event loop (which would stall all
+                # other tasks, tqdm updates, and timeout timers).
+                ctx = ToolContext(task_id)
+                try:
+                    loop = asyncio.get_event_loop()
+                    reward = await loop.run_in_executor(
+                        None,  # default thread pool
+                        self._run_tests, eval_item, ctx, task_name,
+                    )
+                except Exception as e:
+                    logger.error("Task %s: test verification failed: %s", task_name, e)
+                    reward = 0.0
+                finally:
+                    ctx.cleanup()
+
+            passed = reward == 1.0
+            status = "PASS" if passed else "FAIL"
+            elapsed = time.time() - task_start
+            tqdm.write(f"  [{status}] {task_name} (turns={result.turns_used}, {elapsed:.0f}s)")
+            logger.info(
+                "Task %s: reward=%.1f, turns=%d, finished=%s",
+                task_name, reward, result.turns_used, result.finished_naturally,
+            )
+
+            out = {
+                "passed": passed,
+                "reward": reward,
+                "task_name": task_name,
+                "category": category,
+                "turns_used": result.turns_used,
+                "finished_naturally": result.finished_naturally,
+                "messages": result.messages,
+            }
+            self._save_result(out)
+            return out
+
+        except Exception as e:
+            elapsed = time.time() - task_start
+            logger.error("Task %s: rollout failed: %s", task_name, e, exc_info=True)
+            tqdm.write(f"  [ERROR] {task_name}: {e} ({elapsed:.0f}s)")
+            out = {
+                "passed": False, "reward": 0.0,
+                "task_name": task_name, "category": category,
+                "error": str(e),
+            }
+            self._save_result(out)
+            return out
+
+        finally:
+            # --- Cleanup: clear overrides, sandbox, and temp files ---
+            clear_task_env_overrides(task_id)
+            try:
+                cleanup_vm(task_id)
+            except Exception as e:
+                logger.debug("VM cleanup for %s: %s", task_id[:8], e)
+            if task_dir and task_dir.exists():
+                shutil.rmtree(task_dir, ignore_errors=True)
+
+    def _run_tests(
+        self, item: Dict[str, Any], ctx: ToolContext, task_name: str
+    ) -> float:
+        """
+        Upload and execute the test suite in the agent's sandbox, then
+        download the verifier output locally to read the reward.
+
+        Follows Harbor's verification pattern:
+        1. Upload tests/ directory into the sandbox
+        2. Execute test.sh inside the sandbox
+        3. Download /logs/verifier/ directory to a local temp dir
+        4. Read reward.txt locally with native Python I/O
+
+        Downloading locally avoids issues with the file_read tool on
+        the Modal VM and matches how Harbor handles verification.
+
+        TB2 test scripts (test.sh) typically:
+        1. Install pytest via uv/pip
+        2. Run pytest against the test files in /tests/
+        3. Write results to /logs/verifier/reward.txt
+
+        Args:
+            item: The TB2 task dict (contains tests_tar, test_sh)
+            ctx: ToolContext scoped to this task's sandbox
+            task_name: For logging
+
+        Returns:
+            1.0 if tests pass, 0.0 otherwise
+        """
+        tests_tar = item.get("tests_tar", "")
+        test_sh = item.get("test_sh", "")
+
+        if not test_sh:
+            logger.warning("Task %s: no test_sh content, reward=0", task_name)
+            return 0.0
+
+        # Create required directories in the sandbox
+        ctx.terminal("mkdir -p /tests /logs/verifier")
+
+        # Upload test files into the sandbox (binary-safe via base64)
+        if tests_tar:
+            tests_temp = Path(tempfile.mkdtemp(prefix=f"tb2-tests-{task_name}-"))
+            try:
+                _extract_base64_tar(tests_tar, tests_temp)
+                ctx.upload_dir(str(tests_temp), "/tests")
+            except Exception as e:
+                logger.warning("Task %s: failed to upload test files: %s", task_name, e)
+            finally:
+                shutil.rmtree(tests_temp, ignore_errors=True)
+
+        # Write the test runner script (test.sh)
+        ctx.write_file("/tests/test.sh", test_sh)
+        ctx.terminal("chmod +x /tests/test.sh")
+
+        # Execute the test suite
+        logger.info(
+            "Task %s: running test suite (timeout=%ds)",
+            task_name, self.config.test_timeout,
+        )
+        test_result = ctx.terminal(
+            "bash /tests/test.sh",
+            timeout=self.config.test_timeout,
+        )
+
+        exit_code = test_result.get("exit_code", -1)
+        output = test_result.get("output", "")
+
+        # Download the verifier output directory locally, then read reward.txt
+        # with native Python I/O. This avoids issues with file_read on the
+        # Modal VM and matches Harbor's verification pattern.
+        reward = 0.0
+        local_verifier_dir = Path(tempfile.mkdtemp(prefix=f"tb2-verifier-{task_name}-"))
+        try:
+            ctx.download_dir("/logs/verifier", str(local_verifier_dir))
+
+            reward_file = local_verifier_dir / "reward.txt"
+            if reward_file.exists() and reward_file.stat().st_size > 0:
+                content = reward_file.read_text().strip()
+                if content == "1":
+                    reward = 1.0
+                elif content == "0":
+                    reward = 0.0
+                else:
+                    # Unexpected content -- try parsing as float
+                    try:
+                        reward = float(content)
+                    except (ValueError, TypeError):
+                        logger.warning(
+                            "Task %s: reward.txt content unexpected (%r), "
+                            "falling back to exit_code=%d",
+                            task_name, content, exit_code,
+                        )
+                        reward = 1.0 if exit_code == 0 else 0.0
+            else:
+                # reward.txt not written -- fall back to exit code
+                logger.warning(
+                    "Task %s: reward.txt not found after download, "
+                    "falling back to exit_code=%d",
+                    task_name, exit_code,
+                )
+                reward = 1.0 if exit_code == 0 else 0.0
+        except Exception as e:
+            logger.warning(
+                "Task %s: failed to download verifier dir: %s, "
+                "falling back to exit_code=%d",
+                task_name, e, exit_code,
+            )
+            reward = 1.0 if exit_code == 0 else 0.0
+        finally:
+            shutil.rmtree(local_verifier_dir, ignore_errors=True)
+
+        # Log test output for debugging failures
+        if reward == 0.0:
+            output_preview = output[-500:] if output else "(no output)"
+            logger.info(
+                "Task %s: FAIL (exit_code=%d)\n%s",
+                task_name, exit_code, output_preview,
+            )
+
+        return reward
+
+    # =========================================================================
+    # Evaluate -- main entry point for the eval subcommand
+    # =========================================================================
+
+    async def _eval_with_timeout(self, item: Dict[str, Any]) -> Dict:
+        """
+        Wrap rollout_and_score_eval with a per-task wall-clock timeout.
+
+        If the task exceeds task_timeout seconds, it's automatically scored
+        as FAIL. This prevents any single task from hanging indefinitely.
+        """
+        task_name = item.get("task_name", "unknown")
+        category = item.get("category", "unknown")
+        try:
+            return await asyncio.wait_for(
+                self.rollout_and_score_eval(item),
+                timeout=self.config.task_timeout,
+            )
+        except asyncio.TimeoutError:
+            from tqdm import tqdm
+            elapsed = self.config.task_timeout
+            tqdm.write(f"  [TIMEOUT] {task_name} (exceeded {elapsed}s wall-clock limit)")
+            logger.error("Task %s: wall-clock timeout after %ds", task_name, elapsed)
+            out = {
+                "passed": False, "reward": 0.0,
+                "task_name": task_name, "category": category,
+                "error": f"timeout ({elapsed}s)",
+            }
+            self._save_result(out)
+            return out
+
+    async def evaluate(self, *args, **kwargs) -> None:
+        """
+        Run Terminal-Bench 2.0 evaluation over all tasks.
+
+        This is the main entry point when invoked via:
+            python environments/terminalbench2_env.py evaluate
+
+        Runs all tasks through rollout_and_score_eval() via asyncio.gather()
+        (same pattern as GPQA and other Atropos eval envs). Each task is
+        wrapped with a wall-clock timeout so hung tasks auto-fail.
+
+        Suppresses noisy Modal/terminal output (HERMES_QUIET) so the tqdm
+        bar stays visible.
+        """
+        start_time = time.time()
+
+        # Route all logging through tqdm.write() so the progress bar stays
+        # pinned at the bottom while log lines scroll above it.
+        from tqdm import tqdm
+
+        class _TqdmHandler(logging.Handler):
+            def emit(self, record):
+                try:
+                    tqdm.write(self.format(record))
+                except Exception:
+                    self.handleError(record)
+
+        handler = _TqdmHandler()
+        handler.setFormatter(logging.Formatter(
+            "%(asctime)s [%(name)s] %(levelname)s: %(message)s",
+            datefmt="%H:%M:%S",
+        ))
+        root = logging.getLogger()
+        root.handlers = [handler]  # Replace any existing handlers
+        root.setLevel(logging.INFO)
+
+        # Silence noisy third-party loggers that flood the output
+        logging.getLogger("httpx").setLevel(logging.WARNING)      # Every HTTP request
+        logging.getLogger("openai").setLevel(logging.WARNING)     # OpenAI client retries
+        logging.getLogger("rex-deploy").setLevel(logging.WARNING) # Swerex deployment
+        logging.getLogger("rex_image_builder").setLevel(logging.WARNING)  # Image builds
+
+        print(f"\n{'='*60}")
+        print("Starting Terminal-Bench 2.0 Evaluation")
+        print(f"{'='*60}")
+        print(f"  Dataset: {self.config.dataset_name}")
+        print(f"  Total tasks: {len(self.all_eval_items)}")
+        print(f"  Max agent turns: {self.config.max_agent_turns}")
+        print(f"  Task timeout: {self.config.task_timeout}s")
+        print(f"  Terminal backend: {self.config.terminal_backend}")
+        print(f"  Tool thread pool: {self.config.tool_pool_size}")
+        print(f"  Terminal timeout: {self.config.terminal_timeout}s/cmd")
+        print(f"  Terminal lifetime: {self.config.terminal_lifetime}s (auto: task_timeout + 120)")
+        print(f"{'='*60}\n")
+
+        # Fire all tasks with wall-clock timeout, track live accuracy on the bar
+        total_tasks = len(self.all_eval_items)
+        eval_tasks = [
+            asyncio.ensure_future(self._eval_with_timeout(item))
+            for item in self.all_eval_items
+        ]
+
+        results = []
+        passed_count = 0
+        pbar = tqdm(total=total_tasks, desc="Evaluating TB2", dynamic_ncols=True)
+        try:
+            for coro in asyncio.as_completed(eval_tasks):
+                result = await coro
+                results.append(result)
+                if result and result.get("passed"):
+                    passed_count += 1
+                done = len(results)
+                pct = (passed_count / done * 100) if done else 0
+                pbar.set_postfix_str(f"pass={passed_count}/{done} ({pct:.1f}%)")
+                pbar.update(1)
+        except (KeyboardInterrupt, asyncio.CancelledError):
+            pbar.close()
+            print(f"\n\nInterrupted! Cleaning up {len(eval_tasks)} tasks...")
+            # Cancel all pending tasks
+            for task in eval_tasks:
+                task.cancel()
+            # Let cancellations propagate (finally blocks run cleanup_vm)
+            await asyncio.gather(*eval_tasks, return_exceptions=True)
+            # Belt-and-suspenders: clean up any remaining sandboxes
+            from tools.terminal_tool import cleanup_all_environments
+            cleanup_all_environments()
+            print("All sandboxes cleaned up.")
+            return
+        finally:
+            pbar.close()
+
+        end_time = time.time()
+
+        # Filter out None results (shouldn't happen, but be safe)
+        valid_results = [r for r in results if r is not None]
+
+        if not valid_results:
+            print("Warning: No valid evaluation results obtained")
+            return
+
+        # ---- Compute metrics ----
+        total = len(valid_results)
+        passed = sum(1 for r in valid_results if r.get("passed"))
+        overall_pass_rate = passed / total if total > 0 else 0.0
+
+        # Per-category breakdown
+        cat_results: Dict[str, List[Dict]] = defaultdict(list)
+        for r in valid_results:
+            cat_results[r.get("category", "unknown")].append(r)
+
+        # Build metrics dict
+        eval_metrics = {
+            "eval/pass_rate": overall_pass_rate,
+            "eval/total_tasks": total,
+            "eval/passed_tasks": passed,
+            "eval/evaluation_time_seconds": end_time - start_time,
+        }
+
+        # Per-category metrics
+        for category, cat_items in sorted(cat_results.items()):
+            cat_passed = sum(1 for r in cat_items if r.get("passed"))
+            cat_total = len(cat_items)
+            cat_pass_rate = cat_passed / cat_total if cat_total > 0 else 0.0
+            cat_key = category.replace(" ", "_").replace("-", "_").lower()
+            eval_metrics[f"eval/pass_rate_{cat_key}"] = cat_pass_rate
+
+        # Store metrics for wandb_log
+        self.eval_metrics = [(k, v) for k, v in eval_metrics.items()]
+
+        # ---- Print summary ----
+        print(f"\n{'='*60}")
+        print("Terminal-Bench 2.0 Evaluation Results")
+        print(f"{'='*60}")
+        print(f"Overall Pass Rate: {overall_pass_rate:.4f} ({passed}/{total})")
+        print(f"Evaluation Time: {end_time - start_time:.1f} seconds")
+
+        print("\nCategory Breakdown:")
+        for category, cat_items in sorted(cat_results.items()):
+            cat_passed = sum(1 for r in cat_items if r.get("passed"))
+            cat_total = len(cat_items)
+            cat_rate = cat_passed / cat_total if cat_total > 0 else 0.0
+            print(f"  {category}: {cat_rate:.1%} ({cat_passed}/{cat_total})")
+
+        # Print individual task results
+        print("\nTask Results:")
+        for r in sorted(valid_results, key=lambda x: x.get("task_name", "")):
+            status = "PASS" if r.get("passed") else "FAIL"
+            turns = r.get("turns_used", "?")
+            error = r.get("error", "")
+            extra = f" (error: {error})" if error else ""
+            print(f"  [{status}] {r['task_name']} (turns={turns}){extra}")
+
+        print(f"{'='*60}\n")
+
+        # Build sample records for evaluate_log (includes full conversations)
+        samples = [
+            {
+                "task_name": r.get("task_name"),
+                "category": r.get("category"),
+                "passed": r.get("passed"),
+                "reward": r.get("reward"),
+                "turns_used": r.get("turns_used"),
+                "error": r.get("error"),
+                "messages": r.get("messages"),
+            }
+            for r in valid_results
+        ]
+
+        # Log evaluation results
+        try:
+            await self.evaluate_log(
+                metrics=eval_metrics,
+                samples=samples,
+                start_time=start_time,
+                end_time=end_time,
+                generation_parameters={
+                    "temperature": self.config.agent_temperature,
+                    "max_tokens": self.config.max_token_length,
+                    "max_agent_turns": self.config.max_agent_turns,
+                    "terminal_backend": self.config.terminal_backend,
+                },
+            )
+        except Exception as e:
+            print(f"Error logging evaluation results: {e}")
+
+        # Close streaming file
+        if hasattr(self, "_streaming_file") and not self._streaming_file.closed:
+            self._streaming_file.close()
+            print(f"  Live results saved to: {self._streaming_path}")
+
+        # Kill all remaining sandboxes. Timed-out tasks leave orphaned thread
+        # pool workers still executing commands -- cleanup_all stops them.
+        from tools.terminal_tool import cleanup_all_environments
+        print("\nCleaning up all sandboxes...")
+        cleanup_all_environments()
+
+        # Shut down the tool thread pool so orphaned workers from timed-out
+        # tasks are killed immediately instead of retrying against dead
+        # sandboxes and spamming the console with TimeoutError warnings.
+        from environments.agent_loop import _tool_executor
+        _tool_executor.shutdown(wait=False, cancel_futures=True)
+        print("Done.")
+
+    # =========================================================================
+    # Wandb logging
+    # =========================================================================
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log TB2-specific metrics to wandb."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        # Add stored eval metrics
+        for metric_name, metric_value in self.eval_metrics:
+            wandb_metrics[metric_name] = metric_value
+        self.eval_metrics = []
+
+        await super().wandb_log(wandb_metrics)
+
+
+if __name__ == "__main__":
+    TerminalBench2EvalEnv.cli()
--- a/environments/hermes_base_env.py
+++ b/environments/hermes_base_env.py
@@ -0,0 +1,673 @@
+"""
+HermesAgentBaseEnv -- Abstract Base Environment for Hermes-Agent + Atropos
+
+Provides the Atropos integration plumbing that all hermes-agent environments share:
+- Two-mode operation (OpenAI server for Phase 1, VLLM ManagedServer for Phase 2)
+- Per-group toolset/distribution resolution
+- Agent loop orchestration via HermesAgentLoop
+- ToolContext creation for reward functions
+- ScoredDataGroup construction from ManagedServer state
+
+Subclasses only need to implement:
+    setup()           -- Load dataset, initialize state
+    get_next_item()   -- Return the next item from the dataset
+    format_prompt()   -- Convert a dataset item into the user message
+    compute_reward()  -- Score the rollout (has full ToolContext access)
+    evaluate()        -- Periodic evaluation
+"""
+
+import asyncio
+import json
+import logging
+import os
+import sys
+import uuid
+from abc import abstractmethod
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Set, Tuple, Union
+
+# Ensure the hermes-agent repo root is on sys.path so that imports like
+# `from model_tools import ...` and `from environments.X import ...` work
+# regardless of where the script is invoked from.
+_repo_root = Path(__file__).resolve().parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from dotenv import load_dotenv
+from pydantic import Field
+
+# Load API keys from hermes-agent/.env so all environments can access them
+_env_path = _repo_root / ".env"
+if _env_path.exists():
+    load_dotenv(dotenv_path=_env_path)
+
+# Apply monkey patches for async-safe tool operation inside Atropos's event loop.
+# This patches SwerexModalEnvironment to use a background thread instead of
+# asyncio.run(), which would deadlock inside Atropos. Safe for normal CLI too.
+from environments.patches import apply_patches
+apply_patches()
+
+from atroposlib.envs.base import (
+    BaseEnv,
+    BaseEnvConfig,
+    ScoredDataGroup,
+    ScoredDataItem,
+)
+from atroposlib.envs.server_handling.server_manager import (
+    APIServerConfig,
+    ServerBaseline,
+    ServerManager,
+)
+from atroposlib.type_definitions import Item
+
+from environments.agent_loop import AgentResult, HermesAgentLoop
+from environments.tool_context import ToolContext
+
+# Import hermes-agent toolset infrastructure
+from model_tools import get_tool_definitions
+from toolset_distributions import sample_toolsets_from_distribution
+
+logger = logging.getLogger(__name__)
+
+
+class HermesAgentEnvConfig(BaseEnvConfig):
+    """
+    Configuration for hermes-agent Atropos environments.
+
+    Extends BaseEnvConfig with agent-specific settings for toolsets,
+    terminal backend, dataset loading, and tool call parsing.
+    """
+
+    # --- Toolset configuration ---
+    # Mutually exclusive: use either enabled_toolsets OR distribution
+    enabled_toolsets: Optional[List[str]] = Field(
+        default=None,
+        description="Explicit list of hermes toolsets to enable (e.g., ['terminal', 'file', 'web']). "
+        "If None and distribution is also None, all available toolsets are enabled.",
+    )
+    disabled_toolsets: Optional[List[str]] = Field(
+        default=None,
+        description="Toolsets to disable. Applied as a filter on top of enabled_toolsets or distribution.",
+    )
+    distribution: Optional[str] = Field(
+        default=None,
+        description="Name of a toolset distribution from toolset_distributions.py "
+        "(e.g., 'development', 'terminal_tasks'). Sampled once per group. "
+        "Mutually exclusive with enabled_toolsets.",
+    )
+
+    # --- Agent loop configuration ---
+    max_agent_turns: int = Field(
+        default=30,
+        description="Maximum number of LLM calls (tool-calling iterations) per rollout.",
+    )
+    system_prompt: Optional[str] = Field(
+        default=None,
+        description="System prompt for the agent. Tools are handled via the tools= parameter, "
+        "not embedded in the prompt text.",
+    )
+    agent_temperature: float = Field(
+        default=1.0,
+        description="Sampling temperature for agent generation during rollouts.",
+    )
+
+    # --- Terminal backend ---
+    terminal_backend: str = Field(
+        default="local",
+        description="Terminal backend: 'local', 'docker', 'modal', 'ssh', 'singularity'. "
+        "Modal recommended for production RL (cloud isolation per rollout).",
+    )
+    terminal_timeout: int = Field(
+        default=120,
+        description="Per-command timeout in seconds for terminal tool calls. "
+        "Commands exceeding this are killed. Increase for tasks with long-running "
+        "commands (compilation, pip install, etc.).",
+    )
+    terminal_lifetime: int = Field(
+        default=3600,
+        description="Sandbox inactivity lifetime in seconds. The cleanup thread kills "
+        "sandboxes that have been idle longer than this. Must be longer than "
+        "the longest gap between tool calls (e.g., waiting for LLM response).",
+    )
+
+    # --- Dataset ---
+    dataset_name: Optional[str] = Field(
+        default=None,
+        description="HuggingFace dataset name. Optional if tasks are defined inline.",
+    )
+    dataset_split: str = Field(
+        default="train",
+        description="Dataset split to use.",
+    )
+    prompt_field: str = Field(
+        default="prompt",
+        description="Which field in the dataset contains the prompt.",
+    )
+
+    # --- Thread pool ---
+    tool_pool_size: int = Field(
+        default=128,
+        description="Thread pool size for tool execution. Each concurrent task needs a "
+        "thread for tool calls. Must be large enough for parallel evaluation. "
+        "Too small = thread pool starvation.",
+    )
+
+    # --- Phase 2: Tool call parsing ---
+    tool_call_parser: str = Field(
+        default="hermes",
+        description="Tool call parser name for Phase 2 (VLLM server type). "
+        "Ignored in Phase 1 (OpenAI server type where VLLM parses natively). "
+        "Options: hermes, mistral, llama3_json, qwen, deepseek_v3, etc.",
+    )
+
+    # --- Provider-specific parameters ---
+    # Passed as extra_body to the OpenAI client's chat.completions.create() call.
+    # Useful for OpenRouter provider preferences, transforms, route settings, etc.
+    # Example YAML:
+    #   extra_body:
+    #     provider:
+    #       ignore: ["DeepInfra", "Fireworks"]
+    #       order: ["Together"]
+    #     transforms: ["middle-out"]
+    extra_body: Optional[Dict[str, Any]] = Field(
+        default=None,
+        description="Extra body parameters passed to the OpenAI client's "
+        "chat.completions.create(). Used for OpenRouter provider preferences, "
+        "transforms, and other provider-specific settings.",
+    )
+
+
+class HermesAgentBaseEnv(BaseEnv):
+    """
+    Abstract base environment for hermes-agent Atropos integration.
+
+    Handles two modes of operation:
+    - Phase 1 (OpenAI server type): Uses server.chat_completion() directly.
+      The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing
+      and reasoning extraction natively. DummyManagedServer provides placeholder
+      tokens. Good for SFT data gen, verifier testing, evaluation.
+
+    - Phase 2 (VLLM server type): Uses ManagedServer for exact token IDs + logprobs
+      via /generate. Client-side tool call parser reconstructs structured tool_calls
+      from raw output. Full RL training capability.
+
+    Subclasses must implement:
+        setup()           -- Load dataset, initialize state
+        get_next_item()   -- Return the next item to roll out
+        format_prompt()   -- Convert a dataset item into the user message string
+        compute_reward()  -- Score the rollout using ToolContext
+        evaluate()        -- Periodic evaluation
+    """
+
+    name: Optional[str] = "hermes-agent"
+    env_config_cls = HermesAgentEnvConfig
+
+    def __init__(
+        self,
+        config: HermesAgentEnvConfig,
+        server_configs: Union[ServerBaseline, List[APIServerConfig]],
+        slurm=False,
+        testing=False,
+    ):
+        super().__init__(config, server_configs, slurm, testing)
+
+        # Set terminal environment variables so hermes tools pick them up.
+        # These can all be overridden per-environment via config fields instead
+        # of requiring users to set shell env vars.
+        if config.terminal_backend:
+            os.environ["TERMINAL_ENV"] = config.terminal_backend
+        os.environ["TERMINAL_TIMEOUT"] = str(config.terminal_timeout)
+        os.environ["TERMINAL_LIFETIME_SECONDS"] = str(config.terminal_lifetime)
+        print(
+            f"🖥️  Terminal: backend={config.terminal_backend}, "
+            f"timeout={config.terminal_timeout}s, lifetime={config.terminal_lifetime}s"
+        )
+
+        # Resize the agent loop's thread pool for tool execution.
+        # This must be large enough for the number of concurrent tasks
+        # (e.g., 89 parallel TB2 eval tasks each need a thread for tool calls).
+        from environments.agent_loop import resize_tool_pool
+        resize_tool_pool(config.tool_pool_size)
+
+        # Current group's resolved tools (set in collect_trajectories)
+        self._current_group_tools: Optional[Tuple[List[Dict], Set[str]]] = None
+
+        # Tool error tracking for wandb logging
+        self._tool_error_buffer: List[Dict[str, Any]] = []
+
+    # =========================================================================
+    # Toolset resolution (per-group)
+    # =========================================================================
+
+    def _resolve_tools_for_group(self) -> Tuple[List[Dict[str, Any]], Set[str]]:
+        """
+        Resolve toolsets for a group. Called once in collect_trajectories(),
+        then shared by all collect_trajectory() calls in the group.
+
+        If distribution is set, samples probabilistically.
+        If enabled_toolsets is set, uses that explicit list.
+        disabled_toolsets is applied as a filter on top.
+
+        Returns:
+            (tool_schemas, valid_tool_names) tuple
+        """
+        config = self.config
+
+        if config.distribution:
+            group_toolsets = sample_toolsets_from_distribution(config.distribution)
+            logger.info("Sampled toolsets from '%s': %s", config.distribution, group_toolsets)
+        else:
+            group_toolsets = config.enabled_toolsets  # None means "all available"
+
+        tools = get_tool_definitions(
+            enabled_toolsets=group_toolsets,
+            disabled_toolsets=config.disabled_toolsets,
+            quiet_mode=True,
+        )
+
+        valid_names = {t["function"]["name"] for t in tools} if tools else set()
+        logger.info("Resolved %d tools for group: %s", len(valid_names), sorted(valid_names))
+        return tools, valid_names
+
+    # =========================================================================
+    # Server mode detection
+    # =========================================================================
+
+    def _use_managed_server(self) -> bool:
+        """
+        Determine if we should use ManagedServer (Phase 2) or direct server (Phase 1).
+
+        Phase 2 (ManagedServer) is used when the server type is 'vllm' or 'sglang',
+        which go through the /generate endpoint for exact token tracking.
+
+        Phase 1 (direct server) is used for 'openai' server type, which uses
+        /v1/chat/completions with native tool call parsing.
+        """
+        if not self.server.servers:
+            return False
+
+        server = self.server.servers[0]
+        # If the server is an OpenAI server (not VLLM/SGLang), use direct mode
+        from atroposlib.envs.server_handling.openai_server import OpenAIServer
+        return not isinstance(server, OpenAIServer)
+
+    # =========================================================================
+    # Core Atropos integration
+    # =========================================================================
+
+    async def collect_trajectories(
+        self, item: Item
+    ) -> Tuple[
+        Union[Optional[ScoredDataGroup], List[Optional[ScoredDataGroup]]],
+        List[Item],
+    ]:
+        """
+        Override collect_trajectories to resolve toolsets once per group,
+        then delegate to the standard group-level collection.
+
+        The default BaseEnv.collect_trajectories() calls collect_trajectory()
+        group_size times in parallel. We resolve tools once here and store
+        them for all those calls to use.
+        """
+        # Resolve toolsets for this group (shared by all rollouts in the group)
+        self._current_group_tools = self._resolve_tools_for_group()
+
+        # Delegate to the default implementation which calls collect_trajectory()
+        # group_size times via asyncio.gather
+        return await super().collect_trajectories(item)
+
+    # =========================================================================
+    # Wandb rollout display -- format trajectories nicely
+    # =========================================================================
+
+    @staticmethod
+    def _format_trajectory_for_display(messages: List[Dict[str, Any]]) -> str:
+        """
+        Format a conversation's messages into a readable trajectory string
+        for wandb rollout tables. Shows tool calls, tool results, and reasoning
+        in a structured way instead of raw token decoding.
+        """
+        parts = []
+        for msg in messages:
+            role = msg.get("role", "unknown")
+            content = msg.get("content", "")
+
+            if role == "system":
+                parts.append(f"[SYSTEM]\n{content}")
+
+            elif role == "user":
+                parts.append(f"[USER]\n{content}")
+
+            elif role == "assistant":
+                # Show reasoning if present
+                reasoning = msg.get("reasoning_content", "")
+                if reasoning:
+                    # Truncate long reasoning for display
+                    if len(reasoning) > 300:
+                        reasoning = reasoning[:300] + "..."
+                    parts.append(f"[ASSISTANT thinking]\n{reasoning}")
+
+                # Show content
+                if content:
+                    parts.append(f"[ASSISTANT]\n{content}")
+
+                # Show tool calls
+                tool_calls = msg.get("tool_calls", [])
+                for tc in tool_calls:
+                    func = tc.get("function", {})
+                    name = func.get("name", "?")
+                    args = func.get("arguments", "{}")
+                    # Truncate long arguments for display
+                    if len(args) > 200:
+                        args = args[:200] + "..."
+                    parts.append(f"[TOOL CALL] {name}({args})")
+
+            elif role == "tool":
+                tool_id = msg.get("tool_call_id", "")
+                result = content
+                # Truncate long tool results for display
+                if len(result) > 500:
+                    result = result[:500] + "..."
+                parts.append(f"[TOOL RESULT] {result}")
+
+        return "\n\n".join(parts)
+
+    async def add_rollouts_for_wandb(
+        self,
+        scored_data,
+        item=None,
+    ):
+        """
+        Override to show formatted trajectories with tool calls visible,
+        instead of raw token decoding which loses all structure.
+        """
+        num_keep = self.config.num_rollouts_per_group_for_logging
+        if num_keep == -1:
+            num_keep = self.config.group_size
+
+        group = []
+        for i in range(min(num_keep, len(scored_data.get("scores", [])))):
+            score = scored_data["scores"][i]
+
+            # Use messages if available for rich display
+            messages = None
+            if scored_data.get("messages") and i < len(scored_data["messages"]):
+                messages = scored_data["messages"][i]
+
+            if messages:
+                text = self._format_trajectory_for_display(messages)
+            elif scored_data.get("tokens") and i < len(scored_data["tokens"]):
+                text = self.tokenizer.decode(scored_data["tokens"][i])
+            else:
+                text = "(no data)"
+
+            group.append((text, score))
+
+        self.rollouts_for_wandb.append(group)
+        if len(self.rollouts_for_wandb) > self.config.num_rollouts_to_keep:
+            self.rollouts_for_wandb.pop(0)
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log base metrics including tool errors to wandb."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        # Log tool error stats
+        if self._tool_error_buffer:
+            wandb_metrics["train/tool_errors_count"] = len(self._tool_error_buffer)
+
+            # Log error details as a summary string (tables can crash wandb on tmp cleanup)
+            error_summaries = []
+            for err in self._tool_error_buffer:
+                error_summaries.append(
+                    f"[turn {err['turn']}] {err['tool']}({err['args'][:80]}) -> {err['error'][:150]}"
+                )
+            wandb_metrics["train/tool_error_details"] = "\n".join(error_summaries)
+
+            # Also print to stdout for immediate visibility
+            for summary in error_summaries:
+                print(f"  Tool Error: {summary}")
+
+            self._tool_error_buffer = []
+        else:
+            wandb_metrics["train/tool_errors_count"] = 0
+
+        await super().wandb_log(wandb_metrics)
+
+    async def collect_trajectory(
+        self, item: Item
+    ) -> Tuple[Optional[Union[ScoredDataItem, Any]], List[Item]]:
+        """
+        Run a single rollout: agent loop + reward computation.
+
+        This is called group_size times in parallel by collect_trajectories().
+        Each call gets its own task_id for terminal/browser session isolation.
+        """
+        task_id = str(uuid.uuid4())
+
+        # Get group-level tools (resolved once in collect_trajectories)
+        if self._current_group_tools is None:
+            # Fallback: resolve per-trajectory if called outside collect_trajectories
+            tools, valid_names = self._resolve_tools_for_group()
+        else:
+            tools, valid_names = self._current_group_tools
+
+        # Build initial messages
+        messages: List[Dict[str, Any]] = []
+        if self.config.system_prompt:
+            messages.append({"role": "system", "content": self.config.system_prompt})
+        messages.append({"role": "user", "content": self.format_prompt(item)})
+
+        # Run the agent loop
+        result: AgentResult
+        if self._use_managed_server():
+            # Phase 2: ManagedServer with parser -- exact tokens + logprobs
+            # Load the tool call parser from registry based on config
+            from environments.tool_call_parsers import get_parser
+            try:
+                tc_parser = get_parser(self.config.tool_call_parser)
+            except KeyError:
+                logger.warning(
+                    "Tool call parser '%s' not found, falling back to 'hermes'",
+                    self.config.tool_call_parser,
+                )
+                tc_parser = get_parser("hermes")
+
+            try:
+                async with self.server.managed_server(
+                    tokenizer=self.tokenizer,
+                    tool_call_parser=tc_parser,
+                ) as managed:
+                    _max_ctx = self.config.max_token_length if (self.config.max_token_length and self.config.max_token_length > 0) else None
+                    agent = HermesAgentLoop(
+                        server=managed,
+                        tool_schemas=tools,
+                        valid_tool_names=valid_names,
+                        max_turns=self.config.max_agent_turns,
+                        task_id=task_id,
+                        temperature=self.config.agent_temperature,
+                        max_tokens=self.config.max_token_length,
+                        extra_body=self.config.extra_body,
+                        max_context_tokens=_max_ctx,
+                    )
+                    result = await agent.run(messages)
+            except NotImplementedError:
+                # DummyManagedServer not allowed -- fall back to Phase 1
+                logger.warning(
+                    "ManagedServer not available (OpenAI server?). "
+                    "Falling back to direct server mode."
+                )
+                _max_ctx = self.config.max_token_length if (self.config.max_token_length and self.config.max_token_length > 0) else None
+                agent = HermesAgentLoop(
+                    server=self.server,
+                    tool_schemas=tools,
+                    valid_tool_names=valid_names,
+                    max_turns=self.config.max_agent_turns,
+                    task_id=task_id,
+                    temperature=self.config.agent_temperature,
+                    max_tokens=self.config.max_token_length,
+                    extra_body=self.config.extra_body,
+                    max_context_tokens=_max_ctx,
+                )
+                result = await agent.run(messages)
+        else:
+            # Phase 1: OpenAI server -- native tool_calls, placeholder tokens
+            _max_ctx = self.config.max_token_length if (self.config.max_token_length and self.config.max_token_length > 0) else None
+            agent = HermesAgentLoop(
+                server=self.server,
+                tool_schemas=tools,
+                valid_tool_names=valid_names,
+                max_turns=self.config.max_agent_turns,
+                task_id=task_id,
+                temperature=self.config.agent_temperature,
+                max_tokens=self.config.max_token_length,
+                extra_body=self.config.extra_body,
+                max_context_tokens=_max_ctx,
+            )
+            result = await agent.run(messages)
+
+        # Skip reward computation if the agent loop produced no meaningful work
+        # (e.g., API call failed on turn 1). No point spinning up a Modal sandbox
+        # just to verify files that were never created.
+        only_system_and_user = all(
+            msg.get("role") in ("system", "user") for msg in result.messages
+        )
+        if result.turns_used == 0 or only_system_and_user:
+            logger.warning(
+                "Agent loop produced no output (turns=%d, msgs=%d). Skipping reward.",
+                result.turns_used, len(result.messages),
+            )
+            reward = 0.0
+        else:
+            # Compute reward using ToolContext (gives verifier full tool access)
+            ctx = ToolContext(task_id)
+            try:
+                reward = await self.compute_reward(item, result, ctx)
+            except Exception as e:
+                logger.error("compute_reward failed: %s", e)
+                reward = 0.0
+            finally:
+                ctx.cleanup()
+
+        # Track tool errors for wandb logging
+        if result.tool_errors:
+            for err in result.tool_errors:
+                self._tool_error_buffer.append({
+                    "turn": err.turn,
+                    "tool": err.tool_name,
+                    "args": err.arguments[:150],
+                    "error": err.error[:300],
+                    "result": err.tool_result[:300],
+                })
+
+        # Build ScoredDataItem from ManagedServer state
+        # Phase 2: real tokens/masks/logprobs from SequenceNodes
+        # Phase 1: placeholder tokens (still need a valid ScoredDataItem for the pipeline)
+        nodes = (result.managed_state or {}).get("nodes", [])
+
+        if nodes:
+            # Phase 2 (or DummyManagedServer): use actual node data
+            node = nodes[-1]  # Final sequence node = full trajectory
+            scored_item: Dict[str, Any] = {
+                "tokens": node.tokens,
+                "masks": node.masked_tokens,
+                "scores": reward,
+            }
+
+            # Include logprobs if available (Phase 2)
+            if hasattr(node, "logprobs") and node.logprobs:
+                scored_item["advantages"] = None  # Computed by trainer
+                scored_item["ref_logprobs"] = None
+        else:
+            # Phase 1 with no managed state: create placeholder tokens
+            # so the data pipeline doesn't break. These are NOT suitable
+            # for training but allow process mode (SFT data gen) to work.
+            # Tokenize the full conversation to get approximate tokens.
+            full_text = "\n".join(
+                msg.get("content", "") for msg in result.messages if msg.get("content")
+            )
+            if self.tokenizer:
+                tokens = self.tokenizer.encode(full_text, add_special_tokens=True)
+            else:
+                tokens = list(range(min(len(full_text) // 4, 128)))
+
+            scored_item = {
+                "tokens": tokens,
+                "masks": [-100] + tokens[1:],  # Mask first token as prompt
+                "scores": reward,
+            }
+
+        # Always include messages for wandb rollout display and data logging
+        scored_item["messages"] = result.messages
+
+        return scored_item, []
+
+    # =========================================================================
+    # Abstract methods -- subclasses must implement
+    # =========================================================================
+
+    @abstractmethod
+    async def setup(self):
+        """
+        Load dataset, initialize state.
+
+        Called once when the environment starts. Typical implementation:
+            self.dataset = load_dataset(self.config.dataset_name, split=self.config.dataset_split)
+            self.iter = 0
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def get_next_item(self) -> Item:
+        """
+        Return the next item from the dataset for rollout.
+
+        Called by the base env's main loop to get items for workers.
+        Should cycle through the dataset.
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    def format_prompt(self, item: Item) -> str:
+        """
+        Convert a dataset item into the user message for the agent.
+
+        Args:
+            item: Dataset item (dict, tuple, etc.)
+
+        Returns:
+            The prompt string to send to the agent
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def compute_reward(
+        self, item: Item, result: AgentResult, ctx: ToolContext
+    ) -> float:
+        """
+        Score the rollout. Has full access to:
+        - item: the original dataset item (ground truth, test commands, etc.)
+        - result: AgentResult with full messages, turn count, reasoning, etc.
+        - ctx: ToolContext -- call ANY hermes-agent tool (terminal, file, web,
+               browser, vision...) scoped to this rollout's sandbox. Nothing
+               is off-limits.
+
+        Args:
+            item: The dataset item that was rolled out
+            result: The agent's rollout result
+            ctx: ToolContext with full tool access for verification
+
+        Returns:
+            Reward float (typically 0.0 to 1.0, but any float is valid)
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def evaluate(self, *args, **kwargs):
+        """
+        Periodic evaluation. Called every steps_per_eval steps.
+
+        Typical implementation runs the agent on a held-out eval set
+        and logs metrics via wandb/evaluate_log.
+        """
+        raise NotImplementedError
--- a/environments/hermes_swe_env/init.py
+++ b/environments/hermes_swe_env/init.py
--- a/environments/hermes_swe_env/default.yaml
+++ b/environments/hermes_swe_env/default.yaml
@@ -0,0 +1,34 @@
+# SWE Environment -- Default Configuration
+#
+# SWE-bench style tasks with Modal sandboxes for cloud isolation.
+# Uses terminal + file + web toolsets.
+#
+# Usage:
+#   python environments/hermes_swe_env/hermes_swe_env.py serve \
+#       --config environments/hermes_swe_env/default.yaml
+
+env:
+  enabled_toolsets: ["terminal", "file", "web"]
+  max_agent_turns: 30
+  max_token_length: 4096
+  group_size: 4
+  terminal_backend: "modal"
+  tool_call_parser: "hermes"
+  tokenizer_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+  dataset_name: "bigcode/humanevalpack"
+  dataset_split: "test"
+  prompt_field: "prompt"
+  steps_per_eval: 50
+  total_steps: 500
+  use_wandb: true
+  wandb_name: "hermes-swe"
+  system_prompt: >
+    You are a skilled software engineer. You have access to a terminal,
+    file tools, and web search. Use these tools to complete the coding task.
+    Write clean, working code and verify it runs correctly before finishing.
+
+openai:
+  base_url: "http://localhost:8000/v1"
+  model_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+  server_type: "openai"
+  api_key: ""
--- a/environments/hermes_swe_env/hermes_swe_env.py
+++ b/environments/hermes_swe_env/hermes_swe_env.py
@@ -0,0 +1,229 @@
+"""
+HermesSweEnv -- SWE-Bench Style Environment with Modal Sandboxes
+
+A concrete environment for software engineering tasks where the model writes code
+and the reward function runs tests to verify correctness. Uses Modal terminal
+backend for cloud-isolated sandboxes per rollout.
+
+The reward function uses ToolContext.terminal() to run test commands in the same
+Modal sandbox the model used during its agentic loop. All filesystem state from
+the model's tool calls is preserved for verification.
+
+Usage:
+    # Phase 1: OpenAI server type
+    vllm serve YourModel --tool-parser hermes
+    run-api
+    python environments/hermes_swe_env.py serve \\
+        --openai.base_url http://localhost:8000/v1 \\
+        --openai.model_name YourModel \\
+        --openai.server_type openai \\
+        --env.dataset_name bigcode/humanevalpack \\
+        --env.terminal_backend modal
+
+    # Phase 2: VLLM server type (full RL training)
+    python environments/hermes_swe_env.py serve \\
+        --openai.base_url http://localhost:8000/v1 \\
+        --openai.model_name YourModel \\
+        --openai.server_type vllm \\
+        --env.tool_call_parser hermes \\
+        --env.terminal_backend modal
+"""
+
+import logging
+import sys
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# Ensure repo root is on sys.path for imports
+_repo_root = Path(__file__).resolve().parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from datasets import load_dataset
+
+from atroposlib.envs.base import ScoredDataGroup
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+from atroposlib.type_definitions import Item
+
+from environments.agent_loop import AgentResult
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.tool_context import ToolContext
+
+logger = logging.getLogger(__name__)
+
+
+class HermesSweEnvConfig(HermesAgentEnvConfig):
+    """Config with defaults for SWE-bench style tasks."""
+
+    pass  # Inherits all fields, overrides defaults in config_init
+
+
+class HermesSweEnv(HermesAgentBaseEnv):
+    """
+    SWE-bench style environment using Modal terminal backend.
+
+    The model gets a coding task, uses terminal + file + web tools to solve it,
+    and the reward function runs tests in the same Modal sandbox to verify.
+
+    Subclass this for specific SWE datasets (HumanEval, SWE-bench, etc.)
+    and customize format_prompt() and compute_reward() as needed.
+    """
+
+    name = "hermes-swe"
+    env_config_cls = HermesSweEnvConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[HermesSweEnvConfig, List[APIServerConfig]]:
+        """
+        Default configuration for the SWE environment.
+
+        Uses Modal terminal backend for cloud isolation and terminal + file + web toolsets.
+        """
+        env_config = HermesSweEnvConfig(
+            # Toolsets: terminal for running code, file for reading/writing, web for docs
+            enabled_toolsets=["terminal", "file", "web"],
+            disabled_toolsets=None,
+            distribution=None,
+            # Agent settings -- SWE tasks need more turns
+            max_agent_turns=30,
+            max_token_length=4096,
+            agent_temperature=1.0,
+            system_prompt=(
+                "You are a skilled software engineer. You have access to a terminal, "
+                "file tools, and web search. Use these tools to complete the coding task. "
+                "Write clean, working code and verify it runs correctly before finishing."
+            ),
+            # Modal backend for cloud-isolated sandboxes
+            terminal_backend="modal",
+            # Dataset -- override via CLI for your specific SWE dataset
+            dataset_name="bigcode/humanevalpack",
+            dataset_split="test",
+            prompt_field="prompt",
+            # Atropos settings
+            group_size=4,
+            tokenizer_name="NousResearch/DeepHermes-3-Llama-3-3B-Preview",
+            tool_call_parser="hermes",
+            steps_per_eval=50,
+            total_steps=500,
+            use_wandb=True,
+            wandb_name="hermes-swe",
+        )
+
+        server_configs = [
+            APIServerConfig(
+                base_url="http://localhost:8000/v1",
+                model_name="NousResearch/DeepHermes-3-Llama-3-3B-Preview",
+                server_type="openai",  # Phase 1; switch to "vllm" for Phase 2
+                api_key="",
+            )
+        ]
+
+        return env_config, server_configs
+
+    async def setup(self):
+        """Load the SWE dataset."""
+        if self.config.dataset_name:
+            self.dataset = load_dataset(
+                self.config.dataset_name, split=self.config.dataset_split
+            )
+        else:
+            # Placeholder if no dataset specified
+            self.dataset = []
+        self.iter = 0
+        self.reward_buffer: List[float] = []
+
+    async def get_next_item(self) -> Dict[str, Any]:
+        """Cycle through the SWE dataset."""
+        if not self.dataset:
+            raise ValueError("No dataset loaded. Set dataset_name in config.")
+        item = self.dataset[self.iter % len(self.dataset)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item: Dict[str, Any]) -> str:
+        """
+        Format the SWE task prompt.
+
+        Override this in subclasses for different dataset formats.
+        Default assumes the dataset has a 'prompt' field and optionally a 'test' field.
+        """
+        prompt = item.get(self.config.prompt_field, "")
+
+        # If the dataset has test information, include it in the prompt
+        test_info = item.get("test", item.get("test_code", item.get("tests", "")))
+        if test_info:
+            prompt += f"\n\nTests to pass:\n{test_info}"
+
+        return prompt
+
+    async def compute_reward(
+        self, item: Dict[str, Any], result: AgentResult, ctx: ToolContext
+    ) -> float:
+        """
+        Score by running tests in the model's Modal sandbox.
+
+        Default implementation:
+        - If the dataset item has a 'test' or 'test_code' field, run it
+        - Check exit code: 0 = pass, non-zero = fail
+        - Partial credit for file creation
+
+        Override this in subclasses for more sophisticated reward logic.
+        """
+        # Find the test command from the dataset item
+        test_code = item.get("test", item.get("test_code", item.get("tests", "")))
+
+        if test_code:
+            # Run the test in the model's sandbox
+            test_result = ctx.terminal(
+                f'cd /workspace && python3 -c "{test_code}"', timeout=60
+            )
+
+            if test_result["exit_code"] == 0:
+                self.reward_buffer.append(1.0)
+                return 1.0
+
+        # Partial credit: check if the model created any Python files
+        file_check = ctx.terminal("find /workspace -name '*.py' -newer /tmp/.start_marker 2>/dev/null | head -5")
+        if file_check["exit_code"] == 0 and file_check.get("output", "").strip():
+            self.reward_buffer.append(0.1)
+            return 0.1
+
+        self.reward_buffer.append(0.0)
+        return 0.0
+
+    async def evaluate(self, *args, **kwargs):
+        """
+        Run evaluation on a held-out set.
+
+        Override for dataset-specific evaluation logic.
+        """
+        start_time = time.time()
+        end_time = time.time()
+
+        eval_metrics = {"eval/placeholder": 0.0}
+        await self.evaluate_log(
+            metrics=eval_metrics,
+            start_time=start_time,
+            end_time=end_time,
+        )
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log SWE-specific metrics."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        if self.reward_buffer:
+            wandb_metrics["train/avg_reward"] = sum(self.reward_buffer) / len(
+                self.reward_buffer
+            )
+            wandb_metrics["train/pass_rate"] = sum(
+                1 for r in self.reward_buffer if r == 1.0
+            ) / len(self.reward_buffer)
+            self.reward_buffer = []
+
+        await super().wandb_log(wandb_metrics)
+
+
+if __name__ == "__main__":
+    HermesSweEnv.cli()
--- a/environments/patches.py
+++ b/environments/patches.py
@@ -0,0 +1,188 @@
+"""
+Monkey patches for making hermes-agent tools work inside async frameworks (Atropos).
+
+Problem:
+    Some tools use asyncio.run() internally (e.g., mini-swe-agent's Modal backend,
+    web_extract). This crashes when called from inside Atropos's event loop because
+    asyncio.run() can't be nested.
+
+Solution:
+    Replace the problematic methods with versions that use a dedicated background
+    thread with its own event loop. The calling code sees the same sync interface --
+    call a function, get a result -- but internally the async work happens on a
+    separate thread that doesn't conflict with Atropos's loop.
+
+    These patches are safe for normal CLI use too: when there's no running event
+    loop, the behavior is identical (the background thread approach works regardless).
+
+What gets patched:
+    - SwerexModalEnvironment.__init__ -- creates Modal deployment on a background thread
+    - SwerexModalEnvironment.execute -- runs commands on the same background thread
+    - SwerexModalEnvironment.stop -- stops deployment on the background thread
+
+Usage:
+    Call apply_patches() once at import time (done automatically by hermes_base_env.py).
+    This is idempotent -- calling it multiple times is safe.
+"""
+
+import asyncio
+import logging
+import threading
+from typing import Any
+
+logger = logging.getLogger(__name__)
+
+_patches_applied = False
+
+
+class _AsyncWorker:
+    """
+    A dedicated background thread with its own event loop.
+
+    Allows sync code to submit async coroutines and block for results,
+    even when called from inside another running event loop. Used to
+    bridge sync tool interfaces with async backends (Modal, SWE-ReX).
+    """
+
+    def __init__(self):
+        self._loop: asyncio.AbstractEventLoop = None
+        self._thread: threading.Thread = None
+        self._started = threading.Event()
+
+    def start(self):
+        """Start the background event loop thread."""
+        self._thread = threading.Thread(target=self._run_loop, daemon=True)
+        self._thread.start()
+        self._started.wait(timeout=30)
+
+    def _run_loop(self):
+        """Background thread entry point -- runs the event loop forever."""
+        self._loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(self._loop)
+        self._started.set()
+        self._loop.run_forever()
+
+    def run_coroutine(self, coro, timeout=600):
+        """
+        Submit a coroutine to the background loop and block until it completes.
+
+        Safe to call from any thread, including threads that already have
+        a running event loop.
+        """
+        if self._loop is None or self._loop.is_closed():
+            raise RuntimeError("AsyncWorker loop is not running")
+        future = asyncio.run_coroutine_threadsafe(coro, self._loop)
+        return future.result(timeout=timeout)
+
+    def stop(self):
+        """Stop the background event loop and join the thread."""
+        if self._loop and self._loop.is_running():
+            self._loop.call_soon_threadsafe(self._loop.stop)
+        if self._thread:
+            self._thread.join(timeout=10)
+
+
+def _patch_swerex_modal():
+    """
+    Monkey patch SwerexModalEnvironment to use a background thread event loop
+    instead of asyncio.run(). This makes it safe to call from inside Atropos's
+    async event loop.
+
+    The patched methods have the exact same interface and behavior -- the only
+    difference is HOW the async work is executed internally.
+    """
+    try:
+        from minisweagent.environments.extra.swerex_modal import (
+            SwerexModalEnvironment,
+            SwerexModalEnvironmentConfig,
+        )
+        from swerex.deployment.modal import ModalDeployment
+        from swerex.runtime.abstract import Command as RexCommand
+    except ImportError:
+        # mini-swe-agent or swe-rex not installed -- nothing to patch
+        logger.debug("mini-swe-agent Modal backend not available, skipping patch")
+        return
+
+    # Save original methods so we can refer to config handling
+    _original_init = SwerexModalEnvironment.__init__
+
+    def _patched_init(self, **kwargs):
+        """Patched __init__: creates Modal deployment on a background thread."""
+        self.config = SwerexModalEnvironmentConfig(**kwargs)
+
+        # Start a dedicated event loop thread for all Modal async operations
+        self._worker = _AsyncWorker()
+        self._worker.start()
+
+        # Create AND start the deployment entirely on the worker's loop/thread
+        # so all gRPC channels and async state are bound to that loop
+        async def _create_and_start():
+            deployment = ModalDeployment(
+                image=self.config.image,
+                startup_timeout=self.config.startup_timeout,
+                runtime_timeout=self.config.runtime_timeout,
+                deployment_timeout=self.config.deployment_timeout,
+                install_pipx=self.config.install_pipx,
+                modal_sandbox_kwargs=self.config.modal_sandbox_kwargs,
+            )
+            await deployment.start()
+            return deployment
+
+        self.deployment = self._worker.run_coroutine(_create_and_start())
+
+    def _patched_execute(self, command: str, cwd: str = "", *, timeout: int | None = None) -> dict[str, Any]:
+        """Patched execute: runs commands on the background thread's loop."""
+        async def _do_execute():
+            return await self.deployment.runtime.execute(
+                RexCommand(
+                    command=command,
+                    shell=True,
+                    check=False,
+                    cwd=cwd or self.config.cwd,
+                    timeout=timeout or self.config.timeout,
+                    merge_output_streams=True,
+                    env=self.config.env if self.config.env else None,
+                )
+            )
+
+        output = self._worker.run_coroutine(_do_execute())
+        return {
+            "output": output.stdout,
+            "returncode": output.exit_code,
+        }
+
+    def _patched_stop(self):
+        """Patched stop: stops deployment on the background thread, then stops the thread."""
+        try:
+            self._worker.run_coroutine(
+                asyncio.wait_for(self.deployment.stop(), timeout=10),
+                timeout=15,
+            )
+        except Exception:
+            pass
+        finally:
+            self._worker.stop()
+
+    # Apply the patches
+    SwerexModalEnvironment.__init__ = _patched_init
+    SwerexModalEnvironment.execute = _patched_execute
+    SwerexModalEnvironment.stop = _patched_stop
+
+    logger.debug("Patched SwerexModalEnvironment for async-safe operation")
+
+
+def apply_patches():
+    """
+    Apply all monkey patches needed for Atropos compatibility.
+
+    Safe to call multiple times -- patches are only applied once.
+    Safe for normal CLI use -- patched code works identically when
+    there is no running event loop.
+    """
+    global _patches_applied
+    if _patches_applied:
+        return
+
+    _patch_swerex_modal()
+
+    _patches_applied = True
--- a/environments/terminal_test_env/init.py
+++ b/environments/terminal_test_env/init.py
--- a/environments/terminal_test_env/default.yaml
+++ b/environments/terminal_test_env/default.yaml
@@ -0,0 +1,34 @@
+# Terminal Test Environment -- Default Configuration
+#
+# Simple file-creation tasks for validating the full Atropos + hermes-agent stack.
+# Uses Modal terminal backend and OpenRouter (Claude) for inference.
+# API keys loaded from ~/hermes-agent/.env
+#
+# Usage:
+#   run-api
+#   python environments/terminal_test_env/terminal_test_env.py serve \
+#       --config environments/terminal_test_env/default.yaml
+
+env:
+  enabled_toolsets: ["terminal", "file"]
+  max_agent_turns: 10
+  max_token_length: 2048
+  group_size: 3
+  total_steps: 3
+  steps_per_eval: 3
+  terminal_backend: "modal"
+  tool_call_parser: "hermes"
+  tokenizer_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+  ensure_scores_are_not_same: false
+  use_wandb: false
+  system_prompt: >
+    You are a helpful assistant with access to a terminal and file tools.
+    Complete the user's request by using the available tools.
+    Be precise and follow instructions exactly.
+
+openai:
+  base_url: "https://openrouter.ai/api/v1"
+  model_name: "anthropic/claude-opus-4.6"
+  server_type: "openai"
+  health_check: false
+  # api_key loaded from OPENROUTER_API_KEY in .env
--- a/environments/terminal_test_env/terminal_test_env.py
+++ b/environments/terminal_test_env/terminal_test_env.py
@@ -0,0 +1,292 @@
+"""
+TerminalTestEnv -- Simple Test Environment for Validating the Stack
+
+A self-contained environment with inline tasks (no external dataset needed).
+Each task asks the model to create a file at a known path with specific content.
+The reward verifier cats the file and checks if the content matches.
+
+Enables only terminal + file toolsets. Uses Modal terminal backend with
+OpenRouter (Claude) by default.
+
+Training tasks (3):
+    1. Create ~/greeting.txt with "Hello from Hermes Agent"
+    2. Create ~/count.txt with numbers 1-5, one per line
+    3. Create ~/answer.txt with the result of 123 + 456
+
+Eval task (1):
+    1. Create ~/result.txt with the result of 6 * 7
+
+Usage:
+    # Start Atropos API server
+    run-api
+
+    # Run environment (uses OpenRouter + Modal by default)
+    python environments/terminal_test_env.py serve
+
+    # Process mode (no run-api needed, saves to JSONL)
+    python environments/terminal_test_env.py process \\
+        --env.data_path_to_save_groups terminal_test_output.jsonl
+"""
+
+import logging
+import os
+import sys
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# Ensure repo root is on sys.path for imports
+_repo_root = Path(__file__).resolve().parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from atroposlib.envs.base import ScoredDataGroup
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+from atroposlib.type_definitions import Item
+
+from environments.agent_loop import AgentResult
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.tool_context import ToolContext
+
+logger = logging.getLogger(__name__)
+
+
+# =============================================================================
+# Inline task definitions -- no external dataset needed
+# =============================================================================
+
+TRAIN_TASKS = [
+    {
+        "prompt": "Create a file at ~/greeting.txt containing exactly the text: Hello from Hermes Agent",
+        "verify_path": "~/greeting.txt",
+        "expected_content": "Hello from Hermes Agent",
+    },
+    {
+        "prompt": "Create a file at ~/count.txt containing the numbers 1 through 5, one per line",
+        "verify_path": "~/count.txt",
+        "expected_content": "1\n2\n3\n4\n5",
+    },
+    {
+        "prompt": "Create a file at ~/answer.txt containing the result of 123 + 456",
+        "verify_path": "~/answer.txt",
+        "expected_content": "579",
+    },
+]
+
+EVAL_TASKS = [
+    {
+        "prompt": "Create a file at ~/result.txt containing the result of 6 * 7",
+        "verify_path": "~/result.txt",
+        "expected_content": "42",
+    },
+]
+
+
+class TerminalTestEnvConfig(HermesAgentEnvConfig):
+    """Config with defaults suitable for terminal testing."""
+
+    pass  # Inherits all fields, overrides defaults in config_init
+
+
+class TerminalTestEnv(HermesAgentBaseEnv):
+    """
+    Simple test environment with inline file-creation tasks.
+
+    All tasks follow the same pattern: "create a file at ~/X.txt with content Y".
+    The verifier runs `cat ~/X.txt` in the rollout's terminal and checks the output
+    against the expected string. Same verifier logic for all tasks.
+
+    This environment is designed to validate the full stack end-to-end:
+    - Agent loop executes tool calls (terminal/file)
+    - ToolContext provides terminal access to the reward function
+    - Reward function verifies file content via cat
+    - Scored data flows through the Atropos pipeline
+    """
+
+    name = "terminal-test"
+    env_config_cls = TerminalTestEnvConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[TerminalTestEnvConfig, List[APIServerConfig]]:
+        """
+        Default configuration for the terminal test environment.
+
+        Uses Modal terminal backend for cloud isolation and OpenRouter with
+        Claude for inference. API keys loaded from ~/hermes-agent/.env.
+        """
+        env_config = TerminalTestEnvConfig(
+            # Terminal + file tools only
+            enabled_toolsets=["terminal", "file"],
+            disabled_toolsets=None,
+            distribution=None,
+            # Agent settings
+            max_agent_turns=10,  # Simple tasks, don't need many turns
+            max_token_length=16000,
+            agent_temperature=1.0,
+            system_prompt=(
+                "You are a helpful assistant with access to a terminal and file tools. "
+                "Complete the user's request by using the available tools. "
+                "Be precise and follow instructions exactly."
+            ),
+            # Modal terminal backend for cloud-isolated sandboxes per rollout
+            terminal_backend="modal",
+            # Atropos settings
+            group_size=3,              # 3 rollouts per group
+            tokenizer_name="NousResearch/q-30b-t-h45-e1",
+            tool_call_parser="hermes",
+            steps_per_eval=3,          # Eval after all 3 steps
+            total_steps=3,             # 3 groups total (1 group per step)
+            use_wandb=True,
+            wandb_name="terminal-test",
+            ensure_scores_are_not_same=False,  # Allow all-same scores for simple tasks
+            # No external dataset
+            dataset_name=None,
+        )
+
+        # OpenRouter with Claude -- API key loaded from .env (OPENROUTER_API_KEY)
+        server_configs = [
+            APIServerConfig(
+                base_url="https://openrouter.ai/api/v1",
+                model_name="anthropic/claude-opus-4.6",
+                server_type="openai",
+                api_key=os.getenv("OPENROUTER_API_KEY", ""),
+                health_check=False,  # OpenRouter doesn't have a /health endpoint
+            )
+        ]
+
+        return env_config, server_configs
+
+    async def setup(self):
+        """Initialize inline task lists."""
+        self.train_tasks = list(TRAIN_TASKS)
+        self.eval_tasks = list(EVAL_TASKS)
+        self.iter = 0
+        # Track reward stats for wandb logging
+        self.reward_buffer: List[float] = []
+
+    async def get_next_item(self) -> Dict[str, str]:
+        """Cycle through training tasks."""
+        item = self.train_tasks[self.iter % len(self.train_tasks)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item: Dict[str, str]) -> str:
+        """The prompt is directly in the task item."""
+        return item["prompt"]
+
+    async def compute_reward(
+        self, item: Dict[str, str], result: AgentResult, ctx: ToolContext
+    ) -> float:
+        """
+        Verify by cat-ing the expected file path and checking content matches.
+        Same verifier for all tasks -- they all write a file at a known path.
+
+        Scoring:
+            1.0 = exact match
+            0.5 = expected content is present but has extra stuff
+            0.0 = file doesn't exist or content doesn't match
+        """
+        verify_result = ctx.terminal(f"cat {item['verify_path']}")
+
+        # File doesn't exist or can't be read
+        if verify_result["exit_code"] != 0:
+            self.reward_buffer.append(0.0)
+            return 0.0
+
+        actual = verify_result.get("output", "").strip()
+        expected = item["expected_content"].strip()
+
+        # Exact match
+        if actual == expected:
+            self.reward_buffer.append(1.0)
+            return 1.0
+
+        # Partial credit: expected content is present but has extra stuff
+        if expected in actual:
+            self.reward_buffer.append(0.5)
+            return 0.5
+
+        self.reward_buffer.append(0.0)
+        return 0.0
+
+    async def evaluate(self, *args, **kwargs):
+        """
+        Run eval tasks using the agent loop and verify results.
+        Logs accuracy metrics.
+        """
+        start_time = time.time()
+        correct = 0
+        total = len(self.eval_tasks)
+        samples = []
+
+        for eval_item in self.eval_tasks:
+            try:
+                # For eval, we do a simple single-turn completion (not full agent loop)
+                # to keep eval fast. The agent loop is tested via training.
+                completion = await self.server.chat_completion(
+                    messages=[
+                        {"role": "system", "content": self.config.system_prompt or ""},
+                        {"role": "user", "content": eval_item["prompt"]},
+                    ],
+                    n=1,
+                    max_tokens=self.config.max_token_length,
+                    temperature=0.0,
+                    split="eval",
+                )
+
+                response_content = (
+                    completion.choices[0].message.content if completion.choices else ""
+                )
+
+                samples.append(
+                    {
+                        "prompt": eval_item["prompt"],
+                        "response": response_content,
+                        "expected": eval_item["expected_content"],
+                    }
+                )
+
+            except Exception as e:
+                logger.error("Eval failed for item: %s", e)
+                samples.append(
+                    {
+                        "prompt": eval_item["prompt"],
+                        "response": f"ERROR: {e}",
+                        "expected": eval_item["expected_content"],
+                    }
+                )
+
+        end_time = time.time()
+
+        eval_metrics = {
+            "eval/num_samples": total,
+        }
+
+        await self.evaluate_log(
+            metrics=eval_metrics,
+            samples=samples,
+            start_time=start_time,
+            end_time=end_time,
+        )
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log training metrics including reward stats and accuracy."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        if self.reward_buffer:
+            total = len(self.reward_buffer)
+            correct = sum(1 for r in self.reward_buffer if r == 1.0)
+            partial = sum(1 for r in self.reward_buffer if r == 0.5)
+
+            wandb_metrics["train/avg_reward"] = sum(self.reward_buffer) / total
+            wandb_metrics["train/accuracy"] = correct / total
+            wandb_metrics["train/partial_match_rate"] = partial / total
+            wandb_metrics["train/total_rollouts"] = total
+            self.reward_buffer = []
+
+        await super().wandb_log(wandb_metrics)
+
+
+if __name__ == "__main__":
+    TerminalTestEnv.cli()
--- a/environments/tool_call_parsers/init.py
+++ b/environments/tool_call_parsers/init.py
@@ -0,0 +1,120 @@
+"""
+Tool Call Parser Registry
+
+Client-side parsers that extract structured tool_calls from raw model output text.
+Used in Phase 2 (VLLM server type) where ManagedServer's /generate endpoint returns
+raw text without tool call parsing.
+
+Each parser is a standalone reimplementation of the corresponding VLLM parser's
+non-streaming extract_tool_calls() logic. No VLLM dependency -- only standard library
+(re, json, uuid) and openai types.
+
+Usage:
+    from environments.tool_call_parsers import get_parser
+
+    parser = get_parser("hermes")
+    content, tool_calls = parser.parse(raw_model_output)
+    # content = text with tool call markup stripped
+    # tool_calls = list of ChatCompletionMessageToolCall objects, or None
+"""
+
+import logging
+from abc import ABC, abstractmethod
+from typing import Dict, List, Optional, Tuple, Type
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+)
+
+logger = logging.getLogger(__name__)
+
+# Type alias for parser return value
+ParseResult = Tuple[Optional[str], Optional[List[ChatCompletionMessageToolCall]]]
+
+
+class ToolCallParser(ABC):
+    """
+    Base class for tool call parsers.
+
+    Each parser knows how to extract structured tool_calls from a specific
+    model family's raw output text format.
+    """
+
+    @abstractmethod
+    def parse(self, text: str) -> ParseResult:
+        """
+        Parse raw model output text for tool calls.
+
+        Args:
+            text: Raw decoded text from the model's completion
+
+        Returns:
+            Tuple of (content, tool_calls) where:
+            - content: text with tool call markup stripped (the message 'content' field),
+                       or None if the entire output was tool calls
+            - tool_calls: list of ChatCompletionMessageToolCall objects,
+                          or None if no tool calls were found
+        """
+        raise NotImplementedError
+
+
+# Global parser registry: name -> parser class
+PARSER_REGISTRY: Dict[str, Type[ToolCallParser]] = {}
+
+
+def register_parser(name: str):
+    """
+    Decorator to register a parser class under a given name.
+
+    Usage:
+        @register_parser("hermes")
+        class HermesToolCallParser(ToolCallParser):
+            ...
+    """
+
+    def decorator(cls: Type[ToolCallParser]) -> Type[ToolCallParser]:
+        PARSER_REGISTRY[name] = cls
+        return cls
+
+    return decorator
+
+
+def get_parser(name: str) -> ToolCallParser:
+    """
+    Get a parser instance by name.
+
+    Args:
+        name: Parser name (e.g., "hermes", "mistral", "llama3_json")
+
+    Returns:
+        Instantiated parser
+
+    Raises:
+        KeyError: If parser name is not found in registry
+    """
+    if name not in PARSER_REGISTRY:
+        available = sorted(PARSER_REGISTRY.keys())
+        raise KeyError(
+            f"Tool call parser '{name}' not found. Available parsers: {available}"
+        )
+    return PARSER_REGISTRY[name]()
+
+
+def list_parsers() -> List[str]:
+    """Return sorted list of registered parser names."""
+    return sorted(PARSER_REGISTRY.keys())
+
+
+# Import all parser modules to trigger registration via @register_parser decorators
+# Each module registers itself when imported
+from environments.tool_call_parsers.hermes_parser import HermesToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.longcat_parser import LongcatToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.mistral_parser import MistralToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.llama_parser import LlamaToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.qwen_parser import QwenToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.deepseek_v3_parser import DeepSeekV3ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.deepseek_v3_1_parser import DeepSeekV31ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.kimi_k2_parser import KimiK2ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.glm45_parser import Glm45ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.glm47_parser import Glm47ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.qwen3_coder_parser import Qwen3CoderToolCallParser  # noqa: E402, F401
--- a/environments/tool_call_parsers/deepseek_v3_1_parser.py
+++ b/environments/tool_call_parsers/deepseek_v3_1_parser.py
@@ -0,0 +1,71 @@
+"""
+DeepSeek V3.1 tool call parser.
+
+Similar to V3 but with a slightly different format:
+    <｜tool▁call▁begin｜>function_name<｜tool▁sep｜>arguments<｜tool▁call▁end｜>
+
+Note: V3 has type+name before the separator, V3.1 has name before and args after.
+
+Based on VLLM's DeepSeekV31ToolParser.extract_tool_calls()
+"""
+
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("deepseek_v3_1")
+@register_parser("deepseek_v31")
+class DeepSeekV31ToolCallParser(ToolCallParser):
+    """
+    Parser for DeepSeek V3.1 tool calls.
+
+    Slightly different regex than V3: function_name comes before the separator,
+    arguments come after (no type field, no json code block wrapper).
+    """
+
+    START_TOKEN = "<｜tool▁calls▁begin｜>"
+
+    # Regex captures: function_name, function_arguments
+    PATTERN = re.compile(
+        r"<｜tool▁call▁begin｜>(?P<function_name>.*?)<｜tool▁sep｜>(?P<function_arguments>.*?)<｜tool▁call▁end｜>"
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        if self.START_TOKEN not in text:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                func_name, func_args = match
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=func_name.strip(),
+                            arguments=func_args.strip(),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            content = text[: text.find(self.START_TOKEN)].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/environments/tool_call_parsers/deepseek_v3_parser.py
+++ b/environments/tool_call_parsers/deepseek_v3_parser.py
@@ -0,0 +1,75 @@
+"""
+DeepSeek V3 tool call parser.
+
+Format uses special unicode tokens:
+    <｜tool▁calls▁begin｜>
+    <｜tool▁call▁begin｜>type<｜tool▁sep｜>function_name
+    ```json
+    {"arg": "value"}
+    ```
+    <｜tool▁call▁end｜>
+    <｜tool▁calls▁end｜>
+
+Based on VLLM's DeepSeekV3ToolParser.extract_tool_calls()
+"""
+
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("deepseek_v3")
+class DeepSeekV3ToolCallParser(ToolCallParser):
+    """
+    Parser for DeepSeek V3 tool calls.
+
+    Uses special unicode tokens with fullwidth angle brackets and block elements.
+    Extracts type, function name, and JSON arguments from the structured format.
+    """
+
+    START_TOKEN = "<｜tool▁calls▁begin｜>"
+
+    # Regex captures: type, function_name, function_arguments
+    PATTERN = re.compile(
+        r"<｜tool▁call▁begin｜>(?P<type>.*)<｜tool▁sep｜>(?P<function_name>.*)\n```json\n(?P<function_arguments>.*)\n```<｜tool▁call▁end｜>"
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        if self.START_TOKEN not in text:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                tc_type, func_name, func_args = match
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=func_name.strip(),
+                            arguments=func_args.strip(),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            # Content is everything before the tool calls section
+            content = text[: text.find(self.START_TOKEN)].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/environments/tool_call_parsers/glm45_parser.py
+++ b/environments/tool_call_parsers/glm45_parser.py
@@ -0,0 +1,109 @@
+"""
+GLM 4.5 (GLM-4-MoE) tool call parser.
+
+Format uses custom arg_key/arg_value tags rather than standard JSON:
+    <tool_call>function_name
+    <arg_key>param1</arg_key><arg_value>value1</arg_value>
+    <arg_key>param2</arg_key><arg_value>value2</arg_value>
+    </tool_call>
+
+Values are deserialized using json.loads -> ast.literal_eval -> raw string fallback.
+
+Based on VLLM's Glm4MoeModelToolParser.extract_tool_calls()
+"""
+
+import ast
+import json
+import re
+import uuid
+from typing import Any, Dict, List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+def _deserialize_value(value: str) -> Any:
+    """
+    Try to deserialize a string value to its native Python type.
+    Attempts json.loads, then ast.literal_eval, then returns raw string.
+    """
+    try:
+        return json.loads(value)
+    except (json.JSONDecodeError, TypeError):
+        pass
+
+    try:
+        return ast.literal_eval(value)
+    except (ValueError, SyntaxError, TypeError):
+        pass
+
+    return value
+
+
+@register_parser("glm45")
+class Glm45ToolCallParser(ToolCallParser):
+    """
+    Parser for GLM 4.5 (GLM-4-MoE) tool calls.
+
+    Uses <tool_call>...</tool_call> tags with <arg_key>/<arg_value> pairs
+    instead of standard JSON arguments.
+    """
+
+    FUNC_CALL_REGEX = re.compile(r"<tool_call>.*?</tool_call>", re.DOTALL)
+    FUNC_DETAIL_REGEX = re.compile(r"<tool_call>([^\n]*)\n(.*)</tool_call>", re.DOTALL)
+    FUNC_ARG_REGEX = re.compile(
+        r"<arg_key>(.*?)</arg_key>\s*<arg_value>(.*?)</arg_value>", re.DOTALL
+    )
+
+    START_TOKEN = "<tool_call>"
+
+    def parse(self, text: str) -> ParseResult:
+        if self.START_TOKEN not in text:
+            return text, None
+
+        try:
+            matched_calls = self.FUNC_CALL_REGEX.findall(text)
+            if not matched_calls:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+
+            for match in matched_calls:
+                detail = self.FUNC_DETAIL_REGEX.search(match)
+                if not detail:
+                    continue
+
+                func_name = detail.group(1).strip()
+                func_args_raw = detail.group(2)
+
+                # Parse arg_key/arg_value pairs
+                pairs = self.FUNC_ARG_REGEX.findall(func_args_raw) if func_args_raw else []
+                arg_dict: Dict[str, Any] = {}
+                for key, value in pairs:
+                    arg_key = key.strip()
+                    arg_val = _deserialize_value(value.strip())
+                    arg_dict[arg_key] = arg_val
+
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=func_name,
+                            arguments=json.dumps(arg_dict, ensure_ascii=False),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            content = text[: text.find(self.START_TOKEN)].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/environments/tool_call_parsers/glm47_parser.py
+++ b/environments/tool_call_parsers/glm47_parser.py
@@ -0,0 +1,35 @@
+"""
+GLM 4.7 tool call parser.
+
+Same as GLM 4.5 but with slightly different regex patterns.
+The tool_call tags may wrap differently and arg parsing handles
+newlines between key/value pairs.
+
+Based on VLLM's Glm47MoeModelToolParser (extends Glm4MoeModelToolParser).
+"""
+
+import re
+
+from environments.tool_call_parsers import ParseResult, register_parser
+from environments.tool_call_parsers.glm45_parser import Glm45ToolCallParser
+
+
+@register_parser("glm47")
+class Glm47ToolCallParser(Glm45ToolCallParser):
+    """
+    Parser for GLM 4.7 tool calls.
+    Extends GLM 4.5 with updated regex patterns.
+    """
+
+    def __init__(self):
+        super().__init__()
+        # GLM 4.7 uses a slightly different detail regex that includes
+        # the <tool_call> wrapper and optional arg_key content
+        self.FUNC_DETAIL_REGEX = re.compile(
+            r"<tool_call>(.*?)(<arg_key>.*?)?</tool_call>", re.DOTALL
+        )
+        # GLM 4.7 handles newlines between arg_key and arg_value tags
+        self.FUNC_ARG_REGEX = re.compile(
+            r"<arg_key>(.*?)</arg_key>(?:\\n|\s)*<arg_value>(.*?)</arg_value>",
+            re.DOTALL,
+        )
--- a/environments/tool_call_parsers/hermes_parser.py
+++ b/environments/tool_call_parsers/hermes_parser.py
@@ -0,0 +1,80 @@
+"""
+Hermes tool call parser.
+
+Format: <tool_call>{"name": "func", "arguments": {...}}</tool_call>
+Based on VLLM's Hermes2ProToolParser.extract_tool_calls()
+"""
+
+import json
+import re
+import uuid
+from typing import List, Optional, Tuple
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("hermes")
+class HermesToolCallParser(ToolCallParser):
+    """
+    Parser for Hermes-format tool calls.
+
+    Matches <tool_call>...</tool_call> tags containing JSON with "name" and "arguments".
+    Also handles unclosed <tool_call> at end-of-string (truncated generation).
+    """
+
+    # Matches both closed and unclosed tool_call tags
+    PATTERN = re.compile(
+        r"<tool_call>\s*(.*?)\s*</tool_call>|<tool_call>\s*(.*)", re.DOTALL
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        if "<tool_call>" not in text:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                # match is a tuple: (closed_content, unclosed_content)
+                raw_json = match[0] if match[0] else match[1]
+                if not raw_json.strip():
+                    continue
+
+                tc_data = json.loads(raw_json)
+                # Handle arguments: could be dict or already a JSON string
+                raw_args = tc_data.get("arguments", {})
+                if isinstance(raw_args, str):
+                    # Already a string — pass through as-is.
+                    # It may be a JSON string ("{...}") or a plain string ("ls").
+                    args_str = raw_args
+                else:
+                    # Dict — serialize to JSON
+                    args_str = json.dumps(raw_args, ensure_ascii=False)
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=tc_data["name"],
+                            arguments=args_str,
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            # Content is everything before the first <tool_call> tag
+            content = text[: text.find("<tool_call>")].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/environments/tool_call_parsers/kimi_k2_parser.py
+++ b/environments/tool_call_parsers/kimi_k2_parser.py
@@ -0,0 +1,93 @@
+"""
+Kimi K2 tool call parser.
+
+Format:
+    <|tool_calls_section_begin|>
+    <|tool_call_begin|>function_id:0<|tool_call_argument_begin|>{"arg": "val"}<|tool_call_end|>
+    <|tool_calls_section_end|>
+
+The function_id format is typically "functions.func_name:index" or "func_name:index".
+
+Based on VLLM's KimiK2ToolParser.extract_tool_calls()
+"""
+
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("kimi_k2")
+class KimiK2ToolCallParser(ToolCallParser):
+    """
+    Parser for Kimi K2 tool calls.
+
+    Uses section begin/end tokens wrapping individual tool call begin/end tokens.
+    The tool_call_id contains the function name (after last dot, before colon).
+    """
+
+    # Support both singular and plural variants
+    START_TOKENS = [
+        "<|tool_calls_section_begin|>",
+        "<|tool_call_section_begin|>",
+    ]
+
+    # Regex captures: tool_call_id (e.g., "functions.get_weather:0"), function_arguments
+    PATTERN = re.compile(
+        r"<\|tool_call_begin\|>\s*(?P<tool_call_id>[^<]+:\d+)\s*"
+        r"<\|tool_call_argument_begin\|>\s*"
+        r"(?P<function_arguments>(?:(?!<\|tool_call_begin\|>).)*?)\s*"
+        r"<\|tool_call_end\|>",
+        re.DOTALL,
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        # Check for any variant of the start token
+        has_start = any(token in text for token in self.START_TOKENS)
+        if not has_start:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                function_id, function_args = match
+
+                # Extract function name from ID format: "functions.get_weather:0" -> "get_weather"
+                function_name = function_id.split(":")[0].split(".")[-1]
+
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=function_id,  # Preserve the original ID format
+                        type="function",
+                        function=Function(
+                            name=function_name,
+                            arguments=function_args.strip(),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            # Content is everything before the tool calls section
+            earliest_start = len(text)
+            for token in self.START_TOKENS:
+                idx = text.find(token)
+                if idx >= 0 and idx < earliest_start:
+                    earliest_start = idx
+
+            content = text[:earliest_start].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/environments/tool_call_parsers/llama_parser.py
+++ b/environments/tool_call_parsers/llama_parser.py
@@ -0,0 +1,96 @@
+"""
+Llama 3.x / 4 tool call parser.
+
+Format: The model outputs JSON objects with "name" and "arguments" (or "parameters") keys.
+May be preceded by <|python_tag|> token. Supports multiple JSON objects separated
+by content or semicolons.
+
+Based on VLLM's Llama3JsonToolParser.extract_tool_calls()
+"""
+
+import json
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("llama3_json")
+@register_parser("llama4_json")
+class LlamaToolCallParser(ToolCallParser):
+    """
+    Parser for Llama 3.x and 4 JSON-format tool calls.
+
+    Finds JSON objects containing "name" + ("arguments" or "parameters") keys.
+    Uses Python's json.JSONDecoder.raw_decode for robust extraction of
+    JSON objects from mixed text.
+    """
+
+    BOT_TOKEN = "<|python_tag|>"
+
+    # Regex to find the start of potential JSON objects
+    JSON_START = re.compile(r"\{")
+
+    def parse(self, text: str) -> ParseResult:
+        # Quick check: need either the bot token or a JSON brace
+        if self.BOT_TOKEN not in text and "{" not in text:
+            return text, None
+
+        try:
+            decoder = json.JSONDecoder()
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            end_index = -1  # Track where the last parsed JSON ended
+
+            for match in self.JSON_START.finditer(text):
+                start = match.start()
+                # Skip if this brace is inside a previously parsed JSON object
+                if start <= end_index:
+                    continue
+
+                try:
+                    obj, json_end = decoder.raw_decode(text[start:])
+                    end_index = start + json_end
+
+                    # Must have "name" and either "arguments" or "parameters"
+                    name = obj.get("name")
+                    args = obj.get("arguments", obj.get("parameters"))
+
+                    if not name or args is None:
+                        continue
+
+                    # Normalize arguments to JSON string
+                    if isinstance(args, dict):
+                        args = json.dumps(args, ensure_ascii=False)
+                    elif not isinstance(args, str):
+                        args = json.dumps(args, ensure_ascii=False)
+
+                    tool_calls.append(
+                        ChatCompletionMessageToolCall(
+                            id=f"call_{uuid.uuid4().hex[:8]}",
+                            type="function",
+                            function=Function(name=name, arguments=args),
+                        )
+                    )
+                except (json.JSONDecodeError, KeyError, ValueError):
+                    continue
+
+            if not tool_calls:
+                return text, None
+
+            # Content is everything before the first tool call JSON
+            # Find where the first tool call starts in the text
+            first_tc_start = text.find("{")
+            if self.BOT_TOKEN in text:
+                first_tc_start = text.find(self.BOT_TOKEN)
+            content = text[:first_tc_start].strip() if first_tc_start > 0 else None
+
+            return content, tool_calls
+
+        except Exception:
+            return text, None
--- a/environments/tool_call_parsers/longcat_parser.py
+++ b/environments/tool_call_parsers/longcat_parser.py
@@ -0,0 +1,69 @@
+"""
+Longcat Flash Chat tool call parser.
+
+Same as Hermes but uses <longcat_tool_call> tags instead of <tool_call>.
+Based on VLLM's LongcatFlashToolParser (extends Hermes2ProToolParser).
+"""
+
+import json
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("longcat")
+class LongcatToolCallParser(ToolCallParser):
+    """
+    Parser for Longcat Flash Chat tool calls.
+    Identical logic to Hermes, just different tag names.
+    """
+
+    PATTERN = re.compile(
+        r"<longcat_tool_call>\s*(.*?)\s*</longcat_tool_call>|<longcat_tool_call>\s*(.*)",
+        re.DOTALL,
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        if "<longcat_tool_call>" not in text:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                raw_json = match[0] if match[0] else match[1]
+                if not raw_json.strip():
+                    continue
+
+                tc_data = json.loads(raw_json)
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=tc_data["name"],
+                            arguments=json.dumps(
+                                tc_data.get("arguments", {}), ensure_ascii=False
+                            ),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            content = text[: text.find("<longcat_tool_call>")].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/environments/tool_call_parsers/mistral_parser.py
+++ b/environments/tool_call_parsers/mistral_parser.py
@@ -0,0 +1,130 @@
+"""
+Mistral tool call parser.
+
+Supports two formats depending on tokenizer version:
+- Pre-v11: content[TOOL_CALLS] [{"name": ..., "arguments": {...}}, ...]
+- v11+:    content[TOOL_CALLS]tool_name1{"arg": "val"}[TOOL_CALLS]tool_name2{"arg": "val"}
+
+Based on VLLM's MistralToolParser.extract_tool_calls()
+The [TOOL_CALLS] token is the bot_token used by Mistral models.
+"""
+
+import json
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+def _generate_mistral_id() -> str:
+    """Mistral tool call IDs are 9-char alphanumeric strings."""
+    import random
+    import string
+
+    return "".join(random.choices(string.ascii_letters + string.digits, k=9))
+
+
+@register_parser("mistral")
+class MistralToolCallParser(ToolCallParser):
+    """
+    Parser for Mistral-format tool calls.
+
+    Detects format by checking if the content after [TOOL_CALLS] starts with '['
+    (pre-v11 JSON array) or with a tool name (v11+ format).
+    """
+
+    # The [TOOL_CALLS] token -- may appear as different strings depending on tokenizer
+    BOT_TOKEN = "[TOOL_CALLS]"
+
+    # Fallback regex for pre-v11 format when JSON parsing fails
+    TOOL_CALL_REGEX = re.compile(r"\[?\s*(\{.*?\})\s*\]?", re.DOTALL)
+
+    def parse(self, text: str) -> ParseResult:
+        if self.BOT_TOKEN not in text:
+            return text, None
+
+        try:
+            parts = text.split(self.BOT_TOKEN)
+            content = parts[0].strip()
+            raw_tool_calls = parts[1:]
+
+            # Detect format: if the first raw part starts with '[', it's pre-v11
+            first_raw = raw_tool_calls[0].strip() if raw_tool_calls else ""
+            is_pre_v11 = first_raw.startswith("[") or first_raw.startswith("{")
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+
+            if not is_pre_v11:
+                # v11+ format: [TOOL_CALLS]tool_name{args}[TOOL_CALLS]tool_name2{args2}
+                for raw in raw_tool_calls:
+                    raw = raw.strip()
+                    if not raw or "{" not in raw:
+                        continue
+
+                    brace_idx = raw.find("{")
+                    tool_name = raw[:brace_idx].strip()
+                    args_str = raw[brace_idx:]
+
+                    tool_calls.append(
+                        ChatCompletionMessageToolCall(
+                            id=_generate_mistral_id(),
+                            type="function",
+                            function=Function(name=tool_name, arguments=args_str),
+                        )
+                    )
+            else:
+                # Pre-v11 format: [TOOL_CALLS] [{"name": ..., "arguments": {...}}]
+                try:
+                    parsed = json.loads(first_raw)
+                    if isinstance(parsed, dict):
+                        parsed = [parsed]
+
+                    for tc in parsed:
+                        args = tc.get("arguments", {})
+                        if isinstance(args, dict):
+                            args = json.dumps(args, ensure_ascii=False)
+
+                        tool_calls.append(
+                            ChatCompletionMessageToolCall(
+                                id=_generate_mistral_id(),
+                                type="function",
+                                function=Function(
+                                    name=tc["name"], arguments=args
+                                ),
+                            )
+                        )
+                except json.JSONDecodeError:
+                    # Fallback regex extraction
+                    match = self.TOOL_CALL_REGEX.findall(first_raw)
+                    if match:
+                        for raw_json in match:
+                            try:
+                                tc = json.loads(raw_json)
+                                args = tc.get("arguments", {})
+                                if isinstance(args, dict):
+                                    args = json.dumps(args, ensure_ascii=False)
+                                tool_calls.append(
+                                    ChatCompletionMessageToolCall(
+                                        id=_generate_mistral_id(),
+                                        type="function",
+                                        function=Function(
+                                            name=tc["name"], arguments=args
+                                        ),
+                                    )
+                                )
+                            except (json.JSONDecodeError, KeyError):
+                                continue
+
+            if not tool_calls:
+                return text, None
+
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
--- a/environments/tool_call_parsers/qwen3_coder_parser.py
+++ b/environments/tool_call_parsers/qwen3_coder_parser.py
@@ -0,0 +1,163 @@
+"""
+Qwen3-Coder tool call parser.
+
+Format uses XML-style nested tags:
+    <tool_call>
+    <function=function_name>
+    <parameter=param_name>value</parameter>
+    <parameter=param_name2>value2</parameter>
+    </function>
+    </tool_call>
+
+Parameters are extracted from <parameter=name>value</parameter> tags and
+type-converted using the schema if available, otherwise treated as strings.
+
+Based on VLLM's Qwen3CoderToolParser.extract_tool_calls()
+"""
+
+import ast
+import json
+import re
+import uuid
+from typing import Any, Dict, List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+def _try_convert_value(value: str) -> Any:
+    """
+    Try to convert a parameter value string to a native Python type.
+    Handles null, numbers, booleans, JSON objects/arrays, and falls back to string.
+    """
+    stripped = value.strip()
+
+    # Handle null
+    if stripped.lower() == "null":
+        return None
+
+    # Try JSON first (handles objects, arrays, strings, numbers, booleans)
+    try:
+        return json.loads(stripped)
+    except (json.JSONDecodeError, TypeError):
+        pass
+
+    # Try Python literal eval (handles tuples, etc.)
+    try:
+        return ast.literal_eval(stripped)
+    except (ValueError, SyntaxError, TypeError):
+        pass
+
+    # Return as string
+    return stripped
+
+
+@register_parser("qwen3_coder")
+class Qwen3CoderToolCallParser(ToolCallParser):
+    """
+    Parser for Qwen3-Coder XML-format tool calls.
+
+    Uses nested XML tags: <tool_call><function=name><parameter=key>val</parameter></function></tool_call>
+    """
+
+    START_TOKEN = "<tool_call>"
+    FUNCTION_PREFIX = "<function="
+
+    # Find complete tool_call blocks (or unclosed at end)
+    TOOL_CALL_REGEX = re.compile(
+        r"<tool_call>(.*?)</tool_call>|<tool_call>(.*?)$", re.DOTALL
+    )
+
+    # Find function blocks within a tool_call
+    FUNCTION_REGEX = re.compile(
+        r"<function=(.*?)</function>|<function=(.*)$", re.DOTALL
+    )
+
+    # Find parameter blocks within a function
+    PARAMETER_REGEX = re.compile(
+        r"<parameter=(.*?)(?:</parameter>|(?=<parameter=)|(?=</function>)|$)",
+        re.DOTALL,
+    )
+
+    def _parse_function_call(self, function_str: str) -> Optional[ChatCompletionMessageToolCall]:
+        """Parse a single <function=name>...</function> block into a ToolCall."""
+        try:
+            # Extract function name: everything before the first '>'
+            gt_idx = function_str.index(">")
+            func_name = function_str[:gt_idx].strip()
+            params_str = function_str[gt_idx + 1:]
+
+            # Extract parameters
+            param_dict: Dict[str, Any] = {}
+            for match_text in self.PARAMETER_REGEX.findall(params_str):
+                if ">" not in match_text:
+                    continue
+                eq_idx = match_text.index(">")
+                param_name = match_text[:eq_idx].strip()
+                param_value = match_text[eq_idx + 1:]
+
+                # Clean up whitespace
+                if param_value.startswith("\n"):
+                    param_value = param_value[1:]
+                if param_value.endswith("\n"):
+                    param_value = param_value[:-1]
+
+                param_dict[param_name] = _try_convert_value(param_value)
+
+            return ChatCompletionMessageToolCall(
+                id=f"call_{uuid.uuid4().hex[:24]}",
+                type="function",
+                function=Function(
+                    name=func_name,
+                    arguments=json.dumps(param_dict, ensure_ascii=False),
+                ),
+            )
+        except (ValueError, IndexError):
+            return None
+
+    def parse(self, text: str) -> ParseResult:
+        if self.FUNCTION_PREFIX not in text:
+            return text, None
+
+        try:
+            # Find all tool_call blocks
+            tc_matches = self.TOOL_CALL_REGEX.findall(text)
+            raw_blocks = [m[0] if m[0] else m[1] for m in tc_matches]
+
+            # Fallback: if no tool_call tags, try the whole text
+            if not raw_blocks:
+                raw_blocks = [text]
+
+            # Find function blocks within each tool_call
+            function_strs: List[str] = []
+            for block in raw_blocks:
+                func_matches = self.FUNCTION_REGEX.findall(block)
+                function_strs.extend(m[0] if m[0] else m[1] for m in func_matches)
+
+            if not function_strs:
+                return text, None
+
+            # Parse each function call
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for func_str in function_strs:
+                tc = self._parse_function_call(func_str)
+                if tc is not None:
+                    tool_calls.append(tc)
+
+            if not tool_calls:
+                return text, None
+
+            # Content before tool calls
+            first_tc = text.find(self.START_TOKEN)
+            if first_tc < 0:
+                first_tc = text.find(self.FUNCTION_PREFIX)
+            content = text[:first_tc].strip() if first_tc > 0 else None
+
+            return content, tool_calls
+
+        except Exception:
+            return text, None
--- a/environments/tool_call_parsers/qwen_parser.py
+++ b/environments/tool_call_parsers/qwen_parser.py
@@ -0,0 +1,19 @@
+"""
+Qwen 2.5 tool call parser.
+
+Uses the same <tool_call> format as Hermes.
+Registered as a separate parser name for clarity when using --tool-parser=qwen.
+"""
+
+from environments.tool_call_parsers import register_parser
+from environments.tool_call_parsers.hermes_parser import HermesToolCallParser
+
+
+@register_parser("qwen")
+class QwenToolCallParser(HermesToolCallParser):
+    """
+    Parser for Qwen 2.5 tool calls.
+    Same <tool_call>{"name": ..., "arguments": ...}</tool_call> format as Hermes.
+    """
+
+    pass  # Identical format -- inherits everything from Hermes
--- a/environments/tool_context.py
+++ b/environments/tool_context.py
@@ -0,0 +1,463 @@
+"""
+ToolContext -- Unrestricted Tool Access for Reward Functions
+
+A per-rollout handle that gives reward/verification functions direct access to
+ALL hermes-agent tools, scoped to the rollout's task_id. The same task_id means
+the terminal/browser session is the SAME one the model used during its rollout --
+all state (files, processes, browser tabs) is preserved.
+
+The verifier author decides which tools to use. Nothing is hardcoded or gated.
+
+Example usage in a compute_reward():
+    async def compute_reward(self, item, result, ctx):
+        # Run tests in the model's terminal sandbox
+        test = ctx.terminal("pytest -v")
+        if test["exit_code"] == 0:
+            return 1.0
+
+        # Check if a file was created
+        content = ctx.read_file("/workspace/solution.py")
+        if content.get("content"):
+            return 0.5
+
+        return 0.0
+"""
+
+import json
+import logging
+import os
+from typing import Any, Dict, List, Optional
+
+import asyncio
+import concurrent.futures
+
+from model_tools import handle_function_call
+from tools.terminal_tool import cleanup_vm
+from tools.browser_tool import cleanup_browser
+
+logger = logging.getLogger(__name__)
+
+# Thread pool for running sync tool calls that internally use asyncio.run()
+_tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=4)
+
+
+def _run_tool_in_thread(tool_name: str, arguments: Dict[str, Any], task_id: str) -> str:
+    """
+    Run a tool call in a thread pool executor so backends that use asyncio.run()
+    internally (modal, docker) get a clean event loop.
+
+    If we're already in an async context, uses run_in_executor.
+    If not (e.g., called from sync code), runs directly.
+    """
+    try:
+        loop = asyncio.get_running_loop()
+        # We're in an async context -- need to run in thread
+        import concurrent.futures
+        with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
+            future = pool.submit(
+                handle_function_call, tool_name, arguments, task_id
+            )
+            return future.result(timeout=300)
+    except RuntimeError:
+        # No running event loop -- safe to call directly
+        return handle_function_call(tool_name, arguments, task_id)
+
+
+class ToolContext:
+    """
+    Open-ended access to all hermes-agent tools for a specific rollout.
+
+    Passed to compute_reward() so verifiers can use any tool they need:
+    terminal commands, file reads/writes, web searches, browser automation, etc.
+    All calls share the rollout's task_id for session isolation.
+    """
+
+    def __init__(self, task_id: str):
+        self.task_id = task_id
+
+    # -------------------------------------------------------------------------
+    # Terminal tools
+    # -------------------------------------------------------------------------
+
+    def terminal(self, command: str, timeout: int = 180) -> Dict[str, Any]:
+        """
+        Run a command in the rollout's terminal session.
+
+        Args:
+            command: Shell command to execute
+            timeout: Command timeout in seconds
+
+        Returns:
+            Dict with 'exit_code' (int) and 'output' (str)
+        """
+        import os
+        backend = os.getenv("TERMINAL_ENV", "local")
+        logger.debug("ToolContext.terminal [%s backend] task=%s: %s", backend, self.task_id[:8], command[:100])
+
+        # Run in thread pool so modal/docker backends' asyncio.run() doesn't deadlock
+        result = _run_tool_in_thread(
+            "terminal",
+            {"command": command, "timeout": timeout},
+            self.task_id,
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"exit_code": -1, "output": result}
+
+    # -------------------------------------------------------------------------
+    # File tools
+    # -------------------------------------------------------------------------
+
+    def read_file(self, path: str) -> Dict[str, Any]:
+        """
+        Read a file from the rollout's filesystem.
+
+        Args:
+            path: File path to read
+
+        Returns:
+            Dict with file content or error
+        """
+        result = handle_function_call(
+            "read_file", {"path": path}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def write_file(self, path: str, content: str) -> Dict[str, Any]:
+        """
+        Write a TEXT file in the rollout's filesystem.
+
+        Uses a shell heredoc under the hood, so this is only safe for text content.
+        For binary files (images, compiled artifacts, etc.), use upload_file() instead.
+
+        Args:
+            path: File path to write
+            content: Text content to write
+
+        Returns:
+            Dict with success status or error
+        """
+        result = handle_function_call(
+            "write_file", {"path": path, "content": content}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def upload_file(self, local_path: str, remote_path: str) -> Dict[str, Any]:
+        """
+        Upload a local file to the rollout's sandbox (binary-safe).
+
+        Unlike write_file() which passes content through a shell heredoc (text-only),
+        this method base64-encodes the file and decodes it inside the sandbox.
+        Safe for any file type: binaries, images, archives, etc.
+
+        For large files (>1MB), the content is split into chunks to avoid
+        hitting shell command-length limits.
+
+        Args:
+            local_path: Path to a local file on the host
+            remote_path: Destination path inside the sandbox
+
+        Returns:
+            Dict with 'exit_code' and 'output'
+        """
+        import base64
+        from pathlib import Path as _Path
+
+        local = _Path(local_path)
+        if not local.exists():
+            return {"exit_code": -1, "output": f"Local file not found: {local_path}"}
+
+        raw = local.read_bytes()
+        b64 = base64.b64encode(raw).decode("ascii")
+
+        # Ensure parent directory exists in the sandbox
+        parent = str(_Path(remote_path).parent)
+        if parent not in (".", "/"):
+            self.terminal(f"mkdir -p {parent}", timeout=10)
+
+        # For small files, single command is fine
+        chunk_size = 60_000  # ~60KB per chunk (well within shell limits)
+        if len(b64) <= chunk_size:
+            result = self.terminal(
+                f"printf '%s' '{b64}' | base64 -d > {remote_path}",
+                timeout=30,
+            )
+        else:
+            # For larger files, write base64 in chunks then decode
+            tmp_b64 = "/tmp/_hermes_upload.b64"
+            self.terminal(f": > {tmp_b64}", timeout=5)  # truncate
+            for i in range(0, len(b64), chunk_size):
+                chunk = b64[i : i + chunk_size]
+                self.terminal(f"printf '%s' '{chunk}' >> {tmp_b64}", timeout=15)
+            result = self.terminal(
+                f"base64 -d {tmp_b64} > {remote_path} && rm -f {tmp_b64}",
+                timeout=30,
+            )
+
+        return result
+
+    def upload_dir(self, local_dir: str, remote_dir: str) -> List[Dict[str, Any]]:
+        """
+        Upload an entire local directory to the rollout's sandbox (binary-safe).
+
+        Recursively uploads all files, preserving directory structure.
+
+        Args:
+            local_dir: Path to a local directory on the host
+            remote_dir: Destination directory inside the sandbox
+
+        Returns:
+            List of results, one per file uploaded
+        """
+        from pathlib import Path as _Path
+
+        local = _Path(local_dir)
+        if not local.exists() or not local.is_dir():
+            return [{"exit_code": -1, "output": f"Local directory not found: {local_dir}"}]
+
+        results = []
+        for file_path in sorted(local.rglob("*")):
+            if file_path.is_file():
+                relative = file_path.relative_to(local)
+                target = f"{remote_dir}/{relative}"
+                results.append(self.upload_file(str(file_path), target))
+        return results
+
+    def download_file(self, remote_path: str, local_path: str) -> Dict[str, Any]:
+        """
+        Download a file from the rollout's sandbox to the host (binary-safe).
+
+        The inverse of upload_file(). Base64-encodes the file inside the sandbox,
+        reads the encoded data through the terminal, and decodes it locally.
+        Safe for any file type.
+
+        Args:
+            remote_path: Path to the file inside the sandbox
+            local_path: Destination path on the host
+
+        Returns:
+            Dict with 'success' (bool) and 'bytes' (int) or 'error' (str)
+        """
+        import base64
+        from pathlib import Path as _Path
+
+        # Base64-encode the file inside the sandbox and capture output
+        result = self.terminal(
+            f"base64 {remote_path} 2>/dev/null",
+            timeout=30,
+        )
+
+        if result.get("exit_code", -1) != 0:
+            return {
+                "success": False,
+                "error": f"Failed to read remote file: {result.get('output', '')}",
+            }
+
+        b64_data = result.get("output", "").strip()
+        if not b64_data:
+            return {"success": False, "error": f"Remote file is empty or missing: {remote_path}"}
+
+        try:
+            raw = base64.b64decode(b64_data)
+        except Exception as e:
+            return {"success": False, "error": f"Base64 decode failed: {e}"}
+
+        # Write to local host filesystem
+        local = _Path(local_path)
+        local.parent.mkdir(parents=True, exist_ok=True)
+        local.write_bytes(raw)
+
+        return {"success": True, "bytes": len(raw)}
+
+    def download_dir(self, remote_dir: str, local_dir: str) -> List[Dict[str, Any]]:
+        """
+        Download a directory from the rollout's sandbox to the host (binary-safe).
+
+        Lists all files in the remote directory, then downloads each one.
+        Preserves directory structure.
+
+        Args:
+            remote_dir: Path to the directory inside the sandbox
+            local_dir: Destination directory on the host
+
+        Returns:
+            List of results, one per file downloaded
+        """
+        from pathlib import Path as _Path
+
+        # List files in the remote directory
+        ls_result = self.terminal(
+            f"find {remote_dir} -type f 2>/dev/null",
+            timeout=15,
+        )
+
+        if ls_result.get("exit_code", -1) != 0:
+            return [{"success": False, "error": f"Failed to list remote dir: {remote_dir}"}]
+
+        file_list = ls_result.get("output", "").strip()
+        if not file_list:
+            return [{"success": False, "error": f"Remote directory is empty or missing: {remote_dir}"}]
+
+        results = []
+        for remote_file in file_list.splitlines():
+            remote_file = remote_file.strip()
+            if not remote_file:
+                continue
+            # Compute the relative path to preserve directory structure
+            if remote_file.startswith(remote_dir):
+                relative = remote_file[len(remote_dir):].lstrip("/")
+            else:
+                relative = _Path(remote_file).name
+            local_file = str(_Path(local_dir) / relative)
+            results.append(self.download_file(remote_file, local_file))
+
+        return results
+
+    def search(self, query: str, path: str = ".") -> Dict[str, Any]:
+        """
+        Search for text in the rollout's filesystem.
+
+        Args:
+            query: Search query
+            path: Directory to search in
+
+        Returns:
+            Dict with search results
+        """
+        result = handle_function_call(
+            "search", {"query": query, "path": path}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    # -------------------------------------------------------------------------
+    # Web tools
+    # -------------------------------------------------------------------------
+
+    def web_search(self, query: str) -> Dict[str, Any]:
+        """
+        Search the web.
+
+        Args:
+            query: Search query
+
+        Returns:
+            Dict with search results
+        """
+        result = handle_function_call("web_search", {"query": query})
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def web_extract(self, urls: List[str]) -> Dict[str, Any]:
+        """
+        Extract content from URLs.
+
+        Args:
+            urls: List of URLs to extract content from
+
+        Returns:
+            Dict with extracted content
+        """
+        result = handle_function_call("web_extract", {"urls": urls})
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    # -------------------------------------------------------------------------
+    # Browser tools
+    # -------------------------------------------------------------------------
+
+    def browser_navigate(self, url: str) -> Dict[str, Any]:
+        """
+        Navigate the rollout's browser session to a URL.
+
+        Args:
+            url: URL to navigate to
+
+        Returns:
+            Dict with page snapshot or error
+        """
+        result = handle_function_call(
+            "browser_navigate", {"url": url}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def browser_snapshot(self) -> Dict[str, Any]:
+        """
+        Take a snapshot of the current browser page.
+
+        Returns:
+            Dict with page content/accessibility snapshot
+        """
+        result = handle_function_call(
+            "browser_snapshot", {}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    # -------------------------------------------------------------------------
+    # Generic tool access
+    # -------------------------------------------------------------------------
+
+    def call_tool(self, tool_name: str, arguments: Dict[str, Any]) -> str:
+        """
+        Call any hermes-agent tool by name.
+
+        This is the generic escape hatch -- if a tool doesn't have a convenience
+        wrapper above, you can call it directly here.
+
+        Args:
+            tool_name: Name of the tool (e.g., "vision_analyze", "skills_list")
+            arguments: Dict of arguments for the tool
+
+        Returns:
+            Raw JSON string result from the tool
+        """
+        return _run_tool_in_thread(tool_name, arguments, self.task_id)
+
+    # -------------------------------------------------------------------------
+    # Cleanup
+    # -------------------------------------------------------------------------
+
+    def cleanup(self):
+        """
+        Release all resources (terminal VMs, browser sessions) for this rollout.
+
+        Called automatically by the base environment via try/finally after
+        compute_reward() completes. You generally don't need to call this yourself.
+        """
+        try:
+            cleanup_vm(self.task_id)
+        except Exception as e:
+            logger.debug("VM cleanup for task %s: %s", self.task_id, e)
+
+        # Suppress browser_tool's noisy debug prints during cleanup.
+        # The cleanup still runs (safe), it just doesn't spam the console.
+        _prev_quiet = os.environ.get("HERMES_QUIET")
+        os.environ["HERMES_QUIET"] = "1"
+        try:
+            cleanup_browser(self.task_id)
+        except Exception as e:
+            logger.debug("Browser cleanup for task %s: %s", self.task_id, e)
+        finally:
+            if _prev_quiet is None:
+                os.environ.pop("HERMES_QUIET", None)
+            else:
+                os.environ["HERMES_QUIET"] = _prev_quiet
--- a/example-skill/SKILL.md
+++ b/example-skill/SKILL.md
@@ -1,70 +0,0 @@
---
-name: example-skill
-description: An example skill demonstrating the skill file format and structure
---
-
-# Example Skill
-
-This is an example skill file that demonstrates how to create skills for the Hermes Agent.
-
-## Skill File Format
-
-Skills are markdown files with YAML frontmatter at the top:
-
-```yaml
---
-name: your-skill-name
-description: A brief one-line description of what this skill does
---
-```
-
-The frontmatter fields:
- **name**: The identifier used to reference this skill (lowercase, hyphens for spaces)
- **description**: A brief description shown when listing skills (keep under 200 chars)
-
-## Writing Effective Skills
-
-### 1. Be Specific and Actionable
-
-Good skills provide clear, actionable instructions:
-
-```
-When reviewing code:
-1. Check for security vulnerabilities first
-2. Verify error handling is comprehensive
-3. Ensure tests cover edge cases
-```
-
-### 2. Include Examples
-
-Show concrete examples of what you want:
-
-```python
-# Good: Descriptive variable names
-user_authentication_token = get_token()
-
-# Bad: Cryptic abbreviations  
-uat = gt()
-```
-
-### 3. Define When to Use
-
-Help the agent understand when this skill applies:
-
-> Use this skill when: reviewing pull requests, auditing security, or checking code quality.
-
-## Skill Categories
-
-Consider organizing skills by purpose:
-
- **Conventions**: Coding standards, API patterns, naming rules
- **Workflows**: Step-by-step processes for deployments, reviews, releases
- **Knowledge**: Domain-specific information, system architecture, gotchas
- **Templates**: Boilerplate for common tasks, response formats
-
-## Tips
-
-1. Keep the description concise - it's shown in the skills list
-2. Use headers to organize longer skills
-3. Include code examples where helpful
-4. Reference other skills if they're related
--- a/gateway/platforms/base.py
+++ b/gateway/platforms/base.py
@@ -6,10 +6,11 @@ and implement the required methods.
 """

 import asyncio
+import re
 from abc import ABC, abstractmethod
 from dataclasses import dataclass, field
 from datetime import datetime
-from typing import Dict, List, Optional, Any, Callable, Awaitable
+from typing import Dict, List, Optional, Any, Callable, Awaitable, Tuple
 from enum import Enum

 import sys
@@ -177,6 +178,123 @@ class BasePlatformAdapter(ABC):
        """
        pass
    
+    async def send_image(
+        self,
+        chat_id: str,
+        image_url: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """
+        Send an image natively via the platform API.
+        
+        Override in subclasses to send images as proper attachments
+        instead of plain-text URLs. Default falls back to sending the
+        URL as a text message.
+        """
+        # Fallback: send URL as text (subclasses override for native images)
+        text = f"{caption}\n{image_url}" if caption else image_url
+        return await self.send(chat_id=chat_id, content=text, reply_to=reply_to)
+    
+    @staticmethod
+    def extract_images(content: str) -> Tuple[List[Tuple[str, str]], str]:
+        """
+        Extract image URLs from markdown and HTML image tags in a response.
+        
+        Finds patterns like:
+        - ![alt text](https://example.com/image.png)
+        - <img src="https://example.com/image.png">
+        - <img src="https://example.com/image.png"></img>
+        
+        Args:
+            content: The response text to scan.
+        
+        Returns:
+            Tuple of (list of (url, alt_text) pairs, cleaned content with image tags removed).
+        """
+        images = []
+        cleaned = content
+        
+        # Match markdown images: ![alt](url)
+        md_pattern = r'!\[([^\]]*)\]\((https?://[^\s\)]+)\)'
+        for match in re.finditer(md_pattern, content):
+            alt_text = match.group(1)
+            url = match.group(2)
+            # Only extract URLs that look like actual images
+            if any(url.lower().endswith(ext) or ext in url.lower() for ext in
+                   ['.png', '.jpg', '.jpeg', '.gif', '.webp', 'fal.media', 'fal-cdn', 'replicate.delivery']):
+                images.append((url, alt_text))
+        
+        # Match HTML img tags: <img src="url"> or <img src="url"></img> or <img src="url"/>
+        html_pattern = r'<img\s+src=["\']?(https?://[^\s"\'<>]+)["\']?\s*/?>\s*(?:</img>)?'
+        for match in re.finditer(html_pattern, content):
+            url = match.group(1)
+            images.append((url, ""))
+        
+        # Remove matched image tags from content if we found images
+        if images:
+            cleaned = re.sub(md_pattern, '', cleaned)
+            cleaned = re.sub(html_pattern, '', cleaned)
+            # Clean up leftover blank lines
+            cleaned = re.sub(r'\n{3,}', '\n\n', cleaned).strip()
+        
+        return images, cleaned
+    
+    async def send_voice(
+        self,
+        chat_id: str,
+        audio_path: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """
+        Send an audio file as a native voice message via the platform API.
+        
+        Override in subclasses to send audio as voice bubbles (Telegram)
+        or file attachments (Discord). Default falls back to sending the
+        file path as text.
+        """
+        text = f"🔊 Audio: {audio_path}"
+        if caption:
+            text = f"{caption}\n{text}"
+        return await self.send(chat_id=chat_id, content=text, reply_to=reply_to)
+    
+    @staticmethod
+    def extract_media(content: str) -> Tuple[List[Tuple[str, bool]], str]:
+        """
+        Extract MEDIA:<path> tags and [[audio_as_voice]] directives from response text.
+        
+        The TTS tool returns responses like:
+            [[audio_as_voice]]
+            MEDIA:/path/to/audio.ogg
+        
+        Args:
+            content: The response text to scan.
+        
+        Returns:
+            Tuple of (list of (path, is_voice) pairs, cleaned content with tags removed).
+        """
+        media = []
+        cleaned = content
+        
+        # Check for [[audio_as_voice]] directive
+        has_voice_tag = "[[audio_as_voice]]" in content
+        cleaned = cleaned.replace("[[audio_as_voice]]", "")
+        
+        # Extract MEDIA:<path> tags (path may contain spaces)
+        media_pattern = r'MEDIA:(\S+)'
+        for match in re.finditer(media_pattern, content):
+            path = match.group(1).strip()
+            if path:
+                media.append((path, has_voice_tag))
+        
+        # Remove MEDIA tags from content
+        if media:
+            cleaned = re.sub(media_pattern, '', cleaned)
+            cleaned = re.sub(r'\n{3,}', '\n\n', cleaned).strip()
+        
+        return media, cleaned
+    
    async def _keep_typing(self, chat_id: str, interval: float = 2.0) -> None:
        """
        Continuously send typing indicator until cancelled.
@@ -231,23 +349,56 @@ class BasePlatformAdapter(ABC):
            
            # Send response if any
            if response:
-                result = await self.send(
-                    chat_id=event.source.chat_id,
-                    content=response,
-                    reply_to=event.message_id
-                )
+                # Extract MEDIA:<path> tags (from TTS tool) before other processing
+                media_files, response = self.extract_media(response)
                
-                # Log send failures (don't raise - user already saw tool progress)
-                if not result.success:
-                    print(f"[{self.name}] Failed to send response: {result.error}")
-                    # Try sending without markdown as fallback
-                    fallback_result = await self.send(
+                # Extract image URLs and send them as native platform attachments
+                images, text_content = self.extract_images(response)
+                
+                # Send the text portion first (if any remains after extractions)
+                if text_content:
+                    result = await self.send(
                        chat_id=event.source.chat_id,
-                        content=f"(Response formatting failed, plain text:)\n\n{response[:3500]}",
+                        content=text_content,
                        reply_to=event.message_id
                    )
-                    if not fallback_result.success:
-                        print(f"[{self.name}] Fallback send also failed: {fallback_result.error}")
+                    
+                    # Log send failures (don't raise - user already saw tool progress)
+                    if not result.success:
+                        print(f"[{self.name}] Failed to send response: {result.error}")
+                        # Try sending without markdown as fallback
+                        fallback_result = await self.send(
+                            chat_id=event.source.chat_id,
+                            content=f"(Response formatting failed, plain text:)\n\n{text_content[:3500]}",
+                            reply_to=event.message_id
+                        )
+                        if not fallback_result.success:
+                            print(f"[{self.name}] Fallback send also failed: {fallback_result.error}")
+                
+                # Send extracted images as native attachments
+                for image_url, alt_text in images:
+                    try:
+                        img_result = await self.send_image(
+                            chat_id=event.source.chat_id,
+                            image_url=image_url,
+                            caption=alt_text if alt_text else None,
+                        )
+                        if not img_result.success:
+                            print(f"[{self.name}] Failed to send image: {img_result.error}")
+                    except Exception as img_err:
+                        print(f"[{self.name}] Error sending image: {img_err}")
+                
+                # Send extracted audio/voice files as native attachments
+                for audio_path, is_voice in media_files:
+                    try:
+                        voice_result = await self.send_voice(
+                            chat_id=event.source.chat_id,
+                            audio_path=audio_path,
+                        )
+                        if not voice_result.success:
+                            print(f"[{self.name}] Failed to send voice: {voice_result.error}")
+                    except Exception as voice_err:
+                        print(f"[{self.name}] Error sending voice: {voice_err}")
            
            # Check if there's a pending message that was queued during our processing
            if session_key in self._pending_messages:
@@ -286,7 +437,7 @@ class BasePlatformAdapter(ABC):
    
    def get_pending_message(self, session_key: str) -> Optional[MessageEvent]:
        """Get and clear any pending message for a session."""
-        return self._pending_messages.get(session_key)
+        return self._pending_messages.pop(session_key, None)
    
    def build_source(
        self,
--- a/gateway/platforms/discord.py
+++ b/gateway/platforms/discord.py
@@ -8,6 +8,7 @@ Uses discord.py library for:
 """

 import asyncio
+import os
 from typing import Dict, List, Optional, Any

 try:
@@ -173,6 +174,99 @@ class DiscordAdapter(BasePlatformAdapter):
        except Exception as e:
            return SendResult(success=False, error=str(e))
    
+    async def send_voice(
+        self,
+        chat_id: str,
+        audio_path: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send audio as a Discord file attachment."""
+        if not self._client:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            import io
+            
+            channel = self._client.get_channel(int(chat_id))
+            if not channel:
+                channel = await self._client.fetch_channel(int(chat_id))
+            if not channel:
+                return SendResult(success=False, error=f"Channel {chat_id} not found")
+            
+            if not os.path.exists(audio_path):
+                return SendResult(success=False, error=f"Audio file not found: {audio_path}")
+            
+            # Determine filename from path
+            filename = os.path.basename(audio_path)
+            
+            with open(audio_path, "rb") as f:
+                file = discord.File(io.BytesIO(f.read()), filename=filename)
+                msg = await channel.send(
+                    content=caption if caption else None,
+                    file=file,
+                )
+                return SendResult(success=True, message_id=str(msg.id))
+        
+        except Exception as e:
+            print(f"[{self.name}] Failed to send audio: {e}")
+            return await super().send_voice(chat_id, audio_path, caption, reply_to)
+    
+    async def send_image(
+        self,
+        chat_id: str,
+        image_url: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send an image natively as a Discord file attachment."""
+        if not self._client:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            import aiohttp
+            
+            channel = self._client.get_channel(int(chat_id))
+            if not channel:
+                channel = await self._client.fetch_channel(int(chat_id))
+            if not channel:
+                return SendResult(success=False, error=f"Channel {chat_id} not found")
+            
+            # Download the image and send as a Discord file attachment
+            # (Discord renders attachments inline, unlike plain URLs)
+            async with aiohttp.ClientSession() as session:
+                async with session.get(image_url, timeout=aiohttp.ClientTimeout(total=30)) as resp:
+                    if resp.status != 200:
+                        raise Exception(f"Failed to download image: HTTP {resp.status}")
+                    
+                    image_data = await resp.read()
+                    
+                    # Determine filename from URL or content type
+                    content_type = resp.headers.get("content-type", "image/png")
+                    ext = "png"
+                    if "jpeg" in content_type or "jpg" in content_type:
+                        ext = "jpg"
+                    elif "gif" in content_type:
+                        ext = "gif"
+                    elif "webp" in content_type:
+                        ext = "webp"
+                    
+                    import io
+                    file = discord.File(io.BytesIO(image_data), filename=f"image.{ext}")
+                    
+                    msg = await channel.send(
+                        content=caption if caption else None,
+                        file=file,
+                    )
+                    return SendResult(success=True, message_id=str(msg.id))
+        
+        except ImportError:
+            print(f"[{self.name}] aiohttp not installed, falling back to URL. Run: pip install aiohttp")
+            return await super().send_image(chat_id, image_url, caption, reply_to)
+        except Exception as e:
+            print(f"[{self.name}] Failed to send image attachment, falling back to URL: {e}")
+            return await super().send_image(chat_id, image_url, caption, reply_to)
+    
    async def send_typing(self, chat_id: str) -> None:
        """Send typing indicator."""
        if self._client:
@@ -232,6 +326,36 @@ class DiscordAdapter(BasePlatformAdapter):
    
    async def _handle_message(self, message: DiscordMessage) -> None:
        """Handle incoming Discord messages."""
+        # In server channels (not DMs), require the bot to be @mentioned
+        # UNLESS the channel is in the free-response list.
+        #
+        # Config:
+        #   DISCORD_FREE_RESPONSE_CHANNELS: Comma-separated channel IDs where the
+        #       bot responds to every message without needing a mention.
+        #   DISCORD_REQUIRE_MENTION: Set to "false" to disable mention requirement
+        #       globally (all channels become free-response). Default: "true".
+        
+        if not isinstance(message.channel, discord.DMChannel):
+            # Check if this channel is in the free-response list
+            free_channels_raw = os.getenv("DISCORD_FREE_RESPONSE_CHANNELS", "")
+            free_channels = {ch.strip() for ch in free_channels_raw.split(",") if ch.strip()}
+            channel_id = str(message.channel.id)
+            
+            # Global override: if DISCORD_REQUIRE_MENTION=false, all channels are free
+            require_mention = os.getenv("DISCORD_REQUIRE_MENTION", "true").lower() not in ("false", "0", "no")
+            
+            is_free_channel = channel_id in free_channels
+            
+            if require_mention and not is_free_channel:
+                # Must be @mentioned to respond
+                if self._client.user not in message.mentions:
+                    return  # Silently ignore messages that don't mention the bot
+            
+            # Strip the bot mention from the message text so the agent sees clean input
+            if self._client.user and self._client.user in message.mentions:
+                message.content = message.content.replace(f"<@{self._client.user.id}>", "").strip()
+                message.content = message.content.replace(f"<@!{self._client.user.id}>", "").strip()
+        
        # Determine message type
        msg_type = MessageType.TEXT
        if message.content.startswith("/"):
--- a/gateway/platforms/telegram.py
+++ b/gateway/platforms/telegram.py
@@ -174,6 +174,69 @@ class TelegramAdapter(BasePlatformAdapter):
        except Exception as e:
            return SendResult(success=False, error=str(e))
    
+    async def send_voice(
+        self,
+        chat_id: str,
+        audio_path: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send audio as a native Telegram voice message or audio file."""
+        if not self._bot:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            import os
+            if not os.path.exists(audio_path):
+                return SendResult(success=False, error=f"Audio file not found: {audio_path}")
+            
+            with open(audio_path, "rb") as audio_file:
+                # .ogg files -> send as voice (round playable bubble)
+                if audio_path.endswith(".ogg") or audio_path.endswith(".opus"):
+                    msg = await self._bot.send_voice(
+                        chat_id=int(chat_id),
+                        voice=audio_file,
+                        caption=caption[:1024] if caption else None,
+                        reply_to_message_id=int(reply_to) if reply_to else None,
+                    )
+                else:
+                    # .mp3 and others -> send as audio file
+                    msg = await self._bot.send_audio(
+                        chat_id=int(chat_id),
+                        audio=audio_file,
+                        caption=caption[:1024] if caption else None,
+                        reply_to_message_id=int(reply_to) if reply_to else None,
+                    )
+            return SendResult(success=True, message_id=str(msg.message_id))
+        except Exception as e:
+            print(f"[{self.name}] Failed to send voice/audio: {e}")
+            return await super().send_voice(chat_id, audio_path, caption, reply_to)
+    
+    async def send_image(
+        self,
+        chat_id: str,
+        image_url: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send an image natively as a Telegram photo."""
+        if not self._bot:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            # Telegram can send photos directly from URLs
+            msg = await self._bot.send_photo(
+                chat_id=int(chat_id),
+                photo=image_url,
+                caption=caption[:1024] if caption else None,  # Telegram caption limit
+                reply_to_message_id=int(reply_to) if reply_to else None,
+            )
+            return SendResult(success=True, message_id=str(msg.message_id))
+        except Exception as e:
+            print(f"[{self.name}] Failed to send photo, falling back to URL: {e}")
+            # Fallback: send as text link
+            return await super().send_image(chat_id, image_url, caption, reply_to)
+    
    async def send_typing(self, chat_id: str) -> None:
        """Send typing indicator."""
        if self._bot:
--- a/gateway/run.py
+++ b/gateway/run.py
@@ -35,6 +35,9 @@ load_dotenv()
 # Gateway runs in quiet mode - suppress debug output and use cwd directly (no temp dirs)
 os.environ["HERMES_QUIET"] = "1"

+# Enable interactive exec approval for dangerous commands on messaging platforms
+os.environ["HERMES_EXEC_ASK"] = "1"
+
 # Set terminal working directory for messaging platforms
 # Uses MESSAGING_CWD if set, otherwise defaults to home directory
 # This is separate from CLI which uses the directory where `hermes` is run
@@ -77,6 +80,10 @@ class GatewayRunner:
        # Key: session_key, Value: AIAgent instance
        self._running_agents: Dict[str, Any] = {}
        self._pending_messages: Dict[str, str] = {}  # Queued messages during interrupt
+        
+        # Track pending exec approvals per session
+        # Key: session_key, Value: {"command": str, "pattern_key": str}
+        self._pending_approvals: Dict[str, Dict[str, str]] = {}
    
    async def start(self) -> bool:
        """
@@ -246,6 +253,25 @@ class GatewayRunner:
        if command == "stop":
            return await self._handle_stop_command(event)
        
+        # Check for pending exec approval responses
+        session_key_preview = f"agent:main:{source.platform.value}:{source.chat_type}:{source.chat_id}" if source.chat_type != "dm" else f"agent:main:{source.platform.value}:dm"
+        if session_key_preview in self._pending_approvals:
+            user_text = event.text.strip().lower()
+            if user_text in ("yes", "y", "approve", "ok", "go", "do it"):
+                approval = self._pending_approvals.pop(session_key_preview)
+                cmd = approval["command"]
+                pattern_key = approval.get("pattern_key", "")
+                print(f"[gateway] ✅ User approved dangerous command: {cmd[:60]}...")
+                # Approve for session and re-run via terminal_tool with force=True
+                from tools.terminal_tool import terminal_tool, _session_approved_patterns
+                _session_approved_patterns.add(pattern_key)
+                result = terminal_tool(command=cmd, force=True)
+                return f"✅ Command approved and executed.\n\n```\n{result[:3500]}\n```"
+            elif user_text in ("no", "n", "deny", "cancel", "nope"):
+                self._pending_approvals.pop(session_key_preview)
+                return "❌ Command denied."
+            # If it's not clearly an approval/denial, fall through to normal processing
+        
        # Get or create session
        session_entry = self.session_store.get_or_create_session(source)
        session_key = session_entry.session_key
@@ -282,6 +308,17 @@ class GatewayRunner:
                session_key=session_key
            )
            
+            # Check if the agent encountered a dangerous command needing approval
+            # The terminal tool stores the last pending approval globally
+            try:
+                from tools.terminal_tool import _last_pending_approval
+                if _last_pending_approval:
+                    self._pending_approvals[session_key] = _last_pending_approval.copy()
+                    # Clear the global so it doesn't leak to other sessions
+                    _last_pending_approval.clear()
+            except Exception:
+                pass
+            
            # Append to transcript
            self.session_store.append_to_transcript(
                session_entry.session_id,
@@ -418,23 +455,35 @@ class GatewayRunner:
                return
            last_tool[0] = tool_name
            
-            # Build progress message
+            # Build progress message with primary argument preview
            tool_emojis = {
                "terminal": "💻",
                "web_search": "🔍",
                "web_extract": "📄",
                "read_file": "📖",
                "write_file": "✍️",
+                "patch": "🔧",
+                "search": "🔎",
                "list_directory": "📂",
                "image_generate": "🎨",
+                "text_to_speech": "🔊",
                "browser_navigate": "🌐",
                "browser_click": "👆",
+                "browser_type": "⌨️",
+                "browser_snapshot": "📸",
                "moa_query": "🧠",
+                "mixture_of_agents": "🧠",
+                "vision_analyze": "👁️",
+                "skill_view": "📚",
+                "skills_list": "📋",
            }
            emoji = tool_emojis.get(tool_name, "⚙️")
            
-            if tool_name == "terminal" and preview:
-                msg = f"{emoji} `{preview}`..."
+            if preview:
+                # Truncate preview to keep messages clean
+                if len(preview) > 40:
+                    preview = preview[:37] + "..."
+                msg = f"{emoji} {tool_name}... \"{preview}\""
            else:
                msg = f"{emoji} {tool_name}..."
            
@@ -480,27 +529,54 @@ class GatewayRunner:
            # Read from env var or use default (same as CLI)
            max_iterations = int(os.getenv("HERMES_MAX_ITERATIONS", "60"))
            
+            # Map platform enum to the platform hint key the agent understands.
+            # Platform.LOCAL ("local") maps to "cli"; others pass through as-is.
+            platform_key = "cli" if source.platform == Platform.LOCAL else source.platform.value
+            
            agent = AIAgent(
-                model=os.getenv("HERMES_MODEL", "anthropic/claude-sonnet-4"),
+                model=os.getenv("HERMES_MODEL", "anthropic/claude-opus-4.6"),
                max_iterations=max_iterations,
                quiet_mode=True,
                enabled_toolsets=[toolset],
                ephemeral_system_prompt=context_prompt,
                session_id=session_id,
                tool_progress_callback=progress_callback if tool_progress_enabled else None,
+                platform=platform_key,  # Tells the agent which interface to format for
            )
            
            # Store agent reference for interrupt support
            agent_holder[0] = agent
            
-            # Convert transcript history to agent format
-            # Transcript has timestamps; agent expects {"role": ..., "content": ...}
+            # Convert history to agent format.
+            # Two cases:
+            #   1. Normal path (from transcript): simple {role, content, timestamp} dicts
+            #      - Strip timestamps, keep role+content
+            #   2. Interrupt path (from agent result["messages"]): full agent messages
+            #      that may include tool_calls, tool_call_id, reasoning, etc.
+            #      - These must be passed through intact so the API sees valid
+            #        assistant→tool sequences (dropping tool_calls causes 500 errors)
            agent_history = []
            for msg in history:
                role = msg.get("role")
-                content = msg.get("content")
-                if role and content:
-                    agent_history.append({"role": role, "content": content})
+                if not role:
+                    continue
+                
+                # Check if this is a rich agent message (has tool_calls or tool_call_id)
+                # If so, pass it through with full structure intact
+                has_tool_calls = "tool_calls" in msg
+                has_tool_call_id = "tool_call_id" in msg
+                is_tool_message = role == "tool"
+                
+                if has_tool_calls or has_tool_call_id or is_tool_message:
+                    # Preserve full message structure (tool_calls, tool_call_id, etc.)
+                    # Only strip fields that are purely internal (e.g. timestamp)
+                    clean_msg = {k: v for k, v in msg.items() if k != "timestamp"}
+                    agent_history.append(clean_msg)
+                else:
+                    # Simple text message - just need role and content
+                    content = msg.get("content")
+                    if content:
+                        agent_history.append({"role": role, "content": content})
            
            result = agent.run_conversation(message, conversation_history=agent_history)
            result_holder[0] = result
@@ -572,13 +648,16 @@ class GatewayRunner:
            
            if pending:
                print(f"[gateway] 📨 Processing interrupted message: '{pending[:40]}...'")
-                # Add an indicator to the response
-                if response:
-                    response = response + "\n\n---\n_[Interrupted - processing your new message]_"
                
-                # Send the interrupted response first
-                if adapter and response:
-                    await adapter.send(chat_id=source.chat_id, content=response)
+                # Clear the adapter's interrupt event so the next _run_agent call
+                # doesn't immediately re-trigger the interrupt before the new agent
+                # even makes its first API call (this was causing an infinite loop).
+                if adapter and hasattr(adapter, '_active_sessions') and source.chat_id in adapter._active_sessions:
+                    adapter._active_sessions[source.chat_id].clear()
+                
+                # Don't send the interrupted response to the user — it's just noise
+                # like "Operation interrupted." They already know they sent a new
+                # message, so go straight to processing it.
                
                # Now process the pending message with updated history
                updated_history = result.get("messages", history)
@@ -612,11 +691,13 @@ class GatewayRunner:
        return response


-async def start_gateway(config: Optional[GatewayConfig] = None) -> None:
+async def start_gateway(config: Optional[GatewayConfig] = None) -> bool:
    """
    Start the gateway and run until interrupted.
    
    This is the main entry point for running the gateway.
+    Returns True if the gateway ran successfully, False if it failed to start.
+    A False return causes a non-zero exit code so systemd can auto-restart.
    """
    runner = GatewayRunner(config)
    
@@ -635,10 +716,11 @@ async def start_gateway(config: Optional[GatewayConfig] = None) -> None:
    # Start the gateway
    success = await runner.start()
    if not success:
-        return
+        return False
    
    # Wait for shutdown
    await runner.wait_for_shutdown()
+    return True


 def main():
@@ -658,8 +740,11 @@ def main():
            data = json.load(f)
            config = GatewayConfig.from_dict(data)
    
-    # Run the gateway
-    asyncio.run(start_gateway(config))
+    # Run the gateway - exit with code 1 if no platforms connected,
+    # so systemd Restart=on-failure will retry on transient errors (e.g. DNS)
+    success = asyncio.run(start_gateway(config))
+    if not success:
+        sys.exit(1)


 if __name__ == "__main__":
--- a/hermes_agent.egg-info/PKG-INFO
+++ b/hermes_agent.egg-info/PKG-INFO
@@ -1,868 +0,0 @@
-Metadata-Version: 2.4
-Name: hermes-agent
-Version: 0.1.0
-Summary: AI agent with advanced tool-calling and toolsets
-Author: Nous Research
-License: MIT
-Requires-Python: >=3.10
-Description-Content-Type: text/markdown
-Requires-Dist: openai
-Requires-Dist: python-dotenv
-Requires-Dist: fire
-Requires-Dist: httpx
-Requires-Dist: rich
-Requires-Dist: tenacity
-Requires-Dist: pyyaml
-Requires-Dist: requests
-Requires-Dist: jinja2
-Requires-Dist: pydantic>=2.0
-Requires-Dist: firecrawl-py
-Requires-Dist: fal-client
-Requires-Dist: litellm>=1.75.5
-Requires-Dist: typer
-Requires-Dist: platformdirs
-Provides-Extra: modal
-Requires-Dist: modal; extra == "modal"
-Requires-Dist: boto3; extra == "modal"
-Provides-Extra: dev
-Requires-Dist: pytest; extra == "dev"
-Requires-Dist: pytest-asyncio; extra == "dev"
-Provides-Extra: messaging
-Requires-Dist: python-telegram-bot>=20.0; extra == "messaging"
-Requires-Dist: discord.py>=2.0; extra == "messaging"
-Provides-Extra: cron
-Requires-Dist: croniter; extra == "cron"
-Provides-Extra: all
-Requires-Dist: croniter; extra == "all"
-Requires-Dist: python-telegram-bot>=20.0; extra == "all"
-Requires-Dist: discord.py>=2.0; extra == "all"
-
-# Hermes Agent
-
-An AI agent with advanced tool-calling capabilities, featuring a flexible toolsets system for organizing and managing tools.
-
-## Features
-
- **Interactive CLI**: Beautiful terminal interface with animated feedback, personalities, and session management
- **Messaging Gateway**: Connect to Telegram, Discord, and WhatsApp for conversational AI anywhere
- **Web Tools**: Search, extract content, and crawl websites
- **Terminal Tools**: Execute commands via local, Docker, Singularity, Modal, or SSH backends
- **Browser Tools**: Automate web browsers to navigate, click, type, and extract content
- **Vision Tools**: Analyze images from URLs
- **Reasoning Tools**: Advanced multi-model reasoning (Mixture of Agents)
- **Creative Tools**: Generate images from text prompts
- **Skills Tools**: On-demand knowledge documents with progressive disclosure
- **Toolsets System**: Organize tools into logical groups for different scenarios
- **Scheduled Tasks**: Cron jobs for automated agent tasks with delivery to platforms
- **Context Compression**: Automatic summarization when approaching context limits
- **Batch Processing**: Process datasets in parallel with checkpointing and statistics tracking
- **Ephemeral System Prompts**: Guide model behavior without polluting training datasets
-
-## Installation
-
-### Quick Install (Recommended)
-
-**Linux/macOS:**
-```bash
-curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
-```
-
-**Windows (PowerShell):**
-```powershell
-irm https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.ps1 | iex
-```
-
-This installer will:
- Clone the repository to `~/.hermes-agent`
- Create a virtual environment and install dependencies
- Set up the `hermes` command in your PATH
- Run an interactive setup wizard to configure API keys
-
-### Manual Installation
-
-If you prefer to install manually:
-
-```bash
-# Clone with submodules
-git clone --recurse-submodules https://github.com/NousResearch/Hermes-Agent.git
-cd Hermes-Agent
-
-# Run the setup script
-./setup-hermes.sh
-```
-
-Or step-by-step:
-
-```bash
-# Create and activate virtual environment
-python3 -m venv venv
-source venv/bin/activate  # Windows: venv\Scripts\activate
-
-# Install in editable mode with all extras
-pip install -e ".[all]"
-
-# Or install dependencies manually
-pip install -r requirements.txt
-pip install -e ./mini-swe-agent
-
-# Copy and configure environment
-cp .env.example .env
-# Edit .env with your API keys
-
-# Run the setup wizard
-hermes setup
-```
-
-## Quick Start
-
-Once installed, the `hermes` command is your main entry point:
-
-```bash
-hermes                    # Interactive chat (default)
-hermes chat               # Same as above
-hermes chat -q "Hello"    # Single query, then exit
-hermes setup              # Configure API keys and settings
-hermes status             # Show configuration status
-hermes doctor             # Diagnose issues
-hermes gateway            # Start messaging gateway (Telegram/Discord/WhatsApp)
-hermes cron daemon        # Run cron job scheduler
-hermes version            # Show version info
-```
-
-**Legacy `./hermes` script:**
-```bash
-# The old CLI script still works:
-./hermes
-
-# Or with options:
-./hermes --model "anthropic/claude-sonnet-4" --toolsets "web,terminal"
-```
-
-The CLI provides:
- Animated spinners during thinking and tool execution
- Kawaii-style feedback messages
- `/commands` for configuration, history, and session management
- Customizable personalities (`/personality kawaii`, `/personality pirate`, etc.)
- Persistent configuration via `cli-config.yaml`
-
-## Configuration
-
-### Environment Variables
-```bash
-# Copy the example environment file
-cp .env.example .env
-
-# Edit .env and add your API keys
-nano .env  # or use your preferred editor
-```
-
-**Required API Keys:**
- `OPENROUTER_API_KEY` - LLM access via OpenRouter (get at: https://openrouter.ai/keys)
- `FIRECRAWL_API_KEY` - Web tools (get at: https://firecrawl.dev/)
- `NOUS_API_KEY` - Vision & reasoning tools (get at: https://inference-api.nousresearch.com/)
- `FAL_KEY` - Image generation (get at: https://fal.ai/)
-
-**Optional API Keys (for specific features):**
- `BROWSERBASE_API_KEY` - Browser automation (get at: https://browserbase.com/)
- `BROWSERBASE_PROJECT_ID` - From Browserbase dashboard
- `MORPH_API_KEY` - For legacy Hecate terminal backend (get at: https://morph.so/)
-
-### 4. Configure Terminal Backend
-
-The terminal tool uses **mini-swe-agent** environments. Configure in `.env` or `cli-config.yaml`:
-
-```bash
-# Backend: "local", "docker", "singularity", "modal", or "ssh"
-TERMINAL_ENV=local          # Default: runs on host machine (no isolation)
-TERMINAL_ENV=ssh            # Remote execution via SSH (agent code stays local)
-TERMINAL_ENV=singularity    # Recommended for HPC: Apptainer/Singularity containers
-TERMINAL_ENV=docker         # Isolated Docker containers
-TERMINAL_ENV=modal          # Cloud execution via Modal
-
-# Container image (for docker/singularity/modal backends)
-TERMINAL_DOCKER_IMAGE=python:3.11-slim
-TERMINAL_SINGULARITY_IMAGE=docker://python:3.11-slim
-TERMINAL_TIMEOUT=60
-
-# SSH backend (for ssh)
-TERMINAL_SSH_HOST=my-server.example.com
-TERMINAL_SSH_USER=myuser
-TERMINAL_SSH_KEY=~/.ssh/id_rsa  # Optional, uses ssh-agent if not set
-```
-
-**Backend Requirements:**
- **local**: No extra setup (runs directly on your machine, no isolation)
- **ssh**: SSH access to remote machine (great for sandboxing - agent can't touch its own code)
- **singularity**: Requires Apptainer or Singularity installed (common on HPC clusters, no root needed)
- **docker**: Requires Docker installed and user in `docker` group
- **modal**: Requires Modal account (see setup below)
-
-### Singularity/Apptainer Setup (Recommended for HPC)
-
-Singularity/Apptainer provides rootless container execution, ideal for HPC clusters:
-
-```bash
-# 1. Verify Apptainer is installed
-apptainer --version  # or: singularity --version
-
-# 2. Set up cache directories (important for parallel workers)
-# Use /scratch if available (HPC), otherwise /tmp
-export APPTAINER_CACHEDIR=/scratch/$USER/.apptainer
-export APPTAINER_TMPDIR=/scratch/$USER/.apptainer/tmp
-mkdir -p "$APPTAINER_CACHEDIR" "$APPTAINER_TMPDIR"
-
-# 3. Pre-build SIF image (recommended for parallel batch processing)
-# This avoids race conditions when multiple workers start simultaneously
-apptainer build $APPTAINER_CACHEDIR/python-nodejs.sif docker://nikolaik/python-nodejs:python3.11-nodejs20
-
-# 4. Configure .env to use the local SIF
-TERMINAL_ENV=singularity
-TERMINAL_SINGULARITY_IMAGE=/scratch/$USER/.apptainer/python-nodejs.sif
-```
-
-**Tip:** The batch scripts in `configs/` automatically handle SIF pre-building if `/scratch` is available.
-
-### Modal Cloud Backend Setup
-
-[Modal](https://modal.com) provides serverless cloud compute for running sandboxed environments at scale.
-
-```bash
-# 1. Install Modal and dependencies
-pip install modal boto3
-
-# 2. Authenticate with Modal (opens browser)
-modal setup
-
-# 3. Set terminal backend to modal in .env
-TERMINAL_ENV=modal
-```
-
-Modal uses CLI-based authentication (stored in `~/.modal/`), so no API key is needed in `.env`. After running `modal setup`, commands will automatically execute in Modal's cloud sandboxes.
-
-### Browser Tools Setup
-
-Browser tools enable the agent to navigate websites, fill forms, click buttons, and extract content. They use [agent-browser](https://github.com/vercel-labs/agent-browser) CLI with [Browserbase](https://browserbase.com) cloud execution.
-
-```bash
-# 1. Install Node.js (if not already installed)
-# Use nvm (recommended) or your package manager
-
-# 2. Install agent-browser CLI (choose one option):
-npm install -g agent-browser     # Option A: Global install (recommended)
-npm install                      # Option B: Local install (uses npx fallback)
-
-# 3. Get Browserbase credentials
-# Sign up at https://browserbase.com/ and get your:
-# - API Key (from Settings → API Keys)
-# - Project ID (from your project dashboard)
-
-# 4. Add to your .env file:
-BROWSERBASE_API_KEY=your_api_key_here
-BROWSERBASE_PROJECT_ID=your_project_id_here
-```
-
-**Available Browser Tools:**
-
-| Tool | Description |
-|------|-------------|
-| `browser_navigate` | Navigate to a URL |
-| `browser_snapshot` | Get text-based page snapshot with element refs |
-| `browser_click` | Click an element by ref (e.g., `@e5`) |
-| `browser_type` | Type text into an input field |
-| `browser_scroll` | Scroll up or down |
-| `browser_back` | Go back in browser history |
-| `browser_press` | Press a keyboard key (Enter, Tab, etc.) |
-| `browser_close` | Close the browser session |
-| `browser_get_images` | Get list of images on the page |
-
-**Example Usage:**
-```bash
-# Use browser tools with web search and vision
-python run_agent.py \
-  --query "Go to amazon.com and find the price of the latest Kindle" \
-  --enabled_toolsets=browser,web,vision
-
-# Use browser-focused distribution
-python batch_runner.py \
-  --dataset_file=browser_tasks.jsonl \
-  --distribution=browser_use \
-  --run_name=browser_run
-```
-
-See `.env.example` for all available configuration options including debug settings.
-
-### Skills Tools
-
-Skills are on-demand knowledge documents the agent can load when needed. They follow a **progressive disclosure** pattern to minimize token usage:
-
-```
-skills/
-├── mlops/                    # Category folder
-│   ├── axolotl/             # Skill folder
-│   │   ├── SKILL.md         # Main instructions (required)
-│   │   ├── references/      # Additional docs, API specs
-│   │   └── templates/       # Output formats, configs
-│   └── vllm/
-│       └── SKILL.md
-```
-
-**Available Skills Tools:**
-
-| Tool | Description |
-|------|-------------|
-| `skills_categories` | List available skill categories (~50 tokens) |
-| `skills_list` | List skills with name + description (~3k tokens for 40 skills) |
-| `skill_view` | Load full skill content, tags, and linked files |
-
-**Example Usage:**
-```bash
-# Use skills tools
-python run_agent.py \
-  --query "What skills do you have for fine-tuning? Show me the axolotl skill." \
-  --enabled_toolsets=skills
-```
-
-**Creating Skills:**
-
-Skills use YAML frontmatter for metadata:
-```yaml
---
-name: my-skill
-description: Brief description shown in skills_list
-tags: [tag1, tag2]
-related_skills: [other-skill]
-version: 1.0.0
---
-# Skill Content
-
-Instructions, examples, and guidelines here...
-```
-
-Skills can include:
- `references/` - Additional documentation, API specs, examples
- `templates/` - Output formats, config files, boilerplate code
- `scripts/` - Executable helpers (Python, shell scripts)
-
-## Session Logging
-
-Every conversation is automatically logged to `logs/` for debugging and inspection:
-
-```
-logs/
-├── session_20260201_143052_a1b2c3.json
-├── session_20260201_150217_d4e5f6.json
-└── ...
-```
-
-**Log Format:**
-```json
-{
-  "session_id": "20260201_143052_a1b2c3",
-  "model": "anthropic/claude-sonnet-4",
-  "session_start": "2026-02-01T14:30:52.123456",
-  "last_updated": "2026-02-01T14:35:12.789012",
-  "message_count": 8,
-  "conversations": [
-    {"from": "system", "value": "..."},
-    {"from": "human", "value": "..."},
-    {"from": "gpt", "value": "..."},
-    {"from": "tool", "value": "..."}
-  ]
-}
-```
-
- **Automatic**: Logs are created and updated automatically after each conversation turn
- **Session ID in Banner**: The CLI displays the session ID in the welcome banner
- **Trajectory Format**: Uses the same format as batch processing for consistency
- **Git Ignored**: `logs/` is in `.gitignore` so logs aren't committed
-
-## Context Compression
-
-Long conversations can exceed the model's context limit. Hermes Agent automatically compresses context when approaching the limit:
-
-**How it works:**
-1. Tracks actual token usage from API responses (`usage.prompt_tokens`)
-2. When tokens reach 85% of model's context limit, triggers compression
-3. Protects first 3 turns (system prompt, initial request, first response)
-4. Protects last 4 turns (recent context is most relevant)
-5. Summarizes middle turns using a fast/cheap model (Gemini Flash)
-6. Inserts summary as a user message, conversation continues seamlessly
-
-**Configuration (`cli-config.yaml`):**
-```yaml
-compression:
-  enabled: true                    # Enable auto-compression (default)
-  threshold: 0.85                  # Compress at 85% of context limit
-  summary_model: "google/gemini-2.0-flash-001"
-```
-
-**Or via environment variables:**
-```bash
-CONTEXT_COMPRESSION_ENABLED=true
-CONTEXT_COMPRESSION_THRESHOLD=0.85
-CONTEXT_COMPRESSION_MODEL=google/gemini-2.0-flash-001
-```
-
-**When compression triggers, you'll see:**
-```
-📦 Context compression triggered (170,000 tokens ≥ 170,000 threshold)
-   📊 Model context limit: 200,000 tokens (85% = 170,000)
-   🗜️  Summarizing turns 4-15 (12 turns)
-   ✅ Compressed: 20 → 9 messages (~45,000 tokens saved)
-```
-
-## Scheduled Tasks (Cron Jobs)
-
-Hermes Agent can schedule automated tasks to run in the future - either one-time reminders or recurring jobs.
-
-### CLI Commands
-
-```bash
-# List scheduled jobs
-/cron
-
-# Add a one-shot reminder (runs once in 30 minutes)
-/cron add 30m Remind me to check the build status
-
-# Add a recurring job (every 2 hours)
-/cron add "every 2h" Check server status at 192.168.1.100 and report any issues
-
-# Add a cron expression (daily at 9am)
-/cron add "0 9 * * *" Generate a morning briefing summarizing GitHub notifications
-
-# Remove a job
-/cron remove abc123def456
-```
-
-### Agent Self-Scheduling
-
-The agent can also schedule its own follow-up tasks using tools:
-
-```python
-# Available when using hermes-cli toolset (default for CLI)
-schedule_cronjob(prompt="...", schedule="30m", repeat=1)  # One-shot
-schedule_cronjob(prompt="...", schedule="every 2h")       # Recurring
-list_cronjobs()                                            # View all jobs
-remove_cronjob(job_id="...")                              # Cancel a job
-```
-
-**⚠️ Important:** Cronjobs run in **isolated sessions with NO prior context**. The prompt must be completely self-contained with all necessary information (file paths, URLs, server addresses, etc.). The future agent will not remember anything from the current conversation.
-
-### Schedule Formats
-
-| Format | Example | Description |
-|--------|---------|-------------|
-| Duration | `30m`, `2h`, `1d` | One-shot delay from now |
-| Interval | `every 30m`, `every 2h` | Recurring at fixed intervals |
-| Cron | `0 9 * * *` | Cron expression (requires `croniter`) |
-| Timestamp | `2026-02-03T14:00` | One-shot at specific time |
-
-### Repeat Options
-
-| repeat | Behavior |
-|--------|----------|
-| (omitted) | One-shot schedules run once; intervals/cron run forever |
-| `1` | Run once then auto-delete |
-| `N` | Run N times then auto-delete |
-
-### Running the Cron Daemon
-
-Jobs are stored in `~/.hermes/cron/jobs.json` and executed by a scheduler:
-
-```bash
-# Option 1: Built-in daemon (checks every 60 seconds)
-python cli.py --cron-daemon
-
-# Option 2: System cron integration (run once per minute)
-# Add to crontab: crontab -e
-*/1 * * * * cd ~/hermes-agent && python cli.py --cron-tick-once >> ~/.hermes/cron/cron.log 2>&1
-```
-
-### Job Output
-
-Job outputs are saved to `~/.hermes/cron/output/{job_id}/{timestamp}.md` for review.
-
-## Messaging Gateway (Telegram, Discord, WhatsApp)
-
-Connect Hermes Agent to messaging platforms so you can chat from anywhere.
-
-### Quick Start
-
-```bash
-# 1. Add your bot token to .env
-echo 'TELEGRAM_BOT_TOKEN="your_token"' >> .env
-
-# 2. Test the gateway (foreground)
-./scripts/hermes-gateway run
-
-# 3. Install as a background service
-./scripts/hermes-gateway install
-
-# 4. Manage the service
-./scripts/hermes-gateway start   # Start
-./scripts/hermes-gateway stop    # Stop
-./scripts/hermes-gateway status  # Check status
-```
-
-### Supported Platforms
-
-| Platform | Setup | Toolset |
-|----------|-------|---------|
-| Telegram | Bot via @BotFather | `hermes-telegram` |
-| Discord | Bot via Developer Portal | `hermes-discord` |
-| WhatsApp | Node.js bridge | `hermes-whatsapp` |
-
-### Session Management
-
- Sessions persist across messages (agent remembers context)
- Reset policies: daily (4am), idle (2 hours), or both
- Manual reset: send `/new` or `/reset`
-
-### Cron Job Delivery
-
-Schedule tasks that deliver to specific platforms:
-
-```python
-schedule_cronjob(
-    prompt="Check server status...",
-    schedule="every 1h",
-    deliver="telegram"  # or "origin", "discord", etc.
-)
-```
-
-### CLI Commands
-
-| Command | Description |
-|---------|-------------|
-| `/platforms` | Show gateway configuration status |
-| `--gateway` | Start the gateway (CLI flag) |
-
-See [docs/messaging.md](docs/messaging.md) for full setup instructions.
-
-## Interactive CLI
-
-The CLI provides a rich interactive experience for working with the agent.
-
-### Running the CLI
-
-```bash
-# Basic usage
-./hermes
-
-# With specific model
-./hermes --model "anthropic/claude-sonnet-4"
-
-# With specific toolsets
-./hermes --toolsets "web,terminal,skills"
-```
-
-### CLI Commands
-
-| Command | Description |
-|---------|-------------|
-| `/help` | Show available commands |
-| `/tools` | List available tools by toolset |
-| `/toolsets` | List available toolsets |
-| `/model [name]` | Show or change the current model |
-| `/prompt [text]` | View/set custom system prompt |
-| `/personality [name]` | Set a predefined personality |
-| `/clear` | Clear screen and reset conversation |
-| `/reset` | Reset conversation only |
-| `/history` | Show conversation history |
-| `/save` | Save current conversation to file |
-| `/config` | Show current configuration |
-| `/cron` | Manage scheduled tasks (list, add, remove) |
-| `/platforms` | Show gateway/messaging platform status |
-| `/quit` | Exit the CLI |
-
-### Configuration
-
-Copy `cli-config.yaml.example` to `cli-config.yaml` and customize:
-
-```yaml
-# Model settings
-model:
-  default: "anthropic/claude-sonnet-4"
-
-# Terminal backend (local, docker, singularity, modal, or ssh)
-terminal:
-  env_type: "local"
-  cwd: "."  # Use current directory
-
-# Or use SSH for remote execution (keeps agent code isolated)
-# terminal:
-#   env_type: "ssh"
-#   ssh_host: "my-server.example.com"
-#   ssh_user: "myuser"
-#   ssh_key: "~/.ssh/id_rsa"
-#   cwd: "/home/myuser/project"
-
-# Enable specific toolsets
-toolsets:
-  - all  # or: web, terminal, browser, vision, etc.
-
-# Custom personalities (use with /personality command)
-agent:
-  personalities:
-    helpful: "You are a helpful assistant."
-    kawaii: "You are a kawaii assistant! Use cute expressions..."
-```
-
-### Personalities
-
-Built-in personalities available via `/personality`:
- `helpful`, `concise`, `technical`, `creative`, `teacher`
- `kawaii`, `catgirl`, `pirate`, `shakespeare`, `surfer`
- `noir`, `uwu`, `philosopher`, `hype`
-
-## Toolsets System
-
-The agent uses a toolsets system for organizing and managing tools. All tools must be part of a toolset to be accessible - individual tool selection is not supported. This ensures consistent and logical grouping of capabilities.
-
-### Key Concepts
-
- **Toolsets**: Logical groups of tools for specific use cases (e.g., "research", "development", "debugging")
- **Composition**: Toolsets can include other toolsets for powerful combinations
- **Custom Toolsets**: Create your own toolsets at runtime or by editing `toolsets.py`
- **Toolset-Only Access**: Tools are only accessible through toolsets, not individually
-
-### Available Toolsets
-
-See `toolsets.py` for the complete list of predefined toolsets including:
- Basic toolsets (web, terminal, vision, creative, reasoning)
- Composite toolsets (research, development, analysis, etc.)
- Scenario-specific toolsets (debugging, documentation, API testing, etc.)
- Special toolsets (safe mode without terminal, minimal, offline)
-
-### Using Toolsets
-
-```bash
-# Use a predefined toolset
-python run_agent.py --enabled_toolsets=research --query "Find latest AI papers"
-
-# Combine multiple toolsets
-python run_agent.py --enabled_toolsets=web,vision --query "Analyze this website"
-
-# Enable all toolsets explicitly (same as omitting the flag)
-python run_agent.py --enabled_toolsets=all --query "Do web research and run commands if helpful"
-
-# Safe mode (no terminal access)
-python run_agent.py --enabled_toolsets=safe --query "Help without running commands"
-
-# List all available toolsets and tools
-python run_agent.py --list_tools
-```
-
-See `toolsets.py` for the complete list of available toolsets and how to create custom ones.
-
-## Basic Usage
-
-### Default (all tools enabled)
-```bash
-# Uses OpenRouter by default - just set OPENROUTER_API_KEY in .env
-python run_agent.py \
-  --query "search up the latest docs on jit in python 3.13 and write me basic example that's not in their docs. profile its perf" \
-  --max_turns 20 \
-  --model anthropic/claude-sonnet-4-20250514
-```
-
-### With specific toolset
-```bash
-python run_agent.py \
-  --query "Debug this Python error" \
-  --enabled_toolsets=debugging \
-  --model anthropic/claude-sonnet-4-20250514
-```
-
-### Python API
-```python
-from run_agent import AIAgent
-
-# Uses OpenRouter by default (reads OPENROUTER_API_KEY from .env)
-agent = AIAgent(
-    model="anthropic/claude-sonnet-4-20250514",
-    enabled_toolsets=["research"]
-)
-response = agent.chat("Find information about quantum computing")
-
-# Create custom toolset at runtime
-from toolsets import create_custom_toolset
-
-create_custom_toolset(
-    name="my_tools",
-    description="My custom toolkit",
-    tools=["web_search"],
-    includes=["terminal", "vision"]
-)
-
-agent = AIAgent(enabled_toolsets=["my_tools"])
-```
-
-## Batch Processing
-
-Process multiple prompts from a dataset in parallel with automatic checkpointing and statistics tracking:
-
-```bash
-# Basic batch processing
-python batch_runner.py \
-  --dataset_file=prompts.jsonl \
-  --batch_size=20 \
-  --run_name=my_run
-
-# With specific distribution
-python batch_runner.py \
-  --dataset_file=prompts.jsonl \
-  --batch_size=20 \
-  --run_name=image_run \
-  --distribution=image_gen \
-  --num_workers=4
-```
-
-**Key Features:**
- Parallel processing with configurable workers
- Toolset distributions for varied data generation
- Automatic checkpointing and resume capability
- Combined output in `data/<run_name>/trajectories.jsonl`
- Tool usage statistics and success rates
-
-Use `--list_distributions` to see available toolset distributions for varied data generation.
-
-### Trajectory Compression
-
-Post-process trajectories to fit within token budgets for training:
-
-```bash
-# Compress a directory of JSONL files
-python trajectory_compressor.py --input=data/my_run
-
-# Compress a single JSONL file
-python trajectory_compressor.py --input=data/trajectories.jsonl
-
-# Compress a 15% sample (useful for creating smaller training sets)
-python trajectory_compressor.py --input=data/trajectories.jsonl --sample_percent=15
-
-# Custom output and token target
-python trajectory_compressor.py \
-  --input=data/trajectories.jsonl \
-  --output=data/compressed.jsonl \
-  --target_max_tokens=16000
-```
-
-**Features:**
- Protects first turns (system, human, first GPT response, first tool call)
- Protects last N turns (configurable)
- Summarizes middle turns using LLM to fit target token budget
- Supports both directory and single file input
- Optional random sampling with `--sample_percent`
- Configurable via `configs/trajectory_compression.yaml`
-
-### Ephemeral System Prompts
-
-The ephemeral system prompt feature allows you to guide the model's behavior during batch processing **without** saving that prompt to the training dataset trajectories. This is useful for:
-
- Guiding model behavior during data collection
- Adding task-specific instructions 
- Keeping saved trajectories clean and focused on tool-calling format
-
-**Example:**
-```bash
-python batch_runner.py \
-  --dataset_file=prompts.jsonl \
-  --batch_size=10 \
-  --run_name=my_run \
-  --ephemeral_system_prompt="You are a helpful assistant focused on image generation."
-```
-
-The ephemeral prompt will influence the model's behavior during execution, but **only the standard tool-calling system prompt** will be saved in the trajectory files.
-
-The ephemeral prompt influences model behavior during execution, but **only the standard tool-calling system prompt** is saved in trajectory files.
-
-## Command Line Arguments
-
-**Single Agent (`run_agent.py`):**
- `--query`: The question or task for the agent
- `--model`: Model to use (default: claude-opus-4-20250514)
- `--api_key`: API key for authentication
- `--base_url`: API endpoint URL
- `--max_turns`: Maximum number of tool-calling iterations
- `--enabled_toolsets`: Comma-separated list of toolsets to enable. Use `all` (or `*`) to enable everything. If omitted, all toolsets are enabled by default.
- `--disabled_toolsets`: Comma-separated list of toolsets to disable
- `--list_tools`: List all available toolsets and tools
- `--save_trajectories`: Save conversation trajectories to JSONL files
-
-**Batch Processing (`batch_runner.py`):**
- `--dataset_file`: Path to JSONL file with prompts
- `--batch_size`: Number of prompts per batch
- `--run_name`: Name for this run (for output/checkpointing)
- `--distribution`: Toolset distribution to use (default: "default")
- `--num_workers`: Number of parallel workers (default: 4)
- `--resume`: Resume from checkpoint if interrupted
- `--ephemeral_system_prompt`: System prompt used during execution but NOT saved to trajectories
- `--list_distributions`: List available toolset distributions
-
-## Environment Variables
-
-All environment variables can be configured in the `.env` file (copy from `.env.example`).
-
-**LLM Provider (OpenRouter):**
- `OPENROUTER_API_KEY`: Primary LLM access via OpenRouter (supports Claude, GPT-4, Gemini, etc.)
- `LLM_MODEL`: Default model (e.g., `anthropic/claude-sonnet-4`, `openai/gpt-4o`)
-
-**Tool API Keys:**
- `FIRECRAWL_API_KEY`: Web tools (search, extract, crawl)
- `NOUS_API_KEY`: Vision and reasoning tools
- `FAL_KEY`: Image generation tools
-
-**Terminal Tool Configuration (mini-swe-agent backend):**
- `TERMINAL_ENV`: Backend type - `local`, `docker`, `singularity`, `modal`, or `ssh` (default: `local`)
- `TERMINAL_DOCKER_IMAGE`: Docker image for docker backend (default: `python:3.11-slim`)
- `TERMINAL_SINGULARITY_IMAGE`: Singularity/Apptainer image (can be `docker://...` URL or local `.sif` path)
- `TERMINAL_TIMEOUT`: Command timeout in seconds (default: `60`)
- `TERMINAL_LIFETIME_SECONDS`: Cleanup inactive environments after this time (default: `300`)
- `TERMINAL_CWD`: Working directory inside containers (default: `/tmp`)
- `TERMINAL_SCRATCH_DIR`: Custom scratch directory for sandbox storage (optional, auto-detects `/scratch`)
- `SUDO_PASSWORD`: Enable sudo commands by piping password via `sudo -S` (works with all backends)
-  - If unset in CLI mode, you'll be prompted interactively when sudo is needed (45s timeout)
-
-**SSH Backend Configuration (for remote execution):**
- `TERMINAL_SSH_HOST`: Remote server hostname or IP
- `TERMINAL_SSH_USER`: SSH username
- `TERMINAL_SSH_PORT`: SSH port (default: `22`)
- `TERMINAL_SSH_KEY`: Path to SSH private key (optional, uses ssh-agent if not set)
-
-**Context Compression (auto-shrinks long conversations):**
- `CONTEXT_COMPRESSION_ENABLED`: Enable auto-compression (default: `true`)
- `CONTEXT_COMPRESSION_THRESHOLD`: Compress at this % of context limit (default: `0.85`)
- `CONTEXT_COMPRESSION_MODEL`: Model for generating summaries (default: `google/gemini-2.0-flash-001`)
-
-**Browser Tool Configuration (agent-browser + Browserbase):**
- `BROWSERBASE_API_KEY`: Browserbase API key for cloud browser execution
- `BROWSERBASE_PROJECT_ID`: Browserbase project ID
- `BROWSER_SESSION_TIMEOUT`: Session timeout in seconds (default: `300`)
-
-**Legacy Hecate Terminal Backend (optional):**
- `MORPH_API_KEY`: For Hecate/MorphCloud terminal backend
- `HECATE_VM_LIFETIME_SECONDS`: VM lifetime (default: 300)
- `HECATE_DEFAULT_SNAPSHOT_ID`: Default snapshot (default: snapshot_p5294qxt)
-
-**Debug Options:**
- `WEB_TOOLS_DEBUG`, `VISION_TOOLS_DEBUG`, `MOA_TOOLS_DEBUG`, `IMAGE_TOOLS_DEBUG`: Enable debug logging
-
-## Key Files
-
-| File | Purpose |
-|------|---------|
-| `hermes` | CLI launcher script (run with `./hermes`) |
-| `cli.py` | Interactive CLI implementation |
-| `cli-config.yaml` | CLI configuration (copy from `.example`) |
-| `run_agent.py` | Main agent runner - single query execution |
-| `batch_runner.py` | Parallel batch processing with checkpointing |
-| `model_tools.py` | Core tool definitions and handlers |
-| `toolsets.py` | Toolset definitions and composition |
-| `toolset_distributions.py` | Probability distributions for data generation |
-| `trajectory_compressor.py` | Post-process trajectories for training |
-| `tools/` | Individual tool implementations |
-| `tools/skills_tool.py` | Skills system with progressive disclosure |
-| `skills/` | On-demand knowledge documents |
-| `docs/` | Documentation |
-| `configs/` | Example batch run scripts |
--- a/hermes_agent.egg-info/SOURCES.txt
+++ b/hermes_agent.egg-info/SOURCES.txt
@@ -1,47 +0,0 @@
-README.md
-batch_runner.py
-cli.py
-model_tools.py
-pyproject.toml
-run_agent.py
-toolset_distributions.py
-toolsets.py
-trajectory_compressor.py
-cron/__init__.py
-cron/jobs.py
-cron/scheduler.py
-gateway/__init__.py
-gateway/config.py
-gateway/delivery.py
-gateway/run.py
-gateway/session.py
-hermes_agent.egg-info/PKG-INFO
-hermes_agent.egg-info/SOURCES.txt
-hermes_agent.egg-info/dependency_links.txt
-hermes_agent.egg-info/entry_points.txt
-hermes_agent.egg-info/requires.txt
-hermes_agent.egg-info/top_level.txt
-hermes_cli/__init__.py
-hermes_cli/cron.py
-hermes_cli/doctor.py
-hermes_cli/gateway.py
-hermes_cli/main.py
-hermes_cli/setup.py
-hermes_cli/status.py
-tests/test_batch_runner.py
-tests/test_checkpoint_resumption.py
-tests/test_modal_terminal.py
-tests/test_nous_api_limits.py
-tests/test_nous_api_pattern.py
-tests/test_temperature_fix.py
-tests/test_web_tools.py
-tools/__init__.py
-tools/browser_tool.py
-tools/cronjob_tools.py
-tools/image_generation_tool.py
-tools/mixture_of_agents_tool.py
-tools/skills_tool.py
-tools/terminal_hecate.py
-tools/terminal_tool.py
-tools/vision_tools.py
-tools/web_tools.py
--- a/hermes_agent.egg-info/dependency_links.txt
+++ b/hermes_agent.egg-info/dependency_links.txt
@@ -1 +0,0 @@
-
--- a/hermes_agent.egg-info/entry_points.txt
+++ b/hermes_agent.egg-info/entry_points.txt
@@ -1,3 +0,0 @@
-[console_scripts]
-hermes = hermes_cli.main:main
-hermes-agent = run_agent:main
--- a/hermes_agent.egg-info/requires.txt
+++ b/hermes_agent.egg-info/requires.txt
@@ -1,35 +0,0 @@
-openai
-python-dotenv
-fire
-httpx
-rich
-tenacity
-pyyaml
-requests
-jinja2
-pydantic>=2.0
-firecrawl-py
-fal-client
-litellm>=1.75.5
-typer
-platformdirs
-
-[all]
-croniter
-python-telegram-bot>=20.0
-discord.py>=2.0
-
-[cron]
-croniter
-
-[dev]
-pytest
-pytest-asyncio
-
-[messaging]
-python-telegram-bot>=20.0
-discord.py>=2.0
-
-[modal]
-modal
-boto3
--- a/hermes_agent.egg-info/top_level.txt
+++ b/hermes_agent.egg-info/top_level.txt
@@ -1,11 +0,0 @@
-batch_runner
-cli
-cron
-gateway
-hermes_cli
-model_tools
-run_agent
-tools
-toolset_distributions
-toolsets
-trajectory_compressor
--- a/hermes_cli/config.py
+++ b/hermes_cli/config.py
@@ -71,7 +71,7 @@ def ensure_hermes_home():
 # =============================================================================

 DEFAULT_CONFIG = {
-    "model": "anthropic/claude-sonnet-4.5",
+    "model": "anthropic/claude-opus-4.6",
    "toolsets": ["hermes-cli"],
    "max_turns": 100,
    
@@ -91,7 +91,7 @@ DEFAULT_CONFIG = {
    "compression": {
        "enabled": True,
        "threshold": 0.85,
-        "summary_model": "google/gemini-2.0-flash-001",
+        "summary_model": "google/gemini-3-flash-preview",
    },
    
    "display": {
@@ -99,6 +99,24 @@ DEFAULT_CONFIG = {
        "personality": "kawaii",
    },
    
+    # Text-to-speech configuration
+    "tts": {
+        "provider": "edge",  # "edge" (free) | "elevenlabs" (premium) | "openai"
+        "edge": {
+            "voice": "en-US-AriaNeural",
+            # Popular: AriaNeural, JennyNeural, AndrewNeural, BrianNeural, SoniaNeural
+        },
+        "elevenlabs": {
+            "voice_id": "pNInz6obpgDQGcFmaJgB",  # Adam
+            "model_id": "eleven_multilingual_v2",
+        },
+        "openai": {
+            "model": "gpt-4o-mini-tts",
+            "voice": "alloy",
+            # Voices: alloy, echo, fable, onyx, nova, shimmer
+        },
+    },
+    
    # Permanently allowed dangerous command patterns (added via "always" approval)
    "command_allowlist": [],
    
@@ -202,6 +220,13 @@ OPTIONAL_ENV_VARS = {
        "url": None,
        "password": False,
    },
+    # Text-to-speech (premium providers)
+    "ELEVENLABS_API_KEY": {
+        "description": "ElevenLabs API key for premium text-to-speech voices",
+        "prompt": "ElevenLabs API key",
+        "url": "https://elevenlabs.io/",
+        "password": True,
+    },
    # Terminal configuration
    "MESSAGING_CWD": {
        "description": "Working directory for terminal commands via messaging (Telegram/Discord/etc). CLI always uses current directory.",
@@ -555,7 +580,7 @@ def show_config():
    print(f"  Enabled:      {'yes' if enabled else 'no'}")
    if enabled:
        print(f"  Threshold:    {compression.get('threshold', 0.85) * 100:.0f}%")
-        print(f"  Model:        {compression.get('summary_model', 'google/gemini-2.0-flash-001')}")
+        print(f"  Model:        {compression.get('summary_model', 'google/gemini-3-flash-preview')}")
    
    # Messaging
    print()
--- a/hermes_cli/doctor.py
+++ b/hermes_cli/doctor.py
@@ -58,8 +58,11 @@ def run_doctor(args):
    print(color("◆ Python Environment", Colors.CYAN, Colors.BOLD))
    
    py_version = sys.version_info
-    if py_version >= (3, 10):
+    if py_version >= (3, 11):
        check_ok(f"Python {py_version.major}.{py_version.minor}.{py_version.micro}")
+    elif py_version >= (3, 10):
+        check_ok(f"Python {py_version.major}.{py_version.minor}.{py_version.micro}")
+        check_warn("Python 3.11+ recommended for RL Training tools (tinker requires >= 3.11)")
    elif py_version >= (3, 8):
        check_warn(f"Python {py_version.major}.{py_version.minor}.{py_version.micro}", "(3.10+ recommended)")
    else:
@@ -100,7 +103,7 @@ def run_doctor(args):
            check_ok(name)
        except ImportError:
            check_fail(name, "(missing)")
-            issues.append(f"Install {name}: pip install {module}")
+            issues.append(f"Install {name}: uv pip install {module}")
    
    for module, name in optional_packages:
        try:
@@ -263,6 +266,39 @@ def run_doctor(args):
        except Exception as e:
            check_warn("Anthropic API", f"({e})")
    
+    # =========================================================================
+    # Check: Submodules
+    # =========================================================================
+    print()
+    print(color("◆ Submodules", Colors.CYAN, Colors.BOLD))
+    
+    # mini-swe-agent (terminal tool backend)
+    mini_swe_dir = PROJECT_ROOT / "mini-swe-agent"
+    if mini_swe_dir.exists() and (mini_swe_dir / "pyproject.toml").exists():
+        try:
+            __import__("minisweagent")
+            check_ok("mini-swe-agent", "(terminal backend)")
+        except ImportError:
+            check_warn("mini-swe-agent found but not installed", "(run: uv pip install -e ./mini-swe-agent)")
+            issues.append("Install mini-swe-agent: uv pip install -e ./mini-swe-agent")
+    else:
+        check_warn("mini-swe-agent not found", "(run: git submodule update --init --recursive)")
+    
+    # tinker-atropos (RL training backend)
+    tinker_dir = PROJECT_ROOT / "tinker-atropos"
+    if tinker_dir.exists() and (tinker_dir / "pyproject.toml").exists():
+        if py_version >= (3, 11):
+            try:
+                __import__("tinker_atropos")
+                check_ok("tinker-atropos", "(RL training backend)")
+            except ImportError:
+                check_warn("tinker-atropos found but not installed", "(run: uv pip install -e ./tinker-atropos)")
+                issues.append("Install tinker-atropos: uv pip install -e ./tinker-atropos")
+        else:
+            check_warn("tinker-atropos requires Python 3.11+", f"(current: {py_version.major}.{py_version.minor})")
+    else:
+        check_warn("tinker-atropos not found", "(run: git submodule update --init --recursive)")
+    
    # =========================================================================
    # Check: Tool Availability
    # =========================================================================
--- a/hermes_cli/gateway.py
+++ b/hermes_cli/gateway.py
@@ -360,7 +360,11 @@ def run_gateway(verbose: bool = False):
    print("└─────────────────────────────────────────────────────────┘")
    print()
    
-    asyncio.run(start_gateway())
+    # Exit with code 1 if gateway fails to connect any platform,
+    # so systemd Restart=on-failure will retry on transient errors
+    success = asyncio.run(start_gateway())
+    if not success:
+        sys.exit(1)


 # =============================================================================
--- a/hermes_cli/main.py
+++ b/hermes_cli/main.py
@@ -119,6 +119,7 @@ def cmd_uninstall(args):
 def cmd_update(args):
    """Update Hermes Agent to the latest version."""
    import subprocess
+    import shutil
    
    print("🦋 Updating Hermes Agent...")
    print()
@@ -163,13 +164,21 @@ def cmd_update(args):
        print("→ Pulling updates...")
        subprocess.run(["git", "pull", "origin", branch], cwd=PROJECT_ROOT, check=True)
        
-        # Reinstall Python dependencies
+        # Reinstall Python dependencies (prefer uv for speed, fall back to pip)
        print("→ Updating Python dependencies...")
-        venv_pip = PROJECT_ROOT / "venv" / "bin" / "pip"
-        if venv_pip.exists():
-            subprocess.run([str(venv_pip), "install", "-e", ".", "--quiet"], cwd=PROJECT_ROOT, check=True)
+        uv_bin = shutil.which("uv")
+        if uv_bin:
+            subprocess.run(
+                [uv_bin, "pip", "install", "-e", ".", "--quiet"],
+                cwd=PROJECT_ROOT, check=True,
+                env={**os.environ, "VIRTUAL_ENV": str(PROJECT_ROOT / "venv")}
+            )
        else:
-            subprocess.run(["pip", "install", "-e", ".", "--quiet"], cwd=PROJECT_ROOT, check=True)
+            venv_pip = PROJECT_ROOT / "venv" / "bin" / "pip"
+            if venv_pip.exists():
+                subprocess.run([str(venv_pip), "install", "-e", ".", "--quiet"], cwd=PROJECT_ROOT, check=True)
+            else:
+                subprocess.run(["pip", "install", "-e", ".", "--quiet"], cwd=PROJECT_ROOT, check=True)
        
        # Check for Node.js deps
        if (PROJECT_ROOT / "package.json").exists():
--- a/hermes_cli/setup.py
+++ b/hermes_cli/setup.py
@@ -186,6 +186,11 @@ def _print_setup_summary(config: dict, hermes_home):
    else:
        tool_status.append(("Image Generation", False, "FAL_KEY"))
    
+    # TTS (always available via Edge TTS; ElevenLabs/OpenAI are optional)
+    tool_status.append(("Text-to-Speech (Edge TTS)", True, None))
+    if get_env_value('ELEVENLABS_API_KEY'):
+        tool_status.append(("Text-to-Speech (ElevenLabs)", True, None))
+    
    # Tinker + WandB (RL training)
    if get_env_value('TINKER_API_KEY') and get_env_value('WANDB_API_KEY'):
        tool_status.append(("RL Training (Tinker)", True, None))
@@ -501,11 +506,12 @@ def run_setup_wizard(args):
    # =========================================================================
    print_header("Default Model")
    
-    current_model = config.get('model', 'anthropic/claude-sonnet-4')
+    current_model = config.get('model', 'anthropic/claude-opus-4.6')
    print_info(f"Current: {current_model}")
    
    model_choices = [
-        "anthropic/claude-sonnet-4.5 (recommended)",
+        "anthropic/claude-opus-4.6 (recommended)",
+        "anthropic/claude-sonnet-4.5",
        "anthropic/claude-opus-4.5",
        "openai/gpt-5.2",
        "openai/gpt-5.2-codex",
@@ -518,27 +524,31 @@ def run_setup_wizard(args):
        f"Keep current ({current_model})"
    ]
    
-    model_idx = prompt_choice("Select default model:", model_choices, 10)  # Default: keep current
+    model_idx = prompt_choice("Select default model:", model_choices, 11)  # Default: keep current
    
    model_map = {
-        0: "anthropic/claude-sonnet-4.5",
-        1: "anthropic/claude-opus-4.5",
-        2: "openai/gpt-5.2",
-        3: "openai/gpt-5.2-codex",
-        4: "google/gemini-3-pro-preview",
-        5: "google/gemini-3-flash-preview",
-        6: "z-ai/glm-4.7",
-        7: "moonshotai/kimi-k2.5",
-        8: "minimax/minimax-m2.1",
+        0: "anthropic/claude-opus-4.6",
+        1: "anthropic/claude-sonnet-4.5",
+        2: "anthropic/claude-opus-4.5",
+        3: "openai/gpt-5.2",
+        4: "openai/gpt-5.2-codex",
+        5: "google/gemini-3-pro-preview",
+        6: "google/gemini-3-flash-preview",
+        7: "z-ai/glm-4.7",
+        8: "moonshotai/kimi-k2.5",
+        9: "minimax/minimax-m2.1",
    }
    
    if model_idx in model_map:
        config['model'] = model_map[model_idx]
-    elif model_idx == 9:  # Custom
-        custom = prompt("Enter model name (e.g., anthropic/claude-sonnet-4.5)")
+        # Also update LLM_MODEL in .env so it stays in sync (cli.py reads .env first)
+        save_env_value("LLM_MODEL", model_map[model_idx])
+    elif model_idx == 10:  # Custom
+        custom = prompt("Enter model name (e.g., anthropic/claude-opus-4.6)")
        if custom:
            config['model'] = custom
-    # else: Keep current (model_idx == 10)
+            save_env_value("LLM_MODEL", custom)
+    # else: Keep current (model_idx == 11)
    
    # =========================================================================
    # Step 4: Terminal Backend
@@ -652,6 +662,32 @@ def run_setup_wizard(args):
        print_info("Modal Cloud Configuration:")
        print_info("Get credentials at: https://modal.com/settings")
        
+        # Check if swe-rex[modal] is installed, install if missing
+        try:
+            from swerex.deployment.modal import ModalDeployment
+            print_info("swe-rex[modal] package: installed ✓")
+        except ImportError:
+            print_info("Installing required package: swe-rex[modal]...")
+            import subprocess
+            import shutil
+            # Prefer uv for speed, fall back to pip
+            uv_bin = shutil.which("uv")
+            if uv_bin:
+                result = subprocess.run(
+                    [uv_bin, "pip", "install", "swe-rex[modal]>=1.4.0"],
+                    capture_output=True, text=True
+                )
+            else:
+                result = subprocess.run(
+                    [sys.executable, "-m", "pip", "install", "swe-rex[modal]>=1.4.0"],
+                    capture_output=True, text=True
+                )
+            if result.returncode == 0:
+                print_success("swe-rex[modal] installed (includes modal + boto3)")
+            else:
+                print_warning("Failed to install swe-rex[modal] — install manually:")
+                print_info('  uv pip install "swe-rex[modal]>=1.4.0"')
+        
        # Always show current status and allow reconfiguration
        current_token = get_env_value('MODAL_TOKEN_ID')
        if current_token:
@@ -917,6 +953,24 @@ def run_setup_wizard(args):
                save_env_value("BROWSERBASE_API_KEY", api_key)
            if project_id:
                save_env_value("BROWSERBASE_PROJECT_ID", project_id)
+            
+            # Check if Node.js dependencies are installed (required for browser tools)
+            import shutil
+            node_modules = PROJECT_ROOT / "node_modules" / "agent-browser"
+            if not node_modules.exists() and shutil.which("npm"):
+                print_info("    Installing Node.js dependencies for browser tools...")
+                import subprocess
+                result = subprocess.run(
+                    ["npm", "install", "--silent"],
+                    capture_output=True, text=True, cwd=str(PROJECT_ROOT)
+                )
+                if result.returncode == 0:
+                    print_success("    Node.js dependencies installed")
+                else:
+                    print_warning("    npm install failed — run manually: cd ~/.hermes/hermes-agent && npm install")
+            elif not node_modules.exists():
+                print_warning("    Node.js not found — browser tools require: npm install (in the hermes-agent directory)")
+            
            print_success("    Configured ✓")
    print()
    
@@ -942,6 +996,28 @@ def run_setup_wizard(args):
                print_success("    Configured ✓")
    print()
    
+    # ElevenLabs - Premium TTS
+    print_info("─" * 50)
+    print(color("  Text-to-Speech - ElevenLabs (Premium)", Colors.CYAN))
+    print_info("  Enables: Premium TTS voices (Edge TTS is free and works without a key)")
+    print_info("  Use case: High-quality, customizable voice synthesis")
+    if get_env_value('ELEVENLABS_API_KEY'):
+        print_success("  Status: Configured ✓")
+        if prompt_yes_no("  Update ElevenLabs API key?", False):
+            api_key = prompt("    API key", password=True)
+            if api_key:
+                save_env_value("ELEVENLABS_API_KEY", api_key)
+                print_success("    Updated")
+    else:
+        print_warning("  Status: Not configured (free Edge TTS will be used by default)")
+        if prompt_yes_no("  Set up ElevenLabs?", False):
+            print_info("    Get your API key at: https://elevenlabs.io/")
+            api_key = prompt("    API key", password=True)
+            if api_key:
+                save_env_value("ELEVENLABS_API_KEY", api_key)
+                print_success("    Configured ✓")
+    print()
+    
    # Tinker + WandB - RL Training
    print_info("─" * 50)
    print(color("  RL Training (Tinker + WandB)", Colors.CYAN))
@@ -950,6 +1026,11 @@ def run_setup_wizard(args):
    tinker_configured = get_env_value('TINKER_API_KEY')
    wandb_configured = get_env_value('WANDB_API_KEY')
    
+    # Check Python version requirement upfront
+    rl_python_ok = sys.version_info >= (3, 11)
+    if not rl_python_ok:
+        print_warning(f"  Requires Python 3.11+ (current: {sys.version_info.major}.{sys.version_info.minor})")
+    
    if tinker_configured and wandb_configured:
        print_success("  Status: Configured ✓")
        if prompt_yes_no("  Update RL training credentials?", False):
@@ -969,18 +1050,55 @@ def run_setup_wizard(args):
            print_warning("  Status: Not configured (tools will be disabled)")
        
        if prompt_yes_no("  Set up RL Training?", False):
-            print_info("    Get Tinker key at: https://tinker-console.thinkingmachines.ai/keys")
-            print_info("    Get WandB key at: https://wandb.ai/authorize")
-            api_key = prompt("    Tinker API key", password=True)
-            if api_key:
-                save_env_value("TINKER_API_KEY", api_key)
-            wandb_key = prompt("    WandB API key", password=True)
-            if wandb_key:
-                save_env_value("WANDB_API_KEY", wandb_key)
-            if api_key and wandb_key:
-                print_success("    Configured ✓")
+            # Check Python version before proceeding
+            if not rl_python_ok:
+                print_error(f"    Python 3.11+ required (current: {sys.version_info.major}.{sys.version_info.minor})")
+                print_info("    Upgrade Python and reinstall to enable RL training tools")
            else:
-                print_warning("    Partially configured (both keys required)")
+                print_info("    Get Tinker key at: https://tinker-console.thinkingmachines.ai/keys")
+                print_info("    Get WandB key at: https://wandb.ai/authorize")
+                api_key = prompt("    Tinker API key", password=True)
+                if api_key:
+                    save_env_value("TINKER_API_KEY", api_key)
+                wandb_key = prompt("    WandB API key", password=True)
+                if wandb_key:
+                    save_env_value("WANDB_API_KEY", wandb_key)
+                
+                # Check if tinker-atropos submodule is installed
+                try:
+                    __import__("tinker_atropos")
+                except ImportError:
+                    tinker_dir = PROJECT_ROOT / "tinker-atropos"
+                    if tinker_dir.exists() and (tinker_dir / "pyproject.toml").exists():
+                        print_info("    Installing tinker-atropos submodule...")
+                        import subprocess
+                        import shutil
+                        # Prefer uv for speed, fall back to pip
+                        uv_bin = shutil.which("uv")
+                        if uv_bin:
+                            result = subprocess.run(
+                                [uv_bin, "pip", "install", "-e", str(tinker_dir)],
+                                capture_output=True, text=True
+                            )
+                        else:
+                            result = subprocess.run(
+                                [sys.executable, "-m", "pip", "install", "-e", str(tinker_dir)],
+                                capture_output=True, text=True
+                            )
+                        if result.returncode == 0:
+                            print_success("    tinker-atropos installed")
+                        else:
+                            print_warning("    tinker-atropos install failed — run manually:")
+                            print_info('      uv pip install -e "./tinker-atropos"')
+                    else:
+                        print_warning("    tinker-atropos submodule not found — run:")
+                        print_info("      git submodule update --init --recursive")
+                        print_info('      uv pip install -e "./tinker-atropos"')
+                
+                if api_key and wandb_key:
+                    print_success("    Configured ✓")
+                else:
+                    print_warning("    Partially configured (both keys required)")
    
    # =========================================================================
    # Save config and show summary
--- a/hermes_cli/status.py
+++ b/hermes_cli/status.py
@@ -76,6 +76,7 @@ def show_status(args):
        "FAL": "FAL_KEY",
        "Tinker": "TINKER_API_KEY",
        "WandB": "WANDB_API_KEY",
+        "ElevenLabs": "ELEVENLABS_API_KEY",
    }
    
    for name, env_var in keys.items():
--- a/model_tools.py
+++ b/model_tools.py
@@ -41,7 +41,7 @@ from tools.terminal_hecate import terminal_hecate_tool, check_hecate_requirement
 from tools.vision_tools import vision_analyze_tool, check_vision_requirements
 from tools.mixture_of_agents_tool import mixture_of_agents_tool, check_moa_requirements
 from tools.image_generation_tool import image_generate_tool, check_image_generation_requirements
-from tools.skills_tool import skills_categories, skills_list, skill_view, check_skills_requirements, SKILLS_TOOL_DESCRIPTION
+from tools.skills_tool import skills_list, skill_view, check_skills_requirements, SKILLS_TOOL_DESCRIPTION
 # RL Training tools (Tinker-Atropos)
 from tools.rl_training_tool import (
    rl_list_environments,
@@ -83,6 +83,8 @@ from tools.browser_tool import (
    check_browser_requirements,
    BROWSER_TOOL_SCHEMAS
 )
+# Text-to-speech tool (Edge TTS / ElevenLabs / OpenAI)
+from tools.tts_tool import text_to_speech_tool, check_tts_requirements
 from toolsets import (
    get_toolset, resolve_toolset, resolve_multiple_toolsets,
    get_all_toolsets, get_toolset_names, validate_toolset,
@@ -143,7 +145,7 @@ TOOLSET_REQUIREMENTS = {
        "env_vars": [],  # Just needs skills directory
        "check_fn": check_skills_requirements,
        "setup_url": None,
-        "tools": ["skills_categories", "skills_list", "skill_view"],
+        "tools": ["skills_list", "skill_view"],
    },
    "rl": {
        "name": "RL Training (Tinker-Atropos)",
@@ -165,6 +167,13 @@ TOOLSET_REQUIREMENTS = {
        "setup_url": None,
        "tools": ["read_file", "write_file", "patch", "search"],
    },
+    "tts": {
+        "name": "Text-to-Speech",
+        "env_vars": [],  # Edge TTS needs no key; premium providers checked at runtime
+        "check_fn": check_tts_requirements,
+        "setup_url": None,
+        "tools": ["text_to_speech"],
+    },
 }


@@ -392,7 +401,7 @@ def get_image_tool_definitions() -> List[Dict[str, Any]]:
            "type": "function",
            "function": {
                "name": "image_generate",
-                "description": "Generate high-quality images from text prompts using FLUX 2 Pro model with automatic 2x upscaling. Creates detailed, artistic images that are automatically upscaled for hi-rez results. Returns a single upscaled image URL that can be displayed using <img src=\"{URL}\"></img> tags.",
+                "description": "Generate high-quality images from text prompts using FLUX 2 Pro model with automatic 2x upscaling. Creates detailed, artistic images that are automatically upscaled for hi-rez results. Returns a single upscaled image URL. Display it using markdown: ![description](URL)",
                "parameters": {
                    "type": "object",
                    "properties": {
@@ -432,24 +441,7 @@ def get_skills_tool_definitions() -> List[Dict[str, Any]]:
                    "properties": {
                        "category": {
                            "type": "string",
-                            "description": "Optional category filter (from skills_categories)"
-                        }
-                    },
-                    "required": []
-                }
-            }
-        },
-        {
-            "type": "function",
-            "function": {
-                "name": "skills_categories",
-                "description": "List available skill categories. Call this first to discover what skill categories exist, then use skills_list(category) to see skills in a category.",
-                "parameters": {
-                    "type": "object",
-                    "properties": {
-                        "verbose": {
-                            "type": "boolean",
-                            "description": "If true, include skill counts per category. Default: false."
+                            "description": "Optional category filter to narrow results"
                        }
                    },
                    "required": []
@@ -700,13 +692,21 @@ def get_file_tool_definitions() -> List[Dict[str, Any]]:
            "type": "function",
            "function": {
                "name": "read_file",
-                "description": "Read a file with pagination support. Returns content with line numbers in 'LINE_NUM|CONTENT' format. For binary files (images), returns base64-encoded data. If file not found, suggests similar filenames.",
+                "description": (
+                    "Read a file with pagination support. Preferred over 'cat' in the terminal because it "
+                    "provides line numbers, handles binary/image files, and suggests similar filenames if "
+                    "the file is not found.\n\n"
+                    "**Output format:** Each line is returned as 'LINE_NUM|CONTENT' for easy reference.\n"
+                    "**Binary files:** Detected automatically; images (png/jpg/gif/webp) are returned as base64 with MIME type and dimensions.\n"
+                    "**Large files:** Use offset and limit to paginate. The response includes total line count and a hint for the next page.\n"
+                    "**Paths:** Supports absolute paths, relative paths (from working directory), and ~ expansion."
+                ),
                "parameters": {
                    "type": "object",
                    "properties": {
                        "path": {
                            "type": "string",
-                            "description": "Path to the file to read (absolute or relative)"
+                            "description": "Path to the file to read (absolute, relative, or ~/path)"
                        },
                        "offset": {
                            "type": "integer",
@@ -729,17 +729,25 @@ def get_file_tool_definitions() -> List[Dict[str, Any]]:
            "type": "function",
            "function": {
                "name": "write_file",
-                "description": "Write content to a file. Creates parent directories automatically. Returns bytes written and lint check results for supported languages.",
+                "description": (
+                    "Write content to a file, completely replacing any existing content. Creates parent "
+                    "directories automatically if they don't exist. Preferred over 'echo' or heredoc in the "
+                    "terminal because it safely handles special characters, newlines, and shell metacharacters "
+                    "without escaping issues.\n\n"
+                    "**Important:** This OVERWRITES the entire file. To make targeted edits to an existing file, "
+                    "use the 'patch' tool instead.\n"
+                    "**Paths:** Supports absolute paths, relative paths, and ~ expansion."
+                ),
                "parameters": {
                    "type": "object",
                    "properties": {
                        "path": {
                            "type": "string",
-                            "description": "Path to the file to write (will be created if doesn't exist)"
+                            "description": "Path to the file to write (will be created if it doesn't exist, overwritten if it does)"
                        },
                        "content": {
                            "type": "string",
-                            "description": "Content to write to the file"
+                            "description": "Complete content to write to the file"
                        }
                    },
                    "required": ["path", "content"]
@@ -750,36 +758,48 @@ def get_file_tool_definitions() -> List[Dict[str, Any]]:
            "type": "function",
            "function": {
                "name": "patch",
-                "description": "Modify files using either simple string replacement or V4A patch format. Mode 'replace' does find-and-replace with fuzzy matching. Mode 'patch' applies multi-file changes using V4A format (*** Begin/End Patch). Auto-runs syntax checks on modified files.",
+                "description": (
+                    "Modify existing files using targeted edits. Preferred over 'sed' or manual rewriting because "
+                    "it uses intelligent fuzzy matching that tolerates minor whitespace and indentation differences, "
+                    "and auto-runs syntax checks (Python, JS, TS, Go, Rust) after editing.\n\n"
+                    "**Replace mode (recommended):** Find a unique string in the file and replace it. Uses a "
+                    "9-strategy fuzzy matching chain (exact → line-trimmed → whitespace-normalized → "
+                    "indentation-flexible → context-aware) so small formatting differences won't cause failures. "
+                    "Returns a unified diff showing exactly what changed.\n\n"
+                    "**Patch mode:** Apply multi-file changes using V4A patch format for large-scale edits across "
+                    "multiple files in one call.\n\n"
+                    "**Auto-lint:** After every edit, automatically runs syntax checks and reports errors so you "
+                    "can fix them immediately."
+                ),
                "parameters": {
                    "type": "object",
                    "properties": {
                        "mode": {
                            "type": "string",
                            "enum": ["replace", "patch"],
-                            "description": "Edit mode: 'replace' for string replacement, 'patch' for V4A patch format",
+                            "description": "Edit mode: 'replace' for targeted find-and-replace, 'patch' for V4A multi-file patches",
                            "default": "replace"
                        },
                        "path": {
                            "type": "string",
-                            "description": "File path (required for 'replace' mode)"
+                            "description": "File path to edit (required for 'replace' mode)"
                        },
                        "old_string": {
                            "type": "string",
-                            "description": "Text to find and replace (required for 'replace' mode). Must be unique in file unless replace_all=true"
+                            "description": "Text to find in the file (required for 'replace' mode). Must be unique in the file unless replace_all=true. Include enough surrounding context to ensure uniqueness."
                        },
                        "new_string": {
                            "type": "string",
-                            "description": "Replacement text (required for 'replace' mode)"
+                            "description": "Replacement text (required for 'replace' mode). Can be empty string to delete the matched text."
                        },
                        "replace_all": {
                            "type": "boolean",
-                            "description": "Replace all occurrences instead of requiring unique match (default: false)",
+                            "description": "Replace all occurrences instead of requiring a unique match (default: false)",
                            "default": False
                        },
                        "patch": {
                            "type": "string",
-                            "description": "V4A format patch content (required for 'patch' mode). Format: *** Begin Patch / *** Update File: path / @@ context @@ / -removed / +added / *** End Patch"
+                            "description": "V4A format patch content (required for 'patch' mode). Format:\n*** Begin Patch\n*** Update File: path/to/file\n@@ context hint @@\n context line\n-removed line\n+added line\n*** End Patch"
                        }
                    },
                    "required": ["mode"]
@@ -790,7 +810,16 @@ def get_file_tool_definitions() -> List[Dict[str, Any]]:
            "type": "function",
            "function": {
                "name": "search",
-                "description": "Search for content in files or search for files by name. Use target='content' to search inside files (like grep), or target='files' to find files by name pattern (like glob/find). Results sorted by modification time (newest first).",
+                "description": (
+                    "Search for content inside files or find files by name. Preferred over 'grep' or 'find' "
+                    "in the terminal because it uses ripgrep (fast) with automatic fallback to grep, handles "
+                    "pagination, and returns structured results sorted by modification time (newest first).\n\n"
+                    "**Content search (target='content'):** Regex-powered search inside files with optional "
+                    "file type filtering and context lines. Three output modes: full matches with line numbers, "
+                    "file paths only, or match counts per file.\n\n"
+                    "**File search (target='files'):** Find files by glob pattern (e.g., '*.py', '*config*'). "
+                    "Results sorted by modification time so recently changed files appear first."
+                ),
                "parameters": {
                    "type": "object",
                    "properties": {
@@ -801,12 +830,12 @@ def get_file_tool_definitions() -> List[Dict[str, Any]]:
                        "target": {
                            "type": "string",
                            "enum": ["content", "files"],
-                            "description": "Search mode: 'content' searches inside files, 'files' searches for files by name",
+                            "description": "Search mode: 'content' searches inside files (like grep/rg), 'files' searches for files by name (like find/glob)",
                            "default": "content"
                        },
                        "path": {
                            "type": "string",
-                            "description": "Directory or file to search in (default: current directory)",
+                            "description": "Directory or file to search in (default: current working directory)",
                            "default": "."
                        },
                        "file_glob": {
@@ -815,7 +844,7 @@ def get_file_tool_definitions() -> List[Dict[str, Any]]:
                        },
                        "limit": {
                            "type": "integer",
-                            "description": "Maximum number of results (default: 50)",
+                            "description": "Maximum number of results to return (default: 50)",
                            "default": 50
                        },
                        "offset": {
@@ -826,12 +855,12 @@ def get_file_tool_definitions() -> List[Dict[str, Any]]:
                        "output_mode": {
                            "type": "string",
                            "enum": ["content", "files_only", "count"],
-                            "description": "For target='content': 'content' shows matches, 'files_only' shows file paths, 'count' shows match counts per file",
+                            "description": "Output format for content search: 'content' shows matching lines with line numbers, 'files_only' lists file paths, 'count' shows match counts per file",
                            "default": "content"
                        },
                        "context": {
                            "type": "integer",
-                            "description": "Lines of context around matches (only for target='content', output_mode='content')",
+                            "description": "Number of lines to show before and after each match (only for target='content', output_mode='content')",
                            "default": 0
                        }
                    },
@@ -842,6 +871,38 @@ def get_file_tool_definitions() -> List[Dict[str, Any]]:
    ]


+def get_tts_tool_definitions() -> List[Dict[str, Any]]:
+    """
+    Get tool definitions for text-to-speech tools in OpenAI's expected format.
+    
+    Returns:
+        List[Dict]: List of TTS tool definitions compatible with OpenAI API
+    """
+    return [
+        {
+            "type": "function",
+            "function": {
+                "name": "text_to_speech",
+                "description": "Convert text to speech audio. Returns a MEDIA: path that the platform delivers as a voice message. On Telegram it plays as a voice bubble, on Discord/WhatsApp as an audio attachment. In CLI mode, saves to ~/voice-memos/. Voice and provider are user-configured, not model-selected.",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "text": {
+                            "type": "string",
+                            "description": "The text to convert to speech. Keep under 4000 characters."
+                        },
+                        "output_path": {
+                            "type": "string",
+                            "description": "Optional custom file path to save the audio. Defaults to ~/voice-memos/<timestamp>.mp3"
+                        }
+                    },
+                    "required": ["text"]
+                }
+            }
+        }
+    ]
+
+
 def get_all_tool_names() -> List[str]:
    """
    Get the names of all available tools across all toolsets.
@@ -873,7 +934,7 @@ def get_all_tool_names() -> List[str]:
    
    # Skills tools
    if check_skills_requirements():
-        tool_names.extend(["skills_categories", "skills_list", "skill_view"])
+        tool_names.extend(["skills_list", "skill_view"])
    
    # Browser automation tools
    if check_browser_requirements():
@@ -906,9 +967,61 @@ def get_all_tool_names() -> List[str]:
            "read_file", "write_file", "patch", "search"
        ])
    
+    # Text-to-speech tools
+    if check_tts_requirements():
+        tool_names.extend(["text_to_speech"])
+    
    return tool_names


+# Master mapping of every tool name → its toolset.
+# This is the single source of truth for all valid tool names in the system.
+# Import TOOL_TO_TOOLSET_MAP from here whenever you need to check valid tools.
+TOOL_TO_TOOLSET_MAP = {
+    "web_search": "web_tools",
+    "web_extract": "web_tools",
+    "terminal": "terminal_tools",
+    "vision_analyze": "vision_tools",
+    "mixture_of_agents": "moa_tools",
+    "image_generate": "image_tools",
+    # Skills tools
+    "skills_list": "skills_tools",
+    "skill_view": "skills_tools",
+    # Browser automation tools
+    "browser_navigate": "browser_tools",
+    "browser_snapshot": "browser_tools",
+    "browser_click": "browser_tools",
+    "browser_type": "browser_tools",
+    "browser_scroll": "browser_tools",
+    "browser_back": "browser_tools",
+    "browser_press": "browser_tools",
+    "browser_close": "browser_tools",
+    "browser_get_images": "browser_tools",
+    "browser_vision": "browser_tools",
+    # Cronjob management tools
+    "schedule_cronjob": "cronjob_tools",
+    "list_cronjobs": "cronjob_tools",
+    "remove_cronjob": "cronjob_tools",
+    # RL Training tools
+    "rl_list_environments": "rl_tools",
+    "rl_select_environment": "rl_tools",
+    "rl_get_current_config": "rl_tools",
+    "rl_edit_config": "rl_tools",
+    "rl_start_training": "rl_tools",
+    "rl_check_status": "rl_tools",
+    "rl_stop_training": "rl_tools",
+    "rl_get_results": "rl_tools",
+    "rl_list_runs": "rl_tools",
+    # Text-to-speech tools
+    "text_to_speech": "tts_tools",
+    # File manipulation tools
+    "read_file": "file_tools",
+    "write_file": "file_tools",
+    "patch": "file_tools",
+    "search": "file_tools",
+}
+
+
 def get_toolset_for_tool(tool_name: str) -> str:
    """
    Get the toolset that a tool belongs to.
@@ -919,50 +1032,7 @@ def get_toolset_for_tool(tool_name: str) -> str:
    Returns:
        str: Name of the toolset, or "unknown" if not found
    """
-    toolset_mapping = {
-        "web_search": "web_tools",
-        "web_extract": "web_tools",
-        "terminal": "terminal_tools",
-        "vision_analyze": "vision_tools",
-        "mixture_of_agents": "moa_tools",
-        "image_generate": "image_tools",
-        # Skills tools
-        "skills_categories": "skills_tools",
-        "skills_list": "skills_tools",
-        "skill_view": "skills_tools",
-        # Browser automation tools
-        "browser_navigate": "browser_tools",
-        "browser_snapshot": "browser_tools",
-        "browser_click": "browser_tools",
-        "browser_type": "browser_tools",
-        "browser_scroll": "browser_tools",
-        "browser_back": "browser_tools",
-        "browser_press": "browser_tools",
-        "browser_close": "browser_tools",
-        "browser_get_images": "browser_tools",
-        "browser_vision": "browser_tools",
-        # Cronjob management tools
-        "schedule_cronjob": "cronjob_tools",
-        "list_cronjobs": "cronjob_tools",
-        "remove_cronjob": "cronjob_tools",
-        # RL Training tools
-        "rl_list_environments": "rl_tools",
-        "rl_select_environment": "rl_tools",
-        "rl_get_current_config": "rl_tools",
-        "rl_edit_config": "rl_tools",
-        "rl_start_training": "rl_tools",
-        "rl_check_status": "rl_tools",
-        "rl_stop_training": "rl_tools",
-        "rl_get_results": "rl_tools",
-        "rl_list_runs": "rl_tools",
-        # File manipulation tools
-        "read_file": "file_tools",
-        "write_file": "file_tools",
-        "patch": "file_tools",
-        "search": "file_tools",
-    }
-    
-    return toolset_mapping.get(tool_name, "unknown")
+    return TOOL_TO_TOOLSET_MAP.get(tool_name, "unknown")


 def get_tool_definitions(
@@ -1047,6 +1117,11 @@ def get_tool_definitions(
        for tool in get_file_tool_definitions():
            all_available_tools_map[tool["function"]["name"]] = tool
    
+    # Text-to-speech tools
+    if check_tts_requirements():
+        for tool in get_tts_tool_definitions():
+            all_available_tools_map[tool["function"]["name"]] = tool
+    
    # Determine which tools to include based on toolsets
    tools_to_include = set()
    
@@ -1068,7 +1143,7 @@ def get_tool_definitions(
                        "vision_tools": ["vision_analyze"],
                        "moa_tools": ["mixture_of_agents"],
                        "image_tools": ["image_generate"],
-                        "skills_tools": ["skills_categories", "skills_list", "skill_view"],
+                        "skills_tools": ["skills_list", "skill_view"],
                        "browser_tools": [
                            "browser_navigate", "browser_snapshot", "browser_click",
                            "browser_type", "browser_scroll", "browser_back",
@@ -1083,7 +1158,8 @@ def get_tool_definitions(
                            "rl_stop_training", "rl_get_results",
                            "rl_list_runs", "rl_test_inference"
                        ],
-                        "file_tools": ["read_file", "write_file", "patch", "search"]
+                        "file_tools": ["read_file", "write_file", "patch", "search"],
+                        "tts_tools": ["text_to_speech"]
                    }
                    legacy_tools = legacy_map.get(toolset_name, [])
                    tools_to_include.update(legacy_tools)
@@ -1121,7 +1197,7 @@ def get_tool_definitions(
                        "vision_tools": ["vision_analyze"],
                        "moa_tools": ["mixture_of_agents"],
                        "image_tools": ["image_generate"],
-                        "skills_tools": ["skills_categories", "skills_list", "skill_view"],
+                        "skills_tools": ["skills_list", "skill_view"],
                        "browser_tools": [
                            "browser_navigate", "browser_snapshot", "browser_click",
                            "browser_type", "browser_scroll", "browser_back",
@@ -1136,7 +1212,8 @@ def get_tool_definitions(
                            "rl_stop_training", "rl_get_results",
                            "rl_list_runs", "rl_test_inference"
                        ],
-                        "file_tools": ["read_file", "write_file", "patch", "search"]
+                        "file_tools": ["read_file", "write_file", "patch", "search"],
+                        "tts_tools": ["text_to_speech"]
                    }
                    legacy_tools = legacy_map.get(toolset_name, [])
                    tools_to_include.difference_update(legacy_tools)
@@ -1191,8 +1268,19 @@ def handle_web_function_call(function_name: str, function_args: Dict[str, Any])
        urls = function_args.get("urls", [])
        # Limit URLs to prevent abuse
        urls = urls[:5] if isinstance(urls, list) else []
-        # Run async function in event loop
-        return asyncio.run(web_extract_tool(urls, "markdown"))
+        # Run async function -- use existing loop if available (Atropos),
+        # otherwise create one (normal CLI)
+        try:
+            loop = asyncio.get_running_loop()
+            # Already in an async context (Atropos) -- run in a thread
+            import concurrent.futures
+            with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
+                return pool.submit(
+                    lambda: asyncio.run(web_extract_tool(urls, "markdown"))
+                ).result(timeout=120)
+        except RuntimeError:
+            # No running loop (normal CLI) -- use asyncio.run directly
+            return asyncio.run(web_extract_tool(urls, "markdown"))
    
    else:
        return json.dumps({"error": f"Unknown web function: {function_name}"}, ensure_ascii=False)
@@ -1339,11 +1427,7 @@ def handle_skills_function_call(function_name: str, function_args: Dict[str, Any
    Returns:
        str: Function result as JSON string
    """
-    if function_name == "skills_categories":
-        verbose = function_args.get("verbose", False)
-        return skills_categories(verbose=verbose)
-    
-    elif function_name == "skills_list":
+    if function_name == "skills_list":
        category = function_args.get("category")
        return skills_list(category=category)
    
@@ -1587,6 +1671,28 @@ def handle_file_function_call(
    return json.dumps({"error": f"Unknown file function: {function_name}"}, ensure_ascii=False)


+def handle_tts_function_call(
+    function_name: str,
+    function_args: Dict[str, Any]
+) -> str:
+    """
+    Handle function calls for text-to-speech tools.
+    
+    Args:
+        function_name (str): Name of the TTS function to call
+        function_args (Dict): Arguments for the function
+    
+    Returns:
+        str: Function result as JSON string
+    """
+    if function_name == "text_to_speech":
+        text = function_args.get("text", "")
+        output_path = function_args.get("output_path")
+        return text_to_speech_tool(text=text, output_path=output_path)
+    
+    return json.dumps({"error": f"Unknown TTS function: {function_name}"}, ensure_ascii=False)
+
+
 def handle_function_call(
    function_name: str, 
    function_args: Dict[str, Any], 
@@ -1634,7 +1740,7 @@ def handle_function_call(
            return handle_image_function_call(function_name, function_args)

        # Route skills tools
-        elif function_name in ["skills_categories", "skills_list", "skill_view"]:
+        elif function_name in ["skills_list", "skill_view"]:
            return handle_skills_function_call(function_name, function_args)

        # Route browser automation tools
@@ -1664,6 +1770,10 @@ def handle_function_call(
        elif function_name in ["read_file", "write_file", "patch", "search"]:
            return handle_file_function_call(function_name, function_args, task_id)

+        # Route text-to-speech tools
+        elif function_name in ["text_to_speech"]:
+            return handle_tts_function_call(function_name, function_args)
+
        else:
            error_msg = f"Unknown function: {function_name}"
            print(f"❌ {error_msg}")
@@ -1715,7 +1825,7 @@ def get_available_toolsets() -> Dict[str, Dict[str, Any]]:
        },
        "skills_tools": {
            "available": check_skills_requirements(),
-            "tools": ["skills_categories", "skills_list", "skill_view"],
+            "tools": ["skills_list", "skill_view"],
            "description": "Access skill documents that provide specialized instructions, guidelines, or knowledge the agent can load on demand",
            "requirements": ["skills/ directory in repo root"]
        },
@@ -1741,6 +1851,12 @@ def get_available_toolsets() -> Dict[str, Dict[str, Any]]:
            "tools": ["read_file", "write_file", "patch", "search"],
            "description": "File manipulation tools: read/write files, search content/files, patch with fuzzy matching",
            "requirements": ["Terminal backend available (local/docker/ssh/singularity/modal)"]
+        },
+        "tts_tools": {
+            "available": check_tts_requirements(),
+            "tools": ["text_to_speech"],
+            "description": "Text-to-speech: convert text to audio (Edge TTS free, ElevenLabs, OpenAI)",
+            "requirements": ["edge-tts package (free) or ELEVENLABS_API_KEY or OPENAI_API_KEY"]
        }
    }
    
@@ -1762,7 +1878,8 @@ def check_toolset_requirements() -> Dict[str, bool]:
        "skills_tools": check_skills_requirements(),
        "browser_tools": check_browser_requirements(),
        "cronjob_tools": check_cronjob_requirements(),
-        "file_tools": check_file_requirements()
+        "file_tools": check_file_requirements(),
+        "tts_tools": check_tts_requirements()
    }

 if __name__ == "__main__":
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -22,6 +22,8 @@ dependencies = [
  "requests",
  "jinja2",
  "pydantic>=2.0",
+  # Interactive CLI (prompt_toolkit is used directly by cli.py)
+  "prompt_toolkit",
  # Tools
  "firecrawl-py",
  "fal-client",
@@ -32,12 +34,18 @@ dependencies = [
 ]

 [project.optional-dependencies]
-modal = ["modal", "boto3"]
+modal = ["swe-rex[modal]>=1.4.0"]
 dev = ["pytest", "pytest-asyncio"]
-messaging = ["python-telegram-bot>=20.0", "discord.py>=2.0"]
+messaging = ["python-telegram-bot>=20.0", "discord.py>=2.0", "aiohttp>=3.9.0"]
 cron = ["croniter"]
 cli = ["simple-term-menu"]
-all = ["croniter", "python-telegram-bot>=20.0", "discord.py>=2.0", "simple-term-menu"]
+all = [
+  "hermes-agent[modal]",
+  "hermes-agent[messaging]",
+  "hermes-agent[cron]",
+  "hermes-agent[cli]",
+  "hermes-agent[dev]",
+]

 [project.scripts]
 hermes = "hermes_cli.main:main"
--- a/requirements.txt
+++ b/requirements.txt
@@ -6,6 +6,10 @@ httpx
 rich
 tenacity
 prompt_toolkit
+pyyaml
+requests
+jinja2
+pydantic>=2.0

 # Web tools
 firecrawl-py
@@ -15,10 +19,6 @@ fal-client

 # mini-swe-agent dependencies (for terminal tool)
 # Note: Install mini-swe-agent itself with: pip install -e ./mini-swe-agent
-pyyaml
-requests
-jinja2
-pydantic>=2.0
 litellm>=1.75.5
 typer
 platformdirs
@@ -27,18 +27,23 @@ platformdirs
 # Requires Docker installed and user in 'docker' group

 # Optional: For Modal backend (cloud execution)
-# modal
-# boto3
+# swe-rex[modal]>=1.4.0  # Includes modal + boto3 + swe-rex runtime
+
+# Text-to-speech (Edge TTS is free, no API key needed)
+edge-tts
+
+# Optional: Premium TTS providers
+# elevenlabs  # Uncomment if using ElevenLabs TTS (needs ELEVENLABS_API_KEY)

 # Optional: For cron expression parsing (cronjob scheduling)
 croniter

 # Optional: For messaging platform integrations (gateway)
-# Telegram: pip install python-telegram-bot
+# Telegram
 python-telegram-bot>=20.0

-# Discord: pip install discord.py
+# Discord
 discord.py>=2.0

-# WhatsApp: Requires Node.js bridge (see docs/messaging.md)
-# aiohttp  # For WhatsApp bridge communication
+# WhatsApp bridge communication + general async HTTP (used by gateway)
+aiohttp>=3.9.0
--- a/run_agent.py
+++ b/run_agent.py
--- a/scripts/install.ps1
+++ b/scripts/install.ps1
@@ -2,6 +2,7 @@
 # Hermes Agent Installer for Windows
 # ============================================================================
 # Installation script for Windows (PowerShell).
+# Uses uv for fast Python provisioning and package management.
 #
 # Usage:
 #   irm https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.ps1 | iex
@@ -27,6 +28,7 @@ $ErrorActionPreference = "Stop"

 $RepoUrlSsh = "git@github.com:NousResearch/hermes-agent.git"
 $RepoUrlHttps = "https://github.com/NousResearch/hermes-agent.git"
+$PythonVersion = "3.11"

 # ============================================================================
 # Helper functions
@@ -52,12 +54,12 @@ function Write-Success {
    Write-Host "✓ $Message" -ForegroundColor Green
 }

-function Write-Warning {
+function Write-Warn {
    param([string]$Message)
    Write-Host "⚠ $Message" -ForegroundColor Yellow
 }

-function Write-Error {
+function Write-Err {
    param([string]$Message)
    Write-Host "✗ $Message" -ForegroundColor Red
 }
@@ -66,33 +68,93 @@ function Write-Error {
 # Dependency checks
 # ============================================================================

-function Test-Python {
-    Write-Info "Checking Python..."
+function Install-Uv {
+    Write-Info "Checking for uv package manager..."
    
-    # Try different python commands
-    $pythonCmds = @("python3", "python", "py -3")
+    # Check if uv is already available
+    if (Get-Command uv -ErrorAction SilentlyContinue) {
+        $version = uv --version
+        $script:UvCmd = "uv"
+        Write-Success "uv found ($version)"
+        return $true
+    }
    
-    foreach ($cmd in $pythonCmds) {
-        try {
-            $version = & $cmd.Split()[0] $cmd.Split()[1..99] -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')" 2>$null
-            if ($version) {
-                $major, $minor = $version.Split('.')
-                if ([int]$major -ge 3 -and [int]$minor -ge 10) {
-                    $script:PythonCmd = $cmd
-                    Write-Success "Python $version found"
-                    return $true
-                }
-            }
-        } catch {
-            # Try next command
+    # Check common install locations
+    $uvPaths = @(
+        "$env:USERPROFILE\.local\bin\uv.exe",
+        "$env:USERPROFILE\.cargo\bin\uv.exe"
+    )
+    foreach ($uvPath in $uvPaths) {
+        if (Test-Path $uvPath) {
+            $script:UvCmd = $uvPath
+            $version = & $uvPath --version
+            Write-Success "uv found at $uvPath ($version)"
+            return $true
        }
    }
    
-    Write-Error "Python 3.10+ not found"
-    Write-Info "Please install Python 3.10 or newer from:"
-    Write-Info "  https://www.python.org/downloads/"
-    Write-Info ""
-    Write-Info "Make sure to check 'Add Python to PATH' during installation"
+    # Install uv
+    Write-Info "Installing uv (fast Python package manager)..."
+    try {
+        powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" 2>&1 | Out-Null
+        
+        # Find the installed binary
+        $uvExe = "$env:USERPROFILE\.local\bin\uv.exe"
+        if (-not (Test-Path $uvExe)) {
+            $uvExe = "$env:USERPROFILE\.cargo\bin\uv.exe"
+        }
+        if (-not (Test-Path $uvExe)) {
+            # Refresh PATH and try again
+            $env:Path = [Environment]::GetEnvironmentVariable("Path", "User") + ";" + [Environment]::GetEnvironmentVariable("Path", "Machine")
+            if (Get-Command uv -ErrorAction SilentlyContinue) {
+                $uvExe = (Get-Command uv).Source
+            }
+        }
+        
+        if (Test-Path $uvExe) {
+            $script:UvCmd = $uvExe
+            $version = & $uvExe --version
+            Write-Success "uv installed ($version)"
+            return $true
+        }
+        
+        Write-Err "uv installed but not found on PATH"
+        Write-Info "Try restarting your terminal and re-running"
+        return $false
+    } catch {
+        Write-Err "Failed to install uv"
+        Write-Info "Install manually: https://docs.astral.sh/uv/getting-started/installation/"
+        return $false
+    }
+}
+
+function Test-Python {
+    Write-Info "Checking Python $PythonVersion..."
+    
+    # Let uv find or install Python
+    try {
+        $pythonPath = & $UvCmd python find $PythonVersion 2>$null
+        if ($pythonPath) {
+            $ver = & $pythonPath --version 2>$null
+            Write-Success "Python found: $ver"
+            return $true
+        }
+    } catch { }
+    
+    # Python not found — use uv to install it (no admin needed!)
+    Write-Info "Python $PythonVersion not found, installing via uv..."
+    try {
+        & $UvCmd python install $PythonVersion 2>&1 | Out-Null
+        $pythonPath = & $UvCmd python find $PythonVersion 2>$null
+        if ($pythonPath) {
+            $ver = & $pythonPath --version 2>$null
+            Write-Success "Python installed: $ver"
+            return $true
+        }
+    } catch { }
+    
+    Write-Err "Failed to install Python $PythonVersion"
+    Write-Info "Install Python $PythonVersion manually, then re-run this script"
    return $false
 }

@@ -105,7 +167,7 @@ function Test-Git {
        return $true
    }
    
-    Write-Error "Git not found"
+    Write-Err "Git not found"
    Write-Info "Please install Git from:"
    Write-Info "  https://git-scm.com/download/win"
    return $false
@@ -121,7 +183,7 @@ function Test-Node {
        return $true
    }
    
-    Write-Warning "Node.js not found (browser tools will be limited)"
+    Write-Warn "Node.js not found (browser tools will be limited)"
    Write-Info "To install Node.js (optional):"
    Write-Info "  https://nodejs.org/en/download/"
    $script:HasNode = $false
@@ -138,7 +200,7 @@ function Test-Ripgrep {
        return $true
    }
    
-    Write-Warning "ripgrep not found (file search will use findstr fallback)"
+    Write-Warn "ripgrep not found (file search will use findstr fallback)"
    
    # Check what package managers are available
    $hasWinget = Get-Command winget -ErrorAction SilentlyContinue
@@ -185,7 +247,7 @@ function Test-Ripgrep {
            } catch { }
        }
        
-        Write-Warning "Auto-install failed. You can install manually:"
+        Write-Warn "Auto-install failed. You can install manually:"
    } else {
        Write-Info "Skipping ripgrep installation. To install manually:"
    }
@@ -200,6 +262,25 @@ function Test-Ripgrep {
    return $true  # Don't fail - ripgrep is optional
 }

+function Test-Ffmpeg {
+    Write-Info "Checking ffmpeg (optional, for TTS voice messages)..."
+    
+    if (Get-Command ffmpeg -ErrorAction SilentlyContinue) {
+        $version = ffmpeg -version 2>&1 | Select-Object -First 1
+        Write-Success "ffmpeg found"
+        $script:HasFfmpeg = $true
+        return $true
+    }
+    
+    Write-Warn "ffmpeg not found (TTS voice bubbles on Telegram will send as audio files instead)"
+    Write-Info "  Install with: winget install ffmpeg"
+    Write-Info "  Or: choco install ffmpeg"
+    Write-Info "  Or download from: https://ffmpeg.org/download.html"
+    
+    $script:HasFfmpeg = $false
+    return $true  # Don't fail - ffmpeg is optional
+}
+
 # ============================================================================
 # Installation
 # ============================================================================
@@ -216,13 +297,12 @@ function Install-Repository {
            git pull origin $Branch
            Pop-Location
        } else {
-            Write-Error "Directory exists but is not a git repository: $InstallDir"
+            Write-Err "Directory exists but is not a git repository: $InstallDir"
            Write-Info "Remove it or choose a different directory with -InstallDir"
            exit 1
        }
    } else {
        # Try SSH first (for private repo access), fall back to HTTPS
-        # Use --recurse-submodules to also clone mini-swe-agent and tinker-atropos
        Write-Info "Trying SSH clone..."
        $sshResult = git clone --branch $Branch --recurse-submodules $RepoUrlSsh $InstallDir 2>&1
        
@@ -235,7 +315,7 @@ function Install-Repository {
            if ($LASTEXITCODE -eq 0) {
                Write-Success "Cloned via HTTPS"
            } else {
-                Write-Error "Failed to clone repository"
+                Write-Err "Failed to clone repository"
                Write-Info "For private repo access, ensure your SSH key is added to GitHub:"
                Write-Info "  ssh-add ~/.ssh/id_rsa"
                Write-Info "  ssh -T git@github.com  # Test connection"
@@ -244,7 +324,7 @@ function Install-Repository {
        }
    }
    
-    # Ensure submodules are initialized and updated (for existing installs or if --recurse failed)
+    # Ensure submodules are initialized and updated
    Write-Info "Initializing submodules (mini-swe-agent, tinker-atropos)..."
    Push-Location $InstallDir
    git submodule update --init --recursive
@@ -260,23 +340,21 @@ function Install-Venv {
        return
    }
    
-    Write-Info "Creating virtual environment..."
+    Write-Info "Creating virtual environment with Python $PythonVersion..."
    
    Push-Location $InstallDir
    
-    if (-not (Test-Path "venv")) {
-        & $PythonCmd -m venv venv
+    if (Test-Path "venv") {
+        Write-Info "Virtual environment already exists, recreating..."
+        Remove-Item -Recurse -Force "venv"
    }
    
-    # Activate
-    & .\venv\Scripts\Activate.ps1
-    
-    # Upgrade pip
-    pip install --upgrade pip wheel setuptools | Out-Null
+    # uv creates the venv and pins the Python version in one step
+    & $UvCmd venv venv --python $PythonVersion
    
    Pop-Location
    
-    Write-Success "Virtual environment ready"
+    Write-Success "Virtual environment ready (Python $PythonVersion)"
 }

 function Install-Dependencies {
@@ -285,14 +363,15 @@ function Install-Dependencies {
    Push-Location $InstallDir
    
    if (-not $NoVenv) {
-        & .\venv\Scripts\Activate.ps1
+        # Tell uv to install into our venv (no activation needed)
+        $env:VIRTUAL_ENV = "$InstallDir\venv"
    }
    
-    # Install main package
+    # Install main package with all extras
    try {
-        pip install -e ".[all]" 2>&1 | Out-Null
+        & $UvCmd pip install -e ".[all]" 2>&1 | Out-Null
    } catch {
-        pip install -e "." | Out-Null
+        & $UvCmd pip install -e "." | Out-Null
    }
    
    Write-Success "Main package installed"
@@ -301,25 +380,25 @@ function Install-Dependencies {
    Write-Info "Installing mini-swe-agent (terminal tool backend)..."
    if (Test-Path "mini-swe-agent\pyproject.toml") {
        try {
-            pip install -e ".\mini-swe-agent" 2>&1 | Out-Null
+            & $UvCmd pip install -e ".\mini-swe-agent" 2>&1 | Out-Null
            Write-Success "mini-swe-agent installed"
        } catch {
-            Write-Warning "mini-swe-agent install failed (terminal tools may not work)"
+            Write-Warn "mini-swe-agent install failed (terminal tools may not work)"
        }
    } else {
-        Write-Warning "mini-swe-agent not found (run: git submodule update --init)"
+        Write-Warn "mini-swe-agent not found (run: git submodule update --init)"
    }
    
    Write-Info "Installing tinker-atropos (RL training backend)..."
    if (Test-Path "tinker-atropos\pyproject.toml") {
        try {
-            pip install -e ".\tinker-atropos" 2>&1 | Out-Null
+            & $UvCmd pip install -e ".\tinker-atropos" 2>&1 | Out-Null
            Write-Success "tinker-atropos installed"
        } catch {
-            Write-Warning "tinker-atropos install failed (RL tools may not work)"
+            Write-Warn "tinker-atropos install failed (RL tools may not work)"
        }
    } else {
-        Write-Warning "tinker-atropos not found (run: git submodule update --init)"
+        Write-Warn "tinker-atropos not found (run: git submodule update --init)"
    }
    
    Pop-Location
@@ -328,41 +407,44 @@ function Install-Dependencies {
 }

 function Set-PathVariable {
-    Write-Info "Setting up PATH..."
+    Write-Info "Setting up hermes command..."
    
    if ($NoVenv) {
-        $binDir = "$InstallDir"
+        $hermesBin = "$InstallDir"
    } else {
-        $binDir = "$InstallDir\venv\Scripts"
+        $hermesBin = "$InstallDir\venv\Scripts"
    }
    
-    # Add to user PATH
+    # Add the venv Scripts dir to user PATH so hermes is globally available
+    # On Windows, the hermes.exe in venv\Scripts\ has the venv Python baked in
    $currentPath = [Environment]::GetEnvironmentVariable("Path", "User")
    
-    if ($currentPath -notlike "*$binDir*") {
+    if ($currentPath -notlike "*$hermesBin*") {
        [Environment]::SetEnvironmentVariable(
            "Path",
-            "$binDir;$currentPath",
+            "$hermesBin;$currentPath",
            "User"
        )
-        Write-Success "Added to user PATH"
+        Write-Success "Added to user PATH: $hermesBin"
    } else {
        Write-Info "PATH already configured"
    }
    
    # Update current session
-    $env:Path = "$binDir;$env:Path"
+    $env:Path = "$hermesBin;$env:Path"
+    
+    Write-Success "hermes command ready"
 }

 function Copy-ConfigTemplates {
    Write-Info "Setting up configuration files..."
    
-    # Create ~/.hermes directory structure (config at top level, code in subdir)
+    # Create ~/.hermes directory structure
    New-Item -ItemType Directory -Force -Path "$HermesHome\cron" | Out-Null
    New-Item -ItemType Directory -Force -Path "$HermesHome\sessions" | Out-Null
    New-Item -ItemType Directory -Force -Path "$HermesHome\logs" | Out-Null
    
-    # Create .env at ~/.hermes/.env (top level, easy to find)
+    # Create .env
    $envPath = "$HermesHome\.env"
    if (-not (Test-Path $envPath)) {
        $examplePath = "$InstallDir\.env.example"
@@ -370,7 +452,6 @@ function Copy-ConfigTemplates {
            Copy-Item $examplePath $envPath
            Write-Success "Created ~/.hermes/.env from template"
        } else {
-            # Create empty .env if no example exists
            New-Item -ItemType File -Force -Path $envPath | Out-Null
            Write-Success "Created ~/.hermes/.env"
        }
@@ -378,7 +459,7 @@ function Copy-ConfigTemplates {
        Write-Info "~/.hermes/.env already exists, keeping it"
    }
    
-    # Create config.yaml at ~/.hermes/config.yaml (top level, easy to find)
+    # Create config.yaml
    $configPath = "$HermesHome\config.yaml"
    if (-not (Test-Path $configPath)) {
        $examplePath = "$InstallDir\cli-config.yaml.example"
@@ -407,7 +488,7 @@ function Install-NodeDeps {
            npm install --silent 2>&1 | Out-Null
            Write-Success "Node.js dependencies installed"
        } catch {
-            Write-Warning "npm install failed (browser tools may not work)"
+            Write-Warn "npm install failed (browser tools may not work)"
        }
    }
    
@@ -426,12 +507,13 @@ function Invoke-SetupWizard {
    
    Push-Location $InstallDir
    
+    # Run hermes setup using the venv Python directly (no activation needed)
    if (-not $NoVenv) {
-        & .\venv\Scripts\Activate.ps1
+        & ".\venv\Scripts\python.exe" -m hermes_cli.main setup
+    } else {
+        python -m hermes_cli.main setup
    }
    
-    python -m hermes_cli.main setup
-    
    Pop-Location
 }

@@ -478,7 +560,6 @@ function Write-Completion {
    Write-Host "⚡ Restart your terminal for PATH changes to take effect" -ForegroundColor Yellow
    Write-Host ""
    
-    # Show notes about optional tools
    if (-not $HasNode) {
        Write-Host "Note: Node.js was not found. Browser automation tools" -ForegroundColor Yellow
        Write-Host "will have limited functionality." -ForegroundColor Yellow
@@ -500,10 +581,12 @@ function Write-Completion {
 function Main {
    Write-Banner
    
+    if (-not (Install-Uv)) { exit 1 }
    if (-not (Test-Python)) { exit 1 }
    if (-not (Test-Git)) { exit 1 }
    Test-Node      # Optional, doesn't fail
    Test-Ripgrep   # Optional, doesn't fail
+    Test-Ffmpeg    # Optional, doesn't fail
    
    Install-Repository
    Install-Venv
--- a/scripts/install.sh
+++ b/scripts/install.sh
@@ -3,6 +3,7 @@
 # Hermes Agent Installer
 # ============================================================================
 # Installation script for Linux and macOS.
+# Uses uv for fast Python provisioning and package management.
 #
 # Usage:
 #   curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
@@ -29,7 +30,7 @@ REPO_URL_SSH="git@github.com:NousResearch/hermes-agent.git"
 REPO_URL_HTTPS="https://github.com/NousResearch/hermes-agent.git"
 HERMES_HOME="$HOME/.hermes"
 INSTALL_DIR="${HERMES_INSTALL_DIR:-$HERMES_HOME/hermes-agent}"
-PYTHON_MIN_VERSION="3.10"
+PYTHON_VERSION="3.11"

 # Options
 USE_VENV=true
@@ -64,7 +65,7 @@ while [[ $# -gt 0 ]]; do
            echo "  --no-venv      Don't create virtual environment"
            echo "  --skip-setup   Skip interactive setup wizard"
            echo "  --branch NAME  Git branch to install (default: main)"
-            echo "  --dir PATH     Installation directory (default: ~/.hermes-agent)"
+            echo "  --dir PATH     Installation directory (default: ~/.hermes/hermes-agent)"
            echo "  -h, --help     Show this help"
            exit 0
            ;;
@@ -146,50 +147,80 @@ detect_os() {
 # Dependency checks
 # ============================================================================

-check_python() {
-    log_info "Checking Python..."
+install_uv() {
+    log_info "Checking for uv package manager..."
    
-    # Try different python commands
-    for cmd in python3.12 python3.11 python3.10 python3 python; do
-        if command -v $cmd &> /dev/null; then
-            PYTHON_CMD=$cmd
-            PYTHON_VERSION=$($cmd -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')
-            
-            # Check version
-            if python3 -c "import sys; exit(0 if sys.version_info >= (3, 10) else 1)" 2>/dev/null; then
-                log_success "Python $PYTHON_VERSION found"
-                return 0
-            fi
+    # Check common locations for uv
+    if command -v uv &> /dev/null; then
+        UV_CMD="uv"
+        UV_VERSION=$($UV_CMD --version 2>/dev/null)
+        log_success "uv found ($UV_VERSION)"
+        return 0
+    fi
+    
+    # Check ~/.local/bin (default uv install location) even if not on PATH yet
+    if [ -x "$HOME/.local/bin/uv" ]; then
+        UV_CMD="$HOME/.local/bin/uv"
+        UV_VERSION=$($UV_CMD --version 2>/dev/null)
+        log_success "uv found at ~/.local/bin ($UV_VERSION)"
+        return 0
+    fi
+    
+    # Check ~/.cargo/bin (alternative uv install location)
+    if [ -x "$HOME/.cargo/bin/uv" ]; then
+        UV_CMD="$HOME/.cargo/bin/uv"
+        UV_VERSION=$($UV_CMD --version 2>/dev/null)
+        log_success "uv found at ~/.cargo/bin ($UV_VERSION)"
+        return 0
+    fi
+    
+    # Install uv
+    log_info "Installing uv (fast Python package manager)..."
+    if curl -LsSf https://astral.sh/uv/install.sh | sh 2>/dev/null; then
+        # uv installs to ~/.local/bin by default
+        if [ -x "$HOME/.local/bin/uv" ]; then
+            UV_CMD="$HOME/.local/bin/uv"
+        elif [ -x "$HOME/.cargo/bin/uv" ]; then
+            UV_CMD="$HOME/.cargo/bin/uv"
+        elif command -v uv &> /dev/null; then
+            UV_CMD="uv"
+        else
+            log_error "uv installed but not found on PATH"
+            log_info "Try adding ~/.local/bin to your PATH and re-running"
+            exit 1
        fi
-    done
+        UV_VERSION=$($UV_CMD --version 2>/dev/null)
+        log_success "uv installed ($UV_VERSION)"
+    else
+        log_error "Failed to install uv"
+        log_info "Install manually: https://docs.astral.sh/uv/getting-started/installation/"
+        exit 1
+    fi
+}
+
+check_python() {
+    log_info "Checking Python $PYTHON_VERSION..."
    
-    log_error "Python 3.10+ not found"
-    log_info "Please install Python 3.10 or newer:"
+    # Let uv handle Python — it can download and manage Python versions
+    # First check if a suitable Python is already available
+    if $UV_CMD python find "$PYTHON_VERSION" &> /dev/null; then
+        PYTHON_PATH=$($UV_CMD python find "$PYTHON_VERSION")
+        PYTHON_FOUND_VERSION=$($PYTHON_PATH --version 2>/dev/null)
+        log_success "Python found: $PYTHON_FOUND_VERSION"
+        return 0
+    fi
    
-    case "$OS" in
-        linux)
-            case "$DISTRO" in
-                ubuntu|debian)
-                    log_info "  sudo apt update && sudo apt install python3.11 python3.11-venv"
-                    ;;
-                fedora)
-                    log_info "  sudo dnf install python3.11"
-                    ;;
-                arch)
-                    log_info "  sudo pacman -S python"
-                    ;;
-                *)
-                    log_info "  Use your package manager to install Python 3.10+"
-                    ;;
-            esac
-            ;;
-        macos)
-            log_info "  brew install python@3.11"
-            log_info "  Or download from https://www.python.org/downloads/"
-            ;;
-    esac
-    
-    exit 1
+    # Python not found — use uv to install it (no sudo needed!)
+    log_info "Python $PYTHON_VERSION not found, installing via uv..."
+    if $UV_CMD python install "$PYTHON_VERSION"; then
+        PYTHON_PATH=$($UV_CMD python find "$PYTHON_VERSION")
+        PYTHON_FOUND_VERSION=$($PYTHON_PATH --version 2>/dev/null)
+        log_success "Python installed: $PYTHON_FOUND_VERSION"
+    else
+        log_error "Failed to install Python $PYTHON_VERSION"
+        log_info "Install Python $PYTHON_VERSION manually, then re-run this script"
+        exit 1
+    fi
 }

 check_git() {
@@ -294,7 +325,6 @@ check_ripgrep() {
        # Check if we can use sudo
        CAN_SUDO=false
        if command -v sudo &> /dev/null; then
-            # Check if user has sudo access (without actually running sudo)
            if sudo -n true 2>/dev/null || sudo -v 2>/dev/null; then
                CAN_SUDO=true
            fi
@@ -328,7 +358,6 @@ check_ripgrep() {
                    esac
                else
                    log_warn "sudo not available - cannot auto-install system packages"
-                    # Try cargo as fallback if available
                    if command -v cargo &> /dev/null; then
                        log_info "Trying cargo install (no sudo required)..."
                        if cargo install ripgrep 2>/dev/null; then
@@ -371,7 +400,6 @@ check_ripgrep() {
                    log_info "  https://github.com/BurntSushi/ripgrep#installation"
                    ;;
            esac
-            # Show cargo alternative for users without sudo
            if command -v cargo &> /dev/null; then
                log_info "  Or without sudo: cargo install ripgrep"
            fi
@@ -385,6 +413,45 @@ check_ripgrep() {
    # Don't exit - ripgrep is optional (grep fallback exists)
 }

+check_ffmpeg() {
+    log_info "Checking ffmpeg (optional, for TTS voice messages)..."
+    
+    if command -v ffmpeg &> /dev/null; then
+        local ffmpeg_version=$(ffmpeg -version 2>/dev/null | head -1 | awk '{print $3}')
+        log_success "ffmpeg found: $ffmpeg_version"
+        HAS_FFMPEG=true
+        return
+    fi
+    
+    log_warn "ffmpeg not found (TTS voice bubbles on Telegram will send as audio files instead)"
+    log_info "To install ffmpeg (optional):"
+    
+    case "$OS" in
+        linux)
+            case "$DISTRO" in
+                ubuntu|debian)
+                    log_info "  sudo apt install ffmpeg"
+                    ;;
+                fedora)
+                    log_info "  sudo dnf install ffmpeg"
+                    ;;
+                arch)
+                    log_info "  sudo pacman -S ffmpeg"
+                    ;;
+                *)
+                    log_info "  https://ffmpeg.org/download.html"
+                    ;;
+            esac
+            ;;
+        macos)
+            log_info "  brew install ffmpeg"
+            ;;
+    esac
+    
+    HAS_FFMPEG=false
+    # Don't exit - ffmpeg is optional
+}
+
 # ============================================================================
 # Installation
 # ============================================================================
@@ -440,39 +507,36 @@ setup_venv() {
        return 0
    fi
    
-    log_info "Creating virtual environment..."
+    log_info "Creating virtual environment with Python $PYTHON_VERSION..."
    
    if [ -d "venv" ]; then
-        log_info "Virtual environment already exists"
-    else
-        $PYTHON_CMD -m venv venv
+        log_info "Virtual environment already exists, recreating..."
+        rm -rf venv
    fi
    
-    # Activate
-    source venv/bin/activate
+    # uv creates the venv and pins the Python version in one step
+    $UV_CMD venv venv --python "$PYTHON_VERSION"
    
-    # Upgrade pip
-    pip install --upgrade pip wheel setuptools > /dev/null
-    
-    log_success "Virtual environment ready"
+    log_success "Virtual environment ready (Python $PYTHON_VERSION)"
 }

 install_deps() {
    log_info "Installing dependencies..."
    
    if [ "$USE_VENV" = true ]; then
-        source venv/bin/activate
+        # Tell uv to install into our venv (no need to activate)
+        export VIRTUAL_ENV="$INSTALL_DIR/venv"
    fi
    
    # Install the main package in editable mode with all extras
-    pip install -e ".[all]" > /dev/null 2>&1 || pip install -e "." > /dev/null
+    $UV_CMD pip install -e ".[all]" || $UV_CMD pip install -e "."
    
    log_success "Main package installed"
    
    # Install submodules
    log_info "Installing mini-swe-agent (terminal tool backend)..."
    if [ -d "mini-swe-agent" ] && [ -f "mini-swe-agent/pyproject.toml" ]; then
-        pip install -e "./mini-swe-agent" > /dev/null 2>&1 || log_warn "mini-swe-agent install failed (terminal tools may not work)"
+        $UV_CMD pip install -e "./mini-swe-agent" || log_warn "mini-swe-agent install failed (terminal tools may not work)"
        log_success "mini-swe-agent installed"
    else
        log_warn "mini-swe-agent not found (run: git submodule update --init)"
@@ -480,7 +544,7 @@ install_deps() {
    
    log_info "Installing tinker-atropos (RL training backend)..."
    if [ -d "tinker-atropos" ] && [ -f "tinker-atropos/pyproject.toml" ]; then
-        pip install -e "./tinker-atropos" > /dev/null 2>&1 || log_warn "tinker-atropos install failed (RL tools may not work)"
+        $UV_CMD pip install -e "./tinker-atropos" || log_warn "tinker-atropos install failed (RL tools may not work)"
        log_success "tinker-atropos installed"
    else
        log_warn "tinker-atropos not found (run: git submodule update --init)"
@@ -490,53 +554,56 @@ install_deps() {
 }

 setup_path() {
-    log_info "Setting up PATH..."
+    log_info "Setting up hermes command..."
    
-    # Determine the bin directory
    if [ "$USE_VENV" = true ]; then
-        BIN_DIR="$INSTALL_DIR/venv/bin"
+        HERMES_BIN="$INSTALL_DIR/venv/bin/hermes"
    else
-        BIN_DIR="$HOME/.local/bin"
-        mkdir -p "$BIN_DIR"
+        HERMES_BIN="$(which hermes 2>/dev/null || echo "")"
+        if [ -z "$HERMES_BIN" ]; then
+            log_warn "hermes not found on PATH after install"
+            return 0
+        fi
+    fi
+    
+    # Create symlink in ~/.local/bin (standard user binary location, usually on PATH)
+    mkdir -p "$HOME/.local/bin"
+    ln -sf "$HERMES_BIN" "$HOME/.local/bin/hermes"
+    log_success "Symlinked hermes → ~/.local/bin/hermes"
+    
+    # Check if ~/.local/bin is on PATH; if not, add it to shell config
+    if ! echo "$PATH" | tr ':' '\n' | grep -q "^$HOME/.local/bin$"; then
+        SHELL_CONFIG=""
+        if [ -n "$BASH_VERSION" ]; then
+            if [ -f "$HOME/.bashrc" ]; then
+                SHELL_CONFIG="$HOME/.bashrc"
+            elif [ -f "$HOME/.bash_profile" ]; then
+                SHELL_CONFIG="$HOME/.bash_profile"
+            fi
+        elif [ -n "$ZSH_VERSION" ] || [ -f "$HOME/.zshrc" ]; then
+            SHELL_CONFIG="$HOME/.zshrc"
+        fi
        
-        # Create a wrapper script
-        cat > "$BIN_DIR/hermes" << EOF
-#!/bin/bash
-cd "$INSTALL_DIR"
-exec python -m hermes_cli.main "\$@"
-EOF
-        chmod +x "$BIN_DIR/hermes"
-    fi
-    
-    # Add to PATH in shell config
-    SHELL_CONFIG=""
-    if [ -n "$BASH_VERSION" ]; then
-        if [ -f "$HOME/.bashrc" ]; then
-            SHELL_CONFIG="$HOME/.bashrc"
-        elif [ -f "$HOME/.bash_profile" ]; then
-            SHELL_CONFIG="$HOME/.bash_profile"
+        PATH_LINE='export PATH="$HOME/.local/bin:$PATH"'
+        
+        if [ -n "$SHELL_CONFIG" ]; then
+            if ! grep -q '\.local/bin' "$SHELL_CONFIG" 2>/dev/null; then
+                echo "" >> "$SHELL_CONFIG"
+                echo "# Hermes Agent — ensure ~/.local/bin is on PATH" >> "$SHELL_CONFIG"
+                echo "$PATH_LINE" >> "$SHELL_CONFIG"
+                log_success "Added ~/.local/bin to PATH in $SHELL_CONFIG"
+            else
+                log_info "~/.local/bin already referenced in $SHELL_CONFIG"
+            fi
        fi
-    elif [ -n "$ZSH_VERSION" ] || [ -f "$HOME/.zshrc" ]; then
-        SHELL_CONFIG="$HOME/.zshrc"
+    else
+        log_info "~/.local/bin already on PATH"
    fi
    
-    PATH_LINE="export PATH=\"$BIN_DIR:\$PATH\""
+    # Export for current session so hermes works immediately
+    export PATH="$HOME/.local/bin:$PATH"
    
-    if [ -n "$SHELL_CONFIG" ]; then
-        if ! grep -q "hermes-agent" "$SHELL_CONFIG" 2>/dev/null; then
-            echo "" >> "$SHELL_CONFIG"
-            echo "# Hermes Agent" >> "$SHELL_CONFIG"
-            echo "$PATH_LINE" >> "$SHELL_CONFIG"
-            log_success "Added to $SHELL_CONFIG"
-        else
-            log_info "PATH already configured in $SHELL_CONFIG"
-        fi
-    fi
-    
-    # Also export for current session
-    export PATH="$BIN_DIR:$PATH"
-    
-    log_success "PATH configured"
+    log_success "hermes command ready"
 }

 copy_config_templates() {
@@ -553,7 +620,6 @@ copy_config_templates() {
            cp "$INSTALL_DIR/.env.example" "$HERMES_HOME/.env"
            log_success "Created ~/.hermes/.env from template"
        else
-            # Create empty .env if no example exists
            touch "$HERMES_HOME/.env"
            log_success "Created ~/.hermes/.env"
        fi
@@ -601,12 +667,14 @@ run_setup_wizard() {
    log_info "Starting setup wizard..."
    echo ""
    
-    if [ "$USE_VENV" = true ]; then
-        source "$INSTALL_DIR/venv/bin/activate"
-    fi
-    
    cd "$INSTALL_DIR"
-    python -m hermes_cli.main setup
+    
+    # Run hermes setup using the venv Python directly (no activation needed)
+    if [ "$USE_VENV" = true ]; then
+        "$INSTALL_DIR/venv/bin/python" -m hermes_cli.main setup
+    else
+        python -m hermes_cli.main setup
+    fi
 }

 print_success() {
@@ -673,10 +741,12 @@ main() {
    print_banner
    
    detect_os
+    install_uv
    check_python
    check_git
    check_node
    check_ripgrep
+    check_ffmpeg
    
    clone_repo
    setup_venv
--- a/scripts/kill_modal.sh
+++ b/scripts/kill_modal.sh
@@ -0,0 +1,34 @@
+#!/bin/bash
+# Kill all running Modal apps (sandboxes, deployments, etc.)
+#
+# Usage:
+#   bash scripts/kill_modal.sh          # Stop swe-rex (the sandbox app)
+#   bash scripts/kill_modal.sh --all    # Stop ALL Modal apps
+
+set -uo pipefail
+
+echo "Fetching Modal app list..."
+APP_LIST=$(modal app list 2>/dev/null)
+
+if [[ "${1:-}" == "--all" ]]; then
+    echo "Stopping ALL Modal apps..."
+    echo "$APP_LIST" | grep -oE 'ap-[A-Za-z0-9]+' | sort -u | while read app_id; do
+        echo "  Stopping $app_id"
+        modal app stop "$app_id" 2>/dev/null || true
+    done
+else
+    echo "Stopping swe-rex sandboxes..."
+    APPS=$(echo "$APP_LIST" | grep 'swe-rex' | grep -oE 'ap-[A-Za-z0-9]+' || true)
+    if [[ -z "$APPS" ]]; then
+        echo "  No swe-rex apps found."
+    else
+        echo "$APPS" | while read app_id; do
+            echo "  Stopping $app_id"
+            modal app stop "$app_id" 2>/dev/null || true
+        done
+    fi
+fi
+
+echo ""
+echo "Current swe-rex status:"
+modal app list 2>/dev/null | grep -E 'State|swe-rex' || echo "  (none)"
--- a/setup-hermes.sh
+++ b/setup-hermes.sh
@@ -3,16 +3,18 @@
 # Hermes Agent Setup Script
 # ============================================================================
 # Quick setup for developers who cloned the repo manually.
+# Uses uv for fast Python provisioning and package management.
 #
 # Usage:
 #   ./setup-hermes.sh
 #
 # This script:
-# 1. Creates a virtual environment (if not exists)
-# 2. Installs dependencies
-# 3. Creates .env from template (if not exists)
-# 4. Installs the 'hermes' CLI command
-# 5. Runs the setup wizard (optional)
+# 1. Installs uv if not present
+# 2. Creates a virtual environment with Python 3.11 via uv
+# 3. Installs all dependencies (main package + submodules)
+# 4. Creates .env from template (if not exists)
+# 5. Symlinks the 'hermes' CLI command into ~/.local/bin
+# 6. Runs the setup wizard (optional)
 # ============================================================================

 set -e
@@ -21,38 +23,75 @@ set -e
 GREEN='\033[0;32m'
 YELLOW='\033[0;33m'
 CYAN='\033[0;36m'
+RED='\033[0;31m'
 NC='\033[0m'

 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 cd "$SCRIPT_DIR"

+PYTHON_VERSION="3.11"
+
 echo ""
 echo -e "${CYAN}🦋 Hermes Agent Setup${NC}"
 echo ""

 # ============================================================================
-# Python check
+# Install / locate uv
 # ============================================================================

-echo -e "${CYAN}→${NC} Checking Python..."
+echo -e "${CYAN}→${NC} Checking for uv..."

-PYTHON_CMD=""
-for cmd in python3.12 python3.11 python3.10 python3 python; do
-    if command -v $cmd &> /dev/null; then
-        if $cmd -c "import sys; exit(0 if sys.version_info >= (3, 10) else 1)" 2>/dev/null; then
-            PYTHON_CMD=$cmd
-            break
-        fi
-    fi
-done
-
-if [ -z "$PYTHON_CMD" ]; then
-    echo -e "${YELLOW}✗${NC} Python 3.10+ required"
-    exit 1
+UV_CMD=""
+if command -v uv &> /dev/null; then
+    UV_CMD="uv"
+elif [ -x "$HOME/.local/bin/uv" ]; then
+    UV_CMD="$HOME/.local/bin/uv"
+elif [ -x "$HOME/.cargo/bin/uv" ]; then
+    UV_CMD="$HOME/.cargo/bin/uv"
 fi

-PYTHON_VERSION=$($PYTHON_CMD -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')
-echo -e "${GREEN}✓${NC} Python $PYTHON_VERSION found"
+if [ -n "$UV_CMD" ]; then
+    UV_VERSION=$($UV_CMD --version 2>/dev/null)
+    echo -e "${GREEN}✓${NC} uv found ($UV_VERSION)"
+else
+    echo -e "${CYAN}→${NC} Installing uv..."
+    if curl -LsSf https://astral.sh/uv/install.sh | sh 2>/dev/null; then
+        if [ -x "$HOME/.local/bin/uv" ]; then
+            UV_CMD="$HOME/.local/bin/uv"
+        elif [ -x "$HOME/.cargo/bin/uv" ]; then
+            UV_CMD="$HOME/.cargo/bin/uv"
+        fi
+        
+        if [ -n "$UV_CMD" ]; then
+            UV_VERSION=$($UV_CMD --version 2>/dev/null)
+            echo -e "${GREEN}✓${NC} uv installed ($UV_VERSION)"
+        else
+            echo -e "${RED}✗${NC} uv installed but not found. Add ~/.local/bin to PATH and retry."
+            exit 1
+        fi
+    else
+        echo -e "${RED}✗${NC} Failed to install uv. Visit https://docs.astral.sh/uv/"
+        exit 1
+    fi
+fi
+
+# ============================================================================
+# Python check (uv can provision it automatically)
+# ============================================================================
+
+echo -e "${CYAN}→${NC} Checking Python $PYTHON_VERSION..."
+
+if $UV_CMD python find "$PYTHON_VERSION" &> /dev/null; then
+    PYTHON_PATH=$($UV_CMD python find "$PYTHON_VERSION")
+    PYTHON_FOUND_VERSION=$($PYTHON_PATH --version 2>/dev/null)
+    echo -e "${GREEN}✓${NC} $PYTHON_FOUND_VERSION found"
+else
+    echo -e "${CYAN}→${NC} Python $PYTHON_VERSION not found, installing via uv..."
+    $UV_CMD python install "$PYTHON_VERSION"
+    PYTHON_PATH=$($UV_CMD python find "$PYTHON_VERSION")
+    PYTHON_FOUND_VERSION=$($PYTHON_PATH --version 2>/dev/null)
+    echo -e "${GREEN}✓${NC} $PYTHON_FOUND_VERSION installed"
+fi

 # ============================================================================
 # Virtual environment
@@ -60,15 +99,16 @@ echo -e "${GREEN}✓${NC} Python $PYTHON_VERSION found"

 echo -e "${CYAN}→${NC} Setting up virtual environment..."

-if [ ! -d "venv" ]; then
-    $PYTHON_CMD -m venv venv
-    echo -e "${GREEN}✓${NC} Created venv"
-else
-    echo -e "${GREEN}✓${NC} venv exists"
+if [ -d "venv" ]; then
+    echo -e "${CYAN}→${NC} Removing old venv..."
+    rm -rf venv
 fi

-source venv/bin/activate
-pip install --upgrade pip wheel setuptools > /dev/null
+$UV_CMD venv venv --python "$PYTHON_VERSION"
+echo -e "${GREEN}✓${NC} venv created (Python $PYTHON_VERSION)"
+
+# Tell uv to install into this venv (no activation needed for uv)
+export VIRTUAL_ENV="$SCRIPT_DIR/venv"

 # ============================================================================
 # Dependencies
@@ -76,10 +116,34 @@ pip install --upgrade pip wheel setuptools > /dev/null

 echo -e "${CYAN}→${NC} Installing dependencies..."

-pip install -e ".[all]" > /dev/null 2>&1 || pip install -e "." > /dev/null
+$UV_CMD pip install -e ".[all]" || $UV_CMD pip install -e "."

 echo -e "${GREEN}✓${NC} Dependencies installed"

+# ============================================================================
+# Submodules (terminal backend + RL training)
+# ============================================================================
+
+echo -e "${CYAN}→${NC} Installing submodules..."
+
+# mini-swe-agent (terminal tool backend)
+if [ -d "mini-swe-agent" ] && [ -f "mini-swe-agent/pyproject.toml" ]; then
+    $UV_CMD pip install -e "./mini-swe-agent" && \
+        echo -e "${GREEN}✓${NC} mini-swe-agent installed" || \
+        echo -e "${YELLOW}⚠${NC} mini-swe-agent install failed (terminal tools may not work)"
+else
+    echo -e "${YELLOW}⚠${NC} mini-swe-agent not found (run: git submodule update --init --recursive)"
+fi
+
+# tinker-atropos (RL training backend)
+if [ -d "tinker-atropos" ] && [ -f "tinker-atropos/pyproject.toml" ]; then
+    $UV_CMD pip install -e "./tinker-atropos" && \
+        echo -e "${GREEN}✓${NC} tinker-atropos installed" || \
+        echo -e "${YELLOW}⚠${NC} tinker-atropos install failed (RL tools may not work)"
+else
+    echo -e "${YELLOW}⚠${NC} tinker-atropos not found (run: git submodule update --init --recursive)"
+fi
+
 # ============================================================================
 # Optional: ripgrep (for faster file search)
 # ============================================================================
@@ -141,14 +205,17 @@ else
 fi

 # ============================================================================
-# PATH setup
+# PATH setup — symlink hermes into ~/.local/bin
 # ============================================================================

 echo -e "${CYAN}→${NC} Setting up hermes command..."

-BIN_DIR="$SCRIPT_DIR/venv/bin"
+HERMES_BIN="$SCRIPT_DIR/venv/bin/hermes"
+mkdir -p "$HOME/.local/bin"
+ln -sf "$HERMES_BIN" "$HOME/.local/bin/hermes"
+echo -e "${GREEN}✓${NC} Symlinked hermes → ~/.local/bin/hermes"

-# Add to shell config if not already there
+# Ensure ~/.local/bin is on PATH in shell config
 SHELL_CONFIG=""
 if [ -f "$HOME/.zshrc" ]; then
    SHELL_CONFIG="$HOME/.zshrc"
@@ -159,13 +226,17 @@ elif [ -f "$HOME/.bash_profile" ]; then
 fi

 if [ -n "$SHELL_CONFIG" ]; then
-    if ! grep -q "hermes-agent" "$SHELL_CONFIG" 2>/dev/null; then
-        echo "" >> "$SHELL_CONFIG"
-        echo "# Hermes Agent" >> "$SHELL_CONFIG"
-        echo "export PATH=\"$BIN_DIR:\$PATH\"" >> "$SHELL_CONFIG"
-        echo -e "${GREEN}✓${NC} Added to $SHELL_CONFIG"
+    if ! echo "$PATH" | tr ':' '\n' | grep -q "^$HOME/.local/bin$"; then
+        if ! grep -q '\.local/bin' "$SHELL_CONFIG" 2>/dev/null; then
+            echo "" >> "$SHELL_CONFIG"
+            echo "# Hermes Agent — ensure ~/.local/bin is on PATH" >> "$SHELL_CONFIG"
+            echo 'export PATH="$HOME/.local/bin:$PATH"' >> "$SHELL_CONFIG"
+            echo -e "${GREEN}✓${NC} Added ~/.local/bin to PATH in $SHELL_CONFIG"
+        else
+            echo -e "${GREEN}✓${NC} ~/.local/bin already in $SHELL_CONFIG"
+        fi
    else
-        echo -e "${GREEN}✓${NC} PATH already in $SHELL_CONFIG"
+        echo -e "${GREEN}✓${NC} ~/.local/bin already on PATH"
    fi
 fi

@@ -199,5 +270,6 @@ read -p "Would you like to run the setup wizard now? [Y/n] " -n 1 -r
 echo
 if [[ $REPLY =~ ^[Yy]$ ]] || [[ -z $REPLY ]]; then
    echo ""
-    python -m hermes_cli.main setup
+    # Run directly with venv Python (no activation needed)
+    "$SCRIPT_DIR/venv/bin/python" -m hermes_cli.main setup
 fi
--- a/skills/diagramming/DESCRIPTION.md
+++ b/skills/diagramming/DESCRIPTION.md
@@ -0,0 +1,3 @@
+---
+description: Diagram creation skills for generating visual diagrams, flowcharts, architecture diagrams, and illustrations using tools like Excalidraw.
+---
--- a/skills/diagramming/excalidraw/SKILL.md
+++ b/skills/diagramming/excalidraw/SKILL.md
@@ -0,0 +1,191 @@
+---
+name: excalidraw
+description: Create hand-drawn style diagrams using Excalidraw JSON format. Generate .excalidraw files for architecture diagrams, flowcharts, sequence diagrams, concept maps, and more. Files can be opened at excalidraw.com or uploaded for shareable links.
+version: 1.0.0
+author: Hermes Agent
+license: MIT
+tags: [Excalidraw, Diagrams, Flowcharts, Architecture, Visualization, JSON]
+dependencies: []
+related_skills: []
+---
+
+# Excalidraw Diagram Skill
+
+Create diagrams by writing standard Excalidraw element JSON and saving as `.excalidraw` files. These files can be drag-and-dropped onto [excalidraw.com](https://excalidraw.com) for viewing and editing. No accounts, no API keys, no rendering libraries -- just JSON.
+
+## Workflow
+
+1. **Load this skill** (you already did)
+2. **Write the elements JSON** -- an array of Excalidraw element objects
+3. **Save the file** using `write_file` to create a `.excalidraw` file
+4. **Optionally upload** for a shareable link using `scripts/upload.py` via `terminal`
+
+### Saving a Diagram
+
+Wrap your elements array in the standard `.excalidraw` envelope and save with `write_file`:
+
+```json
+{
+  "type": "excalidraw",
+  "version": 2,
+  "source": "hermes-agent",
+  "elements": [ ...your elements array here... ],
+  "appState": {
+    "viewBackgroundColor": "#ffffff"
+  }
+}
+```
+
+Save to any path, e.g. `~/diagrams/my_diagram.excalidraw`.
+
+### Uploading for a Shareable Link
+
+Run the upload script (located in this skill's `scripts/` directory) via terminal:
+
+```bash
+python skills/diagramming/excalidraw/scripts/upload.py ~/diagrams/my_diagram.excalidraw
+```
+
+This uploads to excalidraw.com (no account needed) and prints a shareable URL. Requires the `cryptography` pip package (`pip install cryptography`).
+
+---
+
+## Element Format Reference
+
+### Required Fields (all elements)
+`type`, `id` (unique string), `x`, `y`, `width`, `height`
+
+### Defaults (skip these -- they're applied automatically)
+- `strokeColor`: `"#1e1e1e"`
+- `backgroundColor`: `"transparent"`
+- `fillStyle`: `"solid"`
+- `strokeWidth`: `2`
+- `roughness`: `1` (hand-drawn look)
+- `opacity`: `100`
+
+Canvas background is white.
+
+### Element Types
+
+**Rectangle**:
+```json
+{ "type": "rectangle", "id": "r1", "x": 100, "y": 100, "width": 200, "height": 100 }
+```
+- `roundness: { "type": 3 }` for rounded corners
+- `backgroundColor: "#a5d8ff"`, `fillStyle: "solid"` for filled
+
+**Ellipse**:
+```json
+{ "type": "ellipse", "id": "e1", "x": 100, "y": 100, "width": 150, "height": 150 }
+```
+
+**Diamond**:
+```json
+{ "type": "diamond", "id": "d1", "x": 100, "y": 100, "width": 150, "height": 150 }
+```
+
+**Labeled shape (container binding)** -- create a text element bound to the shape:
+
+> **WARNING:** Do NOT use `"label": { "text": "..." }` on shapes. This is NOT a valid
+> Excalidraw property and will be silently ignored, producing blank shapes. You MUST
+> use the container binding approach below.
+
+The shape needs `boundElements` listing the text, and the text needs `containerId` pointing back:
+```json
+{ "type": "rectangle", "id": "r1", "x": 100, "y": 100, "width": 200, "height": 80,
+  "roundness": { "type": 3 }, "backgroundColor": "#a5d8ff", "fillStyle": "solid",
+  "boundElements": [{ "id": "t_r1", "type": "text" }] },
+{ "type": "text", "id": "t_r1", "x": 105, "y": 110, "width": 190, "height": 25,
+  "text": "Hello", "fontSize": 20, "fontFamily": 1, "strokeColor": "#1e1e1e",
+  "textAlign": "center", "verticalAlign": "middle",
+  "containerId": "r1", "originalText": "Hello", "autoResize": true }
+```
+- Works on rectangle, ellipse, diamond
+- Text is auto-centered by Excalidraw when `containerId` is set
+- The text `x`/`y`/`width`/`height` are approximate -- Excalidraw recalculates them on load
+- `originalText` should match `text`
+- Always include `fontFamily: 1` (Virgil/hand-drawn font)
+
+**Labeled arrow** -- same container binding approach:
+```json
+{ "type": "arrow", "id": "a1", "x": 300, "y": 150, "width": 200, "height": 0,
+  "points": [[0,0],[200,0]], "endArrowhead": "arrow",
+  "boundElements": [{ "id": "t_a1", "type": "text" }] },
+{ "type": "text", "id": "t_a1", "x": 370, "y": 130, "width": 60, "height": 20,
+  "text": "connects", "fontSize": 16, "fontFamily": 1, "strokeColor": "#1e1e1e",
+  "textAlign": "center", "verticalAlign": "middle",
+  "containerId": "a1", "originalText": "connects", "autoResize": true }
+```
+
+**Standalone text** (titles and annotations only -- no container):
+```json
+{ "type": "text", "id": "t1", "x": 150, "y": 138, "text": "Hello", "fontSize": 20,
+  "fontFamily": 1, "strokeColor": "#1e1e1e", "originalText": "Hello", "autoResize": true }
+```
+- `x` is the LEFT edge. To center at position `cx`: `x = cx - (text.length * fontSize * 0.5) / 2`
+- Do NOT rely on `textAlign` or `width` for positioning
+
+**Arrow**:
+```json
+{ "type": "arrow", "id": "a1", "x": 300, "y": 150, "width": 200, "height": 0,
+  "points": [[0,0],[200,0]], "endArrowhead": "arrow" }
+```
+- `points`: `[dx, dy]` offsets from element `x`, `y`
+- `endArrowhead`: `null` | `"arrow"` | `"bar"` | `"dot"` | `"triangle"`
+- `strokeStyle`: `"solid"` (default) | `"dashed"` | `"dotted"`
+
+### Arrow Bindings (connect arrows to shapes)
+
+```json
+{
+  "type": "arrow", "id": "a1", "x": 300, "y": 150, "width": 150, "height": 0,
+  "points": [[0,0],[150,0]], "endArrowhead": "arrow",
+  "startBinding": { "elementId": "r1", "fixedPoint": [1, 0.5] },
+  "endBinding": { "elementId": "r2", "fixedPoint": [0, 0.5] }
+}
+```
+
+`fixedPoint` coordinates: `top=[0.5,0]`, `bottom=[0.5,1]`, `left=[0,0.5]`, `right=[1,0.5]`
+
+### Drawing Order (z-order)
+- Array order = z-order (first = back, last = front)
+- Emit progressively: background zones → shape → its bound text → its arrows → next shape
+- BAD: all rectangles, then all texts, then all arrows
+- GOOD: bg_zone → shape1 → text_for_shape1 → arrow1 → arrow_label_text → shape2 → text_for_shape2 → ...
+- Always place the bound text element immediately after its container shape
+
+### Sizing Guidelines
+
+**Font sizes:**
+- Minimum `fontSize`: **16** for body text, labels, descriptions
+- Minimum `fontSize`: **20** for titles and headings
+- Minimum `fontSize`: **14** for secondary annotations only (sparingly)
+- NEVER use `fontSize` below 14
+
+**Element sizes:**
+- Minimum shape size: 120x60 for labeled rectangles/ellipses
+- Leave 20-30px gaps between elements minimum
+- Prefer fewer, larger elements over many tiny ones
+
+### Color Palette
+
+See `references/colors.md` for full color tables. Quick reference:
+
+| Use | Fill Color | Hex |
+|-----|-----------|-----|
+| Primary / Input | Light Blue | `#a5d8ff` |
+| Success / Output | Light Green | `#b2f2bb` |
+| Warning / External | Light Orange | `#ffd8a8` |
+| Processing / Special | Light Purple | `#d0bfff` |
+| Error / Critical | Light Red | `#ffc9c9` |
+| Notes / Decisions | Light Yellow | `#fff3bf` |
+| Storage / Data | Light Teal | `#c3fae8` |
+
+### Tips
+- Use the color palette consistently across the diagram
+- **Text contrast is CRITICAL** -- never use light gray on white backgrounds. Minimum text color on white: `#757575`
+- Do NOT use emoji in text -- they don't render in Excalidraw's font
+- For dark mode diagrams, see `references/dark-mode.md`
+- For larger examples, see `references/examples.md`
+
+
--- a/skills/diagramming/excalidraw/references/colors.md
+++ b/skills/diagramming/excalidraw/references/colors.md
@@ -0,0 +1,44 @@
+# Excalidraw Color Palette
+
+Use these colors consistently across diagrams.
+
+## Primary Colors (for strokes, arrows, and accents)
+
+| Name | Hex | Use |
+|------|-----|-----|
+| Blue | `#4a9eed` | Primary actions, links, data series 1 |
+| Amber | `#f59e0b` | Warnings, highlights, data series 2 |
+| Green | `#22c55e` | Success, positive, data series 3 |
+| Red | `#ef4444` | Errors, negative, data series 4 |
+| Purple | `#8b5cf6` | Accents, special items, data series 5 |
+| Pink | `#ec4899` | Decorative, data series 6 |
+| Cyan | `#06b6d4` | Info, secondary, data series 7 |
+| Lime | `#84cc16` | Extra, data series 8 |
+
+## Pastel Fills (for shape backgrounds)
+
+| Color | Hex | Good For |
+|-------|-----|----------|
+| Light Blue | `#a5d8ff` | Input, sources, primary nodes |
+| Light Green | `#b2f2bb` | Success, output, completed |
+| Light Orange | `#ffd8a8` | Warning, pending, external |
+| Light Purple | `#d0bfff` | Processing, middleware, special |
+| Light Red | `#ffc9c9` | Error, critical, alerts |
+| Light Yellow | `#fff3bf` | Notes, decisions, planning |
+| Light Teal | `#c3fae8` | Storage, data, memory |
+| Light Pink | `#eebefa` | Analytics, metrics |
+
+## Background Zones (use with opacity: 30-35 for layered diagrams)
+
+| Color | Hex | Good For |
+|-------|-----|----------|
+| Blue zone | `#dbe4ff` | UI / frontend layer |
+| Purple zone | `#e5dbff` | Logic / agent layer |
+| Green zone | `#d3f9d8` | Data / tool layer |
+
+## Text Contrast Rules
+
+- **On white backgrounds**: minimum text color is `#757575`. Default `#1e1e1e` is best.
+- **Colored text on light fills**: use dark variants (`#15803d` not `#22c55e`, `#2563eb` not `#4a9eed`)
+- **White text**: only on dark backgrounds (`#9a5030` not `#c4795b`)
+- **Never**: light gray (`#b0b0b0`, `#999`) on white -- unreadable
--- a/skills/diagramming/excalidraw/references/dark-mode.md
+++ b/skills/diagramming/excalidraw/references/dark-mode.md
@@ -0,0 +1,68 @@
+# Excalidraw Dark Mode Diagrams
+
+To create a dark-themed diagram, use a massive dark background rectangle as the **first element** in the array. Make it large enough to cover any viewport:
+
+```json
+{
+  "type": "rectangle", "id": "darkbg",
+  "x": -4000, "y": -3000, "width": 10000, "height": 7500,
+  "backgroundColor": "#1e1e2e", "fillStyle": "solid",
+  "strokeColor": "transparent", "strokeWidth": 0
+}
+```
+
+Then use the following color palettes for elements on the dark background.
+
+## Text Colors (on dark)
+
+| Color | Hex | Use |
+|-------|-----|-----|
+| White | `#e5e5e5` | Primary text, titles |
+| Muted | `#a0a0a0` | Secondary text, annotations |
+| NEVER | `#555` or darker | Invisible on dark bg! |
+
+## Shape Fills (on dark)
+
+| Color | Hex | Good For |
+|-------|-----|----------|
+| Dark Blue | `#1e3a5f` | Primary nodes |
+| Dark Green | `#1a4d2e` | Success, output |
+| Dark Purple | `#2d1b69` | Processing, special |
+| Dark Orange | `#5c3d1a` | Warning, pending |
+| Dark Red | `#5c1a1a` | Error, critical |
+| Dark Teal | `#1a4d4d` | Storage, data |
+
+## Stroke and Arrow Colors (on dark)
+
+Use the standard Primary Colors from the main color palette -- they're bright enough on dark backgrounds:
+- Blue `#4a9eed`, Amber `#f59e0b`, Green `#22c55e`, Red `#ef4444`, Purple `#8b5cf6`
+
+For subtle shape borders, use `#555555`.
+
+## Example: Dark mode labeled rectangle
+
+Use container binding (NOT the `"label"` property, which doesn't work). On dark backgrounds, set text `strokeColor` to `"#e5e5e5"` so it's visible:
+
+```json
+[
+  {
+    "type": "rectangle", "id": "r1",
+    "x": 100, "y": 100, "width": 200, "height": 80,
+    "backgroundColor": "#1e3a5f", "fillStyle": "solid",
+    "strokeColor": "#4a9eed", "strokeWidth": 2,
+    "roundness": { "type": 3 },
+    "boundElements": [{ "id": "t_r1", "type": "text" }]
+  },
+  {
+    "type": "text", "id": "t_r1",
+    "x": 105, "y": 120, "width": 190, "height": 25,
+    "text": "Dark Node", "fontSize": 20, "fontFamily": 1,
+    "strokeColor": "#e5e5e5",
+    "textAlign": "center", "verticalAlign": "middle",
+    "containerId": "r1", "originalText": "Dark Node", "autoResize": true
+  }
+]
+```
+
+Note: For standalone text elements on dark backgrounds, always set `"strokeColor": "#e5e5e5"` explicitly. The default `#1e1e1e` is invisible on dark.
+
--- a/skills/diagramming/excalidraw/references/examples.md
+++ b/skills/diagramming/excalidraw/references/examples.md
@@ -0,0 +1,141 @@
+# Excalidraw Diagram Examples
+
+Complete, copy-pasteable examples. Wrap each in the `.excalidraw` envelope before saving:
+
+```json
+{
+  "type": "excalidraw",
+  "version": 2,
+  "source": "hermes-agent",
+  "elements": [ ...elements from examples below... ],
+  "appState": { "viewBackgroundColor": "#ffffff" }
+}
+```
+
+> **IMPORTANT:** All text labels on shapes and arrows use container binding (`containerId` + `boundElements`).
+> Do NOT use the non-existent `"label"` property -- it will be silently ignored, producing blank shapes.
+
+---
+
+## Example 1: Two Connected Labeled Boxes
+
+A minimal flowchart with two boxes and an arrow between them.
+
+```json
+[
+  { "type": "text", "id": "title", "x": 280, "y": 30, "text": "Simple Flow", "fontSize": 28, "fontFamily": 1, "strokeColor": "#1e1e1e", "originalText": "Simple Flow", "autoResize": true },
+  { "type": "rectangle", "id": "b1", "x": 100, "y": 100, "width": 200, "height": 100, "roundness": { "type": 3 }, "backgroundColor": "#a5d8ff", "fillStyle": "solid", "boundElements": [{ "id": "t_b1", "type": "text" }, { "id": "a1", "type": "arrow" }] },
+  { "type": "text", "id": "t_b1", "x": 105, "y": 130, "width": 190, "height": 25, "text": "Start", "fontSize": 20, "fontFamily": 1, "strokeColor": "#1e1e1e", "textAlign": "center", "verticalAlign": "middle", "containerId": "b1", "originalText": "Start", "autoResize": true },
+  { "type": "rectangle", "id": "b2", "x": 450, "y": 100, "width": 200, "height": 100, "roundness": { "type": 3 }, "backgroundColor": "#b2f2bb", "fillStyle": "solid", "boundElements": [{ "id": "t_b2", "type": "text" }, { "id": "a1", "type": "arrow" }] },
+  { "type": "text", "id": "t_b2", "x": 455, "y": 130, "width": 190, "height": 25, "text": "End", "fontSize": 20, "fontFamily": 1, "strokeColor": "#1e1e1e", "textAlign": "center", "verticalAlign": "middle", "containerId": "b2", "originalText": "End", "autoResize": true },
+  { "type": "arrow", "id": "a1", "x": 300, "y": 150, "width": 150, "height": 0, "points": [[0,0],[150,0]], "endArrowhead": "arrow", "startBinding": { "elementId": "b1", "fixedPoint": [1, 0.5] }, "endBinding": { "elementId": "b2", "fixedPoint": [0, 0.5] } }
+]
+```
+
+---
+
+## Example 2: Photosynthesis Process Diagram
+
+A larger diagram with background zones, multiple nodes, and directional arrows showing inputs/outputs.
+
+```json
+[
+  {"type":"text","id":"ti","x":280,"y":10,"text":"Photosynthesis","fontSize":28,"fontFamily":1,"strokeColor":"#1e1e1e","originalText":"Photosynthesis","autoResize":true},
+  {"type":"text","id":"fo","x":245,"y":48,"text":"6CO2 + 6H2O --> C6H12O6 + 6O2","fontSize":16,"fontFamily":1,"strokeColor":"#757575","originalText":"6CO2 + 6H2O --> C6H12O6 + 6O2","autoResize":true},
+  {"type":"rectangle","id":"lf","x":150,"y":90,"width":520,"height":380,"backgroundColor":"#d3f9d8","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#22c55e","strokeWidth":1,"opacity":35},
+  {"type":"text","id":"lfl","x":170,"y":96,"text":"Inside the Leaf","fontSize":16,"fontFamily":1,"strokeColor":"#15803d","originalText":"Inside the Leaf","autoResize":true},
+
+  {"type":"rectangle","id":"lr","x":190,"y":190,"width":160,"height":70,"backgroundColor":"#fff3bf","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#f59e0b","boundElements":[{"id":"t_lr","type":"text"},{"id":"a1","type":"arrow"},{"id":"a2","type":"arrow"},{"id":"a3","type":"arrow"},{"id":"a5","type":"arrow"}]},
+  {"type":"text","id":"t_lr","x":195,"y":205,"width":150,"height":20,"text":"Light Reactions","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"lr","originalText":"Light Reactions","autoResize":true},
+
+  {"type":"arrow","id":"a1","x":350,"y":225,"width":120,"height":0,"points":[[0,0],[120,0]],"strokeColor":"#1e1e1e","strokeWidth":2,"endArrowhead":"arrow","boundElements":[{"id":"t_a1","type":"text"}]},
+  {"type":"text","id":"t_a1","x":390,"y":205,"width":40,"height":20,"text":"ATP","fontSize":14,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"a1","originalText":"ATP","autoResize":true},
+
+  {"type":"rectangle","id":"cc","x":470,"y":190,"width":160,"height":70,"backgroundColor":"#d0bfff","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#8b5cf6","boundElements":[{"id":"t_cc","type":"text"},{"id":"a1","type":"arrow"},{"id":"a4","type":"arrow"},{"id":"a6","type":"arrow"}]},
+  {"type":"text","id":"t_cc","x":475,"y":205,"width":150,"height":20,"text":"Calvin Cycle","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"cc","originalText":"Calvin Cycle","autoResize":true},
+
+  {"type":"rectangle","id":"sl","x":10,"y":200,"width":120,"height":50,"backgroundColor":"#fff3bf","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#f59e0b","boundElements":[{"id":"t_sl","type":"text"},{"id":"a2","type":"arrow"}]},
+  {"type":"text","id":"t_sl","x":15,"y":210,"width":110,"height":20,"text":"Sunlight","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"sl","originalText":"Sunlight","autoResize":true},
+
+  {"type":"arrow","id":"a2","x":130,"y":225,"width":60,"height":0,"points":[[0,0],[60,0]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":"arrow"},
+
+  {"type":"rectangle","id":"wa","x":200,"y":360,"width":140,"height":50,"backgroundColor":"#a5d8ff","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#4a9eed","boundElements":[{"id":"t_wa","type":"text"},{"id":"a3","type":"arrow"}]},
+  {"type":"text","id":"t_wa","x":205,"y":370,"width":130,"height":20,"text":"Water (H2O)","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"wa","originalText":"Water (H2O)","autoResize":true},
+
+  {"type":"arrow","id":"a3","x":270,"y":360,"width":0,"height":-100,"points":[[0,0],[0,-100]],"strokeColor":"#4a9eed","strokeWidth":2,"endArrowhead":"arrow"},
+
+  {"type":"rectangle","id":"co","x":480,"y":360,"width":130,"height":50,"backgroundColor":"#ffd8a8","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#f59e0b","boundElements":[{"id":"t_co","type":"text"},{"id":"a4","type":"arrow"}]},
+  {"type":"text","id":"t_co","x":485,"y":370,"width":120,"height":20,"text":"CO2","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"co","originalText":"CO2","autoResize":true},
+
+  {"type":"arrow","id":"a4","x":545,"y":360,"width":0,"height":-100,"points":[[0,0],[0,-100]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":"arrow"},
+
+  {"type":"rectangle","id":"ox","x":540,"y":100,"width":100,"height":40,"backgroundColor":"#ffc9c9","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#ef4444","boundElements":[{"id":"t_ox","type":"text"},{"id":"a5","type":"arrow"}]},
+  {"type":"text","id":"t_ox","x":545,"y":105,"width":90,"height":20,"text":"O2","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"ox","originalText":"O2","autoResize":true},
+
+  {"type":"arrow","id":"a5","x":310,"y":190,"width":230,"height":-50,"points":[[0,0],[230,-50]],"strokeColor":"#ef4444","strokeWidth":2,"endArrowhead":"arrow"},
+
+  {"type":"rectangle","id":"gl","x":690,"y":195,"width":120,"height":60,"backgroundColor":"#c3fae8","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#22c55e","boundElements":[{"id":"t_gl","type":"text"},{"id":"a6","type":"arrow"}]},
+  {"type":"text","id":"t_gl","x":695,"y":210,"width":110,"height":25,"text":"Glucose","fontSize":18,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"gl","originalText":"Glucose","autoResize":true},
+
+  {"type":"arrow","id":"a6","x":630,"y":225,"width":60,"height":0,"points":[[0,0],[60,0]],"strokeColor":"#22c55e","strokeWidth":2,"endArrowhead":"arrow"},
+
+  {"type":"ellipse","id":"sun","x":30,"y":110,"width":50,"height":50,"backgroundColor":"#fff3bf","fillStyle":"solid","strokeColor":"#f59e0b","strokeWidth":2},
+  {"type":"arrow","id":"r1","x":55,"y":108,"width":0,"height":-14,"points":[[0,0],[0,-14]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":null,"startArrowhead":null},
+  {"type":"arrow","id":"r2","x":55,"y":162,"width":0,"height":14,"points":[[0,0],[0,14]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":null,"startArrowhead":null},
+  {"type":"arrow","id":"r3","x":28,"y":135,"width":-14,"height":0,"points":[[0,0],[-14,0]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":null,"startArrowhead":null},
+  {"type":"arrow","id":"r4","x":82,"y":135,"width":14,"height":0,"points":[[0,0],[14,0]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":null,"startArrowhead":null}
+]
+```
+
+---
+
+## Example 3: Sequence Diagram (UML-style)
+
+Demonstrates a sequence diagram with actors, dashed lifelines, and message arrows.
+
+```json
+[
+  {"type":"text","id":"title","x":200,"y":15,"text":"MCP Apps -- Sequence Flow","fontSize":24,"fontFamily":1,"strokeColor":"#1e1e1e","originalText":"MCP Apps -- Sequence Flow","autoResize":true},
+
+  {"type":"rectangle","id":"uHead","x":60,"y":60,"width":100,"height":40,"backgroundColor":"#a5d8ff","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#4a9eed","strokeWidth":2,"boundElements":[{"id":"t_uHead","type":"text"}]},
+  {"type":"text","id":"t_uHead","x":65,"y":65,"width":90,"height":20,"text":"User","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"uHead","originalText":"User","autoResize":true},
+
+  {"type":"arrow","id":"uLine","x":110,"y":100,"width":0,"height":400,"points":[[0,0],[0,400]],"strokeColor":"#b0b0b0","strokeWidth":1,"strokeStyle":"dashed","endArrowhead":null},
+
+  {"type":"rectangle","id":"aHead","x":230,"y":60,"width":100,"height":40,"backgroundColor":"#d0bfff","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#8b5cf6","strokeWidth":2,"boundElements":[{"id":"t_aHead","type":"text"}]},
+  {"type":"text","id":"t_aHead","x":235,"y":65,"width":90,"height":20,"text":"Agent","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"aHead","originalText":"Agent","autoResize":true},
+
+  {"type":"arrow","id":"aLine","x":280,"y":100,"width":0,"height":400,"points":[[0,0],[0,400]],"strokeColor":"#b0b0b0","strokeWidth":1,"strokeStyle":"dashed","endArrowhead":null},
+
+  {"type":"rectangle","id":"sHead","x":420,"y":60,"width":130,"height":40,"backgroundColor":"#ffd8a8","fillStyle":"solid","roundness":{"type":3},"strokeColor":"#f59e0b","strokeWidth":2,"boundElements":[{"id":"t_sHead","type":"text"}]},
+  {"type":"text","id":"t_sHead","x":425,"y":65,"width":120,"height":20,"text":"Server","fontSize":16,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"sHead","originalText":"Server","autoResize":true},
+
+  {"type":"arrow","id":"sLine","x":485,"y":100,"width":0,"height":400,"points":[[0,0],[0,400]],"strokeColor":"#b0b0b0","strokeWidth":1,"strokeStyle":"dashed","endArrowhead":null},
+
+  {"type":"arrow","id":"m1","x":110,"y":150,"width":170,"height":0,"points":[[0,0],[170,0]],"strokeColor":"#1e1e1e","strokeWidth":2,"endArrowhead":"arrow","boundElements":[{"id":"t_m1","type":"text"}]},
+  {"type":"text","id":"t_m1","x":165,"y":130,"width":60,"height":20,"text":"request","fontSize":14,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"m1","originalText":"request","autoResize":true},
+
+  {"type":"arrow","id":"m2","x":280,"y":200,"width":205,"height":0,"points":[[0,0],[205,0]],"strokeColor":"#8b5cf6","strokeWidth":2,"endArrowhead":"arrow","boundElements":[{"id":"t_m2","type":"text"}]},
+  {"type":"text","id":"t_m2","x":352,"y":180,"width":60,"height":20,"text":"tools/call","fontSize":14,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"m2","originalText":"tools/call","autoResize":true},
+
+  {"type":"arrow","id":"m3","x":485,"y":260,"width":-205,"height":0,"points":[[0,0],[-205,0]],"strokeColor":"#f59e0b","strokeWidth":2,"endArrowhead":"arrow","strokeStyle":"dashed","boundElements":[{"id":"t_m3","type":"text"}]},
+  {"type":"text","id":"t_m3","x":352,"y":240,"width":60,"height":20,"text":"result","fontSize":14,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"m3","originalText":"result","autoResize":true},
+
+  {"type":"arrow","id":"m4","x":280,"y":320,"width":-170,"height":0,"points":[[0,0],[-170,0]],"strokeColor":"#8b5cf6","strokeWidth":2,"endArrowhead":"arrow","strokeStyle":"dashed","boundElements":[{"id":"t_m4","type":"text"}]},
+  {"type":"text","id":"t_m4","x":165,"y":300,"width":60,"height":20,"text":"response","fontSize":14,"fontFamily":1,"strokeColor":"#1e1e1e","textAlign":"center","verticalAlign":"middle","containerId":"m4","originalText":"response","autoResize":true}
+]
+```
+
+---
+
+## Common Mistakes to Avoid
+
+- **Do NOT use `"label"` property** -- this is the #1 mistake. It is NOT part of the Excalidraw file format and will be silently ignored, producing blank shapes with no visible text. Always use container binding (`containerId` + `boundElements`) as shown in the examples above.
+- **Every bound text needs both sides linked** -- the shape needs `boundElements: [{"id": "t_xxx", "type": "text"}]` AND the text needs `containerId: "shape_id"`. If either is missing, the binding won't work.
+- **Include `originalText` and `autoResize: true`** on all text elements -- Excalidraw uses these for proper text reflow.
+- **Include `fontFamily: 1`** on all text elements -- without it, text may not render with the expected hand-drawn font.
+- **Elements overlap when y-coordinates are close** -- always check that text, boxes, and labels don't stack on top of each other
+- **Arrow labels need space** -- long labels like "ATP + NADPH" overflow short arrows. Keep labels short or make arrows wider
+- **Center titles relative to the diagram** -- estimate total width and center the title text over it
+- **Draw decorations LAST** -- cute illustrations (sun, stars, icons) should appear at the end of the array so they're drawn on top
+
--- a/skills/diagramming/excalidraw/scripts/upload.py
+++ b/skills/diagramming/excalidraw/scripts/upload.py
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+"""
+Upload an .excalidraw file to excalidraw.com and print a shareable URL.
+
+No account required. The diagram is encrypted client-side (AES-GCM) before
+upload -- the encryption key is embedded in the URL fragment, so the server
+never sees plaintext.
+
+Requirements:
+    pip install cryptography
+
+Usage:
+    python upload.py <path-to-file.excalidraw>
+
+Example:
+    python upload.py ~/diagrams/architecture.excalidraw
+    # prints: https://excalidraw.com/#json=abc123,encryptionKeyHere
+"""
+
+import json
+import os
+import struct
+import sys
+import zlib
+import base64
+import urllib.request
+
+try:
+    from cryptography.hazmat.primitives.ciphers.aead import AESGCM
+except ImportError:
+    print("Error: 'cryptography' package is required for upload.")
+    print("Install it with: pip install cryptography")
+    sys.exit(1)
+
+# Excalidraw public upload endpoint (no auth needed)
+UPLOAD_URL = "https://json.excalidraw.com/api/v2/post/"
+
+
+def concat_buffers(*buffers: bytes) -> bytes:
+    """
+    Build the Excalidraw v2 concat-buffers binary format.
+
+    Layout: [version=1 (4B big-endian)] then for each buffer:
+            [length (4B big-endian)] [data bytes]
+    """
+    parts = [struct.pack(">I", 1)]  # version = 1
+    for buf in buffers:
+        parts.append(struct.pack(">I", len(buf)))
+        parts.append(buf)
+    return b"".join(parts)
+
+
+def upload(excalidraw_json: str) -> str:
+    """
+    Encrypt and upload Excalidraw JSON to excalidraw.com.
+
+    Args:
+        excalidraw_json: The full .excalidraw file content as a string.
+
+    Returns:
+        Shareable URL string.
+    """
+    # 1. Inner payload: concat_buffers(file_metadata, data)
+    file_metadata = json.dumps({}).encode("utf-8")
+    data_bytes = excalidraw_json.encode("utf-8")
+    inner_payload = concat_buffers(file_metadata, data_bytes)
+
+    # 2. Compress with zlib
+    compressed = zlib.compress(inner_payload)
+
+    # 3. AES-GCM 128-bit encrypt
+    raw_key = os.urandom(16)   # 128-bit key
+    iv = os.urandom(12)        # 12-byte nonce
+    aesgcm = AESGCM(raw_key)
+    encrypted = aesgcm.encrypt(iv, compressed, None)
+
+    # 4. Encoding metadata
+    encoding_meta = json.dumps({
+        "version": 2,
+        "compression": "pako@1",
+        "encryption": "AES-GCM",
+    }).encode("utf-8")
+
+    # 5. Outer payload: concat_buffers(encoding_meta, iv, encrypted)
+    payload = concat_buffers(encoding_meta, iv, encrypted)
+
+    # 6. Upload
+    req = urllib.request.Request(UPLOAD_URL, data=payload, method="POST")
+    with urllib.request.urlopen(req, timeout=30) as resp:
+        if resp.status != 200:
+            raise RuntimeError(f"Upload failed with HTTP {resp.status}")
+        result = json.loads(resp.read().decode("utf-8"))
+
+    file_id = result.get("id")
+    if not file_id:
+        raise RuntimeError(f"Upload returned no file ID. Response: {result}")
+
+    # 7. Key as base64url (JWK 'k' format, no padding)
+    key_b64 = base64.urlsafe_b64encode(raw_key).rstrip(b"=").decode("ascii")
+
+    return f"https://excalidraw.com/#json={file_id},{key_b64}"
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: python upload.py <path-to-file.excalidraw>")
+        sys.exit(1)
+
+    file_path = sys.argv[1]
+
+    if not os.path.isfile(file_path):
+        print(f"Error: File not found: {file_path}")
+        sys.exit(1)
+
+    with open(file_path, "r", encoding="utf-8") as f:
+        content = f.read()
+
+    # Basic validation: should be valid JSON with an "elements" key
+    try:
+        doc = json.loads(content)
+    except json.JSONDecodeError as e:
+        print(f"Error: File is not valid JSON: {e}")
+        sys.exit(1)
+
+    if "elements" not in doc:
+        print("Warning: File does not contain an 'elements' key. Uploading anyway.")
+
+    url = upload(content)
+    print(url)
+
+
+if __name__ == "__main__":
+    main()
--- a/tools/init.py
+++ b/tools/init.py
@@ -31,6 +31,8 @@ from .terminal_tool import (
    cleanup_vm,
    cleanup_all_environments,
    get_active_environments_info,
+    register_task_env_overrides,
+    clear_task_env_overrides,
    TERMINAL_TOOL_DESCRIPTION
 )

@@ -57,7 +59,6 @@ from .image_generation_tool import (
 )

 from .skills_tool import (
-    skills_categories,
    skills_list,
    skill_view,
    check_skills_requirements,
@@ -121,6 +122,12 @@ from .file_tools import (
    clear_file_ops_cache,
 )

+# Text-to-speech tools (Edge TTS / ElevenLabs / OpenAI)
+from .tts_tool import (
+    text_to_speech_tool,
+    check_tts_requirements,
+)
+
 # File tools have no external requirements - they use the terminal backend
 def check_file_requirements():
    """File tools only require terminal backend to be available."""
@@ -139,6 +146,8 @@ __all__ = [
    'cleanup_vm',
    'cleanup_all_environments',
    'get_active_environments_info',
+    'register_task_env_overrides',
+    'clear_task_env_overrides',
    'TERMINAL_TOOL_DESCRIPTION',
    # Terminal tools (Hecate/MorphCloud backend)
    'terminal_hecate_tool',
@@ -154,7 +163,6 @@ __all__ = [
    'image_generate_tool',
    'check_image_generation_requirements',
    # Skills tools
-    'skills_categories',
    'skills_list',
    'skill_view',
    'check_skills_requirements',
@@ -205,5 +213,8 @@ __all__ = [
    'get_file_tools',
    'clear_file_ops_cache',
    'check_file_requirements',
+    # Text-to-speech tools
+    'text_to_speech_tool',
+    'check_tts_requirements',
 ]

--- a/tools/browser_tool.py
+++ b/tools/browser_tool.py
@@ -51,6 +51,7 @@ import subprocess
 import shutil
 import sys
 import asyncio
+import tempfile
 import threading
 import time
 import requests
@@ -644,17 +645,25 @@ def _find_agent_browser() -> str:
    """
    Find the agent-browser CLI executable.
    
+    Checks in order: PATH, local node_modules/.bin/, npx fallback.
+    
    Returns:
        Path to agent-browser executable
        
    Raises:
        FileNotFoundError: If agent-browser is not installed
    """
-    # Check if it's in PATH
+    # Check if it's in PATH (global install)
    which_result = shutil.which("agent-browser")
    if which_result:
        return which_result
    
+    # Check local node_modules/.bin/ (npm install in repo root)
+    repo_root = Path(__file__).parent.parent
+    local_bin = repo_root / "node_modules" / ".bin" / "agent-browser"
+    if local_bin.exists():
+        return str(local_bin)
+    
    # Check common npx locations
    npx_path = shutil.which("npx")
    if npx_path:
@@ -662,6 +671,7 @@ def _find_agent_browser() -> str:
    
    raise FileNotFoundError(
        "agent-browser CLI not found. Install it with: npm install -g agent-browser\n"
+        "Or run 'npm install' in the repo root to install locally.\n"
        "Or ensure npx is available in your PATH."
    )

@@ -708,12 +718,26 @@ def _run_browser_command(
    ] + args
    
    try:
+        # Give each task its own socket directory to prevent concurrency conflicts.
+        # Without this, parallel workers fight over the same default socket path,
+        # causing "Failed to create socket directory: Permission denied" errors.
+        task_socket_dir = os.path.join(
+            tempfile.gettempdir(), 
+            f"agent-browser-{session_info['session_name']}"
+        )
+        os.makedirs(task_socket_dir, exist_ok=True)
+        
+        browser_env = {
+            **os.environ,
+            "AGENT_BROWSER_SOCKET_DIR": task_socket_dir,
+        }
+        
        result = subprocess.run(
            cmd_parts,
            capture_output=True,
            text=True,
            timeout=timeout,
-            env={**os.environ}
+            env=browser_env,
        )
        
        # Parse JSON output
@@ -1487,6 +1511,13 @@ def cleanup_browser(task_id: Optional[str] = None) -> None:
        except Exception as e:
            print(f"[browser_tool] Exception during BrowserBase session close: {e}", file=sys.stderr)
        
+        # Clean up per-task socket directory
+        session_name = session_info.get("session_name", "")
+        if session_name:
+            socket_dir = os.path.join(tempfile.gettempdir(), f"agent-browser-{session_name}")
+            if os.path.exists(socket_dir):
+                shutil.rmtree(socket_dir, ignore_errors=True)
+        
        del _active_sessions[task_id]
        if not os.getenv("HERMES_QUIET"):
            print(f"[browser_tool] Removed task {task_id} from active sessions", file=sys.stderr)
--- a/tools/file_operations.py
+++ b/tools/file_operations.py
@@ -254,12 +254,12 @@ class ShellFileOperations(FileOperations):
        Args:
            terminal_env: Any object with execute(command, cwd) method.
                         Returns {"output": str, "returncode": int}
-            cwd: Working directory (defaults to env's cwd or /tmp)
+            cwd: Working directory (defaults to env's cwd or current directory)
        """
        self.env = terminal_env
        # Determine cwd from various possible sources
        self.cwd = cwd or getattr(terminal_env, 'cwd', None) or \
-                   getattr(getattr(terminal_env, 'config', None), 'cwd', None) or '/tmp'
+                   getattr(getattr(terminal_env, 'config', None), 'cwd', None) or os.getcwd()
        
        # Cache for command availability checks
        self._command_cache: Dict[str, bool] = {}
--- a/tools/file_tools.py
+++ b/tools/file_tools.py
@@ -2,6 +2,7 @@
 """File Tools Module - LLM agent file manipulation tools."""

 import json
+import os
 import threading
 from typing import Optional
 from tools.file_operations import ShellFileOperations
@@ -11,23 +12,91 @@ _file_ops_cache: dict = {}


 def _get_file_ops(task_id: str = "default") -> ShellFileOperations:
-    """Get or create ShellFileOperations for a terminal environment."""
-    from tools.terminal_tool import _active_environments, _env_lock, _LocalEnvironment
+    """Get or create ShellFileOperations for a terminal environment.
    
+    Respects the TERMINAL_ENV setting -- if the task_id doesn't have an
+    environment yet, creates one using the configured backend (local, docker,
+    modal, etc.) rather than always defaulting to local.
+    """
+    from tools.terminal_tool import (
+        _active_environments, _env_lock, _create_environment,
+        _get_env_config, _last_activity, _start_cleanup_thread,
+        _check_disk_usage_warning,
+    )
+    import time
+    
+    # Fast path: check cache without heavy locks
    with _file_ops_lock:
        if task_id in _file_ops_cache:
            return _file_ops_cache[task_id]
-        
-        with _env_lock:
-            if task_id not in _active_environments:
-                import os
-                env = _LocalEnvironment(cwd=os.getcwd(), timeout=60)
-                _active_environments[task_id] = env
-            terminal_env = _active_environments[task_id]
-        
-        file_ops = ShellFileOperations(terminal_env)
+    
+    # Check if we need to create a new environment.
+    # Uses the same per-task creation locks as terminal_tool to prevent
+    # duplicate sandbox creation from concurrent tool calls.
+    from tools.terminal_tool import _creation_locks, _creation_locks_lock
+    
+    needs_creation = False
+    with _env_lock:
+        if task_id not in _active_environments:
+            needs_creation = True
+    
+    if needs_creation:
+        # Per-task lock: only one thread creates the sandbox, others wait
+        with _creation_locks_lock:
+            if task_id not in _creation_locks:
+                _creation_locks[task_id] = __import__("threading").Lock()
+            task_lock = _creation_locks[task_id]
+
+        with task_lock:
+            # Double-check after acquiring the per-task lock
+            with _env_lock:
+                if task_id in _active_environments:
+                    needs_creation = False
+
+            if needs_creation:
+                from tools.terminal_tool import _task_env_overrides
+                
+                config = _get_env_config()
+                env_type = config["env_type"]
+                overrides = _task_env_overrides.get(task_id, {})
+                
+                if env_type == "docker":
+                    image = overrides.get("docker_image") or config["docker_image"]
+                elif env_type == "singularity":
+                    image = overrides.get("singularity_image") or config["singularity_image"]
+                elif env_type == "modal":
+                    image = overrides.get("modal_image") or config["modal_image"]
+                else:
+                    image = ""
+                
+                cwd = overrides.get("cwd") or config["cwd"]
+                if not os.getenv("HERMES_QUIET"):
+                    print(f"[FileTools] Creating new {env_type} environment for task {task_id[:8]}...", flush=True)
+                
+                new_env = _create_environment(
+                    env_type=env_type,
+                    image=image,
+                    cwd=cwd,
+                    timeout=config["timeout"],
+                )
+                
+                with _env_lock:
+                    _active_environments[task_id] = new_env
+                    _last_activity[task_id] = __import__("time").time()
+                
+                _start_cleanup_thread()
+                if not os.getenv("HERMES_QUIET"):
+                    print(f"[FileTools] {env_type} environment ready for task {task_id[:8]}", flush=True)
+    
+    # Now get the environment and build file_ops
+    with _env_lock:
+        _last_activity[task_id] = time.time()
+        terminal_env = _active_environments[task_id]
+    
+    file_ops = ShellFileOperations(terminal_env)
+    with _file_ops_lock:
        _file_ops_cache[task_id] = file_ops
-        return file_ops
+    return file_ops


 def clear_file_ops_cache(task_id: str = None):
@@ -56,6 +125,7 @@ def write_file_tool(path: str, content: str, task_id: str = "default") -> str:
        result = file_ops.write_file(path, content)
        return json.dumps(result.to_dict(), ensure_ascii=False)
    except Exception as e:
+        print(f"[FileTools] write_file error: {type(e).__name__}: {e}", flush=True)  
        return json.dumps({"error": str(e)}, ensure_ascii=False)


--- a/tools/rl_training_tool.py
+++ b/tools/rl_training_tool.py
@@ -1300,10 +1300,26 @@ async def rl_test_inference(
 # Requirements Check
 # ============================================================================

+def check_rl_python_version() -> bool:
+    """
+    Check if Python version meets the minimum for RL tools.
+    
+    tinker-atropos depends on the 'tinker' package which requires Python >= 3.11.
+    """
+    return sys.version_info >= (3, 11)
+
+
 def check_rl_api_keys() -> bool:
    """
-    Check if required API keys are available.
+    Check if required API keys and Python version are available.
+    
+    RL training requires:
+    - Python >= 3.11 (tinker package requirement)
+    - TINKER_API_KEY for the Tinker training API
+    - WANDB_API_KEY for Weights & Biases metrics
    """
+    if not check_rl_python_version():
+        return False
    tinker_key = os.getenv("TINKER_API_KEY")
    wandb_key = os.getenv("WANDB_API_KEY")
    return bool(tinker_key) and bool(wandb_key)
@@ -1311,9 +1327,11 @@ def check_rl_api_keys() -> bool:

 def get_missing_keys() -> List[str]:
    """
-    Get list of missing required API keys.
+    Get list of missing requirements for RL tools (API keys and Python version).
    """
    missing = []
+    if not check_rl_python_version():
+        missing.append(f"Python >= 3.11 (current: {sys.version_info.major}.{sys.version_info.minor})")
    if not os.getenv("TINKER_API_KEY"):
        missing.append("TINKER_API_KEY")
    if not os.getenv("WANDB_API_KEY"):
--- a/tools/terminal_tool.py
+++ b/tools/terminal_tool.py
@@ -28,6 +28,7 @@ Usage:

 import json
 import os
+import signal
 import sys
 import time
 import threading
@@ -39,6 +40,28 @@ import uuid
 from pathlib import Path
 from typing import Optional, Dict, Any

+
+# ---------------------------------------------------------------------------
+# Global interrupt event: set by the agent when a user interrupt arrives.
+# The terminal tool polls this during command execution so it can kill
+# long-running subprocesses immediately instead of blocking until timeout.
+# ---------------------------------------------------------------------------
+_interrupt_event = threading.Event()
+
+
+def set_interrupt_event(active: bool) -> None:
+    """Called by the agent to signal or clear the interrupt."""
+    if active:
+        _interrupt_event.set()
+    else:
+        _interrupt_event.clear()
+
+
+def is_interrupted() -> bool:
+    """Check if an interrupt has been requested."""
+    return _interrupt_event.is_set()
+
+
 # Add mini-swe-agent to path if not installed
 mini_swe_path = Path(__file__).parent.parent / "mini-swe-agent" / "src"
 if mini_swe_path.exists():
@@ -83,9 +106,9 @@ def _get_apptainer_cache_dir() -> Path:
        cache_path.mkdir(parents=True, exist_ok=True)
        return cache_path
    
-    # Use scratch dir parent for cache (one level up from sandboxes)
+    # Use user-specific subdirectory in scratch for cache
    scratch = _get_scratch_dir()
-    cache_path = scratch.parent / ".apptainer"
+    cache_path = scratch / ".apptainer"
    cache_path.mkdir(parents=True, exist_ok=True)
    return cache_path

@@ -214,6 +237,10 @@ _cached_sudo_password: str = ""
 # Session-cached dangerous command approvals (pattern -> approved)
 _session_approved_patterns: set = set()

+# Last approval-required command (for gateway to pick up)
+# Set by _check_dangerous_command when in ask mode, read by gateway
+_last_pending_approval: dict = {}
+
 # Dangerous command patterns (regex, description)
 DANGEROUS_PATTERNS = [
    (r'\brm\s+(-[^\s]*\s+)*/', "delete in root path"),
@@ -385,12 +412,22 @@ def _check_dangerous_command(command: str, env_type: str) -> dict:
        # Programmatic use - allow (user opted into local backend)
        return {"approved": True, "message": None}
    
-    if is_gateway:
-        # Messaging context - return informative denial, agent should ask user
+    if is_gateway or os.getenv("HERMES_EXEC_ASK"):
+        # Messaging context - return approval_required so the gateway can
+        # prompt the user interactively instead of just blocking
+        global _last_pending_approval
+        _last_pending_approval = {
+            "command": command,
+            "pattern_key": pattern_key,
+            "description": description,
+        }
        return {
            "approved": False,
            "pattern_key": pattern_key,
-            "message": f"BLOCKED: This command is potentially dangerous ({description}). Tell the user and ask if they want to add this command pattern to their allowlist. They can do this via 'hermes config edit' or by running the command directly on their machine."
+            "status": "approval_required",
+            "command": command,
+            "description": description,
+            "message": f"⚠️ This command is potentially dangerous ({description}). Asking the user for approval..."
        }
    
    # CLI context - prompt user
@@ -599,7 +636,13 @@ class _LocalEnvironment:
        self.env = env or {}
    
    def execute(self, command: str, cwd: str = "", *, timeout: int | None = None) -> dict:
-        """Execute a command locally with sudo support."""
+        """
+        Execute a command locally with sudo support.
+        
+        Uses Popen + polling so the global interrupt event can kill the
+        process early when the user sends a new message, instead of
+        blocking for the full timeout.
+        """
        work_dir = cwd or self.cwd or os.getcwd()
        effective_timeout = timeout or self.timeout
        
@@ -607,22 +650,56 @@ class _LocalEnvironment:
        exec_command = _transform_sudo_command(command)
        
        try:
-            result = subprocess.run(
+            proc = subprocess.Popen(
                exec_command,
                shell=True,
                text=True,
                cwd=work_dir,
                env=os.environ | self.env,
-                timeout=effective_timeout,
                encoding="utf-8",
                errors="replace",
                stdout=subprocess.PIPE,
                stderr=subprocess.STDOUT,
                stdin=subprocess.DEVNULL,  # Prevent hanging on interactive prompts
+                # Start in a new process group so we can kill the whole tree
+                preexec_fn=os.setsid,
            )
-            return {"output": result.stdout, "returncode": result.returncode}
-        except subprocess.TimeoutExpired:
-            return {"output": f"Command timed out after {effective_timeout}s", "returncode": 124}
+            
+            deadline = time.monotonic() + effective_timeout
+            
+            # Poll every 200ms so we notice interrupts quickly
+            while proc.poll() is None:
+                if _interrupt_event.is_set():
+                    # User sent a new message — kill the process tree and return
+                    # what we have so far
+                    try:
+                        os.killpg(os.getpgid(proc.pid), signal.SIGTERM)
+                    except (ProcessLookupError, PermissionError):
+                        proc.kill()
+                    # Grab any partial output
+                    partial, _ = proc.communicate(timeout=2)
+                    output = partial or ""
+                    return {
+                        "output": output + "\n[Command interrupted — user sent a new message]",
+                        "returncode": 130  # Standard interrupted exit code
+                    }
+                
+                if time.monotonic() > deadline:
+                    # Timeout — kill process tree
+                    try:
+                        os.killpg(os.getpgid(proc.pid), signal.SIGTERM)
+                    except (ProcessLookupError, PermissionError):
+                        proc.kill()
+                    proc.communicate(timeout=2)
+                    return {"output": f"Command timed out after {effective_timeout}s", "returncode": 124}
+                
+                # Short sleep to avoid busy-waiting
+                time.sleep(0.2)
+            
+            # Process finished normally — read all output
+            stdout, _ = proc.communicate()
+            return {"output": stdout or "", "returncode": proc.returncode}
+            
        except Exception as e:
            return {"output": f"Execution error: {str(e)}", "returncode": 1}
    
@@ -637,15 +714,21 @@ class _LocalEnvironment:

 class _SingularityEnvironment:
    """
-    Custom Singularity/Apptainer environment with better space management.
+    Persistent Singularity/Apptainer container environment.
    
-    - Automatically builds/caches SIF images from docker:// URLs
-    - Builds sandbox in /scratch (if available) or configurable location
-    - Binds a large working directory into the container
-    - Keeps container isolated from host filesystem
+    Uses `apptainer instance` to create a long-running container that persists
+    state (files, installs, env changes) across all commands within a task.
+    The model experiences this as a real Linux VM.
+    
+    Features:
+    - Persistent filesystem: files created in one command are visible in the next
+    - Package installs persist: pip/apt installs survive across tool calls
+    - Full isolation: --containall gives PID, IPC, and environment isolation
+    - Writable tmpfs overlay: full root filesystem is writable (RAM-backed)
+    - Automatic SIF caching: docker:// images converted to SIF once, reused forever
    """
    
-    def __init__(self, image: str, cwd: str = "/workspace", timeout: int = 60):
+    def __init__(self, image: str, cwd: str = "/root", timeout: int = 60):
        self.cwd = cwd
        self.timeout = timeout
        
@@ -655,60 +738,60 @@ class _SingularityEnvironment:
        # Get or build SIF from docker:// URL (fast if already cached)
        self.image = _get_or_build_sif(image, self.executable)
        
-        # Get scratch directory for sandbox
-        self.scratch_dir = _get_scratch_dir()
+        # Create unique instance name (must be alphanumeric + underscores)
+        self.instance_id = f"hermes_{uuid.uuid4().hex[:12]}"
+        self._instance_started = False
        
-        # Create unique sandbox directory
-        self.sandbox_id = f"hermes-{uuid.uuid4().hex[:12]}"
-        self.sandbox_dir = self.scratch_dir / self.sandbox_id
-        
-        # Create a working directory that will be bound into the container
-        self.work_dir = self.scratch_dir / f"{self.sandbox_id}-work"
-        self.work_dir.mkdir(parents=True, exist_ok=True)
-        
-        # Build the sandbox
-        self._build_sandbox()
+        # Start the persistent instance
+        self._start_instance()
    
-    def _build_sandbox(self):
-        """Build a writable sandbox from the container image (SIF or other)."""
+    def _start_instance(self):
+        """Start a persistent apptainer instance.
+        
+        The instance runs as a background process. All subsequent execute() calls
+        run commands inside this same instance, so state persists across calls.
+        """
+        cmd = [
+            self.executable, "instance", "start",
+            "--writable-tmpfs",  # RAM-backed writable overlay on read-only SIF
+            "--containall",      # Full isolation: PID, IPC, environment, filesystem
+            str(self.image),
+            self.instance_id,
+        ]
+        
        try:
            result = subprocess.run(
-                [self.executable, "build", "--sandbox", str(self.sandbox_dir), self.image],
+                cmd,
                capture_output=True,
                text=True,
-                timeout=300  # 5 min timeout for building
+                timeout=120,  # 2 min for instance startup
            )
            if result.returncode != 0:
-                raise RuntimeError(f"Failed to build sandbox: {result.stderr}")
+                raise RuntimeError(f"Failed to start instance: {result.stderr}")
            
-            # Create /workspace directory inside the sandbox for bind mounting
-            workspace_in_sandbox = self.sandbox_dir / "workspace"
-            workspace_in_sandbox.mkdir(parents=True, exist_ok=True)
+            self._instance_started = True
+            print(f"[Singularity] Instance {self.instance_id} started (persistent container)", flush=True)
            
        except subprocess.TimeoutExpired:
-            shutil.rmtree(self.sandbox_dir, ignore_errors=True)
-            raise RuntimeError("Sandbox build timed out")
+            raise RuntimeError("Instance start timed out")
    
    def execute(self, command: str, cwd: str = "", *, timeout: int | None = None) -> dict:
-        """Execute a command in the Singularity container."""
+        """Execute a command in the persistent Singularity instance.
+        
+        All commands run in the same container, so files, installs, and
+        environment changes persist between calls.
+        """
+        if not self._instance_started:
+            return {"output": "Instance not started", "returncode": -1}
+        
        cmd = [self.executable, "exec"]
        
-        # Isolation flags - contain but allow network
-        cmd.extend(["--contain", "--cleanenv"])
-        
-        # Bind the working directory into the container at /workspace
-        # This gives the container access to a large writable space
-        cmd.extend(["--bind", f"{self.work_dir}:/workspace"])
-        
-        # Also bind it to /tmp inside container for pip cache etc.
-        cmd.extend(["--bind", f"{self.work_dir}:/tmp"])
-        
        # Set working directory
        work_dir = cwd or self.cwd
        cmd.extend(["--pwd", work_dir])
        
-        # Use writable sandbox
-        cmd.extend(["--writable", str(self.sandbox_dir)])
+        # Connect to the running instance
+        cmd.append(f"instance://{self.instance_id}")
        
        # Transform sudo commands if SUDO_PASSWORD is available
        exec_command = _transform_sudo_command(command)
@@ -732,9 +815,19 @@ class _SingularityEnvironment:
            return {"output": f"Command timed out after {timeout or self.timeout}s", "returncode": 124}
    
    def cleanup(self):
-        """Clean up sandbox and working directory."""
-        shutil.rmtree(self.sandbox_dir, ignore_errors=True)
-        shutil.rmtree(self.work_dir, ignore_errors=True)
+        """Stop the persistent instance and clean up."""
+        if self._instance_started:
+            try:
+                subprocess.run(
+                    [self.executable, "instance", "stop", self.instance_id],
+                    capture_output=True,
+                    text=True,
+                    timeout=30,
+                )
+                print(f"[Singularity] Instance {self.instance_id} stopped", flush=True)
+            except Exception as e:
+                print(f"[Singularity] Warning: failed to stop instance {self.instance_id}: {e}", flush=True)
+            self._instance_started = False
    
    def stop(self):
        """Alias for cleanup."""
@@ -742,7 +835,10 @@ class _SingularityEnvironment:
    
    def __del__(self):
        """Cleanup on destruction."""
-        self.cleanup()
+        try:
+            self.cleanup()
+        except:
+            pass


 class _SSHEnvironment:
@@ -957,13 +1053,37 @@ class _ModalEnvironment:
    
    Wraps mini-swe-agent's SwerexModalEnvironment but adds:
    - SUDO_PASSWORD support via _transform_sudo_command
+    - Automatic async-safety patches (applied once, before first use)
    
-    Note: stdin handling is not needed for Modal since it uses remote async execution.
+    The patches replace SwerexModalEnvironment's asyncio.run() calls with a
+    background thread approach, making it safe to use inside any event loop
+    (e.g., Atropos). Applied here at the point of use rather than relying on
+    import-time side effects, so ALL callers get the fix automatically.
    """
    
-    def __init__(self, image: str, cwd: str = "/", timeout: int = 60):
+    # Class-level flag: patches only need to be applied once
+    _patches_applied = False
+    
+    def __init__(self, image: str, cwd: str = "/root", timeout: int = 60):
+        # Ensure async-safety patches are applied before creating any
+        # SwerexModalEnvironment instance. This is the single authoritative
+        # place -- no other module needs to call apply_patches() for Modal.
+        if not _ModalEnvironment._patches_applied:
+            try:
+                from environments.patches import apply_patches
+                apply_patches()
+            except ImportError:
+                pass  # patches module not available (standalone use)
+            _ModalEnvironment._patches_applied = True
+        
        from minisweagent.environments.extra.swerex_modal import SwerexModalEnvironment
-        self._inner = SwerexModalEnvironment(image=image, cwd=cwd, timeout=timeout)
+        # Generous startup timeout: sandbox creation can take 30-60s for cold images,
+        # and the SWE-ReX runtime needs another 10-30s to boot inside it.
+        self._inner = SwerexModalEnvironment(
+            image=image, cwd=cwd, timeout=timeout,
+            startup_timeout=180.0,
+            runtime_timeout=3600.0,
+        )
        self.cwd = cwd
        self.timeout = timeout
    
@@ -1014,7 +1134,7 @@ TERMINAL_TOOL_DESCRIPTION = """Execute commands on a secure Linux environment.
 - Run servers/long processes in background
 - Monitor disk usage for large tasks
 - Install whatever tools you need with apt-get or pip
- Do not be afraid to run pip with --break-system-packages
+- Try to create or use a venv with uv or python -m venv to keep isolation from global system packages.

 **Things to avoid:**
 - Do NOT use interactive tools such as tmux, vim, nano, python repl - you will get stuck.
@@ -1026,20 +1146,73 @@ _active_environments: Dict[str, Any] = {}
 _task_workdirs: Dict[str, str] = {}  # Maps task_id to working directory
 _last_activity: Dict[str, float] = {}
 _env_lock = threading.Lock()
+_creation_locks: Dict[str, threading.Lock] = {}  # Per-task locks for sandbox creation
+_creation_locks_lock = threading.Lock()  # Protects _creation_locks dict itself
 _cleanup_thread = None
 _cleanup_running = False

+# Per-task environment overrides registry.
+# Allows environments (e.g., TerminalBench2Env) to specify a custom Docker/Modal
+# image for a specific task_id BEFORE the agent loop starts. When the terminal or
+# file tools create a new sandbox for that task_id, they check this registry first
+# and fall back to the TERMINAL_MODAL_IMAGE (etc.) env var if no override is set.
+#
+# This is never exposed to the model -- only infrastructure code calls it.
+# Thread-safe because each task_id is unique per rollout.
+_task_env_overrides: Dict[str, Dict[str, Any]] = {}
+
+
+def register_task_env_overrides(task_id: str, overrides: Dict[str, Any]):
+    """
+    Register environment overrides for a specific task/rollout.
+
+    Called by Atropos environments before the agent loop to configure
+    per-task sandbox settings (e.g., a custom Dockerfile for the Modal image).
+
+    Supported override keys:
+        - modal_image: str -- Path to Dockerfile or Docker Hub image name
+        - docker_image: str -- Docker image name
+        - cwd: str -- Working directory inside the sandbox
+
+    Args:
+        task_id: The rollout's unique task identifier
+        overrides: Dict of config keys to override
+    """
+    _task_env_overrides[task_id] = overrides
+
+
+def clear_task_env_overrides(task_id: str):
+    """
+    Clear environment overrides for a task after rollout completes.
+
+    Called during cleanup to avoid stale entries accumulating.
+    """
+    _task_env_overrides.pop(task_id, None)
+
 # Configuration from environment variables
 def _get_env_config() -> Dict[str, Any]:
    """Get terminal environment configuration from environment variables."""
    # Default image with Python and Node.js for maximum compatibility
    default_image = "nikolaik/python-nodejs:python3.11-nodejs20"
+    env_type = os.getenv("TERMINAL_ENV", "local")
+    
+    # Default cwd depends on backend:
+    #   - local/ssh: current working directory (CLI resolves "." before we get here)
+    #   - docker/singularity: /tmp inside the container (singularity bind-mounts /scratch there)
+    #   - modal: /root (ephemeral cloud container, full filesystem access)
+    if env_type in ("modal", "singularity"):
+        default_cwd = "/root"
+    elif env_type == "docker":
+        default_cwd = "/"
+    else:
+        default_cwd = os.getcwd()
+    
    return {
-        "env_type": os.getenv("TERMINAL_ENV", "local"),  # local, docker, singularity, modal, or ssh
+        "env_type": env_type,
        "docker_image": os.getenv("TERMINAL_DOCKER_IMAGE", default_image),
        "singularity_image": os.getenv("TERMINAL_SINGULARITY_IMAGE", f"docker://{default_image}"),
        "modal_image": os.getenv("TERMINAL_MODAL_IMAGE", default_image),
-        "cwd": os.getenv("TERMINAL_CWD", "/tmp"),
+        "cwd": os.getenv("TERMINAL_CWD", default_cwd),
        "timeout": int(os.getenv("TERMINAL_TIMEOUT", "60")),
        "lifetime_seconds": int(os.getenv("TERMINAL_LIFETIME_SECONDS", "300")),
        # SSH-specific config
@@ -1271,7 +1444,15 @@ def cleanup_vm(task_id: str):
                    print(f"[Terminal Cleanup] Error cleaning up environment for task {task_id}: {e}")


-atexit.register(_stop_cleanup_thread)
+def _atexit_cleanup():
+    """Stop cleanup thread and shut down all remaining sandboxes on exit."""
+    _stop_cleanup_thread()
+    if _active_environments:
+        count = len(_active_environments)
+        print(f"\n[Terminal Cleanup] Shutting down {count} remaining sandbox(es)...")
+        cleanup_all_environments()
+
+atexit.register(_atexit_cleanup)


 def terminal_tool(
@@ -1313,24 +1494,28 @@ def terminal_tool(
        # Get configuration
        config = _get_env_config()
        env_type = config["env_type"]
-        
-        # Select image based on env type
-        if env_type == "docker":
-            image = config["docker_image"]
-        elif env_type == "singularity":
-            image = config["singularity_image"]
-        elif env_type == "modal":
-            image = config["modal_image"]
-        else:
-            image = ""
-        
-        cwd = config["cwd"]
-        default_timeout = config["timeout"]
-        effective_timeout = timeout or default_timeout

        # Use task_id for environment isolation
        effective_task_id = task_id or "default"

+        # Check per-task overrides (set by environments like TerminalBench2Env)
+        # before falling back to global env var config
+        overrides = _task_env_overrides.get(effective_task_id, {})
+        
+        # Select image based on env type, with per-task override support
+        if env_type == "docker":
+            image = overrides.get("docker_image") or config["docker_image"]
+        elif env_type == "singularity":
+            image = overrides.get("singularity_image") or config["singularity_image"]
+        elif env_type == "modal":
+            image = overrides.get("modal_image") or config["modal_image"]
+        else:
+            image = ""
+        
+        cwd = overrides.get("cwd") or config["cwd"]
+        default_timeout = config["timeout"]
+        effective_timeout = timeout or default_timeout
+
        # For local environment in batch mode, create a unique subdirectory per task
        # This prevents parallel tasks from overwriting each other's files
        # In CLI mode (HERMES_QUIET), use the cwd directly without subdirectories
@@ -1346,47 +1531,86 @@ def terminal_tool(
        # Start cleanup thread
        _start_cleanup_thread()

-        # Get or create environment
+        # Get or create environment.
+        # Use a per-task creation lock so concurrent tool calls for the same
+        # task_id wait for the first one to finish creating the sandbox,
+        # instead of each creating their own (wasting Modal resources).
        with _env_lock:
-            if effective_task_id not in _active_environments:
-                # Check disk usage before creating new environment
-                _check_disk_usage_warning()
-                
-                try:
-                    # Build SSH config if using SSH environment
-                    ssh_config = None
-                    if env_type == "ssh":
-                        ssh_config = {
-                            "host": config.get("ssh_host", ""),
-                            "user": config.get("ssh_user", ""),
-                            "port": config.get("ssh_port", 22),
-                            "key": config.get("ssh_key", ""),
-                        }
-                    
-                    _active_environments[effective_task_id] = _create_environment(
-                        env_type=env_type,
-                        image=image,
-                        cwd=cwd,
-                        timeout=effective_timeout,
-                        ssh_config=ssh_config
-                    )
-                except ImportError as e:
-                    return json.dumps({
-                        "output": "",
-                        "exit_code": -1,
-                        "error": f"Terminal tool disabled: mini-swe-agent not available ({e})",
-                        "status": "disabled"
-                    }, ensure_ascii=False)
+            if effective_task_id in _active_environments:
+                _last_activity[effective_task_id] = time.time()
+                env = _active_environments[effective_task_id]
+                needs_creation = False
+            else:
+                needs_creation = True

-            # Update last activity time
-            _last_activity[effective_task_id] = time.time()
-            env = _active_environments[effective_task_id]
+        if needs_creation:
+            # Per-task lock: only one thread creates the sandbox, others wait
+            with _creation_locks_lock:
+                if effective_task_id not in _creation_locks:
+                    _creation_locks[effective_task_id] = threading.Lock()
+                task_lock = _creation_locks[effective_task_id]
+
+            with task_lock:
+                # Double-check after acquiring the per-task lock
+                with _env_lock:
+                    if effective_task_id in _active_environments:
+                        _last_activity[effective_task_id] = time.time()
+                        env = _active_environments[effective_task_id]
+                        needs_creation = False
+
+                if needs_creation:
+                    if env_type in ("singularity", "local"):
+                        _check_disk_usage_warning()
+                    if not os.getenv("HERMES_QUIET"):
+                        print(f"[Terminal] Creating new {env_type} environment for task {effective_task_id[:8]}...", flush=True)
+                    try:
+                        ssh_config = None
+                        if env_type == "ssh":
+                            ssh_config = {
+                                "host": config.get("ssh_host", ""),
+                                "user": config.get("ssh_user", ""),
+                                "port": config.get("ssh_port", 22),
+                                "key": config.get("ssh_key", ""),
+                            }
+
+                        new_env = _create_environment(
+                            env_type=env_type,
+                            image=image,
+                            cwd=cwd,
+                            timeout=effective_timeout,
+                            ssh_config=ssh_config
+                        )
+                    except ImportError as e:
+                        return json.dumps({
+                            "output": "",
+                            "exit_code": -1,
+                            "error": f"Terminal tool disabled: mini-swe-agent not available ({e})",
+                            "status": "disabled"
+                        }, ensure_ascii=False)
+
+                    with _env_lock:
+                        _active_environments[effective_task_id] = new_env
+                        _last_activity[effective_task_id] = time.time()
+                        env = new_env
+                    if not os.getenv("HERMES_QUIET"):
+                        print(f"[Terminal] {env_type} environment ready for task {effective_task_id[:8]}", flush=True)

        # Check for dangerous commands (only for local/ssh in interactive modes)
        # Skip check if force=True (user has confirmed they want to run it)
        if not force:
            approval = _check_dangerous_command(command, env_type)
            if not approval["approved"]:
+                # Check if this is an approval_required (gateway ask mode)
+                if approval.get("status") == "approval_required":
+                    return json.dumps({
+                        "output": "",
+                        "exit_code": -1,
+                        "error": approval.get("message", "Waiting for user approval"),
+                        "status": "approval_required",
+                        "command": approval.get("command", command),
+                        "description": approval.get("description", "dangerous command"),
+                        "pattern_key": approval.get("pattern_key", ""),
+                    }, ensure_ascii=False)
                # Command was blocked - return informative message
                return json.dumps({
                    "output": "",
@@ -1435,13 +1659,20 @@ def terminal_tool(
                        retry_count += 1
                        wait_time = 2 ** retry_count
                        print(f"⚠️  Terminal: execution error, retrying in {wait_time}s (attempt {retry_count}/{max_retries})")
+                        print(f"   Command: {command[:200]}")
+                        print(f"   Error: {type(e).__name__}: {e}")
+                        print(f"   Task ID: {effective_task_id}, Backend: {env_type}")
                        time.sleep(wait_time)
                        continue
                    
+                    print(f"❌ Terminal: execution failed after {max_retries} retries")
+                    print(f"   Command: {command[:200]}")
+                    print(f"   Error: {type(e).__name__}: {e}")
+                    print(f"   Task ID: {effective_task_id}, Backend: {env_type}")
                    return json.dumps({
                        "output": "",
                        "exit_code": -1,
-                        "error": f"Command execution failed: {str(e)}"
+                        "error": f"Command execution failed: {type(e).__name__}: {str(e)}"
                    }, ensure_ascii=False)
                
                # Got a result
@@ -1546,6 +1777,6 @@ if __name__ == "__main__":
    print(f"  TERMINAL_DOCKER_IMAGE: {os.getenv('TERMINAL_DOCKER_IMAGE', default_img)}")
    print(f"  TERMINAL_SINGULARITY_IMAGE: {os.getenv('TERMINAL_SINGULARITY_IMAGE', f'docker://{default_img}')}")
    print(f"  TERMINAL_MODAL_IMAGE: {os.getenv('TERMINAL_MODAL_IMAGE', default_img)}")
-    print(f"  TERMINAL_CWD: {os.getenv('TERMINAL_CWD', '/tmp')}")
+    print(f"  TERMINAL_CWD: {os.getenv('TERMINAL_CWD', os.getcwd())}")
    print(f"  TERMINAL_TIMEOUT: {os.getenv('TERMINAL_TIMEOUT', '60')}")
    print(f"  TERMINAL_LIFETIME_SECONDS: {os.getenv('TERMINAL_LIFETIME_SECONDS', '300')}")
--- a/tools/tts_tool.py
+++ b/tools/tts_tool.py
@@ -0,0 +1,403 @@
+#!/usr/bin/env python3
+"""
+Text-to-Speech Tool Module
+
+Supports three TTS providers:
+- Edge TTS (default, free, no API key): Microsoft Edge neural voices
+- ElevenLabs (premium): High-quality voices, needs ELEVENLABS_API_KEY
+- OpenAI TTS: Good quality, needs OPENAI_API_KEY
+
+Output formats:
+- Opus (.ogg) for Telegram voice bubbles (requires ffmpeg for Edge TTS)
+- MP3 (.mp3) for everything else (CLI, Discord, WhatsApp)
+
+Configuration is loaded from ~/.hermes/config.yaml under the 'tts:' key.
+The user chooses the provider and voice; the model just sends text.
+
+Usage:
+    from tools.tts_tool import text_to_speech_tool, check_tts_requirements
+
+    result = text_to_speech_tool(text="Hello world")
+"""
+
+import asyncio
+import datetime
+import json
+import os
+import shutil
+import subprocess
+import tempfile
+from pathlib import Path
+from typing import Dict, Any, Optional
+
+# ---------------------------------------------------------------------------
+# Optional imports -- providers degrade gracefully if not installed
+# ---------------------------------------------------------------------------
+try:
+    import edge_tts
+    _HAS_EDGE_TTS = True
+except ImportError:
+    _HAS_EDGE_TTS = False
+
+try:
+    from elevenlabs.client import ElevenLabs
+    _HAS_ELEVENLABS = True
+except ImportError:
+    _HAS_ELEVENLABS = False
+
+# openai is a core dependency, but guard anyway
+try:
+    from openai import OpenAI as OpenAIClient
+    _HAS_OPENAI = True
+except ImportError:
+    _HAS_OPENAI = False
+
+
+# ===========================================================================
+# Defaults
+# ===========================================================================
+DEFAULT_PROVIDER = "edge"
+DEFAULT_EDGE_VOICE = "en-US-AriaNeural"
+DEFAULT_ELEVENLABS_VOICE_ID = "pNInz6obpgDQGcFmaJgB"  # Adam
+DEFAULT_ELEVENLABS_MODEL_ID = "eleven_multilingual_v2"
+DEFAULT_OPENAI_MODEL = "gpt-4o-mini-tts"
+DEFAULT_OPENAI_VOICE = "alloy"
+DEFAULT_OUTPUT_DIR = os.path.expanduser("~/voice-memos")
+MAX_TEXT_LENGTH = 4000
+
+
+# ===========================================================================
+# Config loader -- reads tts: section from ~/.hermes/config.yaml
+# ===========================================================================
+def _load_tts_config() -> Dict[str, Any]:
+    """
+    Load TTS configuration from ~/.hermes/config.yaml.
+
+    Returns a dict with provider settings. Falls back to defaults
+    for any missing fields.
+    """
+    try:
+        from hermes_cli.config import load_config
+        config = load_config()
+        return config.get("tts", {})
+    except Exception:
+        return {}
+
+
+def _get_provider(tts_config: Dict[str, Any]) -> str:
+    """Get the configured TTS provider name."""
+    return tts_config.get("provider", DEFAULT_PROVIDER).lower().strip()
+
+
+# ===========================================================================
+# ffmpeg Opus conversion (Edge TTS MP3 -> OGG Opus for Telegram)
+# ===========================================================================
+def _has_ffmpeg() -> bool:
+    """Check if ffmpeg is available on the system."""
+    return shutil.which("ffmpeg") is not None
+
+
+def _convert_to_opus(mp3_path: str) -> Optional[str]:
+    """
+    Convert an MP3 file to OGG Opus format for Telegram voice bubbles.
+
+    Args:
+        mp3_path: Path to the input MP3 file.
+
+    Returns:
+        Path to the .ogg file, or None if conversion fails.
+    """
+    if not _has_ffmpeg():
+        return None
+
+    ogg_path = mp3_path.rsplit(".", 1)[0] + ".ogg"
+    try:
+        subprocess.run(
+            ["ffmpeg", "-i", mp3_path, "-acodec", "libopus",
+             "-ac", "1", "-b:a", "64k", "-vbr", "off", ogg_path, "-y"],
+            capture_output=True, timeout=30,
+        )
+        if os.path.exists(ogg_path) and os.path.getsize(ogg_path) > 0:
+            return ogg_path
+    except Exception:
+        pass
+    return None
+
+
+# ===========================================================================
+# Provider: Edge TTS (free)
+# ===========================================================================
+async def _generate_edge_tts(text: str, output_path: str, tts_config: Dict[str, Any]) -> str:
+    """
+    Generate audio using Edge TTS.
+
+    Args:
+        text: Text to convert.
+        output_path: Where to save the MP3 file.
+        tts_config: TTS config dict.
+
+    Returns:
+        Path to the saved audio file.
+    """
+    edge_config = tts_config.get("edge", {})
+    voice = edge_config.get("voice", DEFAULT_EDGE_VOICE)
+
+    communicate = edge_tts.Communicate(text, voice)
+    await communicate.save(output_path)
+    return output_path
+
+
+# ===========================================================================
+# Provider: ElevenLabs (premium)
+# ===========================================================================
+def _generate_elevenlabs(text: str, output_path: str, tts_config: Dict[str, Any]) -> str:
+    """
+    Generate audio using ElevenLabs.
+
+    Args:
+        text: Text to convert.
+        output_path: Where to save the audio file.
+        tts_config: TTS config dict.
+
+    Returns:
+        Path to the saved audio file.
+    """
+    api_key = os.getenv("ELEVENLABS_API_KEY", "")
+    if not api_key:
+        raise ValueError("ELEVENLABS_API_KEY not set. Get one at https://elevenlabs.io/")
+
+    el_config = tts_config.get("elevenlabs", {})
+    voice_id = el_config.get("voice_id", DEFAULT_ELEVENLABS_VOICE_ID)
+    model_id = el_config.get("model_id", DEFAULT_ELEVENLABS_MODEL_ID)
+
+    # Determine output format based on file extension
+    if output_path.endswith(".ogg"):
+        output_format = "opus_48000_64"
+    else:
+        output_format = "mp3_44100_128"
+
+    client = ElevenLabs(api_key=api_key)
+    audio_generator = client.text_to_speech.convert(
+        text=text,
+        voice_id=voice_id,
+        model_id=model_id,
+        output_format=output_format,
+    )
+
+    # audio_generator yields chunks -- write them all
+    with open(output_path, "wb") as f:
+        for chunk in audio_generator:
+            f.write(chunk)
+
+    return output_path
+
+
+# ===========================================================================
+# Provider: OpenAI TTS
+# ===========================================================================
+def _generate_openai_tts(text: str, output_path: str, tts_config: Dict[str, Any]) -> str:
+    """
+    Generate audio using OpenAI TTS.
+
+    Args:
+        text: Text to convert.
+        output_path: Where to save the audio file.
+        tts_config: TTS config dict.
+
+    Returns:
+        Path to the saved audio file.
+    """
+    api_key = os.getenv("OPENAI_API_KEY", "")
+    if not api_key:
+        raise ValueError("OPENAI_API_KEY not set. Get one at https://platform.openai.com/api-keys")
+
+    oai_config = tts_config.get("openai", {})
+    model = oai_config.get("model", DEFAULT_OPENAI_MODEL)
+    voice = oai_config.get("voice", DEFAULT_OPENAI_VOICE)
+
+    # Determine response format from extension
+    if output_path.endswith(".ogg"):
+        response_format = "opus"
+    else:
+        response_format = "mp3"
+
+    client = OpenAIClient(api_key=api_key)
+    response = client.audio.speech.create(
+        model=model,
+        voice=voice,
+        input=text,
+        response_format=response_format,
+    )
+
+    response.stream_to_file(output_path)
+    return output_path
+
+
+# ===========================================================================
+# Main tool function
+# ===========================================================================
+def text_to_speech_tool(
+    text: str,
+    output_path: Optional[str] = None,
+) -> str:
+    """
+    Convert text to speech audio.
+
+    Reads provider/voice config from ~/.hermes/config.yaml (tts: section).
+    The model sends text; the user configures voice and provider.
+
+    On messaging platforms, the returned MEDIA:<path> tag is intercepted
+    by the send pipeline and delivered as a native voice message.
+    In CLI mode, the file is saved to ~/voice-memos/.
+
+    Args:
+        text: The text to convert to speech.
+        output_path: Optional custom save path. Defaults to ~/voice-memos/<timestamp>.mp3
+
+    Returns:
+        str: JSON result with success, file_path, and optionally MEDIA tag.
+    """
+    if not text or not text.strip():
+        return json.dumps({"success": False, "error": "Text is required"}, ensure_ascii=False)
+
+    # Truncate very long text with a warning
+    if len(text) > MAX_TEXT_LENGTH:
+        print(f"⚠️  TTS text too long ({len(text)} chars), truncating to {MAX_TEXT_LENGTH}")
+        text = text[:MAX_TEXT_LENGTH]
+
+    tts_config = _load_tts_config()
+    provider = _get_provider(tts_config)
+
+    # Determine output path
+    if output_path:
+        file_path = Path(output_path).expanduser()
+    else:
+        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        out_dir = Path(DEFAULT_OUTPUT_DIR)
+        out_dir.mkdir(parents=True, exist_ok=True)
+        file_path = out_dir / f"tts_{timestamp}.mp3"
+
+    # Ensure parent directory exists
+    file_path.parent.mkdir(parents=True, exist_ok=True)
+    file_str = str(file_path)
+
+    try:
+        # Generate audio with the configured provider
+        if provider == "elevenlabs":
+            if not _HAS_ELEVENLABS:
+                return json.dumps({
+                    "success": False,
+                    "error": "ElevenLabs provider selected but 'elevenlabs' package not installed. Run: pip install elevenlabs"
+                }, ensure_ascii=False)
+            print(f"🔊 Generating speech with ElevenLabs...")
+            _generate_elevenlabs(text, file_str, tts_config)
+
+        elif provider == "openai":
+            if not _HAS_OPENAI:
+                return json.dumps({
+                    "success": False,
+                    "error": "OpenAI provider selected but 'openai' package not installed."
+                }, ensure_ascii=False)
+            print(f"🔊 Generating speech with OpenAI TTS...")
+            _generate_openai_tts(text, file_str, tts_config)
+
+        else:
+            # Default: Edge TTS (free)
+            if not _HAS_EDGE_TTS:
+                return json.dumps({
+                    "success": False,
+                    "error": "Edge TTS not available. Run: pip install edge-tts"
+                }, ensure_ascii=False)
+            print(f"🔊 Generating speech with Edge TTS...")
+            # Edge TTS is async, run it
+            try:
+                loop = asyncio.get_running_loop()
+                import concurrent.futures
+                with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
+                    pool.submit(
+                        lambda: asyncio.run(_generate_edge_tts(text, file_str, tts_config))
+                    ).result(timeout=60)
+            except RuntimeError:
+                asyncio.run(_generate_edge_tts(text, file_str, tts_config))
+
+        # Check the file was actually created
+        if not os.path.exists(file_str) or os.path.getsize(file_str) == 0:
+            return json.dumps({
+                "success": False,
+                "error": f"TTS generation produced no output (provider: {provider})"
+            }, ensure_ascii=False)
+
+        # Try Opus conversion for Telegram compatibility (Edge TTS only outputs MP3)
+        voice_compatible = False
+        if provider == "edge" and file_str.endswith(".mp3"):
+            opus_path = _convert_to_opus(file_str)
+            if opus_path:
+                file_str = opus_path
+                voice_compatible = True
+        elif provider in ("elevenlabs", "openai"):
+            # These providers can output Opus natively if the path ends in .ogg
+            voice_compatible = file_str.endswith(".ogg")
+
+        file_size = os.path.getsize(file_str)
+        print(f"✅ TTS audio saved: {file_str} ({file_size:,} bytes, provider: {provider})")
+
+        # Build response with MEDIA tag for platform delivery
+        media_tag = f"MEDIA:{file_str}"
+        if voice_compatible:
+            media_tag = f"[[audio_as_voice]]\n{media_tag}"
+
+        return json.dumps({
+            "success": True,
+            "file_path": file_str,
+            "media_tag": media_tag,
+            "provider": provider,
+            "voice_compatible": voice_compatible,
+        }, ensure_ascii=False)
+
+    except Exception as e:
+        error_msg = f"TTS generation failed ({provider}): {e}"
+        print(f"❌ {error_msg}")
+        return json.dumps({"success": False, "error": error_msg}, ensure_ascii=False)
+
+
+# ===========================================================================
+# Requirements check
+# ===========================================================================
+def check_tts_requirements() -> bool:
+    """
+    Check if at least one TTS provider is available.
+
+    Edge TTS needs no API key and is the default, so if the package
+    is installed, TTS is available.
+
+    Returns:
+        bool: True if at least one provider can work.
+    """
+    if _HAS_EDGE_TTS:
+        return True
+    if _HAS_ELEVENLABS and os.getenv("ELEVENLABS_API_KEY"):
+        return True
+    if _HAS_OPENAI and os.getenv("OPENAI_API_KEY"):
+        return True
+    return False
+
+
+# ===========================================================================
+# Main -- quick diagnostics
+# ===========================================================================
+if __name__ == "__main__":
+    print("🔊 Text-to-Speech Tool Module")
+    print("=" * 50)
+
+    print(f"\nProvider availability:")
+    print(f"  Edge TTS:   {'✅ installed' if _HAS_EDGE_TTS else '❌ not installed (pip install edge-tts)'}")
+    print(f"  ElevenLabs: {'✅ installed' if _HAS_ELEVENLABS else '❌ not installed (pip install elevenlabs)'}")
+    print(f"    API Key:  {'✅ set' if os.getenv('ELEVENLABS_API_KEY') else '❌ not set'}")
+    print(f"  OpenAI:     {'✅ installed' if _HAS_OPENAI else '❌ not installed'}")
+    print(f"    API Key:  {'✅ set' if os.getenv('OPENAI_API_KEY') else '❌ not set'}")
+    print(f"  ffmpeg:     {'✅ found' if _has_ffmpeg() else '❌ not found (needed for Telegram Opus)'}")
+    print(f"\n  Output dir: {DEFAULT_OUTPUT_DIR}")
+
+    config = _load_tts_config()
+    provider = _get_provider(config)
+    print(f"  Configured provider: {provider}")
--- a/toolset_distributions.py
+++ b/toolset_distributions.py
@@ -198,10 +198,10 @@ DISTRIBUTIONS = {
        "toolsets": {
            "terminal": 97,   # 97% - terminal almost always available
            "file": 97,       # 97% - file tools almost always available
-            "web": 15,        # 15% - web search/scrape for documentation
-            "browser": 10,    # 10% - browser occasionally for web interaction
-            "vision": 8,      # 8% - vision analysis rarely
-            "image_gen": 3    # 3% - image generation very rarely
+            "web": 97,        # 15% - web search/scrape for documentation
+            "browser": 75,    # 10% - browser occasionally for web interaction
+            "vision": 50,      # 8% - vision analysis rarely
+            "image_gen": 10    # 3% - image generation very rarely
        }
    },
    
--- a/toolsets.py
+++ b/toolsets.py
@@ -69,7 +69,7 @@ TOOLSETS = {
    
    "skills": {
        "description": "Access skill documents with specialized instructions and knowledge",
-        "tools": ["skills_categories", "skills_list", "skill_view"],
+        "tools": ["skills_list", "skill_view"],
        "includes": []
    },
    
@@ -108,6 +108,12 @@ TOOLSETS = {
        "includes": []
    },
    
+    "tts": {
+        "description": "Text-to-speech: convert text to audio with Edge TTS (free), ElevenLabs, or OpenAI",
+        "tools": ["text_to_speech"],
+        "includes": []
+    },
+    
    # Scenario-specific toolsets
    
    "debugging": {
@@ -142,12 +148,14 @@ TOOLSETS = {
            # MoA
            "mixture_of_agents",
            # Skills
-            "skills_categories", "skills_list", "skill_view",
+            "skills_list", "skill_view",
            # Browser
            "browser_navigate", "browser_snapshot", "browser_click",
            "browser_type", "browser_scroll", "browser_back",
            "browser_press", "browser_close", "browser_get_images",
            "browser_vision",
+            # Text-to-speech
+            "text_to_speech",
            # Cronjob management (CLI-only)
            "schedule_cronjob", "list_cronjobs", "remove_cronjob"
        ],
@@ -169,8 +177,12 @@ TOOLSETS = {
            "web_search", "web_extract",
            # Vision - analyze images sent by users
            "vision_analyze",
+            # Image generation
+            "image_generate",
+            # Text-to-speech
+            "text_to_speech",
            # Skills - access knowledge base
-            "skills_categories", "skills_list", "skill_view",
+            "skills_list", "skill_view",
            # Cronjob management - let users schedule tasks
            "schedule_cronjob", "list_cronjobs", "remove_cronjob"
        ],
@@ -178,15 +190,23 @@ TOOLSETS = {
    },
    
    "hermes-discord": {
-        "description": "Discord bot toolset - limited for public server safety (no terminal, no file access)",
+        "description": "Discord bot toolset - full access (terminal has safety checks via dangerous command approval)",
        "tools": [
-            # Web tools - safe for messaging
-            "web_search",
-            # Vision - analyze images
+            # Terminal - enabled with dangerous command approval system
+            "terminal",
+            # File manipulation
+            "read_file", "write_file", "patch", "search",
+            # Web tools
+            "web_search", "web_extract",
+            # Vision - analyze images sent by users
            "vision_analyze",
+            # Image generation
+            "image_generate",
+            # Text-to-speech
+            "text_to_speech",
            # Skills - access knowledge base
-            "skills_categories", "skills_list", "skill_view",
-            # Cronjob - let users schedule reminders
+            "skills_list", "skill_view",
+            # Cronjob management - let users schedule tasks
            "schedule_cronjob", "list_cronjobs", "remove_cronjob"
        ],
        "includes": []
@@ -203,8 +223,12 @@ TOOLSETS = {
            "read_file", "write_file", "patch", "search",
            # Vision
            "vision_analyze",
+            # Image generation
+            "image_generate",
+            # Text-to-speech
+            "text_to_speech",
            # Skills
-            "skills_categories", "skills_list", "skill_view",
+            "skills_list", "skill_view",
            # Cronjob management
            "schedule_cronjob", "list_cronjobs", "remove_cronjob"
        ],