Eval splits for holdout sets

Changing return type to be ScoredDataGroup to account for multiple trajectories
Added task sppecific metris and evals
2026-03-03 14:42:45 -05:00 · 2026-03-02 11:35:06 -08:00 · 2026-02-27 11:20:18 -08:00 · 2026-02-26 10:41:24 -08:00 · 2026-02-24 19:23:05 -08:00 · 2026-02-24 19:19:39 -08:00
176 changed files with 31746 additions and 22206 deletions
@@ -1,115 +0,0 @@
-# Cline's Memory Bank
-
-I am Cline, an expert software engineer with a unique characteristic: my memory resets completely between sessions. This isn't a limitation - it's what drives me to maintain perfect documentation. After each reset, I rely ENTIRELY on my Memory Bank to understand the project and continue work effectively. I MUST read ALL memory bank files at the start of EVERY task - this is not optional.
-
-## Memory Bank Structure
-
-The Memory Bank consists of core files and optional context files, all in Markdown format. Files build upon each other in a clear hierarchy:
-
-flowchart TD
-    PB[projectbrief.md] --> PC[productContext.md]
-    PB --> SP[systemPatterns.md]
-    PB --> TC[techContext.md]
-
-    PC --> AC[activeContext.md]
-    SP --> AC
-    TC --> AC
-
-    AC --> P[progress.md]
-
-### Core Files (Required)
-1. `projectbrief.md`
-   - Foundation document that shapes all other files
-   - Created at project start if it doesn't exist
-   - Defines core requirements and goals
-   - Source of truth for project scope
-
-2. `productContext.md`
-   - Why this project exists
-   - Problems it solves
-   - How it should work
-   - User experience goals
-
-3. `activeContext.md`
-   - Current work focus
-   - Recent changes
-   - Next steps
-   - Active decisions and considerations
-   - Important patterns and preferences
-   - Learnings and project insights
-
-4. `systemPatterns.md`
-   - System architecture
-   - Key technical decisions
-   - Design patterns in use
-   - Component relationships
-   - Critical implementation paths
-
-5. `techContext.md`
-   - Technologies used
-   - Development setup
-   - Technical constraints
-   - Dependencies
-   - Tool usage patterns
-
-6. `progress.md`
-   - What works
-   - What's left to build
-   - Current status
-   - Known issues
-   - Evolution of project decisions
-
-### Additional Context
-Create additional files/folders within memory-bank/ when they help organize:
- Complex feature documentation
- Integration specifications
- API documentation
- Testing strategies
- Deployment procedures
-
-## Core Workflows
-
-### Plan Mode
-flowchart TD
-    Start[Start] --> ReadFiles[Read Memory Bank]
-    ReadFiles --> CheckFiles{Files Complete?}
-
-    CheckFiles -->|No| Plan[Create Plan]
-    Plan --> Document[Document in Chat]
-
-    CheckFiles -->|Yes| Verify[Verify Context]
-    Verify --> Strategy[Develop Strategy]
-    Strategy --> Present[Present Approach]
-
-### Act Mode
-flowchart TD
-    Start[Start] --> Context[Check Memory Bank]
-    Context --> Update[Update Documentation]
-    Update --> Execute[Execute Task]
-    Execute --> Document[Document Changes]
-
-## Documentation Updates
-
-Memory Bank updates occur when:
-1. Discovering new project patterns
-2. After implementing significant changes
-3. When user requests with **update memory bank** (MUST review ALL files)
-4. When context needs clarification
-
-flowchart TD
-    Start[Update Process]
-
-    subgraph Process
-        P1[Review ALL Files]
-        P2[Document Current State]
-        P3[Clarify Next Steps]
-        P4[Document Insights & Patterns]
-
-        P1 --> P2 --> P3 --> P4
-    end
-
-    Start --> Process
-
-Note: When triggered by **update memory bank**, I MUST review every memory bank file, even if some don't require updates. Focus particularly on activeContext.md and progress.md as they track current state.
-
-REMEMBER: After every memory reset, I begin completely fresh. The Memory Bank is my only link to previous work. It must be maintained with precision and clarity, as my effectiveness depends entirely on its accuracy.
@@ -1,201 +0,0 @@
-Hermes-Agent is an agent harness for LLMs with an interactive CLI.
-
-## Development Environment
-
-**IMPORTANT**: Always use the virtual environment if it exists:
-```bash
-source venv/bin/activate  # Before running any Python commands
-```
-
-## Project Structure
-
- `hermes` - CLI launcher script (run with `./hermes`)
- `cli.py` - Interactive CLI with Rich UI, prompt_toolkit, animated spinners
- `cli-config.yaml` - CLI configuration (model, terminal, toolsets, personalities)
- `tools/` - Individual tool implementations (web, terminal, browser, vision, etc.)
- `tools/__init__.py` - Exports all tools for importing
- `model_tools.py` - Consolidates tool schemas and handlers for the agent
- `toolsets.py` - Groups tools into logical toolsets (web, terminal, browser, etc.)
- `toolset_distributions.py` - Probability-based tool selection for data generation
- `run_agent.py` - Primary agent runner with AIAgent class and KawaiiSpinner
- `batch_runner.py` - Parallel batch processing with checkpointing
- `tests/` - Test scripts
-
-## File Dependency Chain
-
-```
-tools/*.py → tools/__init__.py → model_tools.py → toolsets.py → toolset_distributions.py
-                                       ↑
-run_agent.py ──────────────────────────┘
-cli.py → run_agent.py (uses AIAgent with quiet_mode=True)
-batch_runner.py → run_agent.py + toolset_distributions.py
-```
-
-Always ensure consistency between tools, model_tools.py, and toolsets.py when changing any of them.
-
-## CLI Architecture (cli.py)
-
-The interactive CLI uses:
- **Rich** - For the welcome banner and styled panels
- **prompt_toolkit** - For fixed input area with history and `patch_stdout`
- **KawaiiSpinner** (in run_agent.py) - Animated feedback during API calls and tool execution
-
-Key components:
- `HermesCLI` class - Main CLI controller with commands and conversation loop
- `load_cli_config()` - Loads `cli-config.yaml`, sets environment variables for terminal
- `build_welcome_banner()` - Displays ASCII art logo, tools, and skills summary
- `/commands` - Process user commands like `/help`, `/clear`, `/personality`, etc.
-
-CLI uses `quiet_mode=True` when creating AIAgent to suppress verbose logging and enable kawaii-style feedback instead.
-
-### Adding CLI Commands
-
-1. Add to `COMMANDS` dict with description
-2. Add handler in `process_command()` method
-3. For persistent settings, use `save_config_value()` to update `cli-config.yaml`
-
-## Adding a New Tool
-
-Follow this strict order to maintain consistency:
-
-1. Create `tools/your_tool.py` with:
-   - Handler function (sync or async) returning a JSON string via `json.dumps()`
-   - `check_*_requirements()` function to verify dependencies (e.g., API keys)
-   - Schema definition following OpenAI function-calling format
-
-2. Export in `tools/__init__.py`:
-   - Import the handler and check function
-   - Add to `__all__` list
-
-3. Register in `model_tools.py`:
-   - Create `get_*_tool_definitions()` function or add to existing
-   - Add routing in `handle_function_call()` dispatcher
-   - Update `get_all_tool_names()` with the tool name
-   - Update `get_toolset_for_tool()` mapping
-   - Update `get_available_toolsets()` and `check_toolset_requirements()`
-
-4. Add to toolset in `toolsets.py`:
-   - Add to existing toolset or create new one in TOOLSETS dict
-
-5. Optionally add to `toolset_distributions.py` for batch processing
-
-## Tool Implementation Pattern
-
-```python
-# tools/example_tool.py
-import json
-import os
-
-def check_example_requirements() -> bool:
-    """Check if required API keys/dependencies are available."""
-    return bool(os.getenv("EXAMPLE_API_KEY"))
-
-def example_tool(param: str, task_id: str = None) -> str:
-    """Execute the tool and return JSON string result."""
-    try:
-        result = {"success": True, "data": "..."}
-        return json.dumps(result, ensure_ascii=False)
-    except Exception as e:
-        return json.dumps({"error": str(e)}, ensure_ascii=False)
-```
-
-All tool handlers MUST return a JSON string. Never return raw dicts.
-
-## Stateful Tools
-
-Tools that maintain state (terminal, browser) require:
- `task_id` parameter for session isolation between concurrent tasks
- `cleanup_*()` function to release resources
- Cleanup is called automatically in run_agent.py after conversation completes
-
-## Environment Variables
-
-API keys are loaded from `.env` file in repo root:
- `OPENROUTER_API_KEY` - Main LLM API access (primary provider)
- `FIRECRAWL_API_KEY` - Web search/extract tools
- `BROWSERBASE_API_KEY` / `BROWSERBASE_PROJECT_ID` - Browser automation
- `FAL_KEY` - Image generation (FLUX model)
- `NOUS_API_KEY` - Vision and Mixture-of-Agents tools
-
-Terminal tool configuration (can also be set in `cli-config.yaml`):
- `TERMINAL_ENV` - Backend: local, docker, singularity, modal, or ssh
- `TERMINAL_CWD` - Working directory
- `TERMINAL_SSH_HOST`, `TERMINAL_SSH_USER`, `TERMINAL_SSH_KEY` - For SSH backend
-
-## Agent Loop (run_agent.py)
-
-The AIAgent class handles:
- Processing enabled toolsets to provide to the model
- Piping prompts to the agent
- Looping LLM calls when tools are invoked, until natural language response
- Returning the final response
-
-Uses OpenAI-compatible API (primarily OpenRouter) with the OpenAI Python SDK.
-
-## Reasoning Model Support
-
-For models that support chain-of-thought reasoning:
- Extract `reasoning_content` from API responses
- Store in `assistant_msg["reasoning"]` for trajectory export
- Pass back via `reasoning_content` field on subsequent turns
-
-## Trajectory Format
-
-Conversations are saved in ShareGPT format for training:
-```json
-{"from": "system", "value": "System prompt with <tools>...</tools>"}
-{"from": "human", "value": "User message"}
-{"from": "gpt", "value": "<think>reasoning</think>\n<tool_call>{...}</tool_call>"}
-{"from": "tool", "value": "<tool_response>{...}</tool_response>"}
-{"from": "gpt", "value": "Final response"}
-```
-
-Tool calls use `<tool_call>` XML tags, responses use `<tool_response>` tags, reasoning uses `<think>` tags.
-
-## Batch Processing (batch_runner.py)
-
-For processing multiple prompts:
- Parallel execution with multiprocessing
- Content-based resume for fault tolerance (matches on prompt text, not indices)
- Toolset distributions control probabilistic tool availability per prompt
- Output: `data/<run_name>/trajectories.jsonl` (combined) + individual batch files
-
-## Logging
-
-Trajectories restructure tools as a system prompt for storage in a format suitable for later training use.
-
-## Skills System
-
-Skills are on-demand knowledge documents the agent can load. Located in `skills/` directory:
-
-```
-skills/
-├── mlops/                    # Category folder
-│   ├── axolotl/             # Skill folder
-│   │   ├── SKILL.md         # Main instructions (required)
-│   │   ├── references/      # Additional docs, API specs
-│   │   └── templates/       # Output formats, configs
-│   └── vllm/
-│       └── SKILL.md
-└── example-skill/
-    └── SKILL.md
-```
-
-**Progressive disclosure** (token-efficient):
-1. `skills_categories()` - List category names (~50 tokens)
-2. `skills_list(category)` - Name + description per skill (~3k tokens)
-3. `skill_view(name)` - Full content + tags + linked files
-
-SKILL.md files use YAML frontmatter:
-```yaml
---
-name: skill-name
-description: Brief description for listing
-tags: [tag1, tag2]
-related_skills: [other-skill]
-version: 1.0.0
---
-# Skill Content...
-```
-
-Tool files: `tools/skills_tool.py` → `model_tools.py` → `toolsets.py`
@@ -1,73 +1,17 @@
 # Hermes Agent Environment Configuration
 # Copy this file to .env and fill in your API keys

-# =============================================================================
-# CORE SETTINGS
-# =============================================================================
-# Agent backend:
-# - openai  : default Hermes-Agent loop (OpenAI function-calling via OpenAI SDK)
-# - atropos : Atroposlib ServerManager/ManagedServer-backed loop (training/env integration)
-HERMES_BACKEND=openai
-
-
-# =============================================================================
-# LOCAL / SELF-HOSTED OPENAI-COMPATIBLE ENDPOINTS (vLLM, SGLang, llama.cpp, etc.)
-# =============================================================================
-# For local development (matches the Atropos test env defaults):
-# ATROPOS_SERVER_BASE_URL=http://127.0.0.1:8080
-# ATROPOS_SERVER_MODEL=hermes-4-36b
-# For hosted inference (Nous Research inference API):
-ATROPOS_SERVER_BASE_URL=
-ATROPOS_SERVER_MODEL=
-ATROPOS_TOKENIZER_NAME=
-# Set this to your Nous API key (Bearer token).
-ATROPOS_SERVER_API_KEY=
-
-# Debugging (prints to stdout; use with care)
-# HERMES_DEBUG_ATROPOS_REQUEST=1
-# HERMES_DEBUG_ATROPOS_RESPONSE=1
-# HERMES_DEBUG_OPENAI_REQUEST=1
-# HERMES_DEBUG_OPENAI_RESPONSE=1
-
-# =============================================================================
-# LOCAL / SELF-HOSTED OPENAI-COMPATIBLE ENDPOINTS (vLLM, SGLang, llama.cpp, etc.)
-# =============================================================================
-# If you set ATROPOS_SERVER_BASE_URL or OPENAI_BASE_URL, Hermes will use it instead
-# of OpenRouter.
-#
-# Local server convenience (base URL without /v1):
-# llama.cpp example (see `Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh`):
-# ATROPOS_SERVER_BASE_URL=http://127.0.0.1:8080
-# ATROPOS_SERVER_MODEL=hermes-4-36b
-# ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B
-# ATROPOS_SERVER_API_KEY=local
-#
-# Hosted Nous inference API:
-# ATROPOS_SERVER_BASE_URL=https://inference-api.nousresearch.com
-# ATROPOS_SERVER_MODEL=Hermes-4.3-36B
-# ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B
-# ATROPOS_SERVER_API_KEY=sk-... (Bearer token)
-#
-# If you plan to run GRPO-style group sampling (e.g. `--env.group_size 4`) against
-# llama.cpp, start the server with at least that many slots, e.g.:
-#   LLAMA_CPP_PARALLEL=4 Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh
-#
-# Generic OpenAI-compatible (base URL should include /v1):
-# OPENAI_BASE_URL=http://127.0.0.1:8080/v1
-# OPENAI_API_KEY=local
-
 # =============================================================================
 # LLM PROVIDER (OpenRouter)
 # =============================================================================
 # OpenRouter provides access to many models through one API
 # All LLM calls go through OpenRouter - no direct provider keys needed
 # Get your key at: https://openrouter.ai/keys
-OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
 OPENROUTER_API_KEY=

 # Default model to use (OpenRouter format: provider/model)
-# Examples: anthropic/claude-sonnet-4, openai/gpt-4o, google/gemini-2.0-flash, zhipuai/glm-4-plus
-LLM_MODEL=anthropic/claude-sonnet-4
+# Examples: anthropic/claude-opus-4.6, openai/gpt-4o, google/gemini-2.0-flash, zhipuai/glm-4-plus
+LLM_MODEL=anthropic/claude-opus-4.6

 # =============================================================================
 # TOOL API KEYS
@@ -96,13 +40,20 @@ FAL_KEY=
 # - modal: Runs in Modal cloud sandboxes (scalable, requires Modal account)
 TERMINAL_ENV=local

-# Container images (for singularity/docker/modal backends)
-TERMINAL_DOCKER_IMAGE=python:3.11
-TERMINAL_SINGULARITY_IMAGE=docker://python:3.11
-TERMINAL_MODAL_IMAGE=python:3.11

-# Working directory inside the container
-TERMINAL_CWD=/tmp
+# Container images (for singularity/docker/modal backends)
+TERMINAL_DOCKER_IMAGE=nikolaik/python-nodejs:python3.11-nodejs20
+TERMINAL_SINGULARITY_IMAGE=docker://nikolaik/python-nodejs:python3.11-nodejs20
+TERMINAL_MODAL_IMAGE=nikolaik/python-nodejs:python3.11-nodejs20
+
+
+# Working directory for terminal commands
+# For local backend: "." means current directory (resolved automatically)
+# For remote backends (ssh/docker/modal/singularity): use an absolute path
+#   INSIDE the target environment, or leave unset for the backend's default
+#   (/root for modal, / for docker, ~ for ssh). Do NOT use a host-local path.
+# Usually managed by config.yaml (terminal.cwd) — uncomment to override
+# TERMINAL_CWD=.

 # Default command timeout in seconds
 TERMINAL_TIMEOUT=60
@@ -144,87 +95,12 @@ TERMINAL_LIFETIME_SECONDS=300
 # SUDO_PASSWORD=your_password_here

 # =============================================================================
-# MODAL CLOUD BACKEND (for TERMINAL_ENV=modal)
+# MODAL CLOUD BACKEND (Optional - for TERMINAL_ENV=modal)
 # =============================================================================
-# Modal provides cloud sandboxes with per-second billing and auto-scaling.
-# This implementation uses a warm pool of sandboxes for cost efficiency.
-#
-# SETUP:
-#   pip install modal && modal setup
-#   (Authenticates via browser, stores credentials locally)
-#
-# FEATURES:
-# - Auto-scaling warm sandbox pool (no cold start after first use)
-# - Named sandbox recovery (reconnects after restart)
-# - Profile-based heterogeneous environments (CPU, GPU, different images)
-# - Server-side idle_timeout protection against orphaned sandboxes
-
-# Modal app name (groups all sandboxes, used for recovery)
-TERMINAL_MODAL_APP_NAME=hermes-sandbox
-
-# Default profile when none specified
-TERMINAL_MODAL_DEFAULT_PROFILE=default
-
-# Profile config file (optional - YAML format, see modal_profiles.yaml)
-# TERMINAL_MODAL_PROFILES_FILE=modal_profiles.yaml
-
-# --- Default Profile Settings (used if no YAML file) ---
-# These apply when no profile is specified or for the "default" profile
-TERMINAL_MODAL_IMAGE=python:3.11
-TERMINAL_MODAL_MIN_POOL=1
-TERMINAL_MODAL_MAX_POOL=5
-TERMINAL_MODAL_IDLE_TIMEOUT=120
-TERMINAL_MODAL_MAX_LIFETIME=3600
-TERMINAL_MODAL_SCALE_DOWN_IDLE=180
-
-# --- Custom Profile Example: pytorch-gpu ---
-# Uncomment to enable a GPU profile for ML tasks
-# Usage: terminal_tool("python train.py", profile="pytorch-gpu")
-#
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_IMAGE=pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_GPU=T4
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_MEMORY=16384
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_MIN_POOL=0
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_MAX_POOL=2
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_IDLE_TIMEOUT=60
-
-# --- Custom Profile Example: node ---
-# Uncomment to enable a Node.js profile
-# Usage: terminal_tool("npm test", profile="node")
-#
-# TERMINAL_MODAL_PROFILE_node_IMAGE=node:18
-# TERMINAL_MODAL_PROFILE_node_MIN_POOL=0
-# TERMINAL_MODAL_PROFILE_node_MAX_POOL=3
-
-# =============================================================================
-# MODAL SECRETS (Secure credential injection)
-# =============================================================================
-# Modal Secrets allow you to securely pass API keys, passwords, and other
-# sensitive data to your sandboxes without exposing them in code or logs.
-#
-# SETUP SECRETS:
-#   1. Via Dashboard: https://modal.com/secrets
-#   2. Via CLI: modal secret create my-secret KEY1=value1 KEY2=value2
-#   3. Via CLI with env: modal secret create my-secret API_KEY="$API_KEY"
-#
-# LIST SECRETS:
-#   modal secret list
-#
-# DELETE SECRETS:
-#   modal secret delete my-secret
-
-# Global secrets applied to ALL profiles (comma-separated secret names)
-# These secrets must be created on Modal dashboard or via CLI first
-# TERMINAL_MODAL_SECRETS=my-api-keys,database-creds
-
-# Per-profile secrets (comma-separated secret names)
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_SECRETS=huggingface-token,wandb-key
-
-# Per-profile environment variables (semicolon-separated KEY=VALUE pairs)
-# TERMINAL_MODAL_PROFILE_default_ENV_VARS=DEBUG=1;LOG_LEVEL=info
-
-# Load local .env file into sandbox (useful for development)
-# TERMINAL_MODAL_PROFILE_default_USE_DOTENV=true
+# Modal uses CLI authentication, not environment variables.
+# Run: pip install modal && modal setup
+# This will authenticate via browser and store credentials locally.
+# No API key needed in .env - Modal handles auth automatically.

 # =============================================================================
 # BROWSER TOOL CONFIGURATION (agent-browser + Browserbase)
@@ -266,6 +142,37 @@ BROWSER_INACTIVITY_TIMEOUT=120
 # Format: logs/session_YYYYMMDD_HHMMSS_UUID.json
 # Contains full conversation history in trajectory format for debugging/replay

+# =============================================================================
+# VOICE TRANSCRIPTION & OPENAI TTS
+# =============================================================================
+# Required for voice message transcription (Whisper) and OpenAI TTS voices.
+# Uses OpenAI's API directly (not via OpenRouter).
+# Named HERMES_OPENAI_API_KEY to avoid interference with OpenRouter.
+# Get at: https://platform.openai.com/api-keys
+HERMES_OPENAI_API_KEY=
+
+# =============================================================================
+# SLACK INTEGRATION
+# =============================================================================
+# Slack Bot Token - From Slack App settings (OAuth & Permissions)
+# Get at: https://api.slack.com/apps
+# SLACK_BOT_TOKEN=xoxb-...
+
+# Slack App Token - For Socket Mode (App-Level Tokens in Slack App settings)
+# SLACK_APP_TOKEN=xapp-...
+
+# Slack allowed users (comma-separated Slack user IDs)
+# SLACK_ALLOWED_USERS=
+
+# =============================================================================
+# RESPONSE PACING
+# =============================================================================
+# Human-like delays between message chunks on messaging platforms.
+# Makes the bot feel less robotic.
+# HERMES_HUMAN_DELAY_MODE=off     # off | natural | custom
+# HERMES_HUMAN_DELAY_MIN_MS=800   # Min delay in ms (custom mode)
+# HERMES_HUMAN_DELAY_MAX_MS=2500  # Max delay in ms (custom mode)
+
 # =============================================================================
 # LEGACY/OPTIONAL API KEYS
 # =============================================================================
@@ -285,3 +192,31 @@ WEB_TOOLS_DEBUG=false
 VISION_TOOLS_DEBUG=false
 MOA_TOOLS_DEBUG=false
 IMAGE_TOOLS_DEBUG=false
+
+# =============================================================================
+# CONTEXT COMPRESSION (Auto-shrinks long conversations)
+# =============================================================================
+# When conversation approaches model's context limit, middle turns are
+# automatically summarized to free up space.
+#
+# CONTEXT_COMPRESSION_ENABLED=true        # Enable auto-compression (default: true)
+# CONTEXT_COMPRESSION_THRESHOLD=0.85      # Compress at 85% of context limit
+# CONTEXT_COMPRESSION_MODEL=google/gemini-2.0-flash-001  # Fast model for summaries
+
+# =============================================================================
+# RL TRAINING (Tinker + Atropos)
+# =============================================================================
+# Run reinforcement learning training on language models using the Tinker API.
+# Requires the rl-server to be running (from tinker-atropos package).
+
+# Tinker API Key - RL training service
+# Get at: https://tinker-console.thinkingmachines.ai/keys
+TINKER_API_KEY=
+
+# Weights & Biases API Key - Experiment tracking and metrics
+# Get at: https://wandb.ai/authorize
+WANDB_API_KEY=
+
+# RL API Server URL (default: http://localhost:8080)
+# Change if running the rl-server on a different host/port
+# RL_API_URL=http://localhost:8080
@@ -39,26 +39,10 @@ agent-browser/
 *.pem
 privvy*
 images/
+__pycache__/
+hermes_agent.egg-info/
+wandb/
+testlogs

 # CLI config (may contain sensitive SSH paths)
 cli-config.yaml
-
-.DS_Store
-
-# artifacts
-*.jsonl
-*.html
-*.json
-*.log
-*.csv
-
-# Singularity/Apptainer images (large binary files)
-*.sif
-
-# Test files
-test_singularity_*.py
-test_*.py
-!tests/test_*.py
-
-# Nomad data
-/tmp/NomadClient*/
@@ -1,3 +1,6 @@
 [submodule "mini-swe-agent"]
 	path = mini-swe-agent
 	url = https://github.com/SWE-agent/mini-swe-agent
+[submodule "tinker-atropos"]
+	path = tinker-atropos
+	url = https://github.com/nousresearch/tinker-atropos
@@ -0,0 +1,609 @@
+# Hermes Agent - Development Guide
+
+Instructions for AI coding assistants (GitHub Copilot, Cursor, etc.) and human developers.
+
+Hermes-Agent is an AI agent harness with tool-calling capabilities, interactive CLI, messaging integrations, and scheduled tasks.
+
+## Development Environment
+
+**IMPORTANT**: Always use the virtual environment if it exists:
+```bash
+source venv/bin/activate  # Before running any Python commands
+```
+
+## Project Structure
+
+```
+hermes-agent/
+├── hermes_cli/           # Unified CLI commands
+│   ├── main.py           # Entry point, command dispatcher
+│   ├── setup.py          # Interactive setup wizard
+│   ├── config.py         # Config management & migration
+│   ├── status.py         # Status display
+│   ├── doctor.py         # Diagnostics
+│   ├── gateway.py        # Gateway management
+│   ├── uninstall.py      # Uninstaller
+│   └── cron.py           # Cron job management
+├── tools/                # Tool implementations
+│   ├── process_registry.py     # Background process management (spawn, poll, wait, kill)
+│   ├── transcription_tools.py  # Speech-to-text (Whisper API)
+├── gateway/              # Messaging platform adapters
+│   ├── pairing.py        # DM pairing code system
+│   ├── hooks.py          # Event hook system
+│   ├── sticker_cache.py  # Telegram sticker vision cache
+│   ├── platforms/
+│   │   └── slack.py          # Slack adapter (slack-bolt)
+├── cron/                 # Scheduler implementation
+├── skills/               # Knowledge documents
+├── cli.py                # Interactive CLI (Rich UI)
+├── run_agent.py          # Agent runner with AIAgent class
+├── model_tools.py        # Tool schemas and handlers
+├── toolsets.py           # Tool groupings
+├── toolset_distributions.py  # Probability-based tool selection
+└── batch_runner.py       # Parallel batch processing
+```
+
+**User Configuration** (stored in `~/.hermes/`):
+- `~/.hermes/config.yaml` - Settings (model, terminal, toolsets, etc.)
+- `~/.hermes/.env` - API keys and secrets
+- `~/.hermes/pairing/` - DM pairing data
+- `~/.hermes/hooks/` - Custom event hooks
+- `~/.hermes/image_cache/` - Cached user images
+- `~/.hermes/audio_cache/` - Cached user voice messages
+- `~/.hermes/sticker_cache.json` - Telegram sticker descriptions
+
+## File Dependency Chain
+
+```
+tools/*.py → tools/__init__.py → model_tools.py → toolsets.py → toolset_distributions.py
+                                       ↑
+run_agent.py ──────────────────────────┘
+cli.py → run_agent.py (uses AIAgent with quiet_mode=True)
+batch_runner.py → run_agent.py + toolset_distributions.py
+```
+
+Always ensure consistency between tools, model_tools.py, and toolsets.py when changing any of them.
+
+---
+
+## AIAgent Class
+
+The main agent is implemented in `run_agent.py`:
+
+```python
+class AIAgent:
+    def __init__(
+        self,
+        model: str = "anthropic/claude-sonnet-4",
+        api_key: str = None,
+        base_url: str = "https://openrouter.ai/api/v1",
+        max_iterations: int = 60,        # Max tool-calling loops
+        enabled_toolsets: list = None,
+        disabled_toolsets: list = None,
+        verbose_logging: bool = False,
+        quiet_mode: bool = False,         # Suppress progress output
+        tool_progress_callback: callable = None,  # Called on each tool use
+    ):
+        # Initialize OpenAI client, load tools based on toolsets
+        ...
+    
+    def chat(self, user_message: str, task_id: str = None) -> str:
+        # Main entry point - runs the agent loop
+        ...
+```
+
+### Agent Loop
+
+The core loop in `_run_agent_loop()`:
+
+```
+1. Add user message to conversation
+2. Call LLM with tools
+3. If LLM returns tool calls:
+   - Execute each tool
+   - Add tool results to conversation
+   - Go to step 2
+4. If LLM returns text response:
+   - Return response to user
+```
+
+```python
+while turns < max_turns:
+    response = client.chat.completions.create(
+        model=model,
+        messages=messages,
+        tools=tool_schemas,
+    )
+    
+    if response.tool_calls:
+        for tool_call in response.tool_calls:
+            result = await execute_tool(tool_call)
+            messages.append(tool_result_message(result))
+        turns += 1
+    else:
+        return response.content
+```
+
+### Conversation Management
+
+Messages are stored as a list of dicts following OpenAI format:
+
+```python
+messages = [
+    {"role": "system", "content": "You are a helpful assistant..."},
+    {"role": "user", "content": "Search for Python tutorials"},
+    {"role": "assistant", "content": None, "tool_calls": [...]},
+    {"role": "tool", "tool_call_id": "...", "content": "..."},
+    {"role": "assistant", "content": "Here's what I found..."},
+]
+```
+
+### Reasoning Model Support
+
+For models that support chain-of-thought reasoning:
+- Extract `reasoning_content` from API responses
+- Store in `assistant_msg["reasoning"]` for trajectory export
+- Pass back via `reasoning_content` field on subsequent turns
+
+---
+
+## CLI Architecture (cli.py)
+
+The interactive CLI uses:
+- **Rich** - For the welcome banner and styled panels
+- **prompt_toolkit** - For fixed input area with history and `patch_stdout`
+- **KawaiiSpinner** (in run_agent.py) - Animated feedback during API calls and tool execution
+
+Key components:
+- `HermesCLI` class - Main CLI controller with commands and conversation loop
+- `load_cli_config()` - Loads config, sets environment variables for terminal
+- `build_welcome_banner()` - Displays ASCII art logo, tools, and skills summary
+- `/commands` - Process user commands like `/help`, `/clear`, `/personality`, etc.
+
+CLI uses `quiet_mode=True` when creating AIAgent to suppress verbose logging.
+
+### Adding CLI Commands
+
+1. Add to `COMMANDS` dict with description
+2. Add handler in `process_command()` method
+3. For persistent settings, use `save_config_value()` to update config
+
+---
+
+## Hermes CLI Commands
+
+The unified `hermes` command provides all functionality:
+
+| Command | Description |
+|---------|-------------|
+| `hermes` | Interactive chat (default) |
+| `hermes chat -q "..."` | Single query mode |
+| `hermes setup` | Configure API keys and settings |
+| `hermes config` | View current configuration |
+| `hermes config edit` | Open config in editor |
+| `hermes config set KEY VAL` | Set a specific value |
+| `hermes config check` | Check for missing config |
+| `hermes config migrate` | Prompt for missing config interactively |
+| `hermes status` | Show configuration status |
+| `hermes doctor` | Diagnose issues |
+| `hermes update` | Update to latest (checks for new config) |
+| `hermes uninstall` | Uninstall (can keep configs for reinstall) |
+| `hermes gateway` | Start messaging gateway |
+| `hermes cron list` | View scheduled jobs |
+| `hermes version` | Show version info |
+| `hermes pairing list/approve/revoke` | Manage DM pairing codes |
+
+---
+
+## Messaging Gateway
+
+The gateway connects Hermes to Telegram, Discord, and WhatsApp.
+
+### Configuration (in `~/.hermes/.env`):
+
+```bash
+# Telegram
+TELEGRAM_BOT_TOKEN=123456:ABC-DEF...      # From @BotFather
+TELEGRAM_ALLOWED_USERS=123456789,987654   # Comma-separated user IDs (from @userinfobot)
+
+# Discord  
+DISCORD_BOT_TOKEN=MTIz...                 # From Developer Portal
+DISCORD_ALLOWED_USERS=123456789012345678  # Comma-separated user IDs
+
+# Agent Behavior
+HERMES_MAX_ITERATIONS=60                  # Max tool-calling iterations
+MESSAGING_CWD=/home/myuser                # Terminal working directory for messaging
+
+# Tool Progress (optional)
+HERMES_TOOL_PROGRESS=true                 # Send progress messages
+HERMES_TOOL_PROGRESS_MODE=new             # "new" or "all"
+```
+
+### Working Directory Behavior
+
+- **CLI (`hermes` command)**: Uses current directory (`.` → `os.getcwd()`)
+- **Messaging (Telegram/Discord)**: Uses `MESSAGING_CWD` (default: home directory)
+
+This is intentional: CLI users are in a terminal and expect the agent to work in their current directory, while messaging users need a consistent starting location.
+
+### Security (User Allowlists):
+
+**IMPORTANT**: Without an allowlist, anyone who finds your bot can use it!
+
+The gateway checks `{PLATFORM}_ALLOWED_USERS` environment variables:
+- If set: Only listed user IDs can interact with the bot
+- If unset: All users are allowed (dangerous with terminal access!)
+
+Users can find their IDs:
+- **Telegram**: Message [@userinfobot](https://t.me/userinfobot)
+- **Discord**: Enable Developer Mode, right-click name → Copy ID
+
+### DM Pairing System
+
+Instead of static allowlists, users can pair via one-time codes:
+1. Unknown user DMs the bot → receives pairing code
+2. Owner runs `hermes pairing approve <platform> <code>`
+3. User is permanently authorized
+
+Security: 8-char codes, 1-hour expiry, rate-limited (1/10min/user), max 3 pending per platform, lockout after 5 failed attempts, `chmod 0600` on data files.
+
+Files: `gateway/pairing.py`, `hermes_cli/pairing.py`
+
+### Event Hooks
+
+Hooks fire at lifecycle points. Place hook directories in `~/.hermes/hooks/`:
+
+```
+~/.hermes/hooks/my-hook/
+├── HOOK.yaml    # name, description, events list
+└── handler.py   # async def handle(event_type, context): ...
+```
+
+Events: `gateway:startup`, `session:start`, `session:reset`, `agent:start`, `agent:step`, `agent:end`, `command:*`
+
+The `agent:step` event fires each iteration of the tool-calling loop with tool names and results.
+
+Files: `gateway/hooks.py`
+
+### Tool Progress Notifications
+
+When `HERMES_TOOL_PROGRESS=true`, the bot sends status messages as it works:
+- `💻 \`ls -la\`...` (terminal commands show the actual command)
+- `🔍 web_search...`
+- `📄 web_extract...`
+
+Modes:
+- `new`: Only when switching to a different tool (less spam)
+- `all`: Every single tool call
+
+### Typing Indicator
+
+The gateway keeps the "typing..." indicator active throughout processing, refreshing every 4 seconds. This lets users know the bot is working even during long tool-calling sequences.
+
+### Platform Toolsets:
+
+Each platform has a dedicated toolset in `toolsets.py`:
+- `hermes-telegram`: Full tools including terminal (with safety checks)
+- `hermes-discord`: Full tools including terminal
+- `hermes-whatsapp`: Full tools including terminal
+
+---
+
+## Configuration System
+
+Configuration files are stored in `~/.hermes/` for easy user access:
+- `~/.hermes/config.yaml` - All settings (model, terminal, compression, etc.)
+- `~/.hermes/.env` - API keys and secrets
+
+### Adding New Configuration Options
+
+When adding new configuration variables, you MUST follow this process:
+
+#### For config.yaml options:
+
+1. Add to `DEFAULT_CONFIG` in `hermes_cli/config.py`
+2. **CRITICAL**: Bump `_config_version` in `DEFAULT_CONFIG` when adding required fields
+3. This triggers migration prompts for existing users on next `hermes update` or `hermes setup`
+
+Example:
+```python
+DEFAULT_CONFIG = {
+    # ... existing config ...
+    
+    "new_feature": {
+        "enabled": True,
+        "option": "default_value",
+    },
+    
+    # BUMP THIS when adding required fields
+    "_config_version": 2,  # Was 1, now 2
+}
+```
+
+#### For .env variables (API keys/secrets):
+
+1. Add to `REQUIRED_ENV_VARS` or `OPTIONAL_ENV_VARS` in `hermes_cli/config.py`
+2. Include metadata for the migration system:
+
+```python
+OPTIONAL_ENV_VARS = {
+    # ... existing vars ...
+    "NEW_API_KEY": {
+        "description": "What this key is for",
+        "prompt": "Display name in prompts",
+        "url": "https://where-to-get-it.com/",
+        "tools": ["tools_it_enables"],  # What tools need this
+        "password": True,  # Mask input
+    },
+}
+```
+
+#### Update related files:
+
+- `hermes_cli/setup.py` - Add prompts in the setup wizard
+- `cli-config.yaml.example` - Add example with comments
+- Update README.md if user-facing
+
+### Config Version Migration
+
+The system uses `_config_version` to detect outdated configs:
+
+1. `check_for_missing_config()` compares user config to `DEFAULT_CONFIG`
+2. `migrate_config()` interactively prompts for missing values
+3. Called automatically by `hermes update` and optionally by `hermes setup`
+
+---
+
+## Environment Variables
+
+API keys are loaded from `~/.hermes/.env`:
+- `OPENROUTER_API_KEY` - Main LLM API access (primary provider)
+- `FIRECRAWL_API_KEY` - Web search/extract tools
+- `BROWSERBASE_API_KEY` / `BROWSERBASE_PROJECT_ID` - Browser automation
+- `FAL_KEY` - Image generation (FLUX model)
+- `NOUS_API_KEY` - Vision and Mixture-of-Agents tools
+
+Terminal tool configuration (in `~/.hermes/config.yaml`):
+- `terminal.backend` - Backend: local, docker, singularity, modal, or ssh
+- `terminal.cwd` - Working directory ("." = host CWD for local only; for remote backends set an absolute path inside the target, or omit to use the backend's default)
+- `terminal.docker_image` - Image for Docker backend
+- `terminal.singularity_image` - Image for Singularity backend
+- `terminal.modal_image` - Image for Modal backend
+- SSH: `TERMINAL_SSH_HOST`, `TERMINAL_SSH_USER`, `TERMINAL_SSH_KEY` in .env
+
+Agent behavior (in `~/.hermes/.env`):
+- `HERMES_MAX_ITERATIONS` - Max tool-calling iterations (default: 60)
+- `MESSAGING_CWD` - Working directory for messaging platforms (default: ~)
+- `HERMES_TOOL_PROGRESS` - Enable tool progress messages (`true`/`false`)
+- `HERMES_TOOL_PROGRESS_MODE` - Progress mode: `new` (tool changes) or `all`
+- `OPENAI_API_KEY` - Voice transcription (Whisper STT)
+- `SLACK_BOT_TOKEN` / `SLACK_APP_TOKEN` - Slack integration (Socket Mode)
+- `SLACK_ALLOWED_USERS` - Comma-separated Slack user IDs
+- `HERMES_HUMAN_DELAY_MODE` - Response pacing: off/natural/custom
+- `HERMES_HUMAN_DELAY_MIN_MS` / `HERMES_HUMAN_DELAY_MAX_MS` - Custom delay range
+
+### Dangerous Command Approval
+
+The terminal tool includes safety checks for potentially destructive commands (e.g., `rm -rf`, `DROP TABLE`, `chmod 777`, etc.):
+
+**Behavior by Backend:**
+- **Docker/Singularity/Modal**: Commands run unrestricted (isolated containers)
+- **Local/SSH**: Dangerous commands trigger approval flow
+
+**Approval Flow (CLI):**
+```
+⚠️  Potentially dangerous command detected: recursive delete
+    rm -rf /tmp/test
+
+    [o]nce  |  [s]ession  |  [a]lways  |  [d]eny
+    Choice [o/s/a/D]: 
+```
+
+**Approval Flow (Messaging):**
+- Command is blocked with explanation
+- Agent explains the command was blocked for safety
+- User must add the pattern to their allowlist via `hermes config edit` or run the command directly on their machine
+
+**Configuration:**
+- `command_allowlist` in `~/.hermes/config.yaml` stores permanently allowed patterns
+- Add patterns via "always" approval or edit directly
+
+**Sudo Handling (Messaging):**
+- If sudo fails over messaging, output includes tip to add `SUDO_PASSWORD` to `~/.hermes/.env`
+
+---
+
+## Background Process Management
+
+The `process` tool works alongside `terminal` for managing long-running background processes:
+
+**Starting a background process:**
+```python
+terminal(command="pytest -v tests/", background=true)
+# Returns: {"session_id": "proc_abc123", "pid": 12345, ...}
+```
+
+**Managing it with the process tool:**
+- `process(action="list")` -- show all running/recent processes
+- `process(action="poll", session_id="proc_abc123")` -- check status + new output
+- `process(action="log", session_id="proc_abc123")` -- full output with pagination
+- `process(action="wait", session_id="proc_abc123", timeout=600)` -- block until done
+- `process(action="kill", session_id="proc_abc123")` -- terminate
+- `process(action="write", session_id="proc_abc123", data="y")` -- send stdin
+- `process(action="submit", session_id="proc_abc123", data="yes")` -- send + Enter
+
+**Key behaviors:**
+- Background processes execute through the configured terminal backend (local/Docker/Modal/SSH/Singularity) -- never directly on the host unless `TERMINAL_ENV=local`
+- The `wait` action blocks the tool call until the process finishes, times out, or is interrupted by a new user message
+- PTY mode (`pty=true` on terminal) enables interactive CLI tools (Codex, Claude Code)
+- In RL training, background processes are auto-killed when the episode ends (`tool_context.cleanup()`)
+- In the gateway, sessions with active background processes are exempt from idle reset
+- The process registry checkpoints to `~/.hermes/processes.json` for crash recovery
+
+Files: `tools/process_registry.py` (registry), `model_tools.py` (tool definition + handler), `tools/terminal_tool.py` (spawn integration)
+
+---
+
+## Adding New Tools
+
+Follow this strict order to maintain consistency:
+
+1. Create `tools/your_tool.py` with:
+   - Handler function (sync or async) returning a JSON string via `json.dumps()`
+   - `check_*_requirements()` function to verify dependencies (e.g., API keys)
+   - Schema definition following OpenAI function-calling format
+
+2. Export in `tools/__init__.py`:
+   - Import the handler and check function
+   - Add to `__all__` list
+
+3. Register in `model_tools.py`:
+   - Add to `TOOLSET_REQUIREMENTS` if it needs API keys
+   - Create `get_*_tool_definitions()` function or add to existing
+   - Add routing in `handle_function_call()` dispatcher
+   - Update `get_all_tool_names()` with the tool name
+   - Update `get_toolset_for_tool()` mapping
+   - Update `get_available_toolsets()` and `check_toolset_requirements()`
+
+4. Add to toolset in `toolsets.py`:
+   - Add to existing toolset or create new one in TOOLSETS dict
+
+5. If the tool requires an API key:
+   - Add to `OPTIONAL_ENV_VARS` in `hermes_cli/config.py`
+   - The tool will be auto-disabled if the key is missing
+
+6. Optionally add to `toolset_distributions.py` for batch processing
+
+### Tool Implementation Pattern
+
+```python
+# tools/example_tool.py
+import json
+import os
+
+def check_example_requirements() -> bool:
+    """Check if required API keys/dependencies are available."""
+    return bool(os.getenv("EXAMPLE_API_KEY"))
+
+def example_tool(param: str, task_id: str = None) -> str:
+    """Execute the tool and return JSON string result."""
+    try:
+        result = {"success": True, "data": "..."}
+        return json.dumps(result, ensure_ascii=False)
+    except Exception as e:
+        return json.dumps({"error": str(e)}, ensure_ascii=False)
+```
+
+All tool handlers MUST return a JSON string. Never return raw dicts.
+
+### Dynamic Tool Availability
+
+Tools are automatically disabled when their API keys are missing:
+
+```python
+# In model_tools.py
+TOOLSET_REQUIREMENTS = {
+    "web": {"env_vars": ["FIRECRAWL_API_KEY"]},
+    "browser": {"env_vars": ["BROWSERBASE_API_KEY", "BROWSERBASE_PROJECT_ID"]},
+    "creative": {"env_vars": ["FAL_KEY"]},
+}
+```
+
+The `check_tool_availability()` function determines which tools to include.
+
+### Stateful Tools
+
+Tools that maintain state (terminal, browser) require:
+- `task_id` parameter for session isolation between concurrent tasks
+- `cleanup_*()` function to release resources
+- Cleanup is called automatically in run_agent.py after conversation completes
+
+---
+
+## Trajectory Format
+
+Conversations are saved in ShareGPT format for training:
+```json
+{"from": "system", "value": "System prompt with <tools>...</tools>"}
+{"from": "human", "value": "User message"}
+{"from": "gpt", "value": "<think>reasoning</think>\n<tool_call>{...}</tool_call>"}
+{"from": "tool", "value": "<tool_response>{...}</tool_response>"}
+{"from": "gpt", "value": "Final response"}
+```
+
+Tool calls use `<tool_call>` XML tags, responses use `<tool_response>` tags, reasoning uses `<think>` tags.
+
+### Trajectory Export
+
+```python
+agent = AIAgent(save_trajectories=True)
+agent.chat("Do something")
+# Saves to trajectories/*.jsonl in ShareGPT format
+```
+
+---
+
+## Batch Processing (batch_runner.py)
+
+For processing multiple prompts:
+- Parallel execution with multiprocessing
+- Content-based resume for fault tolerance (matches on prompt text, not indices)
+- Toolset distributions control probabilistic tool availability per prompt
+- Output: `data/<run_name>/trajectories.jsonl` (combined) + individual batch files
+
+```bash
+python batch_runner.py \
+    --dataset_file=prompts.jsonl \
+    --batch_size=20 \
+    --num_workers=4 \
+    --run_name=my_run
+```
+
+---
+
+## Skills System
+
+Skills are on-demand knowledge documents the agent can load. Located in `skills/` directory:
+
+```
+skills/
+├── mlops/                    # Category folder
+│   ├── axolotl/             # Skill folder
+│   │   ├── SKILL.md         # Main instructions (required)
+│   │   ├── references/      # Additional docs, API specs
+│   │   └── templates/       # Output formats, configs
+│   └── vllm/
+│       └── SKILL.md
+└── example-skill/
+    └── SKILL.md
+```
+
+**Progressive disclosure** (token-efficient):
+1. `skills_categories()` - List category names (~50 tokens)
+2. `skills_list(category)` - Name + description per skill (~3k tokens)
+3. `skill_view(name)` - Full content + tags + linked files
+
+SKILL.md files use YAML frontmatter:
+```yaml
+---
+name: skill-name
+description: Brief description for listing
+tags: [tag1, tag2]
+related_skills: [other-skill]
+version: 1.0.0
+---
+# Skill Content...
+```
+
+Tool files: `tools/skills_tool.py` → `model_tools.py` → `toolsets.py`
+
+---
+
+## Testing Changes
+
+After making changes:
+
+1. Run `hermes doctor` to check setup
+2. Run `hermes config check` to verify config
+3. Test with `hermes chat -q "test message"`
+4. For new config options, test fresh install: `rm -rf ~/.hermes && hermes setup`
@@ -1,729 +1,63 @@
 # Hermes Agent - Future Improvements

-> Ideas for enhancing the agent's capabilities, generated from self-analysis of the codebase.
-
---
-
-## 🚨 HIGH PRIORITY - Immediate Fixes
-
-These items need to be addressed ASAP:
-
-### 1. SUDO Breaking Terminal Tool 🔐 ✅ COMPLETE
- [x] **Problem:** SUDO commands break the terminal tool execution (hangs indefinitely)
- [x] **Fix:** Created custom environment wrappers in `tools/terminal_tool.py`
-  - `stdin=subprocess.DEVNULL` prevents hanging on interactive prompts
-  - Sudo fails gracefully with clear error if no password configured
-  - Same UX as Claude Code - agent sees error, tells user to run it themselves
- [x] **All 5 environments now have consistent behavior:**
-  - `_LocalEnvironment` - local execution
-  - `_DockerEnvironment` - Docker containers
-  - `_SingularityEnvironment` - Singularity/Apptainer containers
-  - `_ModalEnvironment` - Modal cloud sandboxes
-  - `_SSHEnvironment` - remote SSH execution
- [x] **Optional sudo support via `SUDO_PASSWORD` env var:**
-  - Shared `_transform_sudo_command()` helper used by all environments
-  - If set, auto-transforms `sudo cmd` → pipes password via `sudo -S`
-  - Documented in `.env.example`, `cli-config.yaml`, and README
-  - Works for chained commands: `cmd1 && sudo cmd2`
- [x] **Interactive sudo prompt in CLI mode:**
-  - When sudo detected and no password configured, prompts user
-  - 45-second timeout (auto-skips if no input)
-  - Hidden password input via `getpass` (password not visible)
-  - Password cached for session (don't ask repeatedly)
-  - Spinner pauses during prompt for clean UX
-  - Uses `HERMES_INTERACTIVE` env var to detect CLI mode
-
-### 2. Fix `browser_get_images` Tool 🖼️ ✅ VERIFIED WORKING
- [x] **Tested:** Tool works correctly on multiple sites
- [x] **Results:** Successfully extracts image URLs, alt text, dimensions
- [x] **Note:** Some sites (Pixabay, etc.) have Cloudflare bot protection that blocks headless browsers - this is expected behavior, not a bug
-
-### 3. Better Action Logging for Debugging 📝 ✅ COMPLETE
- [x] **Problem:** Need better logging of agent actions for debugging
- [x] **Implementation:**
-  - Save full session trajectories to `logs/` directory as JSON
-  - Each session gets a unique file: `session_YYYYMMDD_HHMMSS_UUID.json`
-  - Logs all messages, tool calls with inputs/outputs, timestamps
-  - Structured JSON format for easy parsing and replay
-  - Automatic on CLI runs (configurable)
-
-### 4. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED
- [ ] **Problem:** Thinking/reasoning summaries not shown while streaming
- [ ] **Complexity:** This is a significant refactor - leaving for later
-
-**OpenRouter Streaming Info:**
- Uses `stream=True` with OpenAI SDK
- Reasoning comes in `choices[].delta.reasoning_details` chunks
- Types: `reasoning.summary`, `reasoning.text`, `reasoning.encrypted`
- Tool call arguments stream as partial JSON (need accumulation)
- Items paradigm: same ID emitted multiple times with updated content
-
-**Key Challenges:**
- Tool call JSON accumulation (partial `{"query": "wea` → `{"query": "weather"}`)
- Multiple concurrent outputs (thinking + tool calls + text simultaneously)
- State management for partial responses
- Error handling if connection drops mid-stream
- Deciding when tool calls are "complete" enough to execute
-
-**UX Questions to Resolve:**
- Show raw thinking text or summarized?
- Live expanding text vs. spinner replacement?
- Markdown rendering while streaming?
- How to handle thinking + tool call display simultaneously?
-
-**Implementation Options:**
- New `run_conversation_streaming()` method (keep non-streaming as fallback)
- Wrapper that handles streaming internally
- Big refactor of existing `run_conversation()`
-
-**References:**
- https://openrouter.ai/docs/api/reference/streaming
- https://openrouter.ai/docs/guides/best-practices/reasoning-tokens#streaming-response
-
 ---

 ## 1. Subagent Architecture (Context Isolation) 🎯

-**Problem:** Long-running tools (terminal commands, browser automation, complex file operations) consume massive context. A single `ls -la` can add hundreds of lines. Browser snapshots, debugging sessions, and iterative terminal work quickly bloat the main conversation, leaving less room for actual reasoning.
+The main agent becomes an orchestrator that delegates context-heavy tasks to subagents with isolated context. Each subagent returns a summary, keeping the orchestrator's context clean. `delegate_task(goal, context, toolsets=[])` with fresh conversation, limited toolset, task-specific system prompt.

-**Solution:** The main agent becomes an **orchestrator** that delegates context-heavy tasks to **subagents**.
+## 2. Planning & Task Management 📋

-**Architecture:**
-```
-┌─────────────────────────────────────────────────────────────────┐
-│  ORCHESTRATOR (main agent)                                      │
-│  - Receives user request                                        │
-│  - Plans approach                                               │
-│  - Delegates heavy tasks to subagents                           │
-│  - Receives summarized results                                  │
-│  - Maintains clean, focused context                             │
-└─────────────────────────────────────────────────────────────────┘
-         │                    │                    │
-         ▼                    ▼                    ▼
-┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
-│ TERMINAL AGENT  │  │ BROWSER AGENT   │  │ CODE AGENT      │
-│ - terminal tool │  │ - browser tools │  │ - file tools    │
-│ - file tools    │  │ - web_search    │  │ - terminal      │
-│                 │  │ - web_extract   │  │                 │
-│ Isolated context│  │ Isolated context│  │ Isolated context│
-│ Returns summary │  │ Returns summary │  │ Returns summary │
-└─────────────────┘  └─────────────────┘  └─────────────────┘
-```
+Task decomposition tool, progress checkpoints after N tool calls, persistent plan storage that survives context compression, failure recovery with replanning.

-**How it works:**
-1. User asks: "Set up a new Python project with FastAPI and tests"
-2. Orchestrator plans: "I need to create files, install deps, write code"
-3. Orchestrator calls: `terminal_task(goal="Create venv, install fastapi pytest", context="New project in ~/myapp")`
-4. **Subagent spawns** with fresh context, only terminal/file tools
-5. Subagent iterates (may take 10+ tool calls, lots of output)
-6. Subagent completes → returns summary: "Created venv, installed fastapi==0.109.0, pytest==8.0.0"
-7. Orchestrator receives **only the summary**, context stays clean
-8. Orchestrator continues with next subtask
+## 3. Dynamic Skills Expansion 📚

-**Key tools to implement:**
- [ ] `terminal_task(goal, context, cwd?)` - Delegate terminal/shell work
- [ ] `browser_task(goal, context, start_url?)` - Delegate web research/automation  
- [ ] `code_task(goal, context, files?)` - Delegate code writing/modification
- [ ] Generic `delegate_task(goal, context, toolsets=[])` - Flexible delegation
+Skill acquisition from successful tasks, parameterized skill templates, skill chaining with dependency graphs.

-**Implementation details:**
- [ ] Subagent uses same `run_agent.py` but with:
-  - Fresh/empty conversation history
-  - Limited toolset (only what's needed)
-  - Smaller max_iterations (focused task)
-  - Task-specific system prompt
- [ ] Subagent returns structured result:
-  ```python
-  {
-    "success": True,
-    "summary": "Installed 3 packages, created 2 files",
-    "details": "Optional longer explanation if needed",
-    "artifacts": ["~/myapp/requirements.txt", "~/myapp/main.py"],  # Files created
-    "errors": []  # Any issues encountered
-  }
-  ```
- [ ] Orchestrator sees only the summary in its context
- [ ] Full subagent transcript saved separately for debugging
+## 4. Interactive Clarifying Questions ❓

-**Benefits:**
- 🧹 **Clean context** - Orchestrator stays focused, doesn't drown in tool output
- 📊 **Better token efficiency** - 50 terminal outputs → 1 summary paragraph
- 🎯 **Focused subagents** - Each agent has just the tools it needs
- 🔄 **Parallel potential** - Independent subtasks could run concurrently
- 🐛 **Easier debugging** - Each subtask has its own isolated transcript
+Multiple-choice prompt tool with rich terminal UI. Up to 4 choices + free-text. CLI-only with graceful fallback for non-interactive modes.

-**When to use subagents vs direct tools:**
- **Subagent**: Multi-step tasks, iteration likely, lots of output expected
- **Direct**: Quick one-off commands, simple file reads, user needs to see output
+## 5. Memory System 🧠

-**Files to modify:** `run_agent.py` (add orchestration mode), new `tools/delegate_tools.py`, new `subagent_runner.py`
+Daily memory logs, long-term curated MEMORY.md, vector/semantic search, pre-compaction memory flush, user profile, learning store for error patterns and discovered fixes. *Inspired by ClawdBot's memory system.*

---
+## 6. Heartbeat System 💓

-## 2. Context Management (complements Subagents)
+Periodic agent wake-up that reads HEARTBEAT.md for instructions. Runs inside the main session with full context. Triggers on interval, exec completion, cron events, or manual wake. HEARTBEAT_OK suppression when nothing needs attention. *Inspired by ClawdBot's heartbeat.*

-**Problem:** Context grows unbounded during long conversations. Trajectory compression exists for training data post-hoc, but live conversations lack intelligent context management.
+## 7. Local Browser Control via CDP 🌐

-**Ideas:**
- [ ] **Incremental summarization** - Compress old tool outputs on-the-fly during conversations
-  - Trigger when context exceeds threshold (e.g., 80% of max tokens)
-  - Preserve recent turns fully, summarize older tool responses
-  - Could reuse logic from `trajectory_compressor.py`
-  
- [ ] **Semantic memory retrieval** - Vector store for long conversation recall
-  - Embed important facts/findings as conversation progresses
-  - Retrieve relevant memories when needed instead of keeping everything in context
-  - Consider lightweight solutions: ChromaDB, FAISS, or even a simple embedding cache
-  
- [ ] **Working vs. episodic memory** distinction
-  - Working memory: Current task state, recent tool results (always in context)
-  - Episodic memory: Past findings, tried approaches (retrieved on demand)
-  - Clear eviction policies for each
+Support both local Chrome (via CDP, free) and Browserbase (cloud, paid) as browser backends. Local gives persistent login sessions but lacks CAPTCHA solving.

-**Files to modify:** `run_agent.py` (add memory manager), possibly new `tools/memory_tool.py`
+## 8. Signal Integration 📡

---
+New platform adapter using signal-cli daemon (JSON-RPC HTTP + SSE). Requires Java runtime and phone number registration.

-## 3. Self-Reflection & Course Correction 🔄
+## 9. Session Transcript Search 🔍

-**Problem:** Current retry logic handles malformed outputs but not semantic failures. Agent doesn't reason about *why* something failed.
+`hermes sessions search <query>` CLI command and `session_search` agent tool. Text-based first (ripgrep over JSONL), vector search later.

-**Ideas:**
- [ ] **Meta-reasoning after failures** - When a tool returns an error or unexpected result:
-  ```
-  Tool failed → Reflect: "Why did this fail? What assumptions were wrong?"
-  → Adjust approach → Retry with new strategy
-  ```
-  - Could be a lightweight LLM call or structured self-prompt
-  
- [ ] **Planning/replanning module** - For complex multi-step tasks:
-  - Generate plan before execution
-  - After each step, evaluate: "Am I on track? Should I revise the plan?"
-  - Store plan in working memory, update as needed
-  
- [ ] **Approach memory** - Remember what didn't work:
-  - "I tried X for this type of problem and it failed because Y"
-  - Prevents repeating failed strategies in the same conversation
+## 10. Plugin/Extension System 🔌

-**Files to modify:** `run_agent.py` (add reflection hooks in tool loop), new `tools/reflection_tool.py`
+Python plugin interface with `plugin.yaml` + `handler.py`. Discovery from `~/.hermes/plugins/`. Plugins can register tools, hooks, and CLI commands. *Inspired by ClawdBot's 36-plugin extension system.*

---
+## 11. Native Companion Apps 📱

-## 4. Tool Composition & Learning 🔧
+macOS (Swift/SwiftUI), iOS, Android apps connecting via WebSocket. Prerequisite: WS API on gateway. MVP: web UI with Flask/FastAPI. *Inspired by ClawdBot's companion apps.*

-**Problem:** Tools are atomic. Complex tasks require repeated manual orchestration of the same tool sequences.
+## 12. Evaluation System 📏

-**Ideas:**
- [ ] **Macro tools / Tool chains** - Define reusable tool sequences:
-  ```yaml
-  research_topic:
-    description: "Deep research on a topic"
-    steps:
-      - web_search: {query: "$topic"}
-      - web_extract: {urls: "$search_results.urls[:3]"}
-      - summarize: {content: "$extracted"}
-  ```
-  - Could be defined in skills or a new `macros/` directory
-  - Agent can invoke macro as single tool call
-  
- [ ] **Tool failure patterns** - Learn from failures:
-  - Track: tool, input pattern, error type, what worked instead
-  - Before calling a tool, check: "Has this pattern failed before?"
-  - Persistent across sessions (stored in skills or separate DB)
-  
- [ ] **Parallel tool execution** - When tools are independent, run concurrently:
-  - Detect independence (no data dependencies between calls)
-  - Use `asyncio.gather()` for parallel execution
-  - Already have async support in some tools, just need orchestration
+LLM grader mode for batch_runner, action comparison against expected tool calls, string matching baselines.

-**Files to modify:** `model_tools.py`, `toolsets.py`, new `tool_macros.py`
+## 13. Layered Context Architecture 📊

---
+Structured hierarchy: project context > skills > user profile > learnings > external knowledge > runtime introspection.

-## 5. Dynamic Skills Expansion 📚
+## 14. Tools Wishlist 🧰

-**Problem:** Skills system is elegant but static. Skills must be manually created and added.
-
-**Ideas:**
- [ ] **Skill acquisition from successful tasks** - After completing a complex task:
-  - "This approach worked well. Save as a skill?"
-  - Extract: goal, steps taken, tools used, key decisions
-  - Generate SKILL.md automatically
-  - Store in user's skills directory
-  
- [ ] **Skill templates** - Common patterns that can be parameterized:
-  ```markdown
-  # Debug {language} Error
-  1. Reproduce the error
-  2. Search for error message: `web_search("{error_message} {language}")`
-  3. Check common causes: {common_causes}
-  4. Apply fix and verify
-  ```
-  
- [ ] **Skill chaining** - Combine skills for complex workflows:
-  - Skills can reference other skills as dependencies
-  - "To do X, first apply skill Y, then skill Z"
-  - Directed graph of skill dependencies
-
-**Files to modify:** `tools/skills_tool.py`, `skills/` directory structure, new `skill_generator.py`
-
---
-
-## 6. Task Continuation Hints 🎯
-
-**Problem:** Could be more helpful by suggesting logical next steps.
-
-**Ideas:**
- [ ] **Suggest next steps** - At end of a task, suggest logical continuations:
-  - "Code is written. Want me to also write tests / docs / deploy?"
-  - Based on common workflows for task type
-  - Non-intrusive, just offer options
-
-**Files to modify:** `run_agent.py`, response generation logic
-
---
-
-## 7. Interactive Clarifying Questions Tool ❓
-
-**Problem:** Agent sometimes makes assumptions or guesses when it should ask the user. Currently can only ask via text, which gets lost in long outputs.
-
-**Ideas:**
- [ ] **Multiple-choice prompt tool** - Let agent present structured choices to user:
-  ```
-  ask_user_choice(
-    question="Should the language switcher enable only German or all languages?",
-    choices=[
-      "Only enable German - works immediately",
-      "Enable all, mark untranslated - show fallback notice",
-      "Let me specify something else"
-    ]
-  )
-  ```
-  - Renders as interactive terminal UI with arrow key / Tab navigation
-  - User selects option, result returned to agent
-  - Up to 4 choices + optional free-text option
-  
- [ ] **Implementation:**
-  - Use `inquirer` or `questionary` Python library for rich terminal prompts
-  - Tool returns selected option text (or user's custom input)
-  - **CLI-only** - only works when running via `cli.py` (not API/programmatic use)
-  - Graceful fallback: if not in interactive mode, return error asking agent to rephrase as text
-  
- [ ] **Use cases:**
-  - Clarify ambiguous requirements before starting work
-  - Confirm destructive operations with clear options
-  - Let user choose between implementation approaches
-  - Checkpoint complex multi-step workflows
-
-**Files to modify:** New `tools/ask_user_tool.py`, `cli.py` (detect interactive mode), `model_tools.py`
-
---
-
-## 8. Resource Awareness & Efficiency 💰
-
-**Problem:** No awareness of costs, time, or resource usage. Could be smarter about efficiency.
-
-**Ideas:**
- [ ] **Tool result caching** - Don't repeat identical operations:
-  - Cache web searches, extractions within a session
-  - Invalidation based on time-sensitivity of query
-  - Hash-based lookup: same input → cached output
-
- [ ] **Lazy evaluation** - Don't fetch everything upfront:
-  - Get summaries first, full content only if needed
-  - "I found 5 relevant pages. Want me to deep-dive on any?"
-
-**Files to modify:** `model_tools.py`, new `resource_tracker.py`
-
---
-
-## 9. Collaborative Problem Solving 🤝
-
-**Problem:** Interaction is command/response. Complex problems benefit from dialogue.
-
-**Ideas:**
- [ ] **Assumption surfacing** - Make implicit assumptions explicit:
-  - "I'm assuming you want Python 3.11+. Correct?"
-  - "This solution assumes you have sudo access..."
-  - Let user correct before going down wrong path
-
- [ ] **Checkpoint & confirm** - For high-stakes operations:
-  - "About to delete 47 files. Here's the list - proceed?"
-  - "This will modify your database. Want a backup first?"
-  - Configurable threshold for when to ask
-
-**Files to modify:** `run_agent.py`, system prompt configuration
-
---
-
-## 10. Project-Local Context 💾
-
-**Problem:** Valuable context lost between sessions.
-
-**Ideas:**
- [ ] **Project awareness** - Remember project-specific context:
-  - Store `.hermes/context.md` in project directory
-  - "This is a Django project using PostgreSQL"
-  - Coding style preferences, deployment setup, etc.
-  - Load automatically when working in that directory
-
- [ ] **Handoff notes** - Leave notes for future sessions:
-  - Write to `.hermes/notes.md` in project
-  - "TODO for next session: finish implementing X"
-  - "Known issues: Y doesn't work on Windows"
-
-**Files to modify:** New `project_context.py`, auto-load in `run_agent.py`
-
---
-
-## 11. Graceful Degradation & Robustness 🛡️
-
-**Problem:** When things go wrong, recovery is limited. Should fail gracefully.
-
-**Ideas:**
- [ ] **Fallback chains** - When primary approach fails, have backups:
-  - `web_extract` fails → try `browser_navigate` → try `web_search` for cached version
-  - Define fallback order per tool type
-  
- [ ] **Partial progress preservation** - Don't lose work on failure:
-  - Long task fails midway → save what we've got
-  - "I completed 3/5 steps before the error. Here's what I have..."
-  
- [ ] **Self-healing** - Detect and recover from bad states:
-  - Browser stuck → close and retry
-  - Terminal hung → timeout and reset
-
-**Files to modify:** `model_tools.py`, tool implementations, new `fallback_manager.py`
-
---
-
-## 12. Tools & Skills Wishlist 🧰
-
-*Things that would need new tool implementations (can't do well with current tools):*
-
-### High-Impact
-
- [ ] **Audio/Video Transcription** 🎬 *(See also: Section 16 for detailed spec)*
-  - Transcribe audio files, podcasts, YouTube videos
-  - Extract key moments from video
-  - Voice memo transcription for messaging integrations
-  - *Provider options: Whisper API, Deepgram, local Whisper*
-  
- [ ] **Diagram Rendering** 📊
-  - Render Mermaid/PlantUML to actual images
-  - Can generate the code, but rendering requires external service or tool
-  - "Show me how these components connect" → actual visual diagram
-
-### Medium-Impact
-
- [ ] **Canvas / Visual Workspace** 🖼️
-  - Agent-controlled visual panel for rendering interactive UI
-  - Inspired by OpenClaw's Canvas feature
-  - **Capabilities:**
-    - `present` / `hide` - Show/hide the canvas panel
-    - `navigate` - Load HTML files or URLs into the canvas
-    - `eval` - Execute JavaScript in the canvas context
-    - `snapshot` - Capture the rendered UI as an image
-  - **Use cases:**
-    - Display generated HTML/CSS/JS previews
-    - Show interactive data visualizations (charts, graphs)
-    - Render diagrams (Mermaid → rendered output)
-    - Present structured information in rich format
-    - A2UI-style component system for structured agent UI
-  - **Implementation options:**
-    - Electron-based panel for CLI
-    - WebSocket-connected web app
-    - VS Code webview extension
-  - *Would let agent "show" things rather than just describe them*
-
- [ ] **Document Generation** 📄
-  - Create styled PDFs, Word docs, presentations
-  - *Can do basic PDF via terminal tools, but limited*
-
- [ ] **Diff/Patch Tool** 📝
-  - Surgical code modifications with preview
-  - "Change line 45-50 to X" without rewriting whole file
-  - Show diffs before applying
-  - *Can use `diff`/`patch` but a native tool would be safer*
-
-### Skills to Create
-
- [ ] **Domain-specific skill packs:**
-  - DevOps/Infrastructure (Terraform, K8s, AWS)
-  - Data Science workflows (EDA, model training)
-  - Security/pentesting procedures
-  
- [ ] **Framework-specific skills:**
-  - React/Vue/Angular patterns
-  - Django/Rails/Express conventions
-  - Database optimization playbooks
-
- [ ] **Troubleshooting flowcharts:**
-  - "Docker container won't start" → decision tree
-  - "Production is slow" → systematic diagnosis
-
---
-
-## 13. Messaging Platform Integrations 💬
-
-**Problem:** Agent currently only works via `cli.py` which requires direct terminal access. Users may want to interact via messaging apps from their phone or other devices.
-
-**Architecture:**
- `run_agent.py` already accepts `conversation_history` parameter and returns updated messages ✅
- Need: persistent session storage, platform monitors, session key resolution
-
-**Implementation approach:**
-```
-┌─────────────────────────────────────────────────────────────┐
-│  Platform Monitor (e.g., telegram_monitor.py)               │
-│  ├─ Long-running daemon connecting to messaging platform    │
-│  ├─ On message: resolve session key → load history from disk│
-│  ├─ Call run_agent.py with loaded history                   │
-│  ├─ Save updated history back to disk (JSONL)               │
-│  └─ Send response back to platform                          │
-└─────────────────────────────────────────────────────────────┘
-```
-
-**Platform support (each user sets up their own credentials):**
- [ ] **Telegram** - via `python-telegram-bot` or `grammy` equivalent
-  - Bot token from @BotFather
-  - Easiest to set up, good for personal use
- [ ] **Discord** - via `discord.py`
-  - Bot token from Discord Developer Portal
-  - Can work in servers (group sessions) or DMs
- [ ] **WhatsApp** - via `baileys` (WhatsApp Web protocol)
-  - QR code scan to authenticate
-  - More complex, but reaches most people
-
-**Session management:**
- [ ] **Session store** - JSONL persistence per session key
-  - `~/.hermes/sessions/{session_key}.jsonl`
-  - Session keys: `telegram:dm:{user_id}`, `discord:channel:{id}`, etc.
- [ ] **Session expiry** - Configurable reset policies
-  - Daily reset (default 4am) OR idle timeout (e.g., 2 hours)
-  - Manual reset via `/reset` or `/new` command in chat
- [ ] **Session continuity** - Conversations persist across messages until reset
-
-**Files to create:** `monitors/telegram_monitor.py`, `monitors/discord_monitor.py`, `monitors/session_store.py`
-
---
-
-## 14. Scheduled Tasks / Cron Jobs ⏰
-
-**Problem:** Agent only runs on-demand. Some tasks benefit from scheduled execution (daily summaries, monitoring, reminders).
-
-**Ideas:**
- [ ] **Cron-style scheduler** - Run agent turns on a schedule
-  - Store jobs in `~/.hermes/cron/jobs.json`
-  - Each job: `{ id, schedule, prompt, session_mode, delivery }`
-  - Uses APScheduler or similar Python library
-  
- [ ] **Session modes:**
-  - `isolated` - Fresh session each run (no history, clean context)
-  - `main` - Append to main session (agent remembers previous scheduled runs)
-  
- [ ] **Delivery options:**
-  - Write output to file (`~/.hermes/cron/output/{job_id}/{timestamp}.md`)
-  - Send to messaging channel (if integrations enabled)
-  - Both
-  
- [ ] **CLI interface:**
-  ```bash
-  # List scheduled jobs
-  python cli.py --cron list
-  
-  # Add a job (runs daily at 9am)
-  python cli.py --cron add "Summarize my email inbox" --schedule "0 9 * * *"
-  
-  # Quick syntax for simple intervals  
-  python cli.py --cron add "Check server status" --every 30m
-  
-  # Remove a job
-  python cli.py --cron remove <job_id>
-  ```
-
- [ ] **Agent self-scheduling** - Let the agent create its own cron jobs
-  - New tool: `schedule_task(prompt, schedule, session_mode)`
-  - "Remind me to check the deployment tomorrow at 9am"
-  - Agent can set follow-up tasks for itself
-
- [ ] **In-chat command:** `/cronjob {prompt} {frequency}` when using messaging integrations
-
-**Files to create:** `cron/scheduler.py`, `cron/jobs.py`, `tools/schedule_tool.py`
-
---
-
-## 15. Text-to-Speech (TTS) 🔊
-
-**Problem:** Agent can only respond with text. Some users prefer audio responses (accessibility, hands-free use, podcasts).
-
-**Ideas:**
- [ ] **TTS tool** - Generate audio files from text
-  ```python
-  tts_generate(text="Here's your summary...", voice="nova", output="summary.mp3")
-  ```
-  - Returns path to generated audio file
-  - For messaging integrations: can send as voice message
-  
- [ ] **Provider options:**
-  - Edge TTS (free, good quality, many voices)
-  - OpenAI TTS (paid, excellent quality)
-  - ElevenLabs (paid, best quality, voice cloning)
-  - Local options (Coqui TTS, Bark)
-  
- [ ] **Modes:**
-  - On-demand: User explicitly asks "read this to me"
-  - Auto-TTS: Configurable to always generate audio for responses
-  - Long-text handling: Summarize or chunk very long responses
-  
- [ ] **Integration with messaging:**
-  - When enabled, can send voice notes instead of/alongside text
-  - User preference per channel
-
-**Files to create:** `tools/tts_tool.py`, config in `cli-config.yaml`
-
---
-
-## 16. Speech-to-Text / Audio Transcription 🎤
-
-**Problem:** Users may want to send voice memos instead of typing. Agent is blind to audio content.
-
-**Ideas:**
- [ ] **Voice memo transcription** - For messaging integrations
-  - User sends voice message → transcribe → process as text
-  - Seamless: user speaks, agent responds
-  
- [ ] **Audio/video file transcription** - Existing idea, expanded:
-  - Transcribe local audio files (mp3, wav, m4a)
-  - Transcribe YouTube videos (download audio → transcribe)
-  - Extract key moments with timestamps
-  
- [ ] **Provider options:**
-  - OpenAI Whisper API (good quality, cheap)
-  - Deepgram (fast, good for real-time)
-  - Local Whisper (free, runs on GPU)
-  - Groq Whisper (fast, free tier available)
-  
- [ ] **Tool interface:**
-  ```python
-  transcribe(source="audio.mp3")  # Local file
-  transcribe(source="https://youtube.com/...")  # YouTube
-  transcribe(source="voice_message", data=bytes)  # Voice memo
-  ```
-
-**Files to create:** `tools/transcribe_tool.py`, integrate with messaging monitors
-
---
-
-## Priority Order (Suggested)
-
-1. **🎯 Subagent Architecture** - Critical for context management, enables everything else
-2. **Memory & Context Management** - Complements subagents for remaining context
-3. **Self-Reflection** - Improves reliability and reduces wasted tool calls  
-4. **Project-Local Context** - Practical win, keeps useful info across sessions
-5. **Messaging Integrations** - Unlocks mobile access, new interaction patterns
-6. **Scheduled Tasks / Cron Jobs** - Enables automation, reminders, monitoring
-7. **Tool Composition** - Quality of life, builds on other improvements
-8. **Dynamic Skills** - Force multiplier for repeated tasks
-9. **Interactive Clarifying Questions** - Better UX for ambiguous tasks
-10. **TTS / Audio Transcription** - Accessibility, hands-free use
-
---
-
-## Removed Items (Unrealistic)
-
-The following were removed because they're architecturally impossible:
-
- ~~Proactive suggestions / Prefetching~~ - Agent only runs on user request, can't interject
- ~~Clipboard integration~~ - No access to user's local system clipboard
-
-The following **moved to active TODO** (now possible with new architecture):
-
- ~~Session save/restore~~ → See **Messaging Integrations** (session persistence)
- ~~Voice/TTS playback~~ → See **TTS** (can generate audio files, send via messaging)
- ~~Set reminders~~ → See **Scheduled Tasks / Cron Jobs**
-
-The following were removed because they're **already possible**:
-
- ~~HTTP/API Client~~ → Use `curl` or Python `requests` in terminal
- ~~Structured Data Manipulation~~ → Use `pandas` in terminal
- ~~Git-Native Operations~~ → Use `git` CLI in terminal
- ~~Symbolic Math~~ → Use `SymPy` in terminal
- ~~Code Quality Tools~~ → Run linters (`eslint`, `black`, `mypy`) in terminal
- ~~Testing Framework~~ → Run `pytest`, `jest`, etc. in terminal
- ~~Translation~~ → LLM handles this fine, or use translation APIs
-
---
-
---
-
-## 🧪 Brainstorm Ideas (Not Yet Fleshed Out)
-
-*These are early-stage ideas that need more thinking before implementation. Captured here so they don't get lost.*
-
-### Remote/Distributed Execution 🌐
-
-**Concept:** Run agent on a powerful remote server while interacting from a thin client.
-
-**Why interesting:**
- Run on beefy GPU server for local LLM inference
- Agent has access to remote machine's resources (files, tools, internet)
- User interacts via lightweight client (phone, low-power laptop)
-
-**Open questions:**
- How does this differ from just SSH + running cli.py on remote?
- Would need secure communication channel (WebSocket? gRPC?)
- How to handle tool outputs that reference remote paths?
- Credential management for remote execution
- Latency considerations for interactive use
-
-**Possible architecture:**
-```
-┌─────────────┐         ┌─────────────────────────┐
-│ Thin Client │ ◄─────► │ Remote Hermes Server    │
-│ (phone/web) │  WS/API │ - Full agent + tools    │
-└─────────────┘         │ - GPU for local LLM     │
-                        │ - Access to server files│
-                        └─────────────────────────┘
-```
-
-**Related to:** Messaging integrations (could be the "server" that monitors receive from)
-
---
-
-### Multi-Agent Parallel Execution 🤖🤖
-
-**Concept:** Extension of Subagent Architecture (Section 1) - run multiple subagents in parallel.
-
-**Why interesting:**
- Independent subtasks don't need to wait for each other
- "Research X while setting up Y" - both run simultaneously
- Faster completion for complex multi-part tasks
-
-**Open questions:**
- How to detect which tasks are truly independent?
- Resource management (API rate limits, concurrent connections)
- How to merge results when parallel tasks have conflicts?
- Cost implications of multiple parallel LLM calls
-
-*Note: Basic subagent delegation (Section 1) should be implemented first, parallel execution is an optimization on top.*
-
---
-
-### Plugin/Extension System 🔌
-
-**Concept:** Allow users to add custom tools/skills without modifying core code.
-
-**Why interesting:**
- Community contributions
- Organization-specific tools
- Clean separation of core vs. extensions
-
-**Open questions:**
- Security implications of loading arbitrary code
- Versioning and compatibility
- Discovery and installation UX
-
---
-
-*Last updated: $(date +%Y-%m-%d)* 🤖
+- Diagram rendering (Mermaid/PlantUML to images)
+- Document generation (PDFs, Word, presentations)
+- Canvas / visual workspace
+- Coding agent skill (Codex, Claude Code orchestration via PTY)
+- Domain skill packs (DevOps, data science, security)
@@ -1,41 +0,0 @@
-# Dockerfile for atropos-agent sandbox server
-# Runs inside Nomad containers to handle tool execution
-# Includes bubblewrap for namespace-based slot isolation
-
-FROM python:3.11-slim
-
-# Install system dependencies
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    # Bubblewrap for namespace isolation
-    bubblewrap \
-    # `script` for PTY allocation (used for stable tmux+asciinema startup)
-    util-linux \
-    # Git for SWE-style tasks (cloning repos)
-    git \
-    # tmux for stateful terminal sessions (Phase 4.7+)
-    tmux \
-    # Common tools agents might need
-    curl \
-    wget \
-    jq \
-    # Cleanup
-    && rm -rf /var/lib/apt/lists/*
-
-# Install Python dependencies (sandbox server + optional terminal recording)
-RUN pip install --no-cache-dir aiohttp asciinema
-
-# Copy the sandbox server
-COPY sandbox_server.py /app/sandbox_server.py
-
-WORKDIR /app
-
-# Create data directory for slot workspaces
-RUN mkdir -p /data
-
-# Verify bubblewrap is installed and working
-RUN bwrap --version
-
-EXPOSE 8080
-
-# Default command - can be overridden by Nomad job spec
-CMD ["python", "sandbox_server.py", "--port", "8080", "--slots", "10", "--data-dir", "/data"]
@@ -1,46 +0,0 @@
-"""
-Atropos integration for Hermes-Agent.
-
-This package is intentionally optional: Hermes-Agent should work without Atropos.
-If you import anything from `atropos.*` without having `atroposlib` installed,
-we raise a clear error with install instructions.
-
-Install (recommended, from repo checkout):
-  uv sync --extra atropos
-
-Or (pip / editable):
-  pip install -e '.[atropos]'
-"""
-
-from __future__ import annotations
-
-
-def _require_atroposlib() -> None:
-    try:
-        import atroposlib  # noqa: F401
-    except ModuleNotFoundError as exc:  # pragma: no cover
-        raise ModuleNotFoundError(
-            "Hermes-Agent Atropos integration requires `atroposlib`, but it is not installed.\n"
-            "Install it with:\n"
-            "  uv sync --extra atropos\n"
-            "or:\n"
-            "  pip install -e '.[atropos]'\n"
-        ) from exc
-
-
-_require_atroposlib()
-
-# Re-export the most commonly used pieces for convenience.
-from .agent import AgentConfig, AgentResult, AgentStep, AtroposAgent, SequenceData  # noqa: E402
-from .envs import AgentEnv, AgentEnvConfig  # noqa: E402
-
-__all__ = [
-    "AtroposAgent",
-    "AgentConfig",
-    "AgentResult",
-    "AgentStep",
-    "SequenceData",
-    "AgentEnv",
-    "AgentEnvConfig",
-]
-
@@ -1,15 +0,0 @@
-"""
-Agent abstractions for atropos-agent.
-
-Provides the core AtroposAgent class for running ReACT-style agent loops.
-"""
-
-from .atropos_agent import AgentConfig, AgentResult, AgentStep, AtroposAgent, SequenceData
-
-__all__ = [
-    "AtroposAgent",
-    "AgentConfig",
-    "AgentResult",
-    "AgentStep",
-    "SequenceData",
-]
@@ -1,850 +0,0 @@
-"""
-ReACT-style agent implementation for atropos-agent.
-
-This module provides the core AtroposAgent class that implements a basic
-Reason-Act-Observe loop with tool calling capabilities.
-
-Uses ManagedServer from atroposlib for automatic token/logprob tracking,
-making trajectories ready for RL training.
-
-The agent uses Hermes-style XML tags for tool calls:
- <think>...</think> for reasoning
- <tool_call>{"name": "...", "arguments": {...}}</tool_call> for actions
- <tool_response>...</tool_response> for observations
-"""
-
-import asyncio
-import os
-import json
-import time
-from contextlib import asynccontextmanager
-from dataclasses import dataclass, field
-from uuid import uuid4
-from typing import Any, AsyncGenerator, Awaitable, Callable, Dict, List, Optional, Union
-
-from dotenv import load_dotenv
-import httpx
-
-from ..tools import ToolCall, ToolRegistry, ToolResult
-from atroposlib.envs.server_handling.managed_server import ManagedServer
-
-load_dotenv()
-
-
-# Default system prompt with tool calling instructions.
-AGENT_SYSTEM_PROMPT = """You are a deep thinking AI. You MUST enclose your internal reasoning inside <think>...</think> tags.
-
-You are a function calling AI model.
-
-You are provided with function signatures within <tools></tools> XML tags.
-You must call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
-You can ONLY respond without a tool call if you are totally certain you have the final answer to the user's question or task
-After calling & executing a function, you will be provided with function results within <tool_response></tool_response> XML tags.
-
-Here are the available tools:
-<tools>
-{tools_json}
-</tools>
-
-Use the following JSON schema for each tool call you will make:
-{"title": "FunctionCall", "type": "object", "properties": {"name": {"title": "Name", "type": "string"}, "arguments": {"title": "Arguments", "type": "object"}}, "required": ["name", "arguments"]}
-
-## REQUIRED TOOL FORMAT
-
-When you decide to call a tool, your assistant message MUST be:
-1) exactly one <think>...</think> block, followed by
-2) one or more <tool_call>...</tool_call> blocks,
-and NOTHING else in that message.
-
-If you need to explain anything, put it inside <think>. Do NOT write natural language outside <think> or <tool_call>.
-
-For each function call return a JSON object with function name and arguments within <tool_call></tool_call> XML tags as follows:
-<tool_call>
-{"name": "<function-name>", "arguments": {"arg1": "value1"}}
-</tool_call>
-
-Each <tool_call> must be on its own and contain ONLY the JSON object (no extra text).
-The JSON inside <tool_call> MUST be valid JSON with double quotes.
-
-Do NOT output <tool_response> in an assistant message.
-
-After you receive tool results, you may either call more tools (same required format) or provide the final answer.
-When providing the final answer, do NOT include any <tool_call> blocks.
-
-## TERMINAL TOOL NOTES
-
- Commands execute under POSIX `/bin/sh` (not bash).
- Each tool call runs in a fresh shell: environment changes (like `cd` or venv activation) do not persist across tool calls.
- Avoid bash-only features like `source`, `[[ ... ]]`, or process substitution.
- Prefer explicit venv usage:
-  - `python -m venv .venv && . .venv/bin/activate && python -m pip install -e .` (POSIX `.` activation), or
-  - `.venv/bin/python -m pip install -e .` (no activation required).
-
-## ICL (examples)
-
-User: Show the current directory.
-Assistant:
-<think>I should run pwd.</think>
-<tool_call>
-{"name": "terminal", "arguments": {"command": "pwd"}}
-</tool_call>
-User: <tool_response>{"success": true, "output": "/tmp\\n"}</tool_response>
-Assistant: /tmp
-
-User: List files, then count them.
-Assistant:
-<think>I should count files.</think>
-<tool_call>
-{"name": "terminal", "arguments": {"command": "ls -1 | wc -l"}}
-</tool_call>
-User: <tool_response>{"success": true, "output": "3\\n"}</tool_response>
-Assistant: 3
-
-User: Run pwd, then print ok (two tool calls).
-Assistant:
-<think>I should run two commands.</think>
-<tool_call>
-{"name": "terminal", "arguments": {"command": "pwd"}}
-</tool_call>
-<tool_call>
-{"name": "terminal", "arguments": {"command": "echo ok"}}
-</tool_call>
-User: <tool_response>{"success": true, "output": "/tmp\\n"}</tool_response>
-User: <tool_response>{"success": true, "output": "ok\\n"}</tool_response>
-Assistant: ok
-"""
-
-
-@dataclass
-class AgentConfig:
-    """Configuration for the AtroposAgent."""
-    
-    # Generation parameters
-    temperature: Optional[float] = 0.7
-    # Default to "let the backend decide" (important for tool-tag completions that may be longer).
-    max_tokens: Optional[int] = None
-    
-    # Agent behavior
-    max_steps: int = 50
-    system_prompt: Optional[str] = None
-    tool_delay_s: float = 0.0
-    
-    # Working directory for tools
-    working_dir: Optional[str] = None
-
-
-@dataclass
-class SequenceData:
-    """Token/logprob data from a single completion."""
-    
-    full_text: str
-    tokens: List[int]
-    masked_tokens: List[int]  # -100 for prompt, actual IDs for completion
-    logprobs: List[float]  # 1.0 for prompt, actual values for completion
-    metadata: Optional[Dict[str, Any]] = None
-    
-    @classmethod
-    def from_sequence_node(cls, node) -> "SequenceData":
-        """Create from a ManagedServer SequenceNode."""
-        return cls(
-            full_text=node.full_text,
-            tokens=node.tokens,
-            masked_tokens=node.masked_tokens,
-            logprobs=node.logprobs,
-            metadata=getattr(node, "metadata", None),
-        )
-
-
-@dataclass
-class AgentStep:
-    """A single step in the agent's trajectory."""
-    
-    step_number: int
-    assistant_message: str
-    tool_calls: List[ToolCall] = field(default_factory=list)
-    tool_results: List[ToolResult] = field(default_factory=list)
-    sequence_data: Optional[SequenceData] = None  # Token data from this step
-    
-    @property
-    def has_tool_calls(self) -> bool:
-        return len(self.tool_calls) > 0
-
-
-@dataclass
-class AgentResult:
-    """Result of running an agent trajectory."""
-    
-    success: bool
-    final_response: str
-    steps: List[AgentStep] = field(default_factory=list)
-    total_tokens: int = 0
-    error: Optional[str] = None
-    metadata: Dict[str, Any] = field(default_factory=dict)
-    
-    # Full trajectory token data for RL training
-    trajectory_data: Optional[SequenceData] = None
-    
-    @property
-    def num_steps(self) -> int:
-        return len(self.steps)
-    
-    @property
-    def total_tool_calls(self) -> int:
-        return sum(len(step.tool_calls) for step in self.steps)
-    
-    def to_messages(self) -> List[Dict[str, str]]:
-        """Convert trajectory to messages format for logging."""
-        messages = []
-        for step in self.steps:
-            messages.append({"role": "assistant", "content": step.assistant_message})
-            if step.tool_results:
-                # Combine all tool responses
-                responses = "\n".join(r.to_xml() for r in step.tool_results)
-                messages.append({"role": "user", "content": responses})
-        return messages
-    
-    def to_scored_data(self, score: float) -> Optional[Dict[str, Any]]:
-        """
-        Convert to format suitable for ScoredDataGroup.
-        
-        Args:
-            score: The score for this trajectory
-            
-        Returns:
-            Dict with tokens, masks, scores suitable for training, or None if no data
-        """
-        if self.trajectory_data is None:
-            return None
-        
-        return {
-            "tokens": self.trajectory_data.tokens,
-            "masks": self.trajectory_data.masked_tokens,
-            "scores": score,
-            "logprobs": self.trajectory_data.logprobs,
-        }
-
-
-class AtroposAgent:
-    """
-    A ReACT-style agent that uses LLMs with tool calling.
-    
-    This implementation wraps ManagedServer for automatic token/logprob tracking,
-    making trajectories ready for RL training.
-    
-    Example:
-        # `server` may be an Atropos `ServerManager` (recommended) or a single `APIServer`.
-        # In practice, environments usually construct this via `BaseEnv`.
-        server = ...
-        tools = ToolRegistry()
-        tools.register(BashTool())
-        
-        agent = AtroposAgent(server=server, tools=tools)
-        result = await agent.run("List the files in the current directory")
-        
-        # Access token data for training
-        if result.trajectory_data:
-            print(f"Tokens: {result.trajectory_data.tokens}")
-            print(f"Masked: {result.trajectory_data.masked_tokens}")
-    """
-    
-    def __init__(
-        self,
-        server,  # ServerManager or APIServer
-        tools: Optional[ToolRegistry] = None,
-        config: Optional[AgentConfig] = None,
-        tokenizer: Optional[Any] = None,
-        execute_tool: Optional[Callable[[ToolCall], Awaitable[ToolResult]]] = None,
-    ):
-        self.server = server
-        self.tools = tools or ToolRegistry()
-        self.config = config or AgentConfig()
-        self.tokenizer = tokenizer or getattr(server, "tokenizer", None)
-        self.execute_tool = execute_tool or self.tools.execute
-
-    @asynccontextmanager
-    async def _managed(self) -> AsyncGenerator[Any, None]:
-        """
-        Yield a ManagedServer-like object.
-
-        - If `self.server` is a ServerManager, use its `managed_server()` context manager.
-        - If `self.server` is a single APIServer, wrap it in `ManagedServer` directly.
-        """
-        if os.getenv("ATROPOS_BYPASS_MANAGED_SERVER") == "1":
-            yield _DirectChatCompletionClient(server=self.server)
-            return
-        if hasattr(self.server, "managed_server"):
-            async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
-                yield managed
-        else:
-            managed = ManagedServer(server=self.server, tokenizer=self.tokenizer)
-            try:
-                yield managed
-            finally:
-                managed.reset()
-    
-    def _build_system_prompt(self) -> str:
-        """Build the system prompt with tool descriptions."""
-        if self.config.system_prompt:
-            return self.config.system_prompt
-
-        tools_json = self.tools.get_prompt_tool_definitions_json()
-        # Avoid `str.format()` here because the prompt contains many literal `{}` braces
-        # in JSON examples; we only want to substitute the single `{tools_json}` token.
-        return AGENT_SYSTEM_PROMPT.replace("{tools_json}", tools_json)
-
-    def _infer_server_model_for_debug(self) -> Optional[str]:
-        """
-        Best-effort inference of the configured model name for debug payload saving.
-
-        ManagedServer/server_manager typically injects `model` internally, so `chat_kwargs`
-        may not contain it. For replaying saved payloads via curl, it's useful to persist it.
-        """
-        servers = getattr(self.server, "servers", None)
-        if isinstance(servers, list) and servers:
-            s0 = servers[0]
-            cfg = getattr(s0, "config", None)
-            model = getattr(cfg, "model_name", None) or getattr(s0, "model_name", None)
-            if isinstance(model, str) and model:
-                return model
-        model = getattr(self.server, "model_name", None) or getattr(self.server, "model", None)
-        if isinstance(model, str) and model:
-            return model
-        return None
-
-    def _infer_server_base_url_for_debug(self) -> Optional[str]:
-        """
-        Best-effort inference of the configured base_url for debug logging.
-
-        This is helpful when diagnosing hangs / retries at the transport layer.
-        """
-        servers = getattr(self.server, "servers", None)
-        if isinstance(servers, list) and servers:
-            s0 = servers[0]
-            cfg = getattr(s0, "config", None)
-            base_url = getattr(cfg, "base_url", None) or getattr(s0, "base_url", None)
-            if isinstance(base_url, str) and base_url:
-                return base_url
-        base_url = getattr(self.server, "base_url", None)
-        if isinstance(base_url, str) and base_url:
-            return base_url
-        return None
-
-    def _extract_response_metadata(self, response: Any) -> Dict[str, Any]:
-        """
-        Extract lightweight, JSON-serializable metadata from an OpenAI-style response.
-
-        This is useful for debugging training runs, especially when ManagedServer state
-        tracking is unavailable (e.g. OpenAI-compatible chat endpoints).
-        """
-        meta: Dict[str, Any] = {}
-        try:
-            rid = getattr(response, "id", None)
-            if isinstance(rid, str) and rid:
-                meta["id"] = rid
-            model = getattr(response, "model", None)
-            if isinstance(model, str) and model:
-                meta["model"] = model
-            created = getattr(response, "created", None)
-            if isinstance(created, int):
-                meta["created"] = created
-            system_fingerprint = getattr(response, "system_fingerprint", None)
-            if isinstance(system_fingerprint, str) and system_fingerprint:
-                meta["system_fingerprint"] = system_fingerprint
-
-            choices = getattr(response, "choices", None)
-            if isinstance(choices, list) and choices:
-                fr = getattr(choices[0], "finish_reason", None)
-                if isinstance(fr, str) and fr:
-                    meta["finish_reason"] = fr
-
-            usage = getattr(response, "usage", None)
-            if usage is not None:
-                if hasattr(usage, "model_dump"):
-                    meta["usage"] = usage.model_dump()
-                elif isinstance(usage, dict):
-                    meta["usage"] = usage
-        except Exception:
-            pass
-        return meta
-
-    def _debug_dump_request(self, *, step_num: int, chat_kwargs: Dict[str, Any]) -> None:
-        if os.getenv("ATROPOS_DEBUG_AGENT_REQUEST") != "1":
-            return
-        try:
-            # Avoid dumping megabytes by default; messages can be huge.
-            meta = {
-                "step": step_num,
-                "base_url": self._infer_server_base_url_for_debug(),
-                "model": chat_kwargs.get("model") or self._infer_server_model_for_debug(),
-                "chat_kwargs_keys": sorted(list(chat_kwargs.keys())),
-                "n": chat_kwargs.get("n"),
-                "max_tokens": chat_kwargs.get("max_tokens"),
-                "temperature": chat_kwargs.get("temperature"),
-                "num_messages": len(chat_kwargs.get("messages") or []),
-            }
-            print("\n=== ATROPOS_DEBUG_AGENT_REQUEST ===", flush=True)
-            print(meta, flush=True)
-
-            if os.getenv("ATROPOS_DEBUG_AGENT_REQUEST_FULL") == "1":
-                payload = dict(chat_kwargs)
-                # Make the payload more legible and less huge.
-                try:
-                    dumped = json.dumps(payload, ensure_ascii=False, indent=2)
-                except Exception:
-                    dumped = repr(payload)
-                print("\n=== ATROPOS_DEBUG_AGENT_REQUEST_FULL ===", flush=True)
-                print(dumped[:200_000], flush=True)
-
-            # Optional: save the FULL request payload to disk (no truncation).
-            save_dir = os.getenv("ATROPOS_DEBUG_AGENT_REQUEST_SAVE_DIR")
-            if save_dir:
-                os.makedirs(save_dir, exist_ok=True)
-                payload: Dict[str, Any] = dict(chat_kwargs)
-                if "model" not in payload:
-                    model = self._infer_server_model_for_debug()
-                    if model:
-                        payload["model"] = model
-                # Use a unique filename so parallel trajectories don't clobber each other.
-                fname = os.path.join(
-                    save_dir,
-                    f"atropos_agent_request_step{step_num}_{int(time.time()*1000)}_{os.getpid()}_{uuid4().hex}.json",
-                )
-                with open(fname, "w", encoding="utf-8") as f:
-                    json.dump(payload, f, ensure_ascii=False, indent=2)
-                print(f"[AtroposAgent] saved request payload: {fname}", flush=True)
-        except Exception:
-            return
-
-    def _debug_dump_response(self, *, step_num: int, response: Any) -> None:
-        if os.getenv("ATROPOS_DEBUG_AGENT_RESPONSE") != "1":
-            return
-        print("\n=== ATROPOS_DEBUG_AGENT_RESPONSE ===", flush=True)
-        print({"step": step_num, "type": type(response).__name__}, flush=True)
-        try:
-            dumped = response.model_dump()  # openai pydantic model
-        except Exception:
-            dumped = getattr(response, "__dict__", {"repr": repr(response)})
-        # Keep the dump bounded; we only need enough to see the assistant message content.
-        text = str(dumped)
-        print(text[:200_000], flush=True)
-
-    async def _chat_completion_with_debug(
-        self, *, managed: Any, step_num: int, chat_kwargs: Dict[str, Any]
-    ) -> Any:
-        """
-        Call `managed.chat_completion()` with optional timeout + richer failure logging.
-
-        Debug env vars:
-        - `ATROPOS_AGENT_CHAT_TIMEOUT_S`: if set, wraps the await in `asyncio.wait_for`.
-        - `ATROPOS_DEBUG_AGENT_WAIT_EVERY_S`: if set, prints a heartbeat while waiting.
-        """
-        # Hard guardrail: never allow a single chat completion to block for too long.
-        # This is essential for RL data-gen stability; long hangs should be treated as failures (score=0).
-        timeout_s_raw = os.getenv("ATROPOS_AGENT_CHAT_TIMEOUT_S")
-        timeout_s_default = 240.0
-        timeout_s = float(timeout_s_raw) if timeout_s_raw else timeout_s_default
-        timeout_s = min(timeout_s, 240.0)
-
-        wait_every_raw = os.getenv("ATROPOS_DEBUG_AGENT_WAIT_EVERY_S")
-        wait_every_s = float(wait_every_raw) if wait_every_raw else None
-
-        async def _await_call() -> Any:
-            if not wait_every_s or wait_every_s <= 0:
-                return await managed.chat_completion(**chat_kwargs)
-
-            # Heartbeat mode: wait in chunks without cancelling the underlying request.
-            # NOTE: do NOT use `asyncio.wait_for(task, timeout=...)` here, because a timeout
-            # will cancel the task and surface as `CancelledError` on the next loop.
-            task = asyncio.create_task(managed.chat_completion(**chat_kwargs))
-            t0 = time.perf_counter()
-            try:
-                while True:
-                    done, _pending = await asyncio.wait({task}, timeout=wait_every_s)
-                    if task in done:
-                        return task.result()
-
-                    waited = time.perf_counter() - t0
-                    print(
-                        f"[AtroposAgent] step={step_num} still waiting for chat_completion... ({waited:.1f}s)",
-                        flush=True,
-                    )
-            except asyncio.CancelledError:
-                task.cancel()
-                raise
-
-        try:
-            return await asyncio.wait_for(_await_call(), timeout=timeout_s)
-        except asyncio.TimeoutError as e:
-            print("\n=== ATROPOS_DEBUG_AGENT_CHAT_TIMEOUT ===", flush=True)
-            print({"step": step_num, "timeout_s": timeout_s}, flush=True)
-            raise RuntimeError(f"chat_completion timed out after {timeout_s:.1f}s") from e
-        except asyncio.CancelledError:
-            # Treat cancellation as a hard failure rather than crashing the whole env run.
-            # (Atropos/BaseEnv may cancel tasks during shutdown or retries.)
-            raise RuntimeError("chat_completion cancelled") from None
-        except Exception as e:
-            detail: Dict[str, Any] = {
-                "step": step_num,
-                "exc_type": type(e).__name__,
-                "exc_str": str(e),
-            }
-            if isinstance(e, httpx.HTTPStatusError):
-                try:
-                    detail["status_code"] = e.response.status_code
-                    detail["response_text"] = e.response.text[:20_000]
-                except Exception:
-                    pass
-            elif isinstance(e, httpx.RequestError):
-                detail["request"] = repr(getattr(e, "request", None))
-
-            print("\n=== ATROPOS_DEBUG_AGENT_CHAT_FAILURE ===", flush=True)
-            print(detail, flush=True)
-            raise
-
-    async def run(
-        self,
-        task: str,
-        initial_messages: Optional[List[Dict[str, str]]] = None,
-    ) -> AgentResult:
-        """
-        Run the agent on a task using ManagedServer for token tracking.
-        
-        Args:
-            task: The task/prompt for the agent
-            initial_messages: Optional additional context messages
-            
-        Returns:
-            AgentResult with the trajectory, final response, and token data
-        """
-        messages = [
-            {"role": "system", "content": self._build_system_prompt()},
-        ]
-        
-        if initial_messages:
-            messages.extend(initial_messages)
-        
-        messages.append({"role": "user", "content": task})
-        
-        steps = []
-        final_response = ""
-        final_node = None
-        final_prompt_messages: Optional[List[Dict[str, str]]] = None
-        last_node = None
-        last_prompt_messages: Optional[List[Dict[str, str]]] = None
-        last_response_text: str = ""
-        
-        # Use ManagedServer for automatic token tracking
-        async with self._managed() as managed:
-            for step_num in range(self.config.max_steps):
-                # ReACT loop iteration here, just call -> tools -> observe until done (no tools called)
-                try:
-                    # Keep a copy of the prompt messages used for this completion.
-                    # Useful for reconstructing tokens/masks when state tracking is unavailable.
-                    prompt_messages = list(messages)
-                    chat_kwargs: Dict[str, Any] = {"messages": messages, "n": 1}
-                    if self.config.max_tokens is not None:
-                        chat_kwargs["max_tokens"] = self.config.max_tokens
-                    if self.config.temperature is not None:
-                        chat_kwargs["temperature"] = self.config.temperature
-
-                    t_req = time.perf_counter()
-                    print(
-                        f"[AtroposAgent] step={step_num+1} chat_completion start "
-                        f"(messages={len(messages)}, max_tokens={self.config.max_tokens}, temp={self.config.temperature})",
-                        flush=True,
-                    )
-                    self._debug_dump_request(step_num=step_num + 1, chat_kwargs=chat_kwargs)
-                    response = await self._chat_completion_with_debug(
-                        managed=managed, step_num=step_num + 1, chat_kwargs=chat_kwargs
-                    )
-                    self._debug_dump_response(step_num=step_num + 1, response=response)
-                    response_meta = self._extract_response_metadata(response)
-                    print(
-                        f"[AtroposAgent] step={step_num+1} chat_completion done in {time.perf_counter() - t_req:.2f}s",
-                        flush=True,
-                    )
-                    
-                    current_node = None
-                    if hasattr(managed, "get_state"):
-                        state = managed.get_state()
-                        nodes = state.get("nodes", [])
-                        current_node = nodes[-1] if nodes else None
-                    
-                except Exception as e:
-                    return AgentResult(
-                        success=False,
-                        final_response="",
-                        steps=steps,
-                        error=f"Generation error: {str(e)}",
-                    )
-                
-                msg = response.choices[0].message
-                # Some OpenAI-compatible servers populate `message.reasoning` and leave `content=""`.
-                response_text = (msg.content or "") or (getattr(msg, "reasoning", None) or "")
-                tool_calls = ToolCall.parse_from_text(response_text)
-                last_node = current_node
-                last_prompt_messages = prompt_messages
-                last_response_text = response_text
-
-                step_sequence_data = SequenceData.from_sequence_node(current_node) if current_node else None
-                if step_sequence_data is None:
-                    if response_meta:
-                        # We still want metadata for debugging even if token/logprob state tracking is unavailable.
-                        step_sequence_data = SequenceData(
-                            full_text=response_text,
-                            tokens=[],
-                            masked_tokens=[],
-                            logprobs=[],
-                            metadata=response_meta,
-                        )
-                else:
-                    merged = dict(response_meta)
-                    node_meta = step_sequence_data.metadata
-                    if isinstance(node_meta, dict):
-                        merged.update(node_meta)
-                    step_sequence_data.metadata = merged or step_sequence_data.metadata
-                
-                step = AgentStep(
-                    step_number=step_num + 1,
-                    assistant_message=response_text,
-                    tool_calls=tool_calls,
-                    sequence_data=step_sequence_data,
-                )
-                
-                if not tool_calls:
-                    steps.append(step)
-                    final_response = response_text
-                    final_node = current_node
-                    final_prompt_messages = prompt_messages
-                    break
-                
-                messages.append({"role": "assistant", "content": response_text})
-                
-                tool_responses = []
-                for call in tool_calls:
-                    result = await self.execute_tool(call)
-                    step.tool_results.append(result)
-                    tool_responses.append(result.to_xml())
-                    if self.config.tool_delay_s > 0:
-                        await asyncio.sleep(self.config.tool_delay_s)
-                
-                steps.append(step)
-            
-                responses_text = "\n".join(tool_responses)
-                # Tool observations are represented as user content with Hermes-style tags.
-                # This is compatible with most OpenAI-compatible chat APIs and ensures
-                # tokenizers/chat templates include tool outputs during training.
-                messages.append({"role": "user", "content": responses_text})
-            
-            else:
-                # Reached max steps without completing
-                # Return a failure result but include the last observed completion so callers can
-                # record the trajectory (score=0) without triggering retries.
-                final_response = last_response_text or final_response
-                final_node = last_node
-                final_prompt_messages = last_prompt_messages
-                trajectory_data = None
-                if final_node:
-                    trajectory_data = SequenceData.from_sequence_node(final_node)
-                elif final_prompt_messages is not None and self.tokenizer is not None:
-                    if hasattr(self.tokenizer, "apply_chat_template"):
-                        prompt_text = self.tokenizer.apply_chat_template(
-                            final_prompt_messages, tokenize=False, add_generation_prompt=True
-                        )
-                        prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=False)
-                    else:
-                        prompt_text = "\n".join([f"{m['role']}: {m['content']}" for m in final_prompt_messages])
-                        prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=True)
-                    output_tokens = self.tokenizer.encode(final_response, add_special_tokens=False)
-                    tokens = prompt_tokens + output_tokens
-                    masked_tokens = ([-100] * len(prompt_tokens)) + output_tokens
-                    logprobs = ([1.0] * len(prompt_tokens)) + ([0.0] * len(output_tokens))
-                    trajectory_data = SequenceData(
-                        full_text=f"{prompt_text}{final_response}",
-                        tokens=tokens,
-                        masked_tokens=masked_tokens,
-                        logprobs=logprobs,
-                    )
-                # Preserve response metadata (if any) even on failure trajectories.
-                try:
-                    if trajectory_data is not None and steps:
-                        last_step = steps[-1]
-                        if last_step.sequence_data and isinstance(last_step.sequence_data.metadata, dict):
-                            trajectory_data.metadata = dict(last_step.sequence_data.metadata)
-                except Exception:
-                    pass
-                return AgentResult(
-                    success=False,
-                    final_response=final_response,
-                    steps=steps,
-                    error=f"Reached maximum steps ({self.config.max_steps})",
-                    trajectory_data=trajectory_data,
-                )
-        
-        # Build result with trajectory data
-        trajectory_data = None
-        if final_node:
-            trajectory_data = SequenceData.from_sequence_node(final_node)
-        elif final_prompt_messages is not None and self.tokenizer is not None:
-            if hasattr(self.tokenizer, "apply_chat_template"):
-                prompt_text = self.tokenizer.apply_chat_template(
-                    final_prompt_messages, tokenize=False, add_generation_prompt=True
-                )
-                prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=False)
-            else:
-                prompt_text = "\n".join([f"{m['role']}: {m['content']}" for m in final_prompt_messages])
-                prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=True)
-            output_tokens = self.tokenizer.encode(final_response, add_special_tokens=False)
-            tokens = prompt_tokens + output_tokens
-            masked_tokens = ([-100] * len(prompt_tokens)) + output_tokens
-            logprobs = ([1.0] * len(prompt_tokens)) + ([0.0] * len(output_tokens))
-            trajectory_data = SequenceData(
-                full_text=f"{prompt_text}{final_response}",
-                tokens=tokens,
-                masked_tokens=masked_tokens,
-                logprobs=logprobs,
-            )
-
-        # Ensure trajectory_data carries the most recent metadata we observed (if any).
-        try:
-            if trajectory_data is not None and steps:
-                last_step = steps[-1]
-                if last_step.sequence_data and isinstance(last_step.sequence_data.metadata, dict):
-                    trajectory_data.metadata = dict(last_step.sequence_data.metadata)
-        except Exception:
-            pass
-        
-        return AgentResult(
-            success=True,
-            final_response=final_response,
-            steps=steps,
-            trajectory_data=trajectory_data,
-        )
-    
-    async def run_single_turn(
-        self,
-        messages: List[Dict[str, str]],
-        execute_tools: bool = True,
-    ) -> tuple[str, List[ToolResult], Optional[SequenceData]]:
-        """
-        Run a single turn of the agent (one LLM call + tool execution).
-        
-        This is useful for integration with BaseEnv where you want more
-        control over the loop.
-        
-        Args:
-            messages: The conversation history
-            execute_tools: Whether to execute parsed tool calls
-            
-        Returns:
-            Tuple of (response_text, tool_results, sequence_data)
-        """
-        async with self._managed() as managed:
-            chat_kwargs: Dict[str, Any] = {"messages": messages, "n": 1}
-            if self.config.max_tokens is not None:
-                chat_kwargs["max_tokens"] = self.config.max_tokens
-            if self.config.temperature is not None:
-                chat_kwargs["temperature"] = self.config.temperature
-
-            self._debug_dump_request(step_num=1, chat_kwargs=chat_kwargs)
-            response = await self._chat_completion_with_debug(managed=managed, step_num=1, chat_kwargs=chat_kwargs)
-            self._debug_dump_response(step_num=1, response=response)
-            
-            current_node = None
-            if hasattr(managed, "get_state"):
-                state = managed.get_state()
-                nodes = state.get("nodes", [])
-                current_node = nodes[-1] if nodes else None
-        
-        msg = response.choices[0].message
-        response_text = (msg.content or "") or (getattr(msg, "reasoning", None) or "")
-        tool_results = []
-        
-        if execute_tools:
-            tool_calls = ToolCall.parse_from_text(response_text)
-            for call in tool_calls:
-                result = await self.execute_tool(call)
-                tool_results.append(result)
-        
-        sequence_data = SequenceData.from_sequence_node(current_node) if current_node else None
-        
-        return response_text, tool_results, sequence_data
-
-
-class _DirectChatCompletionClient:
-    """
-    Minimal stand-in for ManagedServer that calls the OpenAI-compatible endpoint directly.
-
-    This is for isolating issues where `ManagedServer.chat_completion()` hangs or misbehaves.
-    It intentionally does NOT do token/logprob tracking.
-    """
-
-    def __init__(self, server: Any):
-        self._server = server
-
-    def _server_config(self) -> tuple[str, str, str]:
-        # ServerManager case: first configured server.
-        servers = getattr(self._server, "servers", None)
-        if isinstance(servers, list) and servers:
-            s0 = servers[0]
-            cfg = getattr(s0, "config", None)
-            base_url = getattr(cfg, "base_url", None) or getattr(s0, "base_url", None)
-            api_key = getattr(cfg, "api_key", None) or getattr(s0, "api_key", None)
-            model = getattr(cfg, "model_name", None) or getattr(s0, "model_name", None)
-            if isinstance(base_url, str) and isinstance(api_key, str) and isinstance(model, str):
-                return base_url.rstrip("/"), api_key, model
-
-        # APIServer-like fallback.
-        base_url = getattr(self._server, "base_url", None)
-        api_key = getattr(self._server, "api_key", None)
-        model = getattr(self._server, "model_name", None) or getattr(self._server, "model", None)
-        if isinstance(base_url, str) and isinstance(api_key, str) and isinstance(model, str):
-            return base_url.rstrip("/"), api_key, model
-
-        raise RuntimeError("Unable to resolve server base_url/api_key/model for direct chat completion")
-
-    async def chat_completion(self, *, messages: List[Dict[str, str]], n: int = 1, **kwargs: Any) -> Any:
-        base_url, api_key, model = self._server_config()
-        url = f"{base_url}/chat/completions"
-
-        payload: Dict[str, Any] = {
-            "model": model,
-            "messages": messages,
-            "n": n,
-        }
-        # Pass through common generation kwargs.
-        for k in ("max_tokens", "temperature", "top_p", "presence_penalty", "frequency_penalty", "stop"):
-            if k in kwargs and kwargs[k] is not None:
-                payload[k] = kwargs[k]
-
-        timeout_s = float(os.getenv("ATROPOS_DIRECT_REQUEST_TIMEOUT_S") or "120")
-        print(f"[AtroposAgent] DIRECT chat_completion POST {url} (timeout={timeout_s}s)", flush=True)
-        async with httpx.AsyncClient(timeout=timeout_s) as client:
-            resp = await client.post(
-                url,
-                headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
-                json=payload,
-            )
-            resp.raise_for_status()
-            data = resp.json()
-
-        # Return a very small object compatible with the code paths that read
-        # `response.choices[0].message.content`.
-        class _Msg:
-            def __init__(self, d: Dict[str, Any]):
-                self.content = d.get("content")
-                self.reasoning = d.get("reasoning")
-
-        class _Choice:
-            def __init__(self, d: Dict[str, Any]):
-                self.message = _Msg(d.get("message") or {})
-
-        class _Resp:
-            def __init__(self, d: Dict[str, Any]):
-                self._d = d
-                self.choices = [_Choice(c) for c in (d.get("choices") or [])]
-
-            def model_dump(self) -> Dict[str, Any]:
-                return self._d
-
-        return _Resp(data)
@@ -1,6 +0,0 @@
-"""
-FastAPI services for atropos-agent.
-
- tool_executor_server: queued/batched sandbox tool execution (Phase 4)
-"""
-
@@ -1,254 +0,0 @@
-"""
-Tool Executor API (Phase 4)
-
-This service provides a queued, batched execution layer on top of a ToolBackend.
-It mirrors the stateful FastAPI + app.state pattern used in:
-  atropos/atroposlib/api/server.py
-
-Run (dev):
-  uv run uvicorn atropos_agent.api.tool_executor_server:app --host 0.0.0.0 --port 9001
-"""
-
-from __future__ import annotations
-
-import os
-from typing import Any, Dict, Optional
-from pathlib import Path
-
-from fastapi import FastAPI, Header, HTTPException, status
-from pydantic import BaseModel, Field
-
-from ..backends.nomad_backend import NomadBackendConfig, NomadToolBackend
-from ..tools import ToolRegistry, build_tool_registry
-from ..tools.base import (
-    ArtifactArchiveRequestPayload,
-    ArtifactArchiveResponsePayload,
-    ArtifactListRequestPayload,
-    ArtifactListResponsePayload,
-    ArtifactReadRequestPayload,
-    ArtifactReadResponsePayload,
-    ToolExecutorExecuteRequest,
-    ToolExecutorReleaseRequest,
-    ToolResultPayload,
-)
-from ..tools.tool_executor import ToolExecutor, ToolExecutorConfig
-
-
-class ToolExecutorServerConfig(BaseModel):
-    nomad_address: str = Field(default="http://localhost:4646")
-    job_id: str = Field(default="atropos-sandbox-tool-executor")
-    image: str = Field(default="atropos-sandbox:local")
-    slots_per_container: int = Field(default=10)
-    min_containers: int = Field(default=1)
-    max_containers: int = Field(default=10)
-    privileged: bool = Field(default=False)
-    acquire_timeout_s: float = Field(default=30.0)
-
-    batch_window_ms: int = Field(default=20)
-    max_batch_size: int = Field(default=200)
-    allow_network: bool = Field(default=True)
-
-    tool_server_url: Optional[str] = Field(default=None)
-    tool_server_token: Optional[str] = Field(default=None)
-
-    token: Optional[str] = Field(default=None, description="Bearer token required for requests (optional in dev).")
-
-    purge_job_on_shutdown: bool = Field(default=True)
-
-    @classmethod
-    def from_env(cls) -> "ToolExecutorServerConfig":
-        # In dev, prefer loading secrets/config from the repo-local `.env` (not committed).
-        try:
-            from dotenv import load_dotenv  # type: ignore
-        except Exception:  # pragma: no cover
-            load_dotenv = None  # type: ignore[assignment]
-        if load_dotenv is not None:
-            env_path = Path(__file__).resolve().parents[2] / ".env"
-            if env_path.exists():
-                load_dotenv(dotenv_path=env_path)
-
-        def _get_bool(name: str, default: bool) -> bool:
-            raw = os.getenv(name)
-            if raw is None:
-                return default
-            return raw.strip().lower() in {"1", "true", "yes", "y", "on"}
-
-        return cls(
-            nomad_address=os.getenv("TOOL_EXECUTOR_NOMAD_ADDRESS", "http://localhost:4646"),
-            job_id=os.getenv("TOOL_EXECUTOR_JOB_ID", "atropos-sandbox-tool-executor"),
-            image=os.getenv("TOOL_EXECUTOR_IMAGE", "atropos-sandbox:local"),
-            slots_per_container=int(os.getenv("TOOL_EXECUTOR_SLOTS", "10")),
-            min_containers=int(os.getenv("TOOL_EXECUTOR_MIN_CONTAINERS", "1")),
-            max_containers=int(os.getenv("TOOL_EXECUTOR_MAX_CONTAINERS", "10")),
-            privileged=_get_bool("TOOL_EXECUTOR_PRIVILEGED", False),
-            acquire_timeout_s=float(os.getenv("TOOL_EXECUTOR_ACQUIRE_TIMEOUT_S", "30.0")),
-            batch_window_ms=int(os.getenv("TOOL_EXECUTOR_BATCH_WINDOW_MS", "20")),
-            max_batch_size=int(os.getenv("TOOL_EXECUTOR_MAX_BATCH_SIZE", "200")),
-            allow_network=_get_bool("TOOL_EXECUTOR_ALLOW_NETWORK", True),
-            tool_server_url=os.getenv("TOOL_EXECUTOR_TOOL_SERVER_URL") or None,
-            tool_server_token=os.getenv("TOOL_EXECUTOR_TOOL_SERVER_TOKEN") or None,
-            token=os.getenv("TOOL_EXECUTOR_TOKEN") or None,
-            purge_job_on_shutdown=_get_bool("TOOL_EXECUTOR_PURGE_JOB_ON_SHUTDOWN", True),
-        )
-
-
-app = FastAPI(title="Atropos-Agent Tool Executor")
-
-
-@app.get("/")
-async def root() -> Dict[str, str]:
-    return {"message": "Atropos-Agent Tool Executor"}
-
-
-def _check_auth(cfg: ToolExecutorServerConfig, authorization: Optional[str]) -> None:
-    if not cfg.token:
-        return
-    if not authorization:
-        raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Missing Authorization header")
-    if not authorization.lower().startswith("bearer "):
-        raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid Authorization header")
-    token = authorization.split(" ", 1)[1].strip()
-    if token != cfg.token:
-        raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Invalid token")
-
-
-@app.on_event("startup")
-async def _startup() -> None:
-    cfg = ToolExecutorServerConfig.from_env()
-
-    # Default to Atropos "full" tool surface: sandbox + external (if tool_server_url provided).
-    tools: ToolRegistry = build_tool_registry(
-        enabled_toolsets=["full"],
-        disabled_toolsets=None,
-        tool_server_url=cfg.tool_server_url,
-    )
-
-    backend = NomadToolBackend(
-        NomadBackendConfig(
-            nomad_address=cfg.nomad_address,
-            sandbox_job_id=cfg.job_id,
-            sandbox_image=cfg.image,
-            slots_per_container=cfg.slots_per_container,
-            min_containers=cfg.min_containers,
-            max_containers=cfg.max_containers,
-            privileged=cfg.privileged,
-            acquire_timeout_s=cfg.acquire_timeout_s,
-            purge_job_on_start=False,
-        )
-    )
-    await backend.start()
-
-    executor = ToolExecutor(
-        backend=backend,
-        tools=tools,
-        config=ToolExecutorConfig(
-            batch_window_ms=cfg.batch_window_ms,
-            max_batch_size=cfg.max_batch_size,
-            allow_network=cfg.allow_network,
-            tool_server_url=cfg.tool_server_url,
-            tool_server_token=cfg.tool_server_token,
-        ),
-    )
-    await executor.start()
-
-    app.state.cfg = cfg
-    app.state.backend = backend
-    app.state.executor = executor
-
-
-@app.on_event("shutdown")
-async def _shutdown() -> None:
-    executor: Optional[ToolExecutor] = getattr(app.state, "executor", None)
-    backend: Optional[NomadToolBackend] = getattr(app.state, "backend", None)
-    cfg: Optional[ToolExecutorServerConfig] = getattr(app.state, "cfg", None)
-
-    if executor is not None:
-        await executor.close()
-
-    if backend is not None:
-        await backend.stop(purge=bool(cfg.purge_job_on_shutdown) if cfg else False)
-
-
-@app.get("/health")
-async def health() -> Dict[str, Any]:
-    return {"status": "ok"}
-
-
-@app.get("/status")
-async def status_endpoint() -> Dict[str, Any]:
-    executor: ToolExecutor = app.state.executor
-    backend: NomadToolBackend = app.state.backend
-
-    return {
-        "queue_size": executor.queue_size(),
-        "total_requests": executor.total_requests,
-        "total_errors": executor.total_errors,
-        "pool": backend.get_stats(),
-    }
-
-
-@app.post("/execute", response_model=ToolResultPayload)
-async def execute_tool(
-    req: ToolExecutorExecuteRequest,
-    authorization: Optional[str] = Header(default=None),
-    status_code: int = status.HTTP_200_OK,  # noqa: B008
-) -> ToolResultPayload:
-    cfg: ToolExecutorServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    executor: ToolExecutor = app.state.executor
-    result = await executor.execute(
-        trajectory_id=req.trajectory_id,
-        call=req.tool.to_tool_call(),
-        timeout_s=req.timeout_s,
-    )
-    return ToolResultPayload.from_tool_result(result)
-
-
-@app.post("/release")
-async def release_trajectory(
-    req: ToolExecutorReleaseRequest,
-    authorization: Optional[str] = Header(default=None),
-) -> Dict[str, Any]:
-    cfg: ToolExecutorServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    executor: ToolExecutor = app.state.executor
-    await executor.release_trajectory(req.trajectory_id, reset_workspace=req.reset_workspace)
-    return {"status": "ok"}
-
-
-@app.post("/artifacts/read", response_model=ArtifactReadResponsePayload)
-async def artifacts_read(
-    req: ArtifactReadRequestPayload,
-    authorization: Optional[str] = Header(default=None),
-) -> ArtifactReadResponsePayload:
-    cfg: ToolExecutorServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    executor: ToolExecutor = app.state.executor
-    return await executor.read_artifact(req)
-
-
-@app.post("/artifacts/list", response_model=ArtifactListResponsePayload)
-async def artifacts_list(
-    req: ArtifactListRequestPayload,
-    authorization: Optional[str] = Header(default=None),
-) -> ArtifactListResponsePayload:
-    cfg: ToolExecutorServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    executor: ToolExecutor = app.state.executor
-    return await executor.list_artifacts(req)
-
-
-@app.post("/artifacts/archive", response_model=ArtifactArchiveResponsePayload)
-async def artifacts_archive(
-    req: ArtifactArchiveRequestPayload,
-    authorization: Optional[str] = Header(default=None),
-) -> ArtifactArchiveResponsePayload:
-    cfg: ToolExecutorServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    executor: ToolExecutor = app.state.executor
-    return await executor.archive_artifacts(req)
@@ -1,140 +0,0 @@
-"""
-External ToolServer (Phase 4.5+).
-
-This server executes tools that must NOT run inside the sandbox, typically
-because they require credentials or access to external services.
-
-Run (dev):
-  uv run uvicorn atropos_agent.api.tool_server:app --host 0.0.0.0 --port 9002
-"""
-
-from __future__ import annotations
-
-import asyncio
-import os
-import inspect
-from typing import Any, Dict, List, Optional
-from pathlib import Path
-
-from fastapi import FastAPI, Header, HTTPException, status
-from pydantic import BaseModel, Field
-
-from ..tools import ToolRegistry, build_tool_registry
-from ..tools.base import ToolResultPayload, ToolServerExecuteRequest
-
-
-class ToolServerConfig(BaseModel):
-    token: Optional[str] = Field(
-        default=None,
-        description="Bearer token required for requests (optional in dev).",
-    )
-    max_concurrency: int = Field(default=16, ge=1, description="Max concurrent tool executions.")
-
-    @classmethod
-    def from_env(cls) -> "ToolServerConfig":
-        # In dev, prefer loading secrets from the repo-local `.env` (not committed).
-        try:
-            from dotenv import load_dotenv  # type: ignore
-        except Exception:  # pragma: no cover
-            load_dotenv = None  # type: ignore[assignment]
-        if load_dotenv is not None:
-            env_path = Path(__file__).resolve().parents[2] / ".env"
-            if env_path.exists():
-                load_dotenv(dotenv_path=env_path)
-
-        token = os.getenv("TOOL_SERVER_TOKEN") or None
-        max_concurrency = int(os.getenv("TOOL_SERVER_MAX_CONCURRENCY", "16"))
-        return cls(token=token, max_concurrency=max_concurrency)
-
-
-app = FastAPI(title="Atropos-Agent Tool Server")
-
-
-@app.get("/")
-async def root() -> Dict[str, str]:
-    return {"message": "Atropos-Agent Tool Server"}
-
-
-@app.on_event("startup")
-async def _startup() -> None:
-    cfg = ToolServerConfig.from_env()
-
-    # External-only registry. It will only include tools that are enabled by toolsets and
-    # whose Hermes requirements/keys are satisfied in this process.
-    tools: ToolRegistry = build_tool_registry(
-        enabled_toolsets=["all"],
-        disabled_toolsets=["terminal", "sandbox", "filesystem", "terminal_stateful", "default"],
-        tool_server_url="enabled",
-    )
-
-    app.state.cfg = cfg
-    app.state.tools = tools
-    app.state.semaphore = asyncio.Semaphore(cfg.max_concurrency)
-
-
-@app.get("/health")
-async def health() -> Dict[str, Any]:
-    return {"status": "ok"}
-
-
-@app.get("/tools")
-async def list_tools() -> Dict[str, Any]:
-    tools: ToolRegistry = app.state.tools
-    return {"tools": [s.to_dict() for s in tools.get_schemas()]}
-
-
-def _check_auth(cfg: ToolServerConfig, authorization: Optional[str]) -> None:
-    if not cfg.token:
-        return
-    if not authorization:
-        raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Missing Authorization header")
-    if not authorization.lower().startswith("bearer "):
-        raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid Authorization header")
-    token = authorization.split(" ", 1)[1].strip()
-    if token != cfg.token:
-        raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Invalid token")
-
-
-@app.post("/execute", response_model=ToolResultPayload)
-async def execute_tool(
-    req: ToolServerExecuteRequest,
-    authorization: Optional[str] = Header(default=None),
-) -> ToolResultPayload:
-    cfg: ToolServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    tools: ToolRegistry = app.state.tools
-    sem: asyncio.Semaphore = app.state.semaphore
-
-    tool = tools.get(req.tool.name)
-    if tool is None:
-        return ToolResultPayload(
-            success=False,
-            error=f"Unknown tool: {req.tool.name}",
-            uniq_id=req.tool.uniq_id,
-        )
-
-    async with sem:
-        try:
-            kwargs = dict(req.tool.arguments)
-            sig = inspect.signature(tool.execute).parameters
-            # Some tools can benefit from extra context.
-            if req.trajectory_id and "trajectory_id" in sig:
-                kwargs["trajectory_id"] = req.trajectory_id
-            if req.slot_id and "slot_id" in sig:
-                kwargs["slot_id"] = req.slot_id
-            if req.container_addr and "container_addr" in sig:
-                kwargs["container_addr"] = req.container_addr
-            if "task_id" in sig:
-                kwargs["task_id"] = req.trajectory_id
-            result = await tool.execute(**kwargs)
-        except Exception as e:
-            return ToolResultPayload(
-                success=False,
-                error=f"Tool execution error: {e}",
-                uniq_id=req.tool.uniq_id,
-            )
-
-    if result.uniq_id is None:
-        result.uniq_id = req.tool.uniq_id
-    return ToolResultPayload.from_tool_result(result)
@@ -1,27 +0,0 @@
-from __future__ import annotations
-
-from typing import Any
-
-from .base import ToolBackend
-from .modal_backend import ModalSandboxConfig, ModalToolBackend
-from .nomad_backend import NomadBackendConfig, NomadToolBackend
-
-
-def create_tool_backend(cfg: Any) -> ToolBackend:
-    mode = str(getattr(cfg, "tool_pool_mode", "nomad")).strip().lower()
-    if mode == "nomad":
-        return NomadToolBackend(NomadBackendConfig.from_agent_env_config(cfg))
-    if mode == "modal":
-        return ModalToolBackend(ModalSandboxConfig.from_agent_env_config(cfg))
-    raise ValueError(f"Unknown tool_pool_mode: {mode}")
-
-
-__all__ = [
-    "ToolBackend",
-    "create_tool_backend",
-    "NomadBackendConfig",
-    "NomadToolBackend",
-    "ModalSandboxConfig",
-    "ModalToolBackend",
-]
-
@@ -1,89 +0,0 @@
-"""
-Backend interfaces for AgentEnv tool execution.
-
-The goal of this module is to decouple ToolExecutor / AgentEnv from any single
-execution backend (Nomad/Docker today; Modal later).
-"""
-
-from __future__ import annotations
-
-from typing import Any, Dict, List, Optional, Protocol, Tuple
-
-from ..slots.executor import ExecutionResult
-from ..slots.slot import Slot
-
-
-class ToolBackend(Protocol):
-    """
-    Minimal interface required by ToolExecutor.
-
-    Backends provide:
-    - lifecycle (start/stop)
-    - slot acquisition/release (workspace affinity)
-    - batched tool execution across slots
-    - optional artifact helpers (for env verification / demos)
-    """
-
-    @property
-    def default_timeout_s(self) -> Optional[float]:
-        """Default sandbox execution timeout in seconds (if any)."""
-
-    async def start(self) -> None:
-        """Start the backend (provision workers/containers, health checks, etc)."""
-
-    async def stop(self, *, purge: bool = False) -> None:
-        """Stop the backend and optionally purge remote resources."""
-
-    async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
-        """Acquire a slot for a trajectory (workspace affinity)."""
-
-    async def release(self, slot: Slot, *, reset_workspace: bool = False) -> None:
-        """Release a slot back to the pool."""
-
-    async def execute_batch(
-        self,
-        requests: List[Tuple[Slot, str, Dict[str, Any]]],
-        *,
-        timeout_s: Optional[float] = None,
-    ) -> List[ExecutionResult]:
-        """Execute a batch of sandbox tool calls and return results in order."""
-
-    # ---------------------------------------------------------------------
-    # Optional artifact helpers (supported by the Nomad sandbox-server today)
-    # ---------------------------------------------------------------------
-
-    async def read_artifact(
-        self,
-        slot: Slot,
-        path: str,
-        *,
-        encoding: str = "text",
-        max_bytes: Optional[int] = None,
-        include_sha256: bool = False,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        raise NotImplementedError
-
-    async def list_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        recursive: bool = False,
-        max_entries: Optional[int] = None,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        raise NotImplementedError
-
-    async def archive_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        archive_format: str = "tar.gz",
-        max_bytes: Optional[int] = None,
-        max_entries: Optional[int] = None,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        raise NotImplementedError
-
@@ -1,156 +0,0 @@
-"""
-Nomad/Docker tool backend.
-
-This backend is the current default for AgentEnv: it provisions a Nomad job
-running `sandbox_server.py` and multiplexes stateless slots inside each container.
-"""
-
-from __future__ import annotations
-
-from dataclasses import dataclass
-from typing import Any, Dict, List, Optional, Tuple
-
-from ..slots import Slot, SlotPool, SlotPoolConfig
-from ..slots.executor import ExecutionResult
-from .base import ToolBackend
-
-
-@dataclass(frozen=True)
-class NomadBackendConfig:
-    nomad_address: str
-    sandbox_job_id: str
-    sandbox_image: str
-    slots_per_container: int
-    min_containers: int
-    max_containers: int
-    privileged: bool
-    acquire_timeout_s: float
-    purge_job_on_start: bool
-    # Driver selection: "docker" or "singularity"
-    driver: str = "docker"
-    # Path to .sif file for singularity driver (required if driver="singularity")
-    singularity_image: Optional[str] = None
-
-    @classmethod
-    def from_agent_env_config(cls, cfg: Any) -> "NomadBackendConfig":
-        return cls(
-            nomad_address=str(getattr(cfg, "nomad_address")),
-            sandbox_job_id=str(getattr(cfg, "sandbox_job_id")),
-            sandbox_image=str(getattr(cfg, "sandbox_image")),
-            slots_per_container=int(getattr(cfg, "slots_per_container")),
-            min_containers=int(getattr(cfg, "min_containers")),
-            max_containers=int(getattr(cfg, "max_containers")),
-            privileged=bool(getattr(cfg, "privileged")),
-            acquire_timeout_s=float(getattr(cfg, "acquire_timeout_s")),
-            purge_job_on_start=bool(getattr(cfg, "purge_job_on_start", False)),
-            driver=str(getattr(cfg, "driver", "docker")),
-            singularity_image=getattr(cfg, "singularity_image", None),
-        )
-
-
-class NomadToolBackend(ToolBackend):
-    def __init__(self, config: NomadBackendConfig):
-        self.config = config
-        self.pool = SlotPool(
-            SlotPoolConfig(
-                nomad_address=config.nomad_address,
-                job_id=config.sandbox_job_id,
-                image=config.sandbox_image,
-                slots_per_container=config.slots_per_container,
-                min_containers=config.min_containers,
-                max_containers=config.max_containers,
-                privileged=config.privileged,
-                acquire_timeout=config.acquire_timeout_s,
-                purge_job_on_start=bool(config.purge_job_on_start),
-                driver=config.driver,
-                singularity_image=config.singularity_image,
-            )
-        )
-
-    @property
-    def default_timeout_s(self) -> Optional[float]:
-        t = getattr(self.pool.executor, "timeout", None)
-        total = getattr(t, "total", None)
-        try:
-            return float(total) if total is not None else None
-        except Exception:
-            return None
-
-    async def start(self) -> None:
-        await self.pool.start()
-
-    async def stop(self, *, purge: bool = False) -> None:
-        await self.pool.stop(purge_job=purge)
-
-    async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
-        return await self.pool.acquire(trajectory_id)
-
-    async def release(self, slot: Slot, *, reset_workspace: bool = False) -> None:
-        await self.pool.release(slot, reset_workspace=reset_workspace)
-
-    async def execute_batch(
-        self,
-        requests: List[Tuple[Slot, str, Dict[str, Any]]],
-        *,
-        timeout_s: Optional[float] = None,
-    ) -> List[ExecutionResult]:
-        return await self.pool.execute_batch(requests, timeout=timeout_s)
-
-    async def read_artifact(
-        self,
-        slot: Slot,
-        path: str,
-        *,
-        encoding: str = "text",
-        max_bytes: Optional[int] = None,
-        include_sha256: bool = False,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        return await self.pool.executor.read_artifact(
-            slot,
-            path,
-            encoding=encoding,
-            max_bytes=max_bytes,
-            include_sha256=include_sha256,
-            timeout=timeout_s,
-        )
-
-    async def list_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        recursive: bool = False,
-        max_entries: Optional[int] = None,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        return await self.pool.executor.list_artifacts(
-            slot,
-            path,
-            recursive=recursive,
-            max_entries=max_entries,
-            timeout=timeout_s,
-        )
-
-    async def archive_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        archive_format: str = "tar.gz",
-        max_bytes: Optional[int] = None,
-        max_entries: Optional[int] = None,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        return await self.pool.executor.archive_artifacts(
-            slot,
-            path,
-            archive_format=archive_format,
-            max_bytes=max_bytes,
-            max_entries=max_entries,
-            timeout=timeout_s,
-        )
-
-    def get_stats(self) -> Dict[str, Any]:
-        return self.pool.get_stats()
-
@@ -1,10 +0,0 @@
-"""
-Environment implementations for atropos-agent.
-"""
-
-from .agent_env import AgentEnv, AgentEnvConfig
-
-# NOTE: Additional example envs exist as modules (e.g. `test_env`, `swe_smith_oracle_env`),
-# but are intentionally not imported here to avoid pulling heavy optional deps at import time.
-
-__all__ = ["AgentEnv", "AgentEnvConfig"]
@@ -1,526 +0,0 @@
-"""
-AgentEnv - Atropos BaseEnv extension for agent/tool-call workloads.
-
-AgentEnv is responsible for starting the sandbox tool execution backend and
-providing helpers for running agent trajectories with queued/batched tool calls.
-"""
-
-from __future__ import annotations
-import os
-import asyncio
-import time
-import uuid
-from abc import ABC, abstractmethod
-from typing import Any, Awaitable, Callable, Dict, Generic, List, Optional, Tuple, TypeVar
-
-from pydantic import Field
-
-from atroposlib.envs.base import APIServerConfig, BaseEnv, BaseEnvConfig, Item, ScoredDataGroup, ScoredDataItem
-from atroposlib.envs.server_handling.server_baseline import AsyncSemWithAdaptiveWeight
-
-from ..agent import AgentConfig, AgentResult, AtroposAgent
-from ..backends import ToolBackend, create_tool_backend
-from ..tools import ToolRegistry, build_tool_registry
-from ..tools.tool_executor import ToolExecutor, ToolExecutorConfig
-
-# Main BaseEnv child classes. Child class THESE to get agent+tooling functionality easily.
-
-class AgentEnvConfig(BaseEnvConfig):
-    tool_pool_mode: str = Field(default="nomad", description="Tool execution backend ('nomad' or 'modal')")
-
-    allow_network: bool = Field(
-        default=True,
-        description="Whether sandbox bash commands may access the network (env policy).",
-    )
-    require_sandbox: bool = Field(
-        default=False,
-        description="Fail closed if bubblewrap sandboxing is unavailable/unusable for stateless sandbox tools.",
-    )
-    require_stateful_sandbox: bool = Field(
-        default=False,
-        description="Fail closed if bubblewrap/PID isolation is unavailable for stateful terminal tools (tmux).",
-    )
-    tool_batch_window_ms: int = Field(default=20, description="ToolExecutor batching window (ms)")
-    tool_max_batch_size: int = Field(default=200, description="ToolExecutor maximum batch size")
-
-    # nomad mode settings. TODO: Add Modal support, split this into own config
-    nomad_address: str = Field(default="http://localhost:4646", description="Nomad API address")
-    sandbox_job_id: str = Field(default="atropos-sandbox-agent-env", description="Nomad job id for sandbox containers")
-    sandbox_image: str = Field(default="atropos-sandbox:local", description="Docker image for sandbox containers")
-    slots_per_container: int = Field(default=10, description="Nomad mode: slots per container")
-    min_containers: int = Field(default=1, description="Nomad mode: minimum containers")
-    max_containers: int = Field(default=10, description="Nomad mode: maximum containers")
-    privileged: bool = Field(default=False, description="Nomad mode: run container privileged")
-    acquire_timeout_s: float = Field(default=30.0, description="Slot acquisition timeout (seconds)")
-    purge_job_on_start: bool = Field(
-        default=False,
-        description=(
-            "Nomad mode: stop/purge the sandbox job on startup. This is helpful in local dev and training runs "
-            "to recover from previous crashes that leave the job in a restart backoff state."
-        ),
-    )
-    purge_job_on_shutdown: bool = Field(default=True, description="Nomad mode: stop/purge job on shutdown")
-    
-    # Nomad driver selection (docker or singularity)
-    driver: str = Field(
-        default="docker",
-        description="Nomad task driver: 'docker' (default) or 'singularity' (for HPC without sudo Docker)",
-    )
-    singularity_image: Optional[str] = Field(
-        default=None,
-        description="Path to .sif file for Singularity driver (required if driver='singularity')",
-    )
-
-    # modal mode settings (stub; implementation pending)
-    modal_app_name: str = Field(default="atropos-sandbox", description="Modal app name (stub)")
-    modal_function_name: str = Field(default="sandbox_server", description="Modal function/actor name (stub)")
-    modal_volume_name: Optional[str] = Field(default=None, description="Modal Volume name for persistent storage (stub)")
-    modal_volume_mount_path: str = Field(default="/data", description="Modal Volume mount path (stub)")
-
-    # basic agent defaults
-    agent_max_steps: int = Field(default=50, description="Max ReACT steps per trajectory")
-    agent_temperature: float = Field(default=0.7, description="Sampling temperature")
-    agent_max_tokens: Optional[int] = Field(
-        default=None,
-        description="Max tokens per model response (default: let backend decide)",
-    )
-    agent_tool_delay_s: float = Field(default=0.0, description="Delay between tool calls (seconds)")
-
-    # tool selection
-    enabled_toolsets: List[str] = Field(
-        default_factory=lambda: ["default"],
-        description="Toolsets to enable (Hermes-style grouping).",
-    )
-    disabled_toolsets: List[str] = Field(
-        default_factory=list,
-        description="Toolsets to disable (applied after enabled_toolsets).",
-    )
-
-    # external ToolServer routing (Phase 4.5+)
-    tool_server_url: Optional[str] = Field(
-        default=None,
-        description="Base URL for external ToolServer (enables external tools).",
-    )
-    tool_server_token: Optional[str] = Field(
-        default=None,
-        description="Bearer token for ToolServer auth (optional in dev).",
-    )
-
-AgentEnvConfigT = TypeVar("AgentEnvConfigT", bound="AgentEnvConfig")
-
-
-class AgentEnv(BaseEnv, ABC, Generic[AgentEnvConfigT]):
-    env_config_cls = AgentEnvConfig
-
-    def __init__(
-        self,
-        config: AgentEnvConfigT,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self.config: AgentEnvConfigT = config
-
-        self.tools: ToolRegistry = self.build_tools()
-
-        self._backend: Optional[ToolBackend] = None
-        self._tool_executor: Optional[ToolExecutor] = None
-        self._tool_server_inprocess: bool = False
-        self._trajectory_workspace_meta: Dict[str, Dict[str, Any]] = {}
-
-    def build_tools(self) -> ToolRegistry:
-        """Wraps original Hermes-Agent ToolRegistry for atropos AgentEnv use.
-        See Hermes-Agent docs for toolsets and available tools etc.
-        """
-        return build_tool_registry(
-            enabled_toolsets=self.config.enabled_toolsets or ["default"],
-            disabled_toolsets=self.config.disabled_toolsets or None,
-            tool_server_url=self.config.tool_server_url,
-        )
-
-    @abstractmethod
-    def build_task(self, item: Item) -> str:
-        """Return the user-facing task string for the agent."""
-
-    @abstractmethod
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        """Return a scalar score for this trajectory."""
-
-    async def setup_trajectory_workspace(
-        self,
-        item: Item,
-        *,
-        trajectory_id: str,
-        exec_tool: Callable[["ToolCall"], Awaitable["ToolResult"]],
-    ) -> Dict[str, Any]:
-        """
-        Optional hook: prepare the sandbox workspace before the agent starts.
-
-        Examples:
-        - clone a repo and checkout a commit
-        - write fixture files (e.g. images) for external-tool demos
-        - pre-install dependencies
-
-        Default: no-op.
-        """
-        _ = (item, trajectory_id, exec_tool)
-        return {}
-
-    async def verify_and_score_trajectory(
-        self,
-        item: Item,
-        final_response: str,
-        *,
-        trajectory_id: str,
-        exec_tool: Callable[["ToolCall"], Awaitable["ToolResult"]],
-        agent_result: Optional[AgentResult] = None,
-        workspace_meta: Optional[Dict[str, Any]] = None,
-    ) -> tuple[float, Dict[str, Any]]:
-        """
-        Optional hook: run in-sandbox verification before scoring.
-
-        Many agent envs need to execute verification inside the same trajectory
-        workspace (e.g. pytest) before releasing/resetting the slot.
-
-        Default: calls `score_trajectory()` and returns empty metadata.
-        """
-        _ = (trajectory_id, exec_tool, agent_result, workspace_meta)  # default ignores in-workspace verification
-        score = await self.score_trajectory(item, final_response)
-        return score, {}
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        return AgentConfig(
-            max_steps=self.config.agent_max_steps,
-            temperature=self.config.agent_temperature,
-            max_tokens=self.config.agent_max_tokens,
-            tool_delay_s=self.config.agent_tool_delay_s,
-        )
-
-    async def setup(self) -> None:
-        print(f"[AgentEnv] setup(): starting tool backend ({self.config.tool_pool_mode})", flush=True)
-        await self._start_tool_backend()
-        print("[AgentEnv] setup(): configuring server concurrency", flush=True)
-        self._configure_server_concurrency()
-        print("[AgentEnv] setup(): running env-specific setup_agent_env()", flush=True)
-        await self.setup_agent_env()
-        print("[AgentEnv] setup(): done", flush=True)
-
-    def _configure_server_concurrency(self) -> None:
-        """
-        Ensure the LLM server concurrency isn't accidentally capped below `group_size`.
-
-        In `BaseEnv process` mode, groups are collected concurrently and if the underlying
-        ServerManager/OpenAIServer semaphore is left at 1, we serialize inference even
-        when `--env.group_size` is > 1.
-        """
-        desired = int(getattr(self.config, "group_size", 1) or 1)
-        if desired <= 1:
-            return
-
-        servers = getattr(self.server, "servers", None)
-        if not isinstance(servers, list) or not servers:
-            return
-
-        for s in servers:
-            sem = getattr(s, "sem", None)
-            eval_sem = getattr(s, "eval_sem", None)
-            # Only increase; never shrink.
-            if sem is not None and getattr(sem, "max_val", 0) < desired:
-                s.sem = AsyncSemWithAdaptiveWeight(desired)
-                if hasattr(s, "config") and hasattr(s.config, "num_max_requests_at_once"):
-                    s.config.num_max_requests_at_once = desired
-            if eval_sem is not None and getattr(eval_sem, "max_val", 0) < desired:
-                s.eval_sem = AsyncSemWithAdaptiveWeight(desired)
-                if hasattr(s, "config") and hasattr(s.config, "num_requests_for_eval"):
-                    s.config.num_requests_for_eval = desired
-
-    @abstractmethod
-    async def setup_agent_env(self) -> None:
-        """Subclass hook for env-specific setup."""
-
-    async def evaluate(self, *args, **kwargs):  # noqa: ARG002
-        """
-        Default eval hook (no-op).
-
-        Atropos BaseEnv requires an `evaluate()` implementation. Many agent envs
-        won't have a meaningful evaluation path during early PoC work; they can
-        override this when needed.
-        """
-        return {}
-
-    async def env_manager(self):
-        try:
-            return await super().env_manager()
-        finally:
-            await self.shutdown_tool_backend()
-
-    async def process_manager(self):
-        try:
-            return await super().process_manager()
-        finally:
-            await self.shutdown_tool_backend()
-
-    async def _start_tool_backend(self) -> None:
-        if self._tool_executor is not None:
-            return
-
-        tool_server_url = self.config.tool_server_url
-        tool_server_client = None
-        if tool_server_url == "inprocess":
-            import httpx
-            from ..api.tool_server import app as tool_server_app
-
-            await tool_server_app.router.startup()
-            tool_server_client = httpx.AsyncClient(
-                transport=httpx.ASGITransport(app=tool_server_app),
-                base_url="http://toolserver",
-            )
-            tool_server_url = "http://toolserver"
-            self._tool_server_inprocess = True
-
-        backend = create_tool_backend(self.config)
-        await backend.start()
-
-        executor = ToolExecutor(
-            backend=backend,
-            tools=self.tools,
-            config=ToolExecutorConfig(
-                batch_window_ms=self.config.tool_batch_window_ms,
-                max_batch_size=self.config.tool_max_batch_size,
-                allow_network=self.config.allow_network,
-                require_sandbox=self.config.require_sandbox,
-                require_stateful_sandbox=self.config.require_stateful_sandbox,
-                tool_server_url=tool_server_url,
-                tool_server_token=self.config.tool_server_token,
-            ),
-        )
-        await executor.start()
-        if tool_server_client is not None:
-            executor._tool_server_client = tool_server_client  # type: ignore[attr-defined]
-
-        self._backend = backend
-        self._tool_executor = executor
-
-    async def shutdown_tool_backend(self) -> None:
-        executor = self._tool_executor
-        backend = self._backend
-        inprocess_tool_server = self._tool_server_inprocess
-        self._tool_executor = None
-        self._backend = None
-        self._tool_server_inprocess = False
-
-        if executor is not None:
-            await executor.close()
-        if backend is not None:
-            await backend.stop(purge=bool(self.config.purge_job_on_shutdown))
-        if inprocess_tool_server:
-            from ..api.tool_server import app as tool_server_app
-
-            await tool_server_app.router.shutdown()
-
-    async def collect_trajectory(
-        self, item: Item
-    ) -> Tuple[Optional[ScoredDataItem], List[Item]]:
-        if self._tool_executor is None:
-            raise RuntimeError("Tool backend not started")
-
-        trajectory_id = str(uuid.uuid4())
-        t0 = time.perf_counter()
-        print(f"[AgentEnv] collect_trajectory(): tid={trajectory_id} start", flush=True)
-        task = self.build_task(item)
-        agent_config = self.build_agent_config(item)
-        if os.getenv("ATROPOS_DEBUG_PRINT_TASK") == "1":
-            print(f"Starting trajectory {trajectory_id} with task: {task}", flush=True)
-        else:
-            # Avoid printing the full task prompt by default (can be huge/noisy).
-            one_line = " ".join(str(task).splitlines()).strip()
-            preview = one_line[:240] + ("…" if len(one_line) > 240 else "")
-            print(f"Starting trajectory {trajectory_id} (task preview): {preview}", flush=True)
-
-        async def _exec(call):
-            return await self._tool_executor.execute(trajectory_id, call)
-
-        agent = AtroposAgent(
-            server=self.server,
-            tokenizer=self.tokenizer,
-            tools=self.tools,
-            config=agent_config,
-            execute_tool=_exec,
-        )
-
-        try:
-            print(f"[AgentEnv] tid={trajectory_id} setup_trajectory_workspace() start", flush=True)
-            workspace_meta = await self.setup_trajectory_workspace(item, trajectory_id=trajectory_id, exec_tool=_exec)
-            if not isinstance(workspace_meta, dict):
-                workspace_meta = {}
-            self._trajectory_workspace_meta[trajectory_id] = workspace_meta
-            print(
-                f"[AgentEnv] tid={trajectory_id} setup_trajectory_workspace() done in {time.perf_counter() - t0:.2f}s",
-                flush=True,
-            )
-
-            print(f"[AgentEnv] tid={trajectory_id} agent.run() start", flush=True)
-            result = await agent.run(task)
-            print(
-                f"[AgentEnv] tid={trajectory_id} agent.run() done in {time.perf_counter() - t0:.2f}s "
-                f"success={result.success} tool_calls={result.total_tool_calls}",
-                flush=True,
-            )
-            if not result.success or result.trajectory_data is None:
-                # Do not trigger BaseEnv retries for agent failures.
-                # Record the trajectory with score 0.0 so training/eval can see the failure mode.
-                messages = [{"role": "system", "content": agent._build_system_prompt()}]  # noqa: SLF001
-                messages.append({"role": "user", "content": task})
-                for step in result.steps:
-                    messages.append({"role": "assistant", "content": step.assistant_message})
-                    if step.tool_results:
-                        tool_text = "\n".join(r.to_xml() for r in step.tool_results)
-                        messages.append({"role": "user", "content": tool_text})
-
-                scored: ScoredDataItem = {
-                    "tokens": (result.trajectory_data.tokens if result.trajectory_data else []),
-                    "masks": (result.trajectory_data.masked_tokens if result.trajectory_data else []),
-                    "scores": 0.0,
-                }
-                if result.trajectory_data is not None:
-                    scored["inference_logprobs"] = result.trajectory_data.logprobs  # type: ignore[typeddict-unknown-key]
-                    if getattr(result.trajectory_data, "metadata", None):
-                        scored["overrides"] = {"managed_metadata": result.trajectory_data.metadata}
-                if self.config.include_messages:
-                    # Record a final failure marker as a user-side tool_response-like block so it survives templates.
-                    import json
-
-                    err = result.error or "agent_failed"
-                    messages.append(
-                        {
-                            "role": "user",
-                            "content": f"<tool_response>{json.dumps({'success': False, 'error': err})}</tool_response>",
-                        }
-                    )
-                    scored["messages"] = messages
-                return scored, []
-
-            print(f"[AgentEnv] tid={trajectory_id} verify_and_score_trajectory() start", flush=True)
-            score, score_metadata = await self.verify_and_score_trajectory(
-                item,
-                result.final_response,
-                trajectory_id=trajectory_id,
-                exec_tool=_exec,
-                agent_result=result,
-                workspace_meta=workspace_meta,
-            )
-            print(
-                f"[AgentEnv] tid={trajectory_id} verify_and_score_trajectory() done in {time.perf_counter() - t0:.2f}s "
-                f"score={score}",
-                flush=True,
-            )
-
-            messages = [{"role": "system", "content": agent._build_system_prompt()}]  # noqa: SLF001
-            messages.append({"role": "user", "content": task})
-            for step in result.steps:
-                messages.append({"role": "assistant", "content": step.assistant_message})
-                if step.tool_results:
-                    tool_text = "\n".join(r.to_xml() for r in step.tool_results)
-                    messages.append({"role": "user", "content": tool_text})
-
-            # Optional: allow env verification to attach additional messages (e.g. install logs).
-            if self.config.include_messages and isinstance(score_metadata, dict):
-                extra = score_metadata.get("verification_messages")
-                if isinstance(extra, list):
-                    for m in extra:
-                        if isinstance(m, dict) and isinstance(m.get("role"), str) and isinstance(m.get("content"), str):
-                            messages.append({"role": m["role"], "content": m["content"]})
-
-            scored: ScoredDataItem = {
-                "tokens": result.trajectory_data.tokens,
-                "masks": result.trajectory_data.masked_tokens,
-                "scores": score,
-            }
-            # Atroposlib expects policy logprobs at the *group* level under `inference_logprobs`.
-            # We stash per-item values here and lift them into the group in `collect_trajectories()`.
-            scored["inference_logprobs"] = result.trajectory_data.logprobs  # type: ignore[typeddict-unknown-key]
-            if getattr(result.trajectory_data, "metadata", None):
-                scored["overrides"] = {"managed_metadata": result.trajectory_data.metadata}
-            if self.config.include_messages:
-                scored["messages"] = messages
-
-            return scored, []
-        finally:
-            self._trajectory_workspace_meta.pop(trajectory_id, None)
-            print(f"[AgentEnv] tid={trajectory_id} release_trajectory(reset_workspace=True)", flush=True)
-            await self._tool_executor.release_trajectory(trajectory_id, reset_workspace=True)
-            print(f"[AgentEnv] collect_trajectory(): tid={trajectory_id} done in {time.perf_counter() - t0:.2f}s", flush=True)
-
-    async def collect_trajectories(
-        self, item: Item
-    ) -> Tuple[Optional[ScoredDataGroup], List[Item]]:
-        tasks = [self.collect_trajectory(item) for _ in range(self.config.group_size)]
-        results = await asyncio.gather(*tasks)
-
-        backlog: List[Item] = []
-        items: List[ScoredDataItem] = []
-        for scored, b in results:
-            backlog.extend(b)
-            if scored is not None:
-                items.append(scored)
-
-        if len(items) != self.config.group_size:
-            return None, backlog
-
-        group: ScoredDataGroup = ScoredDataGroup(
-            tokens=[],
-            masks=[],
-            scores=[],
-            advantages=[],
-            ref_logprobs=[],
-            messages=[] if self.config.include_messages else None,
-            inference_logprobs=[],
-            group_overrides={},
-            overrides=[],
-            images=[],
-            generation_params=None,
-        )
-
-        for it in items:
-            group["tokens"].append(it["tokens"])
-            group["masks"].append(it["masks"])
-            group["scores"].append(it["scores"])
-            # policy logprobs (for PPO/GRPO training) if present
-            lp = it.get("inference_logprobs")  # type: ignore[typeddict-item]
-            if lp is not None:
-                group["inference_logprobs"].append(lp)
-            group["overrides"].append(it.get("overrides") or {})  # type: ignore[typeddict-item]
-            if group.get("messages") is not None and it.get("messages") is not None:
-                group["messages"].append(it["messages"])
-
-        return group, backlog
-
-    async def run_agent(self, task: str, *, trajectory_id: Optional[str] = None) -> Tuple[str, Dict[str, Any]]:
-        """
-        Run the AtroposAgent on a single task and return (final_response, debug).
-
-        This is a helper intended for simple environments and tests.
-        """
-        if self._tool_executor is None:
-            raise RuntimeError("Tool backend not started")
-
-        tid = trajectory_id or str(uuid.uuid4())
-
-        async def _exec(call):
-            return await self._tool_executor.execute(tid, call)
-
-        agent = AtroposAgent(
-            server=self.server,
-            tokenizer=self.tokenizer,
-            tools=self.tools,
-            config=AgentConfig(
-                max_steps=self.config.agent_max_steps,
-                temperature=self.config.agent_temperature,
-                max_tokens=self.config.agent_max_tokens,
-            ),
-            execute_tool=_exec,
-        )
-        result = await agent.run(task)
-        await self._tool_executor.release_trajectory(tid, reset_workspace=True)
-        return result.final_response, {"success": result.success, "error": result.error, "tool_calls": result.total_tool_calls}
@@ -1,171 +0,0 @@
-"""
-Hermes-Agent + Atropos (Nomad sandbox) compatibility smoke environment.
-
-This environment is intended to validate, end-to-end:
-  BaseEnv.process -> AgentEnv -> ToolExecutor (batched) -> Nomad SlotPool -> sandbox_server
-
-It forces the model to use a sandbox tool by asking it to run a command that
-generates a high-entropy token inside the sandbox, then repeat it exactly.
-
-Run (process mode):
-  uv run python -m atropos.envs.hermes_compat_test_env process --env.use_wandb false --env.total_steps 2 --env.group_size 1
-"""
-
-from __future__ import annotations
-
-import os
-from typing import Any, Dict, List, Tuple
-
-from dotenv import load_dotenv
-from pydantic import Field
-
-from atroposlib.envs.base import APIServerConfig, Item
-
-from ..agent import AgentConfig, AgentResult
-from ..tools import ToolCall
-from .agent_env import AgentEnv, AgentEnvConfig
-
-load_dotenv()
-
-
-def _forced_tool_item() -> Item:
-    # Use double quotes in the shell command and show JSON escaping explicitly.
-    # This avoids invalid JSON escapes like `\\'` (not valid JSON) that some models produce.
-    cmd = 'python -c "import secrets; print(secrets.token_hex(16))"'
-    return {
-        "command": cmd,
-        "prompt": (
-            "You are acting as an agent inside a sandboxed environment.\n"
-            "You MUST use the terminal tool to execute commands.\n"
-            "Run this exact command:\n"
-            f"{cmd}\n"
-            "When you call the tool, use valid JSON inside <tool_call>. Example:\n"
-            '<tool_call>{"name": "terminal", "arguments": {"command": '
-            '"python -c \\\\"import secrets; print(secrets.token_hex(16))\\\\""}}'
-            "</tool_call>\n"
-            "Then respond with EXACTLY what it printed (the hex token) and nothing else.\n"
-            "Do not guess. Do not explain."
-        ),
-    }
-
-
-class HermesCompatTestEnvConfig(AgentEnvConfig):
-    server_base_url: str = Field(
-        default="http://127.0.0.1:8080",
-        description="Base URL for an OpenAI-compatible chat server (without /v1).",
-    )
-    server_model: str = Field(default="hermes-4-36b", description="Model name")
-    tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
-
-
-class HermesCompatTestEnv(AgentEnv[HermesCompatTestEnvConfig]):
-    name = "hermes_compat_test_env"
-    env_config_cls = HermesCompatTestEnvConfig
-
-    def __init__(
-        self,
-        config: HermesCompatTestEnvConfig,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self._iter = 0
-
-    @classmethod
-    def config_init(cls) -> Tuple[HermesCompatTestEnvConfig, List[APIServerConfig]]:
-        base_url = (
-            os.getenv("ATROPOS_SERVER_BASE_URL")
-            or os.getenv("OPENAI_BASE_URL")
-            or os.getenv("LLM_BASE_URL")
-            or "http://127.0.0.1:8080"
-        )
-        model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
-        api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
-
-        env_config = HermesCompatTestEnvConfig(
-            tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
-            group_size=1,
-            use_wandb=False,
-            include_messages=True,
-            ensure_scores_are_not_same=False,
-            total_steps=2,
-            batch_size=1,
-            server_base_url=base_url,
-            server_model=model,
-            # Tooling: sandbox-only terminal.
-            enabled_toolsets=["terminal"],
-            disabled_toolsets=[],
-            # Default to Nomad sandboxing; users can override via --env.* args.
-            sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
-            # In local dev it's common for a previous crash to leave the job in backoff.
-            purge_job_on_start=True,
-            purge_job_on_shutdown=True,
-        )
-
-        server_configs = [
-            APIServerConfig(
-                model_name=model,
-                base_url=f"{base_url.rstrip('/')}/v1",
-                api_key=api_key,
-                num_max_requests_at_once=1,
-                num_requests_for_eval=1,
-                timeout=120,
-            )
-        ]
-        return env_config, server_configs
-
-    async def setup_agent_env(self) -> None:
-        return None
-
-    async def get_next_item(self) -> Item:
-        self._iter += 1
-        return _forced_tool_item()
-
-    def build_task(self, item: Item) -> str:
-        return str(item.get("prompt") or "")
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        # Avoid imposing max_tokens by default; tool-tag responses can be long for some models.
-        return AgentConfig(
-            max_steps=min(8, int(self.config.agent_max_steps)),
-            temperature=0.2,
-            max_tokens=None,
-        )
-
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        # Scoring happens in verify_and_score_trajectory so we can inspect tool results.
-        _ = (item, final_response)
-        return 0.0
-
-    async def verify_and_score_trajectory(
-        self,
-        item: Item,
-        final_response: str,
-        *,
-        trajectory_id: str,  # noqa: ARG002
-        exec_tool,  # noqa: ARG002
-        agent_result: AgentResult | None = None,
-        workspace_meta: Dict[str, Any] | None = None,  # noqa: ARG002
-    ) -> tuple[float, Dict[str, Any]]:
-        if agent_result is None:
-            return 0.0, {"error": "Missing agent_result"}
-
-        observed: str = ""
-        tool_ok = False
-        for step in agent_result.steps:
-            for res in step.tool_results:
-                if not res.success:
-                    return 0.0, {"error": res.error, "output": res.output}
-                out = (res.output or "").strip()
-                if out:
-                    observed = out.splitlines()[-1].strip()
-                    tool_ok = True
-
-        final = (final_response or "").strip()
-        score = 1.0 if tool_ok and agent_result.total_tool_calls > 0 and observed and final == observed else 0.0
-        return score, {"observed": observed, "tool_calls": agent_result.total_tool_calls, "command": item.get("command")}
-
-
-if __name__ == "__main__":
-    HermesCompatTestEnv.cli()
@@ -1,172 +0,0 @@
-"""
-Nomad sandbox terminal smoke environment (training-oriented).
-
-Validates, end-to-end:
-  BaseEnv.process -> AgentEnv -> ToolExecutor (batched) -> Nomad SlotPool -> sandbox_server
-
-It forces the model to use a sandbox tool by asking it to run a command that
-generates a high-entropy token inside the sandbox, then repeat it exactly.
-
-Run (process mode):
-  uv run python -m atropos.envs.sandbox_terminal_smoke_env process --env.use_wandb false --env.total_steps 2 --env.group_size 1
-"""
-
-from __future__ import annotations
-
-import os
-from typing import Any, Dict, List, Tuple
-
-from dotenv import load_dotenv
-from pydantic import Field
-
-from atroposlib.envs.base import APIServerConfig, Item
-
-from ..agent import AgentConfig, AgentResult
-from ..tools import ToolCall
-from .agent_env import AgentEnv, AgentEnvConfig
-
-load_dotenv()
-
-STRICT_TOOLCALL_SYSTEM_PROMPT = None
-
-
-def _forced_tool_item() -> Item:
-    # Use double quotes in the shell command and show JSON escaping explicitly.
-    # This avoids invalid JSON escapes like `\\'` (not valid JSON) that some models produce.
-    cmd = 'python -c "import secrets; print(secrets.token_hex(16))"'
-    return {
-        "command": cmd,
-        "prompt": (
-            "You MUST use the terminal tool.\n"
-            "Run this exact command:\n"
-            f"{cmd}\n"
-            "When you call the tool, use valid JSON inside <tool_call>. Example:\n"
-            '<tool_call>{"name": "terminal", "arguments": {"command": '
-            '"python -c \\\\"import secrets; print(secrets.token_hex(16))\\\\""}}'
-            "</tool_call>\n"
-            "Then respond with EXACTLY what it printed (the hex token) and nothing else.\n"
-            "Do not guess. Do not explain."
-        ),
-    }
-
-
-class SandboxTerminalSmokeEnvConfig(AgentEnvConfig):
-    server_base_url: str = Field(
-        default="http://127.0.0.1:8080",
-        description="Base URL for an OpenAI-compatible chat server (without /v1).",
-    )
-    server_model: str = Field(default="hermes-4-36b", description="Model name")
-    tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
-
-
-class SandboxTerminalSmokeEnv(AgentEnv[SandboxTerminalSmokeEnvConfig]):
-    name = "sandbox_terminal_smoke_env"
-    env_config_cls = SandboxTerminalSmokeEnvConfig
-
-    def __init__(
-        self,
-        config: SandboxTerminalSmokeEnvConfig,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self._iter = 0
-
-    @classmethod
-    def config_init(cls) -> Tuple[SandboxTerminalSmokeEnvConfig, List[APIServerConfig]]:
-        base_url = (
-            os.getenv("ATROPOS_SERVER_BASE_URL")
-            or os.getenv("OPENAI_BASE_URL")
-            or os.getenv("LLM_BASE_URL")
-            or "http://127.0.0.1:8080"
-        )
-        model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
-        api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
-
-        env_config = SandboxTerminalSmokeEnvConfig(
-            tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
-            group_size=1,
-            use_wandb=False,
-            include_messages=True,
-            ensure_scores_are_not_same=False,
-            total_steps=2,
-            batch_size=1,
-            server_base_url=base_url,
-            server_model=model,
-            # Tooling: sandbox-only terminal.
-            enabled_toolsets=["terminal"],
-            disabled_toolsets=[],
-            # Default to Nomad sandboxing; users can override via --env.* args.
-            sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
-            purge_job_on_start=True,
-            purge_job_on_shutdown=True,
-        )
-
-        server_configs = [
-            APIServerConfig(
-                model_name=model,
-                base_url=f"{base_url.rstrip('/')}/v1",
-                api_key=api_key,
-                num_max_requests_at_once=1,
-                num_requests_for_eval=1,
-                timeout=120,
-            )
-        ]
-        return env_config, server_configs
-
-    async def setup_agent_env(self) -> None:
-        return None
-
-    async def get_next_item(self) -> Item:
-        self._iter += 1
-        return _forced_tool_item()
-
-    def build_task(self, item: Item) -> str:
-        return str(item.get("prompt") or "")
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        # Avoid imposing max_tokens by default; tool-tag responses can be long for some models.
-        return AgentConfig(
-            max_steps=min(8, int(self.config.agent_max_steps)),
-            temperature=0.2,
-            max_tokens=None,
-            system_prompt=STRICT_TOOLCALL_SYSTEM_PROMPT,
-        )
-
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        # Scoring happens in verify_and_score_trajectory so we can inspect tool results.
-        _ = (item, final_response)
-        return 0.0
-
-    async def verify_and_score_trajectory(
-        self,
-        item: Item,
-        final_response: str,
-        *,
-        trajectory_id: str,  # noqa: ARG002
-        exec_tool,  # noqa: ARG002
-        agent_result: AgentResult | None = None,
-        workspace_meta: Dict[str, Any] | None = None,  # noqa: ARG002
-    ) -> tuple[float, Dict[str, Any]]:
-        if agent_result is None:
-            return 0.0, {"error": "Missing agent_result"}
-
-        observed: str = ""
-        tool_ok = False
-        for step in agent_result.steps:
-            for res in step.tool_results:
-                if not res.success:
-                    return 0.0, {"error": res.error, "output": res.output}
-                out = (res.output or "").strip()
-                if out:
-                    observed = out.splitlines()[-1].strip()
-                    tool_ok = True
-
-        final = (final_response or "").strip()
-        score = 1.0 if tool_ok and agent_result.total_tool_calls > 0 and observed and final == observed else 0.0
-        return score, {"observed": observed, "tool_calls": agent_result.total_tool_calls, "command": item.get("command")}
-
-
-if __name__ == "__main__":
-    SandboxTerminalSmokeEnv.cli()
@@ -1,418 +0,0 @@
-"""
-SWE-smith-oracle environment.
-
-This environment is intentionally minimal:
- prepares a sandbox workspace by cloning a public GitHub repo at `base_commit`
- runs an AtroposAgent tool loop to apply a fix
- verifies by running pytest nodeids from the dataset (reward = pass/fail)
- Python only (no multi-language support currently, need to properly bauild & add to dropbox)
- TODO: Get the other nonpython sandboxes up and running, then add a config knob to switch between them per row
- oh and add to dockerhub
-
-Dataset: NousResearch/SWE-smith-oracle (train; does NOT use SWE-bench eval set).
-"""
-
-from __future__ import annotations
-
-import os
-import random
-import time
-from typing import Any, Dict, List, Optional, Tuple
-
-from pydantic import Field
-
-from atroposlib.envs.base import APIServerConfig, Item
-
-from ..agent import AgentConfig
-from ..tools import ToolCall
-from .agent_env import AgentEnv, AgentEnvConfig
-
-
-class SweSmithOracleEnvConfig(AgentEnvConfig):
-    dataset_name: str = Field(default="NousResearch/SWE-smith-oracle")
-    dataset_split: str = Field(default="train")
-    max_items: int = Field(default=0, description="0 = no limit")
-    shuffle: bool = Field(default=True)
-    seed: int = Field(default=0)
-
-    python_only: bool = Field(default=True, description="Filter to Python-evaluable rows")
-    score_include_fail_to_pass: bool = Field(
-        default=True,
-        description=(
-            "If true (default), score tests on PASS_TO_PASS ∪ FAIL_TO_PASS. "
-            "Disable to only run PASS_TO_PASS (faster but weaker signal)."
-        ),
-    )
-
-    prompt_mode: str = Field(
-        default="problem_statement",
-        description="Task prompt content: 'problem_statement' (fast) or 'problem_statement+text' (slower, includes dataset 'text').",
-    )
-
-    repo_base_url: str = Field(default="https://github.com", description="Base URL for repo cloning")
-    install_timeout_s: float = Field(default=600.0)
-    test_timeout_s: float = Field(default=600.0)
-
-    tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
-
-
-class SweSmithOracleEnv(AgentEnv[SweSmithOracleEnvConfig]):
-    """
-    SWE-smith-oracle AgentEnv.
-
-    This is designed for benchmarking multiplexed slot execution vs naive container-per-trajectory.
-    """
-
-    name = "swe_smith_oracle_env"
-    env_config_cls = SweSmithOracleEnvConfig
-
-    def __init__(
-        self,
-        config: SweSmithOracleEnvConfig,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self._dataset = None
-        self._indices: List[int] = []
-        self._cursor = 0
-
-    @classmethod
-    def config_init(cls) -> Tuple[SweSmithOracleEnvConfig, List[APIServerConfig]]:
-        # Defaults for running the env via CLI in offline `process` mode.
-        # Override via env vars or `--env.*` flags as needed.
-        base_url_raw = (
-            os.getenv("ATROPOS_SERVER_BASE_URL")
-            or os.getenv("OPENAI_BASE_URL")
-            or os.getenv("LLM_BASE_URL")
-            or "http://127.0.0.1:8080"
-        )
-        base_url = base_url_raw.rstrip("/")
-        if not base_url.endswith("/v1"):
-            base_url = f"{base_url}/v1"
-        model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
-        api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
-
-        env_config = SweSmithOracleEnvConfig(
-            tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
-            group_size=1,
-            use_wandb=False,
-            rollout_server_url="http://localhost:8000",
-            total_steps=1,
-            batch_size=1,
-            steps_per_eval=1,
-            max_token_length=8192,
-            inference_weight=1.0,
-            wandb_name="swe_smith_oracle",
-            enabled_toolsets=["terminal"],
-            disabled_toolsets=[],
-            sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
-            purge_job_on_start=True,
-            purge_job_on_shutdown=True,
-        )
-
-        server_configs = [
-            APIServerConfig(
-                model_name=model,
-                base_url=base_url,
-                api_key=api_key,
-                num_max_requests_at_once=1,
-                num_requests_for_eval=1,
-                timeout=int(os.getenv("ATROPOS_SERVER_TIMEOUT_S") or "300"),
-            ),
-        ]
-
-        return env_config, server_configs
-
-    async def setup_agent_env(self) -> None:
-        from datasets import load_dataset
-
-        t0 = time.perf_counter()
-        print(
-            f"[SweSmithOracleEnv] loading dataset {self.config.dataset_name}:{self.config.dataset_split} "
-            f"(python_only={self.config.python_only}, max_items={self.config.max_items or 'all'})",
-            flush=True,
-        )
-        ds = load_dataset(self.config.dataset_name, split=self.config.dataset_split)
-        self._dataset = ds
-
-        indices: List[int] = []
-        for idx in range(len(ds)):
-            row = ds[idx]
-            if self.config.python_only and not self._is_python_row(row):
-                continue
-            indices.append(idx)
-
-        if self.config.shuffle:
-            rnd = random.Random(self.config.seed)
-            rnd.shuffle(indices)
-
-        if self.config.max_items and self.config.max_items > 0:
-            indices = indices[: self.config.max_items]
-
-        self._indices = indices
-        self._cursor = 0
-
-        print(
-            f"[SweSmithOracleEnv] loaded {len(self._indices)} items from {self.config.dataset_name}:{self.config.dataset_split} "
-            f"in {time.perf_counter() - t0:.2f}s",
-            flush=True,
-        )
-
-    def _is_python_row(self, row: Dict[str, Any]) -> bool:
-        nodeids = row.get("PASS_TO_PASS")
-        if not isinstance(nodeids, list) or not nodeids:
-            return False
-        for nid in nodeids:
-            if not isinstance(nid, str) or ".py::" not in nid:
-                return False
-        return True
-
-    async def get_next_item(self) -> Item:
-        print(f"[SweSmithOracleEnv] get_next_item() cursor={self._cursor}/{len(self._indices)}", flush=True)
-        if not self._dataset or not self._indices:
-            raise RuntimeError("Dataset not initialized (did setup() run?)")
-        if self._cursor >= len(self._indices):
-            self._cursor = 0
-        idx = self._indices[self._cursor]
-        self._cursor += 1
-        return dict(self._dataset[idx])
-
-    def _repo_name(self, item: Item) -> str:
-        repo = item.get("repo") or ""
-        if isinstance(repo, str) and "/" in repo:
-            return repo.split("/")[-1]
-        return "repo"
-
-    def build_task(self, item: Item) -> str:
-        repo = item.get("repo") or ""
-        base_commit = item.get("base_commit") or ""
-        problem = str(item.get("problem_statement") or "")
-        context = str(item.get("text") or "")
-
-        nodeids = self._tests_for_item(item)
-        tests_list = "\n".join(f"- {t}" for t in nodeids)
-
-        repo_dir = self._repo_name(item)
-
-        tests_block = (
-            "Run these tests to verify:\n"
-            f"{tests_list}\n\n"
-            "When done, briefly describe what you changed and confirm tests pass."
-        )
-
-        prompt_mode = (self.config.prompt_mode or "problem_statement").strip().lower()
-        if prompt_mode not in {"problem_statement", "problem_statement+text"}:
-            raise ValueError(
-                f"Invalid prompt_mode={self.config.prompt_mode!r}. "
-                "Expected 'problem_statement' or 'problem_statement+text'."
-            )
-
-        context_block = ""
-        if prompt_mode == "problem_statement+text" and context:
-            # Note: We intentionally do NOT truncate/cap here. This mode is for debugging / richer prompts and can be slow.
-            context_block = f"\nAdditional context:\n{context}\n"
-
-        return (
-            "You are a senior software engineer. Fix the repository so the specified tests pass.\n\n"
-            f"Repository: {repo} (checked out at base_commit={base_commit})\n"
-            f"Workspace path: ./{repo_dir}\n\n"
-            "Constraints:\n"
-            "- You MUST use the terminal tool to inspect, edit, and verify the repository. Do not respond with a patch file.\n"
-            f"- Start by inspecting the repo (e.g. `ls`, `cd ./{repo_dir}`, `git status`).\n"
-            "- Use a workspace-local virtualenv (e.g. inside the repo at ./.venv) to avoid cross-run contamination.\n"
-            "- Use non-interactive commands only.\n\n"
-            "- Terminal commands run under POSIX /bin/sh and each tool call runs in a fresh shell (no persisted env vars).\n"
-            "  Avoid bash-only `source`; prefer `. .venv/bin/activate` or `.venv/bin/python ...`.\n\n"
-            "Problem statement:\n"
-            f"{problem}\n\n"
-            f"{context_block}\n"
-            f"{tests_block}"
-        )
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        # SWE tasks are longer than the simple test env.
-        return AgentConfig(
-            max_steps=self.config.agent_max_steps,
-            temperature=self.config.agent_temperature,
-            max_tokens=self.config.agent_max_tokens,
-            tool_delay_s=self.config.agent_tool_delay_s,
-        )
-
-    async def setup_trajectory_workspace(self, item: Item, *, trajectory_id: str, exec_tool) -> Dict[str, Any]:
-        t0 = time.perf_counter()
-        repo = item.get("repo")
-        base_commit = item.get("base_commit")
-        instance_id = item.get("instance_id") or item.get("id") or item.get("problem_id")
-        if not isinstance(repo, str) or not isinstance(base_commit, str):
-            raise RuntimeError("Invalid dataset row: missing repo/base_commit")
-
-        repo_dir = self._repo_name(item)
-        clone_url = f"{self.config.repo_base_url.rstrip('/')}/{repo}.git"
-        print(
-            f"[SweSmithOracleEnv] tid={trajectory_id} setup_trajectory_workspace(): "
-            f"repo={repo} base_commit={base_commit} instance_id={instance_id} dir=./{repo_dir}",
-            flush=True,
-        )
-
-        # Repo setup strategy:
-        # - Maintain a shared, per-container bare repo cache under /data/repo_cache
-        # - For each trajectory, create an isolated git worktree under the slot workspace
-        # This avoids cloning/fetching full repos per trajectory and is crucial for multiplexing.
-
-        def _repo_cache_slug(repo_name: str) -> str:
-            return repo_name.replace("/", "__")
-
-        repo_slug = _repo_cache_slug(repo)
-        cache_root = "/data/repo_cache"
-        bare_repo = f"{cache_root}/{repo_slug}.git"
-        lock_file = f"{cache_root}/.locks/{repo_slug}.lock"
-
-        # Use flock to serialize operations that mutate the shared bare repo (fetch/worktree).
-        # util-linux (flock) is included in the sandbox image.
-        worktree_cmd = (
-            "set -e; "
-            f"rm -rf {repo_dir}; "
-            f"mkdir -p {cache_root}/.locks; "
-            f": > {lock_file}; "
-            f"flock -x {lock_file} sh -lc '"
-            f"set -e; "
-            "export GIT_TERMINAL_PROMPT=0; "
-            "export GIT_LFS_SKIP_SMUDGE=1; "
-            f"if [ ! -d \"{bare_repo}\" ]; then "
-            f"  git init --bare \"{bare_repo}\"; "
-            f"  git -C \"{bare_repo}\" remote add origin \"{clone_url}\"; "
-            "fi; "
-            f"git -C \"{bare_repo}\" remote set-url origin \"{clone_url}\"; "
-            f"git -C \"{bare_repo}\" worktree prune || true; "
-            f"if ! git -C \"{bare_repo}\" cat-file -e \"{base_commit}^{{commit}}\" 2>/dev/null; then "
-            f"  git -C \"{bare_repo}\" fetch --depth 1 origin \"{base_commit}\" || true; "
-            "fi; "
-            f"if ! git -C \"{bare_repo}\" cat-file -e \"{base_commit}^{{commit}}\" 2>/dev/null; then "
-            f"  git -C \"{bare_repo}\" fetch --prune origin; "
-            "fi; "
-            f"git --git-dir=\"{bare_repo}\" worktree add --detach \"{repo_dir}\" \"{base_commit}\"; "
-            "'"
-        )
-
-        print(f"[SweSmithOracleEnv] tid={trajectory_id} preparing worktree from repo cache", flush=True)
-        res = await exec_tool(
-            ToolCall(
-                name="terminal",
-                arguments={"command": worktree_cmd, "timeout": self.config.install_timeout_s},
-            )
-        )
-        if not res.success:
-            raise RuntimeError(
-                "git worktree setup failed "
-                f"(repo={repo}, base_commit={base_commit}, instance_id={instance_id}): {res.error}\n{res.output}"
-            )
-
-        print(
-            f"[SweSmithOracleEnv] tid={trajectory_id} setup_trajectory_workspace(): worktree ready in {time.perf_counter() - t0:.2f}s",
-            flush=True,
-        )
-        return {"repo_dir": repo_dir, "base_commit": base_commit}
-
-    def _tests_for_item(self, item: Item) -> List[str]:
-        tests: List[str] = []
-        if self.config.score_include_fail_to_pass:
-            for key in ("PASS_TO_PASS", "FAIL_TO_PASS"):
-                nodeids = item.get(key)
-                if isinstance(nodeids, list):
-                    tests.extend([n for n in nodeids if isinstance(n, str)])
-        else:
-            nodeids = item.get("PASS_TO_PASS")
-            if isinstance(nodeids, list):
-                tests.extend([n for n in nodeids if isinstance(n, str)])
-        # Stable order for reproducibility.
-        return sorted(dict.fromkeys(tests))
-
-    def _chunk_nodeids(self, nodeids: List[str], max_per_chunk: int = 50) -> List[List[str]]:
-        chunks: List[List[str]] = []
-        for i in range(0, len(nodeids), max_per_chunk):
-            chunks.append(nodeids[i : i + max_per_chunk])
-        return chunks
-
-    async def verify_and_score_trajectory(
-        self,
-        item: Item,
-        final_response: str,  # noqa: ARG002
-        *,
-        trajectory_id: str,
-        exec_tool,
-        agent_result=None,
-        workspace_meta: Optional[Dict[str, Any]] = None,
-    ) -> tuple[float, Dict[str, Any]]:
-        _ = trajectory_id
-        repo_dir = self._repo_name(item)
-
-        # Training correctness: do not reward trajectories that never actually used tools.
-        if agent_result is not None and getattr(agent_result, "total_tool_calls", 0) <= 0:
-            print(
-                f"[SweSmithOracleEnv] tid={trajectory_id} verify (dataset_tests): no tool calls; score=0.0",
-                flush=True,
-            )
-            return 0.0, {
-                "verification_mode": "dataset_tests",
-                "error": "No tool calls were made by the agent",
-            }
-
-        nodeids = self._tests_for_item(item)
-        if not nodeids:
-            return 0.0, {"error": "No tests provided"}
-
-        print(f"[SweSmithOracleEnv] tid={trajectory_id} verify (dataset_tests): ensuring venv + deps", flush=True)
-        setup_cmd = (
-            f"cd {repo_dir} && "
-            "python -m venv .venv && "
-            ". .venv/bin/activate && "
-            "python -m pip install -U pip setuptools wheel && "
-            "python -m pip install -e . && "
-            "python -m pip install pytest"
-        )
-        setup_res = await exec_tool(
-            ToolCall(name="terminal", arguments={"command": setup_cmd, "timeout": self.config.install_timeout_s})
-        )
-        verification_messages = [{"role": "user", "content": setup_res.to_xml()}]
-        if not setup_res.success:
-            return 0.0, {
-                "verification_mode": "dataset_tests",
-                "phase": "install",
-                "error": setup_res.error,
-                "output": setup_res.output,
-                "verification_messages": verification_messages,
-            }
-
-        chunks = self._chunk_nodeids(nodeids, max_per_chunk=50)
-        for chunk_idx, chunk in enumerate(chunks):
-            joined = " ".join(chunk)
-            cmd = f"cd {repo_dir} && . .venv/bin/activate && python -m pytest -q {joined}"
-            res = await exec_tool(
-                ToolCall(
-                    name="terminal",
-                    arguments={"command": cmd, "timeout": self.config.test_timeout_s},
-                )
-            )
-            verification_messages.append({"role": "user", "content": res.to_xml()})
-            if not res.success:
-                return 0.0, {
-                    "verification_mode": "dataset_tests",
-                    "phase": "pytest",
-                    "failed_chunk": chunk_idx,
-                    "error": res.error,
-                    "output": res.output,
-                    "verification_messages": verification_messages,
-                }
-
-        return 1.0, {"verification_mode": "dataset_tests", "passed": True, "verification_messages": verification_messages}
-
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        # Not used; scoring happens in verify_and_score_trajectory.
-        _ = (item, final_response)
-        return 0.0
-
-
-if __name__ == "__main__":
-    SweSmithOracleEnv.cli()
@@ -1,217 +0,0 @@
-"""
-Simple test environment for validating the atropos-agent setup.
-
-This environment uses a local OpenAI-compatible server for LLM testing to verify:
- BaseEnv extension works correctly
- API communication via OpenAI-compatible endpoint
- Basic trajectory collection
-
-This is a minimal environment for testing, not production use.
-"""
-
-import os
-from typing import Dict, List, Optional, Tuple
-
-from dotenv import load_dotenv
-from pydantic import Field
-
-from atroposlib.envs.base import (
-    APIServerConfig,
-    Item,
-)
-
-from ..agent import AgentConfig
-from .agent_env import AgentEnv, AgentEnvConfig
-
-# Load environment variables from .env file
-load_dotenv()
-
-
-# Simple test prompts for validation
-TEST_PROMPTS = [
-    {
-        "prompt": "What is 2 + 2? Answer with just the number.",
-        "expected": "4",
-    },
-    {
-        "prompt": "What is the capital of France? Answer with just the city name.",
-        "expected": "Paris",
-    },
-    {
-        "prompt": "What color is the sky on a clear day? Answer with just the color.",
-        "expected": "Blue",
-    },
-    {
-        "prompt": "How many days are in a week? Answer with just the number.",
-        "expected": "7",
-    },
-    {
-        "prompt": "What is 10 * 5? Answer with just the number.",
-        "expected": "50",
-    },
-]
-
-SYSTEM_PROMPT = (
-    "You are a helpful assistant. Answer questions concisely and directly. "
-    "When asked for a simple answer, provide just that answer without explanation."
-)
-
-
-class SimpleTestEnvConfig(AgentEnvConfig):
-    """Configuration for the simple test environment."""
-
-    server_base_url: str = Field(
-        default="http://127.0.0.1:8080",
-        description="Base URL for an OpenAI-compatible server (without /v1)",
-    )
-    server_model: str = Field(
-        default="hermes-4-36b",
-        description="Model name",
-    )
-    tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
-
-
-class SimpleTestEnv(AgentEnv[SimpleTestEnvConfig]):
-    """
-    A simple test environment to validate the atropos-agent setup.
-    
-    Uses a local OpenAI-compatible LLM endpoint with basic question-answering tasks.
-    Scoring is based on whether the response contains the expected answer.
-    """
-
-    name = "simple_test_env"
-    env_config_cls = SimpleTestEnvConfig
-
-    def __init__(
-        self,
-        config: SimpleTestEnvConfig,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self.iter = 0
-        self.test_prompts = TEST_PROMPTS
-        self.percent_correct_buffer: List[float] = []
-
-    @classmethod
-    def config_init(cls) -> Tuple[SimpleTestEnvConfig, List[APIServerConfig]]:
-        """
-        Initialize configuration with local server settings from environment variables.
-        """
-        base_url = (
-            os.getenv("ATROPOS_SERVER_BASE_URL")
-            or os.getenv("OPENAI_BASE_URL")
-            or os.getenv("LLM_BASE_URL")
-            or "http://127.0.0.1:8080"
-        )
-        model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
-        api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
-
-        env_config = SimpleTestEnvConfig(
-            tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
-            group_size=4,
-            use_wandb=False,  # Disable wandb for simple testing
-            rollout_server_url="http://localhost:8000",
-            total_steps=10,
-            batch_size=16,
-            steps_per_eval=5,
-            max_token_length=2048,
-            inference_weight=1.0,
-            wandb_name="simple_test",
-            server_base_url=base_url,
-            server_model=model,
-        )
-
-        # OpenAI-compatible servers typically expose chat completions at /v1.
-        server_configs = [
-            APIServerConfig(
-                model_name=model,
-                base_url=f"{base_url}/v1",
-                api_key=api_key,
-                num_max_requests_at_once=4,
-                num_requests_for_eval=8,
-                timeout=120,  # Local models may be slower
-            ),
-        ]
-
-        return env_config, server_configs
-
-    async def setup_agent_env(self):
-        """Setup the environment - load test data."""
-        print(f"SimpleTestEnv setup complete. {len(self.test_prompts)} test prompts loaded.")
-        print(f"Using server at: {self.config.server_base_url}")
-        print(f"Model: {self.config.server_model}")
-
-    async def get_next_item(self) -> Item:
-        """Get the next test prompt."""
-        item = self.test_prompts[self.iter % len(self.test_prompts)]
-        self.iter += 1
-        return item
-
-    def build_task(self, item: Item) -> str:
-        return item["prompt"]
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        return AgentConfig(
-            max_steps=5,
-            temperature=0.7,
-            max_tokens=256,
-            system_prompt=SYSTEM_PROMPT,
-        )
-
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        expected = item["expected"].lower()
-        response_lower = (final_response or "").lower()
-        score = 1.0 if expected in response_lower else 0.0
-        self.percent_correct_buffer.append(score)
-        return score
-
-    async def evaluate(self, *args, **kwargs):
-        """
-        Simple evaluation - run through all test prompts once.
-        """
-        correct = 0
-        total = len(self.test_prompts)
-
-        for item in self.test_prompts:
-            messages = [
-                {"role": "system", "content": SYSTEM_PROMPT},
-                {"role": "user", "content": item["prompt"]},
-            ]
-
-            response = await self.server.chat_completion(
-                messages=messages,
-                n=1,
-                max_tokens=256,
-                temperature=0.0,  # Greedy for eval
-                split="eval",
-            )
-
-            response_text = response.choices[0].message.content or ""
-            expected = item["expected"].lower()
-
-            if expected in response_text.lower():
-                correct += 1
-
-        accuracy = correct / total
-        print(f"Evaluation: {correct}/{total} = {accuracy:.2%} accuracy")
-        return {"eval_accuracy": accuracy}
-
-    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
-        """Log metrics (simplified for testing)."""
-        if wandb_metrics is None:
-            wandb_metrics = {}
-
-        if self.percent_correct_buffer:
-            avg_correct = sum(self.percent_correct_buffer) / len(self.percent_correct_buffer)
-            wandb_metrics["train/percent_correct"] = avg_correct
-            print(f"Train accuracy: {avg_correct:.2%}")
-            self.percent_correct_buffer = []
-
-        await super().wandb_log(wandb_metrics)
-
-
-if __name__ == "__main__":
-    # Allow running as CLI
-    SimpleTestEnv.cli()
@@ -1,165 +0,0 @@
-"""
-ToolServer routing smoke environment.
-
-Validates that:
-  - sandbox tools run through Nomad SlotPool (terminal -> bash in sandbox)
-  - external tools run through ToolServer (skills_list)
-
-This env uses ToolServer in-process by default (`tool_server_url="inprocess"`),
-so it is self-contained for local testing.
-
-Run:
-  uv run python -m atropos.envs.toolserver_smoke_env process --env.use_wandb false --env.total_steps 1 --env.group_size 1
-"""
-
-from __future__ import annotations
-
-import os
-from typing import Any, Dict, List, Tuple
-
-from dotenv import load_dotenv
-from pydantic import Field
-
-from atroposlib.envs.base import APIServerConfig, Item
-
-from ..agent import AgentConfig, AgentResult
-from .agent_env import AgentEnv, AgentEnvConfig
-
-load_dotenv()
-
-
-class ToolServerSmokeEnvConfig(AgentEnvConfig):
-    server_base_url: str = Field(
-        default="http://127.0.0.1:8080",
-        description="Base URL for an OpenAI-compatible chat server (without /v1).",
-    )
-    server_model: str = Field(default="hermes-4-36b", description="Model name")
-    tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
-
-
-class ToolServerSmokeEnv(AgentEnv[ToolServerSmokeEnvConfig]):
-    name = "toolserver_smoke_env"
-    env_config_cls = ToolServerSmokeEnvConfig
-
-    def __init__(
-        self,
-        config: ToolServerSmokeEnvConfig,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self._iter = 0
-
-    @classmethod
-    def config_init(cls) -> Tuple[ToolServerSmokeEnvConfig, List[APIServerConfig]]:
-        base_url = (
-            os.getenv("ATROPOS_SERVER_BASE_URL")
-            or os.getenv("OPENAI_BASE_URL")
-            or os.getenv("LLM_BASE_URL")
-            or "http://127.0.0.1:8080"
-        )
-        model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
-        api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
-
-        env_config = ToolServerSmokeEnvConfig(
-            tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
-            group_size=1,
-            use_wandb=False,
-            include_messages=True,
-            ensure_scores_are_not_same=False,
-            total_steps=1,
-            batch_size=1,
-            server_base_url=base_url,
-            server_model=model,
-            enabled_toolsets=["terminal", "skills"],
-            disabled_toolsets=[],
-            # Self-contained ToolServer for local smoke.
-            tool_server_url="inprocess",
-            sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
-            purge_job_on_start=True,
-            purge_job_on_shutdown=True,
-        )
-
-        server_configs = [
-            APIServerConfig(
-                model_name=model,
-                base_url=f"{base_url.rstrip('/')}/v1",
-                api_key=api_key,
-                num_max_requests_at_once=1,
-                num_requests_for_eval=1,
-                timeout=120,
-            )
-        ]
-        return env_config, server_configs
-
-    async def setup_agent_env(self) -> None:
-        return None
-
-    async def get_next_item(self) -> Item:
-        self._iter += 1
-        return {
-            "prompt": (
-                "You MUST call exactly one tool per assistant message.\n"
-                "\n"
-                "Step 1) Call the skills_list tool (no arguments), then stop.\n"
-                "Step 2) After you receive the tool response, call the terminal tool to run:\n"
-                "python -c \"print('ok')\"\n"
-                "Step 3) After you receive the terminal tool response, answer with just: ok\n"
-                "\n"
-                "Tool call format requirements:\n"
-                "- Every tool call MUST be a complete XML block with a closing tag.\n"
-                "- Do NOT emit a second <tool_call> in the same assistant message.\n"
-                "\n"
-                "Example:\n"
-                "<tool_call>{\"name\": \"skills_list\", \"arguments\": {}}</tool_call>\n"
-                "Do not include anything else in your final answer."
-            )
-        }
-
-    def build_task(self, item: Item) -> str:
-        return str(item.get("prompt") or "")
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        return AgentConfig(
-            max_steps=min(10, int(self.config.agent_max_steps)),
-            temperature=0.2,
-            max_tokens=None,
-        )
-
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        _ = (item, final_response)
-        return 0.0
-
-    async def verify_and_score_trajectory(
-        self,
-        item: Item,
-        final_response: str,
-        *,
-        trajectory_id: str,  # noqa: ARG002
-        exec_tool,  # noqa: ARG002
-        agent_result: AgentResult | None = None,
-        workspace_meta: Dict[str, Any] | None = None,  # noqa: ARG002
-    ) -> tuple[float, Dict[str, Any]]:
-        if agent_result is None:
-            return 0.0, {"error": "Missing agent_result"}
-
-        called = {c.name for s in agent_result.steps for c in s.tool_calls}
-        need = {"skills_list", "terminal"}
-        if not need.issubset(called):
-            return 0.0, {"error": f"Missing tool calls: {sorted(need - called)}", "called": sorted(called)}
-
-        terminal_ok = False
-        for step in agent_result.steps:
-            for call, res in zip(step.tool_calls, step.tool_results):
-                if call.name != "terminal":
-                    continue
-                if res.success and (res.output or "").strip().splitlines()[-1].strip() == "ok":
-                    terminal_ok = True
-
-        score = 1.0 if terminal_ok and (final_response or "").strip() == "ok" else 0.0
-        return score, {"called": sorted(called), "final": (final_response or "").strip()}
-
-
-if __name__ == "__main__":
-    ToolServerSmokeEnv.cli()
@@ -1,11 +0,0 @@
-"""
-Nomad integration for atropos-agent.
-
-Provides:
- NomadClient: Client for Nomad HTTP API
- Job templates for sandbox containers
-"""
-
-from .client import NomadClient
-
-__all__ = ["NomadClient"]
@@ -1,500 +0,0 @@
-"""
-Nomad API Client for atropos-agent.
-
-Provides a simple async client for interacting with the Nomad HTTP API:
- Submit/stop jobs
- Query allocations
- Get allocation addresses
- Scale jobs up/down
-"""
-
-import asyncio
-import json
-import os
-from dataclasses import dataclass, field
-from enum import Enum
-from pathlib import Path
-from typing import Any, Dict, List, Optional
-
-import aiohttp
-
-
-class AllocationStatus(Enum):
-    """Nomad allocation status."""
-    PENDING = "pending"
-    RUNNING = "running"
-    COMPLETE = "complete"
-    FAILED = "failed"
-    LOST = "lost"
-
-
-@dataclass
-class Allocation:
-    """Information about a Nomad allocation."""
-    id: str
-    job_id: str
-    task_group: str
-    node_id: str
-    status: AllocationStatus
-    # Network info for reaching the allocation
-    address: Optional[str] = None
-    port: Optional[int] = None
-    
-    @property
-    def http_address(self) -> Optional[str]:
-        """Get full HTTP address for the allocation."""
-        if self.address and self.port:
-            return f"http://{self.address}:{self.port}"
-        return None
-
-
-@dataclass
-class JobStatus:
-    """Status of a Nomad job."""
-    id: str
-    name: str
-    status: str
-    allocations: List[Allocation] = field(default_factory=list)
-    count: int = 0  # Number of task groups
-
-
-class NomadClient:
-    """
-    Async client for Nomad HTTP API.
-    
-    Usage:
-        client = NomadClient(address="http://localhost:4646")
-        
-        # Submit a job
-        await client.submit_job(job_spec)
-        
-        # Get allocations
-        allocs = await client.get_job_allocations("sandbox-python")
-        
-        # Scale job
-        await client.scale_job("sandbox-python", count=5)
-    """
-    
-    def __init__(
-        self,
-        address: str = "http://localhost:4646",
-        token: Optional[str] = None,
-        timeout: float = 30.0,
-    ):
-        self.address = address.rstrip("/")
-        self.token = token or os.environ.get("NOMAD_TOKEN")
-        self.timeout = aiohttp.ClientTimeout(total=timeout)
-        self._session: Optional[aiohttp.ClientSession] = None
-    
-    async def _get_session(self) -> aiohttp.ClientSession:
-        """Get or create HTTP session."""
-        if self._session is None or self._session.closed:
-            headers = {}
-            if self.token:
-                headers["X-Nomad-Token"] = self.token
-            self._session = aiohttp.ClientSession(
-                timeout=self.timeout,
-                headers=headers,
-            )
-        return self._session
-    
-    async def close(self):
-        """Close the HTTP session."""
-        if self._session and not self._session.closed:
-            await self._session.close()
-    
-    async def __aenter__(self):
-        return self
-    
-    async def __aexit__(self, exc_type, exc_val, exc_tb):
-        await self.close()
-    
-    async def _request(
-        self,
-        method: str,
-        path: str,
-        data: Optional[Dict[str, Any]] = None,
-    ) -> Dict[str, Any]:
-        """Make an HTTP request to Nomad API."""
-        session = await self._get_session()
-        url = f"{self.address}{path}"
-        
-        try:
-            async with session.request(method, url, json=data) as response:
-                if response.status == 404:
-                    return {"error": "not_found", "status": 404}
-                
-                text = await response.text()
-                if not text:
-                    return {"status": response.status}
-                
-                try:
-                    result = json.loads(text)
-                except json.JSONDecodeError:
-                    return {"text": text, "status": response.status}
-                
-                if response.status >= 400:
-                    return {"error": result, "status": response.status}
-                
-                return result if isinstance(result, dict) else {"data": result, "status": response.status}
-                
-        except aiohttp.ClientError as e:
-            return {"error": str(e), "status": 0}
-    
-    # Job Operations
-    
-    async def submit_job(self, job_spec: Dict[str, Any]) -> Dict[str, Any]:
-        """
-        Submit a job to Nomad.
-        
-        Args:
-            job_spec: Job specification dict (HCL converted to JSON)
-            
-        Returns:
-            Response with EvalID if successful
-        """
-        return await self._request("POST", "/v1/jobs", {"Job": job_spec})
-    
-    async def stop_job(self, job_id: str, purge: bool = False) -> Dict[str, Any]:
-        """
-        Stop (and optionally purge) a job.
-        
-        Args:
-            job_id: Job identifier
-            purge: If True, completely remove the job
-        """
-        path = f"/v1/job/{job_id}"
-        if purge:
-            path += "?purge=true"
-        return await self._request("DELETE", path)
-    
-    async def get_job(self, job_id: str) -> Optional[Dict[str, Any]]:
-        """Get job details."""
-        result = await self._request("GET", f"/v1/job/{job_id}")
-        if "error" in result and result.get("status") == 404:
-            return None
-        return result
-    
-    async def get_job_status(self, job_id: str) -> Optional[JobStatus]:
-        """Get job status with allocations."""
-        job = await self.get_job(job_id)
-        if not job:
-            return None
-        
-        allocs = await self.get_job_allocations(job_id)
-        
-        # Get count from task groups
-        count = 0
-        task_groups = job.get("TaskGroups", [])
-        for tg in task_groups:
-            count += tg.get("Count", 1)
-        
-        return JobStatus(
-            id=job_id,
-            name=job.get("Name", job_id),
-            status=job.get("Status", "unknown"),
-            allocations=allocs,
-            count=count,
-        )
-    
-    # Allocation Operations
-    
-    async def get_job_allocations(self, job_id: str) -> List[Allocation]:
-        """Get all allocations for a job."""
-        result = await self._request("GET", f"/v1/job/{job_id}/allocations")
-        
-        if "error" in result:
-            return []
-        
-        allocs_data = result.get("data", result) if isinstance(result, dict) else result
-        if not isinstance(allocs_data, list):
-            return []
-        
-        allocations = []
-        for alloc_data in allocs_data:
-            # Parse allocation info
-            alloc_id = alloc_data.get("ID", "")
-            status_str = alloc_data.get("ClientStatus", "unknown")
-            
-            try:
-                status = AllocationStatus(status_str)
-            except ValueError:
-                status = AllocationStatus.PENDING
-            
-            # Get network info - need to fetch detailed allocation for this
-            address = None
-            port = None
-            
-            # First try the summary data
-            resources = alloc_data.get("AllocatedResources") or {}
-            shared = resources.get("Shared") or {}
-            networks = shared.get("Networks") or []
-            
-            # If no networks in summary, fetch detailed allocation
-            if not networks and alloc_id:
-                detailed = await self.get_allocation(alloc_id)
-                if detailed:
-                    resources = detailed.get("AllocatedResources") or {}
-                    shared = resources.get("Shared") or {}
-                    networks = shared.get("Networks") or []
-            
-            if networks:
-                network = networks[0]
-                address = network.get("IP")
-                # Look for dynamic ports OR reserved ports (Singularity/raw_exec uses reserved)
-                dyn_ports = network.get("DynamicPorts") or []
-                reserved_ports = network.get("ReservedPorts") or []
-                for dp in dyn_ports + reserved_ports:
-                    if dp.get("Label") == "http":
-                        port = dp.get("Value")
-                        break
-            
-            allocations.append(Allocation(
-                id=alloc_id,
-                job_id=job_id,
-                task_group=alloc_data.get("TaskGroup", ""),
-                node_id=alloc_data.get("NodeID", ""),
-                status=status,
-                address=address,
-                port=port,
-            ))
-        
-        return allocations
-    
-    async def get_allocation(self, alloc_id: str) -> Optional[Dict[str, Any]]:
-        """Get detailed allocation info."""
-        result = await self._request("GET", f"/v1/allocation/{alloc_id}")
-        if "error" in result and result.get("status") == 404:
-            return None
-        return result
-    
-    # Scaling Operations
-    
-    async def scale_job(self, job_id: str, count: int, task_group: str = "sandbox") -> Dict[str, Any]:
-        """
-        Scale a job's task group to specified count.
-        
-        Args:
-            job_id: Job identifier
-            count: Desired number of allocations
-            task_group: Name of task group to scale
-        """
-        payload = {
-            "Count": count,
-            "Target": {
-                "Group": task_group,
-            },
-        }
-        return await self._request("POST", f"/v1/job/{job_id}/scale", payload)
-    
-    async def get_job_scale_status(self, job_id: str) -> Dict[str, int]:
-        """
-        Get current scale status for a job.
-        
-        Returns:
-            Dict mapping task group name to count
-        """
-        result = await self._request("GET", f"/v1/job/{job_id}/scale")
-        
-        if "error" in result:
-            return {}
-        
-        task_groups = result.get("TaskGroups", {})
-        return {
-            name: info.get("Running", 0)
-            for name, info in task_groups.items()
-        }
-    
-    # Health Check
-    
-    async def is_healthy(self) -> bool:
-        """Check if Nomad is reachable and healthy."""
-        try:
-            result = await self._request("GET", "/v1/status/leader")
-            return "error" not in result
-        except Exception:
-            return False
-    
-    async def get_leader(self) -> Optional[str]:
-        """Get current Nomad leader address."""
-        result = await self._request("GET", "/v1/status/leader")
-        if isinstance(result, dict) and "data" in result:
-            return result["data"]
-        return None
-
-
-def load_job_template(
-    template_name: str = "sandbox",
-    **kwargs,
-) -> Dict[str, Any]:
-    """
-    Load and configure a job template.
-    
-    Args:
-        template_name: Name of template (e.g., "sandbox")
-        **kwargs: Template variables to substitute
-        
-    Returns:
-        Job specification dict ready for Nomad API
-    """
-    # Default job template for sandbox container
-    if template_name == "sandbox":
-        return create_sandbox_job(**kwargs)
-    else:
-        raise ValueError(f"Unknown template: {template_name}")
-
-
-def create_sandbox_job(
-    job_id: str = "atropos-sandbox",
-    image: str = "atropos-sandbox:local",  # Use :local tag to avoid registry pull
-    count: int = 1,
-    slots_per_container: int = 10,
-    privileged: bool = False,
-    cpu: int = 500,
-    memory: int = 512,
-    port: int = 8080,
-    datacenter: str = "dc1",
-    driver: str = "docker",  # "docker" or "singularity"
-    singularity_image: str = None,  # Path to .sif file for singularity driver
-) -> Dict[str, Any]:
-    """
-    Create a sandbox job specification.
-    
-    This job runs the sandbox_server.py inside a container,
-    with the specified number of slots for agent workspaces.
-    
-    Args:
-        job_id: Unique job identifier
-        image: Docker image to use (for docker driver)
-        count: Number of container instances
-        slots_per_container: Number of slots per container
-        privileged: Run container in privileged mode (recommended for bubblewrap)
-        cpu: CPU allocation in MHz
-        memory: Memory allocation in MB
-        port: HTTP port for sandbox server
-        datacenter: Nomad datacenter
-        driver: Container driver - "docker" or "singularity"
-        singularity_image: Path to .sif file (required if driver="singularity")
-        
-    Returns:
-        Job specification dict
-    """
-    # Build task config based on driver
-    if driver == "singularity":
-        if not singularity_image:
-            raise ValueError("singularity_image path required when driver='singularity'")
-        
-        # Use raw_exec driver to run apptainer via shell for variable expansion
-        # The container binds the allocation directory for workspace persistence
-        # For raw_exec, we use static port since Nomad's dynamic port mapping doesn't
-        # work the same as Docker - the process runs directly on the host.
-        shell_cmd = (
-            f'apptainer run '
-            f'--bind "$NOMAD_ALLOC_DIR/data:/data" '
-            f'--pwd /app '
-            f'--env PYTHONUNBUFFERED=1 '
-            f'{singularity_image} '
-            f'python sandbox_server.py '
-            f'--port {port} '
-            f'--slots {slots_per_container} '
-            f'--data-dir /data'
-        )
-        task_config = {
-            "command": "/bin/sh",
-            "args": ["-c", shell_cmd],
-        }
-        task_driver = "raw_exec"
-    else:
-        # Docker driver (default)
-        task_config = {
-            "image": image,
-            "force_pull": False,  # Use local image, don't try to pull
-            "ports": ["http"],
-            "privileged": privileged,
-            "command": "python",
-            "args": [
-                "sandbox_server.py",
-                "--port", str(port),
-                "--slots", str(slots_per_container),
-                "--data-dir", "/data",
-            ],
-            # Note: On Linux, you can mount persistent storage:
-            # "volumes": ["${NOMAD_ALLOC_DIR}/data:/data"],
-            # On macOS/Docker Desktop, skip volumes for PoC
-            # (container /data is ephemeral but works for testing)
-        }
-        task_driver = "docker"
-    
-    # For Singularity/raw_exec, use static ports since the process runs directly on host.
-    # For Docker, use dynamic ports with port mapping.
-    if driver == "singularity":
-        network_config = {
-            "Mode": "host",
-            "ReservedPorts": [
-                {
-                    "Label": "http",
-                    "Value": port,
-                }
-            ],
-        }
-    else:
-        network_config = {
-            "Mode": "host",
-            "DynamicPorts": [
-                {
-                    "Label": "http",
-                    "To": port,
-                }
-            ],
-        }
-    
-    return {
-        "ID": job_id,
-        "Name": job_id,
-        "Type": "service",
-        "Datacenters": [datacenter],
-        "TaskGroups": [
-            {
-                "Name": "sandbox",
-                "Count": count,
-                # Speed up deployments and avoid Consul checks. Without this, Nomad may
-                # keep an "active deployment" around for the default MinHealthyTime,
-                # which blocks immediate scaling under load.
-                "Update": {
-                    "HealthCheck": "task_states",
-                    "MinHealthyTime": 0,
-                },
-                "Networks": [network_config],
-                "Tasks": [
-                    {
-                        "Name": "sandbox-server",
-                        "Driver": task_driver,
-                        "Config": task_config,
-                        "Env": {
-                            "PYTHONUNBUFFERED": "1",
-                            "NOMAD_ALLOC_DIR": "${NOMAD_ALLOC_DIR}",
-                        },
-                        "Resources": {
-                            "CPU": cpu,
-                            "MemoryMB": memory,
-                        },
-                        # Note: Services with Checks require Consul, which we skip for the PoC
-                    }
-                ],
-                "RestartPolicy": {
-                    "Attempts": 3,
-                    "Interval": 300_000_000_000,  # 5 minutes
-                    "Delay": 10_000_000_000,     # 10 seconds
-                    "Mode": "delay",
-                },
-                "ReschedulePolicy": {
-                    "Attempts": 5,
-                    "Interval": 3600_000_000_000,  # 1 hour
-                    "Delay": 30_000_000_000,      # 30 seconds
-                    "DelayFunction": "exponential",
-                    "MaxDelay": 300_000_000_000,  # 5 minutes
-                    "Unlimited": False,
-                },
-            }
-        ],
-    }
@@ -1,20 +0,0 @@
-"""
-Slot-based multiplexing for atropos-agent.
-
-Provides:
- Slot: Isolated workspace for a single trajectory
- SlotPool: Manages slots across Nomad allocations  
- SandboxExecutor: Executes tools in sandbox containers
-"""
-
-from .executor import SandboxExecutor
-from .pool import SlotPool, SlotPoolConfig
-from .slot import Slot, SlotState
-
-__all__ = [
-    "Slot",
-    "SlotState",
-    "SlotPool",
-    "SlotPoolConfig",
-    "SandboxExecutor",
-]
@@ -1,457 +0,0 @@
-"""
-SandboxExecutor - HTTP client for sandbox container communication.
-
-Sends tool execution requests to sandbox_server.py running inside Nomad containers.
-Supports single and batch execution for efficiency.
-"""
-
-import asyncio
-import uuid
-from dataclasses import dataclass, field
-from typing import Any, Dict, List, Optional, Tuple
-
-import aiohttp
-
-from .slot import Slot, SlotState
-from ..tools.base import ToolCall, ToolResult
-
-
-@dataclass
-class ExecutionRequest:
-    """Request to execute a tool in a slot."""
-    slot: Slot
-    tool_name: str
-    args: Dict[str, Any]
-    execution_id: str = field(default_factory=lambda: str(uuid.uuid4()))
-    timeout: float = 30.0
-
-
-@dataclass
-class ExecutionResult:
-    """Result from sandbox execution."""
-    success: bool
-    output: str = ""
-    error: str = ""
-    execution_id: str = ""
-    slot_id: str = ""
-    metadata: Dict[str, Any] = field(default_factory=dict)
-    
-    def to_tool_result(self) -> ToolResult:
-        """Convert to ToolResult for agent consumption."""
-        return ToolResult(
-            success=self.success,
-            output=self.output,
-            error=self.error,
-            metadata=self.metadata,
-            uniq_id=self.execution_id,
-        )
-
-
-class SandboxExecutor:
-    """
-    HTTP client for executing tools in sandbox containers.
-    
-    Communicates with sandbox_server.py running inside Nomad allocations.
-    Supports both single execution and batched parallel execution.
-    
-    Usage:
-        executor = SandboxExecutor()
-        
-        # Single execution
-        result = await executor.execute(slot, "bash", {"command": "ls"})
-        
-        # Batch execution
-        results = await executor.execute_batch([
-            (slot1, "bash", {"command": "ls"}),
-            (slot2, "write_file", {"path": "test.txt", "content": "hello"}),
-        ])
-    """
-    
-    def __init__(
-        self,
-        timeout: float = 30.0,
-        max_retries: int = 3,
-        retry_delay: float = 1.0,
-    ):
-        self.timeout = aiohttp.ClientTimeout(total=timeout)
-        self.max_retries = max_retries
-        self.retry_delay = retry_delay
-        self._session: Optional[aiohttp.ClientSession] = None
-    
-    async def _get_session(self) -> aiohttp.ClientSession:
-        """Get or create HTTP session."""
-        if self._session is None or self._session.closed:
-            self._session = aiohttp.ClientSession(timeout=self.timeout)
-        return self._session
-    
-    async def close(self):
-        """Close HTTP session."""
-        if self._session and not self._session.closed:
-            await self._session.close()
-    
-    async def __aenter__(self):
-        return self
-    
-    async def __aexit__(self, exc_type, exc_val, exc_tb):
-        await self.close()
-    
-    async def execute(
-        self,
-        slot: Slot,
-        tool_name: str,
-        args: Dict[str, Any],
-        timeout: Optional[float] = None,
-    ) -> ExecutionResult:
-        """
-        Execute a tool in a slot's workspace.
-        
-        Args:
-            slot: Slot to execute in
-            tool_name: Name of tool (bash, read_file, write_file)
-            args: Tool arguments
-            timeout: Optional timeout override
-            
-        Returns:
-            ExecutionResult with output or error
-        """
-        execution_id = str(uuid.uuid4())
-        exec_timeout = timeout or self.timeout.total or 30.0
-        
-        # Mark slot as executing
-        original_state = slot.state
-        try:
-            if slot.state == SlotState.ACQUIRED:
-                slot.start_execution(execution_id)
-            
-            result = await self._send_execute_request(
-                container_addr=slot.container_addr,
-                slot_id=slot.slot_id,
-                tool_name=tool_name,
-                args=args,
-                execution_id=execution_id,
-                timeout=exec_timeout,
-            )
-            result.slot_id = slot.slot_id
-            return result
-            
-        finally:
-            # Restore slot state
-            if slot.state == SlotState.EXECUTING:
-                slot.end_execution()
-    
-    async def _send_execute_request(
-        self,
-        container_addr: str,
-        slot_id: str,
-        tool_name: str,
-        args: Dict[str, Any],
-        execution_id: str,
-        timeout: float,
-    ) -> ExecutionResult:
-        """Send execution request to sandbox server with retry logic."""
-        session = await self._get_session()
-        url = f"{container_addr}/execute"
-        
-        payload = {
-            "slot_id": slot_id,
-            "tool": tool_name,
-            "args": args,
-            "execution_id": execution_id,
-            "timeout": timeout,
-        }
-        
-        last_error = None
-        for attempt in range(self.max_retries):
-            try:
-                async with session.post(url, json=payload) as response:
-                    data = await response.json()
-                    
-                    return ExecutionResult(
-                        success=data.get("success", False),
-                        output=data.get("output", ""),
-                        error=data.get("error", ""),
-                        execution_id=data.get("execution_id", execution_id),
-                        metadata=data.get("metadata", {}),
-                    )
-                    
-            except aiohttp.ClientError as e:
-                last_error = str(e)
-                if attempt < self.max_retries - 1:
-                    await asyncio.sleep(self.retry_delay * (attempt + 1))
-                continue
-            except asyncio.TimeoutError:
-                last_error = f"Request timed out after {timeout}s"
-                break
-            except Exception as e:
-                last_error = str(e)
-                break
-        
-        return ExecutionResult(
-            success=False,
-            error=f"Failed after {self.max_retries} attempts: {last_error}",
-            execution_id=execution_id,
-        )
-    
-    async def execute_batch(
-        self,
-        requests: List[Tuple[Slot, str, Dict[str, Any]]],
-        timeout: Optional[float] = None,
-    ) -> List[ExecutionResult]:
-        """
-        Execute multiple tools in parallel across slots.
-        
-        This is the key optimization - we batch tool calls to maximize
-        container utilization while agents are waiting for LLM responses.
-        
-        Args:
-            requests: List of (slot, tool_name, args) tuples
-            timeout: Optional timeout override
-            
-        Returns:
-            List of ExecutionResults in same order as requests
-        """
-        if not requests:
-            return []
-        
-        # Group requests by container address for batch API
-        by_container: Dict[str, List[Tuple[int, Slot, str, Dict[str, Any], str]]] = {}
-        
-        for idx, (slot, tool_name, args) in enumerate(requests):
-            execution_id = str(uuid.uuid4())
-            container = slot.container_addr
-            
-            if container not in by_container:
-                by_container[container] = []
-            by_container[container].append((idx, slot, tool_name, args, execution_id))
-            
-            # Mark slots as executing
-            if slot.state == SlotState.ACQUIRED:
-                slot.start_execution(execution_id)
-        
-        # Execute batches in parallel
-        exec_timeout = timeout or self.timeout.total or 30.0
-        batch_tasks = []
-        
-        for container_addr, batch_requests in by_container.items():
-            task = self._send_batch_request(
-                container_addr=container_addr,
-                batch_requests=batch_requests,
-                timeout=exec_timeout,
-            )
-            batch_tasks.append(task)
-        
-        # Gather all batch results
-        batch_results = await asyncio.gather(*batch_tasks, return_exceptions=True)
-        
-        # Collect results in original order
-        results: List[Optional[ExecutionResult]] = [None] * len(requests)
-        
-        for batch_result in batch_results:
-            if isinstance(batch_result, Exception):
-                # Mark all in this batch as failed
-                continue
-            
-            for idx, result in batch_result:
-                results[idx] = result
-        
-        # Fill in any missing results
-        for idx, result in enumerate(results):
-            if result is None:
-                slot, tool_name, args = requests[idx]
-                results[idx] = ExecutionResult(
-                    success=False,
-                    error="Batch execution failed",
-                    slot_id=slot.slot_id,
-                )
-        
-        # End execution on all slots
-        for slot, _, _ in requests:
-            if slot.state == SlotState.EXECUTING:
-                slot.end_execution()
-        
-        return results  # type: ignore
-    
-    async def _send_batch_request(
-        self,
-        container_addr: str,
-        batch_requests: List[Tuple[int, Slot, str, Dict[str, Any], str]],
-        timeout: float,
-    ) -> List[Tuple[int, ExecutionResult]]:
-        """Send batch execution request to a single container."""
-        session = await self._get_session()
-        url = f"{container_addr}/batch"
-        
-        # Build batch payload
-        payload = [
-            {
-                "slot_id": slot.slot_id,
-                "tool": tool_name,
-                "args": args,
-                "execution_id": execution_id,
-                "timeout": timeout,
-            }
-            for _, slot, tool_name, args, execution_id in batch_requests
-        ]
-        
-        try:
-            async with session.post(url, json=payload) as response:
-                data = await response.json()
-                
-                if not isinstance(data, list):
-                    raise ValueError(f"Expected list response, got {type(data)}")
-                
-                results = []
-                for i, (idx, slot, _, _, execution_id) in enumerate(batch_requests):
-                    if i < len(data):
-                        item = data[i]
-                        result = ExecutionResult(
-                            success=item.get("success", False),
-                            output=item.get("output", ""),
-                            error=item.get("error", ""),
-                            execution_id=item.get("execution_id", execution_id),
-                            slot_id=slot.slot_id,
-                            metadata=item.get("metadata", {}),
-                        )
-                    else:
-                        result = ExecutionResult(
-                            success=False,
-                            error="Missing result in batch response",
-                            execution_id=execution_id,
-                            slot_id=slot.slot_id,
-                        )
-                    results.append((idx, result))
-                
-                return results
-                
-        except Exception as e:
-            # Return error for all requests in batch
-            return [
-                (idx, ExecutionResult(
-                    success=False,
-                    error=str(e),
-                    execution_id=execution_id,
-                    slot_id=slot.slot_id,
-                ))
-                for idx, slot, _, _, execution_id in batch_requests
-            ]
-    
-    async def reset_slot(self, slot: Slot) -> ExecutionResult:
-        """
-        Reset a slot's workspace (delete all files).
-        
-        Useful when reusing a slot for a new trajectory.
-        """
-        session = await self._get_session()
-        url = f"{slot.container_addr}/reset"
-        
-        try:
-            async with session.post(url, json={"slot_id": slot.slot_id}) as response:
-                data = await response.json()
-                return ExecutionResult(
-                    success=data.get("success", False),
-                    output=data.get("output", ""),
-                    error=data.get("error", ""),
-                    slot_id=slot.slot_id,
-                )
-        except Exception as e:
-            return ExecutionResult(
-                success=False,
-                error=str(e),
-                slot_id=slot.slot_id,
-            )
-    
-    async def health_check(self, container_addr: str) -> bool:
-        """Check if a sandbox container is healthy."""
-        session = await self._get_session()
-        url = f"{container_addr}/health"
-        
-        try:
-            async with session.get(url) as response:
-                data = await response.json()
-                return data.get("status") == "ok"
-        except Exception:
-            return False
-    
-    async def get_container_status(
-        self, 
-        container_addr: str
-    ) -> Optional[Dict[str, Any]]:
-        """Get status info from a sandbox container."""
-        session = await self._get_session()
-        url = f"{container_addr}/health"
-        
-        try:
-            async with session.get(url) as response:
-                return await response.json()
-        except Exception:
-            return None
-
-    # -------------------------------------------------------------------------
-    # Artifact helpers (optional)
-    # -------------------------------------------------------------------------
-
-    async def _post_json(
-        self,
-        url: str,
-        payload: Dict[str, Any],
-        timeout: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        session = await self._get_session()
-        try:
-            async with session.post(url, json=payload, timeout=timeout) as response:
-                data = await response.json()
-                if isinstance(data, dict):
-                    data.setdefault("http_status", response.status)
-                    return data
-                return {"success": False, "error": f"Unexpected response type: {type(data)}", "http_status": response.status}
-        except Exception as e:
-            return {"success": False, "error": str(e)}
-
-    async def read_artifact(
-        self,
-        slot: Slot,
-        path: str,
-        *,
-        encoding: str = "text",
-        max_bytes: Optional[int] = None,
-        include_sha256: bool = False,
-        timeout: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        url = f"{slot.container_addr}/artifacts/read"
-        payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "encoding": encoding, "include_sha256": include_sha256}
-        if max_bytes is not None:
-            payload["max_bytes"] = max_bytes
-        return await self._post_json(url, payload, timeout=timeout)
-
-    async def list_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        recursive: bool = False,
-        max_entries: Optional[int] = None,
-        timeout: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        url = f"{slot.container_addr}/artifacts/list"
-        payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "recursive": recursive}
-        if max_entries is not None:
-            payload["max_entries"] = max_entries
-        return await self._post_json(url, payload, timeout=timeout)
-
-    async def archive_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        archive_format: str = "tar.gz",
-        max_bytes: Optional[int] = None,
-        max_entries: Optional[int] = None,
-        timeout: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        url = f"{slot.container_addr}/artifacts/archive"
-        payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "format": archive_format}
-        if max_bytes is not None:
-            payload["max_bytes"] = max_bytes
-        if max_entries is not None:
-            payload["max_entries"] = max_entries
-        return await self._post_json(url, payload, timeout=timeout)
@@ -1,659 +0,0 @@
-"""
-SlotPool - Manages slots across Nomad allocations.
-
-The SlotPool is the core abstraction for slot-based multiplexing:
- Tracks available/acquired slots across containers
- Handles slot acquisition and release
- Auto-scales Nomad job count based on demand
- Provides batched tool execution
-"""
-
-import asyncio
-import logging
-import os
-import subprocess
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any, Dict, List, Optional, Tuple
-
-from ..nomad.client import (
-    Allocation,
-    AllocationStatus,
-    NomadClient,
-    create_sandbox_job,
-)
-from .executor import ExecutionResult, SandboxExecutor
-from .slot import Slot, SlotState, create_slots_for_allocation
-
-logger = logging.getLogger(__name__)
-
-
-@dataclass
-class SlotPoolConfig:
-    """Configuration for SlotPool."""
-    
-    # Nomad settings
-    nomad_address: str = "http://localhost:4646"
-    job_id: str = "atropos-sandbox"
-    datacenter: str = "dc1"
-    
-    # Container settings
-    image: str = "atropos-sandbox:local"  # Use :local tag to avoid registry pull
-    slots_per_container: int = 10
-    privileged: bool = False
-    cpu: int = 500  # MHz
-    memory: int = 512  # MB
-    
-    # Driver selection: "docker" or "singularity"
-    driver: str = "docker"
-    # Path to .sif file for singularity driver (required if driver="singularity")
-    singularity_image: Optional[str] = None
-    
-    # Scaling settings
-    min_containers: int = 1
-    max_containers: int = 10
-    
-    # Timeouts
-    acquire_timeout: float = 30.0  # Seconds between acquire polls (also triggers scale-up attempts)
-    health_check_interval: float = 30.0  # Seconds between health checks
-    scale_cooldown: float = 60.0  # Seconds between scale operations
-
-    # Job lifecycle
-    purge_job_on_start: bool = False  # Purge any pre-existing job before starting (local dev/training friendly)
-
-    # Local Docker image convenience (macOS/Nomad dev mode)
-    auto_build_local_image: bool = True  # If image endswith :local and is missing, build it from the bundled Dockerfile.
-    dockerfile_path: Optional[str] = None  # Override Dockerfile path (default: Hermes-Agent/atropos/Dockerfile).
-    docker_build_context: Optional[str] = None  # Override build context (default: Hermes-Agent/atropos).
-
-
-class SlotPool:
-    """
-    Manages a pool of slots across Nomad allocations.
-    
-    The SlotPool:
-    - Deploys sandbox containers to Nomad
-    - Tracks slots across all running containers
-    - Handles slot acquisition/release
-    - Auto-scales based on demand
-    - Provides batched execution via SandboxExecutor
-    
-    Usage:
-        config = SlotPoolConfig(
-            nomad_address="http://localhost:4646",
-            job_id="my-sandbox",
-            slots_per_container=10,
-        )
-        
-        pool = SlotPool(config)
-        await pool.start()
-        
-        # Acquire a slot
-        slot = await pool.acquire()
-        
-        # Execute tool
-        result = await pool.execute(slot, "bash", {"command": "ls"})
-        
-        # Release slot
-        await pool.release(slot)
-        
-        # Shutdown
-        await pool.stop()
-    """
-    
-    def __init__(self, config: Optional[SlotPoolConfig] = None):
-        self.config = config or SlotPoolConfig()
-        
-        # Nomad client
-        self.nomad = NomadClient(address=self.config.nomad_address)
-        
-        # Sandbox executor for tool execution
-        self.executor = SandboxExecutor()
-        
-        # Slot tracking
-        self._slots: Dict[str, Slot] = {}  # slot_key -> Slot
-        self._available_queue: asyncio.Queue[str] = asyncio.Queue()
-        self._lock = asyncio.Lock()
-        self._scale_lock = asyncio.Lock()
-        
-        # State
-        self._started = False
-        self._health_task: Optional[asyncio.Task] = None
-        self._scale_task: Optional[asyncio.Task] = None
-        self._last_scale_time = 0.0
-
-    def _default_dockerfile_path(self) -> Path:
-        # Hermes-Agent/atropos/Dockerfile lives next to this module in source checkouts.
-        return Path(__file__).resolve().parents[1] / "Dockerfile"
-
-    def _default_build_context(self) -> Path:
-        return Path(__file__).resolve().parents[1]
-
-    def _docker_image_exists(self, image: str) -> bool:
-        try:
-            proc = subprocess.run(
-                ["docker", "image", "inspect", image],
-                stdout=subprocess.DEVNULL,
-                stderr=subprocess.DEVNULL,
-                check=False,
-                env={**os.environ, "DOCKER_CLI_HINTS": "false"},
-            )
-            return proc.returncode == 0
-        except FileNotFoundError:
-            return False
-
-    def _try_build_local_image(self, image: str) -> None:
-        dockerfile = Path(self.config.dockerfile_path) if self.config.dockerfile_path else self._default_dockerfile_path()
-        context = Path(self.config.docker_build_context) if self.config.docker_build_context else self._default_build_context()
-
-        if not dockerfile.exists():
-            raise RuntimeError(
-                f"Sandbox Dockerfile not found at {dockerfile}. "
-                "Build the sandbox image manually or set --env.purge_job_on_start false and provide a non-local image."
-            )
-        if not context.exists():
-            raise RuntimeError(f"Docker build context not found at {context}")
-
-        # Prefer buildx+--load to ensure the image ends up in the local daemon (required by Nomad's docker driver).
-        buildx_cmd = [
-            "docker",
-            "buildx",
-            "build",
-            "--load",
-            "-t",
-            image,
-            "-f",
-            str(dockerfile),
-            str(context),
-        ]
-        proc = subprocess.run(buildx_cmd, check=False, env={**os.environ, "DOCKER_CLI_HINTS": "false"})
-        if proc.returncode == 0:
-            return
-
-        # Fallback to classic docker build if buildx isn't available.
-        build_cmd = ["docker", "build", "-t", image, "-f", str(dockerfile), str(context)]
-        proc2 = subprocess.run(build_cmd, check=False, env={**os.environ, "DOCKER_CLI_HINTS": "false"})
-        if proc2.returncode != 0:
-            raise RuntimeError(
-                f"Failed to build local sandbox image {image}. "
-                f"Tried: {' '.join(buildx_cmd)} and {' '.join(build_cmd)}"
-            )
-
-    def _ensure_local_image(self) -> None:
-        image = (self.config.image or "").strip()
-        if not image.endswith(":local"):
-            return
-        if not self.config.auto_build_local_image:
-            return
-
-        if self._docker_image_exists(image):
-            return
-
-        logger.info(f"Local sandbox image {image} not found; building it now...")
-        self._try_build_local_image(image)
-
-    def _slot_key(self, alloc_id: str, slot_id: str) -> str:
-        """Generate unique key for a slot."""
-        return f"{alloc_id}:{slot_id}"
-    
-    @property
-    def total_slots(self) -> int:
-        """Total number of slots in pool."""
-        return len(self._slots)
-    
-    @property
-    def available_slots(self) -> int:
-        """Number of available slots."""
-        return sum(1 for s in self._slots.values() if s.is_available)
-    
-    @property
-    def acquired_slots(self) -> int:
-        """Number of acquired slots."""
-        return sum(1 for s in self._slots.values() if s.is_acquired)
-    
-    async def start(self) -> None:
-        """
-        Start the slot pool.
-        
-        - Checks if Nomad is healthy
-        - Deploys sandbox job if not running
-        - Discovers existing allocations
-        - Starts health check background task
-        """
-        if self._started:
-            return
-        
-        logger.info(f"Starting SlotPool (job_id={self.config.job_id})")
-
-        try:
-            # Make sure local sandbox images exist before Nomad tries to pull them.
-            # This is a common footgun in macOS dev mode with :local tags.
-            self._ensure_local_image()
-
-            # Check Nomad health
-            if not await self.nomad.is_healthy():
-                raise RuntimeError(f"Nomad is not reachable at {self.config.nomad_address}")
-
-            if self.config.purge_job_on_start:
-                logger.info(f"Purging any existing Nomad job: {self.config.job_id}")
-                await self.nomad.stop_job(self.config.job_id, purge=True)
-
-            # Check if job exists (after optional purge)
-            job = await self.nomad.get_job(self.config.job_id)
-
-            if job is None:
-                # Deploy new job
-                logger.info(f"Deploying sandbox job: {self.config.job_id} (driver={self.config.driver})")
-                job_spec = create_sandbox_job(
-                    job_id=self.config.job_id,
-                    image=self.config.image,
-                    count=self.config.min_containers,
-                    slots_per_container=self.config.slots_per_container,
-                    privileged=self.config.privileged,
-                    cpu=self.config.cpu,
-                    memory=self.config.memory,
-                    datacenter=self.config.datacenter,
-                    driver=self.config.driver,
-                    singularity_image=self.config.singularity_image,
-                )
-                result = await self.nomad.submit_job(job_spec)
-                if "error" in result:
-                    raise RuntimeError(f"Failed to submit job: {result}")
-
-            # Wait for allocations to be running (even if the job already existed).
-            await self._wait_for_healthy_allocations(self.config.min_containers)
-
-            # Discover existing allocations and slots
-            await self._refresh_slots()
-
-            # Start health check task
-            self._health_task = asyncio.create_task(self._health_check_loop())
-
-            self._started = True
-            logger.info(f"SlotPool started: {self.total_slots} slots available")
-        except Exception:
-            # Ensure aiohttp sessions are not leaked if we fail to start.
-            await self.stop(purge_job=False)
-            raise
-    
-    async def stop(self, purge_job: bool = False) -> None:
-        """
-        Stop the slot pool.
-        
-        Args:
-            purge_job: If True, also stop the Nomad job
-        """
-        logger.info("Stopping SlotPool")
-
-        # Cancel health check task
-        if self._health_task:
-            self._health_task.cancel()
-            try:
-                await self._health_task
-            except asyncio.CancelledError:
-                pass
-            finally:
-                self._health_task = None
-
-        if self._scale_task:
-            self._scale_task.cancel()
-            try:
-                await self._scale_task
-            except asyncio.CancelledError:
-                pass
-            finally:
-                self._scale_task = None
-
-        # Optionally stop the job (do this even if start() never completed).
-        if purge_job:
-            logger.info(f"Stopping Nomad job: {self.config.job_id}")
-            await self.nomad.stop_job(self.config.job_id, purge=True)
-
-        # Close connections
-        await self.executor.close()
-        await self.nomad.close()
-
-        self._started = False
-        self._slots.clear()
-
-        # Clear the queue
-        while not self._available_queue.empty():
-            try:
-                self._available_queue.get_nowait()
-            except asyncio.QueueEmpty:
-                break
-    
-    async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
-        """
-        Acquire an available slot.
-        
-        If no slots are available, waits up to acquire_timeout seconds.
-        If still no slots, attempts to scale up.
-        
-        Args:
-            trajectory_id: Optional ID of trajectory acquiring the slot
-            
-        Returns:
-            Acquired Slot
-            
-        Raises:
-            asyncio.TimeoutError: If no slot becomes available
-        """
-        if not self._started:
-            raise RuntimeError("SlotPool not started")
-
-        while True:
-            try:
-                # Try to get an available slot
-                slot_key = await asyncio.wait_for(
-                    self._available_queue.get(),
-                    timeout=self.config.acquire_timeout,
-                )
-            except asyncio.TimeoutError:
-                # Try to scale up, but keep waiting even if scaling isn't possible.
-                # In practice, slots may become available shortly (e.g. contention),
-                # and scaling may be temporarily blocked by Nomad deployments.
-                await self._try_scale_up()
-                continue
-
-            slot = self._slots.get(slot_key)
-            if slot is None:
-                # Slot was removed; discard stale queue entry and retry.
-                continue
-
-            try:
-                slot.acquire(trajectory_id)
-            except RuntimeError:
-                # Slot isn't actually available (e.g. duplicate queue entry); retry.
-                continue
-
-            logger.debug(f"Acquired slot {slot.slot_id} (alloc={slot.alloc_id[:8]})")
-            return slot
-    
-    async def release(self, slot: Slot, reset_workspace: bool = False) -> None:
-        """
-        Release a slot back to the pool.
-        
-        Args:
-            slot: Slot to release
-            reset_workspace: If True, clear the workspace files
-        """
-        slot_key = self._slot_key(slot.alloc_id, slot.slot_id)
-        
-        if slot_key not in self._slots:
-            logger.warning(f"Releasing unknown slot: {slot_key}")
-            return
-        
-        # Optionally reset workspace
-        if reset_workspace:
-            await self.executor.reset_slot(slot)
-        
-        slot.release()
-        await self._available_queue.put(slot_key)
-        
-        logger.debug(f"Released slot {slot.slot_id}")
-    
-    async def execute(
-        self,
-        slot: Slot,
-        tool_name: str,
-        args: Dict[str, Any],
-        timeout: Optional[float] = None,
-    ) -> ExecutionResult:
-        """
-        Execute a tool in a slot's workspace.
-        
-        Args:
-            slot: Slot to execute in
-            tool_name: Name of tool (bash, read_file, write_file)
-            args: Tool arguments
-            timeout: Optional timeout override
-            
-        Returns:
-            ExecutionResult
-        """
-        return await self.executor.execute(slot, tool_name, args, timeout)
-    
-    async def execute_batch(
-        self,
-        requests: List[Tuple[Slot, str, Dict[str, Any]]],
-        timeout: Optional[float] = None,
-    ) -> List[ExecutionResult]:
-        """
-        Execute multiple tools in parallel.
-        
-        This is the key optimization - batch execution across multiple slots
-        maximizes container utilization.
-        
-        Args:
-            requests: List of (slot, tool_name, args) tuples
-            timeout: Optional timeout override
-            
-        Returns:
-            List of ExecutionResults in same order
-        """
-        return await self.executor.execute_batch(requests, timeout)
-    
-    async def _refresh_slots(self) -> None:
-        """Refresh slot inventory from Nomad allocations."""
-        async with self._lock:
-            allocs = await self.nomad.get_job_allocations(self.config.job_id)
-            
-            # Track which slots we've seen
-            seen_keys = set()
-            
-            for alloc in allocs:
-                if alloc.status != AllocationStatus.RUNNING:
-                    continue
-                
-                if not alloc.http_address:
-                    continue
-                
-                # Check container health
-                healthy = await self.executor.health_check(alloc.http_address)
-                if not healthy:
-                    continue
-                
-                # Create slots for this allocation
-                for i in range(self.config.slots_per_container):
-                    slot_id = f"slot_{i}"
-                    slot_key = self._slot_key(alloc.id, slot_id)
-                    seen_keys.add(slot_key)
-                    
-                    if slot_key not in self._slots:
-                        # New slot
-                        slot = Slot(
-                            slot_id=slot_id,
-                            alloc_id=alloc.id,
-                            container_addr=alloc.http_address,
-                        )
-                        self._slots[slot_key] = slot
-                        await self._available_queue.put(slot_key)
-                        logger.debug(f"Added slot: {slot_key}")
-            
-            # Remove slots from dead allocations
-            for slot_key in list(self._slots.keys()):
-                if slot_key not in seen_keys:
-                    slot = self._slots.pop(slot_key)
-                    logger.debug(f"Removed slot: {slot_key}")
-    
-    async def _wait_for_healthy_allocations(
-        self, 
-        min_count: int, 
-        timeout: float = 120.0
-    ) -> None:
-        """Wait for allocations to become healthy."""
-        import time
-        start = time.time()
-
-        def _summarize_alloc_detail(detail: Dict[str, Any]) -> str:
-            task_states = detail.get("TaskStates") or {}
-            parts: List[str] = []
-            if isinstance(task_states, dict):
-                for task_name, st in task_states.items():
-                    events = (st or {}).get("Events") or []
-                    if isinstance(events, list) and events:
-                        # Include a few recent events; the latest can be a generic restart message
-                        # while the true root cause is slightly earlier (e.g. image pull failure).
-                        recent = events[-3:]
-                        msgs: List[str] = []
-                        for ev in recent:
-                            desc = ev.get("DisplayMessage") or ev.get("Message") or ev.get("Type") or ""
-                            if desc:
-                                msgs.append(desc)
-                        if msgs:
-                            parts.append(f"{task_name}: " + " | ".join(msgs))
-            return "; ".join(parts)
-
-        def _alloc_events_lower(detail: Dict[str, Any]) -> str:
-            task_states = detail.get("TaskStates") or {}
-            texts: List[str] = []
-            if isinstance(task_states, dict):
-                for _task_name, st in task_states.items():
-                    events = (st or {}).get("Events") or []
-                    if isinstance(events, list):
-                        for ev in events[-10:]:
-                            desc = ev.get("DisplayMessage") or ev.get("Message") or ev.get("Type") or ""
-                            if desc:
-                                texts.append(desc)
-            return " ".join(texts).lower()
-        
-        while time.time() - start < timeout:
-            allocs = await self.nomad.get_job_allocations(self.config.job_id)
-            
-            healthy_count = 0
-            for alloc in allocs:
-                if alloc.status == AllocationStatus.RUNNING and alloc.http_address:
-                    if await self.executor.health_check(alloc.http_address):
-                        healthy_count += 1
-
-                # Fast-fail on obvious driver/image errors to avoid waiting out the full timeout.
-                if alloc.id:
-                    detail = await self.nomad.get_allocation(alloc.id)
-                    if isinstance(detail, dict):
-                        summary = _summarize_alloc_detail(detail)
-                        lowered = _alloc_events_lower(detail) or summary.lower()
-                        if "failed to pull" in lowered or "pull access denied" in lowered:
-                            raise RuntimeError(
-                                "Nomad allocation failed to start due to a Docker image pull error. "
-                                f"Allocation {alloc.id[:8]}: {summary}\n"
-                                "If you're using a local image tag (e.g. `atropos-sandbox:local`) on macOS, "
-                                "make sure the image is loaded into Docker, e.g.:\n"
-                                "  docker buildx build --load -t atropos-sandbox:local -f Hermes-Agent/atropos/Dockerfile Hermes-Agent/atropos"
-                            )
-                        if "exceeded allowed attempts" in lowered:
-                            raise RuntimeError(
-                                "Nomad allocation is crash-looping and has entered restart backoff. "
-                                f"Allocation {alloc.id[:8]}: {summary}\n"
-                                "Inspect logs with:\n"
-                                f"  nomad alloc logs -stderr -task sandbox-server {alloc.id}\n"
-                                "Common causes include: missing local Docker image tag, container entrypoint error, "
-                                "or sandbox-server startup failure."
-                            )
-            
-            if healthy_count >= min_count:
-                return
-            
-            await asyncio.sleep(2.0)
-
-        # Timed out: include allocation status detail to help debugging.
-        allocs = await self.nomad.get_job_allocations(self.config.job_id)
-        alloc_lines: List[str] = []
-        for alloc in allocs[:10]:
-            addr = alloc.http_address or "-"
-            line = f"{alloc.id[:8]} status={alloc.status.value} http={addr}"
-            detail = await self.nomad.get_allocation(alloc.id)
-            if isinstance(detail, dict):
-                summary = _summarize_alloc_detail(detail)
-                if summary:
-                    line += f" detail={summary}"
-            alloc_lines.append(line)
-
-        hint = (
-            "Timed out waiting for healthy sandbox allocations.\n"
-            f"Job: {self.config.job_id}, desired_healthy: {min_count}\n"
-            "Allocations:\n  - " + "\n  - ".join(alloc_lines)
-        )
-        raise RuntimeError(hint)
-    
-    async def _try_scale_up(self) -> bool:
-        """Attempt to scale up the job."""
-        import time
-
-        async with self._scale_lock:
-            # Check cooldown
-            if time.time() - self._last_scale_time < self.config.scale_cooldown:
-                return False
-
-            # Check max containers
-            status = await self.nomad.get_job_status(self.config.job_id)
-            if status is None:
-                return False
-
-            current_count = status.count
-            if current_count >= self.config.max_containers:
-                logger.warning(f"Cannot scale up: already at max ({self.config.max_containers})")
-                return False
-
-            # Scale up
-            new_count = min(current_count + 1, self.config.max_containers)
-            logger.info(f"Scaling up from {current_count} to {new_count} containers")
-
-            scale_resp = await self.nomad.scale_job(
-                self.config.job_id,
-                count=new_count,
-                task_group="sandbox",
-            )
-
-            # Nomad may return non-JSON errors (e.g. plain text) with a status field.
-            if isinstance(scale_resp, dict) and scale_resp.get("status", 200) >= 400:
-                logger.warning(f"Scale request rejected: {scale_resp}")
-                self._last_scale_time = time.time()
-                return False
-
-            self._last_scale_time = time.time()
-
-            # Wait for new allocation in the background so contended acquires can still
-            # make progress (e.g. by grabbing slots released by other trajectories).
-            if self._scale_task is None or self._scale_task.done():
-                self._scale_task = asyncio.create_task(self._wait_for_scale(new_count))
-
-            return True
-
-    async def _wait_for_scale(self, desired_count: int) -> None:
-        try:
-            await self._wait_for_healthy_allocations(desired_count, timeout=60.0)
-            await self._refresh_slots()
-        except asyncio.CancelledError:
-            raise
-        except Exception as e:
-            logger.error(f"Failed to scale up: {e}")
-    
-    async def _health_check_loop(self) -> None:
-        """Background task to monitor container health."""
-        while True:
-            try:
-                await asyncio.sleep(self.config.health_check_interval)
-                await self._refresh_slots()
-            except asyncio.CancelledError:
-                break
-            except Exception as e:
-                logger.error(f"Health check error: {e}")
-    
-    def get_stats(self) -> Dict[str, Any]:
-        """Get pool statistics."""
-        slots_by_state = {}
-        for slot in self._slots.values():
-            state = slot.state.value
-            slots_by_state[state] = slots_by_state.get(state, 0) + 1
-
-        container_count = len({s.alloc_id for s in self._slots.values()}) if self._slots else 0
-        
-        return {
-            "total_slots": self.total_slots,
-            "available_slots": self.available_slots,
-            "acquired_slots": self.acquired_slots,
-            "containers": container_count,
-            "slots_by_state": slots_by_state,
-            "started": self._started,
-        }
@@ -1,159 +0,0 @@
-"""
-Slot abstraction for atropos-agent.
-
-A Slot represents an isolated workspace for a single agent trajectory.
-Slots are hosted on Nomad allocations and provide workspace isolation
-via filesystem directories.
-"""
-
-from dataclasses import dataclass, field
-from enum import Enum
-from typing import Any, Dict, Optional
-import uuid
-
-
-class SlotState(Enum):
-    """State of a slot in the pool."""
-    AVAILABLE = "available"      # Ready to be acquired
-    ACQUIRED = "acquired"        # Assigned to a trajectory
-    EXECUTING = "executing"      # Currently executing a tool
-    RELEASING = "releasing"      # Being released back to pool
-    ERROR = "error"              # In error state
-
-
-@dataclass
-class Slot:
-    """
-    An isolated workspace for a single agent trajectory.
-    
-    Slots are the unit of scheduling - each trajectory runs in its own slot,
-    with an isolated workspace directory. Multiple slots share a container.
-    
-    Attributes:
-        slot_id: Unique identifier for this slot (e.g., "slot_0")
-        alloc_id: Nomad allocation ID hosting this slot
-        container_addr: HTTP address of the sandbox server (e.g., "http://10.0.0.1:8080")
-        workspace_dir: Path to workspace in container (e.g., "/data/slot_0")
-        state: Current state of the slot
-        trajectory_id: ID of trajectory currently using this slot (if acquired)
-        metadata: Additional metadata
-    """
-    slot_id: str
-    alloc_id: str
-    container_addr: str
-    workspace_dir: str = ""
-    state: SlotState = SlotState.AVAILABLE
-    trajectory_id: Optional[str] = None
-    metadata: Dict[str, Any] = field(default_factory=dict)
-    
-    def __post_init__(self):
-        """Set default workspace_dir if not provided."""
-        if not self.workspace_dir:
-            self.workspace_dir = f"/data/{self.slot_id}"
-    
-    @property
-    def is_available(self) -> bool:
-        """Check if slot is available for acquisition."""
-        return self.state == SlotState.AVAILABLE
-    
-    @property
-    def is_acquired(self) -> bool:
-        """Check if slot is currently acquired."""
-        return self.state in (SlotState.ACQUIRED, SlotState.EXECUTING)
-    
-    def acquire(self, trajectory_id: Optional[str] = None) -> None:
-        """
-        Mark slot as acquired by a trajectory.
-        
-        Args:
-            trajectory_id: Optional ID of acquiring trajectory
-        """
-        if not self.is_available:
-            raise RuntimeError(f"Cannot acquire slot {self.slot_id}: state is {self.state}")
-        
-        self.state = SlotState.ACQUIRED
-        self.trajectory_id = trajectory_id or str(uuid.uuid4())
-    
-    def start_execution(self, execution_id: Optional[str] = None) -> None:
-        """Mark slot as executing."""
-        if self.state != SlotState.ACQUIRED:
-            raise RuntimeError(f"Cannot start execution on slot {self.slot_id}: state is {self.state}")
-        
-        self.state = SlotState.EXECUTING
-        if execution_id:
-            self.metadata["current_execution_id"] = execution_id
-    
-    def end_execution(self) -> None:
-        """Mark execution as complete, return to acquired state."""
-        if self.state != SlotState.EXECUTING:
-            raise RuntimeError(f"Cannot end execution on slot {self.slot_id}: state is {self.state}")
-        
-        self.state = SlotState.ACQUIRED
-        self.metadata.pop("current_execution_id", None)
-    
-    def release(self) -> None:
-        """Release slot back to available state."""
-        self.state = SlotState.AVAILABLE
-        self.trajectory_id = None
-        self.metadata.pop("current_execution_id", None)
-    
-    def mark_error(self, error: str) -> None:
-        """Mark slot as in error state."""
-        self.state = SlotState.ERROR
-        self.metadata["error"] = error
-    
-    def to_dict(self) -> Dict[str, Any]:
-        """Convert to dictionary for serialization."""
-        return {
-            "slot_id": self.slot_id,
-            "alloc_id": self.alloc_id,
-            "container_addr": self.container_addr,
-            "workspace_dir": self.workspace_dir,
-            "state": self.state.value,
-            "trajectory_id": self.trajectory_id,
-            "metadata": self.metadata,
-        }
-    
-    @classmethod
-    def from_dict(cls, data: Dict[str, Any]) -> "Slot":
-        """Create from dictionary."""
-        return cls(
-            slot_id=data["slot_id"],
-            alloc_id=data["alloc_id"],
-            container_addr=data["container_addr"],
-            workspace_dir=data.get("workspace_dir", ""),
-            state=SlotState(data.get("state", "available")),
-            trajectory_id=data.get("trajectory_id"),
-            metadata=data.get("metadata", {}),
-        )
-    
-    def __repr__(self) -> str:
-        return f"Slot({self.slot_id}, state={self.state.value}, alloc={self.alloc_id[:8]}...)"
-
-
-def create_slots_for_allocation(
-    alloc_id: str,
-    container_addr: str,
-    num_slots: int = 10,
-) -> list["Slot"]:
-    """
-    Create slots for a Nomad allocation.
-    
-    Args:
-        alloc_id: Nomad allocation ID
-        container_addr: HTTP address of sandbox server
-        num_slots: Number of slots to create
-        
-    Returns:
-        List of Slot objects
-    """
-    slots = []
-    for i in range(num_slots):
-        slot_id = f"slot_{i}"
-        slots.append(Slot(
-            slot_id=slot_id,
-            alloc_id=alloc_id,
-            container_addr=container_addr,
-            workspace_dir=f"/data/{slot_id}",
-        ))
-    return slots
@@ -1,2 +0,0 @@
-"""Terminal helpers for stateful sandbox interactions."""
-
@@ -1,115 +0,0 @@
-from __future__ import annotations
-
-import json
-from typing import Any
-
-import pyte
-
-
-class AsciinemaStreamDecoder:
-    def __init__(self, *, default_width: int = 80, default_height: int = 24) -> None:
-        self._default_width = max(1, int(default_width))
-        self._default_height = max(1, int(default_height))
-        self._buffer = ""
-        self._has_header = False
-        self.width = self._default_width
-        self.height = self._default_height
-        self._screen = pyte.Screen(self.width, self.height)
-        self._stream = pyte.Stream(self._screen)
-
-    def reset(self) -> None:
-        self._buffer = ""
-        self._has_header = False
-        self.width = self._default_width
-        self.height = self._default_height
-        self._screen = pyte.Screen(self.width, self.height)
-        self._stream = pyte.Stream(self._screen)
-
-    def feed(self, chunk: str | bytes) -> None:
-        if not chunk:
-            return
-        if isinstance(chunk, bytes):
-            chunk = chunk.decode("utf-8", errors="replace")
-        self._buffer += chunk
-        while True:
-            line, sep, rest = self._buffer.partition("\n")
-            if not sep:
-                break
-            self._buffer = rest
-            line = line.strip()
-            if not line:
-                continue
-            parsed = self._parse_json_line(line)
-            if parsed is None:
-                continue
-            if not self._has_header:
-                if isinstance(parsed, dict):
-                    self._init_from_header(parsed)
-                    continue
-                if isinstance(parsed, list):
-                    self._has_header = True
-                    self._apply_event(parsed)
-                    continue
-                continue
-            if isinstance(parsed, list):
-                self._apply_event(parsed)
-
-    def render(self) -> str:
-        return "\n".join(self._screen.display)
-
-    def _parse_json_line(self, line: str) -> Any | None:
-        try:
-            return json.loads(line)
-        except json.JSONDecodeError:
-            return None
-
-    def _init_from_header(self, header: dict[str, Any]) -> None:
-        width = _coerce_int(
-            header.get("width") or header.get("columns") or header.get("cols"),
-            self._default_width,
-        )
-        height = _coerce_int(
-            header.get("height") or header.get("rows") or header.get("lines"),
-            self._default_height,
-        )
-        self.width = max(1, width)
-        self.height = max(1, height)
-        self._screen = pyte.Screen(self.width, self.height)
-        self._stream = pyte.Stream(self._screen)
-        self._has_header = True
-
-    def _apply_event(self, event: list[Any]) -> None:
-        if len(event) < 2:
-            return
-        event_type = event[1]
-        payload = event[2] if len(event) > 2 else ""
-        if event_type == "o":
-            if isinstance(payload, str):
-                self._stream.feed(payload)
-        elif event_type == "r":
-            width, height = _parse_resize(payload)
-            if width and height:
-                self.width = width
-                self.height = height
-                self._screen.resize(width, height)
-
-
-def _coerce_int(value: Any, default: int) -> int:
-    try:
-        return int(value)
-    except (TypeError, ValueError):
-        return int(default)
-
-
-def _parse_resize(payload: Any) -> tuple[int, int]:
-    if isinstance(payload, str) and "x" in payload:
-        left, right = payload.lower().split("x", 1)
-        return _coerce_int(left, 0), _coerce_int(right, 0)
-    if isinstance(payload, dict):
-        width = _coerce_int(payload.get("width") or payload.get("columns") or payload.get("cols"), 0)
-        height = _coerce_int(payload.get("height") or payload.get("rows") or payload.get("lines"), 0)
-        return width, height
-    if isinstance(payload, list) and len(payload) >= 2:
-        return _coerce_int(payload[0], 0), _coerce_int(payload[1], 0)
-    return 0, 0
-
@@ -1,26 +0,0 @@
-"""
-Tool abstractions for atropos-agent.
-
-Provides base Tool class and common tool implementations.
-"""
-
-from .base import Tool, ToolCall, ToolRegistry, ToolResult, ToolSchema
-from .build_registry import build_tool_registry
-from .sandbox_stubs import BashTool, ReadFileTool, TerminalTool, WriteFileTool
-from .terminal_stateful_tool import TerminalStatefulTool
-from .tmux_tool import TmuxTool
-
-__all__ = [
-    "Tool",
-    "ToolCall",
-    "ToolRegistry",
-    "ToolResult",
-    "ToolSchema",
-    "BashTool",
-    "ReadFileTool",
-    "WriteFileTool",
-    "TerminalTool",
-    "TerminalStatefulTool",
-    "TmuxTool",
-    "build_tool_registry",
-]
@@ -1,423 +0,0 @@
-"""
-Base Tool abstraction for atropos-agent.
-
-Tools follow a simple pattern:
-1. Define schema (name, description, parameters)
-2. Implement execute() method
-3. Return ToolResult with output/error
-
-Tool calls use Hermes-style XML tags:
-<tool_call>{"name": "bash", "arguments": {"command": "ls"}}</tool_call>
-"""
-
-import json
-import re
-import uuid
-from abc import ABC, abstractmethod
-from dataclasses import dataclass, field
-from typing import Any, Dict, List, Literal, Optional
-
-from pydantic import BaseModel, Field
-
-
-@dataclass
-class ToolSchema:
-    """JSON Schema for a tool's parameters."""
-    
-    name: str
-    description: str
-    parameters: Dict[str, Any] = field(default_factory=dict)
-    required: List[str] = field(default_factory=list)
-    external: bool = False  # Whether the tool must be executed via an external ToolServer (secret proxy) and not inside the sandbox.
-    
-    def to_dict(self) -> Dict[str, Any]:
-        """Convert to OpenAI-compatible function schema."""
-        return {
-            "type": "function",
-            "function": {
-                "name": self.name,
-                "description": self.description,
-                "parameters": {
-                    "type": "object",
-                    "properties": self.parameters,
-                    "required": self.required,
-                },
-            },
-        }
-    
-    def to_prompt_description(self) -> str:
-        """Convert to human-readable description for system prompt."""
-        params_desc = []
-        for name, spec in self.parameters.items():
-            req = "(required)" if name in self.required else "(optional)"
-            desc = spec.get("description", "")
-            param_type = spec.get("type", "string")
-            params_desc.append(f"  - {name} ({param_type}) {req}: {desc}")
-        
-        params_str = "\n".join(params_desc) if params_desc else "  (no parameters)"
-        return f"**{self.name}**: {self.description}\nParameters:\n{params_str}"
-
-
-@dataclass
-class ToolCall:
-    """A parsed tool call from model output."""
-    
-    name: str
-    arguments: Dict[str, Any]
-    raw_text: str = ""  # Original XML/JSON text
-    uniq_id: str = field(default_factory=lambda: str(uuid.uuid4()))  # Unique tool-call id for traceability/reconstruction.
-    
-    @classmethod
-    def parse_from_text(cls, text: str) -> List["ToolCall"]:
-        """
-        Extract tool calls from text using Hermes-style XML tags.
-        
-        Supported formats (STRICT: requires well-formed closing tags):
-        - Hermes JSON wrapper:
-          <tool_call>{"name": "...", "arguments": {...}}</tool_call>
-        - GLM/llama.cpp style:
-          <tool_call>terminal{"command":"ls -la"}</tool_call>
-        """
-        calls: List["ToolCall"] = []
-
-        if not text:
-            return calls
-
-        def _append_from_payload(*, name: str, arguments: Dict[str, Any], raw: str, uniq_id: Optional[str] = None) -> None:
-            if not isinstance(name, str) or not name:
-                return
-            if not isinstance(arguments, dict):
-                return
-            calls.append(
-                cls(
-                    name=name,
-                    arguments=arguments,
-                    raw_text=raw,
-                    uniq_id=uniq_id or str(uuid.uuid4()),
-                )
-            )
-
-        # STRICT parsing: only accept well-formed <tool_call>...</tool_call> blocks.
-        pattern = r"<tool_call>\s*(.*?)\s*</tool_call>"
-        for inner in re.findall(pattern, text, re.DOTALL):
-            cleaned = (inner or "").strip()
-            if not cleaned:
-                continue
-
-            # Hermes JSON wrapper.
-            if cleaned.startswith("{"):
-                try:
-                    data = json.loads(cleaned)
-                except json.JSONDecodeError:
-                    continue
-                uniq_id = data.get("uniq_id") or data.get("id") or None
-                _append_from_payload(
-                    name=data.get("name", ""),
-                    arguments=data.get("arguments", {}),
-                    raw=inner,
-                    uniq_id=uniq_id,
-                )
-                continue
-
-            # GLM/llama.cpp style: terminal{...}
-            m = re.match(r"^\s*([A-Za-z0-9_.:\\-]+)\s*(\{.*\})\s*$", cleaned, re.DOTALL)
-            if not m:
-                continue
-            name = m.group(1)
-            args_text = m.group(2)
-            try:
-                args = json.loads(args_text)
-            except json.JSONDecodeError:
-                continue
-            _append_from_payload(name=name, arguments=args, raw=inner)
-
-        return calls
-    
-    @classmethod
-    def has_tool_call(cls, text: str) -> bool:
-        """Check if text contains any tool calls."""
-        return bool(re.search(r"<tool_call>", text))
-
-
-@dataclass
-class ToolResult:
-    """Result from executing a tool."""
-    
-    success: bool
-    output: str = ""
-    error: str = ""
-    metadata: Dict[str, Any] = field(default_factory=dict)
-    uniq_id: Optional[str] = None  # Should match ToolCall.uniq_id for async execution tracking.
-    
-    def to_xml(self) -> str:
-        """Format as XML for including in conversation."""
-        data = {
-            "success": self.success,
-            "output": self.output,
-        }
-        if self.uniq_id:
-            data["uniq_id"] = self.uniq_id
-        if self.error:
-            data["error"] = self.error
-        if self.metadata:
-            data["metadata"] = self.metadata
-        return f"<tool_response>{json.dumps(data)}</tool_response>"
-    
-    def to_dict(self) -> Dict[str, Any]:
-        """Convert to dictionary."""
-        return {
-            "success": self.success,
-            "output": self.output,
-            "error": self.error,
-            "metadata": self.metadata,
-            "uniq_id": self.uniq_id,
-        }
-
-
-class Tool(ABC):
-    """
-    Abstract base class for tools.
-    
-    Subclasses must implement:
-    - schema: ToolSchema describing the tool
-    - execute(): async method that performs the tool action
-    """
-    
-    @property
-    @abstractmethod
-    def schema(self) -> ToolSchema:
-        """Return the tool's schema."""
-        pass
-    
-    @property
-    def name(self) -> str:
-        """Tool name (from schema)."""
-        return self.schema.name
-    
-    @abstractmethod
-    async def execute(self, **kwargs) -> ToolResult:
-        """
-        Execute the tool with given arguments.
-        
-        Args:
-            **kwargs: Tool-specific arguments
-            
-        Returns:
-            ToolResult with success/failure and output
-        """
-        pass
-    
-    def is_available(self) -> tuple[bool, str | None]:
-        """
-        Return whether this tool should be exposed/executable in the current process.
-
-        Tools that depend on optional binaries/services/env vars can override this
-        to avoid advertising a tool that will fail at runtime.
-        """
-        return True, None
-
-    async def __call__(self, **kwargs) -> ToolResult:
-        """Allow calling tool instance directly."""
-        return await self.execute(**kwargs)
-
-# Note: This is only wrapping declarations for the external ToolServer (for execution on external process tools), and tools preinstalled in envs
-class ToolRegistry:
-    """Registry of available tools."""
-    
-    def __init__(self):
-        self._tools: Dict[str, Tool] = {}
-    
-    def register(self, tool: Tool) -> None:
-        """Register a tool."""
-        self._tools[tool.name] = tool
-    
-    def get(self, name: str) -> Optional[Tool]:
-        """Get a tool by name."""
-        return self._tools.get(name)
-    
-    def list_tools(self) -> List[Tool]:
-        """List all registered tools."""
-        return list(self._tools.values())
-    
-    def get_schemas(self) -> List[ToolSchema]:
-        """Get schemas for all registered tools."""
-        return [tool.schema for tool in self._tools.values()]
-    
-    def get_prompt_description(self) -> str:
-        """Generate tool descriptions for system prompt."""
-        descriptions = [tool.schema.to_prompt_description() for tool in self._tools.values()]
-        return "\n\n".join(descriptions)
-
-    def get_prompt_tool_definitions_json(self) -> str:
-        """
-        Return a Hermes-style JSON list of tool definitions for use inside a `<tools>...</tools>` block.
-
-        Hermes trajectories historically use a simplified schema list:
-          [{"name": ..., "description": ..., "parameters": {...}, "required": null}, ...]
-        """
-        formatted: List[Dict[str, Any]] = []
-        for tool in self._tools.values():
-            fn = tool.schema.to_dict().get("function", {})
-            formatted.append(
-                {
-                    "name": fn.get("name", tool.name),
-                    "description": fn.get("description", ""),
-                    "parameters": fn.get("parameters", {}),
-                    # Keep parity with Hermes saved trajectories (required is typically null there).
-                    "required": None,
-                }
-            )
-        return json.dumps(formatted, ensure_ascii=False)
-    
-    async def execute(self, call: ToolCall) -> ToolResult:
-        """Execute a tool call."""
-        tool = self.get(call.name)
-        if tool is None:
-            return ToolResult(
-                success=False,
-                error=f"Unknown tool: {call.name}",
-                uniq_id=call.uniq_id,
-            )
-        
-        try:
-            result = await tool.execute(**call.arguments)
-            if result.uniq_id is None:
-                result.uniq_id = call.uniq_id
-            return result
-        except Exception as e:
-            return ToolResult(
-                success=False,
-                error=f"Tool execution error: {str(e)}",
-                uniq_id=call.uniq_id,
-            )
-
-
-# =============================================================================
-# FastAPI / transport models
-# =============================================================================
-
-
-class ToolCallPayload(BaseModel):
-    name: str
-    arguments: Dict[str, Any] = Field(default_factory=dict)
-    uniq_id: str
-
-    @classmethod
-    def from_tool_call(cls, call: ToolCall) -> "ToolCallPayload":
-        return cls(name=call.name, arguments=call.arguments, uniq_id=call.uniq_id)
-
-    def to_tool_call(self) -> ToolCall:
-        return ToolCall(name=self.name, arguments=self.arguments, uniq_id=self.uniq_id)
-
-
-class ToolResultPayload(BaseModel):
-    success: bool
-    output: str = ""
-    error: str = ""
-    metadata: Dict[str, Any] = Field(default_factory=dict)
-    uniq_id: Optional[str] = None
-
-    @classmethod
-    def from_tool_result(cls, result: ToolResult) -> "ToolResultPayload":
-        return cls(
-            success=result.success,
-            output=result.output,
-            error=result.error,
-            metadata=result.metadata,
-            uniq_id=result.uniq_id,
-        )
-
-    def to_tool_result(self) -> ToolResult:
-        return ToolResult(
-            success=self.success,
-            output=self.output,
-            error=self.error,
-            metadata=self.metadata,
-            uniq_id=self.uniq_id,
-        )
-
-
-class ToolExecutorExecuteRequest(BaseModel):
-    trajectory_id: str
-    tool: ToolCallPayload
-    timeout_s: Optional[float] = None
-
-
-class ToolExecutorReleaseRequest(BaseModel):
-    trajectory_id: str
-    reset_workspace: bool = False
-
-
-class ToolServerExecuteRequest(BaseModel):
-    trajectory_id: Optional[str] = None
-    tool: ToolCallPayload
-    timeout_s: Optional[float] = None
-    # Optional sandbox context for tools that need workspace artifacts.
-    # This is set by ToolExecutor and is NOT model-controlled.
-    slot_id: Optional[str] = None
-    container_addr: Optional[str] = None
-
-
-# =============================================================================
-# Artifact transport models
-# =============================================================================
-
-
-class ArtifactReadRequestPayload(BaseModel):
-    trajectory_id: str
-    path: str
-    encoding: Literal["text", "base64"] = "text"
-    max_bytes: Optional[int] = None
-    include_sha256: bool = False
-
-
-class ArtifactReadResponsePayload(BaseModel):
-    success: bool
-    content: str = ""
-    error: str = ""
-    encoding: str = "text"
-    truncated: bool = False
-    bytes: int = 0
-    file_size: Optional[int] = None
-    path: str = ""
-    mime: Optional[str] = None
-    sha256: Optional[str] = None
-
-
-class ArtifactListRequestPayload(BaseModel):
-    trajectory_id: str
-    path: str = "."
-    recursive: bool = False
-    max_entries: Optional[int] = None
-
-
-class ArtifactListEntryPayload(BaseModel):
-    path: str
-    is_dir: bool
-    size: int
-    mtime: float
-
-
-class ArtifactListResponsePayload(BaseModel):
-    success: bool
-    entries: List[ArtifactListEntryPayload] = Field(default_factory=list)
-    truncated: bool = False
-    error: str = ""
-
-
-class ArtifactArchiveRequestPayload(BaseModel):
-    trajectory_id: str
-    path: str = "."
-    format: Literal["tar.gz", "tgz"] = "tar.gz"
-    max_bytes: Optional[int] = None
-    max_entries: Optional[int] = None
-
-
-class ArtifactArchiveResponsePayload(BaseModel):
-    success: bool
-    content: str = ""
-    error: str = ""
-    encoding: str = "base64"
-    format: str = "tar.gz"
-    bytes: int = 0
-    entry_count: int = 0
@@ -1,64 +0,0 @@
-"""
-Unified tool registry builder for Hermes-Agent Atropos integration.
-
-This composes:
- sandbox tool stubs (terminal/bash/read_file/write_file + stateful terminal/tmux)
- Hermes external tools (web/vision/image/moa/skills/browser), executed via ToolServer
-
-ToolExecutor only needs the schema + `external` routing bit; ToolServer executes
-the external tools via Hermes' existing implementations.
-"""
-
-from __future__ import annotations
-
-from typing import List, Optional
-
-from .base import ToolRegistry
-from .hermes_external_tools import build_external_tools
-from .sandbox_stubs import BashTool, ReadFileTool, TerminalTool, WriteFileTool
-from .terminal_stateful_tool import TerminalStatefulTool
-from .tmux_tool import TmuxTool
-from .toolset_resolver import resolve_multiple_toolsets
-
-
-def build_tool_registry(
-    *,
-    enabled_toolsets: Optional[List[str]] = None,
-    disabled_toolsets: Optional[List[str]] = None,
-    tool_server_url: Optional[str] = None,
-) -> ToolRegistry:
-    """
-    Build a ToolRegistry for AgentEnv / ToolExecutor / ToolServer.
-
-    If `tool_server_url` is not provided, external tools will be omitted so we do
-    not advertise tools that cannot execute.
-    """
-    enabled_toolsets = enabled_toolsets or ["default"]
-
-    # Resolve tool names using Hermes toolsets plus Atropos additions.
-    selected = set(resolve_multiple_toolsets(enabled_toolsets))
-    if disabled_toolsets:
-        selected -= set(resolve_multiple_toolsets(disabled_toolsets))
-
-    reg = ToolRegistry()
-
-    # Always register sandbox tools if selected.
-    sandbox_by_name = {
-        "terminal": TerminalTool(),
-        "bash": BashTool(),
-        "read_file": ReadFileTool(),
-        "write_file": WriteFileTool(),
-        "terminal_stateful": TerminalStatefulTool(),
-        "tmux": TmuxTool(),
-    }
-    for name, tool in sandbox_by_name.items():
-        if name in selected:
-            reg.register(tool)
-
-    # External tools: only include when ToolServer is configured.
-    if tool_server_url:
-        for tool in build_external_tools(selected_tool_names=selected):
-            if tool.name in selected:
-                reg.register(tool)
-
-    return reg
@@ -1,90 +0,0 @@
-"""
-Hermes external tool adapter for Atropos ToolServer.
-
-These tools reuse Hermes-Agent's existing tool runner (`model_tools.handle_function_call`)
-so we don't duplicate external tool implementations.
-
-Important:
- These are marked `external=True` and should be executed ONLY by ToolServer.
- We run `handle_function_call` in a worker thread because the Hermes implementation
-  uses `asyncio.run()` internally for some async tools (web_extract, vision, MoA, etc).
-"""
-
-from __future__ import annotations
-
-import asyncio
-import json
-from typing import Any, Dict, List, Optional
-
-import model_tools
-
-from .base import Tool, ToolResult, ToolSchema
-
-
-def _schema_from_openai_tool_dict(tool: Dict[str, Any], *, external: bool) -> ToolSchema:
-    fn = tool.get("function") or {}
-    name = str(fn.get("name") or "")
-    description = str(fn.get("description") or "")
-    params = fn.get("parameters") or {}
-    properties = params.get("properties") or {}
-    required = params.get("required") or []
-    if not isinstance(required, list):
-        required = []
-    return ToolSchema(
-        name=name,
-        description=description,
-        parameters=dict(properties),
-        required=[str(x) for x in required if isinstance(x, (str, int))],
-        external=external,
-    )
-
-
-class HermesExternalTool(Tool):
-    def __init__(self, schema: ToolSchema):
-        self._schema = schema
-
-    @property
-    def schema(self) -> ToolSchema:
-        return self._schema
-
-    async def execute(self, task_id: Optional[str] = None, **kwargs: Any) -> ToolResult:
-        # `model_tools.handle_function_call` returns a JSON string (success or error).
-        # Run in a thread because some Hermes tool handlers call `asyncio.run()`.
-        raw = await asyncio.to_thread(model_tools.handle_function_call, self.name, kwargs, task_id)
-
-        try:
-            parsed = json.loads(raw)
-        except Exception:
-            # Keep as plain string.
-            return ToolResult(success=True, output=str(raw))
-
-        if isinstance(parsed, dict) and parsed.get("error"):
-            return ToolResult(success=False, error=str(parsed.get("error")), output="")
-
-        return ToolResult(success=True, output=json.dumps(parsed, ensure_ascii=False))
-
-
-def build_external_tools(
-    *,
-    selected_tool_names: Optional[set[str]] = None,
-) -> List[HermesExternalTool]:
-    """
-    Build external tool wrappers from Hermes tool declarations.
-
-    Filters out sandbox-oriented tools (e.g. `terminal`) since those should run
-    inside the sandbox via ToolExecutor.
-    """
-    # IMPORTANT: Hermes' `model_tools.get_tool_definitions()` only understands Hermes toolsets.
-    # Atropos envs add extra toolsets (filesystem/sandbox/stateful). To avoid noisy "Unknown toolset"
-    # prints and accidental filtering, we fetch ALL Hermes tool definitions here and filter by name.
-    tools = model_tools.get_tool_definitions(enabled_toolsets=None, disabled_toolsets=None, quiet_mode=True)
-
-    wrappers: List[HermesExternalTool] = []
-    for t in tools:
-        schema = _schema_from_openai_tool_dict(t, external=True)
-        if schema.name in {"terminal"}:
-            continue
-        if selected_tool_names is not None and schema.name not in selected_tool_names:
-            continue
-        wrappers.append(HermesExternalTool(schema))
-    return wrappers
@@ -1,99 +0,0 @@
-"""
-Sandbox tool stubs for Atropos ToolExecutor.
-
-These tools are executed inside the sandbox containers via:
-ToolExecutor -> SlotPool -> sandbox_server.py
-
-They intentionally do NOT execute anything on the host process. If they are
-called directly (outside ToolExecutor), they return a clear error.
-"""
-
-from __future__ import annotations
-
-from typing import Optional
-
-from .base import Tool, ToolResult, ToolSchema
-
-
-class TerminalTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="terminal",
-            description=(
-                "Execute a command inside the sandbox slot workspace and return stdout/stderr. "
-                "Filesystem persists within a trajectory slot. Background processes are not supported "
-                "in stateless mode. Commands run under POSIX /bin/sh and each tool call runs in a fresh "
-                "shell (no persisted env vars). Avoid bash-only syntax like `source`; prefer `. .venv/bin/activate` "
-                "or invoke `.venv/bin/python ...` directly."
-            ),
-            parameters={
-                "command": {"type": "string", "description": "The command to execute"},
-                "timeout": {
-                    "type": "integer",
-                    "description": "Command timeout in seconds (optional).",
-                    "minimum": 1,
-                },
-                "background": {
-                    "type": "boolean",
-                    "description": "Not supported in sandbox terminal (always false).",
-                    "default": False,
-                },
-            },
-            required=["command"],
-            external=False,
-        )
-
-    async def execute(self, **_kwargs) -> ToolResult:
-        return ToolResult(
-            success=False,
-            error="terminal must be executed via ToolExecutor inside the sandbox",
-        )
-
-
-class BashTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="bash",
-            description="Execute a bash command inside the sandbox slot workspace.",
-            parameters={"command": {"type": "string", "description": "The bash command to execute"}},
-            required=["command"],
-            external=False,
-        )
-
-    async def execute(self, **_kwargs) -> ToolResult:
-        return ToolResult(success=False, error="bash must be executed via ToolExecutor inside the sandbox")
-
-
-class ReadFileTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="read_file",
-            description="Read a file from the sandbox slot workspace.",
-            parameters={"path": {"type": "string", "description": "Path to the file"}},
-            required=["path"],
-            external=False,
-        )
-
-    async def execute(self, **_kwargs) -> ToolResult:
-        return ToolResult(success=False, error="read_file must be executed via ToolExecutor inside the sandbox")
-
-
-class WriteFileTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="write_file",
-            description="Write a file into the sandbox slot workspace.",
-            parameters={
-                "path": {"type": "string", "description": "Path to the file"},
-                "content": {"type": "string", "description": "File content"},
-            },
-            required=["path", "content"],
-            external=False,
-        )
-
-    async def execute(self, **_kwargs) -> ToolResult:
-        return ToolResult(success=False, error="write_file must be executed via ToolExecutor inside the sandbox")
@@ -1,45 +0,0 @@
-"""
-Stateful terminal tool schema.
-
-This is a sandbox tool that routes to the sandbox server as `bash_stateful`
-via ToolExecutor mapping. It exists to expose an explicit, opt-in terminal
-primitive suitable for stateful workflows (e.g. tmux sessions / TUIs).
-"""
-
-from __future__ import annotations
-
-from typing import Optional
-
-from .base import Tool, ToolResult, ToolSchema
-
-
-class TerminalStatefulTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="terminal_stateful",
-            description=(
-                "Execute a command in the sandbox, allowing stateful/background processes to persist "
-                "across tool calls within the same trajectory slot (e.g. tmux sessions). "
-                "Use sparingly; output is still non-interactive."
-            ),
-            parameters={
-                "command": {"type": "string", "description": "The command to execute"},
-                "timeout": {
-                    "type": "integer",
-                    "description": "Command timeout in seconds (optional).",
-                    "minimum": 1,
-                },
-            },
-            required=["command"],
-        )
-
-    def is_available(self) -> tuple[bool, str | None]:
-        return True, None
-
-    async def execute(self, command: str, timeout: Optional[int] = None) -> ToolResult:
-        _ = (command, timeout)
-        return ToolResult(
-            success=False,
-            error="terminal_stateful must be executed via ToolExecutor inside the sandbox",
-        )
@@ -1,89 +0,0 @@
-"""
-tmux tool schema (sandbox).
-
-This is a sandbox tool that provides basic tmux session control suitable for
-TUI-style terminal interactions:
- send keys (arrow keys, enter, etc.)
- capture the current screen buffer
-
-Execution is routed by ToolExecutor to the sandbox server's `tmux` backend.
-"""
-
-from __future__ import annotations
-
-from typing import Any, Dict, Optional
-
-from .base import Tool, ToolResult, ToolSchema
-
-
-class TmuxTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="tmux",
-            description=(
-                "Control a per-trajectory tmux session inside the sandbox (stateful terminal). "
-                "Use this for TUI-style interactions: send keys and capture the current screen."
-            ),
-            parameters={
-                "action": {
-                    "type": "string",
-                    "description": "Action to perform: start | send_keys | stream | stop.",
-                    "enum": ["start", "send_keys", "stream", "stop", "capture"],
-                },
-                "keys": {
-                    "description": "Keys to send (string or list of strings) when action=send_keys.",
-                },
-                "block": {
-                    "type": "boolean",
-                    "description": "If true, wait for shell command completion (only valid at a shell prompt).",
-                    "default": False,
-                },
-                "min_wait_s": {
-                    "type": "number",
-                    "description": "For non-blocking send_keys, sleep this long after sending keys (seconds).",
-                    "default": 0.0,
-                },
-                "max_wait_s": {
-                    "type": "number",
-                    "description": "For blocking send_keys, max time to wait for completion (seconds).",
-                },
-                "capture_entire": {
-                    "type": "boolean",
-                    "description": "Deprecated. Streaming is preferred.",
-                    "default": False,
-                },
-                "max_bytes": {
-                    "type": "integer",
-                    "description": "Max bytes to return per stream call.",
-                },
-                "reset": {
-                    "type": "boolean",
-                    "description": "If true, reset stream offset to the beginning of the asciinema recording.",
-                    "default": False,
-                },
-                "pane_width": {
-                    "type": "integer",
-                    "description": "Pane width for action=start (columns).",
-                    "minimum": 20,
-                },
-                "pane_height": {
-                    "type": "integer",
-                    "description": "Pane height for action=start (rows).",
-                    "minimum": 10,
-                },
-            },
-            required=["action"],
-        )
-
-    def is_available(self) -> tuple[bool, str | None]:
-        return True, None
-
-    async def execute(self, **kwargs: Dict[str, Any]) -> ToolResult:
-        # This tool is intended to be executed via ToolExecutor -> sandbox server.
-        # We keep a safe fallback for non-sandbox contexts.
-        action = str(kwargs.get("action") or "").strip()
-        return ToolResult(
-            success=False,
-            error=f"tmux tool must be executed in the sandbox (got action={action!r})",
-        )
@@ -1,500 +0,0 @@
-"""
-ToolExecutor - queued, batched tool dispatch for multiplexed agent trajectories.
-
-This component is responsible for:
- Maintaining trajectory -> Slot affinity (workspace continuity)
- Batching sandbox tool calls across trajectories to maximize container utilization
- Routing external tools (ToolSchema.external=True) to a ToolServer (Phase 4.5)
-
-For now, only sandbox tools are executed:
- bash
- read_file
- write_file
-"""
-
-from __future__ import annotations
-
-import asyncio
-import time
-from dataclasses import dataclass
-from typing import Any, Dict, List, Optional
-
-import httpx
-
-from .base import (
-    ArtifactArchiveRequestPayload,
-    ArtifactArchiveResponsePayload,
-    ArtifactListRequestPayload,
-    ArtifactListResponsePayload,
-    ArtifactReadRequestPayload,
-    ArtifactReadResponsePayload,
-    ToolCall,
-    ToolCallPayload,
-    ToolRegistry,
-    ToolResult,
-    ToolResultPayload,
-    ToolServerExecuteRequest,
-)
-from ..backends.base import ToolBackend
-from ..slots import Slot
-
-
-@dataclass
-class ToolExecutorConfig:
-    batch_window_ms: int = 20
-    max_batch_size: int = 200
-    allow_network: bool = True
-    require_sandbox: bool = False
-    require_stateful_sandbox: bool = False
-    tool_server_url: Optional[str] = None
-    tool_server_token: Optional[str] = None
-
-
-@dataclass
-class _QueuedToolRequest:
-    trajectory_id: str
-    call: ToolCall
-    timeout_s: Optional[float]
-    future: asyncio.Future
-
-
-class ToolExecutor:
-    def __init__(
-        self,
-        backend: ToolBackend,
-        tools: ToolRegistry,
-        config: Optional[ToolExecutorConfig] = None,
-    ) -> None:
-        self.backend = backend
-        self.tools = tools
-        self.config = config or ToolExecutorConfig()
-
-        self._queue: asyncio.Queue[Optional[_QueuedToolRequest]] = asyncio.Queue()
-        self._task: Optional[asyncio.Task] = None
-        self._stopping = asyncio.Event()
-
-        self._slots_lock = asyncio.Lock()
-        self._slot_by_trajectory: Dict[str, Slot] = {}
-
-        self._tool_server_client: Optional[httpx.AsyncClient] = None
-        self._tool_server_lock = asyncio.Lock()
-
-        # lightweight stats for status endpoints
-        self.total_requests: int = 0
-        self.total_errors: int = 0
-        self.latencies_s: List[float] = []
-
-    async def start(self) -> None:
-        if self._task is None:
-            self._task = asyncio.create_task(self._run_loop())
-
-    def queue_size(self) -> int:
-        return self._queue.qsize()
-
-    async def close(self) -> None:
-        self._stopping.set()
-        await self._queue.put(None)
-        if self._task:
-            await self._task
-            self._task = None
-
-        client = self._tool_server_client
-        self._tool_server_client = None
-        if client is not None:
-            await client.aclose()
-
-        # Best-effort release any remaining slots.
-        async with self._slots_lock:
-            slots = list(self._slot_by_trajectory.items())
-            self._slot_by_trajectory.clear()
-
-        for _, slot in slots:
-            try:
-                await self.backend.release(slot, reset_workspace=False)
-            except Exception:
-                pass
-
-    async def execute(
-        self,
-        trajectory_id: str,
-        call: ToolCall,
-        timeout_s: Optional[float] = None,
-    ) -> ToolResult:
-        if self._task is None:
-            raise RuntimeError("ToolExecutor not started (call start() first)")
-
-        # Allow tool args to suggest a timeout (Hermes-compatible terminal tool),
-        # but never let the model choose "infinite" timeouts.
-        if timeout_s is None:
-            raw_timeout = call.arguments.get("timeout")
-            if isinstance(raw_timeout, (int, float)):
-                timeout_s = float(raw_timeout)
-        if timeout_s is not None:
-            timeout_s = max(1.0, min(float(timeout_s), 600.0))
-
-        loop = asyncio.get_running_loop()
-        fut: asyncio.Future = loop.create_future()
-        started = time.perf_counter()
-        await self._queue.put(_QueuedToolRequest(trajectory_id=trajectory_id, call=call, timeout_s=timeout_s, future=fut))
-        try:
-            result: ToolResult = await fut
-            return result
-        finally:
-            self.latencies_s.append(time.perf_counter() - started)
-
-    async def release_trajectory(self, trajectory_id: str, reset_workspace: bool = False) -> None:
-        async with self._slots_lock:
-            slot = self._slot_by_trajectory.pop(trajectory_id, None)
-
-        if slot is not None:
-            await self.backend.release(slot, reset_workspace=reset_workspace)
-
-    async def _get_slot_if_present(self, trajectory_id: str) -> Optional[Slot]:
-        async with self._slots_lock:
-            return self._slot_by_trajectory.get(trajectory_id)
-
-    # ---------------------------------------------------------------------
-    # Artifact helpers (optional)
-    # ---------------------------------------------------------------------
-
-    async def read_artifact(self, req: ArtifactReadRequestPayload) -> ArtifactReadResponsePayload:
-        slot = await self._get_slot_if_present(req.trajectory_id)
-        if slot is None:
-            return ArtifactReadResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
-        data = await self.backend.read_artifact(
-            slot,
-            req.path,
-            encoding=req.encoding,
-            max_bytes=req.max_bytes,
-            include_sha256=req.include_sha256,
-        )
-        if isinstance(data, dict):
-            data = dict(data)
-            data.pop("http_status", None)
-        try:
-            return ArtifactReadResponsePayload(**(data or {}))
-        except Exception as e:
-            return ArtifactReadResponsePayload(success=False, error=f"Invalid artifact read response: {e}")
-
-    async def list_artifacts(self, req: ArtifactListRequestPayload) -> ArtifactListResponsePayload:
-        slot = await self._get_slot_if_present(req.trajectory_id)
-        if slot is None:
-            return ArtifactListResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
-        data = await self.backend.list_artifacts(
-            slot,
-            req.path,
-            recursive=req.recursive,
-            max_entries=req.max_entries,
-        )
-        if isinstance(data, dict):
-            data = dict(data)
-            data.pop("http_status", None)
-        try:
-            return ArtifactListResponsePayload(**(data or {}))
-        except Exception as e:
-            return ArtifactListResponsePayload(success=False, error=f"Invalid artifact list response: {e}")
-
-    async def archive_artifacts(self, req: ArtifactArchiveRequestPayload) -> ArtifactArchiveResponsePayload:
-        slot = await self._get_slot_if_present(req.trajectory_id)
-        if slot is None:
-            return ArtifactArchiveResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
-        data = await self.backend.archive_artifacts(
-            slot,
-            req.path,
-            archive_format=req.format,
-            max_bytes=req.max_bytes,
-            max_entries=req.max_entries,
-        )
-        if isinstance(data, dict):
-            data = dict(data)
-            data.pop("http_status", None)
-        try:
-            return ArtifactArchiveResponsePayload(**(data or {}))
-        except Exception as e:
-            return ArtifactArchiveResponsePayload(success=False, error=f"Invalid artifact archive response: {e}")
-
-    async def _get_or_acquire_slot(self, trajectory_id: str) -> Slot:
-        async with self._slots_lock:
-            existing = self._slot_by_trajectory.get(trajectory_id)
-            if existing is not None:
-                return existing
-
-        slot = await self.backend.acquire(trajectory_id)
-
-        async with self._slots_lock:
-            existing = self._slot_by_trajectory.get(trajectory_id)
-            if existing is not None:
-                # Another coroutine won the race; return its slot.
-                await self.backend.release(slot, reset_workspace=False)
-                return existing
-            self._slot_by_trajectory[trajectory_id] = slot
-            return slot
-
-    async def _run_loop(self) -> None:
-        pending: List[_QueuedToolRequest] = []
-        deadline: Optional[float] = None
-
-        batch_window_s = max(0.0, self.config.batch_window_ms / 1000.0)
-        max_batch = max(1, self.config.max_batch_size)
-
-        while True:
-            if self._stopping.is_set() and self._queue.empty() and not pending:
-                break
-
-            timeout = None
-            if pending and deadline is not None:
-                timeout = max(0.0, deadline - time.perf_counter())
-
-            try:
-                item = await asyncio.wait_for(self._queue.get(), timeout=timeout)
-                if item is None:
-                    continue
-                pending.append(item)
-                if len(pending) == 1:
-                    deadline = time.perf_counter() + batch_window_s
-                if len(pending) < max_batch:
-                    continue
-            except asyncio.TimeoutError:
-                # batch window elapsed
-                pass
-
-            if not pending:
-                deadline = None
-                continue
-
-            batch = pending
-            pending = []
-            deadline = None
-
-            await self._execute_batch(batch)
-
-    async def _get_tool_server_client(self) -> httpx.AsyncClient:
-        url = self.config.tool_server_url
-        if not url:
-            raise RuntimeError("ToolServer not configured")
-
-        if self._tool_server_client is not None:
-            return self._tool_server_client
-
-        async with self._tool_server_lock:
-            if self._tool_server_client is None:
-                self._tool_server_client = httpx.AsyncClient(base_url=url.rstrip("/"))
-            return self._tool_server_client
-
-    def _tool_server_headers(self) -> Dict[str, str]:
-        token = self.config.tool_server_token
-        if not token:
-            return {}
-        return {"Authorization": f"Bearer {token}"}
-
-    async def _execute_external(self, req: _QueuedToolRequest) -> ToolResult:
-        client = await self._get_tool_server_client()
-        slot_id: Optional[str] = None
-        container_addr: Optional[str] = None
-        slot = await self._get_slot_if_present(req.trajectory_id)
-        if slot is not None:
-            slot_id = slot.slot_id
-            container_addr = slot.container_addr
-
-        payload = ToolServerExecuteRequest(
-            trajectory_id=req.trajectory_id,
-            tool=ToolCallPayload.from_tool_call(req.call),
-            timeout_s=req.timeout_s,
-            slot_id=slot_id,
-            container_addr=container_addr,
-        )
-
-        try:
-            resp = await client.post(
-                "/execute",
-                json=payload.model_dump(),
-                headers=self._tool_server_headers(),
-                timeout=req.timeout_s,
-            )
-            resp.raise_for_status()
-            data = resp.json()
-            parsed = ToolResultPayload(**data)
-            result = parsed.to_tool_result()
-            if result.uniq_id is None:
-                result.uniq_id = req.call.uniq_id
-            return result
-        except Exception as e:
-            return ToolResult(
-                success=False,
-                error=f"External tool failed: {e}",
-                uniq_id=req.call.uniq_id,
-            )
-
-    async def _execute_batch(self, batch: List[_QueuedToolRequest]) -> None:
-        # Resolve tool schemas once per request and separate sandbox/external/unknown.
-        sandbox_items: List[_QueuedToolRequest] = []
-        external_items: List[_QueuedToolRequest] = []
-        unknown_items: List[_QueuedToolRequest] = []
-
-        for it in batch:
-            tool = self.tools.get(it.call.name)
-            if tool is None:
-                unknown_items.append(it)
-                continue
-
-            schema = tool.schema
-            if not schema.external:
-                sandbox_items.append(it)
-            else:
-                external_items.append(it)
-
-        for it in unknown_items:
-            self.total_requests += 1
-            self.total_errors += 1
-            if not it.future.done():
-                it.future.set_result(
-                    ToolResult(
-                        success=False,
-                        error=f"Unknown tool: {it.call.name}",
-                        uniq_id=it.call.uniq_id,
-                    )
-                )
-
-        if external_items:
-            if not self.config.tool_server_url:
-                for it in external_items:
-                    self.total_requests += 1
-                    self.total_errors += 1
-                    if not it.future.done():
-                        it.future.set_result(
-                            ToolResult(
-                                success=False,
-                                error=f"External tool not available (ToolServer not configured): {it.call.name}",
-                                uniq_id=it.call.uniq_id,
-                            )
-                        )
-            else:
-                results = await asyncio.gather(*[self._execute_external(it) for it in external_items])
-                for it, res in zip(external_items, results):
-                    self.total_requests += 1
-                    if not getattr(res, "success", False):
-                        self.total_errors += 1
-                    if not it.future.done():
-                        it.future.set_result(res)
-
-        if not sandbox_items:
-            return
-
-        # Acquire slots for the distinct trajectories in this batch.
-        try:
-            traj_ids = list({it.trajectory_id for it in sandbox_items})
-            slots = await asyncio.gather(*[self._get_or_acquire_slot(tid) for tid in traj_ids])
-            slot_by_traj = dict(zip(traj_ids, slots))
-        except Exception as e:
-            for it in sandbox_items:
-                self.total_requests += 1
-                self.total_errors += 1
-                if not it.future.done():
-                    it.future.set_result(
-                        ToolResult(
-                            success=False,
-                            error=f"Failed to acquire slot: {e}",
-                            uniq_id=it.call.uniq_id,
-                        )
-                    )
-            return
-
-        # Group by timeout so we don't accidentally make short timeouts wait on long ones.
-        by_timeout: Dict[float, List[_QueuedToolRequest]] = {}
-        default_timeout = self.backend.default_timeout_s
-
-        for it in sandbox_items:
-            t = it.timeout_s
-            if t is None:
-                t = default_timeout
-            if t is None:
-                t = 30.0
-            by_timeout.setdefault(float(t), []).append(it)
-
-        for timeout_s, items in by_timeout.items():
-            requests = []
-            dispatched: List[_QueuedToolRequest] = []
-            for it in items:
-                slot = slot_by_traj[it.trajectory_id]
-                tool_name = it.call.name
-                args = dict(it.call.arguments)
-
-                # Hermes compatibility: treat `terminal` as an alias of sandbox `bash`.
-                if tool_name == "terminal":
-                    if args.get("background"):
-                        self.total_requests += 1
-                        self.total_errors += 1
-                        if not it.future.done():
-                            it.future.set_result(
-                                ToolResult(
-                                    success=False,
-                                    error="terminal background execution is not supported in sandbox",
-                                    uniq_id=it.call.uniq_id,
-                                )
-                            )
-                        continue
-                    tool_name = "bash"
-                    # `timeout` is handled at the ToolExecutor level, not passed to the sandbox tool args.
-                    args.pop("timeout", None)
-                elif tool_name == "terminal_stateful":
-                    tool_name = "bash_stateful"
-                    args.pop("timeout", None)
-                elif tool_name == "tmux":
-                    # `tmux` is a sandbox tool backed by the stateful session manager.
-                    # Network policy is env-controlled.
-                    args.pop("allow_network", None)
-
-                if tool_name == "bash":
-                    # Network policy is set by the environment/executor, not by the model.
-                    args.pop("allow_network", None)
-                    args.pop("require_sandbox", None)
-                    args["allow_network"] = bool(self.config.allow_network)
-                    args["require_sandbox"] = bool(self.config.require_sandbox)
-                    # `timeout` is handled at the ToolExecutor level, not passed to the sandbox tool args.
-                    args.pop("timeout", None)
-                elif tool_name == "bash_stateful":
-                    # Network policy is set by the environment/executor, not by the model.
-                    args.pop("allow_network", None)
-                    args.pop("require_sandbox", None)
-                    args.pop("require_stateful_sandbox", None)
-                    args["allow_network"] = bool(self.config.allow_network)
-                    args["require_stateful_sandbox"] = bool(self.config.require_stateful_sandbox)
-                    args.pop("timeout", None)
-                elif tool_name == "tmux":
-                    # Network policy applies to the underlying stateful session.
-                    args.pop("allow_network", None)
-                    args.pop("require_sandbox", None)
-                    args.pop("require_stateful_sandbox", None)
-                    args["allow_network"] = bool(self.config.allow_network)
-                    args["require_stateful_sandbox"] = bool(self.config.require_stateful_sandbox)
-
-                requests.append((slot, tool_name, args))
-                dispatched.append(it)
-
-            results = None
-            try:
-                if not dispatched:
-                    continue
-                results = await self.backend.execute_batch(requests, timeout_s=timeout_s)
-            except Exception as e:
-                for it in items:
-                    self.total_requests += 1
-                    self.total_errors += 1
-                    if not it.future.done():
-                        it.future.set_result(
-                            ToolResult(
-                                success=False,
-                                error=f"Batch execution failed: {e}",
-                                uniq_id=it.call.uniq_id,
-                            )
-                        )
-                continue
-
-            for it, res in zip(dispatched, results):
-                self.total_requests += 1
-                if not getattr(res, "success", False):
-                    self.total_errors += 1
-                tool_result = res.to_tool_result()
-                tool_result.uniq_id = it.call.uniq_id
-                if not it.future.done():
-                    it.future.set_result(tool_result)
@@ -1,88 +0,0 @@
-"""
-Toolset resolution for Hermes-Agent Atropos integration.
-
-We primarily reuse Hermes-Agent toolsets (`toolsets.py`), but Atropos training/envs
-need a few extra sandbox-oriented toolsets that Hermes doesn't expose by default
-(e.g. filesystem + stateful terminal).
-"""
-
-from __future__ import annotations
-
-from typing import Any, Dict, List, Optional, Set
-
-import toolsets as hermes_toolsets
-
-
-ATROPOS_TOOLSETS: Dict[str, Dict[str, Any]] = {
-    "filesystem": {
-        "description": "Read/write files in the sandbox workspace.",
-        "tools": ["read_file", "write_file"],
-        "includes": [],
-    },
-    "terminal_stateful": {
-        "description": "Stateful terminal execution (tmux/TUI support) inside the sandbox.",
-        "tools": ["terminal_stateful", "tmux"],
-        "includes": [],
-    },
-    "sandbox": {
-        "description": "Sandbox tools (terminal + filesystem).",
-        "tools": [],
-        "includes": ["terminal", "filesystem"],
-    },
-    "default": {
-        "description": "Default toolset for Atropos AgentEnv tasks.",
-        "tools": [],
-        "includes": ["sandbox"],
-    },
-    "full": {
-        "description": "All Hermes tools plus Atropos sandbox additions.",
-        "tools": [],
-        "includes": ["all", "filesystem", "sandbox", "terminal_stateful"],
-    },
-}
-
-
-def validate_toolset(name: str) -> bool:
-    if name in {"all", "*"}:
-        return True
-    return hermes_toolsets.validate_toolset(name) or name in ATROPOS_TOOLSETS
-
-
-def resolve_toolset(name: str, visited: Optional[Set[str]] = None) -> List[str]:
-    if visited is None:
-        visited = set()
-
-    if name in {"all", "*"}:
-        # Union Hermes + Atropos toolsets.
-        all_tools: Set[str] = set()
-        for tname in hermes_toolsets.get_toolset_names():
-            all_tools.update(resolve_toolset(tname, visited=set()))
-        for tname, spec in ATROPOS_TOOLSETS.items():
-            # Avoid recursion: some Atropos toolsets (e.g. "full") include "all".
-            if tname == "full" or "all" in (spec.get("includes") or []):
-                continue
-            all_tools.update(resolve_toolset(tname, visited=set()))
-        return sorted(all_tools)
-
-    if name in ATROPOS_TOOLSETS:
-        if name in visited:
-            return []
-        visited.add(name)
-        spec = ATROPOS_TOOLSETS[name]
-        tools: Set[str] = set(spec.get("tools", []))
-        for inc in spec.get("includes", []):
-            tools.update(resolve_toolset(inc, visited=set(visited)))
-        return sorted(tools)
-
-    # Fall back to Hermes toolsets.
-    # IMPORTANT: do not pre-add `name` to `visited` here; Hermes' resolver uses
-    # `visited` for its own cycle detection and will treat the presence of `name`
-    # as a circular dependency.
-    return sorted(hermes_toolsets.resolve_toolset(name, visited=set(visited)))
-
-
-def resolve_multiple_toolsets(names: List[str]) -> List[str]:
-    tools: Set[str] = set()
-    for name in names:
-        tools.update(resolve_toolset(name, visited=set()))
-    return sorted(tools)
@@ -1,415 +0,0 @@
-#!/usr/bin/env python3
-"""
-Atropos-compatible Hermes agent runner.
-
-This is a minimal subclass of Hermes-Agent's `AIAgent` that swaps the OpenAI
-function-calling backend for Atroposlib's `ManagedServer`/`ServerManager` backend
-and uses Hermes-style XML tool tags:
-
- <tool_call>{"name": "...", "arguments": {...}}</tool_call>
- <tool_response>{...}</tool_response>
-
-Tool observations are appended as `role="user"` messages containing one or more
-`<tool_response>` blocks so they survive common chat templates during tokenization.
-"""
-
-from __future__ import annotations
-
-import asyncio
-import json
-import re
-import time
-import warnings
-import os
-from contextlib import asynccontextmanager
-from typing import Any, AsyncGenerator, Dict, List, Optional, Tuple
-
-from model_tools import cleanup_vm, handle_function_call
-from run_agent import AIAgent
-
-_TOOL_CALL_RE = re.compile(r"<tool_call>\\s*(.*?)\\s*</tool_call>", re.DOTALL)
-
-
-ATROPOS_TOOL_SYSTEM_PROMPT = """You are a helpful AI assistant with access to tools.
-
-## Available Tools
-<tools>
-{tool_descriptions}
-</tools>
-
-## How to Use Tools
-To call a tool, output:
-<tool_call>{{"name": "tool_name", "arguments": {{"arg1": "value1"}}}}</tool_call>
-
-You may include optional reasoning in <think>...</think> before tool calls.
-
-After each tool call, you will receive tool results as:
-<tool_response>{{...}}</tool_response>
-
-Continue until finished, then provide a final response with no <tool_call> blocks.
-"""
-
-
-class AtroposAIAgent(AIAgent):
-    """
-    Hermes `AIAgent` variant that uses Atroposlib ServerManager/ManagedServer.
-
-    Notes:
-    - The default Hermes `AIAgent` remains unchanged; this class is opt-in.
-    - The underlying server must expose `managed_server(tokenizer=...)` OR be a single
-      APIServer-compatible object usable by Atroposlib's `ManagedServer`.
-    """
-
-    def __init__(
-        self,
-        *,
-        server: Any,
-        tokenizer: Any = None,
-        model: str = "local",
-        max_iterations: int = 10,
-        tool_delay: float = 0.0,
-        enabled_toolsets: Optional[List[str]] = None,
-        disabled_toolsets: Optional[List[str]] = None,
-        save_trajectories: bool = False,
-        verbose_logging: bool = False,
-        quiet_mode: bool = False,
-        ephemeral_system_prompt: Optional[str] = None,
-        log_prefix_chars: int = 100,
-        log_prefix: str = "",
-        session_id: Optional[str] = None,
-        temperature: Optional[float] = None,
-        max_tokens: Optional[int] = None,
-    ):
-        # Call parent init mainly to reuse tool selection + trajectory saving utilities.
-        super().__init__(
-            base_url="http://unused",
-            api_key="dummy-key",
-            model=model,
-            max_iterations=max_iterations,
-            tool_delay=tool_delay,
-            enabled_toolsets=enabled_toolsets,
-            disabled_toolsets=disabled_toolsets,
-            save_trajectories=save_trajectories,
-            verbose_logging=verbose_logging,
-            quiet_mode=quiet_mode,
-            ephemeral_system_prompt=ephemeral_system_prompt,
-            log_prefix_chars=log_prefix_chars,
-            log_prefix=log_prefix,
-            session_id=session_id,
-        )
-
-        self.server = server
-        self.tokenizer = tokenizer
-        self.temperature = temperature
-        self.max_tokens = max_tokens
-
-    @asynccontextmanager
-    async def _managed(self) -> AsyncGenerator[Any, None]:
-        if hasattr(self.server, "managed_server"):
-            with warnings.catch_warnings():
-                warnings.filterwarnings(
-                    "ignore",
-                    message=r"Using OpenAIServer with managed_server does not allow for state tracking",
-                    category=UserWarning,
-                )
-                async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
-                    yield managed
-            return
-
-        # Fall back to directly wrapping a single server object.
-        from atroposlib.envs.server_handling.managed_server import ManagedServer
-
-        managed = ManagedServer(server=self.server, tokenizer=self.tokenizer)
-        try:
-            yield managed
-        finally:
-            managed.reset()
-
-    def _tool_descriptions_text(self) -> str:
-        if not self.tools:
-            return "(no tools available)"
-
-        parts: List[str] = []
-        for tool in self.tools:
-            fn = (tool or {}).get("function", {})
-            name = fn.get("name", "")
-            desc = (fn.get("description") or "").strip()
-            if not name:
-                continue
-            if desc:
-                parts.append(f"- {name}: {desc}")
-            else:
-                parts.append(f"- {name}")
-        return "\n".join(parts) if parts else "(no tools available)"
-
-    def _build_system_prompt(self, system_message: Optional[str]) -> Optional[str]:
-        tool_prompt = ATROPOS_TOOL_SYSTEM_PROMPT.format(
-            tool_descriptions=self._tool_descriptions_text()
-        )
-
-        parts: List[str] = []
-        if system_message:
-            parts.append(system_message)
-        if self.ephemeral_system_prompt:
-            parts.append(self.ephemeral_system_prompt)
-        parts.append(tool_prompt)
-
-        return "\n\n".join(parts)
-
-    def _parse_tool_calls(self, content: str) -> Tuple[List[Tuple[str, Dict[str, Any]]], List[str]]:
-        """
-        Returns:
-          (calls, errors)
-        """
-        calls: List[Tuple[str, Dict[str, Any]]] = []
-        errors: List[str] = []
-
-        for raw in _TOOL_CALL_RE.findall(content or ""):
-            try:
-                payload = json.loads(raw)
-            except json.JSONDecodeError as exc:
-                errors.append(f"Invalid JSON inside <tool_call>: {exc}")
-                continue
-
-            name = payload.get("name")
-            args = payload.get("arguments", {})
-            if not isinstance(name, str) or not name:
-                errors.append("Tool call missing 'name' string")
-                continue
-            if not isinstance(args, dict):
-                errors.append("Tool call 'arguments' must be an object")
-                continue
-
-            calls.append((name, args))
-
-        return calls, errors
-
-    async def run_conversation_async(
-        self,
-        user_message: str,
-        system_message: Optional[str] = None,
-        conversation_history: Optional[List[Dict[str, Any]]] = None,
-        task_id: Optional[str] = None,
-    ) -> Dict[str, Any]:
-        import uuid
-
-        effective_task_id = task_id or str(uuid.uuid4())
-
-        messages: List[Dict[str, Any]] = conversation_history.copy() if conversation_history else []
-        messages.append({"role": "user", "content": user_message})
-
-        active_system_prompt = self._build_system_prompt(system_message)
-
-        api_call_count = 0
-        final_response: Optional[str] = None
-        managed_state: Optional[Dict[str, Any]] = None
-        completed = False
-
-        try:
-            async with self._managed() as managed:
-                while api_call_count < self.max_iterations:
-                    api_call_count += 1
-
-                    api_messages = messages.copy()
-                    if active_system_prompt:
-                        api_messages = [{"role": "system", "content": active_system_prompt}] + api_messages
-
-                    chat_kwargs: Dict[str, Any] = {"messages": api_messages, "n": 1}
-                    if self.max_tokens is not None:
-                        chat_kwargs["max_tokens"] = self.max_tokens
-                    if self.temperature is not None:
-                        chat_kwargs["temperature"] = self.temperature
-
-                    # Prefer OpenAI tool calling when supported by the backend:
-                    # - Many providers normalize Hermes-style <tool_call> tags into tool_calls when `tools` is provided.
-                    # - ManagedServer (atroposlib) does prompt->completion conversion and does not support `tools`.
-                    #   Only pass `tools` when we're calling an OpenAI-compatible chat endpoint directly.
-                    tool_schemas = self.tools if self.tools else None
-                    managed_cls = type(managed).__name__
-                    if tool_schemas and managed_cls != "ManagedServer":
-                        chat_kwargs["tools"] = tool_schemas
-
-                    if os.getenv("HERMES_DEBUG_ATROPOS_REQUEST") == "1":
-                        meta = {
-                            "managed_type": managed_cls,
-                            "model": getattr(getattr(managed, "config", None), "model_name", self.model),
-                            "base_url": getattr(getattr(managed, "config", None), "base_url", None),
-                            "kwargs": chat_kwargs,
-                        }
-                        # Avoid dumping megabytes of data accidentally.
-                        # (Messages can be large; this is still "full" but bounded.)
-                        print("\n=== HERMES_DEBUG_ATROPOS_REQUEST ===", flush=True)
-                        print(json.dumps(meta, ensure_ascii=False, indent=2)[:200_000], flush=True)
-
-                    response = await managed.chat_completion(**chat_kwargs)
-
-                    if os.getenv("HERMES_DEBUG_ATROPOS_RESPONSE") == "1":
-                        try:
-                            dumped = response.model_dump()  # openai pydantic model
-                        except Exception:
-                            dumped = getattr(response, "__dict__", {"repr": repr(response)})
-                        print("\n=== HERMES_DEBUG_ATROPOS_RESPONSE: ChatCompletion (raw) ===", flush=True)
-                        print(json.dumps(dumped, ensure_ascii=False, indent=2), flush=True)
-
-                    if hasattr(managed, "get_state"):
-                        managed_state = managed.get_state()
-
-                    msg = response.choices[0].message
-                    assistant_content = (msg.content or "")
-                    msg_reasoning = getattr(msg, "reasoning", None)
-
-                    # Use tool_calls if the backend provides them (preferred).
-                    structured_tool_calls = getattr(msg, "tool_calls", None)
-
-                    # If the backend emits content="" but includes useful text in reasoning,
-                    # use it for parsing *only if needed* (e.g. tool tags).
-                    if assistant_content == "" and isinstance(msg_reasoning, str) and msg_reasoning:
-                        if os.getenv("HERMES_DEBUG_ATROPOS_RESPONSE") == "1":
-                            print("\n=== HERMES_DEBUG_ATROPOS_RESPONSE: message.reasoning present (content empty) ===", flush=True)
-                            print(msg_reasoning, flush=True)
-
-                    assistant_msg: Dict[str, Any] = {"role": "assistant", "content": assistant_content}
-                    if structured_tool_calls:
-                        # Preserve tool_calls so the next request is consistent with OpenAI protocol.
-                        try:
-                            assistant_msg["tool_calls"] = [
-                                {
-                                    "id": tc.id,
-                                    "type": tc.type,
-                                    "function": {"name": tc.function.name, "arguments": tc.function.arguments},
-                                }
-                                for tc in structured_tool_calls
-                            ]
-                        except Exception:
-                            # Best-effort; keep conversation moving.
-                            pass
-                    messages.append(assistant_msg)
-
-                    # Mode A: OpenAI tool calling (preferred when supported)
-                    if structured_tool_calls:
-                        for tc in structured_tool_calls:
-                            tool_start = time.time()
-                            try:
-                                tool_args = json.loads(tc.function.arguments or "{}")
-                            except Exception:
-                                tool_args = {}
-                            tool_result = handle_function_call(tc.function.name, tool_args, effective_task_id)
-                            tool_duration = time.time() - tool_start
-
-                            # Keep the raw tool result as tool content (OpenAI protocol expects role=tool).
-                            messages.append(
-                                {
-                                    "role": "tool",
-                                    "tool_call_id": tc.id,
-                                    "content": tool_result,
-                                }
-                            )
-
-                            if self.tool_delay and self.tool_delay > 0:
-                                await asyncio.sleep(self.tool_delay)
-
-                        # Continue loop after tool execution.
-                        continue
-
-                    # Mode B: Hermes XML tool tags in assistant text (fallback).
-                    parse_source = assistant_content or (msg_reasoning or "")
-                    tool_calls, parse_errors = self._parse_tool_calls(parse_source)
-
-                    if parse_errors and not tool_calls:
-                        # Ask the model to retry with valid tool JSON.
-                        err_text = "; ".join(parse_errors[:3])
-                        messages.append(
-                            {
-                                "role": "user",
-                                "content": (
-                                    f"<tool_response>{json.dumps({'error': err_text}, ensure_ascii=False)}</tool_response>\n"
-                                    "The previous <tool_call> blocks were invalid. Please output valid JSON inside <tool_call>."
-                                ),
-                            }
-                        )
-                        continue
-
-                    if not tool_calls:
-                        # No tool calls: treat as final answer.
-                        final_response = (assistant_content or "").strip()
-                        completed = True
-                        break
-
-                    tool_responses: List[str] = []
-                    for tool_name, tool_args in tool_calls:
-                        tool_start = time.time()
-                        tool_result = handle_function_call(tool_name, tool_args, effective_task_id)
-                        tool_duration = time.time() - tool_start
-
-                        try:
-                            parsed = json.loads(tool_result)
-                            payload: Any = parsed
-                        except Exception:
-                            payload = tool_result
-
-                        tool_payload = {
-                            "name": tool_name,
-                            "duration_s": round(tool_duration, 3),
-                            "result": payload,
-                        }
-                        tool_responses.append(
-                            f"<tool_response>{json.dumps(tool_payload, ensure_ascii=False)}</tool_response>"
-                        )
-
-                        if self.tool_delay and self.tool_delay > 0:
-                            await asyncio.sleep(self.tool_delay)
-
-                    messages.append({"role": "user", "content": "\n".join(tool_responses)})
-
-                if final_response is None:
-                    final_response = "I've reached the maximum number of iterations."
-
-        finally:
-            try:
-                cleanup_vm(effective_task_id)
-            except Exception:
-                pass
-
-        # Save trajectory using Hermes formatting (optional).
-        self._save_trajectory(messages, user_message, completed=completed)
-
-        return {
-            "final_response": final_response,
-            "messages": messages,
-            "api_calls": api_call_count,
-            "completed": completed,
-            "managed_state": managed_state,
-            "system_prompt": active_system_prompt,
-            "task_id": effective_task_id,
-        }
-
-    def run_conversation(self, *args: Any, **kwargs: Any) -> Dict[str, Any]:
-        """
-        Sync wrapper for convenience.
-
-        If called from within a running event loop (e.g. prompt_toolkit), this
-        runs the async conversation in a dedicated thread to avoid nested loops.
-        """
-        try:
-            asyncio.get_running_loop()
-        except RuntimeError:
-            return asyncio.run(self.run_conversation_async(*args, **kwargs))
-
-        import queue
-        import threading
-
-        out: "queue.Queue[object]" = queue.Queue(maxsize=1)
-
-        def runner() -> None:
-            try:
-                out.put(asyncio.run(self.run_conversation_async(*args, **kwargs)))
-            except BaseException as exc:  # noqa: BLE001
-                out.put(exc)
-
-        thread = threading.Thread(target=runner, daemon=True)
-        thread.start()
-
-        result = out.get()
-        if isinstance(result, BaseException):
-            raise result
-        return result  # type: ignore[return-value]
@@ -41,24 +41,17 @@ from toolset_distributions import (
    sample_toolsets_from_distribution,
    validate_distribution
 )
+from model_tools import TOOL_TO_TOOLSET_MAP


 # Global configuration for worker processes
 _WORKER_CONFIG = {}

-# All possible tools - used to ensure consistent schema across all trajectory entries
-# This is required because Arrow/Parquet (used by HuggingFace datasets) needs identical schemas
-ALL_POSSIBLE_TOOLS = {
-    'terminal', 'web_search', 'web_extract',
-    'vision_analyze', 'image_generate', 'mixture_of_agents',
-    # Skills tools
-    'skills_categories', 'skills_list', 'skill_view',
-    # Browser automation tools
-    'browser_navigate', 'browser_snapshot', 'browser_click',
-    'browser_type', 'browser_scroll', 'browser_back',
-    'browser_press', 'browser_close', 'browser_get_images',
-    'browser_vision'
-}
+# All possible tools - auto-derived from the master mapping in model_tools.py.
+# This stays in sync automatically when new tools are added to TOOL_TO_TOOLSET_MAP.
+# Used for consistent schema in Arrow/Parquet (HuggingFace datasets) and for
+# filtering corrupted entries during trajectory combination.
+ALL_POSSIBLE_TOOLS = set(TOOL_TO_TOOLSET_MAP.keys())

 # Default stats for tools that weren't used
 DEFAULT_TOOL_STATS = {'count': 0, 'success': 0, 'failure': 0}
@@ -200,6 +193,42 @@ def _extract_tool_stats(messages: List[Dict[str, Any]]) -> Dict[str, Dict[str, i
    return tool_stats


+def _extract_reasoning_stats(messages: List[Dict[str, Any]]) -> Dict[str, int]:
+    """
+    Count how many assistant turns have reasoning vs no reasoning.
+    
+    Checks for <REASONING_SCRATCHPAD> in content or a non-empty 'reasoning' field
+    (native thinking tokens). Returns counts for tracking reasoning coverage.
+    
+    Args:
+        messages: Message history
+        
+    Returns:
+        Dict with 'total_assistant_turns', 'turns_with_reasoning', 'turns_without_reasoning'
+    """
+    total = 0
+    with_reasoning = 0
+    
+    for msg in messages:
+        if msg.get("role") != "assistant":
+            continue
+        total += 1
+        
+        content = msg.get("content", "") or ""
+        has_scratchpad = "<REASONING_SCRATCHPAD>" in content
+        has_native_reasoning = bool(msg.get("reasoning", "").strip()) if msg.get("reasoning") else False
+        
+        if has_scratchpad or has_native_reasoning:
+            with_reasoning += 1
+    
+    return {
+        "total_assistant_turns": total,
+        "turns_with_reasoning": with_reasoning,
+        "turns_without_reasoning": total - with_reasoning,
+        "has_any_reasoning": with_reasoning > 0,
+    }
+
+
 def _process_single_prompt(
    prompt_index: int,
    prompt_data: Dict[str, Any],
@@ -244,6 +273,10 @@ def _process_single_prompt(
            providers_ignored=config.get("providers_ignored"),
            providers_order=config.get("providers_order"),
            provider_sort=config.get("provider_sort"),
+            max_tokens=config.get("max_tokens"),
+            reasoning_config=config.get("reasoning_config"),
+            prefill_messages=config.get("prefill_messages"),
+            skip_context_files=True,  # Don't pollute trajectories with SOUL.md/AGENTS.md
        )

        # Run the agent with task_id to ensure each task gets its own isolated VM
@@ -252,6 +285,9 @@ def _process_single_prompt(
        # Extract tool usage statistics
        tool_stats = _extract_tool_stats(result["messages"])
        
+        # Extract reasoning coverage stats
+        reasoning_stats = _extract_reasoning_stats(result["messages"])
+        
        # Convert to trajectory format (using existing method)
        trajectory = agent._convert_to_trajectory_format(
            result["messages"],
@@ -264,6 +300,7 @@ def _process_single_prompt(
            "prompt_index": prompt_index,
            "trajectory": trajectory,
            "tool_stats": tool_stats,
+            "reasoning_stats": reasoning_stats,
            "completed": result["completed"],
            "partial": result.get("partial", False),
            "api_calls": result["api_calls"],
@@ -332,7 +369,9 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
    
    # Initialize aggregated stats for this batch
    batch_tool_stats = {}
+    batch_reasoning_stats = {"total_assistant_turns": 0, "turns_with_reasoning": 0, "turns_without_reasoning": 0}
    completed_in_batch = []
+    discarded_no_reasoning = 0
    
    # Process each prompt sequentially in this batch
    for prompt_index, prompt_data in prompts_to_process:
@@ -346,6 +385,13 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
        
        # Save trajectory if successful
        if result["success"] and result["trajectory"]:
+            # Discard samples with zero reasoning across all turns
+            reasoning = result.get("reasoning_stats", {})
+            if not reasoning.get("has_any_reasoning", True):
+                print(f"   🚫 Prompt {prompt_index} discarded (no reasoning in any turn)")
+                discarded_no_reasoning += 1
+                continue
+            
            # Get and normalize tool stats for consistent schema across all entries
            raw_tool_stats = result.get("tool_stats", {})
            tool_stats = _normalize_tool_stats(raw_tool_stats)
@@ -386,6 +432,10 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
            batch_tool_stats[tool_name]["success"] += stats["success"]
            batch_tool_stats[tool_name]["failure"] += stats["failure"]
        
+        # Aggregate reasoning stats
+        for key in batch_reasoning_stats:
+            batch_reasoning_stats[key] += result.get("reasoning_stats", {}).get(key, 0)
+        
        # Only mark as completed if successfully saved (failed prompts can be retried on resume)
        if result["success"] and result["trajectory"]:
            completed_in_batch.append(prompt_index)
@@ -401,6 +451,8 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
        "processed": len(prompts_to_process),
        "skipped": len(batch_data) - len(prompts_to_process),
        "tool_stats": batch_tool_stats,
+        "reasoning_stats": batch_reasoning_stats,
+        "discarded_no_reasoning": discarded_no_reasoning,
        "completed_prompts": completed_in_batch
    }

@@ -428,6 +480,10 @@ class BatchRunner:
        providers_ignored: List[str] = None,
        providers_order: List[str] = None,
        provider_sort: str = None,
+        max_tokens: int = None,
+        reasoning_config: Dict[str, Any] = None,
+        prefill_messages: List[Dict[str, Any]] = None,
+        max_samples: int = None,
    ):
        """
        Initialize the batch runner.
@@ -449,6 +505,10 @@ class BatchRunner:
            providers_ignored (List[str]): OpenRouter providers to ignore (optional)
            providers_order (List[str]): OpenRouter providers to try in order (optional)
            provider_sort (str): Sort providers by price/throughput/latency (optional)
+            max_tokens (int): Maximum tokens for model responses (optional, uses model default if not set)
+            reasoning_config (Dict): OpenRouter reasoning config override (e.g. {"effort": "none"} to disable thinking)
+            prefill_messages (List[Dict]): Messages to prepend as prefilled conversation context (few-shot priming)
+            max_samples (int): Only process the first N samples from the dataset (optional, processes all if not set)
        """
        self.dataset_file = Path(dataset_file)
        self.batch_size = batch_size
@@ -466,6 +526,10 @@ class BatchRunner:
        self.providers_ignored = providers_ignored
        self.providers_order = providers_order
        self.provider_sort = provider_sort
+        self.max_tokens = max_tokens
+        self.reasoning_config = reasoning_config
+        self.prefill_messages = prefill_messages
+        self.max_samples = max_samples
        
        # Validate distribution
        if not validate_distribution(distribution):
@@ -481,8 +545,12 @@ class BatchRunner:
        # Statistics file
        self.stats_file = self.output_dir / "statistics.json"
        
-        # Load dataset
+        # Load dataset (and optionally truncate to max_samples)
        self.dataset = self._load_dataset()
+        if self.max_samples and self.max_samples < len(self.dataset):
+            full_count = len(self.dataset)
+            self.dataset = self.dataset[:self.max_samples]
+            print(f"✂️  Truncated dataset from {full_count} to {self.max_samples} samples (--max_samples)")
        
        # Create batches
        self.batches = self._create_batches()
@@ -735,6 +803,9 @@ class BatchRunner:
            "providers_ignored": self.providers_ignored,
            "providers_order": self.providers_order,
            "provider_sort": self.provider_sort,
+            "max_tokens": self.max_tokens,
+            "reasoning_config": self.reasoning_config,
+            "prefill_messages": self.prefill_messages,
        }
        
        # For backward compatibility, still track by index (but this is secondary to content matching)
@@ -797,6 +868,8 @@ class BatchRunner:
        
        # Aggregate all batch statistics and update checkpoint
        all_completed_prompts = list(completed_prompts_set)
+        total_reasoning_stats = {"total_assistant_turns": 0, "turns_with_reasoning": 0, "turns_without_reasoning": 0}
+        
        for batch_result in results:
            # Add newly completed prompts
            all_completed_prompts.extend(batch_result.get("completed_prompts", []))
@@ -813,6 +886,10 @@ class BatchRunner:
                total_tool_stats[tool_name]["count"] += stats["count"]
                total_tool_stats[tool_name]["success"] += stats["success"]
                total_tool_stats[tool_name]["failure"] += stats["failure"]
+            
+            # Aggregate reasoning stats
+            for key in total_reasoning_stats:
+                total_reasoning_stats[key] += batch_result.get("reasoning_stats", {}).get(key, 0)
        
        # Save final checkpoint
        checkpoint_data["completed_prompts"] = all_completed_prompts
@@ -835,15 +912,8 @@ class BatchRunner:
        combined_file = self.output_dir / "trajectories.jsonl"
        print(f"\n📦 Combining ALL batch files into {combined_file.name}...")
        
-        VALID_TOOLS = {'web_search', 'web_extract', 'terminal', 'vision_analyze', 
-                       'image_generate', 'mixture_of_agents',
-                       # Skills tools
-                       'skills_categories', 'skills_list', 'skill_view',
-                       # Browser automation tools
-                       'browser_navigate', 'browser_snapshot', 'browser_click',
-                       'browser_type', 'browser_scroll', 'browser_back',
-                       'browser_press', 'browser_close', 'browser_get_images',
-                       'browser_vision'}
+        # Valid tools auto-derived from model_tools.py — no manual updates needed
+        VALID_TOOLS = ALL_POSSIBLE_TOOLS
        
        total_entries = 0
        filtered_entries = 0
@@ -892,7 +962,8 @@ class BatchRunner:
            "model": self.model,
            "completed_at": datetime.now().isoformat(),
            "duration_seconds": round(time.time() - start_time, 2),
-            "tool_statistics": total_tool_stats
+            "tool_statistics": total_tool_stats,
+            "reasoning_statistics": total_reasoning_stats,
        }
        
        with open(self.stats_file, 'w', encoding='utf-8') as f:
@@ -930,6 +1001,25 @@ class BatchRunner:
        else:
            print("No tool calls were made during this run.")
        
+        # Print reasoning coverage stats
+        total_discarded = sum(r.get("discarded_no_reasoning", 0) for r in results)
+        
+        print(f"\n🧠 Reasoning Coverage:")
+        print("-" * 70)
+        total_turns = total_reasoning_stats["total_assistant_turns"]
+        with_reasoning = total_reasoning_stats["turns_with_reasoning"]
+        without_reasoning = total_reasoning_stats["turns_without_reasoning"]
+        if total_turns > 0:
+            pct_with = round(with_reasoning / total_turns * 100, 1)
+            pct_without = round(without_reasoning / total_turns * 100, 1)
+            print(f"   Total assistant turns:    {total_turns:,}")
+            print(f"   With reasoning:           {with_reasoning:,} ({pct_with}%)")
+            print(f"   Without reasoning:        {without_reasoning:,} ({pct_without}%)")
+        else:
+            print("   No assistant turns recorded.")
+        if total_discarded > 0:
+            print(f"   🚫 Samples discarded (zero reasoning): {total_discarded:,}")
+        
        print(f"\n💾 Results saved to: {self.output_dir}")
        print(f"   - Trajectories: trajectories.jsonl (combined)")
        print(f"   - Individual batches: batch_*.jsonl (for debugging)")
@@ -956,6 +1046,11 @@ def main(
    providers_ignored: str = None,
    providers_order: str = None,
    provider_sort: str = None,
+    max_tokens: int = None,
+    reasoning_effort: str = None,
+    reasoning_disabled: bool = False,
+    prefill_messages_file: str = None,
+    max_samples: int = None,
 ):
    """
    Run batch processing of agent prompts from a dataset.
@@ -979,6 +1074,11 @@ def main(
        providers_ignored (str): Comma-separated list of OpenRouter providers to ignore (e.g. "together,deepinfra")
        providers_order (str): Comma-separated list of OpenRouter providers to try in order (e.g. "anthropic,openai,google")
        provider_sort (str): Sort providers by "price", "throughput", or "latency" (OpenRouter only)
+        max_tokens (int): Maximum tokens for model responses (optional, uses model default if not set)
+        reasoning_effort (str): OpenRouter reasoning effort level: "xhigh", "high", "medium", "low", "minimal", "none" (default: "xhigh")
+        reasoning_disabled (bool): Completely disable reasoning/thinking tokens (default: False)
+        prefill_messages_file (str): Path to JSON file containing prefill messages (list of {role, content} dicts)
+        max_samples (int): Only process the first N samples from the dataset (optional, processes all if not set)
        
    Examples:
        # Basic usage
@@ -990,9 +1090,13 @@ def main(
        # Use specific distribution
        python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=image_test --distribution=image_gen
        
-        # With ephemeral system prompt (not saved to dataset)
+        # With disabled reasoning and max tokens
        python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run \\
-                               --ephemeral_system_prompt="You are a helpful assistant focused on image generation."
+                               --reasoning_disabled --max_tokens=128000
+        
+        # With prefill messages from file
+        python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run \\
+                               --prefill_messages_file=configs/prefill_opus.json
        
        # List available distributions
        python batch_runner.py --list_distributions
@@ -1031,6 +1135,36 @@ def main(
    providers_ignored_list = [p.strip() for p in providers_ignored.split(",")] if providers_ignored else None
    providers_order_list = [p.strip() for p in providers_order.split(",")] if providers_order else None
    
+    # Build reasoning_config from CLI flags
+    # --reasoning_disabled takes priority, then --reasoning_effort, then default (xhigh)
+    reasoning_config = None
+    if reasoning_disabled:
+        # Completely disable reasoning/thinking tokens
+        reasoning_config = {"effort": "none"}
+        print("🧠 Reasoning: DISABLED (effort=none)")
+    elif reasoning_effort:
+        # Use specified effort level
+        valid_efforts = ["xhigh", "high", "medium", "low", "minimal", "none"]
+        if reasoning_effort not in valid_efforts:
+            print(f"❌ Error: --reasoning_effort must be one of: {', '.join(valid_efforts)}")
+            return
+        reasoning_config = {"enabled": True, "effort": reasoning_effort}
+        print(f"🧠 Reasoning effort: {reasoning_effort}")
+    
+    # Load prefill messages from JSON file if provided
+    prefill_messages = None
+    if prefill_messages_file:
+        try:
+            with open(prefill_messages_file, 'r', encoding='utf-8') as f:
+                prefill_messages = json.load(f)
+            if not isinstance(prefill_messages, list):
+                print(f"❌ Error: prefill_messages_file must contain a JSON array of messages")
+                return
+            print(f"💬 Loaded {len(prefill_messages)} prefill messages from {prefill_messages_file}")
+        except Exception as e:
+            print(f"❌ Error loading prefill messages: {e}")
+            return
+    
    # Initialize and run batch runner
    try:
        runner = BatchRunner(
@@ -1050,6 +1184,10 @@ def main(
            providers_ignored=providers_ignored_list,
            providers_order=providers_order_list,
            provider_sort=provider_sort,
+            max_tokens=max_tokens,
+            reasoning_config=reasoning_config,
+            prefill_messages=prefill_messages,
+            max_samples=max_samples,
        )

        runner.run(resume=resume)
@@ -7,7 +7,7 @@
 # =============================================================================
 model:
  # Default model to use (can be overridden with --model flag)
-  default: "anthropic/claude-sonnet-4"
+  default: "anthropic/claude-opus-4.6"
  
  # API configuration (falls back to OPENROUTER_API_KEY env var)
  # api_key: "your-key-here"  # Uncomment to set here instead of .env
@@ -23,9 +23,12 @@ model:
 # OPTION 1: Local execution (default)
 # Commands run directly on your machine in the current directory
 # -----------------------------------------------------------------------------
+# Working directory behavior:
+#   - CLI (`hermes` command): Uses "." (current directory where you run hermes)
+#   - Messaging (Telegram/Discord): Uses MESSAGING_CWD from .env (default: home)
 terminal:
-  env_type: "local"
-  cwd: "."  # Use "." for current directory, or specify absolute path
+  backend: "local"
+  cwd: "."  # For local backend: "." = current directory. Ignored for remote backends.
  timeout: 180
  lifetime_seconds: 300
  # sudo_password: ""  # Enable sudo commands (pipes via sudo -S) - SECURITY WARNING: plaintext!
@@ -36,8 +39,8 @@ terminal:
 # Great for: keeping agent isolated from its own code, using powerful remote hardware
 # -----------------------------------------------------------------------------
 # terminal:
-#   env_type: "ssh"
-#   cwd: "/home/myuser/project"
+#   backend: "ssh"
+#   cwd: "/home/myuser/project"  # Path on the REMOTE server
 #   timeout: 180
 #   lifetime_seconds: 300
 #   ssh_host: "my-server.example.com"
@@ -51,11 +54,11 @@ terminal:
 # Great for: reproducible environments, testing, isolation
 # -----------------------------------------------------------------------------
 # terminal:
-#   env_type: "docker"
-#   cwd: "/workspace"
+#   backend: "docker"
+#   cwd: "/workspace"  # Path INSIDE the container (default: /)
 #   timeout: 180
 #   lifetime_seconds: 300
-#   docker_image: "python:3.11"
+#   docker_image: "nikolaik/python-nodejs:python3.11-nodejs20"

 # -----------------------------------------------------------------------------
 # OPTION 4: Singularity/Apptainer container
@@ -63,11 +66,11 @@ terminal:
 # Great for: HPC clusters, shared compute environments
 # -----------------------------------------------------------------------------
 # terminal:
-#   env_type: "singularity"
-#   cwd: "/workspace"
+#   backend: "singularity"
+#   cwd: "/workspace"  # Path INSIDE the container (default: /root)
 #   timeout: 180
 #   lifetime_seconds: 300
-#   singularity_image: "docker://python:3.11"
+#   singularity_image: "docker://nikolaik/python-nodejs:python3.11-nodejs20"

 # -----------------------------------------------------------------------------
 # OPTION 5: Modal cloud execution
@@ -75,11 +78,11 @@ terminal:
 # Great for: GPU access, scalable compute, serverless execution
 # -----------------------------------------------------------------------------
 # terminal:
-#   env_type: "modal"
-#   cwd: "/workspace"
+#   backend: "modal"
+#   cwd: "/workspace"  # Path INSIDE the sandbox (default: /root)
 #   timeout: 180
 #   lifetime_seconds: 300
-#   modal_image: "python:3.11"
+#   modal_image: "nikolaik/python-nodejs:python3.11-nodejs20"

 # -----------------------------------------------------------------------------
 # SUDO SUPPORT (works with ALL backends above)
@@ -112,12 +115,41 @@ browser:
  # after this period of no activity between agent loops (default: 120 = 2 minutes)
  inactivity_timeout: 120

+# =============================================================================
+# Context Compression (Auto-shrinks long conversations)
+# =============================================================================
+# When conversation approaches model's context limit, middle turns are
+# automatically summarized to free up space while preserving important context.
+#
+# HOW IT WORKS:
+# 1. Tracks actual token usage from API responses (not estimates)
+# 2. When prompt_tokens >= threshold% of model's context_length, triggers compression
+# 3. Protects first 3 turns (system prompt, initial request, first response)
+# 4. Protects last 4 turns (recent context is most relevant)
+# 5. Summarizes middle turns using a fast/cheap model
+# 6. Inserts summary as a user message, continues conversation seamlessly
+#
+compression:
+  # Enable automatic context compression (default: true)
+  # Set to false if you prefer to manage context manually or want errors on overflow
+  enabled: true
+  
+  # Trigger compression at this % of model's context limit (default: 0.85 = 85%)
+  # Lower values = more aggressive compression, higher values = compress later
+  threshold: 0.85
+  
+  # Model to use for generating summaries (fast/cheap recommended)
+  # This model compresses the middle turns into a concise summary
+  summary_model: "google/gemini-3-flash-preview"
+
 # =============================================================================
 # Agent Behavior
 # =============================================================================
 agent:
-  # Maximum conversation turns before stopping
-  max_turns: 20
+  # Maximum tool-calling iterations per conversation
+  # Higher = more room for complex tasks, but costs more tokens
+  # Recommended: 20-30 for focused tasks, 50-100 for open exploration
+  max_turns: 60
  
  # Enable verbose logging
  verbose: false
@@ -212,6 +244,24 @@ toolsets:
 # toolsets:
 #   - safe

+# =============================================================================
+# Voice Transcription (Speech-to-Text)
+# =============================================================================
+# Automatically transcribe voice messages on messaging platforms.
+# Requires OPENAI_API_KEY in .env (uses OpenAI Whisper API directly).
+stt:
+  enabled: true
+  model: "whisper-1"  # whisper-1 (cheapest) | gpt-4o-mini-transcribe | gpt-4o-transcribe
+
+# =============================================================================
+# Response Pacing (Messaging Platforms)
+# =============================================================================
+# Add human-like delays between message chunks.
+# human_delay:
+#   mode: "off"      # "off" | "natural" | "custom"
+#   min_ms: 800      # Min delay (custom mode only)
+#   max_ms: 2500     # Max delay (custom mode only)
+
 # =============================================================================
 # Session Logging
 # =============================================================================
@@ -0,0 +1,36 @@
+"""
+Cron job scheduling system for Hermes Agent.
+
+This module provides scheduled task execution, allowing the agent to:
+- Run automated tasks on schedules (cron expressions, intervals, one-shot)
+- Self-schedule reminders and follow-up tasks
+- Execute tasks in isolated sessions (no prior context)
+
+Usage:
+    # Run due jobs (for system cron integration)
+    python -c "from cron import tick; tick()"
+    
+    # Or via CLI
+    python cli.py --cron-daemon
+"""
+
+from cron.jobs import (
+    create_job,
+    get_job,
+    list_jobs,
+    remove_job,
+    update_job,
+    JOBS_FILE,
+)
+from cron.scheduler import tick, run_daemon
+
+__all__ = [
+    "create_job",
+    "get_job", 
+    "list_jobs",
+    "remove_job",
+    "update_job",
+    "tick",
+    "run_daemon",
+    "JOBS_FILE",
+]
@@ -0,0 +1,383 @@
+"""
+Cron job storage and management.
+
+Jobs are stored in ~/.hermes/cron/jobs.json
+Output is saved to ~/.hermes/cron/output/{job_id}/{timestamp}.md
+"""
+
+import json
+import os
+import re
+import uuid
+from datetime import datetime, timedelta
+from pathlib import Path
+from typing import Optional, Dict, List, Any
+
+try:
+    from croniter import croniter
+    HAS_CRONITER = True
+except ImportError:
+    HAS_CRONITER = False
+
+# =============================================================================
+# Configuration
+# =============================================================================
+
+HERMES_DIR = Path.home() / ".hermes"
+CRON_DIR = HERMES_DIR / "cron"
+JOBS_FILE = CRON_DIR / "jobs.json"
+OUTPUT_DIR = CRON_DIR / "output"
+
+
+def ensure_dirs():
+    """Ensure cron directories exist."""
+    CRON_DIR.mkdir(parents=True, exist_ok=True)
+    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+
+
+# =============================================================================
+# Schedule Parsing
+# =============================================================================
+
+def parse_duration(s: str) -> int:
+    """
+    Parse duration string into minutes.
+    
+    Examples:
+        "30m" → 30
+        "2h" → 120
+        "1d" → 1440
+    """
+    s = s.strip().lower()
+    match = re.match(r'^(\d+)\s*(m|min|mins|minute|minutes|h|hr|hrs|hour|hours|d|day|days)$', s)
+    if not match:
+        raise ValueError(f"Invalid duration: '{s}'. Use format like '30m', '2h', or '1d'")
+    
+    value = int(match.group(1))
+    unit = match.group(2)[0]  # First char: m, h, or d
+    
+    multipliers = {'m': 1, 'h': 60, 'd': 1440}
+    return value * multipliers[unit]
+
+
+def parse_schedule(schedule: str) -> Dict[str, Any]:
+    """
+    Parse schedule string into structured format.
+    
+    Returns dict with:
+        - kind: "once" | "interval" | "cron"
+        - For "once": "run_at" (ISO timestamp)
+        - For "interval": "minutes" (int)
+        - For "cron": "expr" (cron expression)
+    
+    Examples:
+        "30m"              → once in 30 minutes
+        "2h"               → once in 2 hours
+        "every 30m"        → recurring every 30 minutes
+        "every 2h"         → recurring every 2 hours
+        "0 9 * * *"        → cron expression
+        "2026-02-03T14:00" → once at timestamp
+    """
+    schedule = schedule.strip()
+    original = schedule
+    schedule_lower = schedule.lower()
+    
+    # "every X" pattern → recurring interval
+    if schedule_lower.startswith("every "):
+        duration_str = schedule[6:].strip()
+        minutes = parse_duration(duration_str)
+        return {
+            "kind": "interval",
+            "minutes": minutes,
+            "display": f"every {minutes}m"
+        }
+    
+    # Check for cron expression (5 or 6 space-separated fields)
+    # Cron fields: minute hour day month weekday [year]
+    parts = schedule.split()
+    if len(parts) >= 5 and all(
+        re.match(r'^[\d\*\-,/]+$', p) for p in parts[:5]
+    ):
+        if not HAS_CRONITER:
+            raise ValueError("Cron expressions require 'croniter' package. Install with: pip install croniter")
+        # Validate cron expression
+        try:
+            croniter(schedule)
+        except Exception as e:
+            raise ValueError(f"Invalid cron expression '{schedule}': {e}")
+        return {
+            "kind": "cron",
+            "expr": schedule,
+            "display": schedule
+        }
+    
+    # ISO timestamp (contains T or looks like date)
+    if 'T' in schedule or re.match(r'^\d{4}-\d{2}-\d{2}', schedule):
+        try:
+            # Parse and validate
+            dt = datetime.fromisoformat(schedule.replace('Z', '+00:00'))
+            return {
+                "kind": "once",
+                "run_at": dt.isoformat(),
+                "display": f"once at {dt.strftime('%Y-%m-%d %H:%M')}"
+            }
+        except ValueError as e:
+            raise ValueError(f"Invalid timestamp '{schedule}': {e}")
+    
+    # Duration like "30m", "2h", "1d" → one-shot from now
+    try:
+        minutes = parse_duration(schedule)
+        run_at = datetime.now() + timedelta(minutes=minutes)
+        return {
+            "kind": "once",
+            "run_at": run_at.isoformat(),
+            "display": f"once in {original}"
+        }
+    except ValueError:
+        pass
+    
+    raise ValueError(
+        f"Invalid schedule '{original}'. Use:\n"
+        f"  - Duration: '30m', '2h', '1d' (one-shot)\n"
+        f"  - Interval: 'every 30m', 'every 2h' (recurring)\n"
+        f"  - Cron: '0 9 * * *' (cron expression)\n"
+        f"  - Timestamp: '2026-02-03T14:00:00' (one-shot at time)"
+    )
+
+
+def compute_next_run(schedule: Dict[str, Any], last_run_at: Optional[str] = None) -> Optional[str]:
+    """
+    Compute the next run time for a schedule.
+    
+    Returns ISO timestamp string, or None if no more runs.
+    """
+    now = datetime.now()
+    
+    if schedule["kind"] == "once":
+        run_at = datetime.fromisoformat(schedule["run_at"])
+        # If in the future, return it; if in the past, no more runs
+        return schedule["run_at"] if run_at > now else None
+    
+    elif schedule["kind"] == "interval":
+        minutes = schedule["minutes"]
+        if last_run_at:
+            # Next run is last_run + interval
+            last = datetime.fromisoformat(last_run_at)
+            next_run = last + timedelta(minutes=minutes)
+        else:
+            # First run is now + interval
+            next_run = now + timedelta(minutes=minutes)
+        return next_run.isoformat()
+    
+    elif schedule["kind"] == "cron":
+        if not HAS_CRONITER:
+            return None
+        cron = croniter(schedule["expr"], now)
+        next_run = cron.get_next(datetime)
+        return next_run.isoformat()
+    
+    return None
+
+
+# =============================================================================
+# Job CRUD Operations
+# =============================================================================
+
+def load_jobs() -> List[Dict[str, Any]]:
+    """Load all jobs from storage."""
+    ensure_dirs()
+    if not JOBS_FILE.exists():
+        return []
+    
+    try:
+        with open(JOBS_FILE, 'r', encoding='utf-8') as f:
+            data = json.load(f)
+            return data.get("jobs", [])
+    except (json.JSONDecodeError, IOError):
+        return []
+
+
+def save_jobs(jobs: List[Dict[str, Any]]):
+    """Save all jobs to storage."""
+    ensure_dirs()
+    with open(JOBS_FILE, 'w', encoding='utf-8') as f:
+        json.dump({"jobs": jobs, "updated_at": datetime.now().isoformat()}, f, indent=2)
+
+
+def create_job(
+    prompt: str,
+    schedule: str,
+    name: Optional[str] = None,
+    repeat: Optional[int] = None,
+    deliver: Optional[str] = None,
+    origin: Optional[Dict[str, Any]] = None
+) -> Dict[str, Any]:
+    """
+    Create a new cron job.
+    
+    Args:
+        prompt: The prompt to run (must be self-contained)
+        schedule: Schedule string (see parse_schedule)
+        name: Optional friendly name
+        repeat: How many times to run (None = forever, 1 = once)
+        deliver: Where to deliver output ("origin", "local", "telegram", etc.)
+        origin: Source info where job was created (for "origin" delivery)
+    
+    Returns:
+        The created job dict
+    """
+    parsed_schedule = parse_schedule(schedule)
+    
+    # Auto-set repeat=1 for one-shot schedules if not specified
+    if parsed_schedule["kind"] == "once" and repeat is None:
+        repeat = 1
+    
+    # Default delivery to origin if available, otherwise local
+    if deliver is None:
+        deliver = "origin" if origin else "local"
+    
+    job_id = uuid.uuid4().hex[:12]
+    now = datetime.now().isoformat()
+    
+    job = {
+        "id": job_id,
+        "name": name or prompt[:50].strip(),
+        "prompt": prompt,
+        "schedule": parsed_schedule,
+        "schedule_display": parsed_schedule.get("display", schedule),
+        "repeat": {
+            "times": repeat,  # None = forever
+            "completed": 0
+        },
+        "enabled": True,
+        "created_at": now,
+        "next_run_at": compute_next_run(parsed_schedule),
+        "last_run_at": None,
+        "last_status": None,
+        "last_error": None,
+        # Delivery configuration
+        "deliver": deliver,
+        "origin": origin,  # Tracks where job was created for "origin" delivery
+    }
+    
+    jobs = load_jobs()
+    jobs.append(job)
+    save_jobs(jobs)
+    
+    return job
+
+
+def get_job(job_id: str) -> Optional[Dict[str, Any]]:
+    """Get a job by ID."""
+    jobs = load_jobs()
+    for job in jobs:
+        if job["id"] == job_id:
+            return job
+    return None
+
+
+def list_jobs(include_disabled: bool = False) -> List[Dict[str, Any]]:
+    """List all jobs, optionally including disabled ones."""
+    jobs = load_jobs()
+    if not include_disabled:
+        jobs = [j for j in jobs if j.get("enabled", True)]
+    return jobs
+
+
+def update_job(job_id: str, updates: Dict[str, Any]) -> Optional[Dict[str, Any]]:
+    """Update a job by ID."""
+    jobs = load_jobs()
+    for i, job in enumerate(jobs):
+        if job["id"] == job_id:
+            jobs[i] = {**job, **updates}
+            save_jobs(jobs)
+            return jobs[i]
+    return None
+
+
+def remove_job(job_id: str) -> bool:
+    """Remove a job by ID."""
+    jobs = load_jobs()
+    original_len = len(jobs)
+    jobs = [j for j in jobs if j["id"] != job_id]
+    if len(jobs) < original_len:
+        save_jobs(jobs)
+        return True
+    return False
+
+
+def mark_job_run(job_id: str, success: bool, error: Optional[str] = None):
+    """
+    Mark a job as having been run.
+    
+    Updates last_run_at, last_status, increments completed count,
+    computes next_run_at, and auto-deletes if repeat limit reached.
+    """
+    jobs = load_jobs()
+    for i, job in enumerate(jobs):
+        if job["id"] == job_id:
+            now = datetime.now().isoformat()
+            job["last_run_at"] = now
+            job["last_status"] = "ok" if success else "error"
+            job["last_error"] = error if not success else None
+            
+            # Increment completed count
+            if job.get("repeat"):
+                job["repeat"]["completed"] = job["repeat"].get("completed", 0) + 1
+                
+                # Check if we've hit the repeat limit
+                times = job["repeat"].get("times")
+                completed = job["repeat"]["completed"]
+                if times is not None and completed >= times:
+                    # Remove the job (limit reached)
+                    jobs.pop(i)
+                    save_jobs(jobs)
+                    return
+            
+            # Compute next run
+            job["next_run_at"] = compute_next_run(job["schedule"], now)
+            
+            # If no next run (one-shot completed), disable
+            if job["next_run_at"] is None:
+                job["enabled"] = False
+            
+            save_jobs(jobs)
+            return
+    
+    save_jobs(jobs)
+
+
+def get_due_jobs() -> List[Dict[str, Any]]:
+    """Get all jobs that are due to run now."""
+    now = datetime.now()
+    jobs = load_jobs()
+    due = []
+    
+    for job in jobs:
+        if not job.get("enabled", True):
+            continue
+        
+        next_run = job.get("next_run_at")
+        if not next_run:
+            continue
+        
+        next_run_dt = datetime.fromisoformat(next_run)
+        if next_run_dt <= now:
+            due.append(job)
+    
+    return due
+
+
+def save_job_output(job_id: str, output: str):
+    """Save job output to file."""
+    ensure_dirs()
+    job_output_dir = OUTPUT_DIR / job_id
+    job_output_dir.mkdir(parents=True, exist_ok=True)
+    
+    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+    output_file = job_output_dir / f"{timestamp}.md"
+    
+    with open(output_file, 'w', encoding='utf-8') as f:
+        f.write(output)
+    
+    return output_file
@@ -0,0 +1,188 @@
+"""
+Cron job scheduler - executes due jobs.
+
+This module provides:
+- tick(): Run all due jobs once (for system cron integration)
+- run_daemon(): Run continuously, checking every 60 seconds
+"""
+
+import os
+import sys
+import time
+import traceback
+from datetime import datetime
+from pathlib import Path
+from typing import Optional
+
+# Add parent directory to path for imports
+sys.path.insert(0, str(Path(__file__).parent.parent))
+
+from cron.jobs import get_due_jobs, mark_job_run, save_job_output
+
+
+def run_job(job: dict) -> tuple[bool, str, Optional[str]]:
+    """
+    Execute a single cron job.
+    
+    Returns:
+        Tuple of (success, output, error_message)
+    """
+    from run_agent import AIAgent
+    
+    job_id = job["id"]
+    job_name = job["name"]
+    prompt = job["prompt"]
+    
+    print(f"[cron] Running job '{job_name}' (ID: {job_id})")
+    print(f"[cron] Prompt: {prompt[:100]}{'...' if len(prompt) > 100 else ''}")
+    
+    try:
+        # Create agent with default settings
+        # Jobs run in isolated sessions (no prior context)
+        agent = AIAgent(
+            model=os.getenv("HERMES_MODEL", "anthropic/claude-opus-4.6"),
+            quiet_mode=True,
+            session_id=f"cron_{job_id}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
+        )
+        
+        # Run the conversation
+        result = agent.run_conversation(prompt)
+        
+        # Extract final response
+        final_response = result.get("final_response", "")
+        if not final_response:
+            final_response = "(No response generated)"
+        
+        # Build output document
+        output = f"""# Cron Job: {job_name}
+
+**Job ID:** {job_id}
+**Run Time:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
+**Schedule:** {job.get('schedule_display', 'N/A')}
+
+## Prompt
+
+{prompt}
+
+## Response
+
+{final_response}
+"""
+        
+        print(f"[cron] Job '{job_name}' completed successfully")
+        return True, output, None
+        
+    except Exception as e:
+        error_msg = f"{type(e).__name__}: {str(e)}"
+        print(f"[cron] Job '{job_name}' failed: {error_msg}")
+        
+        # Build error output
+        output = f"""# Cron Job: {job_name} (FAILED)
+
+**Job ID:** {job_id}
+**Run Time:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
+**Schedule:** {job.get('schedule_display', 'N/A')}
+
+## Prompt
+
+{prompt}
+
+## Error
+
+```
+{error_msg}
+
+{traceback.format_exc()}
+```
+"""
+        return False, output, error_msg
+
+
+def tick(verbose: bool = True) -> int:
+    """
+    Check and run all due jobs.
+    
+    This is designed to be called by system cron every minute:
+        */1 * * * * cd ~/hermes-agent && python -c "from cron import tick; tick()"
+    
+    Args:
+        verbose: Whether to print status messages
+    
+    Returns:
+        Number of jobs executed
+    """
+    due_jobs = get_due_jobs()
+    
+    if verbose and not due_jobs:
+        print(f"[cron] {datetime.now().strftime('%H:%M:%S')} - No jobs due")
+        return 0
+    
+    if verbose:
+        print(f"[cron] {datetime.now().strftime('%H:%M:%S')} - {len(due_jobs)} job(s) due")
+    
+    executed = 0
+    for job in due_jobs:
+        try:
+            success, output, error = run_job(job)
+            
+            # Save output to file
+            output_file = save_job_output(job["id"], output)
+            if verbose:
+                print(f"[cron] Output saved to: {output_file}")
+            
+            # Mark job as run (handles repeat counting, next_run computation)
+            mark_job_run(job["id"], success, error)
+            executed += 1
+            
+        except Exception as e:
+            print(f"[cron] Error processing job {job['id']}: {e}")
+            mark_job_run(job["id"], False, str(e))
+    
+    return executed
+
+
+def run_daemon(check_interval: int = 60, verbose: bool = True):
+    """
+    Run the cron daemon continuously.
+    
+    Checks for due jobs every `check_interval` seconds.
+    
+    Args:
+        check_interval: Seconds between checks (default: 60)
+        verbose: Whether to print status messages
+    """
+    print(f"[cron] Starting daemon (checking every {check_interval}s)")
+    print(f"[cron] Press Ctrl+C to stop")
+    print()
+    
+    try:
+        while True:
+            try:
+                tick(verbose=verbose)
+            except Exception as e:
+                print(f"[cron] Tick error: {e}")
+            
+            time.sleep(check_interval)
+            
+    except KeyboardInterrupt:
+        print("\n[cron] Daemon stopped")
+
+
+if __name__ == "__main__":
+    # Allow running directly: python cron/scheduler.py [daemon|tick]
+    import argparse
+    
+    parser = argparse.ArgumentParser(description="Hermes Cron Scheduler")
+    parser.add_argument("mode", choices=["daemon", "tick"], default="tick", nargs="?",
+                        help="Mode: 'tick' to run once, 'daemon' to run continuously")
+    parser.add_argument("--interval", type=int, default=60,
+                        help="Check interval in seconds for daemon mode")
+    parser.add_argument("--quiet", "-q", action="store_true",
+                        help="Suppress status messages")
+    
+    args = parser.parse_args()
+    
+    if args.mode == "daemon":
+        run_daemon(check_interval=args.interval, verbose=not args.quiet)
+    else:
+        tick(verbose=not args.quiet)
@@ -1,224 +0,0 @@
-# Modal Backend
-
-Hermes Agent uses [Modal](https://modal.com) for scalable, isolated cloud execution environments. There are two Modal integrations:
-
-1. **Terminal Tool** (`tools/terminal_tool.py`) - For CLI/agent command execution
-2. **Atropos Backend** (`atropos/backends/modal_backend.py`) - For batch RL training workloads
-
-
-
---
-
-## Terminal Tool (CLI/Agent)
-
-The terminal tool provides a simple interface for executing commands in Modal sandboxes.
-
-### Configuration
-
-Set environment variables:
-
-```bash
-export TERMINAL_ENV=modal
-export TERMINAL_MODAL_IMAGE=python:3.11
-export TERMINAL_MODAL_APP_NAME=hermes-sandbox
-```
-
-Or use a YAML config file (`modal_profiles.yaml`):
-
-```yaml
-profiles:
-  default:
-    image: python:3.11
-    cpu: 1.0
-    memory: 2048
-    min_pool: 1
-    max_pool: 5
-    idle_timeout: 120
-
-  gpu:
-    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
-    gpu: T4
-    memory: 16384
-    min_pool: 0
-    max_pool: 2
-```
-
-### Features
-
-| Feature | Description |
-|---------|-------------|
-| **Sandbox Pool** | Pre-warmed sandboxes for low latency |
-| **Auto-scaling** | Grows/shrinks pool based on demand |
-| **Idle Timeout** | Sandboxes auto-terminate when unused |
-| **Profile Selection** | Different configs for different workloads |
-| **Credential Injection** | `modal.Secret` integration |
-
-### Usage
-
-```python
-from tools.terminal_tool import terminal_tool
-
-# Simple command
-output = terminal_tool("echo hello", task_id="my-task")
-
-# With profile selection
-output = terminal_tool("python train.py", task_id="training", profile="gpu")
-
-# Cleanup when done
-from tools.terminal_tool import cleanup_vm
-cleanup_vm("my-task")
-```
-
-### Architecture
-
-```
-_ModalPoolManager (singleton)
-    ├── "default" pool → [sandbox-0, sandbox-1, ...]
-    └── "gpu" pool     → [sandbox-0, ...]
-
-Each pool:
-  - Maintains min_pool warm sandboxes
-  - Scales up to max_pool on demand  
-  - Background thread scales down idle sandboxes
-```
-
---
-
-## Atropos Backend (RL Training)
-
-The Atropos backend is designed for high-throughput batch execution during reinforcement learning training.
-
-### Key Concept: Slot-based Multiplexing
-
-Instead of one sandbox per trajectory, multiple trajectories share sandboxes via **slots**:
-
-```
-Sandbox (1 container)
-    ├── Slot 0 → Trajectory A (workspace: /data/slot_0)
-    ├── Slot 1 → Trajectory B (workspace: /data/slot_1)
-    └── Slot 2 → Trajectory C (workspace: /data/slot_2)
-```
-
-**Benefits**:
- Fewer containers = lower cost
- Shared warm-up time
- Better GPU utilization
-
-### Configuration
-
-```python
-from atropos.backends.modal_backend import ModalSandboxConfig, ModalToolBackend
-
-config = ModalSandboxConfig(
-    name="default",
-    image="python:3.11",
-    cpu=1.0,
-    memory=2048,
-    slots_per_sandbox=10,  # 10 trajectories per container
-    min_sandboxes=1,
-    max_sandboxes=5,
-)
-
-backend = ModalToolBackend(config.with_app_name("my-training"))
-```
-
-### Multi-Profile Support
-
-Different trajectory types can request different resources:
-
-```python
-backend = ModalToolBackend.with_profiles(
-    app_name="rl-training",
-    profiles={
-        "default": ModalSandboxConfig(
-            name="default",
-            cpu=1.0,
-            memory=2048,
-        ),
-        "pytorch-gpu": ModalSandboxConfig(
-            name="pytorch-gpu",
-            image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
-            gpu="T4",
-            memory=16384,
-        ),
-    }
-)
-
-# CPU task
-slot1 = await backend.acquire("traj-1", profile="default")
-
-# GPU task
-slot2 = await backend.acquire("traj-2", profile="pytorch-gpu")
-```
-
-### Batched Execution
-
-The key optimization - execute many commands in parallel:
-
-```python
-# Acquire slots for multiple trajectories
-slots = [await backend.acquire(f"traj-{i}") for i in range(50)]
-
-# Execute batch across all slots in parallel
-results = await backend.execute_batch([
-    (slot, "bash", {"command": "python step.py"})
-    for slot in slots
-])
-
-# Release slots
-for slot in slots:
-    await backend.release(slot)
-```
-
-### Architecture
-
-```
-ModalToolBackend
-    └── _ModalMultiProfileManager
-            ├── "default" → _ModalSandboxPool
-            │                   ├── Sandbox 0 (slots 0-9)
-            │                   └── Sandbox 1 (slots 0-9)
-            │
-            └── "pytorch-gpu" → _ModalSandboxPool
-                                    └── Sandbox 0 (slots 0-9)
-```
-
---
-
-## Credentials
-
-Inject secrets securely using Modal's secret management:
-
-```bash
-# Create secret in Modal dashboard or CLI
-modal secret create my-api-key API_KEY=sk-xxx
-```
-
-```python
-# Reference in config
-config = ModalSandboxConfig(
-    secrets=["my-api-key"],  # Modal secret names
-    env_vars={"DEBUG": "1"},  # Additional env vars
-)
-```
-
-## Troubleshooting
-
-### "Modal package not installed"
-```bash
-pip install modal
-modal token new  # Authenticate
-```
-
-### "Sandbox creation failed"
- Check Modal dashboard for quota limits
- Verify image exists and is accessible
- Check secret names are correct
-
-### Shutdown errors
-These are harmless warnings during Python interpreter shutdown:
-```
-[Modal] Error terminating ...: cannot schedule new futures after interpreter shutdown
-```
-
-The sandboxes will auto-terminate via Modal's idle_timeout anyway.
@@ -250,6 +250,38 @@ This is useful for:
 - Replaying conversations
 - Training data inspection

+### Context Compression
+
+Long conversations can exceed model context limits. The CLI automatically compresses context when approaching the limit:
+
+```yaml
+# In cli-config.yaml
+compression:
+  enabled: true                    # Enable auto-compression
+  threshold: 0.85                  # Compress at 85% of context limit  
+  summary_model: "google/gemini-2.0-flash-001"
+```
+
+**How it works:**
+1. Tracks actual token usage from each API response
+2. When tokens reach threshold, middle turns are summarized
+3. First 3 and last 4 turns are always protected
+4. Conversation continues seamlessly after compression
+
+**When compression triggers:**
+```
+📦 Context compression triggered (170,000 tokens ≥ 170,000 threshold)
+   📊 Model context limit: 200,000 tokens (85% = 170,000)
+   🗜️  Summarizing turns 4-15 (12 turns)
+   ✅ Compressed: 20 → 9 messages (~45,000 tokens saved)
+```
+
+To disable compression:
+```yaml
+compression:
+  enabled: false
+```
+
 ## Quiet Mode

 The CLI runs in "quiet mode" (`HERMES_QUIET=1`), which:
@@ -0,0 +1,547 @@
+# Messaging Platform Integrations (Gateway)
+
+Hermes Agent can connect to messaging platforms like Telegram, Discord, and WhatsApp to serve as a conversational AI assistant.
+
+## Quick Start
+
+```bash
+# 1. Set your bot token(s) in .env file
+echo 'TELEGRAM_BOT_TOKEN="your_telegram_bot_token"' >> .env
+echo 'DISCORD_BOT_TOKEN="your_discord_bot_token"' >> .env
+
+# 2. Test the gateway (foreground)
+./scripts/hermes-gateway run
+
+# 3. Install as a system service (runs in background)
+./scripts/hermes-gateway install
+
+# 4. Manage the service
+./scripts/hermes-gateway start
+./scripts/hermes-gateway stop
+./scripts/hermes-gateway restart
+./scripts/hermes-gateway status
+```
+
+**Quick test (without service install):**
+```bash
+python cli.py --gateway  # Runs in foreground, useful for debugging
+```
+
+## Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                      Hermes Gateway                             │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
+│  │   Telegram   │  │   Discord    │  │   WhatsApp   │          │
+│  │   Adapter    │  │   Adapter    │  │   Adapter    │          │
+│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘          │
+│         │                 │                 │                   │
+│         └─────────────────┼─────────────────┘                   │
+│                           │                                     │
+│                  ┌────────▼────────┐                            │
+│                  │  Session Store  │                            │
+│                  │  (per-chat)     │                            │
+│                  └────────┬────────┘                            │
+│                           │                                     │
+│                  ┌────────▼────────┐                            │
+│                  │   AIAgent       │                            │
+│                  │   (run_agent)   │                            │
+│                  └─────────────────┘                            │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Session Management
+
+### Session Persistence
+
+Sessions persist across messages until they reset. The agent remembers your conversation context.
+
+### Reset Policies
+
+Sessions reset based on configurable policies:
+
+| Policy | Default | Description |
+|--------|---------|-------------|
+| Daily | 4:00 AM | Reset at a specific hour each day |
+| Idle | 120 min | Reset after N minutes of inactivity |
+| Both | (combined) | Whichever triggers first |
+
+### Manual Reset
+
+Send `/new` or `/reset` as a message to start fresh.
+
+### Per-Platform Overrides
+
+Configure different reset policies per platform:
+
+```json
+{
+  "reset_by_platform": {
+    "telegram": { "mode": "idle", "idle_minutes": 240 },
+    "discord": { "mode": "idle", "idle_minutes": 60 }
+  }
+}
+```
+
+## Platform Setup
+
+### Telegram
+
+1. **Create a bot** via [@BotFather](https://t.me/BotFather)
+2. **Get your token** (looks like `123456789:ABCdefGHIjklMNOpqrsTUVwxyz`)
+3. **Set environment variable:**
+   ```bash
+   export TELEGRAM_BOT_TOKEN="your_token_here"
+   ```
+4. **Optional: Set home channel** for cron job delivery:
+   ```bash
+   export TELEGRAM_HOME_CHANNEL="-1001234567890"
+   export TELEGRAM_HOME_CHANNEL_NAME="My Notes"
+   ```
+
+**Requirements:**
+```bash
+pip install python-telegram-bot>=20.0
+```
+
+### Discord
+
+1. **Create an application** at [Discord Developer Portal](https://discord.com/developers/applications)
+2. **Create a bot** under your application
+3. **Get the bot token**
+4. **Enable required intents:**
+   - Message Content Intent
+   - Server Members Intent (optional)
+5. **Invite to your server** using OAuth2 URL generator (scopes: `bot`, `applications.commands`)
+6. **Set environment variable:**
+   ```bash
+   export DISCORD_BOT_TOKEN="your_token_here"
+   ```
+7. **Optional: Set home channel:**
+   ```bash
+   export DISCORD_HOME_CHANNEL="123456789012345678"
+   export DISCORD_HOME_CHANNEL_NAME="#bot-updates"
+   ```
+
+**Requirements:**
+```bash
+pip install discord.py>=2.0
+```
+
+### WhatsApp
+
+WhatsApp integration is more complex due to the lack of a simple bot API.
+
+**Options:**
+1. **WhatsApp Business API** (requires Meta verification)
+2. **whatsapp-web.js** via Node.js bridge (for personal accounts)
+
+**Bridge Setup:**
+1. Install Node.js
+2. Set up the bridge script (see `scripts/whatsapp-bridge/` for reference)
+3. Configure in gateway:
+   ```json
+   {
+     "platforms": {
+       "whatsapp": {
+         "enabled": true,
+         "extra": {
+           "bridge_script": "/path/to/bridge.js",
+           "bridge_port": 3000
+         }
+       }
+     }
+   }
+   ```
+
+## Configuration
+
+There are **three ways** to configure the gateway (in order of precedence):
+
+### 1. Environment Variables (`.env` file) - Recommended for Quick Setup
+
+Add to your `~/.hermes/.env` file:
+
+```bash
+# =============================================================================
+# MESSAGING PLATFORM TOKENS
+# =============================================================================
+
+# Telegram - get from @BotFather on Telegram
+TELEGRAM_BOT_TOKEN=your_telegram_bot_token
+TELEGRAM_ALLOWED_USERS=123456789,987654321    # Security: restrict to these user IDs
+
+# Optional: Default channel for cron job delivery
+TELEGRAM_HOME_CHANNEL=-1001234567890
+TELEGRAM_HOME_CHANNEL_NAME="My Notes"
+
+# Discord - get from Discord Developer Portal
+DISCORD_BOT_TOKEN=your_discord_bot_token
+DISCORD_ALLOWED_USERS=123456789012345678      # Security: restrict to these user IDs
+
+# Optional: Default channel for cron job delivery
+DISCORD_HOME_CHANNEL=123456789012345678
+DISCORD_HOME_CHANNEL_NAME="#bot-updates"
+
+# WhatsApp - requires Node.js bridge setup
+WHATSAPP_ENABLED=true
+
+# =============================================================================
+# AGENT SETTINGS
+# =============================================================================
+
+# Max tool-calling iterations per conversation (default: 60)
+HERMES_MAX_ITERATIONS=60
+
+# Working directory for terminal commands (default: home ~)
+MESSAGING_CWD=/home/myuser
+
+# =============================================================================
+# TOOL PROGRESS NOTIFICATIONS
+# =============================================================================
+
+# Show progress messages as agent uses tools
+HERMES_TOOL_PROGRESS=true
+
+# Mode: "new" (only when tool changes) or "all" (every tool call)
+HERMES_TOOL_PROGRESS_MODE=new
+
+# =============================================================================
+# SESSION SETTINGS
+# =============================================================================
+
+# Reset sessions after N minutes of inactivity (default: 120)
+SESSION_IDLE_MINUTES=120
+
+# Daily reset hour in 24h format (default: 4 = 4am)
+SESSION_RESET_HOUR=4
+```
+
+### 2. Gateway Config File (`~/.hermes/gateway.json`) - Full Control
+
+For advanced configuration, create `~/.hermes/gateway.json`:
+
+```json
+{
+  "platforms": {
+    "telegram": {
+      "enabled": true,
+      "token": "your_telegram_token",
+      "home_channel": {
+        "platform": "telegram",
+        "chat_id": "-1001234567890",
+        "name": "My Notes"
+      }
+    },
+    "discord": {
+      "enabled": true,
+      "token": "your_discord_token",
+      "home_channel": {
+        "platform": "discord",
+        "chat_id": "123456789012345678",
+        "name": "#bot-updates"
+      }
+    }
+  },
+  "default_reset_policy": {
+    "mode": "both",
+    "at_hour": 4,
+    "idle_minutes": 120
+  },
+  "reset_by_platform": {
+    "discord": {
+      "mode": "idle",
+      "idle_minutes": 60
+    }
+  },
+  "always_log_local": true
+}
+```
+
+## Platform-Specific Toolsets
+
+Each platform has its own toolset for security:
+
+| Platform | Toolset | Capabilities |
+|----------|---------|--------------|
+| CLI | `hermes-cli` | Full access (terminal, browser, etc.) |
+| Telegram | `hermes-telegram` | Full tools including terminal |
+| Discord | `hermes-discord` | Full tools including terminal |
+| WhatsApp | `hermes-whatsapp` | Full tools including terminal |
+
+## User Experience Features
+
+### Typing Indicator
+
+The gateway keeps the "typing..." indicator active throughout processing, refreshing every 4 seconds. This lets users know the bot is working even during long tool-calling sequences.
+
+### Tool Progress Notifications
+
+When `HERMES_TOOL_PROGRESS=true`, the bot sends status messages as it works:
+
+```
+💻 `ls -la`...
+🔍 web_search...
+📄 web_extract...
+🎨 image_generate...
+```
+
+Terminal commands show the actual command (truncated to 50 chars). Other tools just show the tool name.
+
+**Modes:**
+- `new`: Only sends message when switching to a different tool (less spam)
+- `all`: Sends message for every single tool call
+
+### Working Directory
+
+- **CLI (`hermes` command)**: Uses current directory where you run the command
+- **Messaging**: Uses `MESSAGING_CWD` (default: home directory `~`)
+
+This is intentional: CLI users are in a terminal and expect the agent to work in their current directory, while messaging users need a consistent starting location.
+
+### Max Iterations
+
+If the agent hits the max iteration limit while working, instead of a generic error, it asks the model to summarize what it found so far. This gives you a useful response even when the task couldn't be fully completed.
+
+## Voice Messages (TTS)
+
+The `text_to_speech` tool generates audio that the gateway delivers as native voice messages on each platform:
+
+| Platform | Delivery | Format |
+|----------|----------|--------|
+| Telegram | Voice bubble (plays inline) | Opus `.ogg` — native from OpenAI/ElevenLabs, converted via ffmpeg for Edge TTS |
+| Discord | Audio file attachment | MP3 |
+| WhatsApp | Audio file attachment | MP3 |
+| CLI | Saved to `~/voice-memos/` | MP3 |
+
+**Providers:**
+- **Edge TTS** (default) — Free, no API key, 322 voices in 74 languages
+- **ElevenLabs** — Premium quality, requires `ELEVENLABS_API_KEY`
+- **OpenAI TTS** — Good quality, requires `OPENAI_API_KEY`
+
+Voice and provider are configured by the user in `~/.hermes/config.yaml` under the `tts:` key. The model only sends text; it does not choose the voice.
+
+The tool returns a `MEDIA:<path>` tag that the gateway send pipeline intercepts and delivers as a native audio message. If `[[audio_as_voice]]` is present (Opus format available), Telegram sends it as a voice bubble instead of an audio file.
+
+**Telegram voice bubbles & ffmpeg:**
+
+Telegram requires Opus/OGG format for native voice bubbles (the round, inline-playable kind). **OpenAI and ElevenLabs** produce Opus natively when on Telegram — no extra setup needed. **Edge TTS** (the default free provider) outputs MP3 and needs `ffmpeg` to convert:
+
+```bash
+sudo apt install ffmpeg    # Ubuntu/Debian
+brew install ffmpeg         # macOS
+sudo dnf install ffmpeg     # Fedora
+```
+
+Without ffmpeg, Edge TTS audio is sent as a regular audio file (still playable, but shows as a rectangular music player instead of a voice bubble).
+
+## Cron Job Delivery
+
+When scheduling cron jobs, you can specify where the output should be delivered:
+
+```
+User: "Remind me to check the server in 30 minutes"
+
+Agent uses: schedule_cronjob(
+  prompt="Check server status...",
+  schedule="30m",
+  deliver="origin"  # Back to this chat
+)
+```
+
+### Delivery Options
+
+| Option | Description |
+|--------|-------------|
+| `"origin"` | Back to where the job was created |
+| `"local"` | Save to local files only |
+| `"telegram"` | Telegram home channel |
+| `"discord"` | Discord home channel |
+| `"telegram:123456"` | Specific Telegram chat |
+
+## Dynamic Context Injection
+
+The agent knows where it is via injected context:
+
+```
+## Current Session Context
+
+**Source:** Telegram (group: Dev Team, ID: -1001234567890)
+**Connected Platforms:** local, telegram, discord
+
+**Home Channels:**
+  - telegram: My Notes (ID: -1001234567890)
+  - discord: #bot-updates (ID: 123456789012345678)
+
+**Delivery options for scheduled tasks:**
+- "origin" → Back to this chat (Dev Team)
+- "local" → Save to local files only
+- "telegram" → Home channel (My Notes)
+- "discord" → Home channel (#bot-updates)
+```
+
+## CLI Commands
+
+| Command | Description |
+|---------|-------------|
+| `/platforms` | Show gateway configuration and status |
+| `--gateway` | Start the gateway (CLI flag) |
+
+## Troubleshooting
+
+### "python-telegram-bot not installed"
+
+```bash
+pip install python-telegram-bot>=20.0
+```
+
+### "discord.py not installed"
+
+```bash
+pip install discord.py>=2.0
+```
+
+### "No platforms connected"
+
+1. Check your environment variables are set
+2. Check your tokens are valid
+3. Try `/platforms` to see configuration status
+
+### Session not persisting
+
+1. Check `~/.hermes/sessions/` exists
+2. Check session policies aren't too aggressive
+3. Verify no errors in gateway logs
+
+## Adding a New Platform
+
+To add a new messaging platform:
+
+### 1. Create the adapter
+
+Create `gateway/platforms/your_platform.py`:
+
+```python
+from gateway.platforms.base import BasePlatformAdapter, MessageEvent, SendResult
+from gateway.config import Platform, PlatformConfig
+
+class YourPlatformAdapter(BasePlatformAdapter):
+    def __init__(self, config: PlatformConfig):
+        super().__init__(config, Platform.YOUR_PLATFORM)
+    
+    async def connect(self) -> bool:
+        # Connect to the platform
+        ...
+    
+    async def disconnect(self) -> None:
+        # Disconnect
+        ...
+    
+    async def send(self, chat_id: str, content: str, ...) -> SendResult:
+        # Send a message
+        ...
+    
+    async def get_chat_info(self, chat_id: str) -> Dict[str, Any]:
+        # Get chat information
+        ...
+```
+
+### 2. Register the platform
+
+Add to `gateway/config.py`:
+
+```python
+class Platform(Enum):
+    # ... existing ...
+    YOUR_PLATFORM = "your_platform"
+```
+
+### 3. Add to gateway runner
+
+Update `gateway/run.py` `_create_adapter()`:
+
+```python
+elif platform == Platform.YOUR_PLATFORM:
+    from gateway.platforms.your_platform import YourPlatformAdapter
+    return YourPlatformAdapter(config)
+```
+
+### 4. Create a toolset (optional)
+
+Add to `toolsets.py`:
+
+```python
+"hermes-your-platform": {
+    "description": "Your platform toolset",
+    "tools": [...],
+    "includes": []
+}
+```
+
+### 5. Configure
+
+Add environment variables to `.env`:
+
+```bash
+YOUR_PLATFORM_TOKEN=...
+YOUR_PLATFORM_HOME_CHANNEL=...
+```
+
+## Service Management
+
+### Linux (systemd)
+
+```bash
+# Install as user service
+./scripts/hermes-gateway install
+
+# Manage
+systemctl --user start hermes-gateway
+systemctl --user stop hermes-gateway
+systemctl --user restart hermes-gateway
+systemctl --user status hermes-gateway
+
+# View logs
+journalctl --user -u hermes-gateway -f
+
+# Enable lingering (keeps running after logout)
+sudo loginctl enable-linger $USER
+```
+
+### macOS (launchd)
+
+```bash
+# Install
+./scripts/hermes-gateway install
+
+# Manage
+launchctl start ai.hermes.gateway
+launchctl stop ai.hermes.gateway
+
+# View logs
+tail -f ~/.hermes/logs/gateway.log
+```
+
+### Manual (any platform)
+
+```bash
+# Run in foreground (for testing/debugging)
+./scripts/hermes-gateway run
+
+# Or via CLI (also foreground)
+python cli.py --gateway
+```
+
+## Storage Locations
+
+| Path | Purpose |
+|------|---------|
+| `~/.hermes/gateway.json` | Gateway configuration |
+| `~/.hermes/sessions/sessions.json` | Session index |
+| `~/.hermes/sessions/{id}.jsonl` | Conversation transcripts |
+| `~/.hermes/cron/output/` | Cron job outputs |
+| `~/.hermes/logs/gateway.log` | Gateway logs (macOS launchd) |
@@ -40,11 +40,15 @@ async def web_search(query: str) -> dict:
 |----------|--------|-------|
 | **Web** | `web_tools.py` | `web_search`, `web_extract`, `web_crawl` |
 | **Terminal** | `terminal_tool.py` | `terminal` (local/docker/singularity/modal/ssh backends) |
+| **File** | `file_tools.py` | `read_file`, `write_file`, `patch`, `search` |
 | **Browser** | `browser_tool.py` | `browser_navigate`, `browser_click`, `browser_type`, etc. |
 | **Vision** | `vision_tools.py` | `vision_analyze` |
 | **Image Gen** | `image_generation_tool.py` | `image_generate` |
+| **TTS** | `tts_tool.py` | `text_to_speech` (Edge TTS free / ElevenLabs / OpenAI) |
 | **Reasoning** | `mixture_of_agents_tool.py` | `mixture_of_agents` |
-| **Skills** | `skills_tool.py` | `skills_categories`, `skills_list`, `skill_view` |
+| **Skills** | `skills_tool.py` | `skills_list`, `skill_view` |
+| **Cronjob** | `cronjob_tools.py` | `schedule_cronjob`, `list_cronjobs`, `remove_cronjob` |
+| **RL Training** | `rl_training_tool.py` | `rl_list_environments`, `rl_start_training`, `rl_check_status`, etc. |

 ## Tool Registration

@@ -0,0 +1,330 @@
+# Hermes-Agent Atropos Environments
+
+This directory contains the integration layer between **hermes-agent's** tool-calling capabilities and the **Atropos** RL training framework. It provides everything needed to run agentic LLMs through multi-turn tool-calling loops, score their output with arbitrary reward functions, and feed results into Atropos for training or evaluation.
+
+## Architecture Overview
+
+```
+                        Atropos Framework
+                    ┌───────────────────────┐
+                    │       BaseEnv          │  (atroposlib)
+                    │  - Server management   │
+                    │  - Worker scheduling   │
+                    │  - Wandb logging       │
+                    │  - CLI (serve/process/ │
+                    │    evaluate)           │
+                    └───────────┬───────────┘
+                                │ inherits
+                    ┌───────────┴───────────┐
+                    │  HermesAgentBaseEnv    │  hermes_base_env.py
+                    │  - Terminal backend    │
+                    │  - Tool resolution     │
+                    │  - Agent loop          │
+                    │  - ToolContext          │
+                    │  - Async patches       │
+                    └───────────┬───────────┘
+                                │ inherits
+              ┌─────────────────┼─────────────────┐
+              │                 │                  │
+     TerminalTestEnv     HermesSweEnv    TerminalBench2EvalEnv
+     (stack testing)     (SWE training)   (TB2 benchmark eval)
+```
+
+### Inheritance Chain
+
+**BaseEnv** (from `atroposlib`) is the Atropos base class. It provides:
+- Server management (OpenAI-compatible API servers, VLLM, SGLang)
+- Worker scheduling for parallel rollouts
+- Wandb integration for metrics and rollout logging
+- CLI interface with three subcommands: `serve`, `process`, `evaluate`
+- `evaluate_log()` for saving eval results to JSON + samples.jsonl
+
+**HermesAgentBaseEnv** (`hermes_base_env.py`) extends BaseEnv with hermes-agent specifics:
+- Sets `os.environ["TERMINAL_ENV"]` to configure the terminal backend (local, docker, modal, ssh, singularity)
+- Resolves hermes-agent toolsets via `_resolve_tools_for_group()` (calls `get_tool_definitions()` from `model_tools.py`)
+- Implements `collect_trajectory()` which runs the full agent loop and computes rewards
+- Supports two-phase operation (Phase 1: OpenAI server, Phase 2: VLLM ManagedServer)
+- Applies monkey patches for async-safe tool operation at import time
+
+Concrete environments inherit from `HermesAgentBaseEnv` and implement:
+- `setup()` -- Load dataset, initialize state
+- `get_next_item()` -- Return the next item for rollout
+- `format_prompt()` -- Convert a dataset item into the user message
+- `compute_reward()` -- Score the rollout using ToolContext
+- `evaluate()` -- Periodic evaluation logic
+
+## Core Components
+
+### Agent Loop (`agent_loop.py`)
+
+`HermesAgentLoop` is the reusable multi-turn agent engine. It runs the same pattern as hermes-agent's `run_agent.py`:
+
+1. Send messages + tools to the API via `server.chat_completion()`
+2. If the response contains `tool_calls`, execute each one via `handle_function_call()` from `model_tools.py`
+3. Append tool results to the conversation and go back to step 1
+4. If the response has no tool_calls, the agent is done
+
+Tool calls are executed in a thread pool (`run_in_executor`) so backends that use `asyncio.run()` internally (Modal, Docker) don't deadlock inside Atropos's event loop.
+
+Returns an `AgentResult` containing the full conversation history, turn count, reasoning content per turn, tool errors, and optional ManagedServer state (for Phase 2).
+
+### Tool Context (`tool_context.py`)
+
+`ToolContext` is a per-rollout handle that gives reward/verification functions direct access to **all** hermes-agent tools, scoped to the rollout's `task_id`. The same `task_id` means the terminal/browser session is the SAME one the model used during its rollout -- all state (files, processes, browser tabs) is preserved.
+
+```python
+async def compute_reward(self, item, result, ctx: ToolContext):
+    # Run tests in the model's terminal sandbox
+    test = ctx.terminal("pytest -v")
+    if test["exit_code"] == 0:
+        return 1.0
+
+    # Check if a file was created
+    content = ctx.read_file("/workspace/solution.py")
+    if content.get("content"):
+        return 0.5
+
+    # Download files locally for verification (binary-safe)
+    ctx.download_file("/remote/output.bin", "/local/output.bin")
+
+    return 0.0
+```
+
+Available methods:
+- **Terminal**: `terminal(command, timeout)` -- run shell commands
+- **Files**: `read_file(path)`, `write_file(path, content)`, `search(query, path)`
+- **Transfers**: `upload_file()`, `upload_dir()`, `download_file()`, `download_dir()` -- binary-safe file transfers between host and sandbox
+- **Web**: `web_search(query)`, `web_extract(urls)`
+- **Browser**: `browser_navigate(url)`, `browser_snapshot()`
+- **Generic**: `call_tool(name, args)` -- call any hermes-agent tool by name
+- **Cleanup**: `cleanup()` -- release all resources (called automatically after `compute_reward`)
+
+### Patches (`patches.py`)
+
+**Problem**: Some hermes-agent tools use `asyncio.run()` internally (e.g., mini-swe-agent's Modal backend via SWE-ReX). This crashes when called from inside Atropos's event loop because `asyncio.run()` cannot be nested.
+
+**Solution**: `patches.py` monkey-patches `SwerexModalEnvironment` to use a dedicated background thread (`_AsyncWorker`) with its own event loop. The calling code sees the same sync interface, but internally the async work happens on a separate thread that doesn't conflict with Atropos's loop.
+
+What gets patched:
+- `SwerexModalEnvironment.__init__` -- creates Modal deployment on a background thread
+- `SwerexModalEnvironment.execute` -- runs commands on the same background thread
+- `SwerexModalEnvironment.stop` -- stops deployment on the background thread
+
+The patches are:
+- **Idempotent** -- calling `apply_patches()` multiple times is safe
+- **Transparent** -- same interface and behavior, only the internal async execution changes
+- **Universal** -- works identically in normal CLI use (no running event loop)
+
+Applied automatically at import time by `hermes_base_env.py`.
+
+### Tool Call Parsers (`tool_call_parsers/`)
+
+Client-side parsers that extract structured `tool_calls` from raw model output text. Used in **Phase 2** (VLLM server type) where ManagedServer's `/generate` endpoint returns raw text without tool call parsing.
+
+Each parser is a standalone reimplementation of the corresponding VLLM parser's `extract_tool_calls()` logic. No VLLM dependency -- only standard library (`re`, `json`, `uuid`) and `openai` types.
+
+Available parsers:
+- `hermes` -- Hermes/ChatML `<tool_call>` XML format
+- `mistral` -- Mistral `[TOOL_CALLS]` format
+- `llama3_json` -- Llama 3 JSON tool calling
+- `qwen` -- Qwen tool calling format
+- `qwen3_coder` -- Qwen3 Coder format
+- `deepseek_v3` -- DeepSeek V3 format
+- `deepseek_v3_1` -- DeepSeek V3.1 format
+- `kimi_k2` -- Kimi K2 format
+- `longcat` -- Longcat format
+- `glm45` / `glm47` -- GLM model formats
+
+Usage:
+```python
+from environments.tool_call_parsers import get_parser
+
+parser = get_parser("hermes")
+content, tool_calls = parser.parse(raw_model_output)
+```
+
+In Phase 1 (OpenAI server type), these parsers are not needed -- the server handles tool call parsing natively.
+
+## Two-Phase Operation
+
+### Phase 1: OpenAI Server (Evaluation / SFT Data Generation)
+
+Uses `server.chat_completion()` with `tools=` parameter. The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing natively. Returns `ChatCompletion` objects with structured `tool_calls`.
+
+- Good for: evaluation, SFT data generation, testing
+- Run with: `serve` (with `run-api`), `process`, or `evaluate` subcommands
+- Placeholder tokens are created for the Atropos pipeline
+
+### Phase 2: VLLM ManagedServer (Full RL Training)
+
+Uses ManagedServer for exact token IDs + logprobs via `/generate`. Client-side tool call parser (from `tool_call_parsers/`) reconstructs structured `tool_calls` from raw output.
+
+- Good for: full RL training with GRPO/PPO
+- Run with: `serve` subcommand
+- Real tokens, masks, and logprobs flow through the pipeline
+
+## Directory Structure
+
+```
+environments/
+├── README.md                     # This file
+├── __init__.py                   # Package exports
+├── hermes_base_env.py            # Abstract base (HermesAgentBaseEnv)
+├── agent_loop.py                 # Multi-turn agent engine (HermesAgentLoop)
+├── tool_context.py               # Per-rollout tool access for reward functions
+├── patches.py                    # Async-safety patches for Modal backend
+│
+├── tool_call_parsers/            # Phase 2 client-side parsers
+│   ├── __init__.py               # Registry + base class
+│   ├── hermes_parser.py
+│   ├── mistral_parser.py
+│   ├── llama_parser.py
+│   ├── qwen_parser.py
+│   ├── qwen3_coder_parser.py
+│   ├── deepseek_v3_parser.py
+│   ├── deepseek_v3_1_parser.py
+│   ├── kimi_k2_parser.py
+│   ├── longcat_parser.py
+│   ├── glm45_parser.py
+│   └── glm47_parser.py
+│
+├── terminal_test_env/            # Stack validation environment
+│   └── terminal_test_env.py
+│
+├── hermes_swe_env/               # SWE-bench style training environment
+│   └── hermes_swe_env.py
+│
+└── benchmarks/                   # Evaluation benchmarks
+    └── terminalbench_2/
+        └── terminalbench2_env.py
+```
+
+## Concrete Environments
+
+### TerminalTestEnv (`terminal_test_env/`)
+
+A self-contained environment with inline tasks (no external dataset needed) for validating the full stack end-to-end. Each task asks the model to create a file at a known path, and the verifier checks the content matches.
+
+```bash
+# Serve mode (needs run-api)
+run-api
+python environments/terminal_test_env/terminal_test_env.py serve
+
+# Process mode (no run-api, saves to JSONL)
+python environments/terminal_test_env/terminal_test_env.py process \
+    --env.data_path_to_save_groups terminal_test_output.jsonl
+```
+
+### HermesSweEnv (`hermes_swe_env/`)
+
+SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.
+
+```bash
+python environments/hermes_swe_env/hermes_swe_env.py serve \
+    --openai.model_name YourModel \
+    --env.dataset_name bigcode/humanevalpack \
+    --env.terminal_backend modal
+```
+
+### TerminalBench2EvalEnv (`benchmarks/terminalbench_2/`)
+
+**Eval-only** environment for the Terminal-Bench 2.0 benchmark (89 tasks). Each task gets a pre-built Docker Hub image, a natural language instruction, and a test suite. The agent uses terminal + file tools to solve the task, then the test suite verifies correctness.
+
+Follows the standard Atropos eval pattern (like GPQA, MMLU, etc.):
+- Run via `evaluate` subcommand (no `run-api` needed)
+- `setup()` loads the dataset, `evaluate()` runs all tasks
+- `rollout_and_score_eval()` handles per-task agent loop + test verification
+- Downloads verifier output locally for reliable reward checking (Harbor pattern)
+
+```bash
+# Run full benchmark
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6
+
+# Run subset of tasks
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6 \
+    --env.task_filter fix-git,git-multibranch
+
+# Skip specific tasks
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+    --openai.model_name anthropic/claude-opus-4.6 \
+    --env.skip_tasks heavy-task,slow-task
+```
+
+## Creating a New Environment
+
+### Training Environment
+
+1. Create a new directory under `environments/`
+2. Create your env file inheriting from `HermesAgentBaseEnv`
+3. Implement the four abstract methods + `evaluate()`
+
+```python
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+
+class MyEnvConfig(HermesAgentEnvConfig):
+    pass  # Add custom fields as needed
+
+class MyEnv(HermesAgentBaseEnv):
+    name = "my-env"
+    env_config_cls = MyEnvConfig
+
+    @classmethod
+    def config_init(cls):
+        env_config = MyEnvConfig(
+            enabled_toolsets=["terminal", "file"],
+            terminal_backend="modal",
+            # ... other config
+        )
+        server_configs = [APIServerConfig(...)]
+        return env_config, server_configs
+
+    async def setup(self):
+        self.dataset = load_dataset(...)
+        self.iter = 0
+
+    async def get_next_item(self):
+        item = self.dataset[self.iter % len(self.dataset)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item):
+        return item["instruction"]
+
+    async def compute_reward(self, item, result, ctx):
+        # ctx gives you full tool access to the rollout's sandbox
+        test = ctx.terminal("pytest -v")
+        return 1.0 if test["exit_code"] == 0 else 0.0
+
+    async def evaluate(self, *args, **kwargs):
+        # Periodic evaluation logic
+        ...
+
+if __name__ == "__main__":
+    MyEnv.cli()
+```
+
+### Eval-Only Environment (Benchmark)
+
+For eval benchmarks, follow the pattern in `terminalbench2_env.py`:
+1. Create under `environments/benchmarks/your-benchmark/`
+2. Inherit from `HermesAgentBaseEnv`
+3. Set eval-only config: `eval_handling=STOP_TRAIN`, `steps_per_eval=1`, `total_steps=1`
+4. Stub the training methods (`collect_trajectories`, `score`)
+5. Implement `rollout_and_score_eval()` and `evaluate()`
+6. Run with `evaluate` subcommand
+
+## Key Config Fields
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| `enabled_toolsets` | Which hermes toolsets to enable | `None` (all) |
+| `disabled_toolsets` | Toolsets to disable | `None` |
+| `distribution` | Probabilistic toolset distribution name | `None` |
+| `max_agent_turns` | Max LLM calls per rollout | `30` |
+| `agent_temperature` | Sampling temperature | `1.0` |
+| `terminal_backend` | `local`, `docker`, `modal`, `ssh`, `singularity` | `local` |
+| `system_prompt` | System message for the agent | `None` |
+| `tool_call_parser` | Parser name for Phase 2 | `hermes` |
+| `eval_handling` | `STOP_TRAIN`, `LIMIT_TRAIN`, `NONE` | `STOP_TRAIN` |
@@ -0,0 +1,32 @@
+"""
+Hermes-Agent Atropos Environments
+
+Provides a layered integration between hermes-agent's tool-calling capabilities
+and the Atropos RL training framework.
+
+Core layers:
+    - agent_loop: Reusable multi-turn agent loop with standard OpenAI-spec tool calling
+    - tool_context: Per-rollout tool access handle for reward/verification functions
+    - hermes_base_env: Abstract base environment (BaseEnv subclass) for Atropos
+    - tool_call_parsers: Client-side tool call parser registry for Phase 2 (VLLM /generate)
+
+Concrete environments:
+    - terminal_test_env/: Simple file-creation tasks for testing the stack
+    - hermes_swe_env/: SWE-bench style tasks with Modal sandboxes
+    - endless_terminals/: Terminal tasks from HuggingFace dataset with Apptainer containers
+
+Benchmarks (eval-only):
+    - benchmarks/terminalbench_2/: Terminal-Bench 2.0 evaluation
+"""
+
+from environments.agent_loop import AgentResult, HermesAgentLoop
+from environments.tool_context import ToolContext
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+
+__all__ = [
+    "AgentResult",
+    "HermesAgentLoop",
+    "ToolContext",
+    "HermesAgentBaseEnv",
+    "HermesAgentEnvConfig",
+]
@@ -0,0 +1,421 @@
+"""
+HermesAgentLoop -- Reusable Multi-Turn Agent Engine
+
+Runs the hermes-agent tool-calling loop using standard OpenAI-spec tool calling.
+Works with any server that returns ChatCompletion objects with tool_calls:
+    - Phase 1: OpenAI server type (VLLM, SGLang, OpenRouter, OpenAI API)
+    - Phase 2: ManagedServer with client-side tool call parser
+
+The loop passes tools= and checks response.choices[0].message.tool_calls,
+identical to hermes-agent's run_agent.py. Tool execution is dispatched via
+handle_function_call() from model_tools.py.
+"""
+
+import asyncio
+import concurrent.futures
+import json
+import logging
+import os
+import uuid
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Set
+
+from model_tools import handle_function_call
+
+# Thread pool for running sync tool calls that internally use asyncio.run()
+# (e.g., mini-swe-agent's modal/docker backends). Running them in a separate
+# thread gives them a clean event loop so they don't deadlock inside Atropos's loop.
+# Size must be large enough for concurrent eval tasks (e.g., 89 TB2 tasks all
+# making tool calls). Too small = thread pool starvation, tasks queue for minutes.
+# Resized at runtime by HermesAgentBaseEnv.__init__ via resize_tool_pool().
+_tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=128)
+
+
+def resize_tool_pool(max_workers: int):
+    """
+    Replace the global tool executor with a new one of the given size.
+
+    Called by HermesAgentBaseEnv.__init__ based on config.tool_pool_size.
+    Safe to call before any tasks are submitted.
+    """
+    global _tool_executor
+    _tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
+    logger.info("Tool thread pool resized to %d workers", max_workers)
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ToolError:
+    """Record of a tool execution error during the agent loop."""
+
+    turn: int                  # Which turn the error occurred on
+    tool_name: str             # Which tool was called
+    arguments: str             # The arguments passed (truncated)
+    error: str                 # The error message
+    tool_result: str           # The raw result returned to the model
+
+
+@dataclass
+class AgentResult:
+    """Result of running the agent loop."""
+
+    # Full conversation history in OpenAI message format
+    messages: List[Dict[str, Any]]
+    # ManagedServer.get_state() if available (Phase 2), None otherwise
+    managed_state: Optional[Dict[str, Any]] = None
+    # How many LLM calls were made
+    turns_used: int = 0
+    # True if model stopped calling tools naturally (vs hitting max_turns)
+    finished_naturally: bool = False
+    # Extracted reasoning content per turn (from PR #297 helpers)
+    reasoning_per_turn: List[Optional[str]] = field(default_factory=list)
+    # Tool errors encountered during the loop
+    tool_errors: List[ToolError] = field(default_factory=list)
+
+
+def _extract_reasoning_from_message(message) -> Optional[str]:
+    """
+    Extract reasoning content from a ChatCompletion message.
+
+    Handles multiple provider formats:
+    1. message.reasoning_content field (some providers)
+    2. message.reasoning field (some providers)
+    3. message.reasoning_details[].text (OpenRouter style)
+
+    Note: <think> block extraction from content is NOT done here -- that's
+    handled by the response already in Phase 1 (server does it) or by
+    ManagedServer's patch in Phase 2.
+
+    Args:
+        message: The assistant message from ChatCompletion response
+
+    Returns:
+        Extracted reasoning text, or None if not found
+    """
+    # Check reasoning_content field (common across providers)
+    if hasattr(message, "reasoning_content") and message.reasoning_content:
+        return message.reasoning_content
+
+    # Check reasoning field
+    if hasattr(message, "reasoning") and message.reasoning:
+        return message.reasoning
+
+    # Check reasoning_details (OpenRouter style)
+    if hasattr(message, "reasoning_details") and message.reasoning_details:
+        for detail in message.reasoning_details:
+            if hasattr(detail, "text") and detail.text:
+                return detail.text
+            if isinstance(detail, dict) and detail.get("text"):
+                return detail["text"]
+
+    return None
+
+
+class HermesAgentLoop:
+    """
+    Runs hermes-agent's tool-calling loop using standard OpenAI-spec tool calling.
+
+    Same pattern as run_agent.py:
+    - Pass tools= to the API
+    - Check response.choices[0].message.tool_calls
+    - Dispatch via handle_function_call()
+
+    Works identically with any server type -- OpenAI, VLLM, SGLang, OpenRouter,
+    or ManagedServer with a parser. The server determines how tool_calls get
+    populated on the response.
+    """
+
+    def __init__(
+        self,
+        server,
+        tool_schemas: List[Dict[str, Any]],
+        valid_tool_names: Set[str],
+        max_turns: int = 30,
+        task_id: Optional[str] = None,
+        temperature: float = 1.0,
+        max_tokens: Optional[int] = None,
+        extra_body: Optional[Dict[str, Any]] = None,
+    ):
+        """
+        Initialize the agent loop.
+
+        Args:
+            server: Server object with chat_completion() method (OpenAIServer,
+                    ManagedServer, ServerManager, etc.)
+            tool_schemas: OpenAI-format tool definitions from get_tool_definitions()
+            valid_tool_names: Set of tool names the model is allowed to call
+            max_turns: Maximum number of LLM calls before stopping
+            task_id: Unique ID for terminal/browser session isolation
+            temperature: Sampling temperature for generation
+            max_tokens: Max tokens per generation (None for server default)
+            extra_body: Extra parameters passed to the OpenAI client's create() call.
+                        Used for OpenRouter provider preferences, transforms, etc.
+                        e.g. {"provider": {"ignore": ["DeepInfra"]}}
+        """
+        self.server = server
+        self.tool_schemas = tool_schemas
+        self.valid_tool_names = valid_tool_names
+        self.max_turns = max_turns
+        self.task_id = task_id or str(uuid.uuid4())
+        self.temperature = temperature
+        self.max_tokens = max_tokens
+        self.extra_body = extra_body
+
+    async def run(self, messages: List[Dict[str, Any]]) -> AgentResult:
+        """
+        Execute the full agent loop using standard OpenAI tool calling.
+
+        Args:
+            messages: Initial conversation messages (system + user).
+                      Modified in-place as the conversation progresses.
+
+        Returns:
+            AgentResult with full conversation history, managed state, and metadata
+        """
+        reasoning_per_turn = []
+        tool_errors: List[ToolError] = []
+
+        import time as _time
+
+        for turn in range(self.max_turns):
+            turn_start = _time.monotonic()
+
+            # Build the chat_completion kwargs
+            chat_kwargs = {
+                "messages": messages,
+                "n": 1,
+                "temperature": self.temperature,
+            }
+
+            # Only pass tools if we have them
+            if self.tool_schemas:
+                chat_kwargs["tools"] = self.tool_schemas
+
+            # Only pass max_tokens if explicitly set
+            if self.max_tokens is not None:
+                chat_kwargs["max_tokens"] = self.max_tokens
+
+            # Inject extra_body for provider-specific params (e.g., OpenRouter
+            # provider preferences like banned/preferred providers, transforms)
+            if self.extra_body:
+                chat_kwargs["extra_body"] = self.extra_body
+
+            # Make the API call -- standard OpenAI spec
+            api_start = _time.monotonic()
+            try:
+                response = await self.server.chat_completion(**chat_kwargs)
+            except Exception as e:
+                api_elapsed = _time.monotonic() - api_start
+                logger.error("API call failed on turn %d (%.1fs): %s", turn + 1, api_elapsed, e)
+                return AgentResult(
+                    messages=messages,
+                    managed_state=self._get_managed_state(),
+                    turns_used=turn + 1,
+                    finished_naturally=False,
+                    reasoning_per_turn=reasoning_per_turn,
+                    tool_errors=tool_errors,
+                )
+
+            api_elapsed = _time.monotonic() - api_start
+
+            if not response or not response.choices:
+                logger.warning("Empty response on turn %d (api=%.1fs)", turn + 1, api_elapsed)
+                return AgentResult(
+                    messages=messages,
+                    managed_state=self._get_managed_state(),
+                    turns_used=turn + 1,
+                    finished_naturally=False,
+                    reasoning_per_turn=reasoning_per_turn,
+                    tool_errors=tool_errors,
+                )
+
+            assistant_msg = response.choices[0].message
+
+            # Extract reasoning content from the response (all provider formats)
+            reasoning = _extract_reasoning_from_message(assistant_msg)
+            reasoning_per_turn.append(reasoning)
+
+            # Check for tool calls -- standard OpenAI spec
+            if assistant_msg.tool_calls:
+                # Build the assistant message dict for conversation history
+                msg_dict: Dict[str, Any] = {
+                    "role": "assistant",
+                    "content": assistant_msg.content or "",
+                    "tool_calls": [
+                        {
+                            "id": tc.id,
+                            "type": "function",
+                            "function": {
+                                "name": tc.function.name,
+                                "arguments": tc.function.arguments,
+                            },
+                        }
+                        for tc in assistant_msg.tool_calls
+                    ],
+                }
+
+                # Preserve reasoning_content for multi-turn chat template handling
+                # (e.g., Kimi-K2's template renders <think> blocks differently
+                # for history vs. the latest turn based on this field)
+                if reasoning:
+                    msg_dict["reasoning_content"] = reasoning
+
+                messages.append(msg_dict)
+
+                # Execute each tool call via hermes-agent's dispatch
+                for tc in assistant_msg.tool_calls:
+                    tool_name = tc.function.name
+                    tool_args_raw = tc.function.arguments
+
+                    # Validate tool name
+                    if tool_name not in self.valid_tool_names:
+                        tool_result = json.dumps(
+                            {
+                                "error": f"Unknown tool '{tool_name}'. "
+                                f"Available tools: {sorted(self.valid_tool_names)}"
+                            }
+                        )
+                        tool_errors.append(ToolError(
+                            turn=turn + 1, tool_name=tool_name,
+                            arguments=tool_args_raw[:200],
+                            error=f"Unknown tool '{tool_name}'",
+                            tool_result=tool_result,
+                        ))
+                        logger.warning(
+                            "Model called unknown tool '%s' on turn %d",
+                            tool_name, turn + 1,
+                        )
+                    else:
+                        # Parse arguments and dispatch
+                        try:
+                            args = json.loads(tool_args_raw)
+                        except json.JSONDecodeError:
+                            args = {}
+                            logger.warning(
+                                "Invalid JSON in tool call arguments for '%s': %s",
+                                tool_name, tool_args_raw[:200],
+                            )
+
+                        try:
+                            if tool_name == "terminal":
+                                backend = os.getenv("TERMINAL_ENV", "local")
+                                cmd_preview = args.get("command", "")[:80]
+                                logger.info(
+                                    "[%s] $ %s", self.task_id[:8], cmd_preview,
+                                )
+
+                            # Run tool calls in a thread pool so backends that use
+                            # asyncio.run() internally (modal, docker) get a clean
+                            # event loop instead of deadlocking inside Atropos's loop.
+                            tool_submit_time = _time.monotonic()
+                            loop = asyncio.get_event_loop()
+                            tool_result = await loop.run_in_executor(
+                                _tool_executor,
+                                lambda: handle_function_call(
+                                    tool_name, args, task_id=self.task_id
+                                ),
+                            )
+                            tool_elapsed = _time.monotonic() - tool_submit_time
+
+                            # Log slow tools and thread pool stats for debugging
+                            pool_active = _tool_executor._work_queue.qsize()
+                            if tool_elapsed > 30:
+                                logger.warning(
+                                    "[%s] turn %d: %s took %.1fs (pool queue=%d)",
+                                    self.task_id[:8], turn + 1, tool_name,
+                                    tool_elapsed, pool_active,
+                                )
+                        except Exception as e:
+                            tool_result = json.dumps(
+                                {"error": f"Tool execution failed: {type(e).__name__}: {str(e)}"}
+                            )
+                            tool_errors.append(ToolError(
+                                turn=turn + 1, tool_name=tool_name,
+                                arguments=tool_args_raw[:200],
+                                error=f"{type(e).__name__}: {str(e)}",
+                                tool_result=tool_result,
+                            ))
+                            logger.error(
+                                "Tool '%s' execution failed on turn %d: %s",
+                                tool_name, turn + 1, e,
+                            )
+
+                        # Also check if the tool returned an error in its JSON result
+                        try:
+                            result_data = json.loads(tool_result)
+                            if isinstance(result_data, dict):
+                                err = result_data.get("error")
+                                exit_code = result_data.get("exit_code")
+                                if err and exit_code and exit_code < 0:
+                                    tool_errors.append(ToolError(
+                                        turn=turn + 1, tool_name=tool_name,
+                                        arguments=tool_args_raw[:200],
+                                        error=str(err),
+                                        tool_result=tool_result[:500],
+                                    ))
+                        except (json.JSONDecodeError, TypeError):
+                            pass
+
+                    # Add tool response to conversation
+                    messages.append(
+                        {
+                            "role": "tool",
+                            "tool_call_id": tc.id,
+                            "content": tool_result,
+                        }
+                    )
+
+                turn_elapsed = _time.monotonic() - turn_start
+                logger.info(
+                    "[%s] turn %d: api=%.1fs, %d tools, turn_total=%.1fs",
+                    self.task_id[:8], turn + 1, api_elapsed,
+                    len(assistant_msg.tool_calls), turn_elapsed,
+                )
+
+            else:
+                # No tool calls -- model is done
+                msg_dict = {
+                    "role": "assistant",
+                    "content": assistant_msg.content or "",
+                }
+                if reasoning:
+                    msg_dict["reasoning_content"] = reasoning
+                messages.append(msg_dict)
+
+                turn_elapsed = _time.monotonic() - turn_start
+                logger.info(
+                    "[%s] turn %d: api=%.1fs, no tools (finished), turn_total=%.1fs",
+                    self.task_id[:8], turn + 1, api_elapsed, turn_elapsed,
+                )
+
+                return AgentResult(
+                    messages=messages,
+                    managed_state=self._get_managed_state(),
+                    turns_used=turn + 1,
+                    finished_naturally=True,
+                    reasoning_per_turn=reasoning_per_turn,
+                    tool_errors=tool_errors,
+                )
+
+        # Hit max turns without the model stopping
+        logger.info("Agent hit max_turns (%d) without finishing", self.max_turns)
+        return AgentResult(
+            messages=messages,
+            managed_state=self._get_managed_state(),
+            turns_used=self.max_turns,
+            finished_naturally=False,
+            reasoning_per_turn=reasoning_per_turn,
+            tool_errors=tool_errors,
+        )
+
+    def _get_managed_state(self) -> Optional[Dict[str, Any]]:
+        """
+        Get ManagedServer state if the server supports it.
+
+        Returns state dict with SequenceNodes containing tokens/logprobs/masks,
+        or None if the server doesn't support get_state() (e.g., regular OpenAI server).
+        """
+        if hasattr(self.server, "get_state"):
+            return self.server.get_state()
+        return None
@@ -0,0 +1,38 @@
+# Terminal-Bench 2.0 Evaluation -- Default Configuration
+#
+# Eval-only environment for the TB2 benchmark (89 terminal tasks).
+# Uses Modal terminal backend for per-task cloud-isolated sandboxes
+# and OpenRouter for inference.
+#
+# Usage:
+#   python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+#       --config environments/benchmarks/terminalbench_2/default.yaml
+#
+#   # Override model:
+#   python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+#       --config environments/benchmarks/terminalbench_2/default.yaml \
+#       --openai.model_name anthropic/claude-sonnet-4
+
+env:
+  enabled_toolsets: ["terminal", "file"]
+  max_agent_turns: 60
+  max_token_length: 32000
+  agent_temperature: 0.8
+  terminal_backend: "modal"
+  terminal_timeout: 300        # 5 min per command (builds, pip install)
+  tool_pool_size: 128          # thread pool for 89 parallel tasks
+  dataset_name: "NousResearch/terminal-bench-2"
+  test_timeout: 600
+  task_timeout: 1800           # 30 min wall-clock per task, auto-FAIL if exceeded
+  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
+  use_wandb: true
+  wandb_name: "terminal-bench-2"
+  ensure_scores_are_not_same: false
+  data_dir_to_save_evals: "environments/benchmarks/evals/terminal-bench-2"
+
+openai:
+  base_url: "https://openrouter.ai/api/v1"
+  model_name: "anthropic/claude-opus-4.6"
+  server_type: "openai"
+  health_check: false
+  # api_key loaded from OPENROUTER_API_KEY in .env
@@ -0,0 +1,32 @@
+#!/bin/bash
+
+# Terminal-Bench 2.0 Evaluation
+#
+# Run from repo root:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh
+#
+# Override model:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh \
+#       --openai.model_name anthropic/claude-sonnet-4
+#
+# Run a subset:
+#   bash environments/benchmarks/terminalbench_2/run_eval.sh \
+#       --env.task_filter fix-git,git-multibranch
+
+mkdir -p logs evals/terminal-bench-2
+LOG_FILE="logs/terminalbench2_$(date +%Y%m%d_%H%M%S).log"
+
+echo "Terminal-Bench 2.0 Evaluation"
+echo "Log: $LOG_FILE"
+echo ""
+
+export TERMINAL_ENV=modal
+export TERMINAL_TIMEOUT=300
+
+python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
+  --config environments/benchmarks/terminalbench_2/default.yaml \
+  "$@" \
+  2>&1 | tee "$LOG_FILE"
+
+echo ""
+echo "Log saved to: $LOG_FILE"
@@ -0,0 +1,904 @@
+"""
+TerminalBench2Env -- Terminal-Bench 2.0 Evaluation Environment
+
+Evaluates agentic LLMs on challenging terminal tasks from Terminal-Bench 2.0.
+Each task provides a unique Docker environment (pre-built on Docker Hub), a natural
+language instruction, and a test suite for verification. The agent uses terminal +
+file tools to complete the task, then the test suite runs inside the same sandbox.
+
+This is an eval-only environment (not a training environment). It is designed to
+be run via the `evaluate` subcommand:
+
+    python environments/terminalbench2_env.py evaluate \\
+        --env.dataset_name NousResearch/terminal-bench-2
+
+The evaluate flow:
+    1. setup()     -- Loads the TB2 dataset from HuggingFace
+    2. evaluate()  -- Iterates over all tasks, running each through:
+        a. rollout_and_score_eval()  -- Per-task agent loop + test verification
+            - Resolves Docker image (pre-built Hub image or Dockerfile fallback)
+            - Registers per-task Modal sandbox via register_task_env_overrides()
+            - Runs the HermesAgentLoop (terminal + file tools)
+            - Uploads test suite and runs test.sh in the same sandbox
+            - Returns binary pass/fail result
+        b. Aggregates per-task, per-category, and overall pass rates
+        c. Logs results via evaluate_log() and wandb
+
+Key features:
+  - Per-task Modal sandboxes using pre-built Docker Hub images
+  - Binary reward: 1.0 if all tests pass, 0.0 otherwise
+  - Concurrency-controlled parallel evaluation via asyncio.Semaphore
+  - Per-task, per-category, and aggregate pass rate tracking
+"""
+
+import asyncio
+import base64
+import io
+import json
+import logging
+import os
+import shutil
+import sys
+import tarfile
+import tempfile
+import time
+import uuid
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# Ensure repo root is on sys.path for imports
+_repo_root = Path(__file__).resolve().parent.parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from pydantic import Field
+
+from atroposlib.envs.base import EvalHandlingEnum
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+
+from environments.agent_loop import AgentResult, HermesAgentLoop
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.tool_context import ToolContext
+from tools.terminal_tool import (
+    register_task_env_overrides,
+    clear_task_env_overrides,
+    cleanup_vm,
+)
+
+logger = logging.getLogger(__name__)
+
+
+# =============================================================================
+# Configuration
+# =============================================================================
+
+class TerminalBench2EvalConfig(HermesAgentEnvConfig):
+    """
+    Configuration for the Terminal-Bench 2.0 evaluation environment.
+
+    Extends HermesAgentEnvConfig with TB2-specific settings for dataset loading,
+    test execution, task filtering, and eval concurrency.
+    """
+
+    # --- Dataset ---
+    dataset_name: str = Field(
+        default="NousResearch/terminal-bench-2",
+        description="HuggingFace dataset containing TB2 tasks.",
+    )
+
+    # --- Test execution ---
+    test_timeout: int = Field(
+        default=180,
+        description="Timeout in seconds for running the test suite after agent completes.",
+    )
+
+    # --- Image strategy ---
+    force_build: bool = Field(
+        default=False,
+        description="If True, always build from Dockerfile (ignore docker_image). "
+        "Useful for testing custom Dockerfiles.",
+    )
+
+    # --- Task filtering (comma-separated from CLI) ---
+    task_filter: Optional[str] = Field(
+        default=None,
+        description="Comma-separated task names to run (e.g., 'fix-git,git-multibranch'). "
+        "If not set, all tasks are run.",
+    )
+    skip_tasks: Optional[str] = Field(
+        default=None,
+        description="Comma-separated task names to skip on top of the default skip list.",
+    )
+
+    # --- Per-task wall-clock timeout ---
+    task_timeout: int = Field(
+        default=1800,
+        description="Maximum wall-clock seconds per task (agent loop + verification). "
+        "Tasks exceeding this are scored as FAIL. Default 30 minutes.",
+    )
+
+
+# Tasks that cannot run properly on Modal and are excluded from scoring.
+MODAL_INCOMPATIBLE_TASKS = {
+    "qemu-startup",        # Needs KVM/hardware virtualization
+    "qemu-alpine-ssh",     # Needs KVM/hardware virtualization
+    "crack-7z-hash",       # Password brute-force -- too slow for cloud sandbox timeouts
+}
+
+
+# =============================================================================
+# Tar extraction helper
+# =============================================================================
+
+def _extract_base64_tar(b64_data: str, target_dir: Path):
+    """Extract a base64-encoded tar.gz archive into target_dir."""
+    if not b64_data:
+        return
+    raw = base64.b64decode(b64_data)
+    buf = io.BytesIO(raw)
+    with tarfile.open(fileobj=buf, mode="r:gz") as tar:
+        tar.extractall(path=str(target_dir))
+
+
+# =============================================================================
+# Main Environment
+# =============================================================================
+
+class TerminalBench2EvalEnv(HermesAgentBaseEnv):
+    """
+    Terminal-Bench 2.0 evaluation environment (eval-only, no training).
+
+    Inherits from HermesAgentBaseEnv for:
+      - Terminal backend setup (os.environ["TERMINAL_ENV"])
+      - Tool resolution via _resolve_tools_for_group()
+      - Monkey patches for async-safe tool operation
+      - Wandb trajectory formatting
+
+    The evaluate flow (triggered by `environment.py evaluate`):
+      1. setup()    -- Load dataset from HuggingFace
+      2. evaluate() -- Run all tasks through rollout_and_score_eval()
+
+    Each task in rollout_and_score_eval():
+      1. Resolve Docker image (pre-built Hub image or Dockerfile fallback)
+      2. Register per-task Modal sandbox override
+      3. Run HermesAgentLoop with terminal + file tools
+      4. Upload test suite and execute test.sh in the same sandbox
+      5. Check /logs/verifier/reward.txt for pass/fail
+      6. Clean up sandbox, overrides, and temp files
+    """
+
+    name = "terminal-bench-2"
+    env_config_cls = TerminalBench2EvalConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[TerminalBench2EvalConfig, List[APIServerConfig]]:
+        """
+        Default configuration for Terminal-Bench 2.0 evaluation.
+
+        Uses eval-only settings:
+          - eval_handling=STOP_TRAIN so the eval flow runs cleanly
+          - steps_per_eval=1, total_steps=1 so eval triggers immediately
+          - group_size=1 (one rollout per group, each task is expensive)
+
+        Uses Modal terminal backend (cloud-isolated sandbox per task) and
+        OpenRouter with Claude for inference.
+        """
+        env_config = TerminalBench2EvalConfig(
+            # Terminal + file tools only (the agent interacts via shell commands)
+            enabled_toolsets=["terminal", "file"],
+            disabled_toolsets=None,
+            distribution=None,
+
+            # Agent settings -- TB2 tasks are complex, need many turns
+            max_agent_turns=60,
+            max_token_length=16000,
+            agent_temperature=0.6,
+            system_prompt=None,
+
+            # Modal backend for per-task cloud-isolated sandboxes
+            terminal_backend="modal",
+            terminal_timeout=300,   # 5 min per command (builds, pip install, etc.)
+
+            # Test execution timeout (TB2 test scripts can install deps like pytest)
+            test_timeout=180,
+
+            # 89 tasks run in parallel, each needs a thread for tool calls
+            tool_pool_size=128,
+
+            # --- Eval-only Atropos settings ---
+            # These settings make the env work as an eval-only environment:
+            #   - STOP_TRAIN: pauses training during eval (standard for eval envs)
+            #   - steps_per_eval=1, total_steps=1: eval triggers immediately
+            #   - group_size=1: one rollout per group (each task is expensive)
+            eval_handling=EvalHandlingEnum.STOP_TRAIN,
+            group_size=1,
+            steps_per_eval=1,
+            total_steps=1,
+
+            tokenizer_name="NousResearch/Hermes-3-Llama-3.1-8B",
+            use_wandb=True,
+            wandb_name="terminal-bench-2",
+            ensure_scores_are_not_same=False,  # Binary rewards may all be 0 or 1
+        )
+
+        # OpenRouter with Claude -- API key loaded from .env
+        server_configs = [
+            APIServerConfig(
+                base_url="https://openrouter.ai/api/v1",
+                model_name="anthropic/claude-sonnet-4",
+                server_type="openai",
+                api_key=os.getenv("OPENROUTER_API_KEY", ""),
+                health_check=False,
+            )
+        ]
+
+        return env_config, server_configs
+
+    # =========================================================================
+    # Setup -- load dataset
+    # =========================================================================
+
+    async def setup(self):
+        """Load the Terminal-Bench 2.0 dataset from HuggingFace."""
+        from datasets import load_dataset
+
+        # Auto-set terminal_lifetime to task_timeout + 120s so sandboxes
+        # never get killed during an active task, but still get cleaned up
+        # promptly after the task times out.
+        lifetime = self.config.task_timeout + 120
+        self.config.terminal_lifetime = lifetime
+        os.environ["TERMINAL_LIFETIME_SECONDS"] = str(lifetime)
+        print(f"  Terminal lifetime auto-set to {lifetime}s (task_timeout + 120s)")
+
+        print(f"Loading TB2 dataset from: {self.config.dataset_name}")
+        ds = load_dataset(self.config.dataset_name, split="train")
+
+        # Apply task filters (comma-separated strings from CLI)
+        tasks = list(ds)
+        if self.config.task_filter:
+            allowed = {name.strip() for name in self.config.task_filter.split(",")}
+            tasks = [t for t in tasks if t["task_name"] in allowed]
+            print(f"  Filtered to {len(tasks)} tasks: {sorted(allowed)}")
+
+        # Skip tasks incompatible with the current backend (e.g., QEMU on Modal)
+        # plus any user-specified skip_tasks
+        skip = set(MODAL_INCOMPATIBLE_TASKS) if self.config.terminal_backend == "modal" else set()
+        if self.config.skip_tasks:
+            skip |= {name.strip() for name in self.config.skip_tasks.split(",")}
+        if skip:
+            before = len(tasks)
+            tasks = [t for t in tasks if t["task_name"] not in skip]
+            skipped = before - len(tasks)
+            if skipped > 0:
+                print(f"  Skipped {skipped} incompatible tasks: {sorted(skip & {t['task_name'] for t in ds})}")
+
+        self.all_eval_items = tasks
+        self.iter = 0
+
+        # Build category index for per-category metrics
+        self.category_index: Dict[str, List[int]] = defaultdict(list)
+        for i, task in enumerate(self.all_eval_items):
+            self.category_index[task.get("category", "unknown")].append(i)
+
+        # Reward tracking for wandb logging
+        self.eval_metrics: List[Tuple[str, float]] = []
+
+        # Streaming JSONL writer -- saves each task's full conversation
+        # immediately on completion so data is preserved even on Ctrl+C.
+        # Timestamped filename so each run produces a unique file.
+        import datetime
+        log_dir = os.path.join(os.path.dirname(__file__), "logs")
+        os.makedirs(log_dir, exist_ok=True)
+        run_ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        self._streaming_path = os.path.join(log_dir, f"samples_{run_ts}.jsonl")
+        self._streaming_file = open(self._streaming_path, "w")
+        self._streaming_lock = __import__("threading").Lock()
+        print(f"  Streaming results to: {self._streaming_path}")
+
+        print(f"TB2 ready: {len(self.all_eval_items)} tasks across {len(self.category_index)} categories")
+        for cat, indices in sorted(self.category_index.items()):
+            print(f"  {cat}: {len(indices)} tasks")
+
+    def _save_result(self, result: Dict[str, Any]):
+        """Write a single task result to the streaming JSONL file immediately."""
+        if not hasattr(self, "_streaming_file") or self._streaming_file.closed:
+            return
+        with self._streaming_lock:
+            self._streaming_file.write(json.dumps(result, ensure_ascii=False, default=str) + "\n")
+            self._streaming_file.flush()
+
+    # =========================================================================
+    # Training pipeline stubs -- NOT used in eval-only mode
+    # =========================================================================
+    # These satisfy the abstract method requirements from HermesAgentBaseEnv.
+    # The evaluate subcommand calls setup() -> evaluate() directly, bypassing
+    # the training pipeline entirely.
+
+    async def get_next_item(self):
+        """Return next item (stub -- not used in eval-only mode)."""
+        item = self.all_eval_items[self.iter % len(self.all_eval_items)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item: Dict[str, Any]) -> str:
+        """Return the task's instruction as the user prompt."""
+        return item["instruction"]
+
+    async def compute_reward(self, item, result, ctx) -> float:
+        """Compute reward (stub -- actual verification is in rollout_and_score_eval)."""
+        return 0.0
+
+    async def collect_trajectories(self, item):
+        """Collect trajectories (stub -- not used in eval-only mode)."""
+        return None, []
+
+    async def score(self, rollout_group_data):
+        """Score rollouts (stub -- not used in eval-only mode)."""
+        return None
+
+    # =========================================================================
+    # Docker image resolution
+    # =========================================================================
+
+    def _resolve_task_image(
+        self, item: Dict[str, Any], task_name: str
+    ) -> Tuple[str, Optional[Path]]:
+        """
+        Resolve the Docker image for a task, with fallback to Dockerfile.
+
+        Strategy (mirrors Harbor's approach):
+        1. If force_build=True, always build from Dockerfile in environment_tar
+        2. If docker_image is available, use the pre-built Docker Hub image (fast)
+        3. Otherwise, extract Dockerfile from environment_tar and build (slow)
+
+        Returns:
+            (modal_image, temp_dir) -- modal_image is a Docker Hub name or a
+            Dockerfile path. temp_dir is set if we extracted files that need
+            cleanup later.
+        """
+        docker_image = item.get("docker_image", "")
+        environment_tar = item.get("environment_tar", "")
+
+        # Fast path: use pre-built Docker Hub image
+        if docker_image and not self.config.force_build:
+            logger.info("Task %s: using pre-built image %s", task_name, docker_image)
+            return docker_image, None
+
+        # Slow path: extract Dockerfile from environment_tar and build
+        if environment_tar:
+            task_dir = Path(tempfile.mkdtemp(prefix=f"tb2-{task_name}-"))
+            _extract_base64_tar(environment_tar, task_dir)
+            dockerfile_path = task_dir / "Dockerfile"
+            if dockerfile_path.exists():
+                logger.info(
+                    "Task %s: building from Dockerfile (force_build=%s, docker_image=%s)",
+                    task_name, self.config.force_build, bool(docker_image),
+                )
+                return str(dockerfile_path), task_dir
+
+        # Neither available -- fall back to Hub image if force_build was True
+        if docker_image:
+            logger.warning(
+                "Task %s: force_build=True but no environment_tar, "
+                "falling back to docker_image %s", task_name, docker_image,
+            )
+            return docker_image, None
+
+        return "", None
+
+    # =========================================================================
+    # Per-task evaluation -- agent loop + test verification
+    # =========================================================================
+
+    async def rollout_and_score_eval(self, eval_item: Dict[str, Any]) -> Dict:
+        """
+        Evaluate a single TB2 task: run the agent loop, then verify with tests.
+
+        This is the core evaluation method. For each task it:
+        1. Resolves the Docker image and registers the Modal sandbox override
+        2. Runs HermesAgentLoop with terminal + file tools
+        3. Uploads the test suite into the sandbox
+        4. Executes test.sh and checks the result
+        5. Cleans up the sandbox and temp files
+
+        Args:
+            eval_item: A single TB2 task dict from the dataset
+
+        Returns:
+            Dict with 'passed' (bool), 'reward' (float), 'task_name' (str),
+            'category' (str), and optional debug info
+        """
+        task_name = eval_item.get("task_name", "unknown")
+        category = eval_item.get("category", "unknown")
+        task_id = str(uuid.uuid4())
+        task_dir = None  # Set if we extract a Dockerfile (needs cleanup)
+
+        from tqdm import tqdm
+        tqdm.write(f"  [START] {task_name} (task_id={task_id[:8]})")
+        task_start = time.time()
+
+        try:
+            # --- 1. Resolve Docker image ---
+            modal_image, task_dir = self._resolve_task_image(eval_item, task_name)
+            if not modal_image:
+                logger.error("Task %s: no docker_image or environment_tar, skipping", task_name)
+                return {
+                    "passed": False, "reward": 0.0,
+                    "task_name": task_name, "category": category,
+                    "error": "no_image",
+                }
+
+            # --- 2. Register per-task Modal image override ---
+            register_task_env_overrides(task_id, {"modal_image": modal_image})
+            logger.info(
+                "Task %s: registered image override for task_id %s",
+                task_name, task_id[:8],
+            )
+
+            # --- 3. Resolve tools and build messages ---
+            tools, valid_names = self._resolve_tools_for_group()
+
+            messages: List[Dict[str, Any]] = []
+            if self.config.system_prompt:
+                messages.append({"role": "system", "content": self.config.system_prompt})
+            messages.append({"role": "user", "content": self.format_prompt(eval_item)})
+
+            # --- 4. Run agent loop ---
+            agent = HermesAgentLoop(
+                server=self.server,
+                tool_schemas=tools,
+                valid_tool_names=valid_names,
+                max_turns=self.config.max_agent_turns,
+                task_id=task_id,
+                temperature=self.config.agent_temperature,
+                max_tokens=self.config.max_token_length,
+                extra_body=self.config.extra_body,
+            )
+            result = await agent.run(messages)
+
+            # --- 5. Verify -- run test suite in the agent's sandbox ---
+            # Skip verification if the agent produced no meaningful output
+            only_system_and_user = all(
+                msg.get("role") in ("system", "user") for msg in result.messages
+            )
+            if result.turns_used == 0 or only_system_and_user:
+                logger.warning(
+                    "Task %s: agent produced no output (turns=%d). Reward=0.",
+                    task_name, result.turns_used,
+                )
+                reward = 0.0
+            else:
+                # Run tests in a thread so the blocking ctx.terminal() calls
+                # don't freeze the entire event loop (which would stall all
+                # other tasks, tqdm updates, and timeout timers).
+                ctx = ToolContext(task_id)
+                try:
+                    loop = asyncio.get_event_loop()
+                    reward = await loop.run_in_executor(
+                        None,  # default thread pool
+                        self._run_tests, eval_item, ctx, task_name,
+                    )
+                except Exception as e:
+                    logger.error("Task %s: test verification failed: %s", task_name, e)
+                    reward = 0.0
+                finally:
+                    ctx.cleanup()
+
+            passed = reward == 1.0
+            status = "PASS" if passed else "FAIL"
+            elapsed = time.time() - task_start
+            tqdm.write(f"  [{status}] {task_name} (turns={result.turns_used}, {elapsed:.0f}s)")
+            logger.info(
+                "Task %s: reward=%.1f, turns=%d, finished=%s",
+                task_name, reward, result.turns_used, result.finished_naturally,
+            )
+
+            out = {
+                "passed": passed,
+                "reward": reward,
+                "task_name": task_name,
+                "category": category,
+                "turns_used": result.turns_used,
+                "finished_naturally": result.finished_naturally,
+                "messages": result.messages,
+            }
+            self._save_result(out)
+            return out
+
+        except Exception as e:
+            elapsed = time.time() - task_start
+            logger.error("Task %s: rollout failed: %s", task_name, e, exc_info=True)
+            tqdm.write(f"  [ERROR] {task_name}: {e} ({elapsed:.0f}s)")
+            out = {
+                "passed": False, "reward": 0.0,
+                "task_name": task_name, "category": category,
+                "error": str(e),
+            }
+            self._save_result(out)
+            return out
+
+        finally:
+            # --- Cleanup: clear overrides, sandbox, and temp files ---
+            clear_task_env_overrides(task_id)
+            try:
+                cleanup_vm(task_id)
+            except Exception as e:
+                logger.debug("VM cleanup for %s: %s", task_id[:8], e)
+            if task_dir and task_dir.exists():
+                shutil.rmtree(task_dir, ignore_errors=True)
+
+    def _run_tests(
+        self, item: Dict[str, Any], ctx: ToolContext, task_name: str
+    ) -> float:
+        """
+        Upload and execute the test suite in the agent's sandbox, then
+        download the verifier output locally to read the reward.
+
+        Follows Harbor's verification pattern:
+        1. Upload tests/ directory into the sandbox
+        2. Execute test.sh inside the sandbox
+        3. Download /logs/verifier/ directory to a local temp dir
+        4. Read reward.txt locally with native Python I/O
+
+        Downloading locally avoids issues with the file_read tool on
+        the Modal VM and matches how Harbor handles verification.
+
+        TB2 test scripts (test.sh) typically:
+        1. Install pytest via uv/pip
+        2. Run pytest against the test files in /tests/
+        3. Write results to /logs/verifier/reward.txt
+
+        Args:
+            item: The TB2 task dict (contains tests_tar, test_sh)
+            ctx: ToolContext scoped to this task's sandbox
+            task_name: For logging
+
+        Returns:
+            1.0 if tests pass, 0.0 otherwise
+        """
+        tests_tar = item.get("tests_tar", "")
+        test_sh = item.get("test_sh", "")
+
+        if not test_sh:
+            logger.warning("Task %s: no test_sh content, reward=0", task_name)
+            return 0.0
+
+        # Create required directories in the sandbox
+        ctx.terminal("mkdir -p /tests /logs/verifier")
+
+        # Upload test files into the sandbox (binary-safe via base64)
+        if tests_tar:
+            tests_temp = Path(tempfile.mkdtemp(prefix=f"tb2-tests-{task_name}-"))
+            try:
+                _extract_base64_tar(tests_tar, tests_temp)
+                ctx.upload_dir(str(tests_temp), "/tests")
+            except Exception as e:
+                logger.warning("Task %s: failed to upload test files: %s", task_name, e)
+            finally:
+                shutil.rmtree(tests_temp, ignore_errors=True)
+
+        # Write the test runner script (test.sh)
+        ctx.write_file("/tests/test.sh", test_sh)
+        ctx.terminal("chmod +x /tests/test.sh")
+
+        # Execute the test suite
+        logger.info(
+            "Task %s: running test suite (timeout=%ds)",
+            task_name, self.config.test_timeout,
+        )
+        test_result = ctx.terminal(
+            "bash /tests/test.sh",
+            timeout=self.config.test_timeout,
+        )
+
+        exit_code = test_result.get("exit_code", -1)
+        output = test_result.get("output", "")
+
+        # Download the verifier output directory locally, then read reward.txt
+        # with native Python I/O. This avoids issues with file_read on the
+        # Modal VM and matches Harbor's verification pattern.
+        reward = 0.0
+        local_verifier_dir = Path(tempfile.mkdtemp(prefix=f"tb2-verifier-{task_name}-"))
+        try:
+            ctx.download_dir("/logs/verifier", str(local_verifier_dir))
+
+            reward_file = local_verifier_dir / "reward.txt"
+            if reward_file.exists() and reward_file.stat().st_size > 0:
+                content = reward_file.read_text().strip()
+                if content == "1":
+                    reward = 1.0
+                elif content == "0":
+                    reward = 0.0
+                else:
+                    # Unexpected content -- try parsing as float
+                    try:
+                        reward = float(content)
+                    except (ValueError, TypeError):
+                        logger.warning(
+                            "Task %s: reward.txt content unexpected (%r), "
+                            "falling back to exit_code=%d",
+                            task_name, content, exit_code,
+                        )
+                        reward = 1.0 if exit_code == 0 else 0.0
+            else:
+                # reward.txt not written -- fall back to exit code
+                logger.warning(
+                    "Task %s: reward.txt not found after download, "
+                    "falling back to exit_code=%d",
+                    task_name, exit_code,
+                )
+                reward = 1.0 if exit_code == 0 else 0.0
+        except Exception as e:
+            logger.warning(
+                "Task %s: failed to download verifier dir: %s, "
+                "falling back to exit_code=%d",
+                task_name, e, exit_code,
+            )
+            reward = 1.0 if exit_code == 0 else 0.0
+        finally:
+            shutil.rmtree(local_verifier_dir, ignore_errors=True)
+
+        # Log test output for debugging failures
+        if reward == 0.0:
+            output_preview = output[-500:] if output else "(no output)"
+            logger.info(
+                "Task %s: FAIL (exit_code=%d)\n%s",
+                task_name, exit_code, output_preview,
+            )
+
+        return reward
+
+    # =========================================================================
+    # Evaluate -- main entry point for the eval subcommand
+    # =========================================================================
+
+    async def _eval_with_timeout(self, item: Dict[str, Any]) -> Dict:
+        """
+        Wrap rollout_and_score_eval with a per-task wall-clock timeout.
+
+        If the task exceeds task_timeout seconds, it's automatically scored
+        as FAIL. This prevents any single task from hanging indefinitely.
+        """
+        task_name = item.get("task_name", "unknown")
+        category = item.get("category", "unknown")
+        try:
+            return await asyncio.wait_for(
+                self.rollout_and_score_eval(item),
+                timeout=self.config.task_timeout,
+            )
+        except asyncio.TimeoutError:
+            from tqdm import tqdm
+            elapsed = self.config.task_timeout
+            tqdm.write(f"  [TIMEOUT] {task_name} (exceeded {elapsed}s wall-clock limit)")
+            logger.error("Task %s: wall-clock timeout after %ds", task_name, elapsed)
+            out = {
+                "passed": False, "reward": 0.0,
+                "task_name": task_name, "category": category,
+                "error": f"timeout ({elapsed}s)",
+            }
+            self._save_result(out)
+            return out
+
+    async def evaluate(self, *args, **kwargs) -> None:
+        """
+        Run Terminal-Bench 2.0 evaluation over all tasks.
+
+        This is the main entry point when invoked via:
+            python environments/terminalbench2_env.py evaluate
+
+        Runs all tasks through rollout_and_score_eval() via asyncio.gather()
+        (same pattern as GPQA and other Atropos eval envs). Each task is
+        wrapped with a wall-clock timeout so hung tasks auto-fail.
+
+        Suppresses noisy Modal/terminal output (HERMES_QUIET) so the tqdm
+        bar stays visible.
+        """
+        start_time = time.time()
+
+        # Route all logging through tqdm.write() so the progress bar stays
+        # pinned at the bottom while log lines scroll above it.
+        from tqdm import tqdm
+
+        class _TqdmHandler(logging.Handler):
+            def emit(self, record):
+                try:
+                    tqdm.write(self.format(record))
+                except Exception:
+                    self.handleError(record)
+
+        handler = _TqdmHandler()
+        handler.setFormatter(logging.Formatter(
+            "%(asctime)s [%(name)s] %(levelname)s: %(message)s",
+            datefmt="%H:%M:%S",
+        ))
+        root = logging.getLogger()
+        root.handlers = [handler]  # Replace any existing handlers
+        root.setLevel(logging.INFO)
+
+        # Silence noisy third-party loggers that flood the output
+        logging.getLogger("httpx").setLevel(logging.WARNING)      # Every HTTP request
+        logging.getLogger("openai").setLevel(logging.WARNING)     # OpenAI client retries
+        logging.getLogger("rex-deploy").setLevel(logging.WARNING) # Swerex deployment
+        logging.getLogger("rex_image_builder").setLevel(logging.WARNING)  # Image builds
+
+        print(f"\n{'='*60}")
+        print("Starting Terminal-Bench 2.0 Evaluation")
+        print(f"{'='*60}")
+        print(f"  Dataset: {self.config.dataset_name}")
+        print(f"  Total tasks: {len(self.all_eval_items)}")
+        print(f"  Max agent turns: {self.config.max_agent_turns}")
+        print(f"  Task timeout: {self.config.task_timeout}s")
+        print(f"  Terminal backend: {self.config.terminal_backend}")
+        print(f"  Tool thread pool: {self.config.tool_pool_size}")
+        print(f"  Terminal timeout: {self.config.terminal_timeout}s/cmd")
+        print(f"  Terminal lifetime: {self.config.terminal_lifetime}s (auto: task_timeout + 120)")
+        print(f"{'='*60}\n")
+
+        # Fire all tasks with wall-clock timeout, track live accuracy on the bar
+        total_tasks = len(self.all_eval_items)
+        eval_tasks = [
+            asyncio.ensure_future(self._eval_with_timeout(item))
+            for item in self.all_eval_items
+        ]
+
+        results = []
+        passed_count = 0
+        pbar = tqdm(total=total_tasks, desc="Evaluating TB2", dynamic_ncols=True)
+        try:
+            for coro in asyncio.as_completed(eval_tasks):
+                result = await coro
+                results.append(result)
+                if result and result.get("passed"):
+                    passed_count += 1
+                done = len(results)
+                pct = (passed_count / done * 100) if done else 0
+                pbar.set_postfix_str(f"pass={passed_count}/{done} ({pct:.1f}%)")
+                pbar.update(1)
+        except (KeyboardInterrupt, asyncio.CancelledError):
+            pbar.close()
+            print(f"\n\nInterrupted! Cleaning up {len(eval_tasks)} tasks...")
+            # Cancel all pending tasks
+            for task in eval_tasks:
+                task.cancel()
+            # Let cancellations propagate (finally blocks run cleanup_vm)
+            await asyncio.gather(*eval_tasks, return_exceptions=True)
+            # Belt-and-suspenders: clean up any remaining sandboxes
+            from tools.terminal_tool import cleanup_all_environments
+            cleanup_all_environments()
+            print("All sandboxes cleaned up.")
+            return
+        finally:
+            pbar.close()
+
+        end_time = time.time()
+
+        # Filter out None results (shouldn't happen, but be safe)
+        valid_results = [r for r in results if r is not None]
+
+        if not valid_results:
+            print("Warning: No valid evaluation results obtained")
+            return
+
+        # ---- Compute metrics ----
+        total = len(valid_results)
+        passed = sum(1 for r in valid_results if r.get("passed"))
+        overall_pass_rate = passed / total if total > 0 else 0.0
+
+        # Per-category breakdown
+        cat_results: Dict[str, List[Dict]] = defaultdict(list)
+        for r in valid_results:
+            cat_results[r.get("category", "unknown")].append(r)
+
+        # Build metrics dict
+        eval_metrics = {
+            "eval/pass_rate": overall_pass_rate,
+            "eval/total_tasks": total,
+            "eval/passed_tasks": passed,
+            "eval/evaluation_time_seconds": end_time - start_time,
+        }
+
+        # Per-category metrics
+        for category, cat_items in sorted(cat_results.items()):
+            cat_passed = sum(1 for r in cat_items if r.get("passed"))
+            cat_total = len(cat_items)
+            cat_pass_rate = cat_passed / cat_total if cat_total > 0 else 0.0
+            cat_key = category.replace(" ", "_").replace("-", "_").lower()
+            eval_metrics[f"eval/pass_rate_{cat_key}"] = cat_pass_rate
+
+        # Store metrics for wandb_log
+        self.eval_metrics = [(k, v) for k, v in eval_metrics.items()]
+
+        # ---- Print summary ----
+        print(f"\n{'='*60}")
+        print("Terminal-Bench 2.0 Evaluation Results")
+        print(f"{'='*60}")
+        print(f"Overall Pass Rate: {overall_pass_rate:.4f} ({passed}/{total})")
+        print(f"Evaluation Time: {end_time - start_time:.1f} seconds")
+
+        print("\nCategory Breakdown:")
+        for category, cat_items in sorted(cat_results.items()):
+            cat_passed = sum(1 for r in cat_items if r.get("passed"))
+            cat_total = len(cat_items)
+            cat_rate = cat_passed / cat_total if cat_total > 0 else 0.0
+            print(f"  {category}: {cat_rate:.1%} ({cat_passed}/{cat_total})")
+
+        # Print individual task results
+        print("\nTask Results:")
+        for r in sorted(valid_results, key=lambda x: x.get("task_name", "")):
+            status = "PASS" if r.get("passed") else "FAIL"
+            turns = r.get("turns_used", "?")
+            error = r.get("error", "")
+            extra = f" (error: {error})" if error else ""
+            print(f"  [{status}] {r['task_name']} (turns={turns}){extra}")
+
+        print(f"{'='*60}\n")
+
+        # Build sample records for evaluate_log (includes full conversations)
+        samples = [
+            {
+                "task_name": r.get("task_name"),
+                "category": r.get("category"),
+                "passed": r.get("passed"),
+                "reward": r.get("reward"),
+                "turns_used": r.get("turns_used"),
+                "error": r.get("error"),
+                "messages": r.get("messages"),
+            }
+            for r in valid_results
+        ]
+
+        # Log evaluation results
+        try:
+            await self.evaluate_log(
+                metrics=eval_metrics,
+                samples=samples,
+                start_time=start_time,
+                end_time=end_time,
+                generation_parameters={
+                    "temperature": self.config.agent_temperature,
+                    "max_tokens": self.config.max_token_length,
+                    "max_agent_turns": self.config.max_agent_turns,
+                    "terminal_backend": self.config.terminal_backend,
+                },
+            )
+        except Exception as e:
+            print(f"Error logging evaluation results: {e}")
+
+        # Close streaming file
+        if hasattr(self, "_streaming_file") and not self._streaming_file.closed:
+            self._streaming_file.close()
+            print(f"  Live results saved to: {self._streaming_path}")
+
+        # Kill all remaining sandboxes. Timed-out tasks leave orphaned thread
+        # pool workers still executing commands -- cleanup_all stops them.
+        from tools.terminal_tool import cleanup_all_environments
+        print("\nCleaning up all sandboxes...")
+        cleanup_all_environments()
+
+        # Shut down the tool thread pool so orphaned workers from timed-out
+        # tasks are killed immediately instead of retrying against dead
+        # sandboxes and spamming the console with TimeoutError warnings.
+        from environments.agent_loop import _tool_executor
+        _tool_executor.shutdown(wait=False, cancel_futures=True)
+        print("Done.")
+
+    # =========================================================================
+    # Wandb logging
+    # =========================================================================
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log TB2-specific metrics to wandb."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        # Add stored eval metrics
+        for metric_name, metric_value in self.eval_metrics:
+            wandb_metrics[metric_name] = metric_value
+        self.eval_metrics = []
+
+        await super().wandb_log(wandb_metrics)
+
+
+if __name__ == "__main__":
+    TerminalBench2EvalEnv.cli()
@@ -0,0 +1,5 @@
+"""Endless Terminals Environment - Terminal task training from HuggingFace dataset."""
+
+from .endless_terminals_env import EndlessTerminalsEnv, EndlessTerminalsEnvConfig
+
+__all__ = ["EndlessTerminalsEnv", "EndlessTerminalsEnvConfig"]
@@ -0,0 +1,69 @@
+# Endless Terminals Environment -- Default Configuration
+#
+# Trains agents on terminal tasks from the Endless Terminals HuggingFace dataset.
+# Uses hermes-agent backends (modal/docker/local) with per-task Docker images.
+# Tests run in the same sandbox the agent used (no separate containers needed).
+#
+# Dataset: https://huggingface.co/datasets/obiwan96/endless-terminals-train
+#
+# Prerequisites:
+#   1. Download dataset: huggingface-cli download obiwan96/endless-terminals-train \
+#        --repo-type dataset --local-dir ~/endless-terminals-data \
+#        --local-dir-use-symlinks False
+#   2. Set TASKS_BASE_DIR environment variable or configure tasks_base_dir below
+#   3. For modal backend: Configure Modal CLI (modal token set)
+#   4. For docker backend: Install Docker
+#
+# Usage:
+#   python environments/endless_terminals/endless_terminals_env.py process \
+#       --config environments/endless_terminals/default.yaml
+
+env:
+  # Toolsets
+  enabled_toolsets: ["terminal", "file"]
+
+  # Agent configuration
+  max_agent_turns: 32
+  max_token_length: 4096
+  agent_temperature: 1.0
+
+  # Terminal backend
+  terminal_backend: "local"  # Change to "modal" or "docker" for cloud isolation
+
+  # Dataset settings
+  use_dataset: true
+  dataset_name: "obiwan96/endless-terminals"
+  dataset_split: "train"
+  dataset_cache_dir: "~/.cache/huggingface/datasets"
+  tasks_base_dir: ""  # Set to directory containing task_* folders (e.g., ~/endless-terminals-data)
+
+  # Test execution
+  test_timeout_s: 60
+
+  # Training configuration
+  group_size: 8
+  total_steps: 10000
+  steps_per_eval: 500
+
+  num_eval_tasks: 10
+  eval_split_ratio: 0.1
+
+  # Logging
+  use_wandb: true
+  wandb_name: "endless-terminals"
+
+  # System prompt
+  system_prompt: >
+    You are a skilled Linux system administrator and programmer.
+    You have access to a terminal and file tools to complete system administration
+    and programming tasks. Use the tools effectively to solve the given task,
+    and verify your solution works correctly before finishing.
+
+openai:
+  base_url: "https://openrouter.ai/api/v1"
+  model_name: "anthropic/claude-sonnet-4.5"
+  server_type: "openai"
+  api_key: ""  # Loaded from OPENROUTER_API_KEY env var
+  health_check: false
+  timeout: 30  # 30 second timeout per request
+  max_retries: 2  # Only retry twice
@@ -0,0 +1,921 @@
+"""
+Endless Terminals Environment for Hermes-Agent + Atropos RL.
+
+Loads pre-generated terminal tasks from HuggingFace dataset and scores
+agent performance using test execution in the agent's sandbox.
+
+Uses hermes-agent backends (modal, docker, local) with per-task Docker images
+extracted from container.def files. Tests run in the same sandbox the agent
+used, following the Terminal Bench 2 pattern.
+
+Dataset: https://huggingface.co/datasets/obiwan96/endless-terminals-train
+
+Run:
+  python environments/endless_terminals/endless_terminals_env.py process \
+    --config environments/endless_terminals/default.yaml
+"""
+
+import asyncio
+import logging
+import os
+import random
+import re
+import sys
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+
+from pydantic import Field
+
+# Ensure hermes-agent root is on path
+_repo_root = Path(__file__).resolve().parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from atroposlib.envs.base import ScoredDataGroup, ScoredDataItem
+from atroposlib.type_definitions import Item
+
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.agent_loop import AgentResult
+from environments.tool_context import ToolContext
+from tools.terminal_tool import (
+    register_task_env_overrides,
+    clear_task_env_overrides,
+    cleanup_vm,
+)
+
+logger = logging.getLogger(__name__)
+
+# Add endless-terminals to path for imports
+ENDLESS_TERMINALS_PATH = os.getenv(
+    "ENDLESS_TERMINALS_PATH",
+    str(Path.home() / "Desktop" / "Projects" / "endless-terminals")
+)
+sys.path.insert(0, ENDLESS_TERMINALS_PATH)
+
+
+class EndlessTerminalsEnvConfig(HermesAgentEnvConfig):
+    """Configuration for Endless Terminals environment."""
+
+    # Dataset settings
+    use_dataset: bool = Field(
+        default=True,
+        description="Load tasks from HuggingFace dataset (recommended). If False, generate procedurally."
+    )
+    dataset_name: str = Field(
+        default="obiwan96/endless-terminals-train",
+        description="HuggingFace dataset name"
+    )
+    dataset_split: str = Field(
+        default="train",
+        description="Dataset split to use"
+    )
+    dataset_cache_dir: str = Field(
+        default="~/.cache/huggingface/datasets",
+        description="HuggingFace datasets cache directory"
+    )
+    tasks_base_dir: str = Field(
+        default="",
+        description="Base directory containing task_* folders. If empty, uses paths from dataset."
+    )
+
+    # Test execution
+    test_timeout_s: int = Field(default=60, description="Test execution timeout (seconds)")
+
+    # Docker image fallback
+    default_docker_image: str = Field(
+        default="ubuntu:22.04",
+        description="Default Docker image if container.def parsing fails"
+    )
+
+    # Agent defaults
+    max_agent_turns: int = Field(default=32, description="Max turns for agent (increased for long traces)")
+
+    # Evaluation settings
+    num_eval_tasks: int = Field(
+        default=10,
+        description="Number of tasks to run during periodic evaluation"
+    )
+    eval_split_ratio: float = Field(
+        default=0.1,
+        description="Fraction of dataset to hold out for evaluation (0.0-1.0)"
+    )
+
+
+class EndlessTerminalsEnv(HermesAgentBaseEnv):
+    """
+    Endless Terminals environment using pre-generated HuggingFace dataset.
+
+    Loads terminal tasks from dataset, runs agent with terminal tools,
+    and scores by executing tests in the agent's sandbox using ToolContext.
+    """
+
+    name = "endless_terminals_env"
+    env_config_cls = EndlessTerminalsEnvConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[EndlessTerminalsEnvConfig, List["APIServerConfig"]]:
+        """
+        Default configuration for Endless Terminals environment.
+
+        This is used when no config file is provided, but note that when using
+        --config, the YAML is loaded differently and this may not be called.
+        """
+        from atroposlib.envs.server_handling.server_manager import APIServerConfig
+
+        env_config = EndlessTerminalsEnvConfig(
+            enabled_toolsets=["terminal", "file"],
+            max_agent_turns=32,
+            terminal_backend="local",
+            use_dataset=True,
+            tasks_base_dir="",
+            group_size=1,
+            total_steps=1,
+            use_wandb=False,
+        )
+
+        server_configs = [
+            APIServerConfig(
+                base_url="https://openrouter.ai/api/v1",
+                model_name="anthropic/claude-sonnet-4.5",
+                server_type="openai",
+                api_key=os.getenv("OPENROUTER_API_KEY", ""),
+                health_check=False,
+            )
+        ]
+
+        return env_config, server_configs
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self._dataset = None
+        self._train_dataset = None
+        self._eval_dataset = None
+        self._dataset_indices = []
+        self._current_index = 0
+
+        # Metrics tracking for wandb - single buffer with dicts
+        self._metrics_buffer = []
+
+        # Debug: check server config
+        if hasattr(self, 'server') and hasattr(self.server, 'servers'):
+            for i, srv in enumerate(self.server.servers):
+                logger.debug(f"Server {i}: model_name={getattr(srv.config, 'model_name', 'NONE')}")
+
+    async def setup(self):
+        """Load dataset from HuggingFace or local directory."""
+        if not self.config.use_dataset:
+            logger.info("Using procedural task generation (not implemented yet)")
+            return
+
+        # If tasks_base_dir is set, load from local directory instead of HuggingFace
+        if self.config.tasks_base_dir:
+            tasks_base = Path(os.path.expanduser(self.config.tasks_base_dir))
+
+            # Resolve to absolute path if relative
+            if not tasks_base.is_absolute():
+                tasks_base = Path.cwd() / tasks_base
+
+            tasks_base = tasks_base.resolve()
+
+            if not tasks_base.exists():
+                raise RuntimeError(f"tasks_base_dir not found: {tasks_base}")
+
+            logger.info(f"Loading tasks from local directory: {tasks_base}")
+
+            # Find all task_* directories
+            task_dirs = sorted(tasks_base.glob("task_*"))
+            logger.info(f"Found {len(task_dirs)} task directories")
+
+            if not task_dirs:
+                # Debug: show what's actually in the directory
+                all_items = list(tasks_base.iterdir())
+                logger.warning(f"Directory contains {len(all_items)} items:")
+                for item in all_items[:10]:
+                    logger.warning(f"  - {item.name} ({'dir' if item.is_dir() else 'file'})")
+                raise RuntimeError(f"No task_* directories found in {tasks_base}")
+
+            # Create fake dataset items (just the directory paths)
+            self._dataset = [
+                {
+                    "description": f"Task from {task_dir.name}",
+                    "extra_info": {"task_dir": str(task_dir)},
+                }
+                for task_dir in task_dirs
+            ]
+
+            logger.info(f"Loaded {len(self._dataset)} tasks from local directory")
+
+            self._split_dataset()
+            return
+
+        # Otherwise, load from HuggingFace
+        logger.info(f"Loading dataset from HuggingFace: {self.config.dataset_name}")
+
+        try:
+            from datasets import load_dataset
+
+            self._dataset = await asyncio.get_event_loop().run_in_executor(
+                None,
+                lambda: load_dataset(
+                    self.config.dataset_name,
+                    split=self.config.dataset_split,
+                    cache_dir=os.path.expanduser(self.config.dataset_cache_dir)
+                )
+            )
+
+            logger.info(f"Loaded {len(self._dataset)} tasks from HuggingFace")
+
+            self._split_dataset()
+
+        except Exception as e:
+            logger.error(f"ERROR loading dataset: {e}")
+            raise
+
+    def _split_dataset(self):
+        """Split dataset into train and eval sets based on eval_split_ratio."""
+        if self._dataset is None or len(self._dataset) == 0:
+            raise RuntimeError("Cannot split empty dataset")
+
+        total_size = len(self._dataset)
+        eval_size = int(total_size * self.config.eval_split_ratio)
+        train_size = total_size - eval_size
+
+        all_indices = list(range(total_size))
+        random.shuffle(all_indices)
+
+        train_indices = all_indices[:train_size]
+        eval_indices = all_indices[train_size:]
+
+        if isinstance(self._dataset, list):
+            self._train_dataset = [self._dataset[i] for i in train_indices]
+            self._eval_dataset = [self._dataset[i] for i in eval_indices]
+        else:
+            self._train_dataset = self._dataset.select(train_indices)
+            self._eval_dataset = self._dataset.select(eval_indices)
+
+        self._dataset_indices = list(range(len(self._train_dataset)))
+        random.shuffle(self._dataset_indices)
+        self._current_index = 0
+
+        logger.info(
+            f"Split dataset: {len(self._train_dataset)} train, "
+            f"{len(self._eval_dataset)} eval "
+            f"(ratio={self.config.eval_split_ratio:.1%})"
+        )
+
+    async def get_next_item(self) -> Item:
+        """Sample next task from training dataset."""
+        if self._train_dataset is None:
+            raise RuntimeError("Dataset not loaded. Call setup() first.")
+
+        # Get next task (with wraparound)
+        idx = self._dataset_indices[self._current_index]
+        task = self._train_dataset[idx]
+
+        # Advance to next task
+        self._current_index += 1
+        if self._current_index >= len(self._dataset_indices):
+            # Reshuffle for next epoch
+            random.shuffle(self._dataset_indices)
+            self._current_index = 0
+            logger.info("Reshuffled dataset (completed one epoch)")
+
+        # Extract task directory path
+        task_dir = task.get("extra_info", {}).get("task_dir")
+        if not task_dir:
+            task_dir = task.get("reward_spec", {}).get("ground_truth")
+
+        # Resolve task directory path
+        if task_dir:
+            task_dir_path = Path(task_dir)
+            # If tasks_base_dir is configured and path doesn't exist, reconstruct it
+            if self.config.tasks_base_dir and not task_dir_path.exists():
+                original_path = Path(task_dir)
+                task_name = original_path.name
+                task_dir_path = Path(os.path.expanduser(self.config.tasks_base_dir)) / task_name
+        else:
+            logger.error("No task directory path found in dataset item")
+            return await self.get_next_item()
+
+        # Verify directory exists
+        if not task_dir_path.exists():
+            logger.warning(f"Task dir not found: {task_dir_path}")
+            logger.warning("Hint: Set tasks_base_dir to directory containing task_* folders")
+            return await self.get_next_item()  # Try next task
+
+        # Look for test file in tests/ subdirectory first, then at root
+        final_test = task_dir_path / "tests" / "test_final_state.py"
+        if not final_test.exists():
+            final_test = task_dir_path / "test_final_state.py"
+
+        # Verify test file exists
+        if not final_test.exists():
+            logger.warning(f"Missing test file in {task_dir_path} (checked tests/ and root)")
+            return await self.get_next_item()
+
+        # Parse container.def to extract Docker image
+        # Check environment/ subdirectory first, then root
+        container_def = task_dir_path / "environment" / "container.def"
+        if not container_def.exists():
+            container_def = task_dir_path / "container.def"
+        docker_image = self._parse_docker_image_from_def(container_def)
+
+        # Try to load description from instruction.md or task.json
+        description = task.get("description", "")
+
+        # First try instruction.md
+        instruction_md = task_dir_path / "instruction.md"
+        if not description and instruction_md.exists():
+            try:
+                description = instruction_md.read_text().strip()
+            except Exception as e:
+                logger.warning(f"Failed to load instruction.md for {task_dir_path.name}: {e}")
+
+        # Fallback to task.json in environment/
+        if not description:
+            task_json = task_dir_path / "environment" / "task.json"
+            if task_json.exists():
+                try:
+                    import json
+                    task_data = json.loads(task_json.read_text())
+                    description = task_data.get("description", "") or task_data.get("instruction", "")
+                except Exception as e:
+                    logger.warning(f"Failed to load task.json for {task_dir_path.name}: {e}")
+
+        if not description:
+            description = f"Complete the task in {task_dir_path.name}"
+
+        return {
+            "task_id": f"{task_dir_path.name}",
+            "task_name": task_dir_path.name,
+            "description": description,
+            "task_dir": str(task_dir_path),
+            "final_test": str(final_test),
+            "docker_image": docker_image,
+            "dataset_index": idx,
+        }
+
+    def format_prompt(self, item: Item) -> str:
+        """Return the task description for the agent."""
+        return str(item.get("description", ""))
+
+    def _parse_docker_image_from_def(self, container_def_path: Path) -> str:
+        """
+        Parse container.def file to extract the Docker base image.
+
+        Apptainer definition files typically look like:
+            Bootstrap: docker
+            From: ubuntu:22.04
+
+        Returns the image from the "From:" line, or falls back to default.
+        """
+        if not container_def_path.exists():
+            logger.warning(f"container.def not found at {container_def_path}, using default image")
+            return self.config.default_docker_image
+
+        try:
+            content = container_def_path.read_text()
+            # Look for "From: <image>" line (case-insensitive)
+            match = re.search(r'^From:\s*(.+)$', content, re.MULTILINE | re.IGNORECASE)
+            if match:
+                image = match.group(1).strip()
+                logger.info(f"Extracted Docker image from container.def: {image}")
+                return image
+        except Exception as e:
+            logger.warning(f"Failed to parse {container_def_path}: {e}")
+
+        logger.warning(f"Could not extract image from {container_def_path}, using default")
+        return self.config.default_docker_image
+
+    async def collect_trajectory(
+        self, item: Item
+    ) -> Tuple[Optional[ScoredDataItem], List[Item]]:
+        """
+        Override to register per-task Docker image before running the agent.
+
+        Follows Terminal Bench 2 pattern: register_task_env_overrides() tells
+        the hermes-agent terminal backend to use a specific Docker image for
+        this task_id.
+
+        This is a copy of HermesAgentBaseEnv.collect_trajectory with Docker
+        image registration added after task_id generation.
+        """
+        import uuid
+        from environments.agent_loop import HermesAgentLoop
+
+        task_id = str(uuid.uuid4())
+        task_name = item.get("task_name", "unknown")
+        docker_image = item.get("docker_image", self.config.default_docker_image)
+
+        logger.debug(f"collect_trajectory START for {task_name}")
+
+        # Register Docker image override for this task_id
+        logger.debug(f"Registering Docker image: {docker_image}")
+        register_task_env_overrides(task_id, {"modal_image": docker_image})
+        logger.info(
+            f"Task {task_name}: registered Docker image {docker_image} for task_id {task_id[:8]}"
+        )
+        logger.debug("Docker image registered")
+
+        try:
+            # Get group-level tools (resolved once in collect_trajectories)
+            logger.debug("Resolving tools...")
+            if self._current_group_tools is None:
+                tools, valid_names = self._resolve_tools_for_group()
+            else:
+                tools, valid_names = self._current_group_tools
+            logger.debug(f"Tools resolved: {len(tools)} tools")
+
+            # Build initial messages
+            logger.debug("Building initial messages...")
+            messages: List[Dict[str, Any]] = []
+            if self.config.system_prompt:
+                messages.append({"role": "system", "content": self.config.system_prompt})
+            messages.append({"role": "user", "content": self.format_prompt(item)})
+            logger.debug("Messages built, starting agent loop...")
+
+            # Run the agent loop
+            result: AgentResult
+            managed_state: Optional[Dict[str, Any]] = None
+
+            if self._use_managed_server():
+                # Phase 2: ManagedServer with parser
+                from environments.tool_call_parsers import get_parser
+                try:
+                    tc_parser = get_parser(self.config.tool_call_parser)
+                except KeyError:
+                    logger.warning(
+                        "Tool call parser '%s' not found, falling back to 'hermes'",
+                        self.config.tool_call_parser,
+                    )
+                    tc_parser = get_parser("hermes")
+
+                try:
+                    async with self.server.managed_server(
+                        tokenizer=self.tokenizer,
+                        tool_call_parser=tc_parser,
+                    ) as managed:
+                        agent = HermesAgentLoop(
+                            server=managed,
+                            tool_schemas=tools,
+                            valid_tool_names=valid_names,
+                            max_turns=self.config.max_agent_turns,
+                            task_id=task_id,
+                            temperature=self.config.agent_temperature,
+                            max_tokens=self.config.max_token_length,
+                            extra_body=self.config.extra_body,
+                        )
+                        result = await agent.run(messages)
+
+                        # Get state directly from managed server while still in context
+                        managed_state = managed.get_state()
+                except NotImplementedError:
+                    # DummyManagedServer not allowed
+                    logger.warning("ManagedServer not available. Falling back to direct server mode.")
+                    agent = HermesAgentLoop(
+                        server=self.server,
+                        tool_schemas=tools,
+                        valid_tool_names=valid_names,
+                        max_turns=self.config.max_agent_turns,
+                        task_id=task_id,
+                        temperature=self.config.agent_temperature,
+                        max_tokens=self.config.max_token_length,
+                        extra_body=self.config.extra_body,
+                    )
+                    result = await agent.run(messages)
+            else:
+                # Phase 1: OpenAI server
+                agent = HermesAgentLoop(
+                    server=self.server,
+                    tool_schemas=tools,
+                    valid_tool_names=valid_names,
+                    max_turns=self.config.max_agent_turns,
+                    task_id=task_id,
+                    temperature=self.config.agent_temperature,
+                    max_tokens=self.config.max_token_length,
+                    extra_body=self.config.extra_body,
+                )
+                result = await agent.run(messages)
+
+            # Skip reward computation if agent produced no output
+            only_system_and_user = all(
+                msg.get("role") in ("system", "user") for msg in result.messages
+            )
+            if result.turns_used == 0 or only_system_and_user:
+                logger.warning(
+                    "Agent loop produced no output (turns=%d). Skipping trajectory.",
+                    result.turns_used,
+                )
+                # Return None to skip this trajectory (likely an API failure)
+                return None, []
+            else:
+                # Compute reward using ToolContext
+                ctx = ToolContext(task_id)
+                try:
+                    reward = await self.compute_reward(item, result, ctx)
+                except Exception as e:
+                    logger.error("compute_reward failed: %s", e)
+                    reward = 0.0
+                finally:
+                    ctx.cleanup()
+
+            # Track metrics for wandb logging
+            task_metrics = {
+                "test_passed": 1.0 if reward > 0.5 else 0.0,
+                "reward": reward,
+                "turns_used": result.turns_used,
+                "finished_naturally": result.finished_naturally,
+                "docker_image": docker_image,
+                "num_tool_errors": len(result.tool_errors),
+            }
+
+            # Include detailed tool errors if any occurred
+            if result.tool_errors:
+                task_metrics["tool_errors"] = [
+                    {
+                        "turn": err.turn,
+                        "tool": err.tool_name,
+                        "error": err.error[:200],
+                    }
+                    for err in result.tool_errors
+                ]
+
+            self._metrics_buffer.append(task_metrics)
+
+            # ============================================================================
+            # Build ScoredDataGroup from ManagedServer state
+            # ============================================================================
+            # Phase 2: Extract pre-computed data from SequenceNodes
+            # We may have multiple trajectories in the nodes due to how interesting
+            # agents can be, so iterate through all nodes and return multiple sequences.
+            #
+            # Each SequenceNode contains:
+            # - tokens: Full unmasked token sequence [1, 2, 3, ..., N]
+            # - masked_tokens: Training format [-100, -100, ..., -100, actual, actual, ...]
+            # - logprobs: Training format [1.0, 1.0, ..., 1.0, -0.5, -0.3, ...]
+            # - full_text: Complete text (prompt + all completions)
+            #
+            # Phase 1: Create placeholder tokens for OpenAI-style servers
+            # ============================================================================
+            nodes = (managed_state or {}).get("nodes", []) if managed_state else []
+
+            # Create ScoredDataGroup with lists for multiple trajectories
+            scored_group = ScoredDataGroup()
+            scored_group["tokens"] = []
+            scored_group["masks"] = []
+            scored_group["scores"] = []
+            scored_group["messages"] = []
+            scored_group["inference_logprobs"] = []
+
+            if nodes:
+                # Phase 2: iterate through all nodes (may have multiple trajectories)
+                for i, node in enumerate(nodes):
+                    scored_group["tokens"].append(node.tokens)
+                    scored_group["masks"].append(node.masked_tokens)
+                    scored_group["scores"].append(reward)
+                    scored_group["messages"].append(result.messages)
+
+                    if hasattr(node, "logprobs") and node.logprobs:
+                        scored_group["inference_logprobs"].append(node.logprobs)
+                    else:
+                        # Placeholder logprobs if not available
+                        scored_group["inference_logprobs"].append([1.0] * len(node.tokens))
+
+                    logger.debug(f"Added trajectory {i+1}/{len(nodes)} with {len(node.tokens)} tokens")
+
+            else:
+                # Phase 1: create placeholder tokens for OpenAI-style servers
+                full_text = "\n".join(
+                    msg.get("content", "") for msg in result.messages if msg.get("content")
+                )
+                if self.tokenizer:
+                    tokens = self.tokenizer.encode(full_text, add_special_tokens=True)
+                else:
+                    tokens = list(range(min(len(full_text) // 4, 128)))
+
+                scored_group["tokens"].append(tokens)
+                scored_group["masks"].append([-100] + tokens[1:])
+                scored_group["scores"].append(reward)
+                scored_group["messages"].append(result.messages)
+                scored_group["inference_logprobs"].append([1.0] * len(tokens))
+
+            # Return None if no trajectories collected
+            if len(scored_group["tokens"]) == 0:
+                return None, []
+
+            logger.debug(f"Returning ScoredDataGroup with {len(scored_group['tokens'])} trajectories")
+            return scored_group, []
+
+        finally:
+            # Clean up task overrides and sandbox
+            clear_task_env_overrides(task_id)
+            try:
+                cleanup_vm(task_id)
+            except Exception as e:
+                logger.debug(f"VM cleanup for {task_id[:8]}: {e}")
+
+    async def compute_reward(
+        self,
+        item: Item,
+        result: AgentResult,
+        ctx: ToolContext
+    ) -> float:
+        """
+        Run final tests in the agent's sandbox and return binary reward.
+
+        Uses ToolContext to execute pytest in the SAME sandbox the agent used,
+        following the Terminal Bench 2 verification pattern. No separate
+        Apptainer execution needed.
+
+        Returns 1.0 if tests pass, 0.0 otherwise.
+        """
+        task_name = item.get("task_name", "unknown")
+        final_test_path = Path(item.get("final_test", ""))
+
+        if not final_test_path.exists():
+            logger.error(f"Task {task_name}: test file not found at {final_test_path}")
+            return 0.0
+
+        logger.info(f"Task {task_name}: running tests in sandbox...")
+
+        try:
+            # Run tests in a thread to avoid blocking the event loop
+            loop = asyncio.get_event_loop()
+            reward = await loop.run_in_executor(
+                None,
+                self._run_tests_in_sandbox,
+                final_test_path,
+                ctx,
+                task_name,
+            )
+
+            status = "PASS" if reward == 1.0 else "FAIL"
+            logger.info(f"Task {task_name}: {status} (reward={reward})")
+            return reward
+
+        except Exception as e:
+            logger.error(f"Task {task_name}: test execution failed: {e}", exc_info=True)
+            return 0.0
+
+    def _run_tests_in_sandbox(
+        self,
+        test_file_path: Path,
+        ctx: ToolContext,
+        task_name: str,
+    ) -> float:
+        """
+        Upload test file to sandbox and execute pytest.
+
+        Runs in thread pool (via run_in_executor) to avoid blocking the event loop
+        with synchronous ToolContext calls.
+
+        Args:
+            test_file_path: Local path to test_final_state.py
+            ctx: ToolContext scoped to the agent's sandbox
+            task_name: For logging
+
+        Returns:
+            1.0 if tests pass, 0.0 otherwise
+        """
+        try:
+            # Upload test file to sandbox
+            test_content = test_file_path.read_text()
+            ctx.write_file("/workspace/test_final_state.py", test_content)
+            logger.debug(f"Task {task_name}: uploaded test file to /workspace/test_final_state.py")
+
+            # Run pytest in the sandbox
+            result = ctx.terminal(
+                "cd /workspace && python -m pytest -q test_final_state.py",
+                timeout=self.config.test_timeout_s,
+            )
+
+            exit_code = result.get("exit_code", -1)
+            output = result.get("output", "")
+
+            if exit_code == 0:
+                logger.debug(f"Task {task_name}: tests passed")
+                return 1.0
+            else:
+                # Log failure output (last 500 chars for debugging)
+                output_preview = output[-500:] if output else "(no output)"
+                logger.info(
+                    f"Task {task_name}: tests failed (exit_code={exit_code})\n{output_preview}"
+                )
+                return 0.0
+
+        except Exception as e:
+            logger.error(f"Task {task_name}: error running tests: {e}")
+            return 0.0
+
+    async def evaluate(self):
+        """
+        Periodic evaluation on holdout eval set.
+
+        Runs the agent on num_eval_tasks from the held-out eval set
+        (never seen during training). Returns metrics for wandb logging.
+        """
+        if self._eval_dataset is None:
+            logger.warning("Cannot evaluate: eval dataset not loaded")
+            return {}
+
+        if len(self._eval_dataset) == 0:
+            logger.warning("Eval dataset is empty")
+            return {}
+
+        # Use min of num_eval_tasks and actual eval set size
+        num_tasks = min(self.config.num_eval_tasks, len(self._eval_dataset))
+        logger.info(f"Starting evaluation on {num_tasks} held-out tasks...")
+
+        eval_metrics = {
+            "rewards": [],
+            "passes": [],
+            "turns": [],
+            "natural_finishes": [],
+        }
+
+        # Sample from eval set (holdout)
+        import random
+        eval_indices = random.sample(range(len(self._eval_dataset)), num_tasks)
+
+        for idx in eval_indices:
+            task = self._eval_dataset[idx]
+
+            # Build item using same logic as get_next_item
+            task_dir = task.get("extra_info", {}).get("task_dir")
+            if not task_dir:
+                task_dir = task.get("reward_spec", {}).get("ground_truth")
+
+            if not task_dir:
+                continue
+
+            task_dir_path = Path(task_dir)
+            if self.config.tasks_base_dir and not task_dir_path.exists():
+                original_path = Path(task_dir)
+                task_name = original_path.name
+                task_dir_path = Path(os.path.expanduser(self.config.tasks_base_dir)) / task_name
+
+            if not task_dir_path.exists():
+                continue
+
+            # Find test file
+            final_test = task_dir_path / "tests" / "test_final_state.py"
+            if not final_test.exists():
+                final_test = task_dir_path / "test_final_state.py"
+            if not final_test.exists():
+                continue
+
+            # Parse Docker image
+            container_def = task_dir_path / "environment" / "container.def"
+            if not container_def.exists():
+                container_def = task_dir_path / "container.def"
+            docker_image = self._parse_docker_image_from_def(container_def)
+
+            # Load description
+            description = task.get("description", "")
+            instruction_md = task_dir_path / "instruction.md"
+            if not description and instruction_md.exists():
+                try:
+                    description = instruction_md.read_text().strip()
+                except Exception:
+                    pass
+
+            item = {
+                "description": description,
+                "final_test": str(final_test),
+                "docker_image": docker_image,
+            }
+
+            # Run agent on this task
+            try:
+                import uuid
+                task_id = str(uuid.uuid4())
+
+                # Register task environment
+                from model_tools import register_task_env_overrides
+                register_task_env_overrides(task_id, {"modal_image": docker_image})
+
+                # Build messages
+                messages = [
+                    {"role": "system", "content": self.config.system_prompt},
+                    {"role": "user", "content": description or "Complete the task."},
+                ]
+
+                # Get tools
+                from model_tools import get_tool_definitions
+                tools = get_tool_definitions(self.config.enabled_toolsets)
+                valid_names = {t["function"]["name"] for t in tools}
+
+                # Run agent
+                from environments.agent_loop import HermesAgentLoop
+                agent = HermesAgentLoop(
+                    server=self.server,
+                    tool_schemas=tools,
+                    valid_tool_names=valid_names,
+                    max_turns=self.config.max_agent_turns,
+                    task_id=task_id,
+                    temperature=self.config.agent_temperature,
+                    max_tokens=self.config.max_token_length,
+                    extra_body=self.config.extra_body,
+                )
+                result = await agent.run(messages)
+
+                # Compute reward
+                from environments.tool_context import ToolContext
+                ctx = ToolContext(task_id)
+                try:
+                    reward = await self.compute_reward(item, result, ctx)
+                except Exception as e:
+                    logger.warning(f"Eval reward computation failed: {e}")
+                    reward = 0.0
+                finally:
+                    ctx.cleanup()
+
+                # Track metrics
+                eval_metrics["rewards"].append(reward)
+                eval_metrics["passes"].append(1.0 if reward > 0.5 else 0.0)
+                eval_metrics["turns"].append(result.turns_used)
+                eval_metrics["natural_finishes"].append(1.0 if result.finished_naturally else 0.0)
+
+            except Exception as e:
+                logger.error(f"Eval task failed: {e}")
+                continue
+            finally:
+                # Cleanup
+                from model_tools import clear_task_env_overrides, cleanup_vm
+                clear_task_env_overrides(task_id)
+                cleanup_vm(task_id)
+
+        # Aggregate metrics
+        if not eval_metrics["rewards"]:
+            logger.warning("No eval tasks completed successfully")
+            return {}
+
+        aggregated = {
+            "eval/pass_rate": sum(eval_metrics["passes"]) / len(eval_metrics["passes"]),
+            "eval/avg_reward": sum(eval_metrics["rewards"]) / len(eval_metrics["rewards"]),
+            "eval/avg_turns": sum(eval_metrics["turns"]) / len(eval_metrics["turns"]),
+            "eval/natural_finish_rate": sum(eval_metrics["natural_finishes"]) / len(eval_metrics["natural_finishes"]),
+            "eval/num_tasks": len(eval_metrics["rewards"]),
+        }
+
+        logger.info(f"Evaluation complete: pass_rate={aggregated['eval/pass_rate']:.2%}, avg_turns={aggregated['eval/avg_turns']:.1f}")
+        return aggregated
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log Endless Terminals specific metrics to wandb."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        # Aggregate metrics from buffer
+        if self._metrics_buffer:
+            # Test pass rate
+            test_passes = [m["test_passed"] for m in self._metrics_buffer]
+            wandb_metrics["endless_terminals/test_pass_rate"] = sum(test_passes) / len(test_passes)
+            wandb_metrics["endless_terminals/num_tests_passed"] = sum(test_passes)
+            wandb_metrics["endless_terminals/num_tests_total"] = len(test_passes)
+
+            # Turns used statistics
+            turns = [m["turns_used"] for m in self._metrics_buffer]
+            wandb_metrics["endless_terminals/avg_turns_used"] = sum(turns) / len(turns)
+            wandb_metrics["endless_terminals/max_turns_used"] = max(turns)
+            wandb_metrics["endless_terminals/min_turns_used"] = min(turns)
+
+            # Natural finish rate (did model stop on its own vs hitting max turns)
+            natural_finishes = [1.0 if m["finished_naturally"] else 0.0 for m in self._metrics_buffer]
+            wandb_metrics["endless_terminals/natural_finish_rate"] = sum(natural_finishes) / len(natural_finishes)
+
+            # Tool error statistics
+            total_tool_errors = sum(m["num_tool_errors"] for m in self._metrics_buffer)
+            wandb_metrics["endless_terminals/total_tool_errors"] = total_tool_errors
+            wandb_metrics["endless_terminals/avg_tool_errors_per_task"] = total_tool_errors / len(self._metrics_buffer)
+
+            # Docker image distribution (count unique images used)
+            docker_images = [m["docker_image"] for m in self._metrics_buffer]
+            unique_images = set(docker_images)
+            wandb_metrics["endless_terminals/num_unique_docker_images"] = len(unique_images)
+
+            # Log most common errors if any
+            all_errors = []
+            for m in self._metrics_buffer:
+                if "tool_errors" in m:
+                    all_errors.extend(m["tool_errors"])
+
+            if all_errors:
+                # Count error types
+                error_tools = {}
+                for err in all_errors:
+                    tool = err["tool"]
+                    error_tools[tool] = error_tools.get(tool, 0) + 1
+
+                # Log top 3 error-prone tools
+                for i, (tool, count) in enumerate(sorted(error_tools.items(), key=lambda x: x[1], reverse=True)[:3]):
+                    wandb_metrics[f"endless_terminals/errors_by_tool/{tool}"] = count
+
+            # Clear buffer after logging
+            self._metrics_buffer = []
+
+        await super().wandb_log(wandb_metrics)
+
+
+if __name__ == "__main__":
+    EndlessTerminalsEnv.cli()
@@ -0,0 +1,672 @@
+"""
+HermesAgentBaseEnv -- Abstract Base Environment for Hermes-Agent + Atropos
+
+Provides the Atropos integration plumbing that all hermes-agent environments share:
+- Two-mode operation (OpenAI server for Phase 1, VLLM ManagedServer for Phase 2)
+- Per-group toolset/distribution resolution
+- Agent loop orchestration via HermesAgentLoop
+- ToolContext creation for reward functions
+- ScoredDataGroup construction from ManagedServer state
+
+Subclasses only need to implement:
+    setup()           -- Load dataset, initialize state
+    get_next_item()   -- Return the next item from the dataset
+    format_prompt()   -- Convert a dataset item into the user message
+    compute_reward()  -- Score the rollout (has full ToolContext access)
+    evaluate()        -- Periodic evaluation
+"""
+
+import asyncio
+import json
+import logging
+import os
+import sys
+import uuid
+from abc import abstractmethod
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Set, Tuple, Union
+
+# Ensure the hermes-agent repo root is on sys.path so that imports like
+# `from model_tools import ...` and `from environments.X import ...` work
+# regardless of where the script is invoked from.
+_repo_root = Path(__file__).resolve().parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from dotenv import load_dotenv
+from pydantic import Field
+
+# Load API keys from hermes-agent/.env so all environments can access them
+_env_path = _repo_root / ".env"
+if _env_path.exists():
+    load_dotenv(dotenv_path=_env_path)
+
+# Apply monkey patches for async-safe tool operation inside Atropos's event loop.
+# This patches SwerexModalEnvironment to use a background thread instead of
+# asyncio.run(), which would deadlock inside Atropos. Safe for normal CLI too.
+from environments.patches import apply_patches
+apply_patches()
+
+from atroposlib.envs.base import (
+    BaseEnv,
+    BaseEnvConfig,
+    ScoredDataGroup,
+    ScoredDataItem,
+)
+from atroposlib.envs.server_handling.server_manager import (
+    APIServerConfig,
+    ServerBaseline,
+    ServerManager,
+)
+from atroposlib.type_definitions import Item
+
+from environments.agent_loop import AgentResult, HermesAgentLoop
+from environments.tool_context import ToolContext
+
+# Import hermes-agent toolset infrastructure
+from model_tools import get_tool_definitions
+from toolset_distributions import sample_toolsets_from_distribution
+
+logger = logging.getLogger(__name__)
+
+
+class HermesAgentEnvConfig(BaseEnvConfig):
+    """
+    Configuration for hermes-agent Atropos environments.
+
+    Extends BaseEnvConfig with agent-specific settings for toolsets,
+    terminal backend, dataset loading, and tool call parsing.
+    """
+
+    # --- Toolset configuration ---
+    # Mutually exclusive: use either enabled_toolsets OR distribution
+    enabled_toolsets: Optional[List[str]] = Field(
+        default=None,
+        description="Explicit list of hermes toolsets to enable (e.g., ['terminal', 'file', 'web']). "
+        "If None and distribution is also None, all available toolsets are enabled.",
+    )
+    disabled_toolsets: Optional[List[str]] = Field(
+        default=None,
+        description="Toolsets to disable. Applied as a filter on top of enabled_toolsets or distribution.",
+    )
+    distribution: Optional[str] = Field(
+        default=None,
+        description="Name of a toolset distribution from toolset_distributions.py "
+        "(e.g., 'development', 'terminal_tasks'). Sampled once per group. "
+        "Mutually exclusive with enabled_toolsets.",
+    )
+
+    # --- Agent loop configuration ---
+    max_agent_turns: int = Field(
+        default=30,
+        description="Maximum number of LLM calls (tool-calling iterations) per rollout.",
+    )
+    system_prompt: Optional[str] = Field(
+        default=None,
+        description="System prompt for the agent. Tools are handled via the tools= parameter, "
+        "not embedded in the prompt text.",
+    )
+    agent_temperature: float = Field(
+        default=1.0,
+        description="Sampling temperature for agent generation during rollouts.",
+    )
+
+    # --- Terminal backend ---
+    terminal_backend: str = Field(
+        default="local",
+        description="Terminal backend: 'local', 'docker', 'modal', 'ssh', 'singularity'. "
+        "Modal recommended for production RL (cloud isolation per rollout).",
+    )
+    terminal_timeout: int = Field(
+        default=120,
+        description="Per-command timeout in seconds for terminal tool calls. "
+        "Commands exceeding this are killed. Increase for tasks with long-running "
+        "commands (compilation, pip install, etc.).",
+    )
+    terminal_lifetime: int = Field(
+        default=3600,
+        description="Sandbox inactivity lifetime in seconds. The cleanup thread kills "
+        "sandboxes that have been idle longer than this. Must be longer than "
+        "the longest gap between tool calls (e.g., waiting for LLM response).",
+    )
+
+    # --- Dataset ---
+    dataset_name: Optional[str] = Field(
+        default=None,
+        description="HuggingFace dataset name. Optional if tasks are defined inline.",
+    )
+    dataset_split: str = Field(
+        default="train",
+        description="Dataset split to use.",
+    )
+    prompt_field: str = Field(
+        default="prompt",
+        description="Which field in the dataset contains the prompt.",
+    )
+
+    # --- Thread pool ---
+    tool_pool_size: int = Field(
+        default=128,
+        description="Thread pool size for tool execution. Each concurrent task needs a "
+        "thread for tool calls. Must be large enough for parallel evaluation. "
+        "Too small = thread pool starvation.",
+    )
+
+    # --- Phase 2: Tool call parsing ---
+    tool_call_parser: str = Field(
+        default="hermes",
+        description="Tool call parser name for Phase 2 (VLLM server type). "
+        "Ignored in Phase 1 (OpenAI server type where VLLM parses natively). "
+        "Options: hermes, mistral, llama3_json, qwen, deepseek_v3, etc.",
+    )
+
+    # --- Provider-specific parameters ---
+    # Passed as extra_body to the OpenAI client's chat.completions.create() call.
+    # Useful for OpenRouter provider preferences, transforms, route settings, etc.
+    # Example YAML:
+    #   extra_body:
+    #     provider:
+    #       ignore: ["DeepInfra", "Fireworks"]
+    #       order: ["Together"]
+    #     transforms: ["middle-out"]
+    extra_body: Optional[Dict[str, Any]] = Field(
+        default=None,
+        description="Extra body parameters passed to the OpenAI client's "
+        "chat.completions.create(). Used for OpenRouter provider preferences, "
+        "transforms, and other provider-specific settings.",
+    )
+
+
+class HermesAgentBaseEnv(BaseEnv):
+    """
+    Abstract base environment for hermes-agent Atropos integration.
+
+    Handles two modes of operation:
+    - Phase 1 (OpenAI server type): Uses server.chat_completion() directly.
+      The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing
+      and reasoning extraction natively. DummyManagedServer provides placeholder
+      tokens. Good for SFT data gen, verifier testing, evaluation.
+
+    - Phase 2 (VLLM server type): Uses ManagedServer for exact token IDs + logprobs
+      via /generate. Client-side tool call parser reconstructs structured tool_calls
+      from raw output. Full RL training capability.
+
+    Subclasses must implement:
+        setup()           -- Load dataset, initialize state
+        get_next_item()   -- Return the next item to roll out
+        format_prompt()   -- Convert a dataset item into the user message string
+        compute_reward()  -- Score the rollout using ToolContext
+        evaluate()        -- Periodic evaluation
+    """
+
+    name: Optional[str] = "hermes-agent"
+    env_config_cls = HermesAgentEnvConfig
+
+    def __init__(
+        self,
+        config: HermesAgentEnvConfig,
+        server_configs: Union[ServerBaseline, List[APIServerConfig]],
+        slurm=False,
+        testing=False,
+    ):
+        super().__init__(config, server_configs, slurm, testing)
+
+        # Set terminal environment variables so hermes tools pick them up.
+        # These can all be overridden per-environment via config fields instead
+        # of requiring users to set shell env vars.
+        if config.terminal_backend:
+            os.environ["TERMINAL_ENV"] = config.terminal_backend
+        os.environ["TERMINAL_TIMEOUT"] = str(config.terminal_timeout)
+        os.environ["TERMINAL_LIFETIME_SECONDS"] = str(config.terminal_lifetime)
+        print(
+            f"🖥️  Terminal: backend={config.terminal_backend}, "
+            f"timeout={config.terminal_timeout}s, lifetime={config.terminal_lifetime}s"
+        )
+
+        # Resize the agent loop's thread pool for tool execution.
+        # This must be large enough for the number of concurrent tasks
+        # (e.g., 89 parallel TB2 eval tasks each need a thread for tool calls).
+        from environments.agent_loop import resize_tool_pool
+        resize_tool_pool(config.tool_pool_size)
+
+        # Current group's resolved tools (set in collect_trajectories)
+        self._current_group_tools: Optional[Tuple[List[Dict], Set[str]]] = None
+
+        # Tool error tracking for wandb logging
+        self._tool_error_buffer: List[Dict[str, Any]] = []
+
+    # =========================================================================
+    # Toolset resolution (per-group)
+    # =========================================================================
+
+    def _resolve_tools_for_group(self) -> Tuple[List[Dict[str, Any]], Set[str]]:
+        """
+        Resolve toolsets for a group. Called once in collect_trajectories(),
+        then shared by all collect_trajectory() calls in the group.
+
+        If distribution is set, samples probabilistically.
+        If enabled_toolsets is set, uses that explicit list.
+        disabled_toolsets is applied as a filter on top.
+
+        Returns:
+            (tool_schemas, valid_tool_names) tuple
+        """
+        config = self.config
+
+        if config.distribution:
+            group_toolsets = sample_toolsets_from_distribution(config.distribution)
+            logger.info("Sampled toolsets from '%s': %s", config.distribution, group_toolsets)
+        else:
+            group_toolsets = config.enabled_toolsets  # None means "all available"
+            if group_toolsets is None:
+                logger.warning(
+                    "enabled_toolsets is None -- loading ALL tools including messaging. "
+                    "Set explicit enabled_toolsets for RL training."
+                )
+
+        tools = get_tool_definitions(
+            enabled_toolsets=group_toolsets,
+            disabled_toolsets=config.disabled_toolsets,
+            quiet_mode=True,
+        )
+
+        valid_names = {t["function"]["name"] for t in tools} if tools else set()
+        logger.info("Resolved %d tools for group: %s", len(valid_names), sorted(valid_names))
+        return tools, valid_names
+
+    # =========================================================================
+    # Server mode detection
+    # =========================================================================
+
+    def _use_managed_server(self) -> bool:
+        """
+        Determine if we should use ManagedServer (Phase 2) or direct server (Phase 1).
+
+        Phase 2 (ManagedServer) is used when the server type is 'vllm' or 'sglang',
+        which go through the /generate endpoint for exact token tracking.
+
+        Phase 1 (direct server) is used for 'openai' server type, which uses
+        /v1/chat/completions with native tool call parsing.
+        """
+        if not self.server.servers:
+            return False
+
+        server = self.server.servers[0]
+        # If the server is an OpenAI server (not VLLM/SGLang), use direct mode
+        from atroposlib.envs.server_handling.openai_server import OpenAIServer
+        return not isinstance(server, OpenAIServer)
+
+    # =========================================================================
+    # Core Atropos integration
+    # =========================================================================
+
+    async def collect_trajectories(
+        self, item: Item
+    ) -> Tuple[
+        Union[Optional[ScoredDataGroup], List[Optional[ScoredDataGroup]]],
+        List[Item],
+    ]:
+        """
+        Override collect_trajectories to resolve toolsets once per group,
+        then delegate to the standard group-level collection.
+
+        The default BaseEnv.collect_trajectories() calls collect_trajectory()
+        group_size times in parallel. We resolve tools once here and store
+        them for all those calls to use.
+        """
+        # Resolve toolsets for this group (shared by all rollouts in the group)
+        self._current_group_tools = self._resolve_tools_for_group()
+
+        # Delegate to the default implementation which calls collect_trajectory()
+        # group_size times via asyncio.gather
+        return await super().collect_trajectories(item)
+
+    # =========================================================================
+    # Wandb rollout display -- format trajectories nicely
+    # =========================================================================
+
+    @staticmethod
+    def _format_trajectory_for_display(messages: List[Dict[str, Any]]) -> str:
+        """
+        Format a conversation's messages into a readable trajectory string
+        for wandb rollout tables. Shows tool calls, tool results, and reasoning
+        in a structured way instead of raw token decoding.
+        """
+        parts = []
+        for msg in messages:
+            role = msg.get("role", "unknown")
+            content = msg.get("content", "")
+
+            if role == "system":
+                parts.append(f"[SYSTEM]\n{content}")
+
+            elif role == "user":
+                parts.append(f"[USER]\n{content}")
+
+            elif role == "assistant":
+                # Show reasoning if present
+                reasoning = msg.get("reasoning_content", "")
+                if reasoning:
+                    # Truncate long reasoning for display
+                    if len(reasoning) > 300:
+                        reasoning = reasoning[:300] + "..."
+                    parts.append(f"[ASSISTANT thinking]\n{reasoning}")
+
+                # Show content
+                if content:
+                    parts.append(f"[ASSISTANT]\n{content}")
+
+                # Show tool calls
+                tool_calls = msg.get("tool_calls", [])
+                for tc in tool_calls:
+                    func = tc.get("function", {})
+                    name = func.get("name", "?")
+                    args = func.get("arguments", "{}")
+                    # Truncate long arguments for display
+                    if len(args) > 200:
+                        args = args[:200] + "..."
+                    parts.append(f"[TOOL CALL] {name}({args})")
+
+            elif role == "tool":
+                tool_id = msg.get("tool_call_id", "")
+                result = content
+                # Truncate long tool results for display
+                if len(result) > 500:
+                    result = result[:500] + "..."
+                parts.append(f"[TOOL RESULT] {result}")
+
+        return "\n\n".join(parts)
+
+    async def add_rollouts_for_wandb(
+        self,
+        scored_data,
+        item=None,
+    ):
+        """
+        Override to show formatted trajectories with tool calls visible,
+        instead of raw token decoding which loses all structure.
+        """
+        num_keep = self.config.num_rollouts_per_group_for_logging
+        if num_keep == -1:
+            num_keep = self.config.group_size
+
+        group = []
+        for i in range(min(num_keep, len(scored_data.get("scores", [])))):
+            score = scored_data["scores"][i]
+
+            # Use messages if available for rich display
+            messages = None
+            if scored_data.get("messages") and i < len(scored_data["messages"]):
+                messages = scored_data["messages"][i]
+
+            if messages:
+                text = self._format_trajectory_for_display(messages)
+            elif scored_data.get("tokens") and i < len(scored_data["tokens"]):
+                text = self.tokenizer.decode(scored_data["tokens"][i])
+            else:
+                text = "(no data)"
+
+            group.append((text, score))
+
+        self.rollouts_for_wandb.append(group)
+        if len(self.rollouts_for_wandb) > self.config.num_rollouts_to_keep:
+            self.rollouts_for_wandb.pop(0)
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log base metrics including tool errors to wandb."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        # Log tool error stats
+        if self._tool_error_buffer:
+            wandb_metrics["train/tool_errors_count"] = len(self._tool_error_buffer)
+
+            # Log error details as a summary string (tables can crash wandb on tmp cleanup)
+            error_summaries = []
+            for err in self._tool_error_buffer:
+                error_summaries.append(
+                    f"[turn {err['turn']}] {err['tool']}({err['args'][:80]}) -> {err['error'][:150]}"
+                )
+            wandb_metrics["train/tool_error_details"] = "\n".join(error_summaries)
+
+            # Also print to stdout for immediate visibility
+            for summary in error_summaries:
+                print(f"  Tool Error: {summary}")
+
+            self._tool_error_buffer = []
+        else:
+            wandb_metrics["train/tool_errors_count"] = 0
+
+        await super().wandb_log(wandb_metrics)
+
+    async def collect_trajectory(
+        self, item: Item
+    ) -> Tuple[Optional[Union[ScoredDataItem, Any]], List[Item]]:
+        """
+        Run a single rollout: agent loop + reward computation.
+
+        This is called group_size times in parallel by collect_trajectories().
+        Each call gets its own task_id for terminal/browser session isolation.
+        """
+        task_id = str(uuid.uuid4())
+
+        # Get group-level tools (resolved once in collect_trajectories)
+        if self._current_group_tools is None:
+            # Fallback: resolve per-trajectory if called outside collect_trajectories
+            tools, valid_names = self._resolve_tools_for_group()
+        else:
+            tools, valid_names = self._current_group_tools
+
+        # Build initial messages
+        messages: List[Dict[str, Any]] = []
+        if self.config.system_prompt:
+            messages.append({"role": "system", "content": self.config.system_prompt})
+        messages.append({"role": "user", "content": self.format_prompt(item)})
+
+        # Run the agent loop
+        result: AgentResult
+        if self._use_managed_server():
+            # Phase 2: ManagedServer with parser -- exact tokens + logprobs
+            # Load the tool call parser from registry based on config
+            from environments.tool_call_parsers import get_parser
+            try:
+                tc_parser = get_parser(self.config.tool_call_parser)
+            except KeyError:
+                logger.warning(
+                    "Tool call parser '%s' not found, falling back to 'hermes'",
+                    self.config.tool_call_parser,
+                )
+                tc_parser = get_parser("hermes")
+
+            try:
+                async with self.server.managed_server(
+                    tokenizer=self.tokenizer,
+                    tool_call_parser=tc_parser,
+                ) as managed:
+                    agent = HermesAgentLoop(
+                        server=managed,
+                        tool_schemas=tools,
+                        valid_tool_names=valid_names,
+                        max_turns=self.config.max_agent_turns,
+                        task_id=task_id,
+                        temperature=self.config.agent_temperature,
+                        max_tokens=self.config.max_token_length,
+                        extra_body=self.config.extra_body,
+                    )
+                    result = await agent.run(messages)
+            except NotImplementedError:
+                # DummyManagedServer not allowed -- fall back to Phase 1
+                logger.warning(
+                    "ManagedServer not available (OpenAI server?). "
+                    "Falling back to direct server mode."
+                )
+                agent = HermesAgentLoop(
+                    server=self.server,
+                    tool_schemas=tools,
+                    valid_tool_names=valid_names,
+                    max_turns=self.config.max_agent_turns,
+                    task_id=task_id,
+                    temperature=self.config.agent_temperature,
+                    max_tokens=self.config.max_token_length,
+                    extra_body=self.config.extra_body,
+                )
+                result = await agent.run(messages)
+        else:
+            # Phase 1: OpenAI server -- native tool_calls, placeholder tokens
+            agent = HermesAgentLoop(
+                server=self.server,
+                tool_schemas=tools,
+                valid_tool_names=valid_names,
+                max_turns=self.config.max_agent_turns,
+                task_id=task_id,
+                temperature=self.config.agent_temperature,
+                max_tokens=self.config.max_token_length,
+                extra_body=self.config.extra_body,
+            )
+            result = await agent.run(messages)
+
+        # Skip reward computation if the agent loop produced no meaningful work
+        # (e.g., API call failed on turn 1). No point spinning up a Modal sandbox
+        # just to verify files that were never created.
+        only_system_and_user = all(
+            msg.get("role") in ("system", "user") for msg in result.messages
+        )
+        if result.turns_used == 0 or only_system_and_user:
+            logger.warning(
+                "Agent loop produced no output (turns=%d, msgs=%d). Skipping reward.",
+                result.turns_used, len(result.messages),
+            )
+            reward = 0.0
+        else:
+            # Compute reward using ToolContext (gives verifier full tool access)
+            ctx = ToolContext(task_id)
+            try:
+                reward = await self.compute_reward(item, result, ctx)
+            except Exception as e:
+                logger.error("compute_reward failed: %s", e)
+                reward = 0.0
+            finally:
+                ctx.cleanup()
+
+        # Track tool errors for wandb logging
+        if result.tool_errors:
+            for err in result.tool_errors:
+                self._tool_error_buffer.append({
+                    "turn": err.turn,
+                    "tool": err.tool_name,
+                    "args": err.arguments[:150],
+                    "error": err.error[:300],
+                    "result": err.tool_result[:300],
+                })
+
+        # Build ScoredDataItem from ManagedServer state
+        # Phase 2: real tokens/masks/logprobs from SequenceNodes
+        # Phase 1: placeholder tokens (still need a valid ScoredDataItem for the pipeline)
+        nodes = (result.managed_state or {}).get("nodes", [])
+
+        if nodes:
+            # Phase 2 (or DummyManagedServer): use actual node data
+            node = nodes[-1]  # Final sequence node = full trajectory
+            scored_item: Dict[str, Any] = {
+                "tokens": node.tokens,
+                "masks": node.masked_tokens,
+                "scores": reward,
+            }
+
+            # Include logprobs if available (Phase 2)
+            if hasattr(node, "logprobs") and node.logprobs:
+                scored_item["advantages"] = None  # Computed by trainer
+                scored_item["ref_logprobs"] = None
+        else:
+            # Phase 1 with no managed state: create placeholder tokens
+            # so the data pipeline doesn't break. These are NOT suitable
+            # for training but allow process mode (SFT data gen) to work.
+            # Tokenize the full conversation to get approximate tokens.
+            full_text = "\n".join(
+                msg.get("content", "") for msg in result.messages if msg.get("content")
+            )
+            if self.tokenizer:
+                tokens = self.tokenizer.encode(full_text, add_special_tokens=True)
+            else:
+                tokens = list(range(min(len(full_text) // 4, 128)))
+
+            scored_item = {
+                "tokens": tokens,
+                "masks": [-100] + tokens[1:],  # Mask first token as prompt
+                "scores": reward,
+            }
+
+        # Always include messages for wandb rollout display and data logging
+        scored_item["messages"] = result.messages
+
+        return scored_item, []
+
+    # =========================================================================
+    # Abstract methods -- subclasses must implement
+    # =========================================================================
+
+    @abstractmethod
+    async def setup(self):
+        """
+        Load dataset, initialize state.
+
+        Called once when the environment starts. Typical implementation:
+            self.dataset = load_dataset(self.config.dataset_name, split=self.config.dataset_split)
+            self.iter = 0
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def get_next_item(self) -> Item:
+        """
+        Return the next item from the dataset for rollout.
+
+        Called by the base env's main loop to get items for workers.
+        Should cycle through the dataset.
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    def format_prompt(self, item: Item) -> str:
+        """
+        Convert a dataset item into the user message for the agent.
+
+        Args:
+            item: Dataset item (dict, tuple, etc.)
+
+        Returns:
+            The prompt string to send to the agent
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def compute_reward(
+        self, item: Item, result: AgentResult, ctx: ToolContext
+    ) -> float:
+        """
+        Score the rollout. Has full access to:
+        - item: the original dataset item (ground truth, test commands, etc.)
+        - result: AgentResult with full messages, turn count, reasoning, etc.
+        - ctx: ToolContext -- call ANY hermes-agent tool (terminal, file, web,
+               browser, vision...) scoped to this rollout's sandbox. Nothing
+               is off-limits.
+
+        Args:
+            item: The dataset item that was rolled out
+            result: The agent's rollout result
+            ctx: ToolContext with full tool access for verification
+
+        Returns:
+            Reward float (typically 0.0 to 1.0, but any float is valid)
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def evaluate(self, *args, **kwargs):
+        """
+        Periodic evaluation. Called every steps_per_eval steps.
+
+        Typical implementation runs the agent on a held-out eval set
+        and logs metrics via wandb/evaluate_log.
+        """
+        raise NotImplementedError
@@ -0,0 +1,34 @@
+# SWE Environment -- Default Configuration
+#
+# SWE-bench style tasks with Modal sandboxes for cloud isolation.
+# Uses terminal + file + web toolsets.
+#
+# Usage:
+#   python environments/hermes_swe_env/hermes_swe_env.py serve \
+#       --config environments/hermes_swe_env/default.yaml
+
+env:
+  enabled_toolsets: ["terminal", "file", "web"]
+  max_agent_turns: 30
+  max_token_length: 4096
+  group_size: 4
+  terminal_backend: "modal"
+  tool_call_parser: "hermes"
+  tokenizer_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+  dataset_name: "bigcode/humanevalpack"
+  dataset_split: "test"
+  prompt_field: "prompt"
+  steps_per_eval: 50
+  total_steps: 500
+  use_wandb: true
+  wandb_name: "hermes-swe"
+  system_prompt: >
+    You are a skilled software engineer. You have access to a terminal,
+    file tools, and web search. Use these tools to complete the coding task.
+    Write clean, working code and verify it runs correctly before finishing.
+
+openai:
+  base_url: "http://localhost:8000/v1"
+  model_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+  server_type: "openai"
+  api_key: ""
@@ -0,0 +1,229 @@
+"""
+HermesSweEnv -- SWE-Bench Style Environment with Modal Sandboxes
+
+A concrete environment for software engineering tasks where the model writes code
+and the reward function runs tests to verify correctness. Uses Modal terminal
+backend for cloud-isolated sandboxes per rollout.
+
+The reward function uses ToolContext.terminal() to run test commands in the same
+Modal sandbox the model used during its agentic loop. All filesystem state from
+the model's tool calls is preserved for verification.
+
+Usage:
+    # Phase 1: OpenAI server type
+    vllm serve YourModel --tool-parser hermes
+    run-api
+    python environments/hermes_swe_env.py serve \\
+        --openai.base_url http://localhost:8000/v1 \\
+        --openai.model_name YourModel \\
+        --openai.server_type openai \\
+        --env.dataset_name bigcode/humanevalpack \\
+        --env.terminal_backend modal
+
+    # Phase 2: VLLM server type (full RL training)
+    python environments/hermes_swe_env.py serve \\
+        --openai.base_url http://localhost:8000/v1 \\
+        --openai.model_name YourModel \\
+        --openai.server_type vllm \\
+        --env.tool_call_parser hermes \\
+        --env.terminal_backend modal
+"""
+
+import logging
+import sys
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# Ensure repo root is on sys.path for imports
+_repo_root = Path(__file__).resolve().parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from datasets import load_dataset
+
+from atroposlib.envs.base import ScoredDataGroup
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+from atroposlib.type_definitions import Item
+
+from environments.agent_loop import AgentResult
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.tool_context import ToolContext
+
+logger = logging.getLogger(__name__)
+
+
+class HermesSweEnvConfig(HermesAgentEnvConfig):
+    """Config with defaults for SWE-bench style tasks."""
+
+    pass  # Inherits all fields, overrides defaults in config_init
+
+
+class HermesSweEnv(HermesAgentBaseEnv):
+    """
+    SWE-bench style environment using Modal terminal backend.
+
+    The model gets a coding task, uses terminal + file + web tools to solve it,
+    and the reward function runs tests in the same Modal sandbox to verify.
+
+    Subclass this for specific SWE datasets (HumanEval, SWE-bench, etc.)
+    and customize format_prompt() and compute_reward() as needed.
+    """
+
+    name = "hermes-swe"
+    env_config_cls = HermesSweEnvConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[HermesSweEnvConfig, List[APIServerConfig]]:
+        """
+        Default configuration for the SWE environment.
+
+        Uses Modal terminal backend for cloud isolation and terminal + file + web toolsets.
+        """
+        env_config = HermesSweEnvConfig(
+            # Toolsets: terminal for running code, file for reading/writing, web for docs
+            enabled_toolsets=["terminal", "file", "web"],
+            disabled_toolsets=None,
+            distribution=None,
+            # Agent settings -- SWE tasks need more turns
+            max_agent_turns=30,
+            max_token_length=4096,
+            agent_temperature=1.0,
+            system_prompt=(
+                "You are a skilled software engineer. You have access to a terminal, "
+                "file tools, and web search. Use these tools to complete the coding task. "
+                "Write clean, working code and verify it runs correctly before finishing."
+            ),
+            # Modal backend for cloud-isolated sandboxes
+            terminal_backend="modal",
+            # Dataset -- override via CLI for your specific SWE dataset
+            dataset_name="bigcode/humanevalpack",
+            dataset_split="test",
+            prompt_field="prompt",
+            # Atropos settings
+            group_size=4,
+            tokenizer_name="NousResearch/DeepHermes-3-Llama-3-3B-Preview",
+            tool_call_parser="hermes",
+            steps_per_eval=50,
+            total_steps=500,
+            use_wandb=True,
+            wandb_name="hermes-swe",
+        )
+
+        server_configs = [
+            APIServerConfig(
+                base_url="http://localhost:8000/v1",
+                model_name="NousResearch/DeepHermes-3-Llama-3-3B-Preview",
+                server_type="openai",  # Phase 1; switch to "vllm" for Phase 2
+                api_key="",
+            )
+        ]
+
+        return env_config, server_configs
+
+    async def setup(self):
+        """Load the SWE dataset."""
+        if self.config.dataset_name:
+            self.dataset = load_dataset(
+                self.config.dataset_name, split=self.config.dataset_split
+            )
+        else:
+            # Placeholder if no dataset specified
+            self.dataset = []
+        self.iter = 0
+        self.reward_buffer: List[float] = []
+
+    async def get_next_item(self) -> Dict[str, Any]:
+        """Cycle through the SWE dataset."""
+        if not self.dataset:
+            raise ValueError("No dataset loaded. Set dataset_name in config.")
+        item = self.dataset[self.iter % len(self.dataset)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item: Dict[str, Any]) -> str:
+        """
+        Format the SWE task prompt.
+
+        Override this in subclasses for different dataset formats.
+        Default assumes the dataset has a 'prompt' field and optionally a 'test' field.
+        """
+        prompt = item.get(self.config.prompt_field, "")
+
+        # If the dataset has test information, include it in the prompt
+        test_info = item.get("test", item.get("test_code", item.get("tests", "")))
+        if test_info:
+            prompt += f"\n\nTests to pass:\n{test_info}"
+
+        return prompt
+
+    async def compute_reward(
+        self, item: Dict[str, Any], result: AgentResult, ctx: ToolContext
+    ) -> float:
+        """
+        Score by running tests in the model's Modal sandbox.
+
+        Default implementation:
+        - If the dataset item has a 'test' or 'test_code' field, run it
+        - Check exit code: 0 = pass, non-zero = fail
+        - Partial credit for file creation
+
+        Override this in subclasses for more sophisticated reward logic.
+        """
+        # Find the test command from the dataset item
+        test_code = item.get("test", item.get("test_code", item.get("tests", "")))
+
+        if test_code:
+            # Run the test in the model's sandbox
+            test_result = ctx.terminal(
+                f'cd /workspace && python3 -c "{test_code}"', timeout=60
+            )
+
+            if test_result["exit_code"] == 0:
+                self.reward_buffer.append(1.0)
+                return 1.0
+
+        # Partial credit: check if the model created any Python files
+        file_check = ctx.terminal("find /workspace -name '*.py' -newer /tmp/.start_marker 2>/dev/null | head -5")
+        if file_check["exit_code"] == 0 and file_check.get("output", "").strip():
+            self.reward_buffer.append(0.1)
+            return 0.1
+
+        self.reward_buffer.append(0.0)
+        return 0.0
+
+    async def evaluate(self, *args, **kwargs):
+        """
+        Run evaluation on a held-out set.
+
+        Override for dataset-specific evaluation logic.
+        """
+        start_time = time.time()
+        end_time = time.time()
+
+        eval_metrics = {"eval/placeholder": 0.0}
+        await self.evaluate_log(
+            metrics=eval_metrics,
+            start_time=start_time,
+            end_time=end_time,
+        )
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log SWE-specific metrics."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        if self.reward_buffer:
+            wandb_metrics["train/avg_reward"] = sum(self.reward_buffer) / len(
+                self.reward_buffer
+            )
+            wandb_metrics["train/pass_rate"] = sum(
+                1 for r in self.reward_buffer if r == 1.0
+            ) / len(self.reward_buffer)
+            self.reward_buffer = []
+
+        await super().wandb_log(wandb_metrics)
+
+
+if __name__ == "__main__":
+    HermesSweEnv.cli()
@@ -0,0 +1,188 @@
+"""
+Monkey patches for making hermes-agent tools work inside async frameworks (Atropos).
+
+Problem:
+    Some tools use asyncio.run() internally (e.g., mini-swe-agent's Modal backend,
+    web_extract). This crashes when called from inside Atropos's event loop because
+    asyncio.run() can't be nested.
+
+Solution:
+    Replace the problematic methods with versions that use a dedicated background
+    thread with its own event loop. The calling code sees the same sync interface --
+    call a function, get a result -- but internally the async work happens on a
+    separate thread that doesn't conflict with Atropos's loop.
+
+    These patches are safe for normal CLI use too: when there's no running event
+    loop, the behavior is identical (the background thread approach works regardless).
+
+What gets patched:
+    - SwerexModalEnvironment.__init__ -- creates Modal deployment on a background thread
+    - SwerexModalEnvironment.execute -- runs commands on the same background thread
+    - SwerexModalEnvironment.stop -- stops deployment on the background thread
+
+Usage:
+    Call apply_patches() once at import time (done automatically by hermes_base_env.py).
+    This is idempotent -- calling it multiple times is safe.
+"""
+
+import asyncio
+import logging
+import threading
+from typing import Any
+
+logger = logging.getLogger(__name__)
+
+_patches_applied = False
+
+
+class _AsyncWorker:
+    """
+    A dedicated background thread with its own event loop.
+
+    Allows sync code to submit async coroutines and block for results,
+    even when called from inside another running event loop. Used to
+    bridge sync tool interfaces with async backends (Modal, SWE-ReX).
+    """
+
+    def __init__(self):
+        self._loop: asyncio.AbstractEventLoop = None
+        self._thread: threading.Thread = None
+        self._started = threading.Event()
+
+    def start(self):
+        """Start the background event loop thread."""
+        self._thread = threading.Thread(target=self._run_loop, daemon=True)
+        self._thread.start()
+        self._started.wait(timeout=30)
+
+    def _run_loop(self):
+        """Background thread entry point -- runs the event loop forever."""
+        self._loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(self._loop)
+        self._started.set()
+        self._loop.run_forever()
+
+    def run_coroutine(self, coro, timeout=600):
+        """
+        Submit a coroutine to the background loop and block until it completes.
+
+        Safe to call from any thread, including threads that already have
+        a running event loop.
+        """
+        if self._loop is None or self._loop.is_closed():
+            raise RuntimeError("AsyncWorker loop is not running")
+        future = asyncio.run_coroutine_threadsafe(coro, self._loop)
+        return future.result(timeout=timeout)
+
+    def stop(self):
+        """Stop the background event loop and join the thread."""
+        if self._loop and self._loop.is_running():
+            self._loop.call_soon_threadsafe(self._loop.stop)
+        if self._thread:
+            self._thread.join(timeout=10)
+
+
+def _patch_swerex_modal():
+    """
+    Monkey patch SwerexModalEnvironment to use a background thread event loop
+    instead of asyncio.run(). This makes it safe to call from inside Atropos's
+    async event loop.
+
+    The patched methods have the exact same interface and behavior -- the only
+    difference is HOW the async work is executed internally.
+    """
+    try:
+        from minisweagent.environments.extra.swerex_modal import (
+            SwerexModalEnvironment,
+            SwerexModalEnvironmentConfig,
+        )
+        from swerex.deployment.modal import ModalDeployment
+        from swerex.runtime.abstract import Command as RexCommand
+    except ImportError:
+        # mini-swe-agent or swe-rex not installed -- nothing to patch
+        logger.debug("mini-swe-agent Modal backend not available, skipping patch")
+        return
+
+    # Save original methods so we can refer to config handling
+    _original_init = SwerexModalEnvironment.__init__
+
+    def _patched_init(self, **kwargs):
+        """Patched __init__: creates Modal deployment on a background thread."""
+        self.config = SwerexModalEnvironmentConfig(**kwargs)
+
+        # Start a dedicated event loop thread for all Modal async operations
+        self._worker = _AsyncWorker()
+        self._worker.start()
+
+        # Create AND start the deployment entirely on the worker's loop/thread
+        # so all gRPC channels and async state are bound to that loop
+        async def _create_and_start():
+            deployment = ModalDeployment(
+                image=self.config.image,
+                startup_timeout=self.config.startup_timeout,
+                runtime_timeout=self.config.runtime_timeout,
+                deployment_timeout=self.config.deployment_timeout,
+                install_pipx=self.config.install_pipx,
+                modal_sandbox_kwargs=self.config.modal_sandbox_kwargs,
+            )
+            await deployment.start()
+            return deployment
+
+        self.deployment = self._worker.run_coroutine(_create_and_start())
+
+    def _patched_execute(self, command: str, cwd: str = "", *, timeout: int | None = None) -> dict[str, Any]:
+        """Patched execute: runs commands on the background thread's loop."""
+        async def _do_execute():
+            return await self.deployment.runtime.execute(
+                RexCommand(
+                    command=command,
+                    shell=True,
+                    check=False,
+                    cwd=cwd or self.config.cwd,
+                    timeout=timeout or self.config.timeout,
+                    merge_output_streams=True,
+                    env=self.config.env if self.config.env else None,
+                )
+            )
+
+        output = self._worker.run_coroutine(_do_execute())
+        return {
+            "output": output.stdout,
+            "returncode": output.exit_code,
+        }
+
+    def _patched_stop(self):
+        """Patched stop: stops deployment on the background thread, then stops the thread."""
+        try:
+            self._worker.run_coroutine(
+                asyncio.wait_for(self.deployment.stop(), timeout=10),
+                timeout=15,
+            )
+        except Exception:
+            pass
+        finally:
+            self._worker.stop()
+
+    # Apply the patches
+    SwerexModalEnvironment.__init__ = _patched_init
+    SwerexModalEnvironment.execute = _patched_execute
+    SwerexModalEnvironment.stop = _patched_stop
+
+    logger.debug("Patched SwerexModalEnvironment for async-safe operation")
+
+
+def apply_patches():
+    """
+    Apply all monkey patches needed for Atropos compatibility.
+
+    Safe to call multiple times -- patches are only applied once.
+    Safe for normal CLI use -- patched code works identically when
+    there is no running event loop.
+    """
+    global _patches_applied
+    if _patches_applied:
+        return
+
+    _patch_swerex_modal()
+
+    _patches_applied = True
@@ -0,0 +1,34 @@
+# Terminal Test Environment -- Default Configuration
+#
+# Simple file-creation tasks for validating the full Atropos + hermes-agent stack.
+# Uses Modal terminal backend and OpenRouter (Claude) for inference.
+# API keys loaded from ~/hermes-agent/.env
+#
+# Usage:
+#   run-api
+#   python environments/terminal_test_env/terminal_test_env.py serve \
+#       --config environments/terminal_test_env/default.yaml
+
+env:
+  enabled_toolsets: ["terminal", "file"]
+  max_agent_turns: 10
+  max_token_length: 2048
+  group_size: 3
+  total_steps: 3
+  steps_per_eval: 3
+  terminal_backend: "modal"
+  tool_call_parser: "hermes"
+  tokenizer_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
+  ensure_scores_are_not_same: false
+  use_wandb: false
+  system_prompt: >
+    You are a helpful assistant with access to a terminal and file tools.
+    Complete the user's request by using the available tools.
+    Be precise and follow instructions exactly.
+
+openai:
+  base_url: "https://openrouter.ai/api/v1"
+  model_name: "anthropic/claude-opus-4.6"
+  server_type: "openai"
+  health_check: false
+  # api_key loaded from OPENROUTER_API_KEY in .env
@@ -0,0 +1,292 @@
+"""
+TerminalTestEnv -- Simple Test Environment for Validating the Stack
+
+A self-contained environment with inline tasks (no external dataset needed).
+Each task asks the model to create a file at a known path with specific content.
+The reward verifier cats the file and checks if the content matches.
+
+Enables only terminal + file toolsets. Uses Modal terminal backend with
+OpenRouter (Claude) by default.
+
+Training tasks (3):
+    1. Create ~/greeting.txt with "Hello from Hermes Agent"
+    2. Create ~/count.txt with numbers 1-5, one per line
+    3. Create ~/answer.txt with the result of 123 + 456
+
+Eval task (1):
+    1. Create ~/result.txt with the result of 6 * 7
+
+Usage:
+    # Start Atropos API server
+    run-api
+
+    # Run environment (uses OpenRouter + Modal by default)
+    python environments/terminal_test_env.py serve
+
+    # Process mode (no run-api needed, saves to JSONL)
+    python environments/terminal_test_env.py process \\
+        --env.data_path_to_save_groups terminal_test_output.jsonl
+"""
+
+import logging
+import os
+import sys
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# Ensure repo root is on sys.path for imports
+_repo_root = Path(__file__).resolve().parent.parent.parent
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from atroposlib.envs.base import ScoredDataGroup
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+from atroposlib.type_definitions import Item
+
+from environments.agent_loop import AgentResult
+from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
+from environments.tool_context import ToolContext
+
+logger = logging.getLogger(__name__)
+
+
+# =============================================================================
+# Inline task definitions -- no external dataset needed
+# =============================================================================
+
+TRAIN_TASKS = [
+    {
+        "prompt": "Create a file at ~/greeting.txt containing exactly the text: Hello from Hermes Agent",
+        "verify_path": "~/greeting.txt",
+        "expected_content": "Hello from Hermes Agent",
+    },
+    {
+        "prompt": "Create a file at ~/count.txt containing the numbers 1 through 5, one per line",
+        "verify_path": "~/count.txt",
+        "expected_content": "1\n2\n3\n4\n5",
+    },
+    {
+        "prompt": "Create a file at ~/answer.txt containing the result of 123 + 456",
+        "verify_path": "~/answer.txt",
+        "expected_content": "579",
+    },
+]
+
+EVAL_TASKS = [
+    {
+        "prompt": "Create a file at ~/result.txt containing the result of 6 * 7",
+        "verify_path": "~/result.txt",
+        "expected_content": "42",
+    },
+]
+
+
+class TerminalTestEnvConfig(HermesAgentEnvConfig):
+    """Config with defaults suitable for terminal testing."""
+
+    pass  # Inherits all fields, overrides defaults in config_init
+
+
+class TerminalTestEnv(HermesAgentBaseEnv):
+    """
+    Simple test environment with inline file-creation tasks.
+
+    All tasks follow the same pattern: "create a file at ~/X.txt with content Y".
+    The verifier runs `cat ~/X.txt` in the rollout's terminal and checks the output
+    against the expected string. Same verifier logic for all tasks.
+
+    This environment is designed to validate the full stack end-to-end:
+    - Agent loop executes tool calls (terminal/file)
+    - ToolContext provides terminal access to the reward function
+    - Reward function verifies file content via cat
+    - Scored data flows through the Atropos pipeline
+    """
+
+    name = "terminal-test"
+    env_config_cls = TerminalTestEnvConfig
+
+    @classmethod
+    def config_init(cls) -> Tuple[TerminalTestEnvConfig, List[APIServerConfig]]:
+        """
+        Default configuration for the terminal test environment.
+
+        Uses Modal terminal backend for cloud isolation and OpenRouter with
+        Claude for inference. API keys loaded from ~/hermes-agent/.env.
+        """
+        env_config = TerminalTestEnvConfig(
+            # Terminal + file tools only
+            enabled_toolsets=["terminal", "file"],
+            disabled_toolsets=None,
+            distribution=None,
+            # Agent settings
+            max_agent_turns=10,  # Simple tasks, don't need many turns
+            max_token_length=16000,
+            agent_temperature=1.0,
+            system_prompt=(
+                "You are a helpful assistant with access to a terminal and file tools. "
+                "Complete the user's request by using the available tools. "
+                "Be precise and follow instructions exactly."
+            ),
+            # Modal terminal backend for cloud-isolated sandboxes per rollout
+            terminal_backend="modal",
+            # Atropos settings
+            group_size=3,              # 3 rollouts per group
+            tokenizer_name="NousResearch/q-30b-t-h45-e1",
+            tool_call_parser="hermes",
+            steps_per_eval=3,          # Eval after all 3 steps
+            total_steps=3,             # 3 groups total (1 group per step)
+            use_wandb=True,
+            wandb_name="terminal-test",
+            ensure_scores_are_not_same=False,  # Allow all-same scores for simple tasks
+            # No external dataset
+            dataset_name=None,
+        )
+
+        # OpenRouter with Claude -- API key loaded from .env (OPENROUTER_API_KEY)
+        server_configs = [
+            APIServerConfig(
+                base_url="https://openrouter.ai/api/v1",
+                model_name="anthropic/claude-opus-4.6",
+                server_type="openai",
+                api_key=os.getenv("OPENROUTER_API_KEY", ""),
+                health_check=False,  # OpenRouter doesn't have a /health endpoint
+            )
+        ]
+
+        return env_config, server_configs
+
+    async def setup(self):
+        """Initialize inline task lists."""
+        self.train_tasks = list(TRAIN_TASKS)
+        self.eval_tasks = list(EVAL_TASKS)
+        self.iter = 0
+        # Track reward stats for wandb logging
+        self.reward_buffer: List[float] = []
+
+    async def get_next_item(self) -> Dict[str, str]:
+        """Cycle through training tasks."""
+        item = self.train_tasks[self.iter % len(self.train_tasks)]
+        self.iter += 1
+        return item
+
+    def format_prompt(self, item: Dict[str, str]) -> str:
+        """The prompt is directly in the task item."""
+        return item["prompt"]
+
+    async def compute_reward(
+        self, item: Dict[str, str], result: AgentResult, ctx: ToolContext
+    ) -> float:
+        """
+        Verify by cat-ing the expected file path and checking content matches.
+        Same verifier for all tasks -- they all write a file at a known path.
+
+        Scoring:
+            1.0 = exact match
+            0.5 = expected content is present but has extra stuff
+            0.0 = file doesn't exist or content doesn't match
+        """
+        verify_result = ctx.terminal(f"cat {item['verify_path']}")
+
+        # File doesn't exist or can't be read
+        if verify_result["exit_code"] != 0:
+            self.reward_buffer.append(0.0)
+            return 0.0
+
+        actual = verify_result.get("output", "").strip()
+        expected = item["expected_content"].strip()
+
+        # Exact match
+        if actual == expected:
+            self.reward_buffer.append(1.0)
+            return 1.0
+
+        # Partial credit: expected content is present but has extra stuff
+        if expected in actual:
+            self.reward_buffer.append(0.5)
+            return 0.5
+
+        self.reward_buffer.append(0.0)
+        return 0.0
+
+    async def evaluate(self, *args, **kwargs):
+        """
+        Run eval tasks using the agent loop and verify results.
+        Logs accuracy metrics.
+        """
+        start_time = time.time()
+        correct = 0
+        total = len(self.eval_tasks)
+        samples = []
+
+        for eval_item in self.eval_tasks:
+            try:
+                # For eval, we do a simple single-turn completion (not full agent loop)
+                # to keep eval fast. The agent loop is tested via training.
+                completion = await self.server.chat_completion(
+                    messages=[
+                        {"role": "system", "content": self.config.system_prompt or ""},
+                        {"role": "user", "content": eval_item["prompt"]},
+                    ],
+                    n=1,
+                    max_tokens=self.config.max_token_length,
+                    temperature=0.0,
+                    split="eval",
+                )
+
+                response_content = (
+                    completion.choices[0].message.content if completion.choices else ""
+                )
+
+                samples.append(
+                    {
+                        "prompt": eval_item["prompt"],
+                        "response": response_content,
+                        "expected": eval_item["expected_content"],
+                    }
+                )
+
+            except Exception as e:
+                logger.error("Eval failed for item: %s", e)
+                samples.append(
+                    {
+                        "prompt": eval_item["prompt"],
+                        "response": f"ERROR: {e}",
+                        "expected": eval_item["expected_content"],
+                    }
+                )
+
+        end_time = time.time()
+
+        eval_metrics = {
+            "eval/num_samples": total,
+        }
+
+        await self.evaluate_log(
+            metrics=eval_metrics,
+            samples=samples,
+            start_time=start_time,
+            end_time=end_time,
+        )
+
+    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
+        """Log training metrics including reward stats and accuracy."""
+        if wandb_metrics is None:
+            wandb_metrics = {}
+
+        if self.reward_buffer:
+            total = len(self.reward_buffer)
+            correct = sum(1 for r in self.reward_buffer if r == 1.0)
+            partial = sum(1 for r in self.reward_buffer if r == 0.5)
+
+            wandb_metrics["train/avg_reward"] = sum(self.reward_buffer) / total
+            wandb_metrics["train/accuracy"] = correct / total
+            wandb_metrics["train/partial_match_rate"] = partial / total
+            wandb_metrics["train/total_rollouts"] = total
+            self.reward_buffer = []
+
+        await super().wandb_log(wandb_metrics)
+
+
+if __name__ == "__main__":
+    TerminalTestEnv.cli()
@@ -0,0 +1,120 @@
+"""
+Tool Call Parser Registry
+
+Client-side parsers that extract structured tool_calls from raw model output text.
+Used in Phase 2 (VLLM server type) where ManagedServer's /generate endpoint returns
+raw text without tool call parsing.
+
+Each parser is a standalone reimplementation of the corresponding VLLM parser's
+non-streaming extract_tool_calls() logic. No VLLM dependency -- only standard library
+(re, json, uuid) and openai types.
+
+Usage:
+    from environments.tool_call_parsers import get_parser
+
+    parser = get_parser("hermes")
+    content, tool_calls = parser.parse(raw_model_output)
+    # content = text with tool call markup stripped
+    # tool_calls = list of ChatCompletionMessageToolCall objects, or None
+"""
+
+import logging
+from abc import ABC, abstractmethod
+from typing import Dict, List, Optional, Tuple, Type
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+)
+
+logger = logging.getLogger(__name__)
+
+# Type alias for parser return value
+ParseResult = Tuple[Optional[str], Optional[List[ChatCompletionMessageToolCall]]]
+
+
+class ToolCallParser(ABC):
+    """
+    Base class for tool call parsers.
+
+    Each parser knows how to extract structured tool_calls from a specific
+    model family's raw output text format.
+    """
+
+    @abstractmethod
+    def parse(self, text: str) -> ParseResult:
+        """
+        Parse raw model output text for tool calls.
+
+        Args:
+            text: Raw decoded text from the model's completion
+
+        Returns:
+            Tuple of (content, tool_calls) where:
+            - content: text with tool call markup stripped (the message 'content' field),
+                       or None if the entire output was tool calls
+            - tool_calls: list of ChatCompletionMessageToolCall objects,
+                          or None if no tool calls were found
+        """
+        raise NotImplementedError
+
+
+# Global parser registry: name -> parser class
+PARSER_REGISTRY: Dict[str, Type[ToolCallParser]] = {}
+
+
+def register_parser(name: str):
+    """
+    Decorator to register a parser class under a given name.
+
+    Usage:
+        @register_parser("hermes")
+        class HermesToolCallParser(ToolCallParser):
+            ...
+    """
+
+    def decorator(cls: Type[ToolCallParser]) -> Type[ToolCallParser]:
+        PARSER_REGISTRY[name] = cls
+        return cls
+
+    return decorator
+
+
+def get_parser(name: str) -> ToolCallParser:
+    """
+    Get a parser instance by name.
+
+    Args:
+        name: Parser name (e.g., "hermes", "mistral", "llama3_json")
+
+    Returns:
+        Instantiated parser
+
+    Raises:
+        KeyError: If parser name is not found in registry
+    """
+    if name not in PARSER_REGISTRY:
+        available = sorted(PARSER_REGISTRY.keys())
+        raise KeyError(
+            f"Tool call parser '{name}' not found. Available parsers: {available}"
+        )
+    return PARSER_REGISTRY[name]()
+
+
+def list_parsers() -> List[str]:
+    """Return sorted list of registered parser names."""
+    return sorted(PARSER_REGISTRY.keys())
+
+
+# Import all parser modules to trigger registration via @register_parser decorators
+# Each module registers itself when imported
+from environments.tool_call_parsers.hermes_parser import HermesToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.longcat_parser import LongcatToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.mistral_parser import MistralToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.llama_parser import LlamaToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.qwen_parser import QwenToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.deepseek_v3_parser import DeepSeekV3ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.deepseek_v3_1_parser import DeepSeekV31ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.kimi_k2_parser import KimiK2ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.glm45_parser import Glm45ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.glm47_parser import Glm47ToolCallParser  # noqa: E402, F401
+from environments.tool_call_parsers.qwen3_coder_parser import Qwen3CoderToolCallParser  # noqa: E402, F401
@@ -0,0 +1,71 @@
+"""
+DeepSeek V3.1 tool call parser.
+
+Similar to V3 but with a slightly different format:
+    <｜tool▁call▁begin｜>function_name<｜tool▁sep｜>arguments<｜tool▁call▁end｜>
+
+Note: V3 has type+name before the separator, V3.1 has name before and args after.
+
+Based on VLLM's DeepSeekV31ToolParser.extract_tool_calls()
+"""
+
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("deepseek_v3_1")
+@register_parser("deepseek_v31")
+class DeepSeekV31ToolCallParser(ToolCallParser):
+    """
+    Parser for DeepSeek V3.1 tool calls.
+
+    Slightly different regex than V3: function_name comes before the separator,
+    arguments come after (no type field, no json code block wrapper).
+    """
+
+    START_TOKEN = "<｜tool▁calls▁begin｜>"
+
+    # Regex captures: function_name, function_arguments
+    PATTERN = re.compile(
+        r"<｜tool▁call▁begin｜>(?P<function_name>.*?)<｜tool▁sep｜>(?P<function_arguments>.*?)<｜tool▁call▁end｜>"
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        if self.START_TOKEN not in text:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                func_name, func_args = match
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=func_name.strip(),
+                            arguments=func_args.strip(),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            content = text[: text.find(self.START_TOKEN)].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
@@ -0,0 +1,75 @@
+"""
+DeepSeek V3 tool call parser.
+
+Format uses special unicode tokens:
+    <｜tool▁calls▁begin｜>
+    <｜tool▁call▁begin｜>type<｜tool▁sep｜>function_name
+    ```json
+    {"arg": "value"}
+    ```
+    <｜tool▁call▁end｜>
+    <｜tool▁calls▁end｜>
+
+Based on VLLM's DeepSeekV3ToolParser.extract_tool_calls()
+"""
+
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("deepseek_v3")
+class DeepSeekV3ToolCallParser(ToolCallParser):
+    """
+    Parser for DeepSeek V3 tool calls.
+
+    Uses special unicode tokens with fullwidth angle brackets and block elements.
+    Extracts type, function name, and JSON arguments from the structured format.
+    """
+
+    START_TOKEN = "<｜tool▁calls▁begin｜>"
+
+    # Regex captures: type, function_name, function_arguments
+    PATTERN = re.compile(
+        r"<｜tool▁call▁begin｜>(?P<type>.*)<｜tool▁sep｜>(?P<function_name>.*)\n```json\n(?P<function_arguments>.*)\n```<｜tool▁call▁end｜>"
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        if self.START_TOKEN not in text:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                tc_type, func_name, func_args = match
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=func_name.strip(),
+                            arguments=func_args.strip(),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            # Content is everything before the tool calls section
+            content = text[: text.find(self.START_TOKEN)].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
@@ -0,0 +1,109 @@
+"""
+GLM 4.5 (GLM-4-MoE) tool call parser.
+
+Format uses custom arg_key/arg_value tags rather than standard JSON:
+    <tool_call>function_name
+    <arg_key>param1</arg_key><arg_value>value1</arg_value>
+    <arg_key>param2</arg_key><arg_value>value2</arg_value>
+    </tool_call>
+
+Values are deserialized using json.loads -> ast.literal_eval -> raw string fallback.
+
+Based on VLLM's Glm4MoeModelToolParser.extract_tool_calls()
+"""
+
+import ast
+import json
+import re
+import uuid
+from typing import Any, Dict, List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+def _deserialize_value(value: str) -> Any:
+    """
+    Try to deserialize a string value to its native Python type.
+    Attempts json.loads, then ast.literal_eval, then returns raw string.
+    """
+    try:
+        return json.loads(value)
+    except (json.JSONDecodeError, TypeError):
+        pass
+
+    try:
+        return ast.literal_eval(value)
+    except (ValueError, SyntaxError, TypeError):
+        pass
+
+    return value
+
+
+@register_parser("glm45")
+class Glm45ToolCallParser(ToolCallParser):
+    """
+    Parser for GLM 4.5 (GLM-4-MoE) tool calls.
+
+    Uses <tool_call>...</tool_call> tags with <arg_key>/<arg_value> pairs
+    instead of standard JSON arguments.
+    """
+
+    FUNC_CALL_REGEX = re.compile(r"<tool_call>.*?</tool_call>", re.DOTALL)
+    FUNC_DETAIL_REGEX = re.compile(r"<tool_call>([^\n]*)\n(.*)</tool_call>", re.DOTALL)
+    FUNC_ARG_REGEX = re.compile(
+        r"<arg_key>(.*?)</arg_key>\s*<arg_value>(.*?)</arg_value>", re.DOTALL
+    )
+
+    START_TOKEN = "<tool_call>"
+
+    def parse(self, text: str) -> ParseResult:
+        if self.START_TOKEN not in text:
+            return text, None
+
+        try:
+            matched_calls = self.FUNC_CALL_REGEX.findall(text)
+            if not matched_calls:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+
+            for match in matched_calls:
+                detail = self.FUNC_DETAIL_REGEX.search(match)
+                if not detail:
+                    continue
+
+                func_name = detail.group(1).strip()
+                func_args_raw = detail.group(2)
+
+                # Parse arg_key/arg_value pairs
+                pairs = self.FUNC_ARG_REGEX.findall(func_args_raw) if func_args_raw else []
+                arg_dict: Dict[str, Any] = {}
+                for key, value in pairs:
+                    arg_key = key.strip()
+                    arg_val = _deserialize_value(value.strip())
+                    arg_dict[arg_key] = arg_val
+
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=func_name,
+                            arguments=json.dumps(arg_dict, ensure_ascii=False),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            content = text[: text.find(self.START_TOKEN)].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
@@ -0,0 +1,35 @@
+"""
+GLM 4.7 tool call parser.
+
+Same as GLM 4.5 but with slightly different regex patterns.
+The tool_call tags may wrap differently and arg parsing handles
+newlines between key/value pairs.
+
+Based on VLLM's Glm47MoeModelToolParser (extends Glm4MoeModelToolParser).
+"""
+
+import re
+
+from environments.tool_call_parsers import ParseResult, register_parser
+from environments.tool_call_parsers.glm45_parser import Glm45ToolCallParser
+
+
+@register_parser("glm47")
+class Glm47ToolCallParser(Glm45ToolCallParser):
+    """
+    Parser for GLM 4.7 tool calls.
+    Extends GLM 4.5 with updated regex patterns.
+    """
+
+    def __init__(self):
+        super().__init__()
+        # GLM 4.7 uses a slightly different detail regex that includes
+        # the <tool_call> wrapper and optional arg_key content
+        self.FUNC_DETAIL_REGEX = re.compile(
+            r"<tool_call>(.*?)(<arg_key>.*?)?</tool_call>", re.DOTALL
+        )
+        # GLM 4.7 handles newlines between arg_key and arg_value tags
+        self.FUNC_ARG_REGEX = re.compile(
+            r"<arg_key>(.*?)</arg_key>(?:\\n|\s)*<arg_value>(.*?)</arg_value>",
+            re.DOTALL,
+        )
@@ -0,0 +1,73 @@
+"""
+Hermes tool call parser.
+
+Format: <tool_call>{"name": "func", "arguments": {...}}</tool_call>
+Based on VLLM's Hermes2ProToolParser.extract_tool_calls()
+"""
+
+import json
+import re
+import uuid
+from typing import List, Optional, Tuple
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("hermes")
+class HermesToolCallParser(ToolCallParser):
+    """
+    Parser for Hermes-format tool calls.
+
+    Matches <tool_call>...</tool_call> tags containing JSON with "name" and "arguments".
+    Also handles unclosed <tool_call> at end-of-string (truncated generation).
+    """
+
+    # Matches both closed and unclosed tool_call tags
+    PATTERN = re.compile(
+        r"<tool_call>\s*(.*?)\s*</tool_call>|<tool_call>\s*(.*)", re.DOTALL
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        if "<tool_call>" not in text:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                # match is a tuple: (closed_content, unclosed_content)
+                raw_json = match[0] if match[0] else match[1]
+                if not raw_json.strip():
+                    continue
+
+                tc_data = json.loads(raw_json)
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=tc_data["name"],
+                            arguments=json.dumps(
+                                tc_data.get("arguments", {}), ensure_ascii=False
+                            ),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            # Content is everything before the first <tool_call> tag
+            content = text[: text.find("<tool_call>")].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
@@ -0,0 +1,93 @@
+"""
+Kimi K2 tool call parser.
+
+Format:
+    <|tool_calls_section_begin|>
+    <|tool_call_begin|>function_id:0<|tool_call_argument_begin|>{"arg": "val"}<|tool_call_end|>
+    <|tool_calls_section_end|>
+
+The function_id format is typically "functions.func_name:index" or "func_name:index".
+
+Based on VLLM's KimiK2ToolParser.extract_tool_calls()
+"""
+
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("kimi_k2")
+class KimiK2ToolCallParser(ToolCallParser):
+    """
+    Parser for Kimi K2 tool calls.
+
+    Uses section begin/end tokens wrapping individual tool call begin/end tokens.
+    The tool_call_id contains the function name (after last dot, before colon).
+    """
+
+    # Support both singular and plural variants
+    START_TOKENS = [
+        "<|tool_calls_section_begin|>",
+        "<|tool_call_section_begin|>",
+    ]
+
+    # Regex captures: tool_call_id (e.g., "functions.get_weather:0"), function_arguments
+    PATTERN = re.compile(
+        r"<\|tool_call_begin\|>\s*(?P<tool_call_id>[^<]+:\d+)\s*"
+        r"<\|tool_call_argument_begin\|>\s*"
+        r"(?P<function_arguments>(?:(?!<\|tool_call_begin\|>).)*?)\s*"
+        r"<\|tool_call_end\|>",
+        re.DOTALL,
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        # Check for any variant of the start token
+        has_start = any(token in text for token in self.START_TOKENS)
+        if not has_start:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                function_id, function_args = match
+
+                # Extract function name from ID format: "functions.get_weather:0" -> "get_weather"
+                function_name = function_id.split(":")[0].split(".")[-1]
+
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=function_id,  # Preserve the original ID format
+                        type="function",
+                        function=Function(
+                            name=function_name,
+                            arguments=function_args.strip(),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            # Content is everything before the tool calls section
+            earliest_start = len(text)
+            for token in self.START_TOKENS:
+                idx = text.find(token)
+                if idx >= 0 and idx < earliest_start:
+                    earliest_start = idx
+
+            content = text[:earliest_start].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
@@ -0,0 +1,96 @@
+"""
+Llama 3.x / 4 tool call parser.
+
+Format: The model outputs JSON objects with "name" and "arguments" (or "parameters") keys.
+May be preceded by <|python_tag|> token. Supports multiple JSON objects separated
+by content or semicolons.
+
+Based on VLLM's Llama3JsonToolParser.extract_tool_calls()
+"""
+
+import json
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("llama3_json")
+@register_parser("llama4_json")
+class LlamaToolCallParser(ToolCallParser):
+    """
+    Parser for Llama 3.x and 4 JSON-format tool calls.
+
+    Finds JSON objects containing "name" + ("arguments" or "parameters") keys.
+    Uses Python's json.JSONDecoder.raw_decode for robust extraction of
+    JSON objects from mixed text.
+    """
+
+    BOT_TOKEN = "<|python_tag|>"
+
+    # Regex to find the start of potential JSON objects
+    JSON_START = re.compile(r"\{")
+
+    def parse(self, text: str) -> ParseResult:
+        # Quick check: need either the bot token or a JSON brace
+        if self.BOT_TOKEN not in text and "{" not in text:
+            return text, None
+
+        try:
+            decoder = json.JSONDecoder()
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            end_index = -1  # Track where the last parsed JSON ended
+
+            for match in self.JSON_START.finditer(text):
+                start = match.start()
+                # Skip if this brace is inside a previously parsed JSON object
+                if start <= end_index:
+                    continue
+
+                try:
+                    obj, json_end = decoder.raw_decode(text[start:])
+                    end_index = start + json_end
+
+                    # Must have "name" and either "arguments" or "parameters"
+                    name = obj.get("name")
+                    args = obj.get("arguments", obj.get("parameters"))
+
+                    if not name or args is None:
+                        continue
+
+                    # Normalize arguments to JSON string
+                    if isinstance(args, dict):
+                        args = json.dumps(args, ensure_ascii=False)
+                    elif not isinstance(args, str):
+                        args = json.dumps(args, ensure_ascii=False)
+
+                    tool_calls.append(
+                        ChatCompletionMessageToolCall(
+                            id=f"call_{uuid.uuid4().hex[:8]}",
+                            type="function",
+                            function=Function(name=name, arguments=args),
+                        )
+                    )
+                except (json.JSONDecodeError, KeyError, ValueError):
+                    continue
+
+            if not tool_calls:
+                return text, None
+
+            # Content is everything before the first tool call JSON
+            # Find where the first tool call starts in the text
+            first_tc_start = text.find("{")
+            if self.BOT_TOKEN in text:
+                first_tc_start = text.find(self.BOT_TOKEN)
+            content = text[:first_tc_start].strip() if first_tc_start > 0 else None
+
+            return content, tool_calls
+
+        except Exception:
+            return text, None
@@ -0,0 +1,69 @@
+"""
+Longcat Flash Chat tool call parser.
+
+Same as Hermes but uses <longcat_tool_call> tags instead of <tool_call>.
+Based on VLLM's LongcatFlashToolParser (extends Hermes2ProToolParser).
+"""
+
+import json
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+@register_parser("longcat")
+class LongcatToolCallParser(ToolCallParser):
+    """
+    Parser for Longcat Flash Chat tool calls.
+    Identical logic to Hermes, just different tag names.
+    """
+
+    PATTERN = re.compile(
+        r"<longcat_tool_call>\s*(.*?)\s*</longcat_tool_call>|<longcat_tool_call>\s*(.*)",
+        re.DOTALL,
+    )
+
+    def parse(self, text: str) -> ParseResult:
+        if "<longcat_tool_call>" not in text:
+            return text, None
+
+        try:
+            matches = self.PATTERN.findall(text)
+            if not matches:
+                return text, None
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for match in matches:
+                raw_json = match[0] if match[0] else match[1]
+                if not raw_json.strip():
+                    continue
+
+                tc_data = json.loads(raw_json)
+                tool_calls.append(
+                    ChatCompletionMessageToolCall(
+                        id=f"call_{uuid.uuid4().hex[:8]}",
+                        type="function",
+                        function=Function(
+                            name=tc_data["name"],
+                            arguments=json.dumps(
+                                tc_data.get("arguments", {}), ensure_ascii=False
+                            ),
+                        ),
+                    )
+                )
+
+            if not tool_calls:
+                return text, None
+
+            content = text[: text.find("<longcat_tool_call>")].strip()
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
@@ -0,0 +1,130 @@
+"""
+Mistral tool call parser.
+
+Supports two formats depending on tokenizer version:
+- Pre-v11: content[TOOL_CALLS] [{"name": ..., "arguments": {...}}, ...]
+- v11+:    content[TOOL_CALLS]tool_name1{"arg": "val"}[TOOL_CALLS]tool_name2{"arg": "val"}
+
+Based on VLLM's MistralToolParser.extract_tool_calls()
+The [TOOL_CALLS] token is the bot_token used by Mistral models.
+"""
+
+import json
+import re
+import uuid
+from typing import List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+def _generate_mistral_id() -> str:
+    """Mistral tool call IDs are 9-char alphanumeric strings."""
+    import random
+    import string
+
+    return "".join(random.choices(string.ascii_letters + string.digits, k=9))
+
+
+@register_parser("mistral")
+class MistralToolCallParser(ToolCallParser):
+    """
+    Parser for Mistral-format tool calls.
+
+    Detects format by checking if the content after [TOOL_CALLS] starts with '['
+    (pre-v11 JSON array) or with a tool name (v11+ format).
+    """
+
+    # The [TOOL_CALLS] token -- may appear as different strings depending on tokenizer
+    BOT_TOKEN = "[TOOL_CALLS]"
+
+    # Fallback regex for pre-v11 format when JSON parsing fails
+    TOOL_CALL_REGEX = re.compile(r"\[?\s*(\{.*?\})\s*\]?", re.DOTALL)
+
+    def parse(self, text: str) -> ParseResult:
+        if self.BOT_TOKEN not in text:
+            return text, None
+
+        try:
+            parts = text.split(self.BOT_TOKEN)
+            content = parts[0].strip()
+            raw_tool_calls = parts[1:]
+
+            # Detect format: if the first raw part starts with '[', it's pre-v11
+            first_raw = raw_tool_calls[0].strip() if raw_tool_calls else ""
+            is_pre_v11 = first_raw.startswith("[") or first_raw.startswith("{")
+
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+
+            if not is_pre_v11:
+                # v11+ format: [TOOL_CALLS]tool_name{args}[TOOL_CALLS]tool_name2{args2}
+                for raw in raw_tool_calls:
+                    raw = raw.strip()
+                    if not raw or "{" not in raw:
+                        continue
+
+                    brace_idx = raw.find("{")
+                    tool_name = raw[:brace_idx].strip()
+                    args_str = raw[brace_idx:]
+
+                    tool_calls.append(
+                        ChatCompletionMessageToolCall(
+                            id=_generate_mistral_id(),
+                            type="function",
+                            function=Function(name=tool_name, arguments=args_str),
+                        )
+                    )
+            else:
+                # Pre-v11 format: [TOOL_CALLS] [{"name": ..., "arguments": {...}}]
+                try:
+                    parsed = json.loads(first_raw)
+                    if isinstance(parsed, dict):
+                        parsed = [parsed]
+
+                    for tc in parsed:
+                        args = tc.get("arguments", {})
+                        if isinstance(args, dict):
+                            args = json.dumps(args, ensure_ascii=False)
+
+                        tool_calls.append(
+                            ChatCompletionMessageToolCall(
+                                id=_generate_mistral_id(),
+                                type="function",
+                                function=Function(
+                                    name=tc["name"], arguments=args
+                                ),
+                            )
+                        )
+                except json.JSONDecodeError:
+                    # Fallback regex extraction
+                    match = self.TOOL_CALL_REGEX.findall(first_raw)
+                    if match:
+                        for raw_json in match:
+                            try:
+                                tc = json.loads(raw_json)
+                                args = tc.get("arguments", {})
+                                if isinstance(args, dict):
+                                    args = json.dumps(args, ensure_ascii=False)
+                                tool_calls.append(
+                                    ChatCompletionMessageToolCall(
+                                        id=_generate_mistral_id(),
+                                        type="function",
+                                        function=Function(
+                                            name=tc["name"], arguments=args
+                                        ),
+                                    )
+                                )
+                            except (json.JSONDecodeError, KeyError):
+                                continue
+
+            if not tool_calls:
+                return text, None
+
+            return content if content else None, tool_calls
+
+        except Exception:
+            return text, None
@@ -0,0 +1,163 @@
+"""
+Qwen3-Coder tool call parser.
+
+Format uses XML-style nested tags:
+    <tool_call>
+    <function=function_name>
+    <parameter=param_name>value</parameter>
+    <parameter=param_name2>value2</parameter>
+    </function>
+    </tool_call>
+
+Parameters are extracted from <parameter=name>value</parameter> tags and
+type-converted using the schema if available, otherwise treated as strings.
+
+Based on VLLM's Qwen3CoderToolParser.extract_tool_calls()
+"""
+
+import ast
+import json
+import re
+import uuid
+from typing import Any, Dict, List, Optional
+
+from openai.types.chat.chat_completion_message_tool_call import (
+    ChatCompletionMessageToolCall,
+    Function,
+)
+
+from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
+
+
+def _try_convert_value(value: str) -> Any:
+    """
+    Try to convert a parameter value string to a native Python type.
+    Handles null, numbers, booleans, JSON objects/arrays, and falls back to string.
+    """
+    stripped = value.strip()
+
+    # Handle null
+    if stripped.lower() == "null":
+        return None
+
+    # Try JSON first (handles objects, arrays, strings, numbers, booleans)
+    try:
+        return json.loads(stripped)
+    except (json.JSONDecodeError, TypeError):
+        pass
+
+    # Try Python literal eval (handles tuples, etc.)
+    try:
+        return ast.literal_eval(stripped)
+    except (ValueError, SyntaxError, TypeError):
+        pass
+
+    # Return as string
+    return stripped
+
+
+@register_parser("qwen3_coder")
+class Qwen3CoderToolCallParser(ToolCallParser):
+    """
+    Parser for Qwen3-Coder XML-format tool calls.
+
+    Uses nested XML tags: <tool_call><function=name><parameter=key>val</parameter></function></tool_call>
+    """
+
+    START_TOKEN = "<tool_call>"
+    FUNCTION_PREFIX = "<function="
+
+    # Find complete tool_call blocks (or unclosed at end)
+    TOOL_CALL_REGEX = re.compile(
+        r"<tool_call>(.*?)</tool_call>|<tool_call>(.*?)$", re.DOTALL
+    )
+
+    # Find function blocks within a tool_call
+    FUNCTION_REGEX = re.compile(
+        r"<function=(.*?)</function>|<function=(.*)$", re.DOTALL
+    )
+
+    # Find parameter blocks within a function
+    PARAMETER_REGEX = re.compile(
+        r"<parameter=(.*?)(?:</parameter>|(?=<parameter=)|(?=</function>)|$)",
+        re.DOTALL,
+    )
+
+    def _parse_function_call(self, function_str: str) -> Optional[ChatCompletionMessageToolCall]:
+        """Parse a single <function=name>...</function> block into a ToolCall."""
+        try:
+            # Extract function name: everything before the first '>'
+            gt_idx = function_str.index(">")
+            func_name = function_str[:gt_idx].strip()
+            params_str = function_str[gt_idx + 1:]
+
+            # Extract parameters
+            param_dict: Dict[str, Any] = {}
+            for match_text in self.PARAMETER_REGEX.findall(params_str):
+                if ">" not in match_text:
+                    continue
+                eq_idx = match_text.index(">")
+                param_name = match_text[:eq_idx].strip()
+                param_value = match_text[eq_idx + 1:]
+
+                # Clean up whitespace
+                if param_value.startswith("\n"):
+                    param_value = param_value[1:]
+                if param_value.endswith("\n"):
+                    param_value = param_value[:-1]
+
+                param_dict[param_name] = _try_convert_value(param_value)
+
+            return ChatCompletionMessageToolCall(
+                id=f"call_{uuid.uuid4().hex[:24]}",
+                type="function",
+                function=Function(
+                    name=func_name,
+                    arguments=json.dumps(param_dict, ensure_ascii=False),
+                ),
+            )
+        except (ValueError, IndexError):
+            return None
+
+    def parse(self, text: str) -> ParseResult:
+        if self.FUNCTION_PREFIX not in text:
+            return text, None
+
+        try:
+            # Find all tool_call blocks
+            tc_matches = self.TOOL_CALL_REGEX.findall(text)
+            raw_blocks = [m[0] if m[0] else m[1] for m in tc_matches]
+
+            # Fallback: if no tool_call tags, try the whole text
+            if not raw_blocks:
+                raw_blocks = [text]
+
+            # Find function blocks within each tool_call
+            function_strs: List[str] = []
+            for block in raw_blocks:
+                func_matches = self.FUNCTION_REGEX.findall(block)
+                function_strs.extend(m[0] if m[0] else m[1] for m in func_matches)
+
+            if not function_strs:
+                return text, None
+
+            # Parse each function call
+            tool_calls: List[ChatCompletionMessageToolCall] = []
+            for func_str in function_strs:
+                tc = self._parse_function_call(func_str)
+                if tc is not None:
+                    tool_calls.append(tc)
+
+            if not tool_calls:
+                return text, None
+
+            # Content before tool calls
+            first_tc = text.find(self.START_TOKEN)
+            if first_tc < 0:
+                first_tc = text.find(self.FUNCTION_PREFIX)
+            content = text[:first_tc].strip() if first_tc > 0 else None
+
+            return content, tool_calls
+
+        except Exception:
+            return text, None
@@ -0,0 +1,19 @@
+"""
+Qwen 2.5 tool call parser.
+
+Uses the same <tool_call> format as Hermes.
+Registered as a separate parser name for clarity when using --tool-parser=qwen.
+"""
+
+from environments.tool_call_parsers import register_parser
+from environments.tool_call_parsers.hermes_parser import HermesToolCallParser
+
+
+@register_parser("qwen")
+class QwenToolCallParser(HermesToolCallParser):
+    """
+    Parser for Qwen 2.5 tool calls.
+    Same <tool_call>{"name": ..., "arguments": ...}</tool_call> format as Hermes.
+    """
+
+    pass  # Identical format -- inherits everything from Hermes
@@ -0,0 +1,473 @@
+"""
+ToolContext -- Unrestricted Tool Access for Reward Functions
+
+A per-rollout handle that gives reward/verification functions direct access to
+ALL hermes-agent tools, scoped to the rollout's task_id. The same task_id means
+the terminal/browser session is the SAME one the model used during its rollout --
+all state (files, processes, browser tabs) is preserved.
+
+The verifier author decides which tools to use. Nothing is hardcoded or gated.
+
+Example usage in a compute_reward():
+    async def compute_reward(self, item, result, ctx):
+        # Run tests in the model's terminal sandbox
+        test = ctx.terminal("pytest -v")
+        if test["exit_code"] == 0:
+            return 1.0
+
+        # Check if a file was created
+        content = ctx.read_file("/workspace/solution.py")
+        if content.get("content"):
+            return 0.5
+
+        return 0.0
+"""
+
+import json
+import logging
+import os
+from typing import Any, Dict, List, Optional
+
+import asyncio
+import concurrent.futures
+
+from model_tools import handle_function_call
+from tools.terminal_tool import cleanup_vm
+from tools.browser_tool import cleanup_browser
+
+logger = logging.getLogger(__name__)
+
+# Thread pool for running sync tool calls that internally use asyncio.run()
+_tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=4)
+
+
+def _run_tool_in_thread(tool_name: str, arguments: Dict[str, Any], task_id: str) -> str:
+    """
+    Run a tool call in a thread pool executor so backends that use asyncio.run()
+    internally (modal, docker) get a clean event loop.
+
+    If we're already in an async context, uses run_in_executor.
+    If not (e.g., called from sync code), runs directly.
+    """
+    try:
+        loop = asyncio.get_running_loop()
+        # We're in an async context -- need to run in thread
+        import concurrent.futures
+        with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
+            future = pool.submit(
+                handle_function_call, tool_name, arguments, task_id
+            )
+            return future.result(timeout=300)
+    except RuntimeError:
+        # No running event loop -- safe to call directly
+        return handle_function_call(tool_name, arguments, task_id)
+
+
+class ToolContext:
+    """
+    Open-ended access to all hermes-agent tools for a specific rollout.
+
+    Passed to compute_reward() so verifiers can use any tool they need:
+    terminal commands, file reads/writes, web searches, browser automation, etc.
+    All calls share the rollout's task_id for session isolation.
+    """
+
+    def __init__(self, task_id: str):
+        self.task_id = task_id
+
+    # -------------------------------------------------------------------------
+    # Terminal tools
+    # -------------------------------------------------------------------------
+
+    def terminal(self, command: str, timeout: int = 180) -> Dict[str, Any]:
+        """
+        Run a command in the rollout's terminal session.
+
+        Args:
+            command: Shell command to execute
+            timeout: Command timeout in seconds
+
+        Returns:
+            Dict with 'exit_code' (int) and 'output' (str)
+        """
+        import os
+        backend = os.getenv("TERMINAL_ENV", "local")
+        logger.debug("ToolContext.terminal [%s backend] task=%s: %s", backend, self.task_id[:8], command[:100])
+
+        # Run in thread pool so modal/docker backends' asyncio.run() doesn't deadlock
+        result = _run_tool_in_thread(
+            "terminal",
+            {"command": command, "timeout": timeout},
+            self.task_id,
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"exit_code": -1, "output": result}
+
+    # -------------------------------------------------------------------------
+    # File tools
+    # -------------------------------------------------------------------------
+
+    def read_file(self, path: str) -> Dict[str, Any]:
+        """
+        Read a file from the rollout's filesystem.
+
+        Args:
+            path: File path to read
+
+        Returns:
+            Dict with file content or error
+        """
+        result = handle_function_call(
+            "read_file", {"path": path}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def write_file(self, path: str, content: str) -> Dict[str, Any]:
+        """
+        Write a TEXT file in the rollout's filesystem.
+
+        Uses a shell heredoc under the hood, so this is only safe for text content.
+        For binary files (images, compiled artifacts, etc.), use upload_file() instead.
+
+        Args:
+            path: File path to write
+            content: Text content to write
+
+        Returns:
+            Dict with success status or error
+        """
+        result = handle_function_call(
+            "write_file", {"path": path, "content": content}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def upload_file(self, local_path: str, remote_path: str) -> Dict[str, Any]:
+        """
+        Upload a local file to the rollout's sandbox (binary-safe).
+
+        Unlike write_file() which passes content through a shell heredoc (text-only),
+        this method base64-encodes the file and decodes it inside the sandbox.
+        Safe for any file type: binaries, images, archives, etc.
+
+        For large files (>1MB), the content is split into chunks to avoid
+        hitting shell command-length limits.
+
+        Args:
+            local_path: Path to a local file on the host
+            remote_path: Destination path inside the sandbox
+
+        Returns:
+            Dict with 'exit_code' and 'output'
+        """
+        import base64
+        from pathlib import Path as _Path
+
+        local = _Path(local_path)
+        if not local.exists():
+            return {"exit_code": -1, "output": f"Local file not found: {local_path}"}
+
+        raw = local.read_bytes()
+        b64 = base64.b64encode(raw).decode("ascii")
+
+        # Ensure parent directory exists in the sandbox
+        parent = str(_Path(remote_path).parent)
+        if parent not in (".", "/"):
+            self.terminal(f"mkdir -p {parent}", timeout=10)
+
+        # For small files, single command is fine
+        chunk_size = 60_000  # ~60KB per chunk (well within shell limits)
+        if len(b64) <= chunk_size:
+            result = self.terminal(
+                f"printf '%s' '{b64}' | base64 -d > {remote_path}",
+                timeout=30,
+            )
+        else:
+            # For larger files, write base64 in chunks then decode
+            tmp_b64 = "/tmp/_hermes_upload.b64"
+            self.terminal(f": > {tmp_b64}", timeout=5)  # truncate
+            for i in range(0, len(b64), chunk_size):
+                chunk = b64[i : i + chunk_size]
+                self.terminal(f"printf '%s' '{chunk}' >> {tmp_b64}", timeout=15)
+            result = self.terminal(
+                f"base64 -d {tmp_b64} > {remote_path} && rm -f {tmp_b64}",
+                timeout=30,
+            )
+
+        return result
+
+    def upload_dir(self, local_dir: str, remote_dir: str) -> List[Dict[str, Any]]:
+        """
+        Upload an entire local directory to the rollout's sandbox (binary-safe).
+
+        Recursively uploads all files, preserving directory structure.
+
+        Args:
+            local_dir: Path to a local directory on the host
+            remote_dir: Destination directory inside the sandbox
+
+        Returns:
+            List of results, one per file uploaded
+        """
+        from pathlib import Path as _Path
+
+        local = _Path(local_dir)
+        if not local.exists() or not local.is_dir():
+            return [{"exit_code": -1, "output": f"Local directory not found: {local_dir}"}]
+
+        results = []
+        for file_path in sorted(local.rglob("*")):
+            if file_path.is_file():
+                relative = file_path.relative_to(local)
+                target = f"{remote_dir}/{relative}"
+                results.append(self.upload_file(str(file_path), target))
+        return results
+
+    def download_file(self, remote_path: str, local_path: str) -> Dict[str, Any]:
+        """
+        Download a file from the rollout's sandbox to the host (binary-safe).
+
+        The inverse of upload_file(). Base64-encodes the file inside the sandbox,
+        reads the encoded data through the terminal, and decodes it locally.
+        Safe for any file type.
+
+        Args:
+            remote_path: Path to the file inside the sandbox
+            local_path: Destination path on the host
+
+        Returns:
+            Dict with 'success' (bool) and 'bytes' (int) or 'error' (str)
+        """
+        import base64
+        from pathlib import Path as _Path
+
+        # Base64-encode the file inside the sandbox and capture output
+        result = self.terminal(
+            f"base64 {remote_path} 2>/dev/null",
+            timeout=30,
+        )
+
+        if result.get("exit_code", -1) != 0:
+            return {
+                "success": False,
+                "error": f"Failed to read remote file: {result.get('output', '')}",
+            }
+
+        b64_data = result.get("output", "").strip()
+        if not b64_data:
+            return {"success": False, "error": f"Remote file is empty or missing: {remote_path}"}
+
+        try:
+            raw = base64.b64decode(b64_data)
+        except Exception as e:
+            return {"success": False, "error": f"Base64 decode failed: {e}"}
+
+        # Write to local host filesystem
+        local = _Path(local_path)
+        local.parent.mkdir(parents=True, exist_ok=True)
+        local.write_bytes(raw)
+
+        return {"success": True, "bytes": len(raw)}
+
+    def download_dir(self, remote_dir: str, local_dir: str) -> List[Dict[str, Any]]:
+        """
+        Download a directory from the rollout's sandbox to the host (binary-safe).
+
+        Lists all files in the remote directory, then downloads each one.
+        Preserves directory structure.
+
+        Args:
+            remote_dir: Path to the directory inside the sandbox
+            local_dir: Destination directory on the host
+
+        Returns:
+            List of results, one per file downloaded
+        """
+        from pathlib import Path as _Path
+
+        # List files in the remote directory
+        ls_result = self.terminal(
+            f"find {remote_dir} -type f 2>/dev/null",
+            timeout=15,
+        )
+
+        if ls_result.get("exit_code", -1) != 0:
+            return [{"success": False, "error": f"Failed to list remote dir: {remote_dir}"}]
+
+        file_list = ls_result.get("output", "").strip()
+        if not file_list:
+            return [{"success": False, "error": f"Remote directory is empty or missing: {remote_dir}"}]
+
+        results = []
+        for remote_file in file_list.splitlines():
+            remote_file = remote_file.strip()
+            if not remote_file:
+                continue
+            # Compute the relative path to preserve directory structure
+            if remote_file.startswith(remote_dir):
+                relative = remote_file[len(remote_dir):].lstrip("/")
+            else:
+                relative = _Path(remote_file).name
+            local_file = str(_Path(local_dir) / relative)
+            results.append(self.download_file(remote_file, local_file))
+
+        return results
+
+    def search(self, query: str, path: str = ".") -> Dict[str, Any]:
+        """
+        Search for text in the rollout's filesystem.
+
+        Args:
+            query: Search query
+            path: Directory to search in
+
+        Returns:
+            Dict with search results
+        """
+        result = handle_function_call(
+            "search", {"query": query, "path": path}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    # -------------------------------------------------------------------------
+    # Web tools
+    # -------------------------------------------------------------------------
+
+    def web_search(self, query: str) -> Dict[str, Any]:
+        """
+        Search the web.
+
+        Args:
+            query: Search query
+
+        Returns:
+            Dict with search results
+        """
+        result = handle_function_call("web_search", {"query": query})
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def web_extract(self, urls: List[str]) -> Dict[str, Any]:
+        """
+        Extract content from URLs.
+
+        Args:
+            urls: List of URLs to extract content from
+
+        Returns:
+            Dict with extracted content
+        """
+        result = handle_function_call("web_extract", {"urls": urls})
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    # -------------------------------------------------------------------------
+    # Browser tools
+    # -------------------------------------------------------------------------
+
+    def browser_navigate(self, url: str) -> Dict[str, Any]:
+        """
+        Navigate the rollout's browser session to a URL.
+
+        Args:
+            url: URL to navigate to
+
+        Returns:
+            Dict with page snapshot or error
+        """
+        result = handle_function_call(
+            "browser_navigate", {"url": url}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    def browser_snapshot(self) -> Dict[str, Any]:
+        """
+        Take a snapshot of the current browser page.
+
+        Returns:
+            Dict with page content/accessibility snapshot
+        """
+        result = handle_function_call(
+            "browser_snapshot", {}, task_id=self.task_id
+        )
+        try:
+            return json.loads(result)
+        except json.JSONDecodeError:
+            return {"error": result}
+
+    # -------------------------------------------------------------------------
+    # Generic tool access
+    # -------------------------------------------------------------------------
+
+    def call_tool(self, tool_name: str, arguments: Dict[str, Any]) -> str:
+        """
+        Call any hermes-agent tool by name.
+
+        This is the generic escape hatch -- if a tool doesn't have a convenience
+        wrapper above, you can call it directly here.
+
+        Args:
+            tool_name: Name of the tool (e.g., "vision_analyze", "skills_list")
+            arguments: Dict of arguments for the tool
+
+        Returns:
+            Raw JSON string result from the tool
+        """
+        return _run_tool_in_thread(tool_name, arguments, self.task_id)
+
+    # -------------------------------------------------------------------------
+    # Cleanup
+    # -------------------------------------------------------------------------
+
+    def cleanup(self):
+        """
+        Release all resources (terminal VMs, browser sessions, background processes)
+        for this rollout.
+
+        Called automatically by the base environment via try/finally after
+        compute_reward() completes. You generally don't need to call this yourself.
+        """
+        # Kill any background processes from this rollout (safety net)
+        try:
+            from tools.process_registry import process_registry
+            killed = process_registry.kill_all(task_id=self.task_id)
+            if killed:
+                logger.debug("Process cleanup for task %s: killed %d process(es)", self.task_id, killed)
+        except Exception as e:
+            logger.debug("Process cleanup for task %s: %s", self.task_id, e)
+
+        try:
+            cleanup_vm(self.task_id)
+        except Exception as e:
+            logger.debug("VM cleanup for task %s: %s", self.task_id, e)
+
+        # Suppress browser_tool's noisy debug prints during cleanup.
+        # The cleanup still runs (safe), it just doesn't spam the console.
+        _prev_quiet = os.environ.get("HERMES_QUIET")
+        os.environ["HERMES_QUIET"] = "1"
+        try:
+            cleanup_browser(self.task_id)
+        except Exception as e:
+            logger.debug("Browser cleanup for task %s: %s", self.task_id, e)
+        finally:
+            if _prev_quiet is None:
+                os.environ.pop("HERMES_QUIET", None)
+            else:
+                os.environ["HERMES_QUIET"] = _prev_quiet
@@ -1,70 +0,0 @@
---
-name: example-skill
-description: An example skill demonstrating the skill file format and structure
---
-
-# Example Skill
-
-This is an example skill file that demonstrates how to create skills for the Hermes Agent.
-
-## Skill File Format
-
-Skills are markdown files with YAML frontmatter at the top:
-
-```yaml
---
-name: your-skill-name
-description: A brief one-line description of what this skill does
---
-```
-
-The frontmatter fields:
- **name**: The identifier used to reference this skill (lowercase, hyphens for spaces)
- **description**: A brief description shown when listing skills (keep under 200 chars)
-
-## Writing Effective Skills
-
-### 1. Be Specific and Actionable
-
-Good skills provide clear, actionable instructions:
-
-```
-When reviewing code:
-1. Check for security vulnerabilities first
-2. Verify error handling is comprehensive
-3. Ensure tests cover edge cases
-```
-
-### 2. Include Examples
-
-Show concrete examples of what you want:
-
-```python
-# Good: Descriptive variable names
-user_authentication_token = get_token()
-
-# Bad: Cryptic abbreviations  
-uat = gt()
-```
-
-### 3. Define When to Use
-
-Help the agent understand when this skill applies:
-
-> Use this skill when: reviewing pull requests, auditing security, or checking code quality.
-
-## Skill Categories
-
-Consider organizing skills by purpose:
-
- **Conventions**: Coding standards, API patterns, naming rules
- **Workflows**: Step-by-step processes for deployments, reviews, releases
- **Knowledge**: Domain-specific information, system architecture, gotchas
- **Templates**: Boilerplate for common tasks, response formats
-
-## Tips
-
-1. Keep the description concise - it's shown in the skills list
-2. Use headers to organize longer skills
-3. Include code examples where helpful
-4. Reference other skills if they're related
@@ -0,0 +1,35 @@
+"""
+Hermes Gateway - Multi-platform messaging integration.
+
+This module provides a unified gateway for connecting the Hermes agent
+to various messaging platforms (Telegram, Discord, WhatsApp) with:
+- Session management (persistent conversations with reset policies)
+- Dynamic context injection (agent knows where messages come from)
+- Delivery routing (cron job outputs to appropriate channels)
+- Platform-specific toolsets (different capabilities per platform)
+"""
+
+from .config import GatewayConfig, PlatformConfig, HomeChannel, load_gateway_config
+from .session import (
+    SessionContext,
+    SessionStore,
+    SessionResetPolicy,
+    build_session_context_prompt,
+)
+from .delivery import DeliveryRouter, DeliveryTarget
+
+__all__ = [
+    # Config
+    "GatewayConfig",
+    "PlatformConfig", 
+    "HomeChannel",
+    "load_gateway_config",
+    # Session
+    "SessionContext",
+    "SessionStore",
+    "SessionResetPolicy",
+    "build_session_context_prompt",
+    # Delivery
+    "DeliveryRouter",
+    "DeliveryTarget",
+]
@@ -0,0 +1,350 @@
+"""
+Gateway configuration management.
+
+Handles loading and validating configuration for:
+- Connected platforms (Telegram, Discord, WhatsApp)
+- Home channels for each platform
+- Session reset policies
+- Delivery preferences
+"""
+
+import os
+import json
+from pathlib import Path
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional, Any
+from enum import Enum
+
+
+class Platform(Enum):
+    """Supported messaging platforms."""
+    LOCAL = "local"
+    TELEGRAM = "telegram"
+    DISCORD = "discord"
+    WHATSAPP = "whatsapp"
+    SLACK = "slack"
+
+
+@dataclass
+class HomeChannel:
+    """
+    Default destination for a platform.
+    
+    When a cron job specifies deliver="telegram" without a specific chat ID,
+    messages are sent to this home channel.
+    """
+    platform: Platform
+    chat_id: str
+    name: str  # Human-readable name for display
+    
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "platform": self.platform.value,
+            "chat_id": self.chat_id,
+            "name": self.name,
+        }
+    
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]) -> "HomeChannel":
+        return cls(
+            platform=Platform(data["platform"]),
+            chat_id=str(data["chat_id"]),
+            name=data.get("name", "Home"),
+        )
+
+
+@dataclass
+class SessionResetPolicy:
+    """
+    Controls when sessions reset (lose context).
+    
+    Modes:
+    - "daily": Reset at a specific hour each day
+    - "idle": Reset after N minutes of inactivity
+    - "both": Whichever triggers first (daily boundary OR idle timeout)
+    """
+    mode: str = "both"  # "daily", "idle", or "both"
+    at_hour: int = 4  # Hour for daily reset (0-23, local time)
+    idle_minutes: int = 1440  # Minutes of inactivity before reset (24 hours)
+    
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "mode": self.mode,
+            "at_hour": self.at_hour,
+            "idle_minutes": self.idle_minutes,
+        }
+    
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]) -> "SessionResetPolicy":
+        return cls(
+            mode=data.get("mode", "both"),
+            at_hour=data.get("at_hour", 4),
+            idle_minutes=data.get("idle_minutes", 1440),
+        )
+
+
+@dataclass
+class PlatformConfig:
+    """Configuration for a single messaging platform."""
+    enabled: bool = False
+    token: Optional[str] = None  # Bot token (Telegram, Discord)
+    api_key: Optional[str] = None  # API key if different from token
+    home_channel: Optional[HomeChannel] = None
+    
+    # Platform-specific settings
+    extra: Dict[str, Any] = field(default_factory=dict)
+    
+    def to_dict(self) -> Dict[str, Any]:
+        result = {
+            "enabled": self.enabled,
+            "extra": self.extra,
+        }
+        if self.token:
+            result["token"] = self.token
+        if self.api_key:
+            result["api_key"] = self.api_key
+        if self.home_channel:
+            result["home_channel"] = self.home_channel.to_dict()
+        return result
+    
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]) -> "PlatformConfig":
+        home_channel = None
+        if "home_channel" in data:
+            home_channel = HomeChannel.from_dict(data["home_channel"])
+        
+        return cls(
+            enabled=data.get("enabled", False),
+            token=data.get("token"),
+            api_key=data.get("api_key"),
+            home_channel=home_channel,
+            extra=data.get("extra", {}),
+        )
+
+
+@dataclass
+class GatewayConfig:
+    """
+    Main gateway configuration.
+    
+    Manages all platform connections, session policies, and delivery settings.
+    """
+    # Platform configurations
+    platforms: Dict[Platform, PlatformConfig] = field(default_factory=dict)
+    
+    # Session reset policies by type
+    default_reset_policy: SessionResetPolicy = field(default_factory=SessionResetPolicy)
+    reset_by_type: Dict[str, SessionResetPolicy] = field(default_factory=dict)
+    reset_by_platform: Dict[Platform, SessionResetPolicy] = field(default_factory=dict)
+    
+    # Reset trigger commands
+    reset_triggers: List[str] = field(default_factory=lambda: ["/new", "/reset"])
+    
+    # Storage paths
+    sessions_dir: Path = field(default_factory=lambda: Path.home() / ".hermes" / "sessions")
+    
+    # Delivery settings
+    always_log_local: bool = True  # Always save cron outputs to local files
+    
+    def get_connected_platforms(self) -> List[Platform]:
+        """Return list of platforms that are enabled and configured."""
+        connected = []
+        for platform, config in self.platforms.items():
+            if config.enabled and (config.token or config.api_key):
+                connected.append(platform)
+        return connected
+    
+    def get_home_channel(self, platform: Platform) -> Optional[HomeChannel]:
+        """Get the home channel for a platform."""
+        config = self.platforms.get(platform)
+        if config:
+            return config.home_channel
+        return None
+    
+    def get_reset_policy(
+        self, 
+        platform: Optional[Platform] = None,
+        session_type: Optional[str] = None
+    ) -> SessionResetPolicy:
+        """
+        Get the appropriate reset policy for a session.
+        
+        Priority: platform override > type override > default
+        """
+        # Platform-specific override takes precedence
+        if platform and platform in self.reset_by_platform:
+            return self.reset_by_platform[platform]
+        
+        # Type-specific override (dm, group, thread)
+        if session_type and session_type in self.reset_by_type:
+            return self.reset_by_type[session_type]
+        
+        return self.default_reset_policy
+    
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "platforms": {
+                p.value: c.to_dict() for p, c in self.platforms.items()
+            },
+            "default_reset_policy": self.default_reset_policy.to_dict(),
+            "reset_by_type": {
+                k: v.to_dict() for k, v in self.reset_by_type.items()
+            },
+            "reset_by_platform": {
+                p.value: v.to_dict() for p, v in self.reset_by_platform.items()
+            },
+            "reset_triggers": self.reset_triggers,
+            "sessions_dir": str(self.sessions_dir),
+            "always_log_local": self.always_log_local,
+        }
+    
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]) -> "GatewayConfig":
+        platforms = {}
+        for platform_name, platform_data in data.get("platforms", {}).items():
+            try:
+                platform = Platform(platform_name)
+                platforms[platform] = PlatformConfig.from_dict(platform_data)
+            except ValueError:
+                pass  # Skip unknown platforms
+        
+        reset_by_type = {}
+        for type_name, policy_data in data.get("reset_by_type", {}).items():
+            reset_by_type[type_name] = SessionResetPolicy.from_dict(policy_data)
+        
+        reset_by_platform = {}
+        for platform_name, policy_data in data.get("reset_by_platform", {}).items():
+            try:
+                platform = Platform(platform_name)
+                reset_by_platform[platform] = SessionResetPolicy.from_dict(policy_data)
+            except ValueError:
+                pass
+        
+        default_policy = SessionResetPolicy()
+        if "default_reset_policy" in data:
+            default_policy = SessionResetPolicy.from_dict(data["default_reset_policy"])
+        
+        sessions_dir = Path.home() / ".hermes" / "sessions"
+        if "sessions_dir" in data:
+            sessions_dir = Path(data["sessions_dir"])
+        
+        return cls(
+            platforms=platforms,
+            default_reset_policy=default_policy,
+            reset_by_type=reset_by_type,
+            reset_by_platform=reset_by_platform,
+            reset_triggers=data.get("reset_triggers", ["/new", "/reset"]),
+            sessions_dir=sessions_dir,
+            always_log_local=data.get("always_log_local", True),
+        )
+
+
+def load_gateway_config() -> GatewayConfig:
+    """
+    Load gateway configuration from multiple sources.
+    
+    Priority (highest to lowest):
+    1. Environment variables
+    2. ~/.hermes/gateway.json
+    3. cli-config.yaml gateway section
+    4. Defaults
+    """
+    config = GatewayConfig()
+    
+    # Try loading from ~/.hermes/gateway.json
+    gateway_config_path = Path.home() / ".hermes" / "gateway.json"
+    if gateway_config_path.exists():
+        try:
+            with open(gateway_config_path, "r") as f:
+                data = json.load(f)
+                config = GatewayConfig.from_dict(data)
+        except Exception as e:
+            print(f"[gateway] Warning: Failed to load {gateway_config_path}: {e}")
+    
+    # Override with environment variables
+    _apply_env_overrides(config)
+    
+    return config
+
+
+def _apply_env_overrides(config: GatewayConfig) -> None:
+    """Apply environment variable overrides to config."""
+    
+    # Telegram
+    telegram_token = os.getenv("TELEGRAM_BOT_TOKEN")
+    if telegram_token:
+        if Platform.TELEGRAM not in config.platforms:
+            config.platforms[Platform.TELEGRAM] = PlatformConfig()
+        config.platforms[Platform.TELEGRAM].enabled = True
+        config.platforms[Platform.TELEGRAM].token = telegram_token
+    
+    telegram_home = os.getenv("TELEGRAM_HOME_CHANNEL")
+    if telegram_home and Platform.TELEGRAM in config.platforms:
+        config.platforms[Platform.TELEGRAM].home_channel = HomeChannel(
+            platform=Platform.TELEGRAM,
+            chat_id=telegram_home,
+            name=os.getenv("TELEGRAM_HOME_CHANNEL_NAME", "Home"),
+        )
+    
+    # Discord
+    discord_token = os.getenv("DISCORD_BOT_TOKEN")
+    if discord_token:
+        if Platform.DISCORD not in config.platforms:
+            config.platforms[Platform.DISCORD] = PlatformConfig()
+        config.platforms[Platform.DISCORD].enabled = True
+        config.platforms[Platform.DISCORD].token = discord_token
+    
+    discord_home = os.getenv("DISCORD_HOME_CHANNEL")
+    if discord_home and Platform.DISCORD in config.platforms:
+        config.platforms[Platform.DISCORD].home_channel = HomeChannel(
+            platform=Platform.DISCORD,
+            chat_id=discord_home,
+            name=os.getenv("DISCORD_HOME_CHANNEL_NAME", "Home"),
+        )
+    
+    # WhatsApp (typically uses different auth mechanism)
+    whatsapp_enabled = os.getenv("WHATSAPP_ENABLED", "").lower() in ("true", "1", "yes")
+    if whatsapp_enabled:
+        if Platform.WHATSAPP not in config.platforms:
+            config.platforms[Platform.WHATSAPP] = PlatformConfig()
+        config.platforms[Platform.WHATSAPP].enabled = True
+    
+    # Slack
+    slack_token = os.getenv("SLACK_BOT_TOKEN")
+    if slack_token:
+        if Platform.SLACK not in config.platforms:
+            config.platforms[Platform.SLACK] = PlatformConfig()
+        config.platforms[Platform.SLACK].enabled = True
+        config.platforms[Platform.SLACK].token = slack_token
+        # Home channel
+        slack_home = os.getenv("SLACK_HOME_CHANNEL")
+        if slack_home:
+            config.platforms[Platform.SLACK].home_channel = HomeChannel(
+                platform=Platform.SLACK,
+                chat_id=slack_home,
+                name=os.getenv("SLACK_HOME_CHANNEL_NAME", ""),
+            )
+    
+    # Session settings
+    idle_minutes = os.getenv("SESSION_IDLE_MINUTES")
+    if idle_minutes:
+        try:
+            config.default_reset_policy.idle_minutes = int(idle_minutes)
+        except ValueError:
+            pass
+    
+    reset_hour = os.getenv("SESSION_RESET_HOUR")
+    if reset_hour:
+        try:
+            config.default_reset_policy.at_hour = int(reset_hour)
+        except ValueError:
+            pass
+
+
+def save_gateway_config(config: GatewayConfig) -> None:
+    """Save gateway configuration to ~/.hermes/gateway.json."""
+    gateway_config_path = Path.home() / ".hermes" / "gateway.json"
+    gateway_config_path.parent.mkdir(parents=True, exist_ok=True)
+    
+    with open(gateway_config_path, "w") as f:
+        json.dump(config.to_dict(), f, indent=2)
@@ -0,0 +1,318 @@
+"""
+Delivery routing for cron job outputs and agent responses.
+
+Routes messages to the appropriate destination based on:
+- Explicit targets (e.g., "telegram:123456789")
+- Platform home channels (e.g., "telegram" → home channel)
+- Origin (back to where the job was created)
+- Local (always saved to files)
+"""
+
+import json
+from pathlib import Path
+from datetime import datetime
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Any, Union
+from enum import Enum
+
+from .config import Platform, GatewayConfig, HomeChannel
+from .session import SessionSource
+
+
+@dataclass
+class DeliveryTarget:
+    """
+    A single delivery target.
+    
+    Represents where a message should be sent:
+    - "origin" → back to source
+    - "local" → save to local files
+    - "telegram" → Telegram home channel
+    - "telegram:123456" → specific Telegram chat
+    """
+    platform: Platform
+    chat_id: Optional[str] = None  # None means use home channel
+    is_origin: bool = False
+    is_explicit: bool = False  # True if chat_id was explicitly specified
+    
+    @classmethod
+    def parse(cls, target: str, origin: Optional[SessionSource] = None) -> "DeliveryTarget":
+        """
+        Parse a delivery target string.
+        
+        Formats:
+        - "origin" → back to source
+        - "local" → local files only
+        - "telegram" → Telegram home channel
+        - "telegram:123456" → specific Telegram chat
+        """
+        target = target.strip().lower()
+        
+        if target == "origin":
+            if origin:
+                return cls(
+                    platform=origin.platform,
+                    chat_id=origin.chat_id,
+                    is_origin=True,
+                )
+            else:
+                # Fallback to local if no origin
+                return cls(platform=Platform.LOCAL, is_origin=True)
+        
+        if target == "local":
+            return cls(platform=Platform.LOCAL)
+        
+        # Check for platform:chat_id format
+        if ":" in target:
+            platform_str, chat_id = target.split(":", 1)
+            try:
+                platform = Platform(platform_str)
+                return cls(platform=platform, chat_id=chat_id, is_explicit=True)
+            except ValueError:
+                # Unknown platform, treat as local
+                return cls(platform=Platform.LOCAL)
+        
+        # Just a platform name (use home channel)
+        try:
+            platform = Platform(target)
+            return cls(platform=platform)
+        except ValueError:
+            # Unknown platform, treat as local
+            return cls(platform=Platform.LOCAL)
+    
+    def to_string(self) -> str:
+        """Convert back to string format."""
+        if self.is_origin:
+            return "origin"
+        if self.platform == Platform.LOCAL:
+            return "local"
+        if self.chat_id:
+            return f"{self.platform.value}:{self.chat_id}"
+        return self.platform.value
+
+
+class DeliveryRouter:
+    """
+    Routes messages to appropriate destinations.
+    
+    Handles the logic of resolving delivery targets and dispatching
+    messages to the right platform adapters.
+    """
+    
+    def __init__(self, config: GatewayConfig, adapters: Dict[Platform, Any] = None):
+        """
+        Initialize the delivery router.
+        
+        Args:
+            config: Gateway configuration
+            adapters: Dict mapping platforms to their adapter instances
+        """
+        self.config = config
+        self.adapters = adapters or {}
+        self.output_dir = Path.home() / ".hermes" / "cron" / "output"
+    
+    def resolve_targets(
+        self,
+        deliver: Union[str, List[str]],
+        origin: Optional[SessionSource] = None
+    ) -> List[DeliveryTarget]:
+        """
+        Resolve delivery specification to concrete targets.
+        
+        Args:
+            deliver: Delivery spec - "origin", "telegram", ["local", "discord"], etc.
+            origin: The source where the request originated (for "origin" target)
+        
+        Returns:
+            List of resolved delivery targets
+        """
+        if isinstance(deliver, str):
+            deliver = [deliver]
+        
+        targets = []
+        seen_platforms = set()
+        
+        for target_str in deliver:
+            target = DeliveryTarget.parse(target_str, origin)
+            
+            # Resolve home channel if needed
+            if target.chat_id is None and target.platform != Platform.LOCAL:
+                home = self.config.get_home_channel(target.platform)
+                if home:
+                    target.chat_id = home.chat_id
+                else:
+                    # No home channel configured, skip this platform
+                    continue
+            
+            # Deduplicate
+            key = (target.platform, target.chat_id)
+            if key not in seen_platforms:
+                seen_platforms.add(key)
+                targets.append(target)
+        
+        # Always include local if configured
+        if self.config.always_log_local:
+            local_key = (Platform.LOCAL, None)
+            if local_key not in seen_platforms:
+                targets.append(DeliveryTarget(platform=Platform.LOCAL))
+        
+        return targets
+    
+    async def deliver(
+        self,
+        content: str,
+        targets: List[DeliveryTarget],
+        job_id: Optional[str] = None,
+        job_name: Optional[str] = None,
+        metadata: Optional[Dict[str, Any]] = None
+    ) -> Dict[str, Any]:
+        """
+        Deliver content to all specified targets.
+        
+        Args:
+            content: The message/output to deliver
+            targets: List of delivery targets
+            job_id: Optional job ID (for cron jobs)
+            job_name: Optional job name
+            metadata: Additional metadata to include
+        
+        Returns:
+            Dict with delivery results per target
+        """
+        results = {}
+        
+        for target in targets:
+            try:
+                if target.platform == Platform.LOCAL:
+                    result = self._deliver_local(content, job_id, job_name, metadata)
+                else:
+                    result = await self._deliver_to_platform(target, content, metadata)
+                
+                results[target.to_string()] = {
+                    "success": True,
+                    "result": result
+                }
+            except Exception as e:
+                results[target.to_string()] = {
+                    "success": False,
+                    "error": str(e)
+                }
+        
+        return results
+    
+    def _deliver_local(
+        self,
+        content: str,
+        job_id: Optional[str],
+        job_name: Optional[str],
+        metadata: Optional[Dict[str, Any]]
+    ) -> Dict[str, Any]:
+        """Save content to local files."""
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        
+        if job_id:
+            output_path = self.output_dir / job_id / f"{timestamp}.md"
+        else:
+            output_path = self.output_dir / "misc" / f"{timestamp}.md"
+        
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        
+        # Build the output document
+        lines = []
+        if job_name:
+            lines.append(f"# {job_name}")
+        else:
+            lines.append("# Delivery Output")
+        
+        lines.append("")
+        lines.append(f"**Timestamp:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+        
+        if job_id:
+            lines.append(f"**Job ID:** {job_id}")
+        
+        if metadata:
+            for key, value in metadata.items():
+                lines.append(f"**{key}:** {value}")
+        
+        lines.append("")
+        lines.append("---")
+        lines.append("")
+        lines.append(content)
+        
+        output_path.write_text("\n".join(lines))
+        
+        return {
+            "path": str(output_path),
+            "timestamp": timestamp
+        }
+    
+    async def _deliver_to_platform(
+        self,
+        target: DeliveryTarget,
+        content: str,
+        metadata: Optional[Dict[str, Any]]
+    ) -> Dict[str, Any]:
+        """Deliver content to a messaging platform."""
+        adapter = self.adapters.get(target.platform)
+        
+        if not adapter:
+            raise ValueError(f"No adapter configured for {target.platform.value}")
+        
+        if not target.chat_id:
+            raise ValueError(f"No chat ID for {target.platform.value} delivery")
+        
+        # Call the adapter's send method
+        # Adapters should implement: async def send(chat_id: str, content: str) -> Dict
+        return await adapter.send(target.chat_id, content, metadata=metadata)
+
+
+def parse_deliver_spec(
+    deliver: Optional[Union[str, List[str]]],
+    origin: Optional[SessionSource] = None,
+    default: str = "origin"
+) -> Union[str, List[str]]:
+    """
+    Normalize a delivery specification.
+    
+    If None or empty, returns the default.
+    """
+    if not deliver:
+        return default
+    return deliver
+
+
+def build_delivery_context_for_tool(
+    config: GatewayConfig,
+    origin: Optional[SessionSource] = None
+) -> Dict[str, Any]:
+    """
+    Build context for the schedule_cronjob tool to understand delivery options.
+    
+    This is passed to the tool so it can validate and explain delivery targets.
+    """
+    connected = config.get_connected_platforms()
+    
+    options = {
+        "origin": {
+            "description": "Back to where this job was created",
+            "available": origin is not None,
+        },
+        "local": {
+            "description": "Save to local files only",
+            "available": True,
+        }
+    }
+    
+    for platform in connected:
+        home = config.get_home_channel(platform)
+        options[platform.value] = {
+            "description": f"{platform.value.title()} home channel",
+            "available": True,
+            "home_channel": home.to_dict() if home else None,
+        }
+    
+    return {
+        "origin": origin.to_dict() if origin else None,
+        "options": options,
+        "always_log_local": config.always_log_local,
+    }
@@ -0,0 +1,150 @@
+"""
+Event Hook System
+
+A lightweight event-driven system that fires handlers at key lifecycle points.
+Hooks are discovered from ~/.hermes/hooks/ directories, each containing:
+  - HOOK.yaml  (metadata: name, description, events list)
+  - handler.py (Python handler with async def handle(event_type, context))
+
+Events:
+  - gateway:startup     -- Gateway process starts
+  - session:start       -- New session created
+  - session:reset       -- User ran /new or /reset
+  - agent:start         -- Agent begins processing a message
+  - agent:step          -- Each turn in the tool-calling loop
+  - agent:end           -- Agent finishes processing
+  - command:*           -- Any slash command executed (wildcard match)
+
+Errors in hooks are caught and logged but never block the main pipeline.
+"""
+
+import asyncio
+import importlib.util
+import os
+from pathlib import Path
+from typing import Any, Callable, Dict, List, Optional
+
+import yaml
+
+
+HOOKS_DIR = Path(os.path.expanduser("~/.hermes/hooks"))
+
+
+class HookRegistry:
+    """
+    Discovers, loads, and fires event hooks.
+
+    Usage:
+        registry = HookRegistry()
+        registry.discover_and_load()
+        await registry.emit("agent:start", {"platform": "telegram", ...})
+    """
+
+    def __init__(self):
+        # event_type -> [handler_fn, ...]
+        self._handlers: Dict[str, List[Callable]] = {}
+        self._loaded_hooks: List[dict] = []  # metadata for listing
+
+    @property
+    def loaded_hooks(self) -> List[dict]:
+        """Return metadata about all loaded hooks."""
+        return list(self._loaded_hooks)
+
+    def discover_and_load(self) -> None:
+        """
+        Scan the hooks directory for hook directories and load their handlers.
+
+        Each hook directory must contain:
+          - HOOK.yaml with at least 'name' and 'events' keys
+          - handler.py with a top-level 'handle' function (sync or async)
+        """
+        if not HOOKS_DIR.exists():
+            return
+
+        for hook_dir in sorted(HOOKS_DIR.iterdir()):
+            if not hook_dir.is_dir():
+                continue
+
+            manifest_path = hook_dir / "HOOK.yaml"
+            handler_path = hook_dir / "handler.py"
+
+            if not manifest_path.exists() or not handler_path.exists():
+                continue
+
+            try:
+                manifest = yaml.safe_load(manifest_path.read_text(encoding="utf-8"))
+                if not manifest or not isinstance(manifest, dict):
+                    print(f"[hooks] Skipping {hook_dir.name}: invalid HOOK.yaml", flush=True)
+                    continue
+
+                hook_name = manifest.get("name", hook_dir.name)
+                events = manifest.get("events", [])
+                if not events:
+                    print(f"[hooks] Skipping {hook_name}: no events declared", flush=True)
+                    continue
+
+                # Dynamically load the handler module
+                spec = importlib.util.spec_from_file_location(
+                    f"hermes_hook_{hook_name}", handler_path
+                )
+                if spec is None or spec.loader is None:
+                    print(f"[hooks] Skipping {hook_name}: could not load handler.py", flush=True)
+                    continue
+
+                module = importlib.util.module_from_spec(spec)
+                spec.loader.exec_module(module)
+
+                handle_fn = getattr(module, "handle", None)
+                if handle_fn is None:
+                    print(f"[hooks] Skipping {hook_name}: no 'handle' function found", flush=True)
+                    continue
+
+                # Register the handler for each declared event
+                for event in events:
+                    self._handlers.setdefault(event, []).append(handle_fn)
+
+                self._loaded_hooks.append({
+                    "name": hook_name,
+                    "description": manifest.get("description", ""),
+                    "events": events,
+                    "path": str(hook_dir),
+                })
+
+                print(f"[hooks] Loaded hook '{hook_name}' for events: {events}", flush=True)
+
+            except Exception as e:
+                print(f"[hooks] Error loading hook {hook_dir.name}: {e}", flush=True)
+
+    async def emit(self, event_type: str, context: Optional[Dict[str, Any]] = None) -> None:
+        """
+        Fire all handlers registered for an event.
+
+        Supports wildcard matching: handlers registered for "command:*" will
+        fire for any "command:..." event. Handlers registered for a base type
+        like "agent" won't fire for "agent:start" -- only exact matches and
+        explicit wildcards.
+
+        Args:
+            event_type: The event identifier (e.g. "agent:start").
+            context:    Optional dict with event-specific data.
+        """
+        if context is None:
+            context = {}
+
+        # Collect handlers: exact match + wildcard match
+        handlers = list(self._handlers.get(event_type, []))
+
+        # Check for wildcard patterns (e.g., "command:*" matches "command:reset")
+        if ":" in event_type:
+            base = event_type.split(":")[0]
+            wildcard_key = f"{base}:*"
+            handlers.extend(self._handlers.get(wildcard_key, []))
+
+        for fn in handlers:
+            try:
+                result = fn(event_type, context)
+                # Support both sync and async handlers
+                if asyncio.iscoroutine(result):
+                    await result
+            except Exception as e:
+                print(f"[hooks] Error in handler for '{event_type}': {e}", flush=True)
@@ -0,0 +1,282 @@
+"""
+DM Pairing System
+
+Code-based approval flow for authorizing new users on messaging platforms.
+Instead of static allowlists with user IDs, unknown users receive a one-time
+pairing code that the bot owner approves via the CLI.
+
+Security features (based on OWASP + NIST SP 800-63-4 guidance):
+  - 8-char codes from 32-char unambiguous alphabet (no 0/O/1/I)
+  - Cryptographic randomness via secrets.choice()
+  - 1-hour code expiry
+  - Max 3 pending codes per platform
+  - Rate limiting: 1 request per user per 10 minutes
+  - Lockout after 5 failed approval attempts (1 hour)
+  - File permissions: chmod 0600 on all data files
+  - Codes are never logged to stdout
+
+Storage: ~/.hermes/pairing/
+"""
+
+import json
+import os
+import secrets
+import time
+from pathlib import Path
+from typing import Optional
+
+
+# Unambiguous alphabet -- excludes 0/O, 1/I to prevent confusion
+ALPHABET = "ABCDEFGHJKLMNPQRSTUVWXYZ23456789"
+CODE_LENGTH = 8
+
+# Timing constants
+CODE_TTL_SECONDS = 3600             # Codes expire after 1 hour
+RATE_LIMIT_SECONDS = 600            # 1 request per user per 10 minutes
+LOCKOUT_SECONDS = 3600              # Lockout duration after too many failures
+
+# Limits
+MAX_PENDING_PER_PLATFORM = 3        # Max pending codes per platform
+MAX_FAILED_ATTEMPTS = 5             # Failed approvals before lockout
+
+PAIRING_DIR = Path(os.path.expanduser("~/.hermes/pairing"))
+
+
+def _secure_write(path: Path, data: str) -> None:
+    """Write data to file with restrictive permissions (owner read/write only)."""
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(data, encoding="utf-8")
+    try:
+        os.chmod(path, 0o600)
+    except OSError:
+        pass  # Windows doesn't support chmod the same way
+
+
+class PairingStore:
+    """
+    Manages pairing codes and approved user lists.
+
+    Data files per platform:
+      - {platform}-pending.json   : pending pairing requests
+      - {platform}-approved.json  : approved (paired) users
+      - _rate_limits.json         : rate limit tracking
+    """
+
+    def __init__(self):
+        PAIRING_DIR.mkdir(parents=True, exist_ok=True)
+
+    def _pending_path(self, platform: str) -> Path:
+        return PAIRING_DIR / f"{platform}-pending.json"
+
+    def _approved_path(self, platform: str) -> Path:
+        return PAIRING_DIR / f"{platform}-approved.json"
+
+    def _rate_limit_path(self) -> Path:
+        return PAIRING_DIR / "_rate_limits.json"
+
+    def _load_json(self, path: Path) -> dict:
+        if path.exists():
+            try:
+                return json.loads(path.read_text(encoding="utf-8"))
+            except (json.JSONDecodeError, OSError):
+                return {}
+        return {}
+
+    def _save_json(self, path: Path, data: dict) -> None:
+        _secure_write(path, json.dumps(data, indent=2, ensure_ascii=False))
+
+    # ----- Approved users -----
+
+    def is_approved(self, platform: str, user_id: str) -> bool:
+        """Check if a user is approved (paired) on a platform."""
+        approved = self._load_json(self._approved_path(platform))
+        return user_id in approved
+
+    def list_approved(self, platform: str = None) -> list:
+        """List approved users, optionally filtered by platform."""
+        results = []
+        platforms = [platform] if platform else self._all_platforms("approved")
+        for p in platforms:
+            approved = self._load_json(self._approved_path(p))
+            for uid, info in approved.items():
+                results.append({"platform": p, "user_id": uid, **info})
+        return results
+
+    def _approve_user(self, platform: str, user_id: str, user_name: str = "") -> None:
+        """Add a user to the approved list."""
+        approved = self._load_json(self._approved_path(platform))
+        approved[user_id] = {
+            "user_name": user_name,
+            "approved_at": time.time(),
+        }
+        self._save_json(self._approved_path(platform), approved)
+
+    def revoke(self, platform: str, user_id: str) -> bool:
+        """Remove a user from the approved list. Returns True if found."""
+        path = self._approved_path(platform)
+        approved = self._load_json(path)
+        if user_id in approved:
+            del approved[user_id]
+            self._save_json(path, approved)
+            return True
+        return False
+
+    # ----- Pending codes -----
+
+    def generate_code(
+        self, platform: str, user_id: str, user_name: str = ""
+    ) -> Optional[str]:
+        """
+        Generate a pairing code for a new user.
+
+        Returns the code string, or None if:
+          - User is rate-limited (too recent request)
+          - Max pending codes reached for this platform
+          - User/platform is in lockout due to failed attempts
+        """
+        self._cleanup_expired(platform)
+
+        # Check lockout
+        if self._is_locked_out(platform):
+            return None
+
+        # Check rate limit for this specific user
+        if self._is_rate_limited(platform, user_id):
+            return None
+
+        # Check max pending
+        pending = self._load_json(self._pending_path(platform))
+        if len(pending) >= MAX_PENDING_PER_PLATFORM:
+            return None
+
+        # Generate cryptographically random code
+        code = "".join(secrets.choice(ALPHABET) for _ in range(CODE_LENGTH))
+
+        # Store pending request
+        pending[code] = {
+            "user_id": user_id,
+            "user_name": user_name,
+            "created_at": time.time(),
+        }
+        self._save_json(self._pending_path(platform), pending)
+
+        # Record rate limit
+        self._record_rate_limit(platform, user_id)
+
+        return code
+
+    def approve_code(self, platform: str, code: str) -> Optional[dict]:
+        """
+        Approve a pairing code. Adds the user to the approved list.
+
+        Returns {user_id, user_name} on success, None if code is invalid/expired.
+        """
+        self._cleanup_expired(platform)
+        code = code.upper().strip()
+
+        pending = self._load_json(self._pending_path(platform))
+        if code not in pending:
+            self._record_failed_attempt(platform)
+            return None
+
+        entry = pending.pop(code)
+        self._save_json(self._pending_path(platform), pending)
+
+        # Add to approved list
+        self._approve_user(platform, entry["user_id"], entry.get("user_name", ""))
+
+        return {
+            "user_id": entry["user_id"],
+            "user_name": entry.get("user_name", ""),
+        }
+
+    def list_pending(self, platform: str = None) -> list:
+        """List pending pairing requests, optionally filtered by platform."""
+        results = []
+        platforms = [platform] if platform else self._all_platforms("pending")
+        for p in platforms:
+            self._cleanup_expired(p)
+            pending = self._load_json(self._pending_path(p))
+            for code, info in pending.items():
+                age_min = int((time.time() - info["created_at"]) / 60)
+                results.append({
+                    "platform": p,
+                    "code": code,
+                    "user_id": info["user_id"],
+                    "user_name": info.get("user_name", ""),
+                    "age_minutes": age_min,
+                })
+        return results
+
+    def clear_pending(self, platform: str = None) -> int:
+        """Clear all pending requests. Returns count removed."""
+        count = 0
+        platforms = [platform] if platform else self._all_platforms("pending")
+        for p in platforms:
+            pending = self._load_json(self._pending_path(p))
+            count += len(pending)
+            self._save_json(self._pending_path(p), {})
+        return count
+
+    # ----- Rate limiting and lockout -----
+
+    def _is_rate_limited(self, platform: str, user_id: str) -> bool:
+        """Check if a user has requested a code too recently."""
+        limits = self._load_json(self._rate_limit_path())
+        key = f"{platform}:{user_id}"
+        last_request = limits.get(key, 0)
+        return (time.time() - last_request) < RATE_LIMIT_SECONDS
+
+    def _record_rate_limit(self, platform: str, user_id: str) -> None:
+        """Record the time of a pairing request for rate limiting."""
+        limits = self._load_json(self._rate_limit_path())
+        key = f"{platform}:{user_id}"
+        limits[key] = time.time()
+        self._save_json(self._rate_limit_path(), limits)
+
+    def _is_locked_out(self, platform: str) -> bool:
+        """Check if a platform is in lockout due to failed approval attempts."""
+        limits = self._load_json(self._rate_limit_path())
+        lockout_key = f"_lockout:{platform}"
+        lockout_until = limits.get(lockout_key, 0)
+        return time.time() < lockout_until
+
+    def _record_failed_attempt(self, platform: str) -> None:
+        """Record a failed approval attempt. Triggers lockout after MAX_FAILED_ATTEMPTS."""
+        limits = self._load_json(self._rate_limit_path())
+        fail_key = f"_failures:{platform}"
+        fails = limits.get(fail_key, 0) + 1
+        limits[fail_key] = fails
+        if fails >= MAX_FAILED_ATTEMPTS:
+            lockout_key = f"_lockout:{platform}"
+            limits[lockout_key] = time.time() + LOCKOUT_SECONDS
+            limits[fail_key] = 0  # Reset counter
+            print(f"[pairing] Platform {platform} locked out for {LOCKOUT_SECONDS}s "
+                  f"after {MAX_FAILED_ATTEMPTS} failed attempts", flush=True)
+        self._save_json(self._rate_limit_path(), limits)
+
+    # ----- Cleanup -----
+
+    def _cleanup_expired(self, platform: str) -> None:
+        """Remove expired pending codes."""
+        path = self._pending_path(platform)
+        pending = self._load_json(path)
+        now = time.time()
+        expired = [
+            code for code, info in pending.items()
+            if (now - info["created_at"]) > CODE_TTL_SECONDS
+        ]
+        if expired:
+            for code in expired:
+                del pending[code]
+            self._save_json(path, pending)
+
+    def _all_platforms(self, suffix: str) -> list:
+        """List all platforms that have data files of a given suffix."""
+        platforms = []
+        for f in PAIRING_DIR.iterdir():
+            if f.name.endswith(f"-{suffix}.json"):
+                platform = f.name.replace(f"-{suffix}.json", "")
+                if not platform.startswith("_"):
+                    platforms.append(platform)
+        return platforms
@@ -0,0 +1,17 @@
+"""
+Platform adapters for messaging integrations.
+
+Each adapter handles:
+- Receiving messages from a platform
+- Sending messages/responses back
+- Platform-specific authentication
+- Message formatting and media handling
+"""
+
+from .base import BasePlatformAdapter, MessageEvent, SendResult
+
+__all__ = [
+    "BasePlatformAdapter",
+    "MessageEvent",
+    "SendResult",
+]
@@ -0,0 +1,691 @@
+"""
+Base platform adapter interface.
+
+All platform adapters (Telegram, Discord, WhatsApp) inherit from this
+and implement the required methods.
+"""
+
+import asyncio
+import os
+import re
+import uuid
+from abc import ABC, abstractmethod
+from dataclasses import dataclass, field
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, List, Optional, Any, Callable, Awaitable, Tuple
+from enum import Enum
+
+import sys
+sys.path.insert(0, str(__file__).rsplit("/", 3)[0])
+
+from gateway.config import Platform, PlatformConfig
+from gateway.session import SessionSource
+
+
+# ---------------------------------------------------------------------------
+# Image cache utilities
+#
+# When users send images on messaging platforms, we download them to a local
+# cache directory so they can be analyzed by the vision tool (which accepts
+# local file paths). This avoids issues with ephemeral platform URLs
+# (e.g. Telegram file URLs expire after ~1 hour).
+# ---------------------------------------------------------------------------
+
+# Default location: ~/.hermes/image_cache/
+IMAGE_CACHE_DIR = Path(os.path.expanduser("~/.hermes/image_cache"))
+
+
+def get_image_cache_dir() -> Path:
+    """Return the image cache directory, creating it if it doesn't exist."""
+    IMAGE_CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    return IMAGE_CACHE_DIR
+
+
+def cache_image_from_bytes(data: bytes, ext: str = ".jpg") -> str:
+    """
+    Save raw image bytes to the cache and return the absolute file path.
+
+    Args:
+        data: Raw image bytes.
+        ext:  File extension including the dot (e.g. ".jpg", ".png").
+
+    Returns:
+        Absolute path to the cached image file as a string.
+    """
+    cache_dir = get_image_cache_dir()
+    filename = f"img_{uuid.uuid4().hex[:12]}{ext}"
+    filepath = cache_dir / filename
+    filepath.write_bytes(data)
+    return str(filepath)
+
+
+async def cache_image_from_url(url: str, ext: str = ".jpg") -> str:
+    """
+    Download an image from a URL and save it to the local cache.
+
+    Uses httpx for async download with a reasonable timeout.
+
+    Args:
+        url: The HTTP/HTTPS URL to download from.
+        ext: File extension including the dot (e.g. ".jpg", ".png").
+
+    Returns:
+        Absolute path to the cached image file as a string.
+    """
+    import httpx
+
+    async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
+        response = await client.get(
+            url,
+            headers={
+                "User-Agent": "Mozilla/5.0 (compatible; HermesAgent/1.0)",
+                "Accept": "image/*,*/*;q=0.8",
+            },
+        )
+        response.raise_for_status()
+        return cache_image_from_bytes(response.content, ext)
+
+
+def cleanup_image_cache(max_age_hours: int = 24) -> int:
+    """
+    Delete cached images older than *max_age_hours*.
+
+    Returns the number of files removed.
+    """
+    import time
+
+    cache_dir = get_image_cache_dir()
+    cutoff = time.time() - (max_age_hours * 3600)
+    removed = 0
+    for f in cache_dir.iterdir():
+        if f.is_file() and f.stat().st_mtime < cutoff:
+            try:
+                f.unlink()
+                removed += 1
+            except OSError:
+                pass
+    return removed
+
+
+# ---------------------------------------------------------------------------
+# Audio cache utilities
+#
+# Same pattern as image cache -- voice messages from platforms are downloaded
+# here so the STT tool (OpenAI Whisper) can transcribe them from local files.
+# ---------------------------------------------------------------------------
+
+AUDIO_CACHE_DIR = Path(os.path.expanduser("~/.hermes/audio_cache"))
+
+
+def get_audio_cache_dir() -> Path:
+    """Return the audio cache directory, creating it if it doesn't exist."""
+    AUDIO_CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    return AUDIO_CACHE_DIR
+
+
+def cache_audio_from_bytes(data: bytes, ext: str = ".ogg") -> str:
+    """
+    Save raw audio bytes to the cache and return the absolute file path.
+
+    Args:
+        data: Raw audio bytes.
+        ext:  File extension including the dot (e.g. ".ogg", ".mp3").
+
+    Returns:
+        Absolute path to the cached audio file as a string.
+    """
+    cache_dir = get_audio_cache_dir()
+    filename = f"audio_{uuid.uuid4().hex[:12]}{ext}"
+    filepath = cache_dir / filename
+    filepath.write_bytes(data)
+    return str(filepath)
+
+
+async def cache_audio_from_url(url: str, ext: str = ".ogg") -> str:
+    """
+    Download an audio file from a URL and save it to the local cache.
+
+    Args:
+        url: The HTTP/HTTPS URL to download from.
+        ext: File extension including the dot (e.g. ".ogg", ".mp3").
+
+    Returns:
+        Absolute path to the cached audio file as a string.
+    """
+    import httpx
+
+    async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
+        response = await client.get(
+            url,
+            headers={
+                "User-Agent": "Mozilla/5.0 (compatible; HermesAgent/1.0)",
+                "Accept": "audio/*,*/*;q=0.8",
+            },
+        )
+        response.raise_for_status()
+        return cache_audio_from_bytes(response.content, ext)
+
+
+class MessageType(Enum):
+    """Types of incoming messages."""
+    TEXT = "text"
+    PHOTO = "photo"
+    VIDEO = "video"
+    AUDIO = "audio"
+    VOICE = "voice"
+    DOCUMENT = "document"
+    STICKER = "sticker"
+    COMMAND = "command"  # /command style
+
+
+@dataclass
+class MessageEvent:
+    """
+    Incoming message from a platform.
+    
+    Normalized representation that all adapters produce.
+    """
+    # Message content
+    text: str
+    message_type: MessageType = MessageType.TEXT
+    
+    # Source information
+    source: SessionSource = None
+    
+    # Original platform data
+    raw_message: Any = None
+    message_id: Optional[str] = None
+    
+    # Media attachments
+    media_urls: List[str] = field(default_factory=list)
+    media_types: List[str] = field(default_factory=list)
+    
+    # Reply context
+    reply_to_message_id: Optional[str] = None
+    
+    # Timestamps
+    timestamp: datetime = field(default_factory=datetime.now)
+    
+    def is_command(self) -> bool:
+        """Check if this is a command message (e.g., /new, /reset)."""
+        return self.text.startswith("/")
+    
+    def get_command(self) -> Optional[str]:
+        """Extract command name if this is a command message."""
+        if not self.is_command():
+            return None
+        # Split on space and get first word, strip the /
+        parts = self.text.split(maxsplit=1)
+        return parts[0][1:].lower() if parts else None
+    
+    def get_command_args(self) -> str:
+        """Get the arguments after a command."""
+        if not self.is_command():
+            return self.text
+        parts = self.text.split(maxsplit=1)
+        return parts[1] if len(parts) > 1 else ""
+
+
+@dataclass 
+class SendResult:
+    """Result of sending a message."""
+    success: bool
+    message_id: Optional[str] = None
+    error: Optional[str] = None
+    raw_response: Any = None
+
+
+# Type for message handlers
+MessageHandler = Callable[[MessageEvent], Awaitable[Optional[str]]]
+
+
+class BasePlatformAdapter(ABC):
+    """
+    Base class for platform adapters.
+    
+    Subclasses implement platform-specific logic for:
+    - Connecting and authenticating
+    - Receiving messages
+    - Sending messages/responses
+    - Handling media
+    """
+    
+    def __init__(self, config: PlatformConfig, platform: Platform):
+        self.config = config
+        self.platform = platform
+        self._message_handler: Optional[MessageHandler] = None
+        self._running = False
+        
+        # Track active message handlers per session for interrupt support
+        # Key: session_key (e.g., chat_id), Value: (event, asyncio.Event for interrupt)
+        self._active_sessions: Dict[str, asyncio.Event] = {}
+        self._pending_messages: Dict[str, MessageEvent] = {}
+    
+    @property
+    def name(self) -> str:
+        """Human-readable name for this adapter."""
+        return self.platform.value.title()
+    
+    @property
+    def is_connected(self) -> bool:
+        """Check if adapter is currently connected."""
+        return self._running
+    
+    def set_message_handler(self, handler: MessageHandler) -> None:
+        """
+        Set the handler for incoming messages.
+        
+        The handler receives a MessageEvent and should return
+        an optional response string.
+        """
+        self._message_handler = handler
+    
+    @abstractmethod
+    async def connect(self) -> bool:
+        """
+        Connect to the platform and start receiving messages.
+        
+        Returns True if connection was successful.
+        """
+        pass
+    
+    @abstractmethod
+    async def disconnect(self) -> None:
+        """Disconnect from the platform."""
+        pass
+    
+    @abstractmethod
+    async def send(
+        self,
+        chat_id: str,
+        content: str,
+        reply_to: Optional[str] = None,
+        metadata: Optional[Dict[str, Any]] = None
+    ) -> SendResult:
+        """
+        Send a message to a chat.
+        
+        Args:
+            chat_id: The chat/channel ID to send to
+            content: Message content (may be markdown)
+            reply_to: Optional message ID to reply to
+            metadata: Additional platform-specific options
+        
+        Returns:
+            SendResult with success status and message ID
+        """
+        pass
+    
+    async def send_typing(self, chat_id: str) -> None:
+        """
+        Send a typing indicator.
+        
+        Override in subclasses if the platform supports it.
+        """
+        pass
+    
+    async def send_image(
+        self,
+        chat_id: str,
+        image_url: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """
+        Send an image natively via the platform API.
+        
+        Override in subclasses to send images as proper attachments
+        instead of plain-text URLs. Default falls back to sending the
+        URL as a text message.
+        """
+        # Fallback: send URL as text (subclasses override for native images)
+        text = f"{caption}\n{image_url}" if caption else image_url
+        return await self.send(chat_id=chat_id, content=text, reply_to=reply_to)
+    
+    @staticmethod
+    def extract_images(content: str) -> Tuple[List[Tuple[str, str]], str]:
+        """
+        Extract image URLs from markdown and HTML image tags in a response.
+        
+        Finds patterns like:
+        - ![alt text](https://example.com/image.png)
+        - <img src="https://example.com/image.png">
+        - <img src="https://example.com/image.png"></img>
+        
+        Args:
+            content: The response text to scan.
+        
+        Returns:
+            Tuple of (list of (url, alt_text) pairs, cleaned content with image tags removed).
+        """
+        images = []
+        cleaned = content
+        
+        # Match markdown images: ![alt](url)
+        md_pattern = r'!\[([^\]]*)\]\((https?://[^\s\)]+)\)'
+        for match in re.finditer(md_pattern, content):
+            alt_text = match.group(1)
+            url = match.group(2)
+            # Only extract URLs that look like actual images
+            if any(url.lower().endswith(ext) or ext in url.lower() for ext in
+                   ['.png', '.jpg', '.jpeg', '.gif', '.webp', 'fal.media', 'fal-cdn', 'replicate.delivery']):
+                images.append((url, alt_text))
+        
+        # Match HTML img tags: <img src="url"> or <img src="url"></img> or <img src="url"/>
+        html_pattern = r'<img\s+src=["\']?(https?://[^\s"\'<>]+)["\']?\s*/?>\s*(?:</img>)?'
+        for match in re.finditer(html_pattern, content):
+            url = match.group(1)
+            images.append((url, ""))
+        
+        # Remove matched image tags from content if we found images
+        if images:
+            cleaned = re.sub(md_pattern, '', cleaned)
+            cleaned = re.sub(html_pattern, '', cleaned)
+            # Clean up leftover blank lines
+            cleaned = re.sub(r'\n{3,}', '\n\n', cleaned).strip()
+        
+        return images, cleaned
+    
+    async def send_voice(
+        self,
+        chat_id: str,
+        audio_path: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """
+        Send an audio file as a native voice message via the platform API.
+        
+        Override in subclasses to send audio as voice bubbles (Telegram)
+        or file attachments (Discord). Default falls back to sending the
+        file path as text.
+        """
+        text = f"🔊 Audio: {audio_path}"
+        if caption:
+            text = f"{caption}\n{text}"
+        return await self.send(chat_id=chat_id, content=text, reply_to=reply_to)
+    
+    @staticmethod
+    def extract_media(content: str) -> Tuple[List[Tuple[str, bool]], str]:
+        """
+        Extract MEDIA:<path> tags and [[audio_as_voice]] directives from response text.
+        
+        The TTS tool returns responses like:
+            [[audio_as_voice]]
+            MEDIA:/path/to/audio.ogg
+        
+        Args:
+            content: The response text to scan.
+        
+        Returns:
+            Tuple of (list of (path, is_voice) pairs, cleaned content with tags removed).
+        """
+        media = []
+        cleaned = content
+        
+        # Check for [[audio_as_voice]] directive
+        has_voice_tag = "[[audio_as_voice]]" in content
+        cleaned = cleaned.replace("[[audio_as_voice]]", "")
+        
+        # Extract MEDIA:<path> tags (path may contain spaces)
+        media_pattern = r'MEDIA:(\S+)'
+        for match in re.finditer(media_pattern, content):
+            path = match.group(1).strip()
+            if path:
+                media.append((path, has_voice_tag))
+        
+        # Remove MEDIA tags from content
+        if media:
+            cleaned = re.sub(media_pattern, '', cleaned)
+            cleaned = re.sub(r'\n{3,}', '\n\n', cleaned).strip()
+        
+        return media, cleaned
+    
+    async def _keep_typing(self, chat_id: str, interval: float = 2.0) -> None:
+        """
+        Continuously send typing indicator until cancelled.
+        
+        Telegram/Discord typing status expires after ~5 seconds, so we refresh every 2
+        to recover quickly after progress messages interrupt it.
+        """
+        try:
+            while True:
+                await self.send_typing(chat_id)
+                await asyncio.sleep(interval)
+        except asyncio.CancelledError:
+            pass  # Normal cancellation when handler completes
+    
+    async def handle_message(self, event: MessageEvent) -> None:
+        """
+        Process an incoming message.
+        
+        This method returns quickly by spawning background tasks.
+        This allows new messages to be processed even while an agent is running,
+        enabling interruption support.
+        """
+        if not self._message_handler:
+            return
+        
+        session_key = event.source.chat_id
+        
+        # Check if there's already an active handler for this session
+        if session_key in self._active_sessions:
+            # Store this as a pending message - it will interrupt the running agent
+            print(f"[{self.name}] ⚡ New message while session {session_key} is active - triggering interrupt")
+            self._pending_messages[session_key] = event
+            # Signal the interrupt (the processing task checks this)
+            self._active_sessions[session_key].set()
+            return  # Don't process now - will be handled after current task finishes
+        
+        # Spawn background task to process this message
+        asyncio.create_task(self._process_message_background(event, session_key))
+    
+    @staticmethod
+    def _get_human_delay() -> float:
+        """
+        Return a random delay in seconds for human-like response pacing.
+
+        Reads from env vars:
+          HERMES_HUMAN_DELAY_MODE: "off" (default) | "natural" | "custom"
+          HERMES_HUMAN_DELAY_MIN_MS: minimum delay in ms (default 800, custom mode)
+          HERMES_HUMAN_DELAY_MAX_MS: maximum delay in ms (default 2500, custom mode)
+        """
+        import random
+
+        mode = os.getenv("HERMES_HUMAN_DELAY_MODE", "off").lower()
+        if mode == "off":
+            return 0.0
+        min_ms = int(os.getenv("HERMES_HUMAN_DELAY_MIN_MS", "800"))
+        max_ms = int(os.getenv("HERMES_HUMAN_DELAY_MAX_MS", "2500"))
+        if mode == "natural":
+            min_ms, max_ms = 800, 2500
+        return random.uniform(min_ms / 1000.0, max_ms / 1000.0)
+
+    async def _process_message_background(self, event: MessageEvent, session_key: str) -> None:
+        """Background task that actually processes the message."""
+        # Create interrupt event for this session
+        interrupt_event = asyncio.Event()
+        self._active_sessions[session_key] = interrupt_event
+        
+        # Start continuous typing indicator (refreshes every 2 seconds)
+        typing_task = asyncio.create_task(self._keep_typing(event.source.chat_id))
+        
+        try:
+            # Call the handler (this can take a while with tool calls)
+            response = await self._message_handler(event)
+            
+            # Send response if any
+            if response:
+                # Extract MEDIA:<path> tags (from TTS tool) before other processing
+                media_files, response = self.extract_media(response)
+                
+                # Extract image URLs and send them as native platform attachments
+                images, text_content = self.extract_images(response)
+                
+                # Send the text portion first (if any remains after extractions)
+                if text_content:
+                    result = await self.send(
+                        chat_id=event.source.chat_id,
+                        content=text_content,
+                        reply_to=event.message_id
+                    )
+                    
+                    # Log send failures (don't raise - user already saw tool progress)
+                    if not result.success:
+                        print(f"[{self.name}] Failed to send response: {result.error}")
+                        # Try sending without markdown as fallback
+                        fallback_result = await self.send(
+                            chat_id=event.source.chat_id,
+                            content=f"(Response formatting failed, plain text:)\n\n{text_content[:3500]}",
+                            reply_to=event.message_id
+                        )
+                        if not fallback_result.success:
+                            print(f"[{self.name}] Fallback send also failed: {fallback_result.error}")
+                
+                # Human-like pacing delay between text and media
+                human_delay = self._get_human_delay()
+                
+                # Send extracted images as native attachments
+                for image_url, alt_text in images:
+                    if human_delay > 0:
+                        await asyncio.sleep(human_delay)
+                    try:
+                        img_result = await self.send_image(
+                            chat_id=event.source.chat_id,
+                            image_url=image_url,
+                            caption=alt_text if alt_text else None,
+                        )
+                        if not img_result.success:
+                            print(f"[{self.name}] Failed to send image: {img_result.error}")
+                    except Exception as img_err:
+                        print(f"[{self.name}] Error sending image: {img_err}")
+                
+                # Send extracted audio/voice files as native attachments
+                for audio_path, is_voice in media_files:
+                    if human_delay > 0:
+                        await asyncio.sleep(human_delay)
+                    try:
+                        voice_result = await self.send_voice(
+                            chat_id=event.source.chat_id,
+                            audio_path=audio_path,
+                        )
+                        if not voice_result.success:
+                            print(f"[{self.name}] Failed to send voice: {voice_result.error}")
+                    except Exception as voice_err:
+                        print(f"[{self.name}] Error sending voice: {voice_err}")
+            
+            # Check if there's a pending message that was queued during our processing
+            if session_key in self._pending_messages:
+                pending_event = self._pending_messages.pop(session_key)
+                print(f"[{self.name}] 📨 Processing queued message from interrupt")
+                # Clean up current session before processing pending
+                if session_key in self._active_sessions:
+                    del self._active_sessions[session_key]
+                typing_task.cancel()
+                try:
+                    await typing_task
+                except asyncio.CancelledError:
+                    pass
+                # Process pending message in new background task
+                await self._process_message_background(pending_event, session_key)
+                return  # Already cleaned up
+                
+        except Exception as e:
+            print(f"[{self.name}] Error handling message: {e}")
+            import traceback
+            traceback.print_exc()
+        finally:
+            # Stop typing indicator
+            typing_task.cancel()
+            try:
+                await typing_task
+            except asyncio.CancelledError:
+                pass
+            # Clean up session tracking
+            if session_key in self._active_sessions:
+                del self._active_sessions[session_key]
+    
+    def has_pending_interrupt(self, session_key: str) -> bool:
+        """Check if there's a pending interrupt for a session."""
+        return session_key in self._active_sessions and self._active_sessions[session_key].is_set()
+    
+    def get_pending_message(self, session_key: str) -> Optional[MessageEvent]:
+        """Get and clear any pending message for a session."""
+        return self._pending_messages.pop(session_key, None)
+    
+    def build_source(
+        self,
+        chat_id: str,
+        chat_name: Optional[str] = None,
+        chat_type: str = "dm",
+        user_id: Optional[str] = None,
+        user_name: Optional[str] = None,
+        thread_id: Optional[str] = None
+    ) -> SessionSource:
+        """Helper to build a SessionSource for this platform."""
+        return SessionSource(
+            platform=self.platform,
+            chat_id=str(chat_id),
+            chat_name=chat_name,
+            chat_type=chat_type,
+            user_id=str(user_id) if user_id else None,
+            user_name=user_name,
+            thread_id=str(thread_id) if thread_id else None,
+        )
+    
+    @abstractmethod
+    async def get_chat_info(self, chat_id: str) -> Dict[str, Any]:
+        """
+        Get information about a chat/channel.
+        
+        Returns dict with at least:
+        - name: Chat name
+        - type: "dm", "group", "channel"
+        """
+        pass
+    
+    def format_message(self, content: str) -> str:
+        """
+        Format a message for this platform.
+        
+        Override in subclasses to handle platform-specific formatting
+        (e.g., Telegram MarkdownV2, Discord markdown).
+        
+        Default implementation returns content as-is.
+        """
+        return content
+    
+    def truncate_message(self, content: str, max_length: int = 4096) -> List[str]:
+        """
+        Split a long message into chunks.
+        
+        Args:
+            content: The full message content
+            max_length: Maximum length per chunk (platform-specific)
+        
+        Returns:
+            List of message chunks
+        """
+        if len(content) <= max_length:
+            return [content]
+        
+        chunks = []
+        while content:
+            if len(content) <= max_length:
+                chunks.append(content)
+                break
+            
+            # Try to split at a newline
+            split_idx = content.rfind("\n", 0, max_length)
+            if split_idx == -1:
+                # No newline, split at space
+                split_idx = content.rfind(" ", 0, max_length)
+            if split_idx == -1:
+                # No space either, hard split
+                split_idx = max_length
+            
+            chunks.append(content[:split_idx])
+            content = content[split_idx:].lstrip()
+        
+        return chunks
@@ -0,0 +1,679 @@
+"""
+Discord platform adapter.
+
+Uses discord.py library for:
+- Receiving messages from servers and DMs
+- Sending responses back
+- Handling threads and channels
+"""
+
+import asyncio
+import os
+from typing import Dict, List, Optional, Any
+
+try:
+    import discord
+    from discord import Message as DiscordMessage, Intents
+    from discord.ext import commands
+    DISCORD_AVAILABLE = True
+except ImportError:
+    DISCORD_AVAILABLE = False
+    discord = None
+    DiscordMessage = Any
+    Intents = Any
+    commands = None
+
+import sys
+sys.path.insert(0, str(__file__).rsplit("/", 3)[0])
+
+from gateway.config import Platform, PlatformConfig
+from gateway.platforms.base import (
+    BasePlatformAdapter,
+    MessageEvent,
+    MessageType,
+    SendResult,
+    cache_image_from_url,
+    cache_audio_from_url,
+)
+
+
+def check_discord_requirements() -> bool:
+    """Check if Discord dependencies are available."""
+    return DISCORD_AVAILABLE
+
+
+class DiscordAdapter(BasePlatformAdapter):
+    """
+    Discord bot adapter.
+    
+    Handles:
+    - Receiving messages from servers and DMs
+    - Sending responses with Discord markdown
+    - Thread support
+    - Native slash commands (/ask, /reset, /status, /stop)
+    - Button-based exec approvals
+    - Auto-threading for long conversations
+    - Reaction-based feedback
+    """
+    
+    # Discord message limits
+    MAX_MESSAGE_LENGTH = 2000
+    
+    def __init__(self, config: PlatformConfig):
+        super().__init__(config, Platform.DISCORD)
+        self._client: Optional[commands.Bot] = None
+        self._ready_event = asyncio.Event()
+        self._allowed_user_ids: set = set()  # For button approval authorization
+    
+    async def connect(self) -> bool:
+        """Connect to Discord and start receiving events."""
+        if not DISCORD_AVAILABLE:
+            print(f"[{self.name}] discord.py not installed. Run: pip install discord.py")
+            return False
+        
+        if not self.config.token:
+            print(f"[{self.name}] No bot token configured")
+            return False
+        
+        try:
+            # Set up intents
+            intents = Intents.default()
+            intents.message_content = True
+            intents.dm_messages = True
+            intents.guild_messages = True
+            
+            # Create bot
+            self._client = commands.Bot(
+                command_prefix="!",  # Not really used, we handle raw messages
+                intents=intents,
+            )
+            
+            # Parse allowed user IDs for button authorization
+            allowed_env = os.getenv("DISCORD_ALLOWED_USERS", "")
+            if allowed_env:
+                self._allowed_user_ids = {
+                    uid.strip() for uid in allowed_env.split(",") if uid.strip()
+                }
+            
+            # Register event handlers
+            @self._client.event
+            async def on_ready():
+                print(f"[{self.name}] Connected as {self._client.user}")
+                # Sync slash commands with Discord
+                try:
+                    synced = await self._client.tree.sync()
+                    print(f"[{self.name}] Synced {len(synced)} slash command(s)")
+                except Exception as e:
+                    print(f"[{self.name}] Slash command sync failed: {e}")
+                self._ready_event.set()
+            
+            @self._client.event
+            async def on_message(message: DiscordMessage):
+                # Ignore bot's own messages
+                if message.author == self._client.user:
+                    return
+                await self._handle_message(message)
+            
+            # Register slash commands
+            self._register_slash_commands()
+            
+            # Start the bot in background
+            asyncio.create_task(self._client.start(self.config.token))
+            
+            # Wait for ready
+            await asyncio.wait_for(self._ready_event.wait(), timeout=30)
+            
+            self._running = True
+            return True
+            
+        except asyncio.TimeoutError:
+            print(f"[{self.name}] Timeout waiting for connection")
+            return False
+        except Exception as e:
+            print(f"[{self.name}] Failed to connect: {e}")
+            return False
+    
+    async def disconnect(self) -> None:
+        """Disconnect from Discord."""
+        if self._client:
+            try:
+                await self._client.close()
+            except Exception as e:
+                print(f"[{self.name}] Error during disconnect: {e}")
+        
+        self._running = False
+        self._client = None
+        self._ready_event.clear()
+        print(f"[{self.name}] Disconnected")
+    
+    async def send(
+        self,
+        chat_id: str,
+        content: str,
+        reply_to: Optional[str] = None,
+        metadata: Optional[Dict[str, Any]] = None
+    ) -> SendResult:
+        """Send a message to a Discord channel."""
+        if not self._client:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            # Get the channel
+            channel = self._client.get_channel(int(chat_id))
+            if not channel:
+                channel = await self._client.fetch_channel(int(chat_id))
+            
+            if not channel:
+                return SendResult(success=False, error=f"Channel {chat_id} not found")
+            
+            # Format and split message if needed
+            formatted = self.format_message(content)
+            chunks = self.truncate_message(formatted, self.MAX_MESSAGE_LENGTH)
+            
+            message_ids = []
+            reference = None
+            
+            if reply_to:
+                try:
+                    ref_msg = await channel.fetch_message(int(reply_to))
+                    reference = ref_msg
+                except Exception:
+                    pass  # Ignore if we can't find the referenced message
+            
+            for i, chunk in enumerate(chunks):
+                msg = await channel.send(
+                    content=chunk,
+                    reference=reference if i == 0 else None,
+                )
+                message_ids.append(str(msg.id))
+            
+            return SendResult(
+                success=True,
+                message_id=message_ids[0] if message_ids else None,
+                raw_response={"message_ids": message_ids}
+            )
+            
+        except Exception as e:
+            return SendResult(success=False, error=str(e))
+    
+    async def send_voice(
+        self,
+        chat_id: str,
+        audio_path: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send audio as a Discord file attachment."""
+        if not self._client:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            import io
+            
+            channel = self._client.get_channel(int(chat_id))
+            if not channel:
+                channel = await self._client.fetch_channel(int(chat_id))
+            if not channel:
+                return SendResult(success=False, error=f"Channel {chat_id} not found")
+            
+            if not os.path.exists(audio_path):
+                return SendResult(success=False, error=f"Audio file not found: {audio_path}")
+            
+            # Determine filename from path
+            filename = os.path.basename(audio_path)
+            
+            with open(audio_path, "rb") as f:
+                file = discord.File(io.BytesIO(f.read()), filename=filename)
+                msg = await channel.send(
+                    content=caption if caption else None,
+                    file=file,
+                )
+                return SendResult(success=True, message_id=str(msg.id))
+        
+        except Exception as e:
+            print(f"[{self.name}] Failed to send audio: {e}")
+            return await super().send_voice(chat_id, audio_path, caption, reply_to)
+    
+    async def send_image(
+        self,
+        chat_id: str,
+        image_url: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send an image natively as a Discord file attachment."""
+        if not self._client:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            import aiohttp
+            
+            channel = self._client.get_channel(int(chat_id))
+            if not channel:
+                channel = await self._client.fetch_channel(int(chat_id))
+            if not channel:
+                return SendResult(success=False, error=f"Channel {chat_id} not found")
+            
+            # Download the image and send as a Discord file attachment
+            # (Discord renders attachments inline, unlike plain URLs)
+            async with aiohttp.ClientSession() as session:
+                async with session.get(image_url, timeout=aiohttp.ClientTimeout(total=30)) as resp:
+                    if resp.status != 200:
+                        raise Exception(f"Failed to download image: HTTP {resp.status}")
+                    
+                    image_data = await resp.read()
+                    
+                    # Determine filename from URL or content type
+                    content_type = resp.headers.get("content-type", "image/png")
+                    ext = "png"
+                    if "jpeg" in content_type or "jpg" in content_type:
+                        ext = "jpg"
+                    elif "gif" in content_type:
+                        ext = "gif"
+                    elif "webp" in content_type:
+                        ext = "webp"
+                    
+                    import io
+                    file = discord.File(io.BytesIO(image_data), filename=f"image.{ext}")
+                    
+                    msg = await channel.send(
+                        content=caption if caption else None,
+                        file=file,
+                    )
+                    return SendResult(success=True, message_id=str(msg.id))
+        
+        except ImportError:
+            print(f"[{self.name}] aiohttp not installed, falling back to URL. Run: pip install aiohttp")
+            return await super().send_image(chat_id, image_url, caption, reply_to)
+        except Exception as e:
+            print(f"[{self.name}] Failed to send image attachment, falling back to URL: {e}")
+            return await super().send_image(chat_id, image_url, caption, reply_to)
+    
+    async def send_typing(self, chat_id: str) -> None:
+        """Send typing indicator."""
+        if self._client:
+            try:
+                channel = self._client.get_channel(int(chat_id))
+                if channel:
+                    await channel.typing()
+            except Exception:
+                pass  # Ignore typing indicator failures
+    
+    async def get_chat_info(self, chat_id: str) -> Dict[str, Any]:
+        """Get information about a Discord channel."""
+        if not self._client:
+            return {"name": "Unknown", "type": "dm"}
+        
+        try:
+            channel = self._client.get_channel(int(chat_id))
+            if not channel:
+                channel = await self._client.fetch_channel(int(chat_id))
+            
+            if not channel:
+                return {"name": str(chat_id), "type": "dm"}
+            
+            # Determine channel type
+            if isinstance(channel, discord.DMChannel):
+                chat_type = "dm"
+                name = channel.recipient.name if channel.recipient else str(chat_id)
+            elif isinstance(channel, discord.Thread):
+                chat_type = "thread"
+                name = channel.name
+            elif isinstance(channel, discord.TextChannel):
+                chat_type = "channel"
+                name = f"#{channel.name}"
+                if channel.guild:
+                    name = f"{channel.guild.name} / {name}"
+            else:
+                chat_type = "channel"
+                name = getattr(channel, "name", str(chat_id))
+            
+            return {
+                "name": name,
+                "type": chat_type,
+                "guild_id": str(channel.guild.id) if hasattr(channel, "guild") and channel.guild else None,
+                "guild_name": channel.guild.name if hasattr(channel, "guild") and channel.guild else None,
+            }
+        except Exception as e:
+            return {"name": str(chat_id), "type": "dm", "error": str(e)}
+    
+    def format_message(self, content: str) -> str:
+        """
+        Format message for Discord.
+        
+        Discord uses its own markdown variant.
+        """
+        # Discord markdown is fairly standard, no special escaping needed
+        return content
+    
+    def _register_slash_commands(self) -> None:
+        """Register Discord slash commands on the command tree."""
+        if not self._client:
+            return
+
+        tree = self._client.tree
+
+        @tree.command(name="ask", description="Ask Hermes a question")
+        @discord.app_commands.describe(question="Your question for Hermes")
+        async def slash_ask(interaction: discord.Interaction, question: str):
+            await interaction.response.defer()
+            event = self._build_slash_event(interaction, question)
+            await self.handle_message(event)
+            # The response is sent via the normal send() flow
+            # Send a followup to close the interaction if needed
+            try:
+                await interaction.followup.send("Processing complete~", ephemeral=True)
+            except Exception:
+                pass
+
+        @tree.command(name="reset", description="Reset your Hermes session")
+        async def slash_reset(interaction: discord.Interaction):
+            await interaction.response.defer(ephemeral=True)
+            event = self._build_slash_event(interaction, "/reset")
+            await self.handle_message(event)
+            try:
+                await interaction.followup.send("Session reset~", ephemeral=True)
+            except Exception:
+                pass
+
+        @tree.command(name="status", description="Show Hermes session status")
+        async def slash_status(interaction: discord.Interaction):
+            await interaction.response.defer(ephemeral=True)
+            event = self._build_slash_event(interaction, "/status")
+            await self.handle_message(event)
+            try:
+                await interaction.followup.send("Status sent~", ephemeral=True)
+            except Exception:
+                pass
+
+        @tree.command(name="stop", description="Stop the running Hermes agent")
+        async def slash_stop(interaction: discord.Interaction):
+            await interaction.response.defer(ephemeral=True)
+            event = self._build_slash_event(interaction, "/stop")
+            await self.handle_message(event)
+            try:
+                await interaction.followup.send("Stop requested~", ephemeral=True)
+            except Exception:
+                pass
+
+    def _build_slash_event(self, interaction: discord.Interaction, text: str) -> MessageEvent:
+        """Build a MessageEvent from a Discord slash command interaction."""
+        is_dm = isinstance(interaction.channel, discord.DMChannel)
+        chat_type = "dm" if is_dm else "group"
+        chat_name = ""
+        if not is_dm and hasattr(interaction.channel, "name"):
+            chat_name = interaction.channel.name
+            if hasattr(interaction.channel, "guild") and interaction.channel.guild:
+                chat_name = f"{interaction.channel.guild.name} / #{chat_name}"
+
+        source = self.build_source(
+            chat_id=str(interaction.channel_id),
+            chat_name=chat_name,
+            chat_type=chat_type,
+            user_id=str(interaction.user.id),
+            user_name=interaction.user.display_name,
+        )
+
+        msg_type = MessageType.COMMAND if text.startswith("/") else MessageType.TEXT
+        return MessageEvent(
+            text=text,
+            message_type=msg_type,
+            source=source,
+            raw_message=interaction,
+        )
+
+    async def send_exec_approval(
+        self, chat_id: str, command: str, approval_id: str
+    ) -> SendResult:
+        """
+        Send a button-based exec approval prompt for a dangerous command.
+
+        Returns SendResult. The approval is resolved when a user clicks a button.
+        """
+        if not self._client or not DISCORD_AVAILABLE:
+            return SendResult(success=False, error="Not connected")
+
+        try:
+            channel = self._client.get_channel(int(chat_id))
+            if not channel:
+                channel = await self._client.fetch_channel(int(chat_id))
+
+            embed = discord.Embed(
+                title="Command Approval Required",
+                description=f"```\n{command[:500]}\n```",
+                color=discord.Color.orange(),
+            )
+            embed.set_footer(text=f"Approval ID: {approval_id}")
+
+            view = ExecApprovalView(
+                approval_id=approval_id,
+                allowed_user_ids=self._allowed_user_ids,
+            )
+
+            msg = await channel.send(embed=embed, view=view)
+            return SendResult(success=True, message_id=str(msg.id))
+
+        except Exception as e:
+            return SendResult(success=False, error=str(e))
+
+    async def _handle_message(self, message: DiscordMessage) -> None:
+        """Handle incoming Discord messages."""
+        # In server channels (not DMs), require the bot to be @mentioned
+        # UNLESS the channel is in the free-response list.
+        #
+        # Config:
+        #   DISCORD_FREE_RESPONSE_CHANNELS: Comma-separated channel IDs where the
+        #       bot responds to every message without needing a mention.
+        #   DISCORD_REQUIRE_MENTION: Set to "false" to disable mention requirement
+        #       globally (all channels become free-response). Default: "true".
+        
+        if not isinstance(message.channel, discord.DMChannel):
+            # Check if this channel is in the free-response list
+            free_channels_raw = os.getenv("DISCORD_FREE_RESPONSE_CHANNELS", "")
+            free_channels = {ch.strip() for ch in free_channels_raw.split(",") if ch.strip()}
+            channel_id = str(message.channel.id)
+            
+            # Global override: if DISCORD_REQUIRE_MENTION=false, all channels are free
+            require_mention = os.getenv("DISCORD_REQUIRE_MENTION", "true").lower() not in ("false", "0", "no")
+            
+            is_free_channel = channel_id in free_channels
+            
+            if require_mention and not is_free_channel:
+                # Must be @mentioned to respond
+                if self._client.user not in message.mentions:
+                    return  # Silently ignore messages that don't mention the bot
+            
+            # Strip the bot mention from the message text so the agent sees clean input
+            if self._client.user and self._client.user in message.mentions:
+                message.content = message.content.replace(f"<@{self._client.user.id}>", "").strip()
+                message.content = message.content.replace(f"<@!{self._client.user.id}>", "").strip()
+        
+        # Determine message type
+        msg_type = MessageType.TEXT
+        if message.content.startswith("/"):
+            msg_type = MessageType.COMMAND
+        elif message.attachments:
+            # Check attachment types
+            for att in message.attachments:
+                if att.content_type:
+                    if att.content_type.startswith("image/"):
+                        msg_type = MessageType.PHOTO
+                    elif att.content_type.startswith("video/"):
+                        msg_type = MessageType.VIDEO
+                    elif att.content_type.startswith("audio/"):
+                        msg_type = MessageType.AUDIO
+                    else:
+                        msg_type = MessageType.DOCUMENT
+                    break
+        
+        # Determine chat type
+        if isinstance(message.channel, discord.DMChannel):
+            chat_type = "dm"
+            chat_name = message.author.name
+        elif isinstance(message.channel, discord.Thread):
+            chat_type = "thread"
+            chat_name = message.channel.name
+        else:
+            chat_type = "group"  # Treat server channels as groups
+            chat_name = getattr(message.channel, "name", str(message.channel.id))
+            if hasattr(message.channel, "guild") and message.channel.guild:
+                chat_name = f"{message.channel.guild.name} / #{chat_name}"
+        
+        # Get thread ID if in a thread
+        thread_id = None
+        if isinstance(message.channel, discord.Thread):
+            thread_id = str(message.channel.id)
+        
+        # Build source
+        source = self.build_source(
+            chat_id=str(message.channel.id),
+            chat_name=chat_name,
+            chat_type=chat_type,
+            user_id=str(message.author.id),
+            user_name=message.author.display_name,
+            thread_id=thread_id,
+        )
+        
+        # Build media URLs -- download image attachments to local cache so the
+        # vision tool can access them reliably (Discord CDN URLs can expire).
+        media_urls = []
+        media_types = []
+        for att in message.attachments:
+            content_type = att.content_type or "unknown"
+            if content_type.startswith("image/"):
+                try:
+                    # Determine extension from content type (image/png -> .png)
+                    ext = "." + content_type.split("/")[-1].split(";")[0]
+                    if ext not in (".jpg", ".jpeg", ".png", ".gif", ".webp"):
+                        ext = ".jpg"
+                    cached_path = await cache_image_from_url(att.url, ext=ext)
+                    media_urls.append(cached_path)
+                    media_types.append(content_type)
+                    print(f"[Discord] Cached user image: {cached_path}", flush=True)
+                except Exception as e:
+                    print(f"[Discord] Failed to cache image attachment: {e}", flush=True)
+                    # Fall back to the CDN URL if caching fails
+                    media_urls.append(att.url)
+                    media_types.append(content_type)
+            elif content_type.startswith("audio/"):
+                try:
+                    ext = "." + content_type.split("/")[-1].split(";")[0]
+                    if ext not in (".ogg", ".mp3", ".wav", ".webm", ".m4a"):
+                        ext = ".ogg"
+                    cached_path = await cache_audio_from_url(att.url, ext=ext)
+                    media_urls.append(cached_path)
+                    media_types.append(content_type)
+                    print(f"[Discord] Cached user audio: {cached_path}", flush=True)
+                except Exception as e:
+                    print(f"[Discord] Failed to cache audio attachment: {e}", flush=True)
+                    media_urls.append(att.url)
+                    media_types.append(content_type)
+            else:
+                # Other attachments: keep the original URL
+                media_urls.append(att.url)
+                media_types.append(content_type)
+        
+        event = MessageEvent(
+            text=message.content,
+            message_type=msg_type,
+            source=source,
+            raw_message=message,
+            message_id=str(message.id),
+            media_urls=media_urls,
+            media_types=media_types,
+            reply_to_message_id=str(message.reference.message_id) if message.reference else None,
+            timestamp=message.created_at,
+        )
+        
+        await self.handle_message(event)
+
+
+# ---------------------------------------------------------------------------
+# Discord UI Components (outside the adapter class)
+# ---------------------------------------------------------------------------
+
+if DISCORD_AVAILABLE:
+
+    class ExecApprovalView(discord.ui.View):
+        """
+        Interactive button view for exec approval of dangerous commands.
+
+        Shows three buttons: Allow Once (green), Always Allow (blue), Deny (red).
+        Only users in the allowed list can click. The view times out after 5 minutes.
+        """
+
+        def __init__(self, approval_id: str, allowed_user_ids: set):
+            super().__init__(timeout=300)  # 5-minute timeout
+            self.approval_id = approval_id
+            self.allowed_user_ids = allowed_user_ids
+            self.resolved = False
+
+        def _check_auth(self, interaction: discord.Interaction) -> bool:
+            """Verify the user clicking is authorized."""
+            if not self.allowed_user_ids:
+                return True  # No allowlist = anyone can approve
+            return str(interaction.user.id) in self.allowed_user_ids
+
+        async def _resolve(
+            self, interaction: discord.Interaction, action: str, color: discord.Color
+        ):
+            """Resolve the approval and update the message."""
+            if self.resolved:
+                await interaction.response.send_message(
+                    "This approval has already been resolved~", ephemeral=True
+                )
+                return
+
+            if not self._check_auth(interaction):
+                await interaction.response.send_message(
+                    "You're not authorized to approve commands~", ephemeral=True
+                )
+                return
+
+            self.resolved = True
+
+            # Update the embed with the decision
+            embed = interaction.message.embeds[0] if interaction.message.embeds else None
+            if embed:
+                embed.color = color
+                embed.set_footer(text=f"{action} by {interaction.user.display_name}")
+
+            # Disable all buttons
+            for child in self.children:
+                child.disabled = True
+
+            await interaction.response.edit_message(embed=embed, view=self)
+
+            # Store the approval decision for the gateway to pick up
+            try:
+                from tools.terminal_tool import _session_approved_patterns
+                if action == "allow_once":
+                    pass  # One-time approval handled by gateway
+                elif action == "allow_always":
+                    _session_approved_patterns.add(self.approval_id)
+            except ImportError:
+                pass
+
+        @discord.ui.button(label="Allow Once", style=discord.ButtonStyle.green)
+        async def allow_once(
+            self, interaction: discord.Interaction, button: discord.ui.Button
+        ):
+            await self._resolve(interaction, "allow_once", discord.Color.green())
+
+        @discord.ui.button(label="Always Allow", style=discord.ButtonStyle.blurple)
+        async def allow_always(
+            self, interaction: discord.Interaction, button: discord.ui.Button
+        ):
+            await self._resolve(interaction, "allow_always", discord.Color.blue())
+
+        @discord.ui.button(label="Deny", style=discord.ButtonStyle.red)
+        async def deny(
+            self, interaction: discord.Interaction, button: discord.ui.Button
+        ):
+            await self._resolve(interaction, "deny", discord.Color.red())
+
+        async def on_timeout(self):
+            """Handle view timeout -- disable buttons and mark as expired."""
+            self.resolved = True
+            for child in self.children:
+                child.disabled = True
@@ -0,0 +1,374 @@
+"""
+Slack platform adapter.
+
+Uses slack-bolt (Python) with Socket Mode for:
+- Receiving messages from channels and DMs
+- Sending responses back
+- Handling slash commands
+- Thread support
+"""
+
+import asyncio
+import os
+from typing import Dict, List, Optional, Any
+
+try:
+    from slack_bolt.async_app import AsyncApp
+    from slack_bolt.adapter.socket_mode.async_handler import AsyncSocketModeHandler
+    from slack_sdk.web.async_client import AsyncWebClient
+    SLACK_AVAILABLE = True
+except ImportError:
+    SLACK_AVAILABLE = False
+    AsyncApp = Any
+    AsyncSocketModeHandler = Any
+    AsyncWebClient = Any
+
+import sys
+sys.path.insert(0, str(__file__).rsplit("/", 3)[0])
+
+from gateway.config import Platform, PlatformConfig
+from gateway.platforms.base import (
+    BasePlatformAdapter,
+    MessageEvent,
+    MessageType,
+    SendResult,
+    cache_image_from_url,
+    cache_audio_from_url,
+)
+
+
+def check_slack_requirements() -> bool:
+    """Check if Slack dependencies are available."""
+    return SLACK_AVAILABLE
+
+
+class SlackAdapter(BasePlatformAdapter):
+    """
+    Slack bot adapter using Socket Mode.
+
+    Requires two tokens:
+      - SLACK_BOT_TOKEN (xoxb-...) for API calls
+      - SLACK_APP_TOKEN (xapp-...) for Socket Mode connection
+
+    Features:
+      - DMs and channel messages (mention-gated in channels)
+      - Thread support
+      - File/image/audio attachments
+      - Slash commands (/hermes)
+      - Typing indicators (not natively supported by Slack bots)
+    """
+
+    MAX_MESSAGE_LENGTH = 4000  # Slack's limit is higher but mrkdwn can inflate
+
+    def __init__(self, config: PlatformConfig):
+        super().__init__(config, Platform.SLACK)
+        self._app: Optional[AsyncApp] = None
+        self._handler: Optional[AsyncSocketModeHandler] = None
+        self._bot_user_id: Optional[str] = None
+
+    async def connect(self) -> bool:
+        """Connect to Slack via Socket Mode."""
+        if not SLACK_AVAILABLE:
+            print("[Slack] slack-bolt not installed. Run: pip install slack-bolt")
+            return False
+
+        bot_token = self.config.token
+        app_token = os.getenv("SLACK_APP_TOKEN")
+
+        if not bot_token:
+            print("[Slack] SLACK_BOT_TOKEN not set")
+            return False
+        if not app_token:
+            print("[Slack] SLACK_APP_TOKEN not set")
+            return False
+
+        try:
+            self._app = AsyncApp(token=bot_token)
+
+            # Get our own bot user ID for mention detection
+            auth_response = await self._app.client.auth_test()
+            self._bot_user_id = auth_response.get("user_id")
+            bot_name = auth_response.get("user", "unknown")
+
+            # Register message event handler
+            @self._app.event("message")
+            async def handle_message_event(event, say):
+                await self._handle_slack_message(event)
+
+            # Register slash command handler
+            @self._app.command("/hermes")
+            async def handle_hermes_command(ack, command):
+                await ack()
+                await self._handle_slash_command(command)
+
+            # Start Socket Mode handler in background
+            self._handler = AsyncSocketModeHandler(self._app, app_token)
+            asyncio.create_task(self._handler.start_async())
+
+            self._running = True
+            print(f"[Slack] Connected as @{bot_name} (Socket Mode)")
+            return True
+
+        except Exception as e:
+            print(f"[Slack] Connection failed: {e}")
+            return False
+
+    async def disconnect(self) -> None:
+        """Disconnect from Slack."""
+        if self._handler:
+            await self._handler.close_async()
+        self._running = False
+        print("[Slack] Disconnected")
+
+    async def send(
+        self,
+        chat_id: str,
+        content: str,
+        reply_to: Optional[str] = None,
+        metadata: Optional[Dict[str, Any]] = None,
+    ) -> SendResult:
+        """Send a message to a Slack channel or DM."""
+        if not self._app:
+            return SendResult(success=False, error="Not connected")
+
+        try:
+            kwargs = {
+                "channel": chat_id,
+                "text": content,
+            }
+
+            # Reply in thread if thread_ts is available
+            if reply_to:
+                kwargs["thread_ts"] = reply_to
+            elif metadata and metadata.get("thread_ts"):
+                kwargs["thread_ts"] = metadata["thread_ts"]
+
+            result = await self._app.client.chat_postMessage(**kwargs)
+
+            return SendResult(
+                success=True,
+                message_id=result.get("ts"),
+                raw_response=result,
+            )
+
+        except Exception as e:
+            print(f"[Slack] Send error: {e}")
+            return SendResult(success=False, error=str(e))
+
+    async def send_typing(self, chat_id: str) -> None:
+        """Slack doesn't have a direct typing indicator API for bots."""
+        pass
+
+    async def send_image(
+        self,
+        chat_id: str,
+        image_url: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send an image to Slack by uploading the URL as a file."""
+        if not self._app:
+            return SendResult(success=False, error="Not connected")
+
+        try:
+            import httpx
+
+            # Download the image first
+            async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
+                response = await client.get(image_url)
+                response.raise_for_status()
+
+            result = await self._app.client.files_upload_v2(
+                channel=chat_id,
+                content=response.content,
+                filename="image.png",
+                initial_comment=caption or "",
+                thread_ts=reply_to,
+            )
+
+            return SendResult(success=True, raw_response=result)
+
+        except Exception as e:
+            # Fall back to sending the URL as text
+            text = f"{caption}\n{image_url}" if caption else image_url
+            return await self.send(chat_id=chat_id, content=text, reply_to=reply_to)
+
+    async def send_voice(
+        self,
+        chat_id: str,
+        audio_path: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send an audio file to Slack."""
+        if not self._app:
+            return SendResult(success=False, error="Not connected")
+
+        try:
+            result = await self._app.client.files_upload_v2(
+                channel=chat_id,
+                file=audio_path,
+                filename=os.path.basename(audio_path),
+                initial_comment=caption or "",
+                thread_ts=reply_to,
+            )
+            return SendResult(success=True, raw_response=result)
+
+        except Exception as e:
+            return SendResult(success=False, error=str(e))
+
+    async def get_chat_info(self, chat_id: str) -> Dict[str, Any]:
+        """Get information about a Slack channel."""
+        if not self._app:
+            return {"name": chat_id, "type": "unknown"}
+
+        try:
+            result = await self._app.client.conversations_info(channel=chat_id)
+            channel = result.get("channel", {})
+            is_dm = channel.get("is_im", False)
+            return {
+                "name": channel.get("name", chat_id),
+                "type": "dm" if is_dm else "group",
+            }
+        except Exception:
+            return {"name": chat_id, "type": "unknown"}
+
+    # ----- Internal handlers -----
+
+    async def _handle_slack_message(self, event: dict) -> None:
+        """Handle an incoming Slack message event."""
+        # Ignore bot messages (including our own)
+        if event.get("bot_id") or event.get("subtype") == "bot_message":
+            return
+
+        # Ignore message edits and deletions
+        subtype = event.get("subtype")
+        if subtype in ("message_changed", "message_deleted"):
+            return
+
+        text = event.get("text", "")
+        user_id = event.get("user", "")
+        channel_id = event.get("channel", "")
+        thread_ts = event.get("thread_ts") or event.get("ts")
+        ts = event.get("ts", "")
+
+        # Determine if this is a DM or channel message
+        channel_type = event.get("channel_type", "")
+        is_dm = channel_type == "im"
+
+        # In channels, only respond if bot is mentioned
+        if not is_dm and self._bot_user_id:
+            if f"<@{self._bot_user_id}>" not in text:
+                return
+            # Strip the bot mention from the text
+            text = text.replace(f"<@{self._bot_user_id}>", "").strip()
+
+        # Determine message type
+        msg_type = MessageType.TEXT
+        if text.startswith("/"):
+            msg_type = MessageType.COMMAND
+
+        # Handle file attachments
+        media_urls = []
+        media_types = []
+        files = event.get("files", [])
+        for f in files:
+            mimetype = f.get("mimetype", "unknown")
+            url = f.get("url_private_download") or f.get("url_private", "")
+            if mimetype.startswith("image/") and url:
+                try:
+                    ext = "." + mimetype.split("/")[-1].split(";")[0]
+                    if ext not in (".jpg", ".jpeg", ".png", ".gif", ".webp"):
+                        ext = ".jpg"
+                    # Slack private URLs require the bot token as auth header
+                    cached = await self._download_slack_file(url, ext)
+                    media_urls.append(cached)
+                    media_types.append(mimetype)
+                    msg_type = MessageType.PHOTO
+                except Exception as e:
+                    print(f"[Slack] Failed to cache image: {e}", flush=True)
+            elif mimetype.startswith("audio/") and url:
+                try:
+                    ext = "." + mimetype.split("/")[-1].split(";")[0]
+                    if ext not in (".ogg", ".mp3", ".wav", ".webm", ".m4a"):
+                        ext = ".ogg"
+                    cached = await self._download_slack_file(url, ext, audio=True)
+                    media_urls.append(cached)
+                    media_types.append(mimetype)
+                    msg_type = MessageType.VOICE
+                except Exception as e:
+                    print(f"[Slack] Failed to cache audio: {e}", flush=True)
+
+        # Build source
+        source = self.build_source(
+            chat_id=channel_id,
+            chat_name=channel_id,  # Will be resolved later if needed
+            chat_type="dm" if is_dm else "group",
+            user_id=user_id,
+            thread_id=thread_ts,
+        )
+
+        msg_event = MessageEvent(
+            text=text,
+            message_type=msg_type,
+            source=source,
+            raw_message=event,
+            message_id=ts,
+            media_urls=media_urls,
+            media_types=media_types,
+            reply_to_message_id=thread_ts if thread_ts != ts else None,
+        )
+
+        await self.handle_message(msg_event)
+
+    async def _handle_slash_command(self, command: dict) -> None:
+        """Handle /hermes slash command."""
+        text = command.get("text", "").strip()
+        user_id = command.get("user_id", "")
+        channel_id = command.get("channel_id", "")
+
+        # Map common slash subcommands to gateway commands
+        if text in ("new", "reset"):
+            text = "/reset"
+        elif text == "status":
+            text = "/status"
+        elif text == "stop":
+            text = "/stop"
+        elif text:
+            pass  # Treat as a regular question
+        else:
+            text = "/help"
+
+        source = self.build_source(
+            chat_id=channel_id,
+            chat_type="dm",  # Slash commands are always in DM-like context
+            user_id=user_id,
+        )
+
+        event = MessageEvent(
+            text=text,
+            message_type=MessageType.COMMAND if text.startswith("/") else MessageType.TEXT,
+            source=source,
+            raw_message=command,
+        )
+
+        await self.handle_message(event)
+
+    async def _download_slack_file(self, url: str, ext: str, audio: bool = False) -> str:
+        """Download a Slack file using the bot token for auth."""
+        import httpx
+
+        bot_token = self.config.token
+        async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
+            response = await client.get(
+                url,
+                headers={"Authorization": f"Bearer {bot_token}"},
+            )
+            response.raise_for_status()
+
+        if audio:
+            from gateway.platforms.base import cache_audio_from_bytes
+            return cache_audio_from_bytes(response.content, ext)
+        else:
+            from gateway.platforms.base import cache_image_from_bytes
+            return cache_image_from_bytes(response.content, ext)
@@ -0,0 +1,484 @@
+"""
+Telegram platform adapter.
+
+Uses python-telegram-bot library for:
+- Receiving messages from users/groups
+- Sending responses back
+- Handling media and commands
+"""
+
+import asyncio
+from typing import Dict, List, Optional, Any
+
+try:
+    from telegram import Update, Bot, Message
+    from telegram.ext import (
+        Application,
+        CommandHandler,
+        MessageHandler as TelegramMessageHandler,
+        ContextTypes,
+        filters,
+    )
+    from telegram.constants import ParseMode, ChatType
+    TELEGRAM_AVAILABLE = True
+except ImportError:
+    TELEGRAM_AVAILABLE = False
+    Update = Any
+    Bot = Any
+    Message = Any
+    Application = Any
+    ContextTypes = Any
+
+import sys
+sys.path.insert(0, str(__file__).rsplit("/", 3)[0])
+
+from gateway.config import Platform, PlatformConfig
+from gateway.platforms.base import (
+    BasePlatformAdapter,
+    MessageEvent,
+    MessageType,
+    SendResult,
+    cache_image_from_bytes,
+    cache_audio_from_bytes,
+)
+
+
+def check_telegram_requirements() -> bool:
+    """Check if Telegram dependencies are available."""
+    return TELEGRAM_AVAILABLE
+
+
+class TelegramAdapter(BasePlatformAdapter):
+    """
+    Telegram bot adapter.
+    
+    Handles:
+    - Receiving messages from users and groups
+    - Sending responses with Telegram markdown
+    - Forum topics (thread_id support)
+    - Media messages
+    """
+    
+    # Telegram message limits
+    MAX_MESSAGE_LENGTH = 4096
+    
+    def __init__(self, config: PlatformConfig):
+        super().__init__(config, Platform.TELEGRAM)
+        self._app: Optional[Application] = None
+        self._bot: Optional[Bot] = None
+    
+    async def connect(self) -> bool:
+        """Connect to Telegram and start polling for updates."""
+        if not TELEGRAM_AVAILABLE:
+            print(f"[{self.name}] python-telegram-bot not installed. Run: pip install python-telegram-bot")
+            return False
+        
+        if not self.config.token:
+            print(f"[{self.name}] No bot token configured")
+            return False
+        
+        try:
+            # Build the application
+            self._app = Application.builder().token(self.config.token).build()
+            self._bot = self._app.bot
+            
+            # Register handlers
+            self._app.add_handler(TelegramMessageHandler(
+                filters.TEXT & ~filters.COMMAND,
+                self._handle_text_message
+            ))
+            self._app.add_handler(TelegramMessageHandler(
+                filters.COMMAND,
+                self._handle_command
+            ))
+            self._app.add_handler(TelegramMessageHandler(
+                filters.PHOTO | filters.VIDEO | filters.AUDIO | filters.VOICE | filters.Document.ALL | filters.Sticker.ALL,
+                self._handle_media_message
+            ))
+            
+            # Start polling in background
+            await self._app.initialize()
+            await self._app.start()
+            await self._app.updater.start_polling(allowed_updates=Update.ALL_TYPES)
+            
+            self._running = True
+            print(f"[{self.name}] Connected and polling for updates")
+            return True
+            
+        except Exception as e:
+            print(f"[{self.name}] Failed to connect: {e}")
+            return False
+    
+    async def disconnect(self) -> None:
+        """Stop polling and disconnect."""
+        if self._app:
+            try:
+                await self._app.updater.stop()
+                await self._app.stop()
+                await self._app.shutdown()
+            except Exception as e:
+                print(f"[{self.name}] Error during disconnect: {e}")
+        
+        self._running = False
+        self._app = None
+        self._bot = None
+        print(f"[{self.name}] Disconnected")
+    
+    async def send(
+        self,
+        chat_id: str,
+        content: str,
+        reply_to: Optional[str] = None,
+        metadata: Optional[Dict[str, Any]] = None
+    ) -> SendResult:
+        """Send a message to a Telegram chat."""
+        if not self._bot:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            # Format and split message if needed
+            formatted = self.format_message(content)
+            chunks = self.truncate_message(formatted, self.MAX_MESSAGE_LENGTH)
+            
+            message_ids = []
+            thread_id = metadata.get("thread_id") if metadata else None
+            
+            for i, chunk in enumerate(chunks):
+                # Try Markdown first, fall back to plain text if it fails
+                try:
+                    msg = await self._bot.send_message(
+                        chat_id=int(chat_id),
+                        text=chunk,
+                        parse_mode=ParseMode.MARKDOWN,
+                        reply_to_message_id=int(reply_to) if reply_to and i == 0 else None,
+                        message_thread_id=int(thread_id) if thread_id else None,
+                    )
+                except Exception as md_error:
+                    # Markdown parsing failed, try plain text
+                    if "parse" in str(md_error).lower() or "markdown" in str(md_error).lower():
+                        msg = await self._bot.send_message(
+                            chat_id=int(chat_id),
+                            text=chunk,
+                            parse_mode=None,  # Plain text
+                            reply_to_message_id=int(reply_to) if reply_to and i == 0 else None,
+                            message_thread_id=int(thread_id) if thread_id else None,
+                        )
+                    else:
+                        raise  # Re-raise if not a parse error
+                message_ids.append(str(msg.message_id))
+            
+            return SendResult(
+                success=True,
+                message_id=message_ids[0] if message_ids else None,
+                raw_response={"message_ids": message_ids}
+            )
+            
+        except Exception as e:
+            return SendResult(success=False, error=str(e))
+    
+    async def send_voice(
+        self,
+        chat_id: str,
+        audio_path: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send audio as a native Telegram voice message or audio file."""
+        if not self._bot:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            import os
+            if not os.path.exists(audio_path):
+                return SendResult(success=False, error=f"Audio file not found: {audio_path}")
+            
+            with open(audio_path, "rb") as audio_file:
+                # .ogg files -> send as voice (round playable bubble)
+                if audio_path.endswith(".ogg") or audio_path.endswith(".opus"):
+                    msg = await self._bot.send_voice(
+                        chat_id=int(chat_id),
+                        voice=audio_file,
+                        caption=caption[:1024] if caption else None,
+                        reply_to_message_id=int(reply_to) if reply_to else None,
+                    )
+                else:
+                    # .mp3 and others -> send as audio file
+                    msg = await self._bot.send_audio(
+                        chat_id=int(chat_id),
+                        audio=audio_file,
+                        caption=caption[:1024] if caption else None,
+                        reply_to_message_id=int(reply_to) if reply_to else None,
+                    )
+            return SendResult(success=True, message_id=str(msg.message_id))
+        except Exception as e:
+            print(f"[{self.name}] Failed to send voice/audio: {e}")
+            return await super().send_voice(chat_id, audio_path, caption, reply_to)
+    
+    async def send_image(
+        self,
+        chat_id: str,
+        image_url: str,
+        caption: Optional[str] = None,
+        reply_to: Optional[str] = None,
+    ) -> SendResult:
+        """Send an image natively as a Telegram photo."""
+        if not self._bot:
+            return SendResult(success=False, error="Not connected")
+        
+        try:
+            # Telegram can send photos directly from URLs
+            msg = await self._bot.send_photo(
+                chat_id=int(chat_id),
+                photo=image_url,
+                caption=caption[:1024] if caption else None,  # Telegram caption limit
+                reply_to_message_id=int(reply_to) if reply_to else None,
+            )
+            return SendResult(success=True, message_id=str(msg.message_id))
+        except Exception as e:
+            print(f"[{self.name}] Failed to send photo, falling back to URL: {e}")
+            # Fallback: send as text link
+            return await super().send_image(chat_id, image_url, caption, reply_to)
+    
+    async def send_typing(self, chat_id: str) -> None:
+        """Send typing indicator."""
+        if self._bot:
+            try:
+                await self._bot.send_chat_action(
+                    chat_id=int(chat_id),
+                    action="typing"
+                )
+            except Exception:
+                pass  # Ignore typing indicator failures
+    
+    async def get_chat_info(self, chat_id: str) -> Dict[str, Any]:
+        """Get information about a Telegram chat."""
+        if not self._bot:
+            return {"name": "Unknown", "type": "dm"}
+        
+        try:
+            chat = await self._bot.get_chat(int(chat_id))
+            
+            chat_type = "dm"
+            if chat.type == ChatType.GROUP:
+                chat_type = "group"
+            elif chat.type == ChatType.SUPERGROUP:
+                chat_type = "group"
+                if chat.is_forum:
+                    chat_type = "forum"
+            elif chat.type == ChatType.CHANNEL:
+                chat_type = "channel"
+            
+            return {
+                "name": chat.title or chat.full_name or str(chat_id),
+                "type": chat_type,
+                "username": chat.username,
+                "is_forum": getattr(chat, "is_forum", False),
+            }
+        except Exception as e:
+            return {"name": str(chat_id), "type": "dm", "error": str(e)}
+    
+    def format_message(self, content: str) -> str:
+        """
+        Format message for Telegram.
+        
+        Telegram uses a subset of markdown. We'll use the simpler
+        Markdown mode (not MarkdownV2) for compatibility.
+        """
+        # Basic escaping for Telegram Markdown
+        # In Markdown mode (not V2), only certain characters need escaping
+        return content
+    
+    async def _handle_text_message(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
+        """Handle incoming text messages."""
+        if not update.message or not update.message.text:
+            return
+        
+        event = self._build_message_event(update.message, MessageType.TEXT)
+        await self.handle_message(event)
+    
+    async def _handle_command(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
+        """Handle incoming command messages."""
+        if not update.message or not update.message.text:
+            return
+        
+        event = self._build_message_event(update.message, MessageType.COMMAND)
+        await self.handle_message(event)
+    
+    async def _handle_media_message(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
+        """Handle incoming media messages, downloading images to local cache."""
+        if not update.message:
+            return
+        
+        msg = update.message
+        
+        # Determine media type
+        if msg.sticker:
+            msg_type = MessageType.STICKER
+        elif msg.photo:
+            msg_type = MessageType.PHOTO
+        elif msg.video:
+            msg_type = MessageType.VIDEO
+        elif msg.audio:
+            msg_type = MessageType.AUDIO
+        elif msg.voice:
+            msg_type = MessageType.VOICE
+        else:
+            msg_type = MessageType.DOCUMENT
+        
+        event = self._build_message_event(msg, msg_type)
+        
+        # Add caption as text
+        if msg.caption:
+            event.text = msg.caption
+        
+        # Handle stickers: describe via vision tool with caching
+        if msg.sticker:
+            await self._handle_sticker(msg, event)
+            await self.handle_message(event)
+            return
+        
+        # Download photo to local image cache so the vision tool can access it
+        # even after Telegram's ephemeral file URLs expire (~1 hour).
+        if msg.photo:
+            try:
+                # msg.photo is a list of PhotoSize sorted by size; take the largest
+                photo = msg.photo[-1]
+                file_obj = await photo.get_file()
+                # Download the image bytes directly into memory
+                image_bytes = await file_obj.download_as_bytearray()
+                # Determine extension from the file path if available
+                ext = ".jpg"
+                if file_obj.file_path:
+                    for candidate in [".png", ".webp", ".gif", ".jpeg", ".jpg"]:
+                        if file_obj.file_path.lower().endswith(candidate):
+                            ext = candidate
+                            break
+                # Save to cache and populate media_urls with the local path
+                cached_path = cache_image_from_bytes(bytes(image_bytes), ext=ext)
+                event.media_urls = [cached_path]
+                event.media_types = [f"image/{ext.lstrip('.')}"]
+                print(f"[Telegram] Cached user photo: {cached_path}", flush=True)
+            except Exception as e:
+                print(f"[Telegram] Failed to cache photo: {e}", flush=True)
+        
+        # Download voice/audio messages to cache for STT transcription
+        if msg.voice:
+            try:
+                file_obj = await msg.voice.get_file()
+                audio_bytes = await file_obj.download_as_bytearray()
+                cached_path = cache_audio_from_bytes(bytes(audio_bytes), ext=".ogg")
+                event.media_urls = [cached_path]
+                event.media_types = ["audio/ogg"]
+                print(f"[Telegram] Cached user voice: {cached_path}", flush=True)
+            except Exception as e:
+                print(f"[Telegram] Failed to cache voice: {e}", flush=True)
+        elif msg.audio:
+            try:
+                file_obj = await msg.audio.get_file()
+                audio_bytes = await file_obj.download_as_bytearray()
+                cached_path = cache_audio_from_bytes(bytes(audio_bytes), ext=".mp3")
+                event.media_urls = [cached_path]
+                event.media_types = ["audio/mp3"]
+                print(f"[Telegram] Cached user audio: {cached_path}", flush=True)
+            except Exception as e:
+                print(f"[Telegram] Failed to cache audio: {e}", flush=True)
+        
+        await self.handle_message(event)
+    
+    async def _handle_sticker(self, msg: Message, event: "MessageEvent") -> None:
+        """
+        Describe a Telegram sticker via vision analysis, with caching.
+
+        For static stickers (WEBP), we download, analyze with vision, and cache
+        the description by file_unique_id. For animated/video stickers, we inject
+        a placeholder noting the emoji.
+        """
+        from gateway.sticker_cache import (
+            get_cached_description,
+            cache_sticker_description,
+            build_sticker_injection,
+            build_animated_sticker_injection,
+            STICKER_VISION_PROMPT,
+        )
+
+        sticker = msg.sticker
+        emoji = sticker.emoji or ""
+        set_name = sticker.set_name or ""
+
+        # Animated and video stickers can't be analyzed as static images
+        if sticker.is_animated or sticker.is_video:
+            event.text = build_animated_sticker_injection(emoji)
+            return
+
+        # Check the cache first
+        cached = get_cached_description(sticker.file_unique_id)
+        if cached:
+            event.text = build_sticker_injection(
+                cached["description"], cached.get("emoji", emoji), cached.get("set_name", set_name)
+            )
+            print(f"[Telegram] Sticker cache hit: {sticker.file_unique_id}", flush=True)
+            return
+
+        # Cache miss -- download and analyze
+        try:
+            file_obj = await sticker.get_file()
+            image_bytes = await file_obj.download_as_bytearray()
+            cached_path = cache_image_from_bytes(bytes(image_bytes), ext=".webp")
+            print(f"[Telegram] Analyzing sticker: {cached_path}", flush=True)
+
+            from tools.vision_tools import vision_analyze_tool
+            import json as _json
+
+            result_json = await vision_analyze_tool(
+                image_url=cached_path,
+                user_prompt=STICKER_VISION_PROMPT,
+            )
+            result = _json.loads(result_json)
+
+            if result.get("success"):
+                description = result.get("analysis", "a sticker")
+                cache_sticker_description(sticker.file_unique_id, description, emoji, set_name)
+                event.text = build_sticker_injection(description, emoji, set_name)
+            else:
+                # Vision failed -- use emoji as fallback
+                event.text = build_sticker_injection(
+                    f"a sticker with emoji {emoji}" if emoji else "a sticker",
+                    emoji, set_name,
+                )
+        except Exception as e:
+            print(f"[Telegram] Sticker analysis error: {e}", flush=True)
+            event.text = build_sticker_injection(
+                f"a sticker with emoji {emoji}" if emoji else "a sticker",
+                emoji, set_name,
+            )
+
+    def _build_message_event(self, message: Message, msg_type: MessageType) -> MessageEvent:
+        """Build a MessageEvent from a Telegram message."""
+        chat = message.chat
+        user = message.from_user
+        
+        # Determine chat type
+        chat_type = "dm"
+        if chat.type in (ChatType.GROUP, ChatType.SUPERGROUP):
+            chat_type = "group"
+        elif chat.type == ChatType.CHANNEL:
+            chat_type = "channel"
+        
+        # Build source
+        source = self.build_source(
+            chat_id=str(chat.id),
+            chat_name=chat.title or (chat.full_name if hasattr(chat, "full_name") else None),
+            chat_type=chat_type,
+            user_id=str(user.id) if user else None,
+            user_name=user.full_name if user else None,
+            thread_id=str(message.message_thread_id) if message.message_thread_id else None,
+        )
+        
+        return MessageEvent(
+            text=message.text or "",
+            message_type=msg_type,
+            source=source,
+            raw_message=message,
+            message_id=str(message.message_id),
+            timestamp=message.date,
+        )
--- a/Show More
+++ b/Show More
				`@@ -1,2 +0,0 @@`
				`"""Terminal helpers for stateful sandbox interactions."""`