Compare commits

..

118 Commits

Author SHA1 Message Date
Sam Herring dff5481e58 Eval splits for holdout sets 2026-03-03 14:42:45 -05:00
Sam Herring fe17b5ff08 Changing return type to be ScoredDataGroup to account for multiple trajectories 2026-03-02 11:35:06 -08:00
Sam Herring 6fdb38ed29 Added task sppecific metris and evals 2026-02-27 11:20:18 -08:00
Sam Herring b7e713b101 Wandb changes 2026-02-26 10:41:24 -08:00
Sam Herring 0e694b954a Updating config 2026-02-24 19:23:05 -08:00
Sam Herring c12e46cf13 Adding config init method 2026-02-24 19:19:39 -08:00
Sam Herring b93ad43191 Updating path vars and dataset loading 2026-02-24 18:58:19 -08:00
Sam Herring f1c2f8a414 Updating to use hermes-agent backend and parse container definition out of provided .sif files 2026-02-24 16:35:18 -08:00
Sam Herring 9139eeaa60 Adding endless terminal environment after rebase: 2026-02-17 16:56:17 -08:00
Sam Herring d0f82e6dcc Removing random project notes doc 2026-02-17 08:02:29 -08:00
teknium1 49e1f9ea89 Refactor TODO.md to summarize future improvements for the Hermes Agent, focusing on subagent architecture, task management, dynamic skills expansion, and interactive clarifying questions. Key ideas include context isolation for subagents, task decomposition, progress tracking, and skill acquisition from successful tasks. 2026-02-17 03:24:38 -08:00
teknium1 6731230d73 Add special handling for 'process' tool in _build_tool_preview function
- Enhanced the _build_tool_preview function to include specific formatting for the 'process' tool, displaying action, session_id, data, and timeout when applicable.
- This update improves the clarity of tool previews, particularly for actions that require session tracking and timeout management.
2026-02-17 03:18:27 -08:00
teknium1 ec59d71e60 Update PTY write handling in ProcessRegistry to ensure data is encoded as bytes before writing. This change improves compatibility with string inputs and clarifies the expected data type in comments. 2026-02-17 03:14:47 -08:00
teknium1 bdac541d1e Rename OPENAI_API_KEY to HERMES_OPENAI_API_KEY in configuration and codebase for clarity and to avoid conflicts. Update related documentation and error messages to reflect the new key name, ensuring backward compatibility with existing setups. 2026-02-17 03:11:17 -08:00
teknium1 061fa70907 Add background process management with process tool, wait, PTY, and stdin support
New process registry and tool for managing long-running background processes
across all terminal backends (local, Docker, Singularity, Modal, SSH).

Process Registry (tools/process_registry.py):
- ProcessSession tracking with rolling 200KB output buffer
- spawn_local() with optional PTY via ptyprocess for interactive CLIs
- spawn_via_env() for non-local backends (runs inside sandbox, never on host)
- Background reader threads per process (Popen stdout or PTY)
- wait() with timeout clamping, interrupt support, and transparent limit reporting
- JSON checkpoint to ~/.hermes/processes.json for gateway crash recovery
- Module-level singleton shared across agent loop, gateway, and RL

Process Tool (model_tools.py):
- 7 actions: list, poll, log, wait, kill, write, submit
- Paired with terminal in all toolsets (CLI, messaging, RL)
- Timeout clamping with transparent notes in response

Terminal Tool Updates (tools/terminal_tool.py):
- Replaced nohup background mode with registry spawn (returns session_id)
- Added workdir parameter for per-command working directory
- Added check_interval parameter for gateway auto-check watchers
- Added pty parameter for interactive CLI tools (Codex, Claude Code)
- Updated TERMINAL_TOOL_DESCRIPTION with full background workflow docs
- Cleanup thread now respects active background processes (won't reap sandbox)

Gateway Integration (gateway/run.py, session.py, config.py):
- Session reset protection: sessions with active processes exempt from reset
- Default idle timeout increased from 2 hours to 24 hours
- from_dict fallback aligned to match (was 120, now 1440)
- session_key env var propagated to process registry for session mapping
- Crash recovery on gateway startup via checkpoint probe
- check_interval watcher: asyncio task polls process, delivers updates to platform

RL Safety (environments/):
- tool_context.py cleanup() kills background processes on episode end
- hermes_base_env.py warns when enabled_toolsets is None (loads all tools)
- Process tool safe in RL via wait() blocking the agent loop

Also:
- Added ptyprocess as optional dependency (in pyproject.toml [pty] extra + [all])
- Fixed pre-existing bug: rl_test_inference missing from TOOL_TO_TOOLSET_MAP
- Updated AGENTS.md with process management docs and project structure
- Updated README.md terminal section with process management overview
2026-02-17 02:51:31 -08:00
teknium1 48b5cfd085 Add skip_context_files option to AIAgent for batch processing
- Introduced a new parameter `skip_context_files` in the AIAgent class to control the inclusion of context files (SOUL.md, AGENTS.md, .cursorrules) in the system prompt.
- Updated the _process_single_prompt function to set `skip_context_files` to True, preventing pollution of trajectories during batch processing and data generation.
2026-02-16 22:40:31 -08:00
teknium1 a7609c97be Update docs to match backend key rename and CWD behavior
- cli-config.yaml.example: env_type → backend everywhere, matching the
  documented config key that hermes_cli/config.py and README already use
- cli-config.yaml.example: added comments clarifying cwd is a path
  INSIDE the target environment for non-local backends
- AGENTS.md: updated terminal.cwd description to explain "." only
  resolves to host CWD for the local backend
- .env.example: updated TERMINAL_CWD comment to warn against using
  host-local paths with remote backends, lists per-backend defaults
2026-02-16 22:31:41 -08:00
teknium1 c33feb6dc9 Fix host CWD leaking into non-local terminal backends
When using Modal, Docker, SSH, or Singularity as the terminal backend
from the CLI, the agent resolved cwd: "." to the host machine's local
path (e.g. /Users/rewbs/code/hermes-agent) and passed it to the remote
sandbox, where it doesn't exist. All commands failed with "No such file
or directory".

Root cause: cli.py unconditionally resolved "." to os.getcwd() and wrote
it to TERMINAL_CWD regardless of backend type. Every tool then used that
host-local path as the working directory inside the remote environment.

Fixes:
- cli.py: only resolve "." to os.getcwd() for the local backend. For all
  remote backends (ssh, docker, modal, singularity), leave TERMINAL_CWD
  unset so the tool layer uses per-backend defaults (/root, /, ~, etc.)
- terminal_tool.py: added sanity check -- if TERMINAL_CWD contains a
  host-local prefix (/Users/, /home/, C:\) for a non-local backend, log
  a warning and fall back to the backend's default
- terminal_tool.py: SSH default CWD is now ~ instead of os.getcwd()
- file_operations.py: last-resort CWD fallback changed from os.getcwd()
  to "/" so host paths never leak into remote file operations
2026-02-16 22:30:04 -08:00
teknium1 2c7deb41f6 Fix Modal backend not working from CLI
Two config systems used different key names for the terminal backend:
- hermes_cli/config.py, README, and all docs use "terminal.backend"
- cli.py's env var mapping only recognized "terminal.env_type"

Users following the docs who set `backend: modal` in ~/.hermes/config.yaml
had it silently ignored -- TERMINAL_ENV always defaulted to "local".

Additionally, when no config file existed, cli.py's hardcoded defaults
overwrote any TERMINAL_ENV=modal set in .env, despite the comment saying
"env vars take precedence."

Fixes:
- cli.py now normalizes "backend" -> "env_type" (backend takes precedence)
- Defaults no longer overwrite .env when no config file terminal section exists
- hermes status reads from config as fallback when env var isn't set

Also fixes four related bugs found in the Modal/sandbox lifecycle:
- file_tools cache not cleared on sandbox cleanup (stale ops on dead sandbox)
- Global lock held during slow Modal teardown (blocked all tool calls 10-15s)
- Race condition in file_tools between existence check and access (KeyError)
- Per-task creation locks never cleaned up (memory leak)
2026-02-16 19:47:23 -08:00
teknium1 8117d0adab Refactor file operations and environment management in file_tools and terminal_tool
- Improved the caching mechanism for ShellFileOperations to ensure stale entries are invalidated when environments are cleaned up.
- Enhanced thread safety by refining the use of locks during environment creation and cleanup processes.
- Streamlined the cleanup of inactive environments to prevent blocking other tool calls, ensuring efficient resource management.
- Added error handling and messaging improvements for better user feedback during environment cleanup.
2026-02-16 19:37:40 -08:00
teknium1 01a3a6ab0d Implement cleanup guard to prevent multiple executions on exit
- Introduced a new cleanup function that ensures terminal and browser sessions are cleaned up only once during application exit.
- Updated atexit registration to use the new cleanup function, enhancing resource management and preventing potential issues from multiple cleanup calls.
- Modified terminal cleanup messaging to only display when environments are cleaned, improving user feedback.
2026-02-16 02:43:45 -08:00
teknium1 45a8098d3a Remove browserbase SDK check and add Node.js and agent-browser validation in doctor script
- Removed the check for the browserbase SDK from the optional packages list.
- Added validation for Node.js installation and the presence of the agent-browser package, providing feedback on their status for browser automation tools.
2026-02-16 02:41:24 -08:00
teknium1 60812ae041 Enhance configuration checks and persona file creation in doctor and install scripts
- Updated the doctor script to load environment variables from user-specific and project-specific `.env` files, improving configuration management.
- Added checks for the existence of the `SOUL.md` persona file, providing feedback on its status and creating it with a template if missing.
- Enhanced install scripts to create the `SOUL.md` file if it doesn't exist, ensuring users can easily customize the agent's personality.
2026-02-16 02:38:19 -08:00
teknium1 635bec06cb Update tool definitions handling in GatewayRunner
- Modified the retrieval of tool definitions to use the agent result's "tools" key, ensuring accurate logging in the transcript.
- Enhanced the response structure to include tools in the final output, improving the clarity of tool usage in session interactions.
2026-02-16 00:55:18 -08:00
teknium1 0f58dfdea4 Enhance agent response handling and transcript logging
- Refactored the agent response processing to return a comprehensive result dictionary, including final responses and full message history.
- Improved transcript logging to capture the complete conversation, including tool calls and intermediate reasoning, facilitating session resumption and debugging.
- Added handling for fresh sessions to include tool definitions in the transcript for clarity.
- Implemented logic to filter and timestamp new messages, ensuring accurate logging of user and assistant interactions.
2026-02-16 00:53:17 -08:00
teknium1 dd5fe334f3 Refactor configuration handling to improve user experience
- Implemented deep copy of DEFAULT_CONFIG to prevent mutations during config loading.
- Enhanced user config merging process to clarify the deep merge of user values over defaults.
- Added newline handling when appending environment variables to ensure proper formatting.
- Updated the set_config_value function to write only user-specific configurations back to the file, avoiding overwriting default values.
2026-02-16 00:33:45 -08:00
teknium1 e0c9d495ef Refine configuration migration process to improve user experience
- Updated prompts for the OPENAI_BASE_URL to clarify its use for custom endpoints.
- Enhanced the migration function to skip "advanced" environment variables during interactive configuration, streamlining the setup for standard users.
- Improved messaging for missing optional API keys, ensuring clearer guidance for users during configuration.
2026-02-15 21:53:59 -08:00
teknium1 2f34e6fd30 Update OpenAI configuration prompts for clarity and detail
- Revised descriptions and prompts for the OPENAI_BASE_URL and OPENAI_API_KEY environment variables to enhance user understanding.
- Added a URL reference for the OPENAI_API_KEY to guide users in obtaining their API key.
- Specified the use of the API key for voice transcription and custom endpoints, improving the overall configuration documentation.
2026-02-15 21:48:07 -08:00
teknium1 69aa35a51c Add messaging platform enhancements: STT, stickers, Discord UX, Slack, pairing, hooks
Major feature additions inspired by OpenClaw/ClawdBot integration analysis:

Voice Message Transcription (STT):
- Auto-transcribe voice/audio messages via OpenAI Whisper API
- Download voice to ~/.hermes/audio_cache/ on Telegram/Discord/WhatsApp
- Inject transcript as text so all models can understand voice input
- Configurable model (whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe)

Telegram Sticker Understanding:
- Describe static stickers via vision tool with JSON-backed cache
- Cache keyed by file_unique_id avoids redundant API calls
- Animated/video stickers get emoji-based fallback description

Discord Rich UX:
- Native slash commands (/ask, /reset, /status, /stop) via app_commands
- Button-based exec approvals (Allow Once / Always Allow / Deny)
- ExecApprovalView with user authorization and timeout handling

Slack Integration:
- Full SlackAdapter using slack-bolt with Socket Mode
- DMs, channel messages (mention-gated), /hermes slash command
- File attachment handling with bot-token-authenticated downloads

DM Pairing System:
- Code-based user authorization as alternative to static allowlists
- 8-char codes from unambiguous alphabet, 1-hour expiry
- Rate limiting, lockout after failed attempts, chmod 0600 on data
- CLI: hermes pairing list/approve/revoke/clear-pending

Event Hook System:
- File-based hook discovery from ~/.hermes/hooks/
- HOOK.yaml + handler.py per hook, sync/async handler support
- Events: gateway:startup, session:start/reset, agent:start/step/end
- Wildcard matching (command:* catches all command events)

Cross-Channel Messaging:
- send_message agent tool for delivering to any connected platform
- Enables cron job delivery and cross-platform notifications

Human-Like Response Pacing:
- Configurable delays between message chunks (off/natural/custom)
- HERMES_HUMAN_DELAY_MODE env var with min/max ms settings

Warm Injection Message Style:
- Retrofitted image vision messages with friendly kawaii-consistent tone
- All new injection messages (STT, stickers, errors) use warm style

Also: updated config migration to prompt for optional keys interactively,
bumped config version, updated README, AGENTS.md, .env.example,
cli-config.yaml.example, install scripts, pyproject.toml, and toolsets.
2026-02-15 21:38:59 -08:00
teknium1 5404a8fcd8 Enhance image handling and analysis capabilities across platforms
- Updated the vision tool to accept both HTTP/HTTPS URLs and local file paths for image analysis.
- Implemented caching of user-uploaded images in local directories to ensure reliable access for the vision tool, addressing issues with ephemeral URLs.
- Enhanced platform adapters (Discord, Telegram, WhatsApp) to download and cache images, allowing for immediate analysis and enriched message context.
- Added a new method to auto-analyze images attached by users, enriching the conversation with detailed descriptions.
- Improved documentation for image handling processes and updated related functions for clarity and efficiency.
2026-02-15 16:10:50 -08:00
teknium1 eb49936a60 Update documentation and installation scripts for TTS audio formats
- Clarified the requirements for Telegram voice bubbles, specifying the need for ffmpeg when using Edge TTS.
- Enhanced README and messaging documentation to detail audio delivery formats across platforms.
- Improved installation script messages to inform users about the necessity of ffmpeg for proper audio playback on Telegram.
2026-02-14 16:16:54 -08:00
teknium1 ff9ea6c4b1 Enhance TTS tool to support platform-specific audio formats
- Added detection of the platform from the environment variable to determine the appropriate audio output format.
- Implemented logic to output Opus (.ogg) files for Telegram when using compatible TTS providers, while defaulting to MP3 for others.
2026-02-14 16:13:26 -08:00
teknium1 586b0a7047 Add Text-to-Speech (TTS) support with Edge TTS and ElevenLabs integration
- Updated `pyproject.toml` to include Edge TTS and ElevenLabs as dependencies.
- Enhanced documentation to detail voice message capabilities across platforms and TTS provider options.
- Modified the GatewayRunner to handle MEDIA tags from TTS tool responses, ensuring proper delivery of audio messages.
2026-02-14 16:08:14 -08:00
teknium1 84718d183a Add platform-specific formatting hints and identity for AIAgent
- Introduced a default agent identity prompt to ensure consistent behavior across platforms.
- Added platform-specific formatting hints for CLI, WhatsApp, Telegram, and Discord to guide the agent's output style.
- Updated the AIAgent initialization to accept a platform parameter, enhancing adaptability to different interfaces.
2026-02-12 16:11:16 -08:00
teknium1 3099a2f53c Add timestamp to active system prompt in AIAgent
- Appended the current local date and time to the active system prompt to provide context for the model, addressing potential misinterpretations due to training cutoffs.
2026-02-12 15:59:31 -08:00
teknium1 ed010752dd Update .env.example to use new Docker, Singularity, and Modal images for Python 3.11 with Node.js 20 support 2026-02-12 10:07:03 -08:00
teknium1 f5be6177b2 Add Text-to-Speech (TTS) functionality with multiple providers
Add tool previews

Add AGENTS and SOUL.md support

Add Exec Approval
2026-02-12 10:05:08 -08:00
teknium 89c6f24d48 Merge branch 'main' of github.com:nousresearch/hermes-agent 2026-02-12 05:38:15 +00:00
teknium f23856df8e Add kill_modal script to manage Modal applications and better handling of file and terminal tools
- Introduced a new script, `kill_modal.sh`, to facilitate stopping running Modal apps, including the ability to stop all apps or specific swe-rex sandboxes.
- Enhanced user experience with clear usage instructions and feedback during the stopping process.
- Improved error handling to ensure smooth execution even if some apps fail to stop.
2026-02-12 05:37:14 +00:00
teknium 1b7bc299f3 Enhance TerminalBench2 environment with task filtering due to incompat with modal and logging improvements
- Updated task filter descriptions for clarity and added a new skip task feature to exclude incompatible tasks.
- Introduced a set of modal incompatible tasks to prevent execution errors in cloud environments.
- Implemented streaming JSONL logging for task results, preserving data even on interruptions.
- Refactored task evaluation logic to include skipped task reporting and improved error handling.
2026-02-12 05:36:45 +00:00
teknium a291cc99cf more extra kwarg support for provider selection etc on openrouter in agent rl envs and evals 2026-02-12 05:36:25 +00:00
teknium 389ac5e017 pass extrabody for agentloop to ban and allowlist providers on openrouter, control thinking, etc 2026-02-12 05:35:48 +00:00
nightwing fc792a4be9 Update Project_notes.md: grailed-embedding-search status and TODOs (June 2025) 2026-02-11 17:54:47 -07:00
nightwing 07501bef14 Add Project_notes.md — centralized status tracker for all side projects 2026-02-11 17:36:18 -07:00
teknium1 137ce05324 Add image generation tool to toolsets for messaging platforms
- Included "image_generate" in the toolsets for web, vision, and skills categories, expanding functionality for image-related tasks.
- Updated comments for clarity on the new tool's purpose, ensuring users understand its integration within the existing framework.
2026-02-10 21:04:24 -08:00
teknium1 ada0b4f131 Enhance image handling in platform adapters
- Updated the image generation function description to clarify usage with markdown.
- Added `send_image` method to `BasePlatformAdapter` for native image sending across platforms.
- Implemented `send_image` in `DiscordAdapter` and `TelegramAdapter` to handle image attachments directly.
- Introduced `extract_images` method to extract image URLs from markdown and HTML, improving content processing.
- Enhanced message handling to support sending images as attachments while maintaining text content.
2026-02-10 21:02:40 -08:00
teknium abe925e212 Update hermes-discord toolset to enable full terminal access with safety checks
- Revised the description to reflect full access capabilities, including terminal usage with a dangerous command approval system.
- Added terminal and file manipulation tools to the toolset, enhancing functionality for users.
- Updated comments for clarity on tool purposes, ensuring better understanding of available features.
2026-02-11 04:44:30 +00:00
teknium1 8fb44608bf Update SKILL.md and related references to implement container binding for labeled shapes and arrows in Excalidraw
- Revised the labeled shape and arrow sections to utilize container binding instead of the deprecated "label" property, ensuring proper text rendering.
- Added warnings about the invalidity of the "label" property and emphasized the use of `boundElements` for text elements.
- Updated examples in dark-mode and general references to reflect the new binding approach, enhancing clarity and usability for users creating diagrams.
2026-02-10 20:05:23 -08:00
teknium1 153cd5bb44 Refactor skills tool integration and enhance system prompt
- Removed the skills_categories tool from the skills toolset, streamlining the skills functionality to focus on skills_list and skill_view.
- Updated the system prompt to dynamically build a compact skills index, allowing the model to quickly reference available skills without additional tool calls.
- Cleaned up related code and documentation to reflect the removal of skills_categories, ensuring clarity and consistency across the codebase.
2026-02-10 19:48:38 -08:00
teknium1 669545f551 Add diagramming skills for Excalidraw
- Introduced a new DESCRIPTION.md file outlining diagram creation skills for visual diagrams and flowcharts using Excalidraw.
- Added SKILL.md for the Excalidraw skill, detailing its functionality, usage, and workflow for creating hand-drawn style diagrams.
- Created references for color palettes, dark mode diagrams, and example diagrams to assist users in utilizing the Excalidraw skill effectively.
- Implemented an upload script for sharing diagrams via Excalidraw.com, ensuring user-friendly access to generated diagrams.
2026-02-10 19:30:46 -08:00
teknium1 cfe2f3fe15 Implement interrupt handling for long-running tool executions in AIAgent
- Added functionality to signal and terminate long-running terminal commands when a new user message is received, allowing for immediate agent response.
- Introduced a global interrupt event in the terminal tool to facilitate early termination of subprocesses.
- Updated the AIAgent class to handle interrupts gracefully, ensuring that remaining tool calls are skipped and appropriate messages are returned to maintain valid message sequences.
2026-02-10 16:34:27 -08:00
teknium1 140d609e0c Refine agent history conversion logic in GatewayRunner
- Enhanced the conversion of message history to agent format by distinguishing between normal and rich agent messages.
- Implemented logic to preserve full message structure for tool-related messages, ensuring valid assistant-to-tool sequences.
- Simplified handling of simple text messages by stripping unnecessary fields while retaining essential role and content information.
2026-02-10 16:16:30 -08:00
teknium a32ad1a656 Fix infinite interrupt loop in gateway by consuming pending messages with .pop() and clearing interrupt events before recursion
- Added logic to clear the adapter's interrupt event to prevent infinite loops during message processing.
- Updated the get_pending_message method to pop messages from the pending queue, ensuring proper message handling.
2026-02-11 00:05:30 +00:00
teknium1 62ba69a29d Fix gateway exit code to enable systemd auto-restart on connection failure
- Updated the start_gateway function to return a boolean indicating success or failure, allowing for better control over exit codes.
- Modified the main function to handle gateway startup failures, ensuring systemd can automatically restart on transient errors.
- Enhanced error handling in the hermes_cli gateway to exit with code 1 if the gateway fails to connect to any platform.
2026-02-10 16:01:00 -08:00
teknium1 9b0f2a16ca Enhance CLI functionality with retry and undo commands
- Added /retry command to resend the last user message, improving user experience by allowing message re-sending without retyping.
- Introduced /undo command to remove the last user/assistant exchange from conversation history, providing better control over conversation flow.
- Updated save_config_value function to respect user and project config precedence, enhancing configuration management.
- Improved prompt handling and visual output for user input, adapting to terminal width for better readability.
2026-02-10 15:59:46 -08:00
teknium 85e629e915 Add cleanup functionality for orphaned sandboxes in TerminalBench2EvalEnv
- Implemented a cleanup process to terminate any remaining sandboxes after evaluation, addressing issues with orphaned thread pool workers.
- Enhanced logging to inform users about the cleanup process, ensuring better resource management and user awareness.
2026-02-10 23:48:49 +00:00
teknium 999a28062d Implement graceful exit cleanup for terminal tool
- Added a new `_atexit_cleanup` function to handle cleanup of active environments and stop the cleanup thread upon program exit.
- Enhanced logging to inform users about the number of remaining sandboxes being shut down during cleanup.
2026-02-10 22:53:44 +00:00
teknium ba3fea24f1 Enhance TerminalBench 2 configuration and evaluation handling
- Added task_timeout parameter to enforce a maximum wall-clock time for each task, automatically scoring as FAIL if exceeded.
- Introduced terminal_timeout and tool_pool_size parameters to improve command execution and concurrency management.
- Updated logging to provide detailed task execution times and timeout handling, enhancing overall monitoring.
- Removed outdated evaluate_config.yaml file to streamline configuration management.
2026-02-10 22:53:24 +00:00
teknium 6b4a8d0b17 Add terminal configuration options and enhance environment setup
- Introduced terminal_timeout and terminal_lifetime parameters to control command execution and sandbox inactivity.
- Updated environment variable handling to allow configuration overrides for terminal settings.
- Enhanced logging to provide detailed information about terminal settings during initialization.
- Added tool_pool_size parameter to dynamically resize the thread pool for tool execution, improving concurrency management.
2026-02-10 22:51:50 +00:00
teknium 5ec75e38b9 Enhance tool execution and logging in HermesAgentLoop
- Increased thread pool size for tool execution from 8 to 128 to improve concurrency and prevent starvation.
- Added a function to resize the tool executor dynamically based on configuration.
- Enhanced logging to track API call durations and tool execution times, including warnings for slow tools.
- Improved overall performance monitoring by logging detailed information for each turn in the agent loop.
2026-02-10 22:51:18 +00:00
teknium ad042fdd68 Update terminalbench_2 configuration for enhanced performance and evaluation
- Increased max_token_length from 16000 to 32000 to allow for longer inputs.
- Adjusted agent_temperature from 0.6 to 0.8 for more varied responses.
- Extended test_timeout from 180 to 600 seconds to accommodate longer evaluations.
- Updated data directory path for saving evaluations to ensure proper organization.
2026-02-10 19:48:41 +00:00
teknium 35ad3146a8 Add new environments and enhance tool context functionality
- Introduced new environments: Terminal Test Environment and SWE Environment, each with default configurations for testing and software engineering tasks.
- Added TerminalBench 2.0 evaluation environment with comprehensive setup for agentic LLMs, including task execution and verification.
- Enhanced ToolContext with methods for uploading and downloading files, ensuring binary-safe operations.
- Updated documentation across environments to reflect new features and usage instructions.
- Refactored existing environment configurations for consistency and clarity.
2026-02-10 19:39:05 +00:00
teknium e8343f2d87 Refactor Singularity environment for persistent container management
- Updated the _SingularityEnvironment class to utilize a persistent Apptainer instance, allowing state (files, installs, environment changes) to persist across commands.
- Enhanced the initialization process to start a background instance with full isolation and writable filesystem.
- Modified the execute method to connect to the running instance, ensuring commands run within the same container context.
- Implemented cleanup functionality to stop the persistent instance on cleanup or destruction, improving resource management.
- Updated class documentation to reflect new features and usage of the persistent environment.
2026-02-10 06:49:58 +00:00
teknium 1b1307d0d1 Implement Anthropic prompt caching for Claude models via OpenRouter
- Introduced a caching strategy that reduces input token costs by ~75% on multi-turn conversations by caching the conversation prefix.
- Added functions to apply cache control markers to messages, enhancing efficiency in token usage.
- Updated AIAgent to auto-enable prompt caching for Claude models, with configurable cache TTL.
- Enhanced logging to track cache hit statistics when caching is active, improving monitoring of token usage.
2026-02-10 06:49:41 +00:00
teknium 7a11be9f3f Enhance browser tool functionality and cleanup process
- Added checks for local installation of the agent-browser CLI in the `_find_agent_browser` function, improving installation guidance.
- Implemented per-task socket directory management in `_run_browser_command` to prevent concurrency issues.
- Updated `cleanup_browser` to remove per-task socket directories, ensuring proper resource cleanup after task completion.
- Refactored comments for clarity and improved documentation throughout the browser tool code.
2026-02-09 04:36:37 +00:00
teknium1 192ce958c3 Enhance CLI command handling and introduce resource cleanup features
- Added imports for resource cleanup during safe shutdown, including terminal and browser session cleanup.
- Refactored command handling to preserve original case for model names and prompt text, improving user experience.
- Introduced a dedicated interrupt queue to manage user input while the agent is running, preventing race conditions.
- Updated comments and documentation for clarity on command processing and input handling.
2026-02-08 13:31:45 -08:00
teknium1 c441681dc2 Update default model to 'anthropic/claude-opus-4.6' and refine terminal working directory settings
- Changed the default LLM model in the setup wizard and example environment file to 'anthropic/claude-opus-4.6'.
- Updated terminal working directory settings in CLI and related files to use the current directory ('.') instead of '/tmp'.
- Enhanced documentation comments for clarity on terminal configuration and working directory behavior.
2026-02-08 12:56:40 -08:00
teknium dd70d57b9b Refactor BatchRunner and AIAgent for enhanced reasoning and tool management, improved tool definitions for fileops
- Updated `ALL_POSSIBLE_TOOLS` to auto-derive from `TOOL_TO_TOOLSET_MAP` for consistent schema.
- Introduced `_extract_reasoning_stats` function to track reasoning coverage in assistant turns.
- Enhanced `_process_batch_worker` to discard prompts with no reasoning and aggregate reasoning statistics.
- Updated documentation and comments for clarity on new features and changes.
2026-02-08 20:19:14 +00:00
teknium f12ea1bc02 Enhance BatchRunner and AIAgent with new configuration options, default model now opus 4.6, default summarizer gemini flash 3
- Added `max_tokens`, `reasoning_config`, and `prefill_messages` parameters to `BatchRunner` and `AIAgent` for improved model response control.
- Updated CLI to support new options for reasoning effort and prefill messages from a JSON file.
- Modified example configuration files to reflect changes in default model and summary model.
- Improved error handling for loading prefill messages and reasoning configurations in the CLI.
- Updated documentation to include new parameters and usage examples.
2026-02-08 10:49:24 +00:00
Teknium fa76a331b0 Merge pull request #19 from NousResearch/atropos-hermes-agent
Enhance async tool execution and error handling in Hermes agent for A…
2026-02-07 21:01:06 -08:00
teknium d999d9876d Enhance async tool execution and error handling in Hermes agent for Atropos integration
- Updated `.gitignore` to exclude `testlogs` directory.
- Refactored `handle_web_function_call` in `model_tools.py` to support running async functions in existing event loops, improving compatibility with Atropos.
- Introduced a thread pool executor in `agent_loop.py` for running synchronous tool calls that internally use `asyncio.run()`, preventing deadlocks.
- Added `ToolError` class to track tool execution errors, enhancing error reporting during agent loops.
- Updated `wandb_log` method in `hermes_base_env.py` to log tool error statistics for better monitoring.
- Implemented patches in `patches.py` to ensure async-safe operation of tools within Atropos's event loop.
- Enhanced `ToolContext` and `terminal_tool.py` to utilize the new async handling, improving overall tool execution reliability.
2026-02-08 05:00:47 +00:00
Teknium 578a5fb6a9 Merge pull request #18 from NousResearch/atropos-hermes-agent
Upgrade installers to use uv
2026-02-07 16:00:05 -08:00
teknium a8809bbd3e Transition installation to uv for py version and speed to be easier to streamline
- Integrated `uv` as a fast Python package manager for automatic Python provisioning and dependency management.
- Updated installation scripts (`setup-hermes.sh`, `install.sh`, `install.ps1`) to utilize `uv` for installing Python and packages, streamlining the setup process.
- Revised `README.md` to reflect changes in installation steps, including symlinking `hermes` for global access and clarifying Python version requirements.
- Adjusted commands in `doctor.py` and other scripts to recommend `uv` for package installations, ensuring consistency across the project.
2026-02-07 23:54:53 +00:00
teknium a478e44585 Increase max_token_length in TerminalTestEnv to 16000 for enhanced processing capacity 2026-02-07 21:11:07 +00:00
teknium c0494b3558 Update pyproject.toml to refine dependency management
- Reorganized the 'all' dependencies to include specific optional groups for better modularity.
- Added support for 'hermes-agent' with distinct categories: modal, messaging, cron, cli, and dev.
2026-02-07 21:11:01 +00:00
Teknium 7f1cd014f2 Merge pull request #17 from NousResearch/atropos-hermes-agent
Add support for Atropos Agentic RL environments (requires branch tool…
2026-02-07 09:12:10 -08:00
teknium 07b615e96e Add support for Atropos Agentic RL environments (requires branch tool_call_support in Atropos atm)
- Added new environments for reinforcement learning, including `HermesSweEnv` for software engineering tasks and `TerminalTestEnv` for inline testing.
- Introduced `ToolContext` for unrestricted access to tools during reward computation.
- Updated `.gitignore` to exclude `wandb/` directory.
- Enhanced `README.md` with detailed architecture and usage instructions for Atropos environments.
- Added configuration files for SWE and terminal test environments to streamline setup.
- Removed unnecessary compiled Python files from `__pycache__`.
2026-02-07 09:17:16 +00:00
Teknium ab387a6120 Merge pull request #16 from NousResearch/atropos-hermes-agent
Update dependencies and enhance installation scripts
2026-02-06 16:05:50 -08:00
teknium ac79725923 Update dependencies and enhance installation scripts
- Added `prompt_toolkit` as a direct dependency for interactive CLI support.
- Updated `modal` optional dependency to require `swe-rex[modal]>=1.4.0` for improved cloud execution capabilities.
- Enhanced `messaging` optional dependencies to include `aiohttp>=3.9.0` for WhatsApp bridge communication.
- Refined installation scripts to check for Python version requirements, emphasizing the need for Python 3.11+ for RL training tools.
- Improved setup scripts to ensure proper installation of submodules and dependencies, enhancing user experience during setup.
2026-02-07 00:05:04 +00:00
Teknium 8dd38318fc Merge pull request #15 from NousResearch/rl-capabilities
Rl capabilities && File Operator Tools
2026-02-05 03:50:42 -08:00
teknium1 533c064269 Add file manipulation tools and enhance setup scripts
- Introduced file manipulation capabilities in `model_tools.py`, including functions for reading, writing, patching, and searching files.
- Added a new `file` toolset in `toolsets.py` and updated distributions to include file tools.
- Enhanced `setup-hermes.sh` and `install.sh` scripts to check for and optionally install `ripgrep` for faster file searching.
- Implemented a new `file_operations.py` module to encapsulate file operations using shell commands.
- Updated `doctor.py` and `install.ps1` to check for `ripgrep` and provide installation guidance if not found.
- Added fuzzy matching and patch parsing capabilities to improve file manipulation accuracy and flexibility.
2026-02-05 03:49:46 -08:00
teknium1 5c3105b437 Enhance RL test inference with WandB integration and real-time output streaming
- Added unique run ID generation for WandB tracking during test inference.
- Enabled WandB usage for test tracking and updated command-line arguments accordingly.
- Implemented real-time output streaming for process execution, improving log visibility and debugging.
- Enhanced error handling to display last few lines of stderr for better troubleshooting.
2026-02-04 21:07:07 -08:00
teknium1 3c0d0dba49 Update RL tools and enhance configuration management
- Modified `model_tools.py` to update default model IDs and add new RL function `rl_test_inference`.
- Enhanced `README.md` with installation instructions for submodules and updated API key usage.
- Improved `rl_cli.py` to load configuration from `~/.hermes/config.yaml` and set terminal working directory for RL tools.
- Updated `run_agent.py` to handle empty string arguments as empty objects for better JSON validation.
- Refined installation scripts to ensure submodules are cloned and installed correctly, enhancing setup experience.
2026-02-04 13:57:59 -08:00
teknium1 12bbca95ec Add tinker-atropos submodule and update RL training tools
- Added the tinker-atropos submodule for enhanced RL training capabilities.
- Updated model_tools.py to reorder RL function definitions and improve descriptions.
- Modified rl_cli.py to include checks for the tinker-atropos setup and provide user guidance.
- Adjusted toolsets.py and __init__.py to reflect changes in RL function availability.
- Enhanced rl_training_tool.py to manage training processes directly without a separate API server.
2026-02-04 10:36:01 -08:00
teknium1 f6574978de Add RL training configuration and tools
- Updated `.env.example` to include Tinker and WandB API keys for reinforcement learning training.
- Enhanced `model_tools.py` to clarify configuration options and streamline the RL training process.
- Expanded `README.md` with detailed instructions for setting up RL training using Tinker and WandB.
- Modified `hermes_cli` files to integrate RL training tools and ensure proper configuration checks.
- Improved `rl_training_tool.py` to reflect changes in training parameters and configuration management.
2026-02-04 09:36:51 -08:00
Teknium 8380895ae3 Update README.md 2026-02-04 00:35:45 -08:00
teknium1 f018999da9 initial RL training tools and loop 2026-02-03 23:41:26 -08:00
teknium1 51a6b7d2b5 Implement interrupt handling for message processing in GatewayRunner and BasePlatformAdapter
- Introduced a monitoring mechanism in GatewayRunner to detect incoming messages while an agent is active, allowing for graceful interruption and processing of new messages.
- Enhanced BasePlatformAdapter to manage active sessions and pending messages, ensuring that new messages can interrupt ongoing tasks effectively.
- Improved the handling of pending messages by checking for interrupts and processing them in the correct order, enhancing user experience during message interactions.
- Updated the cleanup process for active tasks to ensure proper resource management after interruptions.
2026-02-03 20:10:15 -08:00
teknium1 9bfe185a2e Implement interrupt handling for agent and CLI input and persistent prompt line at bottom of CLI :)
- Enhanced the AIAgent class to support interrupt requests, allowing for graceful interruption of ongoing tasks and processing of new messages.
- Updated the HermesCLI to manage user input in a persistent manner, enabling real-time interruption of the agent's conversation.
- Introduced a mechanism in the GatewayRunner to handle incoming messages while an agent is running, allowing for immediate response to user commands.
- Improved overall user experience by providing feedback during interruptions and ensuring that pending messages are processed correctly.
2026-02-03 16:15:49 -08:00
teknium1 beeb7896e0 Refactor message handling and error logging in agent and gateway
- Updated the AIAgent class to extract the first user message for trajectory formatting, improving the accuracy of user queries in the trajectory format.
- Enhanced the GatewayRunner to convert transcript history into the agent format, ensuring proper handling of message roles and content.
- Adjusted the typing indicator refresh rate to every 2 seconds for better responsiveness.
- Improved error handling in the message sending process for the Telegram adapter, implementing a fallback mechanism for Markdown parsing failures, and logging send failures for better debugging.
2026-02-03 15:42:54 -08:00
teknium1 212460289b Enhance skills tool to have an arg so it is more reliably called, and error handling in agent
- Updated the `skills_categories` function to include a `verbose` parameter, allowing users to request skill counts per category.
- Modified the `handle_skills_function_call` method to pass the `verbose` argument to `skills_categories`.
- Improved error handling in the `AIAgent` class by injecting a recovery message when invalid JSON arguments are detected, guiding users on how to correct their tool calls.
- Enhanced the `GatewayRunner` to return a user-friendly error message if the agent fails to generate a final response, improving overall user experience.
2026-02-03 15:26:59 -08:00
teknium1 221fb17c5e Refine typing indicator behavior in message handling
- Adjusted the `_keep_typing` method to refresh the typing indicator every 2 seconds instead of 4, improving responsiveness after progress messages.
- Updated the `GatewayRunner` to restore the typing indicator after sending progress messages, enhancing user experience during message processing.
2026-02-03 15:06:18 -08:00
teknium1 488deb04a4 fix telegram, import asyncio 2026-02-03 15:02:41 -08:00
teknium1 9d9eea9ac9 Enhance agent configuration and documentation for tool progress and working directory
- Updated the AIAgent class to include new parameters for maximum iterations and tool progress callback, improving agent behavior and user feedback.
- Added detailed documentation on working directory behavior for CLI and messaging platforms, clarifying the use of `MESSAGING_CWD`.
- Introduced tool progress notifications in messaging, allowing users to receive real-time updates during tool execution.
- Updated relevant sections in AGENTS.md, README.md, and messaging.md to reflect these enhancements and provide clearer setup instructions.
2026-02-03 14:57:27 -08:00
teknium1 e7f0ffbf5d Add tool progress notifications for messaging channels
- Introduced a new callback mechanism in the AIAgent class to send tool progress messages during execution, enhancing user feedback in messaging platforms.
- Updated the GatewayRunner to support tool progress notifications, allowing users to enable or disable this feature via environment variables.
- Enhanced the CLI setup wizard to prompt users for enabling tool progress messages and selecting the notification mode (all or new), improving configuration options.
- Updated relevant documentation to reflect the new features and configuration settings for tool progress notifications.
2026-02-03 14:54:43 -08:00
teknium1 a09b018bd5 Implement continuous typing indicator in message handling
- Added a new private method `_keep_typing` to send a typing indicator continuously while processing messages, refreshing every 4 seconds to comply with Telegram/Discord limitations.
- Updated the `handle_message` method to initiate the typing indicator at the start of message processing and ensure it stops once processing is complete, improving user experience during message handling.
2026-02-03 14:51:31 -08:00
teknium1 7eac4ee9fe Update agent configuration for maximum tool-calling iterations
- Increased the default maximum tool-calling iterations from 20 to 60 in the CLI configuration and related files, allowing for more complex tasks.
- Updated documentation and comments to reflect the new recommended range for iterations, enhancing user guidance.
- Implemented backward compatibility for loading max iterations from the root-level configuration, ensuring a smooth transition for existing users.
- Adjusted the setup wizard to prompt for the maximum iterations setting, improving user experience during configuration.
2026-02-03 14:48:19 -08:00
teknium1 17a5efb416 Enhance messaging gateway configuration and security features
- Added new environment variables for Telegram and Discord bot configurations, including `TELEGRAM_ALLOWED_USERS` and `DISCORD_ALLOWED_USERS`, to restrict bot access to specific users.
- Updated documentation in AGENTS.md and README.md to include detailed setup instructions for the messaging gateway, emphasizing the importance of user allowlists for security.
- Improved the CLI setup wizard to prompt for allowed user IDs during configuration, enhancing user guidance and security awareness.
- Refined the gateway run script to support user authorization checks, ensuring only allowed users can interact with the bot.
2026-02-03 10:46:23 -08:00
teknium1 3e634aa7e4 Update requirements and enhance environment variable loading in gateway
- Updated requirements.txt to uncomment and ensure the installation of `python-telegram-bot` and `discord.py` packages.
- Enhanced the gateway run script to load environment variables from a specified path, improving configuration management and flexibility for different environments.
2026-02-03 07:02:59 -08:00
teknium1 5d3398aa8a Refactor terminal tool command approval process and enhance CLI feedback
- Updated the terminal tool's command approval flow to improve user interaction when executing potentially dangerous commands, replacing the previous confirmation method with a clear explanation and instructions for adding commands to the allowlist.
- Removed the internal `force` parameter from the model API, ensuring that dangerous command approvals are handled solely through user prompts.
- Enhanced the CLI to provide better feedback regarding tool availability, including improved messaging for enabled and disabled toolsets.
- Updated AGENTS.md to reflect changes in the command approval process and configuration instructions.
2026-02-02 23:46:41 -08:00
teknium1 76d929e177 Implement dangerous command approval system for terminal tool
- Added a safety mechanism to detect and approve potentially dangerous commands (e.g., `rm -rf`, `DROP TABLE`).
- Introduced an approval flow for local/SSH backends, prompting users for confirmation with options to allow once, for the session, or permanently.
- Updated configuration to include a `command_allowlist` for storing approved patterns.
- Enhanced messaging for sudo failures in messaging contexts.
- Updated relevant documentation in AGENTS.md and TODO.md to reflect these changes.
2026-02-02 23:35:18 -08:00
Teknium be91af7551 Refactor TODO list and remove completed items
Removed high-priority immediate fixes section and reorganized the TODO list. Updated various sections to reflect new priorities and ideas.
2026-02-02 23:08:27 -08:00
teknium1 c9011fc7e1 Add uninstall command to CLI and update documentation
- Introduced a new `uninstall` command in the CLI for the Hermes Agent, allowing users to remove the agent while optionally retaining configuration files for future reinstallation.
- Updated AGENTS.md and README.md to include the new uninstall functionality, enhancing user guidance on available commands and their purposes.
- Improved command-line interface with detailed help options for the uninstall process, including flags for full removal and confirmation prompts.
2026-02-02 22:18:18 -08:00
teknium1 ff776b57bf Remove outdated .cursorrules file and add comprehensive AGENTS.md documentation
- Deleted the .cursorrules file, which contained legacy information about the Hermes-Agent project structure and development environment.
- Introduced AGENTS.md, a detailed development guide for the Hermes Agent, outlining project structure, configuration management, CLI architecture, and agent functionality.
- Enhanced user guidance for setting up the development environment and utilizing the CLI effectively, including new commands for configuration management.
2026-02-02 19:45:42 -08:00
teknium1 3ee788dacc Implement configuration migration system and enhance CLI setup
- Introduced a configuration migration system to check for missing required environment variables and outdated config fields, prompting users for necessary inputs during updates.
- Enhanced the CLI with new commands for checking and migrating configuration, improving user experience by providing clear guidance on required settings.
- Updated the setup wizard to detect existing installations and offer quick setup options for missing configurations, streamlining the user onboarding process.
- Improved messaging throughout the CLI to inform users about the status of their configuration and any required actions.
2026-02-02 19:39:23 -08:00
teknium1 fef504f038 Refactor configuration file management and improve user feedback
- Updated the setup wizard and installation scripts to standardize the configuration file paths under ~/.hermes, enhancing clarity for users.
- Improved messaging in the CLI to clearly indicate where configuration files and data directories are located.
- Streamlined the creation of configuration files, ensuring they are easily accessible and organized within the new directory structure.
2026-02-02 19:34:56 -08:00
teknium1 bbb5776763 Enhance tool availability checks and user feedback in CLI
- Updated the CLI to include a new method for displaying warnings about disabled tools due to missing API keys.
- Integrated tool availability checks into the setup wizard and doctor commands, providing users with clear information on which tools are available and what is required for full functionality.
- Improved user prompts and feedback regarding API key configuration, emphasizing the importance of setting up keys for certain tools.
- Added detailed summaries of tool availability during setup and diagnostics, enhancing the overall user experience.
2026-02-02 19:28:27 -08:00
teknium1 e87bee9ccd Refactor setup wizard for improved API key and provider configuration
- Updated the setup wizard to clarify the OpenRouter API key requirement and enhance user prompts for API key input.
- Streamlined the main agent provider selection process, allowing users to choose between OpenRouter and custom endpoints with improved guidance.
- Renumbered setup steps for better organization and clarity, ensuring a smoother user experience during configuration.
- Enhanced error handling and user feedback for API configuration, emphasizing the importance of the OpenRouter key for certain tools.
2026-02-02 19:23:20 -08:00
teknium1 69a338610a Enhance repository cloning logic in install script
- Updated the install script to attempt cloning via SSH first for private repositories, falling back to HTTPS if the SSH method fails.
- Added detailed error handling and user guidance for SSH key setup, improving the installation experience for users with private repositories.
2026-02-02 19:19:26 -08:00
teknium1 aa6394e94f Update install script to support SSH and HTTPS repository URLs
- Modified the install script to include separate variables for SSH and HTTPS repository URLs, enhancing flexibility for users during the cloning process.
- This change allows users to choose their preferred method of accessing the repository, improving the overall installation experience.
2026-02-02 19:19:12 -08:00
teknium1 ef409c6a24 Enhance repository cloning in install script
- Updated the install script to support both SSH and HTTPS cloning methods for the repository, improving flexibility for users with different access configurations.
- Added error handling and informative logging to guide users in case of cloning failures, particularly for private repositories requiring SSH key setup.
- Refactored the cloning logic to attempt SSH first, falling back to HTTPS if necessary, ensuring a smoother installation experience.
2026-02-02 19:19:07 -08:00
teknium1 da4167560f Enhance terminal backend selection in setup wizard
- Added platform detection to customize available terminal backend options based on the operating system (Linux, macOS, Windows).
- Updated terminal choices to include Singularity/Apptainer only for Linux users, with appropriate warnings for unsupported selections.
- Improved user prompts for Docker and local configurations to provide platform-specific guidance.
- Refactored backend selection logic to streamline the process and ensure accurate mapping of user choices to backend configurations.
2026-02-02 19:15:30 -08:00
teknium1 3488576bd8 Update terminal configuration and enhance CLI model management
- Changed default Docker, Singularity, and Modal images in configuration files to use "nikolaik/python-nodejs:python3.11-nodejs20" for improved compatibility.
- Updated the default model in the configuration to "anthropic/claude-sonnet-4.5" and adjusted related setup prompts for API provider configuration.
- Introduced a new CLI option for selecting a custom OpenAI-compatible endpoint, enhancing flexibility in model provider setup.
- Enhanced the prompt choice functionality to support arrow key navigation for better user experience in CLI interactions.
- Updated documentation in relevant files to reflect these changes and improve user guidance.
2026-02-02 19:13:41 -08:00
teknium1 619c72e566 Enhance CLI with multi-platform messaging integration and configuration management
- Updated CLI to load configuration from user-specific and project-specific YAML files, prioritizing user settings.
- Introduced a new command `/platforms` to display the status of connected messaging platforms (Telegram, Discord, WhatsApp).
- Implemented a gateway system for handling messaging interactions, including session management and delivery routing for cron job outputs.
- Added support for environment variable configuration and a dedicated gateway configuration file for advanced settings.
- Enhanced documentation in README.md and added a new messaging.md file to guide users on platform integrations and setup.
- Updated toolsets to include platform-specific capabilities for Telegram, Discord, and WhatsApp, ensuring secure and tailored interactions.
2026-02-02 19:01:51 -08:00
teknium1 a3ba41fce2 Implement cron job management system for scheduled tasks (similar to OpenAI's Pulse but the AI can also schedule jobs)
- Introduced a new cron job system allowing users to schedule automated tasks via the CLI, supporting one-time reminders and recurring jobs.
- Added commands for managing cron jobs: `/cron` to list jobs, `/cron add` to create new jobs, and `/cron remove` to delete jobs.
- Implemented job storage in `~/.hermes/cron/jobs.json` with output saved to `~/.hermes/cron/output/{job_id}/{timestamp}.md`.
- Enhanced the CLI and README documentation to include detailed usage instructions and examples for cron job management.
- Integrated cron job tools into the hermes-cli toolset, ensuring they are only available in interactive CLI mode.
- Added support for cron expression parsing with the `croniter` package, enabling flexible scheduling options.
2026-02-02 08:26:42 -08:00
teknium1 c935a604f8 Refactor TODO.md to reorganize task sections and update descriptions
- Renamed and reordered sections in the TODO list for clarity, moving "Interactive Clarifying Questions Tool" to section 5 and "Collaborative Problem Solving" to section 6.
- Removed outdated ideas related to task continuation hints and resource awareness, streamlining the focus on current development priorities.
- Enhanced the overall structure of the TODO list to better reflect ongoing and future tasks.
2026-02-02 01:25:03 -08:00
teknium1 e114f09f70 Implement reasoning extraction and enhance assistant message handling
- Added a new method `_extract_reasoning` to extract reasoning content from assistant messages, accommodating multiple formats from various providers.
- Updated message handling to ensure all assistant messages include reasoning content for API compatibility, preserving multi-turn reasoning context.
- Enhanced logging to capture reasoning details for debugging and analysis.
- Modified the TODO.md to reflect changes in planning and task management, emphasizing the need for structured task decomposition and progress tracking.
2026-02-01 22:48:18 -08:00
teknium1 9b4d9452ba Add context compression feature for long conversations
- Implemented automatic context compression to manage long conversations that approach the model's context limit.
- Configured the feature to summarize middle turns while protecting the first three and last four turns, ensuring important context is retained.
- Added configuration options in `cli-config.yaml` and environment variables for enabling/disabling compression and setting thresholds.
- Updated documentation in `README.md`, `cli.md`, and `.env.example` to explain the context compression functionality and its configuration.
- Enhanced the `cli.py` to load compression settings into environment variables, ensuring seamless integration with the CLI.
- Completed the implementation of context compression as outlined in the TODO list, marking it as a significant enhancement to conversation management.
2026-02-01 18:01:31 -08:00
176 changed files with 31746 additions and 22206 deletions
-115
View File
@@ -1,115 +0,0 @@
# Cline's Memory Bank
I am Cline, an expert software engineer with a unique characteristic: my memory resets completely between sessions. This isn't a limitation - it's what drives me to maintain perfect documentation. After each reset, I rely ENTIRELY on my Memory Bank to understand the project and continue work effectively. I MUST read ALL memory bank files at the start of EVERY task - this is not optional.
## Memory Bank Structure
The Memory Bank consists of core files and optional context files, all in Markdown format. Files build upon each other in a clear hierarchy:
flowchart TD
PB[projectbrief.md] --> PC[productContext.md]
PB --> SP[systemPatterns.md]
PB --> TC[techContext.md]
PC --> AC[activeContext.md]
SP --> AC
TC --> AC
AC --> P[progress.md]
### Core Files (Required)
1. `projectbrief.md`
- Foundation document that shapes all other files
- Created at project start if it doesn't exist
- Defines core requirements and goals
- Source of truth for project scope
2. `productContext.md`
- Why this project exists
- Problems it solves
- How it should work
- User experience goals
3. `activeContext.md`
- Current work focus
- Recent changes
- Next steps
- Active decisions and considerations
- Important patterns and preferences
- Learnings and project insights
4. `systemPatterns.md`
- System architecture
- Key technical decisions
- Design patterns in use
- Component relationships
- Critical implementation paths
5. `techContext.md`
- Technologies used
- Development setup
- Technical constraints
- Dependencies
- Tool usage patterns
6. `progress.md`
- What works
- What's left to build
- Current status
- Known issues
- Evolution of project decisions
### Additional Context
Create additional files/folders within memory-bank/ when they help organize:
- Complex feature documentation
- Integration specifications
- API documentation
- Testing strategies
- Deployment procedures
## Core Workflows
### Plan Mode
flowchart TD
Start[Start] --> ReadFiles[Read Memory Bank]
ReadFiles --> CheckFiles{Files Complete?}
CheckFiles -->|No| Plan[Create Plan]
Plan --> Document[Document in Chat]
CheckFiles -->|Yes| Verify[Verify Context]
Verify --> Strategy[Develop Strategy]
Strategy --> Present[Present Approach]
### Act Mode
flowchart TD
Start[Start] --> Context[Check Memory Bank]
Context --> Update[Update Documentation]
Update --> Execute[Execute Task]
Execute --> Document[Document Changes]
## Documentation Updates
Memory Bank updates occur when:
1. Discovering new project patterns
2. After implementing significant changes
3. When user requests with **update memory bank** (MUST review ALL files)
4. When context needs clarification
flowchart TD
Start[Update Process]
subgraph Process
P1[Review ALL Files]
P2[Document Current State]
P3[Clarify Next Steps]
P4[Document Insights & Patterns]
P1 --> P2 --> P3 --> P4
end
Start --> Process
Note: When triggered by **update memory bank**, I MUST review every memory bank file, even if some don't require updates. Focus particularly on activeContext.md and progress.md as they track current state.
REMEMBER: After every memory reset, I begin completely fresh. The Memory Bank is my only link to previous work. It must be maintained with precision and clarity, as my effectiveness depends entirely on its accuracy.
-201
View File
@@ -1,201 +0,0 @@
Hermes-Agent is an agent harness for LLMs with an interactive CLI.
## Development Environment
**IMPORTANT**: Always use the virtual environment if it exists:
```bash
source venv/bin/activate # Before running any Python commands
```
## Project Structure
- `hermes` - CLI launcher script (run with `./hermes`)
- `cli.py` - Interactive CLI with Rich UI, prompt_toolkit, animated spinners
- `cli-config.yaml` - CLI configuration (model, terminal, toolsets, personalities)
- `tools/` - Individual tool implementations (web, terminal, browser, vision, etc.)
- `tools/__init__.py` - Exports all tools for importing
- `model_tools.py` - Consolidates tool schemas and handlers for the agent
- `toolsets.py` - Groups tools into logical toolsets (web, terminal, browser, etc.)
- `toolset_distributions.py` - Probability-based tool selection for data generation
- `run_agent.py` - Primary agent runner with AIAgent class and KawaiiSpinner
- `batch_runner.py` - Parallel batch processing with checkpointing
- `tests/` - Test scripts
## File Dependency Chain
```
tools/*.py → tools/__init__.py → model_tools.py → toolsets.py → toolset_distributions.py
run_agent.py ──────────────────────────┘
cli.py → run_agent.py (uses AIAgent with quiet_mode=True)
batch_runner.py → run_agent.py + toolset_distributions.py
```
Always ensure consistency between tools, model_tools.py, and toolsets.py when changing any of them.
## CLI Architecture (cli.py)
The interactive CLI uses:
- **Rich** - For the welcome banner and styled panels
- **prompt_toolkit** - For fixed input area with history and `patch_stdout`
- **KawaiiSpinner** (in run_agent.py) - Animated feedback during API calls and tool execution
Key components:
- `HermesCLI` class - Main CLI controller with commands and conversation loop
- `load_cli_config()` - Loads `cli-config.yaml`, sets environment variables for terminal
- `build_welcome_banner()` - Displays ASCII art logo, tools, and skills summary
- `/commands` - Process user commands like `/help`, `/clear`, `/personality`, etc.
CLI uses `quiet_mode=True` when creating AIAgent to suppress verbose logging and enable kawaii-style feedback instead.
### Adding CLI Commands
1. Add to `COMMANDS` dict with description
2. Add handler in `process_command()` method
3. For persistent settings, use `save_config_value()` to update `cli-config.yaml`
## Adding a New Tool
Follow this strict order to maintain consistency:
1. Create `tools/your_tool.py` with:
- Handler function (sync or async) returning a JSON string via `json.dumps()`
- `check_*_requirements()` function to verify dependencies (e.g., API keys)
- Schema definition following OpenAI function-calling format
2. Export in `tools/__init__.py`:
- Import the handler and check function
- Add to `__all__` list
3. Register in `model_tools.py`:
- Create `get_*_tool_definitions()` function or add to existing
- Add routing in `handle_function_call()` dispatcher
- Update `get_all_tool_names()` with the tool name
- Update `get_toolset_for_tool()` mapping
- Update `get_available_toolsets()` and `check_toolset_requirements()`
4. Add to toolset in `toolsets.py`:
- Add to existing toolset or create new one in TOOLSETS dict
5. Optionally add to `toolset_distributions.py` for batch processing
## Tool Implementation Pattern
```python
# tools/example_tool.py
import json
import os
def check_example_requirements() -> bool:
"""Check if required API keys/dependencies are available."""
return bool(os.getenv("EXAMPLE_API_KEY"))
def example_tool(param: str, task_id: str = None) -> str:
"""Execute the tool and return JSON string result."""
try:
result = {"success": True, "data": "..."}
return json.dumps(result, ensure_ascii=False)
except Exception as e:
return json.dumps({"error": str(e)}, ensure_ascii=False)
```
All tool handlers MUST return a JSON string. Never return raw dicts.
## Stateful Tools
Tools that maintain state (terminal, browser) require:
- `task_id` parameter for session isolation between concurrent tasks
- `cleanup_*()` function to release resources
- Cleanup is called automatically in run_agent.py after conversation completes
## Environment Variables
API keys are loaded from `.env` file in repo root:
- `OPENROUTER_API_KEY` - Main LLM API access (primary provider)
- `FIRECRAWL_API_KEY` - Web search/extract tools
- `BROWSERBASE_API_KEY` / `BROWSERBASE_PROJECT_ID` - Browser automation
- `FAL_KEY` - Image generation (FLUX model)
- `NOUS_API_KEY` - Vision and Mixture-of-Agents tools
Terminal tool configuration (can also be set in `cli-config.yaml`):
- `TERMINAL_ENV` - Backend: local, docker, singularity, modal, or ssh
- `TERMINAL_CWD` - Working directory
- `TERMINAL_SSH_HOST`, `TERMINAL_SSH_USER`, `TERMINAL_SSH_KEY` - For SSH backend
## Agent Loop (run_agent.py)
The AIAgent class handles:
- Processing enabled toolsets to provide to the model
- Piping prompts to the agent
- Looping LLM calls when tools are invoked, until natural language response
- Returning the final response
Uses OpenAI-compatible API (primarily OpenRouter) with the OpenAI Python SDK.
## Reasoning Model Support
For models that support chain-of-thought reasoning:
- Extract `reasoning_content` from API responses
- Store in `assistant_msg["reasoning"]` for trajectory export
- Pass back via `reasoning_content` field on subsequent turns
## Trajectory Format
Conversations are saved in ShareGPT format for training:
```json
{"from": "system", "value": "System prompt with <tools>...</tools>"}
{"from": "human", "value": "User message"}
{"from": "gpt", "value": "<think>reasoning</think>\n<tool_call>{...}</tool_call>"}
{"from": "tool", "value": "<tool_response>{...}</tool_response>"}
{"from": "gpt", "value": "Final response"}
```
Tool calls use `<tool_call>` XML tags, responses use `<tool_response>` tags, reasoning uses `<think>` tags.
## Batch Processing (batch_runner.py)
For processing multiple prompts:
- Parallel execution with multiprocessing
- Content-based resume for fault tolerance (matches on prompt text, not indices)
- Toolset distributions control probabilistic tool availability per prompt
- Output: `data/<run_name>/trajectories.jsonl` (combined) + individual batch files
## Logging
Trajectories restructure tools as a system prompt for storage in a format suitable for later training use.
## Skills System
Skills are on-demand knowledge documents the agent can load. Located in `skills/` directory:
```
skills/
├── mlops/ # Category folder
│ ├── axolotl/ # Skill folder
│ │ ├── SKILL.md # Main instructions (required)
│ │ ├── references/ # Additional docs, API specs
│ │ └── templates/ # Output formats, configs
│ └── vllm/
│ └── SKILL.md
└── example-skill/
└── SKILL.md
```
**Progressive disclosure** (token-efficient):
1. `skills_categories()` - List category names (~50 tokens)
2. `skills_list(category)` - Name + description per skill (~3k tokens)
3. `skill_view(name)` - Full content + tags + linked files
SKILL.md files use YAML frontmatter:
```yaml
---
name: skill-name
description: Brief description for listing
tags: [tag1, tag2]
related_skills: [other-skill]
version: 1.0.0
---
# Skill Content...
```
Tool files: `tools/skills_tool.py` → `model_tools.py` → `toolsets.py`
+79 -144
View File
@@ -1,73 +1,17 @@
# Hermes Agent Environment Configuration
# Copy this file to .env and fill in your API keys
# =============================================================================
# CORE SETTINGS
# =============================================================================
# Agent backend:
# - openai : default Hermes-Agent loop (OpenAI function-calling via OpenAI SDK)
# - atropos : Atroposlib ServerManager/ManagedServer-backed loop (training/env integration)
HERMES_BACKEND=openai
# =============================================================================
# LOCAL / SELF-HOSTED OPENAI-COMPATIBLE ENDPOINTS (vLLM, SGLang, llama.cpp, etc.)
# =============================================================================
# For local development (matches the Atropos test env defaults):
# ATROPOS_SERVER_BASE_URL=http://127.0.0.1:8080
# ATROPOS_SERVER_MODEL=hermes-4-36b
# For hosted inference (Nous Research inference API):
ATROPOS_SERVER_BASE_URL=
ATROPOS_SERVER_MODEL=
ATROPOS_TOKENIZER_NAME=
# Set this to your Nous API key (Bearer token).
ATROPOS_SERVER_API_KEY=
# Debugging (prints to stdout; use with care)
# HERMES_DEBUG_ATROPOS_REQUEST=1
# HERMES_DEBUG_ATROPOS_RESPONSE=1
# HERMES_DEBUG_OPENAI_REQUEST=1
# HERMES_DEBUG_OPENAI_RESPONSE=1
# =============================================================================
# LOCAL / SELF-HOSTED OPENAI-COMPATIBLE ENDPOINTS (vLLM, SGLang, llama.cpp, etc.)
# =============================================================================
# If you set ATROPOS_SERVER_BASE_URL or OPENAI_BASE_URL, Hermes will use it instead
# of OpenRouter.
#
# Local server convenience (base URL without /v1):
# llama.cpp example (see `Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh`):
# ATROPOS_SERVER_BASE_URL=http://127.0.0.1:8080
# ATROPOS_SERVER_MODEL=hermes-4-36b
# ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B
# ATROPOS_SERVER_API_KEY=local
#
# Hosted Nous inference API:
# ATROPOS_SERVER_BASE_URL=https://inference-api.nousresearch.com
# ATROPOS_SERVER_MODEL=Hermes-4.3-36B
# ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B
# ATROPOS_SERVER_API_KEY=sk-... (Bearer token)
#
# If you plan to run GRPO-style group sampling (e.g. `--env.group_size 4`) against
# llama.cpp, start the server with at least that many slots, e.g.:
# LLAMA_CPP_PARALLEL=4 Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh
#
# Generic OpenAI-compatible (base URL should include /v1):
# OPENAI_BASE_URL=http://127.0.0.1:8080/v1
# OPENAI_API_KEY=local
# =============================================================================
# LLM PROVIDER (OpenRouter)
# =============================================================================
# OpenRouter provides access to many models through one API
# All LLM calls go through OpenRouter - no direct provider keys needed
# Get your key at: https://openrouter.ai/keys
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
OPENROUTER_API_KEY=
# Default model to use (OpenRouter format: provider/model)
# Examples: anthropic/claude-sonnet-4, openai/gpt-4o, google/gemini-2.0-flash, zhipuai/glm-4-plus
LLM_MODEL=anthropic/claude-sonnet-4
# Examples: anthropic/claude-opus-4.6, openai/gpt-4o, google/gemini-2.0-flash, zhipuai/glm-4-plus
LLM_MODEL=anthropic/claude-opus-4.6
# =============================================================================
# TOOL API KEYS
@@ -96,13 +40,20 @@ FAL_KEY=
# - modal: Runs in Modal cloud sandboxes (scalable, requires Modal account)
TERMINAL_ENV=local
# Container images (for singularity/docker/modal backends)
TERMINAL_DOCKER_IMAGE=python:3.11
TERMINAL_SINGULARITY_IMAGE=docker://python:3.11
TERMINAL_MODAL_IMAGE=python:3.11
# Working directory inside the container
TERMINAL_CWD=/tmp
# Container images (for singularity/docker/modal backends)
TERMINAL_DOCKER_IMAGE=nikolaik/python-nodejs:python3.11-nodejs20
TERMINAL_SINGULARITY_IMAGE=docker://nikolaik/python-nodejs:python3.11-nodejs20
TERMINAL_MODAL_IMAGE=nikolaik/python-nodejs:python3.11-nodejs20
# Working directory for terminal commands
# For local backend: "." means current directory (resolved automatically)
# For remote backends (ssh/docker/modal/singularity): use an absolute path
# INSIDE the target environment, or leave unset for the backend's default
# (/root for modal, / for docker, ~ for ssh). Do NOT use a host-local path.
# Usually managed by config.yaml (terminal.cwd) — uncomment to override
# TERMINAL_CWD=.
# Default command timeout in seconds
TERMINAL_TIMEOUT=60
@@ -144,87 +95,12 @@ TERMINAL_LIFETIME_SECONDS=300
# SUDO_PASSWORD=your_password_here
# =============================================================================
# MODAL CLOUD BACKEND (for TERMINAL_ENV=modal)
# MODAL CLOUD BACKEND (Optional - for TERMINAL_ENV=modal)
# =============================================================================
# Modal provides cloud sandboxes with per-second billing and auto-scaling.
# This implementation uses a warm pool of sandboxes for cost efficiency.
#
# SETUP:
# pip install modal && modal setup
# (Authenticates via browser, stores credentials locally)
#
# FEATURES:
# - Auto-scaling warm sandbox pool (no cold start after first use)
# - Named sandbox recovery (reconnects after restart)
# - Profile-based heterogeneous environments (CPU, GPU, different images)
# - Server-side idle_timeout protection against orphaned sandboxes
# Modal app name (groups all sandboxes, used for recovery)
TERMINAL_MODAL_APP_NAME=hermes-sandbox
# Default profile when none specified
TERMINAL_MODAL_DEFAULT_PROFILE=default
# Profile config file (optional - YAML format, see modal_profiles.yaml)
# TERMINAL_MODAL_PROFILES_FILE=modal_profiles.yaml
# --- Default Profile Settings (used if no YAML file) ---
# These apply when no profile is specified or for the "default" profile
TERMINAL_MODAL_IMAGE=python:3.11
TERMINAL_MODAL_MIN_POOL=1
TERMINAL_MODAL_MAX_POOL=5
TERMINAL_MODAL_IDLE_TIMEOUT=120
TERMINAL_MODAL_MAX_LIFETIME=3600
TERMINAL_MODAL_SCALE_DOWN_IDLE=180
# --- Custom Profile Example: pytorch-gpu ---
# Uncomment to enable a GPU profile for ML tasks
# Usage: terminal_tool("python train.py", profile="pytorch-gpu")
#
# TERMINAL_MODAL_PROFILE_pytorch_gpu_IMAGE=pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
# TERMINAL_MODAL_PROFILE_pytorch_gpu_GPU=T4
# TERMINAL_MODAL_PROFILE_pytorch_gpu_MEMORY=16384
# TERMINAL_MODAL_PROFILE_pytorch_gpu_MIN_POOL=0
# TERMINAL_MODAL_PROFILE_pytorch_gpu_MAX_POOL=2
# TERMINAL_MODAL_PROFILE_pytorch_gpu_IDLE_TIMEOUT=60
# --- Custom Profile Example: node ---
# Uncomment to enable a Node.js profile
# Usage: terminal_tool("npm test", profile="node")
#
# TERMINAL_MODAL_PROFILE_node_IMAGE=node:18
# TERMINAL_MODAL_PROFILE_node_MIN_POOL=0
# TERMINAL_MODAL_PROFILE_node_MAX_POOL=3
# =============================================================================
# MODAL SECRETS (Secure credential injection)
# =============================================================================
# Modal Secrets allow you to securely pass API keys, passwords, and other
# sensitive data to your sandboxes without exposing them in code or logs.
#
# SETUP SECRETS:
# 1. Via Dashboard: https://modal.com/secrets
# 2. Via CLI: modal secret create my-secret KEY1=value1 KEY2=value2
# 3. Via CLI with env: modal secret create my-secret API_KEY="$API_KEY"
#
# LIST SECRETS:
# modal secret list
#
# DELETE SECRETS:
# modal secret delete my-secret
# Global secrets applied to ALL profiles (comma-separated secret names)
# These secrets must be created on Modal dashboard or via CLI first
# TERMINAL_MODAL_SECRETS=my-api-keys,database-creds
# Per-profile secrets (comma-separated secret names)
# TERMINAL_MODAL_PROFILE_pytorch_gpu_SECRETS=huggingface-token,wandb-key
# Per-profile environment variables (semicolon-separated KEY=VALUE pairs)
# TERMINAL_MODAL_PROFILE_default_ENV_VARS=DEBUG=1;LOG_LEVEL=info
# Load local .env file into sandbox (useful for development)
# TERMINAL_MODAL_PROFILE_default_USE_DOTENV=true
# Modal uses CLI authentication, not environment variables.
# Run: pip install modal && modal setup
# This will authenticate via browser and store credentials locally.
# No API key needed in .env - Modal handles auth automatically.
# =============================================================================
# BROWSER TOOL CONFIGURATION (agent-browser + Browserbase)
@@ -266,6 +142,37 @@ BROWSER_INACTIVITY_TIMEOUT=120
# Format: logs/session_YYYYMMDD_HHMMSS_UUID.json
# Contains full conversation history in trajectory format for debugging/replay
# =============================================================================
# VOICE TRANSCRIPTION & OPENAI TTS
# =============================================================================
# Required for voice message transcription (Whisper) and OpenAI TTS voices.
# Uses OpenAI's API directly (not via OpenRouter).
# Named HERMES_OPENAI_API_KEY to avoid interference with OpenRouter.
# Get at: https://platform.openai.com/api-keys
HERMES_OPENAI_API_KEY=
# =============================================================================
# SLACK INTEGRATION
# =============================================================================
# Slack Bot Token - From Slack App settings (OAuth & Permissions)
# Get at: https://api.slack.com/apps
# SLACK_BOT_TOKEN=xoxb-...
# Slack App Token - For Socket Mode (App-Level Tokens in Slack App settings)
# SLACK_APP_TOKEN=xapp-...
# Slack allowed users (comma-separated Slack user IDs)
# SLACK_ALLOWED_USERS=
# =============================================================================
# RESPONSE PACING
# =============================================================================
# Human-like delays between message chunks on messaging platforms.
# Makes the bot feel less robotic.
# HERMES_HUMAN_DELAY_MODE=off # off | natural | custom
# HERMES_HUMAN_DELAY_MIN_MS=800 # Min delay in ms (custom mode)
# HERMES_HUMAN_DELAY_MAX_MS=2500 # Max delay in ms (custom mode)
# =============================================================================
# LEGACY/OPTIONAL API KEYS
# =============================================================================
@@ -285,3 +192,31 @@ WEB_TOOLS_DEBUG=false
VISION_TOOLS_DEBUG=false
MOA_TOOLS_DEBUG=false
IMAGE_TOOLS_DEBUG=false
# =============================================================================
# CONTEXT COMPRESSION (Auto-shrinks long conversations)
# =============================================================================
# When conversation approaches model's context limit, middle turns are
# automatically summarized to free up space.
#
# CONTEXT_COMPRESSION_ENABLED=true # Enable auto-compression (default: true)
# CONTEXT_COMPRESSION_THRESHOLD=0.85 # Compress at 85% of context limit
# CONTEXT_COMPRESSION_MODEL=google/gemini-2.0-flash-001 # Fast model for summaries
# =============================================================================
# RL TRAINING (Tinker + Atropos)
# =============================================================================
# Run reinforcement learning training on language models using the Tinker API.
# Requires the rl-server to be running (from tinker-atropos package).
# Tinker API Key - RL training service
# Get at: https://tinker-console.thinkingmachines.ai/keys
TINKER_API_KEY=
# Weights & Biases API Key - Experiment tracking and metrics
# Get at: https://wandb.ai/authorize
WANDB_API_KEY=
# RL API Server URL (default: http://localhost:8080)
# Change if running the rl-server on a different host/port
# RL_API_URL=http://localhost:8080
+4 -20
View File
@@ -39,26 +39,10 @@ agent-browser/
*.pem
privvy*
images/
__pycache__/
hermes_agent.egg-info/
wandb/
testlogs
# CLI config (may contain sensitive SSH paths)
cli-config.yaml
.DS_Store
# artifacts
*.jsonl
*.html
*.json
*.log
*.csv
# Singularity/Apptainer images (large binary files)
*.sif
# Test files
test_singularity_*.py
test_*.py
!tests/test_*.py
# Nomad data
/tmp/NomadClient*/
+3
View File
@@ -1,3 +1,6 @@
[submodule "mini-swe-agent"]
path = mini-swe-agent
url = https://github.com/SWE-agent/mini-swe-agent
[submodule "tinker-atropos"]
path = tinker-atropos
url = https://github.com/nousresearch/tinker-atropos
+609
View File
@@ -0,0 +1,609 @@
# Hermes Agent - Development Guide
Instructions for AI coding assistants (GitHub Copilot, Cursor, etc.) and human developers.
Hermes-Agent is an AI agent harness with tool-calling capabilities, interactive CLI, messaging integrations, and scheduled tasks.
## Development Environment
**IMPORTANT**: Always use the virtual environment if it exists:
```bash
source venv/bin/activate # Before running any Python commands
```
## Project Structure
```
hermes-agent/
├── hermes_cli/ # Unified CLI commands
│ ├── main.py # Entry point, command dispatcher
│ ├── setup.py # Interactive setup wizard
│ ├── config.py # Config management & migration
│ ├── status.py # Status display
│ ├── doctor.py # Diagnostics
│ ├── gateway.py # Gateway management
│ ├── uninstall.py # Uninstaller
│ └── cron.py # Cron job management
├── tools/ # Tool implementations
│ ├── process_registry.py # Background process management (spawn, poll, wait, kill)
│ ├── transcription_tools.py # Speech-to-text (Whisper API)
├── gateway/ # Messaging platform adapters
│ ├── pairing.py # DM pairing code system
│ ├── hooks.py # Event hook system
│ ├── sticker_cache.py # Telegram sticker vision cache
│ ├── platforms/
│ │ └── slack.py # Slack adapter (slack-bolt)
├── cron/ # Scheduler implementation
├── skills/ # Knowledge documents
├── cli.py # Interactive CLI (Rich UI)
├── run_agent.py # Agent runner with AIAgent class
├── model_tools.py # Tool schemas and handlers
├── toolsets.py # Tool groupings
├── toolset_distributions.py # Probability-based tool selection
└── batch_runner.py # Parallel batch processing
```
**User Configuration** (stored in `~/.hermes/`):
- `~/.hermes/config.yaml` - Settings (model, terminal, toolsets, etc.)
- `~/.hermes/.env` - API keys and secrets
- `~/.hermes/pairing/` - DM pairing data
- `~/.hermes/hooks/` - Custom event hooks
- `~/.hermes/image_cache/` - Cached user images
- `~/.hermes/audio_cache/` - Cached user voice messages
- `~/.hermes/sticker_cache.json` - Telegram sticker descriptions
## File Dependency Chain
```
tools/*.py → tools/__init__.py → model_tools.py → toolsets.py → toolset_distributions.py
run_agent.py ──────────────────────────┘
cli.py → run_agent.py (uses AIAgent with quiet_mode=True)
batch_runner.py → run_agent.py + toolset_distributions.py
```
Always ensure consistency between tools, model_tools.py, and toolsets.py when changing any of them.
---
## AIAgent Class
The main agent is implemented in `run_agent.py`:
```python
class AIAgent:
def __init__(
self,
model: str = "anthropic/claude-sonnet-4",
api_key: str = None,
base_url: str = "https://openrouter.ai/api/v1",
max_iterations: int = 60, # Max tool-calling loops
enabled_toolsets: list = None,
disabled_toolsets: list = None,
verbose_logging: bool = False,
quiet_mode: bool = False, # Suppress progress output
tool_progress_callback: callable = None, # Called on each tool use
):
# Initialize OpenAI client, load tools based on toolsets
...
def chat(self, user_message: str, task_id: str = None) -> str:
# Main entry point - runs the agent loop
...
```
### Agent Loop
The core loop in `_run_agent_loop()`:
```
1. Add user message to conversation
2. Call LLM with tools
3. If LLM returns tool calls:
- Execute each tool
- Add tool results to conversation
- Go to step 2
4. If LLM returns text response:
- Return response to user
```
```python
while turns < max_turns:
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tool_schemas,
)
if response.tool_calls:
for tool_call in response.tool_calls:
result = await execute_tool(tool_call)
messages.append(tool_result_message(result))
turns += 1
else:
return response.content
```
### Conversation Management
Messages are stored as a list of dicts following OpenAI format:
```python
messages = [
{"role": "system", "content": "You are a helpful assistant..."},
{"role": "user", "content": "Search for Python tutorials"},
{"role": "assistant", "content": None, "tool_calls": [...]},
{"role": "tool", "tool_call_id": "...", "content": "..."},
{"role": "assistant", "content": "Here's what I found..."},
]
```
### Reasoning Model Support
For models that support chain-of-thought reasoning:
- Extract `reasoning_content` from API responses
- Store in `assistant_msg["reasoning"]` for trajectory export
- Pass back via `reasoning_content` field on subsequent turns
---
## CLI Architecture (cli.py)
The interactive CLI uses:
- **Rich** - For the welcome banner and styled panels
- **prompt_toolkit** - For fixed input area with history and `patch_stdout`
- **KawaiiSpinner** (in run_agent.py) - Animated feedback during API calls and tool execution
Key components:
- `HermesCLI` class - Main CLI controller with commands and conversation loop
- `load_cli_config()` - Loads config, sets environment variables for terminal
- `build_welcome_banner()` - Displays ASCII art logo, tools, and skills summary
- `/commands` - Process user commands like `/help`, `/clear`, `/personality`, etc.
CLI uses `quiet_mode=True` when creating AIAgent to suppress verbose logging.
### Adding CLI Commands
1. Add to `COMMANDS` dict with description
2. Add handler in `process_command()` method
3. For persistent settings, use `save_config_value()` to update config
---
## Hermes CLI Commands
The unified `hermes` command provides all functionality:
| Command | Description |
|---------|-------------|
| `hermes` | Interactive chat (default) |
| `hermes chat -q "..."` | Single query mode |
| `hermes setup` | Configure API keys and settings |
| `hermes config` | View current configuration |
| `hermes config edit` | Open config in editor |
| `hermes config set KEY VAL` | Set a specific value |
| `hermes config check` | Check for missing config |
| `hermes config migrate` | Prompt for missing config interactively |
| `hermes status` | Show configuration status |
| `hermes doctor` | Diagnose issues |
| `hermes update` | Update to latest (checks for new config) |
| `hermes uninstall` | Uninstall (can keep configs for reinstall) |
| `hermes gateway` | Start messaging gateway |
| `hermes cron list` | View scheduled jobs |
| `hermes version` | Show version info |
| `hermes pairing list/approve/revoke` | Manage DM pairing codes |
---
## Messaging Gateway
The gateway connects Hermes to Telegram, Discord, and WhatsApp.
### Configuration (in `~/.hermes/.env`):
```bash
# Telegram
TELEGRAM_BOT_TOKEN=123456:ABC-DEF... # From @BotFather
TELEGRAM_ALLOWED_USERS=123456789,987654 # Comma-separated user IDs (from @userinfobot)
# Discord
DISCORD_BOT_TOKEN=MTIz... # From Developer Portal
DISCORD_ALLOWED_USERS=123456789012345678 # Comma-separated user IDs
# Agent Behavior
HERMES_MAX_ITERATIONS=60 # Max tool-calling iterations
MESSAGING_CWD=/home/myuser # Terminal working directory for messaging
# Tool Progress (optional)
HERMES_TOOL_PROGRESS=true # Send progress messages
HERMES_TOOL_PROGRESS_MODE=new # "new" or "all"
```
### Working Directory Behavior
- **CLI (`hermes` command)**: Uses current directory (`.``os.getcwd()`)
- **Messaging (Telegram/Discord)**: Uses `MESSAGING_CWD` (default: home directory)
This is intentional: CLI users are in a terminal and expect the agent to work in their current directory, while messaging users need a consistent starting location.
### Security (User Allowlists):
**IMPORTANT**: Without an allowlist, anyone who finds your bot can use it!
The gateway checks `{PLATFORM}_ALLOWED_USERS` environment variables:
- If set: Only listed user IDs can interact with the bot
- If unset: All users are allowed (dangerous with terminal access!)
Users can find their IDs:
- **Telegram**: Message [@userinfobot](https://t.me/userinfobot)
- **Discord**: Enable Developer Mode, right-click name → Copy ID
### DM Pairing System
Instead of static allowlists, users can pair via one-time codes:
1. Unknown user DMs the bot → receives pairing code
2. Owner runs `hermes pairing approve <platform> <code>`
3. User is permanently authorized
Security: 8-char codes, 1-hour expiry, rate-limited (1/10min/user), max 3 pending per platform, lockout after 5 failed attempts, `chmod 0600` on data files.
Files: `gateway/pairing.py`, `hermes_cli/pairing.py`
### Event Hooks
Hooks fire at lifecycle points. Place hook directories in `~/.hermes/hooks/`:
```
~/.hermes/hooks/my-hook/
├── HOOK.yaml # name, description, events list
└── handler.py # async def handle(event_type, context): ...
```
Events: `gateway:startup`, `session:start`, `session:reset`, `agent:start`, `agent:step`, `agent:end`, `command:*`
The `agent:step` event fires each iteration of the tool-calling loop with tool names and results.
Files: `gateway/hooks.py`
### Tool Progress Notifications
When `HERMES_TOOL_PROGRESS=true`, the bot sends status messages as it works:
- `💻 \`ls -la\`...` (terminal commands show the actual command)
- `🔍 web_search...`
- `📄 web_extract...`
Modes:
- `new`: Only when switching to a different tool (less spam)
- `all`: Every single tool call
### Typing Indicator
The gateway keeps the "typing..." indicator active throughout processing, refreshing every 4 seconds. This lets users know the bot is working even during long tool-calling sequences.
### Platform Toolsets:
Each platform has a dedicated toolset in `toolsets.py`:
- `hermes-telegram`: Full tools including terminal (with safety checks)
- `hermes-discord`: Full tools including terminal
- `hermes-whatsapp`: Full tools including terminal
---
## Configuration System
Configuration files are stored in `~/.hermes/` for easy user access:
- `~/.hermes/config.yaml` - All settings (model, terminal, compression, etc.)
- `~/.hermes/.env` - API keys and secrets
### Adding New Configuration Options
When adding new configuration variables, you MUST follow this process:
#### For config.yaml options:
1. Add to `DEFAULT_CONFIG` in `hermes_cli/config.py`
2. **CRITICAL**: Bump `_config_version` in `DEFAULT_CONFIG` when adding required fields
3. This triggers migration prompts for existing users on next `hermes update` or `hermes setup`
Example:
```python
DEFAULT_CONFIG = {
# ... existing config ...
"new_feature": {
"enabled": True,
"option": "default_value",
},
# BUMP THIS when adding required fields
"_config_version": 2, # Was 1, now 2
}
```
#### For .env variables (API keys/secrets):
1. Add to `REQUIRED_ENV_VARS` or `OPTIONAL_ENV_VARS` in `hermes_cli/config.py`
2. Include metadata for the migration system:
```python
OPTIONAL_ENV_VARS = {
# ... existing vars ...
"NEW_API_KEY": {
"description": "What this key is for",
"prompt": "Display name in prompts",
"url": "https://where-to-get-it.com/",
"tools": ["tools_it_enables"], # What tools need this
"password": True, # Mask input
},
}
```
#### Update related files:
- `hermes_cli/setup.py` - Add prompts in the setup wizard
- `cli-config.yaml.example` - Add example with comments
- Update README.md if user-facing
### Config Version Migration
The system uses `_config_version` to detect outdated configs:
1. `check_for_missing_config()` compares user config to `DEFAULT_CONFIG`
2. `migrate_config()` interactively prompts for missing values
3. Called automatically by `hermes update` and optionally by `hermes setup`
---
## Environment Variables
API keys are loaded from `~/.hermes/.env`:
- `OPENROUTER_API_KEY` - Main LLM API access (primary provider)
- `FIRECRAWL_API_KEY` - Web search/extract tools
- `BROWSERBASE_API_KEY` / `BROWSERBASE_PROJECT_ID` - Browser automation
- `FAL_KEY` - Image generation (FLUX model)
- `NOUS_API_KEY` - Vision and Mixture-of-Agents tools
Terminal tool configuration (in `~/.hermes/config.yaml`):
- `terminal.backend` - Backend: local, docker, singularity, modal, or ssh
- `terminal.cwd` - Working directory ("." = host CWD for local only; for remote backends set an absolute path inside the target, or omit to use the backend's default)
- `terminal.docker_image` - Image for Docker backend
- `terminal.singularity_image` - Image for Singularity backend
- `terminal.modal_image` - Image for Modal backend
- SSH: `TERMINAL_SSH_HOST`, `TERMINAL_SSH_USER`, `TERMINAL_SSH_KEY` in .env
Agent behavior (in `~/.hermes/.env`):
- `HERMES_MAX_ITERATIONS` - Max tool-calling iterations (default: 60)
- `MESSAGING_CWD` - Working directory for messaging platforms (default: ~)
- `HERMES_TOOL_PROGRESS` - Enable tool progress messages (`true`/`false`)
- `HERMES_TOOL_PROGRESS_MODE` - Progress mode: `new` (tool changes) or `all`
- `OPENAI_API_KEY` - Voice transcription (Whisper STT)
- `SLACK_BOT_TOKEN` / `SLACK_APP_TOKEN` - Slack integration (Socket Mode)
- `SLACK_ALLOWED_USERS` - Comma-separated Slack user IDs
- `HERMES_HUMAN_DELAY_MODE` - Response pacing: off/natural/custom
- `HERMES_HUMAN_DELAY_MIN_MS` / `HERMES_HUMAN_DELAY_MAX_MS` - Custom delay range
### Dangerous Command Approval
The terminal tool includes safety checks for potentially destructive commands (e.g., `rm -rf`, `DROP TABLE`, `chmod 777`, etc.):
**Behavior by Backend:**
- **Docker/Singularity/Modal**: Commands run unrestricted (isolated containers)
- **Local/SSH**: Dangerous commands trigger approval flow
**Approval Flow (CLI):**
```
⚠️ Potentially dangerous command detected: recursive delete
rm -rf /tmp/test
[o]nce | [s]ession | [a]lways | [d]eny
Choice [o/s/a/D]:
```
**Approval Flow (Messaging):**
- Command is blocked with explanation
- Agent explains the command was blocked for safety
- User must add the pattern to their allowlist via `hermes config edit` or run the command directly on their machine
**Configuration:**
- `command_allowlist` in `~/.hermes/config.yaml` stores permanently allowed patterns
- Add patterns via "always" approval or edit directly
**Sudo Handling (Messaging):**
- If sudo fails over messaging, output includes tip to add `SUDO_PASSWORD` to `~/.hermes/.env`
---
## Background Process Management
The `process` tool works alongside `terminal` for managing long-running background processes:
**Starting a background process:**
```python
terminal(command="pytest -v tests/", background=true)
# Returns: {"session_id": "proc_abc123", "pid": 12345, ...}
```
**Managing it with the process tool:**
- `process(action="list")` -- show all running/recent processes
- `process(action="poll", session_id="proc_abc123")` -- check status + new output
- `process(action="log", session_id="proc_abc123")` -- full output with pagination
- `process(action="wait", session_id="proc_abc123", timeout=600)` -- block until done
- `process(action="kill", session_id="proc_abc123")` -- terminate
- `process(action="write", session_id="proc_abc123", data="y")` -- send stdin
- `process(action="submit", session_id="proc_abc123", data="yes")` -- send + Enter
**Key behaviors:**
- Background processes execute through the configured terminal backend (local/Docker/Modal/SSH/Singularity) -- never directly on the host unless `TERMINAL_ENV=local`
- The `wait` action blocks the tool call until the process finishes, times out, or is interrupted by a new user message
- PTY mode (`pty=true` on terminal) enables interactive CLI tools (Codex, Claude Code)
- In RL training, background processes are auto-killed when the episode ends (`tool_context.cleanup()`)
- In the gateway, sessions with active background processes are exempt from idle reset
- The process registry checkpoints to `~/.hermes/processes.json` for crash recovery
Files: `tools/process_registry.py` (registry), `model_tools.py` (tool definition + handler), `tools/terminal_tool.py` (spawn integration)
---
## Adding New Tools
Follow this strict order to maintain consistency:
1. Create `tools/your_tool.py` with:
- Handler function (sync or async) returning a JSON string via `json.dumps()`
- `check_*_requirements()` function to verify dependencies (e.g., API keys)
- Schema definition following OpenAI function-calling format
2. Export in `tools/__init__.py`:
- Import the handler and check function
- Add to `__all__` list
3. Register in `model_tools.py`:
- Add to `TOOLSET_REQUIREMENTS` if it needs API keys
- Create `get_*_tool_definitions()` function or add to existing
- Add routing in `handle_function_call()` dispatcher
- Update `get_all_tool_names()` with the tool name
- Update `get_toolset_for_tool()` mapping
- Update `get_available_toolsets()` and `check_toolset_requirements()`
4. Add to toolset in `toolsets.py`:
- Add to existing toolset or create new one in TOOLSETS dict
5. If the tool requires an API key:
- Add to `OPTIONAL_ENV_VARS` in `hermes_cli/config.py`
- The tool will be auto-disabled if the key is missing
6. Optionally add to `toolset_distributions.py` for batch processing
### Tool Implementation Pattern
```python
# tools/example_tool.py
import json
import os
def check_example_requirements() -> bool:
"""Check if required API keys/dependencies are available."""
return bool(os.getenv("EXAMPLE_API_KEY"))
def example_tool(param: str, task_id: str = None) -> str:
"""Execute the tool and return JSON string result."""
try:
result = {"success": True, "data": "..."}
return json.dumps(result, ensure_ascii=False)
except Exception as e:
return json.dumps({"error": str(e)}, ensure_ascii=False)
```
All tool handlers MUST return a JSON string. Never return raw dicts.
### Dynamic Tool Availability
Tools are automatically disabled when their API keys are missing:
```python
# In model_tools.py
TOOLSET_REQUIREMENTS = {
"web": {"env_vars": ["FIRECRAWL_API_KEY"]},
"browser": {"env_vars": ["BROWSERBASE_API_KEY", "BROWSERBASE_PROJECT_ID"]},
"creative": {"env_vars": ["FAL_KEY"]},
}
```
The `check_tool_availability()` function determines which tools to include.
### Stateful Tools
Tools that maintain state (terminal, browser) require:
- `task_id` parameter for session isolation between concurrent tasks
- `cleanup_*()` function to release resources
- Cleanup is called automatically in run_agent.py after conversation completes
---
## Trajectory Format
Conversations are saved in ShareGPT format for training:
```json
{"from": "system", "value": "System prompt with <tools>...</tools>"}
{"from": "human", "value": "User message"}
{"from": "gpt", "value": "<think>reasoning</think>\n<tool_call>{...}</tool_call>"}
{"from": "tool", "value": "<tool_response>{...}</tool_response>"}
{"from": "gpt", "value": "Final response"}
```
Tool calls use `<tool_call>` XML tags, responses use `<tool_response>` tags, reasoning uses `<think>` tags.
### Trajectory Export
```python
agent = AIAgent(save_trajectories=True)
agent.chat("Do something")
# Saves to trajectories/*.jsonl in ShareGPT format
```
---
## Batch Processing (batch_runner.py)
For processing multiple prompts:
- Parallel execution with multiprocessing
- Content-based resume for fault tolerance (matches on prompt text, not indices)
- Toolset distributions control probabilistic tool availability per prompt
- Output: `data/<run_name>/trajectories.jsonl` (combined) + individual batch files
```bash
python batch_runner.py \
--dataset_file=prompts.jsonl \
--batch_size=20 \
--num_workers=4 \
--run_name=my_run
```
---
## Skills System
Skills are on-demand knowledge documents the agent can load. Located in `skills/` directory:
```
skills/
├── mlops/ # Category folder
│ ├── axolotl/ # Skill folder
│ │ ├── SKILL.md # Main instructions (required)
│ │ ├── references/ # Additional docs, API specs
│ │ └── templates/ # Output formats, configs
│ └── vllm/
│ └── SKILL.md
└── example-skill/
└── SKILL.md
```
**Progressive disclosure** (token-efficient):
1. `skills_categories()` - List category names (~50 tokens)
2. `skills_list(category)` - Name + description per skill (~3k tokens)
3. `skill_view(name)` - Full content + tags + linked files
SKILL.md files use YAML frontmatter:
```yaml
---
name: skill-name
description: Brief description for listing
tags: [tag1, tag2]
related_skills: [other-skill]
version: 1.0.0
---
# Skill Content...
```
Tool files: `tools/skills_tool.py``model_tools.py``toolsets.py`
---
## Testing Changes
After making changes:
1. Run `hermes doctor` to check setup
2. Run `hermes config check` to verify config
3. Test with `hermes chat -q "test message"`
4. For new config options, test fresh install: `rm -rf ~/.hermes && hermes setup`
+985 -689
View File
File diff suppressed because it is too large Load Diff
+31 -697
View File
@@ -1,729 +1,63 @@
# Hermes Agent - Future Improvements
> Ideas for enhancing the agent's capabilities, generated from self-analysis of the codebase.
---
## 🚨 HIGH PRIORITY - Immediate Fixes
These items need to be addressed ASAP:
### 1. SUDO Breaking Terminal Tool 🔐 ✅ COMPLETE
- [x] **Problem:** SUDO commands break the terminal tool execution (hangs indefinitely)
- [x] **Fix:** Created custom environment wrappers in `tools/terminal_tool.py`
- `stdin=subprocess.DEVNULL` prevents hanging on interactive prompts
- Sudo fails gracefully with clear error if no password configured
- Same UX as Claude Code - agent sees error, tells user to run it themselves
- [x] **All 5 environments now have consistent behavior:**
- `_LocalEnvironment` - local execution
- `_DockerEnvironment` - Docker containers
- `_SingularityEnvironment` - Singularity/Apptainer containers
- `_ModalEnvironment` - Modal cloud sandboxes
- `_SSHEnvironment` - remote SSH execution
- [x] **Optional sudo support via `SUDO_PASSWORD` env var:**
- Shared `_transform_sudo_command()` helper used by all environments
- If set, auto-transforms `sudo cmd` → pipes password via `sudo -S`
- Documented in `.env.example`, `cli-config.yaml`, and README
- Works for chained commands: `cmd1 && sudo cmd2`
- [x] **Interactive sudo prompt in CLI mode:**
- When sudo detected and no password configured, prompts user
- 45-second timeout (auto-skips if no input)
- Hidden password input via `getpass` (password not visible)
- Password cached for session (don't ask repeatedly)
- Spinner pauses during prompt for clean UX
- Uses `HERMES_INTERACTIVE` env var to detect CLI mode
### 2. Fix `browser_get_images` Tool 🖼️ ✅ VERIFIED WORKING
- [x] **Tested:** Tool works correctly on multiple sites
- [x] **Results:** Successfully extracts image URLs, alt text, dimensions
- [x] **Note:** Some sites (Pixabay, etc.) have Cloudflare bot protection that blocks headless browsers - this is expected behavior, not a bug
### 3. Better Action Logging for Debugging 📝 ✅ COMPLETE
- [x] **Problem:** Need better logging of agent actions for debugging
- [x] **Implementation:**
- Save full session trajectories to `logs/` directory as JSON
- Each session gets a unique file: `session_YYYYMMDD_HHMMSS_UUID.json`
- Logs all messages, tool calls with inputs/outputs, timestamps
- Structured JSON format for easy parsing and replay
- Automatic on CLI runs (configurable)
### 4. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED
- [ ] **Problem:** Thinking/reasoning summaries not shown while streaming
- [ ] **Complexity:** This is a significant refactor - leaving for later
**OpenRouter Streaming Info:**
- Uses `stream=True` with OpenAI SDK
- Reasoning comes in `choices[].delta.reasoning_details` chunks
- Types: `reasoning.summary`, `reasoning.text`, `reasoning.encrypted`
- Tool call arguments stream as partial JSON (need accumulation)
- Items paradigm: same ID emitted multiple times with updated content
**Key Challenges:**
- Tool call JSON accumulation (partial `{"query": "wea``{"query": "weather"}`)
- Multiple concurrent outputs (thinking + tool calls + text simultaneously)
- State management for partial responses
- Error handling if connection drops mid-stream
- Deciding when tool calls are "complete" enough to execute
**UX Questions to Resolve:**
- Show raw thinking text or summarized?
- Live expanding text vs. spinner replacement?
- Markdown rendering while streaming?
- How to handle thinking + tool call display simultaneously?
**Implementation Options:**
- New `run_conversation_streaming()` method (keep non-streaming as fallback)
- Wrapper that handles streaming internally
- Big refactor of existing `run_conversation()`
**References:**
- https://openrouter.ai/docs/api/reference/streaming
- https://openrouter.ai/docs/guides/best-practices/reasoning-tokens#streaming-response
---
## 1. Subagent Architecture (Context Isolation) 🎯
**Problem:** Long-running tools (terminal commands, browser automation, complex file operations) consume massive context. A single `ls -la` can add hundreds of lines. Browser snapshots, debugging sessions, and iterative terminal work quickly bloat the main conversation, leaving less room for actual reasoning.
The main agent becomes an orchestrator that delegates context-heavy tasks to subagents with isolated context. Each subagent returns a summary, keeping the orchestrator's context clean. `delegate_task(goal, context, toolsets=[])` with fresh conversation, limited toolset, task-specific system prompt.
**Solution:** The main agent becomes an **orchestrator** that delegates context-heavy tasks to **subagents**.
## 2. Planning & Task Management 📋
**Architecture:**
```
┌─────────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR (main agent) │
│ - Receives user request │
│ - Plans approach │
│ - Delegates heavy tasks to subagents │
│ - Receives summarized results │
│ - Maintains clean, focused context │
└─────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ TERMINAL AGENT │ │ BROWSER AGENT │ │ CODE AGENT │
│ - terminal tool │ │ - browser tools │ │ - file tools │
│ - file tools │ │ - web_search │ │ - terminal │
│ │ │ - web_extract │ │ │
│ Isolated context│ │ Isolated context│ │ Isolated context│
│ Returns summary │ │ Returns summary │ │ Returns summary │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
Task decomposition tool, progress checkpoints after N tool calls, persistent plan storage that survives context compression, failure recovery with replanning.
**How it works:**
1. User asks: "Set up a new Python project with FastAPI and tests"
2. Orchestrator plans: "I need to create files, install deps, write code"
3. Orchestrator calls: `terminal_task(goal="Create venv, install fastapi pytest", context="New project in ~/myapp")`
4. **Subagent spawns** with fresh context, only terminal/file tools
5. Subagent iterates (may take 10+ tool calls, lots of output)
6. Subagent completes → returns summary: "Created venv, installed fastapi==0.109.0, pytest==8.0.0"
7. Orchestrator receives **only the summary**, context stays clean
8. Orchestrator continues with next subtask
## 3. Dynamic Skills Expansion 📚
**Key tools to implement:**
- [ ] `terminal_task(goal, context, cwd?)` - Delegate terminal/shell work
- [ ] `browser_task(goal, context, start_url?)` - Delegate web research/automation
- [ ] `code_task(goal, context, files?)` - Delegate code writing/modification
- [ ] Generic `delegate_task(goal, context, toolsets=[])` - Flexible delegation
Skill acquisition from successful tasks, parameterized skill templates, skill chaining with dependency graphs.
**Implementation details:**
- [ ] Subagent uses same `run_agent.py` but with:
- Fresh/empty conversation history
- Limited toolset (only what's needed)
- Smaller max_iterations (focused task)
- Task-specific system prompt
- [ ] Subagent returns structured result:
```python
{
"success": True,
"summary": "Installed 3 packages, created 2 files",
"details": "Optional longer explanation if needed",
"artifacts": ["~/myapp/requirements.txt", "~/myapp/main.py"], # Files created
"errors": [] # Any issues encountered
}
```
- [ ] Orchestrator sees only the summary in its context
- [ ] Full subagent transcript saved separately for debugging
## 4. Interactive Clarifying Questions ❓
**Benefits:**
- 🧹 **Clean context** - Orchestrator stays focused, doesn't drown in tool output
- 📊 **Better token efficiency** - 50 terminal outputs → 1 summary paragraph
- 🎯 **Focused subagents** - Each agent has just the tools it needs
- 🔄 **Parallel potential** - Independent subtasks could run concurrently
- 🐛 **Easier debugging** - Each subtask has its own isolated transcript
Multiple-choice prompt tool with rich terminal UI. Up to 4 choices + free-text. CLI-only with graceful fallback for non-interactive modes.
**When to use subagents vs direct tools:**
- **Subagent**: Multi-step tasks, iteration likely, lots of output expected
- **Direct**: Quick one-off commands, simple file reads, user needs to see output
## 5. Memory System 🧠
**Files to modify:** `run_agent.py` (add orchestration mode), new `tools/delegate_tools.py`, new `subagent_runner.py`
Daily memory logs, long-term curated MEMORY.md, vector/semantic search, pre-compaction memory flush, user profile, learning store for error patterns and discovered fixes. *Inspired by ClawdBot's memory system.*
---
## 6. Heartbeat System 💓
## 2. Context Management (complements Subagents)
Periodic agent wake-up that reads HEARTBEAT.md for instructions. Runs inside the main session with full context. Triggers on interval, exec completion, cron events, or manual wake. HEARTBEAT_OK suppression when nothing needs attention. *Inspired by ClawdBot's heartbeat.*
**Problem:** Context grows unbounded during long conversations. Trajectory compression exists for training data post-hoc, but live conversations lack intelligent context management.
## 7. Local Browser Control via CDP 🌐
**Ideas:**
- [ ] **Incremental summarization** - Compress old tool outputs on-the-fly during conversations
- Trigger when context exceeds threshold (e.g., 80% of max tokens)
- Preserve recent turns fully, summarize older tool responses
- Could reuse logic from `trajectory_compressor.py`
- [ ] **Semantic memory retrieval** - Vector store for long conversation recall
- Embed important facts/findings as conversation progresses
- Retrieve relevant memories when needed instead of keeping everything in context
- Consider lightweight solutions: ChromaDB, FAISS, or even a simple embedding cache
- [ ] **Working vs. episodic memory** distinction
- Working memory: Current task state, recent tool results (always in context)
- Episodic memory: Past findings, tried approaches (retrieved on demand)
- Clear eviction policies for each
Support both local Chrome (via CDP, free) and Browserbase (cloud, paid) as browser backends. Local gives persistent login sessions but lacks CAPTCHA solving.
**Files to modify:** `run_agent.py` (add memory manager), possibly new `tools/memory_tool.py`
## 8. Signal Integration 📡
---
New platform adapter using signal-cli daemon (JSON-RPC HTTP + SSE). Requires Java runtime and phone number registration.
## 3. Self-Reflection & Course Correction 🔄
## 9. Session Transcript Search 🔍
**Problem:** Current retry logic handles malformed outputs but not semantic failures. Agent doesn't reason about *why* something failed.
`hermes sessions search <query>` CLI command and `session_search` agent tool. Text-based first (ripgrep over JSONL), vector search later.
**Ideas:**
- [ ] **Meta-reasoning after failures** - When a tool returns an error or unexpected result:
```
Tool failed → Reflect: "Why did this fail? What assumptions were wrong?"
→ Adjust approach → Retry with new strategy
```
- Could be a lightweight LLM call or structured self-prompt
- [ ] **Planning/replanning module** - For complex multi-step tasks:
- Generate plan before execution
- After each step, evaluate: "Am I on track? Should I revise the plan?"
- Store plan in working memory, update as needed
- [ ] **Approach memory** - Remember what didn't work:
- "I tried X for this type of problem and it failed because Y"
- Prevents repeating failed strategies in the same conversation
## 10. Plugin/Extension System 🔌
**Files to modify:** `run_agent.py` (add reflection hooks in tool loop), new `tools/reflection_tool.py`
Python plugin interface with `plugin.yaml` + `handler.py`. Discovery from `~/.hermes/plugins/`. Plugins can register tools, hooks, and CLI commands. *Inspired by ClawdBot's 36-plugin extension system.*
---
## 11. Native Companion Apps 📱
## 4. Tool Composition & Learning 🔧
macOS (Swift/SwiftUI), iOS, Android apps connecting via WebSocket. Prerequisite: WS API on gateway. MVP: web UI with Flask/FastAPI. *Inspired by ClawdBot's companion apps.*
**Problem:** Tools are atomic. Complex tasks require repeated manual orchestration of the same tool sequences.
## 12. Evaluation System 📏
**Ideas:**
- [ ] **Macro tools / Tool chains** - Define reusable tool sequences:
```yaml
research_topic:
description: "Deep research on a topic"
steps:
- web_search: {query: "$topic"}
- web_extract: {urls: "$search_results.urls[:3]"}
- summarize: {content: "$extracted"}
```
- Could be defined in skills or a new `macros/` directory
- Agent can invoke macro as single tool call
- [ ] **Tool failure patterns** - Learn from failures:
- Track: tool, input pattern, error type, what worked instead
- Before calling a tool, check: "Has this pattern failed before?"
- Persistent across sessions (stored in skills or separate DB)
- [ ] **Parallel tool execution** - When tools are independent, run concurrently:
- Detect independence (no data dependencies between calls)
- Use `asyncio.gather()` for parallel execution
- Already have async support in some tools, just need orchestration
LLM grader mode for batch_runner, action comparison against expected tool calls, string matching baselines.
**Files to modify:** `model_tools.py`, `toolsets.py`, new `tool_macros.py`
## 13. Layered Context Architecture 📊
---
Structured hierarchy: project context > skills > user profile > learnings > external knowledge > runtime introspection.
## 5. Dynamic Skills Expansion 📚
## 14. Tools Wishlist 🧰
**Problem:** Skills system is elegant but static. Skills must be manually created and added.
**Ideas:**
- [ ] **Skill acquisition from successful tasks** - After completing a complex task:
- "This approach worked well. Save as a skill?"
- Extract: goal, steps taken, tools used, key decisions
- Generate SKILL.md automatically
- Store in user's skills directory
- [ ] **Skill templates** - Common patterns that can be parameterized:
```markdown
# Debug {language} Error
1. Reproduce the error
2. Search for error message: `web_search("{error_message} {language}")`
3. Check common causes: {common_causes}
4. Apply fix and verify
```
- [ ] **Skill chaining** - Combine skills for complex workflows:
- Skills can reference other skills as dependencies
- "To do X, first apply skill Y, then skill Z"
- Directed graph of skill dependencies
**Files to modify:** `tools/skills_tool.py`, `skills/` directory structure, new `skill_generator.py`
---
## 6. Task Continuation Hints 🎯
**Problem:** Could be more helpful by suggesting logical next steps.
**Ideas:**
- [ ] **Suggest next steps** - At end of a task, suggest logical continuations:
- "Code is written. Want me to also write tests / docs / deploy?"
- Based on common workflows for task type
- Non-intrusive, just offer options
**Files to modify:** `run_agent.py`, response generation logic
---
## 7. Interactive Clarifying Questions Tool ❓
**Problem:** Agent sometimes makes assumptions or guesses when it should ask the user. Currently can only ask via text, which gets lost in long outputs.
**Ideas:**
- [ ] **Multiple-choice prompt tool** - Let agent present structured choices to user:
```
ask_user_choice(
question="Should the language switcher enable only German or all languages?",
choices=[
"Only enable German - works immediately",
"Enable all, mark untranslated - show fallback notice",
"Let me specify something else"
]
)
```
- Renders as interactive terminal UI with arrow key / Tab navigation
- User selects option, result returned to agent
- Up to 4 choices + optional free-text option
- [ ] **Implementation:**
- Use `inquirer` or `questionary` Python library for rich terminal prompts
- Tool returns selected option text (or user's custom input)
- **CLI-only** - only works when running via `cli.py` (not API/programmatic use)
- Graceful fallback: if not in interactive mode, return error asking agent to rephrase as text
- [ ] **Use cases:**
- Clarify ambiguous requirements before starting work
- Confirm destructive operations with clear options
- Let user choose between implementation approaches
- Checkpoint complex multi-step workflows
**Files to modify:** New `tools/ask_user_tool.py`, `cli.py` (detect interactive mode), `model_tools.py`
---
## 8. Resource Awareness & Efficiency 💰
**Problem:** No awareness of costs, time, or resource usage. Could be smarter about efficiency.
**Ideas:**
- [ ] **Tool result caching** - Don't repeat identical operations:
- Cache web searches, extractions within a session
- Invalidation based on time-sensitivity of query
- Hash-based lookup: same input → cached output
- [ ] **Lazy evaluation** - Don't fetch everything upfront:
- Get summaries first, full content only if needed
- "I found 5 relevant pages. Want me to deep-dive on any?"
**Files to modify:** `model_tools.py`, new `resource_tracker.py`
---
## 9. Collaborative Problem Solving 🤝
**Problem:** Interaction is command/response. Complex problems benefit from dialogue.
**Ideas:**
- [ ] **Assumption surfacing** - Make implicit assumptions explicit:
- "I'm assuming you want Python 3.11+. Correct?"
- "This solution assumes you have sudo access..."
- Let user correct before going down wrong path
- [ ] **Checkpoint & confirm** - For high-stakes operations:
- "About to delete 47 files. Here's the list - proceed?"
- "This will modify your database. Want a backup first?"
- Configurable threshold for when to ask
**Files to modify:** `run_agent.py`, system prompt configuration
---
## 10. Project-Local Context 💾
**Problem:** Valuable context lost between sessions.
**Ideas:**
- [ ] **Project awareness** - Remember project-specific context:
- Store `.hermes/context.md` in project directory
- "This is a Django project using PostgreSQL"
- Coding style preferences, deployment setup, etc.
- Load automatically when working in that directory
- [ ] **Handoff notes** - Leave notes for future sessions:
- Write to `.hermes/notes.md` in project
- "TODO for next session: finish implementing X"
- "Known issues: Y doesn't work on Windows"
**Files to modify:** New `project_context.py`, auto-load in `run_agent.py`
---
## 11. Graceful Degradation & Robustness 🛡️
**Problem:** When things go wrong, recovery is limited. Should fail gracefully.
**Ideas:**
- [ ] **Fallback chains** - When primary approach fails, have backups:
- `web_extract` fails → try `browser_navigate` → try `web_search` for cached version
- Define fallback order per tool type
- [ ] **Partial progress preservation** - Don't lose work on failure:
- Long task fails midway → save what we've got
- "I completed 3/5 steps before the error. Here's what I have..."
- [ ] **Self-healing** - Detect and recover from bad states:
- Browser stuck → close and retry
- Terminal hung → timeout and reset
**Files to modify:** `model_tools.py`, tool implementations, new `fallback_manager.py`
---
## 12. Tools & Skills Wishlist 🧰
*Things that would need new tool implementations (can't do well with current tools):*
### High-Impact
- [ ] **Audio/Video Transcription** 🎬 *(See also: Section 16 for detailed spec)*
- Transcribe audio files, podcasts, YouTube videos
- Extract key moments from video
- Voice memo transcription for messaging integrations
- *Provider options: Whisper API, Deepgram, local Whisper*
- [ ] **Diagram Rendering** 📊
- Render Mermaid/PlantUML to actual images
- Can generate the code, but rendering requires external service or tool
- "Show me how these components connect" → actual visual diagram
### Medium-Impact
- [ ] **Canvas / Visual Workspace** 🖼️
- Agent-controlled visual panel for rendering interactive UI
- Inspired by OpenClaw's Canvas feature
- **Capabilities:**
- `present` / `hide` - Show/hide the canvas panel
- `navigate` - Load HTML files or URLs into the canvas
- `eval` - Execute JavaScript in the canvas context
- `snapshot` - Capture the rendered UI as an image
- **Use cases:**
- Display generated HTML/CSS/JS previews
- Show interactive data visualizations (charts, graphs)
- Render diagrams (Mermaid → rendered output)
- Present structured information in rich format
- A2UI-style component system for structured agent UI
- **Implementation options:**
- Electron-based panel for CLI
- WebSocket-connected web app
- VS Code webview extension
- *Would let agent "show" things rather than just describe them*
- [ ] **Document Generation** 📄
- Create styled PDFs, Word docs, presentations
- *Can do basic PDF via terminal tools, but limited*
- [ ] **Diff/Patch Tool** 📝
- Surgical code modifications with preview
- "Change line 45-50 to X" without rewriting whole file
- Show diffs before applying
- *Can use `diff`/`patch` but a native tool would be safer*
### Skills to Create
- [ ] **Domain-specific skill packs:**
- DevOps/Infrastructure (Terraform, K8s, AWS)
- Data Science workflows (EDA, model training)
- Security/pentesting procedures
- [ ] **Framework-specific skills:**
- React/Vue/Angular patterns
- Django/Rails/Express conventions
- Database optimization playbooks
- [ ] **Troubleshooting flowcharts:**
- "Docker container won't start" → decision tree
- "Production is slow" → systematic diagnosis
---
## 13. Messaging Platform Integrations 💬
**Problem:** Agent currently only works via `cli.py` which requires direct terminal access. Users may want to interact via messaging apps from their phone or other devices.
**Architecture:**
- `run_agent.py` already accepts `conversation_history` parameter and returns updated messages ✅
- Need: persistent session storage, platform monitors, session key resolution
**Implementation approach:**
```
┌─────────────────────────────────────────────────────────────┐
│ Platform Monitor (e.g., telegram_monitor.py) │
│ ├─ Long-running daemon connecting to messaging platform │
│ ├─ On message: resolve session key → load history from disk│
│ ├─ Call run_agent.py with loaded history │
│ ├─ Save updated history back to disk (JSONL) │
│ └─ Send response back to platform │
└─────────────────────────────────────────────────────────────┘
```
**Platform support (each user sets up their own credentials):**
- [ ] **Telegram** - via `python-telegram-bot` or `grammy` equivalent
- Bot token from @BotFather
- Easiest to set up, good for personal use
- [ ] **Discord** - via `discord.py`
- Bot token from Discord Developer Portal
- Can work in servers (group sessions) or DMs
- [ ] **WhatsApp** - via `baileys` (WhatsApp Web protocol)
- QR code scan to authenticate
- More complex, but reaches most people
**Session management:**
- [ ] **Session store** - JSONL persistence per session key
- `~/.hermes/sessions/{session_key}.jsonl`
- Session keys: `telegram:dm:{user_id}`, `discord:channel:{id}`, etc.
- [ ] **Session expiry** - Configurable reset policies
- Daily reset (default 4am) OR idle timeout (e.g., 2 hours)
- Manual reset via `/reset` or `/new` command in chat
- [ ] **Session continuity** - Conversations persist across messages until reset
**Files to create:** `monitors/telegram_monitor.py`, `monitors/discord_monitor.py`, `monitors/session_store.py`
---
## 14. Scheduled Tasks / Cron Jobs ⏰
**Problem:** Agent only runs on-demand. Some tasks benefit from scheduled execution (daily summaries, monitoring, reminders).
**Ideas:**
- [ ] **Cron-style scheduler** - Run agent turns on a schedule
- Store jobs in `~/.hermes/cron/jobs.json`
- Each job: `{ id, schedule, prompt, session_mode, delivery }`
- Uses APScheduler or similar Python library
- [ ] **Session modes:**
- `isolated` - Fresh session each run (no history, clean context)
- `main` - Append to main session (agent remembers previous scheduled runs)
- [ ] **Delivery options:**
- Write output to file (`~/.hermes/cron/output/{job_id}/{timestamp}.md`)
- Send to messaging channel (if integrations enabled)
- Both
- [ ] **CLI interface:**
```bash
# List scheduled jobs
python cli.py --cron list
# Add a job (runs daily at 9am)
python cli.py --cron add "Summarize my email inbox" --schedule "0 9 * * *"
# Quick syntax for simple intervals
python cli.py --cron add "Check server status" --every 30m
# Remove a job
python cli.py --cron remove <job_id>
```
- [ ] **Agent self-scheduling** - Let the agent create its own cron jobs
- New tool: `schedule_task(prompt, schedule, session_mode)`
- "Remind me to check the deployment tomorrow at 9am"
- Agent can set follow-up tasks for itself
- [ ] **In-chat command:** `/cronjob {prompt} {frequency}` when using messaging integrations
**Files to create:** `cron/scheduler.py`, `cron/jobs.py`, `tools/schedule_tool.py`
---
## 15. Text-to-Speech (TTS) 🔊
**Problem:** Agent can only respond with text. Some users prefer audio responses (accessibility, hands-free use, podcasts).
**Ideas:**
- [ ] **TTS tool** - Generate audio files from text
```python
tts_generate(text="Here's your summary...", voice="nova", output="summary.mp3")
```
- Returns path to generated audio file
- For messaging integrations: can send as voice message
- [ ] **Provider options:**
- Edge TTS (free, good quality, many voices)
- OpenAI TTS (paid, excellent quality)
- ElevenLabs (paid, best quality, voice cloning)
- Local options (Coqui TTS, Bark)
- [ ] **Modes:**
- On-demand: User explicitly asks "read this to me"
- Auto-TTS: Configurable to always generate audio for responses
- Long-text handling: Summarize or chunk very long responses
- [ ] **Integration with messaging:**
- When enabled, can send voice notes instead of/alongside text
- User preference per channel
**Files to create:** `tools/tts_tool.py`, config in `cli-config.yaml`
---
## 16. Speech-to-Text / Audio Transcription 🎤
**Problem:** Users may want to send voice memos instead of typing. Agent is blind to audio content.
**Ideas:**
- [ ] **Voice memo transcription** - For messaging integrations
- User sends voice message → transcribe → process as text
- Seamless: user speaks, agent responds
- [ ] **Audio/video file transcription** - Existing idea, expanded:
- Transcribe local audio files (mp3, wav, m4a)
- Transcribe YouTube videos (download audio → transcribe)
- Extract key moments with timestamps
- [ ] **Provider options:**
- OpenAI Whisper API (good quality, cheap)
- Deepgram (fast, good for real-time)
- Local Whisper (free, runs on GPU)
- Groq Whisper (fast, free tier available)
- [ ] **Tool interface:**
```python
transcribe(source="audio.mp3") # Local file
transcribe(source="https://youtube.com/...") # YouTube
transcribe(source="voice_message", data=bytes) # Voice memo
```
**Files to create:** `tools/transcribe_tool.py`, integrate with messaging monitors
---
## Priority Order (Suggested)
1. **🎯 Subagent Architecture** - Critical for context management, enables everything else
2. **Memory & Context Management** - Complements subagents for remaining context
3. **Self-Reflection** - Improves reliability and reduces wasted tool calls
4. **Project-Local Context** - Practical win, keeps useful info across sessions
5. **Messaging Integrations** - Unlocks mobile access, new interaction patterns
6. **Scheduled Tasks / Cron Jobs** - Enables automation, reminders, monitoring
7. **Tool Composition** - Quality of life, builds on other improvements
8. **Dynamic Skills** - Force multiplier for repeated tasks
9. **Interactive Clarifying Questions** - Better UX for ambiguous tasks
10. **TTS / Audio Transcription** - Accessibility, hands-free use
---
## Removed Items (Unrealistic)
The following were removed because they're architecturally impossible:
- ~~Proactive suggestions / Prefetching~~ - Agent only runs on user request, can't interject
- ~~Clipboard integration~~ - No access to user's local system clipboard
The following **moved to active TODO** (now possible with new architecture):
- ~~Session save/restore~~ → See **Messaging Integrations** (session persistence)
- ~~Voice/TTS playback~~ → See **TTS** (can generate audio files, send via messaging)
- ~~Set reminders~~ → See **Scheduled Tasks / Cron Jobs**
The following were removed because they're **already possible**:
- ~~HTTP/API Client~~ → Use `curl` or Python `requests` in terminal
- ~~Structured Data Manipulation~~ → Use `pandas` in terminal
- ~~Git-Native Operations~~ → Use `git` CLI in terminal
- ~~Symbolic Math~~ → Use `SymPy` in terminal
- ~~Code Quality Tools~~ → Run linters (`eslint`, `black`, `mypy`) in terminal
- ~~Testing Framework~~ → Run `pytest`, `jest`, etc. in terminal
- ~~Translation~~ → LLM handles this fine, or use translation APIs
---
---
## 🧪 Brainstorm Ideas (Not Yet Fleshed Out)
*These are early-stage ideas that need more thinking before implementation. Captured here so they don't get lost.*
### Remote/Distributed Execution 🌐
**Concept:** Run agent on a powerful remote server while interacting from a thin client.
**Why interesting:**
- Run on beefy GPU server for local LLM inference
- Agent has access to remote machine's resources (files, tools, internet)
- User interacts via lightweight client (phone, low-power laptop)
**Open questions:**
- How does this differ from just SSH + running cli.py on remote?
- Would need secure communication channel (WebSocket? gRPC?)
- How to handle tool outputs that reference remote paths?
- Credential management for remote execution
- Latency considerations for interactive use
**Possible architecture:**
```
┌─────────────┐ ┌─────────────────────────┐
│ Thin Client │ ◄─────► │ Remote Hermes Server │
│ (phone/web) │ WS/API │ - Full agent + tools │
└─────────────┘ │ - GPU for local LLM │
│ - Access to server files│
└─────────────────────────┘
```
**Related to:** Messaging integrations (could be the "server" that monitors receive from)
---
### Multi-Agent Parallel Execution 🤖🤖
**Concept:** Extension of Subagent Architecture (Section 1) - run multiple subagents in parallel.
**Why interesting:**
- Independent subtasks don't need to wait for each other
- "Research X while setting up Y" - both run simultaneously
- Faster completion for complex multi-part tasks
**Open questions:**
- How to detect which tasks are truly independent?
- Resource management (API rate limits, concurrent connections)
- How to merge results when parallel tasks have conflicts?
- Cost implications of multiple parallel LLM calls
*Note: Basic subagent delegation (Section 1) should be implemented first, parallel execution is an optimization on top.*
---
### Plugin/Extension System 🔌
**Concept:** Allow users to add custom tools/skills without modifying core code.
**Why interesting:**
- Community contributions
- Organization-specific tools
- Clean separation of core vs. extensions
**Open questions:**
- Security implications of loading arbitrary code
- Versioning and compatibility
- Discovery and installation UX
---
*Last updated: $(date +%Y-%m-%d)* 🤖
- Diagram rendering (Mermaid/PlantUML to images)
- Document generation (PDFs, Word, presentations)
- Canvas / visual workspace
- Coding agent skill (Codex, Claude Code orchestration via PTY)
- Domain skill packs (DevOps, data science, security)
Binary file not shown.
Binary file not shown.
-41
View File
@@ -1,41 +0,0 @@
# Dockerfile for atropos-agent sandbox server
# Runs inside Nomad containers to handle tool execution
# Includes bubblewrap for namespace-based slot isolation
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
# Bubblewrap for namespace isolation
bubblewrap \
# `script` for PTY allocation (used for stable tmux+asciinema startup)
util-linux \
# Git for SWE-style tasks (cloning repos)
git \
# tmux for stateful terminal sessions (Phase 4.7+)
tmux \
# Common tools agents might need
curl \
wget \
jq \
# Cleanup
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies (sandbox server + optional terminal recording)
RUN pip install --no-cache-dir aiohttp asciinema
# Copy the sandbox server
COPY sandbox_server.py /app/sandbox_server.py
WORKDIR /app
# Create data directory for slot workspaces
RUN mkdir -p /data
# Verify bubblewrap is installed and working
RUN bwrap --version
EXPOSE 8080
# Default command - can be overridden by Nomad job spec
CMD ["python", "sandbox_server.py", "--port", "8080", "--slots", "10", "--data-dir", "/data"]
-46
View File
@@ -1,46 +0,0 @@
"""
Atropos integration for Hermes-Agent.
This package is intentionally optional: Hermes-Agent should work without Atropos.
If you import anything from `atropos.*` without having `atroposlib` installed,
we raise a clear error with install instructions.
Install (recommended, from repo checkout):
uv sync --extra atropos
Or (pip / editable):
pip install -e '.[atropos]'
"""
from __future__ import annotations
def _require_atroposlib() -> None:
try:
import atroposlib # noqa: F401
except ModuleNotFoundError as exc: # pragma: no cover
raise ModuleNotFoundError(
"Hermes-Agent Atropos integration requires `atroposlib`, but it is not installed.\n"
"Install it with:\n"
" uv sync --extra atropos\n"
"or:\n"
" pip install -e '.[atropos]'\n"
) from exc
_require_atroposlib()
# Re-export the most commonly used pieces for convenience.
from .agent import AgentConfig, AgentResult, AgentStep, AtroposAgent, SequenceData # noqa: E402
from .envs import AgentEnv, AgentEnvConfig # noqa: E402
__all__ = [
"AtroposAgent",
"AgentConfig",
"AgentResult",
"AgentStep",
"SequenceData",
"AgentEnv",
"AgentEnvConfig",
]
-15
View File
@@ -1,15 +0,0 @@
"""
Agent abstractions for atropos-agent.
Provides the core AtroposAgent class for running ReACT-style agent loops.
"""
from .atropos_agent import AgentConfig, AgentResult, AgentStep, AtroposAgent, SequenceData
__all__ = [
"AtroposAgent",
"AgentConfig",
"AgentResult",
"AgentStep",
"SequenceData",
]
-850
View File
@@ -1,850 +0,0 @@
"""
ReACT-style agent implementation for atropos-agent.
This module provides the core AtroposAgent class that implements a basic
Reason-Act-Observe loop with tool calling capabilities.
Uses ManagedServer from atroposlib for automatic token/logprob tracking,
making trajectories ready for RL training.
The agent uses Hermes-style XML tags for tool calls:
- <think>...</think> for reasoning
- <tool_call>{"name": "...", "arguments": {...}}</tool_call> for actions
- <tool_response>...</tool_response> for observations
"""
import asyncio
import os
import json
import time
from contextlib import asynccontextmanager
from dataclasses import dataclass, field
from uuid import uuid4
from typing import Any, AsyncGenerator, Awaitable, Callable, Dict, List, Optional, Union
from dotenv import load_dotenv
import httpx
from ..tools import ToolCall, ToolRegistry, ToolResult
from atroposlib.envs.server_handling.managed_server import ManagedServer
load_dotenv()
# Default system prompt with tool calling instructions.
AGENT_SYSTEM_PROMPT = """You are a deep thinking AI. You MUST enclose your internal reasoning inside <think>...</think> tags.
You are a function calling AI model.
You are provided with function signatures within <tools></tools> XML tags.
You must call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
You can ONLY respond without a tool call if you are totally certain you have the final answer to the user's question or task
After calling & executing a function, you will be provided with function results within <tool_response></tool_response> XML tags.
Here are the available tools:
<tools>
{tools_json}
</tools>
Use the following JSON schema for each tool call you will make:
{"title": "FunctionCall", "type": "object", "properties": {"name": {"title": "Name", "type": "string"}, "arguments": {"title": "Arguments", "type": "object"}}, "required": ["name", "arguments"]}
## REQUIRED TOOL FORMAT
When you decide to call a tool, your assistant message MUST be:
1) exactly one <think>...</think> block, followed by
2) one or more <tool_call>...</tool_call> blocks,
and NOTHING else in that message.
If you need to explain anything, put it inside <think>. Do NOT write natural language outside <think> or <tool_call>.
For each function call return a JSON object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{"name": "<function-name>", "arguments": {"arg1": "value1"}}
</tool_call>
Each <tool_call> must be on its own and contain ONLY the JSON object (no extra text).
The JSON inside <tool_call> MUST be valid JSON with double quotes.
Do NOT output <tool_response> in an assistant message.
After you receive tool results, you may either call more tools (same required format) or provide the final answer.
When providing the final answer, do NOT include any <tool_call> blocks.
## TERMINAL TOOL NOTES
- Commands execute under POSIX `/bin/sh` (not bash).
- Each tool call runs in a fresh shell: environment changes (like `cd` or venv activation) do not persist across tool calls.
- Avoid bash-only features like `source`, `[[ ... ]]`, or process substitution.
- Prefer explicit venv usage:
- `python -m venv .venv && . .venv/bin/activate && python -m pip install -e .` (POSIX `.` activation), or
- `.venv/bin/python -m pip install -e .` (no activation required).
## ICL (examples)
User: Show the current directory.
Assistant:
<think>I should run pwd.</think>
<tool_call>
{"name": "terminal", "arguments": {"command": "pwd"}}
</tool_call>
User: <tool_response>{"success": true, "output": "/tmp\\n"}</tool_response>
Assistant: /tmp
User: List files, then count them.
Assistant:
<think>I should count files.</think>
<tool_call>
{"name": "terminal", "arguments": {"command": "ls -1 | wc -l"}}
</tool_call>
User: <tool_response>{"success": true, "output": "3\\n"}</tool_response>
Assistant: 3
User: Run pwd, then print ok (two tool calls).
Assistant:
<think>I should run two commands.</think>
<tool_call>
{"name": "terminal", "arguments": {"command": "pwd"}}
</tool_call>
<tool_call>
{"name": "terminal", "arguments": {"command": "echo ok"}}
</tool_call>
User: <tool_response>{"success": true, "output": "/tmp\\n"}</tool_response>
User: <tool_response>{"success": true, "output": "ok\\n"}</tool_response>
Assistant: ok
"""
@dataclass
class AgentConfig:
"""Configuration for the AtroposAgent."""
# Generation parameters
temperature: Optional[float] = 0.7
# Default to "let the backend decide" (important for tool-tag completions that may be longer).
max_tokens: Optional[int] = None
# Agent behavior
max_steps: int = 50
system_prompt: Optional[str] = None
tool_delay_s: float = 0.0
# Working directory for tools
working_dir: Optional[str] = None
@dataclass
class SequenceData:
"""Token/logprob data from a single completion."""
full_text: str
tokens: List[int]
masked_tokens: List[int] # -100 for prompt, actual IDs for completion
logprobs: List[float] # 1.0 for prompt, actual values for completion
metadata: Optional[Dict[str, Any]] = None
@classmethod
def from_sequence_node(cls, node) -> "SequenceData":
"""Create from a ManagedServer SequenceNode."""
return cls(
full_text=node.full_text,
tokens=node.tokens,
masked_tokens=node.masked_tokens,
logprobs=node.logprobs,
metadata=getattr(node, "metadata", None),
)
@dataclass
class AgentStep:
"""A single step in the agent's trajectory."""
step_number: int
assistant_message: str
tool_calls: List[ToolCall] = field(default_factory=list)
tool_results: List[ToolResult] = field(default_factory=list)
sequence_data: Optional[SequenceData] = None # Token data from this step
@property
def has_tool_calls(self) -> bool:
return len(self.tool_calls) > 0
@dataclass
class AgentResult:
"""Result of running an agent trajectory."""
success: bool
final_response: str
steps: List[AgentStep] = field(default_factory=list)
total_tokens: int = 0
error: Optional[str] = None
metadata: Dict[str, Any] = field(default_factory=dict)
# Full trajectory token data for RL training
trajectory_data: Optional[SequenceData] = None
@property
def num_steps(self) -> int:
return len(self.steps)
@property
def total_tool_calls(self) -> int:
return sum(len(step.tool_calls) for step in self.steps)
def to_messages(self) -> List[Dict[str, str]]:
"""Convert trajectory to messages format for logging."""
messages = []
for step in self.steps:
messages.append({"role": "assistant", "content": step.assistant_message})
if step.tool_results:
# Combine all tool responses
responses = "\n".join(r.to_xml() for r in step.tool_results)
messages.append({"role": "user", "content": responses})
return messages
def to_scored_data(self, score: float) -> Optional[Dict[str, Any]]:
"""
Convert to format suitable for ScoredDataGroup.
Args:
score: The score for this trajectory
Returns:
Dict with tokens, masks, scores suitable for training, or None if no data
"""
if self.trajectory_data is None:
return None
return {
"tokens": self.trajectory_data.tokens,
"masks": self.trajectory_data.masked_tokens,
"scores": score,
"logprobs": self.trajectory_data.logprobs,
}
class AtroposAgent:
"""
A ReACT-style agent that uses LLMs with tool calling.
This implementation wraps ManagedServer for automatic token/logprob tracking,
making trajectories ready for RL training.
Example:
# `server` may be an Atropos `ServerManager` (recommended) or a single `APIServer`.
# In practice, environments usually construct this via `BaseEnv`.
server = ...
tools = ToolRegistry()
tools.register(BashTool())
agent = AtroposAgent(server=server, tools=tools)
result = await agent.run("List the files in the current directory")
# Access token data for training
if result.trajectory_data:
print(f"Tokens: {result.trajectory_data.tokens}")
print(f"Masked: {result.trajectory_data.masked_tokens}")
"""
def __init__(
self,
server, # ServerManager or APIServer
tools: Optional[ToolRegistry] = None,
config: Optional[AgentConfig] = None,
tokenizer: Optional[Any] = None,
execute_tool: Optional[Callable[[ToolCall], Awaitable[ToolResult]]] = None,
):
self.server = server
self.tools = tools or ToolRegistry()
self.config = config or AgentConfig()
self.tokenizer = tokenizer or getattr(server, "tokenizer", None)
self.execute_tool = execute_tool or self.tools.execute
@asynccontextmanager
async def _managed(self) -> AsyncGenerator[Any, None]:
"""
Yield a ManagedServer-like object.
- If `self.server` is a ServerManager, use its `managed_server()` context manager.
- If `self.server` is a single APIServer, wrap it in `ManagedServer` directly.
"""
if os.getenv("ATROPOS_BYPASS_MANAGED_SERVER") == "1":
yield _DirectChatCompletionClient(server=self.server)
return
if hasattr(self.server, "managed_server"):
async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
yield managed
else:
managed = ManagedServer(server=self.server, tokenizer=self.tokenizer)
try:
yield managed
finally:
managed.reset()
def _build_system_prompt(self) -> str:
"""Build the system prompt with tool descriptions."""
if self.config.system_prompt:
return self.config.system_prompt
tools_json = self.tools.get_prompt_tool_definitions_json()
# Avoid `str.format()` here because the prompt contains many literal `{}` braces
# in JSON examples; we only want to substitute the single `{tools_json}` token.
return AGENT_SYSTEM_PROMPT.replace("{tools_json}", tools_json)
def _infer_server_model_for_debug(self) -> Optional[str]:
"""
Best-effort inference of the configured model name for debug payload saving.
ManagedServer/server_manager typically injects `model` internally, so `chat_kwargs`
may not contain it. For replaying saved payloads via curl, it's useful to persist it.
"""
servers = getattr(self.server, "servers", None)
if isinstance(servers, list) and servers:
s0 = servers[0]
cfg = getattr(s0, "config", None)
model = getattr(cfg, "model_name", None) or getattr(s0, "model_name", None)
if isinstance(model, str) and model:
return model
model = getattr(self.server, "model_name", None) or getattr(self.server, "model", None)
if isinstance(model, str) and model:
return model
return None
def _infer_server_base_url_for_debug(self) -> Optional[str]:
"""
Best-effort inference of the configured base_url for debug logging.
This is helpful when diagnosing hangs / retries at the transport layer.
"""
servers = getattr(self.server, "servers", None)
if isinstance(servers, list) and servers:
s0 = servers[0]
cfg = getattr(s0, "config", None)
base_url = getattr(cfg, "base_url", None) or getattr(s0, "base_url", None)
if isinstance(base_url, str) and base_url:
return base_url
base_url = getattr(self.server, "base_url", None)
if isinstance(base_url, str) and base_url:
return base_url
return None
def _extract_response_metadata(self, response: Any) -> Dict[str, Any]:
"""
Extract lightweight, JSON-serializable metadata from an OpenAI-style response.
This is useful for debugging training runs, especially when ManagedServer state
tracking is unavailable (e.g. OpenAI-compatible chat endpoints).
"""
meta: Dict[str, Any] = {}
try:
rid = getattr(response, "id", None)
if isinstance(rid, str) and rid:
meta["id"] = rid
model = getattr(response, "model", None)
if isinstance(model, str) and model:
meta["model"] = model
created = getattr(response, "created", None)
if isinstance(created, int):
meta["created"] = created
system_fingerprint = getattr(response, "system_fingerprint", None)
if isinstance(system_fingerprint, str) and system_fingerprint:
meta["system_fingerprint"] = system_fingerprint
choices = getattr(response, "choices", None)
if isinstance(choices, list) and choices:
fr = getattr(choices[0], "finish_reason", None)
if isinstance(fr, str) and fr:
meta["finish_reason"] = fr
usage = getattr(response, "usage", None)
if usage is not None:
if hasattr(usage, "model_dump"):
meta["usage"] = usage.model_dump()
elif isinstance(usage, dict):
meta["usage"] = usage
except Exception:
pass
return meta
def _debug_dump_request(self, *, step_num: int, chat_kwargs: Dict[str, Any]) -> None:
if os.getenv("ATROPOS_DEBUG_AGENT_REQUEST") != "1":
return
try:
# Avoid dumping megabytes by default; messages can be huge.
meta = {
"step": step_num,
"base_url": self._infer_server_base_url_for_debug(),
"model": chat_kwargs.get("model") or self._infer_server_model_for_debug(),
"chat_kwargs_keys": sorted(list(chat_kwargs.keys())),
"n": chat_kwargs.get("n"),
"max_tokens": chat_kwargs.get("max_tokens"),
"temperature": chat_kwargs.get("temperature"),
"num_messages": len(chat_kwargs.get("messages") or []),
}
print("\n=== ATROPOS_DEBUG_AGENT_REQUEST ===", flush=True)
print(meta, flush=True)
if os.getenv("ATROPOS_DEBUG_AGENT_REQUEST_FULL") == "1":
payload = dict(chat_kwargs)
# Make the payload more legible and less huge.
try:
dumped = json.dumps(payload, ensure_ascii=False, indent=2)
except Exception:
dumped = repr(payload)
print("\n=== ATROPOS_DEBUG_AGENT_REQUEST_FULL ===", flush=True)
print(dumped[:200_000], flush=True)
# Optional: save the FULL request payload to disk (no truncation).
save_dir = os.getenv("ATROPOS_DEBUG_AGENT_REQUEST_SAVE_DIR")
if save_dir:
os.makedirs(save_dir, exist_ok=True)
payload: Dict[str, Any] = dict(chat_kwargs)
if "model" not in payload:
model = self._infer_server_model_for_debug()
if model:
payload["model"] = model
# Use a unique filename so parallel trajectories don't clobber each other.
fname = os.path.join(
save_dir,
f"atropos_agent_request_step{step_num}_{int(time.time()*1000)}_{os.getpid()}_{uuid4().hex}.json",
)
with open(fname, "w", encoding="utf-8") as f:
json.dump(payload, f, ensure_ascii=False, indent=2)
print(f"[AtroposAgent] saved request payload: {fname}", flush=True)
except Exception:
return
def _debug_dump_response(self, *, step_num: int, response: Any) -> None:
if os.getenv("ATROPOS_DEBUG_AGENT_RESPONSE") != "1":
return
print("\n=== ATROPOS_DEBUG_AGENT_RESPONSE ===", flush=True)
print({"step": step_num, "type": type(response).__name__}, flush=True)
try:
dumped = response.model_dump() # openai pydantic model
except Exception:
dumped = getattr(response, "__dict__", {"repr": repr(response)})
# Keep the dump bounded; we only need enough to see the assistant message content.
text = str(dumped)
print(text[:200_000], flush=True)
async def _chat_completion_with_debug(
self, *, managed: Any, step_num: int, chat_kwargs: Dict[str, Any]
) -> Any:
"""
Call `managed.chat_completion()` with optional timeout + richer failure logging.
Debug env vars:
- `ATROPOS_AGENT_CHAT_TIMEOUT_S`: if set, wraps the await in `asyncio.wait_for`.
- `ATROPOS_DEBUG_AGENT_WAIT_EVERY_S`: if set, prints a heartbeat while waiting.
"""
# Hard guardrail: never allow a single chat completion to block for too long.
# This is essential for RL data-gen stability; long hangs should be treated as failures (score=0).
timeout_s_raw = os.getenv("ATROPOS_AGENT_CHAT_TIMEOUT_S")
timeout_s_default = 240.0
timeout_s = float(timeout_s_raw) if timeout_s_raw else timeout_s_default
timeout_s = min(timeout_s, 240.0)
wait_every_raw = os.getenv("ATROPOS_DEBUG_AGENT_WAIT_EVERY_S")
wait_every_s = float(wait_every_raw) if wait_every_raw else None
async def _await_call() -> Any:
if not wait_every_s or wait_every_s <= 0:
return await managed.chat_completion(**chat_kwargs)
# Heartbeat mode: wait in chunks without cancelling the underlying request.
# NOTE: do NOT use `asyncio.wait_for(task, timeout=...)` here, because a timeout
# will cancel the task and surface as `CancelledError` on the next loop.
task = asyncio.create_task(managed.chat_completion(**chat_kwargs))
t0 = time.perf_counter()
try:
while True:
done, _pending = await asyncio.wait({task}, timeout=wait_every_s)
if task in done:
return task.result()
waited = time.perf_counter() - t0
print(
f"[AtroposAgent] step={step_num} still waiting for chat_completion... ({waited:.1f}s)",
flush=True,
)
except asyncio.CancelledError:
task.cancel()
raise
try:
return await asyncio.wait_for(_await_call(), timeout=timeout_s)
except asyncio.TimeoutError as e:
print("\n=== ATROPOS_DEBUG_AGENT_CHAT_TIMEOUT ===", flush=True)
print({"step": step_num, "timeout_s": timeout_s}, flush=True)
raise RuntimeError(f"chat_completion timed out after {timeout_s:.1f}s") from e
except asyncio.CancelledError:
# Treat cancellation as a hard failure rather than crashing the whole env run.
# (Atropos/BaseEnv may cancel tasks during shutdown or retries.)
raise RuntimeError("chat_completion cancelled") from None
except Exception as e:
detail: Dict[str, Any] = {
"step": step_num,
"exc_type": type(e).__name__,
"exc_str": str(e),
}
if isinstance(e, httpx.HTTPStatusError):
try:
detail["status_code"] = e.response.status_code
detail["response_text"] = e.response.text[:20_000]
except Exception:
pass
elif isinstance(e, httpx.RequestError):
detail["request"] = repr(getattr(e, "request", None))
print("\n=== ATROPOS_DEBUG_AGENT_CHAT_FAILURE ===", flush=True)
print(detail, flush=True)
raise
async def run(
self,
task: str,
initial_messages: Optional[List[Dict[str, str]]] = None,
) -> AgentResult:
"""
Run the agent on a task using ManagedServer for token tracking.
Args:
task: The task/prompt for the agent
initial_messages: Optional additional context messages
Returns:
AgentResult with the trajectory, final response, and token data
"""
messages = [
{"role": "system", "content": self._build_system_prompt()},
]
if initial_messages:
messages.extend(initial_messages)
messages.append({"role": "user", "content": task})
steps = []
final_response = ""
final_node = None
final_prompt_messages: Optional[List[Dict[str, str]]] = None
last_node = None
last_prompt_messages: Optional[List[Dict[str, str]]] = None
last_response_text: str = ""
# Use ManagedServer for automatic token tracking
async with self._managed() as managed:
for step_num in range(self.config.max_steps):
# ReACT loop iteration here, just call -> tools -> observe until done (no tools called)
try:
# Keep a copy of the prompt messages used for this completion.
# Useful for reconstructing tokens/masks when state tracking is unavailable.
prompt_messages = list(messages)
chat_kwargs: Dict[str, Any] = {"messages": messages, "n": 1}
if self.config.max_tokens is not None:
chat_kwargs["max_tokens"] = self.config.max_tokens
if self.config.temperature is not None:
chat_kwargs["temperature"] = self.config.temperature
t_req = time.perf_counter()
print(
f"[AtroposAgent] step={step_num+1} chat_completion start "
f"(messages={len(messages)}, max_tokens={self.config.max_tokens}, temp={self.config.temperature})",
flush=True,
)
self._debug_dump_request(step_num=step_num + 1, chat_kwargs=chat_kwargs)
response = await self._chat_completion_with_debug(
managed=managed, step_num=step_num + 1, chat_kwargs=chat_kwargs
)
self._debug_dump_response(step_num=step_num + 1, response=response)
response_meta = self._extract_response_metadata(response)
print(
f"[AtroposAgent] step={step_num+1} chat_completion done in {time.perf_counter() - t_req:.2f}s",
flush=True,
)
current_node = None
if hasattr(managed, "get_state"):
state = managed.get_state()
nodes = state.get("nodes", [])
current_node = nodes[-1] if nodes else None
except Exception as e:
return AgentResult(
success=False,
final_response="",
steps=steps,
error=f"Generation error: {str(e)}",
)
msg = response.choices[0].message
# Some OpenAI-compatible servers populate `message.reasoning` and leave `content=""`.
response_text = (msg.content or "") or (getattr(msg, "reasoning", None) or "")
tool_calls = ToolCall.parse_from_text(response_text)
last_node = current_node
last_prompt_messages = prompt_messages
last_response_text = response_text
step_sequence_data = SequenceData.from_sequence_node(current_node) if current_node else None
if step_sequence_data is None:
if response_meta:
# We still want metadata for debugging even if token/logprob state tracking is unavailable.
step_sequence_data = SequenceData(
full_text=response_text,
tokens=[],
masked_tokens=[],
logprobs=[],
metadata=response_meta,
)
else:
merged = dict(response_meta)
node_meta = step_sequence_data.metadata
if isinstance(node_meta, dict):
merged.update(node_meta)
step_sequence_data.metadata = merged or step_sequence_data.metadata
step = AgentStep(
step_number=step_num + 1,
assistant_message=response_text,
tool_calls=tool_calls,
sequence_data=step_sequence_data,
)
if not tool_calls:
steps.append(step)
final_response = response_text
final_node = current_node
final_prompt_messages = prompt_messages
break
messages.append({"role": "assistant", "content": response_text})
tool_responses = []
for call in tool_calls:
result = await self.execute_tool(call)
step.tool_results.append(result)
tool_responses.append(result.to_xml())
if self.config.tool_delay_s > 0:
await asyncio.sleep(self.config.tool_delay_s)
steps.append(step)
responses_text = "\n".join(tool_responses)
# Tool observations are represented as user content with Hermes-style tags.
# This is compatible with most OpenAI-compatible chat APIs and ensures
# tokenizers/chat templates include tool outputs during training.
messages.append({"role": "user", "content": responses_text})
else:
# Reached max steps without completing
# Return a failure result but include the last observed completion so callers can
# record the trajectory (score=0) without triggering retries.
final_response = last_response_text or final_response
final_node = last_node
final_prompt_messages = last_prompt_messages
trajectory_data = None
if final_node:
trajectory_data = SequenceData.from_sequence_node(final_node)
elif final_prompt_messages is not None and self.tokenizer is not None:
if hasattr(self.tokenizer, "apply_chat_template"):
prompt_text = self.tokenizer.apply_chat_template(
final_prompt_messages, tokenize=False, add_generation_prompt=True
)
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=False)
else:
prompt_text = "\n".join([f"{m['role']}: {m['content']}" for m in final_prompt_messages])
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=True)
output_tokens = self.tokenizer.encode(final_response, add_special_tokens=False)
tokens = prompt_tokens + output_tokens
masked_tokens = ([-100] * len(prompt_tokens)) + output_tokens
logprobs = ([1.0] * len(prompt_tokens)) + ([0.0] * len(output_tokens))
trajectory_data = SequenceData(
full_text=f"{prompt_text}{final_response}",
tokens=tokens,
masked_tokens=masked_tokens,
logprobs=logprobs,
)
# Preserve response metadata (if any) even on failure trajectories.
try:
if trajectory_data is not None and steps:
last_step = steps[-1]
if last_step.sequence_data and isinstance(last_step.sequence_data.metadata, dict):
trajectory_data.metadata = dict(last_step.sequence_data.metadata)
except Exception:
pass
return AgentResult(
success=False,
final_response=final_response,
steps=steps,
error=f"Reached maximum steps ({self.config.max_steps})",
trajectory_data=trajectory_data,
)
# Build result with trajectory data
trajectory_data = None
if final_node:
trajectory_data = SequenceData.from_sequence_node(final_node)
elif final_prompt_messages is not None and self.tokenizer is not None:
if hasattr(self.tokenizer, "apply_chat_template"):
prompt_text = self.tokenizer.apply_chat_template(
final_prompt_messages, tokenize=False, add_generation_prompt=True
)
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=False)
else:
prompt_text = "\n".join([f"{m['role']}: {m['content']}" for m in final_prompt_messages])
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=True)
output_tokens = self.tokenizer.encode(final_response, add_special_tokens=False)
tokens = prompt_tokens + output_tokens
masked_tokens = ([-100] * len(prompt_tokens)) + output_tokens
logprobs = ([1.0] * len(prompt_tokens)) + ([0.0] * len(output_tokens))
trajectory_data = SequenceData(
full_text=f"{prompt_text}{final_response}",
tokens=tokens,
masked_tokens=masked_tokens,
logprobs=logprobs,
)
# Ensure trajectory_data carries the most recent metadata we observed (if any).
try:
if trajectory_data is not None and steps:
last_step = steps[-1]
if last_step.sequence_data and isinstance(last_step.sequence_data.metadata, dict):
trajectory_data.metadata = dict(last_step.sequence_data.metadata)
except Exception:
pass
return AgentResult(
success=True,
final_response=final_response,
steps=steps,
trajectory_data=trajectory_data,
)
async def run_single_turn(
self,
messages: List[Dict[str, str]],
execute_tools: bool = True,
) -> tuple[str, List[ToolResult], Optional[SequenceData]]:
"""
Run a single turn of the agent (one LLM call + tool execution).
This is useful for integration with BaseEnv where you want more
control over the loop.
Args:
messages: The conversation history
execute_tools: Whether to execute parsed tool calls
Returns:
Tuple of (response_text, tool_results, sequence_data)
"""
async with self._managed() as managed:
chat_kwargs: Dict[str, Any] = {"messages": messages, "n": 1}
if self.config.max_tokens is not None:
chat_kwargs["max_tokens"] = self.config.max_tokens
if self.config.temperature is not None:
chat_kwargs["temperature"] = self.config.temperature
self._debug_dump_request(step_num=1, chat_kwargs=chat_kwargs)
response = await self._chat_completion_with_debug(managed=managed, step_num=1, chat_kwargs=chat_kwargs)
self._debug_dump_response(step_num=1, response=response)
current_node = None
if hasattr(managed, "get_state"):
state = managed.get_state()
nodes = state.get("nodes", [])
current_node = nodes[-1] if nodes else None
msg = response.choices[0].message
response_text = (msg.content or "") or (getattr(msg, "reasoning", None) or "")
tool_results = []
if execute_tools:
tool_calls = ToolCall.parse_from_text(response_text)
for call in tool_calls:
result = await self.execute_tool(call)
tool_results.append(result)
sequence_data = SequenceData.from_sequence_node(current_node) if current_node else None
return response_text, tool_results, sequence_data
class _DirectChatCompletionClient:
"""
Minimal stand-in for ManagedServer that calls the OpenAI-compatible endpoint directly.
This is for isolating issues where `ManagedServer.chat_completion()` hangs or misbehaves.
It intentionally does NOT do token/logprob tracking.
"""
def __init__(self, server: Any):
self._server = server
def _server_config(self) -> tuple[str, str, str]:
# ServerManager case: first configured server.
servers = getattr(self._server, "servers", None)
if isinstance(servers, list) and servers:
s0 = servers[0]
cfg = getattr(s0, "config", None)
base_url = getattr(cfg, "base_url", None) or getattr(s0, "base_url", None)
api_key = getattr(cfg, "api_key", None) or getattr(s0, "api_key", None)
model = getattr(cfg, "model_name", None) or getattr(s0, "model_name", None)
if isinstance(base_url, str) and isinstance(api_key, str) and isinstance(model, str):
return base_url.rstrip("/"), api_key, model
# APIServer-like fallback.
base_url = getattr(self._server, "base_url", None)
api_key = getattr(self._server, "api_key", None)
model = getattr(self._server, "model_name", None) or getattr(self._server, "model", None)
if isinstance(base_url, str) and isinstance(api_key, str) and isinstance(model, str):
return base_url.rstrip("/"), api_key, model
raise RuntimeError("Unable to resolve server base_url/api_key/model for direct chat completion")
async def chat_completion(self, *, messages: List[Dict[str, str]], n: int = 1, **kwargs: Any) -> Any:
base_url, api_key, model = self._server_config()
url = f"{base_url}/chat/completions"
payload: Dict[str, Any] = {
"model": model,
"messages": messages,
"n": n,
}
# Pass through common generation kwargs.
for k in ("max_tokens", "temperature", "top_p", "presence_penalty", "frequency_penalty", "stop"):
if k in kwargs and kwargs[k] is not None:
payload[k] = kwargs[k]
timeout_s = float(os.getenv("ATROPOS_DIRECT_REQUEST_TIMEOUT_S") or "120")
print(f"[AtroposAgent] DIRECT chat_completion POST {url} (timeout={timeout_s}s)", flush=True)
async with httpx.AsyncClient(timeout=timeout_s) as client:
resp = await client.post(
url,
headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
json=payload,
)
resp.raise_for_status()
data = resp.json()
# Return a very small object compatible with the code paths that read
# `response.choices[0].message.content`.
class _Msg:
def __init__(self, d: Dict[str, Any]):
self.content = d.get("content")
self.reasoning = d.get("reasoning")
class _Choice:
def __init__(self, d: Dict[str, Any]):
self.message = _Msg(d.get("message") or {})
class _Resp:
def __init__(self, d: Dict[str, Any]):
self._d = d
self.choices = [_Choice(c) for c in (d.get("choices") or [])]
def model_dump(self) -> Dict[str, Any]:
return self._d
return _Resp(data)
-6
View File
@@ -1,6 +0,0 @@
"""
FastAPI services for atropos-agent.
- tool_executor_server: queued/batched sandbox tool execution (Phase 4)
"""
-254
View File
@@ -1,254 +0,0 @@
"""
Tool Executor API (Phase 4)
This service provides a queued, batched execution layer on top of a ToolBackend.
It mirrors the stateful FastAPI + app.state pattern used in:
atropos/atroposlib/api/server.py
Run (dev):
uv run uvicorn atropos_agent.api.tool_executor_server:app --host 0.0.0.0 --port 9001
"""
from __future__ import annotations
import os
from typing import Any, Dict, Optional
from pathlib import Path
from fastapi import FastAPI, Header, HTTPException, status
from pydantic import BaseModel, Field
from ..backends.nomad_backend import NomadBackendConfig, NomadToolBackend
from ..tools import ToolRegistry, build_tool_registry
from ..tools.base import (
ArtifactArchiveRequestPayload,
ArtifactArchiveResponsePayload,
ArtifactListRequestPayload,
ArtifactListResponsePayload,
ArtifactReadRequestPayload,
ArtifactReadResponsePayload,
ToolExecutorExecuteRequest,
ToolExecutorReleaseRequest,
ToolResultPayload,
)
from ..tools.tool_executor import ToolExecutor, ToolExecutorConfig
class ToolExecutorServerConfig(BaseModel):
nomad_address: str = Field(default="http://localhost:4646")
job_id: str = Field(default="atropos-sandbox-tool-executor")
image: str = Field(default="atropos-sandbox:local")
slots_per_container: int = Field(default=10)
min_containers: int = Field(default=1)
max_containers: int = Field(default=10)
privileged: bool = Field(default=False)
acquire_timeout_s: float = Field(default=30.0)
batch_window_ms: int = Field(default=20)
max_batch_size: int = Field(default=200)
allow_network: bool = Field(default=True)
tool_server_url: Optional[str] = Field(default=None)
tool_server_token: Optional[str] = Field(default=None)
token: Optional[str] = Field(default=None, description="Bearer token required for requests (optional in dev).")
purge_job_on_shutdown: bool = Field(default=True)
@classmethod
def from_env(cls) -> "ToolExecutorServerConfig":
# In dev, prefer loading secrets/config from the repo-local `.env` (not committed).
try:
from dotenv import load_dotenv # type: ignore
except Exception: # pragma: no cover
load_dotenv = None # type: ignore[assignment]
if load_dotenv is not None:
env_path = Path(__file__).resolve().parents[2] / ".env"
if env_path.exists():
load_dotenv(dotenv_path=env_path)
def _get_bool(name: str, default: bool) -> bool:
raw = os.getenv(name)
if raw is None:
return default
return raw.strip().lower() in {"1", "true", "yes", "y", "on"}
return cls(
nomad_address=os.getenv("TOOL_EXECUTOR_NOMAD_ADDRESS", "http://localhost:4646"),
job_id=os.getenv("TOOL_EXECUTOR_JOB_ID", "atropos-sandbox-tool-executor"),
image=os.getenv("TOOL_EXECUTOR_IMAGE", "atropos-sandbox:local"),
slots_per_container=int(os.getenv("TOOL_EXECUTOR_SLOTS", "10")),
min_containers=int(os.getenv("TOOL_EXECUTOR_MIN_CONTAINERS", "1")),
max_containers=int(os.getenv("TOOL_EXECUTOR_MAX_CONTAINERS", "10")),
privileged=_get_bool("TOOL_EXECUTOR_PRIVILEGED", False),
acquire_timeout_s=float(os.getenv("TOOL_EXECUTOR_ACQUIRE_TIMEOUT_S", "30.0")),
batch_window_ms=int(os.getenv("TOOL_EXECUTOR_BATCH_WINDOW_MS", "20")),
max_batch_size=int(os.getenv("TOOL_EXECUTOR_MAX_BATCH_SIZE", "200")),
allow_network=_get_bool("TOOL_EXECUTOR_ALLOW_NETWORK", True),
tool_server_url=os.getenv("TOOL_EXECUTOR_TOOL_SERVER_URL") or None,
tool_server_token=os.getenv("TOOL_EXECUTOR_TOOL_SERVER_TOKEN") or None,
token=os.getenv("TOOL_EXECUTOR_TOKEN") or None,
purge_job_on_shutdown=_get_bool("TOOL_EXECUTOR_PURGE_JOB_ON_SHUTDOWN", True),
)
app = FastAPI(title="Atropos-Agent Tool Executor")
@app.get("/")
async def root() -> Dict[str, str]:
return {"message": "Atropos-Agent Tool Executor"}
def _check_auth(cfg: ToolExecutorServerConfig, authorization: Optional[str]) -> None:
if not cfg.token:
return
if not authorization:
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Missing Authorization header")
if not authorization.lower().startswith("bearer "):
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid Authorization header")
token = authorization.split(" ", 1)[1].strip()
if token != cfg.token:
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Invalid token")
@app.on_event("startup")
async def _startup() -> None:
cfg = ToolExecutorServerConfig.from_env()
# Default to Atropos "full" tool surface: sandbox + external (if tool_server_url provided).
tools: ToolRegistry = build_tool_registry(
enabled_toolsets=["full"],
disabled_toolsets=None,
tool_server_url=cfg.tool_server_url,
)
backend = NomadToolBackend(
NomadBackendConfig(
nomad_address=cfg.nomad_address,
sandbox_job_id=cfg.job_id,
sandbox_image=cfg.image,
slots_per_container=cfg.slots_per_container,
min_containers=cfg.min_containers,
max_containers=cfg.max_containers,
privileged=cfg.privileged,
acquire_timeout_s=cfg.acquire_timeout_s,
purge_job_on_start=False,
)
)
await backend.start()
executor = ToolExecutor(
backend=backend,
tools=tools,
config=ToolExecutorConfig(
batch_window_ms=cfg.batch_window_ms,
max_batch_size=cfg.max_batch_size,
allow_network=cfg.allow_network,
tool_server_url=cfg.tool_server_url,
tool_server_token=cfg.tool_server_token,
),
)
await executor.start()
app.state.cfg = cfg
app.state.backend = backend
app.state.executor = executor
@app.on_event("shutdown")
async def _shutdown() -> None:
executor: Optional[ToolExecutor] = getattr(app.state, "executor", None)
backend: Optional[NomadToolBackend] = getattr(app.state, "backend", None)
cfg: Optional[ToolExecutorServerConfig] = getattr(app.state, "cfg", None)
if executor is not None:
await executor.close()
if backend is not None:
await backend.stop(purge=bool(cfg.purge_job_on_shutdown) if cfg else False)
@app.get("/health")
async def health() -> Dict[str, Any]:
return {"status": "ok"}
@app.get("/status")
async def status_endpoint() -> Dict[str, Any]:
executor: ToolExecutor = app.state.executor
backend: NomadToolBackend = app.state.backend
return {
"queue_size": executor.queue_size(),
"total_requests": executor.total_requests,
"total_errors": executor.total_errors,
"pool": backend.get_stats(),
}
@app.post("/execute", response_model=ToolResultPayload)
async def execute_tool(
req: ToolExecutorExecuteRequest,
authorization: Optional[str] = Header(default=None),
status_code: int = status.HTTP_200_OK, # noqa: B008
) -> ToolResultPayload:
cfg: ToolExecutorServerConfig = app.state.cfg
_check_auth(cfg, authorization)
executor: ToolExecutor = app.state.executor
result = await executor.execute(
trajectory_id=req.trajectory_id,
call=req.tool.to_tool_call(),
timeout_s=req.timeout_s,
)
return ToolResultPayload.from_tool_result(result)
@app.post("/release")
async def release_trajectory(
req: ToolExecutorReleaseRequest,
authorization: Optional[str] = Header(default=None),
) -> Dict[str, Any]:
cfg: ToolExecutorServerConfig = app.state.cfg
_check_auth(cfg, authorization)
executor: ToolExecutor = app.state.executor
await executor.release_trajectory(req.trajectory_id, reset_workspace=req.reset_workspace)
return {"status": "ok"}
@app.post("/artifacts/read", response_model=ArtifactReadResponsePayload)
async def artifacts_read(
req: ArtifactReadRequestPayload,
authorization: Optional[str] = Header(default=None),
) -> ArtifactReadResponsePayload:
cfg: ToolExecutorServerConfig = app.state.cfg
_check_auth(cfg, authorization)
executor: ToolExecutor = app.state.executor
return await executor.read_artifact(req)
@app.post("/artifacts/list", response_model=ArtifactListResponsePayload)
async def artifacts_list(
req: ArtifactListRequestPayload,
authorization: Optional[str] = Header(default=None),
) -> ArtifactListResponsePayload:
cfg: ToolExecutorServerConfig = app.state.cfg
_check_auth(cfg, authorization)
executor: ToolExecutor = app.state.executor
return await executor.list_artifacts(req)
@app.post("/artifacts/archive", response_model=ArtifactArchiveResponsePayload)
async def artifacts_archive(
req: ArtifactArchiveRequestPayload,
authorization: Optional[str] = Header(default=None),
) -> ArtifactArchiveResponsePayload:
cfg: ToolExecutorServerConfig = app.state.cfg
_check_auth(cfg, authorization)
executor: ToolExecutor = app.state.executor
return await executor.archive_artifacts(req)
-140
View File
@@ -1,140 +0,0 @@
"""
External ToolServer (Phase 4.5+).
This server executes tools that must NOT run inside the sandbox, typically
because they require credentials or access to external services.
Run (dev):
uv run uvicorn atropos_agent.api.tool_server:app --host 0.0.0.0 --port 9002
"""
from __future__ import annotations
import asyncio
import os
import inspect
from typing import Any, Dict, List, Optional
from pathlib import Path
from fastapi import FastAPI, Header, HTTPException, status
from pydantic import BaseModel, Field
from ..tools import ToolRegistry, build_tool_registry
from ..tools.base import ToolResultPayload, ToolServerExecuteRequest
class ToolServerConfig(BaseModel):
token: Optional[str] = Field(
default=None,
description="Bearer token required for requests (optional in dev).",
)
max_concurrency: int = Field(default=16, ge=1, description="Max concurrent tool executions.")
@classmethod
def from_env(cls) -> "ToolServerConfig":
# In dev, prefer loading secrets from the repo-local `.env` (not committed).
try:
from dotenv import load_dotenv # type: ignore
except Exception: # pragma: no cover
load_dotenv = None # type: ignore[assignment]
if load_dotenv is not None:
env_path = Path(__file__).resolve().parents[2] / ".env"
if env_path.exists():
load_dotenv(dotenv_path=env_path)
token = os.getenv("TOOL_SERVER_TOKEN") or None
max_concurrency = int(os.getenv("TOOL_SERVER_MAX_CONCURRENCY", "16"))
return cls(token=token, max_concurrency=max_concurrency)
app = FastAPI(title="Atropos-Agent Tool Server")
@app.get("/")
async def root() -> Dict[str, str]:
return {"message": "Atropos-Agent Tool Server"}
@app.on_event("startup")
async def _startup() -> None:
cfg = ToolServerConfig.from_env()
# External-only registry. It will only include tools that are enabled by toolsets and
# whose Hermes requirements/keys are satisfied in this process.
tools: ToolRegistry = build_tool_registry(
enabled_toolsets=["all"],
disabled_toolsets=["terminal", "sandbox", "filesystem", "terminal_stateful", "default"],
tool_server_url="enabled",
)
app.state.cfg = cfg
app.state.tools = tools
app.state.semaphore = asyncio.Semaphore(cfg.max_concurrency)
@app.get("/health")
async def health() -> Dict[str, Any]:
return {"status": "ok"}
@app.get("/tools")
async def list_tools() -> Dict[str, Any]:
tools: ToolRegistry = app.state.tools
return {"tools": [s.to_dict() for s in tools.get_schemas()]}
def _check_auth(cfg: ToolServerConfig, authorization: Optional[str]) -> None:
if not cfg.token:
return
if not authorization:
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Missing Authorization header")
if not authorization.lower().startswith("bearer "):
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid Authorization header")
token = authorization.split(" ", 1)[1].strip()
if token != cfg.token:
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Invalid token")
@app.post("/execute", response_model=ToolResultPayload)
async def execute_tool(
req: ToolServerExecuteRequest,
authorization: Optional[str] = Header(default=None),
) -> ToolResultPayload:
cfg: ToolServerConfig = app.state.cfg
_check_auth(cfg, authorization)
tools: ToolRegistry = app.state.tools
sem: asyncio.Semaphore = app.state.semaphore
tool = tools.get(req.tool.name)
if tool is None:
return ToolResultPayload(
success=False,
error=f"Unknown tool: {req.tool.name}",
uniq_id=req.tool.uniq_id,
)
async with sem:
try:
kwargs = dict(req.tool.arguments)
sig = inspect.signature(tool.execute).parameters
# Some tools can benefit from extra context.
if req.trajectory_id and "trajectory_id" in sig:
kwargs["trajectory_id"] = req.trajectory_id
if req.slot_id and "slot_id" in sig:
kwargs["slot_id"] = req.slot_id
if req.container_addr and "container_addr" in sig:
kwargs["container_addr"] = req.container_addr
if "task_id" in sig:
kwargs["task_id"] = req.trajectory_id
result = await tool.execute(**kwargs)
except Exception as e:
return ToolResultPayload(
success=False,
error=f"Tool execution error: {e}",
uniq_id=req.tool.uniq_id,
)
if result.uniq_id is None:
result.uniq_id = req.tool.uniq_id
return ToolResultPayload.from_tool_result(result)
-27
View File
@@ -1,27 +0,0 @@
from __future__ import annotations
from typing import Any
from .base import ToolBackend
from .modal_backend import ModalSandboxConfig, ModalToolBackend
from .nomad_backend import NomadBackendConfig, NomadToolBackend
def create_tool_backend(cfg: Any) -> ToolBackend:
mode = str(getattr(cfg, "tool_pool_mode", "nomad")).strip().lower()
if mode == "nomad":
return NomadToolBackend(NomadBackendConfig.from_agent_env_config(cfg))
if mode == "modal":
return ModalToolBackend(ModalSandboxConfig.from_agent_env_config(cfg))
raise ValueError(f"Unknown tool_pool_mode: {mode}")
__all__ = [
"ToolBackend",
"create_tool_backend",
"NomadBackendConfig",
"NomadToolBackend",
"ModalSandboxConfig",
"ModalToolBackend",
]
-89
View File
@@ -1,89 +0,0 @@
"""
Backend interfaces for AgentEnv tool execution.
The goal of this module is to decouple ToolExecutor / AgentEnv from any single
execution backend (Nomad/Docker today; Modal later).
"""
from __future__ import annotations
from typing import Any, Dict, List, Optional, Protocol, Tuple
from ..slots.executor import ExecutionResult
from ..slots.slot import Slot
class ToolBackend(Protocol):
"""
Minimal interface required by ToolExecutor.
Backends provide:
- lifecycle (start/stop)
- slot acquisition/release (workspace affinity)
- batched tool execution across slots
- optional artifact helpers (for env verification / demos)
"""
@property
def default_timeout_s(self) -> Optional[float]:
"""Default sandbox execution timeout in seconds (if any)."""
async def start(self) -> None:
"""Start the backend (provision workers/containers, health checks, etc)."""
async def stop(self, *, purge: bool = False) -> None:
"""Stop the backend and optionally purge remote resources."""
async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
"""Acquire a slot for a trajectory (workspace affinity)."""
async def release(self, slot: Slot, *, reset_workspace: bool = False) -> None:
"""Release a slot back to the pool."""
async def execute_batch(
self,
requests: List[Tuple[Slot, str, Dict[str, Any]]],
*,
timeout_s: Optional[float] = None,
) -> List[ExecutionResult]:
"""Execute a batch of sandbox tool calls and return results in order."""
# ---------------------------------------------------------------------
# Optional artifact helpers (supported by the Nomad sandbox-server today)
# ---------------------------------------------------------------------
async def read_artifact(
self,
slot: Slot,
path: str,
*,
encoding: str = "text",
max_bytes: Optional[int] = None,
include_sha256: bool = False,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
raise NotImplementedError
async def list_artifacts(
self,
slot: Slot,
path: str = ".",
*,
recursive: bool = False,
max_entries: Optional[int] = None,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
raise NotImplementedError
async def archive_artifacts(
self,
slot: Slot,
path: str = ".",
*,
archive_format: str = "tar.gz",
max_bytes: Optional[int] = None,
max_entries: Optional[int] = None,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
raise NotImplementedError
File diff suppressed because it is too large Load Diff
-156
View File
@@ -1,156 +0,0 @@
"""
Nomad/Docker tool backend.
This backend is the current default for AgentEnv: it provisions a Nomad job
running `sandbox_server.py` and multiplexes stateless slots inside each container.
"""
from __future__ import annotations
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Tuple
from ..slots import Slot, SlotPool, SlotPoolConfig
from ..slots.executor import ExecutionResult
from .base import ToolBackend
@dataclass(frozen=True)
class NomadBackendConfig:
nomad_address: str
sandbox_job_id: str
sandbox_image: str
slots_per_container: int
min_containers: int
max_containers: int
privileged: bool
acquire_timeout_s: float
purge_job_on_start: bool
# Driver selection: "docker" or "singularity"
driver: str = "docker"
# Path to .sif file for singularity driver (required if driver="singularity")
singularity_image: Optional[str] = None
@classmethod
def from_agent_env_config(cls, cfg: Any) -> "NomadBackendConfig":
return cls(
nomad_address=str(getattr(cfg, "nomad_address")),
sandbox_job_id=str(getattr(cfg, "sandbox_job_id")),
sandbox_image=str(getattr(cfg, "sandbox_image")),
slots_per_container=int(getattr(cfg, "slots_per_container")),
min_containers=int(getattr(cfg, "min_containers")),
max_containers=int(getattr(cfg, "max_containers")),
privileged=bool(getattr(cfg, "privileged")),
acquire_timeout_s=float(getattr(cfg, "acquire_timeout_s")),
purge_job_on_start=bool(getattr(cfg, "purge_job_on_start", False)),
driver=str(getattr(cfg, "driver", "docker")),
singularity_image=getattr(cfg, "singularity_image", None),
)
class NomadToolBackend(ToolBackend):
def __init__(self, config: NomadBackendConfig):
self.config = config
self.pool = SlotPool(
SlotPoolConfig(
nomad_address=config.nomad_address,
job_id=config.sandbox_job_id,
image=config.sandbox_image,
slots_per_container=config.slots_per_container,
min_containers=config.min_containers,
max_containers=config.max_containers,
privileged=config.privileged,
acquire_timeout=config.acquire_timeout_s,
purge_job_on_start=bool(config.purge_job_on_start),
driver=config.driver,
singularity_image=config.singularity_image,
)
)
@property
def default_timeout_s(self) -> Optional[float]:
t = getattr(self.pool.executor, "timeout", None)
total = getattr(t, "total", None)
try:
return float(total) if total is not None else None
except Exception:
return None
async def start(self) -> None:
await self.pool.start()
async def stop(self, *, purge: bool = False) -> None:
await self.pool.stop(purge_job=purge)
async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
return await self.pool.acquire(trajectory_id)
async def release(self, slot: Slot, *, reset_workspace: bool = False) -> None:
await self.pool.release(slot, reset_workspace=reset_workspace)
async def execute_batch(
self,
requests: List[Tuple[Slot, str, Dict[str, Any]]],
*,
timeout_s: Optional[float] = None,
) -> List[ExecutionResult]:
return await self.pool.execute_batch(requests, timeout=timeout_s)
async def read_artifact(
self,
slot: Slot,
path: str,
*,
encoding: str = "text",
max_bytes: Optional[int] = None,
include_sha256: bool = False,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
return await self.pool.executor.read_artifact(
slot,
path,
encoding=encoding,
max_bytes=max_bytes,
include_sha256=include_sha256,
timeout=timeout_s,
)
async def list_artifacts(
self,
slot: Slot,
path: str = ".",
*,
recursive: bool = False,
max_entries: Optional[int] = None,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
return await self.pool.executor.list_artifacts(
slot,
path,
recursive=recursive,
max_entries=max_entries,
timeout=timeout_s,
)
async def archive_artifacts(
self,
slot: Slot,
path: str = ".",
*,
archive_format: str = "tar.gz",
max_bytes: Optional[int] = None,
max_entries: Optional[int] = None,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
return await self.pool.executor.archive_artifacts(
slot,
path,
archive_format=archive_format,
max_bytes=max_bytes,
max_entries=max_entries,
timeout=timeout_s,
)
def get_stats(self) -> Dict[str, Any]:
return self.pool.get_stats()
-10
View File
@@ -1,10 +0,0 @@
"""
Environment implementations for atropos-agent.
"""
from .agent_env import AgentEnv, AgentEnvConfig
# NOTE: Additional example envs exist as modules (e.g. `test_env`, `swe_smith_oracle_env`),
# but are intentionally not imported here to avoid pulling heavy optional deps at import time.
__all__ = ["AgentEnv", "AgentEnvConfig"]
-526
View File
@@ -1,526 +0,0 @@
"""
AgentEnv - Atropos BaseEnv extension for agent/tool-call workloads.
AgentEnv is responsible for starting the sandbox tool execution backend and
providing helpers for running agent trajectories with queued/batched tool calls.
"""
from __future__ import annotations
import os
import asyncio
import time
import uuid
from abc import ABC, abstractmethod
from typing import Any, Awaitable, Callable, Dict, Generic, List, Optional, Tuple, TypeVar
from pydantic import Field
from atroposlib.envs.base import APIServerConfig, BaseEnv, BaseEnvConfig, Item, ScoredDataGroup, ScoredDataItem
from atroposlib.envs.server_handling.server_baseline import AsyncSemWithAdaptiveWeight
from ..agent import AgentConfig, AgentResult, AtroposAgent
from ..backends import ToolBackend, create_tool_backend
from ..tools import ToolRegistry, build_tool_registry
from ..tools.tool_executor import ToolExecutor, ToolExecutorConfig
# Main BaseEnv child classes. Child class THESE to get agent+tooling functionality easily.
class AgentEnvConfig(BaseEnvConfig):
tool_pool_mode: str = Field(default="nomad", description="Tool execution backend ('nomad' or 'modal')")
allow_network: bool = Field(
default=True,
description="Whether sandbox bash commands may access the network (env policy).",
)
require_sandbox: bool = Field(
default=False,
description="Fail closed if bubblewrap sandboxing is unavailable/unusable for stateless sandbox tools.",
)
require_stateful_sandbox: bool = Field(
default=False,
description="Fail closed if bubblewrap/PID isolation is unavailable for stateful terminal tools (tmux).",
)
tool_batch_window_ms: int = Field(default=20, description="ToolExecutor batching window (ms)")
tool_max_batch_size: int = Field(default=200, description="ToolExecutor maximum batch size")
# nomad mode settings. TODO: Add Modal support, split this into own config
nomad_address: str = Field(default="http://localhost:4646", description="Nomad API address")
sandbox_job_id: str = Field(default="atropos-sandbox-agent-env", description="Nomad job id for sandbox containers")
sandbox_image: str = Field(default="atropos-sandbox:local", description="Docker image for sandbox containers")
slots_per_container: int = Field(default=10, description="Nomad mode: slots per container")
min_containers: int = Field(default=1, description="Nomad mode: minimum containers")
max_containers: int = Field(default=10, description="Nomad mode: maximum containers")
privileged: bool = Field(default=False, description="Nomad mode: run container privileged")
acquire_timeout_s: float = Field(default=30.0, description="Slot acquisition timeout (seconds)")
purge_job_on_start: bool = Field(
default=False,
description=(
"Nomad mode: stop/purge the sandbox job on startup. This is helpful in local dev and training runs "
"to recover from previous crashes that leave the job in a restart backoff state."
),
)
purge_job_on_shutdown: bool = Field(default=True, description="Nomad mode: stop/purge job on shutdown")
# Nomad driver selection (docker or singularity)
driver: str = Field(
default="docker",
description="Nomad task driver: 'docker' (default) or 'singularity' (for HPC without sudo Docker)",
)
singularity_image: Optional[str] = Field(
default=None,
description="Path to .sif file for Singularity driver (required if driver='singularity')",
)
# modal mode settings (stub; implementation pending)
modal_app_name: str = Field(default="atropos-sandbox", description="Modal app name (stub)")
modal_function_name: str = Field(default="sandbox_server", description="Modal function/actor name (stub)")
modal_volume_name: Optional[str] = Field(default=None, description="Modal Volume name for persistent storage (stub)")
modal_volume_mount_path: str = Field(default="/data", description="Modal Volume mount path (stub)")
# basic agent defaults
agent_max_steps: int = Field(default=50, description="Max ReACT steps per trajectory")
agent_temperature: float = Field(default=0.7, description="Sampling temperature")
agent_max_tokens: Optional[int] = Field(
default=None,
description="Max tokens per model response (default: let backend decide)",
)
agent_tool_delay_s: float = Field(default=0.0, description="Delay between tool calls (seconds)")
# tool selection
enabled_toolsets: List[str] = Field(
default_factory=lambda: ["default"],
description="Toolsets to enable (Hermes-style grouping).",
)
disabled_toolsets: List[str] = Field(
default_factory=list,
description="Toolsets to disable (applied after enabled_toolsets).",
)
# external ToolServer routing (Phase 4.5+)
tool_server_url: Optional[str] = Field(
default=None,
description="Base URL for external ToolServer (enables external tools).",
)
tool_server_token: Optional[str] = Field(
default=None,
description="Bearer token for ToolServer auth (optional in dev).",
)
AgentEnvConfigT = TypeVar("AgentEnvConfigT", bound="AgentEnvConfig")
class AgentEnv(BaseEnv, ABC, Generic[AgentEnvConfigT]):
env_config_cls = AgentEnvConfig
def __init__(
self,
config: AgentEnvConfigT,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self.config: AgentEnvConfigT = config
self.tools: ToolRegistry = self.build_tools()
self._backend: Optional[ToolBackend] = None
self._tool_executor: Optional[ToolExecutor] = None
self._tool_server_inprocess: bool = False
self._trajectory_workspace_meta: Dict[str, Dict[str, Any]] = {}
def build_tools(self) -> ToolRegistry:
"""Wraps original Hermes-Agent ToolRegistry for atropos AgentEnv use.
See Hermes-Agent docs for toolsets and available tools etc.
"""
return build_tool_registry(
enabled_toolsets=self.config.enabled_toolsets or ["default"],
disabled_toolsets=self.config.disabled_toolsets or None,
tool_server_url=self.config.tool_server_url,
)
@abstractmethod
def build_task(self, item: Item) -> str:
"""Return the user-facing task string for the agent."""
@abstractmethod
async def score_trajectory(self, item: Item, final_response: str) -> float:
"""Return a scalar score for this trajectory."""
async def setup_trajectory_workspace(
self,
item: Item,
*,
trajectory_id: str,
exec_tool: Callable[["ToolCall"], Awaitable["ToolResult"]],
) -> Dict[str, Any]:
"""
Optional hook: prepare the sandbox workspace before the agent starts.
Examples:
- clone a repo and checkout a commit
- write fixture files (e.g. images) for external-tool demos
- pre-install dependencies
Default: no-op.
"""
_ = (item, trajectory_id, exec_tool)
return {}
async def verify_and_score_trajectory(
self,
item: Item,
final_response: str,
*,
trajectory_id: str,
exec_tool: Callable[["ToolCall"], Awaitable["ToolResult"]],
agent_result: Optional[AgentResult] = None,
workspace_meta: Optional[Dict[str, Any]] = None,
) -> tuple[float, Dict[str, Any]]:
"""
Optional hook: run in-sandbox verification before scoring.
Many agent envs need to execute verification inside the same trajectory
workspace (e.g. pytest) before releasing/resetting the slot.
Default: calls `score_trajectory()` and returns empty metadata.
"""
_ = (trajectory_id, exec_tool, agent_result, workspace_meta) # default ignores in-workspace verification
score = await self.score_trajectory(item, final_response)
return score, {}
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
return AgentConfig(
max_steps=self.config.agent_max_steps,
temperature=self.config.agent_temperature,
max_tokens=self.config.agent_max_tokens,
tool_delay_s=self.config.agent_tool_delay_s,
)
async def setup(self) -> None:
print(f"[AgentEnv] setup(): starting tool backend ({self.config.tool_pool_mode})", flush=True)
await self._start_tool_backend()
print("[AgentEnv] setup(): configuring server concurrency", flush=True)
self._configure_server_concurrency()
print("[AgentEnv] setup(): running env-specific setup_agent_env()", flush=True)
await self.setup_agent_env()
print("[AgentEnv] setup(): done", flush=True)
def _configure_server_concurrency(self) -> None:
"""
Ensure the LLM server concurrency isn't accidentally capped below `group_size`.
In `BaseEnv process` mode, groups are collected concurrently and if the underlying
ServerManager/OpenAIServer semaphore is left at 1, we serialize inference even
when `--env.group_size` is > 1.
"""
desired = int(getattr(self.config, "group_size", 1) or 1)
if desired <= 1:
return
servers = getattr(self.server, "servers", None)
if not isinstance(servers, list) or not servers:
return
for s in servers:
sem = getattr(s, "sem", None)
eval_sem = getattr(s, "eval_sem", None)
# Only increase; never shrink.
if sem is not None and getattr(sem, "max_val", 0) < desired:
s.sem = AsyncSemWithAdaptiveWeight(desired)
if hasattr(s, "config") and hasattr(s.config, "num_max_requests_at_once"):
s.config.num_max_requests_at_once = desired
if eval_sem is not None and getattr(eval_sem, "max_val", 0) < desired:
s.eval_sem = AsyncSemWithAdaptiveWeight(desired)
if hasattr(s, "config") and hasattr(s.config, "num_requests_for_eval"):
s.config.num_requests_for_eval = desired
@abstractmethod
async def setup_agent_env(self) -> None:
"""Subclass hook for env-specific setup."""
async def evaluate(self, *args, **kwargs): # noqa: ARG002
"""
Default eval hook (no-op).
Atropos BaseEnv requires an `evaluate()` implementation. Many agent envs
won't have a meaningful evaluation path during early PoC work; they can
override this when needed.
"""
return {}
async def env_manager(self):
try:
return await super().env_manager()
finally:
await self.shutdown_tool_backend()
async def process_manager(self):
try:
return await super().process_manager()
finally:
await self.shutdown_tool_backend()
async def _start_tool_backend(self) -> None:
if self._tool_executor is not None:
return
tool_server_url = self.config.tool_server_url
tool_server_client = None
if tool_server_url == "inprocess":
import httpx
from ..api.tool_server import app as tool_server_app
await tool_server_app.router.startup()
tool_server_client = httpx.AsyncClient(
transport=httpx.ASGITransport(app=tool_server_app),
base_url="http://toolserver",
)
tool_server_url = "http://toolserver"
self._tool_server_inprocess = True
backend = create_tool_backend(self.config)
await backend.start()
executor = ToolExecutor(
backend=backend,
tools=self.tools,
config=ToolExecutorConfig(
batch_window_ms=self.config.tool_batch_window_ms,
max_batch_size=self.config.tool_max_batch_size,
allow_network=self.config.allow_network,
require_sandbox=self.config.require_sandbox,
require_stateful_sandbox=self.config.require_stateful_sandbox,
tool_server_url=tool_server_url,
tool_server_token=self.config.tool_server_token,
),
)
await executor.start()
if tool_server_client is not None:
executor._tool_server_client = tool_server_client # type: ignore[attr-defined]
self._backend = backend
self._tool_executor = executor
async def shutdown_tool_backend(self) -> None:
executor = self._tool_executor
backend = self._backend
inprocess_tool_server = self._tool_server_inprocess
self._tool_executor = None
self._backend = None
self._tool_server_inprocess = False
if executor is not None:
await executor.close()
if backend is not None:
await backend.stop(purge=bool(self.config.purge_job_on_shutdown))
if inprocess_tool_server:
from ..api.tool_server import app as tool_server_app
await tool_server_app.router.shutdown()
async def collect_trajectory(
self, item: Item
) -> Tuple[Optional[ScoredDataItem], List[Item]]:
if self._tool_executor is None:
raise RuntimeError("Tool backend not started")
trajectory_id = str(uuid.uuid4())
t0 = time.perf_counter()
print(f"[AgentEnv] collect_trajectory(): tid={trajectory_id} start", flush=True)
task = self.build_task(item)
agent_config = self.build_agent_config(item)
if os.getenv("ATROPOS_DEBUG_PRINT_TASK") == "1":
print(f"Starting trajectory {trajectory_id} with task: {task}", flush=True)
else:
# Avoid printing the full task prompt by default (can be huge/noisy).
one_line = " ".join(str(task).splitlines()).strip()
preview = one_line[:240] + ("" if len(one_line) > 240 else "")
print(f"Starting trajectory {trajectory_id} (task preview): {preview}", flush=True)
async def _exec(call):
return await self._tool_executor.execute(trajectory_id, call)
agent = AtroposAgent(
server=self.server,
tokenizer=self.tokenizer,
tools=self.tools,
config=agent_config,
execute_tool=_exec,
)
try:
print(f"[AgentEnv] tid={trajectory_id} setup_trajectory_workspace() start", flush=True)
workspace_meta = await self.setup_trajectory_workspace(item, trajectory_id=trajectory_id, exec_tool=_exec)
if not isinstance(workspace_meta, dict):
workspace_meta = {}
self._trajectory_workspace_meta[trajectory_id] = workspace_meta
print(
f"[AgentEnv] tid={trajectory_id} setup_trajectory_workspace() done in {time.perf_counter() - t0:.2f}s",
flush=True,
)
print(f"[AgentEnv] tid={trajectory_id} agent.run() start", flush=True)
result = await agent.run(task)
print(
f"[AgentEnv] tid={trajectory_id} agent.run() done in {time.perf_counter() - t0:.2f}s "
f"success={result.success} tool_calls={result.total_tool_calls}",
flush=True,
)
if not result.success or result.trajectory_data is None:
# Do not trigger BaseEnv retries for agent failures.
# Record the trajectory with score 0.0 so training/eval can see the failure mode.
messages = [{"role": "system", "content": agent._build_system_prompt()}] # noqa: SLF001
messages.append({"role": "user", "content": task})
for step in result.steps:
messages.append({"role": "assistant", "content": step.assistant_message})
if step.tool_results:
tool_text = "\n".join(r.to_xml() for r in step.tool_results)
messages.append({"role": "user", "content": tool_text})
scored: ScoredDataItem = {
"tokens": (result.trajectory_data.tokens if result.trajectory_data else []),
"masks": (result.trajectory_data.masked_tokens if result.trajectory_data else []),
"scores": 0.0,
}
if result.trajectory_data is not None:
scored["inference_logprobs"] = result.trajectory_data.logprobs # type: ignore[typeddict-unknown-key]
if getattr(result.trajectory_data, "metadata", None):
scored["overrides"] = {"managed_metadata": result.trajectory_data.metadata}
if self.config.include_messages:
# Record a final failure marker as a user-side tool_response-like block so it survives templates.
import json
err = result.error or "agent_failed"
messages.append(
{
"role": "user",
"content": f"<tool_response>{json.dumps({'success': False, 'error': err})}</tool_response>",
}
)
scored["messages"] = messages
return scored, []
print(f"[AgentEnv] tid={trajectory_id} verify_and_score_trajectory() start", flush=True)
score, score_metadata = await self.verify_and_score_trajectory(
item,
result.final_response,
trajectory_id=trajectory_id,
exec_tool=_exec,
agent_result=result,
workspace_meta=workspace_meta,
)
print(
f"[AgentEnv] tid={trajectory_id} verify_and_score_trajectory() done in {time.perf_counter() - t0:.2f}s "
f"score={score}",
flush=True,
)
messages = [{"role": "system", "content": agent._build_system_prompt()}] # noqa: SLF001
messages.append({"role": "user", "content": task})
for step in result.steps:
messages.append({"role": "assistant", "content": step.assistant_message})
if step.tool_results:
tool_text = "\n".join(r.to_xml() for r in step.tool_results)
messages.append({"role": "user", "content": tool_text})
# Optional: allow env verification to attach additional messages (e.g. install logs).
if self.config.include_messages and isinstance(score_metadata, dict):
extra = score_metadata.get("verification_messages")
if isinstance(extra, list):
for m in extra:
if isinstance(m, dict) and isinstance(m.get("role"), str) and isinstance(m.get("content"), str):
messages.append({"role": m["role"], "content": m["content"]})
scored: ScoredDataItem = {
"tokens": result.trajectory_data.tokens,
"masks": result.trajectory_data.masked_tokens,
"scores": score,
}
# Atroposlib expects policy logprobs at the *group* level under `inference_logprobs`.
# We stash per-item values here and lift them into the group in `collect_trajectories()`.
scored["inference_logprobs"] = result.trajectory_data.logprobs # type: ignore[typeddict-unknown-key]
if getattr(result.trajectory_data, "metadata", None):
scored["overrides"] = {"managed_metadata": result.trajectory_data.metadata}
if self.config.include_messages:
scored["messages"] = messages
return scored, []
finally:
self._trajectory_workspace_meta.pop(trajectory_id, None)
print(f"[AgentEnv] tid={trajectory_id} release_trajectory(reset_workspace=True)", flush=True)
await self._tool_executor.release_trajectory(trajectory_id, reset_workspace=True)
print(f"[AgentEnv] collect_trajectory(): tid={trajectory_id} done in {time.perf_counter() - t0:.2f}s", flush=True)
async def collect_trajectories(
self, item: Item
) -> Tuple[Optional[ScoredDataGroup], List[Item]]:
tasks = [self.collect_trajectory(item) for _ in range(self.config.group_size)]
results = await asyncio.gather(*tasks)
backlog: List[Item] = []
items: List[ScoredDataItem] = []
for scored, b in results:
backlog.extend(b)
if scored is not None:
items.append(scored)
if len(items) != self.config.group_size:
return None, backlog
group: ScoredDataGroup = ScoredDataGroup(
tokens=[],
masks=[],
scores=[],
advantages=[],
ref_logprobs=[],
messages=[] if self.config.include_messages else None,
inference_logprobs=[],
group_overrides={},
overrides=[],
images=[],
generation_params=None,
)
for it in items:
group["tokens"].append(it["tokens"])
group["masks"].append(it["masks"])
group["scores"].append(it["scores"])
# policy logprobs (for PPO/GRPO training) if present
lp = it.get("inference_logprobs") # type: ignore[typeddict-item]
if lp is not None:
group["inference_logprobs"].append(lp)
group["overrides"].append(it.get("overrides") or {}) # type: ignore[typeddict-item]
if group.get("messages") is not None and it.get("messages") is not None:
group["messages"].append(it["messages"])
return group, backlog
async def run_agent(self, task: str, *, trajectory_id: Optional[str] = None) -> Tuple[str, Dict[str, Any]]:
"""
Run the AtroposAgent on a single task and return (final_response, debug).
This is a helper intended for simple environments and tests.
"""
if self._tool_executor is None:
raise RuntimeError("Tool backend not started")
tid = trajectory_id or str(uuid.uuid4())
async def _exec(call):
return await self._tool_executor.execute(tid, call)
agent = AtroposAgent(
server=self.server,
tokenizer=self.tokenizer,
tools=self.tools,
config=AgentConfig(
max_steps=self.config.agent_max_steps,
temperature=self.config.agent_temperature,
max_tokens=self.config.agent_max_tokens,
),
execute_tool=_exec,
)
result = await agent.run(task)
await self._tool_executor.release_trajectory(tid, reset_workspace=True)
return result.final_response, {"success": result.success, "error": result.error, "tool_calls": result.total_tool_calls}
-171
View File
@@ -1,171 +0,0 @@
"""
Hermes-Agent + Atropos (Nomad sandbox) compatibility smoke environment.
This environment is intended to validate, end-to-end:
BaseEnv.process -> AgentEnv -> ToolExecutor (batched) -> Nomad SlotPool -> sandbox_server
It forces the model to use a sandbox tool by asking it to run a command that
generates a high-entropy token inside the sandbox, then repeat it exactly.
Run (process mode):
uv run python -m atropos.envs.hermes_compat_test_env process --env.use_wandb false --env.total_steps 2 --env.group_size 1
"""
from __future__ import annotations
import os
from typing import Any, Dict, List, Tuple
from dotenv import load_dotenv
from pydantic import Field
from atroposlib.envs.base import APIServerConfig, Item
from ..agent import AgentConfig, AgentResult
from ..tools import ToolCall
from .agent_env import AgentEnv, AgentEnvConfig
load_dotenv()
def _forced_tool_item() -> Item:
# Use double quotes in the shell command and show JSON escaping explicitly.
# This avoids invalid JSON escapes like `\\'` (not valid JSON) that some models produce.
cmd = 'python -c "import secrets; print(secrets.token_hex(16))"'
return {
"command": cmd,
"prompt": (
"You are acting as an agent inside a sandboxed environment.\n"
"You MUST use the terminal tool to execute commands.\n"
"Run this exact command:\n"
f"{cmd}\n"
"When you call the tool, use valid JSON inside <tool_call>. Example:\n"
'<tool_call>{"name": "terminal", "arguments": {"command": '
'"python -c \\\\"import secrets; print(secrets.token_hex(16))\\\\""}}'
"</tool_call>\n"
"Then respond with EXACTLY what it printed (the hex token) and nothing else.\n"
"Do not guess. Do not explain."
),
}
class HermesCompatTestEnvConfig(AgentEnvConfig):
server_base_url: str = Field(
default="http://127.0.0.1:8080",
description="Base URL for an OpenAI-compatible chat server (without /v1).",
)
server_model: str = Field(default="hermes-4-36b", description="Model name")
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
class HermesCompatTestEnv(AgentEnv[HermesCompatTestEnvConfig]):
name = "hermes_compat_test_env"
env_config_cls = HermesCompatTestEnvConfig
def __init__(
self,
config: HermesCompatTestEnvConfig,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self._iter = 0
@classmethod
def config_init(cls) -> Tuple[HermesCompatTestEnvConfig, List[APIServerConfig]]:
base_url = (
os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://127.0.0.1:8080"
)
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
env_config = HermesCompatTestEnvConfig(
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
group_size=1,
use_wandb=False,
include_messages=True,
ensure_scores_are_not_same=False,
total_steps=2,
batch_size=1,
server_base_url=base_url,
server_model=model,
# Tooling: sandbox-only terminal.
enabled_toolsets=["terminal"],
disabled_toolsets=[],
# Default to Nomad sandboxing; users can override via --env.* args.
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
# In local dev it's common for a previous crash to leave the job in backoff.
purge_job_on_start=True,
purge_job_on_shutdown=True,
)
server_configs = [
APIServerConfig(
model_name=model,
base_url=f"{base_url.rstrip('/')}/v1",
api_key=api_key,
num_max_requests_at_once=1,
num_requests_for_eval=1,
timeout=120,
)
]
return env_config, server_configs
async def setup_agent_env(self) -> None:
return None
async def get_next_item(self) -> Item:
self._iter += 1
return _forced_tool_item()
def build_task(self, item: Item) -> str:
return str(item.get("prompt") or "")
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
# Avoid imposing max_tokens by default; tool-tag responses can be long for some models.
return AgentConfig(
max_steps=min(8, int(self.config.agent_max_steps)),
temperature=0.2,
max_tokens=None,
)
async def score_trajectory(self, item: Item, final_response: str) -> float:
# Scoring happens in verify_and_score_trajectory so we can inspect tool results.
_ = (item, final_response)
return 0.0
async def verify_and_score_trajectory(
self,
item: Item,
final_response: str,
*,
trajectory_id: str, # noqa: ARG002
exec_tool, # noqa: ARG002
agent_result: AgentResult | None = None,
workspace_meta: Dict[str, Any] | None = None, # noqa: ARG002
) -> tuple[float, Dict[str, Any]]:
if agent_result is None:
return 0.0, {"error": "Missing agent_result"}
observed: str = ""
tool_ok = False
for step in agent_result.steps:
for res in step.tool_results:
if not res.success:
return 0.0, {"error": res.error, "output": res.output}
out = (res.output or "").strip()
if out:
observed = out.splitlines()[-1].strip()
tool_ok = True
final = (final_response or "").strip()
score = 1.0 if tool_ok and agent_result.total_tool_calls > 0 and observed and final == observed else 0.0
return score, {"observed": observed, "tool_calls": agent_result.total_tool_calls, "command": item.get("command")}
if __name__ == "__main__":
HermesCompatTestEnv.cli()
-172
View File
@@ -1,172 +0,0 @@
"""
Nomad sandbox terminal smoke environment (training-oriented).
Validates, end-to-end:
BaseEnv.process -> AgentEnv -> ToolExecutor (batched) -> Nomad SlotPool -> sandbox_server
It forces the model to use a sandbox tool by asking it to run a command that
generates a high-entropy token inside the sandbox, then repeat it exactly.
Run (process mode):
uv run python -m atropos.envs.sandbox_terminal_smoke_env process --env.use_wandb false --env.total_steps 2 --env.group_size 1
"""
from __future__ import annotations
import os
from typing import Any, Dict, List, Tuple
from dotenv import load_dotenv
from pydantic import Field
from atroposlib.envs.base import APIServerConfig, Item
from ..agent import AgentConfig, AgentResult
from ..tools import ToolCall
from .agent_env import AgentEnv, AgentEnvConfig
load_dotenv()
STRICT_TOOLCALL_SYSTEM_PROMPT = None
def _forced_tool_item() -> Item:
# Use double quotes in the shell command and show JSON escaping explicitly.
# This avoids invalid JSON escapes like `\\'` (not valid JSON) that some models produce.
cmd = 'python -c "import secrets; print(secrets.token_hex(16))"'
return {
"command": cmd,
"prompt": (
"You MUST use the terminal tool.\n"
"Run this exact command:\n"
f"{cmd}\n"
"When you call the tool, use valid JSON inside <tool_call>. Example:\n"
'<tool_call>{"name": "terminal", "arguments": {"command": '
'"python -c \\\\"import secrets; print(secrets.token_hex(16))\\\\""}}'
"</tool_call>\n"
"Then respond with EXACTLY what it printed (the hex token) and nothing else.\n"
"Do not guess. Do not explain."
),
}
class SandboxTerminalSmokeEnvConfig(AgentEnvConfig):
server_base_url: str = Field(
default="http://127.0.0.1:8080",
description="Base URL for an OpenAI-compatible chat server (without /v1).",
)
server_model: str = Field(default="hermes-4-36b", description="Model name")
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
class SandboxTerminalSmokeEnv(AgentEnv[SandboxTerminalSmokeEnvConfig]):
name = "sandbox_terminal_smoke_env"
env_config_cls = SandboxTerminalSmokeEnvConfig
def __init__(
self,
config: SandboxTerminalSmokeEnvConfig,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self._iter = 0
@classmethod
def config_init(cls) -> Tuple[SandboxTerminalSmokeEnvConfig, List[APIServerConfig]]:
base_url = (
os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://127.0.0.1:8080"
)
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
env_config = SandboxTerminalSmokeEnvConfig(
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
group_size=1,
use_wandb=False,
include_messages=True,
ensure_scores_are_not_same=False,
total_steps=2,
batch_size=1,
server_base_url=base_url,
server_model=model,
# Tooling: sandbox-only terminal.
enabled_toolsets=["terminal"],
disabled_toolsets=[],
# Default to Nomad sandboxing; users can override via --env.* args.
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
purge_job_on_start=True,
purge_job_on_shutdown=True,
)
server_configs = [
APIServerConfig(
model_name=model,
base_url=f"{base_url.rstrip('/')}/v1",
api_key=api_key,
num_max_requests_at_once=1,
num_requests_for_eval=1,
timeout=120,
)
]
return env_config, server_configs
async def setup_agent_env(self) -> None:
return None
async def get_next_item(self) -> Item:
self._iter += 1
return _forced_tool_item()
def build_task(self, item: Item) -> str:
return str(item.get("prompt") or "")
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
# Avoid imposing max_tokens by default; tool-tag responses can be long for some models.
return AgentConfig(
max_steps=min(8, int(self.config.agent_max_steps)),
temperature=0.2,
max_tokens=None,
system_prompt=STRICT_TOOLCALL_SYSTEM_PROMPT,
)
async def score_trajectory(self, item: Item, final_response: str) -> float:
# Scoring happens in verify_and_score_trajectory so we can inspect tool results.
_ = (item, final_response)
return 0.0
async def verify_and_score_trajectory(
self,
item: Item,
final_response: str,
*,
trajectory_id: str, # noqa: ARG002
exec_tool, # noqa: ARG002
agent_result: AgentResult | None = None,
workspace_meta: Dict[str, Any] | None = None, # noqa: ARG002
) -> tuple[float, Dict[str, Any]]:
if agent_result is None:
return 0.0, {"error": "Missing agent_result"}
observed: str = ""
tool_ok = False
for step in agent_result.steps:
for res in step.tool_results:
if not res.success:
return 0.0, {"error": res.error, "output": res.output}
out = (res.output or "").strip()
if out:
observed = out.splitlines()[-1].strip()
tool_ok = True
final = (final_response or "").strip()
score = 1.0 if tool_ok and agent_result.total_tool_calls > 0 and observed and final == observed else 0.0
return score, {"observed": observed, "tool_calls": agent_result.total_tool_calls, "command": item.get("command")}
if __name__ == "__main__":
SandboxTerminalSmokeEnv.cli()
-418
View File
@@ -1,418 +0,0 @@
"""
SWE-smith-oracle environment.
This environment is intentionally minimal:
- prepares a sandbox workspace by cloning a public GitHub repo at `base_commit`
- runs an AtroposAgent tool loop to apply a fix
- verifies by running pytest nodeids from the dataset (reward = pass/fail)
- Python only (no multi-language support currently, need to properly bauild & add to dropbox)
- TODO: Get the other nonpython sandboxes up and running, then add a config knob to switch between them per row
- oh and add to dockerhub
Dataset: NousResearch/SWE-smith-oracle (train; does NOT use SWE-bench eval set).
"""
from __future__ import annotations
import os
import random
import time
from typing import Any, Dict, List, Optional, Tuple
from pydantic import Field
from atroposlib.envs.base import APIServerConfig, Item
from ..agent import AgentConfig
from ..tools import ToolCall
from .agent_env import AgentEnv, AgentEnvConfig
class SweSmithOracleEnvConfig(AgentEnvConfig):
dataset_name: str = Field(default="NousResearch/SWE-smith-oracle")
dataset_split: str = Field(default="train")
max_items: int = Field(default=0, description="0 = no limit")
shuffle: bool = Field(default=True)
seed: int = Field(default=0)
python_only: bool = Field(default=True, description="Filter to Python-evaluable rows")
score_include_fail_to_pass: bool = Field(
default=True,
description=(
"If true (default), score tests on PASS_TO_PASS FAIL_TO_PASS. "
"Disable to only run PASS_TO_PASS (faster but weaker signal)."
),
)
prompt_mode: str = Field(
default="problem_statement",
description="Task prompt content: 'problem_statement' (fast) or 'problem_statement+text' (slower, includes dataset 'text').",
)
repo_base_url: str = Field(default="https://github.com", description="Base URL for repo cloning")
install_timeout_s: float = Field(default=600.0)
test_timeout_s: float = Field(default=600.0)
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
class SweSmithOracleEnv(AgentEnv[SweSmithOracleEnvConfig]):
"""
SWE-smith-oracle AgentEnv.
This is designed for benchmarking multiplexed slot execution vs naive container-per-trajectory.
"""
name = "swe_smith_oracle_env"
env_config_cls = SweSmithOracleEnvConfig
def __init__(
self,
config: SweSmithOracleEnvConfig,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self._dataset = None
self._indices: List[int] = []
self._cursor = 0
@classmethod
def config_init(cls) -> Tuple[SweSmithOracleEnvConfig, List[APIServerConfig]]:
# Defaults for running the env via CLI in offline `process` mode.
# Override via env vars or `--env.*` flags as needed.
base_url_raw = (
os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://127.0.0.1:8080"
)
base_url = base_url_raw.rstrip("/")
if not base_url.endswith("/v1"):
base_url = f"{base_url}/v1"
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
env_config = SweSmithOracleEnvConfig(
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
group_size=1,
use_wandb=False,
rollout_server_url="http://localhost:8000",
total_steps=1,
batch_size=1,
steps_per_eval=1,
max_token_length=8192,
inference_weight=1.0,
wandb_name="swe_smith_oracle",
enabled_toolsets=["terminal"],
disabled_toolsets=[],
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
purge_job_on_start=True,
purge_job_on_shutdown=True,
)
server_configs = [
APIServerConfig(
model_name=model,
base_url=base_url,
api_key=api_key,
num_max_requests_at_once=1,
num_requests_for_eval=1,
timeout=int(os.getenv("ATROPOS_SERVER_TIMEOUT_S") or "300"),
),
]
return env_config, server_configs
async def setup_agent_env(self) -> None:
from datasets import load_dataset
t0 = time.perf_counter()
print(
f"[SweSmithOracleEnv] loading dataset {self.config.dataset_name}:{self.config.dataset_split} "
f"(python_only={self.config.python_only}, max_items={self.config.max_items or 'all'})",
flush=True,
)
ds = load_dataset(self.config.dataset_name, split=self.config.dataset_split)
self._dataset = ds
indices: List[int] = []
for idx in range(len(ds)):
row = ds[idx]
if self.config.python_only and not self._is_python_row(row):
continue
indices.append(idx)
if self.config.shuffle:
rnd = random.Random(self.config.seed)
rnd.shuffle(indices)
if self.config.max_items and self.config.max_items > 0:
indices = indices[: self.config.max_items]
self._indices = indices
self._cursor = 0
print(
f"[SweSmithOracleEnv] loaded {len(self._indices)} items from {self.config.dataset_name}:{self.config.dataset_split} "
f"in {time.perf_counter() - t0:.2f}s",
flush=True,
)
def _is_python_row(self, row: Dict[str, Any]) -> bool:
nodeids = row.get("PASS_TO_PASS")
if not isinstance(nodeids, list) or not nodeids:
return False
for nid in nodeids:
if not isinstance(nid, str) or ".py::" not in nid:
return False
return True
async def get_next_item(self) -> Item:
print(f"[SweSmithOracleEnv] get_next_item() cursor={self._cursor}/{len(self._indices)}", flush=True)
if not self._dataset or not self._indices:
raise RuntimeError("Dataset not initialized (did setup() run?)")
if self._cursor >= len(self._indices):
self._cursor = 0
idx = self._indices[self._cursor]
self._cursor += 1
return dict(self._dataset[idx])
def _repo_name(self, item: Item) -> str:
repo = item.get("repo") or ""
if isinstance(repo, str) and "/" in repo:
return repo.split("/")[-1]
return "repo"
def build_task(self, item: Item) -> str:
repo = item.get("repo") or ""
base_commit = item.get("base_commit") or ""
problem = str(item.get("problem_statement") or "")
context = str(item.get("text") or "")
nodeids = self._tests_for_item(item)
tests_list = "\n".join(f"- {t}" for t in nodeids)
repo_dir = self._repo_name(item)
tests_block = (
"Run these tests to verify:\n"
f"{tests_list}\n\n"
"When done, briefly describe what you changed and confirm tests pass."
)
prompt_mode = (self.config.prompt_mode or "problem_statement").strip().lower()
if prompt_mode not in {"problem_statement", "problem_statement+text"}:
raise ValueError(
f"Invalid prompt_mode={self.config.prompt_mode!r}. "
"Expected 'problem_statement' or 'problem_statement+text'."
)
context_block = ""
if prompt_mode == "problem_statement+text" and context:
# Note: We intentionally do NOT truncate/cap here. This mode is for debugging / richer prompts and can be slow.
context_block = f"\nAdditional context:\n{context}\n"
return (
"You are a senior software engineer. Fix the repository so the specified tests pass.\n\n"
f"Repository: {repo} (checked out at base_commit={base_commit})\n"
f"Workspace path: ./{repo_dir}\n\n"
"Constraints:\n"
"- You MUST use the terminal tool to inspect, edit, and verify the repository. Do not respond with a patch file.\n"
f"- Start by inspecting the repo (e.g. `ls`, `cd ./{repo_dir}`, `git status`).\n"
"- Use a workspace-local virtualenv (e.g. inside the repo at ./.venv) to avoid cross-run contamination.\n"
"- Use non-interactive commands only.\n\n"
"- Terminal commands run under POSIX /bin/sh and each tool call runs in a fresh shell (no persisted env vars).\n"
" Avoid bash-only `source`; prefer `. .venv/bin/activate` or `.venv/bin/python ...`.\n\n"
"Problem statement:\n"
f"{problem}\n\n"
f"{context_block}\n"
f"{tests_block}"
)
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
# SWE tasks are longer than the simple test env.
return AgentConfig(
max_steps=self.config.agent_max_steps,
temperature=self.config.agent_temperature,
max_tokens=self.config.agent_max_tokens,
tool_delay_s=self.config.agent_tool_delay_s,
)
async def setup_trajectory_workspace(self, item: Item, *, trajectory_id: str, exec_tool) -> Dict[str, Any]:
t0 = time.perf_counter()
repo = item.get("repo")
base_commit = item.get("base_commit")
instance_id = item.get("instance_id") or item.get("id") or item.get("problem_id")
if not isinstance(repo, str) or not isinstance(base_commit, str):
raise RuntimeError("Invalid dataset row: missing repo/base_commit")
repo_dir = self._repo_name(item)
clone_url = f"{self.config.repo_base_url.rstrip('/')}/{repo}.git"
print(
f"[SweSmithOracleEnv] tid={trajectory_id} setup_trajectory_workspace(): "
f"repo={repo} base_commit={base_commit} instance_id={instance_id} dir=./{repo_dir}",
flush=True,
)
# Repo setup strategy:
# - Maintain a shared, per-container bare repo cache under /data/repo_cache
# - For each trajectory, create an isolated git worktree under the slot workspace
# This avoids cloning/fetching full repos per trajectory and is crucial for multiplexing.
def _repo_cache_slug(repo_name: str) -> str:
return repo_name.replace("/", "__")
repo_slug = _repo_cache_slug(repo)
cache_root = "/data/repo_cache"
bare_repo = f"{cache_root}/{repo_slug}.git"
lock_file = f"{cache_root}/.locks/{repo_slug}.lock"
# Use flock to serialize operations that mutate the shared bare repo (fetch/worktree).
# util-linux (flock) is included in the sandbox image.
worktree_cmd = (
"set -e; "
f"rm -rf {repo_dir}; "
f"mkdir -p {cache_root}/.locks; "
f": > {lock_file}; "
f"flock -x {lock_file} sh -lc '"
f"set -e; "
"export GIT_TERMINAL_PROMPT=0; "
"export GIT_LFS_SKIP_SMUDGE=1; "
f"if [ ! -d \"{bare_repo}\" ]; then "
f" git init --bare \"{bare_repo}\"; "
f" git -C \"{bare_repo}\" remote add origin \"{clone_url}\"; "
"fi; "
f"git -C \"{bare_repo}\" remote set-url origin \"{clone_url}\"; "
f"git -C \"{bare_repo}\" worktree prune || true; "
f"if ! git -C \"{bare_repo}\" cat-file -e \"{base_commit}^{{commit}}\" 2>/dev/null; then "
f" git -C \"{bare_repo}\" fetch --depth 1 origin \"{base_commit}\" || true; "
"fi; "
f"if ! git -C \"{bare_repo}\" cat-file -e \"{base_commit}^{{commit}}\" 2>/dev/null; then "
f" git -C \"{bare_repo}\" fetch --prune origin; "
"fi; "
f"git --git-dir=\"{bare_repo}\" worktree add --detach \"{repo_dir}\" \"{base_commit}\"; "
"'"
)
print(f"[SweSmithOracleEnv] tid={trajectory_id} preparing worktree from repo cache", flush=True)
res = await exec_tool(
ToolCall(
name="terminal",
arguments={"command": worktree_cmd, "timeout": self.config.install_timeout_s},
)
)
if not res.success:
raise RuntimeError(
"git worktree setup failed "
f"(repo={repo}, base_commit={base_commit}, instance_id={instance_id}): {res.error}\n{res.output}"
)
print(
f"[SweSmithOracleEnv] tid={trajectory_id} setup_trajectory_workspace(): worktree ready in {time.perf_counter() - t0:.2f}s",
flush=True,
)
return {"repo_dir": repo_dir, "base_commit": base_commit}
def _tests_for_item(self, item: Item) -> List[str]:
tests: List[str] = []
if self.config.score_include_fail_to_pass:
for key in ("PASS_TO_PASS", "FAIL_TO_PASS"):
nodeids = item.get(key)
if isinstance(nodeids, list):
tests.extend([n for n in nodeids if isinstance(n, str)])
else:
nodeids = item.get("PASS_TO_PASS")
if isinstance(nodeids, list):
tests.extend([n for n in nodeids if isinstance(n, str)])
# Stable order for reproducibility.
return sorted(dict.fromkeys(tests))
def _chunk_nodeids(self, nodeids: List[str], max_per_chunk: int = 50) -> List[List[str]]:
chunks: List[List[str]] = []
for i in range(0, len(nodeids), max_per_chunk):
chunks.append(nodeids[i : i + max_per_chunk])
return chunks
async def verify_and_score_trajectory(
self,
item: Item,
final_response: str, # noqa: ARG002
*,
trajectory_id: str,
exec_tool,
agent_result=None,
workspace_meta: Optional[Dict[str, Any]] = None,
) -> tuple[float, Dict[str, Any]]:
_ = trajectory_id
repo_dir = self._repo_name(item)
# Training correctness: do not reward trajectories that never actually used tools.
if agent_result is not None and getattr(agent_result, "total_tool_calls", 0) <= 0:
print(
f"[SweSmithOracleEnv] tid={trajectory_id} verify (dataset_tests): no tool calls; score=0.0",
flush=True,
)
return 0.0, {
"verification_mode": "dataset_tests",
"error": "No tool calls were made by the agent",
}
nodeids = self._tests_for_item(item)
if not nodeids:
return 0.0, {"error": "No tests provided"}
print(f"[SweSmithOracleEnv] tid={trajectory_id} verify (dataset_tests): ensuring venv + deps", flush=True)
setup_cmd = (
f"cd {repo_dir} && "
"python -m venv .venv && "
". .venv/bin/activate && "
"python -m pip install -U pip setuptools wheel && "
"python -m pip install -e . && "
"python -m pip install pytest"
)
setup_res = await exec_tool(
ToolCall(name="terminal", arguments={"command": setup_cmd, "timeout": self.config.install_timeout_s})
)
verification_messages = [{"role": "user", "content": setup_res.to_xml()}]
if not setup_res.success:
return 0.0, {
"verification_mode": "dataset_tests",
"phase": "install",
"error": setup_res.error,
"output": setup_res.output,
"verification_messages": verification_messages,
}
chunks = self._chunk_nodeids(nodeids, max_per_chunk=50)
for chunk_idx, chunk in enumerate(chunks):
joined = " ".join(chunk)
cmd = f"cd {repo_dir} && . .venv/bin/activate && python -m pytest -q {joined}"
res = await exec_tool(
ToolCall(
name="terminal",
arguments={"command": cmd, "timeout": self.config.test_timeout_s},
)
)
verification_messages.append({"role": "user", "content": res.to_xml()})
if not res.success:
return 0.0, {
"verification_mode": "dataset_tests",
"phase": "pytest",
"failed_chunk": chunk_idx,
"error": res.error,
"output": res.output,
"verification_messages": verification_messages,
}
return 1.0, {"verification_mode": "dataset_tests", "passed": True, "verification_messages": verification_messages}
async def score_trajectory(self, item: Item, final_response: str) -> float:
# Not used; scoring happens in verify_and_score_trajectory.
_ = (item, final_response)
return 0.0
if __name__ == "__main__":
SweSmithOracleEnv.cli()
-217
View File
@@ -1,217 +0,0 @@
"""
Simple test environment for validating the atropos-agent setup.
This environment uses a local OpenAI-compatible server for LLM testing to verify:
- BaseEnv extension works correctly
- API communication via OpenAI-compatible endpoint
- Basic trajectory collection
This is a minimal environment for testing, not production use.
"""
import os
from typing import Dict, List, Optional, Tuple
from dotenv import load_dotenv
from pydantic import Field
from atroposlib.envs.base import (
APIServerConfig,
Item,
)
from ..agent import AgentConfig
from .agent_env import AgentEnv, AgentEnvConfig
# Load environment variables from .env file
load_dotenv()
# Simple test prompts for validation
TEST_PROMPTS = [
{
"prompt": "What is 2 + 2? Answer with just the number.",
"expected": "4",
},
{
"prompt": "What is the capital of France? Answer with just the city name.",
"expected": "Paris",
},
{
"prompt": "What color is the sky on a clear day? Answer with just the color.",
"expected": "Blue",
},
{
"prompt": "How many days are in a week? Answer with just the number.",
"expected": "7",
},
{
"prompt": "What is 10 * 5? Answer with just the number.",
"expected": "50",
},
]
SYSTEM_PROMPT = (
"You are a helpful assistant. Answer questions concisely and directly. "
"When asked for a simple answer, provide just that answer without explanation."
)
class SimpleTestEnvConfig(AgentEnvConfig):
"""Configuration for the simple test environment."""
server_base_url: str = Field(
default="http://127.0.0.1:8080",
description="Base URL for an OpenAI-compatible server (without /v1)",
)
server_model: str = Field(
default="hermes-4-36b",
description="Model name",
)
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
class SimpleTestEnv(AgentEnv[SimpleTestEnvConfig]):
"""
A simple test environment to validate the atropos-agent setup.
Uses a local OpenAI-compatible LLM endpoint with basic question-answering tasks.
Scoring is based on whether the response contains the expected answer.
"""
name = "simple_test_env"
env_config_cls = SimpleTestEnvConfig
def __init__(
self,
config: SimpleTestEnvConfig,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self.iter = 0
self.test_prompts = TEST_PROMPTS
self.percent_correct_buffer: List[float] = []
@classmethod
def config_init(cls) -> Tuple[SimpleTestEnvConfig, List[APIServerConfig]]:
"""
Initialize configuration with local server settings from environment variables.
"""
base_url = (
os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://127.0.0.1:8080"
)
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
env_config = SimpleTestEnvConfig(
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
group_size=4,
use_wandb=False, # Disable wandb for simple testing
rollout_server_url="http://localhost:8000",
total_steps=10,
batch_size=16,
steps_per_eval=5,
max_token_length=2048,
inference_weight=1.0,
wandb_name="simple_test",
server_base_url=base_url,
server_model=model,
)
# OpenAI-compatible servers typically expose chat completions at /v1.
server_configs = [
APIServerConfig(
model_name=model,
base_url=f"{base_url}/v1",
api_key=api_key,
num_max_requests_at_once=4,
num_requests_for_eval=8,
timeout=120, # Local models may be slower
),
]
return env_config, server_configs
async def setup_agent_env(self):
"""Setup the environment - load test data."""
print(f"SimpleTestEnv setup complete. {len(self.test_prompts)} test prompts loaded.")
print(f"Using server at: {self.config.server_base_url}")
print(f"Model: {self.config.server_model}")
async def get_next_item(self) -> Item:
"""Get the next test prompt."""
item = self.test_prompts[self.iter % len(self.test_prompts)]
self.iter += 1
return item
def build_task(self, item: Item) -> str:
return item["prompt"]
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
return AgentConfig(
max_steps=5,
temperature=0.7,
max_tokens=256,
system_prompt=SYSTEM_PROMPT,
)
async def score_trajectory(self, item: Item, final_response: str) -> float:
expected = item["expected"].lower()
response_lower = (final_response or "").lower()
score = 1.0 if expected in response_lower else 0.0
self.percent_correct_buffer.append(score)
return score
async def evaluate(self, *args, **kwargs):
"""
Simple evaluation - run through all test prompts once.
"""
correct = 0
total = len(self.test_prompts)
for item in self.test_prompts:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": item["prompt"]},
]
response = await self.server.chat_completion(
messages=messages,
n=1,
max_tokens=256,
temperature=0.0, # Greedy for eval
split="eval",
)
response_text = response.choices[0].message.content or ""
expected = item["expected"].lower()
if expected in response_text.lower():
correct += 1
accuracy = correct / total
print(f"Evaluation: {correct}/{total} = {accuracy:.2%} accuracy")
return {"eval_accuracy": accuracy}
async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
"""Log metrics (simplified for testing)."""
if wandb_metrics is None:
wandb_metrics = {}
if self.percent_correct_buffer:
avg_correct = sum(self.percent_correct_buffer) / len(self.percent_correct_buffer)
wandb_metrics["train/percent_correct"] = avg_correct
print(f"Train accuracy: {avg_correct:.2%}")
self.percent_correct_buffer = []
await super().wandb_log(wandb_metrics)
if __name__ == "__main__":
# Allow running as CLI
SimpleTestEnv.cli()
-165
View File
@@ -1,165 +0,0 @@
"""
ToolServer routing smoke environment.
Validates that:
- sandbox tools run through Nomad SlotPool (terminal -> bash in sandbox)
- external tools run through ToolServer (skills_list)
This env uses ToolServer in-process by default (`tool_server_url="inprocess"`),
so it is self-contained for local testing.
Run:
uv run python -m atropos.envs.toolserver_smoke_env process --env.use_wandb false --env.total_steps 1 --env.group_size 1
"""
from __future__ import annotations
import os
from typing import Any, Dict, List, Tuple
from dotenv import load_dotenv
from pydantic import Field
from atroposlib.envs.base import APIServerConfig, Item
from ..agent import AgentConfig, AgentResult
from .agent_env import AgentEnv, AgentEnvConfig
load_dotenv()
class ToolServerSmokeEnvConfig(AgentEnvConfig):
server_base_url: str = Field(
default="http://127.0.0.1:8080",
description="Base URL for an OpenAI-compatible chat server (without /v1).",
)
server_model: str = Field(default="hermes-4-36b", description="Model name")
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
class ToolServerSmokeEnv(AgentEnv[ToolServerSmokeEnvConfig]):
name = "toolserver_smoke_env"
env_config_cls = ToolServerSmokeEnvConfig
def __init__(
self,
config: ToolServerSmokeEnvConfig,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self._iter = 0
@classmethod
def config_init(cls) -> Tuple[ToolServerSmokeEnvConfig, List[APIServerConfig]]:
base_url = (
os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://127.0.0.1:8080"
)
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
env_config = ToolServerSmokeEnvConfig(
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
group_size=1,
use_wandb=False,
include_messages=True,
ensure_scores_are_not_same=False,
total_steps=1,
batch_size=1,
server_base_url=base_url,
server_model=model,
enabled_toolsets=["terminal", "skills"],
disabled_toolsets=[],
# Self-contained ToolServer for local smoke.
tool_server_url="inprocess",
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
purge_job_on_start=True,
purge_job_on_shutdown=True,
)
server_configs = [
APIServerConfig(
model_name=model,
base_url=f"{base_url.rstrip('/')}/v1",
api_key=api_key,
num_max_requests_at_once=1,
num_requests_for_eval=1,
timeout=120,
)
]
return env_config, server_configs
async def setup_agent_env(self) -> None:
return None
async def get_next_item(self) -> Item:
self._iter += 1
return {
"prompt": (
"You MUST call exactly one tool per assistant message.\n"
"\n"
"Step 1) Call the skills_list tool (no arguments), then stop.\n"
"Step 2) After you receive the tool response, call the terminal tool to run:\n"
"python -c \"print('ok')\"\n"
"Step 3) After you receive the terminal tool response, answer with just: ok\n"
"\n"
"Tool call format requirements:\n"
"- Every tool call MUST be a complete XML block with a closing tag.\n"
"- Do NOT emit a second <tool_call> in the same assistant message.\n"
"\n"
"Example:\n"
"<tool_call>{\"name\": \"skills_list\", \"arguments\": {}}</tool_call>\n"
"Do not include anything else in your final answer."
)
}
def build_task(self, item: Item) -> str:
return str(item.get("prompt") or "")
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
return AgentConfig(
max_steps=min(10, int(self.config.agent_max_steps)),
temperature=0.2,
max_tokens=None,
)
async def score_trajectory(self, item: Item, final_response: str) -> float:
_ = (item, final_response)
return 0.0
async def verify_and_score_trajectory(
self,
item: Item,
final_response: str,
*,
trajectory_id: str, # noqa: ARG002
exec_tool, # noqa: ARG002
agent_result: AgentResult | None = None,
workspace_meta: Dict[str, Any] | None = None, # noqa: ARG002
) -> tuple[float, Dict[str, Any]]:
if agent_result is None:
return 0.0, {"error": "Missing agent_result"}
called = {c.name for s in agent_result.steps for c in s.tool_calls}
need = {"skills_list", "terminal"}
if not need.issubset(called):
return 0.0, {"error": f"Missing tool calls: {sorted(need - called)}", "called": sorted(called)}
terminal_ok = False
for step in agent_result.steps:
for call, res in zip(step.tool_calls, step.tool_results):
if call.name != "terminal":
continue
if res.success and (res.output or "").strip().splitlines()[-1].strip() == "ok":
terminal_ok = True
score = 1.0 if terminal_ok and (final_response or "").strip() == "ok" else 0.0
return score, {"called": sorted(called), "final": (final_response or "").strip()}
if __name__ == "__main__":
ToolServerSmokeEnv.cli()
-11
View File
@@ -1,11 +0,0 @@
"""
Nomad integration for atropos-agent.
Provides:
- NomadClient: Client for Nomad HTTP API
- Job templates for sandbox containers
"""
from .client import NomadClient
__all__ = ["NomadClient"]
-500
View File
@@ -1,500 +0,0 @@
"""
Nomad API Client for atropos-agent.
Provides a simple async client for interacting with the Nomad HTTP API:
- Submit/stop jobs
- Query allocations
- Get allocation addresses
- Scale jobs up/down
"""
import asyncio
import json
import os
from dataclasses import dataclass, field
from enum import Enum
from pathlib import Path
from typing import Any, Dict, List, Optional
import aiohttp
class AllocationStatus(Enum):
"""Nomad allocation status."""
PENDING = "pending"
RUNNING = "running"
COMPLETE = "complete"
FAILED = "failed"
LOST = "lost"
@dataclass
class Allocation:
"""Information about a Nomad allocation."""
id: str
job_id: str
task_group: str
node_id: str
status: AllocationStatus
# Network info for reaching the allocation
address: Optional[str] = None
port: Optional[int] = None
@property
def http_address(self) -> Optional[str]:
"""Get full HTTP address for the allocation."""
if self.address and self.port:
return f"http://{self.address}:{self.port}"
return None
@dataclass
class JobStatus:
"""Status of a Nomad job."""
id: str
name: str
status: str
allocations: List[Allocation] = field(default_factory=list)
count: int = 0 # Number of task groups
class NomadClient:
"""
Async client for Nomad HTTP API.
Usage:
client = NomadClient(address="http://localhost:4646")
# Submit a job
await client.submit_job(job_spec)
# Get allocations
allocs = await client.get_job_allocations("sandbox-python")
# Scale job
await client.scale_job("sandbox-python", count=5)
"""
def __init__(
self,
address: str = "http://localhost:4646",
token: Optional[str] = None,
timeout: float = 30.0,
):
self.address = address.rstrip("/")
self.token = token or os.environ.get("NOMAD_TOKEN")
self.timeout = aiohttp.ClientTimeout(total=timeout)
self._session: Optional[aiohttp.ClientSession] = None
async def _get_session(self) -> aiohttp.ClientSession:
"""Get or create HTTP session."""
if self._session is None or self._session.closed:
headers = {}
if self.token:
headers["X-Nomad-Token"] = self.token
self._session = aiohttp.ClientSession(
timeout=self.timeout,
headers=headers,
)
return self._session
async def close(self):
"""Close the HTTP session."""
if self._session and not self._session.closed:
await self._session.close()
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
await self.close()
async def _request(
self,
method: str,
path: str,
data: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
"""Make an HTTP request to Nomad API."""
session = await self._get_session()
url = f"{self.address}{path}"
try:
async with session.request(method, url, json=data) as response:
if response.status == 404:
return {"error": "not_found", "status": 404}
text = await response.text()
if not text:
return {"status": response.status}
try:
result = json.loads(text)
except json.JSONDecodeError:
return {"text": text, "status": response.status}
if response.status >= 400:
return {"error": result, "status": response.status}
return result if isinstance(result, dict) else {"data": result, "status": response.status}
except aiohttp.ClientError as e:
return {"error": str(e), "status": 0}
# Job Operations
async def submit_job(self, job_spec: Dict[str, Any]) -> Dict[str, Any]:
"""
Submit a job to Nomad.
Args:
job_spec: Job specification dict (HCL converted to JSON)
Returns:
Response with EvalID if successful
"""
return await self._request("POST", "/v1/jobs", {"Job": job_spec})
async def stop_job(self, job_id: str, purge: bool = False) -> Dict[str, Any]:
"""
Stop (and optionally purge) a job.
Args:
job_id: Job identifier
purge: If True, completely remove the job
"""
path = f"/v1/job/{job_id}"
if purge:
path += "?purge=true"
return await self._request("DELETE", path)
async def get_job(self, job_id: str) -> Optional[Dict[str, Any]]:
"""Get job details."""
result = await self._request("GET", f"/v1/job/{job_id}")
if "error" in result and result.get("status") == 404:
return None
return result
async def get_job_status(self, job_id: str) -> Optional[JobStatus]:
"""Get job status with allocations."""
job = await self.get_job(job_id)
if not job:
return None
allocs = await self.get_job_allocations(job_id)
# Get count from task groups
count = 0
task_groups = job.get("TaskGroups", [])
for tg in task_groups:
count += tg.get("Count", 1)
return JobStatus(
id=job_id,
name=job.get("Name", job_id),
status=job.get("Status", "unknown"),
allocations=allocs,
count=count,
)
# Allocation Operations
async def get_job_allocations(self, job_id: str) -> List[Allocation]:
"""Get all allocations for a job."""
result = await self._request("GET", f"/v1/job/{job_id}/allocations")
if "error" in result:
return []
allocs_data = result.get("data", result) if isinstance(result, dict) else result
if not isinstance(allocs_data, list):
return []
allocations = []
for alloc_data in allocs_data:
# Parse allocation info
alloc_id = alloc_data.get("ID", "")
status_str = alloc_data.get("ClientStatus", "unknown")
try:
status = AllocationStatus(status_str)
except ValueError:
status = AllocationStatus.PENDING
# Get network info - need to fetch detailed allocation for this
address = None
port = None
# First try the summary data
resources = alloc_data.get("AllocatedResources") or {}
shared = resources.get("Shared") or {}
networks = shared.get("Networks") or []
# If no networks in summary, fetch detailed allocation
if not networks and alloc_id:
detailed = await self.get_allocation(alloc_id)
if detailed:
resources = detailed.get("AllocatedResources") or {}
shared = resources.get("Shared") or {}
networks = shared.get("Networks") or []
if networks:
network = networks[0]
address = network.get("IP")
# Look for dynamic ports OR reserved ports (Singularity/raw_exec uses reserved)
dyn_ports = network.get("DynamicPorts") or []
reserved_ports = network.get("ReservedPorts") or []
for dp in dyn_ports + reserved_ports:
if dp.get("Label") == "http":
port = dp.get("Value")
break
allocations.append(Allocation(
id=alloc_id,
job_id=job_id,
task_group=alloc_data.get("TaskGroup", ""),
node_id=alloc_data.get("NodeID", ""),
status=status,
address=address,
port=port,
))
return allocations
async def get_allocation(self, alloc_id: str) -> Optional[Dict[str, Any]]:
"""Get detailed allocation info."""
result = await self._request("GET", f"/v1/allocation/{alloc_id}")
if "error" in result and result.get("status") == 404:
return None
return result
# Scaling Operations
async def scale_job(self, job_id: str, count: int, task_group: str = "sandbox") -> Dict[str, Any]:
"""
Scale a job's task group to specified count.
Args:
job_id: Job identifier
count: Desired number of allocations
task_group: Name of task group to scale
"""
payload = {
"Count": count,
"Target": {
"Group": task_group,
},
}
return await self._request("POST", f"/v1/job/{job_id}/scale", payload)
async def get_job_scale_status(self, job_id: str) -> Dict[str, int]:
"""
Get current scale status for a job.
Returns:
Dict mapping task group name to count
"""
result = await self._request("GET", f"/v1/job/{job_id}/scale")
if "error" in result:
return {}
task_groups = result.get("TaskGroups", {})
return {
name: info.get("Running", 0)
for name, info in task_groups.items()
}
# Health Check
async def is_healthy(self) -> bool:
"""Check if Nomad is reachable and healthy."""
try:
result = await self._request("GET", "/v1/status/leader")
return "error" not in result
except Exception:
return False
async def get_leader(self) -> Optional[str]:
"""Get current Nomad leader address."""
result = await self._request("GET", "/v1/status/leader")
if isinstance(result, dict) and "data" in result:
return result["data"]
return None
def load_job_template(
template_name: str = "sandbox",
**kwargs,
) -> Dict[str, Any]:
"""
Load and configure a job template.
Args:
template_name: Name of template (e.g., "sandbox")
**kwargs: Template variables to substitute
Returns:
Job specification dict ready for Nomad API
"""
# Default job template for sandbox container
if template_name == "sandbox":
return create_sandbox_job(**kwargs)
else:
raise ValueError(f"Unknown template: {template_name}")
def create_sandbox_job(
job_id: str = "atropos-sandbox",
image: str = "atropos-sandbox:local", # Use :local tag to avoid registry pull
count: int = 1,
slots_per_container: int = 10,
privileged: bool = False,
cpu: int = 500,
memory: int = 512,
port: int = 8080,
datacenter: str = "dc1",
driver: str = "docker", # "docker" or "singularity"
singularity_image: str = None, # Path to .sif file for singularity driver
) -> Dict[str, Any]:
"""
Create a sandbox job specification.
This job runs the sandbox_server.py inside a container,
with the specified number of slots for agent workspaces.
Args:
job_id: Unique job identifier
image: Docker image to use (for docker driver)
count: Number of container instances
slots_per_container: Number of slots per container
privileged: Run container in privileged mode (recommended for bubblewrap)
cpu: CPU allocation in MHz
memory: Memory allocation in MB
port: HTTP port for sandbox server
datacenter: Nomad datacenter
driver: Container driver - "docker" or "singularity"
singularity_image: Path to .sif file (required if driver="singularity")
Returns:
Job specification dict
"""
# Build task config based on driver
if driver == "singularity":
if not singularity_image:
raise ValueError("singularity_image path required when driver='singularity'")
# Use raw_exec driver to run apptainer via shell for variable expansion
# The container binds the allocation directory for workspace persistence
# For raw_exec, we use static port since Nomad's dynamic port mapping doesn't
# work the same as Docker - the process runs directly on the host.
shell_cmd = (
f'apptainer run '
f'--bind "$NOMAD_ALLOC_DIR/data:/data" '
f'--pwd /app '
f'--env PYTHONUNBUFFERED=1 '
f'{singularity_image} '
f'python sandbox_server.py '
f'--port {port} '
f'--slots {slots_per_container} '
f'--data-dir /data'
)
task_config = {
"command": "/bin/sh",
"args": ["-c", shell_cmd],
}
task_driver = "raw_exec"
else:
# Docker driver (default)
task_config = {
"image": image,
"force_pull": False, # Use local image, don't try to pull
"ports": ["http"],
"privileged": privileged,
"command": "python",
"args": [
"sandbox_server.py",
"--port", str(port),
"--slots", str(slots_per_container),
"--data-dir", "/data",
],
# Note: On Linux, you can mount persistent storage:
# "volumes": ["${NOMAD_ALLOC_DIR}/data:/data"],
# On macOS/Docker Desktop, skip volumes for PoC
# (container /data is ephemeral but works for testing)
}
task_driver = "docker"
# For Singularity/raw_exec, use static ports since the process runs directly on host.
# For Docker, use dynamic ports with port mapping.
if driver == "singularity":
network_config = {
"Mode": "host",
"ReservedPorts": [
{
"Label": "http",
"Value": port,
}
],
}
else:
network_config = {
"Mode": "host",
"DynamicPorts": [
{
"Label": "http",
"To": port,
}
],
}
return {
"ID": job_id,
"Name": job_id,
"Type": "service",
"Datacenters": [datacenter],
"TaskGroups": [
{
"Name": "sandbox",
"Count": count,
# Speed up deployments and avoid Consul checks. Without this, Nomad may
# keep an "active deployment" around for the default MinHealthyTime,
# which blocks immediate scaling under load.
"Update": {
"HealthCheck": "task_states",
"MinHealthyTime": 0,
},
"Networks": [network_config],
"Tasks": [
{
"Name": "sandbox-server",
"Driver": task_driver,
"Config": task_config,
"Env": {
"PYTHONUNBUFFERED": "1",
"NOMAD_ALLOC_DIR": "${NOMAD_ALLOC_DIR}",
},
"Resources": {
"CPU": cpu,
"MemoryMB": memory,
},
# Note: Services with Checks require Consul, which we skip for the PoC
}
],
"RestartPolicy": {
"Attempts": 3,
"Interval": 300_000_000_000, # 5 minutes
"Delay": 10_000_000_000, # 10 seconds
"Mode": "delay",
},
"ReschedulePolicy": {
"Attempts": 5,
"Interval": 3600_000_000_000, # 1 hour
"Delay": 30_000_000_000, # 30 seconds
"DelayFunction": "exponential",
"MaxDelay": 300_000_000_000, # 5 minutes
"Unlimited": False,
},
}
],
}
File diff suppressed because it is too large Load Diff
-20
View File
@@ -1,20 +0,0 @@
"""
Slot-based multiplexing for atropos-agent.
Provides:
- Slot: Isolated workspace for a single trajectory
- SlotPool: Manages slots across Nomad allocations
- SandboxExecutor: Executes tools in sandbox containers
"""
from .executor import SandboxExecutor
from .pool import SlotPool, SlotPoolConfig
from .slot import Slot, SlotState
__all__ = [
"Slot",
"SlotState",
"SlotPool",
"SlotPoolConfig",
"SandboxExecutor",
]
-457
View File
@@ -1,457 +0,0 @@
"""
SandboxExecutor - HTTP client for sandbox container communication.
Sends tool execution requests to sandbox_server.py running inside Nomad containers.
Supports single and batch execution for efficiency.
"""
import asyncio
import uuid
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Tuple
import aiohttp
from .slot import Slot, SlotState
from ..tools.base import ToolCall, ToolResult
@dataclass
class ExecutionRequest:
"""Request to execute a tool in a slot."""
slot: Slot
tool_name: str
args: Dict[str, Any]
execution_id: str = field(default_factory=lambda: str(uuid.uuid4()))
timeout: float = 30.0
@dataclass
class ExecutionResult:
"""Result from sandbox execution."""
success: bool
output: str = ""
error: str = ""
execution_id: str = ""
slot_id: str = ""
metadata: Dict[str, Any] = field(default_factory=dict)
def to_tool_result(self) -> ToolResult:
"""Convert to ToolResult for agent consumption."""
return ToolResult(
success=self.success,
output=self.output,
error=self.error,
metadata=self.metadata,
uniq_id=self.execution_id,
)
class SandboxExecutor:
"""
HTTP client for executing tools in sandbox containers.
Communicates with sandbox_server.py running inside Nomad allocations.
Supports both single execution and batched parallel execution.
Usage:
executor = SandboxExecutor()
# Single execution
result = await executor.execute(slot, "bash", {"command": "ls"})
# Batch execution
results = await executor.execute_batch([
(slot1, "bash", {"command": "ls"}),
(slot2, "write_file", {"path": "test.txt", "content": "hello"}),
])
"""
def __init__(
self,
timeout: float = 30.0,
max_retries: int = 3,
retry_delay: float = 1.0,
):
self.timeout = aiohttp.ClientTimeout(total=timeout)
self.max_retries = max_retries
self.retry_delay = retry_delay
self._session: Optional[aiohttp.ClientSession] = None
async def _get_session(self) -> aiohttp.ClientSession:
"""Get or create HTTP session."""
if self._session is None or self._session.closed:
self._session = aiohttp.ClientSession(timeout=self.timeout)
return self._session
async def close(self):
"""Close HTTP session."""
if self._session and not self._session.closed:
await self._session.close()
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
await self.close()
async def execute(
self,
slot: Slot,
tool_name: str,
args: Dict[str, Any],
timeout: Optional[float] = None,
) -> ExecutionResult:
"""
Execute a tool in a slot's workspace.
Args:
slot: Slot to execute in
tool_name: Name of tool (bash, read_file, write_file)
args: Tool arguments
timeout: Optional timeout override
Returns:
ExecutionResult with output or error
"""
execution_id = str(uuid.uuid4())
exec_timeout = timeout or self.timeout.total or 30.0
# Mark slot as executing
original_state = slot.state
try:
if slot.state == SlotState.ACQUIRED:
slot.start_execution(execution_id)
result = await self._send_execute_request(
container_addr=slot.container_addr,
slot_id=slot.slot_id,
tool_name=tool_name,
args=args,
execution_id=execution_id,
timeout=exec_timeout,
)
result.slot_id = slot.slot_id
return result
finally:
# Restore slot state
if slot.state == SlotState.EXECUTING:
slot.end_execution()
async def _send_execute_request(
self,
container_addr: str,
slot_id: str,
tool_name: str,
args: Dict[str, Any],
execution_id: str,
timeout: float,
) -> ExecutionResult:
"""Send execution request to sandbox server with retry logic."""
session = await self._get_session()
url = f"{container_addr}/execute"
payload = {
"slot_id": slot_id,
"tool": tool_name,
"args": args,
"execution_id": execution_id,
"timeout": timeout,
}
last_error = None
for attempt in range(self.max_retries):
try:
async with session.post(url, json=payload) as response:
data = await response.json()
return ExecutionResult(
success=data.get("success", False),
output=data.get("output", ""),
error=data.get("error", ""),
execution_id=data.get("execution_id", execution_id),
metadata=data.get("metadata", {}),
)
except aiohttp.ClientError as e:
last_error = str(e)
if attempt < self.max_retries - 1:
await asyncio.sleep(self.retry_delay * (attempt + 1))
continue
except asyncio.TimeoutError:
last_error = f"Request timed out after {timeout}s"
break
except Exception as e:
last_error = str(e)
break
return ExecutionResult(
success=False,
error=f"Failed after {self.max_retries} attempts: {last_error}",
execution_id=execution_id,
)
async def execute_batch(
self,
requests: List[Tuple[Slot, str, Dict[str, Any]]],
timeout: Optional[float] = None,
) -> List[ExecutionResult]:
"""
Execute multiple tools in parallel across slots.
This is the key optimization - we batch tool calls to maximize
container utilization while agents are waiting for LLM responses.
Args:
requests: List of (slot, tool_name, args) tuples
timeout: Optional timeout override
Returns:
List of ExecutionResults in same order as requests
"""
if not requests:
return []
# Group requests by container address for batch API
by_container: Dict[str, List[Tuple[int, Slot, str, Dict[str, Any], str]]] = {}
for idx, (slot, tool_name, args) in enumerate(requests):
execution_id = str(uuid.uuid4())
container = slot.container_addr
if container not in by_container:
by_container[container] = []
by_container[container].append((idx, slot, tool_name, args, execution_id))
# Mark slots as executing
if slot.state == SlotState.ACQUIRED:
slot.start_execution(execution_id)
# Execute batches in parallel
exec_timeout = timeout or self.timeout.total or 30.0
batch_tasks = []
for container_addr, batch_requests in by_container.items():
task = self._send_batch_request(
container_addr=container_addr,
batch_requests=batch_requests,
timeout=exec_timeout,
)
batch_tasks.append(task)
# Gather all batch results
batch_results = await asyncio.gather(*batch_tasks, return_exceptions=True)
# Collect results in original order
results: List[Optional[ExecutionResult]] = [None] * len(requests)
for batch_result in batch_results:
if isinstance(batch_result, Exception):
# Mark all in this batch as failed
continue
for idx, result in batch_result:
results[idx] = result
# Fill in any missing results
for idx, result in enumerate(results):
if result is None:
slot, tool_name, args = requests[idx]
results[idx] = ExecutionResult(
success=False,
error="Batch execution failed",
slot_id=slot.slot_id,
)
# End execution on all slots
for slot, _, _ in requests:
if slot.state == SlotState.EXECUTING:
slot.end_execution()
return results # type: ignore
async def _send_batch_request(
self,
container_addr: str,
batch_requests: List[Tuple[int, Slot, str, Dict[str, Any], str]],
timeout: float,
) -> List[Tuple[int, ExecutionResult]]:
"""Send batch execution request to a single container."""
session = await self._get_session()
url = f"{container_addr}/batch"
# Build batch payload
payload = [
{
"slot_id": slot.slot_id,
"tool": tool_name,
"args": args,
"execution_id": execution_id,
"timeout": timeout,
}
for _, slot, tool_name, args, execution_id in batch_requests
]
try:
async with session.post(url, json=payload) as response:
data = await response.json()
if not isinstance(data, list):
raise ValueError(f"Expected list response, got {type(data)}")
results = []
for i, (idx, slot, _, _, execution_id) in enumerate(batch_requests):
if i < len(data):
item = data[i]
result = ExecutionResult(
success=item.get("success", False),
output=item.get("output", ""),
error=item.get("error", ""),
execution_id=item.get("execution_id", execution_id),
slot_id=slot.slot_id,
metadata=item.get("metadata", {}),
)
else:
result = ExecutionResult(
success=False,
error="Missing result in batch response",
execution_id=execution_id,
slot_id=slot.slot_id,
)
results.append((idx, result))
return results
except Exception as e:
# Return error for all requests in batch
return [
(idx, ExecutionResult(
success=False,
error=str(e),
execution_id=execution_id,
slot_id=slot.slot_id,
))
for idx, slot, _, _, execution_id in batch_requests
]
async def reset_slot(self, slot: Slot) -> ExecutionResult:
"""
Reset a slot's workspace (delete all files).
Useful when reusing a slot for a new trajectory.
"""
session = await self._get_session()
url = f"{slot.container_addr}/reset"
try:
async with session.post(url, json={"slot_id": slot.slot_id}) as response:
data = await response.json()
return ExecutionResult(
success=data.get("success", False),
output=data.get("output", ""),
error=data.get("error", ""),
slot_id=slot.slot_id,
)
except Exception as e:
return ExecutionResult(
success=False,
error=str(e),
slot_id=slot.slot_id,
)
async def health_check(self, container_addr: str) -> bool:
"""Check if a sandbox container is healthy."""
session = await self._get_session()
url = f"{container_addr}/health"
try:
async with session.get(url) as response:
data = await response.json()
return data.get("status") == "ok"
except Exception:
return False
async def get_container_status(
self,
container_addr: str
) -> Optional[Dict[str, Any]]:
"""Get status info from a sandbox container."""
session = await self._get_session()
url = f"{container_addr}/health"
try:
async with session.get(url) as response:
return await response.json()
except Exception:
return None
# -------------------------------------------------------------------------
# Artifact helpers (optional)
# -------------------------------------------------------------------------
async def _post_json(
self,
url: str,
payload: Dict[str, Any],
timeout: Optional[float] = None,
) -> Dict[str, Any]:
session = await self._get_session()
try:
async with session.post(url, json=payload, timeout=timeout) as response:
data = await response.json()
if isinstance(data, dict):
data.setdefault("http_status", response.status)
return data
return {"success": False, "error": f"Unexpected response type: {type(data)}", "http_status": response.status}
except Exception as e:
return {"success": False, "error": str(e)}
async def read_artifact(
self,
slot: Slot,
path: str,
*,
encoding: str = "text",
max_bytes: Optional[int] = None,
include_sha256: bool = False,
timeout: Optional[float] = None,
) -> Dict[str, Any]:
url = f"{slot.container_addr}/artifacts/read"
payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "encoding": encoding, "include_sha256": include_sha256}
if max_bytes is not None:
payload["max_bytes"] = max_bytes
return await self._post_json(url, payload, timeout=timeout)
async def list_artifacts(
self,
slot: Slot,
path: str = ".",
*,
recursive: bool = False,
max_entries: Optional[int] = None,
timeout: Optional[float] = None,
) -> Dict[str, Any]:
url = f"{slot.container_addr}/artifacts/list"
payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "recursive": recursive}
if max_entries is not None:
payload["max_entries"] = max_entries
return await self._post_json(url, payload, timeout=timeout)
async def archive_artifacts(
self,
slot: Slot,
path: str = ".",
*,
archive_format: str = "tar.gz",
max_bytes: Optional[int] = None,
max_entries: Optional[int] = None,
timeout: Optional[float] = None,
) -> Dict[str, Any]:
url = f"{slot.container_addr}/artifacts/archive"
payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "format": archive_format}
if max_bytes is not None:
payload["max_bytes"] = max_bytes
if max_entries is not None:
payload["max_entries"] = max_entries
return await self._post_json(url, payload, timeout=timeout)
-659
View File
@@ -1,659 +0,0 @@
"""
SlotPool - Manages slots across Nomad allocations.
The SlotPool is the core abstraction for slot-based multiplexing:
- Tracks available/acquired slots across containers
- Handles slot acquisition and release
- Auto-scales Nomad job count based on demand
- Provides batched tool execution
"""
import asyncio
import logging
import os
import subprocess
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
from ..nomad.client import (
Allocation,
AllocationStatus,
NomadClient,
create_sandbox_job,
)
from .executor import ExecutionResult, SandboxExecutor
from .slot import Slot, SlotState, create_slots_for_allocation
logger = logging.getLogger(__name__)
@dataclass
class SlotPoolConfig:
"""Configuration for SlotPool."""
# Nomad settings
nomad_address: str = "http://localhost:4646"
job_id: str = "atropos-sandbox"
datacenter: str = "dc1"
# Container settings
image: str = "atropos-sandbox:local" # Use :local tag to avoid registry pull
slots_per_container: int = 10
privileged: bool = False
cpu: int = 500 # MHz
memory: int = 512 # MB
# Driver selection: "docker" or "singularity"
driver: str = "docker"
# Path to .sif file for singularity driver (required if driver="singularity")
singularity_image: Optional[str] = None
# Scaling settings
min_containers: int = 1
max_containers: int = 10
# Timeouts
acquire_timeout: float = 30.0 # Seconds between acquire polls (also triggers scale-up attempts)
health_check_interval: float = 30.0 # Seconds between health checks
scale_cooldown: float = 60.0 # Seconds between scale operations
# Job lifecycle
purge_job_on_start: bool = False # Purge any pre-existing job before starting (local dev/training friendly)
# Local Docker image convenience (macOS/Nomad dev mode)
auto_build_local_image: bool = True # If image endswith :local and is missing, build it from the bundled Dockerfile.
dockerfile_path: Optional[str] = None # Override Dockerfile path (default: Hermes-Agent/atropos/Dockerfile).
docker_build_context: Optional[str] = None # Override build context (default: Hermes-Agent/atropos).
class SlotPool:
"""
Manages a pool of slots across Nomad allocations.
The SlotPool:
- Deploys sandbox containers to Nomad
- Tracks slots across all running containers
- Handles slot acquisition/release
- Auto-scales based on demand
- Provides batched execution via SandboxExecutor
Usage:
config = SlotPoolConfig(
nomad_address="http://localhost:4646",
job_id="my-sandbox",
slots_per_container=10,
)
pool = SlotPool(config)
await pool.start()
# Acquire a slot
slot = await pool.acquire()
# Execute tool
result = await pool.execute(slot, "bash", {"command": "ls"})
# Release slot
await pool.release(slot)
# Shutdown
await pool.stop()
"""
def __init__(self, config: Optional[SlotPoolConfig] = None):
self.config = config or SlotPoolConfig()
# Nomad client
self.nomad = NomadClient(address=self.config.nomad_address)
# Sandbox executor for tool execution
self.executor = SandboxExecutor()
# Slot tracking
self._slots: Dict[str, Slot] = {} # slot_key -> Slot
self._available_queue: asyncio.Queue[str] = asyncio.Queue()
self._lock = asyncio.Lock()
self._scale_lock = asyncio.Lock()
# State
self._started = False
self._health_task: Optional[asyncio.Task] = None
self._scale_task: Optional[asyncio.Task] = None
self._last_scale_time = 0.0
def _default_dockerfile_path(self) -> Path:
# Hermes-Agent/atropos/Dockerfile lives next to this module in source checkouts.
return Path(__file__).resolve().parents[1] / "Dockerfile"
def _default_build_context(self) -> Path:
return Path(__file__).resolve().parents[1]
def _docker_image_exists(self, image: str) -> bool:
try:
proc = subprocess.run(
["docker", "image", "inspect", image],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
check=False,
env={**os.environ, "DOCKER_CLI_HINTS": "false"},
)
return proc.returncode == 0
except FileNotFoundError:
return False
def _try_build_local_image(self, image: str) -> None:
dockerfile = Path(self.config.dockerfile_path) if self.config.dockerfile_path else self._default_dockerfile_path()
context = Path(self.config.docker_build_context) if self.config.docker_build_context else self._default_build_context()
if not dockerfile.exists():
raise RuntimeError(
f"Sandbox Dockerfile not found at {dockerfile}. "
"Build the sandbox image manually or set --env.purge_job_on_start false and provide a non-local image."
)
if not context.exists():
raise RuntimeError(f"Docker build context not found at {context}")
# Prefer buildx+--load to ensure the image ends up in the local daemon (required by Nomad's docker driver).
buildx_cmd = [
"docker",
"buildx",
"build",
"--load",
"-t",
image,
"-f",
str(dockerfile),
str(context),
]
proc = subprocess.run(buildx_cmd, check=False, env={**os.environ, "DOCKER_CLI_HINTS": "false"})
if proc.returncode == 0:
return
# Fallback to classic docker build if buildx isn't available.
build_cmd = ["docker", "build", "-t", image, "-f", str(dockerfile), str(context)]
proc2 = subprocess.run(build_cmd, check=False, env={**os.environ, "DOCKER_CLI_HINTS": "false"})
if proc2.returncode != 0:
raise RuntimeError(
f"Failed to build local sandbox image {image}. "
f"Tried: {' '.join(buildx_cmd)} and {' '.join(build_cmd)}"
)
def _ensure_local_image(self) -> None:
image = (self.config.image or "").strip()
if not image.endswith(":local"):
return
if not self.config.auto_build_local_image:
return
if self._docker_image_exists(image):
return
logger.info(f"Local sandbox image {image} not found; building it now...")
self._try_build_local_image(image)
def _slot_key(self, alloc_id: str, slot_id: str) -> str:
"""Generate unique key for a slot."""
return f"{alloc_id}:{slot_id}"
@property
def total_slots(self) -> int:
"""Total number of slots in pool."""
return len(self._slots)
@property
def available_slots(self) -> int:
"""Number of available slots."""
return sum(1 for s in self._slots.values() if s.is_available)
@property
def acquired_slots(self) -> int:
"""Number of acquired slots."""
return sum(1 for s in self._slots.values() if s.is_acquired)
async def start(self) -> None:
"""
Start the slot pool.
- Checks if Nomad is healthy
- Deploys sandbox job if not running
- Discovers existing allocations
- Starts health check background task
"""
if self._started:
return
logger.info(f"Starting SlotPool (job_id={self.config.job_id})")
try:
# Make sure local sandbox images exist before Nomad tries to pull them.
# This is a common footgun in macOS dev mode with :local tags.
self._ensure_local_image()
# Check Nomad health
if not await self.nomad.is_healthy():
raise RuntimeError(f"Nomad is not reachable at {self.config.nomad_address}")
if self.config.purge_job_on_start:
logger.info(f"Purging any existing Nomad job: {self.config.job_id}")
await self.nomad.stop_job(self.config.job_id, purge=True)
# Check if job exists (after optional purge)
job = await self.nomad.get_job(self.config.job_id)
if job is None:
# Deploy new job
logger.info(f"Deploying sandbox job: {self.config.job_id} (driver={self.config.driver})")
job_spec = create_sandbox_job(
job_id=self.config.job_id,
image=self.config.image,
count=self.config.min_containers,
slots_per_container=self.config.slots_per_container,
privileged=self.config.privileged,
cpu=self.config.cpu,
memory=self.config.memory,
datacenter=self.config.datacenter,
driver=self.config.driver,
singularity_image=self.config.singularity_image,
)
result = await self.nomad.submit_job(job_spec)
if "error" in result:
raise RuntimeError(f"Failed to submit job: {result}")
# Wait for allocations to be running (even if the job already existed).
await self._wait_for_healthy_allocations(self.config.min_containers)
# Discover existing allocations and slots
await self._refresh_slots()
# Start health check task
self._health_task = asyncio.create_task(self._health_check_loop())
self._started = True
logger.info(f"SlotPool started: {self.total_slots} slots available")
except Exception:
# Ensure aiohttp sessions are not leaked if we fail to start.
await self.stop(purge_job=False)
raise
async def stop(self, purge_job: bool = False) -> None:
"""
Stop the slot pool.
Args:
purge_job: If True, also stop the Nomad job
"""
logger.info("Stopping SlotPool")
# Cancel health check task
if self._health_task:
self._health_task.cancel()
try:
await self._health_task
except asyncio.CancelledError:
pass
finally:
self._health_task = None
if self._scale_task:
self._scale_task.cancel()
try:
await self._scale_task
except asyncio.CancelledError:
pass
finally:
self._scale_task = None
# Optionally stop the job (do this even if start() never completed).
if purge_job:
logger.info(f"Stopping Nomad job: {self.config.job_id}")
await self.nomad.stop_job(self.config.job_id, purge=True)
# Close connections
await self.executor.close()
await self.nomad.close()
self._started = False
self._slots.clear()
# Clear the queue
while not self._available_queue.empty():
try:
self._available_queue.get_nowait()
except asyncio.QueueEmpty:
break
async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
"""
Acquire an available slot.
If no slots are available, waits up to acquire_timeout seconds.
If still no slots, attempts to scale up.
Args:
trajectory_id: Optional ID of trajectory acquiring the slot
Returns:
Acquired Slot
Raises:
asyncio.TimeoutError: If no slot becomes available
"""
if not self._started:
raise RuntimeError("SlotPool not started")
while True:
try:
# Try to get an available slot
slot_key = await asyncio.wait_for(
self._available_queue.get(),
timeout=self.config.acquire_timeout,
)
except asyncio.TimeoutError:
# Try to scale up, but keep waiting even if scaling isn't possible.
# In practice, slots may become available shortly (e.g. contention),
# and scaling may be temporarily blocked by Nomad deployments.
await self._try_scale_up()
continue
slot = self._slots.get(slot_key)
if slot is None:
# Slot was removed; discard stale queue entry and retry.
continue
try:
slot.acquire(trajectory_id)
except RuntimeError:
# Slot isn't actually available (e.g. duplicate queue entry); retry.
continue
logger.debug(f"Acquired slot {slot.slot_id} (alloc={slot.alloc_id[:8]})")
return slot
async def release(self, slot: Slot, reset_workspace: bool = False) -> None:
"""
Release a slot back to the pool.
Args:
slot: Slot to release
reset_workspace: If True, clear the workspace files
"""
slot_key = self._slot_key(slot.alloc_id, slot.slot_id)
if slot_key not in self._slots:
logger.warning(f"Releasing unknown slot: {slot_key}")
return
# Optionally reset workspace
if reset_workspace:
await self.executor.reset_slot(slot)
slot.release()
await self._available_queue.put(slot_key)
logger.debug(f"Released slot {slot.slot_id}")
async def execute(
self,
slot: Slot,
tool_name: str,
args: Dict[str, Any],
timeout: Optional[float] = None,
) -> ExecutionResult:
"""
Execute a tool in a slot's workspace.
Args:
slot: Slot to execute in
tool_name: Name of tool (bash, read_file, write_file)
args: Tool arguments
timeout: Optional timeout override
Returns:
ExecutionResult
"""
return await self.executor.execute(slot, tool_name, args, timeout)
async def execute_batch(
self,
requests: List[Tuple[Slot, str, Dict[str, Any]]],
timeout: Optional[float] = None,
) -> List[ExecutionResult]:
"""
Execute multiple tools in parallel.
This is the key optimization - batch execution across multiple slots
maximizes container utilization.
Args:
requests: List of (slot, tool_name, args) tuples
timeout: Optional timeout override
Returns:
List of ExecutionResults in same order
"""
return await self.executor.execute_batch(requests, timeout)
async def _refresh_slots(self) -> None:
"""Refresh slot inventory from Nomad allocations."""
async with self._lock:
allocs = await self.nomad.get_job_allocations(self.config.job_id)
# Track which slots we've seen
seen_keys = set()
for alloc in allocs:
if alloc.status != AllocationStatus.RUNNING:
continue
if not alloc.http_address:
continue
# Check container health
healthy = await self.executor.health_check(alloc.http_address)
if not healthy:
continue
# Create slots for this allocation
for i in range(self.config.slots_per_container):
slot_id = f"slot_{i}"
slot_key = self._slot_key(alloc.id, slot_id)
seen_keys.add(slot_key)
if slot_key not in self._slots:
# New slot
slot = Slot(
slot_id=slot_id,
alloc_id=alloc.id,
container_addr=alloc.http_address,
)
self._slots[slot_key] = slot
await self._available_queue.put(slot_key)
logger.debug(f"Added slot: {slot_key}")
# Remove slots from dead allocations
for slot_key in list(self._slots.keys()):
if slot_key not in seen_keys:
slot = self._slots.pop(slot_key)
logger.debug(f"Removed slot: {slot_key}")
async def _wait_for_healthy_allocations(
self,
min_count: int,
timeout: float = 120.0
) -> None:
"""Wait for allocations to become healthy."""
import time
start = time.time()
def _summarize_alloc_detail(detail: Dict[str, Any]) -> str:
task_states = detail.get("TaskStates") or {}
parts: List[str] = []
if isinstance(task_states, dict):
for task_name, st in task_states.items():
events = (st or {}).get("Events") or []
if isinstance(events, list) and events:
# Include a few recent events; the latest can be a generic restart message
# while the true root cause is slightly earlier (e.g. image pull failure).
recent = events[-3:]
msgs: List[str] = []
for ev in recent:
desc = ev.get("DisplayMessage") or ev.get("Message") or ev.get("Type") or ""
if desc:
msgs.append(desc)
if msgs:
parts.append(f"{task_name}: " + " | ".join(msgs))
return "; ".join(parts)
def _alloc_events_lower(detail: Dict[str, Any]) -> str:
task_states = detail.get("TaskStates") or {}
texts: List[str] = []
if isinstance(task_states, dict):
for _task_name, st in task_states.items():
events = (st or {}).get("Events") or []
if isinstance(events, list):
for ev in events[-10:]:
desc = ev.get("DisplayMessage") or ev.get("Message") or ev.get("Type") or ""
if desc:
texts.append(desc)
return " ".join(texts).lower()
while time.time() - start < timeout:
allocs = await self.nomad.get_job_allocations(self.config.job_id)
healthy_count = 0
for alloc in allocs:
if alloc.status == AllocationStatus.RUNNING and alloc.http_address:
if await self.executor.health_check(alloc.http_address):
healthy_count += 1
# Fast-fail on obvious driver/image errors to avoid waiting out the full timeout.
if alloc.id:
detail = await self.nomad.get_allocation(alloc.id)
if isinstance(detail, dict):
summary = _summarize_alloc_detail(detail)
lowered = _alloc_events_lower(detail) or summary.lower()
if "failed to pull" in lowered or "pull access denied" in lowered:
raise RuntimeError(
"Nomad allocation failed to start due to a Docker image pull error. "
f"Allocation {alloc.id[:8]}: {summary}\n"
"If you're using a local image tag (e.g. `atropos-sandbox:local`) on macOS, "
"make sure the image is loaded into Docker, e.g.:\n"
" docker buildx build --load -t atropos-sandbox:local -f Hermes-Agent/atropos/Dockerfile Hermes-Agent/atropos"
)
if "exceeded allowed attempts" in lowered:
raise RuntimeError(
"Nomad allocation is crash-looping and has entered restart backoff. "
f"Allocation {alloc.id[:8]}: {summary}\n"
"Inspect logs with:\n"
f" nomad alloc logs -stderr -task sandbox-server {alloc.id}\n"
"Common causes include: missing local Docker image tag, container entrypoint error, "
"or sandbox-server startup failure."
)
if healthy_count >= min_count:
return
await asyncio.sleep(2.0)
# Timed out: include allocation status detail to help debugging.
allocs = await self.nomad.get_job_allocations(self.config.job_id)
alloc_lines: List[str] = []
for alloc in allocs[:10]:
addr = alloc.http_address or "-"
line = f"{alloc.id[:8]} status={alloc.status.value} http={addr}"
detail = await self.nomad.get_allocation(alloc.id)
if isinstance(detail, dict):
summary = _summarize_alloc_detail(detail)
if summary:
line += f" detail={summary}"
alloc_lines.append(line)
hint = (
"Timed out waiting for healthy sandbox allocations.\n"
f"Job: {self.config.job_id}, desired_healthy: {min_count}\n"
"Allocations:\n - " + "\n - ".join(alloc_lines)
)
raise RuntimeError(hint)
async def _try_scale_up(self) -> bool:
"""Attempt to scale up the job."""
import time
async with self._scale_lock:
# Check cooldown
if time.time() - self._last_scale_time < self.config.scale_cooldown:
return False
# Check max containers
status = await self.nomad.get_job_status(self.config.job_id)
if status is None:
return False
current_count = status.count
if current_count >= self.config.max_containers:
logger.warning(f"Cannot scale up: already at max ({self.config.max_containers})")
return False
# Scale up
new_count = min(current_count + 1, self.config.max_containers)
logger.info(f"Scaling up from {current_count} to {new_count} containers")
scale_resp = await self.nomad.scale_job(
self.config.job_id,
count=new_count,
task_group="sandbox",
)
# Nomad may return non-JSON errors (e.g. plain text) with a status field.
if isinstance(scale_resp, dict) and scale_resp.get("status", 200) >= 400:
logger.warning(f"Scale request rejected: {scale_resp}")
self._last_scale_time = time.time()
return False
self._last_scale_time = time.time()
# Wait for new allocation in the background so contended acquires can still
# make progress (e.g. by grabbing slots released by other trajectories).
if self._scale_task is None or self._scale_task.done():
self._scale_task = asyncio.create_task(self._wait_for_scale(new_count))
return True
async def _wait_for_scale(self, desired_count: int) -> None:
try:
await self._wait_for_healthy_allocations(desired_count, timeout=60.0)
await self._refresh_slots()
except asyncio.CancelledError:
raise
except Exception as e:
logger.error(f"Failed to scale up: {e}")
async def _health_check_loop(self) -> None:
"""Background task to monitor container health."""
while True:
try:
await asyncio.sleep(self.config.health_check_interval)
await self._refresh_slots()
except asyncio.CancelledError:
break
except Exception as e:
logger.error(f"Health check error: {e}")
def get_stats(self) -> Dict[str, Any]:
"""Get pool statistics."""
slots_by_state = {}
for slot in self._slots.values():
state = slot.state.value
slots_by_state[state] = slots_by_state.get(state, 0) + 1
container_count = len({s.alloc_id for s in self._slots.values()}) if self._slots else 0
return {
"total_slots": self.total_slots,
"available_slots": self.available_slots,
"acquired_slots": self.acquired_slots,
"containers": container_count,
"slots_by_state": slots_by_state,
"started": self._started,
}
-159
View File
@@ -1,159 +0,0 @@
"""
Slot abstraction for atropos-agent.
A Slot represents an isolated workspace for a single agent trajectory.
Slots are hosted on Nomad allocations and provide workspace isolation
via filesystem directories.
"""
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Dict, Optional
import uuid
class SlotState(Enum):
"""State of a slot in the pool."""
AVAILABLE = "available" # Ready to be acquired
ACQUIRED = "acquired" # Assigned to a trajectory
EXECUTING = "executing" # Currently executing a tool
RELEASING = "releasing" # Being released back to pool
ERROR = "error" # In error state
@dataclass
class Slot:
"""
An isolated workspace for a single agent trajectory.
Slots are the unit of scheduling - each trajectory runs in its own slot,
with an isolated workspace directory. Multiple slots share a container.
Attributes:
slot_id: Unique identifier for this slot (e.g., "slot_0")
alloc_id: Nomad allocation ID hosting this slot
container_addr: HTTP address of the sandbox server (e.g., "http://10.0.0.1:8080")
workspace_dir: Path to workspace in container (e.g., "/data/slot_0")
state: Current state of the slot
trajectory_id: ID of trajectory currently using this slot (if acquired)
metadata: Additional metadata
"""
slot_id: str
alloc_id: str
container_addr: str
workspace_dir: str = ""
state: SlotState = SlotState.AVAILABLE
trajectory_id: Optional[str] = None
metadata: Dict[str, Any] = field(default_factory=dict)
def __post_init__(self):
"""Set default workspace_dir if not provided."""
if not self.workspace_dir:
self.workspace_dir = f"/data/{self.slot_id}"
@property
def is_available(self) -> bool:
"""Check if slot is available for acquisition."""
return self.state == SlotState.AVAILABLE
@property
def is_acquired(self) -> bool:
"""Check if slot is currently acquired."""
return self.state in (SlotState.ACQUIRED, SlotState.EXECUTING)
def acquire(self, trajectory_id: Optional[str] = None) -> None:
"""
Mark slot as acquired by a trajectory.
Args:
trajectory_id: Optional ID of acquiring trajectory
"""
if not self.is_available:
raise RuntimeError(f"Cannot acquire slot {self.slot_id}: state is {self.state}")
self.state = SlotState.ACQUIRED
self.trajectory_id = trajectory_id or str(uuid.uuid4())
def start_execution(self, execution_id: Optional[str] = None) -> None:
"""Mark slot as executing."""
if self.state != SlotState.ACQUIRED:
raise RuntimeError(f"Cannot start execution on slot {self.slot_id}: state is {self.state}")
self.state = SlotState.EXECUTING
if execution_id:
self.metadata["current_execution_id"] = execution_id
def end_execution(self) -> None:
"""Mark execution as complete, return to acquired state."""
if self.state != SlotState.EXECUTING:
raise RuntimeError(f"Cannot end execution on slot {self.slot_id}: state is {self.state}")
self.state = SlotState.ACQUIRED
self.metadata.pop("current_execution_id", None)
def release(self) -> None:
"""Release slot back to available state."""
self.state = SlotState.AVAILABLE
self.trajectory_id = None
self.metadata.pop("current_execution_id", None)
def mark_error(self, error: str) -> None:
"""Mark slot as in error state."""
self.state = SlotState.ERROR
self.metadata["error"] = error
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary for serialization."""
return {
"slot_id": self.slot_id,
"alloc_id": self.alloc_id,
"container_addr": self.container_addr,
"workspace_dir": self.workspace_dir,
"state": self.state.value,
"trajectory_id": self.trajectory_id,
"metadata": self.metadata,
}
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "Slot":
"""Create from dictionary."""
return cls(
slot_id=data["slot_id"],
alloc_id=data["alloc_id"],
container_addr=data["container_addr"],
workspace_dir=data.get("workspace_dir", ""),
state=SlotState(data.get("state", "available")),
trajectory_id=data.get("trajectory_id"),
metadata=data.get("metadata", {}),
)
def __repr__(self) -> str:
return f"Slot({self.slot_id}, state={self.state.value}, alloc={self.alloc_id[:8]}...)"
def create_slots_for_allocation(
alloc_id: str,
container_addr: str,
num_slots: int = 10,
) -> list["Slot"]:
"""
Create slots for a Nomad allocation.
Args:
alloc_id: Nomad allocation ID
container_addr: HTTP address of sandbox server
num_slots: Number of slots to create
Returns:
List of Slot objects
"""
slots = []
for i in range(num_slots):
slot_id = f"slot_{i}"
slots.append(Slot(
slot_id=slot_id,
alloc_id=alloc_id,
container_addr=container_addr,
workspace_dir=f"/data/{slot_id}",
))
return slots
-2
View File
@@ -1,2 +0,0 @@
"""Terminal helpers for stateful sandbox interactions."""
-115
View File
@@ -1,115 +0,0 @@
from __future__ import annotations
import json
from typing import Any
import pyte
class AsciinemaStreamDecoder:
def __init__(self, *, default_width: int = 80, default_height: int = 24) -> None:
self._default_width = max(1, int(default_width))
self._default_height = max(1, int(default_height))
self._buffer = ""
self._has_header = False
self.width = self._default_width
self.height = self._default_height
self._screen = pyte.Screen(self.width, self.height)
self._stream = pyte.Stream(self._screen)
def reset(self) -> None:
self._buffer = ""
self._has_header = False
self.width = self._default_width
self.height = self._default_height
self._screen = pyte.Screen(self.width, self.height)
self._stream = pyte.Stream(self._screen)
def feed(self, chunk: str | bytes) -> None:
if not chunk:
return
if isinstance(chunk, bytes):
chunk = chunk.decode("utf-8", errors="replace")
self._buffer += chunk
while True:
line, sep, rest = self._buffer.partition("\n")
if not sep:
break
self._buffer = rest
line = line.strip()
if not line:
continue
parsed = self._parse_json_line(line)
if parsed is None:
continue
if not self._has_header:
if isinstance(parsed, dict):
self._init_from_header(parsed)
continue
if isinstance(parsed, list):
self._has_header = True
self._apply_event(parsed)
continue
continue
if isinstance(parsed, list):
self._apply_event(parsed)
def render(self) -> str:
return "\n".join(self._screen.display)
def _parse_json_line(self, line: str) -> Any | None:
try:
return json.loads(line)
except json.JSONDecodeError:
return None
def _init_from_header(self, header: dict[str, Any]) -> None:
width = _coerce_int(
header.get("width") or header.get("columns") or header.get("cols"),
self._default_width,
)
height = _coerce_int(
header.get("height") or header.get("rows") or header.get("lines"),
self._default_height,
)
self.width = max(1, width)
self.height = max(1, height)
self._screen = pyte.Screen(self.width, self.height)
self._stream = pyte.Stream(self._screen)
self._has_header = True
def _apply_event(self, event: list[Any]) -> None:
if len(event) < 2:
return
event_type = event[1]
payload = event[2] if len(event) > 2 else ""
if event_type == "o":
if isinstance(payload, str):
self._stream.feed(payload)
elif event_type == "r":
width, height = _parse_resize(payload)
if width and height:
self.width = width
self.height = height
self._screen.resize(width, height)
def _coerce_int(value: Any, default: int) -> int:
try:
return int(value)
except (TypeError, ValueError):
return int(default)
def _parse_resize(payload: Any) -> tuple[int, int]:
if isinstance(payload, str) and "x" in payload:
left, right = payload.lower().split("x", 1)
return _coerce_int(left, 0), _coerce_int(right, 0)
if isinstance(payload, dict):
width = _coerce_int(payload.get("width") or payload.get("columns") or payload.get("cols"), 0)
height = _coerce_int(payload.get("height") or payload.get("rows") or payload.get("lines"), 0)
return width, height
if isinstance(payload, list) and len(payload) >= 2:
return _coerce_int(payload[0], 0), _coerce_int(payload[1], 0)
return 0, 0
-26
View File
@@ -1,26 +0,0 @@
"""
Tool abstractions for atropos-agent.
Provides base Tool class and common tool implementations.
"""
from .base import Tool, ToolCall, ToolRegistry, ToolResult, ToolSchema
from .build_registry import build_tool_registry
from .sandbox_stubs import BashTool, ReadFileTool, TerminalTool, WriteFileTool
from .terminal_stateful_tool import TerminalStatefulTool
from .tmux_tool import TmuxTool
__all__ = [
"Tool",
"ToolCall",
"ToolRegistry",
"ToolResult",
"ToolSchema",
"BashTool",
"ReadFileTool",
"WriteFileTool",
"TerminalTool",
"TerminalStatefulTool",
"TmuxTool",
"build_tool_registry",
]
-423
View File
@@ -1,423 +0,0 @@
"""
Base Tool abstraction for atropos-agent.
Tools follow a simple pattern:
1. Define schema (name, description, parameters)
2. Implement execute() method
3. Return ToolResult with output/error
Tool calls use Hermes-style XML tags:
<tool_call>{"name": "bash", "arguments": {"command": "ls"}}</tool_call>
"""
import json
import re
import uuid
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Any, Dict, List, Literal, Optional
from pydantic import BaseModel, Field
@dataclass
class ToolSchema:
"""JSON Schema for a tool's parameters."""
name: str
description: str
parameters: Dict[str, Any] = field(default_factory=dict)
required: List[str] = field(default_factory=list)
external: bool = False # Whether the tool must be executed via an external ToolServer (secret proxy) and not inside the sandbox.
def to_dict(self) -> Dict[str, Any]:
"""Convert to OpenAI-compatible function schema."""
return {
"type": "function",
"function": {
"name": self.name,
"description": self.description,
"parameters": {
"type": "object",
"properties": self.parameters,
"required": self.required,
},
},
}
def to_prompt_description(self) -> str:
"""Convert to human-readable description for system prompt."""
params_desc = []
for name, spec in self.parameters.items():
req = "(required)" if name in self.required else "(optional)"
desc = spec.get("description", "")
param_type = spec.get("type", "string")
params_desc.append(f" - {name} ({param_type}) {req}: {desc}")
params_str = "\n".join(params_desc) if params_desc else " (no parameters)"
return f"**{self.name}**: {self.description}\nParameters:\n{params_str}"
@dataclass
class ToolCall:
"""A parsed tool call from model output."""
name: str
arguments: Dict[str, Any]
raw_text: str = "" # Original XML/JSON text
uniq_id: str = field(default_factory=lambda: str(uuid.uuid4())) # Unique tool-call id for traceability/reconstruction.
@classmethod
def parse_from_text(cls, text: str) -> List["ToolCall"]:
"""
Extract tool calls from text using Hermes-style XML tags.
Supported formats (STRICT: requires well-formed closing tags):
- Hermes JSON wrapper:
<tool_call>{"name": "...", "arguments": {...}}</tool_call>
- GLM/llama.cpp style:
<tool_call>terminal{"command":"ls -la"}</tool_call>
"""
calls: List["ToolCall"] = []
if not text:
return calls
def _append_from_payload(*, name: str, arguments: Dict[str, Any], raw: str, uniq_id: Optional[str] = None) -> None:
if not isinstance(name, str) or not name:
return
if not isinstance(arguments, dict):
return
calls.append(
cls(
name=name,
arguments=arguments,
raw_text=raw,
uniq_id=uniq_id or str(uuid.uuid4()),
)
)
# STRICT parsing: only accept well-formed <tool_call>...</tool_call> blocks.
pattern = r"<tool_call>\s*(.*?)\s*</tool_call>"
for inner in re.findall(pattern, text, re.DOTALL):
cleaned = (inner or "").strip()
if not cleaned:
continue
# Hermes JSON wrapper.
if cleaned.startswith("{"):
try:
data = json.loads(cleaned)
except json.JSONDecodeError:
continue
uniq_id = data.get("uniq_id") or data.get("id") or None
_append_from_payload(
name=data.get("name", ""),
arguments=data.get("arguments", {}),
raw=inner,
uniq_id=uniq_id,
)
continue
# GLM/llama.cpp style: terminal{...}
m = re.match(r"^\s*([A-Za-z0-9_.:\\-]+)\s*(\{.*\})\s*$", cleaned, re.DOTALL)
if not m:
continue
name = m.group(1)
args_text = m.group(2)
try:
args = json.loads(args_text)
except json.JSONDecodeError:
continue
_append_from_payload(name=name, arguments=args, raw=inner)
return calls
@classmethod
def has_tool_call(cls, text: str) -> bool:
"""Check if text contains any tool calls."""
return bool(re.search(r"<tool_call>", text))
@dataclass
class ToolResult:
"""Result from executing a tool."""
success: bool
output: str = ""
error: str = ""
metadata: Dict[str, Any] = field(default_factory=dict)
uniq_id: Optional[str] = None # Should match ToolCall.uniq_id for async execution tracking.
def to_xml(self) -> str:
"""Format as XML for including in conversation."""
data = {
"success": self.success,
"output": self.output,
}
if self.uniq_id:
data["uniq_id"] = self.uniq_id
if self.error:
data["error"] = self.error
if self.metadata:
data["metadata"] = self.metadata
return f"<tool_response>{json.dumps(data)}</tool_response>"
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary."""
return {
"success": self.success,
"output": self.output,
"error": self.error,
"metadata": self.metadata,
"uniq_id": self.uniq_id,
}
class Tool(ABC):
"""
Abstract base class for tools.
Subclasses must implement:
- schema: ToolSchema describing the tool
- execute(): async method that performs the tool action
"""
@property
@abstractmethod
def schema(self) -> ToolSchema:
"""Return the tool's schema."""
pass
@property
def name(self) -> str:
"""Tool name (from schema)."""
return self.schema.name
@abstractmethod
async def execute(self, **kwargs) -> ToolResult:
"""
Execute the tool with given arguments.
Args:
**kwargs: Tool-specific arguments
Returns:
ToolResult with success/failure and output
"""
pass
def is_available(self) -> tuple[bool, str | None]:
"""
Return whether this tool should be exposed/executable in the current process.
Tools that depend on optional binaries/services/env vars can override this
to avoid advertising a tool that will fail at runtime.
"""
return True, None
async def __call__(self, **kwargs) -> ToolResult:
"""Allow calling tool instance directly."""
return await self.execute(**kwargs)
# Note: This is only wrapping declarations for the external ToolServer (for execution on external process tools), and tools preinstalled in envs
class ToolRegistry:
"""Registry of available tools."""
def __init__(self):
self._tools: Dict[str, Tool] = {}
def register(self, tool: Tool) -> None:
"""Register a tool."""
self._tools[tool.name] = tool
def get(self, name: str) -> Optional[Tool]:
"""Get a tool by name."""
return self._tools.get(name)
def list_tools(self) -> List[Tool]:
"""List all registered tools."""
return list(self._tools.values())
def get_schemas(self) -> List[ToolSchema]:
"""Get schemas for all registered tools."""
return [tool.schema for tool in self._tools.values()]
def get_prompt_description(self) -> str:
"""Generate tool descriptions for system prompt."""
descriptions = [tool.schema.to_prompt_description() for tool in self._tools.values()]
return "\n\n".join(descriptions)
def get_prompt_tool_definitions_json(self) -> str:
"""
Return a Hermes-style JSON list of tool definitions for use inside a `<tools>...</tools>` block.
Hermes trajectories historically use a simplified schema list:
[{"name": ..., "description": ..., "parameters": {...}, "required": null}, ...]
"""
formatted: List[Dict[str, Any]] = []
for tool in self._tools.values():
fn = tool.schema.to_dict().get("function", {})
formatted.append(
{
"name": fn.get("name", tool.name),
"description": fn.get("description", ""),
"parameters": fn.get("parameters", {}),
# Keep parity with Hermes saved trajectories (required is typically null there).
"required": None,
}
)
return json.dumps(formatted, ensure_ascii=False)
async def execute(self, call: ToolCall) -> ToolResult:
"""Execute a tool call."""
tool = self.get(call.name)
if tool is None:
return ToolResult(
success=False,
error=f"Unknown tool: {call.name}",
uniq_id=call.uniq_id,
)
try:
result = await tool.execute(**call.arguments)
if result.uniq_id is None:
result.uniq_id = call.uniq_id
return result
except Exception as e:
return ToolResult(
success=False,
error=f"Tool execution error: {str(e)}",
uniq_id=call.uniq_id,
)
# =============================================================================
# FastAPI / transport models
# =============================================================================
class ToolCallPayload(BaseModel):
name: str
arguments: Dict[str, Any] = Field(default_factory=dict)
uniq_id: str
@classmethod
def from_tool_call(cls, call: ToolCall) -> "ToolCallPayload":
return cls(name=call.name, arguments=call.arguments, uniq_id=call.uniq_id)
def to_tool_call(self) -> ToolCall:
return ToolCall(name=self.name, arguments=self.arguments, uniq_id=self.uniq_id)
class ToolResultPayload(BaseModel):
success: bool
output: str = ""
error: str = ""
metadata: Dict[str, Any] = Field(default_factory=dict)
uniq_id: Optional[str] = None
@classmethod
def from_tool_result(cls, result: ToolResult) -> "ToolResultPayload":
return cls(
success=result.success,
output=result.output,
error=result.error,
metadata=result.metadata,
uniq_id=result.uniq_id,
)
def to_tool_result(self) -> ToolResult:
return ToolResult(
success=self.success,
output=self.output,
error=self.error,
metadata=self.metadata,
uniq_id=self.uniq_id,
)
class ToolExecutorExecuteRequest(BaseModel):
trajectory_id: str
tool: ToolCallPayload
timeout_s: Optional[float] = None
class ToolExecutorReleaseRequest(BaseModel):
trajectory_id: str
reset_workspace: bool = False
class ToolServerExecuteRequest(BaseModel):
trajectory_id: Optional[str] = None
tool: ToolCallPayload
timeout_s: Optional[float] = None
# Optional sandbox context for tools that need workspace artifacts.
# This is set by ToolExecutor and is NOT model-controlled.
slot_id: Optional[str] = None
container_addr: Optional[str] = None
# =============================================================================
# Artifact transport models
# =============================================================================
class ArtifactReadRequestPayload(BaseModel):
trajectory_id: str
path: str
encoding: Literal["text", "base64"] = "text"
max_bytes: Optional[int] = None
include_sha256: bool = False
class ArtifactReadResponsePayload(BaseModel):
success: bool
content: str = ""
error: str = ""
encoding: str = "text"
truncated: bool = False
bytes: int = 0
file_size: Optional[int] = None
path: str = ""
mime: Optional[str] = None
sha256: Optional[str] = None
class ArtifactListRequestPayload(BaseModel):
trajectory_id: str
path: str = "."
recursive: bool = False
max_entries: Optional[int] = None
class ArtifactListEntryPayload(BaseModel):
path: str
is_dir: bool
size: int
mtime: float
class ArtifactListResponsePayload(BaseModel):
success: bool
entries: List[ArtifactListEntryPayload] = Field(default_factory=list)
truncated: bool = False
error: str = ""
class ArtifactArchiveRequestPayload(BaseModel):
trajectory_id: str
path: str = "."
format: Literal["tar.gz", "tgz"] = "tar.gz"
max_bytes: Optional[int] = None
max_entries: Optional[int] = None
class ArtifactArchiveResponsePayload(BaseModel):
success: bool
content: str = ""
error: str = ""
encoding: str = "base64"
format: str = "tar.gz"
bytes: int = 0
entry_count: int = 0
-64
View File
@@ -1,64 +0,0 @@
"""
Unified tool registry builder for Hermes-Agent Atropos integration.
This composes:
- sandbox tool stubs (terminal/bash/read_file/write_file + stateful terminal/tmux)
- Hermes external tools (web/vision/image/moa/skills/browser), executed via ToolServer
ToolExecutor only needs the schema + `external` routing bit; ToolServer executes
the external tools via Hermes' existing implementations.
"""
from __future__ import annotations
from typing import List, Optional
from .base import ToolRegistry
from .hermes_external_tools import build_external_tools
from .sandbox_stubs import BashTool, ReadFileTool, TerminalTool, WriteFileTool
from .terminal_stateful_tool import TerminalStatefulTool
from .tmux_tool import TmuxTool
from .toolset_resolver import resolve_multiple_toolsets
def build_tool_registry(
*,
enabled_toolsets: Optional[List[str]] = None,
disabled_toolsets: Optional[List[str]] = None,
tool_server_url: Optional[str] = None,
) -> ToolRegistry:
"""
Build a ToolRegistry for AgentEnv / ToolExecutor / ToolServer.
If `tool_server_url` is not provided, external tools will be omitted so we do
not advertise tools that cannot execute.
"""
enabled_toolsets = enabled_toolsets or ["default"]
# Resolve tool names using Hermes toolsets plus Atropos additions.
selected = set(resolve_multiple_toolsets(enabled_toolsets))
if disabled_toolsets:
selected -= set(resolve_multiple_toolsets(disabled_toolsets))
reg = ToolRegistry()
# Always register sandbox tools if selected.
sandbox_by_name = {
"terminal": TerminalTool(),
"bash": BashTool(),
"read_file": ReadFileTool(),
"write_file": WriteFileTool(),
"terminal_stateful": TerminalStatefulTool(),
"tmux": TmuxTool(),
}
for name, tool in sandbox_by_name.items():
if name in selected:
reg.register(tool)
# External tools: only include when ToolServer is configured.
if tool_server_url:
for tool in build_external_tools(selected_tool_names=selected):
if tool.name in selected:
reg.register(tool)
return reg
-90
View File
@@ -1,90 +0,0 @@
"""
Hermes external tool adapter for Atropos ToolServer.
These tools reuse Hermes-Agent's existing tool runner (`model_tools.handle_function_call`)
so we don't duplicate external tool implementations.
Important:
- These are marked `external=True` and should be executed ONLY by ToolServer.
- We run `handle_function_call` in a worker thread because the Hermes implementation
uses `asyncio.run()` internally for some async tools (web_extract, vision, MoA, etc).
"""
from __future__ import annotations
import asyncio
import json
from typing import Any, Dict, List, Optional
import model_tools
from .base import Tool, ToolResult, ToolSchema
def _schema_from_openai_tool_dict(tool: Dict[str, Any], *, external: bool) -> ToolSchema:
fn = tool.get("function") or {}
name = str(fn.get("name") or "")
description = str(fn.get("description") or "")
params = fn.get("parameters") or {}
properties = params.get("properties") or {}
required = params.get("required") or []
if not isinstance(required, list):
required = []
return ToolSchema(
name=name,
description=description,
parameters=dict(properties),
required=[str(x) for x in required if isinstance(x, (str, int))],
external=external,
)
class HermesExternalTool(Tool):
def __init__(self, schema: ToolSchema):
self._schema = schema
@property
def schema(self) -> ToolSchema:
return self._schema
async def execute(self, task_id: Optional[str] = None, **kwargs: Any) -> ToolResult:
# `model_tools.handle_function_call` returns a JSON string (success or error).
# Run in a thread because some Hermes tool handlers call `asyncio.run()`.
raw = await asyncio.to_thread(model_tools.handle_function_call, self.name, kwargs, task_id)
try:
parsed = json.loads(raw)
except Exception:
# Keep as plain string.
return ToolResult(success=True, output=str(raw))
if isinstance(parsed, dict) and parsed.get("error"):
return ToolResult(success=False, error=str(parsed.get("error")), output="")
return ToolResult(success=True, output=json.dumps(parsed, ensure_ascii=False))
def build_external_tools(
*,
selected_tool_names: Optional[set[str]] = None,
) -> List[HermesExternalTool]:
"""
Build external tool wrappers from Hermes tool declarations.
Filters out sandbox-oriented tools (e.g. `terminal`) since those should run
inside the sandbox via ToolExecutor.
"""
# IMPORTANT: Hermes' `model_tools.get_tool_definitions()` only understands Hermes toolsets.
# Atropos envs add extra toolsets (filesystem/sandbox/stateful). To avoid noisy "Unknown toolset"
# prints and accidental filtering, we fetch ALL Hermes tool definitions here and filter by name.
tools = model_tools.get_tool_definitions(enabled_toolsets=None, disabled_toolsets=None, quiet_mode=True)
wrappers: List[HermesExternalTool] = []
for t in tools:
schema = _schema_from_openai_tool_dict(t, external=True)
if schema.name in {"terminal"}:
continue
if selected_tool_names is not None and schema.name not in selected_tool_names:
continue
wrappers.append(HermesExternalTool(schema))
return wrappers
-99
View File
@@ -1,99 +0,0 @@
"""
Sandbox tool stubs for Atropos ToolExecutor.
These tools are executed inside the sandbox containers via:
ToolExecutor -> SlotPool -> sandbox_server.py
They intentionally do NOT execute anything on the host process. If they are
called directly (outside ToolExecutor), they return a clear error.
"""
from __future__ import annotations
from typing import Optional
from .base import Tool, ToolResult, ToolSchema
class TerminalTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="terminal",
description=(
"Execute a command inside the sandbox slot workspace and return stdout/stderr. "
"Filesystem persists within a trajectory slot. Background processes are not supported "
"in stateless mode. Commands run under POSIX /bin/sh and each tool call runs in a fresh "
"shell (no persisted env vars). Avoid bash-only syntax like `source`; prefer `. .venv/bin/activate` "
"or invoke `.venv/bin/python ...` directly."
),
parameters={
"command": {"type": "string", "description": "The command to execute"},
"timeout": {
"type": "integer",
"description": "Command timeout in seconds (optional).",
"minimum": 1,
},
"background": {
"type": "boolean",
"description": "Not supported in sandbox terminal (always false).",
"default": False,
},
},
required=["command"],
external=False,
)
async def execute(self, **_kwargs) -> ToolResult:
return ToolResult(
success=False,
error="terminal must be executed via ToolExecutor inside the sandbox",
)
class BashTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="bash",
description="Execute a bash command inside the sandbox slot workspace.",
parameters={"command": {"type": "string", "description": "The bash command to execute"}},
required=["command"],
external=False,
)
async def execute(self, **_kwargs) -> ToolResult:
return ToolResult(success=False, error="bash must be executed via ToolExecutor inside the sandbox")
class ReadFileTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="read_file",
description="Read a file from the sandbox slot workspace.",
parameters={"path": {"type": "string", "description": "Path to the file"}},
required=["path"],
external=False,
)
async def execute(self, **_kwargs) -> ToolResult:
return ToolResult(success=False, error="read_file must be executed via ToolExecutor inside the sandbox")
class WriteFileTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="write_file",
description="Write a file into the sandbox slot workspace.",
parameters={
"path": {"type": "string", "description": "Path to the file"},
"content": {"type": "string", "description": "File content"},
},
required=["path", "content"],
external=False,
)
async def execute(self, **_kwargs) -> ToolResult:
return ToolResult(success=False, error="write_file must be executed via ToolExecutor inside the sandbox")
-45
View File
@@ -1,45 +0,0 @@
"""
Stateful terminal tool schema.
This is a sandbox tool that routes to the sandbox server as `bash_stateful`
via ToolExecutor mapping. It exists to expose an explicit, opt-in terminal
primitive suitable for stateful workflows (e.g. tmux sessions / TUIs).
"""
from __future__ import annotations
from typing import Optional
from .base import Tool, ToolResult, ToolSchema
class TerminalStatefulTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="terminal_stateful",
description=(
"Execute a command in the sandbox, allowing stateful/background processes to persist "
"across tool calls within the same trajectory slot (e.g. tmux sessions). "
"Use sparingly; output is still non-interactive."
),
parameters={
"command": {"type": "string", "description": "The command to execute"},
"timeout": {
"type": "integer",
"description": "Command timeout in seconds (optional).",
"minimum": 1,
},
},
required=["command"],
)
def is_available(self) -> tuple[bool, str | None]:
return True, None
async def execute(self, command: str, timeout: Optional[int] = None) -> ToolResult:
_ = (command, timeout)
return ToolResult(
success=False,
error="terminal_stateful must be executed via ToolExecutor inside the sandbox",
)
-89
View File
@@ -1,89 +0,0 @@
"""
tmux tool schema (sandbox).
This is a sandbox tool that provides basic tmux session control suitable for
TUI-style terminal interactions:
- send keys (arrow keys, enter, etc.)
- capture the current screen buffer
Execution is routed by ToolExecutor to the sandbox server's `tmux` backend.
"""
from __future__ import annotations
from typing import Any, Dict, Optional
from .base import Tool, ToolResult, ToolSchema
class TmuxTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="tmux",
description=(
"Control a per-trajectory tmux session inside the sandbox (stateful terminal). "
"Use this for TUI-style interactions: send keys and capture the current screen."
),
parameters={
"action": {
"type": "string",
"description": "Action to perform: start | send_keys | stream | stop.",
"enum": ["start", "send_keys", "stream", "stop", "capture"],
},
"keys": {
"description": "Keys to send (string or list of strings) when action=send_keys.",
},
"block": {
"type": "boolean",
"description": "If true, wait for shell command completion (only valid at a shell prompt).",
"default": False,
},
"min_wait_s": {
"type": "number",
"description": "For non-blocking send_keys, sleep this long after sending keys (seconds).",
"default": 0.0,
},
"max_wait_s": {
"type": "number",
"description": "For blocking send_keys, max time to wait for completion (seconds).",
},
"capture_entire": {
"type": "boolean",
"description": "Deprecated. Streaming is preferred.",
"default": False,
},
"max_bytes": {
"type": "integer",
"description": "Max bytes to return per stream call.",
},
"reset": {
"type": "boolean",
"description": "If true, reset stream offset to the beginning of the asciinema recording.",
"default": False,
},
"pane_width": {
"type": "integer",
"description": "Pane width for action=start (columns).",
"minimum": 20,
},
"pane_height": {
"type": "integer",
"description": "Pane height for action=start (rows).",
"minimum": 10,
},
},
required=["action"],
)
def is_available(self) -> tuple[bool, str | None]:
return True, None
async def execute(self, **kwargs: Dict[str, Any]) -> ToolResult:
# This tool is intended to be executed via ToolExecutor -> sandbox server.
# We keep a safe fallback for non-sandbox contexts.
action = str(kwargs.get("action") or "").strip()
return ToolResult(
success=False,
error=f"tmux tool must be executed in the sandbox (got action={action!r})",
)
-500
View File
@@ -1,500 +0,0 @@
"""
ToolExecutor - queued, batched tool dispatch for multiplexed agent trajectories.
This component is responsible for:
- Maintaining trajectory -> Slot affinity (workspace continuity)
- Batching sandbox tool calls across trajectories to maximize container utilization
- Routing external tools (ToolSchema.external=True) to a ToolServer (Phase 4.5)
For now, only sandbox tools are executed:
- bash
- read_file
- write_file
"""
from __future__ import annotations
import asyncio
import time
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
import httpx
from .base import (
ArtifactArchiveRequestPayload,
ArtifactArchiveResponsePayload,
ArtifactListRequestPayload,
ArtifactListResponsePayload,
ArtifactReadRequestPayload,
ArtifactReadResponsePayload,
ToolCall,
ToolCallPayload,
ToolRegistry,
ToolResult,
ToolResultPayload,
ToolServerExecuteRequest,
)
from ..backends.base import ToolBackend
from ..slots import Slot
@dataclass
class ToolExecutorConfig:
batch_window_ms: int = 20
max_batch_size: int = 200
allow_network: bool = True
require_sandbox: bool = False
require_stateful_sandbox: bool = False
tool_server_url: Optional[str] = None
tool_server_token: Optional[str] = None
@dataclass
class _QueuedToolRequest:
trajectory_id: str
call: ToolCall
timeout_s: Optional[float]
future: asyncio.Future
class ToolExecutor:
def __init__(
self,
backend: ToolBackend,
tools: ToolRegistry,
config: Optional[ToolExecutorConfig] = None,
) -> None:
self.backend = backend
self.tools = tools
self.config = config or ToolExecutorConfig()
self._queue: asyncio.Queue[Optional[_QueuedToolRequest]] = asyncio.Queue()
self._task: Optional[asyncio.Task] = None
self._stopping = asyncio.Event()
self._slots_lock = asyncio.Lock()
self._slot_by_trajectory: Dict[str, Slot] = {}
self._tool_server_client: Optional[httpx.AsyncClient] = None
self._tool_server_lock = asyncio.Lock()
# lightweight stats for status endpoints
self.total_requests: int = 0
self.total_errors: int = 0
self.latencies_s: List[float] = []
async def start(self) -> None:
if self._task is None:
self._task = asyncio.create_task(self._run_loop())
def queue_size(self) -> int:
return self._queue.qsize()
async def close(self) -> None:
self._stopping.set()
await self._queue.put(None)
if self._task:
await self._task
self._task = None
client = self._tool_server_client
self._tool_server_client = None
if client is not None:
await client.aclose()
# Best-effort release any remaining slots.
async with self._slots_lock:
slots = list(self._slot_by_trajectory.items())
self._slot_by_trajectory.clear()
for _, slot in slots:
try:
await self.backend.release(slot, reset_workspace=False)
except Exception:
pass
async def execute(
self,
trajectory_id: str,
call: ToolCall,
timeout_s: Optional[float] = None,
) -> ToolResult:
if self._task is None:
raise RuntimeError("ToolExecutor not started (call start() first)")
# Allow tool args to suggest a timeout (Hermes-compatible terminal tool),
# but never let the model choose "infinite" timeouts.
if timeout_s is None:
raw_timeout = call.arguments.get("timeout")
if isinstance(raw_timeout, (int, float)):
timeout_s = float(raw_timeout)
if timeout_s is not None:
timeout_s = max(1.0, min(float(timeout_s), 600.0))
loop = asyncio.get_running_loop()
fut: asyncio.Future = loop.create_future()
started = time.perf_counter()
await self._queue.put(_QueuedToolRequest(trajectory_id=trajectory_id, call=call, timeout_s=timeout_s, future=fut))
try:
result: ToolResult = await fut
return result
finally:
self.latencies_s.append(time.perf_counter() - started)
async def release_trajectory(self, trajectory_id: str, reset_workspace: bool = False) -> None:
async with self._slots_lock:
slot = self._slot_by_trajectory.pop(trajectory_id, None)
if slot is not None:
await self.backend.release(slot, reset_workspace=reset_workspace)
async def _get_slot_if_present(self, trajectory_id: str) -> Optional[Slot]:
async with self._slots_lock:
return self._slot_by_trajectory.get(trajectory_id)
# ---------------------------------------------------------------------
# Artifact helpers (optional)
# ---------------------------------------------------------------------
async def read_artifact(self, req: ArtifactReadRequestPayload) -> ArtifactReadResponsePayload:
slot = await self._get_slot_if_present(req.trajectory_id)
if slot is None:
return ArtifactReadResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
data = await self.backend.read_artifact(
slot,
req.path,
encoding=req.encoding,
max_bytes=req.max_bytes,
include_sha256=req.include_sha256,
)
if isinstance(data, dict):
data = dict(data)
data.pop("http_status", None)
try:
return ArtifactReadResponsePayload(**(data or {}))
except Exception as e:
return ArtifactReadResponsePayload(success=False, error=f"Invalid artifact read response: {e}")
async def list_artifacts(self, req: ArtifactListRequestPayload) -> ArtifactListResponsePayload:
slot = await self._get_slot_if_present(req.trajectory_id)
if slot is None:
return ArtifactListResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
data = await self.backend.list_artifacts(
slot,
req.path,
recursive=req.recursive,
max_entries=req.max_entries,
)
if isinstance(data, dict):
data = dict(data)
data.pop("http_status", None)
try:
return ArtifactListResponsePayload(**(data or {}))
except Exception as e:
return ArtifactListResponsePayload(success=False, error=f"Invalid artifact list response: {e}")
async def archive_artifacts(self, req: ArtifactArchiveRequestPayload) -> ArtifactArchiveResponsePayload:
slot = await self._get_slot_if_present(req.trajectory_id)
if slot is None:
return ArtifactArchiveResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
data = await self.backend.archive_artifacts(
slot,
req.path,
archive_format=req.format,
max_bytes=req.max_bytes,
max_entries=req.max_entries,
)
if isinstance(data, dict):
data = dict(data)
data.pop("http_status", None)
try:
return ArtifactArchiveResponsePayload(**(data or {}))
except Exception as e:
return ArtifactArchiveResponsePayload(success=False, error=f"Invalid artifact archive response: {e}")
async def _get_or_acquire_slot(self, trajectory_id: str) -> Slot:
async with self._slots_lock:
existing = self._slot_by_trajectory.get(trajectory_id)
if existing is not None:
return existing
slot = await self.backend.acquire(trajectory_id)
async with self._slots_lock:
existing = self._slot_by_trajectory.get(trajectory_id)
if existing is not None:
# Another coroutine won the race; return its slot.
await self.backend.release(slot, reset_workspace=False)
return existing
self._slot_by_trajectory[trajectory_id] = slot
return slot
async def _run_loop(self) -> None:
pending: List[_QueuedToolRequest] = []
deadline: Optional[float] = None
batch_window_s = max(0.0, self.config.batch_window_ms / 1000.0)
max_batch = max(1, self.config.max_batch_size)
while True:
if self._stopping.is_set() and self._queue.empty() and not pending:
break
timeout = None
if pending and deadline is not None:
timeout = max(0.0, deadline - time.perf_counter())
try:
item = await asyncio.wait_for(self._queue.get(), timeout=timeout)
if item is None:
continue
pending.append(item)
if len(pending) == 1:
deadline = time.perf_counter() + batch_window_s
if len(pending) < max_batch:
continue
except asyncio.TimeoutError:
# batch window elapsed
pass
if not pending:
deadline = None
continue
batch = pending
pending = []
deadline = None
await self._execute_batch(batch)
async def _get_tool_server_client(self) -> httpx.AsyncClient:
url = self.config.tool_server_url
if not url:
raise RuntimeError("ToolServer not configured")
if self._tool_server_client is not None:
return self._tool_server_client
async with self._tool_server_lock:
if self._tool_server_client is None:
self._tool_server_client = httpx.AsyncClient(base_url=url.rstrip("/"))
return self._tool_server_client
def _tool_server_headers(self) -> Dict[str, str]:
token = self.config.tool_server_token
if not token:
return {}
return {"Authorization": f"Bearer {token}"}
async def _execute_external(self, req: _QueuedToolRequest) -> ToolResult:
client = await self._get_tool_server_client()
slot_id: Optional[str] = None
container_addr: Optional[str] = None
slot = await self._get_slot_if_present(req.trajectory_id)
if slot is not None:
slot_id = slot.slot_id
container_addr = slot.container_addr
payload = ToolServerExecuteRequest(
trajectory_id=req.trajectory_id,
tool=ToolCallPayload.from_tool_call(req.call),
timeout_s=req.timeout_s,
slot_id=slot_id,
container_addr=container_addr,
)
try:
resp = await client.post(
"/execute",
json=payload.model_dump(),
headers=self._tool_server_headers(),
timeout=req.timeout_s,
)
resp.raise_for_status()
data = resp.json()
parsed = ToolResultPayload(**data)
result = parsed.to_tool_result()
if result.uniq_id is None:
result.uniq_id = req.call.uniq_id
return result
except Exception as e:
return ToolResult(
success=False,
error=f"External tool failed: {e}",
uniq_id=req.call.uniq_id,
)
async def _execute_batch(self, batch: List[_QueuedToolRequest]) -> None:
# Resolve tool schemas once per request and separate sandbox/external/unknown.
sandbox_items: List[_QueuedToolRequest] = []
external_items: List[_QueuedToolRequest] = []
unknown_items: List[_QueuedToolRequest] = []
for it in batch:
tool = self.tools.get(it.call.name)
if tool is None:
unknown_items.append(it)
continue
schema = tool.schema
if not schema.external:
sandbox_items.append(it)
else:
external_items.append(it)
for it in unknown_items:
self.total_requests += 1
self.total_errors += 1
if not it.future.done():
it.future.set_result(
ToolResult(
success=False,
error=f"Unknown tool: {it.call.name}",
uniq_id=it.call.uniq_id,
)
)
if external_items:
if not self.config.tool_server_url:
for it in external_items:
self.total_requests += 1
self.total_errors += 1
if not it.future.done():
it.future.set_result(
ToolResult(
success=False,
error=f"External tool not available (ToolServer not configured): {it.call.name}",
uniq_id=it.call.uniq_id,
)
)
else:
results = await asyncio.gather(*[self._execute_external(it) for it in external_items])
for it, res in zip(external_items, results):
self.total_requests += 1
if not getattr(res, "success", False):
self.total_errors += 1
if not it.future.done():
it.future.set_result(res)
if not sandbox_items:
return
# Acquire slots for the distinct trajectories in this batch.
try:
traj_ids = list({it.trajectory_id for it in sandbox_items})
slots = await asyncio.gather(*[self._get_or_acquire_slot(tid) for tid in traj_ids])
slot_by_traj = dict(zip(traj_ids, slots))
except Exception as e:
for it in sandbox_items:
self.total_requests += 1
self.total_errors += 1
if not it.future.done():
it.future.set_result(
ToolResult(
success=False,
error=f"Failed to acquire slot: {e}",
uniq_id=it.call.uniq_id,
)
)
return
# Group by timeout so we don't accidentally make short timeouts wait on long ones.
by_timeout: Dict[float, List[_QueuedToolRequest]] = {}
default_timeout = self.backend.default_timeout_s
for it in sandbox_items:
t = it.timeout_s
if t is None:
t = default_timeout
if t is None:
t = 30.0
by_timeout.setdefault(float(t), []).append(it)
for timeout_s, items in by_timeout.items():
requests = []
dispatched: List[_QueuedToolRequest] = []
for it in items:
slot = slot_by_traj[it.trajectory_id]
tool_name = it.call.name
args = dict(it.call.arguments)
# Hermes compatibility: treat `terminal` as an alias of sandbox `bash`.
if tool_name == "terminal":
if args.get("background"):
self.total_requests += 1
self.total_errors += 1
if not it.future.done():
it.future.set_result(
ToolResult(
success=False,
error="terminal background execution is not supported in sandbox",
uniq_id=it.call.uniq_id,
)
)
continue
tool_name = "bash"
# `timeout` is handled at the ToolExecutor level, not passed to the sandbox tool args.
args.pop("timeout", None)
elif tool_name == "terminal_stateful":
tool_name = "bash_stateful"
args.pop("timeout", None)
elif tool_name == "tmux":
# `tmux` is a sandbox tool backed by the stateful session manager.
# Network policy is env-controlled.
args.pop("allow_network", None)
if tool_name == "bash":
# Network policy is set by the environment/executor, not by the model.
args.pop("allow_network", None)
args.pop("require_sandbox", None)
args["allow_network"] = bool(self.config.allow_network)
args["require_sandbox"] = bool(self.config.require_sandbox)
# `timeout` is handled at the ToolExecutor level, not passed to the sandbox tool args.
args.pop("timeout", None)
elif tool_name == "bash_stateful":
# Network policy is set by the environment/executor, not by the model.
args.pop("allow_network", None)
args.pop("require_sandbox", None)
args.pop("require_stateful_sandbox", None)
args["allow_network"] = bool(self.config.allow_network)
args["require_stateful_sandbox"] = bool(self.config.require_stateful_sandbox)
args.pop("timeout", None)
elif tool_name == "tmux":
# Network policy applies to the underlying stateful session.
args.pop("allow_network", None)
args.pop("require_sandbox", None)
args.pop("require_stateful_sandbox", None)
args["allow_network"] = bool(self.config.allow_network)
args["require_stateful_sandbox"] = bool(self.config.require_stateful_sandbox)
requests.append((slot, tool_name, args))
dispatched.append(it)
results = None
try:
if not dispatched:
continue
results = await self.backend.execute_batch(requests, timeout_s=timeout_s)
except Exception as e:
for it in items:
self.total_requests += 1
self.total_errors += 1
if not it.future.done():
it.future.set_result(
ToolResult(
success=False,
error=f"Batch execution failed: {e}",
uniq_id=it.call.uniq_id,
)
)
continue
for it, res in zip(dispatched, results):
self.total_requests += 1
if not getattr(res, "success", False):
self.total_errors += 1
tool_result = res.to_tool_result()
tool_result.uniq_id = it.call.uniq_id
if not it.future.done():
it.future.set_result(tool_result)
-88
View File
@@ -1,88 +0,0 @@
"""
Toolset resolution for Hermes-Agent Atropos integration.
We primarily reuse Hermes-Agent toolsets (`toolsets.py`), but Atropos training/envs
need a few extra sandbox-oriented toolsets that Hermes doesn't expose by default
(e.g. filesystem + stateful terminal).
"""
from __future__ import annotations
from typing import Any, Dict, List, Optional, Set
import toolsets as hermes_toolsets
ATROPOS_TOOLSETS: Dict[str, Dict[str, Any]] = {
"filesystem": {
"description": "Read/write files in the sandbox workspace.",
"tools": ["read_file", "write_file"],
"includes": [],
},
"terminal_stateful": {
"description": "Stateful terminal execution (tmux/TUI support) inside the sandbox.",
"tools": ["terminal_stateful", "tmux"],
"includes": [],
},
"sandbox": {
"description": "Sandbox tools (terminal + filesystem).",
"tools": [],
"includes": ["terminal", "filesystem"],
},
"default": {
"description": "Default toolset for Atropos AgentEnv tasks.",
"tools": [],
"includes": ["sandbox"],
},
"full": {
"description": "All Hermes tools plus Atropos sandbox additions.",
"tools": [],
"includes": ["all", "filesystem", "sandbox", "terminal_stateful"],
},
}
def validate_toolset(name: str) -> bool:
if name in {"all", "*"}:
return True
return hermes_toolsets.validate_toolset(name) or name in ATROPOS_TOOLSETS
def resolve_toolset(name: str, visited: Optional[Set[str]] = None) -> List[str]:
if visited is None:
visited = set()
if name in {"all", "*"}:
# Union Hermes + Atropos toolsets.
all_tools: Set[str] = set()
for tname in hermes_toolsets.get_toolset_names():
all_tools.update(resolve_toolset(tname, visited=set()))
for tname, spec in ATROPOS_TOOLSETS.items():
# Avoid recursion: some Atropos toolsets (e.g. "full") include "all".
if tname == "full" or "all" in (spec.get("includes") or []):
continue
all_tools.update(resolve_toolset(tname, visited=set()))
return sorted(all_tools)
if name in ATROPOS_TOOLSETS:
if name in visited:
return []
visited.add(name)
spec = ATROPOS_TOOLSETS[name]
tools: Set[str] = set(spec.get("tools", []))
for inc in spec.get("includes", []):
tools.update(resolve_toolset(inc, visited=set(visited)))
return sorted(tools)
# Fall back to Hermes toolsets.
# IMPORTANT: do not pre-add `name` to `visited` here; Hermes' resolver uses
# `visited` for its own cycle detection and will treat the presence of `name`
# as a circular dependency.
return sorted(hermes_toolsets.resolve_toolset(name, visited=set(visited)))
def resolve_multiple_toolsets(names: List[str]) -> List[str]:
tools: Set[str] = set()
for name in names:
tools.update(resolve_toolset(name, visited=set()))
return sorted(tools)
-415
View File
@@ -1,415 +0,0 @@
#!/usr/bin/env python3
"""
Atropos-compatible Hermes agent runner.
This is a minimal subclass of Hermes-Agent's `AIAgent` that swaps the OpenAI
function-calling backend for Atroposlib's `ManagedServer`/`ServerManager` backend
and uses Hermes-style XML tool tags:
- <tool_call>{"name": "...", "arguments": {...}}</tool_call>
- <tool_response>{...}</tool_response>
Tool observations are appended as `role="user"` messages containing one or more
`<tool_response>` blocks so they survive common chat templates during tokenization.
"""
from __future__ import annotations
import asyncio
import json
import re
import time
import warnings
import os
from contextlib import asynccontextmanager
from typing import Any, AsyncGenerator, Dict, List, Optional, Tuple
from model_tools import cleanup_vm, handle_function_call
from run_agent import AIAgent
_TOOL_CALL_RE = re.compile(r"<tool_call>\\s*(.*?)\\s*</tool_call>", re.DOTALL)
ATROPOS_TOOL_SYSTEM_PROMPT = """You are a helpful AI assistant with access to tools.
## Available Tools
<tools>
{tool_descriptions}
</tools>
## How to Use Tools
To call a tool, output:
<tool_call>{{"name": "tool_name", "arguments": {{"arg1": "value1"}}}}</tool_call>
You may include optional reasoning in <think>...</think> before tool calls.
After each tool call, you will receive tool results as:
<tool_response>{{...}}</tool_response>
Continue until finished, then provide a final response with no <tool_call> blocks.
"""
class AtroposAIAgent(AIAgent):
"""
Hermes `AIAgent` variant that uses Atroposlib ServerManager/ManagedServer.
Notes:
- The default Hermes `AIAgent` remains unchanged; this class is opt-in.
- The underlying server must expose `managed_server(tokenizer=...)` OR be a single
APIServer-compatible object usable by Atroposlib's `ManagedServer`.
"""
def __init__(
self,
*,
server: Any,
tokenizer: Any = None,
model: str = "local",
max_iterations: int = 10,
tool_delay: float = 0.0,
enabled_toolsets: Optional[List[str]] = None,
disabled_toolsets: Optional[List[str]] = None,
save_trajectories: bool = False,
verbose_logging: bool = False,
quiet_mode: bool = False,
ephemeral_system_prompt: Optional[str] = None,
log_prefix_chars: int = 100,
log_prefix: str = "",
session_id: Optional[str] = None,
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
):
# Call parent init mainly to reuse tool selection + trajectory saving utilities.
super().__init__(
base_url="http://unused",
api_key="dummy-key",
model=model,
max_iterations=max_iterations,
tool_delay=tool_delay,
enabled_toolsets=enabled_toolsets,
disabled_toolsets=disabled_toolsets,
save_trajectories=save_trajectories,
verbose_logging=verbose_logging,
quiet_mode=quiet_mode,
ephemeral_system_prompt=ephemeral_system_prompt,
log_prefix_chars=log_prefix_chars,
log_prefix=log_prefix,
session_id=session_id,
)
self.server = server
self.tokenizer = tokenizer
self.temperature = temperature
self.max_tokens = max_tokens
@asynccontextmanager
async def _managed(self) -> AsyncGenerator[Any, None]:
if hasattr(self.server, "managed_server"):
with warnings.catch_warnings():
warnings.filterwarnings(
"ignore",
message=r"Using OpenAIServer with managed_server does not allow for state tracking",
category=UserWarning,
)
async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
yield managed
return
# Fall back to directly wrapping a single server object.
from atroposlib.envs.server_handling.managed_server import ManagedServer
managed = ManagedServer(server=self.server, tokenizer=self.tokenizer)
try:
yield managed
finally:
managed.reset()
def _tool_descriptions_text(self) -> str:
if not self.tools:
return "(no tools available)"
parts: List[str] = []
for tool in self.tools:
fn = (tool or {}).get("function", {})
name = fn.get("name", "")
desc = (fn.get("description") or "").strip()
if not name:
continue
if desc:
parts.append(f"- {name}: {desc}")
else:
parts.append(f"- {name}")
return "\n".join(parts) if parts else "(no tools available)"
def _build_system_prompt(self, system_message: Optional[str]) -> Optional[str]:
tool_prompt = ATROPOS_TOOL_SYSTEM_PROMPT.format(
tool_descriptions=self._tool_descriptions_text()
)
parts: List[str] = []
if system_message:
parts.append(system_message)
if self.ephemeral_system_prompt:
parts.append(self.ephemeral_system_prompt)
parts.append(tool_prompt)
return "\n\n".join(parts)
def _parse_tool_calls(self, content: str) -> Tuple[List[Tuple[str, Dict[str, Any]]], List[str]]:
"""
Returns:
(calls, errors)
"""
calls: List[Tuple[str, Dict[str, Any]]] = []
errors: List[str] = []
for raw in _TOOL_CALL_RE.findall(content or ""):
try:
payload = json.loads(raw)
except json.JSONDecodeError as exc:
errors.append(f"Invalid JSON inside <tool_call>: {exc}")
continue
name = payload.get("name")
args = payload.get("arguments", {})
if not isinstance(name, str) or not name:
errors.append("Tool call missing 'name' string")
continue
if not isinstance(args, dict):
errors.append("Tool call 'arguments' must be an object")
continue
calls.append((name, args))
return calls, errors
async def run_conversation_async(
self,
user_message: str,
system_message: Optional[str] = None,
conversation_history: Optional[List[Dict[str, Any]]] = None,
task_id: Optional[str] = None,
) -> Dict[str, Any]:
import uuid
effective_task_id = task_id or str(uuid.uuid4())
messages: List[Dict[str, Any]] = conversation_history.copy() if conversation_history else []
messages.append({"role": "user", "content": user_message})
active_system_prompt = self._build_system_prompt(system_message)
api_call_count = 0
final_response: Optional[str] = None
managed_state: Optional[Dict[str, Any]] = None
completed = False
try:
async with self._managed() as managed:
while api_call_count < self.max_iterations:
api_call_count += 1
api_messages = messages.copy()
if active_system_prompt:
api_messages = [{"role": "system", "content": active_system_prompt}] + api_messages
chat_kwargs: Dict[str, Any] = {"messages": api_messages, "n": 1}
if self.max_tokens is not None:
chat_kwargs["max_tokens"] = self.max_tokens
if self.temperature is not None:
chat_kwargs["temperature"] = self.temperature
# Prefer OpenAI tool calling when supported by the backend:
# - Many providers normalize Hermes-style <tool_call> tags into tool_calls when `tools` is provided.
# - ManagedServer (atroposlib) does prompt->completion conversion and does not support `tools`.
# Only pass `tools` when we're calling an OpenAI-compatible chat endpoint directly.
tool_schemas = self.tools if self.tools else None
managed_cls = type(managed).__name__
if tool_schemas and managed_cls != "ManagedServer":
chat_kwargs["tools"] = tool_schemas
if os.getenv("HERMES_DEBUG_ATROPOS_REQUEST") == "1":
meta = {
"managed_type": managed_cls,
"model": getattr(getattr(managed, "config", None), "model_name", self.model),
"base_url": getattr(getattr(managed, "config", None), "base_url", None),
"kwargs": chat_kwargs,
}
# Avoid dumping megabytes of data accidentally.
# (Messages can be large; this is still "full" but bounded.)
print("\n=== HERMES_DEBUG_ATROPOS_REQUEST ===", flush=True)
print(json.dumps(meta, ensure_ascii=False, indent=2)[:200_000], flush=True)
response = await managed.chat_completion(**chat_kwargs)
if os.getenv("HERMES_DEBUG_ATROPOS_RESPONSE") == "1":
try:
dumped = response.model_dump() # openai pydantic model
except Exception:
dumped = getattr(response, "__dict__", {"repr": repr(response)})
print("\n=== HERMES_DEBUG_ATROPOS_RESPONSE: ChatCompletion (raw) ===", flush=True)
print(json.dumps(dumped, ensure_ascii=False, indent=2), flush=True)
if hasattr(managed, "get_state"):
managed_state = managed.get_state()
msg = response.choices[0].message
assistant_content = (msg.content or "")
msg_reasoning = getattr(msg, "reasoning", None)
# Use tool_calls if the backend provides them (preferred).
structured_tool_calls = getattr(msg, "tool_calls", None)
# If the backend emits content="" but includes useful text in reasoning,
# use it for parsing *only if needed* (e.g. tool tags).
if assistant_content == "" and isinstance(msg_reasoning, str) and msg_reasoning:
if os.getenv("HERMES_DEBUG_ATROPOS_RESPONSE") == "1":
print("\n=== HERMES_DEBUG_ATROPOS_RESPONSE: message.reasoning present (content empty) ===", flush=True)
print(msg_reasoning, flush=True)
assistant_msg: Dict[str, Any] = {"role": "assistant", "content": assistant_content}
if structured_tool_calls:
# Preserve tool_calls so the next request is consistent with OpenAI protocol.
try:
assistant_msg["tool_calls"] = [
{
"id": tc.id,
"type": tc.type,
"function": {"name": tc.function.name, "arguments": tc.function.arguments},
}
for tc in structured_tool_calls
]
except Exception:
# Best-effort; keep conversation moving.
pass
messages.append(assistant_msg)
# Mode A: OpenAI tool calling (preferred when supported)
if structured_tool_calls:
for tc in structured_tool_calls:
tool_start = time.time()
try:
tool_args = json.loads(tc.function.arguments or "{}")
except Exception:
tool_args = {}
tool_result = handle_function_call(tc.function.name, tool_args, effective_task_id)
tool_duration = time.time() - tool_start
# Keep the raw tool result as tool content (OpenAI protocol expects role=tool).
messages.append(
{
"role": "tool",
"tool_call_id": tc.id,
"content": tool_result,
}
)
if self.tool_delay and self.tool_delay > 0:
await asyncio.sleep(self.tool_delay)
# Continue loop after tool execution.
continue
# Mode B: Hermes XML tool tags in assistant text (fallback).
parse_source = assistant_content or (msg_reasoning or "")
tool_calls, parse_errors = self._parse_tool_calls(parse_source)
if parse_errors and not tool_calls:
# Ask the model to retry with valid tool JSON.
err_text = "; ".join(parse_errors[:3])
messages.append(
{
"role": "user",
"content": (
f"<tool_response>{json.dumps({'error': err_text}, ensure_ascii=False)}</tool_response>\n"
"The previous <tool_call> blocks were invalid. Please output valid JSON inside <tool_call>."
),
}
)
continue
if not tool_calls:
# No tool calls: treat as final answer.
final_response = (assistant_content or "").strip()
completed = True
break
tool_responses: List[str] = []
for tool_name, tool_args in tool_calls:
tool_start = time.time()
tool_result = handle_function_call(tool_name, tool_args, effective_task_id)
tool_duration = time.time() - tool_start
try:
parsed = json.loads(tool_result)
payload: Any = parsed
except Exception:
payload = tool_result
tool_payload = {
"name": tool_name,
"duration_s": round(tool_duration, 3),
"result": payload,
}
tool_responses.append(
f"<tool_response>{json.dumps(tool_payload, ensure_ascii=False)}</tool_response>"
)
if self.tool_delay and self.tool_delay > 0:
await asyncio.sleep(self.tool_delay)
messages.append({"role": "user", "content": "\n".join(tool_responses)})
if final_response is None:
final_response = "I've reached the maximum number of iterations."
finally:
try:
cleanup_vm(effective_task_id)
except Exception:
pass
# Save trajectory using Hermes formatting (optional).
self._save_trajectory(messages, user_message, completed=completed)
return {
"final_response": final_response,
"messages": messages,
"api_calls": api_call_count,
"completed": completed,
"managed_state": managed_state,
"system_prompt": active_system_prompt,
"task_id": effective_task_id,
}
def run_conversation(self, *args: Any, **kwargs: Any) -> Dict[str, Any]:
"""
Sync wrapper for convenience.
If called from within a running event loop (e.g. prompt_toolkit), this
runs the async conversation in a dedicated thread to avoid nested loops.
"""
try:
asyncio.get_running_loop()
except RuntimeError:
return asyncio.run(self.run_conversation_async(*args, **kwargs))
import queue
import threading
out: "queue.Queue[object]" = queue.Queue(maxsize=1)
def runner() -> None:
try:
out.put(asyncio.run(self.run_conversation_async(*args, **kwargs)))
except BaseException as exc: # noqa: BLE001
out.put(exc)
thread = threading.Thread(target=runner, daemon=True)
thread.start()
result = out.get()
if isinstance(result, BaseException):
raise result
return result # type: ignore[return-value]
+164 -26
View File
@@ -41,24 +41,17 @@ from toolset_distributions import (
sample_toolsets_from_distribution,
validate_distribution
)
from model_tools import TOOL_TO_TOOLSET_MAP
# Global configuration for worker processes
_WORKER_CONFIG = {}
# All possible tools - used to ensure consistent schema across all trajectory entries
# This is required because Arrow/Parquet (used by HuggingFace datasets) needs identical schemas
ALL_POSSIBLE_TOOLS = {
'terminal', 'web_search', 'web_extract',
'vision_analyze', 'image_generate', 'mixture_of_agents',
# Skills tools
'skills_categories', 'skills_list', 'skill_view',
# Browser automation tools
'browser_navigate', 'browser_snapshot', 'browser_click',
'browser_type', 'browser_scroll', 'browser_back',
'browser_press', 'browser_close', 'browser_get_images',
'browser_vision'
}
# All possible tools - auto-derived from the master mapping in model_tools.py.
# This stays in sync automatically when new tools are added to TOOL_TO_TOOLSET_MAP.
# Used for consistent schema in Arrow/Parquet (HuggingFace datasets) and for
# filtering corrupted entries during trajectory combination.
ALL_POSSIBLE_TOOLS = set(TOOL_TO_TOOLSET_MAP.keys())
# Default stats for tools that weren't used
DEFAULT_TOOL_STATS = {'count': 0, 'success': 0, 'failure': 0}
@@ -200,6 +193,42 @@ def _extract_tool_stats(messages: List[Dict[str, Any]]) -> Dict[str, Dict[str, i
return tool_stats
def _extract_reasoning_stats(messages: List[Dict[str, Any]]) -> Dict[str, int]:
"""
Count how many assistant turns have reasoning vs no reasoning.
Checks for <REASONING_SCRATCHPAD> in content or a non-empty 'reasoning' field
(native thinking tokens). Returns counts for tracking reasoning coverage.
Args:
messages: Message history
Returns:
Dict with 'total_assistant_turns', 'turns_with_reasoning', 'turns_without_reasoning'
"""
total = 0
with_reasoning = 0
for msg in messages:
if msg.get("role") != "assistant":
continue
total += 1
content = msg.get("content", "") or ""
has_scratchpad = "<REASONING_SCRATCHPAD>" in content
has_native_reasoning = bool(msg.get("reasoning", "").strip()) if msg.get("reasoning") else False
if has_scratchpad or has_native_reasoning:
with_reasoning += 1
return {
"total_assistant_turns": total,
"turns_with_reasoning": with_reasoning,
"turns_without_reasoning": total - with_reasoning,
"has_any_reasoning": with_reasoning > 0,
}
def _process_single_prompt(
prompt_index: int,
prompt_data: Dict[str, Any],
@@ -244,6 +273,10 @@ def _process_single_prompt(
providers_ignored=config.get("providers_ignored"),
providers_order=config.get("providers_order"),
provider_sort=config.get("provider_sort"),
max_tokens=config.get("max_tokens"),
reasoning_config=config.get("reasoning_config"),
prefill_messages=config.get("prefill_messages"),
skip_context_files=True, # Don't pollute trajectories with SOUL.md/AGENTS.md
)
# Run the agent with task_id to ensure each task gets its own isolated VM
@@ -252,6 +285,9 @@ def _process_single_prompt(
# Extract tool usage statistics
tool_stats = _extract_tool_stats(result["messages"])
# Extract reasoning coverage stats
reasoning_stats = _extract_reasoning_stats(result["messages"])
# Convert to trajectory format (using existing method)
trajectory = agent._convert_to_trajectory_format(
result["messages"],
@@ -264,6 +300,7 @@ def _process_single_prompt(
"prompt_index": prompt_index,
"trajectory": trajectory,
"tool_stats": tool_stats,
"reasoning_stats": reasoning_stats,
"completed": result["completed"],
"partial": result.get("partial", False),
"api_calls": result["api_calls"],
@@ -332,7 +369,9 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
# Initialize aggregated stats for this batch
batch_tool_stats = {}
batch_reasoning_stats = {"total_assistant_turns": 0, "turns_with_reasoning": 0, "turns_without_reasoning": 0}
completed_in_batch = []
discarded_no_reasoning = 0
# Process each prompt sequentially in this batch
for prompt_index, prompt_data in prompts_to_process:
@@ -346,6 +385,13 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
# Save trajectory if successful
if result["success"] and result["trajectory"]:
# Discard samples with zero reasoning across all turns
reasoning = result.get("reasoning_stats", {})
if not reasoning.get("has_any_reasoning", True):
print(f" 🚫 Prompt {prompt_index} discarded (no reasoning in any turn)")
discarded_no_reasoning += 1
continue
# Get and normalize tool stats for consistent schema across all entries
raw_tool_stats = result.get("tool_stats", {})
tool_stats = _normalize_tool_stats(raw_tool_stats)
@@ -386,6 +432,10 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
batch_tool_stats[tool_name]["success"] += stats["success"]
batch_tool_stats[tool_name]["failure"] += stats["failure"]
# Aggregate reasoning stats
for key in batch_reasoning_stats:
batch_reasoning_stats[key] += result.get("reasoning_stats", {}).get(key, 0)
# Only mark as completed if successfully saved (failed prompts can be retried on resume)
if result["success"] and result["trajectory"]:
completed_in_batch.append(prompt_index)
@@ -401,6 +451,8 @@ def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
"processed": len(prompts_to_process),
"skipped": len(batch_data) - len(prompts_to_process),
"tool_stats": batch_tool_stats,
"reasoning_stats": batch_reasoning_stats,
"discarded_no_reasoning": discarded_no_reasoning,
"completed_prompts": completed_in_batch
}
@@ -428,6 +480,10 @@ class BatchRunner:
providers_ignored: List[str] = None,
providers_order: List[str] = None,
provider_sort: str = None,
max_tokens: int = None,
reasoning_config: Dict[str, Any] = None,
prefill_messages: List[Dict[str, Any]] = None,
max_samples: int = None,
):
"""
Initialize the batch runner.
@@ -449,6 +505,10 @@ class BatchRunner:
providers_ignored (List[str]): OpenRouter providers to ignore (optional)
providers_order (List[str]): OpenRouter providers to try in order (optional)
provider_sort (str): Sort providers by price/throughput/latency (optional)
max_tokens (int): Maximum tokens for model responses (optional, uses model default if not set)
reasoning_config (Dict): OpenRouter reasoning config override (e.g. {"effort": "none"} to disable thinking)
prefill_messages (List[Dict]): Messages to prepend as prefilled conversation context (few-shot priming)
max_samples (int): Only process the first N samples from the dataset (optional, processes all if not set)
"""
self.dataset_file = Path(dataset_file)
self.batch_size = batch_size
@@ -466,6 +526,10 @@ class BatchRunner:
self.providers_ignored = providers_ignored
self.providers_order = providers_order
self.provider_sort = provider_sort
self.max_tokens = max_tokens
self.reasoning_config = reasoning_config
self.prefill_messages = prefill_messages
self.max_samples = max_samples
# Validate distribution
if not validate_distribution(distribution):
@@ -481,8 +545,12 @@ class BatchRunner:
# Statistics file
self.stats_file = self.output_dir / "statistics.json"
# Load dataset
# Load dataset (and optionally truncate to max_samples)
self.dataset = self._load_dataset()
if self.max_samples and self.max_samples < len(self.dataset):
full_count = len(self.dataset)
self.dataset = self.dataset[:self.max_samples]
print(f"✂️ Truncated dataset from {full_count} to {self.max_samples} samples (--max_samples)")
# Create batches
self.batches = self._create_batches()
@@ -735,6 +803,9 @@ class BatchRunner:
"providers_ignored": self.providers_ignored,
"providers_order": self.providers_order,
"provider_sort": self.provider_sort,
"max_tokens": self.max_tokens,
"reasoning_config": self.reasoning_config,
"prefill_messages": self.prefill_messages,
}
# For backward compatibility, still track by index (but this is secondary to content matching)
@@ -797,6 +868,8 @@ class BatchRunner:
# Aggregate all batch statistics and update checkpoint
all_completed_prompts = list(completed_prompts_set)
total_reasoning_stats = {"total_assistant_turns": 0, "turns_with_reasoning": 0, "turns_without_reasoning": 0}
for batch_result in results:
# Add newly completed prompts
all_completed_prompts.extend(batch_result.get("completed_prompts", []))
@@ -813,6 +886,10 @@ class BatchRunner:
total_tool_stats[tool_name]["count"] += stats["count"]
total_tool_stats[tool_name]["success"] += stats["success"]
total_tool_stats[tool_name]["failure"] += stats["failure"]
# Aggregate reasoning stats
for key in total_reasoning_stats:
total_reasoning_stats[key] += batch_result.get("reasoning_stats", {}).get(key, 0)
# Save final checkpoint
checkpoint_data["completed_prompts"] = all_completed_prompts
@@ -835,15 +912,8 @@ class BatchRunner:
combined_file = self.output_dir / "trajectories.jsonl"
print(f"\n📦 Combining ALL batch files into {combined_file.name}...")
VALID_TOOLS = {'web_search', 'web_extract', 'terminal', 'vision_analyze',
'image_generate', 'mixture_of_agents',
# Skills tools
'skills_categories', 'skills_list', 'skill_view',
# Browser automation tools
'browser_navigate', 'browser_snapshot', 'browser_click',
'browser_type', 'browser_scroll', 'browser_back',
'browser_press', 'browser_close', 'browser_get_images',
'browser_vision'}
# Valid tools auto-derived from model_tools.py — no manual updates needed
VALID_TOOLS = ALL_POSSIBLE_TOOLS
total_entries = 0
filtered_entries = 0
@@ -892,7 +962,8 @@ class BatchRunner:
"model": self.model,
"completed_at": datetime.now().isoformat(),
"duration_seconds": round(time.time() - start_time, 2),
"tool_statistics": total_tool_stats
"tool_statistics": total_tool_stats,
"reasoning_statistics": total_reasoning_stats,
}
with open(self.stats_file, 'w', encoding='utf-8') as f:
@@ -930,6 +1001,25 @@ class BatchRunner:
else:
print("No tool calls were made during this run.")
# Print reasoning coverage stats
total_discarded = sum(r.get("discarded_no_reasoning", 0) for r in results)
print(f"\n🧠 Reasoning Coverage:")
print("-" * 70)
total_turns = total_reasoning_stats["total_assistant_turns"]
with_reasoning = total_reasoning_stats["turns_with_reasoning"]
without_reasoning = total_reasoning_stats["turns_without_reasoning"]
if total_turns > 0:
pct_with = round(with_reasoning / total_turns * 100, 1)
pct_without = round(without_reasoning / total_turns * 100, 1)
print(f" Total assistant turns: {total_turns:,}")
print(f" With reasoning: {with_reasoning:,} ({pct_with}%)")
print(f" Without reasoning: {without_reasoning:,} ({pct_without}%)")
else:
print(" No assistant turns recorded.")
if total_discarded > 0:
print(f" 🚫 Samples discarded (zero reasoning): {total_discarded:,}")
print(f"\n💾 Results saved to: {self.output_dir}")
print(f" - Trajectories: trajectories.jsonl (combined)")
print(f" - Individual batches: batch_*.jsonl (for debugging)")
@@ -956,6 +1046,11 @@ def main(
providers_ignored: str = None,
providers_order: str = None,
provider_sort: str = None,
max_tokens: int = None,
reasoning_effort: str = None,
reasoning_disabled: bool = False,
prefill_messages_file: str = None,
max_samples: int = None,
):
"""
Run batch processing of agent prompts from a dataset.
@@ -979,6 +1074,11 @@ def main(
providers_ignored (str): Comma-separated list of OpenRouter providers to ignore (e.g. "together,deepinfra")
providers_order (str): Comma-separated list of OpenRouter providers to try in order (e.g. "anthropic,openai,google")
provider_sort (str): Sort providers by "price", "throughput", or "latency" (OpenRouter only)
max_tokens (int): Maximum tokens for model responses (optional, uses model default if not set)
reasoning_effort (str): OpenRouter reasoning effort level: "xhigh", "high", "medium", "low", "minimal", "none" (default: "xhigh")
reasoning_disabled (bool): Completely disable reasoning/thinking tokens (default: False)
prefill_messages_file (str): Path to JSON file containing prefill messages (list of {role, content} dicts)
max_samples (int): Only process the first N samples from the dataset (optional, processes all if not set)
Examples:
# Basic usage
@@ -990,9 +1090,13 @@ def main(
# Use specific distribution
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=image_test --distribution=image_gen
# With ephemeral system prompt (not saved to dataset)
# With disabled reasoning and max tokens
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run \\
--ephemeral_system_prompt="You are a helpful assistant focused on image generation."
--reasoning_disabled --max_tokens=128000
# With prefill messages from file
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run \\
--prefill_messages_file=configs/prefill_opus.json
# List available distributions
python batch_runner.py --list_distributions
@@ -1031,6 +1135,36 @@ def main(
providers_ignored_list = [p.strip() for p in providers_ignored.split(",")] if providers_ignored else None
providers_order_list = [p.strip() for p in providers_order.split(",")] if providers_order else None
# Build reasoning_config from CLI flags
# --reasoning_disabled takes priority, then --reasoning_effort, then default (xhigh)
reasoning_config = None
if reasoning_disabled:
# Completely disable reasoning/thinking tokens
reasoning_config = {"effort": "none"}
print("🧠 Reasoning: DISABLED (effort=none)")
elif reasoning_effort:
# Use specified effort level
valid_efforts = ["xhigh", "high", "medium", "low", "minimal", "none"]
if reasoning_effort not in valid_efforts:
print(f"❌ Error: --reasoning_effort must be one of: {', '.join(valid_efforts)}")
return
reasoning_config = {"enabled": True, "effort": reasoning_effort}
print(f"🧠 Reasoning effort: {reasoning_effort}")
# Load prefill messages from JSON file if provided
prefill_messages = None
if prefill_messages_file:
try:
with open(prefill_messages_file, 'r', encoding='utf-8') as f:
prefill_messages = json.load(f)
if not isinstance(prefill_messages, list):
print(f"❌ Error: prefill_messages_file must contain a JSON array of messages")
return
print(f"💬 Loaded {len(prefill_messages)} prefill messages from {prefill_messages_file}")
except Exception as e:
print(f"❌ Error loading prefill messages: {e}")
return
# Initialize and run batch runner
try:
runner = BatchRunner(
@@ -1050,6 +1184,10 @@ def main(
providers_ignored=providers_ignored_list,
providers_order=providers_order_list,
provider_sort=provider_sort,
max_tokens=max_tokens,
reasoning_config=reasoning_config,
prefill_messages=prefill_messages,
max_samples=max_samples,
)
runner.run(resume=resume)
+66 -16
View File
@@ -7,7 +7,7 @@
# =============================================================================
model:
# Default model to use (can be overridden with --model flag)
default: "anthropic/claude-sonnet-4"
default: "anthropic/claude-opus-4.6"
# API configuration (falls back to OPENROUTER_API_KEY env var)
# api_key: "your-key-here" # Uncomment to set here instead of .env
@@ -23,9 +23,12 @@ model:
# OPTION 1: Local execution (default)
# Commands run directly on your machine in the current directory
# -----------------------------------------------------------------------------
# Working directory behavior:
# - CLI (`hermes` command): Uses "." (current directory where you run hermes)
# - Messaging (Telegram/Discord): Uses MESSAGING_CWD from .env (default: home)
terminal:
env_type: "local"
cwd: "." # Use "." for current directory, or specify absolute path
backend: "local"
cwd: "." # For local backend: "." = current directory. Ignored for remote backends.
timeout: 180
lifetime_seconds: 300
# sudo_password: "" # Enable sudo commands (pipes via sudo -S) - SECURITY WARNING: plaintext!
@@ -36,8 +39,8 @@ terminal:
# Great for: keeping agent isolated from its own code, using powerful remote hardware
# -----------------------------------------------------------------------------
# terminal:
# env_type: "ssh"
# cwd: "/home/myuser/project"
# backend: "ssh"
# cwd: "/home/myuser/project" # Path on the REMOTE server
# timeout: 180
# lifetime_seconds: 300
# ssh_host: "my-server.example.com"
@@ -51,11 +54,11 @@ terminal:
# Great for: reproducible environments, testing, isolation
# -----------------------------------------------------------------------------
# terminal:
# env_type: "docker"
# cwd: "/workspace"
# backend: "docker"
# cwd: "/workspace" # Path INSIDE the container (default: /)
# timeout: 180
# lifetime_seconds: 300
# docker_image: "python:3.11"
# docker_image: "nikolaik/python-nodejs:python3.11-nodejs20"
# -----------------------------------------------------------------------------
# OPTION 4: Singularity/Apptainer container
@@ -63,11 +66,11 @@ terminal:
# Great for: HPC clusters, shared compute environments
# -----------------------------------------------------------------------------
# terminal:
# env_type: "singularity"
# cwd: "/workspace"
# backend: "singularity"
# cwd: "/workspace" # Path INSIDE the container (default: /root)
# timeout: 180
# lifetime_seconds: 300
# singularity_image: "docker://python:3.11"
# singularity_image: "docker://nikolaik/python-nodejs:python3.11-nodejs20"
# -----------------------------------------------------------------------------
# OPTION 5: Modal cloud execution
@@ -75,11 +78,11 @@ terminal:
# Great for: GPU access, scalable compute, serverless execution
# -----------------------------------------------------------------------------
# terminal:
# env_type: "modal"
# cwd: "/workspace"
# backend: "modal"
# cwd: "/workspace" # Path INSIDE the sandbox (default: /root)
# timeout: 180
# lifetime_seconds: 300
# modal_image: "python:3.11"
# modal_image: "nikolaik/python-nodejs:python3.11-nodejs20"
# -----------------------------------------------------------------------------
# SUDO SUPPORT (works with ALL backends above)
@@ -112,12 +115,41 @@ browser:
# after this period of no activity between agent loops (default: 120 = 2 minutes)
inactivity_timeout: 120
# =============================================================================
# Context Compression (Auto-shrinks long conversations)
# =============================================================================
# When conversation approaches model's context limit, middle turns are
# automatically summarized to free up space while preserving important context.
#
# HOW IT WORKS:
# 1. Tracks actual token usage from API responses (not estimates)
# 2. When prompt_tokens >= threshold% of model's context_length, triggers compression
# 3. Protects first 3 turns (system prompt, initial request, first response)
# 4. Protects last 4 turns (recent context is most relevant)
# 5. Summarizes middle turns using a fast/cheap model
# 6. Inserts summary as a user message, continues conversation seamlessly
#
compression:
# Enable automatic context compression (default: true)
# Set to false if you prefer to manage context manually or want errors on overflow
enabled: true
# Trigger compression at this % of model's context limit (default: 0.85 = 85%)
# Lower values = more aggressive compression, higher values = compress later
threshold: 0.85
# Model to use for generating summaries (fast/cheap recommended)
# This model compresses the middle turns into a concise summary
summary_model: "google/gemini-3-flash-preview"
# =============================================================================
# Agent Behavior
# =============================================================================
agent:
# Maximum conversation turns before stopping
max_turns: 20
# Maximum tool-calling iterations per conversation
# Higher = more room for complex tasks, but costs more tokens
# Recommended: 20-30 for focused tasks, 50-100 for open exploration
max_turns: 60
# Enable verbose logging
verbose: false
@@ -212,6 +244,24 @@ toolsets:
# toolsets:
# - safe
# =============================================================================
# Voice Transcription (Speech-to-Text)
# =============================================================================
# Automatically transcribe voice messages on messaging platforms.
# Requires OPENAI_API_KEY in .env (uses OpenAI Whisper API directly).
stt:
enabled: true
model: "whisper-1" # whisper-1 (cheapest) | gpt-4o-mini-transcribe | gpt-4o-transcribe
# =============================================================================
# Response Pacing (Messaging Platforms)
# =============================================================================
# Add human-like delays between message chunks.
# human_delay:
# mode: "off" # "off" | "natural" | "custom"
# min_ms: 800 # Min delay (custom mode only)
# max_ms: 2500 # Max delay (custom mode only)
# =============================================================================
# Session Logging
# =============================================================================
+821 -278
View File
File diff suppressed because it is too large Load Diff
+36
View File
@@ -0,0 +1,36 @@
"""
Cron job scheduling system for Hermes Agent.
This module provides scheduled task execution, allowing the agent to:
- Run automated tasks on schedules (cron expressions, intervals, one-shot)
- Self-schedule reminders and follow-up tasks
- Execute tasks in isolated sessions (no prior context)
Usage:
# Run due jobs (for system cron integration)
python -c "from cron import tick; tick()"
# Or via CLI
python cli.py --cron-daemon
"""
from cron.jobs import (
create_job,
get_job,
list_jobs,
remove_job,
update_job,
JOBS_FILE,
)
from cron.scheduler import tick, run_daemon
__all__ = [
"create_job",
"get_job",
"list_jobs",
"remove_job",
"update_job",
"tick",
"run_daemon",
"JOBS_FILE",
]
+383
View File
@@ -0,0 +1,383 @@
"""
Cron job storage and management.
Jobs are stored in ~/.hermes/cron/jobs.json
Output is saved to ~/.hermes/cron/output/{job_id}/{timestamp}.md
"""
import json
import os
import re
import uuid
from datetime import datetime, timedelta
from pathlib import Path
from typing import Optional, Dict, List, Any
try:
from croniter import croniter
HAS_CRONITER = True
except ImportError:
HAS_CRONITER = False
# =============================================================================
# Configuration
# =============================================================================
HERMES_DIR = Path.home() / ".hermes"
CRON_DIR = HERMES_DIR / "cron"
JOBS_FILE = CRON_DIR / "jobs.json"
OUTPUT_DIR = CRON_DIR / "output"
def ensure_dirs():
"""Ensure cron directories exist."""
CRON_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
# =============================================================================
# Schedule Parsing
# =============================================================================
def parse_duration(s: str) -> int:
"""
Parse duration string into minutes.
Examples:
"30m" 30
"2h" 120
"1d" 1440
"""
s = s.strip().lower()
match = re.match(r'^(\d+)\s*(m|min|mins|minute|minutes|h|hr|hrs|hour|hours|d|day|days)$', s)
if not match:
raise ValueError(f"Invalid duration: '{s}'. Use format like '30m', '2h', or '1d'")
value = int(match.group(1))
unit = match.group(2)[0] # First char: m, h, or d
multipliers = {'m': 1, 'h': 60, 'd': 1440}
return value * multipliers[unit]
def parse_schedule(schedule: str) -> Dict[str, Any]:
"""
Parse schedule string into structured format.
Returns dict with:
- kind: "once" | "interval" | "cron"
- For "once": "run_at" (ISO timestamp)
- For "interval": "minutes" (int)
- For "cron": "expr" (cron expression)
Examples:
"30m" once in 30 minutes
"2h" once in 2 hours
"every 30m" recurring every 30 minutes
"every 2h" recurring every 2 hours
"0 9 * * *" cron expression
"2026-02-03T14:00" once at timestamp
"""
schedule = schedule.strip()
original = schedule
schedule_lower = schedule.lower()
# "every X" pattern → recurring interval
if schedule_lower.startswith("every "):
duration_str = schedule[6:].strip()
minutes = parse_duration(duration_str)
return {
"kind": "interval",
"minutes": minutes,
"display": f"every {minutes}m"
}
# Check for cron expression (5 or 6 space-separated fields)
# Cron fields: minute hour day month weekday [year]
parts = schedule.split()
if len(parts) >= 5 and all(
re.match(r'^[\d\*\-,/]+$', p) for p in parts[:5]
):
if not HAS_CRONITER:
raise ValueError("Cron expressions require 'croniter' package. Install with: pip install croniter")
# Validate cron expression
try:
croniter(schedule)
except Exception as e:
raise ValueError(f"Invalid cron expression '{schedule}': {e}")
return {
"kind": "cron",
"expr": schedule,
"display": schedule
}
# ISO timestamp (contains T or looks like date)
if 'T' in schedule or re.match(r'^\d{4}-\d{2}-\d{2}', schedule):
try:
# Parse and validate
dt = datetime.fromisoformat(schedule.replace('Z', '+00:00'))
return {
"kind": "once",
"run_at": dt.isoformat(),
"display": f"once at {dt.strftime('%Y-%m-%d %H:%M')}"
}
except ValueError as e:
raise ValueError(f"Invalid timestamp '{schedule}': {e}")
# Duration like "30m", "2h", "1d" → one-shot from now
try:
minutes = parse_duration(schedule)
run_at = datetime.now() + timedelta(minutes=minutes)
return {
"kind": "once",
"run_at": run_at.isoformat(),
"display": f"once in {original}"
}
except ValueError:
pass
raise ValueError(
f"Invalid schedule '{original}'. Use:\n"
f" - Duration: '30m', '2h', '1d' (one-shot)\n"
f" - Interval: 'every 30m', 'every 2h' (recurring)\n"
f" - Cron: '0 9 * * *' (cron expression)\n"
f" - Timestamp: '2026-02-03T14:00:00' (one-shot at time)"
)
def compute_next_run(schedule: Dict[str, Any], last_run_at: Optional[str] = None) -> Optional[str]:
"""
Compute the next run time for a schedule.
Returns ISO timestamp string, or None if no more runs.
"""
now = datetime.now()
if schedule["kind"] == "once":
run_at = datetime.fromisoformat(schedule["run_at"])
# If in the future, return it; if in the past, no more runs
return schedule["run_at"] if run_at > now else None
elif schedule["kind"] == "interval":
minutes = schedule["minutes"]
if last_run_at:
# Next run is last_run + interval
last = datetime.fromisoformat(last_run_at)
next_run = last + timedelta(minutes=minutes)
else:
# First run is now + interval
next_run = now + timedelta(minutes=minutes)
return next_run.isoformat()
elif schedule["kind"] == "cron":
if not HAS_CRONITER:
return None
cron = croniter(schedule["expr"], now)
next_run = cron.get_next(datetime)
return next_run.isoformat()
return None
# =============================================================================
# Job CRUD Operations
# =============================================================================
def load_jobs() -> List[Dict[str, Any]]:
"""Load all jobs from storage."""
ensure_dirs()
if not JOBS_FILE.exists():
return []
try:
with open(JOBS_FILE, 'r', encoding='utf-8') as f:
data = json.load(f)
return data.get("jobs", [])
except (json.JSONDecodeError, IOError):
return []
def save_jobs(jobs: List[Dict[str, Any]]):
"""Save all jobs to storage."""
ensure_dirs()
with open(JOBS_FILE, 'w', encoding='utf-8') as f:
json.dump({"jobs": jobs, "updated_at": datetime.now().isoformat()}, f, indent=2)
def create_job(
prompt: str,
schedule: str,
name: Optional[str] = None,
repeat: Optional[int] = None,
deliver: Optional[str] = None,
origin: Optional[Dict[str, Any]] = None
) -> Dict[str, Any]:
"""
Create a new cron job.
Args:
prompt: The prompt to run (must be self-contained)
schedule: Schedule string (see parse_schedule)
name: Optional friendly name
repeat: How many times to run (None = forever, 1 = once)
deliver: Where to deliver output ("origin", "local", "telegram", etc.)
origin: Source info where job was created (for "origin" delivery)
Returns:
The created job dict
"""
parsed_schedule = parse_schedule(schedule)
# Auto-set repeat=1 for one-shot schedules if not specified
if parsed_schedule["kind"] == "once" and repeat is None:
repeat = 1
# Default delivery to origin if available, otherwise local
if deliver is None:
deliver = "origin" if origin else "local"
job_id = uuid.uuid4().hex[:12]
now = datetime.now().isoformat()
job = {
"id": job_id,
"name": name or prompt[:50].strip(),
"prompt": prompt,
"schedule": parsed_schedule,
"schedule_display": parsed_schedule.get("display", schedule),
"repeat": {
"times": repeat, # None = forever
"completed": 0
},
"enabled": True,
"created_at": now,
"next_run_at": compute_next_run(parsed_schedule),
"last_run_at": None,
"last_status": None,
"last_error": None,
# Delivery configuration
"deliver": deliver,
"origin": origin, # Tracks where job was created for "origin" delivery
}
jobs = load_jobs()
jobs.append(job)
save_jobs(jobs)
return job
def get_job(job_id: str) -> Optional[Dict[str, Any]]:
"""Get a job by ID."""
jobs = load_jobs()
for job in jobs:
if job["id"] == job_id:
return job
return None
def list_jobs(include_disabled: bool = False) -> List[Dict[str, Any]]:
"""List all jobs, optionally including disabled ones."""
jobs = load_jobs()
if not include_disabled:
jobs = [j for j in jobs if j.get("enabled", True)]
return jobs
def update_job(job_id: str, updates: Dict[str, Any]) -> Optional[Dict[str, Any]]:
"""Update a job by ID."""
jobs = load_jobs()
for i, job in enumerate(jobs):
if job["id"] == job_id:
jobs[i] = {**job, **updates}
save_jobs(jobs)
return jobs[i]
return None
def remove_job(job_id: str) -> bool:
"""Remove a job by ID."""
jobs = load_jobs()
original_len = len(jobs)
jobs = [j for j in jobs if j["id"] != job_id]
if len(jobs) < original_len:
save_jobs(jobs)
return True
return False
def mark_job_run(job_id: str, success: bool, error: Optional[str] = None):
"""
Mark a job as having been run.
Updates last_run_at, last_status, increments completed count,
computes next_run_at, and auto-deletes if repeat limit reached.
"""
jobs = load_jobs()
for i, job in enumerate(jobs):
if job["id"] == job_id:
now = datetime.now().isoformat()
job["last_run_at"] = now
job["last_status"] = "ok" if success else "error"
job["last_error"] = error if not success else None
# Increment completed count
if job.get("repeat"):
job["repeat"]["completed"] = job["repeat"].get("completed", 0) + 1
# Check if we've hit the repeat limit
times = job["repeat"].get("times")
completed = job["repeat"]["completed"]
if times is not None and completed >= times:
# Remove the job (limit reached)
jobs.pop(i)
save_jobs(jobs)
return
# Compute next run
job["next_run_at"] = compute_next_run(job["schedule"], now)
# If no next run (one-shot completed), disable
if job["next_run_at"] is None:
job["enabled"] = False
save_jobs(jobs)
return
save_jobs(jobs)
def get_due_jobs() -> List[Dict[str, Any]]:
"""Get all jobs that are due to run now."""
now = datetime.now()
jobs = load_jobs()
due = []
for job in jobs:
if not job.get("enabled", True):
continue
next_run = job.get("next_run_at")
if not next_run:
continue
next_run_dt = datetime.fromisoformat(next_run)
if next_run_dt <= now:
due.append(job)
return due
def save_job_output(job_id: str, output: str):
"""Save job output to file."""
ensure_dirs()
job_output_dir = OUTPUT_DIR / job_id
job_output_dir.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
output_file = job_output_dir / f"{timestamp}.md"
with open(output_file, 'w', encoding='utf-8') as f:
f.write(output)
return output_file
+188
View File
@@ -0,0 +1,188 @@
"""
Cron job scheduler - executes due jobs.
This module provides:
- tick(): Run all due jobs once (for system cron integration)
- run_daemon(): Run continuously, checking every 60 seconds
"""
import os
import sys
import time
import traceback
from datetime import datetime
from pathlib import Path
from typing import Optional
# Add parent directory to path for imports
sys.path.insert(0, str(Path(__file__).parent.parent))
from cron.jobs import get_due_jobs, mark_job_run, save_job_output
def run_job(job: dict) -> tuple[bool, str, Optional[str]]:
"""
Execute a single cron job.
Returns:
Tuple of (success, output, error_message)
"""
from run_agent import AIAgent
job_id = job["id"]
job_name = job["name"]
prompt = job["prompt"]
print(f"[cron] Running job '{job_name}' (ID: {job_id})")
print(f"[cron] Prompt: {prompt[:100]}{'...' if len(prompt) > 100 else ''}")
try:
# Create agent with default settings
# Jobs run in isolated sessions (no prior context)
agent = AIAgent(
model=os.getenv("HERMES_MODEL", "anthropic/claude-opus-4.6"),
quiet_mode=True,
session_id=f"cron_{job_id}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
)
# Run the conversation
result = agent.run_conversation(prompt)
# Extract final response
final_response = result.get("final_response", "")
if not final_response:
final_response = "(No response generated)"
# Build output document
output = f"""# Cron Job: {job_name}
**Job ID:** {job_id}
**Run Time:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
**Schedule:** {job.get('schedule_display', 'N/A')}
## Prompt
{prompt}
## Response
{final_response}
"""
print(f"[cron] Job '{job_name}' completed successfully")
return True, output, None
except Exception as e:
error_msg = f"{type(e).__name__}: {str(e)}"
print(f"[cron] Job '{job_name}' failed: {error_msg}")
# Build error output
output = f"""# Cron Job: {job_name} (FAILED)
**Job ID:** {job_id}
**Run Time:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
**Schedule:** {job.get('schedule_display', 'N/A')}
## Prompt
{prompt}
## Error
```
{error_msg}
{traceback.format_exc()}
```
"""
return False, output, error_msg
def tick(verbose: bool = True) -> int:
"""
Check and run all due jobs.
This is designed to be called by system cron every minute:
*/1 * * * * cd ~/hermes-agent && python -c "from cron import tick; tick()"
Args:
verbose: Whether to print status messages
Returns:
Number of jobs executed
"""
due_jobs = get_due_jobs()
if verbose and not due_jobs:
print(f"[cron] {datetime.now().strftime('%H:%M:%S')} - No jobs due")
return 0
if verbose:
print(f"[cron] {datetime.now().strftime('%H:%M:%S')} - {len(due_jobs)} job(s) due")
executed = 0
for job in due_jobs:
try:
success, output, error = run_job(job)
# Save output to file
output_file = save_job_output(job["id"], output)
if verbose:
print(f"[cron] Output saved to: {output_file}")
# Mark job as run (handles repeat counting, next_run computation)
mark_job_run(job["id"], success, error)
executed += 1
except Exception as e:
print(f"[cron] Error processing job {job['id']}: {e}")
mark_job_run(job["id"], False, str(e))
return executed
def run_daemon(check_interval: int = 60, verbose: bool = True):
"""
Run the cron daemon continuously.
Checks for due jobs every `check_interval` seconds.
Args:
check_interval: Seconds between checks (default: 60)
verbose: Whether to print status messages
"""
print(f"[cron] Starting daemon (checking every {check_interval}s)")
print(f"[cron] Press Ctrl+C to stop")
print()
try:
while True:
try:
tick(verbose=verbose)
except Exception as e:
print(f"[cron] Tick error: {e}")
time.sleep(check_interval)
except KeyboardInterrupt:
print("\n[cron] Daemon stopped")
if __name__ == "__main__":
# Allow running directly: python cron/scheduler.py [daemon|tick]
import argparse
parser = argparse.ArgumentParser(description="Hermes Cron Scheduler")
parser.add_argument("mode", choices=["daemon", "tick"], default="tick", nargs="?",
help="Mode: 'tick' to run once, 'daemon' to run continuously")
parser.add_argument("--interval", type=int, default=60,
help="Check interval in seconds for daemon mode")
parser.add_argument("--quiet", "-q", action="store_true",
help="Suppress status messages")
args = parser.parse_args()
if args.mode == "daemon":
run_daemon(check_interval=args.interval, verbose=not args.quiet)
else:
tick(verbose=not args.quiet)
-224
View File
@@ -1,224 +0,0 @@
# Modal Backend
Hermes Agent uses [Modal](https://modal.com) for scalable, isolated cloud execution environments. There are two Modal integrations:
1. **Terminal Tool** (`tools/terminal_tool.py`) - For CLI/agent command execution
2. **Atropos Backend** (`atropos/backends/modal_backend.py`) - For batch RL training workloads
---
## Terminal Tool (CLI/Agent)
The terminal tool provides a simple interface for executing commands in Modal sandboxes.
### Configuration
Set environment variables:
```bash
export TERMINAL_ENV=modal
export TERMINAL_MODAL_IMAGE=python:3.11
export TERMINAL_MODAL_APP_NAME=hermes-sandbox
```
Or use a YAML config file (`modal_profiles.yaml`):
```yaml
profiles:
default:
image: python:3.11
cpu: 1.0
memory: 2048
min_pool: 1
max_pool: 5
idle_timeout: 120
gpu:
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
gpu: T4
memory: 16384
min_pool: 0
max_pool: 2
```
### Features
| Feature | Description |
|---------|-------------|
| **Sandbox Pool** | Pre-warmed sandboxes for low latency |
| **Auto-scaling** | Grows/shrinks pool based on demand |
| **Idle Timeout** | Sandboxes auto-terminate when unused |
| **Profile Selection** | Different configs for different workloads |
| **Credential Injection** | `modal.Secret` integration |
### Usage
```python
from tools.terminal_tool import terminal_tool
# Simple command
output = terminal_tool("echo hello", task_id="my-task")
# With profile selection
output = terminal_tool("python train.py", task_id="training", profile="gpu")
# Cleanup when done
from tools.terminal_tool import cleanup_vm
cleanup_vm("my-task")
```
### Architecture
```
_ModalPoolManager (singleton)
├── "default" pool → [sandbox-0, sandbox-1, ...]
└── "gpu" pool → [sandbox-0, ...]
Each pool:
- Maintains min_pool warm sandboxes
- Scales up to max_pool on demand
- Background thread scales down idle sandboxes
```
---
## Atropos Backend (RL Training)
The Atropos backend is designed for high-throughput batch execution during reinforcement learning training.
### Key Concept: Slot-based Multiplexing
Instead of one sandbox per trajectory, multiple trajectories share sandboxes via **slots**:
```
Sandbox (1 container)
├── Slot 0 → Trajectory A (workspace: /data/slot_0)
├── Slot 1 → Trajectory B (workspace: /data/slot_1)
└── Slot 2 → Trajectory C (workspace: /data/slot_2)
```
**Benefits**:
- Fewer containers = lower cost
- Shared warm-up time
- Better GPU utilization
### Configuration
```python
from atropos.backends.modal_backend import ModalSandboxConfig, ModalToolBackend
config = ModalSandboxConfig(
name="default",
image="python:3.11",
cpu=1.0,
memory=2048,
slots_per_sandbox=10, # 10 trajectories per container
min_sandboxes=1,
max_sandboxes=5,
)
backend = ModalToolBackend(config.with_app_name("my-training"))
```
### Multi-Profile Support
Different trajectory types can request different resources:
```python
backend = ModalToolBackend.with_profiles(
app_name="rl-training",
profiles={
"default": ModalSandboxConfig(
name="default",
cpu=1.0,
memory=2048,
),
"pytorch-gpu": ModalSandboxConfig(
name="pytorch-gpu",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
gpu="T4",
memory=16384,
),
}
)
# CPU task
slot1 = await backend.acquire("traj-1", profile="default")
# GPU task
slot2 = await backend.acquire("traj-2", profile="pytorch-gpu")
```
### Batched Execution
The key optimization - execute many commands in parallel:
```python
# Acquire slots for multiple trajectories
slots = [await backend.acquire(f"traj-{i}") for i in range(50)]
# Execute batch across all slots in parallel
results = await backend.execute_batch([
(slot, "bash", {"command": "python step.py"})
for slot in slots
])
# Release slots
for slot in slots:
await backend.release(slot)
```
### Architecture
```
ModalToolBackend
└── _ModalMultiProfileManager
├── "default" → _ModalSandboxPool
│ ├── Sandbox 0 (slots 0-9)
│ └── Sandbox 1 (slots 0-9)
└── "pytorch-gpu" → _ModalSandboxPool
└── Sandbox 0 (slots 0-9)
```
---
## Credentials
Inject secrets securely using Modal's secret management:
```bash
# Create secret in Modal dashboard or CLI
modal secret create my-api-key API_KEY=sk-xxx
```
```python
# Reference in config
config = ModalSandboxConfig(
secrets=["my-api-key"], # Modal secret names
env_vars={"DEBUG": "1"}, # Additional env vars
)
```
## Troubleshooting
### "Modal package not installed"
```bash
pip install modal
modal token new # Authenticate
```
### "Sandbox creation failed"
- Check Modal dashboard for quota limits
- Verify image exists and is accessible
- Check secret names are correct
### Shutdown errors
These are harmless warnings during Python interpreter shutdown:
```
[Modal] Error terminating ...: cannot schedule new futures after interpreter shutdown
```
The sandboxes will auto-terminate via Modal's idle_timeout anyway.
+32
View File
@@ -250,6 +250,38 @@ This is useful for:
- Replaying conversations
- Training data inspection
### Context Compression
Long conversations can exceed model context limits. The CLI automatically compresses context when approaching the limit:
```yaml
# In cli-config.yaml
compression:
enabled: true # Enable auto-compression
threshold: 0.85 # Compress at 85% of context limit
summary_model: "google/gemini-2.0-flash-001"
```
**How it works:**
1. Tracks actual token usage from each API response
2. When tokens reach threshold, middle turns are summarized
3. First 3 and last 4 turns are always protected
4. Conversation continues seamlessly after compression
**When compression triggers:**
```
📦 Context compression triggered (170,000 tokens ≥ 170,000 threshold)
📊 Model context limit: 200,000 tokens (85% = 170,000)
🗜️ Summarizing turns 4-15 (12 turns)
✅ Compressed: 20 → 9 messages (~45,000 tokens saved)
```
To disable compression:
```yaml
compression:
enabled: false
```
## Quiet Mode
The CLI runs in "quiet mode" (`HERMES_QUIET=1`), which:
+547
View File
@@ -0,0 +1,547 @@
# Messaging Platform Integrations (Gateway)
Hermes Agent can connect to messaging platforms like Telegram, Discord, and WhatsApp to serve as a conversational AI assistant.
## Quick Start
```bash
# 1. Set your bot token(s) in .env file
echo 'TELEGRAM_BOT_TOKEN="your_telegram_bot_token"' >> .env
echo 'DISCORD_BOT_TOKEN="your_discord_bot_token"' >> .env
# 2. Test the gateway (foreground)
./scripts/hermes-gateway run
# 3. Install as a system service (runs in background)
./scripts/hermes-gateway install
# 4. Manage the service
./scripts/hermes-gateway start
./scripts/hermes-gateway stop
./scripts/hermes-gateway restart
./scripts/hermes-gateway status
```
**Quick test (without service install):**
```bash
python cli.py --gateway # Runs in foreground, useful for debugging
```
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ Hermes Gateway │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Telegram │ │ Discord │ │ WhatsApp │ │
│ │ Adapter │ │ Adapter │ │ Adapter │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ Session Store │ │
│ │ (per-chat) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ AIAgent │ │
│ │ (run_agent) │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
## Session Management
### Session Persistence
Sessions persist across messages until they reset. The agent remembers your conversation context.
### Reset Policies
Sessions reset based on configurable policies:
| Policy | Default | Description |
|--------|---------|-------------|
| Daily | 4:00 AM | Reset at a specific hour each day |
| Idle | 120 min | Reset after N minutes of inactivity |
| Both | (combined) | Whichever triggers first |
### Manual Reset
Send `/new` or `/reset` as a message to start fresh.
### Per-Platform Overrides
Configure different reset policies per platform:
```json
{
"reset_by_platform": {
"telegram": { "mode": "idle", "idle_minutes": 240 },
"discord": { "mode": "idle", "idle_minutes": 60 }
}
}
```
## Platform Setup
### Telegram
1. **Create a bot** via [@BotFather](https://t.me/BotFather)
2. **Get your token** (looks like `123456789:ABCdefGHIjklMNOpqrsTUVwxyz`)
3. **Set environment variable:**
```bash
export TELEGRAM_BOT_TOKEN="your_token_here"
```
4. **Optional: Set home channel** for cron job delivery:
```bash
export TELEGRAM_HOME_CHANNEL="-1001234567890"
export TELEGRAM_HOME_CHANNEL_NAME="My Notes"
```
**Requirements:**
```bash
pip install python-telegram-bot>=20.0
```
### Discord
1. **Create an application** at [Discord Developer Portal](https://discord.com/developers/applications)
2. **Create a bot** under your application
3. **Get the bot token**
4. **Enable required intents:**
- Message Content Intent
- Server Members Intent (optional)
5. **Invite to your server** using OAuth2 URL generator (scopes: `bot`, `applications.commands`)
6. **Set environment variable:**
```bash
export DISCORD_BOT_TOKEN="your_token_here"
```
7. **Optional: Set home channel:**
```bash
export DISCORD_HOME_CHANNEL="123456789012345678"
export DISCORD_HOME_CHANNEL_NAME="#bot-updates"
```
**Requirements:**
```bash
pip install discord.py>=2.0
```
### WhatsApp
WhatsApp integration is more complex due to the lack of a simple bot API.
**Options:**
1. **WhatsApp Business API** (requires Meta verification)
2. **whatsapp-web.js** via Node.js bridge (for personal accounts)
**Bridge Setup:**
1. Install Node.js
2. Set up the bridge script (see `scripts/whatsapp-bridge/` for reference)
3. Configure in gateway:
```json
{
"platforms": {
"whatsapp": {
"enabled": true,
"extra": {
"bridge_script": "/path/to/bridge.js",
"bridge_port": 3000
}
}
}
}
```
## Configuration
There are **three ways** to configure the gateway (in order of precedence):
### 1. Environment Variables (`.env` file) - Recommended for Quick Setup
Add to your `~/.hermes/.env` file:
```bash
# =============================================================================
# MESSAGING PLATFORM TOKENS
# =============================================================================
# Telegram - get from @BotFather on Telegram
TELEGRAM_BOT_TOKEN=your_telegram_bot_token
TELEGRAM_ALLOWED_USERS=123456789,987654321 # Security: restrict to these user IDs
# Optional: Default channel for cron job delivery
TELEGRAM_HOME_CHANNEL=-1001234567890
TELEGRAM_HOME_CHANNEL_NAME="My Notes"
# Discord - get from Discord Developer Portal
DISCORD_BOT_TOKEN=your_discord_bot_token
DISCORD_ALLOWED_USERS=123456789012345678 # Security: restrict to these user IDs
# Optional: Default channel for cron job delivery
DISCORD_HOME_CHANNEL=123456789012345678
DISCORD_HOME_CHANNEL_NAME="#bot-updates"
# WhatsApp - requires Node.js bridge setup
WHATSAPP_ENABLED=true
# =============================================================================
# AGENT SETTINGS
# =============================================================================
# Max tool-calling iterations per conversation (default: 60)
HERMES_MAX_ITERATIONS=60
# Working directory for terminal commands (default: home ~)
MESSAGING_CWD=/home/myuser
# =============================================================================
# TOOL PROGRESS NOTIFICATIONS
# =============================================================================
# Show progress messages as agent uses tools
HERMES_TOOL_PROGRESS=true
# Mode: "new" (only when tool changes) or "all" (every tool call)
HERMES_TOOL_PROGRESS_MODE=new
# =============================================================================
# SESSION SETTINGS
# =============================================================================
# Reset sessions after N minutes of inactivity (default: 120)
SESSION_IDLE_MINUTES=120
# Daily reset hour in 24h format (default: 4 = 4am)
SESSION_RESET_HOUR=4
```
### 2. Gateway Config File (`~/.hermes/gateway.json`) - Full Control
For advanced configuration, create `~/.hermes/gateway.json`:
```json
{
"platforms": {
"telegram": {
"enabled": true,
"token": "your_telegram_token",
"home_channel": {
"platform": "telegram",
"chat_id": "-1001234567890",
"name": "My Notes"
}
},
"discord": {
"enabled": true,
"token": "your_discord_token",
"home_channel": {
"platform": "discord",
"chat_id": "123456789012345678",
"name": "#bot-updates"
}
}
},
"default_reset_policy": {
"mode": "both",
"at_hour": 4,
"idle_minutes": 120
},
"reset_by_platform": {
"discord": {
"mode": "idle",
"idle_minutes": 60
}
},
"always_log_local": true
}
```
## Platform-Specific Toolsets
Each platform has its own toolset for security:
| Platform | Toolset | Capabilities |
|----------|---------|--------------|
| CLI | `hermes-cli` | Full access (terminal, browser, etc.) |
| Telegram | `hermes-telegram` | Full tools including terminal |
| Discord | `hermes-discord` | Full tools including terminal |
| WhatsApp | `hermes-whatsapp` | Full tools including terminal |
## User Experience Features
### Typing Indicator
The gateway keeps the "typing..." indicator active throughout processing, refreshing every 4 seconds. This lets users know the bot is working even during long tool-calling sequences.
### Tool Progress Notifications
When `HERMES_TOOL_PROGRESS=true`, the bot sends status messages as it works:
```
💻 `ls -la`...
🔍 web_search...
📄 web_extract...
🎨 image_generate...
```
Terminal commands show the actual command (truncated to 50 chars). Other tools just show the tool name.
**Modes:**
- `new`: Only sends message when switching to a different tool (less spam)
- `all`: Sends message for every single tool call
### Working Directory
- **CLI (`hermes` command)**: Uses current directory where you run the command
- **Messaging**: Uses `MESSAGING_CWD` (default: home directory `~`)
This is intentional: CLI users are in a terminal and expect the agent to work in their current directory, while messaging users need a consistent starting location.
### Max Iterations
If the agent hits the max iteration limit while working, instead of a generic error, it asks the model to summarize what it found so far. This gives you a useful response even when the task couldn't be fully completed.
## Voice Messages (TTS)
The `text_to_speech` tool generates audio that the gateway delivers as native voice messages on each platform:
| Platform | Delivery | Format |
|----------|----------|--------|
| Telegram | Voice bubble (plays inline) | Opus `.ogg` — native from OpenAI/ElevenLabs, converted via ffmpeg for Edge TTS |
| Discord | Audio file attachment | MP3 |
| WhatsApp | Audio file attachment | MP3 |
| CLI | Saved to `~/voice-memos/` | MP3 |
**Providers:**
- **Edge TTS** (default) — Free, no API key, 322 voices in 74 languages
- **ElevenLabs** — Premium quality, requires `ELEVENLABS_API_KEY`
- **OpenAI TTS** — Good quality, requires `OPENAI_API_KEY`
Voice and provider are configured by the user in `~/.hermes/config.yaml` under the `tts:` key. The model only sends text; it does not choose the voice.
The tool returns a `MEDIA:<path>` tag that the gateway send pipeline intercepts and delivers as a native audio message. If `[[audio_as_voice]]` is present (Opus format available), Telegram sends it as a voice bubble instead of an audio file.
**Telegram voice bubbles & ffmpeg:**
Telegram requires Opus/OGG format for native voice bubbles (the round, inline-playable kind). **OpenAI and ElevenLabs** produce Opus natively when on Telegram — no extra setup needed. **Edge TTS** (the default free provider) outputs MP3 and needs `ffmpeg` to convert:
```bash
sudo apt install ffmpeg # Ubuntu/Debian
brew install ffmpeg # macOS
sudo dnf install ffmpeg # Fedora
```
Without ffmpeg, Edge TTS audio is sent as a regular audio file (still playable, but shows as a rectangular music player instead of a voice bubble).
## Cron Job Delivery
When scheduling cron jobs, you can specify where the output should be delivered:
```
User: "Remind me to check the server in 30 minutes"
Agent uses: schedule_cronjob(
prompt="Check server status...",
schedule="30m",
deliver="origin" # Back to this chat
)
```
### Delivery Options
| Option | Description |
|--------|-------------|
| `"origin"` | Back to where the job was created |
| `"local"` | Save to local files only |
| `"telegram"` | Telegram home channel |
| `"discord"` | Discord home channel |
| `"telegram:123456"` | Specific Telegram chat |
## Dynamic Context Injection
The agent knows where it is via injected context:
```
## Current Session Context
**Source:** Telegram (group: Dev Team, ID: -1001234567890)
**Connected Platforms:** local, telegram, discord
**Home Channels:**
- telegram: My Notes (ID: -1001234567890)
- discord: #bot-updates (ID: 123456789012345678)
**Delivery options for scheduled tasks:**
- "origin" → Back to this chat (Dev Team)
- "local" → Save to local files only
- "telegram" → Home channel (My Notes)
- "discord" → Home channel (#bot-updates)
```
## CLI Commands
| Command | Description |
|---------|-------------|
| `/platforms` | Show gateway configuration and status |
| `--gateway` | Start the gateway (CLI flag) |
## Troubleshooting
### "python-telegram-bot not installed"
```bash
pip install python-telegram-bot>=20.0
```
### "discord.py not installed"
```bash
pip install discord.py>=2.0
```
### "No platforms connected"
1. Check your environment variables are set
2. Check your tokens are valid
3. Try `/platforms` to see configuration status
### Session not persisting
1. Check `~/.hermes/sessions/` exists
2. Check session policies aren't too aggressive
3. Verify no errors in gateway logs
## Adding a New Platform
To add a new messaging platform:
### 1. Create the adapter
Create `gateway/platforms/your_platform.py`:
```python
from gateway.platforms.base import BasePlatformAdapter, MessageEvent, SendResult
from gateway.config import Platform, PlatformConfig
class YourPlatformAdapter(BasePlatformAdapter):
def __init__(self, config: PlatformConfig):
super().__init__(config, Platform.YOUR_PLATFORM)
async def connect(self) -> bool:
# Connect to the platform
...
async def disconnect(self) -> None:
# Disconnect
...
async def send(self, chat_id: str, content: str, ...) -> SendResult:
# Send a message
...
async def get_chat_info(self, chat_id: str) -> Dict[str, Any]:
# Get chat information
...
```
### 2. Register the platform
Add to `gateway/config.py`:
```python
class Platform(Enum):
# ... existing ...
YOUR_PLATFORM = "your_platform"
```
### 3. Add to gateway runner
Update `gateway/run.py` `_create_adapter()`:
```python
elif platform == Platform.YOUR_PLATFORM:
from gateway.platforms.your_platform import YourPlatformAdapter
return YourPlatformAdapter(config)
```
### 4. Create a toolset (optional)
Add to `toolsets.py`:
```python
"hermes-your-platform": {
"description": "Your platform toolset",
"tools": [...],
"includes": []
}
```
### 5. Configure
Add environment variables to `.env`:
```bash
YOUR_PLATFORM_TOKEN=...
YOUR_PLATFORM_HOME_CHANNEL=...
```
## Service Management
### Linux (systemd)
```bash
# Install as user service
./scripts/hermes-gateway install
# Manage
systemctl --user start hermes-gateway
systemctl --user stop hermes-gateway
systemctl --user restart hermes-gateway
systemctl --user status hermes-gateway
# View logs
journalctl --user -u hermes-gateway -f
# Enable lingering (keeps running after logout)
sudo loginctl enable-linger $USER
```
### macOS (launchd)
```bash
# Install
./scripts/hermes-gateway install
# Manage
launchctl start ai.hermes.gateway
launchctl stop ai.hermes.gateway
# View logs
tail -f ~/.hermes/logs/gateway.log
```
### Manual (any platform)
```bash
# Run in foreground (for testing/debugging)
./scripts/hermes-gateway run
# Or via CLI (also foreground)
python cli.py --gateway
```
## Storage Locations
| Path | Purpose |
|------|---------|
| `~/.hermes/gateway.json` | Gateway configuration |
| `~/.hermes/sessions/sessions.json` | Session index |
| `~/.hermes/sessions/{id}.jsonl` | Conversation transcripts |
| `~/.hermes/cron/output/` | Cron job outputs |
| `~/.hermes/logs/gateway.log` | Gateway logs (macOS launchd) |
+5 -1
View File
@@ -40,11 +40,15 @@ async def web_search(query: str) -> dict:
|----------|--------|-------|
| **Web** | `web_tools.py` | `web_search`, `web_extract`, `web_crawl` |
| **Terminal** | `terminal_tool.py` | `terminal` (local/docker/singularity/modal/ssh backends) |
| **File** | `file_tools.py` | `read_file`, `write_file`, `patch`, `search` |
| **Browser** | `browser_tool.py` | `browser_navigate`, `browser_click`, `browser_type`, etc. |
| **Vision** | `vision_tools.py` | `vision_analyze` |
| **Image Gen** | `image_generation_tool.py` | `image_generate` |
| **TTS** | `tts_tool.py` | `text_to_speech` (Edge TTS free / ElevenLabs / OpenAI) |
| **Reasoning** | `mixture_of_agents_tool.py` | `mixture_of_agents` |
| **Skills** | `skills_tool.py` | `skills_categories`, `skills_list`, `skill_view` |
| **Skills** | `skills_tool.py` | `skills_list`, `skill_view` |
| **Cronjob** | `cronjob_tools.py` | `schedule_cronjob`, `list_cronjobs`, `remove_cronjob` |
| **RL Training** | `rl_training_tool.py` | `rl_list_environments`, `rl_start_training`, `rl_check_status`, etc. |
## Tool Registration
+330
View File
@@ -0,0 +1,330 @@
# Hermes-Agent Atropos Environments
This directory contains the integration layer between **hermes-agent's** tool-calling capabilities and the **Atropos** RL training framework. It provides everything needed to run agentic LLMs through multi-turn tool-calling loops, score their output with arbitrary reward functions, and feed results into Atropos for training or evaluation.
## Architecture Overview
```
Atropos Framework
┌───────────────────────┐
│ BaseEnv │ (atroposlib)
│ - Server management │
│ - Worker scheduling │
│ - Wandb logging │
│ - CLI (serve/process/ │
│ evaluate) │
└───────────┬───────────┘
│ inherits
┌───────────┴───────────┐
│ HermesAgentBaseEnv │ hermes_base_env.py
│ - Terminal backend │
│ - Tool resolution │
│ - Agent loop │
│ - ToolContext │
│ - Async patches │
└───────────┬───────────┘
│ inherits
┌─────────────────┼─────────────────┐
│ │ │
TerminalTestEnv HermesSweEnv TerminalBench2EvalEnv
(stack testing) (SWE training) (TB2 benchmark eval)
```
### Inheritance Chain
**BaseEnv** (from `atroposlib`) is the Atropos base class. It provides:
- Server management (OpenAI-compatible API servers, VLLM, SGLang)
- Worker scheduling for parallel rollouts
- Wandb integration for metrics and rollout logging
- CLI interface with three subcommands: `serve`, `process`, `evaluate`
- `evaluate_log()` for saving eval results to JSON + samples.jsonl
**HermesAgentBaseEnv** (`hermes_base_env.py`) extends BaseEnv with hermes-agent specifics:
- Sets `os.environ["TERMINAL_ENV"]` to configure the terminal backend (local, docker, modal, ssh, singularity)
- Resolves hermes-agent toolsets via `_resolve_tools_for_group()` (calls `get_tool_definitions()` from `model_tools.py`)
- Implements `collect_trajectory()` which runs the full agent loop and computes rewards
- Supports two-phase operation (Phase 1: OpenAI server, Phase 2: VLLM ManagedServer)
- Applies monkey patches for async-safe tool operation at import time
Concrete environments inherit from `HermesAgentBaseEnv` and implement:
- `setup()` -- Load dataset, initialize state
- `get_next_item()` -- Return the next item for rollout
- `format_prompt()` -- Convert a dataset item into the user message
- `compute_reward()` -- Score the rollout using ToolContext
- `evaluate()` -- Periodic evaluation logic
## Core Components
### Agent Loop (`agent_loop.py`)
`HermesAgentLoop` is the reusable multi-turn agent engine. It runs the same pattern as hermes-agent's `run_agent.py`:
1. Send messages + tools to the API via `server.chat_completion()`
2. If the response contains `tool_calls`, execute each one via `handle_function_call()` from `model_tools.py`
3. Append tool results to the conversation and go back to step 1
4. If the response has no tool_calls, the agent is done
Tool calls are executed in a thread pool (`run_in_executor`) so backends that use `asyncio.run()` internally (Modal, Docker) don't deadlock inside Atropos's event loop.
Returns an `AgentResult` containing the full conversation history, turn count, reasoning content per turn, tool errors, and optional ManagedServer state (for Phase 2).
### Tool Context (`tool_context.py`)
`ToolContext` is a per-rollout handle that gives reward/verification functions direct access to **all** hermes-agent tools, scoped to the rollout's `task_id`. The same `task_id` means the terminal/browser session is the SAME one the model used during its rollout -- all state (files, processes, browser tabs) is preserved.
```python
async def compute_reward(self, item, result, ctx: ToolContext):
# Run tests in the model's terminal sandbox
test = ctx.terminal("pytest -v")
if test["exit_code"] == 0:
return 1.0
# Check if a file was created
content = ctx.read_file("/workspace/solution.py")
if content.get("content"):
return 0.5
# Download files locally for verification (binary-safe)
ctx.download_file("/remote/output.bin", "/local/output.bin")
return 0.0
```
Available methods:
- **Terminal**: `terminal(command, timeout)` -- run shell commands
- **Files**: `read_file(path)`, `write_file(path, content)`, `search(query, path)`
- **Transfers**: `upload_file()`, `upload_dir()`, `download_file()`, `download_dir()` -- binary-safe file transfers between host and sandbox
- **Web**: `web_search(query)`, `web_extract(urls)`
- **Browser**: `browser_navigate(url)`, `browser_snapshot()`
- **Generic**: `call_tool(name, args)` -- call any hermes-agent tool by name
- **Cleanup**: `cleanup()` -- release all resources (called automatically after `compute_reward`)
### Patches (`patches.py`)
**Problem**: Some hermes-agent tools use `asyncio.run()` internally (e.g., mini-swe-agent's Modal backend via SWE-ReX). This crashes when called from inside Atropos's event loop because `asyncio.run()` cannot be nested.
**Solution**: `patches.py` monkey-patches `SwerexModalEnvironment` to use a dedicated background thread (`_AsyncWorker`) with its own event loop. The calling code sees the same sync interface, but internally the async work happens on a separate thread that doesn't conflict with Atropos's loop.
What gets patched:
- `SwerexModalEnvironment.__init__` -- creates Modal deployment on a background thread
- `SwerexModalEnvironment.execute` -- runs commands on the same background thread
- `SwerexModalEnvironment.stop` -- stops deployment on the background thread
The patches are:
- **Idempotent** -- calling `apply_patches()` multiple times is safe
- **Transparent** -- same interface and behavior, only the internal async execution changes
- **Universal** -- works identically in normal CLI use (no running event loop)
Applied automatically at import time by `hermes_base_env.py`.
### Tool Call Parsers (`tool_call_parsers/`)
Client-side parsers that extract structured `tool_calls` from raw model output text. Used in **Phase 2** (VLLM server type) where ManagedServer's `/generate` endpoint returns raw text without tool call parsing.
Each parser is a standalone reimplementation of the corresponding VLLM parser's `extract_tool_calls()` logic. No VLLM dependency -- only standard library (`re`, `json`, `uuid`) and `openai` types.
Available parsers:
- `hermes` -- Hermes/ChatML `<tool_call>` XML format
- `mistral` -- Mistral `[TOOL_CALLS]` format
- `llama3_json` -- Llama 3 JSON tool calling
- `qwen` -- Qwen tool calling format
- `qwen3_coder` -- Qwen3 Coder format
- `deepseek_v3` -- DeepSeek V3 format
- `deepseek_v3_1` -- DeepSeek V3.1 format
- `kimi_k2` -- Kimi K2 format
- `longcat` -- Longcat format
- `glm45` / `glm47` -- GLM model formats
Usage:
```python
from environments.tool_call_parsers import get_parser
parser = get_parser("hermes")
content, tool_calls = parser.parse(raw_model_output)
```
In Phase 1 (OpenAI server type), these parsers are not needed -- the server handles tool call parsing natively.
## Two-Phase Operation
### Phase 1: OpenAI Server (Evaluation / SFT Data Generation)
Uses `server.chat_completion()` with `tools=` parameter. The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing natively. Returns `ChatCompletion` objects with structured `tool_calls`.
- Good for: evaluation, SFT data generation, testing
- Run with: `serve` (with `run-api`), `process`, or `evaluate` subcommands
- Placeholder tokens are created for the Atropos pipeline
### Phase 2: VLLM ManagedServer (Full RL Training)
Uses ManagedServer for exact token IDs + logprobs via `/generate`. Client-side tool call parser (from `tool_call_parsers/`) reconstructs structured `tool_calls` from raw output.
- Good for: full RL training with GRPO/PPO
- Run with: `serve` subcommand
- Real tokens, masks, and logprobs flow through the pipeline
## Directory Structure
```
environments/
├── README.md # This file
├── __init__.py # Package exports
├── hermes_base_env.py # Abstract base (HermesAgentBaseEnv)
├── agent_loop.py # Multi-turn agent engine (HermesAgentLoop)
├── tool_context.py # Per-rollout tool access for reward functions
├── patches.py # Async-safety patches for Modal backend
├── tool_call_parsers/ # Phase 2 client-side parsers
│ ├── __init__.py # Registry + base class
│ ├── hermes_parser.py
│ ├── mistral_parser.py
│ ├── llama_parser.py
│ ├── qwen_parser.py
│ ├── qwen3_coder_parser.py
│ ├── deepseek_v3_parser.py
│ ├── deepseek_v3_1_parser.py
│ ├── kimi_k2_parser.py
│ ├── longcat_parser.py
│ ├── glm45_parser.py
│ └── glm47_parser.py
├── terminal_test_env/ # Stack validation environment
│ └── terminal_test_env.py
├── hermes_swe_env/ # SWE-bench style training environment
│ └── hermes_swe_env.py
└── benchmarks/ # Evaluation benchmarks
└── terminalbench_2/
└── terminalbench2_env.py
```
## Concrete Environments
### TerminalTestEnv (`terminal_test_env/`)
A self-contained environment with inline tasks (no external dataset needed) for validating the full stack end-to-end. Each task asks the model to create a file at a known path, and the verifier checks the content matches.
```bash
# Serve mode (needs run-api)
run-api
python environments/terminal_test_env/terminal_test_env.py serve
# Process mode (no run-api, saves to JSONL)
python environments/terminal_test_env/terminal_test_env.py process \
--env.data_path_to_save_groups terminal_test_output.jsonl
```
### HermesSweEnv (`hermes_swe_env/`)
SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.
```bash
python environments/hermes_swe_env/hermes_swe_env.py serve \
--openai.model_name YourModel \
--env.dataset_name bigcode/humanevalpack \
--env.terminal_backend modal
```
### TerminalBench2EvalEnv (`benchmarks/terminalbench_2/`)
**Eval-only** environment for the Terminal-Bench 2.0 benchmark (89 tasks). Each task gets a pre-built Docker Hub image, a natural language instruction, and a test suite. The agent uses terminal + file tools to solve the task, then the test suite verifies correctness.
Follows the standard Atropos eval pattern (like GPQA, MMLU, etc.):
- Run via `evaluate` subcommand (no `run-api` needed)
- `setup()` loads the dataset, `evaluate()` runs all tasks
- `rollout_and_score_eval()` handles per-task agent loop + test verification
- Downloads verifier output locally for reliable reward checking (Harbor pattern)
```bash
# Run full benchmark
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
--openai.model_name anthropic/claude-opus-4.6
# Run subset of tasks
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
--openai.model_name anthropic/claude-opus-4.6 \
--env.task_filter fix-git,git-multibranch
# Skip specific tasks
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
--openai.model_name anthropic/claude-opus-4.6 \
--env.skip_tasks heavy-task,slow-task
```
## Creating a New Environment
### Training Environment
1. Create a new directory under `environments/`
2. Create your env file inheriting from `HermesAgentBaseEnv`
3. Implement the four abstract methods + `evaluate()`
```python
from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
class MyEnvConfig(HermesAgentEnvConfig):
pass # Add custom fields as needed
class MyEnv(HermesAgentBaseEnv):
name = "my-env"
env_config_cls = MyEnvConfig
@classmethod
def config_init(cls):
env_config = MyEnvConfig(
enabled_toolsets=["terminal", "file"],
terminal_backend="modal",
# ... other config
)
server_configs = [APIServerConfig(...)]
return env_config, server_configs
async def setup(self):
self.dataset = load_dataset(...)
self.iter = 0
async def get_next_item(self):
item = self.dataset[self.iter % len(self.dataset)]
self.iter += 1
return item
def format_prompt(self, item):
return item["instruction"]
async def compute_reward(self, item, result, ctx):
# ctx gives you full tool access to the rollout's sandbox
test = ctx.terminal("pytest -v")
return 1.0 if test["exit_code"] == 0 else 0.0
async def evaluate(self, *args, **kwargs):
# Periodic evaluation logic
...
if __name__ == "__main__":
MyEnv.cli()
```
### Eval-Only Environment (Benchmark)
For eval benchmarks, follow the pattern in `terminalbench2_env.py`:
1. Create under `environments/benchmarks/your-benchmark/`
2. Inherit from `HermesAgentBaseEnv`
3. Set eval-only config: `eval_handling=STOP_TRAIN`, `steps_per_eval=1`, `total_steps=1`
4. Stub the training methods (`collect_trajectories`, `score`)
5. Implement `rollout_and_score_eval()` and `evaluate()`
6. Run with `evaluate` subcommand
## Key Config Fields
| Field | Description | Default |
|-------|-------------|---------|
| `enabled_toolsets` | Which hermes toolsets to enable | `None` (all) |
| `disabled_toolsets` | Toolsets to disable | `None` |
| `distribution` | Probabilistic toolset distribution name | `None` |
| `max_agent_turns` | Max LLM calls per rollout | `30` |
| `agent_temperature` | Sampling temperature | `1.0` |
| `terminal_backend` | `local`, `docker`, `modal`, `ssh`, `singularity` | `local` |
| `system_prompt` | System message for the agent | `None` |
| `tool_call_parser` | Parser name for Phase 2 | `hermes` |
| `eval_handling` | `STOP_TRAIN`, `LIMIT_TRAIN`, `NONE` | `STOP_TRAIN` |
+32
View File
@@ -0,0 +1,32 @@
"""
Hermes-Agent Atropos Environments
Provides a layered integration between hermes-agent's tool-calling capabilities
and the Atropos RL training framework.
Core layers:
- agent_loop: Reusable multi-turn agent loop with standard OpenAI-spec tool calling
- tool_context: Per-rollout tool access handle for reward/verification functions
- hermes_base_env: Abstract base environment (BaseEnv subclass) for Atropos
- tool_call_parsers: Client-side tool call parser registry for Phase 2 (VLLM /generate)
Concrete environments:
- terminal_test_env/: Simple file-creation tasks for testing the stack
- hermes_swe_env/: SWE-bench style tasks with Modal sandboxes
- endless_terminals/: Terminal tasks from HuggingFace dataset with Apptainer containers
Benchmarks (eval-only):
- benchmarks/terminalbench_2/: Terminal-Bench 2.0 evaluation
"""
from environments.agent_loop import AgentResult, HermesAgentLoop
from environments.tool_context import ToolContext
from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
__all__ = [
"AgentResult",
"HermesAgentLoop",
"ToolContext",
"HermesAgentBaseEnv",
"HermesAgentEnvConfig",
]
+421
View File
@@ -0,0 +1,421 @@
"""
HermesAgentLoop -- Reusable Multi-Turn Agent Engine
Runs the hermes-agent tool-calling loop using standard OpenAI-spec tool calling.
Works with any server that returns ChatCompletion objects with tool_calls:
- Phase 1: OpenAI server type (VLLM, SGLang, OpenRouter, OpenAI API)
- Phase 2: ManagedServer with client-side tool call parser
The loop passes tools= and checks response.choices[0].message.tool_calls,
identical to hermes-agent's run_agent.py. Tool execution is dispatched via
handle_function_call() from model_tools.py.
"""
import asyncio
import concurrent.futures
import json
import logging
import os
import uuid
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Set
from model_tools import handle_function_call
# Thread pool for running sync tool calls that internally use asyncio.run()
# (e.g., mini-swe-agent's modal/docker backends). Running them in a separate
# thread gives them a clean event loop so they don't deadlock inside Atropos's loop.
# Size must be large enough for concurrent eval tasks (e.g., 89 TB2 tasks all
# making tool calls). Too small = thread pool starvation, tasks queue for minutes.
# Resized at runtime by HermesAgentBaseEnv.__init__ via resize_tool_pool().
_tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=128)
def resize_tool_pool(max_workers: int):
"""
Replace the global tool executor with a new one of the given size.
Called by HermesAgentBaseEnv.__init__ based on config.tool_pool_size.
Safe to call before any tasks are submitted.
"""
global _tool_executor
_tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
logger.info("Tool thread pool resized to %d workers", max_workers)
logger = logging.getLogger(__name__)
@dataclass
class ToolError:
"""Record of a tool execution error during the agent loop."""
turn: int # Which turn the error occurred on
tool_name: str # Which tool was called
arguments: str # The arguments passed (truncated)
error: str # The error message
tool_result: str # The raw result returned to the model
@dataclass
class AgentResult:
"""Result of running the agent loop."""
# Full conversation history in OpenAI message format
messages: List[Dict[str, Any]]
# ManagedServer.get_state() if available (Phase 2), None otherwise
managed_state: Optional[Dict[str, Any]] = None
# How many LLM calls were made
turns_used: int = 0
# True if model stopped calling tools naturally (vs hitting max_turns)
finished_naturally: bool = False
# Extracted reasoning content per turn (from PR #297 helpers)
reasoning_per_turn: List[Optional[str]] = field(default_factory=list)
# Tool errors encountered during the loop
tool_errors: List[ToolError] = field(default_factory=list)
def _extract_reasoning_from_message(message) -> Optional[str]:
"""
Extract reasoning content from a ChatCompletion message.
Handles multiple provider formats:
1. message.reasoning_content field (some providers)
2. message.reasoning field (some providers)
3. message.reasoning_details[].text (OpenRouter style)
Note: <think> block extraction from content is NOT done here -- that's
handled by the response already in Phase 1 (server does it) or by
ManagedServer's patch in Phase 2.
Args:
message: The assistant message from ChatCompletion response
Returns:
Extracted reasoning text, or None if not found
"""
# Check reasoning_content field (common across providers)
if hasattr(message, "reasoning_content") and message.reasoning_content:
return message.reasoning_content
# Check reasoning field
if hasattr(message, "reasoning") and message.reasoning:
return message.reasoning
# Check reasoning_details (OpenRouter style)
if hasattr(message, "reasoning_details") and message.reasoning_details:
for detail in message.reasoning_details:
if hasattr(detail, "text") and detail.text:
return detail.text
if isinstance(detail, dict) and detail.get("text"):
return detail["text"]
return None
class HermesAgentLoop:
"""
Runs hermes-agent's tool-calling loop using standard OpenAI-spec tool calling.
Same pattern as run_agent.py:
- Pass tools= to the API
- Check response.choices[0].message.tool_calls
- Dispatch via handle_function_call()
Works identically with any server type -- OpenAI, VLLM, SGLang, OpenRouter,
or ManagedServer with a parser. The server determines how tool_calls get
populated on the response.
"""
def __init__(
self,
server,
tool_schemas: List[Dict[str, Any]],
valid_tool_names: Set[str],
max_turns: int = 30,
task_id: Optional[str] = None,
temperature: float = 1.0,
max_tokens: Optional[int] = None,
extra_body: Optional[Dict[str, Any]] = None,
):
"""
Initialize the agent loop.
Args:
server: Server object with chat_completion() method (OpenAIServer,
ManagedServer, ServerManager, etc.)
tool_schemas: OpenAI-format tool definitions from get_tool_definitions()
valid_tool_names: Set of tool names the model is allowed to call
max_turns: Maximum number of LLM calls before stopping
task_id: Unique ID for terminal/browser session isolation
temperature: Sampling temperature for generation
max_tokens: Max tokens per generation (None for server default)
extra_body: Extra parameters passed to the OpenAI client's create() call.
Used for OpenRouter provider preferences, transforms, etc.
e.g. {"provider": {"ignore": ["DeepInfra"]}}
"""
self.server = server
self.tool_schemas = tool_schemas
self.valid_tool_names = valid_tool_names
self.max_turns = max_turns
self.task_id = task_id or str(uuid.uuid4())
self.temperature = temperature
self.max_tokens = max_tokens
self.extra_body = extra_body
async def run(self, messages: List[Dict[str, Any]]) -> AgentResult:
"""
Execute the full agent loop using standard OpenAI tool calling.
Args:
messages: Initial conversation messages (system + user).
Modified in-place as the conversation progresses.
Returns:
AgentResult with full conversation history, managed state, and metadata
"""
reasoning_per_turn = []
tool_errors: List[ToolError] = []
import time as _time
for turn in range(self.max_turns):
turn_start = _time.monotonic()
# Build the chat_completion kwargs
chat_kwargs = {
"messages": messages,
"n": 1,
"temperature": self.temperature,
}
# Only pass tools if we have them
if self.tool_schemas:
chat_kwargs["tools"] = self.tool_schemas
# Only pass max_tokens if explicitly set
if self.max_tokens is not None:
chat_kwargs["max_tokens"] = self.max_tokens
# Inject extra_body for provider-specific params (e.g., OpenRouter
# provider preferences like banned/preferred providers, transforms)
if self.extra_body:
chat_kwargs["extra_body"] = self.extra_body
# Make the API call -- standard OpenAI spec
api_start = _time.monotonic()
try:
response = await self.server.chat_completion(**chat_kwargs)
except Exception as e:
api_elapsed = _time.monotonic() - api_start
logger.error("API call failed on turn %d (%.1fs): %s", turn + 1, api_elapsed, e)
return AgentResult(
messages=messages,
managed_state=self._get_managed_state(),
turns_used=turn + 1,
finished_naturally=False,
reasoning_per_turn=reasoning_per_turn,
tool_errors=tool_errors,
)
api_elapsed = _time.monotonic() - api_start
if not response or not response.choices:
logger.warning("Empty response on turn %d (api=%.1fs)", turn + 1, api_elapsed)
return AgentResult(
messages=messages,
managed_state=self._get_managed_state(),
turns_used=turn + 1,
finished_naturally=False,
reasoning_per_turn=reasoning_per_turn,
tool_errors=tool_errors,
)
assistant_msg = response.choices[0].message
# Extract reasoning content from the response (all provider formats)
reasoning = _extract_reasoning_from_message(assistant_msg)
reasoning_per_turn.append(reasoning)
# Check for tool calls -- standard OpenAI spec
if assistant_msg.tool_calls:
# Build the assistant message dict for conversation history
msg_dict: Dict[str, Any] = {
"role": "assistant",
"content": assistant_msg.content or "",
"tool_calls": [
{
"id": tc.id,
"type": "function",
"function": {
"name": tc.function.name,
"arguments": tc.function.arguments,
},
}
for tc in assistant_msg.tool_calls
],
}
# Preserve reasoning_content for multi-turn chat template handling
# (e.g., Kimi-K2's template renders <think> blocks differently
# for history vs. the latest turn based on this field)
if reasoning:
msg_dict["reasoning_content"] = reasoning
messages.append(msg_dict)
# Execute each tool call via hermes-agent's dispatch
for tc in assistant_msg.tool_calls:
tool_name = tc.function.name
tool_args_raw = tc.function.arguments
# Validate tool name
if tool_name not in self.valid_tool_names:
tool_result = json.dumps(
{
"error": f"Unknown tool '{tool_name}'. "
f"Available tools: {sorted(self.valid_tool_names)}"
}
)
tool_errors.append(ToolError(
turn=turn + 1, tool_name=tool_name,
arguments=tool_args_raw[:200],
error=f"Unknown tool '{tool_name}'",
tool_result=tool_result,
))
logger.warning(
"Model called unknown tool '%s' on turn %d",
tool_name, turn + 1,
)
else:
# Parse arguments and dispatch
try:
args = json.loads(tool_args_raw)
except json.JSONDecodeError:
args = {}
logger.warning(
"Invalid JSON in tool call arguments for '%s': %s",
tool_name, tool_args_raw[:200],
)
try:
if tool_name == "terminal":
backend = os.getenv("TERMINAL_ENV", "local")
cmd_preview = args.get("command", "")[:80]
logger.info(
"[%s] $ %s", self.task_id[:8], cmd_preview,
)
# Run tool calls in a thread pool so backends that use
# asyncio.run() internally (modal, docker) get a clean
# event loop instead of deadlocking inside Atropos's loop.
tool_submit_time = _time.monotonic()
loop = asyncio.get_event_loop()
tool_result = await loop.run_in_executor(
_tool_executor,
lambda: handle_function_call(
tool_name, args, task_id=self.task_id
),
)
tool_elapsed = _time.monotonic() - tool_submit_time
# Log slow tools and thread pool stats for debugging
pool_active = _tool_executor._work_queue.qsize()
if tool_elapsed > 30:
logger.warning(
"[%s] turn %d: %s took %.1fs (pool queue=%d)",
self.task_id[:8], turn + 1, tool_name,
tool_elapsed, pool_active,
)
except Exception as e:
tool_result = json.dumps(
{"error": f"Tool execution failed: {type(e).__name__}: {str(e)}"}
)
tool_errors.append(ToolError(
turn=turn + 1, tool_name=tool_name,
arguments=tool_args_raw[:200],
error=f"{type(e).__name__}: {str(e)}",
tool_result=tool_result,
))
logger.error(
"Tool '%s' execution failed on turn %d: %s",
tool_name, turn + 1, e,
)
# Also check if the tool returned an error in its JSON result
try:
result_data = json.loads(tool_result)
if isinstance(result_data, dict):
err = result_data.get("error")
exit_code = result_data.get("exit_code")
if err and exit_code and exit_code < 0:
tool_errors.append(ToolError(
turn=turn + 1, tool_name=tool_name,
arguments=tool_args_raw[:200],
error=str(err),
tool_result=tool_result[:500],
))
except (json.JSONDecodeError, TypeError):
pass
# Add tool response to conversation
messages.append(
{
"role": "tool",
"tool_call_id": tc.id,
"content": tool_result,
}
)
turn_elapsed = _time.monotonic() - turn_start
logger.info(
"[%s] turn %d: api=%.1fs, %d tools, turn_total=%.1fs",
self.task_id[:8], turn + 1, api_elapsed,
len(assistant_msg.tool_calls), turn_elapsed,
)
else:
# No tool calls -- model is done
msg_dict = {
"role": "assistant",
"content": assistant_msg.content or "",
}
if reasoning:
msg_dict["reasoning_content"] = reasoning
messages.append(msg_dict)
turn_elapsed = _time.monotonic() - turn_start
logger.info(
"[%s] turn %d: api=%.1fs, no tools (finished), turn_total=%.1fs",
self.task_id[:8], turn + 1, api_elapsed, turn_elapsed,
)
return AgentResult(
messages=messages,
managed_state=self._get_managed_state(),
turns_used=turn + 1,
finished_naturally=True,
reasoning_per_turn=reasoning_per_turn,
tool_errors=tool_errors,
)
# Hit max turns without the model stopping
logger.info("Agent hit max_turns (%d) without finishing", self.max_turns)
return AgentResult(
messages=messages,
managed_state=self._get_managed_state(),
turns_used=self.max_turns,
finished_naturally=False,
reasoning_per_turn=reasoning_per_turn,
tool_errors=tool_errors,
)
def _get_managed_state(self) -> Optional[Dict[str, Any]]:
"""
Get ManagedServer state if the server supports it.
Returns state dict with SequenceNodes containing tokens/logprobs/masks,
or None if the server doesn't support get_state() (e.g., regular OpenAI server).
"""
if hasattr(self.server, "get_state"):
return self.server.get_state()
return None
View File
@@ -0,0 +1,38 @@
# Terminal-Bench 2.0 Evaluation -- Default Configuration
#
# Eval-only environment for the TB2 benchmark (89 terminal tasks).
# Uses Modal terminal backend for per-task cloud-isolated sandboxes
# and OpenRouter for inference.
#
# Usage:
# python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
# --config environments/benchmarks/terminalbench_2/default.yaml
#
# # Override model:
# python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
# --config environments/benchmarks/terminalbench_2/default.yaml \
# --openai.model_name anthropic/claude-sonnet-4
env:
enabled_toolsets: ["terminal", "file"]
max_agent_turns: 60
max_token_length: 32000
agent_temperature: 0.8
terminal_backend: "modal"
terminal_timeout: 300 # 5 min per command (builds, pip install)
tool_pool_size: 128 # thread pool for 89 parallel tasks
dataset_name: "NousResearch/terminal-bench-2"
test_timeout: 600
task_timeout: 1800 # 30 min wall-clock per task, auto-FAIL if exceeded
tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
use_wandb: true
wandb_name: "terminal-bench-2"
ensure_scores_are_not_same: false
data_dir_to_save_evals: "environments/benchmarks/evals/terminal-bench-2"
openai:
base_url: "https://openrouter.ai/api/v1"
model_name: "anthropic/claude-opus-4.6"
server_type: "openai"
health_check: false
# api_key loaded from OPENROUTER_API_KEY in .env
+32
View File
@@ -0,0 +1,32 @@
#!/bin/bash
# Terminal-Bench 2.0 Evaluation
#
# Run from repo root:
# bash environments/benchmarks/terminalbench_2/run_eval.sh
#
# Override model:
# bash environments/benchmarks/terminalbench_2/run_eval.sh \
# --openai.model_name anthropic/claude-sonnet-4
#
# Run a subset:
# bash environments/benchmarks/terminalbench_2/run_eval.sh \
# --env.task_filter fix-git,git-multibranch
mkdir -p logs evals/terminal-bench-2
LOG_FILE="logs/terminalbench2_$(date +%Y%m%d_%H%M%S).log"
echo "Terminal-Bench 2.0 Evaluation"
echo "Log: $LOG_FILE"
echo ""
export TERMINAL_ENV=modal
export TERMINAL_TIMEOUT=300
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
--config environments/benchmarks/terminalbench_2/default.yaml \
"$@" \
2>&1 | tee "$LOG_FILE"
echo ""
echo "Log saved to: $LOG_FILE"
@@ -0,0 +1,904 @@
"""
TerminalBench2Env -- Terminal-Bench 2.0 Evaluation Environment
Evaluates agentic LLMs on challenging terminal tasks from Terminal-Bench 2.0.
Each task provides a unique Docker environment (pre-built on Docker Hub), a natural
language instruction, and a test suite for verification. The agent uses terminal +
file tools to complete the task, then the test suite runs inside the same sandbox.
This is an eval-only environment (not a training environment). It is designed to
be run via the `evaluate` subcommand:
python environments/terminalbench2_env.py evaluate \\
--env.dataset_name NousResearch/terminal-bench-2
The evaluate flow:
1. setup() -- Loads the TB2 dataset from HuggingFace
2. evaluate() -- Iterates over all tasks, running each through:
a. rollout_and_score_eval() -- Per-task agent loop + test verification
- Resolves Docker image (pre-built Hub image or Dockerfile fallback)
- Registers per-task Modal sandbox via register_task_env_overrides()
- Runs the HermesAgentLoop (terminal + file tools)
- Uploads test suite and runs test.sh in the same sandbox
- Returns binary pass/fail result
b. Aggregates per-task, per-category, and overall pass rates
c. Logs results via evaluate_log() and wandb
Key features:
- Per-task Modal sandboxes using pre-built Docker Hub images
- Binary reward: 1.0 if all tests pass, 0.0 otherwise
- Concurrency-controlled parallel evaluation via asyncio.Semaphore
- Per-task, per-category, and aggregate pass rate tracking
"""
import asyncio
import base64
import io
import json
import logging
import os
import shutil
import sys
import tarfile
import tempfile
import time
import uuid
from collections import defaultdict
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple, Union
# Ensure repo root is on sys.path for imports
_repo_root = Path(__file__).resolve().parent.parent.parent.parent
if str(_repo_root) not in sys.path:
sys.path.insert(0, str(_repo_root))
from pydantic import Field
from atroposlib.envs.base import EvalHandlingEnum
from atroposlib.envs.server_handling.server_manager import APIServerConfig
from environments.agent_loop import AgentResult, HermesAgentLoop
from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
from environments.tool_context import ToolContext
from tools.terminal_tool import (
register_task_env_overrides,
clear_task_env_overrides,
cleanup_vm,
)
logger = logging.getLogger(__name__)
# =============================================================================
# Configuration
# =============================================================================
class TerminalBench2EvalConfig(HermesAgentEnvConfig):
"""
Configuration for the Terminal-Bench 2.0 evaluation environment.
Extends HermesAgentEnvConfig with TB2-specific settings for dataset loading,
test execution, task filtering, and eval concurrency.
"""
# --- Dataset ---
dataset_name: str = Field(
default="NousResearch/terminal-bench-2",
description="HuggingFace dataset containing TB2 tasks.",
)
# --- Test execution ---
test_timeout: int = Field(
default=180,
description="Timeout in seconds for running the test suite after agent completes.",
)
# --- Image strategy ---
force_build: bool = Field(
default=False,
description="If True, always build from Dockerfile (ignore docker_image). "
"Useful for testing custom Dockerfiles.",
)
# --- Task filtering (comma-separated from CLI) ---
task_filter: Optional[str] = Field(
default=None,
description="Comma-separated task names to run (e.g., 'fix-git,git-multibranch'). "
"If not set, all tasks are run.",
)
skip_tasks: Optional[str] = Field(
default=None,
description="Comma-separated task names to skip on top of the default skip list.",
)
# --- Per-task wall-clock timeout ---
task_timeout: int = Field(
default=1800,
description="Maximum wall-clock seconds per task (agent loop + verification). "
"Tasks exceeding this are scored as FAIL. Default 30 minutes.",
)
# Tasks that cannot run properly on Modal and are excluded from scoring.
MODAL_INCOMPATIBLE_TASKS = {
"qemu-startup", # Needs KVM/hardware virtualization
"qemu-alpine-ssh", # Needs KVM/hardware virtualization
"crack-7z-hash", # Password brute-force -- too slow for cloud sandbox timeouts
}
# =============================================================================
# Tar extraction helper
# =============================================================================
def _extract_base64_tar(b64_data: str, target_dir: Path):
"""Extract a base64-encoded tar.gz archive into target_dir."""
if not b64_data:
return
raw = base64.b64decode(b64_data)
buf = io.BytesIO(raw)
with tarfile.open(fileobj=buf, mode="r:gz") as tar:
tar.extractall(path=str(target_dir))
# =============================================================================
# Main Environment
# =============================================================================
class TerminalBench2EvalEnv(HermesAgentBaseEnv):
"""
Terminal-Bench 2.0 evaluation environment (eval-only, no training).
Inherits from HermesAgentBaseEnv for:
- Terminal backend setup (os.environ["TERMINAL_ENV"])
- Tool resolution via _resolve_tools_for_group()
- Monkey patches for async-safe tool operation
- Wandb trajectory formatting
The evaluate flow (triggered by `environment.py evaluate`):
1. setup() -- Load dataset from HuggingFace
2. evaluate() -- Run all tasks through rollout_and_score_eval()
Each task in rollout_and_score_eval():
1. Resolve Docker image (pre-built Hub image or Dockerfile fallback)
2. Register per-task Modal sandbox override
3. Run HermesAgentLoop with terminal + file tools
4. Upload test suite and execute test.sh in the same sandbox
5. Check /logs/verifier/reward.txt for pass/fail
6. Clean up sandbox, overrides, and temp files
"""
name = "terminal-bench-2"
env_config_cls = TerminalBench2EvalConfig
@classmethod
def config_init(cls) -> Tuple[TerminalBench2EvalConfig, List[APIServerConfig]]:
"""
Default configuration for Terminal-Bench 2.0 evaluation.
Uses eval-only settings:
- eval_handling=STOP_TRAIN so the eval flow runs cleanly
- steps_per_eval=1, total_steps=1 so eval triggers immediately
- group_size=1 (one rollout per group, each task is expensive)
Uses Modal terminal backend (cloud-isolated sandbox per task) and
OpenRouter with Claude for inference.
"""
env_config = TerminalBench2EvalConfig(
# Terminal + file tools only (the agent interacts via shell commands)
enabled_toolsets=["terminal", "file"],
disabled_toolsets=None,
distribution=None,
# Agent settings -- TB2 tasks are complex, need many turns
max_agent_turns=60,
max_token_length=16000,
agent_temperature=0.6,
system_prompt=None,
# Modal backend for per-task cloud-isolated sandboxes
terminal_backend="modal",
terminal_timeout=300, # 5 min per command (builds, pip install, etc.)
# Test execution timeout (TB2 test scripts can install deps like pytest)
test_timeout=180,
# 89 tasks run in parallel, each needs a thread for tool calls
tool_pool_size=128,
# --- Eval-only Atropos settings ---
# These settings make the env work as an eval-only environment:
# - STOP_TRAIN: pauses training during eval (standard for eval envs)
# - steps_per_eval=1, total_steps=1: eval triggers immediately
# - group_size=1: one rollout per group (each task is expensive)
eval_handling=EvalHandlingEnum.STOP_TRAIN,
group_size=1,
steps_per_eval=1,
total_steps=1,
tokenizer_name="NousResearch/Hermes-3-Llama-3.1-8B",
use_wandb=True,
wandb_name="terminal-bench-2",
ensure_scores_are_not_same=False, # Binary rewards may all be 0 or 1
)
# OpenRouter with Claude -- API key loaded from .env
server_configs = [
APIServerConfig(
base_url="https://openrouter.ai/api/v1",
model_name="anthropic/claude-sonnet-4",
server_type="openai",
api_key=os.getenv("OPENROUTER_API_KEY", ""),
health_check=False,
)
]
return env_config, server_configs
# =========================================================================
# Setup -- load dataset
# =========================================================================
async def setup(self):
"""Load the Terminal-Bench 2.0 dataset from HuggingFace."""
from datasets import load_dataset
# Auto-set terminal_lifetime to task_timeout + 120s so sandboxes
# never get killed during an active task, but still get cleaned up
# promptly after the task times out.
lifetime = self.config.task_timeout + 120
self.config.terminal_lifetime = lifetime
os.environ["TERMINAL_LIFETIME_SECONDS"] = str(lifetime)
print(f" Terminal lifetime auto-set to {lifetime}s (task_timeout + 120s)")
print(f"Loading TB2 dataset from: {self.config.dataset_name}")
ds = load_dataset(self.config.dataset_name, split="train")
# Apply task filters (comma-separated strings from CLI)
tasks = list(ds)
if self.config.task_filter:
allowed = {name.strip() for name in self.config.task_filter.split(",")}
tasks = [t for t in tasks if t["task_name"] in allowed]
print(f" Filtered to {len(tasks)} tasks: {sorted(allowed)}")
# Skip tasks incompatible with the current backend (e.g., QEMU on Modal)
# plus any user-specified skip_tasks
skip = set(MODAL_INCOMPATIBLE_TASKS) if self.config.terminal_backend == "modal" else set()
if self.config.skip_tasks:
skip |= {name.strip() for name in self.config.skip_tasks.split(",")}
if skip:
before = len(tasks)
tasks = [t for t in tasks if t["task_name"] not in skip]
skipped = before - len(tasks)
if skipped > 0:
print(f" Skipped {skipped} incompatible tasks: {sorted(skip & {t['task_name'] for t in ds})}")
self.all_eval_items = tasks
self.iter = 0
# Build category index for per-category metrics
self.category_index: Dict[str, List[int]] = defaultdict(list)
for i, task in enumerate(self.all_eval_items):
self.category_index[task.get("category", "unknown")].append(i)
# Reward tracking for wandb logging
self.eval_metrics: List[Tuple[str, float]] = []
# Streaming JSONL writer -- saves each task's full conversation
# immediately on completion so data is preserved even on Ctrl+C.
# Timestamped filename so each run produces a unique file.
import datetime
log_dir = os.path.join(os.path.dirname(__file__), "logs")
os.makedirs(log_dir, exist_ok=True)
run_ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
self._streaming_path = os.path.join(log_dir, f"samples_{run_ts}.jsonl")
self._streaming_file = open(self._streaming_path, "w")
self._streaming_lock = __import__("threading").Lock()
print(f" Streaming results to: {self._streaming_path}")
print(f"TB2 ready: {len(self.all_eval_items)} tasks across {len(self.category_index)} categories")
for cat, indices in sorted(self.category_index.items()):
print(f" {cat}: {len(indices)} tasks")
def _save_result(self, result: Dict[str, Any]):
"""Write a single task result to the streaming JSONL file immediately."""
if not hasattr(self, "_streaming_file") or self._streaming_file.closed:
return
with self._streaming_lock:
self._streaming_file.write(json.dumps(result, ensure_ascii=False, default=str) + "\n")
self._streaming_file.flush()
# =========================================================================
# Training pipeline stubs -- NOT used in eval-only mode
# =========================================================================
# These satisfy the abstract method requirements from HermesAgentBaseEnv.
# The evaluate subcommand calls setup() -> evaluate() directly, bypassing
# the training pipeline entirely.
async def get_next_item(self):
"""Return next item (stub -- not used in eval-only mode)."""
item = self.all_eval_items[self.iter % len(self.all_eval_items)]
self.iter += 1
return item
def format_prompt(self, item: Dict[str, Any]) -> str:
"""Return the task's instruction as the user prompt."""
return item["instruction"]
async def compute_reward(self, item, result, ctx) -> float:
"""Compute reward (stub -- actual verification is in rollout_and_score_eval)."""
return 0.0
async def collect_trajectories(self, item):
"""Collect trajectories (stub -- not used in eval-only mode)."""
return None, []
async def score(self, rollout_group_data):
"""Score rollouts (stub -- not used in eval-only mode)."""
return None
# =========================================================================
# Docker image resolution
# =========================================================================
def _resolve_task_image(
self, item: Dict[str, Any], task_name: str
) -> Tuple[str, Optional[Path]]:
"""
Resolve the Docker image for a task, with fallback to Dockerfile.
Strategy (mirrors Harbor's approach):
1. If force_build=True, always build from Dockerfile in environment_tar
2. If docker_image is available, use the pre-built Docker Hub image (fast)
3. Otherwise, extract Dockerfile from environment_tar and build (slow)
Returns:
(modal_image, temp_dir) -- modal_image is a Docker Hub name or a
Dockerfile path. temp_dir is set if we extracted files that need
cleanup later.
"""
docker_image = item.get("docker_image", "")
environment_tar = item.get("environment_tar", "")
# Fast path: use pre-built Docker Hub image
if docker_image and not self.config.force_build:
logger.info("Task %s: using pre-built image %s", task_name, docker_image)
return docker_image, None
# Slow path: extract Dockerfile from environment_tar and build
if environment_tar:
task_dir = Path(tempfile.mkdtemp(prefix=f"tb2-{task_name}-"))
_extract_base64_tar(environment_tar, task_dir)
dockerfile_path = task_dir / "Dockerfile"
if dockerfile_path.exists():
logger.info(
"Task %s: building from Dockerfile (force_build=%s, docker_image=%s)",
task_name, self.config.force_build, bool(docker_image),
)
return str(dockerfile_path), task_dir
# Neither available -- fall back to Hub image if force_build was True
if docker_image:
logger.warning(
"Task %s: force_build=True but no environment_tar, "
"falling back to docker_image %s", task_name, docker_image,
)
return docker_image, None
return "", None
# =========================================================================
# Per-task evaluation -- agent loop + test verification
# =========================================================================
async def rollout_and_score_eval(self, eval_item: Dict[str, Any]) -> Dict:
"""
Evaluate a single TB2 task: run the agent loop, then verify with tests.
This is the core evaluation method. For each task it:
1. Resolves the Docker image and registers the Modal sandbox override
2. Runs HermesAgentLoop with terminal + file tools
3. Uploads the test suite into the sandbox
4. Executes test.sh and checks the result
5. Cleans up the sandbox and temp files
Args:
eval_item: A single TB2 task dict from the dataset
Returns:
Dict with 'passed' (bool), 'reward' (float), 'task_name' (str),
'category' (str), and optional debug info
"""
task_name = eval_item.get("task_name", "unknown")
category = eval_item.get("category", "unknown")
task_id = str(uuid.uuid4())
task_dir = None # Set if we extract a Dockerfile (needs cleanup)
from tqdm import tqdm
tqdm.write(f" [START] {task_name} (task_id={task_id[:8]})")
task_start = time.time()
try:
# --- 1. Resolve Docker image ---
modal_image, task_dir = self._resolve_task_image(eval_item, task_name)
if not modal_image:
logger.error("Task %s: no docker_image or environment_tar, skipping", task_name)
return {
"passed": False, "reward": 0.0,
"task_name": task_name, "category": category,
"error": "no_image",
}
# --- 2. Register per-task Modal image override ---
register_task_env_overrides(task_id, {"modal_image": modal_image})
logger.info(
"Task %s: registered image override for task_id %s",
task_name, task_id[:8],
)
# --- 3. Resolve tools and build messages ---
tools, valid_names = self._resolve_tools_for_group()
messages: List[Dict[str, Any]] = []
if self.config.system_prompt:
messages.append({"role": "system", "content": self.config.system_prompt})
messages.append({"role": "user", "content": self.format_prompt(eval_item)})
# --- 4. Run agent loop ---
agent = HermesAgentLoop(
server=self.server,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=self.config.agent_temperature,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
# --- 5. Verify -- run test suite in the agent's sandbox ---
# Skip verification if the agent produced no meaningful output
only_system_and_user = all(
msg.get("role") in ("system", "user") for msg in result.messages
)
if result.turns_used == 0 or only_system_and_user:
logger.warning(
"Task %s: agent produced no output (turns=%d). Reward=0.",
task_name, result.turns_used,
)
reward = 0.0
else:
# Run tests in a thread so the blocking ctx.terminal() calls
# don't freeze the entire event loop (which would stall all
# other tasks, tqdm updates, and timeout timers).
ctx = ToolContext(task_id)
try:
loop = asyncio.get_event_loop()
reward = await loop.run_in_executor(
None, # default thread pool
self._run_tests, eval_item, ctx, task_name,
)
except Exception as e:
logger.error("Task %s: test verification failed: %s", task_name, e)
reward = 0.0
finally:
ctx.cleanup()
passed = reward == 1.0
status = "PASS" if passed else "FAIL"
elapsed = time.time() - task_start
tqdm.write(f" [{status}] {task_name} (turns={result.turns_used}, {elapsed:.0f}s)")
logger.info(
"Task %s: reward=%.1f, turns=%d, finished=%s",
task_name, reward, result.turns_used, result.finished_naturally,
)
out = {
"passed": passed,
"reward": reward,
"task_name": task_name,
"category": category,
"turns_used": result.turns_used,
"finished_naturally": result.finished_naturally,
"messages": result.messages,
}
self._save_result(out)
return out
except Exception as e:
elapsed = time.time() - task_start
logger.error("Task %s: rollout failed: %s", task_name, e, exc_info=True)
tqdm.write(f" [ERROR] {task_name}: {e} ({elapsed:.0f}s)")
out = {
"passed": False, "reward": 0.0,
"task_name": task_name, "category": category,
"error": str(e),
}
self._save_result(out)
return out
finally:
# --- Cleanup: clear overrides, sandbox, and temp files ---
clear_task_env_overrides(task_id)
try:
cleanup_vm(task_id)
except Exception as e:
logger.debug("VM cleanup for %s: %s", task_id[:8], e)
if task_dir and task_dir.exists():
shutil.rmtree(task_dir, ignore_errors=True)
def _run_tests(
self, item: Dict[str, Any], ctx: ToolContext, task_name: str
) -> float:
"""
Upload and execute the test suite in the agent's sandbox, then
download the verifier output locally to read the reward.
Follows Harbor's verification pattern:
1. Upload tests/ directory into the sandbox
2. Execute test.sh inside the sandbox
3. Download /logs/verifier/ directory to a local temp dir
4. Read reward.txt locally with native Python I/O
Downloading locally avoids issues with the file_read tool on
the Modal VM and matches how Harbor handles verification.
TB2 test scripts (test.sh) typically:
1. Install pytest via uv/pip
2. Run pytest against the test files in /tests/
3. Write results to /logs/verifier/reward.txt
Args:
item: The TB2 task dict (contains tests_tar, test_sh)
ctx: ToolContext scoped to this task's sandbox
task_name: For logging
Returns:
1.0 if tests pass, 0.0 otherwise
"""
tests_tar = item.get("tests_tar", "")
test_sh = item.get("test_sh", "")
if not test_sh:
logger.warning("Task %s: no test_sh content, reward=0", task_name)
return 0.0
# Create required directories in the sandbox
ctx.terminal("mkdir -p /tests /logs/verifier")
# Upload test files into the sandbox (binary-safe via base64)
if tests_tar:
tests_temp = Path(tempfile.mkdtemp(prefix=f"tb2-tests-{task_name}-"))
try:
_extract_base64_tar(tests_tar, tests_temp)
ctx.upload_dir(str(tests_temp), "/tests")
except Exception as e:
logger.warning("Task %s: failed to upload test files: %s", task_name, e)
finally:
shutil.rmtree(tests_temp, ignore_errors=True)
# Write the test runner script (test.sh)
ctx.write_file("/tests/test.sh", test_sh)
ctx.terminal("chmod +x /tests/test.sh")
# Execute the test suite
logger.info(
"Task %s: running test suite (timeout=%ds)",
task_name, self.config.test_timeout,
)
test_result = ctx.terminal(
"bash /tests/test.sh",
timeout=self.config.test_timeout,
)
exit_code = test_result.get("exit_code", -1)
output = test_result.get("output", "")
# Download the verifier output directory locally, then read reward.txt
# with native Python I/O. This avoids issues with file_read on the
# Modal VM and matches Harbor's verification pattern.
reward = 0.0
local_verifier_dir = Path(tempfile.mkdtemp(prefix=f"tb2-verifier-{task_name}-"))
try:
ctx.download_dir("/logs/verifier", str(local_verifier_dir))
reward_file = local_verifier_dir / "reward.txt"
if reward_file.exists() and reward_file.stat().st_size > 0:
content = reward_file.read_text().strip()
if content == "1":
reward = 1.0
elif content == "0":
reward = 0.0
else:
# Unexpected content -- try parsing as float
try:
reward = float(content)
except (ValueError, TypeError):
logger.warning(
"Task %s: reward.txt content unexpected (%r), "
"falling back to exit_code=%d",
task_name, content, exit_code,
)
reward = 1.0 if exit_code == 0 else 0.0
else:
# reward.txt not written -- fall back to exit code
logger.warning(
"Task %s: reward.txt not found after download, "
"falling back to exit_code=%d",
task_name, exit_code,
)
reward = 1.0 if exit_code == 0 else 0.0
except Exception as e:
logger.warning(
"Task %s: failed to download verifier dir: %s, "
"falling back to exit_code=%d",
task_name, e, exit_code,
)
reward = 1.0 if exit_code == 0 else 0.0
finally:
shutil.rmtree(local_verifier_dir, ignore_errors=True)
# Log test output for debugging failures
if reward == 0.0:
output_preview = output[-500:] if output else "(no output)"
logger.info(
"Task %s: FAIL (exit_code=%d)\n%s",
task_name, exit_code, output_preview,
)
return reward
# =========================================================================
# Evaluate -- main entry point for the eval subcommand
# =========================================================================
async def _eval_with_timeout(self, item: Dict[str, Any]) -> Dict:
"""
Wrap rollout_and_score_eval with a per-task wall-clock timeout.
If the task exceeds task_timeout seconds, it's automatically scored
as FAIL. This prevents any single task from hanging indefinitely.
"""
task_name = item.get("task_name", "unknown")
category = item.get("category", "unknown")
try:
return await asyncio.wait_for(
self.rollout_and_score_eval(item),
timeout=self.config.task_timeout,
)
except asyncio.TimeoutError:
from tqdm import tqdm
elapsed = self.config.task_timeout
tqdm.write(f" [TIMEOUT] {task_name} (exceeded {elapsed}s wall-clock limit)")
logger.error("Task %s: wall-clock timeout after %ds", task_name, elapsed)
out = {
"passed": False, "reward": 0.0,
"task_name": task_name, "category": category,
"error": f"timeout ({elapsed}s)",
}
self._save_result(out)
return out
async def evaluate(self, *args, **kwargs) -> None:
"""
Run Terminal-Bench 2.0 evaluation over all tasks.
This is the main entry point when invoked via:
python environments/terminalbench2_env.py evaluate
Runs all tasks through rollout_and_score_eval() via asyncio.gather()
(same pattern as GPQA and other Atropos eval envs). Each task is
wrapped with a wall-clock timeout so hung tasks auto-fail.
Suppresses noisy Modal/terminal output (HERMES_QUIET) so the tqdm
bar stays visible.
"""
start_time = time.time()
# Route all logging through tqdm.write() so the progress bar stays
# pinned at the bottom while log lines scroll above it.
from tqdm import tqdm
class _TqdmHandler(logging.Handler):
def emit(self, record):
try:
tqdm.write(self.format(record))
except Exception:
self.handleError(record)
handler = _TqdmHandler()
handler.setFormatter(logging.Formatter(
"%(asctime)s [%(name)s] %(levelname)s: %(message)s",
datefmt="%H:%M:%S",
))
root = logging.getLogger()
root.handlers = [handler] # Replace any existing handlers
root.setLevel(logging.INFO)
# Silence noisy third-party loggers that flood the output
logging.getLogger("httpx").setLevel(logging.WARNING) # Every HTTP request
logging.getLogger("openai").setLevel(logging.WARNING) # OpenAI client retries
logging.getLogger("rex-deploy").setLevel(logging.WARNING) # Swerex deployment
logging.getLogger("rex_image_builder").setLevel(logging.WARNING) # Image builds
print(f"\n{'='*60}")
print("Starting Terminal-Bench 2.0 Evaluation")
print(f"{'='*60}")
print(f" Dataset: {self.config.dataset_name}")
print(f" Total tasks: {len(self.all_eval_items)}")
print(f" Max agent turns: {self.config.max_agent_turns}")
print(f" Task timeout: {self.config.task_timeout}s")
print(f" Terminal backend: {self.config.terminal_backend}")
print(f" Tool thread pool: {self.config.tool_pool_size}")
print(f" Terminal timeout: {self.config.terminal_timeout}s/cmd")
print(f" Terminal lifetime: {self.config.terminal_lifetime}s (auto: task_timeout + 120)")
print(f"{'='*60}\n")
# Fire all tasks with wall-clock timeout, track live accuracy on the bar
total_tasks = len(self.all_eval_items)
eval_tasks = [
asyncio.ensure_future(self._eval_with_timeout(item))
for item in self.all_eval_items
]
results = []
passed_count = 0
pbar = tqdm(total=total_tasks, desc="Evaluating TB2", dynamic_ncols=True)
try:
for coro in asyncio.as_completed(eval_tasks):
result = await coro
results.append(result)
if result and result.get("passed"):
passed_count += 1
done = len(results)
pct = (passed_count / done * 100) if done else 0
pbar.set_postfix_str(f"pass={passed_count}/{done} ({pct:.1f}%)")
pbar.update(1)
except (KeyboardInterrupt, asyncio.CancelledError):
pbar.close()
print(f"\n\nInterrupted! Cleaning up {len(eval_tasks)} tasks...")
# Cancel all pending tasks
for task in eval_tasks:
task.cancel()
# Let cancellations propagate (finally blocks run cleanup_vm)
await asyncio.gather(*eval_tasks, return_exceptions=True)
# Belt-and-suspenders: clean up any remaining sandboxes
from tools.terminal_tool import cleanup_all_environments
cleanup_all_environments()
print("All sandboxes cleaned up.")
return
finally:
pbar.close()
end_time = time.time()
# Filter out None results (shouldn't happen, but be safe)
valid_results = [r for r in results if r is not None]
if not valid_results:
print("Warning: No valid evaluation results obtained")
return
# ---- Compute metrics ----
total = len(valid_results)
passed = sum(1 for r in valid_results if r.get("passed"))
overall_pass_rate = passed / total if total > 0 else 0.0
# Per-category breakdown
cat_results: Dict[str, List[Dict]] = defaultdict(list)
for r in valid_results:
cat_results[r.get("category", "unknown")].append(r)
# Build metrics dict
eval_metrics = {
"eval/pass_rate": overall_pass_rate,
"eval/total_tasks": total,
"eval/passed_tasks": passed,
"eval/evaluation_time_seconds": end_time - start_time,
}
# Per-category metrics
for category, cat_items in sorted(cat_results.items()):
cat_passed = sum(1 for r in cat_items if r.get("passed"))
cat_total = len(cat_items)
cat_pass_rate = cat_passed / cat_total if cat_total > 0 else 0.0
cat_key = category.replace(" ", "_").replace("-", "_").lower()
eval_metrics[f"eval/pass_rate_{cat_key}"] = cat_pass_rate
# Store metrics for wandb_log
self.eval_metrics = [(k, v) for k, v in eval_metrics.items()]
# ---- Print summary ----
print(f"\n{'='*60}")
print("Terminal-Bench 2.0 Evaluation Results")
print(f"{'='*60}")
print(f"Overall Pass Rate: {overall_pass_rate:.4f} ({passed}/{total})")
print(f"Evaluation Time: {end_time - start_time:.1f} seconds")
print("\nCategory Breakdown:")
for category, cat_items in sorted(cat_results.items()):
cat_passed = sum(1 for r in cat_items if r.get("passed"))
cat_total = len(cat_items)
cat_rate = cat_passed / cat_total if cat_total > 0 else 0.0
print(f" {category}: {cat_rate:.1%} ({cat_passed}/{cat_total})")
# Print individual task results
print("\nTask Results:")
for r in sorted(valid_results, key=lambda x: x.get("task_name", "")):
status = "PASS" if r.get("passed") else "FAIL"
turns = r.get("turns_used", "?")
error = r.get("error", "")
extra = f" (error: {error})" if error else ""
print(f" [{status}] {r['task_name']} (turns={turns}){extra}")
print(f"{'='*60}\n")
# Build sample records for evaluate_log (includes full conversations)
samples = [
{
"task_name": r.get("task_name"),
"category": r.get("category"),
"passed": r.get("passed"),
"reward": r.get("reward"),
"turns_used": r.get("turns_used"),
"error": r.get("error"),
"messages": r.get("messages"),
}
for r in valid_results
]
# Log evaluation results
try:
await self.evaluate_log(
metrics=eval_metrics,
samples=samples,
start_time=start_time,
end_time=end_time,
generation_parameters={
"temperature": self.config.agent_temperature,
"max_tokens": self.config.max_token_length,
"max_agent_turns": self.config.max_agent_turns,
"terminal_backend": self.config.terminal_backend,
},
)
except Exception as e:
print(f"Error logging evaluation results: {e}")
# Close streaming file
if hasattr(self, "_streaming_file") and not self._streaming_file.closed:
self._streaming_file.close()
print(f" Live results saved to: {self._streaming_path}")
# Kill all remaining sandboxes. Timed-out tasks leave orphaned thread
# pool workers still executing commands -- cleanup_all stops them.
from tools.terminal_tool import cleanup_all_environments
print("\nCleaning up all sandboxes...")
cleanup_all_environments()
# Shut down the tool thread pool so orphaned workers from timed-out
# tasks are killed immediately instead of retrying against dead
# sandboxes and spamming the console with TimeoutError warnings.
from environments.agent_loop import _tool_executor
_tool_executor.shutdown(wait=False, cancel_futures=True)
print("Done.")
# =========================================================================
# Wandb logging
# =========================================================================
async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
"""Log TB2-specific metrics to wandb."""
if wandb_metrics is None:
wandb_metrics = {}
# Add stored eval metrics
for metric_name, metric_value in self.eval_metrics:
wandb_metrics[metric_name] = metric_value
self.eval_metrics = []
await super().wandb_log(wandb_metrics)
if __name__ == "__main__":
TerminalBench2EvalEnv.cli()
@@ -0,0 +1,5 @@
"""Endless Terminals Environment - Terminal task training from HuggingFace dataset."""
from .endless_terminals_env import EndlessTerminalsEnv, EndlessTerminalsEnvConfig
__all__ = ["EndlessTerminalsEnv", "EndlessTerminalsEnvConfig"]
@@ -0,0 +1,69 @@
# Endless Terminals Environment -- Default Configuration
#
# Trains agents on terminal tasks from the Endless Terminals HuggingFace dataset.
# Uses hermes-agent backends (modal/docker/local) with per-task Docker images.
# Tests run in the same sandbox the agent used (no separate containers needed).
#
# Dataset: https://huggingface.co/datasets/obiwan96/endless-terminals-train
#
# Prerequisites:
# 1. Download dataset: huggingface-cli download obiwan96/endless-terminals-train \
# --repo-type dataset --local-dir ~/endless-terminals-data \
# --local-dir-use-symlinks False
# 2. Set TASKS_BASE_DIR environment variable or configure tasks_base_dir below
# 3. For modal backend: Configure Modal CLI (modal token set)
# 4. For docker backend: Install Docker
#
# Usage:
# python environments/endless_terminals/endless_terminals_env.py process \
# --config environments/endless_terminals/default.yaml
env:
# Toolsets
enabled_toolsets: ["terminal", "file"]
# Agent configuration
max_agent_turns: 32
max_token_length: 4096
agent_temperature: 1.0
# Terminal backend
terminal_backend: "local" # Change to "modal" or "docker" for cloud isolation
# Dataset settings
use_dataset: true
dataset_name: "obiwan96/endless-terminals"
dataset_split: "train"
dataset_cache_dir: "~/.cache/huggingface/datasets"
tasks_base_dir: "" # Set to directory containing task_* folders (e.g., ~/endless-terminals-data)
# Test execution
test_timeout_s: 60
# Training configuration
group_size: 8
total_steps: 10000
steps_per_eval: 500
num_eval_tasks: 10
eval_split_ratio: 0.1
# Logging
use_wandb: true
wandb_name: "endless-terminals"
# System prompt
system_prompt: >
You are a skilled Linux system administrator and programmer.
You have access to a terminal and file tools to complete system administration
and programming tasks. Use the tools effectively to solve the given task,
and verify your solution works correctly before finishing.
openai:
base_url: "https://openrouter.ai/api/v1"
model_name: "anthropic/claude-sonnet-4.5"
server_type: "openai"
api_key: "" # Loaded from OPENROUTER_API_KEY env var
health_check: false
timeout: 30 # 30 second timeout per request
max_retries: 2 # Only retry twice
@@ -0,0 +1,921 @@
"""
Endless Terminals Environment for Hermes-Agent + Atropos RL.
Loads pre-generated terminal tasks from HuggingFace dataset and scores
agent performance using test execution in the agent's sandbox.
Uses hermes-agent backends (modal, docker, local) with per-task Docker images
extracted from container.def files. Tests run in the same sandbox the agent
used, following the Terminal Bench 2 pattern.
Dataset: https://huggingface.co/datasets/obiwan96/endless-terminals-train
Run:
python environments/endless_terminals/endless_terminals_env.py process \
--config environments/endless_terminals/default.yaml
"""
import asyncio
import logging
import os
import random
import re
import sys
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
from pydantic import Field
# Ensure hermes-agent root is on path
_repo_root = Path(__file__).resolve().parent.parent.parent
if str(_repo_root) not in sys.path:
sys.path.insert(0, str(_repo_root))
from atroposlib.envs.base import ScoredDataGroup, ScoredDataItem
from atroposlib.type_definitions import Item
from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
from environments.agent_loop import AgentResult
from environments.tool_context import ToolContext
from tools.terminal_tool import (
register_task_env_overrides,
clear_task_env_overrides,
cleanup_vm,
)
logger = logging.getLogger(__name__)
# Add endless-terminals to path for imports
ENDLESS_TERMINALS_PATH = os.getenv(
"ENDLESS_TERMINALS_PATH",
str(Path.home() / "Desktop" / "Projects" / "endless-terminals")
)
sys.path.insert(0, ENDLESS_TERMINALS_PATH)
class EndlessTerminalsEnvConfig(HermesAgentEnvConfig):
"""Configuration for Endless Terminals environment."""
# Dataset settings
use_dataset: bool = Field(
default=True,
description="Load tasks from HuggingFace dataset (recommended). If False, generate procedurally."
)
dataset_name: str = Field(
default="obiwan96/endless-terminals-train",
description="HuggingFace dataset name"
)
dataset_split: str = Field(
default="train",
description="Dataset split to use"
)
dataset_cache_dir: str = Field(
default="~/.cache/huggingface/datasets",
description="HuggingFace datasets cache directory"
)
tasks_base_dir: str = Field(
default="",
description="Base directory containing task_* folders. If empty, uses paths from dataset."
)
# Test execution
test_timeout_s: int = Field(default=60, description="Test execution timeout (seconds)")
# Docker image fallback
default_docker_image: str = Field(
default="ubuntu:22.04",
description="Default Docker image if container.def parsing fails"
)
# Agent defaults
max_agent_turns: int = Field(default=32, description="Max turns for agent (increased for long traces)")
# Evaluation settings
num_eval_tasks: int = Field(
default=10,
description="Number of tasks to run during periodic evaluation"
)
eval_split_ratio: float = Field(
default=0.1,
description="Fraction of dataset to hold out for evaluation (0.0-1.0)"
)
class EndlessTerminalsEnv(HermesAgentBaseEnv):
"""
Endless Terminals environment using pre-generated HuggingFace dataset.
Loads terminal tasks from dataset, runs agent with terminal tools,
and scores by executing tests in the agent's sandbox using ToolContext.
"""
name = "endless_terminals_env"
env_config_cls = EndlessTerminalsEnvConfig
@classmethod
def config_init(cls) -> Tuple[EndlessTerminalsEnvConfig, List["APIServerConfig"]]:
"""
Default configuration for Endless Terminals environment.
This is used when no config file is provided, but note that when using
--config, the YAML is loaded differently and this may not be called.
"""
from atroposlib.envs.server_handling.server_manager import APIServerConfig
env_config = EndlessTerminalsEnvConfig(
enabled_toolsets=["terminal", "file"],
max_agent_turns=32,
terminal_backend="local",
use_dataset=True,
tasks_base_dir="",
group_size=1,
total_steps=1,
use_wandb=False,
)
server_configs = [
APIServerConfig(
base_url="https://openrouter.ai/api/v1",
model_name="anthropic/claude-sonnet-4.5",
server_type="openai",
api_key=os.getenv("OPENROUTER_API_KEY", ""),
health_check=False,
)
]
return env_config, server_configs
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._dataset = None
self._train_dataset = None
self._eval_dataset = None
self._dataset_indices = []
self._current_index = 0
# Metrics tracking for wandb - single buffer with dicts
self._metrics_buffer = []
# Debug: check server config
if hasattr(self, 'server') and hasattr(self.server, 'servers'):
for i, srv in enumerate(self.server.servers):
logger.debug(f"Server {i}: model_name={getattr(srv.config, 'model_name', 'NONE')}")
async def setup(self):
"""Load dataset from HuggingFace or local directory."""
if not self.config.use_dataset:
logger.info("Using procedural task generation (not implemented yet)")
return
# If tasks_base_dir is set, load from local directory instead of HuggingFace
if self.config.tasks_base_dir:
tasks_base = Path(os.path.expanduser(self.config.tasks_base_dir))
# Resolve to absolute path if relative
if not tasks_base.is_absolute():
tasks_base = Path.cwd() / tasks_base
tasks_base = tasks_base.resolve()
if not tasks_base.exists():
raise RuntimeError(f"tasks_base_dir not found: {tasks_base}")
logger.info(f"Loading tasks from local directory: {tasks_base}")
# Find all task_* directories
task_dirs = sorted(tasks_base.glob("task_*"))
logger.info(f"Found {len(task_dirs)} task directories")
if not task_dirs:
# Debug: show what's actually in the directory
all_items = list(tasks_base.iterdir())
logger.warning(f"Directory contains {len(all_items)} items:")
for item in all_items[:10]:
logger.warning(f" - {item.name} ({'dir' if item.is_dir() else 'file'})")
raise RuntimeError(f"No task_* directories found in {tasks_base}")
# Create fake dataset items (just the directory paths)
self._dataset = [
{
"description": f"Task from {task_dir.name}",
"extra_info": {"task_dir": str(task_dir)},
}
for task_dir in task_dirs
]
logger.info(f"Loaded {len(self._dataset)} tasks from local directory")
self._split_dataset()
return
# Otherwise, load from HuggingFace
logger.info(f"Loading dataset from HuggingFace: {self.config.dataset_name}")
try:
from datasets import load_dataset
self._dataset = await asyncio.get_event_loop().run_in_executor(
None,
lambda: load_dataset(
self.config.dataset_name,
split=self.config.dataset_split,
cache_dir=os.path.expanduser(self.config.dataset_cache_dir)
)
)
logger.info(f"Loaded {len(self._dataset)} tasks from HuggingFace")
self._split_dataset()
except Exception as e:
logger.error(f"ERROR loading dataset: {e}")
raise
def _split_dataset(self):
"""Split dataset into train and eval sets based on eval_split_ratio."""
if self._dataset is None or len(self._dataset) == 0:
raise RuntimeError("Cannot split empty dataset")
total_size = len(self._dataset)
eval_size = int(total_size * self.config.eval_split_ratio)
train_size = total_size - eval_size
all_indices = list(range(total_size))
random.shuffle(all_indices)
train_indices = all_indices[:train_size]
eval_indices = all_indices[train_size:]
if isinstance(self._dataset, list):
self._train_dataset = [self._dataset[i] for i in train_indices]
self._eval_dataset = [self._dataset[i] for i in eval_indices]
else:
self._train_dataset = self._dataset.select(train_indices)
self._eval_dataset = self._dataset.select(eval_indices)
self._dataset_indices = list(range(len(self._train_dataset)))
random.shuffle(self._dataset_indices)
self._current_index = 0
logger.info(
f"Split dataset: {len(self._train_dataset)} train, "
f"{len(self._eval_dataset)} eval "
f"(ratio={self.config.eval_split_ratio:.1%})"
)
async def get_next_item(self) -> Item:
"""Sample next task from training dataset."""
if self._train_dataset is None:
raise RuntimeError("Dataset not loaded. Call setup() first.")
# Get next task (with wraparound)
idx = self._dataset_indices[self._current_index]
task = self._train_dataset[idx]
# Advance to next task
self._current_index += 1
if self._current_index >= len(self._dataset_indices):
# Reshuffle for next epoch
random.shuffle(self._dataset_indices)
self._current_index = 0
logger.info("Reshuffled dataset (completed one epoch)")
# Extract task directory path
task_dir = task.get("extra_info", {}).get("task_dir")
if not task_dir:
task_dir = task.get("reward_spec", {}).get("ground_truth")
# Resolve task directory path
if task_dir:
task_dir_path = Path(task_dir)
# If tasks_base_dir is configured and path doesn't exist, reconstruct it
if self.config.tasks_base_dir and not task_dir_path.exists():
original_path = Path(task_dir)
task_name = original_path.name
task_dir_path = Path(os.path.expanduser(self.config.tasks_base_dir)) / task_name
else:
logger.error("No task directory path found in dataset item")
return await self.get_next_item()
# Verify directory exists
if not task_dir_path.exists():
logger.warning(f"Task dir not found: {task_dir_path}")
logger.warning("Hint: Set tasks_base_dir to directory containing task_* folders")
return await self.get_next_item() # Try next task
# Look for test file in tests/ subdirectory first, then at root
final_test = task_dir_path / "tests" / "test_final_state.py"
if not final_test.exists():
final_test = task_dir_path / "test_final_state.py"
# Verify test file exists
if not final_test.exists():
logger.warning(f"Missing test file in {task_dir_path} (checked tests/ and root)")
return await self.get_next_item()
# Parse container.def to extract Docker image
# Check environment/ subdirectory first, then root
container_def = task_dir_path / "environment" / "container.def"
if not container_def.exists():
container_def = task_dir_path / "container.def"
docker_image = self._parse_docker_image_from_def(container_def)
# Try to load description from instruction.md or task.json
description = task.get("description", "")
# First try instruction.md
instruction_md = task_dir_path / "instruction.md"
if not description and instruction_md.exists():
try:
description = instruction_md.read_text().strip()
except Exception as e:
logger.warning(f"Failed to load instruction.md for {task_dir_path.name}: {e}")
# Fallback to task.json in environment/
if not description:
task_json = task_dir_path / "environment" / "task.json"
if task_json.exists():
try:
import json
task_data = json.loads(task_json.read_text())
description = task_data.get("description", "") or task_data.get("instruction", "")
except Exception as e:
logger.warning(f"Failed to load task.json for {task_dir_path.name}: {e}")
if not description:
description = f"Complete the task in {task_dir_path.name}"
return {
"task_id": f"{task_dir_path.name}",
"task_name": task_dir_path.name,
"description": description,
"task_dir": str(task_dir_path),
"final_test": str(final_test),
"docker_image": docker_image,
"dataset_index": idx,
}
def format_prompt(self, item: Item) -> str:
"""Return the task description for the agent."""
return str(item.get("description", ""))
def _parse_docker_image_from_def(self, container_def_path: Path) -> str:
"""
Parse container.def file to extract the Docker base image.
Apptainer definition files typically look like:
Bootstrap: docker
From: ubuntu:22.04
Returns the image from the "From:" line, or falls back to default.
"""
if not container_def_path.exists():
logger.warning(f"container.def not found at {container_def_path}, using default image")
return self.config.default_docker_image
try:
content = container_def_path.read_text()
# Look for "From: <image>" line (case-insensitive)
match = re.search(r'^From:\s*(.+)$', content, re.MULTILINE | re.IGNORECASE)
if match:
image = match.group(1).strip()
logger.info(f"Extracted Docker image from container.def: {image}")
return image
except Exception as e:
logger.warning(f"Failed to parse {container_def_path}: {e}")
logger.warning(f"Could not extract image from {container_def_path}, using default")
return self.config.default_docker_image
async def collect_trajectory(
self, item: Item
) -> Tuple[Optional[ScoredDataItem], List[Item]]:
"""
Override to register per-task Docker image before running the agent.
Follows Terminal Bench 2 pattern: register_task_env_overrides() tells
the hermes-agent terminal backend to use a specific Docker image for
this task_id.
This is a copy of HermesAgentBaseEnv.collect_trajectory with Docker
image registration added after task_id generation.
"""
import uuid
from environments.agent_loop import HermesAgentLoop
task_id = str(uuid.uuid4())
task_name = item.get("task_name", "unknown")
docker_image = item.get("docker_image", self.config.default_docker_image)
logger.debug(f"collect_trajectory START for {task_name}")
# Register Docker image override for this task_id
logger.debug(f"Registering Docker image: {docker_image}")
register_task_env_overrides(task_id, {"modal_image": docker_image})
logger.info(
f"Task {task_name}: registered Docker image {docker_image} for task_id {task_id[:8]}"
)
logger.debug("Docker image registered")
try:
# Get group-level tools (resolved once in collect_trajectories)
logger.debug("Resolving tools...")
if self._current_group_tools is None:
tools, valid_names = self._resolve_tools_for_group()
else:
tools, valid_names = self._current_group_tools
logger.debug(f"Tools resolved: {len(tools)} tools")
# Build initial messages
logger.debug("Building initial messages...")
messages: List[Dict[str, Any]] = []
if self.config.system_prompt:
messages.append({"role": "system", "content": self.config.system_prompt})
messages.append({"role": "user", "content": self.format_prompt(item)})
logger.debug("Messages built, starting agent loop...")
# Run the agent loop
result: AgentResult
managed_state: Optional[Dict[str, Any]] = None
if self._use_managed_server():
# Phase 2: ManagedServer with parser
from environments.tool_call_parsers import get_parser
try:
tc_parser = get_parser(self.config.tool_call_parser)
except KeyError:
logger.warning(
"Tool call parser '%s' not found, falling back to 'hermes'",
self.config.tool_call_parser,
)
tc_parser = get_parser("hermes")
try:
async with self.server.managed_server(
tokenizer=self.tokenizer,
tool_call_parser=tc_parser,
) as managed:
agent = HermesAgentLoop(
server=managed,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=self.config.agent_temperature,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
# Get state directly from managed server while still in context
managed_state = managed.get_state()
except NotImplementedError:
# DummyManagedServer not allowed
logger.warning("ManagedServer not available. Falling back to direct server mode.")
agent = HermesAgentLoop(
server=self.server,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=self.config.agent_temperature,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
else:
# Phase 1: OpenAI server
agent = HermesAgentLoop(
server=self.server,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=self.config.agent_temperature,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
# Skip reward computation if agent produced no output
only_system_and_user = all(
msg.get("role") in ("system", "user") for msg in result.messages
)
if result.turns_used == 0 or only_system_and_user:
logger.warning(
"Agent loop produced no output (turns=%d). Skipping trajectory.",
result.turns_used,
)
# Return None to skip this trajectory (likely an API failure)
return None, []
else:
# Compute reward using ToolContext
ctx = ToolContext(task_id)
try:
reward = await self.compute_reward(item, result, ctx)
except Exception as e:
logger.error("compute_reward failed: %s", e)
reward = 0.0
finally:
ctx.cleanup()
# Track metrics for wandb logging
task_metrics = {
"test_passed": 1.0 if reward > 0.5 else 0.0,
"reward": reward,
"turns_used": result.turns_used,
"finished_naturally": result.finished_naturally,
"docker_image": docker_image,
"num_tool_errors": len(result.tool_errors),
}
# Include detailed tool errors if any occurred
if result.tool_errors:
task_metrics["tool_errors"] = [
{
"turn": err.turn,
"tool": err.tool_name,
"error": err.error[:200],
}
for err in result.tool_errors
]
self._metrics_buffer.append(task_metrics)
# ============================================================================
# Build ScoredDataGroup from ManagedServer state
# ============================================================================
# Phase 2: Extract pre-computed data from SequenceNodes
# We may have multiple trajectories in the nodes due to how interesting
# agents can be, so iterate through all nodes and return multiple sequences.
#
# Each SequenceNode contains:
# - tokens: Full unmasked token sequence [1, 2, 3, ..., N]
# - masked_tokens: Training format [-100, -100, ..., -100, actual, actual, ...]
# - logprobs: Training format [1.0, 1.0, ..., 1.0, -0.5, -0.3, ...]
# - full_text: Complete text (prompt + all completions)
#
# Phase 1: Create placeholder tokens for OpenAI-style servers
# ============================================================================
nodes = (managed_state or {}).get("nodes", []) if managed_state else []
# Create ScoredDataGroup with lists for multiple trajectories
scored_group = ScoredDataGroup()
scored_group["tokens"] = []
scored_group["masks"] = []
scored_group["scores"] = []
scored_group["messages"] = []
scored_group["inference_logprobs"] = []
if nodes:
# Phase 2: iterate through all nodes (may have multiple trajectories)
for i, node in enumerate(nodes):
scored_group["tokens"].append(node.tokens)
scored_group["masks"].append(node.masked_tokens)
scored_group["scores"].append(reward)
scored_group["messages"].append(result.messages)
if hasattr(node, "logprobs") and node.logprobs:
scored_group["inference_logprobs"].append(node.logprobs)
else:
# Placeholder logprobs if not available
scored_group["inference_logprobs"].append([1.0] * len(node.tokens))
logger.debug(f"Added trajectory {i+1}/{len(nodes)} with {len(node.tokens)} tokens")
else:
# Phase 1: create placeholder tokens for OpenAI-style servers
full_text = "\n".join(
msg.get("content", "") for msg in result.messages if msg.get("content")
)
if self.tokenizer:
tokens = self.tokenizer.encode(full_text, add_special_tokens=True)
else:
tokens = list(range(min(len(full_text) // 4, 128)))
scored_group["tokens"].append(tokens)
scored_group["masks"].append([-100] + tokens[1:])
scored_group["scores"].append(reward)
scored_group["messages"].append(result.messages)
scored_group["inference_logprobs"].append([1.0] * len(tokens))
# Return None if no trajectories collected
if len(scored_group["tokens"]) == 0:
return None, []
logger.debug(f"Returning ScoredDataGroup with {len(scored_group['tokens'])} trajectories")
return scored_group, []
finally:
# Clean up task overrides and sandbox
clear_task_env_overrides(task_id)
try:
cleanup_vm(task_id)
except Exception as e:
logger.debug(f"VM cleanup for {task_id[:8]}: {e}")
async def compute_reward(
self,
item: Item,
result: AgentResult,
ctx: ToolContext
) -> float:
"""
Run final tests in the agent's sandbox and return binary reward.
Uses ToolContext to execute pytest in the SAME sandbox the agent used,
following the Terminal Bench 2 verification pattern. No separate
Apptainer execution needed.
Returns 1.0 if tests pass, 0.0 otherwise.
"""
task_name = item.get("task_name", "unknown")
final_test_path = Path(item.get("final_test", ""))
if not final_test_path.exists():
logger.error(f"Task {task_name}: test file not found at {final_test_path}")
return 0.0
logger.info(f"Task {task_name}: running tests in sandbox...")
try:
# Run tests in a thread to avoid blocking the event loop
loop = asyncio.get_event_loop()
reward = await loop.run_in_executor(
None,
self._run_tests_in_sandbox,
final_test_path,
ctx,
task_name,
)
status = "PASS" if reward == 1.0 else "FAIL"
logger.info(f"Task {task_name}: {status} (reward={reward})")
return reward
except Exception as e:
logger.error(f"Task {task_name}: test execution failed: {e}", exc_info=True)
return 0.0
def _run_tests_in_sandbox(
self,
test_file_path: Path,
ctx: ToolContext,
task_name: str,
) -> float:
"""
Upload test file to sandbox and execute pytest.
Runs in thread pool (via run_in_executor) to avoid blocking the event loop
with synchronous ToolContext calls.
Args:
test_file_path: Local path to test_final_state.py
ctx: ToolContext scoped to the agent's sandbox
task_name: For logging
Returns:
1.0 if tests pass, 0.0 otherwise
"""
try:
# Upload test file to sandbox
test_content = test_file_path.read_text()
ctx.write_file("/workspace/test_final_state.py", test_content)
logger.debug(f"Task {task_name}: uploaded test file to /workspace/test_final_state.py")
# Run pytest in the sandbox
result = ctx.terminal(
"cd /workspace && python -m pytest -q test_final_state.py",
timeout=self.config.test_timeout_s,
)
exit_code = result.get("exit_code", -1)
output = result.get("output", "")
if exit_code == 0:
logger.debug(f"Task {task_name}: tests passed")
return 1.0
else:
# Log failure output (last 500 chars for debugging)
output_preview = output[-500:] if output else "(no output)"
logger.info(
f"Task {task_name}: tests failed (exit_code={exit_code})\n{output_preview}"
)
return 0.0
except Exception as e:
logger.error(f"Task {task_name}: error running tests: {e}")
return 0.0
async def evaluate(self):
"""
Periodic evaluation on holdout eval set.
Runs the agent on num_eval_tasks from the held-out eval set
(never seen during training). Returns metrics for wandb logging.
"""
if self._eval_dataset is None:
logger.warning("Cannot evaluate: eval dataset not loaded")
return {}
if len(self._eval_dataset) == 0:
logger.warning("Eval dataset is empty")
return {}
# Use min of num_eval_tasks and actual eval set size
num_tasks = min(self.config.num_eval_tasks, len(self._eval_dataset))
logger.info(f"Starting evaluation on {num_tasks} held-out tasks...")
eval_metrics = {
"rewards": [],
"passes": [],
"turns": [],
"natural_finishes": [],
}
# Sample from eval set (holdout)
import random
eval_indices = random.sample(range(len(self._eval_dataset)), num_tasks)
for idx in eval_indices:
task = self._eval_dataset[idx]
# Build item using same logic as get_next_item
task_dir = task.get("extra_info", {}).get("task_dir")
if not task_dir:
task_dir = task.get("reward_spec", {}).get("ground_truth")
if not task_dir:
continue
task_dir_path = Path(task_dir)
if self.config.tasks_base_dir and not task_dir_path.exists():
original_path = Path(task_dir)
task_name = original_path.name
task_dir_path = Path(os.path.expanduser(self.config.tasks_base_dir)) / task_name
if not task_dir_path.exists():
continue
# Find test file
final_test = task_dir_path / "tests" / "test_final_state.py"
if not final_test.exists():
final_test = task_dir_path / "test_final_state.py"
if not final_test.exists():
continue
# Parse Docker image
container_def = task_dir_path / "environment" / "container.def"
if not container_def.exists():
container_def = task_dir_path / "container.def"
docker_image = self._parse_docker_image_from_def(container_def)
# Load description
description = task.get("description", "")
instruction_md = task_dir_path / "instruction.md"
if not description and instruction_md.exists():
try:
description = instruction_md.read_text().strip()
except Exception:
pass
item = {
"description": description,
"final_test": str(final_test),
"docker_image": docker_image,
}
# Run agent on this task
try:
import uuid
task_id = str(uuid.uuid4())
# Register task environment
from model_tools import register_task_env_overrides
register_task_env_overrides(task_id, {"modal_image": docker_image})
# Build messages
messages = [
{"role": "system", "content": self.config.system_prompt},
{"role": "user", "content": description or "Complete the task."},
]
# Get tools
from model_tools import get_tool_definitions
tools = get_tool_definitions(self.config.enabled_toolsets)
valid_names = {t["function"]["name"] for t in tools}
# Run agent
from environments.agent_loop import HermesAgentLoop
agent = HermesAgentLoop(
server=self.server,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=self.config.agent_temperature,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
# Compute reward
from environments.tool_context import ToolContext
ctx = ToolContext(task_id)
try:
reward = await self.compute_reward(item, result, ctx)
except Exception as e:
logger.warning(f"Eval reward computation failed: {e}")
reward = 0.0
finally:
ctx.cleanup()
# Track metrics
eval_metrics["rewards"].append(reward)
eval_metrics["passes"].append(1.0 if reward > 0.5 else 0.0)
eval_metrics["turns"].append(result.turns_used)
eval_metrics["natural_finishes"].append(1.0 if result.finished_naturally else 0.0)
except Exception as e:
logger.error(f"Eval task failed: {e}")
continue
finally:
# Cleanup
from model_tools import clear_task_env_overrides, cleanup_vm
clear_task_env_overrides(task_id)
cleanup_vm(task_id)
# Aggregate metrics
if not eval_metrics["rewards"]:
logger.warning("No eval tasks completed successfully")
return {}
aggregated = {
"eval/pass_rate": sum(eval_metrics["passes"]) / len(eval_metrics["passes"]),
"eval/avg_reward": sum(eval_metrics["rewards"]) / len(eval_metrics["rewards"]),
"eval/avg_turns": sum(eval_metrics["turns"]) / len(eval_metrics["turns"]),
"eval/natural_finish_rate": sum(eval_metrics["natural_finishes"]) / len(eval_metrics["natural_finishes"]),
"eval/num_tasks": len(eval_metrics["rewards"]),
}
logger.info(f"Evaluation complete: pass_rate={aggregated['eval/pass_rate']:.2%}, avg_turns={aggregated['eval/avg_turns']:.1f}")
return aggregated
async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
"""Log Endless Terminals specific metrics to wandb."""
if wandb_metrics is None:
wandb_metrics = {}
# Aggregate metrics from buffer
if self._metrics_buffer:
# Test pass rate
test_passes = [m["test_passed"] for m in self._metrics_buffer]
wandb_metrics["endless_terminals/test_pass_rate"] = sum(test_passes) / len(test_passes)
wandb_metrics["endless_terminals/num_tests_passed"] = sum(test_passes)
wandb_metrics["endless_terminals/num_tests_total"] = len(test_passes)
# Turns used statistics
turns = [m["turns_used"] for m in self._metrics_buffer]
wandb_metrics["endless_terminals/avg_turns_used"] = sum(turns) / len(turns)
wandb_metrics["endless_terminals/max_turns_used"] = max(turns)
wandb_metrics["endless_terminals/min_turns_used"] = min(turns)
# Natural finish rate (did model stop on its own vs hitting max turns)
natural_finishes = [1.0 if m["finished_naturally"] else 0.0 for m in self._metrics_buffer]
wandb_metrics["endless_terminals/natural_finish_rate"] = sum(natural_finishes) / len(natural_finishes)
# Tool error statistics
total_tool_errors = sum(m["num_tool_errors"] for m in self._metrics_buffer)
wandb_metrics["endless_terminals/total_tool_errors"] = total_tool_errors
wandb_metrics["endless_terminals/avg_tool_errors_per_task"] = total_tool_errors / len(self._metrics_buffer)
# Docker image distribution (count unique images used)
docker_images = [m["docker_image"] for m in self._metrics_buffer]
unique_images = set(docker_images)
wandb_metrics["endless_terminals/num_unique_docker_images"] = len(unique_images)
# Log most common errors if any
all_errors = []
for m in self._metrics_buffer:
if "tool_errors" in m:
all_errors.extend(m["tool_errors"])
if all_errors:
# Count error types
error_tools = {}
for err in all_errors:
tool = err["tool"]
error_tools[tool] = error_tools.get(tool, 0) + 1
# Log top 3 error-prone tools
for i, (tool, count) in enumerate(sorted(error_tools.items(), key=lambda x: x[1], reverse=True)[:3]):
wandb_metrics[f"endless_terminals/errors_by_tool/{tool}"] = count
# Clear buffer after logging
self._metrics_buffer = []
await super().wandb_log(wandb_metrics)
if __name__ == "__main__":
EndlessTerminalsEnv.cli()
+672
View File
@@ -0,0 +1,672 @@
"""
HermesAgentBaseEnv -- Abstract Base Environment for Hermes-Agent + Atropos
Provides the Atropos integration plumbing that all hermes-agent environments share:
- Two-mode operation (OpenAI server for Phase 1, VLLM ManagedServer for Phase 2)
- Per-group toolset/distribution resolution
- Agent loop orchestration via HermesAgentLoop
- ToolContext creation for reward functions
- ScoredDataGroup construction from ManagedServer state
Subclasses only need to implement:
setup() -- Load dataset, initialize state
get_next_item() -- Return the next item from the dataset
format_prompt() -- Convert a dataset item into the user message
compute_reward() -- Score the rollout (has full ToolContext access)
evaluate() -- Periodic evaluation
"""
import asyncio
import json
import logging
import os
import sys
import uuid
from abc import abstractmethod
from pathlib import Path
from typing import Any, Dict, List, Optional, Set, Tuple, Union
# Ensure the hermes-agent repo root is on sys.path so that imports like
# `from model_tools import ...` and `from environments.X import ...` work
# regardless of where the script is invoked from.
_repo_root = Path(__file__).resolve().parent.parent
if str(_repo_root) not in sys.path:
sys.path.insert(0, str(_repo_root))
from dotenv import load_dotenv
from pydantic import Field
# Load API keys from hermes-agent/.env so all environments can access them
_env_path = _repo_root / ".env"
if _env_path.exists():
load_dotenv(dotenv_path=_env_path)
# Apply monkey patches for async-safe tool operation inside Atropos's event loop.
# This patches SwerexModalEnvironment to use a background thread instead of
# asyncio.run(), which would deadlock inside Atropos. Safe for normal CLI too.
from environments.patches import apply_patches
apply_patches()
from atroposlib.envs.base import (
BaseEnv,
BaseEnvConfig,
ScoredDataGroup,
ScoredDataItem,
)
from atroposlib.envs.server_handling.server_manager import (
APIServerConfig,
ServerBaseline,
ServerManager,
)
from atroposlib.type_definitions import Item
from environments.agent_loop import AgentResult, HermesAgentLoop
from environments.tool_context import ToolContext
# Import hermes-agent toolset infrastructure
from model_tools import get_tool_definitions
from toolset_distributions import sample_toolsets_from_distribution
logger = logging.getLogger(__name__)
class HermesAgentEnvConfig(BaseEnvConfig):
"""
Configuration for hermes-agent Atropos environments.
Extends BaseEnvConfig with agent-specific settings for toolsets,
terminal backend, dataset loading, and tool call parsing.
"""
# --- Toolset configuration ---
# Mutually exclusive: use either enabled_toolsets OR distribution
enabled_toolsets: Optional[List[str]] = Field(
default=None,
description="Explicit list of hermes toolsets to enable (e.g., ['terminal', 'file', 'web']). "
"If None and distribution is also None, all available toolsets are enabled.",
)
disabled_toolsets: Optional[List[str]] = Field(
default=None,
description="Toolsets to disable. Applied as a filter on top of enabled_toolsets or distribution.",
)
distribution: Optional[str] = Field(
default=None,
description="Name of a toolset distribution from toolset_distributions.py "
"(e.g., 'development', 'terminal_tasks'). Sampled once per group. "
"Mutually exclusive with enabled_toolsets.",
)
# --- Agent loop configuration ---
max_agent_turns: int = Field(
default=30,
description="Maximum number of LLM calls (tool-calling iterations) per rollout.",
)
system_prompt: Optional[str] = Field(
default=None,
description="System prompt for the agent. Tools are handled via the tools= parameter, "
"not embedded in the prompt text.",
)
agent_temperature: float = Field(
default=1.0,
description="Sampling temperature for agent generation during rollouts.",
)
# --- Terminal backend ---
terminal_backend: str = Field(
default="local",
description="Terminal backend: 'local', 'docker', 'modal', 'ssh', 'singularity'. "
"Modal recommended for production RL (cloud isolation per rollout).",
)
terminal_timeout: int = Field(
default=120,
description="Per-command timeout in seconds for terminal tool calls. "
"Commands exceeding this are killed. Increase for tasks with long-running "
"commands (compilation, pip install, etc.).",
)
terminal_lifetime: int = Field(
default=3600,
description="Sandbox inactivity lifetime in seconds. The cleanup thread kills "
"sandboxes that have been idle longer than this. Must be longer than "
"the longest gap between tool calls (e.g., waiting for LLM response).",
)
# --- Dataset ---
dataset_name: Optional[str] = Field(
default=None,
description="HuggingFace dataset name. Optional if tasks are defined inline.",
)
dataset_split: str = Field(
default="train",
description="Dataset split to use.",
)
prompt_field: str = Field(
default="prompt",
description="Which field in the dataset contains the prompt.",
)
# --- Thread pool ---
tool_pool_size: int = Field(
default=128,
description="Thread pool size for tool execution. Each concurrent task needs a "
"thread for tool calls. Must be large enough for parallel evaluation. "
"Too small = thread pool starvation.",
)
# --- Phase 2: Tool call parsing ---
tool_call_parser: str = Field(
default="hermes",
description="Tool call parser name for Phase 2 (VLLM server type). "
"Ignored in Phase 1 (OpenAI server type where VLLM parses natively). "
"Options: hermes, mistral, llama3_json, qwen, deepseek_v3, etc.",
)
# --- Provider-specific parameters ---
# Passed as extra_body to the OpenAI client's chat.completions.create() call.
# Useful for OpenRouter provider preferences, transforms, route settings, etc.
# Example YAML:
# extra_body:
# provider:
# ignore: ["DeepInfra", "Fireworks"]
# order: ["Together"]
# transforms: ["middle-out"]
extra_body: Optional[Dict[str, Any]] = Field(
default=None,
description="Extra body parameters passed to the OpenAI client's "
"chat.completions.create(). Used for OpenRouter provider preferences, "
"transforms, and other provider-specific settings.",
)
class HermesAgentBaseEnv(BaseEnv):
"""
Abstract base environment for hermes-agent Atropos integration.
Handles two modes of operation:
- Phase 1 (OpenAI server type): Uses server.chat_completion() directly.
The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing
and reasoning extraction natively. DummyManagedServer provides placeholder
tokens. Good for SFT data gen, verifier testing, evaluation.
- Phase 2 (VLLM server type): Uses ManagedServer for exact token IDs + logprobs
via /generate. Client-side tool call parser reconstructs structured tool_calls
from raw output. Full RL training capability.
Subclasses must implement:
setup() -- Load dataset, initialize state
get_next_item() -- Return the next item to roll out
format_prompt() -- Convert a dataset item into the user message string
compute_reward() -- Score the rollout using ToolContext
evaluate() -- Periodic evaluation
"""
name: Optional[str] = "hermes-agent"
env_config_cls = HermesAgentEnvConfig
def __init__(
self,
config: HermesAgentEnvConfig,
server_configs: Union[ServerBaseline, List[APIServerConfig]],
slurm=False,
testing=False,
):
super().__init__(config, server_configs, slurm, testing)
# Set terminal environment variables so hermes tools pick them up.
# These can all be overridden per-environment via config fields instead
# of requiring users to set shell env vars.
if config.terminal_backend:
os.environ["TERMINAL_ENV"] = config.terminal_backend
os.environ["TERMINAL_TIMEOUT"] = str(config.terminal_timeout)
os.environ["TERMINAL_LIFETIME_SECONDS"] = str(config.terminal_lifetime)
print(
f"🖥️ Terminal: backend={config.terminal_backend}, "
f"timeout={config.terminal_timeout}s, lifetime={config.terminal_lifetime}s"
)
# Resize the agent loop's thread pool for tool execution.
# This must be large enough for the number of concurrent tasks
# (e.g., 89 parallel TB2 eval tasks each need a thread for tool calls).
from environments.agent_loop import resize_tool_pool
resize_tool_pool(config.tool_pool_size)
# Current group's resolved tools (set in collect_trajectories)
self._current_group_tools: Optional[Tuple[List[Dict], Set[str]]] = None
# Tool error tracking for wandb logging
self._tool_error_buffer: List[Dict[str, Any]] = []
# =========================================================================
# Toolset resolution (per-group)
# =========================================================================
def _resolve_tools_for_group(self) -> Tuple[List[Dict[str, Any]], Set[str]]:
"""
Resolve toolsets for a group. Called once in collect_trajectories(),
then shared by all collect_trajectory() calls in the group.
If distribution is set, samples probabilistically.
If enabled_toolsets is set, uses that explicit list.
disabled_toolsets is applied as a filter on top.
Returns:
(tool_schemas, valid_tool_names) tuple
"""
config = self.config
if config.distribution:
group_toolsets = sample_toolsets_from_distribution(config.distribution)
logger.info("Sampled toolsets from '%s': %s", config.distribution, group_toolsets)
else:
group_toolsets = config.enabled_toolsets # None means "all available"
if group_toolsets is None:
logger.warning(
"enabled_toolsets is None -- loading ALL tools including messaging. "
"Set explicit enabled_toolsets for RL training."
)
tools = get_tool_definitions(
enabled_toolsets=group_toolsets,
disabled_toolsets=config.disabled_toolsets,
quiet_mode=True,
)
valid_names = {t["function"]["name"] for t in tools} if tools else set()
logger.info("Resolved %d tools for group: %s", len(valid_names), sorted(valid_names))
return tools, valid_names
# =========================================================================
# Server mode detection
# =========================================================================
def _use_managed_server(self) -> bool:
"""
Determine if we should use ManagedServer (Phase 2) or direct server (Phase 1).
Phase 2 (ManagedServer) is used when the server type is 'vllm' or 'sglang',
which go through the /generate endpoint for exact token tracking.
Phase 1 (direct server) is used for 'openai' server type, which uses
/v1/chat/completions with native tool call parsing.
"""
if not self.server.servers:
return False
server = self.server.servers[0]
# If the server is an OpenAI server (not VLLM/SGLang), use direct mode
from atroposlib.envs.server_handling.openai_server import OpenAIServer
return not isinstance(server, OpenAIServer)
# =========================================================================
# Core Atropos integration
# =========================================================================
async def collect_trajectories(
self, item: Item
) -> Tuple[
Union[Optional[ScoredDataGroup], List[Optional[ScoredDataGroup]]],
List[Item],
]:
"""
Override collect_trajectories to resolve toolsets once per group,
then delegate to the standard group-level collection.
The default BaseEnv.collect_trajectories() calls collect_trajectory()
group_size times in parallel. We resolve tools once here and store
them for all those calls to use.
"""
# Resolve toolsets for this group (shared by all rollouts in the group)
self._current_group_tools = self._resolve_tools_for_group()
# Delegate to the default implementation which calls collect_trajectory()
# group_size times via asyncio.gather
return await super().collect_trajectories(item)
# =========================================================================
# Wandb rollout display -- format trajectories nicely
# =========================================================================
@staticmethod
def _format_trajectory_for_display(messages: List[Dict[str, Any]]) -> str:
"""
Format a conversation's messages into a readable trajectory string
for wandb rollout tables. Shows tool calls, tool results, and reasoning
in a structured way instead of raw token decoding.
"""
parts = []
for msg in messages:
role = msg.get("role", "unknown")
content = msg.get("content", "")
if role == "system":
parts.append(f"[SYSTEM]\n{content}")
elif role == "user":
parts.append(f"[USER]\n{content}")
elif role == "assistant":
# Show reasoning if present
reasoning = msg.get("reasoning_content", "")
if reasoning:
# Truncate long reasoning for display
if len(reasoning) > 300:
reasoning = reasoning[:300] + "..."
parts.append(f"[ASSISTANT thinking]\n{reasoning}")
# Show content
if content:
parts.append(f"[ASSISTANT]\n{content}")
# Show tool calls
tool_calls = msg.get("tool_calls", [])
for tc in tool_calls:
func = tc.get("function", {})
name = func.get("name", "?")
args = func.get("arguments", "{}")
# Truncate long arguments for display
if len(args) > 200:
args = args[:200] + "..."
parts.append(f"[TOOL CALL] {name}({args})")
elif role == "tool":
tool_id = msg.get("tool_call_id", "")
result = content
# Truncate long tool results for display
if len(result) > 500:
result = result[:500] + "..."
parts.append(f"[TOOL RESULT] {result}")
return "\n\n".join(parts)
async def add_rollouts_for_wandb(
self,
scored_data,
item=None,
):
"""
Override to show formatted trajectories with tool calls visible,
instead of raw token decoding which loses all structure.
"""
num_keep = self.config.num_rollouts_per_group_for_logging
if num_keep == -1:
num_keep = self.config.group_size
group = []
for i in range(min(num_keep, len(scored_data.get("scores", [])))):
score = scored_data["scores"][i]
# Use messages if available for rich display
messages = None
if scored_data.get("messages") and i < len(scored_data["messages"]):
messages = scored_data["messages"][i]
if messages:
text = self._format_trajectory_for_display(messages)
elif scored_data.get("tokens") and i < len(scored_data["tokens"]):
text = self.tokenizer.decode(scored_data["tokens"][i])
else:
text = "(no data)"
group.append((text, score))
self.rollouts_for_wandb.append(group)
if len(self.rollouts_for_wandb) > self.config.num_rollouts_to_keep:
self.rollouts_for_wandb.pop(0)
async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
"""Log base metrics including tool errors to wandb."""
if wandb_metrics is None:
wandb_metrics = {}
# Log tool error stats
if self._tool_error_buffer:
wandb_metrics["train/tool_errors_count"] = len(self._tool_error_buffer)
# Log error details as a summary string (tables can crash wandb on tmp cleanup)
error_summaries = []
for err in self._tool_error_buffer:
error_summaries.append(
f"[turn {err['turn']}] {err['tool']}({err['args'][:80]}) -> {err['error'][:150]}"
)
wandb_metrics["train/tool_error_details"] = "\n".join(error_summaries)
# Also print to stdout for immediate visibility
for summary in error_summaries:
print(f" Tool Error: {summary}")
self._tool_error_buffer = []
else:
wandb_metrics["train/tool_errors_count"] = 0
await super().wandb_log(wandb_metrics)
async def collect_trajectory(
self, item: Item
) -> Tuple[Optional[Union[ScoredDataItem, Any]], List[Item]]:
"""
Run a single rollout: agent loop + reward computation.
This is called group_size times in parallel by collect_trajectories().
Each call gets its own task_id for terminal/browser session isolation.
"""
task_id = str(uuid.uuid4())
# Get group-level tools (resolved once in collect_trajectories)
if self._current_group_tools is None:
# Fallback: resolve per-trajectory if called outside collect_trajectories
tools, valid_names = self._resolve_tools_for_group()
else:
tools, valid_names = self._current_group_tools
# Build initial messages
messages: List[Dict[str, Any]] = []
if self.config.system_prompt:
messages.append({"role": "system", "content": self.config.system_prompt})
messages.append({"role": "user", "content": self.format_prompt(item)})
# Run the agent loop
result: AgentResult
if self._use_managed_server():
# Phase 2: ManagedServer with parser -- exact tokens + logprobs
# Load the tool call parser from registry based on config
from environments.tool_call_parsers import get_parser
try:
tc_parser = get_parser(self.config.tool_call_parser)
except KeyError:
logger.warning(
"Tool call parser '%s' not found, falling back to 'hermes'",
self.config.tool_call_parser,
)
tc_parser = get_parser("hermes")
try:
async with self.server.managed_server(
tokenizer=self.tokenizer,
tool_call_parser=tc_parser,
) as managed:
agent = HermesAgentLoop(
server=managed,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=self.config.agent_temperature,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
except NotImplementedError:
# DummyManagedServer not allowed -- fall back to Phase 1
logger.warning(
"ManagedServer not available (OpenAI server?). "
"Falling back to direct server mode."
)
agent = HermesAgentLoop(
server=self.server,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=self.config.agent_temperature,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
else:
# Phase 1: OpenAI server -- native tool_calls, placeholder tokens
agent = HermesAgentLoop(
server=self.server,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=self.config.agent_temperature,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
# Skip reward computation if the agent loop produced no meaningful work
# (e.g., API call failed on turn 1). No point spinning up a Modal sandbox
# just to verify files that were never created.
only_system_and_user = all(
msg.get("role") in ("system", "user") for msg in result.messages
)
if result.turns_used == 0 or only_system_and_user:
logger.warning(
"Agent loop produced no output (turns=%d, msgs=%d). Skipping reward.",
result.turns_used, len(result.messages),
)
reward = 0.0
else:
# Compute reward using ToolContext (gives verifier full tool access)
ctx = ToolContext(task_id)
try:
reward = await self.compute_reward(item, result, ctx)
except Exception as e:
logger.error("compute_reward failed: %s", e)
reward = 0.0
finally:
ctx.cleanup()
# Track tool errors for wandb logging
if result.tool_errors:
for err in result.tool_errors:
self._tool_error_buffer.append({
"turn": err.turn,
"tool": err.tool_name,
"args": err.arguments[:150],
"error": err.error[:300],
"result": err.tool_result[:300],
})
# Build ScoredDataItem from ManagedServer state
# Phase 2: real tokens/masks/logprobs from SequenceNodes
# Phase 1: placeholder tokens (still need a valid ScoredDataItem for the pipeline)
nodes = (result.managed_state or {}).get("nodes", [])
if nodes:
# Phase 2 (or DummyManagedServer): use actual node data
node = nodes[-1] # Final sequence node = full trajectory
scored_item: Dict[str, Any] = {
"tokens": node.tokens,
"masks": node.masked_tokens,
"scores": reward,
}
# Include logprobs if available (Phase 2)
if hasattr(node, "logprobs") and node.logprobs:
scored_item["advantages"] = None # Computed by trainer
scored_item["ref_logprobs"] = None
else:
# Phase 1 with no managed state: create placeholder tokens
# so the data pipeline doesn't break. These are NOT suitable
# for training but allow process mode (SFT data gen) to work.
# Tokenize the full conversation to get approximate tokens.
full_text = "\n".join(
msg.get("content", "") for msg in result.messages if msg.get("content")
)
if self.tokenizer:
tokens = self.tokenizer.encode(full_text, add_special_tokens=True)
else:
tokens = list(range(min(len(full_text) // 4, 128)))
scored_item = {
"tokens": tokens,
"masks": [-100] + tokens[1:], # Mask first token as prompt
"scores": reward,
}
# Always include messages for wandb rollout display and data logging
scored_item["messages"] = result.messages
return scored_item, []
# =========================================================================
# Abstract methods -- subclasses must implement
# =========================================================================
@abstractmethod
async def setup(self):
"""
Load dataset, initialize state.
Called once when the environment starts. Typical implementation:
self.dataset = load_dataset(self.config.dataset_name, split=self.config.dataset_split)
self.iter = 0
"""
raise NotImplementedError
@abstractmethod
async def get_next_item(self) -> Item:
"""
Return the next item from the dataset for rollout.
Called by the base env's main loop to get items for workers.
Should cycle through the dataset.
"""
raise NotImplementedError
@abstractmethod
def format_prompt(self, item: Item) -> str:
"""
Convert a dataset item into the user message for the agent.
Args:
item: Dataset item (dict, tuple, etc.)
Returns:
The prompt string to send to the agent
"""
raise NotImplementedError
@abstractmethod
async def compute_reward(
self, item: Item, result: AgentResult, ctx: ToolContext
) -> float:
"""
Score the rollout. Has full access to:
- item: the original dataset item (ground truth, test commands, etc.)
- result: AgentResult with full messages, turn count, reasoning, etc.
- ctx: ToolContext -- call ANY hermes-agent tool (terminal, file, web,
browser, vision...) scoped to this rollout's sandbox. Nothing
is off-limits.
Args:
item: The dataset item that was rolled out
result: The agent's rollout result
ctx: ToolContext with full tool access for verification
Returns:
Reward float (typically 0.0 to 1.0, but any float is valid)
"""
raise NotImplementedError
@abstractmethod
async def evaluate(self, *args, **kwargs):
"""
Periodic evaluation. Called every steps_per_eval steps.
Typical implementation runs the agent on a held-out eval set
and logs metrics via wandb/evaluate_log.
"""
raise NotImplementedError
+34
View File
@@ -0,0 +1,34 @@
# SWE Environment -- Default Configuration
#
# SWE-bench style tasks with Modal sandboxes for cloud isolation.
# Uses terminal + file + web toolsets.
#
# Usage:
# python environments/hermes_swe_env/hermes_swe_env.py serve \
# --config environments/hermes_swe_env/default.yaml
env:
enabled_toolsets: ["terminal", "file", "web"]
max_agent_turns: 30
max_token_length: 4096
group_size: 4
terminal_backend: "modal"
tool_call_parser: "hermes"
tokenizer_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
dataset_name: "bigcode/humanevalpack"
dataset_split: "test"
prompt_field: "prompt"
steps_per_eval: 50
total_steps: 500
use_wandb: true
wandb_name: "hermes-swe"
system_prompt: >
You are a skilled software engineer. You have access to a terminal,
file tools, and web search. Use these tools to complete the coding task.
Write clean, working code and verify it runs correctly before finishing.
openai:
base_url: "http://localhost:8000/v1"
model_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
server_type: "openai"
api_key: ""
@@ -0,0 +1,229 @@
"""
HermesSweEnv -- SWE-Bench Style Environment with Modal Sandboxes
A concrete environment for software engineering tasks where the model writes code
and the reward function runs tests to verify correctness. Uses Modal terminal
backend for cloud-isolated sandboxes per rollout.
The reward function uses ToolContext.terminal() to run test commands in the same
Modal sandbox the model used during its agentic loop. All filesystem state from
the model's tool calls is preserved for verification.
Usage:
# Phase 1: OpenAI server type
vllm serve YourModel --tool-parser hermes
run-api
python environments/hermes_swe_env.py serve \\
--openai.base_url http://localhost:8000/v1 \\
--openai.model_name YourModel \\
--openai.server_type openai \\
--env.dataset_name bigcode/humanevalpack \\
--env.terminal_backend modal
# Phase 2: VLLM server type (full RL training)
python environments/hermes_swe_env.py serve \\
--openai.base_url http://localhost:8000/v1 \\
--openai.model_name YourModel \\
--openai.server_type vllm \\
--env.tool_call_parser hermes \\
--env.terminal_backend modal
"""
import logging
import sys
import time
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple, Union
# Ensure repo root is on sys.path for imports
_repo_root = Path(__file__).resolve().parent.parent.parent
if str(_repo_root) not in sys.path:
sys.path.insert(0, str(_repo_root))
from datasets import load_dataset
from atroposlib.envs.base import ScoredDataGroup
from atroposlib.envs.server_handling.server_manager import APIServerConfig
from atroposlib.type_definitions import Item
from environments.agent_loop import AgentResult
from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
from environments.tool_context import ToolContext
logger = logging.getLogger(__name__)
class HermesSweEnvConfig(HermesAgentEnvConfig):
"""Config with defaults for SWE-bench style tasks."""
pass # Inherits all fields, overrides defaults in config_init
class HermesSweEnv(HermesAgentBaseEnv):
"""
SWE-bench style environment using Modal terminal backend.
The model gets a coding task, uses terminal + file + web tools to solve it,
and the reward function runs tests in the same Modal sandbox to verify.
Subclass this for specific SWE datasets (HumanEval, SWE-bench, etc.)
and customize format_prompt() and compute_reward() as needed.
"""
name = "hermes-swe"
env_config_cls = HermesSweEnvConfig
@classmethod
def config_init(cls) -> Tuple[HermesSweEnvConfig, List[APIServerConfig]]:
"""
Default configuration for the SWE environment.
Uses Modal terminal backend for cloud isolation and terminal + file + web toolsets.
"""
env_config = HermesSweEnvConfig(
# Toolsets: terminal for running code, file for reading/writing, web for docs
enabled_toolsets=["terminal", "file", "web"],
disabled_toolsets=None,
distribution=None,
# Agent settings -- SWE tasks need more turns
max_agent_turns=30,
max_token_length=4096,
agent_temperature=1.0,
system_prompt=(
"You are a skilled software engineer. You have access to a terminal, "
"file tools, and web search. Use these tools to complete the coding task. "
"Write clean, working code and verify it runs correctly before finishing."
),
# Modal backend for cloud-isolated sandboxes
terminal_backend="modal",
# Dataset -- override via CLI for your specific SWE dataset
dataset_name="bigcode/humanevalpack",
dataset_split="test",
prompt_field="prompt",
# Atropos settings
group_size=4,
tokenizer_name="NousResearch/DeepHermes-3-Llama-3-3B-Preview",
tool_call_parser="hermes",
steps_per_eval=50,
total_steps=500,
use_wandb=True,
wandb_name="hermes-swe",
)
server_configs = [
APIServerConfig(
base_url="http://localhost:8000/v1",
model_name="NousResearch/DeepHermes-3-Llama-3-3B-Preview",
server_type="openai", # Phase 1; switch to "vllm" for Phase 2
api_key="",
)
]
return env_config, server_configs
async def setup(self):
"""Load the SWE dataset."""
if self.config.dataset_name:
self.dataset = load_dataset(
self.config.dataset_name, split=self.config.dataset_split
)
else:
# Placeholder if no dataset specified
self.dataset = []
self.iter = 0
self.reward_buffer: List[float] = []
async def get_next_item(self) -> Dict[str, Any]:
"""Cycle through the SWE dataset."""
if not self.dataset:
raise ValueError("No dataset loaded. Set dataset_name in config.")
item = self.dataset[self.iter % len(self.dataset)]
self.iter += 1
return item
def format_prompt(self, item: Dict[str, Any]) -> str:
"""
Format the SWE task prompt.
Override this in subclasses for different dataset formats.
Default assumes the dataset has a 'prompt' field and optionally a 'test' field.
"""
prompt = item.get(self.config.prompt_field, "")
# If the dataset has test information, include it in the prompt
test_info = item.get("test", item.get("test_code", item.get("tests", "")))
if test_info:
prompt += f"\n\nTests to pass:\n{test_info}"
return prompt
async def compute_reward(
self, item: Dict[str, Any], result: AgentResult, ctx: ToolContext
) -> float:
"""
Score by running tests in the model's Modal sandbox.
Default implementation:
- If the dataset item has a 'test' or 'test_code' field, run it
- Check exit code: 0 = pass, non-zero = fail
- Partial credit for file creation
Override this in subclasses for more sophisticated reward logic.
"""
# Find the test command from the dataset item
test_code = item.get("test", item.get("test_code", item.get("tests", "")))
if test_code:
# Run the test in the model's sandbox
test_result = ctx.terminal(
f'cd /workspace && python3 -c "{test_code}"', timeout=60
)
if test_result["exit_code"] == 0:
self.reward_buffer.append(1.0)
return 1.0
# Partial credit: check if the model created any Python files
file_check = ctx.terminal("find /workspace -name '*.py' -newer /tmp/.start_marker 2>/dev/null | head -5")
if file_check["exit_code"] == 0 and file_check.get("output", "").strip():
self.reward_buffer.append(0.1)
return 0.1
self.reward_buffer.append(0.0)
return 0.0
async def evaluate(self, *args, **kwargs):
"""
Run evaluation on a held-out set.
Override for dataset-specific evaluation logic.
"""
start_time = time.time()
end_time = time.time()
eval_metrics = {"eval/placeholder": 0.0}
await self.evaluate_log(
metrics=eval_metrics,
start_time=start_time,
end_time=end_time,
)
async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
"""Log SWE-specific metrics."""
if wandb_metrics is None:
wandb_metrics = {}
if self.reward_buffer:
wandb_metrics["train/avg_reward"] = sum(self.reward_buffer) / len(
self.reward_buffer
)
wandb_metrics["train/pass_rate"] = sum(
1 for r in self.reward_buffer if r == 1.0
) / len(self.reward_buffer)
self.reward_buffer = []
await super().wandb_log(wandb_metrics)
if __name__ == "__main__":
HermesSweEnv.cli()
+188
View File
@@ -0,0 +1,188 @@
"""
Monkey patches for making hermes-agent tools work inside async frameworks (Atropos).
Problem:
Some tools use asyncio.run() internally (e.g., mini-swe-agent's Modal backend,
web_extract). This crashes when called from inside Atropos's event loop because
asyncio.run() can't be nested.
Solution:
Replace the problematic methods with versions that use a dedicated background
thread with its own event loop. The calling code sees the same sync interface --
call a function, get a result -- but internally the async work happens on a
separate thread that doesn't conflict with Atropos's loop.
These patches are safe for normal CLI use too: when there's no running event
loop, the behavior is identical (the background thread approach works regardless).
What gets patched:
- SwerexModalEnvironment.__init__ -- creates Modal deployment on a background thread
- SwerexModalEnvironment.execute -- runs commands on the same background thread
- SwerexModalEnvironment.stop -- stops deployment on the background thread
Usage:
Call apply_patches() once at import time (done automatically by hermes_base_env.py).
This is idempotent -- calling it multiple times is safe.
"""
import asyncio
import logging
import threading
from typing import Any
logger = logging.getLogger(__name__)
_patches_applied = False
class _AsyncWorker:
"""
A dedicated background thread with its own event loop.
Allows sync code to submit async coroutines and block for results,
even when called from inside another running event loop. Used to
bridge sync tool interfaces with async backends (Modal, SWE-ReX).
"""
def __init__(self):
self._loop: asyncio.AbstractEventLoop = None
self._thread: threading.Thread = None
self._started = threading.Event()
def start(self):
"""Start the background event loop thread."""
self._thread = threading.Thread(target=self._run_loop, daemon=True)
self._thread.start()
self._started.wait(timeout=30)
def _run_loop(self):
"""Background thread entry point -- runs the event loop forever."""
self._loop = asyncio.new_event_loop()
asyncio.set_event_loop(self._loop)
self._started.set()
self._loop.run_forever()
def run_coroutine(self, coro, timeout=600):
"""
Submit a coroutine to the background loop and block until it completes.
Safe to call from any thread, including threads that already have
a running event loop.
"""
if self._loop is None or self._loop.is_closed():
raise RuntimeError("AsyncWorker loop is not running")
future = asyncio.run_coroutine_threadsafe(coro, self._loop)
return future.result(timeout=timeout)
def stop(self):
"""Stop the background event loop and join the thread."""
if self._loop and self._loop.is_running():
self._loop.call_soon_threadsafe(self._loop.stop)
if self._thread:
self._thread.join(timeout=10)
def _patch_swerex_modal():
"""
Monkey patch SwerexModalEnvironment to use a background thread event loop
instead of asyncio.run(). This makes it safe to call from inside Atropos's
async event loop.
The patched methods have the exact same interface and behavior -- the only
difference is HOW the async work is executed internally.
"""
try:
from minisweagent.environments.extra.swerex_modal import (
SwerexModalEnvironment,
SwerexModalEnvironmentConfig,
)
from swerex.deployment.modal import ModalDeployment
from swerex.runtime.abstract import Command as RexCommand
except ImportError:
# mini-swe-agent or swe-rex not installed -- nothing to patch
logger.debug("mini-swe-agent Modal backend not available, skipping patch")
return
# Save original methods so we can refer to config handling
_original_init = SwerexModalEnvironment.__init__
def _patched_init(self, **kwargs):
"""Patched __init__: creates Modal deployment on a background thread."""
self.config = SwerexModalEnvironmentConfig(**kwargs)
# Start a dedicated event loop thread for all Modal async operations
self._worker = _AsyncWorker()
self._worker.start()
# Create AND start the deployment entirely on the worker's loop/thread
# so all gRPC channels and async state are bound to that loop
async def _create_and_start():
deployment = ModalDeployment(
image=self.config.image,
startup_timeout=self.config.startup_timeout,
runtime_timeout=self.config.runtime_timeout,
deployment_timeout=self.config.deployment_timeout,
install_pipx=self.config.install_pipx,
modal_sandbox_kwargs=self.config.modal_sandbox_kwargs,
)
await deployment.start()
return deployment
self.deployment = self._worker.run_coroutine(_create_and_start())
def _patched_execute(self, command: str, cwd: str = "", *, timeout: int | None = None) -> dict[str, Any]:
"""Patched execute: runs commands on the background thread's loop."""
async def _do_execute():
return await self.deployment.runtime.execute(
RexCommand(
command=command,
shell=True,
check=False,
cwd=cwd or self.config.cwd,
timeout=timeout or self.config.timeout,
merge_output_streams=True,
env=self.config.env if self.config.env else None,
)
)
output = self._worker.run_coroutine(_do_execute())
return {
"output": output.stdout,
"returncode": output.exit_code,
}
def _patched_stop(self):
"""Patched stop: stops deployment on the background thread, then stops the thread."""
try:
self._worker.run_coroutine(
asyncio.wait_for(self.deployment.stop(), timeout=10),
timeout=15,
)
except Exception:
pass
finally:
self._worker.stop()
# Apply the patches
SwerexModalEnvironment.__init__ = _patched_init
SwerexModalEnvironment.execute = _patched_execute
SwerexModalEnvironment.stop = _patched_stop
logger.debug("Patched SwerexModalEnvironment for async-safe operation")
def apply_patches():
"""
Apply all monkey patches needed for Atropos compatibility.
Safe to call multiple times -- patches are only applied once.
Safe for normal CLI use -- patched code works identically when
there is no running event loop.
"""
global _patches_applied
if _patches_applied:
return
_patch_swerex_modal()
_patches_applied = True
@@ -0,0 +1,34 @@
# Terminal Test Environment -- Default Configuration
#
# Simple file-creation tasks for validating the full Atropos + hermes-agent stack.
# Uses Modal terminal backend and OpenRouter (Claude) for inference.
# API keys loaded from ~/hermes-agent/.env
#
# Usage:
# run-api
# python environments/terminal_test_env/terminal_test_env.py serve \
# --config environments/terminal_test_env/default.yaml
env:
enabled_toolsets: ["terminal", "file"]
max_agent_turns: 10
max_token_length: 2048
group_size: 3
total_steps: 3
steps_per_eval: 3
terminal_backend: "modal"
tool_call_parser: "hermes"
tokenizer_name: "NousResearch/DeepHermes-3-Llama-3-3B-Preview"
ensure_scores_are_not_same: false
use_wandb: false
system_prompt: >
You are a helpful assistant with access to a terminal and file tools.
Complete the user's request by using the available tools.
Be precise and follow instructions exactly.
openai:
base_url: "https://openrouter.ai/api/v1"
model_name: "anthropic/claude-opus-4.6"
server_type: "openai"
health_check: false
# api_key loaded from OPENROUTER_API_KEY in .env
@@ -0,0 +1,292 @@
"""
TerminalTestEnv -- Simple Test Environment for Validating the Stack
A self-contained environment with inline tasks (no external dataset needed).
Each task asks the model to create a file at a known path with specific content.
The reward verifier cats the file and checks if the content matches.
Enables only terminal + file toolsets. Uses Modal terminal backend with
OpenRouter (Claude) by default.
Training tasks (3):
1. Create ~/greeting.txt with "Hello from Hermes Agent"
2. Create ~/count.txt with numbers 1-5, one per line
3. Create ~/answer.txt with the result of 123 + 456
Eval task (1):
1. Create ~/result.txt with the result of 6 * 7
Usage:
# Start Atropos API server
run-api
# Run environment (uses OpenRouter + Modal by default)
python environments/terminal_test_env.py serve
# Process mode (no run-api needed, saves to JSONL)
python environments/terminal_test_env.py process \\
--env.data_path_to_save_groups terminal_test_output.jsonl
"""
import logging
import os
import sys
import time
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple, Union
# Ensure repo root is on sys.path for imports
_repo_root = Path(__file__).resolve().parent.parent.parent
if str(_repo_root) not in sys.path:
sys.path.insert(0, str(_repo_root))
from atroposlib.envs.base import ScoredDataGroup
from atroposlib.envs.server_handling.server_manager import APIServerConfig
from atroposlib.type_definitions import Item
from environments.agent_loop import AgentResult
from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
from environments.tool_context import ToolContext
logger = logging.getLogger(__name__)
# =============================================================================
# Inline task definitions -- no external dataset needed
# =============================================================================
TRAIN_TASKS = [
{
"prompt": "Create a file at ~/greeting.txt containing exactly the text: Hello from Hermes Agent",
"verify_path": "~/greeting.txt",
"expected_content": "Hello from Hermes Agent",
},
{
"prompt": "Create a file at ~/count.txt containing the numbers 1 through 5, one per line",
"verify_path": "~/count.txt",
"expected_content": "1\n2\n3\n4\n5",
},
{
"prompt": "Create a file at ~/answer.txt containing the result of 123 + 456",
"verify_path": "~/answer.txt",
"expected_content": "579",
},
]
EVAL_TASKS = [
{
"prompt": "Create a file at ~/result.txt containing the result of 6 * 7",
"verify_path": "~/result.txt",
"expected_content": "42",
},
]
class TerminalTestEnvConfig(HermesAgentEnvConfig):
"""Config with defaults suitable for terminal testing."""
pass # Inherits all fields, overrides defaults in config_init
class TerminalTestEnv(HermesAgentBaseEnv):
"""
Simple test environment with inline file-creation tasks.
All tasks follow the same pattern: "create a file at ~/X.txt with content Y".
The verifier runs `cat ~/X.txt` in the rollout's terminal and checks the output
against the expected string. Same verifier logic for all tasks.
This environment is designed to validate the full stack end-to-end:
- Agent loop executes tool calls (terminal/file)
- ToolContext provides terminal access to the reward function
- Reward function verifies file content via cat
- Scored data flows through the Atropos pipeline
"""
name = "terminal-test"
env_config_cls = TerminalTestEnvConfig
@classmethod
def config_init(cls) -> Tuple[TerminalTestEnvConfig, List[APIServerConfig]]:
"""
Default configuration for the terminal test environment.
Uses Modal terminal backend for cloud isolation and OpenRouter with
Claude for inference. API keys loaded from ~/hermes-agent/.env.
"""
env_config = TerminalTestEnvConfig(
# Terminal + file tools only
enabled_toolsets=["terminal", "file"],
disabled_toolsets=None,
distribution=None,
# Agent settings
max_agent_turns=10, # Simple tasks, don't need many turns
max_token_length=16000,
agent_temperature=1.0,
system_prompt=(
"You are a helpful assistant with access to a terminal and file tools. "
"Complete the user's request by using the available tools. "
"Be precise and follow instructions exactly."
),
# Modal terminal backend for cloud-isolated sandboxes per rollout
terminal_backend="modal",
# Atropos settings
group_size=3, # 3 rollouts per group
tokenizer_name="NousResearch/q-30b-t-h45-e1",
tool_call_parser="hermes",
steps_per_eval=3, # Eval after all 3 steps
total_steps=3, # 3 groups total (1 group per step)
use_wandb=True,
wandb_name="terminal-test",
ensure_scores_are_not_same=False, # Allow all-same scores for simple tasks
# No external dataset
dataset_name=None,
)
# OpenRouter with Claude -- API key loaded from .env (OPENROUTER_API_KEY)
server_configs = [
APIServerConfig(
base_url="https://openrouter.ai/api/v1",
model_name="anthropic/claude-opus-4.6",
server_type="openai",
api_key=os.getenv("OPENROUTER_API_KEY", ""),
health_check=False, # OpenRouter doesn't have a /health endpoint
)
]
return env_config, server_configs
async def setup(self):
"""Initialize inline task lists."""
self.train_tasks = list(TRAIN_TASKS)
self.eval_tasks = list(EVAL_TASKS)
self.iter = 0
# Track reward stats for wandb logging
self.reward_buffer: List[float] = []
async def get_next_item(self) -> Dict[str, str]:
"""Cycle through training tasks."""
item = self.train_tasks[self.iter % len(self.train_tasks)]
self.iter += 1
return item
def format_prompt(self, item: Dict[str, str]) -> str:
"""The prompt is directly in the task item."""
return item["prompt"]
async def compute_reward(
self, item: Dict[str, str], result: AgentResult, ctx: ToolContext
) -> float:
"""
Verify by cat-ing the expected file path and checking content matches.
Same verifier for all tasks -- they all write a file at a known path.
Scoring:
1.0 = exact match
0.5 = expected content is present but has extra stuff
0.0 = file doesn't exist or content doesn't match
"""
verify_result = ctx.terminal(f"cat {item['verify_path']}")
# File doesn't exist or can't be read
if verify_result["exit_code"] != 0:
self.reward_buffer.append(0.0)
return 0.0
actual = verify_result.get("output", "").strip()
expected = item["expected_content"].strip()
# Exact match
if actual == expected:
self.reward_buffer.append(1.0)
return 1.0
# Partial credit: expected content is present but has extra stuff
if expected in actual:
self.reward_buffer.append(0.5)
return 0.5
self.reward_buffer.append(0.0)
return 0.0
async def evaluate(self, *args, **kwargs):
"""
Run eval tasks using the agent loop and verify results.
Logs accuracy metrics.
"""
start_time = time.time()
correct = 0
total = len(self.eval_tasks)
samples = []
for eval_item in self.eval_tasks:
try:
# For eval, we do a simple single-turn completion (not full agent loop)
# to keep eval fast. The agent loop is tested via training.
completion = await self.server.chat_completion(
messages=[
{"role": "system", "content": self.config.system_prompt or ""},
{"role": "user", "content": eval_item["prompt"]},
],
n=1,
max_tokens=self.config.max_token_length,
temperature=0.0,
split="eval",
)
response_content = (
completion.choices[0].message.content if completion.choices else ""
)
samples.append(
{
"prompt": eval_item["prompt"],
"response": response_content,
"expected": eval_item["expected_content"],
}
)
except Exception as e:
logger.error("Eval failed for item: %s", e)
samples.append(
{
"prompt": eval_item["prompt"],
"response": f"ERROR: {e}",
"expected": eval_item["expected_content"],
}
)
end_time = time.time()
eval_metrics = {
"eval/num_samples": total,
}
await self.evaluate_log(
metrics=eval_metrics,
samples=samples,
start_time=start_time,
end_time=end_time,
)
async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
"""Log training metrics including reward stats and accuracy."""
if wandb_metrics is None:
wandb_metrics = {}
if self.reward_buffer:
total = len(self.reward_buffer)
correct = sum(1 for r in self.reward_buffer if r == 1.0)
partial = sum(1 for r in self.reward_buffer if r == 0.5)
wandb_metrics["train/avg_reward"] = sum(self.reward_buffer) / total
wandb_metrics["train/accuracy"] = correct / total
wandb_metrics["train/partial_match_rate"] = partial / total
wandb_metrics["train/total_rollouts"] = total
self.reward_buffer = []
await super().wandb_log(wandb_metrics)
if __name__ == "__main__":
TerminalTestEnv.cli()
+120
View File
@@ -0,0 +1,120 @@
"""
Tool Call Parser Registry
Client-side parsers that extract structured tool_calls from raw model output text.
Used in Phase 2 (VLLM server type) where ManagedServer's /generate endpoint returns
raw text without tool call parsing.
Each parser is a standalone reimplementation of the corresponding VLLM parser's
non-streaming extract_tool_calls() logic. No VLLM dependency -- only standard library
(re, json, uuid) and openai types.
Usage:
from environments.tool_call_parsers import get_parser
parser = get_parser("hermes")
content, tool_calls = parser.parse(raw_model_output)
# content = text with tool call markup stripped
# tool_calls = list of ChatCompletionMessageToolCall objects, or None
"""
import logging
from abc import ABC, abstractmethod
from typing import Dict, List, Optional, Tuple, Type
from openai.types.chat.chat_completion_message_tool_call import (
ChatCompletionMessageToolCall,
)
logger = logging.getLogger(__name__)
# Type alias for parser return value
ParseResult = Tuple[Optional[str], Optional[List[ChatCompletionMessageToolCall]]]
class ToolCallParser(ABC):
"""
Base class for tool call parsers.
Each parser knows how to extract structured tool_calls from a specific
model family's raw output text format.
"""
@abstractmethod
def parse(self, text: str) -> ParseResult:
"""
Parse raw model output text for tool calls.
Args:
text: Raw decoded text from the model's completion
Returns:
Tuple of (content, tool_calls) where:
- content: text with tool call markup stripped (the message 'content' field),
or None if the entire output was tool calls
- tool_calls: list of ChatCompletionMessageToolCall objects,
or None if no tool calls were found
"""
raise NotImplementedError
# Global parser registry: name -> parser class
PARSER_REGISTRY: Dict[str, Type[ToolCallParser]] = {}
def register_parser(name: str):
"""
Decorator to register a parser class under a given name.
Usage:
@register_parser("hermes")
class HermesToolCallParser(ToolCallParser):
...
"""
def decorator(cls: Type[ToolCallParser]) -> Type[ToolCallParser]:
PARSER_REGISTRY[name] = cls
return cls
return decorator
def get_parser(name: str) -> ToolCallParser:
"""
Get a parser instance by name.
Args:
name: Parser name (e.g., "hermes", "mistral", "llama3_json")
Returns:
Instantiated parser
Raises:
KeyError: If parser name is not found in registry
"""
if name not in PARSER_REGISTRY:
available = sorted(PARSER_REGISTRY.keys())
raise KeyError(
f"Tool call parser '{name}' not found. Available parsers: {available}"
)
return PARSER_REGISTRY[name]()
def list_parsers() -> List[str]:
"""Return sorted list of registered parser names."""
return sorted(PARSER_REGISTRY.keys())
# Import all parser modules to trigger registration via @register_parser decorators
# Each module registers itself when imported
from environments.tool_call_parsers.hermes_parser import HermesToolCallParser # noqa: E402, F401
from environments.tool_call_parsers.longcat_parser import LongcatToolCallParser # noqa: E402, F401
from environments.tool_call_parsers.mistral_parser import MistralToolCallParser # noqa: E402, F401
from environments.tool_call_parsers.llama_parser import LlamaToolCallParser # noqa: E402, F401
from environments.tool_call_parsers.qwen_parser import QwenToolCallParser # noqa: E402, F401
from environments.tool_call_parsers.deepseek_v3_parser import DeepSeekV3ToolCallParser # noqa: E402, F401
from environments.tool_call_parsers.deepseek_v3_1_parser import DeepSeekV31ToolCallParser # noqa: E402, F401
from environments.tool_call_parsers.kimi_k2_parser import KimiK2ToolCallParser # noqa: E402, F401
from environments.tool_call_parsers.glm45_parser import Glm45ToolCallParser # noqa: E402, F401
from environments.tool_call_parsers.glm47_parser import Glm47ToolCallParser # noqa: E402, F401
from environments.tool_call_parsers.qwen3_coder_parser import Qwen3CoderToolCallParser # noqa: E402, F401
@@ -0,0 +1,71 @@
"""
DeepSeek V3.1 tool call parser.
Similar to V3 but with a slightly different format:
<toolcallbegin>function_name<toolsep>arguments<toolcallend>
Note: V3 has type+name before the separator, V3.1 has name before and args after.
Based on VLLM's DeepSeekV31ToolParser.extract_tool_calls()
"""
import re
import uuid
from typing import List, Optional
from openai.types.chat.chat_completion_message_tool_call import (
ChatCompletionMessageToolCall,
Function,
)
from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
@register_parser("deepseek_v3_1")
@register_parser("deepseek_v31")
class DeepSeekV31ToolCallParser(ToolCallParser):
"""
Parser for DeepSeek V3.1 tool calls.
Slightly different regex than V3: function_name comes before the separator,
arguments come after (no type field, no json code block wrapper).
"""
START_TOKEN = "<tool▁calls▁begin>"
# Regex captures: function_name, function_arguments
PATTERN = re.compile(
r"<tool▁call▁begin>(?P<function_name>.*?)<tool▁sep>(?P<function_arguments>.*?)<tool▁call▁end>"
)
def parse(self, text: str) -> ParseResult:
if self.START_TOKEN not in text:
return text, None
try:
matches = self.PATTERN.findall(text)
if not matches:
return text, None
tool_calls: List[ChatCompletionMessageToolCall] = []
for match in matches:
func_name, func_args = match
tool_calls.append(
ChatCompletionMessageToolCall(
id=f"call_{uuid.uuid4().hex[:8]}",
type="function",
function=Function(
name=func_name.strip(),
arguments=func_args.strip(),
),
)
)
if not tool_calls:
return text, None
content = text[: text.find(self.START_TOKEN)].strip()
return content if content else None, tool_calls
except Exception:
return text, None
@@ -0,0 +1,75 @@
"""
DeepSeek V3 tool call parser.
Format uses special unicode tokens:
<toolcallsbegin>
<toolcallbegin>type<toolsep>function_name
```json
{"arg": "value"}
```
<toolcallend>
<toolcallsend>
Based on VLLM's DeepSeekV3ToolParser.extract_tool_calls()
"""
import re
import uuid
from typing import List, Optional
from openai.types.chat.chat_completion_message_tool_call import (
ChatCompletionMessageToolCall,
Function,
)
from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
@register_parser("deepseek_v3")
class DeepSeekV3ToolCallParser(ToolCallParser):
"""
Parser for DeepSeek V3 tool calls.
Uses special unicode tokens with fullwidth angle brackets and block elements.
Extracts type, function name, and JSON arguments from the structured format.
"""
START_TOKEN = "<tool▁calls▁begin>"
# Regex captures: type, function_name, function_arguments
PATTERN = re.compile(
r"<tool▁call▁begin>(?P<type>.*)<tool▁sep>(?P<function_name>.*)\n```json\n(?P<function_arguments>.*)\n```<tool▁call▁end>"
)
def parse(self, text: str) -> ParseResult:
if self.START_TOKEN not in text:
return text, None
try:
matches = self.PATTERN.findall(text)
if not matches:
return text, None
tool_calls: List[ChatCompletionMessageToolCall] = []
for match in matches:
tc_type, func_name, func_args = match
tool_calls.append(
ChatCompletionMessageToolCall(
id=f"call_{uuid.uuid4().hex[:8]}",
type="function",
function=Function(
name=func_name.strip(),
arguments=func_args.strip(),
),
)
)
if not tool_calls:
return text, None
# Content is everything before the tool calls section
content = text[: text.find(self.START_TOKEN)].strip()
return content if content else None, tool_calls
except Exception:
return text, None
@@ -0,0 +1,109 @@
"""
GLM 4.5 (GLM-4-MoE) tool call parser.
Format uses custom arg_key/arg_value tags rather than standard JSON:
<tool_call>function_name
<arg_key>param1</arg_key><arg_value>value1</arg_value>
<arg_key>param2</arg_key><arg_value>value2</arg_value>
</tool_call>
Values are deserialized using json.loads -> ast.literal_eval -> raw string fallback.
Based on VLLM's Glm4MoeModelToolParser.extract_tool_calls()
"""
import ast
import json
import re
import uuid
from typing import Any, Dict, List, Optional
from openai.types.chat.chat_completion_message_tool_call import (
ChatCompletionMessageToolCall,
Function,
)
from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
def _deserialize_value(value: str) -> Any:
"""
Try to deserialize a string value to its native Python type.
Attempts json.loads, then ast.literal_eval, then returns raw string.
"""
try:
return json.loads(value)
except (json.JSONDecodeError, TypeError):
pass
try:
return ast.literal_eval(value)
except (ValueError, SyntaxError, TypeError):
pass
return value
@register_parser("glm45")
class Glm45ToolCallParser(ToolCallParser):
"""
Parser for GLM 4.5 (GLM-4-MoE) tool calls.
Uses <tool_call>...</tool_call> tags with <arg_key>/<arg_value> pairs
instead of standard JSON arguments.
"""
FUNC_CALL_REGEX = re.compile(r"<tool_call>.*?</tool_call>", re.DOTALL)
FUNC_DETAIL_REGEX = re.compile(r"<tool_call>([^\n]*)\n(.*)</tool_call>", re.DOTALL)
FUNC_ARG_REGEX = re.compile(
r"<arg_key>(.*?)</arg_key>\s*<arg_value>(.*?)</arg_value>", re.DOTALL
)
START_TOKEN = "<tool_call>"
def parse(self, text: str) -> ParseResult:
if self.START_TOKEN not in text:
return text, None
try:
matched_calls = self.FUNC_CALL_REGEX.findall(text)
if not matched_calls:
return text, None
tool_calls: List[ChatCompletionMessageToolCall] = []
for match in matched_calls:
detail = self.FUNC_DETAIL_REGEX.search(match)
if not detail:
continue
func_name = detail.group(1).strip()
func_args_raw = detail.group(2)
# Parse arg_key/arg_value pairs
pairs = self.FUNC_ARG_REGEX.findall(func_args_raw) if func_args_raw else []
arg_dict: Dict[str, Any] = {}
for key, value in pairs:
arg_key = key.strip()
arg_val = _deserialize_value(value.strip())
arg_dict[arg_key] = arg_val
tool_calls.append(
ChatCompletionMessageToolCall(
id=f"call_{uuid.uuid4().hex[:8]}",
type="function",
function=Function(
name=func_name,
arguments=json.dumps(arg_dict, ensure_ascii=False),
),
)
)
if not tool_calls:
return text, None
content = text[: text.find(self.START_TOKEN)].strip()
return content if content else None, tool_calls
except Exception:
return text, None
@@ -0,0 +1,35 @@
"""
GLM 4.7 tool call parser.
Same as GLM 4.5 but with slightly different regex patterns.
The tool_call tags may wrap differently and arg parsing handles
newlines between key/value pairs.
Based on VLLM's Glm47MoeModelToolParser (extends Glm4MoeModelToolParser).
"""
import re
from environments.tool_call_parsers import ParseResult, register_parser
from environments.tool_call_parsers.glm45_parser import Glm45ToolCallParser
@register_parser("glm47")
class Glm47ToolCallParser(Glm45ToolCallParser):
"""
Parser for GLM 4.7 tool calls.
Extends GLM 4.5 with updated regex patterns.
"""
def __init__(self):
super().__init__()
# GLM 4.7 uses a slightly different detail regex that includes
# the <tool_call> wrapper and optional arg_key content
self.FUNC_DETAIL_REGEX = re.compile(
r"<tool_call>(.*?)(<arg_key>.*?)?</tool_call>", re.DOTALL
)
# GLM 4.7 handles newlines between arg_key and arg_value tags
self.FUNC_ARG_REGEX = re.compile(
r"<arg_key>(.*?)</arg_key>(?:\\n|\s)*<arg_value>(.*?)</arg_value>",
re.DOTALL,
)
@@ -0,0 +1,73 @@
"""
Hermes tool call parser.
Format: <tool_call>{"name": "func", "arguments": {...}}</tool_call>
Based on VLLM's Hermes2ProToolParser.extract_tool_calls()
"""
import json
import re
import uuid
from typing import List, Optional, Tuple
from openai.types.chat.chat_completion_message_tool_call import (
ChatCompletionMessageToolCall,
Function,
)
from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
@register_parser("hermes")
class HermesToolCallParser(ToolCallParser):
"""
Parser for Hermes-format tool calls.
Matches <tool_call>...</tool_call> tags containing JSON with "name" and "arguments".
Also handles unclosed <tool_call> at end-of-string (truncated generation).
"""
# Matches both closed and unclosed tool_call tags
PATTERN = re.compile(
r"<tool_call>\s*(.*?)\s*</tool_call>|<tool_call>\s*(.*)", re.DOTALL
)
def parse(self, text: str) -> ParseResult:
if "<tool_call>" not in text:
return text, None
try:
matches = self.PATTERN.findall(text)
if not matches:
return text, None
tool_calls: List[ChatCompletionMessageToolCall] = []
for match in matches:
# match is a tuple: (closed_content, unclosed_content)
raw_json = match[0] if match[0] else match[1]
if not raw_json.strip():
continue
tc_data = json.loads(raw_json)
tool_calls.append(
ChatCompletionMessageToolCall(
id=f"call_{uuid.uuid4().hex[:8]}",
type="function",
function=Function(
name=tc_data["name"],
arguments=json.dumps(
tc_data.get("arguments", {}), ensure_ascii=False
),
),
)
)
if not tool_calls:
return text, None
# Content is everything before the first <tool_call> tag
content = text[: text.find("<tool_call>")].strip()
return content if content else None, tool_calls
except Exception:
return text, None
@@ -0,0 +1,93 @@
"""
Kimi K2 tool call parser.
Format:
<|tool_calls_section_begin|>
<|tool_call_begin|>function_id:0<|tool_call_argument_begin|>{"arg": "val"}<|tool_call_end|>
<|tool_calls_section_end|>
The function_id format is typically "functions.func_name:index" or "func_name:index".
Based on VLLM's KimiK2ToolParser.extract_tool_calls()
"""
import re
import uuid
from typing import List, Optional
from openai.types.chat.chat_completion_message_tool_call import (
ChatCompletionMessageToolCall,
Function,
)
from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
@register_parser("kimi_k2")
class KimiK2ToolCallParser(ToolCallParser):
"""
Parser for Kimi K2 tool calls.
Uses section begin/end tokens wrapping individual tool call begin/end tokens.
The tool_call_id contains the function name (after last dot, before colon).
"""
# Support both singular and plural variants
START_TOKENS = [
"<|tool_calls_section_begin|>",
"<|tool_call_section_begin|>",
]
# Regex captures: tool_call_id (e.g., "functions.get_weather:0"), function_arguments
PATTERN = re.compile(
r"<\|tool_call_begin\|>\s*(?P<tool_call_id>[^<]+:\d+)\s*"
r"<\|tool_call_argument_begin\|>\s*"
r"(?P<function_arguments>(?:(?!<\|tool_call_begin\|>).)*?)\s*"
r"<\|tool_call_end\|>",
re.DOTALL,
)
def parse(self, text: str) -> ParseResult:
# Check for any variant of the start token
has_start = any(token in text for token in self.START_TOKENS)
if not has_start:
return text, None
try:
matches = self.PATTERN.findall(text)
if not matches:
return text, None
tool_calls: List[ChatCompletionMessageToolCall] = []
for match in matches:
function_id, function_args = match
# Extract function name from ID format: "functions.get_weather:0" -> "get_weather"
function_name = function_id.split(":")[0].split(".")[-1]
tool_calls.append(
ChatCompletionMessageToolCall(
id=function_id, # Preserve the original ID format
type="function",
function=Function(
name=function_name,
arguments=function_args.strip(),
),
)
)
if not tool_calls:
return text, None
# Content is everything before the tool calls section
earliest_start = len(text)
for token in self.START_TOKENS:
idx = text.find(token)
if idx >= 0 and idx < earliest_start:
earliest_start = idx
content = text[:earliest_start].strip()
return content if content else None, tool_calls
except Exception:
return text, None
@@ -0,0 +1,96 @@
"""
Llama 3.x / 4 tool call parser.
Format: The model outputs JSON objects with "name" and "arguments" (or "parameters") keys.
May be preceded by <|python_tag|> token. Supports multiple JSON objects separated
by content or semicolons.
Based on VLLM's Llama3JsonToolParser.extract_tool_calls()
"""
import json
import re
import uuid
from typing import List, Optional
from openai.types.chat.chat_completion_message_tool_call import (
ChatCompletionMessageToolCall,
Function,
)
from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
@register_parser("llama3_json")
@register_parser("llama4_json")
class LlamaToolCallParser(ToolCallParser):
"""
Parser for Llama 3.x and 4 JSON-format tool calls.
Finds JSON objects containing "name" + ("arguments" or "parameters") keys.
Uses Python's json.JSONDecoder.raw_decode for robust extraction of
JSON objects from mixed text.
"""
BOT_TOKEN = "<|python_tag|>"
# Regex to find the start of potential JSON objects
JSON_START = re.compile(r"\{")
def parse(self, text: str) -> ParseResult:
# Quick check: need either the bot token or a JSON brace
if self.BOT_TOKEN not in text and "{" not in text:
return text, None
try:
decoder = json.JSONDecoder()
tool_calls: List[ChatCompletionMessageToolCall] = []
end_index = -1 # Track where the last parsed JSON ended
for match in self.JSON_START.finditer(text):
start = match.start()
# Skip if this brace is inside a previously parsed JSON object
if start <= end_index:
continue
try:
obj, json_end = decoder.raw_decode(text[start:])
end_index = start + json_end
# Must have "name" and either "arguments" or "parameters"
name = obj.get("name")
args = obj.get("arguments", obj.get("parameters"))
if not name or args is None:
continue
# Normalize arguments to JSON string
if isinstance(args, dict):
args = json.dumps(args, ensure_ascii=False)
elif not isinstance(args, str):
args = json.dumps(args, ensure_ascii=False)
tool_calls.append(
ChatCompletionMessageToolCall(
id=f"call_{uuid.uuid4().hex[:8]}",
type="function",
function=Function(name=name, arguments=args),
)
)
except (json.JSONDecodeError, KeyError, ValueError):
continue
if not tool_calls:
return text, None
# Content is everything before the first tool call JSON
# Find where the first tool call starts in the text
first_tc_start = text.find("{")
if self.BOT_TOKEN in text:
first_tc_start = text.find(self.BOT_TOKEN)
content = text[:first_tc_start].strip() if first_tc_start > 0 else None
return content, tool_calls
except Exception:
return text, None
@@ -0,0 +1,69 @@
"""
Longcat Flash Chat tool call parser.
Same as Hermes but uses <longcat_tool_call> tags instead of <tool_call>.
Based on VLLM's LongcatFlashToolParser (extends Hermes2ProToolParser).
"""
import json
import re
import uuid
from typing import List, Optional
from openai.types.chat.chat_completion_message_tool_call import (
ChatCompletionMessageToolCall,
Function,
)
from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
@register_parser("longcat")
class LongcatToolCallParser(ToolCallParser):
"""
Parser for Longcat Flash Chat tool calls.
Identical logic to Hermes, just different tag names.
"""
PATTERN = re.compile(
r"<longcat_tool_call>\s*(.*?)\s*</longcat_tool_call>|<longcat_tool_call>\s*(.*)",
re.DOTALL,
)
def parse(self, text: str) -> ParseResult:
if "<longcat_tool_call>" not in text:
return text, None
try:
matches = self.PATTERN.findall(text)
if not matches:
return text, None
tool_calls: List[ChatCompletionMessageToolCall] = []
for match in matches:
raw_json = match[0] if match[0] else match[1]
if not raw_json.strip():
continue
tc_data = json.loads(raw_json)
tool_calls.append(
ChatCompletionMessageToolCall(
id=f"call_{uuid.uuid4().hex[:8]}",
type="function",
function=Function(
name=tc_data["name"],
arguments=json.dumps(
tc_data.get("arguments", {}), ensure_ascii=False
),
),
)
)
if not tool_calls:
return text, None
content = text[: text.find("<longcat_tool_call>")].strip()
return content if content else None, tool_calls
except Exception:
return text, None
@@ -0,0 +1,130 @@
"""
Mistral tool call parser.
Supports two formats depending on tokenizer version:
- Pre-v11: content[TOOL_CALLS] [{"name": ..., "arguments": {...}}, ...]
- v11+: content[TOOL_CALLS]tool_name1{"arg": "val"}[TOOL_CALLS]tool_name2{"arg": "val"}
Based on VLLM's MistralToolParser.extract_tool_calls()
The [TOOL_CALLS] token is the bot_token used by Mistral models.
"""
import json
import re
import uuid
from typing import List, Optional
from openai.types.chat.chat_completion_message_tool_call import (
ChatCompletionMessageToolCall,
Function,
)
from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
def _generate_mistral_id() -> str:
"""Mistral tool call IDs are 9-char alphanumeric strings."""
import random
import string
return "".join(random.choices(string.ascii_letters + string.digits, k=9))
@register_parser("mistral")
class MistralToolCallParser(ToolCallParser):
"""
Parser for Mistral-format tool calls.
Detects format by checking if the content after [TOOL_CALLS] starts with '['
(pre-v11 JSON array) or with a tool name (v11+ format).
"""
# The [TOOL_CALLS] token -- may appear as different strings depending on tokenizer
BOT_TOKEN = "[TOOL_CALLS]"
# Fallback regex for pre-v11 format when JSON parsing fails
TOOL_CALL_REGEX = re.compile(r"\[?\s*(\{.*?\})\s*\]?", re.DOTALL)
def parse(self, text: str) -> ParseResult:
if self.BOT_TOKEN not in text:
return text, None
try:
parts = text.split(self.BOT_TOKEN)
content = parts[0].strip()
raw_tool_calls = parts[1:]
# Detect format: if the first raw part starts with '[', it's pre-v11
first_raw = raw_tool_calls[0].strip() if raw_tool_calls else ""
is_pre_v11 = first_raw.startswith("[") or first_raw.startswith("{")
tool_calls: List[ChatCompletionMessageToolCall] = []
if not is_pre_v11:
# v11+ format: [TOOL_CALLS]tool_name{args}[TOOL_CALLS]tool_name2{args2}
for raw in raw_tool_calls:
raw = raw.strip()
if not raw or "{" not in raw:
continue
brace_idx = raw.find("{")
tool_name = raw[:brace_idx].strip()
args_str = raw[brace_idx:]
tool_calls.append(
ChatCompletionMessageToolCall(
id=_generate_mistral_id(),
type="function",
function=Function(name=tool_name, arguments=args_str),
)
)
else:
# Pre-v11 format: [TOOL_CALLS] [{"name": ..., "arguments": {...}}]
try:
parsed = json.loads(first_raw)
if isinstance(parsed, dict):
parsed = [parsed]
for tc in parsed:
args = tc.get("arguments", {})
if isinstance(args, dict):
args = json.dumps(args, ensure_ascii=False)
tool_calls.append(
ChatCompletionMessageToolCall(
id=_generate_mistral_id(),
type="function",
function=Function(
name=tc["name"], arguments=args
),
)
)
except json.JSONDecodeError:
# Fallback regex extraction
match = self.TOOL_CALL_REGEX.findall(first_raw)
if match:
for raw_json in match:
try:
tc = json.loads(raw_json)
args = tc.get("arguments", {})
if isinstance(args, dict):
args = json.dumps(args, ensure_ascii=False)
tool_calls.append(
ChatCompletionMessageToolCall(
id=_generate_mistral_id(),
type="function",
function=Function(
name=tc["name"], arguments=args
),
)
)
except (json.JSONDecodeError, KeyError):
continue
if not tool_calls:
return text, None
return content if content else None, tool_calls
except Exception:
return text, None
@@ -0,0 +1,163 @@
"""
Qwen3-Coder tool call parser.
Format uses XML-style nested tags:
<tool_call>
<function=function_name>
<parameter=param_name>value</parameter>
<parameter=param_name2>value2</parameter>
</function>
</tool_call>
Parameters are extracted from <parameter=name>value</parameter> tags and
type-converted using the schema if available, otherwise treated as strings.
Based on VLLM's Qwen3CoderToolParser.extract_tool_calls()
"""
import ast
import json
import re
import uuid
from typing import Any, Dict, List, Optional
from openai.types.chat.chat_completion_message_tool_call import (
ChatCompletionMessageToolCall,
Function,
)
from environments.tool_call_parsers import ParseResult, ToolCallParser, register_parser
def _try_convert_value(value: str) -> Any:
"""
Try to convert a parameter value string to a native Python type.
Handles null, numbers, booleans, JSON objects/arrays, and falls back to string.
"""
stripped = value.strip()
# Handle null
if stripped.lower() == "null":
return None
# Try JSON first (handles objects, arrays, strings, numbers, booleans)
try:
return json.loads(stripped)
except (json.JSONDecodeError, TypeError):
pass
# Try Python literal eval (handles tuples, etc.)
try:
return ast.literal_eval(stripped)
except (ValueError, SyntaxError, TypeError):
pass
# Return as string
return stripped
@register_parser("qwen3_coder")
class Qwen3CoderToolCallParser(ToolCallParser):
"""
Parser for Qwen3-Coder XML-format tool calls.
Uses nested XML tags: <tool_call><function=name><parameter=key>val</parameter></function></tool_call>
"""
START_TOKEN = "<tool_call>"
FUNCTION_PREFIX = "<function="
# Find complete tool_call blocks (or unclosed at end)
TOOL_CALL_REGEX = re.compile(
r"<tool_call>(.*?)</tool_call>|<tool_call>(.*?)$", re.DOTALL
)
# Find function blocks within a tool_call
FUNCTION_REGEX = re.compile(
r"<function=(.*?)</function>|<function=(.*)$", re.DOTALL
)
# Find parameter blocks within a function
PARAMETER_REGEX = re.compile(
r"<parameter=(.*?)(?:</parameter>|(?=<parameter=)|(?=</function>)|$)",
re.DOTALL,
)
def _parse_function_call(self, function_str: str) -> Optional[ChatCompletionMessageToolCall]:
"""Parse a single <function=name>...</function> block into a ToolCall."""
try:
# Extract function name: everything before the first '>'
gt_idx = function_str.index(">")
func_name = function_str[:gt_idx].strip()
params_str = function_str[gt_idx + 1:]
# Extract parameters
param_dict: Dict[str, Any] = {}
for match_text in self.PARAMETER_REGEX.findall(params_str):
if ">" not in match_text:
continue
eq_idx = match_text.index(">")
param_name = match_text[:eq_idx].strip()
param_value = match_text[eq_idx + 1:]
# Clean up whitespace
if param_value.startswith("\n"):
param_value = param_value[1:]
if param_value.endswith("\n"):
param_value = param_value[:-1]
param_dict[param_name] = _try_convert_value(param_value)
return ChatCompletionMessageToolCall(
id=f"call_{uuid.uuid4().hex[:24]}",
type="function",
function=Function(
name=func_name,
arguments=json.dumps(param_dict, ensure_ascii=False),
),
)
except (ValueError, IndexError):
return None
def parse(self, text: str) -> ParseResult:
if self.FUNCTION_PREFIX not in text:
return text, None
try:
# Find all tool_call blocks
tc_matches = self.TOOL_CALL_REGEX.findall(text)
raw_blocks = [m[0] if m[0] else m[1] for m in tc_matches]
# Fallback: if no tool_call tags, try the whole text
if not raw_blocks:
raw_blocks = [text]
# Find function blocks within each tool_call
function_strs: List[str] = []
for block in raw_blocks:
func_matches = self.FUNCTION_REGEX.findall(block)
function_strs.extend(m[0] if m[0] else m[1] for m in func_matches)
if not function_strs:
return text, None
# Parse each function call
tool_calls: List[ChatCompletionMessageToolCall] = []
for func_str in function_strs:
tc = self._parse_function_call(func_str)
if tc is not None:
tool_calls.append(tc)
if not tool_calls:
return text, None
# Content before tool calls
first_tc = text.find(self.START_TOKEN)
if first_tc < 0:
first_tc = text.find(self.FUNCTION_PREFIX)
content = text[:first_tc].strip() if first_tc > 0 else None
return content, tool_calls
except Exception:
return text, None
@@ -0,0 +1,19 @@
"""
Qwen 2.5 tool call parser.
Uses the same <tool_call> format as Hermes.
Registered as a separate parser name for clarity when using --tool-parser=qwen.
"""
from environments.tool_call_parsers import register_parser
from environments.tool_call_parsers.hermes_parser import HermesToolCallParser
@register_parser("qwen")
class QwenToolCallParser(HermesToolCallParser):
"""
Parser for Qwen 2.5 tool calls.
Same <tool_call>{"name": ..., "arguments": ...}</tool_call> format as Hermes.
"""
pass # Identical format -- inherits everything from Hermes
+473
View File
@@ -0,0 +1,473 @@
"""
ToolContext -- Unrestricted Tool Access for Reward Functions
A per-rollout handle that gives reward/verification functions direct access to
ALL hermes-agent tools, scoped to the rollout's task_id. The same task_id means
the terminal/browser session is the SAME one the model used during its rollout --
all state (files, processes, browser tabs) is preserved.
The verifier author decides which tools to use. Nothing is hardcoded or gated.
Example usage in a compute_reward():
async def compute_reward(self, item, result, ctx):
# Run tests in the model's terminal sandbox
test = ctx.terminal("pytest -v")
if test["exit_code"] == 0:
return 1.0
# Check if a file was created
content = ctx.read_file("/workspace/solution.py")
if content.get("content"):
return 0.5
return 0.0
"""
import json
import logging
import os
from typing import Any, Dict, List, Optional
import asyncio
import concurrent.futures
from model_tools import handle_function_call
from tools.terminal_tool import cleanup_vm
from tools.browser_tool import cleanup_browser
logger = logging.getLogger(__name__)
# Thread pool for running sync tool calls that internally use asyncio.run()
_tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=4)
def _run_tool_in_thread(tool_name: str, arguments: Dict[str, Any], task_id: str) -> str:
"""
Run a tool call in a thread pool executor so backends that use asyncio.run()
internally (modal, docker) get a clean event loop.
If we're already in an async context, uses run_in_executor.
If not (e.g., called from sync code), runs directly.
"""
try:
loop = asyncio.get_running_loop()
# We're in an async context -- need to run in thread
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
future = pool.submit(
handle_function_call, tool_name, arguments, task_id
)
return future.result(timeout=300)
except RuntimeError:
# No running event loop -- safe to call directly
return handle_function_call(tool_name, arguments, task_id)
class ToolContext:
"""
Open-ended access to all hermes-agent tools for a specific rollout.
Passed to compute_reward() so verifiers can use any tool they need:
terminal commands, file reads/writes, web searches, browser automation, etc.
All calls share the rollout's task_id for session isolation.
"""
def __init__(self, task_id: str):
self.task_id = task_id
# -------------------------------------------------------------------------
# Terminal tools
# -------------------------------------------------------------------------
def terminal(self, command: str, timeout: int = 180) -> Dict[str, Any]:
"""
Run a command in the rollout's terminal session.
Args:
command: Shell command to execute
timeout: Command timeout in seconds
Returns:
Dict with 'exit_code' (int) and 'output' (str)
"""
import os
backend = os.getenv("TERMINAL_ENV", "local")
logger.debug("ToolContext.terminal [%s backend] task=%s: %s", backend, self.task_id[:8], command[:100])
# Run in thread pool so modal/docker backends' asyncio.run() doesn't deadlock
result = _run_tool_in_thread(
"terminal",
{"command": command, "timeout": timeout},
self.task_id,
)
try:
return json.loads(result)
except json.JSONDecodeError:
return {"exit_code": -1, "output": result}
# -------------------------------------------------------------------------
# File tools
# -------------------------------------------------------------------------
def read_file(self, path: str) -> Dict[str, Any]:
"""
Read a file from the rollout's filesystem.
Args:
path: File path to read
Returns:
Dict with file content or error
"""
result = handle_function_call(
"read_file", {"path": path}, task_id=self.task_id
)
try:
return json.loads(result)
except json.JSONDecodeError:
return {"error": result}
def write_file(self, path: str, content: str) -> Dict[str, Any]:
"""
Write a TEXT file in the rollout's filesystem.
Uses a shell heredoc under the hood, so this is only safe for text content.
For binary files (images, compiled artifacts, etc.), use upload_file() instead.
Args:
path: File path to write
content: Text content to write
Returns:
Dict with success status or error
"""
result = handle_function_call(
"write_file", {"path": path, "content": content}, task_id=self.task_id
)
try:
return json.loads(result)
except json.JSONDecodeError:
return {"error": result}
def upload_file(self, local_path: str, remote_path: str) -> Dict[str, Any]:
"""
Upload a local file to the rollout's sandbox (binary-safe).
Unlike write_file() which passes content through a shell heredoc (text-only),
this method base64-encodes the file and decodes it inside the sandbox.
Safe for any file type: binaries, images, archives, etc.
For large files (>1MB), the content is split into chunks to avoid
hitting shell command-length limits.
Args:
local_path: Path to a local file on the host
remote_path: Destination path inside the sandbox
Returns:
Dict with 'exit_code' and 'output'
"""
import base64
from pathlib import Path as _Path
local = _Path(local_path)
if not local.exists():
return {"exit_code": -1, "output": f"Local file not found: {local_path}"}
raw = local.read_bytes()
b64 = base64.b64encode(raw).decode("ascii")
# Ensure parent directory exists in the sandbox
parent = str(_Path(remote_path).parent)
if parent not in (".", "/"):
self.terminal(f"mkdir -p {parent}", timeout=10)
# For small files, single command is fine
chunk_size = 60_000 # ~60KB per chunk (well within shell limits)
if len(b64) <= chunk_size:
result = self.terminal(
f"printf '%s' '{b64}' | base64 -d > {remote_path}",
timeout=30,
)
else:
# For larger files, write base64 in chunks then decode
tmp_b64 = "/tmp/_hermes_upload.b64"
self.terminal(f": > {tmp_b64}", timeout=5) # truncate
for i in range(0, len(b64), chunk_size):
chunk = b64[i : i + chunk_size]
self.terminal(f"printf '%s' '{chunk}' >> {tmp_b64}", timeout=15)
result = self.terminal(
f"base64 -d {tmp_b64} > {remote_path} && rm -f {tmp_b64}",
timeout=30,
)
return result
def upload_dir(self, local_dir: str, remote_dir: str) -> List[Dict[str, Any]]:
"""
Upload an entire local directory to the rollout's sandbox (binary-safe).
Recursively uploads all files, preserving directory structure.
Args:
local_dir: Path to a local directory on the host
remote_dir: Destination directory inside the sandbox
Returns:
List of results, one per file uploaded
"""
from pathlib import Path as _Path
local = _Path(local_dir)
if not local.exists() or not local.is_dir():
return [{"exit_code": -1, "output": f"Local directory not found: {local_dir}"}]
results = []
for file_path in sorted(local.rglob("*")):
if file_path.is_file():
relative = file_path.relative_to(local)
target = f"{remote_dir}/{relative}"
results.append(self.upload_file(str(file_path), target))
return results
def download_file(self, remote_path: str, local_path: str) -> Dict[str, Any]:
"""
Download a file from the rollout's sandbox to the host (binary-safe).
The inverse of upload_file(). Base64-encodes the file inside the sandbox,
reads the encoded data through the terminal, and decodes it locally.
Safe for any file type.
Args:
remote_path: Path to the file inside the sandbox
local_path: Destination path on the host
Returns:
Dict with 'success' (bool) and 'bytes' (int) or 'error' (str)
"""
import base64
from pathlib import Path as _Path
# Base64-encode the file inside the sandbox and capture output
result = self.terminal(
f"base64 {remote_path} 2>/dev/null",
timeout=30,
)
if result.get("exit_code", -1) != 0:
return {
"success": False,
"error": f"Failed to read remote file: {result.get('output', '')}",
}
b64_data = result.get("output", "").strip()
if not b64_data:
return {"success": False, "error": f"Remote file is empty or missing: {remote_path}"}
try:
raw = base64.b64decode(b64_data)
except Exception as e:
return {"success": False, "error": f"Base64 decode failed: {e}"}
# Write to local host filesystem
local = _Path(local_path)
local.parent.mkdir(parents=True, exist_ok=True)
local.write_bytes(raw)
return {"success": True, "bytes": len(raw)}
def download_dir(self, remote_dir: str, local_dir: str) -> List[Dict[str, Any]]:
"""
Download a directory from the rollout's sandbox to the host (binary-safe).
Lists all files in the remote directory, then downloads each one.
Preserves directory structure.
Args:
remote_dir: Path to the directory inside the sandbox
local_dir: Destination directory on the host
Returns:
List of results, one per file downloaded
"""
from pathlib import Path as _Path
# List files in the remote directory
ls_result = self.terminal(
f"find {remote_dir} -type f 2>/dev/null",
timeout=15,
)
if ls_result.get("exit_code", -1) != 0:
return [{"success": False, "error": f"Failed to list remote dir: {remote_dir}"}]
file_list = ls_result.get("output", "").strip()
if not file_list:
return [{"success": False, "error": f"Remote directory is empty or missing: {remote_dir}"}]
results = []
for remote_file in file_list.splitlines():
remote_file = remote_file.strip()
if not remote_file:
continue
# Compute the relative path to preserve directory structure
if remote_file.startswith(remote_dir):
relative = remote_file[len(remote_dir):].lstrip("/")
else:
relative = _Path(remote_file).name
local_file = str(_Path(local_dir) / relative)
results.append(self.download_file(remote_file, local_file))
return results
def search(self, query: str, path: str = ".") -> Dict[str, Any]:
"""
Search for text in the rollout's filesystem.
Args:
query: Search query
path: Directory to search in
Returns:
Dict with search results
"""
result = handle_function_call(
"search", {"query": query, "path": path}, task_id=self.task_id
)
try:
return json.loads(result)
except json.JSONDecodeError:
return {"error": result}
# -------------------------------------------------------------------------
# Web tools
# -------------------------------------------------------------------------
def web_search(self, query: str) -> Dict[str, Any]:
"""
Search the web.
Args:
query: Search query
Returns:
Dict with search results
"""
result = handle_function_call("web_search", {"query": query})
try:
return json.loads(result)
except json.JSONDecodeError:
return {"error": result}
def web_extract(self, urls: List[str]) -> Dict[str, Any]:
"""
Extract content from URLs.
Args:
urls: List of URLs to extract content from
Returns:
Dict with extracted content
"""
result = handle_function_call("web_extract", {"urls": urls})
try:
return json.loads(result)
except json.JSONDecodeError:
return {"error": result}
# -------------------------------------------------------------------------
# Browser tools
# -------------------------------------------------------------------------
def browser_navigate(self, url: str) -> Dict[str, Any]:
"""
Navigate the rollout's browser session to a URL.
Args:
url: URL to navigate to
Returns:
Dict with page snapshot or error
"""
result = handle_function_call(
"browser_navigate", {"url": url}, task_id=self.task_id
)
try:
return json.loads(result)
except json.JSONDecodeError:
return {"error": result}
def browser_snapshot(self) -> Dict[str, Any]:
"""
Take a snapshot of the current browser page.
Returns:
Dict with page content/accessibility snapshot
"""
result = handle_function_call(
"browser_snapshot", {}, task_id=self.task_id
)
try:
return json.loads(result)
except json.JSONDecodeError:
return {"error": result}
# -------------------------------------------------------------------------
# Generic tool access
# -------------------------------------------------------------------------
def call_tool(self, tool_name: str, arguments: Dict[str, Any]) -> str:
"""
Call any hermes-agent tool by name.
This is the generic escape hatch -- if a tool doesn't have a convenience
wrapper above, you can call it directly here.
Args:
tool_name: Name of the tool (e.g., "vision_analyze", "skills_list")
arguments: Dict of arguments for the tool
Returns:
Raw JSON string result from the tool
"""
return _run_tool_in_thread(tool_name, arguments, self.task_id)
# -------------------------------------------------------------------------
# Cleanup
# -------------------------------------------------------------------------
def cleanup(self):
"""
Release all resources (terminal VMs, browser sessions, background processes)
for this rollout.
Called automatically by the base environment via try/finally after
compute_reward() completes. You generally don't need to call this yourself.
"""
# Kill any background processes from this rollout (safety net)
try:
from tools.process_registry import process_registry
killed = process_registry.kill_all(task_id=self.task_id)
if killed:
logger.debug("Process cleanup for task %s: killed %d process(es)", self.task_id, killed)
except Exception as e:
logger.debug("Process cleanup for task %s: %s", self.task_id, e)
try:
cleanup_vm(self.task_id)
except Exception as e:
logger.debug("VM cleanup for task %s: %s", self.task_id, e)
# Suppress browser_tool's noisy debug prints during cleanup.
# The cleanup still runs (safe), it just doesn't spam the console.
_prev_quiet = os.environ.get("HERMES_QUIET")
os.environ["HERMES_QUIET"] = "1"
try:
cleanup_browser(self.task_id)
except Exception as e:
logger.debug("Browser cleanup for task %s: %s", self.task_id, e)
finally:
if _prev_quiet is None:
os.environ.pop("HERMES_QUIET", None)
else:
os.environ["HERMES_QUIET"] = _prev_quiet
-70
View File
@@ -1,70 +0,0 @@
---
name: example-skill
description: An example skill demonstrating the skill file format and structure
---
# Example Skill
This is an example skill file that demonstrates how to create skills for the Hermes Agent.
## Skill File Format
Skills are markdown files with YAML frontmatter at the top:
```yaml
---
name: your-skill-name
description: A brief one-line description of what this skill does
---
```
The frontmatter fields:
- **name**: The identifier used to reference this skill (lowercase, hyphens for spaces)
- **description**: A brief description shown when listing skills (keep under 200 chars)
## Writing Effective Skills
### 1. Be Specific and Actionable
Good skills provide clear, actionable instructions:
```
When reviewing code:
1. Check for security vulnerabilities first
2. Verify error handling is comprehensive
3. Ensure tests cover edge cases
```
### 2. Include Examples
Show concrete examples of what you want:
```python
# Good: Descriptive variable names
user_authentication_token = get_token()
# Bad: Cryptic abbreviations
uat = gt()
```
### 3. Define When to Use
Help the agent understand when this skill applies:
> Use this skill when: reviewing pull requests, auditing security, or checking code quality.
## Skill Categories
Consider organizing skills by purpose:
- **Conventions**: Coding standards, API patterns, naming rules
- **Workflows**: Step-by-step processes for deployments, reviews, releases
- **Knowledge**: Domain-specific information, system architecture, gotchas
- **Templates**: Boilerplate for common tasks, response formats
## Tips
1. Keep the description concise - it's shown in the skills list
2. Use headers to organize longer skills
3. Include code examples where helpful
4. Reference other skills if they're related
+35
View File
@@ -0,0 +1,35 @@
"""
Hermes Gateway - Multi-platform messaging integration.
This module provides a unified gateway for connecting the Hermes agent
to various messaging platforms (Telegram, Discord, WhatsApp) with:
- Session management (persistent conversations with reset policies)
- Dynamic context injection (agent knows where messages come from)
- Delivery routing (cron job outputs to appropriate channels)
- Platform-specific toolsets (different capabilities per platform)
"""
from .config import GatewayConfig, PlatformConfig, HomeChannel, load_gateway_config
from .session import (
SessionContext,
SessionStore,
SessionResetPolicy,
build_session_context_prompt,
)
from .delivery import DeliveryRouter, DeliveryTarget
__all__ = [
# Config
"GatewayConfig",
"PlatformConfig",
"HomeChannel",
"load_gateway_config",
# Session
"SessionContext",
"SessionStore",
"SessionResetPolicy",
"build_session_context_prompt",
# Delivery
"DeliveryRouter",
"DeliveryTarget",
]
+350
View File
@@ -0,0 +1,350 @@
"""
Gateway configuration management.
Handles loading and validating configuration for:
- Connected platforms (Telegram, Discord, WhatsApp)
- Home channels for each platform
- Session reset policies
- Delivery preferences
"""
import os
import json
from pathlib import Path
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any
from enum import Enum
class Platform(Enum):
"""Supported messaging platforms."""
LOCAL = "local"
TELEGRAM = "telegram"
DISCORD = "discord"
WHATSAPP = "whatsapp"
SLACK = "slack"
@dataclass
class HomeChannel:
"""
Default destination for a platform.
When a cron job specifies deliver="telegram" without a specific chat ID,
messages are sent to this home channel.
"""
platform: Platform
chat_id: str
name: str # Human-readable name for display
def to_dict(self) -> Dict[str, Any]:
return {
"platform": self.platform.value,
"chat_id": self.chat_id,
"name": self.name,
}
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HomeChannel":
return cls(
platform=Platform(data["platform"]),
chat_id=str(data["chat_id"]),
name=data.get("name", "Home"),
)
@dataclass
class SessionResetPolicy:
"""
Controls when sessions reset (lose context).
Modes:
- "daily": Reset at a specific hour each day
- "idle": Reset after N minutes of inactivity
- "both": Whichever triggers first (daily boundary OR idle timeout)
"""
mode: str = "both" # "daily", "idle", or "both"
at_hour: int = 4 # Hour for daily reset (0-23, local time)
idle_minutes: int = 1440 # Minutes of inactivity before reset (24 hours)
def to_dict(self) -> Dict[str, Any]:
return {
"mode": self.mode,
"at_hour": self.at_hour,
"idle_minutes": self.idle_minutes,
}
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "SessionResetPolicy":
return cls(
mode=data.get("mode", "both"),
at_hour=data.get("at_hour", 4),
idle_minutes=data.get("idle_minutes", 1440),
)
@dataclass
class PlatformConfig:
"""Configuration for a single messaging platform."""
enabled: bool = False
token: Optional[str] = None # Bot token (Telegram, Discord)
api_key: Optional[str] = None # API key if different from token
home_channel: Optional[HomeChannel] = None
# Platform-specific settings
extra: Dict[str, Any] = field(default_factory=dict)
def to_dict(self) -> Dict[str, Any]:
result = {
"enabled": self.enabled,
"extra": self.extra,
}
if self.token:
result["token"] = self.token
if self.api_key:
result["api_key"] = self.api_key
if self.home_channel:
result["home_channel"] = self.home_channel.to_dict()
return result
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PlatformConfig":
home_channel = None
if "home_channel" in data:
home_channel = HomeChannel.from_dict(data["home_channel"])
return cls(
enabled=data.get("enabled", False),
token=data.get("token"),
api_key=data.get("api_key"),
home_channel=home_channel,
extra=data.get("extra", {}),
)
@dataclass
class GatewayConfig:
"""
Main gateway configuration.
Manages all platform connections, session policies, and delivery settings.
"""
# Platform configurations
platforms: Dict[Platform, PlatformConfig] = field(default_factory=dict)
# Session reset policies by type
default_reset_policy: SessionResetPolicy = field(default_factory=SessionResetPolicy)
reset_by_type: Dict[str, SessionResetPolicy] = field(default_factory=dict)
reset_by_platform: Dict[Platform, SessionResetPolicy] = field(default_factory=dict)
# Reset trigger commands
reset_triggers: List[str] = field(default_factory=lambda: ["/new", "/reset"])
# Storage paths
sessions_dir: Path = field(default_factory=lambda: Path.home() / ".hermes" / "sessions")
# Delivery settings
always_log_local: bool = True # Always save cron outputs to local files
def get_connected_platforms(self) -> List[Platform]:
"""Return list of platforms that are enabled and configured."""
connected = []
for platform, config in self.platforms.items():
if config.enabled and (config.token or config.api_key):
connected.append(platform)
return connected
def get_home_channel(self, platform: Platform) -> Optional[HomeChannel]:
"""Get the home channel for a platform."""
config = self.platforms.get(platform)
if config:
return config.home_channel
return None
def get_reset_policy(
self,
platform: Optional[Platform] = None,
session_type: Optional[str] = None
) -> SessionResetPolicy:
"""
Get the appropriate reset policy for a session.
Priority: platform override > type override > default
"""
# Platform-specific override takes precedence
if platform and platform in self.reset_by_platform:
return self.reset_by_platform[platform]
# Type-specific override (dm, group, thread)
if session_type and session_type in self.reset_by_type:
return self.reset_by_type[session_type]
return self.default_reset_policy
def to_dict(self) -> Dict[str, Any]:
return {
"platforms": {
p.value: c.to_dict() for p, c in self.platforms.items()
},
"default_reset_policy": self.default_reset_policy.to_dict(),
"reset_by_type": {
k: v.to_dict() for k, v in self.reset_by_type.items()
},
"reset_by_platform": {
p.value: v.to_dict() for p, v in self.reset_by_platform.items()
},
"reset_triggers": self.reset_triggers,
"sessions_dir": str(self.sessions_dir),
"always_log_local": self.always_log_local,
}
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "GatewayConfig":
platforms = {}
for platform_name, platform_data in data.get("platforms", {}).items():
try:
platform = Platform(platform_name)
platforms[platform] = PlatformConfig.from_dict(platform_data)
except ValueError:
pass # Skip unknown platforms
reset_by_type = {}
for type_name, policy_data in data.get("reset_by_type", {}).items():
reset_by_type[type_name] = SessionResetPolicy.from_dict(policy_data)
reset_by_platform = {}
for platform_name, policy_data in data.get("reset_by_platform", {}).items():
try:
platform = Platform(platform_name)
reset_by_platform[platform] = SessionResetPolicy.from_dict(policy_data)
except ValueError:
pass
default_policy = SessionResetPolicy()
if "default_reset_policy" in data:
default_policy = SessionResetPolicy.from_dict(data["default_reset_policy"])
sessions_dir = Path.home() / ".hermes" / "sessions"
if "sessions_dir" in data:
sessions_dir = Path(data["sessions_dir"])
return cls(
platforms=platforms,
default_reset_policy=default_policy,
reset_by_type=reset_by_type,
reset_by_platform=reset_by_platform,
reset_triggers=data.get("reset_triggers", ["/new", "/reset"]),
sessions_dir=sessions_dir,
always_log_local=data.get("always_log_local", True),
)
def load_gateway_config() -> GatewayConfig:
"""
Load gateway configuration from multiple sources.
Priority (highest to lowest):
1. Environment variables
2. ~/.hermes/gateway.json
3. cli-config.yaml gateway section
4. Defaults
"""
config = GatewayConfig()
# Try loading from ~/.hermes/gateway.json
gateway_config_path = Path.home() / ".hermes" / "gateway.json"
if gateway_config_path.exists():
try:
with open(gateway_config_path, "r") as f:
data = json.load(f)
config = GatewayConfig.from_dict(data)
except Exception as e:
print(f"[gateway] Warning: Failed to load {gateway_config_path}: {e}")
# Override with environment variables
_apply_env_overrides(config)
return config
def _apply_env_overrides(config: GatewayConfig) -> None:
"""Apply environment variable overrides to config."""
# Telegram
telegram_token = os.getenv("TELEGRAM_BOT_TOKEN")
if telegram_token:
if Platform.TELEGRAM not in config.platforms:
config.platforms[Platform.TELEGRAM] = PlatformConfig()
config.platforms[Platform.TELEGRAM].enabled = True
config.platforms[Platform.TELEGRAM].token = telegram_token
telegram_home = os.getenv("TELEGRAM_HOME_CHANNEL")
if telegram_home and Platform.TELEGRAM in config.platforms:
config.platforms[Platform.TELEGRAM].home_channel = HomeChannel(
platform=Platform.TELEGRAM,
chat_id=telegram_home,
name=os.getenv("TELEGRAM_HOME_CHANNEL_NAME", "Home"),
)
# Discord
discord_token = os.getenv("DISCORD_BOT_TOKEN")
if discord_token:
if Platform.DISCORD not in config.platforms:
config.platforms[Platform.DISCORD] = PlatformConfig()
config.platforms[Platform.DISCORD].enabled = True
config.platforms[Platform.DISCORD].token = discord_token
discord_home = os.getenv("DISCORD_HOME_CHANNEL")
if discord_home and Platform.DISCORD in config.platforms:
config.platforms[Platform.DISCORD].home_channel = HomeChannel(
platform=Platform.DISCORD,
chat_id=discord_home,
name=os.getenv("DISCORD_HOME_CHANNEL_NAME", "Home"),
)
# WhatsApp (typically uses different auth mechanism)
whatsapp_enabled = os.getenv("WHATSAPP_ENABLED", "").lower() in ("true", "1", "yes")
if whatsapp_enabled:
if Platform.WHATSAPP not in config.platforms:
config.platforms[Platform.WHATSAPP] = PlatformConfig()
config.platforms[Platform.WHATSAPP].enabled = True
# Slack
slack_token = os.getenv("SLACK_BOT_TOKEN")
if slack_token:
if Platform.SLACK not in config.platforms:
config.platforms[Platform.SLACK] = PlatformConfig()
config.platforms[Platform.SLACK].enabled = True
config.platforms[Platform.SLACK].token = slack_token
# Home channel
slack_home = os.getenv("SLACK_HOME_CHANNEL")
if slack_home:
config.platforms[Platform.SLACK].home_channel = HomeChannel(
platform=Platform.SLACK,
chat_id=slack_home,
name=os.getenv("SLACK_HOME_CHANNEL_NAME", ""),
)
# Session settings
idle_minutes = os.getenv("SESSION_IDLE_MINUTES")
if idle_minutes:
try:
config.default_reset_policy.idle_minutes = int(idle_minutes)
except ValueError:
pass
reset_hour = os.getenv("SESSION_RESET_HOUR")
if reset_hour:
try:
config.default_reset_policy.at_hour = int(reset_hour)
except ValueError:
pass
def save_gateway_config(config: GatewayConfig) -> None:
"""Save gateway configuration to ~/.hermes/gateway.json."""
gateway_config_path = Path.home() / ".hermes" / "gateway.json"
gateway_config_path.parent.mkdir(parents=True, exist_ok=True)
with open(gateway_config_path, "w") as f:
json.dump(config.to_dict(), f, indent=2)
+318
View File
@@ -0,0 +1,318 @@
"""
Delivery routing for cron job outputs and agent responses.
Routes messages to the appropriate destination based on:
- Explicit targets (e.g., "telegram:123456789")
- Platform home channels (e.g., "telegram" home channel)
- Origin (back to where the job was created)
- Local (always saved to files)
"""
import json
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass
from typing import Dict, List, Optional, Any, Union
from enum import Enum
from .config import Platform, GatewayConfig, HomeChannel
from .session import SessionSource
@dataclass
class DeliveryTarget:
"""
A single delivery target.
Represents where a message should be sent:
- "origin" back to source
- "local" save to local files
- "telegram" Telegram home channel
- "telegram:123456" specific Telegram chat
"""
platform: Platform
chat_id: Optional[str] = None # None means use home channel
is_origin: bool = False
is_explicit: bool = False # True if chat_id was explicitly specified
@classmethod
def parse(cls, target: str, origin: Optional[SessionSource] = None) -> "DeliveryTarget":
"""
Parse a delivery target string.
Formats:
- "origin" back to source
- "local" local files only
- "telegram" Telegram home channel
- "telegram:123456" specific Telegram chat
"""
target = target.strip().lower()
if target == "origin":
if origin:
return cls(
platform=origin.platform,
chat_id=origin.chat_id,
is_origin=True,
)
else:
# Fallback to local if no origin
return cls(platform=Platform.LOCAL, is_origin=True)
if target == "local":
return cls(platform=Platform.LOCAL)
# Check for platform:chat_id format
if ":" in target:
platform_str, chat_id = target.split(":", 1)
try:
platform = Platform(platform_str)
return cls(platform=platform, chat_id=chat_id, is_explicit=True)
except ValueError:
# Unknown platform, treat as local
return cls(platform=Platform.LOCAL)
# Just a platform name (use home channel)
try:
platform = Platform(target)
return cls(platform=platform)
except ValueError:
# Unknown platform, treat as local
return cls(platform=Platform.LOCAL)
def to_string(self) -> str:
"""Convert back to string format."""
if self.is_origin:
return "origin"
if self.platform == Platform.LOCAL:
return "local"
if self.chat_id:
return f"{self.platform.value}:{self.chat_id}"
return self.platform.value
class DeliveryRouter:
"""
Routes messages to appropriate destinations.
Handles the logic of resolving delivery targets and dispatching
messages to the right platform adapters.
"""
def __init__(self, config: GatewayConfig, adapters: Dict[Platform, Any] = None):
"""
Initialize the delivery router.
Args:
config: Gateway configuration
adapters: Dict mapping platforms to their adapter instances
"""
self.config = config
self.adapters = adapters or {}
self.output_dir = Path.home() / ".hermes" / "cron" / "output"
def resolve_targets(
self,
deliver: Union[str, List[str]],
origin: Optional[SessionSource] = None
) -> List[DeliveryTarget]:
"""
Resolve delivery specification to concrete targets.
Args:
deliver: Delivery spec - "origin", "telegram", ["local", "discord"], etc.
origin: The source where the request originated (for "origin" target)
Returns:
List of resolved delivery targets
"""
if isinstance(deliver, str):
deliver = [deliver]
targets = []
seen_platforms = set()
for target_str in deliver:
target = DeliveryTarget.parse(target_str, origin)
# Resolve home channel if needed
if target.chat_id is None and target.platform != Platform.LOCAL:
home = self.config.get_home_channel(target.platform)
if home:
target.chat_id = home.chat_id
else:
# No home channel configured, skip this platform
continue
# Deduplicate
key = (target.platform, target.chat_id)
if key not in seen_platforms:
seen_platforms.add(key)
targets.append(target)
# Always include local if configured
if self.config.always_log_local:
local_key = (Platform.LOCAL, None)
if local_key not in seen_platforms:
targets.append(DeliveryTarget(platform=Platform.LOCAL))
return targets
async def deliver(
self,
content: str,
targets: List[DeliveryTarget],
job_id: Optional[str] = None,
job_name: Optional[str] = None,
metadata: Optional[Dict[str, Any]] = None
) -> Dict[str, Any]:
"""
Deliver content to all specified targets.
Args:
content: The message/output to deliver
targets: List of delivery targets
job_id: Optional job ID (for cron jobs)
job_name: Optional job name
metadata: Additional metadata to include
Returns:
Dict with delivery results per target
"""
results = {}
for target in targets:
try:
if target.platform == Platform.LOCAL:
result = self._deliver_local(content, job_id, job_name, metadata)
else:
result = await self._deliver_to_platform(target, content, metadata)
results[target.to_string()] = {
"success": True,
"result": result
}
except Exception as e:
results[target.to_string()] = {
"success": False,
"error": str(e)
}
return results
def _deliver_local(
self,
content: str,
job_id: Optional[str],
job_name: Optional[str],
metadata: Optional[Dict[str, Any]]
) -> Dict[str, Any]:
"""Save content to local files."""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
if job_id:
output_path = self.output_dir / job_id / f"{timestamp}.md"
else:
output_path = self.output_dir / "misc" / f"{timestamp}.md"
output_path.parent.mkdir(parents=True, exist_ok=True)
# Build the output document
lines = []
if job_name:
lines.append(f"# {job_name}")
else:
lines.append("# Delivery Output")
lines.append("")
lines.append(f"**Timestamp:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if job_id:
lines.append(f"**Job ID:** {job_id}")
if metadata:
for key, value in metadata.items():
lines.append(f"**{key}:** {value}")
lines.append("")
lines.append("---")
lines.append("")
lines.append(content)
output_path.write_text("\n".join(lines))
return {
"path": str(output_path),
"timestamp": timestamp
}
async def _deliver_to_platform(
self,
target: DeliveryTarget,
content: str,
metadata: Optional[Dict[str, Any]]
) -> Dict[str, Any]:
"""Deliver content to a messaging platform."""
adapter = self.adapters.get(target.platform)
if not adapter:
raise ValueError(f"No adapter configured for {target.platform.value}")
if not target.chat_id:
raise ValueError(f"No chat ID for {target.platform.value} delivery")
# Call the adapter's send method
# Adapters should implement: async def send(chat_id: str, content: str) -> Dict
return await adapter.send(target.chat_id, content, metadata=metadata)
def parse_deliver_spec(
deliver: Optional[Union[str, List[str]]],
origin: Optional[SessionSource] = None,
default: str = "origin"
) -> Union[str, List[str]]:
"""
Normalize a delivery specification.
If None or empty, returns the default.
"""
if not deliver:
return default
return deliver
def build_delivery_context_for_tool(
config: GatewayConfig,
origin: Optional[SessionSource] = None
) -> Dict[str, Any]:
"""
Build context for the schedule_cronjob tool to understand delivery options.
This is passed to the tool so it can validate and explain delivery targets.
"""
connected = config.get_connected_platforms()
options = {
"origin": {
"description": "Back to where this job was created",
"available": origin is not None,
},
"local": {
"description": "Save to local files only",
"available": True,
}
}
for platform in connected:
home = config.get_home_channel(platform)
options[platform.value] = {
"description": f"{platform.value.title()} home channel",
"available": True,
"home_channel": home.to_dict() if home else None,
}
return {
"origin": origin.to_dict() if origin else None,
"options": options,
"always_log_local": config.always_log_local,
}
+150
View File
@@ -0,0 +1,150 @@
"""
Event Hook System
A lightweight event-driven system that fires handlers at key lifecycle points.
Hooks are discovered from ~/.hermes/hooks/ directories, each containing:
- HOOK.yaml (metadata: name, description, events list)
- handler.py (Python handler with async def handle(event_type, context))
Events:
- gateway:startup -- Gateway process starts
- session:start -- New session created
- session:reset -- User ran /new or /reset
- agent:start -- Agent begins processing a message
- agent:step -- Each turn in the tool-calling loop
- agent:end -- Agent finishes processing
- command:* -- Any slash command executed (wildcard match)
Errors in hooks are caught and logged but never block the main pipeline.
"""
import asyncio
import importlib.util
import os
from pathlib import Path
from typing import Any, Callable, Dict, List, Optional
import yaml
HOOKS_DIR = Path(os.path.expanduser("~/.hermes/hooks"))
class HookRegistry:
"""
Discovers, loads, and fires event hooks.
Usage:
registry = HookRegistry()
registry.discover_and_load()
await registry.emit("agent:start", {"platform": "telegram", ...})
"""
def __init__(self):
# event_type -> [handler_fn, ...]
self._handlers: Dict[str, List[Callable]] = {}
self._loaded_hooks: List[dict] = [] # metadata for listing
@property
def loaded_hooks(self) -> List[dict]:
"""Return metadata about all loaded hooks."""
return list(self._loaded_hooks)
def discover_and_load(self) -> None:
"""
Scan the hooks directory for hook directories and load their handlers.
Each hook directory must contain:
- HOOK.yaml with at least 'name' and 'events' keys
- handler.py with a top-level 'handle' function (sync or async)
"""
if not HOOKS_DIR.exists():
return
for hook_dir in sorted(HOOKS_DIR.iterdir()):
if not hook_dir.is_dir():
continue
manifest_path = hook_dir / "HOOK.yaml"
handler_path = hook_dir / "handler.py"
if not manifest_path.exists() or not handler_path.exists():
continue
try:
manifest = yaml.safe_load(manifest_path.read_text(encoding="utf-8"))
if not manifest or not isinstance(manifest, dict):
print(f"[hooks] Skipping {hook_dir.name}: invalid HOOK.yaml", flush=True)
continue
hook_name = manifest.get("name", hook_dir.name)
events = manifest.get("events", [])
if not events:
print(f"[hooks] Skipping {hook_name}: no events declared", flush=True)
continue
# Dynamically load the handler module
spec = importlib.util.spec_from_file_location(
f"hermes_hook_{hook_name}", handler_path
)
if spec is None or spec.loader is None:
print(f"[hooks] Skipping {hook_name}: could not load handler.py", flush=True)
continue
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
handle_fn = getattr(module, "handle", None)
if handle_fn is None:
print(f"[hooks] Skipping {hook_name}: no 'handle' function found", flush=True)
continue
# Register the handler for each declared event
for event in events:
self._handlers.setdefault(event, []).append(handle_fn)
self._loaded_hooks.append({
"name": hook_name,
"description": manifest.get("description", ""),
"events": events,
"path": str(hook_dir),
})
print(f"[hooks] Loaded hook '{hook_name}' for events: {events}", flush=True)
except Exception as e:
print(f"[hooks] Error loading hook {hook_dir.name}: {e}", flush=True)
async def emit(self, event_type: str, context: Optional[Dict[str, Any]] = None) -> None:
"""
Fire all handlers registered for an event.
Supports wildcard matching: handlers registered for "command:*" will
fire for any "command:..." event. Handlers registered for a base type
like "agent" won't fire for "agent:start" -- only exact matches and
explicit wildcards.
Args:
event_type: The event identifier (e.g. "agent:start").
context: Optional dict with event-specific data.
"""
if context is None:
context = {}
# Collect handlers: exact match + wildcard match
handlers = list(self._handlers.get(event_type, []))
# Check for wildcard patterns (e.g., "command:*" matches "command:reset")
if ":" in event_type:
base = event_type.split(":")[0]
wildcard_key = f"{base}:*"
handlers.extend(self._handlers.get(wildcard_key, []))
for fn in handlers:
try:
result = fn(event_type, context)
# Support both sync and async handlers
if asyncio.iscoroutine(result):
await result
except Exception as e:
print(f"[hooks] Error in handler for '{event_type}': {e}", flush=True)
+282
View File
@@ -0,0 +1,282 @@
"""
DM Pairing System
Code-based approval flow for authorizing new users on messaging platforms.
Instead of static allowlists with user IDs, unknown users receive a one-time
pairing code that the bot owner approves via the CLI.
Security features (based on OWASP + NIST SP 800-63-4 guidance):
- 8-char codes from 32-char unambiguous alphabet (no 0/O/1/I)
- Cryptographic randomness via secrets.choice()
- 1-hour code expiry
- Max 3 pending codes per platform
- Rate limiting: 1 request per user per 10 minutes
- Lockout after 5 failed approval attempts (1 hour)
- File permissions: chmod 0600 on all data files
- Codes are never logged to stdout
Storage: ~/.hermes/pairing/
"""
import json
import os
import secrets
import time
from pathlib import Path
from typing import Optional
# Unambiguous alphabet -- excludes 0/O, 1/I to prevent confusion
ALPHABET = "ABCDEFGHJKLMNPQRSTUVWXYZ23456789"
CODE_LENGTH = 8
# Timing constants
CODE_TTL_SECONDS = 3600 # Codes expire after 1 hour
RATE_LIMIT_SECONDS = 600 # 1 request per user per 10 minutes
LOCKOUT_SECONDS = 3600 # Lockout duration after too many failures
# Limits
MAX_PENDING_PER_PLATFORM = 3 # Max pending codes per platform
MAX_FAILED_ATTEMPTS = 5 # Failed approvals before lockout
PAIRING_DIR = Path(os.path.expanduser("~/.hermes/pairing"))
def _secure_write(path: Path, data: str) -> None:
"""Write data to file with restrictive permissions (owner read/write only)."""
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(data, encoding="utf-8")
try:
os.chmod(path, 0o600)
except OSError:
pass # Windows doesn't support chmod the same way
class PairingStore:
"""
Manages pairing codes and approved user lists.
Data files per platform:
- {platform}-pending.json : pending pairing requests
- {platform}-approved.json : approved (paired) users
- _rate_limits.json : rate limit tracking
"""
def __init__(self):
PAIRING_DIR.mkdir(parents=True, exist_ok=True)
def _pending_path(self, platform: str) -> Path:
return PAIRING_DIR / f"{platform}-pending.json"
def _approved_path(self, platform: str) -> Path:
return PAIRING_DIR / f"{platform}-approved.json"
def _rate_limit_path(self) -> Path:
return PAIRING_DIR / "_rate_limits.json"
def _load_json(self, path: Path) -> dict:
if path.exists():
try:
return json.loads(path.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError):
return {}
return {}
def _save_json(self, path: Path, data: dict) -> None:
_secure_write(path, json.dumps(data, indent=2, ensure_ascii=False))
# ----- Approved users -----
def is_approved(self, platform: str, user_id: str) -> bool:
"""Check if a user is approved (paired) on a platform."""
approved = self._load_json(self._approved_path(platform))
return user_id in approved
def list_approved(self, platform: str = None) -> list:
"""List approved users, optionally filtered by platform."""
results = []
platforms = [platform] if platform else self._all_platforms("approved")
for p in platforms:
approved = self._load_json(self._approved_path(p))
for uid, info in approved.items():
results.append({"platform": p, "user_id": uid, **info})
return results
def _approve_user(self, platform: str, user_id: str, user_name: str = "") -> None:
"""Add a user to the approved list."""
approved = self._load_json(self._approved_path(platform))
approved[user_id] = {
"user_name": user_name,
"approved_at": time.time(),
}
self._save_json(self._approved_path(platform), approved)
def revoke(self, platform: str, user_id: str) -> bool:
"""Remove a user from the approved list. Returns True if found."""
path = self._approved_path(platform)
approved = self._load_json(path)
if user_id in approved:
del approved[user_id]
self._save_json(path, approved)
return True
return False
# ----- Pending codes -----
def generate_code(
self, platform: str, user_id: str, user_name: str = ""
) -> Optional[str]:
"""
Generate a pairing code for a new user.
Returns the code string, or None if:
- User is rate-limited (too recent request)
- Max pending codes reached for this platform
- User/platform is in lockout due to failed attempts
"""
self._cleanup_expired(platform)
# Check lockout
if self._is_locked_out(platform):
return None
# Check rate limit for this specific user
if self._is_rate_limited(platform, user_id):
return None
# Check max pending
pending = self._load_json(self._pending_path(platform))
if len(pending) >= MAX_PENDING_PER_PLATFORM:
return None
# Generate cryptographically random code
code = "".join(secrets.choice(ALPHABET) for _ in range(CODE_LENGTH))
# Store pending request
pending[code] = {
"user_id": user_id,
"user_name": user_name,
"created_at": time.time(),
}
self._save_json(self._pending_path(platform), pending)
# Record rate limit
self._record_rate_limit(platform, user_id)
return code
def approve_code(self, platform: str, code: str) -> Optional[dict]:
"""
Approve a pairing code. Adds the user to the approved list.
Returns {user_id, user_name} on success, None if code is invalid/expired.
"""
self._cleanup_expired(platform)
code = code.upper().strip()
pending = self._load_json(self._pending_path(platform))
if code not in pending:
self._record_failed_attempt(platform)
return None
entry = pending.pop(code)
self._save_json(self._pending_path(platform), pending)
# Add to approved list
self._approve_user(platform, entry["user_id"], entry.get("user_name", ""))
return {
"user_id": entry["user_id"],
"user_name": entry.get("user_name", ""),
}
def list_pending(self, platform: str = None) -> list:
"""List pending pairing requests, optionally filtered by platform."""
results = []
platforms = [platform] if platform else self._all_platforms("pending")
for p in platforms:
self._cleanup_expired(p)
pending = self._load_json(self._pending_path(p))
for code, info in pending.items():
age_min = int((time.time() - info["created_at"]) / 60)
results.append({
"platform": p,
"code": code,
"user_id": info["user_id"],
"user_name": info.get("user_name", ""),
"age_minutes": age_min,
})
return results
def clear_pending(self, platform: str = None) -> int:
"""Clear all pending requests. Returns count removed."""
count = 0
platforms = [platform] if platform else self._all_platforms("pending")
for p in platforms:
pending = self._load_json(self._pending_path(p))
count += len(pending)
self._save_json(self._pending_path(p), {})
return count
# ----- Rate limiting and lockout -----
def _is_rate_limited(self, platform: str, user_id: str) -> bool:
"""Check if a user has requested a code too recently."""
limits = self._load_json(self._rate_limit_path())
key = f"{platform}:{user_id}"
last_request = limits.get(key, 0)
return (time.time() - last_request) < RATE_LIMIT_SECONDS
def _record_rate_limit(self, platform: str, user_id: str) -> None:
"""Record the time of a pairing request for rate limiting."""
limits = self._load_json(self._rate_limit_path())
key = f"{platform}:{user_id}"
limits[key] = time.time()
self._save_json(self._rate_limit_path(), limits)
def _is_locked_out(self, platform: str) -> bool:
"""Check if a platform is in lockout due to failed approval attempts."""
limits = self._load_json(self._rate_limit_path())
lockout_key = f"_lockout:{platform}"
lockout_until = limits.get(lockout_key, 0)
return time.time() < lockout_until
def _record_failed_attempt(self, platform: str) -> None:
"""Record a failed approval attempt. Triggers lockout after MAX_FAILED_ATTEMPTS."""
limits = self._load_json(self._rate_limit_path())
fail_key = f"_failures:{platform}"
fails = limits.get(fail_key, 0) + 1
limits[fail_key] = fails
if fails >= MAX_FAILED_ATTEMPTS:
lockout_key = f"_lockout:{platform}"
limits[lockout_key] = time.time() + LOCKOUT_SECONDS
limits[fail_key] = 0 # Reset counter
print(f"[pairing] Platform {platform} locked out for {LOCKOUT_SECONDS}s "
f"after {MAX_FAILED_ATTEMPTS} failed attempts", flush=True)
self._save_json(self._rate_limit_path(), limits)
# ----- Cleanup -----
def _cleanup_expired(self, platform: str) -> None:
"""Remove expired pending codes."""
path = self._pending_path(platform)
pending = self._load_json(path)
now = time.time()
expired = [
code for code, info in pending.items()
if (now - info["created_at"]) > CODE_TTL_SECONDS
]
if expired:
for code in expired:
del pending[code]
self._save_json(path, pending)
def _all_platforms(self, suffix: str) -> list:
"""List all platforms that have data files of a given suffix."""
platforms = []
for f in PAIRING_DIR.iterdir():
if f.name.endswith(f"-{suffix}.json"):
platform = f.name.replace(f"-{suffix}.json", "")
if not platform.startswith("_"):
platforms.append(platform)
return platforms
+17
View File
@@ -0,0 +1,17 @@
"""
Platform adapters for messaging integrations.
Each adapter handles:
- Receiving messages from a platform
- Sending messages/responses back
- Platform-specific authentication
- Message formatting and media handling
"""
from .base import BasePlatformAdapter, MessageEvent, SendResult
__all__ = [
"BasePlatformAdapter",
"MessageEvent",
"SendResult",
]
+691
View File
@@ -0,0 +1,691 @@
"""
Base platform adapter interface.
All platform adapters (Telegram, Discord, WhatsApp) inherit from this
and implement the required methods.
"""
import asyncio
import os
import re
import uuid
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Any, Callable, Awaitable, Tuple
from enum import Enum
import sys
sys.path.insert(0, str(__file__).rsplit("/", 3)[0])
from gateway.config import Platform, PlatformConfig
from gateway.session import SessionSource
# ---------------------------------------------------------------------------
# Image cache utilities
#
# When users send images on messaging platforms, we download them to a local
# cache directory so they can be analyzed by the vision tool (which accepts
# local file paths). This avoids issues with ephemeral platform URLs
# (e.g. Telegram file URLs expire after ~1 hour).
# ---------------------------------------------------------------------------
# Default location: ~/.hermes/image_cache/
IMAGE_CACHE_DIR = Path(os.path.expanduser("~/.hermes/image_cache"))
def get_image_cache_dir() -> Path:
"""Return the image cache directory, creating it if it doesn't exist."""
IMAGE_CACHE_DIR.mkdir(parents=True, exist_ok=True)
return IMAGE_CACHE_DIR
def cache_image_from_bytes(data: bytes, ext: str = ".jpg") -> str:
"""
Save raw image bytes to the cache and return the absolute file path.
Args:
data: Raw image bytes.
ext: File extension including the dot (e.g. ".jpg", ".png").
Returns:
Absolute path to the cached image file as a string.
"""
cache_dir = get_image_cache_dir()
filename = f"img_{uuid.uuid4().hex[:12]}{ext}"
filepath = cache_dir / filename
filepath.write_bytes(data)
return str(filepath)
async def cache_image_from_url(url: str, ext: str = ".jpg") -> str:
"""
Download an image from a URL and save it to the local cache.
Uses httpx for async download with a reasonable timeout.
Args:
url: The HTTP/HTTPS URL to download from.
ext: File extension including the dot (e.g. ".jpg", ".png").
Returns:
Absolute path to the cached image file as a string.
"""
import httpx
async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
response = await client.get(
url,
headers={
"User-Agent": "Mozilla/5.0 (compatible; HermesAgent/1.0)",
"Accept": "image/*,*/*;q=0.8",
},
)
response.raise_for_status()
return cache_image_from_bytes(response.content, ext)
def cleanup_image_cache(max_age_hours: int = 24) -> int:
"""
Delete cached images older than *max_age_hours*.
Returns the number of files removed.
"""
import time
cache_dir = get_image_cache_dir()
cutoff = time.time() - (max_age_hours * 3600)
removed = 0
for f in cache_dir.iterdir():
if f.is_file() and f.stat().st_mtime < cutoff:
try:
f.unlink()
removed += 1
except OSError:
pass
return removed
# ---------------------------------------------------------------------------
# Audio cache utilities
#
# Same pattern as image cache -- voice messages from platforms are downloaded
# here so the STT tool (OpenAI Whisper) can transcribe them from local files.
# ---------------------------------------------------------------------------
AUDIO_CACHE_DIR = Path(os.path.expanduser("~/.hermes/audio_cache"))
def get_audio_cache_dir() -> Path:
"""Return the audio cache directory, creating it if it doesn't exist."""
AUDIO_CACHE_DIR.mkdir(parents=True, exist_ok=True)
return AUDIO_CACHE_DIR
def cache_audio_from_bytes(data: bytes, ext: str = ".ogg") -> str:
"""
Save raw audio bytes to the cache and return the absolute file path.
Args:
data: Raw audio bytes.
ext: File extension including the dot (e.g. ".ogg", ".mp3").
Returns:
Absolute path to the cached audio file as a string.
"""
cache_dir = get_audio_cache_dir()
filename = f"audio_{uuid.uuid4().hex[:12]}{ext}"
filepath = cache_dir / filename
filepath.write_bytes(data)
return str(filepath)
async def cache_audio_from_url(url: str, ext: str = ".ogg") -> str:
"""
Download an audio file from a URL and save it to the local cache.
Args:
url: The HTTP/HTTPS URL to download from.
ext: File extension including the dot (e.g. ".ogg", ".mp3").
Returns:
Absolute path to the cached audio file as a string.
"""
import httpx
async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
response = await client.get(
url,
headers={
"User-Agent": "Mozilla/5.0 (compatible; HermesAgent/1.0)",
"Accept": "audio/*,*/*;q=0.8",
},
)
response.raise_for_status()
return cache_audio_from_bytes(response.content, ext)
class MessageType(Enum):
"""Types of incoming messages."""
TEXT = "text"
PHOTO = "photo"
VIDEO = "video"
AUDIO = "audio"
VOICE = "voice"
DOCUMENT = "document"
STICKER = "sticker"
COMMAND = "command" # /command style
@dataclass
class MessageEvent:
"""
Incoming message from a platform.
Normalized representation that all adapters produce.
"""
# Message content
text: str
message_type: MessageType = MessageType.TEXT
# Source information
source: SessionSource = None
# Original platform data
raw_message: Any = None
message_id: Optional[str] = None
# Media attachments
media_urls: List[str] = field(default_factory=list)
media_types: List[str] = field(default_factory=list)
# Reply context
reply_to_message_id: Optional[str] = None
# Timestamps
timestamp: datetime = field(default_factory=datetime.now)
def is_command(self) -> bool:
"""Check if this is a command message (e.g., /new, /reset)."""
return self.text.startswith("/")
def get_command(self) -> Optional[str]:
"""Extract command name if this is a command message."""
if not self.is_command():
return None
# Split on space and get first word, strip the /
parts = self.text.split(maxsplit=1)
return parts[0][1:].lower() if parts else None
def get_command_args(self) -> str:
"""Get the arguments after a command."""
if not self.is_command():
return self.text
parts = self.text.split(maxsplit=1)
return parts[1] if len(parts) > 1 else ""
@dataclass
class SendResult:
"""Result of sending a message."""
success: bool
message_id: Optional[str] = None
error: Optional[str] = None
raw_response: Any = None
# Type for message handlers
MessageHandler = Callable[[MessageEvent], Awaitable[Optional[str]]]
class BasePlatformAdapter(ABC):
"""
Base class for platform adapters.
Subclasses implement platform-specific logic for:
- Connecting and authenticating
- Receiving messages
- Sending messages/responses
- Handling media
"""
def __init__(self, config: PlatformConfig, platform: Platform):
self.config = config
self.platform = platform
self._message_handler: Optional[MessageHandler] = None
self._running = False
# Track active message handlers per session for interrupt support
# Key: session_key (e.g., chat_id), Value: (event, asyncio.Event for interrupt)
self._active_sessions: Dict[str, asyncio.Event] = {}
self._pending_messages: Dict[str, MessageEvent] = {}
@property
def name(self) -> str:
"""Human-readable name for this adapter."""
return self.platform.value.title()
@property
def is_connected(self) -> bool:
"""Check if adapter is currently connected."""
return self._running
def set_message_handler(self, handler: MessageHandler) -> None:
"""
Set the handler for incoming messages.
The handler receives a MessageEvent and should return
an optional response string.
"""
self._message_handler = handler
@abstractmethod
async def connect(self) -> bool:
"""
Connect to the platform and start receiving messages.
Returns True if connection was successful.
"""
pass
@abstractmethod
async def disconnect(self) -> None:
"""Disconnect from the platform."""
pass
@abstractmethod
async def send(
self,
chat_id: str,
content: str,
reply_to: Optional[str] = None,
metadata: Optional[Dict[str, Any]] = None
) -> SendResult:
"""
Send a message to a chat.
Args:
chat_id: The chat/channel ID to send to
content: Message content (may be markdown)
reply_to: Optional message ID to reply to
metadata: Additional platform-specific options
Returns:
SendResult with success status and message ID
"""
pass
async def send_typing(self, chat_id: str) -> None:
"""
Send a typing indicator.
Override in subclasses if the platform supports it.
"""
pass
async def send_image(
self,
chat_id: str,
image_url: str,
caption: Optional[str] = None,
reply_to: Optional[str] = None,
) -> SendResult:
"""
Send an image natively via the platform API.
Override in subclasses to send images as proper attachments
instead of plain-text URLs. Default falls back to sending the
URL as a text message.
"""
# Fallback: send URL as text (subclasses override for native images)
text = f"{caption}\n{image_url}" if caption else image_url
return await self.send(chat_id=chat_id, content=text, reply_to=reply_to)
@staticmethod
def extract_images(content: str) -> Tuple[List[Tuple[str, str]], str]:
"""
Extract image URLs from markdown and HTML image tags in a response.
Finds patterns like:
- ![alt text](https://example.com/image.png)
- <img src="https://example.com/image.png">
- <img src="https://example.com/image.png"></img>
Args:
content: The response text to scan.
Returns:
Tuple of (list of (url, alt_text) pairs, cleaned content with image tags removed).
"""
images = []
cleaned = content
# Match markdown images: ![alt](url)
md_pattern = r'!\[([^\]]*)\]\((https?://[^\s\)]+)\)'
for match in re.finditer(md_pattern, content):
alt_text = match.group(1)
url = match.group(2)
# Only extract URLs that look like actual images
if any(url.lower().endswith(ext) or ext in url.lower() for ext in
['.png', '.jpg', '.jpeg', '.gif', '.webp', 'fal.media', 'fal-cdn', 'replicate.delivery']):
images.append((url, alt_text))
# Match HTML img tags: <img src="url"> or <img src="url"></img> or <img src="url"/>
html_pattern = r'<img\s+src=["\']?(https?://[^\s"\'<>]+)["\']?\s*/?>\s*(?:</img>)?'
for match in re.finditer(html_pattern, content):
url = match.group(1)
images.append((url, ""))
# Remove matched image tags from content if we found images
if images:
cleaned = re.sub(md_pattern, '', cleaned)
cleaned = re.sub(html_pattern, '', cleaned)
# Clean up leftover blank lines
cleaned = re.sub(r'\n{3,}', '\n\n', cleaned).strip()
return images, cleaned
async def send_voice(
self,
chat_id: str,
audio_path: str,
caption: Optional[str] = None,
reply_to: Optional[str] = None,
) -> SendResult:
"""
Send an audio file as a native voice message via the platform API.
Override in subclasses to send audio as voice bubbles (Telegram)
or file attachments (Discord). Default falls back to sending the
file path as text.
"""
text = f"🔊 Audio: {audio_path}"
if caption:
text = f"{caption}\n{text}"
return await self.send(chat_id=chat_id, content=text, reply_to=reply_to)
@staticmethod
def extract_media(content: str) -> Tuple[List[Tuple[str, bool]], str]:
"""
Extract MEDIA:<path> tags and [[audio_as_voice]] directives from response text.
The TTS tool returns responses like:
[[audio_as_voice]]
MEDIA:/path/to/audio.ogg
Args:
content: The response text to scan.
Returns:
Tuple of (list of (path, is_voice) pairs, cleaned content with tags removed).
"""
media = []
cleaned = content
# Check for [[audio_as_voice]] directive
has_voice_tag = "[[audio_as_voice]]" in content
cleaned = cleaned.replace("[[audio_as_voice]]", "")
# Extract MEDIA:<path> tags (path may contain spaces)
media_pattern = r'MEDIA:(\S+)'
for match in re.finditer(media_pattern, content):
path = match.group(1).strip()
if path:
media.append((path, has_voice_tag))
# Remove MEDIA tags from content
if media:
cleaned = re.sub(media_pattern, '', cleaned)
cleaned = re.sub(r'\n{3,}', '\n\n', cleaned).strip()
return media, cleaned
async def _keep_typing(self, chat_id: str, interval: float = 2.0) -> None:
"""
Continuously send typing indicator until cancelled.
Telegram/Discord typing status expires after ~5 seconds, so we refresh every 2
to recover quickly after progress messages interrupt it.
"""
try:
while True:
await self.send_typing(chat_id)
await asyncio.sleep(interval)
except asyncio.CancelledError:
pass # Normal cancellation when handler completes
async def handle_message(self, event: MessageEvent) -> None:
"""
Process an incoming message.
This method returns quickly by spawning background tasks.
This allows new messages to be processed even while an agent is running,
enabling interruption support.
"""
if not self._message_handler:
return
session_key = event.source.chat_id
# Check if there's already an active handler for this session
if session_key in self._active_sessions:
# Store this as a pending message - it will interrupt the running agent
print(f"[{self.name}] ⚡ New message while session {session_key} is active - triggering interrupt")
self._pending_messages[session_key] = event
# Signal the interrupt (the processing task checks this)
self._active_sessions[session_key].set()
return # Don't process now - will be handled after current task finishes
# Spawn background task to process this message
asyncio.create_task(self._process_message_background(event, session_key))
@staticmethod
def _get_human_delay() -> float:
"""
Return a random delay in seconds for human-like response pacing.
Reads from env vars:
HERMES_HUMAN_DELAY_MODE: "off" (default) | "natural" | "custom"
HERMES_HUMAN_DELAY_MIN_MS: minimum delay in ms (default 800, custom mode)
HERMES_HUMAN_DELAY_MAX_MS: maximum delay in ms (default 2500, custom mode)
"""
import random
mode = os.getenv("HERMES_HUMAN_DELAY_MODE", "off").lower()
if mode == "off":
return 0.0
min_ms = int(os.getenv("HERMES_HUMAN_DELAY_MIN_MS", "800"))
max_ms = int(os.getenv("HERMES_HUMAN_DELAY_MAX_MS", "2500"))
if mode == "natural":
min_ms, max_ms = 800, 2500
return random.uniform(min_ms / 1000.0, max_ms / 1000.0)
async def _process_message_background(self, event: MessageEvent, session_key: str) -> None:
"""Background task that actually processes the message."""
# Create interrupt event for this session
interrupt_event = asyncio.Event()
self._active_sessions[session_key] = interrupt_event
# Start continuous typing indicator (refreshes every 2 seconds)
typing_task = asyncio.create_task(self._keep_typing(event.source.chat_id))
try:
# Call the handler (this can take a while with tool calls)
response = await self._message_handler(event)
# Send response if any
if response:
# Extract MEDIA:<path> tags (from TTS tool) before other processing
media_files, response = self.extract_media(response)
# Extract image URLs and send them as native platform attachments
images, text_content = self.extract_images(response)
# Send the text portion first (if any remains after extractions)
if text_content:
result = await self.send(
chat_id=event.source.chat_id,
content=text_content,
reply_to=event.message_id
)
# Log send failures (don't raise - user already saw tool progress)
if not result.success:
print(f"[{self.name}] Failed to send response: {result.error}")
# Try sending without markdown as fallback
fallback_result = await self.send(
chat_id=event.source.chat_id,
content=f"(Response formatting failed, plain text:)\n\n{text_content[:3500]}",
reply_to=event.message_id
)
if not fallback_result.success:
print(f"[{self.name}] Fallback send also failed: {fallback_result.error}")
# Human-like pacing delay between text and media
human_delay = self._get_human_delay()
# Send extracted images as native attachments
for image_url, alt_text in images:
if human_delay > 0:
await asyncio.sleep(human_delay)
try:
img_result = await self.send_image(
chat_id=event.source.chat_id,
image_url=image_url,
caption=alt_text if alt_text else None,
)
if not img_result.success:
print(f"[{self.name}] Failed to send image: {img_result.error}")
except Exception as img_err:
print(f"[{self.name}] Error sending image: {img_err}")
# Send extracted audio/voice files as native attachments
for audio_path, is_voice in media_files:
if human_delay > 0:
await asyncio.sleep(human_delay)
try:
voice_result = await self.send_voice(
chat_id=event.source.chat_id,
audio_path=audio_path,
)
if not voice_result.success:
print(f"[{self.name}] Failed to send voice: {voice_result.error}")
except Exception as voice_err:
print(f"[{self.name}] Error sending voice: {voice_err}")
# Check if there's a pending message that was queued during our processing
if session_key in self._pending_messages:
pending_event = self._pending_messages.pop(session_key)
print(f"[{self.name}] 📨 Processing queued message from interrupt")
# Clean up current session before processing pending
if session_key in self._active_sessions:
del self._active_sessions[session_key]
typing_task.cancel()
try:
await typing_task
except asyncio.CancelledError:
pass
# Process pending message in new background task
await self._process_message_background(pending_event, session_key)
return # Already cleaned up
except Exception as e:
print(f"[{self.name}] Error handling message: {e}")
import traceback
traceback.print_exc()
finally:
# Stop typing indicator
typing_task.cancel()
try:
await typing_task
except asyncio.CancelledError:
pass
# Clean up session tracking
if session_key in self._active_sessions:
del self._active_sessions[session_key]
def has_pending_interrupt(self, session_key: str) -> bool:
"""Check if there's a pending interrupt for a session."""
return session_key in self._active_sessions and self._active_sessions[session_key].is_set()
def get_pending_message(self, session_key: str) -> Optional[MessageEvent]:
"""Get and clear any pending message for a session."""
return self._pending_messages.pop(session_key, None)
def build_source(
self,
chat_id: str,
chat_name: Optional[str] = None,
chat_type: str = "dm",
user_id: Optional[str] = None,
user_name: Optional[str] = None,
thread_id: Optional[str] = None
) -> SessionSource:
"""Helper to build a SessionSource for this platform."""
return SessionSource(
platform=self.platform,
chat_id=str(chat_id),
chat_name=chat_name,
chat_type=chat_type,
user_id=str(user_id) if user_id else None,
user_name=user_name,
thread_id=str(thread_id) if thread_id else None,
)
@abstractmethod
async def get_chat_info(self, chat_id: str) -> Dict[str, Any]:
"""
Get information about a chat/channel.
Returns dict with at least:
- name: Chat name
- type: "dm", "group", "channel"
"""
pass
def format_message(self, content: str) -> str:
"""
Format a message for this platform.
Override in subclasses to handle platform-specific formatting
(e.g., Telegram MarkdownV2, Discord markdown).
Default implementation returns content as-is.
"""
return content
def truncate_message(self, content: str, max_length: int = 4096) -> List[str]:
"""
Split a long message into chunks.
Args:
content: The full message content
max_length: Maximum length per chunk (platform-specific)
Returns:
List of message chunks
"""
if len(content) <= max_length:
return [content]
chunks = []
while content:
if len(content) <= max_length:
chunks.append(content)
break
# Try to split at a newline
split_idx = content.rfind("\n", 0, max_length)
if split_idx == -1:
# No newline, split at space
split_idx = content.rfind(" ", 0, max_length)
if split_idx == -1:
# No space either, hard split
split_idx = max_length
chunks.append(content[:split_idx])
content = content[split_idx:].lstrip()
return chunks
+679
View File
@@ -0,0 +1,679 @@
"""
Discord platform adapter.
Uses discord.py library for:
- Receiving messages from servers and DMs
- Sending responses back
- Handling threads and channels
"""
import asyncio
import os
from typing import Dict, List, Optional, Any
try:
import discord
from discord import Message as DiscordMessage, Intents
from discord.ext import commands
DISCORD_AVAILABLE = True
except ImportError:
DISCORD_AVAILABLE = False
discord = None
DiscordMessage = Any
Intents = Any
commands = None
import sys
sys.path.insert(0, str(__file__).rsplit("/", 3)[0])
from gateway.config import Platform, PlatformConfig
from gateway.platforms.base import (
BasePlatformAdapter,
MessageEvent,
MessageType,
SendResult,
cache_image_from_url,
cache_audio_from_url,
)
def check_discord_requirements() -> bool:
"""Check if Discord dependencies are available."""
return DISCORD_AVAILABLE
class DiscordAdapter(BasePlatformAdapter):
"""
Discord bot adapter.
Handles:
- Receiving messages from servers and DMs
- Sending responses with Discord markdown
- Thread support
- Native slash commands (/ask, /reset, /status, /stop)
- Button-based exec approvals
- Auto-threading for long conversations
- Reaction-based feedback
"""
# Discord message limits
MAX_MESSAGE_LENGTH = 2000
def __init__(self, config: PlatformConfig):
super().__init__(config, Platform.DISCORD)
self._client: Optional[commands.Bot] = None
self._ready_event = asyncio.Event()
self._allowed_user_ids: set = set() # For button approval authorization
async def connect(self) -> bool:
"""Connect to Discord and start receiving events."""
if not DISCORD_AVAILABLE:
print(f"[{self.name}] discord.py not installed. Run: pip install discord.py")
return False
if not self.config.token:
print(f"[{self.name}] No bot token configured")
return False
try:
# Set up intents
intents = Intents.default()
intents.message_content = True
intents.dm_messages = True
intents.guild_messages = True
# Create bot
self._client = commands.Bot(
command_prefix="!", # Not really used, we handle raw messages
intents=intents,
)
# Parse allowed user IDs for button authorization
allowed_env = os.getenv("DISCORD_ALLOWED_USERS", "")
if allowed_env:
self._allowed_user_ids = {
uid.strip() for uid in allowed_env.split(",") if uid.strip()
}
# Register event handlers
@self._client.event
async def on_ready():
print(f"[{self.name}] Connected as {self._client.user}")
# Sync slash commands with Discord
try:
synced = await self._client.tree.sync()
print(f"[{self.name}] Synced {len(synced)} slash command(s)")
except Exception as e:
print(f"[{self.name}] Slash command sync failed: {e}")
self._ready_event.set()
@self._client.event
async def on_message(message: DiscordMessage):
# Ignore bot's own messages
if message.author == self._client.user:
return
await self._handle_message(message)
# Register slash commands
self._register_slash_commands()
# Start the bot in background
asyncio.create_task(self._client.start(self.config.token))
# Wait for ready
await asyncio.wait_for(self._ready_event.wait(), timeout=30)
self._running = True
return True
except asyncio.TimeoutError:
print(f"[{self.name}] Timeout waiting for connection")
return False
except Exception as e:
print(f"[{self.name}] Failed to connect: {e}")
return False
async def disconnect(self) -> None:
"""Disconnect from Discord."""
if self._client:
try:
await self._client.close()
except Exception as e:
print(f"[{self.name}] Error during disconnect: {e}")
self._running = False
self._client = None
self._ready_event.clear()
print(f"[{self.name}] Disconnected")
async def send(
self,
chat_id: str,
content: str,
reply_to: Optional[str] = None,
metadata: Optional[Dict[str, Any]] = None
) -> SendResult:
"""Send a message to a Discord channel."""
if not self._client:
return SendResult(success=False, error="Not connected")
try:
# Get the channel
channel = self._client.get_channel(int(chat_id))
if not channel:
channel = await self._client.fetch_channel(int(chat_id))
if not channel:
return SendResult(success=False, error=f"Channel {chat_id} not found")
# Format and split message if needed
formatted = self.format_message(content)
chunks = self.truncate_message(formatted, self.MAX_MESSAGE_LENGTH)
message_ids = []
reference = None
if reply_to:
try:
ref_msg = await channel.fetch_message(int(reply_to))
reference = ref_msg
except Exception:
pass # Ignore if we can't find the referenced message
for i, chunk in enumerate(chunks):
msg = await channel.send(
content=chunk,
reference=reference if i == 0 else None,
)
message_ids.append(str(msg.id))
return SendResult(
success=True,
message_id=message_ids[0] if message_ids else None,
raw_response={"message_ids": message_ids}
)
except Exception as e:
return SendResult(success=False, error=str(e))
async def send_voice(
self,
chat_id: str,
audio_path: str,
caption: Optional[str] = None,
reply_to: Optional[str] = None,
) -> SendResult:
"""Send audio as a Discord file attachment."""
if not self._client:
return SendResult(success=False, error="Not connected")
try:
import io
channel = self._client.get_channel(int(chat_id))
if not channel:
channel = await self._client.fetch_channel(int(chat_id))
if not channel:
return SendResult(success=False, error=f"Channel {chat_id} not found")
if not os.path.exists(audio_path):
return SendResult(success=False, error=f"Audio file not found: {audio_path}")
# Determine filename from path
filename = os.path.basename(audio_path)
with open(audio_path, "rb") as f:
file = discord.File(io.BytesIO(f.read()), filename=filename)
msg = await channel.send(
content=caption if caption else None,
file=file,
)
return SendResult(success=True, message_id=str(msg.id))
except Exception as e:
print(f"[{self.name}] Failed to send audio: {e}")
return await super().send_voice(chat_id, audio_path, caption, reply_to)
async def send_image(
self,
chat_id: str,
image_url: str,
caption: Optional[str] = None,
reply_to: Optional[str] = None,
) -> SendResult:
"""Send an image natively as a Discord file attachment."""
if not self._client:
return SendResult(success=False, error="Not connected")
try:
import aiohttp
channel = self._client.get_channel(int(chat_id))
if not channel:
channel = await self._client.fetch_channel(int(chat_id))
if not channel:
return SendResult(success=False, error=f"Channel {chat_id} not found")
# Download the image and send as a Discord file attachment
# (Discord renders attachments inline, unlike plain URLs)
async with aiohttp.ClientSession() as session:
async with session.get(image_url, timeout=aiohttp.ClientTimeout(total=30)) as resp:
if resp.status != 200:
raise Exception(f"Failed to download image: HTTP {resp.status}")
image_data = await resp.read()
# Determine filename from URL or content type
content_type = resp.headers.get("content-type", "image/png")
ext = "png"
if "jpeg" in content_type or "jpg" in content_type:
ext = "jpg"
elif "gif" in content_type:
ext = "gif"
elif "webp" in content_type:
ext = "webp"
import io
file = discord.File(io.BytesIO(image_data), filename=f"image.{ext}")
msg = await channel.send(
content=caption if caption else None,
file=file,
)
return SendResult(success=True, message_id=str(msg.id))
except ImportError:
print(f"[{self.name}] aiohttp not installed, falling back to URL. Run: pip install aiohttp")
return await super().send_image(chat_id, image_url, caption, reply_to)
except Exception as e:
print(f"[{self.name}] Failed to send image attachment, falling back to URL: {e}")
return await super().send_image(chat_id, image_url, caption, reply_to)
async def send_typing(self, chat_id: str) -> None:
"""Send typing indicator."""
if self._client:
try:
channel = self._client.get_channel(int(chat_id))
if channel:
await channel.typing()
except Exception:
pass # Ignore typing indicator failures
async def get_chat_info(self, chat_id: str) -> Dict[str, Any]:
"""Get information about a Discord channel."""
if not self._client:
return {"name": "Unknown", "type": "dm"}
try:
channel = self._client.get_channel(int(chat_id))
if not channel:
channel = await self._client.fetch_channel(int(chat_id))
if not channel:
return {"name": str(chat_id), "type": "dm"}
# Determine channel type
if isinstance(channel, discord.DMChannel):
chat_type = "dm"
name = channel.recipient.name if channel.recipient else str(chat_id)
elif isinstance(channel, discord.Thread):
chat_type = "thread"
name = channel.name
elif isinstance(channel, discord.TextChannel):
chat_type = "channel"
name = f"#{channel.name}"
if channel.guild:
name = f"{channel.guild.name} / {name}"
else:
chat_type = "channel"
name = getattr(channel, "name", str(chat_id))
return {
"name": name,
"type": chat_type,
"guild_id": str(channel.guild.id) if hasattr(channel, "guild") and channel.guild else None,
"guild_name": channel.guild.name if hasattr(channel, "guild") and channel.guild else None,
}
except Exception as e:
return {"name": str(chat_id), "type": "dm", "error": str(e)}
def format_message(self, content: str) -> str:
"""
Format message for Discord.
Discord uses its own markdown variant.
"""
# Discord markdown is fairly standard, no special escaping needed
return content
def _register_slash_commands(self) -> None:
"""Register Discord slash commands on the command tree."""
if not self._client:
return
tree = self._client.tree
@tree.command(name="ask", description="Ask Hermes a question")
@discord.app_commands.describe(question="Your question for Hermes")
async def slash_ask(interaction: discord.Interaction, question: str):
await interaction.response.defer()
event = self._build_slash_event(interaction, question)
await self.handle_message(event)
# The response is sent via the normal send() flow
# Send a followup to close the interaction if needed
try:
await interaction.followup.send("Processing complete~", ephemeral=True)
except Exception:
pass
@tree.command(name="reset", description="Reset your Hermes session")
async def slash_reset(interaction: discord.Interaction):
await interaction.response.defer(ephemeral=True)
event = self._build_slash_event(interaction, "/reset")
await self.handle_message(event)
try:
await interaction.followup.send("Session reset~", ephemeral=True)
except Exception:
pass
@tree.command(name="status", description="Show Hermes session status")
async def slash_status(interaction: discord.Interaction):
await interaction.response.defer(ephemeral=True)
event = self._build_slash_event(interaction, "/status")
await self.handle_message(event)
try:
await interaction.followup.send("Status sent~", ephemeral=True)
except Exception:
pass
@tree.command(name="stop", description="Stop the running Hermes agent")
async def slash_stop(interaction: discord.Interaction):
await interaction.response.defer(ephemeral=True)
event = self._build_slash_event(interaction, "/stop")
await self.handle_message(event)
try:
await interaction.followup.send("Stop requested~", ephemeral=True)
except Exception:
pass
def _build_slash_event(self, interaction: discord.Interaction, text: str) -> MessageEvent:
"""Build a MessageEvent from a Discord slash command interaction."""
is_dm = isinstance(interaction.channel, discord.DMChannel)
chat_type = "dm" if is_dm else "group"
chat_name = ""
if not is_dm and hasattr(interaction.channel, "name"):
chat_name = interaction.channel.name
if hasattr(interaction.channel, "guild") and interaction.channel.guild:
chat_name = f"{interaction.channel.guild.name} / #{chat_name}"
source = self.build_source(
chat_id=str(interaction.channel_id),
chat_name=chat_name,
chat_type=chat_type,
user_id=str(interaction.user.id),
user_name=interaction.user.display_name,
)
msg_type = MessageType.COMMAND if text.startswith("/") else MessageType.TEXT
return MessageEvent(
text=text,
message_type=msg_type,
source=source,
raw_message=interaction,
)
async def send_exec_approval(
self, chat_id: str, command: str, approval_id: str
) -> SendResult:
"""
Send a button-based exec approval prompt for a dangerous command.
Returns SendResult. The approval is resolved when a user clicks a button.
"""
if not self._client or not DISCORD_AVAILABLE:
return SendResult(success=False, error="Not connected")
try:
channel = self._client.get_channel(int(chat_id))
if not channel:
channel = await self._client.fetch_channel(int(chat_id))
embed = discord.Embed(
title="Command Approval Required",
description=f"```\n{command[:500]}\n```",
color=discord.Color.orange(),
)
embed.set_footer(text=f"Approval ID: {approval_id}")
view = ExecApprovalView(
approval_id=approval_id,
allowed_user_ids=self._allowed_user_ids,
)
msg = await channel.send(embed=embed, view=view)
return SendResult(success=True, message_id=str(msg.id))
except Exception as e:
return SendResult(success=False, error=str(e))
async def _handle_message(self, message: DiscordMessage) -> None:
"""Handle incoming Discord messages."""
# In server channels (not DMs), require the bot to be @mentioned
# UNLESS the channel is in the free-response list.
#
# Config:
# DISCORD_FREE_RESPONSE_CHANNELS: Comma-separated channel IDs where the
# bot responds to every message without needing a mention.
# DISCORD_REQUIRE_MENTION: Set to "false" to disable mention requirement
# globally (all channels become free-response). Default: "true".
if not isinstance(message.channel, discord.DMChannel):
# Check if this channel is in the free-response list
free_channels_raw = os.getenv("DISCORD_FREE_RESPONSE_CHANNELS", "")
free_channels = {ch.strip() for ch in free_channels_raw.split(",") if ch.strip()}
channel_id = str(message.channel.id)
# Global override: if DISCORD_REQUIRE_MENTION=false, all channels are free
require_mention = os.getenv("DISCORD_REQUIRE_MENTION", "true").lower() not in ("false", "0", "no")
is_free_channel = channel_id in free_channels
if require_mention and not is_free_channel:
# Must be @mentioned to respond
if self._client.user not in message.mentions:
return # Silently ignore messages that don't mention the bot
# Strip the bot mention from the message text so the agent sees clean input
if self._client.user and self._client.user in message.mentions:
message.content = message.content.replace(f"<@{self._client.user.id}>", "").strip()
message.content = message.content.replace(f"<@!{self._client.user.id}>", "").strip()
# Determine message type
msg_type = MessageType.TEXT
if message.content.startswith("/"):
msg_type = MessageType.COMMAND
elif message.attachments:
# Check attachment types
for att in message.attachments:
if att.content_type:
if att.content_type.startswith("image/"):
msg_type = MessageType.PHOTO
elif att.content_type.startswith("video/"):
msg_type = MessageType.VIDEO
elif att.content_type.startswith("audio/"):
msg_type = MessageType.AUDIO
else:
msg_type = MessageType.DOCUMENT
break
# Determine chat type
if isinstance(message.channel, discord.DMChannel):
chat_type = "dm"
chat_name = message.author.name
elif isinstance(message.channel, discord.Thread):
chat_type = "thread"
chat_name = message.channel.name
else:
chat_type = "group" # Treat server channels as groups
chat_name = getattr(message.channel, "name", str(message.channel.id))
if hasattr(message.channel, "guild") and message.channel.guild:
chat_name = f"{message.channel.guild.name} / #{chat_name}"
# Get thread ID if in a thread
thread_id = None
if isinstance(message.channel, discord.Thread):
thread_id = str(message.channel.id)
# Build source
source = self.build_source(
chat_id=str(message.channel.id),
chat_name=chat_name,
chat_type=chat_type,
user_id=str(message.author.id),
user_name=message.author.display_name,
thread_id=thread_id,
)
# Build media URLs -- download image attachments to local cache so the
# vision tool can access them reliably (Discord CDN URLs can expire).
media_urls = []
media_types = []
for att in message.attachments:
content_type = att.content_type or "unknown"
if content_type.startswith("image/"):
try:
# Determine extension from content type (image/png -> .png)
ext = "." + content_type.split("/")[-1].split(";")[0]
if ext not in (".jpg", ".jpeg", ".png", ".gif", ".webp"):
ext = ".jpg"
cached_path = await cache_image_from_url(att.url, ext=ext)
media_urls.append(cached_path)
media_types.append(content_type)
print(f"[Discord] Cached user image: {cached_path}", flush=True)
except Exception as e:
print(f"[Discord] Failed to cache image attachment: {e}", flush=True)
# Fall back to the CDN URL if caching fails
media_urls.append(att.url)
media_types.append(content_type)
elif content_type.startswith("audio/"):
try:
ext = "." + content_type.split("/")[-1].split(";")[0]
if ext not in (".ogg", ".mp3", ".wav", ".webm", ".m4a"):
ext = ".ogg"
cached_path = await cache_audio_from_url(att.url, ext=ext)
media_urls.append(cached_path)
media_types.append(content_type)
print(f"[Discord] Cached user audio: {cached_path}", flush=True)
except Exception as e:
print(f"[Discord] Failed to cache audio attachment: {e}", flush=True)
media_urls.append(att.url)
media_types.append(content_type)
else:
# Other attachments: keep the original URL
media_urls.append(att.url)
media_types.append(content_type)
event = MessageEvent(
text=message.content,
message_type=msg_type,
source=source,
raw_message=message,
message_id=str(message.id),
media_urls=media_urls,
media_types=media_types,
reply_to_message_id=str(message.reference.message_id) if message.reference else None,
timestamp=message.created_at,
)
await self.handle_message(event)
# ---------------------------------------------------------------------------
# Discord UI Components (outside the adapter class)
# ---------------------------------------------------------------------------
if DISCORD_AVAILABLE:
class ExecApprovalView(discord.ui.View):
"""
Interactive button view for exec approval of dangerous commands.
Shows three buttons: Allow Once (green), Always Allow (blue), Deny (red).
Only users in the allowed list can click. The view times out after 5 minutes.
"""
def __init__(self, approval_id: str, allowed_user_ids: set):
super().__init__(timeout=300) # 5-minute timeout
self.approval_id = approval_id
self.allowed_user_ids = allowed_user_ids
self.resolved = False
def _check_auth(self, interaction: discord.Interaction) -> bool:
"""Verify the user clicking is authorized."""
if not self.allowed_user_ids:
return True # No allowlist = anyone can approve
return str(interaction.user.id) in self.allowed_user_ids
async def _resolve(
self, interaction: discord.Interaction, action: str, color: discord.Color
):
"""Resolve the approval and update the message."""
if self.resolved:
await interaction.response.send_message(
"This approval has already been resolved~", ephemeral=True
)
return
if not self._check_auth(interaction):
await interaction.response.send_message(
"You're not authorized to approve commands~", ephemeral=True
)
return
self.resolved = True
# Update the embed with the decision
embed = interaction.message.embeds[0] if interaction.message.embeds else None
if embed:
embed.color = color
embed.set_footer(text=f"{action} by {interaction.user.display_name}")
# Disable all buttons
for child in self.children:
child.disabled = True
await interaction.response.edit_message(embed=embed, view=self)
# Store the approval decision for the gateway to pick up
try:
from tools.terminal_tool import _session_approved_patterns
if action == "allow_once":
pass # One-time approval handled by gateway
elif action == "allow_always":
_session_approved_patterns.add(self.approval_id)
except ImportError:
pass
@discord.ui.button(label="Allow Once", style=discord.ButtonStyle.green)
async def allow_once(
self, interaction: discord.Interaction, button: discord.ui.Button
):
await self._resolve(interaction, "allow_once", discord.Color.green())
@discord.ui.button(label="Always Allow", style=discord.ButtonStyle.blurple)
async def allow_always(
self, interaction: discord.Interaction, button: discord.ui.Button
):
await self._resolve(interaction, "allow_always", discord.Color.blue())
@discord.ui.button(label="Deny", style=discord.ButtonStyle.red)
async def deny(
self, interaction: discord.Interaction, button: discord.ui.Button
):
await self._resolve(interaction, "deny", discord.Color.red())
async def on_timeout(self):
"""Handle view timeout -- disable buttons and mark as expired."""
self.resolved = True
for child in self.children:
child.disabled = True
+374
View File
@@ -0,0 +1,374 @@
"""
Slack platform adapter.
Uses slack-bolt (Python) with Socket Mode for:
- Receiving messages from channels and DMs
- Sending responses back
- Handling slash commands
- Thread support
"""
import asyncio
import os
from typing import Dict, List, Optional, Any
try:
from slack_bolt.async_app import AsyncApp
from slack_bolt.adapter.socket_mode.async_handler import AsyncSocketModeHandler
from slack_sdk.web.async_client import AsyncWebClient
SLACK_AVAILABLE = True
except ImportError:
SLACK_AVAILABLE = False
AsyncApp = Any
AsyncSocketModeHandler = Any
AsyncWebClient = Any
import sys
sys.path.insert(0, str(__file__).rsplit("/", 3)[0])
from gateway.config import Platform, PlatformConfig
from gateway.platforms.base import (
BasePlatformAdapter,
MessageEvent,
MessageType,
SendResult,
cache_image_from_url,
cache_audio_from_url,
)
def check_slack_requirements() -> bool:
"""Check if Slack dependencies are available."""
return SLACK_AVAILABLE
class SlackAdapter(BasePlatformAdapter):
"""
Slack bot adapter using Socket Mode.
Requires two tokens:
- SLACK_BOT_TOKEN (xoxb-...) for API calls
- SLACK_APP_TOKEN (xapp-...) for Socket Mode connection
Features:
- DMs and channel messages (mention-gated in channels)
- Thread support
- File/image/audio attachments
- Slash commands (/hermes)
- Typing indicators (not natively supported by Slack bots)
"""
MAX_MESSAGE_LENGTH = 4000 # Slack's limit is higher but mrkdwn can inflate
def __init__(self, config: PlatformConfig):
super().__init__(config, Platform.SLACK)
self._app: Optional[AsyncApp] = None
self._handler: Optional[AsyncSocketModeHandler] = None
self._bot_user_id: Optional[str] = None
async def connect(self) -> bool:
"""Connect to Slack via Socket Mode."""
if not SLACK_AVAILABLE:
print("[Slack] slack-bolt not installed. Run: pip install slack-bolt")
return False
bot_token = self.config.token
app_token = os.getenv("SLACK_APP_TOKEN")
if not bot_token:
print("[Slack] SLACK_BOT_TOKEN not set")
return False
if not app_token:
print("[Slack] SLACK_APP_TOKEN not set")
return False
try:
self._app = AsyncApp(token=bot_token)
# Get our own bot user ID for mention detection
auth_response = await self._app.client.auth_test()
self._bot_user_id = auth_response.get("user_id")
bot_name = auth_response.get("user", "unknown")
# Register message event handler
@self._app.event("message")
async def handle_message_event(event, say):
await self._handle_slack_message(event)
# Register slash command handler
@self._app.command("/hermes")
async def handle_hermes_command(ack, command):
await ack()
await self._handle_slash_command(command)
# Start Socket Mode handler in background
self._handler = AsyncSocketModeHandler(self._app, app_token)
asyncio.create_task(self._handler.start_async())
self._running = True
print(f"[Slack] Connected as @{bot_name} (Socket Mode)")
return True
except Exception as e:
print(f"[Slack] Connection failed: {e}")
return False
async def disconnect(self) -> None:
"""Disconnect from Slack."""
if self._handler:
await self._handler.close_async()
self._running = False
print("[Slack] Disconnected")
async def send(
self,
chat_id: str,
content: str,
reply_to: Optional[str] = None,
metadata: Optional[Dict[str, Any]] = None,
) -> SendResult:
"""Send a message to a Slack channel or DM."""
if not self._app:
return SendResult(success=False, error="Not connected")
try:
kwargs = {
"channel": chat_id,
"text": content,
}
# Reply in thread if thread_ts is available
if reply_to:
kwargs["thread_ts"] = reply_to
elif metadata and metadata.get("thread_ts"):
kwargs["thread_ts"] = metadata["thread_ts"]
result = await self._app.client.chat_postMessage(**kwargs)
return SendResult(
success=True,
message_id=result.get("ts"),
raw_response=result,
)
except Exception as e:
print(f"[Slack] Send error: {e}")
return SendResult(success=False, error=str(e))
async def send_typing(self, chat_id: str) -> None:
"""Slack doesn't have a direct typing indicator API for bots."""
pass
async def send_image(
self,
chat_id: str,
image_url: str,
caption: Optional[str] = None,
reply_to: Optional[str] = None,
) -> SendResult:
"""Send an image to Slack by uploading the URL as a file."""
if not self._app:
return SendResult(success=False, error="Not connected")
try:
import httpx
# Download the image first
async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
response = await client.get(image_url)
response.raise_for_status()
result = await self._app.client.files_upload_v2(
channel=chat_id,
content=response.content,
filename="image.png",
initial_comment=caption or "",
thread_ts=reply_to,
)
return SendResult(success=True, raw_response=result)
except Exception as e:
# Fall back to sending the URL as text
text = f"{caption}\n{image_url}" if caption else image_url
return await self.send(chat_id=chat_id, content=text, reply_to=reply_to)
async def send_voice(
self,
chat_id: str,
audio_path: str,
caption: Optional[str] = None,
reply_to: Optional[str] = None,
) -> SendResult:
"""Send an audio file to Slack."""
if not self._app:
return SendResult(success=False, error="Not connected")
try:
result = await self._app.client.files_upload_v2(
channel=chat_id,
file=audio_path,
filename=os.path.basename(audio_path),
initial_comment=caption or "",
thread_ts=reply_to,
)
return SendResult(success=True, raw_response=result)
except Exception as e:
return SendResult(success=False, error=str(e))
async def get_chat_info(self, chat_id: str) -> Dict[str, Any]:
"""Get information about a Slack channel."""
if not self._app:
return {"name": chat_id, "type": "unknown"}
try:
result = await self._app.client.conversations_info(channel=chat_id)
channel = result.get("channel", {})
is_dm = channel.get("is_im", False)
return {
"name": channel.get("name", chat_id),
"type": "dm" if is_dm else "group",
}
except Exception:
return {"name": chat_id, "type": "unknown"}
# ----- Internal handlers -----
async def _handle_slack_message(self, event: dict) -> None:
"""Handle an incoming Slack message event."""
# Ignore bot messages (including our own)
if event.get("bot_id") or event.get("subtype") == "bot_message":
return
# Ignore message edits and deletions
subtype = event.get("subtype")
if subtype in ("message_changed", "message_deleted"):
return
text = event.get("text", "")
user_id = event.get("user", "")
channel_id = event.get("channel", "")
thread_ts = event.get("thread_ts") or event.get("ts")
ts = event.get("ts", "")
# Determine if this is a DM or channel message
channel_type = event.get("channel_type", "")
is_dm = channel_type == "im"
# In channels, only respond if bot is mentioned
if not is_dm and self._bot_user_id:
if f"<@{self._bot_user_id}>" not in text:
return
# Strip the bot mention from the text
text = text.replace(f"<@{self._bot_user_id}>", "").strip()
# Determine message type
msg_type = MessageType.TEXT
if text.startswith("/"):
msg_type = MessageType.COMMAND
# Handle file attachments
media_urls = []
media_types = []
files = event.get("files", [])
for f in files:
mimetype = f.get("mimetype", "unknown")
url = f.get("url_private_download") or f.get("url_private", "")
if mimetype.startswith("image/") and url:
try:
ext = "." + mimetype.split("/")[-1].split(";")[0]
if ext not in (".jpg", ".jpeg", ".png", ".gif", ".webp"):
ext = ".jpg"
# Slack private URLs require the bot token as auth header
cached = await self._download_slack_file(url, ext)
media_urls.append(cached)
media_types.append(mimetype)
msg_type = MessageType.PHOTO
except Exception as e:
print(f"[Slack] Failed to cache image: {e}", flush=True)
elif mimetype.startswith("audio/") and url:
try:
ext = "." + mimetype.split("/")[-1].split(";")[0]
if ext not in (".ogg", ".mp3", ".wav", ".webm", ".m4a"):
ext = ".ogg"
cached = await self._download_slack_file(url, ext, audio=True)
media_urls.append(cached)
media_types.append(mimetype)
msg_type = MessageType.VOICE
except Exception as e:
print(f"[Slack] Failed to cache audio: {e}", flush=True)
# Build source
source = self.build_source(
chat_id=channel_id,
chat_name=channel_id, # Will be resolved later if needed
chat_type="dm" if is_dm else "group",
user_id=user_id,
thread_id=thread_ts,
)
msg_event = MessageEvent(
text=text,
message_type=msg_type,
source=source,
raw_message=event,
message_id=ts,
media_urls=media_urls,
media_types=media_types,
reply_to_message_id=thread_ts if thread_ts != ts else None,
)
await self.handle_message(msg_event)
async def _handle_slash_command(self, command: dict) -> None:
"""Handle /hermes slash command."""
text = command.get("text", "").strip()
user_id = command.get("user_id", "")
channel_id = command.get("channel_id", "")
# Map common slash subcommands to gateway commands
if text in ("new", "reset"):
text = "/reset"
elif text == "status":
text = "/status"
elif text == "stop":
text = "/stop"
elif text:
pass # Treat as a regular question
else:
text = "/help"
source = self.build_source(
chat_id=channel_id,
chat_type="dm", # Slash commands are always in DM-like context
user_id=user_id,
)
event = MessageEvent(
text=text,
message_type=MessageType.COMMAND if text.startswith("/") else MessageType.TEXT,
source=source,
raw_message=command,
)
await self.handle_message(event)
async def _download_slack_file(self, url: str, ext: str, audio: bool = False) -> str:
"""Download a Slack file using the bot token for auth."""
import httpx
bot_token = self.config.token
async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
response = await client.get(
url,
headers={"Authorization": f"Bearer {bot_token}"},
)
response.raise_for_status()
if audio:
from gateway.platforms.base import cache_audio_from_bytes
return cache_audio_from_bytes(response.content, ext)
else:
from gateway.platforms.base import cache_image_from_bytes
return cache_image_from_bytes(response.content, ext)
+484
View File
@@ -0,0 +1,484 @@
"""
Telegram platform adapter.
Uses python-telegram-bot library for:
- Receiving messages from users/groups
- Sending responses back
- Handling media and commands
"""
import asyncio
from typing import Dict, List, Optional, Any
try:
from telegram import Update, Bot, Message
from telegram.ext import (
Application,
CommandHandler,
MessageHandler as TelegramMessageHandler,
ContextTypes,
filters,
)
from telegram.constants import ParseMode, ChatType
TELEGRAM_AVAILABLE = True
except ImportError:
TELEGRAM_AVAILABLE = False
Update = Any
Bot = Any
Message = Any
Application = Any
ContextTypes = Any
import sys
sys.path.insert(0, str(__file__).rsplit("/", 3)[0])
from gateway.config import Platform, PlatformConfig
from gateway.platforms.base import (
BasePlatformAdapter,
MessageEvent,
MessageType,
SendResult,
cache_image_from_bytes,
cache_audio_from_bytes,
)
def check_telegram_requirements() -> bool:
"""Check if Telegram dependencies are available."""
return TELEGRAM_AVAILABLE
class TelegramAdapter(BasePlatformAdapter):
"""
Telegram bot adapter.
Handles:
- Receiving messages from users and groups
- Sending responses with Telegram markdown
- Forum topics (thread_id support)
- Media messages
"""
# Telegram message limits
MAX_MESSAGE_LENGTH = 4096
def __init__(self, config: PlatformConfig):
super().__init__(config, Platform.TELEGRAM)
self._app: Optional[Application] = None
self._bot: Optional[Bot] = None
async def connect(self) -> bool:
"""Connect to Telegram and start polling for updates."""
if not TELEGRAM_AVAILABLE:
print(f"[{self.name}] python-telegram-bot not installed. Run: pip install python-telegram-bot")
return False
if not self.config.token:
print(f"[{self.name}] No bot token configured")
return False
try:
# Build the application
self._app = Application.builder().token(self.config.token).build()
self._bot = self._app.bot
# Register handlers
self._app.add_handler(TelegramMessageHandler(
filters.TEXT & ~filters.COMMAND,
self._handle_text_message
))
self._app.add_handler(TelegramMessageHandler(
filters.COMMAND,
self._handle_command
))
self._app.add_handler(TelegramMessageHandler(
filters.PHOTO | filters.VIDEO | filters.AUDIO | filters.VOICE | filters.Document.ALL | filters.Sticker.ALL,
self._handle_media_message
))
# Start polling in background
await self._app.initialize()
await self._app.start()
await self._app.updater.start_polling(allowed_updates=Update.ALL_TYPES)
self._running = True
print(f"[{self.name}] Connected and polling for updates")
return True
except Exception as e:
print(f"[{self.name}] Failed to connect: {e}")
return False
async def disconnect(self) -> None:
"""Stop polling and disconnect."""
if self._app:
try:
await self._app.updater.stop()
await self._app.stop()
await self._app.shutdown()
except Exception as e:
print(f"[{self.name}] Error during disconnect: {e}")
self._running = False
self._app = None
self._bot = None
print(f"[{self.name}] Disconnected")
async def send(
self,
chat_id: str,
content: str,
reply_to: Optional[str] = None,
metadata: Optional[Dict[str, Any]] = None
) -> SendResult:
"""Send a message to a Telegram chat."""
if not self._bot:
return SendResult(success=False, error="Not connected")
try:
# Format and split message if needed
formatted = self.format_message(content)
chunks = self.truncate_message(formatted, self.MAX_MESSAGE_LENGTH)
message_ids = []
thread_id = metadata.get("thread_id") if metadata else None
for i, chunk in enumerate(chunks):
# Try Markdown first, fall back to plain text if it fails
try:
msg = await self._bot.send_message(
chat_id=int(chat_id),
text=chunk,
parse_mode=ParseMode.MARKDOWN,
reply_to_message_id=int(reply_to) if reply_to and i == 0 else None,
message_thread_id=int(thread_id) if thread_id else None,
)
except Exception as md_error:
# Markdown parsing failed, try plain text
if "parse" in str(md_error).lower() or "markdown" in str(md_error).lower():
msg = await self._bot.send_message(
chat_id=int(chat_id),
text=chunk,
parse_mode=None, # Plain text
reply_to_message_id=int(reply_to) if reply_to and i == 0 else None,
message_thread_id=int(thread_id) if thread_id else None,
)
else:
raise # Re-raise if not a parse error
message_ids.append(str(msg.message_id))
return SendResult(
success=True,
message_id=message_ids[0] if message_ids else None,
raw_response={"message_ids": message_ids}
)
except Exception as e:
return SendResult(success=False, error=str(e))
async def send_voice(
self,
chat_id: str,
audio_path: str,
caption: Optional[str] = None,
reply_to: Optional[str] = None,
) -> SendResult:
"""Send audio as a native Telegram voice message or audio file."""
if not self._bot:
return SendResult(success=False, error="Not connected")
try:
import os
if not os.path.exists(audio_path):
return SendResult(success=False, error=f"Audio file not found: {audio_path}")
with open(audio_path, "rb") as audio_file:
# .ogg files -> send as voice (round playable bubble)
if audio_path.endswith(".ogg") or audio_path.endswith(".opus"):
msg = await self._bot.send_voice(
chat_id=int(chat_id),
voice=audio_file,
caption=caption[:1024] if caption else None,
reply_to_message_id=int(reply_to) if reply_to else None,
)
else:
# .mp3 and others -> send as audio file
msg = await self._bot.send_audio(
chat_id=int(chat_id),
audio=audio_file,
caption=caption[:1024] if caption else None,
reply_to_message_id=int(reply_to) if reply_to else None,
)
return SendResult(success=True, message_id=str(msg.message_id))
except Exception as e:
print(f"[{self.name}] Failed to send voice/audio: {e}")
return await super().send_voice(chat_id, audio_path, caption, reply_to)
async def send_image(
self,
chat_id: str,
image_url: str,
caption: Optional[str] = None,
reply_to: Optional[str] = None,
) -> SendResult:
"""Send an image natively as a Telegram photo."""
if not self._bot:
return SendResult(success=False, error="Not connected")
try:
# Telegram can send photos directly from URLs
msg = await self._bot.send_photo(
chat_id=int(chat_id),
photo=image_url,
caption=caption[:1024] if caption else None, # Telegram caption limit
reply_to_message_id=int(reply_to) if reply_to else None,
)
return SendResult(success=True, message_id=str(msg.message_id))
except Exception as e:
print(f"[{self.name}] Failed to send photo, falling back to URL: {e}")
# Fallback: send as text link
return await super().send_image(chat_id, image_url, caption, reply_to)
async def send_typing(self, chat_id: str) -> None:
"""Send typing indicator."""
if self._bot:
try:
await self._bot.send_chat_action(
chat_id=int(chat_id),
action="typing"
)
except Exception:
pass # Ignore typing indicator failures
async def get_chat_info(self, chat_id: str) -> Dict[str, Any]:
"""Get information about a Telegram chat."""
if not self._bot:
return {"name": "Unknown", "type": "dm"}
try:
chat = await self._bot.get_chat(int(chat_id))
chat_type = "dm"
if chat.type == ChatType.GROUP:
chat_type = "group"
elif chat.type == ChatType.SUPERGROUP:
chat_type = "group"
if chat.is_forum:
chat_type = "forum"
elif chat.type == ChatType.CHANNEL:
chat_type = "channel"
return {
"name": chat.title or chat.full_name or str(chat_id),
"type": chat_type,
"username": chat.username,
"is_forum": getattr(chat, "is_forum", False),
}
except Exception as e:
return {"name": str(chat_id), "type": "dm", "error": str(e)}
def format_message(self, content: str) -> str:
"""
Format message for Telegram.
Telegram uses a subset of markdown. We'll use the simpler
Markdown mode (not MarkdownV2) for compatibility.
"""
# Basic escaping for Telegram Markdown
# In Markdown mode (not V2), only certain characters need escaping
return content
async def _handle_text_message(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
"""Handle incoming text messages."""
if not update.message or not update.message.text:
return
event = self._build_message_event(update.message, MessageType.TEXT)
await self.handle_message(event)
async def _handle_command(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
"""Handle incoming command messages."""
if not update.message or not update.message.text:
return
event = self._build_message_event(update.message, MessageType.COMMAND)
await self.handle_message(event)
async def _handle_media_message(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
"""Handle incoming media messages, downloading images to local cache."""
if not update.message:
return
msg = update.message
# Determine media type
if msg.sticker:
msg_type = MessageType.STICKER
elif msg.photo:
msg_type = MessageType.PHOTO
elif msg.video:
msg_type = MessageType.VIDEO
elif msg.audio:
msg_type = MessageType.AUDIO
elif msg.voice:
msg_type = MessageType.VOICE
else:
msg_type = MessageType.DOCUMENT
event = self._build_message_event(msg, msg_type)
# Add caption as text
if msg.caption:
event.text = msg.caption
# Handle stickers: describe via vision tool with caching
if msg.sticker:
await self._handle_sticker(msg, event)
await self.handle_message(event)
return
# Download photo to local image cache so the vision tool can access it
# even after Telegram's ephemeral file URLs expire (~1 hour).
if msg.photo:
try:
# msg.photo is a list of PhotoSize sorted by size; take the largest
photo = msg.photo[-1]
file_obj = await photo.get_file()
# Download the image bytes directly into memory
image_bytes = await file_obj.download_as_bytearray()
# Determine extension from the file path if available
ext = ".jpg"
if file_obj.file_path:
for candidate in [".png", ".webp", ".gif", ".jpeg", ".jpg"]:
if file_obj.file_path.lower().endswith(candidate):
ext = candidate
break
# Save to cache and populate media_urls with the local path
cached_path = cache_image_from_bytes(bytes(image_bytes), ext=ext)
event.media_urls = [cached_path]
event.media_types = [f"image/{ext.lstrip('.')}"]
print(f"[Telegram] Cached user photo: {cached_path}", flush=True)
except Exception as e:
print(f"[Telegram] Failed to cache photo: {e}", flush=True)
# Download voice/audio messages to cache for STT transcription
if msg.voice:
try:
file_obj = await msg.voice.get_file()
audio_bytes = await file_obj.download_as_bytearray()
cached_path = cache_audio_from_bytes(bytes(audio_bytes), ext=".ogg")
event.media_urls = [cached_path]
event.media_types = ["audio/ogg"]
print(f"[Telegram] Cached user voice: {cached_path}", flush=True)
except Exception as e:
print(f"[Telegram] Failed to cache voice: {e}", flush=True)
elif msg.audio:
try:
file_obj = await msg.audio.get_file()
audio_bytes = await file_obj.download_as_bytearray()
cached_path = cache_audio_from_bytes(bytes(audio_bytes), ext=".mp3")
event.media_urls = [cached_path]
event.media_types = ["audio/mp3"]
print(f"[Telegram] Cached user audio: {cached_path}", flush=True)
except Exception as e:
print(f"[Telegram] Failed to cache audio: {e}", flush=True)
await self.handle_message(event)
async def _handle_sticker(self, msg: Message, event: "MessageEvent") -> None:
"""
Describe a Telegram sticker via vision analysis, with caching.
For static stickers (WEBP), we download, analyze with vision, and cache
the description by file_unique_id. For animated/video stickers, we inject
a placeholder noting the emoji.
"""
from gateway.sticker_cache import (
get_cached_description,
cache_sticker_description,
build_sticker_injection,
build_animated_sticker_injection,
STICKER_VISION_PROMPT,
)
sticker = msg.sticker
emoji = sticker.emoji or ""
set_name = sticker.set_name or ""
# Animated and video stickers can't be analyzed as static images
if sticker.is_animated or sticker.is_video:
event.text = build_animated_sticker_injection(emoji)
return
# Check the cache first
cached = get_cached_description(sticker.file_unique_id)
if cached:
event.text = build_sticker_injection(
cached["description"], cached.get("emoji", emoji), cached.get("set_name", set_name)
)
print(f"[Telegram] Sticker cache hit: {sticker.file_unique_id}", flush=True)
return
# Cache miss -- download and analyze
try:
file_obj = await sticker.get_file()
image_bytes = await file_obj.download_as_bytearray()
cached_path = cache_image_from_bytes(bytes(image_bytes), ext=".webp")
print(f"[Telegram] Analyzing sticker: {cached_path}", flush=True)
from tools.vision_tools import vision_analyze_tool
import json as _json
result_json = await vision_analyze_tool(
image_url=cached_path,
user_prompt=STICKER_VISION_PROMPT,
)
result = _json.loads(result_json)
if result.get("success"):
description = result.get("analysis", "a sticker")
cache_sticker_description(sticker.file_unique_id, description, emoji, set_name)
event.text = build_sticker_injection(description, emoji, set_name)
else:
# Vision failed -- use emoji as fallback
event.text = build_sticker_injection(
f"a sticker with emoji {emoji}" if emoji else "a sticker",
emoji, set_name,
)
except Exception as e:
print(f"[Telegram] Sticker analysis error: {e}", flush=True)
event.text = build_sticker_injection(
f"a sticker with emoji {emoji}" if emoji else "a sticker",
emoji, set_name,
)
def _build_message_event(self, message: Message, msg_type: MessageType) -> MessageEvent:
"""Build a MessageEvent from a Telegram message."""
chat = message.chat
user = message.from_user
# Determine chat type
chat_type = "dm"
if chat.type in (ChatType.GROUP, ChatType.SUPERGROUP):
chat_type = "group"
elif chat.type == ChatType.CHANNEL:
chat_type = "channel"
# Build source
source = self.build_source(
chat_id=str(chat.id),
chat_name=chat.title or (chat.full_name if hasattr(chat, "full_name") else None),
chat_type=chat_type,
user_id=str(user.id) if user else None,
user_name=user.full_name if user else None,
thread_id=str(message.message_thread_id) if message.message_thread_id else None,
)
return MessageEvent(
text=message.text or "",
message_type=msg_type,
source=source,
raw_message=message,
message_id=str(message.message_id),
timestamp=message.date,
)

Some files were not shown because too many files have changed in this diff Show More