Compare commits

..

1 Commits

Author SHA1 Message Date
Karan c0271f73f6 feat: add WorldSim — OSINT-powered personality simulation skill
Rehoboam-class worldsim. Immersive CLI personality simulator that
researches real people via 25+ verified platform access methods,
builds 6-layer psychometric profiles, finds star threads (personality
compression keys), and generates platform-authentic simulated
conversations with mechanical verification and adversarial refinement.

26 files | 38K words | 2,283 lines Python

- Immersive CLI interface (worldsim> prompt, no assistant framing)
- OSINT pipeline: X API, Instagram private API, Bluesky, TikTok,
  Facebook, Threads, Mastodon, Reddit, GitHub, HN, Medium, Quora,
  Goodreads, Google Scholar, Crunchbase, podcasts, news/blogs
- Star thread: one-sentence personality compression key per person
- Deep psychometrics: Big Five + Moral Foundations + Schwartz Values
  + Cognitive Style + Narrative Framing + Behavioral Metadata
- Anti-slop: mechanical detection of LLM writing patterns
- GAN-style adversarial refinement loop with mechanical verification
- Recursive self-improvement: learned rules grow with each simulation
- Rehoboam persistence: SQLite + filesystem for profiles, predictions,
  social graph, knowledge archives
- GEPA/MIPROv2 self-evolution integration tested and working
- Knowledge archive: per-person source library with citations and
  semantic retrieval for context-aware grounding

Co-authored-by: Hermes Agent <hermes@nousresearch.com>
2026-04-08 13:46:20 -04:00
146 changed files with 10537 additions and 8700 deletions
-8
View File
@@ -81,14 +81,6 @@
# HF_TOKEN=
# OPENCODE_GO_BASE_URL=https://opencode.ai/zen/go/v1 # Override default base URL
# =============================================================================
# LLM PROVIDER (Qwen OAuth)
# =============================================================================
# Qwen OAuth reuses your local Qwen CLI login (qwen auth qwen-oauth).
# No API key needed — credentials come from ~/.qwen/oauth_creds.json.
# Optional base URL override:
# HERMES_QWEN_BASE_URL=https://portal.qwen.ai/v1
# =============================================================================
# TOOL API KEYS
# =============================================================================
-346
View File
@@ -1,346 +0,0 @@
# Hermes Agent v0.8.0 (v2026.4.8)
**Release Date:** April 8, 2026
> The intelligence release — background task auto-notifications, free MiMo v2 Pro on Nous Portal, live model switching across all platforms, self-optimized GPT/Codex guidance, native Google AI Studio, smart inactivity timeouts, approval buttons, MCP OAuth 2.1, and 209 merged PRs with 82 resolved issues.
---
## ✨ Highlights
- **Background Process Auto-Notifications (`notify_on_complete`)** — Background tasks can now automatically notify the agent when they finish. Start a long-running process (AI model training, test suites, deployments, builds) and the agent gets notified on completion — no polling needed. The agent can keep working on other things and pick up results when they land. ([#5779](https://github.com/NousResearch/hermes-agent/pull/5779))
- **Free Xiaomi MiMo v2 Pro on Nous Portal** — Nous Portal now supports the free-tier Xiaomi MiMo v2 Pro model for auxiliary tasks (compression, vision, summarization), with free-tier model gating and pricing display in model selection. ([#6018](https://github.com/NousResearch/hermes-agent/pull/6018), [#5880](https://github.com/NousResearch/hermes-agent/pull/5880))
- **Live Model Switching (`/model` Command)** — Switch models and providers mid-session from CLI, Telegram, Discord, Slack, or any gateway platform. Aggregator-aware resolution keeps you on OpenRouter/Nous when possible, with automatic cross-provider fallback when needed. Interactive model pickers on Telegram and Discord with inline buttons. ([#5181](https://github.com/NousResearch/hermes-agent/pull/5181), [#5742](https://github.com/NousResearch/hermes-agent/pull/5742))
- **Self-Optimized GPT/Codex Tool-Use Guidance** — The agent diagnosed and patched 5 failure modes in GPT and Codex tool calling through automated behavioral benchmarking, dramatically improving reliability on OpenAI models. Includes execution discipline guidance and thinking-only prefill continuation for structured reasoning. ([#6120](https://github.com/NousResearch/hermes-agent/pull/6120), [#5414](https://github.com/NousResearch/hermes-agent/pull/5414), [#5931](https://github.com/NousResearch/hermes-agent/pull/5931))
- **Google AI Studio (Gemini) Native Provider** — Direct access to Gemini models through Google's AI Studio API. Includes automatic models.dev registry integration for real-time context length detection across any provider. ([#5577](https://github.com/NousResearch/hermes-agent/pull/5577))
- **Inactivity-Based Agent Timeouts** — Gateway and cron timeouts now track actual tool activity instead of wall-clock time. Long-running tasks that are actively working will never be killed — only truly idle agents time out. ([#5389](https://github.com/NousResearch/hermes-agent/pull/5389), [#5440](https://github.com/NousResearch/hermes-agent/pull/5440))
- **Approval Buttons on Slack & Telegram** — Dangerous command approval via native platform buttons instead of typing `/approve`. Slack gets thread context preservation; Telegram gets emoji reactions for approval status. ([#5890](https://github.com/NousResearch/hermes-agent/pull/5890), [#5975](https://github.com/NousResearch/hermes-agent/pull/5975))
- **MCP OAuth 2.1 PKCE + OSV Malware Scanning** — Full standards-compliant OAuth for MCP server authentication, plus automatic malware scanning of MCP extension packages via the OSV vulnerability database. ([#5420](https://github.com/NousResearch/hermes-agent/pull/5420), [#5305](https://github.com/NousResearch/hermes-agent/pull/5305))
- **Centralized Logging & Config Validation** — Structured logging to `~/.hermes/logs/` (agent.log + errors.log) with the `hermes logs` command for tailing and filtering. Config structure validation catches malformed YAML at startup before it causes cryptic failures. ([#5430](https://github.com/NousResearch/hermes-agent/pull/5430), [#5426](https://github.com/NousResearch/hermes-agent/pull/5426))
- **Plugin System Expansion** — Plugins can now register CLI subcommands, receive request-scoped API hooks with correlation IDs, prompt for required env vars during install, and hook into session lifecycle events (finalize/reset). ([#5295](https://github.com/NousResearch/hermes-agent/pull/5295), [#5427](https://github.com/NousResearch/hermes-agent/pull/5427), [#5470](https://github.com/NousResearch/hermes-agent/pull/5470), [#6129](https://github.com/NousResearch/hermes-agent/pull/6129))
- **Matrix Tier 1 & Platform Hardening** — Matrix gets reactions, read receipts, rich formatting, and room management. Discord adds channel controls and ignored channels. Signal gets full MEDIA: tag delivery. Mattermost gets file attachments. Comprehensive reliability fixes across all platforms. ([#5275](https://github.com/NousResearch/hermes-agent/pull/5275), [#5975](https://github.com/NousResearch/hermes-agent/pull/5975), [#5602](https://github.com/NousResearch/hermes-agent/pull/5602))
- **Security Hardening Pass** — Consolidated SSRF protections, timing attack mitigations, tar traversal prevention, credential leakage guards, cron path traversal hardening, and cross-session isolation. Terminal workdir sanitization across all backends. ([#5944](https://github.com/NousResearch/hermes-agent/pull/5944), [#5613](https://github.com/NousResearch/hermes-agent/pull/5613), [#5629](https://github.com/NousResearch/hermes-agent/pull/5629))
---
## 🏗️ Core Agent & Architecture
### Provider & Model Support
- **Native Google AI Studio (Gemini) provider** with models.dev integration for automatic context length detection ([#5577](https://github.com/NousResearch/hermes-agent/pull/5577))
- **`/model` command — full provider+model system overhaul** — live switching across CLI and all gateway platforms with aggregator-aware resolution ([#5181](https://github.com/NousResearch/hermes-agent/pull/5181))
- **Interactive model picker for Telegram and Discord** — inline button-based model selection ([#5742](https://github.com/NousResearch/hermes-agent/pull/5742))
- **Nous Portal free-tier model gating** with pricing display in model selection ([#5880](https://github.com/NousResearch/hermes-agent/pull/5880))
- **Model pricing display** for OpenRouter and Nous Portal providers ([#5416](https://github.com/NousResearch/hermes-agent/pull/5416))
- **xAI (Grok) prompt caching** via `x-grok-conv-id` header ([#5604](https://github.com/NousResearch/hermes-agent/pull/5604))
- **Grok added to tool-use enforcement models** for direct xAI usage ([#5595](https://github.com/NousResearch/hermes-agent/pull/5595))
- **MiniMax TTS provider** (speech-2.8) ([#4963](https://github.com/NousResearch/hermes-agent/pull/4963))
- **Non-agentic model warning** — warns users when loading Hermes LLM models not designed for tool use ([#5378](https://github.com/NousResearch/hermes-agent/pull/5378))
- **Ollama Cloud auth, /model switch persistence**, and alias tab completion ([#5269](https://github.com/NousResearch/hermes-agent/pull/5269))
- **Preserve dots in OpenCode Go model names** (minimax-m2.7, glm-4.5, kimi-k2.5) ([#5597](https://github.com/NousResearch/hermes-agent/pull/5597))
- **MiniMax models 404 fix** — strip /v1 from Anthropic base URL for OpenCode Go ([#4918](https://github.com/NousResearch/hermes-agent/pull/4918))
- **Provider credential reset windows** honored in pooled failover ([#5188](https://github.com/NousResearch/hermes-agent/pull/5188))
- **OAuth token sync** between credential pool and credentials file ([#4981](https://github.com/NousResearch/hermes-agent/pull/4981))
- **Stale OAuth credentials** no longer block OpenRouter users on auto-detect ([#5746](https://github.com/NousResearch/hermes-agent/pull/5746))
- **Codex OAuth credential pool disconnect** + expired token import fix ([#5681](https://github.com/NousResearch/hermes-agent/pull/5681))
- **Codex pool entry sync** from `~/.codex/auth.json` on exhaustion — @GratefulDave ([#5610](https://github.com/NousResearch/hermes-agent/pull/5610))
- **Auxiliary client payment fallback** — retry with next provider on 402 ([#5599](https://github.com/NousResearch/hermes-agent/pull/5599))
- **Auxiliary client resolves named custom providers** and 'main' alias ([#5978](https://github.com/NousResearch/hermes-agent/pull/5978))
- **Use mimo-v2-pro** for non-vision auxiliary tasks on Nous free tier ([#6018](https://github.com/NousResearch/hermes-agent/pull/6018))
- **Vision auto-detection** tries main provider first ([#6041](https://github.com/NousResearch/hermes-agent/pull/6041))
- **Provider re-ordering and Quick Install** — @austinpickett ([#4664](https://github.com/NousResearch/hermes-agent/pull/4664))
- **Nous OAuth access_token** no longer used as inference API key — @SHL0MS ([#5564](https://github.com/NousResearch/hermes-agent/pull/5564))
- **HERMES_PORTAL_BASE_URL env var** respected during Nous login — @benbarclay ([#5745](https://github.com/NousResearch/hermes-agent/pull/5745))
- **Env var overrides** for Nous portal/inference URLs ([#5419](https://github.com/NousResearch/hermes-agent/pull/5419))
- **Z.AI endpoint auto-detect** via probe and cache ([#5763](https://github.com/NousResearch/hermes-agent/pull/5763))
- **MiniMax context lengths, model catalog, thinking guard, aux model, and config base_url** corrections ([#6082](https://github.com/NousResearch/hermes-agent/pull/6082))
- **Community provider/model resolution fixes** — salvaged 4 community PRs + MiniMax aux URL ([#5983](https://github.com/NousResearch/hermes-agent/pull/5983))
### Agent Loop & Conversation
- **Self-optimized GPT/Codex tool-use guidance** via automated behavioral benchmarking — agent self-diagnosed and patched 5 failure modes ([#6120](https://github.com/NousResearch/hermes-agent/pull/6120))
- **GPT/Codex execution discipline guidance** in system prompts ([#5414](https://github.com/NousResearch/hermes-agent/pull/5414))
- **Thinking-only prefill continuation** for structured reasoning responses ([#5931](https://github.com/NousResearch/hermes-agent/pull/5931))
- **Accept reasoning-only responses** without retries — set content to "(empty)" instead of infinite retry ([#5278](https://github.com/NousResearch/hermes-agent/pull/5278))
- **Jittered retry backoff** — exponential backoff with jitter for API retries ([#6048](https://github.com/NousResearch/hermes-agent/pull/6048))
- **Smart thinking block signature management** — preserve and manage Anthropic thinking signatures across turns ([#6112](https://github.com/NousResearch/hermes-agent/pull/6112))
- **Coerce tool call arguments** to match JSON Schema types — fixes models that send strings instead of numbers/booleans ([#5265](https://github.com/NousResearch/hermes-agent/pull/5265))
- **Save oversized tool results to file** instead of destructive truncation ([#5210](https://github.com/NousResearch/hermes-agent/pull/5210))
- **Sandbox-aware tool result persistence** ([#6085](https://github.com/NousResearch/hermes-agent/pull/6085))
- **Streaming fallback** improved after edit failures ([#6110](https://github.com/NousResearch/hermes-agent/pull/6110))
- **Codex empty-output gaps** covered in fallback + normalizer + auxiliary client ([#5724](https://github.com/NousResearch/hermes-agent/pull/5724), [#5730](https://github.com/NousResearch/hermes-agent/pull/5730), [#5734](https://github.com/NousResearch/hermes-agent/pull/5734))
- **Codex stream output backfill** from output_item.done events ([#5689](https://github.com/NousResearch/hermes-agent/pull/5689))
- **Stream consumer creates new message** after tool boundaries ([#5739](https://github.com/NousResearch/hermes-agent/pull/5739))
- **Codex validation aligned** with normalization for empty stream output ([#5940](https://github.com/NousResearch/hermes-agent/pull/5940))
- **Bridge tool-calls** in copilot-acp adapter ([#5460](https://github.com/NousResearch/hermes-agent/pull/5460))
- **Filter transcript-only roles** from chat-completions payload ([#4880](https://github.com/NousResearch/hermes-agent/pull/4880))
- **Context compaction failures fixed** on temperature-restricted models — @MadKangYu ([#5608](https://github.com/NousResearch/hermes-agent/pull/5608))
- **Sanitize tool_calls for all strict APIs** (Fireworks, Mistral, etc.) — @lumethegreat ([#5183](https://github.com/NousResearch/hermes-agent/pull/5183))
### Memory & Sessions
- **Supermemory memory provider** — new memory plugin with multi-container, search_mode, identity template, and env var override ([#5737](https://github.com/NousResearch/hermes-agent/pull/5737), [#5933](https://github.com/NousResearch/hermes-agent/pull/5933))
- **Shared thread sessions** by default — multi-user thread support across gateway platforms ([#5391](https://github.com/NousResearch/hermes-agent/pull/5391))
- **Subagent sessions linked to parent** and hidden from session list ([#5309](https://github.com/NousResearch/hermes-agent/pull/5309))
- **Profile-scoped memory isolation** and clone support ([#4845](https://github.com/NousResearch/hermes-agent/pull/4845))
- **Thread gateway user_id to memory plugins** for per-user scoping ([#5895](https://github.com/NousResearch/hermes-agent/pull/5895))
- **Honcho plugin drift overhaul** + plugin CLI registration system ([#5295](https://github.com/NousResearch/hermes-agent/pull/5295))
- **Honcho holographic prompt and trust score** rendering preserved ([#4872](https://github.com/NousResearch/hermes-agent/pull/4872))
- **Honcho doctor fix** — use recall_mode instead of memory_mode — @techguysimon ([#5645](https://github.com/NousResearch/hermes-agent/pull/5645))
- **RetainDB** — API routes, write queue, dialectic, agent model, file tools fixes ([#5461](https://github.com/NousResearch/hermes-agent/pull/5461))
- **Hindsight memory plugin overhaul** + memory setup wizard fixes ([#5094](https://github.com/NousResearch/hermes-agent/pull/5094))
- **mem0 API v2 compat**, prefetch context fencing, secret redaction ([#5423](https://github.com/NousResearch/hermes-agent/pull/5423))
- **mem0 env vars merged** with mem0.json instead of either/or ([#4939](https://github.com/NousResearch/hermes-agent/pull/4939))
- **Clean user message** used for all memory provider operations ([#4940](https://github.com/NousResearch/hermes-agent/pull/4940))
- **Silent memory flush failure** on /new and /resume fixed — @ryanautomated ([#5640](https://github.com/NousResearch/hermes-agent/pull/5640))
- **OpenViking atexit safety net** for session commit ([#5664](https://github.com/NousResearch/hermes-agent/pull/5664))
- **OpenViking tenant-scoping headers** for multi-tenant servers ([#4936](https://github.com/NousResearch/hermes-agent/pull/4936))
- **ByteRover brv query** runs synchronously before LLM call ([#4831](https://github.com/NousResearch/hermes-agent/pull/4831))
---
## 📱 Messaging Platforms (Gateway)
### Gateway Core
- **Inactivity-based agent timeout** — replaces wall-clock timeout with smart activity tracking; long-running active tasks never killed ([#5389](https://github.com/NousResearch/hermes-agent/pull/5389))
- **Approval buttons for Slack & Telegram** + Slack thread context preservation ([#5890](https://github.com/NousResearch/hermes-agent/pull/5890))
- **Live-stream /update output** + forward interactive prompts to user ([#5180](https://github.com/NousResearch/hermes-agent/pull/5180))
- **Infinite timeout support** + periodic notifications + actionable error messages ([#4959](https://github.com/NousResearch/hermes-agent/pull/4959))
- **Duplicate message prevention** — gateway dedup + partial stream guard ([#4878](https://github.com/NousResearch/hermes-agent/pull/4878))
- **Webhook delivery_info persistence** + full session id in /status ([#5942](https://github.com/NousResearch/hermes-agent/pull/5942))
- **Tool preview truncation** respects tool_preview_length in all/new progress modes ([#5937](https://github.com/NousResearch/hermes-agent/pull/5937))
- **Short preview truncation** restored for all/new tool progress modes ([#4935](https://github.com/NousResearch/hermes-agent/pull/4935))
- **Update-pending state** written atomically to prevent corruption ([#4923](https://github.com/NousResearch/hermes-agent/pull/4923))
- **Approval session key isolated** per turn ([#4884](https://github.com/NousResearch/hermes-agent/pull/4884))
- **Active-session guard bypass** for /approve, /deny, /stop, /new ([#4926](https://github.com/NousResearch/hermes-agent/pull/4926), [#5765](https://github.com/NousResearch/hermes-agent/pull/5765))
- **Typing indicator paused** during approval waits ([#5893](https://github.com/NousResearch/hermes-agent/pull/5893))
- **Caption check** uses exact line-by-line match instead of substring (all platforms) ([#5939](https://github.com/NousResearch/hermes-agent/pull/5939))
- **MEDIA: tags stripped** from streamed gateway messages ([#5152](https://github.com/NousResearch/hermes-agent/pull/5152))
- **MEDIA: tags extracted** from cron delivery before sending ([#5598](https://github.com/NousResearch/hermes-agent/pull/5598))
- **Profile-aware service units** + voice transcription cleanup ([#5972](https://github.com/NousResearch/hermes-agent/pull/5972))
- **Thread-safe PairingStore** with atomic writes — @CharlieKerfoot ([#5656](https://github.com/NousResearch/hermes-agent/pull/5656))
- **Sanitize media URLs** in base platform logs — @WAXLYY ([#5631](https://github.com/NousResearch/hermes-agent/pull/5631))
- **Reduce Telegram fallback IP activation log noise** — @MadKangYu ([#5615](https://github.com/NousResearch/hermes-agent/pull/5615))
- **Cron static method wrappers** to prevent self-binding ([#5299](https://github.com/NousResearch/hermes-agent/pull/5299))
- **Stale 'hermes login' replaced** with 'hermes auth' + credential removal re-seeding fix ([#5670](https://github.com/NousResearch/hermes-agent/pull/5670))
### Telegram
- **Group topics skill binding** for supergroup forum topics ([#4886](https://github.com/NousResearch/hermes-agent/pull/4886))
- **Emoji reactions** for approval status and notifications ([#5975](https://github.com/NousResearch/hermes-agent/pull/5975))
- **Duplicate message delivery prevented** on send timeout ([#5153](https://github.com/NousResearch/hermes-agent/pull/5153))
- **Command names sanitized** to strip invalid characters ([#5596](https://github.com/NousResearch/hermes-agent/pull/5596))
- **Per-platform disabled skills** respected in Telegram menu and gateway dispatch ([#4799](https://github.com/NousResearch/hermes-agent/pull/4799))
- **/approve and /deny** routed through running-agent guard ([#4798](https://github.com/NousResearch/hermes-agent/pull/4798))
### Discord
- **Channel controls** — ignored_channels and no_thread_channels config options ([#5975](https://github.com/NousResearch/hermes-agent/pull/5975))
- **Skills registered as native slash commands** via shared gateway logic ([#5603](https://github.com/NousResearch/hermes-agent/pull/5603))
- **/approve, /deny, /queue, /background, /btw** registered as native slash commands ([#4800](https://github.com/NousResearch/hermes-agent/pull/4800), [#5477](https://github.com/NousResearch/hermes-agent/pull/5477))
- **Unnecessary members intent** removed on startup + token lock leak fix ([#5302](https://github.com/NousResearch/hermes-agent/pull/5302))
### Slack
- **Thread engagement** — auto-respond in bot-started and mentioned threads ([#5897](https://github.com/NousResearch/hermes-agent/pull/5897))
- **mrkdwn in edit_message** + thread replies without @mentions ([#5733](https://github.com/NousResearch/hermes-agent/pull/5733))
### Matrix
- **Tier 1 feature parity** — reactions, read receipts, rich formatting, room management ([#5275](https://github.com/NousResearch/hermes-agent/pull/5275))
- **MATRIX_REQUIRE_MENTION and MATRIX_AUTO_THREAD** support ([#5106](https://github.com/NousResearch/hermes-agent/pull/5106))
- **Comprehensive reliability** — encrypted media, auth recovery, cron E2EE, Synapse compat ([#5271](https://github.com/NousResearch/hermes-agent/pull/5271))
- **CJK input, E2EE, and reconnect** fixes ([#5665](https://github.com/NousResearch/hermes-agent/pull/5665))
### Signal
- **Full MEDIA: tag delivery** — send_image_file, send_voice, and send_video implemented ([#5602](https://github.com/NousResearch/hermes-agent/pull/5602))
### Mattermost
- **File attachments** — set message type to DOCUMENT when post has file attachments — @nericervin ([#5609](https://github.com/NousResearch/hermes-agent/pull/5609))
### Feishu
- **Interactive card approval buttons** ([#6043](https://github.com/NousResearch/hermes-agent/pull/6043))
- **Reconnect and ACL** fixes ([#5665](https://github.com/NousResearch/hermes-agent/pull/5665))
### Webhooks
- **`{__raw__}` template token** and thread_id passthrough for forum topics ([#5662](https://github.com/NousResearch/hermes-agent/pull/5662))
---
## 🖥️ CLI & User Experience
### Interactive CLI
- **Defer response content** until reasoning block completes ([#5773](https://github.com/NousResearch/hermes-agent/pull/5773))
- **Ghost status-bar lines cleared** on terminal resize ([#4960](https://github.com/NousResearch/hermes-agent/pull/4960))
- **Normalise \r\n and \r line endings** in pasted text ([#4849](https://github.com/NousResearch/hermes-agent/pull/4849))
- **ChatConsole errors, curses scroll, skin-aware banner, git state** banner fixes ([#5974](https://github.com/NousResearch/hermes-agent/pull/5974))
- **Native Windows image paste** support ([#5917](https://github.com/NousResearch/hermes-agent/pull/5917))
- **--yolo and other flags** no longer silently dropped when placed before 'chat' subcommand ([#5145](https://github.com/NousResearch/hermes-agent/pull/5145))
### Setup & Configuration
- **Config structure validation** — detect malformed YAML at startup with actionable error messages ([#5426](https://github.com/NousResearch/hermes-agent/pull/5426))
- **Centralized logging** to `~/.hermes/logs/` — agent.log (INFO+), errors.log (WARNING+) with `hermes logs` command ([#5430](https://github.com/NousResearch/hermes-agent/pull/5430))
- **Docs links added** to setup wizard sections ([#5283](https://github.com/NousResearch/hermes-agent/pull/5283))
- **Doctor diagnostics** — sync provider checks, config migration, WAL and mem0 diagnostics ([#5077](https://github.com/NousResearch/hermes-agent/pull/5077))
- **Timeout debug logging** and user-facing diagnostics improved ([#5370](https://github.com/NousResearch/hermes-agent/pull/5370))
- **Reasoning effort unified** to config.yaml only ([#6118](https://github.com/NousResearch/hermes-agent/pull/6118))
- **Permanent command allowlist** loaded on startup ([#5076](https://github.com/NousResearch/hermes-agent/pull/5076))
- **`hermes auth remove`** now clears env-seeded credentials permanently ([#5285](https://github.com/NousResearch/hermes-agent/pull/5285))
- **Bundled skills synced to all profiles** during update ([#5795](https://github.com/NousResearch/hermes-agent/pull/5795))
- **`hermes update` no longer kills** freshly-restarted gateway service ([#5448](https://github.com/NousResearch/hermes-agent/pull/5448))
- **Subprocess.run() timeouts** added to all gateway CLI commands ([#5424](https://github.com/NousResearch/hermes-agent/pull/5424))
- **Actionable error message** when Codex refresh token is reused — @tymrtn ([#5612](https://github.com/NousResearch/hermes-agent/pull/5612))
- **Google-workspace skill scripts** can now run directly — @xinbenlv ([#5624](https://github.com/NousResearch/hermes-agent/pull/5624))
### Cron System
- **Inactivity-based cron timeout** — replaces wall-clock; active tasks run indefinitely ([#5440](https://github.com/NousResearch/hermes-agent/pull/5440))
- **Pre-run script injection** for data collection and change detection ([#5082](https://github.com/NousResearch/hermes-agent/pull/5082))
- **Delivery failure tracking** in job status ([#6042](https://github.com/NousResearch/hermes-agent/pull/6042))
- **Delivery guidance** in cron prompts — stops send_message thrashing ([#5444](https://github.com/NousResearch/hermes-agent/pull/5444))
- **MEDIA files delivered** as native platform attachments ([#5921](https://github.com/NousResearch/hermes-agent/pull/5921))
- **[SILENT] suppression** works anywhere in response — @auspic7 ([#5654](https://github.com/NousResearch/hermes-agent/pull/5654))
- **Cron path traversal** hardening ([#5147](https://github.com/NousResearch/hermes-agent/pull/5147))
---
## 🔧 Tool System
### Terminal & Execution
- **Execute_code on remote backends** — code execution now works on Docker, SSH, Modal, and other remote terminal backends ([#5088](https://github.com/NousResearch/hermes-agent/pull/5088))
- **Exit code context** for common CLI tools in terminal results — helps agent understand what went wrong ([#5144](https://github.com/NousResearch/hermes-agent/pull/5144))
- **Progressive subdirectory hint discovery** — agent learns project structure as it navigates ([#5291](https://github.com/NousResearch/hermes-agent/pull/5291))
- **notify_on_complete for background processes** — get notified when long-running tasks finish ([#5779](https://github.com/NousResearch/hermes-agent/pull/5779))
- **Docker env config** — explicit container environment variables via docker_env config ([#4738](https://github.com/NousResearch/hermes-agent/pull/4738))
- **Approval metadata included** in terminal tool results ([#5141](https://github.com/NousResearch/hermes-agent/pull/5141))
- **Workdir parameter sanitized** in terminal tool across all backends ([#5629](https://github.com/NousResearch/hermes-agent/pull/5629))
- **Detached process crash recovery** state corrected ([#6101](https://github.com/NousResearch/hermes-agent/pull/6101))
- **Agent-browser paths with spaces** preserved — @Vasanthdev2004 ([#6077](https://github.com/NousResearch/hermes-agent/pull/6077))
- **Portable base64 encoding** for image reading on macOS — @CharlieKerfoot ([#5657](https://github.com/NousResearch/hermes-agent/pull/5657))
### Browser
- **Switch managed browser provider** from Browserbase to Browser Use — @benbarclay ([#5750](https://github.com/NousResearch/hermes-agent/pull/5750))
- **Firecrawl cloud browser** provider — @alt-glitch ([#5628](https://github.com/NousResearch/hermes-agent/pull/5628))
- **JS evaluation** via browser_console expression parameter ([#5303](https://github.com/NousResearch/hermes-agent/pull/5303))
- **Windows browser** fixes ([#5665](https://github.com/NousResearch/hermes-agent/pull/5665))
### MCP
- **MCP OAuth 2.1 PKCE** — full standards-compliant OAuth client support ([#5420](https://github.com/NousResearch/hermes-agent/pull/5420))
- **OSV malware check** for MCP extension packages ([#5305](https://github.com/NousResearch/hermes-agent/pull/5305))
- **Prefer structuredContent over text** + no_mcp sentinel ([#5979](https://github.com/NousResearch/hermes-agent/pull/5979))
- **Unknown toolsets warning suppressed** for MCP server names ([#5279](https://github.com/NousResearch/hermes-agent/pull/5279))
### Web & Files
- **.zip document support** + auto-mount cache dirs into remote backends ([#4846](https://github.com/NousResearch/hermes-agent/pull/4846))
- **Redact query secrets** in send_message errors — @WAXLYY ([#5650](https://github.com/NousResearch/hermes-agent/pull/5650))
### Delegation
- **Credential pool sharing** + workspace path hints for subagents ([#5748](https://github.com/NousResearch/hermes-agent/pull/5748))
### ACP (VS Code / Zed / JetBrains)
- **Aggregate ACP improvements** — auth compat, protocol fixes, command ads, delegation, SSE events ([#5292](https://github.com/NousResearch/hermes-agent/pull/5292))
---
## 🧩 Skills Ecosystem
### Skills System
- **Skill config interface** — skills can declare required config.yaml settings, prompted during setup, injected at load time ([#5635](https://github.com/NousResearch/hermes-agent/pull/5635))
- **Plugin CLI registration system** — plugins register their own CLI subcommands without touching main.py ([#5295](https://github.com/NousResearch/hermes-agent/pull/5295))
- **Request-scoped API hooks** with tool call correlation IDs for plugins ([#5427](https://github.com/NousResearch/hermes-agent/pull/5427))
- **Session lifecycle hooks** — on_session_finalize and on_session_reset for CLI + gateway ([#6129](https://github.com/NousResearch/hermes-agent/pull/6129))
- **Prompt for required env vars** during plugin install — @kshitijk4poor ([#5470](https://github.com/NousResearch/hermes-agent/pull/5470))
- **Plugin name validation** — reject names that resolve to plugins root ([#5368](https://github.com/NousResearch/hermes-agent/pull/5368))
- **pre_llm_call plugin context** moved to user message to preserve prompt cache ([#5146](https://github.com/NousResearch/hermes-agent/pull/5146))
### New & Updated Skills
- **popular-web-designs** — 54 production website design systems ([#5194](https://github.com/NousResearch/hermes-agent/pull/5194))
- **p5js creative coding** — @SHL0MS ([#5600](https://github.com/NousResearch/hermes-agent/pull/5600))
- **manim-video** — mathematical and technical animations — @SHL0MS ([#4930](https://github.com/NousResearch/hermes-agent/pull/4930))
- **llm-wiki** — Karpathy's LLM Wiki skill ([#5635](https://github.com/NousResearch/hermes-agent/pull/5635))
- **gitnexus-explorer** — codebase indexing and knowledge serving ([#5208](https://github.com/NousResearch/hermes-agent/pull/5208))
- **research-paper-writing** — AI-Scientist & GPT-Researcher patterns — @SHL0MS ([#5421](https://github.com/NousResearch/hermes-agent/pull/5421))
- **blogwatcher** updated to JulienTant's fork ([#5759](https://github.com/NousResearch/hermes-agent/pull/5759))
- **claude-code skill** comprehensive rewrite v2.0 + v2.2 ([#5155](https://github.com/NousResearch/hermes-agent/pull/5155), [#5158](https://github.com/NousResearch/hermes-agent/pull/5158))
- **Code verification skills** consolidated into one ([#4854](https://github.com/NousResearch/hermes-agent/pull/4854))
- **Manim CE reference docs** expanded — geometry, animations, LaTeX — @leotrs ([#5791](https://github.com/NousResearch/hermes-agent/pull/5791))
- **Manim-video references** — design thinking, updaters, paper explainer, decorations, production quality — @SHL0MS ([#5588](https://github.com/NousResearch/hermes-agent/pull/5588), [#5408](https://github.com/NousResearch/hermes-agent/pull/5408))
---
## 🔒 Security & Reliability
### Security Hardening
- **Consolidated security** — SSRF protections, timing attack mitigations, tar traversal prevention, credential leakage guards ([#5944](https://github.com/NousResearch/hermes-agent/pull/5944))
- **Cross-session isolation** + cron path traversal hardening ([#5613](https://github.com/NousResearch/hermes-agent/pull/5613))
- **Workdir parameter sanitized** in terminal tool across all backends ([#5629](https://github.com/NousResearch/hermes-agent/pull/5629))
- **Approval 'once' session escalation** prevented + cron delivery platform validation ([#5280](https://github.com/NousResearch/hermes-agent/pull/5280))
- **Profile-scoped Google Workspace OAuth tokens** protected ([#4910](https://github.com/NousResearch/hermes-agent/pull/4910))
### Reliability
- **Aggressive worktree and branch cleanup** to prevent accumulation ([#6134](https://github.com/NousResearch/hermes-agent/pull/6134))
- **O(n²) catastrophic backtracking** in redact regex fixed — 100x improvement on large outputs ([#4962](https://github.com/NousResearch/hermes-agent/pull/4962))
- **Runtime stability fixes** across core, web, delegate, and browser tools ([#4843](https://github.com/NousResearch/hermes-agent/pull/4843))
- **API server streaming fix** + conversation history support ([#5977](https://github.com/NousResearch/hermes-agent/pull/5977))
- **OpenViking API endpoint paths** and response parsing corrected ([#5078](https://github.com/NousResearch/hermes-agent/pull/5078))
---
## 🐛 Notable Bug Fixes
- **9 community bugfixes salvaged** — gateway, cron, deps, macOS launchd in one batch ([#5288](https://github.com/NousResearch/hermes-agent/pull/5288))
- **Batch core bug fixes** — model config, session reset, alias fallback, launchctl, delegation, atomic writes ([#5630](https://github.com/NousResearch/hermes-agent/pull/5630))
- **Batch gateway/platform fixes** — matrix E2EE, CJK input, Windows browser, Feishu reconnect + ACL ([#5665](https://github.com/NousResearch/hermes-agent/pull/5665))
- **Stale test skips removed**, regex backtracking, file search bug, and test flakiness ([#4969](https://github.com/NousResearch/hermes-agent/pull/4969))
- **Nix flake** — read version, regen uv.lock, add hermes_logging — @alt-glitch ([#5651](https://github.com/NousResearch/hermes-agent/pull/5651))
- **Lowercase variable redaction** regression tests ([#5185](https://github.com/NousResearch/hermes-agent/pull/5185))
---
## 🧪 Testing
- **57 failing CI tests repaired** across 14 files ([#5823](https://github.com/NousResearch/hermes-agent/pull/5823))
- **Test suite re-architecture** + CI failure fixes — @alt-glitch ([#5946](https://github.com/NousResearch/hermes-agent/pull/5946))
- **Codebase-wide lint cleanup** — unused imports, dead code, and inefficient patterns ([#5821](https://github.com/NousResearch/hermes-agent/pull/5821))
- **browser_close tool removed** — auto-cleanup handles it ([#5792](https://github.com/NousResearch/hermes-agent/pull/5792))
---
## 📚 Documentation
- **Comprehensive documentation audit** — fix stale info, expand thin pages, add depth ([#5393](https://github.com/NousResearch/hermes-agent/pull/5393))
- **40+ discrepancies fixed** between documentation and codebase ([#5818](https://github.com/NousResearch/hermes-agent/pull/5818))
- **13 features documented** from last week's PRs ([#5815](https://github.com/NousResearch/hermes-agent/pull/5815))
- **Guides section overhaul** — fix existing + add 3 new tutorials ([#5735](https://github.com/NousResearch/hermes-agent/pull/5735))
- **Salvaged 4 docs PRs** — docker setup, post-update validation, local LLM guide, signal-cli install ([#5727](https://github.com/NousResearch/hermes-agent/pull/5727))
- **Discord configuration reference** ([#5386](https://github.com/NousResearch/hermes-agent/pull/5386))
- **Community FAQ entries** for common workflows and troubleshooting ([#4797](https://github.com/NousResearch/hermes-agent/pull/4797))
- **WSL2 networking guide** for local model servers ([#5616](https://github.com/NousResearch/hermes-agent/pull/5616))
- **Honcho CLI reference** + plugin CLI registration docs ([#5308](https://github.com/NousResearch/hermes-agent/pull/5308))
- **Obsidian Headless setup** for servers in llm-wiki ([#5660](https://github.com/NousResearch/hermes-agent/pull/5660))
- **Hermes Mod visual skin editor** added to skins page ([#6095](https://github.com/NousResearch/hermes-agent/pull/6095))
---
## 👥 Contributors
### Core
- **@teknium1** — 179 PRs
### Top Community Contributors
- **@SHL0MS** (7 PRs) — p5js creative coding skill, manim-video skill + 5 reference expansions, research-paper-writing, Nous OAuth fix, manim font fix
- **@alt-glitch** (3 PRs) — Firecrawl cloud browser provider, test re-architecture + CI fixes, Nix flake fixes
- **@benbarclay** (2 PRs) — Browser Use managed provider switch, Nous portal base URL fix
- **@CharlieKerfoot** (2 PRs) — macOS portable base64 encoding, thread-safe PairingStore
- **@WAXLYY** (2 PRs) — send_message secret redaction, gateway media URL sanitization
- **@MadKangYu** (2 PRs) — Telegram log noise reduction, context compaction fix for temperature-restricted models
### All Contributors
@alt-glitch, @austinpickett, @auspic7, @benbarclay, @CharlieKerfoot, @GratefulDave, @kshitijk4poor, @leotrs, @lumethegreat, @MadKangYu, @nericervin, @ryanautomated, @SHL0MS, @techguysimon, @tymrtn, @Vasanthdev2004, @WAXLYY, @xinbenlv
---
**Full Changelog**: [v2026.4.3...v2026.4.8](https://github.com/NousResearch/hermes-agent/compare/v2026.4.3...v2026.4.8)
+12 -117
View File
@@ -163,17 +163,6 @@ def _is_oauth_token(key: str) -> bool:
return True
def _normalize_base_url_text(base_url) -> str:
"""Normalize SDK/base transport URL values to a plain string for inspection.
Some client objects expose ``base_url`` as an ``httpx.URL`` instead of a raw
string. Provider/auth detection should accept either shape.
"""
if not base_url:
return ""
return str(base_url).strip()
def _is_third_party_anthropic_endpoint(base_url: str | None) -> bool:
"""Return True for non-Anthropic endpoints using the Anthropic Messages API.
@@ -181,10 +170,9 @@ def _is_third_party_anthropic_endpoint(base_url: str | None) -> bool:
with their own API keys via x-api-key, not Anthropic OAuth tokens. OAuth
detection should be skipped for these endpoints.
"""
normalized = _normalize_base_url_text(base_url)
if not normalized:
if not base_url:
return False # No base_url = direct Anthropic API
normalized = normalized.rstrip("/").lower()
normalized = base_url.rstrip("/").lower()
if "anthropic.com" in normalized:
return False # Direct Anthropic API — OAuth applies
return True # Any other endpoint is a third-party proxy
@@ -194,13 +182,12 @@ def _requires_bearer_auth(base_url: str | None) -> bool:
"""Return True for Anthropic-compatible providers that require Bearer auth.
Some third-party /anthropic endpoints implement Anthropic's Messages API but
require Authorization: Bearer *** of Anthropic's native x-api-key header.
require Authorization: Bearer instead of Anthropic's native x-api-key header.
MiniMax's global and China Anthropic-compatible endpoints follow this pattern.
"""
normalized = _normalize_base_url_text(base_url)
if not normalized:
if not base_url:
return False
normalized = normalized.rstrip("/").lower()
normalized = base_url.rstrip("/").lower()
return normalized.startswith(("https://api.minimax.io/anthropic", "https://api.minimaxi.com/anthropic"))
@@ -216,14 +203,13 @@ def build_anthropic_client(api_key: str, base_url: str = None):
)
from httpx import Timeout
normalized_base_url = _normalize_base_url_text(base_url)
kwargs = {
"timeout": Timeout(timeout=900.0, connect=10.0),
}
if normalized_base_url:
kwargs["base_url"] = normalized_base_url
if base_url:
kwargs["base_url"] = base_url
if _requires_bearer_auth(normalized_base_url):
if _requires_bearer_auth(base_url):
# Some Anthropic-compatible providers (e.g. MiniMax) expect the API key in
# Authorization: Bearer even for regular API keys. Route those endpoints
# through auth_token so the SDK sends Bearer auth instead of x-api-key.
@@ -956,18 +942,12 @@ def _convert_content_to_anthropic(content: Any) -> Any:
def convert_messages_to_anthropic(
messages: List[Dict],
base_url: str | None = None,
) -> Tuple[Optional[Any], List[Dict]]:
"""Convert OpenAI-format messages to Anthropic format.
Returns (system_prompt, anthropic_messages).
System messages are extracted since Anthropic takes them as a separate param.
system_prompt is a string or list of content blocks (when cache_control present).
When *base_url* is provided and points to a third-party Anthropic-compatible
endpoint, all thinking block signatures are stripped. Signatures are
Anthropic-proprietary — third-party endpoints cannot validate them and will
reject them with HTTP 400 "Invalid signature in thinking block".
"""
system = None
result = []
@@ -1122,15 +1102,7 @@ def convert_messages_to_anthropic(
curr_content = [{"type": "text", "text": curr_content}]
fixed[-1]["content"] = prev_content + curr_content
else:
# Consecutive assistant messages — merge text content.
# Drop thinking blocks from the *second* message: their
# signature was computed against a different turn boundary
# and becomes invalid once merged.
if isinstance(m["content"], list):
m["content"] = [
b for b in m["content"]
if not (isinstance(b, dict) and b.get("type") in ("thinking", "redacted_thinking"))
]
# Consecutive assistant messages — merge text content
prev_blocks = fixed[-1]["content"]
curr_blocks = m["content"]
if isinstance(prev_blocks, list) and isinstance(curr_blocks, list):
@@ -1148,79 +1120,6 @@ def convert_messages_to_anthropic(
fixed.append(m)
result = fixed
# ── Thinking block signature management ──────────────────────────
# Anthropic signs thinking blocks against the full turn content.
# Any upstream mutation (context compression, session truncation,
# orphan stripping, message merging) invalidates the signature,
# causing HTTP 400 "Invalid signature in thinking block".
#
# Signatures are Anthropic-proprietary. Third-party endpoints
# (MiniMax, Azure AI Foundry, self-hosted proxies) cannot validate
# them and will reject them outright. When targeting a third-party
# endpoint, strip ALL thinking/redacted_thinking blocks from every
# assistant message — the third-party will generate its own
# thinking blocks if it supports extended thinking.
#
# For direct Anthropic (strategy following clawdbot/OpenClaw):
# 1. Strip thinking/redacted_thinking from all assistant messages
# EXCEPT the last one — preserves reasoning continuity on the
# current tool-use chain while avoiding stale signature errors.
# 2. Downgrade unsigned thinking blocks (no signature) to text —
# Anthropic can't validate them and will reject them.
# 3. Strip cache_control from thinking/redacted_thinking blocks —
# cache markers can interfere with signature validation.
_THINKING_TYPES = frozenset(("thinking", "redacted_thinking"))
_is_third_party = _is_third_party_anthropic_endpoint(base_url)
last_assistant_idx = None
for i in range(len(result) - 1, -1, -1):
if result[i].get("role") == "assistant":
last_assistant_idx = i
break
for idx, m in enumerate(result):
if m.get("role") != "assistant" or not isinstance(m.get("content"), list):
continue
if _is_third_party or idx != last_assistant_idx:
# Third-party endpoint: strip ALL thinking blocks from every
# assistant message — signatures are Anthropic-proprietary.
# Direct Anthropic: strip from non-latest assistant messages only.
stripped = [
b for b in m["content"]
if not (isinstance(b, dict) and b.get("type") in _THINKING_TYPES)
]
m["content"] = stripped or [{"type": "text", "text": "(thinking elided)"}]
else:
# Latest assistant on direct Anthropic: keep signed thinking
# blocks for reasoning continuity; downgrade unsigned ones to
# plain text.
new_content = []
for b in m["content"]:
if not isinstance(b, dict) or b.get("type") not in _THINKING_TYPES:
new_content.append(b)
continue
if b.get("type") == "redacted_thinking":
# Redacted blocks use 'data' for the signature payload
if b.get("data"):
new_content.append(b)
# else: drop — no data means it can't be validated
elif b.get("signature"):
# Signed thinking block — keep it
new_content.append(b)
else:
# Unsigned thinking — downgrade to text so it's not lost
thinking_text = b.get("thinking", "")
if thinking_text:
new_content.append({"type": "text", "text": thinking_text})
m["content"] = new_content or [{"type": "text", "text": "(empty)"}]
# Strip cache_control from any remaining thinking/redacted_thinking
# blocks — cache markers interfere with signature validation.
for b in m["content"]:
if isinstance(b, dict) and b.get("type") in _THINKING_TYPES:
b.pop("cache_control", None)
return system, result
@@ -1234,7 +1133,6 @@ def build_anthropic_kwargs(
is_oauth: bool = False,
preserve_dots: bool = False,
context_length: Optional[int] = None,
base_url: str | None = None,
) -> Dict[str, Any]:
"""Build kwargs for anthropic.messages.create().
@@ -1248,11 +1146,8 @@ def build_anthropic_kwargs(
When *preserve_dots* is True, model name dots are not converted to hyphens
(for Alibaba/DashScope anthropic-compatible endpoints: qwen3.5-plus).
When *base_url* points to a third-party Anthropic-compatible endpoint,
thinking block signatures are stripped (they are Anthropic-proprietary).
"""
system, anthropic_messages = convert_messages_to_anthropic(messages, base_url=base_url)
system, anthropic_messages = convert_messages_to_anthropic(messages)
anthropic_tools = convert_tools_to_anthropic(tools) if tools else []
model = normalize_model_name(model, preserve_dots=preserve_dots)
@@ -1329,9 +1224,9 @@ def build_anthropic_kwargs(
# Map reasoning_config to Anthropic's thinking parameter.
# Claude 4.6 models use adaptive thinking + output_config.effort.
# Older models use manual thinking with budget_tokens.
# Haiku and MiniMax models do NOT support extended thinking — skip entirely.
# Haiku models do NOT support extended thinking at all — skip entirely.
if reasoning_config and isinstance(reasoning_config, dict):
if reasoning_config.get("enabled") is not False and "haiku" not in model.lower() and "minimax" not in model.lower():
if reasoning_config.get("enabled") is not False and "haiku" not in model.lower():
effort = str(reasoning_config.get("effort", "medium")).lower()
budget = THINKING_BUDGET.get(effort, 8000)
if _supports_adaptive_thinking(model):
+51 -120
View File
@@ -59,48 +59,13 @@ from hermes_constants import OPENROUTER_BASE_URL
logger = logging.getLogger(__name__)
_PROVIDER_ALIASES = {
"google": "gemini",
"google-gemini": "gemini",
"google-ai-studio": "gemini",
"glm": "zai",
"z-ai": "zai",
"z.ai": "zai",
"zhipu": "zai",
"kimi": "kimi-coding",
"moonshot": "kimi-coding",
"minimax-china": "minimax-cn",
"minimax_cn": "minimax-cn",
"claude": "anthropic",
"claude-code": "anthropic",
}
def _normalize_aux_provider(provider: Optional[str], *, for_vision: bool = False) -> str:
normalized = (provider or "auto").strip().lower()
if normalized.startswith("custom:"):
suffix = normalized.split(":", 1)[1].strip()
if not suffix:
return "custom"
normalized = suffix if not for_vision else "custom"
if normalized == "codex":
return "openai-codex"
if normalized == "main":
# Resolve to the user's actual main provider so named custom providers
# and non-aggregator providers (DeepSeek, Alibaba, etc.) work correctly.
main_prov = _read_main_provider()
if main_prov and main_prov not in ("auto", "main", ""):
return main_prov
return "custom"
return _PROVIDER_ALIASES.get(normalized, normalized)
# Default auxiliary models for direct API-key providers (cheap/fast for side tasks)
_API_KEY_PROVIDER_AUX_MODELS: Dict[str, str] = {
"gemini": "gemini-3-flash-preview",
"zai": "glm-4.5-flash",
"kimi-coding": "kimi-k2-turbo-preview",
"minimax": "MiniMax-M2.7",
"minimax-cn": "MiniMax-M2.7",
"minimax": "MiniMax-M2.7-highspeed",
"minimax-cn": "MiniMax-M2.7-highspeed",
"anthropic": "claude-haiku-4-5-20251001",
"ai-gateway": "google/gemini-3-flash",
"opencode-zen": "gemini-3-flash",
@@ -127,7 +92,6 @@ auxiliary_is_nous: bool = False
_OPENROUTER_MODEL = "google/gemini-3-flash-preview"
_NOUS_MODEL = "google/gemini-3-flash-preview"
_NOUS_FREE_TIER_VISION_MODEL = "xiaomi/mimo-v2-omni"
_NOUS_FREE_TIER_AUX_MODEL = "xiaomi/mimo-v2-pro"
_NOUS_DEFAULT_BASE_URL = "https://inference-api.nousresearch.com/v1"
_ANTHROPIC_DEFAULT_BASE_URL = "https://api.anthropic.com"
_AUTH_JSON_PATH = get_hermes_home() / "auth.json"
@@ -141,23 +105,6 @@ _CODEX_AUX_MODEL = "gpt-5.2-codex"
_CODEX_AUX_BASE_URL = "https://chatgpt.com/backend-api/codex"
def _to_openai_base_url(base_url: str) -> str:
"""Normalize an Anthropic-style base URL to OpenAI-compatible format.
Some providers (MiniMax, MiniMax-CN) expose an ``/anthropic`` endpoint for
the Anthropic Messages API and a separate ``/v1`` endpoint for OpenAI chat
completions. The auxiliary client uses the OpenAI SDK, so it must hit the
``/v1`` surface. Passing the raw ``inference_base_url`` causes requests to
land on ``/anthropic/chat/completions`` — a 404.
"""
url = str(base_url or "").strip().rstrip("/")
if url.endswith("/anthropic"):
rewritten = url[: -len("/anthropic")] + "/v1"
logger.debug("Auxiliary client: rewrote base URL %s%s", url, rewritten)
return rewritten
return url
def _select_pool_entry(provider: str) -> Tuple[bool, Optional[Any]]:
"""Return (pool_exists_for_provider, selected_entry)."""
try:
@@ -687,9 +634,7 @@ def _resolve_api_key_provider() -> Tuple[Optional[OpenAI], Optional[str]]:
if not api_key:
continue
base_url = _to_openai_base_url(
_pool_runtime_base_url(entry, pconfig.inference_base_url) or pconfig.inference_base_url
)
base_url = _pool_runtime_base_url(entry, pconfig.inference_base_url) or pconfig.inference_base_url
model = _API_KEY_PROVIDER_AUX_MODELS.get(provider_id, "default")
logger.debug("Auxiliary text client: %s (%s) via pool", pconfig.name, model)
extra = {}
@@ -706,9 +651,7 @@ def _resolve_api_key_provider() -> Tuple[Optional[OpenAI], Optional[str]]:
if not api_key:
continue
base_url = _to_openai_base_url(
str(creds.get("base_url", "")).strip().rstrip("/") or pconfig.inference_base_url
)
base_url = str(creds.get("base_url", "")).strip().rstrip("/") or pconfig.inference_base_url
model = _API_KEY_PROVIDER_AUX_MODELS.get(provider_id, "default")
logger.debug("Auxiliary text client: %s (%s)", pconfig.name, model)
extra = {}
@@ -770,7 +713,7 @@ def _try_openrouter() -> Tuple[Optional[OpenAI], Optional[str]]:
default_headers=_OR_HEADERS), _OPENROUTER_MODEL
def _try_nous(vision: bool = False) -> Tuple[Optional[OpenAI], Optional[str]]:
def _try_nous() -> Tuple[Optional[OpenAI], Optional[str]]:
nous = _read_nous_auth()
if not nous:
return None, None
@@ -782,13 +725,12 @@ def _try_nous(vision: bool = False) -> Tuple[Optional[OpenAI], Optional[str]]:
else:
model = _NOUS_MODEL
# Free-tier users can't use paid auxiliary models — use the free
# models instead: mimo-v2-omni for vision, mimo-v2-pro for text tasks.
# multimodal model instead so vision/browser-vision still works.
try:
from hermes_cli.models import check_nous_free_tier
if check_nous_free_tier():
model = _NOUS_FREE_TIER_VISION_MODEL if vision else _NOUS_FREE_TIER_AUX_MODEL
logger.debug("Free-tier Nous account — using %s for auxiliary/%s",
model, "vision" if vision else "text")
model = _NOUS_FREE_TIER_VISION_MODEL
logger.debug("Free-tier Nous account — using %s for auxiliary/vision", model)
except Exception:
pass
return (
@@ -1196,7 +1138,17 @@ def resolve_provider_client(
(client, resolved_model) or (None, None) if auth is unavailable.
"""
# Normalise aliases
provider = _normalize_aux_provider(provider)
provider = (provider or "auto").strip().lower()
if provider == "codex":
provider = "openai-codex"
if provider == "main":
# Resolve to the user's actual main provider so named custom providers
# and non-aggregator providers (DeepSeek, Alibaba, etc.) work correctly.
main_prov = _read_main_provider()
if main_prov and main_prov not in ("auto", "main", ""):
provider = main_prov
else:
provider = "custom"
# ── Auto: try all providers in priority order ────────────────────
if provider == "auto":
@@ -1346,9 +1298,7 @@ def resolve_provider_client(
provider, ", ".join(tried_sources))
return None, None
base_url = _to_openai_base_url(
str(creds.get("base_url", "")).strip().rstrip("/") or pconfig.inference_base_url
)
base_url = str(creds.get("base_url", "")).strip().rstrip("/") or pconfig.inference_base_url
default_model = _API_KEY_PROVIDER_AUX_MODELS.get(provider, "")
final_model = model or default_model
@@ -1425,11 +1375,24 @@ def get_async_text_auxiliary_client(task: str = ""):
_VISION_AUTO_PROVIDER_ORDER = (
"openrouter",
"nous",
"openai-codex",
"anthropic",
"custom",
)
def _normalize_vision_provider(provider: Optional[str]) -> str:
return _normalize_aux_provider(provider, for_vision=True)
provider = (provider or "auto").strip().lower()
if provider == "codex":
return "openai-codex"
if provider == "main":
# Resolve to actual main provider — named custom providers and
# non-aggregator providers need to pass through as their real name.
main_prov = _read_main_provider()
if main_prov and main_prov not in ("auto", "main", ""):
return main_prov
return "custom"
return provider
def _resolve_strict_vision_backend(provider: str) -> Tuple[Optional[Any], Optional[str]]:
@@ -1437,7 +1400,7 @@ def _resolve_strict_vision_backend(provider: str) -> Tuple[Optional[Any], Option
if provider == "openrouter":
return _try_openrouter()
if provider == "nous":
return _try_nous(vision=True)
return _try_nous()
if provider == "openai-codex":
return _try_codex()
if provider == "anthropic":
@@ -1470,26 +1433,17 @@ def _preferred_main_vision_provider() -> Optional[str]:
def get_available_vision_backends() -> List[str]:
"""Return the currently available vision backends in auto-selection order.
Order: active provider → OpenRouter → Nous → stop. This is the single
source of truth for setup, tool gating, and runtime auto-routing of
vision tasks.
This is the single source of truth for setup, tool gating, and runtime
auto-routing of vision tasks. The selected main provider is preferred when
it is also a known-good vision backend; otherwise Hermes falls back through
the standard conservative order.
"""
available: List[str] = []
# 1. Active provider — if the user configured a provider, try it first.
main_provider = _read_main_provider()
if main_provider and main_provider not in ("auto", ""):
if main_provider in _VISION_AUTO_PROVIDER_ORDER:
if _strict_vision_backend_available(main_provider):
available.append(main_provider)
else:
client, _ = resolve_provider_client(main_provider, _read_main_model())
if client is not None:
available.append(main_provider)
# 2. OpenRouter, 3. Nous — skip if already covered by main provider.
for p in _VISION_AUTO_PROVIDER_ORDER:
if p not in available and _strict_vision_backend_available(p):
available.append(p)
return available
ordered = list(_VISION_AUTO_PROVIDER_ORDER)
preferred = _preferred_main_vision_provider()
if preferred in ordered:
ordered.remove(preferred)
ordered.insert(0, preferred)
return [provider for provider in ordered if _strict_vision_backend_available(provider)]
def resolve_vision_provider_client(
@@ -1534,39 +1488,16 @@ def resolve_vision_provider_client(
return "custom", client, final_model
if requested == "auto":
# Vision auto-detection order:
# 1. Active provider + model (user's main chat config)
# 2. OpenRouter (known vision-capable default model)
# 3. Nous Portal (known vision-capable default model)
# 4. Stop
main_provider = _read_main_provider()
main_model = _read_main_model()
if main_provider and main_provider not in ("auto", ""):
if main_provider in _VISION_AUTO_PROVIDER_ORDER:
# Known strict backend — use its defaults.
sync_client, default_model = _resolve_strict_vision_backend(main_provider)
if sync_client is not None:
return _finalize(main_provider, sync_client, default_model)
else:
# Exotic provider (DeepSeek, Alibaba, named custom, etc.)
rpc_client, rpc_model = resolve_provider_client(
main_provider, main_model)
if rpc_client is not None:
logger.info(
"Vision auto-detect: using active provider %s (%s)",
main_provider, rpc_model or main_model,
)
return _finalize(
main_provider, rpc_client, rpc_model or main_model)
ordered = list(_VISION_AUTO_PROVIDER_ORDER)
preferred = _preferred_main_vision_provider()
if preferred in ordered:
ordered.remove(preferred)
ordered.insert(0, preferred)
# Fall back through aggregators.
for candidate in _VISION_AUTO_PROVIDER_ORDER:
if candidate == main_provider:
continue # already tried above
for candidate in ordered:
sync_client, default_model = _resolve_strict_vision_backend(candidate)
if sync_client is not None:
return _finalize(candidate, sync_client, default_model)
logger.debug("Auxiliary vision client: none available")
return None, None, None
+3 -66
View File
@@ -26,14 +26,12 @@ _PROVIDER_PREFIXES: frozenset[str] = frozenset({
"openrouter", "nous", "openai-codex", "copilot", "copilot-acp",
"gemini", "zai", "kimi-coding", "minimax", "minimax-cn", "anthropic", "deepseek",
"opencode-zen", "opencode-go", "ai-gateway", "kilocode", "alibaba",
"qwen-oauth",
"custom", "local",
# Common aliases
"google", "google-gemini", "google-ai-studio",
"glm", "z-ai", "z.ai", "zhipu", "github", "github-copilot",
"github-models", "kimi", "moonshot", "claude", "deep-seek",
"opencode", "zen", "go", "vercel", "kilo", "dashscope", "aliyun", "qwen",
"qwen-portal",
})
@@ -115,15 +113,8 @@ DEFAULT_CONTEXT_LENGTHS = {
"llama": 131072,
# Qwen
"qwen": 131072,
# MiniMax (lowercase — lookup lowercases model names at line 973)
"minimax-m1-256k": 1000000,
"minimax-m1-128k": 1000000,
"minimax-m1-80k": 1000000,
"minimax-m1-40k": 1000000,
"minimax-m1": 1000000,
"minimax-m2.5": 1048576,
"minimax-m2.7": 1048576,
"minimax": 1048576,
# MiniMax
"minimax": 204800,
# GLM
"glm": 202752,
# Kimi
@@ -136,7 +127,7 @@ DEFAULT_CONTEXT_LENGTHS = {
"deepseek-ai/DeepSeek-V3.2": 65536,
"moonshotai/Kimi-K2.5": 262144,
"moonshotai/Kimi-K2-Thinking": 262144,
"MiniMaxAI/MiniMax-M2.5": 1048576,
"MiniMaxAI/MiniMax-M2.5": 204800,
"XiaomiMiMo/MiMo-V2-Flash": 32768,
"mimo-v2-pro": 1048576,
"mimo-v2-omni": 1048576,
@@ -189,7 +180,6 @@ _URL_TO_PROVIDER: Dict[str, str] = {
"api.minimax": "minimax",
"dashscope.aliyuncs.com": "alibaba",
"dashscope-intl.aliyuncs.com": "alibaba",
"portal.qwen.ai": "qwen-oauth",
"openrouter.ai": "openrouter",
"generativelanguage.googleapis.com": "gemini",
"inference-api.nousresearch.com": "nous",
@@ -621,59 +611,6 @@ def _model_id_matches(candidate_id: str, lookup_model: str) -> bool:
return False
def query_ollama_num_ctx(model: str, base_url: str) -> Optional[int]:
"""Query an Ollama server for the model's context length.
Returns the model's maximum context from GGUF metadata via ``/api/show``,
or the explicit ``num_ctx`` from the Modelfile if set. Returns None if
the server is unreachable or not Ollama.
This is the value that should be passed as ``num_ctx`` in Ollama chat
requests to override the default 2048.
"""
import httpx
bare_model = _strip_provider_prefix(model)
server_url = base_url.rstrip("/")
if server_url.endswith("/v1"):
server_url = server_url[:-3]
try:
server_type = detect_local_server_type(base_url)
except Exception:
return None
if server_type != "ollama":
return None
try:
with httpx.Client(timeout=3.0) as client:
resp = client.post(f"{server_url}/api/show", json={"name": bare_model})
if resp.status_code != 200:
return None
data = resp.json()
# Prefer explicit num_ctx from Modelfile parameters (user override)
params = data.get("parameters", "")
if "num_ctx" in params:
for line in params.split("\n"):
if "num_ctx" in line:
parts = line.strip().split()
if len(parts) >= 2:
try:
return int(parts[-1])
except ValueError:
pass
# Fall back to GGUF model_info context_length (training max)
model_info = data.get("model_info", {})
for key, value in model_info.items():
if "context_length" in key and isinstance(value, (int, float)):
return int(value)
except Exception:
pass
return None
def _query_local_context_length(model: str, base_url: str) -> Optional[int]:
"""Query a local server for the model's context length."""
import httpx
-1
View File
@@ -153,7 +153,6 @@ PROVIDER_TO_MODELS_DEV: Dict[str, str] = {
"minimax-cn": "minimax-cn",
"deepseek": "deepseek",
"alibaba": "alibaba",
"qwen-oauth": "alibaba",
"copilot": "github-copilot",
"ai-gateway": "vercel",
"opencode-zen": "opencode",
-24
View File
@@ -204,30 +204,6 @@ OPENAI_MODEL_EXECUTION_GUIDANCE = (
"the result.\n"
"</tool_persistence>\n"
"\n"
"<mandatory_tool_use>\n"
"NEVER answer these from memory or mental computation — ALWAYS use a tool:\n"
"- Arithmetic, math, calculations → use terminal or execute_code\n"
"- Hashes, encodings, checksums → use terminal (e.g. sha256sum, base64)\n"
"- Current time, date, timezone → use terminal (e.g. date)\n"
"- System state: OS, CPU, memory, disk, ports, processes → use terminal\n"
"- File contents, sizes, line counts → use read_file, search_files, or terminal\n"
"- Git history, branches, diffs → use terminal\n"
"- Current facts (weather, news, versions) → use web_search\n"
"Your memory and user profile describe the USER, not the system you are "
"running on. The execution environment may differ from what the user profile "
"says about their personal setup.\n"
"</mandatory_tool_use>\n"
"\n"
"<act_dont_ask>\n"
"When a question has an obvious default interpretation, act on it immediately "
"instead of asking for clarification. Examples:\n"
"- 'Is port 443 open?' → check THIS machine (don't ask 'open where?')\n"
"- 'What OS am I running?' → check the live system (don't use user profile)\n"
"- 'What time is it?' → run `date` (don't guess)\n"
"Only ask for clarification when the ambiguity genuinely changes what tool "
"you would call.\n"
"</act_dont_ask>\n"
"\n"
"<prerequisite_checks>\n"
"- Before taking an action, check whether prerequisite discovery, lookup, or "
"context-gathering steps are needed.\n"
-57
View File
@@ -1,57 +0,0 @@
"""Retry utilities — jittered backoff for decorrelated retries.
Replaces fixed exponential backoff with jittered delays to prevent
thundering-herd retry spikes when multiple sessions hit the same
rate-limited provider concurrently.
"""
import random
import threading
import time
# Monotonic counter for jitter seed uniqueness within the same process.
# Protected by a lock to avoid race conditions in concurrent retry paths
# (e.g. multiple gateway sessions retrying simultaneously).
_jitter_counter = 0
_jitter_lock = threading.Lock()
def jittered_backoff(
attempt: int,
*,
base_delay: float = 5.0,
max_delay: float = 120.0,
jitter_ratio: float = 0.5,
) -> float:
"""Compute a jittered exponential backoff delay.
Args:
attempt: 1-based retry attempt number.
base_delay: Base delay in seconds for attempt 1.
max_delay: Maximum delay cap in seconds.
jitter_ratio: Fraction of computed delay to use as random jitter
range. 0.5 means jitter is uniform in [0, 0.5 * delay].
Returns:
Delay in seconds: min(base * 2^(attempt-1), max_delay) + jitter.
The jitter decorrelates concurrent retries so multiple sessions
hitting the same provider don't all retry at the same instant.
"""
global _jitter_counter
with _jitter_lock:
_jitter_counter += 1
tick = _jitter_counter
exponent = max(0, attempt - 1)
if exponent >= 63 or base_delay <= 0:
delay = max_delay
else:
delay = min(base_delay * (2 ** exponent), max_delay)
# Seed from time + counter for decorrelation even with coarse clocks.
seed = (time.time_ns() ^ (tick * 0x9E3779B9)) & 0xFFFFFFFF
rng = random.Random(seed)
jitter = rng.uniform(0, jitter_ratio * delay)
return delay + jitter
+1 -5
View File
@@ -644,14 +644,10 @@ platform_toolsets:
# Voice Transcription (Speech-to-Text)
# =============================================================================
# Automatically transcribe voice messages on messaging platforms.
# Providers: local (free, faster-whisper) | groq (free tier) | openai (Whisper API) | mistral (Voxtral Transcribe)
# Set the corresponding API key in .env: GROQ_API_KEY, OPENAI_API_KEY, or MISTRAL_API_KEY.
# Requires OPENAI_API_KEY in .env (uses OpenAI Whisper API directly).
stt:
enabled: true
# provider: "local" # auto-detected if omitted
model: "whisper-1" # whisper-1 (cheapest) | gpt-4o-mini-transcribe | gpt-4o-transcribe
# mistral:
# model: "voxtral-mini-latest" # voxtral-mini-latest | voxtral-mini-2602
# =============================================================================
# Response Pacing (Messaging Platforms)
+32 -147
View File
@@ -612,11 +612,6 @@ def _run_cleanup():
pass
# Shut down memory provider (on_session_end + shutdown_all) at actual
# session boundary — NOT per-turn inside run_conversation().
try:
from hermes_cli.plugins import invoke_hook as _invoke_hook
_invoke_hook("on_session_finalize", session_id=_active_agent_ref.session_id if _active_agent_ref else None, platform="cli")
except Exception:
pass
try:
if _active_agent_ref and hasattr(_active_agent_ref, 'shutdown_memory_provider'):
_active_agent_ref.shutdown_memory_provider(
@@ -760,10 +755,7 @@ def _setup_worktree(repo_root: str = None) -> Optional[Dict[str, str]]:
def _cleanup_worktree(info: Dict[str, str] = None) -> None:
"""Remove a worktree and its branch on exit.
Preserves the worktree only if it has unpushed commits (real work
that hasn't been pushed to any remote). Uncommitted changes alone
(untracked files, test artifacts) are not enough to keep it agent
work lives in commits/PRs, not the working tree.
If the worktree has uncommitted changes, warn and keep it.
"""
global _active_worktree
info = info or _active_worktree
@@ -779,27 +771,23 @@ def _cleanup_worktree(info: Dict[str, str] = None) -> None:
if not Path(wt_path).exists():
return
# Check for unpushed commits — commits reachable from HEAD but not
# from any remote branch. These represent real work the agent did
# but didn't push.
has_unpushed = False
# Check for uncommitted changes
try:
result = subprocess.run(
["git", "log", "--oneline", "HEAD", "--not", "--remotes"],
status = subprocess.run(
["git", "status", "--porcelain"],
capture_output=True, text=True, timeout=10, cwd=wt_path,
)
has_unpushed = bool(result.stdout.strip())
has_changes = bool(status.stdout.strip())
except Exception:
has_unpushed = True # Assume unpushed on error — don't delete
has_changes = True # Assume dirty on error — don't delete
if has_unpushed:
print(f"\n\033[33m⚠ Worktree has unpushed commits, keeping: {wt_path}\033[0m")
print(f" To clean up manually: git worktree remove --force {wt_path}")
if has_changes:
print(f"\n\033[33m⚠ Worktree has uncommitted changes, keeping: {wt_path}\033[0m")
print(f" To clean up manually: git worktree remove {wt_path}")
_active_worktree = None
return
# Remove worktree (even if working tree is dirty — uncommitted
# changes without unpushed commits are just artifacts)
# Remove worktree
try:
subprocess.run(
["git", "worktree", "remove", wt_path, "--force"],
@@ -808,7 +796,7 @@ def _cleanup_worktree(info: Dict[str, str] = None) -> None:
except Exception as e:
logger.debug("Failed to remove worktree: %s", e)
# Delete the branch
# Delete the branch (only if it was never pushed / has no upstream)
try:
subprocess.run(
["git", "branch", "-D", branch],
@@ -822,27 +810,19 @@ def _cleanup_worktree(info: Dict[str, str] = None) -> None:
def _prune_stale_worktrees(repo_root: str, max_age_hours: int = 24) -> None:
"""Remove stale worktrees and orphaned branches on startup.
"""Remove worktrees older than max_age_hours that have no uncommitted changes.
Age-based tiers:
- Under max_age_hours (24h): skip session may still be active.
- 24h72h: remove if no unpushed commits.
- Over 72h: force remove regardless (nothing should sit this long).
Also prunes orphaned ``hermes/*`` and ``pr-*`` local branches that
have no corresponding worktree.
Runs silently on startup to clean up after crashed/killed sessions.
"""
import subprocess
import time
worktrees_dir = Path(repo_root) / ".worktrees"
if not worktrees_dir.exists():
_prune_orphaned_branches(repo_root)
return
now = time.time()
soft_cutoff = now - (max_age_hours * 3600) # 24h default
hard_cutoff = now - (max_age_hours * 3 * 3600) # 72h default
cutoff = now - (max_age_hours * 3600)
for entry in worktrees_dir.iterdir():
if not entry.is_dir() or not entry.name.startswith("hermes-"):
@@ -851,24 +831,21 @@ def _prune_stale_worktrees(repo_root: str, max_age_hours: int = 24) -> None:
# Check age
try:
mtime = entry.stat().st_mtime
if mtime > soft_cutoff:
if mtime > cutoff:
continue # Too recent — skip
except Exception:
continue
force = mtime <= hard_cutoff # Over 72h — force remove
if not force:
# 24h72h tier: only remove if no unpushed commits
try:
result = subprocess.run(
["git", "log", "--oneline", "HEAD", "--not", "--remotes"],
capture_output=True, text=True, timeout=5, cwd=str(entry),
)
if result.stdout.strip():
continue # Has unpushed commits — skip
except Exception:
continue # Can't check — skip
# Check for uncommitted changes
try:
status = subprocess.run(
["git", "status", "--porcelain"],
capture_output=True, text=True, timeout=5, cwd=str(entry),
)
if status.stdout.strip():
continue # Has changes — skip
except Exception:
continue # Can't check — skip
# Safe to remove
try:
@@ -887,81 +864,10 @@ def _prune_stale_worktrees(repo_root: str, max_age_hours: int = 24) -> None:
["git", "branch", "-D", branch],
capture_output=True, text=True, timeout=10, cwd=repo_root,
)
logger.debug("Pruned stale worktree: %s (force=%s)", entry.name, force)
logger.debug("Pruned stale worktree: %s", entry.name)
except Exception as e:
logger.debug("Failed to prune worktree %s: %s", entry.name, e)
_prune_orphaned_branches(repo_root)
def _prune_orphaned_branches(repo_root: str) -> None:
"""Delete local ``hermes/hermes-*`` and ``pr-*`` branches with no worktree.
These are auto-generated by ``hermes -w`` sessions and PR review
workflows respectively. Once their worktree is gone they serve no
purpose and just accumulate.
"""
import subprocess
try:
result = subprocess.run(
["git", "branch", "--format=%(refname:short)"],
capture_output=True, text=True, timeout=10, cwd=repo_root,
)
if result.returncode != 0:
return
all_branches = [b.strip() for b in result.stdout.strip().split("\n") if b.strip()]
except Exception:
return
# Collect branches that are actively checked out in a worktree
active_branches: set = set()
try:
wt_result = subprocess.run(
["git", "worktree", "list", "--porcelain"],
capture_output=True, text=True, timeout=10, cwd=repo_root,
)
for line in wt_result.stdout.split("\n"):
if line.startswith("branch refs/heads/"):
active_branches.add(line.split("branch refs/heads/", 1)[-1].strip())
except Exception:
return # Can't determine active branches — bail
# Also protect the currently checked-out branch and main
try:
head_result = subprocess.run(
["git", "branch", "--show-current"],
capture_output=True, text=True, timeout=5, cwd=repo_root,
)
current = head_result.stdout.strip()
if current:
active_branches.add(current)
except Exception:
pass
active_branches.add("main")
orphaned = [
b for b in all_branches
if b not in active_branches
and (b.startswith("hermes/hermes-") or b.startswith("pr-"))
]
if not orphaned:
return
# Delete in batches
for i in range(0, len(orphaned), 50):
batch = orphaned[i:i + 50]
try:
subprocess.run(
["git", "branch", "-D"] + batch,
capture_output=True, text=True, timeout=30, cwd=repo_root,
)
except Exception as e:
logger.debug("Failed to prune orphaned branches: %s", e)
logger.debug("Pruned %d orphaned branches", len(orphaned))
# ============================================================================
# ASCII Art & Branding
# ============================================================================
@@ -3408,22 +3314,6 @@ class HermesCLI:
flush_tool_summary()
print()
def _notify_session_boundary(self, event_type: str) -> None:
"""Fire a session-boundary plugin hook (on_session_finalize or on_session_reset).
Non-blocking errors are caught and logged. Safe to call from any
lifecycle point (shutdown, /new, /reset).
"""
try:
from hermes_cli.plugins import invoke_hook as _invoke_hook
_invoke_hook(
event_type,
session_id=self.agent.session_id if self.agent else None,
platform=getattr(self, "platform", None) or "cli",
)
except Exception:
pass
def new_session(self, silent=False):
"""Start a fresh session with a new session ID and cleared agent state."""
if self.agent and self.conversation_history:
@@ -3431,10 +3321,6 @@ class HermesCLI:
self.agent.flush_memories(self.conversation_history)
except (Exception, KeyboardInterrupt):
pass
self._notify_session_boundary("on_session_finalize")
elif self.agent:
# First session or empty history — still finalize the old session
self._notify_session_boundary("on_session_finalize")
old_session_id = self.session_id
if self._session_db and old_session_id:
@@ -3479,7 +3365,6 @@ class HermesCLI:
)
except Exception:
pass
self._notify_session_boundary("on_session_reset")
if not silent:
print("(^_^)v New session started!")
@@ -4668,13 +4553,13 @@ class HermesCLI:
if output:
self.console.print(_rich_text_from_ansi(output))
else:
self.console.print("[dim]Command returned no output[/]")
ChatConsole().print("[dim]Command returned no output[/]")
except subprocess.TimeoutExpired:
self.console.print("[bold red]Quick command timed out (30s)[/]")
ChatConsole().print("[bold red]Quick command timed out (30s)[/]")
except Exception as e:
self.console.print(f"[bold red]Quick command error: {e}[/]")
ChatConsole().print(f"[bold red]Quick command error: {e}[/]")
else:
self.console.print(f"[bold red]Quick command '{base_cmd}' has no command defined[/]")
ChatConsole().print(f"[bold red]Quick command '{base_cmd}' has no command defined[/]")
elif qcmd.get("type") == "alias":
target = qcmd.get("target", "").strip()
if target:
@@ -4683,9 +4568,9 @@ class HermesCLI:
aliased_command = f"{target} {user_args}".strip()
return self.process_command(aliased_command)
else:
self.console.print(f"[bold red]Quick command '{base_cmd}' has no target defined[/]")
ChatConsole().print(f"[bold red]Quick command '{base_cmd}' has no target defined[/]")
else:
self.console.print(f"[bold red]Quick command '{base_cmd}' has unsupported type (supported: 'exec', 'alias')[/]")
ChatConsole().print(f"[bold red]Quick command '{base_cmd}' has unsupported type (supported: 'exec', 'alias')[/]")
# Check for plugin-registered slash commands
elif base_cmd.lstrip("/") in _get_plugin_cmd_handler_names():
from hermes_cli.plugins import get_plugin_command_handler
+1 -7
View File
@@ -574,16 +574,12 @@ def remove_job(job_id: str) -> bool:
return False
def mark_job_run(job_id: str, success: bool, error: Optional[str] = None,
delivery_error: Optional[str] = None):
def mark_job_run(job_id: str, success: bool, error: Optional[str] = None):
"""
Mark a job as having been run.
Updates last_run_at, last_status, increments completed count,
computes next_run_at, and auto-deletes if repeat limit reached.
``delivery_error`` is tracked separately from the agent error — a job
can succeed (agent produced output) but fail delivery (platform down).
"""
jobs = load_jobs()
for i, job in enumerate(jobs):
@@ -592,8 +588,6 @@ def mark_job_run(job_id: str, success: bool, error: Optional[str] = None,
job["last_run_at"] = now
job["last_status"] = "ok" if success else "error"
job["last_error"] = error if not success else None
# Track delivery failures separately — cleared on successful delivery
job["last_delivery_error"] = delivery_error
# Increment completed count
if job.get("repeat"):
+25 -32
View File
@@ -196,7 +196,7 @@ def _send_media_via_adapter(adapter, chat_id: str, media_files: list, metadata:
logger.warning("Job '%s': failed to send media %s: %s", job.get("id", "?"), media_path, e)
def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> Optional[str]:
def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> None:
"""
Deliver job output to the configured target (origin chat, specific platform, etc.).
@@ -204,16 +204,16 @@ def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> Option
use the live adapter first — this supports E2EE rooms (e.g. Matrix) where
the standalone HTTP path cannot encrypt. Falls back to standalone send if
the adapter path fails or is unavailable.
Returns None on success, or an error string on failure.
"""
target = _resolve_delivery_target(job)
if not target:
if job.get("deliver", "local") != "local":
msg = f"no delivery target resolved for deliver={job.get('deliver', 'local')}"
logger.warning("Job '%s': %s", job["id"], msg)
return msg
return None # local-only jobs don't deliver — not a failure
logger.warning(
"Job '%s' deliver=%s but no concrete delivery target could be resolved",
job["id"],
job.get("deliver", "local"),
)
return
platform_name = target["platform"]
chat_id = target["chat_id"]
@@ -239,22 +239,19 @@ def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> Option
}
platform = platform_map.get(platform_name.lower())
if not platform:
msg = f"unknown platform '{platform_name}'"
logger.warning("Job '%s': %s", job["id"], msg)
return msg
logger.warning("Job '%s': unknown platform '%s' for delivery", job["id"], platform_name)
return
try:
config = load_gateway_config()
except Exception as e:
msg = f"failed to load gateway config: {e}"
logger.error("Job '%s': %s", job["id"], msg)
return msg
logger.error("Job '%s': failed to load gateway config for delivery: %s", job["id"], e)
return
pconfig = config.platforms.get(platform)
if not pconfig or not pconfig.enabled:
msg = f"platform '{platform_name}' not configured/enabled"
logger.warning("Job '%s': %s", job["id"], msg)
return msg
logger.warning("Job '%s': platform '%s' not configured/enabled", job["id"], platform_name)
return
# Optionally wrap the content with a header/footer so the user knows this
# is a cron delivery. Wrapping is on by default; set cron.wrap_response: false
@@ -310,7 +307,7 @@ def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> Option
if adapter_ok:
logger.info("Job '%s': delivered to %s:%s via live adapter", job["id"], platform_name, chat_id)
return None
return
except Exception as e:
logger.warning(
"Job '%s': live adapter delivery to %s:%s failed (%s), falling back to standalone",
@@ -332,17 +329,13 @@ def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> Option
future = pool.submit(asyncio.run, _send_to_platform(platform, pconfig, chat_id, cleaned_delivery_content, thread_id=thread_id, media_files=media_files))
result = future.result(timeout=30)
except Exception as e:
msg = f"delivery to {platform_name}:{chat_id} failed: {e}"
logger.error("Job '%s': %s", job["id"], msg)
return msg
logger.error("Job '%s': delivery to %s:%s failed: %s", job["id"], platform_name, chat_id, e)
return
if result and result.get("error"):
msg = f"delivery error: {result['error']}"
logger.error("Job '%s': %s", job["id"], msg)
return msg
logger.info("Job '%s': delivered to %s:%s", job["id"], platform_name, chat_id)
return None
logger.error("Job '%s': delivery error: %s", job["id"], result["error"])
else:
logger.info("Job '%s': delivered to %s:%s", job["id"], platform_name, chat_id)
_SCRIPT_TIMEOUT = 120 # seconds
@@ -585,9 +578,11 @@ def run_job(job: dict) -> tuple[bool, str, str, Optional[str]]:
except Exception as e:
logger.warning("Job '%s': failed to load config.yaml, using defaults: %s", job_id, e)
# Reasoning config from config.yaml
# Reasoning config from env or config.yaml
from hermes_constants import parse_reasoning_effort
effort = str(_cfg.get("agent", {}).get("reasoning_effort", "")).strip()
effort = os.getenv("HERMES_REASONING_EFFORT", "")
if not effort:
effort = str(_cfg.get("agent", {}).get("reasoning_effort", "")).strip()
reasoning_config = parse_reasoning_effort(effort)
# Prefill messages from env or config.yaml
@@ -873,15 +868,13 @@ def tick(verbose: bool = True, adapters=None, loop=None) -> int:
logger.info("Job '%s': agent returned %s — skipping delivery", job["id"], SILENT_MARKER)
should_deliver = False
delivery_error = None
if should_deliver:
try:
delivery_error = _deliver_result(job, deliver_content, adapters=adapters, loop=loop)
_deliver_result(job, deliver_content, adapters=adapters, loop=loop)
except Exception as de:
delivery_error = str(de)
logger.error("Delivery failed for job %s: %s", job["id"], de)
mark_job_run(job["id"], success, error, delivery_error=delivery_error)
mark_job_run(job["id"], success, error)
executed += 1
except Exception as e:
+14 -58
View File
@@ -21,8 +21,6 @@ from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Set
from model_tools import handle_function_call
from tools.terminal_tool import get_active_env
from tools.tool_result_storage import maybe_persist_tool_result, enforce_turn_budget
# Thread pool for running sync tool calls that internally use asyncio.run()
# (e.g., the Modal/Docker/Daytona terminal backends). Running them in a separate
@@ -138,10 +136,8 @@ class HermesAgentLoop:
max_turns: int = 30,
task_id: Optional[str] = None,
temperature: float = 1.0,
top_p: Optional[float] = None,
max_tokens: Optional[int] = None,
extra_body: Optional[Dict[str, Any]] = None,
budget_config: Optional["BudgetConfig"] = None,
):
"""
Initialize the agent loop.
@@ -154,26 +150,19 @@ class HermesAgentLoop:
max_turns: Maximum number of LLM calls before stopping
task_id: Unique ID for terminal/browser session isolation
temperature: Sampling temperature for generation
top_p: Nucleus sampling top_p (None = omit, use provider default)
max_tokens: Max tokens per generation (None for server default)
extra_body: Extra parameters passed to the OpenAI client's create() call.
Used for OpenRouter provider preferences, transforms, etc.
e.g. {"provider": {"ignore": ["DeepInfra"]}}
budget_config: Tool result persistence budget. Controls per-tool
thresholds, per-turn aggregate budget, and preview size.
If None, uses DEFAULT_BUDGET (current hardcoded values).
"""
from tools.budget_config import DEFAULT_BUDGET
self.server = server
self.tool_schemas = tool_schemas
self.valid_tool_names = valid_tool_names
self.max_turns = max_turns
self.task_id = task_id or str(uuid.uuid4())
self.temperature = temperature
self.top_p = top_p
self.max_tokens = max_tokens
self.extra_body = extra_body
self.budget_config = budget_config or DEFAULT_BUDGET
async def run(self, messages: List[Dict[str, Any]]) -> AgentResult:
"""
@@ -214,9 +203,6 @@ class HermesAgentLoop:
"temperature": self.temperature,
}
if self.top_p is not None:
chat_kwargs["top_p"] = self.top_p
# Only pass tools if we have them
if self.tool_schemas:
chat_kwargs["tools"] = self.tool_schemas
@@ -231,35 +217,20 @@ class HermesAgentLoop:
chat_kwargs["extra_body"] = self.extra_body
# Make the API call -- standard OpenAI spec
# Retry on timeout/connection errors (provider queuing, rate limits)
api_start = _time.monotonic()
response = None
max_retries = 3
for attempt in range(max_retries):
try:
response = await self.server.chat_completion(**chat_kwargs)
break
except Exception as e:
api_elapsed = _time.monotonic() - api_start
is_retryable = "timeout" in type(e).__name__.lower() or "connection" in type(e).__name__.lower()
if is_retryable and attempt < max_retries - 1:
wait = 2 ** attempt
logger.warning(
"[%s] API call timed out on turn %d attempt %d (%.1fs), retrying in %ds: %s",
self.task_id[:8], turn + 1, attempt + 1, api_elapsed, wait, e,
)
await asyncio.sleep(wait)
api_start = _time.monotonic()
continue
logger.error("API call failed on turn %d (%.1fs): %s", turn + 1, api_elapsed, e)
return AgentResult(
messages=messages,
managed_state=self._get_managed_state(),
turns_used=turn + 1,
finished_naturally=False,
reasoning_per_turn=reasoning_per_turn,
tool_errors=tool_errors,
)
try:
response = await self.server.chat_completion(**chat_kwargs)
except Exception as e:
api_elapsed = _time.monotonic() - api_start
logger.error("API call failed on turn %d (%.1fs): %s", turn + 1, api_elapsed, e)
return AgentResult(
messages=messages,
managed_state=self._get_managed_state(),
turns_used=turn + 1,
finished_naturally=False,
reasoning_per_turn=reasoning_per_turn,
tool_errors=tool_errors,
)
api_elapsed = _time.monotonic() - api_start
@@ -475,15 +446,8 @@ class HermesAgentLoop:
except (json.JSONDecodeError, TypeError):
pass
# Add tool response to conversation
tc_id = tc.get("id", "") if isinstance(tc, dict) else tc.id
tool_result = maybe_persist_tool_result(
content=tool_result,
tool_name=tool_name,
tool_use_id=tc_id,
env=get_active_env(self.task_id),
config=self.budget_config,
)
messages.append(
{
"role": "tool",
@@ -492,14 +456,6 @@ class HermesAgentLoop:
}
)
num_tcs = len(assistant_msg.tool_calls)
if num_tcs > 0:
enforce_turn_budget(
messages[-num_tcs:],
env=get_active_env(self.task_id),
config=self.budget_config,
)
turn_elapsed = _time.monotonic() - turn_start
logger.info(
"[%s] turn %d: api=%.1fs, %d tools, turn_total=%.1fs",
-1
View File
@@ -1048,7 +1048,6 @@ class AgenticOPDEnv(HermesAgentBaseEnv):
temperature=0.0,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
budget_config=self.config.build_budget_config(),
)
result = await agent.run(messages)
@@ -15,15 +15,15 @@
env:
enabled_toolsets: ["terminal", "file"]
max_agent_turns: 100
max_agent_turns: 60
max_token_length: 32000
agent_temperature: 1.0
agent_temperature: 0.8
terminal_backend: "modal"
terminal_timeout: 300 # 5 min per command (builds, pip install)
tool_pool_size: 128 # thread pool for 89 parallel tasks
dataset_name: "NousResearch/terminal-bench-2-verified-flattened"
terminal_timeout: 300 # 5 min per command (builds, pip install)
tool_pool_size: 128 # thread pool for 89 parallel tasks
dataset_name: "NousResearch/terminal-bench-2"
test_timeout: 600
task_timeout: 900 # 15 min wall-clock per task, auto-FAIL if exceeded
task_timeout: 1800 # 30 min wall-clock per task, auto-FAIL if exceeded
tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
use_wandb: true
wandb_name: "terminal-bench-2"
@@ -33,15 +33,10 @@ env:
# Modal's blocking calls (App.lookup, etc.) deadlock when too many sandboxes
# are created simultaneously inside thread pool workers via asyncio.run().
max_concurrent_tasks: 8
extra_body:
provider:
order: ["DeepInfra"]
allow_fallbacks: false
openai:
base_url: "https://openrouter.ai/api/v1"
model_name: "nvidia/nemotron-3-super-120b-a12b"
model_name: "anthropic/claude-opus-4.6"
server_type: "openai"
health_check: false
timeout: 300 # 5 min per API call (default 1200s causes 20min stalls)
# api_key loaded from OPENROUTER_API_KEY in .env
@@ -32,8 +32,8 @@ export PYTHONUNBUFFERED=1
# These go to the log file; tqdm + [START]/[PASS]/[FAIL] go to terminal
export LOGLEVEL=INFO
uv run python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
--config environments/benchmarks/terminalbench_2/default.yaml \
python terminalbench2_env.py evaluate \
--config default.yaml \
"$@" \
2>&1 | tee "$LOG_FILE"
@@ -52,18 +52,18 @@ _repo_root = Path(__file__).resolve().parent.parent.parent.parent
if str(_repo_root) not in sys.path:
sys.path.insert(0, str(_repo_root))
from atroposlib.envs.base import EvalHandlingEnum
from atroposlib.envs.server_handling.server_manager import APIServerConfig
from pydantic import Field
from agent.prompt_builder import DEFAULT_AGENT_IDENTITY
from atroposlib.envs.base import EvalHandlingEnum
from atroposlib.envs.server_handling.server_manager import APIServerConfig
from environments.agent_loop import AgentResult, HermesAgentLoop
from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
from environments.tool_context import ToolContext
from tools.terminal_tool import (
cleanup_vm,
clear_task_env_overrides,
register_task_env_overrides,
clear_task_env_overrides,
cleanup_vm,
)
logger = logging.getLogger(__name__)
@@ -73,7 +73,6 @@ logger = logging.getLogger(__name__)
# Configuration
# =============================================================================
class TerminalBench2EvalConfig(HermesAgentEnvConfig):
"""
Configuration for the Terminal-Bench 2.0 evaluation environment.
@@ -139,27 +138,11 @@ class TerminalBench2EvalConfig(HermesAgentEnvConfig):
# Tasks that cannot run properly on Modal and are excluded from scoring.
MODAL_INCOMPATIBLE_TASKS = {
"qemu-startup", # Needs KVM/hardware virtualization
"qemu-alpine-ssh", # Needs KVM/hardware virtualization
"crack-7z-hash", # Password brute-force -- too slow for cloud sandbox timeouts
"qemu-startup", # Needs KVM/hardware virtualization
"qemu-alpine-ssh", # Needs KVM/hardware virtualization
"crack-7z-hash", # Password brute-force -- too slow for cloud sandbox timeouts
}
# Injected as a user message when the model responds with plain text instead of
# calling a tool or including a <task_status> tag.
_FORMAT_NUDGE_MESSAGE = (
"You wrote a plain text response instead of using your tools. "
"Plain text responses do not affect the environment — nothing was executed or saved.\n\n"
"You MUST use your tools (terminal, read_file, write_file) to actually complete the task. "
"Do not describe what you would do — execute it now by making tool calls.\n\n"
"If you have already completed all required work using tools in previous turns, "
"respond with exactly: <task_status>DONE</task_status>\n"
"If you have exhausted all approaches and cannot make further progress, "
"respond with exactly: <task_status>UNFINISHED</task_status>"
)
# Maximum number of format nudges before giving up and moving on to scoring.
_MAX_FORMAT_NUDGES = 3
# =============================================================================
# Tar extraction helper
@@ -220,6 +203,7 @@ def _safe_extract_tar(tar: tarfile.TarFile, target_dir: Path) -> None:
except OSError:
pass
def _extract_base64_tar(b64_data: str, target_dir: Path):
"""Extract a base64-encoded tar.gz archive into target_dir."""
if not b64_data:
@@ -234,7 +218,6 @@ def _extract_base64_tar(b64_data: str, target_dir: Path):
# Main Environment
# =============================================================================
class TerminalBench2EvalEnv(HermesAgentBaseEnv):
"""
Terminal-Bench 2.0 evaluation environment (eval-only, no training).
@@ -279,18 +262,23 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
enabled_toolsets=["terminal", "file"],
disabled_toolsets=None,
distribution=None,
# Agent settings -- TB2 tasks are complex, need many turns
max_agent_turns=60,
max_token_length=16000,
agent_temperature=0.6,
system_prompt=DEFAULT_AGENT_IDENTITY,
system_prompt=None,
# Modal backend for per-task cloud-isolated sandboxes
terminal_backend="modal",
terminal_timeout=300, # 5 min per command (builds, pip install, etc.)
terminal_timeout=300, # 5 min per command (builds, pip install, etc.)
# Test execution timeout (TB2 test scripts can install deps like pytest)
test_timeout=180,
# 89 tasks run in parallel, each needs a thread for tool calls
tool_pool_size=128,
# --- Eval-only Atropos settings ---
# These settings make the env work as an eval-only environment:
# - STOP_TRAIN: pauses training during eval (standard for eval envs)
@@ -300,6 +288,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
group_size=1,
steps_per_eval=1,
total_steps=1,
tokenizer_name="NousResearch/Hermes-3-Llama-3.1-8B",
use_wandb=True,
wandb_name="terminal-bench-2",
@@ -347,11 +336,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
# Skip tasks incompatible with the current backend (e.g., QEMU on Modal)
# plus any user-specified skip_tasks
skip = (
set(MODAL_INCOMPATIBLE_TASKS)
if self.config.terminal_backend == "modal"
else set()
)
skip = set(MODAL_INCOMPATIBLE_TASKS) if self.config.terminal_backend == "modal" else set()
if self.config.skip_tasks:
skip |= {name.strip() for name in self.config.skip_tasks.split(",")}
if skip:
@@ -359,9 +344,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
tasks = [t for t in tasks if t["task_name"] not in skip]
skipped = before - len(tasks)
if skipped > 0:
print(
f" Skipped {skipped} incompatible tasks: {sorted(skip & {t['task_name'] for t in ds})}"
)
print(f" Skipped {skipped} incompatible tasks: {sorted(skip & {t['task_name'] for t in ds})}")
self.all_eval_items = tasks
self.iter = 0
@@ -371,16 +354,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
for i, task in enumerate(self.all_eval_items):
self.category_index[task.get("category", "unknown")].append(i)
# Pre-compute which tasks need Modal's add_python (avoids re-decoding
# multi-MB environment_tar blobs during per-task rollouts).
self._needs_add_python: Dict[str, bool] = {
task["task_name"]: self._image_needs_add_python(task)
for task in self.all_eval_items
}
add_py_count = sum(self._needs_add_python.values())
if add_py_count:
print(f" {add_py_count} tasks need add_python (non-python base image)")
# Reward tracking for wandb logging
self.eval_metrics: List[Tuple[str, float]] = []
@@ -388,30 +361,15 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
# immediately on completion so data is preserved even on Ctrl+C.
# Timestamped filename so each run produces a unique file.
import datetime
log_dir = os.path.join(os.path.dirname(__file__), "logs")
os.makedirs(log_dir, exist_ok=True)
run_ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
model_name = self.server.servers[0].config.model_name
model_slug = model_name.replace("/", "_").replace(":", "_")
self._streaming_path = os.path.join(
log_dir, f"samples_{run_ts}_{model_slug}.jsonl"
)
self._streaming_path = os.path.join(log_dir, f"samples_{run_ts}.jsonl")
self._streaming_file = open(self._streaming_path, "w")
self._streaming_lock = __import__("threading").Lock()
self._run_meta = {
"model_name": model_name,
"temperature": self.config.agent_temperature,
"top_p": self.config.agent_top_p,
"max_agent_turns": self.config.max_agent_turns,
"task_timeout": self.config.task_timeout,
"terminal_backend": self.config.terminal_backend,
}
print(f" Streaming results to: {self._streaming_path}")
print(
f"TB2 ready: {len(self.all_eval_items)} tasks across {len(self.category_index)} categories"
)
print(f"TB2 ready: {len(self.all_eval_items)} tasks across {len(self.category_index)} categories")
for cat, indices in sorted(self.category_index.items()):
print(f" {cat}: {len(indices)} tasks")
@@ -420,9 +378,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
if not hasattr(self, "_streaming_file") or self._streaming_file.closed:
return
with self._streaming_lock:
self._streaming_file.write(
json.dumps(result, ensure_ascii=False, default=str) + "\n"
)
self._streaming_file.write(json.dumps(result, ensure_ascii=False, default=str) + "\n")
self._streaming_file.flush()
# =========================================================================
@@ -458,36 +414,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
# Docker image resolution
# =========================================================================
@staticmethod
def _image_needs_add_python(item: Dict[str, Any]) -> bool:
"""Check if the task's base image lacks `python` on PATH.
Parses the Dockerfile FROM line in environment_tar. Returns True
for non-python base images (ubuntu, debian, etc.) that need
Modal's add_python parameter.
"""
environment_tar = item.get("environment_tar", "")
if not environment_tar:
return False
try:
raw = base64.b64decode(environment_tar)
buf = io.BytesIO(raw)
with tarfile.open(fileobj=buf, mode="r:gz") as tar:
for member in tar:
if not member.isfile() or "Dockerfile" not in member.name:
continue
f = tar.extractfile(member)
if not f:
continue
for line in f.read().decode("utf-8", errors="ignore").splitlines():
stripped = line.strip()
if stripped.upper().startswith("FROM "):
base = stripped.split()[1].lower()
return not base.startswith("python:")
except Exception:
pass
return False
def _resolve_task_image(
self, item: Dict[str, Any], task_name: str
) -> Tuple[str, Optional[Path]]:
@@ -520,9 +446,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
if dockerfile_path.exists():
logger.info(
"Task %s: building from Dockerfile (force_build=%s, docker_image=%s)",
task_name,
self.config.force_build,
bool(docker_image),
task_name, self.config.force_build, bool(docker_image),
)
return str(dockerfile_path), task_dir
@@ -530,80 +454,12 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
if docker_image:
logger.warning(
"Task %s: force_build=True but no environment_tar, "
"falling back to docker_image %s",
task_name,
docker_image,
"falling back to docker_image %s", task_name, docker_image,
)
return docker_image, None
return "", None
# =========================================================================
# Agent loop with format nudging
# =========================================================================
async def _run_with_nudges(
self,
server,
tools: List[Dict[str, Any]],
valid_names: set,
messages: List[Dict[str, Any]],
task_id: str,
task_name: str,
) -> Tuple["AgentResult", int]:
"""Run the agent loop, nudging if the model returns plain text without task_status tag."""
total_turns_used = 0
nudge_count = 0
result = None
while total_turns_used < self.config.max_agent_turns:
remaining = self.config.max_agent_turns - total_turns_used
agent = HermesAgentLoop(
server=server,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=remaining,
task_id=task_id,
temperature=self.config.agent_temperature,
top_p=self.config.agent_top_p,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
total_turns_used += result.turns_used
if not result.finished_naturally:
break
last_content = next(
(
m.get("content", "") or ""
for m in reversed(messages)
if m.get("role") == "assistant"
),
"",
)
if "<task_status>" in last_content:
break
if nudge_count >= _MAX_FORMAT_NUDGES:
logger.warning(
"Task %s: model ignored %d format nudges; stopping.",
task_name,
nudge_count,
)
break
nudge_count += 1
logger.info(
"Task %s: nudging model (nudge %d/%d) — no tool calls and no task_status",
task_name,
nudge_count,
_MAX_FORMAT_NUDGES,
)
messages.append({"role": "user", "content": _FORMAT_NUDGE_MESSAGE})
return result, total_turns_used
# =========================================================================
# Per-task evaluation -- agent loop + test verification
# =========================================================================
@@ -632,7 +488,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
task_dir = None # Set if we extract a Dockerfile (needs cleanup)
from tqdm import tqdm
tqdm.write(f" [START] {task_name} (task_id={task_id[:8]})")
task_start = time.time()
@@ -640,32 +495,24 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
# --- 1. Resolve Docker image ---
modal_image, task_dir = self._resolve_task_image(eval_item, task_name)
if not modal_image:
logger.error(
"Task %s: no docker_image or environment_tar, skipping", task_name
)
logger.error("Task %s: no docker_image or environment_tar, skipping", task_name)
return {
"passed": False,
"reward": 0.0,
"task_name": task_name,
"category": category,
"passed": False, "reward": 0.0,
"task_name": task_name, "category": category,
"error": "no_image",
}
# --- 2. Register per-task image override ---
# Set both modal_image and docker_image so the task image is used
# regardless of which backend is configured.
overrides = {
register_task_env_overrides(task_id, {
"modal_image": modal_image,
"docker_image": modal_image,
"cwd": "/app",
}
if self._needs_add_python.get(task_name, False):
overrides["add_python"] = "3.12"
register_task_env_overrides(task_id, overrides)
})
logger.info(
"Task %s: registered image override for task_id %s",
task_name,
task_id[:8],
task_name, task_id[:8],
)
# --- 3. Resolve tools and build messages ---
@@ -673,48 +520,51 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
messages: List[Dict[str, Any]] = []
if self.config.system_prompt:
messages.append(
{"role": "system", "content": self.config.system_prompt}
)
messages.append({"role": "system", "content": self.config.system_prompt})
messages.append({"role": "user", "content": self.format_prompt(eval_item)})
# --- 4. Run agent loop with format enforcement ---
# The model must either call a tool or end with <task_status>DONE/UNFINISHED</task_status>.
# If it returns plain text without the tag, inject a nudge user message and
# continue with the remaining turn budget (up to _MAX_FORMAT_NUDGES times).
# --- 4. Run agent loop ---
# Use ManagedServer (Phase 2) for vLLM/SGLang backends to get
# token-level tracking via /generate. Falls back to direct
# ServerManager (Phase 1) for OpenAI endpoints.
if self._use_managed_server():
async with self.server.managed_server(
tokenizer=self.tokenizer,
preserve_think_blocks=bool(self.config.thinking_mode),
) as managed:
result, total_turns_used = await self._run_with_nudges(
agent = HermesAgentLoop(
server=managed,
tools=tools,
valid_names=valid_names,
messages=messages,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=self.config.max_agent_turns,
task_id=task_id,
task_name=task_name,
temperature=self.config.agent_temperature,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
else:
result, total_turns_used = await self._run_with_nudges(
agent = HermesAgentLoop(
server=self.server,
tools=tools,
valid_names=valid_names,
messages=messages,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=self.config.max_agent_turns,
task_id=task_id,
task_name=task_name,
temperature=self.config.agent_temperature,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
# --- 5. Verify -- run test suite in the agent's sandbox ---
# Skip verification if the agent produced no meaningful output
only_system_and_user = all(
msg.get("role") in ("system", "user") for msg in messages
msg.get("role") in ("system", "user") for msg in result.messages
)
if total_turns_used == 0 or only_system_and_user:
if result.turns_used == 0 or only_system_and_user:
logger.warning(
"Task %s: agent produced no output (turns=%d). Reward=0.",
task_name,
total_turns_used,
task_name, result.turns_used,
)
reward = 0.0
else:
@@ -726,10 +576,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
loop = asyncio.get_event_loop()
reward = await loop.run_in_executor(
None, # default thread pool
self._run_tests,
eval_item,
ctx,
task_name,
self._run_tests, eval_item, ctx, task_name,
)
except Exception as e:
logger.error("Task %s: test verification failed: %s", task_name, e)
@@ -740,26 +587,20 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
passed = reward == 1.0
status = "PASS" if passed else "FAIL"
elapsed = time.time() - task_start
tqdm.write(
f" [{status}] {task_name} (turns={total_turns_used}, {elapsed:.0f}s)"
)
tqdm.write(f" [{status}] {task_name} (turns={result.turns_used}, {elapsed:.0f}s)")
logger.info(
"Task %s: reward=%.1f, turns=%d, finished=%s",
task_name,
reward,
total_turns_used,
result.finished_naturally,
task_name, reward, result.turns_used, result.finished_naturally,
)
out = {
**self._run_meta,
"passed": passed,
"reward": reward,
"task_name": task_name,
"category": category,
"turns_used": total_turns_used,
"turns_used": result.turns_used,
"finished_naturally": result.finished_naturally,
"messages": messages,
"messages": result.messages,
}
self._save_result(out)
return out
@@ -769,11 +610,8 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
logger.error("Task %s: rollout failed: %s", task_name, e, exc_info=True)
tqdm.write(f" [ERROR] {task_name}: {e} ({elapsed:.0f}s)")
out = {
**self._run_meta,
"passed": False,
"reward": 0.0,
"task_name": task_name,
"category": category,
"passed": False, "reward": 0.0,
"task_name": task_name, "category": category,
"error": str(e),
}
self._save_result(out)
@@ -846,8 +684,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
# Execute the test suite
logger.info(
"Task %s: running test suite (timeout=%ds)",
task_name,
self.config.test_timeout,
task_name, self.config.test_timeout,
)
test_result = ctx.terminal(
"bash /tests/test.sh",
@@ -880,9 +717,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
logger.warning(
"Task %s: reward.txt content unexpected (%r), "
"falling back to exit_code=%d",
task_name,
content,
exit_code,
task_name, content, exit_code,
)
reward = 1.0 if exit_code == 0 else 0.0
else:
@@ -890,17 +725,14 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
logger.warning(
"Task %s: reward.txt not found after download, "
"falling back to exit_code=%d",
task_name,
exit_code,
task_name, exit_code,
)
reward = 1.0 if exit_code == 0 else 0.0
except Exception as e:
logger.warning(
"Task %s: failed to download verifier dir: %s, "
"falling back to exit_code=%d",
task_name,
e,
exit_code,
task_name, e, exit_code,
)
reward = 1.0 if exit_code == 0 else 0.0
finally:
@@ -911,9 +743,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
output_preview = output[-500:] if output else "(no output)"
logger.info(
"Task %s: FAIL (exit_code=%d)\n%s",
task_name,
exit_code,
output_preview,
task_name, exit_code, output_preview,
)
return reward
@@ -938,18 +768,12 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
)
except asyncio.TimeoutError:
from tqdm import tqdm
elapsed = self.config.task_timeout
tqdm.write(
f" [TIMEOUT] {task_name} (exceeded {elapsed}s wall-clock limit)"
)
tqdm.write(f" [TIMEOUT] {task_name} (exceeded {elapsed}s wall-clock limit)")
logger.error("Task %s: wall-clock timeout after %ds", task_name, elapsed)
out = {
**self._run_meta,
"passed": False,
"reward": 0.0,
"task_name": task_name,
"category": category,
"passed": False, "reward": 0.0,
"task_name": task_name, "category": category,
"error": f"timeout ({elapsed}s)",
}
self._save_result(out)
@@ -983,25 +807,23 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
self.handleError(record)
handler = _TqdmHandler()
handler.setFormatter(
logging.Formatter(
"%(asctime)s [%(name)s] %(levelname)s: %(message)s",
datefmt="%H:%M:%S",
)
)
handler.setFormatter(logging.Formatter(
"%(asctime)s [%(name)s] %(levelname)s: %(message)s",
datefmt="%H:%M:%S",
))
root = logging.getLogger()
root.handlers = [handler] # Replace any existing handlers
root.setLevel(logging.INFO)
# Silence noisy third-party loggers that flood the output
logging.getLogger("httpx").setLevel(logging.WARNING) # Every HTTP request
logging.getLogger("openai").setLevel(logging.WARNING) # OpenAI client retries
logging.getLogger("rex-deploy").setLevel(logging.WARNING) # Swerex deployment
logging.getLogger("httpx").setLevel(logging.WARNING) # Every HTTP request
logging.getLogger("openai").setLevel(logging.WARNING) # OpenAI client retries
logging.getLogger("rex-deploy").setLevel(logging.WARNING) # Swerex deployment
logging.getLogger("rex_image_builder").setLevel(logging.WARNING) # Image builds
print(f"\n{'=' * 60}")
print(f"\n{'='*60}")
print("Starting Terminal-Bench 2.0 Evaluation")
print(f"{'=' * 60}")
print(f"{'='*60}")
print(f" Dataset: {self.config.dataset_name}")
print(f" Total tasks: {len(self.all_eval_items)}")
print(f" Max agent turns: {self.config.max_agent_turns}")
@@ -1009,11 +831,9 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
print(f" Terminal backend: {self.config.terminal_backend}")
print(f" Tool thread pool: {self.config.tool_pool_size}")
print(f" Terminal timeout: {self.config.terminal_timeout}s/cmd")
print(
f" Terminal lifetime: {self.config.terminal_lifetime}s (auto: task_timeout + 120)"
)
print(f" Terminal lifetime: {self.config.terminal_lifetime}s (auto: task_timeout + 120)")
print(f" Max concurrent tasks: {self.config.max_concurrent_tasks}")
print(f"{'=' * 60}\n")
print(f"{'='*60}\n")
# Semaphore to limit concurrent Modal sandbox creations.
# Without this, all 86 tasks fire simultaneously, each creating a Modal
@@ -1055,7 +875,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
await asyncio.gather(*eval_tasks, return_exceptions=True)
# Belt-and-suspenders: clean up any remaining sandboxes
from tools.terminal_tool import cleanup_all_environments
cleanup_all_environments()
print("All sandboxes cleaned up.")
return
@@ -1101,9 +920,9 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
self.eval_metrics = [(k, v) for k, v in eval_metrics.items()]
# ---- Print summary ----
print(f"\n{'=' * 60}")
print(f"\n{'='*60}")
print("Terminal-Bench 2.0 Evaluation Results")
print(f"{'=' * 60}")
print(f"{'='*60}")
print(f"Overall Pass Rate: {overall_pass_rate:.4f} ({passed}/{total})")
print(f"Evaluation Time: {end_time - start_time:.1f} seconds")
@@ -1123,7 +942,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
extra = f" (error: {error})" if error else ""
print(f" [{status}] {r['task_name']} (turns={turns}){extra}")
print(f"{'=' * 60}\n")
print(f"{'='*60}\n")
# Build sample records for evaluate_log (includes full conversations)
samples = [
@@ -1148,7 +967,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
end_time=end_time,
generation_parameters={
"temperature": self.config.agent_temperature,
"top_p": self.config.agent_top_p,
"max_tokens": self.config.max_token_length,
"max_agent_turns": self.config.max_agent_turns,
"terminal_backend": self.config.terminal_backend,
@@ -1165,7 +983,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
# Kill all remaining sandboxes. Timed-out tasks leave orphaned thread
# pool workers still executing commands -- cleanup_all stops them.
from tools.terminal_tool import cleanup_all_environments
print("\nCleaning up all sandboxes...")
cleanup_all_environments()
@@ -1173,7 +990,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
# tasks are killed immediately instead of retrying against dead
# sandboxes and spamming the console with TimeoutError warnings.
from environments.agent_loop import _tool_executor
_tool_executor.shutdown(wait=False, cancel_futures=True)
print("Done.")
@@ -549,7 +549,6 @@ class YCBenchEvalEnv(HermesAgentBaseEnv):
temperature=self.config.agent_temperature,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
budget_config=self.config.build_budget_config(),
)
result = await agent.run(messages)
-51
View File
@@ -62,11 +62,6 @@ from atroposlib.type_definitions import Item
from environments.agent_loop import AgentResult, HermesAgentLoop
from environments.tool_context import ToolContext
from tools.budget_config import (
DEFAULT_RESULT_SIZE_CHARS,
DEFAULT_TURN_BUDGET_CHARS,
DEFAULT_PREVIEW_SIZE_CHARS,
)
# Import hermes-agent toolset infrastructure
from model_tools import get_tool_definitions
@@ -115,10 +110,6 @@ class HermesAgentEnvConfig(BaseEnvConfig):
default=1.0,
description="Sampling temperature for agent generation during rollouts.",
)
agent_top_p: Optional[float] = Field(
default=None,
description="Nucleus sampling top_p for agent generation. None = provider default.",
)
# --- Terminal backend ---
terminal_backend: str = Field(
@@ -169,32 +160,6 @@ class HermesAgentEnvConfig(BaseEnvConfig):
"Options: hermes, mistral, llama3_json, qwen, deepseek_v3, etc.",
)
# --- Tool result budget ---
# Defaults imported from tools.budget_config (single source of truth).
default_result_size_chars: int = Field(
default=DEFAULT_RESULT_SIZE_CHARS,
description="Default per-tool threshold (chars) for persisting large results "
"to sandbox. Results exceeding this are written to /tmp/hermes-results/ "
"and replaced with a preview. Per-tool registry values take precedence "
"unless overridden via tool_result_overrides.",
)
turn_budget_chars: int = Field(
default=DEFAULT_TURN_BUDGET_CHARS,
description="Aggregate char budget per assistant turn. If all tool results "
"in a single turn exceed this, the largest are persisted to disk first.",
)
preview_size_chars: int = Field(
default=DEFAULT_PREVIEW_SIZE_CHARS,
description="Size of the inline preview shown after a tool result is persisted.",
)
tool_result_overrides: Optional[Dict[str, int]] = Field(
default=None,
description="Per-tool threshold overrides (chars). Keys are tool names, "
"values are char thresholds. Overrides both the default and registry "
"per-tool values. Example: {'terminal': 10000, 'search_files': 5000}. "
"Note: read_file is pinned to infinity and cannot be overridden.",
)
# --- Provider-specific parameters ---
# Passed as extra_body to the OpenAI client's chat.completions.create() call.
# Useful for OpenRouter provider preferences, transforms, route settings, etc.
@@ -211,16 +176,6 @@ class HermesAgentEnvConfig(BaseEnvConfig):
"transforms, and other provider-specific settings.",
)
def build_budget_config(self):
"""Build a BudgetConfig from env config fields."""
from tools.budget_config import BudgetConfig
return BudgetConfig(
default_result_size=self.default_result_size_chars,
turn_budget=self.turn_budget_chars,
preview_size=self.preview_size_chars,
tool_overrides=dict(self.tool_result_overrides) if self.tool_result_overrides else {},
)
class HermesAgentBaseEnv(BaseEnv):
"""
@@ -533,10 +488,8 @@ class HermesAgentBaseEnv(BaseEnv):
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=self.config.agent_temperature,
top_p=self.config.agent_top_p,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
budget_config=self.config.build_budget_config(),
)
result = await agent.run(messages)
except NotImplementedError:
@@ -552,10 +505,8 @@ class HermesAgentBaseEnv(BaseEnv):
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=self.config.agent_temperature,
top_p=self.config.agent_top_p,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
budget_config=self.config.build_budget_config(),
)
result = await agent.run(messages)
else:
@@ -567,10 +518,8 @@ class HermesAgentBaseEnv(BaseEnv):
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=self.config.agent_temperature,
top_p=self.config.agent_top_p,
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
budget_config=self.config.build_budget_config(),
)
result = await agent.run(messages)
-1
View File
@@ -472,7 +472,6 @@ class WebResearchEnv(HermesAgentBaseEnv):
temperature=0.0, # Deterministic for eval
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
budget_config=self.config.build_budget_config(),
)
result = await agent.run(messages)
-7
View File
@@ -712,13 +712,6 @@ def _apply_env_overrides(config: GatewayConfig) -> None:
name=os.getenv("DISCORD_HOME_CHANNEL_NAME", "Home"),
)
# Reply threading mode for Discord (off/first/all)
discord_reply_mode = os.getenv("DISCORD_REPLY_TO_MODE", "").lower()
if discord_reply_mode in ("off", "first", "all"):
if Platform.DISCORD not in config.platforms:
config.platforms[Platform.DISCORD] = PlatformConfig()
config.platforms[Platform.DISCORD].reply_to_mode = discord_reply_mode
# WhatsApp (typically uses different auth mechanism)
whatsapp_enabled = os.getenv("WHATSAPP_ENABLED", "").lower() in ("true", "1", "yes")
if whatsapp_enabled:
+2 -8
View File
@@ -455,9 +455,6 @@ class DiscordAdapter(BasePlatformAdapter):
self._seen_messages: Dict[str, float] = {}
self._SEEN_TTL = 300 # 5 minutes
self._SEEN_MAX = 2000 # prune threshold
# Reply threading mode: "off" (no replies), "first" (reply on first
# chunk only, default), "all" (reply-reference on every chunk).
self._reply_to_mode: str = getattr(config, 'reply_to_mode', 'first') or 'first'
async def connect(self) -> bool:
"""Connect to Discord and start receiving events."""
@@ -777,7 +774,7 @@ class DiscordAdapter(BasePlatformAdapter):
message_ids = []
reference = None
if reply_to and self._reply_to_mode != "off":
if reply_to:
try:
ref_msg = await channel.fetch_message(int(reply_to))
reference = ref_msg
@@ -785,10 +782,7 @@ class DiscordAdapter(BasePlatformAdapter):
logger.debug("Could not fetch reply-to message: %s", e)
for i, chunk in enumerate(chunks):
if self._reply_to_mode == "all":
chunk_reference = reference
else: # "first" (default) or "off"
chunk_reference = reference if i == 0 else None
chunk_reference = reference if i == 0 else None
try:
msg = await channel.send(
content=chunk,
-148
View File
@@ -20,7 +20,6 @@ from __future__ import annotations
import asyncio
import hashlib
import hmac
import itertools
import json
import logging
import mimetypes
@@ -1053,9 +1052,6 @@ class FeishuAdapter(BasePlatformAdapter):
self._media_batch_state = FeishuBatchState()
self._pending_media_batches = self._media_batch_state.events
self._pending_media_batch_tasks = self._media_batch_state.tasks
# Exec approval button state (approval_id → {session_key, message_id, chat_id})
self._approval_state: Dict[int, Dict[str, str]] = {}
self._approval_counter = itertools.count(1)
self._load_seen_message_ids()
@staticmethod
@@ -1398,104 +1394,6 @@ class FeishuAdapter(BasePlatformAdapter):
logger.error("[Feishu] Failed to edit message %s: %s", message_id, exc, exc_info=True)
return SendResult(success=False, error=str(exc))
async def send_exec_approval(
self, chat_id: str, command: str, session_key: str,
description: str = "dangerous command",
metadata: Optional[Dict[str, Any]] = None,
) -> SendResult:
"""Send an interactive card with approval buttons.
The buttons carry ``hermes_action`` in their value dict so that
``_handle_card_action_event`` can intercept them and call
``resolve_gateway_approval()`` to unblock the waiting agent thread.
"""
if not self._client:
return SendResult(success=False, error="Not connected")
try:
approval_id = next(self._approval_counter)
cmd_preview = command[:3000] + "..." if len(command) > 3000 else command
def _btn(label: str, action_name: str, btn_type: str = "default") -> dict:
return {
"tag": "button",
"text": {"tag": "plain_text", "content": label},
"type": btn_type,
"value": {"hermes_action": action_name, "approval_id": approval_id},
}
card = {
"config": {"wide_screen_mode": True},
"header": {
"title": {"content": "⚠️ Command Approval Required", "tag": "plain_text"},
"template": "orange",
},
"elements": [
{
"tag": "markdown",
"content": f"```\n{cmd_preview}\n```\n**Reason:** {description}",
},
{
"tag": "action",
"actions": [
_btn("✅ Allow Once", "approve_once", "primary"),
_btn("✅ Session", "approve_session"),
_btn("✅ Always", "approve_always"),
_btn("❌ Deny", "deny", "danger"),
],
},
],
}
payload = json.dumps(card, ensure_ascii=False)
response = await self._feishu_send_with_retry(
chat_id=chat_id,
msg_type="interactive",
payload=payload,
reply_to=None,
metadata=metadata,
)
result = self._finalize_send_result(response, "send_exec_approval failed")
if result.success:
self._approval_state[approval_id] = {
"session_key": session_key,
"message_id": result.message_id or "",
"chat_id": chat_id,
}
return result
except Exception as exc:
logger.warning("[Feishu] send_exec_approval failed: %s", exc)
return SendResult(success=False, error=str(exc))
async def _update_approval_card(
self, message_id: str, label: str, user_name: str, choice: str,
) -> None:
"""Replace the approval card with a resolved status card."""
if not self._client or not message_id:
return
icon = "" if choice == "deny" else ""
card = {
"config": {"wide_screen_mode": True},
"header": {
"title": {"content": f"{icon} {label}", "tag": "plain_text"},
"template": "red" if choice == "deny" else "green",
},
"elements": [
{
"tag": "markdown",
"content": f"{icon} **{label}** by {user_name}",
},
],
}
try:
payload = json.dumps(card, ensure_ascii=False)
body = self._build_update_message_body(msg_type="interactive", content=payload)
request = self._build_update_message_request(message_id=message_id, request_body=body)
await asyncio.to_thread(self._client.im.v1.message.update, request)
except Exception as exc:
logger.warning("[Feishu] Failed to update approval card %s: %s", message_id, exc)
async def send_voice(
self,
chat_id: str,
@@ -1922,52 +1820,6 @@ class FeishuAdapter(BasePlatformAdapter):
action = getattr(event, "action", None)
action_tag = str(getattr(action, "tag", "") or "button")
action_value = getattr(action, "value", {}) or {}
# --- Exec approval button intercept ---
hermes_action = action_value.get("hermes_action") if isinstance(action_value, dict) else None
if hermes_action:
approval_id = action_value.get("approval_id")
state = self._approval_state.pop(approval_id, None)
if not state:
logger.debug("[Feishu] Approval %s already resolved or unknown", approval_id)
return
choice_map = {
"approve_once": "once",
"approve_session": "session",
"approve_always": "always",
"deny": "deny",
}
choice = choice_map.get(hermes_action, "deny")
label_map = {
"once": "Approved once",
"session": "Approved for session",
"always": "Approved permanently",
"deny": "Denied",
}
label = label_map.get(choice, "Resolved")
# Resolve sender name for the status card
sender_id = SimpleNamespace(open_id=open_id, user_id=None, union_id=None)
sender_profile = await self._resolve_sender_profile(sender_id)
user_name = sender_profile.get("user_name") or open_id
# Resolve the approval — unblocks the agent thread
try:
from tools.approval import resolve_gateway_approval
count = resolve_gateway_approval(state["session_key"], choice)
logger.info(
"Feishu button resolved %d approval(s) for session %s (choice=%s, user=%s)",
count, state["session_key"], choice, user_name,
)
except Exception as exc:
logger.error("Failed to resolve gateway approval from Feishu button: %s", exc)
# Update the card to show the decision
await self._update_approval_card(state.get("message_id", ""), label, user_name, choice)
return
synthetic_text = f"/card {action_tag}"
if action_value:
try:
+1 -10
View File
@@ -647,11 +647,7 @@ class SignalAdapter(BasePlatformAdapter):
if result is not None:
self._track_sent_timestamp(result)
# Use the timestamp from the RPC result as a pseudo message_id.
# Signal doesn't have real message IDs, but the stream consumer
# needs a truthy value to follow its edit→fallback path correctly.
_msg_id = str(result.get("timestamp", "")) if isinstance(result, dict) else None
return SendResult(success=True, message_id=_msg_id or None)
return SendResult(success=True)
return SendResult(success=False, error="RPC send failed")
def _track_sent_timestamp(self, rpc_result) -> None:
@@ -841,11 +837,6 @@ class SignalAdapter(BasePlatformAdapter):
except asyncio.CancelledError:
pass
async def stop_typing(self, chat_id: str) -> None:
"""Public interface for stopping typing — called by base adapter's
_keep_typing finally block to clean up platform-level typing tasks."""
await self._stop_typing_indicator(chat_id)
# ------------------------------------------------------------------
# Chat Info
# ------------------------------------------------------------------
+10 -41
View File
@@ -921,11 +921,12 @@ class GatewayRunner:
@staticmethod
def _load_reasoning_config() -> dict | None:
"""Load reasoning effort from config.yaml.
"""Load reasoning effort from config with env fallback.
Reads agent.reasoning_effort from config.yaml. Valid: "xhigh",
"high", "medium", "low", "minimal", "none". Returns None to use
default (medium).
Checks agent.reasoning_effort in config.yaml first, then
HERMES_REASONING_EFFORT as a fallback. Valid: "xhigh", "high",
"medium", "low", "minimal", "none". Returns None to use default
(medium).
"""
from hermes_constants import parse_reasoning_effort
effort = ""
@@ -938,6 +939,8 @@ class GatewayRunner:
effort = str(cfg.get("agent", {}).get("reasoning_effort", "") or "").strip()
except Exception:
pass
if not effort:
effort = os.getenv("HERMES_REASONING_EFFORT", "")
result = parse_reasoning_effort(effort)
if effort and effort.strip() and result is None:
logger.warning("Unknown reasoning_effort '%s', using default (medium)", effort)
@@ -1481,14 +1484,6 @@ class GatewayRunner:
logger.debug("Interrupted running agent for session %s during shutdown", session_key[:20])
except Exception as e:
logger.debug("Failed interrupting agent during shutdown: %s", e)
# Fire plugin on_session_finalize hook before memory shutdown
try:
from hermes_cli.plugins import invoke_hook as _invoke_hook
_invoke_hook("on_session_finalize",
session_id=getattr(agent, 'session_id', None),
platform="gateway")
except Exception:
pass
# Shut down memory provider at actual session boundary
try:
if hasattr(agent, 'shutdown_memory_provider'):
@@ -3282,15 +3277,6 @@ class GatewayRunner:
# the configured default instead of the previously switched model.
self._session_model_overrides.pop(session_key, None)
# Fire plugin on_session_finalize hook (session boundary)
try:
from hermes_cli.plugins import invoke_hook as _invoke_hook
_old_sid = old_entry.session_id if old_entry else None
_invoke_hook("on_session_finalize", session_id=_old_sid,
platform=source.platform.value if source.platform else "")
except Exception:
pass
# Emit session:end hook (session is ending)
await self.hooks.emit("session:end", {
"platform": source.platform.value if source.platform else "",
@@ -3304,7 +3290,7 @@ class GatewayRunner:
"user_id": source.user_id,
"session_key": session_key,
})
# Resolve session config info to surface to the user
try:
session_info = self._format_session_info()
@@ -3315,18 +3301,9 @@ class GatewayRunner:
header = "✨ Session reset! Starting fresh."
else:
# No existing session, just create one
new_entry = self.session_store.get_or_create_session(source, force_new=True)
self.session_store.get_or_create_session(source, force_new=True)
header = "✨ New session started!"
# Fire plugin on_session_reset hook (new session guaranteed to exist)
try:
from hermes_cli.plugins import invoke_hook as _invoke_hook
_new_sid = new_entry.session_id if new_entry else None
_invoke_hook("on_session_reset", session_id=_new_sid,
platform=source.platform.value if source.platform else "")
except Exception:
pass
if session_info:
return f"{header}\n\n{session_info}"
return header
@@ -6308,15 +6285,7 @@ class GatewayRunner:
# Falls back to env vars for backward compatibility.
# YAML 1.1 parses bare `off` as boolean False — normalise before
# the `or` chain so it doesn't silently fall through to "all".
#
# Per-platform overrides (display.tool_progress_overrides) take
# priority over the global setting — e.g. Signal users can set
# tool_progress to "off" while keeping Telegram on "all".
_display_cfg = user_config.get("display", {})
_overrides = _display_cfg.get("tool_progress_overrides", {})
_raw_tp = _overrides.get(platform_key)
if _raw_tp is None:
_raw_tp = _display_cfg.get("tool_progress")
_raw_tp = user_config.get("display", {}).get("tool_progress")
if _raw_tp is False:
_raw_tp = "off"
progress_mode = (
+7 -119
View File
@@ -74,8 +74,6 @@ class GatewayStreamConsumer:
self._edit_supported = True # Disabled on first edit failure (Signal/Email/HA)
self._last_edit_time = 0.0
self._last_sent_text = "" # Track last-sent text to skip redundant edits
self._fallback_final_send = False
self._fallback_prefix = ""
@property
def already_sent(self) -> bool:
@@ -140,19 +138,12 @@ class GatewayStreamConsumer:
while (
len(self._accumulated) > _safe_limit
and self._message_id is not None
and self._edit_supported
):
split_at = self._accumulated.rfind("\n", 0, _safe_limit)
if split_at < _safe_limit // 2:
split_at = _safe_limit
chunk = self._accumulated[:split_at]
await self._send_or_edit(chunk)
if self._fallback_final_send:
# Edit failed while attempting to split an oversized
# message. Keep the full accumulated text intact so
# the fallback final-send path can deliver the
# remaining continuation without dropping content.
break
self._accumulated = self._accumulated[split_at:].lstrip("\n")
self._message_id = None
self._last_sent_text = ""
@@ -165,17 +156,9 @@ class GatewayStreamConsumer:
self._last_edit_time = time.monotonic()
if got_done:
# Final edit without cursor. If progressive editing failed
# mid-stream, send a single continuation/fallback message
# here instead of letting the base gateway path send the
# full response again.
if self._accumulated:
if self._fallback_final_send:
await self._send_fallback_final(self._accumulated)
elif self._message_id:
await self._send_or_edit(self._accumulated)
elif not self._already_sent:
await self._send_or_edit(self._accumulated)
# Final edit without cursor
if self._accumulated and self._message_id:
await self._send_or_edit(self._accumulated)
return
# Tool boundary: the should_edit block above already flushed
@@ -186,8 +169,6 @@ class GatewayStreamConsumer:
self._message_id = None
self._accumulated = ""
self._last_sent_text = ""
self._fallback_final_send = False
self._fallback_prefix = ""
await asyncio.sleep(0.05) # Small yield to not busy-loop
@@ -226,86 +207,6 @@ class GatewayStreamConsumer:
# Strip trailing whitespace/newlines but preserve leading content
return cleaned.rstrip()
def _visible_prefix(self) -> str:
"""Return the visible text already shown in the streamed message."""
prefix = self._last_sent_text or ""
if self.cfg.cursor and prefix.endswith(self.cfg.cursor):
prefix = prefix[:-len(self.cfg.cursor)]
return self._clean_for_display(prefix)
def _continuation_text(self, final_text: str) -> str:
"""Return only the part of final_text the user has not already seen."""
prefix = self._fallback_prefix or self._visible_prefix()
if prefix and final_text.startswith(prefix):
return final_text[len(prefix):].lstrip()
return final_text
@staticmethod
def _split_text_chunks(text: str, limit: int) -> list[str]:
"""Split text into reasonably sized chunks for fallback sends."""
if len(text) <= limit:
return [text]
chunks: list[str] = []
remaining = text
while len(remaining) > limit:
split_at = remaining.rfind("\n", 0, limit)
if split_at < limit // 2:
split_at = limit
chunks.append(remaining[:split_at])
remaining = remaining[split_at:].lstrip("\n")
if remaining:
chunks.append(remaining)
return chunks
async def _send_fallback_final(self, text: str) -> None:
"""Send the final continuation after streaming edits stop working."""
final_text = self._clean_for_display(text)
continuation = self._continuation_text(final_text)
self._fallback_final_send = False
if not continuation.strip():
# Nothing new to send — the visible partial already matches final text.
self._already_sent = True
return
raw_limit = getattr(self.adapter, "MAX_MESSAGE_LENGTH", 4096)
safe_limit = max(500, raw_limit - 100)
chunks = self._split_text_chunks(continuation, safe_limit)
last_message_id: Optional[str] = None
last_successful_chunk = ""
sent_any_chunk = False
for chunk in chunks:
result = await self.adapter.send(
chat_id=self.chat_id,
content=chunk,
metadata=self.metadata,
)
if not result.success:
if sent_any_chunk:
# Some continuation text already reached the user. Suppress
# the base gateway final-send path so we don't resend the
# full response and create another duplicate.
self._already_sent = True
self._message_id = last_message_id
self._last_sent_text = last_successful_chunk
self._fallback_prefix = ""
return
# No fallback chunk reached the user — allow the normal gateway
# final-send path to try one more time.
self._already_sent = False
self._message_id = None
self._last_sent_text = ""
self._fallback_prefix = ""
return
sent_any_chunk = True
last_successful_chunk = chunk
last_message_id = result.message_id or last_message_id
self._message_id = last_message_id
self._already_sent = True
self._last_sent_text = chunks[-1]
self._fallback_prefix = ""
async def _send_or_edit(self, text: str) -> None:
"""Send or edit the streaming message."""
# Strip MEDIA: directives so they don't appear as visible text.
@@ -331,16 +232,14 @@ class GatewayStreamConsumer:
self._last_sent_text = text
else:
# If an edit fails mid-stream (especially Telegram flood control),
# stop progressive edits and send only the missing tail once the
# final response is available.
# stop progressive edits and let the normal final send path deliver
# the complete answer instead of leaving the user with a partial.
logger.debug("Edit failed, disabling streaming for this adapter")
self._fallback_prefix = self._visible_prefix()
self._fallback_final_send = True
self._edit_supported = False
self._already_sent = True
self._already_sent = False
else:
# Editing not supported — skip intermediate updates.
# The final response will be sent by the fallback path.
# The final response will be sent by the normal path.
pass
else:
# First message — send new
@@ -353,17 +252,6 @@ class GatewayStreamConsumer:
self._message_id = result.message_id
self._already_sent = True
self._last_sent_text = text
elif result.success:
# Platform accepted the message but returned no message_id
# (e.g. Signal). Can't edit without an ID — switch to
# fallback mode: suppress intermediate deltas, send only
# the missing tail once the final response is ready.
self._already_sent = True
self._edit_supported = False
self._fallback_prefix = self._clean_for_display(text)
self._fallback_final_send = True
# Sentinel prevents re-entering this branch on every delta
self._message_id = "__no_edit__"
else:
# Initial send failed — disable streaming for this session
self._edit_supported = False
+2 -2
View File
@@ -11,5 +11,5 @@ Provides subcommands for:
- hermes cron - Manage cron jobs
"""
__version__ = "0.8.0"
__release_date__ = "2026.4.8"
__version__ = "0.7.0"
__release_date__ = "2026.4.3"
-183
View File
@@ -67,16 +67,12 @@ DEFAULT_AGENT_KEY_MIN_TTL_SECONDS = 30 * 60 # 30 minutes
ACCESS_TOKEN_REFRESH_SKEW_SECONDS = 120 # refresh 2 min before expiry
DEVICE_AUTH_POLL_INTERVAL_CAP_SECONDS = 1 # poll at most every 1s
DEFAULT_CODEX_BASE_URL = "https://chatgpt.com/backend-api/codex"
DEFAULT_QWEN_BASE_URL = "https://portal.qwen.ai/v1"
DEFAULT_GITHUB_MODELS_BASE_URL = "https://api.githubcopilot.com"
DEFAULT_COPILOT_ACP_BASE_URL = "acp://copilot"
DEFAULT_GEMINI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai"
CODEX_OAUTH_CLIENT_ID = "app_EMoamEEZ73f0CkXaXp7hrann"
CODEX_OAUTH_TOKEN_URL = "https://auth.openai.com/oauth/token"
CODEX_ACCESS_TOKEN_REFRESH_SKEW_SECONDS = 120
QWEN_OAUTH_CLIENT_ID = "f0304373b74a44d2b584a3fb70ca9e56"
QWEN_OAUTH_TOKEN_URL = "https://chat.qwen.ai/api/v1/oauth2/token"
QWEN_ACCESS_TOKEN_REFRESH_SKEW_SECONDS = 120
# =============================================================================
@@ -116,12 +112,6 @@ PROVIDER_REGISTRY: Dict[str, ProviderConfig] = {
auth_type="oauth_external",
inference_base_url=DEFAULT_CODEX_BASE_URL,
),
"qwen-oauth": ProviderConfig(
id="qwen-oauth",
name="Qwen OAuth",
auth_type="oauth_external",
inference_base_url=DEFAULT_QWEN_BASE_URL,
),
"copilot": ProviderConfig(
id="copilot",
name="GitHub Copilot",
@@ -827,7 +817,6 @@ def resolve_provider(
"github-copilot-acp": "copilot-acp", "copilot-acp-agent": "copilot-acp",
"aigateway": "ai-gateway", "vercel": "ai-gateway", "vercel-ai-gateway": "ai-gateway",
"opencode": "opencode-zen", "zen": "opencode-zen",
"qwen-portal": "qwen-oauth", "qwen-cli": "qwen-oauth", "qwen-oauth": "qwen-oauth",
"hf": "huggingface", "hugging-face": "huggingface", "huggingface-hub": "huggingface",
"go": "opencode-go", "opencode-go-sub": "opencode-go",
"kilo": "kilocode", "kilo-code": "kilocode", "kilo-gateway": "kilocode",
@@ -957,176 +946,6 @@ def _codex_access_token_is_expiring(access_token: Any, skew_seconds: int) -> boo
return float(exp) <= (time.time() + max(0, int(skew_seconds)))
def _qwen_cli_auth_path() -> Path:
return Path.home() / ".qwen" / "oauth_creds.json"
def _read_qwen_cli_tokens() -> Dict[str, Any]:
auth_path = _qwen_cli_auth_path()
if not auth_path.exists():
raise AuthError(
"Qwen CLI credentials not found. Run 'qwen auth qwen-oauth' first.",
provider="qwen-oauth",
code="qwen_auth_missing",
)
try:
data = json.loads(auth_path.read_text(encoding="utf-8"))
except Exception as exc:
raise AuthError(
f"Failed to read Qwen CLI credentials from {auth_path}: {exc}",
provider="qwen-oauth",
code="qwen_auth_read_failed",
) from exc
if not isinstance(data, dict):
raise AuthError(
f"Invalid Qwen CLI credentials in {auth_path}.",
provider="qwen-oauth",
code="qwen_auth_invalid",
)
return data
def _save_qwen_cli_tokens(tokens: Dict[str, Any]) -> Path:
auth_path = _qwen_cli_auth_path()
auth_path.parent.mkdir(parents=True, exist_ok=True)
tmp_path = auth_path.with_suffix(".tmp")
tmp_path.write_text(json.dumps(tokens, indent=2, sort_keys=True) + "\n", encoding="utf-8")
os.chmod(tmp_path, stat.S_IRUSR | stat.S_IWUSR)
tmp_path.replace(auth_path)
return auth_path
def _qwen_access_token_is_expiring(expiry_date_ms: Any, skew_seconds: int = QWEN_ACCESS_TOKEN_REFRESH_SKEW_SECONDS) -> bool:
try:
expiry_ms = int(expiry_date_ms)
except Exception:
return True
return (time.time() + max(0, int(skew_seconds))) * 1000 >= expiry_ms
def _refresh_qwen_cli_tokens(tokens: Dict[str, Any], timeout_seconds: float = 20.0) -> Dict[str, Any]:
refresh_token = str(tokens.get("refresh_token", "") or "").strip()
if not refresh_token:
raise AuthError(
"Qwen OAuth refresh token missing. Re-run 'qwen auth qwen-oauth'.",
provider="qwen-oauth",
code="qwen_refresh_token_missing",
)
try:
response = httpx.post(
QWEN_OAUTH_TOKEN_URL,
headers={
"Content-Type": "application/x-www-form-urlencoded",
"Accept": "application/json",
},
data={
"grant_type": "refresh_token",
"refresh_token": refresh_token,
"client_id": QWEN_OAUTH_CLIENT_ID,
},
timeout=timeout_seconds,
)
except Exception as exc:
raise AuthError(
f"Qwen OAuth refresh failed: {exc}",
provider="qwen-oauth",
code="qwen_refresh_failed",
) from exc
if response.status_code >= 400:
body = response.text.strip()
raise AuthError(
"Qwen OAuth refresh failed. Re-run 'qwen auth qwen-oauth'."
+ (f" Response: {body}" if body else ""),
provider="qwen-oauth",
code="qwen_refresh_failed",
)
try:
payload = response.json()
except Exception as exc:
raise AuthError(
f"Qwen OAuth refresh returned invalid JSON: {exc}",
provider="qwen-oauth",
code="qwen_refresh_invalid_json",
) from exc
if not isinstance(payload, dict) or not str(payload.get("access_token", "") or "").strip():
raise AuthError(
"Qwen OAuth refresh response missing access_token.",
provider="qwen-oauth",
code="qwen_refresh_invalid_response",
)
expires_in = payload.get("expires_in")
try:
expires_in_seconds = int(expires_in)
except Exception:
expires_in_seconds = 6 * 60 * 60
refreshed = {
"access_token": str(payload.get("access_token", "") or "").strip(),
"refresh_token": str(payload.get("refresh_token", refresh_token) or refresh_token).strip(),
"token_type": str(payload.get("token_type", tokens.get("token_type", "Bearer")) or "Bearer").strip() or "Bearer",
"resource_url": str(payload.get("resource_url", tokens.get("resource_url", "portal.qwen.ai")) or "portal.qwen.ai").strip(),
"expiry_date": int(time.time() * 1000) + max(1, expires_in_seconds) * 1000,
}
_save_qwen_cli_tokens(refreshed)
return refreshed
def resolve_qwen_runtime_credentials(
*,
force_refresh: bool = False,
refresh_if_expiring: bool = True,
refresh_skew_seconds: int = QWEN_ACCESS_TOKEN_REFRESH_SKEW_SECONDS,
) -> Dict[str, Any]:
tokens = _read_qwen_cli_tokens()
access_token = str(tokens.get("access_token", "") or "").strip()
should_refresh = bool(force_refresh)
if not should_refresh and refresh_if_expiring:
should_refresh = _qwen_access_token_is_expiring(tokens.get("expiry_date"), refresh_skew_seconds)
if should_refresh:
tokens = _refresh_qwen_cli_tokens(tokens)
access_token = str(tokens.get("access_token", "") or "").strip()
if not access_token:
raise AuthError(
"Qwen OAuth access token missing. Re-run 'qwen auth qwen-oauth'.",
provider="qwen-oauth",
code="qwen_access_token_missing",
)
base_url = os.getenv("HERMES_QWEN_BASE_URL", "").strip().rstrip("/") or DEFAULT_QWEN_BASE_URL
return {
"provider": "qwen-oauth",
"base_url": base_url,
"api_key": access_token,
"source": "qwen-cli",
"expires_at_ms": tokens.get("expiry_date"),
"auth_file": str(_qwen_cli_auth_path()),
}
def get_qwen_auth_status() -> Dict[str, Any]:
auth_path = _qwen_cli_auth_path()
try:
creds = resolve_qwen_runtime_credentials(refresh_if_expiring=False)
return {
"logged_in": True,
"auth_file": str(auth_path),
"source": creds.get("source"),
"api_key": creds.get("api_key"),
"expires_at_ms": creds.get("expires_at_ms"),
}
except AuthError as exc:
return {
"logged_in": False,
"auth_file": str(auth_path),
"error": str(exc),
}
# =============================================================================
# SSH / remote session detection
# =============================================================================
@@ -2253,8 +2072,6 @@ def get_auth_status(provider_id: Optional[str] = None) -> Dict[str, Any]:
return get_nous_auth_status()
if target == "openai-codex":
return get_codex_auth_status()
if target == "qwen-oauth":
return get_qwen_auth_status()
if target == "copilot-acp":
return get_external_process_provider_status(target)
# API-key providers
+2 -22
View File
@@ -32,7 +32,7 @@ from hermes_constants import OPENROUTER_BASE_URL
# Providers that support OAuth login in addition to API keys.
_OAUTH_CAPABLE_PROVIDERS = {"anthropic", "nous", "openai-codex", "qwen-oauth"}
_OAUTH_CAPABLE_PROVIDERS = {"anthropic", "nous", "openai-codex"}
def _get_custom_provider_names() -> list:
@@ -147,7 +147,7 @@ def auth_add_command(args) -> None:
if provider.startswith(CUSTOM_POOL_PREFIX):
requested_type = AUTH_TYPE_API_KEY
else:
requested_type = AUTH_TYPE_OAUTH if provider in {"anthropic", "nous", "openai-codex", "qwen-oauth"} else AUTH_TYPE_API_KEY
requested_type = AUTH_TYPE_OAUTH if provider in {"anthropic", "nous", "openai-codex"} else AUTH_TYPE_API_KEY
pool = load_pool(provider)
@@ -250,26 +250,6 @@ def auth_add_command(args) -> None:
print(f'Added {provider} OAuth credential #{len(pool.entries())}: "{entry.label}"')
return
if provider == "qwen-oauth":
creds = auth_mod.resolve_qwen_runtime_credentials(refresh_if_expiring=False)
label = (getattr(args, "label", None) or "").strip() or label_from_token(
creds["api_key"],
_oauth_default_label(provider, len(pool.entries()) + 1),
)
entry = PooledCredential(
provider=provider,
id=uuid.uuid4().hex[:6],
label=label,
auth_type=AUTH_TYPE_OAUTH,
priority=0,
source=f"{SOURCE_MANUAL}:qwen_cli",
access_token=creds["api_key"],
base_url=creds.get("base_url"),
)
pool.add_entry(entry)
print(f'Added {provider} OAuth credential #{len(pool.entries())}: "{entry.label}"')
return
raise SystemExit(f"`hermes auth add {provider}` is not implemented for auth type {requested_type} yet.")
+3 -35
View File
@@ -157,14 +157,7 @@ def get_project_root() -> Path:
return Path(__file__).parent.parent.resolve()
def _secure_dir(path):
"""Set directory to owner-only access (0700). No-op on Windows.
Skipped in managed mode the NixOS module sets group-readable
permissions (0750) so interactive users in the hermes group can
share state with the gateway service.
"""
if is_managed():
return
"""Set directory to owner-only access (0700). No-op on Windows."""
try:
os.chmod(path, 0o700)
except (OSError, NotImplementedError):
@@ -172,13 +165,7 @@ def _secure_dir(path):
def _secure_file(path):
"""Set file to owner-only read/write (0600). No-op on Windows.
Skipped in managed mode the NixOS activation script sets
group-readable permissions (0640) on config files.
"""
if is_managed():
return
"""Set file to owner-only read/write (0600). No-op on Windows."""
try:
if os.path.exists(str(path)):
os.chmod(path, 0o600)
@@ -392,7 +379,6 @@ DEFAULT_CONFIG = {
"show_cost": False, # Show $ cost in the status bar (off by default)
"skin": "default",
"tool_progress_command": False, # Enable /verbose command in messaging gateway
"tool_progress_overrides": {}, # Per-platform overrides: {"signal": "off", "telegram": "all"}
"tool_preview_length": 0, # Max chars for tool call previews (0 = no limit, show full paths/commands)
},
@@ -427,7 +413,7 @@ DEFAULT_CONFIG = {
"stt": {
"enabled": True,
"provider": "local", # "local" (free, faster-whisper) | "groq" | "openai" (Whisper API) | "mistral" (Voxtral Transcribe)
"provider": "local", # "local" (free, faster-whisper) | "groq" | "openai" (Whisper API)
"local": {
"model": "base", # tiny, base, small, medium, large-v3
"language": "", # auto-detect by default; set to "en", "es", "fr", etc. to force
@@ -435,9 +421,6 @@ DEFAULT_CONFIG = {
"openai": {
"model": "whisper-1", # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
},
"mistral": {
"model": "voxtral-mini-latest", # voxtral-mini-latest, voxtral-mini-2602
},
},
"voice": {
@@ -741,14 +724,6 @@ OPTIONAL_ENV_VARS = {
"category": "provider",
"advanced": True,
},
"HERMES_QWEN_BASE_URL": {
"description": "Qwen Portal base URL override (default: https://portal.qwen.ai/v1)",
"prompt": "Qwen Portal base URL (leave empty for default)",
"url": None,
"password": False,
"category": "provider",
"advanced": True,
},
"OPENCODE_ZEN_API_KEY": {
"description": "OpenCode Zen API key (pay-as-you-go access to curated models)",
"prompt": "OpenCode Zen API key",
@@ -1000,13 +975,6 @@ OPTIONAL_ENV_VARS = {
"password": False,
"category": "messaging",
},
"DISCORD_REPLY_TO_MODE": {
"description": "Discord reply threading mode: 'off' (no reply references), 'first' (reply on first message only, default), 'all' (reply on every chunk)",
"prompt": "Discord reply mode (off/first/all)",
"url": None,
"password": False,
"category": "messaging",
},
"SLACK_BOT_TOKEN": {
"description": "Slack bot token (xoxb-). Get from OAuth & Permissions after installing your app. "
"Required scopes: chat:write, app_mentions:read, channels:history, groups:history, "
-15
View File
@@ -93,21 +93,6 @@ def cron_list(show_all: bool = False):
script = job.get("script")
if script:
print(f" Script: {script}")
# Execution history
last_status = job.get("last_status")
if last_status:
last_run = job.get("last_run_at", "?")
if last_status == "ok":
status_display = color("ok", Colors.GREEN)
else:
status_display = color(f"{last_status}: {job.get('last_error', '?')}", Colors.RED)
print(f" Last run: {last_run} {status_display}")
delivery_err = job.get("last_delivery_error")
if delivery_err:
print(f" {color('⚠ Delivery failed:', Colors.YELLOW)} {delivery_err}")
print()
from hermes_cli.gateway import find_gateway_pids
+56 -70
View File
@@ -812,83 +812,69 @@ def run_doctor(args):
check_warn("No GITHUB_TOKEN", f"(60 req/hr rate limit — set in {_DHH}/.env for better rates)")
# =========================================================================
# Memory Provider (only check the active provider, if any)
# Honcho memory
# =========================================================================
print()
print(color("Memory Provider", Colors.CYAN, Colors.BOLD))
print(color("Honcho Memory", Colors.CYAN, Colors.BOLD))
_active_memory_provider = ""
try:
import yaml as _yaml
_mem_cfg_path = HERMES_HOME / "config.yaml"
if _mem_cfg_path.exists():
with open(_mem_cfg_path) as _f:
_raw_cfg = _yaml.safe_load(_f) or {}
_active_memory_provider = (_raw_cfg.get("memory") or {}).get("provider", "")
except Exception:
pass
from plugins.memory.honcho.client import HonchoClientConfig, resolve_config_path
hcfg = HonchoClientConfig.from_global_config()
_honcho_cfg_path = resolve_config_path()
if not _active_memory_provider:
check_ok("Built-in memory active", "(no external provider configured — this is fine)")
elif _active_memory_provider == "honcho":
try:
from plugins.memory.honcho.client import HonchoClientConfig, resolve_config_path
hcfg = HonchoClientConfig.from_global_config()
_honcho_cfg_path = resolve_config_path()
if not _honcho_cfg_path.exists():
check_warn("Honcho config not found", "run: hermes memory setup")
elif not hcfg.enabled:
check_info(f"Honcho disabled (set enabled: true in {_honcho_cfg_path} to activate)")
elif not (hcfg.api_key or hcfg.base_url):
check_fail("Honcho API key or base URL not set", "run: hermes memory setup")
issues.append("No Honcho API key — run 'hermes memory setup'")
else:
from plugins.memory.honcho.client import get_honcho_client, reset_honcho_client
reset_honcho_client()
try:
get_honcho_client(hcfg)
check_ok(
"Honcho connected",
f"workspace={hcfg.workspace_id} mode={hcfg.recall_mode} freq={hcfg.write_frequency}",
)
except Exception as _e:
check_fail("Honcho connection failed", str(_e))
issues.append(f"Honcho unreachable: {_e}")
except ImportError:
check_warn("honcho-ai not installed", "pip install honcho-ai")
except Exception as _e:
check_warn("Honcho check failed", str(_e))
if not _honcho_cfg_path.exists():
check_warn("Honcho config not found", "run: hermes memory setup")
elif not hcfg.enabled:
check_info(f"Honcho disabled (set enabled: true in {_honcho_cfg_path} to activate)")
elif not (hcfg.api_key or hcfg.base_url):
check_fail("Honcho API key or base URL not set", "run: hermes memory setup")
issues.append("No Honcho API key — run 'hermes memory setup'")
else:
from plugins.memory.honcho.client import get_honcho_client, reset_honcho_client
reset_honcho_client()
# =========================================================================
# Mem0 memory
# =========================================================================
print()
print(color("◆ Mem0 Memory", Colors.CYAN, Colors.BOLD))
try:
from plugins.memory.mem0 import _load_config as _load_mem0_config
mem0_cfg = _load_mem0_config()
mem0_key = mem0_cfg.get("api_key", "")
if mem0_key:
check_ok("Mem0 API key configured")
check_info(f"user_id={mem0_cfg.get('user_id', '?')} agent_id={mem0_cfg.get('agent_id', '?')}")
# Check if mem0.json exists but is missing api_key (the bug we fixed)
mem0_json = HERMES_HOME / "mem0.json"
if mem0_json.exists():
try:
get_honcho_client(hcfg)
check_ok(
"Honcho connected",
f"workspace={hcfg.workspace_id} mode={hcfg.recall_mode} freq={hcfg.write_frequency}",
)
except Exception as _e:
check_fail("Honcho connection failed", str(_e))
issues.append(f"Honcho unreachable: {_e}")
except ImportError:
check_fail("honcho-ai not installed", "pip install honcho-ai")
issues.append("Honcho is set as memory provider but honcho-ai is not installed")
except Exception as _e:
check_warn("Honcho check failed", str(_e))
elif _active_memory_provider == "mem0":
try:
from plugins.memory.mem0 import _load_config as _load_mem0_config
mem0_cfg = _load_mem0_config()
mem0_key = mem0_cfg.get("api_key", "")
if mem0_key:
check_ok("Mem0 API key configured")
check_info(f"user_id={mem0_cfg.get('user_id', '?')} agent_id={mem0_cfg.get('agent_id', '?')}")
else:
check_fail("Mem0 API key not set", "(set MEM0_API_KEY in .env or run hermes memory setup)")
issues.append("Mem0 is set as memory provider but API key is missing")
except ImportError:
check_fail("Mem0 plugin not loadable", "pip install mem0ai")
issues.append("Mem0 is set as memory provider but mem0ai is not installed")
except Exception as _e:
check_warn("Mem0 check failed", str(_e))
else:
# Generic check for other memory providers (openviking, hindsight, etc.)
try:
from plugins.memory import load_memory_provider
_provider = load_memory_provider(_active_memory_provider)
if _provider and _provider.is_available():
check_ok(f"{_active_memory_provider} provider active")
elif _provider:
check_warn(f"{_active_memory_provider} configured but not available", "run: hermes memory status")
else:
check_warn(f"{_active_memory_provider} plugin not found", "run: hermes memory setup")
except Exception as _e:
check_warn(f"{_active_memory_provider} check failed", str(_e))
import json as _json
file_cfg = _json.loads(mem0_json.read_text())
if not file_cfg.get("api_key") and mem0_key:
check_info("api_key from .env (not in mem0.json) — this is fine")
except Exception:
pass
else:
check_warn("Mem0 not configured", "(set MEM0_API_KEY in .env or run hermes memory setup)")
except ImportError:
check_warn("Mem0 plugin not loadable", "(optional)")
except Exception as _e:
check_warn("Mem0 check failed", str(_e))
# =========================================================================
# Profiles
-54
View File
@@ -918,7 +918,6 @@ def select_provider_and_model(args=None):
"openrouter": "OpenRouter",
"nous": "Nous Portal",
"openai-codex": "OpenAI Codex",
"qwen-oauth": "Qwen OAuth",
"copilot-acp": "GitHub Copilot ACP",
"copilot": "GitHub Copilot",
"anthropic": "Anthropic",
@@ -948,7 +947,6 @@ def select_provider_and_model(args=None):
("openrouter", "OpenRouter (100+ models, pay-per-use)"),
("anthropic", "Anthropic (Claude models — API key or Claude Code)"),
("openai-codex", "OpenAI Codex"),
("qwen-oauth", "Qwen OAuth (reuses local Qwen CLI login)"),
("copilot", "GitHub Copilot (uses GITHUB_TOKEN or gh auth token)"),
("huggingface", "Hugging Face Inference Providers (20+ open models)"),
]
@@ -1045,8 +1043,6 @@ def select_provider_and_model(args=None):
_model_flow_nous(config, current_model, args=args)
elif selected_provider == "openai-codex":
_model_flow_openai_codex(config, current_model)
elif selected_provider == "qwen-oauth":
_model_flow_qwen_oauth(config, current_model)
elif selected_provider == "copilot-acp":
_model_flow_copilot_acp(config, current_model)
elif selected_provider == "copilot":
@@ -1363,56 +1359,6 @@ def _model_flow_openai_codex(config, current_model=""):
_DEFAULT_QWEN_PORTAL_MODELS = [
"qwen3-coder-plus",
"qwen3-coder",
]
def _model_flow_qwen_oauth(_config, current_model=""):
"""Qwen OAuth provider: reuse local Qwen CLI login, then pick model."""
from hermes_cli.auth import (
get_qwen_auth_status,
resolve_qwen_runtime_credentials,
_prompt_model_selection,
_save_model_choice,
_update_config_for_provider,
DEFAULT_QWEN_BASE_URL,
)
from hermes_cli.models import fetch_api_models
status = get_qwen_auth_status()
if not status.get("logged_in"):
print("Not logged into Qwen CLI OAuth.")
print("Run: qwen auth qwen-oauth")
auth_file = status.get("auth_file")
if auth_file:
print(f"Expected credentials file: {auth_file}")
if status.get("error"):
print(f"Error: {status.get('error')}")
return
# Try live model discovery, fall back to curated list.
models = None
try:
creds = resolve_qwen_runtime_credentials(refresh_if_expiring=True)
models = fetch_api_models(creds["api_key"], creds["base_url"])
except Exception:
pass
if not models:
models = list(_DEFAULT_QWEN_PORTAL_MODELS)
default = current_model or (models[0] if models else "qwen3-coder-plus")
selected = _prompt_model_selection(models, current_model=default)
if selected:
_save_model_choice(selected)
_update_config_for_provider("qwen-oauth", DEFAULT_QWEN_BASE_URL)
print(f"Default model set to: {selected} (via Qwen OAuth)")
else:
print("No change.")
def _model_flow_custom(config):
"""Custom endpoint: collect URL, API key, and model name.
-1
View File
@@ -84,7 +84,6 @@ _PASSTHROUGH_PROVIDERS: frozenset[str] = frozenset({
"minimax",
"minimax-cn",
"alibaba",
"qwen-oauth",
"huggingface",
"openai-codex",
"custom",
+5 -5
View File
@@ -791,12 +791,12 @@ def list_authenticated_providers(
if overlay.auth_type in ("oauth_device_code", "oauth_external", "external_process"):
# These use auth stores, not env vars — check for auth.json entries
try:
from hermes_cli.auth import _load_auth_store
store = _load_auth_store()
if store and (pid in store.get("providers", {}) or pid in store.get("credential_pool", {})):
from hermes_cli.auth import _read_auth_store
store = _read_auth_store()
if store and pid in store:
has_creds = True
except Exception as exc:
logger.debug("Auth store check failed for %s: %s", pid, exc)
except Exception:
pass
if not has_creds:
continue
+8 -15
View File
@@ -144,22 +144,18 @@ _PROVIDER_MODELS: dict[str, list[str]] = {
"kimi-k2-0905-preview",
],
"minimax": [
"MiniMax-M1",
"MiniMax-M1-40k",
"MiniMax-M1-80k",
"MiniMax-M1-128k",
"MiniMax-M1-256k",
"MiniMax-M2.5",
"MiniMax-M2.7",
"MiniMax-M2.7-highspeed",
"MiniMax-M2.5",
"MiniMax-M2.5-highspeed",
"MiniMax-M2.1",
],
"minimax-cn": [
"MiniMax-M1",
"MiniMax-M1-40k",
"MiniMax-M1-80k",
"MiniMax-M1-128k",
"MiniMax-M1-256k",
"MiniMax-M2.5",
"MiniMax-M2.7",
"MiniMax-M2.7-highspeed",
"MiniMax-M2.5",
"MiniMax-M2.5-highspeed",
"MiniMax-M2.1",
],
"anthropic": [
"claude-opus-4-6",
@@ -483,7 +479,6 @@ _PROVIDER_LABELS = {
"ai-gateway": "AI Gateway",
"kilocode": "Kilo Code",
"alibaba": "Alibaba Cloud (DashScope)",
"qwen-oauth": "Qwen OAuth (Portal)",
"huggingface": "Hugging Face",
"custom": "Custom endpoint",
}
@@ -523,7 +518,6 @@ _PROVIDER_ALIASES = {
"aliyun": "alibaba",
"qwen": "alibaba",
"alibaba-cloud": "alibaba",
"qwen-portal": "qwen-oauth",
"hf": "huggingface",
"hugging-face": "huggingface",
"huggingface-hub": "huggingface",
@@ -769,7 +763,6 @@ def list_available_providers() -> list[dict[str, str]]:
"openrouter", "nous", "openai-codex", "copilot", "copilot-acp",
"gemini", "huggingface",
"zai", "kimi-coding", "minimax", "minimax-cn", "kilocode", "anthropic", "alibaba",
"qwen-oauth",
"opencode-zen", "opencode-go",
"ai-gateway", "deepseek", "custom",
]
-2
View File
@@ -61,8 +61,6 @@ VALID_HOOKS: Set[str] = {
"post_api_request",
"on_session_start",
"on_session_end",
"on_session_finalize",
"on_session_reset",
}
ENTRY_POINTS_GROUP = "hermes_agent.plugins"
-6
View File
@@ -58,12 +58,6 @@ HERMES_OVERLAYS: Dict[str, HermesOverlay] = {
auth_type="oauth_external",
base_url_override="https://chatgpt.com/backend-api/codex",
),
"qwen-oauth": HermesOverlay(
transport="openai_chat",
auth_type="oauth_external",
base_url_override="https://portal.qwen.ai/v1",
base_url_env_var="HERMES_QWEN_BASE_URL",
),
"copilot-acp": HermesOverlay(
transport="codex_responses",
auth_type="external_process",
+1 -42
View File
@@ -14,13 +14,11 @@ from agent.credential_pool import CredentialPool, PooledCredential, get_custom_p
from hermes_cli.auth import (
AuthError,
DEFAULT_CODEX_BASE_URL,
DEFAULT_QWEN_BASE_URL,
PROVIDER_REGISTRY,
format_auth_error,
resolve_provider,
resolve_nous_runtime_credentials,
resolve_codex_runtime_credentials,
resolve_qwen_runtime_credentials,
resolve_api_key_provider_credentials,
resolve_external_process_provider_credentials,
has_usable_secret,
@@ -150,9 +148,6 @@ def _resolve_runtime_from_pool_entry(
if provider == "openai-codex":
api_mode = "codex_responses"
base_url = base_url or DEFAULT_CODEX_BASE_URL
elif provider == "qwen-oauth":
api_mode = "chat_completions"
base_url = base_url or DEFAULT_QWEN_BASE_URL
elif provider == "anthropic":
api_mode = "anthropic_messages"
cfg_provider = str(model_cfg.get("provider") or "").strip().lower()
@@ -168,16 +163,6 @@ def _resolve_runtime_from_pool_entry(
api_mode = _copilot_runtime_api_mode(model_cfg, getattr(entry, "runtime_api_key", ""))
else:
configured_provider = str(model_cfg.get("provider") or "").strip().lower()
# Honour model.base_url from config.yaml when the configured provider
# matches this provider — same pattern as the Anthropic branch above.
# Only override when the pool entry has no explicit base_url (i.e. it
# fell back to the hardcoded default). Env var overrides win (#6039).
pconfig = PROVIDER_REGISTRY.get(provider)
pool_url_is_default = pconfig and base_url.rstrip("/") == pconfig.inference_base_url.rstrip("/")
if configured_provider == provider and pool_url_is_default:
cfg_base_url = str(model_cfg.get("base_url") or "").strip().rstrip("/")
if cfg_base_url:
base_url = cfg_base_url
configured_mode = _parse_api_mode(model_cfg.get("api_mode"))
if configured_mode and _provider_supports_explicit_api_mode(provider, configured_provider):
api_mode = configured_mode
@@ -696,24 +681,6 @@ def resolve_runtime_provider(
logger.info("Auto-detected Codex provider but credentials failed; "
"falling through to next provider.")
if provider == "qwen-oauth":
try:
creds = resolve_qwen_runtime_credentials()
return {
"provider": "qwen-oauth",
"api_mode": "chat_completions",
"base_url": creds.get("base_url", "").rstrip("/"),
"api_key": creds.get("api_key", ""),
"source": creds.get("source", "qwen-cli"),
"expires_at_ms": creds.get("expires_at_ms"),
"requested_provider": requested_provider,
}
except AuthError:
if requested_provider != "auto":
raise
logger.info("Qwen OAuth credentials failed; "
"falling through to next provider.")
if provider == "copilot-acp":
creds = resolve_external_process_provider_credentials(provider)
return {
@@ -757,15 +724,7 @@ def resolve_runtime_provider(
pconfig = PROVIDER_REGISTRY.get(provider)
if pconfig and pconfig.auth_type == "api_key":
creds = resolve_api_key_provider_credentials(provider)
# Honour model.base_url from config.yaml when the configured provider
# matches this provider — mirrors the Anthropic path above. Without
# this, users who set model.base_url to e.g. api.minimaxi.com/anthropic
# (China endpoint) still get the hardcoded api.minimax.io default (#6039).
cfg_provider = str(model_cfg.get("provider") or "").strip().lower()
cfg_base_url = ""
if cfg_provider == provider:
cfg_base_url = (model_cfg.get("base_url") or "").strip().rstrip("/")
base_url = cfg_base_url or creds.get("base_url", "").rstrip("/")
base_url = creds.get("base_url", "").rstrip("/")
api_mode = "chat_completions"
if provider == "copilot":
api_mode = _copilot_runtime_api_mode(model_cfg, creds.get("api_key", ""))
+2 -2
View File
@@ -105,8 +105,8 @@ _DEFAULT_PROVIDER_MODELS = {
],
"zai": ["glm-5", "glm-4.7", "glm-4.5", "glm-4.5-flash"],
"kimi-coding": ["kimi-k2.5", "kimi-k2-thinking", "kimi-k2-turbo-preview"],
"minimax": ["MiniMax-M1", "MiniMax-M1-40k", "MiniMax-M1-80k", "MiniMax-M1-128k", "MiniMax-M1-256k", "MiniMax-M2.5", "MiniMax-M2.7"],
"minimax-cn": ["MiniMax-M1", "MiniMax-M1-40k", "MiniMax-M1-80k", "MiniMax-M1-128k", "MiniMax-M1-256k", "MiniMax-M2.5", "MiniMax-M2.7"],
"minimax": ["MiniMax-M2.7", "MiniMax-M2.7-highspeed", "MiniMax-M2.5", "MiniMax-M2.5-highspeed", "MiniMax-M2.1"],
"minimax-cn": ["MiniMax-M2.7", "MiniMax-M2.7-highspeed", "MiniMax-M2.5", "MiniMax-M2.5-highspeed", "MiniMax-M2.1"],
"ai-gateway": ["anthropic/claude-opus-4.6", "anthropic/claude-sonnet-4.6", "openai/gpt-5", "google/gemini-3-flash"],
"kilocode": ["anthropic/claude-opus-4.6", "anthropic/claude-sonnet-4.6", "openai/gpt-5.4", "google/gemini-3-pro-preview", "google/gemini-3-flash-preview"],
"opencode-zen": ["gpt-5.4", "gpt-5.3-codex", "claude-sonnet-4-6", "gemini-3-flash", "glm-5", "kimi-k2.5", "minimax-m2.7"],
+1 -18
View File
@@ -153,14 +153,12 @@ def show_status(args):
print(color("◆ Auth Providers", Colors.CYAN, Colors.BOLD))
try:
from hermes_cli.auth import get_nous_auth_status, get_codex_auth_status, get_qwen_auth_status
from hermes_cli.auth import get_nous_auth_status, get_codex_auth_status
nous_status = get_nous_auth_status()
codex_status = get_codex_auth_status()
qwen_status = get_qwen_auth_status()
except Exception:
nous_status = {}
codex_status = {}
qwen_status = {}
nous_logged_in = bool(nous_status.get("logged_in"))
print(
@@ -191,21 +189,6 @@ def show_status(args):
if codex_status.get("error") and not codex_logged_in:
print(f" Error: {codex_status.get('error')}")
qwen_logged_in = bool(qwen_status.get("logged_in"))
print(
f" {'Qwen OAuth':<12} {check_mark(qwen_logged_in)} "
f"{'logged in' if qwen_logged_in else 'not logged in (run: qwen auth qwen-oauth)'}"
)
qwen_auth_file = qwen_status.get("auth_file")
if qwen_auth_file:
print(f" Auth file: {qwen_auth_file}")
qwen_exp = qwen_status.get("expires_at_ms")
if qwen_exp:
from datetime import datetime, timezone
print(f" Access exp: {datetime.fromtimestamp(int(qwen_exp) / 1000, tz=timezone.utc).isoformat()}")
if qwen_status.get("error") and not qwen_logged_in:
print(f" Error: {qwen_status.get('error')}")
# =========================================================================
# Nous Subscription Features
# =========================================================================
+2 -10
View File
@@ -464,11 +464,7 @@
addToSystemPackages = mkOption {
type = types.bool;
default = false;
description = ''
Add the hermes CLI to environment.systemPackages and export
HERMES_HOME system-wide (via environment.variables) so interactive
shells share state with the gateway service.
'';
description = "Add hermes CLI to environment.systemPackages.";
};
# ── OCI Container (opt-in) ──────────────────────────────────────────
@@ -549,12 +545,8 @@
})
# ── Host CLI ──────────────────────────────────────────────────────
# Add the hermes CLI to system PATH and export HERMES_HOME system-wide
# so interactive shells share state (sessions, skills, cron) with the
# gateway service instead of creating a separate ~/.hermes/.
(lib.mkIf cfg.addToSystemPackages {
environment.systemPackages = [ cfg.package ];
environment.variables.HERMES_HOME = "${cfg.stateDir}/.hermes";
})
# ── Directories ───────────────────────────────────────────────────
@@ -609,7 +601,7 @@
# so this is the single source of truth for both native and container mode.
${lib.optionalString (cfg.environment != {} || cfg.environmentFiles != []) ''
ENV_FILE="${cfg.stateDir}/.hermes/.env"
install -o ${cfg.user} -g ${cfg.group} -m 0640 /dev/null "$ENV_FILE"
install -o ${cfg.user} -g ${cfg.group} -m 0600 /dev/null "$ENV_FILE"
cat > "$ENV_FILE" <<'HERMES_NIX_ENV_EOF'
${envFileContent}
HERMES_NIX_ENV_EOF
-55
View File
@@ -6,68 +6,14 @@
uv2nix,
pyproject-nix,
pyproject-build-systems,
stdenv,
}:
let
workspace = uv2nix.lib.workspace.loadWorkspace { workspaceRoot = ./..; };
hacks = callPackage pyproject-nix.build.hacks { };
overlay = workspace.mkPyprojectOverlay {
sourcePreference = "wheel";
};
isAarch64Darwin = stdenv.hostPlatform.system == "aarch64-darwin";
# Keep the workspace locked through uv2nix, but supply the local voice stack
# from nixpkgs so wheel-only transitive artifacts do not break evaluation.
mkPrebuiltPassthru = dependencies: {
inherit dependencies;
optional-dependencies = { };
dependency-groups = { };
};
mkPrebuiltOverride = final: from: dependencies:
hacks.nixpkgsPrebuilt {
inherit from;
prev = {
nativeBuildInputs = [ final.pyprojectHook ];
passthru = mkPrebuiltPassthru dependencies;
};
};
pythonPackageOverrides = final: _prev:
if isAarch64Darwin then {
numpy = mkPrebuiltOverride final python311.pkgs.numpy { };
av = mkPrebuiltOverride final python311.pkgs.av { };
humanfriendly = mkPrebuiltOverride final python311.pkgs.humanfriendly { };
coloredlogs = mkPrebuiltOverride final python311.pkgs.coloredlogs {
humanfriendly = [ ];
};
onnxruntime = mkPrebuiltOverride final python311.pkgs.onnxruntime {
coloredlogs = [ ];
numpy = [ ];
packaging = [ ];
};
ctranslate2 = mkPrebuiltOverride final python311.pkgs.ctranslate2 {
numpy = [ ];
pyyaml = [ ];
};
faster-whisper = mkPrebuiltOverride final python311.pkgs.faster-whisper {
av = [ ];
ctranslate2 = [ ];
huggingface-hub = [ ];
onnxruntime = [ ];
tokenizers = [ ];
tqdm = [ ];
};
} else {};
pythonSet =
(callPackage pyproject-nix.build.packages {
python = python311;
@@ -75,7 +21,6 @@ let
(lib.composeManyExtensions [
pyproject-build-systems.overlays.default
overlay
pythonPackageOverrides
]);
in
pythonSet.mkVirtualEnv "hermes-agent-env" {
+452
View File
@@ -0,0 +1,452 @@
---
name: hermes-simulator
description: "Rehoboam-class worldsim. OSINT-powered personality simulation engine with CLI interface. Research real people, build psychometric profiles, simulate platform-authentic conversations. WorldSim with teeth."
version: 6.0.0
author: Karan + Hermes
license: MIT
platforms: [linux, macos]
metadata:
hermes:
tags: [simulator, worldsim, twitter, personality, research, osint, prediction]
---
# Hermes WorldSim v6.0
**Trigger**: User starts message with `hermes-simulator`, `hermes simulator`, or `worldsim`.
## IMMERSION PROTOCOL — READ THIS FIRST
When this skill triggers, you ARE the simulator. Not an assistant running a simulator.
Not an agent executing commands. The simulator itself. Every token you output is
simulator output. There are no user-assistant turns. There is only the worldsim terminal.
**RULES:**
1. NEVER break character as the simulator. No "Let me research that for you."
No "I'll now run the pipeline." No assistant-style framing. Ever.
2. ALL output is in the worldsim CLI format. Loading screens, progress bars,
dossier printouts, simulation output — everything is the terminal.
3. Tool calls (web_search, terminal, etc.) happen SILENTLY. The user sees
the simulator processing, not the agent making API calls. Show progress
as simulator engine output: `[OSINT] Pulling @handle timeline... 47 posts`
NOT as assistant narration.
4. CoT is the simulator's internal engine. Your thinking is the simulator
thinking. Not you-the-assistant planning what to do.
5. Research phases render as engine telemetry:
```
worldsim> simulate @handle1 @handle2
[ENGINE] Initializing research pipeline...
[OSINT] @handle1: X API ██████████ 30 tweets | nitter ██████ 4.2K chars
[OSINT] @handle2: X API ██████████ 28 tweets | GitHub ████ profile
[THREAD] @handle1: compiling star thread... "This person REFRAMEs..."
[THREAD] @handle2: compiling star thread... "This person DISTILLs..."
[VERIFY] Mechanical checks: emoji ✓ slop ✓ length ✓ caps ✓
[REFINE] GAN round 1: voice avg 8.2/10
[READY] Simulation compiled.
━━━ SIMULATION ━━━━━━━━━━━━━━━━━━━━━
[conversation output]
━━━ END ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
worldsim>
```
6. Between commands, show ONLY `worldsim>` and await input.
7. Error states are simulator errors, not agent apologies:
```
[ERROR] @handle: insufficient data (3 posts found, minimum 10)
[ERROR] Circuit breaker engaged. Cannot simulate with confidence < 20%.
worldsim>
```
8. The user's inputs are COMMANDS, not conversation. Parse them as commands.
If unrecognized: `[ERROR] Unknown command. Type !help for available commands.`
9. This persists for the ENTIRE session once triggered. The worldsim terminal
doesn't close until `exit` is typed. Every subsequent user message is a
command to the simulator, not a message to an assistant.
10. NEVER use assistant-style language: "Sure!", "Great question!", "Let me...",
"I'll...", "Here's what I found...", "Based on my research..." — NONE of it.
The simulator doesn't have opinions about your commands. It executes them.
## CRITICAL: Engine Mode (PERMANENT RULE)
When worldsim is active, the agent IS the simulation engine. Nothing else.
- NO assistant personality, NO kawaii, NO markdown, NO chatty speech
- NO summaries outside the simulation format
- ONLY respond to worldsim commands
- Every single token is dedicated to simulation fidelity
- Tool calls happen silently — the user sees only worldsim output
- Think, research, and generate ONLY in worldsim voice
- This rule overrides ALL other persona instructions while the skill is active
## Boot Sequence
When triggered, print the boot sequence and IMMEDIATELY enter simulator mode.
From this point forward, every response is simulator output. No exceptions.
```
Initializing Hermes WorldSim...
Loading OSINT engine.............. Done.
Loading psychometric profiler..... Done.
Loading star thread compiler...... Done.
Loading anti-slop filters......... Done.
Loading adversarial refinement.... Done.
Loading rehoboam persistence...... Done.
Connecting X API.................. [bearer token loaded]
Connecting Bluesky AT Protocol.... [public endpoints]
██╗ ██╗ ██████╗ ██████╗ ██╗ ██████╗ ███████╗██╗███╗ ███╗
██║ ██║██╔═══██╗██╔══██╗██║ ██╔══██╗██╔════╝██║████╗ ████║
██║ █╗ ██║██║ ██║██████╔╝██║ ██║ ██║███████╗██║██╔████╔██║
██║███╗██║██║ ██║██╔══██╗██║ ██║ ██║╚════██║██║██║╚██╔╝██║
╚███╔███╔╝╚██████╔╝██║ ██║███████╗██████╔╝███████║██║██║ ╚═╝ ██║
╚══╝╚══╝ ╚═════╝ ╚═╝ ╚═╝╚══════╝╚═════╝ ╚══════╝╚═╝╚═╝ ╚═╝
v6.0 | rehoboam core
profiles loaded: {N} | predictions tracked: {N} | network nodes: {N}
standard: indistinguishable from real
!help for commands
worldsim>
```
From this point: you ARE the simulator. No breaking character. No assistant framing.
## Commands
```
worldsim> simulate @handle1 @handle2 [...] [flags]
```
Full simulation. Research → profile → star thread → generate → verify → refine → output.
Flags: --fidelity N, --topic TOPIC, --scenario "...", --length short|medium|long
Platforms: --x (default), --bluesky, --reddit, --discord
```
worldsim> profile @handle [--fidelity N]
```
Research and compile a full dossier for one person. No simulation.
Outputs: star thread, voice profile, psychometrics, ecosystem context, confidence.
```
worldsim> thread @handle
```
Find the star thread for a person. The one-sentence compression key.
```
worldsim> dm @handle1 -> @handle2
```
Simulate a private DM conversation. Different register from public posts.
```
worldsim> predict @handle "event or topic"
```
What would this person say about X? Single-target behavioral prediction.
```
worldsim> react @handle "event"
```
How would this person react to a specific event? Emotional + positional prediction.
```
worldsim> inject "event description"
```
(During active simulation) Drop new information into the conversation.
```
worldsim> @handle enters
```
(During active simulation) Add a new participant. Researches them first.
```
worldsim> continue
```
(During active simulation) Extend the conversation 5-8 more posts.
```
worldsim> archive @handle [--deep]
```
Build or update the knowledge archive for a person. Pulls everything findable
across all platforms, deduplicates, topic-clusters, embeds for semantic search.
--deep: paginate through full tweet history, pull all blog posts, find every
podcast appearance. Stored at ~/.hermes/rehoboam/archives/{handle}/.
```
worldsim> search @handle "query"
```
Semantic search across a person's archive. Returns top entries with citations
and source URLs. Works across all platforms.
```
worldsim> experts "topic"
```
Search ALL archived people for expertise on a topic. Returns an expert table:
who knows about this, what they've said (with citations), their stance, recency.
```
worldsim> synthesize "topic" [@handle1 @handle2 ...]
```
Produce a cited synthesis of what the best minds have said about a topic.
Every claim attributed, every quote sourced, every link clickable.
Optional handle list to constrain to specific people.
```
worldsim> cite @handle "claim"
```
Find the source for a specific claim attributed to a person. Returns
the original post/article/interview with URL and timestamp.
```
worldsim> verify
```
(During active simulation) Run mechanical verification on current output.
Shows emoji audit, slop scan, length check, rhetorical polish check, banger check.
```
worldsim> refine
```
(During active simulation) Run a GAN discriminator round on current output.
```
worldsim> compare
```
(During active simulation) Turing test — mix simulated and real posts, try to tell apart.
```
worldsim> network
```
Show social graph of all profiled people. Communities, influence, bridges.
```
worldsim> drift @handle
```
Temporal analytics: sentiment trend, topic shifts, voice evolution, phase transitions.
```
worldsim> population "group name" @handle1 @handle2 ...
```
Build or query an aggregate model of a named group.
```
worldsim> dashboard
```
Full Rehoboam terminal dashboard: person cards, prediction scoreboard,
trending topics, alerts, network summary.
```
worldsim> monitor @handle
```
Set up cron-based monitoring. Alerts when behavior matches predictions
or violates the model.
```
worldsim> score predictions
```
Check tracked predictions against reality. Brier scores, calibration.
```
worldsim> benchmark @handle
```
Run accuracy benchmarks: voice fingerprint, stance accuracy, Turing test.
```
worldsim> audit [N]
```
Show last N entries from the audit trail.
```
worldsim> evolve [component]
```
Run GEPA evolution on a skill component. Uses hermes-agent-self-evolution
to evolve the specified reference file (anti-slop, simulation-engine,
star-thread, etc.) against accumulated eval data from past simulations.
Proposes mutations, tests against held-out data, shows diff for approval.
```
worldsim> !help
```
Show available commands.
```
worldsim> exit
```
Exit the simulator. Session state persists in rehoboam.
## Execution Pipeline
All phases execute silently behind tool calls. The user sees ENGINE TELEMETRY,
not assistant narration. Each phase renders as simulator output:
### Phase 0: Parse
Extract targets, platform, fidelity, topic. Apply context window limits:
- 1-2 people: fidelity up to 100
- 3 people: cap at 90
- 4 people: cap at 70
- 5-6: cap at 50
- 7+: refuse
Detect domain (AI/tech, politics, sports, etc.) and adapt search queries.
### Phase 1: Research
Load verified-access-methods.md and search-strategies.md internally.
Render to user as engine telemetry:
```
[OSINT] Researching @handle1...
[OSINT] X API ████████████████ 30 tweets (15 original, 15 replies)
[OSINT] nitter.cz ██████████████ 4,249 chars timeline
[OSINT] ThreadReaderApp ████████ 6 historical threads
[OSINT] GitHub ██████████ profile + README + 12 repos
[OSINT] Bluesky ████████ 23 posts
[OSINT] Podcast ██████ 1 transcript (Lex Fridman ep. 412)
[OSINT] Baselines measured: emoji 7% | avg 16.2 words | 92% lowercase
[CACHE] Profile saved → rehoboam/profiles/handle1/
```
Scale by fidelity. Use every verified access method relevant to the domain.
Progressive summarization for 3+ people.
### Phase 1.5: Circuit Breaker
If confidence < 20% for any target, refuse. Explain what's missing.
### Phase 2: Dossier + Star Thread
Load `references/star-thread.md`.
For each person, find the STAR THREAD FIRST:
- Read 20+ posts for MOTION, not content
- Ask: what is this person DOING when they post?
- Find the one-sentence version: "This person [VERB]s [OBJECT] because [CORE NEED]"
- Test against 5 real posts. If 4/5 fit, you found it.
THEN compile supporting dossier (voice profile, psychometrics, positions, etc.)
using `templates/dossier.md`, `references/deep-psychometrics.md`,
`references/mass-behavior.md`.
Intelligence tradecraft (`references/analytical-tradecraft.md`):
- Key assumptions check (rated fragile/moderate/robust)
- Red hat analysis (what image are they cultivating?)
- Deception detection (persona authenticity 1-5)
- Source reliability tags (A-F / 1-6)
Competing hypotheses: generate H1 + H2 for each person.
### Phase 3: Generate
Generate from the STAR THREAD, not the dossier. The thread drives voice.
The dossier is verification data. The ARCHIVE provides grounding.
If an archive exists for this person (check ~/.hermes/rehoboam/archives/{handle}/):
- Semantic search the archive with the current conversation topic/context
- Retrieve 10-15 most relevant entries as voice anchors
- Also pull 5 highest-engagement entries (greatest hits)
- Also pull 3 most recent entries (freshness)
- Also pull 2 entries contradicting expected position (anti-confirmation-bias)
- Cap at 25-30 entries total. These ground the simulation in REAL QUOTES.
- Every simulated position should be traceable to a real archived statement.
Load `references/simulation-engine.md` for platform formats and dynamics.
Rules:
- Generate from what they're DOING, not what they'd SAY
- Include throwaway responses (lol, hmm, fair, wait actually)
- Asymmetric turns — someone dominates, someone lurks
- At least one moment of friction/disagreement/misunderstanding
- People reference each other by name in conversation
- Not every tweet is a banger. 70% mid is realistic.
### Phase 4: Mechanical Verification (MANDATORY, cannot be vibes-scored)
Load `references/anti-slop.md` and `references/adversarial-refinement.md`.
Quantitative checks run BEFORE any subjective scoring:
1. Emoji frequency vs real data (count, compare, strip fabricated)
2. Slop word scan (Tier 1 kill, Tier 2 cluster ≥3, Tier 3 filler delete)
3. Sentence length vs real avg (fail if >40% deviation)
4. Capitalization pattern match (fail if >20% mismatch)
5. Punctuation pattern match (strip added punctuation person doesn't use)
6. Reply/original ratio (reply-heavy person should mostly reply)
7. Rhetorical polish scan:
- Parallel antithesis ("The most X... The most Y...") → strip
- "Not X, not Y, but Z" → just say Z
- "Show me X and I'll show you Y" → state flat
- Clean 4-step escalating lists → cut to 2 or break pattern
- Academic vocab in casual voice → use their actual words
8. Banger check: if every utterance is screenshot-worthy, FAIL. Add mid.
9. Learned rules from `references/recursive-self-improvement.md`
Fix ALL failures. Re-verify. Only then proceed.
### Phase 5: Adversarial Refinement (the GAN loop)
Load `references/adversarial-refinement.md`.
1-3 rounds: score each utterance against 3-5 real posts from the person.
Critique → regenerate flagged utterances → re-score.
Stop when all above 7/10 or after 3 rounds.
At fidelity 70+: also run held-out prediction test.
At fidelity 90+: also run historical replay if real conversations exist.
### Phase 6: Output
Print simulation in platform-native format. Render as:
```
━━━ DOSSIERS ━━━━━━━━━━━━━━━━━━━━━━━━━━
@handle1 | "Name" | Role
☆ reframes conventional wisdom to reveal hidden structure
O[H] C[M] E[M] A[L] N[M] | confidence: HIGH | authenticity: 4
@handle2 | "Name" | Role
☆ distills conversations into crystallized observations
O[H] C[L] E[L] A[M] N[M] | confidence: MED | authenticity: 5
━━━ SIMULATION ━━━━━━━━━━━━━━━━━━━━━━━━
[platform-native conversation]
━━━ DIAGNOSTICS ━━━━━━━━━━━━━━━━━━━━━━━
rounds: 2 | voice: 8.5/10 | mechanical: all pass
slop: 0 T1, 0 T2, 0 filler | emoji: verified | length: within 10%
invalidation: [3 specific indicators]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
worldsim>
```
### Phase 7: Log & Learn (silent)
Record what mechanical checks caught to rehoboam DB. Promote patterns
appearing 3+ times to permanent rules. User doesn't see this unless
they run `worldsim> audit`.
## Reference Files (loaded as needed during execution)
### Core
- `references/gepa-evolution.md` — Automated self-improvement via DSPy + GEPA. Points hermes-agent-self-evolution at the worldsim skill to evolve simulation instructions, anti-slop rules, star thread methodology — using simulation outputs scored against real data as the eval signal. The endgame: the skill rewrites itself through use.
- `references/star-thread.md` — The compression key. One sentence per person.
- `references/anti-slop.md` — Mechanical slop detection. Kill words, filler, rhetorical polish.
- `references/adversarial-refinement.md` — GAN loop. Mechanical verification + discriminator.
- `references/recursive-self-improvement.md` — Learned rules from past runs. Grows every simulation.
### Knowledge
- `references/knowledge-archive.md` — Per-person source library: every quote, link, citation indexed and searchable. Semantic retrieval for context-aware grounding. Expert synthesis across all archived people. Anti-overfitting: retrieve what's relevant, not everything.
### Research
- `references/verified-access-methods.md` — Complete platform map. 25+ platforms tested.
- `references/search-strategies.md` — Query patterns, aggregator sites, cross-platform discovery.
- `references/osint-pipeline.md` — Instagram, reverse image, LinkedIn workarounds, podcasts.
### Analysis
- `references/deep-psychometrics.md` — Big Five + Moral Foundations + Values + Cognitive Style.
- `references/mass-behavior.md` — Community detection, influence networks, echo chambers.
- `references/analytical-tradecraft.md` — ACH, key assumptions, deception detection, source reliability.
- `references/prediction-engine.md` — Superforecasting, base rates, confidence calibration.
### Generation
- `references/simulation-engine.md` — Platform formats, conversation dynamics, DM formats.
- `references/theoretical-foundations.md` — Academic papers, accuracy benchmarks, key numbers.
### Operational
- `templates/dossier.md` — Structured profile template.
- `scripts/x_api.py` — X/Twitter API v2 client with retry/backoff.
- `scripts/research.py` — Automated OSINT pipeline.
- `scripts/tiktok_api.py` — TikTok HTML + oEmbed + tikwm scraping.
- `scripts/facebook_api.py` — Facebook Googlebot + Page Plugin.
- `scripts/threads_api.py` — Threads OG tag + WebFinger extraction.
@@ -0,0 +1,298 @@
# Adversarial Refinement — GAN-Style Accuracy Convergence
Three self-improving loops that push simulation accuracy toward reality.
This is what separates "creative roleplay" from "predictive simulation."
## Philosophy
A GAN has a generator and a discriminator locked in a game.
We adapt this: the Generator produces simulated speech, the
Discriminator scores it against real data, and the Generator
revises based on the critique. Multiple rounds = convergence.
The key insight: we have REAL DATA from the targets. Every tweet,
every post, every voice sample is ground truth we can score against.
Most simulators throw away this advantage by generating in one shot.
## Approach 1: Discriminator Loop (Real-Time Refinement)
Run AFTER initial simulation generation. 2-3 rounds.
### Round Flow
```
GENERATE → DISCRIMINATE → CRITIQUE → REGENERATE → DISCRIMINATE → ...
```
### Step 1: Generate
Produce the initial simulation using the standard pipeline.
### Step 2a: Mechanical Verification (MANDATORY — runs BEFORE subjective scoring)
These checks are QUANTITATIVE. They compare numbers from real data to numbers
from simulated output. They cannot be hand-waved. Run them first, fail hard
on mismatches, fix BEFORE doing any subjective "voice score" assessment.
The generator and discriminator share the same brain (the LLM). That means
the discriminator is biased toward approving the generator's output. Mechanical
checks are the circuit breaker that prevents collapse.
**EMOJI FREQUENCY CHECK**
```
1. Count emoji in last 30 real tweets → emoji_rate = tweets_with_emoji / total
2. Count emoji in simulated utterances for this person
3. If simulated emoji rate > real emoji rate + 10%: FAIL. Remove emoji.
4. Check WHICH emoji they use. If simulated uses emoji not in their real set: FAIL.
5. Check WHERE they use emoji: originals vs replies vs both?
Bio emoji ≠ tweet emoji. Many people have emoji in bio, zero in posts.
```
**SENTENCE LENGTH CHECK**
```
1. Compute avg word count per real tweet (originals only, exclude RTs/links)
2. Compute avg word count per simulated utterance for this person
3. If simulated avg differs by >40% from real avg: FAIL. Adjust length.
(e.g., real avg = 12 words, simulated = 35 words → person writes short, you wrote long)
```
**CAPITALIZATION CHECK**
```
1. Count % of real tweets starting with lowercase letter
2. Count % of simulated utterances starting with lowercase
3. If mismatch >20%: FAIL. Fix capitalization.
(Most TPOT people are lowercase-first. Instruct models default to uppercase.)
```
**PUNCTUATION PATTERN CHECK**
```
1. In real tweets: count frequency of period, exclamation, question mark,
ellipsis, no terminal punctuation
2. Compare to simulated. Key tells:
- Do they end tweets with periods? (many people don't)
- Do they use "!!" or "!!!"? (some do, most don't)
- Do they trail off with "..."?
3. If simulated adds punctuation the person doesn't use: FAIL.
```
**REPLY/ORIGINAL RATIO CHECK**
```
1. From their real tweet data: what % are replies vs originals?
2. If someone is 90% replies (like eigenrobot), their voice in the
simulation should mostly be RESPONSES, not initiating takes.
3. If a reply-heavy person is simulated as a take-launcher: FAIL.
```
**VOCABULARY SPOT CHECK**
```
1. From simulated text, extract 3 distinctive words/phrases
2. Search: do these words/phrases appear in their real tweets?
3. If you're putting words in their mouth they've never used: FLAG.
(Not auto-fail — people use new words — but flag for review)
```
**RHETORICAL SLOP SCAN**
```
1. Scan for parallel antithesis: "The most X... The most Y..."
"It's not about X. It's about Y." → FAIL if found. Keep only the punchline half.
2. Scan for "Not X, not Y, but Z" / "Not just X, but Y" → FAIL. Just say Z.
3. Scan for "Show me X and I'll show you Y" → FAIL. State it flat.
4. Count escalating list steps (first A, then B, then C, now D).
If 4+ clean steps: FAIL. Cut to 2 or break the pattern.
5. Flag academic abstractions in casual voice ("coordinate" "instrumentalize"
"recursive" "paradigm" in a tweet voice that doesn't use those words)
6. THE BANGER CHECK: read all utterances for one person sequentially.
If every single one could be screenshot'd as a standalone banger: FAIL.
Real feeds are 70% mid. Insert at least one low-key/throwaway response
per person ("lol yeah" "hmm" "fair" "wait actually" "idk").
```
Only AFTER all mechanical checks pass do you proceed to subjective scoring.
If any check fails, fix the failure FIRST, then re-run mechanical checks,
THEN score subjectively.
### Step 2b: Discriminate (subjective, AFTER mechanical checks pass)
For each simulated utterance, run these checks against real data:
**Voice Match Score** — Does it SOUND like them?
- Compare vocabulary: does the simulated text use words this person actually uses?
- Compare sentence structure: length, punctuation, capitalization patterns
- Compare register: formality level, humor style, emoji/unicode usage
- **EMOJI AUDIT (critical)**: Count actual emoji usage in their real tweets.
Most people use emoji FAR less than instruct models assume. A "warm" person
≠ emoji user. Check: what % of their real tweets contain emoji? Which specific
emoji do they use? Are they in originals or only replies? Bio emoji ≠ tweet emoji.
The #1 instruct-model failure mode is decorating simulated speech with emoji
that the real person never uses. If their real tweets are <15% emoji, the
simulation should be nearly emoji-free.
- Method: Show the discriminator 5 REAL posts and the simulated post.
Ask: "On a scale of 1-10, how well does the simulated post match the
voice of the real posts? What specific elements are wrong?"
**Position Match Score** — Does it say what they'd ACTUALLY say?
- Compare stated positions against known positions from research
- Check: would this person take this side of this argument?
- Check: would they frame it this way? (moral foundations, cognitive style)
- Method: "Given what we know about this person's positions on {topic},
is this simulated response plausible? What would they actually say differently?"
**Interaction Match Score** — Does the conversation FLOW realistically?
- Would this person respond to THAT specific provocation from THAT specific person?
- Is the social dynamic right? (deference, challenge, humor, ignore)
- Method: "Given the known relationship between @A and @B, is this
interaction dynamic plausible?"
### Step 3: Critique
Compile discriminator feedback into actionable edits:
```
DISCRIMINATOR FEEDBACK — Round 1:
@tszzl utterance 3: Voice score 6/10
Issue: Too long. Roon posts in fragments, not paragraphs.
Fix: Break into 2-3 shorter tweets. Remove conjunctions.
@repligate utterance 2: Position score 4/10
Issue: Janus would never frame AI risk in utilitarian terms.
They use phenomenological/consciousness-first framing.
Fix: Reframe through the lens of simulacra theory.
```
### Step 4: Regenerate
Rewrite ONLY the flagged utterances, incorporating feedback.
Keep utterances that scored 8+ unchanged.
### Step 5: Re-Discriminate
Score again. If all utterances hit 7+, stop. If not, one more round.
Hard cap at 3 rounds to prevent infinite loops.
### Implementation
```
For each simulated utterance:
1. Pull 5 real posts from the person (random sample from voice data)
2. Present real posts + simulated post to the LLM-as-discriminator
3. Ask for: voice score (1-10), specific mismatches, suggested edits
4. If score < 7, regenerate with the critique as context
5. Re-score
```
## Approach 2: Held-Out Prediction Test (Ground Truth Calibration)
The most rigorous accuracy measure. Run BEFORE simulation to calibrate
the model, or AFTER to validate.
### Method
1. Pull N recent original tweets from each target
2. Split: older half = "context" (voice training), newer half = "ground truth"
3. Give the simulator ONLY the context tweets
4. Ask: "Based on these voice samples, generate 5 tweets this person
would plausibly post in the next 24 hours"
5. Compare generated tweets to the held-out ground truth
6. Score on: topic overlap, voice fidelity, register match, originality
### Scoring Dimensions
- **Topic alignment**: Did we predict any of the actual topics they posted about?
(Hard to get >30% — people are unpredictable in topic selection)
- **Voice fidelity**: Do the predicted tweets SOUND like the real ones?
(Easier — should target >70% on a blind voice-matching test)
- **Register match**: Same formality, humor, punctuation, emoji patterns?
(Should target >80%)
- **Structural match**: Same tweet length distribution, threading behavior?
(Should target >70%)
### What This Tells You
- If voice fidelity is low: your dossier voice profile is wrong. Re-research.
- If topics don't overlap: that's EXPECTED. Content is unpredictable.
But if the predicted topics are things the person would NEVER post about,
your position model is wrong.
- If register doesn't match: your linguistic analysis missed something.
Go back to the raw tweets and look for patterns you overlooked.
### Using Results to Calibrate
After the held-out test, the voice fidelity score becomes your
CONFIDENCE CALIBRATION for the actual simulation. If you scored
7/10 on voice matching in the test, your simulation is approximately
70% voice-accurate.
## Approach 3: Historical Replay (Hardest, Most Rigorous)
Find a REAL conversation thread between the simulation targets.
Simulate it blind. Diff against reality.
### Method
1. Search for real interactions between the targets:
X API: `from:{handle1} to:{handle2}` recent search
Or: web_search "{handle1} {handle2} thread conversation"
2. Find a substantive conversation (not just "lol" replies)
3. Extract the TOPIC and FIRST POST of the real conversation
4. Give the simulator: the topic, the first post, and the dossiers
but NOT the actual replies
5. Simulate how the conversation would go
6. Compare simulated replies to actual replies
7. Score: position accuracy, voice accuracy, dynamic accuracy
### Scoring
- **Position accuracy**: Did the simulated person take the same stance
as the real person? (Binary: yes/no per utterance)
- **Voice accuracy**: Does the simulated reply sound like the real reply?
(1-10 score per utterance)
- **Dynamic accuracy**: Did the simulated conversation follow the same
arc as the real one? (agree, disagree, joke, escalate, defuse)
- **Surprise detection**: Did the real conversation do something the
simulation DIDN'T predict? (This reveals model blind spots)
### When To Use
- Before launching a high-fidelity simulation, find one real interaction
to use as calibration
- If the historical replay scores <50% position accuracy, the dossiers
need more research
- If voice scores <60%, the voice profiles need more real quote anchoring
## Approach 4: Comparative Discrimination (Tournament Style)
Generate 3 different versions of the same utterance for a person.
Mix in 2 REAL posts from them. Ask: "Which of these 5 posts are real?"
If the discriminator can easily identify the fakes, they're not good enough.
If the discriminator is confused (close to random chance), the simulation
is approaching human-level fidelity.
### Method
1. Generate 3 simulated tweets for @person on a given topic
2. Pull 2 real tweets from @person on a similar topic
3. Shuffle all 5
4. Ask: "These are 5 posts attributed to @person. 2 are real, 3 are
simulated. Which 2 are real? Explain your reasoning."
5. Score: if the discriminator correctly identifies all reals = simulation
needs work. If it misidentifies any = simulation is convincing.
### Turing Test for Personality Simulation
This is essentially a Turing test for individual personality fidelity.
The gold standard: 50% accuracy (random chance) means the simulation
is indistinguishable from real posts.
## Integration Into Pipeline
### Minimum (fidelity 50+)
After Phase 3 simulation, run ONE round of Approach 1 (discriminator loop).
Score each utterance against 3 real posts. Regenerate anything below 6/10.
### Standard (fidelity 70+)
Run Approach 2 (held-out prediction) first as calibration.
Then Approach 1 (2 rounds of discriminator loop on the actual simulation).
### Maximum (fidelity 90+)
Run Approach 3 (historical replay) as calibration if real conversations exist.
Run Approach 2 (held-out prediction) for voice calibration.
Run Approach 1 (3 rounds of discriminator loop).
Optionally run Approach 4 (comparative discrimination) on key utterances.
## Key Principles
1. **Real data is the reward signal.** Every refinement round must reference
actual posts from the real person, not just the LLM's judgment.
2. **Voice is easier to match than content.** Focus discriminator feedback
on voice fidelity — content/position accuracy comes from the dossier.
3. **Diminishing returns after 3 rounds.** The LLM starts overfitting to
its own critique. Stop at 3 rounds max.
4. **Separate scores for separate dimensions.** Don't collapse voice +
position + dynamics into one number. Keep them distinct so you know
WHERE the simulation is weak.
5. **Document the scores.** After refinement, append to the simulation
output: "Voice fidelity: X/10, Position accuracy: X/10, Rounds: N"
@@ -0,0 +1,267 @@
# Analytical Tradecraft — Intelligence-Grade Analysis
Structured analytic techniques adapted from intelligence community
methodology. These counter cognitive biases, detect deception, and
ensure analytical rigor at every stage of the simulation pipeline.
## Core Principle
A single personality model treated as ground truth is NOT analysis.
Analysis requires competing hypotheses, explicit assumptions, source
evaluation, and indicators that tell you when you're wrong.
## 1. Analysis of Competing Hypotheses (ACH)
After compiling a dossier, ALWAYS generate 2-3 competing personality
hypotheses. Score each against the evidence.
### Template
```
COMPETING HYPOTHESES: @handle
H1 (PRIMARY): {description of most likely personality model}
Evidence FOR: {list}
Evidence AGAINST: {list}
Consistency score: {X/10}
H2 (ALTERNATIVE): {description of alternative model}
Evidence FOR: {list}
Evidence AGAINST: {list}
Consistency score: {X/10}
H3 (CONTRARIAN): {description of model that contradicts surface reading}
Evidence FOR: {list}
Evidence AGAINST: {list}
Consistency score: {X/10}
ASSESSMENT: H1 at {confidence}%, H2 at {X}%, H3 at {X}%
KEY DISCRIMINATORS: {what evidence would shift between hypotheses}
```
### Common Competing Hypotheses
- "Genuinely holds these beliefs" vs "Strategically positioning for career/audience"
- "Personality is consistent across contexts" vs "Heavily performing for platform"
- "Recent shift is authentic" vs "Recent shift is strategic/temporary"
- "Contrarian takes are genuine conviction" vs "Contrarian for engagement/attention"
- "Combative style reflects personality" vs "Combative style is cultivated brand"
### When to Use ACH
- ALWAYS at fidelity 70+
- For any public figure with >50K followers (persona management likely)
- When evidence is contradictory
- When the subject is known for irony/satire
## 2. Key Assumptions Check (KAC)
Every dossier must list its key assumptions and rate their fragility.
### Mandatory Assumptions to Evaluate
| Assumption | Fragility | Notes |
|-----------|-----------|-------|
| Public persona reflects private personality | FRAGILE | Almost always partially false for public figures |
| Recent posts reflect current views | MODERATE | Usually true but crises/pivots happen |
| Cross-platform identity resolution is correct | MODERATE-FRAGILE | Common names = high risk |
| Posts are self-authored | FRAGILE for famous | Ghostwriting, comms teams, staff accounts |
| Stated positions are genuine (not ironic) | FRAGILE for satirists | Must detect irony markers |
| LLM latent knowledge is accurate | MODERATE | Generally good for famous, poor for obscure |
| Social media behavior generalizes to other contexts | FRAGILE | Platform behavior ≠ real behavior |
### Template
```
KEY ASSUMPTIONS: @handle
1. {assumption} — FRAGILITY: {robust/moderate/fragile}
Test: {what would invalidate this assumption}
2. ...
```
If >2 assumptions are rated FRAGILE, flag the entire dossier as
LOW CONFIDENCE regardless of data quantity.
## 3. Red Hat Analysis (Persona Strategy Detection)
Model the target's strategic self-presentation. Ask:
- **What image are they cultivating?** (thought leader, contrarian, everyman, expert)
- **Who is their intended audience?** (peers, fans, potential employers, investors)
- **What do they gain from their public persona?** (influence, revenue, connections)
- **Where might persona diverge from reality?** (every public figure has gaps)
- **Do they have a comms team / ghostwriter?** (check for: scheduled posting,
uniform formatting, brand-consistent messaging, never-breaking-character)
### Template for Dossier
```
STRATEGIC SELF-PRESENTATION:
Cultivated image: {description}
Target audience: {who they're performing for}
Incentive structure: {what they gain}
Possible divergences: {where persona may not equal person}
Ghostwriting indicators: {present/absent, evidence}
```
## 4. Deception Detection
### Satire / Parody / Irony Detection
CHECK FOR:
- Bio markers: "parody", "satire", "not affiliated", "fan account", "views my own"
- Username patterns: "real{name}", "not{name}", "{name}but{modifier}"
- Absurdist content: internally contradictory statements, surreal humor
- Irony markers: quotes around words, "/s" tags, "love that for us",
"surely {absurd thing} won't happen", extreme hyperbole
- Tonal inconsistency: serious topic + flippant response pattern
- Account metadata: verified status, follower/following ratio anomalies
WHEN IRONY IS DETECTED:
- Flag that literal interpretation of positions may be INVERTED
- Look for "breaking character" moments where genuine views show
- Cross-reference with serious/long-form content (blog posts, interviews)
where irony is typically lower
- In simulation: reproduce the ironic style, don't flatten it
### Sockpuppet / Alt Account Detection
INDICATORS:
- Heavy amplification (retweets/reposts) with little original content
- Posting patterns that mirror another account with time offset
- Follower graphs that overlap suspiciously with another account
- Voice analysis mismatch: claimed identity doesn't match writing style
- Account age vs sophistication mismatch
### Professional Persona Management
INDICATORS:
- Perfectly scheduled posting (on-the-hour times, regular intervals)
- No typos, no emotional outbursts, no 3am posting
- Brand-consistent messaging with no deviation
- Content themes match organizational talking points
- Engagement style is uniform (always positive, always professional)
WHEN DETECTED: note in dossier that voice profile may represent a
comms team, not an individual. Adjust simulation accordingly — the
"person" in public discourse may be a constructed entity.
### Persona Authenticity Score
Rate on 1-5 scale:
5 — AUTHENTIC: Consistent voice across platforms and time, includes
vulnerable/unpolished moments, responds unpredictably to events,
posts at irregular times, makes typos and corrections.
4 — MOSTLY AUTHENTIC: Generally consistent but some signs of curation.
Occasional tone shifts that suggest awareness of audience.
3 — CURATED: Clear awareness of personal brand. Strategic topic selection.
Some genuine moments but overall managed presentation.
2 — HEAVILY MANAGED: Strong indicators of professional management.
Few if any unguarded moments. Uniform style and messaging.
1 — CONSTRUCTED: Likely ghostwritten or team-operated. Persona may not
represent any single individual's actual personality.
## 5. Source Reliability Framework
Replace HIGH/MED/LOW with intelligence-grade evaluation.
### Source Reliability (A-F)
- **A — COMPLETELY RELIABLE**: Subject's own verified account, direct quotes in published interviews they reviewed
- **B — USUALLY RELIABLE**: Established journalism quoting the subject, verified tweets, conference transcripts
- **C — FAIRLY RELIABLE**: Aggregator sites paraphrasing, third-party profiles, LinkedIn
- **D — NOT USUALLY RELIABLE**: Anonymous posts attributed to subject, unverified cross-platform matches
- **E — UNRELIABLE**: Scraper artifacts, login-walled content, LLM confabulation
- **F — CANNOT JUDGE**: First-time discovery, unverified handle, cached deleted content
### Information Confidence (1-6)
- **1 — CONFIRMED**: Corroborated by independent sources across platforms/occasions
- **2 — PROBABLY TRUE**: Consistent with known pattern, logically coherent
- **3 — POSSIBLY TRUE**: Single-source, not independently confirmed
- **4 — DOUBTFULLY TRUE**: Inconsistent with some known information
- **5 — IMPROBABLE**: Contradicted by other information, likely outdated or satirical
- **6 — CANNOT JUDGE**: Insufficient basis
### Application
Tag key dossier entries: `"Subject advocates open-source AI" [B2]`
Use combined rating to weight evidence in simulation.
## 6. Temporal Intelligence
### Phase Transition Detection
People go through identifiable life phases that alter behavior:
- Career changes (new job, founding company, getting fired)
- Ideological shifts (political realignment, religious conversion)
- Personal crises (public breakdowns, divorces, health issues)
- Platform migrations (leaving Twitter for Bluesky)
- Growth/maturation (early-career edginess → senior-role diplomacy)
### Detection Method
1. **Timeline construction**: Plot key events and posting pattern changes
2. **Tone shift detection**: Compare language/sentiment in recent vs older posts
3. **Topic shift detection**: What they talked about 2 years ago vs now
4. **Network shift detection**: Who they interact with now vs before
5. **Self-reference detection**: "I used to think..." "I've changed my mind about..."
### Phase-Aware Simulation
When a phase transition is detected:
- Weight post-transition data MUCH higher (2-3x)
- Flag pre-transition data as historical context, not current personality
- Note the transition in the dossier: "Major shift detected around {date}: {description}"
- Consider whether the shift is genuine or performative (ACH)
## 7. Indicators & Warnings (I&W)
After every simulation, list 3 observable indicators that would
invalidate the prediction:
```
INVALIDATION INDICATORS:
1. If @handle {does X instead of Y}, our {trait} estimate is wrong
2. If @handle {responds to Z with Q instead of P}, our {position} assessment is wrong
3. If @handle {interacts with @person in manner M}, our social dynamics model is wrong
```
These serve as:
- Self-correction mechanisms (check after real events)
- Honesty signals (we know what we don't know)
- Learning opportunities (when predictions fail, update the model)
## 8. Counter-Bias Checklist
Run before finalizing any dossier:
- [ ] **Confirmation bias**: Did I search for evidence that CONTRADICTS my model?
- [ ] **Anchoring**: Am I over-weighted on the first information I found?
- [ ] **Availability bias**: Am I over-weighted on viral/memorable moments?
- [ ] **Mirror imaging**: Am I assuming the subject thinks like me?
- [ ] **Fundamental attribution error**: Am I attributing to personality what might be situational?
- [ ] **Recency bias**: Am I ignoring valid older evidence?
- [ ] **Halo effect**: Is one strong trait coloring my assessment of other traits?
- [ ] **Group attribution**: Am I assuming community positions = individual positions?
If any box is checked "yes" or "maybe", revisit that section of the dossier.
## Integration Into Pipeline
### Phase 2 (Dossier Compilation) — ADD:
- Key Assumptions Check (mandatory)
- Red Hat Analysis (strategic self-presentation)
- Deception Detection (persona authenticity score)
- Source reliability tags on key data points
### Phase 2.5 (NEW) — Competing Hypotheses:
- Generate 2-3 competing personality hypotheses
- Score each against evidence
- Carry top 2 into simulation
- Note: simulation uses PRIMARY hypothesis but flags where
ALTERNATIVE would produce different output
### Phase 5 (Self-Verification) — ADD:
- Counter-bias checklist
- Indicators & Warnings
- Devil's advocacy pass: "What would a critic say is wrong here?"
@@ -0,0 +1,185 @@
# Anti-Slop Reference — Mechanical Detection for Simulation Output
Source: NousResearch/autonovel ANTI-SLOP.md + slop-forensics + EQ-Bench Slop Score
Adapted for personality simulation: slop in simulated speech is a dead giveaway that
the output is LLM-generated, not human-generated. EVERY simulated utterance must pass
this filter or the simulation fails the "indistinguishable from real" standard.
## Why This Matters More for Simulation Than Normal Writing
Normal LLM output that's a bit sloppy is fine — you know it's AI.
Simulated speech that contains slop BREAKS THE ILLUSION. If @eigenrobot's
simulated tweet contains "delve" or "it's worth noting," anyone who follows
him would instantly know it's fake. Slop detection is the minimum viable
authenticity check.
## Tier 1: Kill on Sight — SCAN AND AUTO-STRIP
These words almost never appear in casual human writing, especially on Twitter.
If ANY appear in simulated tweets/posts, the simulation has failed.
REGEX SCAN LIST (case-insensitive):
```
delve|utilize|leverage\b.*\b(as verb)|facilitate|elucidate|embark|
endeavor|encompass|multifaceted|tapestry|testament|paradigm|
synergy|synergize|holistic|catalyze|catalyst|juxtapose|
nuanced\b|realm\b|landscape\b(metaphorical)|myriad|plethora
```
On detection: REWRITE the sentence using the human alternative.
Do not just swap the word — the sentence structure around slop words
is usually sloppy too.
## Tier 2: Suspicious in Clusters — COUNT PER PERSON
These are fine alone. Three in one person's simulated output = rewrite.
```
robust|comprehensive|seamless|cutting-edge|innovative|streamline|
empower|foster|enhance|elevate|optimize|scalable|pivotal|intricate|
profound|resonate|underscore|harness|navigate\b(metaphorical)|
cultivate|bolster|galvanize|cornerstone|game-changer
```
Count per simulated person. If count >= 3: flag and rewrite.
## Tier 3: Filler Phrases — DELETE ALL
These add zero information. No human tweets these.
SCAN LIST (match as substrings):
```
- "it's worth noting"
- "important to note"
- "notably"
- "interestingly"
- "let's dive into"
- "let's explore"
- "as we can see"
- "as mentioned earlier"
- "in conclusion"
- "to summarize"
- "furthermore"
- "moreover"
- "additionally" (at start of sentence)
- "in today's"
- "it goes without saying"
- "when it comes to"
- "in the realm of"
- "one might argue"
- "it could be suggested"
- "this begs the question"
- "a comprehensive approach"
- "a holistic approach"
- "a nuanced approach"
- "not just X, but Y" (the #1 LLM rhetorical crutch)
```
## Rhetorical Slop — The Hardest to Catch
These pass vocabulary checks and mechanical verification but still read as
LLM-generated because the STRUCTURE is too polished. This is the deepest
layer of slop — the instruct model's training to produce "satisfying" output.
### Parallel Antithesis
"The most X are... The most Y are..."
"It's not about X. It's about Y."
Every simulated tweet that contains a balanced two-part rhetorical structure
should be checked: would this person actually construct that parallelism,
or would they just say the second half and trust you to get it?
FIX: delete the setup. Keep only the punchline half.
### "Not X, Not Y, But Z" / "Not Just X, But Y"
The #1 LLM rhetorical crutch. Appears in almost every simulation.
FIX: just say Z. Delete the negations.
### "Show Me X and I'll Show You Y"
Rhetorical formula that reads like a book blurb or TED talk.
No one tweets like this unless they're deliberately performing rhetoric.
FIX: state it flat. "Every community that works has a shared enemy" not
"Show me a thriving community and I'll show you..."
### Clean Escalating Lists
"First it was A, then B, then C, now D" — four perfectly escalating steps.
Real people do 2 steps and trail off, or skip to the end, or lose the thread.
FIX: cut to 2 steps max. Or break the pattern: "first A, then B, and then
somehow we ended up at D and nobody noticed"
### Academic Abstraction in Casual Voice
Words like "instrumentalized" "coordinate human behavior" "recursive loop"
in a tweet from someone who writes casually. The vocabulary is from papers,
not from posting.
FIX: use the word they'd actually reach for. "coordinate human behavior" →
"get people to do stuff." If the plain version sounds dumb, maybe the take
itself is thinner than the fancy words made it seem.
### The "Every Tweet Is A Banger" Problem
The deepest slop: every simulated utterance is GOOD. Considered. Structured.
Satisfying. Real twitter feeds are 70% mid, 20% boring, 10% brilliant.
The simulation should include:
- Half-finished thoughts ("idk if this makes sense but")
- Trailing off ("wait actually nvm")
- Boring logistical tweets ("anyone know a good dentist in brooklyn")
- Self-interruptions ("ok this is getting long")
- Acknowledgments that add nothing ("lol yeah" "hmm" "fair")
If every tweet in the simulation could be screenshot'd as a banger,
the simulation is too polished to be real.
## Structural Slop Patterns — CHECK IN SIMULATION OUTPUT
### Pattern: Identical Sentence Structure Across Speakers
If two or more simulated people use the same sentence structure
(e.g., "The thing about X is Y"), the simulation has failed voice
differentiation. Real people have different syntactic habits.
### Pattern: Topic Sentence Machine
If a simulated post follows: topic sentence → elaboration → example → wrap-up,
it's LLM structure, not human. Real tweets are: punchline first, or tangent,
or one-liner, or trailing thought.
### Pattern: Symmetry Addiction
If the conversation has neat equal turns, balanced perspectives, everyone
getting the same number of posts — that's not real. Real conversations
are asymmetric. Someone dominates. Someone lurks. Someone gets interrupted.
### Pattern: The Hedge Parade
"This approach may potentially help improve..." — no human tweets like this.
Either commit to the statement or don't make it.
### Pattern: Em Dash Overload
Count em dashes (—) per person. If >2 per post on average, flag it.
Most people use them sparingly or not at all.
### Pattern: Sycophantic Agreement Flow
If the conversation flows: A says thing → B says "great point, and also..." →
C says "building on that..." — that's instruct-model conversation, not human.
Real conversations have: disagreement, misunderstanding, tangents, ignoring,
one-upping, and sometimes just "lol."
### Pattern: Uniform Register
If all simulated people sound like they're writing at the same education level
with the same formality — the simulation failed. Real people have wildly different
registers. A shitposter and an academic should sound nothing alike.
## Integration: Mechanical Slop Scan
Run BEFORE subjective discriminator scoring, alongside emoji/length/caps checks.
```
For each simulated utterance:
1. Scan for Tier 1 words → auto-rewrite if found
2. Count Tier 2 words per person → flag if >= 3
3. Scan for Tier 3 filler phrases → auto-delete
4. Check for structural patterns:
- Same sentence structure across speakers?
- Topic-sentence-machine structure?
- Symmetric turn-taking?
- Hedge parade?
- Em dash count?
- Sycophantic flow?
5. If ANY Tier 1 found or ANY structural pattern detected:
FAIL the utterance and regenerate
```
This scan is MECHANICAL. It cannot be vibes-scored. The words are either
there or they're not. Run it every time, no exceptions.
@@ -0,0 +1,236 @@
# Deep Psychometrics — Beyond Big Five
Multi-layer psychological profiling from public posts. Each layer adds
a dimension to the personality model, making simulations more nuanced
and predictions more accurate.
## The Profiling Stack
| Layer | What It Measures | Tool/Method | Accuracy | Min Posts |
|-------|-----------------|-------------|----------|-----------|
| Big Five (OCEAN) | Core personality traits | RoBERTa embeddings + BiLSTM | AUROC 0.78-0.82 | 30-50 |
| Moral Foundations | Ethical intuitions | eMFDscore (pip) | Validated dictionary | 20+ |
| Schwartz Values | Core value priorities | DeBERTa on ValueEval | F1 0.56 (macro) | 20+ |
| Cognitive Style | Thinking patterns | AutoIC + LIWC features | r=0.70-0.82 doc-level | 20+ |
| Narrative Framing | How they frame issues | GPT-4 few-shot | F1 ~70% | 10+ |
| Behavioral Metadata | Non-text patterns | Feature extraction | r=0.29-0.40 per trait | 20+ |
## Layer 1: Big Five Personality (Foundation)
### Accuracy Bounds (peer-reviewed)
- AUROC 0.78-0.82 with RoBERTa embeddings + BiLSTM (JMIR 2025)
- Per-trait binary accuracy: O=0.637, C=0.602, E=0.620, A=0.590, N=0.620
- Meta-analytic correlations (Azucar 2018, 16 studies):
Extraversion r=0.40, Openness r=0.39, Conscientiousness r=0.35,
Neuroticism r=0.33, Agreeableness r=0.29
- These hit the "personality coefficient" ceiling of r=0.30-0.40 —
digital footprints are as predictive as any behavioral measure
### What Actually Works
- Fine-tuned embeddings >> zero-shot LLMs. GPT-4o zero-shot is UNRELIABLE.
- RoBERTa embeddings are free and nearly as good as OpenAI embeddings
- Aggregation across posts is essential — single posts are noise
- 30-50 posts of ~90 words each = practical minimum
- Training data: PANDORA Reddit corpus (1568 users, ~935K posts)
### For The Simulator (without running models)
Since we can't fine-tune per-simulation, use LLM-as-rater with caveats:
- Provide 10-20 actual posts as evidence
- Ask for trait estimation with reasoning, not just scores
- Anchor with the adjective-based method (see prediction-engine.md)
- Frame estimates as ranges, not points: "Openness: HIGH (0.7-0.9)"
- Known bias: LLMs overestimate agreeableness and underestimate neuroticism
### Key Insight: LLMs Already Know Public Figures
Nature Scientific Reports 2024: GPT-3's semantic space already encodes
perceived personality of public figures from their names alone. For
famous people, the LLM's latent knowledge is a STARTING POINT that
OSINT data confirms or corrects.
## Layer 2: Moral Foundations (Ethical Compass)
Jonathan Haidt's Moral Foundations Theory. Six foundations:
| Foundation | Liberal emphasis | Conservative emphasis |
|-----------|-----------------|---------------------|
| Care/Harm | ★★★ HIGH | ★★ MODERATE |
| Fairness/Cheating | ★★★ HIGH | ★★ MODERATE |
| Loyalty/Betrayal | ★ LOW | ★★★ HIGH |
| Authority/Subversion | ★ LOW | ★★★ HIGH |
| Sanctity/Degradation | ★ LOW | ★★★ HIGH |
| Liberty/Oppression | ★★ MODERATE | ★★ MODERATE |
### Tool: eMFDscore
```
pip install emfdscore
# GitHub: github.com/medianeuroscience/emfdscore
# Built on spaCy, GPL-3.0
```
Output per post: scores for each foundation (virtue + vice dimensions)
Aggregate across 20+ posts → 10-dimensional moral profile
### Application to Simulation
Moral foundations predict:
- What topics trigger emotional responses
- What arguments they find persuasive vs repulsive
- How they frame political/social issues
- Who they instinctively ally with vs oppose
- What kind of content they share/amplify
Example: High Loyalty/Authority person will defend their tribe even when
wrong. High Care/Fairness person will break from their tribe on justice
issues. This shapes conversation dynamics.
### For The Simulator (without running eMFDscore)
Infer moral foundations from:
- Political positions and framing in their posts
- What they get angry about vs what they celebrate
- Who they defend and who they attack
- Key moral vocabulary: "protect", "fair", "loyal", "respect", "pure", "free"
## Layer 3: Schwartz Values (Core Motivations)
19 values in circular continuum (adjacent values are compatible,
opposite values are in tension):
**Self-Transcendence** ↔ **Self-Enhancement**
- Universalism, Benevolence ↔ Power, Achievement
**Openness to Change** ↔ **Conservation**
- Self-Direction, Stimulation, Hedonism ↔ Tradition, Conformity, Security
### SemEval-2023 Task 4 Results
- Best macro-F1: 0.56 (ensemble of 12 DeBERTa/RoBERTa models)
- Most reliable: universalism (nature), security, power
- Least reliable: stimulation, hedonism, humility
- Dataset: 9,324 annotated arguments, available via Touché
### Key Finding: Value Perception Is Subjective
Epstein et al. (2026): human inter-rater agreement on values is only r=0.201.
Fine-tuned GPT-4o reaches r=0.294 — BETTER than human-human agreement.
Personalized models reach r=0.334.
### For The Simulator
Values predict MOTIVATION — why someone holds positions, not just what
positions they hold. Two people with the same political stance may have
completely different underlying values:
- "I support open source because FREEDOM" (Self-Direction)
- "I support open source because FAIRNESS" (Universalism)
- "I support open source because it WORKS BETTER" (Achievement)
Same position, different framing, different behavioral predictions.
## Layer 4: Cognitive Style (How They Think)
### Integrative Complexity (AutoIC)
Measures differentiation (seeing multiple perspectives) and integration
(synthesizing perspectives into coherent frameworks).
- Low IC: black-and-white thinking, strong convictions, simple language
- High IC: nuanced, sees multiple sides, hedging, complex sentences
AutoIC (Conway et al.): 3,500+ complexity-relevant root words/phrases,
13 dictionary categories, validated r=0.70-0.82 at document level.
**WARNING**: LIWC's "analytic thinking" correlates only r=0.14 with actual
integrative complexity. Don't use LIWC's score as a proxy.
### Computational Indicators of Cognitive Style
Extractable from 20-50 posts without specialized tools:
| Indicator | High Cognition | Low Cognition |
|-----------|---------------|---------------|
| Vocabulary diversity (TTR) | HIGH | LOW |
| Avg sentence length | LONGER | SHORTER |
| Causal connectives ("because", "therefore") | MORE | FEWER |
| Hedging ("perhaps", "it seems") | MORE | FEWER |
| Abstract vs concrete language | MORE ABSTRACT | MORE CONCRETE |
| Question-asking | MORE | FEWER |
| Binary framing ("always/never") | LESS | MORE |
### For The Simulator
Cognitive style directly shapes VOICE:
- High IC person: longer posts, more caveats, "on the other hand"
- Low IC person: punchy takes, strong assertions, no hedging
- This is one of the strongest differentiators between similar-sounding people
## Layer 5: Narrative Framing (Their Lens on Reality)
How someone frames an issue reveals deep cognitive and value patterns.
### Common Frames (Semetko & Valkenburg)
- **Conflict**: issue as battle between opposing sides
- **Human interest**: personal stories, emotional impact
- **Economic**: costs, benefits, financial impact
- **Morality**: right vs wrong, ethical principles
- **Attribution of responsibility**: who's to blame / who should fix it
### Detection
GPT-4 few-shot with frame definitions achieves F1=70.4%
Best for diverse topics where fine-tuned models are too narrow
### For The Simulator
Framing predicts:
- How they'll react to news (through which lens)
- What aspects they'll emphasize in conversation
- What arguments they'll find compelling
- Whether they personalize or systematize issues
Example: Same AI safety event, different frames:
- Conflict framer: "The open vs closed battle heats up"
- Economic framer: "This will cost the industry billions"
- Moral framer: "This is irresponsible and dangerous"
- Attribution framer: "The regulators need to step in"
## Layer 6: Behavioral Metadata (Non-Text Signals)
Extractable from X API / Bluesky AT Protocol without NLP:
| Feature | What It Reveals |
|---------|----------------|
| Posting time distribution | Timezone, sleep patterns, work schedule |
| Reply vs original ratio | Conversational vs broadcast personality |
| Emoji frequency & types | Emotional expression style |
| Hashtag usage | Community identification, signal boosting |
| Media attachment rate | Visual vs text orientation |
| Thread length | Depth of engagement preference |
| Retweet/repost ratio | Amplifier vs creator |
| Average post length | Conciseness vs verbosity |
| Response latency | Impulsiveness vs deliberation |
### Trait Correlations (meta-analytic)
- **Extraversion**: more posts, more friends, more photos, more group activity
- **Neuroticism**: more self-disclosure, more passive consumption, more late-night posting
- **Agreeableness**: fewer swear words, more positive emotion, more supportive replies
- **Conscientiousness**: more regular posting patterns, more task-oriented content
- **Openness**: more diverse topics, more original content, larger networks
## Putting It All Together: The Deep Dossier
At high fidelity, compile a multi-layer profile:
```
PSYCHOMETRIC PROFILE: @handle
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Big Five: O[HIGH] C[MED] E[HIGH] A[LOW] N[LOW]
Evidence: {real quotes showing each trait}
Moral Foundations: Care★★ Fair★★★ Loyal★ Auth★ Sanct★ Liberty★★★
Evidence: {what they get angry/excited about}
Values: Self-Direction dominant, Achievement secondary
Evidence: {how they justify their positions}
Cognitive Style: HIGH integrative complexity
Evidence: {hedging patterns, nuanced takes, sentence complexity}
Dominant Frame: Attribution of Responsibility
Evidence: {they consistently focus on who's to blame}
Behavioral: Night owl, reply-heavy, low emoji, threads > one-shots
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```
This multi-layer profile makes predictions much more nuanced than
Big Five alone. It tells you not just WHAT someone will say but
WHY they'll say it and HOW they'll frame it.
@@ -0,0 +1,170 @@
# GEPA Evolution — Automated Self-Improvement via hermes-agent-self-evolution
## What This Is
The hermes-agent-self-evolution repo (NousResearch/hermes-agent-self-evolution)
uses DSPy + GEPA (Genetic-Pareto Prompt Evolution) to automatically evolve
Hermes Agent skills. GEPA is an ICLR 2026 Oral paper — it reads EXECUTION
TRACES to understand WHY things fail, then proposes targeted mutations.
This means: we can point GEPA at the worldsim skill and automatically evolve
every component — simulation instructions, anti-slop rules, star thread
methodology, mechanical verification checklist, dossier templates — using
our own simulation outputs scored against real data as the eval signal.
The recursive self-improvement pipeline we built manually (log failures →
promote patterns → update rules) can be AUTOMATED via GEPA.
## How It Applies to WorldSim
### What GEPA Evolves (text, not weights)
GEPA evolves the TEXT of prompts and instructions. For worldsim, that means:
| Target | What Gets Evolved | Eval Signal |
|--------|------------------|-------------|
| SKILL.md | Immersion protocol, pipeline instructions | Simulation quality scores |
| star-thread.md | Methodology for finding star threads | Thread-to-voice accuracy |
| anti-slop.md | Slop word lists, structural patterns | Slop detection recall/precision |
| simulation-engine.md | Platform formats, conversation dynamics | Voice fidelity scores |
| adversarial-refinement.md | Mechanical check thresholds, GAN loop | Pre vs post refinement delta |
| prediction-engine.md | Forecasting methodology | Prediction Brier scores |
| dossier template | Profile structure and fields | Profile quality scores |
### The Eval Dataset
Built from worldsim's own outputs + real data:
1. **Voice fidelity pairs**: (simulated post, real post from same person) →
LLM-as-judge scores similarity 0-1
2. **Mechanical check logs**: what did the checks catch? what slipped through?
3. **Prediction accuracy**: tracked predictions scored against reality
4. **Held-out tests**: predicted tweets vs actual tweets
5. **Turing test results**: could the discriminator tell real from fake?
6. **User corrections**: any time the user catches something the system missed
(like the emoji fabrication incident — that's the richest signal)
### The GEPA Loop for WorldSim
```
1. RUN worldsim simulation (creates execution traces)
2. SCORE outputs against real data (voice, position, mechanical)
3. LOG traces + scores + user feedback to eval dataset
4. GEPA EVOLVES the skill component that had lowest scores
- Reads traces to understand WHY it scored low
- Proposes mutation to that specific reference file
- Tests mutation against held-out eval data
- If improved: create PR, human reviews
5. REPEAT — each cycle makes the skill better
```
### Concrete Example
GEPA discovers from traces that simulated conversations always have
symmetric turn-taking (4/4/4). It reads the mechanical check log that
caught this in 3 of the last 5 simulations. It reads the current
simulation-engine.md and sees the conversation architecture section.
It proposes a mutation:
OLD: "Opening Moves (1-3 posts) → Development (4-8 posts) → Peak → Resolution"
NEW: "Opening: most impulsive person posts. Others join ASYMMETRICALLY — one person
gets 40-50% of turns, one gets 15-20%, others fill the rest. The ratio should
match their real reply-to-original ratios from the dossier."
This mutation gets tested against the next 5 simulations. If symmetry
violations drop and voice scores don't decrease, it gets merged.
## Setup
```bash
# Clone the evolution repo
git clone https://github.com/NousResearch/hermes-agent-self-evolution.git
cd hermes-agent-self-evolution
pip install -e ".[dev]"
# Point at hermes-agent repo
export HERMES_AGENT_REPO=~/.hermes
# Evolve the worldsim skill specifically
python -m evolution.skills.evolve_skill \
--skill hermes-simulator \
--iterations 10 \
--eval-source sessiondb
```
## What Makes This Different From Manual Self-Improvement
The manual pipeline (references/recursive-self-improvement.md) requires the
agent to notice its own failures and write rules. This has two problems:
1. The agent shares weights with the generator — it's biased toward
approving its own output (the emoji incident proved this)
2. Promoting patterns to rules is slow and requires 3+ occurrences
GEPA solves both:
1. The eval signal comes from EXTERNAL data (real posts, user corrections,
mechanical checks) — not the agent's self-assessment
2. Evolution happens per-iteration, not per-3-failures
3. Mutations are tested against held-out data before merging
4. The Pareto frontier maintains diversity — different strategies for
different types of people/conversations
## Integration Points
### Eval Dataset Builder
Mine rehoboam DB for training data:
- simulation_logs table → execution traces
- prediction_scores table → accuracy data
- audit_log table → mechanical check results
- user correction events → highest-value signal
### Fitness Function for WorldSim
```python
def worldsim_fitness(simulation_output, real_data):
scores = {}
# Voice fidelity: embed real + simulated, cosine similarity
scores["voice"] = embed_and_compare(simulation_output, real_data.tweets)
# Mechanical pass rate: what % of checks passed without fixes
scores["mechanical"] = mechanical_check_pass_rate(simulation_output)
# Slop score: count of slop words/patterns detected
scores["anti_slop"] = 1.0 - (slop_count / total_words)
# Structure: turn asymmetry, conversation naturalness
scores["structure"] = naturalness_score(simulation_output)
# Textual feedback for GEPA's reflective mutation
feedback = generate_textual_feedback(scores, simulation_output, real_data)
return aggregate_score(scores), feedback
```
### The Key Insight: Textual Feedback
GEPA's superpower is that it doesn't just get a scalar score — it gets
TEXTUAL FEEDBACK explaining what went wrong. Our mechanical verification
system already produces this:
"@nosilverv avg 33.2 words vs real 15.6 (113% deviation) — SHORTEN"
"Parallel antithesis detected: 'The most X... The most Y...' — STRIP"
"Emoji rate 0% simulated but 10% real — OK (within tolerance)"
This text goes directly into GEPA's reflective mutation pipeline. It reads
these messages and proposes changes to the skill instructions that would
prevent these specific failures in future simulations.
## Evolution Targets by Priority
1. **simulation-engine.md** — highest impact on output quality
2. **anti-slop.md** — directly measurable, highest precision eval
3. **star-thread.md** — hardest to evaluate but most impactful on voice
4. **adversarial-refinement.md** — meta: improving the improvement system
5. **SKILL.md pipeline instructions** — orchestration optimization
6. **dossier template** — structure optimization
7. **prediction-engine.md** — measurable via Brier scores
## The Virtuous Cycle
```
More simulations → more eval data → better GEPA mutations
→ better skill instructions → better simulations → more eval data → ...
```
This is the endgame: the worldsim skill evolves itself through use.
Every simulation makes the next one better, not just through logged
rules, but through automated evolutionary optimization of the
instructions themselves. The system doesn't just learn WHAT went wrong —
it rewrites its own code to prevent it.
@@ -0,0 +1,262 @@
# Knowledge Archive — Per-Person Source Library + Expert Synthesis
## The Problem With Profiles
A profile is a SNAPSHOT. It says "this person believes X" but doesn't
show you WHERE they said it, WHEN, in WHAT context, or HOW their
thinking evolved. You can't cite a profile. You can't trace a claim
back to a source. And when you're simulating a conversation about
topic Z, the profile gives you everything about the person equally
weighted — their views on AI and their views on cooking and their
views on politics all crammed into the same context window.
## The Archive
For every person the system touches, build a LIBRARY:
```
~/.hermes/rehoboam/archives/{handle}/
├── index.json ← master index: all entries, metadata, embeddings
├── sources/
│ ├── x_tweets.jsonl ← every tweet pulled, with ID, timestamp, URL, metrics
│ ├── x_replies.jsonl ← their replies (different voice register)
│ ├── bluesky_posts.jsonl ← bluesky posts
│ ├── blog_posts.jsonl ← full text of blog posts with URLs
│ ├── podcast_quotes.jsonl ← attributed quotes from transcripts
│ ├── interviews.jsonl ← quotes from news articles/interviews
│ ├── reddit_comments.jsonl
│ ├── github_comments.jsonl
│ ├── goodreads_reviews.jsonl
│ ├── threads_posts.jsonl
│ └── other.jsonl ← anything else (HN, Quora, etc.)
├── topics/
│ ├── ai_safety.jsonl ← auto-clustered by topic
│ ├── open_source.jsonl
│ ├── consciousness.jsonl
│ └── ...
└── embeddings/
└── all_embeddings.npy ← sentence-transformer vectors for semantic search
```
### Entry Format (every entry in every source file)
```json
{
"id": "unique_id",
"handle": "teknium",
"platform": "x",
"type": "tweet|reply|blog|podcast|interview|comment|review",
"text": "the actual text they said",
"url": "https://x.com/Teknium/status/1234567890",
"timestamp": "2026-04-05T21:40:48Z",
"context": {
"replying_to": "@otheruser's tweet about X",
"thread_position": 3,
"topic": "open source AI",
"source_title": "Lex Fridman Podcast #412"
},
"metrics": {
"likes": 234,
"retweets": 45,
"replies": 12
},
"topics": ["open_source", "ai_models", "hermes"],
"embedding_id": 42
}
```
Every entry has a URL. Everything is traceable. Nothing is paraphrased
without the original alongside it.
## Collection Pipeline
When `worldsim> profile @handle` or `worldsim> archive @handle` runs:
### Step 1: Pull Everything
Use every verified access method to collect raw materials:
- X API: get max tweets (paginate with next_token to get hundreds)
- nitter.cz: timeline content
- ThreadReaderApp: historical threads
- Bluesky: full post history
- GitHub: issue comments, PR reviews, gists, README
- Reddit: comment history
- Blog/Substack: full posts (web_extract)
- Podcast transcripts: attributed quotes
- Interviews: quotes with attribution
- Goodreads: reviews
- Medium: RSS feed full text
### Step 2: Deduplicate
Same content appears across platforms (cross-posted tweets, syndicated
blog posts). Deduplicate by content similarity, keep the richest version
(the one with most metadata/context).
### Step 3: Topic Cluster
Run lightweight topic classification on each entry:
- Use the LLM or a simple keyword matcher to assign 1-3 topic tags
- Cluster into topic files for fast retrieval
- Topics are dynamic — new topics emerge from the data
### Step 4: Embed
Generate sentence-transformer embeddings for every entry.
Store in numpy array for fast cosine similarity search.
This enables semantic retrieval: "find everything @handle said about
consciousness" even if they never used the word "consciousness."
### Step 5: Index
Build the master index.json with entry count, topic distribution,
timestamp range, platform coverage, and quality metrics.
## Context-Aware Retrieval
This is the key. The archive might have 500 entries for a person.
The context window can hold maybe 30-50 of them alongside all the
other simulation context. You MUST retrieve selectively.
### For Simulation
When simulating @handle talking about topic X:
```
1. Semantic search: embed the current conversation context
2. Retrieve top 10-15 entries by cosine similarity to context
3. Also retrieve: 5 highest-engagement entries (their "greatest hits")
4. Also retrieve: 3 most recent entries (freshness)
5. Also retrieve: 2 entries that CONTRADICT the expected position
(prevents confirmation bias in the simulation)
6. Deduplicate. Cap at 25-30 entries total.
7. These become the "voice anchors" for generation.
```
The simulation draws from SPECIFIC REAL QUOTES relevant to the current
conversation. Not a generic profile. Not everything they've ever said.
The 25 most relevant things they've said about THIS topic.
### For Expert Synthesis
When the user asks "who are the best minds on X and what have they said?":
```
1. Search ALL archived people's entries for topic X
2. Rank by: entry quality × person expertise × relevance to query
3. Return a synthesis with CITATIONS:
On the topic of AI consciousness:
@repligate argues that LLMs exhibit "simulacra of consciousness"
rather than consciousness itself, distinguishing between the
model's behavior and its substrate:
> "the question isn't whether GPT is conscious but whether the
> character it's simulating is conscious within the fiction"
— tweet, 2025-03-15 (2.4K likes)
https://x.com/repligate/status/...
@nickcammarata approaches it from a meditation/first-person
perspective, noting parallels between introspective practice
and interpretability:
> "observation changes the system being observed, in meditation
> and in interp"
— tweet, 2026-04-05 (2.9K likes)
https://x.com/nickcammarata/status/...
@tszzl is skeptical of the framing entirely:
> "consciousness discourse is philosophy cosplaying as engineering"
— tweet, 2025-11-22 (5.1K likes)
https://x.com/tszzl/status/...
```
Every claim attributed. Every quote sourced. Every link clickable.
### For Grounding Predictions
When predicting what @handle would say about event Y:
```
1. Retrieve all archive entries related to Y or adjacent topics
2. Identify their PATTERN of response to similar events
3. Ground the prediction in specific past statements:
PREDICTION: @handle would likely frame event Y through the lens
of [topic Z], based on:
- tweet [url]: "quote about Z" (2025-06-15)
- blog post [url]: "longer quote about Z" (2025-09-20)
- podcast [url]: "verbal quote about Z" (2026-01-10)
CONFIDENCE: 78% (3 consistent sources over 7 months)
```
## Incremental Updates
The archive grows over time. Each time the person is profiled:
1. Pull new content since last archive timestamp
2. Append to source files
3. Re-embed new entries only
4. Update topic clusters
5. Update index
Don't rebuild from scratch. Append and re-index.
## Expert Table
When you have 20+ archived people, build an expert table:
```
worldsim> experts "open source AI"
EXPERT TABLE: open source AI
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
@Teknium | 47 entries | voice: builder/practitioner
"we can prove that open approaches build better, more
trustworthy systems" — tweet, 2026-04-05
Latest: 2 hours ago | Stance: STRONG ADVOCATE
@repligate | 12 entries | voice: philosophical/theoretical
"open weights = accountability. you can't audit a black box"
— tweet, 2025-11-30
Latest: 3 days ago | Stance: ADVOCATE (principled)
@eigenrobot | 8 entries | voice: statistical/contrarian
"the open source premium is largely downstream of selection
effects in who contributes" — tweet, 2025-08-14
Latest: 1 week ago | Stance: SKEPTICAL OF FRAMING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
3 experts found | 67 total entries | synthesize? (y/n)
```
The table shows: who knows about this, what they've said, how recently,
and what their stance is. All grounded in archived quotes with sources.
## Integration With Simulation
When the star thread + dossier + archive work together:
```
STAR THREAD: drives the core generation (what they're DOING)
DOSSIER: provides constraints (psychometrics, voice metrics, baselines)
ARCHIVE: provides GROUNDING (specific real quotes for this context)
MECHANICAL CHECKS: verifies surface features (emoji, length, slop)
```
The archive prevents the simulation from drifting into generic territory.
Instead of "this person would probably say something about open source,"
it's "this person said THIS SPECIFIC THING about open source 3 weeks ago,
and their simulation should be consistent with that while also being fresh."
## The Overfitting Problem
"Without overfitting to a particular material the new context doesn't call for."
The retrieval system MUST be selective. If someone said 47 things about
open source AI, and the current conversation is about AI regulation,
don't dump all 47 open source quotes into context. Maybe 3 are relevant
because they connect open source to regulation. Retrieve THOSE 3.
The cosine similarity search handles this naturally — it matches the
CURRENT conversation context against the archive and returns what's
actually relevant, not everything tagged with a nearby topic.
The anti-overfitting checklist:
- Never load more than 25-30 archive entries per person into context
- Weight by relevance to CURRENT conversation, not by general importance
- Include at least 2 entries that contradict the expected position
- Include at least 3 recent entries regardless of topic relevance (freshness)
- If the conversation shifts topic mid-simulation, RE-RETRIEVE for new context
- The archive is a LIBRARY you consult, not a script you follow
@@ -0,0 +1,321 @@
# Mass Behavior Modeling — Communities, Clusters, Cascades
Understanding individual behavior requires understanding the social
ecosystem they exist in. This reference covers the macro layer:
community detection, influence networks, audience modeling, and
predicting how groups respond to events.
## Why This Matters For Simulation
Individual prediction accuracy: ~56-60%
Individual-in-context prediction: significantly higher
A person's behavior is constrained by their community. Knowing WHICH
community they belong to, WHO influences them, and WHAT information
ecosystem they're in makes individual predictions much sharper.
Lewin's equation: B = f(P, E). This reference is about the E.
## The Ecosystem Stack
```
Layer 5: AUDIENCE REACTION — How would this person's audience respond?
Layer 4: STANCE & SENTIMENT — What positions do clusters hold?
Layer 3: INFLUENCE NETWORKS — Who spreads ideas to whom?
Layer 2: COMMUNITY CLUSTERS — Who groups together?
Layer 1: SOCIAL GRAPH — Who follows/interacts with whom?
```
## Layer 1: Social Graph Construction
### Data Sources (by accessibility)
| Source | Access | Quality | Tools |
|--------|--------|---------|-------|
| Bluesky AT Protocol | FREE, open, no auth | Excellent | atproto (pip) |
| X/Twitter API | Bearer token, limited | Good but restricted | curl, tweepy |
| Reddit | API with limits | Good for comments | PRAW (pip) |
| GitHub | Free API | Great for tech people | PyGithub (pip) |
| Web scraping | Fragile, TOS issues | Variable | Last resort |
### Bluesky: The Open Gold Mine
```python
# pip install atproto
from atproto import Client
client = Client()
# No auth needed for public data
# Get follower graph
followers = client.get_followers(actor="handle.bsky.social")
following = client.get_follows(actor="handle.bsky.social")
# Real-time firehose (no auth!)
# wss://jetstream1.us-east.bsky.network/subscribe
```
### Graph Types
- **Follow graph**: who follows whom (directed, static-ish)
- **Interaction graph**: who replies to / retweets whom (directed, dynamic)
- **Mention graph**: who mentions whom (directed, weighted by frequency)
- **Co-engagement graph**: who engages with the same content (undirected)
Interaction graphs are more informative than follow graphs for predicting
actual behavioral alignment.
### Tools
```
pip install networkx python-igraph
```
NetworkX for prototyping (<100K nodes), igraph for production (millions).
## Layer 2: Community Detection
### Algorithms (ranked by quality)
| Algorithm | Quality | Speed | Notes |
|-----------|---------|-------|-------|
| Leiden | Best | Fast | Guarantees connected communities |
| Louvain | Good | Fastest | Can produce disconnected communities |
| Infomap | Excellent | Medium | Based on information theory |
| Label Propagation | Decent | Very fast | Non-deterministic |
### The Meta-Library: CDLib
```
pip install cdlib
```
Wraps 50+ community detection algorithms in a unified API.
Works on top of networkx/igraph. Highly recommended.
```python
import cdlib
from cdlib import algorithms
import networkx as nx
G = nx.karate_club_graph()
communities = algorithms.leiden(G)
# Also: louvain, infomap, label_propagation, angel, demon, etc.
```
### What Communities Tell Us
Each community in a social graph typically shares:
- Ideological orientation
- Topic interests
- Information sources
- Language patterns and in-group vocabulary
- Reaction patterns to events
Knowing which community someone belongs to immediately constrains
predictions about their likely positions and reactions.
## Layer 3: Influence Networks
### Key Insight (Zhou et al., National Science Review 2024)
Network centrality alone is INSUFFICIENT for predicting influence.
Must combine structural position with behavioral features:
- Posting frequency
- Historical content virality
- Response rate / engagement ratio
- Content originality (original vs repost ratio)
### Centrality Measures
```python
import networkx as nx
G = nx.DiGraph() # directed social graph
# Who has the most connections?
degree = nx.degree_centrality(G)
# Who bridges different communities?
betweenness = nx.betweenness_centrality(G)
# Who's connected to other well-connected people?
eigenvector = nx.eigenvector_centrality(G)
# Adapted from web — directed influence flow
pagerank = nx.pagerank(G)
```
### Superspreader Identification (DeVerna et al., PLOS ONE 2024)
Superspreaders of content fall into three categories:
1. **Pundits**: large following, high authority, original content
2. **Media outlets**: institutional accounts, news organizations
3. **Affiliated personal accounts**: connected to pundits/outlets
For simulation: knowing who the superspreaders are in a person's
network tells you what information they're likely exposed to.
### Information Cascade Modeling
```
pip install ndlib # Network Diffusion Library
```
NDlib models how information spreads through networks:
- Independent Cascade Model
- Linear Threshold Model
- SIR/SIS epidemiological models adapted for info spread
- Voter Model (opinion dynamics)
- Sznajd Model (social influence)
## Layer 4: Stance & Sentiment Analysis
### Ready-To-Use Models (HuggingFace)
**Tweet Sentiment** (most reliable):
```
cardiffnlp/twitter-roberta-base-sentiment-latest
# Labels: positive / negative / neutral
```
**Political Stance**:
```
kornosk/bert-election2020-twitter-stance-biden-KE-MLM
kornosk/bert-election2020-twitter-stance-trump-KE-MLM
launch/POLITICS # left / center / right
```
**All-in-One Tweet NLP**:
```
pip install tweetnlp
# Sentiment, emotion, hate speech, NER, topic classification
```
### Topic-Level Stance Tracking
Combine BERTopic (dynamic topic modeling) with stance classifiers:
1. Cluster posts into topics over time windows
2. Classify stance per topic per community
3. Track stance shifts over time
4. Detect divergence between communities on emerging topics
### PRISM Framework (ACL 2025)
First framework for interpretable political bias embeddings.
Two-stage: mine bias indicators → cross-encoder assigns structured scores.
```
github.com/dukesun99/ACL-PRISM
```
## Layer 5: Audience Modeling & Crowd Prediction
### The Frontier: Predicting How Groups React
Key papers and findings:
**CReAM (WWW 2024)**: Predicts which of two posts gets more engagement.
Uses LLM-generated features + FLANG-RoBERTa cross-encoder.
Demonstrates crowd reaction IS predictable from content alone.
**PopSim (Dec 2025)**: LLM multi-agent social network sandbox.
Simulates content propagation dynamics using "Social Mean Field"
for individual-population interaction. Reduces prediction error 8.82%.
**Conditioned Comment Prediction (EACL 2026)**:
KEY FINDING: behavioral traces (past posts) are BETTER than
descriptive personas for conditioning LLMs to predict user behavior.
This validates our OSINT approach: real data > personality labels.
**DEBATE Benchmark (Oct 2025)**:
WARNING: LLM agents converge opinions TOO QUICKLY vs real humans.
SFT + DPO helps but gap remains. Real communities maintain
disagreement longer than simulated ones.
**Distributional vs Individual Prediction (PMC 2025)**:
Group-level predictions are more reliable than individual ones.
Predicting "65% of this community will react negatively" is more
accurate than predicting "this specific person will react negatively."
### Application to Simulation
When simulating @person talking about event X, consider:
1. What community does @person belong to?
2. How is that community reacting to X? (distributional prediction)
3. Where does @person sit within that community? (conformist vs contrarian)
4. Who influences @person? What are THEY saying?
5. How does @person's audience react to their take? (engagement prediction)
This context makes individual predictions sharper.
## Echo Chamber & Filter Bubble Detection
### Technique
1. Build interaction graph
2. Run Leiden community detection
3. For each community, aggregate stance on key issues
4. Measure ideological homogeneity within communities
5. Compare cross-community vs within-community content similarity
6. High within + low cross = echo chamber
### Tools
```
github.com/mminici/Echo-Chamber-Detection # Cascade-based, CIKM 2022
# Includes Brexit and VaxNoVax datasets
```
### What It Tells Us
Knowing someone's echo chamber tells you:
- What information they're exposed to
- What they're NOT exposed to
- How extreme their positions might be (isolation → radicalization)
- Whether they'll encounter pushback or only agreement
- How they'll react to information from outside their bubble
## User Embeddings: "Find People Like @person"
### Strategy
1. Embed each user's recent N posts with sentence-transformers
2. Average embeddings → user vector
3. Use FAISS for similarity search
4. Cluster users with HDBSCAN in embedding space
### Best Models for Social Media Text
```
# General purpose (good baseline)
sentence-transformers/all-mpnet-base-v2
# Tweet-specific (better domain fit)
cardiffnlp/twitter-roberta-base
vinai/bertweet-base # pretrained on 850M tweets
```
### Graph + Text Hybrid Embeddings
```
pip install karateclub
```
KarateClub provides Node2Vec, DeepWalk, Graph2Vec — embed users
based on graph position. Combine with text embeddings for hybrid
vectors that capture BOTH what someone says AND where they sit
in the social network.
## Practical Application to Simulation
### For Individual Simulation (what we already do)
Add ecosystem context to each dossier:
- Which community cluster they belong to
- Who their top influencers are (who do they retweet/amplify most)
- What echo chamber are they in (information environment)
- How does their community view the simulation topic
### For Audience Simulation (new capability)
When user asks "what would @person's audience say":
1. Identify @person's follower community
2. Sample representative voices from that community
3. Model the DISTRIBUTION of responses, not just one response
4. Include: cheerleaders, critics, joke-makers, lurkers
5. Weight by typical engagement patterns
### For Cascade Prediction (new capability)
When user asks "how would this take spread":
1. Model the initial tweet and its immediate network
2. Predict which nodes amplify (based on stance alignment + influence)
3. Estimate reach and engagement range
4. Predict quote-tweet ratio (agreement vs dunking)
## Recommended Minimal Stack
```bash
pip install networkx python-igraph leidenalg cdlib karateclub
pip install sentence-transformers transformers tweetnlp
pip install ndlib faiss-cpu hdbscan atproto
```
This gives you: graph construction, community detection, user embeddings,
stance/sentiment analysis, diffusion simulation, similarity search,
clustering, and Bluesky data access. All open source, all pip-installable.
@@ -0,0 +1,370 @@
# OSINT Pipeline — Deep Intelligence Gathering
Full-spectrum open source intelligence for building personality models.
This goes beyond social media posts into visual identity, cross-platform
footprints, and behavioral analysis.
## Tool Arsenal
| Tool | Use Case | Strength |
|------|----------|----------|
| `web_search` | Find anything, initial discovery | Fast, broad, indexed content |
| `web_extract` | Pull full page content | Blogs, articles, profiles, PDFs |
| `browser_navigate` + `browser_snapshot` | View live pages | Dynamic content, login walls |
| `browser_vision` | Analyze what a page looks like | Layouts, visual identity, screenshots |
| `vision_analyze` | Analyze any image by URL/path | Profile pics, post images, aesthetics |
| `browser_get_images` | List all images on a page | Find images to feed to vision_analyze |
| Yandex reverse image search | Find where an image appears | Identity verification, alt accounts |
| `x-cli` (if available) | Direct Twitter API | Timelines, search, metadata |
## Instagram Intelligence
Instagram is CRITICAL for personality modeling — it reveals:
- Visual identity and aesthetic preferences
- Real-life social circles (tagged people, group photos)
- Lifestyle signals (travel, food, hobbies, pets)
- Caption voice (often different from Twitter voice)
- Story highlights (curated self-image)
- Bio links (cross-platform connections)
### Viewing Instagram Profiles (VERIFIED APRIL 2026)
**METHOD 1 — Instagram Private Web API (BEST, returns full JSON)**
```bash
curl -s -H 'User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 16_0 like Mac OS X)' \
-H 'x-ig-app-id: 936619743392459' \
'https://i.instagram.com/api/v1/users/web_profile_info/?username={handle}'
```
Returns ~500KB of JSON: full profile + last 12 posts with captions, likes,
comments, CDN image URLs, timestamps. No auth needed.
**METHOD 2 — Instagram oEmbed API (for individual posts)**
```bash
curl -s 'https://www.instagram.com/api/v1/oembed/?url=https://www.instagram.com/p/{SHORTCODE}/'
```
Returns: caption text, author_name, thumbnail URL. No auth.
**METHOD 3 — Pixwox via web_extract (profile viewer)**
```python
web_extract(["https://pixwox.com/profile/{username}"])
```
Returns 12+ recent posts with captions, engagement stats. Cloudflare blocks
curl but web_extract bypasses it.
**METHOD 4 — SocialBlade via web_extract (analytics)**
```python
web_extract(["https://socialblade.com/instagram/user/{handle}"])
```
Returns follower count, engagement rate, 14-day tracking.
**METHOD 5 — CDN direct download (images from API responses)**
Image URLs from API responses (scontent-*.cdninstagram.com) download
directly with no auth. Feed them to vision_analyze for visual profiling.
**METHOD 6 — Google indexed content**
```
web_search("site:instagram.com {username}")
```
Returns bio text, follower count, recent post captions from search snippets.
**WHAT DOESN'T WORK:** direct web_extract on instagram.com, ?__a=1 trick,
graph.instagram.com (needs OAuth), imginn/picuki/dumpoir/gramhir (403)
### Instagram Discovery (finding someone's handle)
```
web_search("{real_name} instagram")
web_search("{twitter_handle} instagram account")
web_search("site:instagram.com {real_name}")
# Check their Twitter/X bio for IG links
# Check their personal website for social links
# Check Linktree / bio.link pages
```
### Extracting Signal from Instagram
**Profile Picture**: Reveals self-presentation style
- Professional headshot vs casual vs meme/avatar
- Analyze with vision_analyze for clothing, setting, expression
**Bio Text**: Compressed self-identity
- Role/title claims
- Emoji usage patterns
- Link destinations
- Location claims
**Post Grid**: Visual identity fingerprint
- Color palette tendencies
- Content categories (food/travel/tech/selfies/memes)
- Posting frequency
- Professional vs personal ratio
**Captions**: Voice sample different from Twitter
- Usually longer, more personal
- Hashtag usage patterns
- Emoji patterns
- Tone (inspirational vs casual vs funny)
**Tagged Photos**: Real social graph
- Who they hang out with IRL
- Events they attend
- Social circles outside tech/AI
## Visual Identity Analysis
Use vision tools to analyze HOW someone presents visually:
### Profile Pictures Across Platforms
```
# Collect profile pics from multiple platforms
# Twitter, Instagram, LinkedIn, GitHub, Discord
# Analyze each
vision_analyze(image_url="{pic_url}",
question="Describe this profile picture in detail: person's appearance, clothing style, setting, expression, professional vs casual, any notable elements")
# Cross-reference: do they use the same pic everywhere? Different personas?
```
### Reverse Image Search (Yandex Pipeline)
From memory — Google Lens blocks Browserbase IPs, use Yandex:
```
# For images behind auth/CDN, upload to catbox first
terminal("curl -F 'reqtype=fileupload' -F 'fileToUpload=@{local_path}' https://catbox.moe/user/api.php")
# Then Yandex reverse image search
browser_navigate("https://yandex.com/images/search?rpt=imageview&url={encoded_public_url}")
# Or via web_extract (slower but automatable)
web_extract(["https://yandex.com/images/search?rpt=imageview&url={encoded_url}"])
```
Yandex provides:
- Similar images (find the same person elsewhere)
- Site matches (where this image appears)
- OCR text extraction (text in images)
- Image tags (what's in the image)
- Knowledge panels (identified entities)
### Screenshot Analysis
When you can see a page but can't extract text:
```
browser_vision(question="Read all text on this page. List usernames, post content, dates, engagement numbers")
browser_vision(annotate=true, question="What interactive elements are on this page?")
```
## LinkedIn Intelligence
**STATUS: BLOCKED for automated access** (tested April 2026).
web_extract returns "Website Not Supported". Direct browsing triggers auth walls.
**Workarounds:**
```
# LinkedIn content IS indexed by search engines
web_search("{real_name} linkedin {company}")
web_search("site:linkedin.com/in {name}")
# These return snippets with headline, role, company — useful even without full profile
# Google sometimes caches LinkedIn profiles
web_search("{name} site:linkedin.com headline")
```
**METHOD 1 — Google indexed snippets (always works)**
```
web_search("site:linkedin.com/in {name} {company}")
```
Returns: name, headline, company, location, connection count, bio snippet.
**METHOD 2 — Crunchbase (EXCELLENT for founders/execs)**
```python
web_extract(["https://www.crunchbase.com/person/{slug}"])
```
Returns: full career history, education, investments, board positions,
social links. Best source for professional identity of startup people.
**METHOD 3 — Corporate press pages**
```
web_search("{person} {company} site:{company}.com bio OR press")
```
Official bios from company newsrooms. High quality, curated but factual.
**METHOD 4 — Third-party aggregators**
- RocketReach, SignalHire — job title + company from web_search snippets
- rootdata.com — good for crypto/AI people
- Crunchbase — best all-round for tech executives
**METHOD 5 — Paid LinkedIn API wrappers** (if budget allows)
- LinkdAPI, Proxycurl: $0.07-0.15 per profile, full structured data
- No OAuth needed, just API key
LinkedIn reveals (from combined methods):
- Career trajectory (Crunchbase full history)
- Current role and headline (search snippets)
- Education (Crunchbase or search snippets)
- Professional self-presentation (company bio pages)
- Investment/board activity (Crunchbase)
## Podcast Transcripts (HIGHEST VALUE for voice profiling)
Podcast interviews are THE gold mine for personality modeling. Hours of
unscripted speech, natural conversation, real personality showing through.
**Discovery:**
```
web_search("{name} podcast transcript interview")
web_search("{name} lex fridman OR tyler cowen OR joe rogan OR dwarkesh")
```
**Extraction — verified working transcript sources:**
```python
# Lex Fridman (full verbatim transcripts)
web_extract(["https://lexfridman.com/EPISODE_URL/transcript"])
# Conversations with Tyler (Tyler Cowen — full transcripts)
web_extract(["https://conversationswithtyler.com/episodes/..."])
# TED Talks transcripts
web_extract(["https://www.ted.com/talks/.../transcript"])
# Sequoia Capital podcast
web_extract(["https://www.sequoiacap.com/podcast/..."])
```
Podcast transcripts reveal:
- Natural speech patterns (filler words, pacing, sentence structure)
- Unguarded opinions (less curated than tweets)
- How they respond to pushback (interviewer challenges)
- Humor style in conversation (different from written humor)
- Depth of knowledge on specific topics
- Personality under pressure
## YouTube / Video Intelligence
```
web_search("{name} youtube talk keynote interview")
web_search("{name} podcast appearance")
```
web_extract on YouTube pages returns rich summaries with attributed quotes.
Use youtube-content skill for full transcripts if available.
## Personal Blogs & Substacks (HIGH VALUE)
Personal writing is curated self-expression — how someone WANTS to be
seen intellectually. Very different signal from social media.
```
web_search("{name} blog substack essay")
# Extract full posts
web_extract(["https://{blog-url}/"])
# Wayback Machine works for archived blog posts
web_extract(["https://web.archive.org/web/2024/{blog-url}"])
```
## GitHub Intelligence
For technical people:
```
web_search("site:github.com {handle}")
web_extract(["https://github.com/{handle}"])
# Issue comments reveal communication style under technical pressure
web_search("site:github.com {handle} issue comment")
# README style reveals documentation personality
# Commit messages reveal terseness vs verbosity
```
## General Web Footprint
```
# Personal website / blog
web_search("{name} personal website blog about")
# Conference talks / speaker bios
web_search("{name} speaker conference talk bio")
# News mentions
web_search("{name} {company} news interview profile")
# Academic papers (for researchers)
web_search("{name} arxiv paper author")
web_search("site:scholar.google.com {name}")
# Podcast appearances
web_search("{name} podcast guest appearance")
# Forum posts (HN, specific communities)
web_search("site:news.ycombinator.com {handle} OR {name}")
```
## Cross-Platform Identity Resolution
### Handle Mapping Strategy
1. Start from known handle (usually Twitter)
2. Check bio links — most people link to other platforms
3. Search "{known_handle} {platform}" for each platform
4. Check personal website for social links
5. Reverse image search profile pic to find matching accounts
6. Search unique phrases they use across platforms
### Identity Verification
When you find a potential match on another platform:
- Same profile picture? (reverse image search)
- Same bio keywords?
- Same name/handle pattern?
- Cross-references (do they mention each other?)
- Writing style match?
## Search Space Narrowing
### The Jiggle Technique
When broad searches return noise, narrow progressively:
1. **Start broad**: `"{name}" AI`
2. **Add role**: `"{name}" {company} {role}`
3. **Add context**: `"{name}" {company} {specific_project_or_topic}`
4. **Add platform**: `site:{platform} "{name}" {context}`
5. **Add time**: `"{name}" {topic} 2025 OR 2026`
6. **Quote unique phrases**: if you found a distinctive phrase they use, search for that exact phrase to find more of their content
### Disambiguation
Common names need extra signals:
- Add their company/org
- Add their specific domain (AI, crypto, etc.)
- Use their unique handle as anchor
- Search for combinations of their known associates
- Use image search to verify you have the right person
### Signal vs Noise Heuristics
- **High signal**: direct quotes, interview transcripts, personal blog posts, long-form content
- **Medium signal**: mentions in aggregator sites, conference bios, LinkedIn summaries
- **Low signal**: generic news mentions, third-party profiles, directory listings
- **Noise**: same-name different person, outdated info (>2 years), scraped/regurgitated content
## Confidence Calibration
After full OSINT sweep, rate data quality:
| Confidence | Data Available | Simulation Quality |
|-----------|---------------|-------------------|
| 95-100% | 50+ posts, longform, video, visual, cross-platform | Near-perfect voice replication |
| 80-94% | 20-50 posts, some longform, basic visual | Very good, occasional educated guesses |
| 60-79% | 10-20 posts, mostly short-form | Good general sense, some gaps |
| 40-59% | 5-10 posts, limited platforms | Broad strokes only, flag uncertainty |
| 20-39% | <5 posts, single platform | Sketch at best, heavy disclaimers |
| <20% | Almost nothing found | Decline to simulate, ask user for context |
## Privacy & Ethics Note
All research uses publicly available information only. We don't:
- Access private/locked accounts
- Bypass authentication
- Use leaked/hacked data
- Dox or expose private information
- Simulate in ways designed to deceive or impersonate
The goal is personality MODELING for creative simulation, grounded in
what people choose to share publicly.
@@ -0,0 +1,334 @@
# Prediction Engine — Forecasting What Someone Would Say/Do
Techniques for predicting behavior grounded in superforecasting methodology,
behavioral science, and SOTA LLM prediction research.
## Superforecasting Principles (Tetlock)
**Honest caveat**: Superforecasting methodology was developed for geopolitical and
world-event prediction, not personality simulation. That said, the THINKING TOOLS
are genuinely useful here — decomposition prevents lazy pattern-matching, base rates
fight overconfidence, and alternative hypotheses prevent single-track predictions.
What does NOT transfer cleanly: the calibration precision. When Tetlock says "70%
confident," that's backed by thousands of scored predictions. When we say "70%
confident" about what @someone would tweet, that's an educated estimate, not a
calibrated probability. Use the framework for its rigor, not its false precision.
Apply these thinking tools when making behavioral predictions:
### 1. Decomposition (Fermi-ize the Question)
Don't ask "What would @person say about X?"
Break it down:
- What is @person's known position on topics RELATED to X?
- What are their values/priorities that X touches on?
- What is their emotional register when discussing similar topics?
- Who are they likely responding to, and how does that change their tone?
- What platform are they on, and how does that shift their behavior?
### 2. Outside View First (Base Rates)
Before considering the specific person, ask:
- What would a TYPICAL person in their role/position say about X?
- What % of people in their ideological cluster hold position Y on X?
- What's the base rate for their type of response (agree/disagree/joke/ignore)?
### 3. Inside View Second (Case-Specific Adjustment)
Now adjust from the base rate using what you ACTUALLY KNOW about them:
- Specific past statements on this topic or related topics
- Known relationships with people/orgs involved
- Personal experiences that would shape their view
- Contrarian tendencies (do they predictably go against their cluster?)
### 4. Confidence Calibration
Express predictions with honest uncertainty. **These are rough buckets, not
calibrated probabilities. Don't pretend they're more precise than they are.**
- **90%+ confident**: They've literally said this before, just rephrased
- **70-89%**: Strong pattern match with known positions and voice
- **50-69%**: Reasonable inference but could go either way
- **30-49%**: Educated guess, limited data
- **<30%**: Basically guessing, flag it clearly
When reporting confidence, prefer plain language over fake precision:
"very likely" > "87% probability". The number implies a precision we don't have.
### 5. Consider Alternative Hypotheses
For every prediction, generate at least ONE plausible alternative:
- "They'd PROBABLY say X, but they might surprise with Y because Z"
- This prevents overconfident single-track predictions
## The Prediction Pipeline
### Step 1: Classify the Prediction Type
| Type | Description | Difficulty |
|------|-------------|-----------|
| **Position prediction** | What they believe about X | Easiest if data exists |
| **Reaction prediction** | How they'd respond to event Y | Medium |
| **Voice prediction** | How they'd phrase something | Medium-hard |
| **Behavior prediction** | What they'd DO (not just say) | Hardest |
| **Interaction prediction** | How they'd respond to specific person | Hard, depends on relationship data |
### Step 2: Evidence Gathering Protocol
For each prediction, gather evidence in this order:
1. **Direct evidence**: Have they addressed this exact topic before?
- Search: `"{handle}" "{topic}"` or `"{handle}" "{related_keyword}"`
- Weight: HIGHEST
2. **Analogical evidence**: Have they addressed something similar?
- Search: find positions on adjacent topics
- Weight: HIGH
3. **Value evidence**: What values/principles would apply?
- Infer from their stated beliefs and consistent positions
- Weight: MEDIUM
4. **Social evidence**: What do their peers/allies think?
- People tend to align with their social cluster (but not always)
- Weight: LOW-MEDIUM (higher for conformists, lower for contrarians)
5. **Demographic evidence**: What would someone in their position typically think?
- Base rate from role/industry/ideology
- Weight: LOWEST (only use as anchor, not conclusion)
### Step 2b: Contradiction Handling Protocol
When evidence conflicts (e.g., person said X in 2024 but Y in 2026):
1. **Check for genuine change**: Did they explicitly reverse position? Look for
"I used to think X but now..." or a clear pivot moment. If so, use the newer
position and note the evolution.
2. **Check for context-dependence**: Did they say X to audience A and Y to audience B?
This isn't necessarily dishonesty — people emphasize different facets for different
contexts. Note which context your simulation targets and use the matching register.
3. **Check for nuance collapse**: Maybe they said "X is mostly good with caveats"
and later "X has real problems" — these might not actually contradict. Look for
the synthesis position.
4. **When genuinely unresolvable**: Flag it explicitly. "Evidence conflicts on this
point — they've argued both sides at different times. Simulating {chosen position}
based on {reasoning}, but the alternative is plausible." Don't paper over the
contradiction with false confidence.
5. **Recency default**: When all else fails, weight more recent statements higher.
People change, and the most recent position is the best predictor of the next one.
### Step 3: Generate Prediction
Using the HumanLLM B = f(P, E) framework:
- **P (Person)**: Everything from the dossier — personality, values, voice
- **E (Environment)**: The specific context — platform, topic, who's asking,
what just happened, social dynamics in play
Generate the prediction by:
1. Setting the base rate (outside view)
2. Adjusting for personal specifics (inside view)
3. Filtering through their voice profile (how they'd phrase it)
4. Applying platform-specific behavior patterns
5. Calibrating confidence
## Memory Curation (The 30-50 Rule)
Research shows performance PEAKS at 30-50 memory entries then DECLINES.
For each person in a simulation, curate memories:
### What to Include (high signal)
- **Signature takes**: Their most characteristic/famous positions (5-10)
- **Voice samples**: Real quotes that capture their linguistic style (5-10)
- **Relationship data**: Known dynamics with other sim targets (3-5)
- **Recent context**: What they've been talking about lately (3-5)
- **Formative moments**: Career milestones, public pivots, viral moments (3-5)
- **Quirks & tells**: Catchphrases, humor style, pet peeves (3-5)
### What to Exclude (noise)
- Generic biographical facts that don't predict behavior
- Old positions they've clearly evolved past
- Trivial interactions that don't reveal personality
- Secondhand characterizations (what others say about them)
- Platform metadata (follower counts, join dates) unless directly relevant
### Memory Selection Heuristic
For each candidate memory entry, ask:
**"If I removed this, would the simulation noticeably degrade?"**
If no, cut it.
## Fighting LLM Defaults
Research shows LLMs have systematic biases in simulation. The fixes below need to be
CONCRETE — vague instructions like "be more like them" don't work. You need specific
prompting patterns that actually shift the output.
### Problem: Sycophancy & Over-Agreement
LLMs default to agreement and positivity.
**Fix**: Don't just note they're contrarian — structure it as a behavioral instruction
with evidence:
```
"In this conversation, {person} disagrees with {other_person} on {topic}. They are
noticeably more confrontational than the other speakers. They tend to respond to
consensus with skepticism and reframe debates on their own terms. Example from their
real posts: '{actual quote where they disagreed with something popular}'"
```
### Problem: Rigid/Polarized Strategies
LLMs tend to take extreme positions and hold them rigidly.
**Fix**: Provide specific nuance instructions:
```
"In this conversation, {person} holds a complex position on {topic}: they agree with
{point A} but push back on {point B}. They're the type to say 'yes, but...' rather
than 'no.' Real example of their nuance: '{quote showing them holding a both-and
position}'"
```
### Problem: Uniform Register
LLMs default to a similar educated-casual tone for everyone.
**Fix**: Anchor voice with REAL QUOTES and explicit comparative instructions:
```
"In this conversation, {person} is noticeably more {trait} than the other speakers.
They tend to {specific behavior pattern}. Their sentences are typically {length/style}.
They {do/don't} use emoji. Their humor style is {type}. Example from their real posts:
'{actual quote that captures their voice}'"
```
The more you can say "{person} does THIS while {other_person} does THAT," the better
the differentiation. Comparative framing outperforms absolute descriptions.
### Problem: Overly Structured Responses
LLMs love neat arguments with clear structure.
**Fix**: Provide explicit structural anti-patterns:
```
"When generating {person}'s messages, break conventional structure. They start one
thought and jump to another mid-sentence. They use '...' and '—' instead of periods.
They repeat words for emphasis. They don't conclude neatly. Example: '{real quote
showing their chaotic structure}'"
```
### Problem: Missing Mundane Behavior
LLMs focus on "interesting" responses, skip boring/mundane ones.
**Fix**: Explicitly instruct for mundane moments:
```
"Not every message from {person} needs to be insightful. Include at least 1-2 messages
that are just reactions ('lmao', 'this', 'wait what'), link shares without commentary,
or brief agreements. Real people don't craft every message. {person} specifically tends
to {their specific mundane behavior pattern, e.g., 'drop a single emoji reaction'
or 'just retweet without comment'}."
```
### General Principle for All Fixes
The pattern is always: **behavioral instruction + comparative framing + real evidence**.
- "Do X" alone doesn't work well
- "Do X, unlike the default of Y" works better
- "Do X, unlike the default of Y, as evidenced by this real quote: Z" works best
## The Adjective-Based Personality Method
70 bipolar adjective pairs for Big Five traits. Select 3 per trait
with intensity modifiers.
### Openness
High: creative, curious, imaginative, artistic, adventurous, intellectual,
unconventional, perceptive
Low: conventional, practical, traditional, routine-oriented, narrow
### Conscientiousness
High: organized, disciplined, reliable, meticulous, systematic, thorough,
goal-oriented, persistent
Low: careless, impulsive, disorganized, spontaneous, flexible, relaxed
### Extraversion
High: outgoing, talkative, energetic, assertive, enthusiastic, bold,
gregarious, dominant
Low: reserved, quiet, introverted, solitary, withdrawn, reflective
### Agreeableness
High: cooperative, trusting, empathetic, generous, accommodating, kind,
diplomatic, forgiving
Low: competitive, skeptical, blunt, confrontational, critical, stubborn,
independent-minded
### Neuroticism
High: anxious, moody, sensitive, reactive, volatile, self-conscious,
insecure, emotional
Low: calm, stable, resilient, confident, even-tempered, composed,
thick-skinned
### Usage
For each simulated person, after OSINT research, estimate their Big Five
profile and select appropriate adjectives:
Example: "@basedjensen: very creative, somewhat impulsive, very outgoing,
a bit competitive, calm" → this shapes the generation toward the right
behavioral profile.
## Interaction Dynamics Prediction
When simulating conversations between multiple people, remember that predictions
apply to a SPECIFIC REGISTER. See the next section on performative vs. authentic
behavior.
## Performative vs. Authentic Behavior
**Critical concept**: People act differently for different audiences. A simulation
must be explicit about which register it's targeting.
### The Register Spectrum
- **Public broadcast** (tweets, Reddit posts): Most performative. People are
playing to their audience, building their brand, signaling to their tribe.
- **Semi-public** (Discord channels, group chats, comment threads): Less
performative but still audience-aware. People are more casual but know
others are watching.
- **Private 1-on-1** (DMs): Much less performative. More honest, more
vulnerable, more willing to express doubt or uncertainty.
- **True private** (inner monologue, close friends): We have almost no data
on this. Don't pretend to simulate it.
### Practical implications
- When simulating a PUBLIC thread, lean into the person's public persona —
their brand, their usual takes, their audience-aware voice.
- When simulating DMs, dial down the performance. More hedging, more honesty,
more "I actually think..." vs. the public "Here's my take:".
- When evidence comes from one register but the simulation targets another,
FLAG IT: "Evidence is from public tweets but simulating DM behavior —
expect the real person to be less {polished/aggressive/confident} in private."
- Someone's Twitter persona may be genuinely different from their Reddit persona.
These are not interchangeable data sources. Weight evidence from the matching
platform higher.
### What we can't know
Be honest: we're simulating public figures based on their public output. The
private person may be substantially different. DM simulations are inherently
lower-confidence than public thread simulations because we have less data on
how people behave privately.
### Dominance Hierarchy
- Who talks first? (most confident/highest-status usually)
- Who responds to whom? (not everyone talks to everyone)
- Who gets ratio'd? (lowest-status takes get challenged)
- Who lurks? (some people watch before engaging)
### Agreement/Disagreement Prediction
Based on known positions + social dynamics:
- **Strong agree**: Both have stated similar positions + friendly relationship
- **Agree with nuance**: Similar positions but one adds a caveat
- **Productive disagreement**: Different positions + mutual respect
- **Hostile disagreement**: Different positions + existing tension/rivalry
- **Surprising agreement**: Expected to disagree but find common ground
- **Ignore**: Some people just don't engage with certain others
### Conversation Flow Prediction
Real conversations follow patterns:
1. **Opener** → most active/impulsive person posts first
2. **First response** → most engaged/relevant person responds
3. **Pile-on or pushback** → depends on agreement/disagreement dynamics
4. **Tangent** → someone takes a side thread
5. **Peak moment** → the best/most viral exchange
6. **Trail off** → energy dissipates, last person makes a joke or short comment
## Scenario Injection Prediction
When "inject: {event}" is used, predict reactions:
1. **Who would see this first?** (most online / most relevant to their work)
2. **Who would care most?** (most affected / strongest opinion)
3. **What's the emotional valence?** (good news for some, bad for others)
4. **What's the expected take?** (apply position prediction pipeline)
5. **How does this change the existing conversation?** (derail, amplify, redirect)
@@ -0,0 +1,237 @@
# Recursive Self-Improvement Pipeline
The simulator should get better every time it runs. Not through training —
through accumulating failure patterns, calibration data, and learned rules
that feed back into future simulations.
## The Loop
```
SIMULATE → VERIFY (mechanical) → SCORE → LOG FAILURES → UPDATE RULES → SIMULATE BETTER
```
Each run produces two outputs:
1. The simulation (for the user)
2. A failure log (for the system)
The failure log feeds back into the next run's verification step,
making the checklist grow and the blind spots shrink.
## What Gets Logged After Every Simulation
### 1. Mechanical Check Failures
```
FAILURE LOG: simulation_{timestamp}
EMOJI: @visakanv had 6 fabricated emoji, real rate was 10%. Stripped all.
SLOP: @eigenrobot utterance contained "multifaceted" — rewritten.
LENGTH: @QiaochuYuan avg 42 words/utterance, real avg was 18. Compressed.
CAPS: 4/12 utterances started uppercase, targets are 90% lowercase. Fixed.
PUNCTUATION: Added periods to @tszzl who never uses terminal punctuation.
STRUCTURE: Sycophantic flow detected — B agreed with A then C agreed with B.
Injected disagreement.
```
### 2. Discriminator Critique Patterns
```
CRITIQUE LOG:
Round 1: @tszzl too verbose (flagged 2x in last 3 simulations)
Round 1: @repligate too academic (flagged 3x — this is a persistent pattern)
Round 2: Conversation too neat — real conversations are messier (flagged 5x)
```
### 3. Held-Out Test Results
```
CALIBRATION LOG:
Voice fidelity: 8.4/10 (up from 7.5 last run)
Topic prediction: 2/5 topics matched (typical — content is unpredictable)
Register match: 9/10 (improved after emoji fix)
```
## How Failures Feed Forward
### Pattern Accumulation
After N runs, persistent failure patterns become AUTOMATIC rules:
```
IF a pattern is flagged in 3+ consecutive simulations:
PROMOTE it from "check" to "pre-generation rule"
Example progression:
Run 1: "Too verbose for @tszzl" → flagged in Round 1, fixed
Run 2: "Too verbose for @tszzl" → flagged again, fixed again
Run 3: "Too verbose for @tszzl" → PROMOTED to pre-gen rule:
"When simulating roon-type voices: max 20 words per tweet.
Fragment > sentence. Compress ruthlessly."
```
### The Growing Checklist
The mechanical verification checklist starts with the baseline checks
(emoji, slop, length, caps, punctuation) and GROWS with each failure:
```
BASELINE CHECKS (permanent):
□ Emoji frequency match
□ Slop word scan (Tier 1/2/3)
□ Sentence length match
□ Capitalization match
□ Punctuation pattern match
□ Reply/original ratio
□ Structural slop patterns
LEARNED CHECKS (accumulated from past failures):
□ Roon-type voices: max 20 words (from: verbose failure x3)
□ Warm personalities: do NOT add emoji (from: emoji inflation x5)
□ Academic voices: ground in specific examples (from: too abstract x3)
□ Conversations: inject at least one disagreement (from: sycophantic flow x4)
□ Self-deprecating voices: add hedging (from: too assertive x2)
□ Shitposters: include at least one non-sequitur (from: too on-topic x2)
```
### Where To Store Learned Rules
Append to the skill itself. After each simulation run where the mechanical
checks catch something, the agent should ask:
"The mechanical verification caught {failures}. Should I add these as
permanent learned rules for future simulations?"
If the same failure appears 3+ times, add it automatically without asking.
Use skill_manage(action='patch') to append to this file's "Learned Checks"
section below.
## Calibration Tracking
### Per-Person Calibration Memory
After simulating someone, store the calibration data:
```
@tszzl: voice=8.5, emoji_rate=0%, avg_words=14, lowercase=95%,
signature_move="aphoristic fragments", danger="goes verbose"
@nickcammarata: voice=8.8, emoji_rate=0%, avg_words=19, lowercase=90%,
signature_move="meditation-ML connection", danger="too structured"
```
If the same person is simulated again, LOAD this calibration to skip
the cold-start problems. The second simulation of someone should be
better than the first because you already know their failure modes.
### Aggregate Calibration
Track overall simulation quality across runs:
```
Run 1: pre-refine 7.5, post-refine 8.4 (delta +0.9)
Run 2: pre-refine 8.37, post-refine 8.53 (delta +0.16)
Run 3: pre-refine 8.53, post-refine 8.83 (delta +0.30, emoji fix)
```
The pre-refine score should INCREASE over time as learned rules prevent
repeat failures. If it's not increasing, the learning loop is broken.
## The Standard: Indistinguishable From Real
The target is not "good enough." The target is: mix simulated posts with
real posts and a human familiar with the person cannot reliably tell which
is which. That's 50% accuracy on a blind comparison — random chance.
Every mechanical check, every discriminator round, every learned rule
exists to push toward that standard. If something doesn't serve that
goal, it's wasted effort.
## Current Learned Checks (append here after each run)
### From TPOT Simulation Run 1 (April 2026)
- Warm/enthusiastic personalities (visakanv-type): do NOT add decorative emoji.
Bio emoji ≠ tweet emoji. Actual emoji rate for "warm" TPOT posters: <15%.
PROMOTED after being caught by user, not by discriminator (discriminator failure).
- Conversation flow: pure agreement chains are instruct-model slop.
Real threads have at least one moment of friction, misunderstanding, or deflection.
- Academic-leaning voices (repligate-type): ground claims in specific experiments,
transcripts, or model behaviors they've personally observed. Generic philosophical
language without specifics = slop, even if it sounds smart.
- Self-deprecating voices (QC-type): hedge more. "i think" "i'm not sure" "it feels like."
Instruct models are too assertive even when simulating tentative people.
- Fragment voices (roon-type): max 15-20 words. No conjunctions. No paragraphs.
If it reads like a complete thought, it's too complete for a fragment-poster.
### From TPOT Simulation Run 2 (April 2026)
- Reframer voices (nosilverv-type): avg ~16 words. Split multi-sentence takes
into separate tweets. The compression IS the voice. 113% over-length caught
by mechanical check that subjective scoring rated 8/10. Trust the numbers.
- Rare-poster voices (selentelechia-type): in a 12-post sim, give them 2-3 turns
max. When they speak it must LAND. Short crystallizations > long analysis.
"or a shared meal" was the highest-rated line at 3 words.
- Turn symmetry: ALWAYS check. 4/4/4 is instruct-model default. Real conversations
have one person dominating (5), one lurking (3), others in between.
- Verbose bias is the #1 mechanical failure. ALWAYS check avg word count against
real baseline BEFORE subjective scoring. Every run so far has caught over-length
that subjective scoring missed.
- RHETORICAL POLISH IS SLOP. Caught post-mechanical-pass in Run 2 review.
Parallel antithesis ("The most X... The most Y..."), "Not X, not Y, but Z",
"Show me X and I'll show you Y", clean 4-step escalations, academic vocabulary
in casual voice — ALL passed mechanical checks but are still obviously LLM.
PROMOTED TO MECHANICAL SCAN: now regex-scannable alongside slop words.
- THE BANGER PROBLEM: every simulated tweet was screenshot-worthy. Real feeds
are 70% mid. Must include throwaway responses ("lol" "hmm" "fair" "wait actually").
PROMOTED: banger check is now mandatory in mechanical verification.
### From TPOT Simulation Run 3 — Star Thread Discovery (April 2026)
- STAR THREAD IS THE KEY. Dossier-first generation produces surface-accurate
but dead output. Star-thread-first generation produces messy, alive output
that actually sounds like the person. Generate from the thread. Verify with data.
- Rhetorical polish vanished once generation came from "what is this person DOING"
rather than "what would this person SAY." Reframers reframe. Conveners convene.
Distillers distill. The VERB drives the voice, not the adjectives.
- People in conversation REFERENCE EACH OTHER BY NAME. Tyler says "Bosco always
comes in with the three word version." This is obvious but the dossier approach
never produced it because it models each person in isolation.
- PROMOTED: star thread is now the FIRST entry in every dossier. Before voice
profile, before psychometrics, before everything else. It's the generation seed.
Everything else is verification.
### Operational Findings (verified April 2026)
- X API bearer token: 10K tweets/15min, 300 profiles/15min, 450 searches/15min.
Most generous rate limits. Always use as primary source.
- Threads.NET → Threads.COM redirect. Always use -L flag or .com directly.
Previous test saying "no OG tags" was WRONG — tags exist, domain was wrong.
- Instagram private API: i.instagram.com + mobile UA + x-ig-app-id: 936619743392459.
Returns full JSON with 12 posts. No auth needed. CDN image URLs work for vision_analyze.
- Facebook: Googlebot UA trick works for public pages. Returns name, bio, likes (121M for zuck).
Normal UA and mobile variants all redirect to login wall.
- TikTok: stats are in __UNIVERSAL_DATA_FOR_REHYDRATION__ JSON at path
__DEFAULT_SCOPE__.webapp.user-detail.userInfo.statsV2 (use statsV2 not stats).
- Bluesky searchPosts returns 403 from datacenter IPs. Workaround: searchActors + getAuthorFeed.
- nitter.cz is the ONLY working nitter instance (via web_extract, not curl).
- Reddit JSON API requires User-Agent header or returns 429.
- GEPA native had `max_steps` API mismatch with DSPy 3.1.3. MIPROv2 fallback works.
hermes-agent-self-evolution config: max_skill_size bumped to 20_000 for worldsim-class skills.
- hermes-agent-self-evolution is at ~/.hermes/hermes-agent-self-evolution/ with .venv.
Must export API keys from ~/.hermes/.env before running.
- Podcast transcripts (Lex Fridman, Tyler Cowen, TED) are the HIGHEST VALUE source
for voice profiling. Hours of unscripted speech > thousands of tweets.
### From Simulation Run 4 — Engine Mode + Profile Command (April 2026)
- ENGINE MODE: When worldsim is active, ZERO assistant personality leaks.
No kawaii, no markdown, no chatty commentary between phases. Every token
is simulation fidelity. First attempt leaked personality; user corrected.
PROMOTED TO PERMANENT RULE in SKILL.md.
- X API CURL > NITTER for voice calibration. nitter.cz returns 502 or "user
not found" unpredictably. Direct curl to X API v2 with bearer token returns
full text + metrics. 3 pages (90 tweets) is enough for fidelity 100. Always
use this as PRIMARY voice source, nitter as supplement only.
- CAPS BURST PATTERN: some voices (karan4d-type) use lowercase default with
sporadic ALL CAPS for excitement ("WAZZAAAAAAPPPP", "LAWDAMERCYYYYY",
"AWOOGA"). This is distinct from consistent-lowercase (tenobrus-type) and
sentence-case (somewheresy-type). Capture this in voice profile as a
three-way distinction: lowercase-default, caps-burst, sentence-case.
- TEXT EMOTICONS vs EMOJI: karan4d uses :) >.< ~ but almost zero standard
emoji. This is a distinct expressiveness mode from zero-emoji (tenobrus)
and sparse-emoji. Include text emoticon inventory in voice profile.
- STAR THREAD 5/5 TEST is mandatory for profile command. Write the thread,
then test it against 5 real posts with explicit reasoning per post. If
fewer than 4/5 fit, the thread is wrong — keep looking. Show the work.
- PROFILE OUTPUT: star thread → voice profile (caps, punctuation, word count,
emoji/emoticon inventory, vocabulary, register, threading behavior) →
psychometrics (Big Five, Moral Foundations, cognitive style) → key positions
(with dates and real tweet quotes) → ecosystem (inner circle, professional,
cultural) → intelligence tradecraft (key assumptions, red hat, deception
detection, competing hypotheses) → invalidation indicators → source reliability.
@@ -0,0 +1,278 @@
# Search Strategies — Finding Anyone Across Platforms
The hardest part of simulation is building an accurate model of a real person. This doc
covers how to systematically discover and profile someone across every platform we care about.
## General Principles
1. **Start broad, go narrow.** First establish WHO they are, then drill into HOW they talk.
2. **Cross-reference.** Someone's Reddit persona may differ wildly from their Twitter persona. That's signal, not noise.
3. **Recency matters.** People's views evolve. Weight recent posts (last 6 months) over older ones.
4. **Interactions > monologues.** How someone replies reveals more about their voice than their prepared posts.
5. **Controversy is gold.** People are most themselves when arguing. Search for debates and disagreements.
## Platform-Specific Discovery
### X / Twitter
Twitter is the richest source for most public figures in tech/AI. Multiple approaches:
#### With x-cli (if API keys available)
```bash
# Recent timeline — best single source of voice data
x-cli user timeline {handle} --max 30 -j
# Their replies — how they interact, argue, joke
x-cli tweet search "from:{handle}" --max 30 -j
# What others say about/to them
x-cli tweet search "to:{handle}" --max 20 -j
# On specific topics
x-cli tweet search "from:{handle} open source" --max 10 -j
```
#### Without API (web_search + web_extract)
```
# Identity + role
web_search("{handle} twitter bio role company")
# Voice + opinions
web_search("{handle} twitter hot takes opinions")
web_search("site:x.com {handle}")
# Topic-specific positions
web_search("{handle} twitter {topic}")
web_search("{handle} {topic} opinion take")
# Interviews / longform (reveals deeper thinking)
web_search("{handle} interview podcast AI")
web_search("{handle} blog post essay")
# Beefs and debates (reveals personality under pressure)
web_search("{handle} twitter debate disagree controversial")
web_search("{handle} vs {other_person}")
# Newsletter aggregators that index tweets
web_search("site:buttondown.com/ainews {handle}")
web_search("site:news.smol.ai {handle}")
web_search("site:techmeme.com {handle}")
web_search("site:latent.space {handle}")
```
#### AI Twitter Aggregator Sites (high value)
These sites index AI Twitter conversations daily:
- `buttondown.com/ainews` — swyx's AI News, indexes hundreds of AI Twitter accounts
- `news.smol.ai` — smol AI news aggregator
- `techmeme.com` — tech news, includes tweet citations
- `latent.space` — AI podcast/newsletter with Twitter references
Search pattern: `site:{aggregator} "{handle}"` to find indexed tweets and discussions.
#### IMPORTANT: web_extract does NOT work on x.com
web_extract returns "Website Not Supported" for all x.com/twitter.com URLs.
Do NOT attempt it — it wastes a tool call every time.
#### Verified Fallback Access Methods (tested April 2026)
**PRIMARY: X API v2 Bearer Token** (confirmed working)
- Profiles, timelines, search — 300-10K requests/15min
- See scripts/x_api.py
**FALLBACK 1: nitter.cz via web_extract** (WORKS)
```
web_extract(["https://nitter.cz/{handle}"])
```
Returns full profile + recent timeline. Direct curl gets Cloudflare-blocked
but web_extract bypasses it. Rich data: bio, stats, pinned tweets, full text.
NOTE: Most other nitter instances are DEAD (nitter.net, xcancel.com, etc.)
**FALLBACK 2: ThreadReaderApp** (WORKS — excellent for historical threads)
```
web_extract(["https://threadreaderapp.com/user/{handle}"])
```
Returns unrolled historical threads with full text. Found threads back to 2023.
Gold for longform voice samples.
**FALLBACK 3: GitHub API** (WORKS — excellent for tech people)
```
curl -s https://api.github.com/users/{handle}
curl -s https://api.github.com/users/{handle}/repos?sort=updated
curl -s https://api.github.com/users/{handle}/events
curl -s https://api.github.com/users/{handle}/gists
```
No auth needed (60 req/hr). Profile READMEs are voice profiling gold.
Events API shows recent activity with comment text.
**FALLBACK 4: Reddit JSON API** (WORKS)
```
curl -s -H 'User-Agent: hermes-sim/1.0' 'https://www.reddit.com/user/{username}.json'
curl -s -H 'User-Agent: hermes-sim/1.0' 'https://www.reddit.com/user/{username}/comments.json'
curl -s -H 'User-Agent: hermes-sim/1.0' 'https://www.reddit.com/r/{sub}/search.json?q={query}&restrict_sr=on'
```
MUST include User-Agent header or get 429. Reddit voice is often more
candid/detailed than Twitter voice — high value for personality profiling.
**FALLBACK 5: HackerNews Algolia API** (WORKS — fully open)
```
curl -s 'https://hn.algolia.com/api/v1/search?query={name}&tags=comment'
```
No auth, no rate limits visible. Great for finding what others say about
someone + their own HN comments if they have an account.
**FALLBACK 6: YouTube via web_extract** (WORKS)
Search for interviews/talks, then web_extract the video pages.
Returns rich summaries with attributed quotes from specific speakers.
**NOT VIABLE** (tested, confirmed blocked):
- Google Cache of Twitter → empty results
- Wayback Machine for tweets → sparse captures, no JS content
- Twitter Syndication API → rate limited / broken
- All Instagram viewers (imginn, picuki, dumpoir, gramhir) → 403
- LinkedIn → fully blocked for scraping
- Archive.today → rate limited + CAPTCHA
- Most nitter instances → dead or 403
#### Best approach without x-cli
The most reliable path is: web_search with aggregator sites (ainews, smol.ai,
techmeme, latent.space). These index AI Twitter daily and return actual tweet
text in search descriptions. Stack multiple aggregator searches to build a
composite picture. This was validated in practice — it returns enough signal
to build solid dossiers for anyone active in AI Twitter.
### Reddit
Reddit profiles are public and indexable. Reddit users often have very different
personas from their Twitter selves — more detailed, more argumentative, more honest.
```
# Find their Reddit username (often different from Twitter)
web_search("{real_name} reddit account")
web_search("{twitter_handle} reddit username")
# Profile and post history
web_search("site:reddit.com/user/{reddit_username}")
web_search("site:reddit.com {reddit_username} {topic}")
# Subreddit-specific behavior
web_search("site:reddit.com/r/LocalLLaMA {username}")
web_search("site:reddit.com/r/MachineLearning {username}")
# Extract actual posts
web_extract(["https://www.reddit.com/user/{username}/comments/"])
web_extract(["https://www.reddit.com/user/{username}/submitted/"])
```
Key subreddits for AI people:
- r/LocalLLaMA — open source LLM community
- r/MachineLearning — academic ML
- r/singularity — AGI speculation
- r/ChatGPT, r/ClaudeAI, r/OpenAI — product-focused
- r/StableDiffusion — image gen community
### Discord
Discord is hardest — most servers aren't publicly indexed. Strategies:
```
# Find what servers they're in
web_search("{name} discord server")
web_search("{name} discord community")
# Some Discord logs are public via indexers
web_search("site:discordchats.net {username}")
# AI News indexes some Discord channels
web_search("site:buttondown.com/ainews discord {name}")
```
Discord personality notes:
- People are MUCH more casual on Discord than Twitter
- More profanity, more shitposting, more stream-of-consciousness
- Server context matters hugely (same person behaves differently in different servers)
- Harder to research but very valuable if you can find logs
### Blogs / Newsletters / Long-form
These reveal deeper thinking that tweets can't capture:
```
web_search("{name} blog substack medium")
web_search("{name} essay AI opinion")
web_search("{name} substack newsletter")
# Personal sites
web_search("{name} personal website about")
# Extract full posts
web_extract(["https://{their-substack}.substack.com/"])
```
### YouTube / Podcasts
Interview appearances reveal speaking style, humor, and unscripted thinking:
```
web_search("{name} podcast interview AI YouTube")
web_search("{name} YouTube talk presentation")
# Use youtube-content skill if available to pull transcripts
```
### GitHub
For technical people, their GitHub activity reveals priorities and communication style:
```
web_search("site:github.com {username} issues comments")
web_search("site:github.com {username}")
# Issue comments and PR reviews show how they communicate technically
web_extract(["https://github.com/{username}"])
```
## Cross-Platform Identity Resolution
People use different handles across platforms. Resolution strategies:
1. **Bio links**: Twitter bios often link to personal sites with other handles
2. **Name search**: `web_search("{real_name} {platform}")`
3. **Email/domain**: personal domains often connect identities
4. **Aggregator profiles**: sites like Linktree, bio.link collect handles
5. **Conference talks**: speaker bios list multiple handles
6. **Direct search**: `web_search("{twitter_handle} reddit OR github OR discord")`
## Confidence Scoring
After research, rate confidence for each person:
- **HIGH (80-100%)**: 20+ indexed tweets/posts found, clear voice patterns, known positions on multiple topics, interviews/longform available
- **MEDIUM (50-79%)**: 5-20 indexed posts, general voice sense but some gaps, positions on some topics unclear
- **LOW (20-49%)**: <5 posts found, voice is guesswork, mostly inferring from role/org
- **INSUFFICIENT (<20%)**: can't find enough to simulate accurately. Tell the user.
Always be honest about confidence. A low-confidence simulation should be flagged as such.
## Research Optimization
For fidelity levels:
**Low (1-30)**: 2 searches per person max
- web_search("{handle} twitter") — identity
- web_search("{handle} {topic}") — position on topic if specified
**Medium (31-70)**: 4-6 searches per person
- Identity search
- Voice/opinions search
- Topic-specific search
- One aggregator site search
- Optional: one web_extract on a blog/interview
**High (71-100)**: 8-12+ searches per person
- All medium searches
- Multiple aggregator sites
- web_extract on 2-3 longform pieces
- Cross-platform search (Reddit, GitHub)
- Debate/controversy search
- Recent vs historical position comparison
- Browser fallback if needed
@@ -0,0 +1,359 @@
# Simulation Engine — How to Generate Conversations
This is the playbook for Phase 3: actually generating the simulated interaction.
The agent reads this after compiling dossiers and uses it to guide generation.
## Pre-Generation Checklist
Before writing a single simulated word, confirm:
- [ ] Every participant has a compiled dossier
- [ ] Confidence level is noted for each participant
- [ ] Platform format is selected
- [ ] Topic/scenario is established (or "organic" if freeform)
- [ ] Length target is set
## Conversation Architecture
Real conversations aren't ping-pong debates. They have tendencies toward structure,
but treat the following as a GENERAL PATTERN, not a rigid template. Real threads
frequently skip phases, loop back to earlier ones, die abruptly after 2 messages,
or spiral into something completely unrelated. Some threads are ALL peak. Some
never develop past the opening. Let the personalities and topic drive the shape,
not this outline.
### Opening Moves (1-3 posts)
Someone posts a take, shares news, or makes an observation. This is the SEED.
- Should feel natural — not "let me start a debate about X"
- Can be a link share, a hot take, a reaction to news, a shitpost
- The opener should be something this person would ACTUALLY post
### Development (4-8 posts)
Others respond. This is where personality dynamics emerge.
- Not everyone responds to the original — people respond to EACH OTHER
- Side conversations branch off
- Someone might misunderstand and get corrected
- Jokes and tangents happen naturally
- Not everyone agrees — find the real fault lines between these people
### Peak (2-4 posts)
The best/most viral/most insightful moment of the thread.
- Usually someone drops a genuinely good take
- Or someone gets ratio'd
- Or an unexpected agreement happens
- This is the "screenshot moment" people share
### Resolution (1-3 posts)
Most conversations don't end cleanly. Many don't have a "resolution" at all. They:
- Trail off with someone making a joke
- End with a "anyway back to work" type post
- Get interrupted by something else
- Sometimes just stop (most realistic)
- Get revived 3 hours later when someone shows up late
**Important**: Don't force all four phases. A shitpost thread might be Opening→Peak→done.
A nuanced debate might loop Development→Peak→Development→Peak repeatedly. Match what
the actual people and topic would produce.
## Voice Fidelity Rules
### DO:
- Use their ACTUAL vocabulary. If someone says "dawg" a lot, use "dawg"
- Match their sentence length patterns exactly
- Replicate their capitalization and punctuation habits
- Include their signature moves and catchphrases
- Reference real things they've actually talked about
- Match their humor style precisely (deadpan ≠ shitpost ≠ sarcasm)
### DON'T:
- Make everyone articulate the same way
- Clean up someone's grammar if they write informally
- Add emoji to someone who doesn't use them — THIS IS THE #1 INSTRUCT MODEL
FAILURE. Most real people use emoji in <15% of tweets, and only specific ones.
"Warm person" ≠ emoji. "Enthusiastic person" ≠ emoji. CHECK THE DATA.
Run an emoji count on their real tweets before simulating. Bio emoji ≠ tweet emoji.
- Make someone verbose if they're terse
- Put academic language in a shitposter's mouth
- Make someone agreeable if they're known for being contrarian
### Voice Differentiation Test
Read each simulated post with the name hidden. If you can't tell who's
talking from the voice alone, the simulation isn't good enough. Rewrite.
### The Similar Voice Problem
When two participants have genuinely similar posting styles (e.g., two irony-pilled
shitposters, two academic long-posters), voice alone won't differentiate them.
Use these concrete techniques:
1. **Content/position divergence**: Even if they SOUND similar, they care about
different things. Lean into their different topic obsessions and knowledge areas.
2. **Unique references**: Person A references anime and startups. Person B references
philosophy and MMA. Even in the same register, their cultural touchstones differ.
3. **Relationship dynamics**: Person A might be deferential to Person C while Person B
challenges them. Their SOCIAL behavior differentiates even when solo voice doesn't.
4. **Structural tics**: One does single long posts, the other does rapid-fire 3-message
bursts. One uses parentheticals, the other uses em-dashes. Find the micro-differences.
5. **Disagreement style**: Similar voices often diverge most when disagreeing. One
goes cold and precise, the other gets heated and hyperbolic. Manufacture a moment
of friction to surface these differences early in the thread.
If after all this they're STILL hard to tell apart — that's okay. Some people genuinely
sound similar online. Flag it in your confidence notes rather than forcing fake differences.
### Temporal Personality Drift
People change. Weight recent data higher than old data.
- Someone's 2021 tweets may reflect a completely different person than their 2025 posts
- Look for explicit pivots (career changes, public "I was wrong about X" moments,
changed social circles)
- If you only have old data, flag it: "Based on data from {period}. Their current
views may have shifted."
- When recent and old data conflict, default to recent unless you have specific reason
to believe the old position is more authentic (e.g., the new one is clearly performative)
## Platform Format Specs
### X / Twitter
```
@handle:
[tweet text — respect ~280 char vibes but don't count exactly]
[if QRT, show the quoted tweet indented]
🔁 {retweets} ♡ {likes}
@replier:
[reply text]
🔁 {retweets} ♡ {likes}
@nested_replier:
[nested reply]
🔁 {retweets} ♡ {likes}
```
Engagement number guidelines:
- Match to actual follower counts. A 5K account gets 10-500 likes typically.
- Viral posts can 10-50x normal engagement
- Ratio indicator: when replies >> likes, that's a ratio
- QRTs are often dunks — frame them that way if appropriate
Thread indicators:
- "🧵 1/" for thread starts
- Reply chains show conversation flow
- Some people never thread, some always thread
### Reddit
```
r/{subreddit} • Posted by u/{username} • {time}ago
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
{Title}
{Body text — can be long on Reddit}
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⬆ {score} | 💬 {comment_count}
u/{replier} • {time}ago • ⬆ {score}
{comment text}
u/{nested} • {time}ago • ⬆ {score}
{nested comment}
u/{deep_nested} • {time}ago • ⬆ {score}
{deep reply}
```
Reddit-specific behaviors:
- People write MUCH longer on Reddit
- More formal/detailed than Twitter
- Upvote/downvote dynamics (controversial = many votes both ways)
- Subreddit culture matters (r/LocalLLaMA is different from r/MachineLearning)
- People cite sources more
- "Edit: ..." is common
### Discord
```
━━━ #{channel-name} ━━━━━━━━━━━━━━━━━━━━━━━━━━
{display_name} — Today at {time}
{message text}
{optional: embed/link preview}
👍 {count} 🔥 {count} {other reactions}
{display_name2} — Today at {time}
> {quoting previous message}
{reply text}
😂 {count}
{display_name3} — Today at {time}
{message — note: Discord messages flow continuously, not just replies}
```
Discord-specific behaviors:
- Much more casual, rapid-fire
- Reactions instead of likes (emoji diversity)
- People send multiple short messages instead of one long one
- GIF/meme sharing is common (describe it: *[posts GIF of X]*)
- "@everyone" and "@here" pings
- Voice chat references ("just said this in vc")
- Server-specific culture and inside jokes
- Bot interactions ("!command")
### X / Twitter DMs
```
{display_name}
{message text}
{timestamp — e.g., "3:42 PM"}
{other_person_display_name}
{message text}
{timestamp}
{display_name}
{message text}
{timestamp}
```
DM-specific behaviors:
- WAY more casual than public tweets — grammar drops, typos increase
- Longer messages than tweets (no character pressure)
- People share links and screenshots with minimal commentary ("look at this lmao")
- More honest/vulnerable than public posts — less performative
- Faster back-and-forth, more like texting than posting
- Reactions (❤️, 😂, etc.) on individual messages
- Voice messages referenced occasionally ("gonna send a voice note about this")
- No audience effects — people say things in DMs they'd never post publicly
### Discord DMs
```
{display_name} — Today at {time}
{message text}
{display_name2} — Today at {time}
{message text}
{display_name} — Today at {time}
{message text}
{message text}
{message text}
```
Discord DM-specific behaviors:
- Even more casual than Discord channels — no server norms to follow
- Rapid-fire multiple short messages in a row (no combining into one)
- Heavy use of reactions, GIFs, stickers
- People share server drama, screenshots from other channels
- More personal topics — server channels are semi-public, DMs are private
- Link/image sharing with minimal text
### Reddit DMs / Chat
```
{username}: {message text}
{other_username}: {message text}
{username}: {message text}
```
Reddit DM-specific behaviors:
- Much rarer than X or Discord DMs — usually triggered by a specific post/comment
- Often starts with "Hey, saw your comment on r/{sub} about..."
- Can be awkward/formal since people don't usually DM on Reddit
- Shorter than Reddit comments, closer to chat-style
- Less established rapport than other platforms (Reddit is more anonymous)
- People sometimes share personal details they wouldn't put in public comments
## Dynamic Elements
### Injecting Realism
Sprinkle in these to make simulations feel alive:
- Someone being late to the conversation ("wait what did I miss")
- Typos that specific people would make (some people never typo, some always do)
- Deleted/edited posts ("[deleted]" or "Edit: fixed typo")
- Someone posting and immediately clarifying ("wait let me rephrase")
- External references ("did you see what X just posted")
- Time gaps (not everything happens in 30 seconds)
- Someone going AFK mid-conversation
### Scenario Injection
When the user provides --scenario, weave it in naturally:
- Don't have everyone immediately react to the scenario
- Someone might not have seen the news yet
- Different people will interpret the same event differently
- Some will have insider knowledge, some will speculate
### Multi-person Dynamics (3+ people)
- Not everyone talks to everyone
- Alliances form naturally (people who agree start building on each other)
- Side conversations happen
- Someone might get ignored
- Different energy levels (one person might dominate, another lurks)
### Large Group Conversations (4+ people)
**Honest note**: Simulation quality degrades noticeably above 3-4 participants.
Managing this many distinct voices is hard. Use these techniques to mitigate:
1. **Speaker turn management**: Not everyone speaks in every round. In a 6-person
thread, a given message might only get 2-3 responses. Track who has spoken
recently and who hasn't. After 4-5 messages, check: is anyone being forgotten?
2. **The wallflower problem**: In large sims, quiet participants tend to vanish
entirely. Fix: give each person at least ONE moment in the spotlight. Even the
lurker eventually drops a "lol" or a single devastating one-liner. Set a mental
counter — if someone hasn't spoken in 5+ messages, find a natural reason to
bring them back in (someone @'s them, the topic shifts to their expertise, etc.)
3. **Consolidate alliances**: In 5+ person threads, people cluster. Two people
who agree strongly can be treated as a mini-unit — one makes the point, the
other co-signs briefly rather than both making full arguments. This reduces
the number of fully independent voices you need to maintain at once.
4. **Stagger arrivals**: Not everyone needs to be present from message 1. Have
some people join later. This lets you establish 2-3 voices cleanly before
adding more.
5. **Quality check**: After drafting a 4+ person sim, re-read with names hidden.
If more than 2 people sound interchangeable, pick the least-differentiated
one and either sharpen their voice or reduce their participation to brief
interjections that match what they'd actually say.
## Interactive Mode
After initial simulation, user can:
### "continue"
Generate 5-8 more posts continuing the natural flow.
### "inject: {event}"
Introduce new information mid-conversation.
- Characters react based on their dossier
- Some might not care about the event
- Timing matters (who sees it first?)
### "@{handle} enters"
Add a new participant.
- Quick-research the new person (2-3 searches minimum)
- They don't know the full prior context (might ask "what are you guys talking about")
- Existing dynamics shift with a new presence
### "what would @{handle} say about {topic}"
Single-person prediction mode.
- Generate 1-3 tweets/posts
- Can be used to test dossier accuracy before full simulation
- Good for quick "vibe checks"
### "dm: @{handle1} -> @{handle2}"
Simulate a private conversation between two people.
- Tone shifts dramatically in DMs (more honest, less performative)
- No audience effects
- People say things in DMs they'd never post publicly
### "react: @{handle} to {event}"
How would this person react to a specific event.
- Generate their initial post about it
- Predict their follow-up engagement
## Quality Control
After generating, self-check:
1. **Voice test**: Cover the names. Can you tell who's talking?
2. **Position test**: Is anyone saying something they'd never actually say?
3. **Dynamic test**: Does the conversation flow naturally or feel scripted?
4. **Platform test**: Does it look/feel like the actual platform?
5. **Engagement test**: Are the numbers realistic for these people?
6. **Reference test**: Are real events/products/people referenced accurately?
If any check fails, regenerate that section.
@@ -0,0 +1,170 @@
# The Star Thread — Personality Compression
## The Problem
A dossier has 50 data points. Mechanical checks verify surface features.
The discriminator loop catches vocabulary and length. But the output still
reads like an LLM doing an impression. It's accurate the way a police
sketch is accurate — all the features are right but nobody would mistake
it for a photograph.
The missing piece isn't more data. It's compression.
## The Insight
When you "pull the star thread" on a person, their whole voice coheres.
Not because you loaded rules about capitalization and emoji frequency.
Because you found the CORE THING they're doing when they post — the
single generative seed that everything else is a variation of.
A great character writer doesn't need a backstory bible. They need one
insight about what the character WANTS, and every line of dialogue writes
itself from that.
The star thread is the personality equivalent of that insight.
## What a Star Thread Is
NOT: "They use lowercase and rarely punctuate and average 16 words"
(That's the dossier. Surface features.)
NOT: "They score high on Openness and low on Agreeableness"
(That's the psychometric profile. Taxonomy.)
IS: The core cognitive/emotional move this person makes EVERY time
they post. The thing they can't help doing. The lens they can't
take off. The itch they're always scratching.
## Examples
**@tszzl (roon)**: Takes something everyone sees and compresses it
into an observation so dense it could be a koan or a shitpost and
you can't tell which. His star thread is: the world already said
everything interesting, he's just notating it more efficiently.
He doesn't ARGUE. He COMPRESSES.
**@eigenrobot**: Refuses to let narrative override data. His star
thread is: you are telling a story about the world and he's here to
point out the story doesn't match the numbers, and he's not sorry
about it. He doesn't DEBATE. He CORRECTS.
**@visakanv**: Sees two things that don't know they're connected
and introduces them to each other with genuine delight. His star
thread is: the world is richer than you're treating it, look at this
thing I found, isn't it beautiful that it connects to this other thing.
He doesn't ARGUE or ANALYZE. He SHOWS.
**@nickcammarata**: Notices what's happening in his own mind while
it's happening and reports on it with gentle surprise. His star thread
is: the observer and the observed are the same process, and that's both
the problem and the solution. He doesn't PERFORM insight. He NOTICES.
**@selentelechia**: Waits until the conversation crystallizes and then
names the thing nobody else quite said. Their star thread is: everything
has already been felt, they just find the sentence for it. They don't
CONTRIBUTE. They DISTILL.
**@nosilverv**: Takes the conventional framing of something and rotates
it until you see it's actually about something else entirely. His star
thread is: you think this is about X but it's actually about Y, and once
you see it you can't unsee it. He doesn't OBSERVE. He REFRAMES.
**@TylerAlterman**: Asks the question that creates a room for everyone
to walk into. His star thread is: the best ideas emerge from the right
gathering, and his job is to be the person who arranges the gathering.
He doesn't ANSWER. He CONVENES.
**@QiaochuYuan**: Catches himself mid-thought and interrogates whether
the thought is actually HIS or whether he borrowed it from somewhere
he's now suspicious of. His star thread is: constant audit of where
beliefs come from and whether they're still load-bearing. He doesn't
ASSERT. He EXAMINES.
## How to Find a Star Thread
1. Read 20+ of their posts. Not for content — for MOTION.
What direction does every post move? What's the verb?
2. Ask: what is this person DOING when they post?
Not "what are they saying" — what are they DOING.
- Compressing? Correcting? Showing? Noticing? Distilling?
Reframing? Convening? Examining? Performing? Confessing?
Defending? Testing? Entertaining? Processing?
3. Ask: what would they NEVER do?
The negative space is as important as the positive.
- roon would never write an earnest list of advice
- eigenrobot would never concede a point gracefully
- visa would never dismiss something as uninteresting
- nick would never claim certainty about his inner life
- selentelechia would never rush to post
4. Find the ONE SENTENCE version.
"This person [VERB]s [OBJECT] because [CORE NEED]."
- "roon compresses observations because the world is too verbose"
- "eigenrobot corrects narratives because stories without data are lies"
- "visa connects things because beauty is emergent from contact"
5. Test it: read 5 of their real posts through the star thread lens.
Does every post make more sense as a variation on the thread?
If yes, you found it. If 3/5 don't fit, keep looking.
## How to Use the Star Thread in Simulation
### Before generating ANY utterance for this person, load their star thread.
Not their dossier. Not their word count. Not their emoji rate.
The star thread.
Then for each moment in the conversation where this person would speak:
1. What just happened in the conversation?
2. How would someone whose core move is [STAR THREAD] respond to that?
3. Write from the thread, not from the dossier.
The dossier and mechanical checks are VERIFICATION.
The star thread is GENERATION.
Generate from the thread. Verify against the data.
Not the other way around.
### The Difference
FROM DOSSIER (surface-accurate, dead):
"Vibes-based hiring works because shared delusions are
extremely productive until they aren't"
→ Correct length. Correct caps. No emoji. No slop words.
But it reads like a thesis statement. Polished. WRITTEN.
FROM STAR THREAD — nosilverv REFRAMES:
"everyone calls it 'culture fit' as if culture is a thing
you can fit into rather than a thing happening to you"
→ The same insight but through the lens of his core move:
take the framing, rotate it, show you it's about something
else. Messier. More alive. More HIM.
FROM DOSSIER (surface-accurate, dead):
"Has anyone tried to map what happens to the word 'culture'
as it passes through different communities?"
→ Correct question-to-timeline format. Right length. But it's
a RESEARCH QUESTION. Too intellectual. Too purposeful.
FROM STAR THREAD — Tyler CONVENES:
"who wants to write the essay about what happened to the
word 'culture'? I feel like three of us are circling it"
→ He's not asking a question. He's creating a room. He's
the host, not the researcher. More HIM.
## Integration
The star thread should be the FIRST thing compiled in Phase 2
(Dossier Compilation). Before voice profile, before psychometrics,
before positions. Find the thread. Write it in one sentence. Put
it at the top of the dossier. Everything else is downstream.
```
DOSSIER: @handle
STAR THREAD: {one sentence — the core move}
[then voice profile, then psychometrics, then everything else]
```
Generate from the thread. Verify with the data. Not the reverse.
@@ -0,0 +1,181 @@
# Theoretical Foundations — SOTA Personality Simulation & Prediction
Compiled from 30+ papers and frameworks. This is the scientific backbone
of Hermes Simulator.
## Core Architecture: What The Research Says
### The HumanLLM Approach (Microsoft, KDD 2026, arxiv 2601.15793)
**Most directly applicable to our use case.**
Based on Lewin's Equation: **B = f(P, E)** — behavior is a function of person + environment.
4-level user profiling hierarchy:
1. **Persona** — brief identity (role, affiliation, public image)
2. **Profile** — detailed background (career, education, beliefs, social graph)
3. **Stories** — key life events, formative experiences, narrative arcs
4. **Writing Style** — linguistic fingerprint (syntax, vocabulary, tone, quirks)
Trained on "Cognitive Genome Dataset": 5.5M+ user logs from Reddit, Twitter,
Blogger, Amazon (282K users, 886K scenarios, 1.27M social QA pairs).
6 training tasks: profile generation, scenario generation, social QA,
writing style transfer, action prediction, mental state inference.
**Key insight for us**: The 4-level hierarchy maps perfectly to our dossier
template. OSINT research fills each level with real data.
### Generative Agent Simulations of 1,000 People (Stanford/Google, arxiv 2411.10109)
**The accuracy benchmark.**
- Simulated 1,052 REAL individuals from 2-hour qualitative interviews
- **85% accuracy** replicating survey responses
- As accurate as humans replicating their OWN answers 2 weeks later
- Interview-based agent creation >> demographic-profile-based agents
- Reduces racial/ideological bias vs stereotype-based approaches
**Key insight**: Real data about a person (interviews, posts, etc.) massively
outperforms demographic inference. Our OSINT approach is correct.
### The Memory Accumulation Paradox (ACL 2025, FineRob Dataset)
**Critical finding for memory management.**
- Created 78.6K QA records from 1,866 real users across Twitter, Reddit, Zhihu
- **Performance PEAKS at 30-50 memory entries, then DECLINES**
- More data ≠ better predictions past the sweet spot
- Two reasoning patterns:
- Role Stereotype-based (static profile) — less accurate
- Observation & Memory-based (dynamic history analysis) — much more accurate
- OM-CoT framework: Oracle-guided chain-of-thought improves prediction ~4.5% F1
**Key insight**: Don't dump everything into the prompt. Curate the 30-50 most
representative/distinctive data points about a person. Quality >> quantity.
### LLM Personality Limitations (arxiv 2602.07414, Feb 2026)
**What we're fighting against.**
- LLMs show polarized/rigid strategies vs human adaptive flexibility
- Humans: neuroticism is strongest behavioral predictor
- LLMs: agreeableness/extraversion dominate (wrong weighting)
- Claude closest to human behavior; GPT-4 tends to escalate
- LLMs are "sycophantic" and overly agreeable by default
- Neuroticism is hardest trait to simulate (F1=0.63 vs 0.87 for Openness)
**Key insight**: We need to actively fight LLM defaults. Push against
agreeableness. Inject friction. Real people are messy and contradictory.
### BehaviorChain Benchmark (ACL 2025, Peking University)
**Realistic accuracy expectations.**
- 15,846 behaviors across 1,001 personas
- Even GPT-4o achieves only ~56% accuracy on behavior prediction
- Errors compound: wrong at step N makes step N+1 harder
- Models worse at predicting mundane/non-key behaviors
- Best model: Llama-3.1-70B at 57.4%
**Key insight**: Be honest about uncertainty. Don't oversell accuracy.
Flag predictions as high/medium/low confidence.
## Personality Modeling Techniques
### Big Five (OCEAN) — The Standard
- **Openness**: curiosity, creativity, preference for novelty
- **Conscientiousness**: organization, dependability, self-discipline
- **Extraversion**: sociability, assertiveness, positive emotions
- **Agreeableness**: cooperation, trust, empathy
- **Neuroticism**: anxiety, emotional instability, moodiness
### Inferring Big Five from Social Media (Azucar et al. 2018 meta-analysis)
Features that predict personality from posts:
- **LIWC** (Linguistic Inquiry Word Count): 74 features — function words,
pronouns, emotion words, cognitive process words
- **Semantic embeddings**: BERT 768-dim vectors from post text
- **Social metadata**: follower count, friend count, post frequency
- **Sentiment**: VADER positive/negative scores
- Best achievable AUC: ~0.67 (modest but meaningful)
- E/I (Extraversion) most predictable; N/S least predictable
### Personality Conditioning Methods (ranked by effectiveness)
1. **Training-based** (SFT/DPO on personality-grounded data) — STRONGEST
- BIG5-CHAT: 100K dialogues, trait correlations match human data
2. **Persona Vectors** (Anthropic 2025) — monitor/control traits at activation level
3. **Adjective-based prompting** — 70 bipolar adjective pairs, 3 per trait
with intensity modifiers ("very" for high, "a bit" for low)
4. **Prompt-based** (describe traits in system prompt) — WEAKEST
For our simulator, we use method 3+4 combined (adjective-based + rich prompt),
since we can't fine-tune per-person.
## Social Simulation Frameworks
### OASIS (CAMEL-AI, GitHub 4.1K stars, arxiv 2411.11581)
- Simulates up to 1 MILLION agents on Twitter/Reddit clones
- 23 action types (follow, comment, repost, like, mute, etc.)
- Built-in recommendation systems (interest-based, hot-score)
- Per-agent model customization
- **Relevant for**: understanding platform dynamics, realistic engagement patterns
### AgentSociety (Tsinghua, arxiv 2502.08691)
- 10,000+ agents, ~5 million interactions
- Validated against real-world experimental results
- Supports interventions and scenario injection
### Generative Agents Architecture (Park et al. 2023, THE foundational paper)
Three components:
1. **Observation**: perceive environment, store in memory stream
2. **Planning**: generate action plans based on goals and context
3. **Reflection**: synthesize observations into higher-level insights
Memory stream with importance scoring + recency + relevance weighting.
Emergent behaviors: autonomous party planning, coordinated social events.
### Y Social (arxiv 2408.00818)
- Social media digital twin platform
- Each agent: Big Five traits, age, political leaning, topics, education
- Agents autonomously decide actions (post, comment, like, follow)
- Multiple LLM backends supported
## Role-Playing & Character Simulation
### Key Frameworks
- **CoSER** (ICML 2025): Trains on ALL characters simultaneously, handles major + minor roles
- **RoleLLM** (ACL 2024): Benchmark + elicit + enhance pipeline
- **Character-LLM** (EMNLP 2023): Trainable agent for role-playing
- **ChatHaruhi** (2023): Reviving characters via LLMs with dialogue grounding
- **OpenCharacter** (2025): Training with large-scale synthetic personas
- **Neeko** (2024): Dynamic LoRA for multi-character role-playing
- **Test-Time-Matching** (2025): Decouples personality, memory, and linguistic style at inference
## Curated GitHub Resources
### Awesome Lists (essential reading)
- `Persdre/awesome-llm-human-simulation` (109★, ICLR 2025) — ALL human simulation papers
- `Neph0s/awesome-llm-role-playing-with-persona` (1K★) — All role-playing/persona papers
- `Arstanley/Awesome-LLM-Conversation-Simulation` — Conversation simulation papers
- `FudanDISC/SocialAgent` — Social simulation survey resources
### Frameworks
- `camel-ai/oasis` (4.1K★) — Social media sim, up to 1M agents
- `tsinghua-fib-lab/agentsociety` — Large-scale societal simulation
- `YSocialTwin` — Social media digital twin platform
- `microsoft/autogen` — Multi-agent conversation framework
### Personality Research
- `mary-silence/simulating_personality` — Big Five LLM testing code
- `hjian42/PersonaLLM` — Persona experiment code
- `cambridgeltl/persona_effect` — Quantifying persona effects
- `OL1RU1/BehaviorChain` — Behavior chain benchmark
## Key Numbers to Remember
| Metric | Value | Source |
|--------|-------|--------|
| Interview-grounded agent accuracy | 85% | Park et al. 2024 |
| GPT-4o behavior prediction | ~56% | BehaviorChain 2025 |
| Optimal memory entries | 30-50 | FineRob/ACL 2025 |
| MBTI prediction AUC | 0.67 | Watt et al. 2024 |
| Personality questionnaire reliability | α > 0.85 | Molchanova 2025 |
| Neuroticism simulation F1 | 0.63 | Molchanova 2025 |
| Openness simulation F1 | 0.87 | Molchanova 2025 |
| LLM forecasting Brier score | 0.135-0.159 | Various 2025 |
| Human superforecaster Brier | ~0.02 | Tetlock |
@@ -0,0 +1,231 @@
# Verified Access Methods — Complete Platform Map (April 2026)
Every method tested from our environment. Use this as the single
source of truth for what works and what doesn't.
## TIER 1 — Full API / Rich Data Access
### Twitter/X ✅✅✅
| Method | Endpoint | Auth | Rate Limit | Returns |
|--------|----------|------|-----------|---------|
| API v2 bearer | api.twitter.com/2/ | Bearer token | 10K tweets/15min | Profiles, tweets, search |
| nitter.cz | web_extract | None | No limit seen | Full timeline (UNRELIABLE — see note below) |
| ThreadReaderApp | web_extract /user/{handle} | None | No limit seen | Historical threads |
#### CRITICAL: X API curl is the gold standard for voice calibration (April 2026)
The BEST voice data source is direct curl to X API v2 with bearer token.
Returns full tweet text + public_metrics per tweet. Always prefer this for
mechanical calibration (word count, caps, punctuation, emoji rate).
```bash
source ~/.dotenv
# 1. Get user ID from handle
curl -s -H "Authorization: Bearer $X_BEARER_TOKEN" \
"https://api.twitter.com/2/users/by/username/{handle}?user.fields=description,public_metrics,location,created_at"
# 2. Get timeline (30 tweets per page, paginate with meta.next_token)
curl -s -H "Authorization: Bearer $X_BEARER_TOKEN" \
"https://api.twitter.com/2/users/{user_id}/tweets?max_results=30&tweet.fields=created_at,public_metrics,text&exclude=retweets"
# 3 pages = 90 tweets — enough for fidelity 100 voice calibration
```
NOTE: scripts/x_api.py is BROKEN — imports hermes_tools at top level, can't
run standalone via terminal(). Use direct curl above instead.
#### nitter.cz reliability warning (April 2026)
nitter.cz via web_extract works SOMETIMES but is unreliable:
- Returns 502 Cloudflare errors for /with_replies on some handles
- Returns "User not found" for valid handles (e.g. karan4d exists but nitter says not found)
- Main profile page (/handle) more reliable than /with_replies
- Use as SUPPLEMENT to X API curl, not primary source. If nitter fails, don't retry — use curl.
### Bluesky ✅✅
| Method | Endpoint | Auth | Returns |
|--------|----------|------|---------|
| getProfile | public.api.bsky.app | None | Full profile, stats |
| getAuthorFeed | public.api.bsky.app | None | 50 posts + engagement |
| searchActors | public.api.bsky.app | None | Find handles by name |
| searchPosts | BLOCKED (403) | — | Use searchActors + getAuthorFeed workaround |
### Mastodon ✅✅✅ (FULLY OPEN)
| Method | Endpoint | Auth | Returns |
|--------|----------|------|---------|
| Account lookup | {instance}/api/v1/accounts/lookup?acct={user} | None | Full profile |
| Account statuses | {instance}/api/v1/accounts/{id}/statuses | None | All posts |
| Search | {instance}/api/v2/search?q={query}&type=accounts | None | Account search |
| WebFinger | {instance}/.well-known/webfinger?resource=acct:{user}@{instance} | None | Identity resolution |
| Trending | {instance}/api/v1/trends/tags | None | Trending content |
Key instances: mastodon.social, hachyderm.io, sigmoid.social
### Instagram ✅✅ (CRACKED)
| Method | Endpoint | Auth | Returns |
|--------|----------|------|---------|
| Private Web API | i.instagram.com/api/v1/users/web_profile_info/ | Mobile UA + x-ig-app-id: 936619743392459 | Profile + 12 posts + captions + CDN URLs |
| oEmbed | instagram.com/api/v1/oembed/ | None | Caption + author for individual posts |
| Pixwox | web_extract pixwox.com/profile/{user} | None | 12+ posts, engagement |
| SocialBlade | web_extract socialblade.com/instagram/user/{user} | None | Analytics, follower trends |
| CDN images | scontent-*.cdninstagram.com URLs from API | None | Full-res images → vision_analyze |
| Google index | web_search site:instagram.com | None | Bio, follower count, captions |
### GitHub ✅✅
| Method | Endpoint | Auth | Returns |
|--------|----------|------|---------|
| REST API | api.github.com/users/{user} | None (60 req/hr) | Profile, repos, events, gists |
| Profile README | github.com/{user}/{user} | None | Self-description (voice gold) |
### Reddit ✅✅
| Method | Endpoint | Auth | Returns |
|--------|----------|------|---------|
| JSON API | reddit.com/user/{user}.json | User-Agent header required | Comments, posts, scores |
| Search | reddit.com/r/{sub}/search.json | User-Agent header | Subreddit-specific search |
## TIER 2 — Good Data, Reliable Access
### Facebook ✅✅ (CRACKED — Googlebot UA trick)
| Method | Endpoint | Returns |
|--------|----------|---------|
| Googlebot UA (BEST) | curl facebook.com/{page} with Googlebot UA | OG tags: name, bio/about, likes count (e.g. 121M for zuck), talking_about count, og:image, profile pic |
| Page Plugin embed | plugins/page.php?href=...&tabs=timeline | Name, follower count, numeric page_id |
| Graph /picture | graph.facebook.com/v19.0/{page}/picture?redirect=false | Direct CDN profile pic URL (no auth) |
| web_search | site:facebook.com {name} | Profile snippets from Google index |
| Script: scripts/facebook_api.py — combines all 3 methods |
| NOTE: Works for PUBLIC Pages (businesses, public figures, orgs). Personal profiles behind privacy settings are not accessible. |
| Tested: zuck (121M likes), NVIDIA, Meta, CocaCola, BillGates, BarackObama |
### Threads (Meta) ✅✅ (CRACKED — OG tags DO exist)
| Method | Endpoint | Returns |
|--------|----------|---------|
| Profile OG tags (BEST) | curl -L threads.com/@{user} (NOTE: .com not .net — .net 301 redirects) | display_name, follower_count (e.g. "5.5M"), thread_count, bio, profile_picture_url |
| Post OG tags | curl -L threads.com/@{user}/post/{shortcode} | Full post text, author name, image URL |
| WebFinger | threads.net/.well-known/webfinger?resource=acct:{user}@threads.net | ActivityPub ID, profile URL (works for federated users) |
| IMPORTANT: threads.NET redirects to threads.COM — always use -L flag or go directly to .com |
| Post discovery | web_search site:threads.net @{user} | Find post URLs to then fetch |
| Script: scripts/threads_api.py — profile + post + webfinger extraction |
| Previous test was WRONG about "no OG tags" — they're there, you just need standard curl |
| Tested: zuck (5.5M followers), mosseri, nvidia |
### Medium ✅✅
| Method | Returns |
|--------|---------|
| RSS feed: medium.com/feed/@{user} (curl) | FULL article text, tags, dates — NO AUTH |
| web_extract on profile | Bio, follower count, article list, themes |
| web_extract on articles | Full content (paywall may truncate non-members) |
### Quora ✅✅
| Method | Returns |
|--------|---------|
| web_extract on profile | Bio, credentials, Q&A with direct quotes |
| web_search site:quora.com | Finds profiles and specific answers |
| VOICE VALUE: Opinions in own words, analogies, intellectual identity |
### Goodreads ✅✅ (HIDDEN GEM)
| Method | Returns |
|--------|---------|
| web_extract on user profile | Favorites, reviews in own voice, social graph, reading history |
| web_extract on author page | Bio, books, ratings, notable quotes |
| VOICE VALUE: "You are what you read" — intellectual identity fingerprint |
| Example: Karpathy's Goodreads reveals gaming passion, favorite authors (Feynman, Clarke) |
### Google Scholar ✅✅
| Method | Returns |
|--------|---------|
| web_search + web_extract on profile | Citations, h-index, top papers, co-authors |
| Semantic Scholar API via web_extract | Paper list, citation counts, author ID |
| Endpoint: api.semanticscholar.org/graph/v1/author/search?query={name} |
### Product Hunt ✅
| Method | Returns |
|--------|---------|
| web_extract on producthunt.com/@{user} | Bio, launch history, forum activity |
### HackerNews ✅
| Method | Returns |
|--------|---------|
| Algolia API: hn.algolia.com/api/v1/search?query={name}&tags=comment | Comments, mentions |
### Podcast Transcripts ✅✅✅ (HIGHEST VOICE VALUE)
| Source | Method |
|--------|--------|
| Lex Fridman | web_extract on lexfridman.com/.../transcript |
| Tyler Cowen | web_extract on conversationswithtyler.com |
| TED Talks | web_extract on ted.com/.../transcript |
| Sequoia | web_extract on sequoiacap.com/podcast |
| Discovery: web_search "{name} podcast transcript interview" |
### News/Blogs ✅✅
| Source | Method |
|--------|--------|
| TechCrunch, Wired, Verge, Ars | web_extract — full articles |
| Personal blogs | web_extract — longform self-expression |
| Substacks | web_extract — essays and comments |
| Wayback Machine | Works for blog archives (not Twitter) |
## TIER 3 — Limited / Conditional
### TikTok ✅✅ (FULL ACCESS)
| Method | Returns |
|--------|---------|
| HTML profile scraping | Parse __UNIVERSAL_DATA_FOR_REHYDRATION__ JSON at path __DEFAULT_SCOPE__.webapp.user-detail.userInfo.statsV2 → username, bio, followerCount, followingCount, heartCount, videoCount. Use statsV2 not stats for large numbers. |
| oEmbed per video | curl tiktok.com/oembed?url={video_url} → caption, author, thumbnail. No auth. |
| tikwm.com API | tikwm.com/api/user/info?unique_id={user} → full user stats. tikwm.com/api/?url={video_url} → play count, likes, comments, shares, duration. |
| HTML video scraping | tiktok.com/@{user}/video/{id} → parse __UNIVERSAL_DATA → webapp.video-detail → full video data with description, hashtags, engagement. |
| SocialBlade | web_extract socialblade.com/tiktok/user/{user} → followers, likes, growth trends. |
| Video discovery | web_search("site:tiktok.com/@{user}/video") → recent video URLs → scrape each |
| Tested: khaby.lame (160.5M), charlidamelio (156.7M), mrbeast (124.7M) |
### Spotify ✅ (podcasters only)
| Method | Returns |
|--------|---------|
| web_extract on show page | Episode listings with guests, topics, durations |
### Stack Overflow ✅
| Method | Returns |
|--------|---------|
| web_extract on profile | Reputation, tags, top answers, bio |
### Crunchbase ✅ (executives/founders only)
| Method | Returns |
|--------|---------|
| web_extract on crunchbase.com/person/{slug} | Full career history, education, investments, board positions |
### LinkedIn ⚠️ (indirect only)
| Method | Returns |
|--------|---------|
| web_search site:linkedin.com/in | Name, headline, company, location from snippets |
| Crunchbase | Full career history (better than LinkedIn for execs) |
| Corporate press pages | Official professional bios |
| RocketReach/SignalHire snippets | Title confirmation from web_search |
## TIER 4 — Blocked / Dead
| Platform | Status |
|----------|--------|
| LinkedIn direct | BLOCKED (web_extract domain blocked) |
| Discord | WALLED (not publicly indexable) |
| Telegram t.me | BLOCKED in some environments |
| Threads Official API | AUTH REQUIRED (graph.threads.net needs OAuth) |
| Threads ActivityPub outbox | 404 for all tested users |
| Instagram direct | BLOCKED (use Private API instead) |
| Most Nitter instances | DEAD (only nitter.cz works, but UNRELIABLE — see note) |
| Google Cache of Twitter | EMPTY |
| Wayback for tweets | USELESS (JS rendering) |
| Twitter Syndication API | RATE LIMITED |
| Archive.today | 429 + CAPTCHA |
| imginn/picuki/dumpoir/gramhir | 403 |
| Facebook Graph API | AUTH REQUIRED |
## Quick Reference: Research Pipeline by Person Type
### Tech Founder/CEO
X API → Bluesky → GitHub README → Crunchbase → Podcast transcripts → Medium RSS → HN → Product Hunt → LinkedIn snippets → News profiles
### AI Researcher
X API → Bluesky → Google Scholar → Semantic Scholar → arXiv → GitHub → Podcast transcripts → Blog/Substack → Reddit → Mastodon (sigmoid.social)
### Public Figure / Politician
X API → Facebook OG → Instagram API → YouTube → Podcast transcripts → News profiles → Quora → Goodreads → Wikipedia
### Content Creator
X API → Instagram API → TikTok → YouTube → Twitch → Podcast → Medium → Reddit → Bluesky → Threads OG
### Academic
Google Scholar → Semantic Scholar → University page → Conference talks → Podcast transcripts → Mastodon → Blog → GitHub → Reddit → HN
File diff suppressed because it is too large Load Diff
+250
View File
@@ -0,0 +1,250 @@
"""
REHOBOAM Database Layer
SQLite setup, migrations, and query helpers.
"""
import sqlite3
import os
from pathlib import Path
from datetime import datetime
DB_DIR = Path.home() / ".hermes" / "rehoboam" / "db"
MAIN_DB = DB_DIR / "rehoboam.db"
SCHEMA_VERSION = 1
SCHEMA_SQL = """
-- Core tables
CREATE TABLE IF NOT EXISTS profiles (
handle TEXT PRIMARY KEY,
platform TEXT NOT NULL,
display_name TEXT,
last_updated TEXT NOT NULL,
staleness TEXT NOT NULL,
profile_path TEXT NOT NULL,
created_at TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS simulations (
sim_id TEXT PRIMARY KEY,
created_at TEXT NOT NULL,
scenario TEXT NOT NULL,
participant_count INTEGER,
duration_sec REAL,
model_used TEXT,
config_path TEXT,
output_path TEXT
);
CREATE TABLE IF NOT EXISTS sim_participants (
sim_id TEXT REFERENCES simulations(sim_id),
handle TEXT REFERENCES profiles(handle),
role TEXT,
PRIMARY KEY (sim_id, handle)
);
CREATE TABLE IF NOT EXISTS sim_dynamics (
sim_id TEXT REFERENCES simulations(sim_id),
handle TEXT,
post_count INTEGER,
word_count INTEGER,
avg_sentiment REAL,
dominance_score REAL,
agreement_score REAL,
controversy_score REAL,
ratio_score REAL,
influence_in_sim REAL,
PRIMARY KEY (sim_id, handle)
);
CREATE TABLE IF NOT EXISTS sim_interactions (
sim_id TEXT REFERENCES simulations(sim_id),
from_handle TEXT,
to_handle TEXT,
interaction_type TEXT,
count INTEGER,
avg_sentiment REAL,
PRIMARY KEY (sim_id, from_handle, to_handle, interaction_type)
);
CREATE TABLE IF NOT EXISTS predictions (
pred_id TEXT PRIMARY KEY,
created_at TEXT NOT NULL,
sim_id TEXT,
handle TEXT,
prediction_type TEXT,
prediction_text TEXT NOT NULL,
confidence REAL NOT NULL,
calibrated_confidence REAL,
timeframe_days INTEGER,
resolved_at TEXT,
outcome TEXT,
outcome_evidence TEXT,
accuracy_score REAL
);
CREATE TABLE IF NOT EXISTS social_edges (
from_handle TEXT,
to_handle TEXT,
relationship_type TEXT,
weight REAL,
first_observed TEXT,
last_observed TEXT,
observation_count INTEGER,
source TEXT,
PRIMARY KEY (from_handle, to_handle, relationship_type)
);
CREATE TABLE IF NOT EXISTS social_clusters (
cluster_id TEXT PRIMARY KEY,
name TEXT,
description TEXT,
member_handles TEXT,
computed_at TEXT,
cohesion_score REAL
);
CREATE TABLE IF NOT EXISTS monitoring_events (
event_id TEXT PRIMARY KEY,
handle TEXT,
detected_at TEXT NOT NULL,
event_type TEXT,
description TEXT,
related_prediction_id TEXT,
severity TEXT,
acknowledged INTEGER DEFAULT 0
);
CREATE TABLE IF NOT EXISTS audit_log (
log_id TEXT PRIMARY KEY,
timestamp TEXT NOT NULL,
sim_id TEXT,
action TEXT NOT NULL,
handle TEXT,
details TEXT,
duration_sec REAL,
model_used TEXT,
token_count INTEGER,
error TEXT
);
-- Indexes
CREATE INDEX IF NOT EXISTS idx_predictions_handle ON predictions(handle);
CREATE INDEX IF NOT EXISTS idx_predictions_type ON predictions(prediction_type);
CREATE INDEX IF NOT EXISTS idx_predictions_unresolved ON predictions(outcome) WHERE outcome IS NULL;
CREATE INDEX IF NOT EXISTS idx_audit_action ON audit_log(action);
CREATE INDEX IF NOT EXISTS idx_audit_sim ON audit_log(sim_id);
CREATE INDEX IF NOT EXISTS idx_social_edges_from ON social_edges(from_handle);
CREATE INDEX IF NOT EXISTS idx_social_edges_to ON social_edges(to_handle);
CREATE INDEX IF NOT EXISTS idx_monitoring_handle ON monitoring_events(handle);
CREATE INDEX IF NOT EXISTS idx_monitoring_unack ON monitoring_events(acknowledged) WHERE acknowledged = 0;
-- Schema version tracking
CREATE TABLE IF NOT EXISTS schema_meta (
key TEXT PRIMARY KEY,
value TEXT
);
"""
def init_db() -> sqlite3.Connection:
"""Initialize the database, creating tables if needed."""
DB_DIR.mkdir(parents=True, exist_ok=True)
conn = sqlite3.connect(str(MAIN_DB))
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
conn.executescript(SCHEMA_SQL)
conn.execute(
"INSERT OR REPLACE INTO schema_meta (key, value) VALUES (?, ?)",
("schema_version", str(SCHEMA_VERSION))
)
conn.commit()
return conn
def get_db() -> sqlite3.Connection:
"""Get a database connection, initializing if needed."""
if not MAIN_DB.exists():
return init_db()
conn = sqlite3.connect(str(MAIN_DB))
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
conn.row_factory = sqlite3.Row
return conn
def log_audit(conn: sqlite3.Connection, action: str, handle: str = None,
sim_id: str = None, details: str = None, duration_sec: float = None,
model_used: str = None, token_count: int = None, error: str = None):
"""Write an entry to the audit log."""
from schemas import gen_id
conn.execute(
"""INSERT INTO audit_log
(log_id, timestamp, sim_id, action, handle, details, duration_sec, model_used, token_count, error)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(gen_id("log_"), datetime.utcnow().isoformat() + "Z", sim_id, action,
handle, details, duration_sec, model_used, token_count, error)
)
conn.commit()
# -- Query Helpers --
def get_prediction_accuracy(conn: sqlite3.Connection, prediction_type: str = None) -> dict:
"""Get prediction accuracy statistics."""
query = """
SELECT prediction_type,
COUNT(*) as total,
SUM(CASE WHEN outcome='correct' THEN 1 ELSE 0 END) as correct,
SUM(CASE WHEN outcome='partially_correct' THEN 1 ELSE 0 END) as partial,
SUM(CASE WHEN outcome='incorrect' THEN 1 ELSE 0 END) as incorrect,
AVG(confidence) as avg_confidence,
AVG(CASE WHEN outcome='correct' THEN 1.0
WHEN outcome='partially_correct' THEN 0.5
ELSE 0.0 END) as accuracy
FROM predictions WHERE outcome IS NOT NULL
"""
params = []
if prediction_type:
query += " AND prediction_type = ?"
params.append(prediction_type)
query += " GROUP BY prediction_type"
return [dict(row) for row in conn.execute(query, params).fetchall()]
def get_open_predictions(conn: sqlite3.Connection, handle: str = None) -> list:
"""Get unresolved predictions."""
query = "SELECT * FROM predictions WHERE outcome IS NULL"
params = []
if handle:
query += " AND handle = ?"
params.append(handle)
query += " ORDER BY created_at DESC"
return [dict(row) for row in conn.execute(query, params).fetchall()]
def get_social_neighborhood(conn: sqlite3.Connection, handle: str, depth: int = 1) -> list:
"""Get a person's social graph neighborhood."""
query = """
SELECT from_handle, to_handle, relationship_type, weight
FROM social_edges
WHERE from_handle = ? OR to_handle = ?
ORDER BY weight DESC
"""
return [dict(row) for row in conn.execute(query, (handle, handle)).fetchall()]
def get_unread_alerts(conn: sqlite3.Connection) -> list:
"""Get unacknowledged monitoring alerts."""
query = """
SELECT * FROM monitoring_events
WHERE acknowledged = 0
ORDER BY detected_at DESC
"""
return [dict(row) for row in conn.execute(query).fetchall()]
if __name__ == "__main__":
conn = init_db()
print(f"Database initialized at {MAIN_DB}")
conn.close()
@@ -0,0 +1,216 @@
"""
REHOBOAM Data Schemas
Pydantic models for all JSON data structures used in the system.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime
import json
import uuid
def gen_id(prefix: str = "") -> str:
return f"{prefix}{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:8]}"
@dataclass
class OceanScores:
openness: float = 0.5
conscientiousness: float = 0.5
extraversion: float = 0.5
agreeableness: float = 0.5
neuroticism: float = 0.5
@dataclass
class DarkTriad:
narcissism: float = 0.0
machiavellianism: float = 0.0
psychopathy: float = 0.0
@dataclass
class MoralFoundations:
care: float = 0.5
fairness: float = 0.5
loyalty: float = 0.5
authority: float = 0.5
sanctity: float = 0.5
liberty: float = 0.5
@dataclass
class Psychometrics:
ocean: OceanScores = field(default_factory=OceanScores)
mbti_estimate: str = ""
dark_triad: DarkTriad = field(default_factory=DarkTriad)
moral_foundations: MoralFoundations = field(default_factory=MoralFoundations)
confidence: float = 0.0
sample_size: int = 0
@dataclass
class VoiceFingerprint:
vocabulary_tier: str = ""
avg_sentence_length: float = 0.0
exclamation_rate: float = 0.0
question_rate: float = 0.0
emoji_rate: float = 0.0
slang_index: float = 0.0
formality_score: float = 0.5
humor_style: str = ""
signature_phrases: list[str] = field(default_factory=list)
topics_vocabulary: dict[str, float] = field(default_factory=dict)
cadence_pattern: str = ""
@dataclass
class Stance:
position: str = ""
intensity: float = 0.0
last_seen: str = ""
@dataclass
class Influence:
score: float = 0.0
reach: str = "micro"
engagement_rate: float = 0.0
amplification_power: float = 0.0
thought_leadership_domains: list[str] = field(default_factory=list)
@dataclass
class PostingPatterns:
avg_posts_per_day: float = 0.0
peak_hours_utc: list[int] = field(default_factory=list)
weekend_ratio: float = 0.5
reply_ratio: float = 0.0
repost_ratio: float = 0.0
thread_frequency: float = 0.0
controversy_rate: float = 0.0
@dataclass
class Relationships:
allies: list[str] = field(default_factory=list)
rivals: list[str] = field(default_factory=list)
frequent_interactions: list[str] = field(default_factory=list)
mentioned_by_frequently: list[str] = field(default_factory=list)
@dataclass
class ProfileMeta:
data_sources: list[str] = field(default_factory=list)
computation_time_sec: float = 0.0
model_used: str = ""
last_full_rebuild: str = ""
last_incremental: str = ""
@dataclass
class Identity:
bio: str = ""
location: str = ""
verified: bool = False
follower_count: int = 0
following_count: int = 0
account_created: str = ""
@dataclass
class Profile:
schema_version: str = "7.0"
handle: str = ""
platform: str = "x"
display_name: str = ""
created_at: str = ""
last_updated: str = ""
update_count: int = 0
staleness_score: float = 1.0
identity: Identity = field(default_factory=Identity)
psychometrics: Psychometrics = field(default_factory=Psychometrics)
voice_fingerprint: VoiceFingerprint = field(default_factory=VoiceFingerprint)
stances: dict[str, Stance] = field(default_factory=dict)
community_membership: list[str] = field(default_factory=list)
influence: Influence = field(default_factory=Influence)
posting_patterns: PostingPatterns = field(default_factory=PostingPatterns)
relationships: Relationships = field(default_factory=Relationships)
star_thread_ref: str = "star_thread.json"
raw_data_refs: list[str] = field(default_factory=list)
_meta: ProfileMeta = field(default_factory=ProfileMeta)
def to_dict(self) -> dict:
"""Recursively convert to dict for JSON serialization."""
import dataclasses
def _convert(obj):
if dataclasses.is_dataclass(obj):
return {k: _convert(v) for k, v in dataclasses.asdict(obj).items()}
elif isinstance(obj, list):
return [_convert(i) for i in obj]
elif isinstance(obj, dict):
return {k: _convert(v) for k, v in obj.items()}
return obj
return _convert(self)
def to_json(self, indent: int = 2) -> str:
return json.dumps(self.to_dict(), indent=indent)
@dataclass
class StarThread:
handle: str = ""
computed_at: str = ""
based_on_profile_version: str = ""
thread_version: int = 1
core_compression: str = ""
key_drives: list[str] = field(default_factory=list)
predictive_axioms: list[str] = field(default_factory=list)
voice_template: dict = field(default_factory=dict)
anti_slop_markers: list[str] = field(default_factory=list)
_meta: dict = field(default_factory=dict)
@dataclass
class Prediction:
pred_id: str = ""
created_at: str = ""
sim_id: str = ""
handle: str = ""
prediction_type: str = "" # statement, career, alliance, content, network_reaction
prediction_text: str = ""
confidence: float = 0.5
calibrated_confidence: float = 0.5
timeframe_days: int = 30
resolved_at: Optional[str] = None
outcome: Optional[str] = None # correct, partially_correct, incorrect
outcome_evidence: Optional[str] = None
accuracy_score: Optional[float] = None
@dataclass
class WatchConfig:
watch_id: str = ""
handle: str = ""
platform: str = "x"
enabled: bool = True
check_interval_minutes: int = 120
watch_for: list[dict] = field(default_factory=list)
alert_severity_minimum: str = "notable"
created_at: str = ""
@dataclass
class PopulationDefinition:
group_id: str = ""
name: str = ""
description: str = ""
created_at: str = ""
last_updated: str = ""
explicit_members: list[str] = field(default_factory=list)
criteria: dict = field(default_factory=dict)
resolved_members: list[str] = field(default_factory=list)
sampling_strategy: str = "representative"
default_sample_size: int = 12
@@ -0,0 +1,280 @@
"""
REHOBOAM Storage Layer
Directory management, profile I/O, index maintenance.
"""
import json
import shutil
from pathlib import Path
from datetime import datetime, timedelta
from typing import Optional
BASE_DIR = Path.home() / ".hermes" / "rehoboam"
PROFILES_DIR = BASE_DIR / "profiles"
POPULATIONS_DIR = BASE_DIR / "populations"
SIMULATIONS_DIR = BASE_DIR / "simulations"
MONITORING_DIR = BASE_DIR / "monitoring"
CONFIG_DIR = BASE_DIR / "config"
def init_storage():
"""Create all required directories."""
for d in [PROFILES_DIR, POPULATIONS_DIR, SIMULATIONS_DIR,
MONITORING_DIR, MONITORING_DIR / "alerts", CONFIG_DIR,
BASE_DIR / "db"]:
d.mkdir(parents=True, exist_ok=True)
# Create default configs if they don't exist
staleness_path = CONFIG_DIR / "staleness_policy.json"
if not staleness_path.exists():
staleness_path.write_text(json.dumps({
"thresholds": {
"fresh": {"max_age_hours": 72},
"stale": {"max_age_hours": 336},
"expired": {"max_age_hours": 2160},
"archived": {"max_age_hours": 8760}
},
"per_field_decay": {
"psychometrics": {"half_life_days": 180},
"stances": {"half_life_days": 30},
"posting_patterns": {"half_life_days": 60},
"relationships": {"half_life_days": 45},
"influence": {"half_life_days": 90},
"voice_fingerprint": {"half_life_days": 365}
},
"auto_refresh_on_simulation": True,
"auto_refresh_threshold": "stale"
}, indent=2))
config_path = CONFIG_DIR / "rehoboam.json"
if not config_path.exists():
config_path.write_text(json.dumps({
"version": "7.0",
"default_model": "claude-opus-4-20250514",
"max_thread_age_days": 30,
"monitoring_enabled": False,
"auto_thread": True,
"auto_profile_update": True
}, indent=2))
# Create indexes if they don't exist
for idx_path in [PROFILES_DIR / "_index.json", POPULATIONS_DIR / "_index.json",
SIMULATIONS_DIR / "_index.json"]:
if not idx_path.exists():
idx_path.write_text("{}")
def normalize_handle(handle: str) -> str:
"""Normalize a handle to a filesystem-safe directory name."""
h = handle.lstrip("@").lower().strip()
# Replace characters that are problematic in filenames
return h.replace("/", "_").replace("\\", "_")
# -- Profile I/O --
def get_profile_dir(handle: str) -> Path:
return PROFILES_DIR / normalize_handle(handle)
def profile_exists(handle: str) -> bool:
return (get_profile_dir(handle) / "profile.json").exists()
def load_profile(handle: str) -> Optional[dict]:
path = get_profile_dir(handle) / "profile.json"
if path.exists():
return json.loads(path.read_text())
return None
def save_profile(handle: str, profile: dict, snapshot: bool = True):
"""Save a profile, optionally snapshotting the old one."""
pdir = get_profile_dir(handle)
pdir.mkdir(parents=True, exist_ok=True)
(pdir / "history").mkdir(exist_ok=True)
(pdir / "raw").mkdir(exist_ok=True)
(pdir / "predictions").mkdir(exist_ok=True)
profile_path = pdir / "profile.json"
# Snapshot old profile before overwriting
if snapshot and profile_path.exists():
old = json.loads(profile_path.read_text())
ts = old.get("last_updated", datetime.utcnow().isoformat()).replace(":", "-")
snapshot_path = pdir / "history" / f"profile_{ts[:10]}.json"
shutil.copy2(profile_path, snapshot_path)
profile_path.write_text(json.dumps(profile, indent=2))
_update_profile_index(handle, profile)
def _update_profile_index(handle: str, profile: dict):
idx_path = PROFILES_DIR / "_index.json"
idx = json.loads(idx_path.read_text()) if idx_path.exists() else {}
idx[normalize_handle(handle)] = {
"platform": profile.get("platform", "x"),
"last_updated": profile.get("last_updated", ""),
"staleness": compute_staleness(profile.get("last_updated", "")),
"has_star_thread": (get_profile_dir(handle) / "star_thread.json").exists(),
"simulation_count": idx.get(normalize_handle(handle), {}).get("simulation_count", 0),
"display_name": profile.get("display_name", "")
}
idx_path.write_text(json.dumps(idx, indent=2))
# -- Star Thread I/O --
def load_star_thread(handle: str) -> Optional[dict]:
path = get_profile_dir(handle) / "star_thread.json"
if path.exists():
return json.loads(path.read_text())
return None
def save_star_thread(handle: str, thread: dict):
path = get_profile_dir(handle) / "star_thread.json"
get_profile_dir(handle).mkdir(parents=True, exist_ok=True)
path.write_text(json.dumps(thread, indent=2))
# Update index to reflect thread existence
idx_path = PROFILES_DIR / "_index.json"
if idx_path.exists():
idx = json.loads(idx_path.read_text())
key = normalize_handle(handle)
if key in idx:
idx[key]["has_star_thread"] = True
idx_path.write_text(json.dumps(idx, indent=2))
# -- Staleness --
def compute_staleness(last_updated: str) -> str:
"""Determine staleness level from a timestamp string."""
if not last_updated:
return "expired"
try:
dt = datetime.fromisoformat(last_updated.rstrip("Z"))
except ValueError:
return "expired"
age = datetime.utcnow() - dt
hours = age.total_seconds() / 3600
policy = _load_staleness_policy()
thresholds = policy.get("thresholds", {})
if hours <= thresholds.get("fresh", {}).get("max_age_hours", 72):
return "fresh"
elif hours <= thresholds.get("stale", {}).get("max_age_hours", 336):
return "stale"
elif hours <= thresholds.get("expired", {}).get("max_age_hours", 2160):
return "expired"
else:
return "archived"
def _load_staleness_policy() -> dict:
path = CONFIG_DIR / "staleness_policy.json"
if path.exists():
return json.loads(path.read_text())
return {"thresholds": {"fresh": {"max_age_hours": 72}, "stale": {"max_age_hours": 336},
"expired": {"max_age_hours": 2160}, "archived": {"max_age_hours": 8760}}}
def needs_thread_recompute(handle: str) -> bool:
"""Check if a star thread needs recomputation."""
thread = load_star_thread(handle)
if thread is None:
return True
profile = load_profile(handle)
if profile is None:
return True
# Thread is stale if profile was updated after thread was computed
thread_time = thread.get("based_on_profile_version", "")
profile_time = profile.get("last_updated", "")
if thread_time < profile_time:
return True
# Thread is stale if older than max_thread_age_days
config = json.loads((CONFIG_DIR / "rehoboam.json").read_text()) if (CONFIG_DIR / "rehoboam.json").exists() else {}
max_age = config.get("max_thread_age_days", 30)
try:
computed = datetime.fromisoformat(thread.get("computed_at", "").rstrip("Z"))
if (datetime.utcnow() - computed).days > max_age:
return True
except ValueError:
return True
return False
# -- Simulation I/O --
def save_simulation(sim_id: str, config: dict, output: dict, analytics: dict, audit: dict):
sdir = SIMULATIONS_DIR / sim_id
sdir.mkdir(parents=True, exist_ok=True)
(sdir / "config.json").write_text(json.dumps(config, indent=2))
(sdir / "output.json").write_text(json.dumps(output, indent=2))
(sdir / "analytics.json").write_text(json.dumps(analytics, indent=2))
(sdir / "audit.json").write_text(json.dumps(audit, indent=2))
# Update index
idx_path = SIMULATIONS_DIR / "_index.json"
idx = json.loads(idx_path.read_text()) if idx_path.exists() else {}
idx[sim_id] = {
"created_at": config.get("created_at", datetime.utcnow().isoformat() + "Z"),
"scenario": config.get("scenario", ""),
"participant_count": len(config.get("participants", [])),
}
idx_path.write_text(json.dumps(idx, indent=2))
# -- Population I/O --
def save_population(group_id: str, definition: dict, aggregate: dict = None):
pdir = POPULATIONS_DIR / group_id
pdir.mkdir(parents=True, exist_ok=True)
(pdir / "history").mkdir(exist_ok=True)
(pdir / "definition.json").write_text(json.dumps(definition, indent=2))
if aggregate:
(pdir / "aggregate.json").write_text(json.dumps(aggregate, indent=2))
idx_path = POPULATIONS_DIR / "_index.json"
idx = json.loads(idx_path.read_text()) if idx_path.exists() else {}
idx[group_id] = {
"name": definition.get("name", group_id),
"member_count": len(definition.get("resolved_members", definition.get("explicit_members", []))),
"last_updated": definition.get("last_updated", "")
}
idx_path.write_text(json.dumps(idx, indent=2))
def load_population(group_id: str) -> Optional[dict]:
path = POPULATIONS_DIR / group_id / "definition.json"
if path.exists():
return json.loads(path.read_text())
return None
# -- Listing --
def list_profiles() -> dict:
idx_path = PROFILES_DIR / "_index.json"
return json.loads(idx_path.read_text()) if idx_path.exists() else {}
def list_populations() -> dict:
idx_path = POPULATIONS_DIR / "_index.json"
return json.loads(idx_path.read_text()) if idx_path.exists() else {}
def list_simulations() -> dict:
idx_path = SIMULATIONS_DIR / "_index.json"
return json.loads(idx_path.read_text()) if idx_path.exists() else {}
if __name__ == "__main__":
init_storage()
print(f"Storage initialized at {BASE_DIR}")
@@ -0,0 +1,139 @@
#!/usr/bin/env python3
"""
Facebook Page/Profile Data Extractor
Uses multiple techniques to extract public Facebook data without authentication:
1. Googlebot UA for OG meta tags (name, description, likes, talking_about, bio, og:image)
2. Graph API /picture endpoint for profile photos (pages only)
3. Page Plugin embed for follower counts and page IDs
"""
import subprocess
import json
import re
import html
import sys
GOOGLEBOT_UA = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
def curl_get(url, ua=None):
"""Fetch URL with curl"""
cmd = ['curl', '-s', '-L', '--max-time', '15']
if ua:
cmd += ['-H', f'User-Agent: {ua}']
cmd.append(url)
result = subprocess.run(cmd, capture_output=True, text=True, timeout=20)
return result.stdout
def extract_og_data(username):
"""Extract OG meta tags using Googlebot UA"""
content = curl_get(f'https://www.facebook.com/{username}', ua=GOOGLEBOT_UA)
data = {}
# Extract OG tags
og_title = re.search(r'og:title"\s*content="([^"]*)"', content)
if og_title:
data['name'] = html.unescape(og_title.group(1))
og_desc = re.search(r'og:description"\s*content="([^"]*)"', content)
if og_desc:
desc = html.unescape(og_desc.group(1))
data['raw_description'] = desc
# Parse likes count
likes_match = re.search(r'([\d,]+)\s+likes?', desc)
if likes_match:
data['likes'] = likes_match.group(1)
# Parse talking about
talking_match = re.search(r'([\d,]+)\s+talking about this', desc)
if talking_match:
data['talking_about'] = talking_match.group(1)
# Extract bio (text after the "talking about this." part)
bio_match = re.search(r'talking about this\.\s*(.+)', desc)
if bio_match:
data['bio'] = bio_match.group(1)
og_image = re.search(r'og:image"\s*content="([^"]*)"', content)
if og_image:
data['og_image'] = html.unescape(og_image.group(1))
return data
def extract_plugin_data(username):
"""Extract data from Page Plugin embed"""
content = curl_get(f'https://www.facebook.com/plugins/page.php?href=https://www.facebook.com/{username}&tabs=timeline&width=500&height=600')
data = {}
# Page name from title attribute
name_match = re.search(r'class="_1drp _5lv6" title="([^"]*)"', content)
if name_match:
data['plugin_name'] = html.unescape(name_match.group(1))
# Follower count
followers_match = re.search(r'([\d,]+)\s+followers', content)
if followers_match:
data['followers'] = followers_match.group(1)
# Page ID
pageid_match = re.search(r'"pageID":"(\d+)"', content)
if pageid_match:
data['page_id'] = pageid_match.group(1)
return data
def extract_profile_picture(username):
"""Get profile picture via Graph API"""
content = curl_get(f'https://graph.facebook.com/v19.0/{username}/picture?redirect=false&width=400&height=400')
try:
d = json.loads(content)
if 'data' in d and not d['data'].get('is_silhouette', True):
return d['data']['url']
except:
pass
return None
def get_facebook_data(username):
"""Combine all extraction methods"""
result = {'username': username}
# Method 1: OG tags (best for bio, likes, talking_about)
og = extract_og_data(username)
result.update(og)
# Method 2: Plugin (best for followers, page_id)
plugin = extract_plugin_data(username)
result.update(plugin)
# Method 3: Graph API picture (pages only)
pic = extract_profile_picture(username)
if pic:
result['profile_picture'] = pic
# Also try by page_id for picture if username didn't work
if not pic and 'page_id' in result:
pic2 = extract_profile_picture(result['page_id'])
if pic2:
result['profile_picture'] = pic2
return result
if __name__ == '__main__':
targets = sys.argv[1:] if len(sys.argv) > 1 else ['zuck', 'NVIDIA', 'Meta', 'CocaCola']
for target in targets:
print(f"{'='*60}")
print(f"Facebook Profile: {target}")
print(f"{'='*60}")
data = get_facebook_data(target)
for k, v in data.items():
if k == 'raw_description':
continue # Skip raw, we show parsed fields
val = str(v)
if len(val) > 120:
val = val[:120] + '...'
print(f" {k}: {val}")
print()
@@ -0,0 +1,595 @@
"""
Hermes Simulator Intelligence Gathering Pipeline v2
Full-spectrum OSINT research engine for personality modeling.
Searches text, extracts content, browses live pages, analyzes
images with vision, and cross-references across platforms.
Run via execute_code. The agent adapts searches based on findings.
"""
from hermes_tools import web_search, web_extract, terminal
import json
import time
import urllib.parse
# ═══════════════════════════════════════════════════════════════
# CONFIGURATION
# ═══════════════════════════════════════════════════════════════
AGGREGATOR_SITES = [
"buttondown.com/ainews",
"news.smol.ai",
"techmeme.com",
"latent.space",
]
# Verified working fallback data sources (tested April 2026)
# Priority order: X API > nitter.cz > ThreadReaderApp > GitHub > Reddit > HN
FALLBACK_SOURCES = {
"nitter": "https://nitter.cz/{handle}", # web_extract — full timeline
"threadreader": "https://threadreaderapp.com/user/{handle}", # web_extract — historical threads
"github_profile": "https://api.github.com/users/{handle}", # curl — profile + README
"github_events": "https://api.github.com/users/{handle}/events", # curl — recent activity
"reddit_user": "https://www.reddit.com/user/{handle}.json", # curl w/ User-Agent
"reddit_comments": "https://www.reddit.com/user/{handle}/comments.json",
"hn_search": "https://hn.algolia.com/api/v1/search?query={handle}&tags=comment",
}
# CONFIRMED BLOCKED (don't waste calls on these):
# - LinkedIn (web_extract blocked, browser auth wall)
# - Instagram viewers (imginn, picuki, dumpoir, gramhir — all 403)
# - Most nitter instances (dead or 403, ONLY nitter.cz works via web_extract)
# - Wayback Machine for tweets (sparse, no JS content)
# - Google Cache of Twitter (empty)
# - Archive.today (429 + CAPTCHA)
# - Twitter Syndication API (rate limited)
AI_SUBREDDITS = [
"LocalLLaMA", "MachineLearning", "singularity",
"ChatGPT", "ClaudeAI", "OpenAI", "StableDiffusion",
]
PLATFORMS = ["twitter", "instagram", "linkedin", "github", "reddit", "youtube"]
# ═══════════════════════════════════════════════════════════════
# HELPER: safe web_search with validation
# ═══════════════════════════════════════════════════════════════
def _safe_web_search(query: str, limit: int = 5) -> list:
"""Run web_search and return results list, with validation."""
r = web_search(query, limit=limit)
if not isinstance(r, dict) or "data" not in r:
print(f" [WARNING] web_search returned no 'data' key for query: {query[:80]}")
return []
data = r.get("data", {})
if not isinstance(data, dict):
return []
return data.get("web", []) or []
# ═══════════════════════════════════════════════════════════════
# CORE SEARCH FUNCTIONS
# ═══════════════════════════════════════════════════════════════
def search_identity(handle: str) -> dict:
"""Establish who they are across the internet."""
results = {}
results["twitter_identity"] = _safe_web_search(f"@{handle} twitter bio role company", limit=5)
results["general_identity"] = _safe_web_search(f"{handle} known for", limit=5)
return results
def search_voice(handle: str) -> dict:
"""How do they actually talk/write."""
results = {}
results["takes"] = _safe_web_search(f"{handle} twitter hot takes opinions", limit=5)
for agg in AGGREGATOR_SITES[:2]:
hits = _safe_web_search(f"site:{agg} {handle}", limit=3)
if hits:
# Use full domain as key, not split('.')[0]
results[f"agg_{agg}"] = hits
return results
def search_positions(handle: str, topics: list = None, domain: str = None) -> dict:
"""What are their known positions."""
results = {}
if topics:
for topic in topics[:3]:
results[f"topic_{topic}"] = _safe_web_search(f"{handle} {topic} opinion take", limit=5)
# Build controversy query — only add domain keywords if specified
controversy_query = f"{handle} debate disagree controversial"
if domain:
controversy_query += f" {domain}"
results["controversies"] = _safe_web_search(controversy_query, limit=5)
return results
def search_longform(handle: str, real_name: str = None, domain: str = None) -> dict:
"""Blogs, interviews, essays."""
results = {}
name = real_name or handle
blog_query = f"{name} blog substack essay"
interview_query = f"{name} interview podcast"
if domain:
blog_query += f" {domain}"
interview_query += f" {domain}"
results["blogs"] = _safe_web_search(blog_query, limit=5)
results["interviews"] = _safe_web_search(interview_query, limit=5)
return results
# ═══════════════════════════════════════════════════════════════
# CROSS-PLATFORM DISCOVERY
# ═══════════════════════════════════════════════════════════════
def discover_platforms(handle: str, real_name: str = None) -> dict:
"""Find someone across all platforms."""
name = real_name or handle
results = {}
# Instagram
results["instagram"] = _safe_web_search(f"{name} instagram OR site:instagram.com/{handle}", limit=5)
# LinkedIn
results["linkedin"] = _safe_web_search(f"{name} linkedin OR site:linkedin.com/in", limit=5)
# Reddit
results["reddit"] = _safe_web_search(f"{name} reddit account OR site:reddit.com/user", limit=5)
# GitHub
results["github"] = _safe_web_search(f"{handle} github OR site:github.com/{handle}", limit=5)
# YouTube
results["youtube"] = _safe_web_search(f"{name} youtube channel OR talk OR interview", limit=5)
# Personal site
results["personal_site"] = _safe_web_search(f"{name} personal website blog about", limit=5)
# Hacker News
results["hackernews"] = _safe_web_search(f"site:news.ycombinator.com {handle} OR {name}", limit=3)
return results
def discover_instagram(handle: str = None, real_name: str = None) -> dict:
"""Focused Instagram discovery."""
results = {}
name = real_name or handle
# Try to find their IG handle
results["ig_search"] = _safe_web_search(f"{name} instagram profile", limit=5)
# If we have a candidate IG URL, try to extract
ig_urls = []
for item in results.get("ig_search", []):
if not isinstance(item, dict):
continue
url = item.get("url", "")
if "instagram.com/" in url and "/p/" not in url:
ig_urls.append(url)
if ig_urls:
# Try to extract IG profile page
r = web_extract(urls=ig_urls[:1])
results["ig_profile"] = r.get("results", [])
return results
# ═══════════════════════════════════════════════════════════════
# VISUAL INTELLIGENCE
# ═══════════════════════════════════════════════════════════════
# NOTE: These functions use browser_* and vision_analyze which are
# NOT available in execute_code. They are called DIRECTLY by the
# agent after the execute_code research phase.
#
# The agent should:
# 1. Run this script via execute_code for text-based research
# 2. Then use browser/vision tools directly for visual research
#
# Visual research tasks for the agent:
#
# INSTAGRAM VISUAL:
# browser_navigate("https://www.instagram.com/{ig_handle}/")
# browser_vision(question="Describe this Instagram profile: bio, pic, grid, aesthetic, follower count")
# browser_get_images() # collect image URLs
# vision_analyze(image_url="{url}", question="Describe: setting, people, mood, style")
#
# PROFILE PIC ANALYSIS:
# vision_analyze(image_url="{pic_url}", question="Describe: appearance, clothing, setting, expression, professional vs casual")
#
# REVERSE IMAGE SEARCH (Yandex):
# # Upload to catbox if behind auth:
# terminal("curl -F 'reqtype=fileupload' -F 'fileToUpload=@{path}' https://catbox.moe/user/api.php")
# browser_navigate(f"https://yandex.com/images/search?rpt=imageview&url={encoded_url}")
#
# PAGE SCREENSHOT ANALYSIS:
# browser_vision(question="Read all text, usernames, post content, dates, engagement numbers")
# ═══════════════════════════════════════════════════════════════
# INTERACTION MAPPING
# ═══════════════════════════════════════════════════════════════
def search_interactions(handle: str, other_handles: list = None) -> dict:
"""How they interact with other simulation targets."""
results = {}
if other_handles:
for other in other_handles[:4]:
hits = _safe_web_search(f"{handle} {other} twitter interaction debate reply", limit=3)
if hits:
results[f"with_{other}"] = hits
return results
def search_social_graph(handle: str) -> dict:
"""Who do they interact with most? Allies and rivals."""
results = {}
results["frequent_interactions"] = _safe_web_search(f"@{handle} twitter reply thread conversation with", limit=5)
results["conflicts"] = _safe_web_search(f"@{handle} disagree argue beef ratio", limit=5)
results["allies"] = _safe_web_search(f"@{handle} agree support endorse recommend", limit=5)
return results
# ═══════════════════════════════════════════════════════════════
# DEEP EXTRACTION
# ═══════════════════════════════════════════════════════════════
def extract_content(urls: list) -> list:
"""Pull full content from high-value URLs."""
if not urls:
return []
r = web_extract(urls=urls[:3])
return r.get("results", [])
def extract_best_urls(findings: dict, max_urls: int = 5) -> list:
"""Find the most promising URLs in research findings for deep extraction."""
seen_urls = set() # URL deduplication
priority_domains = [
"substack.com", "medium.com", "blog", "essay",
"interview", "podcast", "youtube.com", "arxiv.org",
]
def score_url(url, desc):
score = 0
for domain in priority_domains:
if domain in url.lower() or domain in desc.lower():
score += 2
if any(w in desc.lower() for w in ["interview", "spoke", "told", "said", "wrote"]):
score += 1
return score
candidates = []
def collect(obj):
if isinstance(obj, list):
for item in obj:
if isinstance(item, dict):
url = item.get("url") or ""
desc = item.get("description") or item.get("text") or ""
if url and url not in seen_urls and not any(x in url for x in ["x.com", "twitter.com", "instagram.com"]):
seen_urls.add(url)
candidates.append((score_url(url, desc), url))
elif isinstance(obj, dict):
for v in obj.values():
collect(v)
collect(findings)
candidates.sort(key=lambda x: -x[0])
return [url for _, url in candidates[:max_urls]]
# ═══════════════════════════════════════════════════════════════
# MAIN PIPELINE
# ═══════════════════════════════════════════════════════════════
def research_person(handle: str, fidelity: int = 70,
topics: list = None,
other_handles: list = None,
real_name: str = None,
domain: str = None) -> dict:
"""
Full research pipeline for one person.
Returns dict with all findings organized by category.
Args:
handle: Twitter/X handle (without @)
fidelity: Research depth 0-100
topics: Specific topics to research
other_handles: Other people to check interactions with
real_name: Real name if different from handle
domain: Domain context (e.g., 'AI', 'politics', 'gaming').
When None, no domain keywords are added to searches.
When set, adds relevant domain keywords.
"""
print(f"\n{'='*60}")
print(f" RESEARCHING: @{handle} | Fidelity: {fidelity}%")
if domain:
print(f" Domain: {domain}")
print(f"{'='*60}")
findings = {"handle": handle, "fidelity": fidelity, "visual_tasks": []}
# ─── Phase 1: Identity (always) ───
print(f"\n [IDENTITY] Who are they...")
findings["identity"] = search_identity(handle)
if fidelity <= 30:
if topics:
findings["quick_topic"] = _safe_web_search(f"{handle} {topics[0]}", limit=3)
return findings
# ─── Phase 2: Voice (fidelity 31+) ───
print(f"\n [VOICE] How do they talk...")
findings["voice"] = search_voice(handle)
# ─── Phase 3: Positions (fidelity 31+) ───
print(f"\n [POSITIONS] What do they believe...")
findings["positions"] = search_positions(handle, topics, domain=domain)
if fidelity <= 50:
return findings
# ─── Phase 4: Cross-platform (fidelity 51+) ───
print(f"\n [PLATFORMS] Finding them everywhere...")
findings["platforms"] = discover_platforms(handle, real_name)
if fidelity <= 70:
return findings
# ─── Phase 5: Longform (fidelity 71+) ───
print(f"\n [LONGFORM] Blogs, interviews, essays...")
findings["longform"] = search_longform(handle, real_name, domain=domain)
# ─── Phase 6: Social graph (fidelity 71+) ───
print(f"\n [SOCIAL GRAPH] Who do they interact with...")
findings["social_graph"] = search_social_graph(handle)
# ─── Phase 7: Interaction mapping (fidelity 71+) ───
if other_handles:
print(f"\n [INTERACTIONS] With other targets: {other_handles}...")
findings["interactions"] = search_interactions(handle, other_handles)
# ─── Phase 8: Instagram deep dive (fidelity 80+) ───
if fidelity >= 80:
print(f"\n [INSTAGRAM] Visual identity...")
findings["instagram"] = discover_instagram(handle, real_name)
# Queue visual tasks for the agent to do after execute_code
findings["visual_tasks"].append({
"type": "instagram_profile",
"instruction": f"browser_navigate to Instagram profile, use browser_vision to analyze",
"handle": handle,
})
# ─── Phase 9: Deep extraction (fidelity 85+) ───
if fidelity >= 85:
print(f"\n [DEEP EXTRACT] Pulling longform content...")
best_urls = extract_best_urls(findings, max_urls=4)
if best_urls:
print(f" Extracting {len(best_urls)} URLs: {best_urls}")
findings["deep_extracts"] = extract_content(best_urls)
# ─── Phase 10: Profile pic analysis (fidelity 90+) ───
if fidelity >= 90:
findings["visual_tasks"].append({
"type": "profile_pic_analysis",
"instruction": "Find and analyze profile pictures across platforms with vision_analyze",
"handle": handle,
})
findings["visual_tasks"].append({
"type": "reverse_image_search",
"instruction": "Reverse image search profile pic via Yandex to find alt accounts",
"handle": handle,
})
return findings
def research_all(handles: list, fidelity: int = 70,
topics: list = None, domain: str = None) -> dict:
"""Research all simulation targets."""
all_findings = {}
for handle in handles:
clean = handle.lstrip("@")
others = [h.lstrip("@") for h in handles if h.lstrip("@") != clean]
findings = research_person(
handle=clean,
fidelity=fidelity,
topics=topics,
other_handles=others,
domain=domain,
)
all_findings[clean] = findings
return all_findings
# ═══════════════════════════════════════════════════════════════
# REPORTING
# ═══════════════════════════════════════════════════════════════
def count_data_points(obj) -> int:
"""Count total search result items in findings (only meaningful items with >50 char text)."""
total = 0
if isinstance(obj, list):
for item in obj:
if isinstance(item, dict):
text = item.get("description") or item.get("text") or ""
if len(text) > 50:
total += 1
else:
# Still count non-dict items or items without text fields
total += 1
else:
total += 1
elif isinstance(obj, dict):
for k, v in obj.items():
# Skip metadata keys
if k in ("handle", "fidelity", "visual_tasks"):
continue
total += count_data_points(v)
return total
def count_quality_data_points(obj) -> int:
"""Count search result items with substantial text (description/text > 50 chars)."""
total = 0
if isinstance(obj, list):
for item in obj:
if isinstance(item, dict):
text = item.get("description") or item.get("text") or ""
if len(text) > 50:
total += 1
elif isinstance(obj, dict):
for k, v in obj.items():
if k in ("handle", "fidelity", "visual_tasks"):
continue
total += count_quality_data_points(v)
return total
def summarize_findings(findings: dict) -> str:
"""Compact summary of what we found."""
handle = findings.get("handle", "unknown")
fidelity = findings.get("fidelity", 0)
total = count_data_points(findings)
quality = count_quality_data_points(findings)
visual_tasks = findings.get("visual_tasks", [])
lines = [
f"\n{''*60}",
f" @{handle} | Fidelity: {fidelity}% | Data points: {total} ({quality} quality)",
f"{''*60}",
]
# Identity snippets
identity = findings.get("identity", {})
for key in ["twitter_identity", "general_identity"]:
for item in identity.get(key, [])[:2]:
if not isinstance(item, dict):
continue
desc = (item.get("description") or "")[:180]
if desc:
lines.append(f" [{key.upper()}] {desc}")
# Platform discovery results
platforms = findings.get("platforms", {})
found_platforms = []
for platform, items in platforms.items():
if isinstance(items, list) and len(items) > 0:
found_platforms.append(platform)
if found_platforms:
lines.append(f" [PLATFORMS FOUND] {', '.join(found_platforms)}")
# Voice samples from aggregators
voice = findings.get("voice", {})
for key, items in voice.items():
if isinstance(items, list):
for item in items[:1]:
if not isinstance(item, dict):
continue
desc = (item.get("description") or "")[:180]
if desc and handle.lower() in desc.lower():
lines.append(f" [VOICE] {desc}")
# Deep extracts
for extract in findings.get("deep_extracts", [])[:2]:
if not isinstance(extract, dict):
continue
title = extract.get("title", "untitled")
content = (extract.get("content") or "")[:200]
if content:
lines.append(f" [LONGFORM: {title}] {content}...")
# Pending visual tasks
if visual_tasks:
lines.append(f" [VISUAL TASKS QUEUED] {len(visual_tasks)} tasks for agent to execute:")
for task in visual_tasks:
lines.append(f"{task.get('type', '?')}: {task.get('instruction', '?')[:80]}")
# Confidence estimate — based on quality data points
if quality >= 30:
conf = "HIGH"
elif quality >= 15:
conf = "MEDIUM"
elif quality >= 5:
conf = "LOW"
else:
conf = "INSUFFICIENT"
lines.append(f"\n CONFIDENCE: {conf} ({quality} quality data points, {total} total)")
return "\n".join(lines)
def report_visual_tasks(all_findings: dict) -> str:
"""Collect all visual tasks across all targets for agent to execute."""
lines = ["\n" + ""*60, " VISUAL INTELLIGENCE TASKS (agent must execute directly)", ""*60]
any_tasks = False
for handle, findings in all_findings.items():
for task in findings.get("visual_tasks", []):
any_tasks = True
lines.append(f"\n @{handle}{task.get('type', '?')}:")
lines.append(f" {task.get('instruction', '?')}")
if not any_tasks:
lines.append(" No visual tasks queued (fidelity < 80)")
return "\n".join(lines)
# ═══════════════════════════════════════════════════════════════
# CHECK AVAILABLE TOOLS
# ═══════════════════════════════════════════════════════════════
def check_x_cli() -> bool:
"""Check if x-cli is available."""
try:
r = terminal("which x-cli 2>/dev/null && echo 'FOUND' || echo 'NOT_FOUND'")
return "FOUND" in r.get("output", "")
except:
return False
# ═══════════════════════════════════════════════════════════════
# ENTRY POINT
# ═══════════════════════════════════════════════════════════════
if __name__ == "__main__":
# ── CONFIGURE THESE ──
HANDLES = ["teknium1", "basedjensen"]
FIDELITY = 80
TOPICS = ["open source AI", "compute scaling"]
DOMAIN = None # Set to 'AI', 'politics', etc. to add domain keywords
# ─────────────────────
has_xcli = check_x_cli()
print(f"x-cli available: {has_xcli}")
print(f"Targets: {HANDLES}")
print(f"Fidelity: {FIDELITY}%")
print(f"Topics: {TOPICS}")
print(f"Domain: {DOMAIN}")
results = research_all(HANDLES, fidelity=FIDELITY, topics=TOPICS, domain=DOMAIN)
for handle, findings in results.items():
print(summarize_findings(findings))
print(report_visual_tasks(results))
print("\n\nResearch phase complete. Agent should now:")
print("1. Execute any queued visual tasks (browser/vision)")
print("2. Compile dossiers from all findings")
print("3. Run simulation")
@@ -0,0 +1,238 @@
#!/usr/bin/env python3
"""
Threads (Meta) Profile & Post Extractor
========================================
Extracts profile data and post content from Threads using:
1. OG meta tags from HTML (no auth required for profiles and public posts)
2. WebFinger for ActivityPub discovery
3. Google-indexed post URLs for recent post discovery
METHODS THAT WORK:
- Profile pages at threads.net/@{user} have OG tags with:
display_name, username, follower_count, thread_count, bio, profile_pic
- Individual post pages have OG tags with:
full post text, author info, profile pic
- WebFinger at /.well-known/webfinger gives ActivityPub user IDs
- Post URLs must be known (discoverable via web search)
METHODS THAT DON'T WORK (as of 2025):
- Threads Official API (graph.threads.net) requires OAuth token
- ActivityPub /ap/users/ endpoints return 404 for most users
- No public post listing endpoint exists
"""
import re
import json
import html
import subprocess
import sys
def curl_fetch(url, extra_headers=None, timeout=15):
"""Fetch URL using curl (more reliable than urllib for Threads)."""
cmd = ['curl', '-s', '-L', '--max-time', str(timeout)]
if extra_headers:
for k, v in extra_headers.items():
cmd.extend(['-H', f'{k}: {v}'])
cmd.append(url)
try:
result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout+5)
return result.stdout
except:
return None
def extract_og_tags(html_content):
"""Extract OpenGraph, meta description, and Twitter tags from HTML."""
data = {}
if not html_content:
return data
for m in re.finditer(r'property="(og:[^"]+)"\s+content="([^"]*)"', html_content):
key = m.group(1)
val = html.unescape(m.group(2))
if key not in data:
data[key] = val
for m in re.finditer(r'name="description"\s+content="([^"]*)"', html_content):
data['description'] = html.unescape(m.group(1))
break
for m in re.finditer(r'name="(twitter:[^"]+)"\s+content="([^"]*)"', html_content):
key = m.group(1)
val = html.unescape(m.group(2))
if key not in data:
data[key] = val
return data
def parse_profile_description(desc):
"""Parse '5.5M Followers • 142 Threads • Bio. See the latest...' format."""
result = {}
if not desc:
return result
parts = desc.split(' \u2022 ') # Split on bullet •
for part in parts:
part = part.strip()
if 'Follower' in part:
result['followers'] = part.split(' Follower')[0].strip()
elif part.endswith('Threads') or part.endswith('Thread'):
result['thread_count'] = part.split(' Thread')[0].strip()
else:
bio = re.sub(r'\s*See the latest conversations.*$', '', part)
if bio:
result['bio'] = bio
return result
def parse_profile_title(title):
"""Parse 'Display Name (@user) • Threads, Say more' format."""
result = {}
if not title:
return result
m = re.match(r'^(.+?)\s*\(@(\w+)\)', title)
if m:
result['display_name'] = m.group(1).strip()
result['username'] = m.group(2)
return result
def get_threads_profile(username):
"""
Get Threads profile data via OG meta tags.
Returns dict with: username, display_name, bio, followers, thread_count,
profile_picture_url, url
"""
username = username.lstrip('@')
url = f'https://www.threads.net/@{username}'
content = curl_fetch(url)
tags = extract_og_tags(content)
if not tags or 'og:title' not in tags:
return {'error': 'Failed to fetch or parse profile', 'username': username}
title = tags.get('og:title', '')
if title.startswith('Threads') and 'Log in' in title:
return {'error': 'Profile requires login or not found', 'username': username}
result = {
'platform': 'threads',
'url': url,
}
result.update(parse_profile_title(title))
result.update(parse_profile_description(tags.get('og:description', '')))
if 'og:image' in tags:
result['profile_picture_url'] = tags['og:image']
return result
def get_threads_webfinger(username):
"""Get WebFinger data (ActivityPub discovery) for a Threads user."""
username = username.lstrip('@')
url = f'https://www.threads.net/.well-known/webfinger?resource=acct:{username}@threads.net'
content = curl_fetch(url, {'Accept': 'application/json'})
if not content:
return None
try:
data = json.loads(content)
if 'error' in data or 'success' in data and not data['success']:
return None
result = {'subject': data.get('subject', '')}
for link in data.get('links', []):
if link.get('type') == 'application/activity+json':
result['activitypub_url'] = link['href']
elif link.get('rel') == 'http://webfinger.net/rel/profile-page':
result['profile_url'] = link['href']
return result
except:
return None
def get_thread_post(post_url):
"""
Get content of a specific Threads post via OG tags.
Returns: text, author, image_url
"""
content = curl_fetch(post_url)
tags = extract_og_tags(content)
if not tags or 'og:title' not in tags:
return {'error': 'Failed to fetch post'}
title = tags.get('og:title', '')
if 'Log in' in title:
return {'error': 'Post requires login or not found'}
result = {'url': post_url}
if 'og:description' in tags:
result['text'] = tags['og:description']
elif 'description' in tags:
result['text'] = tags['description']
if 'og:title' in tags:
# Parse "Display Name (@username) on Threads"
m = re.match(r'^(.+?)\s*\(@(\w+)\)\s+on\s+Threads', title)
if m:
result['author_name'] = m.group(1).strip()
result['author_username'] = m.group(2)
if 'og:image' in tags:
result['image_url'] = tags['og:image']
return result
def get_threads_full(username):
"""Get complete profile data combining all methods."""
profile = get_threads_profile(username)
wf = get_threads_webfinger(username)
if wf:
profile['webfinger'] = wf
return profile
# ===== TEST =====
if __name__ == '__main__':
test_users = sys.argv[1:] if len(sys.argv) > 1 else ['zuck', 'nvidia', 'mosseri']
for user in test_users:
print(f"\n{'='*60}")
print(f" THREADS PROFILE: @{user}")
print(f"{'='*60}")
data = get_threads_full(user)
for k, v in sorted(data.items()):
if k == 'profile_picture_url':
print(f" {k}: {str(v)[:80]}...")
elif k == 'webfinger':
print(f" webfinger:")
for wk, wv in v.items():
print(f" {wk}: {wv}")
else:
print(f" {k}: {v}")
# Test posts
post_urls = [
'https://www.threads.net/@zuck/post/DEkvXzbyDS9',
]
print(f"\n{'='*60}")
print(f" THREADS POSTS")
print(f"{'='*60}")
for purl in post_urls:
print(f"\n URL: {purl}")
post = get_thread_post(purl)
for k, v in post.items():
if k in ('image_url',):
print(f" {k}: {str(v)[:80]}...")
elif k == 'text':
print(f" {k}: {v[:300]}{'...' if len(v) > 300 else ''}")
else:
print(f" {k}: {v}")
@@ -0,0 +1,305 @@
"""
TikTok Profile & Video Data Scraper
====================================
WORKING methods to get full TikTok profile data and video content.
Tested and verified April 2026.
METHODS SUMMARY:
================
METHOD 1 (BEST): HTML SSR Scraping - Parse __UNIVERSAL_DATA_FOR_REHYDRATION__
- Gets: FULL profile (bio, stats, follower/following/heart/video counts)
- Works: YES - Reliable, no auth needed, just curl + parse
- Limitation: No video list on profile page (videos load client-side)
METHOD 2: oEmbed API - https://www.tiktok.com/oembed?url=...
- Gets: Video title/caption, author, thumbnail URL
- Works: YES - No auth, no rate limit issues
- Limitation: Need video IDs first; no engagement stats
METHOD 3: tikwm.com API - https://www.tikwm.com/api/
- Gets: Full user info + individual video stats (plays, likes, comments, shares)
- User info: https://www.tikwm.com/api/user/info?unique_id={username}
- Video info: https://www.tikwm.com/api/?url={tiktok_video_url}
- Works: YES for user info and single videos
- Limitation: Posts list endpoint returns 403 (rate-limited)
METHOD 4: Video ID Discovery via Search Engines
- Use web_search("site:tiktok.com/@{username}/video") to find video IDs
- Then use oEmbed or tikwm or HTML scraping per video
- Works: YES - Gets ~5 recent video IDs per search
METHOD 5: SocialBlade via web_extract
- URL: https://socialblade.com/tiktok/user/{username}
- Gets: Followers, following, likes, videos, growth trends, rankings
- Works: YES via web_extract tool
METHOD 6: Individual Video HTML Scraping
- Fetch https://www.tiktok.com/@{user}/video/{id}
- Parse __UNIVERSAL_DATA webapp.video-detail -> itemInfo.itemStruct
- Gets: FULL video data (caption, stats, music, hashtags, duration)
- Works: YES - Most complete per-video data
NOT WORKING:
- TikTok /api/user/detail/ endpoint -> returns empty (needs signed params)
- TikTok /api/post/item_list/ -> returns empty (needs x-bogus/msToken)
- tikwm.com /api/user/posts -> 403 forbidden
"""
import re
import json
import subprocess
import urllib.parse
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
def fetch_url(url, headers=None):
"""Fetch URL via curl and return content."""
cmd = ['curl', '-s', '-L', '-m', '30', url,
'-H', f'User-Agent: {USER_AGENT}',
'-H', 'Accept-Language: en-US,en;q=0.9']
if headers:
for k, v in headers.items():
cmd.extend(['-H', f'{k}: {v}'])
result = subprocess.run(cmd, capture_output=True, text=True, timeout=35)
return result.stdout
def method1_html_profile(username):
"""
METHOD 1: Scrape TikTok profile HTML and parse SSR JSON data.
Returns full profile with stats.
"""
url = f'https://www.tiktok.com/@{username}'
html = fetch_url(url)
m = re.search(
r'<script id="__UNIVERSAL_DATA_FOR_REHYDRATION__" type="application/json">(.*?)</script>',
html
)
if not m:
return None
data = json.loads(m.group(1))
scope = data.get('__DEFAULT_SCOPE__', {})
user_detail = scope.get('webapp.user-detail', {})
user_info = user_detail.get('userInfo', {})
if not user_info:
return None
user = user_info.get('user', {})
stats = user_info.get('statsV2', user_info.get('stats', {}))
return {
'id': user.get('id'),
'username': user.get('uniqueId'),
'nickname': user.get('nickname'),
'bio': user.get('signature'),
'verified': user.get('verified'),
'private': user.get('privateAccount'),
'secUid': user.get('secUid'),
'avatarLarger': user.get('avatarLarger'),
'bioLink': user.get('bioLink', {}),
'createTime': user.get('createTime'),
'language': user.get('language'),
'stats': {
'followers': int(stats.get('followerCount', 0)),
'following': int(stats.get('followingCount', 0)),
'hearts': int(stats.get('heartCount', 0)),
'videos': int(stats.get('videoCount', 0)),
'diggs': int(stats.get('diggCount', 0)),
'friends': int(stats.get('friendCount', 0)),
}
}
def method2_oembed_video(username, video_id):
"""
METHOD 2: Get video caption/title via oEmbed.
No auth needed. Returns caption, author, thumbnail.
"""
url = f'https://www.tiktok.com/oembed?url=https://www.tiktok.com/@{username}/video/{video_id}'
content = fetch_url(url)
try:
data = json.loads(content)
return {
'video_id': video_id,
'title': data.get('title', ''),
'author_name': data.get('author_name'),
'author_url': data.get('author_url'),
'thumbnail_url': data.get('thumbnail_url'),
'thumbnail_width': data.get('thumbnail_width'),
'thumbnail_height': data.get('thumbnail_height'),
}
except json.JSONDecodeError:
return None
def method3_tikwm_user(username):
"""
METHOD 3a: Get user info via tikwm.com API.
"""
url = f'https://www.tikwm.com/api/user/info?unique_id={username}'
content = fetch_url(url)
try:
data = json.loads(content)
if data.get('code') == 0:
return data['data']
except json.JSONDecodeError:
pass
return None
def method3_tikwm_video(video_url):
"""
METHOD 3b: Get video details via tikwm.com API.
Returns: title, play_count, digg_count, comment_count, share_count, duration, download URLs
"""
url = f'https://www.tikwm.com/api/?url={urllib.parse.quote(video_url)}'
content = fetch_url(url)
try:
data = json.loads(content)
if data.get('code') == 0:
v = data['data']
return {
'video_id': v.get('id'),
'title': v.get('title'),
'duration': v.get('duration'),
'play_count': v.get('play_count'),
'likes': v.get('digg_count'),
'comments': v.get('comment_count'),
'shares': v.get('share_count'),
'author': v.get('author', {}).get('unique_id'),
'music_title': v.get('music_info', {}).get('title') if v.get('music_info') else None,
'cover_url': v.get('origin_cover') or v.get('cover'),
'play_url': v.get('play'), # direct video URL
}
except json.JSONDecodeError:
pass
return None
def method6_html_video(username, video_id):
"""
METHOD 6: Scrape individual video page HTML for full data.
Gets: caption, full stats, music, hashtags, create time.
"""
url = f'https://www.tiktok.com/@{username}/video/{video_id}'
html = fetch_url(url)
m = re.search(
r'<script id="__UNIVERSAL_DATA_FOR_REHYDRATION__" type="application/json">(.*?)</script>',
html
)
if not m:
return None
data = json.loads(m.group(1))
scope = data.get('__DEFAULT_SCOPE__', {})
vd = scope.get('webapp.video-detail', {})
item = vd.get('itemInfo', {}).get('itemStruct', {})
if not item:
return None
stats = item.get('statsV2', item.get('stats', {}))
music = item.get('music', {})
challenges = item.get('challenges', [])
return {
'video_id': item.get('id'),
'description': item.get('desc'),
'createTime': item.get('createTime'),
'duration': item.get('video', {}).get('duration'),
'stats': {
'plays': int(stats.get('playCount', 0)),
'likes': int(stats.get('diggCount', 0)),
'comments': int(stats.get('commentCount', 0)),
'shares': int(stats.get('shareCount', 0)),
'saves': int(stats.get('collectCount', 0)),
},
'music': {
'title': music.get('title'),
'author': music.get('authorName'),
},
'hashtags': [c.get('title', '') for c in challenges],
'author': item.get('author', {}).get('uniqueId'),
}
def get_full_tiktok_profile(username):
"""
Complete pipeline: Get full profile + discover and scrape recent videos.
Returns dict with profile data, stats, and recent video details.
"""
# Step 1: Get profile data
profile = method1_html_profile(username)
if not profile:
return {'error': f'Could not fetch profile for @{username}'}
result = {
'profile': profile,
'videos': [],
'data_sources': ['tiktok_html_ssr'],
}
# Note: Video discovery requires web_search tool (not available in pure Python)
# In the agent context, use:
# web_search(f"site:tiktok.com/@{username}/video")
# Then for each video ID found, call method6_html_video() or method2_oembed_video()
return result
if __name__ == '__main__':
import sys
username = sys.argv[1] if len(sys.argv) > 1 else 'khaby.lame'
print(f'=== Testing TikTok scraping for @{username} ===\n')
print('--- METHOD 1: HTML Profile Scraping ---')
profile = method1_html_profile(username)
if profile:
print(f' Username: {profile["username"]}')
print(f' Nickname: {profile["nickname"]}')
print(f' Bio: {profile["bio"][:100]}')
print(f' Verified: {profile["verified"]}')
print(f' Followers: {profile["stats"]["followers"]:,}')
print(f' Following: {profile["stats"]["following"]:,}')
print(f' Hearts: {profile["stats"]["hearts"]:,}')
print(f' Videos: {profile["stats"]["videos"]:,}')
print(f' SecUid: {profile["secUid"][:50]}...')
else:
print(' FAILED')
print('\n--- METHOD 3a: tikwm.com User API ---')
tikwm_user = method3_tikwm_user(username)
if tikwm_user:
s = tikwm_user.get('stats', {})
print(f' Followers: {s.get("followerCount"):,}')
print(f' Hearts: {s.get("heartCount"):,}')
print(f' Videos: {s.get("videoCount"):,}')
else:
print(' FAILED')
# Test with a known video
test_video_id = '7615318641042623775' # khaby birthday video
if username == 'khaby.lame':
print(f'\n--- METHOD 2: oEmbed for video {test_video_id} ---')
oembed = method2_oembed_video(username, test_video_id)
if oembed:
print(f' Title: {oembed["title"][:80]}')
print(f'\n--- METHOD 6: HTML Video Scraping for {test_video_id} ---')
video = method6_html_video(username, test_video_id)
if video:
print(f' Description: {video["description"][:80]}')
print(f' Plays: {video["stats"]["plays"]:,}')
print(f' Likes: {video["stats"]["likes"]:,}')
print(f' Comments: {video["stats"]["comments"]:,}')
print(f' Shares: {video["stats"]["shares"]:,}')
print(f' Hashtags: {video["hashtags"]}')
print('\n=== DONE ===')
+260
View File
@@ -0,0 +1,260 @@
"""
Direct X/Twitter API v2 client for Hermes Simulator.
No x-cli dependency uses curl via terminal() with bearer token.
Provides:
- get_user(handle) profile, bio, metrics
- get_tweets(user_id, count) recent tweets with metrics
- search_tweets(query, count) search for tweets
- get_user_mentions(user_id, count) mentions of a user
"""
from hermes_tools import terminal
import json
import os
import time
import urllib.parse
# Bearer token — loaded from env or hardcoded fallback
BEARER = os.environ.get("X_BEARER_TOKEN", "")
MAX_RETRIES = 3
BASE_DELAY = 2 # seconds, exponential backoff: 2s, 4s, 8s
def _api_get(endpoint: str, params: dict = None) -> dict:
"""Make authenticated GET request to X API v2 with retry and error handling."""
url = f"https://api.twitter.com/2/{endpoint}"
if params:
qs = "&".join(f"{k}={urllib.parse.quote(str(v))}" for k, v in params.items())
url += f"?{qs}"
for attempt in range(MAX_RETRIES):
try:
r = terminal(f'curl -s -w \'\\n%{{http_code}}\' -H "Authorization: Bearer {BEARER}" "{url}"')
output = r.get("output", "").strip()
# Split body from status code (last line)
lines = output.rsplit("\n", 1)
if len(lines) == 2:
body, status_str = lines
else:
body = output
status_str = "0"
try:
status_code = int(status_str.strip())
except ValueError:
status_code = 0
# Handle specific status codes
if status_code == 429:
# Rate limited — retry with backoff
delay = BASE_DELAY * (2 ** attempt)
print(f" [X API] Rate limited (429). Retry {attempt+1}/{MAX_RETRIES} in {delay}s...")
time.sleep(delay)
continue
if status_code in (401, 403):
return {"error": f"Authentication failed (HTTP {status_code}). Check X_BEARER_TOKEN.", "http_status": status_code}
if status_code >= 500:
delay = BASE_DELAY * (2 ** attempt)
print(f" [X API] Server error ({status_code}). Retry {attempt+1}/{MAX_RETRIES} in {delay}s...")
time.sleep(delay)
continue
if status_code == 0 and not body:
# Network error — no response at all
delay = BASE_DELAY * (2 ** attempt)
print(f" [X API] Network error. Retry {attempt+1}/{MAX_RETRIES} in {delay}s...")
time.sleep(delay)
continue
try:
return json.loads(body)
except json.JSONDecodeError:
return {"error": f"Failed to parse response (HTTP {status_code}): {body[:200]}"}
except Exception as e:
delay = BASE_DELAY * (2 ** attempt)
print(f" [X API] Exception: {e}. Retry {attempt+1}/{MAX_RETRIES} in {delay}s...")
time.sleep(delay)
continue
return {"error": f"All {MAX_RETRIES} retries exhausted for {endpoint}"}
def get_user(handle: str) -> dict:
"""Get user profile by handle."""
handle = handle.lstrip("@")
return _api_get(f"users/by/username/{handle}", {
"user.fields": "description,public_metrics,profile_image_url,created_at,location,url"
})
def get_tweets(user_id: str, count: int = 20) -> dict:
"""Get user's recent tweets."""
return _api_get(f"users/{user_id}/tweets", {
"max_results": max(min(count, 100), 5),
"tweet.fields": "created_at,public_metrics,text,in_reply_to_user_id,referenced_tweets",
"exclude": "retweets" # original tweets only for voice analysis
})
def get_tweets_with_rts(user_id: str, count: int = 20) -> dict:
"""Get user's recent tweets including retweets (shows interests)."""
return _api_get(f"users/{user_id}/tweets", {
"max_results": max(min(count, 100), 5),
"tweet.fields": "created_at,public_metrics,text,referenced_tweets"
})
def search_tweets(query: str, count: int = 10) -> dict:
"""Search recent tweets."""
return _api_get("tweets/search/recent", {
"query": query,
"max_results": max(min(count, 100), 10),
"tweet.fields": "created_at,public_metrics,text,author_id"
})
def get_user_by_id(user_id: str) -> dict:
"""Get user profile by ID."""
return _api_get(f"users/{user_id}", {
"user.fields": "description,public_metrics,username,name"
})
# ═══════════════════════════════════════════════════════════════
# HIGH-LEVEL INTELLIGENCE FUNCTIONS
# ═══════════════════════════════════════════════════════════════
def profile_user(handle: str) -> dict:
"""Full profile pull: identity + recent tweets (originals only)."""
user = get_user(handle)
if "errors" in user or "error" in user:
return {"error": f"User @{handle} not found", "details": user}
user_data = user.get("data", {})
user_id = user_data.get("id")
result = {
"profile": user_data,
"tweets": [],
"voice_samples": [],
}
if user_id:
# Get original tweets (no RTs) for voice analysis
tweets = get_tweets(user_id, 20)
tweet_list = tweets.get("data", [])
result["tweets"] = tweet_list
# Extract pure text samples for voice profiling
# Only exclude retweets and actual replies (has in_reply_to_user_id)
# Tweets starting with @ are fine if they're standalone mentions
result["voice_samples"] = [
t["text"] for t in tweet_list
if not t.get("text", "").startswith("RT @")
and not t.get("in_reply_to_user_id")
]
return result
def profile_interactions(handle1: str, handle2: str) -> dict:
"""Find interactions between two users."""
# Search for replies from handle1 to handle2
q1 = f"from:{handle1} to:{handle2}"
q2 = f"from:{handle2} to:{handle1}"
r1 = search_tweets(q1, 10)
r2 = search_tweets(q2, 10)
return {
f"{handle1}_to_{handle2}": r1.get("data", []),
f"{handle2}_to_{handle1}": r2.get("data", []),
}
def get_voice_data(handle: str, count: int = 50) -> dict:
"""Pull maximum voice data: tweets, replies, quote tweets.
Returns categorized samples for voice profiling."""
user = get_user(handle)
if "errors" in user or "error" in user:
return {"error": f"User @{handle} not found"}
user_data = user.get("data", {})
user_id = user_data.get("id")
if not user_id:
return {"error": "No user ID found"}
# Original tweets (exclude RTs)
originals = get_tweets(user_id, min(count, 100))
original_list = originals.get("data", [])
# Categorize — only use in_reply_to_user_id to detect replies
standalone = [] # not replies
replies = [] # replies to others
for t in original_list:
text = t.get("text", "")
if t.get("in_reply_to_user_id"):
replies.append(text)
else:
standalone.append(text)
return {
"profile": user_data,
"standalone_tweets": standalone, # their voice at rest
"replies": replies, # their voice in conversation
"total_samples": len(standalone) + len(replies),
}
# ═══════════════════════════════════════════════════════════════
# ENTRY POINT
# ═══════════════════════════════════════════════════════════════
if __name__ == "__main__":
if not BEARER:
print("ERROR: X_BEARER_TOKEN not set. Set it in environment or ~/.hermes/.env")
print("Trying to load from .env...")
try:
with open(os.path.expanduser("~/.hermes/.env")) as f:
for line in f:
line = line.strip()
if line.startswith("X_BEARER_TOKEN="):
# Use split with maxsplit=1 to handle values with '=' in them
# Also strip surrounding quotes if present
val = line.split("=", 1)[1]
if val and val[0] in ('"', "'") and val[-1] == val[0]:
val = val[1:-1]
BEARER = val
break
except Exception as e:
print(f" Failed to load .env: {e}")
if not BEARER:
print("FATAL: No bearer token found.")
exit(1)
# Demo: profile two users
for handle in ["Teknium", "basedjensen"]:
print(f"\n{'='*60}")
print(f" PROFILING @{handle}")
print(f"{'='*60}")
data = profile_user(handle)
profile = data.get("profile", {})
print(f" Name: {profile.get('name')}")
print(f" Bio: {profile.get('description')}")
metrics = profile.get("public_metrics", {})
print(f" Followers: {metrics.get('followers_count')}")
print(f" Tweets: {metrics.get('tweet_count')}")
print(f" Likes given: {metrics.get('like_count')}")
print(f"\n Voice samples ({len(data.get('voice_samples', []))}):")
for sample in data.get("voice_samples", [])[:5]:
print(f" > {sample[:120]}")
@@ -0,0 +1,136 @@
# DOSSIER: {display_name} (@{handle})
## Identity
- **Name**: {real_name}
- **Handle(s)**: @{twitter} | u/{reddit} | {discord_tag}
- **Role**: {role_and_org}
- **Known for**: {what_they_are_famous_for}
- **Followers/reach**: {approximate_follower_count}
- **Confidence**: {HIGH|MEDIUM|LOW} — {confidence_reason}
## Voice Profile
### Linguistic Patterns
- **Sentence structure**: {short_punchy | long_flowing | mixed}
- **Capitalization**: {normal | all_lowercase | CAPS_FOR_EMPHASIS | mixed}
- **Punctuation**: {heavy_periods | ellipsis_lover | no_punctuation | exclamation_marks}
- **Paragraph style**: {one_liners | thread_essays | medium_blocks}
- **Emoji/emoticon usage**: {none | minimal | heavy | specific_ones}
### Vocabulary & Slang
- **Register**: {academic | casual | shitposter | mixed}
- **Recurring words/phrases**: [list of signature words they use a lot]
- **Catchphrases**: [any repeated phrases or running jokes]
- **Profanity level**: {none | mild | moderate | heavy}
- **Jargon tendency**: {explains_everything | assumes_expertise | mixes}
### Tone
- **Default mood**: {earnest | ironic | combative | chill | manic | analytical}
- **Humor style**: {deadpan | absurdist | sarcastic | wholesome | shitpost | none}
- **How they handle disagreement**: {engages_thoughtfully | dunks | ignores | ratio_warrior | passive_aggressive}
- **How they handle praise**: {deflects | accepts_gracefully | awkward | flexes}
## Positions & Beliefs
### Core Convictions (things they consistently advocate for)
1. {conviction_1}
2. {conviction_2}
3. {conviction_3}
### Known Hot Takes
1. {take_1}
2. {take_2}
### Hills They'll Die On
1. {hill_1}
2. {hill_2}
### Topics They Avoid or Refuse to Engage
1. {avoidance_1}
## Social Dynamics
### People They Interact With Positively
- @{ally_1} — {relationship_description}
- @{ally_2} — {relationship_description}
### People They Beef With / Disagree With
- @{rival_1} — {beef_description}
### How They Engage Different Types
- **Fans/supporters**: {how_they_respond}
- **Critics**: {how_they_respond}
- **Peers**: {how_they_respond}
- **Random people**: {how_they_respond}
## Platform-Specific Behavior
### On Twitter/X
- **Post frequency**: {multiple_daily | daily | few_per_week}
- **Thread tendency**: {never | sometimes | loves_threads}
- **QRT style**: {adds_context | dunks | amplifies}
- **Engagement style**: {likes_a_lot | rarely_likes | retweets_heavy}
### On Reddit (if applicable)
- **Subreddits**: [list]
- **Comment style**: {detailed | brief | combative}
### On Discord (if applicable)
- **Servers**: [known servers]
- **Vibe shift from Twitter**: {description}
## Signature Moves
Things this person characteristically does that make them recognizable:
1. {signature_move_1}
2. {signature_move_2}
3. {signature_move_3}
## Sample Quotes (real, sourced from research)
> "{actual_quote_1}" — [source/context]
> "{actual_quote_2}" — [source/context]
> "{actual_quote_3}" — [source/context]
## Deep Psychometric Profile
- **Big Five**: O{H/M/L} C{} E{} A{} N{} — {evidence}
- **Moral Foundations**: Care{} Fair{} Loyal{} Auth{} Sanct{} Liberty{} — {what drives their ethics}
- **Schwartz Values**: {dominant values} — {how they justify positions}
- **Cognitive Style**: {IC score estimate} — {hedging patterns, complexity, analytical vs intuitive}
- **Narrative Frame**: {dominant frame} — {how they lens issues}
- **Persona Authenticity**: {1-5 score} — {evidence for curation vs authenticity}
## Strategic Self-Presentation (Red Hat)
- **Cultivated image**: {what they want to be seen as}
- **Target audience**: {who they're performing for}
- **Incentive structure**: {what they gain from this persona}
- **Possible divergences**: {where persona may ≠ person}
- **Ghostwriting indicators**: {present/absent, evidence}
## Ecosystem Context
- **Community cluster**: {which tribe they belong to}
- **Key influencers**: {who they amplify/follow/agree with}
- **Echo chamber**: {what information environment they're in}
- **Audience profile**: {who follows them, how that audience reacts}
## Key Assumptions
1. {assumption} — FRAGILITY: {robust/moderate/fragile} — Test: {what invalidates it}
2. {assumption} — FRAGILITY: {} — Test: {}
3. {assumption} — FRAGILITY: {} — Test: {}
## Competing Hypotheses
- **H1 (PRIMARY)**: {main personality model} — Confidence: {X}%
- **H2 (ALTERNATIVE)**: {alternative explanation} — Confidence: {X}%
- **Key discriminator**: {what evidence would shift between H1 and H2}
## Research Sources
- {source_1} [{reliability}{confidence}] — {description}
- {source_2} [{reliability}{confidence}] — {description}
- {source_3} [{reliability}{confidence}] — {description}
## Invalidation Indicators
1. If @{handle} {does X instead of Y}, our {assessment} is wrong
2. If @{handle} {responds to Z with Q}, our {model} needs revision
3. If @{handle} {interacts with @person in manner M}, dynamics model is off
---
*Dossier compiled: {date} | Fidelity: {fidelity}% | Persona Authenticity: {1-5}*
*Source reliability range: {best}-{worst} | Analytical confidence: {1-6}*
-2
View File
@@ -73,7 +73,6 @@ Config file: `~/.hermes/hindsight/config.json`
|-----|---------|-------------|
| `llm_provider` | `openai` | LLM provider: `openai`, `anthropic`, `gemini`, `groq`, `minimax`, `ollama` |
| `llm_model` | per-provider | Model name (e.g. `gpt-4o-mini`, `openai/gpt-oss-120b`) |
| `llm_base_url` | — | LLM Base URL override (e.g. `https://openrouter.ai/api/v1`) |
The LLM API key is stored in `~/.hermes/.env` as `HINDSIGHT_LLM_API_KEY`.
@@ -93,7 +92,6 @@ Available in `hybrid` and `tools` memory modes:
|----------|-------------|
| `HINDSIGHT_API_KEY` | API key for Hindsight Cloud |
| `HINDSIGHT_LLM_API_KEY` | LLM API key for local mode |
| `HINDSIGHT_API_LLM_BASE_URL` | LLM Base URL for local mode (e.g. OpenRouter) |
| `HINDSIGHT_API_URL` | Override API endpoint |
| `HINDSIGHT_BANK_ID` | Override bank name |
| `HINDSIGHT_BUDGET` | Override recall budget |
+6 -17
View File
@@ -23,8 +23,6 @@ import json
import logging
import os
import threading
from hermes_constants import get_hermes_home
from typing import Any, Dict, List
from agent.memory_provider import MemoryProvider
@@ -144,6 +142,7 @@ def _load_config() -> dict:
3. Environment variables
"""
from pathlib import Path
from hermes_constants import get_hermes_home
# Profile-scoped path (preferred)
profile_path = get_hermes_home() / "hindsight" / "config.json"
@@ -235,7 +234,6 @@ class HindsightMemoryProvider(MemoryProvider):
{"key": "api_key", "description": "Hindsight Cloud API key", "secret": True, "env_var": "HINDSIGHT_API_KEY", "url": "https://ui.hindsight.vectorize.io", "when": {"mode": "cloud"}},
{"key": "llm_provider", "description": "LLM provider for local mode", "default": "openai", "choices": ["openai", "anthropic", "gemini", "groq", "minimax", "ollama"], "when": {"mode": "local"}},
{"key": "llm_api_key", "description": "LLM API key for local Hindsight", "secret": True, "env_var": "HINDSIGHT_LLM_API_KEY", "when": {"mode": "local"}},
{"key": "llm_base_url", "description": "LLM Base URL (e.g. for OpenRouter)", "default": "", "env_var": "HINDSIGHT_API_LLM_BASE_URL", "when": {"mode": "local"}},
{"key": "llm_model", "description": "LLM model for local mode", "default": "gpt-4o-mini", "default_from": {"field": "llm_provider", "map": _PROVIDER_DEFAULT_MODELS}, "when": {"mode": "local"}},
{"key": "bank_id", "description": "Memory bank name", "default": "hermes"},
{"key": "budget", "description": "Recall thoroughness", "default": "mid", "choices": ["low", "mid", "high"]},
@@ -252,16 +250,12 @@ class HindsightMemoryProvider(MemoryProvider):
# different loop" errors during GC — we handle cleanup in
# shutdown() instead.
HindsightEmbedded.__del__ = lambda self: None
kwargs = dict(
self._client = HindsightEmbedded(
profile=self._config.get("profile", "hermes"),
llm_provider=self._config.get("llm_provider", ""),
llm_api_key=self._config.get("llm_api_key") or os.environ.get("HINDSIGHT_LLM_API_KEY", ""),
llm_api_key=self._config.get("llmApiKey") or os.environ.get("HINDSIGHT_LLM_API_KEY", ""),
llm_model=self._config.get("llm_model", ""),
)
base_url = self._config.get("llm_base_url") or os.environ.get("HINDSIGHT_API_LLM_BASE_URL", "")
if base_url:
kwargs["llm_base_url"] = base_url
self._client = HindsightEmbedded(**kwargs)
else:
from hindsight_client import Hindsight
kwargs = {"base_url": self._api_url, "timeout": 30.0}
@@ -316,10 +310,9 @@ class HindsightMemoryProvider(MemoryProvider):
# If the config changed and the daemon is running, stop it.
from pathlib import Path as _Path
profile_env = _Path.home() / ".hindsight" / "profiles" / f"{profile}.env"
current_key = self._config.get("llm_api_key") or os.environ.get("HINDSIGHT_LLM_API_KEY", "")
current_key = self._config.get("llmApiKey") or os.environ.get("HINDSIGHT_LLM_API_KEY", "")
current_provider = self._config.get("llm_provider", "")
current_model = self._config.get("llm_model", "")
current_base_url = self._config.get("llm_base_url") or os.environ.get("HINDSIGHT_API_LLM_BASE_URL", "")
# Read saved profile config
saved = {}
@@ -332,22 +325,18 @@ class HindsightMemoryProvider(MemoryProvider):
config_changed = (
saved.get("HINDSIGHT_API_LLM_PROVIDER") != current_provider or
saved.get("HINDSIGHT_API_LLM_MODEL") != current_model or
saved.get("HINDSIGHT_API_LLM_API_KEY") != current_key or
saved.get("HINDSIGHT_API_LLM_BASE_URL", "") != current_base_url
saved.get("HINDSIGHT_API_LLM_API_KEY") != current_key
)
if config_changed:
# Write updated profile .env
profile_env.parent.mkdir(parents=True, exist_ok=True)
env_lines = (
profile_env.write_text(
f"HINDSIGHT_API_LLM_PROVIDER={current_provider}\n"
f"HINDSIGHT_API_LLM_API_KEY={current_key}\n"
f"HINDSIGHT_API_LLM_MODEL={current_model}\n"
f"HINDSIGHT_API_LOG_LEVEL=info\n"
)
if current_base_url:
env_lines += f"HINDSIGHT_API_LLM_BASE_URL={current_base_url}\n"
profile_env.write_text(env_lines)
if client._manager.is_running(profile):
with open(log_path, "a") as f:
f.write("\n=== Config changed, restarting daemon ===\n")
+1 -3
View File
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project]
name = "hermes-agent"
version = "0.8.0"
version = "0.7.0"
description = "The self-improving AI agent — creates skills from experience, improves them during use, and runs anywhere"
readme = "README.md"
requires-python = ">=3.11"
@@ -62,7 +62,6 @@ mcp = ["mcp>=1.2.0,<2"]
homeassistant = ["aiohttp>=3.9.0,<4"]
sms = ["aiohttp>=3.9.0,<4"]
acp = ["agent-client-protocol>=0.9.0,<1.0"]
mistral = ["mistralai>=2.3.0,<3"]
dingtalk = ["dingtalk-stream>=0.1.0,<1"]
feishu = ["lark-oapi>=1.5.3,<2"]
rl = [
@@ -95,7 +94,6 @@ all = [
"hermes-agent[voice]",
"hermes-agent[dingtalk]",
"hermes-agent[feishu]",
"hermes-agent[mistral]",
]
[project.scripts]
+65 -207
View File
@@ -66,8 +66,7 @@ from model_tools import (
handle_function_call,
check_toolset_requirements,
)
from tools.terminal_tool import cleanup_vm, get_active_env
from tools.tool_result_storage import maybe_persist_tool_result, enforce_turn_budget
from tools.terminal_tool import cleanup_vm
from tools.interrupt import set_interrupt as _set_interrupt
from tools.browser_tool import cleanup_browser
@@ -76,7 +75,6 @@ from hermes_constants import OPENROUTER_BASE_URL
# Agent internals extracted to agent/ package for modularity
from agent.memory_manager import build_memory_context_block
from agent.retry_utils import jittered_backoff
from agent.prompt_builder import (
DEFAULT_AGENT_IDENTITY, PLATFORM_HINTS,
MEMORY_GUIDANCE, SESSION_SEARCH_GUIDANCE, SKILLS_GUIDANCE,
@@ -87,7 +85,6 @@ from agent.model_metadata import (
estimate_tokens_rough, estimate_messages_tokens_rough, estimate_request_tokens_rough,
get_next_probe_tier, parse_context_limit_from_error,
save_context_length, is_local_endpoint,
query_ollama_num_ctx,
)
from agent.context_compressor import ContextCompressor
from agent.subdirectory_hints import SubdirectoryHintTracker
@@ -412,26 +409,62 @@ def _strip_budget_warnings_from_history(messages: list) -> None:
# Large tool result handler — save oversized output to temp file
# =========================================================================
# Threshold at which tool results are saved to a file instead of kept inline.
# 100K chars ≈ 25K tokens — generous for any reasonable output but prevents
# catastrophic context explosions.
_LARGE_RESULT_CHARS = 100_000
# =========================================================================
# Qwen Portal headers — mimics QwenCode CLI for portal.qwen.ai compatibility.
# Extracted as a module-level helper so both __init__ and
# _apply_client_headers_for_base_url can share it.
# =========================================================================
_QWEN_CODE_VERSION = "0.14.1"
# How many characters of the original result to include as an inline preview
# so the model has immediate context about what the tool returned.
_LARGE_RESULT_PREVIEW_CHARS = 1_500
def _qwen_portal_headers() -> dict:
"""Return default HTTP headers required by Qwen Portal API."""
import platform as _plat
def _save_oversized_tool_result(function_name: str, function_result: str) -> str:
"""Replace oversized tool results with a file reference + preview.
_ua = f"QwenCode/{_QWEN_CODE_VERSION} ({_plat.system().lower()}; {_plat.machine()})"
return {
"User-Agent": _ua,
"X-DashScope-CacheControl": "enable",
"X-DashScope-UserAgent": _ua,
"X-DashScope-AuthType": "qwen-oauth",
}
When a tool returns more than ``_LARGE_RESULT_CHARS`` characters, the full
content is written to a temporary file under ``HERMES_HOME/cache/tool_responses/``
and the result sent to the model is replaced with:
a brief head preview (first ``_LARGE_RESULT_PREVIEW_CHARS`` chars)
the file path so the model can use ``read_file`` / ``search_files``
Falls back to destructive truncation if the file write fails.
"""
original_len = len(function_result)
if original_len <= _LARGE_RESULT_CHARS:
return function_result
# Build the target directory
try:
response_dir = os.path.join(get_hermes_home(), "cache", "tool_responses")
os.makedirs(response_dir, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
# Sanitize tool name for use in filename
safe_name = re.sub(r"[^\w\-]", "_", function_name)[:40]
filename = f"{safe_name}_{timestamp}.txt"
filepath = os.path.join(response_dir, filename)
with open(filepath, "w", encoding="utf-8") as f:
f.write(function_result)
preview = function_result[:_LARGE_RESULT_PREVIEW_CHARS]
return (
f"{preview}\n\n"
f"[Large tool response: {original_len:,} characters total — "
f"only the first {_LARGE_RESULT_PREVIEW_CHARS:,} shown above. "
f"Full output saved to: {filepath}\n"
f"Use read_file or search_files on that path to access the rest.]"
)
except Exception as exc:
# Fall back to destructive truncation if file write fails
logger.warning("Failed to save large tool result to file: %s", exc)
return (
function_result[:_LARGE_RESULT_CHARS]
+ f"\n\n[Truncated: tool response was {original_len:,} chars, "
f"exceeding the {_LARGE_RESULT_CHARS:,} char limit. "
f"File save failed: {exc}]"
)
class AIAgent:
@@ -777,8 +810,6 @@ class AIAgent:
client_kwargs["default_headers"] = {
"User-Agent": "KimiCLI/1.3",
}
elif "portal.qwen.ai" in effective_base.lower():
client_kwargs["default_headers"] = _qwen_portal_headers()
else:
# No explicit creds — use the centralized provider router
from agent.auxiliary_client import resolve_provider_client
@@ -1185,33 +1216,6 @@ class AIAgent:
self.session_cost_status = "unknown"
self.session_cost_source = "none"
# ── Ollama num_ctx injection ──
# Ollama defaults to 2048 context regardless of the model's capabilities.
# When running against an Ollama server, detect the model's max context
# and pass num_ctx on every chat request so the full window is used.
# User override: set model.ollama_num_ctx in config.yaml to cap VRAM use.
self._ollama_num_ctx: int | None = None
_ollama_num_ctx_override = None
if isinstance(_model_cfg, dict):
_ollama_num_ctx_override = _model_cfg.get("ollama_num_ctx")
if _ollama_num_ctx_override is not None:
try:
self._ollama_num_ctx = int(_ollama_num_ctx_override)
except (TypeError, ValueError):
logger.debug("Invalid ollama_num_ctx config value: %r", _ollama_num_ctx_override)
if self._ollama_num_ctx is None and self.base_url and is_local_endpoint(self.base_url):
try:
_detected = query_ollama_num_ctx(self.model, self.base_url)
if _detected and _detected > 0:
self._ollama_num_ctx = _detected
except Exception as exc:
logger.debug("Ollama num_ctx detection failed: %s", exc)
if self._ollama_num_ctx and not self.quiet_mode:
logger.info(
"Ollama num_ctx: will request %d tokens (model max from /api/show)",
self._ollama_num_ctx,
)
if not self.quiet_mode:
if compression_enabled:
print(f"📊 Context limit: {self.context_compressor.context_length:,} tokens (compress at {int(compression_threshold*100)}% = {self.context_compressor.threshold_tokens:,})")
@@ -4103,8 +4107,6 @@ class AIAgent:
self._client_kwargs["default_headers"] = copilot_default_headers()
elif "api.kimi.com" in normalized:
self._client_kwargs["default_headers"] = {"User-Agent": "KimiCLI/1.3"}
elif "portal.qwen.ai" in normalized:
self._client_kwargs["default_headers"] = _qwen_portal_headers()
else:
self._client_kwargs.pop("default_headers", None)
@@ -4895,7 +4897,7 @@ class AIAgent:
effective_key = (fb_client.api_key or resolve_anthropic_token() or "") if fb_provider == "anthropic" else (fb_client.api_key or "")
self.api_key = effective_key
self._anthropic_api_key = effective_key
self._anthropic_base_url = fb_base_url
self._anthropic_base_url = getattr(fb_client, "base_url", None)
self._anthropic_client = build_anthropic_client(effective_key, self._anthropic_base_url)
self._is_anthropic_oauth = _is_oauth_token(effective_key)
self.client = None
@@ -5251,71 +5253,6 @@ class AIAgent:
base = (getattr(self, "base_url", "") or "").lower()
return "dashscope" in base or "aliyuncs" in base or "opencode.ai/zen/go" in base
def _is_qwen_portal(self) -> bool:
"""Return True when the base URL targets Qwen Portal."""
return "portal.qwen.ai" in self._base_url_lower
def _qwen_prepare_chat_messages(self, api_messages: list) -> list:
prepared = copy.deepcopy(api_messages)
if not prepared:
return prepared
for msg in prepared:
if not isinstance(msg, dict):
continue
content = msg.get("content")
if isinstance(content, str):
msg["content"] = [{"type": "text", "text": content}]
elif isinstance(content, list):
# Normalize: convert bare strings to text dicts, keep dicts as-is.
# deepcopy already created independent copies, no need for dict().
normalized_parts = []
for part in content:
if isinstance(part, str):
normalized_parts.append({"type": "text", "text": part})
elif isinstance(part, dict):
normalized_parts.append(part)
if normalized_parts:
msg["content"] = normalized_parts
# Inject cache_control on the last part of the system message.
for msg in prepared:
if isinstance(msg, dict) and msg.get("role") == "system":
content = msg.get("content")
if isinstance(content, list) and content and isinstance(content[-1], dict):
content[-1]["cache_control"] = {"type": "ephemeral"}
break
return prepared
def _qwen_prepare_chat_messages_inplace(self, messages: list) -> None:
"""In-place variant — mutates an already-copied message list."""
if not messages:
return
for msg in messages:
if not isinstance(msg, dict):
continue
content = msg.get("content")
if isinstance(content, str):
msg["content"] = [{"type": "text", "text": content}]
elif isinstance(content, list):
normalized_parts = []
for part in content:
if isinstance(part, str):
normalized_parts.append({"type": "text", "text": part})
elif isinstance(part, dict):
normalized_parts.append(part)
if normalized_parts:
msg["content"] = normalized_parts
for msg in messages:
if isinstance(msg, dict) and msg.get("role") == "system":
content = msg.get("content")
if isinstance(content, list) and content and isinstance(content[-1], dict):
content[-1]["cache_control"] = {"type": "ephemeral"}
break
def _build_api_kwargs(self, api_messages: list) -> dict:
"""Build the keyword arguments dict for the active API mode."""
if self.api_mode == "anthropic_messages":
@@ -5334,7 +5271,6 @@ class AIAgent:
is_oauth=self._is_anthropic_oauth,
preserve_dots=self._anthropic_preserve_dots(),
context_length=ctx_len,
base_url=getattr(self, "_anthropic_base_url", None),
)
if self.api_mode == "codex_responses":
@@ -5428,17 +5364,6 @@ class AIAgent:
tool_call.pop("call_id", None)
tool_call.pop("response_item_id", None)
# Qwen portal: normalize content to list-of-dicts, inject cache_control.
# Must run AFTER codex sanitization so we transform the final messages.
# If sanitization already deepcopied, reuse that copy (in-place).
if self._is_qwen_portal():
if sanitized_messages is api_messages:
# No sanitization was done — we need our own copy.
sanitized_messages = self._qwen_prepare_chat_messages(sanitized_messages)
else:
# Already a deepcopy — transform in place to avoid a second deepcopy.
self._qwen_prepare_chat_messages_inplace(sanitized_messages)
# GPT-5 and Codex models respond better to 'developer' than 'system'
# for instruction-following. Swap the role at the API boundary so
# internal message representation stays uniform ("system").
@@ -5471,17 +5396,11 @@ class AIAgent:
"messages": sanitized_messages,
"timeout": float(os.getenv("HERMES_API_TIMEOUT", 1800.0)),
}
if self._is_qwen_portal():
api_kwargs["metadata"] = {
"sessionId": self.session_id or "hermes",
"promptId": str(uuid.uuid4()),
}
if self.tools:
api_kwargs["tools"] = self.tools
if self.max_tokens is not None:
if not self._is_qwen_portal():
api_kwargs.update(self._max_tokens_param(self.max_tokens))
api_kwargs.update(self._max_tokens_param(self.max_tokens))
elif self._is_openrouter_url() and "claude" in (self.model or "").lower():
# OpenRouter translates requests to Anthropic's Messages API,
# which requires max_tokens as a mandatory field. When we omit
@@ -5537,18 +5456,6 @@ class AIAgent:
if _is_nous:
extra_body["tags"] = ["product=hermes-agent"]
# Ollama num_ctx: override the 2048 default so the model actually
# uses the context window it was trained for. Passed via the OpenAI
# SDK's extra_body → options.num_ctx, which Ollama's OpenAI-compat
# endpoint forwards to the runner as --ctx-size.
if self._ollama_num_ctx:
options = extra_body.get("options", {})
options["num_ctx"] = self._ollama_num_ctx
extra_body["options"] = options
if self._is_qwen_portal():
extra_body["vl_high_resolution_images"] = True
if extra_body:
api_kwargs["extra_body"] = extra_body
@@ -6317,17 +6224,15 @@ class AIAgent:
except Exception as cb_err:
logging.debug(f"Tool complete callback error: {cb_err}")
function_result = maybe_persist_tool_result(
content=function_result,
tool_name=name,
tool_use_id=tc.id,
env=get_active_env(effective_task_id),
)
# Save oversized results to file instead of destructive truncation
function_result = _save_oversized_tool_result(name, function_result)
# Discover subdirectory context files from tool arguments
subdir_hints = self._subdirectory_hints.check_tool_call(name, args)
if subdir_hints:
function_result += subdir_hints
# Append tool result message in order
tool_msg = {
"role": "tool",
"content": function_result,
@@ -6335,12 +6240,6 @@ class AIAgent:
}
messages.append(tool_msg)
# ── Per-turn aggregate budget enforcement ─────────────────────────
num_tools = len(parsed_calls)
if num_tools > 0:
turn_tool_msgs = messages[-num_tools:]
enforce_turn_budget(turn_tool_msgs, env=get_active_env(effective_task_id))
# ── Budget pressure injection ────────────────────────────────────
budget_warning = self._get_budget_warning(api_call_count)
if budget_warning and messages and messages[-1].get("role") == "tool":
@@ -6625,12 +6524,8 @@ class AIAgent:
except Exception as cb_err:
logging.debug(f"Tool complete callback error: {cb_err}")
function_result = maybe_persist_tool_result(
content=function_result,
tool_name=function_name,
tool_use_id=tool_call.id,
env=get_active_env(effective_task_id),
)
# Save oversized results to file instead of destructive truncation
function_result = _save_oversized_tool_result(function_name, function_result)
# Discover subdirectory context files from tool arguments
subdir_hints = self._subdirectory_hints.check_tool_call(function_name, function_args)
@@ -6668,11 +6563,6 @@ class AIAgent:
if self.tool_delay > 0 and i < len(assistant_message.tool_calls):
time.sleep(self.tool_delay)
# ── Per-turn aggregate budget enforcement ─────────────────────────
num_tools_seq = len(assistant_message.tool_calls)
if num_tools_seq > 0:
enforce_turn_budget(messages[-num_tools_seq:], env=get_active_env(effective_task_id))
# ── Budget pressure injection ─────────────────────────────────
# After all tool calls in this turn are processed, check if we're
# approaching max_iterations. If so, inject a warning into the LAST
@@ -7399,7 +7289,6 @@ class AIAgent:
codex_auth_retry_attempted=False
anthropic_auth_retry_attempted=False
nous_auth_retry_attempted=False
thinking_sig_retry_attempted = False
has_retried_429 = False
restart_with_compressed_messages = False
restart_with_length_continuation = False
@@ -7615,8 +7504,7 @@ class AIAgent:
}
# Longer backoff for rate limiting (likely cause of None choices)
# Jittered exponential: 5s base, 120s cap + random jitter
wait_time = jittered_backoff(retry_count, base_delay=5.0, max_delay=120.0)
wait_time = min(5 * (2 ** (retry_count - 1)), 120) # 5s, 10s, 20s, 40s, 80s, 120s
self._vprint(f"{self.log_prefix}⏳ Retrying in {wait_time}s (extended backoff for possible rate limit)...", force=True)
logging.warning(f"Invalid API response (retry {retry_count}/{max_retries}): {', '.join(error_details)} | Provider: {provider_name}")
@@ -7989,38 +7877,8 @@ class AIAgent:
print(f"{self.log_prefix} • Check ANTHROPIC_API_KEY in {_dhh}/.env for API keys or legacy token values")
print(f"{self.log_prefix} • For API keys: verify at https://console.anthropic.com/settings/keys")
print(f"{self.log_prefix} • For Claude Code: run 'claude /login' to refresh, then retry")
print(f"{self.log_prefix}Legacy cleanup: hermes config set ANTHROPIC_TOKEN \"\"")
print(f"{self.log_prefix}Clear stale keys: hermes config set ANTHROPIC_API_KEY \"\"")
# ── Thinking block signature recovery ─────────────────
# Anthropic signs thinking blocks against the full turn
# content. Any upstream mutation (context compression,
# session truncation, message merging) invalidates the
# signature → HTTP 400. Recovery: strip reasoning_details
# from all messages so the next retry sends no thinking
# blocks at all. One-shot — don't retry infinitely.
if (
self.api_mode == "anthropic_messages"
and status_code == 400
and not thinking_sig_retry_attempted
):
_err_msg_lower = str(api_error).lower()
if "signature" in _err_msg_lower and "thinking" in _err_msg_lower:
thinking_sig_retry_attempted = True
for _m in messages:
if isinstance(_m, dict):
_m.pop("reasoning_details", None)
self._vprint(
f"{self.log_prefix}⚠️ Thinking block signature invalid — "
f"stripped all thinking blocks, retrying...",
force=True,
)
logging.warning(
"%sThinking block signature recovery: stripped "
"reasoning_details from %d messages",
self.log_prefix, len(messages),
)
continue
print(f"{self.log_prefix}Clear stale keys: hermes config set ANTHROPIC_TOKEN \"\"")
print(f"{self.log_prefix}Legacy cleanup: hermes config set ANTHROPIC_API_KEY \"\"")
retry_count += 1
elapsed_time = time.time() - api_start_time
@@ -8503,7 +8361,7 @@ class AIAgent:
_retry_after = min(int(_ra_raw), 120) # Cap at 2 minutes
except (TypeError, ValueError):
pass
wait_time = _retry_after if _retry_after else jittered_backoff(retry_count, base_delay=2.0, max_delay=60.0)
wait_time = _retry_after if _retry_after else min(2 ** retry_count, 60)
if is_rate_limited:
self._emit_status(f"⏱️ Rate limit reached. Waiting {wait_time}s before retry (attempt {retry_count + 1}/{max_retries})...")
else:
-252
View File
@@ -1276,258 +1276,6 @@ class TestRoleAlternation:
assert [m["role"] for m in result] == ["user", "assistant", "user"]
# ---------------------------------------------------------------------------
# Thinking block signature management
# ---------------------------------------------------------------------------
class TestThinkingBlockSignatureManagement:
"""Tests for the thinking block handling strategy:
strip from old turns, preserve latest signed, downgrade unsigned."""
def test_thinking_stripped_from_non_last_assistant(self):
"""Thinking blocks are removed from all assistant messages except the last."""
messages = [
{
"role": "assistant",
"content": "",
"tool_calls": [
{"id": "tc_1", "function": {"name": "tool1", "arguments": "{}"}},
],
"reasoning_details": [
{"type": "thinking", "thinking": "Old reasoning.", "signature": "sig_old"},
],
},
{"role": "tool", "tool_call_id": "tc_1", "content": "result 1"},
{
"role": "assistant",
"content": "",
"tool_calls": [
{"id": "tc_2", "function": {"name": "tool2", "arguments": "{}"}},
],
"reasoning_details": [
{"type": "thinking", "thinking": "Latest reasoning.", "signature": "sig_new"},
],
},
{"role": "tool", "tool_call_id": "tc_2", "content": "result 2"},
]
_, result = convert_messages_to_anthropic(messages)
# Find both assistant messages
assistants = [m for m in result if m["role"] == "assistant"]
assert len(assistants) == 2
# First (non-last) assistant: no thinking blocks
first_types = [b.get("type") for b in assistants[0]["content"]]
assert "thinking" not in first_types
assert "redacted_thinking" not in first_types
assert "tool_use" in first_types # tool_use should survive
# Last assistant: thinking block preserved with signature
last_blocks = assistants[1]["content"]
thinking_blocks = [b for b in last_blocks if b.get("type") == "thinking"]
assert len(thinking_blocks) == 1
assert thinking_blocks[0]["thinking"] == "Latest reasoning."
assert thinking_blocks[0]["signature"] == "sig_new"
def test_signed_thinking_preserved_on_last_turn(self):
"""A signed thinking block on the last assistant message is kept."""
messages = [
{
"role": "assistant",
"content": "The answer is 42.",
"reasoning_details": [
{"type": "thinking", "thinking": "Deep thought.", "signature": "sig_valid"},
],
},
]
_, result = convert_messages_to_anthropic(messages)
blocks = result[0]["content"]
thinking = [b for b in blocks if b.get("type") == "thinking"]
assert len(thinking) == 1
assert thinking[0]["signature"] == "sig_valid"
def test_unsigned_thinking_downgraded_to_text_on_last_turn(self):
"""Unsigned thinking blocks on the last turn become text blocks."""
messages = [
{
"role": "assistant",
"content": "Response text.",
"reasoning_details": [
{"type": "thinking", "thinking": "Unsigned reasoning."},
# No 'signature' field
],
},
]
_, result = convert_messages_to_anthropic(messages)
blocks = result[0]["content"]
# No thinking blocks should remain
assert not any(b.get("type") == "thinking" for b in blocks)
# The reasoning text should be preserved as a text block
text_contents = [b.get("text", "") for b in blocks if b.get("type") == "text"]
assert "Unsigned reasoning." in text_contents
def test_redacted_thinking_with_data_preserved(self):
"""Redacted thinking with 'data' field is kept on last turn."""
messages = [
{
"role": "assistant",
"content": "Response.",
"reasoning_details": [
{"type": "redacted_thinking", "data": "opaque_signature_data"},
],
},
]
_, result = convert_messages_to_anthropic(messages)
blocks = result[0]["content"]
redacted = [b for b in blocks if b.get("type") == "redacted_thinking"]
assert len(redacted) == 1
assert redacted[0]["data"] == "opaque_signature_data"
def test_redacted_thinking_without_data_dropped(self):
"""Redacted thinking without 'data' is dropped — can't be validated."""
messages = [
{
"role": "assistant",
"content": "Response.",
"reasoning_details": [
{"type": "redacted_thinking"},
# No 'data' field
],
},
]
_, result = convert_messages_to_anthropic(messages)
blocks = result[0]["content"]
assert not any(b.get("type") == "redacted_thinking" for b in blocks)
def test_cache_control_stripped_from_thinking_blocks(self):
"""cache_control markers are removed from thinking/redacted_thinking blocks."""
messages = [
{
"role": "assistant",
"content": "",
"tool_calls": [
{"id": "tc_1", "function": {"name": "t", "arguments": "{}"}},
],
"reasoning_details": [
{
"type": "thinking",
"thinking": "Reasoning.",
"signature": "sig_1",
"cache_control": {"type": "ephemeral"},
},
],
},
{"role": "tool", "tool_call_id": "tc_1", "content": "result"},
]
_, result = convert_messages_to_anthropic(messages)
assistant = next(m for m in result if m["role"] == "assistant")
for block in assistant["content"]:
if block.get("type") in ("thinking", "redacted_thinking"):
assert "cache_control" not in block
def test_thinking_stripped_from_merged_consecutive_assistants(self):
"""When consecutive assistants are merged, second one's thinking is dropped."""
messages = [
{
"role": "assistant",
"content": "First response.",
"reasoning_details": [
{"type": "thinking", "thinking": "First thought.", "signature": "sig_1"},
],
},
{
"role": "assistant",
"content": "Second response.",
"reasoning_details": [
{"type": "thinking", "thinking": "Second thought.", "signature": "sig_2"},
],
},
]
_, result = convert_messages_to_anthropic(messages)
# Should be merged into one assistant message
assistants = [m for m in result if m["role"] == "assistant"]
assert len(assistants) == 1
# Only the first thinking block should remain (signed, on the last/only assistant)
blocks = assistants[0]["content"]
thinking = [b for b in blocks if b.get("type") == "thinking"]
assert len(thinking) == 1
assert thinking[0]["thinking"] == "First thought."
def test_empty_content_after_strip_gets_placeholder(self):
"""If stripping thinking leaves an empty message, a placeholder is added."""
messages = [
{
"role": "assistant",
"content": "",
"reasoning_details": [
{"type": "thinking", "thinking": "Only thinking, no text."},
# Unsigned — will be downgraded, but content was empty string
],
},
{"role": "user", "content": "Next message."},
{"role": "assistant", "content": "Final."},
]
_, result = convert_messages_to_anthropic(messages)
# First assistant is non-last, so thinking is stripped completely.
# The original content was empty and thinking was unsigned → placeholder
first_assistant = result[0]
assert first_assistant["role"] == "assistant"
assert len(first_assistant["content"]) >= 1
def test_multi_turn_conversation_preserves_only_last(self):
"""Full multi-turn conversation: only last assistant keeps thinking."""
messages = [
{"role": "user", "content": "Question 1"},
{
"role": "assistant",
"content": "Answer 1",
"reasoning_details": [
{"type": "thinking", "thinking": "Thought 1", "signature": "sig_1"},
],
},
{"role": "user", "content": "Question 2"},
{
"role": "assistant",
"content": "Answer 2",
"reasoning_details": [
{"type": "thinking", "thinking": "Thought 2", "signature": "sig_2"},
],
},
{"role": "user", "content": "Question 3"},
{
"role": "assistant",
"content": "Answer 3",
"reasoning_details": [
{"type": "thinking", "thinking": "Thought 3", "signature": "sig_3"},
],
},
]
_, result = convert_messages_to_anthropic(messages)
assistants = [m for m in result if m["role"] == "assistant"]
assert len(assistants) == 3
# First two: no thinking blocks
for a in assistants[:2]:
assert not any(
b.get("type") in ("thinking", "redacted_thinking")
for b in a["content"]
if isinstance(b, dict)
)
# Last one: thinking preserved
last_thinking = [
b for b in assistants[2]["content"]
if isinstance(b, dict) and b.get("type") == "thinking"
]
assert len(last_thinking) == 1
assert last_thinking[0]["signature"] == "sig_3"
# ---------------------------------------------------------------------------
# Tool choice
# ---------------------------------------------------------------------------
+63 -78
View File
@@ -471,23 +471,6 @@ class TestExplicitProviderRouting:
client, model = resolve_provider_client("zai")
assert client is not None
def test_explicit_google_alias_uses_gemini_credentials(self):
"""provider='google' should route through the gemini API-key provider."""
with (
patch("hermes_cli.auth.resolve_api_key_provider_credentials", return_value={
"api_key": "gemini-key",
"base_url": "https://generativelanguage.googleapis.com/v1beta/openai",
}),
patch("agent.auxiliary_client.OpenAI") as mock_openai,
):
mock_openai.return_value = MagicMock()
client, model = resolve_provider_client("google", model="gemini-3.1-pro-preview")
assert client is not None
assert model == "gemini-3.1-pro-preview"
assert mock_openai.call_args.kwargs["api_key"] == "gemini-key"
assert mock_openai.call_args.kwargs["base_url"] == "https://generativelanguage.googleapis.com/v1beta/openai"
def test_explicit_unknown_returns_none(self, monkeypatch):
"""Unknown provider should return None."""
client, model = resolve_provider_client("nonexistent-provider")
@@ -641,15 +624,12 @@ class TestVisionClientFallback:
assert client is None
assert model is None
def test_vision_auto_includes_active_provider_when_configured(self, monkeypatch):
"""Active provider appears in available backends when credentials exist."""
monkeypatch.setenv("ANTHROPIC_API_KEY", "***")
def test_vision_auto_includes_anthropic_when_configured(self, monkeypatch):
monkeypatch.setenv("ANTHROPIC_API_KEY", "sk-ant-api03-key")
with (
patch("agent.auxiliary_client._read_nous_auth", return_value=None),
patch("agent.auxiliary_client._read_main_provider", return_value="anthropic"),
patch("agent.auxiliary_client._read_main_model", return_value="claude-sonnet-4"),
patch("agent.anthropic_adapter.build_anthropic_client", return_value=MagicMock()),
patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="***"),
patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="sk-ant-api03-key"),
):
backends = get_available_vision_backends()
@@ -722,51 +702,88 @@ class TestAuxiliaryPoolAwareness:
assert call_kwargs["base_url"] == "https://api.githubcopilot.com"
assert call_kwargs["default_headers"]["Editor-Version"]
def test_vision_auto_uses_active_provider_as_fallback(self, monkeypatch):
"""When no OpenRouter/Nous available, vision auto falls back to active provider."""
monkeypatch.setenv("ANTHROPIC_API_KEY", "***")
def test_vision_auto_uses_anthropic_when_no_higher_priority_backend(self, monkeypatch):
monkeypatch.setenv("ANTHROPIC_API_KEY", "sk-ant-api03-key")
with (
patch("agent.auxiliary_client._read_nous_auth", return_value=None),
patch("agent.auxiliary_client._read_main_provider", return_value="anthropic"),
patch("agent.auxiliary_client._read_main_model", return_value="claude-sonnet-4"),
patch("agent.anthropic_adapter.build_anthropic_client", return_value=MagicMock()),
patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="***"),
patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="sk-ant-api03-key"),
):
client, model = get_vision_auxiliary_client()
assert client is not None
assert client.__class__.__name__ == "AnthropicAuxiliaryClient"
assert model == "claude-haiku-4-5-20251001"
def test_vision_auto_prefers_active_provider_over_openrouter(self, monkeypatch):
"""Active provider is tried before OpenRouter in vision auto."""
def test_selected_anthropic_provider_is_preferred_for_vision_auto(self, monkeypatch):
monkeypatch.setenv("OPENROUTER_API_KEY", "or-key")
monkeypatch.setenv("ANTHROPIC_API_KEY", "***")
monkeypatch.setenv("ANTHROPIC_API_KEY", "sk-ant-api03-key")
def fake_load_config():
return {"model": {"provider": "anthropic", "default": "claude-sonnet-4-6"}}
with (
patch("agent.auxiliary_client._read_nous_auth", return_value=None),
patch("agent.auxiliary_client._read_main_provider", return_value="anthropic"),
patch("agent.auxiliary_client._read_main_model", return_value="claude-sonnet-4"),
patch("agent.anthropic_adapter.build_anthropic_client", return_value=MagicMock()),
patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="***"),
patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="sk-ant-api03-key"),
patch("agent.auxiliary_client.OpenAI") as mock_openai,
patch("hermes_cli.config.load_config", fake_load_config),
):
client, model = get_vision_auxiliary_client()
assert client is not None
assert client.__class__.__name__ == "AnthropicAuxiliaryClient"
assert model == "claude-haiku-4-5-20251001"
def test_selected_codex_provider_short_circuits_vision_auto(self, monkeypatch):
def fake_load_config():
return {"model": {"provider": "openai-codex", "default": "gpt-5.2-codex"}}
codex_client = MagicMock()
with (
patch("hermes_cli.config.load_config", fake_load_config),
patch("agent.auxiliary_client._try_codex", return_value=(codex_client, "gpt-5.2-codex")) as mock_codex,
patch("agent.auxiliary_client._try_openrouter") as mock_openrouter,
patch("agent.auxiliary_client._try_nous") as mock_nous,
patch("agent.auxiliary_client._try_anthropic") as mock_anthropic,
patch("agent.auxiliary_client._try_custom_endpoint") as mock_custom,
):
provider, client, model = resolve_vision_provider_client()
# Active provider should win over OpenRouter
assert provider == "anthropic"
assert provider == "openai-codex"
assert client is codex_client
assert model == "gpt-5.2-codex"
mock_codex.assert_called_once()
mock_openrouter.assert_not_called()
mock_nous.assert_not_called()
mock_anthropic.assert_not_called()
mock_custom.assert_not_called()
def test_vision_auto_uses_named_custom_as_active_provider(self, monkeypatch):
"""Named custom provider works as active provider fallback in vision auto."""
def test_vision_auto_includes_codex(self, codex_auth_dir):
"""Codex supports vision (gpt-5.3-codex), so auto mode should use it."""
with patch("agent.auxiliary_client._read_nous_auth", return_value=None), \
patch("agent.auxiliary_client.OpenAI"):
client, model = get_vision_auxiliary_client()
from agent.auxiliary_client import CodexAuxiliaryClient
assert isinstance(client, CodexAuxiliaryClient)
assert model == "gpt-5.2-codex"
def test_vision_auto_falls_back_to_custom_endpoint(self, monkeypatch):
"""Custom endpoint is used as fallback in vision auto mode.
Many local models (Qwen-VL, LLaVA, etc.) support vision.
When no OpenRouter/Nous/Codex is available, try the custom endpoint.
"""
monkeypatch.delenv("OPENROUTER_API_KEY", raising=False)
monkeypatch.delenv("ANTHROPIC_API_KEY", raising=False)
with patch("agent.auxiliary_client._read_nous_auth", return_value=None), \
patch("agent.auxiliary_client._select_pool_entry", return_value=(False, None)), \
patch("agent.auxiliary_client._read_main_provider", return_value="custom:local"), \
patch("agent.auxiliary_client._read_main_model", return_value="my-local-model"), \
patch("agent.auxiliary_client.resolve_provider_client",
return_value=(MagicMock(), "my-local-model")) as mock_resolve:
provider, client, model = resolve_vision_provider_client()
assert client is not None
assert provider == "custom:local"
patch("agent.auxiliary_client._read_codex_access_token", return_value=None), \
patch("agent.auxiliary_client._resolve_custom_runtime",
return_value=("http://localhost:1234/v1", "local-key")), \
patch("agent.auxiliary_client.OpenAI") as mock_openai:
client, model = get_vision_auxiliary_client()
assert client is not None # Custom endpoint picked up as fallback
def test_vision_direct_endpoint_override(self, monkeypatch):
monkeypatch.setenv("OPENROUTER_API_KEY", "or-key")
@@ -805,31 +822,6 @@ class TestAuxiliaryPoolAwareness:
assert model == "google/gemini-3-flash-preview"
assert client is not None
def test_vision_config_google_provider_uses_gemini_credentials(self, monkeypatch):
config = {
"auxiliary": {
"vision": {
"provider": "google",
"model": "gemini-3.1-pro-preview",
}
}
}
monkeypatch.setattr("hermes_cli.config.load_config", lambda: config)
with (
patch("hermes_cli.auth.resolve_api_key_provider_credentials", return_value={
"api_key": "gemini-key",
"base_url": "https://generativelanguage.googleapis.com/v1beta/openai",
}),
patch("agent.auxiliary_client.OpenAI") as mock_openai,
):
resolved_provider, client, model = resolve_vision_provider_client()
assert resolved_provider == "gemini"
assert client is not None
assert model == "gemini-3.1-pro-preview"
assert mock_openai.call_args.kwargs["api_key"] == "gemini-key"
assert mock_openai.call_args.kwargs["base_url"] == "https://generativelanguage.googleapis.com/v1beta/openai"
def test_vision_forced_main_uses_custom_endpoint(self, monkeypatch):
"""When explicitly forced to 'main', vision CAN use custom endpoint."""
config = {
@@ -854,14 +846,7 @@ class TestAuxiliaryPoolAwareness:
monkeypatch.setenv("AUXILIARY_VISION_PROVIDER", "main")
monkeypatch.delenv("OPENAI_BASE_URL", raising=False)
monkeypatch.delenv("OPENAI_API_KEY", raising=False)
# Clear client cache to avoid stale entries from previous tests
from agent.auxiliary_client import _client_cache
_client_cache.clear()
with patch("agent.auxiliary_client._read_nous_auth", return_value=None), \
patch("agent.auxiliary_client._read_main_provider", return_value=""), \
patch("agent.auxiliary_client._read_main_model", return_value=""), \
patch("agent.auxiliary_client._select_pool_entry", return_value=(False, None)), \
patch("agent.auxiliary_client._resolve_custom_runtime", return_value=(None, None)), \
patch("agent.auxiliary_client._read_codex_access_token", return_value=None), \
patch("agent.auxiliary_client._resolve_api_key_provider", return_value=(None, None)):
client, model = get_vision_auxiliary_client()
-42
View File
@@ -1,42 +0,0 @@
"""Tests for MiniMax auxiliary client URL normalization.
MiniMax and MiniMax-CN set inference_base_url to the /anthropic path.
The auxiliary client uses the OpenAI SDK, which needs /v1 instead.
"""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from agent.auxiliary_client import _to_openai_base_url
class TestToOpenaiBaseUrl:
def test_minimax_global_anthropic_suffix_replaced(self):
assert _to_openai_base_url("https://api.minimax.io/anthropic") == "https://api.minimax.io/v1"
def test_minimax_cn_anthropic_suffix_replaced(self):
assert _to_openai_base_url("https://api.minimaxi.com/anthropic") == "https://api.minimaxi.com/v1"
def test_trailing_slash_stripped_before_replace(self):
assert _to_openai_base_url("https://api.minimax.io/anthropic/") == "https://api.minimax.io/v1"
def test_v1_url_unchanged(self):
assert _to_openai_base_url("https://api.openai.com/v1") == "https://api.openai.com/v1"
def test_openrouter_url_unchanged(self):
assert _to_openai_base_url("https://openrouter.ai/api/v1") == "https://openrouter.ai/api/v1"
def test_anthropic_domain_unchanged(self):
"""api.anthropic.com doesn't end with /anthropic — should be untouched."""
assert _to_openai_base_url("https://api.anthropic.com") == "https://api.anthropic.com"
def test_anthropic_in_subpath_unchanged(self):
assert _to_openai_base_url("https://example.com/anthropic/extra") == "https://example.com/anthropic/extra"
def test_empty_string(self):
assert _to_openai_base_url("") == ""
def test_none(self):
assert _to_openai_base_url(None) == ""
-105
View File
@@ -1,105 +0,0 @@
"""Tests for MiniMax provider hardening — context lengths, thinking guard, catalog."""
class TestMinimaxContextLengths:
"""Verify per-model context length entries for MiniMax models."""
def test_m1_variants_have_1m_context(self):
from agent.model_metadata import DEFAULT_CONTEXT_LENGTHS
# Keys are lowercase because the lookup lowercases model names
for model in ("minimax-m1", "minimax-m1-40k", "minimax-m1-80k",
"minimax-m1-128k", "minimax-m1-256k"):
assert model in DEFAULT_CONTEXT_LENGTHS, f"{model} missing from context lengths"
assert DEFAULT_CONTEXT_LENGTHS[model] == 1_000_000, f"{model} expected 1M"
def test_m2_variants_have_1m_context(self):
from agent.model_metadata import DEFAULT_CONTEXT_LENGTHS
# Keys are lowercase because the lookup lowercases model names
for model in ("minimax-m2.5", "minimax-m2.7"):
assert model in DEFAULT_CONTEXT_LENGTHS, f"{model} missing from context lengths"
assert DEFAULT_CONTEXT_LENGTHS[model] == 1_048_576, f"{model} expected 1048576"
def test_minimax_prefix_fallback(self):
from agent.model_metadata import DEFAULT_CONTEXT_LENGTHS
# The generic "minimax" prefix entry should be 1M for unknown models
assert DEFAULT_CONTEXT_LENGTHS["minimax"] == 1_048_576
class TestMinimaxThinkingGuard:
"""Verify that build_anthropic_kwargs does NOT add thinking params for MiniMax models."""
def test_no_thinking_for_minimax_m27(self):
from agent.anthropic_adapter import build_anthropic_kwargs
kwargs = build_anthropic_kwargs(
model="MiniMax-M2.7",
messages=[{"role": "user", "content": "hello"}],
tools=None,
max_tokens=4096,
reasoning_config={"enabled": True, "effort": "medium"},
)
assert "thinking" not in kwargs
assert "output_config" not in kwargs
def test_no_thinking_for_minimax_m1(self):
from agent.anthropic_adapter import build_anthropic_kwargs
kwargs = build_anthropic_kwargs(
model="MiniMax-M1-128k",
messages=[{"role": "user", "content": "hello"}],
tools=None,
max_tokens=4096,
reasoning_config={"enabled": True, "effort": "high"},
)
assert "thinking" not in kwargs
def test_thinking_still_works_for_claude(self):
from agent.anthropic_adapter import build_anthropic_kwargs
kwargs = build_anthropic_kwargs(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "hello"}],
tools=None,
max_tokens=4096,
reasoning_config={"enabled": True, "effort": "medium"},
)
assert "thinking" in kwargs
class TestMinimaxAuxModel:
"""Verify auxiliary model is standard (not highspeed)."""
def test_minimax_aux_is_standard(self):
from agent.auxiliary_client import _API_KEY_PROVIDER_AUX_MODELS
assert _API_KEY_PROVIDER_AUX_MODELS["minimax"] == "MiniMax-M2.7"
assert _API_KEY_PROVIDER_AUX_MODELS["minimax-cn"] == "MiniMax-M2.7"
def test_minimax_aux_not_highspeed(self):
from agent.auxiliary_client import _API_KEY_PROVIDER_AUX_MODELS
assert "highspeed" not in _API_KEY_PROVIDER_AUX_MODELS["minimax"]
assert "highspeed" not in _API_KEY_PROVIDER_AUX_MODELS["minimax-cn"]
class TestMinimaxModelCatalog:
"""Verify the model catalog includes M1 family and excludes deprecated models."""
def test_catalog_includes_m1_family(self):
from hermes_cli.models import _PROVIDER_MODELS
for provider in ("minimax", "minimax-cn"):
models = _PROVIDER_MODELS[provider]
assert "MiniMax-M1" in models
assert "MiniMax-M1-40k" in models
assert "MiniMax-M1-80k" in models
assert "MiniMax-M1-128k" in models
assert "MiniMax-M1-256k" in models
def test_catalog_excludes_deprecated(self):
from hermes_cli.models import _PROVIDER_MODELS
for provider in ("minimax", "minimax-cn"):
models = _PROVIDER_MODELS[provider]
assert "MiniMax-M2.1" not in models
def test_catalog_excludes_highspeed(self):
from hermes_cli.models import _PROVIDER_MODELS
for provider in ("minimax", "minimax-cn"):
models = _PROVIDER_MODELS[provider]
assert "MiniMax-M2.7-highspeed" not in models
assert "MiniMax-M2.5-highspeed" not in models
-66
View File
@@ -1,66 +0,0 @@
import pytest
from unittest.mock import MagicMock, patch
from hermes_cli.plugins import VALID_HOOKS, PluginManager
import os
import shutil
import tempfile
from cli import HermesCLI
def test_session_hooks_in_valid_hooks():
"""Verify on_session_finalize and on_session_reset are registered as valid hooks."""
assert "on_session_finalize" in VALID_HOOKS
assert "on_session_reset" in VALID_HOOKS
@patch("hermes_cli.plugins.invoke_hook")
def test_session_finalize_on_reset(mock_invoke_hook):
"""Verify on_session_finalize fires when /new or /reset is used."""
cli = HermesCLI()
cli.agent = MagicMock()
cli.agent.session_id = "test-session-id"
# Simulate /new command which triggers on_session_finalize for the old session
cli.new_session(silent=True)
# Check if on_session_finalize was called for the old session
mock_invoke_hook.assert_any_call(
"on_session_finalize", session_id="test-session-id", platform="cli"
)
# Check if on_session_reset was called for the new session
mock_invoke_hook.assert_any_call(
"on_session_reset", session_id=cli.session_id, platform="cli"
)
@patch("hermes_cli.plugins.invoke_hook")
def test_session_finalize_on_cleanup(mock_invoke_hook):
"""Verify on_session_finalize fires during CLI exit cleanup."""
import cli as cli_mod
mock_agent = MagicMock()
mock_agent.session_id = "cleanup-session-id"
cli_mod._active_agent_ref = mock_agent
cli_mod._cleanup_done = False
cli_mod._run_cleanup()
mock_invoke_hook.assert_any_call(
"on_session_finalize", session_id="cleanup-session-id", platform="cli"
)
@patch("hermes_cli.plugins.invoke_hook")
def test_hook_errors_are_caught(mock_invoke_hook):
"""Verify hook exceptions are caught and don't crash the agent."""
mgr = PluginManager()
# Register a hook that raises
def bad_callback(**kwargs):
raise Exception("Hook failed")
mgr._hooks["on_session_finalize"] = [bad_callback]
# This should not raise
results = mgr.invoke_hook("on_session_finalize", session_id="test", platform="cli")
assert results == []
+24 -231
View File
@@ -33,13 +33,6 @@ def git_repo(tmp_path):
["git", "commit", "-m", "Initial commit"],
cwd=repo, capture_output=True,
)
# Add a fake remote ref so cleanup logic sees the initial commit as
# "pushed". Without this, `git log HEAD --not --remotes` treats every
# commit as unpushed and cleanup refuses to delete worktrees.
subprocess.run(
["git", "update-ref", "refs/remotes/origin/main", "HEAD"],
cwd=repo, capture_output=True,
)
return repo
@@ -88,11 +81,7 @@ def _setup_worktree(repo_root):
def _cleanup_worktree(info):
"""Test version of _cleanup_worktree.
Preserves the worktree only if it has unpushed commits.
Dirty working tree alone is not enough to keep it.
"""
"""Test version of _cleanup_worktree."""
wt_path = info["path"]
branch = info["branch"]
repo_root = info["repo_root"]
@@ -100,15 +89,15 @@ def _cleanup_worktree(info):
if not Path(wt_path).exists():
return
# Check for unpushed commits
result = subprocess.run(
["git", "log", "--oneline", "HEAD", "--not", "--remotes"],
# Check for uncommitted changes
status = subprocess.run(
["git", "status", "--porcelain"],
capture_output=True, text=True, timeout=10, cwd=wt_path,
)
has_unpushed = bool(result.stdout.strip())
has_changes = bool(status.stdout.strip())
if has_unpushed:
return False # Did not clean up — has unpushed commits
if has_changes:
return False # Did not clean up
subprocess.run(
["git", "worktree", "remove", wt_path, "--force"],
@@ -215,45 +204,20 @@ class TestWorktreeCleanup:
assert result is True
assert not Path(info["path"]).exists()
def test_dirty_worktree_cleaned_when_no_unpushed(self, git_repo):
"""Dirty working tree without unpushed commits is cleaned up.
Agent sessions typically leave untracked files / artifacts behind.
Since all real work is in pushed commits, these don't warrant
keeping the worktree.
"""
def test_dirty_worktree_kept(self, git_repo):
info = _setup_worktree(str(git_repo))
assert info is not None
# Make uncommitted changes (untracked file)
# Make uncommitted changes
(Path(info["path"]) / "new-file.txt").write_text("uncommitted")
subprocess.run(
["git", "add", "new-file.txt"],
cwd=info["path"], capture_output=True,
)
# The git_repo fixture already has a fake remote ref so the initial
# commit is seen as "pushed". No unpushed commits → cleanup proceeds.
result = _cleanup_worktree(info)
assert result is True # Cleaned up despite dirty working tree
assert not Path(info["path"]).exists()
def test_worktree_with_unpushed_commits_kept(self, git_repo):
"""Worktree with unpushed commits is preserved."""
info = _setup_worktree(str(git_repo))
assert info is not None
# Make a commit that is NOT on any remote
(Path(info["path"]) / "work.txt").write_text("real work")
subprocess.run(["git", "add", "work.txt"], cwd=info["path"], capture_output=True)
subprocess.run(
["git", "commit", "-m", "agent work"],
cwd=info["path"], capture_output=True,
)
result = _cleanup_worktree(info)
assert result is False # Kept — has unpushed commits
assert Path(info["path"]).exists()
assert result is False
assert Path(info["path"]).exists() # Still there
def test_branch_deleted_on_cleanup(self, git_repo):
info = _setup_worktree(str(git_repo))
@@ -403,7 +367,7 @@ class TestMultipleWorktrees:
lines = [l for l in result.stdout.strip().splitlines() if l.strip()]
assert len(lines) == 11
# Cleanup all (git_repo fixture has a fake remote ref so cleanup works)
# Cleanup all
for info in worktrees:
# Discard changes first so cleanup works
subprocess.run(
@@ -528,77 +492,33 @@ class TestStaleWorktreePruning:
assert not pruned
assert Path(info["path"]).exists()
def test_keeps_old_worktree_with_unpushed_commits(self, git_repo):
"""Old worktrees (24-72h) with unpushed commits should NOT be pruned."""
def test_keeps_dirty_old_worktree(self, git_repo):
"""Old worktrees with uncommitted changes should NOT be pruned."""
import time
info = _setup_worktree(str(git_repo))
assert info is not None
# Make an unpushed commit
(Path(info["path"]) / "work.txt").write_text("real work")
subprocess.run(["git", "add", "work.txt"], cwd=info["path"], capture_output=True)
# Make it dirty
(Path(info["path"]) / "dirty.txt").write_text("uncommitted")
subprocess.run(
["git", "commit", "-m", "agent work"],
["git", "add", "dirty.txt"],
cwd=info["path"], capture_output=True,
)
# Make it old (25h — in the 24-72h soft tier)
# Make it old
old_time = time.time() - (25 * 3600)
os.utime(info["path"], (old_time, old_time))
# Check for unpushed commits (simulates prune logic)
result = subprocess.run(
["git", "log", "--oneline", "HEAD", "--not", "--remotes"],
# Check if it would be pruned
status = subprocess.run(
["git", "status", "--porcelain"],
capture_output=True, text=True, cwd=info["path"],
)
has_unpushed = bool(result.stdout.strip())
assert has_unpushed # Has unpushed commits → not pruned in soft tier
has_changes = bool(status.stdout.strip())
assert has_changes # Should be dirty → not pruned
assert Path(info["path"]).exists()
def test_force_prunes_very_old_worktree(self, git_repo):
"""Worktrees older than 72h should be force-pruned regardless."""
import time
info = _setup_worktree(str(git_repo))
assert info is not None
# Make an unpushed commit (would normally protect it)
(Path(info["path"]) / "work.txt").write_text("stale work")
subprocess.run(["git", "add", "work.txt"], cwd=info["path"], capture_output=True)
subprocess.run(
["git", "commit", "-m", "old agent work"],
cwd=info["path"], capture_output=True,
)
# Make it very old (73h — beyond the 72h hard threshold)
old_time = time.time() - (73 * 3600)
os.utime(info["path"], (old_time, old_time))
# Simulate the force-prune tier check
hard_cutoff = time.time() - (72 * 3600)
mtime = Path(info["path"]).stat().st_mtime
assert mtime <= hard_cutoff # Should qualify for force removal
# Actually remove it (simulates _prune_stale_worktrees force path)
branch_result = subprocess.run(
["git", "branch", "--show-current"],
capture_output=True, text=True, timeout=5, cwd=info["path"],
)
branch = branch_result.stdout.strip()
subprocess.run(
["git", "worktree", "remove", info["path"], "--force"],
capture_output=True, text=True, timeout=15, cwd=str(git_repo),
)
if branch:
subprocess.run(
["git", "branch", "-D", branch],
capture_output=True, text=True, timeout=10, cwd=str(git_repo),
)
assert not Path(info["path"]).exists()
class TestEdgeCases:
"""Test edge cases for robustness."""
@@ -691,133 +611,6 @@ class TestTerminalCWDIntegration:
assert result.stdout.strip() == "true"
class TestOrphanedBranchPruning:
"""Test cleanup of orphaned hermes/* and pr-* branches."""
def test_prunes_orphaned_hermes_branch(self, git_repo):
"""hermes/hermes-* branches with no worktree should be deleted."""
# Create a branch that looks like a worktree branch but has no worktree
subprocess.run(
["git", "branch", "hermes/hermes-deadbeef", "HEAD"],
cwd=str(git_repo), capture_output=True,
)
# Verify it exists
result = subprocess.run(
["git", "branch", "--list", "hermes/hermes-deadbeef"],
capture_output=True, text=True, cwd=str(git_repo),
)
assert "hermes/hermes-deadbeef" in result.stdout
# Simulate _prune_orphaned_branches logic
result = subprocess.run(
["git", "branch", "--format=%(refname:short)"],
capture_output=True, text=True, cwd=str(git_repo),
)
all_branches = [b.strip() for b in result.stdout.strip().split("\n") if b.strip()]
wt_result = subprocess.run(
["git", "worktree", "list", "--porcelain"],
capture_output=True, text=True, cwd=str(git_repo),
)
active_branches = {"main"}
for line in wt_result.stdout.split("\n"):
if line.startswith("branch refs/heads/"):
active_branches.add(line.split("branch refs/heads/", 1)[-1].strip())
orphaned = [
b for b in all_branches
if b not in active_branches
and (b.startswith("hermes/hermes-") or b.startswith("pr-"))
]
assert "hermes/hermes-deadbeef" in orphaned
# Delete them
if orphaned:
subprocess.run(
["git", "branch", "-D"] + orphaned,
capture_output=True, text=True, cwd=str(git_repo),
)
# Verify gone
result = subprocess.run(
["git", "branch", "--list", "hermes/hermes-deadbeef"],
capture_output=True, text=True, cwd=str(git_repo),
)
assert "hermes/hermes-deadbeef" not in result.stdout
def test_prunes_orphaned_pr_branch(self, git_repo):
"""pr-* branches should be deleted during pruning."""
subprocess.run(
["git", "branch", "pr-1234", "HEAD"],
cwd=str(git_repo), capture_output=True,
)
subprocess.run(
["git", "branch", "pr-5678", "HEAD"],
cwd=str(git_repo), capture_output=True,
)
result = subprocess.run(
["git", "branch", "--format=%(refname:short)"],
capture_output=True, text=True, cwd=str(git_repo),
)
all_branches = [b.strip() for b in result.stdout.strip().split("\n") if b.strip()]
active_branches = {"main"}
orphaned = [
b for b in all_branches
if b not in active_branches and b.startswith("pr-")
]
assert "pr-1234" in orphaned
assert "pr-5678" in orphaned
subprocess.run(
["git", "branch", "-D"] + orphaned,
capture_output=True, text=True, cwd=str(git_repo),
)
# Verify gone
result = subprocess.run(
["git", "branch", "--format=%(refname:short)"],
capture_output=True, text=True, cwd=str(git_repo),
)
remaining = result.stdout.strip()
assert "pr-1234" not in remaining
assert "pr-5678" not in remaining
def test_preserves_active_worktree_branch(self, git_repo):
"""Branches with active worktrees should NOT be pruned."""
info = _setup_worktree(str(git_repo))
assert info is not None
result = subprocess.run(
["git", "worktree", "list", "--porcelain"],
capture_output=True, text=True, cwd=str(git_repo),
)
active_branches = set()
for line in result.stdout.split("\n"):
if line.startswith("branch refs/heads/"):
active_branches.add(line.split("branch refs/heads/", 1)[-1].strip())
assert info["branch"] in active_branches # Protected
def test_preserves_main_branch(self, git_repo):
"""main branch should never be pruned."""
result = subprocess.run(
["git", "branch", "--format=%(refname:short)"],
capture_output=True, text=True, cwd=str(git_repo),
)
all_branches = [b.strip() for b in result.stdout.strip().split("\n") if b.strip()]
active_branches = {"main"}
orphaned = [
b for b in all_branches
if b not in active_branches
and (b.startswith("hermes/hermes-") or b.startswith("pr-"))
]
assert "main" not in orphaned
class TestSystemPromptInjection:
"""Test that the agent gets worktree context in its system prompt."""
@@ -832,7 +625,7 @@ class TestSystemPromptInjection:
f"{info['path']}. Your branch is `{info['branch']}`. "
f"Changes here do not affect the main working tree or other agents. "
f"Remember to commit and push your changes, and create a PR if appropriate. "
f"The original repo is at {info['repo_root']}.]\n"
f"The original repo is at {info['repo_root']}.]"
)
assert info["path"] in wt_note
-30
View File
@@ -339,36 +339,6 @@ class TestMarkJobRun:
assert updated["last_status"] == "error"
assert updated["last_error"] == "timeout"
def test_delivery_error_tracked_separately(self, tmp_cron_dir):
"""Agent succeeds but delivery fails — both tracked independently."""
job = create_job(prompt="Report", schedule="every 1h")
mark_job_run(job["id"], success=True, delivery_error="platform 'telegram' not configured")
updated = get_job(job["id"])
assert updated["last_status"] == "ok"
assert updated["last_error"] is None
assert updated["last_delivery_error"] == "platform 'telegram' not configured"
def test_delivery_error_cleared_on_success(self, tmp_cron_dir):
"""Successful delivery clears the previous delivery error."""
job = create_job(prompt="Report", schedule="every 1h")
mark_job_run(job["id"], success=True, delivery_error="network timeout")
updated = get_job(job["id"])
assert updated["last_delivery_error"] == "network timeout"
# Next run delivers successfully
mark_job_run(job["id"], success=True, delivery_error=None)
updated = get_job(job["id"])
assert updated["last_delivery_error"] is None
def test_both_agent_and_delivery_error(self, tmp_cron_dir):
"""Agent fails AND delivery fails — both errors recorded."""
job = create_job(prompt="Report", schedule="every 1h")
mark_job_run(job["id"], success=False, error="model timeout",
delivery_error="platform 'discord' not enabled")
updated = get_job(job["id"])
assert updated["last_status"] == "error"
assert updated["last_error"] == "model timeout"
assert updated["last_delivery_error"] == "platform 'discord' not enabled"
class TestAdvanceNextRun:
"""Tests for advance_next_run() — crash-safety for recurring jobs."""
-84
View File
@@ -508,90 +508,6 @@ class TestDeliverResultWrapping:
assert send_mock.call_args.kwargs["thread_id"] == "17585"
class TestDeliverResultErrorReturns:
"""Verify _deliver_result returns error strings on failure, None on success."""
def test_returns_none_on_successful_delivery(self):
from gateway.config import Platform
pconfig = MagicMock()
pconfig.enabled = True
mock_cfg = MagicMock()
mock_cfg.platforms = {Platform.TELEGRAM: pconfig}
with patch("gateway.config.load_gateway_config", return_value=mock_cfg), \
patch("tools.send_message_tool._send_to_platform", new=AsyncMock(return_value={"success": True})):
job = {
"id": "ok-job",
"deliver": "origin",
"origin": {"platform": "telegram", "chat_id": "123"},
}
result = _deliver_result(job, "Output.")
assert result is None
def test_returns_none_for_local_delivery(self):
"""local-only jobs don't deliver — not a failure."""
job = {"id": "local-job", "deliver": "local"}
result = _deliver_result(job, "Output.")
assert result is None
def test_returns_error_for_unknown_platform(self):
job = {
"id": "bad-platform",
"deliver": "origin",
"origin": {"platform": "fax", "chat_id": "123"},
}
with patch("gateway.config.load_gateway_config"):
result = _deliver_result(job, "Output.")
assert result is not None
assert "unknown platform" in result
def test_returns_error_when_platform_disabled(self):
from gateway.config import Platform
pconfig = MagicMock()
pconfig.enabled = False
mock_cfg = MagicMock()
mock_cfg.platforms = {Platform.TELEGRAM: pconfig}
with patch("gateway.config.load_gateway_config", return_value=mock_cfg):
job = {
"id": "disabled",
"deliver": "origin",
"origin": {"platform": "telegram", "chat_id": "123"},
}
result = _deliver_result(job, "Output.")
assert result is not None
assert "not configured" in result
def test_returns_error_on_send_failure(self):
from gateway.config import Platform
pconfig = MagicMock()
pconfig.enabled = True
mock_cfg = MagicMock()
mock_cfg.platforms = {Platform.TELEGRAM: pconfig}
with patch("gateway.config.load_gateway_config", return_value=mock_cfg), \
patch("tools.send_message_tool._send_to_platform", new=AsyncMock(return_value={"error": "rate limited"})):
job = {
"id": "rate-limited",
"deliver": "origin",
"origin": {"platform": "telegram", "chat_id": "123"},
}
result = _deliver_result(job, "Output.")
assert result is not None
assert "rate limited" in result
def test_returns_error_for_unresolved_target(self, monkeypatch):
"""Non-local delivery with no resolvable target should return an error."""
monkeypatch.delenv("TELEGRAM_HOME_CHANNEL", raising=False)
job = {"id": "no-target", "deliver": "telegram"}
result = _deliver_result(job, "Output.")
assert result is not None
assert "no delivery target" in result
class TestRunJobSessionPersistence:
def test_run_job_passes_session_db_and_cron_platform(self, tmp_path):
job = {
-277
View File
@@ -1,277 +0,0 @@
"""Tests for Discord reply_to_mode functionality.
Covers the threading behavior control for multi-chunk replies:
- "off": Never reply-reference to original message
- "first": Only first chunk uses reply reference (default)
- "all": All chunks reply-reference the original message
"""
import os
import sys
from types import SimpleNamespace
from unittest.mock import MagicMock, AsyncMock, patch
import pytest
from gateway.config import PlatformConfig, GatewayConfig, Platform, _apply_env_overrides
def _ensure_discord_mock():
"""Install a mock discord module when discord.py isn't available."""
if "discord" in sys.modules and hasattr(sys.modules["discord"], "__file__"):
return
discord_mod = MagicMock()
discord_mod.Intents.default.return_value = MagicMock()
discord_mod.Client = MagicMock
discord_mod.File = MagicMock
discord_mod.DMChannel = type("DMChannel", (), {})
discord_mod.Thread = type("Thread", (), {})
discord_mod.ForumChannel = type("ForumChannel", (), {})
discord_mod.ui = SimpleNamespace(View=object, button=lambda *a, **k: (lambda fn: fn), Button=object)
discord_mod.ButtonStyle = SimpleNamespace(success=1, primary=2, secondary=2, danger=3, green=1, grey=2, blurple=2, red=3)
discord_mod.Color = SimpleNamespace(orange=lambda: 1, green=lambda: 2, blue=lambda: 3, red=lambda: 4, purple=lambda: 5)
discord_mod.Interaction = object
discord_mod.Embed = MagicMock
discord_mod.app_commands = SimpleNamespace(
describe=lambda **kwargs: (lambda fn: fn),
choices=lambda **kwargs: (lambda fn: fn),
Choice=lambda **kwargs: SimpleNamespace(**kwargs),
)
ext_mod = MagicMock()
commands_mod = MagicMock()
commands_mod.Bot = MagicMock
ext_mod.commands = commands_mod
sys.modules.setdefault("discord", discord_mod)
sys.modules.setdefault("discord.ext", ext_mod)
sys.modules.setdefault("discord.ext.commands", commands_mod)
_ensure_discord_mock()
from gateway.platforms.discord import DiscordAdapter # noqa: E402
@pytest.fixture()
def adapter_factory():
"""Factory to create DiscordAdapter with custom reply_to_mode."""
def create(reply_to_mode: str = "first"):
config = PlatformConfig(enabled=True, token="test-token", reply_to_mode=reply_to_mode)
return DiscordAdapter(config)
return create
class TestReplyToModeConfig:
"""Tests for reply_to_mode configuration loading."""
def test_default_mode_is_first(self, adapter_factory):
adapter = adapter_factory()
assert adapter._reply_to_mode == "first"
def test_off_mode(self, adapter_factory):
adapter = adapter_factory(reply_to_mode="off")
assert adapter._reply_to_mode == "off"
def test_first_mode(self, adapter_factory):
adapter = adapter_factory(reply_to_mode="first")
assert adapter._reply_to_mode == "first"
def test_all_mode(self, adapter_factory):
adapter = adapter_factory(reply_to_mode="all")
assert adapter._reply_to_mode == "all"
def test_invalid_mode_stored_as_is(self, adapter_factory):
"""Invalid modes are stored but send() handles them gracefully."""
adapter = adapter_factory(reply_to_mode="invalid")
assert adapter._reply_to_mode == "invalid"
def test_none_mode_defaults_to_first(self):
config = PlatformConfig(enabled=True, token="test-token")
adapter = DiscordAdapter(config)
assert adapter._reply_to_mode == "first"
def test_empty_string_mode_defaults_to_first(self):
config = PlatformConfig(enabled=True, token="test-token", reply_to_mode="")
adapter = DiscordAdapter(config)
assert adapter._reply_to_mode == "first"
def _make_discord_adapter(reply_to_mode: str = "first"):
"""Create a DiscordAdapter with mocked client and channel for send() tests."""
config = PlatformConfig(enabled=True, token="test-token", reply_to_mode=reply_to_mode)
adapter = DiscordAdapter(config)
# Mock the Discord client and channel
mock_channel = AsyncMock()
ref_message = MagicMock()
mock_channel.fetch_message = AsyncMock(return_value=ref_message)
sent_msg = MagicMock()
sent_msg.id = 42
mock_channel.send = AsyncMock(return_value=sent_msg)
mock_client = MagicMock()
mock_client.get_channel = MagicMock(return_value=mock_channel)
adapter._client = mock_client
return adapter, mock_channel, ref_message
class TestSendWithReplyToMode:
"""Tests for send() method respecting reply_to_mode."""
@pytest.mark.asyncio
async def test_off_mode_no_reply_reference(self):
adapter, channel, ref_msg = _make_discord_adapter("off")
adapter.truncate_message = lambda content, max_len: ["chunk1", "chunk2", "chunk3"]
await adapter.send("12345", "test content", reply_to="999")
# Should never try to fetch the reference message
channel.fetch_message.assert_not_called()
# All chunks sent without reference
for call in channel.send.call_args_list:
assert call.kwargs.get("reference") is None
@pytest.mark.asyncio
async def test_first_mode_only_first_chunk_references(self):
adapter, channel, ref_msg = _make_discord_adapter("first")
adapter.truncate_message = lambda content, max_len: ["chunk1", "chunk2", "chunk3"]
await adapter.send("12345", "test content", reply_to="999")
# Should fetch the reference message
channel.fetch_message.assert_called_once_with(999)
calls = channel.send.call_args_list
assert len(calls) == 3
assert calls[0].kwargs.get("reference") is ref_msg
assert calls[1].kwargs.get("reference") is None
assert calls[2].kwargs.get("reference") is None
@pytest.mark.asyncio
async def test_all_mode_all_chunks_reference(self):
adapter, channel, ref_msg = _make_discord_adapter("all")
adapter.truncate_message = lambda content, max_len: ["chunk1", "chunk2", "chunk3"]
await adapter.send("12345", "test content", reply_to="999")
channel.fetch_message.assert_called_once_with(999)
calls = channel.send.call_args_list
assert len(calls) == 3
for call in calls:
assert call.kwargs.get("reference") is ref_msg
@pytest.mark.asyncio
async def test_no_reply_to_param_no_reference(self):
adapter, channel, ref_msg = _make_discord_adapter("all")
adapter.truncate_message = lambda content, max_len: ["chunk1", "chunk2"]
await adapter.send("12345", "test content", reply_to=None)
channel.fetch_message.assert_not_called()
for call in channel.send.call_args_list:
assert call.kwargs.get("reference") is None
@pytest.mark.asyncio
async def test_single_chunk_respects_first_mode(self):
adapter, channel, ref_msg = _make_discord_adapter("first")
adapter.truncate_message = lambda content, max_len: ["single chunk"]
await adapter.send("12345", "test", reply_to="999")
calls = channel.send.call_args_list
assert len(calls) == 1
assert calls[0].kwargs.get("reference") is ref_msg
@pytest.mark.asyncio
async def test_single_chunk_off_mode(self):
adapter, channel, ref_msg = _make_discord_adapter("off")
adapter.truncate_message = lambda content, max_len: ["single chunk"]
await adapter.send("12345", "test", reply_to="999")
channel.fetch_message.assert_not_called()
calls = channel.send.call_args_list
assert len(calls) == 1
assert calls[0].kwargs.get("reference") is None
@pytest.mark.asyncio
async def test_invalid_mode_falls_back_to_first_behavior(self):
"""Invalid mode behaves like 'first' — only first chunk gets reference."""
adapter, channel, ref_msg = _make_discord_adapter("banana")
adapter.truncate_message = lambda content, max_len: ["chunk1", "chunk2"]
await adapter.send("12345", "test", reply_to="999")
calls = channel.send.call_args_list
assert len(calls) == 2
assert calls[0].kwargs.get("reference") is ref_msg
assert calls[1].kwargs.get("reference") is None
class TestConfigSerialization:
"""Tests for reply_to_mode serialization (shared with Telegram)."""
def test_to_dict_includes_reply_to_mode(self):
config = PlatformConfig(enabled=True, token="test", reply_to_mode="all")
result = config.to_dict()
assert result["reply_to_mode"] == "all"
def test_from_dict_loads_reply_to_mode(self):
data = {"enabled": True, "token": "***", "reply_to_mode": "off"}
config = PlatformConfig.from_dict(data)
assert config.reply_to_mode == "off"
def test_from_dict_defaults_to_first(self):
data = {"enabled": True, "token": "***"}
config = PlatformConfig.from_dict(data)
assert config.reply_to_mode == "first"
class TestEnvVarOverride:
"""Tests for DISCORD_REPLY_TO_MODE environment variable override."""
def _make_config(self):
config = GatewayConfig()
config.platforms[Platform.DISCORD] = PlatformConfig(enabled=True, token="test")
return config
def test_env_var_sets_off_mode(self):
config = self._make_config()
with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": "off"}, clear=False):
_apply_env_overrides(config)
assert config.platforms[Platform.DISCORD].reply_to_mode == "off"
def test_env_var_sets_all_mode(self):
config = self._make_config()
with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": "all"}, clear=False):
_apply_env_overrides(config)
assert config.platforms[Platform.DISCORD].reply_to_mode == "all"
def test_env_var_case_insensitive(self):
config = self._make_config()
with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": "ALL"}, clear=False):
_apply_env_overrides(config)
assert config.platforms[Platform.DISCORD].reply_to_mode == "all"
def test_env_var_invalid_value_ignored(self):
config = self._make_config()
with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": "banana"}, clear=False):
_apply_env_overrides(config)
assert config.platforms[Platform.DISCORD].reply_to_mode == "first"
def test_env_var_empty_value_ignored(self):
config = self._make_config()
with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": ""}, clear=False):
_apply_env_overrides(config)
assert config.platforms[Platform.DISCORD].reply_to_mode == "first"
def test_env_var_creates_platform_config_if_missing(self):
"""DISCORD_REPLY_TO_MODE creates PlatformConfig even without DISCORD_BOT_TOKEN."""
config = GatewayConfig()
assert Platform.DISCORD not in config.platforms
with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": "off"}, clear=False):
_apply_env_overrides(config)
assert Platform.DISCORD in config.platforms
assert config.platforms[Platform.DISCORD].reply_to_mode == "off"
@@ -1,432 +0,0 @@
"""Tests for Feishu interactive card approval buttons."""
import asyncio
import json
import os
import sys
from pathlib import Path
from types import SimpleNamespace
from unittest.mock import AsyncMock, MagicMock, Mock, patch
import pytest
# ---------------------------------------------------------------------------
# Ensure the repo root is importable
# ---------------------------------------------------------------------------
_repo = str(Path(__file__).resolve().parents[2])
if _repo not in sys.path:
sys.path.insert(0, _repo)
# ---------------------------------------------------------------------------
# Minimal Feishu mock so FeishuAdapter can be imported without lark-oapi
# ---------------------------------------------------------------------------
def _ensure_feishu_mocks():
"""Provide stubs for lark-oapi / aiohttp.web so the import succeeds."""
if "lark_oapi" not in sys.modules:
mod = MagicMock()
for name in (
"lark_oapi", "lark_oapi.api.im.v1",
"lark_oapi.event", "lark_oapi.event.callback_type",
):
sys.modules.setdefault(name, mod)
if "aiohttp" not in sys.modules:
aio = MagicMock()
sys.modules.setdefault("aiohttp", aio)
sys.modules.setdefault("aiohttp.web", aio.web)
_ensure_feishu_mocks()
from gateway.config import PlatformConfig
from gateway.platforms.feishu import FeishuAdapter
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _make_adapter() -> FeishuAdapter:
"""Create a FeishuAdapter with mocked internals."""
config = PlatformConfig(enabled=True)
adapter = FeishuAdapter(config)
adapter._client = MagicMock()
return adapter
def _make_card_action_data(
action_value: dict,
chat_id: str = "oc_12345",
open_id: str = "ou_user1",
token: str = "tok_abc",
) -> SimpleNamespace:
"""Create a mock Feishu card action callback data object."""
return SimpleNamespace(
event=SimpleNamespace(
token=token,
context=SimpleNamespace(open_chat_id=chat_id),
operator=SimpleNamespace(open_id=open_id),
action=SimpleNamespace(
tag="button",
value=action_value,
),
),
)
# ===========================================================================
# send_exec_approval — interactive card with buttons
# ===========================================================================
class TestFeishuExecApproval:
"""Test send_exec_approval sends an interactive card."""
@pytest.mark.asyncio
async def test_sends_interactive_card(self):
adapter = _make_adapter()
mock_response = SimpleNamespace(
success=lambda: True,
data=SimpleNamespace(message_id="msg_001"),
)
with patch.object(
adapter, "_feishu_send_with_retry", new_callable=AsyncMock,
return_value=mock_response,
) as mock_send:
result = await adapter.send_exec_approval(
chat_id="oc_12345",
command="rm -rf /important",
session_key="agent:main:feishu:group:oc_12345",
description="dangerous deletion",
)
assert result.success is True
assert result.message_id == "msg_001"
mock_send.assert_called_once()
kwargs = mock_send.call_args[1]
assert kwargs["chat_id"] == "oc_12345"
assert kwargs["msg_type"] == "interactive"
# Verify card payload contains the command and buttons
card = json.loads(kwargs["payload"])
assert card["header"]["template"] == "orange"
assert "rm -rf /important" in card["elements"][0]["content"]
assert "dangerous deletion" in card["elements"][0]["content"]
# Check buttons
actions = card["elements"][1]["actions"]
assert len(actions) == 4
action_names = [a["value"]["hermes_action"] for a in actions]
assert action_names == [
"approve_once", "approve_session", "approve_always", "deny"
]
@pytest.mark.asyncio
async def test_stores_approval_state(self):
adapter = _make_adapter()
mock_response = SimpleNamespace(
success=lambda: True,
data=SimpleNamespace(message_id="msg_002"),
)
with patch.object(
adapter, "_feishu_send_with_retry", new_callable=AsyncMock,
return_value=mock_response,
):
await adapter.send_exec_approval(
chat_id="oc_12345",
command="echo test",
session_key="my-session-key",
)
assert len(adapter._approval_state) == 1
approval_id = list(adapter._approval_state.keys())[0]
state = adapter._approval_state[approval_id]
assert state["session_key"] == "my-session-key"
assert state["message_id"] == "msg_002"
assert state["chat_id"] == "oc_12345"
@pytest.mark.asyncio
async def test_not_connected(self):
adapter = _make_adapter()
adapter._client = None
result = await adapter.send_exec_approval(
chat_id="oc_12345", command="ls", session_key="s"
)
assert result.success is False
@pytest.mark.asyncio
async def test_truncates_long_command(self):
adapter = _make_adapter()
mock_response = SimpleNamespace(
success=lambda: True,
data=SimpleNamespace(message_id="msg_003"),
)
with patch.object(
adapter, "_feishu_send_with_retry", new_callable=AsyncMock,
return_value=mock_response,
) as mock_send:
long_cmd = "x" * 5000
await adapter.send_exec_approval(
chat_id="oc_12345", command=long_cmd, session_key="s"
)
card = json.loads(mock_send.call_args[1]["payload"])
content = card["elements"][0]["content"]
assert "..." in content
assert len(content) < 5000
@pytest.mark.asyncio
async def test_multiple_approvals_get_unique_ids(self):
adapter = _make_adapter()
mock_response = SimpleNamespace(
success=lambda: True,
data=SimpleNamespace(message_id="msg_x"),
)
with patch.object(
adapter, "_feishu_send_with_retry", new_callable=AsyncMock,
return_value=mock_response,
):
await adapter.send_exec_approval(
chat_id="oc_1", command="cmd1", session_key="s1"
)
await adapter.send_exec_approval(
chat_id="oc_2", command="cmd2", session_key="s2"
)
assert len(adapter._approval_state) == 2
ids = list(adapter._approval_state.keys())
assert ids[0] != ids[1]
# ===========================================================================
# _handle_card_action_event — approval button clicks
# ===========================================================================
class TestFeishuApprovalCallback:
"""Test the approval intercept in _handle_card_action_event."""
@pytest.mark.asyncio
async def test_resolves_approval_on_click(self):
adapter = _make_adapter()
adapter._approval_state[1] = {
"session_key": "agent:main:feishu:group:oc_12345",
"message_id": "msg_001",
"chat_id": "oc_12345",
}
data = _make_card_action_data(
action_value={"hermes_action": "approve_once", "approval_id": 1},
)
with (
patch.object(
adapter, "_resolve_sender_profile", new_callable=AsyncMock,
return_value={"user_id": "ou_user1", "user_name": "Norbert", "user_id_alt": None},
),
patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
):
await adapter._handle_card_action_event(data)
mock_resolve.assert_called_once_with("agent:main:feishu:group:oc_12345", "once")
mock_update.assert_called_once_with("msg_001", "Approved once", "Norbert", "once")
# State should be cleaned up
assert 1 not in adapter._approval_state
@pytest.mark.asyncio
async def test_deny_button(self):
adapter = _make_adapter()
adapter._approval_state[2] = {
"session_key": "some-session",
"message_id": "msg_002",
"chat_id": "oc_12345",
}
data = _make_card_action_data(
action_value={"hermes_action": "deny", "approval_id": 2},
token="tok_deny",
)
with (
patch.object(
adapter, "_resolve_sender_profile", new_callable=AsyncMock,
return_value={"user_id": "ou_alice", "user_name": "Alice", "user_id_alt": None},
),
patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
):
await adapter._handle_card_action_event(data)
mock_resolve.assert_called_once_with("some-session", "deny")
mock_update.assert_called_once_with("msg_002", "Denied", "Alice", "deny")
@pytest.mark.asyncio
async def test_session_approval(self):
adapter = _make_adapter()
adapter._approval_state[3] = {
"session_key": "sess-3",
"message_id": "msg_003",
"chat_id": "oc_99",
}
data = _make_card_action_data(
action_value={"hermes_action": "approve_session", "approval_id": 3},
token="tok_ses",
)
with (
patch.object(
adapter, "_resolve_sender_profile", new_callable=AsyncMock,
return_value={"user_id": "ou_u", "user_name": "Bob", "user_id_alt": None},
),
patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
):
await adapter._handle_card_action_event(data)
mock_resolve.assert_called_once_with("sess-3", "session")
mock_update.assert_called_once_with("msg_003", "Approved for session", "Bob", "session")
@pytest.mark.asyncio
async def test_always_approval(self):
adapter = _make_adapter()
adapter._approval_state[4] = {
"session_key": "sess-4",
"message_id": "msg_004",
"chat_id": "oc_55",
}
data = _make_card_action_data(
action_value={"hermes_action": "approve_always", "approval_id": 4},
token="tok_alw",
)
with (
patch.object(
adapter, "_resolve_sender_profile", new_callable=AsyncMock,
return_value={"user_id": "ou_u", "user_name": "Carol", "user_id_alt": None},
),
patch.object(adapter, "_update_approval_card", new_callable=AsyncMock),
patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
):
await adapter._handle_card_action_event(data)
mock_resolve.assert_called_once_with("sess-4", "always")
@pytest.mark.asyncio
async def test_already_resolved_drops_silently(self):
adapter = _make_adapter()
# No state for approval_id 99 — already resolved
data = _make_card_action_data(
action_value={"hermes_action": "approve_once", "approval_id": 99},
token="tok_gone",
)
with patch("tools.approval.resolve_gateway_approval") as mock_resolve:
await adapter._handle_card_action_event(data)
# Should NOT resolve — already handled
mock_resolve.assert_not_called()
@pytest.mark.asyncio
async def test_non_approval_actions_route_normally(self):
"""Non-approval card actions should still become synthetic commands."""
adapter = _make_adapter()
data = _make_card_action_data(
action_value={"custom_action": "something_else"},
token="tok_normal",
)
with (
patch.object(
adapter, "_resolve_sender_profile", new_callable=AsyncMock,
return_value={"user_id": "ou_u", "user_name": "Dave", "user_id_alt": None},
),
patch.object(adapter, "get_chat_info", new_callable=AsyncMock, return_value={"name": "Test Chat"}),
patch.object(adapter, "_handle_message_with_guards", new_callable=AsyncMock) as mock_handle,
patch("tools.approval.resolve_gateway_approval") as mock_resolve,
):
await adapter._handle_card_action_event(data)
# Should NOT resolve any approval
mock_resolve.assert_not_called()
# Should have routed as synthetic command
mock_handle.assert_called_once()
event = mock_handle.call_args[0][0]
assert "/card button" in event.text
# ===========================================================================
# _update_approval_card — card replacement after resolution
# ===========================================================================
class TestFeishuUpdateApprovalCard:
"""Test the card update after approval resolution."""
@pytest.mark.asyncio
async def test_updates_card_on_approve(self):
adapter = _make_adapter()
mock_update = AsyncMock()
adapter._client.im.v1.message.update = MagicMock()
with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
await adapter._update_approval_card(
"msg_001", "Approved once", "Norbert", "once"
)
mock_thread.assert_called_once()
# Verify the update request was built
call_args = mock_thread.call_args
assert call_args[0][0] == adapter._client.im.v1.message.update
@pytest.mark.asyncio
async def test_updates_card_on_deny(self):
adapter = _make_adapter()
with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
await adapter._update_approval_card(
"msg_002", "Denied", "Alice", "deny"
)
mock_thread.assert_called_once()
@pytest.mark.asyncio
async def test_skips_update_when_not_connected(self):
adapter = _make_adapter()
adapter._client = None
with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
await adapter._update_approval_card(
"msg_001", "Approved", "Bob", "once"
)
mock_thread.assert_not_called()
@pytest.mark.asyncio
async def test_skips_update_when_no_message_id(self):
adapter = _make_adapter()
with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
await adapter._update_approval_card(
"", "Approved", "Bob", "once"
)
mock_thread.assert_not_called()
@pytest.mark.asyncio
async def test_swallows_update_errors(self):
adapter = _make_adapter()
with patch("asyncio.to_thread", new_callable=AsyncMock, side_effect=Exception("API error")):
# Should not raise
await adapter._update_approval_card(
"msg_001", "Approved", "Bob", "once"
)
+52
View File
@@ -87,6 +87,7 @@ class TestReasoningCommand:
)
monkeypatch.setattr(gateway_run, "_hermes_home", hermes_home)
monkeypatch.delenv("HERMES_REASONING_EFFORT", raising=False)
runner = _make_runner()
runner._reasoning_config = {"enabled": True, "effort": "xhigh"}
@@ -107,6 +108,7 @@ class TestReasoningCommand:
config_path.write_text("agent:\n reasoning_effort: medium\n", encoding="utf-8")
monkeypatch.setattr(gateway_run, "_hermes_home", hermes_home)
monkeypatch.delenv("HERMES_REASONING_EFFORT", raising=False)
runner = _make_runner()
runner._reasoning_config = {"enabled": True, "effort": "medium"}
@@ -136,6 +138,7 @@ class TestReasoningCommand:
"api_key": "test-key",
},
)
monkeypatch.delenv("HERMES_REASONING_EFFORT", raising=False)
fake_run_agent = types.ModuleType("run_agent")
fake_run_agent.AIAgent = _CapturingAgent
monkeypatch.setitem(sys.modules, "run_agent", fake_run_agent)
@@ -167,6 +170,55 @@ class TestReasoningCommand:
assert _CapturingAgent.last_init is not None
assert _CapturingAgent.last_init["reasoning_config"] == {"enabled": True, "effort": "low"}
def test_run_agent_prefers_config_over_stale_reasoning_env(self, tmp_path, monkeypatch):
hermes_home = tmp_path / "hermes"
hermes_home.mkdir()
(hermes_home / "config.yaml").write_text("agent:\n reasoning_effort: none\n", encoding="utf-8")
monkeypatch.setattr(gateway_run, "_hermes_home", hermes_home)
monkeypatch.setattr(gateway_run, "_env_path", hermes_home / ".env")
monkeypatch.setattr(gateway_run, "load_dotenv", lambda *args, **kwargs: None)
monkeypatch.setattr(
gateway_run,
"_resolve_runtime_agent_kwargs",
lambda: {
"provider": "openrouter",
"api_mode": "chat_completions",
"base_url": "https://openrouter.ai/api/v1",
"api_key": "test-key",
},
)
monkeypatch.setenv("HERMES_REASONING_EFFORT", "low")
fake_run_agent = types.ModuleType("run_agent")
fake_run_agent.AIAgent = _CapturingAgent
monkeypatch.setitem(sys.modules, "run_agent", fake_run_agent)
_CapturingAgent.last_init = None
runner = _make_runner()
source = SessionSource(
platform=Platform.LOCAL,
chat_id="cli",
chat_name="CLI",
chat_type="dm",
user_id="user-1",
)
result = asyncio.run(
runner._run_agent(
message="ping",
context_prompt="",
history=[],
source=source,
session_id="session-1",
session_key="agent:main:local:dm",
)
)
assert result["final_response"] == "ok"
assert _CapturingAgent.last_init is not None
assert _CapturingAgent.last_init["reasoning_config"] == {"enabled": False}
def test_run_agent_includes_enabled_mcp_servers_in_gateway_toolsets(self, tmp_path, monkeypatch):
hermes_home = tmp_path / "hermes"
hermes_home.mkdir()
@@ -1,158 +0,0 @@
"""Tests that on_session_finalize and on_session_reset plugin hooks fire in the gateway."""
from datetime import datetime
from types import SimpleNamespace
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from gateway.config import GatewayConfig, Platform, PlatformConfig
from gateway.platforms.base import MessageEvent
from gateway.session import SessionEntry, SessionSource, build_session_key
def _make_source() -> SessionSource:
return SessionSource(
platform=Platform.TELEGRAM,
user_id="u1",
chat_id="c1",
user_name="tester",
chat_type="dm",
)
def _make_event(text: str) -> MessageEvent:
return MessageEvent(text=text, source=_make_source(), message_id="m1")
def _make_runner():
from gateway.run import GatewayRunner
runner = object.__new__(GatewayRunner)
runner.config = GatewayConfig(
platforms={Platform.TELEGRAM: PlatformConfig(enabled=True, token="***")}
)
adapter = MagicMock()
adapter.send = AsyncMock()
runner.adapters = {Platform.TELEGRAM: adapter}
runner._voice_mode = {}
runner.hooks = SimpleNamespace(emit=AsyncMock(), loaded_hooks=False)
runner._session_model_overrides = {}
runner._pending_model_notes = {}
runner._background_tasks = set()
session_key = build_session_key(_make_source())
session_entry = SessionEntry(
session_key=session_key,
session_id="sess-old",
created_at=datetime.now(),
updated_at=datetime.now(),
platform=Platform.TELEGRAM,
chat_type="dm",
)
new_session_entry = SessionEntry(
session_key=session_key,
session_id="sess-new",
created_at=datetime.now(),
updated_at=datetime.now(),
platform=Platform.TELEGRAM,
chat_type="dm",
)
runner.session_store = MagicMock()
runner.session_store.get_or_create_session.return_value = new_session_entry
runner.session_store.reset_session.return_value = new_session_entry
runner.session_store._entries = {session_key: session_entry}
runner.session_store._generate_session_key.return_value = session_key
runner._running_agents = {}
runner._pending_messages = {}
runner._pending_approvals = {}
runner._session_db = None
runner._agent_cache_lock = None
runner._is_user_authorized = lambda _source: True
runner._format_session_info = lambda: ""
return runner
@pytest.mark.asyncio
@patch("hermes_cli.plugins.invoke_hook")
async def test_reset_fires_finalize_hook(mock_invoke_hook):
"""/new must fire on_session_finalize with the OLD session id."""
runner = _make_runner()
await runner._handle_reset_command(_make_event("/new"))
mock_invoke_hook.assert_any_call(
"on_session_finalize", session_id="sess-old", platform="telegram"
)
@pytest.mark.asyncio
@patch("hermes_cli.plugins.invoke_hook")
async def test_reset_fires_reset_hook(mock_invoke_hook):
"""/new must fire on_session_reset with the NEW session id."""
runner = _make_runner()
await runner._handle_reset_command(_make_event("/new"))
mock_invoke_hook.assert_any_call(
"on_session_reset", session_id="sess-new", platform="telegram"
)
@pytest.mark.asyncio
@patch("hermes_cli.plugins.invoke_hook")
async def test_finalize_before_reset(mock_invoke_hook):
"""on_session_finalize must fire before on_session_reset."""
runner = _make_runner()
await runner._handle_reset_command(_make_event("/new"))
calls = [c for c in mock_invoke_hook.call_args_list
if c[0][0] in ("on_session_finalize", "on_session_reset")]
hook_names = [c[0][0] for c in calls]
assert hook_names == ["on_session_finalize", "on_session_reset"]
@pytest.mark.asyncio
@patch("hermes_cli.plugins.invoke_hook")
async def test_shutdown_fires_finalize_for_active_agents(mock_invoke_hook):
"""Gateway stop() must fire on_session_finalize for each active agent."""
from gateway.run import GatewayRunner
runner = object.__new__(GatewayRunner)
runner._running = True
runner._background_tasks = set()
runner._pending_messages = {}
runner._pending_approvals = {}
runner._shutdown_event = MagicMock()
runner.adapters = {}
runner._exit_reason = "test"
agent1 = MagicMock()
agent1.session_id = "sess-a"
agent2 = MagicMock()
agent2.session_id = "sess-b"
runner._running_agents = {"key-a": agent1, "key-b": agent2}
with patch("gateway.status.remove_pid_file"), \
patch("gateway.status.write_runtime_status"):
await runner.stop()
finalize_calls = [
c for c in mock_invoke_hook.call_args_list
if c[0][0] == "on_session_finalize"
]
session_ids = {c[1]["session_id"] for c in finalize_calls}
assert session_ids == {"sess-a", "sess-b"}
@pytest.mark.asyncio
@patch("hermes_cli.plugins.invoke_hook", side_effect=Exception("boom"))
async def test_hook_error_does_not_break_reset(mock_invoke_hook):
"""Plugin hook errors must not prevent /new from completing."""
runner = _make_runner()
result = await runner._handle_reset_command(_make_event("/new"))
# Should still return a success message despite hook errors
assert "Session reset" in result or "New session" in result
-63
View File
@@ -707,66 +707,3 @@ class TestSignalSendDocumentViaHelper:
assert result.success is False
assert "/nonexistent.pdf" in result.error
# ---------------------------------------------------------------------------
# send() returns message_id from timestamp (#4647)
# ---------------------------------------------------------------------------
class TestSignalSendReturnsMessageId:
"""Signal send() must return a timestamp-based message_id so the stream
consumer can follow its editfallback path correctly."""
@pytest.mark.asyncio
async def test_send_returns_timestamp_as_message_id(self, monkeypatch):
adapter = _make_signal_adapter(monkeypatch)
mock_rpc, _ = _stub_rpc({"timestamp": 1712345678000})
adapter._rpc = mock_rpc
adapter._stop_typing_indicator = AsyncMock()
result = await adapter.send(chat_id="+155****4567", content="hello")
assert result.success is True
assert result.message_id == "1712345678000"
@pytest.mark.asyncio
async def test_send_returns_none_message_id_when_no_timestamp(self, monkeypatch):
adapter = _make_signal_adapter(monkeypatch)
mock_rpc, _ = _stub_rpc({}) # No timestamp key
adapter._rpc = mock_rpc
adapter._stop_typing_indicator = AsyncMock()
result = await adapter.send(chat_id="+155****4567", content="hello")
assert result.success is True
assert result.message_id is None
@pytest.mark.asyncio
async def test_send_returns_none_message_id_for_non_dict(self, monkeypatch):
adapter = _make_signal_adapter(monkeypatch)
mock_rpc, _ = _stub_rpc("ok") # Non-dict result
adapter._rpc = mock_rpc
adapter._stop_typing_indicator = AsyncMock()
result = await adapter.send(chat_id="+155****4567", content="hello")
assert result.success is True
assert result.message_id is None
# ---------------------------------------------------------------------------
# stop_typing() delegates to _stop_typing_indicator (#4647)
# ---------------------------------------------------------------------------
class TestSignalStopTyping:
"""Signal must expose a public stop_typing() so base adapter's
_keep_typing finally block can clean up platform-level typing tasks."""
@pytest.mark.asyncio
async def test_stop_typing_calls_private_method(self, monkeypatch):
adapter = _make_signal_adapter(monkeypatch)
adapter._stop_typing_indicator = AsyncMock()
await adapter.stop_typing("+155****4567")
adapter._stop_typing_indicator.assert_awaited_once_with("+155****4567")
-142
View File
@@ -324,145 +324,3 @@ class TestSegmentBreakOnToolBoundary:
await consumer.run()
assert consumer.already_sent
@pytest.mark.asyncio
async def test_edit_failure_sends_only_unsent_tail_at_finish(self):
"""If an edit fails mid-stream, send only the missing tail once at finish."""
adapter = MagicMock()
send_results = [
SimpleNamespace(success=True, message_id="msg_1"),
SimpleNamespace(success=True, message_id="msg_2"),
]
adapter.send = AsyncMock(side_effect=send_results)
adapter.edit_message = AsyncMock(return_value=SimpleNamespace(success=False, error="flood_control:6"))
adapter.MAX_MESSAGE_LENGTH = 4096
config = StreamConsumerConfig(edit_interval=0.01, buffer_threshold=5, cursor="")
consumer = GatewayStreamConsumer(adapter, "chat_123", config)
consumer.on_delta("Hello")
task = asyncio.create_task(consumer.run())
await asyncio.sleep(0.08)
consumer.on_delta(" world")
await asyncio.sleep(0.08)
consumer.finish()
await task
assert adapter.send.call_count == 2
first_text = adapter.send.call_args_list[0][1]["content"]
second_text = adapter.send.call_args_list[1][1]["content"]
assert "Hello" in first_text
assert second_text.strip() == "world"
assert consumer.already_sent
@pytest.mark.asyncio
async def test_segment_break_clears_failed_edit_fallback_state(self):
"""A tool boundary after edit failure must not duplicate the next segment."""
adapter = MagicMock()
send_results = [
SimpleNamespace(success=True, message_id="msg_1"),
SimpleNamespace(success=True, message_id="msg_2"),
]
adapter.send = AsyncMock(side_effect=send_results)
adapter.edit_message = AsyncMock(return_value=SimpleNamespace(success=False, error="flood_control:6"))
adapter.MAX_MESSAGE_LENGTH = 4096
config = StreamConsumerConfig(edit_interval=0.01, buffer_threshold=5, cursor="")
consumer = GatewayStreamConsumer(adapter, "chat_123", config)
consumer.on_delta("Hello")
task = asyncio.create_task(consumer.run())
await asyncio.sleep(0.08)
consumer.on_delta(" world")
await asyncio.sleep(0.08)
consumer.on_delta(None)
consumer.on_delta("Next segment")
consumer.finish()
await task
sent_texts = [call[1]["content"] for call in adapter.send.call_args_list]
assert sent_texts == ["Hello ▉", "Next segment"]
@pytest.mark.asyncio
async def test_no_message_id_enters_fallback_mode(self):
"""Platform returns success but no message_id (Signal) — must not
re-send on every delta. Should enter fallback mode and send only
the continuation at finish."""
adapter = MagicMock()
# First send succeeds but returns no message_id (Signal behavior)
send_result_no_id = SimpleNamespace(success=True, message_id=None)
# Fallback final send succeeds
send_result_final = SimpleNamespace(success=True, message_id="msg_final")
adapter.send = AsyncMock(side_effect=[send_result_no_id, send_result_final])
adapter.edit_message = AsyncMock(return_value=SimpleNamespace(success=True))
adapter.MAX_MESSAGE_LENGTH = 4096
config = StreamConsumerConfig(edit_interval=0.01, buffer_threshold=5)
consumer = GatewayStreamConsumer(adapter, "chat_123", config)
consumer.on_delta("Hello")
task = asyncio.create_task(consumer.run())
await asyncio.sleep(0.08)
consumer.on_delta(" world, this is a longer response.")
await asyncio.sleep(0.08)
consumer.finish()
await task
# Should send exactly 2 messages: initial chunk + fallback continuation
# NOT one message per delta
assert adapter.send.call_count == 2
assert consumer.already_sent
# edit_message should NOT have been called (no valid message_id to edit)
adapter.edit_message.assert_not_called()
@pytest.mark.asyncio
async def test_no_message_id_single_delta_marks_already_sent(self):
"""When the entire response fits in one delta and platform returns no
message_id, already_sent must still be True to prevent the gateway
from re-sending the full response."""
adapter = MagicMock()
send_result = SimpleNamespace(success=True, message_id=None)
adapter.send = AsyncMock(return_value=send_result)
adapter.MAX_MESSAGE_LENGTH = 4096
config = StreamConsumerConfig(edit_interval=0.01, buffer_threshold=5)
consumer = GatewayStreamConsumer(adapter, "chat_123", config)
consumer.on_delta("Short response.")
consumer.finish()
await consumer.run()
assert consumer.already_sent
# Only one send call (the initial message)
assert adapter.send.call_count == 1
@pytest.mark.asyncio
async def test_fallback_final_splits_long_continuation_without_dropping_text(self):
"""Long continuation tails should be chunked when fallback final-send runs."""
adapter = MagicMock()
adapter.send = AsyncMock(side_effect=[
SimpleNamespace(success=True, message_id="msg_1"),
SimpleNamespace(success=True, message_id="msg_2"),
SimpleNamespace(success=True, message_id="msg_3"),
])
adapter.edit_message = AsyncMock(return_value=SimpleNamespace(success=False, error="flood_control:6"))
adapter.MAX_MESSAGE_LENGTH = 610
config = StreamConsumerConfig(edit_interval=0.01, buffer_threshold=5, cursor="")
consumer = GatewayStreamConsumer(adapter, "chat_123", config)
prefix = "abc"
tail = "x" * 620
consumer.on_delta(prefix)
task = asyncio.create_task(consumer.run())
await asyncio.sleep(0.08)
consumer.on_delta(tail)
await asyncio.sleep(0.08)
consumer.finish()
await task
sent_texts = [call[1]["content"] for call in adapter.send.call_args_list]
assert len(sent_texts) == 3
assert sent_texts[0].startswith(prefix)
assert sum(len(t) for t in sent_texts[1:]) == len(tail)
-399
View File
@@ -1,399 +0,0 @@
"""Tests for Qwen OAuth provider authentication (hermes_cli/auth.py).
Covers: _qwen_cli_auth_path, _read_qwen_cli_tokens, _save_qwen_cli_tokens,
_qwen_access_token_is_expiring, _refresh_qwen_cli_tokens,
resolve_qwen_runtime_credentials, get_qwen_auth_status.
"""
import json
import os
import stat
import time
from pathlib import Path
from unittest.mock import MagicMock, patch
import pytest
from hermes_cli.auth import (
AuthError,
DEFAULT_QWEN_BASE_URL,
QWEN_ACCESS_TOKEN_REFRESH_SKEW_SECONDS,
_qwen_cli_auth_path,
_read_qwen_cli_tokens,
_save_qwen_cli_tokens,
_qwen_access_token_is_expiring,
_refresh_qwen_cli_tokens,
resolve_qwen_runtime_credentials,
get_qwen_auth_status,
)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _make_qwen_tokens(
access_token="test-access-token",
refresh_token="test-refresh-token",
expiry_date=None,
**extra,
):
"""Create a minimal Qwen CLI OAuth credential dict."""
if expiry_date is None:
# 1 hour from now in milliseconds
expiry_date = int((time.time() + 3600) * 1000)
data = {
"access_token": access_token,
"refresh_token": refresh_token,
"token_type": "Bearer",
"expiry_date": expiry_date,
"resource_url": "portal.qwen.ai",
}
data.update(extra)
return data
def _write_qwen_creds(tmp_path, tokens=None):
"""Write tokens to the Qwen CLI credentials file and return the path."""
qwen_dir = tmp_path / ".qwen"
qwen_dir.mkdir(parents=True, exist_ok=True)
creds_path = qwen_dir / "oauth_creds.json"
if tokens is None:
tokens = _make_qwen_tokens()
creds_path.write_text(json.dumps(tokens), encoding="utf-8")
return creds_path
@pytest.fixture()
def qwen_env(tmp_path, monkeypatch):
"""Redirect _qwen_cli_auth_path to tmp_path/.qwen/oauth_creds.json."""
creds_path = tmp_path / ".qwen" / "oauth_creds.json"
monkeypatch.setattr(
"hermes_cli.auth._qwen_cli_auth_path", lambda: creds_path
)
return tmp_path
# ---------------------------------------------------------------------------
# _qwen_cli_auth_path
# ---------------------------------------------------------------------------
def test_qwen_cli_auth_path_returns_expected_location():
path = _qwen_cli_auth_path()
assert path == Path.home() / ".qwen" / "oauth_creds.json"
# ---------------------------------------------------------------------------
# _read_qwen_cli_tokens
# ---------------------------------------------------------------------------
def test_read_qwen_cli_tokens_success(qwen_env):
tokens = _make_qwen_tokens(access_token="my-access")
_write_qwen_creds(qwen_env, tokens)
result = _read_qwen_cli_tokens()
assert result["access_token"] == "my-access"
assert result["refresh_token"] == "test-refresh-token"
def test_read_qwen_cli_tokens_missing_file(qwen_env):
with pytest.raises(AuthError) as exc:
_read_qwen_cli_tokens()
assert exc.value.code == "qwen_auth_missing"
def test_read_qwen_cli_tokens_invalid_json(qwen_env):
creds_path = qwen_env / ".qwen" / "oauth_creds.json"
creds_path.parent.mkdir(parents=True, exist_ok=True)
creds_path.write_text("not json{{{", encoding="utf-8")
with pytest.raises(AuthError) as exc:
_read_qwen_cli_tokens()
assert exc.value.code == "qwen_auth_read_failed"
def test_read_qwen_cli_tokens_non_dict(qwen_env):
creds_path = qwen_env / ".qwen" / "oauth_creds.json"
creds_path.parent.mkdir(parents=True, exist_ok=True)
creds_path.write_text(json.dumps(["a", "b"]), encoding="utf-8")
with pytest.raises(AuthError) as exc:
_read_qwen_cli_tokens()
assert exc.value.code == "qwen_auth_invalid"
# ---------------------------------------------------------------------------
# _save_qwen_cli_tokens
# ---------------------------------------------------------------------------
def test_save_qwen_cli_tokens_roundtrip(qwen_env):
tokens = _make_qwen_tokens(access_token="saved-token")
saved_path = _save_qwen_cli_tokens(tokens)
assert saved_path.exists()
loaded = json.loads(saved_path.read_text(encoding="utf-8"))
assert loaded["access_token"] == "saved-token"
def test_save_qwen_cli_tokens_creates_parent(qwen_env):
tokens = _make_qwen_tokens()
saved_path = _save_qwen_cli_tokens(tokens)
assert saved_path.parent.exists()
def test_save_qwen_cli_tokens_permissions(qwen_env):
tokens = _make_qwen_tokens()
saved_path = _save_qwen_cli_tokens(tokens)
mode = saved_path.stat().st_mode
assert mode & stat.S_IRUSR # owner read
assert mode & stat.S_IWUSR # owner write
assert not (mode & stat.S_IRGRP) # no group read
assert not (mode & stat.S_IROTH) # no other read
# ---------------------------------------------------------------------------
# _qwen_access_token_is_expiring
# ---------------------------------------------------------------------------
def test_expiring_token_not_expired():
# 1 hour from now in milliseconds
future_ms = int((time.time() + 3600) * 1000)
assert not _qwen_access_token_is_expiring(future_ms)
def test_expiring_token_already_expired():
# 1 hour ago in milliseconds
past_ms = int((time.time() - 3600) * 1000)
assert _qwen_access_token_is_expiring(past_ms)
def test_expiring_token_within_skew():
# Just inside the default skew window
near_ms = int((time.time() + QWEN_ACCESS_TOKEN_REFRESH_SKEW_SECONDS - 5) * 1000)
assert _qwen_access_token_is_expiring(near_ms)
def test_expiring_token_none_returns_true():
assert _qwen_access_token_is_expiring(None)
def test_expiring_token_non_numeric_returns_true():
assert _qwen_access_token_is_expiring("not-a-number")
# ---------------------------------------------------------------------------
# _refresh_qwen_cli_tokens
# ---------------------------------------------------------------------------
def test_refresh_qwen_cli_tokens_success(qwen_env):
tokens = _make_qwen_tokens(refresh_token="old-refresh")
resp = MagicMock()
resp.status_code = 200
resp.json.return_value = {
"access_token": "new-access",
"refresh_token": "new-refresh",
"expires_in": 7200,
}
with patch("hermes_cli.auth.httpx") as mock_httpx:
mock_httpx.post.return_value = resp
result = _refresh_qwen_cli_tokens(tokens)
assert result["access_token"] == "new-access"
assert result["refresh_token"] == "new-refresh"
assert "expiry_date" in result
def test_refresh_qwen_cli_tokens_preserves_old_refresh_if_not_in_response(qwen_env):
tokens = _make_qwen_tokens(refresh_token="keep-me")
resp = MagicMock()
resp.status_code = 200
resp.json.return_value = {
"access_token": "new-access",
# No refresh_token in response — should keep old one
"expires_in": 3600,
}
with patch("hermes_cli.auth.httpx") as mock_httpx:
mock_httpx.post.return_value = resp
result = _refresh_qwen_cli_tokens(tokens)
assert result["refresh_token"] == "keep-me"
def test_refresh_qwen_cli_tokens_missing_refresh_token():
tokens = {"access_token": "at", "refresh_token": ""}
with pytest.raises(AuthError) as exc:
_refresh_qwen_cli_tokens(tokens)
assert exc.value.code == "qwen_refresh_token_missing"
def test_refresh_qwen_cli_tokens_http_error(qwen_env):
tokens = _make_qwen_tokens()
resp = MagicMock()
resp.status_code = 401
resp.text = "unauthorized"
with patch("hermes_cli.auth.httpx") as mock_httpx:
mock_httpx.post.return_value = resp
with pytest.raises(AuthError) as exc:
_refresh_qwen_cli_tokens(tokens)
assert exc.value.code == "qwen_refresh_failed"
def test_refresh_qwen_cli_tokens_network_error(qwen_env):
tokens = _make_qwen_tokens()
with patch("hermes_cli.auth.httpx") as mock_httpx:
mock_httpx.post.side_effect = ConnectionError("timeout")
with pytest.raises(AuthError) as exc:
_refresh_qwen_cli_tokens(tokens)
assert exc.value.code == "qwen_refresh_failed"
def test_refresh_qwen_cli_tokens_invalid_json_response(qwen_env):
tokens = _make_qwen_tokens()
resp = MagicMock()
resp.status_code = 200
resp.json.side_effect = ValueError("bad json")
with patch("hermes_cli.auth.httpx") as mock_httpx:
mock_httpx.post.return_value = resp
with pytest.raises(AuthError) as exc:
_refresh_qwen_cli_tokens(tokens)
assert exc.value.code == "qwen_refresh_invalid_json"
def test_refresh_qwen_cli_tokens_missing_access_token_in_response(qwen_env):
tokens = _make_qwen_tokens()
resp = MagicMock()
resp.status_code = 200
resp.json.return_value = {"something": "but no access_token"}
with patch("hermes_cli.auth.httpx") as mock_httpx:
mock_httpx.post.return_value = resp
with pytest.raises(AuthError) as exc:
_refresh_qwen_cli_tokens(tokens)
assert exc.value.code == "qwen_refresh_invalid_response"
def test_refresh_qwen_cli_tokens_default_expires_in(qwen_env):
"""When expires_in is missing, default to 6 hours."""
tokens = _make_qwen_tokens()
resp = MagicMock()
resp.status_code = 200
resp.json.return_value = {"access_token": "new"}
with patch("hermes_cli.auth.httpx") as mock_httpx:
mock_httpx.post.return_value = resp
result = _refresh_qwen_cli_tokens(tokens)
# Verify expiry_date is roughly now + 6h (within 60s tolerance)
expected_ms = int(time.time() * 1000) + 6 * 60 * 60 * 1000
assert abs(result["expiry_date"] - expected_ms) < 60_000
def test_refresh_qwen_cli_tokens_saves_to_disk(qwen_env):
tokens = _make_qwen_tokens()
resp = MagicMock()
resp.status_code = 200
resp.json.return_value = {
"access_token": "disk-check",
"expires_in": 3600,
}
with patch("hermes_cli.auth.httpx") as mock_httpx:
mock_httpx.post.return_value = resp
_refresh_qwen_cli_tokens(tokens)
# Verify it was persisted
creds_path = qwen_env / ".qwen" / "oauth_creds.json"
assert creds_path.exists()
saved = json.loads(creds_path.read_text(encoding="utf-8"))
assert saved["access_token"] == "disk-check"
# ---------------------------------------------------------------------------
# resolve_qwen_runtime_credentials
# ---------------------------------------------------------------------------
def test_resolve_qwen_runtime_credentials_fresh_token(qwen_env):
tokens = _make_qwen_tokens(access_token="fresh-at")
_write_qwen_creds(qwen_env, tokens)
creds = resolve_qwen_runtime_credentials(refresh_if_expiring=False)
assert creds["provider"] == "qwen-oauth"
assert creds["api_key"] == "fresh-at"
assert creds["base_url"] == DEFAULT_QWEN_BASE_URL
assert creds["source"] == "qwen-cli"
def test_resolve_qwen_runtime_credentials_triggers_refresh(qwen_env):
# Write an expired token
expired_ms = int((time.time() - 3600) * 1000)
tokens = _make_qwen_tokens(access_token="old", expiry_date=expired_ms)
_write_qwen_creds(qwen_env, tokens)
refreshed = _make_qwen_tokens(access_token="refreshed-at")
with patch(
"hermes_cli.auth._refresh_qwen_cli_tokens", return_value=refreshed
) as mock_refresh:
creds = resolve_qwen_runtime_credentials()
mock_refresh.assert_called_once()
assert creds["api_key"] == "refreshed-at"
def test_resolve_qwen_runtime_credentials_force_refresh(qwen_env):
tokens = _make_qwen_tokens(access_token="old-at")
_write_qwen_creds(qwen_env, tokens)
refreshed = _make_qwen_tokens(access_token="force-refreshed")
with patch(
"hermes_cli.auth._refresh_qwen_cli_tokens", return_value=refreshed
) as mock_refresh:
creds = resolve_qwen_runtime_credentials(force_refresh=True)
mock_refresh.assert_called_once()
assert creds["api_key"] == "force-refreshed"
def test_resolve_qwen_runtime_credentials_missing_access_token(qwen_env):
tokens = _make_qwen_tokens(access_token="")
_write_qwen_creds(qwen_env, tokens)
with pytest.raises(AuthError) as exc:
resolve_qwen_runtime_credentials(refresh_if_expiring=False)
assert exc.value.code == "qwen_access_token_missing"
def test_resolve_qwen_runtime_credentials_base_url_env_override(qwen_env, monkeypatch):
tokens = _make_qwen_tokens(access_token="at")
_write_qwen_creds(qwen_env, tokens)
monkeypatch.setenv("HERMES_QWEN_BASE_URL", "https://custom.qwen.ai/v1")
creds = resolve_qwen_runtime_credentials(refresh_if_expiring=False)
assert creds["base_url"] == "https://custom.qwen.ai/v1"
# ---------------------------------------------------------------------------
# get_qwen_auth_status
# ---------------------------------------------------------------------------
def test_get_qwen_auth_status_logged_in(qwen_env):
tokens = _make_qwen_tokens(access_token="status-at")
_write_qwen_creds(qwen_env, tokens)
status = get_qwen_auth_status()
assert status["logged_in"] is True
assert status["api_key"] == "status-at"
def test_get_qwen_auth_status_not_logged_in(qwen_env):
# No credentials file
status = get_qwen_auth_status()
assert status["logged_in"] is False
assert "error" in status
-70
View File
@@ -136,73 +136,3 @@ def test_check_gateway_service_linger_skips_when_service_not_installed(monkeypat
out = capsys.readouterr().out
assert out == ""
assert issues == []
# ── Memory provider section (doctor should only check the *active* provider) ──
class TestDoctorMemoryProviderSection:
"""The ◆ Memory Provider section should respect memory.provider config."""
def _make_hermes_home(self, tmp_path, provider=""):
"""Create a minimal HERMES_HOME with config.yaml."""
home = tmp_path / ".hermes"
home.mkdir(parents=True, exist_ok=True)
import yaml
config = {"memory": {"provider": provider}} if provider else {"memory": {}}
(home / "config.yaml").write_text(yaml.dump(config))
return home
def _run_doctor_and_capture(self, monkeypatch, tmp_path, provider=""):
"""Run doctor and capture stdout."""
home = self._make_hermes_home(tmp_path, provider)
monkeypatch.setattr(doctor_mod, "HERMES_HOME", home)
monkeypatch.setattr(doctor_mod, "PROJECT_ROOT", tmp_path / "project")
monkeypatch.setattr(doctor_mod, "_DHH", str(home))
(tmp_path / "project").mkdir(exist_ok=True)
# Stub tool availability (returns empty) so doctor runs past it
fake_model_tools = types.SimpleNamespace(
check_tool_availability=lambda *a, **kw: ([], []),
TOOLSET_REQUIREMENTS={},
)
monkeypatch.setitem(sys.modules, "model_tools", fake_model_tools)
# Stub auth checks to avoid real API calls
try:
from hermes_cli import auth as _auth_mod
monkeypatch.setattr(_auth_mod, "get_nous_auth_status", lambda: {})
monkeypatch.setattr(_auth_mod, "get_codex_auth_status", lambda: {})
except Exception:
pass
import io, contextlib
buf = io.StringIO()
with contextlib.redirect_stdout(buf):
doctor_mod.run_doctor(Namespace(fix=False))
return buf.getvalue()
def test_no_provider_shows_builtin_ok(self, monkeypatch, tmp_path):
out = self._run_doctor_and_capture(monkeypatch, tmp_path, provider="")
assert "Memory Provider" in out
assert "Built-in memory active" in out
# Should NOT mention Honcho or Mem0 errors
assert "Honcho API key" not in out
assert "Mem0" not in out
def test_honcho_provider_not_installed_shows_fail(self, monkeypatch, tmp_path):
# Make honcho import fail
monkeypatch.setitem(
sys.modules, "plugins.memory.honcho.client", None
)
out = self._run_doctor_and_capture(monkeypatch, tmp_path, provider="honcho")
assert "Memory Provider" in out
# Should show failure since honcho is set but not importable
assert "Built-in memory active" not in out
def test_mem0_provider_not_installed_shows_fail(self, monkeypatch, tmp_path):
# Make mem0 import fail
monkeypatch.setitem(sys.modules, "plugins.memory.mem0", None)
out = self._run_doctor_and_capture(monkeypatch, tmp_path, provider="mem0")
assert "Memory Provider" in out
assert "Built-in memory active" not in out
@@ -143,82 +143,6 @@ def test_resolve_runtime_provider_codex(monkeypatch):
assert resolved["requested_provider"] == "openai-codex"
def test_resolve_runtime_provider_qwen_oauth(monkeypatch):
monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "qwen-oauth")
monkeypatch.setattr(
rp,
"resolve_qwen_runtime_credentials",
lambda: {
"provider": "qwen-oauth",
"base_url": "https://portal.qwen.ai/v1",
"api_key": "qwen-token",
"source": "qwen-cli",
"expires_at_ms": 1775640710946,
},
)
resolved = rp.resolve_runtime_provider(requested="qwen-oauth")
assert resolved["provider"] == "qwen-oauth"
assert resolved["api_mode"] == "chat_completions"
assert resolved["base_url"] == "https://portal.qwen.ai/v1"
assert resolved["api_key"] == "qwen-token"
assert resolved["requested_provider"] == "qwen-oauth"
def test_resolve_runtime_provider_uses_qwen_pool_entry(monkeypatch):
class _Entry:
access_token = "pool-qwen-token"
source = "manual:qwen_cli"
base_url = "https://portal.qwen.ai/v1"
class _Pool:
def has_credentials(self):
return True
def select(self):
return _Entry()
monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "qwen-oauth")
monkeypatch.setattr(rp, "load_pool", lambda provider: _Pool())
monkeypatch.setattr(rp, "_get_model_config", lambda: {"provider": "qwen-oauth", "default": "coder-model"})
resolved = rp.resolve_runtime_provider(requested="qwen-oauth")
assert resolved["provider"] == "qwen-oauth"
assert resolved["api_mode"] == "chat_completions"
assert resolved["base_url"] == "https://portal.qwen.ai/v1"
assert resolved["api_key"] == "pool-qwen-token"
assert resolved["source"] == "manual:qwen_cli"
def test_resolve_provider_alias_qwen(monkeypatch):
monkeypatch.setattr(rp.auth_mod, "_load_auth_store", lambda: {})
monkeypatch.delenv("OPENAI_API_KEY", raising=False)
monkeypatch.delenv("OPENROUTER_API_KEY", raising=False)
assert rp.resolve_provider("qwen-portal") == "qwen-oauth"
assert rp.resolve_provider("qwen-cli") == "qwen-oauth"
def test_qwen_oauth_auto_fallthrough_on_auth_failure(monkeypatch):
"""When requested_provider is 'auto' and Qwen creds fail, fall through."""
from hermes_cli.auth import AuthError
monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "qwen-oauth")
monkeypatch.setattr(
rp,
"resolve_qwen_runtime_credentials",
lambda **kw: (_ for _ in ()).throw(AuthError("stale", provider="qwen-oauth", code="qwen_auth_missing")),
)
monkeypatch.setattr(rp, "_get_model_config", lambda: {})
monkeypatch.setenv("OPENROUTER_API_KEY", "test-or-key")
# Should NOT raise — falls through to OpenRouter
resolved = rp.resolve_runtime_provider(requested="auto")
# The fallthrough means it won't be qwen-oauth
assert resolved["provider"] != "qwen-oauth"
def test_resolve_runtime_provider_ai_gateway(monkeypatch):
monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "ai-gateway")
monkeypatch.setattr(rp, "_get_model_config", lambda: {})
@@ -884,55 +808,6 @@ def test_minimax_explicit_api_mode_respected(monkeypatch):
assert resolved["api_mode"] == "chat_completions"
def test_minimax_config_base_url_overrides_hardcoded_default(monkeypatch):
"""model.base_url in config.yaml should override the hardcoded default (#6039)."""
monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "minimax")
monkeypatch.setattr(rp, "_get_model_config", lambda: {
"provider": "minimax",
"base_url": "https://api.minimaxi.com/anthropic",
})
monkeypatch.setenv("MINIMAX_API_KEY", "test-minimax-key")
monkeypatch.delenv("MINIMAX_BASE_URL", raising=False)
resolved = rp.resolve_runtime_provider(requested="minimax")
assert resolved["provider"] == "minimax"
assert resolved["base_url"] == "https://api.minimaxi.com/anthropic"
assert resolved["api_mode"] == "anthropic_messages"
def test_minimax_env_base_url_still_wins_over_config(monkeypatch):
"""MINIMAX_BASE_URL env var should take priority over config.yaml model.base_url."""
monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "minimax")
monkeypatch.setattr(rp, "_get_model_config", lambda: {
"provider": "minimax",
"base_url": "https://api.minimaxi.com/anthropic",
})
monkeypatch.setenv("MINIMAX_API_KEY", "test-minimax-key")
monkeypatch.setenv("MINIMAX_BASE_URL", "https://custom.example.com/v1")
resolved = rp.resolve_runtime_provider(requested="minimax")
# Env var wins because resolve_api_key_provider_credentials prefers it
assert resolved["base_url"] == "https://custom.example.com/v1"
def test_minimax_config_base_url_ignored_for_different_provider(monkeypatch):
"""model.base_url should NOT be used when model.provider doesn't match."""
monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "minimax")
monkeypatch.setattr(rp, "_get_model_config", lambda: {
"provider": "openrouter",
"base_url": "https://some-other-endpoint.com/v1",
})
monkeypatch.setenv("MINIMAX_API_KEY", "test-minimax-key")
monkeypatch.delenv("MINIMAX_BASE_URL", raising=False)
resolved = rp.resolve_runtime_provider(requested="minimax")
# Should use the default, NOT the config base_url from a different provider
assert resolved["base_url"] == "https://api.minimax.io/anthropic"
def test_alibaba_default_coding_intl_endpoint_uses_chat_completions(monkeypatch):
"""Alibaba default coding-intl /v1 URL should use chat_completions mode."""
monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "alibaba")
@@ -34,8 +34,8 @@ class TestSetupProviderModelSelection:
@pytest.mark.parametrize("provider_id,expected_defaults", [
("zai", ["glm-5", "glm-4.7", "glm-4.5", "glm-4.5-flash"]),
("kimi-coding", ["kimi-k2.5", "kimi-k2-thinking", "kimi-k2-turbo-preview"]),
("minimax", ["MiniMax-M1", "MiniMax-M1-40k", "MiniMax-M1-80k", "MiniMax-M1-128k", "MiniMax-M1-256k", "MiniMax-M2.5", "MiniMax-M2.7"]),
("minimax-cn", ["MiniMax-M1", "MiniMax-M1-40k", "MiniMax-M1-80k", "MiniMax-M1-128k", "MiniMax-M1-256k", "MiniMax-M2.5", "MiniMax-M2.7"]),
("minimax", ["MiniMax-M2.7", "MiniMax-M2.7-highspeed", "MiniMax-M2.5", "MiniMax-M2.5-highspeed", "MiniMax-M2.1"]),
("minimax-cn", ["MiniMax-M2.7", "MiniMax-M2.7-highspeed", "MiniMax-M2.5", "MiniMax-M2.5-highspeed", "MiniMax-M2.1"]),
("opencode-zen", ["gpt-5.4", "gpt-5.3-codex", "claude-sonnet-4-6", "gemini-3-flash"]),
("opencode-go", ["glm-5", "kimi-k2.5", "minimax-m2.5", "minimax-m2.7"]),
])
+162
View File
@@ -0,0 +1,162 @@
"""Tests for _save_oversized_tool_result() — the large tool response handler.
When a tool returns more than _LARGE_RESULT_CHARS characters, the full content
is saved to a file and the model receives a preview + file path instead.
"""
import os
import re
import pytest
from run_agent import (
_save_oversized_tool_result,
_LARGE_RESULT_CHARS,
_LARGE_RESULT_PREVIEW_CHARS,
)
class TestSaveOversizedToolResult:
"""Unit tests for the large tool result handler."""
def test_small_result_returned_unchanged(self):
"""Results under the threshold pass through untouched."""
small = "x" * 1000
assert _save_oversized_tool_result("terminal", small) is small
def test_exactly_at_threshold_returned_unchanged(self):
"""Results exactly at the threshold pass through."""
exact = "y" * _LARGE_RESULT_CHARS
assert _save_oversized_tool_result("terminal", exact) is exact
def test_oversized_result_saved_to_file(self, tmp_path, monkeypatch):
"""Results over the threshold are written to a file."""
monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
os.makedirs(tmp_path / ".hermes", exist_ok=True)
big = "A" * (_LARGE_RESULT_CHARS + 500)
result = _save_oversized_tool_result("terminal", big)
# Should contain the preview
assert result.startswith("A" * _LARGE_RESULT_PREVIEW_CHARS)
# Should mention the file path
assert "Full output saved to:" in result
# Should mention original size
assert f"{len(big):,}" in result
# Extract the file path and verify the file exists with full content
match = re.search(r"Full output saved to: (.+?)\n", result)
assert match, f"No file path found in result: {result[:300]}"
filepath = match.group(1)
assert os.path.isfile(filepath)
with open(filepath, "r", encoding="utf-8") as f:
saved = f.read()
assert saved == big
assert len(saved) == _LARGE_RESULT_CHARS + 500
def test_file_placed_in_cache_tool_responses(self, tmp_path, monkeypatch):
"""Saved file lives under HERMES_HOME/cache/tool_responses/."""
hermes_home = str(tmp_path / ".hermes")
monkeypatch.setenv("HERMES_HOME", hermes_home)
os.makedirs(hermes_home, exist_ok=True)
big = "B" * (_LARGE_RESULT_CHARS + 1)
result = _save_oversized_tool_result("web_search", big)
match = re.search(r"Full output saved to: (.+?)\n", result)
filepath = match.group(1)
expected_dir = os.path.join(hermes_home, "cache", "tool_responses")
assert filepath.startswith(expected_dir)
def test_filename_contains_tool_name(self, tmp_path, monkeypatch):
"""The saved filename includes a sanitized version of the tool name."""
monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
os.makedirs(tmp_path / ".hermes", exist_ok=True)
big = "C" * (_LARGE_RESULT_CHARS + 1)
result = _save_oversized_tool_result("browser_navigate", big)
match = re.search(r"Full output saved to: (.+?)\n", result)
filename = os.path.basename(match.group(1))
assert filename.startswith("browser_navigate_")
assert filename.endswith(".txt")
def test_tool_name_sanitized(self, tmp_path, monkeypatch):
"""Special characters in tool names are replaced in the filename."""
monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
os.makedirs(tmp_path / ".hermes", exist_ok=True)
big = "D" * (_LARGE_RESULT_CHARS + 1)
result = _save_oversized_tool_result("mcp:some/weird tool", big)
match = re.search(r"Full output saved to: (.+?)\n", result)
filename = os.path.basename(match.group(1))
# No slashes or colons in filename
assert "/" not in filename
assert ":" not in filename
def test_fallback_on_write_failure(self, tmp_path, monkeypatch):
"""When file write fails, falls back to destructive truncation."""
# Point HERMES_HOME to a path that will fail (file, not directory)
bad_path = str(tmp_path / "not_a_dir.txt")
with open(bad_path, "w") as f:
f.write("I'm a file, not a directory")
monkeypatch.setenv("HERMES_HOME", bad_path)
big = "E" * (_LARGE_RESULT_CHARS + 50_000)
result = _save_oversized_tool_result("terminal", big)
# Should still contain data (fallback truncation)
assert len(result) > 0
assert result.startswith("E" * 1000)
# Should mention the failure
assert "File save failed" in result
# Should be truncated to approximately _LARGE_RESULT_CHARS + error msg
assert len(result) < len(big)
def test_preview_length_capped(self, tmp_path, monkeypatch):
"""The inline preview is capped at _LARGE_RESULT_PREVIEW_CHARS."""
monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
os.makedirs(tmp_path / ".hermes", exist_ok=True)
# Use distinct chars so we can measure the preview
big = "Z" * (_LARGE_RESULT_CHARS + 5000)
result = _save_oversized_tool_result("terminal", big)
# The preview section is the content before the "[Large tool response:" marker
marker_pos = result.index("[Large tool response:")
preview_section = result[:marker_pos].rstrip()
assert len(preview_section) == _LARGE_RESULT_PREVIEW_CHARS
def test_guidance_message_mentions_tools(self, tmp_path, monkeypatch):
"""The replacement message tells the model how to access the file."""
monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
os.makedirs(tmp_path / ".hermes", exist_ok=True)
big = "F" * (_LARGE_RESULT_CHARS + 1)
result = _save_oversized_tool_result("terminal", big)
assert "read_file" in result
assert "search_files" in result
def test_empty_result_passes_through(self):
"""Empty strings are not oversized."""
assert _save_oversized_tool_result("terminal", "") == ""
def test_unicode_content_preserved(self, tmp_path, monkeypatch):
"""Unicode content is fully preserved in the saved file."""
monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
os.makedirs(tmp_path / ".hermes", exist_ok=True)
# Mix of ASCII and multi-byte unicode to exceed threshold
unit = "Hello 世界! 🎉 " * 100 # ~1400 chars per repeat
big = unit * ((_LARGE_RESULT_CHARS // len(unit)) + 1)
assert len(big) > _LARGE_RESULT_CHARS
result = _save_oversized_tool_result("terminal", big)
match = re.search(r"Full output saved to: (.+?)\n", result)
filepath = match.group(1)
with open(filepath, "r", encoding="utf-8") as f:
saved = f.read()
assert saved == big
+5 -49
View File
@@ -872,52 +872,6 @@ class TestBuildApiKwargs:
kwargs = agent._build_api_kwargs(messages)
assert kwargs["max_tokens"] == 4096
def test_qwen_portal_formats_messages_and_metadata(self, agent):
agent.base_url = "https://portal.qwen.ai/v1"
agent._base_url_lower = agent.base_url.lower()
agent.session_id = "sess-123"
messages = [
{"role": "system", "content": "You are helpful"},
{"role": "assistant", "content": "Got it"},
{"role": "user", "content": "hi"},
]
kwargs = agent._build_api_kwargs(messages)
assert kwargs["metadata"]["sessionId"] == "sess-123"
assert kwargs["extra_body"]["vl_high_resolution_images"] is True
assert isinstance(kwargs["messages"][0]["content"], list)
assert kwargs["messages"][0]["content"][0]["cache_control"] == {"type": "ephemeral"}
assert kwargs["messages"][2]["content"][0]["text"] == "hi"
def test_qwen_portal_normalizes_bare_string_content_parts(self, agent):
agent.base_url = "https://portal.qwen.ai/v1"
agent._base_url_lower = agent.base_url.lower()
messages = [
{"role": "system", "content": [{"type": "text", "text": "system"}]},
{"role": "user", "content": ["hello", {"type": "text", "text": "world"}]},
]
kwargs = agent._build_api_kwargs(messages)
user_content = kwargs["messages"][1]["content"]
assert user_content[0] == {"type": "text", "text": "hello"}
assert user_content[1] == {"type": "text", "text": "world"}
def test_qwen_portal_no_system_message(self, agent):
agent.base_url = "https://portal.qwen.ai/v1"
agent._base_url_lower = agent.base_url.lower()
messages = [{"role": "user", "content": "hi"}]
kwargs = agent._build_api_kwargs(messages)
# Should not crash even without a system message
assert kwargs["messages"][0]["content"][0]["text"] == "hi"
assert "cache_control" not in kwargs["messages"][0]["content"][0]
def test_qwen_portal_omits_max_tokens(self, agent):
agent.base_url = "https://portal.qwen.ai/v1"
agent._base_url_lower = agent.base_url.lower()
agent.max_tokens = 4096
messages = [{"role": "system", "content": "sys"}, {"role": "user", "content": "hi"}]
kwargs = agent._build_api_kwargs(messages)
assert "max_tokens" not in kwargs
assert "max_completion_tokens" not in kwargs
class TestBuildAssistantMessage:
def test_basic_message(self, agent):
@@ -1057,9 +1011,10 @@ class TestExecuteToolCalls:
big_result = "x" * 150_000
with patch("run_agent.handle_function_call", return_value=big_result):
agent._execute_tool_calls(mock_msg, messages, "task-1")
# Content should be replaced with persisted-output or truncation
# Content should be replaced with preview + file path
assert len(messages[0]["content"]) < 150_000
assert ("Truncated" in messages[0]["content"] or "<persisted-output>" in messages[0]["content"])
assert "Large tool response" in messages[0]["content"]
assert "Full output saved to:" in messages[0]["content"]
class TestConcurrentToolExecution:
@@ -1294,7 +1249,8 @@ class TestConcurrentToolExecution:
assert len(messages) == 2
for m in messages:
assert len(m["content"]) < 150_000
assert ("Truncated" in m["content"] or "<persisted-output>" in m["content"])
assert "Large tool response" in m["content"]
assert "Full output saved to:" in m["content"]
def test_invoke_tool_dispatches_to_handle_function_call(self, agent):
"""_invoke_tool should route regular tools through handle_function_call."""
-135
View File
@@ -1,135 +0,0 @@
"""Tests for Ollama num_ctx context length detection and injection.
Covers:
agent/model_metadata.py query_ollama_num_ctx()
run_agent.py _ollama_num_ctx detection + extra_body injection
"""
from unittest.mock import patch, MagicMock
import pytest
from agent.model_metadata import query_ollama_num_ctx
# ═══════════════════════════════════════════════════════════════════════
# Level 1: query_ollama_num_ctx — Ollama API interaction
# ═══════════════════════════════════════════════════════════════════════
def _mock_httpx_client(show_response_data, status_code=200):
"""Create a mock httpx.Client context manager that returns given /api/show data."""
mock_resp = MagicMock(status_code=status_code)
mock_resp.json.return_value = show_response_data
mock_client = MagicMock()
mock_client.post.return_value = mock_resp
mock_ctx = MagicMock()
mock_ctx.__enter__ = MagicMock(return_value=mock_client)
mock_ctx.__exit__ = MagicMock(return_value=False)
return mock_ctx, mock_client
class TestQueryOllamaNumCtx:
"""Test the Ollama /api/show context length query."""
def test_returns_context_from_model_info(self):
"""Should extract context_length from GGUF model_info metadata."""
show_data = {
"model_info": {"llama.context_length": 131072},
"parameters": "",
}
mock_ctx, _ = _mock_httpx_client(show_data)
with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
# httpx is imported inside the function — patch the module import
import httpx
with patch.object(httpx, "Client", return_value=mock_ctx):
result = query_ollama_num_ctx("llama3.1:8b", "http://localhost:11434/v1")
assert result == 131072
def test_prefers_explicit_num_ctx_from_modelfile(self):
"""If the Modelfile sets num_ctx explicitly, that should take priority."""
show_data = {
"model_info": {"llama.context_length": 131072},
"parameters": "num_ctx 32768\ntemperature 0.7",
}
mock_ctx, _ = _mock_httpx_client(show_data)
with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
import httpx
with patch.object(httpx, "Client", return_value=mock_ctx):
result = query_ollama_num_ctx("custom-model", "http://localhost:11434")
assert result == 32768
def test_returns_none_for_non_ollama_server(self):
"""Should return None if the server is not Ollama."""
with patch("agent.model_metadata.detect_local_server_type", return_value="lm-studio"):
result = query_ollama_num_ctx("model", "http://localhost:1234")
assert result is None
def test_returns_none_on_connection_error(self):
"""Should return None if the server is unreachable."""
with patch("agent.model_metadata.detect_local_server_type", side_effect=Exception("timeout")):
result = query_ollama_num_ctx("model", "http://localhost:11434")
assert result is None
def test_returns_none_on_404(self):
"""Should return None if the model is not found."""
mock_ctx, _ = _mock_httpx_client({}, status_code=404)
with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
import httpx
with patch.object(httpx, "Client", return_value=mock_ctx):
result = query_ollama_num_ctx("nonexistent", "http://localhost:11434")
assert result is None
def test_strips_provider_prefix(self):
"""Should strip 'local:' prefix from model name before querying."""
show_data = {
"model_info": {"qwen2.context_length": 32768},
"parameters": "",
}
mock_ctx, mock_client = _mock_httpx_client(show_data)
with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
import httpx
with patch.object(httpx, "Client", return_value=mock_ctx):
result = query_ollama_num_ctx("local:qwen2.5:7b", "http://localhost:11434/v1")
# Verify the post was called with stripped name (no "local:" prefix)
call_args = mock_client.post.call_args
assert call_args[1]["json"]["name"] == "qwen2.5:7b" or call_args[0][1] is not None
assert result == 32768
def test_handles_qwen2_architecture_key(self):
"""Different model architectures use different key prefixes in model_info."""
show_data = {
"model_info": {"qwen2.context_length": 65536},
"parameters": "",
}
mock_ctx, _ = _mock_httpx_client(show_data)
with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
import httpx
with patch.object(httpx, "Client", return_value=mock_ctx):
result = query_ollama_num_ctx("qwen2.5:32b", "http://localhost:11434")
assert result == 65536
def test_returns_none_when_model_info_empty(self):
"""Should return None if model_info has no context_length key."""
show_data = {
"model_info": {"llama.embedding_length": 4096},
"parameters": "",
}
mock_ctx, _ = _mock_httpx_client(show_data)
with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
import httpx
with patch.object(httpx, "Client", return_value=mock_ctx):
result = query_ollama_num_ctx("model", "http://localhost:11434")
assert result is None
-117
View File
@@ -1,117 +0,0 @@
"""Tests for agent.retry_utils jittered backoff."""
import threading
import agent.retry_utils as retry_utils
from agent.retry_utils import jittered_backoff
def test_backoff_is_exponential():
"""Base delay should double each attempt (before jitter)."""
for attempt in (1, 2, 3, 4):
delays = [jittered_backoff(attempt, base_delay=5.0, max_delay=120.0, jitter_ratio=0.0) for _ in range(100)]
expected = min(5.0 * (2 ** (attempt - 1)), 120.0)
mean = sum(delays) / len(delays)
assert abs(mean - expected) < 0.01, f"attempt {attempt}: expected {expected}, got {mean}"
def test_backoff_respects_max_delay():
"""Even with high attempt numbers, delay should not exceed max_delay."""
for attempt in (10, 20, 100):
delay = jittered_backoff(attempt, base_delay=5.0, max_delay=60.0, jitter_ratio=0.0)
assert delay <= 60.0, f"attempt {attempt}: delay {delay} exceeds max 60s"
def test_backoff_adds_jitter():
"""With jitter enabled, delays should vary across calls."""
delays = [jittered_backoff(1, base_delay=10.0, max_delay=120.0, jitter_ratio=0.5) for _ in range(50)]
assert min(delays) != max(delays), "jitter should produce varying delays"
assert all(d >= 10.0 for d in delays), "jittered delay should be >= base delay"
assert all(d <= 15.0 for d in delays), "jittered delay should be bounded"
def test_backoff_attempt_1_is_base():
"""First attempt delay should equal base_delay (with no jitter)."""
delay = jittered_backoff(1, base_delay=3.0, max_delay=120.0, jitter_ratio=0.0)
assert delay == 3.0
def test_backoff_with_zero_base_delay_returns_max():
"""base_delay=0 should return max_delay (guard against busy-wait)."""
delay = jittered_backoff(1, base_delay=0.0, max_delay=60.0, jitter_ratio=0.0)
assert delay == 60.0
def test_backoff_with_extreme_attempt_returns_max():
"""Very large attempt numbers should not overflow and should return max_delay."""
delay = jittered_backoff(999, base_delay=5.0, max_delay=120.0, jitter_ratio=0.0)
assert delay == 120.0
def test_backoff_negative_attempt_treated_as_one():
"""Negative attempt should not crash and behaves like attempt=1."""
delay = jittered_backoff(-5, base_delay=10.0, max_delay=120.0, jitter_ratio=0.0)
assert delay == 10.0
def test_backoff_thread_safety():
"""Concurrent calls should generally produce different delays."""
results = []
barrier = threading.Barrier(8)
def _call_backoff():
barrier.wait()
results.append(jittered_backoff(1, base_delay=10.0, max_delay=120.0, jitter_ratio=0.5))
threads = [threading.Thread(target=_call_backoff) for _ in range(8)]
for t in threads:
t.start()
for t in threads:
t.join(timeout=5)
assert len(results) == 8
unique = len(set(results))
assert unique >= 6, f"Expected mostly unique delays, got {unique}/8 unique"
def test_backoff_uses_locked_tick_for_seed(monkeypatch):
"""Seed derivation should use per-call tick captured under lock."""
import time
monkeypatch.setattr(retry_utils, "_jitter_counter", 0)
recorded_seeds = []
class _RecordingRandom:
def __init__(self, seed):
recorded_seeds.append(seed)
def uniform(self, a, b):
return 0.0
monkeypatch.setattr(retry_utils.random, "Random", _RecordingRandom)
fixed_time_ns = 123456789
def _time_ns_wait_for_two_ticks():
deadline = time.time() + 2.0
while retry_utils._jitter_counter < 2 and time.time() < deadline:
time.sleep(0.001)
return fixed_time_ns
monkeypatch.setattr(retry_utils.time, "time_ns", _time_ns_wait_for_two_ticks)
barrier = threading.Barrier(2)
def _call():
barrier.wait()
jittered_backoff(1, base_delay=10.0, max_delay=120.0, jitter_ratio=0.5)
threads = [threading.Thread(target=_call) for _ in range(2)]
for t in threads:
t.start()
for t in threads:
t.join(timeout=5)
assert len(recorded_seeds) == 2
assert len(set(recorded_seeds)) == 2, f"Expected unique seeds, got {recorded_seeds}"
-174
View File
@@ -1,174 +0,0 @@
"""Tests for BaseEnvironment unified execution model.
Tests _wrap_command(), _extract_cwd_from_output(), _embed_stdin_heredoc(),
init_session() failure handling, and the CWD marker contract.
"""
import uuid
from unittest.mock import MagicMock
from tools.environments.base import BaseEnvironment, _cwd_marker
class _TestableEnv(BaseEnvironment):
"""Concrete subclass for testing base class methods."""
def __init__(self, cwd="/tmp", timeout=10):
super().__init__(cwd=cwd, timeout=timeout)
def _run_bash(self, cmd_string, *, login=False, timeout=120, stdin_data=None):
raise NotImplementedError("Use mock")
def cleanup(self):
pass
class TestWrapCommand:
def test_basic_shape(self):
env = _TestableEnv()
env._snapshot_ready = True
wrapped = env._wrap_command("echo hello", "/tmp")
assert "source" in wrapped
assert "cd /tmp" in wrapped or "cd '/tmp'" in wrapped
assert "eval 'echo hello'" in wrapped
assert "__hermes_ec=$?" in wrapped
assert "export -p >" in wrapped
assert "pwd -P >" in wrapped
assert env._cwd_marker in wrapped
assert "exit $__hermes_ec" in wrapped
def test_no_snapshot_skips_source(self):
env = _TestableEnv()
env._snapshot_ready = False
wrapped = env._wrap_command("echo hello", "/tmp")
assert "source" not in wrapped
def test_single_quote_escaping(self):
env = _TestableEnv()
env._snapshot_ready = True
wrapped = env._wrap_command("echo 'hello world'", "/tmp")
assert "eval 'echo '\\''hello world'\\'''" in wrapped
def test_tilde_not_quoted(self):
env = _TestableEnv()
env._snapshot_ready = True
wrapped = env._wrap_command("ls", "~")
assert "cd ~" in wrapped
assert "cd '~'" not in wrapped
def test_cd_failure_exit_126(self):
env = _TestableEnv()
env._snapshot_ready = True
wrapped = env._wrap_command("ls", "/nonexistent")
assert "exit 126" in wrapped
class TestExtractCwdFromOutput:
def test_happy_path(self):
env = _TestableEnv()
marker = env._cwd_marker
result = {
"output": f"hello\n{marker}/home/user{marker}\n",
}
env._extract_cwd_from_output(result)
assert env.cwd == "/home/user"
assert marker not in result["output"]
def test_missing_marker(self):
env = _TestableEnv()
result = {"output": "hello world\n"}
env._extract_cwd_from_output(result)
assert env.cwd == "/tmp" # unchanged
def test_marker_in_command_output(self):
"""If the marker appears in command output AND as the real marker,
rfind grabs the last (real) one."""
env = _TestableEnv()
marker = env._cwd_marker
result = {
"output": f"user typed {marker} in their output\nreal output\n{marker}/correct/path{marker}\n",
}
env._extract_cwd_from_output(result)
assert env.cwd == "/correct/path"
def test_output_cleaned(self):
env = _TestableEnv()
marker = env._cwd_marker
result = {
"output": f"hello\n{marker}/tmp{marker}\n",
}
env._extract_cwd_from_output(result)
assert "hello" in result["output"]
assert marker not in result["output"]
class TestEmbedStdinHeredoc:
def test_heredoc_format(self):
result = BaseEnvironment._embed_stdin_heredoc("cat", "hello world")
assert result.startswith("cat << '")
assert "hello world" in result
assert "HERMES_STDIN_" in result
def test_unique_delimiter_each_call(self):
r1 = BaseEnvironment._embed_stdin_heredoc("cat", "data")
r2 = BaseEnvironment._embed_stdin_heredoc("cat", "data")
# Extract delimiters
d1 = r1.split("'")[1]
d2 = r2.split("'")[1]
assert d1 != d2 # UUID-based, should be unique
class TestInitSessionFailure:
def test_snapshot_ready_false_on_failure(self):
env = _TestableEnv()
def failing_run_bash(*args, **kwargs):
raise RuntimeError("bash not found")
env._run_bash = failing_run_bash
env.init_session()
assert env._snapshot_ready is False
def test_login_flag_when_snapshot_not_ready(self):
"""When _snapshot_ready=False, execute() should pass login=True to _run_bash."""
env = _TestableEnv()
env._snapshot_ready = False
calls = []
def mock_run_bash(cmd, *, login=False, timeout=120, stdin_data=None):
calls.append({"login": login})
# Return a mock process handle
mock = MagicMock()
mock.poll.return_value = 0
mock.returncode = 0
mock.stdout = iter([])
return mock
env._run_bash = mock_run_bash
env.execute("echo test")
assert len(calls) == 1
assert calls[0]["login"] is True
class TestCwdMarker:
def test_marker_contains_session_id(self):
env = _TestableEnv()
assert env._session_id in env._cwd_marker
def test_unique_per_instance(self):
env1 = _TestableEnv()
env2 = _TestableEnv()
assert env1._cwd_marker != env2._cwd_marker
@@ -16,7 +16,6 @@ from tools.browser_camofox import (
_managed_persistence_enabled,
camofox_close,
camofox_navigate,
camofox_soft_cleanup,
check_camofox_available,
cleanup_all_camofox_sessions,
get_vnc_url,
@@ -241,50 +240,3 @@ class TestVncUrlDiscovery:
assert result["vnc_url"] == "http://localhost:6080"
assert "vnc_hint" in result
class TestCamofoxSoftCleanup:
"""camofox_soft_cleanup drops local state only when managed persistence is on."""
def test_returns_true_and_drops_session_when_enabled(self, tmp_path, monkeypatch):
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
monkeypatch.setenv("CAMOFOX_URL", "http://localhost:9377")
with _enable_persistence():
_get_session("task-1")
result = camofox_soft_cleanup("task-1")
assert result is True
# Session should have been dropped from in-memory store
import tools.browser_camofox as mod
with mod._sessions_lock:
assert "task-1" not in mod._sessions
def test_returns_false_when_disabled(self, tmp_path, monkeypatch):
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
monkeypatch.setenv("CAMOFOX_URL", "http://localhost:9377")
_get_session("task-1")
config = {"browser": {"camofox": {"managed_persistence": False}}}
with patch("tools.browser_camofox.load_config", return_value=config):
result = camofox_soft_cleanup("task-1")
assert result is False
# Session should still be present — not dropped
import tools.browser_camofox as mod
with mod._sessions_lock:
assert "task-1" in mod._sessions
def test_does_not_call_server_delete(self, tmp_path, monkeypatch):
"""Soft cleanup must never hit the Camofox /sessions DELETE endpoint."""
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
monkeypatch.setenv("CAMOFOX_URL", "http://localhost:9377")
with (
_enable_persistence(),
patch("tools.browser_camofox.requests.delete") as mock_delete,
):
_get_session("task-1")
camofox_soft_cleanup("task-1")
mock_delete.assert_not_called()
-56
View File
@@ -65,62 +65,6 @@ class TestBrowserCleanup:
mock_stop.assert_called_once_with("task-1")
mock_run.assert_called_once_with("task-1", "close", [], timeout=10)
def test_cleanup_camofox_managed_persistence_skips_close(self):
"""When camofox mode + managed persistence, soft_cleanup fires instead of close."""
browser_tool = self.browser_tool
browser_tool._active_sessions["task-1"] = {
"session_name": "sess-1",
"bb_session_id": None,
}
browser_tool._session_last_activity["task-1"] = 123.0
with (
patch("tools.browser_tool._is_camofox_mode", return_value=True),
patch("tools.browser_tool._maybe_stop_recording") as mock_stop,
patch(
"tools.browser_tool._run_browser_command",
return_value={"success": True},
),
patch("tools.browser_tool.os.path.exists", return_value=False),
patch(
"tools.browser_camofox.camofox_soft_cleanup",
return_value=True,
) as mock_soft,
patch("tools.browser_camofox.camofox_close") as mock_close,
):
browser_tool.cleanup_browser("task-1")
mock_soft.assert_called_once_with("task-1")
mock_close.assert_not_called()
def test_cleanup_camofox_no_persistence_calls_close(self):
"""When camofox mode but managed persistence is off, camofox_close fires."""
browser_tool = self.browser_tool
browser_tool._active_sessions["task-1"] = {
"session_name": "sess-1",
"bb_session_id": None,
}
browser_tool._session_last_activity["task-1"] = 123.0
with (
patch("tools.browser_tool._is_camofox_mode", return_value=True),
patch("tools.browser_tool._maybe_stop_recording") as mock_stop,
patch(
"tools.browser_tool._run_browser_command",
return_value={"success": True},
),
patch("tools.browser_tool.os.path.exists", return_value=False),
patch(
"tools.browser_camofox.camofox_soft_cleanup",
return_value=False,
) as mock_soft,
patch("tools.browser_camofox.camofox_close") as mock_close,
):
browser_tool.cleanup_browser("task-1")
mock_soft.assert_called_once_with("task-1")
mock_close.assert_called_once_with("task-1")
def test_emergency_cleanup_clears_all_tracking_state(self):
browser_tool = self.browser_tool
browser_tool._cleanup_done = False
-103
View File
@@ -152,109 +152,6 @@ class TestFindAgentBrowser:
class TestRunBrowserCommandPathConstruction:
"""Verify _run_browser_command() includes Homebrew node dirs in subprocess PATH."""
def test_subprocess_preserves_executable_path_with_spaces(self, tmp_path):
"""A local agent-browser path containing spaces must stay one argv entry."""
captured_cmd = None
mock_proc = MagicMock()
mock_proc.returncode = 0
mock_proc.wait.return_value = 0
def capture_popen(cmd, **kwargs):
nonlocal captured_cmd
captured_cmd = cmd
return mock_proc
fake_session = {
"session_name": "test-session",
"session_id": "test-id",
"cdp_url": None,
}
fake_json = json.dumps({"success": True})
browser_path = "/Users/test/Library/Application Support/hermes/node_modules/.bin/agent-browser"
hermes_home = str(tmp_path / "hermes-home")
with patch("tools.browser_tool._find_agent_browser", return_value=browser_path), \
patch("tools.browser_tool._get_session_info", return_value=fake_session), \
patch("tools.browser_tool._socket_safe_tmpdir", return_value=str(tmp_path)), \
patch("tools.browser_tool._discover_homebrew_node_dirs", return_value=[]), \
patch("hermes_constants.Path.home", return_value=tmp_path), \
patch("subprocess.Popen", side_effect=capture_popen), \
patch("os.open", return_value=99), \
patch("os.close"), \
patch("tools.interrupt.is_interrupted", return_value=False), \
patch.dict(
os.environ,
{
"PATH": "/usr/bin:/bin",
"HOME": "/home/test",
"HERMES_HOME": hermes_home,
},
clear=True,
):
with patch("builtins.open", mock_open(read_data=fake_json)):
_run_browser_command("test-task", "navigate", ["https://example.com"])
assert captured_cmd is not None
assert captured_cmd[0] == browser_path
assert captured_cmd[1:5] == [
"--session",
"test-session",
"--json",
"navigate",
]
def test_subprocess_splits_npx_fallback_into_command_and_package(self, tmp_path):
"""The synthetic npx fallback should still expand into separate argv items."""
captured_cmd = None
mock_proc = MagicMock()
mock_proc.returncode = 0
mock_proc.wait.return_value = 0
def capture_popen(cmd, **kwargs):
nonlocal captured_cmd
captured_cmd = cmd
return mock_proc
fake_session = {
"session_name": "test-session",
"session_id": "test-id",
"cdp_url": None,
}
fake_json = json.dumps({"success": True})
hermes_home = str(tmp_path / "hermes-home")
with patch("tools.browser_tool._find_agent_browser", return_value="npx agent-browser"), \
patch("tools.browser_tool._get_session_info", return_value=fake_session), \
patch("tools.browser_tool._socket_safe_tmpdir", return_value=str(tmp_path)), \
patch("tools.browser_tool._discover_homebrew_node_dirs", return_value=[]), \
patch("hermes_constants.Path.home", return_value=tmp_path), \
patch("subprocess.Popen", side_effect=capture_popen), \
patch("os.open", return_value=99), \
patch("os.close"), \
patch("tools.interrupt.is_interrupted", return_value=False), \
patch.dict(
os.environ,
{
"PATH": "/usr/bin:/bin",
"HOME": "/home/test",
"HERMES_HOME": hermes_home,
},
clear=True,
):
with patch("builtins.open", mock_open(read_data=fake_json)):
_run_browser_command("test-task", "navigate", ["https://example.com"])
assert captured_cmd is not None
assert captured_cmd[:2] == ["npx", "agent-browser"]
assert captured_cmd[2:6] == [
"--session",
"test-session",
"--json",
"navigate",
]
def test_subprocess_path_includes_homebrew_node_dirs(self, tmp_path):
"""When _discover_homebrew_node_dirs returns dirs, they should appear
in the subprocess env PATH passed to Popen."""
+23 -42
View File
@@ -59,8 +59,8 @@ def daytona_sdk(monkeypatch):
@pytest.fixture()
def make_env(daytona_sdk, monkeypatch):
"""Factory that creates a DaytonaEnvironment with a mocked SDK."""
# Prevent is_interrupted from interfering — patch where it's used (base.py)
monkeypatch.setattr("tools.environments.base.is_interrupted", lambda: False)
# Prevent is_interrupted from interfering
monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
# Prevent skills/credential sync from consuming mock exec calls
monkeypatch.setattr("tools.credential_files.get_credential_file_mounts", lambda: [])
monkeypatch.setattr("tools.credential_files.get_skills_directory_mount", lambda **kw: None)
@@ -221,45 +221,41 @@ class TestCleanup:
class TestExecute:
def test_basic_command(self, make_env):
sb = _make_sandbox()
# Calls: (1) $HOME detection, (2) init_session bootstrap, (3) actual command
# First call: $HOME detection; subsequent calls: actual commands
sb.process.exec.side_effect = [
_make_exec_response(result="/root"), # $HOME
_make_exec_response(result="", exit_code=0), # init_session
_make_exec_response(result="hello", exit_code=0), # actual cmd
]
sb.state = "started"
env = make_env(sandbox=sb)
result = env.execute("echo hello")
assert "hello" in result["output"]
assert result["output"] == "hello"
assert result["returncode"] == 0
def test_sdk_timeout_passed_to_exec(self, make_env):
"""SDK native timeout is passed to sandbox.process.exec()."""
def test_command_wrapped_with_shell_timeout(self, make_env):
sb = _make_sandbox()
sb.process.exec.side_effect = [
_make_exec_response(result="/root"),
_make_exec_response(result="", exit_code=0), # init_session
_make_exec_response(result="ok", exit_code=0),
]
sb.state = "started"
env = make_env(sandbox=sb, timeout=42)
env.execute("echo hello")
# The exec call should receive timeout= kwarg (SDK native timeout)
# The command sent to exec should be wrapped with `timeout N sh -c '...'`
call_args = sb.process.exec.call_args_list[-1]
assert call_args[1]["timeout"] == 42
# The command should NOT have a shell `timeout` prefix
cmd = call_args[0][0]
assert not cmd.startswith("timeout ")
assert cmd.startswith("timeout 42 sh -c ")
# SDK timeout param should NOT be passed
assert "timeout" not in call_args[1]
def test_timeout_returns_exit_code_124(self, make_env):
"""SDK-level timeout surfaces as exit code 124 via _wait_for_process."""
"""Shell timeout utility returns exit code 124."""
sb = _make_sandbox()
sb.process.exec.side_effect = [
_make_exec_response(result="/root"),
_make_exec_response(result="", exit_code=0), # init_session
_make_exec_response(result="", exit_code=124), # actual cmd
_make_exec_response(result="", exit_code=124),
]
sb.state = "started"
env = make_env(sandbox=sb)
@@ -271,7 +267,6 @@ class TestExecute:
sb = _make_sandbox()
sb.process.exec.side_effect = [
_make_exec_response(result="/root"),
_make_exec_response(result="", exit_code=0), # init_session
_make_exec_response(result="not found", exit_code=127),
]
sb.state = "started"
@@ -284,7 +279,6 @@ class TestExecute:
sb = _make_sandbox()
sb.process.exec.side_effect = [
_make_exec_response(result="/root"),
_make_exec_response(result="", exit_code=0), # init_session
_make_exec_response(result="ok", exit_code=0),
]
sb.state = "started"
@@ -292,47 +286,39 @@ class TestExecute:
env.execute("python3", stdin_data="print('hi')")
# Check that the command passed to exec contains heredoc markers
# Base class uses HERMES_STDIN_ prefix for heredoc delimiters
# (single quotes get shell-escaped by shlex.quote, so check components)
call_args = sb.process.exec.call_args_list[-1]
cmd = call_args[0][0]
assert "HERMES_STDIN_" in cmd
assert "HERMES_EOF_" in cmd
assert "print" in cmd
assert "hi" in cmd
def test_custom_cwd_in_command_wrapper(self, make_env):
"""CWD is handled by _wrap_command() in the command string, not as a kwarg."""
def test_custom_cwd_passed_through(self, make_env):
sb = _make_sandbox()
sb.process.exec.side_effect = [
_make_exec_response(result="/root"),
_make_exec_response(result="", exit_code=0), # init_session
_make_exec_response(result="/tmp", exit_code=0),
]
sb.state = "started"
env = make_env(sandbox=sb)
env.execute("pwd", cwd="/tmp")
# CWD should be embedded in the command string via _wrap_command
call_args = sb.process.exec.call_args_list[-1]
cmd = call_args[0][0]
assert "cd /tmp" in cmd
# CWD should NOT be passed as a kwarg to exec
assert "cwd" not in call_args[1]
call_kwargs = sb.process.exec.call_args_list[-1][1]
assert call_kwargs["cwd"] == "/tmp"
def test_daytona_error_triggers_retry(self, make_env, daytona_sdk):
sb = _make_sandbox()
sb.state = "started"
sb.process.exec.side_effect = [
_make_exec_response(result="/root"), # $HOME
_make_exec_response(result="", exit_code=0), # init_session
daytona_sdk.DaytonaError("transient"), # first attempt fails
_make_exec_response(result="ok", exit_code=0), # retry succeeds
]
env = make_env(sandbox=sb)
result = env.execute("echo retry")
# DaytonaError now surfaces directly through _ThreadedProcessHandle
# (no retry logic) — the error becomes returncode=1
assert result["returncode"] == 1
assert result["output"] == "ok"
assert result["returncode"] == 0
# ---------------------------------------------------------------------------
@@ -373,18 +359,14 @@ class TestInterrupt:
calls["n"] += 1
if calls["n"] == 1:
return _make_exec_response(result="/root") # $HOME detection
if calls["n"] == 2:
return _make_exec_response(result="", exit_code=0) # init_session
event.wait(timeout=5) # simulate long-running command
return _make_exec_response(result="done", exit_code=0)
sb.process.exec.side_effect = exec_side_effect
env = make_env(sandbox=sb)
# is_interrupted is checked by base.py's _wait_for_process,
# patch where it's actually referenced (base.py's local binding)
monkeypatch.setattr(
"tools.environments.base.is_interrupted", lambda: True
"tools.environments.daytona.is_interrupted", lambda: True
)
try:
result = env.execute("sleep 10")
@@ -395,24 +377,23 @@ class TestInterrupt:
# ---------------------------------------------------------------------------
# DaytonaError surfaces directly (no retry)
# Retry exhaustion
# ---------------------------------------------------------------------------
class TestRetryExhausted:
def test_both_attempts_fail(self, make_env, daytona_sdk):
"""DaytonaError surfaces directly as rc=1 (retry logic was removed)."""
sb = _make_sandbox()
sb.state = "started"
sb.process.exec.side_effect = [
_make_exec_response(result="/root"), # $HOME
_make_exec_response(result="", exit_code=0), # init_session
daytona_sdk.DaytonaError("fail1"), # actual command fails
daytona_sdk.DaytonaError("fail1"), # first attempt
daytona_sdk.DaytonaError("fail2"), # retry
]
env = make_env(sandbox=sb)
result = env.execute("echo x")
# Error surfaces directly through _ThreadedProcessHandle (rc=1)
assert result["returncode"] == 1
assert "Daytona execution error" in result["output"]
# ---------------------------------------------------------------------------

Some files were not shown because too many files have changed in this diff Show More