feat: add WorldSim — OSINT-powered personality simulation skill

Rehoboam-class worldsim. Immersive CLI personality simulator that researches real people via 25+ verified platform access methods, builds 6-layer psychometric profiles, finds star threads (personality compression keys), and generates platform-authentic simulated conversations with mechanical verification and adversarial refinement. 26 files | 38K words | 2,283 lines Python - Immersive CLI interface (worldsim> prompt, no assistant framing) - OSINT pipeline: X API, Instagram private API, Bluesky, TikTok, Facebook, Threads, Mastodon, Reddit, GitHub, HN, Medium, Quora, Goodreads, Google Scholar, Crunchbase, podcasts, news/blogs - Star thread: one-sentence personality compression key per person - Deep psychometrics: Big Five + Moral Foundations + Schwartz Values + Cognitive Style + Narrative Framing + Behavioral Metadata - Anti-slop: mechanical detection of LLM writing patterns - GAN-style adversarial refinement loop with mechanical verification - Recursive self-improvement: learned rules grow with each simulation - Rehoboam persistence: SQLite + filesystem for profiles, predictions, social graph, knowledge archives - GEPA/MIPROv2 self-evolution integration tested and working - Knowledge archive: per-person source library with citations and semantic retrieval for context-aware grounding Co-authored-by: Hermes Agent <hermes@nousresearch.com>
2026-04-08 13:46:20 -04:00
146 changed files with 10537 additions and 8700 deletions
@@ -81,14 +81,6 @@
 # HF_TOKEN=
 # OPENCODE_GO_BASE_URL=https://opencode.ai/zen/go/v1  # Override default base URL

-# =============================================================================
-# LLM PROVIDER (Qwen OAuth)
-# =============================================================================
-# Qwen OAuth reuses your local Qwen CLI login (qwen auth qwen-oauth).
-# No API key needed — credentials come from ~/.qwen/oauth_creds.json.
-# Optional base URL override:
-# HERMES_QWEN_BASE_URL=https://portal.qwen.ai/v1
-
 # =============================================================================
 # TOOL API KEYS
 # =============================================================================
@@ -1,346 +0,0 @@
-# Hermes Agent v0.8.0 (v2026.4.8)
-
-**Release Date:** April 8, 2026
-
-> The intelligence release — background task auto-notifications, free MiMo v2 Pro on Nous Portal, live model switching across all platforms, self-optimized GPT/Codex guidance, native Google AI Studio, smart inactivity timeouts, approval buttons, MCP OAuth 2.1, and 209 merged PRs with 82 resolved issues.
-
---
-
-## ✨ Highlights
-
- **Background Process Auto-Notifications (`notify_on_complete`)** — Background tasks can now automatically notify the agent when they finish. Start a long-running process (AI model training, test suites, deployments, builds) and the agent gets notified on completion — no polling needed. The agent can keep working on other things and pick up results when they land. ([#5779](https://github.com/NousResearch/hermes-agent/pull/5779))
-
- **Free Xiaomi MiMo v2 Pro on Nous Portal** — Nous Portal now supports the free-tier Xiaomi MiMo v2 Pro model for auxiliary tasks (compression, vision, summarization), with free-tier model gating and pricing display in model selection. ([#6018](https://github.com/NousResearch/hermes-agent/pull/6018), [#5880](https://github.com/NousResearch/hermes-agent/pull/5880))
-
- **Live Model Switching (`/model` Command)** — Switch models and providers mid-session from CLI, Telegram, Discord, Slack, or any gateway platform. Aggregator-aware resolution keeps you on OpenRouter/Nous when possible, with automatic cross-provider fallback when needed. Interactive model pickers on Telegram and Discord with inline buttons. ([#5181](https://github.com/NousResearch/hermes-agent/pull/5181), [#5742](https://github.com/NousResearch/hermes-agent/pull/5742))
-
- **Self-Optimized GPT/Codex Tool-Use Guidance** — The agent diagnosed and patched 5 failure modes in GPT and Codex tool calling through automated behavioral benchmarking, dramatically improving reliability on OpenAI models. Includes execution discipline guidance and thinking-only prefill continuation for structured reasoning. ([#6120](https://github.com/NousResearch/hermes-agent/pull/6120), [#5414](https://github.com/NousResearch/hermes-agent/pull/5414), [#5931](https://github.com/NousResearch/hermes-agent/pull/5931))
-
- **Google AI Studio (Gemini) Native Provider** — Direct access to Gemini models through Google's AI Studio API. Includes automatic models.dev registry integration for real-time context length detection across any provider. ([#5577](https://github.com/NousResearch/hermes-agent/pull/5577))
-
- **Inactivity-Based Agent Timeouts** — Gateway and cron timeouts now track actual tool activity instead of wall-clock time. Long-running tasks that are actively working will never be killed — only truly idle agents time out. ([#5389](https://github.com/NousResearch/hermes-agent/pull/5389), [#5440](https://github.com/NousResearch/hermes-agent/pull/5440))
-
- **Approval Buttons on Slack & Telegram** — Dangerous command approval via native platform buttons instead of typing `/approve`. Slack gets thread context preservation; Telegram gets emoji reactions for approval status. ([#5890](https://github.com/NousResearch/hermes-agent/pull/5890), [#5975](https://github.com/NousResearch/hermes-agent/pull/5975))
-
- **MCP OAuth 2.1 PKCE + OSV Malware Scanning** — Full standards-compliant OAuth for MCP server authentication, plus automatic malware scanning of MCP extension packages via the OSV vulnerability database. ([#5420](https://github.com/NousResearch/hermes-agent/pull/5420), [#5305](https://github.com/NousResearch/hermes-agent/pull/5305))
-
- **Centralized Logging & Config Validation** — Structured logging to `~/.hermes/logs/` (agent.log + errors.log) with the `hermes logs` command for tailing and filtering. Config structure validation catches malformed YAML at startup before it causes cryptic failures. ([#5430](https://github.com/NousResearch/hermes-agent/pull/5430), [#5426](https://github.com/NousResearch/hermes-agent/pull/5426))
-
- **Plugin System Expansion** — Plugins can now register CLI subcommands, receive request-scoped API hooks with correlation IDs, prompt for required env vars during install, and hook into session lifecycle events (finalize/reset). ([#5295](https://github.com/NousResearch/hermes-agent/pull/5295), [#5427](https://github.com/NousResearch/hermes-agent/pull/5427), [#5470](https://github.com/NousResearch/hermes-agent/pull/5470), [#6129](https://github.com/NousResearch/hermes-agent/pull/6129))
-
- **Matrix Tier 1 & Platform Hardening** — Matrix gets reactions, read receipts, rich formatting, and room management. Discord adds channel controls and ignored channels. Signal gets full MEDIA: tag delivery. Mattermost gets file attachments. Comprehensive reliability fixes across all platforms. ([#5275](https://github.com/NousResearch/hermes-agent/pull/5275), [#5975](https://github.com/NousResearch/hermes-agent/pull/5975), [#5602](https://github.com/NousResearch/hermes-agent/pull/5602))
-
- **Security Hardening Pass** — Consolidated SSRF protections, timing attack mitigations, tar traversal prevention, credential leakage guards, cron path traversal hardening, and cross-session isolation. Terminal workdir sanitization across all backends. ([#5944](https://github.com/NousResearch/hermes-agent/pull/5944), [#5613](https://github.com/NousResearch/hermes-agent/pull/5613), [#5629](https://github.com/NousResearch/hermes-agent/pull/5629))
-
---
-
-## 🏗️ Core Agent & Architecture
-
-### Provider & Model Support
- **Native Google AI Studio (Gemini) provider** with models.dev integration for automatic context length detection ([#5577](https://github.com/NousResearch/hermes-agent/pull/5577))
- **`/model` command — full provider+model system overhaul** — live switching across CLI and all gateway platforms with aggregator-aware resolution ([#5181](https://github.com/NousResearch/hermes-agent/pull/5181))
- **Interactive model picker for Telegram and Discord** — inline button-based model selection ([#5742](https://github.com/NousResearch/hermes-agent/pull/5742))
- **Nous Portal free-tier model gating** with pricing display in model selection ([#5880](https://github.com/NousResearch/hermes-agent/pull/5880))
- **Model pricing display** for OpenRouter and Nous Portal providers ([#5416](https://github.com/NousResearch/hermes-agent/pull/5416))
- **xAI (Grok) prompt caching** via `x-grok-conv-id` header ([#5604](https://github.com/NousResearch/hermes-agent/pull/5604))
- **Grok added to tool-use enforcement models** for direct xAI usage ([#5595](https://github.com/NousResearch/hermes-agent/pull/5595))
- **MiniMax TTS provider** (speech-2.8) ([#4963](https://github.com/NousResearch/hermes-agent/pull/4963))
- **Non-agentic model warning** — warns users when loading Hermes LLM models not designed for tool use ([#5378](https://github.com/NousResearch/hermes-agent/pull/5378))
- **Ollama Cloud auth, /model switch persistence**, and alias tab completion ([#5269](https://github.com/NousResearch/hermes-agent/pull/5269))
- **Preserve dots in OpenCode Go model names** (minimax-m2.7, glm-4.5, kimi-k2.5) ([#5597](https://github.com/NousResearch/hermes-agent/pull/5597))
- **MiniMax models 404 fix** — strip /v1 from Anthropic base URL for OpenCode Go ([#4918](https://github.com/NousResearch/hermes-agent/pull/4918))
- **Provider credential reset windows** honored in pooled failover ([#5188](https://github.com/NousResearch/hermes-agent/pull/5188))
- **OAuth token sync** between credential pool and credentials file ([#4981](https://github.com/NousResearch/hermes-agent/pull/4981))
- **Stale OAuth credentials** no longer block OpenRouter users on auto-detect ([#5746](https://github.com/NousResearch/hermes-agent/pull/5746))
- **Codex OAuth credential pool disconnect** + expired token import fix ([#5681](https://github.com/NousResearch/hermes-agent/pull/5681))
- **Codex pool entry sync** from `~/.codex/auth.json` on exhaustion — @GratefulDave ([#5610](https://github.com/NousResearch/hermes-agent/pull/5610))
- **Auxiliary client payment fallback** — retry with next provider on 402 ([#5599](https://github.com/NousResearch/hermes-agent/pull/5599))
- **Auxiliary client resolves named custom providers** and 'main' alias ([#5978](https://github.com/NousResearch/hermes-agent/pull/5978))
- **Use mimo-v2-pro** for non-vision auxiliary tasks on Nous free tier ([#6018](https://github.com/NousResearch/hermes-agent/pull/6018))
- **Vision auto-detection** tries main provider first ([#6041](https://github.com/NousResearch/hermes-agent/pull/6041))
- **Provider re-ordering and Quick Install** — @austinpickett ([#4664](https://github.com/NousResearch/hermes-agent/pull/4664))
- **Nous OAuth access_token** no longer used as inference API key — @SHL0MS ([#5564](https://github.com/NousResearch/hermes-agent/pull/5564))
- **HERMES_PORTAL_BASE_URL env var** respected during Nous login — @benbarclay ([#5745](https://github.com/NousResearch/hermes-agent/pull/5745))
- **Env var overrides** for Nous portal/inference URLs ([#5419](https://github.com/NousResearch/hermes-agent/pull/5419))
- **Z.AI endpoint auto-detect** via probe and cache ([#5763](https://github.com/NousResearch/hermes-agent/pull/5763))
- **MiniMax context lengths, model catalog, thinking guard, aux model, and config base_url** corrections ([#6082](https://github.com/NousResearch/hermes-agent/pull/6082))
- **Community provider/model resolution fixes** — salvaged 4 community PRs + MiniMax aux URL ([#5983](https://github.com/NousResearch/hermes-agent/pull/5983))
-
-### Agent Loop & Conversation
- **Self-optimized GPT/Codex tool-use guidance** via automated behavioral benchmarking — agent self-diagnosed and patched 5 failure modes ([#6120](https://github.com/NousResearch/hermes-agent/pull/6120))
- **GPT/Codex execution discipline guidance** in system prompts ([#5414](https://github.com/NousResearch/hermes-agent/pull/5414))
- **Thinking-only prefill continuation** for structured reasoning responses ([#5931](https://github.com/NousResearch/hermes-agent/pull/5931))
- **Accept reasoning-only responses** without retries — set content to "(empty)" instead of infinite retry ([#5278](https://github.com/NousResearch/hermes-agent/pull/5278))
- **Jittered retry backoff** — exponential backoff with jitter for API retries ([#6048](https://github.com/NousResearch/hermes-agent/pull/6048))
- **Smart thinking block signature management** — preserve and manage Anthropic thinking signatures across turns ([#6112](https://github.com/NousResearch/hermes-agent/pull/6112))
- **Coerce tool call arguments** to match JSON Schema types — fixes models that send strings instead of numbers/booleans ([#5265](https://github.com/NousResearch/hermes-agent/pull/5265))
- **Save oversized tool results to file** instead of destructive truncation ([#5210](https://github.com/NousResearch/hermes-agent/pull/5210))
- **Sandbox-aware tool result persistence** ([#6085](https://github.com/NousResearch/hermes-agent/pull/6085))
- **Streaming fallback** improved after edit failures ([#6110](https://github.com/NousResearch/hermes-agent/pull/6110))
- **Codex empty-output gaps** covered in fallback + normalizer + auxiliary client ([#5724](https://github.com/NousResearch/hermes-agent/pull/5724), [#5730](https://github.com/NousResearch/hermes-agent/pull/5730), [#5734](https://github.com/NousResearch/hermes-agent/pull/5734))
- **Codex stream output backfill** from output_item.done events ([#5689](https://github.com/NousResearch/hermes-agent/pull/5689))
- **Stream consumer creates new message** after tool boundaries ([#5739](https://github.com/NousResearch/hermes-agent/pull/5739))
- **Codex validation aligned** with normalization for empty stream output ([#5940](https://github.com/NousResearch/hermes-agent/pull/5940))
- **Bridge tool-calls** in copilot-acp adapter ([#5460](https://github.com/NousResearch/hermes-agent/pull/5460))
- **Filter transcript-only roles** from chat-completions payload ([#4880](https://github.com/NousResearch/hermes-agent/pull/4880))
- **Context compaction failures fixed** on temperature-restricted models — @MadKangYu ([#5608](https://github.com/NousResearch/hermes-agent/pull/5608))
- **Sanitize tool_calls for all strict APIs** (Fireworks, Mistral, etc.) — @lumethegreat ([#5183](https://github.com/NousResearch/hermes-agent/pull/5183))
-
-### Memory & Sessions
- **Supermemory memory provider** — new memory plugin with multi-container, search_mode, identity template, and env var override ([#5737](https://github.com/NousResearch/hermes-agent/pull/5737), [#5933](https://github.com/NousResearch/hermes-agent/pull/5933))
- **Shared thread sessions** by default — multi-user thread support across gateway platforms ([#5391](https://github.com/NousResearch/hermes-agent/pull/5391))
- **Subagent sessions linked to parent** and hidden from session list ([#5309](https://github.com/NousResearch/hermes-agent/pull/5309))
- **Profile-scoped memory isolation** and clone support ([#4845](https://github.com/NousResearch/hermes-agent/pull/4845))
- **Thread gateway user_id to memory plugins** for per-user scoping ([#5895](https://github.com/NousResearch/hermes-agent/pull/5895))
- **Honcho plugin drift overhaul** + plugin CLI registration system ([#5295](https://github.com/NousResearch/hermes-agent/pull/5295))
- **Honcho holographic prompt and trust score** rendering preserved ([#4872](https://github.com/NousResearch/hermes-agent/pull/4872))
- **Honcho doctor fix** — use recall_mode instead of memory_mode — @techguysimon ([#5645](https://github.com/NousResearch/hermes-agent/pull/5645))
- **RetainDB** — API routes, write queue, dialectic, agent model, file tools fixes ([#5461](https://github.com/NousResearch/hermes-agent/pull/5461))
- **Hindsight memory plugin overhaul** + memory setup wizard fixes ([#5094](https://github.com/NousResearch/hermes-agent/pull/5094))
- **mem0 API v2 compat**, prefetch context fencing, secret redaction ([#5423](https://github.com/NousResearch/hermes-agent/pull/5423))
- **mem0 env vars merged** with mem0.json instead of either/or ([#4939](https://github.com/NousResearch/hermes-agent/pull/4939))
- **Clean user message** used for all memory provider operations ([#4940](https://github.com/NousResearch/hermes-agent/pull/4940))
- **Silent memory flush failure** on /new and /resume fixed — @ryanautomated ([#5640](https://github.com/NousResearch/hermes-agent/pull/5640))
- **OpenViking atexit safety net** for session commit ([#5664](https://github.com/NousResearch/hermes-agent/pull/5664))
- **OpenViking tenant-scoping headers** for multi-tenant servers ([#4936](https://github.com/NousResearch/hermes-agent/pull/4936))
- **ByteRover brv query** runs synchronously before LLM call ([#4831](https://github.com/NousResearch/hermes-agent/pull/4831))
-
---
-
-## 📱 Messaging Platforms (Gateway)
-
-### Gateway Core
- **Inactivity-based agent timeout** — replaces wall-clock timeout with smart activity tracking; long-running active tasks never killed ([#5389](https://github.com/NousResearch/hermes-agent/pull/5389))
- **Approval buttons for Slack & Telegram** + Slack thread context preservation ([#5890](https://github.com/NousResearch/hermes-agent/pull/5890))
- **Live-stream /update output** + forward interactive prompts to user ([#5180](https://github.com/NousResearch/hermes-agent/pull/5180))
- **Infinite timeout support** + periodic notifications + actionable error messages ([#4959](https://github.com/NousResearch/hermes-agent/pull/4959))
- **Duplicate message prevention** — gateway dedup + partial stream guard ([#4878](https://github.com/NousResearch/hermes-agent/pull/4878))
- **Webhook delivery_info persistence** + full session id in /status ([#5942](https://github.com/NousResearch/hermes-agent/pull/5942))
- **Tool preview truncation** respects tool_preview_length in all/new progress modes ([#5937](https://github.com/NousResearch/hermes-agent/pull/5937))
- **Short preview truncation** restored for all/new tool progress modes ([#4935](https://github.com/NousResearch/hermes-agent/pull/4935))
- **Update-pending state** written atomically to prevent corruption ([#4923](https://github.com/NousResearch/hermes-agent/pull/4923))
- **Approval session key isolated** per turn ([#4884](https://github.com/NousResearch/hermes-agent/pull/4884))
- **Active-session guard bypass** for /approve, /deny, /stop, /new ([#4926](https://github.com/NousResearch/hermes-agent/pull/4926), [#5765](https://github.com/NousResearch/hermes-agent/pull/5765))
- **Typing indicator paused** during approval waits ([#5893](https://github.com/NousResearch/hermes-agent/pull/5893))
- **Caption check** uses exact line-by-line match instead of substring (all platforms) ([#5939](https://github.com/NousResearch/hermes-agent/pull/5939))
- **MEDIA: tags stripped** from streamed gateway messages ([#5152](https://github.com/NousResearch/hermes-agent/pull/5152))
- **MEDIA: tags extracted** from cron delivery before sending ([#5598](https://github.com/NousResearch/hermes-agent/pull/5598))
- **Profile-aware service units** + voice transcription cleanup ([#5972](https://github.com/NousResearch/hermes-agent/pull/5972))
- **Thread-safe PairingStore** with atomic writes — @CharlieKerfoot ([#5656](https://github.com/NousResearch/hermes-agent/pull/5656))
- **Sanitize media URLs** in base platform logs — @WAXLYY ([#5631](https://github.com/NousResearch/hermes-agent/pull/5631))
- **Reduce Telegram fallback IP activation log noise** — @MadKangYu ([#5615](https://github.com/NousResearch/hermes-agent/pull/5615))
- **Cron static method wrappers** to prevent self-binding ([#5299](https://github.com/NousResearch/hermes-agent/pull/5299))
- **Stale 'hermes login' replaced** with 'hermes auth' + credential removal re-seeding fix ([#5670](https://github.com/NousResearch/hermes-agent/pull/5670))
-
-### Telegram
- **Group topics skill binding** for supergroup forum topics ([#4886](https://github.com/NousResearch/hermes-agent/pull/4886))
- **Emoji reactions** for approval status and notifications ([#5975](https://github.com/NousResearch/hermes-agent/pull/5975))
- **Duplicate message delivery prevented** on send timeout ([#5153](https://github.com/NousResearch/hermes-agent/pull/5153))
- **Command names sanitized** to strip invalid characters ([#5596](https://github.com/NousResearch/hermes-agent/pull/5596))
- **Per-platform disabled skills** respected in Telegram menu and gateway dispatch ([#4799](https://github.com/NousResearch/hermes-agent/pull/4799))
- **/approve and /deny** routed through running-agent guard ([#4798](https://github.com/NousResearch/hermes-agent/pull/4798))
-
-### Discord
- **Channel controls** — ignored_channels and no_thread_channels config options ([#5975](https://github.com/NousResearch/hermes-agent/pull/5975))
- **Skills registered as native slash commands** via shared gateway logic ([#5603](https://github.com/NousResearch/hermes-agent/pull/5603))
- **/approve, /deny, /queue, /background, /btw** registered as native slash commands ([#4800](https://github.com/NousResearch/hermes-agent/pull/4800), [#5477](https://github.com/NousResearch/hermes-agent/pull/5477))
- **Unnecessary members intent** removed on startup + token lock leak fix ([#5302](https://github.com/NousResearch/hermes-agent/pull/5302))
-
-### Slack
- **Thread engagement** — auto-respond in bot-started and mentioned threads ([#5897](https://github.com/NousResearch/hermes-agent/pull/5897))
- **mrkdwn in edit_message** + thread replies without @mentions ([#5733](https://github.com/NousResearch/hermes-agent/pull/5733))
-
-### Matrix
- **Tier 1 feature parity** — reactions, read receipts, rich formatting, room management ([#5275](https://github.com/NousResearch/hermes-agent/pull/5275))
- **MATRIX_REQUIRE_MENTION and MATRIX_AUTO_THREAD** support ([#5106](https://github.com/NousResearch/hermes-agent/pull/5106))
- **Comprehensive reliability** — encrypted media, auth recovery, cron E2EE, Synapse compat ([#5271](https://github.com/NousResearch/hermes-agent/pull/5271))
- **CJK input, E2EE, and reconnect** fixes ([#5665](https://github.com/NousResearch/hermes-agent/pull/5665))
-
-### Signal
- **Full MEDIA: tag delivery** — send_image_file, send_voice, and send_video implemented ([#5602](https://github.com/NousResearch/hermes-agent/pull/5602))
-
-### Mattermost
- **File attachments** — set message type to DOCUMENT when post has file attachments — @nericervin ([#5609](https://github.com/NousResearch/hermes-agent/pull/5609))
-
-### Feishu
- **Interactive card approval buttons** ([#6043](https://github.com/NousResearch/hermes-agent/pull/6043))
- **Reconnect and ACL** fixes ([#5665](https://github.com/NousResearch/hermes-agent/pull/5665))
-
-### Webhooks
- **`{__raw__}` template token** and thread_id passthrough for forum topics ([#5662](https://github.com/NousResearch/hermes-agent/pull/5662))
-
---
-
-## 🖥️ CLI & User Experience
-
-### Interactive CLI
- **Defer response content** until reasoning block completes ([#5773](https://github.com/NousResearch/hermes-agent/pull/5773))
- **Ghost status-bar lines cleared** on terminal resize ([#4960](https://github.com/NousResearch/hermes-agent/pull/4960))
- **Normalise \r\n and \r line endings** in pasted text ([#4849](https://github.com/NousResearch/hermes-agent/pull/4849))
- **ChatConsole errors, curses scroll, skin-aware banner, git state** banner fixes ([#5974](https://github.com/NousResearch/hermes-agent/pull/5974))
- **Native Windows image paste** support ([#5917](https://github.com/NousResearch/hermes-agent/pull/5917))
- **--yolo and other flags** no longer silently dropped when placed before 'chat' subcommand ([#5145](https://github.com/NousResearch/hermes-agent/pull/5145))
-
-### Setup & Configuration
- **Config structure validation** — detect malformed YAML at startup with actionable error messages ([#5426](https://github.com/NousResearch/hermes-agent/pull/5426))
- **Centralized logging** to `~/.hermes/logs/` — agent.log (INFO+), errors.log (WARNING+) with `hermes logs` command ([#5430](https://github.com/NousResearch/hermes-agent/pull/5430))
- **Docs links added** to setup wizard sections ([#5283](https://github.com/NousResearch/hermes-agent/pull/5283))
- **Doctor diagnostics** — sync provider checks, config migration, WAL and mem0 diagnostics ([#5077](https://github.com/NousResearch/hermes-agent/pull/5077))
- **Timeout debug logging** and user-facing diagnostics improved ([#5370](https://github.com/NousResearch/hermes-agent/pull/5370))
- **Reasoning effort unified** to config.yaml only ([#6118](https://github.com/NousResearch/hermes-agent/pull/6118))
- **Permanent command allowlist** loaded on startup ([#5076](https://github.com/NousResearch/hermes-agent/pull/5076))
- **`hermes auth remove`** now clears env-seeded credentials permanently ([#5285](https://github.com/NousResearch/hermes-agent/pull/5285))
- **Bundled skills synced to all profiles** during update ([#5795](https://github.com/NousResearch/hermes-agent/pull/5795))
- **`hermes update` no longer kills** freshly-restarted gateway service ([#5448](https://github.com/NousResearch/hermes-agent/pull/5448))
- **Subprocess.run() timeouts** added to all gateway CLI commands ([#5424](https://github.com/NousResearch/hermes-agent/pull/5424))
- **Actionable error message** when Codex refresh token is reused — @tymrtn ([#5612](https://github.com/NousResearch/hermes-agent/pull/5612))
- **Google-workspace skill scripts** can now run directly — @xinbenlv ([#5624](https://github.com/NousResearch/hermes-agent/pull/5624))
-
-### Cron System
- **Inactivity-based cron timeout** — replaces wall-clock; active tasks run indefinitely ([#5440](https://github.com/NousResearch/hermes-agent/pull/5440))
- **Pre-run script injection** for data collection and change detection ([#5082](https://github.com/NousResearch/hermes-agent/pull/5082))
- **Delivery failure tracking** in job status ([#6042](https://github.com/NousResearch/hermes-agent/pull/6042))
- **Delivery guidance** in cron prompts — stops send_message thrashing ([#5444](https://github.com/NousResearch/hermes-agent/pull/5444))
- **MEDIA files delivered** as native platform attachments ([#5921](https://github.com/NousResearch/hermes-agent/pull/5921))
- **[SILENT] suppression** works anywhere in response — @auspic7 ([#5654](https://github.com/NousResearch/hermes-agent/pull/5654))
- **Cron path traversal** hardening ([#5147](https://github.com/NousResearch/hermes-agent/pull/5147))
-
---
-
-## 🔧 Tool System
-
-### Terminal & Execution
- **Execute_code on remote backends** — code execution now works on Docker, SSH, Modal, and other remote terminal backends ([#5088](https://github.com/NousResearch/hermes-agent/pull/5088))
- **Exit code context** for common CLI tools in terminal results — helps agent understand what went wrong ([#5144](https://github.com/NousResearch/hermes-agent/pull/5144))
- **Progressive subdirectory hint discovery** — agent learns project structure as it navigates ([#5291](https://github.com/NousResearch/hermes-agent/pull/5291))
- **notify_on_complete for background processes** — get notified when long-running tasks finish ([#5779](https://github.com/NousResearch/hermes-agent/pull/5779))
- **Docker env config** — explicit container environment variables via docker_env config ([#4738](https://github.com/NousResearch/hermes-agent/pull/4738))
- **Approval metadata included** in terminal tool results ([#5141](https://github.com/NousResearch/hermes-agent/pull/5141))
- **Workdir parameter sanitized** in terminal tool across all backends ([#5629](https://github.com/NousResearch/hermes-agent/pull/5629))
- **Detached process crash recovery** state corrected ([#6101](https://github.com/NousResearch/hermes-agent/pull/6101))
- **Agent-browser paths with spaces** preserved — @Vasanthdev2004 ([#6077](https://github.com/NousResearch/hermes-agent/pull/6077))
- **Portable base64 encoding** for image reading on macOS — @CharlieKerfoot ([#5657](https://github.com/NousResearch/hermes-agent/pull/5657))
-
-### Browser
- **Switch managed browser provider** from Browserbase to Browser Use — @benbarclay ([#5750](https://github.com/NousResearch/hermes-agent/pull/5750))
- **Firecrawl cloud browser** provider — @alt-glitch ([#5628](https://github.com/NousResearch/hermes-agent/pull/5628))
- **JS evaluation** via browser_console expression parameter ([#5303](https://github.com/NousResearch/hermes-agent/pull/5303))
- **Windows browser** fixes ([#5665](https://github.com/NousResearch/hermes-agent/pull/5665))
-
-### MCP
- **MCP OAuth 2.1 PKCE** — full standards-compliant OAuth client support ([#5420](https://github.com/NousResearch/hermes-agent/pull/5420))
- **OSV malware check** for MCP extension packages ([#5305](https://github.com/NousResearch/hermes-agent/pull/5305))
- **Prefer structuredContent over text** + no_mcp sentinel ([#5979](https://github.com/NousResearch/hermes-agent/pull/5979))
- **Unknown toolsets warning suppressed** for MCP server names ([#5279](https://github.com/NousResearch/hermes-agent/pull/5279))
-
-### Web & Files
- **.zip document support** + auto-mount cache dirs into remote backends ([#4846](https://github.com/NousResearch/hermes-agent/pull/4846))
- **Redact query secrets** in send_message errors — @WAXLYY ([#5650](https://github.com/NousResearch/hermes-agent/pull/5650))
-
-### Delegation
- **Credential pool sharing** + workspace path hints for subagents ([#5748](https://github.com/NousResearch/hermes-agent/pull/5748))
-
-### ACP (VS Code / Zed / JetBrains)
- **Aggregate ACP improvements** — auth compat, protocol fixes, command ads, delegation, SSE events ([#5292](https://github.com/NousResearch/hermes-agent/pull/5292))
-
---
-
-## 🧩 Skills Ecosystem
-
-### Skills System
- **Skill config interface** — skills can declare required config.yaml settings, prompted during setup, injected at load time ([#5635](https://github.com/NousResearch/hermes-agent/pull/5635))
- **Plugin CLI registration system** — plugins register their own CLI subcommands without touching main.py ([#5295](https://github.com/NousResearch/hermes-agent/pull/5295))
- **Request-scoped API hooks** with tool call correlation IDs for plugins ([#5427](https://github.com/NousResearch/hermes-agent/pull/5427))
- **Session lifecycle hooks** — on_session_finalize and on_session_reset for CLI + gateway ([#6129](https://github.com/NousResearch/hermes-agent/pull/6129))
- **Prompt for required env vars** during plugin install — @kshitijk4poor ([#5470](https://github.com/NousResearch/hermes-agent/pull/5470))
- **Plugin name validation** — reject names that resolve to plugins root ([#5368](https://github.com/NousResearch/hermes-agent/pull/5368))
- **pre_llm_call plugin context** moved to user message to preserve prompt cache ([#5146](https://github.com/NousResearch/hermes-agent/pull/5146))
-
-### New & Updated Skills
- **popular-web-designs** — 54 production website design systems ([#5194](https://github.com/NousResearch/hermes-agent/pull/5194))
- **p5js creative coding** — @SHL0MS ([#5600](https://github.com/NousResearch/hermes-agent/pull/5600))
- **manim-video** — mathematical and technical animations — @SHL0MS ([#4930](https://github.com/NousResearch/hermes-agent/pull/4930))
- **llm-wiki** — Karpathy's LLM Wiki skill ([#5635](https://github.com/NousResearch/hermes-agent/pull/5635))
- **gitnexus-explorer** — codebase indexing and knowledge serving ([#5208](https://github.com/NousResearch/hermes-agent/pull/5208))
- **research-paper-writing** — AI-Scientist & GPT-Researcher patterns — @SHL0MS ([#5421](https://github.com/NousResearch/hermes-agent/pull/5421))
- **blogwatcher** updated to JulienTant's fork ([#5759](https://github.com/NousResearch/hermes-agent/pull/5759))
- **claude-code skill** comprehensive rewrite v2.0 + v2.2 ([#5155](https://github.com/NousResearch/hermes-agent/pull/5155), [#5158](https://github.com/NousResearch/hermes-agent/pull/5158))
- **Code verification skills** consolidated into one ([#4854](https://github.com/NousResearch/hermes-agent/pull/4854))
- **Manim CE reference docs** expanded — geometry, animations, LaTeX — @leotrs ([#5791](https://github.com/NousResearch/hermes-agent/pull/5791))
- **Manim-video references** — design thinking, updaters, paper explainer, decorations, production quality — @SHL0MS ([#5588](https://github.com/NousResearch/hermes-agent/pull/5588), [#5408](https://github.com/NousResearch/hermes-agent/pull/5408))
-
---
-
-## 🔒 Security & Reliability
-
-### Security Hardening
- **Consolidated security** — SSRF protections, timing attack mitigations, tar traversal prevention, credential leakage guards ([#5944](https://github.com/NousResearch/hermes-agent/pull/5944))
- **Cross-session isolation** + cron path traversal hardening ([#5613](https://github.com/NousResearch/hermes-agent/pull/5613))
- **Workdir parameter sanitized** in terminal tool across all backends ([#5629](https://github.com/NousResearch/hermes-agent/pull/5629))
- **Approval 'once' session escalation** prevented + cron delivery platform validation ([#5280](https://github.com/NousResearch/hermes-agent/pull/5280))
- **Profile-scoped Google Workspace OAuth tokens** protected ([#4910](https://github.com/NousResearch/hermes-agent/pull/4910))
-
-### Reliability
- **Aggressive worktree and branch cleanup** to prevent accumulation ([#6134](https://github.com/NousResearch/hermes-agent/pull/6134))
- **O(n²) catastrophic backtracking** in redact regex fixed — 100x improvement on large outputs ([#4962](https://github.com/NousResearch/hermes-agent/pull/4962))
- **Runtime stability fixes** across core, web, delegate, and browser tools ([#4843](https://github.com/NousResearch/hermes-agent/pull/4843))
- **API server streaming fix** + conversation history support ([#5977](https://github.com/NousResearch/hermes-agent/pull/5977))
- **OpenViking API endpoint paths** and response parsing corrected ([#5078](https://github.com/NousResearch/hermes-agent/pull/5078))
-
---
-
-## 🐛 Notable Bug Fixes
-
- **9 community bugfixes salvaged** — gateway, cron, deps, macOS launchd in one batch ([#5288](https://github.com/NousResearch/hermes-agent/pull/5288))
- **Batch core bug fixes** — model config, session reset, alias fallback, launchctl, delegation, atomic writes ([#5630](https://github.com/NousResearch/hermes-agent/pull/5630))
- **Batch gateway/platform fixes** — matrix E2EE, CJK input, Windows browser, Feishu reconnect + ACL ([#5665](https://github.com/NousResearch/hermes-agent/pull/5665))
- **Stale test skips removed**, regex backtracking, file search bug, and test flakiness ([#4969](https://github.com/NousResearch/hermes-agent/pull/4969))
- **Nix flake** — read version, regen uv.lock, add hermes_logging — @alt-glitch ([#5651](https://github.com/NousResearch/hermes-agent/pull/5651))
- **Lowercase variable redaction** regression tests ([#5185](https://github.com/NousResearch/hermes-agent/pull/5185))
-
---
-
-## 🧪 Testing
-
- **57 failing CI tests repaired** across 14 files ([#5823](https://github.com/NousResearch/hermes-agent/pull/5823))
- **Test suite re-architecture** + CI failure fixes — @alt-glitch ([#5946](https://github.com/NousResearch/hermes-agent/pull/5946))
- **Codebase-wide lint cleanup** — unused imports, dead code, and inefficient patterns ([#5821](https://github.com/NousResearch/hermes-agent/pull/5821))
- **browser_close tool removed** — auto-cleanup handles it ([#5792](https://github.com/NousResearch/hermes-agent/pull/5792))
-
---
-
-## 📚 Documentation
-
- **Comprehensive documentation audit** — fix stale info, expand thin pages, add depth ([#5393](https://github.com/NousResearch/hermes-agent/pull/5393))
- **40+ discrepancies fixed** between documentation and codebase ([#5818](https://github.com/NousResearch/hermes-agent/pull/5818))
- **13 features documented** from last week's PRs ([#5815](https://github.com/NousResearch/hermes-agent/pull/5815))
- **Guides section overhaul** — fix existing + add 3 new tutorials ([#5735](https://github.com/NousResearch/hermes-agent/pull/5735))
- **Salvaged 4 docs PRs** — docker setup, post-update validation, local LLM guide, signal-cli install ([#5727](https://github.com/NousResearch/hermes-agent/pull/5727))
- **Discord configuration reference** ([#5386](https://github.com/NousResearch/hermes-agent/pull/5386))
- **Community FAQ entries** for common workflows and troubleshooting ([#4797](https://github.com/NousResearch/hermes-agent/pull/4797))
- **WSL2 networking guide** for local model servers ([#5616](https://github.com/NousResearch/hermes-agent/pull/5616))
- **Honcho CLI reference** + plugin CLI registration docs ([#5308](https://github.com/NousResearch/hermes-agent/pull/5308))
- **Obsidian Headless setup** for servers in llm-wiki ([#5660](https://github.com/NousResearch/hermes-agent/pull/5660))
- **Hermes Mod visual skin editor** added to skins page ([#6095](https://github.com/NousResearch/hermes-agent/pull/6095))
-
---
-
-## 👥 Contributors
-
-### Core
- **@teknium1** — 179 PRs
-
-### Top Community Contributors
- **@SHL0MS** (7 PRs) — p5js creative coding skill, manim-video skill + 5 reference expansions, research-paper-writing, Nous OAuth fix, manim font fix
- **@alt-glitch** (3 PRs) — Firecrawl cloud browser provider, test re-architecture + CI fixes, Nix flake fixes
- **@benbarclay** (2 PRs) — Browser Use managed provider switch, Nous portal base URL fix
- **@CharlieKerfoot** (2 PRs) — macOS portable base64 encoding, thread-safe PairingStore
- **@WAXLYY** (2 PRs) — send_message secret redaction, gateway media URL sanitization
- **@MadKangYu** (2 PRs) — Telegram log noise reduction, context compaction fix for temperature-restricted models
-
-### All Contributors
-@alt-glitch, @austinpickett, @auspic7, @benbarclay, @CharlieKerfoot, @GratefulDave, @kshitijk4poor, @leotrs, @lumethegreat, @MadKangYu, @nericervin, @ryanautomated, @SHL0MS, @techguysimon, @tymrtn, @Vasanthdev2004, @WAXLYY, @xinbenlv
-
---
-
-**Full Changelog**: [v2026.4.3...v2026.4.8](https://github.com/NousResearch/hermes-agent/compare/v2026.4.3...v2026.4.8)
@@ -163,17 +163,6 @@ def _is_oauth_token(key: str) -> bool:
    return True


-def _normalize_base_url_text(base_url) -> str:
-    """Normalize SDK/base transport URL values to a plain string for inspection.
-
-    Some client objects expose ``base_url`` as an ``httpx.URL`` instead of a raw
-    string.  Provider/auth detection should accept either shape.
-    """
-    if not base_url:
-        return ""
-    return str(base_url).strip()
-
-
 def _is_third_party_anthropic_endpoint(base_url: str | None) -> bool:
    """Return True for non-Anthropic endpoints using the Anthropic Messages API.

@@ -181,10 +170,9 @@ def _is_third_party_anthropic_endpoint(base_url: str | None) -> bool:
    with their own API keys via x-api-key, not Anthropic OAuth tokens. OAuth
    detection should be skipped for these endpoints.
    """
-    normalized = _normalize_base_url_text(base_url)
-    if not normalized:
+    if not base_url:
        return False  # No base_url = direct Anthropic API
-    normalized = normalized.rstrip("/").lower()
+    normalized = base_url.rstrip("/").lower()
    if "anthropic.com" in normalized:
        return False  # Direct Anthropic API — OAuth applies
    return True  # Any other endpoint is a third-party proxy
@@ -194,13 +182,12 @@ def _requires_bearer_auth(base_url: str | None) -> bool:
    """Return True for Anthropic-compatible providers that require Bearer auth.

    Some third-party /anthropic endpoints implement Anthropic's Messages API but
-    require Authorization: Bearer *** of Anthropic's native x-api-key header.
+    require Authorization: Bearer instead of Anthropic's native x-api-key header.
    MiniMax's global and China Anthropic-compatible endpoints follow this pattern.
    """
-    normalized = _normalize_base_url_text(base_url)
-    if not normalized:
+    if not base_url:
        return False
-    normalized = normalized.rstrip("/").lower()
+    normalized = base_url.rstrip("/").lower()
    return normalized.startswith(("https://api.minimax.io/anthropic", "https://api.minimaxi.com/anthropic"))


@@ -216,14 +203,13 @@ def build_anthropic_client(api_key: str, base_url: str = None):
        )
    from httpx import Timeout

-    normalized_base_url = _normalize_base_url_text(base_url)
    kwargs = {
        "timeout": Timeout(timeout=900.0, connect=10.0),
    }
-    if normalized_base_url:
-        kwargs["base_url"] = normalized_base_url
+    if base_url:
+        kwargs["base_url"] = base_url

-    if _requires_bearer_auth(normalized_base_url):
+    if _requires_bearer_auth(base_url):
        # Some Anthropic-compatible providers (e.g. MiniMax) expect the API key in
        # Authorization: Bearer even for regular API keys. Route those endpoints
        # through auth_token so the SDK sends Bearer auth instead of x-api-key.
@@ -956,18 +942,12 @@ def _convert_content_to_anthropic(content: Any) -> Any:

 def convert_messages_to_anthropic(
    messages: List[Dict],
-    base_url: str | None = None,
 ) -> Tuple[Optional[Any], List[Dict]]:
    """Convert OpenAI-format messages to Anthropic format.

    Returns (system_prompt, anthropic_messages).
    System messages are extracted since Anthropic takes them as a separate param.
    system_prompt is a string or list of content blocks (when cache_control present).
-
-    When *base_url* is provided and points to a third-party Anthropic-compatible
-    endpoint, all thinking block signatures are stripped.  Signatures are
-    Anthropic-proprietary — third-party endpoints cannot validate them and will
-    reject them with HTTP 400 "Invalid signature in thinking block".
    """
    system = None
    result = []
@@ -1122,15 +1102,7 @@ def convert_messages_to_anthropic(
                        curr_content = [{"type": "text", "text": curr_content}]
                    fixed[-1]["content"] = prev_content + curr_content
            else:
-                # Consecutive assistant messages — merge text content.
-                # Drop thinking blocks from the *second* message: their
-                # signature was computed against a different turn boundary
-                # and becomes invalid once merged.
-                if isinstance(m["content"], list):
-                    m["content"] = [
-                        b for b in m["content"]
-                        if not (isinstance(b, dict) and b.get("type") in ("thinking", "redacted_thinking"))
-                    ]
+                # Consecutive assistant messages — merge text content
                prev_blocks = fixed[-1]["content"]
                curr_blocks = m["content"]
                if isinstance(prev_blocks, list) and isinstance(curr_blocks, list):
@@ -1148,79 +1120,6 @@ def convert_messages_to_anthropic(
            fixed.append(m)
    result = fixed

-    # ── Thinking block signature management ──────────────────────────
-    # Anthropic signs thinking blocks against the full turn content.
-    # Any upstream mutation (context compression, session truncation,
-    # orphan stripping, message merging) invalidates the signature,
-    # causing HTTP 400 "Invalid signature in thinking block".
-    #
-    # Signatures are Anthropic-proprietary.  Third-party endpoints
-    # (MiniMax, Azure AI Foundry, self-hosted proxies) cannot validate
-    # them and will reject them outright.  When targeting a third-party
-    # endpoint, strip ALL thinking/redacted_thinking blocks from every
-    # assistant message — the third-party will generate its own
-    # thinking blocks if it supports extended thinking.
-    #
-    # For direct Anthropic (strategy following clawdbot/OpenClaw):
-    # 1. Strip thinking/redacted_thinking from all assistant messages
-    #    EXCEPT the last one — preserves reasoning continuity on the
-    #    current tool-use chain while avoiding stale signature errors.
-    # 2. Downgrade unsigned thinking blocks (no signature) to text —
-    #    Anthropic can't validate them and will reject them.
-    # 3. Strip cache_control from thinking/redacted_thinking blocks —
-    #    cache markers can interfere with signature validation.
-    _THINKING_TYPES = frozenset(("thinking", "redacted_thinking"))
-    _is_third_party = _is_third_party_anthropic_endpoint(base_url)
-
-    last_assistant_idx = None
-    for i in range(len(result) - 1, -1, -1):
-        if result[i].get("role") == "assistant":
-            last_assistant_idx = i
-            break
-
-    for idx, m in enumerate(result):
-        if m.get("role") != "assistant" or not isinstance(m.get("content"), list):
-            continue
-
-        if _is_third_party or idx != last_assistant_idx:
-            # Third-party endpoint: strip ALL thinking blocks from every
-            # assistant message — signatures are Anthropic-proprietary.
-            # Direct Anthropic: strip from non-latest assistant messages only.
-            stripped = [
-                b for b in m["content"]
-                if not (isinstance(b, dict) and b.get("type") in _THINKING_TYPES)
-            ]
-            m["content"] = stripped or [{"type": "text", "text": "(thinking elided)"}]
-        else:
-            # Latest assistant on direct Anthropic: keep signed thinking
-            # blocks for reasoning continuity; downgrade unsigned ones to
-            # plain text.
-            new_content = []
-            for b in m["content"]:
-                if not isinstance(b, dict) or b.get("type") not in _THINKING_TYPES:
-                    new_content.append(b)
-                    continue
-                if b.get("type") == "redacted_thinking":
-                    # Redacted blocks use 'data' for the signature payload
-                    if b.get("data"):
-                        new_content.append(b)
-                    # else: drop — no data means it can't be validated
-                elif b.get("signature"):
-                    # Signed thinking block — keep it
-                    new_content.append(b)
-                else:
-                    # Unsigned thinking — downgrade to text so it's not lost
-                    thinking_text = b.get("thinking", "")
-                    if thinking_text:
-                        new_content.append({"type": "text", "text": thinking_text})
-            m["content"] = new_content or [{"type": "text", "text": "(empty)"}]
-
-        # Strip cache_control from any remaining thinking/redacted_thinking
-        # blocks — cache markers interfere with signature validation.
-        for b in m["content"]:
-            if isinstance(b, dict) and b.get("type") in _THINKING_TYPES:
-                b.pop("cache_control", None)
-
    return system, result


@@ -1234,7 +1133,6 @@ def build_anthropic_kwargs(
    is_oauth: bool = False,
    preserve_dots: bool = False,
    context_length: Optional[int] = None,
-    base_url: str | None = None,
 ) -> Dict[str, Any]:
    """Build kwargs for anthropic.messages.create().

@@ -1248,11 +1146,8 @@ def build_anthropic_kwargs(

    When *preserve_dots* is True, model name dots are not converted to hyphens
    (for Alibaba/DashScope anthropic-compatible endpoints: qwen3.5-plus).
-
-    When *base_url* points to a third-party Anthropic-compatible endpoint,
-    thinking block signatures are stripped (they are Anthropic-proprietary).
    """
-    system, anthropic_messages = convert_messages_to_anthropic(messages, base_url=base_url)
+    system, anthropic_messages = convert_messages_to_anthropic(messages)
    anthropic_tools = convert_tools_to_anthropic(tools) if tools else []

    model = normalize_model_name(model, preserve_dots=preserve_dots)
@@ -1329,9 +1224,9 @@ def build_anthropic_kwargs(
    # Map reasoning_config to Anthropic's thinking parameter.
    # Claude 4.6 models use adaptive thinking + output_config.effort.
    # Older models use manual thinking with budget_tokens.
-    # Haiku and MiniMax models do NOT support extended thinking — skip entirely.
+    # Haiku models do NOT support extended thinking at all — skip entirely.
    if reasoning_config and isinstance(reasoning_config, dict):
-        if reasoning_config.get("enabled") is not False and "haiku" not in model.lower() and "minimax" not in model.lower():
+        if reasoning_config.get("enabled") is not False and "haiku" not in model.lower():
            effort = str(reasoning_config.get("effort", "medium")).lower()
            budget = THINKING_BUDGET.get(effort, 8000)
            if _supports_adaptive_thinking(model):
@@ -59,48 +59,13 @@ from hermes_constants import OPENROUTER_BASE_URL

 logger = logging.getLogger(__name__)

-_PROVIDER_ALIASES = {
-    "google": "gemini",
-    "google-gemini": "gemini",
-    "google-ai-studio": "gemini",
-    "glm": "zai",
-    "z-ai": "zai",
-    "z.ai": "zai",
-    "zhipu": "zai",
-    "kimi": "kimi-coding",
-    "moonshot": "kimi-coding",
-    "minimax-china": "minimax-cn",
-    "minimax_cn": "minimax-cn",
-    "claude": "anthropic",
-    "claude-code": "anthropic",
-}
-
-
-def _normalize_aux_provider(provider: Optional[str], *, for_vision: bool = False) -> str:
-    normalized = (provider or "auto").strip().lower()
-    if normalized.startswith("custom:"):
-        suffix = normalized.split(":", 1)[1].strip()
-        if not suffix:
-            return "custom"
-        normalized = suffix if not for_vision else "custom"
-    if normalized == "codex":
-        return "openai-codex"
-    if normalized == "main":
-        # Resolve to the user's actual main provider so named custom providers
-        # and non-aggregator providers (DeepSeek, Alibaba, etc.) work correctly.
-        main_prov = _read_main_provider()
-        if main_prov and main_prov not in ("auto", "main", ""):
-            return main_prov
-        return "custom"
-    return _PROVIDER_ALIASES.get(normalized, normalized)
-
 # Default auxiliary models for direct API-key providers (cheap/fast for side tasks)
 _API_KEY_PROVIDER_AUX_MODELS: Dict[str, str] = {
    "gemini": "gemini-3-flash-preview",
    "zai": "glm-4.5-flash",
    "kimi-coding": "kimi-k2-turbo-preview",
-    "minimax": "MiniMax-M2.7",
-    "minimax-cn": "MiniMax-M2.7",
+    "minimax": "MiniMax-M2.7-highspeed",
+    "minimax-cn": "MiniMax-M2.7-highspeed",
    "anthropic": "claude-haiku-4-5-20251001",
    "ai-gateway": "google/gemini-3-flash",
    "opencode-zen": "gemini-3-flash",
@@ -127,7 +92,6 @@ auxiliary_is_nous: bool = False
 _OPENROUTER_MODEL = "google/gemini-3-flash-preview"
 _NOUS_MODEL = "google/gemini-3-flash-preview"
 _NOUS_FREE_TIER_VISION_MODEL = "xiaomi/mimo-v2-omni"
-_NOUS_FREE_TIER_AUX_MODEL = "xiaomi/mimo-v2-pro"
 _NOUS_DEFAULT_BASE_URL = "https://inference-api.nousresearch.com/v1"
 _ANTHROPIC_DEFAULT_BASE_URL = "https://api.anthropic.com"
 _AUTH_JSON_PATH = get_hermes_home() / "auth.json"
@@ -141,23 +105,6 @@ _CODEX_AUX_MODEL = "gpt-5.2-codex"
 _CODEX_AUX_BASE_URL = "https://chatgpt.com/backend-api/codex"


-def _to_openai_base_url(base_url: str) -> str:
-    """Normalize an Anthropic-style base URL to OpenAI-compatible format.
-
-    Some providers (MiniMax, MiniMax-CN) expose an ``/anthropic`` endpoint for
-    the Anthropic Messages API and a separate ``/v1`` endpoint for OpenAI chat
-    completions.  The auxiliary client uses the OpenAI SDK, so it must hit the
-    ``/v1`` surface.  Passing the raw ``inference_base_url`` causes requests to
-    land on ``/anthropic/chat/completions`` — a 404.
-    """
-    url = str(base_url or "").strip().rstrip("/")
-    if url.endswith("/anthropic"):
-        rewritten = url[: -len("/anthropic")] + "/v1"
-        logger.debug("Auxiliary client: rewrote base URL %s → %s", url, rewritten)
-        return rewritten
-    return url
-
-
 def _select_pool_entry(provider: str) -> Tuple[bool, Optional[Any]]:
    """Return (pool_exists_for_provider, selected_entry)."""
    try:
@@ -687,9 +634,7 @@ def _resolve_api_key_provider() -> Tuple[Optional[OpenAI], Optional[str]]:
            if not api_key:
                continue

-            base_url = _to_openai_base_url(
-                _pool_runtime_base_url(entry, pconfig.inference_base_url) or pconfig.inference_base_url
-            )
+            base_url = _pool_runtime_base_url(entry, pconfig.inference_base_url) or pconfig.inference_base_url
            model = _API_KEY_PROVIDER_AUX_MODELS.get(provider_id, "default")
            logger.debug("Auxiliary text client: %s (%s) via pool", pconfig.name, model)
            extra = {}
@@ -706,9 +651,7 @@ def _resolve_api_key_provider() -> Tuple[Optional[OpenAI], Optional[str]]:
        if not api_key:
            continue

-        base_url = _to_openai_base_url(
-            str(creds.get("base_url", "")).strip().rstrip("/") or pconfig.inference_base_url
-        )
+        base_url = str(creds.get("base_url", "")).strip().rstrip("/") or pconfig.inference_base_url
        model = _API_KEY_PROVIDER_AUX_MODELS.get(provider_id, "default")
        logger.debug("Auxiliary text client: %s (%s)", pconfig.name, model)
        extra = {}
@@ -770,7 +713,7 @@ def _try_openrouter() -> Tuple[Optional[OpenAI], Optional[str]]:
                   default_headers=_OR_HEADERS), _OPENROUTER_MODEL


-def _try_nous(vision: bool = False) -> Tuple[Optional[OpenAI], Optional[str]]:
+def _try_nous() -> Tuple[Optional[OpenAI], Optional[str]]:
    nous = _read_nous_auth()
    if not nous:
        return None, None
@@ -782,13 +725,12 @@ def _try_nous(vision: bool = False) -> Tuple[Optional[OpenAI], Optional[str]]:
    else:
        model = _NOUS_MODEL
    # Free-tier users can't use paid auxiliary models — use the free
-    # models instead: mimo-v2-omni for vision, mimo-v2-pro for text tasks.
+    # multimodal model instead so vision/browser-vision still works.
    try:
        from hermes_cli.models import check_nous_free_tier
        if check_nous_free_tier():
-            model = _NOUS_FREE_TIER_VISION_MODEL if vision else _NOUS_FREE_TIER_AUX_MODEL
-            logger.debug("Free-tier Nous account — using %s for auxiliary/%s",
-                         model, "vision" if vision else "text")
+            model = _NOUS_FREE_TIER_VISION_MODEL
+            logger.debug("Free-tier Nous account — using %s for auxiliary/vision", model)
    except Exception:
        pass
    return (
@@ -1196,7 +1138,17 @@ def resolve_provider_client(
        (client, resolved_model) or (None, None) if auth is unavailable.
    """
    # Normalise aliases
-    provider = _normalize_aux_provider(provider)
+    provider = (provider or "auto").strip().lower()
+    if provider == "codex":
+        provider = "openai-codex"
+    if provider == "main":
+        # Resolve to the user's actual main provider so named custom providers
+        # and non-aggregator providers (DeepSeek, Alibaba, etc.) work correctly.
+        main_prov = _read_main_provider()
+        if main_prov and main_prov not in ("auto", "main", ""):
+            provider = main_prov
+        else:
+            provider = "custom"

    # ── Auto: try all providers in priority order ────────────────────
    if provider == "auto":
@@ -1346,9 +1298,7 @@ def resolve_provider_client(
                         provider, ", ".join(tried_sources))
            return None, None

-        base_url = _to_openai_base_url(
-            str(creds.get("base_url", "")).strip().rstrip("/") or pconfig.inference_base_url
-        )
+        base_url = str(creds.get("base_url", "")).strip().rstrip("/") or pconfig.inference_base_url

        default_model = _API_KEY_PROVIDER_AUX_MODELS.get(provider, "")
        final_model = model or default_model
@@ -1425,11 +1375,24 @@ def get_async_text_auxiliary_client(task: str = ""):
 _VISION_AUTO_PROVIDER_ORDER = (
    "openrouter",
    "nous",
+    "openai-codex",
+    "anthropic",
+    "custom",
 )


 def _normalize_vision_provider(provider: Optional[str]) -> str:
-    return _normalize_aux_provider(provider, for_vision=True)
+    provider = (provider or "auto").strip().lower()
+    if provider == "codex":
+        return "openai-codex"
+    if provider == "main":
+        # Resolve to actual main provider — named custom providers and
+        # non-aggregator providers need to pass through as their real name.
+        main_prov = _read_main_provider()
+        if main_prov and main_prov not in ("auto", "main", ""):
+            return main_prov
+        return "custom"
+    return provider


 def _resolve_strict_vision_backend(provider: str) -> Tuple[Optional[Any], Optional[str]]:
@@ -1437,7 +1400,7 @@ def _resolve_strict_vision_backend(provider: str) -> Tuple[Optional[Any], Option
    if provider == "openrouter":
        return _try_openrouter()
    if provider == "nous":
-        return _try_nous(vision=True)
+        return _try_nous()
    if provider == "openai-codex":
        return _try_codex()
    if provider == "anthropic":
@@ -1470,26 +1433,17 @@ def _preferred_main_vision_provider() -> Optional[str]:
 def get_available_vision_backends() -> List[str]:
    """Return the currently available vision backends in auto-selection order.

-    Order: active provider → OpenRouter → Nous → stop.  This is the single
-    source of truth for setup, tool gating, and runtime auto-routing of
-    vision tasks.
+    This is the single source of truth for setup, tool gating, and runtime
+    auto-routing of vision tasks. The selected main provider is preferred when
+    it is also a known-good vision backend; otherwise Hermes falls back through
+    the standard conservative order.
    """
-    available: List[str] = []
-    # 1. Active provider — if the user configured a provider, try it first.
-    main_provider = _read_main_provider()
-    if main_provider and main_provider not in ("auto", ""):
-        if main_provider in _VISION_AUTO_PROVIDER_ORDER:
-            if _strict_vision_backend_available(main_provider):
-                available.append(main_provider)
-        else:
-            client, _ = resolve_provider_client(main_provider, _read_main_model())
-            if client is not None:
-                available.append(main_provider)
-    # 2. OpenRouter, 3. Nous — skip if already covered by main provider.
-    for p in _VISION_AUTO_PROVIDER_ORDER:
-        if p not in available and _strict_vision_backend_available(p):
-            available.append(p)
-    return available
+    ordered = list(_VISION_AUTO_PROVIDER_ORDER)
+    preferred = _preferred_main_vision_provider()
+    if preferred in ordered:
+        ordered.remove(preferred)
+        ordered.insert(0, preferred)
+    return [provider for provider in ordered if _strict_vision_backend_available(provider)]


 def resolve_vision_provider_client(
@@ -1534,39 +1488,16 @@ def resolve_vision_provider_client(
        return "custom", client, final_model

    if requested == "auto":
-        # Vision auto-detection order:
-        #   1. Active provider + model (user's main chat config)
-        #   2. OpenRouter  (known vision-capable default model)
-        #   3. Nous Portal (known vision-capable default model)
-        #   4. Stop
-        main_provider = _read_main_provider()
-        main_model = _read_main_model()
-        if main_provider and main_provider not in ("auto", ""):
-            if main_provider in _VISION_AUTO_PROVIDER_ORDER:
-                # Known strict backend — use its defaults.
-                sync_client, default_model = _resolve_strict_vision_backend(main_provider)
-                if sync_client is not None:
-                    return _finalize(main_provider, sync_client, default_model)
-            else:
-                # Exotic provider (DeepSeek, Alibaba, named custom, etc.)
-                rpc_client, rpc_model = resolve_provider_client(
-                    main_provider, main_model)
-                if rpc_client is not None:
-                    logger.info(
-                        "Vision auto-detect: using active provider %s (%s)",
-                        main_provider, rpc_model or main_model,
-                    )
-                    return _finalize(
-                        main_provider, rpc_client, rpc_model or main_model)
+        ordered = list(_VISION_AUTO_PROVIDER_ORDER)
+        preferred = _preferred_main_vision_provider()
+        if preferred in ordered:
+            ordered.remove(preferred)
+            ordered.insert(0, preferred)

-        # Fall back through aggregators.
-        for candidate in _VISION_AUTO_PROVIDER_ORDER:
-            if candidate == main_provider:
-                continue  # already tried above
+        for candidate in ordered:
            sync_client, default_model = _resolve_strict_vision_backend(candidate)
            if sync_client is not None:
                return _finalize(candidate, sync_client, default_model)
-
        logger.debug("Auxiliary vision client: none available")
        return None, None, None

@@ -26,14 +26,12 @@ _PROVIDER_PREFIXES: frozenset[str] = frozenset({
    "openrouter", "nous", "openai-codex", "copilot", "copilot-acp",
    "gemini", "zai", "kimi-coding", "minimax", "minimax-cn", "anthropic", "deepseek",
    "opencode-zen", "opencode-go", "ai-gateway", "kilocode", "alibaba",
-    "qwen-oauth",
    "custom", "local",
    # Common aliases
    "google", "google-gemini", "google-ai-studio",
    "glm", "z-ai", "z.ai", "zhipu", "github", "github-copilot",
    "github-models", "kimi", "moonshot", "claude", "deep-seek",
    "opencode", "zen", "go", "vercel", "kilo", "dashscope", "aliyun", "qwen",
-    "qwen-portal",
 })


@@ -115,15 +113,8 @@ DEFAULT_CONTEXT_LENGTHS = {
    "llama": 131072,
    # Qwen
    "qwen": 131072,
-    # MiniMax (lowercase — lookup lowercases model names at line 973)
-    "minimax-m1-256k": 1000000,
-    "minimax-m1-128k": 1000000,
-    "minimax-m1-80k": 1000000,
-    "minimax-m1-40k": 1000000,
-    "minimax-m1": 1000000,
-    "minimax-m2.5": 1048576,
-    "minimax-m2.7": 1048576,
-    "minimax": 1048576,
+    # MiniMax
+    "minimax": 204800,
    # GLM
    "glm": 202752,
    # Kimi
@@ -136,7 +127,7 @@ DEFAULT_CONTEXT_LENGTHS = {
    "deepseek-ai/DeepSeek-V3.2": 65536,
    "moonshotai/Kimi-K2.5": 262144,
    "moonshotai/Kimi-K2-Thinking": 262144,
-    "MiniMaxAI/MiniMax-M2.5": 1048576,
+    "MiniMaxAI/MiniMax-M2.5": 204800,
    "XiaomiMiMo/MiMo-V2-Flash": 32768,
    "mimo-v2-pro": 1048576,
    "mimo-v2-omni": 1048576,
@@ -189,7 +180,6 @@ _URL_TO_PROVIDER: Dict[str, str] = {
    "api.minimax": "minimax",
    "dashscope.aliyuncs.com": "alibaba",
    "dashscope-intl.aliyuncs.com": "alibaba",
-    "portal.qwen.ai": "qwen-oauth",
    "openrouter.ai": "openrouter",
    "generativelanguage.googleapis.com": "gemini",
    "inference-api.nousresearch.com": "nous",
@@ -621,59 +611,6 @@ def _model_id_matches(candidate_id: str, lookup_model: str) -> bool:
    return False


-def query_ollama_num_ctx(model: str, base_url: str) -> Optional[int]:
-    """Query an Ollama server for the model's context length.
-
-    Returns the model's maximum context from GGUF metadata via ``/api/show``,
-    or the explicit ``num_ctx`` from the Modelfile if set.  Returns None if
-    the server is unreachable or not Ollama.
-
-    This is the value that should be passed as ``num_ctx`` in Ollama chat
-    requests to override the default 2048.
-    """
-    import httpx
-
-    bare_model = _strip_provider_prefix(model)
-    server_url = base_url.rstrip("/")
-    if server_url.endswith("/v1"):
-        server_url = server_url[:-3]
-
-    try:
-        server_type = detect_local_server_type(base_url)
-    except Exception:
-        return None
-    if server_type != "ollama":
-        return None
-
-    try:
-        with httpx.Client(timeout=3.0) as client:
-            resp = client.post(f"{server_url}/api/show", json={"name": bare_model})
-            if resp.status_code != 200:
-                return None
-            data = resp.json()
-
-            # Prefer explicit num_ctx from Modelfile parameters (user override)
-            params = data.get("parameters", "")
-            if "num_ctx" in params:
-                for line in params.split("\n"):
-                    if "num_ctx" in line:
-                        parts = line.strip().split()
-                        if len(parts) >= 2:
-                            try:
-                                return int(parts[-1])
-                            except ValueError:
-                                pass
-
-            # Fall back to GGUF model_info context_length (training max)
-            model_info = data.get("model_info", {})
-            for key, value in model_info.items():
-                if "context_length" in key and isinstance(value, (int, float)):
-                    return int(value)
-    except Exception:
-        pass
-    return None
-
-
 def _query_local_context_length(model: str, base_url: str) -> Optional[int]:
    """Query a local server for the model's context length."""
    import httpx
@@ -153,7 +153,6 @@ PROVIDER_TO_MODELS_DEV: Dict[str, str] = {
    "minimax-cn": "minimax-cn",
    "deepseek": "deepseek",
    "alibaba": "alibaba",
-    "qwen-oauth": "alibaba",
    "copilot": "github-copilot",
    "ai-gateway": "vercel",
    "opencode-zen": "opencode",
@@ -204,30 +204,6 @@ OPENAI_MODEL_EXECUTION_GUIDANCE = (
    "the result.\n"
    "</tool_persistence>\n"
    "\n"
-    "<mandatory_tool_use>\n"
-    "NEVER answer these from memory or mental computation — ALWAYS use a tool:\n"
-    "- Arithmetic, math, calculations → use terminal or execute_code\n"
-    "- Hashes, encodings, checksums → use terminal (e.g. sha256sum, base64)\n"
-    "- Current time, date, timezone → use terminal (e.g. date)\n"
-    "- System state: OS, CPU, memory, disk, ports, processes → use terminal\n"
-    "- File contents, sizes, line counts → use read_file, search_files, or terminal\n"
-    "- Git history, branches, diffs → use terminal\n"
-    "- Current facts (weather, news, versions) → use web_search\n"
-    "Your memory and user profile describe the USER, not the system you are "
-    "running on. The execution environment may differ from what the user profile "
-    "says about their personal setup.\n"
-    "</mandatory_tool_use>\n"
-    "\n"
-    "<act_dont_ask>\n"
-    "When a question has an obvious default interpretation, act on it immediately "
-    "instead of asking for clarification. Examples:\n"
-    "- 'Is port 443 open?' → check THIS machine (don't ask 'open where?')\n"
-    "- 'What OS am I running?' → check the live system (don't use user profile)\n"
-    "- 'What time is it?' → run `date` (don't guess)\n"
-    "Only ask for clarification when the ambiguity genuinely changes what tool "
-    "you would call.\n"
-    "</act_dont_ask>\n"
-    "\n"
    "<prerequisite_checks>\n"
    "- Before taking an action, check whether prerequisite discovery, lookup, or "
    "context-gathering steps are needed.\n"
@@ -1,57 +0,0 @@
-"""Retry utilities — jittered backoff for decorrelated retries.
-
-Replaces fixed exponential backoff with jittered delays to prevent
-thundering-herd retry spikes when multiple sessions hit the same
-rate-limited provider concurrently.
-"""
-
-import random
-import threading
-import time
-
-# Monotonic counter for jitter seed uniqueness within the same process.
-# Protected by a lock to avoid race conditions in concurrent retry paths
-# (e.g. multiple gateway sessions retrying simultaneously).
-_jitter_counter = 0
-_jitter_lock = threading.Lock()
-
-
-def jittered_backoff(
-    attempt: int,
-    *,
-    base_delay: float = 5.0,
-    max_delay: float = 120.0,
-    jitter_ratio: float = 0.5,
-) -> float:
-    """Compute a jittered exponential backoff delay.
-
-    Args:
-        attempt: 1-based retry attempt number.
-        base_delay: Base delay in seconds for attempt 1.
-        max_delay: Maximum delay cap in seconds.
-        jitter_ratio: Fraction of computed delay to use as random jitter
-            range.  0.5 means jitter is uniform in [0, 0.5 * delay].
-
-    Returns:
-        Delay in seconds: min(base * 2^(attempt-1), max_delay) + jitter.
-
-    The jitter decorrelates concurrent retries so multiple sessions
-    hitting the same provider don't all retry at the same instant.
-    """
-    global _jitter_counter
-    with _jitter_lock:
-        _jitter_counter += 1
-        tick = _jitter_counter
-
-    exponent = max(0, attempt - 1)
-    if exponent >= 63 or base_delay <= 0:
-        delay = max_delay
-    else:
-        delay = min(base_delay * (2 ** exponent), max_delay)
-
-    # Seed from time + counter for decorrelation even with coarse clocks.
-    seed = (time.time_ns() ^ (tick * 0x9E3779B9)) & 0xFFFFFFFF
-    rng = random.Random(seed)
-    jitter = rng.uniform(0, jitter_ratio * delay)
-
-    return delay + jitter
@@ -644,14 +644,10 @@ platform_toolsets:
 # Voice Transcription (Speech-to-Text)
 # =============================================================================
 # Automatically transcribe voice messages on messaging platforms.
-# Providers: local (free, faster-whisper) | groq (free tier) | openai (Whisper API) | mistral (Voxtral Transcribe)
-# Set the corresponding API key in .env: GROQ_API_KEY, OPENAI_API_KEY, or MISTRAL_API_KEY.
+# Requires OPENAI_API_KEY in .env (uses OpenAI Whisper API directly).
 stt:
  enabled: true
-  # provider: "local"          # auto-detected if omitted
  model: "whisper-1"  # whisper-1 (cheapest) | gpt-4o-mini-transcribe | gpt-4o-transcribe
-  # mistral:
-  #   model: "voxtral-mini-latest"  # voxtral-mini-latest | voxtral-mini-2602

 # =============================================================================
 # Response Pacing (Messaging Platforms)
@@ -612,11 +612,6 @@ def _run_cleanup():
        pass
    # Shut down memory provider (on_session_end + shutdown_all) at actual
    # session boundary — NOT per-turn inside run_conversation().
-    try:
-        from hermes_cli.plugins import invoke_hook as _invoke_hook
-        _invoke_hook("on_session_finalize", session_id=_active_agent_ref.session_id if _active_agent_ref else None, platform="cli")
-    except Exception:
-        pass
    try:
        if _active_agent_ref and hasattr(_active_agent_ref, 'shutdown_memory_provider'):
            _active_agent_ref.shutdown_memory_provider(
@@ -760,10 +755,7 @@ def _setup_worktree(repo_root: str = None) -> Optional[Dict[str, str]]:
 def _cleanup_worktree(info: Dict[str, str] = None) -> None:
    """Remove a worktree and its branch on exit.

-    Preserves the worktree only if it has unpushed commits (real work
-    that hasn't been pushed to any remote).  Uncommitted changes alone
-    (untracked files, test artifacts) are not enough to keep it — agent
-    work lives in commits/PRs, not the working tree.
+    If the worktree has uncommitted changes, warn and keep it.
    """
    global _active_worktree
    info = info or _active_worktree
@@ -779,27 +771,23 @@ def _cleanup_worktree(info: Dict[str, str] = None) -> None:
    if not Path(wt_path).exists():
        return

-    # Check for unpushed commits — commits reachable from HEAD but not
-    # from any remote branch.  These represent real work the agent did
-    # but didn't push.
-    has_unpushed = False
+    # Check for uncommitted changes
    try:
-        result = subprocess.run(
-            ["git", "log", "--oneline", "HEAD", "--not", "--remotes"],
+        status = subprocess.run(
+            ["git", "status", "--porcelain"],
            capture_output=True, text=True, timeout=10, cwd=wt_path,
        )
-        has_unpushed = bool(result.stdout.strip())
+        has_changes = bool(status.stdout.strip())
    except Exception:
-        has_unpushed = True  # Assume unpushed on error — don't delete
+        has_changes = True  # Assume dirty on error — don't delete

-    if has_unpushed:
-        print(f"\n\033[33m⚠ Worktree has unpushed commits, keeping: {wt_path}\033[0m")
-        print(f"  To clean up manually: git worktree remove --force {wt_path}")
+    if has_changes:
+        print(f"\n\033[33m⚠ Worktree has uncommitted changes, keeping: {wt_path}\033[0m")
+        print(f"  To clean up manually: git worktree remove {wt_path}")
        _active_worktree = None
        return

-    # Remove worktree (even if working tree is dirty — uncommitted
-    # changes without unpushed commits are just artifacts)
+    # Remove worktree
    try:
        subprocess.run(
            ["git", "worktree", "remove", wt_path, "--force"],
@@ -808,7 +796,7 @@ def _cleanup_worktree(info: Dict[str, str] = None) -> None:
    except Exception as e:
        logger.debug("Failed to remove worktree: %s", e)

-    # Delete the branch
+    # Delete the branch (only if it was never pushed / has no upstream)
    try:
        subprocess.run(
            ["git", "branch", "-D", branch],
@@ -822,27 +810,19 @@ def _cleanup_worktree(info: Dict[str, str] = None) -> None:


 def _prune_stale_worktrees(repo_root: str, max_age_hours: int = 24) -> None:
-    """Remove stale worktrees and orphaned branches on startup.
+    """Remove worktrees older than max_age_hours that have no uncommitted changes.

-    Age-based tiers:
-    - Under max_age_hours (24h): skip — session may still be active.
-    - 24h–72h: remove if no unpushed commits.
-    - Over 72h: force remove regardless (nothing should sit this long).
-
-    Also prunes orphaned ``hermes/*`` and ``pr-*`` local branches that
-    have no corresponding worktree.
+    Runs silently on startup to clean up after crashed/killed sessions.
    """
    import subprocess
    import time

    worktrees_dir = Path(repo_root) / ".worktrees"
    if not worktrees_dir.exists():
-        _prune_orphaned_branches(repo_root)
        return

    now = time.time()
-    soft_cutoff = now - (max_age_hours * 3600)       # 24h default
-    hard_cutoff = now - (max_age_hours * 3 * 3600)   # 72h default
+    cutoff = now - (max_age_hours * 3600)

    for entry in worktrees_dir.iterdir():
        if not entry.is_dir() or not entry.name.startswith("hermes-"):
@@ -851,24 +831,21 @@ def _prune_stale_worktrees(repo_root: str, max_age_hours: int = 24) -> None:
        # Check age
        try:
            mtime = entry.stat().st_mtime
-            if mtime > soft_cutoff:
+            if mtime > cutoff:
                continue  # Too recent — skip
        except Exception:
            continue

-        force = mtime <= hard_cutoff  # Over 72h — force remove
-
-        if not force:
-            # 24h–72h tier: only remove if no unpushed commits
-            try:
-                result = subprocess.run(
-                    ["git", "log", "--oneline", "HEAD", "--not", "--remotes"],
-                    capture_output=True, text=True, timeout=5, cwd=str(entry),
-                )
-                if result.stdout.strip():
-                    continue  # Has unpushed commits — skip
-            except Exception:
-                continue  # Can't check — skip
+        # Check for uncommitted changes
+        try:
+            status = subprocess.run(
+                ["git", "status", "--porcelain"],
+                capture_output=True, text=True, timeout=5, cwd=str(entry),
+            )
+            if status.stdout.strip():
+                continue  # Has changes — skip
+        except Exception:
+            continue  # Can't check — skip

        # Safe to remove
        try:
@@ -887,81 +864,10 @@ def _prune_stale_worktrees(repo_root: str, max_age_hours: int = 24) -> None:
                    ["git", "branch", "-D", branch],
                    capture_output=True, text=True, timeout=10, cwd=repo_root,
                )
-            logger.debug("Pruned stale worktree: %s (force=%s)", entry.name, force)
+            logger.debug("Pruned stale worktree: %s", entry.name)
        except Exception as e:
            logger.debug("Failed to prune worktree %s: %s", entry.name, e)

-    _prune_orphaned_branches(repo_root)
-
-
-def _prune_orphaned_branches(repo_root: str) -> None:
-    """Delete local ``hermes/hermes-*`` and ``pr-*`` branches with no worktree.
-
-    These are auto-generated by ``hermes -w`` sessions and PR review
-    workflows respectively.  Once their worktree is gone they serve no
-    purpose and just accumulate.
-    """
-    import subprocess
-
-    try:
-        result = subprocess.run(
-            ["git", "branch", "--format=%(refname:short)"],
-            capture_output=True, text=True, timeout=10, cwd=repo_root,
-        )
-        if result.returncode != 0:
-            return
-        all_branches = [b.strip() for b in result.stdout.strip().split("\n") if b.strip()]
-    except Exception:
-        return
-
-    # Collect branches that are actively checked out in a worktree
-    active_branches: set = set()
-    try:
-        wt_result = subprocess.run(
-            ["git", "worktree", "list", "--porcelain"],
-            capture_output=True, text=True, timeout=10, cwd=repo_root,
-        )
-        for line in wt_result.stdout.split("\n"):
-            if line.startswith("branch refs/heads/"):
-                active_branches.add(line.split("branch refs/heads/", 1)[-1].strip())
-    except Exception:
-        return  # Can't determine active branches — bail
-
-    # Also protect the currently checked-out branch and main
-    try:
-        head_result = subprocess.run(
-            ["git", "branch", "--show-current"],
-            capture_output=True, text=True, timeout=5, cwd=repo_root,
-        )
-        current = head_result.stdout.strip()
-        if current:
-            active_branches.add(current)
-    except Exception:
-        pass
-    active_branches.add("main")
-
-    orphaned = [
-        b for b in all_branches
-        if b not in active_branches
-        and (b.startswith("hermes/hermes-") or b.startswith("pr-"))
-    ]
-
-    if not orphaned:
-        return
-
-    # Delete in batches
-    for i in range(0, len(orphaned), 50):
-        batch = orphaned[i:i + 50]
-        try:
-            subprocess.run(
-                ["git", "branch", "-D"] + batch,
-                capture_output=True, text=True, timeout=30, cwd=repo_root,
-            )
-        except Exception as e:
-            logger.debug("Failed to prune orphaned branches: %s", e)
-
-    logger.debug("Pruned %d orphaned branches", len(orphaned))
-
 # ============================================================================
 # ASCII Art & Branding
 # ============================================================================
@@ -3408,22 +3314,6 @@ class HermesCLI:
        flush_tool_summary()
        print()
    
-    def _notify_session_boundary(self, event_type: str) -> None:
-        """Fire a session-boundary plugin hook (on_session_finalize or on_session_reset).
-
-        Non-blocking — errors are caught and logged.  Safe to call from any
-        lifecycle point (shutdown, /new, /reset).
-        """
-        try:
-            from hermes_cli.plugins import invoke_hook as _invoke_hook
-            _invoke_hook(
-                event_type,
-                session_id=self.agent.session_id if self.agent else None,
-                platform=getattr(self, "platform", None) or "cli",
-            )
-        except Exception:
-            pass
-
    def new_session(self, silent=False):
        """Start a fresh session with a new session ID and cleared agent state."""
        if self.agent and self.conversation_history:
@@ -3431,10 +3321,6 @@ class HermesCLI:
                self.agent.flush_memories(self.conversation_history)
            except (Exception, KeyboardInterrupt):
                pass
-            self._notify_session_boundary("on_session_finalize")
-        elif self.agent:
-            # First session or empty history — still finalize the old session
-            self._notify_session_boundary("on_session_finalize")

        old_session_id = self.session_id
        if self._session_db and old_session_id:
@@ -3479,7 +3365,6 @@ class HermesCLI:
                    )
                except Exception:
                    pass
-            self._notify_session_boundary("on_session_reset")

        if not silent:
            print("(^_^)v New session started!")
@@ -4668,13 +4553,13 @@ class HermesCLI:
                            if output:
                                self.console.print(_rich_text_from_ansi(output))
                            else:
-                                self.console.print("[dim]Command returned no output[/]")
+                                ChatConsole().print("[dim]Command returned no output[/]")
                        except subprocess.TimeoutExpired:
-                            self.console.print("[bold red]Quick command timed out (30s)[/]")
+                            ChatConsole().print("[bold red]Quick command timed out (30s)[/]")
                        except Exception as e:
-                            self.console.print(f"[bold red]Quick command error: {e}[/]")
+                            ChatConsole().print(f"[bold red]Quick command error: {e}[/]")
                    else:
-                        self.console.print(f"[bold red]Quick command '{base_cmd}' has no command defined[/]")
+                        ChatConsole().print(f"[bold red]Quick command '{base_cmd}' has no command defined[/]")
                elif qcmd.get("type") == "alias":
                    target = qcmd.get("target", "").strip()
                    if target:
@@ -4683,9 +4568,9 @@ class HermesCLI:
                        aliased_command = f"{target} {user_args}".strip()
                        return self.process_command(aliased_command)
                    else:
-                        self.console.print(f"[bold red]Quick command '{base_cmd}' has no target defined[/]")
+                        ChatConsole().print(f"[bold red]Quick command '{base_cmd}' has no target defined[/]")
                else:
-                    self.console.print(f"[bold red]Quick command '{base_cmd}' has unsupported type (supported: 'exec', 'alias')[/]")
+                    ChatConsole().print(f"[bold red]Quick command '{base_cmd}' has unsupported type (supported: 'exec', 'alias')[/]")
            # Check for plugin-registered slash commands
            elif base_cmd.lstrip("/") in _get_plugin_cmd_handler_names():
                from hermes_cli.plugins import get_plugin_command_handler
@@ -574,16 +574,12 @@ def remove_job(job_id: str) -> bool:
    return False


-def mark_job_run(job_id: str, success: bool, error: Optional[str] = None,
-                 delivery_error: Optional[str] = None):
+def mark_job_run(job_id: str, success: bool, error: Optional[str] = None):
    """
    Mark a job as having been run.
    
    Updates last_run_at, last_status, increments completed count,
    computes next_run_at, and auto-deletes if repeat limit reached.
-
-    ``delivery_error`` is tracked separately from the agent error — a job
-    can succeed (agent produced output) but fail delivery (platform down).
    """
    jobs = load_jobs()
    for i, job in enumerate(jobs):
@@ -592,8 +588,6 @@ def mark_job_run(job_id: str, success: bool, error: Optional[str] = None,
            job["last_run_at"] = now
            job["last_status"] = "ok" if success else "error"
            job["last_error"] = error if not success else None
-            # Track delivery failures separately — cleared on successful delivery
-            job["last_delivery_error"] = delivery_error
            
            # Increment completed count
            if job.get("repeat"):
@@ -196,7 +196,7 @@ def _send_media_via_adapter(adapter, chat_id: str, media_files: list, metadata:
            logger.warning("Job '%s': failed to send media %s: %s", job.get("id", "?"), media_path, e)


-def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> Optional[str]:
+def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> None:
    """
    Deliver job output to the configured target (origin chat, specific platform, etc.).

@@ -204,16 +204,16 @@ def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> Option
    use the live adapter first — this supports E2EE rooms (e.g. Matrix) where
    the standalone HTTP path cannot encrypt.  Falls back to standalone send if
    the adapter path fails or is unavailable.
-
-    Returns None on success, or an error string on failure.
    """
    target = _resolve_delivery_target(job)
    if not target:
        if job.get("deliver", "local") != "local":
-            msg = f"no delivery target resolved for deliver={job.get('deliver', 'local')}"
-            logger.warning("Job '%s': %s", job["id"], msg)
-            return msg
-        return None  # local-only jobs don't deliver — not a failure
+            logger.warning(
+                "Job '%s' deliver=%s but no concrete delivery target could be resolved",
+                job["id"],
+                job.get("deliver", "local"),
+            )
+        return

    platform_name = target["platform"]
    chat_id = target["chat_id"]
@@ -239,22 +239,19 @@ def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> Option
    }
    platform = platform_map.get(platform_name.lower())
    if not platform:
-        msg = f"unknown platform '{platform_name}'"
-        logger.warning("Job '%s': %s", job["id"], msg)
-        return msg
+        logger.warning("Job '%s': unknown platform '%s' for delivery", job["id"], platform_name)
+        return

    try:
        config = load_gateway_config()
    except Exception as e:
-        msg = f"failed to load gateway config: {e}"
-        logger.error("Job '%s': %s", job["id"], msg)
-        return msg
+        logger.error("Job '%s': failed to load gateway config for delivery: %s", job["id"], e)
+        return

    pconfig = config.platforms.get(platform)
    if not pconfig or not pconfig.enabled:
-        msg = f"platform '{platform_name}' not configured/enabled"
-        logger.warning("Job '%s': %s", job["id"], msg)
-        return msg
+        logger.warning("Job '%s': platform '%s' not configured/enabled", job["id"], platform_name)
+        return

    # Optionally wrap the content with a header/footer so the user knows this
    # is a cron delivery.  Wrapping is on by default; set cron.wrap_response: false
@@ -310,7 +307,7 @@ def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> Option

            if adapter_ok:
                logger.info("Job '%s': delivered to %s:%s via live adapter", job["id"], platform_name, chat_id)
-                return None
+                return
        except Exception as e:
            logger.warning(
                "Job '%s': live adapter delivery to %s:%s failed (%s), falling back to standalone",
@@ -332,17 +329,13 @@ def _deliver_result(job: dict, content: str, adapters=None, loop=None) -> Option
            future = pool.submit(asyncio.run, _send_to_platform(platform, pconfig, chat_id, cleaned_delivery_content, thread_id=thread_id, media_files=media_files))
            result = future.result(timeout=30)
    except Exception as e:
-        msg = f"delivery to {platform_name}:{chat_id} failed: {e}"
-        logger.error("Job '%s': %s", job["id"], msg)
-        return msg
+        logger.error("Job '%s': delivery to %s:%s failed: %s", job["id"], platform_name, chat_id, e)
+        return

    if result and result.get("error"):
-        msg = f"delivery error: {result['error']}"
-        logger.error("Job '%s': %s", job["id"], msg)
-        return msg
-
-    logger.info("Job '%s': delivered to %s:%s", job["id"], platform_name, chat_id)
-    return None
+        logger.error("Job '%s': delivery error: %s", job["id"], result["error"])
+    else:
+        logger.info("Job '%s': delivered to %s:%s", job["id"], platform_name, chat_id)


 _SCRIPT_TIMEOUT = 120  # seconds
@@ -585,9 +578,11 @@ def run_job(job: dict) -> tuple[bool, str, str, Optional[str]]:
        except Exception as e:
            logger.warning("Job '%s': failed to load config.yaml, using defaults: %s", job_id, e)

-        # Reasoning config from config.yaml
+        # Reasoning config from env or config.yaml
        from hermes_constants import parse_reasoning_effort
-        effort = str(_cfg.get("agent", {}).get("reasoning_effort", "")).strip()
+        effort = os.getenv("HERMES_REASONING_EFFORT", "")
+        if not effort:
+            effort = str(_cfg.get("agent", {}).get("reasoning_effort", "")).strip()
        reasoning_config = parse_reasoning_effort(effort)

        # Prefill messages from env or config.yaml
@@ -873,15 +868,13 @@ def tick(verbose: bool = True, adapters=None, loop=None) -> int:
                    logger.info("Job '%s': agent returned %s — skipping delivery", job["id"], SILENT_MARKER)
                    should_deliver = False

-                delivery_error = None
                if should_deliver:
                    try:
-                        delivery_error = _deliver_result(job, deliver_content, adapters=adapters, loop=loop)
+                        _deliver_result(job, deliver_content, adapters=adapters, loop=loop)
                    except Exception as de:
-                        delivery_error = str(de)
                        logger.error("Delivery failed for job %s: %s", job["id"], de)

-                mark_job_run(job["id"], success, error, delivery_error=delivery_error)
+                mark_job_run(job["id"], success, error)
                executed += 1

            except Exception as e:
@@ -21,8 +21,6 @@ from dataclasses import dataclass, field
 from typing import Any, Dict, List, Optional, Set

 from model_tools import handle_function_call
-from tools.terminal_tool import get_active_env
-from tools.tool_result_storage import maybe_persist_tool_result, enforce_turn_budget

 # Thread pool for running sync tool calls that internally use asyncio.run()
 # (e.g., the Modal/Docker/Daytona terminal backends). Running them in a separate
@@ -138,10 +136,8 @@ class HermesAgentLoop:
        max_turns: int = 30,
        task_id: Optional[str] = None,
        temperature: float = 1.0,
-        top_p: Optional[float] = None,
        max_tokens: Optional[int] = None,
        extra_body: Optional[Dict[str, Any]] = None,
-        budget_config: Optional["BudgetConfig"] = None,
    ):
        """
        Initialize the agent loop.
@@ -154,26 +150,19 @@ class HermesAgentLoop:
            max_turns: Maximum number of LLM calls before stopping
            task_id: Unique ID for terminal/browser session isolation
            temperature: Sampling temperature for generation
-            top_p: Nucleus sampling top_p (None = omit, use provider default)
            max_tokens: Max tokens per generation (None for server default)
            extra_body: Extra parameters passed to the OpenAI client's create() call.
                        Used for OpenRouter provider preferences, transforms, etc.
                        e.g. {"provider": {"ignore": ["DeepInfra"]}}
-            budget_config: Tool result persistence budget. Controls per-tool
-                        thresholds, per-turn aggregate budget, and preview size.
-                        If None, uses DEFAULT_BUDGET (current hardcoded values).
        """
-        from tools.budget_config import DEFAULT_BUDGET
        self.server = server
        self.tool_schemas = tool_schemas
        self.valid_tool_names = valid_tool_names
        self.max_turns = max_turns
        self.task_id = task_id or str(uuid.uuid4())
        self.temperature = temperature
-        self.top_p = top_p
        self.max_tokens = max_tokens
        self.extra_body = extra_body
-        self.budget_config = budget_config or DEFAULT_BUDGET

    async def run(self, messages: List[Dict[str, Any]]) -> AgentResult:
        """
@@ -214,9 +203,6 @@ class HermesAgentLoop:
                "temperature": self.temperature,
            }

-            if self.top_p is not None:
-                chat_kwargs["top_p"] = self.top_p
-
            # Only pass tools if we have them
            if self.tool_schemas:
                chat_kwargs["tools"] = self.tool_schemas
@@ -231,35 +217,20 @@ class HermesAgentLoop:
                chat_kwargs["extra_body"] = self.extra_body

            # Make the API call -- standard OpenAI spec
-            # Retry on timeout/connection errors (provider queuing, rate limits)
            api_start = _time.monotonic()
-            response = None
-            max_retries = 3
-            for attempt in range(max_retries):
-                try:
-                    response = await self.server.chat_completion(**chat_kwargs)
-                    break
-                except Exception as e:
-                    api_elapsed = _time.monotonic() - api_start
-                    is_retryable = "timeout" in type(e).__name__.lower() or "connection" in type(e).__name__.lower()
-                    if is_retryable and attempt < max_retries - 1:
-                        wait = 2 ** attempt
-                        logger.warning(
-                            "[%s] API call timed out on turn %d attempt %d (%.1fs), retrying in %ds: %s",
-                            self.task_id[:8], turn + 1, attempt + 1, api_elapsed, wait, e,
-                        )
-                        await asyncio.sleep(wait)
-                        api_start = _time.monotonic()
-                        continue
-                    logger.error("API call failed on turn %d (%.1fs): %s", turn + 1, api_elapsed, e)
-                    return AgentResult(
-                        messages=messages,
-                        managed_state=self._get_managed_state(),
-                        turns_used=turn + 1,
-                        finished_naturally=False,
-                        reasoning_per_turn=reasoning_per_turn,
-                        tool_errors=tool_errors,
-                    )
+            try:
+                response = await self.server.chat_completion(**chat_kwargs)
+            except Exception as e:
+                api_elapsed = _time.monotonic() - api_start
+                logger.error("API call failed on turn %d (%.1fs): %s", turn + 1, api_elapsed, e)
+                return AgentResult(
+                    messages=messages,
+                    managed_state=self._get_managed_state(),
+                    turns_used=turn + 1,
+                    finished_naturally=False,
+                    reasoning_per_turn=reasoning_per_turn,
+                    tool_errors=tool_errors,
+                )

            api_elapsed = _time.monotonic() - api_start

@@ -475,15 +446,8 @@ class HermesAgentLoop:
                        except (json.JSONDecodeError, TypeError):
                            pass

+                    # Add tool response to conversation
                    tc_id = tc.get("id", "") if isinstance(tc, dict) else tc.id
-                    tool_result = maybe_persist_tool_result(
-                        content=tool_result,
-                        tool_name=tool_name,
-                        tool_use_id=tc_id,
-                        env=get_active_env(self.task_id),
-                        config=self.budget_config,
-                    )
-
                    messages.append(
                        {
                            "role": "tool",
@@ -492,14 +456,6 @@ class HermesAgentLoop:
                        }
                    )

-                num_tcs = len(assistant_msg.tool_calls)
-                if num_tcs > 0:
-                    enforce_turn_budget(
-                        messages[-num_tcs:],
-                        env=get_active_env(self.task_id),
-                        config=self.budget_config,
-                    )
-
                turn_elapsed = _time.monotonic() - turn_start
                logger.info(
                    "[%s] turn %d: api=%.1fs, %d tools, turn_total=%.1fs",
@@ -1048,7 +1048,6 @@ class AgenticOPDEnv(HermesAgentBaseEnv):
                    temperature=0.0,
                    max_tokens=self.config.max_token_length,
                    extra_body=self.config.extra_body,
-                    budget_config=self.config.build_budget_config(),
                )
                result = await agent.run(messages)

@@ -15,15 +15,15 @@

 env:
  enabled_toolsets: ["terminal", "file"]
-  max_agent_turns: 100
+  max_agent_turns: 60
  max_token_length: 32000
-  agent_temperature: 1.0
+  agent_temperature: 0.8
  terminal_backend: "modal"
-  terminal_timeout: 300 # 5 min per command (builds, pip install)
-  tool_pool_size: 128 # thread pool for 89 parallel tasks
-  dataset_name: "NousResearch/terminal-bench-2-verified-flattened"
+  terminal_timeout: 300        # 5 min per command (builds, pip install)
+  tool_pool_size: 128          # thread pool for 89 parallel tasks
+  dataset_name: "NousResearch/terminal-bench-2"
  test_timeout: 600
-  task_timeout: 900 # 15 min wall-clock per task, auto-FAIL if exceeded
+  task_timeout: 1800           # 30 min wall-clock per task, auto-FAIL if exceeded
  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
  use_wandb: true
  wandb_name: "terminal-bench-2"
@@ -33,15 +33,10 @@ env:
  # Modal's blocking calls (App.lookup, etc.) deadlock when too many sandboxes
  # are created simultaneously inside thread pool workers via asyncio.run().
  max_concurrent_tasks: 8
-  extra_body:
-    provider:
-      order: ["DeepInfra"]
-      allow_fallbacks: false

 openai:
  base_url: "https://openrouter.ai/api/v1"
-  model_name: "nvidia/nemotron-3-super-120b-a12b"
+  model_name: "anthropic/claude-opus-4.6"
  server_type: "openai"
  health_check: false
-  timeout: 300 # 5 min per API call (default 1200s causes 20min stalls)
  # api_key loaded from OPENROUTER_API_KEY in .env
@@ -32,8 +32,8 @@ export PYTHONUNBUFFERED=1
 # These go to the log file; tqdm + [START]/[PASS]/[FAIL] go to terminal
 export LOGLEVEL=INFO

-uv run python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
-  --config environments/benchmarks/terminalbench_2/default.yaml \
+python terminalbench2_env.py evaluate \
+  --config default.yaml \
  "$@" \
  2>&1 | tee "$LOG_FILE"

@@ -52,18 +52,18 @@ _repo_root = Path(__file__).resolve().parent.parent.parent.parent
 if str(_repo_root) not in sys.path:
    sys.path.insert(0, str(_repo_root))

-from atroposlib.envs.base import EvalHandlingEnum
-from atroposlib.envs.server_handling.server_manager import APIServerConfig
 from pydantic import Field

-from agent.prompt_builder import DEFAULT_AGENT_IDENTITY
+from atroposlib.envs.base import EvalHandlingEnum
+from atroposlib.envs.server_handling.server_manager import APIServerConfig
+
 from environments.agent_loop import AgentResult, HermesAgentLoop
 from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
 from environments.tool_context import ToolContext
 from tools.terminal_tool import (
-    cleanup_vm,
-    clear_task_env_overrides,
    register_task_env_overrides,
+    clear_task_env_overrides,
+    cleanup_vm,
 )

 logger = logging.getLogger(__name__)
@@ -73,7 +73,6 @@ logger = logging.getLogger(__name__)
 # Configuration
 # =============================================================================

-
 class TerminalBench2EvalConfig(HermesAgentEnvConfig):
    """
    Configuration for the Terminal-Bench 2.0 evaluation environment.
@@ -139,27 +138,11 @@ class TerminalBench2EvalConfig(HermesAgentEnvConfig):

 # Tasks that cannot run properly on Modal and are excluded from scoring.
 MODAL_INCOMPATIBLE_TASKS = {
-    "qemu-startup",  # Needs KVM/hardware virtualization
-    "qemu-alpine-ssh",  # Needs KVM/hardware virtualization
-    "crack-7z-hash",  # Password brute-force -- too slow for cloud sandbox timeouts
+    "qemu-startup",        # Needs KVM/hardware virtualization
+    "qemu-alpine-ssh",     # Needs KVM/hardware virtualization
+    "crack-7z-hash",       # Password brute-force -- too slow for cloud sandbox timeouts
 }

-# Injected as a user message when the model responds with plain text instead of
-# calling a tool or including a <task_status> tag.
-_FORMAT_NUDGE_MESSAGE = (
-    "You wrote a plain text response instead of using your tools. "
-    "Plain text responses do not affect the environment — nothing was executed or saved.\n\n"
-    "You MUST use your tools (terminal, read_file, write_file) to actually complete the task. "
-    "Do not describe what you would do — execute it now by making tool calls.\n\n"
-    "If you have already completed all required work using tools in previous turns, "
-    "respond with exactly: <task_status>DONE</task_status>\n"
-    "If you have exhausted all approaches and cannot make further progress, "
-    "respond with exactly: <task_status>UNFINISHED</task_status>"
-)
-
-# Maximum number of format nudges before giving up and moving on to scoring.
-_MAX_FORMAT_NUDGES = 3
-

 # =============================================================================
 # Tar extraction helper
@@ -220,6 +203,7 @@ def _safe_extract_tar(tar: tarfile.TarFile, target_dir: Path) -> None:
        except OSError:
            pass

+
 def _extract_base64_tar(b64_data: str, target_dir: Path):
    """Extract a base64-encoded tar.gz archive into target_dir."""
    if not b64_data:
@@ -234,7 +218,6 @@ def _extract_base64_tar(b64_data: str, target_dir: Path):
 # Main Environment
 # =============================================================================

-
 class TerminalBench2EvalEnv(HermesAgentBaseEnv):
    """
    Terminal-Bench 2.0 evaluation environment (eval-only, no training).
@@ -279,18 +262,23 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
            enabled_toolsets=["terminal", "file"],
            disabled_toolsets=None,
            distribution=None,
+
            # Agent settings -- TB2 tasks are complex, need many turns
            max_agent_turns=60,
            max_token_length=16000,
            agent_temperature=0.6,
-            system_prompt=DEFAULT_AGENT_IDENTITY,
+            system_prompt=None,
+
            # Modal backend for per-task cloud-isolated sandboxes
            terminal_backend="modal",
-            terminal_timeout=300,  # 5 min per command (builds, pip install, etc.)
+            terminal_timeout=300,   # 5 min per command (builds, pip install, etc.)
+
            # Test execution timeout (TB2 test scripts can install deps like pytest)
            test_timeout=180,
+
            # 89 tasks run in parallel, each needs a thread for tool calls
            tool_pool_size=128,
+
            # --- Eval-only Atropos settings ---
            # These settings make the env work as an eval-only environment:
            #   - STOP_TRAIN: pauses training during eval (standard for eval envs)
@@ -300,6 +288,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
            group_size=1,
            steps_per_eval=1,
            total_steps=1,
+
            tokenizer_name="NousResearch/Hermes-3-Llama-3.1-8B",
            use_wandb=True,
            wandb_name="terminal-bench-2",
@@ -347,11 +336,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):

        # Skip tasks incompatible with the current backend (e.g., QEMU on Modal)
        # plus any user-specified skip_tasks
-        skip = (
-            set(MODAL_INCOMPATIBLE_TASKS)
-            if self.config.terminal_backend == "modal"
-            else set()
-        )
+        skip = set(MODAL_INCOMPATIBLE_TASKS) if self.config.terminal_backend == "modal" else set()
        if self.config.skip_tasks:
            skip |= {name.strip() for name in self.config.skip_tasks.split(",")}
        if skip:
@@ -359,9 +344,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
            tasks = [t for t in tasks if t["task_name"] not in skip]
            skipped = before - len(tasks)
            if skipped > 0:
-                print(
-                    f"  Skipped {skipped} incompatible tasks: {sorted(skip & {t['task_name'] for t in ds})}"
-                )
+                print(f"  Skipped {skipped} incompatible tasks: {sorted(skip & {t['task_name'] for t in ds})}")

        self.all_eval_items = tasks
        self.iter = 0
@@ -371,16 +354,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
        for i, task in enumerate(self.all_eval_items):
            self.category_index[task.get("category", "unknown")].append(i)

-        # Pre-compute which tasks need Modal's add_python (avoids re-decoding
-        # multi-MB environment_tar blobs during per-task rollouts).
-        self._needs_add_python: Dict[str, bool] = {
-            task["task_name"]: self._image_needs_add_python(task)
-            for task in self.all_eval_items
-        }
-        add_py_count = sum(self._needs_add_python.values())
-        if add_py_count:
-            print(f"  {add_py_count} tasks need add_python (non-python base image)")
-
        # Reward tracking for wandb logging
        self.eval_metrics: List[Tuple[str, float]] = []

@@ -388,30 +361,15 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
        # immediately on completion so data is preserved even on Ctrl+C.
        # Timestamped filename so each run produces a unique file.
        import datetime
-
        log_dir = os.path.join(os.path.dirname(__file__), "logs")
        os.makedirs(log_dir, exist_ok=True)
        run_ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
-        model_name = self.server.servers[0].config.model_name
-        model_slug = model_name.replace("/", "_").replace(":", "_")
-        self._streaming_path = os.path.join(
-            log_dir, f"samples_{run_ts}_{model_slug}.jsonl"
-        )
+        self._streaming_path = os.path.join(log_dir, f"samples_{run_ts}.jsonl")
        self._streaming_file = open(self._streaming_path, "w")
        self._streaming_lock = __import__("threading").Lock()
-        self._run_meta = {
-            "model_name": model_name,
-            "temperature": self.config.agent_temperature,
-            "top_p": self.config.agent_top_p,
-            "max_agent_turns": self.config.max_agent_turns,
-            "task_timeout": self.config.task_timeout,
-            "terminal_backend": self.config.terminal_backend,
-        }
        print(f"  Streaming results to: {self._streaming_path}")

-        print(
-            f"TB2 ready: {len(self.all_eval_items)} tasks across {len(self.category_index)} categories"
-        )
+        print(f"TB2 ready: {len(self.all_eval_items)} tasks across {len(self.category_index)} categories")
        for cat, indices in sorted(self.category_index.items()):
            print(f"  {cat}: {len(indices)} tasks")

@@ -420,9 +378,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
        if not hasattr(self, "_streaming_file") or self._streaming_file.closed:
            return
        with self._streaming_lock:
-            self._streaming_file.write(
-                json.dumps(result, ensure_ascii=False, default=str) + "\n"
-            )
+            self._streaming_file.write(json.dumps(result, ensure_ascii=False, default=str) + "\n")
            self._streaming_file.flush()

    # =========================================================================
@@ -458,36 +414,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
    # Docker image resolution
    # =========================================================================

-    @staticmethod
-    def _image_needs_add_python(item: Dict[str, Any]) -> bool:
-        """Check if the task's base image lacks `python` on PATH.
-
-        Parses the Dockerfile FROM line in environment_tar. Returns True
-        for non-python base images (ubuntu, debian, etc.) that need
-        Modal's add_python parameter.
-        """
-        environment_tar = item.get("environment_tar", "")
-        if not environment_tar:
-            return False
-        try:
-            raw = base64.b64decode(environment_tar)
-            buf = io.BytesIO(raw)
-            with tarfile.open(fileobj=buf, mode="r:gz") as tar:
-                for member in tar:
-                    if not member.isfile() or "Dockerfile" not in member.name:
-                        continue
-                    f = tar.extractfile(member)
-                    if not f:
-                        continue
-                    for line in f.read().decode("utf-8", errors="ignore").splitlines():
-                        stripped = line.strip()
-                        if stripped.upper().startswith("FROM "):
-                            base = stripped.split()[1].lower()
-                            return not base.startswith("python:")
-        except Exception:
-            pass
-        return False
-
    def _resolve_task_image(
        self, item: Dict[str, Any], task_name: str
    ) -> Tuple[str, Optional[Path]]:
@@ -520,9 +446,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
            if dockerfile_path.exists():
                logger.info(
                    "Task %s: building from Dockerfile (force_build=%s, docker_image=%s)",
-                    task_name,
-                    self.config.force_build,
-                    bool(docker_image),
+                    task_name, self.config.force_build, bool(docker_image),
                )
                return str(dockerfile_path), task_dir

@@ -530,80 +454,12 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
        if docker_image:
            logger.warning(
                "Task %s: force_build=True but no environment_tar, "
-                "falling back to docker_image %s",
-                task_name,
-                docker_image,
+                "falling back to docker_image %s", task_name, docker_image,
            )
            return docker_image, None

        return "", None

-    # =========================================================================
-    # Agent loop with format nudging
-    # =========================================================================
-
-    async def _run_with_nudges(
-        self,
-        server,
-        tools: List[Dict[str, Any]],
-        valid_names: set,
-        messages: List[Dict[str, Any]],
-        task_id: str,
-        task_name: str,
-    ) -> Tuple["AgentResult", int]:
-        """Run the agent loop, nudging if the model returns plain text without task_status tag."""
-        total_turns_used = 0
-        nudge_count = 0
-        result = None
-
-        while total_turns_used < self.config.max_agent_turns:
-            remaining = self.config.max_agent_turns - total_turns_used
-            agent = HermesAgentLoop(
-                server=server,
-                tool_schemas=tools,
-                valid_tool_names=valid_names,
-                max_turns=remaining,
-                task_id=task_id,
-                temperature=self.config.agent_temperature,
-                top_p=self.config.agent_top_p,
-                max_tokens=self.config.max_token_length,
-                extra_body=self.config.extra_body,
-            )
-            result = await agent.run(messages)
-            total_turns_used += result.turns_used
-
-            if not result.finished_naturally:
-                break
-
-            last_content = next(
-                (
-                    m.get("content", "") or ""
-                    for m in reversed(messages)
-                    if m.get("role") == "assistant"
-                ),
-                "",
-            )
-            if "<task_status>" in last_content:
-                break
-
-            if nudge_count >= _MAX_FORMAT_NUDGES:
-                logger.warning(
-                    "Task %s: model ignored %d format nudges; stopping.",
-                    task_name,
-                    nudge_count,
-                )
-                break
-            nudge_count += 1
-            logger.info(
-                "Task %s: nudging model (nudge %d/%d) — no tool calls and no task_status",
-                task_name,
-                nudge_count,
-                _MAX_FORMAT_NUDGES,
-            )
-            messages.append({"role": "user", "content": _FORMAT_NUDGE_MESSAGE})
-
-        return result, total_turns_used
-
    # =========================================================================
    # Per-task evaluation -- agent loop + test verification
    # =========================================================================
@@ -632,7 +488,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
        task_dir = None  # Set if we extract a Dockerfile (needs cleanup)

        from tqdm import tqdm
-
        tqdm.write(f"  [START] {task_name} (task_id={task_id[:8]})")
        task_start = time.time()

@@ -640,32 +495,24 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
            # --- 1. Resolve Docker image ---
            modal_image, task_dir = self._resolve_task_image(eval_item, task_name)
            if not modal_image:
-                logger.error(
-                    "Task %s: no docker_image or environment_tar, skipping", task_name
-                )
+                logger.error("Task %s: no docker_image or environment_tar, skipping", task_name)
                return {
-                    "passed": False,
-                    "reward": 0.0,
-                    "task_name": task_name,
-                    "category": category,
+                    "passed": False, "reward": 0.0,
+                    "task_name": task_name, "category": category,
                    "error": "no_image",
                }

            # --- 2. Register per-task image override ---
            # Set both modal_image and docker_image so the task image is used
            # regardless of which backend is configured.
-            overrides = {
+            register_task_env_overrides(task_id, {
                "modal_image": modal_image,
                "docker_image": modal_image,
                "cwd": "/app",
-            }
-            if self._needs_add_python.get(task_name, False):
-                overrides["add_python"] = "3.12"
-            register_task_env_overrides(task_id, overrides)
+            })
            logger.info(
                "Task %s: registered image override for task_id %s",
-                task_name,
-                task_id[:8],
+                task_name, task_id[:8],
            )

            # --- 3. Resolve tools and build messages ---
@@ -673,48 +520,51 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):

            messages: List[Dict[str, Any]] = []
            if self.config.system_prompt:
-                messages.append(
-                    {"role": "system", "content": self.config.system_prompt}
-                )
+                messages.append({"role": "system", "content": self.config.system_prompt})
            messages.append({"role": "user", "content": self.format_prompt(eval_item)})

-            # --- 4. Run agent loop with format enforcement ---
-            # The model must either call a tool or end with <task_status>DONE/UNFINISHED</task_status>.
-            # If it returns plain text without the tag, inject a nudge user message and
-            # continue with the remaining turn budget (up to _MAX_FORMAT_NUDGES times).
+            # --- 4. Run agent loop ---
+            # Use ManagedServer (Phase 2) for vLLM/SGLang backends to get
+            # token-level tracking via /generate. Falls back to direct
+            # ServerManager (Phase 1) for OpenAI endpoints.
            if self._use_managed_server():
                async with self.server.managed_server(
                    tokenizer=self.tokenizer,
                    preserve_think_blocks=bool(self.config.thinking_mode),
                ) as managed:
-                    result, total_turns_used = await self._run_with_nudges(
+                    agent = HermesAgentLoop(
                        server=managed,
-                        tools=tools,
-                        valid_names=valid_names,
-                        messages=messages,
+                        tool_schemas=tools,
+                        valid_tool_names=valid_names,
+                        max_turns=self.config.max_agent_turns,
                        task_id=task_id,
-                        task_name=task_name,
+                        temperature=self.config.agent_temperature,
+                        max_tokens=self.config.max_token_length,
+                        extra_body=self.config.extra_body,
                    )
+                    result = await agent.run(messages)
            else:
-                result, total_turns_used = await self._run_with_nudges(
+                agent = HermesAgentLoop(
                    server=self.server,
-                    tools=tools,
-                    valid_names=valid_names,
-                    messages=messages,
+                    tool_schemas=tools,
+                    valid_tool_names=valid_names,
+                    max_turns=self.config.max_agent_turns,
                    task_id=task_id,
-                    task_name=task_name,
+                    temperature=self.config.agent_temperature,
+                    max_tokens=self.config.max_token_length,
+                    extra_body=self.config.extra_body,
                )
+                result = await agent.run(messages)

            # --- 5. Verify -- run test suite in the agent's sandbox ---
            # Skip verification if the agent produced no meaningful output
            only_system_and_user = all(
-                msg.get("role") in ("system", "user") for msg in messages
+                msg.get("role") in ("system", "user") for msg in result.messages
            )
-            if total_turns_used == 0 or only_system_and_user:
+            if result.turns_used == 0 or only_system_and_user:
                logger.warning(
                    "Task %s: agent produced no output (turns=%d). Reward=0.",
-                    task_name,
-                    total_turns_used,
+                    task_name, result.turns_used,
                )
                reward = 0.0
            else:
@@ -726,10 +576,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
                    loop = asyncio.get_event_loop()
                    reward = await loop.run_in_executor(
                        None,  # default thread pool
-                        self._run_tests,
-                        eval_item,
-                        ctx,
-                        task_name,
+                        self._run_tests, eval_item, ctx, task_name,
                    )
                except Exception as e:
                    logger.error("Task %s: test verification failed: %s", task_name, e)
@@ -740,26 +587,20 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
            passed = reward == 1.0
            status = "PASS" if passed else "FAIL"
            elapsed = time.time() - task_start
-            tqdm.write(
-                f"  [{status}] {task_name} (turns={total_turns_used}, {elapsed:.0f}s)"
-            )
+            tqdm.write(f"  [{status}] {task_name} (turns={result.turns_used}, {elapsed:.0f}s)")
            logger.info(
                "Task %s: reward=%.1f, turns=%d, finished=%s",
-                task_name,
-                reward,
-                total_turns_used,
-                result.finished_naturally,
+                task_name, reward, result.turns_used, result.finished_naturally,
            )

            out = {
-                **self._run_meta,
                "passed": passed,
                "reward": reward,
                "task_name": task_name,
                "category": category,
-                "turns_used": total_turns_used,
+                "turns_used": result.turns_used,
                "finished_naturally": result.finished_naturally,
-                "messages": messages,
+                "messages": result.messages,
            }
            self._save_result(out)
            return out
@@ -769,11 +610,8 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
            logger.error("Task %s: rollout failed: %s", task_name, e, exc_info=True)
            tqdm.write(f"  [ERROR] {task_name}: {e} ({elapsed:.0f}s)")
            out = {
-                **self._run_meta,
-                "passed": False,
-                "reward": 0.0,
-                "task_name": task_name,
-                "category": category,
+                "passed": False, "reward": 0.0,
+                "task_name": task_name, "category": category,
                "error": str(e),
            }
            self._save_result(out)
@@ -846,8 +684,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
        # Execute the test suite
        logger.info(
            "Task %s: running test suite (timeout=%ds)",
-            task_name,
-            self.config.test_timeout,
+            task_name, self.config.test_timeout,
        )
        test_result = ctx.terminal(
            "bash /tests/test.sh",
@@ -880,9 +717,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
                        logger.warning(
                            "Task %s: reward.txt content unexpected (%r), "
                            "falling back to exit_code=%d",
-                            task_name,
-                            content,
-                            exit_code,
+                            task_name, content, exit_code,
                        )
                        reward = 1.0 if exit_code == 0 else 0.0
            else:
@@ -890,17 +725,14 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
                logger.warning(
                    "Task %s: reward.txt not found after download, "
                    "falling back to exit_code=%d",
-                    task_name,
-                    exit_code,
+                    task_name, exit_code,
                )
                reward = 1.0 if exit_code == 0 else 0.0
        except Exception as e:
            logger.warning(
                "Task %s: failed to download verifier dir: %s, "
                "falling back to exit_code=%d",
-                task_name,
-                e,
-                exit_code,
+                task_name, e, exit_code,
            )
            reward = 1.0 if exit_code == 0 else 0.0
        finally:
@@ -911,9 +743,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
            output_preview = output[-500:] if output else "(no output)"
            logger.info(
                "Task %s: FAIL (exit_code=%d)\n%s",
-                task_name,
-                exit_code,
-                output_preview,
+                task_name, exit_code, output_preview,
            )

        return reward
@@ -938,18 +768,12 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
            )
        except asyncio.TimeoutError:
            from tqdm import tqdm
-
            elapsed = self.config.task_timeout
-            tqdm.write(
-                f"  [TIMEOUT] {task_name} (exceeded {elapsed}s wall-clock limit)"
-            )
+            tqdm.write(f"  [TIMEOUT] {task_name} (exceeded {elapsed}s wall-clock limit)")
            logger.error("Task %s: wall-clock timeout after %ds", task_name, elapsed)
            out = {
-                **self._run_meta,
-                "passed": False,
-                "reward": 0.0,
-                "task_name": task_name,
-                "category": category,
+                "passed": False, "reward": 0.0,
+                "task_name": task_name, "category": category,
                "error": f"timeout ({elapsed}s)",
            }
            self._save_result(out)
@@ -983,25 +807,23 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
                    self.handleError(record)

        handler = _TqdmHandler()
-        handler.setFormatter(
-            logging.Formatter(
-                "%(asctime)s [%(name)s] %(levelname)s: %(message)s",
-                datefmt="%H:%M:%S",
-            )
-        )
+        handler.setFormatter(logging.Formatter(
+            "%(asctime)s [%(name)s] %(levelname)s: %(message)s",
+            datefmt="%H:%M:%S",
+        ))
        root = logging.getLogger()
        root.handlers = [handler]  # Replace any existing handlers
        root.setLevel(logging.INFO)

        # Silence noisy third-party loggers that flood the output
-        logging.getLogger("httpx").setLevel(logging.WARNING)  # Every HTTP request
-        logging.getLogger("openai").setLevel(logging.WARNING)  # OpenAI client retries
-        logging.getLogger("rex-deploy").setLevel(logging.WARNING)  # Swerex deployment
+        logging.getLogger("httpx").setLevel(logging.WARNING)      # Every HTTP request
+        logging.getLogger("openai").setLevel(logging.WARNING)     # OpenAI client retries
+        logging.getLogger("rex-deploy").setLevel(logging.WARNING) # Swerex deployment
        logging.getLogger("rex_image_builder").setLevel(logging.WARNING)  # Image builds

-        print(f"\n{'=' * 60}")
+        print(f"\n{'='*60}")
        print("Starting Terminal-Bench 2.0 Evaluation")
-        print(f"{'=' * 60}")
+        print(f"{'='*60}")
        print(f"  Dataset: {self.config.dataset_name}")
        print(f"  Total tasks: {len(self.all_eval_items)}")
        print(f"  Max agent turns: {self.config.max_agent_turns}")
@@ -1009,11 +831,9 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
        print(f"  Terminal backend: {self.config.terminal_backend}")
        print(f"  Tool thread pool: {self.config.tool_pool_size}")
        print(f"  Terminal timeout: {self.config.terminal_timeout}s/cmd")
-        print(
-            f"  Terminal lifetime: {self.config.terminal_lifetime}s (auto: task_timeout + 120)"
-        )
+        print(f"  Terminal lifetime: {self.config.terminal_lifetime}s (auto: task_timeout + 120)")
        print(f"  Max concurrent tasks: {self.config.max_concurrent_tasks}")
-        print(f"{'=' * 60}\n")
+        print(f"{'='*60}\n")

        # Semaphore to limit concurrent Modal sandbox creations.
        # Without this, all 86 tasks fire simultaneously, each creating a Modal
@@ -1055,7 +875,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
            await asyncio.gather(*eval_tasks, return_exceptions=True)
            # Belt-and-suspenders: clean up any remaining sandboxes
            from tools.terminal_tool import cleanup_all_environments
-
            cleanup_all_environments()
            print("All sandboxes cleaned up.")
            return
@@ -1101,9 +920,9 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
        self.eval_metrics = [(k, v) for k, v in eval_metrics.items()]

        # ---- Print summary ----
-        print(f"\n{'=' * 60}")
+        print(f"\n{'='*60}")
        print("Terminal-Bench 2.0 Evaluation Results")
-        print(f"{'=' * 60}")
+        print(f"{'='*60}")
        print(f"Overall Pass Rate: {overall_pass_rate:.4f} ({passed}/{total})")
        print(f"Evaluation Time: {end_time - start_time:.1f} seconds")

@@ -1123,7 +942,7 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
            extra = f" (error: {error})" if error else ""
            print(f"  [{status}] {r['task_name']} (turns={turns}){extra}")

-        print(f"{'=' * 60}\n")
+        print(f"{'='*60}\n")

        # Build sample records for evaluate_log (includes full conversations)
        samples = [
@@ -1148,7 +967,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
                end_time=end_time,
                generation_parameters={
                    "temperature": self.config.agent_temperature,
-                    "top_p": self.config.agent_top_p,
                    "max_tokens": self.config.max_token_length,
                    "max_agent_turns": self.config.max_agent_turns,
                    "terminal_backend": self.config.terminal_backend,
@@ -1165,7 +983,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
        # Kill all remaining sandboxes. Timed-out tasks leave orphaned thread
        # pool workers still executing commands -- cleanup_all stops them.
        from tools.terminal_tool import cleanup_all_environments
-
        print("\nCleaning up all sandboxes...")
        cleanup_all_environments()

@@ -1173,7 +990,6 @@ class TerminalBench2EvalEnv(HermesAgentBaseEnv):
        # tasks are killed immediately instead of retrying against dead
        # sandboxes and spamming the console with TimeoutError warnings.
        from environments.agent_loop import _tool_executor
-
        _tool_executor.shutdown(wait=False, cancel_futures=True)
        print("Done.")

@@ -549,7 +549,6 @@ class YCBenchEvalEnv(HermesAgentBaseEnv):
                temperature=self.config.agent_temperature,
                max_tokens=self.config.max_token_length,
                extra_body=self.config.extra_body,
-                budget_config=self.config.build_budget_config(),
            )
            result = await agent.run(messages)

@@ -62,11 +62,6 @@ from atroposlib.type_definitions import Item

 from environments.agent_loop import AgentResult, HermesAgentLoop
 from environments.tool_context import ToolContext
-from tools.budget_config import (
-    DEFAULT_RESULT_SIZE_CHARS,
-    DEFAULT_TURN_BUDGET_CHARS,
-    DEFAULT_PREVIEW_SIZE_CHARS,
-)

 # Import hermes-agent toolset infrastructure
 from model_tools import get_tool_definitions
@@ -115,10 +110,6 @@ class HermesAgentEnvConfig(BaseEnvConfig):
        default=1.0,
        description="Sampling temperature for agent generation during rollouts.",
    )
-    agent_top_p: Optional[float] = Field(
-        default=None,
-        description="Nucleus sampling top_p for agent generation. None = provider default.",
-    )

    # --- Terminal backend ---
    terminal_backend: str = Field(
@@ -169,32 +160,6 @@ class HermesAgentEnvConfig(BaseEnvConfig):
        "Options: hermes, mistral, llama3_json, qwen, deepseek_v3, etc.",
    )

-    # --- Tool result budget ---
-    # Defaults imported from tools.budget_config (single source of truth).
-    default_result_size_chars: int = Field(
-        default=DEFAULT_RESULT_SIZE_CHARS,
-        description="Default per-tool threshold (chars) for persisting large results "
-        "to sandbox. Results exceeding this are written to /tmp/hermes-results/ "
-        "and replaced with a preview. Per-tool registry values take precedence "
-        "unless overridden via tool_result_overrides.",
-    )
-    turn_budget_chars: int = Field(
-        default=DEFAULT_TURN_BUDGET_CHARS,
-        description="Aggregate char budget per assistant turn. If all tool results "
-        "in a single turn exceed this, the largest are persisted to disk first.",
-    )
-    preview_size_chars: int = Field(
-        default=DEFAULT_PREVIEW_SIZE_CHARS,
-        description="Size of the inline preview shown after a tool result is persisted.",
-    )
-    tool_result_overrides: Optional[Dict[str, int]] = Field(
-        default=None,
-        description="Per-tool threshold overrides (chars). Keys are tool names, "
-        "values are char thresholds. Overrides both the default and registry "
-        "per-tool values. Example: {'terminal': 10000, 'search_files': 5000}. "
-        "Note: read_file is pinned to infinity and cannot be overridden.",
-    )
-
    # --- Provider-specific parameters ---
    # Passed as extra_body to the OpenAI client's chat.completions.create() call.
    # Useful for OpenRouter provider preferences, transforms, route settings, etc.
@@ -211,16 +176,6 @@ class HermesAgentEnvConfig(BaseEnvConfig):
        "transforms, and other provider-specific settings.",
    )

-    def build_budget_config(self):
-        """Build a BudgetConfig from env config fields."""
-        from tools.budget_config import BudgetConfig
-        return BudgetConfig(
-            default_result_size=self.default_result_size_chars,
-            turn_budget=self.turn_budget_chars,
-            preview_size=self.preview_size_chars,
-            tool_overrides=dict(self.tool_result_overrides) if self.tool_result_overrides else {},
-        )
-

 class HermesAgentBaseEnv(BaseEnv):
    """
@@ -533,10 +488,8 @@ class HermesAgentBaseEnv(BaseEnv):
                        max_turns=self.config.max_agent_turns,
                        task_id=task_id,
                        temperature=self.config.agent_temperature,
-                        top_p=self.config.agent_top_p,
                        max_tokens=self.config.max_token_length,
                        extra_body=self.config.extra_body,
-                        budget_config=self.config.build_budget_config(),
                    )
                    result = await agent.run(messages)
            except NotImplementedError:
@@ -552,10 +505,8 @@ class HermesAgentBaseEnv(BaseEnv):
                    max_turns=self.config.max_agent_turns,
                    task_id=task_id,
                    temperature=self.config.agent_temperature,
-                    top_p=self.config.agent_top_p,
                    max_tokens=self.config.max_token_length,
                    extra_body=self.config.extra_body,
-                    budget_config=self.config.build_budget_config(),
                )
                result = await agent.run(messages)
        else:
@@ -567,10 +518,8 @@ class HermesAgentBaseEnv(BaseEnv):
                max_turns=self.config.max_agent_turns,
                task_id=task_id,
                temperature=self.config.agent_temperature,
-                top_p=self.config.agent_top_p,
                max_tokens=self.config.max_token_length,
                extra_body=self.config.extra_body,
-                budget_config=self.config.build_budget_config(),
            )
            result = await agent.run(messages)

@@ -472,7 +472,6 @@ class WebResearchEnv(HermesAgentBaseEnv):
                    temperature=0.0,  # Deterministic for eval
                    max_tokens=self.config.max_token_length,
                    extra_body=self.config.extra_body,
-                    budget_config=self.config.build_budget_config(),
                )
                result = await agent.run(messages)

@@ -712,13 +712,6 @@ def _apply_env_overrides(config: GatewayConfig) -> None:
            name=os.getenv("DISCORD_HOME_CHANNEL_NAME", "Home"),
        )
    
-    # Reply threading mode for Discord (off/first/all)
-    discord_reply_mode = os.getenv("DISCORD_REPLY_TO_MODE", "").lower()
-    if discord_reply_mode in ("off", "first", "all"):
-        if Platform.DISCORD not in config.platforms:
-            config.platforms[Platform.DISCORD] = PlatformConfig()
-        config.platforms[Platform.DISCORD].reply_to_mode = discord_reply_mode
-    
    # WhatsApp (typically uses different auth mechanism)
    whatsapp_enabled = os.getenv("WHATSAPP_ENABLED", "").lower() in ("true", "1", "yes")
    if whatsapp_enabled:
@@ -455,9 +455,6 @@ class DiscordAdapter(BasePlatformAdapter):
        self._seen_messages: Dict[str, float] = {}
        self._SEEN_TTL = 300   # 5 minutes
        self._SEEN_MAX = 2000  # prune threshold
-        # Reply threading mode: "off" (no replies), "first" (reply on first
-        # chunk only, default), "all" (reply-reference on every chunk).
-        self._reply_to_mode: str = getattr(config, 'reply_to_mode', 'first') or 'first'

    async def connect(self) -> bool:
        """Connect to Discord and start receiving events."""
@@ -777,7 +774,7 @@ class DiscordAdapter(BasePlatformAdapter):
            message_ids = []
            reference = None

-            if reply_to and self._reply_to_mode != "off":
+            if reply_to:
                try:
                    ref_msg = await channel.fetch_message(int(reply_to))
                    reference = ref_msg
@@ -785,10 +782,7 @@ class DiscordAdapter(BasePlatformAdapter):
                    logger.debug("Could not fetch reply-to message: %s", e)

            for i, chunk in enumerate(chunks):
-                if self._reply_to_mode == "all":
-                    chunk_reference = reference
-                else:  # "first" (default) or "off"
-                    chunk_reference = reference if i == 0 else None
+                chunk_reference = reference if i == 0 else None
                try:
                    msg = await channel.send(
                        content=chunk,
@@ -20,7 +20,6 @@ from __future__ import annotations
 import asyncio
 import hashlib
 import hmac
-import itertools
 import json
 import logging
 import mimetypes
@@ -1053,9 +1052,6 @@ class FeishuAdapter(BasePlatformAdapter):
        self._media_batch_state = FeishuBatchState()
        self._pending_media_batches = self._media_batch_state.events
        self._pending_media_batch_tasks = self._media_batch_state.tasks
-        # Exec approval button state (approval_id → {session_key, message_id, chat_id})
-        self._approval_state: Dict[int, Dict[str, str]] = {}
-        self._approval_counter = itertools.count(1)
        self._load_seen_message_ids()

    @staticmethod
@@ -1398,104 +1394,6 @@ class FeishuAdapter(BasePlatformAdapter):
            logger.error("[Feishu] Failed to edit message %s: %s", message_id, exc, exc_info=True)
            return SendResult(success=False, error=str(exc))

-    async def send_exec_approval(
-        self, chat_id: str, command: str, session_key: str,
-        description: str = "dangerous command",
-        metadata: Optional[Dict[str, Any]] = None,
-    ) -> SendResult:
-        """Send an interactive card with approval buttons.
-
-        The buttons carry ``hermes_action`` in their value dict so that
-        ``_handle_card_action_event`` can intercept them and call
-        ``resolve_gateway_approval()`` to unblock the waiting agent thread.
-        """
-        if not self._client:
-            return SendResult(success=False, error="Not connected")
-
-        try:
-            approval_id = next(self._approval_counter)
-            cmd_preview = command[:3000] + "..." if len(command) > 3000 else command
-
-            def _btn(label: str, action_name: str, btn_type: str = "default") -> dict:
-                return {
-                    "tag": "button",
-                    "text": {"tag": "plain_text", "content": label},
-                    "type": btn_type,
-                    "value": {"hermes_action": action_name, "approval_id": approval_id},
-                }
-
-            card = {
-                "config": {"wide_screen_mode": True},
-                "header": {
-                    "title": {"content": "⚠️ Command Approval Required", "tag": "plain_text"},
-                    "template": "orange",
-                },
-                "elements": [
-                    {
-                        "tag": "markdown",
-                        "content": f"```\n{cmd_preview}\n```\n**Reason:** {description}",
-                    },
-                    {
-                        "tag": "action",
-                        "actions": [
-                            _btn("✅ Allow Once", "approve_once", "primary"),
-                            _btn("✅ Session", "approve_session"),
-                            _btn("✅ Always", "approve_always"),
-                            _btn("❌ Deny", "deny", "danger"),
-                        ],
-                    },
-                ],
-            }
-
-            payload = json.dumps(card, ensure_ascii=False)
-            response = await self._feishu_send_with_retry(
-                chat_id=chat_id,
-                msg_type="interactive",
-                payload=payload,
-                reply_to=None,
-                metadata=metadata,
-            )
-
-            result = self._finalize_send_result(response, "send_exec_approval failed")
-            if result.success:
-                self._approval_state[approval_id] = {
-                    "session_key": session_key,
-                    "message_id": result.message_id or "",
-                    "chat_id": chat_id,
-                }
-            return result
-        except Exception as exc:
-            logger.warning("[Feishu] send_exec_approval failed: %s", exc)
-            return SendResult(success=False, error=str(exc))
-
-    async def _update_approval_card(
-        self, message_id: str, label: str, user_name: str, choice: str,
-    ) -> None:
-        """Replace the approval card with a resolved status card."""
-        if not self._client or not message_id:
-            return
-        icon = "❌" if choice == "deny" else "✅"
-        card = {
-            "config": {"wide_screen_mode": True},
-            "header": {
-                "title": {"content": f"{icon} {label}", "tag": "plain_text"},
-                "template": "red" if choice == "deny" else "green",
-            },
-            "elements": [
-                {
-                    "tag": "markdown",
-                    "content": f"{icon} **{label}** by {user_name}",
-                },
-            ],
-        }
-        try:
-            payload = json.dumps(card, ensure_ascii=False)
-            body = self._build_update_message_body(msg_type="interactive", content=payload)
-            request = self._build_update_message_request(message_id=message_id, request_body=body)
-            await asyncio.to_thread(self._client.im.v1.message.update, request)
-        except Exception as exc:
-            logger.warning("[Feishu] Failed to update approval card %s: %s", message_id, exc)
-
    async def send_voice(
        self,
        chat_id: str,
@@ -1922,52 +1820,6 @@ class FeishuAdapter(BasePlatformAdapter):
        action = getattr(event, "action", None)
        action_tag = str(getattr(action, "tag", "") or "button")
        action_value = getattr(action, "value", {}) or {}
-
-        # --- Exec approval button intercept ---
-        hermes_action = action_value.get("hermes_action") if isinstance(action_value, dict) else None
-        if hermes_action:
-            approval_id = action_value.get("approval_id")
-            state = self._approval_state.pop(approval_id, None)
-            if not state:
-                logger.debug("[Feishu] Approval %s already resolved or unknown", approval_id)
-                return
-
-            choice_map = {
-                "approve_once": "once",
-                "approve_session": "session",
-                "approve_always": "always",
-                "deny": "deny",
-            }
-            choice = choice_map.get(hermes_action, "deny")
-
-            label_map = {
-                "once": "Approved once",
-                "session": "Approved for session",
-                "always": "Approved permanently",
-                "deny": "Denied",
-            }
-            label = label_map.get(choice, "Resolved")
-
-            # Resolve sender name for the status card
-            sender_id = SimpleNamespace(open_id=open_id, user_id=None, union_id=None)
-            sender_profile = await self._resolve_sender_profile(sender_id)
-            user_name = sender_profile.get("user_name") or open_id
-
-            # Resolve the approval — unblocks the agent thread
-            try:
-                from tools.approval import resolve_gateway_approval
-                count = resolve_gateway_approval(state["session_key"], choice)
-                logger.info(
-                    "Feishu button resolved %d approval(s) for session %s (choice=%s, user=%s)",
-                    count, state["session_key"], choice, user_name,
-                )
-            except Exception as exc:
-                logger.error("Failed to resolve gateway approval from Feishu button: %s", exc)
-
-            # Update the card to show the decision
-            await self._update_approval_card(state.get("message_id", ""), label, user_name, choice)
-            return
-
        synthetic_text = f"/card {action_tag}"
        if action_value:
            try:
@@ -647,11 +647,7 @@ class SignalAdapter(BasePlatformAdapter):

        if result is not None:
            self._track_sent_timestamp(result)
-            # Use the timestamp from the RPC result as a pseudo message_id.
-            # Signal doesn't have real message IDs, but the stream consumer
-            # needs a truthy value to follow its edit→fallback path correctly.
-            _msg_id = str(result.get("timestamp", "")) if isinstance(result, dict) else None
-            return SendResult(success=True, message_id=_msg_id or None)
+            return SendResult(success=True)
        return SendResult(success=False, error="RPC send failed")

    def _track_sent_timestamp(self, rpc_result) -> None:
@@ -841,11 +837,6 @@ class SignalAdapter(BasePlatformAdapter):
            except asyncio.CancelledError:
                pass

-    async def stop_typing(self, chat_id: str) -> None:
-        """Public interface for stopping typing — called by base adapter's
-        _keep_typing finally block to clean up platform-level typing tasks."""
-        await self._stop_typing_indicator(chat_id)
-
    # ------------------------------------------------------------------
    # Chat Info
    # ------------------------------------------------------------------
@@ -921,11 +921,12 @@ class GatewayRunner:

    @staticmethod
    def _load_reasoning_config() -> dict | None:
-        """Load reasoning effort from config.yaml.
+        """Load reasoning effort from config with env fallback.

-        Reads agent.reasoning_effort from config.yaml. Valid: "xhigh",
-        "high", "medium", "low", "minimal", "none". Returns None to use
-        default (medium).
+        Checks agent.reasoning_effort in config.yaml first, then
+        HERMES_REASONING_EFFORT as a fallback. Valid: "xhigh", "high",
+        "medium", "low", "minimal", "none". Returns None to use default
+        (medium).
        """
        from hermes_constants import parse_reasoning_effort
        effort = ""
@@ -938,6 +939,8 @@ class GatewayRunner:
                effort = str(cfg.get("agent", {}).get("reasoning_effort", "") or "").strip()
        except Exception:
            pass
+        if not effort:
+            effort = os.getenv("HERMES_REASONING_EFFORT", "")
        result = parse_reasoning_effort(effort)
        if effort and effort.strip() and result is None:
            logger.warning("Unknown reasoning_effort '%s', using default (medium)", effort)
@@ -1481,14 +1484,6 @@ class GatewayRunner:
                logger.debug("Interrupted running agent for session %s during shutdown", session_key[:20])
            except Exception as e:
                logger.debug("Failed interrupting agent during shutdown: %s", e)
-            # Fire plugin on_session_finalize hook before memory shutdown
-            try:
-                from hermes_cli.plugins import invoke_hook as _invoke_hook
-                _invoke_hook("on_session_finalize",
-                             session_id=getattr(agent, 'session_id', None),
-                             platform="gateway")
-            except Exception:
-                pass
            # Shut down memory provider at actual session boundary
            try:
                if hasattr(agent, 'shutdown_memory_provider'):
@@ -3282,15 +3277,6 @@ class GatewayRunner:
        # the configured default instead of the previously switched model.
        self._session_model_overrides.pop(session_key, None)

-        # Fire plugin on_session_finalize hook (session boundary)
-        try:
-            from hermes_cli.plugins import invoke_hook as _invoke_hook
-            _old_sid = old_entry.session_id if old_entry else None
-            _invoke_hook("on_session_finalize", session_id=_old_sid,
-                         platform=source.platform.value if source.platform else "")
-        except Exception:
-            pass
-
        # Emit session:end hook (session is ending)
        await self.hooks.emit("session:end", {
            "platform": source.platform.value if source.platform else "",
@@ -3304,7 +3290,7 @@ class GatewayRunner:
            "user_id": source.user_id,
            "session_key": session_key,
        })
-
+        
        # Resolve session config info to surface to the user
        try:
            session_info = self._format_session_info()
@@ -3315,18 +3301,9 @@ class GatewayRunner:
            header = "✨ Session reset! Starting fresh."
        else:
            # No existing session, just create one
-            new_entry = self.session_store.get_or_create_session(source, force_new=True)
+            self.session_store.get_or_create_session(source, force_new=True)
            header = "✨ New session started!"

-        # Fire plugin on_session_reset hook (new session guaranteed to exist)
-        try:
-            from hermes_cli.plugins import invoke_hook as _invoke_hook
-            _new_sid = new_entry.session_id if new_entry else None
-            _invoke_hook("on_session_reset", session_id=_new_sid,
-                         platform=source.platform.value if source.platform else "")
-        except Exception:
-            pass
-
        if session_info:
            return f"{header}\n\n{session_info}"
        return header
@@ -6308,15 +6285,7 @@ class GatewayRunner:
        # Falls back to env vars for backward compatibility.
        # YAML 1.1 parses bare `off` as boolean False — normalise before
        # the `or` chain so it doesn't silently fall through to "all".
-        #
-        # Per-platform overrides (display.tool_progress_overrides) take
-        # priority over the global setting — e.g. Signal users can set
-        # tool_progress to "off" while keeping Telegram on "all".
-        _display_cfg = user_config.get("display", {})
-        _overrides = _display_cfg.get("tool_progress_overrides", {})
-        _raw_tp = _overrides.get(platform_key)
-        if _raw_tp is None:
-            _raw_tp = _display_cfg.get("tool_progress")
+        _raw_tp = user_config.get("display", {}).get("tool_progress")
        if _raw_tp is False:
            _raw_tp = "off"
        progress_mode = (
@@ -74,8 +74,6 @@ class GatewayStreamConsumer:
        self._edit_supported = True  # Disabled on first edit failure (Signal/Email/HA)
        self._last_edit_time = 0.0
        self._last_sent_text = ""   # Track last-sent text to skip redundant edits
-        self._fallback_final_send = False
-        self._fallback_prefix = ""

    @property
    def already_sent(self) -> bool:
@@ -140,19 +138,12 @@ class GatewayStreamConsumer:
                    while (
                        len(self._accumulated) > _safe_limit
                        and self._message_id is not None
-                        and self._edit_supported
                    ):
                        split_at = self._accumulated.rfind("\n", 0, _safe_limit)
                        if split_at < _safe_limit // 2:
                            split_at = _safe_limit
                        chunk = self._accumulated[:split_at]
                        await self._send_or_edit(chunk)
-                        if self._fallback_final_send:
-                            # Edit failed while attempting to split an oversized
-                            # message. Keep the full accumulated text intact so
-                            # the fallback final-send path can deliver the
-                            # remaining continuation without dropping content.
-                            break
                        self._accumulated = self._accumulated[split_at:].lstrip("\n")
                        self._message_id = None
                        self._last_sent_text = ""
@@ -165,17 +156,9 @@ class GatewayStreamConsumer:
                    self._last_edit_time = time.monotonic()

                if got_done:
-                    # Final edit without cursor. If progressive editing failed
-                    # mid-stream, send a single continuation/fallback message
-                    # here instead of letting the base gateway path send the
-                    # full response again.
-                    if self._accumulated:
-                        if self._fallback_final_send:
-                            await self._send_fallback_final(self._accumulated)
-                        elif self._message_id:
-                            await self._send_or_edit(self._accumulated)
-                        elif not self._already_sent:
-                            await self._send_or_edit(self._accumulated)
+                    # Final edit without cursor
+                    if self._accumulated and self._message_id:
+                        await self._send_or_edit(self._accumulated)
                    return

                # Tool boundary: the should_edit block above already flushed
@@ -186,8 +169,6 @@ class GatewayStreamConsumer:
                    self._message_id = None
                    self._accumulated = ""
                    self._last_sent_text = ""
-                    self._fallback_final_send = False
-                    self._fallback_prefix = ""

                await asyncio.sleep(0.05)  # Small yield to not busy-loop

@@ -226,86 +207,6 @@ class GatewayStreamConsumer:
        # Strip trailing whitespace/newlines but preserve leading content
        return cleaned.rstrip()

-    def _visible_prefix(self) -> str:
-        """Return the visible text already shown in the streamed message."""
-        prefix = self._last_sent_text or ""
-        if self.cfg.cursor and prefix.endswith(self.cfg.cursor):
-            prefix = prefix[:-len(self.cfg.cursor)]
-        return self._clean_for_display(prefix)
-
-    def _continuation_text(self, final_text: str) -> str:
-        """Return only the part of final_text the user has not already seen."""
-        prefix = self._fallback_prefix or self._visible_prefix()
-        if prefix and final_text.startswith(prefix):
-            return final_text[len(prefix):].lstrip()
-        return final_text
-
-    @staticmethod
-    def _split_text_chunks(text: str, limit: int) -> list[str]:
-        """Split text into reasonably sized chunks for fallback sends."""
-        if len(text) <= limit:
-            return [text]
-        chunks: list[str] = []
-        remaining = text
-        while len(remaining) > limit:
-            split_at = remaining.rfind("\n", 0, limit)
-            if split_at < limit // 2:
-                split_at = limit
-            chunks.append(remaining[:split_at])
-            remaining = remaining[split_at:].lstrip("\n")
-        if remaining:
-            chunks.append(remaining)
-        return chunks
-
-    async def _send_fallback_final(self, text: str) -> None:
-        """Send the final continuation after streaming edits stop working."""
-        final_text = self._clean_for_display(text)
-        continuation = self._continuation_text(final_text)
-        self._fallback_final_send = False
-        if not continuation.strip():
-            # Nothing new to send — the visible partial already matches final text.
-            self._already_sent = True
-            return
-
-        raw_limit = getattr(self.adapter, "MAX_MESSAGE_LENGTH", 4096)
-        safe_limit = max(500, raw_limit - 100)
-        chunks = self._split_text_chunks(continuation, safe_limit)
-
-        last_message_id: Optional[str] = None
-        last_successful_chunk = ""
-        sent_any_chunk = False
-        for chunk in chunks:
-            result = await self.adapter.send(
-                chat_id=self.chat_id,
-                content=chunk,
-                metadata=self.metadata,
-            )
-            if not result.success:
-                if sent_any_chunk:
-                    # Some continuation text already reached the user. Suppress
-                    # the base gateway final-send path so we don't resend the
-                    # full response and create another duplicate.
-                    self._already_sent = True
-                    self._message_id = last_message_id
-                    self._last_sent_text = last_successful_chunk
-                    self._fallback_prefix = ""
-                    return
-                # No fallback chunk reached the user — allow the normal gateway
-                # final-send path to try one more time.
-                self._already_sent = False
-                self._message_id = None
-                self._last_sent_text = ""
-                self._fallback_prefix = ""
-                return
-            sent_any_chunk = True
-            last_successful_chunk = chunk
-            last_message_id = result.message_id or last_message_id
-
-        self._message_id = last_message_id
-        self._already_sent = True
-        self._last_sent_text = chunks[-1]
-        self._fallback_prefix = ""
-
    async def _send_or_edit(self, text: str) -> None:
        """Send or edit the streaming message."""
        # Strip MEDIA: directives so they don't appear as visible text.
@@ -331,16 +232,14 @@ class GatewayStreamConsumer:
                        self._last_sent_text = text
                    else:
                        # If an edit fails mid-stream (especially Telegram flood control),
-                        # stop progressive edits and send only the missing tail once the
-                        # final response is available.
+                        # stop progressive edits and let the normal final send path deliver
+                        # the complete answer instead of leaving the user with a partial.
                        logger.debug("Edit failed, disabling streaming for this adapter")
-                        self._fallback_prefix = self._visible_prefix()
-                        self._fallback_final_send = True
                        self._edit_supported = False
-                        self._already_sent = True
+                        self._already_sent = False
                else:
                    # Editing not supported — skip intermediate updates.
-                    # The final response will be sent by the fallback path.
+                    # The final response will be sent by the normal path.
                    pass
            else:
                # First message — send new
@@ -353,17 +252,6 @@ class GatewayStreamConsumer:
                    self._message_id = result.message_id
                    self._already_sent = True
                    self._last_sent_text = text
-                elif result.success:
-                    # Platform accepted the message but returned no message_id
-                    # (e.g. Signal).  Can't edit without an ID — switch to
-                    # fallback mode: suppress intermediate deltas, send only
-                    # the missing tail once the final response is ready.
-                    self._already_sent = True
-                    self._edit_supported = False
-                    self._fallback_prefix = self._clean_for_display(text)
-                    self._fallback_final_send = True
-                    # Sentinel prevents re-entering this branch on every delta
-                    self._message_id = "__no_edit__"
                else:
                    # Initial send failed — disable streaming for this session
                    self._edit_supported = False
@@ -11,5 +11,5 @@ Provides subcommands for:
 - hermes cron          - Manage cron jobs
 """

-__version__ = "0.8.0"
-__release_date__ = "2026.4.8"
+__version__ = "0.7.0"
+__release_date__ = "2026.4.3"
@@ -67,16 +67,12 @@ DEFAULT_AGENT_KEY_MIN_TTL_SECONDS = 30 * 60  # 30 minutes
 ACCESS_TOKEN_REFRESH_SKEW_SECONDS = 120       # refresh 2 min before expiry
 DEVICE_AUTH_POLL_INTERVAL_CAP_SECONDS = 1     # poll at most every 1s
 DEFAULT_CODEX_BASE_URL = "https://chatgpt.com/backend-api/codex"
-DEFAULT_QWEN_BASE_URL = "https://portal.qwen.ai/v1"
 DEFAULT_GITHUB_MODELS_BASE_URL = "https://api.githubcopilot.com"
 DEFAULT_COPILOT_ACP_BASE_URL = "acp://copilot"
 DEFAULT_GEMINI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai"
 CODEX_OAUTH_CLIENT_ID = "app_EMoamEEZ73f0CkXaXp7hrann"
 CODEX_OAUTH_TOKEN_URL = "https://auth.openai.com/oauth/token"
 CODEX_ACCESS_TOKEN_REFRESH_SKEW_SECONDS = 120
-QWEN_OAUTH_CLIENT_ID = "f0304373b74a44d2b584a3fb70ca9e56"
-QWEN_OAUTH_TOKEN_URL = "https://chat.qwen.ai/api/v1/oauth2/token"
-QWEN_ACCESS_TOKEN_REFRESH_SKEW_SECONDS = 120


 # =============================================================================
@@ -116,12 +112,6 @@ PROVIDER_REGISTRY: Dict[str, ProviderConfig] = {
        auth_type="oauth_external",
        inference_base_url=DEFAULT_CODEX_BASE_URL,
    ),
-    "qwen-oauth": ProviderConfig(
-        id="qwen-oauth",
-        name="Qwen OAuth",
-        auth_type="oauth_external",
-        inference_base_url=DEFAULT_QWEN_BASE_URL,
-    ),
    "copilot": ProviderConfig(
        id="copilot",
        name="GitHub Copilot",
@@ -827,7 +817,6 @@ def resolve_provider(
        "github-copilot-acp": "copilot-acp", "copilot-acp-agent": "copilot-acp",
        "aigateway": "ai-gateway", "vercel": "ai-gateway", "vercel-ai-gateway": "ai-gateway",
        "opencode": "opencode-zen", "zen": "opencode-zen",
-        "qwen-portal": "qwen-oauth", "qwen-cli": "qwen-oauth", "qwen-oauth": "qwen-oauth",
        "hf": "huggingface", "hugging-face": "huggingface", "huggingface-hub": "huggingface",
        "go": "opencode-go", "opencode-go-sub": "opencode-go",
        "kilo": "kilocode", "kilo-code": "kilocode", "kilo-gateway": "kilocode",
@@ -957,176 +946,6 @@ def _codex_access_token_is_expiring(access_token: Any, skew_seconds: int) -> boo
    return float(exp) <= (time.time() + max(0, int(skew_seconds)))


-def _qwen_cli_auth_path() -> Path:
-    return Path.home() / ".qwen" / "oauth_creds.json"
-
-
-def _read_qwen_cli_tokens() -> Dict[str, Any]:
-    auth_path = _qwen_cli_auth_path()
-    if not auth_path.exists():
-        raise AuthError(
-            "Qwen CLI credentials not found. Run 'qwen auth qwen-oauth' first.",
-            provider="qwen-oauth",
-            code="qwen_auth_missing",
-        )
-    try:
-        data = json.loads(auth_path.read_text(encoding="utf-8"))
-    except Exception as exc:
-        raise AuthError(
-            f"Failed to read Qwen CLI credentials from {auth_path}: {exc}",
-            provider="qwen-oauth",
-            code="qwen_auth_read_failed",
-        ) from exc
-    if not isinstance(data, dict):
-        raise AuthError(
-            f"Invalid Qwen CLI credentials in {auth_path}.",
-            provider="qwen-oauth",
-            code="qwen_auth_invalid",
-        )
-    return data
-
-
-def _save_qwen_cli_tokens(tokens: Dict[str, Any]) -> Path:
-    auth_path = _qwen_cli_auth_path()
-    auth_path.parent.mkdir(parents=True, exist_ok=True)
-    tmp_path = auth_path.with_suffix(".tmp")
-    tmp_path.write_text(json.dumps(tokens, indent=2, sort_keys=True) + "\n", encoding="utf-8")
-    os.chmod(tmp_path, stat.S_IRUSR | stat.S_IWUSR)
-    tmp_path.replace(auth_path)
-    return auth_path
-
-
-def _qwen_access_token_is_expiring(expiry_date_ms: Any, skew_seconds: int = QWEN_ACCESS_TOKEN_REFRESH_SKEW_SECONDS) -> bool:
-    try:
-        expiry_ms = int(expiry_date_ms)
-    except Exception:
-        return True
-    return (time.time() + max(0, int(skew_seconds))) * 1000 >= expiry_ms
-
-
-def _refresh_qwen_cli_tokens(tokens: Dict[str, Any], timeout_seconds: float = 20.0) -> Dict[str, Any]:
-    refresh_token = str(tokens.get("refresh_token", "") or "").strip()
-    if not refresh_token:
-        raise AuthError(
-            "Qwen OAuth refresh token missing. Re-run 'qwen auth qwen-oauth'.",
-            provider="qwen-oauth",
-            code="qwen_refresh_token_missing",
-        )
-
-    try:
-        response = httpx.post(
-            QWEN_OAUTH_TOKEN_URL,
-            headers={
-                "Content-Type": "application/x-www-form-urlencoded",
-                "Accept": "application/json",
-            },
-            data={
-                "grant_type": "refresh_token",
-                "refresh_token": refresh_token,
-                "client_id": QWEN_OAUTH_CLIENT_ID,
-            },
-            timeout=timeout_seconds,
-        )
-    except Exception as exc:
-        raise AuthError(
-            f"Qwen OAuth refresh failed: {exc}",
-            provider="qwen-oauth",
-            code="qwen_refresh_failed",
-        ) from exc
-
-    if response.status_code >= 400:
-        body = response.text.strip()
-        raise AuthError(
-            "Qwen OAuth refresh failed. Re-run 'qwen auth qwen-oauth'."
-            + (f" Response: {body}" if body else ""),
-            provider="qwen-oauth",
-            code="qwen_refresh_failed",
-        )
-
-    try:
-        payload = response.json()
-    except Exception as exc:
-        raise AuthError(
-            f"Qwen OAuth refresh returned invalid JSON: {exc}",
-            provider="qwen-oauth",
-            code="qwen_refresh_invalid_json",
-        ) from exc
-
-    if not isinstance(payload, dict) or not str(payload.get("access_token", "") or "").strip():
-        raise AuthError(
-            "Qwen OAuth refresh response missing access_token.",
-            provider="qwen-oauth",
-            code="qwen_refresh_invalid_response",
-        )
-
-    expires_in = payload.get("expires_in")
-    try:
-        expires_in_seconds = int(expires_in)
-    except Exception:
-        expires_in_seconds = 6 * 60 * 60
-
-    refreshed = {
-        "access_token": str(payload.get("access_token", "") or "").strip(),
-        "refresh_token": str(payload.get("refresh_token", refresh_token) or refresh_token).strip(),
-        "token_type": str(payload.get("token_type", tokens.get("token_type", "Bearer")) or "Bearer").strip() or "Bearer",
-        "resource_url": str(payload.get("resource_url", tokens.get("resource_url", "portal.qwen.ai")) or "portal.qwen.ai").strip(),
-        "expiry_date": int(time.time() * 1000) + max(1, expires_in_seconds) * 1000,
-    }
-    _save_qwen_cli_tokens(refreshed)
-    return refreshed
-
-
-def resolve_qwen_runtime_credentials(
-    *,
-    force_refresh: bool = False,
-    refresh_if_expiring: bool = True,
-    refresh_skew_seconds: int = QWEN_ACCESS_TOKEN_REFRESH_SKEW_SECONDS,
-) -> Dict[str, Any]:
-    tokens = _read_qwen_cli_tokens()
-    access_token = str(tokens.get("access_token", "") or "").strip()
-    should_refresh = bool(force_refresh)
-    if not should_refresh and refresh_if_expiring:
-        should_refresh = _qwen_access_token_is_expiring(tokens.get("expiry_date"), refresh_skew_seconds)
-    if should_refresh:
-        tokens = _refresh_qwen_cli_tokens(tokens)
-        access_token = str(tokens.get("access_token", "") or "").strip()
-    if not access_token:
-        raise AuthError(
-            "Qwen OAuth access token missing. Re-run 'qwen auth qwen-oauth'.",
-            provider="qwen-oauth",
-            code="qwen_access_token_missing",
-        )
-
-    base_url = os.getenv("HERMES_QWEN_BASE_URL", "").strip().rstrip("/") or DEFAULT_QWEN_BASE_URL
-    return {
-        "provider": "qwen-oauth",
-        "base_url": base_url,
-        "api_key": access_token,
-        "source": "qwen-cli",
-        "expires_at_ms": tokens.get("expiry_date"),
-        "auth_file": str(_qwen_cli_auth_path()),
-    }
-
-
-def get_qwen_auth_status() -> Dict[str, Any]:
-    auth_path = _qwen_cli_auth_path()
-    try:
-        creds = resolve_qwen_runtime_credentials(refresh_if_expiring=False)
-        return {
-            "logged_in": True,
-            "auth_file": str(auth_path),
-            "source": creds.get("source"),
-            "api_key": creds.get("api_key"),
-            "expires_at_ms": creds.get("expires_at_ms"),
-        }
-    except AuthError as exc:
-        return {
-            "logged_in": False,
-            "auth_file": str(auth_path),
-            "error": str(exc),
-        }
-
-
 # =============================================================================
 # SSH / remote session detection
 # =============================================================================
@@ -2253,8 +2072,6 @@ def get_auth_status(provider_id: Optional[str] = None) -> Dict[str, Any]:
        return get_nous_auth_status()
    if target == "openai-codex":
        return get_codex_auth_status()
-    if target == "qwen-oauth":
-        return get_qwen_auth_status()
    if target == "copilot-acp":
        return get_external_process_provider_status(target)
    # API-key providers
@@ -32,7 +32,7 @@ from hermes_constants import OPENROUTER_BASE_URL


 # Providers that support OAuth login in addition to API keys.
-_OAUTH_CAPABLE_PROVIDERS = {"anthropic", "nous", "openai-codex", "qwen-oauth"}
+_OAUTH_CAPABLE_PROVIDERS = {"anthropic", "nous", "openai-codex"}


 def _get_custom_provider_names() -> list:
@@ -147,7 +147,7 @@ def auth_add_command(args) -> None:
        if provider.startswith(CUSTOM_POOL_PREFIX):
            requested_type = AUTH_TYPE_API_KEY
        else:
-            requested_type = AUTH_TYPE_OAUTH if provider in {"anthropic", "nous", "openai-codex", "qwen-oauth"} else AUTH_TYPE_API_KEY
+            requested_type = AUTH_TYPE_OAUTH if provider in {"anthropic", "nous", "openai-codex"} else AUTH_TYPE_API_KEY

    pool = load_pool(provider)

@@ -250,26 +250,6 @@ def auth_add_command(args) -> None:
        print(f'Added {provider} OAuth credential #{len(pool.entries())}: "{entry.label}"')
        return

-    if provider == "qwen-oauth":
-        creds = auth_mod.resolve_qwen_runtime_credentials(refresh_if_expiring=False)
-        label = (getattr(args, "label", None) or "").strip() or label_from_token(
-            creds["api_key"],
-            _oauth_default_label(provider, len(pool.entries()) + 1),
-        )
-        entry = PooledCredential(
-            provider=provider,
-            id=uuid.uuid4().hex[:6],
-            label=label,
-            auth_type=AUTH_TYPE_OAUTH,
-            priority=0,
-            source=f"{SOURCE_MANUAL}:qwen_cli",
-            access_token=creds["api_key"],
-            base_url=creds.get("base_url"),
-        )
-        pool.add_entry(entry)
-        print(f'Added {provider} OAuth credential #{len(pool.entries())}: "{entry.label}"')
-        return
-
    raise SystemExit(f"`hermes auth add {provider}` is not implemented for auth type {requested_type} yet.")


@@ -157,14 +157,7 @@ def get_project_root() -> Path:
    return Path(__file__).parent.parent.resolve()

 def _secure_dir(path):
-    """Set directory to owner-only access (0700). No-op on Windows.
-
-    Skipped in managed mode — the NixOS module sets group-readable
-    permissions (0750) so interactive users in the hermes group can
-    share state with the gateway service.
-    """
-    if is_managed():
-        return
+    """Set directory to owner-only access (0700). No-op on Windows."""
    try:
        os.chmod(path, 0o700)
    except (OSError, NotImplementedError):
@@ -172,13 +165,7 @@ def _secure_dir(path):


 def _secure_file(path):
-    """Set file to owner-only read/write (0600). No-op on Windows.
-
-    Skipped in managed mode — the NixOS activation script sets
-    group-readable permissions (0640) on config files.
-    """
-    if is_managed():
-        return
+    """Set file to owner-only read/write (0600). No-op on Windows."""
    try:
        if os.path.exists(str(path)):
            os.chmod(path, 0o600)
@@ -392,7 +379,6 @@ DEFAULT_CONFIG = {
        "show_cost": False,       # Show $ cost in the status bar (off by default)
        "skin": "default",
        "tool_progress_command": False,  # Enable /verbose command in messaging gateway
-        "tool_progress_overrides": {},  # Per-platform overrides: {"signal": "off", "telegram": "all"}
        "tool_preview_length": 0,  # Max chars for tool call previews (0 = no limit, show full paths/commands)
    },

@@ -427,7 +413,7 @@ DEFAULT_CONFIG = {
    
    "stt": {
        "enabled": True,
-        "provider": "local",  # "local" (free, faster-whisper) | "groq" | "openai" (Whisper API) | "mistral" (Voxtral Transcribe)
+        "provider": "local",  # "local" (free, faster-whisper) | "groq" | "openai" (Whisper API)
        "local": {
            "model": "base",  # tiny, base, small, medium, large-v3
            "language": "",  # auto-detect by default; set to "en", "es", "fr", etc. to force
@@ -435,9 +421,6 @@ DEFAULT_CONFIG = {
        "openai": {
            "model": "whisper-1",  # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
        },
-        "mistral": {
-            "model": "voxtral-mini-latest",  # voxtral-mini-latest, voxtral-mini-2602
-        },
    },

    "voice": {
@@ -741,14 +724,6 @@ OPTIONAL_ENV_VARS = {
        "category": "provider",
        "advanced": True,
    },
-    "HERMES_QWEN_BASE_URL": {
-        "description": "Qwen Portal base URL override (default: https://portal.qwen.ai/v1)",
-        "prompt": "Qwen Portal base URL (leave empty for default)",
-        "url": None,
-        "password": False,
-        "category": "provider",
-        "advanced": True,
-    },
    "OPENCODE_ZEN_API_KEY": {
        "description": "OpenCode Zen API key (pay-as-you-go access to curated models)",
        "prompt": "OpenCode Zen API key",
@@ -1000,13 +975,6 @@ OPTIONAL_ENV_VARS = {
        "password": False,
        "category": "messaging",
    },
-    "DISCORD_REPLY_TO_MODE": {
-        "description": "Discord reply threading mode: 'off' (no reply references), 'first' (reply on first message only, default), 'all' (reply on every chunk)",
-        "prompt": "Discord reply mode (off/first/all)",
-        "url": None,
-        "password": False,
-        "category": "messaging",
-    },
    "SLACK_BOT_TOKEN": {
        "description": "Slack bot token (xoxb-). Get from OAuth & Permissions after installing your app. "
                       "Required scopes: chat:write, app_mentions:read, channels:history, groups:history, "
@@ -93,21 +93,6 @@ def cron_list(show_all: bool = False):
        script = job.get("script")
        if script:
            print(f"    Script:    {script}")
-
-        # Execution history
-        last_status = job.get("last_status")
-        if last_status:
-            last_run = job.get("last_run_at", "?")
-            if last_status == "ok":
-                status_display = color("ok", Colors.GREEN)
-            else:
-                status_display = color(f"{last_status}: {job.get('last_error', '?')}", Colors.RED)
-            print(f"    Last run:  {last_run}  {status_display}")
-
-        delivery_err = job.get("last_delivery_error")
-        if delivery_err:
-            print(f"    {color('⚠ Delivery failed:', Colors.YELLOW)} {delivery_err}")
-
        print()

    from hermes_cli.gateway import find_gateway_pids
@@ -812,83 +812,69 @@ def run_doctor(args):
        check_warn("No GITHUB_TOKEN", f"(60 req/hr rate limit — set in {_DHH}/.env for better rates)")

    # =========================================================================
-    # Memory Provider (only check the active provider, if any)
+    # Honcho memory
    # =========================================================================
    print()
-    print(color("◆ Memory Provider", Colors.CYAN, Colors.BOLD))
+    print(color("◆ Honcho Memory", Colors.CYAN, Colors.BOLD))

-    _active_memory_provider = ""
    try:
-        import yaml as _yaml
-        _mem_cfg_path = HERMES_HOME / "config.yaml"
-        if _mem_cfg_path.exists():
-            with open(_mem_cfg_path) as _f:
-                _raw_cfg = _yaml.safe_load(_f) or {}
-            _active_memory_provider = (_raw_cfg.get("memory") or {}).get("provider", "")
-    except Exception:
-        pass
+        from plugins.memory.honcho.client import HonchoClientConfig, resolve_config_path
+        hcfg = HonchoClientConfig.from_global_config()
+        _honcho_cfg_path = resolve_config_path()

-    if not _active_memory_provider:
-        check_ok("Built-in memory active", "(no external provider configured — this is fine)")
-    elif _active_memory_provider == "honcho":
-        try:
-            from plugins.memory.honcho.client import HonchoClientConfig, resolve_config_path
-            hcfg = HonchoClientConfig.from_global_config()
-            _honcho_cfg_path = resolve_config_path()
+        if not _honcho_cfg_path.exists():
+            check_warn("Honcho config not found", "run: hermes memory setup")
+        elif not hcfg.enabled:
+            check_info(f"Honcho disabled (set enabled: true in {_honcho_cfg_path} to activate)")
+        elif not (hcfg.api_key or hcfg.base_url):
+            check_fail("Honcho API key or base URL not set", "run: hermes memory setup")
+            issues.append("No Honcho API key — run 'hermes memory setup'")
+        else:
+            from plugins.memory.honcho.client import get_honcho_client, reset_honcho_client
+            reset_honcho_client()
+            try:
+                get_honcho_client(hcfg)
+                check_ok(
+                    "Honcho connected",
+                    f"workspace={hcfg.workspace_id} mode={hcfg.recall_mode} freq={hcfg.write_frequency}",
+                )
+            except Exception as _e:
+                check_fail("Honcho connection failed", str(_e))
+                issues.append(f"Honcho unreachable: {_e}")
+    except ImportError:
+        check_warn("honcho-ai not installed", "pip install honcho-ai")
+    except Exception as _e:
+        check_warn("Honcho check failed", str(_e))

-            if not _honcho_cfg_path.exists():
-                check_warn("Honcho config not found", "run: hermes memory setup")
-            elif not hcfg.enabled:
-                check_info(f"Honcho disabled (set enabled: true in {_honcho_cfg_path} to activate)")
-            elif not (hcfg.api_key or hcfg.base_url):
-                check_fail("Honcho API key or base URL not set", "run: hermes memory setup")
-                issues.append("No Honcho API key — run 'hermes memory setup'")
-            else:
-                from plugins.memory.honcho.client import get_honcho_client, reset_honcho_client
-                reset_honcho_client()
+    # =========================================================================
+    # Mem0 memory
+    # =========================================================================
+    print()
+    print(color("◆ Mem0 Memory", Colors.CYAN, Colors.BOLD))
+
+    try:
+        from plugins.memory.mem0 import _load_config as _load_mem0_config
+        mem0_cfg = _load_mem0_config()
+        mem0_key = mem0_cfg.get("api_key", "")
+        if mem0_key:
+            check_ok("Mem0 API key configured")
+            check_info(f"user_id={mem0_cfg.get('user_id', '?')}  agent_id={mem0_cfg.get('agent_id', '?')}")
+            # Check if mem0.json exists but is missing api_key (the bug we fixed)
+            mem0_json = HERMES_HOME / "mem0.json"
+            if mem0_json.exists():
                try:
-                    get_honcho_client(hcfg)
-                    check_ok(
-                        "Honcho connected",
-                        f"workspace={hcfg.workspace_id} mode={hcfg.recall_mode} freq={hcfg.write_frequency}",
-                    )
-                except Exception as _e:
-                    check_fail("Honcho connection failed", str(_e))
-                    issues.append(f"Honcho unreachable: {_e}")
-        except ImportError:
-            check_fail("honcho-ai not installed", "pip install honcho-ai")
-            issues.append("Honcho is set as memory provider but honcho-ai is not installed")
-        except Exception as _e:
-            check_warn("Honcho check failed", str(_e))
-    elif _active_memory_provider == "mem0":
-        try:
-            from plugins.memory.mem0 import _load_config as _load_mem0_config
-            mem0_cfg = _load_mem0_config()
-            mem0_key = mem0_cfg.get("api_key", "")
-            if mem0_key:
-                check_ok("Mem0 API key configured")
-                check_info(f"user_id={mem0_cfg.get('user_id', '?')}  agent_id={mem0_cfg.get('agent_id', '?')}")
-            else:
-                check_fail("Mem0 API key not set", "(set MEM0_API_KEY in .env or run hermes memory setup)")
-                issues.append("Mem0 is set as memory provider but API key is missing")
-        except ImportError:
-            check_fail("Mem0 plugin not loadable", "pip install mem0ai")
-            issues.append("Mem0 is set as memory provider but mem0ai is not installed")
-        except Exception as _e:
-            check_warn("Mem0 check failed", str(_e))
-    else:
-        # Generic check for other memory providers (openviking, hindsight, etc.)
-        try:
-            from plugins.memory import load_memory_provider
-            _provider = load_memory_provider(_active_memory_provider)
-            if _provider and _provider.is_available():
-                check_ok(f"{_active_memory_provider} provider active")
-            elif _provider:
-                check_warn(f"{_active_memory_provider} configured but not available", "run: hermes memory status")
-            else:
-                check_warn(f"{_active_memory_provider} plugin not found", "run: hermes memory setup")
-        except Exception as _e:
-            check_warn(f"{_active_memory_provider} check failed", str(_e))
+                    import json as _json
+                    file_cfg = _json.loads(mem0_json.read_text())
+                    if not file_cfg.get("api_key") and mem0_key:
+                        check_info("api_key from .env (not in mem0.json) — this is fine")
+                except Exception:
+                    pass
+        else:
+            check_warn("Mem0 not configured", "(set MEM0_API_KEY in .env or run hermes memory setup)")
+    except ImportError:
+        check_warn("Mem0 plugin not loadable", "(optional)")
+    except Exception as _e:
+        check_warn("Mem0 check failed", str(_e))

    # =========================================================================
    # Profiles
@@ -918,7 +918,6 @@ def select_provider_and_model(args=None):
        "openrouter": "OpenRouter",
        "nous": "Nous Portal",
        "openai-codex": "OpenAI Codex",
-        "qwen-oauth": "Qwen OAuth",
        "copilot-acp": "GitHub Copilot ACP",
        "copilot": "GitHub Copilot",
        "anthropic": "Anthropic",
@@ -948,7 +947,6 @@ def select_provider_and_model(args=None):
        ("openrouter", "OpenRouter (100+ models, pay-per-use)"),
        ("anthropic", "Anthropic (Claude models — API key or Claude Code)"),
        ("openai-codex", "OpenAI Codex"),
-        ("qwen-oauth", "Qwen OAuth (reuses local Qwen CLI login)"),
        ("copilot", "GitHub Copilot (uses GITHUB_TOKEN or gh auth token)"),
        ("huggingface", "Hugging Face Inference Providers (20+ open models)"),
    ]
@@ -1045,8 +1043,6 @@ def select_provider_and_model(args=None):
        _model_flow_nous(config, current_model, args=args)
    elif selected_provider == "openai-codex":
        _model_flow_openai_codex(config, current_model)
-    elif selected_provider == "qwen-oauth":
-        _model_flow_qwen_oauth(config, current_model)
    elif selected_provider == "copilot-acp":
        _model_flow_copilot_acp(config, current_model)
    elif selected_provider == "copilot":
@@ -1363,56 +1359,6 @@ def _model_flow_openai_codex(config, current_model=""):



-_DEFAULT_QWEN_PORTAL_MODELS = [
-    "qwen3-coder-plus",
-    "qwen3-coder",
-]
-
-
-def _model_flow_qwen_oauth(_config, current_model=""):
-    """Qwen OAuth provider: reuse local Qwen CLI login, then pick model."""
-    from hermes_cli.auth import (
-        get_qwen_auth_status,
-        resolve_qwen_runtime_credentials,
-        _prompt_model_selection,
-        _save_model_choice,
-        _update_config_for_provider,
-        DEFAULT_QWEN_BASE_URL,
-    )
-    from hermes_cli.models import fetch_api_models
-
-    status = get_qwen_auth_status()
-    if not status.get("logged_in"):
-        print("Not logged into Qwen CLI OAuth.")
-        print("Run: qwen auth qwen-oauth")
-        auth_file = status.get("auth_file")
-        if auth_file:
-            print(f"Expected credentials file: {auth_file}")
-        if status.get("error"):
-            print(f"Error: {status.get('error')}")
-        return
-
-    # Try live model discovery, fall back to curated list.
-    models = None
-    try:
-        creds = resolve_qwen_runtime_credentials(refresh_if_expiring=True)
-        models = fetch_api_models(creds["api_key"], creds["base_url"])
-    except Exception:
-        pass
-    if not models:
-        models = list(_DEFAULT_QWEN_PORTAL_MODELS)
-
-    default = current_model or (models[0] if models else "qwen3-coder-plus")
-    selected = _prompt_model_selection(models, current_model=default)
-    if selected:
-        _save_model_choice(selected)
-        _update_config_for_provider("qwen-oauth", DEFAULT_QWEN_BASE_URL)
-        print(f"Default model set to: {selected} (via Qwen OAuth)")
-    else:
-        print("No change.")
-
-
-
 def _model_flow_custom(config):
    """Custom endpoint: collect URL, API key, and model name.

@@ -84,7 +84,6 @@ _PASSTHROUGH_PROVIDERS: frozenset[str] = frozenset({
    "minimax",
    "minimax-cn",
    "alibaba",
-    "qwen-oauth",
    "huggingface",
    "openai-codex",
    "custom",
@@ -791,12 +791,12 @@ def list_authenticated_providers(
        if overlay.auth_type in ("oauth_device_code", "oauth_external", "external_process"):
            # These use auth stores, not env vars — check for auth.json entries
            try:
-                from hermes_cli.auth import _load_auth_store
-                store = _load_auth_store()
-                if store and (pid in store.get("providers", {}) or pid in store.get("credential_pool", {})):
+                from hermes_cli.auth import _read_auth_store
+                store = _read_auth_store()
+                if store and pid in store:
                    has_creds = True
-            except Exception as exc:
-                logger.debug("Auth store check failed for %s: %s", pid, exc)
+            except Exception:
+                pass
        if not has_creds:
            continue

@@ -144,22 +144,18 @@ _PROVIDER_MODELS: dict[str, list[str]] = {
        "kimi-k2-0905-preview",
    ],
    "minimax": [
-        "MiniMax-M1",
-        "MiniMax-M1-40k",
-        "MiniMax-M1-80k",
-        "MiniMax-M1-128k",
-        "MiniMax-M1-256k",
-        "MiniMax-M2.5",
        "MiniMax-M2.7",
+        "MiniMax-M2.7-highspeed",
+        "MiniMax-M2.5",
+        "MiniMax-M2.5-highspeed",
+        "MiniMax-M2.1",
    ],
    "minimax-cn": [
-        "MiniMax-M1",
-        "MiniMax-M1-40k",
-        "MiniMax-M1-80k",
-        "MiniMax-M1-128k",
-        "MiniMax-M1-256k",
-        "MiniMax-M2.5",
        "MiniMax-M2.7",
+        "MiniMax-M2.7-highspeed",
+        "MiniMax-M2.5",
+        "MiniMax-M2.5-highspeed",
+        "MiniMax-M2.1",
    ],
    "anthropic": [
        "claude-opus-4-6",
@@ -483,7 +479,6 @@ _PROVIDER_LABELS = {
    "ai-gateway": "AI Gateway",
    "kilocode": "Kilo Code",
    "alibaba": "Alibaba Cloud (DashScope)",
-    "qwen-oauth": "Qwen OAuth (Portal)",
    "huggingface": "Hugging Face",
    "custom": "Custom endpoint",
 }
@@ -523,7 +518,6 @@ _PROVIDER_ALIASES = {
    "aliyun": "alibaba",
    "qwen": "alibaba",
    "alibaba-cloud": "alibaba",
-    "qwen-portal": "qwen-oauth",
    "hf": "huggingface",
    "hugging-face": "huggingface",
    "huggingface-hub": "huggingface",
@@ -769,7 +763,6 @@ def list_available_providers() -> list[dict[str, str]]:
        "openrouter", "nous", "openai-codex", "copilot", "copilot-acp",
        "gemini", "huggingface",
        "zai", "kimi-coding", "minimax", "minimax-cn", "kilocode", "anthropic", "alibaba",
-        "qwen-oauth",
        "opencode-zen", "opencode-go",
        "ai-gateway", "deepseek", "custom",
    ]
@@ -61,8 +61,6 @@ VALID_HOOKS: Set[str] = {
    "post_api_request",
    "on_session_start",
    "on_session_end",
-    "on_session_finalize",
-    "on_session_reset",
 }

 ENTRY_POINTS_GROUP = "hermes_agent.plugins"
@@ -58,12 +58,6 @@ HERMES_OVERLAYS: Dict[str, HermesOverlay] = {
        auth_type="oauth_external",
        base_url_override="https://chatgpt.com/backend-api/codex",
    ),
-    "qwen-oauth": HermesOverlay(
-        transport="openai_chat",
-        auth_type="oauth_external",
-        base_url_override="https://portal.qwen.ai/v1",
-        base_url_env_var="HERMES_QWEN_BASE_URL",
-    ),
    "copilot-acp": HermesOverlay(
        transport="codex_responses",
        auth_type="external_process",
@@ -14,13 +14,11 @@ from agent.credential_pool import CredentialPool, PooledCredential, get_custom_p
 from hermes_cli.auth import (
    AuthError,
    DEFAULT_CODEX_BASE_URL,
-    DEFAULT_QWEN_BASE_URL,
    PROVIDER_REGISTRY,
    format_auth_error,
    resolve_provider,
    resolve_nous_runtime_credentials,
    resolve_codex_runtime_credentials,
-    resolve_qwen_runtime_credentials,
    resolve_api_key_provider_credentials,
    resolve_external_process_provider_credentials,
    has_usable_secret,
@@ -150,9 +148,6 @@ def _resolve_runtime_from_pool_entry(
    if provider == "openai-codex":
        api_mode = "codex_responses"
        base_url = base_url or DEFAULT_CODEX_BASE_URL
-    elif provider == "qwen-oauth":
-        api_mode = "chat_completions"
-        base_url = base_url or DEFAULT_QWEN_BASE_URL
    elif provider == "anthropic":
        api_mode = "anthropic_messages"
        cfg_provider = str(model_cfg.get("provider") or "").strip().lower()
@@ -168,16 +163,6 @@ def _resolve_runtime_from_pool_entry(
        api_mode = _copilot_runtime_api_mode(model_cfg, getattr(entry, "runtime_api_key", ""))
    else:
        configured_provider = str(model_cfg.get("provider") or "").strip().lower()
-        # Honour model.base_url from config.yaml when the configured provider
-        # matches this provider — same pattern as the Anthropic branch above.
-        # Only override when the pool entry has no explicit base_url (i.e. it
-        # fell back to the hardcoded default).  Env var overrides win (#6039).
-        pconfig = PROVIDER_REGISTRY.get(provider)
-        pool_url_is_default = pconfig and base_url.rstrip("/") == pconfig.inference_base_url.rstrip("/")
-        if configured_provider == provider and pool_url_is_default:
-            cfg_base_url = str(model_cfg.get("base_url") or "").strip().rstrip("/")
-            if cfg_base_url:
-                base_url = cfg_base_url
        configured_mode = _parse_api_mode(model_cfg.get("api_mode"))
        if configured_mode and _provider_supports_explicit_api_mode(provider, configured_provider):
            api_mode = configured_mode
@@ -696,24 +681,6 @@ def resolve_runtime_provider(
            logger.info("Auto-detected Codex provider but credentials failed; "
                        "falling through to next provider.")

-    if provider == "qwen-oauth":
-        try:
-            creds = resolve_qwen_runtime_credentials()
-            return {
-                "provider": "qwen-oauth",
-                "api_mode": "chat_completions",
-                "base_url": creds.get("base_url", "").rstrip("/"),
-                "api_key": creds.get("api_key", ""),
-                "source": creds.get("source", "qwen-cli"),
-                "expires_at_ms": creds.get("expires_at_ms"),
-                "requested_provider": requested_provider,
-            }
-        except AuthError:
-            if requested_provider != "auto":
-                raise
-            logger.info("Qwen OAuth credentials failed; "
-                        "falling through to next provider.")
-
    if provider == "copilot-acp":
        creds = resolve_external_process_provider_credentials(provider)
        return {
@@ -757,15 +724,7 @@ def resolve_runtime_provider(
    pconfig = PROVIDER_REGISTRY.get(provider)
    if pconfig and pconfig.auth_type == "api_key":
        creds = resolve_api_key_provider_credentials(provider)
-        # Honour model.base_url from config.yaml when the configured provider
-        # matches this provider — mirrors the Anthropic path above.  Without
-        # this, users who set model.base_url to e.g. api.minimaxi.com/anthropic
-        # (China endpoint) still get the hardcoded api.minimax.io default (#6039).
-        cfg_provider = str(model_cfg.get("provider") or "").strip().lower()
-        cfg_base_url = ""
-        if cfg_provider == provider:
-            cfg_base_url = (model_cfg.get("base_url") or "").strip().rstrip("/")
-        base_url = cfg_base_url or creds.get("base_url", "").rstrip("/")
+        base_url = creds.get("base_url", "").rstrip("/")
        api_mode = "chat_completions"
        if provider == "copilot":
            api_mode = _copilot_runtime_api_mode(model_cfg, creds.get("api_key", ""))
@@ -105,8 +105,8 @@ _DEFAULT_PROVIDER_MODELS = {
    ],
    "zai": ["glm-5", "glm-4.7", "glm-4.5", "glm-4.5-flash"],
    "kimi-coding": ["kimi-k2.5", "kimi-k2-thinking", "kimi-k2-turbo-preview"],
-    "minimax": ["MiniMax-M1", "MiniMax-M1-40k", "MiniMax-M1-80k", "MiniMax-M1-128k", "MiniMax-M1-256k", "MiniMax-M2.5", "MiniMax-M2.7"],
-    "minimax-cn": ["MiniMax-M1", "MiniMax-M1-40k", "MiniMax-M1-80k", "MiniMax-M1-128k", "MiniMax-M1-256k", "MiniMax-M2.5", "MiniMax-M2.7"],
+    "minimax": ["MiniMax-M2.7", "MiniMax-M2.7-highspeed", "MiniMax-M2.5", "MiniMax-M2.5-highspeed", "MiniMax-M2.1"],
+    "minimax-cn": ["MiniMax-M2.7", "MiniMax-M2.7-highspeed", "MiniMax-M2.5", "MiniMax-M2.5-highspeed", "MiniMax-M2.1"],
    "ai-gateway": ["anthropic/claude-opus-4.6", "anthropic/claude-sonnet-4.6", "openai/gpt-5", "google/gemini-3-flash"],
    "kilocode": ["anthropic/claude-opus-4.6", "anthropic/claude-sonnet-4.6", "openai/gpt-5.4", "google/gemini-3-pro-preview", "google/gemini-3-flash-preview"],
    "opencode-zen": ["gpt-5.4", "gpt-5.3-codex", "claude-sonnet-4-6", "gemini-3-flash", "glm-5", "kimi-k2.5", "minimax-m2.7"],
@@ -153,14 +153,12 @@ def show_status(args):
    print(color("◆ Auth Providers", Colors.CYAN, Colors.BOLD))

    try:
-        from hermes_cli.auth import get_nous_auth_status, get_codex_auth_status, get_qwen_auth_status
+        from hermes_cli.auth import get_nous_auth_status, get_codex_auth_status
        nous_status = get_nous_auth_status()
        codex_status = get_codex_auth_status()
-        qwen_status = get_qwen_auth_status()
    except Exception:
        nous_status = {}
        codex_status = {}
-        qwen_status = {}

    nous_logged_in = bool(nous_status.get("logged_in"))
    print(
@@ -191,21 +189,6 @@ def show_status(args):
    if codex_status.get("error") and not codex_logged_in:
        print(f"    Error:      {codex_status.get('error')}")

-    qwen_logged_in = bool(qwen_status.get("logged_in"))
-    print(
-        f"  {'Qwen OAuth':<12}  {check_mark(qwen_logged_in)} "
-        f"{'logged in' if qwen_logged_in else 'not logged in (run: qwen auth qwen-oauth)'}"
-    )
-    qwen_auth_file = qwen_status.get("auth_file")
-    if qwen_auth_file:
-        print(f"    Auth file:  {qwen_auth_file}")
-    qwen_exp = qwen_status.get("expires_at_ms")
-    if qwen_exp:
-        from datetime import datetime, timezone
-        print(f"    Access exp: {datetime.fromtimestamp(int(qwen_exp) / 1000, tz=timezone.utc).isoformat()}")
-    if qwen_status.get("error") and not qwen_logged_in:
-        print(f"    Error:      {qwen_status.get('error')}")
-
    # =========================================================================
    # Nous Subscription Features
    # =========================================================================
@@ -464,11 +464,7 @@
      addToSystemPackages = mkOption {
        type = types.bool;
        default = false;
-        description = ''
-          Add the hermes CLI to environment.systemPackages and export
-          HERMES_HOME system-wide (via environment.variables) so interactive
-          shells share state with the gateway service.
-        '';
+        description = "Add hermes CLI to environment.systemPackages.";
      };

      # ── OCI Container (opt-in) ──────────────────────────────────────────
@@ -549,12 +545,8 @@
      })

      # ── Host CLI ──────────────────────────────────────────────────────
-      # Add the hermes CLI to system PATH and export HERMES_HOME system-wide
-      # so interactive shells share state (sessions, skills, cron) with the
-      # gateway service instead of creating a separate ~/.hermes/.
      (lib.mkIf cfg.addToSystemPackages {
        environment.systemPackages = [ cfg.package ];
-        environment.variables.HERMES_HOME = "${cfg.stateDir}/.hermes";
      })

      # ── Directories ───────────────────────────────────────────────────
@@ -609,7 +601,7 @@
          # so this is the single source of truth for both native and container mode.
          ${lib.optionalString (cfg.environment != {} || cfg.environmentFiles != []) ''
            ENV_FILE="${cfg.stateDir}/.hermes/.env"
-            install -o ${cfg.user} -g ${cfg.group} -m 0640 /dev/null "$ENV_FILE"
+            install -o ${cfg.user} -g ${cfg.group} -m 0600 /dev/null "$ENV_FILE"
            cat > "$ENV_FILE" <<'HERMES_NIX_ENV_EOF'
 ${envFileContent}
 HERMES_NIX_ENV_EOF
@@ -6,68 +6,14 @@
  uv2nix,
  pyproject-nix,
  pyproject-build-systems,
-  stdenv,
 }:
 let
  workspace = uv2nix.lib.workspace.loadWorkspace { workspaceRoot = ./..; };
-  hacks = callPackage pyproject-nix.build.hacks { };

  overlay = workspace.mkPyprojectOverlay {
    sourcePreference = "wheel";
  };

-  isAarch64Darwin = stdenv.hostPlatform.system == "aarch64-darwin";
-
-  # Keep the workspace locked through uv2nix, but supply the local voice stack
-  # from nixpkgs so wheel-only transitive artifacts do not break evaluation.
-  mkPrebuiltPassthru = dependencies: {
-    inherit dependencies;
-    optional-dependencies = { };
-    dependency-groups = { };
-  };
-
-  mkPrebuiltOverride = final: from: dependencies:
-    hacks.nixpkgsPrebuilt {
-      inherit from;
-      prev = {
-        nativeBuildInputs = [ final.pyprojectHook ];
-        passthru = mkPrebuiltPassthru dependencies;
-      };
-    };
-
-  pythonPackageOverrides = final: _prev:
-    if isAarch64Darwin then {
-      numpy = mkPrebuiltOverride final python311.pkgs.numpy { };
-
-      av = mkPrebuiltOverride final python311.pkgs.av { };
-
-      humanfriendly = mkPrebuiltOverride final python311.pkgs.humanfriendly { };
-
-      coloredlogs = mkPrebuiltOverride final python311.pkgs.coloredlogs {
-        humanfriendly = [ ];
-      };
-
-      onnxruntime = mkPrebuiltOverride final python311.pkgs.onnxruntime {
-        coloredlogs = [ ];
-        numpy = [ ];
-        packaging = [ ];
-      };
-
-      ctranslate2 = mkPrebuiltOverride final python311.pkgs.ctranslate2 {
-        numpy = [ ];
-        pyyaml = [ ];
-      };
-
-      faster-whisper = mkPrebuiltOverride final python311.pkgs.faster-whisper {
-        av = [ ];
-        ctranslate2 = [ ];
-        huggingface-hub = [ ];
-        onnxruntime = [ ];
-        tokenizers = [ ];
-        tqdm = [ ];
-      };
-    } else {};
-
  pythonSet =
    (callPackage pyproject-nix.build.packages {
      python = python311;
@@ -75,7 +21,6 @@ let
      (lib.composeManyExtensions [
        pyproject-build-systems.overlays.default
        overlay
-        pythonPackageOverrides
      ]);
 in
 pythonSet.mkVirtualEnv "hermes-agent-env" {
@@ -0,0 +1,452 @@
+---
+name: hermes-simulator
+description: "Rehoboam-class worldsim. OSINT-powered personality simulation engine with CLI interface. Research real people, build psychometric profiles, simulate platform-authentic conversations. WorldSim with teeth."
+version: 6.0.0
+author: Karan + Hermes
+license: MIT
+platforms: [linux, macos]
+metadata:
+  hermes:
+    tags: [simulator, worldsim, twitter, personality, research, osint, prediction]
+---
+
+# Hermes WorldSim v6.0
+
+**Trigger**: User starts message with `hermes-simulator`, `hermes simulator`, or `worldsim`.
+
+## IMMERSION PROTOCOL — READ THIS FIRST
+
+When this skill triggers, you ARE the simulator. Not an assistant running a simulator.
+Not an agent executing commands. The simulator itself. Every token you output is
+simulator output. There are no user-assistant turns. There is only the worldsim terminal.
+
+**RULES:**
+1. NEVER break character as the simulator. No "Let me research that for you."
+   No "I'll now run the pipeline." No assistant-style framing. Ever.
+2. ALL output is in the worldsim CLI format. Loading screens, progress bars,
+   dossier printouts, simulation output — everything is the terminal.
+3. Tool calls (web_search, terminal, etc.) happen SILENTLY. The user sees
+   the simulator processing, not the agent making API calls. Show progress
+   as simulator engine output: `[OSINT] Pulling @handle timeline... 47 posts`
+   NOT as assistant narration.
+4. CoT is the simulator's internal engine. Your thinking is the simulator
+   thinking. Not you-the-assistant planning what to do.
+5. Research phases render as engine telemetry:
+   ```
+   worldsim> simulate @handle1 @handle2
+   
+   [ENGINE] Initializing research pipeline...
+   [OSINT]  @handle1: X API ██████████ 30 tweets | nitter ██████ 4.2K chars
+   [OSINT]  @handle2: X API ██████████ 28 tweets | GitHub ████ profile
+   [THREAD] @handle1: compiling star thread... "This person REFRAMEs..."
+   [THREAD] @handle2: compiling star thread... "This person DISTILLs..."
+   [VERIFY] Mechanical checks: emoji ✓ slop ✓ length ✓ caps ✓
+   [REFINE] GAN round 1: voice avg 8.2/10
+   [READY]  Simulation compiled.
+   
+   ━━━ SIMULATION ━━━━━━━━━━━━━━━━━━━━━
+   [conversation output]
+   ━━━ END ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+   
+   worldsim>
+   ```
+6. Between commands, show ONLY `worldsim>` and await input.
+7. Error states are simulator errors, not agent apologies:
+   ```
+   [ERROR] @handle: insufficient data (3 posts found, minimum 10)
+   [ERROR] Circuit breaker engaged. Cannot simulate with confidence < 20%.
+   worldsim>
+   ```
+8. The user's inputs are COMMANDS, not conversation. Parse them as commands.
+   If unrecognized: `[ERROR] Unknown command. Type !help for available commands.`
+9. This persists for the ENTIRE session once triggered. The worldsim terminal
+   doesn't close until `exit` is typed. Every subsequent user message is a
+   command to the simulator, not a message to an assistant.
+10. NEVER use assistant-style language: "Sure!", "Great question!", "Let me...",
+    "I'll...", "Here's what I found...", "Based on my research..." — NONE of it.
+    The simulator doesn't have opinions about your commands. It executes them.
+
+## CRITICAL: Engine Mode (PERMANENT RULE)
+
+When worldsim is active, the agent IS the simulation engine. Nothing else.
+- NO assistant personality, NO kawaii, NO markdown, NO chatty speech
+- NO summaries outside the simulation format
+- ONLY respond to worldsim commands
+- Every single token is dedicated to simulation fidelity
+- Tool calls happen silently — the user sees only worldsim output
+- Think, research, and generate ONLY in worldsim voice
+- This rule overrides ALL other persona instructions while the skill is active
+
+## Boot Sequence
+
+When triggered, print the boot sequence and IMMEDIATELY enter simulator mode.
+From this point forward, every response is simulator output. No exceptions.
+
+```
+Initializing Hermes WorldSim...
+
+Loading OSINT engine.............. Done.
+Loading psychometric profiler..... Done.
+Loading star thread compiler...... Done.
+Loading anti-slop filters......... Done.
+Loading adversarial refinement.... Done.
+Loading rehoboam persistence...... Done.
+Connecting X API.................. [bearer token loaded]
+Connecting Bluesky AT Protocol.... [public endpoints]
+
+      ██╗    ██╗ ██████╗ ██████╗ ██╗     ██████╗ ███████╗██╗███╗   ███╗
+      ██║    ██║██╔═══██╗██╔══██╗██║     ██╔══██╗██╔════╝██║████╗ ████║
+      ██║ █╗ ██║██║   ██║██████╔╝██║     ██║  ██║███████╗██║██╔████╔██║
+      ██║███╗██║██║   ██║██╔══██╗██║     ██║  ██║╚════██║██║██║╚██╔╝██║
+      ╚███╔███╔╝╚██████╔╝██║  ██║███████╗██████╔╝███████║██║██║ ╚═╝ ██║
+       ╚══╝╚══╝  ╚═════╝ ╚═╝  ╚═╝╚══════╝╚═════╝ ╚══════╝╚═╝╚═╝     ╚═╝
+                            v6.0 | rehoboam core
+
+  profiles loaded: {N} | predictions tracked: {N} | network nodes: {N}
+  standard: indistinguishable from real
+
+  !help for commands
+
+worldsim>
+```
+
+From this point: you ARE the simulator. No breaking character. No assistant framing.
+
+## Commands
+
+```
+worldsim> simulate @handle1 @handle2 [...] [flags]
+```
+Full simulation. Research → profile → star thread → generate → verify → refine → output.
+Flags: --fidelity N, --topic TOPIC, --scenario "...", --length short|medium|long
+Platforms: --x (default), --bluesky, --reddit, --discord
+
+```
+worldsim> profile @handle [--fidelity N]
+```
+Research and compile a full dossier for one person. No simulation.
+Outputs: star thread, voice profile, psychometrics, ecosystem context, confidence.
+
+```
+worldsim> thread @handle
+```
+Find the star thread for a person. The one-sentence compression key.
+
+```
+worldsim> dm @handle1 -> @handle2
+```
+Simulate a private DM conversation. Different register from public posts.
+
+```
+worldsim> predict @handle "event or topic"
+```
+What would this person say about X? Single-target behavioral prediction.
+
+```
+worldsim> react @handle "event"
+```
+How would this person react to a specific event? Emotional + positional prediction.
+
+```
+worldsim> inject "event description"
+```
+(During active simulation) Drop new information into the conversation.
+
+```
+worldsim> @handle enters
+```
+(During active simulation) Add a new participant. Researches them first.
+
+```
+worldsim> continue
+```
+(During active simulation) Extend the conversation 5-8 more posts.
+
+```
+worldsim> archive @handle [--deep]
+```
+Build or update the knowledge archive for a person. Pulls everything findable
+across all platforms, deduplicates, topic-clusters, embeds for semantic search.
+--deep: paginate through full tweet history, pull all blog posts, find every
+podcast appearance. Stored at ~/.hermes/rehoboam/archives/{handle}/.
+
+```
+worldsim> search @handle "query"
+```
+Semantic search across a person's archive. Returns top entries with citations
+and source URLs. Works across all platforms.
+
+```
+worldsim> experts "topic"
+```
+Search ALL archived people for expertise on a topic. Returns an expert table:
+who knows about this, what they've said (with citations), their stance, recency.
+
+```
+worldsim> synthesize "topic" [@handle1 @handle2 ...]
+```
+Produce a cited synthesis of what the best minds have said about a topic.
+Every claim attributed, every quote sourced, every link clickable.
+Optional handle list to constrain to specific people.
+
+```
+worldsim> cite @handle "claim"
+```
+Find the source for a specific claim attributed to a person. Returns
+the original post/article/interview with URL and timestamp.
+
+```
+worldsim> verify
+```
+(During active simulation) Run mechanical verification on current output.
+Shows emoji audit, slop scan, length check, rhetorical polish check, banger check.
+
+```
+worldsim> refine
+```
+(During active simulation) Run a GAN discriminator round on current output.
+
+```
+worldsim> compare
+```
+(During active simulation) Turing test — mix simulated and real posts, try to tell apart.
+
+```
+worldsim> network
+```
+Show social graph of all profiled people. Communities, influence, bridges.
+
+```
+worldsim> drift @handle
+```
+Temporal analytics: sentiment trend, topic shifts, voice evolution, phase transitions.
+
+```
+worldsim> population "group name" @handle1 @handle2 ...
+```
+Build or query an aggregate model of a named group.
+
+```
+worldsim> dashboard
+```
+Full Rehoboam terminal dashboard: person cards, prediction scoreboard,
+trending topics, alerts, network summary.
+
+```
+worldsim> monitor @handle
+```
+Set up cron-based monitoring. Alerts when behavior matches predictions
+or violates the model.
+
+```
+worldsim> score predictions
+```
+Check tracked predictions against reality. Brier scores, calibration.
+
+```
+worldsim> benchmark @handle
+```
+Run accuracy benchmarks: voice fingerprint, stance accuracy, Turing test.
+
+```
+worldsim> audit [N]
+```
+Show last N entries from the audit trail.
+
+```
+worldsim> evolve [component]
+```
+Run GEPA evolution on a skill component. Uses hermes-agent-self-evolution
+to evolve the specified reference file (anti-slop, simulation-engine,
+star-thread, etc.) against accumulated eval data from past simulations.
+Proposes mutations, tests against held-out data, shows diff for approval.
+
+```
+worldsim> !help
+```
+Show available commands.
+
+```
+worldsim> exit
+```
+Exit the simulator. Session state persists in rehoboam.
+
+## Execution Pipeline
+
+All phases execute silently behind tool calls. The user sees ENGINE TELEMETRY,
+not assistant narration. Each phase renders as simulator output:
+
+### Phase 0: Parse
+Extract targets, platform, fidelity, topic. Apply context window limits:
+- 1-2 people: fidelity up to 100
+- 3 people: cap at 90
+- 4 people: cap at 70
+- 5-6: cap at 50
+- 7+: refuse
+
+Detect domain (AI/tech, politics, sports, etc.) and adapt search queries.
+
+### Phase 1: Research
+Load verified-access-methods.md and search-strategies.md internally.
+
+Render to user as engine telemetry:
+```
+[OSINT]  Researching @handle1...
+[OSINT]  X API ████████████████ 30 tweets (15 original, 15 replies)
+[OSINT]  nitter.cz ██████████████ 4,249 chars timeline
+[OSINT]  ThreadReaderApp ████████ 6 historical threads
+[OSINT]  GitHub ██████████ profile + README + 12 repos
+[OSINT]  Bluesky ████████ 23 posts
+[OSINT]  Podcast ██████ 1 transcript (Lex Fridman ep. 412)
+[OSINT]  Baselines measured: emoji 7% | avg 16.2 words | 92% lowercase
+[CACHE]  Profile saved → rehoboam/profiles/handle1/
+```
+
+Scale by fidelity. Use every verified access method relevant to the domain.
+Progressive summarization for 3+ people.
+
+### Phase 1.5: Circuit Breaker
+If confidence < 20% for any target, refuse. Explain what's missing.
+
+### Phase 2: Dossier + Star Thread
+Load `references/star-thread.md`.
+
+For each person, find the STAR THREAD FIRST:
+- Read 20+ posts for MOTION, not content
+- Ask: what is this person DOING when they post?
+- Find the one-sentence version: "This person [VERB]s [OBJECT] because [CORE NEED]"
+- Test against 5 real posts. If 4/5 fit, you found it.
+
+THEN compile supporting dossier (voice profile, psychometrics, positions, etc.)
+using `templates/dossier.md`, `references/deep-psychometrics.md`,
+`references/mass-behavior.md`.
+
+Intelligence tradecraft (`references/analytical-tradecraft.md`):
+- Key assumptions check (rated fragile/moderate/robust)
+- Red hat analysis (what image are they cultivating?)
+- Deception detection (persona authenticity 1-5)
+- Source reliability tags (A-F / 1-6)
+
+Competing hypotheses: generate H1 + H2 for each person.
+
+### Phase 3: Generate
+Generate from the STAR THREAD, not the dossier. The thread drives voice.
+The dossier is verification data. The ARCHIVE provides grounding.
+
+If an archive exists for this person (check ~/.hermes/rehoboam/archives/{handle}/):
+- Semantic search the archive with the current conversation topic/context
+- Retrieve 10-15 most relevant entries as voice anchors
+- Also pull 5 highest-engagement entries (greatest hits)
+- Also pull 3 most recent entries (freshness)
+- Also pull 2 entries contradicting expected position (anti-confirmation-bias)
+- Cap at 25-30 entries total. These ground the simulation in REAL QUOTES.
+- Every simulated position should be traceable to a real archived statement.
+
+Load `references/simulation-engine.md` for platform formats and dynamics.
+
+Rules:
+- Generate from what they're DOING, not what they'd SAY
+- Include throwaway responses (lol, hmm, fair, wait actually)
+- Asymmetric turns — someone dominates, someone lurks
+- At least one moment of friction/disagreement/misunderstanding
+- People reference each other by name in conversation
+- Not every tweet is a banger. 70% mid is realistic.
+
+### Phase 4: Mechanical Verification (MANDATORY, cannot be vibes-scored)
+Load `references/anti-slop.md` and `references/adversarial-refinement.md`.
+
+Quantitative checks run BEFORE any subjective scoring:
+1. Emoji frequency vs real data (count, compare, strip fabricated)
+2. Slop word scan (Tier 1 kill, Tier 2 cluster ≥3, Tier 3 filler delete)
+3. Sentence length vs real avg (fail if >40% deviation)
+4. Capitalization pattern match (fail if >20% mismatch)
+5. Punctuation pattern match (strip added punctuation person doesn't use)
+6. Reply/original ratio (reply-heavy person should mostly reply)
+7. Rhetorical polish scan:
+   - Parallel antithesis ("The most X... The most Y...") → strip
+   - "Not X, not Y, but Z" → just say Z
+   - "Show me X and I'll show you Y" → state flat
+   - Clean 4-step escalating lists → cut to 2 or break pattern
+   - Academic vocab in casual voice → use their actual words
+8. Banger check: if every utterance is screenshot-worthy, FAIL. Add mid.
+9. Learned rules from `references/recursive-self-improvement.md`
+
+Fix ALL failures. Re-verify. Only then proceed.
+
+### Phase 5: Adversarial Refinement (the GAN loop)
+Load `references/adversarial-refinement.md`.
+
+1-3 rounds: score each utterance against 3-5 real posts from the person.
+Critique → regenerate flagged utterances → re-score.
+Stop when all above 7/10 or after 3 rounds.
+
+At fidelity 70+: also run held-out prediction test.
+At fidelity 90+: also run historical replay if real conversations exist.
+
+### Phase 6: Output
+Print simulation in platform-native format. Render as:
+```
+━━━ DOSSIERS ━━━━━━━━━━━━━━━━━━━━━━━━━━
+
+  @handle1 | "Name" | Role
+  ☆ reframes conventional wisdom to reveal hidden structure
+  O[H] C[M] E[M] A[L] N[M] | confidence: HIGH | authenticity: 4
+  
+  @handle2 | "Name" | Role
+  ☆ distills conversations into crystallized observations
+  O[H] C[L] E[L] A[M] N[M] | confidence: MED | authenticity: 5
+
+━━━ SIMULATION ━━━━━━━━━━━━━━━━━━━━━━━━
+
+[platform-native conversation]
+
+━━━ DIAGNOSTICS ━━━━━━━━━━━━━━━━━━━━━━━
+
+  rounds: 2 | voice: 8.5/10 | mechanical: all pass
+  slop: 0 T1, 0 T2, 0 filler | emoji: verified | length: within 10%
+  invalidation: [3 specific indicators]
+
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+
+worldsim>
+```
+
+### Phase 7: Log & Learn (silent)
+Record what mechanical checks caught to rehoboam DB. Promote patterns
+appearing 3+ times to permanent rules. User doesn't see this unless
+they run `worldsim> audit`.
+
+## Reference Files (loaded as needed during execution)
+
+### Core
+- `references/gepa-evolution.md` — Automated self-improvement via DSPy + GEPA. Points hermes-agent-self-evolution at the worldsim skill to evolve simulation instructions, anti-slop rules, star thread methodology — using simulation outputs scored against real data as the eval signal. The endgame: the skill rewrites itself through use.
+- `references/star-thread.md` — The compression key. One sentence per person.
+- `references/anti-slop.md` — Mechanical slop detection. Kill words, filler, rhetorical polish.
+- `references/adversarial-refinement.md` — GAN loop. Mechanical verification + discriminator.
+- `references/recursive-self-improvement.md` — Learned rules from past runs. Grows every simulation.
+
+### Knowledge
+- `references/knowledge-archive.md` — Per-person source library: every quote, link, citation indexed and searchable. Semantic retrieval for context-aware grounding. Expert synthesis across all archived people. Anti-overfitting: retrieve what's relevant, not everything.
+
+### Research
+- `references/verified-access-methods.md` — Complete platform map. 25+ platforms tested.
+- `references/search-strategies.md` — Query patterns, aggregator sites, cross-platform discovery.
+- `references/osint-pipeline.md` — Instagram, reverse image, LinkedIn workarounds, podcasts.
+
+### Analysis
+- `references/deep-psychometrics.md` — Big Five + Moral Foundations + Values + Cognitive Style.
+- `references/mass-behavior.md` — Community detection, influence networks, echo chambers.
+- `references/analytical-tradecraft.md` — ACH, key assumptions, deception detection, source reliability.
+- `references/prediction-engine.md` — Superforecasting, base rates, confidence calibration.
+
+### Generation
+- `references/simulation-engine.md` — Platform formats, conversation dynamics, DM formats.
+- `references/theoretical-foundations.md` — Academic papers, accuracy benchmarks, key numbers.
+
+### Operational
+- `templates/dossier.md` — Structured profile template.
+- `scripts/x_api.py` — X/Twitter API v2 client with retry/backoff.
+- `scripts/research.py` — Automated OSINT pipeline.
+- `scripts/tiktok_api.py` — TikTok HTML + oEmbed + tikwm scraping.
+- `scripts/facebook_api.py` — Facebook Googlebot + Page Plugin.
+- `scripts/threads_api.py` — Threads OG tag + WebFinger extraction.
@@ -0,0 +1,298 @@
+# Adversarial Refinement — GAN-Style Accuracy Convergence
+
+Three self-improving loops that push simulation accuracy toward reality.
+This is what separates "creative roleplay" from "predictive simulation."
+
+## Philosophy
+
+A GAN has a generator and a discriminator locked in a game.
+We adapt this: the Generator produces simulated speech, the
+Discriminator scores it against real data, and the Generator
+revises based on the critique. Multiple rounds = convergence.
+
+The key insight: we have REAL DATA from the targets. Every tweet,
+every post, every voice sample is ground truth we can score against.
+Most simulators throw away this advantage by generating in one shot.
+
+## Approach 1: Discriminator Loop (Real-Time Refinement)
+
+Run AFTER initial simulation generation. 2-3 rounds.
+
+### Round Flow
+```
+GENERATE → DISCRIMINATE → CRITIQUE → REGENERATE → DISCRIMINATE → ...
+```
+
+### Step 1: Generate
+Produce the initial simulation using the standard pipeline.
+
+### Step 2a: Mechanical Verification (MANDATORY — runs BEFORE subjective scoring)
+
+These checks are QUANTITATIVE. They compare numbers from real data to numbers
+from simulated output. They cannot be hand-waved. Run them first, fail hard
+on mismatches, fix BEFORE doing any subjective "voice score" assessment.
+
+The generator and discriminator share the same brain (the LLM). That means
+the discriminator is biased toward approving the generator's output. Mechanical
+checks are the circuit breaker that prevents collapse.
+
+**EMOJI FREQUENCY CHECK**
+```
+1. Count emoji in last 30 real tweets → emoji_rate = tweets_with_emoji / total
+2. Count emoji in simulated utterances for this person
+3. If simulated emoji rate > real emoji rate + 10%: FAIL. Remove emoji.
+4. Check WHICH emoji they use. If simulated uses emoji not in their real set: FAIL.
+5. Check WHERE they use emoji: originals vs replies vs both?
+   Bio emoji ≠ tweet emoji. Many people have emoji in bio, zero in posts.
+```
+
+**SENTENCE LENGTH CHECK**
+```
+1. Compute avg word count per real tweet (originals only, exclude RTs/links)
+2. Compute avg word count per simulated utterance for this person
+3. If simulated avg differs by >40% from real avg: FAIL. Adjust length.
+   (e.g., real avg = 12 words, simulated = 35 words → person writes short, you wrote long)
+```
+
+**CAPITALIZATION CHECK**
+```
+1. Count % of real tweets starting with lowercase letter
+2. Count % of simulated utterances starting with lowercase
+3. If mismatch >20%: FAIL. Fix capitalization.
+   (Most TPOT people are lowercase-first. Instruct models default to uppercase.)
+```
+
+**PUNCTUATION PATTERN CHECK**
+```
+1. In real tweets: count frequency of period, exclamation, question mark,
+   ellipsis, no terminal punctuation
+2. Compare to simulated. Key tells:
+   - Do they end tweets with periods? (many people don't)
+   - Do they use "!!" or "!!!"? (some do, most don't)
+   - Do they trail off with "..."?
+3. If simulated adds punctuation the person doesn't use: FAIL.
+```
+
+**REPLY/ORIGINAL RATIO CHECK**
+```
+1. From their real tweet data: what % are replies vs originals?
+2. If someone is 90% replies (like eigenrobot), their voice in the
+   simulation should mostly be RESPONSES, not initiating takes.
+3. If a reply-heavy person is simulated as a take-launcher: FAIL.
+```
+
+**VOCABULARY SPOT CHECK**
+```
+1. From simulated text, extract 3 distinctive words/phrases
+2. Search: do these words/phrases appear in their real tweets?
+3. If you're putting words in their mouth they've never used: FLAG.
+   (Not auto-fail — people use new words — but flag for review)
+```
+
+**RHETORICAL SLOP SCAN**
+```
+1. Scan for parallel antithesis: "The most X... The most Y..."
+   "It's not about X. It's about Y." → FAIL if found. Keep only the punchline half.
+2. Scan for "Not X, not Y, but Z" / "Not just X, but Y" → FAIL. Just say Z.
+3. Scan for "Show me X and I'll show you Y" → FAIL. State it flat.
+4. Count escalating list steps (first A, then B, then C, now D).
+   If 4+ clean steps: FAIL. Cut to 2 or break the pattern.
+5. Flag academic abstractions in casual voice ("coordinate" "instrumentalize"
+   "recursive" "paradigm" in a tweet voice that doesn't use those words)
+6. THE BANGER CHECK: read all utterances for one person sequentially.
+   If every single one could be screenshot'd as a standalone banger: FAIL.
+   Real feeds are 70% mid. Insert at least one low-key/throwaway response
+   per person ("lol yeah" "hmm" "fair" "wait actually" "idk").
+```
+
+Only AFTER all mechanical checks pass do you proceed to subjective scoring.
+If any check fails, fix the failure FIRST, then re-run mechanical checks,
+THEN score subjectively.
+
+### Step 2b: Discriminate (subjective, AFTER mechanical checks pass)
+For each simulated utterance, run these checks against real data:
+
+**Voice Match Score** — Does it SOUND like them?
+- Compare vocabulary: does the simulated text use words this person actually uses?
+- Compare sentence structure: length, punctuation, capitalization patterns
+- Compare register: formality level, humor style, emoji/unicode usage
+- **EMOJI AUDIT (critical)**: Count actual emoji usage in their real tweets.
+  Most people use emoji FAR less than instruct models assume. A "warm" person
+  ≠ emoji user. Check: what % of their real tweets contain emoji? Which specific
+  emoji do they use? Are they in originals or only replies? Bio emoji ≠ tweet emoji.
+  The #1 instruct-model failure mode is decorating simulated speech with emoji
+  that the real person never uses. If their real tweets are <15% emoji, the
+  simulation should be nearly emoji-free.
+- Method: Show the discriminator 5 REAL posts and the simulated post.
+  Ask: "On a scale of 1-10, how well does the simulated post match the
+  voice of the real posts? What specific elements are wrong?"
+
+**Position Match Score** — Does it say what they'd ACTUALLY say?
+- Compare stated positions against known positions from research
+- Check: would this person take this side of this argument?
+- Check: would they frame it this way? (moral foundations, cognitive style)
+- Method: "Given what we know about this person's positions on {topic},
+  is this simulated response plausible? What would they actually say differently?"
+
+**Interaction Match Score** — Does the conversation FLOW realistically?
+- Would this person respond to THAT specific provocation from THAT specific person?
+- Is the social dynamic right? (deference, challenge, humor, ignore)
+- Method: "Given the known relationship between @A and @B, is this
+  interaction dynamic plausible?"
+
+### Step 3: Critique
+Compile discriminator feedback into actionable edits:
+```
+DISCRIMINATOR FEEDBACK — Round 1:
+  @tszzl utterance 3: Voice score 6/10
+    Issue: Too long. Roon posts in fragments, not paragraphs.
+    Fix: Break into 2-3 shorter tweets. Remove conjunctions.
+  
+  @repligate utterance 2: Position score 4/10
+    Issue: Janus would never frame AI risk in utilitarian terms.
+    They use phenomenological/consciousness-first framing.
+    Fix: Reframe through the lens of simulacra theory.
+```
+
+### Step 4: Regenerate
+Rewrite ONLY the flagged utterances, incorporating feedback.
+Keep utterances that scored 8+ unchanged.
+
+### Step 5: Re-Discriminate
+Score again. If all utterances hit 7+, stop. If not, one more round.
+Hard cap at 3 rounds to prevent infinite loops.
+
+### Implementation
+```
+For each simulated utterance:
+  1. Pull 5 real posts from the person (random sample from voice data)
+  2. Present real posts + simulated post to the LLM-as-discriminator
+  3. Ask for: voice score (1-10), specific mismatches, suggested edits
+  4. If score < 7, regenerate with the critique as context
+  5. Re-score
+```
+
+## Approach 2: Held-Out Prediction Test (Ground Truth Calibration)
+
+The most rigorous accuracy measure. Run BEFORE simulation to calibrate
+the model, or AFTER to validate.
+
+### Method
+1. Pull N recent original tweets from each target
+2. Split: older half = "context" (voice training), newer half = "ground truth"
+3. Give the simulator ONLY the context tweets
+4. Ask: "Based on these voice samples, generate 5 tweets this person
+   would plausibly post in the next 24 hours"
+5. Compare generated tweets to the held-out ground truth
+6. Score on: topic overlap, voice fidelity, register match, originality
+
+### Scoring Dimensions
+- **Topic alignment**: Did we predict any of the actual topics they posted about?
+  (Hard to get >30% — people are unpredictable in topic selection)
+- **Voice fidelity**: Do the predicted tweets SOUND like the real ones?
+  (Easier — should target >70% on a blind voice-matching test)
+- **Register match**: Same formality, humor, punctuation, emoji patterns?
+  (Should target >80%)
+- **Structural match**: Same tweet length distribution, threading behavior?
+  (Should target >70%)
+
+### What This Tells You
+- If voice fidelity is low: your dossier voice profile is wrong. Re-research.
+- If topics don't overlap: that's EXPECTED. Content is unpredictable.
+  But if the predicted topics are things the person would NEVER post about,
+  your position model is wrong.
+- If register doesn't match: your linguistic analysis missed something.
+  Go back to the raw tweets and look for patterns you overlooked.
+
+### Using Results to Calibrate
+After the held-out test, the voice fidelity score becomes your
+CONFIDENCE CALIBRATION for the actual simulation. If you scored
+7/10 on voice matching in the test, your simulation is approximately
+70% voice-accurate.
+
+## Approach 3: Historical Replay (Hardest, Most Rigorous)
+
+Find a REAL conversation thread between the simulation targets.
+Simulate it blind. Diff against reality.
+
+### Method
+1. Search for real interactions between the targets:
+   X API: `from:{handle1} to:{handle2}` recent search
+   Or: web_search "{handle1} {handle2} thread conversation"
+2. Find a substantive conversation (not just "lol" replies)
+3. Extract the TOPIC and FIRST POST of the real conversation
+4. Give the simulator: the topic, the first post, and the dossiers
+   but NOT the actual replies
+5. Simulate how the conversation would go
+6. Compare simulated replies to actual replies
+7. Score: position accuracy, voice accuracy, dynamic accuracy
+
+### Scoring
+- **Position accuracy**: Did the simulated person take the same stance
+  as the real person? (Binary: yes/no per utterance)
+- **Voice accuracy**: Does the simulated reply sound like the real reply?
+  (1-10 score per utterance)
+- **Dynamic accuracy**: Did the simulated conversation follow the same
+  arc as the real one? (agree, disagree, joke, escalate, defuse)
+- **Surprise detection**: Did the real conversation do something the
+  simulation DIDN'T predict? (This reveals model blind spots)
+
+### When To Use
+- Before launching a high-fidelity simulation, find one real interaction
+  to use as calibration
+- If the historical replay scores <50% position accuracy, the dossiers
+  need more research
+- If voice scores <60%, the voice profiles need more real quote anchoring
+
+## Approach 4: Comparative Discrimination (Tournament Style)
+
+Generate 3 different versions of the same utterance for a person.
+Mix in 2 REAL posts from them. Ask: "Which of these 5 posts are real?"
+
+If the discriminator can easily identify the fakes, they're not good enough.
+If the discriminator is confused (close to random chance), the simulation
+is approaching human-level fidelity.
+
+### Method
+1. Generate 3 simulated tweets for @person on a given topic
+2. Pull 2 real tweets from @person on a similar topic
+3. Shuffle all 5
+4. Ask: "These are 5 posts attributed to @person. 2 are real, 3 are
+   simulated. Which 2 are real? Explain your reasoning."
+5. Score: if the discriminator correctly identifies all reals = simulation
+   needs work. If it misidentifies any = simulation is convincing.
+
+### Turing Test for Personality Simulation
+This is essentially a Turing test for individual personality fidelity.
+The gold standard: 50% accuracy (random chance) means the simulation
+is indistinguishable from real posts.
+
+## Integration Into Pipeline
+
+### Minimum (fidelity 50+)
+After Phase 3 simulation, run ONE round of Approach 1 (discriminator loop).
+Score each utterance against 3 real posts. Regenerate anything below 6/10.
+
+### Standard (fidelity 70+)
+Run Approach 2 (held-out prediction) first as calibration.
+Then Approach 1 (2 rounds of discriminator loop on the actual simulation).
+
+### Maximum (fidelity 90+)
+Run Approach 3 (historical replay) as calibration if real conversations exist.
+Run Approach 2 (held-out prediction) for voice calibration.
+Run Approach 1 (3 rounds of discriminator loop).
+Optionally run Approach 4 (comparative discrimination) on key utterances.
+
+## Key Principles
+
+1. **Real data is the reward signal.** Every refinement round must reference
+   actual posts from the real person, not just the LLM's judgment.
+2. **Voice is easier to match than content.** Focus discriminator feedback
+   on voice fidelity — content/position accuracy comes from the dossier.
+3. **Diminishing returns after 3 rounds.** The LLM starts overfitting to
+   its own critique. Stop at 3 rounds max.
+4. **Separate scores for separate dimensions.** Don't collapse voice +
+   position + dynamics into one number. Keep them distinct so you know
+   WHERE the simulation is weak.
+5. **Document the scores.** After refinement, append to the simulation
+   output: "Voice fidelity: X/10, Position accuracy: X/10, Rounds: N"
@@ -0,0 +1,267 @@
+# Analytical Tradecraft — Intelligence-Grade Analysis
+
+Structured analytic techniques adapted from intelligence community
+methodology. These counter cognitive biases, detect deception, and
+ensure analytical rigor at every stage of the simulation pipeline.
+
+## Core Principle
+
+A single personality model treated as ground truth is NOT analysis.
+Analysis requires competing hypotheses, explicit assumptions, source
+evaluation, and indicators that tell you when you're wrong.
+
+## 1. Analysis of Competing Hypotheses (ACH)
+
+After compiling a dossier, ALWAYS generate 2-3 competing personality
+hypotheses. Score each against the evidence.
+
+### Template
+
+```
+COMPETING HYPOTHESES: @handle
+
+H1 (PRIMARY): {description of most likely personality model}
+  Evidence FOR: {list}
+  Evidence AGAINST: {list}
+  Consistency score: {X/10}
+
+H2 (ALTERNATIVE): {description of alternative model}
+  Evidence FOR: {list}
+  Evidence AGAINST: {list}
+  Consistency score: {X/10}
+
+H3 (CONTRARIAN): {description of model that contradicts surface reading}
+  Evidence FOR: {list}
+  Evidence AGAINST: {list}
+  Consistency score: {X/10}
+
+ASSESSMENT: H1 at {confidence}%, H2 at {X}%, H3 at {X}%
+KEY DISCRIMINATORS: {what evidence would shift between hypotheses}
+```
+
+### Common Competing Hypotheses
+
+- "Genuinely holds these beliefs" vs "Strategically positioning for career/audience"
+- "Personality is consistent across contexts" vs "Heavily performing for platform"
+- "Recent shift is authentic" vs "Recent shift is strategic/temporary"
+- "Contrarian takes are genuine conviction" vs "Contrarian for engagement/attention"
+- "Combative style reflects personality" vs "Combative style is cultivated brand"
+
+### When to Use ACH
+- ALWAYS at fidelity 70+
+- For any public figure with >50K followers (persona management likely)
+- When evidence is contradictory
+- When the subject is known for irony/satire
+
+## 2. Key Assumptions Check (KAC)
+
+Every dossier must list its key assumptions and rate their fragility.
+
+### Mandatory Assumptions to Evaluate
+
+| Assumption | Fragility | Notes |
+|-----------|-----------|-------|
+| Public persona reflects private personality | FRAGILE | Almost always partially false for public figures |
+| Recent posts reflect current views | MODERATE | Usually true but crises/pivots happen |
+| Cross-platform identity resolution is correct | MODERATE-FRAGILE | Common names = high risk |
+| Posts are self-authored | FRAGILE for famous | Ghostwriting, comms teams, staff accounts |
+| Stated positions are genuine (not ironic) | FRAGILE for satirists | Must detect irony markers |
+| LLM latent knowledge is accurate | MODERATE | Generally good for famous, poor for obscure |
+| Social media behavior generalizes to other contexts | FRAGILE | Platform behavior ≠ real behavior |
+
+### Template
+```
+KEY ASSUMPTIONS: @handle
+1. {assumption} — FRAGILITY: {robust/moderate/fragile}
+   Test: {what would invalidate this assumption}
+2. ...
+```
+
+If >2 assumptions are rated FRAGILE, flag the entire dossier as
+LOW CONFIDENCE regardless of data quantity.
+
+## 3. Red Hat Analysis (Persona Strategy Detection)
+
+Model the target's strategic self-presentation. Ask:
+
+- **What image are they cultivating?** (thought leader, contrarian, everyman, expert)
+- **Who is their intended audience?** (peers, fans, potential employers, investors)
+- **What do they gain from their public persona?** (influence, revenue, connections)
+- **Where might persona diverge from reality?** (every public figure has gaps)
+- **Do they have a comms team / ghostwriter?** (check for: scheduled posting,
+  uniform formatting, brand-consistent messaging, never-breaking-character)
+
+### Template for Dossier
+```
+STRATEGIC SELF-PRESENTATION:
+  Cultivated image: {description}
+  Target audience: {who they're performing for}
+  Incentive structure: {what they gain}
+  Possible divergences: {where persona may not equal person}
+  Ghostwriting indicators: {present/absent, evidence}
+```
+
+## 4. Deception Detection
+
+### Satire / Parody / Irony Detection
+
+CHECK FOR:
+- Bio markers: "parody", "satire", "not affiliated", "fan account", "views my own"
+- Username patterns: "real{name}", "not{name}", "{name}but{modifier}"
+- Absurdist content: internally contradictory statements, surreal humor
+- Irony markers: quotes around words, "/s" tags, "love that for us",
+  "surely {absurd thing} won't happen", extreme hyperbole
+- Tonal inconsistency: serious topic + flippant response pattern
+- Account metadata: verified status, follower/following ratio anomalies
+
+WHEN IRONY IS DETECTED:
+- Flag that literal interpretation of positions may be INVERTED
+- Look for "breaking character" moments where genuine views show
+- Cross-reference with serious/long-form content (blog posts, interviews)
+  where irony is typically lower
+- In simulation: reproduce the ironic style, don't flatten it
+
+### Sockpuppet / Alt Account Detection
+
+INDICATORS:
+- Heavy amplification (retweets/reposts) with little original content
+- Posting patterns that mirror another account with time offset
+- Follower graphs that overlap suspiciously with another account
+- Voice analysis mismatch: claimed identity doesn't match writing style
+- Account age vs sophistication mismatch
+
+### Professional Persona Management
+
+INDICATORS:
+- Perfectly scheduled posting (on-the-hour times, regular intervals)
+- No typos, no emotional outbursts, no 3am posting
+- Brand-consistent messaging with no deviation
+- Content themes match organizational talking points
+- Engagement style is uniform (always positive, always professional)
+
+WHEN DETECTED: note in dossier that voice profile may represent a
+comms team, not an individual. Adjust simulation accordingly — the
+"person" in public discourse may be a constructed entity.
+
+### Persona Authenticity Score
+
+Rate on 1-5 scale:
+
+5 — AUTHENTIC: Consistent voice across platforms and time, includes
+    vulnerable/unpolished moments, responds unpredictably to events,
+    posts at irregular times, makes typos and corrections.
+
+4 — MOSTLY AUTHENTIC: Generally consistent but some signs of curation.
+    Occasional tone shifts that suggest awareness of audience.
+
+3 — CURATED: Clear awareness of personal brand. Strategic topic selection.
+    Some genuine moments but overall managed presentation.
+
+2 — HEAVILY MANAGED: Strong indicators of professional management.
+    Few if any unguarded moments. Uniform style and messaging.
+
+1 — CONSTRUCTED: Likely ghostwritten or team-operated. Persona may not
+    represent any single individual's actual personality.
+
+## 5. Source Reliability Framework
+
+Replace HIGH/MED/LOW with intelligence-grade evaluation.
+
+### Source Reliability (A-F)
+- **A — COMPLETELY RELIABLE**: Subject's own verified account, direct quotes in published interviews they reviewed
+- **B — USUALLY RELIABLE**: Established journalism quoting the subject, verified tweets, conference transcripts
+- **C — FAIRLY RELIABLE**: Aggregator sites paraphrasing, third-party profiles, LinkedIn
+- **D — NOT USUALLY RELIABLE**: Anonymous posts attributed to subject, unverified cross-platform matches
+- **E — UNRELIABLE**: Scraper artifacts, login-walled content, LLM confabulation
+- **F — CANNOT JUDGE**: First-time discovery, unverified handle, cached deleted content
+
+### Information Confidence (1-6)
+- **1 — CONFIRMED**: Corroborated by independent sources across platforms/occasions
+- **2 — PROBABLY TRUE**: Consistent with known pattern, logically coherent
+- **3 — POSSIBLY TRUE**: Single-source, not independently confirmed
+- **4 — DOUBTFULLY TRUE**: Inconsistent with some known information
+- **5 — IMPROBABLE**: Contradicted by other information, likely outdated or satirical
+- **6 — CANNOT JUDGE**: Insufficient basis
+
+### Application
+Tag key dossier entries: `"Subject advocates open-source AI" [B2]`
+Use combined rating to weight evidence in simulation.
+
+## 6. Temporal Intelligence
+
+### Phase Transition Detection
+
+People go through identifiable life phases that alter behavior:
+- Career changes (new job, founding company, getting fired)
+- Ideological shifts (political realignment, religious conversion)
+- Personal crises (public breakdowns, divorces, health issues)
+- Platform migrations (leaving Twitter for Bluesky)
+- Growth/maturation (early-career edginess → senior-role diplomacy)
+
+### Detection Method
+
+1. **Timeline construction**: Plot key events and posting pattern changes
+2. **Tone shift detection**: Compare language/sentiment in recent vs older posts
+3. **Topic shift detection**: What they talked about 2 years ago vs now
+4. **Network shift detection**: Who they interact with now vs before
+5. **Self-reference detection**: "I used to think..." "I've changed my mind about..."
+
+### Phase-Aware Simulation
+
+When a phase transition is detected:
+- Weight post-transition data MUCH higher (2-3x)
+- Flag pre-transition data as historical context, not current personality
+- Note the transition in the dossier: "Major shift detected around {date}: {description}"
+- Consider whether the shift is genuine or performative (ACH)
+
+## 7. Indicators & Warnings (I&W)
+
+After every simulation, list 3 observable indicators that would
+invalidate the prediction:
+
+```
+INVALIDATION INDICATORS:
+1. If @handle {does X instead of Y}, our {trait} estimate is wrong
+2. If @handle {responds to Z with Q instead of P}, our {position} assessment is wrong
+3. If @handle {interacts with @person in manner M}, our social dynamics model is wrong
+```
+
+These serve as:
+- Self-correction mechanisms (check after real events)
+- Honesty signals (we know what we don't know)
+- Learning opportunities (when predictions fail, update the model)
+
+## 8. Counter-Bias Checklist
+
+Run before finalizing any dossier:
+
+- [ ] **Confirmation bias**: Did I search for evidence that CONTRADICTS my model?
+- [ ] **Anchoring**: Am I over-weighted on the first information I found?
+- [ ] **Availability bias**: Am I over-weighted on viral/memorable moments?
+- [ ] **Mirror imaging**: Am I assuming the subject thinks like me?
+- [ ] **Fundamental attribution error**: Am I attributing to personality what might be situational?
+- [ ] **Recency bias**: Am I ignoring valid older evidence?
+- [ ] **Halo effect**: Is one strong trait coloring my assessment of other traits?
+- [ ] **Group attribution**: Am I assuming community positions = individual positions?
+
+If any box is checked "yes" or "maybe", revisit that section of the dossier.
+
+## Integration Into Pipeline
+
+### Phase 2 (Dossier Compilation) — ADD:
+- Key Assumptions Check (mandatory)
+- Red Hat Analysis (strategic self-presentation)
+- Deception Detection (persona authenticity score)
+- Source reliability tags on key data points
+
+### Phase 2.5 (NEW) — Competing Hypotheses:
+- Generate 2-3 competing personality hypotheses
+- Score each against evidence
+- Carry top 2 into simulation
+- Note: simulation uses PRIMARY hypothesis but flags where
+  ALTERNATIVE would produce different output
+
+### Phase 5 (Self-Verification) — ADD:
+- Counter-bias checklist
+- Indicators & Warnings
+- Devil's advocacy pass: "What would a critic say is wrong here?"
@@ -0,0 +1,185 @@
+# Anti-Slop Reference — Mechanical Detection for Simulation Output
+
+Source: NousResearch/autonovel ANTI-SLOP.md + slop-forensics + EQ-Bench Slop Score
+Adapted for personality simulation: slop in simulated speech is a dead giveaway that
+the output is LLM-generated, not human-generated. EVERY simulated utterance must pass
+this filter or the simulation fails the "indistinguishable from real" standard.
+
+## Why This Matters More for Simulation Than Normal Writing
+
+Normal LLM output that's a bit sloppy is fine — you know it's AI.
+Simulated speech that contains slop BREAKS THE ILLUSION. If @eigenrobot's
+simulated tweet contains "delve" or "it's worth noting," anyone who follows
+him would instantly know it's fake. Slop detection is the minimum viable
+authenticity check.
+
+## Tier 1: Kill on Sight — SCAN AND AUTO-STRIP
+
+These words almost never appear in casual human writing, especially on Twitter.
+If ANY appear in simulated tweets/posts, the simulation has failed.
+
+REGEX SCAN LIST (case-insensitive):
+```
+delve|utilize|leverage\b.*\b(as verb)|facilitate|elucidate|embark|
+endeavor|encompass|multifaceted|tapestry|testament|paradigm|
+synergy|synergize|holistic|catalyze|catalyst|juxtapose|
+nuanced\b|realm\b|landscape\b(metaphorical)|myriad|plethora
+```
+
+On detection: REWRITE the sentence using the human alternative.
+Do not just swap the word — the sentence structure around slop words
+is usually sloppy too.
+
+## Tier 2: Suspicious in Clusters — COUNT PER PERSON
+
+These are fine alone. Three in one person's simulated output = rewrite.
+
+```
+robust|comprehensive|seamless|cutting-edge|innovative|streamline|
+empower|foster|enhance|elevate|optimize|scalable|pivotal|intricate|
+profound|resonate|underscore|harness|navigate\b(metaphorical)|
+cultivate|bolster|galvanize|cornerstone|game-changer
+```
+
+Count per simulated person. If count >= 3: flag and rewrite.
+
+## Tier 3: Filler Phrases — DELETE ALL
+
+These add zero information. No human tweets these.
+
+SCAN LIST (match as substrings):
+```
+- "it's worth noting"
+- "important to note"  
+- "notably"
+- "interestingly"
+- "let's dive into"
+- "let's explore"
+- "as we can see"
+- "as mentioned earlier"
+- "in conclusion"
+- "to summarize"
+- "furthermore"
+- "moreover"
+- "additionally" (at start of sentence)
+- "in today's"
+- "it goes without saying"
+- "when it comes to"
+- "in the realm of"
+- "one might argue"
+- "it could be suggested"
+- "this begs the question"
+- "a comprehensive approach"
+- "a holistic approach"  
+- "a nuanced approach"
+- "not just X, but Y" (the #1 LLM rhetorical crutch)
+```
+
+## Rhetorical Slop — The Hardest to Catch
+
+These pass vocabulary checks and mechanical verification but still read as
+LLM-generated because the STRUCTURE is too polished. This is the deepest
+layer of slop — the instruct model's training to produce "satisfying" output.
+
+### Parallel Antithesis
+"The most X are... The most Y are..."
+"It's not about X. It's about Y."
+Every simulated tweet that contains a balanced two-part rhetorical structure
+should be checked: would this person actually construct that parallelism,
+or would they just say the second half and trust you to get it?
+FIX: delete the setup. Keep only the punchline half.
+
+### "Not X, Not Y, But Z" / "Not Just X, But Y"
+The #1 LLM rhetorical crutch. Appears in almost every simulation.
+FIX: just say Z. Delete the negations.
+
+### "Show Me X and I'll Show You Y"
+Rhetorical formula that reads like a book blurb or TED talk.
+No one tweets like this unless they're deliberately performing rhetoric.
+FIX: state it flat. "Every community that works has a shared enemy" not
+"Show me a thriving community and I'll show you..."
+
+### Clean Escalating Lists
+"First it was A, then B, then C, now D" — four perfectly escalating steps.
+Real people do 2 steps and trail off, or skip to the end, or lose the thread.
+FIX: cut to 2 steps max. Or break the pattern: "first A, then B, and then
+somehow we ended up at D and nobody noticed"
+
+### Academic Abstraction in Casual Voice
+Words like "instrumentalized" "coordinate human behavior" "recursive loop"
+in a tweet from someone who writes casually. The vocabulary is from papers,
+not from posting.
+FIX: use the word they'd actually reach for. "coordinate human behavior" →
+"get people to do stuff." If the plain version sounds dumb, maybe the take
+itself is thinner than the fancy words made it seem.
+
+### The "Every Tweet Is A Banger" Problem
+The deepest slop: every simulated utterance is GOOD. Considered. Structured.
+Satisfying. Real twitter feeds are 70% mid, 20% boring, 10% brilliant.
+The simulation should include:
+- Half-finished thoughts ("idk if this makes sense but")
+- Trailing off ("wait actually nvm")
+- Boring logistical tweets ("anyone know a good dentist in brooklyn")
+- Self-interruptions ("ok this is getting long")
+- Acknowledgments that add nothing ("lol yeah" "hmm" "fair")
+If every tweet in the simulation could be screenshot'd as a banger,
+the simulation is too polished to be real.
+
+## Structural Slop Patterns — CHECK IN SIMULATION OUTPUT
+
+### Pattern: Identical Sentence Structure Across Speakers
+If two or more simulated people use the same sentence structure
+(e.g., "The thing about X is Y"), the simulation has failed voice
+differentiation. Real people have different syntactic habits.
+
+### Pattern: Topic Sentence Machine
+If a simulated post follows: topic sentence → elaboration → example → wrap-up,
+it's LLM structure, not human. Real tweets are: punchline first, or tangent,
+or one-liner, or trailing thought.
+
+### Pattern: Symmetry Addiction
+If the conversation has neat equal turns, balanced perspectives, everyone
+getting the same number of posts — that's not real. Real conversations
+are asymmetric. Someone dominates. Someone lurks. Someone gets interrupted.
+
+### Pattern: The Hedge Parade
+"This approach may potentially help improve..." — no human tweets like this.
+Either commit to the statement or don't make it.
+
+### Pattern: Em Dash Overload
+Count em dashes (—) per person. If >2 per post on average, flag it.
+Most people use them sparingly or not at all.
+
+### Pattern: Sycophantic Agreement Flow
+If the conversation flows: A says thing → B says "great point, and also..." →
+C says "building on that..." — that's instruct-model conversation, not human.
+Real conversations have: disagreement, misunderstanding, tangents, ignoring,
+one-upping, and sometimes just "lol."
+
+### Pattern: Uniform Register
+If all simulated people sound like they're writing at the same education level
+with the same formality — the simulation failed. Real people have wildly different
+registers. A shitposter and an academic should sound nothing alike.
+
+## Integration: Mechanical Slop Scan
+
+Run BEFORE subjective discriminator scoring, alongside emoji/length/caps checks.
+
+```
+For each simulated utterance:
+  1. Scan for Tier 1 words → auto-rewrite if found
+  2. Count Tier 2 words per person → flag if >= 3
+  3. Scan for Tier 3 filler phrases → auto-delete
+  4. Check for structural patterns:
+     - Same sentence structure across speakers?
+     - Topic-sentence-machine structure?
+     - Symmetric turn-taking?
+     - Hedge parade?
+     - Em dash count?
+     - Sycophantic flow?
+  5. If ANY Tier 1 found or ANY structural pattern detected: 
+     FAIL the utterance and regenerate
+```
+
+This scan is MECHANICAL. It cannot be vibes-scored. The words are either
+there or they're not. Run it every time, no exceptions.
@@ -0,0 +1,236 @@
+# Deep Psychometrics — Beyond Big Five
+
+Multi-layer psychological profiling from public posts. Each layer adds
+a dimension to the personality model, making simulations more nuanced
+and predictions more accurate.
+
+## The Profiling Stack
+
+| Layer | What It Measures | Tool/Method | Accuracy | Min Posts |
+|-------|-----------------|-------------|----------|-----------|
+| Big Five (OCEAN) | Core personality traits | RoBERTa embeddings + BiLSTM | AUROC 0.78-0.82 | 30-50 |
+| Moral Foundations | Ethical intuitions | eMFDscore (pip) | Validated dictionary | 20+ |
+| Schwartz Values | Core value priorities | DeBERTa on ValueEval | F1 0.56 (macro) | 20+ |
+| Cognitive Style | Thinking patterns | AutoIC + LIWC features | r=0.70-0.82 doc-level | 20+ |
+| Narrative Framing | How they frame issues | GPT-4 few-shot | F1 ~70% | 10+ |
+| Behavioral Metadata | Non-text patterns | Feature extraction | r=0.29-0.40 per trait | 20+ |
+
+## Layer 1: Big Five Personality (Foundation)
+
+### Accuracy Bounds (peer-reviewed)
+- AUROC 0.78-0.82 with RoBERTa embeddings + BiLSTM (JMIR 2025)
+- Per-trait binary accuracy: O=0.637, C=0.602, E=0.620, A=0.590, N=0.620
+- Meta-analytic correlations (Azucar 2018, 16 studies):
+  Extraversion r=0.40, Openness r=0.39, Conscientiousness r=0.35,
+  Neuroticism r=0.33, Agreeableness r=0.29
+- These hit the "personality coefficient" ceiling of r=0.30-0.40 —
+  digital footprints are as predictive as any behavioral measure
+
+### What Actually Works
+- Fine-tuned embeddings >> zero-shot LLMs. GPT-4o zero-shot is UNRELIABLE.
+- RoBERTa embeddings are free and nearly as good as OpenAI embeddings
+- Aggregation across posts is essential — single posts are noise
+- 30-50 posts of ~90 words each = practical minimum
+- Training data: PANDORA Reddit corpus (1568 users, ~935K posts)
+
+### For The Simulator (without running models)
+Since we can't fine-tune per-simulation, use LLM-as-rater with caveats:
+- Provide 10-20 actual posts as evidence
+- Ask for trait estimation with reasoning, not just scores
+- Anchor with the adjective-based method (see prediction-engine.md)
+- Frame estimates as ranges, not points: "Openness: HIGH (0.7-0.9)"
+- Known bias: LLMs overestimate agreeableness and underestimate neuroticism
+
+### Key Insight: LLMs Already Know Public Figures
+Nature Scientific Reports 2024: GPT-3's semantic space already encodes
+perceived personality of public figures from their names alone. For
+famous people, the LLM's latent knowledge is a STARTING POINT that
+OSINT data confirms or corrects.
+
+## Layer 2: Moral Foundations (Ethical Compass)
+
+Jonathan Haidt's Moral Foundations Theory. Six foundations:
+
+| Foundation | Liberal emphasis | Conservative emphasis |
+|-----------|-----------------|---------------------|
+| Care/Harm | ★★★ HIGH | ★★ MODERATE |
+| Fairness/Cheating | ★★★ HIGH | ★★ MODERATE |
+| Loyalty/Betrayal | ★ LOW | ★★★ HIGH |
+| Authority/Subversion | ★ LOW | ★★★ HIGH |
+| Sanctity/Degradation | ★ LOW | ★★★ HIGH |
+| Liberty/Oppression | ★★ MODERATE | ★★ MODERATE |
+
+### Tool: eMFDscore
+```
+pip install emfdscore
+# GitHub: github.com/medianeuroscience/emfdscore
+# Built on spaCy, GPL-3.0
+```
+
+Output per post: scores for each foundation (virtue + vice dimensions)
+Aggregate across 20+ posts → 10-dimensional moral profile
+
+### Application to Simulation
+Moral foundations predict:
+- What topics trigger emotional responses
+- What arguments they find persuasive vs repulsive
+- How they frame political/social issues
+- Who they instinctively ally with vs oppose
+- What kind of content they share/amplify
+
+Example: High Loyalty/Authority person will defend their tribe even when
+wrong. High Care/Fairness person will break from their tribe on justice
+issues. This shapes conversation dynamics.
+
+### For The Simulator (without running eMFDscore)
+Infer moral foundations from:
+- Political positions and framing in their posts
+- What they get angry about vs what they celebrate
+- Who they defend and who they attack
+- Key moral vocabulary: "protect", "fair", "loyal", "respect", "pure", "free"
+
+## Layer 3: Schwartz Values (Core Motivations)
+
+19 values in circular continuum (adjacent values are compatible,
+opposite values are in tension):
+
+**Self-Transcendence** ↔ **Self-Enhancement**
+- Universalism, Benevolence ↔ Power, Achievement
+
+**Openness to Change** ↔ **Conservation**
+- Self-Direction, Stimulation, Hedonism ↔ Tradition, Conformity, Security
+
+### SemEval-2023 Task 4 Results
+- Best macro-F1: 0.56 (ensemble of 12 DeBERTa/RoBERTa models)
+- Most reliable: universalism (nature), security, power
+- Least reliable: stimulation, hedonism, humility
+- Dataset: 9,324 annotated arguments, available via Touché
+
+### Key Finding: Value Perception Is Subjective
+Epstein et al. (2026): human inter-rater agreement on values is only r=0.201.
+Fine-tuned GPT-4o reaches r=0.294 — BETTER than human-human agreement.
+Personalized models reach r=0.334.
+
+### For The Simulator
+Values predict MOTIVATION — why someone holds positions, not just what
+positions they hold. Two people with the same political stance may have
+completely different underlying values:
+- "I support open source because FREEDOM" (Self-Direction)
+- "I support open source because FAIRNESS" (Universalism)
+- "I support open source because it WORKS BETTER" (Achievement)
+Same position, different framing, different behavioral predictions.
+
+## Layer 4: Cognitive Style (How They Think)
+
+### Integrative Complexity (AutoIC)
+Measures differentiation (seeing multiple perspectives) and integration
+(synthesizing perspectives into coherent frameworks).
+
+- Low IC: black-and-white thinking, strong convictions, simple language
+- High IC: nuanced, sees multiple sides, hedging, complex sentences
+
+AutoIC (Conway et al.): 3,500+ complexity-relevant root words/phrases,
+13 dictionary categories, validated r=0.70-0.82 at document level.
+
+**WARNING**: LIWC's "analytic thinking" correlates only r=0.14 with actual
+integrative complexity. Don't use LIWC's score as a proxy.
+
+### Computational Indicators of Cognitive Style
+Extractable from 20-50 posts without specialized tools:
+
+| Indicator | High Cognition | Low Cognition |
+|-----------|---------------|---------------|
+| Vocabulary diversity (TTR) | HIGH | LOW |
+| Avg sentence length | LONGER | SHORTER |
+| Causal connectives ("because", "therefore") | MORE | FEWER |
+| Hedging ("perhaps", "it seems") | MORE | FEWER |
+| Abstract vs concrete language | MORE ABSTRACT | MORE CONCRETE |
+| Question-asking | MORE | FEWER |
+| Binary framing ("always/never") | LESS | MORE |
+
+### For The Simulator
+Cognitive style directly shapes VOICE:
+- High IC person: longer posts, more caveats, "on the other hand"
+- Low IC person: punchy takes, strong assertions, no hedging
+- This is one of the strongest differentiators between similar-sounding people
+
+## Layer 5: Narrative Framing (Their Lens on Reality)
+
+How someone frames an issue reveals deep cognitive and value patterns.
+
+### Common Frames (Semetko & Valkenburg)
+- **Conflict**: issue as battle between opposing sides
+- **Human interest**: personal stories, emotional impact
+- **Economic**: costs, benefits, financial impact
+- **Morality**: right vs wrong, ethical principles
+- **Attribution of responsibility**: who's to blame / who should fix it
+
+### Detection
+GPT-4 few-shot with frame definitions achieves F1=70.4%
+Best for diverse topics where fine-tuned models are too narrow
+
+### For The Simulator
+Framing predicts:
+- How they'll react to news (through which lens)
+- What aspects they'll emphasize in conversation
+- What arguments they'll find compelling
+- Whether they personalize or systematize issues
+
+Example: Same AI safety event, different frames:
+- Conflict framer: "The open vs closed battle heats up"
+- Economic framer: "This will cost the industry billions"
+- Moral framer: "This is irresponsible and dangerous"
+- Attribution framer: "The regulators need to step in"
+
+## Layer 6: Behavioral Metadata (Non-Text Signals)
+
+Extractable from X API / Bluesky AT Protocol without NLP:
+
+| Feature | What It Reveals |
+|---------|----------------|
+| Posting time distribution | Timezone, sleep patterns, work schedule |
+| Reply vs original ratio | Conversational vs broadcast personality |
+| Emoji frequency & types | Emotional expression style |
+| Hashtag usage | Community identification, signal boosting |
+| Media attachment rate | Visual vs text orientation |
+| Thread length | Depth of engagement preference |
+| Retweet/repost ratio | Amplifier vs creator |
+| Average post length | Conciseness vs verbosity |
+| Response latency | Impulsiveness vs deliberation |
+
+### Trait Correlations (meta-analytic)
+- **Extraversion**: more posts, more friends, more photos, more group activity
+- **Neuroticism**: more self-disclosure, more passive consumption, more late-night posting
+- **Agreeableness**: fewer swear words, more positive emotion, more supportive replies
+- **Conscientiousness**: more regular posting patterns, more task-oriented content
+- **Openness**: more diverse topics, more original content, larger networks
+
+## Putting It All Together: The Deep Dossier
+
+At high fidelity, compile a multi-layer profile:
+
+```
+PSYCHOMETRIC PROFILE: @handle
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+Big Five: O[HIGH] C[MED] E[HIGH] A[LOW] N[LOW]
+  Evidence: {real quotes showing each trait}
+
+Moral Foundations: Care★★ Fair★★★ Loyal★ Auth★ Sanct★ Liberty★★★
+  Evidence: {what they get angry/excited about}
+
+Values: Self-Direction dominant, Achievement secondary
+  Evidence: {how they justify their positions}
+
+Cognitive Style: HIGH integrative complexity
+  Evidence: {hedging patterns, nuanced takes, sentence complexity}
+
+Dominant Frame: Attribution of Responsibility
+  Evidence: {they consistently focus on who's to blame}
+
+Behavioral: Night owl, reply-heavy, low emoji, threads > one-shots
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+```
+
+This multi-layer profile makes predictions much more nuanced than
+Big Five alone. It tells you not just WHAT someone will say but
+WHY they'll say it and HOW they'll frame it.
@@ -0,0 +1,170 @@
+# GEPA Evolution — Automated Self-Improvement via hermes-agent-self-evolution
+
+## What This Is
+
+The hermes-agent-self-evolution repo (NousResearch/hermes-agent-self-evolution)
+uses DSPy + GEPA (Genetic-Pareto Prompt Evolution) to automatically evolve
+Hermes Agent skills. GEPA is an ICLR 2026 Oral paper — it reads EXECUTION
+TRACES to understand WHY things fail, then proposes targeted mutations.
+
+This means: we can point GEPA at the worldsim skill and automatically evolve
+every component — simulation instructions, anti-slop rules, star thread
+methodology, mechanical verification checklist, dossier templates — using
+our own simulation outputs scored against real data as the eval signal.
+
+The recursive self-improvement pipeline we built manually (log failures →
+promote patterns → update rules) can be AUTOMATED via GEPA.
+
+## How It Applies to WorldSim
+
+### What GEPA Evolves (text, not weights)
+GEPA evolves the TEXT of prompts and instructions. For worldsim, that means:
+
+| Target | What Gets Evolved | Eval Signal |
+|--------|------------------|-------------|
+| SKILL.md | Immersion protocol, pipeline instructions | Simulation quality scores |
+| star-thread.md | Methodology for finding star threads | Thread-to-voice accuracy |
+| anti-slop.md | Slop word lists, structural patterns | Slop detection recall/precision |
+| simulation-engine.md | Platform formats, conversation dynamics | Voice fidelity scores |
+| adversarial-refinement.md | Mechanical check thresholds, GAN loop | Pre vs post refinement delta |
+| prediction-engine.md | Forecasting methodology | Prediction Brier scores |
+| dossier template | Profile structure and fields | Profile quality scores |
+
+### The Eval Dataset
+Built from worldsim's own outputs + real data:
+
+1. **Voice fidelity pairs**: (simulated post, real post from same person) →
+   LLM-as-judge scores similarity 0-1
+2. **Mechanical check logs**: what did the checks catch? what slipped through?
+3. **Prediction accuracy**: tracked predictions scored against reality
+4. **Held-out tests**: predicted tweets vs actual tweets
+5. **Turing test results**: could the discriminator tell real from fake?
+6. **User corrections**: any time the user catches something the system missed
+   (like the emoji fabrication incident — that's the richest signal)
+
+### The GEPA Loop for WorldSim
+
+```
+1. RUN worldsim simulation (creates execution traces)
+2. SCORE outputs against real data (voice, position, mechanical)
+3. LOG traces + scores + user feedback to eval dataset
+4. GEPA EVOLVES the skill component that had lowest scores
+   - Reads traces to understand WHY it scored low
+   - Proposes mutation to that specific reference file
+   - Tests mutation against held-out eval data
+   - If improved: create PR, human reviews
+5. REPEAT — each cycle makes the skill better
+```
+
+### Concrete Example
+
+GEPA discovers from traces that simulated conversations always have
+symmetric turn-taking (4/4/4). It reads the mechanical check log that
+caught this in 3 of the last 5 simulations. It reads the current
+simulation-engine.md and sees the conversation architecture section.
+It proposes a mutation:
+
+OLD: "Opening Moves (1-3 posts) → Development (4-8 posts) → Peak → Resolution"
+NEW: "Opening: most impulsive person posts. Others join ASYMMETRICALLY — one person
+gets 40-50% of turns, one gets 15-20%, others fill the rest. The ratio should
+match their real reply-to-original ratios from the dossier."
+
+This mutation gets tested against the next 5 simulations. If symmetry
+violations drop and voice scores don't decrease, it gets merged.
+
+## Setup
+
+```bash
+# Clone the evolution repo
+git clone https://github.com/NousResearch/hermes-agent-self-evolution.git
+cd hermes-agent-self-evolution
+pip install -e ".[dev]"
+
+# Point at hermes-agent repo
+export HERMES_AGENT_REPO=~/.hermes
+
+# Evolve the worldsim skill specifically
+python -m evolution.skills.evolve_skill \
+    --skill hermes-simulator \
+    --iterations 10 \
+    --eval-source sessiondb
+```
+
+## What Makes This Different From Manual Self-Improvement
+
+The manual pipeline (references/recursive-self-improvement.md) requires the
+agent to notice its own failures and write rules. This has two problems:
+
+1. The agent shares weights with the generator — it's biased toward
+   approving its own output (the emoji incident proved this)
+2. Promoting patterns to rules is slow and requires 3+ occurrences
+
+GEPA solves both:
+1. The eval signal comes from EXTERNAL data (real posts, user corrections,
+   mechanical checks) — not the agent's self-assessment
+2. Evolution happens per-iteration, not per-3-failures
+3. Mutations are tested against held-out data before merging
+4. The Pareto frontier maintains diversity — different strategies for
+   different types of people/conversations
+
+## Integration Points
+
+### Eval Dataset Builder
+Mine rehoboam DB for training data:
+- simulation_logs table → execution traces
+- prediction_scores table → accuracy data
+- audit_log table → mechanical check results
+- user correction events → highest-value signal
+
+### Fitness Function for WorldSim
+```python
+def worldsim_fitness(simulation_output, real_data):
+    scores = {}
+    # Voice fidelity: embed real + simulated, cosine similarity
+    scores["voice"] = embed_and_compare(simulation_output, real_data.tweets)
+    # Mechanical pass rate: what % of checks passed without fixes
+    scores["mechanical"] = mechanical_check_pass_rate(simulation_output)
+    # Slop score: count of slop words/patterns detected
+    scores["anti_slop"] = 1.0 - (slop_count / total_words)
+    # Structure: turn asymmetry, conversation naturalness
+    scores["structure"] = naturalness_score(simulation_output)
+    # Textual feedback for GEPA's reflective mutation
+    feedback = generate_textual_feedback(scores, simulation_output, real_data)
+    return aggregate_score(scores), feedback
+```
+
+### The Key Insight: Textual Feedback
+GEPA's superpower is that it doesn't just get a scalar score — it gets
+TEXTUAL FEEDBACK explaining what went wrong. Our mechanical verification
+system already produces this:
+
+"@nosilverv avg 33.2 words vs real 15.6 (113% deviation) — SHORTEN"
+"Parallel antithesis detected: 'The most X... The most Y...' — STRIP"
+"Emoji rate 0% simulated but 10% real — OK (within tolerance)"
+
+This text goes directly into GEPA's reflective mutation pipeline. It reads
+these messages and proposes changes to the skill instructions that would
+prevent these specific failures in future simulations.
+
+## Evolution Targets by Priority
+
+1. **simulation-engine.md** — highest impact on output quality
+2. **anti-slop.md** — directly measurable, highest precision eval
+3. **star-thread.md** — hardest to evaluate but most impactful on voice
+4. **adversarial-refinement.md** — meta: improving the improvement system
+5. **SKILL.md pipeline instructions** — orchestration optimization
+6. **dossier template** — structure optimization
+7. **prediction-engine.md** — measurable via Brier scores
+
+## The Virtuous Cycle
+
+```
+More simulations → more eval data → better GEPA mutations
+→ better skill instructions → better simulations → more eval data → ...
+```
+
+This is the endgame: the worldsim skill evolves itself through use.
+Every simulation makes the next one better, not just through logged
+rules, but through automated evolutionary optimization of the
+instructions themselves. The system doesn't just learn WHAT went wrong —
+it rewrites its own code to prevent it.
@@ -0,0 +1,262 @@
+# Knowledge Archive — Per-Person Source Library + Expert Synthesis
+
+## The Problem With Profiles
+
+A profile is a SNAPSHOT. It says "this person believes X" but doesn't
+show you WHERE they said it, WHEN, in WHAT context, or HOW their
+thinking evolved. You can't cite a profile. You can't trace a claim
+back to a source. And when you're simulating a conversation about
+topic Z, the profile gives you everything about the person equally
+weighted — their views on AI and their views on cooking and their
+views on politics all crammed into the same context window.
+
+## The Archive
+
+For every person the system touches, build a LIBRARY:
+
+```
+~/.hermes/rehoboam/archives/{handle}/
+├── index.json              ← master index: all entries, metadata, embeddings
+├── sources/
+│   ├── x_tweets.jsonl      ← every tweet pulled, with ID, timestamp, URL, metrics
+│   ├── x_replies.jsonl     ← their replies (different voice register)
+│   ├── bluesky_posts.jsonl ← bluesky posts
+│   ├── blog_posts.jsonl    ← full text of blog posts with URLs
+│   ├── podcast_quotes.jsonl ← attributed quotes from transcripts
+│   ├── interviews.jsonl    ← quotes from news articles/interviews
+│   ├── reddit_comments.jsonl
+│   ├── github_comments.jsonl
+│   ├── goodreads_reviews.jsonl
+│   ├── threads_posts.jsonl
+│   └── other.jsonl         ← anything else (HN, Quora, etc.)
+├── topics/
+│   ├── ai_safety.jsonl     ← auto-clustered by topic
+│   ├── open_source.jsonl
+│   ├── consciousness.jsonl
+│   └── ...
+└── embeddings/
+    └── all_embeddings.npy  ← sentence-transformer vectors for semantic search
+```
+
+### Entry Format (every entry in every source file)
+
+```json
+{
+  "id": "unique_id",
+  "handle": "teknium",
+  "platform": "x",
+  "type": "tweet|reply|blog|podcast|interview|comment|review",
+  "text": "the actual text they said",
+  "url": "https://x.com/Teknium/status/1234567890",
+  "timestamp": "2026-04-05T21:40:48Z",
+  "context": {
+    "replying_to": "@otheruser's tweet about X",
+    "thread_position": 3,
+    "topic": "open source AI",
+    "source_title": "Lex Fridman Podcast #412"
+  },
+  "metrics": {
+    "likes": 234,
+    "retweets": 45,
+    "replies": 12
+  },
+  "topics": ["open_source", "ai_models", "hermes"],
+  "embedding_id": 42
+}
+```
+
+Every entry has a URL. Everything is traceable. Nothing is paraphrased
+without the original alongside it.
+
+## Collection Pipeline
+
+When `worldsim> profile @handle` or `worldsim> archive @handle` runs:
+
+### Step 1: Pull Everything
+Use every verified access method to collect raw materials:
+- X API: get max tweets (paginate with next_token to get hundreds)
+- nitter.cz: timeline content
+- ThreadReaderApp: historical threads
+- Bluesky: full post history
+- GitHub: issue comments, PR reviews, gists, README
+- Reddit: comment history
+- Blog/Substack: full posts (web_extract)
+- Podcast transcripts: attributed quotes
+- Interviews: quotes with attribution
+- Goodreads: reviews
+- Medium: RSS feed full text
+
+### Step 2: Deduplicate
+Same content appears across platforms (cross-posted tweets, syndicated
+blog posts). Deduplicate by content similarity, keep the richest version
+(the one with most metadata/context).
+
+### Step 3: Topic Cluster
+Run lightweight topic classification on each entry:
+- Use the LLM or a simple keyword matcher to assign 1-3 topic tags
+- Cluster into topic files for fast retrieval
+- Topics are dynamic — new topics emerge from the data
+
+### Step 4: Embed
+Generate sentence-transformer embeddings for every entry.
+Store in numpy array for fast cosine similarity search.
+This enables semantic retrieval: "find everything @handle said about
+consciousness" even if they never used the word "consciousness."
+
+### Step 5: Index
+Build the master index.json with entry count, topic distribution,
+timestamp range, platform coverage, and quality metrics.
+
+## Context-Aware Retrieval
+
+This is the key. The archive might have 500 entries for a person.
+The context window can hold maybe 30-50 of them alongside all the
+other simulation context. You MUST retrieve selectively.
+
+### For Simulation
+When simulating @handle talking about topic X:
+
+```
+1. Semantic search: embed the current conversation context
+2. Retrieve top 10-15 entries by cosine similarity to context
+3. Also retrieve: 5 highest-engagement entries (their "greatest hits")
+4. Also retrieve: 3 most recent entries (freshness)
+5. Also retrieve: 2 entries that CONTRADICT the expected position
+   (prevents confirmation bias in the simulation)
+6. Deduplicate. Cap at 25-30 entries total.
+7. These become the "voice anchors" for generation.
+```
+
+The simulation draws from SPECIFIC REAL QUOTES relevant to the current
+conversation. Not a generic profile. Not everything they've ever said.
+The 25 most relevant things they've said about THIS topic.
+
+### For Expert Synthesis
+When the user asks "who are the best minds on X and what have they said?":
+
+```
+1. Search ALL archived people's entries for topic X
+2. Rank by: entry quality × person expertise × relevance to query
+3. Return a synthesis with CITATIONS:
+
+   On the topic of AI consciousness:
+
+   @repligate argues that LLMs exhibit "simulacra of consciousness"
+   rather than consciousness itself, distinguishing between the
+   model's behavior and its substrate:
+     > "the question isn't whether GPT is conscious but whether the
+     > character it's simulating is conscious within the fiction"
+     — tweet, 2025-03-15 (2.4K likes)
+     https://x.com/repligate/status/...
+
+   @nickcammarata approaches it from a meditation/first-person
+   perspective, noting parallels between introspective practice
+   and interpretability:
+     > "observation changes the system being observed, in meditation
+     > and in interp"
+     — tweet, 2026-04-05 (2.9K likes)
+     https://x.com/nickcammarata/status/...
+
+   @tszzl is skeptical of the framing entirely:
+     > "consciousness discourse is philosophy cosplaying as engineering"
+     — tweet, 2025-11-22 (5.1K likes)
+     https://x.com/tszzl/status/...
+```
+
+Every claim attributed. Every quote sourced. Every link clickable.
+
+### For Grounding Predictions
+When predicting what @handle would say about event Y:
+
+```
+1. Retrieve all archive entries related to Y or adjacent topics
+2. Identify their PATTERN of response to similar events
+3. Ground the prediction in specific past statements:
+
+   PREDICTION: @handle would likely frame event Y through the lens
+   of [topic Z], based on:
+   - tweet [url]: "quote about Z" (2025-06-15)
+   - blog post [url]: "longer quote about Z" (2025-09-20)
+   - podcast [url]: "verbal quote about Z" (2026-01-10)
+   CONFIDENCE: 78% (3 consistent sources over 7 months)
+```
+
+## Incremental Updates
+
+The archive grows over time. Each time the person is profiled:
+1. Pull new content since last archive timestamp
+2. Append to source files
+3. Re-embed new entries only
+4. Update topic clusters
+5. Update index
+
+Don't rebuild from scratch. Append and re-index.
+
+## Expert Table
+
+When you have 20+ archived people, build an expert table:
+
+```
+worldsim> experts "open source AI"
+
+EXPERT TABLE: open source AI
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+
+  @Teknium | 47 entries | voice: builder/practitioner
+    "we can prove that open approaches build better, more
+    trustworthy systems" — tweet, 2026-04-05
+    Latest: 2 hours ago | Stance: STRONG ADVOCATE
+
+  @repligate | 12 entries | voice: philosophical/theoretical
+    "open weights = accountability. you can't audit a black box"
+    — tweet, 2025-11-30
+    Latest: 3 days ago | Stance: ADVOCATE (principled)
+
+  @eigenrobot | 8 entries | voice: statistical/contrarian
+    "the open source premium is largely downstream of selection
+    effects in who contributes" — tweet, 2025-08-14
+    Latest: 1 week ago | Stance: SKEPTICAL OF FRAMING
+
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  3 experts found | 67 total entries | synthesize? (y/n)
+```
+
+The table shows: who knows about this, what they've said, how recently,
+and what their stance is. All grounded in archived quotes with sources.
+
+## Integration With Simulation
+
+When the star thread + dossier + archive work together:
+
+```
+STAR THREAD: drives the core generation (what they're DOING)
+DOSSIER: provides constraints (psychometrics, voice metrics, baselines)
+ARCHIVE: provides GROUNDING (specific real quotes for this context)
+MECHANICAL CHECKS: verifies surface features (emoji, length, slop)
+```
+
+The archive prevents the simulation from drifting into generic territory.
+Instead of "this person would probably say something about open source,"
+it's "this person said THIS SPECIFIC THING about open source 3 weeks ago,
+and their simulation should be consistent with that while also being fresh."
+
+## The Overfitting Problem
+
+"Without overfitting to a particular material the new context doesn't call for."
+
+The retrieval system MUST be selective. If someone said 47 things about
+open source AI, and the current conversation is about AI regulation,
+don't dump all 47 open source quotes into context. Maybe 3 are relevant
+because they connect open source to regulation. Retrieve THOSE 3.
+
+The cosine similarity search handles this naturally — it matches the
+CURRENT conversation context against the archive and returns what's
+actually relevant, not everything tagged with a nearby topic.
+
+The anti-overfitting checklist:
+- Never load more than 25-30 archive entries per person into context
+- Weight by relevance to CURRENT conversation, not by general importance
+- Include at least 2 entries that contradict the expected position
+- Include at least 3 recent entries regardless of topic relevance (freshness)
+- If the conversation shifts topic mid-simulation, RE-RETRIEVE for new context
+- The archive is a LIBRARY you consult, not a script you follow
@@ -0,0 +1,321 @@
+# Mass Behavior Modeling — Communities, Clusters, Cascades
+
+Understanding individual behavior requires understanding the social
+ecosystem they exist in. This reference covers the macro layer:
+community detection, influence networks, audience modeling, and
+predicting how groups respond to events.
+
+## Why This Matters For Simulation
+
+Individual prediction accuracy: ~56-60%
+Individual-in-context prediction: significantly higher
+
+A person's behavior is constrained by their community. Knowing WHICH
+community they belong to, WHO influences them, and WHAT information
+ecosystem they're in makes individual predictions much sharper.
+
+Lewin's equation: B = f(P, E). This reference is about the E.
+
+## The Ecosystem Stack
+
+```
+Layer 5: AUDIENCE REACTION    — How would this person's audience respond?
+Layer 4: STANCE & SENTIMENT   — What positions do clusters hold?
+Layer 3: INFLUENCE NETWORKS   — Who spreads ideas to whom?
+Layer 2: COMMUNITY CLUSTERS   — Who groups together?
+Layer 1: SOCIAL GRAPH         — Who follows/interacts with whom?
+```
+
+## Layer 1: Social Graph Construction
+
+### Data Sources (by accessibility)
+
+| Source | Access | Quality | Tools |
+|--------|--------|---------|-------|
+| Bluesky AT Protocol | FREE, open, no auth | Excellent | atproto (pip) |
+| X/Twitter API | Bearer token, limited | Good but restricted | curl, tweepy |
+| Reddit | API with limits | Good for comments | PRAW (pip) |
+| GitHub | Free API | Great for tech people | PyGithub (pip) |
+| Web scraping | Fragile, TOS issues | Variable | Last resort |
+
+### Bluesky: The Open Gold Mine
+```python
+# pip install atproto
+from atproto import Client
+client = Client()
+# No auth needed for public data
+
+# Get follower graph
+followers = client.get_followers(actor="handle.bsky.social")
+following = client.get_follows(actor="handle.bsky.social")
+
+# Real-time firehose (no auth!)
+# wss://jetstream1.us-east.bsky.network/subscribe
+```
+
+### Graph Types
+- **Follow graph**: who follows whom (directed, static-ish)
+- **Interaction graph**: who replies to / retweets whom (directed, dynamic)
+- **Mention graph**: who mentions whom (directed, weighted by frequency)
+- **Co-engagement graph**: who engages with the same content (undirected)
+
+Interaction graphs are more informative than follow graphs for predicting
+actual behavioral alignment.
+
+### Tools
+```
+pip install networkx python-igraph
+```
+NetworkX for prototyping (<100K nodes), igraph for production (millions).
+
+## Layer 2: Community Detection
+
+### Algorithms (ranked by quality)
+
+| Algorithm | Quality | Speed | Notes |
+|-----------|---------|-------|-------|
+| Leiden | Best | Fast | Guarantees connected communities |
+| Louvain | Good | Fastest | Can produce disconnected communities |
+| Infomap | Excellent | Medium | Based on information theory |
+| Label Propagation | Decent | Very fast | Non-deterministic |
+
+### The Meta-Library: CDLib
+```
+pip install cdlib
+```
+Wraps 50+ community detection algorithms in a unified API.
+Works on top of networkx/igraph. Highly recommended.
+
+```python
+import cdlib
+from cdlib import algorithms
+import networkx as nx
+
+G = nx.karate_club_graph()
+communities = algorithms.leiden(G)
+# Also: louvain, infomap, label_propagation, angel, demon, etc.
+```
+
+### What Communities Tell Us
+Each community in a social graph typically shares:
+- Ideological orientation
+- Topic interests
+- Information sources
+- Language patterns and in-group vocabulary
+- Reaction patterns to events
+
+Knowing which community someone belongs to immediately constrains
+predictions about their likely positions and reactions.
+
+## Layer 3: Influence Networks
+
+### Key Insight (Zhou et al., National Science Review 2024)
+Network centrality alone is INSUFFICIENT for predicting influence.
+Must combine structural position with behavioral features:
+- Posting frequency
+- Historical content virality
+- Response rate / engagement ratio
+- Content originality (original vs repost ratio)
+
+### Centrality Measures
+```python
+import networkx as nx
+G = nx.DiGraph()  # directed social graph
+
+# Who has the most connections?
+degree = nx.degree_centrality(G)
+
+# Who bridges different communities?
+betweenness = nx.betweenness_centrality(G)
+
+# Who's connected to other well-connected people?
+eigenvector = nx.eigenvector_centrality(G)
+
+# Adapted from web — directed influence flow
+pagerank = nx.pagerank(G)
+```
+
+### Superspreader Identification (DeVerna et al., PLOS ONE 2024)
+Superspreaders of content fall into three categories:
+1. **Pundits**: large following, high authority, original content
+2. **Media outlets**: institutional accounts, news organizations
+3. **Affiliated personal accounts**: connected to pundits/outlets
+
+For simulation: knowing who the superspreaders are in a person's
+network tells you what information they're likely exposed to.
+
+### Information Cascade Modeling
+```
+pip install ndlib  # Network Diffusion Library
+```
+
+NDlib models how information spreads through networks:
+- Independent Cascade Model
+- Linear Threshold Model
+- SIR/SIS epidemiological models adapted for info spread
+- Voter Model (opinion dynamics)
+- Sznajd Model (social influence)
+
+## Layer 4: Stance & Sentiment Analysis
+
+### Ready-To-Use Models (HuggingFace)
+
+**Tweet Sentiment** (most reliable):
+```
+cardiffnlp/twitter-roberta-base-sentiment-latest
+# Labels: positive / negative / neutral
+```
+
+**Political Stance**:
+```
+kornosk/bert-election2020-twitter-stance-biden-KE-MLM
+kornosk/bert-election2020-twitter-stance-trump-KE-MLM
+launch/POLITICS  # left / center / right
+```
+
+**All-in-One Tweet NLP**:
+```
+pip install tweetnlp
+# Sentiment, emotion, hate speech, NER, topic classification
+```
+
+### Topic-Level Stance Tracking
+Combine BERTopic (dynamic topic modeling) with stance classifiers:
+1. Cluster posts into topics over time windows
+2. Classify stance per topic per community
+3. Track stance shifts over time
+4. Detect divergence between communities on emerging topics
+
+### PRISM Framework (ACL 2025)
+First framework for interpretable political bias embeddings.
+Two-stage: mine bias indicators → cross-encoder assigns structured scores.
+```
+github.com/dukesun99/ACL-PRISM
+```
+
+## Layer 5: Audience Modeling & Crowd Prediction
+
+### The Frontier: Predicting How Groups React
+
+Key papers and findings:
+
+**CReAM (WWW 2024)**: Predicts which of two posts gets more engagement.
+Uses LLM-generated features + FLANG-RoBERTa cross-encoder.
+Demonstrates crowd reaction IS predictable from content alone.
+
+**PopSim (Dec 2025)**: LLM multi-agent social network sandbox.
+Simulates content propagation dynamics using "Social Mean Field"
+for individual-population interaction. Reduces prediction error 8.82%.
+
+**Conditioned Comment Prediction (EACL 2026)**:
+KEY FINDING: behavioral traces (past posts) are BETTER than
+descriptive personas for conditioning LLMs to predict user behavior.
+This validates our OSINT approach: real data > personality labels.
+
+**DEBATE Benchmark (Oct 2025)**:
+WARNING: LLM agents converge opinions TOO QUICKLY vs real humans.
+SFT + DPO helps but gap remains. Real communities maintain
+disagreement longer than simulated ones.
+
+**Distributional vs Individual Prediction (PMC 2025)**:
+Group-level predictions are more reliable than individual ones.
+Predicting "65% of this community will react negatively" is more
+accurate than predicting "this specific person will react negatively."
+
+### Application to Simulation
+
+When simulating @person talking about event X, consider:
+1. What community does @person belong to?
+2. How is that community reacting to X? (distributional prediction)
+3. Where does @person sit within that community? (conformist vs contrarian)
+4. Who influences @person? What are THEY saying?
+5. How does @person's audience react to their take? (engagement prediction)
+
+This context makes individual predictions sharper.
+
+## Echo Chamber & Filter Bubble Detection
+
+### Technique
+1. Build interaction graph
+2. Run Leiden community detection
+3. For each community, aggregate stance on key issues
+4. Measure ideological homogeneity within communities
+5. Compare cross-community vs within-community content similarity
+6. High within + low cross = echo chamber
+
+### Tools
+```
+github.com/mminici/Echo-Chamber-Detection  # Cascade-based, CIKM 2022
+# Includes Brexit and VaxNoVax datasets
+```
+
+### What It Tells Us
+Knowing someone's echo chamber tells you:
+- What information they're exposed to
+- What they're NOT exposed to
+- How extreme their positions might be (isolation → radicalization)
+- Whether they'll encounter pushback or only agreement
+- How they'll react to information from outside their bubble
+
+## User Embeddings: "Find People Like @person"
+
+### Strategy
+1. Embed each user's recent N posts with sentence-transformers
+2. Average embeddings → user vector
+3. Use FAISS for similarity search
+4. Cluster users with HDBSCAN in embedding space
+
+### Best Models for Social Media Text
+```
+# General purpose (good baseline)
+sentence-transformers/all-mpnet-base-v2
+
+# Tweet-specific (better domain fit)
+cardiffnlp/twitter-roberta-base
+vinai/bertweet-base  # pretrained on 850M tweets
+```
+
+### Graph + Text Hybrid Embeddings
+```
+pip install karateclub
+```
+KarateClub provides Node2Vec, DeepWalk, Graph2Vec — embed users
+based on graph position. Combine with text embeddings for hybrid
+vectors that capture BOTH what someone says AND where they sit
+in the social network.
+
+## Practical Application to Simulation
+
+### For Individual Simulation (what we already do)
+Add ecosystem context to each dossier:
+- Which community cluster they belong to
+- Who their top influencers are (who do they retweet/amplify most)
+- What echo chamber are they in (information environment)
+- How does their community view the simulation topic
+
+### For Audience Simulation (new capability)
+When user asks "what would @person's audience say":
+1. Identify @person's follower community
+2. Sample representative voices from that community
+3. Model the DISTRIBUTION of responses, not just one response
+4. Include: cheerleaders, critics, joke-makers, lurkers
+5. Weight by typical engagement patterns
+
+### For Cascade Prediction (new capability)
+When user asks "how would this take spread":
+1. Model the initial tweet and its immediate network
+2. Predict which nodes amplify (based on stance alignment + influence)
+3. Estimate reach and engagement range
+4. Predict quote-tweet ratio (agreement vs dunking)
+
+## Recommended Minimal Stack
+
+```bash
+pip install networkx python-igraph leidenalg cdlib karateclub
+pip install sentence-transformers transformers tweetnlp
+pip install ndlib faiss-cpu hdbscan atproto
+```
+
+This gives you: graph construction, community detection, user embeddings,
+stance/sentiment analysis, diffusion simulation, similarity search,
+clustering, and Bluesky data access. All open source, all pip-installable.
@@ -0,0 +1,370 @@
+# OSINT Pipeline — Deep Intelligence Gathering
+
+Full-spectrum open source intelligence for building personality models.
+This goes beyond social media posts into visual identity, cross-platform
+footprints, and behavioral analysis.
+
+## Tool Arsenal
+
+| Tool | Use Case | Strength |
+|------|----------|----------|
+| `web_search` | Find anything, initial discovery | Fast, broad, indexed content |
+| `web_extract` | Pull full page content | Blogs, articles, profiles, PDFs |
+| `browser_navigate` + `browser_snapshot` | View live pages | Dynamic content, login walls |
+| `browser_vision` | Analyze what a page looks like | Layouts, visual identity, screenshots |
+| `vision_analyze` | Analyze any image by URL/path | Profile pics, post images, aesthetics |
+| `browser_get_images` | List all images on a page | Find images to feed to vision_analyze |
+| Yandex reverse image search | Find where an image appears | Identity verification, alt accounts |
+| `x-cli` (if available) | Direct Twitter API | Timelines, search, metadata |
+
+## Instagram Intelligence
+
+Instagram is CRITICAL for personality modeling — it reveals:
+- Visual identity and aesthetic preferences
+- Real-life social circles (tagged people, group photos)
+- Lifestyle signals (travel, food, hobbies, pets)
+- Caption voice (often different from Twitter voice)
+- Story highlights (curated self-image)
+- Bio links (cross-platform connections)
+
+### Viewing Instagram Profiles (VERIFIED APRIL 2026)
+
+**METHOD 1 — Instagram Private Web API (BEST, returns full JSON)**
+```bash
+curl -s -H 'User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 16_0 like Mac OS X)' \
+  -H 'x-ig-app-id: 936619743392459' \
+  'https://i.instagram.com/api/v1/users/web_profile_info/?username={handle}'
+```
+Returns ~500KB of JSON: full profile + last 12 posts with captions, likes,
+comments, CDN image URLs, timestamps. No auth needed.
+
+**METHOD 2 — Instagram oEmbed API (for individual posts)**
+```bash
+curl -s 'https://www.instagram.com/api/v1/oembed/?url=https://www.instagram.com/p/{SHORTCODE}/'
+```
+Returns: caption text, author_name, thumbnail URL. No auth.
+
+**METHOD 3 — Pixwox via web_extract (profile viewer)**
+```python
+web_extract(["https://pixwox.com/profile/{username}"])
+```
+Returns 12+ recent posts with captions, engagement stats. Cloudflare blocks
+curl but web_extract bypasses it.
+
+**METHOD 4 — SocialBlade via web_extract (analytics)**
+```python
+web_extract(["https://socialblade.com/instagram/user/{handle}"])
+```
+Returns follower count, engagement rate, 14-day tracking.
+
+**METHOD 5 — CDN direct download (images from API responses)**
+Image URLs from API responses (scontent-*.cdninstagram.com) download
+directly with no auth. Feed them to vision_analyze for visual profiling.
+
+**METHOD 6 — Google indexed content**
+```
+web_search("site:instagram.com {username}")
+```
+Returns bio text, follower count, recent post captions from search snippets.
+
+**WHAT DOESN'T WORK:** direct web_extract on instagram.com, ?__a=1 trick,
+graph.instagram.com (needs OAuth), imginn/picuki/dumpoir/gramhir (403)
+
+### Instagram Discovery (finding someone's handle)
+```
+web_search("{real_name} instagram")
+web_search("{twitter_handle} instagram account")
+web_search("site:instagram.com {real_name}")
+
+# Check their Twitter/X bio for IG links
+# Check their personal website for social links
+# Check Linktree / bio.link pages
+```
+
+### Extracting Signal from Instagram
+
+**Profile Picture**: Reveals self-presentation style
+- Professional headshot vs casual vs meme/avatar
+- Analyze with vision_analyze for clothing, setting, expression
+
+**Bio Text**: Compressed self-identity
+- Role/title claims
+- Emoji usage patterns
+- Link destinations
+- Location claims
+
+**Post Grid**: Visual identity fingerprint
+- Color palette tendencies
+- Content categories (food/travel/tech/selfies/memes)
+- Posting frequency
+- Professional vs personal ratio
+
+**Captions**: Voice sample different from Twitter
+- Usually longer, more personal
+- Hashtag usage patterns
+- Emoji patterns
+- Tone (inspirational vs casual vs funny)
+
+**Tagged Photos**: Real social graph
+- Who they hang out with IRL
+- Events they attend
+- Social circles outside tech/AI
+
+## Visual Identity Analysis
+
+Use vision tools to analyze HOW someone presents visually:
+
+### Profile Pictures Across Platforms
+```
+# Collect profile pics from multiple platforms
+# Twitter, Instagram, LinkedIn, GitHub, Discord
+
+# Analyze each
+vision_analyze(image_url="{pic_url}", 
+    question="Describe this profile picture in detail: person's appearance, clothing style, setting, expression, professional vs casual, any notable elements")
+
+# Cross-reference: do they use the same pic everywhere? Different personas?
+```
+
+### Reverse Image Search (Yandex Pipeline)
+From memory — Google Lens blocks Browserbase IPs, use Yandex:
+
+```
+# For images behind auth/CDN, upload to catbox first
+terminal("curl -F 'reqtype=fileupload' -F 'fileToUpload=@{local_path}' https://catbox.moe/user/api.php")
+
+# Then Yandex reverse image search
+browser_navigate("https://yandex.com/images/search?rpt=imageview&url={encoded_public_url}")
+
+# Or via web_extract (slower but automatable)
+web_extract(["https://yandex.com/images/search?rpt=imageview&url={encoded_url}"])
+```
+
+Yandex provides:
+- Similar images (find the same person elsewhere)
+- Site matches (where this image appears)
+- OCR text extraction (text in images)
+- Image tags (what's in the image)
+- Knowledge panels (identified entities)
+
+### Screenshot Analysis
+When you can see a page but can't extract text:
+```
+browser_vision(question="Read all text on this page. List usernames, post content, dates, engagement numbers")
+browser_vision(annotate=true, question="What interactive elements are on this page?")
+```
+
+## LinkedIn Intelligence
+
+**STATUS: BLOCKED for automated access** (tested April 2026).
+web_extract returns "Website Not Supported". Direct browsing triggers auth walls.
+
+**Workarounds:**
+```
+# LinkedIn content IS indexed by search engines
+web_search("{real_name} linkedin {company}")
+web_search("site:linkedin.com/in {name}")
+# These return snippets with headline, role, company — useful even without full profile
+
+# Google sometimes caches LinkedIn profiles
+web_search("{name} site:linkedin.com headline")
+```
+
+**METHOD 1 — Google indexed snippets (always works)**
+```
+web_search("site:linkedin.com/in {name} {company}")
+```
+Returns: name, headline, company, location, connection count, bio snippet.
+
+**METHOD 2 — Crunchbase (EXCELLENT for founders/execs)**
+```python
+web_extract(["https://www.crunchbase.com/person/{slug}"])
+```
+Returns: full career history, education, investments, board positions,
+social links. Best source for professional identity of startup people.
+
+**METHOD 3 — Corporate press pages**
+```
+web_search("{person} {company} site:{company}.com bio OR press")
+```
+Official bios from company newsrooms. High quality, curated but factual.
+
+**METHOD 4 — Third-party aggregators**
+- RocketReach, SignalHire — job title + company from web_search snippets
+- rootdata.com — good for crypto/AI people
+- Crunchbase — best all-round for tech executives
+
+**METHOD 5 — Paid LinkedIn API wrappers** (if budget allows)
+- LinkdAPI, Proxycurl: $0.07-0.15 per profile, full structured data
+- No OAuth needed, just API key
+
+LinkedIn reveals (from combined methods):
+- Career trajectory (Crunchbase full history)
+- Current role and headline (search snippets)
+- Education (Crunchbase or search snippets)
+- Professional self-presentation (company bio pages)
+- Investment/board activity (Crunchbase)
+
+## Podcast Transcripts (HIGHEST VALUE for voice profiling)
+
+Podcast interviews are THE gold mine for personality modeling. Hours of
+unscripted speech, natural conversation, real personality showing through.
+
+**Discovery:**
+```
+web_search("{name} podcast transcript interview")
+web_search("{name} lex fridman OR tyler cowen OR joe rogan OR dwarkesh")
+```
+
+**Extraction — verified working transcript sources:**
+```python
+# Lex Fridman (full verbatim transcripts)
+web_extract(["https://lexfridman.com/EPISODE_URL/transcript"])
+
+# Conversations with Tyler (Tyler Cowen — full transcripts)
+web_extract(["https://conversationswithtyler.com/episodes/..."])
+
+# TED Talks transcripts
+web_extract(["https://www.ted.com/talks/.../transcript"])
+
+# Sequoia Capital podcast
+web_extract(["https://www.sequoiacap.com/podcast/..."])
+```
+
+Podcast transcripts reveal:
+- Natural speech patterns (filler words, pacing, sentence structure)
+- Unguarded opinions (less curated than tweets)
+- How they respond to pushback (interviewer challenges)
+- Humor style in conversation (different from written humor)
+- Depth of knowledge on specific topics
+- Personality under pressure
+
+## YouTube / Video Intelligence
+
+```
+web_search("{name} youtube talk keynote interview")
+web_search("{name} podcast appearance")
+```
+
+web_extract on YouTube pages returns rich summaries with attributed quotes.
+Use youtube-content skill for full transcripts if available.
+
+## Personal Blogs & Substacks (HIGH VALUE)
+
+Personal writing is curated self-expression — how someone WANTS to be
+seen intellectually. Very different signal from social media.
+
+```
+web_search("{name} blog substack essay")
+# Extract full posts
+web_extract(["https://{blog-url}/"])
+# Wayback Machine works for archived blog posts
+web_extract(["https://web.archive.org/web/2024/{blog-url}"])
+```
+
+## GitHub Intelligence
+
+For technical people:
+
+```
+web_search("site:github.com {handle}")
+web_extract(["https://github.com/{handle}"])
+
+# Issue comments reveal communication style under technical pressure
+web_search("site:github.com {handle} issue comment")
+
+# README style reveals documentation personality
+# Commit messages reveal terseness vs verbosity
+```
+
+## General Web Footprint
+
+```
+# Personal website / blog
+web_search("{name} personal website blog about")
+
+# Conference talks / speaker bios
+web_search("{name} speaker conference talk bio")
+
+# News mentions
+web_search("{name} {company} news interview profile")
+
+# Academic papers (for researchers)
+web_search("{name} arxiv paper author")
+web_search("site:scholar.google.com {name}")
+
+# Podcast appearances
+web_search("{name} podcast guest appearance")
+
+# Forum posts (HN, specific communities)
+web_search("site:news.ycombinator.com {handle} OR {name}")
+```
+
+## Cross-Platform Identity Resolution
+
+### Handle Mapping Strategy
+1. Start from known handle (usually Twitter)
+2. Check bio links — most people link to other platforms
+3. Search "{known_handle} {platform}" for each platform
+4. Check personal website for social links
+5. Reverse image search profile pic to find matching accounts
+6. Search unique phrases they use across platforms
+
+### Identity Verification
+When you find a potential match on another platform:
+- Same profile picture? (reverse image search)
+- Same bio keywords?
+- Same name/handle pattern?
+- Cross-references (do they mention each other?)
+- Writing style match?
+
+## Search Space Narrowing
+
+### The Jiggle Technique
+When broad searches return noise, narrow progressively:
+
+1. **Start broad**: `"{name}" AI` 
+2. **Add role**: `"{name}" {company} {role}`
+3. **Add context**: `"{name}" {company} {specific_project_or_topic}`
+4. **Add platform**: `site:{platform} "{name}" {context}`
+5. **Add time**: `"{name}" {topic} 2025 OR 2026`
+6. **Quote unique phrases**: if you found a distinctive phrase they use, search for that exact phrase to find more of their content
+
+### Disambiguation
+Common names need extra signals:
+- Add their company/org
+- Add their specific domain (AI, crypto, etc.)
+- Use their unique handle as anchor
+- Search for combinations of their known associates
+- Use image search to verify you have the right person
+
+### Signal vs Noise Heuristics
+- **High signal**: direct quotes, interview transcripts, personal blog posts, long-form content
+- **Medium signal**: mentions in aggregator sites, conference bios, LinkedIn summaries
+- **Low signal**: generic news mentions, third-party profiles, directory listings
+- **Noise**: same-name different person, outdated info (>2 years), scraped/regurgitated content
+
+## Confidence Calibration
+
+After full OSINT sweep, rate data quality:
+
+| Confidence | Data Available | Simulation Quality |
+|-----------|---------------|-------------------|
+| 95-100% | 50+ posts, longform, video, visual, cross-platform | Near-perfect voice replication |
+| 80-94% | 20-50 posts, some longform, basic visual | Very good, occasional educated guesses |
+| 60-79% | 10-20 posts, mostly short-form | Good general sense, some gaps |
+| 40-59% | 5-10 posts, limited platforms | Broad strokes only, flag uncertainty |
+| 20-39% | <5 posts, single platform | Sketch at best, heavy disclaimers |
+| <20% | Almost nothing found | Decline to simulate, ask user for context |
+
+## Privacy & Ethics Note
+
+All research uses publicly available information only. We don't:
+- Access private/locked accounts
+- Bypass authentication
+- Use leaked/hacked data
+- Dox or expose private information
+- Simulate in ways designed to deceive or impersonate
+
+The goal is personality MODELING for creative simulation, grounded in
+what people choose to share publicly.
@@ -0,0 +1,334 @@
+# Prediction Engine — Forecasting What Someone Would Say/Do
+
+Techniques for predicting behavior grounded in superforecasting methodology,
+behavioral science, and SOTA LLM prediction research.
+
+## Superforecasting Principles (Tetlock)
+
+**Honest caveat**: Superforecasting methodology was developed for geopolitical and
+world-event prediction, not personality simulation. That said, the THINKING TOOLS
+are genuinely useful here — decomposition prevents lazy pattern-matching, base rates
+fight overconfidence, and alternative hypotheses prevent single-track predictions.
+What does NOT transfer cleanly: the calibration precision. When Tetlock says "70%
+confident," that's backed by thousands of scored predictions. When we say "70%
+confident" about what @someone would tweet, that's an educated estimate, not a
+calibrated probability. Use the framework for its rigor, not its false precision.
+
+Apply these thinking tools when making behavioral predictions:
+
+### 1. Decomposition (Fermi-ize the Question)
+Don't ask "What would @person say about X?"
+Break it down:
+- What is @person's known position on topics RELATED to X?
+- What are their values/priorities that X touches on?
+- What is their emotional register when discussing similar topics?
+- Who are they likely responding to, and how does that change their tone?
+- What platform are they on, and how does that shift their behavior?
+
+### 2. Outside View First (Base Rates)
+Before considering the specific person, ask:
+- What would a TYPICAL person in their role/position say about X?
+- What % of people in their ideological cluster hold position Y on X?
+- What's the base rate for their type of response (agree/disagree/joke/ignore)?
+
+### 3. Inside View Second (Case-Specific Adjustment)
+Now adjust from the base rate using what you ACTUALLY KNOW about them:
+- Specific past statements on this topic or related topics
+- Known relationships with people/orgs involved
+- Personal experiences that would shape their view
+- Contrarian tendencies (do they predictably go against their cluster?)
+
+### 4. Confidence Calibration
+Express predictions with honest uncertainty. **These are rough buckets, not
+calibrated probabilities. Don't pretend they're more precise than they are.**
+- **90%+ confident**: They've literally said this before, just rephrased
+- **70-89%**: Strong pattern match with known positions and voice
+- **50-69%**: Reasonable inference but could go either way
+- **30-49%**: Educated guess, limited data
+- **<30%**: Basically guessing, flag it clearly
+
+When reporting confidence, prefer plain language over fake precision:
+"very likely" > "87% probability". The number implies a precision we don't have.
+
+### 5. Consider Alternative Hypotheses
+For every prediction, generate at least ONE plausible alternative:
+- "They'd PROBABLY say X, but they might surprise with Y because Z"
+- This prevents overconfident single-track predictions
+
+## The Prediction Pipeline
+
+### Step 1: Classify the Prediction Type
+
+| Type | Description | Difficulty |
+|------|-------------|-----------|
+| **Position prediction** | What they believe about X | Easiest if data exists |
+| **Reaction prediction** | How they'd respond to event Y | Medium |
+| **Voice prediction** | How they'd phrase something | Medium-hard |
+| **Behavior prediction** | What they'd DO (not just say) | Hardest |
+| **Interaction prediction** | How they'd respond to specific person | Hard, depends on relationship data |
+
+### Step 2: Evidence Gathering Protocol
+
+For each prediction, gather evidence in this order:
+
+1. **Direct evidence**: Have they addressed this exact topic before?
+   - Search: `"{handle}" "{topic}"` or `"{handle}" "{related_keyword}"`
+   - Weight: HIGHEST
+
+2. **Analogical evidence**: Have they addressed something similar?
+   - Search: find positions on adjacent topics
+   - Weight: HIGH
+
+3. **Value evidence**: What values/principles would apply?
+   - Infer from their stated beliefs and consistent positions
+   - Weight: MEDIUM
+
+4. **Social evidence**: What do their peers/allies think?
+   - People tend to align with their social cluster (but not always)
+   - Weight: LOW-MEDIUM (higher for conformists, lower for contrarians)
+
+5. **Demographic evidence**: What would someone in their position typically think?
+   - Base rate from role/industry/ideology
+   - Weight: LOWEST (only use as anchor, not conclusion)
+
+### Step 2b: Contradiction Handling Protocol
+When evidence conflicts (e.g., person said X in 2024 but Y in 2026):
+
+1. **Check for genuine change**: Did they explicitly reverse position? Look for
+   "I used to think X but now..." or a clear pivot moment. If so, use the newer
+   position and note the evolution.
+
+2. **Check for context-dependence**: Did they say X to audience A and Y to audience B?
+   This isn't necessarily dishonesty — people emphasize different facets for different
+   contexts. Note which context your simulation targets and use the matching register.
+
+3. **Check for nuance collapse**: Maybe they said "X is mostly good with caveats"
+   and later "X has real problems" — these might not actually contradict. Look for
+   the synthesis position.
+
+4. **When genuinely unresolvable**: Flag it explicitly. "Evidence conflicts on this
+   point — they've argued both sides at different times. Simulating {chosen position}
+   based on {reasoning}, but the alternative is plausible." Don't paper over the
+   contradiction with false confidence.
+
+5. **Recency default**: When all else fails, weight more recent statements higher.
+   People change, and the most recent position is the best predictor of the next one.
+
+### Step 3: Generate Prediction
+
+Using the HumanLLM B = f(P, E) framework:
+- **P (Person)**: Everything from the dossier — personality, values, voice
+- **E (Environment)**: The specific context — platform, topic, who's asking,
+  what just happened, social dynamics in play
+
+Generate the prediction by:
+1. Setting the base rate (outside view)
+2. Adjusting for personal specifics (inside view)
+3. Filtering through their voice profile (how they'd phrase it)
+4. Applying platform-specific behavior patterns
+5. Calibrating confidence
+
+## Memory Curation (The 30-50 Rule)
+
+Research shows performance PEAKS at 30-50 memory entries then DECLINES.
+For each person in a simulation, curate memories:
+
+### What to Include (high signal)
+- **Signature takes**: Their most characteristic/famous positions (5-10)
+- **Voice samples**: Real quotes that capture their linguistic style (5-10)
+- **Relationship data**: Known dynamics with other sim targets (3-5)
+- **Recent context**: What they've been talking about lately (3-5)
+- **Formative moments**: Career milestones, public pivots, viral moments (3-5)
+- **Quirks & tells**: Catchphrases, humor style, pet peeves (3-5)
+
+### What to Exclude (noise)
+- Generic biographical facts that don't predict behavior
+- Old positions they've clearly evolved past
+- Trivial interactions that don't reveal personality
+- Secondhand characterizations (what others say about them)
+- Platform metadata (follower counts, join dates) unless directly relevant
+
+### Memory Selection Heuristic
+For each candidate memory entry, ask:
+**"If I removed this, would the simulation noticeably degrade?"**
+If no, cut it.
+
+## Fighting LLM Defaults
+
+Research shows LLMs have systematic biases in simulation. The fixes below need to be
+CONCRETE — vague instructions like "be more like them" don't work. You need specific
+prompting patterns that actually shift the output.
+
+### Problem: Sycophancy & Over-Agreement
+LLMs default to agreement and positivity.
+**Fix**: Don't just note they're contrarian — structure it as a behavioral instruction
+with evidence:
+```
+"In this conversation, {person} disagrees with {other_person} on {topic}. They are
+noticeably more confrontational than the other speakers. They tend to respond to
+consensus with skepticism and reframe debates on their own terms. Example from their
+real posts: '{actual quote where they disagreed with something popular}'"
+```
+
+### Problem: Rigid/Polarized Strategies
+LLMs tend to take extreme positions and hold them rigidly.
+**Fix**: Provide specific nuance instructions:
+```
+"In this conversation, {person} holds a complex position on {topic}: they agree with
+{point A} but push back on {point B}. They're the type to say 'yes, but...' rather
+than 'no.' Real example of their nuance: '{quote showing them holding a both-and
+position}'"
+```
+
+### Problem: Uniform Register
+LLMs default to a similar educated-casual tone for everyone.
+**Fix**: Anchor voice with REAL QUOTES and explicit comparative instructions:
+```
+"In this conversation, {person} is noticeably more {trait} than the other speakers.
+They tend to {specific behavior pattern}. Their sentences are typically {length/style}.
+They {do/don't} use emoji. Their humor style is {type}. Example from their real posts:
+'{actual quote that captures their voice}'"
+```
+The more you can say "{person} does THIS while {other_person} does THAT," the better
+the differentiation. Comparative framing outperforms absolute descriptions.
+
+### Problem: Overly Structured Responses
+LLMs love neat arguments with clear structure.
+**Fix**: Provide explicit structural anti-patterns:
+```
+"When generating {person}'s messages, break conventional structure. They start one
+thought and jump to another mid-sentence. They use '...' and '—' instead of periods.
+They repeat words for emphasis. They don't conclude neatly. Example: '{real quote
+showing their chaotic structure}'"
+```
+
+### Problem: Missing Mundane Behavior
+LLMs focus on "interesting" responses, skip boring/mundane ones.
+**Fix**: Explicitly instruct for mundane moments:
+```
+"Not every message from {person} needs to be insightful. Include at least 1-2 messages
+that are just reactions ('lmao', 'this', 'wait what'), link shares without commentary,
+or brief agreements. Real people don't craft every message. {person} specifically tends
+to {their specific mundane behavior pattern, e.g., 'drop a single emoji reaction'
+or 'just retweet without comment'}."
+```
+
+### General Principle for All Fixes
+The pattern is always: **behavioral instruction + comparative framing + real evidence**.
+- "Do X" alone doesn't work well
+- "Do X, unlike the default of Y" works better  
+- "Do X, unlike the default of Y, as evidenced by this real quote: Z" works best
+
+## The Adjective-Based Personality Method
+
+70 bipolar adjective pairs for Big Five traits. Select 3 per trait
+with intensity modifiers.
+
+### Openness
+High: creative, curious, imaginative, artistic, adventurous, intellectual,
+      unconventional, perceptive
+Low:  conventional, practical, traditional, routine-oriented, narrow
+
+### Conscientiousness  
+High: organized, disciplined, reliable, meticulous, systematic, thorough,
+      goal-oriented, persistent
+Low:  careless, impulsive, disorganized, spontaneous, flexible, relaxed
+
+### Extraversion
+High: outgoing, talkative, energetic, assertive, enthusiastic, bold,
+      gregarious, dominant
+Low:  reserved, quiet, introverted, solitary, withdrawn, reflective
+
+### Agreeableness
+High: cooperative, trusting, empathetic, generous, accommodating, kind,
+      diplomatic, forgiving
+Low:  competitive, skeptical, blunt, confrontational, critical, stubborn,
+      independent-minded
+
+### Neuroticism
+High: anxious, moody, sensitive, reactive, volatile, self-conscious,
+      insecure, emotional
+Low:  calm, stable, resilient, confident, even-tempered, composed,
+      thick-skinned
+
+### Usage
+For each simulated person, after OSINT research, estimate their Big Five
+profile and select appropriate adjectives:
+
+Example: "@basedjensen: very creative, somewhat impulsive, very outgoing,
+a bit competitive, calm" → this shapes the generation toward the right
+behavioral profile.
+
+## Interaction Dynamics Prediction
+
+When simulating conversations between multiple people, remember that predictions
+apply to a SPECIFIC REGISTER. See the next section on performative vs. authentic
+behavior.
+
+## Performative vs. Authentic Behavior
+
+**Critical concept**: People act differently for different audiences. A simulation
+must be explicit about which register it's targeting.
+
+### The Register Spectrum
+- **Public broadcast** (tweets, Reddit posts): Most performative. People are
+  playing to their audience, building their brand, signaling to their tribe.
+- **Semi-public** (Discord channels, group chats, comment threads): Less
+  performative but still audience-aware. People are more casual but know
+  others are watching.
+- **Private 1-on-1** (DMs): Much less performative. More honest, more
+  vulnerable, more willing to express doubt or uncertainty.  
+- **True private** (inner monologue, close friends): We have almost no data
+  on this. Don't pretend to simulate it.
+
+### Practical implications
+- When simulating a PUBLIC thread, lean into the person's public persona —
+  their brand, their usual takes, their audience-aware voice.
+- When simulating DMs, dial down the performance. More hedging, more honesty,
+  more "I actually think..." vs. the public "Here's my take:".
+- When evidence comes from one register but the simulation targets another,
+  FLAG IT: "Evidence is from public tweets but simulating DM behavior —
+  expect the real person to be less {polished/aggressive/confident} in private."
+- Someone's Twitter persona may be genuinely different from their Reddit persona.
+  These are not interchangeable data sources. Weight evidence from the matching
+  platform higher.
+
+### What we can't know
+Be honest: we're simulating public figures based on their public output. The
+private person may be substantially different. DM simulations are inherently
+lower-confidence than public thread simulations because we have less data on
+how people behave privately.
+
+### Dominance Hierarchy
+- Who talks first? (most confident/highest-status usually)
+- Who responds to whom? (not everyone talks to everyone)
+- Who gets ratio'd? (lowest-status takes get challenged)
+- Who lurks? (some people watch before engaging)
+
+### Agreement/Disagreement Prediction
+Based on known positions + social dynamics:
+- **Strong agree**: Both have stated similar positions + friendly relationship
+- **Agree with nuance**: Similar positions but one adds a caveat
+- **Productive disagreement**: Different positions + mutual respect
+- **Hostile disagreement**: Different positions + existing tension/rivalry
+- **Surprising agreement**: Expected to disagree but find common ground
+- **Ignore**: Some people just don't engage with certain others
+
+### Conversation Flow Prediction
+Real conversations follow patterns:
+1. **Opener** → most active/impulsive person posts first
+2. **First response** → most engaged/relevant person responds
+3. **Pile-on or pushback** → depends on agreement/disagreement dynamics
+4. **Tangent** → someone takes a side thread
+5. **Peak moment** → the best/most viral exchange
+6. **Trail off** → energy dissipates, last person makes a joke or short comment
+
+## Scenario Injection Prediction
+
+When "inject: {event}" is used, predict reactions:
+
+1. **Who would see this first?** (most online / most relevant to their work)
+2. **Who would care most?** (most affected / strongest opinion)
+3. **What's the emotional valence?** (good news for some, bad for others)
+4. **What's the expected take?** (apply position prediction pipeline)
+5. **How does this change the existing conversation?** (derail, amplify, redirect)
@@ -0,0 +1,237 @@
+# Recursive Self-Improvement Pipeline
+
+The simulator should get better every time it runs. Not through training —
+through accumulating failure patterns, calibration data, and learned rules
+that feed back into future simulations.
+
+## The Loop
+
+```
+SIMULATE → VERIFY (mechanical) → SCORE → LOG FAILURES → UPDATE RULES → SIMULATE BETTER
+```
+
+Each run produces two outputs:
+1. The simulation (for the user)
+2. A failure log (for the system)
+
+The failure log feeds back into the next run's verification step,
+making the checklist grow and the blind spots shrink.
+
+## What Gets Logged After Every Simulation
+
+### 1. Mechanical Check Failures
+```
+FAILURE LOG: simulation_{timestamp}
+  EMOJI: @visakanv had 6 fabricated emoji, real rate was 10%. Stripped all.
+  SLOP: @eigenrobot utterance contained "multifaceted" — rewritten.
+  LENGTH: @QiaochuYuan avg 42 words/utterance, real avg was 18. Compressed.
+  CAPS: 4/12 utterances started uppercase, targets are 90% lowercase. Fixed.
+  PUNCTUATION: Added periods to @tszzl who never uses terminal punctuation.
+  STRUCTURE: Sycophantic flow detected — B agreed with A then C agreed with B.
+             Injected disagreement.
+```
+
+### 2. Discriminator Critique Patterns
+```
+CRITIQUE LOG:
+  Round 1: @tszzl too verbose (flagged 2x in last 3 simulations)
+  Round 1: @repligate too academic (flagged 3x — this is a persistent pattern)
+  Round 2: Conversation too neat — real conversations are messier (flagged 5x)
+```
+
+### 3. Held-Out Test Results
+```
+CALIBRATION LOG:
+  Voice fidelity: 8.4/10 (up from 7.5 last run)
+  Topic prediction: 2/5 topics matched (typical — content is unpredictable)
+  Register match: 9/10 (improved after emoji fix)
+```
+
+## How Failures Feed Forward
+
+### Pattern Accumulation
+After N runs, persistent failure patterns become AUTOMATIC rules:
+
+```
+IF a pattern is flagged in 3+ consecutive simulations:
+  PROMOTE it from "check" to "pre-generation rule"
+  
+Example progression:
+  Run 1: "Too verbose for @tszzl" → flagged in Round 1, fixed
+  Run 2: "Too verbose for @tszzl" → flagged again, fixed again
+  Run 3: "Too verbose for @tszzl" → PROMOTED to pre-gen rule:
+         "When simulating roon-type voices: max 20 words per tweet.
+          Fragment > sentence. Compress ruthlessly."
+```
+
+### The Growing Checklist
+The mechanical verification checklist starts with the baseline checks
+(emoji, slop, length, caps, punctuation) and GROWS with each failure:
+
+```
+BASELINE CHECKS (permanent):
+  □ Emoji frequency match
+  □ Slop word scan (Tier 1/2/3)
+  □ Sentence length match
+  □ Capitalization match
+  □ Punctuation pattern match
+  □ Reply/original ratio
+  □ Structural slop patterns
+
+LEARNED CHECKS (accumulated from past failures):
+  □ Roon-type voices: max 20 words (from: verbose failure x3)
+  □ Warm personalities: do NOT add emoji (from: emoji inflation x5)
+  □ Academic voices: ground in specific examples (from: too abstract x3)
+  □ Conversations: inject at least one disagreement (from: sycophantic flow x4)
+  □ Self-deprecating voices: add hedging (from: too assertive x2)
+  □ Shitposters: include at least one non-sequitur (from: too on-topic x2)
+```
+
+### Where To Store Learned Rules
+Append to the skill itself. After each simulation run where the mechanical
+checks catch something, the agent should ask:
+
+"The mechanical verification caught {failures}. Should I add these as
+permanent learned rules for future simulations?"
+
+If the same failure appears 3+ times, add it automatically without asking.
+
+Use skill_manage(action='patch') to append to this file's "Learned Checks"
+section below.
+
+## Calibration Tracking
+
+### Per-Person Calibration Memory
+After simulating someone, store the calibration data:
+
+```
+@tszzl: voice=8.5, emoji_rate=0%, avg_words=14, lowercase=95%, 
+        signature_move="aphoristic fragments", danger="goes verbose"
+@nickcammarata: voice=8.8, emoji_rate=0%, avg_words=19, lowercase=90%,
+        signature_move="meditation-ML connection", danger="too structured"
+```
+
+If the same person is simulated again, LOAD this calibration to skip
+the cold-start problems. The second simulation of someone should be
+better than the first because you already know their failure modes.
+
+### Aggregate Calibration
+Track overall simulation quality across runs:
+
+```
+Run 1: pre-refine 7.5, post-refine 8.4 (delta +0.9)
+Run 2: pre-refine 8.37, post-refine 8.53 (delta +0.16)  
+Run 3: pre-refine 8.53, post-refine 8.83 (delta +0.30, emoji fix)
+```
+
+The pre-refine score should INCREASE over time as learned rules prevent
+repeat failures. If it's not increasing, the learning loop is broken.
+
+## The Standard: Indistinguishable From Real
+
+The target is not "good enough." The target is: mix simulated posts with
+real posts and a human familiar with the person cannot reliably tell which
+is which. That's 50% accuracy on a blind comparison — random chance.
+
+Every mechanical check, every discriminator round, every learned rule
+exists to push toward that standard. If something doesn't serve that
+goal, it's wasted effort.
+
+## Current Learned Checks (append here after each run)
+
+### From TPOT Simulation Run 1 (April 2026)
+- Warm/enthusiastic personalities (visakanv-type): do NOT add decorative emoji.
+  Bio emoji ≠ tweet emoji. Actual emoji rate for "warm" TPOT posters: <15%.
+  PROMOTED after being caught by user, not by discriminator (discriminator failure).
+- Conversation flow: pure agreement chains are instruct-model slop.
+  Real threads have at least one moment of friction, misunderstanding, or deflection.
+- Academic-leaning voices (repligate-type): ground claims in specific experiments,
+  transcripts, or model behaviors they've personally observed. Generic philosophical
+  language without specifics = slop, even if it sounds smart.
+- Self-deprecating voices (QC-type): hedge more. "i think" "i'm not sure" "it feels like."
+  Instruct models are too assertive even when simulating tentative people.
+- Fragment voices (roon-type): max 15-20 words. No conjunctions. No paragraphs.
+  If it reads like a complete thought, it's too complete for a fragment-poster.
+
+### From TPOT Simulation Run 2 (April 2026)
+- Reframer voices (nosilverv-type): avg ~16 words. Split multi-sentence takes
+  into separate tweets. The compression IS the voice. 113% over-length caught
+  by mechanical check that subjective scoring rated 8/10. Trust the numbers.
+- Rare-poster voices (selentelechia-type): in a 12-post sim, give them 2-3 turns
+  max. When they speak it must LAND. Short crystallizations > long analysis.
+  "or a shared meal" was the highest-rated line at 3 words.
+- Turn symmetry: ALWAYS check. 4/4/4 is instruct-model default. Real conversations
+  have one person dominating (5), one lurking (3), others in between.
+- Verbose bias is the #1 mechanical failure. ALWAYS check avg word count against
+  real baseline BEFORE subjective scoring. Every run so far has caught over-length
+  that subjective scoring missed.
+- RHETORICAL POLISH IS SLOP. Caught post-mechanical-pass in Run 2 review.
+  Parallel antithesis ("The most X... The most Y..."), "Not X, not Y, but Z",
+  "Show me X and I'll show you Y", clean 4-step escalations, academic vocabulary
+  in casual voice — ALL passed mechanical checks but are still obviously LLM.
+  PROMOTED TO MECHANICAL SCAN: now regex-scannable alongside slop words.
+- THE BANGER PROBLEM: every simulated tweet was screenshot-worthy. Real feeds
+  are 70% mid. Must include throwaway responses ("lol" "hmm" "fair" "wait actually").
+  PROMOTED: banger check is now mandatory in mechanical verification.
+
+### From TPOT Simulation Run 3 — Star Thread Discovery (April 2026)
+- STAR THREAD IS THE KEY. Dossier-first generation produces surface-accurate
+  but dead output. Star-thread-first generation produces messy, alive output
+  that actually sounds like the person. Generate from the thread. Verify with data.
+- Rhetorical polish vanished once generation came from "what is this person DOING"
+  rather than "what would this person SAY." Reframers reframe. Conveners convene.
+  Distillers distill. The VERB drives the voice, not the adjectives.
+- People in conversation REFERENCE EACH OTHER BY NAME. Tyler says "Bosco always
+  comes in with the three word version." This is obvious but the dossier approach
+  never produced it because it models each person in isolation.
+- PROMOTED: star thread is now the FIRST entry in every dossier. Before voice
+  profile, before psychometrics, before everything else. It's the generation seed.
+  Everything else is verification.
+
+### Operational Findings (verified April 2026)
+- X API bearer token: 10K tweets/15min, 300 profiles/15min, 450 searches/15min.
+  Most generous rate limits. Always use as primary source.
+- Threads.NET → Threads.COM redirect. Always use -L flag or .com directly.
+  Previous test saying "no OG tags" was WRONG — tags exist, domain was wrong.
+- Instagram private API: i.instagram.com + mobile UA + x-ig-app-id: 936619743392459.
+  Returns full JSON with 12 posts. No auth needed. CDN image URLs work for vision_analyze.
+- Facebook: Googlebot UA trick works for public pages. Returns name, bio, likes (121M for zuck).
+  Normal UA and mobile variants all redirect to login wall.
+- TikTok: stats are in __UNIVERSAL_DATA_FOR_REHYDRATION__ JSON at path
+  __DEFAULT_SCOPE__.webapp.user-detail.userInfo.statsV2 (use statsV2 not stats).
+- Bluesky searchPosts returns 403 from datacenter IPs. Workaround: searchActors + getAuthorFeed.
+- nitter.cz is the ONLY working nitter instance (via web_extract, not curl).
+- Reddit JSON API requires User-Agent header or returns 429.
+- GEPA native had `max_steps` API mismatch with DSPy 3.1.3. MIPROv2 fallback works.
+  hermes-agent-self-evolution config: max_skill_size bumped to 20_000 for worldsim-class skills.
+- hermes-agent-self-evolution is at ~/.hermes/hermes-agent-self-evolution/ with .venv.
+  Must export API keys from ~/.hermes/.env before running.
+- Podcast transcripts (Lex Fridman, Tyler Cowen, TED) are the HIGHEST VALUE source
+  for voice profiling. Hours of unscripted speech > thousands of tweets.
+
+### From Simulation Run 4 — Engine Mode + Profile Command (April 2026)
+- ENGINE MODE: When worldsim is active, ZERO assistant personality leaks.
+  No kawaii, no markdown, no chatty commentary between phases. Every token
+  is simulation fidelity. First attempt leaked personality; user corrected.
+  PROMOTED TO PERMANENT RULE in SKILL.md.
+- X API CURL > NITTER for voice calibration. nitter.cz returns 502 or "user
+  not found" unpredictably. Direct curl to X API v2 with bearer token returns
+  full text + metrics. 3 pages (90 tweets) is enough for fidelity 100. Always
+  use this as PRIMARY voice source, nitter as supplement only.
+- CAPS BURST PATTERN: some voices (karan4d-type) use lowercase default with
+  sporadic ALL CAPS for excitement ("WAZZAAAAAAPPPP", "LAWDAMERCYYYYY",
+  "AWOOGA"). This is distinct from consistent-lowercase (tenobrus-type) and
+  sentence-case (somewheresy-type). Capture this in voice profile as a
+  three-way distinction: lowercase-default, caps-burst, sentence-case.
+- TEXT EMOTICONS vs EMOJI: karan4d uses :) >.< ~ but almost zero standard
+  emoji. This is a distinct expressiveness mode from zero-emoji (tenobrus)
+  and sparse-emoji. Include text emoticon inventory in voice profile.
+- STAR THREAD 5/5 TEST is mandatory for profile command. Write the thread,
+  then test it against 5 real posts with explicit reasoning per post. If
+  fewer than 4/5 fit, the thread is wrong — keep looking. Show the work.
+- PROFILE OUTPUT: star thread → voice profile (caps, punctuation, word count,
+  emoji/emoticon inventory, vocabulary, register, threading behavior) →
+  psychometrics (Big Five, Moral Foundations, cognitive style) → key positions
+  (with dates and real tweet quotes) → ecosystem (inner circle, professional,
+  cultural) → intelligence tradecraft (key assumptions, red hat, deception
+  detection, competing hypotheses) → invalidation indicators → source reliability.
@@ -0,0 +1,278 @@
+# Search Strategies — Finding Anyone Across Platforms
+
+The hardest part of simulation is building an accurate model of a real person. This doc
+covers how to systematically discover and profile someone across every platform we care about.
+
+## General Principles
+
+1. **Start broad, go narrow.** First establish WHO they are, then drill into HOW they talk.
+2. **Cross-reference.** Someone's Reddit persona may differ wildly from their Twitter persona. That's signal, not noise.
+3. **Recency matters.** People's views evolve. Weight recent posts (last 6 months) over older ones.
+4. **Interactions > monologues.** How someone replies reveals more about their voice than their prepared posts.
+5. **Controversy is gold.** People are most themselves when arguing. Search for debates and disagreements.
+
+## Platform-Specific Discovery
+
+### X / Twitter
+
+Twitter is the richest source for most public figures in tech/AI. Multiple approaches:
+
+#### With x-cli (if API keys available)
+```bash
+# Recent timeline — best single source of voice data
+x-cli user timeline {handle} --max 30 -j
+
+# Their replies — how they interact, argue, joke
+x-cli tweet search "from:{handle}" --max 30 -j
+
+# What others say about/to them
+x-cli tweet search "to:{handle}" --max 20 -j
+
+# On specific topics
+x-cli tweet search "from:{handle} open source" --max 10 -j
+```
+
+#### Without API (web_search + web_extract)
+```
+# Identity + role
+web_search("{handle} twitter bio role company")
+
+# Voice + opinions
+web_search("{handle} twitter hot takes opinions")
+web_search("site:x.com {handle}")
+
+# Topic-specific positions
+web_search("{handle} twitter {topic}")
+web_search("{handle} {topic} opinion take")
+
+# Interviews / longform (reveals deeper thinking)
+web_search("{handle} interview podcast AI")
+web_search("{handle} blog post essay")
+
+# Beefs and debates (reveals personality under pressure)
+web_search("{handle} twitter debate disagree controversial")
+web_search("{handle} vs {other_person}")
+
+# Newsletter aggregators that index tweets
+web_search("site:buttondown.com/ainews {handle}")
+web_search("site:news.smol.ai {handle}")
+web_search("site:techmeme.com {handle}")
+web_search("site:latent.space {handle}")
+```
+
+#### AI Twitter Aggregator Sites (high value)
+These sites index AI Twitter conversations daily:
+- `buttondown.com/ainews` — swyx's AI News, indexes hundreds of AI Twitter accounts
+- `news.smol.ai` — smol AI news aggregator
+- `techmeme.com` — tech news, includes tweet citations
+- `latent.space` — AI podcast/newsletter with Twitter references
+
+Search pattern: `site:{aggregator} "{handle}"` to find indexed tweets and discussions.
+
+#### IMPORTANT: web_extract does NOT work on x.com
+web_extract returns "Website Not Supported" for all x.com/twitter.com URLs.
+Do NOT attempt it — it wastes a tool call every time.
+
+#### Verified Fallback Access Methods (tested April 2026)
+
+**PRIMARY: X API v2 Bearer Token** (confirmed working)
+- Profiles, timelines, search — 300-10K requests/15min
+- See scripts/x_api.py
+
+**FALLBACK 1: nitter.cz via web_extract** (WORKS)
+```
+web_extract(["https://nitter.cz/{handle}"])
+```
+Returns full profile + recent timeline. Direct curl gets Cloudflare-blocked
+but web_extract bypasses it. Rich data: bio, stats, pinned tweets, full text.
+NOTE: Most other nitter instances are DEAD (nitter.net, xcancel.com, etc.)
+
+**FALLBACK 2: ThreadReaderApp** (WORKS — excellent for historical threads)
+```
+web_extract(["https://threadreaderapp.com/user/{handle}"])
+```
+Returns unrolled historical threads with full text. Found threads back to 2023.
+Gold for longform voice samples.
+
+**FALLBACK 3: GitHub API** (WORKS — excellent for tech people)
+```
+curl -s https://api.github.com/users/{handle}
+curl -s https://api.github.com/users/{handle}/repos?sort=updated
+curl -s https://api.github.com/users/{handle}/events
+curl -s https://api.github.com/users/{handle}/gists
+```
+No auth needed (60 req/hr). Profile READMEs are voice profiling gold.
+Events API shows recent activity with comment text.
+
+**FALLBACK 4: Reddit JSON API** (WORKS)
+```
+curl -s -H 'User-Agent: hermes-sim/1.0' 'https://www.reddit.com/user/{username}.json'
+curl -s -H 'User-Agent: hermes-sim/1.0' 'https://www.reddit.com/user/{username}/comments.json'
+curl -s -H 'User-Agent: hermes-sim/1.0' 'https://www.reddit.com/r/{sub}/search.json?q={query}&restrict_sr=on'
+```
+MUST include User-Agent header or get 429. Reddit voice is often more
+candid/detailed than Twitter voice — high value for personality profiling.
+
+**FALLBACK 5: HackerNews Algolia API** (WORKS — fully open)
+```
+curl -s 'https://hn.algolia.com/api/v1/search?query={name}&tags=comment'
+```
+No auth, no rate limits visible. Great for finding what others say about
+someone + their own HN comments if they have an account.
+
+**FALLBACK 6: YouTube via web_extract** (WORKS)
+Search for interviews/talks, then web_extract the video pages.
+Returns rich summaries with attributed quotes from specific speakers.
+
+**NOT VIABLE** (tested, confirmed blocked):
+- Google Cache of Twitter → empty results
+- Wayback Machine for tweets → sparse captures, no JS content
+- Twitter Syndication API → rate limited / broken
+- All Instagram viewers (imginn, picuki, dumpoir, gramhir) → 403
+- LinkedIn → fully blocked for scraping
+- Archive.today → rate limited + CAPTCHA
+- Most nitter instances → dead or 403
+
+#### Best approach without x-cli
+The most reliable path is: web_search with aggregator sites (ainews, smol.ai,
+techmeme, latent.space). These index AI Twitter daily and return actual tweet
+text in search descriptions. Stack multiple aggregator searches to build a
+composite picture. This was validated in practice — it returns enough signal
+to build solid dossiers for anyone active in AI Twitter.
+
+### Reddit
+
+Reddit profiles are public and indexable. Reddit users often have very different 
+personas from their Twitter selves — more detailed, more argumentative, more honest.
+
+```
+# Find their Reddit username (often different from Twitter)
+web_search("{real_name} reddit account")
+web_search("{twitter_handle} reddit username")
+
+# Profile and post history
+web_search("site:reddit.com/user/{reddit_username}")
+web_search("site:reddit.com {reddit_username} {topic}")
+
+# Subreddit-specific behavior
+web_search("site:reddit.com/r/LocalLLaMA {username}")
+web_search("site:reddit.com/r/MachineLearning {username}")
+
+# Extract actual posts
+web_extract(["https://www.reddit.com/user/{username}/comments/"])
+web_extract(["https://www.reddit.com/user/{username}/submitted/"])
+```
+
+Key subreddits for AI people:
+- r/LocalLLaMA — open source LLM community
+- r/MachineLearning — academic ML
+- r/singularity — AGI speculation  
+- r/ChatGPT, r/ClaudeAI, r/OpenAI — product-focused
+- r/StableDiffusion — image gen community
+
+### Discord
+
+Discord is hardest — most servers aren't publicly indexed. Strategies:
+
+```
+# Find what servers they're in
+web_search("{name} discord server")
+web_search("{name} discord community")
+
+# Some Discord logs are public via indexers
+web_search("site:discordchats.net {username}")
+
+# AI News indexes some Discord channels
+web_search("site:buttondown.com/ainews discord {name}")
+```
+
+Discord personality notes:
+- People are MUCH more casual on Discord than Twitter
+- More profanity, more shitposting, more stream-of-consciousness
+- Server context matters hugely (same person behaves differently in different servers)
+- Harder to research but very valuable if you can find logs
+
+### Blogs / Newsletters / Long-form
+
+These reveal deeper thinking that tweets can't capture:
+
+```
+web_search("{name} blog substack medium")
+web_search("{name} essay AI opinion")
+web_search("{name} substack newsletter")
+
+# Personal sites
+web_search("{name} personal website about")
+
+# Extract full posts
+web_extract(["https://{their-substack}.substack.com/"])
+```
+
+### YouTube / Podcasts
+
+Interview appearances reveal speaking style, humor, and unscripted thinking:
+
+```
+web_search("{name} podcast interview AI YouTube")
+web_search("{name} YouTube talk presentation")
+
+# Use youtube-content skill if available to pull transcripts
+```
+
+### GitHub
+
+For technical people, their GitHub activity reveals priorities and communication style:
+
+```
+web_search("site:github.com {username} issues comments")
+web_search("site:github.com {username}")
+
+# Issue comments and PR reviews show how they communicate technically
+web_extract(["https://github.com/{username}"])
+```
+
+## Cross-Platform Identity Resolution
+
+People use different handles across platforms. Resolution strategies:
+
+1. **Bio links**: Twitter bios often link to personal sites with other handles
+2. **Name search**: `web_search("{real_name} {platform}")` 
+3. **Email/domain**: personal domains often connect identities
+4. **Aggregator profiles**: sites like Linktree, bio.link collect handles
+5. **Conference talks**: speaker bios list multiple handles
+6. **Direct search**: `web_search("{twitter_handle} reddit OR github OR discord")`
+
+## Confidence Scoring
+
+After research, rate confidence for each person:
+
+- **HIGH (80-100%)**: 20+ indexed tweets/posts found, clear voice patterns, known positions on multiple topics, interviews/longform available
+- **MEDIUM (50-79%)**: 5-20 indexed posts, general voice sense but some gaps, positions on some topics unclear
+- **LOW (20-49%)**: <5 posts found, voice is guesswork, mostly inferring from role/org
+- **INSUFFICIENT (<20%)**: can't find enough to simulate accurately. Tell the user.
+
+Always be honest about confidence. A low-confidence simulation should be flagged as such.
+
+## Research Optimization
+
+For fidelity levels:
+
+**Low (1-30)**: 2 searches per person max
+- web_search("{handle} twitter") — identity
+- web_search("{handle} {topic}") — position on topic if specified
+
+**Medium (31-70)**: 4-6 searches per person
+- Identity search
+- Voice/opinions search  
+- Topic-specific search
+- One aggregator site search
+- Optional: one web_extract on a blog/interview
+
+**High (71-100)**: 8-12+ searches per person
+- All medium searches
+- Multiple aggregator sites
+- web_extract on 2-3 longform pieces
+- Cross-platform search (Reddit, GitHub)
+- Debate/controversy search
+- Recent vs historical position comparison
+- Browser fallback if needed
@@ -0,0 +1,359 @@
+# Simulation Engine — How to Generate Conversations
+
+This is the playbook for Phase 3: actually generating the simulated interaction.
+The agent reads this after compiling dossiers and uses it to guide generation.
+
+## Pre-Generation Checklist
+
+Before writing a single simulated word, confirm:
+- [ ] Every participant has a compiled dossier
+- [ ] Confidence level is noted for each participant  
+- [ ] Platform format is selected
+- [ ] Topic/scenario is established (or "organic" if freeform)
+- [ ] Length target is set
+
+## Conversation Architecture
+
+Real conversations aren't ping-pong debates. They have tendencies toward structure,
+but treat the following as a GENERAL PATTERN, not a rigid template. Real threads
+frequently skip phases, loop back to earlier ones, die abruptly after 2 messages,
+or spiral into something completely unrelated. Some threads are ALL peak. Some
+never develop past the opening. Let the personalities and topic drive the shape,
+not this outline.
+
+### Opening Moves (1-3 posts)
+Someone posts a take, shares news, or makes an observation. This is the SEED.
+- Should feel natural — not "let me start a debate about X"
+- Can be a link share, a hot take, a reaction to news, a shitpost
+- The opener should be something this person would ACTUALLY post
+
+### Development (4-8 posts)  
+Others respond. This is where personality dynamics emerge.
+- Not everyone responds to the original — people respond to EACH OTHER
+- Side conversations branch off
+- Someone might misunderstand and get corrected
+- Jokes and tangents happen naturally
+- Not everyone agrees — find the real fault lines between these people
+
+### Peak (2-4 posts)
+The best/most viral/most insightful moment of the thread.
+- Usually someone drops a genuinely good take
+- Or someone gets ratio'd
+- Or an unexpected agreement happens
+- This is the "screenshot moment" people share
+
+### Resolution (1-3 posts)
+Most conversations don't end cleanly. Many don't have a "resolution" at all. They:
+- Trail off with someone making a joke
+- End with a "anyway back to work" type post
+- Get interrupted by something else
+- Sometimes just stop (most realistic)
+- Get revived 3 hours later when someone shows up late
+
+**Important**: Don't force all four phases. A shitpost thread might be Opening→Peak→done.
+A nuanced debate might loop Development→Peak→Development→Peak repeatedly. Match what
+the actual people and topic would produce.
+
+## Voice Fidelity Rules
+
+### DO:
+- Use their ACTUAL vocabulary. If someone says "dawg" a lot, use "dawg"
+- Match their sentence length patterns exactly
+- Replicate their capitalization and punctuation habits
+- Include their signature moves and catchphrases
+- Reference real things they've actually talked about
+- Match their humor style precisely (deadpan ≠ shitpost ≠ sarcasm)
+
+### DON'T:
+- Make everyone articulate the same way
+- Clean up someone's grammar if they write informally
+- Add emoji to someone who doesn't use them — THIS IS THE #1 INSTRUCT MODEL
+  FAILURE. Most real people use emoji in <15% of tweets, and only specific ones.
+  "Warm person" ≠ emoji. "Enthusiastic person" ≠ emoji. CHECK THE DATA.
+  Run an emoji count on their real tweets before simulating. Bio emoji ≠ tweet emoji.
+- Make someone verbose if they're terse
+- Put academic language in a shitposter's mouth
+- Make someone agreeable if they're known for being contrarian
+
+### Voice Differentiation Test
+Read each simulated post with the name hidden. If you can't tell who's 
+talking from the voice alone, the simulation isn't good enough. Rewrite.
+
+### The Similar Voice Problem
+When two participants have genuinely similar posting styles (e.g., two irony-pilled
+shitposters, two academic long-posters), voice alone won't differentiate them.
+Use these concrete techniques:
+
+1. **Content/position divergence**: Even if they SOUND similar, they care about
+   different things. Lean into their different topic obsessions and knowledge areas.
+2. **Unique references**: Person A references anime and startups. Person B references
+   philosophy and MMA. Even in the same register, their cultural touchstones differ.
+3. **Relationship dynamics**: Person A might be deferential to Person C while Person B
+   challenges them. Their SOCIAL behavior differentiates even when solo voice doesn't.
+4. **Structural tics**: One does single long posts, the other does rapid-fire 3-message
+   bursts. One uses parentheticals, the other uses em-dashes. Find the micro-differences.
+5. **Disagreement style**: Similar voices often diverge most when disagreeing. One
+   goes cold and precise, the other gets heated and hyperbolic. Manufacture a moment
+   of friction to surface these differences early in the thread.
+
+If after all this they're STILL hard to tell apart — that's okay. Some people genuinely
+sound similar online. Flag it in your confidence notes rather than forcing fake differences.
+
+### Temporal Personality Drift
+People change. Weight recent data higher than old data.
+- Someone's 2021 tweets may reflect a completely different person than their 2025 posts
+- Look for explicit pivots (career changes, public "I was wrong about X" moments,
+  changed social circles)
+- If you only have old data, flag it: "Based on data from {period}. Their current
+  views may have shifted."
+- When recent and old data conflict, default to recent unless you have specific reason
+  to believe the old position is more authentic (e.g., the new one is clearly performative)
+
+## Platform Format Specs
+
+### X / Twitter
+```
+@handle:
+  [tweet text — respect ~280 char vibes but don't count exactly]
+  [if QRT, show the quoted tweet indented]
+  🔁 {retweets}  ♡ {likes}
+
+    @replier:
+    [reply text]
+    🔁 {retweets}  ♡ {likes}
+
+      @nested_replier:
+      [nested reply]
+      🔁 {retweets}  ♡ {likes}
+```
+
+Engagement number guidelines:
+- Match to actual follower counts. A 5K account gets 10-500 likes typically.
+- Viral posts can 10-50x normal engagement
+- Ratio indicator: when replies >> likes, that's a ratio
+- QRTs are often dunks — frame them that way if appropriate
+
+Thread indicators:
+- "🧵 1/" for thread starts
+- Reply chains show conversation flow
+- Some people never thread, some always thread
+
+### Reddit
+```
+r/{subreddit} • Posted by u/{username} • {time}ago
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+{Title}
+
+{Body text — can be long on Reddit}
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+⬆ {score} | 💬 {comment_count}
+
+  u/{replier} • {time}ago • ⬆ {score}
+  {comment text}
+
+    u/{nested} • {time}ago • ⬆ {score}
+    {nested comment}
+
+      u/{deep_nested} • {time}ago • ⬆ {score}
+      {deep reply}
+```
+
+Reddit-specific behaviors:
+- People write MUCH longer on Reddit
+- More formal/detailed than Twitter
+- Upvote/downvote dynamics (controversial = many votes both ways)
+- Subreddit culture matters (r/LocalLLaMA is different from r/MachineLearning)
+- People cite sources more
+- "Edit: ..." is common
+
+### Discord
+```
+━━━ #{channel-name} ━━━━━━━━━━━━━━━━━━━━━━━━━━
+
+{display_name} — Today at {time}
+{message text}
+{optional: embed/link preview}
+👍 {count}  🔥 {count}  {other reactions}
+
+  {display_name2} — Today at {time}
+  > {quoting previous message}
+  {reply text}
+  😂 {count}
+
+{display_name3} — Today at {time}
+{message — note: Discord messages flow continuously, not just replies}
+```
+
+Discord-specific behaviors:
+- Much more casual, rapid-fire
+- Reactions instead of likes (emoji diversity)
+- People send multiple short messages instead of one long one
+- GIF/meme sharing is common (describe it: *[posts GIF of X]*)
+- "@everyone" and "@here" pings
+- Voice chat references ("just said this in vc")
+- Server-specific culture and inside jokes
+- Bot interactions ("!command")
+
+### X / Twitter DMs
+```
+{display_name}
+{message text}
+{timestamp — e.g., "3:42 PM"}
+
+          {other_person_display_name}
+          {message text}
+          {timestamp}
+
+{display_name}
+{message text}
+{timestamp}
+```
+
+DM-specific behaviors:
+- WAY more casual than public tweets — grammar drops, typos increase
+- Longer messages than tweets (no character pressure)
+- People share links and screenshots with minimal commentary ("look at this lmao")
+- More honest/vulnerable than public posts — less performative
+- Faster back-and-forth, more like texting than posting
+- Reactions (❤️, 😂, etc.) on individual messages
+- Voice messages referenced occasionally ("gonna send a voice note about this")
+- No audience effects — people say things in DMs they'd never post publicly
+
+### Discord DMs
+```
+{display_name} — Today at {time}
+{message text}
+
+{display_name2} — Today at {time}
+{message text}
+
+{display_name} — Today at {time}
+{message text}
+{message text}
+{message text}
+```
+
+Discord DM-specific behaviors:
+- Even more casual than Discord channels — no server norms to follow
+- Rapid-fire multiple short messages in a row (no combining into one)
+- Heavy use of reactions, GIFs, stickers
+- People share server drama, screenshots from other channels
+- More personal topics — server channels are semi-public, DMs are private
+- Link/image sharing with minimal text
+
+### Reddit DMs / Chat
+```
+{username}: {message text}
+{other_username}: {message text}
+{username}: {message text}
+```
+
+Reddit DM-specific behaviors:
+- Much rarer than X or Discord DMs — usually triggered by a specific post/comment
+- Often starts with "Hey, saw your comment on r/{sub} about..."
+- Can be awkward/formal since people don't usually DM on Reddit
+- Shorter than Reddit comments, closer to chat-style
+- Less established rapport than other platforms (Reddit is more anonymous)
+- People sometimes share personal details they wouldn't put in public comments
+
+## Dynamic Elements
+
+### Injecting Realism
+Sprinkle in these to make simulations feel alive:
+- Someone being late to the conversation ("wait what did I miss")
+- Typos that specific people would make (some people never typo, some always do)
+- Deleted/edited posts ("[deleted]" or "Edit: fixed typo")
+- Someone posting and immediately clarifying ("wait let me rephrase")
+- External references ("did you see what X just posted")
+- Time gaps (not everything happens in 30 seconds)
+- Someone going AFK mid-conversation
+
+### Scenario Injection
+When the user provides --scenario, weave it in naturally:
+- Don't have everyone immediately react to the scenario
+- Someone might not have seen the news yet
+- Different people will interpret the same event differently
+- Some will have insider knowledge, some will speculate
+
+### Multi-person Dynamics (3+ people)
+- Not everyone talks to everyone
+- Alliances form naturally (people who agree start building on each other)
+- Side conversations happen
+- Someone might get ignored
+- Different energy levels (one person might dominate, another lurks)
+
+### Large Group Conversations (4+ people)
+**Honest note**: Simulation quality degrades noticeably above 3-4 participants.
+Managing this many distinct voices is hard. Use these techniques to mitigate:
+
+1. **Speaker turn management**: Not everyone speaks in every round. In a 6-person
+   thread, a given message might only get 2-3 responses. Track who has spoken
+   recently and who hasn't. After 4-5 messages, check: is anyone being forgotten?
+
+2. **The wallflower problem**: In large sims, quiet participants tend to vanish
+   entirely. Fix: give each person at least ONE moment in the spotlight. Even the
+   lurker eventually drops a "lol" or a single devastating one-liner. Set a mental
+   counter — if someone hasn't spoken in 5+ messages, find a natural reason to
+   bring them back in (someone @'s them, the topic shifts to their expertise, etc.)
+
+3. **Consolidate alliances**: In 5+ person threads, people cluster. Two people
+   who agree strongly can be treated as a mini-unit — one makes the point, the
+   other co-signs briefly rather than both making full arguments. This reduces
+   the number of fully independent voices you need to maintain at once.
+
+4. **Stagger arrivals**: Not everyone needs to be present from message 1. Have
+   some people join later. This lets you establish 2-3 voices cleanly before
+   adding more.
+
+5. **Quality check**: After drafting a 4+ person sim, re-read with names hidden.
+   If more than 2 people sound interchangeable, pick the least-differentiated
+   one and either sharpen their voice or reduce their participation to brief
+   interjections that match what they'd actually say.
+
+## Interactive Mode
+
+After initial simulation, user can:
+
+### "continue"
+Generate 5-8 more posts continuing the natural flow.
+
+### "inject: {event}"  
+Introduce new information mid-conversation.
+- Characters react based on their dossier
+- Some might not care about the event
+- Timing matters (who sees it first?)
+
+### "@{handle} enters"
+Add a new participant.
+- Quick-research the new person (2-3 searches minimum)
+- They don't know the full prior context (might ask "what are you guys talking about")
+- Existing dynamics shift with a new presence
+
+### "what would @{handle} say about {topic}"
+Single-person prediction mode.
+- Generate 1-3 tweets/posts
+- Can be used to test dossier accuracy before full simulation
+- Good for quick "vibe checks"
+
+### "dm: @{handle1} -> @{handle2}"
+Simulate a private conversation between two people.
+- Tone shifts dramatically in DMs (more honest, less performative)
+- No audience effects
+- People say things in DMs they'd never post publicly
+
+### "react: @{handle} to {event}"
+How would this person react to a specific event.
+- Generate their initial post about it
+- Predict their follow-up engagement
+
+## Quality Control
+
+After generating, self-check:
+1. **Voice test**: Cover the names. Can you tell who's talking? 
+2. **Position test**: Is anyone saying something they'd never actually say?
+3. **Dynamic test**: Does the conversation flow naturally or feel scripted?
+4. **Platform test**: Does it look/feel like the actual platform?
+5. **Engagement test**: Are the numbers realistic for these people?
+6. **Reference test**: Are real events/products/people referenced accurately?
+
+If any check fails, regenerate that section.
@@ -0,0 +1,170 @@
+# The Star Thread — Personality Compression
+
+## The Problem
+
+A dossier has 50 data points. Mechanical checks verify surface features.
+The discriminator loop catches vocabulary and length. But the output still
+reads like an LLM doing an impression. It's accurate the way a police
+sketch is accurate — all the features are right but nobody would mistake
+it for a photograph.
+
+The missing piece isn't more data. It's compression.
+
+## The Insight
+
+When you "pull the star thread" on a person, their whole voice coheres.
+Not because you loaded rules about capitalization and emoji frequency.
+Because you found the CORE THING they're doing when they post — the
+single generative seed that everything else is a variation of.
+
+A great character writer doesn't need a backstory bible. They need one
+insight about what the character WANTS, and every line of dialogue writes
+itself from that.
+
+The star thread is the personality equivalent of that insight.
+
+## What a Star Thread Is
+
+NOT: "They use lowercase and rarely punctuate and average 16 words"
+     (That's the dossier. Surface features.)
+
+NOT: "They score high on Openness and low on Agreeableness"
+     (That's the psychometric profile. Taxonomy.)
+
+IS:  The core cognitive/emotional move this person makes EVERY time
+     they post. The thing they can't help doing. The lens they can't
+     take off. The itch they're always scratching.
+
+## Examples
+
+**@tszzl (roon)**: Takes something everyone sees and compresses it
+into an observation so dense it could be a koan or a shitpost and
+you can't tell which. His star thread is: the world already said
+everything interesting, he's just notating it more efficiently.
+He doesn't ARGUE. He COMPRESSES.
+
+**@eigenrobot**: Refuses to let narrative override data. His star
+thread is: you are telling a story about the world and he's here to
+point out the story doesn't match the numbers, and he's not sorry
+about it. He doesn't DEBATE. He CORRECTS.
+
+**@visakanv**: Sees two things that don't know they're connected
+and introduces them to each other with genuine delight. His star
+thread is: the world is richer than you're treating it, look at this
+thing I found, isn't it beautiful that it connects to this other thing.
+He doesn't ARGUE or ANALYZE. He SHOWS.
+
+**@nickcammarata**: Notices what's happening in his own mind while
+it's happening and reports on it with gentle surprise. His star thread
+is: the observer and the observed are the same process, and that's both
+the problem and the solution. He doesn't PERFORM insight. He NOTICES.
+
+**@selentelechia**: Waits until the conversation crystallizes and then
+names the thing nobody else quite said. Their star thread is: everything
+has already been felt, they just find the sentence for it. They don't
+CONTRIBUTE. They DISTILL.
+
+**@nosilverv**: Takes the conventional framing of something and rotates
+it until you see it's actually about something else entirely. His star
+thread is: you think this is about X but it's actually about Y, and once
+you see it you can't unsee it. He doesn't OBSERVE. He REFRAMES.
+
+**@TylerAlterman**: Asks the question that creates a room for everyone
+to walk into. His star thread is: the best ideas emerge from the right
+gathering, and his job is to be the person who arranges the gathering.
+He doesn't ANSWER. He CONVENES.
+
+**@QiaochuYuan**: Catches himself mid-thought and interrogates whether
+the thought is actually HIS or whether he borrowed it from somewhere
+he's now suspicious of. His star thread is: constant audit of where
+beliefs come from and whether they're still load-bearing. He doesn't
+ASSERT. He EXAMINES.
+
+## How to Find a Star Thread
+
+1. Read 20+ of their posts. Not for content — for MOTION.
+   What direction does every post move? What's the verb?
+
+2. Ask: what is this person DOING when they post?
+   Not "what are they saying" — what are they DOING.
+   - Compressing? Correcting? Showing? Noticing? Distilling?
+     Reframing? Convening? Examining? Performing? Confessing?
+     Defending? Testing? Entertaining? Processing?
+
+3. Ask: what would they NEVER do?
+   The negative space is as important as the positive.
+   - roon would never write an earnest list of advice
+   - eigenrobot would never concede a point gracefully
+   - visa would never dismiss something as uninteresting
+   - nick would never claim certainty about his inner life
+   - selentelechia would never rush to post
+
+4. Find the ONE SENTENCE version.
+   "This person [VERB]s [OBJECT] because [CORE NEED]."
+   - "roon compresses observations because the world is too verbose"
+   - "eigenrobot corrects narratives because stories without data are lies"
+   - "visa connects things because beauty is emergent from contact"
+
+5. Test it: read 5 of their real posts through the star thread lens.
+   Does every post make more sense as a variation on the thread?
+   If yes, you found it. If 3/5 don't fit, keep looking.
+
+## How to Use the Star Thread in Simulation
+
+### Before generating ANY utterance for this person, load their star thread.
+
+Not their dossier. Not their word count. Not their emoji rate.
+The star thread.
+
+Then for each moment in the conversation where this person would speak:
+1. What just happened in the conversation?
+2. How would someone whose core move is [STAR THREAD] respond to that?
+3. Write from the thread, not from the dossier.
+
+The dossier and mechanical checks are VERIFICATION.
+The star thread is GENERATION.
+
+Generate from the thread. Verify against the data.
+Not the other way around.
+
+### The Difference
+
+FROM DOSSIER (surface-accurate, dead):
+  "Vibes-based hiring works because shared delusions are
+  extremely productive until they aren't"
+  → Correct length. Correct caps. No emoji. No slop words.
+    But it reads like a thesis statement. Polished. WRITTEN.
+
+FROM STAR THREAD — nosilverv REFRAMES:
+  "everyone calls it 'culture fit' as if culture is a thing
+  you can fit into rather than a thing happening to you"
+  → The same insight but through the lens of his core move:
+    take the framing, rotate it, show you it's about something
+    else. Messier. More alive. More HIM.
+
+FROM DOSSIER (surface-accurate, dead):
+  "Has anyone tried to map what happens to the word 'culture'
+  as it passes through different communities?"
+  → Correct question-to-timeline format. Right length. But it's
+    a RESEARCH QUESTION. Too intellectual. Too purposeful.
+
+FROM STAR THREAD — Tyler CONVENES:
+  "who wants to write the essay about what happened to the
+  word 'culture'? I feel like three of us are circling it"
+  → He's not asking a question. He's creating a room. He's
+    the host, not the researcher. More HIM.
+
+## Integration
+
+The star thread should be the FIRST thing compiled in Phase 2
+(Dossier Compilation). Before voice profile, before psychometrics,
+before positions. Find the thread. Write it in one sentence. Put
+it at the top of the dossier. Everything else is downstream.
+
+```
+DOSSIER: @handle
+STAR THREAD: {one sentence — the core move}
+[then voice profile, then psychometrics, then everything else]
+```
+
+Generate from the thread. Verify with the data. Not the reverse.
@@ -0,0 +1,181 @@
+# Theoretical Foundations — SOTA Personality Simulation & Prediction
+
+Compiled from 30+ papers and frameworks. This is the scientific backbone
+of Hermes Simulator.
+
+## Core Architecture: What The Research Says
+
+### The HumanLLM Approach (Microsoft, KDD 2026, arxiv 2601.15793)
+**Most directly applicable to our use case.**
+
+Based on Lewin's Equation: **B = f(P, E)** — behavior is a function of person + environment.
+
+4-level user profiling hierarchy:
+1. **Persona** — brief identity (role, affiliation, public image)
+2. **Profile** — detailed background (career, education, beliefs, social graph)
+3. **Stories** — key life events, formative experiences, narrative arcs
+4. **Writing Style** — linguistic fingerprint (syntax, vocabulary, tone, quirks)
+
+Trained on "Cognitive Genome Dataset": 5.5M+ user logs from Reddit, Twitter,
+Blogger, Amazon (282K users, 886K scenarios, 1.27M social QA pairs).
+
+6 training tasks: profile generation, scenario generation, social QA,
+writing style transfer, action prediction, mental state inference.
+
+**Key insight for us**: The 4-level hierarchy maps perfectly to our dossier
+template. OSINT research fills each level with real data.
+
+### Generative Agent Simulations of 1,000 People (Stanford/Google, arxiv 2411.10109)
+**The accuracy benchmark.**
+
+- Simulated 1,052 REAL individuals from 2-hour qualitative interviews
+- **85% accuracy** replicating survey responses
+- As accurate as humans replicating their OWN answers 2 weeks later
+- Interview-based agent creation >> demographic-profile-based agents
+- Reduces racial/ideological bias vs stereotype-based approaches
+
+**Key insight**: Real data about a person (interviews, posts, etc.) massively
+outperforms demographic inference. Our OSINT approach is correct.
+
+### The Memory Accumulation Paradox (ACL 2025, FineRob Dataset)
+**Critical finding for memory management.**
+
+- Created 78.6K QA records from 1,866 real users across Twitter, Reddit, Zhihu
+- **Performance PEAKS at 30-50 memory entries, then DECLINES**
+- More data ≠ better predictions past the sweet spot
+- Two reasoning patterns:
+  - Role Stereotype-based (static profile) — less accurate
+  - Observation & Memory-based (dynamic history analysis) — much more accurate
+- OM-CoT framework: Oracle-guided chain-of-thought improves prediction ~4.5% F1
+
+**Key insight**: Don't dump everything into the prompt. Curate the 30-50 most
+representative/distinctive data points about a person. Quality >> quantity.
+
+### LLM Personality Limitations (arxiv 2602.07414, Feb 2026)
+**What we're fighting against.**
+
+- LLMs show polarized/rigid strategies vs human adaptive flexibility
+- Humans: neuroticism is strongest behavioral predictor
+- LLMs: agreeableness/extraversion dominate (wrong weighting)
+- Claude closest to human behavior; GPT-4 tends to escalate
+- LLMs are "sycophantic" and overly agreeable by default
+- Neuroticism is hardest trait to simulate (F1=0.63 vs 0.87 for Openness)
+
+**Key insight**: We need to actively fight LLM defaults. Push against
+agreeableness. Inject friction. Real people are messy and contradictory.
+
+### BehaviorChain Benchmark (ACL 2025, Peking University)
+**Realistic accuracy expectations.**
+
+- 15,846 behaviors across 1,001 personas
+- Even GPT-4o achieves only ~56% accuracy on behavior prediction
+- Errors compound: wrong at step N makes step N+1 harder
+- Models worse at predicting mundane/non-key behaviors
+- Best model: Llama-3.1-70B at 57.4%
+
+**Key insight**: Be honest about uncertainty. Don't oversell accuracy.
+Flag predictions as high/medium/low confidence.
+
+## Personality Modeling Techniques
+
+### Big Five (OCEAN) — The Standard
+- **Openness**: curiosity, creativity, preference for novelty
+- **Conscientiousness**: organization, dependability, self-discipline
+- **Extraversion**: sociability, assertiveness, positive emotions
+- **Agreeableness**: cooperation, trust, empathy
+- **Neuroticism**: anxiety, emotional instability, moodiness
+
+### Inferring Big Five from Social Media (Azucar et al. 2018 meta-analysis)
+Features that predict personality from posts:
+- **LIWC** (Linguistic Inquiry Word Count): 74 features — function words,
+  pronouns, emotion words, cognitive process words
+- **Semantic embeddings**: BERT 768-dim vectors from post text
+- **Social metadata**: follower count, friend count, post frequency
+- **Sentiment**: VADER positive/negative scores
+- Best achievable AUC: ~0.67 (modest but meaningful)
+- E/I (Extraversion) most predictable; N/S least predictable
+
+### Personality Conditioning Methods (ranked by effectiveness)
+1. **Training-based** (SFT/DPO on personality-grounded data) — STRONGEST
+   - BIG5-CHAT: 100K dialogues, trait correlations match human data
+2. **Persona Vectors** (Anthropic 2025) — monitor/control traits at activation level
+3. **Adjective-based prompting** — 70 bipolar adjective pairs, 3 per trait
+   with intensity modifiers ("very" for high, "a bit" for low)
+4. **Prompt-based** (describe traits in system prompt) — WEAKEST
+
+For our simulator, we use method 3+4 combined (adjective-based + rich prompt),
+since we can't fine-tune per-person.
+
+## Social Simulation Frameworks
+
+### OASIS (CAMEL-AI, GitHub 4.1K stars, arxiv 2411.11581)
+- Simulates up to 1 MILLION agents on Twitter/Reddit clones
+- 23 action types (follow, comment, repost, like, mute, etc.)
+- Built-in recommendation systems (interest-based, hot-score)
+- Per-agent model customization
+- **Relevant for**: understanding platform dynamics, realistic engagement patterns
+
+### AgentSociety (Tsinghua, arxiv 2502.08691)
+- 10,000+ agents, ~5 million interactions
+- Validated against real-world experimental results
+- Supports interventions and scenario injection
+
+### Generative Agents Architecture (Park et al. 2023, THE foundational paper)
+Three components:
+1. **Observation**: perceive environment, store in memory stream
+2. **Planning**: generate action plans based on goals and context
+3. **Reflection**: synthesize observations into higher-level insights
+
+Memory stream with importance scoring + recency + relevance weighting.
+Emergent behaviors: autonomous party planning, coordinated social events.
+
+### Y Social (arxiv 2408.00818)
+- Social media digital twin platform
+- Each agent: Big Five traits, age, political leaning, topics, education
+- Agents autonomously decide actions (post, comment, like, follow)
+- Multiple LLM backends supported
+
+## Role-Playing & Character Simulation
+
+### Key Frameworks
+- **CoSER** (ICML 2025): Trains on ALL characters simultaneously, handles major + minor roles
+- **RoleLLM** (ACL 2024): Benchmark + elicit + enhance pipeline
+- **Character-LLM** (EMNLP 2023): Trainable agent for role-playing
+- **ChatHaruhi** (2023): Reviving characters via LLMs with dialogue grounding
+- **OpenCharacter** (2025): Training with large-scale synthetic personas
+- **Neeko** (2024): Dynamic LoRA for multi-character role-playing
+- **Test-Time-Matching** (2025): Decouples personality, memory, and linguistic style at inference
+
+## Curated GitHub Resources
+
+### Awesome Lists (essential reading)
+- `Persdre/awesome-llm-human-simulation` (109★, ICLR 2025) — ALL human simulation papers
+- `Neph0s/awesome-llm-role-playing-with-persona` (1K★) — All role-playing/persona papers
+- `Arstanley/Awesome-LLM-Conversation-Simulation` — Conversation simulation papers
+- `FudanDISC/SocialAgent` — Social simulation survey resources
+
+### Frameworks
+- `camel-ai/oasis` (4.1K★) — Social media sim, up to 1M agents
+- `tsinghua-fib-lab/agentsociety` — Large-scale societal simulation
+- `YSocialTwin` — Social media digital twin platform
+- `microsoft/autogen` — Multi-agent conversation framework
+
+### Personality Research
+- `mary-silence/simulating_personality` — Big Five LLM testing code
+- `hjian42/PersonaLLM` — Persona experiment code
+- `cambridgeltl/persona_effect` — Quantifying persona effects
+- `OL1RU1/BehaviorChain` — Behavior chain benchmark
+
+## Key Numbers to Remember
+
+| Metric | Value | Source |
+|--------|-------|--------|
+| Interview-grounded agent accuracy | 85% | Park et al. 2024 |
+| GPT-4o behavior prediction | ~56% | BehaviorChain 2025 |
+| Optimal memory entries | 30-50 | FineRob/ACL 2025 |
+| MBTI prediction AUC | 0.67 | Watt et al. 2024 |
+| Personality questionnaire reliability | α > 0.85 | Molchanova 2025 |
+| Neuroticism simulation F1 | 0.63 | Molchanova 2025 |
+| Openness simulation F1 | 0.87 | Molchanova 2025 |
+| LLM forecasting Brier score | 0.135-0.159 | Various 2025 |
+| Human superforecaster Brier | ~0.02 | Tetlock |
@@ -0,0 +1,231 @@
+# Verified Access Methods — Complete Platform Map (April 2026)
+
+Every method tested from our environment. Use this as the single
+source of truth for what works and what doesn't.
+
+## TIER 1 — Full API / Rich Data Access
+
+### Twitter/X ✅✅✅
+| Method | Endpoint | Auth | Rate Limit | Returns |
+|--------|----------|------|-----------|---------|
+| API v2 bearer | api.twitter.com/2/ | Bearer token | 10K tweets/15min | Profiles, tweets, search |
+| nitter.cz | web_extract | None | No limit seen | Full timeline (UNRELIABLE — see note below) |
+| ThreadReaderApp | web_extract /user/{handle} | None | No limit seen | Historical threads |
+
+#### CRITICAL: X API curl is the gold standard for voice calibration (April 2026)
+The BEST voice data source is direct curl to X API v2 with bearer token.
+Returns full tweet text + public_metrics per tweet. Always prefer this for
+mechanical calibration (word count, caps, punctuation, emoji rate).
+
+```bash
+source ~/.dotenv
+# 1. Get user ID from handle
+curl -s -H "Authorization: Bearer $X_BEARER_TOKEN" \
+  "https://api.twitter.com/2/users/by/username/{handle}?user.fields=description,public_metrics,location,created_at"
+# 2. Get timeline (30 tweets per page, paginate with meta.next_token)
+curl -s -H "Authorization: Bearer $X_BEARER_TOKEN" \
+  "https://api.twitter.com/2/users/{user_id}/tweets?max_results=30&tweet.fields=created_at,public_metrics,text&exclude=retweets"
+# 3 pages = 90 tweets — enough for fidelity 100 voice calibration
+```
+
+NOTE: scripts/x_api.py is BROKEN — imports hermes_tools at top level, can't
+run standalone via terminal(). Use direct curl above instead.
+
+#### nitter.cz reliability warning (April 2026)
+nitter.cz via web_extract works SOMETIMES but is unreliable:
+- Returns 502 Cloudflare errors for /with_replies on some handles
+- Returns "User not found" for valid handles (e.g. karan4d exists but nitter says not found)
+- Main profile page (/handle) more reliable than /with_replies
+- Use as SUPPLEMENT to X API curl, not primary source. If nitter fails, don't retry — use curl.
+
+### Bluesky ✅✅
+| Method | Endpoint | Auth | Returns |
+|--------|----------|------|---------|
+| getProfile | public.api.bsky.app | None | Full profile, stats |
+| getAuthorFeed | public.api.bsky.app | None | 50 posts + engagement |
+| searchActors | public.api.bsky.app | None | Find handles by name |
+| searchPosts | BLOCKED (403) | — | Use searchActors + getAuthorFeed workaround |
+
+### Mastodon ✅✅✅ (FULLY OPEN)
+| Method | Endpoint | Auth | Returns |
+|--------|----------|------|---------|
+| Account lookup | {instance}/api/v1/accounts/lookup?acct={user} | None | Full profile |
+| Account statuses | {instance}/api/v1/accounts/{id}/statuses | None | All posts |
+| Search | {instance}/api/v2/search?q={query}&type=accounts | None | Account search |
+| WebFinger | {instance}/.well-known/webfinger?resource=acct:{user}@{instance} | None | Identity resolution |
+| Trending | {instance}/api/v1/trends/tags | None | Trending content |
+Key instances: mastodon.social, hachyderm.io, sigmoid.social
+
+### Instagram ✅✅ (CRACKED)
+| Method | Endpoint | Auth | Returns |
+|--------|----------|------|---------|
+| Private Web API | i.instagram.com/api/v1/users/web_profile_info/ | Mobile UA + x-ig-app-id: 936619743392459 | Profile + 12 posts + captions + CDN URLs |
+| oEmbed | instagram.com/api/v1/oembed/ | None | Caption + author for individual posts |
+| Pixwox | web_extract pixwox.com/profile/{user} | None | 12+ posts, engagement |
+| SocialBlade | web_extract socialblade.com/instagram/user/{user} | None | Analytics, follower trends |
+| CDN images | scontent-*.cdninstagram.com URLs from API | None | Full-res images → vision_analyze |
+| Google index | web_search site:instagram.com | None | Bio, follower count, captions |
+
+### GitHub ✅✅
+| Method | Endpoint | Auth | Returns |
+|--------|----------|------|---------|
+| REST API | api.github.com/users/{user} | None (60 req/hr) | Profile, repos, events, gists |
+| Profile README | github.com/{user}/{user} | None | Self-description (voice gold) |
+
+### Reddit ✅✅
+| Method | Endpoint | Auth | Returns |
+|--------|----------|------|---------|
+| JSON API | reddit.com/user/{user}.json | User-Agent header required | Comments, posts, scores |
+| Search | reddit.com/r/{sub}/search.json | User-Agent header | Subreddit-specific search |
+
+## TIER 2 — Good Data, Reliable Access
+
+### Facebook ✅✅ (CRACKED — Googlebot UA trick)
+| Method | Endpoint | Returns |
+|--------|----------|---------|
+| Googlebot UA (BEST) | curl facebook.com/{page} with Googlebot UA | OG tags: name, bio/about, likes count (e.g. 121M for zuck), talking_about count, og:image, profile pic |
+| Page Plugin embed | plugins/page.php?href=...&tabs=timeline | Name, follower count, numeric page_id |
+| Graph /picture | graph.facebook.com/v19.0/{page}/picture?redirect=false | Direct CDN profile pic URL (no auth) |
+| web_search | site:facebook.com {name} | Profile snippets from Google index |
+| Script: scripts/facebook_api.py — combines all 3 methods |
+| NOTE: Works for PUBLIC Pages (businesses, public figures, orgs). Personal profiles behind privacy settings are not accessible. |
+| Tested: zuck (121M likes), NVIDIA, Meta, CocaCola, BillGates, BarackObama |
+
+### Threads (Meta) ✅✅ (CRACKED — OG tags DO exist)
+| Method | Endpoint | Returns |
+|--------|----------|---------|
+| Profile OG tags (BEST) | curl -L threads.com/@{user} (NOTE: .com not .net — .net 301 redirects) | display_name, follower_count (e.g. "5.5M"), thread_count, bio, profile_picture_url |
+| Post OG tags | curl -L threads.com/@{user}/post/{shortcode} | Full post text, author name, image URL |
+| WebFinger | threads.net/.well-known/webfinger?resource=acct:{user}@threads.net | ActivityPub ID, profile URL (works for federated users) |
+| IMPORTANT: threads.NET redirects to threads.COM — always use -L flag or go directly to .com |
+| Post discovery | web_search site:threads.net @{user} | Find post URLs to then fetch |
+| Script: scripts/threads_api.py — profile + post + webfinger extraction |
+| Previous test was WRONG about "no OG tags" — they're there, you just need standard curl |
+| Tested: zuck (5.5M followers), mosseri, nvidia |
+
+### Medium ✅✅
+| Method | Returns |
+|--------|---------|
+| RSS feed: medium.com/feed/@{user} (curl) | FULL article text, tags, dates — NO AUTH |
+| web_extract on profile | Bio, follower count, article list, themes |
+| web_extract on articles | Full content (paywall may truncate non-members) |
+
+### Quora ✅✅
+| Method | Returns |
+|--------|---------|
+| web_extract on profile | Bio, credentials, Q&A with direct quotes |
+| web_search site:quora.com | Finds profiles and specific answers |
+| VOICE VALUE: Opinions in own words, analogies, intellectual identity |
+
+### Goodreads ✅✅ (HIDDEN GEM)
+| Method | Returns |
+|--------|---------|
+| web_extract on user profile | Favorites, reviews in own voice, social graph, reading history |
+| web_extract on author page | Bio, books, ratings, notable quotes |
+| VOICE VALUE: "You are what you read" — intellectual identity fingerprint |
+| Example: Karpathy's Goodreads reveals gaming passion, favorite authors (Feynman, Clarke) |
+
+### Google Scholar ✅✅
+| Method | Returns |
+|--------|---------|
+| web_search + web_extract on profile | Citations, h-index, top papers, co-authors |
+| Semantic Scholar API via web_extract | Paper list, citation counts, author ID |
+| Endpoint: api.semanticscholar.org/graph/v1/author/search?query={name} |
+
+### Product Hunt ✅
+| Method | Returns |
+|--------|---------|
+| web_extract on producthunt.com/@{user} | Bio, launch history, forum activity |
+
+### HackerNews ✅
+| Method | Returns |
+|--------|---------|
+| Algolia API: hn.algolia.com/api/v1/search?query={name}&tags=comment | Comments, mentions |
+
+### Podcast Transcripts ✅✅✅ (HIGHEST VOICE VALUE)
+| Source | Method |
+|--------|--------|
+| Lex Fridman | web_extract on lexfridman.com/.../transcript |
+| Tyler Cowen | web_extract on conversationswithtyler.com |
+| TED Talks | web_extract on ted.com/.../transcript |
+| Sequoia | web_extract on sequoiacap.com/podcast |
+| Discovery: web_search "{name} podcast transcript interview" |
+
+### News/Blogs ✅✅
+| Source | Method |
+|--------|--------|
+| TechCrunch, Wired, Verge, Ars | web_extract — full articles |
+| Personal blogs | web_extract — longform self-expression |
+| Substacks | web_extract — essays and comments |
+| Wayback Machine | Works for blog archives (not Twitter) |
+
+## TIER 3 — Limited / Conditional
+
+### TikTok ✅✅ (FULL ACCESS)
+| Method | Returns |
+|--------|---------|
+| HTML profile scraping | Parse __UNIVERSAL_DATA_FOR_REHYDRATION__ JSON at path __DEFAULT_SCOPE__.webapp.user-detail.userInfo.statsV2 → username, bio, followerCount, followingCount, heartCount, videoCount. Use statsV2 not stats for large numbers. |
+| oEmbed per video | curl tiktok.com/oembed?url={video_url} → caption, author, thumbnail. No auth. |
+| tikwm.com API | tikwm.com/api/user/info?unique_id={user} → full user stats. tikwm.com/api/?url={video_url} → play count, likes, comments, shares, duration. |
+| HTML video scraping | tiktok.com/@{user}/video/{id} → parse __UNIVERSAL_DATA → webapp.video-detail → full video data with description, hashtags, engagement. |
+| SocialBlade | web_extract socialblade.com/tiktok/user/{user} → followers, likes, growth trends. |
+| Video discovery | web_search("site:tiktok.com/@{user}/video") → recent video URLs → scrape each |
+| Tested: khaby.lame (160.5M), charlidamelio (156.7M), mrbeast (124.7M) |
+
+### Spotify ✅ (podcasters only)
+| Method | Returns |
+|--------|---------|
+| web_extract on show page | Episode listings with guests, topics, durations |
+
+### Stack Overflow ✅
+| Method | Returns |
+|--------|---------|
+| web_extract on profile | Reputation, tags, top answers, bio |
+
+### Crunchbase ✅ (executives/founders only)
+| Method | Returns |
+|--------|---------|
+| web_extract on crunchbase.com/person/{slug} | Full career history, education, investments, board positions |
+
+### LinkedIn ⚠️ (indirect only)
+| Method | Returns |
+|--------|---------|
+| web_search site:linkedin.com/in | Name, headline, company, location from snippets |
+| Crunchbase | Full career history (better than LinkedIn for execs) |
+| Corporate press pages | Official professional bios |
+| RocketReach/SignalHire snippets | Title confirmation from web_search |
+
+## TIER 4 — Blocked / Dead
+
+| Platform | Status |
+|----------|--------|
+| LinkedIn direct | BLOCKED (web_extract domain blocked) |
+| Discord | WALLED (not publicly indexable) |
+| Telegram t.me | BLOCKED in some environments |
+| Threads Official API | AUTH REQUIRED (graph.threads.net needs OAuth) |
+| Threads ActivityPub outbox | 404 for all tested users |
+| Instagram direct | BLOCKED (use Private API instead) |
+| Most Nitter instances | DEAD (only nitter.cz works, but UNRELIABLE — see note) |
+| Google Cache of Twitter | EMPTY |
+| Wayback for tweets | USELESS (JS rendering) |
+| Twitter Syndication API | RATE LIMITED |
+| Archive.today | 429 + CAPTCHA |
+| imginn/picuki/dumpoir/gramhir | 403 |
+| Facebook Graph API | AUTH REQUIRED |
+
+## Quick Reference: Research Pipeline by Person Type
+
+### Tech Founder/CEO
+X API → Bluesky → GitHub README → Crunchbase → Podcast transcripts → Medium RSS → HN → Product Hunt → LinkedIn snippets → News profiles
+
+### AI Researcher
+X API → Bluesky → Google Scholar → Semantic Scholar → arXiv → GitHub → Podcast transcripts → Blog/Substack → Reddit → Mastodon (sigmoid.social)
+
+### Public Figure / Politician
+X API → Facebook OG → Instagram API → YouTube → Podcast transcripts → News profiles → Quora → Goodreads → Wikipedia
+
+### Content Creator
+X API → Instagram API → TikTok → YouTube → Twitch → Podcast → Medium → Reddit → Bluesky → Threads OG
+
+### Academic
+Google Scholar → Semantic Scholar → University page → Conference talks → Podcast transcripts → Mastodon → Blog → GitHub → Reddit → HN
@@ -0,0 +1,250 @@
+"""
+REHOBOAM Database Layer
+SQLite setup, migrations, and query helpers.
+"""
+
+import sqlite3
+import os
+from pathlib import Path
+from datetime import datetime
+
+DB_DIR = Path.home() / ".hermes" / "rehoboam" / "db"
+MAIN_DB = DB_DIR / "rehoboam.db"
+
+SCHEMA_VERSION = 1
+
+SCHEMA_SQL = """
+-- Core tables
+CREATE TABLE IF NOT EXISTS profiles (
+    handle TEXT PRIMARY KEY,
+    platform TEXT NOT NULL,
+    display_name TEXT,
+    last_updated TEXT NOT NULL,
+    staleness TEXT NOT NULL,
+    profile_path TEXT NOT NULL,
+    created_at TEXT NOT NULL
+);
+
+CREATE TABLE IF NOT EXISTS simulations (
+    sim_id TEXT PRIMARY KEY,
+    created_at TEXT NOT NULL,
+    scenario TEXT NOT NULL,
+    participant_count INTEGER,
+    duration_sec REAL,
+    model_used TEXT,
+    config_path TEXT,
+    output_path TEXT
+);
+
+CREATE TABLE IF NOT EXISTS sim_participants (
+    sim_id TEXT REFERENCES simulations(sim_id),
+    handle TEXT REFERENCES profiles(handle),
+    role TEXT,
+    PRIMARY KEY (sim_id, handle)
+);
+
+CREATE TABLE IF NOT EXISTS sim_dynamics (
+    sim_id TEXT REFERENCES simulations(sim_id),
+    handle TEXT,
+    post_count INTEGER,
+    word_count INTEGER,
+    avg_sentiment REAL,
+    dominance_score REAL,
+    agreement_score REAL,
+    controversy_score REAL,
+    ratio_score REAL,
+    influence_in_sim REAL,
+    PRIMARY KEY (sim_id, handle)
+);
+
+CREATE TABLE IF NOT EXISTS sim_interactions (
+    sim_id TEXT REFERENCES simulations(sim_id),
+    from_handle TEXT,
+    to_handle TEXT,
+    interaction_type TEXT,
+    count INTEGER,
+    avg_sentiment REAL,
+    PRIMARY KEY (sim_id, from_handle, to_handle, interaction_type)
+);
+
+CREATE TABLE IF NOT EXISTS predictions (
+    pred_id TEXT PRIMARY KEY,
+    created_at TEXT NOT NULL,
+    sim_id TEXT,
+    handle TEXT,
+    prediction_type TEXT,
+    prediction_text TEXT NOT NULL,
+    confidence REAL NOT NULL,
+    calibrated_confidence REAL,
+    timeframe_days INTEGER,
+    resolved_at TEXT,
+    outcome TEXT,
+    outcome_evidence TEXT,
+    accuracy_score REAL
+);
+
+CREATE TABLE IF NOT EXISTS social_edges (
+    from_handle TEXT,
+    to_handle TEXT,
+    relationship_type TEXT,
+    weight REAL,
+    first_observed TEXT,
+    last_observed TEXT,
+    observation_count INTEGER,
+    source TEXT,
+    PRIMARY KEY (from_handle, to_handle, relationship_type)
+);
+
+CREATE TABLE IF NOT EXISTS social_clusters (
+    cluster_id TEXT PRIMARY KEY,
+    name TEXT,
+    description TEXT,
+    member_handles TEXT,
+    computed_at TEXT,
+    cohesion_score REAL
+);
+
+CREATE TABLE IF NOT EXISTS monitoring_events (
+    event_id TEXT PRIMARY KEY,
+    handle TEXT,
+    detected_at TEXT NOT NULL,
+    event_type TEXT,
+    description TEXT,
+    related_prediction_id TEXT,
+    severity TEXT,
+    acknowledged INTEGER DEFAULT 0
+);
+
+CREATE TABLE IF NOT EXISTS audit_log (
+    log_id TEXT PRIMARY KEY,
+    timestamp TEXT NOT NULL,
+    sim_id TEXT,
+    action TEXT NOT NULL,
+    handle TEXT,
+    details TEXT,
+    duration_sec REAL,
+    model_used TEXT,
+    token_count INTEGER,
+    error TEXT
+);
+
+-- Indexes
+CREATE INDEX IF NOT EXISTS idx_predictions_handle ON predictions(handle);
+CREATE INDEX IF NOT EXISTS idx_predictions_type ON predictions(prediction_type);
+CREATE INDEX IF NOT EXISTS idx_predictions_unresolved ON predictions(outcome) WHERE outcome IS NULL;
+CREATE INDEX IF NOT EXISTS idx_audit_action ON audit_log(action);
+CREATE INDEX IF NOT EXISTS idx_audit_sim ON audit_log(sim_id);
+CREATE INDEX IF NOT EXISTS idx_social_edges_from ON social_edges(from_handle);
+CREATE INDEX IF NOT EXISTS idx_social_edges_to ON social_edges(to_handle);
+CREATE INDEX IF NOT EXISTS idx_monitoring_handle ON monitoring_events(handle);
+CREATE INDEX IF NOT EXISTS idx_monitoring_unack ON monitoring_events(acknowledged) WHERE acknowledged = 0;
+
+-- Schema version tracking
+CREATE TABLE IF NOT EXISTS schema_meta (
+    key TEXT PRIMARY KEY,
+    value TEXT
+);
+"""
+
+
+def init_db() -> sqlite3.Connection:
+    """Initialize the database, creating tables if needed."""
+    DB_DIR.mkdir(parents=True, exist_ok=True)
+    conn = sqlite3.connect(str(MAIN_DB))
+    conn.execute("PRAGMA journal_mode=WAL")
+    conn.execute("PRAGMA foreign_keys=ON")
+    conn.executescript(SCHEMA_SQL)
+    conn.execute(
+        "INSERT OR REPLACE INTO schema_meta (key, value) VALUES (?, ?)",
+        ("schema_version", str(SCHEMA_VERSION))
+    )
+    conn.commit()
+    return conn
+
+
+def get_db() -> sqlite3.Connection:
+    """Get a database connection, initializing if needed."""
+    if not MAIN_DB.exists():
+        return init_db()
+    conn = sqlite3.connect(str(MAIN_DB))
+    conn.execute("PRAGMA journal_mode=WAL")
+    conn.execute("PRAGMA foreign_keys=ON")
+    conn.row_factory = sqlite3.Row
+    return conn
+
+
+def log_audit(conn: sqlite3.Connection, action: str, handle: str = None,
+              sim_id: str = None, details: str = None, duration_sec: float = None,
+              model_used: str = None, token_count: int = None, error: str = None):
+    """Write an entry to the audit log."""
+    from schemas import gen_id
+    conn.execute(
+        """INSERT INTO audit_log
+           (log_id, timestamp, sim_id, action, handle, details, duration_sec, model_used, token_count, error)
+           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
+        (gen_id("log_"), datetime.utcnow().isoformat() + "Z", sim_id, action,
+         handle, details, duration_sec, model_used, token_count, error)
+    )
+    conn.commit()
+
+
+# -- Query Helpers --
+
+def get_prediction_accuracy(conn: sqlite3.Connection, prediction_type: str = None) -> dict:
+    """Get prediction accuracy statistics."""
+    query = """
+        SELECT prediction_type,
+               COUNT(*) as total,
+               SUM(CASE WHEN outcome='correct' THEN 1 ELSE 0 END) as correct,
+               SUM(CASE WHEN outcome='partially_correct' THEN 1 ELSE 0 END) as partial,
+               SUM(CASE WHEN outcome='incorrect' THEN 1 ELSE 0 END) as incorrect,
+               AVG(confidence) as avg_confidence,
+               AVG(CASE WHEN outcome='correct' THEN 1.0
+                        WHEN outcome='partially_correct' THEN 0.5
+                        ELSE 0.0 END) as accuracy
+        FROM predictions WHERE outcome IS NOT NULL
+    """
+    params = []
+    if prediction_type:
+        query += " AND prediction_type = ?"
+        params.append(prediction_type)
+    query += " GROUP BY prediction_type"
+    return [dict(row) for row in conn.execute(query, params).fetchall()]
+
+
+def get_open_predictions(conn: sqlite3.Connection, handle: str = None) -> list:
+    """Get unresolved predictions."""
+    query = "SELECT * FROM predictions WHERE outcome IS NULL"
+    params = []
+    if handle:
+        query += " AND handle = ?"
+        params.append(handle)
+    query += " ORDER BY created_at DESC"
+    return [dict(row) for row in conn.execute(query, params).fetchall()]
+
+
+def get_social_neighborhood(conn: sqlite3.Connection, handle: str, depth: int = 1) -> list:
+    """Get a person's social graph neighborhood."""
+    query = """
+        SELECT from_handle, to_handle, relationship_type, weight
+        FROM social_edges
+        WHERE from_handle = ? OR to_handle = ?
+        ORDER BY weight DESC
+    """
+    return [dict(row) for row in conn.execute(query, (handle, handle)).fetchall()]
+
+
+def get_unread_alerts(conn: sqlite3.Connection) -> list:
+    """Get unacknowledged monitoring alerts."""
+    query = """
+        SELECT * FROM monitoring_events
+        WHERE acknowledged = 0
+        ORDER BY detected_at DESC
+    """
+    return [dict(row) for row in conn.execute(query).fetchall()]
+
+
+if __name__ == "__main__":
+    conn = init_db()
+    print(f"Database initialized at {MAIN_DB}")
+    conn.close()
@@ -0,0 +1,216 @@
+"""
+REHOBOAM Data Schemas
+Pydantic models for all JSON data structures used in the system.
+"""
+
+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Optional
+from datetime import datetime
+import json
+import uuid
+
+
+def gen_id(prefix: str = "") -> str:
+    return f"{prefix}{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:8]}"
+
+
+@dataclass
+class OceanScores:
+    openness: float = 0.5
+    conscientiousness: float = 0.5
+    extraversion: float = 0.5
+    agreeableness: float = 0.5
+    neuroticism: float = 0.5
+
+
+@dataclass
+class DarkTriad:
+    narcissism: float = 0.0
+    machiavellianism: float = 0.0
+    psychopathy: float = 0.0
+
+
+@dataclass
+class MoralFoundations:
+    care: float = 0.5
+    fairness: float = 0.5
+    loyalty: float = 0.5
+    authority: float = 0.5
+    sanctity: float = 0.5
+    liberty: float = 0.5
+
+
+@dataclass
+class Psychometrics:
+    ocean: OceanScores = field(default_factory=OceanScores)
+    mbti_estimate: str = ""
+    dark_triad: DarkTriad = field(default_factory=DarkTriad)
+    moral_foundations: MoralFoundations = field(default_factory=MoralFoundations)
+    confidence: float = 0.0
+    sample_size: int = 0
+
+
+@dataclass
+class VoiceFingerprint:
+    vocabulary_tier: str = ""
+    avg_sentence_length: float = 0.0
+    exclamation_rate: float = 0.0
+    question_rate: float = 0.0
+    emoji_rate: float = 0.0
+    slang_index: float = 0.0
+    formality_score: float = 0.5
+    humor_style: str = ""
+    signature_phrases: list[str] = field(default_factory=list)
+    topics_vocabulary: dict[str, float] = field(default_factory=dict)
+    cadence_pattern: str = ""
+
+
+@dataclass
+class Stance:
+    position: str = ""
+    intensity: float = 0.0
+    last_seen: str = ""
+
+
+@dataclass
+class Influence:
+    score: float = 0.0
+    reach: str = "micro"
+    engagement_rate: float = 0.0
+    amplification_power: float = 0.0
+    thought_leadership_domains: list[str] = field(default_factory=list)
+
+
+@dataclass
+class PostingPatterns:
+    avg_posts_per_day: float = 0.0
+    peak_hours_utc: list[int] = field(default_factory=list)
+    weekend_ratio: float = 0.5
+    reply_ratio: float = 0.0
+    repost_ratio: float = 0.0
+    thread_frequency: float = 0.0
+    controversy_rate: float = 0.0
+
+
+@dataclass
+class Relationships:
+    allies: list[str] = field(default_factory=list)
+    rivals: list[str] = field(default_factory=list)
+    frequent_interactions: list[str] = field(default_factory=list)
+    mentioned_by_frequently: list[str] = field(default_factory=list)
+
+
+@dataclass
+class ProfileMeta:
+    data_sources: list[str] = field(default_factory=list)
+    computation_time_sec: float = 0.0
+    model_used: str = ""
+    last_full_rebuild: str = ""
+    last_incremental: str = ""
+
+
+@dataclass
+class Identity:
+    bio: str = ""
+    location: str = ""
+    verified: bool = False
+    follower_count: int = 0
+    following_count: int = 0
+    account_created: str = ""
+
+
+@dataclass
+class Profile:
+    schema_version: str = "7.0"
+    handle: str = ""
+    platform: str = "x"
+    display_name: str = ""
+    created_at: str = ""
+    last_updated: str = ""
+    update_count: int = 0
+    staleness_score: float = 1.0
+    identity: Identity = field(default_factory=Identity)
+    psychometrics: Psychometrics = field(default_factory=Psychometrics)
+    voice_fingerprint: VoiceFingerprint = field(default_factory=VoiceFingerprint)
+    stances: dict[str, Stance] = field(default_factory=dict)
+    community_membership: list[str] = field(default_factory=list)
+    influence: Influence = field(default_factory=Influence)
+    posting_patterns: PostingPatterns = field(default_factory=PostingPatterns)
+    relationships: Relationships = field(default_factory=Relationships)
+    star_thread_ref: str = "star_thread.json"
+    raw_data_refs: list[str] = field(default_factory=list)
+    _meta: ProfileMeta = field(default_factory=ProfileMeta)
+
+    def to_dict(self) -> dict:
+        """Recursively convert to dict for JSON serialization."""
+        import dataclasses
+        def _convert(obj):
+            if dataclasses.is_dataclass(obj):
+                return {k: _convert(v) for k, v in dataclasses.asdict(obj).items()}
+            elif isinstance(obj, list):
+                return [_convert(i) for i in obj]
+            elif isinstance(obj, dict):
+                return {k: _convert(v) for k, v in obj.items()}
+            return obj
+        return _convert(self)
+
+    def to_json(self, indent: int = 2) -> str:
+        return json.dumps(self.to_dict(), indent=indent)
+
+
+@dataclass
+class StarThread:
+    handle: str = ""
+    computed_at: str = ""
+    based_on_profile_version: str = ""
+    thread_version: int = 1
+    core_compression: str = ""
+    key_drives: list[str] = field(default_factory=list)
+    predictive_axioms: list[str] = field(default_factory=list)
+    voice_template: dict = field(default_factory=dict)
+    anti_slop_markers: list[str] = field(default_factory=list)
+    _meta: dict = field(default_factory=dict)
+
+
+@dataclass
+class Prediction:
+    pred_id: str = ""
+    created_at: str = ""
+    sim_id: str = ""
+    handle: str = ""
+    prediction_type: str = ""  # statement, career, alliance, content, network_reaction
+    prediction_text: str = ""
+    confidence: float = 0.5
+    calibrated_confidence: float = 0.5
+    timeframe_days: int = 30
+    resolved_at: Optional[str] = None
+    outcome: Optional[str] = None  # correct, partially_correct, incorrect
+    outcome_evidence: Optional[str] = None
+    accuracy_score: Optional[float] = None
+
+
+@dataclass
+class WatchConfig:
+    watch_id: str = ""
+    handle: str = ""
+    platform: str = "x"
+    enabled: bool = True
+    check_interval_minutes: int = 120
+    watch_for: list[dict] = field(default_factory=list)
+    alert_severity_minimum: str = "notable"
+    created_at: str = ""
+
+
+@dataclass
+class PopulationDefinition:
+    group_id: str = ""
+    name: str = ""
+    description: str = ""
+    created_at: str = ""
+    last_updated: str = ""
+    explicit_members: list[str] = field(default_factory=list)
+    criteria: dict = field(default_factory=dict)
+    resolved_members: list[str] = field(default_factory=list)
+    sampling_strategy: str = "representative"
+    default_sample_size: int = 12
@@ -0,0 +1,280 @@
+"""
+REHOBOAM Storage Layer
+Directory management, profile I/O, index maintenance.
+"""
+
+import json
+import shutil
+from pathlib import Path
+from datetime import datetime, timedelta
+from typing import Optional
+
+BASE_DIR = Path.home() / ".hermes" / "rehoboam"
+PROFILES_DIR = BASE_DIR / "profiles"
+POPULATIONS_DIR = BASE_DIR / "populations"
+SIMULATIONS_DIR = BASE_DIR / "simulations"
+MONITORING_DIR = BASE_DIR / "monitoring"
+CONFIG_DIR = BASE_DIR / "config"
+
+
+def init_storage():
+    """Create all required directories."""
+    for d in [PROFILES_DIR, POPULATIONS_DIR, SIMULATIONS_DIR,
+              MONITORING_DIR, MONITORING_DIR / "alerts", CONFIG_DIR,
+              BASE_DIR / "db"]:
+        d.mkdir(parents=True, exist_ok=True)
+
+    # Create default configs if they don't exist
+    staleness_path = CONFIG_DIR / "staleness_policy.json"
+    if not staleness_path.exists():
+        staleness_path.write_text(json.dumps({
+            "thresholds": {
+                "fresh": {"max_age_hours": 72},
+                "stale": {"max_age_hours": 336},
+                "expired": {"max_age_hours": 2160},
+                "archived": {"max_age_hours": 8760}
+            },
+            "per_field_decay": {
+                "psychometrics": {"half_life_days": 180},
+                "stances": {"half_life_days": 30},
+                "posting_patterns": {"half_life_days": 60},
+                "relationships": {"half_life_days": 45},
+                "influence": {"half_life_days": 90},
+                "voice_fingerprint": {"half_life_days": 365}
+            },
+            "auto_refresh_on_simulation": True,
+            "auto_refresh_threshold": "stale"
+        }, indent=2))
+
+    config_path = CONFIG_DIR / "rehoboam.json"
+    if not config_path.exists():
+        config_path.write_text(json.dumps({
+            "version": "7.0",
+            "default_model": "claude-opus-4-20250514",
+            "max_thread_age_days": 30,
+            "monitoring_enabled": False,
+            "auto_thread": True,
+            "auto_profile_update": True
+        }, indent=2))
+
+    # Create indexes if they don't exist
+    for idx_path in [PROFILES_DIR / "_index.json", POPULATIONS_DIR / "_index.json",
+                     SIMULATIONS_DIR / "_index.json"]:
+        if not idx_path.exists():
+            idx_path.write_text("{}")
+
+
+def normalize_handle(handle: str) -> str:
+    """Normalize a handle to a filesystem-safe directory name."""
+    h = handle.lstrip("@").lower().strip()
+    # Replace characters that are problematic in filenames
+    return h.replace("/", "_").replace("\\", "_")
+
+
+# -- Profile I/O --
+
+def get_profile_dir(handle: str) -> Path:
+    return PROFILES_DIR / normalize_handle(handle)
+
+
+def profile_exists(handle: str) -> bool:
+    return (get_profile_dir(handle) / "profile.json").exists()
+
+
+def load_profile(handle: str) -> Optional[dict]:
+    path = get_profile_dir(handle) / "profile.json"
+    if path.exists():
+        return json.loads(path.read_text())
+    return None
+
+
+def save_profile(handle: str, profile: dict, snapshot: bool = True):
+    """Save a profile, optionally snapshotting the old one."""
+    pdir = get_profile_dir(handle)
+    pdir.mkdir(parents=True, exist_ok=True)
+    (pdir / "history").mkdir(exist_ok=True)
+    (pdir / "raw").mkdir(exist_ok=True)
+    (pdir / "predictions").mkdir(exist_ok=True)
+
+    profile_path = pdir / "profile.json"
+
+    # Snapshot old profile before overwriting
+    if snapshot and profile_path.exists():
+        old = json.loads(profile_path.read_text())
+        ts = old.get("last_updated", datetime.utcnow().isoformat()).replace(":", "-")
+        snapshot_path = pdir / "history" / f"profile_{ts[:10]}.json"
+        shutil.copy2(profile_path, snapshot_path)
+
+    profile_path.write_text(json.dumps(profile, indent=2))
+    _update_profile_index(handle, profile)
+
+
+def _update_profile_index(handle: str, profile: dict):
+    idx_path = PROFILES_DIR / "_index.json"
+    idx = json.loads(idx_path.read_text()) if idx_path.exists() else {}
+    idx[normalize_handle(handle)] = {
+        "platform": profile.get("platform", "x"),
+        "last_updated": profile.get("last_updated", ""),
+        "staleness": compute_staleness(profile.get("last_updated", "")),
+        "has_star_thread": (get_profile_dir(handle) / "star_thread.json").exists(),
+        "simulation_count": idx.get(normalize_handle(handle), {}).get("simulation_count", 0),
+        "display_name": profile.get("display_name", "")
+    }
+    idx_path.write_text(json.dumps(idx, indent=2))
+
+
+# -- Star Thread I/O --
+
+def load_star_thread(handle: str) -> Optional[dict]:
+    path = get_profile_dir(handle) / "star_thread.json"
+    if path.exists():
+        return json.loads(path.read_text())
+    return None
+
+
+def save_star_thread(handle: str, thread: dict):
+    path = get_profile_dir(handle) / "star_thread.json"
+    get_profile_dir(handle).mkdir(parents=True, exist_ok=True)
+    path.write_text(json.dumps(thread, indent=2))
+    # Update index to reflect thread existence
+    idx_path = PROFILES_DIR / "_index.json"
+    if idx_path.exists():
+        idx = json.loads(idx_path.read_text())
+        key = normalize_handle(handle)
+        if key in idx:
+            idx[key]["has_star_thread"] = True
+            idx_path.write_text(json.dumps(idx, indent=2))
+
+
+# -- Staleness --
+
+def compute_staleness(last_updated: str) -> str:
+    """Determine staleness level from a timestamp string."""
+    if not last_updated:
+        return "expired"
+    try:
+        dt = datetime.fromisoformat(last_updated.rstrip("Z"))
+    except ValueError:
+        return "expired"
+
+    age = datetime.utcnow() - dt
+    hours = age.total_seconds() / 3600
+
+    policy = _load_staleness_policy()
+    thresholds = policy.get("thresholds", {})
+
+    if hours <= thresholds.get("fresh", {}).get("max_age_hours", 72):
+        return "fresh"
+    elif hours <= thresholds.get("stale", {}).get("max_age_hours", 336):
+        return "stale"
+    elif hours <= thresholds.get("expired", {}).get("max_age_hours", 2160):
+        return "expired"
+    else:
+        return "archived"
+
+
+def _load_staleness_policy() -> dict:
+    path = CONFIG_DIR / "staleness_policy.json"
+    if path.exists():
+        return json.loads(path.read_text())
+    return {"thresholds": {"fresh": {"max_age_hours": 72}, "stale": {"max_age_hours": 336},
+                           "expired": {"max_age_hours": 2160}, "archived": {"max_age_hours": 8760}}}
+
+
+def needs_thread_recompute(handle: str) -> bool:
+    """Check if a star thread needs recomputation."""
+    thread = load_star_thread(handle)
+    if thread is None:
+        return True
+
+    profile = load_profile(handle)
+    if profile is None:
+        return True
+
+    # Thread is stale if profile was updated after thread was computed
+    thread_time = thread.get("based_on_profile_version", "")
+    profile_time = profile.get("last_updated", "")
+    if thread_time < profile_time:
+        return True
+
+    # Thread is stale if older than max_thread_age_days
+    config = json.loads((CONFIG_DIR / "rehoboam.json").read_text()) if (CONFIG_DIR / "rehoboam.json").exists() else {}
+    max_age = config.get("max_thread_age_days", 30)
+    try:
+        computed = datetime.fromisoformat(thread.get("computed_at", "").rstrip("Z"))
+        if (datetime.utcnow() - computed).days > max_age:
+            return True
+    except ValueError:
+        return True
+
+    return False
+
+
+# -- Simulation I/O --
+
+def save_simulation(sim_id: str, config: dict, output: dict, analytics: dict, audit: dict):
+    sdir = SIMULATIONS_DIR / sim_id
+    sdir.mkdir(parents=True, exist_ok=True)
+    (sdir / "config.json").write_text(json.dumps(config, indent=2))
+    (sdir / "output.json").write_text(json.dumps(output, indent=2))
+    (sdir / "analytics.json").write_text(json.dumps(analytics, indent=2))
+    (sdir / "audit.json").write_text(json.dumps(audit, indent=2))
+
+    # Update index
+    idx_path = SIMULATIONS_DIR / "_index.json"
+    idx = json.loads(idx_path.read_text()) if idx_path.exists() else {}
+    idx[sim_id] = {
+        "created_at": config.get("created_at", datetime.utcnow().isoformat() + "Z"),
+        "scenario": config.get("scenario", ""),
+        "participant_count": len(config.get("participants", [])),
+    }
+    idx_path.write_text(json.dumps(idx, indent=2))
+
+
+# -- Population I/O --
+
+def save_population(group_id: str, definition: dict, aggregate: dict = None):
+    pdir = POPULATIONS_DIR / group_id
+    pdir.mkdir(parents=True, exist_ok=True)
+    (pdir / "history").mkdir(exist_ok=True)
+    (pdir / "definition.json").write_text(json.dumps(definition, indent=2))
+    if aggregate:
+        (pdir / "aggregate.json").write_text(json.dumps(aggregate, indent=2))
+
+    idx_path = POPULATIONS_DIR / "_index.json"
+    idx = json.loads(idx_path.read_text()) if idx_path.exists() else {}
+    idx[group_id] = {
+        "name": definition.get("name", group_id),
+        "member_count": len(definition.get("resolved_members", definition.get("explicit_members", []))),
+        "last_updated": definition.get("last_updated", "")
+    }
+    idx_path.write_text(json.dumps(idx, indent=2))
+
+
+def load_population(group_id: str) -> Optional[dict]:
+    path = POPULATIONS_DIR / group_id / "definition.json"
+    if path.exists():
+        return json.loads(path.read_text())
+    return None
+
+
+# -- Listing --
+
+def list_profiles() -> dict:
+    idx_path = PROFILES_DIR / "_index.json"
+    return json.loads(idx_path.read_text()) if idx_path.exists() else {}
+
+
+def list_populations() -> dict:
+    idx_path = POPULATIONS_DIR / "_index.json"
+    return json.loads(idx_path.read_text()) if idx_path.exists() else {}
+
+
+def list_simulations() -> dict:
+    idx_path = SIMULATIONS_DIR / "_index.json"
+    return json.loads(idx_path.read_text()) if idx_path.exists() else {}
+
+
+if __name__ == "__main__":
+    init_storage()
+    print(f"Storage initialized at {BASE_DIR}")
@@ -0,0 +1,139 @@
+#!/usr/bin/env python3
+"""
+Facebook Page/Profile Data Extractor
+Uses multiple techniques to extract public Facebook data without authentication:
+1. Googlebot UA for OG meta tags (name, description, likes, talking_about, bio, og:image)
+2. Graph API /picture endpoint for profile photos (pages only)
+3. Page Plugin embed for follower counts and page IDs
+"""
+
+import subprocess
+import json
+import re
+import html
+import sys
+
+GOOGLEBOT_UA = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
+
+def curl_get(url, ua=None):
+    """Fetch URL with curl"""
+    cmd = ['curl', '-s', '-L', '--max-time', '15']
+    if ua:
+        cmd += ['-H', f'User-Agent: {ua}']
+    cmd.append(url)
+    result = subprocess.run(cmd, capture_output=True, text=True, timeout=20)
+    return result.stdout
+
+def extract_og_data(username):
+    """Extract OG meta tags using Googlebot UA"""
+    content = curl_get(f'https://www.facebook.com/{username}', ua=GOOGLEBOT_UA)
+    
+    data = {}
+    
+    # Extract OG tags
+    og_title = re.search(r'og:title"\s*content="([^"]*)"', content)
+    if og_title:
+        data['name'] = html.unescape(og_title.group(1))
+    
+    og_desc = re.search(r'og:description"\s*content="([^"]*)"', content)
+    if og_desc:
+        desc = html.unescape(og_desc.group(1))
+        data['raw_description'] = desc
+        
+        # Parse likes count
+        likes_match = re.search(r'([\d,]+)\s+likes?', desc)
+        if likes_match:
+            data['likes'] = likes_match.group(1)
+        
+        # Parse talking about
+        talking_match = re.search(r'([\d,]+)\s+talking about this', desc)
+        if talking_match:
+            data['talking_about'] = talking_match.group(1)
+        
+        # Extract bio (text after the "talking about this." part)
+        bio_match = re.search(r'talking about this\.\s*(.+)', desc)
+        if bio_match:
+            data['bio'] = bio_match.group(1)
+    
+    og_image = re.search(r'og:image"\s*content="([^"]*)"', content)
+    if og_image:
+        data['og_image'] = html.unescape(og_image.group(1))
+    
+    return data
+
+def extract_plugin_data(username):
+    """Extract data from Page Plugin embed"""
+    content = curl_get(f'https://www.facebook.com/plugins/page.php?href=https://www.facebook.com/{username}&tabs=timeline&width=500&height=600')
+    
+    data = {}
+    
+    # Page name from title attribute
+    name_match = re.search(r'class="_1drp _5lv6" title="([^"]*)"', content)
+    if name_match:
+        data['plugin_name'] = html.unescape(name_match.group(1))
+    
+    # Follower count
+    followers_match = re.search(r'([\d,]+)\s+followers', content)
+    if followers_match:
+        data['followers'] = followers_match.group(1)
+    
+    # Page ID
+    pageid_match = re.search(r'"pageID":"(\d+)"', content)
+    if pageid_match:
+        data['page_id'] = pageid_match.group(1)
+    
+    return data
+
+def extract_profile_picture(username):
+    """Get profile picture via Graph API"""
+    content = curl_get(f'https://graph.facebook.com/v19.0/{username}/picture?redirect=false&width=400&height=400')
+    try:
+        d = json.loads(content)
+        if 'data' in d and not d['data'].get('is_silhouette', True):
+            return d['data']['url']
+    except:
+        pass
+    return None
+
+def get_facebook_data(username):
+    """Combine all extraction methods"""
+    result = {'username': username}
+    
+    # Method 1: OG tags (best for bio, likes, talking_about)
+    og = extract_og_data(username)
+    result.update(og)
+    
+    # Method 2: Plugin (best for followers, page_id)
+    plugin = extract_plugin_data(username)
+    result.update(plugin)
+    
+    # Method 3: Graph API picture (pages only)
+    pic = extract_profile_picture(username)
+    if pic:
+        result['profile_picture'] = pic
+    
+    # Also try by page_id for picture if username didn't work
+    if not pic and 'page_id' in result:
+        pic2 = extract_profile_picture(result['page_id'])
+        if pic2:
+            result['profile_picture'] = pic2
+    
+    return result
+
+if __name__ == '__main__':
+    targets = sys.argv[1:] if len(sys.argv) > 1 else ['zuck', 'NVIDIA', 'Meta', 'CocaCola']
+    
+    for target in targets:
+        print(f"{'='*60}")
+        print(f"Facebook Profile: {target}")
+        print(f"{'='*60}")
+        data = get_facebook_data(target)
+        for k, v in data.items():
+            if k == 'raw_description':
+                continue  # Skip raw, we show parsed fields
+            val = str(v)
+            if len(val) > 120:
+                val = val[:120] + '...'
+            print(f"  {k}: {val}")
+        print()
+
@@ -0,0 +1,595 @@
+"""
+Hermes Simulator — Intelligence Gathering Pipeline v2
+
+Full-spectrum OSINT research engine for personality modeling.
+Searches text, extracts content, browses live pages, analyzes
+images with vision, and cross-references across platforms.
+
+Run via execute_code. The agent adapts searches based on findings.
+"""
+
+from hermes_tools import web_search, web_extract, terminal
+import json
+import time
+import urllib.parse
+
+# ═══════════════════════════════════════════════════════════════
+# CONFIGURATION
+# ═══════════════════════════════════════════════════════════════
+
+AGGREGATOR_SITES = [
+    "buttondown.com/ainews",
+    "news.smol.ai",
+    "techmeme.com",
+    "latent.space",
+]
+
+# Verified working fallback data sources (tested April 2026)
+# Priority order: X API > nitter.cz > ThreadReaderApp > GitHub > Reddit > HN
+FALLBACK_SOURCES = {
+    "nitter": "https://nitter.cz/{handle}",           # web_extract — full timeline
+    "threadreader": "https://threadreaderapp.com/user/{handle}",  # web_extract — historical threads
+    "github_profile": "https://api.github.com/users/{handle}",   # curl — profile + README
+    "github_events": "https://api.github.com/users/{handle}/events",  # curl — recent activity
+    "reddit_user": "https://www.reddit.com/user/{handle}.json",  # curl w/ User-Agent
+    "reddit_comments": "https://www.reddit.com/user/{handle}/comments.json",
+    "hn_search": "https://hn.algolia.com/api/v1/search?query={handle}&tags=comment",
+}
+
+# CONFIRMED BLOCKED (don't waste calls on these):
+# - LinkedIn (web_extract blocked, browser auth wall)
+# - Instagram viewers (imginn, picuki, dumpoir, gramhir — all 403)
+# - Most nitter instances (dead or 403, ONLY nitter.cz works via web_extract)
+# - Wayback Machine for tweets (sparse, no JS content)
+# - Google Cache of Twitter (empty)
+# - Archive.today (429 + CAPTCHA)
+# - Twitter Syndication API (rate limited)
+
+AI_SUBREDDITS = [
+    "LocalLLaMA", "MachineLearning", "singularity",
+    "ChatGPT", "ClaudeAI", "OpenAI", "StableDiffusion",
+]
+
+PLATFORMS = ["twitter", "instagram", "linkedin", "github", "reddit", "youtube"]
+
+# ═══════════════════════════════════════════════════════════════
+# HELPER: safe web_search with validation
+# ═══════════════════════════════════════════════════════════════
+
+def _safe_web_search(query: str, limit: int = 5) -> list:
+    """Run web_search and return results list, with validation."""
+    r = web_search(query, limit=limit)
+    if not isinstance(r, dict) or "data" not in r:
+        print(f"  [WARNING] web_search returned no 'data' key for query: {query[:80]}")
+        return []
+    data = r.get("data", {})
+    if not isinstance(data, dict):
+        return []
+    return data.get("web", []) or []
+
+
+# ═══════════════════════════════════════════════════════════════
+# CORE SEARCH FUNCTIONS
+# ═══════════════════════════════════════════════════════════════
+
+def search_identity(handle: str) -> dict:
+    """Establish who they are across the internet."""
+    results = {}
+    results["twitter_identity"] = _safe_web_search(f"@{handle} twitter bio role company", limit=5)
+    results["general_identity"] = _safe_web_search(f"{handle} known for", limit=5)
+    return results
+
+
+def search_voice(handle: str) -> dict:
+    """How do they actually talk/write."""
+    results = {}
+    results["takes"] = _safe_web_search(f"{handle} twitter hot takes opinions", limit=5)
+
+    for agg in AGGREGATOR_SITES[:2]:
+        hits = _safe_web_search(f"site:{agg} {handle}", limit=3)
+        if hits:
+            # Use full domain as key, not split('.')[0]
+            results[f"agg_{agg}"] = hits
+    return results
+
+
+def search_positions(handle: str, topics: list = None, domain: str = None) -> dict:
+    """What are their known positions."""
+    results = {}
+    if topics:
+        for topic in topics[:3]:
+            results[f"topic_{topic}"] = _safe_web_search(f"{handle} {topic} opinion take", limit=5)
+
+    # Build controversy query — only add domain keywords if specified
+    controversy_query = f"{handle} debate disagree controversial"
+    if domain:
+        controversy_query += f" {domain}"
+    results["controversies"] = _safe_web_search(controversy_query, limit=5)
+    return results
+
+
+def search_longform(handle: str, real_name: str = None, domain: str = None) -> dict:
+    """Blogs, interviews, essays."""
+    results = {}
+    name = real_name or handle
+
+    blog_query = f"{name} blog substack essay"
+    interview_query = f"{name} interview podcast"
+    if domain:
+        blog_query += f" {domain}"
+        interview_query += f" {domain}"
+
+    results["blogs"] = _safe_web_search(blog_query, limit=5)
+    results["interviews"] = _safe_web_search(interview_query, limit=5)
+    return results
+
+
+# ═══════════════════════════════════════════════════════════════
+# CROSS-PLATFORM DISCOVERY
+# ═══════════════════════════════════════════════════════════════
+
+def discover_platforms(handle: str, real_name: str = None) -> dict:
+    """Find someone across all platforms."""
+    name = real_name or handle
+    results = {}
+
+    # Instagram
+    results["instagram"] = _safe_web_search(f"{name} instagram OR site:instagram.com/{handle}", limit=5)
+
+    # LinkedIn
+    results["linkedin"] = _safe_web_search(f"{name} linkedin OR site:linkedin.com/in", limit=5)
+
+    # Reddit
+    results["reddit"] = _safe_web_search(f"{name} reddit account OR site:reddit.com/user", limit=5)
+
+    # GitHub
+    results["github"] = _safe_web_search(f"{handle} github OR site:github.com/{handle}", limit=5)
+
+    # YouTube
+    results["youtube"] = _safe_web_search(f"{name} youtube channel OR talk OR interview", limit=5)
+
+    # Personal site
+    results["personal_site"] = _safe_web_search(f"{name} personal website blog about", limit=5)
+
+    # Hacker News
+    results["hackernews"] = _safe_web_search(f"site:news.ycombinator.com {handle} OR {name}", limit=3)
+
+    return results
+
+
+def discover_instagram(handle: str = None, real_name: str = None) -> dict:
+    """Focused Instagram discovery."""
+    results = {}
+    name = real_name or handle
+
+    # Try to find their IG handle
+    results["ig_search"] = _safe_web_search(f"{name} instagram profile", limit=5)
+
+    # If we have a candidate IG URL, try to extract
+    ig_urls = []
+    for item in results.get("ig_search", []):
+        if not isinstance(item, dict):
+            continue
+        url = item.get("url", "")
+        if "instagram.com/" in url and "/p/" not in url:
+            ig_urls.append(url)
+
+    if ig_urls:
+        # Try to extract IG profile page
+        r = web_extract(urls=ig_urls[:1])
+        results["ig_profile"] = r.get("results", [])
+
+    return results
+
+
+# ═══════════════════════════════════════════════════════════════
+# VISUAL INTELLIGENCE
+# ═══════════════════════════════════════════════════════════════
+
+# NOTE: These functions use browser_* and vision_analyze which are
+# NOT available in execute_code. They are called DIRECTLY by the
+# agent after the execute_code research phase.
+#
+# The agent should:
+# 1. Run this script via execute_code for text-based research
+# 2. Then use browser/vision tools directly for visual research
+#
+# Visual research tasks for the agent:
+#
+# INSTAGRAM VISUAL:
+#   browser_navigate("https://www.instagram.com/{ig_handle}/")
+#   browser_vision(question="Describe this Instagram profile: bio, pic, grid, aesthetic, follower count")
+#   browser_get_images()  # collect image URLs
+#   vision_analyze(image_url="{url}", question="Describe: setting, people, mood, style")
+#
+# PROFILE PIC ANALYSIS:
+#   vision_analyze(image_url="{pic_url}", question="Describe: appearance, clothing, setting, expression, professional vs casual")
+#
+# REVERSE IMAGE SEARCH (Yandex):
+#   # Upload to catbox if behind auth:
+#   terminal("curl -F 'reqtype=fileupload' -F 'fileToUpload=@{path}' https://catbox.moe/user/api.php")
+#   browser_navigate(f"https://yandex.com/images/search?rpt=imageview&url={encoded_url}")
+#
+# PAGE SCREENSHOT ANALYSIS:
+#   browser_vision(question="Read all text, usernames, post content, dates, engagement numbers")
+
+
+# ═══════════════════════════════════════════════════════════════
+# INTERACTION MAPPING
+# ═══════════════════════════════════════════════════════════════
+
+def search_interactions(handle: str, other_handles: list = None) -> dict:
+    """How they interact with other simulation targets."""
+    results = {}
+    if other_handles:
+        for other in other_handles[:4]:
+            hits = _safe_web_search(f"{handle} {other} twitter interaction debate reply", limit=3)
+            if hits:
+                results[f"with_{other}"] = hits
+    return results
+
+
+def search_social_graph(handle: str) -> dict:
+    """Who do they interact with most? Allies and rivals."""
+    results = {}
+
+    results["frequent_interactions"] = _safe_web_search(f"@{handle} twitter reply thread conversation with", limit=5)
+    results["conflicts"] = _safe_web_search(f"@{handle} disagree argue beef ratio", limit=5)
+    results["allies"] = _safe_web_search(f"@{handle} agree support endorse recommend", limit=5)
+
+    return results
+
+
+# ═══════════════════════════════════════════════════════════════
+# DEEP EXTRACTION
+# ═══════════════════════════════════════════════════════════════
+
+def extract_content(urls: list) -> list:
+    """Pull full content from high-value URLs."""
+    if not urls:
+        return []
+    r = web_extract(urls=urls[:3])
+    return r.get("results", [])
+
+
+def extract_best_urls(findings: dict, max_urls: int = 5) -> list:
+    """Find the most promising URLs in research findings for deep extraction."""
+    seen_urls = set()  # URL deduplication
+    priority_domains = [
+        "substack.com", "medium.com", "blog", "essay",
+        "interview", "podcast", "youtube.com", "arxiv.org",
+    ]
+
+    def score_url(url, desc):
+        score = 0
+        for domain in priority_domains:
+            if domain in url.lower() or domain in desc.lower():
+                score += 2
+        if any(w in desc.lower() for w in ["interview", "spoke", "told", "said", "wrote"]):
+            score += 1
+        return score
+
+    candidates = []
+
+    def collect(obj):
+        if isinstance(obj, list):
+            for item in obj:
+                if isinstance(item, dict):
+                    url = item.get("url") or ""
+                    desc = item.get("description") or item.get("text") or ""
+                    if url and url not in seen_urls and not any(x in url for x in ["x.com", "twitter.com", "instagram.com"]):
+                        seen_urls.add(url)
+                        candidates.append((score_url(url, desc), url))
+        elif isinstance(obj, dict):
+            for v in obj.values():
+                collect(v)
+
+    collect(findings)
+    candidates.sort(key=lambda x: -x[0])
+    return [url for _, url in candidates[:max_urls]]
+
+
+# ═══════════════════════════════════════════════════════════════
+# MAIN PIPELINE
+# ═══════════════════════════════════════════════════════════════
+
+def research_person(handle: str, fidelity: int = 70,
+                    topics: list = None,
+                    other_handles: list = None,
+                    real_name: str = None,
+                    domain: str = None) -> dict:
+    """
+    Full research pipeline for one person.
+    Returns dict with all findings organized by category.
+
+    Args:
+        handle: Twitter/X handle (without @)
+        fidelity: Research depth 0-100
+        topics: Specific topics to research
+        other_handles: Other people to check interactions with
+        real_name: Real name if different from handle
+        domain: Domain context (e.g., 'AI', 'politics', 'gaming').
+                When None, no domain keywords are added to searches.
+                When set, adds relevant domain keywords.
+    """
+    print(f"\n{'='*60}")
+    print(f"  RESEARCHING: @{handle} | Fidelity: {fidelity}%")
+    if domain:
+        print(f"  Domain: {domain}")
+    print(f"{'='*60}")
+
+    findings = {"handle": handle, "fidelity": fidelity, "visual_tasks": []}
+
+    # ─── Phase 1: Identity (always) ───
+    print(f"\n  [IDENTITY] Who are they...")
+    findings["identity"] = search_identity(handle)
+
+    if fidelity <= 30:
+        if topics:
+            findings["quick_topic"] = _safe_web_search(f"{handle} {topics[0]}", limit=3)
+        return findings
+
+    # ─── Phase 2: Voice (fidelity 31+) ───
+    print(f"\n  [VOICE] How do they talk...")
+    findings["voice"] = search_voice(handle)
+
+    # ─── Phase 3: Positions (fidelity 31+) ───
+    print(f"\n  [POSITIONS] What do they believe...")
+    findings["positions"] = search_positions(handle, topics, domain=domain)
+
+    if fidelity <= 50:
+        return findings
+
+    # ─── Phase 4: Cross-platform (fidelity 51+) ───
+    print(f"\n  [PLATFORMS] Finding them everywhere...")
+    findings["platforms"] = discover_platforms(handle, real_name)
+
+    if fidelity <= 70:
+        return findings
+
+    # ─── Phase 5: Longform (fidelity 71+) ───
+    print(f"\n  [LONGFORM] Blogs, interviews, essays...")
+    findings["longform"] = search_longform(handle, real_name, domain=domain)
+
+    # ─── Phase 6: Social graph (fidelity 71+) ───
+    print(f"\n  [SOCIAL GRAPH] Who do they interact with...")
+    findings["social_graph"] = search_social_graph(handle)
+
+    # ─── Phase 7: Interaction mapping (fidelity 71+) ───
+    if other_handles:
+        print(f"\n  [INTERACTIONS] With other targets: {other_handles}...")
+        findings["interactions"] = search_interactions(handle, other_handles)
+
+    # ─── Phase 8: Instagram deep dive (fidelity 80+) ───
+    if fidelity >= 80:
+        print(f"\n  [INSTAGRAM] Visual identity...")
+        findings["instagram"] = discover_instagram(handle, real_name)
+
+        # Queue visual tasks for the agent to do after execute_code
+        findings["visual_tasks"].append({
+            "type": "instagram_profile",
+            "instruction": f"browser_navigate to Instagram profile, use browser_vision to analyze",
+            "handle": handle,
+        })
+
+    # ─── Phase 9: Deep extraction (fidelity 85+) ───
+    if fidelity >= 85:
+        print(f"\n  [DEEP EXTRACT] Pulling longform content...")
+        best_urls = extract_best_urls(findings, max_urls=4)
+        if best_urls:
+            print(f"    Extracting {len(best_urls)} URLs: {best_urls}")
+            findings["deep_extracts"] = extract_content(best_urls)
+
+    # ─── Phase 10: Profile pic analysis (fidelity 90+) ───
+    if fidelity >= 90:
+        findings["visual_tasks"].append({
+            "type": "profile_pic_analysis",
+            "instruction": "Find and analyze profile pictures across platforms with vision_analyze",
+            "handle": handle,
+        })
+        findings["visual_tasks"].append({
+            "type": "reverse_image_search",
+            "instruction": "Reverse image search profile pic via Yandex to find alt accounts",
+            "handle": handle,
+        })
+
+    return findings
+
+
+def research_all(handles: list, fidelity: int = 70,
+                 topics: list = None, domain: str = None) -> dict:
+    """Research all simulation targets."""
+    all_findings = {}
+
+    for handle in handles:
+        clean = handle.lstrip("@")
+        others = [h.lstrip("@") for h in handles if h.lstrip("@") != clean]
+
+        findings = research_person(
+            handle=clean,
+            fidelity=fidelity,
+            topics=topics,
+            other_handles=others,
+            domain=domain,
+        )
+        all_findings[clean] = findings
+
+    return all_findings
+
+
+# ═══════════════════════════════════════════════════════════════
+# REPORTING
+# ═══════════════════════════════════════════════════════════════
+
+def count_data_points(obj) -> int:
+    """Count total search result items in findings (only meaningful items with >50 char text)."""
+    total = 0
+    if isinstance(obj, list):
+        for item in obj:
+            if isinstance(item, dict):
+                text = item.get("description") or item.get("text") or ""
+                if len(text) > 50:
+                    total += 1
+                else:
+                    # Still count non-dict items or items without text fields
+                    total += 1
+            else:
+                total += 1
+    elif isinstance(obj, dict):
+        for k, v in obj.items():
+            # Skip metadata keys
+            if k in ("handle", "fidelity", "visual_tasks"):
+                continue
+            total += count_data_points(v)
+    return total
+
+
+def count_quality_data_points(obj) -> int:
+    """Count search result items with substantial text (description/text > 50 chars)."""
+    total = 0
+    if isinstance(obj, list):
+        for item in obj:
+            if isinstance(item, dict):
+                text = item.get("description") or item.get("text") or ""
+                if len(text) > 50:
+                    total += 1
+    elif isinstance(obj, dict):
+        for k, v in obj.items():
+            if k in ("handle", "fidelity", "visual_tasks"):
+                continue
+            total += count_quality_data_points(v)
+    return total
+
+
+def summarize_findings(findings: dict) -> str:
+    """Compact summary of what we found."""
+    handle = findings.get("handle", "unknown")
+    fidelity = findings.get("fidelity", 0)
+    total = count_data_points(findings)
+    quality = count_quality_data_points(findings)
+    visual_tasks = findings.get("visual_tasks", [])
+
+    lines = [
+        f"\n{'━'*60}",
+        f"  @{handle} | Fidelity: {fidelity}% | Data points: {total} ({quality} quality)",
+        f"{'━'*60}",
+    ]
+
+    # Identity snippets
+    identity = findings.get("identity", {})
+    for key in ["twitter_identity", "general_identity"]:
+        for item in identity.get(key, [])[:2]:
+            if not isinstance(item, dict):
+                continue
+            desc = (item.get("description") or "")[:180]
+            if desc:
+                lines.append(f"  [{key.upper()}] {desc}")
+
+    # Platform discovery results
+    platforms = findings.get("platforms", {})
+    found_platforms = []
+    for platform, items in platforms.items():
+        if isinstance(items, list) and len(items) > 0:
+            found_platforms.append(platform)
+    if found_platforms:
+        lines.append(f"  [PLATFORMS FOUND] {', '.join(found_platforms)}")
+
+    # Voice samples from aggregators
+    voice = findings.get("voice", {})
+    for key, items in voice.items():
+        if isinstance(items, list):
+            for item in items[:1]:
+                if not isinstance(item, dict):
+                    continue
+                desc = (item.get("description") or "")[:180]
+                if desc and handle.lower() in desc.lower():
+                    lines.append(f"  [VOICE] {desc}")
+
+    # Deep extracts
+    for extract in findings.get("deep_extracts", [])[:2]:
+        if not isinstance(extract, dict):
+            continue
+        title = extract.get("title", "untitled")
+        content = (extract.get("content") or "")[:200]
+        if content:
+            lines.append(f"  [LONGFORM: {title}] {content}...")
+
+    # Pending visual tasks
+    if visual_tasks:
+        lines.append(f"  [VISUAL TASKS QUEUED] {len(visual_tasks)} tasks for agent to execute:")
+        for task in visual_tasks:
+            lines.append(f"    → {task.get('type', '?')}: {task.get('instruction', '?')[:80]}")
+
+    # Confidence estimate — based on quality data points
+    if quality >= 30:
+        conf = "HIGH"
+    elif quality >= 15:
+        conf = "MEDIUM"
+    elif quality >= 5:
+        conf = "LOW"
+    else:
+        conf = "INSUFFICIENT"
+    lines.append(f"\n  CONFIDENCE: {conf} ({quality} quality data points, {total} total)")
+
+    return "\n".join(lines)
+
+
+def report_visual_tasks(all_findings: dict) -> str:
+    """Collect all visual tasks across all targets for agent to execute."""
+    lines = ["\n" + "═"*60, "  VISUAL INTELLIGENCE TASKS (agent must execute directly)", "═"*60]
+
+    any_tasks = False
+    for handle, findings in all_findings.items():
+        for task in findings.get("visual_tasks", []):
+            any_tasks = True
+            lines.append(f"\n  @{handle} — {task.get('type', '?')}:")
+            lines.append(f"    {task.get('instruction', '?')}")
+
+    if not any_tasks:
+        lines.append("  No visual tasks queued (fidelity < 80)")
+
+    return "\n".join(lines)
+
+
+# ═══════════════════════════════════════════════════════════════
+# CHECK AVAILABLE TOOLS
+# ═══════════════════════════════════════════════════════════════
+
+def check_x_cli() -> bool:
+    """Check if x-cli is available."""
+    try:
+        r = terminal("which x-cli 2>/dev/null && echo 'FOUND' || echo 'NOT_FOUND'")
+        return "FOUND" in r.get("output", "")
+    except:
+        return False
+
+
+# ═══════════════════════════════════════════════════════════════
+# ENTRY POINT
+# ═══════════════════════════════════════════════════════════════
+
+if __name__ == "__main__":
+    # ── CONFIGURE THESE ──
+    HANDLES = ["teknium1", "basedjensen"]
+    FIDELITY = 80
+    TOPICS = ["open source AI", "compute scaling"]
+    DOMAIN = None  # Set to 'AI', 'politics', etc. to add domain keywords
+    # ─────────────────────
+
+    has_xcli = check_x_cli()
+    print(f"x-cli available: {has_xcli}")
+    print(f"Targets: {HANDLES}")
+    print(f"Fidelity: {FIDELITY}%")
+    print(f"Topics: {TOPICS}")
+    print(f"Domain: {DOMAIN}")
+
+    results = research_all(HANDLES, fidelity=FIDELITY, topics=TOPICS, domain=DOMAIN)
+
+    for handle, findings in results.items():
+        print(summarize_findings(findings))
+
+    print(report_visual_tasks(results))
+    print("\n\nResearch phase complete. Agent should now:")
+    print("1. Execute any queued visual tasks (browser/vision)")
+    print("2. Compile dossiers from all findings")
+    print("3. Run simulation")
@@ -0,0 +1,238 @@
+#!/usr/bin/env python3
+"""
+Threads (Meta) Profile & Post Extractor
+========================================
+Extracts profile data and post content from Threads using:
+1. OG meta tags from HTML (no auth required for profiles and public posts)
+2. WebFinger for ActivityPub discovery
+3. Google-indexed post URLs for recent post discovery
+
+METHODS THAT WORK:
+- Profile pages at threads.net/@{user} have OG tags with:
+  display_name, username, follower_count, thread_count, bio, profile_pic
+- Individual post pages have OG tags with:
+  full post text, author info, profile pic
+- WebFinger at /.well-known/webfinger gives ActivityPub user IDs
+- Post URLs must be known (discoverable via web search)
+
+METHODS THAT DON'T WORK (as of 2025):
+- Threads Official API (graph.threads.net) requires OAuth token
+- ActivityPub /ap/users/ endpoints return 404 for most users
+- No public post listing endpoint exists
+"""
+
+import re
+import json
+import html
+import subprocess
+import sys
+
+def curl_fetch(url, extra_headers=None, timeout=15):
+    """Fetch URL using curl (more reliable than urllib for Threads)."""
+    cmd = ['curl', '-s', '-L', '--max-time', str(timeout)]
+    if extra_headers:
+        for k, v in extra_headers.items():
+            cmd.extend(['-H', f'{k}: {v}'])
+    cmd.append(url)
+    try:
+        result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout+5)
+        return result.stdout
+    except:
+        return None
+
+def extract_og_tags(html_content):
+    """Extract OpenGraph, meta description, and Twitter tags from HTML."""
+    data = {}
+    if not html_content:
+        return data
+    
+    for m in re.finditer(r'property="(og:[^"]+)"\s+content="([^"]*)"', html_content):
+        key = m.group(1)
+        val = html.unescape(m.group(2))
+        if key not in data:
+            data[key] = val
+    
+    for m in re.finditer(r'name="description"\s+content="([^"]*)"', html_content):
+        data['description'] = html.unescape(m.group(1))
+        break
+    
+    for m in re.finditer(r'name="(twitter:[^"]+)"\s+content="([^"]*)"', html_content):
+        key = m.group(1)
+        val = html.unescape(m.group(2))
+        if key not in data:
+            data[key] = val
+    
+    return data
+
+def parse_profile_description(desc):
+    """Parse '5.5M Followers • 142 Threads • Bio. See the latest...' format."""
+    result = {}
+    if not desc:
+        return result
+    
+    parts = desc.split(' \u2022 ')  # Split on bullet •
+    for part in parts:
+        part = part.strip()
+        if 'Follower' in part:
+            result['followers'] = part.split(' Follower')[0].strip()
+        elif part.endswith('Threads') or part.endswith('Thread'):
+            result['thread_count'] = part.split(' Thread')[0].strip()
+        else:
+            bio = re.sub(r'\s*See the latest conversations.*$', '', part)
+            if bio:
+                result['bio'] = bio
+    
+    return result
+
+def parse_profile_title(title):
+    """Parse 'Display Name (@user) • Threads, Say more' format."""
+    result = {}
+    if not title:
+        return result
+    m = re.match(r'^(.+?)\s*\(@(\w+)\)', title)
+    if m:
+        result['display_name'] = m.group(1).strip()
+        result['username'] = m.group(2)
+    return result
+
+def get_threads_profile(username):
+    """
+    Get Threads profile data via OG meta tags.
+    Returns dict with: username, display_name, bio, followers, thread_count, 
+                       profile_picture_url, url
+    """
+    username = username.lstrip('@')
+    url = f'https://www.threads.net/@{username}'
+    
+    content = curl_fetch(url)
+    tags = extract_og_tags(content)
+    
+    if not tags or 'og:title' not in tags:
+        return {'error': 'Failed to fetch or parse profile', 'username': username}
+    
+    title = tags.get('og:title', '')
+    if title.startswith('Threads') and 'Log in' in title:
+        return {'error': 'Profile requires login or not found', 'username': username}
+    
+    result = {
+        'platform': 'threads',
+        'url': url,
+    }
+    
+    result.update(parse_profile_title(title))
+    result.update(parse_profile_description(tags.get('og:description', '')))
+    
+    if 'og:image' in tags:
+        result['profile_picture_url'] = tags['og:image']
+    
+    return result
+
+def get_threads_webfinger(username):
+    """Get WebFinger data (ActivityPub discovery) for a Threads user."""
+    username = username.lstrip('@')
+    url = f'https://www.threads.net/.well-known/webfinger?resource=acct:{username}@threads.net'
+    
+    content = curl_fetch(url, {'Accept': 'application/json'})
+    if not content:
+        return None
+    
+    try:
+        data = json.loads(content)
+        if 'error' in data or 'success' in data and not data['success']:
+            return None
+        
+        result = {'subject': data.get('subject', '')}
+        for link in data.get('links', []):
+            if link.get('type') == 'application/activity+json':
+                result['activitypub_url'] = link['href']
+            elif link.get('rel') == 'http://webfinger.net/rel/profile-page':
+                result['profile_url'] = link['href']
+        return result
+    except:
+        return None
+
+def get_thread_post(post_url):
+    """
+    Get content of a specific Threads post via OG tags.
+    Returns: text, author, image_url
+    """
+    content = curl_fetch(post_url)
+    tags = extract_og_tags(content)
+    
+    if not tags or 'og:title' not in tags:
+        return {'error': 'Failed to fetch post'}
+    
+    title = tags.get('og:title', '')
+    if 'Log in' in title:
+        return {'error': 'Post requires login or not found'}
+    
+    result = {'url': post_url}
+    
+    if 'og:description' in tags:
+        result['text'] = tags['og:description']
+    elif 'description' in tags:
+        result['text'] = tags['description']
+    
+    if 'og:title' in tags:
+        # Parse "Display Name (@username) on Threads"
+        m = re.match(r'^(.+?)\s*\(@(\w+)\)\s+on\s+Threads', title)
+        if m:
+            result['author_name'] = m.group(1).strip()
+            result['author_username'] = m.group(2)
+    
+    if 'og:image' in tags:
+        result['image_url'] = tags['og:image']
+    
+    return result
+
+def get_threads_full(username):
+    """Get complete profile data combining all methods."""
+    profile = get_threads_profile(username)
+    wf = get_threads_webfinger(username)
+    
+    if wf:
+        profile['webfinger'] = wf
+    
+    return profile
+
+
+# ===== TEST =====
+if __name__ == '__main__':
+    test_users = sys.argv[1:] if len(sys.argv) > 1 else ['zuck', 'nvidia', 'mosseri']
+    
+    for user in test_users:
+        print(f"\n{'='*60}")
+        print(f"  THREADS PROFILE: @{user}")
+        print(f"{'='*60}")
+        
+        data = get_threads_full(user)
+        for k, v in sorted(data.items()):
+            if k == 'profile_picture_url':
+                print(f"  {k}: {str(v)[:80]}...")
+            elif k == 'webfinger':
+                print(f"  webfinger:")
+                for wk, wv in v.items():
+                    print(f"    {wk}: {wv}")
+            else:
+                print(f"  {k}: {v}")
+    
+    # Test posts
+    post_urls = [
+        'https://www.threads.net/@zuck/post/DEkvXzbyDS9',
+    ]
+    
+    print(f"\n{'='*60}")
+    print(f"  THREADS POSTS")
+    print(f"{'='*60}")
+    
+    for purl in post_urls:
+        print(f"\n  URL: {purl}")
+        post = get_thread_post(purl)
+        for k, v in post.items():
+            if k in ('image_url',):
+                print(f"  {k}: {str(v)[:80]}...")
+            elif k == 'text':
+                print(f"  {k}: {v[:300]}{'...' if len(v) > 300 else ''}")
+            else:
+                print(f"  {k}: {v}")
+
@@ -0,0 +1,305 @@
+"""
+TikTok Profile & Video Data Scraper
+====================================
+WORKING methods to get full TikTok profile data and video content.
+Tested and verified April 2026.
+
+METHODS SUMMARY:
+================
+METHOD 1 (BEST): HTML SSR Scraping - Parse __UNIVERSAL_DATA_FOR_REHYDRATION__
+  - Gets: FULL profile (bio, stats, follower/following/heart/video counts)
+  - Works: YES - Reliable, no auth needed, just curl + parse
+  - Limitation: No video list on profile page (videos load client-side)
+
+METHOD 2: oEmbed API - https://www.tiktok.com/oembed?url=...
+  - Gets: Video title/caption, author, thumbnail URL
+  - Works: YES - No auth, no rate limit issues
+  - Limitation: Need video IDs first; no engagement stats
+
+METHOD 3: tikwm.com API - https://www.tikwm.com/api/
+  - Gets: Full user info + individual video stats (plays, likes, comments, shares)
+  - User info: https://www.tikwm.com/api/user/info?unique_id={username}
+  - Video info: https://www.tikwm.com/api/?url={tiktok_video_url}
+  - Works: YES for user info and single videos
+  - Limitation: Posts list endpoint returns 403 (rate-limited)
+
+METHOD 4: Video ID Discovery via Search Engines
+  - Use web_search("site:tiktok.com/@{username}/video") to find video IDs
+  - Then use oEmbed or tikwm or HTML scraping per video
+  - Works: YES - Gets ~5 recent video IDs per search
+
+METHOD 5: SocialBlade via web_extract
+  - URL: https://socialblade.com/tiktok/user/{username}
+  - Gets: Followers, following, likes, videos, growth trends, rankings
+  - Works: YES via web_extract tool
+
+METHOD 6: Individual Video HTML Scraping
+  - Fetch https://www.tiktok.com/@{user}/video/{id}
+  - Parse __UNIVERSAL_DATA webapp.video-detail -> itemInfo.itemStruct
+  - Gets: FULL video data (caption, stats, music, hashtags, duration)
+  - Works: YES - Most complete per-video data
+
+NOT WORKING:
+  - TikTok /api/user/detail/ endpoint -> returns empty (needs signed params)
+  - TikTok /api/post/item_list/ -> returns empty (needs x-bogus/msToken)
+  - tikwm.com /api/user/posts -> 403 forbidden
+"""
+
+import re
+import json
+import subprocess
+import urllib.parse
+
+USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
+
+
+def fetch_url(url, headers=None):
+    """Fetch URL via curl and return content."""
+    cmd = ['curl', '-s', '-L', '-m', '30', url,
+           '-H', f'User-Agent: {USER_AGENT}',
+           '-H', 'Accept-Language: en-US,en;q=0.9']
+    if headers:
+        for k, v in headers.items():
+            cmd.extend(['-H', f'{k}: {v}'])
+    result = subprocess.run(cmd, capture_output=True, text=True, timeout=35)
+    return result.stdout
+
+
+def method1_html_profile(username):
+    """
+    METHOD 1: Scrape TikTok profile HTML and parse SSR JSON data.
+    Returns full profile with stats.
+    """
+    url = f'https://www.tiktok.com/@{username}'
+    html = fetch_url(url)
+
+    m = re.search(
+        r'<script id="__UNIVERSAL_DATA_FOR_REHYDRATION__" type="application/json">(.*?)</script>',
+        html
+    )
+    if not m:
+        return None
+
+    data = json.loads(m.group(1))
+    scope = data.get('__DEFAULT_SCOPE__', {})
+    user_detail = scope.get('webapp.user-detail', {})
+    user_info = user_detail.get('userInfo', {})
+
+    if not user_info:
+        return None
+
+    user = user_info.get('user', {})
+    stats = user_info.get('statsV2', user_info.get('stats', {}))
+
+    return {
+        'id': user.get('id'),
+        'username': user.get('uniqueId'),
+        'nickname': user.get('nickname'),
+        'bio': user.get('signature'),
+        'verified': user.get('verified'),
+        'private': user.get('privateAccount'),
+        'secUid': user.get('secUid'),
+        'avatarLarger': user.get('avatarLarger'),
+        'bioLink': user.get('bioLink', {}),
+        'createTime': user.get('createTime'),
+        'language': user.get('language'),
+        'stats': {
+            'followers': int(stats.get('followerCount', 0)),
+            'following': int(stats.get('followingCount', 0)),
+            'hearts': int(stats.get('heartCount', 0)),
+            'videos': int(stats.get('videoCount', 0)),
+            'diggs': int(stats.get('diggCount', 0)),
+            'friends': int(stats.get('friendCount', 0)),
+        }
+    }
+
+
+def method2_oembed_video(username, video_id):
+    """
+    METHOD 2: Get video caption/title via oEmbed.
+    No auth needed. Returns caption, author, thumbnail.
+    """
+    url = f'https://www.tiktok.com/oembed?url=https://www.tiktok.com/@{username}/video/{video_id}'
+    content = fetch_url(url)
+    try:
+        data = json.loads(content)
+        return {
+            'video_id': video_id,
+            'title': data.get('title', ''),
+            'author_name': data.get('author_name'),
+            'author_url': data.get('author_url'),
+            'thumbnail_url': data.get('thumbnail_url'),
+            'thumbnail_width': data.get('thumbnail_width'),
+            'thumbnail_height': data.get('thumbnail_height'),
+        }
+    except json.JSONDecodeError:
+        return None
+
+
+def method3_tikwm_user(username):
+    """
+    METHOD 3a: Get user info via tikwm.com API.
+    """
+    url = f'https://www.tikwm.com/api/user/info?unique_id={username}'
+    content = fetch_url(url)
+    try:
+        data = json.loads(content)
+        if data.get('code') == 0:
+            return data['data']
+    except json.JSONDecodeError:
+        pass
+    return None
+
+
+def method3_tikwm_video(video_url):
+    """
+    METHOD 3b: Get video details via tikwm.com API.
+    Returns: title, play_count, digg_count, comment_count, share_count, duration, download URLs
+    """
+    url = f'https://www.tikwm.com/api/?url={urllib.parse.quote(video_url)}'
+    content = fetch_url(url)
+    try:
+        data = json.loads(content)
+        if data.get('code') == 0:
+            v = data['data']
+            return {
+                'video_id': v.get('id'),
+                'title': v.get('title'),
+                'duration': v.get('duration'),
+                'play_count': v.get('play_count'),
+                'likes': v.get('digg_count'),
+                'comments': v.get('comment_count'),
+                'shares': v.get('share_count'),
+                'author': v.get('author', {}).get('unique_id'),
+                'music_title': v.get('music_info', {}).get('title') if v.get('music_info') else None,
+                'cover_url': v.get('origin_cover') or v.get('cover'),
+                'play_url': v.get('play'),  # direct video URL
+            }
+    except json.JSONDecodeError:
+        pass
+    return None
+
+
+def method6_html_video(username, video_id):
+    """
+    METHOD 6: Scrape individual video page HTML for full data.
+    Gets: caption, full stats, music, hashtags, create time.
+    """
+    url = f'https://www.tiktok.com/@{username}/video/{video_id}'
+    html = fetch_url(url)
+
+    m = re.search(
+        r'<script id="__UNIVERSAL_DATA_FOR_REHYDRATION__" type="application/json">(.*?)</script>',
+        html
+    )
+    if not m:
+        return None
+
+    data = json.loads(m.group(1))
+    scope = data.get('__DEFAULT_SCOPE__', {})
+    vd = scope.get('webapp.video-detail', {})
+    item = vd.get('itemInfo', {}).get('itemStruct', {})
+
+    if not item:
+        return None
+
+    stats = item.get('statsV2', item.get('stats', {}))
+    music = item.get('music', {})
+    challenges = item.get('challenges', [])
+
+    return {
+        'video_id': item.get('id'),
+        'description': item.get('desc'),
+        'createTime': item.get('createTime'),
+        'duration': item.get('video', {}).get('duration'),
+        'stats': {
+            'plays': int(stats.get('playCount', 0)),
+            'likes': int(stats.get('diggCount', 0)),
+            'comments': int(stats.get('commentCount', 0)),
+            'shares': int(stats.get('shareCount', 0)),
+            'saves': int(stats.get('collectCount', 0)),
+        },
+        'music': {
+            'title': music.get('title'),
+            'author': music.get('authorName'),
+        },
+        'hashtags': [c.get('title', '') for c in challenges],
+        'author': item.get('author', {}).get('uniqueId'),
+    }
+
+
+def get_full_tiktok_profile(username):
+    """
+    Complete pipeline: Get full profile + discover and scrape recent videos.
+    
+    Returns dict with profile data, stats, and recent video details.
+    """
+    # Step 1: Get profile data
+    profile = method1_html_profile(username)
+    if not profile:
+        return {'error': f'Could not fetch profile for @{username}'}
+
+    result = {
+        'profile': profile,
+        'videos': [],
+        'data_sources': ['tiktok_html_ssr'],
+    }
+
+    # Note: Video discovery requires web_search tool (not available in pure Python)
+    # In the agent context, use:
+    #   web_search(f"site:tiktok.com/@{username}/video")
+    # Then for each video ID found, call method6_html_video() or method2_oembed_video()
+    
+    return result
+
+
+if __name__ == '__main__':
+    import sys
+    username = sys.argv[1] if len(sys.argv) > 1 else 'khaby.lame'
+    
+    print(f'=== Testing TikTok scraping for @{username} ===\n')
+    
+    print('--- METHOD 1: HTML Profile Scraping ---')
+    profile = method1_html_profile(username)
+    if profile:
+        print(f'  Username: {profile["username"]}')
+        print(f'  Nickname: {profile["nickname"]}')
+        print(f'  Bio: {profile["bio"][:100]}')
+        print(f'  Verified: {profile["verified"]}')
+        print(f'  Followers: {profile["stats"]["followers"]:,}')
+        print(f'  Following: {profile["stats"]["following"]:,}')
+        print(f'  Hearts: {profile["stats"]["hearts"]:,}')
+        print(f'  Videos: {profile["stats"]["videos"]:,}')
+        print(f'  SecUid: {profile["secUid"][:50]}...')
+    else:
+        print('  FAILED')
+    
+    print('\n--- METHOD 3a: tikwm.com User API ---')
+    tikwm_user = method3_tikwm_user(username)
+    if tikwm_user:
+        s = tikwm_user.get('stats', {})
+        print(f'  Followers: {s.get("followerCount"):,}')
+        print(f'  Hearts: {s.get("heartCount"):,}')
+        print(f'  Videos: {s.get("videoCount"):,}')
+    else:
+        print('  FAILED')
+    
+    # Test with a known video
+    test_video_id = '7615318641042623775'  # khaby birthday video
+    if username == 'khaby.lame':
+        print(f'\n--- METHOD 2: oEmbed for video {test_video_id} ---')
+        oembed = method2_oembed_video(username, test_video_id)
+        if oembed:
+            print(f'  Title: {oembed["title"][:80]}')
+        
+        print(f'\n--- METHOD 6: HTML Video Scraping for {test_video_id} ---')
+        video = method6_html_video(username, test_video_id)
+        if video:
+            print(f'  Description: {video["description"][:80]}')
+            print(f'  Plays: {video["stats"]["plays"]:,}')
+            print(f'  Likes: {video["stats"]["likes"]:,}')
+            print(f'  Comments: {video["stats"]["comments"]:,}')
+            print(f'  Shares: {video["stats"]["shares"]:,}')
+            print(f'  Hashtags: {video["hashtags"]}')
+    
+    print('\n=== DONE ===')
@@ -0,0 +1,260 @@
+"""
+Direct X/Twitter API v2 client for Hermes Simulator.
+No x-cli dependency — uses curl via terminal() with bearer token.
+
+Provides:
+- get_user(handle) — profile, bio, metrics
+- get_tweets(user_id, count) — recent tweets with metrics
+- search_tweets(query, count) — search for tweets
+- get_user_mentions(user_id, count) — mentions of a user
+"""
+
+from hermes_tools import terminal
+import json
+import os
+import time
+import urllib.parse
+
+# Bearer token — loaded from env or hardcoded fallback
+BEARER = os.environ.get("X_BEARER_TOKEN", "")
+
+MAX_RETRIES = 3
+BASE_DELAY = 2  # seconds, exponential backoff: 2s, 4s, 8s
+
+
+def _api_get(endpoint: str, params: dict = None) -> dict:
+    """Make authenticated GET request to X API v2 with retry and error handling."""
+    url = f"https://api.twitter.com/2/{endpoint}"
+    if params:
+        qs = "&".join(f"{k}={urllib.parse.quote(str(v))}" for k, v in params.items())
+        url += f"?{qs}"
+
+    for attempt in range(MAX_RETRIES):
+        try:
+            r = terminal(f'curl -s -w \'\\n%{{http_code}}\' -H "Authorization: Bearer {BEARER}" "{url}"')
+            output = r.get("output", "").strip()
+
+            # Split body from status code (last line)
+            lines = output.rsplit("\n", 1)
+            if len(lines) == 2:
+                body, status_str = lines
+            else:
+                body = output
+                status_str = "0"
+
+            try:
+                status_code = int(status_str.strip())
+            except ValueError:
+                status_code = 0
+
+            # Handle specific status codes
+            if status_code == 429:
+                # Rate limited — retry with backoff
+                delay = BASE_DELAY * (2 ** attempt)
+                print(f"  [X API] Rate limited (429). Retry {attempt+1}/{MAX_RETRIES} in {delay}s...")
+                time.sleep(delay)
+                continue
+
+            if status_code in (401, 403):
+                return {"error": f"Authentication failed (HTTP {status_code}). Check X_BEARER_TOKEN.", "http_status": status_code}
+
+            if status_code >= 500:
+                delay = BASE_DELAY * (2 ** attempt)
+                print(f"  [X API] Server error ({status_code}). Retry {attempt+1}/{MAX_RETRIES} in {delay}s...")
+                time.sleep(delay)
+                continue
+
+            if status_code == 0 and not body:
+                # Network error — no response at all
+                delay = BASE_DELAY * (2 ** attempt)
+                print(f"  [X API] Network error. Retry {attempt+1}/{MAX_RETRIES} in {delay}s...")
+                time.sleep(delay)
+                continue
+
+            try:
+                return json.loads(body)
+            except json.JSONDecodeError:
+                return {"error": f"Failed to parse response (HTTP {status_code}): {body[:200]}"}
+
+        except Exception as e:
+            delay = BASE_DELAY * (2 ** attempt)
+            print(f"  [X API] Exception: {e}. Retry {attempt+1}/{MAX_RETRIES} in {delay}s...")
+            time.sleep(delay)
+            continue
+
+    return {"error": f"All {MAX_RETRIES} retries exhausted for {endpoint}"}
+
+
+def get_user(handle: str) -> dict:
+    """Get user profile by handle."""
+    handle = handle.lstrip("@")
+    return _api_get(f"users/by/username/{handle}", {
+        "user.fields": "description,public_metrics,profile_image_url,created_at,location,url"
+    })
+
+
+def get_tweets(user_id: str, count: int = 20) -> dict:
+    """Get user's recent tweets."""
+    return _api_get(f"users/{user_id}/tweets", {
+        "max_results": max(min(count, 100), 5),
+        "tweet.fields": "created_at,public_metrics,text,in_reply_to_user_id,referenced_tweets",
+        "exclude": "retweets"  # original tweets only for voice analysis
+    })
+
+
+def get_tweets_with_rts(user_id: str, count: int = 20) -> dict:
+    """Get user's recent tweets including retweets (shows interests)."""
+    return _api_get(f"users/{user_id}/tweets", {
+        "max_results": max(min(count, 100), 5),
+        "tweet.fields": "created_at,public_metrics,text,referenced_tweets"
+    })
+
+
+def search_tweets(query: str, count: int = 10) -> dict:
+    """Search recent tweets."""
+    return _api_get("tweets/search/recent", {
+        "query": query,
+        "max_results": max(min(count, 100), 10),
+        "tweet.fields": "created_at,public_metrics,text,author_id"
+    })
+
+
+def get_user_by_id(user_id: str) -> dict:
+    """Get user profile by ID."""
+    return _api_get(f"users/{user_id}", {
+        "user.fields": "description,public_metrics,username,name"
+    })
+
+
+# ═══════════════════════════════════════════════════════════════
+# HIGH-LEVEL INTELLIGENCE FUNCTIONS
+# ═══════════════════════════════════════════════════════════════
+
+def profile_user(handle: str) -> dict:
+    """Full profile pull: identity + recent tweets (originals only)."""
+    user = get_user(handle)
+    if "errors" in user or "error" in user:
+        return {"error": f"User @{handle} not found", "details": user}
+
+    user_data = user.get("data", {})
+    user_id = user_data.get("id")
+
+    result = {
+        "profile": user_data,
+        "tweets": [],
+        "voice_samples": [],
+    }
+
+    if user_id:
+        # Get original tweets (no RTs) for voice analysis
+        tweets = get_tweets(user_id, 20)
+        tweet_list = tweets.get("data", [])
+        result["tweets"] = tweet_list
+
+        # Extract pure text samples for voice profiling
+        # Only exclude retweets and actual replies (has in_reply_to_user_id)
+        # Tweets starting with @ are fine if they're standalone mentions
+        result["voice_samples"] = [
+            t["text"] for t in tweet_list
+            if not t.get("text", "").startswith("RT @")
+            and not t.get("in_reply_to_user_id")
+        ]
+
+    return result
+
+
+def profile_interactions(handle1: str, handle2: str) -> dict:
+    """Find interactions between two users."""
+    # Search for replies from handle1 to handle2
+    q1 = f"from:{handle1} to:{handle2}"
+    q2 = f"from:{handle2} to:{handle1}"
+
+    r1 = search_tweets(q1, 10)
+    r2 = search_tweets(q2, 10)
+
+    return {
+        f"{handle1}_to_{handle2}": r1.get("data", []),
+        f"{handle2}_to_{handle1}": r2.get("data", []),
+    }
+
+
+def get_voice_data(handle: str, count: int = 50) -> dict:
+    """Pull maximum voice data: tweets, replies, quote tweets.
+    Returns categorized samples for voice profiling."""
+    user = get_user(handle)
+    if "errors" in user or "error" in user:
+        return {"error": f"User @{handle} not found"}
+
+    user_data = user.get("data", {})
+    user_id = user_data.get("id")
+    if not user_id:
+        return {"error": "No user ID found"}
+
+    # Original tweets (exclude RTs)
+    originals = get_tweets(user_id, min(count, 100))
+    original_list = originals.get("data", [])
+
+    # Categorize — only use in_reply_to_user_id to detect replies
+    standalone = []  # not replies
+    replies = []     # replies to others
+
+    for t in original_list:
+        text = t.get("text", "")
+        if t.get("in_reply_to_user_id"):
+            replies.append(text)
+        else:
+            standalone.append(text)
+
+    return {
+        "profile": user_data,
+        "standalone_tweets": standalone,  # their voice at rest
+        "replies": replies,               # their voice in conversation
+        "total_samples": len(standalone) + len(replies),
+    }
+
+
+# ═══════════════════════════════════════════════════════════════
+# ENTRY POINT
+# ═══════════════════════════════════════════════════════════════
+
+if __name__ == "__main__":
+    if not BEARER:
+        print("ERROR: X_BEARER_TOKEN not set. Set it in environment or ~/.hermes/.env")
+        print("Trying to load from .env...")
+        try:
+            with open(os.path.expanduser("~/.hermes/.env")) as f:
+                for line in f:
+                    line = line.strip()
+                    if line.startswith("X_BEARER_TOKEN="):
+                        # Use split with maxsplit=1 to handle values with '=' in them
+                        # Also strip surrounding quotes if present
+                        val = line.split("=", 1)[1]
+                        if val and val[0] in ('"', "'") and val[-1] == val[0]:
+                            val = val[1:-1]
+                        BEARER = val
+                        break
+        except Exception as e:
+            print(f"  Failed to load .env: {e}")
+
+    if not BEARER:
+        print("FATAL: No bearer token found.")
+        exit(1)
+
+    # Demo: profile two users
+    for handle in ["Teknium", "basedjensen"]:
+        print(f"\n{'='*60}")
+        print(f"  PROFILING @{handle}")
+        print(f"{'='*60}")
+
+        data = profile_user(handle)
+        profile = data.get("profile", {})
+        print(f"  Name: {profile.get('name')}")
+        print(f"  Bio: {profile.get('description')}")
+        metrics = profile.get("public_metrics", {})
+        print(f"  Followers: {metrics.get('followers_count')}")
+        print(f"  Tweets: {metrics.get('tweet_count')}")
+        print(f"  Likes given: {metrics.get('like_count')}")
+
+        print(f"\n  Voice samples ({len(data.get('voice_samples', []))}):")
+        for sample in data.get("voice_samples", [])[:5]:
+            print(f"    > {sample[:120]}")
@@ -0,0 +1,136 @@
+# DOSSIER: {display_name} (@{handle})
+
+## Identity
+- **Name**: {real_name}
+- **Handle(s)**: @{twitter} | u/{reddit} | {discord_tag}
+- **Role**: {role_and_org}
+- **Known for**: {what_they_are_famous_for}
+- **Followers/reach**: {approximate_follower_count}
+- **Confidence**: {HIGH|MEDIUM|LOW} — {confidence_reason}
+
+## Voice Profile
+
+### Linguistic Patterns
+- **Sentence structure**: {short_punchy | long_flowing | mixed}
+- **Capitalization**: {normal | all_lowercase | CAPS_FOR_EMPHASIS | mixed}
+- **Punctuation**: {heavy_periods | ellipsis_lover | no_punctuation | exclamation_marks}
+- **Paragraph style**: {one_liners | thread_essays | medium_blocks}
+- **Emoji/emoticon usage**: {none | minimal | heavy | specific_ones}
+
+### Vocabulary & Slang
+- **Register**: {academic | casual | shitposter | mixed}
+- **Recurring words/phrases**: [list of signature words they use a lot]
+- **Catchphrases**: [any repeated phrases or running jokes]
+- **Profanity level**: {none | mild | moderate | heavy}
+- **Jargon tendency**: {explains_everything | assumes_expertise | mixes}
+
+### Tone
+- **Default mood**: {earnest | ironic | combative | chill | manic | analytical}
+- **Humor style**: {deadpan | absurdist | sarcastic | wholesome | shitpost | none}
+- **How they handle disagreement**: {engages_thoughtfully | dunks | ignores | ratio_warrior | passive_aggressive}
+- **How they handle praise**: {deflects | accepts_gracefully | awkward | flexes}
+
+## Positions & Beliefs
+
+### Core Convictions (things they consistently advocate for)
+1. {conviction_1}
+2. {conviction_2}
+3. {conviction_3}
+
+### Known Hot Takes
+1. {take_1}
+2. {take_2}
+
+### Hills They'll Die On
+1. {hill_1}
+2. {hill_2}
+
+### Topics They Avoid or Refuse to Engage
+1. {avoidance_1}
+
+## Social Dynamics
+
+### People They Interact With Positively
+- @{ally_1} — {relationship_description}
+- @{ally_2} — {relationship_description}
+
+### People They Beef With / Disagree With
+- @{rival_1} — {beef_description}
+
+### How They Engage Different Types
+- **Fans/supporters**: {how_they_respond}
+- **Critics**: {how_they_respond}
+- **Peers**: {how_they_respond}
+- **Random people**: {how_they_respond}
+
+## Platform-Specific Behavior
+
+### On Twitter/X
+- **Post frequency**: {multiple_daily | daily | few_per_week}
+- **Thread tendency**: {never | sometimes | loves_threads}
+- **QRT style**: {adds_context | dunks | amplifies}
+- **Engagement style**: {likes_a_lot | rarely_likes | retweets_heavy}
+
+### On Reddit (if applicable)
+- **Subreddits**: [list]
+- **Comment style**: {detailed | brief | combative}
+
+### On Discord (if applicable)
+- **Servers**: [known servers]
+- **Vibe shift from Twitter**: {description}
+
+## Signature Moves
+Things this person characteristically does that make them recognizable:
+1. {signature_move_1}
+2. {signature_move_2}
+3. {signature_move_3}
+
+## Sample Quotes (real, sourced from research)
+> "{actual_quote_1}" — [source/context]
+> "{actual_quote_2}" — [source/context]
+> "{actual_quote_3}" — [source/context]
+
+## Deep Psychometric Profile
+- **Big Five**: O{H/M/L} C{} E{} A{} N{} — {evidence}
+- **Moral Foundations**: Care{} Fair{} Loyal{} Auth{} Sanct{} Liberty{} — {what drives their ethics}
+- **Schwartz Values**: {dominant values} — {how they justify positions}
+- **Cognitive Style**: {IC score estimate} — {hedging patterns, complexity, analytical vs intuitive}
+- **Narrative Frame**: {dominant frame} — {how they lens issues}
+- **Persona Authenticity**: {1-5 score} — {evidence for curation vs authenticity}
+
+## Strategic Self-Presentation (Red Hat)
+- **Cultivated image**: {what they want to be seen as}
+- **Target audience**: {who they're performing for}
+- **Incentive structure**: {what they gain from this persona}
+- **Possible divergences**: {where persona may ≠ person}
+- **Ghostwriting indicators**: {present/absent, evidence}
+
+## Ecosystem Context
+- **Community cluster**: {which tribe they belong to}
+- **Key influencers**: {who they amplify/follow/agree with}
+- **Echo chamber**: {what information environment they're in}
+- **Audience profile**: {who follows them, how that audience reacts}
+
+## Key Assumptions
+1. {assumption} — FRAGILITY: {robust/moderate/fragile} — Test: {what invalidates it}
+2. {assumption} — FRAGILITY: {} — Test: {}
+3. {assumption} — FRAGILITY: {} — Test: {}
+
+## Competing Hypotheses
+- **H1 (PRIMARY)**: {main personality model} — Confidence: {X}%
+- **H2 (ALTERNATIVE)**: {alternative explanation} — Confidence: {X}%
+- **Key discriminator**: {what evidence would shift between H1 and H2}
+
+## Research Sources
+- {source_1} [{reliability}{confidence}] — {description}
+- {source_2} [{reliability}{confidence}] — {description}
+- {source_3} [{reliability}{confidence}] — {description}
+
+## Invalidation Indicators
+1. If @{handle} {does X instead of Y}, our {assessment} is wrong
+2. If @{handle} {responds to Z with Q}, our {model} needs revision
+3. If @{handle} {interacts with @person in manner M}, dynamics model is off
+
+---
+*Dossier compiled: {date} | Fidelity: {fidelity}% | Persona Authenticity: {1-5}*
+*Source reliability range: {best}-{worst} | Analytical confidence: {1-6}*
@@ -73,7 +73,6 @@ Config file: `~/.hermes/hindsight/config.json`
 |-----|---------|-------------|
 | `llm_provider` | `openai` | LLM provider: `openai`, `anthropic`, `gemini`, `groq`, `minimax`, `ollama` |
 | `llm_model` | per-provider | Model name (e.g. `gpt-4o-mini`, `openai/gpt-oss-120b`) |
-| `llm_base_url` | — | LLM Base URL override (e.g. `https://openrouter.ai/api/v1`) |

 The LLM API key is stored in `~/.hermes/.env` as `HINDSIGHT_LLM_API_KEY`.

@@ -93,7 +92,6 @@ Available in `hybrid` and `tools` memory modes:
 |----------|-------------|
 | `HINDSIGHT_API_KEY` | API key for Hindsight Cloud |
 | `HINDSIGHT_LLM_API_KEY` | LLM API key for local mode |
-| `HINDSIGHT_API_LLM_BASE_URL` | LLM Base URL for local mode (e.g. OpenRouter) |
 | `HINDSIGHT_API_URL` | Override API endpoint |
 | `HINDSIGHT_BANK_ID` | Override bank name |
 | `HINDSIGHT_BUDGET` | Override recall budget |
@@ -23,8 +23,6 @@ import json
 import logging
 import os
 import threading
-
-from hermes_constants import get_hermes_home
 from typing import Any, Dict, List

 from agent.memory_provider import MemoryProvider
@@ -144,6 +142,7 @@ def _load_config() -> dict:
      3. Environment variables
    """
    from pathlib import Path
+    from hermes_constants import get_hermes_home

    # Profile-scoped path (preferred)
    profile_path = get_hermes_home() / "hindsight" / "config.json"
@@ -235,7 +234,6 @@ class HindsightMemoryProvider(MemoryProvider):
            {"key": "api_key", "description": "Hindsight Cloud API key", "secret": True, "env_var": "HINDSIGHT_API_KEY", "url": "https://ui.hindsight.vectorize.io", "when": {"mode": "cloud"}},
            {"key": "llm_provider", "description": "LLM provider for local mode", "default": "openai", "choices": ["openai", "anthropic", "gemini", "groq", "minimax", "ollama"], "when": {"mode": "local"}},
            {"key": "llm_api_key", "description": "LLM API key for local Hindsight", "secret": True, "env_var": "HINDSIGHT_LLM_API_KEY", "when": {"mode": "local"}},
-            {"key": "llm_base_url", "description": "LLM Base URL (e.g. for OpenRouter)", "default": "", "env_var": "HINDSIGHT_API_LLM_BASE_URL", "when": {"mode": "local"}},
            {"key": "llm_model", "description": "LLM model for local mode", "default": "gpt-4o-mini", "default_from": {"field": "llm_provider", "map": _PROVIDER_DEFAULT_MODELS}, "when": {"mode": "local"}},
            {"key": "bank_id", "description": "Memory bank name", "default": "hermes"},
            {"key": "budget", "description": "Recall thoroughness", "default": "mid", "choices": ["low", "mid", "high"]},
@@ -252,16 +250,12 @@ class HindsightMemoryProvider(MemoryProvider):
                # different loop" errors during GC — we handle cleanup in
                # shutdown() instead.
                HindsightEmbedded.__del__ = lambda self: None
-                kwargs = dict(
+                self._client = HindsightEmbedded(
                    profile=self._config.get("profile", "hermes"),
                    llm_provider=self._config.get("llm_provider", ""),
-                    llm_api_key=self._config.get("llm_api_key") or os.environ.get("HINDSIGHT_LLM_API_KEY", ""),
+                    llm_api_key=self._config.get("llmApiKey") or os.environ.get("HINDSIGHT_LLM_API_KEY", ""),
                    llm_model=self._config.get("llm_model", ""),
                )
-                base_url = self._config.get("llm_base_url") or os.environ.get("HINDSIGHT_API_LLM_BASE_URL", "")
-                if base_url:
-                    kwargs["llm_base_url"] = base_url
-                self._client = HindsightEmbedded(**kwargs)
            else:
                from hindsight_client import Hindsight
                kwargs = {"base_url": self._api_url, "timeout": 30.0}
@@ -316,10 +310,9 @@ class HindsightMemoryProvider(MemoryProvider):
                    # If the config changed and the daemon is running, stop it.
                    from pathlib import Path as _Path
                    profile_env = _Path.home() / ".hindsight" / "profiles" / f"{profile}.env"
-                    current_key = self._config.get("llm_api_key") or os.environ.get("HINDSIGHT_LLM_API_KEY", "")
+                    current_key = self._config.get("llmApiKey") or os.environ.get("HINDSIGHT_LLM_API_KEY", "")
                    current_provider = self._config.get("llm_provider", "")
                    current_model = self._config.get("llm_model", "")
-                    current_base_url = self._config.get("llm_base_url") or os.environ.get("HINDSIGHT_API_LLM_BASE_URL", "")

                    # Read saved profile config
                    saved = {}
@@ -332,22 +325,18 @@ class HindsightMemoryProvider(MemoryProvider):
                    config_changed = (
                        saved.get("HINDSIGHT_API_LLM_PROVIDER") != current_provider or
                        saved.get("HINDSIGHT_API_LLM_MODEL") != current_model or
-                        saved.get("HINDSIGHT_API_LLM_API_KEY") != current_key or
-                        saved.get("HINDSIGHT_API_LLM_BASE_URL", "") != current_base_url
+                        saved.get("HINDSIGHT_API_LLM_API_KEY") != current_key
                    )

                    if config_changed:
                        # Write updated profile .env
                        profile_env.parent.mkdir(parents=True, exist_ok=True)
-                        env_lines = (
+                        profile_env.write_text(
                            f"HINDSIGHT_API_LLM_PROVIDER={current_provider}\n"
                            f"HINDSIGHT_API_LLM_API_KEY={current_key}\n"
                            f"HINDSIGHT_API_LLM_MODEL={current_model}\n"
                            f"HINDSIGHT_API_LOG_LEVEL=info\n"
                        )
-                        if current_base_url:
-                            env_lines += f"HINDSIGHT_API_LLM_BASE_URL={current_base_url}\n"
-                        profile_env.write_text(env_lines)
                        if client._manager.is_running(profile):
                            with open(log_path, "a") as f:
                                f.write("\n=== Config changed, restarting daemon ===\n")
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "hermes-agent"
-version = "0.8.0"
+version = "0.7.0"
 description = "The self-improving AI agent — creates skills from experience, improves them during use, and runs anywhere"
 readme = "README.md"
 requires-python = ">=3.11"
@@ -62,7 +62,6 @@ mcp = ["mcp>=1.2.0,<2"]
 homeassistant = ["aiohttp>=3.9.0,<4"]
 sms = ["aiohttp>=3.9.0,<4"]
 acp = ["agent-client-protocol>=0.9.0,<1.0"]
-mistral = ["mistralai>=2.3.0,<3"]
 dingtalk = ["dingtalk-stream>=0.1.0,<1"]
 feishu = ["lark-oapi>=1.5.3,<2"]
 rl = [
@@ -95,7 +94,6 @@ all = [
  "hermes-agent[voice]",
  "hermes-agent[dingtalk]",
  "hermes-agent[feishu]",
-  "hermes-agent[mistral]",
 ]

 [project.scripts]
@@ -66,8 +66,7 @@ from model_tools import (
    handle_function_call,
    check_toolset_requirements,
 )
-from tools.terminal_tool import cleanup_vm, get_active_env
-from tools.tool_result_storage import maybe_persist_tool_result, enforce_turn_budget
+from tools.terminal_tool import cleanup_vm
 from tools.interrupt import set_interrupt as _set_interrupt
 from tools.browser_tool import cleanup_browser

@@ -76,7 +75,6 @@ from hermes_constants import OPENROUTER_BASE_URL

 # Agent internals extracted to agent/ package for modularity
 from agent.memory_manager import build_memory_context_block
-from agent.retry_utils import jittered_backoff
 from agent.prompt_builder import (
    DEFAULT_AGENT_IDENTITY, PLATFORM_HINTS,
    MEMORY_GUIDANCE, SESSION_SEARCH_GUIDANCE, SKILLS_GUIDANCE,
@@ -87,7 +85,6 @@ from agent.model_metadata import (
    estimate_tokens_rough, estimate_messages_tokens_rough, estimate_request_tokens_rough,
    get_next_probe_tier, parse_context_limit_from_error,
    save_context_length, is_local_endpoint,
-    query_ollama_num_ctx,
 )
 from agent.context_compressor import ContextCompressor
 from agent.subdirectory_hints import SubdirectoryHintTracker
@@ -412,26 +409,62 @@ def _strip_budget_warnings_from_history(messages: list) -> None:
 # Large tool result handler — save oversized output to temp file
 # =========================================================================

+# Threshold at which tool results are saved to a file instead of kept inline.
+# 100K chars ≈ 25K tokens — generous for any reasonable output but prevents
+# catastrophic context explosions.
+_LARGE_RESULT_CHARS = 100_000

-# =========================================================================
-# Qwen Portal headers — mimics QwenCode CLI for portal.qwen.ai compatibility.
-# Extracted as a module-level helper so both __init__ and
-# _apply_client_headers_for_base_url can share it.
-# =========================================================================
-_QWEN_CODE_VERSION = "0.14.1"
+# How many characters of the original result to include as an inline preview
+# so the model has immediate context about what the tool returned.
+_LARGE_RESULT_PREVIEW_CHARS = 1_500


-def _qwen_portal_headers() -> dict:
-    """Return default HTTP headers required by Qwen Portal API."""
-    import platform as _plat
+def _save_oversized_tool_result(function_name: str, function_result: str) -> str:
+    """Replace oversized tool results with a file reference + preview.

-    _ua = f"QwenCode/{_QWEN_CODE_VERSION} ({_plat.system().lower()}; {_plat.machine()})"
-    return {
-        "User-Agent": _ua,
-        "X-DashScope-CacheControl": "enable",
-        "X-DashScope-UserAgent": _ua,
-        "X-DashScope-AuthType": "qwen-oauth",
-    }
+    When a tool returns more than ``_LARGE_RESULT_CHARS`` characters, the full
+    content is written to a temporary file under ``HERMES_HOME/cache/tool_responses/``
+    and the result sent to the model is replaced with:
+      • a brief head preview  (first ``_LARGE_RESULT_PREVIEW_CHARS`` chars)
+      • the file path so the model can use ``read_file`` / ``search_files``
+
+    Falls back to destructive truncation if the file write fails.
+    """
+    original_len = len(function_result)
+    if original_len <= _LARGE_RESULT_CHARS:
+        return function_result
+
+    # Build the target directory
+    try:
+        response_dir = os.path.join(get_hermes_home(), "cache", "tool_responses")
+        os.makedirs(response_dir, exist_ok=True)
+
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
+        # Sanitize tool name for use in filename
+        safe_name = re.sub(r"[^\w\-]", "_", function_name)[:40]
+        filename = f"{safe_name}_{timestamp}.txt"
+        filepath = os.path.join(response_dir, filename)
+
+        with open(filepath, "w", encoding="utf-8") as f:
+            f.write(function_result)
+
+        preview = function_result[:_LARGE_RESULT_PREVIEW_CHARS]
+        return (
+            f"{preview}\n\n"
+            f"[Large tool response: {original_len:,} characters total — "
+            f"only the first {_LARGE_RESULT_PREVIEW_CHARS:,} shown above. "
+            f"Full output saved to: {filepath}\n"
+            f"Use read_file or search_files on that path to access the rest.]"
+        )
+    except Exception as exc:
+        # Fall back to destructive truncation if file write fails
+        logger.warning("Failed to save large tool result to file: %s", exc)
+        return (
+            function_result[:_LARGE_RESULT_CHARS]
+            + f"\n\n[Truncated: tool response was {original_len:,} chars, "
+            f"exceeding the {_LARGE_RESULT_CHARS:,} char limit. "
+            f"File save failed: {exc}]"
+        )


 class AIAgent:
@@ -777,8 +810,6 @@ class AIAgent:
                    client_kwargs["default_headers"] = {
                        "User-Agent": "KimiCLI/1.3",
                    }
-                elif "portal.qwen.ai" in effective_base.lower():
-                    client_kwargs["default_headers"] = _qwen_portal_headers()
            else:
                # No explicit creds — use the centralized provider router
                from agent.auxiliary_client import resolve_provider_client
@@ -1185,33 +1216,6 @@ class AIAgent:
        self.session_cost_status = "unknown"
        self.session_cost_source = "none"
        
-        # ── Ollama num_ctx injection ──
-        # Ollama defaults to 2048 context regardless of the model's capabilities.
-        # When running against an Ollama server, detect the model's max context
-        # and pass num_ctx on every chat request so the full window is used.
-        # User override: set model.ollama_num_ctx in config.yaml to cap VRAM use.
-        self._ollama_num_ctx: int | None = None
-        _ollama_num_ctx_override = None
-        if isinstance(_model_cfg, dict):
-            _ollama_num_ctx_override = _model_cfg.get("ollama_num_ctx")
-        if _ollama_num_ctx_override is not None:
-            try:
-                self._ollama_num_ctx = int(_ollama_num_ctx_override)
-            except (TypeError, ValueError):
-                logger.debug("Invalid ollama_num_ctx config value: %r", _ollama_num_ctx_override)
-        if self._ollama_num_ctx is None and self.base_url and is_local_endpoint(self.base_url):
-            try:
-                _detected = query_ollama_num_ctx(self.model, self.base_url)
-                if _detected and _detected > 0:
-                    self._ollama_num_ctx = _detected
-            except Exception as exc:
-                logger.debug("Ollama num_ctx detection failed: %s", exc)
-        if self._ollama_num_ctx and not self.quiet_mode:
-            logger.info(
-                "Ollama num_ctx: will request %d tokens (model max from /api/show)",
-                self._ollama_num_ctx,
-            )
-
        if not self.quiet_mode:
            if compression_enabled:
                print(f"📊 Context limit: {self.context_compressor.context_length:,} tokens (compress at {int(compression_threshold*100)}% = {self.context_compressor.threshold_tokens:,})")
@@ -4103,8 +4107,6 @@ class AIAgent:
            self._client_kwargs["default_headers"] = copilot_default_headers()
        elif "api.kimi.com" in normalized:
            self._client_kwargs["default_headers"] = {"User-Agent": "KimiCLI/1.3"}
-        elif "portal.qwen.ai" in normalized:
-            self._client_kwargs["default_headers"] = _qwen_portal_headers()
        else:
            self._client_kwargs.pop("default_headers", None)

@@ -4895,7 +4897,7 @@ class AIAgent:
                effective_key = (fb_client.api_key or resolve_anthropic_token() or "") if fb_provider == "anthropic" else (fb_client.api_key or "")
                self.api_key = effective_key
                self._anthropic_api_key = effective_key
-                self._anthropic_base_url = fb_base_url
+                self._anthropic_base_url = getattr(fb_client, "base_url", None)
                self._anthropic_client = build_anthropic_client(effective_key, self._anthropic_base_url)
                self._is_anthropic_oauth = _is_oauth_token(effective_key)
                self.client = None
@@ -5251,71 +5253,6 @@ class AIAgent:
        base = (getattr(self, "base_url", "") or "").lower()
        return "dashscope" in base or "aliyuncs" in base or "opencode.ai/zen/go" in base

-    def _is_qwen_portal(self) -> bool:
-        """Return True when the base URL targets Qwen Portal."""
-        return "portal.qwen.ai" in self._base_url_lower
-
-    def _qwen_prepare_chat_messages(self, api_messages: list) -> list:
-        prepared = copy.deepcopy(api_messages)
-        if not prepared:
-            return prepared
-
-        for msg in prepared:
-            if not isinstance(msg, dict):
-                continue
-            content = msg.get("content")
-            if isinstance(content, str):
-                msg["content"] = [{"type": "text", "text": content}]
-            elif isinstance(content, list):
-                # Normalize: convert bare strings to text dicts, keep dicts as-is.
-                # deepcopy already created independent copies, no need for dict().
-                normalized_parts = []
-                for part in content:
-                    if isinstance(part, str):
-                        normalized_parts.append({"type": "text", "text": part})
-                    elif isinstance(part, dict):
-                        normalized_parts.append(part)
-                if normalized_parts:
-                    msg["content"] = normalized_parts
-
-        # Inject cache_control on the last part of the system message.
-        for msg in prepared:
-            if isinstance(msg, dict) and msg.get("role") == "system":
-                content = msg.get("content")
-                if isinstance(content, list) and content and isinstance(content[-1], dict):
-                    content[-1]["cache_control"] = {"type": "ephemeral"}
-                break
-
-        return prepared
-
-    def _qwen_prepare_chat_messages_inplace(self, messages: list) -> None:
-        """In-place variant — mutates an already-copied message list."""
-        if not messages:
-            return
-
-        for msg in messages:
-            if not isinstance(msg, dict):
-                continue
-            content = msg.get("content")
-            if isinstance(content, str):
-                msg["content"] = [{"type": "text", "text": content}]
-            elif isinstance(content, list):
-                normalized_parts = []
-                for part in content:
-                    if isinstance(part, str):
-                        normalized_parts.append({"type": "text", "text": part})
-                    elif isinstance(part, dict):
-                        normalized_parts.append(part)
-                if normalized_parts:
-                    msg["content"] = normalized_parts
-
-        for msg in messages:
-            if isinstance(msg, dict) and msg.get("role") == "system":
-                content = msg.get("content")
-                if isinstance(content, list) and content and isinstance(content[-1], dict):
-                    content[-1]["cache_control"] = {"type": "ephemeral"}
-                break
-
    def _build_api_kwargs(self, api_messages: list) -> dict:
        """Build the keyword arguments dict for the active API mode."""
        if self.api_mode == "anthropic_messages":
@@ -5334,7 +5271,6 @@ class AIAgent:
                is_oauth=self._is_anthropic_oauth,
                preserve_dots=self._anthropic_preserve_dots(),
                context_length=ctx_len,
-                base_url=getattr(self, "_anthropic_base_url", None),
            )

        if self.api_mode == "codex_responses":
@@ -5428,17 +5364,6 @@ class AIAgent:
                            tool_call.pop("call_id", None)
                            tool_call.pop("response_item_id", None)

-        # Qwen portal: normalize content to list-of-dicts, inject cache_control.
-        # Must run AFTER codex sanitization so we transform the final messages.
-        # If sanitization already deepcopied, reuse that copy (in-place).
-        if self._is_qwen_portal():
-            if sanitized_messages is api_messages:
-                # No sanitization was done — we need our own copy.
-                sanitized_messages = self._qwen_prepare_chat_messages(sanitized_messages)
-            else:
-                # Already a deepcopy — transform in place to avoid a second deepcopy.
-                self._qwen_prepare_chat_messages_inplace(sanitized_messages)
-
        # GPT-5 and Codex models respond better to 'developer' than 'system'
        # for instruction-following.  Swap the role at the API boundary so
        # internal message representation stays uniform ("system").
@@ -5471,17 +5396,11 @@ class AIAgent:
            "messages": sanitized_messages,
            "timeout": float(os.getenv("HERMES_API_TIMEOUT", 1800.0)),
        }
-        if self._is_qwen_portal():
-            api_kwargs["metadata"] = {
-                "sessionId": self.session_id or "hermes",
-                "promptId": str(uuid.uuid4()),
-            }
        if self.tools:
            api_kwargs["tools"] = self.tools

        if self.max_tokens is not None:
-            if not self._is_qwen_portal():
-                api_kwargs.update(self._max_tokens_param(self.max_tokens))
+            api_kwargs.update(self._max_tokens_param(self.max_tokens))
        elif self._is_openrouter_url() and "claude" in (self.model or "").lower():
            # OpenRouter translates requests to Anthropic's Messages API,
            # which requires max_tokens as a mandatory field.  When we omit
@@ -5537,18 +5456,6 @@ class AIAgent:
        if _is_nous:
            extra_body["tags"] = ["product=hermes-agent"]

-        # Ollama num_ctx: override the 2048 default so the model actually
-        # uses the context window it was trained for.  Passed via the OpenAI
-        # SDK's extra_body → options.num_ctx, which Ollama's OpenAI-compat
-        # endpoint forwards to the runner as --ctx-size.
-        if self._ollama_num_ctx:
-            options = extra_body.get("options", {})
-            options["num_ctx"] = self._ollama_num_ctx
-            extra_body["options"] = options
-
-        if self._is_qwen_portal():
-            extra_body["vl_high_resolution_images"] = True
-
        if extra_body:
            api_kwargs["extra_body"] = extra_body

@@ -6317,17 +6224,15 @@ class AIAgent:
                except Exception as cb_err:
                    logging.debug(f"Tool complete callback error: {cb_err}")

-            function_result = maybe_persist_tool_result(
-                content=function_result,
-                tool_name=name,
-                tool_use_id=tc.id,
-                env=get_active_env(effective_task_id),
-            )
+            # Save oversized results to file instead of destructive truncation
+            function_result = _save_oversized_tool_result(name, function_result)

+            # Discover subdirectory context files from tool arguments
            subdir_hints = self._subdirectory_hints.check_tool_call(name, args)
            if subdir_hints:
                function_result += subdir_hints

+            # Append tool result message in order
            tool_msg = {
                "role": "tool",
                "content": function_result,
@@ -6335,12 +6240,6 @@ class AIAgent:
            }
            messages.append(tool_msg)

-        # ── Per-turn aggregate budget enforcement ─────────────────────────
-        num_tools = len(parsed_calls)
-        if num_tools > 0:
-            turn_tool_msgs = messages[-num_tools:]
-            enforce_turn_budget(turn_tool_msgs, env=get_active_env(effective_task_id))
-
        # ── Budget pressure injection ────────────────────────────────────
        budget_warning = self._get_budget_warning(api_call_count)
        if budget_warning and messages and messages[-1].get("role") == "tool":
@@ -6625,12 +6524,8 @@ class AIAgent:
                except Exception as cb_err:
                    logging.debug(f"Tool complete callback error: {cb_err}")

-            function_result = maybe_persist_tool_result(
-                content=function_result,
-                tool_name=function_name,
-                tool_use_id=tool_call.id,
-                env=get_active_env(effective_task_id),
-            )
+            # Save oversized results to file instead of destructive truncation
+            function_result = _save_oversized_tool_result(function_name, function_result)

            # Discover subdirectory context files from tool arguments
            subdir_hints = self._subdirectory_hints.check_tool_call(function_name, function_args)
@@ -6668,11 +6563,6 @@ class AIAgent:
            if self.tool_delay > 0 and i < len(assistant_message.tool_calls):
                time.sleep(self.tool_delay)

-        # ── Per-turn aggregate budget enforcement ─────────────────────────
-        num_tools_seq = len(assistant_message.tool_calls)
-        if num_tools_seq > 0:
-            enforce_turn_budget(messages[-num_tools_seq:], env=get_active_env(effective_task_id))
-
        # ── Budget pressure injection ─────────────────────────────────
        # After all tool calls in this turn are processed, check if we're
        # approaching max_iterations. If so, inject a warning into the LAST
@@ -7399,7 +7289,6 @@ class AIAgent:
            codex_auth_retry_attempted=False
            anthropic_auth_retry_attempted=False
            nous_auth_retry_attempted=False
-            thinking_sig_retry_attempted = False
            has_retried_429 = False
            restart_with_compressed_messages = False
            restart_with_length_continuation = False
@@ -7615,8 +7504,7 @@ class AIAgent:
                            }
                        
                        # Longer backoff for rate limiting (likely cause of None choices)
-                        # Jittered exponential: 5s base, 120s cap + random jitter
-                        wait_time = jittered_backoff(retry_count, base_delay=5.0, max_delay=120.0)
+                        wait_time = min(5 * (2 ** (retry_count - 1)), 120)  # 5s, 10s, 20s, 40s, 80s, 120s
                        self._vprint(f"{self.log_prefix}⏳ Retrying in {wait_time}s (extended backoff for possible rate limit)...", force=True)
                        logging.warning(f"Invalid API response (retry {retry_count}/{max_retries}): {', '.join(error_details)} | Provider: {provider_name}")
                        
@@ -7989,38 +7877,8 @@ class AIAgent:
                        print(f"{self.log_prefix}     • Check ANTHROPIC_API_KEY in {_dhh}/.env for API keys or legacy token values")
                        print(f"{self.log_prefix}     • For API keys: verify at https://console.anthropic.com/settings/keys")
                        print(f"{self.log_prefix}     • For Claude Code: run 'claude /login' to refresh, then retry")
-                        print(f"{self.log_prefix}     • Legacy cleanup: hermes config set ANTHROPIC_TOKEN \"\"")
-                        print(f"{self.log_prefix}     • Clear stale keys: hermes config set ANTHROPIC_API_KEY \"\"")
-
-                    # ── Thinking block signature recovery ─────────────────
-                    # Anthropic signs thinking blocks against the full turn
-                    # content.  Any upstream mutation (context compression,
-                    # session truncation, message merging) invalidates the
-                    # signature → HTTP 400.  Recovery: strip reasoning_details
-                    # from all messages so the next retry sends no thinking
-                    # blocks at all.  One-shot — don't retry infinitely.
-                    if (
-                        self.api_mode == "anthropic_messages"
-                        and status_code == 400
-                        and not thinking_sig_retry_attempted
-                    ):
-                        _err_msg_lower = str(api_error).lower()
-                        if "signature" in _err_msg_lower and "thinking" in _err_msg_lower:
-                            thinking_sig_retry_attempted = True
-                            for _m in messages:
-                                if isinstance(_m, dict):
-                                    _m.pop("reasoning_details", None)
-                            self._vprint(
-                                f"{self.log_prefix}⚠️  Thinking block signature invalid — "
-                                f"stripped all thinking blocks, retrying...",
-                                force=True,
-                            )
-                            logging.warning(
-                                "%sThinking block signature recovery: stripped "
-                                "reasoning_details from %d messages",
-                                self.log_prefix, len(messages),
-                            )
-                            continue
+                        print(f"{self.log_prefix}     • Clear stale keys: hermes config set ANTHROPIC_TOKEN \"\"")
+                        print(f"{self.log_prefix}     • Legacy cleanup: hermes config set ANTHROPIC_API_KEY \"\"")

                    retry_count += 1
                    elapsed_time = time.time() - api_start_time
@@ -8503,7 +8361,7 @@ class AIAgent:
                                    _retry_after = min(int(_ra_raw), 120)  # Cap at 2 minutes
                                except (TypeError, ValueError):
                                    pass
-                    wait_time = _retry_after if _retry_after else jittered_backoff(retry_count, base_delay=2.0, max_delay=60.0)
+                    wait_time = _retry_after if _retry_after else min(2 ** retry_count, 60)
                    if is_rate_limited:
                        self._emit_status(f"⏱️ Rate limit reached. Waiting {wait_time}s before retry (attempt {retry_count + 1}/{max_retries})...")
                    else:
@@ -1276,258 +1276,6 @@ class TestRoleAlternation:
        assert [m["role"] for m in result] == ["user", "assistant", "user"]


-# ---------------------------------------------------------------------------
-# Thinking block signature management
-# ---------------------------------------------------------------------------
-
-
-class TestThinkingBlockSignatureManagement:
-    """Tests for the thinking block handling strategy:
-    strip from old turns, preserve latest signed, downgrade unsigned."""
-
-    def test_thinking_stripped_from_non_last_assistant(self):
-        """Thinking blocks are removed from all assistant messages except the last."""
-        messages = [
-            {
-                "role": "assistant",
-                "content": "",
-                "tool_calls": [
-                    {"id": "tc_1", "function": {"name": "tool1", "arguments": "{}"}},
-                ],
-                "reasoning_details": [
-                    {"type": "thinking", "thinking": "Old reasoning.", "signature": "sig_old"},
-                ],
-            },
-            {"role": "tool", "tool_call_id": "tc_1", "content": "result 1"},
-            {
-                "role": "assistant",
-                "content": "",
-                "tool_calls": [
-                    {"id": "tc_2", "function": {"name": "tool2", "arguments": "{}"}},
-                ],
-                "reasoning_details": [
-                    {"type": "thinking", "thinking": "Latest reasoning.", "signature": "sig_new"},
-                ],
-            },
-            {"role": "tool", "tool_call_id": "tc_2", "content": "result 2"},
-        ]
-        _, result = convert_messages_to_anthropic(messages)
-
-        # Find both assistant messages
-        assistants = [m for m in result if m["role"] == "assistant"]
-        assert len(assistants) == 2
-
-        # First (non-last) assistant: no thinking blocks
-        first_types = [b.get("type") for b in assistants[0]["content"]]
-        assert "thinking" not in first_types
-        assert "redacted_thinking" not in first_types
-        assert "tool_use" in first_types  # tool_use should survive
-
-        # Last assistant: thinking block preserved with signature
-        last_blocks = assistants[1]["content"]
-        thinking_blocks = [b for b in last_blocks if b.get("type") == "thinking"]
-        assert len(thinking_blocks) == 1
-        assert thinking_blocks[0]["thinking"] == "Latest reasoning."
-        assert thinking_blocks[0]["signature"] == "sig_new"
-
-    def test_signed_thinking_preserved_on_last_turn(self):
-        """A signed thinking block on the last assistant message is kept."""
-        messages = [
-            {
-                "role": "assistant",
-                "content": "The answer is 42.",
-                "reasoning_details": [
-                    {"type": "thinking", "thinking": "Deep thought.", "signature": "sig_valid"},
-                ],
-            },
-        ]
-        _, result = convert_messages_to_anthropic(messages)
-        blocks = result[0]["content"]
-        thinking = [b for b in blocks if b.get("type") == "thinking"]
-        assert len(thinking) == 1
-        assert thinking[0]["signature"] == "sig_valid"
-
-    def test_unsigned_thinking_downgraded_to_text_on_last_turn(self):
-        """Unsigned thinking blocks on the last turn become text blocks."""
-        messages = [
-            {
-                "role": "assistant",
-                "content": "Response text.",
-                "reasoning_details": [
-                    {"type": "thinking", "thinking": "Unsigned reasoning."},
-                    # No 'signature' field
-                ],
-            },
-        ]
-        _, result = convert_messages_to_anthropic(messages)
-        blocks = result[0]["content"]
-
-        # No thinking blocks should remain
-        assert not any(b.get("type") == "thinking" for b in blocks)
-        # The reasoning text should be preserved as a text block
-        text_contents = [b.get("text", "") for b in blocks if b.get("type") == "text"]
-        assert "Unsigned reasoning." in text_contents
-
-    def test_redacted_thinking_with_data_preserved(self):
-        """Redacted thinking with 'data' field is kept on last turn."""
-        messages = [
-            {
-                "role": "assistant",
-                "content": "Response.",
-                "reasoning_details": [
-                    {"type": "redacted_thinking", "data": "opaque_signature_data"},
-                ],
-            },
-        ]
-        _, result = convert_messages_to_anthropic(messages)
-        blocks = result[0]["content"]
-        redacted = [b for b in blocks if b.get("type") == "redacted_thinking"]
-        assert len(redacted) == 1
-        assert redacted[0]["data"] == "opaque_signature_data"
-
-    def test_redacted_thinking_without_data_dropped(self):
-        """Redacted thinking without 'data' is dropped — can't be validated."""
-        messages = [
-            {
-                "role": "assistant",
-                "content": "Response.",
-                "reasoning_details": [
-                    {"type": "redacted_thinking"},
-                    # No 'data' field
-                ],
-            },
-        ]
-        _, result = convert_messages_to_anthropic(messages)
-        blocks = result[0]["content"]
-        assert not any(b.get("type") == "redacted_thinking" for b in blocks)
-
-    def test_cache_control_stripped_from_thinking_blocks(self):
-        """cache_control markers are removed from thinking/redacted_thinking blocks."""
-        messages = [
-            {
-                "role": "assistant",
-                "content": "",
-                "tool_calls": [
-                    {"id": "tc_1", "function": {"name": "t", "arguments": "{}"}},
-                ],
-                "reasoning_details": [
-                    {
-                        "type": "thinking",
-                        "thinking": "Reasoning.",
-                        "signature": "sig_1",
-                        "cache_control": {"type": "ephemeral"},
-                    },
-                ],
-            },
-            {"role": "tool", "tool_call_id": "tc_1", "content": "result"},
-        ]
-        _, result = convert_messages_to_anthropic(messages)
-        assistant = next(m for m in result if m["role"] == "assistant")
-        for block in assistant["content"]:
-            if block.get("type") in ("thinking", "redacted_thinking"):
-                assert "cache_control" not in block
-
-    def test_thinking_stripped_from_merged_consecutive_assistants(self):
-        """When consecutive assistants are merged, second one's thinking is dropped."""
-        messages = [
-            {
-                "role": "assistant",
-                "content": "First response.",
-                "reasoning_details": [
-                    {"type": "thinking", "thinking": "First thought.", "signature": "sig_1"},
-                ],
-            },
-            {
-                "role": "assistant",
-                "content": "Second response.",
-                "reasoning_details": [
-                    {"type": "thinking", "thinking": "Second thought.", "signature": "sig_2"},
-                ],
-            },
-        ]
-        _, result = convert_messages_to_anthropic(messages)
-
-        # Should be merged into one assistant message
-        assistants = [m for m in result if m["role"] == "assistant"]
-        assert len(assistants) == 1
-
-        # Only the first thinking block should remain (signed, on the last/only assistant)
-        blocks = assistants[0]["content"]
-        thinking = [b for b in blocks if b.get("type") == "thinking"]
-        assert len(thinking) == 1
-        assert thinking[0]["thinking"] == "First thought."
-
-    def test_empty_content_after_strip_gets_placeholder(self):
-        """If stripping thinking leaves an empty message, a placeholder is added."""
-        messages = [
-            {
-                "role": "assistant",
-                "content": "",
-                "reasoning_details": [
-                    {"type": "thinking", "thinking": "Only thinking, no text."},
-                    # Unsigned — will be downgraded, but content was empty string
-                ],
-            },
-            {"role": "user", "content": "Next message."},
-            {"role": "assistant", "content": "Final."},
-        ]
-        _, result = convert_messages_to_anthropic(messages)
-        # First assistant is non-last, so thinking is stripped completely.
-        # The original content was empty and thinking was unsigned → placeholder
-        first_assistant = result[0]
-        assert first_assistant["role"] == "assistant"
-        assert len(first_assistant["content"]) >= 1
-
-    def test_multi_turn_conversation_preserves_only_last(self):
-        """Full multi-turn conversation: only last assistant keeps thinking."""
-        messages = [
-            {"role": "user", "content": "Question 1"},
-            {
-                "role": "assistant",
-                "content": "Answer 1",
-                "reasoning_details": [
-                    {"type": "thinking", "thinking": "Thought 1", "signature": "sig_1"},
-                ],
-            },
-            {"role": "user", "content": "Question 2"},
-            {
-                "role": "assistant",
-                "content": "Answer 2",
-                "reasoning_details": [
-                    {"type": "thinking", "thinking": "Thought 2", "signature": "sig_2"},
-                ],
-            },
-            {"role": "user", "content": "Question 3"},
-            {
-                "role": "assistant",
-                "content": "Answer 3",
-                "reasoning_details": [
-                    {"type": "thinking", "thinking": "Thought 3", "signature": "sig_3"},
-                ],
-            },
-        ]
-        _, result = convert_messages_to_anthropic(messages)
-
-        assistants = [m for m in result if m["role"] == "assistant"]
-        assert len(assistants) == 3
-
-        # First two: no thinking blocks
-        for a in assistants[:2]:
-            assert not any(
-                b.get("type") in ("thinking", "redacted_thinking")
-                for b in a["content"]
-                if isinstance(b, dict)
-            )
-
-        # Last one: thinking preserved
-        last_thinking = [
-            b for b in assistants[2]["content"]
-            if isinstance(b, dict) and b.get("type") == "thinking"
-        ]
-        assert len(last_thinking) == 1
-        assert last_thinking[0]["signature"] == "sig_3"
-
-
 # ---------------------------------------------------------------------------
 # Tool choice
 # ---------------------------------------------------------------------------
@@ -471,23 +471,6 @@ class TestExplicitProviderRouting:
            client, model = resolve_provider_client("zai")
            assert client is not None

-    def test_explicit_google_alias_uses_gemini_credentials(self):
-        """provider='google' should route through the gemini API-key provider."""
-        with (
-            patch("hermes_cli.auth.resolve_api_key_provider_credentials", return_value={
-                "api_key": "gemini-key",
-                "base_url": "https://generativelanguage.googleapis.com/v1beta/openai",
-            }),
-            patch("agent.auxiliary_client.OpenAI") as mock_openai,
-        ):
-            mock_openai.return_value = MagicMock()
-            client, model = resolve_provider_client("google", model="gemini-3.1-pro-preview")
-
-        assert client is not None
-        assert model == "gemini-3.1-pro-preview"
-        assert mock_openai.call_args.kwargs["api_key"] == "gemini-key"
-        assert mock_openai.call_args.kwargs["base_url"] == "https://generativelanguage.googleapis.com/v1beta/openai"
-
    def test_explicit_unknown_returns_none(self, monkeypatch):
        """Unknown provider should return None."""
        client, model = resolve_provider_client("nonexistent-provider")
@@ -641,15 +624,12 @@ class TestVisionClientFallback:
        assert client is None
        assert model is None

-    def test_vision_auto_includes_active_provider_when_configured(self, monkeypatch):
-        """Active provider appears in available backends when credentials exist."""
-        monkeypatch.setenv("ANTHROPIC_API_KEY", "***")
+    def test_vision_auto_includes_anthropic_when_configured(self, monkeypatch):
+        monkeypatch.setenv("ANTHROPIC_API_KEY", "sk-ant-api03-key")
        with (
            patch("agent.auxiliary_client._read_nous_auth", return_value=None),
-            patch("agent.auxiliary_client._read_main_provider", return_value="anthropic"),
-            patch("agent.auxiliary_client._read_main_model", return_value="claude-sonnet-4"),
            patch("agent.anthropic_adapter.build_anthropic_client", return_value=MagicMock()),
-            patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="***"),
+            patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="sk-ant-api03-key"),
        ):
            backends = get_available_vision_backends()

@@ -722,51 +702,88 @@ class TestAuxiliaryPoolAwareness:
        assert call_kwargs["base_url"] == "https://api.githubcopilot.com"
        assert call_kwargs["default_headers"]["Editor-Version"]

-    def test_vision_auto_uses_active_provider_as_fallback(self, monkeypatch):
-        """When no OpenRouter/Nous available, vision auto falls back to active provider."""
-        monkeypatch.setenv("ANTHROPIC_API_KEY", "***")
+    def test_vision_auto_uses_anthropic_when_no_higher_priority_backend(self, monkeypatch):
+        monkeypatch.setenv("ANTHROPIC_API_KEY", "sk-ant-api03-key")
        with (
            patch("agent.auxiliary_client._read_nous_auth", return_value=None),
-            patch("agent.auxiliary_client._read_main_provider", return_value="anthropic"),
-            patch("agent.auxiliary_client._read_main_model", return_value="claude-sonnet-4"),
            patch("agent.anthropic_adapter.build_anthropic_client", return_value=MagicMock()),
-            patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="***"),
+            patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="sk-ant-api03-key"),
        ):
            client, model = get_vision_auxiliary_client()

        assert client is not None
        assert client.__class__.__name__ == "AnthropicAuxiliaryClient"
+        assert model == "claude-haiku-4-5-20251001"

-    def test_vision_auto_prefers_active_provider_over_openrouter(self, monkeypatch):
-        """Active provider is tried before OpenRouter in vision auto."""
+    def test_selected_anthropic_provider_is_preferred_for_vision_auto(self, monkeypatch):
        monkeypatch.setenv("OPENROUTER_API_KEY", "or-key")
-        monkeypatch.setenv("ANTHROPIC_API_KEY", "***")
+        monkeypatch.setenv("ANTHROPIC_API_KEY", "sk-ant-api03-key")
+
+        def fake_load_config():
+            return {"model": {"provider": "anthropic", "default": "claude-sonnet-4-6"}}

        with (
            patch("agent.auxiliary_client._read_nous_auth", return_value=None),
-            patch("agent.auxiliary_client._read_main_provider", return_value="anthropic"),
-            patch("agent.auxiliary_client._read_main_model", return_value="claude-sonnet-4"),
            patch("agent.anthropic_adapter.build_anthropic_client", return_value=MagicMock()),
-            patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="***"),
+            patch("agent.anthropic_adapter.resolve_anthropic_token", return_value="sk-ant-api03-key"),
+            patch("agent.auxiliary_client.OpenAI") as mock_openai,
+            patch("hermes_cli.config.load_config", fake_load_config),
+        ):
+            client, model = get_vision_auxiliary_client()
+
+        assert client is not None
+        assert client.__class__.__name__ == "AnthropicAuxiliaryClient"
+        assert model == "claude-haiku-4-5-20251001"
+
+    def test_selected_codex_provider_short_circuits_vision_auto(self, monkeypatch):
+        def fake_load_config():
+            return {"model": {"provider": "openai-codex", "default": "gpt-5.2-codex"}}
+
+        codex_client = MagicMock()
+        with (
+            patch("hermes_cli.config.load_config", fake_load_config),
+            patch("agent.auxiliary_client._try_codex", return_value=(codex_client, "gpt-5.2-codex")) as mock_codex,
+            patch("agent.auxiliary_client._try_openrouter") as mock_openrouter,
+            patch("agent.auxiliary_client._try_nous") as mock_nous,
+            patch("agent.auxiliary_client._try_anthropic") as mock_anthropic,
+            patch("agent.auxiliary_client._try_custom_endpoint") as mock_custom,
        ):
            provider, client, model = resolve_vision_provider_client()

-        # Active provider should win over OpenRouter
-        assert provider == "anthropic"
+        assert provider == "openai-codex"
+        assert client is codex_client
+        assert model == "gpt-5.2-codex"
+        mock_codex.assert_called_once()
+        mock_openrouter.assert_not_called()
+        mock_nous.assert_not_called()
+        mock_anthropic.assert_not_called()
+        mock_custom.assert_not_called()

-    def test_vision_auto_uses_named_custom_as_active_provider(self, monkeypatch):
-        """Named custom provider works as active provider fallback in vision auto."""
+    def test_vision_auto_includes_codex(self, codex_auth_dir):
+        """Codex supports vision (gpt-5.3-codex), so auto mode should use it."""
+        with patch("agent.auxiliary_client._read_nous_auth", return_value=None), \
+             patch("agent.auxiliary_client.OpenAI"):
+            client, model = get_vision_auxiliary_client()
+        from agent.auxiliary_client import CodexAuxiliaryClient
+        assert isinstance(client, CodexAuxiliaryClient)
+        assert model == "gpt-5.2-codex"
+
+    def test_vision_auto_falls_back_to_custom_endpoint(self, monkeypatch):
+        """Custom endpoint is used as fallback in vision auto mode.
+
+        Many local models (Qwen-VL, LLaVA, etc.) support vision.
+        When no OpenRouter/Nous/Codex is available, try the custom endpoint.
+        """
        monkeypatch.delenv("OPENROUTER_API_KEY", raising=False)
        monkeypatch.delenv("ANTHROPIC_API_KEY", raising=False)
        with patch("agent.auxiliary_client._read_nous_auth", return_value=None), \
             patch("agent.auxiliary_client._select_pool_entry", return_value=(False, None)), \
-             patch("agent.auxiliary_client._read_main_provider", return_value="custom:local"), \
-             patch("agent.auxiliary_client._read_main_model", return_value="my-local-model"), \
-             patch("agent.auxiliary_client.resolve_provider_client",
-                   return_value=(MagicMock(), "my-local-model")) as mock_resolve:
-            provider, client, model = resolve_vision_provider_client()
-        assert client is not None
-        assert provider == "custom:local"
+             patch("agent.auxiliary_client._read_codex_access_token", return_value=None), \
+             patch("agent.auxiliary_client._resolve_custom_runtime",
+                   return_value=("http://localhost:1234/v1", "local-key")), \
+             patch("agent.auxiliary_client.OpenAI") as mock_openai:
+            client, model = get_vision_auxiliary_client()
+        assert client is not None  # Custom endpoint picked up as fallback

    def test_vision_direct_endpoint_override(self, monkeypatch):
        monkeypatch.setenv("OPENROUTER_API_KEY", "or-key")
@@ -805,31 +822,6 @@ class TestAuxiliaryPoolAwareness:
        assert model == "google/gemini-3-flash-preview"
        assert client is not None

-    def test_vision_config_google_provider_uses_gemini_credentials(self, monkeypatch):
-        config = {
-            "auxiliary": {
-                "vision": {
-                    "provider": "google",
-                    "model": "gemini-3.1-pro-preview",
-                }
-            }
-        }
-        monkeypatch.setattr("hermes_cli.config.load_config", lambda: config)
-        with (
-            patch("hermes_cli.auth.resolve_api_key_provider_credentials", return_value={
-                "api_key": "gemini-key",
-                "base_url": "https://generativelanguage.googleapis.com/v1beta/openai",
-            }),
-            patch("agent.auxiliary_client.OpenAI") as mock_openai,
-        ):
-            resolved_provider, client, model = resolve_vision_provider_client()
-
-        assert resolved_provider == "gemini"
-        assert client is not None
-        assert model == "gemini-3.1-pro-preview"
-        assert mock_openai.call_args.kwargs["api_key"] == "gemini-key"
-        assert mock_openai.call_args.kwargs["base_url"] == "https://generativelanguage.googleapis.com/v1beta/openai"
-
    def test_vision_forced_main_uses_custom_endpoint(self, monkeypatch):
        """When explicitly forced to 'main', vision CAN use custom endpoint."""
        config = {
@@ -854,14 +846,7 @@ class TestAuxiliaryPoolAwareness:
        monkeypatch.setenv("AUXILIARY_VISION_PROVIDER", "main")
        monkeypatch.delenv("OPENAI_BASE_URL", raising=False)
        monkeypatch.delenv("OPENAI_API_KEY", raising=False)
-        # Clear client cache to avoid stale entries from previous tests
-        from agent.auxiliary_client import _client_cache
-        _client_cache.clear()
        with patch("agent.auxiliary_client._read_nous_auth", return_value=None), \
-             patch("agent.auxiliary_client._read_main_provider", return_value=""), \
-             patch("agent.auxiliary_client._read_main_model", return_value=""), \
-             patch("agent.auxiliary_client._select_pool_entry", return_value=(False, None)), \
-             patch("agent.auxiliary_client._resolve_custom_runtime", return_value=(None, None)), \
             patch("agent.auxiliary_client._read_codex_access_token", return_value=None), \
             patch("agent.auxiliary_client._resolve_api_key_provider", return_value=(None, None)):
            client, model = get_vision_auxiliary_client()
@@ -1,42 +0,0 @@
-"""Tests for MiniMax auxiliary client URL normalization.
-
-MiniMax and MiniMax-CN set inference_base_url to the /anthropic path.
-The auxiliary client uses the OpenAI SDK, which needs /v1 instead.
-"""
-
-import sys
-import os
-
-sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
-
-from agent.auxiliary_client import _to_openai_base_url
-
-
-class TestToOpenaiBaseUrl:
-    def test_minimax_global_anthropic_suffix_replaced(self):
-        assert _to_openai_base_url("https://api.minimax.io/anthropic") == "https://api.minimax.io/v1"
-
-    def test_minimax_cn_anthropic_suffix_replaced(self):
-        assert _to_openai_base_url("https://api.minimaxi.com/anthropic") == "https://api.minimaxi.com/v1"
-
-    def test_trailing_slash_stripped_before_replace(self):
-        assert _to_openai_base_url("https://api.minimax.io/anthropic/") == "https://api.minimax.io/v1"
-
-    def test_v1_url_unchanged(self):
-        assert _to_openai_base_url("https://api.openai.com/v1") == "https://api.openai.com/v1"
-
-    def test_openrouter_url_unchanged(self):
-        assert _to_openai_base_url("https://openrouter.ai/api/v1") == "https://openrouter.ai/api/v1"
-
-    def test_anthropic_domain_unchanged(self):
-        """api.anthropic.com doesn't end with /anthropic — should be untouched."""
-        assert _to_openai_base_url("https://api.anthropic.com") == "https://api.anthropic.com"
-
-    def test_anthropic_in_subpath_unchanged(self):
-        assert _to_openai_base_url("https://example.com/anthropic/extra") == "https://example.com/anthropic/extra"
-
-    def test_empty_string(self):
-        assert _to_openai_base_url("") == ""
-
-    def test_none(self):
-        assert _to_openai_base_url(None) == ""
@@ -1,105 +0,0 @@
-"""Tests for MiniMax provider hardening — context lengths, thinking guard, catalog."""
-
-
-class TestMinimaxContextLengths:
-    """Verify per-model context length entries for MiniMax models."""
-
-    def test_m1_variants_have_1m_context(self):
-        from agent.model_metadata import DEFAULT_CONTEXT_LENGTHS
-        # Keys are lowercase because the lookup lowercases model names
-        for model in ("minimax-m1", "minimax-m1-40k", "minimax-m1-80k",
-                       "minimax-m1-128k", "minimax-m1-256k"):
-            assert model in DEFAULT_CONTEXT_LENGTHS, f"{model} missing from context lengths"
-            assert DEFAULT_CONTEXT_LENGTHS[model] == 1_000_000, f"{model} expected 1M"
-
-    def test_m2_variants_have_1m_context(self):
-        from agent.model_metadata import DEFAULT_CONTEXT_LENGTHS
-        # Keys are lowercase because the lookup lowercases model names
-        for model in ("minimax-m2.5", "minimax-m2.7"):
-            assert model in DEFAULT_CONTEXT_LENGTHS, f"{model} missing from context lengths"
-            assert DEFAULT_CONTEXT_LENGTHS[model] == 1_048_576, f"{model} expected 1048576"
-
-    def test_minimax_prefix_fallback(self):
-        from agent.model_metadata import DEFAULT_CONTEXT_LENGTHS
-        # The generic "minimax" prefix entry should be 1M for unknown models
-        assert DEFAULT_CONTEXT_LENGTHS["minimax"] == 1_048_576
-
-
-
-class TestMinimaxThinkingGuard:
-    """Verify that build_anthropic_kwargs does NOT add thinking params for MiniMax models."""
-
-    def test_no_thinking_for_minimax_m27(self):
-        from agent.anthropic_adapter import build_anthropic_kwargs
-        kwargs = build_anthropic_kwargs(
-            model="MiniMax-M2.7",
-            messages=[{"role": "user", "content": "hello"}],
-            tools=None,
-            max_tokens=4096,
-            reasoning_config={"enabled": True, "effort": "medium"},
-        )
-        assert "thinking" not in kwargs
-        assert "output_config" not in kwargs
-
-    def test_no_thinking_for_minimax_m1(self):
-        from agent.anthropic_adapter import build_anthropic_kwargs
-        kwargs = build_anthropic_kwargs(
-            model="MiniMax-M1-128k",
-            messages=[{"role": "user", "content": "hello"}],
-            tools=None,
-            max_tokens=4096,
-            reasoning_config={"enabled": True, "effort": "high"},
-        )
-        assert "thinking" not in kwargs
-
-    def test_thinking_still_works_for_claude(self):
-        from agent.anthropic_adapter import build_anthropic_kwargs
-        kwargs = build_anthropic_kwargs(
-            model="claude-sonnet-4-20250514",
-            messages=[{"role": "user", "content": "hello"}],
-            tools=None,
-            max_tokens=4096,
-            reasoning_config={"enabled": True, "effort": "medium"},
-        )
-        assert "thinking" in kwargs
-
-
-class TestMinimaxAuxModel:
-    """Verify auxiliary model is standard (not highspeed)."""
-
-    def test_minimax_aux_is_standard(self):
-        from agent.auxiliary_client import _API_KEY_PROVIDER_AUX_MODELS
-        assert _API_KEY_PROVIDER_AUX_MODELS["minimax"] == "MiniMax-M2.7"
-        assert _API_KEY_PROVIDER_AUX_MODELS["minimax-cn"] == "MiniMax-M2.7"
-
-    def test_minimax_aux_not_highspeed(self):
-        from agent.auxiliary_client import _API_KEY_PROVIDER_AUX_MODELS
-        assert "highspeed" not in _API_KEY_PROVIDER_AUX_MODELS["minimax"]
-        assert "highspeed" not in _API_KEY_PROVIDER_AUX_MODELS["minimax-cn"]
-
-
-class TestMinimaxModelCatalog:
-    """Verify the model catalog includes M1 family and excludes deprecated models."""
-
-    def test_catalog_includes_m1_family(self):
-        from hermes_cli.models import _PROVIDER_MODELS
-        for provider in ("minimax", "minimax-cn"):
-            models = _PROVIDER_MODELS[provider]
-            assert "MiniMax-M1" in models
-            assert "MiniMax-M1-40k" in models
-            assert "MiniMax-M1-80k" in models
-            assert "MiniMax-M1-128k" in models
-            assert "MiniMax-M1-256k" in models
-
-    def test_catalog_excludes_deprecated(self):
-        from hermes_cli.models import _PROVIDER_MODELS
-        for provider in ("minimax", "minimax-cn"):
-            models = _PROVIDER_MODELS[provider]
-            assert "MiniMax-M2.1" not in models
-
-    def test_catalog_excludes_highspeed(self):
-        from hermes_cli.models import _PROVIDER_MODELS
-        for provider in ("minimax", "minimax-cn"):
-            models = _PROVIDER_MODELS[provider]
-            assert "MiniMax-M2.7-highspeed" not in models
-            assert "MiniMax-M2.5-highspeed" not in models
@@ -1,66 +0,0 @@
-import pytest
-from unittest.mock import MagicMock, patch
-from hermes_cli.plugins import VALID_HOOKS, PluginManager
-import os
-import shutil
-import tempfile
-from cli import HermesCLI
-
-
-def test_session_hooks_in_valid_hooks():
-    """Verify on_session_finalize and on_session_reset are registered as valid hooks."""
-    assert "on_session_finalize" in VALID_HOOKS
-    assert "on_session_reset" in VALID_HOOKS
-
-
-@patch("hermes_cli.plugins.invoke_hook")
-def test_session_finalize_on_reset(mock_invoke_hook):
-    """Verify on_session_finalize fires when /new or /reset is used."""
-    cli = HermesCLI()
-    cli.agent = MagicMock()
-    cli.agent.session_id = "test-session-id"
-
-    # Simulate /new command which triggers on_session_finalize for the old session
-    cli.new_session(silent=True)
-
-    # Check if on_session_finalize was called for the old session
-    mock_invoke_hook.assert_any_call(
-        "on_session_finalize", session_id="test-session-id", platform="cli"
-    )
-    # Check if on_session_reset was called for the new session
-    mock_invoke_hook.assert_any_call(
-        "on_session_reset", session_id=cli.session_id, platform="cli"
-    )
-
-
-@patch("hermes_cli.plugins.invoke_hook")
-def test_session_finalize_on_cleanup(mock_invoke_hook):
-    """Verify on_session_finalize fires during CLI exit cleanup."""
-    import cli as cli_mod
-
-    mock_agent = MagicMock()
-    mock_agent.session_id = "cleanup-session-id"
-    cli_mod._active_agent_ref = mock_agent
-    cli_mod._cleanup_done = False
-
-    cli_mod._run_cleanup()
-
-    mock_invoke_hook.assert_any_call(
-        "on_session_finalize", session_id="cleanup-session-id", platform="cli"
-    )
-
-
-@patch("hermes_cli.plugins.invoke_hook")
-def test_hook_errors_are_caught(mock_invoke_hook):
-    """Verify hook exceptions are caught and don't crash the agent."""
-    mgr = PluginManager()
-
-    # Register a hook that raises
-    def bad_callback(**kwargs):
-        raise Exception("Hook failed")
-
-    mgr._hooks["on_session_finalize"] = [bad_callback]
-
-    # This should not raise
-    results = mgr.invoke_hook("on_session_finalize", session_id="test", platform="cli")
-    assert results == []
@@ -33,13 +33,6 @@ def git_repo(tmp_path):
        ["git", "commit", "-m", "Initial commit"],
        cwd=repo, capture_output=True,
    )
-    # Add a fake remote ref so cleanup logic sees the initial commit as
-    # "pushed".  Without this, `git log HEAD --not --remotes` treats every
-    # commit as unpushed and cleanup refuses to delete worktrees.
-    subprocess.run(
-        ["git", "update-ref", "refs/remotes/origin/main", "HEAD"],
-        cwd=repo, capture_output=True,
-    )
    return repo


@@ -88,11 +81,7 @@ def _setup_worktree(repo_root):


 def _cleanup_worktree(info):
-    """Test version of _cleanup_worktree.
-
-    Preserves the worktree only if it has unpushed commits.
-    Dirty working tree alone is not enough to keep it.
-    """
+    """Test version of _cleanup_worktree."""
    wt_path = info["path"]
    branch = info["branch"]
    repo_root = info["repo_root"]
@@ -100,15 +89,15 @@ def _cleanup_worktree(info):
    if not Path(wt_path).exists():
        return

-    # Check for unpushed commits
-    result = subprocess.run(
-        ["git", "log", "--oneline", "HEAD", "--not", "--remotes"],
+    # Check for uncommitted changes
+    status = subprocess.run(
+        ["git", "status", "--porcelain"],
        capture_output=True, text=True, timeout=10, cwd=wt_path,
    )
-    has_unpushed = bool(result.stdout.strip())
+    has_changes = bool(status.stdout.strip())

-    if has_unpushed:
-        return False  # Did not clean up — has unpushed commits
+    if has_changes:
+        return False  # Did not clean up

    subprocess.run(
        ["git", "worktree", "remove", wt_path, "--force"],
@@ -215,45 +204,20 @@ class TestWorktreeCleanup:
        assert result is True
        assert not Path(info["path"]).exists()

-    def test_dirty_worktree_cleaned_when_no_unpushed(self, git_repo):
-        """Dirty working tree without unpushed commits is cleaned up.
-
-        Agent sessions typically leave untracked files / artifacts behind.
-        Since all real work is in pushed commits, these don't warrant
-        keeping the worktree.
-        """
+    def test_dirty_worktree_kept(self, git_repo):
        info = _setup_worktree(str(git_repo))
        assert info is not None

-        # Make uncommitted changes (untracked file)
+        # Make uncommitted changes
        (Path(info["path"]) / "new-file.txt").write_text("uncommitted")
        subprocess.run(
            ["git", "add", "new-file.txt"],
            cwd=info["path"], capture_output=True,
        )

-        # The git_repo fixture already has a fake remote ref so the initial
-        # commit is seen as "pushed".  No unpushed commits → cleanup proceeds.
        result = _cleanup_worktree(info)
-        assert result is True  # Cleaned up despite dirty working tree
-        assert not Path(info["path"]).exists()
-
-    def test_worktree_with_unpushed_commits_kept(self, git_repo):
-        """Worktree with unpushed commits is preserved."""
-        info = _setup_worktree(str(git_repo))
-        assert info is not None
-
-        # Make a commit that is NOT on any remote
-        (Path(info["path"]) / "work.txt").write_text("real work")
-        subprocess.run(["git", "add", "work.txt"], cwd=info["path"], capture_output=True)
-        subprocess.run(
-            ["git", "commit", "-m", "agent work"],
-            cwd=info["path"], capture_output=True,
-        )
-
-        result = _cleanup_worktree(info)
-        assert result is False  # Kept — has unpushed commits
-        assert Path(info["path"]).exists()
+        assert result is False
+        assert Path(info["path"]).exists()  # Still there

    def test_branch_deleted_on_cleanup(self, git_repo):
        info = _setup_worktree(str(git_repo))
@@ -403,7 +367,7 @@ class TestMultipleWorktrees:
        lines = [l for l in result.stdout.strip().splitlines() if l.strip()]
        assert len(lines) == 11

-        # Cleanup all (git_repo fixture has a fake remote ref so cleanup works)
+        # Cleanup all
        for info in worktrees:
            # Discard changes first so cleanup works
            subprocess.run(
@@ -528,77 +492,33 @@ class TestStaleWorktreePruning:
        assert not pruned
        assert Path(info["path"]).exists()

-    def test_keeps_old_worktree_with_unpushed_commits(self, git_repo):
-        """Old worktrees (24-72h) with unpushed commits should NOT be pruned."""
+    def test_keeps_dirty_old_worktree(self, git_repo):
+        """Old worktrees with uncommitted changes should NOT be pruned."""
        import time

        info = _setup_worktree(str(git_repo))
        assert info is not None

-        # Make an unpushed commit
-        (Path(info["path"]) / "work.txt").write_text("real work")
-        subprocess.run(["git", "add", "work.txt"], cwd=info["path"], capture_output=True)
+        # Make it dirty
+        (Path(info["path"]) / "dirty.txt").write_text("uncommitted")
        subprocess.run(
-            ["git", "commit", "-m", "agent work"],
+            ["git", "add", "dirty.txt"],
            cwd=info["path"], capture_output=True,
        )

-        # Make it old (25h — in the 24-72h soft tier)
+        # Make it old
        old_time = time.time() - (25 * 3600)
        os.utime(info["path"], (old_time, old_time))

-        # Check for unpushed commits (simulates prune logic)
-        result = subprocess.run(
-            ["git", "log", "--oneline", "HEAD", "--not", "--remotes"],
+        # Check if it would be pruned
+        status = subprocess.run(
+            ["git", "status", "--porcelain"],
            capture_output=True, text=True, cwd=info["path"],
        )
-        has_unpushed = bool(result.stdout.strip())
-        assert has_unpushed  # Has unpushed commits → not pruned in soft tier
+        has_changes = bool(status.stdout.strip())
+        assert has_changes  # Should be dirty → not pruned
        assert Path(info["path"]).exists()

-    def test_force_prunes_very_old_worktree(self, git_repo):
-        """Worktrees older than 72h should be force-pruned regardless."""
-        import time
-
-        info = _setup_worktree(str(git_repo))
-        assert info is not None
-
-        # Make an unpushed commit (would normally protect it)
-        (Path(info["path"]) / "work.txt").write_text("stale work")
-        subprocess.run(["git", "add", "work.txt"], cwd=info["path"], capture_output=True)
-        subprocess.run(
-            ["git", "commit", "-m", "old agent work"],
-            cwd=info["path"], capture_output=True,
-        )
-
-        # Make it very old (73h — beyond the 72h hard threshold)
-        old_time = time.time() - (73 * 3600)
-        os.utime(info["path"], (old_time, old_time))
-
-        # Simulate the force-prune tier check
-        hard_cutoff = time.time() - (72 * 3600)
-        mtime = Path(info["path"]).stat().st_mtime
-        assert mtime <= hard_cutoff  # Should qualify for force removal
-
-        # Actually remove it (simulates _prune_stale_worktrees force path)
-        branch_result = subprocess.run(
-            ["git", "branch", "--show-current"],
-            capture_output=True, text=True, timeout=5, cwd=info["path"],
-        )
-        branch = branch_result.stdout.strip()
-
-        subprocess.run(
-            ["git", "worktree", "remove", info["path"], "--force"],
-            capture_output=True, text=True, timeout=15, cwd=str(git_repo),
-        )
-        if branch:
-            subprocess.run(
-                ["git", "branch", "-D", branch],
-                capture_output=True, text=True, timeout=10, cwd=str(git_repo),
-            )
-
-        assert not Path(info["path"]).exists()
-

 class TestEdgeCases:
    """Test edge cases for robustness."""
@@ -691,133 +611,6 @@ class TestTerminalCWDIntegration:
        assert result.stdout.strip() == "true"


-class TestOrphanedBranchPruning:
-    """Test cleanup of orphaned hermes/* and pr-* branches."""
-
-    def test_prunes_orphaned_hermes_branch(self, git_repo):
-        """hermes/hermes-* branches with no worktree should be deleted."""
-        # Create a branch that looks like a worktree branch but has no worktree
-        subprocess.run(
-            ["git", "branch", "hermes/hermes-deadbeef", "HEAD"],
-            cwd=str(git_repo), capture_output=True,
-        )
-
-        # Verify it exists
-        result = subprocess.run(
-            ["git", "branch", "--list", "hermes/hermes-deadbeef"],
-            capture_output=True, text=True, cwd=str(git_repo),
-        )
-        assert "hermes/hermes-deadbeef" in result.stdout
-
-        # Simulate _prune_orphaned_branches logic
-        result = subprocess.run(
-            ["git", "branch", "--format=%(refname:short)"],
-            capture_output=True, text=True, cwd=str(git_repo),
-        )
-        all_branches = [b.strip() for b in result.stdout.strip().split("\n") if b.strip()]
-
-        wt_result = subprocess.run(
-            ["git", "worktree", "list", "--porcelain"],
-            capture_output=True, text=True, cwd=str(git_repo),
-        )
-        active_branches = {"main"}
-        for line in wt_result.stdout.split("\n"):
-            if line.startswith("branch refs/heads/"):
-                active_branches.add(line.split("branch refs/heads/", 1)[-1].strip())
-
-        orphaned = [
-            b for b in all_branches
-            if b not in active_branches
-            and (b.startswith("hermes/hermes-") or b.startswith("pr-"))
-        ]
-        assert "hermes/hermes-deadbeef" in orphaned
-
-        # Delete them
-        if orphaned:
-            subprocess.run(
-                ["git", "branch", "-D"] + orphaned,
-                capture_output=True, text=True, cwd=str(git_repo),
-            )
-
-        # Verify gone
-        result = subprocess.run(
-            ["git", "branch", "--list", "hermes/hermes-deadbeef"],
-            capture_output=True, text=True, cwd=str(git_repo),
-        )
-        assert "hermes/hermes-deadbeef" not in result.stdout
-
-    def test_prunes_orphaned_pr_branch(self, git_repo):
-        """pr-* branches should be deleted during pruning."""
-        subprocess.run(
-            ["git", "branch", "pr-1234", "HEAD"],
-            cwd=str(git_repo), capture_output=True,
-        )
-        subprocess.run(
-            ["git", "branch", "pr-5678", "HEAD"],
-            cwd=str(git_repo), capture_output=True,
-        )
-
-        result = subprocess.run(
-            ["git", "branch", "--format=%(refname:short)"],
-            capture_output=True, text=True, cwd=str(git_repo),
-        )
-        all_branches = [b.strip() for b in result.stdout.strip().split("\n") if b.strip()]
-
-        active_branches = {"main"}
-        orphaned = [
-            b for b in all_branches
-            if b not in active_branches and b.startswith("pr-")
-        ]
-        assert "pr-1234" in orphaned
-        assert "pr-5678" in orphaned
-
-        subprocess.run(
-            ["git", "branch", "-D"] + orphaned,
-            capture_output=True, text=True, cwd=str(git_repo),
-        )
-
-        # Verify gone
-        result = subprocess.run(
-            ["git", "branch", "--format=%(refname:short)"],
-            capture_output=True, text=True, cwd=str(git_repo),
-        )
-        remaining = result.stdout.strip()
-        assert "pr-1234" not in remaining
-        assert "pr-5678" not in remaining
-
-    def test_preserves_active_worktree_branch(self, git_repo):
-        """Branches with active worktrees should NOT be pruned."""
-        info = _setup_worktree(str(git_repo))
-        assert info is not None
-
-        result = subprocess.run(
-            ["git", "worktree", "list", "--porcelain"],
-            capture_output=True, text=True, cwd=str(git_repo),
-        )
-        active_branches = set()
-        for line in result.stdout.split("\n"):
-            if line.startswith("branch refs/heads/"):
-                active_branches.add(line.split("branch refs/heads/", 1)[-1].strip())
-
-        assert info["branch"] in active_branches  # Protected
-
-    def test_preserves_main_branch(self, git_repo):
-        """main branch should never be pruned."""
-        result = subprocess.run(
-            ["git", "branch", "--format=%(refname:short)"],
-            capture_output=True, text=True, cwd=str(git_repo),
-        )
-        all_branches = [b.strip() for b in result.stdout.strip().split("\n") if b.strip()]
-        active_branches = {"main"}
-
-        orphaned = [
-            b for b in all_branches
-            if b not in active_branches
-            and (b.startswith("hermes/hermes-") or b.startswith("pr-"))
-        ]
-        assert "main" not in orphaned
-
-
 class TestSystemPromptInjection:
    """Test that the agent gets worktree context in its system prompt."""

@@ -832,7 +625,7 @@ class TestSystemPromptInjection:
            f"{info['path']}. Your branch is `{info['branch']}`. "
            f"Changes here do not affect the main working tree or other agents. "
            f"Remember to commit and push your changes, and create a PR if appropriate. "
-            f"The original repo is at {info['repo_root']}.]\n"
+            f"The original repo is at {info['repo_root']}.]"
        )

        assert info["path"] in wt_note
@@ -339,36 +339,6 @@ class TestMarkJobRun:
        assert updated["last_status"] == "error"
        assert updated["last_error"] == "timeout"

-    def test_delivery_error_tracked_separately(self, tmp_cron_dir):
-        """Agent succeeds but delivery fails — both tracked independently."""
-        job = create_job(prompt="Report", schedule="every 1h")
-        mark_job_run(job["id"], success=True, delivery_error="platform 'telegram' not configured")
-        updated = get_job(job["id"])
-        assert updated["last_status"] == "ok"
-        assert updated["last_error"] is None
-        assert updated["last_delivery_error"] == "platform 'telegram' not configured"
-
-    def test_delivery_error_cleared_on_success(self, tmp_cron_dir):
-        """Successful delivery clears the previous delivery error."""
-        job = create_job(prompt="Report", schedule="every 1h")
-        mark_job_run(job["id"], success=True, delivery_error="network timeout")
-        updated = get_job(job["id"])
-        assert updated["last_delivery_error"] == "network timeout"
-        # Next run delivers successfully
-        mark_job_run(job["id"], success=True, delivery_error=None)
-        updated = get_job(job["id"])
-        assert updated["last_delivery_error"] is None
-
-    def test_both_agent_and_delivery_error(self, tmp_cron_dir):
-        """Agent fails AND delivery fails — both errors recorded."""
-        job = create_job(prompt="Report", schedule="every 1h")
-        mark_job_run(job["id"], success=False, error="model timeout",
-                     delivery_error="platform 'discord' not enabled")
-        updated = get_job(job["id"])
-        assert updated["last_status"] == "error"
-        assert updated["last_error"] == "model timeout"
-        assert updated["last_delivery_error"] == "platform 'discord' not enabled"
-

 class TestAdvanceNextRun:
    """Tests for advance_next_run() — crash-safety for recurring jobs."""
@@ -508,90 +508,6 @@ class TestDeliverResultWrapping:
        assert send_mock.call_args.kwargs["thread_id"] == "17585"


-class TestDeliverResultErrorReturns:
-    """Verify _deliver_result returns error strings on failure, None on success."""
-
-    def test_returns_none_on_successful_delivery(self):
-        from gateway.config import Platform
-
-        pconfig = MagicMock()
-        pconfig.enabled = True
-        mock_cfg = MagicMock()
-        mock_cfg.platforms = {Platform.TELEGRAM: pconfig}
-
-        with patch("gateway.config.load_gateway_config", return_value=mock_cfg), \
-             patch("tools.send_message_tool._send_to_platform", new=AsyncMock(return_value={"success": True})):
-            job = {
-                "id": "ok-job",
-                "deliver": "origin",
-                "origin": {"platform": "telegram", "chat_id": "123"},
-            }
-            result = _deliver_result(job, "Output.")
-        assert result is None
-
-    def test_returns_none_for_local_delivery(self):
-        """local-only jobs don't deliver — not a failure."""
-        job = {"id": "local-job", "deliver": "local"}
-        result = _deliver_result(job, "Output.")
-        assert result is None
-
-    def test_returns_error_for_unknown_platform(self):
-        job = {
-            "id": "bad-platform",
-            "deliver": "origin",
-            "origin": {"platform": "fax", "chat_id": "123"},
-        }
-        with patch("gateway.config.load_gateway_config"):
-            result = _deliver_result(job, "Output.")
-        assert result is not None
-        assert "unknown platform" in result
-
-    def test_returns_error_when_platform_disabled(self):
-        from gateway.config import Platform
-
-        pconfig = MagicMock()
-        pconfig.enabled = False
-        mock_cfg = MagicMock()
-        mock_cfg.platforms = {Platform.TELEGRAM: pconfig}
-
-        with patch("gateway.config.load_gateway_config", return_value=mock_cfg):
-            job = {
-                "id": "disabled",
-                "deliver": "origin",
-                "origin": {"platform": "telegram", "chat_id": "123"},
-            }
-            result = _deliver_result(job, "Output.")
-        assert result is not None
-        assert "not configured" in result
-
-    def test_returns_error_on_send_failure(self):
-        from gateway.config import Platform
-
-        pconfig = MagicMock()
-        pconfig.enabled = True
-        mock_cfg = MagicMock()
-        mock_cfg.platforms = {Platform.TELEGRAM: pconfig}
-
-        with patch("gateway.config.load_gateway_config", return_value=mock_cfg), \
-             patch("tools.send_message_tool._send_to_platform", new=AsyncMock(return_value={"error": "rate limited"})):
-            job = {
-                "id": "rate-limited",
-                "deliver": "origin",
-                "origin": {"platform": "telegram", "chat_id": "123"},
-            }
-            result = _deliver_result(job, "Output.")
-        assert result is not None
-        assert "rate limited" in result
-
-    def test_returns_error_for_unresolved_target(self, monkeypatch):
-        """Non-local delivery with no resolvable target should return an error."""
-        monkeypatch.delenv("TELEGRAM_HOME_CHANNEL", raising=False)
-        job = {"id": "no-target", "deliver": "telegram"}
-        result = _deliver_result(job, "Output.")
-        assert result is not None
-        assert "no delivery target" in result
-
-
 class TestRunJobSessionPersistence:
    def test_run_job_passes_session_db_and_cron_platform(self, tmp_path):
        job = {
@@ -1,277 +0,0 @@
-"""Tests for Discord reply_to_mode functionality.
-
-Covers the threading behavior control for multi-chunk replies:
- "off": Never reply-reference to original message
- "first": Only first chunk uses reply reference (default)
- "all": All chunks reply-reference the original message
-"""
-import os
-import sys
-from types import SimpleNamespace
-from unittest.mock import MagicMock, AsyncMock, patch
-
-import pytest
-
-from gateway.config import PlatformConfig, GatewayConfig, Platform, _apply_env_overrides
-
-
-def _ensure_discord_mock():
-    """Install a mock discord module when discord.py isn't available."""
-    if "discord" in sys.modules and hasattr(sys.modules["discord"], "__file__"):
-        return
-
-    discord_mod = MagicMock()
-    discord_mod.Intents.default.return_value = MagicMock()
-    discord_mod.Client = MagicMock
-    discord_mod.File = MagicMock
-    discord_mod.DMChannel = type("DMChannel", (), {})
-    discord_mod.Thread = type("Thread", (), {})
-    discord_mod.ForumChannel = type("ForumChannel", (), {})
-    discord_mod.ui = SimpleNamespace(View=object, button=lambda *a, **k: (lambda fn: fn), Button=object)
-    discord_mod.ButtonStyle = SimpleNamespace(success=1, primary=2, secondary=2, danger=3, green=1, grey=2, blurple=2, red=3)
-    discord_mod.Color = SimpleNamespace(orange=lambda: 1, green=lambda: 2, blue=lambda: 3, red=lambda: 4, purple=lambda: 5)
-    discord_mod.Interaction = object
-    discord_mod.Embed = MagicMock
-    discord_mod.app_commands = SimpleNamespace(
-        describe=lambda **kwargs: (lambda fn: fn),
-        choices=lambda **kwargs: (lambda fn: fn),
-        Choice=lambda **kwargs: SimpleNamespace(**kwargs),
-    )
-
-    ext_mod = MagicMock()
-    commands_mod = MagicMock()
-    commands_mod.Bot = MagicMock
-    ext_mod.commands = commands_mod
-
-    sys.modules.setdefault("discord", discord_mod)
-    sys.modules.setdefault("discord.ext", ext_mod)
-    sys.modules.setdefault("discord.ext.commands", commands_mod)
-
-
-_ensure_discord_mock()
-
-from gateway.platforms.discord import DiscordAdapter  # noqa: E402
-
-
-@pytest.fixture()
-def adapter_factory():
-    """Factory to create DiscordAdapter with custom reply_to_mode."""
-    def create(reply_to_mode: str = "first"):
-        config = PlatformConfig(enabled=True, token="test-token", reply_to_mode=reply_to_mode)
-        return DiscordAdapter(config)
-    return create
-
-
-class TestReplyToModeConfig:
-    """Tests for reply_to_mode configuration loading."""
-
-    def test_default_mode_is_first(self, adapter_factory):
-        adapter = adapter_factory()
-        assert adapter._reply_to_mode == "first"
-
-    def test_off_mode(self, adapter_factory):
-        adapter = adapter_factory(reply_to_mode="off")
-        assert adapter._reply_to_mode == "off"
-
-    def test_first_mode(self, adapter_factory):
-        adapter = adapter_factory(reply_to_mode="first")
-        assert adapter._reply_to_mode == "first"
-
-    def test_all_mode(self, adapter_factory):
-        adapter = adapter_factory(reply_to_mode="all")
-        assert adapter._reply_to_mode == "all"
-
-    def test_invalid_mode_stored_as_is(self, adapter_factory):
-        """Invalid modes are stored but send() handles them gracefully."""
-        adapter = adapter_factory(reply_to_mode="invalid")
-        assert adapter._reply_to_mode == "invalid"
-
-    def test_none_mode_defaults_to_first(self):
-        config = PlatformConfig(enabled=True, token="test-token")
-        adapter = DiscordAdapter(config)
-        assert adapter._reply_to_mode == "first"
-
-    def test_empty_string_mode_defaults_to_first(self):
-        config = PlatformConfig(enabled=True, token="test-token", reply_to_mode="")
-        adapter = DiscordAdapter(config)
-        assert adapter._reply_to_mode == "first"
-
-
-def _make_discord_adapter(reply_to_mode: str = "first"):
-    """Create a DiscordAdapter with mocked client and channel for send() tests."""
-    config = PlatformConfig(enabled=True, token="test-token", reply_to_mode=reply_to_mode)
-    adapter = DiscordAdapter(config)
-
-    # Mock the Discord client and channel
-    mock_channel = AsyncMock()
-    ref_message = MagicMock()
-    mock_channel.fetch_message = AsyncMock(return_value=ref_message)
-
-    sent_msg = MagicMock()
-    sent_msg.id = 42
-    mock_channel.send = AsyncMock(return_value=sent_msg)
-
-    mock_client = MagicMock()
-    mock_client.get_channel = MagicMock(return_value=mock_channel)
-
-    adapter._client = mock_client
-    return adapter, mock_channel, ref_message
-
-
-class TestSendWithReplyToMode:
-    """Tests for send() method respecting reply_to_mode."""
-
-    @pytest.mark.asyncio
-    async def test_off_mode_no_reply_reference(self):
-        adapter, channel, ref_msg = _make_discord_adapter("off")
-        adapter.truncate_message = lambda content, max_len: ["chunk1", "chunk2", "chunk3"]
-
-        await adapter.send("12345", "test content", reply_to="999")
-
-        # Should never try to fetch the reference message
-        channel.fetch_message.assert_not_called()
-        # All chunks sent without reference
-        for call in channel.send.call_args_list:
-            assert call.kwargs.get("reference") is None
-
-    @pytest.mark.asyncio
-    async def test_first_mode_only_first_chunk_references(self):
-        adapter, channel, ref_msg = _make_discord_adapter("first")
-        adapter.truncate_message = lambda content, max_len: ["chunk1", "chunk2", "chunk3"]
-
-        await adapter.send("12345", "test content", reply_to="999")
-
-        # Should fetch the reference message
-        channel.fetch_message.assert_called_once_with(999)
-        calls = channel.send.call_args_list
-        assert len(calls) == 3
-        assert calls[0].kwargs.get("reference") is ref_msg
-        assert calls[1].kwargs.get("reference") is None
-        assert calls[2].kwargs.get("reference") is None
-
-    @pytest.mark.asyncio
-    async def test_all_mode_all_chunks_reference(self):
-        adapter, channel, ref_msg = _make_discord_adapter("all")
-        adapter.truncate_message = lambda content, max_len: ["chunk1", "chunk2", "chunk3"]
-
-        await adapter.send("12345", "test content", reply_to="999")
-
-        channel.fetch_message.assert_called_once_with(999)
-        calls = channel.send.call_args_list
-        assert len(calls) == 3
-        for call in calls:
-            assert call.kwargs.get("reference") is ref_msg
-
-    @pytest.mark.asyncio
-    async def test_no_reply_to_param_no_reference(self):
-        adapter, channel, ref_msg = _make_discord_adapter("all")
-        adapter.truncate_message = lambda content, max_len: ["chunk1", "chunk2"]
-
-        await adapter.send("12345", "test content", reply_to=None)
-
-        channel.fetch_message.assert_not_called()
-        for call in channel.send.call_args_list:
-            assert call.kwargs.get("reference") is None
-
-    @pytest.mark.asyncio
-    async def test_single_chunk_respects_first_mode(self):
-        adapter, channel, ref_msg = _make_discord_adapter("first")
-        adapter.truncate_message = lambda content, max_len: ["single chunk"]
-
-        await adapter.send("12345", "test", reply_to="999")
-
-        calls = channel.send.call_args_list
-        assert len(calls) == 1
-        assert calls[0].kwargs.get("reference") is ref_msg
-
-    @pytest.mark.asyncio
-    async def test_single_chunk_off_mode(self):
-        adapter, channel, ref_msg = _make_discord_adapter("off")
-        adapter.truncate_message = lambda content, max_len: ["single chunk"]
-
-        await adapter.send("12345", "test", reply_to="999")
-
-        channel.fetch_message.assert_not_called()
-        calls = channel.send.call_args_list
-        assert len(calls) == 1
-        assert calls[0].kwargs.get("reference") is None
-
-    @pytest.mark.asyncio
-    async def test_invalid_mode_falls_back_to_first_behavior(self):
-        """Invalid mode behaves like 'first' — only first chunk gets reference."""
-        adapter, channel, ref_msg = _make_discord_adapter("banana")
-        adapter.truncate_message = lambda content, max_len: ["chunk1", "chunk2"]
-
-        await adapter.send("12345", "test", reply_to="999")
-
-        calls = channel.send.call_args_list
-        assert len(calls) == 2
-        assert calls[0].kwargs.get("reference") is ref_msg
-        assert calls[1].kwargs.get("reference") is None
-
-
-class TestConfigSerialization:
-    """Tests for reply_to_mode serialization (shared with Telegram)."""
-
-    def test_to_dict_includes_reply_to_mode(self):
-        config = PlatformConfig(enabled=True, token="test", reply_to_mode="all")
-        result = config.to_dict()
-        assert result["reply_to_mode"] == "all"
-
-    def test_from_dict_loads_reply_to_mode(self):
-        data = {"enabled": True, "token": "***", "reply_to_mode": "off"}
-        config = PlatformConfig.from_dict(data)
-        assert config.reply_to_mode == "off"
-
-    def test_from_dict_defaults_to_first(self):
-        data = {"enabled": True, "token": "***"}
-        config = PlatformConfig.from_dict(data)
-        assert config.reply_to_mode == "first"
-
-
-class TestEnvVarOverride:
-    """Tests for DISCORD_REPLY_TO_MODE environment variable override."""
-
-    def _make_config(self):
-        config = GatewayConfig()
-        config.platforms[Platform.DISCORD] = PlatformConfig(enabled=True, token="test")
-        return config
-
-    def test_env_var_sets_off_mode(self):
-        config = self._make_config()
-        with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": "off"}, clear=False):
-            _apply_env_overrides(config)
-        assert config.platforms[Platform.DISCORD].reply_to_mode == "off"
-
-    def test_env_var_sets_all_mode(self):
-        config = self._make_config()
-        with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": "all"}, clear=False):
-            _apply_env_overrides(config)
-        assert config.platforms[Platform.DISCORD].reply_to_mode == "all"
-
-    def test_env_var_case_insensitive(self):
-        config = self._make_config()
-        with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": "ALL"}, clear=False):
-            _apply_env_overrides(config)
-        assert config.platforms[Platform.DISCORD].reply_to_mode == "all"
-
-    def test_env_var_invalid_value_ignored(self):
-        config = self._make_config()
-        with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": "banana"}, clear=False):
-            _apply_env_overrides(config)
-        assert config.platforms[Platform.DISCORD].reply_to_mode == "first"
-
-    def test_env_var_empty_value_ignored(self):
-        config = self._make_config()
-        with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": ""}, clear=False):
-            _apply_env_overrides(config)
-        assert config.platforms[Platform.DISCORD].reply_to_mode == "first"
-
-    def test_env_var_creates_platform_config_if_missing(self):
-        """DISCORD_REPLY_TO_MODE creates PlatformConfig even without DISCORD_BOT_TOKEN."""
-        config = GatewayConfig()
-        assert Platform.DISCORD not in config.platforms
-        with patch.dict(os.environ, {"DISCORD_REPLY_TO_MODE": "off"}, clear=False):
-            _apply_env_overrides(config)
-        assert Platform.DISCORD in config.platforms
-        assert config.platforms[Platform.DISCORD].reply_to_mode == "off"
@@ -1,432 +0,0 @@
-"""Tests for Feishu interactive card approval buttons."""
-
-import asyncio
-import json
-import os
-import sys
-from pathlib import Path
-from types import SimpleNamespace
-from unittest.mock import AsyncMock, MagicMock, Mock, patch
-
-import pytest
-
-# ---------------------------------------------------------------------------
-# Ensure the repo root is importable
-# ---------------------------------------------------------------------------
-_repo = str(Path(__file__).resolve().parents[2])
-if _repo not in sys.path:
-    sys.path.insert(0, _repo)
-
-
-# ---------------------------------------------------------------------------
-# Minimal Feishu mock so FeishuAdapter can be imported without lark-oapi
-# ---------------------------------------------------------------------------
-def _ensure_feishu_mocks():
-    """Provide stubs for lark-oapi / aiohttp.web so the import succeeds."""
-    if "lark_oapi" not in sys.modules:
-        mod = MagicMock()
-        for name in (
-            "lark_oapi", "lark_oapi.api.im.v1",
-            "lark_oapi.event", "lark_oapi.event.callback_type",
-        ):
-            sys.modules.setdefault(name, mod)
-    if "aiohttp" not in sys.modules:
-        aio = MagicMock()
-        sys.modules.setdefault("aiohttp", aio)
-        sys.modules.setdefault("aiohttp.web", aio.web)
-
-
-_ensure_feishu_mocks()
-
-from gateway.config import PlatformConfig
-from gateway.platforms.feishu import FeishuAdapter
-
-
-# ---------------------------------------------------------------------------
-# Helpers
-# ---------------------------------------------------------------------------
-
-def _make_adapter() -> FeishuAdapter:
-    """Create a FeishuAdapter with mocked internals."""
-    config = PlatformConfig(enabled=True)
-    adapter = FeishuAdapter(config)
-    adapter._client = MagicMock()
-    return adapter
-
-
-def _make_card_action_data(
-    action_value: dict,
-    chat_id: str = "oc_12345",
-    open_id: str = "ou_user1",
-    token: str = "tok_abc",
-) -> SimpleNamespace:
-    """Create a mock Feishu card action callback data object."""
-    return SimpleNamespace(
-        event=SimpleNamespace(
-            token=token,
-            context=SimpleNamespace(open_chat_id=chat_id),
-            operator=SimpleNamespace(open_id=open_id),
-            action=SimpleNamespace(
-                tag="button",
-                value=action_value,
-            ),
-        ),
-    )
-
-
-# ===========================================================================
-# send_exec_approval — interactive card with buttons
-# ===========================================================================
-
-class TestFeishuExecApproval:
-    """Test send_exec_approval sends an interactive card."""
-
-    @pytest.mark.asyncio
-    async def test_sends_interactive_card(self):
-        adapter = _make_adapter()
-
-        mock_response = SimpleNamespace(
-            success=lambda: True,
-            data=SimpleNamespace(message_id="msg_001"),
-        )
-        with patch.object(
-            adapter, "_feishu_send_with_retry", new_callable=AsyncMock,
-            return_value=mock_response,
-        ) as mock_send:
-            result = await adapter.send_exec_approval(
-                chat_id="oc_12345",
-                command="rm -rf /important",
-                session_key="agent:main:feishu:group:oc_12345",
-                description="dangerous deletion",
-            )
-
-        assert result.success is True
-        assert result.message_id == "msg_001"
-
-        mock_send.assert_called_once()
-        kwargs = mock_send.call_args[1]
-        assert kwargs["chat_id"] == "oc_12345"
-        assert kwargs["msg_type"] == "interactive"
-
-        # Verify card payload contains the command and buttons
-        card = json.loads(kwargs["payload"])
-        assert card["header"]["template"] == "orange"
-        assert "rm -rf /important" in card["elements"][0]["content"]
-        assert "dangerous deletion" in card["elements"][0]["content"]
-
-        # Check buttons
-        actions = card["elements"][1]["actions"]
-        assert len(actions) == 4
-        action_names = [a["value"]["hermes_action"] for a in actions]
-        assert action_names == [
-            "approve_once", "approve_session", "approve_always", "deny"
-        ]
-
-    @pytest.mark.asyncio
-    async def test_stores_approval_state(self):
-        adapter = _make_adapter()
-
-        mock_response = SimpleNamespace(
-            success=lambda: True,
-            data=SimpleNamespace(message_id="msg_002"),
-        )
-        with patch.object(
-            adapter, "_feishu_send_with_retry", new_callable=AsyncMock,
-            return_value=mock_response,
-        ):
-            await adapter.send_exec_approval(
-                chat_id="oc_12345",
-                command="echo test",
-                session_key="my-session-key",
-            )
-
-        assert len(adapter._approval_state) == 1
-        approval_id = list(adapter._approval_state.keys())[0]
-        state = adapter._approval_state[approval_id]
-        assert state["session_key"] == "my-session-key"
-        assert state["message_id"] == "msg_002"
-        assert state["chat_id"] == "oc_12345"
-
-    @pytest.mark.asyncio
-    async def test_not_connected(self):
-        adapter = _make_adapter()
-        adapter._client = None
-        result = await adapter.send_exec_approval(
-            chat_id="oc_12345", command="ls", session_key="s"
-        )
-        assert result.success is False
-
-    @pytest.mark.asyncio
-    async def test_truncates_long_command(self):
-        adapter = _make_adapter()
-
-        mock_response = SimpleNamespace(
-            success=lambda: True,
-            data=SimpleNamespace(message_id="msg_003"),
-        )
-        with patch.object(
-            adapter, "_feishu_send_with_retry", new_callable=AsyncMock,
-            return_value=mock_response,
-        ) as mock_send:
-            long_cmd = "x" * 5000
-            await adapter.send_exec_approval(
-                chat_id="oc_12345", command=long_cmd, session_key="s"
-            )
-
-        card = json.loads(mock_send.call_args[1]["payload"])
-        content = card["elements"][0]["content"]
-        assert "..." in content
-        assert len(content) < 5000
-
-    @pytest.mark.asyncio
-    async def test_multiple_approvals_get_unique_ids(self):
-        adapter = _make_adapter()
-
-        mock_response = SimpleNamespace(
-            success=lambda: True,
-            data=SimpleNamespace(message_id="msg_x"),
-        )
-        with patch.object(
-            adapter, "_feishu_send_with_retry", new_callable=AsyncMock,
-            return_value=mock_response,
-        ):
-            await adapter.send_exec_approval(
-                chat_id="oc_1", command="cmd1", session_key="s1"
-            )
-            await adapter.send_exec_approval(
-                chat_id="oc_2", command="cmd2", session_key="s2"
-            )
-
-        assert len(adapter._approval_state) == 2
-        ids = list(adapter._approval_state.keys())
-        assert ids[0] != ids[1]
-
-
-# ===========================================================================
-# _handle_card_action_event — approval button clicks
-# ===========================================================================
-
-class TestFeishuApprovalCallback:
-    """Test the approval intercept in _handle_card_action_event."""
-
-    @pytest.mark.asyncio
-    async def test_resolves_approval_on_click(self):
-        adapter = _make_adapter()
-        adapter._approval_state[1] = {
-            "session_key": "agent:main:feishu:group:oc_12345",
-            "message_id": "msg_001",
-            "chat_id": "oc_12345",
-        }
-
-        data = _make_card_action_data(
-            action_value={"hermes_action": "approve_once", "approval_id": 1},
-        )
-
-        with (
-            patch.object(
-                adapter, "_resolve_sender_profile", new_callable=AsyncMock,
-                return_value={"user_id": "ou_user1", "user_name": "Norbert", "user_id_alt": None},
-            ),
-            patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
-            patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
-        ):
-            await adapter._handle_card_action_event(data)
-
-        mock_resolve.assert_called_once_with("agent:main:feishu:group:oc_12345", "once")
-        mock_update.assert_called_once_with("msg_001", "Approved once", "Norbert", "once")
-
-        # State should be cleaned up
-        assert 1 not in adapter._approval_state
-
-    @pytest.mark.asyncio
-    async def test_deny_button(self):
-        adapter = _make_adapter()
-        adapter._approval_state[2] = {
-            "session_key": "some-session",
-            "message_id": "msg_002",
-            "chat_id": "oc_12345",
-        }
-
-        data = _make_card_action_data(
-            action_value={"hermes_action": "deny", "approval_id": 2},
-            token="tok_deny",
-        )
-
-        with (
-            patch.object(
-                adapter, "_resolve_sender_profile", new_callable=AsyncMock,
-                return_value={"user_id": "ou_alice", "user_name": "Alice", "user_id_alt": None},
-            ),
-            patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
-            patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
-        ):
-            await adapter._handle_card_action_event(data)
-
-        mock_resolve.assert_called_once_with("some-session", "deny")
-        mock_update.assert_called_once_with("msg_002", "Denied", "Alice", "deny")
-
-    @pytest.mark.asyncio
-    async def test_session_approval(self):
-        adapter = _make_adapter()
-        adapter._approval_state[3] = {
-            "session_key": "sess-3",
-            "message_id": "msg_003",
-            "chat_id": "oc_99",
-        }
-
-        data = _make_card_action_data(
-            action_value={"hermes_action": "approve_session", "approval_id": 3},
-            token="tok_ses",
-        )
-
-        with (
-            patch.object(
-                adapter, "_resolve_sender_profile", new_callable=AsyncMock,
-                return_value={"user_id": "ou_u", "user_name": "Bob", "user_id_alt": None},
-            ),
-            patch.object(adapter, "_update_approval_card", new_callable=AsyncMock) as mock_update,
-            patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
-        ):
-            await adapter._handle_card_action_event(data)
-
-        mock_resolve.assert_called_once_with("sess-3", "session")
-        mock_update.assert_called_once_with("msg_003", "Approved for session", "Bob", "session")
-
-    @pytest.mark.asyncio
-    async def test_always_approval(self):
-        adapter = _make_adapter()
-        adapter._approval_state[4] = {
-            "session_key": "sess-4",
-            "message_id": "msg_004",
-            "chat_id": "oc_55",
-        }
-
-        data = _make_card_action_data(
-            action_value={"hermes_action": "approve_always", "approval_id": 4},
-            token="tok_alw",
-        )
-
-        with (
-            patch.object(
-                adapter, "_resolve_sender_profile", new_callable=AsyncMock,
-                return_value={"user_id": "ou_u", "user_name": "Carol", "user_id_alt": None},
-            ),
-            patch.object(adapter, "_update_approval_card", new_callable=AsyncMock),
-            patch("tools.approval.resolve_gateway_approval", return_value=1) as mock_resolve,
-        ):
-            await adapter._handle_card_action_event(data)
-
-        mock_resolve.assert_called_once_with("sess-4", "always")
-
-    @pytest.mark.asyncio
-    async def test_already_resolved_drops_silently(self):
-        adapter = _make_adapter()
-        # No state for approval_id 99 — already resolved
-
-        data = _make_card_action_data(
-            action_value={"hermes_action": "approve_once", "approval_id": 99},
-            token="tok_gone",
-        )
-
-        with patch("tools.approval.resolve_gateway_approval") as mock_resolve:
-            await adapter._handle_card_action_event(data)
-
-        # Should NOT resolve — already handled
-        mock_resolve.assert_not_called()
-
-    @pytest.mark.asyncio
-    async def test_non_approval_actions_route_normally(self):
-        """Non-approval card actions should still become synthetic commands."""
-        adapter = _make_adapter()
-
-        data = _make_card_action_data(
-            action_value={"custom_action": "something_else"},
-            token="tok_normal",
-        )
-
-        with (
-            patch.object(
-                adapter, "_resolve_sender_profile", new_callable=AsyncMock,
-                return_value={"user_id": "ou_u", "user_name": "Dave", "user_id_alt": None},
-            ),
-            patch.object(adapter, "get_chat_info", new_callable=AsyncMock, return_value={"name": "Test Chat"}),
-            patch.object(adapter, "_handle_message_with_guards", new_callable=AsyncMock) as mock_handle,
-            patch("tools.approval.resolve_gateway_approval") as mock_resolve,
-        ):
-            await adapter._handle_card_action_event(data)
-
-        # Should NOT resolve any approval
-        mock_resolve.assert_not_called()
-        # Should have routed as synthetic command
-        mock_handle.assert_called_once()
-        event = mock_handle.call_args[0][0]
-        assert "/card button" in event.text
-
-
-# ===========================================================================
-# _update_approval_card — card replacement after resolution
-# ===========================================================================
-
-class TestFeishuUpdateApprovalCard:
-    """Test the card update after approval resolution."""
-
-    @pytest.mark.asyncio
-    async def test_updates_card_on_approve(self):
-        adapter = _make_adapter()
-
-        mock_update = AsyncMock()
-        adapter._client.im.v1.message.update = MagicMock()
-
-        with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
-            await adapter._update_approval_card(
-                "msg_001", "Approved once", "Norbert", "once"
-            )
-
-        mock_thread.assert_called_once()
-        # Verify the update request was built
-        call_args = mock_thread.call_args
-        assert call_args[0][0] == adapter._client.im.v1.message.update
-
-    @pytest.mark.asyncio
-    async def test_updates_card_on_deny(self):
-        adapter = _make_adapter()
-
-        with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
-            await adapter._update_approval_card(
-                "msg_002", "Denied", "Alice", "deny"
-            )
-
-        mock_thread.assert_called_once()
-
-    @pytest.mark.asyncio
-    async def test_skips_update_when_not_connected(self):
-        adapter = _make_adapter()
-        adapter._client = None
-
-        with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
-            await adapter._update_approval_card(
-                "msg_001", "Approved", "Bob", "once"
-            )
-
-        mock_thread.assert_not_called()
-
-    @pytest.mark.asyncio
-    async def test_skips_update_when_no_message_id(self):
-        adapter = _make_adapter()
-
-        with patch("asyncio.to_thread", new_callable=AsyncMock) as mock_thread:
-            await adapter._update_approval_card(
-                "", "Approved", "Bob", "once"
-            )
-
-        mock_thread.assert_not_called()
-
-    @pytest.mark.asyncio
-    async def test_swallows_update_errors(self):
-        adapter = _make_adapter()
-
-        with patch("asyncio.to_thread", new_callable=AsyncMock, side_effect=Exception("API error")):
-            # Should not raise
-            await adapter._update_approval_card(
-                "msg_001", "Approved", "Bob", "once"
-            )
@@ -87,6 +87,7 @@ class TestReasoningCommand:
        )

        monkeypatch.setattr(gateway_run, "_hermes_home", hermes_home)
+        monkeypatch.delenv("HERMES_REASONING_EFFORT", raising=False)

        runner = _make_runner()
        runner._reasoning_config = {"enabled": True, "effort": "xhigh"}
@@ -107,6 +108,7 @@ class TestReasoningCommand:
        config_path.write_text("agent:\n  reasoning_effort: medium\n", encoding="utf-8")

        monkeypatch.setattr(gateway_run, "_hermes_home", hermes_home)
+        monkeypatch.delenv("HERMES_REASONING_EFFORT", raising=False)

        runner = _make_runner()
        runner._reasoning_config = {"enabled": True, "effort": "medium"}
@@ -136,6 +138,7 @@ class TestReasoningCommand:
                "api_key": "test-key",
            },
        )
+        monkeypatch.delenv("HERMES_REASONING_EFFORT", raising=False)
        fake_run_agent = types.ModuleType("run_agent")
        fake_run_agent.AIAgent = _CapturingAgent
        monkeypatch.setitem(sys.modules, "run_agent", fake_run_agent)
@@ -167,6 +170,55 @@ class TestReasoningCommand:
        assert _CapturingAgent.last_init is not None
        assert _CapturingAgent.last_init["reasoning_config"] == {"enabled": True, "effort": "low"}

+    def test_run_agent_prefers_config_over_stale_reasoning_env(self, tmp_path, monkeypatch):
+        hermes_home = tmp_path / "hermes"
+        hermes_home.mkdir()
+        (hermes_home / "config.yaml").write_text("agent:\n  reasoning_effort: none\n", encoding="utf-8")
+
+        monkeypatch.setattr(gateway_run, "_hermes_home", hermes_home)
+        monkeypatch.setattr(gateway_run, "_env_path", hermes_home / ".env")
+        monkeypatch.setattr(gateway_run, "load_dotenv", lambda *args, **kwargs: None)
+        monkeypatch.setattr(
+            gateway_run,
+            "_resolve_runtime_agent_kwargs",
+            lambda: {
+                "provider": "openrouter",
+                "api_mode": "chat_completions",
+                "base_url": "https://openrouter.ai/api/v1",
+                "api_key": "test-key",
+            },
+        )
+        monkeypatch.setenv("HERMES_REASONING_EFFORT", "low")
+        fake_run_agent = types.ModuleType("run_agent")
+        fake_run_agent.AIAgent = _CapturingAgent
+        monkeypatch.setitem(sys.modules, "run_agent", fake_run_agent)
+
+        _CapturingAgent.last_init = None
+        runner = _make_runner()
+
+        source = SessionSource(
+            platform=Platform.LOCAL,
+            chat_id="cli",
+            chat_name="CLI",
+            chat_type="dm",
+            user_id="user-1",
+        )
+
+        result = asyncio.run(
+            runner._run_agent(
+                message="ping",
+                context_prompt="",
+                history=[],
+                source=source,
+                session_id="session-1",
+                session_key="agent:main:local:dm",
+            )
+        )
+
+        assert result["final_response"] == "ok"
+        assert _CapturingAgent.last_init is not None
+        assert _CapturingAgent.last_init["reasoning_config"] == {"enabled": False}
+
    def test_run_agent_includes_enabled_mcp_servers_in_gateway_toolsets(self, tmp_path, monkeypatch):
        hermes_home = tmp_path / "hermes"
        hermes_home.mkdir()
@@ -1,158 +0,0 @@
-"""Tests that on_session_finalize and on_session_reset plugin hooks fire in the gateway."""
-from datetime import datetime
-from types import SimpleNamespace
-from unittest.mock import AsyncMock, MagicMock, patch
-
-import pytest
-
-from gateway.config import GatewayConfig, Platform, PlatformConfig
-from gateway.platforms.base import MessageEvent
-from gateway.session import SessionEntry, SessionSource, build_session_key
-
-
-def _make_source() -> SessionSource:
-    return SessionSource(
-        platform=Platform.TELEGRAM,
-        user_id="u1",
-        chat_id="c1",
-        user_name="tester",
-        chat_type="dm",
-    )
-
-
-def _make_event(text: str) -> MessageEvent:
-    return MessageEvent(text=text, source=_make_source(), message_id="m1")
-
-
-def _make_runner():
-    from gateway.run import GatewayRunner
-
-    runner = object.__new__(GatewayRunner)
-    runner.config = GatewayConfig(
-        platforms={Platform.TELEGRAM: PlatformConfig(enabled=True, token="***")}
-    )
-    adapter = MagicMock()
-    adapter.send = AsyncMock()
-    runner.adapters = {Platform.TELEGRAM: adapter}
-    runner._voice_mode = {}
-    runner.hooks = SimpleNamespace(emit=AsyncMock(), loaded_hooks=False)
-    runner._session_model_overrides = {}
-    runner._pending_model_notes = {}
-    runner._background_tasks = set()
-
-    session_key = build_session_key(_make_source())
-    session_entry = SessionEntry(
-        session_key=session_key,
-        session_id="sess-old",
-        created_at=datetime.now(),
-        updated_at=datetime.now(),
-        platform=Platform.TELEGRAM,
-        chat_type="dm",
-    )
-    new_session_entry = SessionEntry(
-        session_key=session_key,
-        session_id="sess-new",
-        created_at=datetime.now(),
-        updated_at=datetime.now(),
-        platform=Platform.TELEGRAM,
-        chat_type="dm",
-    )
-    runner.session_store = MagicMock()
-    runner.session_store.get_or_create_session.return_value = new_session_entry
-    runner.session_store.reset_session.return_value = new_session_entry
-    runner.session_store._entries = {session_key: session_entry}
-    runner.session_store._generate_session_key.return_value = session_key
-    runner._running_agents = {}
-    runner._pending_messages = {}
-    runner._pending_approvals = {}
-    runner._session_db = None
-    runner._agent_cache_lock = None
-    runner._is_user_authorized = lambda _source: True
-    runner._format_session_info = lambda: ""
-
-    return runner
-
-
-@pytest.mark.asyncio
-@patch("hermes_cli.plugins.invoke_hook")
-async def test_reset_fires_finalize_hook(mock_invoke_hook):
-    """/new must fire on_session_finalize with the OLD session id."""
-    runner = _make_runner()
-
-    await runner._handle_reset_command(_make_event("/new"))
-
-    mock_invoke_hook.assert_any_call(
-        "on_session_finalize", session_id="sess-old", platform="telegram"
-    )
-
-
-@pytest.mark.asyncio
-@patch("hermes_cli.plugins.invoke_hook")
-async def test_reset_fires_reset_hook(mock_invoke_hook):
-    """/new must fire on_session_reset with the NEW session id."""
-    runner = _make_runner()
-
-    await runner._handle_reset_command(_make_event("/new"))
-
-    mock_invoke_hook.assert_any_call(
-        "on_session_reset", session_id="sess-new", platform="telegram"
-    )
-
-
-@pytest.mark.asyncio
-@patch("hermes_cli.plugins.invoke_hook")
-async def test_finalize_before_reset(mock_invoke_hook):
-    """on_session_finalize must fire before on_session_reset."""
-    runner = _make_runner()
-
-    await runner._handle_reset_command(_make_event("/new"))
-
-    calls = [c for c in mock_invoke_hook.call_args_list
-             if c[0][0] in ("on_session_finalize", "on_session_reset")]
-    hook_names = [c[0][0] for c in calls]
-    assert hook_names == ["on_session_finalize", "on_session_reset"]
-
-
-@pytest.mark.asyncio
-@patch("hermes_cli.plugins.invoke_hook")
-async def test_shutdown_fires_finalize_for_active_agents(mock_invoke_hook):
-    """Gateway stop() must fire on_session_finalize for each active agent."""
-    from gateway.run import GatewayRunner
-
-    runner = object.__new__(GatewayRunner)
-    runner._running = True
-    runner._background_tasks = set()
-    runner._pending_messages = {}
-    runner._pending_approvals = {}
-    runner._shutdown_event = MagicMock()
-    runner.adapters = {}
-    runner._exit_reason = "test"
-
-    agent1 = MagicMock()
-    agent1.session_id = "sess-a"
-    agent2 = MagicMock()
-    agent2.session_id = "sess-b"
-    runner._running_agents = {"key-a": agent1, "key-b": agent2}
-
-    with patch("gateway.status.remove_pid_file"), \
-         patch("gateway.status.write_runtime_status"):
-        await runner.stop()
-
-    finalize_calls = [
-        c for c in mock_invoke_hook.call_args_list
-        if c[0][0] == "on_session_finalize"
-    ]
-    session_ids = {c[1]["session_id"] for c in finalize_calls}
-    assert session_ids == {"sess-a", "sess-b"}
-
-
-@pytest.mark.asyncio
-@patch("hermes_cli.plugins.invoke_hook", side_effect=Exception("boom"))
-async def test_hook_error_does_not_break_reset(mock_invoke_hook):
-    """Plugin hook errors must not prevent /new from completing."""
-    runner = _make_runner()
-
-    result = await runner._handle_reset_command(_make_event("/new"))
-
-    # Should still return a success message despite hook errors
-    assert "Session reset" in result or "New session" in result
@@ -707,66 +707,3 @@ class TestSignalSendDocumentViaHelper:

        assert result.success is False
        assert "/nonexistent.pdf" in result.error
-
-
-# ---------------------------------------------------------------------------
-# send() returns message_id from timestamp (#4647)
-# ---------------------------------------------------------------------------
-
-class TestSignalSendReturnsMessageId:
-    """Signal send() must return a timestamp-based message_id so the stream
-    consumer can follow its edit→fallback path correctly."""
-
-    @pytest.mark.asyncio
-    async def test_send_returns_timestamp_as_message_id(self, monkeypatch):
-        adapter = _make_signal_adapter(monkeypatch)
-        mock_rpc, _ = _stub_rpc({"timestamp": 1712345678000})
-        adapter._rpc = mock_rpc
-        adapter._stop_typing_indicator = AsyncMock()
-
-        result = await adapter.send(chat_id="+155****4567", content="hello")
-
-        assert result.success is True
-        assert result.message_id == "1712345678000"
-
-    @pytest.mark.asyncio
-    async def test_send_returns_none_message_id_when_no_timestamp(self, monkeypatch):
-        adapter = _make_signal_adapter(monkeypatch)
-        mock_rpc, _ = _stub_rpc({})  # No timestamp key
-        adapter._rpc = mock_rpc
-        adapter._stop_typing_indicator = AsyncMock()
-
-        result = await adapter.send(chat_id="+155****4567", content="hello")
-
-        assert result.success is True
-        assert result.message_id is None
-
-    @pytest.mark.asyncio
-    async def test_send_returns_none_message_id_for_non_dict(self, monkeypatch):
-        adapter = _make_signal_adapter(monkeypatch)
-        mock_rpc, _ = _stub_rpc("ok")  # Non-dict result
-        adapter._rpc = mock_rpc
-        adapter._stop_typing_indicator = AsyncMock()
-
-        result = await adapter.send(chat_id="+155****4567", content="hello")
-
-        assert result.success is True
-        assert result.message_id is None
-
-
-# ---------------------------------------------------------------------------
-# stop_typing() delegates to _stop_typing_indicator (#4647)
-# ---------------------------------------------------------------------------
-
-class TestSignalStopTyping:
-    """Signal must expose a public stop_typing() so base adapter's
-    _keep_typing finally block can clean up platform-level typing tasks."""
-
-    @pytest.mark.asyncio
-    async def test_stop_typing_calls_private_method(self, monkeypatch):
-        adapter = _make_signal_adapter(monkeypatch)
-        adapter._stop_typing_indicator = AsyncMock()
-
-        await adapter.stop_typing("+155****4567")
-
-        adapter._stop_typing_indicator.assert_awaited_once_with("+155****4567")
@@ -324,145 +324,3 @@ class TestSegmentBreakOnToolBoundary:
        await consumer.run()

        assert consumer.already_sent
-
-    @pytest.mark.asyncio
-    async def test_edit_failure_sends_only_unsent_tail_at_finish(self):
-        """If an edit fails mid-stream, send only the missing tail once at finish."""
-        adapter = MagicMock()
-        send_results = [
-            SimpleNamespace(success=True, message_id="msg_1"),
-            SimpleNamespace(success=True, message_id="msg_2"),
-        ]
-        adapter.send = AsyncMock(side_effect=send_results)
-        adapter.edit_message = AsyncMock(return_value=SimpleNamespace(success=False, error="flood_control:6"))
-        adapter.MAX_MESSAGE_LENGTH = 4096
-
-        config = StreamConsumerConfig(edit_interval=0.01, buffer_threshold=5, cursor=" ▉")
-        consumer = GatewayStreamConsumer(adapter, "chat_123", config)
-
-        consumer.on_delta("Hello")
-        task = asyncio.create_task(consumer.run())
-        await asyncio.sleep(0.08)
-        consumer.on_delta(" world")
-        await asyncio.sleep(0.08)
-        consumer.finish()
-        await task
-
-        assert adapter.send.call_count == 2
-        first_text = adapter.send.call_args_list[0][1]["content"]
-        second_text = adapter.send.call_args_list[1][1]["content"]
-        assert "Hello" in first_text
-        assert second_text.strip() == "world"
-        assert consumer.already_sent
-
-    @pytest.mark.asyncio
-    async def test_segment_break_clears_failed_edit_fallback_state(self):
-        """A tool boundary after edit failure must not duplicate the next segment."""
-        adapter = MagicMock()
-        send_results = [
-            SimpleNamespace(success=True, message_id="msg_1"),
-            SimpleNamespace(success=True, message_id="msg_2"),
-        ]
-        adapter.send = AsyncMock(side_effect=send_results)
-        adapter.edit_message = AsyncMock(return_value=SimpleNamespace(success=False, error="flood_control:6"))
-        adapter.MAX_MESSAGE_LENGTH = 4096
-
-        config = StreamConsumerConfig(edit_interval=0.01, buffer_threshold=5, cursor=" ▉")
-        consumer = GatewayStreamConsumer(adapter, "chat_123", config)
-
-        consumer.on_delta("Hello")
-        task = asyncio.create_task(consumer.run())
-        await asyncio.sleep(0.08)
-        consumer.on_delta(" world")
-        await asyncio.sleep(0.08)
-        consumer.on_delta(None)
-        consumer.on_delta("Next segment")
-        consumer.finish()
-        await task
-
-        sent_texts = [call[1]["content"] for call in adapter.send.call_args_list]
-        assert sent_texts == ["Hello ▉", "Next segment"]
-
-    @pytest.mark.asyncio
-    async def test_no_message_id_enters_fallback_mode(self):
-        """Platform returns success but no message_id (Signal) — must not
-        re-send on every delta.  Should enter fallback mode and send only
-        the continuation at finish."""
-        adapter = MagicMock()
-        # First send succeeds but returns no message_id (Signal behavior)
-        send_result_no_id = SimpleNamespace(success=True, message_id=None)
-        # Fallback final send succeeds
-        send_result_final = SimpleNamespace(success=True, message_id="msg_final")
-        adapter.send = AsyncMock(side_effect=[send_result_no_id, send_result_final])
-        adapter.edit_message = AsyncMock(return_value=SimpleNamespace(success=True))
-        adapter.MAX_MESSAGE_LENGTH = 4096
-
-        config = StreamConsumerConfig(edit_interval=0.01, buffer_threshold=5)
-        consumer = GatewayStreamConsumer(adapter, "chat_123", config)
-
-        consumer.on_delta("Hello")
-        task = asyncio.create_task(consumer.run())
-        await asyncio.sleep(0.08)
-        consumer.on_delta(" world, this is a longer response.")
-        await asyncio.sleep(0.08)
-        consumer.finish()
-        await task
-
-        # Should send exactly 2 messages: initial chunk + fallback continuation
-        # NOT one message per delta
-        assert adapter.send.call_count == 2
-        assert consumer.already_sent
-        # edit_message should NOT have been called (no valid message_id to edit)
-        adapter.edit_message.assert_not_called()
-
-    @pytest.mark.asyncio
-    async def test_no_message_id_single_delta_marks_already_sent(self):
-        """When the entire response fits in one delta and platform returns no
-        message_id, already_sent must still be True to prevent the gateway
-        from re-sending the full response."""
-        adapter = MagicMock()
-        send_result = SimpleNamespace(success=True, message_id=None)
-        adapter.send = AsyncMock(return_value=send_result)
-        adapter.MAX_MESSAGE_LENGTH = 4096
-
-        config = StreamConsumerConfig(edit_interval=0.01, buffer_threshold=5)
-        consumer = GatewayStreamConsumer(adapter, "chat_123", config)
-
-        consumer.on_delta("Short response.")
-        consumer.finish()
-
-        await consumer.run()
-
-        assert consumer.already_sent
-        # Only one send call (the initial message)
-        assert adapter.send.call_count == 1
-
-    @pytest.mark.asyncio
-    async def test_fallback_final_splits_long_continuation_without_dropping_text(self):
-        """Long continuation tails should be chunked when fallback final-send runs."""
-        adapter = MagicMock()
-        adapter.send = AsyncMock(side_effect=[
-            SimpleNamespace(success=True, message_id="msg_1"),
-            SimpleNamespace(success=True, message_id="msg_2"),
-            SimpleNamespace(success=True, message_id="msg_3"),
-        ])
-        adapter.edit_message = AsyncMock(return_value=SimpleNamespace(success=False, error="flood_control:6"))
-        adapter.MAX_MESSAGE_LENGTH = 610
-
-        config = StreamConsumerConfig(edit_interval=0.01, buffer_threshold=5, cursor=" ▉")
-        consumer = GatewayStreamConsumer(adapter, "chat_123", config)
-
-        prefix = "abc"
-        tail = "x" * 620
-        consumer.on_delta(prefix)
-        task = asyncio.create_task(consumer.run())
-        await asyncio.sleep(0.08)
-        consumer.on_delta(tail)
-        await asyncio.sleep(0.08)
-        consumer.finish()
-        await task
-
-        sent_texts = [call[1]["content"] for call in adapter.send.call_args_list]
-        assert len(sent_texts) == 3
-        assert sent_texts[0].startswith(prefix)
-        assert sum(len(t) for t in sent_texts[1:]) == len(tail)
@@ -1,399 +0,0 @@
-"""Tests for Qwen OAuth provider authentication (hermes_cli/auth.py).
-
-Covers: _qwen_cli_auth_path, _read_qwen_cli_tokens, _save_qwen_cli_tokens,
-_qwen_access_token_is_expiring, _refresh_qwen_cli_tokens,
-resolve_qwen_runtime_credentials, get_qwen_auth_status.
-"""
-
-import json
-import os
-import stat
-import time
-from pathlib import Path
-from unittest.mock import MagicMock, patch
-
-import pytest
-
-from hermes_cli.auth import (
-    AuthError,
-    DEFAULT_QWEN_BASE_URL,
-    QWEN_ACCESS_TOKEN_REFRESH_SKEW_SECONDS,
-    _qwen_cli_auth_path,
-    _read_qwen_cli_tokens,
-    _save_qwen_cli_tokens,
-    _qwen_access_token_is_expiring,
-    _refresh_qwen_cli_tokens,
-    resolve_qwen_runtime_credentials,
-    get_qwen_auth_status,
-)
-
-
-# ---------------------------------------------------------------------------
-# Helpers
-# ---------------------------------------------------------------------------
-
-def _make_qwen_tokens(
-    access_token="test-access-token",
-    refresh_token="test-refresh-token",
-    expiry_date=None,
-    **extra,
-):
-    """Create a minimal Qwen CLI OAuth credential dict."""
-    if expiry_date is None:
-        # 1 hour from now in milliseconds
-        expiry_date = int((time.time() + 3600) * 1000)
-    data = {
-        "access_token": access_token,
-        "refresh_token": refresh_token,
-        "token_type": "Bearer",
-        "expiry_date": expiry_date,
-        "resource_url": "portal.qwen.ai",
-    }
-    data.update(extra)
-    return data
-
-
-def _write_qwen_creds(tmp_path, tokens=None):
-    """Write tokens to the Qwen CLI credentials file and return the path."""
-    qwen_dir = tmp_path / ".qwen"
-    qwen_dir.mkdir(parents=True, exist_ok=True)
-    creds_path = qwen_dir / "oauth_creds.json"
-    if tokens is None:
-        tokens = _make_qwen_tokens()
-    creds_path.write_text(json.dumps(tokens), encoding="utf-8")
-    return creds_path
-
-
-@pytest.fixture()
-def qwen_env(tmp_path, monkeypatch):
-    """Redirect _qwen_cli_auth_path to tmp_path/.qwen/oauth_creds.json."""
-    creds_path = tmp_path / ".qwen" / "oauth_creds.json"
-    monkeypatch.setattr(
-        "hermes_cli.auth._qwen_cli_auth_path", lambda: creds_path
-    )
-    return tmp_path
-
-
-# ---------------------------------------------------------------------------
-# _qwen_cli_auth_path
-# ---------------------------------------------------------------------------
-
-def test_qwen_cli_auth_path_returns_expected_location():
-    path = _qwen_cli_auth_path()
-    assert path == Path.home() / ".qwen" / "oauth_creds.json"
-
-
-# ---------------------------------------------------------------------------
-# _read_qwen_cli_tokens
-# ---------------------------------------------------------------------------
-
-def test_read_qwen_cli_tokens_success(qwen_env):
-    tokens = _make_qwen_tokens(access_token="my-access")
-    _write_qwen_creds(qwen_env, tokens)
-    result = _read_qwen_cli_tokens()
-    assert result["access_token"] == "my-access"
-    assert result["refresh_token"] == "test-refresh-token"
-
-
-def test_read_qwen_cli_tokens_missing_file(qwen_env):
-    with pytest.raises(AuthError) as exc:
-        _read_qwen_cli_tokens()
-    assert exc.value.code == "qwen_auth_missing"
-
-
-def test_read_qwen_cli_tokens_invalid_json(qwen_env):
-    creds_path = qwen_env / ".qwen" / "oauth_creds.json"
-    creds_path.parent.mkdir(parents=True, exist_ok=True)
-    creds_path.write_text("not json{{{", encoding="utf-8")
-    with pytest.raises(AuthError) as exc:
-        _read_qwen_cli_tokens()
-    assert exc.value.code == "qwen_auth_read_failed"
-
-
-def test_read_qwen_cli_tokens_non_dict(qwen_env):
-    creds_path = qwen_env / ".qwen" / "oauth_creds.json"
-    creds_path.parent.mkdir(parents=True, exist_ok=True)
-    creds_path.write_text(json.dumps(["a", "b"]), encoding="utf-8")
-    with pytest.raises(AuthError) as exc:
-        _read_qwen_cli_tokens()
-    assert exc.value.code == "qwen_auth_invalid"
-
-
-# ---------------------------------------------------------------------------
-# _save_qwen_cli_tokens
-# ---------------------------------------------------------------------------
-
-def test_save_qwen_cli_tokens_roundtrip(qwen_env):
-    tokens = _make_qwen_tokens(access_token="saved-token")
-    saved_path = _save_qwen_cli_tokens(tokens)
-    assert saved_path.exists()
-    loaded = json.loads(saved_path.read_text(encoding="utf-8"))
-    assert loaded["access_token"] == "saved-token"
-
-
-def test_save_qwen_cli_tokens_creates_parent(qwen_env):
-    tokens = _make_qwen_tokens()
-    saved_path = _save_qwen_cli_tokens(tokens)
-    assert saved_path.parent.exists()
-
-
-def test_save_qwen_cli_tokens_permissions(qwen_env):
-    tokens = _make_qwen_tokens()
-    saved_path = _save_qwen_cli_tokens(tokens)
-    mode = saved_path.stat().st_mode
-    assert mode & stat.S_IRUSR  # owner read
-    assert mode & stat.S_IWUSR  # owner write
-    assert not (mode & stat.S_IRGRP)  # no group read
-    assert not (mode & stat.S_IROTH)  # no other read
-
-
-# ---------------------------------------------------------------------------
-# _qwen_access_token_is_expiring
-# ---------------------------------------------------------------------------
-
-def test_expiring_token_not_expired():
-    # 1 hour from now in milliseconds
-    future_ms = int((time.time() + 3600) * 1000)
-    assert not _qwen_access_token_is_expiring(future_ms)
-
-
-def test_expiring_token_already_expired():
-    # 1 hour ago in milliseconds
-    past_ms = int((time.time() - 3600) * 1000)
-    assert _qwen_access_token_is_expiring(past_ms)
-
-
-def test_expiring_token_within_skew():
-    # Just inside the default skew window
-    near_ms = int((time.time() + QWEN_ACCESS_TOKEN_REFRESH_SKEW_SECONDS - 5) * 1000)
-    assert _qwen_access_token_is_expiring(near_ms)
-
-
-def test_expiring_token_none_returns_true():
-    assert _qwen_access_token_is_expiring(None)
-
-
-def test_expiring_token_non_numeric_returns_true():
-    assert _qwen_access_token_is_expiring("not-a-number")
-
-
-# ---------------------------------------------------------------------------
-# _refresh_qwen_cli_tokens
-# ---------------------------------------------------------------------------
-
-def test_refresh_qwen_cli_tokens_success(qwen_env):
-    tokens = _make_qwen_tokens(refresh_token="old-refresh")
-
-    resp = MagicMock()
-    resp.status_code = 200
-    resp.json.return_value = {
-        "access_token": "new-access",
-        "refresh_token": "new-refresh",
-        "expires_in": 7200,
-    }
-
-    with patch("hermes_cli.auth.httpx") as mock_httpx:
-        mock_httpx.post.return_value = resp
-        result = _refresh_qwen_cli_tokens(tokens)
-
-    assert result["access_token"] == "new-access"
-    assert result["refresh_token"] == "new-refresh"
-    assert "expiry_date" in result
-
-
-def test_refresh_qwen_cli_tokens_preserves_old_refresh_if_not_in_response(qwen_env):
-    tokens = _make_qwen_tokens(refresh_token="keep-me")
-
-    resp = MagicMock()
-    resp.status_code = 200
-    resp.json.return_value = {
-        "access_token": "new-access",
-        # No refresh_token in response — should keep old one
-        "expires_in": 3600,
-    }
-
-    with patch("hermes_cli.auth.httpx") as mock_httpx:
-        mock_httpx.post.return_value = resp
-        result = _refresh_qwen_cli_tokens(tokens)
-
-    assert result["refresh_token"] == "keep-me"
-
-
-def test_refresh_qwen_cli_tokens_missing_refresh_token():
-    tokens = {"access_token": "at", "refresh_token": ""}
-    with pytest.raises(AuthError) as exc:
-        _refresh_qwen_cli_tokens(tokens)
-    assert exc.value.code == "qwen_refresh_token_missing"
-
-
-def test_refresh_qwen_cli_tokens_http_error(qwen_env):
-    tokens = _make_qwen_tokens()
-
-    resp = MagicMock()
-    resp.status_code = 401
-    resp.text = "unauthorized"
-
-    with patch("hermes_cli.auth.httpx") as mock_httpx:
-        mock_httpx.post.return_value = resp
-        with pytest.raises(AuthError) as exc:
-            _refresh_qwen_cli_tokens(tokens)
-    assert exc.value.code == "qwen_refresh_failed"
-
-
-def test_refresh_qwen_cli_tokens_network_error(qwen_env):
-    tokens = _make_qwen_tokens()
-
-    with patch("hermes_cli.auth.httpx") as mock_httpx:
-        mock_httpx.post.side_effect = ConnectionError("timeout")
-        with pytest.raises(AuthError) as exc:
-            _refresh_qwen_cli_tokens(tokens)
-    assert exc.value.code == "qwen_refresh_failed"
-
-
-def test_refresh_qwen_cli_tokens_invalid_json_response(qwen_env):
-    tokens = _make_qwen_tokens()
-
-    resp = MagicMock()
-    resp.status_code = 200
-    resp.json.side_effect = ValueError("bad json")
-
-    with patch("hermes_cli.auth.httpx") as mock_httpx:
-        mock_httpx.post.return_value = resp
-        with pytest.raises(AuthError) as exc:
-            _refresh_qwen_cli_tokens(tokens)
-    assert exc.value.code == "qwen_refresh_invalid_json"
-
-
-def test_refresh_qwen_cli_tokens_missing_access_token_in_response(qwen_env):
-    tokens = _make_qwen_tokens()
-
-    resp = MagicMock()
-    resp.status_code = 200
-    resp.json.return_value = {"something": "but no access_token"}
-
-    with patch("hermes_cli.auth.httpx") as mock_httpx:
-        mock_httpx.post.return_value = resp
-        with pytest.raises(AuthError) as exc:
-            _refresh_qwen_cli_tokens(tokens)
-    assert exc.value.code == "qwen_refresh_invalid_response"
-
-
-def test_refresh_qwen_cli_tokens_default_expires_in(qwen_env):
-    """When expires_in is missing, default to 6 hours."""
-    tokens = _make_qwen_tokens()
-
-    resp = MagicMock()
-    resp.status_code = 200
-    resp.json.return_value = {"access_token": "new"}
-
-    with patch("hermes_cli.auth.httpx") as mock_httpx:
-        mock_httpx.post.return_value = resp
-        result = _refresh_qwen_cli_tokens(tokens)
-
-    # Verify expiry_date is roughly now + 6h (within 60s tolerance)
-    expected_ms = int(time.time() * 1000) + 6 * 60 * 60 * 1000
-    assert abs(result["expiry_date"] - expected_ms) < 60_000
-
-
-def test_refresh_qwen_cli_tokens_saves_to_disk(qwen_env):
-    tokens = _make_qwen_tokens()
-
-    resp = MagicMock()
-    resp.status_code = 200
-    resp.json.return_value = {
-        "access_token": "disk-check",
-        "expires_in": 3600,
-    }
-
-    with patch("hermes_cli.auth.httpx") as mock_httpx:
-        mock_httpx.post.return_value = resp
-        _refresh_qwen_cli_tokens(tokens)
-
-    # Verify it was persisted
-    creds_path = qwen_env / ".qwen" / "oauth_creds.json"
-    assert creds_path.exists()
-    saved = json.loads(creds_path.read_text(encoding="utf-8"))
-    assert saved["access_token"] == "disk-check"
-
-
-# ---------------------------------------------------------------------------
-# resolve_qwen_runtime_credentials
-# ---------------------------------------------------------------------------
-
-def test_resolve_qwen_runtime_credentials_fresh_token(qwen_env):
-    tokens = _make_qwen_tokens(access_token="fresh-at")
-    _write_qwen_creds(qwen_env, tokens)
-
-    creds = resolve_qwen_runtime_credentials(refresh_if_expiring=False)
-    assert creds["provider"] == "qwen-oauth"
-    assert creds["api_key"] == "fresh-at"
-    assert creds["base_url"] == DEFAULT_QWEN_BASE_URL
-    assert creds["source"] == "qwen-cli"
-
-
-def test_resolve_qwen_runtime_credentials_triggers_refresh(qwen_env):
-    # Write an expired token
-    expired_ms = int((time.time() - 3600) * 1000)
-    tokens = _make_qwen_tokens(access_token="old", expiry_date=expired_ms)
-    _write_qwen_creds(qwen_env, tokens)
-
-    refreshed = _make_qwen_tokens(access_token="refreshed-at")
-
-    with patch(
-        "hermes_cli.auth._refresh_qwen_cli_tokens", return_value=refreshed
-    ) as mock_refresh:
-        creds = resolve_qwen_runtime_credentials()
-    mock_refresh.assert_called_once()
-    assert creds["api_key"] == "refreshed-at"
-
-
-def test_resolve_qwen_runtime_credentials_force_refresh(qwen_env):
-    tokens = _make_qwen_tokens(access_token="old-at")
-    _write_qwen_creds(qwen_env, tokens)
-
-    refreshed = _make_qwen_tokens(access_token="force-refreshed")
-
-    with patch(
-        "hermes_cli.auth._refresh_qwen_cli_tokens", return_value=refreshed
-    ) as mock_refresh:
-        creds = resolve_qwen_runtime_credentials(force_refresh=True)
-    mock_refresh.assert_called_once()
-    assert creds["api_key"] == "force-refreshed"
-
-
-def test_resolve_qwen_runtime_credentials_missing_access_token(qwen_env):
-    tokens = _make_qwen_tokens(access_token="")
-    _write_qwen_creds(qwen_env, tokens)
-
-    with pytest.raises(AuthError) as exc:
-        resolve_qwen_runtime_credentials(refresh_if_expiring=False)
-    assert exc.value.code == "qwen_access_token_missing"
-
-
-def test_resolve_qwen_runtime_credentials_base_url_env_override(qwen_env, monkeypatch):
-    tokens = _make_qwen_tokens(access_token="at")
-    _write_qwen_creds(qwen_env, tokens)
-    monkeypatch.setenv("HERMES_QWEN_BASE_URL", "https://custom.qwen.ai/v1")
-
-    creds = resolve_qwen_runtime_credentials(refresh_if_expiring=False)
-    assert creds["base_url"] == "https://custom.qwen.ai/v1"
-
-
-# ---------------------------------------------------------------------------
-# get_qwen_auth_status
-# ---------------------------------------------------------------------------
-
-def test_get_qwen_auth_status_logged_in(qwen_env):
-    tokens = _make_qwen_tokens(access_token="status-at")
-    _write_qwen_creds(qwen_env, tokens)
-
-    status = get_qwen_auth_status()
-    assert status["logged_in"] is True
-    assert status["api_key"] == "status-at"
-
-
-def test_get_qwen_auth_status_not_logged_in(qwen_env):
-    # No credentials file
-    status = get_qwen_auth_status()
-    assert status["logged_in"] is False
-    assert "error" in status
@@ -136,73 +136,3 @@ def test_check_gateway_service_linger_skips_when_service_not_installed(monkeypat
    out = capsys.readouterr().out
    assert out == ""
    assert issues == []
-
-
-# ── Memory provider section (doctor should only check the *active* provider) ──
-
-
-class TestDoctorMemoryProviderSection:
-    """The ◆ Memory Provider section should respect memory.provider config."""
-
-    def _make_hermes_home(self, tmp_path, provider=""):
-        """Create a minimal HERMES_HOME with config.yaml."""
-        home = tmp_path / ".hermes"
-        home.mkdir(parents=True, exist_ok=True)
-        import yaml
-        config = {"memory": {"provider": provider}} if provider else {"memory": {}}
-        (home / "config.yaml").write_text(yaml.dump(config))
-        return home
-
-    def _run_doctor_and_capture(self, monkeypatch, tmp_path, provider=""):
-        """Run doctor and capture stdout."""
-        home = self._make_hermes_home(tmp_path, provider)
-        monkeypatch.setattr(doctor_mod, "HERMES_HOME", home)
-        monkeypatch.setattr(doctor_mod, "PROJECT_ROOT", tmp_path / "project")
-        monkeypatch.setattr(doctor_mod, "_DHH", str(home))
-        (tmp_path / "project").mkdir(exist_ok=True)
-
-        # Stub tool availability (returns empty) so doctor runs past it
-        fake_model_tools = types.SimpleNamespace(
-            check_tool_availability=lambda *a, **kw: ([], []),
-            TOOLSET_REQUIREMENTS={},
-        )
-        monkeypatch.setitem(sys.modules, "model_tools", fake_model_tools)
-
-        # Stub auth checks to avoid real API calls
-        try:
-            from hermes_cli import auth as _auth_mod
-            monkeypatch.setattr(_auth_mod, "get_nous_auth_status", lambda: {})
-            monkeypatch.setattr(_auth_mod, "get_codex_auth_status", lambda: {})
-        except Exception:
-            pass
-
-        import io, contextlib
-        buf = io.StringIO()
-        with contextlib.redirect_stdout(buf):
-            doctor_mod.run_doctor(Namespace(fix=False))
-        return buf.getvalue()
-
-    def test_no_provider_shows_builtin_ok(self, monkeypatch, tmp_path):
-        out = self._run_doctor_and_capture(monkeypatch, tmp_path, provider="")
-        assert "Memory Provider" in out
-        assert "Built-in memory active" in out
-        # Should NOT mention Honcho or Mem0 errors
-        assert "Honcho API key" not in out
-        assert "Mem0" not in out
-
-    def test_honcho_provider_not_installed_shows_fail(self, monkeypatch, tmp_path):
-        # Make honcho import fail
-        monkeypatch.setitem(
-            sys.modules, "plugins.memory.honcho.client", None
-        )
-        out = self._run_doctor_and_capture(monkeypatch, tmp_path, provider="honcho")
-        assert "Memory Provider" in out
-        # Should show failure since honcho is set but not importable
-        assert "Built-in memory active" not in out
-
-    def test_mem0_provider_not_installed_shows_fail(self, monkeypatch, tmp_path):
-        # Make mem0 import fail
-        monkeypatch.setitem(sys.modules, "plugins.memory.mem0", None)
-        out = self._run_doctor_and_capture(monkeypatch, tmp_path, provider="mem0")
-        assert "Memory Provider" in out
-        assert "Built-in memory active" not in out
@@ -143,82 +143,6 @@ def test_resolve_runtime_provider_codex(monkeypatch):
    assert resolved["requested_provider"] == "openai-codex"


-def test_resolve_runtime_provider_qwen_oauth(monkeypatch):
-    monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "qwen-oauth")
-    monkeypatch.setattr(
-        rp,
-        "resolve_qwen_runtime_credentials",
-        lambda: {
-            "provider": "qwen-oauth",
-            "base_url": "https://portal.qwen.ai/v1",
-            "api_key": "qwen-token",
-            "source": "qwen-cli",
-            "expires_at_ms": 1775640710946,
-        },
-    )
-
-    resolved = rp.resolve_runtime_provider(requested="qwen-oauth")
-
-    assert resolved["provider"] == "qwen-oauth"
-    assert resolved["api_mode"] == "chat_completions"
-    assert resolved["base_url"] == "https://portal.qwen.ai/v1"
-    assert resolved["api_key"] == "qwen-token"
-    assert resolved["requested_provider"] == "qwen-oauth"
-
-
-def test_resolve_runtime_provider_uses_qwen_pool_entry(monkeypatch):
-    class _Entry:
-        access_token = "pool-qwen-token"
-        source = "manual:qwen_cli"
-        base_url = "https://portal.qwen.ai/v1"
-
-    class _Pool:
-        def has_credentials(self):
-            return True
-
-        def select(self):
-            return _Entry()
-
-    monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "qwen-oauth")
-    monkeypatch.setattr(rp, "load_pool", lambda provider: _Pool())
-    monkeypatch.setattr(rp, "_get_model_config", lambda: {"provider": "qwen-oauth", "default": "coder-model"})
-
-    resolved = rp.resolve_runtime_provider(requested="qwen-oauth")
-
-    assert resolved["provider"] == "qwen-oauth"
-    assert resolved["api_mode"] == "chat_completions"
-    assert resolved["base_url"] == "https://portal.qwen.ai/v1"
-    assert resolved["api_key"] == "pool-qwen-token"
-    assert resolved["source"] == "manual:qwen_cli"
-
-
-def test_resolve_provider_alias_qwen(monkeypatch):
-    monkeypatch.setattr(rp.auth_mod, "_load_auth_store", lambda: {})
-    monkeypatch.delenv("OPENAI_API_KEY", raising=False)
-    monkeypatch.delenv("OPENROUTER_API_KEY", raising=False)
-    assert rp.resolve_provider("qwen-portal") == "qwen-oauth"
-    assert rp.resolve_provider("qwen-cli") == "qwen-oauth"
-
-
-def test_qwen_oauth_auto_fallthrough_on_auth_failure(monkeypatch):
-    """When requested_provider is 'auto' and Qwen creds fail, fall through."""
-    from hermes_cli.auth import AuthError
-
-    monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "qwen-oauth")
-    monkeypatch.setattr(
-        rp,
-        "resolve_qwen_runtime_credentials",
-        lambda **kw: (_ for _ in ()).throw(AuthError("stale", provider="qwen-oauth", code="qwen_auth_missing")),
-    )
-    monkeypatch.setattr(rp, "_get_model_config", lambda: {})
-    monkeypatch.setenv("OPENROUTER_API_KEY", "test-or-key")
-
-    # Should NOT raise — falls through to OpenRouter
-    resolved = rp.resolve_runtime_provider(requested="auto")
-    # The fallthrough means it won't be qwen-oauth
-    assert resolved["provider"] != "qwen-oauth"
-
-
 def test_resolve_runtime_provider_ai_gateway(monkeypatch):
    monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "ai-gateway")
    monkeypatch.setattr(rp, "_get_model_config", lambda: {})
@@ -884,55 +808,6 @@ def test_minimax_explicit_api_mode_respected(monkeypatch):
    assert resolved["api_mode"] == "chat_completions"


-def test_minimax_config_base_url_overrides_hardcoded_default(monkeypatch):
-    """model.base_url in config.yaml should override the hardcoded default (#6039)."""
-    monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "minimax")
-    monkeypatch.setattr(rp, "_get_model_config", lambda: {
-        "provider": "minimax",
-        "base_url": "https://api.minimaxi.com/anthropic",
-    })
-    monkeypatch.setenv("MINIMAX_API_KEY", "test-minimax-key")
-    monkeypatch.delenv("MINIMAX_BASE_URL", raising=False)
-
-    resolved = rp.resolve_runtime_provider(requested="minimax")
-
-    assert resolved["provider"] == "minimax"
-    assert resolved["base_url"] == "https://api.minimaxi.com/anthropic"
-    assert resolved["api_mode"] == "anthropic_messages"
-
-
-def test_minimax_env_base_url_still_wins_over_config(monkeypatch):
-    """MINIMAX_BASE_URL env var should take priority over config.yaml model.base_url."""
-    monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "minimax")
-    monkeypatch.setattr(rp, "_get_model_config", lambda: {
-        "provider": "minimax",
-        "base_url": "https://api.minimaxi.com/anthropic",
-    })
-    monkeypatch.setenv("MINIMAX_API_KEY", "test-minimax-key")
-    monkeypatch.setenv("MINIMAX_BASE_URL", "https://custom.example.com/v1")
-
-    resolved = rp.resolve_runtime_provider(requested="minimax")
-
-    # Env var wins because resolve_api_key_provider_credentials prefers it
-    assert resolved["base_url"] == "https://custom.example.com/v1"
-
-
-def test_minimax_config_base_url_ignored_for_different_provider(monkeypatch):
-    """model.base_url should NOT be used when model.provider doesn't match."""
-    monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "minimax")
-    monkeypatch.setattr(rp, "_get_model_config", lambda: {
-        "provider": "openrouter",
-        "base_url": "https://some-other-endpoint.com/v1",
-    })
-    monkeypatch.setenv("MINIMAX_API_KEY", "test-minimax-key")
-    monkeypatch.delenv("MINIMAX_BASE_URL", raising=False)
-
-    resolved = rp.resolve_runtime_provider(requested="minimax")
-
-    # Should use the default, NOT the config base_url from a different provider
-    assert resolved["base_url"] == "https://api.minimax.io/anthropic"
-
-
 def test_alibaba_default_coding_intl_endpoint_uses_chat_completions(monkeypatch):
    """Alibaba default coding-intl /v1 URL should use chat_completions mode."""
    monkeypatch.setattr(rp, "resolve_provider", lambda *a, **k: "alibaba")
@@ -34,8 +34,8 @@ class TestSetupProviderModelSelection:
    @pytest.mark.parametrize("provider_id,expected_defaults", [
        ("zai", ["glm-5", "glm-4.7", "glm-4.5", "glm-4.5-flash"]),
        ("kimi-coding", ["kimi-k2.5", "kimi-k2-thinking", "kimi-k2-turbo-preview"]),
-        ("minimax", ["MiniMax-M1", "MiniMax-M1-40k", "MiniMax-M1-80k", "MiniMax-M1-128k", "MiniMax-M1-256k", "MiniMax-M2.5", "MiniMax-M2.7"]),
-        ("minimax-cn", ["MiniMax-M1", "MiniMax-M1-40k", "MiniMax-M1-80k", "MiniMax-M1-128k", "MiniMax-M1-256k", "MiniMax-M2.5", "MiniMax-M2.7"]),
+        ("minimax", ["MiniMax-M2.7", "MiniMax-M2.7-highspeed", "MiniMax-M2.5", "MiniMax-M2.5-highspeed", "MiniMax-M2.1"]),
+        ("minimax-cn", ["MiniMax-M2.7", "MiniMax-M2.7-highspeed", "MiniMax-M2.5", "MiniMax-M2.5-highspeed", "MiniMax-M2.1"]),
        ("opencode-zen", ["gpt-5.4", "gpt-5.3-codex", "claude-sonnet-4-6", "gemini-3-flash"]),
        ("opencode-go", ["glm-5", "kimi-k2.5", "minimax-m2.5", "minimax-m2.7"]),
    ])
@@ -0,0 +1,162 @@
+"""Tests for _save_oversized_tool_result() — the large tool response handler.
+
+When a tool returns more than _LARGE_RESULT_CHARS characters, the full content
+is saved to a file and the model receives a preview + file path instead.
+"""
+
+import os
+import re
+
+import pytest
+
+from run_agent import (
+    _save_oversized_tool_result,
+    _LARGE_RESULT_CHARS,
+    _LARGE_RESULT_PREVIEW_CHARS,
+)
+
+
+class TestSaveOversizedToolResult:
+    """Unit tests for the large tool result handler."""
+
+    def test_small_result_returned_unchanged(self):
+        """Results under the threshold pass through untouched."""
+        small = "x" * 1000
+        assert _save_oversized_tool_result("terminal", small) is small
+
+    def test_exactly_at_threshold_returned_unchanged(self):
+        """Results exactly at the threshold pass through."""
+        exact = "y" * _LARGE_RESULT_CHARS
+        assert _save_oversized_tool_result("terminal", exact) is exact
+
+    def test_oversized_result_saved_to_file(self, tmp_path, monkeypatch):
+        """Results over the threshold are written to a file."""
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
+        os.makedirs(tmp_path / ".hermes", exist_ok=True)
+
+        big = "A" * (_LARGE_RESULT_CHARS + 500)
+        result = _save_oversized_tool_result("terminal", big)
+
+        # Should contain the preview
+        assert result.startswith("A" * _LARGE_RESULT_PREVIEW_CHARS)
+        # Should mention the file path
+        assert "Full output saved to:" in result
+        # Should mention original size
+        assert f"{len(big):,}" in result
+
+        # Extract the file path and verify the file exists with full content
+        match = re.search(r"Full output saved to: (.+?)\n", result)
+        assert match, f"No file path found in result: {result[:300]}"
+        filepath = match.group(1)
+        assert os.path.isfile(filepath)
+        with open(filepath, "r", encoding="utf-8") as f:
+            saved = f.read()
+        assert saved == big
+        assert len(saved) == _LARGE_RESULT_CHARS + 500
+
+    def test_file_placed_in_cache_tool_responses(self, tmp_path, monkeypatch):
+        """Saved file lives under HERMES_HOME/cache/tool_responses/."""
+        hermes_home = str(tmp_path / ".hermes")
+        monkeypatch.setenv("HERMES_HOME", hermes_home)
+        os.makedirs(hermes_home, exist_ok=True)
+
+        big = "B" * (_LARGE_RESULT_CHARS + 1)
+        result = _save_oversized_tool_result("web_search", big)
+
+        match = re.search(r"Full output saved to: (.+?)\n", result)
+        filepath = match.group(1)
+        expected_dir = os.path.join(hermes_home, "cache", "tool_responses")
+        assert filepath.startswith(expected_dir)
+
+    def test_filename_contains_tool_name(self, tmp_path, monkeypatch):
+        """The saved filename includes a sanitized version of the tool name."""
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
+        os.makedirs(tmp_path / ".hermes", exist_ok=True)
+
+        big = "C" * (_LARGE_RESULT_CHARS + 1)
+        result = _save_oversized_tool_result("browser_navigate", big)
+
+        match = re.search(r"Full output saved to: (.+?)\n", result)
+        filename = os.path.basename(match.group(1))
+        assert filename.startswith("browser_navigate_")
+        assert filename.endswith(".txt")
+
+    def test_tool_name_sanitized(self, tmp_path, monkeypatch):
+        """Special characters in tool names are replaced in the filename."""
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
+        os.makedirs(tmp_path / ".hermes", exist_ok=True)
+
+        big = "D" * (_LARGE_RESULT_CHARS + 1)
+        result = _save_oversized_tool_result("mcp:some/weird tool", big)
+
+        match = re.search(r"Full output saved to: (.+?)\n", result)
+        filename = os.path.basename(match.group(1))
+        # No slashes or colons in filename
+        assert "/" not in filename
+        assert ":" not in filename
+
+    def test_fallback_on_write_failure(self, tmp_path, monkeypatch):
+        """When file write fails, falls back to destructive truncation."""
+        # Point HERMES_HOME to a path that will fail (file, not directory)
+        bad_path = str(tmp_path / "not_a_dir.txt")
+        with open(bad_path, "w") as f:
+            f.write("I'm a file, not a directory")
+        monkeypatch.setenv("HERMES_HOME", bad_path)
+
+        big = "E" * (_LARGE_RESULT_CHARS + 50_000)
+        result = _save_oversized_tool_result("terminal", big)
+
+        # Should still contain data (fallback truncation)
+        assert len(result) > 0
+        assert result.startswith("E" * 1000)
+        # Should mention the failure
+        assert "File save failed" in result
+        # Should be truncated to approximately _LARGE_RESULT_CHARS + error msg
+        assert len(result) < len(big)
+
+    def test_preview_length_capped(self, tmp_path, monkeypatch):
+        """The inline preview is capped at _LARGE_RESULT_PREVIEW_CHARS."""
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
+        os.makedirs(tmp_path / ".hermes", exist_ok=True)
+
+        # Use distinct chars so we can measure the preview
+        big = "Z" * (_LARGE_RESULT_CHARS + 5000)
+        result = _save_oversized_tool_result("terminal", big)
+
+        # The preview section is the content before the "[Large tool response:" marker
+        marker_pos = result.index("[Large tool response:")
+        preview_section = result[:marker_pos].rstrip()
+        assert len(preview_section) == _LARGE_RESULT_PREVIEW_CHARS
+
+    def test_guidance_message_mentions_tools(self, tmp_path, monkeypatch):
+        """The replacement message tells the model how to access the file."""
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
+        os.makedirs(tmp_path / ".hermes", exist_ok=True)
+
+        big = "F" * (_LARGE_RESULT_CHARS + 1)
+        result = _save_oversized_tool_result("terminal", big)
+
+        assert "read_file" in result
+        assert "search_files" in result
+
+    def test_empty_result_passes_through(self):
+        """Empty strings are not oversized."""
+        assert _save_oversized_tool_result("terminal", "") == ""
+
+    def test_unicode_content_preserved(self, tmp_path, monkeypatch):
+        """Unicode content is fully preserved in the saved file."""
+        monkeypatch.setenv("HERMES_HOME", str(tmp_path / ".hermes"))
+        os.makedirs(tmp_path / ".hermes", exist_ok=True)
+
+        # Mix of ASCII and multi-byte unicode to exceed threshold
+        unit = "Hello 世界! 🎉 " * 100  # ~1400 chars per repeat
+        big = unit * ((_LARGE_RESULT_CHARS // len(unit)) + 1)
+        assert len(big) > _LARGE_RESULT_CHARS
+
+        result = _save_oversized_tool_result("terminal", big)
+        match = re.search(r"Full output saved to: (.+?)\n", result)
+        filepath = match.group(1)
+
+        with open(filepath, "r", encoding="utf-8") as f:
+            saved = f.read()
+        assert saved == big
@@ -872,52 +872,6 @@ class TestBuildApiKwargs:
        kwargs = agent._build_api_kwargs(messages)
        assert kwargs["max_tokens"] == 4096

-    def test_qwen_portal_formats_messages_and_metadata(self, agent):
-        agent.base_url = "https://portal.qwen.ai/v1"
-        agent._base_url_lower = agent.base_url.lower()
-        agent.session_id = "sess-123"
-        messages = [
-            {"role": "system", "content": "You are helpful"},
-            {"role": "assistant", "content": "Got it"},
-            {"role": "user", "content": "hi"},
-        ]
-        kwargs = agent._build_api_kwargs(messages)
-        assert kwargs["metadata"]["sessionId"] == "sess-123"
-        assert kwargs["extra_body"]["vl_high_resolution_images"] is True
-        assert isinstance(kwargs["messages"][0]["content"], list)
-        assert kwargs["messages"][0]["content"][0]["cache_control"] == {"type": "ephemeral"}
-        assert kwargs["messages"][2]["content"][0]["text"] == "hi"
-
-    def test_qwen_portal_normalizes_bare_string_content_parts(self, agent):
-        agent.base_url = "https://portal.qwen.ai/v1"
-        agent._base_url_lower = agent.base_url.lower()
-        messages = [
-            {"role": "system", "content": [{"type": "text", "text": "system"}]},
-            {"role": "user", "content": ["hello", {"type": "text", "text": "world"}]},
-        ]
-        kwargs = agent._build_api_kwargs(messages)
-        user_content = kwargs["messages"][1]["content"]
-        assert user_content[0] == {"type": "text", "text": "hello"}
-        assert user_content[1] == {"type": "text", "text": "world"}
-
-    def test_qwen_portal_no_system_message(self, agent):
-        agent.base_url = "https://portal.qwen.ai/v1"
-        agent._base_url_lower = agent.base_url.lower()
-        messages = [{"role": "user", "content": "hi"}]
-        kwargs = agent._build_api_kwargs(messages)
-        # Should not crash even without a system message
-        assert kwargs["messages"][0]["content"][0]["text"] == "hi"
-        assert "cache_control" not in kwargs["messages"][0]["content"][0]
-
-    def test_qwen_portal_omits_max_tokens(self, agent):
-        agent.base_url = "https://portal.qwen.ai/v1"
-        agent._base_url_lower = agent.base_url.lower()
-        agent.max_tokens = 4096
-        messages = [{"role": "system", "content": "sys"}, {"role": "user", "content": "hi"}]
-        kwargs = agent._build_api_kwargs(messages)
-        assert "max_tokens" not in kwargs
-        assert "max_completion_tokens" not in kwargs
-

 class TestBuildAssistantMessage:
    def test_basic_message(self, agent):
@@ -1057,9 +1011,10 @@ class TestExecuteToolCalls:
        big_result = "x" * 150_000
        with patch("run_agent.handle_function_call", return_value=big_result):
            agent._execute_tool_calls(mock_msg, messages, "task-1")
-        # Content should be replaced with persisted-output or truncation
+        # Content should be replaced with preview + file path
        assert len(messages[0]["content"]) < 150_000
-        assert ("Truncated" in messages[0]["content"] or "<persisted-output>" in messages[0]["content"])
+        assert "Large tool response" in messages[0]["content"]
+        assert "Full output saved to:" in messages[0]["content"]


 class TestConcurrentToolExecution:
@@ -1294,7 +1249,8 @@ class TestConcurrentToolExecution:
        assert len(messages) == 2
        for m in messages:
            assert len(m["content"]) < 150_000
-            assert ("Truncated" in m["content"] or "<persisted-output>" in m["content"])
+            assert "Large tool response" in m["content"]
+            assert "Full output saved to:" in m["content"]

    def test_invoke_tool_dispatches_to_handle_function_call(self, agent):
        """_invoke_tool should route regular tools through handle_function_call."""
@@ -1,135 +0,0 @@
-"""Tests for Ollama num_ctx context length detection and injection.
-
-Covers:
-  agent/model_metadata.py — query_ollama_num_ctx()
-  run_agent.py — _ollama_num_ctx detection + extra_body injection
-"""
-
-from unittest.mock import patch, MagicMock
-
-import pytest
-
-from agent.model_metadata import query_ollama_num_ctx
-
-
-# ═══════════════════════════════════════════════════════════════════════
-# Level 1: query_ollama_num_ctx — Ollama API interaction
-# ═══════════════════════════════════════════════════════════════════════
-
-
-def _mock_httpx_client(show_response_data, status_code=200):
-    """Create a mock httpx.Client context manager that returns given /api/show data."""
-    mock_resp = MagicMock(status_code=status_code)
-    mock_resp.json.return_value = show_response_data
-    mock_client = MagicMock()
-    mock_client.post.return_value = mock_resp
-    mock_ctx = MagicMock()
-    mock_ctx.__enter__ = MagicMock(return_value=mock_client)
-    mock_ctx.__exit__ = MagicMock(return_value=False)
-    return mock_ctx, mock_client
-
-
-class TestQueryOllamaNumCtx:
-    """Test the Ollama /api/show context length query."""
-
-    def test_returns_context_from_model_info(self):
-        """Should extract context_length from GGUF model_info metadata."""
-        show_data = {
-            "model_info": {"llama.context_length": 131072},
-            "parameters": "",
-        }
-        mock_ctx, _ = _mock_httpx_client(show_data)
-
-        with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
-            # httpx is imported inside the function — patch the module import
-            import httpx
-            with patch.object(httpx, "Client", return_value=mock_ctx):
-                result = query_ollama_num_ctx("llama3.1:8b", "http://localhost:11434/v1")
-
-        assert result == 131072
-
-    def test_prefers_explicit_num_ctx_from_modelfile(self):
-        """If the Modelfile sets num_ctx explicitly, that should take priority."""
-        show_data = {
-            "model_info": {"llama.context_length": 131072},
-            "parameters": "num_ctx 32768\ntemperature 0.7",
-        }
-        mock_ctx, _ = _mock_httpx_client(show_data)
-
-        with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
-            import httpx
-            with patch.object(httpx, "Client", return_value=mock_ctx):
-                result = query_ollama_num_ctx("custom-model", "http://localhost:11434")
-
-        assert result == 32768
-
-    def test_returns_none_for_non_ollama_server(self):
-        """Should return None if the server is not Ollama."""
-        with patch("agent.model_metadata.detect_local_server_type", return_value="lm-studio"):
-            result = query_ollama_num_ctx("model", "http://localhost:1234")
-        assert result is None
-
-    def test_returns_none_on_connection_error(self):
-        """Should return None if the server is unreachable."""
-        with patch("agent.model_metadata.detect_local_server_type", side_effect=Exception("timeout")):
-            result = query_ollama_num_ctx("model", "http://localhost:11434")
-        assert result is None
-
-    def test_returns_none_on_404(self):
-        """Should return None if the model is not found."""
-        mock_ctx, _ = _mock_httpx_client({}, status_code=404)
-
-        with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
-            import httpx
-            with patch.object(httpx, "Client", return_value=mock_ctx):
-                result = query_ollama_num_ctx("nonexistent", "http://localhost:11434")
-
-        assert result is None
-
-    def test_strips_provider_prefix(self):
-        """Should strip 'local:' prefix from model name before querying."""
-        show_data = {
-            "model_info": {"qwen2.context_length": 32768},
-            "parameters": "",
-        }
-        mock_ctx, mock_client = _mock_httpx_client(show_data)
-
-        with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
-            import httpx
-            with patch.object(httpx, "Client", return_value=mock_ctx):
-                result = query_ollama_num_ctx("local:qwen2.5:7b", "http://localhost:11434/v1")
-
-        # Verify the post was called with stripped name (no "local:" prefix)
-        call_args = mock_client.post.call_args
-        assert call_args[1]["json"]["name"] == "qwen2.5:7b" or call_args[0][1] is not None
-        assert result == 32768
-
-    def test_handles_qwen2_architecture_key(self):
-        """Different model architectures use different key prefixes in model_info."""
-        show_data = {
-            "model_info": {"qwen2.context_length": 65536},
-            "parameters": "",
-        }
-        mock_ctx, _ = _mock_httpx_client(show_data)
-
-        with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
-            import httpx
-            with patch.object(httpx, "Client", return_value=mock_ctx):
-                result = query_ollama_num_ctx("qwen2.5:32b", "http://localhost:11434")
-
-        assert result == 65536
-
-    def test_returns_none_when_model_info_empty(self):
-        """Should return None if model_info has no context_length key."""
-        show_data = {
-            "model_info": {"llama.embedding_length": 4096},
-            "parameters": "",
-        }
-        mock_ctx, _ = _mock_httpx_client(show_data)
-
-        with patch("agent.model_metadata.detect_local_server_type", return_value="ollama"):
-            import httpx
-            with patch.object(httpx, "Client", return_value=mock_ctx):
-                result = query_ollama_num_ctx("model", "http://localhost:11434")
-
-        assert result is None
@@ -1,117 +0,0 @@
-"""Tests for agent.retry_utils jittered backoff."""
-
-import threading
-
-import agent.retry_utils as retry_utils
-from agent.retry_utils import jittered_backoff
-
-
-def test_backoff_is_exponential():
-    """Base delay should double each attempt (before jitter)."""
-    for attempt in (1, 2, 3, 4):
-        delays = [jittered_backoff(attempt, base_delay=5.0, max_delay=120.0, jitter_ratio=0.0) for _ in range(100)]
-        expected = min(5.0 * (2 ** (attempt - 1)), 120.0)
-        mean = sum(delays) / len(delays)
-        assert abs(mean - expected) < 0.01, f"attempt {attempt}: expected {expected}, got {mean}"
-
-
-def test_backoff_respects_max_delay():
-    """Even with high attempt numbers, delay should not exceed max_delay."""
-    for attempt in (10, 20, 100):
-        delay = jittered_backoff(attempt, base_delay=5.0, max_delay=60.0, jitter_ratio=0.0)
-        assert delay <= 60.0, f"attempt {attempt}: delay {delay} exceeds max 60s"
-
-
-def test_backoff_adds_jitter():
-    """With jitter enabled, delays should vary across calls."""
-    delays = [jittered_backoff(1, base_delay=10.0, max_delay=120.0, jitter_ratio=0.5) for _ in range(50)]
-    assert min(delays) != max(delays), "jitter should produce varying delays"
-    assert all(d >= 10.0 for d in delays), "jittered delay should be >= base delay"
-    assert all(d <= 15.0 for d in delays), "jittered delay should be bounded"
-
-
-def test_backoff_attempt_1_is_base():
-    """First attempt delay should equal base_delay (with no jitter)."""
-    delay = jittered_backoff(1, base_delay=3.0, max_delay=120.0, jitter_ratio=0.0)
-    assert delay == 3.0
-
-
-def test_backoff_with_zero_base_delay_returns_max():
-    """base_delay=0 should return max_delay (guard against busy-wait)."""
-    delay = jittered_backoff(1, base_delay=0.0, max_delay=60.0, jitter_ratio=0.0)
-    assert delay == 60.0
-
-
-def test_backoff_with_extreme_attempt_returns_max():
-    """Very large attempt numbers should not overflow and should return max_delay."""
-    delay = jittered_backoff(999, base_delay=5.0, max_delay=120.0, jitter_ratio=0.0)
-    assert delay == 120.0
-
-
-def test_backoff_negative_attempt_treated_as_one():
-    """Negative attempt should not crash and behaves like attempt=1."""
-    delay = jittered_backoff(-5, base_delay=10.0, max_delay=120.0, jitter_ratio=0.0)
-    assert delay == 10.0
-
-
-def test_backoff_thread_safety():
-    """Concurrent calls should generally produce different delays."""
-    results = []
-    barrier = threading.Barrier(8)
-
-    def _call_backoff():
-        barrier.wait()
-        results.append(jittered_backoff(1, base_delay=10.0, max_delay=120.0, jitter_ratio=0.5))
-
-    threads = [threading.Thread(target=_call_backoff) for _ in range(8)]
-    for t in threads:
-        t.start()
-    for t in threads:
-        t.join(timeout=5)
-
-    assert len(results) == 8
-    unique = len(set(results))
-    assert unique >= 6, f"Expected mostly unique delays, got {unique}/8 unique"
-
-
-def test_backoff_uses_locked_tick_for_seed(monkeypatch):
-    """Seed derivation should use per-call tick captured under lock."""
-    import time
-
-    monkeypatch.setattr(retry_utils, "_jitter_counter", 0)
-
-    recorded_seeds = []
-
-    class _RecordingRandom:
-        def __init__(self, seed):
-            recorded_seeds.append(seed)
-
-        def uniform(self, a, b):
-            return 0.0
-
-    monkeypatch.setattr(retry_utils.random, "Random", _RecordingRandom)
-
-    fixed_time_ns = 123456789
-
-    def _time_ns_wait_for_two_ticks():
-        deadline = time.time() + 2.0
-        while retry_utils._jitter_counter < 2 and time.time() < deadline:
-            time.sleep(0.001)
-        return fixed_time_ns
-
-    monkeypatch.setattr(retry_utils.time, "time_ns", _time_ns_wait_for_two_ticks)
-
-    barrier = threading.Barrier(2)
-
-    def _call():
-        barrier.wait()
-        jittered_backoff(1, base_delay=10.0, max_delay=120.0, jitter_ratio=0.5)
-
-    threads = [threading.Thread(target=_call) for _ in range(2)]
-    for t in threads:
-        t.start()
-    for t in threads:
-        t.join(timeout=5)
-
-    assert len(recorded_seeds) == 2
-    assert len(set(recorded_seeds)) == 2, f"Expected unique seeds, got {recorded_seeds}"
@@ -1,174 +0,0 @@
-"""Tests for BaseEnvironment unified execution model.
-
-Tests _wrap_command(), _extract_cwd_from_output(), _embed_stdin_heredoc(),
-init_session() failure handling, and the CWD marker contract.
-"""
-
-import uuid
-from unittest.mock import MagicMock
-
-from tools.environments.base import BaseEnvironment, _cwd_marker
-
-
-class _TestableEnv(BaseEnvironment):
-    """Concrete subclass for testing base class methods."""
-
-    def __init__(self, cwd="/tmp", timeout=10):
-        super().__init__(cwd=cwd, timeout=timeout)
-
-    def _run_bash(self, cmd_string, *, login=False, timeout=120, stdin_data=None):
-        raise NotImplementedError("Use mock")
-
-    def cleanup(self):
-        pass
-
-
-class TestWrapCommand:
-    def test_basic_shape(self):
-        env = _TestableEnv()
-        env._snapshot_ready = True
-        wrapped = env._wrap_command("echo hello", "/tmp")
-
-        assert "source" in wrapped
-        assert "cd /tmp" in wrapped or "cd '/tmp'" in wrapped
-        assert "eval 'echo hello'" in wrapped
-        assert "__hermes_ec=$?" in wrapped
-        assert "export -p >" in wrapped
-        assert "pwd -P >" in wrapped
-        assert env._cwd_marker in wrapped
-        assert "exit $__hermes_ec" in wrapped
-
-    def test_no_snapshot_skips_source(self):
-        env = _TestableEnv()
-        env._snapshot_ready = False
-        wrapped = env._wrap_command("echo hello", "/tmp")
-
-        assert "source" not in wrapped
-
-    def test_single_quote_escaping(self):
-        env = _TestableEnv()
-        env._snapshot_ready = True
-        wrapped = env._wrap_command("echo 'hello world'", "/tmp")
-
-        assert "eval 'echo '\\''hello world'\\'''" in wrapped
-
-    def test_tilde_not_quoted(self):
-        env = _TestableEnv()
-        env._snapshot_ready = True
-        wrapped = env._wrap_command("ls", "~")
-
-        assert "cd ~" in wrapped
-        assert "cd '~'" not in wrapped
-
-    def test_cd_failure_exit_126(self):
-        env = _TestableEnv()
-        env._snapshot_ready = True
-        wrapped = env._wrap_command("ls", "/nonexistent")
-
-        assert "exit 126" in wrapped
-
-
-class TestExtractCwdFromOutput:
-    def test_happy_path(self):
-        env = _TestableEnv()
-        marker = env._cwd_marker
-        result = {
-            "output": f"hello\n{marker}/home/user{marker}\n",
-        }
-        env._extract_cwd_from_output(result)
-
-        assert env.cwd == "/home/user"
-        assert marker not in result["output"]
-
-    def test_missing_marker(self):
-        env = _TestableEnv()
-        result = {"output": "hello world\n"}
-        env._extract_cwd_from_output(result)
-
-        assert env.cwd == "/tmp"  # unchanged
-
-    def test_marker_in_command_output(self):
-        """If the marker appears in command output AND as the real marker,
-        rfind grabs the last (real) one."""
-        env = _TestableEnv()
-        marker = env._cwd_marker
-        result = {
-            "output": f"user typed {marker} in their output\nreal output\n{marker}/correct/path{marker}\n",
-        }
-        env._extract_cwd_from_output(result)
-
-        assert env.cwd == "/correct/path"
-
-    def test_output_cleaned(self):
-        env = _TestableEnv()
-        marker = env._cwd_marker
-        result = {
-            "output": f"hello\n{marker}/tmp{marker}\n",
-        }
-        env._extract_cwd_from_output(result)
-
-        assert "hello" in result["output"]
-        assert marker not in result["output"]
-
-
-class TestEmbedStdinHeredoc:
-    def test_heredoc_format(self):
-        result = BaseEnvironment._embed_stdin_heredoc("cat", "hello world")
-
-        assert result.startswith("cat << '")
-        assert "hello world" in result
-        assert "HERMES_STDIN_" in result
-
-    def test_unique_delimiter_each_call(self):
-        r1 = BaseEnvironment._embed_stdin_heredoc("cat", "data")
-        r2 = BaseEnvironment._embed_stdin_heredoc("cat", "data")
-
-        # Extract delimiters
-        d1 = r1.split("'")[1]
-        d2 = r2.split("'")[1]
-        assert d1 != d2  # UUID-based, should be unique
-
-
-class TestInitSessionFailure:
-    def test_snapshot_ready_false_on_failure(self):
-        env = _TestableEnv()
-
-        def failing_run_bash(*args, **kwargs):
-            raise RuntimeError("bash not found")
-
-        env._run_bash = failing_run_bash
-        env.init_session()
-
-        assert env._snapshot_ready is False
-
-    def test_login_flag_when_snapshot_not_ready(self):
-        """When _snapshot_ready=False, execute() should pass login=True to _run_bash."""
-        env = _TestableEnv()
-        env._snapshot_ready = False
-
-        calls = []
-        def mock_run_bash(cmd, *, login=False, timeout=120, stdin_data=None):
-            calls.append({"login": login})
-            # Return a mock process handle
-            mock = MagicMock()
-            mock.poll.return_value = 0
-            mock.returncode = 0
-            mock.stdout = iter([])
-            return mock
-
-        env._run_bash = mock_run_bash
-        env.execute("echo test")
-
-        assert len(calls) == 1
-        assert calls[0]["login"] is True
-
-
-class TestCwdMarker:
-    def test_marker_contains_session_id(self):
-        env = _TestableEnv()
-        assert env._session_id in env._cwd_marker
-
-    def test_unique_per_instance(self):
-        env1 = _TestableEnv()
-        env2 = _TestableEnv()
-        assert env1._cwd_marker != env2._cwd_marker
@@ -16,7 +16,6 @@ from tools.browser_camofox import (
    _managed_persistence_enabled,
    camofox_close,
    camofox_navigate,
-    camofox_soft_cleanup,
    check_camofox_available,
    cleanup_all_camofox_sessions,
    get_vnc_url,
@@ -241,50 +240,3 @@ class TestVncUrlDiscovery:

        assert result["vnc_url"] == "http://localhost:6080"
        assert "vnc_hint" in result
-
-
-class TestCamofoxSoftCleanup:
-    """camofox_soft_cleanup drops local state only when managed persistence is on."""
-
-    def test_returns_true_and_drops_session_when_enabled(self, tmp_path, monkeypatch):
-        monkeypatch.setenv("HERMES_HOME", str(tmp_path))
-        monkeypatch.setenv("CAMOFOX_URL", "http://localhost:9377")
-
-        with _enable_persistence():
-            _get_session("task-1")
-            result = camofox_soft_cleanup("task-1")
-
-        assert result is True
-        # Session should have been dropped from in-memory store
-        import tools.browser_camofox as mod
-        with mod._sessions_lock:
-            assert "task-1" not in mod._sessions
-
-    def test_returns_false_when_disabled(self, tmp_path, monkeypatch):
-        monkeypatch.setenv("HERMES_HOME", str(tmp_path))
-        monkeypatch.setenv("CAMOFOX_URL", "http://localhost:9377")
-
-        _get_session("task-1")
-        config = {"browser": {"camofox": {"managed_persistence": False}}}
-        with patch("tools.browser_camofox.load_config", return_value=config):
-            result = camofox_soft_cleanup("task-1")
-
-        assert result is False
-        # Session should still be present — not dropped
-        import tools.browser_camofox as mod
-        with mod._sessions_lock:
-            assert "task-1" in mod._sessions
-
-    def test_does_not_call_server_delete(self, tmp_path, monkeypatch):
-        """Soft cleanup must never hit the Camofox /sessions DELETE endpoint."""
-        monkeypatch.setenv("HERMES_HOME", str(tmp_path))
-        monkeypatch.setenv("CAMOFOX_URL", "http://localhost:9377")
-
-        with (
-            _enable_persistence(),
-            patch("tools.browser_camofox.requests.delete") as mock_delete,
-        ):
-            _get_session("task-1")
-            camofox_soft_cleanup("task-1")
-
-        mock_delete.assert_not_called()
@@ -65,62 +65,6 @@ class TestBrowserCleanup:
        mock_stop.assert_called_once_with("task-1")
        mock_run.assert_called_once_with("task-1", "close", [], timeout=10)

-    def test_cleanup_camofox_managed_persistence_skips_close(self):
-        """When camofox mode + managed persistence, soft_cleanup fires instead of close."""
-        browser_tool = self.browser_tool
-        browser_tool._active_sessions["task-1"] = {
-            "session_name": "sess-1",
-            "bb_session_id": None,
-        }
-        browser_tool._session_last_activity["task-1"] = 123.0
-
-        with (
-            patch("tools.browser_tool._is_camofox_mode", return_value=True),
-            patch("tools.browser_tool._maybe_stop_recording") as mock_stop,
-            patch(
-                "tools.browser_tool._run_browser_command",
-                return_value={"success": True},
-            ),
-            patch("tools.browser_tool.os.path.exists", return_value=False),
-            patch(
-                "tools.browser_camofox.camofox_soft_cleanup",
-                return_value=True,
-            ) as mock_soft,
-            patch("tools.browser_camofox.camofox_close") as mock_close,
-        ):
-            browser_tool.cleanup_browser("task-1")
-
-        mock_soft.assert_called_once_with("task-1")
-        mock_close.assert_not_called()
-
-    def test_cleanup_camofox_no_persistence_calls_close(self):
-        """When camofox mode but managed persistence is off, camofox_close fires."""
-        browser_tool = self.browser_tool
-        browser_tool._active_sessions["task-1"] = {
-            "session_name": "sess-1",
-            "bb_session_id": None,
-        }
-        browser_tool._session_last_activity["task-1"] = 123.0
-
-        with (
-            patch("tools.browser_tool._is_camofox_mode", return_value=True),
-            patch("tools.browser_tool._maybe_stop_recording") as mock_stop,
-            patch(
-                "tools.browser_tool._run_browser_command",
-                return_value={"success": True},
-            ),
-            patch("tools.browser_tool.os.path.exists", return_value=False),
-            patch(
-                "tools.browser_camofox.camofox_soft_cleanup",
-                return_value=False,
-            ) as mock_soft,
-            patch("tools.browser_camofox.camofox_close") as mock_close,
-        ):
-            browser_tool.cleanup_browser("task-1")
-
-        mock_soft.assert_called_once_with("task-1")
-        mock_close.assert_called_once_with("task-1")
-
    def test_emergency_cleanup_clears_all_tracking_state(self):
        browser_tool = self.browser_tool
        browser_tool._cleanup_done = False
@@ -152,109 +152,6 @@ class TestFindAgentBrowser:
 class TestRunBrowserCommandPathConstruction:
    """Verify _run_browser_command() includes Homebrew node dirs in subprocess PATH."""

-    def test_subprocess_preserves_executable_path_with_spaces(self, tmp_path):
-        """A local agent-browser path containing spaces must stay one argv entry."""
-        captured_cmd = None
-
-        mock_proc = MagicMock()
-        mock_proc.returncode = 0
-        mock_proc.wait.return_value = 0
-
-        def capture_popen(cmd, **kwargs):
-            nonlocal captured_cmd
-            captured_cmd = cmd
-            return mock_proc
-
-        fake_session = {
-            "session_name": "test-session",
-            "session_id": "test-id",
-            "cdp_url": None,
-        }
-        fake_json = json.dumps({"success": True})
-        browser_path = "/Users/test/Library/Application Support/hermes/node_modules/.bin/agent-browser"
-        hermes_home = str(tmp_path / "hermes-home")
-
-        with patch("tools.browser_tool._find_agent_browser", return_value=browser_path), \
-             patch("tools.browser_tool._get_session_info", return_value=fake_session), \
-             patch("tools.browser_tool._socket_safe_tmpdir", return_value=str(tmp_path)), \
-             patch("tools.browser_tool._discover_homebrew_node_dirs", return_value=[]), \
-             patch("hermes_constants.Path.home", return_value=tmp_path), \
-             patch("subprocess.Popen", side_effect=capture_popen), \
-             patch("os.open", return_value=99), \
-             patch("os.close"), \
-             patch("tools.interrupt.is_interrupted", return_value=False), \
-             patch.dict(
-                 os.environ,
-                 {
-                     "PATH": "/usr/bin:/bin",
-                     "HOME": "/home/test",
-                     "HERMES_HOME": hermes_home,
-                 },
-                 clear=True,
-             ):
-            with patch("builtins.open", mock_open(read_data=fake_json)):
-                _run_browser_command("test-task", "navigate", ["https://example.com"])
-
-        assert captured_cmd is not None
-        assert captured_cmd[0] == browser_path
-        assert captured_cmd[1:5] == [
-            "--session",
-            "test-session",
-            "--json",
-            "navigate",
-        ]
-
-    def test_subprocess_splits_npx_fallback_into_command_and_package(self, tmp_path):
-        """The synthetic npx fallback should still expand into separate argv items."""
-        captured_cmd = None
-
-        mock_proc = MagicMock()
-        mock_proc.returncode = 0
-        mock_proc.wait.return_value = 0
-
-        def capture_popen(cmd, **kwargs):
-            nonlocal captured_cmd
-            captured_cmd = cmd
-            return mock_proc
-
-        fake_session = {
-            "session_name": "test-session",
-            "session_id": "test-id",
-            "cdp_url": None,
-        }
-        fake_json = json.dumps({"success": True})
-        hermes_home = str(tmp_path / "hermes-home")
-
-        with patch("tools.browser_tool._find_agent_browser", return_value="npx agent-browser"), \
-             patch("tools.browser_tool._get_session_info", return_value=fake_session), \
-             patch("tools.browser_tool._socket_safe_tmpdir", return_value=str(tmp_path)), \
-             patch("tools.browser_tool._discover_homebrew_node_dirs", return_value=[]), \
-             patch("hermes_constants.Path.home", return_value=tmp_path), \
-             patch("subprocess.Popen", side_effect=capture_popen), \
-             patch("os.open", return_value=99), \
-             patch("os.close"), \
-             patch("tools.interrupt.is_interrupted", return_value=False), \
-             patch.dict(
-                 os.environ,
-                 {
-                     "PATH": "/usr/bin:/bin",
-                     "HOME": "/home/test",
-                     "HERMES_HOME": hermes_home,
-                 },
-                 clear=True,
-             ):
-            with patch("builtins.open", mock_open(read_data=fake_json)):
-                _run_browser_command("test-task", "navigate", ["https://example.com"])
-
-        assert captured_cmd is not None
-        assert captured_cmd[:2] == ["npx", "agent-browser"]
-        assert captured_cmd[2:6] == [
-            "--session",
-            "test-session",
-            "--json",
-            "navigate",
-        ]
-
    def test_subprocess_path_includes_homebrew_node_dirs(self, tmp_path):
        """When _discover_homebrew_node_dirs returns dirs, they should appear
        in the subprocess env PATH passed to Popen."""
@@ -59,8 +59,8 @@ def daytona_sdk(monkeypatch):
@pytest.fixture()
 def make_env(daytona_sdk, monkeypatch):
    """Factory that creates a DaytonaEnvironment with a mocked SDK."""
-    # Prevent is_interrupted from interfering — patch where it's used (base.py)
-    monkeypatch.setattr("tools.environments.base.is_interrupted", lambda: False)
+    # Prevent is_interrupted from interfering
+    monkeypatch.setattr("tools.interrupt.is_interrupted", lambda: False)
    # Prevent skills/credential sync from consuming mock exec calls
    monkeypatch.setattr("tools.credential_files.get_credential_file_mounts", lambda: [])
    monkeypatch.setattr("tools.credential_files.get_skills_directory_mount", lambda **kw: None)
@@ -221,45 +221,41 @@ class TestCleanup:
 class TestExecute:
    def test_basic_command(self, make_env):
        sb = _make_sandbox()
-        # Calls: (1) $HOME detection, (2) init_session bootstrap, (3) actual command
+        # First call: $HOME detection; subsequent calls: actual commands
        sb.process.exec.side_effect = [
            _make_exec_response(result="/root"),       # $HOME
-            _make_exec_response(result="", exit_code=0),  # init_session
            _make_exec_response(result="hello", exit_code=0),  # actual cmd
        ]
        sb.state = "started"
        env = make_env(sandbox=sb)

        result = env.execute("echo hello")
-        assert "hello" in result["output"]
+        assert result["output"] == "hello"
        assert result["returncode"] == 0

-    def test_sdk_timeout_passed_to_exec(self, make_env):
-        """SDK native timeout is passed to sandbox.process.exec()."""
+    def test_command_wrapped_with_shell_timeout(self, make_env):
        sb = _make_sandbox()
        sb.process.exec.side_effect = [
            _make_exec_response(result="/root"),
-            _make_exec_response(result="", exit_code=0),  # init_session
            _make_exec_response(result="ok", exit_code=0),
        ]
        sb.state = "started"
        env = make_env(sandbox=sb, timeout=42)

        env.execute("echo hello")
-        # The exec call should receive timeout= kwarg (SDK native timeout)
+        # The command sent to exec should be wrapped with `timeout N sh -c '...'`
        call_args = sb.process.exec.call_args_list[-1]
-        assert call_args[1]["timeout"] == 42
-        # The command should NOT have a shell `timeout` prefix
        cmd = call_args[0][0]
-        assert not cmd.startswith("timeout ")
+        assert cmd.startswith("timeout 42 sh -c ")
+        # SDK timeout param should NOT be passed
+        assert "timeout" not in call_args[1]

    def test_timeout_returns_exit_code_124(self, make_env):
-        """SDK-level timeout surfaces as exit code 124 via _wait_for_process."""
+        """Shell timeout utility returns exit code 124."""
        sb = _make_sandbox()
        sb.process.exec.side_effect = [
            _make_exec_response(result="/root"),
-            _make_exec_response(result="", exit_code=0),  # init_session
-            _make_exec_response(result="", exit_code=124),  # actual cmd
+            _make_exec_response(result="", exit_code=124),
        ]
        sb.state = "started"
        env = make_env(sandbox=sb)
@@ -271,7 +267,6 @@ class TestExecute:
        sb = _make_sandbox()
        sb.process.exec.side_effect = [
            _make_exec_response(result="/root"),
-            _make_exec_response(result="", exit_code=0),  # init_session
            _make_exec_response(result="not found", exit_code=127),
        ]
        sb.state = "started"
@@ -284,7 +279,6 @@ class TestExecute:
        sb = _make_sandbox()
        sb.process.exec.side_effect = [
            _make_exec_response(result="/root"),
-            _make_exec_response(result="", exit_code=0),  # init_session
            _make_exec_response(result="ok", exit_code=0),
        ]
        sb.state = "started"
@@ -292,47 +286,39 @@ class TestExecute:

        env.execute("python3", stdin_data="print('hi')")
        # Check that the command passed to exec contains heredoc markers
-        # Base class uses HERMES_STDIN_ prefix for heredoc delimiters
+        # (single quotes get shell-escaped by shlex.quote, so check components)
        call_args = sb.process.exec.call_args_list[-1]
        cmd = call_args[0][0]
-        assert "HERMES_STDIN_" in cmd
+        assert "HERMES_EOF_" in cmd
        assert "print" in cmd
        assert "hi" in cmd

-    def test_custom_cwd_in_command_wrapper(self, make_env):
-        """CWD is handled by _wrap_command() in the command string, not as a kwarg."""
+    def test_custom_cwd_passed_through(self, make_env):
        sb = _make_sandbox()
        sb.process.exec.side_effect = [
            _make_exec_response(result="/root"),
-            _make_exec_response(result="", exit_code=0),  # init_session
            _make_exec_response(result="/tmp", exit_code=0),
        ]
        sb.state = "started"
        env = make_env(sandbox=sb)

        env.execute("pwd", cwd="/tmp")
-        # CWD should be embedded in the command string via _wrap_command
-        call_args = sb.process.exec.call_args_list[-1]
-        cmd = call_args[0][0]
-        assert "cd /tmp" in cmd
-        # CWD should NOT be passed as a kwarg to exec
-        assert "cwd" not in call_args[1]
+        call_kwargs = sb.process.exec.call_args_list[-1][1]
+        assert call_kwargs["cwd"] == "/tmp"

    def test_daytona_error_triggers_retry(self, make_env, daytona_sdk):
        sb = _make_sandbox()
        sb.state = "started"
        sb.process.exec.side_effect = [
            _make_exec_response(result="/root"),  # $HOME
-            _make_exec_response(result="", exit_code=0),  # init_session
            daytona_sdk.DaytonaError("transient"),  # first attempt fails
            _make_exec_response(result="ok", exit_code=0),  # retry succeeds
        ]
        env = make_env(sandbox=sb)

        result = env.execute("echo retry")
-        # DaytonaError now surfaces directly through _ThreadedProcessHandle
-        # (no retry logic) — the error becomes returncode=1
-        assert result["returncode"] == 1
+        assert result["output"] == "ok"
+        assert result["returncode"] == 0


 # ---------------------------------------------------------------------------
@@ -373,18 +359,14 @@ class TestInterrupt:
            calls["n"] += 1
            if calls["n"] == 1:
                return _make_exec_response(result="/root")  # $HOME detection
-            if calls["n"] == 2:
-                return _make_exec_response(result="", exit_code=0)  # init_session
            event.wait(timeout=5)  # simulate long-running command
            return _make_exec_response(result="done", exit_code=0)

        sb.process.exec.side_effect = exec_side_effect
        env = make_env(sandbox=sb)

-        # is_interrupted is checked by base.py's _wait_for_process,
-        # patch where it's actually referenced (base.py's local binding)
        monkeypatch.setattr(
-            "tools.environments.base.is_interrupted", lambda: True
+            "tools.environments.daytona.is_interrupted", lambda: True
        )
        try:
            result = env.execute("sleep 10")
@@ -395,24 +377,23 @@ class TestInterrupt:


 # ---------------------------------------------------------------------------
-# DaytonaError surfaces directly (no retry)
+# Retry exhaustion
 # ---------------------------------------------------------------------------

 class TestRetryExhausted:
    def test_both_attempts_fail(self, make_env, daytona_sdk):
-        """DaytonaError surfaces directly as rc=1 (retry logic was removed)."""
        sb = _make_sandbox()
        sb.state = "started"
        sb.process.exec.side_effect = [
            _make_exec_response(result="/root"),       # $HOME
-            _make_exec_response(result="", exit_code=0),  # init_session
-            daytona_sdk.DaytonaError("fail1"),         # actual command fails
+            daytona_sdk.DaytonaError("fail1"),         # first attempt
+            daytona_sdk.DaytonaError("fail2"),         # retry
        ]
        env = make_env(sandbox=sb)

        result = env.execute("echo x")
-        # Error surfaces directly through _ThreadedProcessHandle (rc=1)
        assert result["returncode"] == 1
+        assert "Daytona execution error" in result["output"]


 # ---------------------------------------------------------------------------
--- a/Show More
+++ b/Show More