chore: add tests

feat: add support to deploy to modal
2026-03-05 19:01:24 -05:00 · 2026-03-05 18:29:48 -05:00
202 changed files with 2330 additions and 41649 deletions
--- a/.env.example
+++ b/.env.example
@@ -13,34 +13,6 @@ OPENROUTER_API_KEY=
 # Examples: anthropic/claude-opus-4.6, openai/gpt-4o, google/gemini-3-flash-preview, zhipuai/glm-4-plus
 LLM_MODEL=anthropic/claude-opus-4.6

-# =============================================================================
-# LLM PROVIDER (z.ai / GLM)
-# =============================================================================
-# z.ai provides access to ZhipuAI GLM models (GLM-4-Plus, etc.)
-# Get your key at: https://z.ai or https://open.bigmodel.cn
-GLM_API_KEY=
-# GLM_BASE_URL=https://api.z.ai/api/paas/v4  # Override default base URL
-
-# =============================================================================
-# LLM PROVIDER (Kimi / Moonshot)
-# =============================================================================
-# Kimi/Moonshot provides access to Moonshot AI coding models
-# Get your key at: https://platform.moonshot.ai
-KIMI_API_KEY=
-# KIMI_BASE_URL=https://api.moonshot.ai/v1  # Override default base URL
-
-# =============================================================================
-# LLM PROVIDER (MiniMax)
-# =============================================================================
-# MiniMax provides access to MiniMax models (global endpoint)
-# Get your key at: https://www.minimax.io
-MINIMAX_API_KEY=
-# MINIMAX_BASE_URL=https://api.minimax.io/v1  # Override default base URL
-
-# MiniMax China endpoint (for users in mainland China)
-MINIMAX_CN_API_KEY=
-# MINIMAX_CN_BASE_URL=https://api.minimaxi.com/v1  # Override default base URL
-
 # =============================================================================
 # TOOL API KEYS
 # =============================================================================
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -44,8 +44,7 @@ hermes-agent/
 │   │   ├── docker.py          # Docker container execution
 │   │   ├── ssh.py             # SSH remote execution
 │   │   ├── singularity.py     # Singularity/Apptainer + SIF management
-│   │   ├── modal.py           # Modal cloud execution
-│   │   └── daytona.py         # Daytona cloud sandboxes
+│   │   └── modal.py           # Modal cloud execution
 │   ├── terminal_tool.py       # Terminal orchestration (sudo, lifecycle, factory)
 │   ├── todo_tool.py           # Planning & task management
 │   ├── process_registry.py    # Background process management
@@ -56,7 +55,6 @@ hermes-agent/
 ├── cron/                 # Scheduler implementation
 ├── environments/         # RL training environments (Atropos integration)
 ├── skills/               # Bundled skill sources
-├── optional-skills/      # Official optional skills (not activated by default)
 ├── cli.py                # Interactive CLI orchestrator (HermesCLI class)
 ├── run_agent.py          # AIAgent class (core conversation loop)
 ├── model_tools.py        # Tool orchestration (thin layer over tools/registry.py)
@@ -204,7 +202,7 @@ Every installed skill in `~/.hermes/skills/` is automatically registered as a sl
 The skill name (from frontmatter or folder name) becomes the command: `axolotl` → `/axolotl`.

 Implementation (`agent/skill_commands.py`, shared between CLI and gateway):
-1. `scan_skill_commands()` scans all SKILL.md files at startup, filtering out skills incompatible with the current OS platform (via the `platforms` frontmatter field)
+1. `scan_skill_commands()` scans all SKILL.md files at startup
 2. `build_skill_invocation_message()` loads the SKILL.md content and builds a user-turn message
 3. The message includes the full skill content, a list of supporting files (not loaded), and the user's instruction
 4. Supporting files can be loaded on demand via the `skill_view` tool
@@ -423,19 +421,16 @@ The system uses `_config_version` to detect outdated configs:
 API keys are loaded from `~/.hermes/.env`:
 - `OPENROUTER_API_KEY` - Main LLM API access (primary provider)
 - `FIRECRAWL_API_KEY` - Web search/extract tools
- `FIRECRAWL_API_URL` - Self-hosted Firecrawl endpoint (optional)
 - `BROWSERBASE_API_KEY` / `BROWSERBASE_PROJECT_ID` - Browser automation
 - `FAL_KEY` - Image generation (FLUX model)
 - `NOUS_API_KEY` - Vision and Mixture-of-Agents tools

 Terminal tool configuration (in `~/.hermes/config.yaml`):
- `terminal.backend` - Backend: local, docker, singularity, modal, daytona, or ssh
+- `terminal.backend` - Backend: local, docker, singularity, modal, or ssh
 - `terminal.cwd` - Working directory ("." = host CWD for local only; for remote backends set an absolute path inside the target, or omit to use the backend's default)
 - `terminal.docker_image` - Image for Docker backend
 - `terminal.singularity_image` - Image for Singularity backend
 - `terminal.modal_image` - Image for Modal backend
- `terminal.daytona_image` - Image for Daytona backend
- `DAYTONA_API_KEY` - API key for Daytona backend (in .env)
 - SSH: `TERMINAL_SSH_HOST`, `TERMINAL_SSH_USER`, `TERMINAL_SSH_KEY` in .env

 Agent behavior (in `~/.hermes/.env`):
@@ -499,7 +494,7 @@ terminal(command="pytest -v tests/", background=true)
 - `process(action="submit", session_id="proc_abc123", data="yes")` -- send + Enter

 **Key behaviors:**
- Background processes execute through the configured terminal backend (local/Docker/Modal/Daytona/SSH/Singularity) -- never directly on the host unless `TERMINAL_ENV=local`
+- Background processes execute through the configured terminal backend (local/Docker/Modal/SSH/Singularity) -- never directly on the host unless `TERMINAL_ENV=local`
 - The `wait` action blocks the tool call until the process finishes, times out, or is interrupted by a new user message
 - PTY mode (`pty=true` on terminal) enables interactive CLI tools (Codex, Claude Code)
 - In RL training, background processes are auto-killed when the episode ends (`tool_context.cleanup()`)
@@ -657,7 +652,6 @@ SKILL.md files use YAML frontmatter (agentskills.io format):
 name: skill-name
 description: Brief description for listing
 version: 1.0.0
-platforms: [macos]              # Optional — restrict to specific OS (macos/linux/windows)
 metadata:
  hermes:
    tags: [tag1, tag2]
@@ -666,14 +660,12 @@ metadata:
 # Skill Content...
 ```

-**Platform filtering** — Skills with a `platforms` field are automatically excluded from the system prompt index, `skills_list()`, and slash commands on incompatible platforms. Skills without the field load everywhere (backward compatible). See `skills/apple/` for macOS-only examples (iMessage, Reminders, Notes, FindMy).
-
-**Skills Hub** — user-driven skill search/install from online registries and official optional skills. Sources: official optional skills (shipped with repo, labeled "official"), GitHub (openai/skills, anthropics/skills, custom taps), ClawHub, Claude marketplace, LobeHub. Not exposed as an agent tool — the model cannot search for or install skills. Users manage skills via `hermes skills browse/search/install` CLI commands or the `/skills` slash command in chat.
+**Skills Hub** — user-driven skill search/install from online registries (GitHub, ClawHub, Claude marketplaces, LobeHub). Not exposed as an agent tool — the model cannot search for or install skills. Users manage skills via `hermes skills ...` CLI commands or the `/skills` slash command in chat.

 Key files:
 - `tools/skills_tool.py` — Agent-facing skill list/view (progressive disclosure)
 - `tools/skills_guard.py` — Security scanner (regex + LLM audit, trust-aware install policy)
- `tools/skills_hub.py` — Source adapters (OptionalSkillSource, GitHub, ClawHub, Claude marketplace, LobeHub), lock file, auth
+- `tools/skills_hub.py` — Source adapters (GitHub, ClawHub, Claude marketplace, LobeHub), lock file, auth
 - `hermes_cli/skills_hub.py` — CLI subcommands + `/skills` slash command handler

 ---
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -43,9 +43,7 @@ Bundled skills (in `skills/`) ship with every Hermes install. They should be **b
 - Document handling, web research, common dev workflows, system administration
 - Used regularly by a wide range of people

-If your skill is official and useful but not universally needed (e.g., a paid service integration, a heavyweight dependency), put it in **`optional-skills/`** — it ships with the repo but isn't activated by default. Users can discover it via `hermes skills browse` (labeled "official") and install it with `hermes skills install` (no third-party warning, builtin trust).
-
-If your skill is specialized, community-contributed, or niche, it's better suited for a **Skills Hub** — upload it to a skills registry and share it in the [Nous Research Discord](https://discord.gg/NousResearch). Users can install it with `hermes skills install`.
+If your skill is specialized (a niche engineering tool, a specific SaaS integration, a game), it's better suited for a **Skills Hub** — upload it to a skills registry and share it in the [Nous Research Discord](https://discord.gg/NousResearch). Users can install it with `hermes skills install`.

 ---

@@ -155,7 +153,7 @@ hermes-agent/
 │   ├── skill_tools.py            # Skill search, load, manage
 │   └── environments/             # Terminal execution backends
 │       ├── base.py                   # BaseEnvironment ABC
-│       ├── local.py, docker.py, ssh.py, singularity.py, modal.py, daytona.py
+│       ├── local.py, docker.py, ssh.py, singularity.py, modal.py
 │
 ├── gateway/                  # Messaging gateway
 │   ├── run.py                    # GatewayRunner — platform lifecycle, message routing, cron
@@ -170,7 +168,6 @@ hermes-agent/
 │   └── whatsapp-bridge/          # Node.js WhatsApp bridge (Baileys)
 │
 ├── skills/                   # Bundled skills (copied to ~/.hermes/skills/ on install)
-├── optional-skills/          # Official optional skills (discoverable via hub, not activated by default)
 ├── environments/             # RL training environments (Atropos integration)
 ├── tests/                    # Test suite
 ├── website/                  # Documentation site (hermes-agent.nousresearch.com)
@@ -297,9 +294,9 @@ If it's a new toolset, add it to `toolsets.py` and to the relevant platform pres

 ---

-## Adding a Skill
+## Adding a Bundled Skill

-Bundled skills live in `skills/` organized by category. Official optional skills use the same structure in `optional-skills/`:
+Bundled skills live in `skills/` organized by category:

 ```
 skills/
@@ -325,9 +322,6 @@ description: Brief description (shown in skill search results)
 version: 1.0.0
 author: Your Name
 license: MIT
-platforms: [macos, linux]          # Optional — restrict to specific OS platforms
-                                   #   Valid: macos, linux, windows
-                                   #   Omit to load on all platforms (default)
 metadata:
  hermes:
    tags: [Category, Subcategory, Keywords]
@@ -354,18 +348,6 @@ Known failure modes and how to handle them.
 How the agent confirms it worked.
 ```

-### Platform-specific skills
-
-Skills can declare which OS platforms they support via the `platforms` frontmatter field. Skills with this field are automatically hidden from the system prompt, `skills_list()`, and slash commands on incompatible platforms.
-
-```yaml
-platforms: [macos]            # macOS only (e.g., iMessage, Apple Reminders)
-platforms: [macos, linux]     # macOS and Linux
-platforms: [windows]          # Windows only
-```
-
-If the field is omitted or empty, the skill loads on all platforms (backward compatible). See `skills/apple/` for examples of macOS-only skills.
-
 ### Skill guidelines

 - **No external dependencies unless absolutely necessary.** Prefer stdlib Python, curl, and existing Hermes tools (`web_extract`, `terminal`, `read_file`).
--- a/README.md
+++ b/README.md
@@ -11,17 +11,17 @@
  <a href="https://nousresearch.com"><img src="https://img.shields.io/badge/Built%20by-Nous%20Research-blueviolet?style=for-the-badge" alt="Built by Nous Research"></a>
 </p>

-**The self-improving AI agent built by [Nous Research](https://nousresearch.com).** It's the only agent with a built-in learning loop — it creates skills from experience, improves them during use, nudges itself to persist knowledge, searches its own past conversations, and builds a deepening model of who you are across sessions. Run it on a $5 VPS, a GPU cluster, or serverless infrastructure that costs nearly nothing when idle. It's not tied to your laptop — talk to it from Telegram while it works on a cloud VM.
+**The fully open-source AI agent that grows with you.** Install it on a machine, give it your messaging accounts, and it becomes a persistent personal agent — learning your projects, building its own skills, running tasks on a schedule, and reaching you wherever you are.

-Use any model you want — [Nous Portal](https://portal.nousresearch.com), [OpenRouter](https://openrouter.ai) (200+ models), [z.ai/GLM](https://z.ai), [Kimi/Moonshot](https://platform.moonshot.ai), [MiniMax](https://www.minimax.io), OpenAI, or your own endpoint. Switch with `hermes model` — no code changes, no lock-in.
+Use any model you want — [Nous Portal](https://portal.nousresearch.com), [OpenRouter](https://openrouter.ai), OpenAI Codex, or your own endpoint. Switch with `hermes model` — no code changes, no lock-in.

 <table>
 <tr><td><b>A real terminal interface</b></td><td>Full TUI with multiline editing, slash-command autocomplete, conversation history, interrupt-and-redirect, and streaming tool output.</td></tr>
 <tr><td><b>Lives where you do</b></td><td>Telegram, Discord, Slack, WhatsApp, and CLI — all from a single gateway process. Voice memo transcription, cross-platform conversation continuity.</td></tr>
-<tr><td><b>A closed learning loop</b></td><td>Agent-curated memory with periodic nudges. Autonomous skill creation after complex tasks. Skills self-improve during use. FTS5 session search with LLM summarization for cross-session recall. <a href="https://github.com/plastic-labs/honcho">Honcho</a> dialectic user modeling. Compatible with the <a href="https://agentskills.io">agentskills.io</a> open standard.</td></tr>
+<tr><td><b>Grows the longer it runs</b></td><td>Persistent memory across sessions. When it solves a hard problem, it writes a skill document for next time. Skills are searchable, shareable, and compatible with the <a href="https://agentskills.io">agentskills.io</a> open standard.</td></tr>
 <tr><td><b>Scheduled automations</b></td><td>Built-in cron scheduler with delivery to any platform. Daily reports, nightly backups, weekly audits — all in natural language, running unattended.</td></tr>
 <tr><td><b>Delegates and parallelizes</b></td><td>Spawn isolated subagents for parallel workstreams. Write Python scripts that call tools via RPC, collapsing multi-step pipelines into zero-context-cost turns.</td></tr>
-<tr><td><b>Runs anywhere, not just your laptop</b></td><td>Six terminal backends — local, Docker, SSH, Daytona, Singularity, and Modal. Daytona and Modal offer serverless persistence — your agent's environment hibernates when idle and wakes on demand, costing nearly nothing between sessions. Run it on a $5 VPS or a GPU cluster.</td></tr>
+<tr><td><b>Real sandboxing</b></td><td>Five terminal backends — local, Docker, SSH, Singularity, and Modal — with persistent workspaces and container security hardening.</td></tr>
 <tr><td><b>Research-ready</b></td><td>Batch trajectory generation, Atropos RL environments, trajectory compression for training the next generation of tool-calling models.</td></tr>
 </table>

--- a/TODO.md
+++ b/TODO.md
@@ -0,0 +1,129 @@
+# Hermes Agent - Future Improvements
+
+---
+
+
+
+## 3. Local Browser Control via CDP 🌐
+
+**Status:** Not started (currently Browserbase cloud only)
+**Priority:** Medium
+
+Support local Chrome/Chromium via Chrome DevTools Protocol alongside existing Browserbase cloud backend.
+
+**What other agents do:**
+- **OpenClaw**: Full CDP-based Chrome control with snapshots, actions, uploads, profiles, file chooser, PDF save, console messages, tab management. Uses local Chrome for persistent login sessions.
+- **Cline**: Headless browser with Computer Use (click, type, scroll, screenshot, console logs)
+
+**Our approach:**
+- Add a `local` backend option to `browser_tool.py` using Playwright or raw CDP
+- Config toggle: `browser.backend: local | browserbase | auto`
+- `auto` mode: try local first, fall back to Browserbase
+- Local advantages: free, persistent login sessions, no API key needed
+- Local disadvantages: no CAPTCHA solving, no stealth mode, requires Chrome installed
+- Reuse the same 10-tool interface -- just swap the backend
+- Later: Chrome profile management for persistent sessions across restarts
+
+---
+
+## 4. Signal Integration 📡
+
+**Status:** Not started
+**Priority:** Low
+
+New platform adapter using signal-cli daemon (JSON-RPC HTTP + SSE). Requires Java runtime and phone number registration.
+
+**Reference:** OpenClaw has Signal support via signal-cli.
+
+---
+
+## 5. Plugin/Extension System 🔌
+
+**Status:** Partially implemented (event hooks exist in `gateway/hooks.py`)
+**Priority:** Medium
+
+Full Python plugin interface that goes beyond the current hook system.
+
+**What other agents do:**
+- **OpenClaw**: Plugin SDK with tool-send capabilities, lifecycle phase hooks (before-agent-start, after-tool-call, model-override), plugin registry with install/uninstall.
+- **Pi**: Extensions are TypeScript modules that can register tools, commands, keyboard shortcuts, custom UI widgets, overlays, status lines, dialogs, compaction hooks, raw terminal input listeners. Extremely comprehensive.
+- **OpenCode**: MCP client support (stdio, SSE, StreamableHTTP), OAuth auth for MCP servers. Also has Copilot/Codex plugins.
+- **Codex**: Full MCP integration with skill dependencies.
+- **Cline**: MCP integration + lifecycle hooks with cancellation support.
+
+**Our approach (phased):**
+
+### Phase 1: Enhanced hooks
+- Expand the existing `gateway/hooks.py` to support more events: `before-tool-call`, `after-tool-call`, `before-response`, `context-compress`, `session-end`
+- Allow hooks to modify tool results (e.g., filter sensitive output)
+
+### Phase 2: Plugin interface
+- `~/.hermes/plugins/<name>/plugin.yaml` + `handler.py`
+- Plugins can: register new tools, add CLI commands, subscribe to events, inject system prompt sections
+- `hermes plugin list|install|uninstall|create` CLI commands
+- Plugin discovery and validation on startup
+
+### Phase 3: MCP support (industry standard) ✅ DONE
+- ✅ MCP client that connects to external MCP servers (stdio + HTTP/StreamableHTTP)
+- ✅ Config: `mcp_servers` in config.yaml with connection details
+- ✅ Each MCP server's tools auto-registered as a dynamic toolset
+- Future: Resources, Prompts, Progress notifications, `hermes mcp` CLI command
+
+---
+
+## 6. MCP (Model Context Protocol) Support 🔗 ✅ DONE
+
+**Status:** Implemented (PR #301)
+**Priority:** Complete
+
+Native MCP client support with stdio and HTTP/StreamableHTTP transports, auto-discovery, reconnection with exponential backoff, env var filtering, and credential stripping. See `docs/mcp.md` for full documentation.
+
+**Still TODO:**
+- `hermes mcp` CLI subcommand (list/test/status)
+- `hermes tools` UI integration for MCP toolsets
+- MCP Resources and Prompts support
+- OAuth authentication for remote servers
+- Progress notifications for long-running tools
+
+---
+
+## 8. Filesystem Checkpointing / Rollback 🔄
+
+**Status:** Not started
+**Priority:** Low-Medium
+
+Automatic filesystem snapshots after each agent loop iteration so the user can roll back destructive changes to their project.
+
+**What other agents do:**
+- **Cline**: Workspace checkpoints at each step with Compare/Restore UI
+- **OpenCode**: Git-backed workspace snapshots per step, with weekly gc
+- **Codex**: Sandboxed execution with commit-per-step, rollback on failure
+
+**Our approach:**
+- After each tool call (or batch of tool calls in a single turn) that modifies files, create a lightweight checkpoint of the affected files
+- Git-based when the project is a repo: auto-commit to a detached/temporary branch (`hermes/checkpoints/<session>`) after each agent turn, squash or discard on session end
+- Non-git fallback: tar snapshots of changed files in `~/.hermes/checkpoints/<session_id>/`
+- `hermes rollback` CLI command to restore to a previous checkpoint
+- Agent-accessible via a `checkpoint` tool: `list` (show available restore points), `restore` (roll back to a named point), `diff` (show what changed since a checkpoint)
+- Configurable: off by default (opt-in via `config.yaml`), since auto-committing can be surprising
+- Cleanup: checkpoints expire after session ends (or configurable retention period)
+- Integration with the terminal backend: works with local, SSH, and Docker backends (snapshots happen on the execution host)
+
+---
+
+## Implementation Priority Order
+
+### Tier 1: Next Up
+
+1. ~~MCP Support -- #6~~ ✅ Done (PR #301)
+
+### Tier 2: Quality of Life
+
+3. Local Browser Control via CDP -- #3
+4. Plugin/Extension System -- #5
+
+### Tier 3: Nice to Have
+
+5. Session Branching / Checkpoints -- #7
+6. Filesystem Checkpointing / Rollback -- #8
+7. Signal Integration -- #4
--- a/agent/auxiliary_client.py
+++ b/agent/auxiliary_client.py
@@ -4,20 +4,18 @@ Provides a single resolution chain so every consumer (context compression,
 session search, web extraction, vision analysis, browser vision) picks up
 the best available backend without duplicating fallback logic.

-Resolution order (same for text and vision tasks):
+Resolution order for text tasks:
  1. OpenRouter  (OPENROUTER_API_KEY)
  2. Nous Portal (~/.hermes/auth.json active provider)
  3. Custom endpoint (OPENAI_BASE_URL + OPENAI_API_KEY)
  4. Codex OAuth (Responses API via chatgpt.com with gpt-5.3-codex,
     wrapped to look like a chat.completions client)
-  5. Direct API-key providers (z.ai/GLM, Kimi/Moonshot, MiniMax, MiniMax-CN)
-     — checked via PROVIDER_REGISTRY entries with auth_type='api_key'
-  6. None
+  5. None

-Per-task provider overrides (e.g. AUXILIARY_VISION_PROVIDER,
-CONTEXT_COMPRESSION_PROVIDER) can force a specific provider for each task:
-"openrouter", "nous", or "main" (= steps 3-5).
-Default "auto" follows the full chain above.
+Resolution order for vision/multimodal tasks:
+  1. OpenRouter
+  2. Nous Portal
+  3. None  (custom endpoints can't substitute for Gemini multimodal)
 """

 import json
@@ -33,14 +31,6 @@ from hermes_constants import OPENROUTER_BASE_URL

 logger = logging.getLogger(__name__)

-# Default auxiliary models for direct API-key providers (cheap/fast for side tasks)
-_API_KEY_PROVIDER_AUX_MODELS: Dict[str, str] = {
-    "zai": "glm-4.5-flash",
-    "kimi-coding": "kimi-k2-turbo-preview",
-    "minimax": "MiniMax-M2.5-highspeed",
-    "minimax-cn": "MiniMax-M2.5-highspeed",
-}
-
 # OpenRouter app attribution headers
 _OR_HEADERS = {
    "HTTP-Referer": "https://github.com/NousResearch/hermes-agent",
@@ -292,159 +282,53 @@ def _read_codex_access_token() -> Optional[str]:
        return None


-def _resolve_api_key_provider() -> Tuple[Optional[OpenAI], Optional[str]]:
-    """Try each API-key provider in PROVIDER_REGISTRY order.
-
-    Returns (client, model) for the first provider whose env var is set,
-    or (None, None) if none are configured.
-    """
-    try:
-        from hermes_cli.auth import PROVIDER_REGISTRY
-    except ImportError:
-        logger.debug("Could not import PROVIDER_REGISTRY for API-key fallback")
-        return None, None
-
-    for provider_id, pconfig in PROVIDER_REGISTRY.items():
-        if pconfig.auth_type != "api_key":
-            continue
-        # Check if any of the provider's env vars are set
-        api_key = ""
-        for env_var in pconfig.api_key_env_vars:
-            val = os.getenv(env_var, "").strip()
-            if val:
-                api_key = val
-                break
-        if not api_key:
-            continue
-        # Resolve base URL (with optional env-var override)
-        base_url = pconfig.inference_base_url
-        if pconfig.base_url_env_var:
-            env_url = os.getenv(pconfig.base_url_env_var, "").strip()
-            if env_url:
-                base_url = env_url.rstrip("/")
-        model = _API_KEY_PROVIDER_AUX_MODELS.get(provider_id, "default")
-        logger.debug("Auxiliary text client: %s (%s)", pconfig.name, model)
-        return OpenAI(api_key=api_key, base_url=base_url), model
-
-    return None, None
-
-
-# ── Provider resolution helpers ─────────────────────────────────────────────
-
-def _get_auxiliary_provider(task: str = "") -> str:
-    """Read the provider override for a specific auxiliary task.
-
-    Checks AUXILIARY_{TASK}_PROVIDER first (e.g. AUXILIARY_VISION_PROVIDER),
-    then CONTEXT_{TASK}_PROVIDER (for the compression section's summary_provider),
-    then falls back to "auto".  Returns one of: "auto", "openrouter", "nous", "main".
-    """
-    if task:
-        for prefix in ("AUXILIARY_", "CONTEXT_"):
-            val = os.getenv(f"{prefix}{task.upper()}_PROVIDER", "").strip().lower()
-            if val and val != "auto":
-                return val
-    return "auto"
-
-
-def _try_openrouter() -> Tuple[Optional[OpenAI], Optional[str]]:
-    or_key = os.getenv("OPENROUTER_API_KEY")
-    if not or_key:
-        return None, None
-    logger.debug("Auxiliary client: OpenRouter")
-    return OpenAI(api_key=or_key, base_url=OPENROUTER_BASE_URL,
-                   default_headers=_OR_HEADERS), _OPENROUTER_MODEL
-
-
-def _try_nous() -> Tuple[Optional[OpenAI], Optional[str]]:
-    nous = _read_nous_auth()
-    if not nous:
-        return None, None
-    global auxiliary_is_nous
-    auxiliary_is_nous = True
-    logger.debug("Auxiliary client: Nous Portal")
-    return (
-        OpenAI(api_key=_nous_api_key(nous), base_url=_nous_base_url()),
-        _NOUS_MODEL,
-    )
-
-
-def _try_custom_endpoint() -> Tuple[Optional[OpenAI], Optional[str]]:
-    custom_base = os.getenv("OPENAI_BASE_URL")
-    custom_key = os.getenv("OPENAI_API_KEY")
-    if not custom_base or not custom_key:
-        return None, None
-    model = os.getenv("OPENAI_MODEL") or os.getenv("LLM_MODEL") or "gpt-4o-mini"
-    logger.debug("Auxiliary client: custom endpoint (%s)", model)
-    return OpenAI(api_key=custom_key, base_url=custom_base), model
-
-
-def _try_codex() -> Tuple[Optional[Any], Optional[str]]:
-    codex_token = _read_codex_access_token()
-    if not codex_token:
-        return None, None
-    logger.debug("Auxiliary client: Codex OAuth (%s via Responses API)", _CODEX_AUX_MODEL)
-    real_client = OpenAI(api_key=codex_token, base_url=_CODEX_AUX_BASE_URL)
-    return CodexAuxiliaryClient(real_client, _CODEX_AUX_MODEL), _CODEX_AUX_MODEL
-
-
-def _resolve_forced_provider(forced: str) -> Tuple[Optional[OpenAI], Optional[str]]:
-    """Resolve a specific forced provider.  Returns (None, None) if creds missing."""
-    if forced == "openrouter":
-        client, model = _try_openrouter()
-        if client is None:
-            logger.warning("auxiliary.provider=openrouter but OPENROUTER_API_KEY not set")
-        return client, model
-
-    if forced == "nous":
-        client, model = _try_nous()
-        if client is None:
-            logger.warning("auxiliary.provider=nous but Nous Portal not configured (run: hermes login)")
-        return client, model
-
-    if forced == "main":
-        # "main" = skip OpenRouter/Nous, use the main chat model's credentials.
-        for try_fn in (_try_custom_endpoint, _try_codex, _resolve_api_key_provider):
-            client, model = try_fn()
-            if client is not None:
-                return client, model
-        logger.warning("auxiliary.provider=main but no main endpoint credentials found")
-        return None, None
-
-    # Unknown provider name — fall through to auto
-    logger.warning("Unknown auxiliary.provider=%r, falling back to auto", forced)
-    return None, None
-
-
-def _resolve_auto() -> Tuple[Optional[OpenAI], Optional[str]]:
-    """Full auto-detection chain: OpenRouter → Nous → custom → Codex → API-key → None."""
-    for try_fn in (_try_openrouter, _try_nous, _try_custom_endpoint,
-                   _try_codex, _resolve_api_key_provider):
-        client, model = try_fn()
-        if client is not None:
-            return client, model
-    logger.debug("Auxiliary client: none available")
-    return None, None
-
-
 # ── Public API ──────────────────────────────────────────────────────────────

-def get_text_auxiliary_client(task: str = "") -> Tuple[Optional[OpenAI], Optional[str]]:
-    """Return (client, default_model_slug) for text-only auxiliary tasks.
+def get_text_auxiliary_client() -> Tuple[Optional[OpenAI], Optional[str]]:
+    """Return (client, model_slug) for text-only auxiliary tasks.

-    Args:
-        task: Optional task name ("compression", "web_extract") to check
-              for a task-specific provider override.
-
-    Callers may override the returned model with a per-task env var
-    (e.g. CONTEXT_COMPRESSION_MODEL, AUXILIARY_WEB_EXTRACT_MODEL).
+    Falls through OpenRouter -> Nous Portal -> custom endpoint -> Codex OAuth -> (None, None).
    """
-    forced = _get_auxiliary_provider(task)
-    if forced != "auto":
-        return _resolve_forced_provider(forced)
-    return _resolve_auto()
+    # 1. OpenRouter
+    or_key = os.getenv("OPENROUTER_API_KEY")
+    if or_key:
+        logger.debug("Auxiliary text client: OpenRouter")
+        return OpenAI(api_key=or_key, base_url=OPENROUTER_BASE_URL,
+                       default_headers=_OR_HEADERS), _OPENROUTER_MODEL
+
+    # 2. Nous Portal
+    nous = _read_nous_auth()
+    if nous:
+        global auxiliary_is_nous
+        auxiliary_is_nous = True
+        logger.debug("Auxiliary text client: Nous Portal")
+        return (
+            OpenAI(api_key=_nous_api_key(nous), base_url=_nous_base_url()),
+            _NOUS_MODEL,
+        )
+
+    # 3. Custom endpoint (both base URL and key must be set)
+    custom_base = os.getenv("OPENAI_BASE_URL")
+    custom_key = os.getenv("OPENAI_API_KEY")
+    if custom_base and custom_key:
+        model = os.getenv("OPENAI_MODEL") or os.getenv("LLM_MODEL") or "gpt-4o-mini"
+        logger.debug("Auxiliary text client: custom endpoint (%s)", model)
+        return OpenAI(api_key=custom_key, base_url=custom_base), model
+
+    # 4. Codex OAuth -- uses the Responses API (only endpoint the token
+    # can access), wrapped to look like a chat.completions client.
+    codex_token = _read_codex_access_token()
+    if codex_token:
+        logger.debug("Auxiliary text client: Codex OAuth (%s via Responses API)", _CODEX_AUX_MODEL)
+        real_client = OpenAI(api_key=codex_token, base_url=_CODEX_AUX_BASE_URL)
+        return CodexAuxiliaryClient(real_client, _CODEX_AUX_MODEL), _CODEX_AUX_MODEL
+
+    # 5. Nothing available
+    logger.debug("Auxiliary text client: none available")
+    return None, None


-def get_async_text_auxiliary_client(task: str = ""):
+def get_async_text_auxiliary_client():
    """Return (async_client, model_slug) for async consumers.

    For standard providers returns (AsyncOpenAI, model). For Codex returns
@@ -453,7 +337,7 @@ def get_async_text_auxiliary_client(task: str = ""):
    """
    from openai import AsyncOpenAI

-    sync_client, model = get_text_auxiliary_client(task)
+    sync_client, model = get_text_auxiliary_client()
    if sync_client is None:
        return None, None

@@ -470,16 +354,30 @@ def get_async_text_auxiliary_client(task: str = ""):


 def get_vision_auxiliary_client() -> Tuple[Optional[OpenAI], Optional[str]]:
-    """Return (client, default_model_slug) for vision/multimodal auxiliary tasks.
+    """Return (client, model_slug) for vision/multimodal auxiliary tasks.

-    Checks AUXILIARY_VISION_PROVIDER for a forced provider, otherwise
-    auto-detects.  Callers may override the returned model with
-    AUXILIARY_VISION_MODEL.
+    Only OpenRouter and Nous Portal qualify — custom endpoints cannot
+    substitute for Gemini multimodal.
    """
-    forced = _get_auxiliary_provider("vision")
-    if forced != "auto":
-        return _resolve_forced_provider(forced)
-    return _resolve_auto()
+    # 1. OpenRouter
+    or_key = os.getenv("OPENROUTER_API_KEY")
+    if or_key:
+        logger.debug("Auxiliary vision client: OpenRouter")
+        return OpenAI(api_key=or_key, base_url=OPENROUTER_BASE_URL,
+                       default_headers=_OR_HEADERS), _OPENROUTER_MODEL
+
+    # 2. Nous Portal
+    nous = _read_nous_auth()
+    if nous:
+        logger.debug("Auxiliary vision client: Nous Portal")
+        return (
+            OpenAI(api_key=_nous_api_key(nous), base_url=_nous_base_url()),
+            _NOUS_MODEL,
+        )
+
+    # 3. Nothing suitable
+    logger.debug("Auxiliary vision client: none available")
+    return None, None


 def get_auxiliary_extra_body() -> dict:
--- a/agent/context_compressor.py
+++ b/agent/context_compressor.py
@@ -34,26 +34,23 @@ class ContextCompressor:
        summary_target_tokens: int = 2500,
        quiet_mode: bool = False,
        summary_model_override: str = None,
-        base_url: str = "",
    ):
        self.model = model
-        self.base_url = base_url
        self.threshold_percent = threshold_percent
        self.protect_first_n = protect_first_n
        self.protect_last_n = protect_last_n
        self.summary_target_tokens = summary_target_tokens
        self.quiet_mode = quiet_mode

-        self.context_length = get_model_context_length(model, base_url=base_url)
+        self.context_length = get_model_context_length(model)
        self.threshold_tokens = int(self.context_length * threshold_percent)
        self.compression_count = 0
-        self._context_probed = False  # True after a step-down from context error

        self.last_prompt_tokens = 0
        self.last_completion_tokens = 0
        self.last_total_tokens = 0

-        self.client, default_model = get_text_auxiliary_client("compression")
+        self.client, default_model = get_text_auxiliary_client()
        self.summary_model = summary_model_override or default_model

    def update_from_response(self, usage: Dict[str, Any]):
@@ -196,111 +193,10 @@ Write only the summary, starting with "[CONTEXT SUMMARY]:" prefix."""
            logger.debug("Could not build fallback auxiliary client: %s", exc)
            return None, None

-    # ------------------------------------------------------------------
-    # Tool-call / tool-result pair integrity helpers
-    # ------------------------------------------------------------------
-
-    @staticmethod
-    def _get_tool_call_id(tc) -> str:
-        """Extract the call ID from a tool_call entry (dict or SimpleNamespace)."""
-        if isinstance(tc, dict):
-            return tc.get("id", "")
-        return getattr(tc, "id", "") or ""
-
-    def _sanitize_tool_pairs(self, messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
-        """Fix orphaned tool_call / tool_result pairs after compression.
-
-        Two failure modes:
-        1. A tool *result* references a call_id whose assistant tool_call was
-           removed (summarized/truncated).  The API rejects this with
-           "No tool call found for function call output with call_id ...".
-        2. An assistant message has tool_calls whose results were dropped.
-           The API rejects this because every tool_call must be followed by
-           a tool result with the matching call_id.
-
-        This method removes orphaned results and inserts stub results for
-        orphaned calls so the message list is always well-formed.
-        """
-        surviving_call_ids: set = set()
-        for msg in messages:
-            if msg.get("role") == "assistant":
-                for tc in msg.get("tool_calls") or []:
-                    cid = self._get_tool_call_id(tc)
-                    if cid:
-                        surviving_call_ids.add(cid)
-
-        result_call_ids: set = set()
-        for msg in messages:
-            if msg.get("role") == "tool":
-                cid = msg.get("tool_call_id")
-                if cid:
-                    result_call_ids.add(cid)
-
-        # 1. Remove tool results whose call_id has no matching assistant tool_call
-        orphaned_results = result_call_ids - surviving_call_ids
-        if orphaned_results:
-            messages = [
-                m for m in messages
-                if not (m.get("role") == "tool" and m.get("tool_call_id") in orphaned_results)
-            ]
-            if not self.quiet_mode:
-                logger.info("Compression sanitizer: removed %d orphaned tool result(s)", len(orphaned_results))
-
-        # 2. Add stub results for assistant tool_calls whose results were dropped
-        missing_results = surviving_call_ids - result_call_ids
-        if missing_results:
-            patched: List[Dict[str, Any]] = []
-            for msg in messages:
-                patched.append(msg)
-                if msg.get("role") == "assistant":
-                    for tc in msg.get("tool_calls") or []:
-                        cid = self._get_tool_call_id(tc)
-                        if cid in missing_results:
-                            patched.append({
-                                "role": "tool",
-                                "content": "[Result from earlier conversation — see context summary above]",
-                                "tool_call_id": cid,
-                            })
-            messages = patched
-            if not self.quiet_mode:
-                logger.info("Compression sanitizer: added %d stub tool result(s)", len(missing_results))
-
-        return messages
-
-    def _align_boundary_forward(self, messages: List[Dict[str, Any]], idx: int) -> int:
-        """Push a compress-start boundary forward past any orphan tool results.
-
-        If ``messages[idx]`` is a tool result, slide forward until we hit a
-        non-tool message so we don't start the summarised region mid-group.
-        """
-        while idx < len(messages) and messages[idx].get("role") == "tool":
-            idx += 1
-        return idx
-
-    def _align_boundary_backward(self, messages: List[Dict[str, Any]], idx: int) -> int:
-        """Pull a compress-end boundary backward to avoid splitting a
-        tool_call / result group.
-
-        If the message just before ``idx`` is an assistant message with
-        tool_calls, those tool results will start at ``idx`` and would be
-        separated from their parent.  Move backwards to include the whole
-        group in the summarised region.
-        """
-        if idx <= 0 or idx >= len(messages):
-            return idx
-        prev = messages[idx - 1]
-        if prev.get("role") == "assistant" and prev.get("tool_calls"):
-            # The results for this assistant turn sit at idx..idx+k.
-            # Include the assistant message in the summarised region too.
-            idx -= 1
-        return idx
-
    def compress(self, messages: List[Dict[str, Any]], current_tokens: int = None) -> List[Dict[str, Any]]:
        """Compress conversation messages by summarizing middle turns.

        Keeps first N + last N turns, summarizes everything in between.
-        After compression, orphaned tool_call / tool_result pairs are cleaned
-        up so the API never receives mismatched IDs.
        """
        n_messages = len(messages)
        if n_messages <= self.protect_first_n + self.protect_last_n + 1:
@@ -313,12 +209,6 @@ Write only the summary, starting with "[CONTEXT SUMMARY]:" prefix."""
        if compress_start >= compress_end:
            return messages

-        # Adjust boundaries to avoid splitting tool_call/result groups.
-        compress_start = self._align_boundary_forward(messages, compress_start)
-        compress_end = self._align_boundary_backward(messages, compress_end)
-        if compress_start >= compress_end:
-            return messages
-
        turns_to_summarize = messages[compress_start:compress_end]
        display_tokens = current_tokens if current_tokens else self.last_prompt_tokens or estimate_messages_tokens_rough(messages)

@@ -340,7 +230,6 @@ Write only the summary, starting with "[CONTEXT SUMMARY]:" prefix."""
            tail = messages[-self.protect_last_n:]
            kept.extend(m.copy() for m in tail)
            self.compression_count += 1
-            kept = self._sanitize_tool_pairs(kept)
            if not self.quiet_mode:
                print(f"   ✂️  Truncated: {len(messages)} → {len(kept)} messages (dropped middle turns)")
            return kept
@@ -364,8 +253,6 @@ Write only the summary, starting with "[CONTEXT SUMMARY]:" prefix."""

        self.compression_count += 1

-        compressed = self._sanitize_tool_pairs(compressed)
-
        if not self.quiet_mode:
            new_estimate = estimate_messages_tokens_rough(compressed)
            saved_estimate = display_tokens - new_estimate
--- a/agent/insights.py
+++ b/agent/insights.py
@@ -1,818 +0,0 @@
-"""
-Session Insights Engine for Hermes Agent.
-
-Analyzes historical session data from the SQLite state database to produce
-comprehensive usage insights — token consumption, cost estimates, tool usage
-patterns, activity trends, model/platform breakdowns, and session metrics.
-
-Inspired by Claude Code's /insights command, adapted for Hermes Agent's
-multi-platform architecture with additional cost estimation and platform
-breakdown capabilities.
-
-Usage:
-    from agent.insights import InsightsEngine
-    engine = InsightsEngine(db)
-    report = engine.generate(days=30)
-    print(engine.format_terminal(report))
-"""
-
-import json
-import time
-from collections import Counter, defaultdict
-from datetime import datetime
-from typing import Any, Dict, List, Optional
-
-# =========================================================================
-# Model pricing (USD per million tokens) — approximate as of early 2026
-# =========================================================================
-MODEL_PRICING = {
-    # OpenAI
-    "gpt-4o": {"input": 2.50, "output": 10.00},
-    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
-    "gpt-4.1": {"input": 2.00, "output": 8.00},
-    "gpt-4.1-mini": {"input": 0.40, "output": 1.60},
-    "gpt-4.1-nano": {"input": 0.10, "output": 0.40},
-    "gpt-4.5-preview": {"input": 75.00, "output": 150.00},
-    "gpt-5": {"input": 10.00, "output": 30.00},
-    "gpt-5.4": {"input": 10.00, "output": 30.00},
-    "o3": {"input": 10.00, "output": 40.00},
-    "o3-mini": {"input": 1.10, "output": 4.40},
-    "o4-mini": {"input": 1.10, "output": 4.40},
-    # Anthropic
-    "claude-opus-4-20250514": {"input": 15.00, "output": 75.00},
-    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
-    "claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00},
-    "claude-3-5-haiku-20241022": {"input": 0.80, "output": 4.00},
-    "claude-3-opus-20240229": {"input": 15.00, "output": 75.00},
-    "claude-3-haiku-20240307": {"input": 0.25, "output": 1.25},
-    # DeepSeek
-    "deepseek-chat": {"input": 0.14, "output": 0.28},
-    "deepseek-reasoner": {"input": 0.55, "output": 2.19},
-    # Google
-    "gemini-2.5-pro": {"input": 1.25, "output": 10.00},
-    "gemini-2.5-flash": {"input": 0.15, "output": 0.60},
-    "gemini-2.0-flash": {"input": 0.10, "output": 0.40},
-    # Meta (via providers)
-    "llama-4-maverick": {"input": 0.50, "output": 0.70},
-    "llama-4-scout": {"input": 0.20, "output": 0.30},
-    # Z.AI / GLM (direct provider — pricing not published externally, treat as local)
-    "glm-5": {"input": 0.0, "output": 0.0},
-    "glm-4.7": {"input": 0.0, "output": 0.0},
-    "glm-4.5": {"input": 0.0, "output": 0.0},
-    "glm-4.5-flash": {"input": 0.0, "output": 0.0},
-    # Kimi / Moonshot (direct provider — pricing not published externally, treat as local)
-    "kimi-k2.5": {"input": 0.0, "output": 0.0},
-    "kimi-k2-thinking": {"input": 0.0, "output": 0.0},
-    "kimi-k2-turbo-preview": {"input": 0.0, "output": 0.0},
-    "kimi-k2-0905-preview": {"input": 0.0, "output": 0.0},
-    # MiniMax (direct provider — pricing not published externally, treat as local)
-    "MiniMax-M2.5": {"input": 0.0, "output": 0.0},
-    "MiniMax-M2.5-highspeed": {"input": 0.0, "output": 0.0},
-    "MiniMax-M2.1": {"input": 0.0, "output": 0.0},
-}
-
-# Fallback: unknown/custom models get zero cost (we can't assume pricing
-# for self-hosted models, custom OAI endpoints, local inference, etc.)
-_DEFAULT_PRICING = {"input": 0.0, "output": 0.0}
-
-
-def _has_known_pricing(model_name: str) -> bool:
-    """Check if a model has known pricing (vs unknown/custom endpoint)."""
-    return _get_pricing(model_name) is not _DEFAULT_PRICING
-
-
-def _get_pricing(model_name: str) -> Dict[str, float]:
-    """Look up pricing for a model. Uses fuzzy matching on model name.
-
-    Returns _DEFAULT_PRICING (zero cost) for unknown/custom models —
-    we can't assume costs for self-hosted endpoints, local inference, etc.
-    """
-    if not model_name:
-        return _DEFAULT_PRICING
-
-    # Strip provider prefix (e.g., "anthropic/claude-..." -> "claude-...")
-    bare = model_name.split("/")[-1].lower()
-
-    # Exact match first
-    if bare in MODEL_PRICING:
-        return MODEL_PRICING[bare]
-
-    # Fuzzy prefix match — prefer the LONGEST matching key to avoid
-    # e.g. "gpt-4o" matching before "gpt-4o-mini" for "gpt-4o-mini-2024-07-18"
-    best_match = None
-    best_len = 0
-    for key, price in MODEL_PRICING.items():
-        if bare.startswith(key) and len(key) > best_len:
-            best_match = price
-            best_len = len(key)
-    if best_match:
-        return best_match
-
-    # Keyword heuristics (checked in most-specific-first order)
-    if "opus" in bare:
-        return {"input": 15.00, "output": 75.00}
-    if "sonnet" in bare:
-        return {"input": 3.00, "output": 15.00}
-    if "haiku" in bare:
-        return {"input": 0.80, "output": 4.00}
-    if "gpt-4o-mini" in bare:
-        return {"input": 0.15, "output": 0.60}
-    if "gpt-4o" in bare:
-        return {"input": 2.50, "output": 10.00}
-    if "gpt-5" in bare:
-        return {"input": 10.00, "output": 30.00}
-    if "deepseek" in bare:
-        return {"input": 0.14, "output": 0.28}
-    if "gemini" in bare:
-        return {"input": 0.15, "output": 0.60}
-
-    return _DEFAULT_PRICING
-
-
-def _estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
-    """Estimate the USD cost for a given model and token counts."""
-    pricing = _get_pricing(model)
-    return (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000
-
-
-def _format_duration(seconds: float) -> str:
-    """Format seconds into a human-readable duration string."""
-    if seconds < 60:
-        return f"{seconds:.0f}s"
-    minutes = seconds / 60
-    if minutes < 60:
-        return f"{minutes:.0f}m"
-    hours = minutes / 60
-    if hours < 24:
-        remaining_min = int(minutes % 60)
-        return f"{int(hours)}h {remaining_min}m" if remaining_min else f"{int(hours)}h"
-    days = hours / 24
-    return f"{days:.1f}d"
-
-
-def _bar_chart(values: List[int], max_width: int = 20) -> List[str]:
-    """Create simple horizontal bar chart strings from values."""
-    peak = max(values) if values else 1
-    if peak == 0:
-        return ["" for _ in values]
-    return ["█" * max(1, int(v / peak * max_width)) if v > 0 else "" for v in values]
-
-
-class InsightsEngine:
-    """
-    Analyzes session history and produces usage insights.
-
-    Works directly with a SessionDB instance (or raw sqlite3 connection)
-    to query session and message data.
-    """
-
-    def __init__(self, db):
-        """
-        Initialize with a SessionDB instance.
-
-        Args:
-            db: A SessionDB instance (from hermes_state.py)
-        """
-        self.db = db
-        self._conn = db._conn
-
-    def generate(self, days: int = 30, source: str = None) -> Dict[str, Any]:
-        """
-        Generate a complete insights report.
-
-        Args:
-            days: Number of days to look back (default: 30)
-            source: Optional filter by source platform
-
-        Returns:
-            Dict with all computed insights
-        """
-        cutoff = time.time() - (days * 86400)
-
-        # Gather raw data
-        sessions = self._get_sessions(cutoff, source)
-        tool_usage = self._get_tool_usage(cutoff, source)
-        message_stats = self._get_message_stats(cutoff, source)
-
-        if not sessions:
-            return {
-                "days": days,
-                "source_filter": source,
-                "empty": True,
-                "overview": {},
-                "models": [],
-                "platforms": [],
-                "tools": [],
-                "activity": {},
-                "top_sessions": [],
-            }
-
-        # Compute insights
-        overview = self._compute_overview(sessions, message_stats)
-        models = self._compute_model_breakdown(sessions)
-        platforms = self._compute_platform_breakdown(sessions)
-        tools = self._compute_tool_breakdown(tool_usage)
-        activity = self._compute_activity_patterns(sessions)
-        top_sessions = self._compute_top_sessions(sessions)
-
-        return {
-            "days": days,
-            "source_filter": source,
-            "empty": False,
-            "generated_at": time.time(),
-            "overview": overview,
-            "models": models,
-            "platforms": platforms,
-            "tools": tools,
-            "activity": activity,
-            "top_sessions": top_sessions,
-        }
-
-    # =========================================================================
-    # Data gathering (SQL queries)
-    # =========================================================================
-
-    # Columns we actually need (skip system_prompt, model_config blobs)
-    _SESSION_COLS = ("id, source, model, started_at, ended_at, "
-                     "message_count, tool_call_count, input_tokens, output_tokens")
-
-    def _get_sessions(self, cutoff: float, source: str = None) -> List[Dict]:
-        """Fetch sessions within the time window."""
-        if source:
-            cursor = self._conn.execute(
-                f"""SELECT {self._SESSION_COLS} FROM sessions
-                    WHERE started_at >= ? AND source = ?
-                    ORDER BY started_at DESC""",
-                (cutoff, source),
-            )
-        else:
-            cursor = self._conn.execute(
-                f"""SELECT {self._SESSION_COLS} FROM sessions
-                    WHERE started_at >= ?
-                    ORDER BY started_at DESC""",
-                (cutoff,),
-            )
-        return [dict(row) for row in cursor.fetchall()]
-
-    def _get_tool_usage(self, cutoff: float, source: str = None) -> List[Dict]:
-        """Get tool call counts from messages.
-
-        Uses two sources:
-        1. tool_name column on 'tool' role messages (set by gateway)
-        2. tool_calls JSON on 'assistant' role messages (covers CLI where
-           tool_name is not populated on tool responses)
-        """
-        tool_counts = Counter()
-
-        # Source 1: explicit tool_name on tool response messages
-        if source:
-            cursor = self._conn.execute(
-                """SELECT m.tool_name, COUNT(*) as count
-                   FROM messages m
-                   JOIN sessions s ON s.id = m.session_id
-                   WHERE s.started_at >= ? AND s.source = ?
-                     AND m.role = 'tool' AND m.tool_name IS NOT NULL
-                   GROUP BY m.tool_name
-                   ORDER BY count DESC""",
-                (cutoff, source),
-            )
-        else:
-            cursor = self._conn.execute(
-                """SELECT m.tool_name, COUNT(*) as count
-                   FROM messages m
-                   JOIN sessions s ON s.id = m.session_id
-                   WHERE s.started_at >= ?
-                     AND m.role = 'tool' AND m.tool_name IS NOT NULL
-                   GROUP BY m.tool_name
-                   ORDER BY count DESC""",
-                (cutoff,),
-            )
-        for row in cursor.fetchall():
-            tool_counts[row["tool_name"]] += row["count"]
-
-        # Source 2: extract from tool_calls JSON on assistant messages
-        # (covers CLI sessions where tool_name is NULL on tool responses)
-        if source:
-            cursor2 = self._conn.execute(
-                """SELECT m.tool_calls
-                   FROM messages m
-                   JOIN sessions s ON s.id = m.session_id
-                   WHERE s.started_at >= ? AND s.source = ?
-                     AND m.role = 'assistant' AND m.tool_calls IS NOT NULL""",
-                (cutoff, source),
-            )
-        else:
-            cursor2 = self._conn.execute(
-                """SELECT m.tool_calls
-                   FROM messages m
-                   JOIN sessions s ON s.id = m.session_id
-                   WHERE s.started_at >= ?
-                     AND m.role = 'assistant' AND m.tool_calls IS NOT NULL""",
-                (cutoff,),
-            )
-
-        tool_calls_counts = Counter()
-        for row in cursor2.fetchall():
-            try:
-                calls = row["tool_calls"]
-                if isinstance(calls, str):
-                    calls = json.loads(calls)
-                if isinstance(calls, list):
-                    for call in calls:
-                        func = call.get("function", {}) if isinstance(call, dict) else {}
-                        name = func.get("name")
-                        if name:
-                            tool_calls_counts[name] += 1
-            except (json.JSONDecodeError, TypeError, AttributeError):
-                continue
-
-        # Merge: prefer tool_name source, supplement with tool_calls source
-        # for tools not already counted
-        if not tool_counts and tool_calls_counts:
-            # No tool_name data at all — use tool_calls exclusively
-            tool_counts = tool_calls_counts
-        elif tool_counts and tool_calls_counts:
-            # Both sources have data — use whichever has the higher count per tool
-            # (they may overlap, so take the max to avoid double-counting)
-            all_tools = set(tool_counts) | set(tool_calls_counts)
-            merged = Counter()
-            for tool in all_tools:
-                merged[tool] = max(tool_counts.get(tool, 0), tool_calls_counts.get(tool, 0))
-            tool_counts = merged
-
-        # Convert to the expected format
-        return [
-            {"tool_name": name, "count": count}
-            for name, count in tool_counts.most_common()
-        ]
-
-    def _get_message_stats(self, cutoff: float, source: str = None) -> Dict:
-        """Get aggregate message statistics."""
-        if source:
-            cursor = self._conn.execute(
-                """SELECT
-                     COUNT(*) as total_messages,
-                     SUM(CASE WHEN m.role = 'user' THEN 1 ELSE 0 END) as user_messages,
-                     SUM(CASE WHEN m.role = 'assistant' THEN 1 ELSE 0 END) as assistant_messages,
-                     SUM(CASE WHEN m.role = 'tool' THEN 1 ELSE 0 END) as tool_messages
-                   FROM messages m
-                   JOIN sessions s ON s.id = m.session_id
-                   WHERE s.started_at >= ? AND s.source = ?""",
-                (cutoff, source),
-            )
-        else:
-            cursor = self._conn.execute(
-                """SELECT
-                     COUNT(*) as total_messages,
-                     SUM(CASE WHEN m.role = 'user' THEN 1 ELSE 0 END) as user_messages,
-                     SUM(CASE WHEN m.role = 'assistant' THEN 1 ELSE 0 END) as assistant_messages,
-                     SUM(CASE WHEN m.role = 'tool' THEN 1 ELSE 0 END) as tool_messages
-                   FROM messages m
-                   JOIN sessions s ON s.id = m.session_id
-                   WHERE s.started_at >= ?""",
-                (cutoff,),
-            )
-        row = cursor.fetchone()
-        return dict(row) if row else {
-            "total_messages": 0, "user_messages": 0,
-            "assistant_messages": 0, "tool_messages": 0,
-        }
-
-    # =========================================================================
-    # Computation
-    # =========================================================================
-
-    def _compute_overview(self, sessions: List[Dict], message_stats: Dict) -> Dict:
-        """Compute high-level overview statistics."""
-        total_input = sum(s.get("input_tokens") or 0 for s in sessions)
-        total_output = sum(s.get("output_tokens") or 0 for s in sessions)
-        total_tokens = total_input + total_output
-        total_tool_calls = sum(s.get("tool_call_count") or 0 for s in sessions)
-        total_messages = sum(s.get("message_count") or 0 for s in sessions)
-
-        # Cost estimation (weighted by model)
-        total_cost = 0.0
-        models_with_pricing = set()
-        models_without_pricing = set()
-        for s in sessions:
-            model = s.get("model") or ""
-            inp = s.get("input_tokens") or 0
-            out = s.get("output_tokens") or 0
-            total_cost += _estimate_cost(model, inp, out)
-            display = model.split("/")[-1] if "/" in model else (model or "unknown")
-            if _has_known_pricing(model):
-                models_with_pricing.add(display)
-            else:
-                models_without_pricing.add(display)
-
-        # Session duration stats (guard against negative durations from clock drift)
-        durations = []
-        for s in sessions:
-            start = s.get("started_at")
-            end = s.get("ended_at")
-            if start and end and end > start:
-                durations.append(end - start)
-
-        total_hours = sum(durations) / 3600 if durations else 0
-        avg_duration = sum(durations) / len(durations) if durations else 0
-
-        # Earliest and latest session
-        started_timestamps = [s["started_at"] for s in sessions if s.get("started_at")]
-        date_range_start = min(started_timestamps) if started_timestamps else None
-        date_range_end = max(started_timestamps) if started_timestamps else None
-
-        return {
-            "total_sessions": len(sessions),
-            "total_messages": total_messages,
-            "total_tool_calls": total_tool_calls,
-            "total_input_tokens": total_input,
-            "total_output_tokens": total_output,
-            "total_tokens": total_tokens,
-            "estimated_cost": total_cost,
-            "total_hours": total_hours,
-            "avg_session_duration": avg_duration,
-            "avg_messages_per_session": total_messages / len(sessions) if sessions else 0,
-            "avg_tokens_per_session": total_tokens / len(sessions) if sessions else 0,
-            "user_messages": message_stats.get("user_messages") or 0,
-            "assistant_messages": message_stats.get("assistant_messages") or 0,
-            "tool_messages": message_stats.get("tool_messages") or 0,
-            "date_range_start": date_range_start,
-            "date_range_end": date_range_end,
-            "models_with_pricing": sorted(models_with_pricing),
-            "models_without_pricing": sorted(models_without_pricing),
-        }
-
-    def _compute_model_breakdown(self, sessions: List[Dict]) -> List[Dict]:
-        """Break down usage by model."""
-        model_data = defaultdict(lambda: {
-            "sessions": 0, "input_tokens": 0, "output_tokens": 0,
-            "total_tokens": 0, "tool_calls": 0, "cost": 0.0,
-        })
-
-        for s in sessions:
-            model = s.get("model") or "unknown"
-            # Normalize: strip provider prefix for display
-            display_model = model.split("/")[-1] if "/" in model else model
-            d = model_data[display_model]
-            d["sessions"] += 1
-            inp = s.get("input_tokens") or 0
-            out = s.get("output_tokens") or 0
-            d["input_tokens"] += inp
-            d["output_tokens"] += out
-            d["total_tokens"] += inp + out
-            d["tool_calls"] += s.get("tool_call_count") or 0
-            d["cost"] += _estimate_cost(model, inp, out)
-            d["has_pricing"] = _has_known_pricing(model)
-
-        result = [
-            {"model": model, **data}
-            for model, data in model_data.items()
-        ]
-        # Sort by tokens first, fall back to session count when tokens are 0
-        result.sort(key=lambda x: (x["total_tokens"], x["sessions"]), reverse=True)
-        return result
-
-    def _compute_platform_breakdown(self, sessions: List[Dict]) -> List[Dict]:
-        """Break down usage by platform/source."""
-        platform_data = defaultdict(lambda: {
-            "sessions": 0, "messages": 0, "input_tokens": 0,
-            "output_tokens": 0, "total_tokens": 0, "tool_calls": 0,
-        })
-
-        for s in sessions:
-            source = s.get("source") or "unknown"
-            d = platform_data[source]
-            d["sessions"] += 1
-            d["messages"] += s.get("message_count") or 0
-            inp = s.get("input_tokens") or 0
-            out = s.get("output_tokens") or 0
-            d["input_tokens"] += inp
-            d["output_tokens"] += out
-            d["total_tokens"] += inp + out
-            d["tool_calls"] += s.get("tool_call_count") or 0
-
-        result = [
-            {"platform": platform, **data}
-            for platform, data in platform_data.items()
-        ]
-        result.sort(key=lambda x: x["sessions"], reverse=True)
-        return result
-
-    def _compute_tool_breakdown(self, tool_usage: List[Dict]) -> List[Dict]:
-        """Process tool usage data into a ranked list with percentages."""
-        total_calls = sum(t["count"] for t in tool_usage) if tool_usage else 0
-        result = []
-        for t in tool_usage:
-            pct = (t["count"] / total_calls * 100) if total_calls else 0
-            result.append({
-                "tool": t["tool_name"],
-                "count": t["count"],
-                "percentage": pct,
-            })
-        return result
-
-    def _compute_activity_patterns(self, sessions: List[Dict]) -> Dict:
-        """Analyze activity patterns by day of week and hour."""
-        day_counts = Counter()  # 0=Monday ... 6=Sunday
-        hour_counts = Counter()
-        daily_counts = Counter()  # date string -> count
-
-        for s in sessions:
-            ts = s.get("started_at")
-            if not ts:
-                continue
-            dt = datetime.fromtimestamp(ts)
-            day_counts[dt.weekday()] += 1
-            hour_counts[dt.hour] += 1
-            daily_counts[dt.strftime("%Y-%m-%d")] += 1
-
-        day_names = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
-        day_breakdown = [
-            {"day": day_names[i], "count": day_counts.get(i, 0)}
-            for i in range(7)
-        ]
-
-        hour_breakdown = [
-            {"hour": i, "count": hour_counts.get(i, 0)}
-            for i in range(24)
-        ]
-
-        # Busiest day and hour
-        busiest_day = max(day_breakdown, key=lambda x: x["count"]) if day_breakdown else None
-        busiest_hour = max(hour_breakdown, key=lambda x: x["count"]) if hour_breakdown else None
-
-        # Active days (days with at least one session)
-        active_days = len(daily_counts)
-
-        # Streak calculation
-        if daily_counts:
-            all_dates = sorted(daily_counts.keys())
-            current_streak = 1
-            max_streak = 1
-            for i in range(1, len(all_dates)):
-                d1 = datetime.strptime(all_dates[i - 1], "%Y-%m-%d")
-                d2 = datetime.strptime(all_dates[i], "%Y-%m-%d")
-                if (d2 - d1).days == 1:
-                    current_streak += 1
-                    max_streak = max(max_streak, current_streak)
-                else:
-                    current_streak = 1
-        else:
-            max_streak = 0
-
-        return {
-            "by_day": day_breakdown,
-            "by_hour": hour_breakdown,
-            "busiest_day": busiest_day,
-            "busiest_hour": busiest_hour,
-            "active_days": active_days,
-            "max_streak": max_streak,
-        }
-
-    def _compute_top_sessions(self, sessions: List[Dict]) -> List[Dict]:
-        """Find notable sessions (longest, most messages, most tokens)."""
-        top = []
-
-        # Longest by duration
-        sessions_with_duration = [
-            s for s in sessions
-            if s.get("started_at") and s.get("ended_at")
-        ]
-        if sessions_with_duration:
-            longest = max(
-                sessions_with_duration,
-                key=lambda s: (s["ended_at"] - s["started_at"]),
-            )
-            dur = longest["ended_at"] - longest["started_at"]
-            top.append({
-                "label": "Longest session",
-                "session_id": longest["id"][:16],
-                "value": _format_duration(dur),
-                "date": datetime.fromtimestamp(longest["started_at"]).strftime("%b %d"),
-            })
-
-        # Most messages
-        most_msgs = max(sessions, key=lambda s: s.get("message_count") or 0)
-        if (most_msgs.get("message_count") or 0) > 0:
-            top.append({
-                "label": "Most messages",
-                "session_id": most_msgs["id"][:16],
-                "value": f"{most_msgs['message_count']} msgs",
-                "date": datetime.fromtimestamp(most_msgs["started_at"]).strftime("%b %d") if most_msgs.get("started_at") else "?",
-            })
-
-        # Most tokens
-        most_tokens = max(
-            sessions,
-            key=lambda s: (s.get("input_tokens") or 0) + (s.get("output_tokens") or 0),
-        )
-        token_total = (most_tokens.get("input_tokens") or 0) + (most_tokens.get("output_tokens") or 0)
-        if token_total > 0:
-            top.append({
-                "label": "Most tokens",
-                "session_id": most_tokens["id"][:16],
-                "value": f"{token_total:,} tokens",
-                "date": datetime.fromtimestamp(most_tokens["started_at"]).strftime("%b %d") if most_tokens.get("started_at") else "?",
-            })
-
-        # Most tool calls
-        most_tools = max(sessions, key=lambda s: s.get("tool_call_count") or 0)
-        if (most_tools.get("tool_call_count") or 0) > 0:
-            top.append({
-                "label": "Most tool calls",
-                "session_id": most_tools["id"][:16],
-                "value": f"{most_tools['tool_call_count']} calls",
-                "date": datetime.fromtimestamp(most_tools["started_at"]).strftime("%b %d") if most_tools.get("started_at") else "?",
-            })
-
-        return top
-
-    # =========================================================================
-    # Formatting
-    # =========================================================================
-
-    def format_terminal(self, report: Dict) -> str:
-        """Format the insights report for terminal display (CLI)."""
-        if report.get("empty"):
-            days = report.get("days", 30)
-            src = f" (source: {report['source_filter']})" if report.get("source_filter") else ""
-            return f"  No sessions found in the last {days} days{src}."
-
-        lines = []
-        o = report["overview"]
-        days = report["days"]
-        src_filter = report.get("source_filter")
-
-        # Header
-        lines.append("")
-        lines.append("  ╔══════════════════════════════════════════════════════════╗")
-        lines.append("  ║                    📊 Hermes Insights                    ║")
-        period_label = f"Last {days} days"
-        if src_filter:
-            period_label += f" ({src_filter})"
-        padding = 58 - len(period_label) - 2
-        left_pad = padding // 2
-        right_pad = padding - left_pad
-        lines.append(f"  ║{' ' * left_pad} {period_label} {' ' * right_pad}║")
-        lines.append("  ╚══════════════════════════════════════════════════════════╝")
-        lines.append("")
-
-        # Date range
-        if o.get("date_range_start") and o.get("date_range_end"):
-            start_str = datetime.fromtimestamp(o["date_range_start"]).strftime("%b %d, %Y")
-            end_str = datetime.fromtimestamp(o["date_range_end"]).strftime("%b %d, %Y")
-            lines.append(f"  Period: {start_str} — {end_str}")
-            lines.append("")
-
-        # Overview
-        lines.append("  📋 Overview")
-        lines.append("  " + "─" * 56)
-        lines.append(f"  Sessions:          {o['total_sessions']:<12}  Messages:        {o['total_messages']:,}")
-        lines.append(f"  Tool calls:        {o['total_tool_calls']:<12,}  User messages:   {o['user_messages']:,}")
-        lines.append(f"  Input tokens:      {o['total_input_tokens']:<12,}  Output tokens:   {o['total_output_tokens']:,}")
-        cost_str = f"${o['estimated_cost']:.2f}"
-        if o.get("models_without_pricing"):
-            cost_str += " *"
-        lines.append(f"  Total tokens:      {o['total_tokens']:<12,}  Est. cost:       {cost_str}")
-        if o["total_hours"] > 0:
-            lines.append(f"  Active time:       ~{_format_duration(o['total_hours'] * 3600):<11}  Avg session:     ~{_format_duration(o['avg_session_duration'])}")
-        lines.append(f"  Avg msgs/session:  {o['avg_messages_per_session']:.1f}")
-        lines.append("")
-
-        # Model breakdown
-        if report["models"]:
-            lines.append("  🤖 Models Used")
-            lines.append("  " + "─" * 56)
-            lines.append(f"  {'Model':<30} {'Sessions':>8} {'Tokens':>12} {'Cost':>8}")
-            for m in report["models"]:
-                model_name = m["model"][:28]
-                if m.get("has_pricing"):
-                    cost_cell = f"${m['cost']:>6.2f}"
-                else:
-                    cost_cell = "     N/A"
-                lines.append(f"  {model_name:<30} {m['sessions']:>8} {m['total_tokens']:>12,} {cost_cell}")
-            if o.get("models_without_pricing"):
-                lines.append(f"  * Cost N/A for custom/self-hosted models")
-            lines.append("")
-
-        # Platform breakdown
-        if len(report["platforms"]) > 1 or (report["platforms"] and report["platforms"][0]["platform"] != "cli"):
-            lines.append("  📱 Platforms")
-            lines.append("  " + "─" * 56)
-            lines.append(f"  {'Platform':<14} {'Sessions':>8} {'Messages':>10} {'Tokens':>14}")
-            for p in report["platforms"]:
-                lines.append(f"  {p['platform']:<14} {p['sessions']:>8} {p['messages']:>10,} {p['total_tokens']:>14,}")
-            lines.append("")
-
-        # Tool usage
-        if report["tools"]:
-            lines.append("  🔧 Top Tools")
-            lines.append("  " + "─" * 56)
-            lines.append(f"  {'Tool':<28} {'Calls':>8} {'%':>8}")
-            for t in report["tools"][:15]:  # Top 15
-                lines.append(f"  {t['tool']:<28} {t['count']:>8,} {t['percentage']:>7.1f}%")
-            if len(report["tools"]) > 15:
-                lines.append(f"  ... and {len(report['tools']) - 15} more tools")
-            lines.append("")
-
-        # Activity patterns
-        act = report.get("activity", {})
-        if act.get("by_day"):
-            lines.append("  📅 Activity Patterns")
-            lines.append("  " + "─" * 56)
-
-            # Day of week chart
-            day_values = [d["count"] for d in act["by_day"]]
-            bars = _bar_chart(day_values, max_width=15)
-            for i, d in enumerate(act["by_day"]):
-                bar = bars[i]
-                lines.append(f"  {d['day']}  {bar:<15} {d['count']}")
-
-            lines.append("")
-
-            # Peak hours (show top 5 busiest hours)
-            busy_hours = sorted(act["by_hour"], key=lambda x: x["count"], reverse=True)
-            busy_hours = [h for h in busy_hours if h["count"] > 0][:5]
-            if busy_hours:
-                hour_strs = []
-                for h in busy_hours:
-                    hr = h["hour"]
-                    ampm = "AM" if hr < 12 else "PM"
-                    display_hr = hr % 12 or 12
-                    hour_strs.append(f"{display_hr}{ampm} ({h['count']})")
-                lines.append(f"  Peak hours: {', '.join(hour_strs)}")
-
-            if act.get("active_days"):
-                lines.append(f"  Active days: {act['active_days']}")
-            if act.get("max_streak") and act["max_streak"] > 1:
-                lines.append(f"  Best streak: {act['max_streak']} consecutive days")
-            lines.append("")
-
-        # Notable sessions
-        if report.get("top_sessions"):
-            lines.append("  🏆 Notable Sessions")
-            lines.append("  " + "─" * 56)
-            for ts in report["top_sessions"]:
-                lines.append(f"  {ts['label']:<20} {ts['value']:<18} ({ts['date']}, {ts['session_id']})")
-            lines.append("")
-
-        return "\n".join(lines)
-
-    def format_gateway(self, report: Dict) -> str:
-        """Format the insights report for gateway/messaging (shorter)."""
-        if report.get("empty"):
-            days = report.get("days", 30)
-            return f"No sessions found in the last {days} days."
-
-        lines = []
-        o = report["overview"]
-        days = report["days"]
-
-        lines.append(f"📊 **Hermes Insights** — Last {days} days\n")
-
-        # Overview
-        lines.append(f"**Sessions:** {o['total_sessions']} | **Messages:** {o['total_messages']:,} | **Tool calls:** {o['total_tool_calls']:,}")
-        lines.append(f"**Tokens:** {o['total_tokens']:,} (in: {o['total_input_tokens']:,} / out: {o['total_output_tokens']:,})")
-        cost_note = ""
-        if o.get("models_without_pricing"):
-            cost_note = " _(excludes custom/self-hosted models)_"
-        lines.append(f"**Est. cost:** ${o['estimated_cost']:.2f}{cost_note}")
-        if o["total_hours"] > 0:
-            lines.append(f"**Active time:** ~{_format_duration(o['total_hours'] * 3600)} | **Avg session:** ~{_format_duration(o['avg_session_duration'])}")
-        lines.append("")
-
-        # Models (top 5)
-        if report["models"]:
-            lines.append("**🤖 Models:**")
-            for m in report["models"][:5]:
-                cost_str = f"${m['cost']:.2f}" if m.get("has_pricing") else "N/A"
-                lines.append(f"  {m['model'][:25]} — {m['sessions']} sessions, {m['total_tokens']:,} tokens, {cost_str}")
-            lines.append("")
-
-        # Platforms (if multi-platform)
-        if len(report["platforms"]) > 1:
-            lines.append("**📱 Platforms:**")
-            for p in report["platforms"]:
-                lines.append(f"  {p['platform']} — {p['sessions']} sessions, {p['messages']:,} msgs")
-            lines.append("")
-
-        # Tools (top 8)
-        if report["tools"]:
-            lines.append("**🔧 Top Tools:**")
-            for t in report["tools"][:8]:
-                lines.append(f"  {t['tool']} — {t['count']:,} calls ({t['percentage']:.1f}%)")
-            lines.append("")
-
-        # Activity summary
-        act = report.get("activity", {})
-        if act.get("busiest_day") and act.get("busiest_hour"):
-            hr = act["busiest_hour"]["hour"]
-            ampm = "AM" if hr < 12 else "PM"
-            display_hr = hr % 12 or 12
-            lines.append(f"**📅 Busiest:** {act['busiest_day']['day']}s ({act['busiest_day']['count']} sessions), {display_hr}{ampm} ({act['busiest_hour']['count']} sessions)")
-            if act.get("active_days"):
-                lines.append(f"**Active days:** {act['active_days']}", )
-            if act.get("max_streak", 0) > 1:
-                lines.append(f"**Best streak:** {act['max_streak']} consecutive days")
-
-        return "\n".join(lines)
--- a/agent/model_metadata.py
+++ b/agent/model_metadata.py
@@ -5,14 +5,10 @@ and run_agent.py for pre-flight context checks.
 """

 import logging
-import os
-import re
 import time
-from pathlib import Path
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List

 import requests
-import yaml

 from hermes_constants import OPENROUTER_MODELS_URL

@@ -22,18 +18,6 @@ _model_metadata_cache: Dict[str, Dict[str, Any]] = {}
 _model_metadata_cache_time: float = 0
 _MODEL_CACHE_TTL = 3600

-# Descending tiers for context length probing when the model is unknown.
-# We start high and step down on context-length errors until one works.
-CONTEXT_PROBE_TIERS = [
-    2_000_000,
-    1_000_000,
-    512_000,
-    200_000,
-    128_000,
-    64_000,
-    32_000,
-]
-
 DEFAULT_CONTEXT_LENGTHS = {
    "anthropic/claude-opus-4": 200000,
    "anthropic/claude-opus-4.5": 200000,
@@ -49,17 +33,6 @@ DEFAULT_CONTEXT_LENGTHS = {
    "meta-llama/llama-3.3-70b-instruct": 131072,
    "deepseek/deepseek-chat-v3": 65536,
    "qwen/qwen-2.5-72b-instruct": 32768,
-    "glm-4.7": 202752,
-    "glm-5": 202752,
-    "glm-4.5": 131072,
-    "glm-4.5-flash": 131072,
-    "kimi-k2.5": 262144,
-    "kimi-k2-thinking": 262144,
-    "kimi-k2-turbo-preview": 262144,
-    "kimi-k2-0905-preview": 131072,
-    "MiniMax-M2.5": 204800,
-    "MiniMax-M2.5-highspeed": 204800,
-    "MiniMax-M2.1": 204800,
 }


@@ -98,117 +71,17 @@ def fetch_model_metadata(force_refresh: bool = False) -> Dict[str, Dict[str, Any
        return _model_metadata_cache or {}


-def _get_context_cache_path() -> Path:
-    """Return path to the persistent context length cache file."""
-    hermes_home = Path(os.environ.get("HERMES_HOME", Path.home() / ".hermes"))
-    return hermes_home / "context_length_cache.yaml"
-
-
-def _load_context_cache() -> Dict[str, int]:
-    """Load the model+provider → context_length cache from disk."""
-    path = _get_context_cache_path()
-    if not path.exists():
-        return {}
-    try:
-        with open(path) as f:
-            data = yaml.safe_load(f) or {}
-        return data.get("context_lengths", {})
-    except Exception as e:
-        logger.debug("Failed to load context length cache: %s", e)
-        return {}
-
-
-def save_context_length(model: str, base_url: str, length: int) -> None:
-    """Persist a discovered context length for a model+provider combo.
-
-    Cache key is ``model@base_url`` so the same model name served from
-    different providers can have different limits.
-    """
-    key = f"{model}@{base_url}"
-    cache = _load_context_cache()
-    if cache.get(key) == length:
-        return  # already stored
-    cache[key] = length
-    path = _get_context_cache_path()
-    try:
-        path.parent.mkdir(parents=True, exist_ok=True)
-        with open(path, "w") as f:
-            yaml.dump({"context_lengths": cache}, f, default_flow_style=False)
-        logger.info("Cached context length %s → %s tokens", key, f"{length:,}")
-    except Exception as e:
-        logger.debug("Failed to save context length cache: %s", e)
-
-
-def get_cached_context_length(model: str, base_url: str) -> Optional[int]:
-    """Look up a previously discovered context length for model+provider."""
-    key = f"{model}@{base_url}"
-    cache = _load_context_cache()
-    return cache.get(key)
-
-
-def get_next_probe_tier(current_length: int) -> Optional[int]:
-    """Return the next lower probe tier, or None if already at minimum."""
-    for tier in CONTEXT_PROBE_TIERS:
-        if tier < current_length:
-            return tier
-    return None
-
-
-def parse_context_limit_from_error(error_msg: str) -> Optional[int]:
-    """Try to extract the actual context limit from an API error message.
-
-    Many providers include the limit in their error text, e.g.:
-      - "maximum context length is 32768 tokens"
-      - "context_length_exceeded: 131072"
-      - "Maximum context size 32768 exceeded"
-      - "model's max context length is 65536"
-    """
-    error_lower = error_msg.lower()
-    # Pattern: look for numbers near context-related keywords
-    patterns = [
-        r'(?:max(?:imum)?|limit)\s*(?:context\s*)?(?:length|size|window)?\s*(?:is|of|:)?\s*(\d{4,})',
-        r'context\s*(?:length|size|window)\s*(?:is|of|:)?\s*(\d{4,})',
-        r'(\d{4,})\s*(?:token)?\s*(?:context|limit)',
-        r'>\s*(\d{4,})\s*(?:max|limit|token)',  # "250000 tokens > 200000 maximum"
-        r'(\d{4,})\s*(?:max(?:imum)?)\b',  # "200000 maximum"
-    ]
-    for pattern in patterns:
-        match = re.search(pattern, error_lower)
-        if match:
-            limit = int(match.group(1))
-            # Sanity check: must be a reasonable context length
-            if 1024 <= limit <= 10_000_000:
-                return limit
-    return None
-
-
-def get_model_context_length(model: str, base_url: str = "") -> int:
-    """Get the context length for a model.
-
-    Resolution order:
-    1. Persistent cache (previously discovered via probing)
-    2. OpenRouter API metadata
-    3. Hardcoded DEFAULT_CONTEXT_LENGTHS (fuzzy match)
-    4. First probe tier (2M) — will be narrowed on first context error
-    """
-    # 1. Check persistent cache (model+provider)
-    if base_url:
-        cached = get_cached_context_length(model, base_url)
-        if cached is not None:
-            return cached
-
-    # 2. OpenRouter API metadata
+def get_model_context_length(model: str) -> int:
+    """Get the context length for a model (API first, then fallback defaults)."""
    metadata = fetch_model_metadata()
    if model in metadata:
        return metadata[model].get("context_length", 128000)

-    # 3. Hardcoded defaults (fuzzy match)
    for default_model, length in DEFAULT_CONTEXT_LENGTHS.items():
        if default_model in model or model in default_model:
            return length

-    # 4. Unknown model — start at highest probe tier
-    return CONTEXT_PROBE_TIERS[0]
+    return 128000


 def estimate_tokens_rough(text: str) -> int:
--- a/agent/prompt_builder.py
+++ b/agent/prompt_builder.py
@@ -142,28 +142,12 @@ def _read_skill_description(skill_file: Path, max_chars: int = 60) -> str:
    return ""


-def _skill_is_platform_compatible(skill_file: Path) -> bool:
-    """Quick check if a SKILL.md is compatible with the current OS platform.
-
-    Reads just enough to parse the ``platforms`` frontmatter field.
-    Skills without the field (the vast majority) are always compatible.
-    """
-    try:
-        from tools.skills_tool import _parse_frontmatter, skill_matches_platform
-        raw = skill_file.read_text(encoding="utf-8")[:2000]
-        frontmatter, _ = _parse_frontmatter(raw)
-        return skill_matches_platform(frontmatter)
-    except Exception:
-        return True  # Err on the side of showing the skill
-
-
 def build_skills_system_prompt() -> str:
    """Build a compact skill index for the system prompt.

    Scans ~/.hermes/skills/ for SKILL.md files grouped by category.
    Includes per-skill descriptions from frontmatter so the model can
    match skills by meaning, not just name.
-    Filters out skills incompatible with the current OS platform.
    """
    hermes_home = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
    skills_dir = hermes_home / "skills"
@@ -175,9 +159,6 @@ def build_skills_system_prompt() -> str:
    # Each entry: (skill_name, description)
    skills_by_category: dict[str, list[tuple[str, str]]] = {}
    for skill_file in skills_dir.rglob("SKILL.md"):
-        # Skip skills incompatible with the current OS platform
-        if not _skill_is_platform_compatible(skill_file):
-            continue
        rel_path = skill_file.relative_to(skills_dir)
        parts = rel_path.parts
        if len(parts) >= 2:
--- a/agent/skill_commands.py
+++ b/agent/skill_commands.py
@@ -22,7 +22,7 @@ def scan_skill_commands() -> Dict[str, Dict[str, Any]]:
    global _skill_commands
    _skill_commands = {}
    try:
-        from tools.skills_tool import SKILLS_DIR, _parse_frontmatter, skill_matches_platform
+        from tools.skills_tool import SKILLS_DIR, _parse_frontmatter
        if not SKILLS_DIR.exists():
            return _skill_commands
        for skill_md in SKILLS_DIR.rglob("SKILL.md"):
@@ -31,9 +31,6 @@ def scan_skill_commands() -> Dict[str, Dict[str, Any]]:
            try:
                content = skill_md.read_text(encoding='utf-8')
                frontmatter, body = _parse_frontmatter(content)
-                # Skip skills incompatible with the current OS platform
-                if not skill_matches_platform(frontmatter):
-                    continue
                name = frontmatter.get('name', skill_md.parent.name)
                description = frontmatter.get('description', '')
                if not description:
--- a/batch_runner.py
+++ b/batch_runner.py
@@ -29,6 +29,7 @@ from typing import List, Dict, Any, Optional, Tuple
 from datetime import datetime
 from multiprocessing import Pool, Lock
 import traceback
+
 from rich.progress import Progress, SpinnerColumn, BarColumn, TextColumn, TimeRemainingColumn, MofNCompleteColumn
 from rich.console import Console
 import fire
@@ -249,7 +250,7 @@ def _process_single_prompt(
    task_id = f"task_{prompt_index}"
    
    # Per-prompt container image override: if the dataset row has an 'image' field,
-    # register it for this task's sandbox. Works with Docker, Modal, Singularity, and Daytona.
+    # register it for this task's sandbox. Works with Docker, Modal, and Singularity.
    container_image = prompt_data.get("image") or prompt_data.get("docker_image")
    if container_image:
        # Verify the image is accessible before spending tokens on the agent loop.
@@ -291,7 +292,6 @@ def _process_single_prompt(
            "docker_image": container_image,
            "modal_image": container_image,
            "singularity_image": f"docker://{container_image}",
-            "daytona_image": container_image,
        }
        if prompt_data.get("cwd"):
            overrides["cwd"] = prompt_data["cwd"]
@@ -700,13 +700,14 @@ class BatchRunner:
            lock (Lock): Optional lock for thread-safe access
        """
        checkpoint_data["last_updated"] = datetime.now().isoformat()
-
-        from utils import atomic_json_write
+        
        if lock:
            with lock:
-                atomic_json_write(self.checkpoint_file, checkpoint_data)
+                with open(self.checkpoint_file, 'w', encoding='utf-8') as f:
+                    json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)
        else:
-            atomic_json_write(self.checkpoint_file, checkpoint_data)
+            with open(self.checkpoint_file, 'w', encoding='utf-8') as f:
+                json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)
    
    def _scan_completed_prompts_by_content(self) -> set:
        """
@@ -831,15 +832,13 @@ class BatchRunner:
            print(f"   New batches created:       {len(batches_to_process)}")
            print("=" * 70 + "\n")
        
-        # Load existing checkpoint (so resume doesn't clobber prior progress)
-        checkpoint_data = self._load_checkpoint()
-        if checkpoint_data.get("run_name") != self.run_name:
-            checkpoint_data = {
-                "run_name": self.run_name,
-                "completed_prompts": [],
-                "batch_stats": {},
-                "last_updated": None
-            }
+        # Initialize checkpoint data (needed for saving at the end)
+        checkpoint_data = {
+            "run_name": self.run_name,
+            "completed_prompts": [],
+            "batch_stats": {},
+            "last_updated": None
+        }
        
        # Prepare configuration for workers
        config = {
@@ -861,7 +860,7 @@ class BatchRunner:
        }
        
        # For backward compatibility, still track by index (but this is secondary to content matching)
-        completed_prompts_set = set(checkpoint_data.get("completed_prompts", []))
+        completed_prompts_set = set()
        
        # Aggregate statistics across all batches
        total_tool_stats = {}
@@ -870,9 +869,6 @@ class BatchRunner:
        
        print(f"\n🔧 Initializing {self.num_workers} worker processes...")
        
-        # Checkpoint writes happen in the parent process; keep a lock for safety.
-        checkpoint_lock = Lock()
-
        # Process batches in parallel
        with Pool(processes=self.num_workers) as pool:
            # Create tasks for each batch
@@ -918,25 +914,6 @@ class BatchRunner:
                    for result in pool.imap_unordered(_process_batch_worker, tasks):
                        results.append(result)
                        progress.update(task, advance=1)
-
-                        # Incremental checkpoint update (so resume works after crash)
-                        try:
-                            batch_num = result.get('batch_num')
-                            completed = result.get('completed_prompts', []) or []
-                            completed_prompts_set.update(completed)
-
-                            if isinstance(batch_num, int):
-                                checkpoint_data.setdefault('batch_stats', {})[str(batch_num)] = {
-                                    'processed': result.get('processed', 0),
-                                    'skipped': result.get('skipped', 0),
-                                    'discarded_no_reasoning': result.get('discarded_no_reasoning', 0),
-                                }
-
-                            checkpoint_data['completed_prompts'] = sorted(completed_prompts_set)
-                            self._save_checkpoint(checkpoint_data, lock=checkpoint_lock)
-                        except Exception as ckpt_err:
-                            # Don't fail the run if checkpoint write fails
-                            print(f"⚠️  Warning: Failed to save incremental checkpoint: {ckpt_err}")
                except Exception as e:
                    logger.error("Batch worker failed: %s", e, exc_info=True)
                    raise
@@ -968,12 +945,9 @@ class BatchRunner:
            for key in total_reasoning_stats:
                total_reasoning_stats[key] += batch_result.get("reasoning_stats", {}).get(key, 0)
        
-        # Save final checkpoint (best-effort; incremental writes already happened)
-        try:
-            checkpoint_data["completed_prompts"] = all_completed_prompts
-            self._save_checkpoint(checkpoint_data, lock=checkpoint_lock)
-        except Exception as ckpt_err:
-            print(f"âš ï¸  Warning: Failed to save final checkpoint: {ckpt_err}")
+        # Save final checkpoint
+        checkpoint_data["completed_prompts"] = all_completed_prompts
+        self._save_checkpoint(checkpoint_data)
        
        # Calculate success rates
        for tool_name in total_tool_stats:
--- a/cli-config.yaml.example
+++ b/cli-config.yaml.example
@@ -13,10 +13,6 @@ model:
  #   "auto"       - Use Nous Portal if logged in, otherwise OpenRouter/env vars (default)
  #   "openrouter" - Always use OpenRouter API key from OPENROUTER_API_KEY
  #   "nous"       - Always use Nous Portal (requires: hermes login)
-  #   "zai"        - Use z.ai / ZhipuAI GLM models (requires: GLM_API_KEY)
-  #   "kimi-coding"- Use Kimi / Moonshot AI models (requires: KIMI_API_KEY)
-  #   "minimax"    - Use MiniMax global endpoint (requires: MINIMAX_API_KEY)
-  #   "minimax-cn" - Use MiniMax China endpoint (requires: MINIMAX_CN_API_KEY)
  # Can also be overridden with --provider flag or HERMES_INFERENCE_PROVIDER env var.
  provider: "auto"
  
@@ -120,23 +116,8 @@ terminal:
 #   timeout: 180
 #   lifetime_seconds: 300
 #   modal_image: "nikolaik/python-nodejs:python3.11-nodejs20"
-
-# -----------------------------------------------------------------------------
-# OPTION 6: Daytona cloud execution
-# Commands run in Daytona cloud sandboxes
-# Great for: Cloud dev environments, persistent workspaces, team collaboration
-# Requires: pip install daytona, DAYTONA_API_KEY env var
-# -----------------------------------------------------------------------------
-# terminal:
-#   backend: "daytona"
-#   cwd: "~"
-#   timeout: 180
-#   lifetime_seconds: 300
-#   daytona_image: "nikolaik/python-nodejs:python3.11-nodejs20"
-#   container_disk: 10240          # Daytona max is 10GB per sandbox
-
 #
-# --- Container resource limits (docker, singularity, modal, daytona -- ignored for local/ssh) ---
+# --- Container resource limits (docker, singularity, modal -- ignored for local/ssh) ---
 # These settings apply to all container backends. They control the resources
 # allocated to the sandbox and whether its filesystem persists across sessions.
  container_cpu: 1              # CPU cores
@@ -199,58 +180,8 @@ compression:
  threshold: 0.85
  
  # Model to use for generating summaries (fast/cheap recommended)
-  # This model compresses the middle turns into a concise summary.
-  # IMPORTANT: it receives the full middle section of the conversation, so it
-  # MUST support a context length at least as large as your main model's.
+  # This model compresses the middle turns into a concise summary
  summary_model: "google/gemini-3-flash-preview"
-  
-  # Provider for the summary model (default: "auto")
-  # Options: "auto", "openrouter", "nous", "main"
-  # summary_provider: "auto"
-
-# =============================================================================
-# Auxiliary Models (Advanced — Experimental)
-# =============================================================================
-# Hermes uses lightweight "auxiliary" models for side tasks: image analysis,
-# browser screenshot analysis, web page summarization, and context compression.
-#
-# By default these use Gemini Flash via OpenRouter or Nous Portal and are
-# auto-detected from your credentials.  You do NOT need to change anything
-# here for normal usage.
-#
-# WARNING: Overriding these with providers other than OpenRouter or Nous Portal
-# is EXPERIMENTAL and may not work.  Not all models/providers support vision,
-# produce usable summaries, or accept the same API format.  Change at your own
-# risk — if things break, reset to "auto" / empty values.
-#
-# Each task has its own provider + model pair so you can mix providers.
-# For example: OpenRouter for vision (needs multimodal), but your main
-# local endpoint for compression (just needs text).
-#
-# Provider options:
-#   "auto"       - Best available: OpenRouter → Nous Portal → main endpoint (default)
-#   "openrouter" - Force OpenRouter (requires OPENROUTER_API_KEY)
-#   "nous"       - Force Nous Portal (requires: hermes login)
-#   "main"       - Use the same provider & credentials as your main chat model.
-#                  Skips OpenRouter/Nous and uses your custom endpoint
-#                  (OPENAI_BASE_URL), Codex OAuth, or API-key provider directly.
-#                  Useful if you run a local model and want auxiliary tasks to
-#                  use it too.
-#
-# Model: leave empty to use the provider's default.  When empty, OpenRouter
-# uses "google/gemini-3-flash-preview" and Nous uses "gemini-3-flash".
-# Other providers pick a sensible default automatically.
-#
-# auxiliary:
-#   # Image analysis: vision_analyze tool + browser screenshots
-#   vision:
-#     provider: "auto"
-#     model: ""              # e.g. "google/gemini-2.5-flash", "openai/gpt-4o"
-#
-#   # Web page scraping / summarization + browser page text extraction
-#   web_extract:
-#     provider: "auto"
-#     model: ""

 # =============================================================================
 # Persistent Memory
--- a/cli.py
+++ b/cli.py
@@ -14,7 +14,6 @@ Usage:

 import logging
 import os
-import shutil
 import sys
 import json
 import atexit
@@ -158,7 +157,6 @@ def load_cli_config() -> Dict[str, Any]:
            "docker_image": "python:3.11",
            "singularity_image": "docker://python:3.11",
            "modal_image": "python:3.11",
-            "daytona_image": "nikolaik/python-nodejs:python3.11-nodejs20",
        },
        "browser": {
            "inactivity_timeout": 120,  # Auto-cleanup inactive browser sessions after 2 min
@@ -169,7 +167,7 @@ def load_cli_config() -> Dict[str, Any]:
            "summary_model": "google/gemini-3-flash-preview",  # Fast/cheap model for summaries
        },
        "agent": {
-            "max_turns": 90,  # Default max tool-calling iterations (shared with subagents)
+            "max_turns": 60,  # Default max tool-calling iterations
            "verbose": False,
            "system_prompt": "",
            "prefill_messages_file": "",
@@ -285,13 +283,12 @@ def load_cli_config() -> Dict[str, Any]:
        "docker_image": "TERMINAL_DOCKER_IMAGE",
        "singularity_image": "TERMINAL_SINGULARITY_IMAGE",
        "modal_image": "TERMINAL_MODAL_IMAGE",
-        "daytona_image": "TERMINAL_DAYTONA_IMAGE",
        # SSH config
        "ssh_host": "TERMINAL_SSH_HOST",
        "ssh_user": "TERMINAL_SSH_USER",
        "ssh_port": "TERMINAL_SSH_PORT",
        "ssh_key": "TERMINAL_SSH_KEY",
-        # Container resource config (docker, singularity, modal, daytona -- ignored for local/ssh)
+        # Container resource config (docker, singularity, modal -- ignored for local/ssh)
        "container_cpu": "TERMINAL_CONTAINER_CPU",
        "container_memory": "TERMINAL_CONTAINER_MEMORY",
        "container_disk": "TERMINAL_CONTAINER_DISK",
@@ -332,36 +329,12 @@ def load_cli_config() -> Dict[str, Any]:
        "enabled": "CONTEXT_COMPRESSION_ENABLED",
        "threshold": "CONTEXT_COMPRESSION_THRESHOLD",
        "summary_model": "CONTEXT_COMPRESSION_MODEL",
-        "summary_provider": "CONTEXT_COMPRESSION_PROVIDER",
    }
    
    for config_key, env_var in compression_env_mappings.items():
        if config_key in compression_config:
            os.environ[env_var] = str(compression_config[config_key])
    
-    # Apply auxiliary model overrides to environment variables.
-    # Vision and web_extract each have their own provider + model pair.
-    # (Compression is handled in the compression section above.)
-    # Only set env vars for non-empty / non-default values so auto-detection
-    # still works.
-    auxiliary_config = defaults.get("auxiliary", {})
-    auxiliary_task_env = {
-        # config key → (provider env var, model env var)
-        "vision":      ("AUXILIARY_VISION_PROVIDER",      "AUXILIARY_VISION_MODEL"),
-        "web_extract": ("AUXILIARY_WEB_EXTRACT_PROVIDER",  "AUXILIARY_WEB_EXTRACT_MODEL"),
-    }
-    
-    for task_key, (prov_env, model_env) in auxiliary_task_env.items():
-        task_cfg = auxiliary_config.get(task_key, {})
-        if not isinstance(task_cfg, dict):
-            continue
-        prov = str(task_cfg.get("provider", "")).strip()
-        model = str(task_cfg.get("model", "")).strip()
-        if prov and prov != "auto":
-            os.environ[prov_env] = prov
-        if model:
-            os.environ[model_env] = model
-    
    return defaults

 # Load configuration at module startup
@@ -534,18 +507,7 @@ def _get_available_skills() -> Dict[str, List[str]]:
    return skills_by_category


-def _format_context_length(tokens: int) -> str:
-    """Format a token count for display (e.g. 128000 → '128K', 1048576 → '1M')."""
-    if tokens >= 1_000_000:
-        val = tokens / 1_000_000
-        return f"{val:g}M"
-    elif tokens >= 1_000:
-        val = tokens / 1_000
-        return f"{val:g}K"
-    return str(tokens)
-
-
-def build_welcome_banner(console: Console, model: str, cwd: str, tools: List[dict] = None, enabled_toolsets: List[str] = None, session_id: str = None, context_length: int = None):
+def build_welcome_banner(console: Console, model: str, cwd: str, tools: List[dict] = None, enabled_toolsets: List[str] = None, session_id: str = None):
    """
    Build and print a Claude Code-style welcome banner with caduceus on left and info on right.
    
@@ -556,7 +518,6 @@ def build_welcome_banner(console: Console, model: str, cwd: str, tools: List[dic
        tools: List of tool definitions
        enabled_toolsets: List of enabled toolset names
        session_id: Unique session identifier for logging
-        context_length: Model's context window size in tokens
    """
    from model_tools import check_tool_availability, TOOLSET_REQUIREMENTS
    
@@ -582,8 +543,7 @@ def build_welcome_banner(console: Console, model: str, cwd: str, tools: List[dic
    if len(model_short) > 28:
        model_short = model_short[:25] + "..."
    
-    ctx_str = f" [dim #B8860B]·[/] [dim #B8860B]{_format_context_length(context_length)} context[/]" if context_length else ""
-    left_lines.append(f"[#FFBF00]{model_short}[/]{ctx_str} [dim #B8860B]·[/] [dim #B8860B]Nous Research[/]")
+    left_lines.append(f"[#FFBF00]{model_short}[/] [dim #B8860B]·[/] [dim #B8860B]Nous Research[/]")
    left_lines.append(f"[dim #B8860B]{cwd}[/]")
    
    # Add session ID if provided
@@ -730,7 +690,6 @@ COMMANDS = {
    "/cron": "Manage scheduled tasks (list, add, remove)",
    "/skills": "Search, install, inspect, or manage skills from online registries",
    "/platforms": "Show gateway/messaging platform status",
-    "/paste": "Check clipboard for an image and attach it",
    "/reload-mcp": "Reload MCP servers from config.yaml",
    "/quit": "Exit the CLI (also: /exit, /q)",
 }
@@ -857,10 +816,10 @@ class HermesCLI:
        Args:
            model: Model to use (default: from env or claude-sonnet)
            toolsets: List of toolsets to enable (default: all)
-            provider: Inference provider ("auto", "openrouter", "nous", "openai-codex", "zai", "kimi-coding", "minimax", "minimax-cn")
+            provider: Inference provider ("auto", "openrouter", "nous", "openai-codex")
            api_key: API key (default: from environment)
            base_url: API base URL (default: OpenRouter)
-            max_turns: Maximum tool-calling iterations shared with subagents (default: 90)
+            max_turns: Maximum tool-calling iterations (default: 60)
            verbose: Enable verbose logging
            compact: Use compact display mode
            resume: Session ID to resume (restores conversation history from SQLite)
@@ -894,13 +853,7 @@ class HermesCLI:
            or os.getenv("OPENAI_BASE_URL")
            or os.getenv("OPENROUTER_BASE_URL", CLI_CONFIG["model"]["base_url"])
        )
-        # Match key to resolved base_url: OpenRouter URL → prefer OPENROUTER_API_KEY,
-        # custom endpoint → prefer OPENAI_API_KEY (issue #560).
-        # Note: _ensure_runtime_credentials() re-resolves this before first use.
-        if "openrouter.ai" in self.base_url:
-            self.api_key = api_key or os.getenv("OPENROUTER_API_KEY") or os.getenv("OPENAI_API_KEY")
-        else:
-            self.api_key = api_key or os.getenv("OPENAI_API_KEY") or os.getenv("OPENROUTER_API_KEY")
+        self.api_key = api_key or os.getenv("OPENROUTER_API_KEY") or os.getenv("OPENAI_API_KEY")
        self._nous_key_expires_at: Optional[str] = None
        self._nous_key_source: Optional[str] = None
        # Max turns priority: CLI arg > config file > env var > default
@@ -913,7 +866,7 @@ class HermesCLI:
        elif os.getenv("HERMES_MAX_ITERATIONS"):
            self.max_turns = int(os.getenv("HERMES_MAX_ITERATIONS"))
        else:
-            self.max_turns = 90
+            self.max_turns = 60
        
        # Parse and validate toolsets
        self.enabled_toolsets = toolsets
@@ -1125,11 +1078,6 @@ class HermesCLI:
            # Get terminal working directory (where commands will execute)
            cwd = os.getenv("TERMINAL_CWD", os.getcwd())
            
-            # Get context length for display
-            ctx_len = None
-            if hasattr(self, 'agent') and self.agent and hasattr(self.agent, 'context_compressor'):
-                ctx_len = self.agent.context_compressor.context_length
-            
            # Build and display the banner
            build_welcome_banner(
                console=self.console,
@@ -1138,7 +1086,6 @@ class HermesCLI:
                tools=tools,
                enabled_toolsets=self.enabled_toolsets,
                session_id=self.session_id,
-                context_length=ctx_len,
            )
        
        # Show tool availability warnings if any tools are disabled
@@ -1146,69 +1093,6 @@ class HermesCLI:
        
        self.console.print()
    
-    def _try_attach_clipboard_image(self) -> bool:
-        """Check clipboard for an image and attach it if found.
-
-        Saves the image to ~/.hermes/images/ and appends the path to
-        ``_attached_images``.  Returns True if an image was attached.
-        """
-        from hermes_cli.clipboard import save_clipboard_image
-
-        img_dir = Path.home() / ".hermes" / "images"
-        self._image_counter += 1
-        ts = datetime.now().strftime("%Y%m%d_%H%M%S")
-        img_path = img_dir / f"clip_{ts}_{self._image_counter}.png"
-
-        if save_clipboard_image(img_path):
-            self._attached_images.append(img_path)
-            return True
-        self._image_counter -= 1
-        return False
-
-    def _handle_paste_command(self):
-        """Handle /paste — explicitly check clipboard for an image.
-
-        This is the reliable fallback for terminals where BracketedPaste
-        doesn't fire for image-only clipboard content (e.g., VSCode terminal,
-        Windows Terminal with WSL2).
-        """
-        from hermes_cli.clipboard import has_clipboard_image
-        if has_clipboard_image():
-            if self._try_attach_clipboard_image():
-                n = len(self._attached_images)
-                _cprint(f"  📎 Image #{n} attached from clipboard")
-            else:
-                _cprint(f"  {_DIM}(>_<) Clipboard has an image but extraction failed{_RST}")
-        else:
-            _cprint(f"  {_DIM}(._.) No image found in clipboard{_RST}")
-
-    def _build_multimodal_content(self, text: str, images: list) -> list:
-        """Convert text + image paths into OpenAI vision multimodal content.
-
-        Returns a list of content parts suitable for the ``content`` field
-        of a ``user`` message.
-        """
-        import base64 as _b64
-
-        content_parts = []
-        text_part = text if isinstance(text, str) and text else "What do you see in this image?"
-        content_parts.append({"type": "text", "text": text_part})
-
-        _MIME = {
-            "png": "image/png", "jpg": "image/jpeg", "jpeg": "image/jpeg",
-            "gif": "image/gif", "webp": "image/webp",
-        }
-        for img_path in images:
-            if img_path.exists():
-                data = _b64.b64encode(img_path.read_bytes()).decode()
-                ext = img_path.suffix.lower().lstrip(".")
-                mime = _MIME.get(ext, "image/png")
-                content_parts.append({
-                    "type": "image_url",
-                    "image_url": {"url": f"data:{mime};base64,{data}"}
-                })
-        return content_parts
-
    def _show_tool_availability_warnings(self):
        """Show warnings about disabled tools due to missing API keys."""
        try:
@@ -1278,8 +1162,7 @@ class HermesCLI:
                _cprint(f"  {_GOLD}{cmd:<22}{_RST} {_DIM}-{_RST} {info['description']}")

        _cprint(f"\n  {_DIM}Tip: Just type your message to chat with Hermes!{_RST}")
-        _cprint(f"  {_DIM}Multi-line: Alt+Enter for a new line{_RST}")
-        _cprint(f"  {_DIM}Paste image: Alt+V (or /paste){_RST}\n")
+        _cprint(f"  {_DIM}Multi-line: Alt+Enter for a new line{_RST}\n")
    
    def show_tools(self):
        """Display available tools with kawaii ASCII art."""
@@ -1888,10 +1771,6 @@ class HermesCLI:
            self._manual_compress()
        elif cmd_lower == "/usage":
            self._show_usage()
-        elif cmd_lower.startswith("/insights"):
-            self._show_insights(cmd_original)
-        elif cmd_lower == "/paste":
-            self._handle_paste_command()
        elif cmd_lower == "/reload-mcp":
            self._reload_mcp()
        else:
@@ -2015,39 +1894,6 @@ class HermesCLI:
            for quiet_logger in ('tools', 'minisweagent', 'run_agent', 'trajectory_compressor', 'cron', 'hermes_cli'):
                logging.getLogger(quiet_logger).setLevel(logging.ERROR)

-    def _show_insights(self, command: str = "/insights"):
-        """Show usage insights and analytics from session history."""
-        # Parse optional --days flag
-        parts = command.split()
-        days = 30
-        source = None
-        i = 1
-        while i < len(parts):
-            if parts[i] == "--days" and i + 1 < len(parts):
-                try:
-                    days = int(parts[i + 1])
-                except ValueError:
-                    print(f"  Invalid --days value: {parts[i + 1]}")
-                    return
-                i += 2
-            elif parts[i] == "--source" and i + 1 < len(parts):
-                source = parts[i + 1]
-                i += 2
-            else:
-                i += 1
-
-        try:
-            from hermes_state import SessionDB
-            from agent.insights import InsightsEngine
-
-            db = SessionDB()
-            engine = InsightsEngine(db)
-            report = engine.generate(days=days, source=source)
-            print(engine.format_terminal(report))
-            db.close()
-        except Exception as e:
-            print(f"  Error generating insights: {e}")
-
    def _reload_mcp(self):
        """Reload MCP servers: disconnect all, re-read config.yaml, reconnect.

@@ -2269,21 +2115,20 @@ class HermesCLI:
        self._approval_state = None
        self._approval_deadline = 0
        self._invalidate()
-    def chat(self, message, images: list = None) -> Optional[str]:
+        _cprint(f"\n{_DIM}  ⏱ Timeout — denying command{_RST}")
+        return "deny"
+
+    def chat(self, message: str) -> Optional[str]:
        """
        Send a message to the agent and get a response.
        
-        Handles streaming output, interrupt detection (user typing while agent
-        is working), and re-queueing of interrupted messages.
-        
        Uses a dedicated _interrupt_queue (separate from _pending_input) to avoid
        race conditions between the process_loop and interrupt monitoring. Messages
        typed while the agent is running go to _interrupt_queue; messages typed while
        idle go to _pending_input.
        
        Args:
-            message: The user's message (str or multimodal content list)
-            images: Optional list of Path objects for attached images
+            message: The user's message
            
        Returns:
            The agent's response, or None on error
@@ -2296,19 +2141,10 @@ class HermesCLI:
        if not self._init_agent():
            return None
        
-        # Convert attached images to OpenAI vision multimodal content
-        if images:
-            message = self._build_multimodal_content(
-                message if isinstance(message, str) else "", images
-            )
-            for img_path in images:
-                if img_path.exists():
-                    _cprint(f"  {_DIM}📎 attached {img_path.name} ({img_path.stat().st_size // 1024}KB){_RST}")
-
        # Add user message to history
        self.conversation_history.append({"role": "user", "content": message})
        
-        w = shutil.get_terminal_size().columns
+        w = self.console.width
        _cprint(f"{_GOLD}{'─' * w}{_RST}")
        print(flush=True)
        
@@ -2384,7 +2220,7 @@ class HermesCLI:
                    response = response + "\n\n---\n_[Interrupted - processing new message]_"
            
            if response:
-                w = shutil.get_terminal_size().columns
+                w = self.console.width
                label = " ⚕ Hermes "
                fill = w - 2 - len(label)  # 2 for ╭ and ╮
                top = f"{_GOLD}╭─{label}{'─' * max(fill - 1, 0)}╮{_RST}"
@@ -2469,10 +2305,6 @@ class HermesCLI:
        self._approval_state = None     # dict with command, description, choices, selected, response_queue
        self._approval_deadline = 0

-        # Clipboard image attachments (paste images into the CLI)
-        self._attached_images: list[Path] = []
-        self._image_counter = 0
-
        # Register callbacks so terminal_tool prompts route through our UI
        set_sudo_password_callback(self._sudo_password_callback)
        set_approval_callback(self._approval_callback)
@@ -2542,18 +2374,11 @@ class HermesCLI:

            # --- Normal input routing ---
            text = event.app.current_buffer.text.strip()
-            has_images = bool(self._attached_images)
-            if text or has_images:
-                # Snapshot and clear attached images
-                images = list(self._attached_images)
-                self._attached_images.clear()
-                event.app.invalidate()
-                # Bundle text + images as a tuple when images are present
-                payload = (text, images) if images else text
-                if self._agent_running and not (text and text.startswith("/")):
-                    self._interrupt_queue.put(payload)
+            if text:
+                if self._agent_running and not text.startswith("/"):
+                    self._interrupt_queue.put(text)
                else:
-                    self._pending_input.put(payload)
+                    self._pending_input.put(text)
                event.app.current_buffer.reset(append_to_history=True)
        
        @kb.add('escape', 'enter')
@@ -2666,12 +2491,10 @@ class HermesCLI:
                print("\n⚡ Interrupting agent... (press Ctrl+C again to force exit)")
                self.agent.interrupt()
            else:
-                # If there's text or images, clear them (like bash).
-                # If everything is already empty, exit.
-                if event.app.current_buffer.text or self._attached_images:
+                # If there's text in the input buffer, clear it (like bash).
+                # If the buffer is already empty, exit.
+                if event.app.current_buffer.text:
                    event.app.current_buffer.reset()
-                    self._attached_images.clear()
-                    event.app.invalidate()
                else:
                    self._should_exit = True
                    event.app.exit()
@@ -2681,53 +2504,7 @@ class HermesCLI:
            """Handle Ctrl+D - exit."""
            self._should_exit = True
            event.app.exit()
-
-        from prompt_toolkit.keys import Keys
-
-        @kb.add(Keys.BracketedPaste, eager=True)
-        def handle_paste(event):
-            """Handle terminal paste — detect clipboard images.
-
-            When the terminal supports bracketed paste, Ctrl+V / Cmd+V
-            triggers this with the pasted text.  We also check the
-            clipboard for an image on every paste event.
-            """
-            pasted_text = event.data or ""
-            if self._try_attach_clipboard_image():
-                event.app.invalidate()
-            if pasted_text:
-                event.current_buffer.insert_text(pasted_text)
-
-        @kb.add('c-v')
-        def handle_ctrl_v(event):
-            """Fallback image paste for terminals without bracketed paste.
-
-            On Linux terminals (GNOME Terminal, Konsole, etc.), Ctrl+V
-            sends raw byte 0x16 instead of triggering a paste.  This
-            binding catches that and checks the clipboard for images.
-            On terminals that DO intercept Ctrl+V for paste (macOS
-            Terminal, iTerm2, VSCode, Windows Terminal), the bracketed
-            paste handler fires instead and this binding never triggers.
-            """
-            if self._try_attach_clipboard_image():
-                event.app.invalidate()
-
-        @kb.add('escape', 'v')
-        def handle_alt_v(event):
-            """Alt+V — paste image from clipboard.
-
-            Alt key combos pass through all terminal emulators (sent as
-            ESC + key), unlike Ctrl+V which terminals intercept for text
-            paste.  This is the reliable way to attach clipboard images
-            on WSL2, VSCode, and any terminal over SSH where Ctrl+V
-            can't reach the application for image-only clipboard.
-            """
-            if self._try_attach_clipboard_image():
-                event.app.invalidate()
-            else:
-                # No image found — show a hint
-                pass  # silent when no image (avoid noise on accidental press)
-
+        
        # Dynamic prompt: shows Hermes symbol when agent is working,
        # or answer prompt when clarify freetext mode is active.
        cli_ref = self
@@ -2763,7 +2540,7 @@ class HermesCLI:
        def _input_height():
            try:
                doc = input_area.buffer.document
-                available_width = shutil.get_terminal_size().columns - 4  # subtract prompt width
+                available_width = (cli_ref.console.width or 80) - 4  # subtract prompt width
                if available_width < 10:
                    available_width = 40
                visual_lines = 0
@@ -3024,35 +2801,13 @@ class HermesCLI:

        # Horizontal rules above and below the input (bronze, 1 line each).
        # The bottom rule moves down as the TextArea grows with newlines.
-        # Using char='─' instead of hardcoded repetition so the rule
-        # always spans the full terminal width on any screen size.
        input_rule_top = Window(
-            char='─',
+            content=FormattedTextControl([('class:input-rule', '─' * 200)]),
            height=1,
-            style='class:input-rule',
        )
        input_rule_bot = Window(
-            char='─',
+            content=FormattedTextControl([('class:input-rule', '─' * 200)]),
            height=1,
-            style='class:input-rule',
-        )
-
-        # Image attachment indicator — shows badges like [📎 Image #1] above input
-        cli_ref = self
-
-        def _get_image_bar():
-            if not cli_ref._attached_images:
-                return []
-            base = cli_ref._image_counter - len(cli_ref._attached_images) + 1
-            badges = " ".join(
-                f"[📎 Image #{base + i}]"
-                for i in range(len(cli_ref._attached_images))
-            )
-            return [("class:image-badge", f" {badges} ")]
-
-        image_bar = Window(
-            content=FormattedTextControl(_get_image_bar),
-            height=Condition(lambda: bool(cli_ref._attached_images)),
        )

        # Layout: interactive prompt widgets + ruled input at bottom.
@@ -3066,7 +2821,6 @@ class HermesCLI:
                clarify_widget,
                spacer,
                input_rule_top,
-                image_bar,
                input_area,
                input_rule_bot,
                CompletionsMenu(max_height=12, scroll_offset=1),
@@ -3082,8 +2836,6 @@ class HermesCLI:
            'hint': '#555555 italic',
            # Bronze horizontal rules around the input area
            'input-rule': '#CD7F32',
-            # Clipboard image attachment badges
-            'image-badge': '#87CEEB bold',
            'completion-menu': 'bg:#1a1a2e #FFF8DC',
            'completion-menu.completion': 'bg:#1a1a2e #FFF8DC',
            'completion-menu.completion.current': 'bg:#333355 #FFD700',
@@ -3133,14 +2885,9 @@ class HermesCLI:
                    
                    if not user_input:
                        continue
-
-                    # Unpack image payload: (text, [Path, ...]) or plain str
-                    submit_images = []
-                    if isinstance(user_input, tuple):
-                        user_input, submit_images = user_input
                    
                    # Check for commands
-                    if isinstance(user_input, str) and user_input.startswith("/"):
+                    if user_input.startswith("/"):
                        print(f"\n⚙️  {user_input}")
                        if not self.process_command(user_input):
                            self._should_exit = True
@@ -3151,7 +2898,7 @@ class HermesCLI:
                    
                    # Expand paste references back to full content
                    import re as _re
-                    paste_match = _re.match(r'\[Pasted text #\d+: \d+ lines → (.+)\]', user_input) if isinstance(user_input, str) else None
+                    paste_match = _re.match(r'\[Pasted text #\d+: \d+ lines → (.+)\]', user_input)
                    if paste_match:
                        paste_path = Path(paste_match.group(1))
                        if paste_path.exists():
@@ -3173,17 +2920,12 @@ class HermesCLI:
                            print()
                            _cprint(f"{_GOLD}●{_RST} {_BOLD}{user_input}{_RST}")
                    
-                    # Show image attachment count
-                    if submit_images:
-                        n = len(submit_images)
-                        _cprint(f"  {_DIM}📎 {n} image{'s' if n > 1 else ''} attached{_RST}")
-
                    # Regular chat - run agent
                    self._agent_running = True
                    app.invalidate()  # Refresh status line
                    
                    try:
-                        self.chat(user_input, images=submit_images or None)
+                        self.chat(user_input)
                    finally:
                        self._agent_running = False
                        app.invalidate()  # Refresh status line
@@ -3253,7 +2995,7 @@ def main(
        q: Shorthand for --query
        toolsets: Comma-separated list of toolsets to enable (e.g., "web,terminal")
        model: Model to use (default: anthropic/claude-opus-4-20250514)
-        provider: Inference provider ("auto", "openrouter", "nous", "openai-codex", "zai", "kimi-coding", "minimax", "minimax-cn")
+        provider: Inference provider ("auto", "openrouter", "nous")
        api_key: API key for authentication
        base_url: Base URL for the API
        max_turns: Maximum tool-calling iterations (default: 60)
--- a/cron/jobs.py
+++ b/cron/jobs.py
@@ -14,8 +14,6 @@ from datetime import datetime, timedelta
 from pathlib import Path
 from typing import Optional, Dict, List, Any

-from hermes_time import now as _hermes_now
-
 try:
    from croniter import croniter
    HAS_CRONITER = True
@@ -130,7 +128,7 @@ def parse_schedule(schedule: str) -> Dict[str, Any]:
    # Duration like "30m", "2h", "1d" → one-shot from now
    try:
        minutes = parse_duration(schedule)
-        run_at = _hermes_now() + timedelta(minutes=minutes)
+        run_at = datetime.now() + timedelta(minutes=minutes)
        return {
            "kind": "once",
            "run_at": run_at.isoformat(),
@@ -148,50 +146,37 @@ def parse_schedule(schedule: str) -> Dict[str, Any]:
    )


-def _ensure_aware(dt: datetime) -> datetime:
-    """Make a naive datetime tz-aware using the configured timezone.
-
-    Handles backward compatibility: timestamps stored before timezone support
-    are naive (server-local).  We assume they were in the same timezone as
-    the current configuration so comparisons work without crashing.
-    """
-    if dt.tzinfo is None:
-        tz = _hermes_now().tzinfo
-        return dt.replace(tzinfo=tz)
-    return dt
-
-
 def compute_next_run(schedule: Dict[str, Any], last_run_at: Optional[str] = None) -> Optional[str]:
    """
    Compute the next run time for a schedule.
-
+    
    Returns ISO timestamp string, or None if no more runs.
    """
-    now = _hermes_now()
-
+    now = datetime.now()
+    
    if schedule["kind"] == "once":
-        run_at = _ensure_aware(datetime.fromisoformat(schedule["run_at"]))
+        run_at = datetime.fromisoformat(schedule["run_at"])
        # If in the future, return it; if in the past, no more runs
        return schedule["run_at"] if run_at > now else None
-
+    
    elif schedule["kind"] == "interval":
        minutes = schedule["minutes"]
        if last_run_at:
            # Next run is last_run + interval
-            last = _ensure_aware(datetime.fromisoformat(last_run_at))
+            last = datetime.fromisoformat(last_run_at)
            next_run = last + timedelta(minutes=minutes)
        else:
            # First run is now + interval
            next_run = now + timedelta(minutes=minutes)
        return next_run.isoformat()
-
+    
    elif schedule["kind"] == "cron":
        if not HAS_CRONITER:
            return None
        cron = croniter(schedule["expr"], now)
        next_run = cron.get_next(datetime)
        return next_run.isoformat()
-
+    
    return None


@@ -219,7 +204,7 @@ def save_jobs(jobs: List[Dict[str, Any]]):
    fd, tmp_path = tempfile.mkstemp(dir=str(JOBS_FILE.parent), suffix='.tmp', prefix='.jobs_')
    try:
        with os.fdopen(fd, 'w', encoding='utf-8') as f:
-            json.dump({"jobs": jobs, "updated_at": _hermes_now().isoformat()}, f, indent=2)
+            json.dump({"jobs": jobs, "updated_at": datetime.now().isoformat()}, f, indent=2)
            f.flush()
            os.fsync(f.fileno())
        os.replace(tmp_path, JOBS_FILE)
@@ -264,7 +249,7 @@ def create_job(
        deliver = "origin" if origin else "local"
    
    job_id = uuid.uuid4().hex[:12]
-    now = _hermes_now().isoformat()
+    now = datetime.now().isoformat()
    
    job = {
        "id": job_id,
@@ -343,7 +328,7 @@ def mark_job_run(job_id: str, success: bool, error: Optional[str] = None):
    jobs = load_jobs()
    for i, job in enumerate(jobs):
        if job["id"] == job_id:
-            now = _hermes_now().isoformat()
+            now = datetime.now().isoformat()
            job["last_run_at"] = now
            job["last_status"] = "ok" if success else "error"
            job["last_error"] = error if not success else None
@@ -376,7 +361,7 @@ def mark_job_run(job_id: str, success: bool, error: Optional[str] = None):

 def get_due_jobs() -> List[Dict[str, Any]]:
    """Get all jobs that are due to run now."""
-    now = _hermes_now()
+    now = datetime.now()
    jobs = load_jobs()
    due = []
    
@@ -388,7 +373,7 @@ def get_due_jobs() -> List[Dict[str, Any]]:
        if not next_run:
            continue
        
-        next_run_dt = _ensure_aware(datetime.fromisoformat(next_run))
+        next_run_dt = datetime.fromisoformat(next_run)
        if next_run_dt <= now:
            due.append(job)
    
@@ -401,7 +386,7 @@ def save_job_output(job_id: str, output: str):
    job_output_dir = OUTPUT_DIR / job_id
    job_output_dir.mkdir(parents=True, exist_ok=True)
    
-    timestamp = _hermes_now().strftime("%Y-%m-%d_%H-%M-%S")
+    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    output_file = job_output_dir / f"{timestamp}.md"
    
    with open(output_file, 'w', encoding='utf-8') as f:
--- a/cron/scheduler.py
+++ b/cron/scheduler.py
@@ -27,8 +27,6 @@ from datetime import datetime
 from pathlib import Path
 from typing import Optional

-from hermes_time import now as _hermes_now
-
 logger = logging.getLogger(__name__)

 # Add parent directory to path for imports
@@ -209,7 +207,7 @@ def run_job(job: dict) -> tuple[bool, str, str, Optional[str]]:
            provider=runtime.get("provider"),
            api_mode=runtime.get("api_mode"),
            quiet_mode=True,
-            session_id=f"cron_{job_id}_{_hermes_now().strftime('%Y%m%d_%H%M%S')}"
+            session_id=f"cron_{job_id}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        )
        
        result = agent.run_conversation(prompt)
@@ -221,7 +219,7 @@ def run_job(job: dict) -> tuple[bool, str, str, Optional[str]]:
        output = f"""# Cron Job: {job_name}

 **Job ID:** {job_id}
-**Run Time:** {_hermes_now().strftime('%Y-%m-%d %H:%M:%S')}
+**Run Time:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
 **Schedule:** {job.get('schedule_display', 'N/A')}

 ## Prompt
@@ -243,7 +241,7 @@ def run_job(job: dict) -> tuple[bool, str, str, Optional[str]]:
        output = f"""# Cron Job: {job_name} (FAILED)

 **Job ID:** {job_id}
-**Run Time:** {_hermes_now().strftime('%Y-%m-%d %H:%M:%S')}
+**Run Time:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
 **Schedule:** {job.get('schedule_display', 'N/A')}

 ## Prompt
@@ -282,7 +280,6 @@ def tick(verbose: bool = True) -> int:
    _LOCK_DIR.mkdir(parents=True, exist_ok=True)

    # Cross-platform file locking: fcntl on Unix, msvcrt on Windows
-    lock_fd = None
    try:
        lock_fd = open(_LOCK_FILE, "w")
        if fcntl:
@@ -291,19 +288,17 @@ def tick(verbose: bool = True) -> int:
            msvcrt.locking(lock_fd.fileno(), msvcrt.LK_NBLCK, 1)
    except (OSError, IOError):
        logger.debug("Tick skipped — another instance holds the lock")
-        if lock_fd is not None:
-            lock_fd.close()
        return 0

    try:
        due_jobs = get_due_jobs()

        if verbose and not due_jobs:
-            logger.info("%s - No jobs due", _hermes_now().strftime('%H:%M:%S'))
+            logger.info("%s - No jobs due", datetime.now().strftime('%H:%M:%S'))
            return 0

        if verbose:
-            logger.info("%s - %s job(s) due", _hermes_now().strftime('%H:%M:%S'), len(due_jobs))
+            logger.info("%s - %s job(s) due", datetime.now().strftime('%H:%M:%S'), len(due_jobs))

        executed = 0
        for job in due_jobs:
--- a/docs/send_file_integration_map.md
+++ b/docs/send_file_integration_map.md
@@ -1,344 +0,0 @@
-# send_file Integration Map — Hermes Agent Codebase Deep Dive
-
-## 1. environments/tool_context.py — Base64 File Transfer Implementation
-
-### upload_file() (lines 153-205)
- Reads local file as raw bytes, base64-encodes to ASCII string
- Creates parent dirs in sandbox via `self.terminal(f"mkdir -p {parent}")`
- **Chunk size:** 60,000 chars (~60KB per shell command)
- **Small files (<=60KB b64):** Single `printf '%s' '{b64}' | base64 -d > {remote_path}`
- **Large files:** Writes chunks to `/tmp/_hermes_upload.b64` via `printf >> append`, then `base64 -d` to target
- **Error handling:** Checks local file exists; returns `{exit_code, output}`
- **Size limits:** No explicit limit, but shell arg limit ~2MB means chunking is necessary for files >~45KB raw
- **No theoretical max** — but very large files would be slow (many terminal round trips)
-
-### download_file() (lines 234-278)
- Runs `base64 {remote_path}` inside sandbox, captures stdout
- Strips output, base64-decodes to raw bytes
- Writes to host filesystem with parent dir creation
- **Error handling:** Checks exit code, empty output, decode errors
- Returns `{success: bool, bytes: int}` or `{success: false, error: str}`
- **Size limit:** Bounded by terminal output buffer (practical limit ~few MB via base64 terminal output)
-
-### Promotion potential:
- These methods work via `self.terminal()` — they're environment-agnostic
- Could be directly lifted into a new tool that operates on the agent's current sandbox
- For send_file, this `download_file()` pattern is the key: it extracts files from sandbox → host
-
-## 2. tools/environments/base.py — BaseEnvironment Interface
-
-### Current methods:
- `execute(command, cwd, timeout, stdin_data)` → `{output, returncode}`
- `cleanup()` — release resources
- `stop()` — alias for cleanup
- `_prepare_command()` — sudo transformation
- `_build_run_kwargs()` — subprocess kwargs
- `_timeout_result()` — standard timeout dict
-
-### What would need to be added for file transfer:
- **Nothing required at this level.** File transfer can be implemented via `execute()` (base64 over terminal, like ToolContext does) or via environment-specific methods.
- Optional: `upload_file(local_path, remote_path)` and `download_file(remote_path, local_path)` methods could be added to BaseEnvironment for optimized per-backend transfers, but the base64-over-terminal approach already works universally.
-
-## 3. tools/environments/docker.py — Docker Container Details
-
-### Container ID tracking:
- `self._container_id` stored at init from `self._inner.container_id`
- Inner is `minisweagent.environments.docker.DockerEnvironment`
- Container ID is a standard Docker container hash
-
-### docker cp feasibility:
- **YES**, `docker cp` could be used for optimized file transfer:
-  - `docker cp {container_id}:{remote_path} {local_path}` (download)
-  - `docker cp {local_path} {container_id}:{remote_path}` (upload)
- Much faster than base64-over-terminal for large files
- Container ID is directly accessible via `env._container_id` or `env._inner.container_id`
-
-### Volumes mounted:
- **Persistent mode:** Bind mounts at `~/.hermes/sandboxes/docker/{task_id}/workspace` → `/workspace` and `.../home` → `/root`
- **Ephemeral mode:** tmpfs at `/workspace` (10GB), `/home` (1GB), `/root` (1GB)
- **User volumes:** From `config.yaml docker_volumes` (arbitrary `-v` mounts)
- **Security tmpfs:** `/tmp` (512MB), `/var/tmp` (256MB), `/run` (64MB)
-
-### Direct host access for persistent mode:
- If persistent, files at `/workspace/foo.txt` are just `~/.hermes/sandboxes/docker/{task_id}/workspace/foo.txt` on host — no transfer needed!
-
-## 4. tools/environments/ssh.py — SSH Connection Management
-
-### Connection management:
- Uses SSH ControlMaster for persistent connection
- Control socket at `/tmp/hermes-ssh/{user}@{host}:{port}.sock`
- ControlPersist=300 (5 min keepalive)
- BatchMode=yes (non-interactive)
- Stores: `self.host`, `self.user`, `self.port`, `self.key_path`
-
-### SCP/SFTP feasibility:
- **YES**, SCP can piggyback on the ControlMaster socket:
-  - `scp -o ControlPath={socket} {user}@{host}:{remote} {local}` (download)
-  - `scp -o ControlPath={socket} {local} {user}@{host}:{remote}` (upload)
- Same SSH key and connection reuse — zero additional auth
- Would be much faster than base64-over-terminal for large files
-
-## 5. tools/environments/modal.py — Modal Sandbox Filesystem
-
-### Filesystem API exposure:
- **Not directly.** The inner `SwerexModalEnvironment` wraps Modal's sandbox
- The sandbox object is accessible at: `env._inner.deployment._sandbox`
- Modal's Python SDK exposes `sandbox.open()` for file I/O — but only via async API
- Currently only used for `snapshot_filesystem()` during cleanup
- **Could use:** `sandbox.open(path, "rb")` to read files or `sandbox.open(path, "wb")` to write
- **Alternative:** Base64-over-terminal already works via `execute()` — simpler, no SDK dependency
-
-## 6. gateway/platforms/base.py — MEDIA: Tag Flow (Complete)
-
-### extract_media() (lines 587-620):
- **Pattern:** `MEDIA:\S+` — extracts file paths after MEDIA: prefix
- **Voice flag:** `[[audio_as_voice]]` global directive sets `is_voice=True` for all media in message
- Returns `List[Tuple[str, bool]]` (path, is_voice) and cleaned content
-
-### _process_message_background() media routing (lines 752-786):
- After extracting MEDIA tags, routes by file extension:
-  - `.ogg .opus .mp3 .wav .m4a` → `send_voice()`
-  - `.mp4 .mov .avi .mkv .3gp` → `send_video()`
-  - `.jpg .jpeg .png .webp .gif` → `send_image_file()`
-  - **Everything else** → `send_document()`
- This routing already supports arbitrary files!
-
-### send_* method inventory (base class):
- `send(chat_id, content, reply_to, metadata)` — ABSTRACT, text
- `send_image(chat_id, image_url, caption, reply_to)` — URL-based images
- `send_animation(chat_id, animation_url, caption, reply_to)` — GIF animations
- `send_voice(chat_id, audio_path, caption, reply_to)` — voice messages
- `send_video(chat_id, video_path, caption, reply_to)` — video files
- `send_document(chat_id, file_path, caption, file_name, reply_to)` — generic files
- `send_image_file(chat_id, image_path, caption, reply_to)` — local image files
- `send_typing(chat_id)` — typing indicator
- `edit_message(chat_id, message_id, content)` — edit sent messages
-
-### What's missing:
- **Telegram:** No override for `send_document` or `send_image_file` — falls back to text!
- **Discord:** No override for `send_document` — falls back to text!
- **WhatsApp:** Has `send_document` and `send_image_file` via bridge — COMPLETE.
- The base class defaults just send "📎 File: /path" as text — useless for actual file delivery.
-
-## 7. gateway/platforms/telegram.py — Send Method Analysis
-
-### Implemented send methods:
- `send()` — MarkdownV2 text with fallback to plain
- `send_voice()` — `.ogg`/`.opus` as `send_voice()`, others as `send_audio()`
- `send_image()` — URL-based via `send_photo()`
- `send_animation()` — GIF via `send_animation()`
- `send_typing()` — "typing" chat action
- `edit_message()` — edit text messages
-
-### MISSING:
- **`send_document()` NOT overridden** — Need to add `self._bot.send_document(chat_id, document=open(file_path, 'rb'), ...)`
- **`send_image_file()` NOT overridden** — Need to add `self._bot.send_photo(chat_id, photo=open(path, 'rb'), ...)`
- **`send_video()` NOT overridden** — Need to add `self._bot.send_video(...)`
-
-## 8. gateway/platforms/discord.py — Send Method Analysis
-
-### Implemented send methods:
- `send()` — text messages with chunking
- `send_voice()` — discord.File attachment
- `send_image()` — downloads URL, creates discord.File attachment
- `send_typing()` — channel.typing()
- `edit_message()` — edit text messages
-
-### MISSING:
- **`send_document()` NOT overridden** — Need to add discord.File attachment
- **`send_image_file()` NOT overridden** — Need to add discord.File from local path
- **`send_video()` NOT overridden** — Need to add discord.File attachment
-
-## 9. gateway/run.py — User File Attachment Handling
-
-### Current attachment flow:
-1. **Telegram photos** (line 509-529): Download via `photo.get_file()` → `cache_image_from_bytes()` → vision auto-analysis
-2. **Telegram voice** (line 532-541): Download → `cache_audio_from_bytes()` → STT transcription
-3. **Telegram audio** (line 542-551): Same pattern
-4. **Telegram documents** (line 553-617): Extension validation against `SUPPORTED_DOCUMENT_TYPES`, 20MB limit, content injection for text files
-5. **Discord attachments** (line 717-751): Content-type detection, image/audio caching, URL fallback for other types
-6. **Gateway run.py** (lines 818-883): Auto-analyzes images with vision, transcribes audio, enriches document messages with context notes
-
-### Key insight: Files are always cached to host filesystem first, then processed. The agent sees local file paths.
-
-## 10. tools/terminal_tool.py — Terminal Tool & Environment Interaction
-
-### How it manages environments:
- Global dict `_active_environments: Dict[str, Any]` keyed by task_id
- Per-task creation locks prevent duplicate sandbox creation
- Auto-cleanup thread kills idle environments after `TERMINAL_LIFETIME_SECONDS`
- `_get_env_config()` reads all TERMINAL_* env vars for backend selection
- `_create_environment()` factory creates the right backend type
-
-### Could send_file piggyback?
- **YES.** send_file needs access to the same environment to extract files from sandboxes.
- It can reuse `_active_environments[task_id]` to get the environment, then:
-  - Docker: Use `docker cp` via `env._container_id`
-  - SSH: Use `scp` via `env.control_socket`
-  - Local: Just read the file directly
-  - Modal: Use base64-over-terminal via `env.execute()`
- The file_tools.py module already does this with `ShellFileOperations` — read_file/write_file/search/patch all share the same env instance.
-
-## 11. tools/tts_tool.py — Working Example of File Delivery
-
-### Flow:
-1. Generate audio file to `~/.hermes/audio_cache/tts_TIMESTAMP.{ogg,mp3}`
-2. Return JSON with `media_tag: "MEDIA:/path/to/file"`
-3. For Telegram voice: prepend `[[audio_as_voice]]` directive
-4. The LLM includes the MEDIA tag in its response text
-5. `BasePlatformAdapter._process_message_background()` calls `extract_media()` to find the tag
-6. Routes by extension → `send_voice()` for audio files
-7. Platform adapter sends the file natively
-
-### Key pattern: Tool saves file to host → returns MEDIA: path → LLM echoes it → gateway extracts → platform delivers
-
-## 12. tools/image_generation_tool.py — Working Example of Image Delivery
-
-### Flow:
-1. Call FAL.ai API → get image URL
-2. Return JSON with `image: "https://fal.media/..."` URL
-3. The LLM includes the URL in markdown: `![description](URL)`
-4. `BasePlatformAdapter.extract_images()` finds `![alt](url)` patterns
-5. Routes through `send_image()` (URL) or `send_animation()` (GIF)
-6. Platform downloads and sends natively
-
-### Key difference from TTS: Images are URL-based, not local files. The gateway downloads at send time.
-
---
-
-# INTEGRATION MAP: Where send_file Hooks In
-
-## Architecture Decision: MEDIA: Tag Protocol vs. New Tool
-
-The MEDIA: tag protocol is already the established pattern for file delivery. Two options:
-
-### Option A: Pure MEDIA: Tag (Minimal Change)
- No new tool needed
- Agent downloads file from sandbox to host using terminal (base64)
- Saves to known location (e.g., `~/.hermes/file_cache/`)
- Includes `MEDIA:/path` in response text
- Existing routing in `_process_message_background()` handles delivery
- **Problem:** Agent has to manually do base64 dance + know about MEDIA: convention
-
-### Option B: Dedicated send_file Tool (Recommended)
- New tool that the agent calls with `(file_path, caption?)`
- Tool handles the sandbox → host extraction automatically
- Returns MEDIA: tag that gets routed through existing pipeline
- Much cleaner agent experience
-
-## Implementation Plan for Option B
-
-### Files to CREATE:
-
-1. **`tools/send_file_tool.py`** — The new tool
-   - Accepts: `file_path` (path in sandbox), `caption` (optional)
-   - Detects environment backend from `_active_environments`
-   - Extracts file from sandbox:
-     - **local:** `shutil.copy()` or direct path
-     - **docker:** `docker cp {container_id}:{path} {local_cache}/` 
-     - **ssh:** `scp -o ControlPath=... {user}@{host}:{path} {local_cache}/`
-     - **modal:** base64-over-terminal via `env.execute("base64 {path}")`
-   - Saves to `~/.hermes/file_cache/{uuid}_{filename}`
-   - Returns: `MEDIA:/cached/path` in response for gateway to pick up
-   - Register with `registry.register(name="send_file", toolset="file", ...)`
-
-### Files to MODIFY:
-
-2. **`gateway/platforms/telegram.py`** — Add missing send methods:
-   ```python
-   async def send_document(self, chat_id, file_path, caption=None, file_name=None, reply_to=None):
-       with open(file_path, "rb") as f:
-           msg = await self._bot.send_document(
-               chat_id=int(chat_id), document=f,
-               caption=caption, filename=file_name or os.path.basename(file_path))
-       return SendResult(success=True, message_id=str(msg.message_id))
-   
-   async def send_image_file(self, chat_id, image_path, caption=None, reply_to=None):
-       with open(image_path, "rb") as f:
-           msg = await self._bot.send_photo(chat_id=int(chat_id), photo=f, caption=caption)
-       return SendResult(success=True, message_id=str(msg.message_id))
-   
-   async def send_video(self, chat_id, video_path, caption=None, reply_to=None):
-       with open(video_path, "rb") as f:
-           msg = await self._bot.send_video(chat_id=int(chat_id), video=f, caption=caption)
-       return SendResult(success=True, message_id=str(msg.message_id))
-   ```
-
-3. **`gateway/platforms/discord.py`** — Add missing send methods:
-   ```python
-   async def send_document(self, chat_id, file_path, caption=None, file_name=None, reply_to=None):
-       channel = self._client.get_channel(int(chat_id)) or await self._client.fetch_channel(int(chat_id))
-       with open(file_path, "rb") as f:
-           file = discord.File(io.BytesIO(f.read()), filename=file_name or os.path.basename(file_path))
-           msg = await channel.send(content=caption, file=file)
-       return SendResult(success=True, message_id=str(msg.id))
-   
-   async def send_image_file(self, chat_id, image_path, caption=None, reply_to=None):
-       # Same pattern as send_document with image filename
-   
-   async def send_video(self, chat_id, video_path, caption=None, reply_to=None):
-       # Same pattern, discord renders video attachments inline
-   ```
-
-4. **`toolsets.py`** — Add `"send_file"` to `_HERMES_CORE_TOOLS` list
-
-5. **`agent/prompt_builder.py`** — Update platform hints to mention send_file tool
-
-### Code that can be REUSED (zero rewrite):
-
- `BasePlatformAdapter.extract_media()` — Already extracts MEDIA: tags
- `BasePlatformAdapter._process_message_background()` — Already routes by extension
- `ToolContext.download_file()` — Base64-over-terminal extraction pattern
- `tools/terminal_tool.py` _active_environments dict — Environment access
- `tools/registry.py` — Tool registration infrastructure
- `gateway/platforms/base.py` send_document/send_image_file/send_video signatures — Already defined
-
-### Code that needs to be WRITTEN from scratch:
-
-1. `tools/send_file_tool.py` (~150 lines):
-   - File extraction from each environment backend type
-   - Local file cache management
-   - Registry registration
-   
-2. Telegram `send_document` + `send_image_file` + `send_video` overrides (~40 lines)
-3. Discord `send_document` + `send_image_file` + `send_video` overrides (~50 lines)
-
-### Total effort: ~240 lines of new code, ~5 lines of config changes
-
-## Key Environment-Specific Extract Strategies
-
-| Backend    | Extract Method                 | Speed    | Complexity |
-|------------|-------------------------------|----------|------------|
-| local      | shutil.copy / direct path     | Instant  | None       |
-| docker     | `docker cp container:path .`  | Fast     | Low        |
-| docker+vol | Direct host path access       | Instant  | None       |
-| ssh        | `scp -o ControlPath=...`      | Fast     | Low        |
-| modal      | base64-over-terminal          | Moderate | Medium     |
-| singularity| Direct path (overlay mount)   | Fast     | Low        |
-
-## Data Flow Summary
-
-```
-Agent calls send_file(file_path="/workspace/output.pdf", caption="Here's the report")
-    │
-    ▼
-send_file_tool.py:
-    1. Get environment from _active_environments[task_id]
-    2. Detect backend type (docker/ssh/modal/local)
-    3. Extract file to ~/.hermes/file_cache/{uuid}_{filename}
-    4. Return: '{"success": true, "media_tag": "MEDIA:/home/user/.hermes/file_cache/abc123_output.pdf"}'
-    │
-    ▼
-LLM includes MEDIA: tag in its response text
-    │
-    ▼
-BasePlatformAdapter._process_message_background():
-    1. extract_media(response) → finds MEDIA:/path
-    2. Checks extension: .pdf → send_document()
-    3. Calls platform-specific send_document(chat_id, file_path, caption)
-    │
-    ▼
-TelegramAdapter.send_document() / DiscordAdapter.send_document():
-    Opens file, sends via platform API as native document attachment
-    User receives downloadable file in chat
-```
--- a/environments/README.md
+++ b/environments/README.md
@@ -40,7 +40,7 @@ This directory contains the integration layer between **hermes-agent's** tool-ca
 - `evaluate_log()` for saving eval results to JSON + samples.jsonl

 **HermesAgentBaseEnv** (`hermes_base_env.py`) extends BaseEnv with hermes-agent specifics:
- Sets `os.environ["TERMINAL_ENV"]` to configure the terminal backend (local, docker, modal, daytona, ssh, singularity)
+- Sets `os.environ["TERMINAL_ENV"]` to configure the terminal backend (local, docker, modal, ssh, singularity)
 - Resolves hermes-agent toolsets via `_resolve_tools_for_group()` (calls `get_tool_definitions()` which queries `tools/registry.py`)
 - Implements `collect_trajectory()` which runs the full agent loop and computes rewards
 - Supports two-phase operation (Phase 1: OpenAI server, Phase 2: VLLM ManagedServer)
@@ -195,12 +195,8 @@ environments/
 │   └── hermes_swe_env.py
 │
 └── benchmarks/                   # Evaluation benchmarks
-    ├── terminalbench_2/          # 89 terminal tasks, Modal sandboxes
-    │   └── terminalbench2_env.py
-    ├── tblite/                   # 100 calibrated tasks (fast TB2 proxy)
-    │   └── tblite_env.py
-    └── yc_bench/                 # Long-horizon strategic benchmark
-        └── yc_bench_env.py
+    └── terminalbench_2/
+        └── terminalbench2_env.py
 ```

 ## Concrete Environments
@@ -328,7 +324,7 @@ For eval benchmarks, follow the pattern in `terminalbench2_env.py`:
 | `distribution` | Probabilistic toolset distribution name | `None` |
 | `max_agent_turns` | Max LLM calls per rollout | `30` |
 | `agent_temperature` | Sampling temperature | `1.0` |
-| `terminal_backend` | `local`, `docker`, `modal`, `daytona`, `ssh`, `singularity` | `local` |
+| `terminal_backend` | `local`, `docker`, `modal`, `ssh`, `singularity` | `local` |
 | `system_prompt` | System message for the agent | `None` |
 | `tool_call_parser` | Parser name for Phase 2 | `hermes` |
 | `eval_handling` | `STOP_TRAIN`, `LIMIT_TRAIN`, `NONE` | `STOP_TRAIN` |
--- a/environments/agent_loop.py
+++ b/environments/agent_loop.py
@@ -23,7 +23,7 @@ from typing import Any, Dict, List, Optional, Set
 from model_tools import handle_function_call

 # Thread pool for running sync tool calls that internally use asyncio.run()
-# (e.g., mini-swe-agent's modal/docker/daytona backends). Running them in a separate
+# (e.g., mini-swe-agent's modal/docker backends). Running them in a separate
 # thread gives them a clean event loop so they don't deadlock inside Atropos's loop.
 # Size must be large enough for concurrent eval tasks (e.g., 89 TB2 tasks all
 # making tool calls). Too small = thread pool starvation, tasks queue for minutes.
@@ -336,7 +336,7 @@ class HermesAgentLoop:
                                tool_elapsed = _time.monotonic() - tool_submit_time
                            else:
                                # Run tool calls in a thread pool so backends that
-                                # use asyncio.run() internally (modal, docker, daytona) get
+                                # use asyncio.run() internally (modal, docker) get
                                # a clean event loop instead of deadlocking.
                                loop = asyncio.get_event_loop()
                                # Capture current tool_name/args for the lambda
--- a/environments/benchmarks/yc_bench/README.md
+++ b/environments/benchmarks/yc_bench/README.md
@@ -1,115 +0,0 @@
-# YC-Bench: Long-Horizon Agent Benchmark
-
-[YC-Bench](https://github.com/collinear-ai/yc-bench) by [Collinear AI](https://collinear.ai/) is a deterministic, long-horizon benchmark that tests LLM agents' ability to act as a tech startup CEO. The agent manages a simulated company over 1-3 years, making compounding decisions about resource allocation, cash flow, task management, and prestige specialisation across 4 skill domains.
-
-Unlike TerminalBench2 (which evaluates per-task coding ability with binary pass/fail), YC-Bench measures **long-term strategic coherence** — whether an agent can maintain consistent strategy, manage compounding consequences, and adapt plans over hundreds of turns.
-
-## Setup
-
-```bash
-# Install yc-bench (optional dependency)
-pip install "hermes-agent[yc-bench]"
-
-# Or install from source
-git clone https://github.com/collinear-ai/yc-bench
-cd yc-bench && pip install -e .
-
-# Verify
-yc-bench --help
-```
-
-## Running
-
-```bash
-# From the repo root:
-bash environments/benchmarks/yc_bench/run_eval.sh
-
-# Or directly:
-python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
-    --config environments/benchmarks/yc_bench/default.yaml
-
-# Override model:
-bash environments/benchmarks/yc_bench/run_eval.sh \
-    --openai.model_name anthropic/claude-opus-4-20250514
-
-# Quick single-preset test:
-bash environments/benchmarks/yc_bench/run_eval.sh \
-    --env.presets '["fast_test"]' --env.seeds '[1]'
-```
-
-## How It Works
-
-### Architecture
-
-```
-HermesAgentLoop (our agent)
-  -> terminal tool -> subprocess("yc-bench company status") -> JSON output
-  -> terminal tool -> subprocess("yc-bench task accept --task-id X") -> JSON
-  -> terminal tool -> subprocess("yc-bench sim resume") -> JSON (advance time)
-  -> ... (100-500 turns per run)
-```
-
-The environment initialises the simulation via `yc-bench sim init` (NOT `yc-bench run`, which would start yc-bench's own built-in agent loop). Our `HermesAgentLoop` then drives all interaction through CLI commands.
-
-### Simulation Mechanics
-
- **4 skill domains**: research, inference, data_environment, training
- **Prestige system** (1.0-10.0): Gates access to higher-paying tasks
- **Employee management**: Junior/Mid/Senior with domain-specific skill rates
- **Throughput splitting**: `effective_rate = base_rate / N` active tasks per employee
- **Financial pressure**: Monthly payroll, bankruptcy = game over
- **Deterministic**: SHA256-based RNG — same seed + preset = same world
-
-### Difficulty Presets
-
-| Preset | Employees | Tasks | Focus |
-|-----------|-----------|-------|-------|
-| tutorial  | 3         | 50    | Basic loop mechanics |
-| easy      | 5         | 100   | Throughput awareness |
-| **medium**| 5         | 150   | Prestige climbing + domain specialisation |
-| **hard**  | 7         | 200   | Precise ETA reasoning |
-| nightmare | 8         | 300   | Sustained perfection under payroll pressure |
-| fast_test | (varies)  | (varies) | Quick validation (~50 turns) |
-
-Default eval runs **fast_test + medium + hard** × 3 seeds = 9 runs.
-
-### Scoring
-
-```
-composite = 0.5 × survival + 0.5 × normalised_funds
-```
-
- **Survival** (binary): Did the company avoid bankruptcy?
- **Normalised funds** (0.0-1.0): Log-scale relative to initial $250K capital
-
-## Configuration
-
-Key fields in `default.yaml`:
-
-| Field | Default | Description |
-|-------|---------|-------------|
-| `presets` | `["fast_test", "medium", "hard"]` | Which presets to evaluate |
-| `seeds` | `[1, 2, 3]` | RNG seeds per preset |
-| `max_agent_turns` | 200 | Max LLM calls per run |
-| `run_timeout` | 3600 | Wall-clock timeout per run (seconds) |
-| `survival_weight` | 0.5 | Weight of survival in composite score |
-| `funds_weight` | 0.5 | Weight of normalised funds in composite |
-| `horizon_years` | null | Override horizon (null = auto from preset) |
-
-## Cost & Time Estimates
-
-Each run is 100-500 LLM turns. Approximate costs per run at typical API rates:
-
-| Preset | Turns | Time | Est. Cost |
-|--------|-------|------|-----------|
-| fast_test | ~50 | 5-10 min | $1-5 |
-| medium | ~200 | 20-40 min | $5-15 |
-| hard | ~300 | 30-60 min | $10-25 |
-
-Full default eval (9 runs): ~3-6 hours, $50-200 depending on model.
-
-## References
-
- [collinear-ai/yc-bench](https://github.com/collinear-ai/yc-bench) — Official repository
- [Collinear AI](https://collinear.ai/) — Company behind yc-bench
- [TerminalBench2](../terminalbench_2/) — Per-task coding benchmark (complementary)
--- a/environments/benchmarks/yc_bench/init.py
+++ b/environments/benchmarks/yc_bench/init.py
--- a/environments/benchmarks/yc_bench/default.yaml
+++ b/environments/benchmarks/yc_bench/default.yaml
@@ -1,43 +0,0 @@
-# YC-Bench Evaluation -- Default Configuration
-#
-# Long-horizon agent benchmark: agent plays CEO of an AI startup over
-# a simulated 1-3 year run, interacting via yc-bench CLI subcommands.
-#
-# Requires: pip install "hermes-agent[yc-bench]"
-#
-# Usage:
-#   python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
-#       --config environments/benchmarks/yc_bench/default.yaml
-#
-#   # Override model:
-#   python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
-#       --config environments/benchmarks/yc_bench/default.yaml \
-#       --openai.model_name anthropic/claude-opus-4-20250514
-
-env:
-  enabled_toolsets: ["terminal"]
-  max_agent_turns: 200
-  max_token_length: 32000
-  agent_temperature: 0.0
-  terminal_backend: "local"
-  terminal_timeout: 60
-  presets: ["fast_test", "medium", "hard"]
-  seeds: [1, 2, 3]
-  run_timeout: 3600          # 60 min wall-clock per run, auto-FAIL if exceeded
-  survival_weight: 0.5       # weight of binary survival in composite score
-  funds_weight: 0.5          # weight of normalised final funds in composite score
-  db_dir: "/tmp/yc_bench_dbs"
-  company_name: "BenchCo"
-  start_date: "01/01/2025"   # MM/DD/YYYY (yc-bench convention)
-  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
-  use_wandb: true
-  wandb_name: "yc-bench"
-  ensure_scores_are_not_same: false
-  data_dir_to_save_evals: "environments/benchmarks/evals/yc-bench"
-
-openai:
-  base_url: "https://openrouter.ai/api/v1"
-  model_name: "anthropic/claude-sonnet-4.6"
-  server_type: "openai"
-  health_check: false
-  # api_key loaded from OPENROUTER_API_KEY in .env
--- a/environments/benchmarks/yc_bench/run_eval.sh
+++ b/environments/benchmarks/yc_bench/run_eval.sh
@@ -1,34 +0,0 @@
-#!/bin/bash
-
-# YC-Bench Evaluation
-#
-# Requires: pip install "hermes-agent[yc-bench]"
-#
-# Run from repo root:
-#   bash environments/benchmarks/yc_bench/run_eval.sh
-#
-# Override model:
-#   bash environments/benchmarks/yc_bench/run_eval.sh \
-#       --openai.model_name anthropic/claude-opus-4-20250514
-#
-# Run a single preset:
-#   bash environments/benchmarks/yc_bench/run_eval.sh \
-#       --env.presets '["fast_test"]' --env.seeds '[1]'
-
-set -euo pipefail
-
-mkdir -p logs evals/yc-bench
-LOG_FILE="logs/yc_bench_$(date +%Y%m%d_%H%M%S).log"
-
-echo "YC-Bench Evaluation"
-echo "Log: $LOG_FILE"
-echo ""
-
-PYTHONUNBUFFERED=1 LOGLEVEL="${LOGLEVEL:-INFO}" \
-  python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
-  --config environments/benchmarks/yc_bench/default.yaml \
-  "$@" \
-  2>&1 | tee "$LOG_FILE"
-
-echo ""
-echo "Log saved to: $LOG_FILE"
--- a/environments/benchmarks/yc_bench/yc_bench_env.py
+++ b/environments/benchmarks/yc_bench/yc_bench_env.py
@@ -1,847 +0,0 @@
-"""
-YCBenchEvalEnv -- YC-Bench Long-Horizon Agent Benchmark Environment
-
-Evaluates agentic LLMs on YC-Bench: a deterministic, long-horizon benchmark
-where the agent acts as CEO of an AI startup over a simulated 1-3 year run.
-The agent manages cash flow, employees, tasks, and prestige across 4 domains,
-interacting exclusively via CLI subprocess calls against a SQLite-backed
-discrete-event simulation.
-
-Unlike TerminalBench2 (per-task binary pass/fail), YC-Bench measures sustained
-multi-turn strategic coherence -- whether an agent can manage compounding
-decisions over hundreds of turns without going bankrupt.
-
-This is an eval-only environment. Run via:
-
-    python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
-        --config environments/benchmarks/yc_bench/default.yaml
-
-The evaluate flow:
-    1. setup()     -- Verifies yc-bench installed, builds eval matrix (preset x seed)
-    2. evaluate()  -- Iterates over all runs sequentially through:
-        a. rollout_and_score_eval()  -- Per-run agent loop
-            - Initialises a fresh yc-bench simulation via `sim init` (NOT `run`)
-            - Runs HermesAgentLoop with terminal tool only
-            - Reads final SQLite DB to extract score
-            - Returns survival (0/1) + normalised funds score
-        b. Aggregates per-preset and overall metrics
-        c. Logs results via evaluate_log() and wandb
-
-Key features:
-  - CLI-only interface: agent calls yc-bench subcommands via terminal tool
-  - Deterministic: same seed + preset = same world (SHA256-based RNG)
-  - Multi-dimensional scoring: survival + normalised final funds
-  - Per-preset difficulty breakdown in results
-  - Isolated SQLite DB per run (no cross-run state leakage)
-
-Requires: pip install hermes-agent[yc-bench]
-"""
-
-import asyncio
-import datetime
-import json
-import logging
-import math
-import os
-import sqlite3
-import subprocess
-import sys
-import threading
-import time
-import uuid
-from collections import defaultdict
-from pathlib import Path
-from typing import Any, Dict, List, Optional, Tuple
-
-_repo_root = Path(__file__).resolve().parent.parent.parent.parent
-if str(_repo_root) not in sys.path:
-    sys.path.insert(0, str(_repo_root))
-
-from pydantic import Field
-
-from atroposlib.envs.base import EvalHandlingEnum
-from atroposlib.envs.server_handling.server_manager import APIServerConfig
-
-from environments.agent_loop import HermesAgentLoop
-from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
-
-logger = logging.getLogger(__name__)
-
-# =============================================================================
-# System prompt
-# =============================================================================
-
-YC_BENCH_SYSTEM_PROMPT = """\
-You are the autonomous CEO of an early-stage AI startup in a deterministic
-business simulation. You manage the company exclusively through the `yc-bench`
-CLI tool. Your primary goal is to **survive** until the simulation horizon ends
-without going bankrupt, while **maximising final funds**.
-
-## Simulation Mechanics
-
- **Funds**: You start with $250,000 seed capital. Revenue comes from completing
-  tasks. Rewards scale with your prestige: `base × (1 + scale × (prestige − 1))`.
- **Domains**: There are 4 skill domains: **research**, **inference**,
-  **data_environment**, and **training**. Each has its own prestige level
-  (1.0-10.0). Higher prestige unlocks better-paying tasks.
- **Employees**: You have employees (Junior/Mid/Senior) with domain-specific
-  skill rates. **Throughput splits**: `effective_rate = base_rate / N` where N
-  is the number of active tasks assigned to that employee. Focus beats breadth.
- **Payroll**: Deducted automatically on the first business day of each month.
-  Running out of funds = bankruptcy = game over.
- **Time**: The simulation runs on business days (Mon-Fri), 09:00-18:00.
-  Time only advances when you call `yc-bench sim resume`.
-
-## Task Lifecycle
-
-1. Browse market tasks with `market browse`
-2. Accept a task with `task accept` (this sets its deadline)
-3. Assign employees with `task assign`
-4. Dispatch with `task dispatch` to start work
-5. Call `sim resume` to advance time and let employees make progress
-6. Tasks complete when all domain requirements are fulfilled
-
-**Penalties for failure vary by difficulty preset.** Completing a task on time
-earns full reward + prestige gain. Missing a deadline or cancelling a task
-incurs prestige penalties -- cancelling is always more costly than letting a
-task fail, so cancel only as a last resort.
-
-## CLI Commands
-
-### Observe
- `yc-bench company status`                                         -- funds, prestige, runway
- `yc-bench employee list`                                          -- skills, salary, active tasks
- `yc-bench market browse [--domain D] [--required-prestige-lte N]` -- available tasks
- `yc-bench task list [--status active|planned]`                    -- your tasks
- `yc-bench task inspect --task-id UUID`                            -- progress, deadline, assignments
- `yc-bench finance ledger [--category monthly_payroll|task_reward]` -- transaction history
- `yc-bench report monthly`                                         -- monthly P&L
-
-### Act
- `yc-bench task accept --task-id UUID`                              -- accept from market
- `yc-bench task assign --task-id UUID --employee-id UUID`           -- assign employee
- `yc-bench task dispatch --task-id UUID`                            -- start work (needs >=1 assignment)
- `yc-bench task cancel --task-id UUID --reason "text"`              -- cancel (prestige penalty)
- `yc-bench sim resume`                                              -- advance simulation clock
-
-### Memory (persists across context truncation)
- `yc-bench scratchpad read`            -- read your persistent notes
- `yc-bench scratchpad write --content "text"`  -- overwrite notes
- `yc-bench scratchpad append --content "text"` -- append to notes
- `yc-bench scratchpad clear`           -- clear notes
-
-## Strategy Guidelines
-
-1. **Specialise in 2-3 domains** to climb the prestige ladder faster and unlock
-   high-reward tasks. Don't spread thin across all 4 domains early on.
-2. **Focus employees** -- assigning one employee to many tasks halves their
-   throughput per additional task. Keep assignments concentrated.
-3. **Use the scratchpad** to track your strategy, upcoming deadlines, and
-   employee assignments. This persists even if conversation context is truncated.
-4. **Monitor runway** -- always know how many months of payroll you can cover.
-   Accept high-reward tasks before payroll dates.
-5. **Don't over-accept** -- taking too many tasks and missing deadlines cascades
-   into prestige loss, locking you out of profitable contracts.
-6. Use `finance ledger` and `report monthly` to track revenue trends.
-
-## Your Turn
-
-Each turn:
-1. Call `yc-bench company status` and `yc-bench task list` to orient yourself.
-2. Check for completed tasks and pending deadlines.
-3. Browse market for profitable tasks within your prestige level.
-4. Accept, assign, and dispatch tasks strategically.
-5. Call `yc-bench sim resume` to advance time.
-6. Repeat until the simulation ends.
-
-Think step by step before acting."""
-
-# Starting funds in cents ($250,000)
-INITIAL_FUNDS_CENTS = 25_000_000
-
-# Default horizon per preset (years)
-_PRESET_HORIZONS = {
-    "tutorial": 1,
-    "easy": 1,
-    "medium": 1,
-    "hard": 1,
-    "nightmare": 1,
-    "fast_test": 1,
-    "default": 3,
-    "high_reward": 1,
-}
-
-
-# =============================================================================
-# Configuration
-# =============================================================================
-
-class YCBenchEvalConfig(HermesAgentEnvConfig):
-    """
-    Configuration for the YC-Bench evaluation environment.
-
-    Extends HermesAgentEnvConfig with YC-Bench-specific settings for
-    preset selection, seed control, scoring, and simulation parameters.
-    """
-
-    presets: List[str] = Field(
-        default=["fast_test", "medium", "hard"],
-        description="YC-Bench preset names to evaluate.",
-    )
-    seeds: List[int] = Field(
-        default=[1, 2, 3],
-        description="Random seeds -- each preset x seed = one run.",
-    )
-    run_timeout: int = Field(
-        default=3600,
-        description="Maximum wall-clock seconds per run. Default 60 minutes.",
-    )
-    survival_weight: float = Field(
-        default=0.5,
-        description="Weight of survival (0/1) in composite score.",
-    )
-    funds_weight: float = Field(
-        default=0.5,
-        description="Weight of normalised final funds in composite score.",
-    )
-    db_dir: str = Field(
-        default="/tmp/yc_bench_dbs",
-        description="Directory for per-run SQLite databases.",
-    )
-    horizon_years: Optional[int] = Field(
-        default=None,
-        description=(
-            "Simulation horizon in years. If None (default), inferred from "
-            "preset name (1 year for most, 3 for 'default')."
-        ),
-    )
-    company_name: str = Field(
-        default="BenchCo",
-        description="Name of the simulated company.",
-    )
-    start_date: str = Field(
-        default="01/01/2025",
-        description="Simulation start date in MM/DD/YYYY format (yc-bench convention).",
-    )
-
-
-# =============================================================================
-# Scoring helpers
-# =============================================================================
-
-def _read_final_score(db_path: str) -> Dict[str, Any]:
-    """
-    Read final game state from a YC-Bench SQLite database.
-
-    Returns dict with final_funds_cents (int), survived (bool),
-    terminal_reason (str).
-
-    Note: yc-bench table names are plural -- 'companies' not 'company',
-    'sim_events' not 'simulation_log'.
-    """
-    if not os.path.exists(db_path):
-        logger.warning("DB not found at %s", db_path)
-        return {
-            "final_funds_cents": 0,
-            "survived": False,
-            "terminal_reason": "db_missing",
-        }
-
-    conn = None
-    try:
-        conn = sqlite3.connect(db_path)
-        cur = conn.cursor()
-
-        # Read final funds from the 'companies' table
-        cur.execute("SELECT funds_cents FROM companies LIMIT 1")
-        row = cur.fetchone()
-        funds = row[0] if row else 0
-
-        # Determine terminal reason from 'sim_events' table
-        terminal_reason = "unknown"
-        try:
-            cur.execute(
-                "SELECT event_type FROM sim_events "
-                "WHERE event_type IN ('bankruptcy', 'horizon_end') "
-                "ORDER BY scheduled_at DESC LIMIT 1"
-            )
-            event_row = cur.fetchone()
-            if event_row:
-                terminal_reason = event_row[0]
-        except sqlite3.OperationalError:
-            # Table may not exist if simulation didn't progress
-            pass
-
-        survived = funds >= 0 and terminal_reason != "bankruptcy"
-        return {
-            "final_funds_cents": funds,
-            "survived": survived,
-            "terminal_reason": terminal_reason,
-        }
-
-    except Exception as e:
-        logger.error("Failed to read DB %s: %s", db_path, e)
-        return {
-            "final_funds_cents": 0,
-            "survived": False,
-            "terminal_reason": f"db_error: {e}",
-        }
-    finally:
-        if conn:
-            conn.close()
-
-
-def _compute_composite_score(
-    final_funds_cents: int,
-    survived: bool,
-    survival_weight: float = 0.5,
-    funds_weight: float = 0.5,
-    initial_funds_cents: int = INITIAL_FUNDS_CENTS,
-) -> float:
-    """
-    Compute composite score from survival and final funds.
-
-    Score = survival_weight * survival_score
-          + funds_weight * normalised_funds_score
-
-    Normalised funds uses log-scale relative to initial capital:
-    - funds <= 0:          0.0
-    - funds == initial:   ~0.15
-    - funds == 10x:       ~0.52
-    - funds == 100x:       1.0
-    """
-    survival_score = 1.0 if survived else 0.0
-
-    if final_funds_cents <= 0:
-        funds_score = 0.0
-    else:
-        max_ratio = 100.0
-        ratio = final_funds_cents / max(initial_funds_cents, 1)
-        funds_score = min(math.log1p(ratio) / math.log1p(max_ratio), 1.0)
-
-    return survival_weight * survival_score + funds_weight * funds_score
-
-
-# =============================================================================
-# Main Environment
-# =============================================================================
-
-class YCBenchEvalEnv(HermesAgentBaseEnv):
-    """
-    YC-Bench long-horizon agent benchmark environment (eval-only).
-
-    Each eval item is a (preset, seed) pair. The environment initialises the
-    simulation via ``yc-bench sim init`` (NOT ``yc-bench run`` which would start
-    a competing built-in agent loop). The HermesAgentLoop then drives the
-    interaction by calling individual yc-bench CLI commands via the terminal tool.
-
-    After the agent loop ends, the SQLite DB is read to extract the final score.
-
-    Scoring:
-      composite = 0.5 * survival + 0.5 * normalised_funds
-    """
-
-    name = "yc-bench"
-    env_config_cls = YCBenchEvalConfig
-
-    @classmethod
-    def config_init(cls) -> Tuple[YCBenchEvalConfig, List[APIServerConfig]]:
-        env_config = YCBenchEvalConfig(
-            enabled_toolsets=["terminal"],
-            disabled_toolsets=None,
-            distribution=None,
-            max_agent_turns=200,
-            max_token_length=32000,
-            agent_temperature=0.0,
-            system_prompt=YC_BENCH_SYSTEM_PROMPT,
-            terminal_backend="local",
-            terminal_timeout=60,
-            presets=["fast_test", "medium", "hard"],
-            seeds=[1, 2, 3],
-            run_timeout=3600,
-            survival_weight=0.5,
-            funds_weight=0.5,
-            db_dir="/tmp/yc_bench_dbs",
-            eval_handling=EvalHandlingEnum.STOP_TRAIN,
-            group_size=1,
-            steps_per_eval=1,
-            total_steps=1,
-            tokenizer_name="NousResearch/Hermes-3-Llama-3.1-8B",
-            use_wandb=True,
-            wandb_name="yc-bench",
-            ensure_scores_are_not_same=False,
-        )
-
-        server_configs = [
-            APIServerConfig(
-                base_url="https://openrouter.ai/api/v1",
-                model_name="anthropic/claude-sonnet-4.6",
-                server_type="openai",
-                api_key=os.getenv("OPENROUTER_API_KEY", ""),
-                health_check=False,
-            )
-        ]
-
-        return env_config, server_configs
-
-    # =========================================================================
-    # Setup
-    # =========================================================================
-
-    async def setup(self):
-        """Verify yc-bench is installed and build the eval matrix."""
-        # Verify yc-bench CLI is available
-        try:
-            result = subprocess.run(
-                ["yc-bench", "--help"], capture_output=True, text=True, timeout=10
-            )
-            if result.returncode != 0:
-                raise FileNotFoundError
-        except (FileNotFoundError, subprocess.TimeoutExpired):
-            raise RuntimeError(
-                "yc-bench CLI not found. Install with:\n"
-                '  pip install "hermes-agent[yc-bench]"\n'
-                "Or: git clone https://github.com/collinear-ai/yc-bench "
-                "&& cd yc-bench && pip install -e ."
-            )
-        print("yc-bench CLI verified.")
-
-        # Build eval matrix: preset x seed
-        self.all_eval_items = [
-            {"preset": preset, "seed": seed}
-            for preset in self.config.presets
-            for seed in self.config.seeds
-        ]
-        self.iter = 0
-
-        os.makedirs(self.config.db_dir, exist_ok=True)
-        self.eval_metrics: List[Tuple[str, float]] = []
-
-        # Streaming JSONL log for crash-safe result persistence
-        log_dir = os.path.join(os.path.dirname(__file__), "logs")
-        os.makedirs(log_dir, exist_ok=True)
-        run_ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
-        self._streaming_path = os.path.join(log_dir, f"samples_{run_ts}.jsonl")
-        self._streaming_file = open(self._streaming_path, "w")
-        self._streaming_lock = threading.Lock()
-
-        print(f"\nYC-Bench eval matrix: {len(self.all_eval_items)} runs")
-        for item in self.all_eval_items:
-            print(f"  preset={item['preset']!r}  seed={item['seed']}")
-        print(f"Streaming results to: {self._streaming_path}\n")
-
-    def _save_result(self, result: Dict[str, Any]):
-        """Write a single run result to the streaming JSONL file immediately."""
-        if not hasattr(self, "_streaming_file") or self._streaming_file.closed:
-            return
-        with self._streaming_lock:
-            self._streaming_file.write(
-                json.dumps(result, ensure_ascii=False, default=str) + "\n"
-            )
-            self._streaming_file.flush()
-
-    # =========================================================================
-    # Training pipeline stubs (eval-only -- not used)
-    # =========================================================================
-
-    async def get_next_item(self):
-        item = self.all_eval_items[self.iter % len(self.all_eval_items)]
-        self.iter += 1
-        return item
-
-    def format_prompt(self, item: Dict[str, Any]) -> str:
-        preset = item["preset"]
-        seed = item["seed"]
-        return (
-            f"A new YC-Bench simulation has been initialized "
-            f"(preset='{preset}', seed={seed}).\n"
-            f"Your company '{self.config.company_name}' is ready.\n\n"
-            "Begin by calling:\n"
-            "1. `yc-bench company status` -- see your starting funds and prestige\n"
-            "2. `yc-bench employee list` -- see your team and their skills\n"
-            "3. `yc-bench market browse --required-prestige-lte 1` -- find tasks "
-            "you can take\n\n"
-            "Then accept 2-3 tasks, assign employees, dispatch them, and call "
-            "`yc-bench sim resume` to advance time. Repeat this loop until the "
-            "simulation ends (horizon reached or bankruptcy)."
-        )
-
-    async def compute_reward(self, item, result, ctx) -> float:
-        return 0.0
-
-    async def collect_trajectories(self, item):
-        return None, []
-
-    async def score(self, rollout_group_data):
-        return None
-
-    # =========================================================================
-    # Per-run evaluation
-    # =========================================================================
-
-    async def rollout_and_score_eval(self, eval_item: Dict[str, Any]) -> Dict:
-        """
-        Evaluate a single (preset, seed) run.
-
-        1. Sets DATABASE_URL and YC_BENCH_EXPERIMENT env vars
-        2. Initialises the simulation via ``yc-bench sim init`` (NOT ``run``)
-        3. Runs HermesAgentLoop with terminal tool
-        4. Reads SQLite DB to compute final score
-        5. Returns result dict with survival, funds, and composite score
-        """
-        preset = eval_item["preset"]
-        seed = eval_item["seed"]
-        run_id = str(uuid.uuid4())[:8]
-        run_key = f"{preset}_seed{seed}_{run_id}"
-
-        from tqdm import tqdm
-        tqdm.write(f"  [START] preset={preset!r} seed={seed} (run_id={run_id})")
-        run_start = time.time()
-
-        # Isolated DB per run -- prevents cross-run state leakage
-        db_path = os.path.join(self.config.db_dir, f"yc_bench_{run_key}.db")
-        os.environ["DATABASE_URL"] = f"sqlite:///{db_path}"
-        os.environ["YC_BENCH_EXPERIMENT"] = preset
-
-        # Determine horizon: explicit config override > preset lookup > default 1
-        horizon = self.config.horizon_years or _PRESET_HORIZONS.get(preset, 1)
-
-        try:
-            # ----------------------------------------------------------
-            # Step 1: Initialise the simulation via CLI
-            # IMPORTANT: We use `sim init`, NOT `yc-bench run`.
-            # `yc-bench run` starts yc-bench's own LLM agent loop (via
-            # LiteLLM), which would compete with our HermesAgentLoop.
-            # `sim init` just sets up the world and returns.
-            # ----------------------------------------------------------
-            init_cmd = [
-                "yc-bench", "sim", "init",
-                "--seed", str(seed),
-                "--start-date", self.config.start_date,
-                "--company-name", self.config.company_name,
-                "--horizon-years", str(horizon),
-            ]
-            init_result = subprocess.run(
-                init_cmd, capture_output=True, text=True, timeout=30,
-            )
-            if init_result.returncode != 0:
-                error_msg = (init_result.stderr or init_result.stdout).strip()
-                raise RuntimeError(f"yc-bench sim init failed: {error_msg}")
-
-            tqdm.write(f"    Simulation initialized (horizon={horizon}yr)")
-
-            # ----------------------------------------------------------
-            # Step 2: Run the HermesAgentLoop
-            # ----------------------------------------------------------
-            tools, valid_names = self._resolve_tools_for_group()
-
-            messages: List[Dict[str, Any]] = [
-                {"role": "system", "content": YC_BENCH_SYSTEM_PROMPT},
-                {"role": "user", "content": self.format_prompt(eval_item)},
-            ]
-
-            agent = HermesAgentLoop(
-                server=self.server,
-                tool_schemas=tools,
-                valid_tool_names=valid_names,
-                max_turns=self.config.max_agent_turns,
-                task_id=run_id,
-                temperature=self.config.agent_temperature,
-                max_tokens=self.config.max_token_length,
-                extra_body=self.config.extra_body,
-            )
-            result = await agent.run(messages)
-
-            # ----------------------------------------------------------
-            # Step 3: Read final score from the simulation DB
-            # ----------------------------------------------------------
-            score_data = _read_final_score(db_path)
-            final_funds = score_data["final_funds_cents"]
-            survived = score_data["survived"]
-            terminal_reason = score_data["terminal_reason"]
-
-            composite = _compute_composite_score(
-                final_funds_cents=final_funds,
-                survived=survived,
-                survival_weight=self.config.survival_weight,
-                funds_weight=self.config.funds_weight,
-            )
-
-            elapsed = time.time() - run_start
-            status = "SURVIVED" if survived else "BANKRUPT"
-            if final_funds >= 0:
-                funds_str = f"${final_funds / 100:,.0f}"
-            else:
-                funds_str = f"-${abs(final_funds) / 100:,.0f}"
-
-            tqdm.write(
-                f"  [{status}] preset={preset!r} seed={seed} "
-                f"funds={funds_str} score={composite:.3f} "
-                f"turns={result.turns_used} ({elapsed:.0f}s)"
-            )
-
-            out = {
-                "preset": preset,
-                "seed": seed,
-                "survived": survived,
-                "final_funds_cents": final_funds,
-                "final_funds_usd": final_funds / 100,
-                "terminal_reason": terminal_reason,
-                "composite_score": composite,
-                "turns_used": result.turns_used,
-                "finished_naturally": result.finished_naturally,
-                "elapsed_seconds": elapsed,
-                "db_path": db_path,
-                "messages": result.messages,
-            }
-            self._save_result(out)
-            return out
-
-        except Exception as e:
-            elapsed = time.time() - run_start
-            logger.error("Run %s failed: %s", run_key, e, exc_info=True)
-            tqdm.write(
-                f"  [ERROR] preset={preset!r} seed={seed}: {e} ({elapsed:.0f}s)"
-            )
-            out = {
-                "preset": preset,
-                "seed": seed,
-                "survived": False,
-                "final_funds_cents": 0,
-                "final_funds_usd": 0.0,
-                "terminal_reason": f"error: {e}",
-                "composite_score": 0.0,
-                "turns_used": 0,
-                "error": str(e),
-                "elapsed_seconds": elapsed,
-            }
-            self._save_result(out)
-            return out
-
-    # =========================================================================
-    # Evaluate
-    # =========================================================================
-
-    async def _run_with_timeout(self, item: Dict[str, Any]) -> Dict:
-        """Wrap a single rollout with a wall-clock timeout."""
-        preset = item["preset"]
-        seed = item["seed"]
-        try:
-            return await asyncio.wait_for(
-                self.rollout_and_score_eval(item),
-                timeout=self.config.run_timeout,
-            )
-        except asyncio.TimeoutError:
-            from tqdm import tqdm
-            tqdm.write(
-                f"  [TIMEOUT] preset={preset!r} seed={seed} "
-                f"(exceeded {self.config.run_timeout}s)"
-            )
-            out = {
-                "preset": preset,
-                "seed": seed,
-                "survived": False,
-                "final_funds_cents": 0,
-                "final_funds_usd": 0.0,
-                "terminal_reason": f"timeout ({self.config.run_timeout}s)",
-                "composite_score": 0.0,
-                "turns_used": 0,
-                "error": "timeout",
-            }
-            self._save_result(out)
-            return out
-
-    async def evaluate(self, *args, **kwargs) -> None:
-        """
-        Run YC-Bench evaluation over all (preset, seed) combinations.
-
-        Runs sequentially -- each run is 100-500 turns, parallelising would
-        be prohibitively expensive and cause env var conflicts.
-        """
-        start_time = time.time()
-        from tqdm import tqdm
-
-        # --- tqdm-compatible logging handler (TB2 pattern) ---
-        class _TqdmHandler(logging.Handler):
-            def emit(self, record):
-                try:
-                    tqdm.write(self.format(record))
-                except Exception:
-                    self.handleError(record)
-
-        root = logging.getLogger()
-        handler = _TqdmHandler()
-        handler.setFormatter(
-            logging.Formatter("%(levelname)s %(name)s: %(message)s")
-        )
-        root.handlers = [handler]
-        for noisy in ("httpx", "openai"):
-            logging.getLogger(noisy).setLevel(logging.WARNING)
-
-        # --- Print config summary ---
-        print(f"\n{'='*60}")
-        print("Starting YC-Bench Evaluation")
-        print(f"{'='*60}")
-        print(f"  Presets: {self.config.presets}")
-        print(f"  Seeds: {self.config.seeds}")
-        print(f"  Total runs: {len(self.all_eval_items)}")
-        print(f"  Max turns/run: {self.config.max_agent_turns}")
-        print(f"  Run timeout: {self.config.run_timeout}s")
-        print(f"{'='*60}\n")
-
-        results = []
-        pbar = tqdm(
-            total=len(self.all_eval_items), desc="YC-Bench", dynamic_ncols=True
-        )
-
-        try:
-            for item in self.all_eval_items:
-                result = await self._run_with_timeout(item)
-                results.append(result)
-                survived_count = sum(1 for r in results if r.get("survived"))
-                pbar.set_postfix_str(
-                    f"survived={survived_count}/{len(results)}"
-                )
-                pbar.update(1)
-
-        except (KeyboardInterrupt, asyncio.CancelledError):
-            tqdm.write("\n[INTERRUPTED] Stopping evaluation...")
-            pbar.close()
-            try:
-                from tools.terminal_tool import cleanup_all_environments
-                cleanup_all_environments()
-            except Exception:
-                pass
-            if hasattr(self, "_streaming_file") and not self._streaming_file.closed:
-                self._streaming_file.close()
-            return
-
-        pbar.close()
-        end_time = time.time()
-
-        # --- Compute metrics ---
-        valid = [r for r in results if r is not None]
-        if not valid:
-            print("Warning: No valid results.")
-            return
-
-        total = len(valid)
-        survived_total = sum(1 for r in valid if r.get("survived"))
-        survival_rate = survived_total / total if total else 0.0
-        avg_score = (
-            sum(r.get("composite_score", 0) for r in valid) / total
-            if total
-            else 0.0
-        )
-
-        preset_results: Dict[str, List[Dict]] = defaultdict(list)
-        for r in valid:
-            preset_results[r["preset"]].append(r)
-
-        eval_metrics = {
-            "eval/survival_rate": survival_rate,
-            "eval/avg_composite_score": avg_score,
-            "eval/total_runs": total,
-            "eval/survived_runs": survived_total,
-            "eval/evaluation_time_seconds": end_time - start_time,
-        }
-
-        for preset, items in sorted(preset_results.items()):
-            ps = sum(1 for r in items if r.get("survived"))
-            pt = len(items)
-            pa = (
-                sum(r.get("composite_score", 0) for r in items) / pt
-                if pt
-                else 0
-            )
-            key = preset.replace("-", "_")
-            eval_metrics[f"eval/survival_rate_{key}"] = ps / pt if pt else 0
-            eval_metrics[f"eval/avg_score_{key}"] = pa
-
-        self.eval_metrics = [(k, v) for k, v in eval_metrics.items()]
-
-        # --- Print summary ---
-        print(f"\n{'='*60}")
-        print("YC-Bench Evaluation Results")
-        print(f"{'='*60}")
-        print(
-            f"Overall survival rate: {survival_rate:.1%} "
-            f"({survived_total}/{total})"
-        )
-        print(f"Average composite score: {avg_score:.4f}")
-        print(f"Evaluation time: {end_time - start_time:.1f}s")
-
-        print("\nPer-preset breakdown:")
-        for preset, items in sorted(preset_results.items()):
-            ps = sum(1 for r in items if r.get("survived"))
-            pt = len(items)
-            pa = (
-                sum(r.get("composite_score", 0) for r in items) / pt
-                if pt
-                else 0
-            )
-            print(f"  {preset}: {ps}/{pt} survived  avg_score={pa:.4f}")
-            for r in items:
-                status = "SURVIVED" if r.get("survived") else "BANKRUPT"
-                funds = r.get("final_funds_usd", 0)
-                print(
-                    f"    seed={r['seed']}  [{status}]  "
-                    f"${funds:,.0f}  "
-                    f"score={r.get('composite_score', 0):.3f}"
-                )
-
-        print(f"{'='*60}\n")
-
-        # --- Log results ---
-        samples = [
-            {k: v for k, v in r.items() if k != "messages"} for r in valid
-        ]
-
-        try:
-            await self.evaluate_log(
-                metrics=eval_metrics,
-                samples=samples,
-                start_time=start_time,
-                end_time=end_time,
-                generation_parameters={
-                    "temperature": self.config.agent_temperature,
-                    "max_tokens": self.config.max_token_length,
-                    "max_agent_turns": self.config.max_agent_turns,
-                },
-            )
-        except Exception as e:
-            print(f"Error logging results: {e}")
-
-        # --- Cleanup (TB2 pattern) ---
-        if hasattr(self, "_streaming_file") and not self._streaming_file.closed:
-            self._streaming_file.close()
-            print(f"Results saved to: {self._streaming_path}")
-
-        try:
-            from tools.terminal_tool import cleanup_all_environments
-            cleanup_all_environments()
-        except Exception:
-            pass
-
-        try:
-            from environments.agent_loop import _tool_executor
-            _tool_executor.shutdown(wait=False, cancel_futures=True)
-        except Exception:
-            pass
-
-    # =========================================================================
-    # Wandb logging
-    # =========================================================================
-
-    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
-        """Log YC-Bench-specific metrics to wandb."""
-        if wandb_metrics is None:
-            wandb_metrics = {}
-        for k, v in self.eval_metrics:
-            wandb_metrics[k] = v
-        self.eval_metrics = []
-        await super().wandb_log(wandb_metrics)
-
-
-if __name__ == "__main__":
-    YCBenchEvalEnv.cli()
--- a/environments/hermes_base_env.py
+++ b/environments/hermes_base_env.py
@@ -114,8 +114,8 @@ class HermesAgentEnvConfig(BaseEnvConfig):
    # --- Terminal backend ---
    terminal_backend: str = Field(
        default="local",
-        description="Terminal backend: 'local', 'docker', 'modal', 'daytona', 'ssh', 'singularity'. "
-        "Modal or Daytona recommended for production RL (cloud isolation per rollout).",
+        description="Terminal backend: 'local', 'docker', 'modal', 'ssh', 'singularity'. "
+        "Modal recommended for production RL (cloud isolation per rollout).",
    )
    terminal_timeout: int = Field(
        default=120,
--- a/environments/tool_call_parsers/deepseek_v3_1_parser.py
+++ b/environments/tool_call_parsers/deepseek_v3_1_parser.py
@@ -35,8 +35,7 @@ class DeepSeekV31ToolCallParser(ToolCallParser):

    # Regex captures: function_name, function_arguments
    PATTERN = re.compile(
-        r"<｜tool▁call▁begin｜>(?P<function_name>.*?)<｜tool▁sep｜>(?P<function_arguments>.*?)<｜tool▁call▁end｜>",
-        re.DOTALL,
+        r"<｜tool▁call▁begin｜>(?P<function_name>.*?)<｜tool▁sep｜>(?P<function_arguments>.*?)<｜tool▁call▁end｜>"
    )

    def parse(self, text: str) -> ParseResult:
--- a/environments/tool_call_parsers/deepseek_v3_parser.py
+++ b/environments/tool_call_parsers/deepseek_v3_parser.py
@@ -38,8 +38,7 @@ class DeepSeekV3ToolCallParser(ToolCallParser):

    # Regex captures: type, function_name, function_arguments
    PATTERN = re.compile(
-        r"<｜tool▁call▁begin｜>(?P<type>.*)<｜tool▁sep｜>(?P<function_name>.*)\n```json\n(?P<function_arguments>.*)\n```<｜tool▁call▁end｜>",
-        re.DOTALL,
+        r"<｜tool▁call▁begin｜>(?P<type>.*)<｜tool▁sep｜>(?P<function_name>.*)\n```json\n(?P<function_arguments>.*)\n```<｜tool▁call▁end｜>"
    )

    def parse(self, text: str) -> ParseResult:
--- a/environments/tool_context.py
+++ b/environments/tool_context.py
@@ -44,7 +44,7 @@ _tool_executor = concurrent.futures.ThreadPoolExecutor(max_workers=4)
 def _run_tool_in_thread(tool_name: str, arguments: Dict[str, Any], task_id: str) -> str:
    """
    Run a tool call in a thread pool executor so backends that use asyncio.run()
-    internally (modal, docker, daytona) get a clean event loop.
+    internally (modal, docker) get a clean event loop.

    If we're already in an async context, executes handle_function_call() in a
    disposable worker thread and blocks for the result.
@@ -95,7 +95,7 @@ class ToolContext:
        backend = os.getenv("TERMINAL_ENV", "local")
        logger.debug("ToolContext.terminal [%s backend] task=%s: %s", backend, self.task_id[:8], command[:100])

-        # Run via thread helper so modal/docker/daytona backends' asyncio.run() doesn't deadlock
+        # Run via thread helper so modal/docker backends' asyncio.run() doesn't deadlock
        result = _run_tool_in_thread(
            "terminal",
            {"command": command, "timeout": timeout},
--- a/gateway/platforms/telegram.py
+++ b/gateway/platforms/telegram.py
@@ -8,13 +8,10 @@ Uses python-telegram-bot library for:
 """

 import asyncio
-import logging
 import os
 import re
 from typing import Dict, List, Optional, Any

-logger = logging.getLogger(__name__)
-
 try:
    from telegram import Update, Bot, Message
    from telegram.ext import (
@@ -76,19 +73,6 @@ def _escape_mdv2(text: str) -> str:
    return _MDV2_ESCAPE_RE.sub(r'\\\1', text)


-def _strip_mdv2(text: str) -> str:
-    """Strip MarkdownV2 escape backslashes to produce clean plain text.
-
-    Also removes MarkdownV2 bold markers (*text* -> text) so the fallback
-    doesn't show stray asterisks from header/bold conversion.
-    """
-    # Remove escape backslashes before special characters
-    cleaned = re.sub(r'\\([_*\[\]()~`>#\+\-=|{}.!\\])', r'\1', text)
-    # Remove MarkdownV2 bold markers that format_message converted from **bold**
-    cleaned = re.sub(r'\*([^*]+)\*', r'\1', cleaned)
-    return cleaned
-
-
 class TelegramAdapter(BasePlatformAdapter):
    """
    Telegram bot adapter.
@@ -215,13 +199,9 @@ class TelegramAdapter(BasePlatformAdapter):
                except Exception as md_error:
                    # Markdown parsing failed, try plain text
                    if "parse" in str(md_error).lower() or "markdown" in str(md_error).lower():
-                        logger.warning("[%s] MarkdownV2 parse failed, falling back to plain text: %s", self.name, md_error)
-                        # Strip MDV2 escape backslashes so the user doesn't
-                        # see raw backslashes littered through the message.
-                        plain_chunk = _strip_mdv2(chunk)
                        msg = await self._bot.send_message(
                            chat_id=int(chat_id),
-                            text=plain_chunk,
+                            text=chunk,
                            parse_mode=None,  # Plain text
                            reply_to_message_id=int(reply_to) if reply_to and i == 0 else None,
                            message_thread_id=int(thread_id) if thread_id else None,
--- a/gateway/platforms/whatsapp.py
+++ b/gateway/platforms/whatsapp.py
@@ -28,41 +28,6 @@ from typing import Dict, List, Optional, Any

 logger = logging.getLogger(__name__)

-
-def _kill_port_process(port: int) -> None:
-    """Kill any process listening on the given TCP port."""
-    try:
-        if _IS_WINDOWS:
-            # Use netstat to find the PID bound to this port, then taskkill
-            result = subprocess.run(
-                ["netstat", "-ano", "-p", "TCP"],
-                capture_output=True, text=True, timeout=5,
-            )
-            for line in result.stdout.splitlines():
-                parts = line.split()
-                if len(parts) >= 5 and parts[3] == "LISTENING":
-                    local_addr = parts[1]
-                    if local_addr.endswith(f":{port}"):
-                        try:
-                            subprocess.run(
-                                ["taskkill", "/PID", parts[4], "/F"],
-                                capture_output=True, timeout=5,
-                            )
-                        except subprocess.SubprocessError:
-                            pass
-        else:
-            result = subprocess.run(
-                ["fuser", f"{port}/tcp"],
-                capture_output=True, timeout=5,
-            )
-            if result.returncode == 0:
-                subprocess.run(
-                    ["fuser", "-k", f"{port}/tcp"],
-                    capture_output=True, timeout=5,
-                )
-    except Exception:
-        pass
-
 import sys
 sys.path.insert(0, str(Path(__file__).resolve().parents[2]))

@@ -180,9 +145,21 @@ class WhatsAppAdapter(BasePlatformAdapter):
            self._session_path.mkdir(parents=True, exist_ok=True)
            
            # Kill any orphaned bridge from a previous gateway run
-            _kill_port_process(self._bridge_port)
-            import time
-            time.sleep(1)
+            try:
+                result = subprocess.run(
+                    ["fuser", f"{self._bridge_port}/tcp"],
+                    capture_output=True, timeout=5,
+                )
+                if result.returncode == 0:
+                    # Port is in use — kill the process
+                    subprocess.run(
+                        ["fuser", "-k", f"{self._bridge_port}/tcp"],
+                        capture_output=True, timeout=5,
+                    )
+                    import time
+                    time.sleep(2)
+            except Exception:
+                pass
            
            # Start the bridge process in its own process group.
            # Route output to a log file so QR codes, errors, and reconnection
@@ -316,7 +293,13 @@ class WhatsAppAdapter(BasePlatformAdapter):
                print(f"[{self.name}] Error stopping bridge: {e}")
        
        # Also kill any orphaned bridge processes on our port
-        _kill_port_process(self._bridge_port)
+        try:
+            subprocess.run(
+                ["fuser", "-k", f"{self._bridge_port}/tcp"],
+                capture_output=True, timeout=5,
+            )
+        except Exception:
+            pass
        
        self._running = False
        self._bridge_process = None
--- a/gateway/run.py
+++ b/gateway/run.py
@@ -66,7 +66,6 @@ if _config_path.exists():
                "docker_image": "TERMINAL_DOCKER_IMAGE",
                "singularity_image": "TERMINAL_SINGULARITY_IMAGE",
                "modal_image": "TERMINAL_MODAL_IMAGE",
-                "daytona_image": "TERMINAL_DAYTONA_IMAGE",
                "ssh_host": "TERMINAL_SSH_HOST",
                "ssh_user": "TERMINAL_SSH_USER",
                "ssh_port": "TERMINAL_SSH_PORT",
@@ -93,11 +92,6 @@ if _config_path.exists():
        if _agent_cfg and isinstance(_agent_cfg, dict):
            if "max_turns" in _agent_cfg:
                os.environ["HERMES_MAX_ITERATIONS"] = str(_agent_cfg["max_turns"])
-        # Timezone: bridge config.yaml → HERMES_TIMEZONE env var.
-        # HERMES_TIMEZONE from .env takes precedence (already in os.environ).
-        _tz_cfg = _cfg.get("timezone", "")
-        if _tz_cfg and isinstance(_tz_cfg, str) and "HERMES_TIMEZONE" not in os.environ:
-            os.environ["HERMES_TIMEZONE"] = _tz_cfg.strip()
    except Exception:
        pass  # Non-fatal; gateway can still run with .env values

@@ -664,7 +658,7 @@ class GatewayRunner:
        # Emit command:* hook for any recognized slash command
        _known_commands = {"new", "reset", "help", "status", "stop", "model",
                          "personality", "retry", "undo", "sethome", "set-home",
-                          "compress", "usage", "insights", "reload-mcp", "update"}
+                          "compress", "usage", "reload-mcp", "update"}
        if command and command in _known_commands:
            await self.hooks.emit(f"command:{command}", {
                "platform": source.platform.value if source.platform else "",
@@ -706,9 +700,6 @@ class GatewayRunner:
        if command == "usage":
            return await self._handle_usage_command(event)

-        if command == "insights":
-            return await self._handle_insights_command(event)
-
        if command == "reload-mcp":
            return await self._handle_reload_mcp_command(event)

@@ -1112,7 +1103,6 @@ class GatewayRunner:
            "`/sethome` — Set this chat as the home channel",
            "`/compress` — Compress conversation context",
            "`/usage` — Show token usage for this session",
-            "`/insights [days]` — Show usage insights and analytics",
            "`/reload-mcp` — Reload MCP servers from config",
            "`/update` — Update Hermes Agent to the latest version",
            "`/help` — Show this message",
@@ -1263,7 +1253,8 @@ class GatewayRunner:
        )
        
        # Let the normal message handler process it
-        return await self._handle_message(retry_event)
+        await self._handle_message(retry_event)
+        return None  # Response sent through normal flow
    
    async def _handle_undo_command(self, event: MessageEvent) -> str:
        """Handle /undo command - remove the last user/assistant exchange."""
@@ -1406,53 +1397,6 @@ class GatewayRunner:
            )
        return "No usage data available for this session."

-    async def _handle_insights_command(self, event: MessageEvent) -> str:
-        """Handle /insights command -- show usage insights and analytics."""
-        import asyncio as _asyncio
-
-        args = event.get_command_args().strip()
-        days = 30
-        source = None
-
-        # Parse simple args: /insights 7  or  /insights --days 7
-        if args:
-            parts = args.split()
-            i = 0
-            while i < len(parts):
-                if parts[i] == "--days" and i + 1 < len(parts):
-                    try:
-                        days = int(parts[i + 1])
-                    except ValueError:
-                        return f"Invalid --days value: {parts[i + 1]}"
-                    i += 2
-                elif parts[i] == "--source" and i + 1 < len(parts):
-                    source = parts[i + 1]
-                    i += 2
-                elif parts[i].isdigit():
-                    days = int(parts[i])
-                    i += 1
-                else:
-                    i += 1
-
-        try:
-            from hermes_state import SessionDB
-            from agent.insights import InsightsEngine
-
-            loop = _asyncio.get_event_loop()
-
-            def _run_insights():
-                db = SessionDB()
-                engine = InsightsEngine(db)
-                report = engine.generate(days=days, source=source)
-                result = engine.format_gateway(report)
-                db.close()
-                return result
-
-            return await loop.run_in_executor(None, _run_insights)
-        except Exception as e:
-            logger.error("Insights command error: %s", e, exc_info=True)
-            return f"Error generating insights: {e}"
-
    async def _handle_reload_mcp_command(self, event: MessageEvent) -> str:
        """Handle /reload-mcp command -- disconnect and reconnect all MCP servers."""
        loop = asyncio.get_event_loop()
@@ -2097,7 +2041,7 @@ class GatewayRunner:
            os.environ["HERMES_SESSION_KEY"] = session_key or ""

            # Read from env var or use default (same as CLI)
-            max_iterations = int(os.getenv("HERMES_MAX_ITERATIONS", "90"))
+            max_iterations = int(os.getenv("HERMES_MAX_ITERATIONS", "60"))
            
            # Map platform enum to the platform hint key the agent understands.
            # Platform.LOCAL ("local") maps to "cli"; others pass through as-is.
@@ -2445,34 +2389,6 @@ async def start_gateway(config: Optional[GatewayConfig] = None) -> bool:
    Returns True if the gateway ran successfully, False if it failed to start.
    A False return causes a non-zero exit code so systemd can auto-restart.
    """
-    # ── Duplicate-instance guard ──────────────────────────────────────
-    # Prevent two gateways from running under the same HERMES_HOME.
-    # The PID file is scoped to HERMES_HOME, so future multi-profile
-    # setups (each profile using a distinct HERMES_HOME) will naturally
-    # allow concurrent instances without tripping this guard.
-    from gateway.status import get_running_pid
-    existing_pid = get_running_pid()
-    if existing_pid is not None and existing_pid != os.getpid():
-        hermes_home = os.getenv("HERMES_HOME", "~/.hermes")
-        logger.error(
-            "Another gateway instance is already running (PID %d, HERMES_HOME=%s). "
-            "Use 'hermes gateway restart' to replace it, or 'hermes gateway stop' first.",
-            existing_pid, hermes_home,
-        )
-        print(
-            f"\n❌ Gateway already running (PID {existing_pid}).\n"
-            f"   Use 'hermes gateway restart' to replace it,\n"
-            f"   or 'hermes gateway stop' to kill it first.\n"
-        )
-        return False
-
-    # Sync bundled skills on gateway start (fast -- skips unchanged)
-    try:
-        from tools.skills_sync import sync_skills
-        sync_skills(quiet=True)
-    except Exception:
-        pass
-
    # Configure rotating file log so gateway output is persisted for debugging
    log_dir = _hermes_home / 'logs'
    log_dir.mkdir(parents=True, exist_ok=True)
--- a/gateway/status.py
+++ b/gateway/status.py
@@ -3,59 +3,37 @@ Gateway runtime status helpers.

 Provides PID-file based detection of whether the gateway daemon is running,
 used by send_message's check_fn to gate availability in the CLI.
-
-The PID file lives at ``{HERMES_HOME}/gateway.pid``.  HERMES_HOME defaults to
-``~/.hermes`` but can be overridden via the environment variable.  This means
-separate HERMES_HOME directories naturally get separate PID files — a property
-that will be useful when we add named profiles (multiple agents running
-concurrently under distinct configurations).
 """

 import os
 from pathlib import Path
-from typing import Optional

-
-def _get_pid_path() -> Path:
-    """Return the path to the gateway PID file, respecting HERMES_HOME."""
-    home = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
-    return home / "gateway.pid"
+_PID_FILE = Path.home() / ".hermes" / "gateway.pid"


 def write_pid_file() -> None:
    """Write the current process PID to the gateway PID file."""
-    pid_path = _get_pid_path()
-    pid_path.parent.mkdir(parents=True, exist_ok=True)
-    pid_path.write_text(str(os.getpid()))
+    _PID_FILE.parent.mkdir(parents=True, exist_ok=True)
+    _PID_FILE.write_text(str(os.getpid()))


 def remove_pid_file() -> None:
    """Remove the gateway PID file if it exists."""
    try:
-        _get_pid_path().unlink(missing_ok=True)
+        _PID_FILE.unlink(missing_ok=True)
    except Exception:
        pass


-def get_running_pid() -> Optional[int]:
-    """Return the PID of a running gateway instance, or ``None``.
-
-    Checks the PID file and verifies the process is actually alive.
-    Cleans up stale PID files automatically.
-    """
-    pid_path = _get_pid_path()
-    if not pid_path.exists():
-        return None
-    try:
-        pid = int(pid_path.read_text().strip())
-        os.kill(pid, 0)  # signal 0 = existence check, no actual signal sent
-        return pid
-    except (ValueError, ProcessLookupError, PermissionError):
-        # Stale PID file — process is gone
-        remove_pid_file()
-        return None
-
-
 def is_gateway_running() -> bool:
    """Check if the gateway daemon is currently running."""
-    return get_running_pid() is not None
+    if not _PID_FILE.exists():
+        return False
+    try:
+        pid = int(_PID_FILE.read_text().strip())
+        os.kill(pid, 0)  # signal 0 = existence check, no actual signal sent
+        return True
+    except (ValueError, ProcessLookupError, PermissionError):
+        # Stale PID file -- process is gone
+        remove_pid_file()
+        return False
--- a/hermes_cli/auth.py
+++ b/hermes_cli/auth.py
@@ -72,19 +72,15 @@ CODEX_ACCESS_TOKEN_REFRESH_SKEW_SECONDS = 120

@dataclass
 class ProviderConfig:
-    """Describes a known inference provider."""
+    """Describes a known OAuth provider."""
    id: str
    name: str
-    auth_type: str  # "oauth_device_code", "oauth_external", or "api_key"
+    auth_type: str  # "oauth_device_code" or "api_key"
    portal_base_url: str = ""
    inference_base_url: str = ""
    client_id: str = ""
    scope: str = ""
    extra: Dict[str, Any] = field(default_factory=dict)
-    # For API-key providers: env vars to check (in priority order)
-    api_key_env_vars: tuple = ()
-    # Optional env var for base URL override
-    base_url_env_var: str = ""


 PROVIDER_REGISTRY: Dict[str, ProviderConfig] = {
@@ -103,38 +99,6 @@ PROVIDER_REGISTRY: Dict[str, ProviderConfig] = {
        auth_type="oauth_external",
        inference_base_url=DEFAULT_CODEX_BASE_URL,
    ),
-    "zai": ProviderConfig(
-        id="zai",
-        name="Z.AI / GLM",
-        auth_type="api_key",
-        inference_base_url="https://api.z.ai/api/paas/v4",
-        api_key_env_vars=("GLM_API_KEY", "ZAI_API_KEY", "Z_AI_API_KEY"),
-        base_url_env_var="GLM_BASE_URL",
-    ),
-    "kimi-coding": ProviderConfig(
-        id="kimi-coding",
-        name="Kimi / Moonshot",
-        auth_type="api_key",
-        inference_base_url="https://api.moonshot.ai/v1",
-        api_key_env_vars=("KIMI_API_KEY",),
-        base_url_env_var="KIMI_BASE_URL",
-    ),
-    "minimax": ProviderConfig(
-        id="minimax",
-        name="MiniMax",
-        auth_type="api_key",
-        inference_base_url="https://api.minimax.io/v1",
-        api_key_env_vars=("MINIMAX_API_KEY",),
-        base_url_env_var="MINIMAX_BASE_URL",
-    ),
-    "minimax-cn": ProviderConfig(
-        id="minimax-cn",
-        name="MiniMax (China)",
-        auth_type="api_key",
-        inference_base_url="https://api.minimaxi.com/v1",
-        api_key_env_vars=("MINIMAX_CN_API_KEY",),
-        base_url_env_var="MINIMAX_CN_BASE_URL",
-    ),
 }


@@ -391,19 +355,10 @@ def resolve_provider(
    1. active_provider in auth.json with valid credentials
    2. Explicit CLI api_key/base_url -> "openrouter"
    3. OPENAI_API_KEY or OPENROUTER_API_KEY env vars -> "openrouter"
-    4. Provider-specific API keys (GLM, Kimi, MiniMax) -> that provider
-    5. Fallback: "openrouter"
+    4. Fallback: "openrouter"
    """
    normalized = (requested or "auto").strip().lower()

-    # Normalize provider aliases
-    _PROVIDER_ALIASES = {
-        "glm": "zai", "z-ai": "zai", "z.ai": "zai", "zhipu": "zai",
-        "kimi": "kimi-coding", "moonshot": "kimi-coding",
-        "minimax-china": "minimax-cn", "minimax_cn": "minimax-cn",
-    }
-    normalized = _PROVIDER_ALIASES.get(normalized, normalized)
-
    if normalized in {"openrouter", "custom"}:
        return "openrouter"
    if normalized in PROVIDER_REGISTRY:
@@ -432,14 +387,6 @@ def resolve_provider(
    if os.getenv("OPENAI_API_KEY") or os.getenv("OPENROUTER_API_KEY"):
        return "openrouter"

-    # Auto-detect API-key providers by checking their env vars
-    for pid, pconfig in PROVIDER_REGISTRY.items():
-        if pconfig.auth_type != "api_key":
-            continue
-        for env_var in pconfig.api_key_env_vars:
-            if os.getenv(env_var, "").strip():
-                return pid
-
    return "openrouter"


@@ -1283,37 +1230,6 @@ def get_codex_auth_status() -> Dict[str, Any]:
        }


-def get_api_key_provider_status(provider_id: str) -> Dict[str, Any]:
-    """Status snapshot for API-key providers (z.ai, Kimi, MiniMax)."""
-    pconfig = PROVIDER_REGISTRY.get(provider_id)
-    if not pconfig or pconfig.auth_type != "api_key":
-        return {"configured": False}
-
-    api_key = ""
-    key_source = ""
-    for env_var in pconfig.api_key_env_vars:
-        val = os.getenv(env_var, "").strip()
-        if val:
-            api_key = val
-            key_source = env_var
-            break
-
-    base_url = pconfig.inference_base_url
-    if pconfig.base_url_env_var:
-        env_url = os.getenv(pconfig.base_url_env_var, "").strip()
-        if env_url:
-            base_url = env_url
-
-    return {
-        "configured": bool(api_key),
-        "provider": provider_id,
-        "name": pconfig.name,
-        "key_source": key_source,
-        "base_url": base_url,
-        "logged_in": bool(api_key),  # compat with OAuth status shape
-    }
-
-
 def get_auth_status(provider_id: Optional[str] = None) -> Dict[str, Any]:
    """Generic auth status dispatcher."""
    target = provider_id or get_active_provider()
@@ -1321,49 +1237,9 @@ def get_auth_status(provider_id: Optional[str] = None) -> Dict[str, Any]:
        return get_nous_auth_status()
    if target == "openai-codex":
        return get_codex_auth_status()
-    # API-key providers
-    pconfig = PROVIDER_REGISTRY.get(target)
-    if pconfig and pconfig.auth_type == "api_key":
-        return get_api_key_provider_status(target)
    return {"logged_in": False}


-def resolve_api_key_provider_credentials(provider_id: str) -> Dict[str, Any]:
-    """Resolve API key and base URL for an API-key provider.
-
-    Returns dict with: provider, api_key, base_url, source.
-    """
-    pconfig = PROVIDER_REGISTRY.get(provider_id)
-    if not pconfig or pconfig.auth_type != "api_key":
-        raise AuthError(
-            f"Provider '{provider_id}' is not an API-key provider.",
-            provider=provider_id,
-            code="invalid_provider",
-        )
-
-    api_key = ""
-    key_source = ""
-    for env_var in pconfig.api_key_env_vars:
-        val = os.getenv(env_var, "").strip()
-        if val:
-            api_key = val
-            key_source = env_var
-            break
-
-    base_url = pconfig.inference_base_url
-    if pconfig.base_url_env_var:
-        env_url = os.getenv(pconfig.base_url_env_var, "").strip()
-        if env_url:
-            base_url = env_url.rstrip("/")
-
-    return {
-        "provider": provider_id,
-        "api_key": api_key,
-        "base_url": base_url.rstrip("/"),
-        "source": key_source or "default",
-    }
-
-
 # =============================================================================
 # External credential detection
 # =============================================================================
--- a/hermes_cli/banner.py
+++ b/hermes_cli/banner.py
@@ -1,15 +1,10 @@
-"""Welcome banner, ASCII art, skills summary, and update check for the CLI.
+"""Welcome banner, ASCII art, and skills summary for the CLI.

 Pure display functions with no HermesCLI state dependency.
 """

-import json
-import logging
-import os
-import subprocess
-import time
 from pathlib import Path
-from typing import Dict, List, Any, Optional
+from typing import Dict, List, Any

 from rich.console import Console
 from rich.panel import Panel
@@ -18,8 +13,6 @@ from rich.table import Table
 from prompt_toolkit import print_formatted_text as _pt_print
 from prompt_toolkit.formatted_text import ANSI as _PT_ANSI

-logger = logging.getLogger(__name__)
-

 # =========================================================================
 # ANSI building blocks for conversation display
@@ -102,93 +95,15 @@ def get_available_skills() -> Dict[str, List[str]]:
    return skills_by_category


-# =========================================================================
-# Update check
-# =========================================================================
-
-# Cache update check results for 6 hours to avoid repeated git fetches
-_UPDATE_CHECK_CACHE_SECONDS = 6 * 3600
-
-
-def check_for_updates() -> Optional[int]:
-    """Check how many commits behind origin/main the local repo is.
-
-    Does a ``git fetch`` at most once every 6 hours (cached to
-    ``~/.hermes/.update_check``).  Returns the number of commits behind,
-    or ``None`` if the check fails or isn't applicable.
-    """
-    hermes_home = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
-    repo_dir = hermes_home / "hermes-agent"
-    cache_file = hermes_home / ".update_check"
-
-    # Must be a git repo
-    if not (repo_dir / ".git").exists():
-        return None
-
-    # Read cache
-    now = time.time()
-    try:
-        if cache_file.exists():
-            cached = json.loads(cache_file.read_text())
-            if now - cached.get("ts", 0) < _UPDATE_CHECK_CACHE_SECONDS:
-                return cached.get("behind")
-    except Exception:
-        pass
-
-    # Fetch latest refs (fast — only downloads ref metadata, no files)
-    try:
-        subprocess.run(
-            ["git", "fetch", "origin", "--quiet"],
-            capture_output=True, timeout=10,
-            cwd=str(repo_dir),
-        )
-    except Exception:
-        pass  # Offline or timeout — use stale refs, that's fine
-
-    # Count commits behind
-    try:
-        result = subprocess.run(
-            ["git", "rev-list", "--count", "HEAD..origin/main"],
-            capture_output=True, text=True, timeout=5,
-            cwd=str(repo_dir),
-        )
-        if result.returncode == 0:
-            behind = int(result.stdout.strip())
-        else:
-            behind = None
-    except Exception:
-        behind = None
-
-    # Write cache
-    try:
-        cache_file.write_text(json.dumps({"ts": now, "behind": behind}))
-    except Exception:
-        pass
-
-    return behind
-
-
 # =========================================================================
 # Welcome banner
 # =========================================================================

-def _format_context_length(tokens: int) -> str:
-    """Format a token count for display (e.g. 128000 → '128K', 1048576 → '1M')."""
-    if tokens >= 1_000_000:
-        val = tokens / 1_000_000
-        return f"{val:g}M"
-    elif tokens >= 1_000:
-        val = tokens / 1_000
-        return f"{val:g}K"
-    return str(tokens)
-
-
 def build_welcome_banner(console: Console, model: str, cwd: str,
                         tools: List[dict] = None,
                         enabled_toolsets: List[str] = None,
                         session_id: str = None,
-                         get_toolset_for_tool=None,
-                         context_length: int = None):
+                         get_toolset_for_tool=None):
    """Build and print a welcome banner with caduceus on left and info on right.

    Args:
@@ -199,7 +114,6 @@ def build_welcome_banner(console: Console, model: str, cwd: str,
        enabled_toolsets: List of enabled toolset names.
        session_id: Session identifier.
        get_toolset_for_tool: Callable to map tool name -> toolset name.
-        context_length: Model's context window size in tokens.
    """
    from model_tools import check_tool_availability, TOOLSET_REQUIREMENTS
    if get_toolset_for_tool is None:
@@ -221,8 +135,7 @@ def build_welcome_banner(console: Console, model: str, cwd: str,
    model_short = model.split("/")[-1] if "/" in model else model
    if len(model_short) > 28:
        model_short = model_short[:25] + "..."
-    ctx_str = f" [dim #B8860B]·[/] [dim #B8860B]{_format_context_length(context_length)} context[/]" if context_length else ""
-    left_lines.append(f"[#FFBF00]{model_short}[/]{ctx_str} [dim #B8860B]·[/] [dim #B8860B]Nous Research[/]")
+    left_lines.append(f"[#FFBF00]{model_short}[/] [dim #B8860B]·[/] [dim #B8860B]Nous Research[/]")
    left_lines.append(f"[dim #B8860B]{cwd}[/]")
    if session_id:
        left_lines.append(f"[dim #8B8682]Session: {session_id}[/]")
@@ -332,18 +245,6 @@ def build_welcome_banner(console: Console, model: str, cwd: str,
    summary_parts.append("/help for commands")
    right_lines.append(f"[dim #B8860B]{' · '.join(summary_parts)}[/]")

-    # Update check — show if behind origin/main
-    try:
-        behind = check_for_updates()
-        if behind and behind > 0:
-            commits_word = "commit" if behind == 1 else "commits"
-            right_lines.append(
-                f"[bold yellow]⚠ {behind} {commits_word} behind[/]"
-                f"[dim yellow] — run [bold]hermes update[/bold] to update[/]"
-            )
-    except Exception:
-        pass  # Never break the banner over an update check
-
    right_content = "\n".join(right_lines)
    layout_table.add_row(left_content, right_content)

--- a/hermes_cli/clipboard.py
+++ b/hermes_cli/clipboard.py
@@ -1,352 +0,0 @@
-"""Clipboard image extraction for macOS, Linux, and WSL2.
-
-Provides a single function `save_clipboard_image(dest)` that checks the
-system clipboard for image data, saves it to *dest* as PNG, and returns
-True on success.  No external Python dependencies — uses only OS-level
-CLI tools that ship with the platform (or are commonly installed).
-
-Platform support:
-  macOS  — osascript (always available), pngpaste (if installed)
-  WSL2   — powershell.exe via .NET System.Windows.Forms.Clipboard
-  Linux  — wl-paste (Wayland), xclip (X11)
-"""
-
-import base64
-import logging
-import os
-import subprocess
-import sys
-from pathlib import Path
-
-logger = logging.getLogger(__name__)
-
-# Cache WSL detection (checked once per process)
-_wsl_detected: bool | None = None
-
-
-def save_clipboard_image(dest: Path) -> bool:
-    """Extract an image from the system clipboard and save it as PNG.
-
-    Returns True if an image was found and saved, False otherwise.
-    """
-    dest.parent.mkdir(parents=True, exist_ok=True)
-    if sys.platform == "darwin":
-        return _macos_save(dest)
-    return _linux_save(dest)
-
-
-def has_clipboard_image() -> bool:
-    """Quick check: does the clipboard currently contain an image?
-
-    Lighter than save_clipboard_image — doesn't extract or write anything.
-    """
-    if sys.platform == "darwin":
-        return _macos_has_image()
-    if _is_wsl():
-        return _wsl_has_image()
-    if os.environ.get("WAYLAND_DISPLAY"):
-        return _wayland_has_image()
-    return _xclip_has_image()
-
-
-# ── macOS ────────────────────────────────────────────────────────────────
-
-def _macos_save(dest: Path) -> bool:
-    """Try pngpaste first (fast, handles more formats), fall back to osascript."""
-    return _macos_pngpaste(dest) or _macos_osascript(dest)
-
-
-def _macos_has_image() -> bool:
-    """Check if macOS clipboard contains image data."""
-    try:
-        info = subprocess.run(
-            ["osascript", "-e", "clipboard info"],
-            capture_output=True, text=True, timeout=3,
-        )
-        return "«class PNGf»" in info.stdout or "«class TIFF»" in info.stdout
-    except Exception:
-        return False
-
-
-def _macos_pngpaste(dest: Path) -> bool:
-    """Use pngpaste (brew install pngpaste) — fastest, cleanest."""
-    try:
-        r = subprocess.run(
-            ["pngpaste", str(dest)],
-            capture_output=True, timeout=3,
-        )
-        if r.returncode == 0 and dest.exists() and dest.stat().st_size > 0:
-            return True
-    except FileNotFoundError:
-        pass  # pngpaste not installed
-    except Exception as e:
-        logger.debug("pngpaste failed: %s", e)
-    return False
-
-
-def _macos_osascript(dest: Path) -> bool:
-    """Use osascript to extract PNG data from clipboard (always available)."""
-    if not _macos_has_image():
-        return False
-
-    # Extract as PNG
-    script = (
-        'try\n'
-        '  set imgData to the clipboard as «class PNGf»\n'
-        f'  set f to open for access POSIX file "{dest}" with write permission\n'
-        '  write imgData to f\n'
-        '  close access f\n'
-        'on error\n'
-        '  return "fail"\n'
-        'end try\n'
-    )
-    try:
-        r = subprocess.run(
-            ["osascript", "-e", script],
-            capture_output=True, text=True, timeout=5,
-        )
-        if r.returncode == 0 and "fail" not in r.stdout and dest.exists() and dest.stat().st_size > 0:
-            return True
-    except Exception as e:
-        logger.debug("osascript clipboard extract failed: %s", e)
-    return False
-
-
-# ── Linux ────────────────────────────────────────────────────────────────
-
-def _is_wsl() -> bool:
-    """Detect if running inside WSL (1 or 2)."""
-    global _wsl_detected
-    if _wsl_detected is not None:
-        return _wsl_detected
-    try:
-        with open("/proc/version", "r") as f:
-            _wsl_detected = "microsoft" in f.read().lower()
-    except Exception:
-        _wsl_detected = False
-    return _wsl_detected
-
-
-def _linux_save(dest: Path) -> bool:
-    """Try clipboard backends in priority order: WSL → Wayland → X11."""
-    if _is_wsl():
-        if _wsl_save(dest):
-            return True
-        # Fall through — WSLg might have wl-paste or xclip working
-
-    if os.environ.get("WAYLAND_DISPLAY"):
-        if _wayland_save(dest):
-            return True
-
-    return _xclip_save(dest)
-
-
-# ── WSL2 (powershell.exe) ────────────────────────────────────────────────
-
-# PowerShell script: get clipboard image as base64-encoded PNG on stdout.
-# Using .NET System.Windows.Forms.Clipboard — always available on Windows.
-_PS_CHECK_IMAGE = (
-    "Add-Type -AssemblyName System.Windows.Forms;"
-    "[System.Windows.Forms.Clipboard]::ContainsImage()"
-)
-
-_PS_EXTRACT_IMAGE = (
-    "Add-Type -AssemblyName System.Windows.Forms;"
-    "Add-Type -AssemblyName System.Drawing;"
-    "$img = [System.Windows.Forms.Clipboard]::GetImage();"
-    "if ($null -eq $img) { exit 1 }"
-    "$ms = New-Object System.IO.MemoryStream;"
-    "$img.Save($ms, [System.Drawing.Imaging.ImageFormat]::Png);"
-    "[System.Convert]::ToBase64String($ms.ToArray())"
-)
-
-
-def _wsl_has_image() -> bool:
-    """Check if Windows clipboard has an image (via powershell.exe)."""
-    try:
-        r = subprocess.run(
-            ["powershell.exe", "-NoProfile", "-NonInteractive", "-Command",
-             _PS_CHECK_IMAGE],
-            capture_output=True, text=True, timeout=8,
-        )
-        return r.returncode == 0 and "True" in r.stdout
-    except FileNotFoundError:
-        logger.debug("powershell.exe not found — WSL clipboard unavailable")
-    except Exception as e:
-        logger.debug("WSL clipboard check failed: %s", e)
-    return False
-
-
-def _wsl_save(dest: Path) -> bool:
-    """Extract clipboard image via powershell.exe → base64 → decode to PNG."""
-    try:
-        r = subprocess.run(
-            ["powershell.exe", "-NoProfile", "-NonInteractive", "-Command",
-             _PS_EXTRACT_IMAGE],
-            capture_output=True, text=True, timeout=15,
-        )
-        if r.returncode != 0:
-            return False
-
-        b64_data = r.stdout.strip()
-        if not b64_data:
-            return False
-
-        png_bytes = base64.b64decode(b64_data)
-        dest.write_bytes(png_bytes)
-        return dest.exists() and dest.stat().st_size > 0
-
-    except FileNotFoundError:
-        logger.debug("powershell.exe not found — WSL clipboard unavailable")
-    except Exception as e:
-        logger.debug("WSL clipboard extraction failed: %s", e)
-        dest.unlink(missing_ok=True)
-    return False
-
-
-# ── Wayland (wl-paste) ──────────────────────────────────────────────────
-
-def _wayland_has_image() -> bool:
-    """Check if Wayland clipboard has image content."""
-    try:
-        r = subprocess.run(
-            ["wl-paste", "--list-types"],
-            capture_output=True, text=True, timeout=3,
-        )
-        return r.returncode == 0 and any(
-            t.startswith("image/") for t in r.stdout.splitlines()
-        )
-    except FileNotFoundError:
-        logger.debug("wl-paste not installed — Wayland clipboard unavailable")
-    except Exception:
-        pass
-    return False
-
-
-def _wayland_save(dest: Path) -> bool:
-    """Use wl-paste to extract clipboard image (Wayland sessions)."""
-    try:
-        # Check available MIME types
-        types_r = subprocess.run(
-            ["wl-paste", "--list-types"],
-            capture_output=True, text=True, timeout=3,
-        )
-        if types_r.returncode != 0:
-            return False
-        types = types_r.stdout.splitlines()
-
-        # Prefer PNG, fall back to other image formats
-        mime = None
-        for preferred in ("image/png", "image/jpeg", "image/bmp",
-                          "image/gif", "image/webp"):
-            if preferred in types:
-                mime = preferred
-                break
-
-        if not mime:
-            return False
-
-        # Extract the image data
-        with open(dest, "wb") as f:
-            subprocess.run(
-                ["wl-paste", "--type", mime],
-                stdout=f, stderr=subprocess.DEVNULL, timeout=5, check=True,
-            )
-
-        if not dest.exists() or dest.stat().st_size == 0:
-            return False
-
-        # BMP needs conversion to PNG (common in WSLg where only BMP
-        # is bridged from Windows clipboard via RDP).
-        if mime == "image/bmp":
-            return _convert_to_png(dest)
-
-        return True
-
-    except FileNotFoundError:
-        logger.debug("wl-paste not installed — Wayland clipboard unavailable")
-    except Exception as e:
-        logger.debug("wl-paste clipboard extraction failed: %s", e)
-        dest.unlink(missing_ok=True)
-    return False
-
-
-def _convert_to_png(path: Path) -> bool:
-    """Convert an image file to PNG in-place (requires Pillow or ImageMagick)."""
-    # Try Pillow first (likely installed in the venv)
-    try:
-        from PIL import Image
-        img = Image.open(path)
-        img.save(path, "PNG")
-        return True
-    except ImportError:
-        pass
-    except Exception as e:
-        logger.debug("Pillow BMP→PNG conversion failed: %s", e)
-
-    # Fall back to ImageMagick convert
-    try:
-        tmp = path.with_suffix(".bmp")
-        path.rename(tmp)
-        r = subprocess.run(
-            ["convert", str(tmp), "png:" + str(path)],
-            capture_output=True, timeout=5,
-        )
-        tmp.unlink(missing_ok=True)
-        if r.returncode == 0 and path.exists() and path.stat().st_size > 0:
-            return True
-    except FileNotFoundError:
-        logger.debug("ImageMagick not installed — cannot convert BMP to PNG")
-    except Exception as e:
-        logger.debug("ImageMagick BMP→PNG conversion failed: %s", e)
-
-    # Can't convert — BMP is still usable as-is for most APIs
-    return path.exists() and path.stat().st_size > 0
-
-
-# ── X11 (xclip) ─────────────────────────────────────────────────────────
-
-def _xclip_has_image() -> bool:
-    """Check if X11 clipboard has image content."""
-    try:
-        r = subprocess.run(
-            ["xclip", "-selection", "clipboard", "-t", "TARGETS", "-o"],
-            capture_output=True, text=True, timeout=3,
-        )
-        return r.returncode == 0 and "image/png" in r.stdout
-    except FileNotFoundError:
-        pass
-    except Exception:
-        pass
-    return False
-
-
-def _xclip_save(dest: Path) -> bool:
-    """Use xclip to extract clipboard image (X11 sessions)."""
-    # Check if clipboard has image content
-    try:
-        targets = subprocess.run(
-            ["xclip", "-selection", "clipboard", "-t", "TARGETS", "-o"],
-            capture_output=True, text=True, timeout=3,
-        )
-        if "image/png" not in targets.stdout:
-            return False
-    except FileNotFoundError:
-        logger.debug("xclip not installed — X11 clipboard image paste unavailable")
-        return False
-    except Exception:
-        return False
-
-    # Extract PNG data
-    try:
-        with open(dest, "wb") as f:
-            subprocess.run(
-                ["xclip", "-selection", "clipboard", "-t", "image/png", "-o"],
-                stdout=f, stderr=subprocess.DEVNULL, timeout=5, check=True,
-            )
-        if dest.exists() and dest.stat().st_size > 0:
-            return True
-    except Exception as e:
-        logger.debug("xclip image extraction failed: %s", e)
-        dest.unlink(missing_ok=True)
-    return False
--- a/hermes_cli/commands.py
+++ b/hermes_cli/commands.py
@@ -28,7 +28,6 @@ COMMANDS = {
    "/verbose": "Cycle tool progress display: off → new → all → verbose",
    "/compress": "Manually compress conversation context (flush memories + summarize)",
    "/usage": "Show token usage for the current session",
-    "/insights": "Show usage insights and analytics (last 30 days)",
    "/quit": "Exit the CLI (also: /exit, /q)",
 }

--- a/hermes_cli/config.py
+++ b/hermes_cli/config.py
@@ -71,8 +71,7 @@ DEFAULT_CONFIG = {
        "docker_image": "nikolaik/python-nodejs:python3.11-nodejs20",
        "singularity_image": "docker://nikolaik/python-nodejs:python3.11-nodejs20",
        "modal_image": "nikolaik/python-nodejs:python3.11-nodejs20",
-        "daytona_image": "nikolaik/python-nodejs:python3.11-nodejs20",
-        # Container resource limits (docker, singularity, modal, daytona — ignored for local/ssh)
+        # Container resource limits (docker, singularity, modal — ignored for local/ssh)
        "container_cpu": 1,
        "container_memory": 5120,       # MB (default 5GB)
        "container_disk": 51200,        # MB (default 50GB)
@@ -87,20 +86,6 @@ DEFAULT_CONFIG = {
        "enabled": True,
        "threshold": 0.85,
        "summary_model": "google/gemini-3-flash-preview",
-        "summary_provider": "auto",
-    },
-    
-    # Auxiliary model overrides (advanced).  By default Hermes auto-selects
-    # the provider and model for each side task.  Set these to override.
-    "auxiliary": {
-        "vision": {
-            "provider": "auto",    # auto | openrouter | nous | main
-            "model": "",           # e.g. "google/gemini-2.5-flash", "gpt-4o"
-        },
-        "web_extract": {
-            "provider": "auto",
-            "model": "",
-        },
    },
    
    "display": {
@@ -155,13 +140,9 @@ DEFAULT_CONFIG = {
    # (apiKey, workspace, peerName, sessions, enabled) comes from the global config.
    "honcho": {},

-    # IANA timezone (e.g. "Asia/Kolkata", "America/New_York").
-    # Empty string means use server-local time.
-    "timezone": "",
-
    # Permanently allowed dangerous command patterns (added via "always" approval)
    "command_allowlist": [],
-
+    
    # Config schema version - bump this when adding new required fields
    "_config_version": 5,
 }
@@ -188,86 +169,6 @@ OPTIONAL_ENV_VARS = {
        "category": "provider",
        "advanced": True,
    },
-    "GLM_API_KEY": {
-        "description": "Z.AI / GLM API key (also recognized as ZAI_API_KEY / Z_AI_API_KEY)",
-        "prompt": "Z.AI / GLM API key",
-        "url": "https://z.ai/",
-        "password": True,
-        "category": "provider",
-        "advanced": True,
-    },
-    "ZAI_API_KEY": {
-        "description": "Z.AI API key (alias for GLM_API_KEY)",
-        "prompt": "Z.AI API key",
-        "url": "https://z.ai/",
-        "password": True,
-        "category": "provider",
-        "advanced": True,
-    },
-    "Z_AI_API_KEY": {
-        "description": "Z.AI API key (alias for GLM_API_KEY)",
-        "prompt": "Z.AI API key",
-        "url": "https://z.ai/",
-        "password": True,
-        "category": "provider",
-        "advanced": True,
-    },
-    "GLM_BASE_URL": {
-        "description": "Z.AI / GLM base URL override",
-        "prompt": "Z.AI / GLM base URL (leave empty for default)",
-        "url": None,
-        "password": False,
-        "category": "provider",
-        "advanced": True,
-    },
-    "KIMI_API_KEY": {
-        "description": "Kimi / Moonshot API key",
-        "prompt": "Kimi API key",
-        "url": "https://platform.moonshot.cn/",
-        "password": True,
-        "category": "provider",
-        "advanced": True,
-    },
-    "KIMI_BASE_URL": {
-        "description": "Kimi / Moonshot base URL override",
-        "prompt": "Kimi base URL (leave empty for default)",
-        "url": None,
-        "password": False,
-        "category": "provider",
-        "advanced": True,
-    },
-    "MINIMAX_API_KEY": {
-        "description": "MiniMax API key (international)",
-        "prompt": "MiniMax API key",
-        "url": "https://www.minimax.io/",
-        "password": True,
-        "category": "provider",
-        "advanced": True,
-    },
-    "MINIMAX_BASE_URL": {
-        "description": "MiniMax base URL override",
-        "prompt": "MiniMax base URL (leave empty for default)",
-        "url": None,
-        "password": False,
-        "category": "provider",
-        "advanced": True,
-    },
-    "MINIMAX_CN_API_KEY": {
-        "description": "MiniMax API key (China endpoint)",
-        "prompt": "MiniMax (China) API key",
-        "url": "https://www.minimaxi.com/",
-        "password": True,
-        "category": "provider",
-        "advanced": True,
-    },
-    "MINIMAX_CN_BASE_URL": {
-        "description": "MiniMax (China) base URL override",
-        "prompt": "MiniMax (China) base URL (leave empty for default)",
-        "url": None,
-        "password": False,
-        "category": "provider",
-        "advanced": True,
-    },

    # ── Tool API keys ──
    "FIRECRAWL_API_KEY": {
@@ -278,16 +179,8 @@ OPTIONAL_ENV_VARS = {
        "password": True,
        "category": "tool",
    },
-    "FIRECRAWL_API_URL": {
-        "description": "Firecrawl API URL for self-hosted instances (optional)",
-        "prompt": "Firecrawl API URL (leave empty for cloud)",
-        "url": None,
-        "password": False,
-        "category": "tool",
-        "advanced": True,
-    },
    "BROWSERBASE_API_KEY": {
-        "description": "Browserbase API key for cloud browser (optional — local browser works without this)",
+        "description": "Browserbase API key for browser automation",
        "prompt": "Browserbase API key",
        "url": "https://browserbase.com/",
        "tools": ["browser_navigate", "browser_click"],
@@ -295,7 +188,7 @@ OPTIONAL_ENV_VARS = {
        "category": "tool",
    },
    "BROWSERBASE_PROJECT_ID": {
-        "description": "Browserbase project ID (optional — only needed for cloud browser)",
+        "description": "Browserbase project ID",
        "prompt": "Browserbase project ID",
        "url": "https://browserbase.com/",
        "tools": ["browser_navigate", "browser_click"],
@@ -583,22 +476,6 @@ def migrate_config(interactive: bool = True, quiet: bool = False) -> Dict[str, A
            if not quiet:
                print(f"  ✓ Migrated tool progress to config.yaml: {display['tool_progress']}")
    
-    # ── Version 4 → 5: add timezone field ──
-    if current_ver < 5:
-        config = load_config()
-        if "timezone" not in config:
-            old_tz = os.getenv("HERMES_TIMEZONE", "")
-            if old_tz and old_tz.strip():
-                config["timezone"] = old_tz.strip()
-                results["config_added"].append(f"timezone={old_tz.strip()} (from HERMES_TIMEZONE)")
-            else:
-                config["timezone"] = ""
-                results["config_added"].append("timezone= (empty, uses server-local)")
-            save_config(config)
-            if not quiet:
-                tz_display = config["timezone"] or "(server-local)"
-                print(f"  ✓ Added timezone to config.yaml: {tz_display}")
-
    if current_ver < latest_ver and not quiet:
        print(f"Config version: {current_ver} → {latest_ver}")
    
@@ -876,25 +753,12 @@ def show_config():
        print(f"  Modal image:  {terminal.get('modal_image', 'python:3.11')}")
        modal_token = get_env_value('MODAL_TOKEN_ID')
        print(f"  Modal token:  {'configured' if modal_token else '(not set)'}")
-    elif terminal.get('backend') == 'daytona':
-        print(f"  Daytona image: {terminal.get('daytona_image', 'nikolaik/python-nodejs:python3.11-nodejs20')}")
-        daytona_key = get_env_value('DAYTONA_API_KEY')
-        print(f"  API key:      {'configured' if daytona_key else '(not set)'}")
    elif terminal.get('backend') == 'ssh':
        ssh_host = get_env_value('TERMINAL_SSH_HOST')
        ssh_user = get_env_value('TERMINAL_SSH_USER')
        print(f"  SSH host:     {ssh_host or '(not set)'}")
        print(f"  SSH user:     {ssh_user or '(not set)'}")
    
-    # Timezone
-    print()
-    print(color("◆ Timezone", Colors.CYAN, Colors.BOLD))
-    tz = config.get('timezone', '')
-    if tz:
-        print(f"  Timezone:     {tz}")
-    else:
-        print(f"  Timezone:     {color('(server-local)', Colors.DIM)}")
-
    # Compression
    print()
    print(color("◆ Context Compression", Colors.CYAN, Colors.BOLD))
@@ -904,31 +768,6 @@ def show_config():
    if enabled:
        print(f"  Threshold:    {compression.get('threshold', 0.85) * 100:.0f}%")
        print(f"  Model:        {compression.get('summary_model', 'google/gemini-3-flash-preview')}")
-        comp_provider = compression.get('summary_provider', 'auto')
-        if comp_provider != 'auto':
-            print(f"  Provider:     {comp_provider}")
-    
-    # Auxiliary models
-    auxiliary = config.get('auxiliary', {})
-    aux_tasks = {
-        "Vision":      auxiliary.get('vision', {}),
-        "Web extract": auxiliary.get('web_extract', {}),
-    }
-    has_overrides = any(
-        t.get('provider', 'auto') != 'auto' or t.get('model', '')
-        for t in aux_tasks.values()
-    )
-    if has_overrides:
-        print()
-        print(color("◆ Auxiliary Models (overrides)", Colors.CYAN, Colors.BOLD))
-        for label, task_cfg in aux_tasks.items():
-            prov = task_cfg.get('provider', 'auto')
-            mdl = task_cfg.get('model', '')
-            if prov != 'auto' or mdl:
-                parts = [f"provider={prov}"]
-                if mdl:
-                    parts.append(f"model={mdl}")
-                print(f"  {label:12s}  {', '.join(parts)}")
    
    # Messaging
    print()
@@ -981,16 +820,15 @@ def set_config_value(key: str, value: str):
    """Set a configuration value."""
    # Check if it's an API key (goes to .env)
    api_keys = [
-        'OPENROUTER_API_KEY', 'OPENAI_API_KEY', 'ANTHROPIC_API_KEY', 'VOICE_TOOLS_OPENAI_KEY',
-        'FIRECRAWL_API_KEY', 'FIRECRAWL_API_URL', 'BROWSERBASE_API_KEY', 'BROWSERBASE_PROJECT_ID',
+        'OPENROUTER_API_KEY', 'ANTHROPIC_API_KEY', 'VOICE_TOOLS_OPENAI_KEY',
+        'FIRECRAWL_API_KEY', 'BROWSERBASE_API_KEY', 'BROWSERBASE_PROJECT_ID',
        'FAL_KEY', 'TELEGRAM_BOT_TOKEN', 'DISCORD_BOT_TOKEN',
        'TERMINAL_SSH_HOST', 'TERMINAL_SSH_USER', 'TERMINAL_SSH_KEY',
        'SUDO_PASSWORD', 'SLACK_BOT_TOKEN', 'SLACK_APP_TOKEN',
-        'GITHUB_TOKEN', 'HONCHO_API_KEY', 'NOUS_API_KEY', 'WANDB_API_KEY',
-        'TINKER_API_KEY',
+        'GITHUB_TOKEN', 'HONCHO_API_KEY',
    ]
    
-    if key.upper() in api_keys or key.upper().endswith('_API_KEY') or key.upper().endswith('_TOKEN') or key.upper().startswith('TERMINAL_SSH'):
+    if key.upper() in api_keys or key.upper().startswith('TERMINAL_SSH'):
        save_env_value(key.upper(), value)
        print(f"✓ Set {key} in {get_env_path()}")
        return
@@ -1040,7 +878,6 @@ def set_config_value(key: str, value: str):
        "terminal.docker_image": "TERMINAL_DOCKER_IMAGE",
        "terminal.singularity_image": "TERMINAL_SINGULARITY_IMAGE",
        "terminal.modal_image": "TERMINAL_MODAL_IMAGE",
-        "terminal.daytona_image": "TERMINAL_DAYTONA_IMAGE",
        "terminal.cwd": "TERMINAL_CWD",
        "terminal.timeout": "TERMINAL_TIMEOUT",
    }
--- a/hermes_cli/doctor.py
+++ b/hermes_cli/doctor.py
@@ -132,11 +132,7 @@ def run_doctor(args):
        
        # Check for common issues
        content = env_path.read_text()
-        if any(k in content for k in (
-            "OPENROUTER_API_KEY", "ANTHROPIC_API_KEY",
-            "GLM_API_KEY", "ZAI_API_KEY", "Z_AI_API_KEY",
-            "KIMI_API_KEY", "MINIMAX_API_KEY", "MINIMAX_CN_API_KEY",
-        )):
+        if "OPENROUTER_API_KEY" in content or "ANTHROPIC_API_KEY" in content:
            check_ok("API key configured")
        else:
            check_warn("No API key found in ~/.hermes/.env")
@@ -359,21 +355,6 @@ def run_doctor(args):
            check_fail("TERMINAL_SSH_HOST not set", "(required for TERMINAL_ENV=ssh)")
            issues.append("Set TERMINAL_SSH_HOST in .env")
    
-    # Daytona (if using daytona backend)
-    if terminal_env == "daytona":
-        daytona_key = os.getenv("DAYTONA_API_KEY")
-        if daytona_key:
-            check_ok("Daytona API key", "(configured)")
-        else:
-            check_fail("DAYTONA_API_KEY not set", "(required for TERMINAL_ENV=daytona)")
-            issues.append("Set DAYTONA_API_KEY environment variable")
-        try:
-            from daytona import Daytona
-            check_ok("daytona SDK", "(installed)")
-        except ImportError:
-            check_fail("daytona SDK not installed", "(pip install daytona)")
-            issues.append("Install daytona SDK: pip install daytona")
-
    # Node.js + agent-browser (for browser automation tools)
    if shutil.which("node"):
        check_ok("Node.js")
@@ -472,42 +453,7 @@ def run_doctor(args):
                print(f"\r  {color('⚠', Colors.YELLOW)} Anthropic API {color(msg, Colors.DIM)}                 ")
        except Exception as e:
            print(f"\r  {color('⚠', Colors.YELLOW)} Anthropic API {color(f'({e})', Colors.DIM)}                 ")
-
-    # -- API-key providers (Z.AI/GLM, Kimi, MiniMax, MiniMax-CN) --
-    _apikey_providers = [
-        ("Z.AI / GLM",      ("GLM_API_KEY", "ZAI_API_KEY", "Z_AI_API_KEY"), "https://api.z.ai/api/paas/v4/models", "GLM_BASE_URL"),
-        ("Kimi / Moonshot",  ("KIMI_API_KEY",),                              "https://api.moonshot.ai/v1/models",   "KIMI_BASE_URL"),
-        ("MiniMax",          ("MINIMAX_API_KEY",),                            "https://api.minimax.io/v1/models",    "MINIMAX_BASE_URL"),
-        ("MiniMax (China)",  ("MINIMAX_CN_API_KEY",),                         "https://api.minimaxi.com/v1/models",  "MINIMAX_CN_BASE_URL"),
-    ]
-    for _pname, _env_vars, _default_url, _base_env in _apikey_providers:
-        _key = ""
-        for _ev in _env_vars:
-            _key = os.getenv(_ev, "")
-            if _key:
-                break
-        if _key:
-            _label = _pname.ljust(20)
-            print(f"  Checking {_pname} API...", end="", flush=True)
-            try:
-                import httpx
-                _base = os.getenv(_base_env, "")
-                _url = (_base.rstrip("/") + "/models") if _base else _default_url
-                _resp = httpx.get(
-                    _url,
-                    headers={"Authorization": f"Bearer {_key}"},
-                    timeout=10,
-                )
-                if _resp.status_code == 200:
-                    print(f"\r  {color('✓', Colors.GREEN)} {_label}                          ")
-                elif _resp.status_code == 401:
-                    print(f"\r  {color('✗', Colors.RED)} {_label} {color('(invalid API key)', Colors.DIM)}           ")
-                    issues.append(f"Check {_env_vars[0]} in .env")
-                else:
-                    print(f"\r  {color('⚠', Colors.YELLOW)} {_label} {color(f'(HTTP {_resp.status_code})', Colors.DIM)}           ")
-            except Exception as _e:
-                print(f"\r  {color('⚠', Colors.YELLOW)} {_label} {color(f'({_e})', Colors.DIM)}           ")
-
+    
    # =========================================================================
    # Check: Submodules
    # =========================================================================
--- a/hermes_cli/main.py
+++ b/hermes_cli/main.py
@@ -64,13 +64,7 @@ def _has_any_provider_configured() -> bool:
    # Check env vars (may be set by .env or shell).
    # OPENAI_BASE_URL alone counts — local models (vLLM, llama.cpp, etc.)
    # often don't require an API key.
-    from hermes_cli.auth import PROVIDER_REGISTRY
-
-    # Collect all provider env vars
-    provider_env_vars = {"OPENROUTER_API_KEY", "OPENAI_API_KEY", "ANTHROPIC_API_KEY", "OPENAI_BASE_URL"}
-    for pconfig in PROVIDER_REGISTRY.values():
-        if pconfig.auth_type == "api_key":
-            provider_env_vars.update(pconfig.api_key_env_vars)
+    provider_env_vars = ("OPENROUTER_API_KEY", "OPENAI_API_KEY", "ANTHROPIC_API_KEY", "OPENAI_BASE_URL")
    if any(os.getenv(v) for v in provider_env_vars):
        return True

@@ -149,13 +143,6 @@ def cmd_chat(args):
        print("You can run 'hermes setup' at any time to configure.")
        sys.exit(1)

-    # Sync bundled skills on every CLI launch (fast -- skips unchanged skills)
-    try:
-        from tools.skills_sync import sync_skills
-        sync_skills(quiet=True)
-    except Exception:
-        pass
-
    # Import and run the CLI
    from cli import main as cli_main
    
@@ -417,10 +404,6 @@ def cmd_model(args):
        "openrouter": "OpenRouter",
        "nous": "Nous Portal",
        "openai-codex": "OpenAI Codex",
-        "zai": "Z.AI / GLM",
-        "kimi-coding": "Kimi / Moonshot",
-        "minimax": "MiniMax",
-        "minimax-cn": "MiniMax (China)",
        "custom": "Custom endpoint",
    }
    active_label = provider_labels.get(active, active)
@@ -435,16 +418,11 @@ def cmd_model(args):
        ("openrouter", "OpenRouter (100+ models, pay-per-use)"),
        ("nous", "Nous Portal (Nous Research subscription)"),
        ("openai-codex", "OpenAI Codex"),
-        ("zai", "Z.AI / GLM (Zhipu AI direct API)"),
-        ("kimi-coding", "Kimi / Moonshot (Moonshot AI direct API)"),
-        ("minimax", "MiniMax (global direct API)"),
-        ("minimax-cn", "MiniMax China (domestic direct API)"),
        ("custom", "Custom endpoint (self-hosted / VLLM / etc.)"),
    ]

    # Reorder so the active provider is at the top
-    known_keys = {k for k, _ in providers}
-    active_key = active if active in known_keys else "custom"
+    active_key = active if active in ("openrouter", "nous", "openai-codex") else "custom"
    ordered = []
    for key, label in providers:
        if key == active_key:
@@ -469,8 +447,6 @@ def cmd_model(args):
        _model_flow_openai_codex(config, current_model)
    elif selected_provider == "custom":
        _model_flow_custom(config)
-    elif selected_provider in ("zai", "kimi-coding", "minimax", "minimax-cn"):
-        _model_flow_api_key_provider(config, selected_provider, current_model)


 def _prompt_provider_choice(choices):
@@ -740,117 +716,6 @@ def _model_flow_custom(config):
        print("Endpoint saved. Use `/model` in chat or `hermes model` to set a model.")


-# Curated model lists for direct API-key providers
-_PROVIDER_MODELS = {
-    "zai": [
-        "glm-5",
-        "glm-4.7",
-        "glm-4.5",
-        "glm-4.5-flash",
-    ],
-    "kimi-coding": [
-        "kimi-k2.5",
-        "kimi-k2-thinking",
-        "kimi-k2-turbo-preview",
-        "kimi-k2-0905-preview",
-    ],
-    "minimax": [
-        "MiniMax-M2.5",
-        "MiniMax-M2.5-highspeed",
-        "MiniMax-M2.1",
-    ],
-    "minimax-cn": [
-        "MiniMax-M2.5",
-        "MiniMax-M2.5-highspeed",
-        "MiniMax-M2.1",
-    ],
-}
-
-
-def _model_flow_api_key_provider(config, provider_id, current_model=""):
-    """Generic flow for API-key providers (z.ai, Kimi, MiniMax)."""
-    from hermes_cli.auth import (
-        PROVIDER_REGISTRY, _prompt_model_selection, _save_model_choice,
-        _update_config_for_provider, deactivate_provider,
-    )
-    from hermes_cli.config import get_env_value, save_env_value, load_config, save_config
-
-    pconfig = PROVIDER_REGISTRY[provider_id]
-    key_env = pconfig.api_key_env_vars[0] if pconfig.api_key_env_vars else ""
-    base_url_env = pconfig.base_url_env_var or ""
-
-    # Check / prompt for API key
-    existing_key = ""
-    for ev in pconfig.api_key_env_vars:
-        existing_key = get_env_value(ev) or os.getenv(ev, "")
-        if existing_key:
-            break
-
-    if not existing_key:
-        print(f"No {pconfig.name} API key configured.")
-        if key_env:
-            try:
-                new_key = input(f"{key_env} (or Enter to cancel): ").strip()
-            except (KeyboardInterrupt, EOFError):
-                print()
-                return
-            if not new_key:
-                print("Cancelled.")
-                return
-            save_env_value(key_env, new_key)
-            print("API key saved.")
-            print()
-    else:
-        print(f"  {pconfig.name} API key: {existing_key[:8]}... ✓")
-        print()
-
-    # Optional base URL override
-    current_base = ""
-    if base_url_env:
-        current_base = get_env_value(base_url_env) or os.getenv(base_url_env, "")
-    effective_base = current_base or pconfig.inference_base_url
-
-    try:
-        override = input(f"Base URL [{effective_base}]: ").strip()
-    except (KeyboardInterrupt, EOFError):
-        print()
-        override = ""
-    if override and base_url_env:
-        save_env_value(base_url_env, override)
-        effective_base = override
-
-    # Model selection
-    model_list = _PROVIDER_MODELS.get(provider_id, [])
-    if model_list:
-        selected = _prompt_model_selection(model_list, current_model=current_model)
-    else:
-        try:
-            selected = input("Model name: ").strip()
-        except (KeyboardInterrupt, EOFError):
-            selected = None
-
-    if selected:
-        # Clear custom endpoint if set (avoid confusion)
-        if get_env_value("OPENAI_BASE_URL"):
-            save_env_value("OPENAI_BASE_URL", "")
-            save_env_value("OPENAI_API_KEY", "")
-
-        _save_model_choice(selected)
-
-        # Update config with provider and base URL
-        cfg = load_config()
-        model = cfg.get("model")
-        if isinstance(model, dict):
-            model["provider"] = provider_id
-            model["base_url"] = effective_base
-        save_config(cfg)
-        deactivate_provider()
-
-        print(f"Default model set to: {selected} (via {pconfig.name})")
-    else:
-        print("No change.")
-
-
 def cmd_login(args):
    """Authenticate Hermes CLI with a provider."""
    from hermes_cli.auth import login_command
@@ -986,17 +851,11 @@ def _update_via_zip(args):
    # Sync skills
    try:
        from tools.skills_sync import sync_skills
-        print("→ Syncing bundled skills...")
+        print("→ Checking for new bundled skills...")
        result = sync_skills(quiet=True)
        if result["copied"]:
-            print(f"  + {len(result['copied'])} new: {', '.join(result['copied'])}")
-        if result.get("updated"):
-            print(f"  ↑ {len(result['updated'])} updated: {', '.join(result['updated'])}")
-        if result.get("user_modified"):
-            print(f"  ~ {len(result['user_modified'])} user-modified (kept)")
-        if result.get("cleaned"):
-            print(f"  − {len(result['cleaned'])} removed from manifest")
-        if not result["copied"] and not result.get("updated"):
+            print(f"  + {len(result['copied'])} new skill(s): {', '.join(result['copied'])}")
+        else:
            print("  ✓ Skills are up to date")
    except Exception:
        pass
@@ -1102,21 +961,15 @@ def cmd_update(args):
        print()
        print("✓ Code updated!")
        
-        # Sync bundled skills (copies new, updates changed, respects user deletions)
+        # Sync any new bundled skills (manifest-based -- won't overwrite or re-add deleted skills)
        try:
            from tools.skills_sync import sync_skills
            print()
-            print("→ Syncing bundled skills...")
+            print("→ Checking for new bundled skills...")
            result = sync_skills(quiet=True)
            if result["copied"]:
-                print(f"  + {len(result['copied'])} new: {', '.join(result['copied'])}")
-            if result.get("updated"):
-                print(f"  ↑ {len(result['updated'])} updated: {', '.join(result['updated'])}")
-            if result.get("user_modified"):
-                print(f"  ~ {len(result['user_modified'])} user-modified (kept)")
-            if result.get("cleaned"):
-                print(f"  − {len(result['cleaned'])} removed from manifest")
-            if not result["copied"] and not result.get("updated"):
+                print(f"  + {len(result['copied'])} new skill(s): {', '.join(result['copied'])}")
+            else:
                print("  ✓ Skills are up to date")
        except Exception as e:
            logger.debug("Skills sync during update failed: %s", e)
@@ -1269,7 +1122,7 @@ For more help on a command:
    )
    chat_parser.add_argument(
        "--provider",
-        choices=["auto", "openrouter", "nous", "openai-codex", "zai", "kimi-coding", "minimax", "minimax-cn"],
+        choices=["auto", "openrouter", "nous", "openai-codex"],
        default=None,
        help="Inference provider (default: auto)"
    )
@@ -1347,15 +1200,7 @@ For more help on a command:
    setup_parser = subparsers.add_parser(
        "setup",
        help="Interactive setup wizard",
-        description="Configure Hermes Agent with an interactive wizard. "
-                    "Run a specific section: hermes setup model|terminal|gateway|tools|agent"
-    )
-    setup_parser.add_argument(
-        "section",
-        nargs="?",
-        choices=["model", "terminal", "gateway", "tools", "agent"],
-        default=None,
-        help="Run a specific setup section instead of the full wizard"
+        description="Configure Hermes Agent with an interactive wizard"
    )
    setup_parser.add_argument(
        "--non-interactive",
@@ -1579,16 +1424,9 @@ For more help on a command:
    )
    skills_subparsers = skills_parser.add_subparsers(dest="skills_action")

-    skills_browse = skills_subparsers.add_parser("browse", help="Browse all available skills (paginated)")
-    skills_browse.add_argument("--page", type=int, default=1, help="Page number (default: 1)")
-    skills_browse.add_argument("--size", type=int, default=20, help="Results per page (default: 20)")
-    skills_browse.add_argument("--source", default="all",
-                               choices=["all", "official", "github", "clawhub", "lobehub"],
-                               help="Filter by source (default: all)")
-
    skills_search = skills_subparsers.add_parser("search", help="Search skill registries")
    skills_search.add_argument("query", help="Search query")
-    skills_search.add_argument("--source", default="all", choices=["all", "official", "github", "clawhub", "lobehub"])
+    skills_search.add_argument("--source", default="all", choices=["all", "github", "clawhub", "lobehub"])
    skills_search.add_argument("--limit", type=int, default=10, help="Max results")

    skills_install = skills_subparsers.add_parser("install", help="Install a skill")
@@ -1765,32 +1603,6 @@ For more help on a command:

    sessions_parser.set_defaults(func=cmd_sessions)

-    # =========================================================================
-    # insights command
-    # =========================================================================
-    insights_parser = subparsers.add_parser(
-        "insights",
-        help="Show usage insights and analytics",
-        description="Analyze session history to show token usage, costs, tool patterns, and activity trends"
-    )
-    insights_parser.add_argument("--days", type=int, default=30, help="Number of days to analyze (default: 30)")
-    insights_parser.add_argument("--source", help="Filter by platform (cli, telegram, discord, etc.)")
-
-    def cmd_insights(args):
-        try:
-            from hermes_state import SessionDB
-            from agent.insights import InsightsEngine
-
-            db = SessionDB()
-            engine = InsightsEngine(db)
-            report = engine.generate(days=args.days, source=args.source)
-            print(engine.format_terminal(report))
-            db.close()
-        except Exception as e:
-            print(f"Error generating insights: {e}")
-
-    insights_parser.set_defaults(func=cmd_insights)
-
    # =========================================================================
    # version command
    # =========================================================================
--- a/hermes_cli/models.py
+++ b/hermes_cli/models.py
@@ -9,17 +9,14 @@ Add, remove, or reorder entries here — both `hermes setup` and
 OPENROUTER_MODELS: list[tuple[str, str]] = [
    ("anthropic/claude-opus-4.6",       "recommended"),
    ("anthropic/claude-sonnet-4.5",     ""),
-    ("openai/gpt-5.4-pro",              ""),
-    ("openai/gpt-5.4",                  ""),
+    ("anthropic/claude-opus-4.5",       ""),
+    ("openai/gpt-5.2",                  ""),
    ("openai/gpt-5.3-codex",            ""),
    ("google/gemini-3-pro-preview",     ""),
    ("google/gemini-3-flash-preview",   ""),
-    ("qwen/qwen3.5-plus-02-15",        ""),
-    ("qwen/qwen3.5-35b-a3b",           ""),
-    ("stepfun/step-3.5-flash",          ""),
-    ("z-ai/glm-5",                      ""),
+    ("z-ai/glm-4.7",                    ""),
    ("moonshotai/kimi-k2.5",            ""),
-    ("minimax/minimax-m2.5",            ""),
+    ("minimax/minimax-m2.1",            ""),
 ]


--- a/hermes_cli/runtime_provider.py
+++ b/hermes_cli/runtime_provider.py
@@ -7,12 +7,10 @@ from typing import Any, Dict, Optional

 from hermes_cli.auth import (
    AuthError,
-    PROVIDER_REGISTRY,
    format_auth_error,
    resolve_provider,
    resolve_nous_runtime_credentials,
    resolve_codex_runtime_credentials,
-    resolve_api_key_provider_credentials,
 )
 from hermes_cli.config import load_config
 from hermes_constants import OPENROUTER_BASE_URL
@@ -74,26 +72,12 @@ def _resolve_openrouter_runtime(
        or OPENROUTER_BASE_URL
    ).rstrip("/")

-    # Choose API key based on whether the resolved base_url targets OpenRouter.
-    # When hitting OpenRouter, prefer OPENROUTER_API_KEY (issue #289).
-    # When hitting a custom endpoint (e.g. Z.ai, local LLM), prefer
-    # OPENAI_API_KEY so the OpenRouter key doesn't leak to an unrelated
-    # provider (issues #420, #560).
-    _is_openrouter_url = "openrouter.ai" in base_url
-    if _is_openrouter_url:
-        api_key = (
-            explicit_api_key
-            or os.getenv("OPENROUTER_API_KEY")
-            or os.getenv("OPENAI_API_KEY")
-            or ""
-        )
-    else:
-        api_key = (
-            explicit_api_key
-            or os.getenv("OPENAI_API_KEY")
-            or os.getenv("OPENROUTER_API_KEY")
-            or ""
-        )
+    api_key = (
+        explicit_api_key
+        or os.getenv("OPENROUTER_API_KEY")
+        or os.getenv("OPENAI_API_KEY")
+        or ""
+    )

    source = "explicit" if (explicit_api_key or explicit_base_url) else "env/config"

@@ -148,19 +132,6 @@ def resolve_runtime_provider(
            "requested_provider": requested_provider,
        }

-    # API-key providers (z.ai/GLM, Kimi, MiniMax, MiniMax-CN)
-    pconfig = PROVIDER_REGISTRY.get(provider)
-    if pconfig and pconfig.auth_type == "api_key":
-        creds = resolve_api_key_provider_credentials(provider)
-        return {
-            "provider": provider,
-            "api_mode": "chat_completions",
-            "base_url": creds.get("base_url", "").rstrip("/"),
-            "api_key": creds.get("api_key", ""),
-            "source": creds.get("source", "env"),
-            "requested_provider": requested_provider,
-        }
-
    runtime = _resolve_openrouter_runtime(
        requested_provider=requested_provider,
        explicit_api_key=explicit_api_key,
--- a/hermes_cli/setup.py
+++ b/hermes_cli/setup.py
--- a/hermes_cli/skills_hub.py
+++ b/hermes_cli/skills_hub.py
@@ -57,9 +57,8 @@ def _resolve_short_name(name: str, sources, console: Console) -> str:
        table.add_column("Trust", style="dim")
        table.add_column("Identifier", style="bold cyan")
        for r in exact:
-            trust_style = {"builtin": "bright_cyan", "trusted": "green", "community": "yellow"}.get(r.trust_level, "dim")
-            trust_label = "official" if r.source == "official" else r.trust_level
-            table.add_row(r.source, f"[{trust_style}]{trust_label}[/]", r.identifier)
+            trust_style = {"trusted": "green", "community": "yellow"}.get(r.trust_level, "dim")
+            table.add_row(r.source, f"[{trust_style}]{r.trust_level}[/]", r.identifier)
        c.print(table)
        c.print("[bold]Use the full identifier to install a specific one.[/]\n")
        return ""
@@ -100,13 +99,12 @@ def do_search(query: str, source: str = "all", limit: int = 10,
    table.add_column("Identifier", style="dim")

    for r in results:
-        trust_style = {"builtin": "bright_cyan", "trusted": "green", "community": "yellow"}.get(r.trust_level, "dim")
-        trust_label = "official" if r.source == "official" else r.trust_level
+        trust_style = {"trusted": "green", "community": "yellow"}.get(r.trust_level, "dim")
        table.add_row(
            r.name,
            r.description[:60] + ("..." if len(r.description) > 60 else ""),
            r.source,
-            f"[{trust_style}]{trust_label}[/]",
+            f"[{trust_style}]{r.trust_level}[/]",
            r.identifier,
        )

@@ -115,130 +113,6 @@ def do_search(query: str, source: str = "all", limit: int = 10,
            "hermes skills install <identifier> to install[/]\n")


-def do_browse(page: int = 1, page_size: int = 20, source: str = "all",
-              console: Optional[Console] = None) -> None:
-    """Browse all available skills across registries, paginated.
-
-    Official skills are always shown first, regardless of source filter.
-    """
-    from tools.skills_hub import (
-        GitHubAuth, create_source_router, OptionalSkillSource, SkillMeta,
-    )
-
-    # Clamp page_size to safe range
-    page_size = max(1, min(page_size, 100))
-
-    c = console or _console
-
-    auth = GitHubAuth()
-    sources = create_source_router(auth)
-
-    # Collect results from all (or filtered) sources
-    # Use empty query to get everything; per-source limits prevent overload
-    _TRUST_RANK = {"builtin": 3, "trusted": 2, "community": 1}
-    _PER_SOURCE_LIMIT = {"official": 100, "github": 100, "clawhub": 50,
-                         "claude-marketplace": 50, "lobehub": 50}
-
-    all_results: list = []
-    source_counts: dict = {}
-
-    for src in sources:
-        sid = src.source_id()
-        if source != "all" and sid != source and sid != "official":
-            # Always include official source for the "first" placement
-            continue
-        try:
-            limit = _PER_SOURCE_LIMIT.get(sid, 50)
-            results = src.search("", limit=limit)
-            source_counts[sid] = len(results)
-            all_results.extend(results)
-        except Exception:
-            continue
-
-    if not all_results:
-        c.print("[dim]No skills found in the Skills Hub.[/]\n")
-        return
-
-    # Deduplicate by name, preferring higher trust
-    seen: dict = {}
-    for r in all_results:
-        rank = _TRUST_RANK.get(r.trust_level, 0)
-        if r.name not in seen or rank > _TRUST_RANK.get(seen[r.name].trust_level, 0):
-            seen[r.name] = r
-    deduped = list(seen.values())
-
-    # Sort: official first, then by trust level (desc), then alphabetically
-    deduped.sort(key=lambda r: (
-        -_TRUST_RANK.get(r.trust_level, 0),
-        r.source != "official",
-        r.name.lower(),
-    ))
-
-    # Paginate
-    total = len(deduped)
-    total_pages = max(1, (total + page_size - 1) // page_size)
-    page = max(1, min(page, total_pages))
-    start = (page - 1) * page_size
-    end = min(start + page_size, total)
-    page_items = deduped[start:end]
-
-    # Count official vs other
-    official_count = sum(1 for r in deduped if r.source == "official")
-
-    # Build header
-    source_label = f"— {source}" if source != "all" else "— all sources"
-    c.print(f"\n[bold]Skills Hub — Browse {source_label}[/]"
-            f"  [dim]({total} skills, page {page}/{total_pages})[/]")
-    if official_count > 0 and page == 1:
-        c.print(f"[bright_cyan]★ {official_count} official optional skill(s) from Nous Research[/]")
-    c.print()
-
-    # Build table
-    table = Table(show_header=True, header_style="bold")
-    table.add_column("#", style="dim", width=4, justify="right")
-    table.add_column("Name", style="bold cyan", max_width=25)
-    table.add_column("Description", max_width=50)
-    table.add_column("Source", style="dim", width=12)
-    table.add_column("Trust", width=10)
-
-    for i, r in enumerate(page_items, start=start + 1):
-        trust_style = {"builtin": "bright_cyan", "trusted": "green",
-                       "community": "yellow"}.get(r.trust_level, "dim")
-        trust_label = "★ official" if r.source == "official" else r.trust_level
-
-        desc = r.description[:50]
-        if len(r.description) > 50:
-            desc += "..."
-
-        table.add_row(
-            str(i),
-            r.name,
-            desc,
-            r.source,
-            f"[{trust_style}]{trust_label}[/]",
-        )
-
-    c.print(table)
-
-    # Navigation hints
-    nav_parts = []
-    if page > 1:
-        nav_parts.append(f"[cyan]--page {page - 1}[/] ← prev")
-    if page < total_pages:
-        nav_parts.append(f"[cyan]--page {page + 1}[/] → next")
-
-    if nav_parts:
-        c.print(f"  {' | '.join(nav_parts)}")
-
-    # Source summary
-    if source == "all" and source_counts:
-        parts = [f"{sid}: {ct}" for sid, ct in sorted(source_counts.items())]
-        c.print(f"  [dim]Sources: {', '.join(parts)}[/]")
-
-    c.print("[dim]Use: hermes skills inspect <identifier> to preview, "
-            "hermes skills install <identifier> to install[/]\n")
-
-
 def do_install(identifier: str, category: str = "", force: bool = False,
               console: Optional[Console] = None) -> None:
    """Fetch, quarantine, scan, confirm, and install a skill."""
@@ -273,12 +147,6 @@ def do_install(identifier: str, category: str = "", force: bool = False,
        c.print(f"[bold red]Error:[/] Could not fetch '{identifier}' from any source.\n")
        return

-    # Auto-detect category for official skills (e.g. "official/autonomous-ai-agents/blackbox")
-    if bundle.source == "official" and not category:
-        id_parts = bundle.identifier.split("/")  # ["official", "category", "skill"]
-        if len(id_parts) >= 3:
-            category = id_parts[1]
-
    # Check if already installed
    lock = HubLockFile()
    existing = lock.get_installed(bundle.name)
@@ -309,28 +177,18 @@ def do_install(identifier: str, category: str = "", force: bool = False,
                         f"{len(result.findings)}_findings")
        return

-    # Confirm with user — show appropriate warning based on source
+    # Confirm with user — always show risk warning regardless of source
    if not force:
        c.print()
-        if bundle.source == "official":
-            c.print(Panel(
-                "[bold bright_cyan]This is an official optional skill maintained by Nous Research.[/]\n\n"
-                "It ships with hermes-agent but is not activated by default.\n"
-                "Installing will copy it to your skills directory where the agent can use it.\n\n"
-                f"Files will be at: [cyan]~/.hermes/skills/{category + '/' if category else ''}{bundle.name}/[/]",
-                title="Official Skill",
-                border_style="bright_cyan",
-            ))
-        else:
-            c.print(Panel(
-                "[bold yellow]You are installing a third-party skill at your own risk.[/]\n\n"
-                "External skills can contain instructions that influence agent behavior,\n"
-                "shell commands, and scripts. Even after automated scanning, you should\n"
-                "review the installed files before use.\n\n"
-                f"Files will be at: [cyan]~/.hermes/skills/{category + '/' if category else ''}{bundle.name}/[/]",
-                title="Disclaimer",
-                border_style="yellow",
-            ))
+        c.print(Panel(
+            "[bold yellow]You are installing a third-party skill at your own risk.[/]\n\n"
+            "External skills can contain instructions that influence agent behavior,\n"
+            "shell commands, and scripts. Even after automated scanning, you should\n"
+            "review the installed files before use.\n\n"
+            f"Files will be at: [cyan]~/.hermes/skills/{category + '/' if category else ''}{bundle.name}/[/]",
+            title="Disclaimer",
+            border_style="yellow",
+        ))
        c.print(f"[bold]Install '{bundle.name}'?[/]")
        try:
            answer = input("Confirm [y/N]: ").strip().lower()
@@ -379,14 +237,13 @@ def do_inspect(identifier: str, console: Optional[Console] = None) -> None:
            break

    c.print()
-    trust_style = {"builtin": "bright_cyan", "trusted": "green", "community": "yellow"}.get(meta.trust_level, "dim")
-    trust_label = "official" if meta.source == "official" else meta.trust_level
+    trust_style = {"trusted": "green", "community": "yellow"}.get(meta.trust_level, "dim")

    info_lines = [
        f"[bold]Name:[/] {meta.name}",
        f"[bold]Description:[/] {meta.description}",
        f"[bold]Source:[/] {meta.source}",
-        f"[bold]Trust:[/] [{trust_style}]{trust_label}[/]",
+        f"[bold]Trust:[/] [{trust_style}]{meta.trust_level}[/]",
        f"[bold]Identifier:[/] {meta.identifier}",
    ]
    if meta.tags:
@@ -440,9 +297,8 @@ def do_list(source_filter: str = "all", console: Optional[Console] = None) -> No
        if source_filter == "builtin" and hub_entry:
            continue

-        trust_style = {"builtin": "bright_cyan", "trusted": "green", "community": "yellow"}.get(trust, "dim")
-        trust_label = "official" if source_display == "official" else trust
-        table.add_row(name, category, source_display, f"[{trust_style}]{trust_label}[/]")
+        trust_style = {"builtin": "blue", "trusted": "green", "community": "yellow"}.get(trust, "dim")
+        table.add_row(name, category, source_display, f"[{trust_style}]{trust}[/]")

    c.print(table)
    c.print(f"[dim]{len(hub_installed)} hub-installed, "
@@ -802,9 +658,7 @@ def skills_command(args) -> None:
    """Router for `hermes skills <subcommand>` — called from hermes_cli/main.py."""
    action = getattr(args, "skills_action", None)

-    if action == "browse":
-        do_browse(page=args.page, page_size=args.size, source=args.source)
-    elif action == "search":
+    if action == "search":
        do_search(args.query, source=args.source, limit=args.limit)
    elif action == "install":
        do_install(args.identifier, category=args.category, force=args.force)
@@ -838,7 +692,7 @@ def skills_command(args) -> None:
            return
        do_tap(tap_action, repo=repo)
    else:
-        _console.print("Usage: hermes skills [browse|search|install|inspect|list|audit|uninstall|publish|snapshot|tap]\n")
+        _console.print("Usage: hermes skills [search|install|inspect|list|audit|uninstall|publish|snapshot|tap]\n")
        _console.print("Run 'hermes skills <command> --help' for details.\n")


@@ -878,32 +732,7 @@ def handle_skills_slash(cmd: str, console: Optional[Console] = None) -> None:
    action = parts[0].lower()
    args = parts[1:]

-    if action == "browse":
-        page = 1
-        page_size = 20
-        source = "all"
-        i = 0
-        while i < len(args):
-            if args[i] == "--page" and i + 1 < len(args):
-                try:
-                    page = int(args[i + 1])
-                except ValueError:
-                    pass
-                i += 2
-            elif args[i] == "--size" and i + 1 < len(args):
-                try:
-                    page_size = int(args[i + 1])
-                except ValueError:
-                    pass
-                i += 2
-            elif args[i] == "--source" and i + 1 < len(args):
-                source = args[i + 1]
-                i += 2
-            else:
-                i += 1
-        do_browse(page=page, page_size=page_size, source=source, console=c)
-
-    elif action == "search":
+    if action == "search":
        if not args:
            c.print("[bold red]Usage:[/] /skills search <query> [--source github] [--limit N]\n")
            return
@@ -1009,7 +838,6 @@ def _print_skills_help(console: Console) -> None:
    """Print help for the /skills slash command."""
    console.print(Panel(
        "[bold]Skills Hub Commands:[/]\n\n"
-        "  [cyan]browse[/] [--source official]   Browse all available skills (paginated)\n"
        "  [cyan]search[/] <query>              Search registries for skills\n"
        "  [cyan]install[/] <identifier>        Install a skill (with security scan)\n"
        "  [cyan]inspect[/] <identifier>        Preview a skill without installing\n"
--- a/hermes_cli/status.py
+++ b/hermes_cli/status.py
@@ -79,12 +79,8 @@ def show_status(args):
        "OpenRouter": "OPENROUTER_API_KEY",
        "Anthropic": "ANTHROPIC_API_KEY", 
        "OpenAI": "OPENAI_API_KEY",
-        "Z.AI/GLM": "GLM_API_KEY",
-        "Kimi": "KIMI_API_KEY",
-        "MiniMax": "MINIMAX_API_KEY",
-        "MiniMax-CN": "MINIMAX_CN_API_KEY",
        "Firecrawl": "FIRECRAWL_API_KEY",
-        "Browserbase": "BROWSERBASE_API_KEY",  # Optional — local browser works without this
+        "Browserbase": "BROWSERBASE_API_KEY",
        "FAL": "FAL_KEY",
        "Tinker": "TINKER_API_KEY",
        "WandB": "WANDB_API_KEY",
@@ -132,7 +128,7 @@ def show_status(args):
        f"  {'OpenAI Codex':<12}  {check_mark(codex_logged_in)} "
        f"{'logged in' if codex_logged_in else 'not logged in (run: hermes model)'}"
    )
-    codex_auth_file = codex_status.get("auth_store")
+    codex_auth_file = codex_status.get("auth_file")
    if codex_auth_file:
        print(f"    Auth file:  {codex_auth_file}")
    codex_last_refresh = _format_iso_timestamp(codex_status.get("last_refresh"))
@@ -141,28 +137,6 @@ def show_status(args):
    if codex_status.get("error") and not codex_logged_in:
        print(f"    Error:      {codex_status.get('error')}")

-    # =========================================================================
-    # API-Key Providers
-    # =========================================================================
-    print()
-    print(color("◆ API-Key Providers", Colors.CYAN, Colors.BOLD))
-
-    apikey_providers = {
-        "Z.AI / GLM":       ("GLM_API_KEY", "ZAI_API_KEY", "Z_AI_API_KEY"),
-        "Kimi / Moonshot":  ("KIMI_API_KEY",),
-        "MiniMax":          ("MINIMAX_API_KEY",),
-        "MiniMax (China)":  ("MINIMAX_CN_API_KEY",),
-    }
-    for pname, env_vars in apikey_providers.items():
-        key_val = ""
-        for ev in env_vars:
-            key_val = get_env_value(ev) or ""
-            if key_val:
-                break
-        configured = bool(key_val)
-        label = "configured" if configured else "not configured (run: hermes model)"
-        print(f"  {pname:<16} {check_mark(configured)} {label}")
-
    # =========================================================================
    # Terminal Configuration
    # =========================================================================
@@ -189,9 +163,6 @@ def show_status(args):
    elif terminal_env == "docker":
        docker_image = os.getenv("TERMINAL_DOCKER_IMAGE", "python:3.11-slim")
        print(f"  Docker Image: {docker_image}")
-    elif terminal_env == "daytona":
-        daytona_image = os.getenv("TERMINAL_DAYTONA_IMAGE", "nikolaik/python-nodejs:python3.11-nodejs20")
-        print(f"  Daytona Image: {daytona_image}")
    
    sudo_password = os.getenv("SUDO_PASSWORD", "")
    print(f"  Sudo:         {check_mark(bool(sudo_password))} {'enabled' if sudo_password else 'disabled'}")
--- a/hermes_cli/tools_config.py
+++ b/hermes_cli/tools_config.py
@@ -1,10 +1,7 @@
 """
-Unified tool configuration for Hermes Agent.
-
-`hermes tools` and `hermes setup tools` both enter this module.
-Select a platform → toggle toolsets on/off → for newly enabled tools
-that need API keys, run through provider-aware configuration.
+Interactive tool configuration for Hermes Agent.

+`hermes tools` — select a platform, then toggle toolsets on/off via checklist.
 Saves per-platform tool configuration to ~/.hermes/config.yaml under
 the `platform_toolsets` key.
 """
@@ -15,63 +12,9 @@ from typing import Dict, List, Set

 import os

-from hermes_cli.config import (
-    load_config, save_config, get_env_value, save_env_value,
-    get_hermes_home,
-)
+from hermes_cli.config import load_config, save_config, get_env_value, save_env_value
 from hermes_cli.colors import Colors, color

-PROJECT_ROOT = Path(__file__).parent.parent.resolve()
-
-
-# ─── UI Helpers (shared with setup.py) ────────────────────────────────────────
-
-def _print_info(text: str):
-    print(color(f"  {text}", Colors.DIM))
-
-def _print_success(text: str):
-    print(color(f"✓ {text}", Colors.GREEN))
-
-def _print_warning(text: str):
-    print(color(f"⚠ {text}", Colors.YELLOW))
-
-def _print_error(text: str):
-    print(color(f"✗ {text}", Colors.RED))
-
-def _prompt(question: str, default: str = None, password: bool = False) -> str:
-    if default:
-        display = f"{question} [{default}]: "
-    else:
-        display = f"{question}: "
-    try:
-        if password:
-            import getpass
-            value = getpass.getpass(color(display, Colors.YELLOW))
-        else:
-            value = input(color(display, Colors.YELLOW))
-        return value.strip() or default or ""
-    except (KeyboardInterrupt, EOFError):
-        print()
-        return default or ""
-
-def _prompt_yes_no(question: str, default: bool = True) -> bool:
-    default_str = "Y/n" if default else "y/N"
-    while True:
-        try:
-            value = input(color(f"{question} [{default_str}]: ", Colors.YELLOW)).strip().lower()
-        except (KeyboardInterrupt, EOFError):
-            print()
-            return default
-        if not value:
-            return default
-        if value in ('y', 'yes'):
-            return True
-        if value in ('n', 'no'):
-            return False
-
-
-# ─── Toolset Registry ─────────────────────────────────────────────────────────
-
 # Toolsets shown in the configurator, grouped for display.
 # Each entry: (toolset_name, label, description)
 # These map to keys in toolsets.py TOOLSETS dict.
@@ -106,187 +49,6 @@ PLATFORMS = {
 }


-# ─── Tool Categories (provider-aware configuration) ──────────────────────────
-# Maps toolset keys to their provider options. When a toolset is newly enabled,
-# we use this to show provider selection and prompt for the right API keys.
-# Toolsets not in this map either need no config or use the simple fallback.
-
-TOOL_CATEGORIES = {
-    "tts": {
-        "name": "Text-to-Speech",
-        "icon": "🔊",
-        "providers": [
-            {
-                "name": "Microsoft Edge TTS",
-                "tag": "Free - no API key needed",
-                "env_vars": [],
-                "tts_provider": "edge",
-            },
-            {
-                "name": "OpenAI TTS",
-                "tag": "Premium - high quality voices",
-                "env_vars": [
-                    {"key": "VOICE_TOOLS_OPENAI_KEY", "prompt": "OpenAI API key", "url": "https://platform.openai.com/api-keys"},
-                ],
-                "tts_provider": "openai",
-            },
-            {
-                "name": "ElevenLabs",
-                "tag": "Premium - most natural voices",
-                "env_vars": [
-                    {"key": "ELEVENLABS_API_KEY", "prompt": "ElevenLabs API key", "url": "https://elevenlabs.io/app/settings/api-keys"},
-                ],
-                "tts_provider": "elevenlabs",
-            },
-        ],
-    },
-    "web": {
-        "name": "Web Search & Extract",
-        "icon": "🔍",
-        "providers": [
-            {
-                "name": "Firecrawl Cloud",
-                "tag": "Recommended - hosted service",
-                "env_vars": [
-                    {"key": "FIRECRAWL_API_KEY", "prompt": "Firecrawl API key", "url": "https://firecrawl.dev"},
-                ],
-            },
-            {
-                "name": "Firecrawl Self-Hosted",
-                "tag": "Free - run your own instance",
-                "env_vars": [
-                    {"key": "FIRECRAWL_API_URL", "prompt": "Your Firecrawl instance URL (e.g., http://localhost:3002)"},
-                ],
-            },
-        ],
-    },
-    "image_gen": {
-        "name": "Image Generation",
-        "icon": "🎨",
-        "providers": [
-            {
-                "name": "FAL.ai",
-                "tag": "FLUX 2 Pro with auto-upscaling",
-                "env_vars": [
-                    {"key": "FAL_KEY", "prompt": "FAL API key", "url": "https://fal.ai/dashboard/keys"},
-                ],
-            },
-        ],
-    },
-    "browser": {
-        "name": "Browser Automation",
-        "icon": "🌐",
-        "providers": [
-            {
-                "name": "Local Browser",
-                "tag": "Free headless Chromium (no API key needed)",
-                "env_vars": [],
-                "post_setup": "browserbase",  # Same npm install for agent-browser
-            },
-            {
-                "name": "Browserbase",
-                "tag": "Cloud browser with stealth & proxies",
-                "env_vars": [
-                    {"key": "BROWSERBASE_API_KEY", "prompt": "Browserbase API key", "url": "https://browserbase.com"},
-                    {"key": "BROWSERBASE_PROJECT_ID", "prompt": "Browserbase project ID"},
-                ],
-                "post_setup": "browserbase",
-            },
-        ],
-    },
-    "homeassistant": {
-        "name": "Smart Home",
-        "icon": "🏠",
-        "providers": [
-            {
-                "name": "Home Assistant",
-                "tag": "REST API integration",
-                "env_vars": [
-                    {"key": "HASS_TOKEN", "prompt": "Home Assistant Long-Lived Access Token"},
-                    {"key": "HASS_URL", "prompt": "Home Assistant URL", "default": "http://homeassistant.local:8123"},
-                ],
-            },
-        ],
-    },
-    "rl": {
-        "name": "RL Training",
-        "icon": "🧪",
-        "requires_python": (3, 11),
-        "providers": [
-            {
-                "name": "Tinker / Atropos",
-                "tag": "RL training platform",
-                "env_vars": [
-                    {"key": "TINKER_API_KEY", "prompt": "Tinker API key", "url": "https://tinker-console.thinkingmachines.ai/keys"},
-                    {"key": "WANDB_API_KEY", "prompt": "WandB API key", "url": "https://wandb.ai/authorize"},
-                ],
-                "post_setup": "rl_training",
-            },
-        ],
-    },
-}
-
-# Simple env-var requirements for toolsets NOT in TOOL_CATEGORIES.
-# Used as a fallback for tools like vision/moa that just need an API key.
-TOOLSET_ENV_REQUIREMENTS = {
-    "vision":     [("OPENROUTER_API_KEY",   "https://openrouter.ai/keys")],
-    "moa":        [("OPENROUTER_API_KEY",   "https://openrouter.ai/keys")],
-}
-
-
-# ─── Post-Setup Hooks ─────────────────────────────────────────────────────────
-
-def _run_post_setup(post_setup_key: str):
-    """Run post-setup hooks for tools that need extra installation steps."""
-    import shutil
-    if post_setup_key == "browserbase":
-        node_modules = PROJECT_ROOT / "node_modules" / "agent-browser"
-        if not node_modules.exists() and shutil.which("npm"):
-            _print_info("    Installing Node.js dependencies for browser tools...")
-            import subprocess
-            result = subprocess.run(
-                ["npm", "install", "--silent"],
-                capture_output=True, text=True, cwd=str(PROJECT_ROOT)
-            )
-            if result.returncode == 0:
-                _print_success("    Node.js dependencies installed")
-            else:
-                _print_warning("    npm install failed - run manually: cd ~/.hermes/hermes-agent && npm install")
-        elif not node_modules.exists():
-            _print_warning("    Node.js not found - browser tools require: npm install (in hermes-agent directory)")
-
-    elif post_setup_key == "rl_training":
-        try:
-            __import__("tinker_atropos")
-        except ImportError:
-            tinker_dir = PROJECT_ROOT / "tinker-atropos"
-            if tinker_dir.exists() and (tinker_dir / "pyproject.toml").exists():
-                _print_info("    Installing tinker-atropos submodule...")
-                import subprocess
-                uv_bin = shutil.which("uv")
-                if uv_bin:
-                    result = subprocess.run(
-                        [uv_bin, "pip", "install", "--python", sys.executable, "-e", str(tinker_dir)],
-                        capture_output=True, text=True
-                    )
-                else:
-                    result = subprocess.run(
-                        [sys.executable, "-m", "pip", "install", "-e", str(tinker_dir)],
-                        capture_output=True, text=True
-                    )
-                if result.returncode == 0:
-                    _print_success("    tinker-atropos installed")
-                else:
-                    _print_warning("    tinker-atropos install failed - run manually:")
-                    _print_info('      uv pip install -e "./tinker-atropos"')
-            else:
-                _print_warning("    tinker-atropos submodule not found - run:")
-                _print_info("      git submodule update --init --recursive")
-                _print_info('      uv pip install -e "./tinker-atropos"')
-
-
-# ─── Platform / Toolset Helpers ───────────────────────────────────────────────
-
 def _get_enabled_platforms() -> List[str]:
    """Return platform keys that are configured (have tokens or are CLI)."""
    enabled = ["cli"]
@@ -335,28 +97,6 @@ def _save_platform_tools(config: dict, platform: str, enabled_toolset_keys: Set[
    save_config(config)


-def _toolset_has_keys(ts_key: str) -> bool:
-    """Check if a toolset's required API keys are configured."""
-    # Check TOOL_CATEGORIES first (provider-aware)
-    cat = TOOL_CATEGORIES.get(ts_key)
-    if cat:
-        for provider in cat["providers"]:
-            env_vars = provider.get("env_vars", [])
-            if not env_vars:
-                return True  # Free provider (e.g., Edge TTS)
-            if all(get_env_value(v["key"]) for v in env_vars):
-                return True
-        return False
-
-    # Fallback to simple requirements
-    requirements = TOOLSET_ENV_REQUIREMENTS.get(ts_key, [])
-    if not requirements:
-        return True
-    return all(get_env_value(var) for var, _ in requirements)
-
-
-# ─── Menu Helpers ─────────────────────────────────────────────────────────────
-
 def _prompt_choice(question: str, choices: list, default: int = 0) -> int:
    """Single-select menu (arrow keys)."""
    print(color(question, Colors.YELLOW))
@@ -374,7 +114,7 @@ def _prompt_choice(question: str, choices: list, default: int = 0) -> int:
        )
        idx = menu.show()
        if idx is None:
-            return default
+            sys.exit(0)
        print()
        return idx
    except (ImportError, NotImplementedError):
@@ -392,7 +132,15 @@ def _prompt_choice(question: str, choices: list, default: int = 0) -> int:
                    return idx
            except (ValueError, KeyboardInterrupt, EOFError):
                print()
-                return default
+                sys.exit(0)
+
+
+def _toolset_has_keys(ts_key: str) -> bool:
+    """Check if a toolset's required API keys are configured."""
+    requirements = TOOLSET_ENV_REQUIREMENTS.get(ts_key, [])
+    if not requirements:
+        return True
+    return all(get_env_value(var) for var, _ in requirements)


 def _prompt_toolset_checklist(platform_label: str, enabled: Set[str]) -> Set[str]:
@@ -402,8 +150,8 @@ def _prompt_toolset_checklist(platform_label: str, enabled: Set[str]) -> Set[str
    labels = []
    for ts_key, ts_label, ts_desc in CONFIGURABLE_TOOLSETS:
        suffix = ""
-        if not _toolset_has_keys(ts_key) and (TOOL_CATEGORIES.get(ts_key) or TOOLSET_ENV_REQUIREMENTS.get(ts_key)):
-            suffix = "  [no API key]"
+        if not _toolset_has_keys(ts_key) and TOOLSET_ENV_REQUIREMENTS.get(ts_key):
+            suffix = "  ⚠ no API key"
        labels.append(f"{ts_label}  ({ts_desc}){suffix}")

    pre_selected_indices = [
@@ -554,294 +302,77 @@ def _prompt_toolset_checklist(platform_label: str, enabled: Set[str]) -> Set[str
    return {CONFIGURABLE_TOOLSETS[i][0] for i in selected}


-# ─── Provider-Aware Configuration ────────────────────────────────────────────
-
-def _configure_toolset(ts_key: str, config: dict):
-    """Configure a toolset - provider selection + API keys.
-    
-    Uses TOOL_CATEGORIES for provider-aware config, falls back to simple
-    env var prompts for toolsets not in TOOL_CATEGORIES.
-    """
-    cat = TOOL_CATEGORIES.get(ts_key)
-
-    if cat:
-        _configure_tool_category(ts_key, cat, config)
-    else:
-        # Simple fallback for vision, moa, etc.
-        _configure_simple_requirements(ts_key)
+# Map toolset keys to the env vars they require and where to get them
+TOOLSET_ENV_REQUIREMENTS = {
+    "web":        [("FIRECRAWL_API_KEY",    "https://firecrawl.dev/")],
+    "browser":    [("BROWSERBASE_API_KEY",  "https://browserbase.com/"),
+                   ("BROWSERBASE_PROJECT_ID", None)],
+    "vision":     [("OPENROUTER_API_KEY",   "https://openrouter.ai/keys")],
+    "image_gen":  [("FAL_KEY",              "https://fal.ai/")],
+    "moa":        [("OPENROUTER_API_KEY",   "https://openrouter.ai/keys")],
+    "tts":        [],  # Edge TTS is free, no key needed
+    "rl":         [("TINKER_API_KEY",       "https://tinker-console.thinkingmachines.ai/keys"),
+                   ("WANDB_API_KEY",        "https://wandb.ai/authorize")],
+    "homeassistant": [("HASS_TOKEN", "Home Assistant > Profile > Long-Lived Access Tokens"),
+                      ("HASS_URL",   None)],
+}


-def _configure_tool_category(ts_key: str, cat: dict, config: dict):
-    """Configure a tool category with provider selection."""
-    icon = cat.get("icon", "")
-    name = cat["name"]
-    providers = cat["providers"]
+def _check_and_prompt_requirements(newly_enabled: Set[str]):
+    """Check if newly enabled toolsets have missing API keys and offer to set them up."""
+    for ts_key in sorted(newly_enabled):
+        requirements = TOOLSET_ENV_REQUIREMENTS.get(ts_key, [])
+        if not requirements:
+            continue

-    # Check Python version requirement
-    if cat.get("requires_python"):
-        req = cat["requires_python"]
-        if sys.version_info < req:
-            print()
-            _print_error(f"  {name} requires Python {req[0]}.{req[1]}+ (current: {sys.version_info.major}.{sys.version_info.minor})")
-            _print_info("  Upgrade Python and reinstall to enable this tool.")
-            return
+        missing = [(var, url) for var, url in requirements if not get_env_value(var)]
+        if not missing:
+            continue

-    if len(providers) == 1:
-        # Single provider - configure directly
-        provider = providers[0]
-        print()
-        print(color(f"  --- {icon} {name} ({provider['name']}) ---", Colors.CYAN))
-        if provider.get("tag"):
-            _print_info(f"  {provider['tag']}")
-        _configure_provider(provider, config)
-    else:
-        # Multiple providers - let user choose
-        print()
-        print(color(f"  --- {icon} {name} - Choose a provider ---", Colors.CYAN))
+        ts_label = next((l for k, l, _ in CONFIGURABLE_TOOLSETS if k == ts_key), ts_key)
        print()
+        print(color(f"  ⚠ {ts_label} requires configuration:", Colors.YELLOW))

-        # Plain text labels only (no ANSI codes in menu items)
-        provider_choices = []
-        for p in providers:
-            tag = f" ({p['tag']})" if p.get("tag") else ""
-            configured = ""
-            env_vars = p.get("env_vars", [])
-            if not env_vars or all(get_env_value(v["key"]) for v in env_vars):
-                if p.get("tts_provider") and config.get("tts", {}).get("provider") == p["tts_provider"]:
-                    configured = " [active]"
-                elif not env_vars:
-                    configured = " [active]" if config.get("tts", {}).get("provider", "edge") == p.get("tts_provider", "") else ""
-                else:
-                    configured = " [configured]"
-            provider_choices.append(f"{p['name']}{tag}{configured}")
-
-        # Detect current provider as default
-        default_idx = 0
-        for i, p in enumerate(providers):
-            if p.get("tts_provider") and config.get("tts", {}).get("provider") == p["tts_provider"]:
-                default_idx = i
-                break
-            env_vars = p.get("env_vars", [])
-            if env_vars and all(get_env_value(v["key"]) for v in env_vars):
-                default_idx = i
-                break
-
-        provider_idx = _prompt_choice("  Select provider:", provider_choices, default_idx)
-        _configure_provider(providers[provider_idx], config)
-
-
-def _configure_provider(provider: dict, config: dict):
-    """Configure a single provider - prompt for API keys and set config."""
-    env_vars = provider.get("env_vars", [])
-
-    # Set TTS provider in config if applicable
-    if provider.get("tts_provider"):
-        config.setdefault("tts", {})["provider"] = provider["tts_provider"]
-
-    if not env_vars:
-        _print_success(f"  {provider['name']} - no configuration needed!")
-        return
-
-    # Prompt for each required env var
-    all_configured = True
-    for var in env_vars:
-        existing = get_env_value(var["key"])
-        if existing:
-            _print_success(f"  {var['key']}: already configured")
-            # Don't ask to update - this is a new enable flow.
-            # Reconfigure is handled separately.
-        else:
-            url = var.get("url", "")
+        for var, url in missing:
            if url:
-                _print_info(f"  Get yours at: {url}")
-
-            default_val = var.get("default", "")
-            if default_val:
-                value = _prompt(f"    {var.get('prompt', var['key'])}", default_val)
+                print(color(f"    {var}", Colors.CYAN) + color(f"  ({url})", Colors.DIM))
            else:
-                value = _prompt(f"    {var.get('prompt', var['key'])}", password=True)
+                print(color(f"    {var}", Colors.CYAN))

-            if value:
-                save_env_value(var["key"], value)
-                _print_success(f"    Saved")
-            else:
-                _print_warning(f"    Skipped")
-                all_configured = False
-
-    # Run post-setup hooks if needed
-    if provider.get("post_setup") and all_configured:
-        _run_post_setup(provider["post_setup"])
-
-    if all_configured:
-        _print_success(f"  {provider['name']} configured!")
-
-
-def _configure_simple_requirements(ts_key: str):
-    """Simple fallback for toolsets that just need env vars (no provider selection)."""
-    requirements = TOOLSET_ENV_REQUIREMENTS.get(ts_key, [])
-    if not requirements:
-        return
-
-    missing = [(var, url) for var, url in requirements if not get_env_value(var)]
-    if not missing:
-        return
-
-    ts_label = next((l for k, l, _ in CONFIGURABLE_TOOLSETS if k == ts_key), ts_key)
-    print()
-    print(color(f"  {ts_label} requires configuration:", Colors.YELLOW))
-
-    for var, url in missing:
-        if url:
-            _print_info(f"  Get key at: {url}")
-        value = _prompt(f"    {var}", password=True)
-        if value and value.strip():
-            save_env_value(var, value.strip())
-            _print_success(f"    Saved")
-        else:
-            _print_warning(f"    Skipped")
-
-
-def _reconfigure_tool(config: dict):
-    """Let user reconfigure an existing tool's provider or API key."""
-    # Build list of configurable tools that are currently set up
-    configurable = []
-    for ts_key, ts_label, _ in CONFIGURABLE_TOOLSETS:
-        cat = TOOL_CATEGORIES.get(ts_key)
-        reqs = TOOLSET_ENV_REQUIREMENTS.get(ts_key)
-        if cat or reqs:
-            if _toolset_has_keys(ts_key):
-                configurable.append((ts_key, ts_label))
-
-    if not configurable:
-        _print_info("No configured tools to reconfigure.")
-        return
-
-    choices = [label for _, label in configurable]
-    choices.append("Cancel")
-
-    idx = _prompt_choice("  Which tool would you like to reconfigure?", choices, len(choices) - 1)
-
-    if idx >= len(configurable):
-        return  # Cancel
-
-    ts_key, ts_label = configurable[idx]
-    cat = TOOL_CATEGORIES.get(ts_key)
-
-    if cat:
-        _configure_tool_category_for_reconfig(ts_key, cat, config)
-    else:
-        _reconfigure_simple_requirements(ts_key)
-
-    save_config(config)
-
-
-def _configure_tool_category_for_reconfig(ts_key: str, cat: dict, config: dict):
-    """Reconfigure a tool category - provider selection + API key update."""
-    icon = cat.get("icon", "")
-    name = cat["name"]
-    providers = cat["providers"]
-
-    if len(providers) == 1:
-        provider = providers[0]
-        print()
-        print(color(f"  --- {icon} {name} ({provider['name']}) ---", Colors.CYAN))
-        _reconfigure_provider(provider, config)
-    else:
-        print()
-        print(color(f"  --- {icon} {name} - Choose a provider ---", Colors.CYAN))
        print()
+        try:
+            response = input(color("  Set up now? [Y/n] ", Colors.YELLOW)).strip().lower()
+        except (KeyboardInterrupt, EOFError):
+            print()
+            continue

-        provider_choices = []
-        for p in providers:
-            tag = f" ({p['tag']})" if p.get("tag") else ""
-            configured = ""
-            env_vars = p.get("env_vars", [])
-            if not env_vars or all(get_env_value(v["key"]) for v in env_vars):
-                if p.get("tts_provider") and config.get("tts", {}).get("provider") == p["tts_provider"]:
-                    configured = " [active]"
-                elif not env_vars:
-                    configured = ""
+        if response in ("", "y", "yes"):
+            for var, url in missing:
+                if url:
+                    print(color(f"    Get key at: {url}", Colors.DIM))
+                try:
+                    import getpass
+                    value = getpass.getpass(color(f"    {var}: ", Colors.YELLOW))
+                except (KeyboardInterrupt, EOFError):
+                    print()
+                    break
+                if value.strip():
+                    save_env_value(var, value.strip())
+                    print(color(f"    ✓ Saved", Colors.GREEN))
                else:
-                    configured = " [configured]"
-            provider_choices.append(f"{p['name']}{tag}{configured}")
-
-        default_idx = 0
-        for i, p in enumerate(providers):
-            if p.get("tts_provider") and config.get("tts", {}).get("provider") == p["tts_provider"]:
-                default_idx = i
-                break
-            env_vars = p.get("env_vars", [])
-            if env_vars and all(get_env_value(v["key"]) for v in env_vars):
-                default_idx = i
-                break
-
-        provider_idx = _prompt_choice("  Select provider:", provider_choices, default_idx)
-        _reconfigure_provider(providers[provider_idx], config)
-
-
-def _reconfigure_provider(provider: dict, config: dict):
-    """Reconfigure a provider - update API keys."""
-    env_vars = provider.get("env_vars", [])
-
-    if provider.get("tts_provider"):
-        config.setdefault("tts", {})["provider"] = provider["tts_provider"]
-        _print_success(f"  TTS provider set to: {provider['tts_provider']}")
-
-    if not env_vars:
-        _print_success(f"  {provider['name']} - no configuration needed!")
-        return
-
-    for var in env_vars:
-        existing = get_env_value(var["key"])
-        if existing:
-            _print_info(f"  {var['key']}: configured ({existing[:8]}...)")
-        url = var.get("url", "")
-        if url:
-            _print_info(f"  Get yours at: {url}")
-        default_val = var.get("default", "")
-        value = _prompt(f"    {var.get('prompt', var['key'])} (Enter to keep current)", password=not default_val)
-        if value and value.strip():
-            save_env_value(var["key"], value.strip())
-            _print_success(f"    Updated")
+                    print(color(f"    Skipped", Colors.DIM))
        else:
-            _print_info(f"    Kept current")
+            print(color("    Skipped — configure later with 'hermes setup'", Colors.DIM))


-def _reconfigure_simple_requirements(ts_key: str):
-    """Reconfigure simple env var requirements."""
-    requirements = TOOLSET_ENV_REQUIREMENTS.get(ts_key, [])
-    if not requirements:
-        return
-
-    ts_label = next((l for k, l, _ in CONFIGURABLE_TOOLSETS if k == ts_key), ts_key)
-    print()
-    print(color(f"  {ts_label}:", Colors.CYAN))
-
-    for var, url in requirements:
-        existing = get_env_value(var)
-        if existing:
-            _print_info(f"  {var}: configured ({existing[:8]}...)")
-        if url:
-            _print_info(f"  Get key at: {url}")
-        value = _prompt(f"    {var} (Enter to keep current)", password=True)
-        if value and value.strip():
-            save_env_value(var, value.strip())
-            _print_success(f"    Updated")
-        else:
-            _print_info(f"    Kept current")
-
-
-# ─── Main Entry Point ─────────────────────────────────────────────────────────
-
-def tools_command(args=None):
-    """Entry point for `hermes tools` and `hermes setup tools`."""
+def tools_command(args):
+    """Entry point for `hermes tools`."""
    config = load_config()
    enabled_platforms = _get_enabled_platforms()

    print()
    print(color("⚕ Hermes Tool Configuration", Colors.CYAN, Colors.BOLD))
    print(color("  Enable or disable tools per platform.", Colors.DIM))
-    print(color("  Tools that need API keys will be configured when enabled.", Colors.DIM))
    print()

    # Build platform choices
@@ -849,27 +380,21 @@ def tools_command(args=None):
    platform_keys = []
    for pkey in enabled_platforms:
        pinfo = PLATFORMS[pkey]
+        # Count currently enabled toolsets
        current = _get_platform_tools(config, pkey)
        count = len(current)
        total = len(CONFIGURABLE_TOOLSETS)
        platform_choices.append(f"Configure {pinfo['label']}  ({count}/{total} enabled)")
        platform_keys.append(pkey)

-    platform_choices.append("Reconfigure an existing tool's provider or API key")
-    platform_choices.append("Done")
+    platform_choices.append("Done — save and exit")

    while True:
-        idx = _prompt_choice("Select an option:", platform_choices, default=0)
+        idx = _prompt_choice("Select a platform to configure:", platform_choices, default=0)

        # "Done" selected
-        if idx == len(platform_keys) + 1:
-            break
-
-        # "Reconfigure" selected
        if idx == len(platform_keys):
-            _reconfigure_tool(config)
-            print()
-            continue
+            break

        pkey = platform_keys[idx]
        pinfo = PLATFORMS[pkey]
@@ -893,15 +418,11 @@ def tools_command(args=None):
                    label = next((l for k, l, _ in CONFIGURABLE_TOOLSETS if k == ts), ts)
                    print(color(f"  - {label}", Colors.RED))

-            # Configure newly enabled toolsets that need API keys
+            # Prompt for missing API keys on newly enabled toolsets
            if added:
-                for ts_key in sorted(added):
-                    if TOOL_CATEGORIES.get(ts_key) or TOOLSET_ENV_REQUIREMENTS.get(ts_key):
-                        if not _toolset_has_keys(ts_key):
-                            _configure_toolset(ts_key, config)
+                _check_and_prompt_requirements(added)

            _save_platform_tools(config, pkey, new_enabled)
-            save_config(config)
            print(color(f"  ✓ Saved {pinfo['label']} configuration", Colors.GREEN))
        else:
            print(color(f"  No changes to {pinfo['label']}", Colors.DIM))
--- a/hermes_time.py
+++ b/hermes_time.py
@@ -1,119 +0,0 @@
-"""
-Timezone-aware clock for Hermes.
-
-Provides a single ``now()`` helper that returns a timezone-aware datetime
-based on the user's configured IANA timezone (e.g. ``Asia/Kolkata``).
-
-Resolution order:
-  1. ``HERMES_TIMEZONE`` environment variable
-  2. ``timezone`` key in ``~/.hermes/config.yaml``
-  3. Falls back to the server's local time (``datetime.now().astimezone()``)
-
-Invalid timezone values log a warning and fall back safely — Hermes never
-crashes due to a bad timezone string.
-"""
-
-import logging
-import os
-from datetime import datetime, timezone as _tz
-from pathlib import Path
-from typing import Optional
-
-logger = logging.getLogger(__name__)
-
-try:
-    from zoneinfo import ZoneInfo
-except ImportError:
-    # Python 3.8 fallback (shouldn't be needed — Hermes requires 3.9+)
-    from backports.zoneinfo import ZoneInfo  # type: ignore[no-redef]
-
-# Cached state — resolved once, reused on every call.
-# Call reset_cache() to force re-resolution (e.g. after config changes).
-_cached_tz: Optional[ZoneInfo] = None
-_cached_tz_name: Optional[str] = None
-_cache_resolved: bool = False
-
-
-def _resolve_timezone_name() -> str:
-    """Read the configured IANA timezone string (or empty string).
-
-    This does file I/O when falling through to config.yaml, so callers
-    should cache the result rather than calling on every ``now()``.
-    """
-    # 1. Environment variable (highest priority — set by Supervisor, etc.)
-    tz_env = os.getenv("HERMES_TIMEZONE", "").strip()
-    if tz_env:
-        return tz_env
-
-    # 2. config.yaml ``timezone`` key
-    try:
-        import yaml
-        hermes_home = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
-        config_path = hermes_home / "config.yaml"
-        if config_path.exists():
-            with open(config_path) as f:
-                cfg = yaml.safe_load(f) or {}
-            tz_cfg = cfg.get("timezone", "")
-            if isinstance(tz_cfg, str) and tz_cfg.strip():
-                return tz_cfg.strip()
-    except Exception:
-        pass
-
-    return ""
-
-
-def _get_zoneinfo(name: str) -> Optional[ZoneInfo]:
-    """Validate and return a ZoneInfo, or None if invalid."""
-    if not name:
-        return None
-    try:
-        return ZoneInfo(name)
-    except (KeyError, Exception) as exc:
-        logger.warning(
-            "Invalid timezone '%s': %s. Falling back to server local time.",
-            name, exc,
-        )
-        return None
-
-
-def get_timezone() -> Optional[ZoneInfo]:
-    """Return the user's configured ZoneInfo, or None (meaning server-local).
-
-    Resolved once and cached. Call ``reset_cache()`` after config changes.
-    """
-    global _cached_tz, _cached_tz_name, _cache_resolved
-    if not _cache_resolved:
-        _cached_tz_name = _resolve_timezone_name()
-        _cached_tz = _get_zoneinfo(_cached_tz_name)
-        _cache_resolved = True
-    return _cached_tz
-
-
-def get_timezone_name() -> str:
-    """Return the IANA name of the configured timezone, or empty string."""
-    global _cached_tz_name, _cache_resolved
-    if not _cache_resolved:
-        get_timezone()  # populates cache
-    return _cached_tz_name or ""
-
-
-def now() -> datetime:
-    """
-    Return the current time as a timezone-aware datetime.
-
-    If a valid timezone is configured, returns wall-clock time in that zone.
-    Otherwise returns the server's local time (via ``astimezone()``).
-    """
-    tz = get_timezone()
-    if tz is not None:
-        return datetime.now(tz)
-    # No timezone configured — use server-local (still tz-aware)
-    return datetime.now().astimezone()
-
-
-def reset_cache() -> None:
-    """Clear the cached timezone. Used by tests and after config changes."""
-    global _cached_tz, _cached_tz_name, _cache_resolved
-    _cached_tz = None
-    _cached_tz_name = None
-    _cache_resolved = False
--- a/modal_app.py
+++ b/modal_app.py
@@ -0,0 +1,64 @@
+"""Modal deployment configuration for hermes-agent.
+
+Deploys the FastAPI streaming wrapper as a serverless ASGI app on Modal.
+
+Usage:
+    modal deploy modal_app.py       # Deploy to Modal
+    modal serve modal_app.py        # Local dev with hot-reload
+"""
+
+import modal
+
+image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .apt_install("git")
+    .pip_install(
+        "fastapi[standard]",
+        "uvicorn",
+        "openai",
+        "python-dotenv",
+        "fire",
+        "httpx",
+        "rich",
+        "tenacity",
+        "pyyaml",
+        "requests",
+        "jinja2",
+        "pydantic>=2.0",
+        "prompt_toolkit",
+        "firecrawl-py",
+        "fal-client",
+        "edge-tts",
+        "litellm>=1.75.5",
+        "typer",
+        "platformdirs",
+        "PyJWT[crypto]",
+    )
+    .add_local_dir(".", "/app", copy=True, ignore=[".git", "__pycache__", "venv", ".venv", "*.pyc"])
+)
+
+app = modal.App("hermes-agent", image=image)
+
+
+@app.function(
+    min_containers=0,
+    scaledown_window=300,
+    timeout=600,
+    secrets=[modal.Secret.from_name("hermes-secrets")],
+)
+@modal.concurrent(max_inputs=10)
+@modal.asgi_app()
+def web():
+    import os
+    import sys
+    from pathlib import Path
+
+    # Force HERMES_HOME to a known writable path inside the container
+    hermes_home = "/tmp/hermes"
+    os.environ["HERMES_HOME"] = hermes_home
+    Path(hermes_home).mkdir(parents=True, exist_ok=True)
+    (Path(hermes_home) / "logs").mkdir(parents=True, exist_ok=True)
+
+    sys.path.insert(0, "/app")
+    from serve import app as fastapi_app
+    return fastapi_app
--- a/model_tools.py
+++ b/model_tools.py
@@ -225,18 +225,6 @@ def get_tool_definitions(
    # Ask the registry for schemas (only returns tools whose check_fn passes)
    filtered_tools = registry.get_definitions(tools_to_include, quiet=quiet_mode)

-    # Rebuild execute_code schema to only list sandbox tools that are actually
-    # enabled.  Without this, the model sees "web_search is available in
-    # execute_code" even when the user disabled the web toolset (#560-discord).
-    if "execute_code" in tools_to_include:
-        from tools.code_execution_tool import SANDBOX_ALLOWED_TOOLS, build_execute_code_schema
-        sandbox_enabled = SANDBOX_ALLOWED_TOOLS & tools_to_include
-        dynamic_schema = build_execute_code_schema(sandbox_enabled)
-        for i, td in enumerate(filtered_tools):
-            if td.get("function", {}).get("name") == "execute_code":
-                filtered_tools[i] = {"type": "function", "function": dynamic_schema}
-                break
-
    if not quiet_mode:
        if filtered_tools:
            tool_names = [t["function"]["name"] for t in filtered_tools]
--- a/optional-skills/DESCRIPTION.md
+++ b/optional-skills/DESCRIPTION.md
@@ -1,24 +0,0 @@
-# Optional Skills
-
-Official skills maintained by Nous Research that are **not activated by default**.
-
-These skills ship with the hermes-agent repository but are not copied to
-`~/.hermes/skills/` during setup. They are discoverable via the Skills Hub:
-
-```bash
-hermes skills browse               # browse all skills, official shown first
-hermes skills browse --source official  # browse only official optional skills
-hermes skills search <query>       # finds optional skills labeled "official"
-hermes skills install <identifier> # copies to ~/.hermes/skills/ and activates
-```
-
-## Why optional?
-
-Some skills are useful but not broadly needed by every user:
-
- **Niche integrations** — specific paid services, specialized tools
- **Experimental features** — promising but not yet proven
- **Heavyweight dependencies** — require significant setup (API keys, installs)
-
-By keeping them optional, we keep the default skill set lean while still
-providing curated, tested, official skills for users who want them.
--- a/optional-skills/autonomous-ai-agents/DESCRIPTION.md
+++ b/optional-skills/autonomous-ai-agents/DESCRIPTION.md
@@ -1,2 +0,0 @@
-Optional autonomous AI agent integrations — external coding agent CLIs
-that can be delegated to for independent coding tasks.
--- a/optional-skills/autonomous-ai-agents/blackbox/SKILL.md
+++ b/optional-skills/autonomous-ai-agents/blackbox/SKILL.md
@@ -1,143 +0,0 @@
---
-name: blackbox
-description: Delegate coding tasks to Blackbox AI CLI agent. Multi-model agent with built-in judge that runs tasks through multiple LLMs and picks the best result. Requires the blackbox CLI and a Blackbox AI API key.
-version: 1.0.0
-author: Hermes Agent (Nous Research)
-license: MIT
-metadata:
-  hermes:
-    tags: [Coding-Agent, Blackbox, Multi-Agent, Judge, Multi-Model]
-    related_skills: [claude-code, codex, hermes-agent]
---
-
-# Blackbox CLI
-
-Delegate coding tasks to [Blackbox AI](https://www.blackbox.ai/) via the Hermes terminal. Blackbox is a multi-model coding agent CLI that dispatches tasks to multiple LLMs (Claude, Codex, Gemini, Blackbox Pro) and uses a judge to select the best implementation.
-
-The CLI is [open-source](https://github.com/blackboxaicode/cli) (GPL-3.0, TypeScript, forked from Gemini CLI) and supports interactive sessions, non-interactive one-shots, checkpointing, MCP, and vision model switching.
-
-## Prerequisites
-
- Node.js 20+ installed
- Blackbox CLI installed: `npm install -g @blackboxai/cli`
- Or install from source:
-  ```
-  git clone https://github.com/blackboxaicode/cli.git
-  cd cli && npm install && npm install -g .
-  ```
- API key from [app.blackbox.ai/dashboard](https://app.blackbox.ai/dashboard)
- Configured: run `blackbox configure` and enter your API key
- Use `pty=true` in terminal calls — Blackbox CLI is an interactive terminal app
-
-## One-Shot Tasks
-
-```
-terminal(command="blackbox --prompt 'Add JWT authentication with refresh tokens to the Express API'", workdir="/path/to/project", pty=true)
-```
-
-For quick scratch work:
-```
-terminal(command="cd $(mktemp -d) && git init && blackbox --prompt 'Build a REST API for todos with SQLite'", pty=true)
-```
-
-## Background Mode (Long Tasks)
-
-For tasks that take minutes, use background mode so you can monitor progress:
-
-```
-# Start in background with PTY
-terminal(command="blackbox --prompt 'Refactor the auth module to use OAuth 2.0'", workdir="~/project", background=true, pty=true)
-# Returns session_id
-
-# Monitor progress
-process(action="poll", session_id="<id>")
-process(action="log", session_id="<id>")
-
-# Send input if Blackbox asks a question
-process(action="submit", session_id="<id>", data="yes")
-
-# Kill if needed
-process(action="kill", session_id="<id>")
-```
-
-## Checkpoints & Resume
-
-Blackbox CLI has built-in checkpoint support for pausing and resuming tasks:
-
-```
-# After a task completes, Blackbox shows a checkpoint tag
-# Resume with a follow-up task:
-terminal(command="blackbox --resume-checkpoint 'task-abc123-2026-03-06' --prompt 'Now add rate limiting to the endpoints'", workdir="~/project", pty=true)
-```
-
-## Session Commands
-
-During an interactive session, use these commands:
-
-| Command | Effect |
-|---------|--------|
-| `/compress` | Shrink conversation history to save tokens |
-| `/clear` | Wipe history and start fresh |
-| `/stats` | View current token usage |
-| `Ctrl+C` | Cancel current operation |
-
-## PR Reviews
-
-Clone to a temp directory to avoid modifying the working tree:
-
-```
-terminal(command="REVIEW=$(mktemp -d) && git clone https://github.com/user/repo.git $REVIEW && cd $REVIEW && gh pr checkout 42 && blackbox --prompt 'Review this PR against main. Check for bugs, security issues, and code quality.'", pty=true)
-```
-
-## Parallel Work
-
-Spawn multiple Blackbox instances for independent tasks:
-
-```
-terminal(command="blackbox --prompt 'Fix the login bug'", workdir="/tmp/issue-1", background=true, pty=true)
-terminal(command="blackbox --prompt 'Add unit tests for auth'", workdir="/tmp/issue-2", background=true, pty=true)
-
-# Monitor all
-process(action="list")
-```
-
-## Multi-Model Mode
-
-Blackbox's unique feature is running the same task through multiple models and judging the results. Configure which models to use via `blackbox configure` — select multiple providers to enable the Chairman/judge workflow where the CLI evaluates outputs from different models and picks the best one.
-
-## Key Flags
-
-| Flag | Effect |
-|------|--------|
-| `--prompt "task"` | Non-interactive one-shot execution |
-| `--resume-checkpoint "tag"` | Resume from a saved checkpoint |
-| `--yolo` | Auto-approve all actions and model switches |
-| `blackbox session` | Start interactive chat session |
-| `blackbox configure` | Change settings, providers, models |
-| `blackbox info` | Display system information |
-
-## Vision Support
-
-Blackbox automatically detects images in input and can switch to multimodal analysis. VLM modes:
- `"once"` — Switch model for current query only
- `"session"` — Switch for entire session
- `"persist"` — Stay on current model (no switch)
-
-## Token Limits
-
-Control token usage via `.blackboxcli/settings.json`:
-```json
-{
-  "sessionTokenLimit": 32000
-}
-```
-
-## Rules
-
-1. **Always use `pty=true`** — Blackbox CLI is an interactive terminal app and will hang without a PTY
-2. **Use `workdir`** — keep the agent focused on the right directory
-3. **Background for long tasks** — use `background=true` and monitor with `process` tool
-4. **Don't interfere** — monitor with `poll`/`log`, don't kill sessions because they're slow
-5. **Report results** — after completion, check what changed and summarize for the user
-6. **Credits cost money** — Blackbox uses a credit-based system; multi-model mode consumes credits faster
-7. **Check prerequisites** — verify `blackbox` CLI is installed before attempting delegation
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -5,9 +5,9 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "hermes-agent"
 version = "0.1.0"
-description = "The self-improving AI agent — creates skills from experience, improves them during use, and runs anywhere"
+description = "AI agent with advanced tool-calling and toolsets"
 readme = "README.md"
-requires-python = ">=3.11"
+requires-python = ">=3.10"
 authors = [{ name = "Nous Research" }]
 license = { text = "MIT" }
 dependencies = [
@@ -39,7 +39,7 @@ dependencies = [

 [project.optional-dependencies]
 modal = ["swe-rex[modal]>=1.4.0"]
-daytona = ["daytona>=0.148.0"]
+serve = ["fastapi[standard]", "uvicorn"]
 dev = ["pytest", "pytest-asyncio"]
 messaging = ["python-telegram-bot>=20.0", "discord.py>=2.0", "aiohttp>=3.9.0", "slack-bolt>=1.18.0", "slack-sdk>=3.27.0"]
 cron = ["croniter"]
@@ -50,10 +50,9 @@ pty = ["ptyprocess>=0.7.0"]
 honcho = ["honcho-ai>=2.0.1"]
 mcp = ["mcp>=1.2.0"]
 homeassistant = ["aiohttp>=3.9.0"]
-yc-bench = ["yc-bench @ git+https://github.com/collinear-ai/yc-bench.git"]
 all = [
  "hermes-agent[modal]",
-  "hermes-agent[daytona]",
+  "hermes-agent[serve]",
  "hermes-agent[messaging]",
  "hermes-agent[cron]",
  "hermes-agent[cli]",
--- a/run_agent.py
+++ b/run_agent.py
@@ -26,6 +26,7 @@ import json
 import logging
 logger = logging.getLogger(__name__)
 import os
+import queue
 import random
 import re
 import sys
@@ -82,8 +83,6 @@ from agent.prompt_builder import (
 from agent.model_metadata import (
    fetch_model_metadata, get_model_context_length,
    estimate_tokens_rough, estimate_messages_tokens_rough,
-    get_next_probe_tier, parse_context_limit_from_error,
-    save_context_length,
 )
 from agent.context_compressor import ContextCompressor
 from agent.prompt_caching import apply_anthropic_cache_control
@@ -99,46 +98,6 @@ from agent.trajectory import (
 )


-class IterationBudget:
-    """Thread-safe shared iteration counter for parent and child agents.
-
-    Tracks total LLM-call iterations consumed across a parent agent and all
-    its subagents.  A single ``IterationBudget`` is created by the parent
-    and passed to every child so they share the same cap.
-
-    ``execute_code`` (programmatic tool calling) iterations are refunded via
-    :meth:`refund` so they don't eat into the budget.
-    """
-
-    def __init__(self, max_total: int):
-        self.max_total = max_total
-        self._used = 0
-        self._lock = threading.Lock()
-
-    def consume(self) -> bool:
-        """Try to consume one iteration.  Returns True if allowed."""
-        with self._lock:
-            if self._used >= self.max_total:
-                return False
-            self._used += 1
-            return True
-
-    def refund(self) -> None:
-        """Give back one iteration (e.g. for execute_code turns)."""
-        with self._lock:
-            if self._used > 0:
-                self._used -= 1
-
-    @property
-    def used(self) -> int:
-        return self._used
-
-    @property
-    def remaining(self) -> int:
-        with self._lock:
-            return max(0, self.max_total - self._used)
-
-
 class AIAgent:
    """
    AI Agent with tool calling capabilities.
@@ -154,7 +113,7 @@ class AIAgent:
        provider: str = None,
        api_mode: str = None,
        model: str = "anthropic/claude-opus-4.6",  # OpenRouter format
-        max_iterations: int = 90,  # Default tool-calling iterations (shared with subagents)
+        max_iterations: int = 60,  # Default tool-calling iterations
        tool_delay: float = 1.0,
        enabled_toolsets: List[str] = None,
        disabled_toolsets: List[str] = None,
@@ -182,7 +141,8 @@ class AIAgent:
        skip_memory: bool = False,
        session_db=None,
        honcho_session_key: str = None,
-        iteration_budget: "IterationBudget" = None,
+        event_queue: "queue.Queue | None" = None,
+        extra_tags: List[str] = None,
    ):
        """
        Initialize the AI Agent.
@@ -193,7 +153,7 @@ class AIAgent:
            provider (str): Provider identifier (optional; used for telemetry/routing hints)
            api_mode (str): API mode override: "chat_completions" or "codex_responses"
            model (str): Model name to use (default: "anthropic/claude-opus-4.6")
-            max_iterations (int): Maximum number of tool calling iterations (default: 90)
+            max_iterations (int): Maximum number of tool calling iterations (default: 60)
            tool_delay (float): Delay between tool calls in seconds (default: 1.0)
            enabled_toolsets (List[str]): Only enable tools from these toolsets (optional)
            disabled_toolsets (List[str]): Disable tools from these toolsets (optional)
@@ -227,9 +187,6 @@ class AIAgent:
        """
        self.model = model
        self.max_iterations = max_iterations
-        # Shared iteration budget — parent creates, children inherit.
-        # Consumed by every LLM turn across parent + all subagents.
-        self.iteration_budget = iteration_budget or IterationBudget(max_iterations)
        self.tool_delay = tool_delay
        self.save_trajectories = save_trajectories
        self.verbose_logging = verbose_logging
@@ -263,6 +220,8 @@ class AIAgent:
        self.tool_progress_callback = tool_progress_callback
        self.clarify_callback = clarify_callback
        self.step_callback = step_callback
+        self.event_queue: queue.Queue | None = event_queue
+        self._extra_tags: List[str] = extra_tags or []
        self._last_reported_tool = None  # Track for "new tool" mode
        
        # Interrupt mechanism for breaking out of tool loops
@@ -301,7 +260,7 @@ class AIAgent:
        # Persistent error log -- always writes WARNING+ to ~/.hermes/logs/errors.log
        # so tool failures, API errors, etc. are inspectable after the fact.
        from agent.redact import RedactingFormatter
-        _error_log_dir = Path.home() / ".hermes" / "logs"
+        _error_log_dir = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes")) / "logs"
        _error_log_dir.mkdir(parents=True, exist_ok=True)
        _error_log_path = _error_log_dir / "errors.log"
        from logging.handlers import RotatingFileHandler
@@ -582,7 +541,6 @@ class AIAgent:
            summary_target_tokens=500,
            summary_model_override=compression_summary_model,
            quiet_mode=self.quiet_mode,
-            base_url=self.base_url,
        )
        self.compression_enabled = compression_enabled
        self._user_turn_count = 0
@@ -1352,6 +1310,19 @@ class AIAgent:
        except Exception as e:
            logger.debug("Honcho sync failed (non-fatal): %s", e)

+    def _emit_event(self, event: Dict[str, Any]) -> None:
+        """Push a structured event onto the event queue (if one is attached).
+
+        Used by the serve layer to stream intermediate agent progress
+        (text tokens, tool calls, tool results) back to callers over SSE.
+        No-op when ``event_queue`` is ``None`` (CLI / gateway usage).
+        """
+        if self.event_queue is not None:
+            try:
+                self.event_queue.put_nowait(event)
+            except Exception:
+                pass
+
    def _build_system_prompt(self, system_message: str = None) -> str:
        """
        Assemble the full system prompt from all layers.
@@ -1407,8 +1378,7 @@ class AIAgent:
            if context_files_prompt:
                prompt_parts.append(context_files_prompt)

-        from hermes_time import now as _hermes_now
-        now = _hermes_now()
+        now = datetime.now()
        prompt_parts.append(
            f"Conversation started: {now.strftime('%A, %B %d, %Y %I:%M %p')}"
        )
@@ -2063,49 +2033,6 @@ class AIAgent:

        return True

-    def _try_refresh_nous_client_credentials(self, *, force: bool = True) -> bool:
-        if self.api_mode != "chat_completions" or self.provider != "nous":
-            return False
-
-        try:
-            from hermes_cli.auth import resolve_nous_runtime_credentials
-
-            creds = resolve_nous_runtime_credentials(
-                min_key_ttl_seconds=max(60, int(os.getenv("HERMES_NOUS_MIN_KEY_TTL_SECONDS", "1800"))),
-                timeout_seconds=float(os.getenv("HERMES_NOUS_TIMEOUT_SECONDS", "15")),
-                force_mint=force,
-            )
-        except Exception as exc:
-            logger.debug("Nous credential refresh failed: %s", exc)
-            return False
-
-        api_key = creds.get("api_key")
-        base_url = creds.get("base_url")
-        if not isinstance(api_key, str) or not api_key.strip():
-            return False
-        if not isinstance(base_url, str) or not base_url.strip():
-            return False
-
-        self.api_key = api_key.strip()
-        self.base_url = base_url.strip().rstrip("/")
-        self._client_kwargs["api_key"] = self.api_key
-        self._client_kwargs["base_url"] = self.base_url
-        # Nous requests should not inherit OpenRouter-only attribution headers.
-        self._client_kwargs.pop("default_headers", None)
-
-        try:
-            self.client.close()
-        except Exception:
-            pass
-
-        try:
-            self.client = OpenAI(**self._client_kwargs)
-        except Exception as exc:
-            logger.warning("Failed to rebuild OpenAI client after Nous refresh: %s", exc)
-            return False
-
-        return True
-
    def _interruptible_api_call(self, api_kwargs: dict):
        """
        Run the API call in a background thread so the main conversation loop
@@ -2227,9 +2154,11 @@ class AIAgent:
                    "effort": "xhigh"
                }

-        # Nous Portal product attribution
+        # Nous Portal product attribution + caller-supplied tags
        if _is_nous:
-            extra_body["tags"] = ["product=hermes-agent"]
+            tags = list(self._extra_tags)
+            tags.append("product=hermes-agent")
+            extra_body["tags"] = tags

        if extra_body:
            api_kwargs["extra_body"] = extra_body
@@ -2545,6 +2474,13 @@ class AIAgent:
                except Exception as cb_err:
                    logging.debug(f"Tool progress callback error: {cb_err}")

+            self._emit_event({
+                "type": "tool-call",
+                "name": function_name,
+                "args": function_args,
+                "status": "calling",
+            })
+
            tool_start_time = time.time()

            if function_name == "todo":
@@ -2708,6 +2644,14 @@ class AIAgent:
            messages.append(tool_msg)
            self._log_msg_to_db(tool_msg)

+            self._emit_event({
+                "type": "tool-result",
+                "name": function_name,
+                "output": function_result[:4000],
+                "status": "complete",
+                "duration": round(tool_duration, 2),
+            })
+
            if not self.quiet_mode:
                response_preview = function_result[:self.log_prefix_chars] + "..." if len(function_result) > self.log_prefix_chars else function_result
                print(f"  ✅ Tool {i} completed in {tool_duration:.2f}s - {response_preview}")
@@ -2831,7 +2775,7 @@ class AIAgent:
                        "messages": api_messages,
                    }
                    if self.max_tokens is not None:
-                        summary_kwargs.update(self._max_tokens_param(self.max_tokens))
+                        summary_kwargs["max_tokens"] = self.max_tokens
                    if summary_extra_body:
                        summary_kwargs["extra_body"] = summary_extra_body

@@ -3018,7 +2962,7 @@ class AIAgent:
        # Clear any stale interrupt state at start
        self.clear_interrupt()
        
-        while api_call_count < self.max_iterations and self.iteration_budget.remaining > 0:
+        while api_call_count < self.max_iterations:
            # Check for interrupt request (e.g., user sent new message)
            if self._interrupt_requested:
                interrupted = True
@@ -3027,10 +2971,6 @@ class AIAgent:
                break
            
            api_call_count += 1
-            if not self.iteration_budget.consume():
-                if not self.quiet_mode:
-                    print(f"\n⚠️  Session iteration budget exhausted ({self.iteration_budget.max_total} total across agent + subagents)")
-                break

            # Fire step_callback for gateway hooks (agent:step event)
            if self.step_callback is not None:
@@ -3107,13 +3047,6 @@ class AIAgent:
            if self._use_prompt_caching:
                api_messages = apply_anthropic_cache_control(api_messages, cache_ttl=self._cache_ttl)
            
-            # Safety net: strip orphaned tool results / add stubs for missing
-            # results before sending to the API.  The compressor handles this
-            # during compression, but orphans can also sneak in from session
-            # loading or manual message manipulation.
-            if hasattr(self, 'context_compressor') and self.context_compressor:
-                api_messages = self.context_compressor._sanitize_tool_pairs(api_messages)
-
            # Calculate approximate request size for logging
            total_chars = sum(len(str(msg)) for msg in api_messages)
            approx_tokens = total_chars // 4  # Rough estimate: 4 chars per token
@@ -3143,7 +3076,6 @@ class AIAgent:
            retry_count = 0
            max_retries = 6  # Increased to allow longer backoff periods
            codex_auth_retry_attempted = False
-            nous_auth_retry_attempted = False

            finish_reason = "stop"

@@ -3339,13 +3271,6 @@ class AIAgent:
                        }
                        self.context_compressor.update_from_response(usage_dict)

-                        # Cache discovered context length after successful call
-                        if self.context_compressor._context_probed:
-                            ctx = self.context_compressor.context_length
-                            save_context_length(self.model, self.base_url, ctx)
-                            print(f"{self.log_prefix}💾 Cached context length: {ctx:,} tokens for {self.model}")
-                            self.context_compressor._context_probed = False
-
                        self.session_prompt_tokens += prompt_tokens
                        self.session_completion_tokens += completion_tokens
                        self.session_total_tokens += total_tokens
@@ -3393,16 +3318,6 @@ class AIAgent:
                        if self._try_refresh_codex_client_credentials(force=True):
                            print(f"{self.log_prefix}🔐 Codex auth refreshed after 401. Retrying request...")
                            continue
-                    if (
-                        self.api_mode == "chat_completions"
-                        and self.provider == "nous"
-                        and status_code == 401
-                        and not nous_auth_retry_attempted
-                    ):
-                        nous_auth_retry_attempted = True
-                        if self._try_refresh_nous_client_credentials(force=True):
-                            print(f"{self.log_prefix}🔐 Nous agent key refreshed after 401. Retrying request...")
-                            continue

                    retry_count += 1
                    elapsed_time = time.time() - api_start_time
@@ -3475,37 +3390,18 @@ class AIAgent:
                    ])
                    
                    if is_context_length_error:
-                        compressor = self.context_compressor
-                        old_ctx = compressor.context_length
-
-                        # Try to parse the actual limit from the error message
-                        parsed_limit = parse_context_limit_from_error(error_msg)
-                        if parsed_limit and parsed_limit < old_ctx:
-                            new_ctx = parsed_limit
-                            print(f"{self.log_prefix}⚠️  Context limit detected from API: {new_ctx:,} tokens (was {old_ctx:,})")
-                        else:
-                            # Step down to the next probe tier
-                            new_ctx = get_next_probe_tier(old_ctx)
-
-                        if new_ctx and new_ctx < old_ctx:
-                            compressor.context_length = new_ctx
-                            compressor.threshold_tokens = int(new_ctx * compressor.threshold_percent)
-                            compressor._context_probed = True
-                            print(f"{self.log_prefix}⚠️  Context length exceeded — stepping down: {old_ctx:,} → {new_ctx:,} tokens")
-                        else:
-                            print(f"{self.log_prefix}⚠️  Context length exceeded at minimum tier — attempting compression...")
+                        print(f"{self.log_prefix}⚠️  Context length exceeded - attempting compression...")

                        original_len = len(messages)
                        messages, active_system_prompt = self._compress_context(
                            messages, system_message, approx_tokens=approx_tokens
                        )

-                        if len(messages) < original_len or new_ctx and new_ctx < old_ctx:
-                            if len(messages) < original_len:
-                                print(f"{self.log_prefix}   🗜️  Compressed {original_len} → {len(messages)} messages, retrying...")
-                            continue  # Retry with compressed messages or new tier
+                        if len(messages) < original_len:
+                            print(f"{self.log_prefix}   🗜️  Compressed {original_len} → {len(messages)} messages, retrying...")
+                            continue  # Retry with compressed messages
                        else:
-                            # Can't compress further and already at minimum tier
+                            # Can't compress further
                            print(f"{self.log_prefix}❌ Context length exceeded and cannot compress further.")
                            print(f"{self.log_prefix}   💡 The conversation has accumulated too much content.")
                            logging.error(f"{self.log_prefix}Context length exceeded: {approx_tokens:,} tokens. Cannot compress further.")
@@ -3797,13 +3693,6 @@ class AIAgent:
                    self._log_msg_to_db(assistant_msg)
                    
                    self._execute_tool_calls(assistant_message, messages, effective_task_id)
-
-                    # Refund the iteration if the ONLY tool(s) called were
-                    # execute_code (programmatic tool calling).  These are
-                    # cheap RPC-style calls that shouldn't eat the budget.
-                    _tc_names = {tc.function.name for tc in assistant_message.tool_calls}
-                    if _tc_names == {"execute_code"}:
-                        self.iteration_budget.refund()
                    
                    if self.compression_enabled and self.context_compressor.should_compress():
                        messages, active_system_prompt = self._compress_context(
@@ -3824,33 +3713,13 @@ class AIAgent:
                    
                    # Check if response only has think block with no actual content after it
                    if not self._has_content_after_think_block(final_response):
-                        # If the previous turn already delivered real content alongside
-                        # tool calls (e.g. "You're welcome!" + memory save), the model
-                        # has nothing more to say. Use the earlier content immediately
-                        # instead of wasting API calls on retries that won't help.
-                        fallback = getattr(self, '_last_content_with_tools', None)
-                        if fallback:
-                            logger.debug("Empty follow-up after tool calls — using prior turn content as final response")
-                            self._last_content_with_tools = None
-                            self._empty_content_retries = 0
-                            for i in range(len(messages) - 1, -1, -1):
-                                msg = messages[i]
-                                if msg.get("role") == "assistant" and msg.get("tool_calls"):
-                                    tool_names = []
-                                    for tc in msg["tool_calls"]:
-                                        fn = tc.get("function", {})
-                                        tool_names.append(fn.get("name", "unknown"))
-                                    msg["content"] = f"Calling the {', '.join(tool_names)} tool{'s' if len(tool_names) > 1 else ''}..."
-                                    break
-                            final_response = self._strip_think_blocks(fallback).strip()
-                            break
-
-                        # No fallback available — this is a genuine empty response.
-                        # Retry in case the model just had a bad generation.
+                        # Track retries for empty-after-think responses
                        if not hasattr(self, '_empty_content_retries'):
                            self._empty_content_retries = 0
                        self._empty_content_retries += 1
                        
+                        # Show the reasoning/thinking content so the user can see
+                        # what the model was thinking even though content is empty
                        reasoning_text = self._extract_reasoning(assistant_message)
                        print(f"{self.log_prefix}⚠️  Response only contains think block with no content after it")
                        if reasoning_text:
@@ -3945,6 +3814,9 @@ class AIAgent:
                    
                    # Strip <think> blocks from user-facing response (keep raw in messages for trajectory)
                    final_response = self._strip_think_blocks(final_response).strip()
+
+                    if final_response:
+                        self._emit_event({"type": "text", "text": final_response})
                    
                    final_msg = self._build_assistant_message(assistant_message, finish_reason)
                    
@@ -4041,6 +3913,8 @@ class AIAgent:
        
        # Clear interrupt state after handling
        self.clear_interrupt()
+
+        self._emit_event({"type": "done"})
        
        return result
    
--- a/serve.py
+++ b/serve.py
@@ -0,0 +1,124 @@
+"""FastAPI streaming wrapper for AIAgent.
+
+Exposes hermes-agent as an HTTP service with SSE streaming.
+Run locally with: uvicorn serve:app --host 0.0.0.0 --port 8000
+Deploy on Modal via modal_app.py.
+"""
+
+import asyncio
+import json
+import logging
+import os
+import queue
+import threading
+from pathlib import Path
+from typing import Any
+
+from fastapi import FastAPI, Request
+from fastapi.responses import StreamingResponse
+
+logger = logging.getLogger(__name__)
+
+# Force HERMES_HOME to a writable path. Modal secrets may set HERMES_HOME to
+# a non-existent path (e.g. /app/tinker-atropos) — override unconditionally.
+_hermes_home = Path("/tmp/hermes")
+_hermes_home.mkdir(parents=True, exist_ok=True)
+(_hermes_home / "logs").mkdir(parents=True, exist_ok=True)
+os.environ["HERMES_HOME"] = str(_hermes_home)
+
+# Pre-import modules that register signal handlers so they run in the
+# main thread (signal.signal() fails if called from a worker thread).
+try:
+    import tools.browser_tool  # noqa: F401
+except Exception:
+    pass
+
+try:
+    from run_agent import AIAgent  # noqa: F401
+except Exception as e:
+    logger.warning("Failed to pre-import AIAgent: %s", e)
+
+app = FastAPI(title="hermes-agent", version="0.1.0")
+
+
+@app.get("/health")
+async def health():
+    return {"status": "ok"}
+
+
+@app.post("/v1/agent/stream")
+async def agent_stream(request: Request):
+    body = await request.json()
+
+    messages = body.get("messages", [])
+    model = body.get("model", "anthropic/claude-opus-4.6")
+    system_prompt = body.get("system_prompt")
+    toolsets = body.get("toolsets")
+    max_iterations = body.get("max_iterations", 30)
+    base_url = body.get("base_url") or os.getenv("AGENT_LLM_BASE_URL")
+    api_key = body.get("api_key") or os.getenv("AGENT_LLM_API_KEY")
+    tags = body.get("tags")
+
+    user_message = ""
+    conversation_history = []
+    for msg in messages:
+        if msg.get("role") == "user":
+            user_message = msg.get("content", "")
+        conversation_history.append(msg)
+
+    if conversation_history and conversation_history[-1].get("role") == "user":
+        user_message = conversation_history.pop().get("content", "")
+
+    eq: queue.Queue[dict[str, Any]] = queue.Queue(maxsize=512)
+
+    def run_agent():
+        try:
+            agent = AIAgent(
+                model=model,
+                base_url=base_url,
+                api_key=api_key,
+                max_iterations=max_iterations,
+                quiet_mode=True,
+                enabled_toolsets=toolsets,
+                event_queue=eq,
+                ephemeral_system_prompt=system_prompt,
+                extra_tags=tags,
+            )
+            result = agent.run_conversation(
+                user_message=user_message,
+                conversation_history=conversation_history or None,
+            )
+            if result and result.get("failed"):
+                eq.put({"type": "error", "error": result.get("error", "Agent failed")})
+                eq.put({"type": "done"})
+        except Exception as e:
+            logger.exception("Agent error")
+            eq.put({"type": "error", "error": str(e)})
+            eq.put({"type": "done"})
+
+    thread = threading.Thread(target=run_agent, daemon=True)
+    thread.start()
+
+    loop = asyncio.get_event_loop()
+
+    async def event_generator():
+        while True:
+            try:
+                event = await loop.run_in_executor(None, lambda: eq.get(timeout=120))
+            except queue.Empty:
+                yield "data: {\"type\": \"done\"}\n\n"
+                break
+
+            yield f"data: {json.dumps(event)}\n\n"
+
+            if event.get("type") == "done":
+                break
+
+    return StreamingResponse(
+        event_generator(),
+        media_type="text/event-stream",
+        headers={
+            "Cache-Control": "no-cache",
+            "X-Accel-Buffering": "no",
+        },
+    )
--- a/skills/apple/DESCRIPTION.md
+++ b/skills/apple/DESCRIPTION.md
@@ -1,3 +0,0 @@
---
-description: Apple/macOS-specific skills — iMessage, Reminders, Notes, FindMy, and macOS automation. These skills only load on macOS systems.
---
--- a/skills/apple/apple-notes/SKILL.md
+++ b/skills/apple/apple-notes/SKILL.md
@@ -1,88 +0,0 @@
---
-name: apple-notes
-description: Manage Apple Notes via the memo CLI on macOS (create, view, search, edit).
-version: 1.0.0
-author: Hermes Agent
-license: MIT
-platforms: [macos]
-metadata:
-  hermes:
-    tags: [Notes, Apple, macOS, note-taking]
-    related_skills: [obsidian]
---
-
-# Apple Notes
-
-Use `memo` to manage Apple Notes directly from the terminal. Notes sync across all Apple devices via iCloud.
-
-## Prerequisites
-
- **macOS** with Notes.app
- Install: `brew tap antoniorodr/memo && brew install antoniorodr/memo/memo`
- Grant Automation access to Notes.app when prompted (System Settings → Privacy → Automation)
-
-## When to Use
-
- User asks to create, view, or search Apple Notes
- Saving information to Notes.app for cross-device access
- Organizing notes into folders
- Exporting notes to Markdown/HTML
-
-## When NOT to Use
-
- Obsidian vault management → use the `obsidian` skill
- Bear Notes → separate app (not supported here)
- Quick agent-only notes → use the `memory` tool instead
-
-## Quick Reference
-
-### View Notes
-
-```bash
-memo notes                        # List all notes
-memo notes -f "Folder Name"       # Filter by folder
-memo notes -s "query"             # Search notes (fuzzy)
-```
-
-### Create Notes
-
-```bash
-memo notes -a                     # Interactive editor
-memo notes -a "Note Title"        # Quick add with title
-```
-
-### Edit Notes
-
-```bash
-memo notes -e                     # Interactive selection to edit
-```
-
-### Delete Notes
-
-```bash
-memo notes -d                     # Interactive selection to delete
-```
-
-### Move Notes
-
-```bash
-memo notes -m                     # Move note to folder (interactive)
-```
-
-### Export Notes
-
-```bash
-memo notes -ex                    # Export to HTML/Markdown
-```
-
-## Limitations
-
- Cannot edit notes containing images or attachments
- Interactive prompts require terminal access (use pty=true if needed)
- macOS only — requires Apple Notes.app
-
-## Rules
-
-1. Prefer Apple Notes when user wants cross-device sync (iPhone/iPad/Mac)
-2. Use the `memory` tool for agent-internal notes that don't need to sync
-3. Use the `obsidian` skill for Markdown-native knowledge management
--- a/skills/apple/apple-reminders/SKILL.md
+++ b/skills/apple/apple-reminders/SKILL.md
@@ -1,96 +0,0 @@
---
-name: apple-reminders
-description: Manage Apple Reminders via remindctl CLI (list, add, complete, delete).
-version: 1.0.0
-author: Hermes Agent
-license: MIT
-platforms: [macos]
-metadata:
-  hermes:
-    tags: [Reminders, tasks, todo, macOS, Apple]
---
-
-# Apple Reminders
-
-Use `remindctl` to manage Apple Reminders directly from the terminal. Tasks sync across all Apple devices via iCloud.
-
-## Prerequisites
-
- **macOS** with Reminders.app
- Install: `brew install steipete/tap/remindctl`
- Grant Reminders permission when prompted
- Check: `remindctl status` / Request: `remindctl authorize`
-
-## When to Use
-
- User mentions "reminder" or "Reminders app"
- Creating personal to-dos with due dates that sync to iOS
- Managing Apple Reminders lists
- User wants tasks to appear on their iPhone/iPad
-
-## When NOT to Use
-
- Scheduling agent alerts → use the cronjob tool instead
- Calendar events → use Apple Calendar or Google Calendar
- Project task management → use GitHub Issues, Notion, etc.
- If user says "remind me" but means an agent alert → clarify first
-
-## Quick Reference
-
-### View Reminders
-
-```bash
-remindctl                    # Today's reminders
-remindctl today              # Today
-remindctl tomorrow           # Tomorrow
-remindctl week               # This week
-remindctl overdue            # Past due
-remindctl all                # Everything
-remindctl 2026-01-04         # Specific date
-```
-
-### Manage Lists
-
-```bash
-remindctl list               # List all lists
-remindctl list Work          # Show specific list
-remindctl list Projects --create    # Create list
-remindctl list Work --delete        # Delete list
-```
-
-### Create Reminders
-
-```bash
-remindctl add "Buy milk"
-remindctl add --title "Call mom" --list Personal --due tomorrow
-remindctl add --title "Meeting prep" --due "2026-02-15 09:00"
-```
-
-### Complete / Delete
-
-```bash
-remindctl complete 1 2 3          # Complete by ID
-remindctl delete 4A83 --force     # Delete by ID
-```
-
-### Output Formats
-
-```bash
-remindctl today --json       # JSON for scripting
-remindctl today --plain      # TSV format
-remindctl today --quiet      # Counts only
-```
-
-## Date Formats
-
-Accepted by `--due` and date filters:
- `today`, `tomorrow`, `yesterday`
- `YYYY-MM-DD`
- `YYYY-MM-DD HH:mm`
- ISO 8601 (`2026-01-04T12:34:56Z`)
-
-## Rules
-
-1. When user says "remind me", clarify: Apple Reminders (syncs to phone) vs agent cronjob alert
-2. Always confirm reminder content and due date before creating
-3. Use `--json` for programmatic parsing
--- a/skills/apple/findmy/SKILL.md
+++ b/skills/apple/findmy/SKILL.md
@@ -1,131 +0,0 @@
---
-name: findmy
-description: Track Apple devices and AirTags via FindMy.app on macOS using AppleScript and screen capture.
-version: 1.0.0
-author: Hermes Agent
-license: MIT
-platforms: [macos]
-metadata:
-  hermes:
-    tags: [FindMy, AirTag, location, tracking, macOS, Apple]
---
-
-# Find My (Apple)
-
-Track Apple devices and AirTags via the FindMy.app on macOS. Since Apple doesn't
-provide a CLI for FindMy, this skill uses AppleScript to open the app and
-screen capture to read device locations.
-
-## Prerequisites
-
- **macOS** with Find My app and iCloud signed in
- Devices/AirTags already registered in Find My
- Screen Recording permission for terminal (System Settings → Privacy → Screen Recording)
- **Optional but recommended**: Install `peekaboo` for better UI automation:
-  `brew install steipete/tap/peekaboo`
-
-## When to Use
-
- User asks "where is my [device/cat/keys/bag]?"
- Tracking AirTag locations
- Checking device locations (iPhone, iPad, Mac, AirPods)
- Monitoring pet or item movement over time (AirTag patrol routes)
-
-## Method 1: AppleScript + Screenshot (Basic)
-
-### Open FindMy and Navigate
-
-```bash
-# Open Find My app
-osascript -e 'tell application "FindMy" to activate'
-
-# Wait for it to load
-sleep 3
-
-# Take a screenshot of the Find My window
-screencapture -w -o /tmp/findmy.png
-```
-
-Then use `vision_analyze` to read the screenshot:
-```
-vision_analyze(image_url="/tmp/findmy.png", question="What devices/items are shown and what are their locations?")
-```
-
-### Switch Between Tabs
-
-```bash
-# Switch to Devices tab
-osascript -e '
-tell application "System Events"
-    tell process "FindMy"
-        click button "Devices" of toolbar 1 of window 1
-    end tell
-end tell'
-
-# Switch to Items tab (AirTags)
-osascript -e '
-tell application "System Events"
-    tell process "FindMy"
-        click button "Items" of toolbar 1 of window 1
-    end tell
-end tell'
-```
-
-## Method 2: Peekaboo UI Automation (Recommended)
-
-If `peekaboo` is installed, use it for more reliable UI interaction:
-
-```bash
-# Open Find My
-osascript -e 'tell application "FindMy" to activate'
-sleep 3
-
-# Capture and annotate the UI
-peekaboo see --app "FindMy" --annotate --path /tmp/findmy-ui.png
-
-# Click on a specific device/item by element ID
-peekaboo click --on B3 --app "FindMy"
-
-# Capture the detail view
-peekaboo image --app "FindMy" --path /tmp/findmy-detail.png
-```
-
-Then analyze with vision:
-```
-vision_analyze(image_url="/tmp/findmy-detail.png", question="What is the location shown for this device/item? Include address and coordinates if visible.")
-```
-
-## Workflow: Track AirTag Location Over Time
-
-For monitoring an AirTag (e.g., tracking a cat's patrol route):
-
-```bash
-# 1. Open FindMy to Items tab
-osascript -e 'tell application "FindMy" to activate'
-sleep 3
-
-# 2. Click on the AirTag item (stay on page — AirTag only updates when page is open)
-
-# 3. Periodically capture location
-while true; do
-    screencapture -w -o /tmp/findmy-$(date +%H%M%S).png
-    sleep 300  # Every 5 minutes
-done
-```
-
-Analyze each screenshot with vision to extract coordinates, then compile a route.
-
-## Limitations
-
- FindMy has **no CLI or API** — must use UI automation
- AirTags only update location while the FindMy page is actively displayed
- Location accuracy depends on nearby Apple devices in the FindMy network
- Screen Recording permission required for screenshots
- AppleScript UI automation may break across macOS versions
-
-## Rules
-
-1. Keep FindMy app in the foreground when tracking AirTags (updates stop when minimized)
-2. Use `vision_analyze` to read screenshot content — don't try to parse pixels
-3. For ongoing tracking, use a cronjob to periodically capture and log locations
-4. Respect privacy — only track devices/items the user owns
--- a/skills/apple/imessage/SKILL.md
+++ b/skills/apple/imessage/SKILL.md
@@ -1,100 +0,0 @@
---
-name: imessage
-description: Send and receive iMessages/SMS via the imsg CLI on macOS.
-version: 1.0.0
-author: Hermes Agent
-license: MIT
-platforms: [macos]
-metadata:
-  hermes:
-    tags: [iMessage, SMS, messaging, macOS, Apple]
---
-
-# iMessage
-
-Use `imsg` to read and send iMessage/SMS via macOS Messages.app.
-
-## Prerequisites
-
- **macOS** with Messages.app signed in
- Install: `brew install steipete/tap/imsg`
- Grant Full Disk Access for terminal (System Settings → Privacy → Full Disk Access)
- Grant Automation permission for Messages.app when prompted
-
-## When to Use
-
- User asks to send an iMessage or text message
- Reading iMessage conversation history
- Checking recent Messages.app chats
- Sending to phone numbers or Apple IDs
-
-## When NOT to Use
-
- Telegram/Discord/Slack/WhatsApp messages → use the appropriate gateway channel
- Group chat management (adding/removing members) → not supported
- Bulk/mass messaging → always confirm with user first
-
-## Quick Reference
-
-### List Chats
-
-```bash
-imsg chats --limit 10 --json
-```
-
-### View History
-
-```bash
-# By chat ID
-imsg history --chat-id 1 --limit 20 --json
-
-# With attachments info
-imsg history --chat-id 1 --limit 20 --attachments --json
-```
-
-### Send Messages
-
-```bash
-# Text only
-imsg send --to "+14155551212" --text "Hello!"
-
-# With attachment
-imsg send --to "+14155551212" --text "Check this out" --file /path/to/image.jpg
-
-# Force iMessage or SMS
-imsg send --to "+14155551212" --text "Hi" --service imessage
-imsg send --to "+14155551212" --text "Hi" --service sms
-```
-
-### Watch for New Messages
-
-```bash
-imsg watch --chat-id 1 --attachments
-```
-
-## Service Options
-
- `--service imessage` — Force iMessage (requires recipient has iMessage)
- `--service sms` — Force SMS (green bubble)
- `--service auto` — Let Messages.app decide (default)
-
-## Rules
-
-1. **Always confirm recipient and message content** before sending
-2. **Never send to unknown numbers** without explicit user approval
-3. **Verify file paths** exist before attaching
-4. **Don't spam** — rate-limit yourself
-
-## Example Workflow
-
-User: "Text mom that I'll be late"
-
-```bash
-# 1. Find mom's chat
-imsg chats --limit 20 --json | jq '.[] | select(.displayName | contains("Mom"))'
-
-# 2. Confirm with user: "Found Mom at +1555123456. Send 'I'll be late' via iMessage?"
-
-# 3. Send after confirmation
-imsg send --to "+1555123456" --text "I'll be late"
-```
--- a/skills/mlops/accelerate/SKILL.md
+++ b/skills/mlops/accelerate/SKILL.md
@@ -1,335 +0,0 @@
---
-name: huggingface-accelerate
-description: Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [accelerate, torch, transformers]
-metadata:
-  hermes:
-    tags: [Distributed Training, HuggingFace, Accelerate, DeepSpeed, FSDP, Mixed Precision, PyTorch, DDP, Unified API, Simple]
-
---
-
-# HuggingFace Accelerate - Unified Distributed Training
-
-## Quick start
-
-Accelerate simplifies distributed training to 4 lines of code.
-
-**Installation**:
-```bash
-pip install accelerate
-```
-
-**Convert PyTorch script** (4 lines):
-```python
-import torch
-+ from accelerate import Accelerator
-
-+ accelerator = Accelerator()
-
-  model = torch.nn.Transformer()
-  optimizer = torch.optim.Adam(model.parameters())
-  dataloader = torch.utils.data.DataLoader(dataset)
-
-+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
-
-  for batch in dataloader:
-      optimizer.zero_grad()
-      loss = model(batch)
-     loss.backward()
-+     accelerator.backward(loss)
-      optimizer.step()
-```
-
-**Run** (single command):
-```bash
-accelerate launch train.py
-```
-
-## Common workflows
-
-### Workflow 1: From single GPU to multi-GPU
-
-**Original script**:
-```python
-# train.py
-import torch
-
-model = torch.nn.Linear(10, 2).to('cuda')
-optimizer = torch.optim.Adam(model.parameters())
-dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
-
-for epoch in range(10):
-    for batch in dataloader:
-        batch = batch.to('cuda')
-        optimizer.zero_grad()
-        loss = model(batch).mean()
-        loss.backward()
-        optimizer.step()
-```
-
-**With Accelerate** (4 lines added):
-```python
-# train.py
-import torch
-from accelerate import Accelerator  # +1
-
-accelerator = Accelerator()  # +2
-
-model = torch.nn.Linear(10, 2)
-optimizer = torch.optim.Adam(model.parameters())
-dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
-
-model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)  # +3
-
-for epoch in range(10):
-    for batch in dataloader:
-        # No .to('cuda') needed - automatic!
-        optimizer.zero_grad()
-        loss = model(batch).mean()
-        accelerator.backward(loss)  # +4
-        optimizer.step()
-```
-
-**Configure** (interactive):
-```bash
-accelerate config
-```
-
-**Questions**:
- Which machine? (single/multi GPU/TPU/CPU)
- How many machines? (1)
- Mixed precision? (no/fp16/bf16/fp8)
- DeepSpeed? (no/yes)
-
-**Launch** (works on any setup):
-```bash
-# Single GPU
-accelerate launch train.py
-
-# Multi-GPU (8 GPUs)
-accelerate launch --multi_gpu --num_processes 8 train.py
-
-# Multi-node
-accelerate launch --multi_gpu --num_processes 16 \
-  --num_machines 2 --machine_rank 0 \
-  --main_process_ip $MASTER_ADDR \
-  train.py
-```
-
-### Workflow 2: Mixed precision training
-
-**Enable FP16/BF16**:
-```python
-from accelerate import Accelerator
-
-# FP16 (with gradient scaling)
-accelerator = Accelerator(mixed_precision='fp16')
-
-# BF16 (no scaling, more stable)
-accelerator = Accelerator(mixed_precision='bf16')
-
-# FP8 (H100+)
-accelerator = Accelerator(mixed_precision='fp8')
-
-model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
-
-# Everything else is automatic!
-for batch in dataloader:
-    with accelerator.autocast():  # Optional, done automatically
-        loss = model(batch)
-    accelerator.backward(loss)
-```
-
-### Workflow 3: DeepSpeed ZeRO integration
-
-**Enable DeepSpeed ZeRO-2**:
-```python
-from accelerate import Accelerator
-
-accelerator = Accelerator(
-    mixed_precision='bf16',
-    deepspeed_plugin={
-        "zero_stage": 2,  # ZeRO-2
-        "offload_optimizer": False,
-        "gradient_accumulation_steps": 4
-    }
-)
-
-# Same code as before!
-model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
-```
-
-**Or via config**:
-```bash
-accelerate config
-# Select: DeepSpeed → ZeRO-2
-```
-
-**deepspeed_config.json**:
-```json
-{
-    "fp16": {"enabled": false},
-    "bf16": {"enabled": true},
-    "zero_optimization": {
-        "stage": 2,
-        "offload_optimizer": {"device": "cpu"},
-        "allgather_bucket_size": 5e8,
-        "reduce_bucket_size": 5e8
-    }
-}
-```
-
-**Launch**:
-```bash
-accelerate launch --config_file deepspeed_config.json train.py
-```
-
-### Workflow 4: FSDP (Fully Sharded Data Parallel)
-
-**Enable FSDP**:
-```python
-from accelerate import Accelerator, FullyShardedDataParallelPlugin
-
-fsdp_plugin = FullyShardedDataParallelPlugin(
-    sharding_strategy="FULL_SHARD",  # ZeRO-3 equivalent
-    auto_wrap_policy="TRANSFORMER_AUTO_WRAP",
-    cpu_offload=False
-)
-
-accelerator = Accelerator(
-    mixed_precision='bf16',
-    fsdp_plugin=fsdp_plugin
-)
-
-model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
-```
-
-**Or via config**:
-```bash
-accelerate config
-# Select: FSDP → Full Shard → No CPU Offload
-```
-
-### Workflow 5: Gradient accumulation
-
-**Accumulate gradients**:
-```python
-from accelerate import Accelerator
-
-accelerator = Accelerator(gradient_accumulation_steps=4)
-
-model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
-
-for batch in dataloader:
-    with accelerator.accumulate(model):  # Handles accumulation
-        optimizer.zero_grad()
-        loss = model(batch)
-        accelerator.backward(loss)
-        optimizer.step()
-```
-
-**Effective batch size**: `batch_size * num_gpus * gradient_accumulation_steps`
-
-## When to use vs alternatives
-
-**Use Accelerate when**:
- Want simplest distributed training
- Need single script for any hardware
- Use HuggingFace ecosystem
- Want flexibility (DDP/DeepSpeed/FSDP/Megatron)
- Need quick prototyping
-
-**Key advantages**:
- **4 lines**: Minimal code changes
- **Unified API**: Same code for DDP, DeepSpeed, FSDP, Megatron
- **Automatic**: Device placement, mixed precision, sharding
- **Interactive config**: No manual launcher setup
- **Single launch**: Works everywhere
-
-**Use alternatives instead**:
- **PyTorch Lightning**: Need callbacks, high-level abstractions
- **Ray Train**: Multi-node orchestration, hyperparameter tuning
- **DeepSpeed**: Direct API control, advanced features
- **Raw DDP**: Maximum control, minimal abstraction
-
-## Common issues
-
-**Issue: Wrong device placement**
-
-Don't manually move to device:
-```python
-# WRONG
-batch = batch.to('cuda')
-
-# CORRECT
-# Accelerate handles it automatically after prepare()
-```
-
-**Issue: Gradient accumulation not working**
-
-Use context manager:
-```python
-# CORRECT
-with accelerator.accumulate(model):
-    optimizer.zero_grad()
-    accelerator.backward(loss)
-    optimizer.step()
-```
-
-**Issue: Checkpointing in distributed**
-
-Use accelerator methods:
-```python
-# Save only on main process
-if accelerator.is_main_process:
-    accelerator.save_state('checkpoint/')
-
-# Load on all processes
-accelerator.load_state('checkpoint/')
-```
-
-**Issue: Different results with FSDP**
-
-Ensure same random seed:
-```python
-from accelerate.utils import set_seed
-set_seed(42)
-```
-
-## Advanced topics
-
-**Megatron integration**: See [references/megatron-integration.md](references/megatron-integration.md) for tensor parallelism, pipeline parallelism, and sequence parallelism setup.
-
-**Custom plugins**: See [references/custom-plugins.md](references/custom-plugins.md) for creating custom distributed plugins and advanced configuration.
-
-**Performance tuning**: See [references/performance.md](references/performance.md) for profiling, memory optimization, and best practices.
-
-## Hardware requirements
-
- **CPU**: Works (slow)
- **Single GPU**: Works
- **Multi-GPU**: DDP (default), DeepSpeed, or FSDP
- **Multi-node**: DDP, DeepSpeed, FSDP, Megatron
- **TPU**: Supported
- **Apple MPS**: Supported
-
-**Launcher requirements**:
- **DDP**: `torch.distributed.run` (built-in)
- **DeepSpeed**: `deepspeed` (pip install deepspeed)
- **FSDP**: PyTorch 1.12+ (built-in)
- **Megatron**: Custom setup
-
-## Resources
-
- Docs: https://huggingface.co/docs/accelerate
- GitHub: https://github.com/huggingface/accelerate
- Version: 1.11.0+
- Tutorial: "Accelerate your scripts"
- Examples: https://github.com/huggingface/accelerate/tree/main/examples
- Used by: HuggingFace Transformers, TRL, PEFT, all HF libraries
-
-
-
--- a/skills/mlops/accelerate/references/custom-plugins.md
+++ b/skills/mlops/accelerate/references/custom-plugins.md
@@ -1,453 +0,0 @@
-# Custom Plugins for Accelerate
-
-## Overview
-
-Accelerate allows creating **custom plugins** to extend distributed training strategies beyond built-in options (DDP, FSDP, DeepSpeed).
-
-## Plugin Architecture
-
-### Base Plugin Structure
-
-```python
-from accelerate.utils import DistributedDataParallelKwargs
-from dataclasses import dataclass
-
-@dataclass
-class CustomPlugin:
-    """Custom training plugin."""
-
-    # Plugin configuration
-    param1: int = 1
-    param2: str = "default"
-
-    def __post_init__(self):
-        # Validation logic
-        if self.param1 < 1:
-            raise ValueError("param1 must be >= 1")
-```
-
-### Using Custom Plugin
-
-```python
-from accelerate import Accelerator
-
-# Create plugin
-custom_plugin = CustomPlugin(param1=4, param2="value")
-
-# Pass to Accelerator
-accelerator = Accelerator(
-    custom_plugin=custom_plugin  # Not a real parameter, example only
-)
-```
-
-## Built-In Plugin Examples
-
-### 1. GradScalerKwargs (FP16 Configuration)
-
-```python
-from accelerate.utils import GradScalerKwargs
-
-# Configure gradient scaler for FP16
-scaler_kwargs = GradScalerKwargs(
-    init_scale=2.**16,        # Initial loss scale
-    growth_factor=2.0,        # Scale growth rate
-    backoff_factor=0.5,       # Scale backoff rate
-    growth_interval=2000,     # Steps between scale increases
-    enabled=True              # Enable scaler
-)
-
-accelerator = Accelerator(
-    mixed_precision='fp16',
-    kwargs_handlers=[scaler_kwargs]  # Pass as kwargs handler
-)
-```
-
-**Use case**: Fine-tune FP16 gradient scaling behavior
-
-### 2. DistributedDataParallelKwargs
-
-```python
-from accelerate.utils import DistributedDataParallelKwargs
-
-# Configure DDP behavior
-ddp_kwargs = DistributedDataParallelKwargs(
-    bucket_cap_mb=25,                 # Gradient bucketing size
-    find_unused_parameters=False,     # Find unused params (slower)
-    check_reduction=False,            # Check gradient reduction
-    gradient_as_bucket_view=True,     # Memory optimization
-    static_graph=False                # Static computation graph
-)
-
-accelerator = Accelerator(
-    kwargs_handlers=[ddp_kwargs]
-)
-```
-
-**Use case**: Optimize DDP performance for specific models
-
-### 3. FP8RecipeKwargs (H100 FP8)
-
-```python
-from accelerate.utils import FP8RecipeKwargs
-
-# Configure FP8 training (H100)
-fp8_recipe = FP8RecipeKwargs(
-    backend="te",              # TransformerEngine backend
-    margin=0,                  # Scaling margin
-    interval=1,                # Scaling interval
-    fp8_format="HYBRID",       # E4M3 + E5M2 hybrid
-    amax_history_len=1024,     # AMAX history length
-    amax_compute_algo="max"    # AMAX computation algorithm
-)
-
-accelerator = Accelerator(
-    mixed_precision='fp8',
-    kwargs_handlers=[fp8_recipe]
-)
-```
-
-**Use case**: Ultra-fast training on H100 GPUs
-
-## Custom DeepSpeed Configuration
-
-### ZeRO-3 with CPU Offload
-
-```python
-from accelerate import Accelerator
-from accelerate.utils import DeepSpeedPlugin
-
-# Custom DeepSpeed config
-ds_plugin = DeepSpeedPlugin(
-    zero_stage=3,                     # ZeRO-3
-    offload_optimizer_device="cpu",   # CPU offload optimizer
-    offload_param_device="cpu",       # CPU offload parameters
-    zero3_init_flag=True,             # ZeRO-3 initialization
-    zero3_save_16bit_model=True,      # Save FP16 weights
-)
-
-accelerator = Accelerator(
-    deepspeed_plugin=ds_plugin,
-    mixed_precision='bf16'
-)
-```
-
-### ZeRO-2 with NVMe Offload
-
-```python
-ds_plugin = DeepSpeedPlugin(
-    zero_stage=2,
-    offload_optimizer_device="nvme",  # NVMe offload
-    offload_param_device="nvme",
-    nvme_path="/local_nvme",          # NVMe mount path
-)
-```
-
-### Custom JSON Config
-
-```python
-import json
-
-# Load custom DeepSpeed config
-with open('deepspeed_config.json', 'r') as f:
-    ds_config = json.load(f)
-
-ds_plugin = DeepSpeedPlugin(hf_ds_config=ds_config)
-
-accelerator = Accelerator(deepspeed_plugin=ds_plugin)
-```
-
-**Example config** (`deepspeed_config.json`):
-```json
-{
-  "train_batch_size": "auto",
-  "train_micro_batch_size_per_gpu": "auto",
-  "gradient_accumulation_steps": "auto",
-  "gradient_clipping": 1.0,
-  "zero_optimization": {
-    "stage": 3,
-    "offload_optimizer": {
-      "device": "cpu",
-      "pin_memory": true
-    },
-    "offload_param": {
-      "device": "cpu",
-      "pin_memory": true
-    },
-    "overlap_comm": true,
-    "contiguous_gradients": true,
-    "sub_group_size": 1e9,
-    "reduce_bucket_size": 5e8,
-    "stage3_prefetch_bucket_size": 5e8,
-    "stage3_param_persistence_threshold": 1e6,
-    "stage3_max_live_parameters": 1e9,
-    "stage3_max_reuse_distance": 1e9,
-    "stage3_gather_16bit_weights_on_model_save": true
-  },
-  "bf16": {
-    "enabled": true
-  },
-  "steps_per_print": 100,
-  "wall_clock_breakdown": false
-}
-```
-
-## Custom FSDP Configuration
-
-### FSDP with Custom Auto-Wrap Policy
-
-```python
-from accelerate.utils import FullyShardedDataParallelPlugin
-from torch.distributed.fsdp import BackwardPrefetch, ShardingStrategy
-from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy
-import functools
-
-# Custom wrap policy (size-based)
-wrap_policy = functools.partial(
-    size_based_auto_wrap_policy,
-    min_num_params=1e6  # Wrap layers with 1M+ params
-)
-
-fsdp_plugin = FullyShardedDataParallelPlugin(
-    sharding_strategy=ShardingStrategy.FULL_SHARD,  # ZeRO-3 equivalent
-    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,  # Prefetch strategy
-    mixed_precision_policy=None,  # Use Accelerator's mixed precision
-    auto_wrap_policy=wrap_policy,  # Custom wrapping
-    cpu_offload=False,
-    ignored_modules=None,  # Modules to not wrap
-    state_dict_type="FULL_STATE_DICT",  # Save format
-    optim_state_dict_config=None,
-    limit_all_gathers=False,
-    use_orig_params=True,  # Use original param shapes
-)
-
-accelerator = Accelerator(
-    fsdp_plugin=fsdp_plugin,
-    mixed_precision='bf16'
-)
-```
-
-### FSDP with Transformer Auto-Wrap
-
-```python
-from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
-from transformers.models.gpt2.modeling_gpt2 import GPT2Block
-
-# Wrap at transformer block level
-wrap_policy = functools.partial(
-    transformer_auto_wrap_policy,
-    transformer_layer_cls={GPT2Block}  # Wrap GPT2Block layers
-)
-
-fsdp_plugin = FullyShardedDataParallelPlugin(
-    auto_wrap_policy=wrap_policy
-)
-```
-
-## Creating Custom Training Strategy
-
-### Example: Custom Gradient Accumulation
-
-```python
-from accelerate import Accelerator
-
-class CustomGradientAccumulation:
-    def __init__(self, steps=4, adaptive=False):
-        self.steps = steps
-        self.adaptive = adaptive
-        self.current_step = 0
-
-    def should_sync(self, loss):
-        """Decide whether to sync gradients."""
-        self.current_step += 1
-
-        # Adaptive: sync on high loss
-        if self.adaptive and loss > threshold:
-            self.current_step = 0
-            return True
-
-        # Regular: sync every N steps
-        if self.current_step >= self.steps:
-            self.current_step = 0
-            return True
-
-        return False
-
-# Usage
-custom_accum = CustomGradientAccumulation(steps=8, adaptive=True)
-accelerator = Accelerator()
-
-for batch in dataloader:
-    outputs = model(**batch)
-    loss = outputs.loss
-
-    # Scale loss
-    loss = loss / custom_accum.steps
-    accelerator.backward(loss)
-
-    # Conditional sync
-    if custom_accum.should_sync(loss.item()):
-        optimizer.step()
-        optimizer.zero_grad()
-```
-
-### Example: Custom Mixed Precision
-
-```python
-import torch
-
-class CustomMixedPrecision:
-    """Custom mixed precision with dynamic loss scaling."""
-
-    def __init__(self, init_scale=2**16, scale_window=2000):
-        self.scaler = torch.cuda.amp.GradScaler(
-            init_scale=init_scale,
-            growth_interval=scale_window
-        )
-        self.scale_history = []
-
-    def scale_loss(self, loss):
-        """Scale loss for backward."""
-        return self.scaler.scale(loss)
-
-    def unscale_and_clip(self, optimizer, max_norm=1.0):
-        """Unscale gradients and clip."""
-        self.scaler.unscale_(optimizer)
-        torch.nn.utils.clip_grad_norm_(
-            optimizer.param_groups[0]['params'],
-            max_norm
-        )
-
-    def step(self, optimizer):
-        """Optimizer step with scaler update."""
-        scale_before = self.scaler.get_scale()
-        self.scaler.step(optimizer)
-        self.scaler.update()
-        scale_after = self.scaler.get_scale()
-
-        # Track scale changes
-        if scale_before != scale_after:
-            self.scale_history.append(scale_after)
-
-# Usage
-custom_mp = CustomMixedPrecision()
-
-for batch in dataloader:
-    with torch.cuda.amp.autocast(dtype=torch.float16):
-        loss = model(**batch).loss
-
-    scaled_loss = custom_mp.scale_loss(loss)
-    scaled_loss.backward()
-
-    custom_mp.unscale_and_clip(optimizer, max_norm=1.0)
-    custom_mp.step(optimizer)
-    optimizer.zero_grad()
-```
-
-## Advanced: Custom Distributed Backend
-
-### Custom AllReduce Strategy
-
-```python
-import torch.distributed as dist
-
-class CustomAllReduce:
-    """Custom all-reduce with compression."""
-
-    def __init__(self, compression_ratio=0.1):
-        self.compression_ratio = compression_ratio
-
-    def compress_gradients(self, tensor):
-        """Top-k gradient compression."""
-        k = int(tensor.numel() * self.compression_ratio)
-        values, indices = torch.topk(tensor.abs().view(-1), k)
-        return values, indices
-
-    def all_reduce_compressed(self, tensor):
-        """All-reduce with gradient compression."""
-        # Compress
-        values, indices = self.compress_gradients(tensor)
-
-        # All-reduce compressed gradients
-        dist.all_reduce(values, op=dist.ReduceOp.SUM)
-
-        # Decompress
-        tensor_compressed = torch.zeros_like(tensor).view(-1)
-        tensor_compressed[indices] = values / dist.get_world_size()
-
-        return tensor_compressed.view_as(tensor)
-
-# Usage in training loop
-custom_ar = CustomAllReduce(compression_ratio=0.1)
-
-for batch in dataloader:
-    loss = model(**batch).loss
-    loss.backward()
-
-    # Custom all-reduce
-    for param in model.parameters():
-        if param.grad is not None:
-            param.grad.data = custom_ar.all_reduce_compressed(param.grad.data)
-
-    optimizer.step()
-    optimizer.zero_grad()
-```
-
-## Plugin Best Practices
-
-### 1. Validation in `__post_init__`
-
-```python
-@dataclass
-class CustomPlugin:
-    learning_rate: float = 1e-3
-    warmup_steps: int = 1000
-
-    def __post_init__(self):
-        # Validate parameters
-        if self.learning_rate <= 0:
-            raise ValueError("learning_rate must be positive")
-        if self.warmup_steps < 0:
-            raise ValueError("warmup_steps must be non-negative")
-
-        # Compute derived values
-        self.min_lr = self.learning_rate * 0.1
-```
-
-### 2. Compatibility Checks
-
-```python
-@dataclass
-class CustomPlugin:
-    feature_enabled: bool = True
-
-    def is_compatible(self, accelerator):
-        """Check if plugin is compatible with accelerator config."""
-        if self.feature_enabled and accelerator.mixed_precision == 'fp8':
-            raise ValueError("Custom plugin not compatible with FP8")
-        return True
-```
-
-### 3. State Management
-
-```python
-@dataclass
-class CustomPlugin:
-    counter: int = 0
-    history: list = None
-
-    def __post_init__(self):
-        if self.history is None:
-            self.history = []
-
-    def update_state(self, value):
-        """Update plugin state during training."""
-        self.counter += 1
-        self.history.append(value)
-```
-
-## Resources
-
- Accelerate Plugins: https://huggingface.co/docs/accelerate/package_reference/kwargs
- DeepSpeed Config: https://www.deepspeed.ai/docs/config-json/
- FSDP Guide: https://pytorch.org/docs/stable/fsdp.html
- Custom Training Loops: https://huggingface.co/docs/accelerate/usage_guides/training_tpu
--- a/skills/mlops/accelerate/references/megatron-integration.md
+++ b/skills/mlops/accelerate/references/megatron-integration.md
@@ -1,489 +0,0 @@
-# Megatron Integration with Accelerate
-
-## Overview
-
-Accelerate supports Megatron-LM for massive model training with tensor parallelism and pipeline parallelism.
-
-**Megatron capabilities**:
- **Tensor Parallelism (TP)**: Split layers across GPUs
- **Pipeline Parallelism (PP)**: Split model depth across GPUs
- **Data Parallelism (DP)**: Replicate model across GPU groups
- **Sequence Parallelism**: Split sequences for long contexts
-
-## Setup
-
-### Install Megatron-LM
-
-```bash
-# Clone Megatron-LM repository
-git clone https://github.com/NVIDIA/Megatron-LM.git
-cd Megatron-LM
-pip install -e .
-
-# Install Apex (NVIDIA optimizations)
-git clone https://github.com/NVIDIA/apex
-cd apex
-pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
-  --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
-```
-
-### Accelerate Configuration
-
-```bash
-accelerate config
-```
-
-**Questions**:
-```
-In which compute environment are you running?
-> This machine
-
-Which type of machine are you using?
-> Multi-GPU
-
-How many different machines will you use?
-> 1
-
-Do you want to use DeepSpeed/FSDP?
-> No
-
-Do you want to use Megatron-LM?
-> Yes
-
-What is the Tensor Parallelism degree? [1-8]
-> 2
-
-Do you want to enable Sequence Parallelism?
-> No
-
-What is the Pipeline Parallelism degree? [1-8]
-> 2
-
-What is the Data Parallelism degree? [1-8]
-> 2
-
-Where to perform activation checkpointing? ['SELECTIVE', 'FULL', 'NONE']
-> SELECTIVE
-
-Where to perform activation partitioning? ['SEQUENTIAL', 'UNIFORM']
-> SEQUENTIAL
-```
-
-**Generated config** (`~/.cache/huggingface/accelerate/default_config.yaml`):
-```yaml
-compute_environment: LOCAL_MACHINE
-distributed_type: MEGATRON_LM
-downcast_bf16: 'no'
-machine_rank: 0
-main_training_function: main
-megatron_lm_config:
-  megatron_lm_gradient_clipping: 1.0
-  megatron_lm_learning_rate_decay_iters: 320000
-  megatron_lm_num_micro_batches: 1
-  megatron_lm_pp_degree: 2
-  megatron_lm_recompute_activations: true
-  megatron_lm_sequence_parallelism: false
-  megatron_lm_tp_degree: 2
-mixed_precision: bf16
-num_machines: 1
-num_processes: 8
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
-```
-
-## Parallelism Strategies
-
-### Tensor Parallelism (TP)
-
-**Splits each transformer layer across GPUs**:
-
-```python
-# Layer split across 2 GPUs
-# GPU 0: First half of attention heads
-# GPU 1: Second half of attention heads
-
-# Each GPU computes partial outputs
-# All-reduce combines results
-```
-
-**TP degree recommendations**:
- **TP=1**: No tensor parallelism (single GPU per layer)
- **TP=2**: 2 GPUs per layer (good for 7-13B models)
- **TP=4**: 4 GPUs per layer (good for 20-40B models)
- **TP=8**: 8 GPUs per layer (good for 70B+ models)
-
-**Benefits**:
- Reduces memory per GPU
- All-reduce communication (fast)
-
-**Drawbacks**:
- Requires fast inter-GPU bandwidth (NVLink)
- Communication overhead per layer
-
-### Pipeline Parallelism (PP)
-
-**Splits model depth across GPUs**:
-
-```python
-# 12-layer model, PP=4
-# GPU 0: Layers 0-2
-# GPU 1: Layers 3-5
-# GPU 2: Layers 6-8
-# GPU 3: Layers 9-11
-```
-
-**PP degree recommendations**:
- **PP=1**: No pipeline parallelism
- **PP=2**: 2 pipeline stages (good for 20-40B models)
- **PP=4**: 4 pipeline stages (good for 70B+ models)
- **PP=8**: 8 pipeline stages (good for 175B+ models)
-
-**Benefits**:
- Linear memory reduction (4× PP = 4× less memory)
- Works across nodes (slower interconnect OK)
-
-**Drawbacks**:
- Pipeline bubbles (idle time)
- Requires micro-batching
-
-### Data Parallelism (DP)
-
-**Replicates model across GPU groups**:
-
-```python
-# 8 GPUs, TP=2, PP=2, DP=2
-# Group 0 (GPUs 0-3): Full model replica
-# Group 1 (GPUs 4-7): Full model replica
-```
-
-**DP degree**:
- `DP = total_gpus / (TP × PP)`
- Example: 8 GPUs, TP=2, PP=2 → DP=2
-
-**Benefits**:
- Increases throughput
- Scales batch size
-
-### Sequence Parallelism
-
-**Splits long sequences across GPUs** (extends TP):
-
-```python
-# 8K sequence, TP=2, Sequence Parallel=True
-# GPU 0: Tokens 0-4095
-# GPU 1: Tokens 4096-8191
-```
-
-**Benefits**:
- Enables very long sequences (100K+ tokens)
- Reduces activation memory
-
-**Requirements**:
- Must use with TP > 1
- RoPE/ALiBi position encodings work best
-
-## Accelerate Code Example
-
-### Basic Setup
-
-```python
-from accelerate import Accelerator
-from accelerate.utils import MegatronLMPlugin
-
-# Configure Megatron
-megatron_plugin = MegatronLMPlugin(
-    tp_degree=2,              # Tensor parallelism degree
-    pp_degree=2,              # Pipeline parallelism degree
-    num_micro_batches=4,      # Micro-batches for pipeline
-    gradient_clipping=1.0,    # Gradient clipping value
-    sequence_parallelism=False,  # Enable sequence parallelism
-    recompute_activations=True,  # Activation checkpointing
-    use_distributed_optimizer=True,  # Distributed optimizer
-    custom_prepare_model_function=None,  # Custom model prep
-)
-
-# Initialize accelerator
-accelerator = Accelerator(
-    mixed_precision='bf16',
-    megatron_lm_plugin=megatron_plugin
-)
-
-# Prepare model and optimizer
-model, optimizer, train_dataloader = accelerator.prepare(
-    model, optimizer, train_dataloader
-)
-
-# Training loop (same as DDP!)
-for batch in train_dataloader:
-    optimizer.zero_grad()
-    outputs = model(**batch)
-    loss = outputs.loss
-    accelerator.backward(loss)
-    optimizer.step()
-```
-
-### Full Training Script
-
-```python
-import torch
-from accelerate import Accelerator
-from accelerate.utils import MegatronLMPlugin
-from transformers import GPT2Config, GPT2LMHeadModel
-
-def main():
-    # Megatron configuration
-    megatron_plugin = MegatronLMPlugin(
-        tp_degree=2,
-        pp_degree=2,
-        num_micro_batches=4,
-        gradient_clipping=1.0,
-    )
-
-    accelerator = Accelerator(
-        mixed_precision='bf16',
-        gradient_accumulation_steps=8,
-        megatron_lm_plugin=megatron_plugin
-    )
-
-    # Model
-    config = GPT2Config(
-        n_layer=24,
-        n_head=16,
-        n_embd=1024,
-    )
-    model = GPT2LMHeadModel(config)
-
-    # Optimizer
-    optimizer = torch.optim.AdamW(model.parameters(), lr=6e-4)
-
-    # Prepare
-    model, optimizer, train_loader = accelerator.prepare(
-        model, optimizer, train_loader
-    )
-
-    # Training loop
-    for epoch in range(num_epochs):
-        for batch in train_loader:
-            with accelerator.accumulate(model):
-                outputs = model(**batch)
-                loss = outputs.loss
-                accelerator.backward(loss)
-                optimizer.step()
-                optimizer.zero_grad()
-
-        # Save checkpoint
-        accelerator.wait_for_everyone()
-        accelerator.save_state(f'checkpoint-epoch-{epoch}')
-
-if __name__ == '__main__':
-    main()
-```
-
-### Launch Command
-
-```bash
-# 8 GPUs, TP=2, PP=2, DP=2
-accelerate launch --multi_gpu --num_processes 8 train.py
-
-# Multi-node (2 nodes, 8 GPUs each)
-# Node 0
-accelerate launch --multi_gpu --num_processes 16 \
-  --num_machines 2 --machine_rank 0 \
-  --main_process_ip $MASTER_ADDR \
-  --main_process_port 29500 \
-  train.py
-
-# Node 1
-accelerate launch --multi_gpu --num_processes 16 \
-  --num_machines 2 --machine_rank 1 \
-  --main_process_ip $MASTER_ADDR \
-  --main_process_port 29500 \
-  train.py
-```
-
-## Activation Checkpointing
-
-**Reduces memory by recomputing activations**:
-
-```python
-megatron_plugin = MegatronLMPlugin(
-    recompute_activations=True,      # Enable checkpointing
-    checkpoint_num_layers=1,         # Checkpoint every N layers
-    distribute_checkpointed_activations=True,  # Distribute across TP
-    partition_activations=True,      # Partition in PP
-    check_for_nan_in_loss_and_grad=True,  # Stability check
-)
-```
-
-**Strategies**:
- `SELECTIVE`: Checkpoint transformer blocks only
- `FULL`: Checkpoint all layers
- `NONE`: No checkpointing
-
-**Memory savings**: 30-50% with 10-15% slowdown
-
-## Distributed Optimizer
-
-**Shards optimizer state across DP ranks**:
-
-```python
-megatron_plugin = MegatronLMPlugin(
-    use_distributed_optimizer=True,  # Enable sharded optimizer
-)
-```
-
-**Benefits**:
- Reduces optimizer memory by DP degree
- Example: DP=4 → 4× less optimizer memory per GPU
-
-**Compatible with**:
- AdamW, Adam, SGD
- Mixed precision training
-
-## Performance Tuning
-
-### Micro-Batch Size
-
-```python
-# Pipeline parallelism requires micro-batching
-megatron_plugin = MegatronLMPlugin(
-    pp_degree=4,
-    num_micro_batches=16,  # 16 micro-batches per pipeline
-)
-
-# Effective batch = num_micro_batches × micro_batch_size × DP
-# Example: 16 × 2 × 4 = 128
-```
-
-**Recommendations**:
- More micro-batches → less pipeline bubble
- Typical: 4-16 micro-batches
-
-### Sequence Length
-
-```python
-# For long sequences, enable sequence parallelism
-megatron_plugin = MegatronLMPlugin(
-    tp_degree=4,
-    sequence_parallelism=True,  # Required: TP > 1
-)
-
-# Enables sequences up to TP × normal limit
-# Example: TP=4, 8K normal → 32K with sequence parallel
-```
-
-### GPU Topology
-
-**NVLink required for TP**:
-```bash
-# Check NVLink topology
-nvidia-smi topo -m
-
-# Good topology (NVLink between all GPUs)
-# GPU0 - GPU1: NV12 (fast)
-# GPU0 - GPU2: NV12 (fast)
-
-# Bad topology (PCIe only)
-# GPU0 - GPU4: PHB (slow, avoid TP across these)
-```
-
-**Recommendations**:
- **TP**: Within same node (NVLink)
- **PP**: Across nodes (slower interconnect OK)
- **DP**: Any topology
-
-## Model Size Guidelines
-
-| Model Size | GPUs | TP | PP | DP | Micro-Batches |
-|------------|------|----|----|----|--------------|
-| 7B | 8 | 1 | 1 | 8 | 1 |
-| 13B | 8 | 2 | 1 | 4 | 1 |
-| 20B | 16 | 4 | 1 | 4 | 1 |
-| 40B | 32 | 4 | 2 | 4 | 4 |
-| 70B | 64 | 8 | 2 | 4 | 8 |
-| 175B | 128 | 8 | 4 | 4 | 16 |
-
-**Assumptions**: BF16, 2K sequence length, A100 80GB
-
-## Checkpointing
-
-### Save Checkpoint
-
-```python
-# Save full model state
-accelerator.save_state('checkpoint-1000')
-
-# Megatron saves separate files per rank
-# checkpoint-1000/
-#   pytorch_model_tp_0_pp_0.bin
-#   pytorch_model_tp_0_pp_1.bin
-#   pytorch_model_tp_1_pp_0.bin
-#   pytorch_model_tp_1_pp_1.bin
-#   optimizer_tp_0_pp_0.bin
-#   ...
-```
-
-### Load Checkpoint
-
-```python
-# Resume training
-accelerator.load_state('checkpoint-1000')
-
-# Automatically loads correct shard per rank
-```
-
-### Convert to Standard PyTorch
-
-```bash
-# Merge Megatron checkpoint to single file
-python merge_megatron_checkpoint.py \
-  --checkpoint-dir checkpoint-1000 \
-  --output pytorch_model.bin
-```
-
-## Common Issues
-
-### Issue: OOM with Pipeline Parallelism
-
-**Solution**: Increase micro-batches
-```python
-megatron_plugin = MegatronLMPlugin(
-    pp_degree=4,
-    num_micro_batches=16,  # Increase from 4
-)
-```
-
-### Issue: Slow Training
-
-**Check 1**: Pipeline bubbles (PP too high)
-```python
-# Reduce PP, increase TP
-tp_degree=4  # Increase
-pp_degree=2  # Decrease
-```
-
-**Check 2**: Micro-batch size too small
-```python
-num_micro_batches=8  # Increase
-```
-
-### Issue: NVLink Not Detected
-
-```bash
-# Verify NVLink
-nvidia-smi nvlink -s
-
-# If no NVLink, avoid TP > 1
-# Use PP or DP instead
-```
-
-## Resources
-
- Megatron-LM: https://github.com/NVIDIA/Megatron-LM
- Accelerate Megatron docs: https://huggingface.co/docs/accelerate/usage_guides/megatron_lm
- Paper: "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism"
- NVIDIA Apex: https://github.com/NVIDIA/apex
--- a/skills/mlops/accelerate/references/performance.md
+++ b/skills/mlops/accelerate/references/performance.md
@@ -1,525 +0,0 @@
-# Accelerate Performance Tuning
-
-## Profiling
-
-### Basic Profiling
-
-```python
-from accelerate import Accelerator
-import time
-
-accelerator = Accelerator()
-
-# Warmup
-for _ in range(10):
-    batch = next(iter(dataloader))
-    outputs = model(**batch)
-    loss = outputs.loss
-    accelerator.backward(loss)
-    optimizer.step()
-    optimizer.zero_grad()
-
-# Profile training loop
-start = time.time()
-total_batches = 100
-
-for i, batch in enumerate(dataloader):
-    if i >= total_batches:
-        break
-
-    outputs = model(**batch)
-    loss = outputs.loss
-    accelerator.backward(loss)
-    optimizer.step()
-    optimizer.zero_grad()
-
-accelerator.wait_for_everyone()  # Sync all processes
-elapsed = time.time() - start
-
-# Metrics
-batches_per_sec = total_batches / elapsed
-samples_per_sec = (total_batches * batch_size * accelerator.num_processes) / elapsed
-
-print(f"Throughput: {samples_per_sec:.2f} samples/sec")
-print(f"Batches/sec: {batches_per_sec:.2f}")
-```
-
-### PyTorch Profiler Integration
-
-```python
-from torch.profiler import profile, ProfilerActivity
-
-with profile(
-    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
-    record_shapes=True,
-    profile_memory=True,
-    with_stack=True
-) as prof:
-    for i, batch in enumerate(dataloader):
-        if i >= 10:  # Profile first 10 batches
-            break
-
-        outputs = model(**batch)
-        loss = outputs.loss
-        accelerator.backward(loss)
-        optimizer.step()
-        optimizer.zero_grad()
-
-# Print profiling results
-print(prof.key_averages().table(
-    sort_by="cuda_time_total", row_limit=20
-))
-
-# Export to Chrome tracing
-prof.export_chrome_trace("trace.json")
-# View at chrome://tracing
-```
-
-## Memory Optimization
-
-### 1. Gradient Accumulation
-
-**Problem**: Large batch size causes OOM
-
-**Solution**: Accumulate gradients across micro-batches
-
-```python
-accelerator = Accelerator(gradient_accumulation_steps=8)
-
-# Effective batch = batch_size × accumulation_steps × num_gpus
-# Example: 4 × 8 × 8 = 256
-
-for batch in dataloader:
-    with accelerator.accumulate(model):  # Handles accumulation logic
-        outputs = model(**batch)
-        loss = outputs.loss
-        accelerator.backward(loss)
-        optimizer.step()
-        optimizer.zero_grad()
-```
-
-**Memory savings**: 8× less activation memory (with 8 accumulation steps)
-
-### 2. Gradient Checkpointing
-
-**Enable in model**:
-
-```python
-from transformers import AutoModelForCausalLM
-
-model = AutoModelForCausalLM.from_pretrained(
-    "gpt2",
-    use_cache=False  # Required for gradient checkpointing
-)
-
-# Enable checkpointing
-model.gradient_checkpointing_enable()
-
-# Prepare with Accelerate
-model = accelerator.prepare(model)
-```
-
-**Memory savings**: 30-50% with 10-15% slowdown
-
-### 3. Mixed Precision
-
-**BF16 (A100/H100)**:
-```python
-accelerator = Accelerator(mixed_precision='bf16')
-
-# Automatic mixed precision
-for batch in dataloader:
-    outputs = model(**batch)  # Forward in BF16
-    loss = outputs.loss
-    accelerator.backward(loss)  # Backward in FP32
-    optimizer.step()
-```
-
-**FP16 (V100, older GPUs)**:
-```python
-from accelerate.utils import GradScalerKwargs
-
-scaler_kwargs = GradScalerKwargs(
-    init_scale=2.**16,
-    growth_interval=2000
-)
-
-accelerator = Accelerator(
-    mixed_precision='fp16',
-    kwargs_handlers=[scaler_kwargs]
-)
-```
-
-**Memory savings**: 50% compared to FP32
-
-### 4. CPU Offloading (DeepSpeed)
-
-```python
-from accelerate.utils import DeepSpeedPlugin
-
-ds_plugin = DeepSpeedPlugin(
-    zero_stage=3,
-    offload_optimizer_device="cpu",  # Offload optimizer to CPU
-    offload_param_device="cpu",      # Offload parameters to CPU
-)
-
-accelerator = Accelerator(
-    deepspeed_plugin=ds_plugin,
-    mixed_precision='bf16'
-)
-```
-
-**Memory savings**: 10-20× for optimizer state, 5-10× for parameters
-
-**Trade-off**: 20-30% slower due to CPU-GPU transfers
-
-### 5. Flash Attention
-
-```python
-# Install flash-attn
-# pip install flash-attn
-
-from transformers import AutoModelForCausalLM
-
-model = AutoModelForCausalLM.from_pretrained(
-    "gpt2",
-    attn_implementation="flash_attention_2"  # Enable Flash Attention 2
-)
-
-model = accelerator.prepare(model)
-```
-
-**Memory savings**: 50% for attention, 2× faster
-
-**Requirements**: A100/H100, sequence length must be multiple of 128
-
-## Communication Optimization
-
-### 1. Gradient Bucketing (DDP)
-
-```python
-from accelerate.utils import DistributedDataParallelKwargs
-
-ddp_kwargs = DistributedDataParallelKwargs(
-    bucket_cap_mb=25,  # Bucket size for gradient reduction
-    gradient_as_bucket_view=True,  # Reduce memory copies
-    static_graph=False  # Set True if model doesn't change
-)
-
-accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])
-```
-
-**Recommended bucket sizes**:
- Small models (<1B): 25 MB
- Medium models (1-10B): 50-100 MB
- Large models (>10B): 100-200 MB
-
-### 2. Find Unused Parameters
-
-```python
-# Only enable if model has unused parameters (slower!)
-ddp_kwargs = DistributedDataParallelKwargs(
-    find_unused_parameters=True
-)
-```
-
-**Use case**: Models with conditional branches (e.g., mixture of experts)
-
-**Cost**: 10-20% slower
-
-### 3. NCCL Tuning
-
-```bash
-# Set environment variables before launch
-export NCCL_DEBUG=INFO           # Debug info
-export NCCL_IB_DISABLE=0         # Enable InfiniBand
-export NCCL_SOCKET_IFNAME=eth0   # Network interface
-export NCCL_P2P_LEVEL=NVL        # Use NVLink
-
-accelerate launch train.py
-```
-
-**NCCL_P2P_LEVEL options**:
- `NVL`: NVLink (fastest, within node)
- `PIX`: PCIe (fast, within node)
- `PHB`: PCIe host bridge (slow, cross-node)
-
-## Data Loading Optimization
-
-### 1. DataLoader Workers
-
-```python
-from torch.utils.data import DataLoader
-
-train_loader = DataLoader(
-    dataset,
-    batch_size=32,
-    num_workers=4,      # Parallel data loading
-    pin_memory=True,    # Pin memory for faster GPU transfer
-    prefetch_factor=2,  # Prefetch batches per worker
-    persistent_workers=True  # Keep workers alive between epochs
-)
-
-train_loader = accelerator.prepare(train_loader)
-```
-
-**Recommendations**:
- `num_workers`: 2-4 per GPU (8 GPUs → 16-32 workers)
- `pin_memory`: Always True for GPU training
- `prefetch_factor`: 2-4 (higher for slow data loading)
-
-### 2. Data Preprocessing
-
-```python
-from datasets import load_dataset
-
-# Bad: Preprocess during training (slow)
-dataset = load_dataset("openwebtext")
-
-for batch in dataset:
-    tokens = tokenizer(batch['text'])  # Slow!
-    ...
-
-# Good: Preprocess once, save
-dataset = load_dataset("openwebtext")
-tokenized = dataset.map(
-    lambda x: tokenizer(x['text']),
-    batched=True,
-    num_proc=8,  # Parallel preprocessing
-    remove_columns=['text']
-)
-tokenized.save_to_disk("preprocessed_data")
-
-# Load preprocessed
-dataset = load_from_disk("preprocessed_data")
-```
-
-### 3. Faster Tokenization
-
-```python
-import os
-
-# Enable Rust-based tokenizers (10× faster)
-os.environ["TOKENIZERS_PARALLELISM"] = "true"
-
-from transformers import AutoTokenizer
-
-tokenizer = AutoTokenizer.from_pretrained(
-    "gpt2",
-    use_fast=True  # Use fast Rust tokenizer
-)
-```
-
-## Compilation (PyTorch 2.0+)
-
-### Compile Model
-
-```python
-import torch
-
-# Compile model for faster execution
-model = torch.compile(
-    model,
-    mode="reduce-overhead",  # Options: default, reduce-overhead, max-autotune
-    fullgraph=False,         # Compile entire graph (stricter)
-    dynamic=True             # Support dynamic shapes
-)
-
-model = accelerator.prepare(model)
-```
-
-**Speedup**: 10-50% depending on model
-
-**Compilation modes**:
- `default`: Balanced (best for most cases)
- `reduce-overhead`: Min overhead (best for small batches)
- `max-autotune`: Max performance (slow compile, best for production)
-
-### Compilation Best Practices
-
-```python
-# Bad: Compile after prepare (won't work)
-model = accelerator.prepare(model)
-model = torch.compile(model)  # Error!
-
-# Good: Compile before prepare
-model = torch.compile(model)
-model = accelerator.prepare(model)
-
-# Training loop
-for batch in dataloader:
-    # First iteration: slow (compilation)
-    # Subsequent iterations: fast (compiled)
-    outputs = model(**batch)
-    ...
-```
-
-## Benchmarking Different Strategies
-
-### Script Template
-
-```python
-import time
-import torch
-from accelerate import Accelerator
-
-def benchmark_strategy(strategy_name, accelerator_kwargs):
-    """Benchmark a specific training strategy."""
-    accelerator = Accelerator(**accelerator_kwargs)
-
-    # Setup
-    model = create_model()
-    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
-    dataloader = create_dataloader()
-
-    model, optimizer, dataloader = accelerator.prepare(
-        model, optimizer, dataloader
-    )
-
-    # Warmup
-    for i, batch in enumerate(dataloader):
-        if i >= 10:
-            break
-        outputs = model(**batch)
-        loss = outputs.loss
-        accelerator.backward(loss)
-        optimizer.step()
-        optimizer.zero_grad()
-
-    # Benchmark
-    accelerator.wait_for_everyone()
-    torch.cuda.synchronize()
-    start = time.time()
-
-    num_batches = 100
-    for i, batch in enumerate(dataloader):
-        if i >= num_batches:
-            break
-
-        outputs = model(**batch)
-        loss = outputs.loss
-        accelerator.backward(loss)
-        optimizer.step()
-        optimizer.zero_grad()
-
-    accelerator.wait_for_everyone()
-    torch.cuda.synchronize()
-    elapsed = time.time() - start
-
-    # Metrics
-    throughput = (num_batches * batch_size * accelerator.num_processes) / elapsed
-    memory_used = torch.cuda.max_memory_allocated() / 1e9  # GB
-
-    if accelerator.is_main_process:
-        print(f"\n{strategy_name}:")
-        print(f"  Throughput: {throughput:.2f} samples/sec")
-        print(f"  Memory: {memory_used:.2f} GB")
-        print(f"  Time: {elapsed:.2f} sec")
-
-    torch.cuda.reset_peak_memory_stats()
-
-# Benchmark different strategies
-strategies = [
-    ("DDP + FP32", {}),
-    ("DDP + BF16", {"mixed_precision": "bf16"}),
-    ("DDP + BF16 + GradAccum", {"mixed_precision": "bf16", "gradient_accumulation_steps": 4}),
-    ("FSDP", {"fsdp_plugin": fsdp_plugin}),
-    ("DeepSpeed ZeRO-2", {"deepspeed_plugin": ds_plugin_stage2}),
-    ("DeepSpeed ZeRO-3", {"deepspeed_plugin": ds_plugin_stage3}),
-]
-
-for name, kwargs in strategies:
-    benchmark_strategy(name, kwargs)
-```
-
-## Performance Checklist
-
-**Before training**:
- [ ] Use BF16/FP16 mixed precision
- [ ] Enable gradient checkpointing (if OOM)
- [ ] Set appropriate `num_workers` (2-4 per GPU)
- [ ] Enable `pin_memory=True`
- [ ] Preprocess data once, not during training
- [ ] Compile model with `torch.compile` (PyTorch 2.0+)
-
-**For large models**:
- [ ] Use FSDP or DeepSpeed ZeRO-3
- [ ] Enable CPU offloading (if still OOM)
- [ ] Use Flash Attention
- [ ] Increase gradient accumulation
-
-**For multi-node**:
- [ ] Check network topology (InfiniBand > Ethernet)
- [ ] Tune NCCL settings
- [ ] Use larger bucket sizes for DDP
- [ ] Verify NVLink for tensor parallelism
-
-**Profiling**:
- [ ] Profile first 10-100 batches
- [ ] Check GPU utilization (`nvidia-smi dmon`)
- [ ] Check data loading time (should be <5% of iteration)
- [ ] Identify communication bottlenecks
-
-## Common Performance Issues
-
-### Issue: Low GPU Utilization (<80%)
-
-**Cause 1**: Data loading bottleneck
-```python
-# Solution: Increase workers and prefetch
-num_workers=8
-prefetch_factor=4
-```
-
-**Cause 2**: Small batch size
-```python
-# Solution: Increase batch size or use gradient accumulation
-batch_size=32  # Increase
-gradient_accumulation_steps=4  # Or accumulate
-```
-
-### Issue: High Memory Usage
-
-**Solution 1**: Gradient checkpointing
-```python
-model.gradient_checkpointing_enable()
-```
-
-**Solution 2**: Reduce batch size, increase accumulation
-```python
-batch_size=8  # Reduce from 32
-gradient_accumulation_steps=16  # Maintain effective batch
-```
-
-**Solution 3**: Use FSDP or DeepSpeed ZeRO-3
-```python
-accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
-```
-
-### Issue: Slow Multi-GPU Training
-
-**Cause**: Communication bottleneck
-
-**Check 1**: Gradient bucket size
-```python
-ddp_kwargs = DistributedDataParallelKwargs(bucket_cap_mb=100)
-```
-
-**Check 2**: NCCL settings
-```bash
-export NCCL_DEBUG=INFO
-# Check for "Using NVLS" (good) vs "Using PHB" (bad)
-```
-
-**Check 3**: Network bandwidth
-```bash
-# Test inter-GPU bandwidth
-nvidia-smi nvlink -s
-```
-
-## Resources
-
- Accelerate Performance: https://huggingface.co/docs/accelerate/usage_guides/performance
- PyTorch Profiler: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
- NCCL Tuning: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
- Flash Attention: https://github.com/Dao-AILab/flash-attention
--- a/skills/mlops/audiocraft/SKILL.md
+++ b/skills/mlops/audiocraft/SKILL.md
@@ -1,567 +0,0 @@
---
-name: audiocraft-audio-generation
-description: PyTorch library for audio generation including text-to-music (MusicGen) and text-to-sound (AudioGen). Use when you need to generate music from text descriptions, create sound effects, or perform melody-conditioned music generation.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [audiocraft, torch>=2.0.0, transformers>=4.30.0]
-metadata:
-  hermes:
-    tags: [Multimodal, Audio Generation, Text-to-Music, Text-to-Audio, MusicGen]
-
---
-
-# AudioCraft: Audio Generation
-
-Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.
-
-## When to use AudioCraft
-
-**Use AudioCraft when:**
- Need to generate music from text descriptions
- Creating sound effects and environmental audio
- Building music generation applications
- Need melody-conditioned music generation
- Want stereo audio output
- Require controllable music generation with style transfer
-
-**Key features:**
- **MusicGen**: Text-to-music generation with melody conditioning
- **AudioGen**: Text-to-sound effects generation
- **EnCodec**: High-fidelity neural audio codec
- **Multiple model sizes**: Small (300M) to Large (3.3B)
- **Stereo support**: Full stereo audio generation
- **Style conditioning**: MusicGen-Style for reference-based generation
-
-**Use alternatives instead:**
- **Stable Audio**: For longer commercial music generation
- **Bark**: For text-to-speech with music/sound effects
- **Riffusion**: For spectogram-based music generation
- **OpenAI Jukebox**: For raw audio generation with lyrics
-
-## Quick start
-
-### Installation
-
-```bash
-# From PyPI
-pip install audiocraft
-
-# From GitHub (latest)
-pip install git+https://github.com/facebookresearch/audiocraft.git
-
-# Or use HuggingFace Transformers
-pip install transformers torch torchaudio
-```
-
-### Basic text-to-music (AudioCraft)
-
-```python
-import torchaudio
-from audiocraft.models import MusicGen
-
-# Load model
-model = MusicGen.get_pretrained('facebook/musicgen-small')
-
-# Set generation parameters
-model.set_generation_params(
-    duration=8,  # seconds
-    top_k=250,
-    temperature=1.0
-)
-
-# Generate from text
-descriptions = ["happy upbeat electronic dance music with synths"]
-wav = model.generate(descriptions)
-
-# Save audio
-torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)
-```
-
-### Using HuggingFace Transformers
-
-```python
-from transformers import AutoProcessor, MusicgenForConditionalGeneration
-import scipy
-
-# Load model and processor
-processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
-model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
-model.to("cuda")
-
-# Generate music
-inputs = processor(
-    text=["80s pop track with bassy drums and synth"],
-    padding=True,
-    return_tensors="pt"
-).to("cuda")
-
-audio_values = model.generate(
-    **inputs,
-    do_sample=True,
-    guidance_scale=3,
-    max_new_tokens=256
-)
-
-# Save
-sampling_rate = model.config.audio_encoder.sampling_rate
-scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())
-```
-
-### Text-to-sound with AudioGen
-
-```python
-from audiocraft.models import AudioGen
-
-# Load AudioGen
-model = AudioGen.get_pretrained('facebook/audiogen-medium')
-
-model.set_generation_params(duration=5)
-
-# Generate sound effects
-descriptions = ["dog barking in a park with birds chirping"]
-wav = model.generate(descriptions)
-
-torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)
-```
-
-## Core concepts
-
-### Architecture overview
-
-```
-AudioCraft Architecture:
-┌──────────────────────────────────────────────────────────────┐
-│                    Text Encoder (T5)                          │
-│                         │                                     │
-│                    Text Embeddings                            │
-└────────────────────────┬─────────────────────────────────────┘
-                         │
-┌────────────────────────▼─────────────────────────────────────┐
-│              Transformer Decoder (LM)                         │
-│     Auto-regressively generates audio tokens                  │
-│     Using efficient token interleaving patterns               │
-└────────────────────────┬─────────────────────────────────────┘
-                         │
-┌────────────────────────▼─────────────────────────────────────┐
-│                EnCodec Audio Decoder                          │
-│        Converts tokens back to audio waveform                 │
-└──────────────────────────────────────────────────────────────┘
-```
-
-### Model variants
-
-| Model | Size | Description | Use Case |
-|-------|------|-------------|----------|
-| `musicgen-small` | 300M | Text-to-music | Quick generation |
-| `musicgen-medium` | 1.5B | Text-to-music | Balanced |
-| `musicgen-large` | 3.3B | Text-to-music | Best quality |
-| `musicgen-melody` | 1.5B | Text + melody | Melody conditioning |
-| `musicgen-melody-large` | 3.3B | Text + melody | Best melody |
-| `musicgen-stereo-*` | Varies | Stereo output | Stereo generation |
-| `musicgen-style` | 1.5B | Style transfer | Reference-based |
-| `audiogen-medium` | 1.5B | Text-to-sound | Sound effects |
-
-### Generation parameters
-
-| Parameter | Default | Description |
-|-----------|---------|-------------|
-| `duration` | 8.0 | Length in seconds (1-120) |
-| `top_k` | 250 | Top-k sampling |
-| `top_p` | 0.0 | Nucleus sampling (0 = disabled) |
-| `temperature` | 1.0 | Sampling temperature |
-| `cfg_coef` | 3.0 | Classifier-free guidance |
-
-## MusicGen usage
-
-### Text-to-music generation
-
-```python
-from audiocraft.models import MusicGen
-import torchaudio
-
-model = MusicGen.get_pretrained('facebook/musicgen-medium')
-
-# Configure generation
-model.set_generation_params(
-    duration=30,          # Up to 30 seconds
-    top_k=250,            # Sampling diversity
-    top_p=0.0,            # 0 = use top_k only
-    temperature=1.0,      # Creativity (higher = more varied)
-    cfg_coef=3.0          # Text adherence (higher = stricter)
-)
-
-# Generate multiple samples
-descriptions = [
-    "epic orchestral soundtrack with strings and brass",
-    "chill lo-fi hip hop beat with jazzy piano",
-    "energetic rock song with electric guitar"
-]
-
-# Generate (returns [batch, channels, samples])
-wav = model.generate(descriptions)
-
-# Save each
-for i, audio in enumerate(wav):
-    torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)
-```
-
-### Melody-conditioned generation
-
-```python
-from audiocraft.models import MusicGen
-import torchaudio
-
-# Load melody model
-model = MusicGen.get_pretrained('facebook/musicgen-melody')
-model.set_generation_params(duration=30)
-
-# Load melody audio
-melody, sr = torchaudio.load("melody.wav")
-
-# Generate with melody conditioning
-descriptions = ["acoustic guitar folk song"]
-wav = model.generate_with_chroma(descriptions, melody, sr)
-
-torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)
-```
-
-### Stereo generation
-
-```python
-from audiocraft.models import MusicGen
-
-# Load stereo model
-model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')
-model.set_generation_params(duration=15)
-
-descriptions = ["ambient electronic music with wide stereo panning"]
-wav = model.generate(descriptions)
-
-# wav shape: [batch, 2, samples] for stereo
-print(f"Stereo shape: {wav.shape}")  # [1, 2, 480000]
-torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)
-```
-
-### Audio continuation
-
-```python
-from transformers import AutoProcessor, MusicgenForConditionalGeneration
-
-processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")
-model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")
-
-# Load audio to continue
-import torchaudio
-audio, sr = torchaudio.load("intro.wav")
-
-# Process with text and audio
-inputs = processor(
-    audio=audio.squeeze().numpy(),
-    sampling_rate=sr,
-    text=["continue with a epic chorus"],
-    padding=True,
-    return_tensors="pt"
-)
-
-# Generate continuation
-audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)
-```
-
-## MusicGen-Style usage
-
-### Style-conditioned generation
-
-```python
-from audiocraft.models import MusicGen
-
-# Load style model
-model = MusicGen.get_pretrained('facebook/musicgen-style')
-
-# Configure generation with style
-model.set_generation_params(
-    duration=30,
-    cfg_coef=3.0,
-    cfg_coef_beta=5.0  # Style influence
-)
-
-# Configure style conditioner
-model.set_style_conditioner_params(
-    eval_q=3,          # RVQ quantizers (1-6)
-    excerpt_length=3.0  # Style excerpt length
-)
-
-# Load style reference
-style_audio, sr = torchaudio.load("reference_style.wav")
-
-# Generate with text + style
-descriptions = ["upbeat dance track"]
-wav = model.generate_with_style(descriptions, style_audio, sr)
-```
-
-### Style-only generation (no text)
-
-```python
-# Generate matching style without text prompt
-model.set_generation_params(
-    duration=30,
-    cfg_coef=3.0,
-    cfg_coef_beta=None  # Disable double CFG for style-only
-)
-
-wav = model.generate_with_style([None], style_audio, sr)
-```
-
-## AudioGen usage
-
-### Sound effect generation
-
-```python
-from audiocraft.models import AudioGen
-import torchaudio
-
-model = AudioGen.get_pretrained('facebook/audiogen-medium')
-model.set_generation_params(duration=10)
-
-# Generate various sounds
-descriptions = [
-    "thunderstorm with heavy rain and lightning",
-    "busy city traffic with car horns",
-    "ocean waves crashing on rocks",
-    "crackling campfire in forest"
-]
-
-wav = model.generate(descriptions)
-
-for i, audio in enumerate(wav):
-    torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)
-```
-
-## EnCodec usage
-
-### Audio compression
-
-```python
-from audiocraft.models import CompressionModel
-import torch
-import torchaudio
-
-# Load EnCodec
-model = CompressionModel.get_pretrained('facebook/encodec_32khz')
-
-# Load audio
-wav, sr = torchaudio.load("audio.wav")
-
-# Ensure correct sample rate
-if sr != 32000:
-    resampler = torchaudio.transforms.Resample(sr, 32000)
-    wav = resampler(wav)
-
-# Encode to tokens
-with torch.no_grad():
-    encoded = model.encode(wav.unsqueeze(0))
-    codes = encoded[0]  # Audio codes
-
-# Decode back to audio
-with torch.no_grad():
-    decoded = model.decode(codes)
-
-torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)
-```
-
-## Common workflows
-
-### Workflow 1: Music generation pipeline
-
-```python
-import torch
-import torchaudio
-from audiocraft.models import MusicGen
-
-class MusicGenerator:
-    def __init__(self, model_name="facebook/musicgen-medium"):
-        self.model = MusicGen.get_pretrained(model_name)
-        self.sample_rate = 32000
-
-    def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
-        self.model.set_generation_params(
-            duration=duration,
-            top_k=250,
-            temperature=temperature,
-            cfg_coef=cfg
-        )
-
-        with torch.no_grad():
-            wav = self.model.generate([prompt])
-
-        return wav[0].cpu()
-
-    def generate_batch(self, prompts, duration=30):
-        self.model.set_generation_params(duration=duration)
-
-        with torch.no_grad():
-            wav = self.model.generate(prompts)
-
-        return wav.cpu()
-
-    def save(self, audio, path):
-        torchaudio.save(path, audio, sample_rate=self.sample_rate)
-
-# Usage
-generator = MusicGenerator()
-audio = generator.generate(
-    "epic cinematic orchestral music",
-    duration=30,
-    temperature=1.0
-)
-generator.save(audio, "epic_music.wav")
-```
-
-### Workflow 2: Sound design batch processing
-
-```python
-import json
-from pathlib import Path
-from audiocraft.models import AudioGen
-import torchaudio
-
-def batch_generate_sounds(sound_specs, output_dir):
-    """
-    Generate multiple sounds from specifications.
-
-    Args:
-        sound_specs: list of {"name": str, "description": str, "duration": float}
-        output_dir: output directory path
-    """
-    model = AudioGen.get_pretrained('facebook/audiogen-medium')
-    output_dir = Path(output_dir)
-    output_dir.mkdir(exist_ok=True)
-
-    results = []
-
-    for spec in sound_specs:
-        model.set_generation_params(duration=spec.get("duration", 5))
-
-        wav = model.generate([spec["description"]])
-
-        output_path = output_dir / f"{spec['name']}.wav"
-        torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)
-
-        results.append({
-            "name": spec["name"],
-            "path": str(output_path),
-            "description": spec["description"]
-        })
-
-    return results
-
-# Usage
-sounds = [
-    {"name": "explosion", "description": "massive explosion with debris", "duration": 3},
-    {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5},
-    {"name": "door", "description": "wooden door creaking and closing", "duration": 2}
-]
-
-results = batch_generate_sounds(sounds, "sound_effects/")
-```
-
-### Workflow 3: Gradio demo
-
-```python
-import gradio as gr
-import torch
-import torchaudio
-from audiocraft.models import MusicGen
-
-model = MusicGen.get_pretrained('facebook/musicgen-small')
-
-def generate_music(prompt, duration, temperature, cfg_coef):
-    model.set_generation_params(
-        duration=duration,
-        temperature=temperature,
-        cfg_coef=cfg_coef
-    )
-
-    with torch.no_grad():
-        wav = model.generate([prompt])
-
-    # Save to temp file
-    path = "temp_output.wav"
-    torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
-    return path
-
-demo = gr.Interface(
-    fn=generate_music,
-    inputs=[
-        gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"),
-        gr.Slider(1, 30, value=8, label="Duration (seconds)"),
-        gr.Slider(0.5, 2.0, value=1.0, label="Temperature"),
-        gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient")
-    ],
-    outputs=gr.Audio(label="Generated Music"),
-    title="MusicGen Demo"
-)
-
-demo.launch()
-```
-
-## Performance optimization
-
-### Memory optimization
-
-```python
-# Use smaller model
-model = MusicGen.get_pretrained('facebook/musicgen-small')
-
-# Clear cache between generations
-torch.cuda.empty_cache()
-
-# Generate shorter durations
-model.set_generation_params(duration=10)  # Instead of 30
-
-# Use half precision
-model = model.half()
-```
-
-### Batch processing efficiency
-
-```python
-# Process multiple prompts at once (more efficient)
-descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"]
-wav = model.generate(descriptions)  # Single batch
-
-# Instead of
-for desc in descriptions:
-    wav = model.generate([desc])  # Multiple batches (slower)
-```
-
-### GPU memory requirements
-
-| Model | FP32 VRAM | FP16 VRAM |
-|-------|-----------|-----------|
-| musicgen-small | ~4GB | ~2GB |
-| musicgen-medium | ~8GB | ~4GB |
-| musicgen-large | ~16GB | ~8GB |
-
-## Common issues
-
-| Issue | Solution |
-|-------|----------|
-| CUDA OOM | Use smaller model, reduce duration |
-| Poor quality | Increase cfg_coef, better prompts |
-| Generation too short | Check max duration setting |
-| Audio artifacts | Try different temperature |
-| Stereo not working | Use stereo model variant |
-
-## References
-
- **[Advanced Usage](references/advanced-usage.md)** - Training, fine-tuning, deployment
- **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions
-
-## Resources
-
- **GitHub**: https://github.com/facebookresearch/audiocraft
- **Paper (MusicGen)**: https://arxiv.org/abs/2306.05284
- **Paper (AudioGen)**: https://arxiv.org/abs/2209.15352
- **HuggingFace**: https://huggingface.co/facebook/musicgen-small
- **Demo**: https://huggingface.co/spaces/facebook/MusicGen
--- a/skills/mlops/audiocraft/references/advanced-usage.md
+++ b/skills/mlops/audiocraft/references/advanced-usage.md
@@ -1,666 +0,0 @@
-# AudioCraft Advanced Usage Guide
-
-## Fine-tuning MusicGen
-
-### Custom dataset preparation
-
-```python
-import os
-import json
-from pathlib import Path
-import torchaudio
-
-def prepare_dataset(audio_dir, output_dir, metadata_file):
-    """
-    Prepare dataset for MusicGen fine-tuning.
-
-    Directory structure:
-    output_dir/
-    ├── audio/
-    │   ├── 0001.wav
-    │   ├── 0002.wav
-    │   └── ...
-    └── metadata.json
-    """
-    output_dir = Path(output_dir)
-    audio_output = output_dir / "audio"
-    audio_output.mkdir(parents=True, exist_ok=True)
-
-    # Load metadata (format: {"path": "...", "description": "..."})
-    with open(metadata_file) as f:
-        metadata = json.load(f)
-
-    processed = []
-
-    for idx, item in enumerate(metadata):
-        audio_path = Path(audio_dir) / item["path"]
-
-        # Load and resample to 32kHz
-        wav, sr = torchaudio.load(str(audio_path))
-        if sr != 32000:
-            resampler = torchaudio.transforms.Resample(sr, 32000)
-            wav = resampler(wav)
-
-        # Convert to mono if stereo
-        if wav.shape[0] > 1:
-            wav = wav.mean(dim=0, keepdim=True)
-
-        # Save processed audio
-        output_path = audio_output / f"{idx:04d}.wav"
-        torchaudio.save(str(output_path), wav, sample_rate=32000)
-
-        processed.append({
-            "path": str(output_path.relative_to(output_dir)),
-            "description": item["description"],
-            "duration": wav.shape[1] / 32000
-        })
-
-    # Save processed metadata
-    with open(output_dir / "metadata.json", "w") as f:
-        json.dump(processed, f, indent=2)
-
-    print(f"Processed {len(processed)} samples")
-    return processed
-```
-
-### Fine-tuning with dora
-
-```bash
-# AudioCraft uses dora for experiment management
-# Install dora
-pip install dora-search
-
-# Clone AudioCraft
-git clone https://github.com/facebookresearch/audiocraft.git
-cd audiocraft
-
-# Create config for fine-tuning
-cat > config/solver/musicgen/finetune.yaml << 'EOF'
-defaults:
-  - musicgen/musicgen_base
-  - /model: lm/musicgen_lm
-  - /conditioner: cond_base
-
-solver: musicgen
-autocast: true
-autocast_dtype: float16
-
-optim:
-  epochs: 100
-  batch_size: 4
-  lr: 1e-4
-  ema: 0.999
-  optimizer: adamw
-
-dataset:
-  batch_size: 4
-  num_workers: 4
-  train:
-    - dset: your_dataset
-      root: /path/to/dataset
-  valid:
-    - dset: your_dataset
-      root: /path/to/dataset
-
-checkpoint:
-  save_every: 10
-  keep_every_states: null
-EOF
-
-# Run fine-tuning
-dora run solver=musicgen/finetune
-```
-
-### LoRA fine-tuning
-
-```python
-from peft import LoraConfig, get_peft_model
-from audiocraft.models import MusicGen
-import torch
-
-# Load base model
-model = MusicGen.get_pretrained('facebook/musicgen-small')
-
-# Get the language model component
-lm = model.lm
-
-# Configure LoRA
-lora_config = LoraConfig(
-    r=8,
-    lora_alpha=16,
-    target_modules=["q_proj", "v_proj", "k_proj", "out_proj"],
-    lora_dropout=0.05,
-    bias="none"
-)
-
-# Apply LoRA
-lm = get_peft_model(lm, lora_config)
-lm.print_trainable_parameters()
-```
-
-## Multi-GPU Training
-
-### DataParallel
-
-```python
-import torch
-import torch.nn as nn
-from audiocraft.models import MusicGen
-
-model = MusicGen.get_pretrained('facebook/musicgen-small')
-
-# Wrap LM with DataParallel
-if torch.cuda.device_count() > 1:
-    model.lm = nn.DataParallel(model.lm)
-
-model.to("cuda")
-```
-
-### DistributedDataParallel
-
-```python
-import torch.distributed as dist
-from torch.nn.parallel import DistributedDataParallel as DDP
-
-def setup(rank, world_size):
-    dist.init_process_group("nccl", rank=rank, world_size=world_size)
-    torch.cuda.set_device(rank)
-
-def train(rank, world_size):
-    setup(rank, world_size)
-
-    model = MusicGen.get_pretrained('facebook/musicgen-small')
-    model.lm = model.lm.to(rank)
-    model.lm = DDP(model.lm, device_ids=[rank])
-
-    # Training loop
-    # ...
-
-    dist.destroy_process_group()
-```
-
-## Custom Conditioning
-
-### Adding new conditioners
-
-```python
-from audiocraft.modules.conditioners import BaseConditioner
-import torch
-
-class CustomConditioner(BaseConditioner):
-    """Custom conditioner for additional control signals."""
-
-    def __init__(self, dim, output_dim):
-        super().__init__(dim, output_dim)
-        self.embed = torch.nn.Linear(dim, output_dim)
-
-    def forward(self, x):
-        return self.embed(x)
-
-    def tokenize(self, x):
-        # Tokenize input for conditioning
-        return x
-
-# Use with MusicGen
-from audiocraft.models.builders import get_lm_model
-
-# Modify model config to include custom conditioner
-# This requires editing the model configuration
-```
-
-### Melody conditioning internals
-
-```python
-from audiocraft.models import MusicGen
-from audiocraft.modules.codebooks_patterns import DelayedPatternProvider
-import torch
-
-model = MusicGen.get_pretrained('facebook/musicgen-melody')
-
-# Access chroma extractor
-chroma_extractor = model.lm.condition_provider.conditioners.get('chroma')
-
-# Manual chroma extraction
-def extract_chroma(audio, sr):
-    """Extract chroma features from audio."""
-    import librosa
-
-    # Compute chroma
-    chroma = librosa.feature.chroma_cqt(y=audio.numpy(), sr=sr)
-
-    return torch.from_numpy(chroma).float()
-
-# Use extracted chroma for conditioning
-chroma = extract_chroma(melody_audio, sample_rate)
-```
-
-## EnCodec Deep Dive
-
-### Custom compression settings
-
-```python
-from audiocraft.models import CompressionModel
-import torch
-
-# Load EnCodec
-encodec = CompressionModel.get_pretrained('facebook/encodec_32khz')
-
-# Access codec parameters
-print(f"Sample rate: {encodec.sample_rate}")
-print(f"Channels: {encodec.channels}")
-print(f"Cardinality: {encodec.cardinality}")  # Codebook size
-print(f"Num codebooks: {encodec.num_codebooks}")
-print(f"Frame rate: {encodec.frame_rate}")
-
-# Encode with specific bandwidth
-# Lower bandwidth = more compression, lower quality
-encodec.set_target_bandwidth(6.0)  # 6 kbps
-
-audio = torch.randn(1, 1, 32000)  # 1 second
-encoded = encodec.encode(audio)
-decoded = encodec.decode(encoded[0])
-```
-
-### Streaming encoding
-
-```python
-import torch
-from audiocraft.models import CompressionModel
-
-encodec = CompressionModel.get_pretrained('facebook/encodec_32khz')
-
-def encode_streaming(audio_stream, chunk_size=32000):
-    """Encode audio in streaming fashion."""
-    all_codes = []
-
-    for chunk in audio_stream:
-        # Ensure chunk is right shape
-        if chunk.dim() == 1:
-            chunk = chunk.unsqueeze(0).unsqueeze(0)
-
-        with torch.no_grad():
-            codes = encodec.encode(chunk)[0]
-            all_codes.append(codes)
-
-    return torch.cat(all_codes, dim=-1)
-
-def decode_streaming(codes_stream, output_stream):
-    """Decode codes in streaming fashion."""
-    for codes in codes_stream:
-        with torch.no_grad():
-            audio = encodec.decode(codes)
-            output_stream.write(audio.cpu().numpy())
-```
-
-## MultiBand Diffusion
-
-### Using MBD for enhanced quality
-
-```python
-from audiocraft.models import MusicGen, MultiBandDiffusion
-
-# Load MusicGen
-model = MusicGen.get_pretrained('facebook/musicgen-medium')
-
-# Load MultiBand Diffusion
-mbd = MultiBandDiffusion.get_mbd_musicgen()
-
-model.set_generation_params(duration=10)
-
-# Generate with standard decoder
-descriptions = ["epic orchestral music"]
-wav_standard = model.generate(descriptions)
-
-# Generate tokens and use MBD decoder
-with torch.no_grad():
-    # Get tokens
-    gen_tokens = model.generate_tokens(descriptions)
-
-    # Decode with MBD
-    wav_mbd = mbd.tokens_to_wav(gen_tokens)
-
-# Compare quality
-print(f"Standard shape: {wav_standard.shape}")
-print(f"MBD shape: {wav_mbd.shape}")
-```
-
-## API Server Deployment
-
-### FastAPI server
-
-```python
-from fastapi import FastAPI, HTTPException
-from pydantic import BaseModel
-import torch
-import torchaudio
-from audiocraft.models import MusicGen
-import io
-import base64
-
-app = FastAPI()
-
-# Load model at startup
-model = None
-
-@app.on_event("startup")
-async def load_model():
-    global model
-    model = MusicGen.get_pretrained('facebook/musicgen-small')
-    model.set_generation_params(duration=10)
-
-class GenerateRequest(BaseModel):
-    prompt: str
-    duration: float = 10.0
-    temperature: float = 1.0
-    cfg_coef: float = 3.0
-
-class GenerateResponse(BaseModel):
-    audio_base64: str
-    sample_rate: int
-    duration: float
-
-@app.post("/generate", response_model=GenerateResponse)
-async def generate(request: GenerateRequest):
-    if model is None:
-        raise HTTPException(status_code=500, detail="Model not loaded")
-
-    try:
-        model.set_generation_params(
-            duration=min(request.duration, 30),
-            temperature=request.temperature,
-            cfg_coef=request.cfg_coef
-        )
-
-        with torch.no_grad():
-            wav = model.generate([request.prompt])
-
-        # Convert to bytes
-        buffer = io.BytesIO()
-        torchaudio.save(buffer, wav[0].cpu(), sample_rate=32000, format="wav")
-        buffer.seek(0)
-
-        audio_base64 = base64.b64encode(buffer.read()).decode()
-
-        return GenerateResponse(
-            audio_base64=audio_base64,
-            sample_rate=32000,
-            duration=wav.shape[-1] / 32000
-        )
-
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e))
-
-@app.get("/health")
-async def health():
-    return {"status": "ok", "model_loaded": model is not None}
-
-# Run: uvicorn server:app --host 0.0.0.0 --port 8000
-```
-
-### Batch processing service
-
-```python
-import asyncio
-from concurrent.futures import ThreadPoolExecutor
-import torch
-from audiocraft.models import MusicGen
-
-class MusicGenService:
-    def __init__(self, model_name='facebook/musicgen-small', max_workers=2):
-        self.model = MusicGen.get_pretrained(model_name)
-        self.executor = ThreadPoolExecutor(max_workers=max_workers)
-        self.lock = asyncio.Lock()
-
-    async def generate_async(self, prompt, duration=10):
-        """Async generation with thread pool."""
-        loop = asyncio.get_event_loop()
-
-        def _generate():
-            with torch.no_grad():
-                self.model.set_generation_params(duration=duration)
-                return self.model.generate([prompt])
-
-        # Run in thread pool
-        wav = await loop.run_in_executor(self.executor, _generate)
-        return wav[0].cpu()
-
-    async def generate_batch_async(self, prompts, duration=10):
-        """Process multiple prompts concurrently."""
-        tasks = [self.generate_async(p, duration) for p in prompts]
-        return await asyncio.gather(*tasks)
-
-# Usage
-service = MusicGenService()
-
-async def main():
-    prompts = ["jazz piano", "rock guitar", "electronic beats"]
-    results = await service.generate_batch_async(prompts)
-    return results
-```
-
-## Integration Patterns
-
-### LangChain tool
-
-```python
-from langchain.tools import BaseTool
-import torch
-import torchaudio
-from audiocraft.models import MusicGen
-import tempfile
-
-class MusicGeneratorTool(BaseTool):
-    name = "music_generator"
-    description = "Generate music from a text description. Input should be a detailed description of the music style, mood, and instruments."
-
-    def __init__(self):
-        super().__init__()
-        self.model = MusicGen.get_pretrained('facebook/musicgen-small')
-        self.model.set_generation_params(duration=15)
-
-    def _run(self, description: str) -> str:
-        with torch.no_grad():
-            wav = self.model.generate([description])
-
-        # Save to temp file
-        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
-            torchaudio.save(f.name, wav[0].cpu(), sample_rate=32000)
-            return f"Generated music saved to: {f.name}"
-
-    async def _arun(self, description: str) -> str:
-        return self._run(description)
-```
-
-### Gradio with advanced controls
-
-```python
-import gradio as gr
-import torch
-import torchaudio
-from audiocraft.models import MusicGen
-
-models = {}
-
-def load_model(model_size):
-    if model_size not in models:
-        model_name = f"facebook/musicgen-{model_size}"
-        models[model_size] = MusicGen.get_pretrained(model_name)
-    return models[model_size]
-
-def generate(prompt, duration, temperature, cfg_coef, top_k, model_size):
-    model = load_model(model_size)
-
-    model.set_generation_params(
-        duration=duration,
-        temperature=temperature,
-        cfg_coef=cfg_coef,
-        top_k=top_k
-    )
-
-    with torch.no_grad():
-        wav = model.generate([prompt])
-
-    # Save
-    path = "output.wav"
-    torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
-    return path
-
-demo = gr.Interface(
-    fn=generate,
-    inputs=[
-        gr.Textbox(label="Prompt", lines=3),
-        gr.Slider(1, 30, value=10, label="Duration (s)"),
-        gr.Slider(0.1, 2.0, value=1.0, label="Temperature"),
-        gr.Slider(0.5, 10.0, value=3.0, label="CFG Coefficient"),
-        gr.Slider(50, 500, value=250, step=50, label="Top-K"),
-        gr.Dropdown(["small", "medium", "large"], value="small", label="Model Size")
-    ],
-    outputs=gr.Audio(label="Generated Music"),
-    title="MusicGen Advanced",
-    allow_flagging="never"
-)
-
-demo.launch(share=True)
-```
-
-## Audio Processing Pipeline
-
-### Post-processing chain
-
-```python
-import torch
-import torchaudio
-import torchaudio.transforms as T
-import numpy as np
-
-class AudioPostProcessor:
-    def __init__(self, sample_rate=32000):
-        self.sample_rate = sample_rate
-
-    def normalize(self, audio, target_db=-14.0):
-        """Normalize audio to target loudness."""
-        rms = torch.sqrt(torch.mean(audio ** 2))
-        target_rms = 10 ** (target_db / 20)
-        gain = target_rms / (rms + 1e-8)
-        return audio * gain
-
-    def fade_in_out(self, audio, fade_duration=0.1):
-        """Apply fade in/out."""
-        fade_samples = int(fade_duration * self.sample_rate)
-
-        # Create fade curves
-        fade_in = torch.linspace(0, 1, fade_samples)
-        fade_out = torch.linspace(1, 0, fade_samples)
-
-        # Apply fades
-        audio[..., :fade_samples] *= fade_in
-        audio[..., -fade_samples:] *= fade_out
-
-        return audio
-
-    def apply_reverb(self, audio, decay=0.5):
-        """Apply simple reverb effect."""
-        impulse = torch.zeros(int(self.sample_rate * 0.5))
-        impulse[0] = 1.0
-        impulse[int(self.sample_rate * 0.1)] = decay * 0.5
-        impulse[int(self.sample_rate * 0.2)] = decay * 0.25
-
-        # Convolve
-        audio = torch.nn.functional.conv1d(
-            audio.unsqueeze(0),
-            impulse.unsqueeze(0).unsqueeze(0),
-            padding=len(impulse) // 2
-        ).squeeze(0)
-
-        return audio
-
-    def process(self, audio):
-        """Full processing pipeline."""
-        audio = self.normalize(audio)
-        audio = self.fade_in_out(audio)
-        return audio
-
-# Usage with MusicGen
-from audiocraft.models import MusicGen
-
-model = MusicGen.get_pretrained('facebook/musicgen-small')
-model.set_generation_params(duration=10)
-
-wav = model.generate(["chill ambient music"])
-processor = AudioPostProcessor()
-wav_processed = processor.process(wav[0].cpu())
-
-torchaudio.save("processed.wav", wav_processed, sample_rate=32000)
-```
-
-## Evaluation
-
-### Audio quality metrics
-
-```python
-import torch
-from audiocraft.metrics import CLAPTextConsistencyMetric
-from audiocraft.data.audio import audio_read
-
-def evaluate_generation(audio_path, text_prompt):
-    """Evaluate generated audio quality."""
-    # Load audio
-    wav, sr = audio_read(audio_path)
-
-    # CLAP consistency (text-audio alignment)
-    clap_metric = CLAPTextConsistencyMetric()
-    clap_score = clap_metric.compute(wav, [text_prompt])
-
-    return {
-        "clap_score": clap_score,
-        "duration": wav.shape[-1] / sr
-    }
-
-# Batch evaluation
-def evaluate_batch(generations):
-    """Evaluate multiple generations."""
-    results = []
-    for gen in generations:
-        result = evaluate_generation(gen["path"], gen["prompt"])
-        result["prompt"] = gen["prompt"]
-        results.append(result)
-
-    # Aggregate
-    avg_clap = sum(r["clap_score"] for r in results) / len(results)
-    return {
-        "individual": results,
-        "average_clap": avg_clap
-    }
-```
-
-## Model Comparison
-
-### MusicGen variants benchmark
-
-| Model | CLAP Score | Generation Time (10s) | VRAM |
-|-------|------------|----------------------|------|
-| musicgen-small | 0.35 | ~5s | 2GB |
-| musicgen-medium | 0.42 | ~15s | 4GB |
-| musicgen-large | 0.48 | ~30s | 8GB |
-| musicgen-melody | 0.45 | ~15s | 4GB |
-| musicgen-stereo-medium | 0.41 | ~18s | 5GB |
-
-### Prompt engineering tips
-
-```python
-# Good prompts - specific and descriptive
-good_prompts = [
-    "upbeat electronic dance music with synthesizer leads and punchy drums at 128 bpm",
-    "melancholic piano ballad with strings, slow tempo, emotional and cinematic",
-    "funky disco groove with slap bass, brass section, and rhythmic guitar"
-]
-
-# Bad prompts - too vague
-bad_prompts = [
-    "nice music",
-    "song",
-    "good beat"
-]
-
-# Structure: [mood] [genre] with [instruments] at [tempo/style]
-```
--- a/skills/mlops/audiocraft/references/troubleshooting.md
+++ b/skills/mlops/audiocraft/references/troubleshooting.md
@@ -1,504 +0,0 @@
-# AudioCraft Troubleshooting Guide
-
-## Installation Issues
-
-### Import errors
-
-**Error**: `ModuleNotFoundError: No module named 'audiocraft'`
-
-**Solutions**:
-```bash
-# Install from PyPI
-pip install audiocraft
-
-# Or from GitHub
-pip install git+https://github.com/facebookresearch/audiocraft.git
-
-# Verify installation
-python -c "from audiocraft.models import MusicGen; print('OK')"
-```
-
-### FFmpeg not found
-
-**Error**: `RuntimeError: ffmpeg not found`
-
-**Solutions**:
-```bash
-# Ubuntu/Debian
-sudo apt-get install ffmpeg
-
-# macOS
-brew install ffmpeg
-
-# Windows (using conda)
-conda install -c conda-forge ffmpeg
-
-# Verify
-ffmpeg -version
-```
-
-### PyTorch CUDA mismatch
-
-**Error**: `RuntimeError: CUDA error: no kernel image is available`
-
-**Solutions**:
-```bash
-# Check CUDA version
-nvcc --version
-python -c "import torch; print(torch.version.cuda)"
-
-# Install matching PyTorch
-pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
-
-# For CUDA 11.8
-pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
-```
-
-### xformers issues
-
-**Error**: `ImportError: xformers` related errors
-
-**Solutions**:
-```bash
-# Install xformers for memory efficiency
-pip install xformers
-
-# Or disable xformers
-export AUDIOCRAFT_USE_XFORMERS=0
-
-# In Python
-import os
-os.environ["AUDIOCRAFT_USE_XFORMERS"] = "0"
-from audiocraft.models import MusicGen
-```
-
-## Model Loading Issues
-
-### Out of memory during load
-
-**Error**: `torch.cuda.OutOfMemoryError` during model loading
-
-**Solutions**:
-```python
-# Use smaller model
-model = MusicGen.get_pretrained('facebook/musicgen-small')
-
-# Force CPU loading first
-import torch
-device = "cpu"
-model = MusicGen.get_pretrained('facebook/musicgen-small', device=device)
-model = model.to("cuda")
-
-# Use HuggingFace with device_map
-from transformers import MusicgenForConditionalGeneration
-model = MusicgenForConditionalGeneration.from_pretrained(
-    "facebook/musicgen-small",
-    device_map="auto"
-)
-```
-
-### Download failures
-
-**Error**: Connection errors or incomplete downloads
-
-**Solutions**:
-```python
-# Set cache directory
-import os
-os.environ["AUDIOCRAFT_CACHE_DIR"] = "/path/to/cache"
-
-# Or for HuggingFace
-os.environ["HF_HOME"] = "/path/to/hf_cache"
-
-# Resume download
-from huggingface_hub import snapshot_download
-snapshot_download("facebook/musicgen-small", resume_download=True)
-
-# Use local files
-model = MusicGen.get_pretrained('/local/path/to/model')
-```
-
-### Wrong model type
-
-**Error**: Loading wrong model for task
-
-**Solutions**:
-```python
-# For text-to-music: use MusicGen
-from audiocraft.models import MusicGen
-model = MusicGen.get_pretrained('facebook/musicgen-medium')
-
-# For text-to-sound: use AudioGen
-from audiocraft.models import AudioGen
-model = AudioGen.get_pretrained('facebook/audiogen-medium')
-
-# For melody conditioning: use melody variant
-model = MusicGen.get_pretrained('facebook/musicgen-melody')
-
-# For stereo: use stereo variant
-model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')
-```
-
-## Generation Issues
-
-### Empty or silent output
-
-**Problem**: Generated audio is silent or very quiet
-
-**Solutions**:
-```python
-import torch
-
-# Check output
-wav = model.generate(["upbeat music"])
-print(f"Shape: {wav.shape}")
-print(f"Max amplitude: {wav.abs().max().item()}")
-print(f"Mean amplitude: {wav.abs().mean().item()}")
-
-# If too quiet, normalize
-def normalize_audio(audio, target_db=-14.0):
-    rms = torch.sqrt(torch.mean(audio ** 2))
-    target_rms = 10 ** (target_db / 20)
-    gain = target_rms / (rms + 1e-8)
-    return audio * gain
-
-wav_normalized = normalize_audio(wav)
-```
-
-### Poor quality output
-
-**Problem**: Generated music sounds bad or noisy
-
-**Solutions**:
-```python
-# Use larger model
-model = MusicGen.get_pretrained('facebook/musicgen-large')
-
-# Adjust generation parameters
-model.set_generation_params(
-    duration=15,
-    top_k=250,          # Increase for more diversity
-    temperature=0.8,    # Lower for more focused output
-    cfg_coef=4.0        # Increase for better text adherence
-)
-
-# Use better prompts
-# Bad: "music"
-# Good: "upbeat electronic dance music with synthesizers and punchy drums"
-
-# Try MultiBand Diffusion
-from audiocraft.models import MultiBandDiffusion
-mbd = MultiBandDiffusion.get_mbd_musicgen()
-tokens = model.generate_tokens(["prompt"])
-wav = mbd.tokens_to_wav(tokens)
-```
-
-### Generation too short
-
-**Problem**: Audio shorter than expected
-
-**Solutions**:
-```python
-# Check duration setting
-model.set_generation_params(duration=30)  # Set before generate
-
-# Verify in generation
-print(f"Duration setting: {model.generation_params}")
-
-# Check output shape
-wav = model.generate(["prompt"])
-actual_duration = wav.shape[-1] / 32000
-print(f"Actual duration: {actual_duration}s")
-
-# Note: max duration is typically 30s
-```
-
-### Melody conditioning fails
-
-**Error**: Issues with melody-conditioned generation
-
-**Solutions**:
-```python
-import torchaudio
-from audiocraft.models import MusicGen
-
-# Load melody model (not base model)
-model = MusicGen.get_pretrained('facebook/musicgen-melody')
-
-# Load and prepare melody
-melody, sr = torchaudio.load("melody.wav")
-
-# Resample to model sample rate if needed
-if sr != 32000:
-    resampler = torchaudio.transforms.Resample(sr, 32000)
-    melody = resampler(melody)
-
-# Ensure correct shape [batch, channels, samples]
-if melody.dim() == 1:
-    melody = melody.unsqueeze(0).unsqueeze(0)
-elif melody.dim() == 2:
-    melody = melody.unsqueeze(0)
-
-# Convert stereo to mono
-if melody.shape[1] > 1:
-    melody = melody.mean(dim=1, keepdim=True)
-
-# Generate with melody
-model.set_generation_params(duration=min(melody.shape[-1] / 32000, 30))
-wav = model.generate_with_chroma(["piano cover"], melody, 32000)
-```
-
-## Memory Issues
-
-### CUDA out of memory
-
-**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
-
-**Solutions**:
-```python
-import torch
-
-# Clear cache before generation
-torch.cuda.empty_cache()
-
-# Use smaller model
-model = MusicGen.get_pretrained('facebook/musicgen-small')
-
-# Reduce duration
-model.set_generation_params(duration=10)  # Instead of 30
-
-# Generate one at a time
-for prompt in prompts:
-    wav = model.generate([prompt])
-    save_audio(wav)
-    torch.cuda.empty_cache()
-
-# Use CPU for very large generations
-model = MusicGen.get_pretrained('facebook/musicgen-small', device="cpu")
-```
-
-### Memory leak during batch processing
-
-**Problem**: Memory grows over time
-
-**Solutions**:
-```python
-import gc
-import torch
-
-def generate_with_cleanup(model, prompts):
-    results = []
-
-    for prompt in prompts:
-        with torch.no_grad():
-            wav = model.generate([prompt])
-            results.append(wav.cpu())
-
-        # Cleanup
-        del wav
-        gc.collect()
-        torch.cuda.empty_cache()
-
-    return results
-
-# Use context manager
-with torch.inference_mode():
-    wav = model.generate(["prompt"])
-```
-
-## Audio Format Issues
-
-### Wrong sample rate
-
-**Problem**: Audio plays at wrong speed
-
-**Solutions**:
-```python
-import torchaudio
-
-# MusicGen outputs at 32kHz
-sample_rate = 32000
-
-# AudioGen outputs at 16kHz
-sample_rate = 16000
-
-# Always use correct rate when saving
-torchaudio.save("output.wav", wav[0].cpu(), sample_rate=sample_rate)
-
-# Resample if needed
-resampler = torchaudio.transforms.Resample(32000, 44100)
-wav_resampled = resampler(wav)
-```
-
-### Stereo/mono mismatch
-
-**Problem**: Wrong number of channels
-
-**Solutions**:
-```python
-# Check model type
-print(f"Audio channels: {wav.shape}")
-# Mono: [batch, 1, samples]
-# Stereo: [batch, 2, samples]
-
-# Convert mono to stereo
-if wav.shape[1] == 1:
-    wav_stereo = wav.repeat(1, 2, 1)
-
-# Convert stereo to mono
-if wav.shape[1] == 2:
-    wav_mono = wav.mean(dim=1, keepdim=True)
-
-# Use stereo model for stereo output
-model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')
-```
-
-### Clipping and distortion
-
-**Problem**: Audio has clipping or distortion
-
-**Solutions**:
-```python
-import torch
-
-# Check for clipping
-max_val = wav.abs().max().item()
-print(f"Max amplitude: {max_val}")
-
-# Normalize to prevent clipping
-if max_val > 1.0:
-    wav = wav / max_val
-
-# Apply soft clipping
-def soft_clip(x, threshold=0.9):
-    return torch.tanh(x / threshold) * threshold
-
-wav_clipped = soft_clip(wav)
-
-# Lower temperature during generation
-model.set_generation_params(temperature=0.7)  # More controlled
-```
-
-## HuggingFace Transformers Issues
-
-### Processor errors
-
-**Error**: Issues with MusicgenProcessor
-
-**Solutions**:
-```python
-from transformers import AutoProcessor, MusicgenForConditionalGeneration
-
-# Load matching processor and model
-processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
-model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
-
-# Ensure inputs are on same device
-inputs = processor(
-    text=["prompt"],
-    padding=True,
-    return_tensors="pt"
-).to("cuda")
-
-# Check processor configuration
-print(processor.tokenizer)
-print(processor.feature_extractor)
-```
-
-### Generation parameter errors
-
-**Error**: Invalid generation parameters
-
-**Solutions**:
-```python
-# HuggingFace uses different parameter names
-audio_values = model.generate(
-    **inputs,
-    do_sample=True,           # Enable sampling
-    guidance_scale=3.0,       # CFG (not cfg_coef)
-    max_new_tokens=256,       # Token limit (not duration)
-    temperature=1.0
-)
-
-# Calculate tokens from duration
-# ~50 tokens per second
-duration_seconds = 10
-max_tokens = duration_seconds * 50
-audio_values = model.generate(**inputs, max_new_tokens=max_tokens)
-```
-
-## Performance Issues
-
-### Slow generation
-
-**Problem**: Generation takes too long
-
-**Solutions**:
-```python
-# Use smaller model
-model = MusicGen.get_pretrained('facebook/musicgen-small')
-
-# Reduce duration
-model.set_generation_params(duration=10)
-
-# Use GPU
-model.to("cuda")
-
-# Enable flash attention if available
-# (requires compatible hardware)
-
-# Batch multiple prompts
-prompts = ["prompt1", "prompt2", "prompt3"]
-wav = model.generate(prompts)  # Single batch is faster than loop
-
-# Use compile (PyTorch 2.0+)
-model.lm = torch.compile(model.lm)
-```
-
-### CPU fallback
-
-**Problem**: Generation running on CPU instead of GPU
-
-**Solutions**:
-```python
-import torch
-
-# Check CUDA availability
-print(f"CUDA available: {torch.cuda.is_available()}")
-print(f"CUDA device: {torch.cuda.get_device_name(0)}")
-
-# Explicitly move to GPU
-model = MusicGen.get_pretrained('facebook/musicgen-small')
-model.to("cuda")
-
-# Verify model device
-print(f"Model device: {next(model.lm.parameters()).device}")
-```
-
-## Common Error Messages
-
-| Error | Cause | Solution |
-|-------|-------|----------|
-| `CUDA out of memory` | Model too large | Use smaller model, reduce duration |
-| `ffmpeg not found` | FFmpeg not installed | Install FFmpeg |
-| `No module named 'audiocraft'` | Not installed | `pip install audiocraft` |
-| `RuntimeError: Expected 3D tensor` | Wrong input shape | Check tensor dimensions |
-| `KeyError: 'melody'` | Wrong model for melody | Use musicgen-melody |
-| `Sample rate mismatch` | Wrong audio format | Resample to model rate |
-
-## Getting Help
-
-1. **GitHub Issues**: https://github.com/facebookresearch/audiocraft/issues
-2. **HuggingFace Forums**: https://discuss.huggingface.co
-3. **Paper**: https://arxiv.org/abs/2306.05284
-
-### Reporting Issues
-
-Include:
- Python version
- PyTorch version
- CUDA version
- AudioCraft version: `pip show audiocraft`
- Full error traceback
- Minimal reproducible code
- Hardware (GPU model, VRAM)
--- a/skills/mlops/code-review/SKILL.md
+++ b/skills/mlops/code-review/SKILL.md
@@ -1,81 +0,0 @@
---
-name: code-review
-description: Guidelines for performing thorough code reviews with security and quality focus
---
-
-# Code Review Skill
-
-Use this skill when reviewing code changes, pull requests, or auditing existing code.
-
-## Review Checklist
-
-### 1. Security First
- [ ] No hardcoded secrets, API keys, or credentials
- [ ] Input validation on all user-provided data
- [ ] SQL queries use parameterized statements (no string concatenation)
- [ ] File operations validate paths (no path traversal)
- [ ] Authentication/authorization checks present where needed
-
-### 2. Error Handling
- [ ] All external calls (API, DB, file) have try/catch
- [ ] Errors are logged with context (but no sensitive data)
- [ ] User-facing errors are helpful but don't leak internals
- [ ] Resources are cleaned up in finally blocks or context managers
-
-### 3. Code Quality
- [ ] Functions do one thing and are reasonably sized (<50 lines ideal)
- [ ] Variable names are descriptive (no single letters except loops)
- [ ] No commented-out code left behind
- [ ] Complex logic has explanatory comments
- [ ] No duplicate code (DRY principle)
-
-### 4. Testing Considerations
- [ ] Edge cases handled (empty inputs, nulls, boundaries)
- [ ] Happy path and error paths both work
- [ ] New code has corresponding tests (if test suite exists)
-
-## Review Response Format
-
-When providing review feedback, structure it as:
-
-```
-## Summary
-[1-2 sentence overall assessment]
-
-## Critical Issues (Must Fix)
- Issue 1: [description + suggested fix]
- Issue 2: ...
-
-## Suggestions (Nice to Have)
- Suggestion 1: [description]
-
-## Questions
- [Any clarifying questions about intent]
-```
-
-## Common Patterns to Flag
-
-### Python
-```python
-# Bad: SQL injection risk
-cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")
-
-# Good: Parameterized query
-cursor.execute("SELECT * FROM users WHERE id = ?", (user_id,))
-```
-
-### JavaScript
-```javascript
-// Bad: XSS risk
-element.innerHTML = userInput;
-
-// Good: Safe text content
-element.textContent = userInput;
-```
-
-## Tone Guidelines
-
- Be constructive, not critical
- Explain *why* something is an issue, not just *what*
- Offer solutions, not just problems
- Acknowledge good patterns you see
--- a/skills/mlops/faiss/SKILL.md
+++ b/skills/mlops/faiss/SKILL.md
@@ -1,224 +0,0 @@
---
-name: faiss
-description: Facebook's library for efficient similarity search and clustering of dense vectors. Supports billions of vectors, GPU acceleration, and various index types (Flat, IVF, HNSW). Use for fast k-NN search, large-scale vector retrieval, or when you need pure similarity search without metadata. Best for high-performance applications.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [faiss-cpu, faiss-gpu, numpy]
-metadata:
-  hermes:
-    tags: [RAG, FAISS, Similarity Search, Vector Search, Facebook AI, GPU Acceleration, Billion-Scale, K-NN, HNSW, High Performance, Large Scale]
-
---
-
-# FAISS - Efficient Similarity Search
-
-Facebook AI's library for billion-scale vector similarity search.
-
-## When to use FAISS
-
-**Use FAISS when:**
- Need fast similarity search on large vector datasets (millions/billions)
- GPU acceleration required
- Pure vector similarity (no metadata filtering needed)
- High throughput, low latency critical
- Offline/batch processing of embeddings
-
-**Metrics**:
- **31,700+ GitHub stars**
- Meta/Facebook AI Research
- **Handles billions of vectors**
- **C++** with Python bindings
-
-**Use alternatives instead**:
- **Chroma/Pinecone**: Need metadata filtering
- **Weaviate**: Need full database features
- **Annoy**: Simpler, fewer features
-
-## Quick start
-
-### Installation
-
-```bash
-# CPU only
-pip install faiss-cpu
-
-# GPU support
-pip install faiss-gpu
-```
-
-### Basic usage
-
-```python
-import faiss
-import numpy as np
-
-# Create sample data (1000 vectors, 128 dimensions)
-d = 128
-nb = 1000
-vectors = np.random.random((nb, d)).astype('float32')
-
-# Create index
-index = faiss.IndexFlatL2(d)  # L2 distance
-index.add(vectors)             # Add vectors
-
-# Search
-k = 5  # Find 5 nearest neighbors
-query = np.random.random((1, d)).astype('float32')
-distances, indices = index.search(query, k)
-
-print(f"Nearest neighbors: {indices}")
-print(f"Distances: {distances}")
-```
-
-## Index types
-
-### 1. Flat (exact search)
-
-```python
-# L2 (Euclidean) distance
-index = faiss.IndexFlatL2(d)
-
-# Inner product (cosine similarity if normalized)
-index = faiss.IndexFlatIP(d)
-
-# Slowest, most accurate
-```
-
-### 2. IVF (inverted file) - Fast approximate
-
-```python
-# Create quantizer
-quantizer = faiss.IndexFlatL2(d)
-
-# IVF index with 100 clusters
-nlist = 100
-index = faiss.IndexIVFFlat(quantizer, d, nlist)
-
-# Train on data
-index.train(vectors)
-
-# Add vectors
-index.add(vectors)
-
-# Search (nprobe = clusters to search)
-index.nprobe = 10
-distances, indices = index.search(query, k)
-```
-
-### 3. HNSW (Hierarchical NSW) - Best quality/speed
-
-```python
-# HNSW index
-M = 32  # Number of connections per layer
-index = faiss.IndexHNSWFlat(d, M)
-
-# No training needed
-index.add(vectors)
-
-# Search
-distances, indices = index.search(query, k)
-```
-
-### 4. Product Quantization - Memory efficient
-
-```python
-# PQ reduces memory by 16-32×
-m = 8   # Number of subquantizers
-nbits = 8
-index = faiss.IndexPQ(d, m, nbits)
-
-# Train and add
-index.train(vectors)
-index.add(vectors)
-```
-
-## Save and load
-
-```python
-# Save index
-faiss.write_index(index, "large.index")
-
-# Load index
-index = faiss.read_index("large.index")
-
-# Continue using
-distances, indices = index.search(query, k)
-```
-
-## GPU acceleration
-
-```python
-# Single GPU
-res = faiss.StandardGpuResources()
-index_cpu = faiss.IndexFlatL2(d)
-index_gpu = faiss.index_cpu_to_gpu(res, 0, index_cpu)  # GPU 0
-
-# Multi-GPU
-index_gpu = faiss.index_cpu_to_all_gpus(index_cpu)
-
-# 10-100× faster than CPU
-```
-
-## LangChain integration
-
-```python
-from langchain_community.vectorstores import FAISS
-from langchain_openai import OpenAIEmbeddings
-
-# Create FAISS vector store
-vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
-
-# Save
-vectorstore.save_local("faiss_index")
-
-# Load
-vectorstore = FAISS.load_local(
-    "faiss_index",
-    OpenAIEmbeddings(),
-    allow_dangerous_deserialization=True
-)
-
-# Search
-results = vectorstore.similarity_search("query", k=5)
-```
-
-## LlamaIndex integration
-
-```python
-from llama_index.vector_stores.faiss import FaissVectorStore
-import faiss
-
-# Create FAISS index
-d = 1536
-faiss_index = faiss.IndexFlatL2(d)
-
-vector_store = FaissVectorStore(faiss_index=faiss_index)
-```
-
-## Best practices
-
-1. **Choose right index type** - Flat for <10K, IVF for 10K-1M, HNSW for quality
-2. **Normalize for cosine** - Use IndexFlatIP with normalized vectors
-3. **Use GPU for large datasets** - 10-100× faster
-4. **Save trained indices** - Training is expensive
-5. **Tune nprobe/ef_search** - Balance speed/accuracy
-6. **Monitor memory** - PQ for large datasets
-7. **Batch queries** - Better GPU utilization
-
-## Performance
-
-| Index Type | Build Time | Search Time | Memory | Accuracy |
-|------------|------------|-------------|--------|----------|
-| Flat | Fast | Slow | High | 100% |
-| IVF | Medium | Fast | Medium | 95-99% |
-| HNSW | Slow | Fastest | High | 99% |
-| PQ | Medium | Fast | Low | 90-95% |
-
-## Resources
-
- **GitHub**: https://github.com/facebookresearch/faiss ⭐ 31,700+
- **Wiki**: https://github.com/facebookresearch/faiss/wiki
- **License**: MIT
-
-
--- a/skills/mlops/faiss/references/index_types.md
+++ b/skills/mlops/faiss/references/index_types.md
@@ -1,280 +0,0 @@
-# FAISS Index Types Guide
-
-Complete guide to choosing and using FAISS index types.
-
-## Index selection guide
-
-| Dataset Size | Index Type | Training | Accuracy | Speed |
-|--------------|------------|----------|----------|-------|
-| < 10K | Flat | No | 100% | Slow |
-| 10K-1M | IVF | Yes | 95-99% | Fast |
-| 1M-10M | HNSW | No | 99% | Fastest |
-| > 10M | IVF+PQ | Yes | 90-95% | Fast, low memory |
-
-## Flat indices (exact search)
-
-### IndexFlatL2 - L2 (Euclidean) distance
-
-```python
-import faiss
-import numpy as np
-
-d = 128  # Dimension
-index = faiss.IndexFlatL2(d)
-
-# Add vectors
-vectors = np.random.random((1000, d)).astype('float32')
-index.add(vectors)
-
-# Search
-k = 5
-query = np.random.random((1, d)).astype('float32')
-distances, indices = index.search(query, k)
-```
-
-**Use when:**
- Dataset < 10,000 vectors
- Need 100% accuracy
- Serving as baseline
-
-### IndexFlatIP - Inner product (cosine similarity)
-
-```python
-# For cosine similarity, normalize vectors first
-import faiss
-
-d = 128
-index = faiss.IndexFlatIP(d)
-
-# Normalize vectors (required for cosine similarity)
-faiss.normalize_L2(vectors)
-index.add(vectors)
-
-# Search
-faiss.normalize_L2(query)
-distances, indices = index.search(query, k)
-```
-
-**Use when:**
- Need cosine similarity
- Recommendation systems
- Text embeddings
-
-## IVF indices (inverted file)
-
-### IndexIVFFlat - Cluster-based search
-
-```python
-# Create quantizer
-quantizer = faiss.IndexFlatL2(d)
-
-# Create IVF index with 100 clusters
-nlist = 100  # Number of clusters
-index = faiss.IndexIVFFlat(quantizer, d, nlist)
-
-# Train on data (required!)
-index.train(vectors)
-
-# Add vectors
-index.add(vectors)
-
-# Search (nprobe = clusters to search)
-index.nprobe = 10  # Search 10 closest clusters
-distances, indices = index.search(query, k)
-```
-
-**Parameters:**
- `nlist`: Number of clusters (√N to 4√N recommended)
- `nprobe`: Clusters to search (1-nlist, higher = more accurate)
-
-**Use when:**
- Dataset 10K-1M vectors
- Need fast approximate search
- Can afford training time
-
-### Tuning nprobe
-
-```python
-# Test different nprobe values
-for nprobe in [1, 5, 10, 20, 50]:
-    index.nprobe = nprobe
-    distances, indices = index.search(query, k)
-    # Measure recall/speed trade-off
-```
-
-**Guidelines:**
- `nprobe=1`: Fastest, ~50% recall
- `nprobe=10`: Good balance, ~95% recall
- `nprobe=nlist`: Exact search (same as Flat)
-
-## HNSW indices (graph-based)
-
-### IndexHNSWFlat - Hierarchical NSW
-
-```python
-# HNSW index
-M = 32  # Number of connections per layer (16-64)
-index = faiss.IndexHNSWFlat(d, M)
-
-# Optional: Set ef_construction (build time parameter)
-index.hnsw.efConstruction = 40  # Higher = better quality, slower build
-
-# Add vectors (no training needed!)
-index.add(vectors)
-
-# Search
-index.hnsw.efSearch = 16  # Search time parameter
-distances, indices = index.search(query, k)
-```
-
-**Parameters:**
- `M`: Connections per layer (16-64, default 32)
- `efConstruction`: Build quality (40-200, higher = better)
- `efSearch`: Search quality (16-512, higher = more accurate)
-
-**Use when:**
- Need best quality approximate search
- Can afford higher memory (more connections)
- Dataset 1M-10M vectors
-
-## PQ indices (product quantization)
-
-### IndexPQ - Memory-efficient
-
-```python
-# PQ reduces memory by 16-32×
-m = 8   # Number of subquantizers (divides d)
-nbits = 8  # Bits per subquantizer
-
-index = faiss.IndexPQ(d, m, nbits)
-
-# Train (required!)
-index.train(vectors)
-
-# Add vectors
-index.add(vectors)
-
-# Search
-distances, indices = index.search(query, k)
-```
-
-**Parameters:**
- `m`: Subquantizers (d must be divisible by m)
- `nbits`: Bits per code (8 or 16)
-
-**Memory savings:**
- Original: d × 4 bytes (float32)
- PQ: m bytes
- Compression ratio: 4d/m
-
-**Use when:**
- Limited memory
- Large datasets (> 10M vectors)
- Can accept ~90-95% accuracy
-
-### IndexIVFPQ - IVF + PQ combined
-
-```python
-# Best for very large datasets
-nlist = 4096
-m = 8
-nbits = 8
-
-quantizer = faiss.IndexFlatL2(d)
-index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)
-
-# Train
-index.train(vectors)
-index.add(vectors)
-
-# Search
-index.nprobe = 32
-distances, indices = index.search(query, k)
-```
-
-**Use when:**
- Dataset > 10M vectors
- Need fast search + low memory
- Can accept 90-95% accuracy
-
-## GPU indices
-
-### Single GPU
-
-```python
-import faiss
-
-# Create CPU index
-index_cpu = faiss.IndexFlatL2(d)
-
-# Move to GPU
-res = faiss.StandardGpuResources()  # GPU resources
-index_gpu = faiss.index_cpu_to_gpu(res, 0, index_cpu)  # GPU 0
-
-# Use normally
-index_gpu.add(vectors)
-distances, indices = index_gpu.search(query, k)
-```
-
-### Multi-GPU
-
-```python
-# Use all available GPUs
-index_gpu = faiss.index_cpu_to_all_gpus(index_cpu)
-
-# Or specific GPUs
-gpus = [0, 1, 2, 3]  # Use GPUs 0-3
-index_gpu = faiss.index_cpu_to_gpus_list(index_cpu, gpus)
-```
-
-**Speedup:**
- Single GPU: 10-50× faster than CPU
- Multi-GPU: Near-linear scaling
-
-## Index factory
-
-```python
-# Easy index creation with string descriptors
-index = faiss.index_factory(d, "IVF100,Flat")
-index = faiss.index_factory(d, "HNSW32")
-index = faiss.index_factory(d, "IVF4096,PQ8")
-
-# Train and use
-index.train(vectors)
-index.add(vectors)
-```
-
-**Common descriptors:**
- `"Flat"`: Exact search
- `"IVF100,Flat"`: IVF with 100 clusters
- `"HNSW32"`: HNSW with M=32
- `"IVF4096,PQ8"`: IVF + PQ compression
-
-## Performance comparison
-
-### Search speed (1M vectors, k=10)
-
-| Index | Build Time | Search Time | Memory | Recall |
-|-------|------------|-------------|--------|--------|
-| Flat | 0s | 50ms | 512 MB | 100% |
-| IVF100 | 5s | 2ms | 512 MB | 95% |
-| HNSW32 | 60s | 1ms | 1GB | 99% |
-| IVF4096+PQ8 | 30s | 3ms | 32 MB | 90% |
-
-*CPU (16 cores), 128-dim vectors*
-
-## Best practices
-
-1. **Start with Flat** - Baseline for comparison
-2. **Use IVF for medium datasets** - Good balance
-3. **Use HNSW for best quality** - If memory allows
-4. **Add PQ for memory savings** - Large datasets
-5. **GPU for > 100K vectors** - 10-50× speedup
-6. **Tune nprobe/efSearch** - Trade-off speed/accuracy
-7. **Train on representative data** - Better clustering
-8. **Save trained indices** - Avoid retraining
-
-## Resources
-
- **Wiki**: https://github.com/facebookresearch/faiss/wiki
- **Paper**: https://arxiv.org/abs/1702.08734
--- a/skills/mlops/flash-attention/SKILL.md
+++ b/skills/mlops/flash-attention/SKILL.md
@@ -1,370 +0,0 @@
---
-name: optimizing-attention-flash
-description: Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA, flash-attn library, H100 FP8, and sliding window attention.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [flash-attn, torch, transformers]
-metadata:
-  hermes:
-    tags: [Optimization, Flash Attention, Attention Optimization, Memory Efficiency, Speed Optimization, Long Context, PyTorch, SDPA, H100, FP8, Transformers]
-
---
-
-# Flash Attention - Fast Memory-Efficient Attention
-
-## Quick start
-
-Flash Attention provides 2-4x speedup and 10-20x memory reduction for transformer attention through IO-aware tiling and recomputation.
-
-**PyTorch native (easiest, PyTorch 2.2+)**:
-```python
-import torch
-import torch.nn.functional as F
-
-q = torch.randn(2, 8, 512, 64, device='cuda', dtype=torch.float16)  # [batch, heads, seq, dim]
-k = torch.randn(2, 8, 512, 64, device='cuda', dtype=torch.float16)
-v = torch.randn(2, 8, 512, 64, device='cuda', dtype=torch.float16)
-
-# Automatically uses Flash Attention if available
-out = F.scaled_dot_product_attention(q, k, v)
-```
-
-**flash-attn library (more features)**:
-```bash
-pip install flash-attn --no-build-isolation
-```
-
-```python
-from flash_attn import flash_attn_func
-
-# q, k, v: [batch, seqlen, nheads, headdim]
-out = flash_attn_func(q, k, v, dropout_p=0.0, causal=True)
-```
-
-## Common workflows
-
-### Workflow 1: Enable in existing PyTorch model
-
-Copy this checklist:
-
-```
-Flash Attention Integration:
- [ ] Step 1: Check PyTorch version (≥2.2)
- [ ] Step 2: Enable Flash Attention backend
- [ ] Step 3: Verify speedup with profiling
- [ ] Step 4: Test accuracy matches baseline
-```
-
-**Step 1: Check PyTorch version**
-
-```bash
-python -c "import torch; print(torch.__version__)"
-# Should be ≥2.2.0
-```
-
-If <2.2, upgrade:
-```bash
-pip install --upgrade torch
-```
-
-**Step 2: Enable Flash Attention backend**
-
-Replace standard attention:
-```python
-# Before (standard attention)
-attn_weights = torch.softmax(q @ k.transpose(-2, -1) / math.sqrt(d_k), dim=-1)
-out = attn_weights @ v
-
-# After (Flash Attention)
-import torch.nn.functional as F
-out = F.scaled_dot_product_attention(q, k, v, attn_mask=mask)
-```
-
-Force Flash Attention backend:
-```python
-with torch.backends.cuda.sdp_kernel(
-    enable_flash=True,
-    enable_math=False,
-    enable_mem_efficient=False
-):
-    out = F.scaled_dot_product_attention(q, k, v)
-```
-
-**Step 3: Verify speedup with profiling**
-
-```python
-import torch.utils.benchmark as benchmark
-
-def test_attention(use_flash):
-    q, k, v = [torch.randn(2, 8, 2048, 64, device='cuda', dtype=torch.float16) for _ in range(3)]
-
-    if use_flash:
-        with torch.backends.cuda.sdp_kernel(enable_flash=True):
-            return F.scaled_dot_product_attention(q, k, v)
-    else:
-        attn = (q @ k.transpose(-2, -1) / 8.0).softmax(dim=-1)
-        return attn @ v
-
-# Benchmark
-t_flash = benchmark.Timer(stmt='test_attention(True)', globals=globals())
-t_standard = benchmark.Timer(stmt='test_attention(False)', globals=globals())
-
-print(f"Flash: {t_flash.timeit(100).mean:.3f}s")
-print(f"Standard: {t_standard.timeit(100).mean:.3f}s")
-```
-
-Expected: 2-4x speedup for sequences >512 tokens.
-
-**Step 4: Test accuracy matches baseline**
-
-```python
-# Compare outputs
-q, k, v = [torch.randn(1, 8, 512, 64, device='cuda', dtype=torch.float16) for _ in range(3)]
-
-# Flash Attention
-out_flash = F.scaled_dot_product_attention(q, k, v)
-
-# Standard attention
-attn_weights = torch.softmax(q @ k.transpose(-2, -1) / 8.0, dim=-1)
-out_standard = attn_weights @ v
-
-# Check difference
-diff = (out_flash - out_standard).abs().max()
-print(f"Max difference: {diff:.6f}")
-# Should be <1e-3 for float16
-```
-
-### Workflow 2: Use flash-attn library for advanced features
-
-For multi-query attention, sliding window, or H100 FP8.
-
-Copy this checklist:
-
-```
-flash-attn Library Setup:
- [ ] Step 1: Install flash-attn library
- [ ] Step 2: Modify attention code
- [ ] Step 3: Enable advanced features
- [ ] Step 4: Benchmark performance
-```
-
-**Step 1: Install flash-attn library**
-
-```bash
-# NVIDIA GPUs (CUDA 12.0+)
-pip install flash-attn --no-build-isolation
-
-# Verify installation
-python -c "from flash_attn import flash_attn_func; print('Success')"
-```
-
-**Step 2: Modify attention code**
-
-```python
-from flash_attn import flash_attn_func
-
-# Input: [batch_size, seq_len, num_heads, head_dim]
-# Transpose from [batch, heads, seq, dim] if needed
-q = q.transpose(1, 2)  # [batch, seq, heads, dim]
-k = k.transpose(1, 2)
-v = v.transpose(1, 2)
-
-out = flash_attn_func(
-    q, k, v,
-    dropout_p=0.1,
-    causal=True,  # For autoregressive models
-    window_size=(-1, -1),  # No sliding window
-    softmax_scale=None  # Auto-scale
-)
-
-out = out.transpose(1, 2)  # Back to [batch, heads, seq, dim]
-```
-
-**Step 3: Enable advanced features**
-
-Multi-query attention (shared K/V across heads):
-```python
-from flash_attn import flash_attn_func
-
-# q: [batch, seq, num_q_heads, dim]
-# k, v: [batch, seq, num_kv_heads, dim]  # Fewer KV heads
-out = flash_attn_func(q, k, v)  # Automatically handles MQA
-```
-
-Sliding window attention (local attention):
-```python
-# Only attend to window of 256 tokens before/after
-out = flash_attn_func(
-    q, k, v,
-    window_size=(256, 256),  # (left, right) window
-    causal=True
-)
-```
-
-**Step 4: Benchmark performance**
-
-```python
-import torch
-from flash_attn import flash_attn_func
-import time
-
-q, k, v = [torch.randn(4, 4096, 32, 64, device='cuda', dtype=torch.float16) for _ in range(3)]
-
-# Warmup
-for _ in range(10):
-    _ = flash_attn_func(q, k, v)
-
-# Benchmark
-torch.cuda.synchronize()
-start = time.time()
-for _ in range(100):
-    out = flash_attn_func(q, k, v)
-    torch.cuda.synchronize()
-end = time.time()
-
-print(f"Time per iteration: {(end-start)/100*1000:.2f}ms")
-print(f"Memory allocated: {torch.cuda.max_memory_allocated()/1e9:.2f}GB")
-```
-
-### Workflow 3: H100 FP8 optimization (FlashAttention-3)
-
-For maximum performance on H100 GPUs.
-
-```
-FP8 Setup:
- [ ] Step 1: Verify H100 GPU available
- [ ] Step 2: Install flash-attn with FP8 support
- [ ] Step 3: Convert inputs to FP8
- [ ] Step 4: Run with FP8 attention
-```
-
-**Step 1: Verify H100 GPU**
-
-```bash
-nvidia-smi --query-gpu=name --format=csv
-# Should show "H100" or "H800"
-```
-
-**Step 2: Install flash-attn with FP8 support**
-
-```bash
-pip install flash-attn --no-build-isolation
-# FP8 support included for H100
-```
-
-**Step 3: Convert inputs to FP8**
-
-```python
-import torch
-
-q = torch.randn(2, 4096, 32, 64, device='cuda', dtype=torch.float16)
-k = torch.randn(2, 4096, 32, 64, device='cuda', dtype=torch.float16)
-v = torch.randn(2, 4096, 32, 64, device='cuda', dtype=torch.float16)
-
-# Convert to float8_e4m3 (FP8)
-q_fp8 = q.to(torch.float8_e4m3fn)
-k_fp8 = k.to(torch.float8_e4m3fn)
-v_fp8 = v.to(torch.float8_e4m3fn)
-```
-
-**Step 4: Run with FP8 attention**
-
-```python
-from flash_attn import flash_attn_func
-
-# FlashAttention-3 automatically uses FP8 kernels on H100
-out = flash_attn_func(q_fp8, k_fp8, v_fp8)
-# Result: ~1.2 PFLOPS, 1.5-2x faster than FP16
-```
-
-## When to use vs alternatives
-
-**Use Flash Attention when:**
- Training transformers with sequences >512 tokens
- Running inference with long context (>2K tokens)
- GPU memory constrained (OOM with standard attention)
- Need 2-4x speedup without accuracy loss
- Using PyTorch 2.2+ or can install flash-attn
-
-**Use alternatives instead:**
- **Standard attention**: Sequences <256 tokens (overhead not worth it)
- **xFormers**: Need more attention variants (not just speed)
- **Memory-efficient attention**: CPU inference (Flash Attention needs GPU)
-
-## Common issues
-
-**Issue: ImportError: cannot import flash_attn**
-
-Install with no-build-isolation flag:
-```bash
-pip install flash-attn --no-build-isolation
-```
-
-Or install CUDA toolkit first:
-```bash
-conda install cuda -c nvidia
-pip install flash-attn --no-build-isolation
-```
-
-**Issue: Slower than expected (no speedup)**
-
-Flash Attention benefits increase with sequence length:
- <512 tokens: Minimal speedup (10-20%)
- 512-2K tokens: 2-3x speedup
- >2K tokens: 3-4x speedup
-
-Check sequence length is sufficient.
-
-**Issue: RuntimeError: CUDA error**
-
-Verify GPU supports Flash Attention:
-```python
-import torch
-print(torch.cuda.get_device_capability())
-# Should be ≥(7, 5) for Turing+
-```
-
-Flash Attention requires:
- Ampere (A100, A10): ✅ Full support
- Turing (T4): ✅ Supported
- Volta (V100): ❌ Not supported
-
-**Issue: Accuracy degradation**
-
-Check dtype is float16 or bfloat16 (not float32):
-```python
-q = q.to(torch.float16)  # Or torch.bfloat16
-```
-
-Flash Attention uses float16/bfloat16 for speed. Float32 not supported.
-
-## Advanced topics
-
-**Integration with HuggingFace Transformers**: See [references/transformers-integration.md](references/transformers-integration.md) for enabling Flash Attention in BERT, GPT, Llama models.
-
-**Performance benchmarks**: See [references/benchmarks.md](references/benchmarks.md) for detailed speed and memory comparisons across GPUs and sequence lengths.
-
-**Algorithm details**: See [references/algorithm.md](references/algorithm.md) for tiling strategy, recomputation, and IO complexity analysis.
-
-**Advanced features**: See [references/advanced-features.md](references/advanced-features.md) for rotary embeddings, ALiBi, paged KV cache, and custom attention masks.
-
-## Hardware requirements
-
- **GPU**: NVIDIA Ampere+ (A100, A10, A30) or AMD MI200+
- **VRAM**: Same as standard attention (Flash Attention doesn't increase memory)
- **CUDA**: 12.0+ (11.8 minimum)
- **PyTorch**: 2.2+ for native support
-
-**Not supported**: V100 (Volta), CPU inference
-
-## Resources
-
- Paper: "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS 2022)
- Paper: "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (ICLR 2024)
- Blog: https://tridao.me/blog/2024/flash3/
- GitHub: https://github.com/Dao-AILab/flash-attention
- PyTorch docs: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
-
-
-
--- a/skills/mlops/flash-attention/references/benchmarks.md
+++ b/skills/mlops/flash-attention/references/benchmarks.md
@@ -1,215 +0,0 @@
-# Performance Benchmarks
-
-## Contents
- Speed comparisons across GPUs
- Memory usage analysis
- Scaling with sequence length
- Training vs inference performance
- Flash Attention versions comparison
-
-## Speed comparisons across GPUs
-
-### A100 80GB (Ampere)
-
-**Forward pass time** (milliseconds, batch=8, heads=32, dim=64):
-
-| Seq Length | Standard | Flash Attn 2 | Flash Attn 3 | Speedup (FA2) |
-|------------|----------|--------------|--------------|---------------|
-| 512 | 1.2 | 0.9 | N/A | 1.3x |
-| 1024 | 3.8 | 1.4 | N/A | 2.7x |
-| 2048 | 14.2 | 4.8 | N/A | 3.0x |
-| 4096 | 55.1 | 17.3 | N/A | 3.2x |
-| 8192 | 218.5 | 66.2 | N/A | 3.3x |
-
-### H100 80GB (Hopper)
-
-**Forward pass time** (milliseconds, same config):
-
-| Seq Length | Standard | Flash Attn 2 | Flash Attn 3 (FP16) | Flash Attn 3 (FP8) | Best Speedup |
-|------------|----------|--------------|---------------------|--------------------|--------------|
-| 512 | 0.8 | 0.6 | 0.4 | 0.3 | 2.7x |
-| 1024 | 2.6 | 1.0 | 0.6 | 0.4 | 6.5x |
-| 2048 | 9.8 | 3.4 | 2.0 | 1.3 | 7.5x |
-| 4096 | 38.2 | 12.5 | 7.2 | 4.8 | 8.0x |
-| 8192 | 151.4 | 47.8 | 27.1 | 18.2 | 8.3x |
-
-**Key insight**: Flash Attention 3 on H100 with FP8 achieves ~1.2 PFLOPS (75% of theoretical max).
-
-### A10G 24GB (Ampere)
-
-**Forward pass time** (milliseconds, batch=4):
-
-| Seq Length | Standard | Flash Attn 2 | Speedup |
-|------------|----------|--------------|---------|
-| 512 | 2.1 | 1.6 | 1.3x |
-| 1024 | 6.8 | 2.8 | 2.4x |
-| 2048 | 25.9 | 9.4 | 2.8x |
-| 4096 | 102.1 | 35.2 | 2.9x |
-
-## Memory usage analysis
-
-### GPU memory consumption (batch=8, heads=32, dim=64)
-
-**Standard attention memory**:
-
-| Seq Length | Attention Matrix | KV Cache | Total | Notes |
-|------------|------------------|----------|-------|-------|
-| 512 | 8 MB | 32 MB | 40 MB | Manageable |
-| 2048 | 128 MB | 128 MB | 256 MB | Growing |
-| 8192 | 2048 MB (2 GB) | 512 MB | 2.5 GB | Large |
-| 32768 | 32768 MB (32 GB) | 2048 MB | 34 GB | OOM on 24GB GPUs |
-
-**Flash Attention 2 memory**:
-
-| Seq Length | Attention (on-chip) | KV Cache | Total | Reduction |
-|------------|---------------------|----------|-------|-----------|
-| 512 | 0 MB (recomputed) | 32 MB | 32 MB | 20% |
-| 2048 | 0 MB | 128 MB | 128 MB | 50% |
-| 8192 | 0 MB | 512 MB | 512 MB | 80% |
-| 32768 | 0 MB | 2048 MB | 2 GB | 94% |
-
-**Key insight**: Flash Attention doesn't materialize attention matrix, saving O(N²) memory.
-
-### Memory scaling comparison
-
-**Llama 2 7B model memory** (float16, batch=1):
-
-| Context Length | Standard Attention | Flash Attention 2 | Can Fit 24GB GPU? |
-|----------------|-------------------|-------------------|-------------------|
-| 2K | 3.2 GB | 2.1 GB | Both: Yes |
-| 4K | 5.8 GB | 2.8 GB | Both: Yes |
-| 8K | 12.1 GB | 4.2 GB | Both: Yes |
-| 16K | 26.3 GB (OOM) | 7.8 GB | Only Flash: Yes |
-| 32K | OOM | 14.2 GB | Only Flash: Yes |
-
-### Training memory (Llama 2 7B, batch=4)
-
-| Context | Standard (GB) | Flash Attn (GB) | Reduction |
-|---------|---------------|-----------------|-----------|
-| 2K | 18.2 | 12.4 | 32% |
-| 4K | 34.8 | 16.8 | 52% |
-| 8K | OOM (>40GB) | 26.2 | Fits! |
-
-## Scaling with sequence length
-
-### Computational complexity
-
-**Standard attention**:
- Time: O(N² × d)
- Memory: O(N² + N × d)
-
-**Flash Attention**:
- Time: O(N² × d) (same, but with better constants)
- Memory: O(N × d) (linear!)
-
-### Empirical scaling (A100, batch=1, heads=32, dim=64)
-
-**Time per token (milliseconds)**:
-
-| Sequence | 512 | 1K | 2K | 4K | 8K | 16K |
-|----------|-----|-----|-----|-----|-----|------|
-| Standard | 0.15 | 0.37 | 1.11 | 3.44 | 13.4 | 52.8 |
-| Flash Attn 2 | 0.11 | 0.14 | 0.24 | 0.43 | 0.83 | 1.64 |
-| Speedup | 1.4x | 2.6x | 4.6x | 8.0x | 16.1x | 32.2x |
-
-**Observation**: Speedup increases quadratically with sequence length!
-
-### Memory per token (MB)
-
-| Sequence | 512 | 1K | 2K | 4K | 8K | 16K |
-|----------|-----|-----|-----|-----|-----|------|
-| Standard | 0.08 | 0.13 | 0.25 | 0.64 | 2.05 | 8.13 |
-| Flash Attn 2 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 |
-
-**Observation**: Flash Attention memory per token is constant!
-
-## Training vs inference performance
-
-### Training (forward + backward, Llama 2 7B, A100)
-
-| Batch × Seq | Standard (samples/sec) | Flash Attn (samples/sec) | Speedup |
-|-------------|------------------------|--------------------------|---------|
-| 4 × 2K | 1.2 | 3.1 | 2.6x |
-| 8 × 2K | 2.1 | 5.8 | 2.8x |
-| 4 × 4K | 0.4 | 1.3 | 3.3x |
-| 8 × 4K | OOM | 2.4 | Enabled |
-| 2 × 8K | 0.1 | 0.4 | 4.0x |
-
-### Inference (generation, Llama 2 7B, A100)
-
-| Context Length | Standard (tokens/sec) | Flash Attn (tokens/sec) | Speedup |
-|----------------|----------------------|-------------------------|---------|
-| 512 | 48 | 52 | 1.1x |
-| 2K | 42 | 62 | 1.5x |
-| 4K | 31 | 58 | 1.9x |
-| 8K | 18 | 51 | 2.8x |
-| 16K | OOM | 42 | Enabled |
-
-**Note**: Inference speedup less dramatic than training because generation is memory-bound (KV cache accesses).
-
-## Flash Attention versions comparison
-
-### Flash Attention 1 vs 2 vs 3 (H100, seq=4096, batch=8)
-
-| Metric | FA1 | FA2 | FA3 (FP16) | FA3 (FP8) |
-|--------|-----|-----|------------|-----------|
-| Forward time (ms) | 28.4 | 12.5 | 7.2 | 4.8 |
-| Memory (GB) | 4.8 | 4.2 | 4.2 | 2.8 |
-| TFLOPS | 180 | 420 | 740 | 1150 |
-| GPU util % | 35% | 55% | 75% | 82% |
-
-**Key improvements**:
- FA2: 2.3x faster than FA1 (better parallelism)
- FA3 (FP16): 1.7x faster than FA2 (H100 async optimizations)
- FA3 (FP8): 2.6x faster than FA2 (low precision)
-
-### Features by version
-
-| Feature | FA1 | FA2 | FA3 |
-|---------|-----|-----|-----|
-| Basic attention | ✅ | ✅ | ✅ |
-| Causal masking | ✅ | ✅ | ✅ |
-| Multi-query attention | ❌ | ✅ | ✅ |
-| Sliding window | ❌ | ✅ | ✅ |
-| Paged KV cache | ❌ | ✅ | ✅ |
-| FP8 support | ❌ | ❌ | ✅ (H100 only) |
-| Work partitioning | Basic | Advanced | Optimal |
-
-## Real-world model benchmarks
-
-### Llama 2 models (A100 80GB, batch=4, seq=2048)
-
-| Model | Params | Standard (samples/sec) | Flash Attn (samples/sec) | Speedup |
-|-------|--------|------------------------|--------------------------|---------|
-| Llama 2 7B | 7B | 1.2 | 3.1 | 2.6x |
-| Llama 2 13B | 13B | 0.6 | 1.7 | 2.8x |
-| Llama 2 70B | 70B | 0.12 | 0.34 | 2.8x |
-
-### GPT-style models (seq=1024)
-
-| Model | Standard (tokens/sec) | Flash Attn (tokens/sec) | Speedup |
-|-------|----------------------|-------------------------|---------|
-| GPT-2 (124M) | 520 | 680 | 1.3x |
-| GPT-J (6B) | 42 | 98 | 2.3x |
-| GPT-NeoX (20B) | 8 | 22 | 2.75x |
-
-## Recommendations by use case
-
-**Training large models (>7B parameters)**:
- Use Flash Attention 2 on A100
- Use Flash Attention 3 FP8 on H100 for maximum speed
- Expected: 2.5-3x speedup
-
-**Long context inference (>4K tokens)**:
- Flash Attention essential (enables contexts standard attention can't handle)
- Expected: 2-4x speedup, 5-10x memory reduction
-
-**Short sequences (<512 tokens)**:
- Flash Attention provides 1.2-1.5x speedup
- Minimal memory benefit
- Still worth enabling (no downside)
-
-**Multi-user serving**:
- Flash Attention reduces per-request memory
- Allows higher concurrent batch sizes
- Can serve 2-3x more users on same hardware
--- a/skills/mlops/flash-attention/references/transformers-integration.md
+++ b/skills/mlops/flash-attention/references/transformers-integration.md
@@ -1,293 +0,0 @@
-# HuggingFace Transformers Integration
-
-## Contents
- Enabling Flash Attention in Transformers
- Supported model architectures
- Configuration examples
- Performance comparisons
- Troubleshooting model-specific issues
-
-## Enabling Flash Attention in Transformers
-
-HuggingFace Transformers (v4.36+) supports Flash Attention 2 natively.
-
-**Simple enable for any supported model**:
-```python
-from transformers import AutoModel
-
-model = AutoModel.from_pretrained(
-    "meta-llama/Llama-2-7b-hf",
-    attn_implementation="flash_attention_2",
-    torch_dtype=torch.float16,
-    device_map="auto"
-)
-```
-
-**Install requirements**:
-```bash
-pip install transformers>=4.36
-pip install flash-attn --no-build-isolation
-```
-
-## Supported model architectures
-
-As of Transformers 4.40:
-
-**Fully supported**:
- Llama / Llama 2 / Llama 3
- Mistral / Mixtral
- Falcon
- GPT-NeoX
- Phi / Phi-2 / Phi-3
- Qwen / Qwen2
- Gemma
- Starcoder2
- GPT-J
- OPT
- BLOOM
-
-**Partially supported** (encoder-decoder):
- BART
- T5 / Flan-T5
- Whisper
-
-**Check support**:
-```python
-from transformers import AutoConfig
-
-config = AutoConfig.from_pretrained("model-name")
-print(config._attn_implementation_internal)
-# 'flash_attention_2' if supported
-```
-
-## Configuration examples
-
-### Llama 2 with Flash Attention
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-
-model_id = "meta-llama/Llama-2-7b-hf"
-
-model = AutoModelForCausalLM.from_pretrained(
-    model_id,
-    attn_implementation="flash_attention_2",
-    torch_dtype=torch.float16,
-    device_map="auto"
-)
-
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-
-# Generate
-inputs = tokenizer("Once upon a time", return_tensors="pt").to("cuda")
-outputs = model.generate(**inputs, max_length=100)
-print(tokenizer.decode(outputs[0]))
-```
-
-### Mistral with Flash Attention for long context
-
-```python
-from transformers import AutoModelForCausalLM
-import torch
-
-model = AutoModelForCausalLM.from_pretrained(
-    "mistralai/Mistral-7B-v0.1",
-    attn_implementation="flash_attention_2",
-    torch_dtype=torch.bfloat16,  # Better for long context
-    device_map="auto",
-    max_position_embeddings=32768  # Extended context
-)
-
-# Process long document (32K tokens)
-long_text = "..." * 10000
-inputs = tokenizer(long_text, return_tensors="pt", truncation=False).to("cuda")
-outputs = model.generate(**inputs, max_new_tokens=512)
-```
-
-### Fine-tuning with Flash Attention
-
-```python
-from transformers import Trainer, TrainingArguments
-from transformers import AutoModelForCausalLM
-
-model = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/Llama-2-7b-hf",
-    attn_implementation="flash_attention_2",
-    torch_dtype=torch.float16
-)
-
-training_args = TrainingArguments(
-    output_dir="./results",
-    per_device_train_batch_size=4,
-    gradient_accumulation_steps=4,
-    num_train_epochs=3,
-    fp16=True,  # Must match model dtype
-    optim="adamw_torch_fused"  # Fast optimizer
-)
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=train_dataset
-)
-
-trainer.train()
-```
-
-### Multi-GPU training
-
-```python
-from transformers import AutoModelForCausalLM
-import torch
-
-# Model parallelism with Flash Attention
-model = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/Llama-2-13b-hf",
-    attn_implementation="flash_attention_2",
-    torch_dtype=torch.float16,
-    device_map="auto",  # Automatic multi-GPU placement
-    max_memory={0: "20GB", 1: "20GB"}  # Limit per GPU
-)
-```
-
-## Performance comparisons
-
-### Memory usage (Llama 2 7B, batch=1)
-
-| Sequence Length | Standard Attention | Flash Attention 2 | Reduction |
-|-----------------|-------------------|-------------------|-----------|
-| 512 | 1.2 GB | 0.9 GB | 25% |
-| 2048 | 3.8 GB | 1.4 GB | 63% |
-| 8192 | 14.2 GB | 3.2 GB | 77% |
-| 32768 | OOM (>24GB) | 10.8 GB | Fits! |
-
-### Speed (tokens/sec, A100 80GB)
-
-| Model | Standard | Flash Attn 2 | Speedup |
-|-------|----------|--------------|---------|
-| Llama 2 7B (seq=2048) | 42 | 118 | 2.8x |
-| Llama 2 13B (seq=4096) | 18 | 52 | 2.9x |
-| Llama 2 70B (seq=2048) | 4 | 11 | 2.75x |
-
-### Training throughput (samples/sec)
-
-| Model | Batch Size | Standard | Flash Attn 2 | Speedup |
-|-------|------------|----------|--------------|---------|
-| Llama 2 7B | 4 | 1.2 | 3.1 | 2.6x |
-| Llama 2 7B | 8 | 2.1 | 5.8 | 2.8x |
-| Llama 2 13B | 2 | 0.6 | 1.7 | 2.8x |
-
-## Troubleshooting model-specific issues
-
-### Issue: Model doesn't support Flash Attention
-
-Check support list above. If not supported, use PyTorch SDPA as fallback:
-
-```python
-model = AutoModelForCausalLM.from_pretrained(
-    "model-name",
-    attn_implementation="sdpa",  # PyTorch native (still faster)
-    torch_dtype=torch.float16
-)
-```
-
-### Issue: CUDA out of memory during loading
-
-Reduce memory footprint:
-
-```python
-model = AutoModelForCausalLM.from_pretrained(
-    "model-name",
-    attn_implementation="flash_attention_2",
-    torch_dtype=torch.float16,
-    device_map="auto",
-    max_memory={0: "18GB"},  # Reserve memory for KV cache
-    low_cpu_mem_usage=True
-)
-```
-
-### Issue: Slower inference than expected
-
-Ensure dtype matches:
-
-```python
-# Model and inputs must both be float16/bfloat16
-model = model.to(torch.float16)
-inputs = tokenizer(..., return_tensors="pt").to("cuda")
-inputs = {k: v.to(torch.float16) if v.dtype == torch.float32 else v
-          for k, v in inputs.items()}
-```
-
-### Issue: Different outputs vs standard attention
-
-Flash Attention is numerically equivalent but uses different computation order. Small differences (<1e-3) are normal:
-
-```python
-# Compare outputs
-model_standard = AutoModelForCausalLM.from_pretrained("model-name", torch_dtype=torch.float16)
-model_flash = AutoModelForCausalLM.from_pretrained(
-    "model-name",
-    attn_implementation="flash_attention_2",
-    torch_dtype=torch.float16
-)
-
-inputs = tokenizer("Test", return_tensors="pt").to("cuda")
-
-with torch.no_grad():
-    out_standard = model_standard(**inputs).logits
-    out_flash = model_flash(**inputs).logits
-
-diff = (out_standard - out_flash).abs().max()
-print(f"Max diff: {diff:.6f}")  # Should be ~1e-3 to 1e-4
-```
-
-### Issue: ImportError during model loading
-
-Install flash-attn:
-```bash
-pip install flash-attn --no-build-isolation
-```
-
-Or disable Flash Attention:
-```python
-model = AutoModelForCausalLM.from_pretrained(
-    "model-name",
-    attn_implementation="eager",  # Standard PyTorch
-    torch_dtype=torch.float16
-)
-```
-
-## Best practices
-
-1. **Always use float16/bfloat16** with Flash Attention (not float32)
-2. **Set device_map="auto"** for automatic memory management
-3. **Use bfloat16 for long context** (better numerical stability)
-4. **Enable gradient checkpointing** for training large models
-5. **Monitor memory** with `torch.cuda.max_memory_allocated()`
-
-**Example with all best practices**:
-```python
-from transformers import AutoModelForCausalLM, TrainingArguments
-
-model = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/Llama-2-7b-hf",
-    attn_implementation="flash_attention_2",
-    torch_dtype=torch.bfloat16,  # Better for training
-    device_map="auto",
-    low_cpu_mem_usage=True
-)
-
-# Enable gradient checkpointing for memory
-model.gradient_checkpointing_enable()
-
-# Training with optimizations
-training_args = TrainingArguments(
-    output_dir="./results",
-    per_device_train_batch_size=8,
-    gradient_accumulation_steps=2,
-    bf16=True,  # Match model dtype
-    optim="adamw_torch_fused",
-    gradient_checkpointing=True
-)
-```
--- a/skills/mlops/gguf/SKILL.md
+++ b/skills/mlops/gguf/SKILL.md
@@ -1,430 +0,0 @@
---
-name: gguf-quantization
-description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [llama-cpp-python>=0.2.0]
-metadata:
-  hermes:
-    tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
-
---
-
-# GGUF - Quantization Format for llama.cpp
-
-The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
-
-## When to use GGUF
-
-**Use GGUF when:**
- Deploying on consumer hardware (laptops, desktops)
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
- Need CPU inference without GPU requirements
- Want flexible quantization (Q2_K to Q8_0)
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
-
-**Key advantages:**
- **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support
- **No Python runtime**: Pure C/C++ inference
- **Flexible quantization**: 2-8 bit with various methods (K-quants)
- **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more
- **imatrix**: Importance matrix for better low-bit quality
-
-**Use alternatives instead:**
- **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs
- **HQQ**: Fast calibration-free quantization for HuggingFace
- **bitsandbytes**: Simple integration with transformers library
- **TensorRT-LLM**: Production NVIDIA deployment with maximum speed
-
-## Quick start
-
-### Installation
-
-```bash
-# Clone llama.cpp
-git clone https://github.com/ggml-org/llama.cpp
-cd llama.cpp
-
-# Build (CPU)
-make
-
-# Build with CUDA (NVIDIA)
-make GGML_CUDA=1
-
-# Build with Metal (Apple Silicon)
-make GGML_METAL=1
-
-# Install Python bindings (optional)
-pip install llama-cpp-python
-```
-
-### Convert model to GGUF
-
-```bash
-# Install requirements
-pip install -r requirements.txt
-
-# Convert HuggingFace model to GGUF (FP16)
-python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
-
-# Or specify output type
-python convert_hf_to_gguf.py ./path/to/model \
-    --outfile model-f16.gguf \
-    --outtype f16
-```
-
-### Quantize model
-
-```bash
-# Basic quantization to Q4_K_M
-./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
-
-# Quantize with importance matrix (better quality)
-./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
-./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
-```
-
-### Run inference
-
-```bash
-# CLI inference
-./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
-
-# Interactive mode
-./llama-cli -m model-q4_k_m.gguf --interactive
-
-# With GPU offload
-./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
-```
-
-## Quantization types
-
-### K-quant methods (recommended)
-
-| Type | Bits | Size (7B) | Quality | Use Case |
-|------|------|-----------|---------|----------|
-| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
-| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
-| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
-| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
-| Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** |
-| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
-| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
-| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
-| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
-
-### Legacy methods
-
-| Type | Description |
-|------|-------------|
-| Q4_0 | 4-bit, basic |
-| Q4_1 | 4-bit with delta |
-| Q5_0 | 5-bit, basic |
-| Q5_1 | 5-bit with delta |
-
-**Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
-
-## Conversion workflows
-
-### Workflow 1: HuggingFace to GGUF
-
-```bash
-# 1. Download model
-huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
-
-# 2. Convert to GGUF (FP16)
-python convert_hf_to_gguf.py ./llama-3.1-8b \
-    --outfile llama-3.1-8b-f16.gguf \
-    --outtype f16
-
-# 3. Quantize
-./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
-
-# 4. Test
-./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
-```
-
-### Workflow 2: With importance matrix (better quality)
-
-```bash
-# 1. Convert to GGUF
-python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
-
-# 2. Create calibration text (diverse samples)
-cat > calibration.txt << 'EOF'
-The quick brown fox jumps over the lazy dog.
-Machine learning is a subset of artificial intelligence.
-Python is a popular programming language.
-# Add more diverse text samples...
-EOF
-
-# 3. Generate importance matrix
-./llama-imatrix -m model-f16.gguf \
-    -f calibration.txt \
-    --chunk 512 \
-    -o model.imatrix \
-    -ngl 35  # GPU layers if available
-
-# 4. Quantize with imatrix
-./llama-quantize --imatrix model.imatrix \
-    model-f16.gguf \
-    model-q4_k_m.gguf \
-    Q4_K_M
-```
-
-### Workflow 3: Multiple quantizations
-
-```bash
-#!/bin/bash
-MODEL="llama-3.1-8b-f16.gguf"
-IMATRIX="llama-3.1-8b.imatrix"
-
-# Generate imatrix once
-./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
-
-# Create multiple quantizations
-for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
-    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
-    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
-    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
-done
-```
-
-## Python usage
-
-### llama-cpp-python
-
-```python
-from llama_cpp import Llama
-
-# Load model
-llm = Llama(
-    model_path="./model-q4_k_m.gguf",
-    n_ctx=4096,          # Context window
-    n_gpu_layers=35,     # GPU offload (0 for CPU only)
-    n_threads=8          # CPU threads
-)
-
-# Generate
-output = llm(
-    "What is machine learning?",
-    max_tokens=256,
-    temperature=0.7,
-    stop=["</s>", "\n\n"]
-)
-print(output["choices"][0]["text"])
-```
-
-### Chat completion
-
-```python
-from llama_cpp import Llama
-
-llm = Llama(
-    model_path="./model-q4_k_m.gguf",
-    n_ctx=4096,
-    n_gpu_layers=35,
-    chat_format="llama-3"  # Or "chatml", "mistral", etc.
-)
-
-messages = [
-    {"role": "system", "content": "You are a helpful assistant."},
-    {"role": "user", "content": "What is Python?"}
-]
-
-response = llm.create_chat_completion(
-    messages=messages,
-    max_tokens=256,
-    temperature=0.7
-)
-print(response["choices"][0]["message"]["content"])
-```
-
-### Streaming
-
-```python
-from llama_cpp import Llama
-
-llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
-
-# Stream tokens
-for chunk in llm(
-    "Explain quantum computing:",
-    max_tokens=256,
-    stream=True
-):
-    print(chunk["choices"][0]["text"], end="", flush=True)
-```
-
-## Server mode
-
-### Start OpenAI-compatible server
-
-```bash
-# Start server
-./llama-server -m model-q4_k_m.gguf \
-    --host 0.0.0.0 \
-    --port 8080 \
-    -ngl 35 \
-    -c 4096
-
-# Or with Python bindings
-python -m llama_cpp.server \
-    --model model-q4_k_m.gguf \
-    --n_gpu_layers 35 \
-    --host 0.0.0.0 \
-    --port 8080
-```
-
-### Use with OpenAI client
-
-```python
-from openai import OpenAI
-
-client = OpenAI(
-    base_url="http://localhost:8080/v1",
-    api_key="not-needed"
-)
-
-response = client.chat.completions.create(
-    model="local-model",
-    messages=[{"role": "user", "content": "Hello!"}],
-    max_tokens=256
-)
-print(response.choices[0].message.content)
-```
-
-## Hardware optimization
-
-### Apple Silicon (Metal)
-
-```bash
-# Build with Metal
-make clean && make GGML_METAL=1
-
-# Run with Metal acceleration
-./llama-cli -m model.gguf -ngl 99 -p "Hello"
-
-# Python with Metal
-llm = Llama(
-    model_path="model.gguf",
-    n_gpu_layers=99,     # Offload all layers
-    n_threads=1          # Metal handles parallelism
-)
-```
-
-### NVIDIA CUDA
-
-```bash
-# Build with CUDA
-make clean && make GGML_CUDA=1
-
-# Run with CUDA
-./llama-cli -m model.gguf -ngl 35 -p "Hello"
-
-# Specify GPU
-CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
-```
-
-### CPU optimization
-
-```bash
-# Build with AVX2/AVX512
-make clean && make
-
-# Run with optimal threads
-./llama-cli -m model.gguf -t 8 -p "Hello"
-
-# Python CPU config
-llm = Llama(
-    model_path="model.gguf",
-    n_gpu_layers=0,      # CPU only
-    n_threads=8,         # Match physical cores
-    n_batch=512          # Batch size for prompt processing
-)
-```
-
-## Integration with tools
-
-### Ollama
-
-```bash
-# Create Modelfile
-cat > Modelfile << 'EOF'
-FROM ./model-q4_k_m.gguf
-TEMPLATE """{{ .System }}
-{{ .Prompt }}"""
-PARAMETER temperature 0.7
-PARAMETER num_ctx 4096
-EOF
-
-# Create Ollama model
-ollama create mymodel -f Modelfile
-
-# Run
-ollama run mymodel "Hello!"
-```
-
-### LM Studio
-
-1. Place GGUF file in `~/.cache/lm-studio/models/`
-2. Open LM Studio and select the model
-3. Configure context length and GPU offload
-4. Start inference
-
-### text-generation-webui
-
-```bash
-# Place in models folder
-cp model-q4_k_m.gguf text-generation-webui/models/
-
-# Start with llama.cpp loader
-python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
-```
-
-## Best practices
-
-1. **Use K-quants**: Q4_K_M offers best quality/size balance
-2. **Use imatrix**: Always use importance matrix for Q4 and below
-3. **GPU offload**: Offload as many layers as VRAM allows
-4. **Context length**: Start with 4096, increase if needed
-5. **Thread count**: Match physical CPU cores, not logical
-6. **Batch size**: Increase n_batch for faster prompt processing
-
-## Common issues
-
-**Model loads slowly:**
-```bash
-# Use mmap for faster loading
-./llama-cli -m model.gguf --mmap
-```
-
-**Out of memory:**
-```bash
-# Reduce GPU layers
-./llama-cli -m model.gguf -ngl 20  # Reduce from 35
-
-# Or use smaller quantization
-./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
-```
-
-**Poor quality at low bits:**
-```bash
-# Always use imatrix for Q4 and below
-./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
-./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
-```
-
-## References
-
- **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
-
-## Resources
-
- **Repository**: https://github.com/ggml-org/llama.cpp
- **Python Bindings**: https://github.com/abetlen/llama-cpp-python
- **Pre-quantized Models**: https://huggingface.co/TheBloke
- **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
- **License**: MIT
--- a/skills/mlops/gguf/references/advanced-usage.md
+++ b/skills/mlops/gguf/references/advanced-usage.md
@@ -1,504 +0,0 @@
-# GGUF Advanced Usage Guide
-
-## Speculative Decoding
-
-### Draft Model Approach
-
-```bash
-# Use smaller model as draft for faster generation
-./llama-speculative \
-    -m large-model-q4_k_m.gguf \
-    -md draft-model-q4_k_m.gguf \
-    -p "Write a story about AI" \
-    -n 500 \
-    --draft 8  # Draft tokens before verification
-```
-
-### Self-Speculative Decoding
-
-```bash
-# Use same model with different context for speculation
-./llama-cli -m model-q4_k_m.gguf \
-    --lookup-cache-static lookup.bin \
-    --lookup-cache-dynamic lookup-dynamic.bin \
-    -p "Hello world"
-```
-
-## Batched Inference
-
-### Process Multiple Prompts
-
-```python
-from llama_cpp import Llama
-
-llm = Llama(
-    model_path="model-q4_k_m.gguf",
-    n_ctx=4096,
-    n_gpu_layers=35,
-    n_batch=512  # Larger batch for parallel processing
-)
-
-prompts = [
-    "What is Python?",
-    "Explain machine learning.",
-    "Describe neural networks."
-]
-
-# Process in batch (each prompt gets separate context)
-for prompt in prompts:
-    output = llm(prompt, max_tokens=100)
-    print(f"Q: {prompt}")
-    print(f"A: {output['choices'][0]['text']}\n")
-```
-
-### Server Batching
-
-```bash
-# Start server with batching
-./llama-server -m model-q4_k_m.gguf \
-    --host 0.0.0.0 \
-    --port 8080 \
-    -ngl 35 \
-    -c 4096 \
-    --parallel 4        # Concurrent requests
-    --cont-batching     # Continuous batching
-```
-
-## Custom Model Conversion
-
-### Convert with Vocabulary Modifications
-
-```python
-# custom_convert.py
-import sys
-sys.path.insert(0, './llama.cpp')
-
-from convert_hf_to_gguf import main
-from gguf import GGUFWriter
-
-# Custom conversion with modified vocab
-def convert_with_custom_vocab(model_path, output_path):
-    # Load and modify tokenizer
-    from transformers import AutoTokenizer
-    tokenizer = AutoTokenizer.from_pretrained(model_path)
-
-    # Add special tokens if needed
-    special_tokens = {"additional_special_tokens": ["<|custom|>"]}
-    tokenizer.add_special_tokens(special_tokens)
-    tokenizer.save_pretrained(model_path)
-
-    # Then run standard conversion
-    main([model_path, "--outfile", output_path])
-```
-
-### Convert Specific Architecture
-
-```bash
-# For Mistral-style models
-python convert_hf_to_gguf.py ./mistral-model \
-    --outfile mistral-f16.gguf \
-    --outtype f16
-
-# For Qwen models
-python convert_hf_to_gguf.py ./qwen-model \
-    --outfile qwen-f16.gguf \
-    --outtype f16
-
-# For Phi models
-python convert_hf_to_gguf.py ./phi-model \
-    --outfile phi-f16.gguf \
-    --outtype f16
-```
-
-## Advanced Quantization
-
-### Mixed Quantization
-
-```bash
-# Quantize different layer types differently
-./llama-quantize model-f16.gguf model-mixed.gguf Q4_K_M \
-    --allow-requantize \
-    --leave-output-tensor
-```
-
-### Quantization with Token Embeddings
-
-```bash
-# Keep embeddings at higher precision
-./llama-quantize model-f16.gguf model-q4.gguf Q4_K_M \
-    --token-embedding-type f16
-```
-
-### IQ Quantization (Importance-aware)
-
-```bash
-# Ultra-low bit quantization with importance
-./llama-quantize --imatrix model.imatrix \
-    model-f16.gguf model-iq2_xxs.gguf IQ2_XXS
-
-# Available IQ types: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS
-```
-
-## Memory Optimization
-
-### Memory Mapping
-
-```python
-from llama_cpp import Llama
-
-# Use memory mapping for large models
-llm = Llama(
-    model_path="model-q4_k_m.gguf",
-    use_mmap=True,       # Memory map the model
-    use_mlock=False,     # Don't lock in RAM
-    n_gpu_layers=35
-)
-```
-
-### Partial GPU Offload
-
-```python
-# Calculate layers to offload based on VRAM
-import subprocess
-
-def get_free_vram_gb():
-    result = subprocess.run(
-        ['nvidia-smi', '--query-gpu=memory.free', '--format=csv,nounits,noheader'],
-        capture_output=True, text=True
-    )
-    return int(result.stdout.strip()) / 1024
-
-# Estimate layers based on VRAM (rough: 0.5GB per layer for 7B Q4)
-free_vram = get_free_vram_gb()
-layers_to_offload = int(free_vram / 0.5)
-
-llm = Llama(
-    model_path="model-q4_k_m.gguf",
-    n_gpu_layers=min(layers_to_offload, 35)  # Cap at total layers
-)
-```
-
-### KV Cache Optimization
-
-```python
-from llama_cpp import Llama
-
-# Optimize KV cache for long contexts
-llm = Llama(
-    model_path="model-q4_k_m.gguf",
-    n_ctx=8192,          # Large context
-    n_gpu_layers=35,
-    type_k=1,            # Q8_0 for K cache (1)
-    type_v=1,            # Q8_0 for V cache (1)
-    # Or use Q4_0 (2) for more compression
-)
-```
-
-## Context Management
-
-### Context Shifting
-
-```python
-from llama_cpp import Llama
-
-llm = Llama(
-    model_path="model-q4_k_m.gguf",
-    n_ctx=4096,
-    n_gpu_layers=35
-)
-
-# Handle long conversations with context shifting
-conversation = []
-max_history = 10
-
-def chat(user_message):
-    conversation.append({"role": "user", "content": user_message})
-
-    # Keep only recent history
-    if len(conversation) > max_history * 2:
-        conversation = conversation[-max_history * 2:]
-
-    response = llm.create_chat_completion(
-        messages=conversation,
-        max_tokens=256
-    )
-
-    assistant_message = response["choices"][0]["message"]["content"]
-    conversation.append({"role": "assistant", "content": assistant_message})
-    return assistant_message
-```
-
-### Save and Load State
-
-```bash
-# Save state to file
-./llama-cli -m model.gguf \
-    -p "Once upon a time" \
-    --save-session session.bin \
-    -n 100
-
-# Load and continue
-./llama-cli -m model.gguf \
-    --load-session session.bin \
-    -p " and they lived" \
-    -n 100
-```
-
-## Grammar Constrained Generation
-
-### JSON Output
-
-```python
-from llama_cpp import Llama, LlamaGrammar
-
-# Define JSON grammar
-json_grammar = LlamaGrammar.from_string('''
-root ::= object
-object ::= "{" ws pair ("," ws pair)* "}" ws
-pair ::= string ":" ws value
-value ::= string | number | object | array | "true" | "false" | "null"
-array ::= "[" ws value ("," ws value)* "]" ws
-string ::= "\\"" [^"\\\\]* "\\""
-number ::= [0-9]+
-ws ::= [ \\t\\n]*
-''')
-
-llm = Llama(model_path="model-q4_k_m.gguf", n_gpu_layers=35)
-
-output = llm(
-    "Output a JSON object with name and age:",
-    grammar=json_grammar,
-    max_tokens=100
-)
-print(output["choices"][0]["text"])
-```
-
-### Custom Grammar
-
-```python
-# Grammar for specific format
-answer_grammar = LlamaGrammar.from_string('''
-root ::= "Answer: " letter "\\n" "Explanation: " explanation
-letter ::= [A-D]
-explanation ::= [a-zA-Z0-9 .,!?]+
-''')
-
-output = llm(
-    "Q: What is 2+2? A) 3 B) 4 C) 5 D) 6",
-    grammar=answer_grammar,
-    max_tokens=100
-)
-```
-
-## LoRA Integration
-
-### Load LoRA Adapter
-
-```bash
-# Apply LoRA at runtime
-./llama-cli -m base-model-q4_k_m.gguf \
-    --lora lora-adapter.gguf \
-    --lora-scale 1.0 \
-    -p "Hello!"
-```
-
-### Multiple LoRA Adapters
-
-```bash
-# Stack multiple adapters
-./llama-cli -m base-model.gguf \
-    --lora adapter1.gguf --lora-scale 0.5 \
-    --lora adapter2.gguf --lora-scale 0.5 \
-    -p "Hello!"
-```
-
-### Python LoRA Usage
-
-```python
-from llama_cpp import Llama
-
-llm = Llama(
-    model_path="base-model-q4_k_m.gguf",
-    lora_path="lora-adapter.gguf",
-    lora_scale=1.0,
-    n_gpu_layers=35
-)
-```
-
-## Embedding Generation
-
-### Extract Embeddings
-
-```python
-from llama_cpp import Llama
-
-llm = Llama(
-    model_path="model-q4_k_m.gguf",
-    embedding=True,      # Enable embedding mode
-    n_gpu_layers=35
-)
-
-# Get embeddings
-embeddings = llm.embed("This is a test sentence.")
-print(f"Embedding dimension: {len(embeddings)}")
-```
-
-### Batch Embeddings
-
-```python
-texts = [
-    "Machine learning is fascinating.",
-    "Deep learning uses neural networks.",
-    "Python is a programming language."
-]
-
-embeddings = [llm.embed(text) for text in texts]
-
-# Calculate similarity
-import numpy as np
-
-def cosine_similarity(a, b):
-    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
-
-sim = cosine_similarity(embeddings[0], embeddings[1])
-print(f"Similarity: {sim:.4f}")
-```
-
-## Performance Tuning
-
-### Benchmark Script
-
-```python
-import time
-from llama_cpp import Llama
-
-def benchmark(model_path, prompt, n_tokens=100, n_runs=5):
-    llm = Llama(
-        model_path=model_path,
-        n_gpu_layers=35,
-        n_ctx=2048,
-        verbose=False
-    )
-
-    # Warmup
-    llm(prompt, max_tokens=10)
-
-    # Benchmark
-    times = []
-    for _ in range(n_runs):
-        start = time.time()
-        output = llm(prompt, max_tokens=n_tokens)
-        elapsed = time.time() - start
-        times.append(elapsed)
-
-    avg_time = sum(times) / len(times)
-    tokens_per_sec = n_tokens / avg_time
-
-    print(f"Model: {model_path}")
-    print(f"Avg time: {avg_time:.2f}s")
-    print(f"Tokens/sec: {tokens_per_sec:.1f}")
-
-    return tokens_per_sec
-
-# Compare quantizations
-for quant in ["q4_k_m", "q5_k_m", "q8_0"]:
-    benchmark(f"model-{quant}.gguf", "Explain quantum computing:", 100)
-```
-
-### Optimal Configuration Finder
-
-```python
-def find_optimal_config(model_path, target_vram_gb=8):
-    """Find optimal n_gpu_layers and n_batch for target VRAM."""
-    from llama_cpp import Llama
-    import gc
-
-    best_config = None
-    best_speed = 0
-
-    for n_gpu_layers in range(0, 50, 5):
-        for n_batch in [128, 256, 512, 1024]:
-            try:
-                gc.collect()
-                llm = Llama(
-                    model_path=model_path,
-                    n_gpu_layers=n_gpu_layers,
-                    n_batch=n_batch,
-                    n_ctx=2048,
-                    verbose=False
-                )
-
-                # Quick benchmark
-                start = time.time()
-                llm("Hello", max_tokens=50)
-                speed = 50 / (time.time() - start)
-
-                if speed > best_speed:
-                    best_speed = speed
-                    best_config = {
-                        "n_gpu_layers": n_gpu_layers,
-                        "n_batch": n_batch,
-                        "speed": speed
-                    }
-
-                del llm
-                gc.collect()
-
-            except Exception as e:
-                print(f"OOM at layers={n_gpu_layers}, batch={n_batch}")
-                break
-
-    return best_config
-```
-
-## Multi-GPU Setup
-
-### Distribute Across GPUs
-
-```bash
-# Split model across multiple GPUs
-./llama-cli -m large-model.gguf \
-    --tensor-split 0.5,0.5 \
-    -ngl 60 \
-    -p "Hello!"
-```
-
-### Python Multi-GPU
-
-```python
-import os
-os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
-
-from llama_cpp import Llama
-
-llm = Llama(
-    model_path="large-model-q4_k_m.gguf",
-    n_gpu_layers=60,
-    tensor_split=[0.5, 0.5]  # Split evenly across 2 GPUs
-)
-```
-
-## Custom Builds
-
-### Build with All Optimizations
-
-```bash
-# Clean build with all CPU optimizations
-make clean
-LLAMA_OPENBLAS=1 LLAMA_BLAS_VENDOR=OpenBLAS make -j
-
-# With CUDA and cuBLAS
-make clean
-GGML_CUDA=1 LLAMA_CUBLAS=1 make -j
-
-# With specific CUDA architecture
-GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_86 make -j
-```
-
-### CMake Build
-
-```bash
-mkdir build && cd build
-cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
-cmake --build . --config Release -j
-```
--- a/skills/mlops/gguf/references/troubleshooting.md
+++ b/skills/mlops/gguf/references/troubleshooting.md
@@ -1,442 +0,0 @@
-# GGUF Troubleshooting Guide
-
-## Installation Issues
-
-### Build Fails
-
-**Error**: `make: *** No targets specified and no makefile found`
-
-**Fix**:
-```bash
-# Ensure you're in llama.cpp directory
-cd llama.cpp
-make
-```
-
-**Error**: `fatal error: cuda_runtime.h: No such file or directory`
-
-**Fix**:
-```bash
-# Install CUDA toolkit
-# Ubuntu
-sudo apt install nvidia-cuda-toolkit
-
-# Or set CUDA path
-export CUDA_PATH=/usr/local/cuda
-export PATH=$CUDA_PATH/bin:$PATH
-make GGML_CUDA=1
-```
-
-### Python Bindings Issues
-
-**Error**: `ERROR: Failed building wheel for llama-cpp-python`
-
-**Fix**:
-```bash
-# Install build dependencies
-pip install cmake scikit-build-core
-
-# For CUDA support
-CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
-
-# For Metal (macOS)
-CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
-```
-
-**Error**: `ImportError: libcudart.so.XX: cannot open shared object file`
-
-**Fix**:
-```bash
-# Add CUDA libraries to path
-export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
-
-# Or reinstall with correct CUDA version
-pip uninstall llama-cpp-python
-CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
-```
-
-## Conversion Issues
-
-### Model Not Supported
-
-**Error**: `KeyError: 'model.embed_tokens.weight'`
-
-**Fix**:
-```bash
-# Check model architecture
-python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('./model').architectures)"
-
-# Use appropriate conversion script
-# For most models:
-python convert_hf_to_gguf.py ./model --outfile model.gguf
-
-# For older models, check if legacy script needed
-```
-
-### Vocabulary Mismatch
-
-**Error**: `RuntimeError: Vocabulary size mismatch`
-
-**Fix**:
-```python
-# Ensure tokenizer matches model
-from transformers import AutoTokenizer, AutoModelForCausalLM
-
-tokenizer = AutoTokenizer.from_pretrained("./model")
-model = AutoModelForCausalLM.from_pretrained("./model")
-
-print(f"Tokenizer vocab size: {len(tokenizer)}")
-print(f"Model vocab size: {model.config.vocab_size}")
-
-# If mismatch, resize embeddings before conversion
-model.resize_token_embeddings(len(tokenizer))
-model.save_pretrained("./model-fixed")
-```
-
-### Out of Memory During Conversion
-
-**Error**: `torch.cuda.OutOfMemoryError` during conversion
-
-**Fix**:
-```bash
-# Use CPU for conversion
-CUDA_VISIBLE_DEVICES="" python convert_hf_to_gguf.py ./model --outfile model.gguf
-
-# Or use low memory mode
-python convert_hf_to_gguf.py ./model --outfile model.gguf --outtype f16
-```
-
-## Quantization Issues
-
-### Wrong Output File Size
-
-**Problem**: Quantized file is larger than expected
-
-**Check**:
-```bash
-# Verify quantization type
-./llama-cli -m model.gguf --verbose
-
-# Expected sizes for 7B model:
-# Q4_K_M: ~4.1 GB
-# Q5_K_M: ~4.8 GB
-# Q8_0: ~7.2 GB
-# F16: ~13.5 GB
-```
-
-### Quantization Crashes
-
-**Error**: `Segmentation fault` during quantization
-
-**Fix**:
-```bash
-# Increase stack size
-ulimit -s unlimited
-
-# Or use less threads
-./llama-quantize -t 4 model-f16.gguf model-q4.gguf Q4_K_M
-```
-
-### Poor Quality After Quantization
-
-**Problem**: Model outputs gibberish after quantization
-
-**Solutions**:
-
-1. **Use importance matrix**:
-```bash
-# Generate imatrix with good calibration data
-./llama-imatrix -m model-f16.gguf \
-    -f wiki_sample.txt \
-    --chunk 512 \
-    -o model.imatrix
-
-# Quantize with imatrix
-./llama-quantize --imatrix model.imatrix \
-    model-f16.gguf model-q4_k_m.gguf Q4_K_M
-```
-
-2. **Try higher precision**:
-```bash
-# Use Q5_K_M or Q6_K instead of Q4
-./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
-```
-
-3. **Check original model**:
-```bash
-# Test FP16 version first
-./llama-cli -m model-f16.gguf -p "Hello, how are you?" -n 50
-```
-
-## Inference Issues
-
-### Slow Generation
-
-**Problem**: Generation is slower than expected
-
-**Solutions**:
-
-1. **Enable GPU offload**:
-```bash
-./llama-cli -m model.gguf -ngl 35 -p "Hello"
-```
-
-2. **Optimize batch size**:
-```python
-llm = Llama(
-    model_path="model.gguf",
-    n_batch=512,        # Increase for faster prompt processing
-    n_gpu_layers=35
-)
-```
-
-3. **Use appropriate threads**:
-```bash
-# Match physical cores, not logical
-./llama-cli -m model.gguf -t 8 -p "Hello"
-```
-
-4. **Enable Flash Attention** (if supported):
-```bash
-./llama-cli -m model.gguf -ngl 35 --flash-attn -p "Hello"
-```
-
-### Out of Memory
-
-**Error**: `CUDA out of memory` or system freeze
-
-**Solutions**:
-
-1. **Reduce GPU layers**:
-```python
-# Start low and increase
-llm = Llama(model_path="model.gguf", n_gpu_layers=10)
-```
-
-2. **Use smaller quantization**:
-```bash
-./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
-```
-
-3. **Reduce context length**:
-```python
-llm = Llama(
-    model_path="model.gguf",
-    n_ctx=2048,  # Reduce from 4096
-    n_gpu_layers=35
-)
-```
-
-4. **Quantize KV cache**:
-```python
-llm = Llama(
-    model_path="model.gguf",
-    type_k=2,    # Q4_0 for K cache
-    type_v=2,    # Q4_0 for V cache
-    n_gpu_layers=35
-)
-```
-
-### Garbage Output
-
-**Problem**: Model outputs random characters or nonsense
-
-**Diagnose**:
-```python
-# Check model loading
-llm = Llama(model_path="model.gguf", verbose=True)
-
-# Test with simple prompt
-output = llm("1+1=", max_tokens=5, temperature=0)
-print(output)
-```
-
-**Solutions**:
-
-1. **Check model integrity**:
-```bash
-# Verify GGUF file
-./llama-cli -m model.gguf --verbose 2>&1 | head -50
-```
-
-2. **Use correct chat format**:
-```python
-llm = Llama(
-    model_path="model.gguf",
-    chat_format="llama-3"  # Match your model: chatml, mistral, etc.
-)
-```
-
-3. **Check temperature**:
-```python
-# Use lower temperature for deterministic output
-output = llm("Hello", max_tokens=50, temperature=0.1)
-```
-
-### Token Issues
-
-**Error**: `RuntimeError: unknown token` or encoding errors
-
-**Fix**:
-```python
-# Ensure UTF-8 encoding
-prompt = "Hello, world!".encode('utf-8').decode('utf-8')
-output = llm(prompt, max_tokens=50)
-```
-
-## Server Issues
-
-### Connection Refused
-
-**Error**: `Connection refused` when accessing server
-
-**Fix**:
-```bash
-# Bind to all interfaces
-./llama-server -m model.gguf --host 0.0.0.0 --port 8080
-
-# Check if port is in use
-lsof -i :8080
-```
-
-### Server Crashes Under Load
-
-**Problem**: Server crashes with multiple concurrent requests
-
-**Solutions**:
-
-1. **Limit parallelism**:
-```bash
-./llama-server -m model.gguf \
-    --parallel 2 \
-    -c 4096 \
-    --cont-batching
-```
-
-2. **Add request timeout**:
-```bash
-./llama-server -m model.gguf --timeout 300
-```
-
-3. **Monitor memory**:
-```bash
-watch -n 1 nvidia-smi  # For GPU
-watch -n 1 free -h     # For RAM
-```
-
-### API Compatibility Issues
-
-**Problem**: OpenAI client not working with server
-
-**Fix**:
-```python
-from openai import OpenAI
-
-# Use correct base URL format
-client = OpenAI(
-    base_url="http://localhost:8080/v1",  # Include /v1
-    api_key="not-needed"
-)
-
-# Use correct model name
-response = client.chat.completions.create(
-    model="local",  # Or the actual model name
-    messages=[{"role": "user", "content": "Hello"}]
-)
-```
-
-## Apple Silicon Issues
-
-### Metal Not Working
-
-**Problem**: Metal acceleration not enabled
-
-**Check**:
-```bash
-# Verify Metal support
-./llama-cli -m model.gguf --verbose 2>&1 | grep -i metal
-```
-
-**Fix**:
-```bash
-# Rebuild with Metal
-make clean
-make GGML_METAL=1
-
-# Python bindings
-CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall
-```
-
-### Incorrect Memory Usage on M1/M2
-
-**Problem**: Model uses too much unified memory
-
-**Fix**:
-```python
-# Offload all layers for Metal
-llm = Llama(
-    model_path="model.gguf",
-    n_gpu_layers=99,    # Offload everything
-    n_threads=1         # Metal handles parallelism
-)
-```
-
-## Debugging
-
-### Enable Verbose Output
-
-```bash
-# CLI verbose mode
-./llama-cli -m model.gguf --verbose -p "Hello" -n 50
-
-# Python verbose
-llm = Llama(model_path="model.gguf", verbose=True)
-```
-
-### Check Model Metadata
-
-```bash
-# View GGUF metadata
-./llama-cli -m model.gguf --verbose 2>&1 | head -100
-```
-
-### Validate GGUF File
-
-```python
-import struct
-
-def validate_gguf(filepath):
-    with open(filepath, 'rb') as f:
-        magic = f.read(4)
-        if magic != b'GGUF':
-            print(f"Invalid magic: {magic}")
-            return False
-
-        version = struct.unpack('<I', f.read(4))[0]
-        print(f"GGUF version: {version}")
-
-        tensor_count = struct.unpack('<Q', f.read(8))[0]
-        metadata_count = struct.unpack('<Q', f.read(8))[0]
-        print(f"Tensors: {tensor_count}, Metadata: {metadata_count}")
-
-        return True
-
-validate_gguf("model.gguf")
-```
-
-## Getting Help
-
-1. **GitHub Issues**: https://github.com/ggml-org/llama.cpp/issues
-2. **Discussions**: https://github.com/ggml-org/llama.cpp/discussions
-3. **Reddit**: r/LocalLLaMA
-
-### Reporting Issues
-
-Include:
- llama.cpp version/commit hash
- Build command used
- Model name and quantization
- Full error message/stack trace
- Hardware: CPU/GPU model, RAM, VRAM
- OS version
- Minimal reproduction steps
--- a/skills/mlops/grpo-rl-training/README.md
+++ b/skills/mlops/grpo-rl-training/README.md
@@ -1,97 +0,0 @@
-# GRPO/RL Training Skill
-
-**Expert-level guidance for Group Relative Policy Optimization with TRL**
-
-## 📁 Skill Structure
-
-```
-grpo-rl-training/
-├── SKILL.md                              # Main skill documentation (READ THIS FIRST)
-├── README.md                             # This file
-├── templates/
-│   └── basic_grpo_training.py            # Production-ready training template
-└── examples/
-    └── reward_functions_library.py       # 20+ reward function examples
-```
-
-## 🚀 Quick Start
-
-1. **Read SKILL.md** - Comprehensive guide with all concepts and patterns
-2. **Copy `templates/basic_grpo_training.py`** - Start with working code
-3. **Browse `examples/reward_functions_library.py`** - Pick reward functions for your task
-4. **Modify for your use case** - Adapt dataset, rewards, and config
-
-## 💡 What's Inside
-
-### SKILL.md (Main Documentation)
- Core GRPO concepts and algorithm fundamentals
- Complete implementation workflow (dataset → rewards → training → deployment)
- 10+ reward function examples with code
- Hyperparameter tuning guide
- Training insights (loss behavior, metrics, debugging)
- Troubleshooting guide
- Production best practices
-
-### Templates
- **basic_grpo_training.py**: Minimal, production-ready training script
-  - Uses Qwen 2.5 1.5B Instruct
-  - 3 reward functions (format + correctness)
-  - LoRA for efficient training
-  - Fully documented and ready to run
-
-### Examples
- **reward_functions_library.py**: 20+ battle-tested reward functions
-  - Correctness rewards (exact match, fuzzy match, numeric, code execution)
-  - Format rewards (XML, JSON, strict/soft)
-  - Length rewards (ideal length, min/max)
-  - Style rewards (reasoning quality, citations, repetition penalty)
-  - Combined rewards (multi-objective optimization)
-  - Preset collections for common tasks
-
-## 📖 Usage for Agents
-
-When this skill is loaded in your agent's context:
-
-1. **Always read SKILL.md first** before implementing
-2. **Start simple** - Use length-based reward to validate setup
-3. **Build incrementally** - Add one reward function at a time
-4. **Reference examples** - Copy patterns from reward_functions_library.py
-5. **Monitor training** - Watch reward metrics (not loss!)
-
-## 🎯 Common Use Cases
-
-| Task Type | Recommended Rewards | Template |
-|-----------|---------------------|----------|
-| Math reasoning | `MATH_REASONING_REWARDS` preset | basic_grpo_training.py |
-| Code generation | `CODE_GENERATION_REWARDS` preset | Modify dataset in template |
-| Summarization | `SUMMARIZATION_REWARDS` preset | Adjust prompts + rewards |
-| Q&A | `QA_REWARDS` preset | Use fuzzy match + citations |
-
-## ⚠️ Critical Reminders
-
- **Loss goes UP during training** - This is normal (it's KL divergence)
- **Use 3-5 reward functions** - Single rewards often fail
- **Test rewards before training** - Debug each function independently
- **Monitor reward_std** - Should stay > 0.1 (avoid mode collapse)
- **Start with num_generations=4-8** - Scale up if GPU allows
-
-## 🔗 External Resources
-
- [TRL Documentation](https://huggingface.co/docs/trl)
- [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948)
- [Open R1 Implementation](https://github.com/huggingface/open-r1)
- [Unsloth (2-3x faster)](https://docs.unsloth.ai/)
-
-## 📝 Version
-
-**v1.0.0** - Initial release (January 2025)
-
-## 👨‍💻 Maintained By
-
-Orchestra Research
-For questions or improvements, see https://orchestra.com
-
---
-
-**License:** MIT
-**Last Updated:** January 2025
--- a/skills/mlops/grpo-rl-training/SKILL.md
+++ b/skills/mlops/grpo-rl-training/SKILL.md
@@ -1,575 +0,0 @@
---
-name: grpo-rl-training
-description: Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [transformers>=4.47.0, trl>=0.14.0, datasets>=3.2.0, peft>=0.14.0, torch]
-metadata:
-  hermes:
-    tags: [Post-Training, Reinforcement Learning, GRPO, TRL, RLHF, Reward Modeling, Reasoning, DPO, PPO, Structured Output]
-
---
-
-# GRPO/RL Training with TRL
-
-Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.
-
-## When to Use This Skill
-
-Use GRPO training when you need to:
- **Enforce specific output formats** (e.g., XML tags, JSON, structured reasoning)
- **Teach verifiable tasks** with objective correctness metrics (math, coding, fact-checking)
- **Improve reasoning capabilities** by rewarding chain-of-thought patterns
- **Align models to domain-specific behaviors** without labeled preference data
- **Optimize for multiple objectives** simultaneously (format + correctness + style)
-
-**Do NOT use GRPO for:**
- Simple supervised fine-tuning tasks (use SFT instead)
- Tasks without clear reward signals
- When you already have high-quality preference pairs (use DPO/PPO instead)
-
---
-
-## Core Concepts
-
-### 1. GRPO Algorithm Fundamentals
-
-**Key Mechanism:**
- Generates **multiple completions** for each prompt (group size: 4-16)
- Compares completions within each group using reward functions
- Updates policy to favor higher-rewarded responses relative to the group
-
-**Critical Difference from PPO:**
- No separate reward model needed
- More sample-efficient (learns from within-group comparisons)
- Simpler to implement and debug
-
-**Mathematical Intuition:**
-```
-For each prompt p:
-  1. Generate N completions: {c₁, c₂, ..., cₙ}
-  2. Compute rewards: {r₁, r₂, ..., rₙ}
-  3. Learn to increase probability of high-reward completions
-     relative to low-reward ones in the same group
-```
-
-### 2. Reward Function Design Philosophy
-
-**Golden Rules:**
-1. **Compose multiple reward functions** - Each handles one aspect (format, correctness, style)
-2. **Scale rewards appropriately** - Higher weight = stronger signal
-3. **Use incremental rewards** - Partial credit for partial compliance
-4. **Test rewards independently** - Debug each reward function in isolation
-
-**Reward Function Types:**
-
-| Type | Use Case | Example Weight |
-|------|----------|----------------|
-| **Correctness** | Verifiable tasks (math, code) | 2.0 (highest) |
-| **Format** | Strict structure enforcement | 0.5-1.0 |
-| **Length** | Encourage verbosity/conciseness | 0.1-0.5 |
-| **Style** | Penalize unwanted patterns | -0.5 to 0.5 |
-
---
-
-## Implementation Workflow
-
-### Step 1: Dataset Preparation
-
-**Critical Requirements:**
- Prompts in chat format (list of dicts with 'role' and 'content')
- Include system prompts to set expectations
- For verifiable tasks, include ground truth answers as additional columns
-
-**Example Structure:**
-```python
-from datasets import load_dataset, Dataset
-
-SYSTEM_PROMPT = """
-Respond in the following format:
-<reasoning>
-[Your step-by-step thinking]
-</reasoning>
-<answer>
-[Final answer]
-</answer>
-"""
-
-def prepare_dataset(raw_data):
-    """
-    Transform raw data into GRPO-compatible format.
-
-    Returns: Dataset with columns:
-    - 'prompt': List[Dict] with role/content (system + user messages)
-    - 'answer': str (ground truth, optional but recommended)
-    """
-    return raw_data.map(lambda x: {
-        'prompt': [
-            {'role': 'system', 'content': SYSTEM_PROMPT},
-            {'role': 'user', 'content': x['question']}
-        ],
-        'answer': extract_answer(x['raw_answer'])
-    })
-```
-
-**Pro Tips:**
- Use one-shot or few-shot examples in system prompt for complex formats
- Keep prompts concise (max_prompt_length: 256-512 tokens)
- Validate data quality before training (garbage in = garbage out)
-
-### Step 2: Reward Function Implementation
-
-**Template Structure:**
-```python
-def reward_function_name(
-    prompts,        # List[List[Dict]]: Original prompts
-    completions,    # List[List[Dict]]: Model generations
-    answer=None,    # Optional: Ground truth from dataset
-    **kwargs        # Additional dataset columns
-) -> list[float]:
-    """
-    Evaluate completions and return rewards.
-
-    Returns: List of floats (one per completion)
-    """
-    # Extract completion text
-    responses = [comp[0]['content'] for comp in completions]
-
-    # Compute rewards
-    rewards = []
-    for response in responses:
-        score = compute_score(response)
-        rewards.append(score)
-
-    return rewards
-```
-
-**Example 1: Correctness Reward (Math/Coding)**
-```python
-def correctness_reward(prompts, completions, answer, **kwargs):
-    """Reward correct answers with high score."""
-    responses = [comp[0]['content'] for comp in completions]
-    extracted = [extract_final_answer(r) for r in responses]
-    return [2.0 if ans == gt else 0.0
-            for ans, gt in zip(extracted, answer)]
-```
-
-**Example 2: Format Reward (Structured Output)**
-```python
-import re
-
-def format_reward(completions, **kwargs):
-    """Reward XML-like structured format."""
-    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
-    responses = [comp[0]['content'] for comp in completions]
-    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
-            for r in responses]
-```
-
-**Example 3: Incremental Format Reward (Partial Credit)**
-```python
-def incremental_format_reward(completions, **kwargs):
-    """Award partial credit for format compliance."""
-    responses = [comp[0]['content'] for comp in completions]
-    rewards = []
-
-    for r in responses:
-        score = 0.0
-        if '<reasoning>' in r:
-            score += 0.25
-        if '</reasoning>' in r:
-            score += 0.25
-        if '<answer>' in r:
-            score += 0.25
-        if '</answer>' in r:
-            score += 0.25
-        # Penalize extra text after closing tag
-        if r.count('</answer>') == 1:
-            extra_text = r.split('</answer>')[-1].strip()
-            score -= len(extra_text) * 0.001
-        rewards.append(score)
-
-    return rewards
-```
-
-**Critical Insight:**
-Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.
-
-### Step 3: Training Configuration
-
-**Memory-Optimized Config (Small GPU)**
-```python
-from trl import GRPOConfig
-
-training_args = GRPOConfig(
-    output_dir="outputs/grpo-model",
-
-    # Learning rate
-    learning_rate=5e-6,          # Lower = more stable
-    adam_beta1=0.9,
-    adam_beta2=0.99,
-    weight_decay=0.1,
-    warmup_ratio=0.1,
-    lr_scheduler_type='cosine',
-
-    # Batch settings
-    per_device_train_batch_size=1,
-    gradient_accumulation_steps=4,  # Effective batch = 4
-
-    # GRPO-specific
-    num_generations=8,            # Group size: 8-16 recommended
-    max_prompt_length=256,
-    max_completion_length=512,
-
-    # Training duration
-    num_train_epochs=1,
-    max_steps=None,               # Or set fixed steps (e.g., 500)
-
-    # Optimization
-    bf16=True,                    # Faster on A100/H100
-    optim="adamw_8bit",          # Memory-efficient optimizer
-    max_grad_norm=0.1,
-
-    # Logging
-    logging_steps=1,
-    save_steps=100,
-    report_to="wandb",            # Or "none" for no logging
-)
-```
-
-**High-Performance Config (Large GPU)**
-```python
-training_args = GRPOConfig(
-    output_dir="outputs/grpo-model",
-    learning_rate=1e-5,
-    per_device_train_batch_size=4,
-    gradient_accumulation_steps=2,
-    num_generations=16,           # Larger groups = better signal
-    max_prompt_length=512,
-    max_completion_length=1024,
-    num_train_epochs=1,
-    bf16=True,
-    use_vllm=True,                # Fast generation with vLLM
-    logging_steps=10,
-)
-```
-
-**Critical Hyperparameters:**
-
-| Parameter | Impact | Tuning Advice |
-|-----------|--------|---------------|
-| `num_generations` | Group size for comparison | Start with 8, increase to 16 if GPU allows |
-| `learning_rate` | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) |
-| `max_completion_length` | Output verbosity | Match your task (512 for reasoning, 256 for short answers) |
-| `gradient_accumulation_steps` | Effective batch size | Increase if GPU memory limited |
-
-### Step 4: Model Setup and Training
-
-**Standard Setup (Transformers)**
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from peft import LoraConfig
-from trl import GRPOTrainer
-
-# Load model
-model_name = "Qwen/Qwen2.5-1.5B-Instruct"
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    torch_dtype=torch.bfloat16,
-    attn_implementation="flash_attention_2",  # 2-3x faster
-    device_map="auto"
-)
-
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-tokenizer.pad_token = tokenizer.eos_token
-
-# Optional: LoRA for parameter-efficient training
-peft_config = LoraConfig(
-    r=16,                         # Rank (higher = more capacity)
-    lora_alpha=32,               # Scaling factor (typically 2*r)
-    target_modules=[
-        "q_proj", "k_proj", "v_proj", "o_proj",
-        "gate_proj", "up_proj", "down_proj"
-    ],
-    task_type="CAUSAL_LM",
-    lora_dropout=0.05,
-)
-
-# Initialize trainer
-trainer = GRPOTrainer(
-    model=model,
-    processing_class=tokenizer,
-    reward_funcs=[
-        incremental_format_reward,
-        format_reward,
-        correctness_reward,
-    ],
-    args=training_args,
-    train_dataset=dataset,
-    peft_config=peft_config,      # Remove for full fine-tuning
-)
-
-# Train
-trainer.train()
-
-# Save
-trainer.save_model("final_model")
-```
-
-**Unsloth Setup (2-3x Faster)**
-```python
-from unsloth import FastLanguageModel
-
-model, tokenizer = FastLanguageModel.from_pretrained(
-    model_name="google/gemma-3-1b-it",
-    max_seq_length=1024,
-    load_in_4bit=True,
-    fast_inference=True,
-    max_lora_rank=32,
-)
-
-model = FastLanguageModel.get_peft_model(
-    model,
-    r=32,
-    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
-                    "gate_proj", "up_proj", "down_proj"],
-    lora_alpha=32,
-    use_gradient_checkpointing="unsloth",
-)
-
-# Rest is identical to standard setup
-trainer = GRPOTrainer(model=model, ...)
-trainer.train()
-```
-
---
-
-## Critical Training Insights
-
-### 1. Loss Behavior (EXPECTED PATTERN)
- **Loss starts near 0 and INCREASES during training**
- This is CORRECT - loss measures KL divergence from initial policy
- Model is learning (diverging from original behavior to optimize rewards)
- Monitor reward metrics instead of loss for progress
-
-### 2. Reward Tracking
-Key metrics to watch:
- `reward`: Average across all completions
- `reward_std`: Diversity within groups (should remain > 0)
- `kl`: KL divergence from reference (should grow moderately)
-
-**Healthy Training Pattern:**
-```
-Step   Reward    Reward_Std   KL
-100    0.5       0.3          0.02
-200    0.8       0.25         0.05
-300    1.2       0.2          0.08  ← Good progression
-400    1.5       0.15         0.12
-```
-
-**Warning Signs:**
- Reward std → 0 (model collapsing to single response)
- KL exploding (> 0.5) (diverging too much, reduce LR)
- Reward stuck (reward functions too harsh or model capacity issue)
-
-### 3. Common Pitfalls and Solutions
-
-| Problem | Symptom | Solution |
-|---------|---------|----------|
-| **Mode collapse** | All completions identical | Increase `num_generations`, add diversity penalty |
-| **No learning** | Flat rewards | Check reward function logic, increase LR |
-| **OOM errors** | GPU memory exceeded | Reduce `num_generations`, enable gradient checkpointing |
-| **Slow training** | < 1 it/s | Enable `use_vllm=True`, use Unsloth, reduce seq length |
-| **Format ignored** | Model doesn't follow structure | Increase format reward weight, add incremental rewards |
-
---
-
-## Advanced Patterns
-
-### 1. Multi-Stage Training
-For complex tasks, train in stages:
-
-```python
-# Stage 1: Format compliance (epochs=1)
-trainer_stage1 = GRPOTrainer(
-    model=model,
-    reward_funcs=[incremental_format_reward, format_reward],
-    ...
-)
-trainer_stage1.train()
-
-# Stage 2: Correctness (epochs=1)
-trainer_stage2 = GRPOTrainer(
-    model=model,
-    reward_funcs=[format_reward, correctness_reward],
-    ...
-)
-trainer_stage2.train()
-```
-
-### 2. Adaptive Reward Scaling
-```python
-class AdaptiveReward:
-    def __init__(self, base_reward_func, initial_weight=1.0):
-        self.func = base_reward_func
-        self.weight = initial_weight
-
-    def __call__(self, *args, **kwargs):
-        rewards = self.func(*args, **kwargs)
-        return [r * self.weight for r in rewards]
-
-    def adjust_weight(self, success_rate):
-        """Increase weight if model struggling, decrease if succeeding."""
-        if success_rate < 0.3:
-            self.weight *= 1.2
-        elif success_rate > 0.8:
-            self.weight *= 0.9
-```
-
-### 3. Custom Dataset Integration
-```python
-def load_custom_knowledge_base(csv_path):
-    """Example: School communication platform docs."""
-    import pandas as pd
-    df = pd.read_csv(csv_path)
-
-    dataset = Dataset.from_pandas(df).map(lambda x: {
-        'prompt': [
-            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
-            {'role': 'user', 'content': x['question']}
-        ],
-        'answer': x['expert_answer']
-    })
-    return dataset
-```
-
---
-
-## Deployment and Inference
-
-### Save and Merge LoRA
-```python
-# Merge LoRA adapters into base model
-if hasattr(trainer.model, 'merge_and_unload'):
-    merged_model = trainer.model.merge_and_unload()
-    merged_model.save_pretrained("production_model")
-    tokenizer.save_pretrained("production_model")
-```
-
-### Inference Example
-```python
-from transformers import pipeline
-
-generator = pipeline(
-    "text-generation",
-    model="production_model",
-    tokenizer=tokenizer
-)
-
-result = generator(
-    [
-        {'role': 'system', 'content': SYSTEM_PROMPT},
-        {'role': 'user', 'content': "What is 15 + 27?"}
-    ],
-    max_new_tokens=256,
-    do_sample=True,
-    temperature=0.7,
-    top_p=0.9
-)
-print(result[0]['generated_text'])
-```
-
---
-
-## Best Practices Checklist
-
-**Before Training:**
- [ ] Validate dataset format (prompts as List[Dict])
- [ ] Test reward functions on sample data
- [ ] Calculate expected max_prompt_length from data
- [ ] Choose appropriate num_generations based on GPU memory
- [ ] Set up logging (wandb recommended)
-
-**During Training:**
- [ ] Monitor reward progression (should increase)
- [ ] Check reward_std (should stay > 0.1)
- [ ] Watch for OOM errors (reduce batch size if needed)
- [ ] Sample generations every 50-100 steps
- [ ] Validate format compliance on holdout set
-
-**After Training:**
- [ ] Merge LoRA weights if using PEFT
- [ ] Test on diverse prompts
- [ ] Compare to baseline model
- [ ] Document reward weights and hyperparameters
- [ ] Save reproducibility config
-
---
-
-## Troubleshooting Guide
-
-### Debugging Workflow
-1. **Isolate reward functions** - Test each independently
-2. **Check data distribution** - Ensure diversity in prompts
-3. **Reduce complexity** - Start with single reward, add gradually
-4. **Monitor generations** - Print samples every N steps
-5. **Validate extraction logic** - Ensure answer parsing works
-
-### Quick Fixes
-```python
-# Debug reward function
-def debug_reward(completions, **kwargs):
-    responses = [comp[0]['content'] for comp in completions]
-    for i, r in enumerate(responses[:2]):  # Print first 2
-        print(f"Response {i}: {r[:200]}...")
-    return [1.0] * len(responses)  # Dummy rewards
-
-# Test without training
-trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
-trainer.generate_completions(dataset[:1])  # Generate without updating
-```
-
---
-
-## References and Resources
-
-**Official Documentation:**
- TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
- DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
- Unsloth Docs: https://docs.unsloth.ai/
-
-**Example Repositories:**
- Open R1 Implementation: https://github.com/huggingface/open-r1
- TRL Examples: https://github.com/huggingface/trl/tree/main/examples
-
-**Recommended Reading:**
- Progressive Disclosure Pattern for agent instructions
- Reward shaping in RL (Ng et al.)
- LoRA paper (Hu et al., 2021)
-
---
-
-## Usage Instructions for Agents
-
-When this skill is loaded:
-
-1. **Read this entire file** before implementing GRPO training
-2. **Start with the simplest reward function** (e.g., length-based) to validate setup
-3. **Use the templates** in `templates/` directory as starting points
-4. **Reference examples** in `examples/` for task-specific implementations
-5. **Follow the workflow** sequentially (don't skip steps)
-6. **Debug incrementally** - add one reward function at a time
-
-**Critical Reminders:**
- Always use multiple reward functions (3-5 is optimal)
- Monitor reward metrics, not loss
- Test reward functions before training
- Start small (num_generations=4), scale up gradually
- Save checkpoints frequently (every 100 steps)
-
-This skill is designed for **expert-level implementation**. Beginners should start with supervised fine-tuning before attempting GRPO.
-
-
-
--- a/skills/mlops/grpo-rl-training/templates/basic_grpo_training.py
+++ b/skills/mlops/grpo-rl-training/templates/basic_grpo_training.py
@@ -1,228 +0,0 @@
-"""
-Basic GRPO Training Template
-=============================
-
-A minimal, production-ready template for GRPO training with TRL.
-Adapt this for your specific task by modifying:
-1. Dataset loading (get_dataset function)
-2. Reward functions (reward_*_func)
-3. System prompt (SYSTEM_PROMPT)
-4. Hyperparameters (GRPOConfig)
-"""
-
-import torch
-import re
-from datasets import load_dataset, Dataset
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from peft import LoraConfig
-from trl import GRPOTrainer, GRPOConfig
-
-# ==================== CONFIGURATION ====================
-
-MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
-OUTPUT_DIR = "outputs/grpo-model"
-MAX_PROMPT_LENGTH = 256
-MAX_COMPLETION_LENGTH = 512
-
-SYSTEM_PROMPT = """
-Respond in the following format:
-<reasoning>
-[Your step-by-step thinking]
-</reasoning>
-<answer>
-[Final answer]
-</answer>
-"""
-
-# ==================== DATASET ====================
-
-def get_dataset(split="train"):
-    """
-    Load and prepare your dataset.
-
-    Returns: Dataset with columns:
-    - 'prompt': List[Dict] with role/content
-    - 'answer': str (ground truth, optional)
-    """
-    # Example: GSM8K math dataset
-    data = load_dataset('openai/gsm8k', 'main')[split]
-
-    def process_example(x):
-        # Extract ground truth answer
-        answer = x['answer'].split('####')[1].strip() if '####' in x['answer'] else None
-
-        return {
-            'prompt': [
-                {'role': 'system', 'content': SYSTEM_PROMPT},
-                {'role': 'user', 'content': x['question']}
-            ],
-            'answer': answer
-        }
-
-    return data.map(process_example)
-
-# ==================== HELPER FUNCTIONS ====================
-
-def extract_xml_tag(text: str, tag: str) -> str:
-    """Extract content between XML tags."""
-    pattern = f'<{tag}>(.*?)</{tag}>'
-    match = re.search(pattern, text, re.DOTALL)
-    return match.group(1).strip() if match else ""
-
-def extract_answer(text: str) -> str:
-    """Extract the final answer from structured output."""
-    return extract_xml_tag(text, 'answer')
-
-# ==================== REWARD FUNCTIONS ====================
-
-def correctness_reward_func(prompts, completions, answer, **kwargs):
-    """
-    Reward correct answers.
-    Weight: 2.0 (highest priority)
-    """
-    responses = [comp[0]['content'] for comp in completions]
-    extracted = [extract_answer(r) for r in responses]
-    return [2.0 if ans == gt else 0.0 for ans, gt in zip(extracted, answer)]
-
-def format_reward_func(completions, **kwargs):
-    """
-    Reward proper XML format.
-    Weight: 0.5
-    """
-    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
-    responses = [comp[0]['content'] for comp in completions]
-    return [0.5 if re.search(pattern, r, re.DOTALL) else 0.0 for r in responses]
-
-def incremental_format_reward_func(completions, **kwargs):
-    """
-    Incremental reward for partial format compliance.
-    Weight: up to 0.5
-    """
-    responses = [comp[0]['content'] for comp in completions]
-    rewards = []
-
-    for r in responses:
-        score = 0.0
-        if '<reasoning>' in r:
-            score += 0.125
-        if '</reasoning>' in r:
-            score += 0.125
-        if '<answer>' in r:
-            score += 0.125
-        if '</answer>' in r:
-            score += 0.125
-
-        # Penalize extra content after closing tag
-        if '</answer>' in r:
-            extra = r.split('</answer>')[-1].strip()
-            score -= len(extra) * 0.001
-
-        rewards.append(score)
-
-    return rewards
-
-# ==================== MODEL SETUP ====================
-
-def setup_model_and_tokenizer():
-    """Load model and tokenizer with optimizations."""
-    model = AutoModelForCausalLM.from_pretrained(
-        MODEL_NAME,
-        torch_dtype=torch.bfloat16,
-        attn_implementation="flash_attention_2",
-        device_map="auto"
-    )
-
-    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
-    tokenizer.pad_token = tokenizer.eos_token
-
-    return model, tokenizer
-
-def get_peft_config():
-    """LoRA configuration for parameter-efficient training."""
-    return LoraConfig(
-        r=16,
-        lora_alpha=32,
-        target_modules=[
-            "q_proj", "k_proj", "v_proj", "o_proj",
-            "gate_proj", "up_proj", "down_proj"
-        ],
-        task_type="CAUSAL_LM",
-        lora_dropout=0.05,
-    )
-
-# ==================== TRAINING ====================
-
-def main():
-    """Main training function."""
-
-    # Load data
-    print("Loading dataset...")
-    dataset = get_dataset()
-    print(f"Dataset size: {len(dataset)}")
-
-    # Setup model
-    print("Loading model...")
-    model, tokenizer = setup_model_and_tokenizer()
-
-    # Training configuration
-    training_args = GRPOConfig(
-        output_dir=OUTPUT_DIR,
-        run_name="grpo-training",
-
-        # Learning rate
-        learning_rate=5e-6,
-        adam_beta1=0.9,
-        adam_beta2=0.99,
-        weight_decay=0.1,
-        warmup_ratio=0.1,
-        lr_scheduler_type='cosine',
-
-        # Batch settings
-        per_device_train_batch_size=1,
-        gradient_accumulation_steps=4,
-
-        # GRPO specific
-        num_generations=8,
-        max_prompt_length=MAX_PROMPT_LENGTH,
-        max_completion_length=MAX_COMPLETION_LENGTH,
-
-        # Training duration
-        num_train_epochs=1,
-
-        # Optimization
-        bf16=True,
-        optim="adamw_8bit",
-        max_grad_norm=0.1,
-
-        # Logging
-        logging_steps=1,
-        save_steps=100,
-        report_to="wandb",  # Change to "none" to disable logging
-    )
-
-    # Initialize trainer
-    trainer = GRPOTrainer(
-        model=model,
-        processing_class=tokenizer,
-        reward_funcs=[
-            incremental_format_reward_func,
-            format_reward_func,
-            correctness_reward_func,
-        ],
-        args=training_args,
-        train_dataset=dataset,
-        peft_config=get_peft_config(),
-    )
-
-    # Train
-    print("Starting training...")
-    trainer.train()
-
-    # Save final model
-    print(f"Saving model to {OUTPUT_DIR}/final")
-    trainer.save_model(f"{OUTPUT_DIR}/final")
-
-    print("Training complete!")
-
-if __name__ == "__main__":
-    main()
--- a/skills/mlops/guidance/SKILL.md
+++ b/skills/mlops/guidance/SKILL.md
@@ -1,575 +0,0 @@
---
-name: guidance
-description: Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [guidance, transformers]
-metadata:
-  hermes:
-    tags: [Prompt Engineering, Guidance, Constrained Generation, Structured Output, JSON Validation, Grammar, Microsoft Research, Format Enforcement, Multi-Step Workflows]
-
---
-
-# Guidance: Constrained LLM Generation
-
-## When to Use This Skill
-
-Use Guidance when you need to:
- **Control LLM output syntax** with regex or grammars
- **Guarantee valid JSON/XML/code** generation
- **Reduce latency** vs traditional prompting approaches
- **Enforce structured formats** (dates, emails, IDs, etc.)
- **Build multi-step workflows** with Pythonic control flow
- **Prevent invalid outputs** through grammatical constraints
-
-**GitHub Stars**: 18,000+ | **From**: Microsoft Research
-
-## Installation
-
-```bash
-# Base installation
-pip install guidance
-
-# With specific backends
-pip install guidance[transformers]  # Hugging Face models
-pip install guidance[llama_cpp]     # llama.cpp models
-```
-
-## Quick Start
-
-### Basic Example: Structured Generation
-
-```python
-from guidance import models, gen
-
-# Load model (supports OpenAI, Transformers, llama.cpp)
-lm = models.OpenAI("gpt-4")
-
-# Generate with constraints
-result = lm + "The capital of France is " + gen("capital", max_tokens=5)
-
-print(result["capital"])  # "Paris"
-```
-
-### With Anthropic Claude
-
-```python
-from guidance import models, gen, system, user, assistant
-
-# Configure Claude
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Use context managers for chat format
-with system():
-    lm += "You are a helpful assistant."
-
-with user():
-    lm += "What is the capital of France?"
-
-with assistant():
-    lm += gen(max_tokens=20)
-```
-
-## Core Concepts
-
-### 1. Context Managers
-
-Guidance uses Pythonic context managers for chat-style interactions.
-
-```python
-from guidance import system, user, assistant, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# System message
-with system():
-    lm += "You are a JSON generation expert."
-
-# User message
-with user():
-    lm += "Generate a person object with name and age."
-
-# Assistant response
-with assistant():
-    lm += gen("response", max_tokens=100)
-
-print(lm["response"])
-```
-
-**Benefits:**
- Natural chat flow
- Clear role separation
- Easy to read and maintain
-
-### 2. Constrained Generation
-
-Guidance ensures outputs match specified patterns using regex or grammars.
-
-#### Regex Constraints
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Constrain to valid email format
-lm += "Email: " + gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
-
-# Constrain to date format (YYYY-MM-DD)
-lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}")
-
-# Constrain to phone number
-lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}")
-
-print(lm["email"])  # Guaranteed valid email
-print(lm["date"])   # Guaranteed YYYY-MM-DD format
-```
-
-**How it works:**
- Regex converted to grammar at token level
- Invalid tokens filtered during generation
- Model can only produce matching outputs
-
-#### Selection Constraints
-
-```python
-from guidance import models, gen, select
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Constrain to specific choices
-lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
-
-# Multiple-choice selection
-lm += "Best answer: " + select(
-    ["A) Paris", "B) London", "C) Berlin", "D) Madrid"],
-    name="answer"
-)
-
-print(lm["sentiment"])  # One of: positive, negative, neutral
-print(lm["answer"])     # One of: A, B, C, or D
-```
-
-### 3. Token Healing
-
-Guidance automatically "heals" token boundaries between prompt and generation.
-
-**Problem:** Tokenization creates unnatural boundaries.
-
-```python
-# Without token healing
-prompt = "The capital of France is "
-# Last token: " is "
-# First generated token might be " Par" (with leading space)
-# Result: "The capital of France is  Paris" (double space!)
-```
-
-**Solution:** Guidance backs up one token and regenerates.
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Token healing enabled by default
-lm += "The capital of France is " + gen("capital", max_tokens=5)
-# Result: "The capital of France is Paris" (correct spacing)
-```
-
-**Benefits:**
- Natural text boundaries
- No awkward spacing issues
- Better model performance (sees natural token sequences)
-
-### 4. Grammar-Based Generation
-
-Define complex structures using context-free grammars.
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# JSON grammar (simplified)
-json_grammar = """
-{
-    "name": <gen name regex="[A-Za-z ]+" max_tokens=20>,
-    "age": <gen age regex="[0-9]+" max_tokens=3>,
-    "email": <gen email regex="[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}" max_tokens=50>
-}
-"""
-
-# Generate valid JSON
-lm += gen("person", grammar=json_grammar)
-
-print(lm["person"])  # Guaranteed valid JSON structure
-```
-
-**Use cases:**
- Complex structured outputs
- Nested data structures
- Programming language syntax
- Domain-specific languages
-
-### 5. Guidance Functions
-
-Create reusable generation patterns with the `@guidance` decorator.
-
-```python
-from guidance import guidance, gen, models
-
-@guidance
-def generate_person(lm):
-    """Generate a person with name and age."""
-    lm += "Name: " + gen("name", max_tokens=20, stop="\n")
-    lm += "\nAge: " + gen("age", regex=r"[0-9]+", max_tokens=3)
-    return lm
-
-# Use the function
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_person(lm)
-
-print(lm["name"])
-print(lm["age"])
-```
-
-**Stateful Functions:**
-
-```python
-@guidance(stateless=False)
-def react_agent(lm, question, tools, max_rounds=5):
-    """ReAct agent with tool use."""
-    lm += f"Question: {question}\n\n"
-
-    for i in range(max_rounds):
-        # Thought
-        lm += f"Thought {i+1}: " + gen("thought", stop="\n")
-
-        # Action
-        lm += "\nAction: " + select(list(tools.keys()), name="action")
-
-        # Execute tool
-        tool_result = tools[lm["action"]]()
-        lm += f"\nObservation: {tool_result}\n\n"
-
-        # Check if done
-        lm += "Done? " + select(["Yes", "No"], name="done")
-        if lm["done"] == "Yes":
-            break
-
-    # Final answer
-    lm += "\nFinal Answer: " + gen("answer", max_tokens=100)
-    return lm
-```
-
-## Backend Configuration
-
-### Anthropic Claude
-
-```python
-from guidance import models
-
-lm = models.Anthropic(
-    model="claude-sonnet-4-5-20250929",
-    api_key="your-api-key"  # Or set ANTHROPIC_API_KEY env var
-)
-```
-
-### OpenAI
-
-```python
-lm = models.OpenAI(
-    model="gpt-4o-mini",
-    api_key="your-api-key"  # Or set OPENAI_API_KEY env var
-)
-```
-
-### Local Models (Transformers)
-
-```python
-from guidance.models import Transformers
-
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cuda"  # Or "cpu"
-)
-```
-
-### Local Models (llama.cpp)
-
-```python
-from guidance.models import LlamaCpp
-
-lm = LlamaCpp(
-    model_path="/path/to/model.gguf",
-    n_ctx=4096,
-    n_gpu_layers=35
-)
-```
-
-## Common Patterns
-
-### Pattern 1: JSON Generation
-
-```python
-from guidance import models, gen, system, user, assistant
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-with system():
-    lm += "You generate valid JSON."
-
-with user():
-    lm += "Generate a user profile with name, age, and email."
-
-with assistant():
-    lm += """{
-    "name": """ + gen("name", regex=r'"[A-Za-z ]+"', max_tokens=30) + """,
-    "age": """ + gen("age", regex=r"[0-9]+", max_tokens=3) + """,
-    "email": """ + gen("email", regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"', max_tokens=50) + """
-}"""
-
-print(lm)  # Valid JSON guaranteed
-```
-
-### Pattern 2: Classification
-
-```python
-from guidance import models, gen, select
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-text = "This product is amazing! I love it."
-
-lm += f"Text: {text}\n"
-lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
-lm += "\nConfidence: " + gen("confidence", regex=r"[0-9]+", max_tokens=3) + "%"
-
-print(f"Sentiment: {lm['sentiment']}")
-print(f"Confidence: {lm['confidence']}%")
-```
-
-### Pattern 3: Multi-Step Reasoning
-
-```python
-from guidance import models, gen, guidance
-
-@guidance
-def chain_of_thought(lm, question):
-    """Generate answer with step-by-step reasoning."""
-    lm += f"Question: {question}\n\n"
-
-    # Generate multiple reasoning steps
-    for i in range(3):
-        lm += f"Step {i+1}: " + gen(f"step_{i+1}", stop="\n", max_tokens=100) + "\n"
-
-    # Final answer
-    lm += "\nTherefore, the answer is: " + gen("answer", max_tokens=50)
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = chain_of_thought(lm, "What is 15% of 200?")
-
-print(lm["answer"])
-```
-
-### Pattern 4: ReAct Agent
-
-```python
-from guidance import models, gen, select, guidance
-
-@guidance(stateless=False)
-def react_agent(lm, question):
-    """ReAct agent with tool use."""
-    tools = {
-        "calculator": lambda expr: eval(expr),
-        "search": lambda query: f"Search results for: {query}",
-    }
-
-    lm += f"Question: {question}\n\n"
-
-    for round in range(5):
-        # Thought
-        lm += f"Thought: " + gen("thought", stop="\n") + "\n"
-
-        # Action selection
-        lm += "Action: " + select(["calculator", "search", "answer"], name="action")
-
-        if lm["action"] == "answer":
-            lm += "\nFinal Answer: " + gen("answer", max_tokens=100)
-            break
-
-        # Action input
-        lm += "\nAction Input: " + gen("action_input", stop="\n") + "\n"
-
-        # Execute tool
-        if lm["action"] in tools:
-            result = tools[lm["action"]](lm["action_input"])
-            lm += f"Observation: {result}\n\n"
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = react_agent(lm, "What is 25 * 4 + 10?")
-print(lm["answer"])
-```
-
-### Pattern 5: Data Extraction
-
-```python
-from guidance import models, gen, guidance
-
-@guidance
-def extract_entities(lm, text):
-    """Extract structured entities from text."""
-    lm += f"Text: {text}\n\n"
-
-    # Extract person
-    lm += "Person: " + gen("person", stop="\n", max_tokens=30) + "\n"
-
-    # Extract organization
-    lm += "Organization: " + gen("organization", stop="\n", max_tokens=30) + "\n"
-
-    # Extract date
-    lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}", max_tokens=10) + "\n"
-
-    # Extract location
-    lm += "Location: " + gen("location", stop="\n", max_tokens=30) + "\n"
-
-    return lm
-
-text = "Tim Cook announced at Apple Park on 2024-09-15 in Cupertino."
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = extract_entities(lm, text)
-
-print(f"Person: {lm['person']}")
-print(f"Organization: {lm['organization']}")
-print(f"Date: {lm['date']}")
-print(f"Location: {lm['location']}")
-```
-
-## Best Practices
-
-### 1. Use Regex for Format Validation
-
-```python
-# ✅ Good: Regex ensures valid format
-lm += "Email: " + gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
-
-# ❌ Bad: Free generation may produce invalid emails
-lm += "Email: " + gen("email", max_tokens=50)
-```
-
-### 2. Use select() for Fixed Categories
-
-```python
-# ✅ Good: Guaranteed valid category
-lm += "Status: " + select(["pending", "approved", "rejected"], name="status")
-
-# ❌ Bad: May generate typos or invalid values
-lm += "Status: " + gen("status", max_tokens=20)
-```
-
-### 3. Leverage Token Healing
-
-```python
-# Token healing is enabled by default
-# No special action needed - just concatenate naturally
-lm += "The capital is " + gen("capital")  # Automatic healing
-```
-
-### 4. Use stop Sequences
-
-```python
-# ✅ Good: Stop at newline for single-line outputs
-lm += "Name: " + gen("name", stop="\n")
-
-# ❌ Bad: May generate multiple lines
-lm += "Name: " + gen("name", max_tokens=50)
-```
-
-### 5. Create Reusable Functions
-
-```python
-# ✅ Good: Reusable pattern
-@guidance
-def generate_person(lm):
-    lm += "Name: " + gen("name", stop="\n")
-    lm += "\nAge: " + gen("age", regex=r"[0-9]+")
-    return lm
-
-# Use multiple times
-lm = generate_person(lm)
-lm += "\n\n"
-lm = generate_person(lm)
-```
-
-### 6. Balance Constraints
-
-```python
-# ✅ Good: Reasonable constraints
-lm += gen("name", regex=r"[A-Za-z ]+", max_tokens=30)
-
-# ❌ Too strict: May fail or be very slow
-lm += gen("name", regex=r"^(John|Jane)$", max_tokens=10)
-```
-
-## Comparison to Alternatives
-
-| Feature | Guidance | Instructor | Outlines | LMQL |
-|---------|----------|------------|----------|------|
-| Regex Constraints | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes |
-| Grammar Support | ✅ CFG | ❌ No | ✅ CFG | ✅ CFG |
-| Pydantic Validation | ❌ No | ✅ Yes | ✅ Yes | ❌ No |
-| Token Healing | ✅ Yes | ❌ No | ✅ Yes | ❌ No |
-| Local Models | ✅ Yes | ⚠️ Limited | ✅ Yes | ✅ Yes |
-| API Models | ✅ Yes | ✅ Yes | ⚠️ Limited | ✅ Yes |
-| Pythonic Syntax | ✅ Yes | ✅ Yes | ✅ Yes | ❌ SQL-like |
-| Learning Curve | Low | Low | Medium | High |
-
-**When to choose Guidance:**
- Need regex/grammar constraints
- Want token healing
- Building complex workflows with control flow
- Using local models (Transformers, llama.cpp)
- Prefer Pythonic syntax
-
-**When to choose alternatives:**
- Instructor: Need Pydantic validation with automatic retrying
- Outlines: Need JSON schema validation
- LMQL: Prefer declarative query syntax
-
-## Performance Characteristics
-
-**Latency Reduction:**
- 30-50% faster than traditional prompting for constrained outputs
- Token healing reduces unnecessary regeneration
- Grammar constraints prevent invalid token generation
-
-**Memory Usage:**
- Minimal overhead vs unconstrained generation
- Grammar compilation cached after first use
- Efficient token filtering at inference time
-
-**Token Efficiency:**
- Prevents wasted tokens on invalid outputs
- No need for retry loops
- Direct path to valid outputs
-
-## Resources
-
- **Documentation**: https://guidance.readthedocs.io
- **GitHub**: https://github.com/guidance-ai/guidance (18k+ stars)
- **Notebooks**: https://github.com/guidance-ai/guidance/tree/main/notebooks
- **Discord**: Community support available
-
-## See Also
-
- `references/constraints.md` - Comprehensive regex and grammar patterns
- `references/backends.md` - Backend-specific configuration
- `references/examples.md` - Production-ready examples
-
-
--- a/skills/mlops/guidance/references/backends.md
+++ b/skills/mlops/guidance/references/backends.md
@@ -1,554 +0,0 @@
-# Backend Configuration Guide
-
-Complete guide to configuring Guidance with different LLM backends.
-
-## Table of Contents
- API-Based Models (Anthropic, OpenAI)
- Local Models (Transformers, llama.cpp)
- Backend Comparison
- Performance Tuning
- Advanced Configuration
-
-## API-Based Models
-
-### Anthropic Claude
-
-#### Basic Setup
-
-```python
-from guidance import models
-
-# Using environment variable
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-# Reads ANTHROPIC_API_KEY from environment
-
-# Explicit API key
-lm = models.Anthropic(
-    model="claude-sonnet-4-5-20250929",
-    api_key="your-api-key-here"
-)
-```
-
-#### Available Models
-
-```python
-# Claude 3.5 Sonnet (Latest, recommended)
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Claude 3.7 Sonnet (Fast, cost-effective)
-lm = models.Anthropic("claude-sonnet-3.7-20250219")
-
-# Claude 3 Opus (Most capable)
-lm = models.Anthropic("claude-3-opus-20240229")
-
-# Claude 3.5 Haiku (Fastest, cheapest)
-lm = models.Anthropic("claude-3-5-haiku-20241022")
-```
-
-#### Configuration Options
-
-```python
-lm = models.Anthropic(
-    model="claude-sonnet-4-5-20250929",
-    api_key="your-api-key",
-    max_tokens=4096,           # Max tokens to generate
-    temperature=0.7,            # Sampling temperature (0-1)
-    top_p=0.9,                  # Nucleus sampling
-    timeout=30,                 # Request timeout (seconds)
-    max_retries=3              # Retry failed requests
-)
-```
-
-#### With Context Managers
-
-```python
-from guidance import models, system, user, assistant, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-with system():
-    lm += "You are a helpful assistant."
-
-with user():
-    lm += "What is the capital of France?"
-
-with assistant():
-    lm += gen(max_tokens=50)
-
-print(lm)
-```
-
-### OpenAI
-
-#### Basic Setup
-
-```python
-from guidance import models
-
-# Using environment variable
-lm = models.OpenAI("gpt-4o")
-# Reads OPENAI_API_KEY from environment
-
-# Explicit API key
-lm = models.OpenAI(
-    model="gpt-4o",
-    api_key="your-api-key-here"
-)
-```
-
-#### Available Models
-
-```python
-# GPT-4o (Latest, multimodal)
-lm = models.OpenAI("gpt-4o")
-
-# GPT-4o Mini (Fast, cost-effective)
-lm = models.OpenAI("gpt-4o-mini")
-
-# GPT-4 Turbo
-lm = models.OpenAI("gpt-4-turbo")
-
-# GPT-3.5 Turbo (Cheapest)
-lm = models.OpenAI("gpt-3.5-turbo")
-```
-
-#### Configuration Options
-
-```python
-lm = models.OpenAI(
-    model="gpt-4o-mini",
-    api_key="your-api-key",
-    max_tokens=2048,
-    temperature=0.7,
-    top_p=1.0,
-    frequency_penalty=0.0,
-    presence_penalty=0.0,
-    timeout=30
-)
-```
-
-#### Chat Format
-
-```python
-from guidance import models, gen
-
-lm = models.OpenAI("gpt-4o-mini")
-
-# OpenAI uses chat format
-lm += [
-    {"role": "system", "content": "You are a helpful assistant."},
-    {"role": "user", "content": "What is 2+2?"}
-]
-
-# Generate response
-lm += gen(max_tokens=50)
-```
-
-### Azure OpenAI
-
-```python
-from guidance import models
-
-lm = models.AzureOpenAI(
-    model="gpt-4o",
-    azure_endpoint="https://your-resource.openai.azure.com/",
-    api_key="your-azure-api-key",
-    api_version="2024-02-15-preview",
-    deployment_name="your-deployment-name"
-)
-```
-
-## Local Models
-
-### Transformers (Hugging Face)
-
-#### Basic Setup
-
-```python
-from guidance.models import Transformers
-
-# Load model from Hugging Face
-lm = Transformers("microsoft/Phi-4-mini-instruct")
-```
-
-#### GPU Configuration
-
-```python
-# Use GPU
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cuda"
-)
-
-# Use specific GPU
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cuda:0"  # GPU 0
-)
-
-# Use CPU
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cpu"
-)
-```
-
-#### Advanced Configuration
-
-```python
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cuda",
-    torch_dtype="float16",      # Use FP16 (faster, less memory)
-    load_in_8bit=True,          # 8-bit quantization
-    max_memory={0: "20GB"},     # GPU memory limit
-    offload_folder="./offload"  # Offload to disk if needed
-)
-```
-
-#### Popular Models
-
-```python
-# Phi-4 (Microsoft)
-lm = Transformers("microsoft/Phi-4-mini-instruct")
-lm = Transformers("microsoft/Phi-3-medium-4k-instruct")
-
-# Llama 3 (Meta)
-lm = Transformers("meta-llama/Llama-3.1-8B-Instruct")
-lm = Transformers("meta-llama/Llama-3.1-70B-Instruct")
-
-# Mistral (Mistral AI)
-lm = Transformers("mistralai/Mistral-7B-Instruct-v0.3")
-lm = Transformers("mistralai/Mixtral-8x7B-Instruct-v0.1")
-
-# Qwen (Alibaba)
-lm = Transformers("Qwen/Qwen2.5-7B-Instruct")
-
-# Gemma (Google)
-lm = Transformers("google/gemma-2-9b-it")
-```
-
-#### Generation Configuration
-
-```python
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cuda"
-)
-
-# Configure generation
-from guidance import gen
-
-result = lm + gen(
-    max_tokens=100,
-    temperature=0.7,
-    top_p=0.9,
-    top_k=50,
-    repetition_penalty=1.1
-)
-```
-
-### llama.cpp
-
-#### Basic Setup
-
-```python
-from guidance.models import LlamaCpp
-
-# Load GGUF model
-lm = LlamaCpp(
-    model_path="/path/to/model.gguf",
-    n_ctx=4096  # Context window
-)
-```
-
-#### GPU Configuration
-
-```python
-# Use GPU acceleration
-lm = LlamaCpp(
-    model_path="/path/to/model.gguf",
-    n_ctx=4096,
-    n_gpu_layers=35,  # Offload 35 layers to GPU
-    n_threads=8       # CPU threads for remaining layers
-)
-
-# Full GPU offload
-lm = LlamaCpp(
-    model_path="/path/to/model.gguf",
-    n_ctx=4096,
-    n_gpu_layers=-1  # Offload all layers
-)
-```
-
-#### Advanced Configuration
-
-```python
-lm = LlamaCpp(
-    model_path="/path/to/llama-3.1-8b-instruct.Q4_K_M.gguf",
-    n_ctx=8192,          # Context window (tokens)
-    n_gpu_layers=35,     # GPU layers
-    n_threads=8,         # CPU threads
-    n_batch=512,         # Batch size for prompt processing
-    use_mmap=True,       # Memory-map the model file
-    use_mlock=False,     # Lock model in RAM
-    seed=42,             # Random seed
-    verbose=False        # Suppress verbose output
-)
-```
-
-#### Quantized Models
-
-```python
-# Q4_K_M (4-bit, recommended for most cases)
-lm = LlamaCpp("/path/to/model.Q4_K_M.gguf")
-
-# Q5_K_M (5-bit, better quality)
-lm = LlamaCpp("/path/to/model.Q5_K_M.gguf")
-
-# Q8_0 (8-bit, high quality)
-lm = LlamaCpp("/path/to/model.Q8_0.gguf")
-
-# F16 (16-bit float, highest quality)
-lm = LlamaCpp("/path/to/model.F16.gguf")
-```
-
-#### Popular GGUF Models
-
-```python
-# Llama 3.1
-lm = LlamaCpp("llama-3.1-8b-instruct.Q4_K_M.gguf")
-
-# Mistral
-lm = LlamaCpp("mistral-7b-instruct-v0.3.Q4_K_M.gguf")
-
-# Phi-4
-lm = LlamaCpp("phi-4-mini-instruct.Q4_K_M.gguf")
-```
-
-## Backend Comparison
-
-### Feature Matrix
-
-| Feature | Anthropic | OpenAI | Transformers | llama.cpp |
-|---------|-----------|--------|--------------|-----------|
-| Constrained Generation | ✅ Full | ✅ Full | ✅ Full | ✅ Full |
-| Token Healing | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
-| Streaming | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
-| GPU Support | N/A | N/A | ✅ Yes | ✅ Yes |
-| Quantization | N/A | N/A | ✅ Yes | ✅ Yes |
-| Cost | $$$ | $$$ | Free | Free |
-| Latency | Low | Low | Medium | Low |
-| Setup Difficulty | Easy | Easy | Medium | Medium |
-
-### Performance Characteristics
-
-**Anthropic Claude:**
- **Latency**: 200-500ms (API call)
- **Throughput**: Limited by API rate limits
- **Cost**: $3-15 per 1M input tokens
- **Best for**: Production systems, high-quality outputs
-
-**OpenAI:**
- **Latency**: 200-400ms (API call)
- **Throughput**: Limited by API rate limits
- **Cost**: $0.15-30 per 1M input tokens
- **Best for**: Cost-sensitive production, gpt-4o-mini
-
-**Transformers:**
- **Latency**: 50-200ms (local inference)
- **Throughput**: GPU-dependent (10-100 tokens/sec)
- **Cost**: Hardware cost only
- **Best for**: Privacy-sensitive, high-volume, experimentation
-
-**llama.cpp:**
- **Latency**: 30-150ms (local inference)
- **Throughput**: Hardware-dependent (20-150 tokens/sec)
- **Cost**: Hardware cost only
- **Best for**: Edge deployment, Apple Silicon, CPU inference
-
-### Memory Requirements
-
-**Transformers (FP16):**
- 7B model: ~14GB GPU VRAM
- 13B model: ~26GB GPU VRAM
- 70B model: ~140GB GPU VRAM (multi-GPU)
-
-**llama.cpp (Q4_K_M):**
- 7B model: ~4.5GB RAM
- 13B model: ~8GB RAM
- 70B model: ~40GB RAM
-
-**Optimization Tips:**
- Use quantized models (Q4_K_M) for lower memory
- Use GPU offloading for faster inference
- Use CPU inference for smaller models (<7B)
-
-## Performance Tuning
-
-### API Models (Anthropic, OpenAI)
-
-#### Reduce Latency
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Use lower max_tokens (faster response)
-lm += gen(max_tokens=100)  # Instead of 1000
-
-# Use streaming (perceived latency reduction)
-for chunk in lm.stream(gen(max_tokens=500)):
-    print(chunk, end="", flush=True)
-```
-
-#### Reduce Cost
-
-```python
-# Use cheaper models
-lm = models.Anthropic("claude-3-5-haiku-20241022")  # vs Sonnet
-lm = models.OpenAI("gpt-4o-mini")  # vs gpt-4o
-
-# Reduce context size
-# - Keep prompts concise
-# - Avoid large few-shot examples
-# - Use max_tokens limits
-```
-
-### Local Models (Transformers, llama.cpp)
-
-#### Optimize GPU Usage
-
-```python
-from guidance.models import Transformers
-
-# Use FP16 for 2x speedup
-lm = Transformers(
-    "meta-llama/Llama-3.1-8B-Instruct",
-    device="cuda",
-    torch_dtype="float16"
-)
-
-# Use 8-bit quantization for 4x memory reduction
-lm = Transformers(
-    "meta-llama/Llama-3.1-8B-Instruct",
-    device="cuda",
-    load_in_8bit=True
-)
-
-# Use flash attention (requires flash-attn package)
-lm = Transformers(
-    "meta-llama/Llama-3.1-8B-Instruct",
-    device="cuda",
-    use_flash_attention_2=True
-)
-```
-
-#### Optimize llama.cpp
-
-```python
-from guidance.models import LlamaCpp
-
-# Maximize GPU layers
-lm = LlamaCpp(
-    model_path="/path/to/model.Q4_K_M.gguf",
-    n_gpu_layers=-1  # All layers on GPU
-)
-
-# Optimize batch size
-lm = LlamaCpp(
-    model_path="/path/to/model.Q4_K_M.gguf",
-    n_batch=512,     # Larger batch = faster prompt processing
-    n_gpu_layers=-1
-)
-
-# Use Metal (Apple Silicon)
-lm = LlamaCpp(
-    model_path="/path/to/model.Q4_K_M.gguf",
-    n_gpu_layers=-1,  # Use Metal GPU acceleration
-    use_mmap=True
-)
-```
-
-#### Batch Processing
-
-```python
-# Process multiple requests efficiently
-requests = [
-    "What is 2+2?",
-    "What is the capital of France?",
-    "What is photosynthesis?"
-]
-
-# Bad: Sequential processing
-for req in requests:
-    lm = Transformers("microsoft/Phi-4-mini-instruct")
-    lm += req + gen(max_tokens=50)
-
-# Good: Reuse loaded model
-lm = Transformers("microsoft/Phi-4-mini-instruct")
-for req in requests:
-    lm += req + gen(max_tokens=50)
-```
-
-## Advanced Configuration
-
-### Custom Model Configurations
-
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-from guidance.models import Transformers
-
-# Load custom model
-tokenizer = AutoTokenizer.from_pretrained("your-model")
-model = AutoModelForCausalLM.from_pretrained(
-    "your-model",
-    device_map="auto",
-    torch_dtype="float16"
-)
-
-# Use with Guidance
-lm = Transformers(model=model, tokenizer=tokenizer)
-```
-
-### Environment Variables
-
-```bash
-# API keys
-export ANTHROPIC_API_KEY="sk-ant-..."
-export OPENAI_API_KEY="sk-..."
-
-# Transformers cache
-export HF_HOME="/path/to/cache"
-export TRANSFORMERS_CACHE="/path/to/cache"
-
-# GPU selection
-export CUDA_VISIBLE_DEVICES=0,1  # Use GPU 0 and 1
-```
-
-### Debugging
-
-```python
-# Enable verbose logging
-import logging
-logging.basicConfig(level=logging.DEBUG)
-
-# Check backend info
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-print(f"Model: {lm.model_name}")
-print(f"Backend: {lm.backend}")
-
-# Check GPU usage (Transformers)
-lm = Transformers("microsoft/Phi-4-mini-instruct", device="cuda")
-print(f"Device: {lm.device}")
-print(f"Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
-```
-
-## Resources
-
- **Anthropic Docs**: https://docs.anthropic.com
- **OpenAI Docs**: https://platform.openai.com/docs
- **Hugging Face Models**: https://huggingface.co/models
- **llama.cpp**: https://github.com/ggerganov/llama.cpp
- **GGUF Models**: https://huggingface.co/models?library=gguf
--- a/skills/mlops/guidance/references/constraints.md
+++ b/skills/mlops/guidance/references/constraints.md
@@ -1,674 +0,0 @@
-# Comprehensive Constraint Patterns
-
-Guide to regex constraints, grammar-based generation, and token healing in Guidance.
-
-## Table of Contents
- Regex Constraints
- Grammar-Based Generation
- Token Healing
- Selection Constraints
- Complex Patterns
- Performance Optimization
-
-## Regex Constraints
-
-### Basic Patterns
-
-#### Numeric Constraints
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Integer (positive)
-lm += "Age: " + gen("age", regex=r"[0-9]+")
-
-# Integer (with negatives)
-lm += "Temperature: " + gen("temp", regex=r"-?[0-9]+")
-
-# Float (positive)
-lm += "Price: $" + gen("price", regex=r"[0-9]+\.[0-9]{2}")
-
-# Float (with negatives and optional decimals)
-lm += "Value: " + gen("value", regex=r"-?[0-9]+(\.[0-9]+)?")
-
-# Percentage (0-100)
-lm += "Progress: " + gen("progress", regex=r"(100|[0-9]{1,2})")
-
-# Range (1-5 stars)
-lm += "Rating: " + gen("rating", regex=r"[1-5]") + " stars"
-```
-
-#### Text Constraints
-
-```python
-# Alphabetic only
-lm += "Name: " + gen("name", regex=r"[A-Za-z]+")
-
-# Alphabetic with spaces
-lm += "Full Name: " + gen("full_name", regex=r"[A-Za-z ]+")
-
-# Alphanumeric
-lm += "Username: " + gen("username", regex=r"[A-Za-z0-9_]+")
-
-# Capitalized words
-lm += "Title: " + gen("title", regex=r"[A-Z][a-z]+( [A-Z][a-z]+)*")
-
-# Lowercase only
-lm += "Code: " + gen("code", regex=r"[a-z0-9-]+")
-
-# Specific length
-lm += "ID: " + gen("id", regex=r"[A-Z]{3}-[0-9]{6}")  # e.g., "ABC-123456"
-```
-
-#### Date and Time Constraints
-
-```python
-# Date (YYYY-MM-DD)
-lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}")
-
-# Date (MM/DD/YYYY)
-lm += "Date: " + gen("date_us", regex=r"\d{2}/\d{2}/\d{4}")
-
-# Time (HH:MM)
-lm += "Time: " + gen("time", regex=r"\d{2}:\d{2}")
-
-# Time (HH:MM:SS)
-lm += "Time: " + gen("time_full", regex=r"\d{2}:\d{2}:\d{2}")
-
-# ISO 8601 datetime
-lm += "Timestamp: " + gen(
-    "timestamp",
-    regex=r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z"
-)
-
-# Year (YYYY)
-lm += "Year: " + gen("year", regex=r"(19|20)\d{2}")
-
-# Month name
-lm += "Month: " + gen(
-    "month",
-    regex=r"(January|February|March|April|May|June|July|August|September|October|November|December)"
-)
-```
-
-#### Contact Information
-
-```python
-# Email
-lm += "Email: " + gen(
-    "email",
-    regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
-)
-
-# Phone (US format)
-lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}")
-
-# Phone (international format)
-lm += "Phone: " + gen("phone_intl", regex=r"\+[0-9]{1,3}-[0-9]{1,14}")
-
-# ZIP code (US)
-lm += "ZIP: " + gen("zip", regex=r"\d{5}(-\d{4})?")
-
-# Postal code (Canada)
-lm += "Postal: " + gen("postal", regex=r"[A-Z]\d[A-Z] \d[A-Z]\d")
-
-# URL
-lm += "URL: " + gen(
-    "url",
-    regex=r"https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(/[a-zA-Z0-9._~:/?#\[\]@!$&'()*+,;=-]*)?"
-)
-```
-
-### Advanced Patterns
-
-#### JSON Field Constraints
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# String field with quotes
-lm += '"name": ' + gen("name", regex=r'"[A-Za-z ]+"')
-
-# Numeric field (no quotes)
-lm += '"age": ' + gen("age", regex=r"[0-9]+")
-
-# Boolean field
-lm += '"active": ' + gen("active", regex=r"(true|false)")
-
-# Null field
-lm += '"optional": ' + gen("optional", regex=r"(null|[0-9]+)")
-
-# Array of strings
-lm += '"tags": [' + gen(
-    "tags",
-    regex=r'"[a-z]+"(, "[a-z]+")*'
-) + ']'
-
-# Complete JSON object
-lm += """{
-    "name": """ + gen("name", regex=r'"[A-Za-z ]+"') + """,
-    "age": """ + gen("age", regex=r"[0-9]+") + """,
-    "email": """ + gen(
-        "email",
-        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
-    ) + """
-}"""
-```
-
-#### Code Patterns
-
-```python
-# Python variable name
-lm += "Variable: " + gen("var", regex=r"[a-z_][a-z0-9_]*")
-
-# Python function name
-lm += "Function: " + gen("func", regex=r"[a-z_][a-z0-9_]*")
-
-# Hex color code
-lm += "Color: #" + gen("color", regex=r"[0-9A-Fa-f]{6}")
-
-# UUID
-lm += "UUID: " + gen(
-    "uuid",
-    regex=r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
-)
-
-# Git commit hash (short)
-lm += "Commit: " + gen("commit", regex=r"[0-9a-f]{7}")
-
-# Semantic version
-lm += "Version: " + gen("version", regex=r"[0-9]+\.[0-9]+\.[0-9]+")
-
-# IP address (IPv4)
-lm += "IP: " + gen(
-    "ip",
-    regex=r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
-)
-```
-
-#### Domain-Specific Patterns
-
-```python
-# Credit card number
-lm += "Card: " + gen("card", regex=r"\d{4}-\d{4}-\d{4}-\d{4}")
-
-# Social Security Number (US)
-lm += "SSN: " + gen("ssn", regex=r"\d{3}-\d{2}-\d{4}")
-
-# ISBN-13
-lm += "ISBN: " + gen("isbn", regex=r"978-\d{1,5}-\d{1,7}-\d{1,7}-\d")
-
-# License plate (US)
-lm += "Plate: " + gen("plate", regex=r"[A-Z]{3}-\d{4}")
-
-# Currency amount
-lm += "Amount: $" + gen("amount", regex=r"[0-9]{1,3}(,[0-9]{3})*\.[0-9]{2}")
-
-# Percentage with decimal
-lm += "Rate: " + gen("rate", regex=r"[0-9]+\.[0-9]{1,2}%")
-```
-
-## Grammar-Based Generation
-
-### JSON Grammar
-
-```python
-from guidance import models, gen, guidance
-
-@guidance
-def json_object(lm):
-    """Generate valid JSON object."""
-    lm += "{\n"
-
-    # Name field (required)
-    lm += '    "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
-
-    # Age field (required)
-    lm += '    "age": ' + gen("age", regex=r"[0-9]+") + ",\n"
-
-    # Email field (required)
-    lm += '    "email": ' + gen(
-        "email",
-        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
-    ) + ",\n"
-
-    # Active field (required, boolean)
-    lm += '    "active": ' + gen("active", regex=r"(true|false)") + "\n"
-
-    lm += "}"
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = json_object(lm)
-print(lm)  # Valid JSON guaranteed
-```
-
-### Nested JSON Grammar
-
-```python
-@guidance
-def nested_json(lm):
-    """Generate nested JSON structure."""
-    lm += "{\n"
-
-    # User object
-    lm += '    "user": {\n'
-    lm += '        "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
-    lm += '        "age": ' + gen("age", regex=r"[0-9]+") + "\n"
-    lm += "    },\n"
-
-    # Address object
-    lm += '    "address": {\n'
-    lm += '        "street": ' + gen("street", regex=r'"[A-Za-z0-9 ]+"') + ",\n"
-    lm += '        "city": ' + gen("city", regex=r'"[A-Za-z ]+"') + ",\n"
-    lm += '        "zip": ' + gen("zip", regex=r'"\d{5}"') + "\n"
-    lm += "    }\n"
-
-    lm += "}"
-    return lm
-```
-
-### Array Grammar
-
-```python
-@guidance
-def json_array(lm, count=3):
-    """Generate JSON array with fixed count."""
-    lm += "[\n"
-
-    for i in range(count):
-        lm += "    {\n"
-        lm += '        "id": ' + gen(f"id_{i}", regex=r"[0-9]+") + ",\n"
-        lm += '        "name": ' + gen(f"name_{i}", regex=r'"[A-Za-z ]+"') + "\n"
-        lm += "    }"
-        if i < count - 1:
-            lm += ","
-        lm += "\n"
-
-    lm += "]"
-    return lm
-```
-
-### XML Grammar
-
-```python
-@guidance
-def xml_document(lm):
-    """Generate valid XML document."""
-    lm += '<?xml version="1.0"?>\n'
-    lm += "<person>\n"
-
-    # Name element
-    lm += "    <name>" + gen("name", regex=r"[A-Za-z ]+") + "</name>\n"
-
-    # Age element
-    lm += "    <age>" + gen("age", regex=r"[0-9]+") + "</age>\n"
-
-    # Email element
-    lm += "    <email>" + gen(
-        "email",
-        regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
-    ) + "</email>\n"
-
-    lm += "</person>"
-    return lm
-```
-
-### CSV Grammar
-
-```python
-@guidance
-def csv_row(lm):
-    """Generate CSV row."""
-    lm += gen("name", regex=r"[A-Za-z ]+") + ","
-    lm += gen("age", regex=r"[0-9]+") + ","
-    lm += gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
-    return lm
-
-@guidance
-def csv_document(lm, rows=5):
-    """Generate complete CSV."""
-    # Header
-    lm += "Name,Age,Email\n"
-
-    # Rows
-    for i in range(rows):
-        lm = csv_row(lm)
-        if i < rows - 1:
-            lm += "\n"
-
-    return lm
-```
-
-## Token Healing
-
-### How Token Healing Works
-
-**Problem:** Tokenization creates unnatural boundaries.
-
-```python
-# Example without token healing
-prompt = "The capital of France is "
-# Tokenization: ["The", " capital", " of", " France", " is", " "]
-# Model sees last token: " "
-# First generated token might include leading space: " Paris"
-# Result: "The capital of France is  Paris" (double space)
-```
-
-**Solution:** Guidance backs up and regenerates the last token.
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Token healing enabled by default
-lm += "The capital of France is " + gen("capital", max_tokens=5)
-
-# Process:
-# 1. Back up to token before " is "
-# 2. Regenerate " is" + "capital" together
-# 3. Result: "The capital of France is Paris" (correct)
-```
-
-### Token Healing Examples
-
-#### Natural Continuations
-
-```python
-# Before token healing
-lm += "The function name is get" + gen("rest")
-# Might generate: "The function name is get User" (space before User)
-
-# With token healing
-lm += "The function name is get" + gen("rest")
-# Generates: "The function name is getUser" (correct camelCase)
-```
-
-#### Code Generation
-
-```python
-# Function name completion
-lm += "def calculate_" + gen("rest", stop="(")
-# Token healing ensures smooth connection: "calculate_total"
-
-# Variable name completion
-lm += "my_" + gen("var_name", regex=r"[a-z_]+")
-# Token healing ensures: "my_variable_name" (not "my_ variable_name")
-```
-
-#### Domain-Specific Terms
-
-```python
-# Medical terms
-lm += "The patient has hyper" + gen("condition")
-# Token healing helps: "hypertension" (not "hyper tension")
-
-# Technical terms
-lm += "Using micro" + gen("tech")
-# Token healing helps: "microservices" (not "micro services")
-```
-
-### Disabling Token Healing
-
-```python
-# Disable token healing if needed (rare)
-lm += gen("text", token_healing=False)
-```
-
-## Selection Constraints
-
-### Basic Selection
-
-```python
-from guidance import models, select
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Simple selection
-lm += "Status: " + select(["active", "inactive", "pending"], name="status")
-
-# Boolean selection
-lm += "Approved: " + select(["Yes", "No"], name="approved")
-
-# Multiple choice
-lm += "Answer: " + select(
-    ["A) Paris", "B) London", "C) Berlin", "D) Madrid"],
-    name="answer"
-)
-```
-
-### Conditional Selection
-
-```python
-from guidance import models, select, gen, guidance
-
-@guidance
-def conditional_fields(lm):
-    """Generate fields conditionally based on type."""
-    lm += "Type: " + select(["person", "company"], name="type")
-
-    if lm["type"] == "person":
-        lm += "\nName: " + gen("name", regex=r"[A-Za-z ]+")
-        lm += "\nAge: " + gen("age", regex=r"[0-9]+")
-    else:
-        lm += "\nCompany Name: " + gen("company", regex=r"[A-Za-z ]+")
-        lm += "\nEmployees: " + gen("employees", regex=r"[0-9]+")
-
-    return lm
-```
-
-### Repeated Selection
-
-```python
-@guidance
-def multiple_selections(lm):
-    """Select multiple items."""
-    lm += "Select 3 colors:\n"
-
-    colors = ["red", "blue", "green", "yellow", "purple"]
-
-    for i in range(3):
-        lm += f"{i+1}. " + select(colors, name=f"color_{i}") + "\n"
-
-    return lm
-```
-
-## Complex Patterns
-
-### Pattern 1: Structured Forms
-
-```python
-@guidance
-def user_form(lm):
-    """Generate structured user form."""
-    lm += "=== User Registration ===\n\n"
-
-    # Name (alphabetic only)
-    lm += "Full Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-
-    # Age (numeric)
-    lm += "Age: " + gen("age", regex=r"[0-9]+", max_tokens=3) + "\n"
-
-    # Email (validated format)
-    lm += "Email: " + gen(
-        "email",
-        regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
-        stop="\n"
-    ) + "\n"
-
-    # Phone (US format)
-    lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}") + "\n"
-
-    # Account type (selection)
-    lm += "Account Type: " + select(
-        ["Standard", "Premium", "Enterprise"],
-        name="account_type"
-    ) + "\n"
-
-    # Active status (boolean)
-    lm += "Active: " + select(["Yes", "No"], name="active") + "\n"
-
-    return lm
-```
-
-### Pattern 2: Multi-Entity Extraction
-
-```python
-@guidance
-def extract_entities(lm, text):
-    """Extract multiple entities with constraints."""
-    lm += f"Text: {text}\n\n"
-
-    # Person name (alphabetic)
-    lm += "Person: " + gen("person", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-
-    # Organization (alphanumeric with spaces)
-    lm += "Organization: " + gen(
-        "organization",
-        regex=r"[A-Za-z0-9 ]+",
-        stop="\n"
-    ) + "\n"
-
-    # Date (YYYY-MM-DD format)
-    lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}") + "\n"
-
-    # Location (alphabetic with spaces)
-    lm += "Location: " + gen("location", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-
-    # Amount (currency)
-    lm += "Amount: $" + gen("amount", regex=r"[0-9,]+\.[0-9]{2}") + "\n"
-
-    return lm
-```
-
-### Pattern 3: Code Generation
-
-```python
-@guidance
-def generate_python_function(lm):
-    """Generate Python function with constraints."""
-    # Function name (valid Python identifier)
-    lm += "def " + gen("func_name", regex=r"[a-z_][a-z0-9_]*") + "("
-
-    # Parameter name
-    lm += gen("param", regex=r"[a-z_][a-z0-9_]*") + "):\n"
-
-    # Docstring
-    lm += '    """' + gen("docstring", stop='"""', max_tokens=50) + '"""\n'
-
-    # Function body (constrained to valid Python)
-    lm += "    return " + gen("return_value", stop="\n") + "\n"
-
-    return lm
-```
-
-### Pattern 4: Hierarchical Data
-
-```python
-@guidance
-def org_chart(lm):
-    """Generate organizational chart."""
-    lm += "Company: " + gen("company", regex=r"[A-Za-z ]+") + "\n\n"
-
-    # CEO
-    lm += "CEO: " + gen("ceo", regex=r"[A-Za-z ]+") + "\n"
-
-    # Departments
-    for dept in ["Engineering", "Sales", "Marketing"]:
-        lm += f"\n{dept} Department:\n"
-        lm += "  Head: " + gen(f"{dept.lower()}_head", regex=r"[A-Za-z ]+") + "\n"
-        lm += "  Size: " + gen(f"{dept.lower()}_size", regex=r"[0-9]+") + " employees\n"
-
-    return lm
-```
-
-## Performance Optimization
-
-### Best Practices
-
-#### 1. Use Specific Patterns
-
-```python
-# ✅ Good: Specific pattern
-lm += gen("age", regex=r"[0-9]{1,3}")  # Fast
-
-# ❌ Bad: Overly broad pattern
-lm += gen("age", regex=r"[0-9]+")  # Slower
-```
-
-#### 2. Limit Max Tokens
-
-```python
-# ✅ Good: Reasonable limit
-lm += gen("name", max_tokens=30)
-
-# ❌ Bad: No limit
-lm += gen("name")  # May generate forever
-```
-
-#### 3. Use stop Sequences
-
-```python
-# ✅ Good: Stop at newline
-lm += gen("line", stop="\n")
-
-# ❌ Bad: Rely on max_tokens
-lm += gen("line", max_tokens=100)
-```
-
-#### 4. Cache Compiled Grammars
-
-```python
-# Grammars are cached automatically after first use
-# No manual caching needed
-@guidance
-def reusable_pattern(lm):
-    """This grammar is compiled once and cached."""
-    lm += gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
-    return lm
-
-# First call: compiles grammar
-lm = reusable_pattern(lm)
-
-# Subsequent calls: uses cached grammar (fast)
-lm = reusable_pattern(lm)
-```
-
-#### 5. Avoid Overlapping Constraints
-
-```python
-# ✅ Good: Clear constraints
-lm += gen("age", regex=r"[0-9]+", max_tokens=3)
-
-# ❌ Bad: Conflicting constraints
-lm += gen("age", regex=r"[0-9]{2}", max_tokens=10)  # max_tokens unnecessary
-```
-
-### Performance Benchmarks
-
-**Regex vs Free Generation:**
- Simple regex (digits): ~1.2x slower than free gen
- Complex regex (email): ~1.5x slower than free gen
- Grammar-based: ~2x slower than free gen
-
-**But:**
- 100% valid outputs (vs ~70% with free gen + validation)
- No retry loops needed
- Overall faster end-to-end for structured outputs
-
-**Optimization Tips:**
- Use regex for critical fields only
- Use `select()` for small fixed sets (fastest)
- Use `stop` sequences when possible (faster than max_tokens)
- Cache compiled grammars by reusing functions
-
-## Resources
-
- **Token Healing Paper**: https://arxiv.org/abs/2306.17648
- **Guidance Docs**: https://guidance.readthedocs.io
- **GitHub**: https://github.com/guidance-ai/guidance
--- a/skills/mlops/guidance/references/examples.md
+++ b/skills/mlops/guidance/references/examples.md
@@ -1,767 +0,0 @@
-# Production-Ready Examples
-
-Real-world examples of using Guidance for structured generation, agents, and workflows.
-
-## Table of Contents
- JSON Generation
- Data Extraction
- Classification Systems
- Agent Systems
- Multi-Step Workflows
- Code Generation
- Production Tips
-
-## JSON Generation
-
-### Basic JSON
-
-```python
-from guidance import models, gen, guidance
-
-@guidance
-def generate_user(lm):
-    """Generate valid user JSON."""
-    lm += "{\n"
-    lm += '  "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
-    lm += '  "age": ' + gen("age", regex=r"[0-9]+") + ",\n"
-    lm += '  "email": ' + gen(
-        "email",
-        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
-    ) + "\n"
-    lm += "}"
-    return lm
-
-# Use it
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm += "Generate a user profile:\n"
-lm = generate_user(lm)
-
-print(lm)
-# Output: Valid JSON guaranteed
-```
-
-### Nested JSON
-
-```python
-@guidance
-def generate_order(lm):
-    """Generate nested order JSON."""
-    lm += "{\n"
-
-    # Customer info
-    lm += '  "customer": {\n'
-    lm += '    "name": ' + gen("customer_name", regex=r'"[A-Za-z ]+"') + ",\n"
-    lm += '    "email": ' + gen(
-        "customer_email",
-        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
-    ) + "\n"
-    lm += "  },\n"
-
-    # Order details
-    lm += '  "order": {\n'
-    lm += '    "id": ' + gen("order_id", regex=r'"ORD-[0-9]{6}"') + ",\n"
-    lm += '    "date": ' + gen("order_date", regex=r'"\d{4}-\d{2}-\d{2}"') + ",\n"
-    lm += '    "total": ' + gen("order_total", regex=r"[0-9]+\.[0-9]{2}") + "\n"
-    lm += "  },\n"
-
-    # Status
-    lm += '  "status": ' + gen(
-        "status",
-        regex=r'"(pending|processing|shipped|delivered)"'
-    ) + "\n"
-
-    lm += "}"
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_order(lm)
-```
-
-### JSON Array
-
-```python
-@guidance
-def generate_user_list(lm, count=3):
-    """Generate JSON array of users."""
-    lm += "[\n"
-
-    for i in range(count):
-        lm += "  {\n"
-        lm += '    "id": ' + gen(f"id_{i}", regex=r"[0-9]+") + ",\n"
-        lm += '    "name": ' + gen(f"name_{i}", regex=r'"[A-Za-z ]+"') + ",\n"
-        lm += '    "active": ' + gen(f"active_{i}", regex=r"(true|false)") + "\n"
-        lm += "  }"
-        if i < count - 1:
-            lm += ","
-        lm += "\n"
-
-    lm += "]"
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_user_list(lm, count=5)
-```
-
-### Dynamic JSON Schema
-
-```python
-import json
-from guidance import models, gen, guidance
-
-@guidance
-def json_from_schema(lm, schema):
-    """Generate JSON matching a schema."""
-    lm += "{\n"
-
-    fields = list(schema["properties"].items())
-    for i, (field_name, field_schema) in enumerate(fields):
-        lm += f'  "{field_name}": '
-
-        # Handle different types
-        if field_schema["type"] == "string":
-            if "pattern" in field_schema:
-                lm += gen(field_name, regex=f'"{field_schema["pattern"]}"')
-            else:
-                lm += gen(field_name, regex=r'"[^"]+"')
-        elif field_schema["type"] == "number":
-            lm += gen(field_name, regex=r"[0-9]+(\.[0-9]+)?")
-        elif field_schema["type"] == "integer":
-            lm += gen(field_name, regex=r"[0-9]+")
-        elif field_schema["type"] == "boolean":
-            lm += gen(field_name, regex=r"(true|false)")
-
-        if i < len(fields) - 1:
-            lm += ","
-        lm += "\n"
-
-    lm += "}"
-    return lm
-
-# Define schema
-schema = {
-    "type": "object",
-    "properties": {
-        "name": {"type": "string"},
-        "age": {"type": "integer"},
-        "score": {"type": "number"},
-        "active": {"type": "boolean"}
-    }
-}
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = json_from_schema(lm, schema)
-```
-
-## Data Extraction
-
-### Extract from Text
-
-```python
-from guidance import models, gen, guidance, system, user, assistant
-
-@guidance
-def extract_person_info(lm, text):
-    """Extract structured info from text."""
-    lm += f"Text: {text}\n\n"
-
-    with assistant():
-        lm += "Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-        lm += "Age: " + gen("age", regex=r"[0-9]+", max_tokens=3) + "\n"
-        lm += "Occupation: " + gen("occupation", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-        lm += "Email: " + gen(
-            "email",
-            regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
-            stop="\n"
-        ) + "\n"
-
-    return lm
-
-text = "John Smith is a 35-year-old software engineer. Contact: john@example.com"
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-with system():
-    lm += "You extract structured information from text."
-
-with user():
-    lm = extract_person_info(lm, text)
-
-print(f"Name: {lm['name']}")
-print(f"Age: {lm['age']}")
-print(f"Occupation: {lm['occupation']}")
-print(f"Email: {lm['email']}")
-```
-
-### Multi-Entity Extraction
-
-```python
-@guidance
-def extract_entities(lm, text):
-    """Extract multiple entity types."""
-    lm += f"Analyze: {text}\n\n"
-
-    # Person entities
-    lm += "People:\n"
-    for i in range(3):  # Up to 3 people
-        lm += f"- " + gen(f"person_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-
-    # Organization entities
-    lm += "\nOrganizations:\n"
-    for i in range(2):  # Up to 2 orgs
-        lm += f"- " + gen(f"org_{i}", regex=r"[A-Za-z0-9 ]+", stop="\n") + "\n"
-
-    # Dates
-    lm += "\nDates:\n"
-    for i in range(2):  # Up to 2 dates
-        lm += f"- " + gen(f"date_{i}", regex=r"\d{4}-\d{2}-\d{2}", stop="\n") + "\n"
-
-    # Locations
-    lm += "\nLocations:\n"
-    for i in range(2):  # Up to 2 locations
-        lm += f"- " + gen(f"location_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-
-    return lm
-
-text = """
-Tim Cook and Satya Nadella met at Microsoft headquarters in Redmond on 2024-09-15
-to discuss the collaboration between Apple and Microsoft. The meeting continued
-in Cupertino on 2024-09-20.
-"""
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = extract_entities(lm, text)
-```
-
-### Batch Extraction
-
-```python
-@guidance
-def batch_extract(lm, texts):
-    """Extract from multiple texts."""
-    lm += "Batch Extraction Results:\n\n"
-
-    for i, text in enumerate(texts):
-        lm += f"=== Item {i+1} ===\n"
-        lm += f"Text: {text}\n"
-        lm += "Name: " + gen(f"name_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-        lm += "Sentiment: " + gen(
-            f"sentiment_{i}",
-            regex=r"(positive|negative|neutral)",
-            stop="\n"
-        ) + "\n\n"
-
-    return lm
-
-texts = [
-    "Alice is happy with the product",
-    "Bob is disappointed with the service",
-    "Carol has no strong feelings either way"
-]
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = batch_extract(lm, texts)
-```
-
-## Classification Systems
-
-### Sentiment Analysis
-
-```python
-from guidance import models, select, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-text = "This product is absolutely amazing! Best purchase ever."
-
-lm += f"Text: {text}\n\n"
-lm += "Sentiment: " + select(
-    ["positive", "negative", "neutral"],
-    name="sentiment"
-)
-lm += "\nConfidence: " + gen("confidence", regex=r"[0-9]{1,3}") + "%\n"
-lm += "Reasoning: " + gen("reasoning", stop="\n", max_tokens=50)
-
-print(f"Sentiment: {lm['sentiment']}")
-print(f"Confidence: {lm['confidence']}%")
-print(f"Reasoning: {lm['reasoning']}")
-```
-
-### Multi-Label Classification
-
-```python
-@guidance
-def classify_article(lm, text):
-    """Classify article with multiple labels."""
-    lm += f"Article: {text}\n\n"
-
-    # Primary category
-    lm += "Primary Category: " + select(
-        ["Technology", "Business", "Science", "Politics", "Entertainment"],
-        name="primary_category"
-    ) + "\n"
-
-    # Secondary categories (up to 3)
-    lm += "\nSecondary Categories:\n"
-    categories = ["Technology", "Business", "Science", "Politics", "Entertainment"]
-    for i in range(3):
-        lm += f"{i+1}. " + select(categories, name=f"secondary_{i}") + "\n"
-
-    # Tags
-    lm += "\nTags: " + gen("tags", stop="\n", max_tokens=50) + "\n"
-
-    # Target audience
-    lm += "Target Audience: " + select(
-        ["General", "Expert", "Beginner"],
-        name="audience"
-    )
-
-    return lm
-
-article = """
-Apple announced new AI features in iOS 18, leveraging machine learning to improve
-battery life and performance. The company's stock rose 5% following the announcement.
-"""
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = classify_article(lm, article)
-```
-
-### Intent Classification
-
-```python
-@guidance
-def classify_intent(lm, message):
-    """Classify user intent."""
-    lm += f"User Message: {message}\n\n"
-
-    # Intent
-    lm += "Intent: " + select(
-        ["question", "complaint", "request", "feedback", "other"],
-        name="intent"
-    ) + "\n"
-
-    # Urgency
-    lm += "Urgency: " + select(
-        ["low", "medium", "high", "critical"],
-        name="urgency"
-    ) + "\n"
-
-    # Department
-    lm += "Route To: " + select(
-        ["support", "sales", "billing", "technical"],
-        name="department"
-    ) + "\n"
-
-    # Sentiment
-    lm += "Sentiment: " + select(
-        ["positive", "neutral", "negative"],
-        name="sentiment"
-    )
-
-    return lm
-
-message = "My account was charged twice for the same order. Need help ASAP!"
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = classify_intent(lm, message)
-
-print(f"Intent: {lm['intent']}")
-print(f"Urgency: {lm['urgency']}")
-print(f"Department: {lm['department']}")
-```
-
-## Agent Systems
-
-### ReAct Agent
-
-```python
-from guidance import models, gen, select, guidance
-
-@guidance(stateless=False)
-def react_agent(lm, question, tools, max_rounds=5):
-    """ReAct agent with tool use."""
-    lm += f"Question: {question}\n\n"
-
-    for round in range(max_rounds):
-        # Thought
-        lm += f"Thought {round+1}: " + gen("thought", stop="\n", max_tokens=100) + "\n"
-
-        # Action selection
-        lm += "Action: " + select(
-            list(tools.keys()) + ["answer"],
-            name="action"
-        )
-
-        if lm["action"] == "answer":
-            lm += "\n\nFinal Answer: " + gen("answer", max_tokens=200)
-            break
-
-        # Action input
-        lm += "\nAction Input: " + gen("action_input", stop="\n", max_tokens=100) + "\n"
-
-        # Execute tool
-        if lm["action"] in tools:
-            try:
-                result = tools[lm["action"]](lm["action_input"])
-                lm += f"Observation: {result}\n\n"
-            except Exception as e:
-                lm += f"Observation: Error - {str(e)}\n\n"
-
-    return lm
-
-# Define tools
-tools = {
-    "calculator": lambda expr: eval(expr),
-    "search": lambda query: f"Search results for '{query}': [Mock results]",
-    "weather": lambda city: f"Weather in {city}: Sunny, 72°F"
-}
-
-# Use agent
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = react_agent(lm, "What is (25 * 4) + 10?", tools)
-
-print(lm["answer"])
-```
-
-### Multi-Agent System
-
-```python
-@guidance
-def coordinator_agent(lm, task):
-    """Coordinator that delegates to specialists."""
-    lm += f"Task: {task}\n\n"
-
-    # Determine which specialist to use
-    lm += "Specialist: " + select(
-        ["researcher", "writer", "coder", "analyst"],
-        name="specialist"
-    ) + "\n"
-
-    lm += "Reasoning: " + gen("reasoning", stop="\n", max_tokens=100) + "\n"
-
-    return lm
-
-@guidance
-def researcher_agent(lm, query):
-    """Research specialist."""
-    lm += f"Research Query: {query}\n\n"
-    lm += "Findings:\n"
-    for i in range(3):
-        lm += f"{i+1}. " + gen(f"finding_{i}", stop="\n", max_tokens=100) + "\n"
-    return lm
-
-@guidance
-def writer_agent(lm, topic):
-    """Writing specialist."""
-    lm += f"Topic: {topic}\n\n"
-    lm += "Title: " + gen("title", stop="\n", max_tokens=50) + "\n"
-    lm += "Content:\n" + gen("content", max_tokens=500)
-    return lm
-
-# Coordination workflow
-task = "Write an article about AI safety"
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = coordinator_agent(lm, task)
-
-specialist = lm["specialist"]
-if specialist == "researcher":
-    lm = researcher_agent(lm, task)
-elif specialist == "writer":
-    lm = writer_agent(lm, task)
-```
-
-### Tool Use with Validation
-
-```python
-@guidance(stateless=False)
-def validated_tool_agent(lm, question):
-    """Agent with validated tool calls."""
-    tools = {
-        "add": lambda a, b: float(a) + float(b),
-        "multiply": lambda a, b: float(a) * float(b),
-        "divide": lambda a, b: float(a) / float(b) if float(b) != 0 else "Error: Division by zero"
-    }
-
-    lm += f"Question: {question}\n\n"
-
-    for i in range(5):
-        # Select tool
-        lm += "Tool: " + select(list(tools.keys()) + ["done"], name="tool")
-
-        if lm["tool"] == "done":
-            lm += "\nAnswer: " + gen("answer", max_tokens=100)
-            break
-
-        # Get validated numeric arguments
-        lm += "\nArg1: " + gen("arg1", regex=r"-?[0-9]+(\.[0-9]+)?") + "\n"
-        lm += "Arg2: " + gen("arg2", regex=r"-?[0-9]+(\.[0-9]+)?") + "\n"
-
-        # Execute
-        result = tools[lm["tool"]](lm["arg1"], lm["arg2"])
-        lm += f"Result: {result}\n\n"
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = validated_tool_agent(lm, "What is (10 + 5) * 3?")
-```
-
-## Multi-Step Workflows
-
-### Chain of Thought
-
-```python
-@guidance
-def chain_of_thought(lm, question):
-    """Multi-step reasoning with CoT."""
-    lm += f"Question: {question}\n\n"
-
-    # Generate reasoning steps
-    lm += "Let me think step by step:\n\n"
-    for i in range(4):
-        lm += f"Step {i+1}: " + gen(f"step_{i+1}", stop="\n", max_tokens=100) + "\n"
-
-    # Final answer
-    lm += "\nTherefore, the answer is: " + gen("answer", stop="\n", max_tokens=50)
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = chain_of_thought(lm, "If a train travels 60 mph for 2.5 hours, how far does it go?")
-
-print(lm["answer"])
-```
-
-### Self-Consistency
-
-```python
-@guidance
-def self_consistency(lm, question, num_samples=3):
-    """Generate multiple reasoning paths and aggregate."""
-    lm += f"Question: {question}\n\n"
-
-    answers = []
-    for i in range(num_samples):
-        lm += f"=== Attempt {i+1} ===\n"
-        lm += "Reasoning: " + gen(f"reasoning_{i}", stop="\n", max_tokens=100) + "\n"
-        lm += "Answer: " + gen(f"answer_{i}", stop="\n", max_tokens=50) + "\n\n"
-        answers.append(lm[f"answer_{i}"])
-
-    # Aggregate (simple majority vote)
-    from collections import Counter
-    most_common = Counter(answers).most_common(1)[0][0]
-
-    lm += f"Final Answer (by majority): {most_common}\n"
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = self_consistency(lm, "What is 15% of 200?")
-```
-
-### Planning and Execution
-
-```python
-@guidance
-def plan_and_execute(lm, goal):
-    """Plan tasks then execute them."""
-    lm += f"Goal: {goal}\n\n"
-
-    # Planning phase
-    lm += "Plan:\n"
-    num_steps = 4
-    for i in range(num_steps):
-        lm += f"{i+1}. " + gen(f"plan_step_{i}", stop="\n", max_tokens=100) + "\n"
-
-    # Execution phase
-    lm += "\nExecution:\n\n"
-    for i in range(num_steps):
-        lm += f"Step {i+1}: {lm[f'plan_step_{i}']}\n"
-        lm += "Status: " + select(["completed", "in-progress", "blocked"], name=f"status_{i}") + "\n"
-        lm += "Result: " + gen(f"result_{i}", stop="\n", max_tokens=150) + "\n\n"
-
-    # Summary
-    lm += "Summary: " + gen("summary", max_tokens=200)
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = plan_and_execute(lm, "Build a REST API for a blog platform")
-```
-
-## Code Generation
-
-### Python Function
-
-```python
-@guidance
-def generate_python_function(lm, description):
-    """Generate Python function from description."""
-    lm += f"Description: {description}\n\n"
-
-    # Function signature
-    lm += "def " + gen("func_name", regex=r"[a-z_][a-z0-9_]*") + "("
-    lm += gen("params", regex=r"[a-z_][a-z0-9_]*(, [a-z_][a-z0-9_]*)*") + "):\n"
-
-    # Docstring
-    lm += '    """' + gen("docstring", stop='"""', max_tokens=100) + '"""\n'
-
-    # Function body
-    lm += "    " + gen("body", stop="\n", max_tokens=200) + "\n"
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_python_function(lm, "Check if a number is prime")
-
-print(lm)
-```
-
-### SQL Query
-
-```python
-@guidance
-def generate_sql(lm, description):
-    """Generate SQL query from description."""
-    lm += f"Description: {description}\n\n"
-    lm += "SQL Query:\n"
-
-    # SELECT clause
-    lm += "SELECT " + gen("select_clause", stop=" FROM", max_tokens=100)
-
-    # FROM clause
-    lm += " FROM " + gen("from_clause", stop=" WHERE", max_tokens=50)
-
-    # WHERE clause (optional)
-    lm += " WHERE " + gen("where_clause", stop=";", max_tokens=100) + ";"
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_sql(lm, "Get all users who signed up in the last 30 days")
-```
-
-### API Endpoint
-
-```python
-@guidance
-def generate_api_endpoint(lm, description):
-    """Generate REST API endpoint."""
-    lm += f"Description: {description}\n\n"
-
-    # HTTP method
-    lm += "Method: " + select(["GET", "POST", "PUT", "DELETE"], name="method") + "\n"
-
-    # Path
-    lm += "Path: /" + gen("path", regex=r"[a-z0-9/-]+", stop="\n") + "\n"
-
-    # Request body (if POST/PUT)
-    if lm["method"] in ["POST", "PUT"]:
-        lm += "\nRequest Body:\n"
-        lm += "{\n"
-        lm += '  "field1": ' + gen("field1", regex=r'"[a-z_]+"') + ",\n"
-        lm += '  "field2": ' + gen("field2", regex=r'"[a-z_]+"') + "\n"
-        lm += "}\n"
-
-    # Response
-    lm += "\nResponse (200 OK):\n"
-    lm += "{\n"
-    lm += '  "status": "success",\n'
-    lm += '  "data": ' + gen("response_data", max_tokens=100) + "\n"
-    lm += "}\n"
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_api_endpoint(lm, "Create a new blog post")
-```
-
-## Production Tips
-
-### Error Handling
-
-```python
-@guidance
-def safe_extraction(lm, text):
-    """Extract with fallback handling."""
-    try:
-        lm += f"Text: {text}\n"
-        lm += "Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n", max_tokens=30)
-        return lm
-    except Exception as e:
-        # Fallback to less strict extraction
-        lm += f"Text: {text}\n"
-        lm += "Name: " + gen("name", stop="\n", max_tokens=30)
-        return lm
-```
-
-### Caching
-
-```python
-from functools import lru_cache
-
-@lru_cache(maxsize=100)
-def cached_generation(text):
-    """Cache LLM generations."""
-    lm = models.Anthropic("claude-sonnet-4-5-20250929")
-    lm += f"Analyze: {text}\n"
-    lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
-    return lm["sentiment"]
-
-# First call: hits LLM
-result1 = cached_generation("This is great!")
-
-# Second call: returns cached result
-result2 = cached_generation("This is great!")  # Instant!
-```
-
-### Monitoring
-
-```python
-import time
-
-@guidance
-def monitored_generation(lm, text):
-    """Track generation metrics."""
-    start_time = time.time()
-
-    lm += f"Text: {text}\n"
-    lm += "Analysis: " + gen("analysis", max_tokens=100)
-
-    elapsed = time.time() - start_time
-
-    # Log metrics
-    print(f"Generation time: {elapsed:.2f}s")
-    print(f"Output length: {len(lm['analysis'])} chars")
-
-    return lm
-```
-
-### Batch Processing
-
-```python
-def batch_process(texts, batch_size=10):
-    """Process texts in batches."""
-    lm = models.Anthropic("claude-sonnet-4-5-20250929")
-    results = []
-
-    for i in range(0, len(texts), batch_size):
-        batch = texts[i:i+batch_size]
-
-        for text in batch:
-            lm += f"Text: {text}\n"
-            lm += "Sentiment: " + select(
-                ["positive", "negative", "neutral"],
-                name=f"sentiment_{i}"
-            ) + "\n\n"
-
-        results.extend([lm[f"sentiment_{i}"] for i in range(len(batch))])
-
-    return results
-```
-
-## Resources
-
- **Guidance Notebooks**: https://github.com/guidance-ai/guidance/tree/main/notebooks
- **Guidance Docs**: https://guidance.readthedocs.io
- **Community Examples**: https://github.com/guidance-ai/guidance/discussions
--- a/skills/mlops/llava/SKILL.md
+++ b/skills/mlops/llava/SKILL.md
@@ -1,307 +0,0 @@
---
-name: llava
-description: Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [transformers, torch, pillow]
-metadata:
-  hermes:
-    tags: [LLaVA, Vision-Language, Multimodal, Visual Question Answering, Image Chat, CLIP, Vicuna, Conversational AI, Instruction Tuning, VQA]
-
---
-
-# LLaVA - Large Language and Vision Assistant
-
-Open-source vision-language model for conversational image understanding.
-
-## When to use LLaVA
-
-**Use when:**
- Building vision-language chatbots
- Visual question answering (VQA)
- Image description and captioning
- Multi-turn image conversations
- Visual instruction following
- Document understanding with images
-
-**Metrics**:
- **23,000+ GitHub stars**
- GPT-4V level capabilities (targeted)
- Apache 2.0 License
- Multiple model sizes (7B-34B params)
-
-**Use alternatives instead**:
- **GPT-4V**: Highest quality, API-based
- **CLIP**: Simple zero-shot classification
- **BLIP-2**: Better for captioning only
- **Flamingo**: Research, not open-source
-
-## Quick start
-
-### Installation
-
-```bash
-# Clone repository
-git clone https://github.com/haotian-liu/LLaVA
-cd LLaVA
-
-# Install
-pip install -e .
-```
-
-### Basic usage
-
-```python
-from llava.model.builder import load_pretrained_model
-from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
-from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
-from llava.conversation import conv_templates
-from PIL import Image
-import torch
-
-# Load model
-model_path = "liuhaotian/llava-v1.5-7b"
-tokenizer, model, image_processor, context_len = load_pretrained_model(
-    model_path=model_path,
-    model_base=None,
-    model_name=get_model_name_from_path(model_path)
-)
-
-# Load image
-image = Image.open("image.jpg")
-image_tensor = process_images([image], image_processor, model.config)
-image_tensor = image_tensor.to(model.device, dtype=torch.float16)
-
-# Create conversation
-conv = conv_templates["llava_v1"].copy()
-conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
-conv.append_message(conv.roles[1], None)
-prompt = conv.get_prompt()
-
-# Generate response
-input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)
-
-with torch.inference_mode():
-    output_ids = model.generate(
-        input_ids,
-        images=image_tensor,
-        do_sample=True,
-        temperature=0.2,
-        max_new_tokens=512
-    )
-
-response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
-print(response)
-```
-
-## Available models
-
-| Model | Parameters | VRAM | Quality |
-|-------|------------|------|---------|
-| LLaVA-v1.5-7B | 7B | ~14 GB | Good |
-| LLaVA-v1.5-13B | 13B | ~28 GB | Better |
-| LLaVA-v1.6-34B | 34B | ~70 GB | Best |
-
-```python
-# Load different models
-model_7b = "liuhaotian/llava-v1.5-7b"
-model_13b = "liuhaotian/llava-v1.5-13b"
-model_34b = "liuhaotian/llava-v1.6-34b"
-
-# 4-bit quantization for lower VRAM
-load_4bit = True  # Reduces VRAM by ~4×
-```
-
-## CLI usage
-
-```bash
-# Single image query
-python -m llava.serve.cli \
-    --model-path liuhaotian/llava-v1.5-7b \
-    --image-file image.jpg \
-    --query "What is in this image?"
-
-# Multi-turn conversation
-python -m llava.serve.cli \
-    --model-path liuhaotian/llava-v1.5-7b \
-    --image-file image.jpg
-# Then type questions interactively
-```
-
-## Web UI (Gradio)
-
-```bash
-# Launch Gradio interface
-python -m llava.serve.gradio_web_server \
-    --model-path liuhaotian/llava-v1.5-7b \
-    --load-4bit  # Optional: reduce VRAM
-
-# Access at http://localhost:7860
-```
-
-## Multi-turn conversations
-
-```python
-# Initialize conversation
-conv = conv_templates["llava_v1"].copy()
-
-# Turn 1
-conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
-conv.append_message(conv.roles[1], None)
-response1 = generate(conv, model, image)  # "A dog playing in a park"
-
-# Turn 2
-conv.messages[-1][1] = response1  # Add previous response
-conv.append_message(conv.roles[0], "What breed is the dog?")
-conv.append_message(conv.roles[1], None)
-response2 = generate(conv, model, image)  # "Golden Retriever"
-
-# Turn 3
-conv.messages[-1][1] = response2
-conv.append_message(conv.roles[0], "What time of day is it?")
-conv.append_message(conv.roles[1], None)
-response3 = generate(conv, model, image)
-```
-
-## Common tasks
-
-### Image captioning
-
-```python
-question = "Describe this image in detail."
-response = ask(model, image, question)
-```
-
-### Visual question answering
-
-```python
-question = "How many people are in the image?"
-response = ask(model, image, question)
-```
-
-### Object detection (textual)
-
-```python
-question = "List all the objects you can see in this image."
-response = ask(model, image, question)
-```
-
-### Scene understanding
-
-```python
-question = "What is happening in this scene?"
-response = ask(model, image, question)
-```
-
-### Document understanding
-
-```python
-question = "What is the main topic of this document?"
-response = ask(model, document_image, question)
-```
-
-## Training custom model
-
-```bash
-# Stage 1: Feature alignment (558K image-caption pairs)
-bash scripts/v1_5/pretrain.sh
-
-# Stage 2: Visual instruction tuning (150K instruction data)
-bash scripts/v1_5/finetune.sh
-```
-
-## Quantization (reduce VRAM)
-
-```python
-# 4-bit quantization
-tokenizer, model, image_processor, context_len = load_pretrained_model(
-    model_path="liuhaotian/llava-v1.5-13b",
-    model_base=None,
-    model_name=get_model_name_from_path("liuhaotian/llava-v1.5-13b"),
-    load_4bit=True  # Reduces VRAM ~4×
-)
-
-# 8-bit quantization
-load_8bit=True  # Reduces VRAM ~2×
-```
-
-## Best practices
-
-1. **Start with 7B model** - Good quality, manageable VRAM
-2. **Use 4-bit quantization** - Reduces VRAM significantly
-3. **GPU required** - CPU inference extremely slow
-4. **Clear prompts** - Specific questions get better answers
-5. **Multi-turn conversations** - Maintain conversation context
-6. **Temperature 0.2-0.7** - Balance creativity/consistency
-7. **max_new_tokens 512-1024** - For detailed responses
-8. **Batch processing** - Process multiple images sequentially
-
-## Performance
-
-| Model | VRAM (FP16) | VRAM (4-bit) | Speed (tokens/s) |
-|-------|-------------|--------------|------------------|
-| 7B | ~14 GB | ~4 GB | ~20 |
-| 13B | ~28 GB | ~8 GB | ~12 |
-| 34B | ~70 GB | ~18 GB | ~5 |
-
-*On A100 GPU*
-
-## Benchmarks
-
-LLaVA achieves competitive scores on:
- **VQAv2**: 78.5%
- **GQA**: 62.0%
- **MM-Vet**: 35.4%
- **MMBench**: 64.3%
-
-## Limitations
-
-1. **Hallucinations** - May describe things not in image
-2. **Spatial reasoning** - Struggles with precise locations
-3. **Small text** - Difficulty reading fine print
-4. **Object counting** - Imprecise for many objects
-5. **VRAM requirements** - Need powerful GPU
-6. **Inference speed** - Slower than CLIP
-
-## Integration with frameworks
-
-### LangChain
-
-```python
-from langchain.llms.base import LLM
-
-class LLaVALLM(LLM):
-    def _call(self, prompt, stop=None):
-        # Custom LLaVA inference
-        return response
-
-llm = LLaVALLM()
-```
-
-### Gradio App
-
-```python
-import gradio as gr
-
-def chat(image, text, history):
-    response = ask_llava(model, image, text)
-    return response
-
-demo = gr.ChatInterface(
-    chat,
-    additional_inputs=[gr.Image(type="pil")],
-    title="LLaVA Chat"
-)
-demo.launch()
-```
-
-## Resources
-
- **GitHub**: https://github.com/haotian-liu/LLaVA ⭐ 23,000+
- **Paper**: https://arxiv.org/abs/2304.08485
- **Demo**: https://llava.hliu.cc
- **Models**: https://huggingface.co/liuhaotian
- **License**: Apache 2.0
-
-
--- a/skills/mlops/llava/references/training.md
+++ b/skills/mlops/llava/references/training.md
@@ -1,197 +0,0 @@
-# LLaVA Training Guide
-
-Guide to training and fine-tuning LLaVA models.
-
-## Training stages
-
-### Stage 1: Feature alignment (Pretraining)
-
-**Purpose**: Align vision encoder with language model
-
-**Data**: 558K image-caption pairs (CC3M subset)
-
-```bash
-# Download pretrained projector or train from scratch
-bash scripts/v1_5/pretrain.sh
-```
-
-**Configuration:**
- Base model: Vicuna-7B or LLaMA-2-7B
- Vision encoder: CLIP ViT-L/14
- Training time: ~20 hours on 8× A100
-
-### Stage 2: Visual instruction tuning
-
-**Purpose**: Teach model to follow visual instructions
-
-**Data**: 150K GPT-generated multimodal instruction data
-
-```bash
-# Fine-tune with instruction data
-bash scripts/v1_5/finetune.sh
-```
-
-**Configuration:**
- Epochs: 1
- Batch size: 128 (across 8 GPUs)
- Learning rate: 2e-5
- Training time: ~24 hours on 8× A100
-
-## Data format
-
-### Instruction data format
-
-```json
-[
-    {
-        "id": "001",
-        "image": "path/to/image.jpg",
-        "conversations": [
-            {
-                "from": "human",
-                "value": "<image>\nWhat is in this image?"
-            },
-            {
-                "from": "gpt",
-                "value": "The image shows a dog playing in a park."
-            },
-            {
-                "from": "human",
-                "value": "What breed is the dog?"
-            },
-            {
-                "from": "gpt",
-                "value": "It appears to be a Golden Retriever."
-            }
-        ]
-    }
-]
-```
-
-## Fine-tuning on custom data
-
-### Prepare your data
-
-```python
-import json
-
-# Create instruction data
-data = []
-for image_path, qa_pairs in your_dataset:
-    conversations = []
-    for q, a in qa_pairs:
-        conversations.append({"from": "human", "value": f"<image>\n{q}"})
-        conversations.append({"from": "gpt", "value": a})
-
-    data.append({
-        "id": str(len(data)),
-        "image": image_path,
-        "conversations": conversations
-    })
-
-# Save
-with open("custom_data.json", "w") as f:
-    json.dump(data, f, indent=2)
-```
-
-### Fine-tune script
-
-```bash
-#!/bin/bash
-
-# Set paths
-DATA_PATH="custom_data.json"
-IMAGE_FOLDER="path/to/images"
-MODEL_PATH="liuhaotian/llava-v1.5-7b"
-OUTPUT_DIR="./checkpoints/llava-custom"
-
-# Fine-tune
-deepspeed llava/train/train_mem.py \
-    --deepspeed ./scripts/zero2.json \
-    --model_name_or_path $MODEL_PATH \
-    --version v1 \
-    --data_path $DATA_PATH \
-    --image_folder $IMAGE_FOLDER \
-    --vision_tower openai/clip-vit-large-patch14-336 \
-    --mm_projector_type mlp2x_gelu \
-    --mm_vision_select_layer -2 \
-    --mm_use_im_start_end False \
-    --mm_use_im_patch_token False \
-    --image_aspect_ratio pad \
-    --group_by_modality_length True \
-    --bf16 True \
-    --output_dir $OUTPUT_DIR \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size 16 \
-    --per_device_eval_batch_size 4 \
-    --gradient_accumulation_steps 1 \
-    --evaluation_strategy "no" \
-    --save_strategy "steps" \
-    --save_steps 50000 \
-    --save_total_limit 1 \
-    --learning_rate 2e-5 \
-    --weight_decay 0. \
-    --warmup_ratio 0.03 \
-    --lr_scheduler_type "cosine" \
-    --logging_steps 1 \
-    --tf32 True \
-    --model_max_length 2048 \
-    --gradient_checkpointing True \
-    --dataloader_num_workers 4 \
-    --lazy_preprocess True \
-    --report_to wandb
-```
-
-## LoRA fine-tuning (memory efficient)
-
-```python
-from peft import LoraConfig, get_peft_model
-
-# LoRA config
-lora_config = LoraConfig(
-    r=8,  # LoRA rank
-    lora_alpha=16,
-    target_modules=["q_proj", "v_proj"],
-    lora_dropout=0.05,
-    bias="none",
-    task_type="CAUSAL_LM"
-)
-
-# Apply LoRA
-model = get_peft_model(base_model, lora_config)
-
-# Train with much lower memory
-```
-
-## Hardware requirements
-
-### Full fine-tuning
-
- **7B model**: 8× A100 (40GB)
- **13B model**: 8× A100 (80GB)
- **Training time**: 20-48 hours
-
-### LoRA fine-tuning
-
- **7B model**: 1× A100 (40GB)
- **13B model**: 2× A100 (40GB)
- **Training time**: 10-24 hours
-
-## Best practices
-
-1. **Start with pretrained** - Don't train from scratch
-2. **Use LoRA for efficiency** - 10× less memory
-3. **Quality over quantity** - 1K high-quality > 10K low-quality
-4. **Multi-turn conversations** - More engaging than single Q&A
-5. **Diverse images** - Cover different scenarios
-6. **Clear instructions** - Specific questions get better answers
-7. **Monitor loss** - Should decrease smoothly
-8. **Save checkpoints** - Training can fail
-9. **Test regularly** - Validate on held-out set
-10. **Use DeepSpeed** - For multi-GPU training
-
-## Resources
-
- **Training script**: https://github.com/haotian-liu/LLaVA/tree/main/scripts
- **Data format**: https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md
- **Paper**: https://arxiv.org/abs/2304.08485
--- a/skills/mlops/nemo-curator/SKILL.md
+++ b/skills/mlops/nemo-curator/SKILL.md
@@ -1,386 +0,0 @@
---
-name: nemo-curator
-description: GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [nemo-curator, cudf, dask, rapids]
-metadata:
-  hermes:
-    tags: [Data Processing, NeMo Curator, Data Curation, GPU Acceleration, Deduplication, Quality Filtering, NVIDIA, RAPIDS, PII Redaction, Multimodal, LLM Training Data]
-
---
-
-# NeMo Curator - GPU-Accelerated Data Curation
-
-NVIDIA's toolkit for preparing high-quality training data for LLMs.
-
-## When to use NeMo Curator
-
-**Use NeMo Curator when:**
- Preparing LLM training data from web scrapes (Common Crawl)
- Need fast deduplication (16× faster than CPU)
- Curating multi-modal datasets (text, images, video, audio)
- Filtering low-quality or toxic content
- Scaling data processing across GPU cluster
-
-**Performance**:
- **16× faster** fuzzy deduplication (8TB RedPajama v2)
- **40% lower TCO** vs CPU alternatives
- **Near-linear scaling** across GPU nodes
-
-**Use alternatives instead**:
- **datatrove**: CPU-based, open-source data processing
- **dolma**: Allen AI's data toolkit
- **Ray Data**: General ML data processing (no curation focus)
-
-## Quick start
-
-### Installation
-
-```bash
-# Text curation (CUDA 12)
-uv pip install "nemo-curator[text_cuda12]"
-
-# All modalities
-uv pip install "nemo-curator[all_cuda12]"
-
-# CPU-only (slower)
-uv pip install "nemo-curator[cpu]"
-```
-
-### Basic text curation pipeline
-
-```python
-from nemo_curator import ScoreFilter, Modify
-from nemo_curator.datasets import DocumentDataset
-import pandas as pd
-
-# Load data
-df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]})
-dataset = DocumentDataset(df)
-
-# Quality filtering
-def quality_score(doc):
-    return len(doc["text"].split()) > 5  # Filter short docs
-
-filtered = ScoreFilter(quality_score)(dataset)
-
-# Deduplication
-from nemo_curator.modules import ExactDuplicates
-deduped = ExactDuplicates()(filtered)
-
-# Save
-deduped.to_parquet("curated_data/")
-```
-
-## Data curation pipeline
-
-### Stage 1: Quality filtering
-
-```python
-from nemo_curator.filters import (
-    WordCountFilter,
-    RepeatedLinesFilter,
-    UrlRatioFilter,
-    NonAlphaNumericFilter
-)
-
-# Apply 30+ heuristic filters
-from nemo_curator import ScoreFilter
-
-# Word count filter
-dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))
-
-# Remove repetitive content
-dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))
-
-# URL ratio filter
-dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
-```
-
-### Stage 2: Deduplication
-
-**Exact deduplication**:
-```python
-from nemo_curator.modules import ExactDuplicates
-
-# Remove exact duplicates
-deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)
-```
-
-**Fuzzy deduplication** (16× faster on GPU):
-```python
-from nemo_curator.modules import FuzzyDuplicates
-
-# MinHash + LSH deduplication
-fuzzy_dedup = FuzzyDuplicates(
-    id_field="id",
-    text_field="text",
-    num_hashes=260,      # MinHash parameters
-    num_buckets=20,
-    hash_method="md5"
-)
-
-deduped = fuzzy_dedup(dataset)
-```
-
-**Semantic deduplication**:
-```python
-from nemo_curator.modules import SemanticDuplicates
-
-# Embedding-based deduplication
-semantic_dedup = SemanticDuplicates(
-    id_field="id",
-    text_field="text",
-    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
-    threshold=0.8  # Cosine similarity threshold
-)
-
-deduped = semantic_dedup(dataset)
-```
-
-### Stage 3: PII redaction
-
-```python
-from nemo_curator.modules import Modify
-from nemo_curator.modifiers import PIIRedactor
-
-# Redact personally identifiable information
-pii_redactor = PIIRedactor(
-    supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"],
-    anonymize_action="replace"  # or "redact"
-)
-
-redacted = Modify(pii_redactor)(dataset)
-```
-
-### Stage 4: Classifier filtering
-
-```python
-from nemo_curator.classifiers import QualityClassifier
-
-# Quality classification
-quality_clf = QualityClassifier(
-    model_path="nvidia/quality-classifier-deberta",
-    batch_size=256,
-    device="cuda"
-)
-
-# Filter low-quality documents
-high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
-```
-
-## GPU acceleration
-
-### GPU vs CPU performance
-
-| Operation | CPU (16 cores) | GPU (A100) | Speedup |
-|-----------|----------------|------------|---------|
-| Fuzzy dedup (8TB) | 120 hours | 7.5 hours | 16× |
-| Exact dedup (1TB) | 8 hours | 0.5 hours | 16× |
-| Quality filtering | 2 hours | 0.2 hours | 10× |
-
-### Multi-GPU scaling
-
-```python
-from nemo_curator import get_client
-import dask_cuda
-
-# Initialize GPU cluster
-client = get_client(cluster_type="gpu", n_workers=8)
-
-# Process with 8 GPUs
-deduped = FuzzyDuplicates(...)(dataset)
-```
-
-## Multi-modal curation
-
-### Image curation
-
-```python
-from nemo_curator.image import (
-    AestheticFilter,
-    NSFWFilter,
-    CLIPEmbedder
-)
-
-# Aesthetic scoring
-aesthetic_filter = AestheticFilter(threshold=5.0)
-filtered_images = aesthetic_filter(image_dataset)
-
-# NSFW detection
-nsfw_filter = NSFWFilter(threshold=0.9)
-safe_images = nsfw_filter(filtered_images)
-
-# Generate CLIP embeddings
-clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32")
-image_embeddings = clip_embedder(safe_images)
-```
-
-### Video curation
-
-```python
-from nemo_curator.video import (
-    SceneDetector,
-    ClipExtractor,
-    InternVideo2Embedder
-)
-
-# Detect scenes
-scene_detector = SceneDetector(threshold=27.0)
-scenes = scene_detector(video_dataset)
-
-# Extract clips
-clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0)
-clips = clip_extractor(scenes)
-
-# Generate embeddings
-video_embedder = InternVideo2Embedder()
-video_embeddings = video_embedder(clips)
-```
-
-### Audio curation
-
-```python
-from nemo_curator.audio import (
-    ASRInference,
-    WERFilter,
-    DurationFilter
-)
-
-# ASR transcription
-asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")
-transcribed = asr(audio_dataset)
-
-# Filter by WER (word error rate)
-wer_filter = WERFilter(max_wer=0.3)
-high_quality_audio = wer_filter(transcribed)
-
-# Duration filtering
-duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0)
-filtered_audio = duration_filter(high_quality_audio)
-```
-
-## Common patterns
-
-### Web scrape curation (Common Crawl)
-
-```python
-from nemo_curator import ScoreFilter, Modify
-from nemo_curator.filters import *
-from nemo_curator.modules import *
-from nemo_curator.datasets import DocumentDataset
-
-# Load Common Crawl data
-dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")
-
-# Pipeline
-pipeline = [
-    # 1. Quality filtering
-    WordCountFilter(min_words=100, max_words=50000),
-    RepeatedLinesFilter(max_repeated_line_fraction=0.2),
-    SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3),
-    UrlRatioFilter(max_url_ratio=0.3),
-
-    # 2. Language filtering
-    LanguageIdentificationFilter(target_languages=["en"]),
-
-    # 3. Deduplication
-    ExactDuplicates(id_field="id", text_field="text"),
-    FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),
-
-    # 4. PII redaction
-    PIIRedactor(),
-
-    # 5. NSFW filtering
-    NSFWClassifier(threshold=0.8)
-]
-
-# Execute
-for stage in pipeline:
-    dataset = stage(dataset)
-
-# Save
-dataset.to_parquet("curated_common_crawl/")
-```
-
-### Distributed processing
-
-```python
-from nemo_curator import get_client
-from dask_cuda import LocalCUDACluster
-
-# Multi-GPU cluster
-cluster = LocalCUDACluster(n_workers=8)
-client = get_client(cluster=cluster)
-
-# Process large dataset
-dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet")
-deduped = FuzzyDuplicates(...)(dataset)
-
-# Cleanup
-client.close()
-cluster.close()
-```
-
-## Performance benchmarks
-
-### Fuzzy deduplication (8TB RedPajama v2)
-
- **CPU (256 cores)**: 120 hours
- **GPU (8× A100)**: 7.5 hours
- **Speedup**: 16×
-
-### Exact deduplication (1TB)
-
- **CPU (64 cores)**: 8 hours
- **GPU (4× A100)**: 0.5 hours
- **Speedup**: 16×
-
-### Quality filtering (100GB)
-
- **CPU (32 cores)**: 2 hours
- **GPU (2× A100)**: 0.2 hours
- **Speedup**: 10×
-
-## Cost comparison
-
-**CPU-based curation** (AWS c5.18xlarge × 10):
- Cost: $3.60/hour × 10 = $36/hour
- Time for 8TB: 120 hours
- **Total**: $4,320
-
-**GPU-based curation** (AWS p4d.24xlarge × 2):
- Cost: $32.77/hour × 2 = $65.54/hour
- Time for 8TB: 7.5 hours
- **Total**: $491.55
-
-**Savings**: 89% reduction ($3,828 saved)
-
-## Supported data formats
-
- **Input**: Parquet, JSONL, CSV
- **Output**: Parquet (recommended), JSONL
- **WebDataset**: TAR archives for multi-modal
-
-## Use cases
-
-**Production deployments**:
- NVIDIA used NeMo Curator to prepare Nemotron-4 training data
- Open-source datasets curated: RedPajama v2, The Pile
-
-## References
-
- **[Filtering Guide](references/filtering.md)** - 30+ quality filters, heuristics
- **[Deduplication Guide](references/deduplication.md)** - Exact, fuzzy, semantic methods
-
-## Resources
-
- **GitHub**: https://github.com/NVIDIA/NeMo-Curator ⭐ 500+
- **Docs**: https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/
- **Version**: 0.4.0+
- **License**: Apache 2.0
-
-
-
--- a/skills/mlops/nemo-curator/references/deduplication.md
+++ b/skills/mlops/nemo-curator/references/deduplication.md
@@ -1,87 +0,0 @@
-# Deduplication Guide
-
-Complete guide to exact, fuzzy, and semantic deduplication.
-
-## Exact deduplication
-
-Remove documents with identical content.
-
-```python
-from nemo_curator.modules import ExactDuplicates
-
-# Exact deduplication
-exact_dedup = ExactDuplicates(
-    id_field="id",
-    text_field="text",
-    hash_method="md5"  # or "sha256"
-)
-
-deduped = exact_dedup(dataset)
-```
-
-**Performance**: ~16× faster on GPU vs CPU
-
-## Fuzzy deduplication
-
-Remove near-duplicate documents using MinHash + LSH.
-
-```python
-from nemo_curator.modules import FuzzyDuplicates
-
-fuzzy_dedup = FuzzyDuplicates(
-    id_field="id",
-    text_field="text",
-    num_hashes=260,        # MinHash permutations (more = accurate)
-    num_buckets=20,        # LSH buckets (more = faster, less recall)
-    hash_method="md5",
-    jaccard_threshold=0.8  # Similarity threshold
-)
-
-deduped = fuzzy_dedup(dataset)
-```
-
-**Parameters**:
- `num_hashes`: 128-512 (default 260)
- `num_buckets`: 10-50 (default 20)
- `jaccard_threshold`: 0.7-0.9 (default 0.8)
-
-**Performance**: 16× faster on 8TB dataset (120h → 7.5h)
-
-## Semantic deduplication
-
-Remove semantically similar documents using embeddings.
-
-```python
-from nemo_curator.modules import SemanticDuplicates
-
-semantic_dedup = SemanticDuplicates(
-    id_field="id",
-    text_field="text",
-    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
-    embedding_batch_size=256,
-    threshold=0.85,  # Cosine similarity threshold
-    device="cuda"
-)
-
-deduped = semantic_dedup(dataset)
-```
-
-**Models**:
- `all-MiniLM-L6-v2`: Fast, 384 dims
- `all-mpnet-base-v2`: Better quality, 768 dims
- Custom models supported
-
-## Comparison
-
-| Method | Speed | Recall | Use Case |
-|--------|-------|--------|----------|
-| Exact | Fastest | 100% | Exact matches only |
-| Fuzzy | Fast | ~95% | Near-duplicates (recommended) |
-| Semantic | Slow | ~90% | Paraphrases, rewrites |
-
-## Best practices
-
-1. **Start with exact dedup** - Remove obvious duplicates
-2. **Use fuzzy for large datasets** - Best speed/quality trade-off
-3. **Semantic for high-value data** - Expensive but thorough
-4. **GPU acceleration required** - 10-16× speedup
--- a/skills/mlops/nemo-curator/references/filtering.md
+++ b/skills/mlops/nemo-curator/references/filtering.md
@@ -1,102 +0,0 @@
-# Quality Filtering Guide
-
-Complete guide to NeMo Curator's 30+ quality filters.
-
-## Text-based filters
-
-### Word count
-
-```python
-from nemo_curator.filters import WordCountFilter
-
-# Filter by word count
-dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))
-```
-
-### Repeated content
-
-```python
-from nemo_curator.filters import RepeatedLinesFilter
-
-# Remove documents with >30% repeated lines
-dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))
-```
-
-### Symbol ratio
-
-```python
-from nemo_curator.filters import SymbolToWordRatioFilter
-
-# Remove documents with too many symbols
-dataset = dataset.filter(SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3))
-```
-
-### URL ratio
-
-```python
-from nemo_curator.filters import UrlRatioFilter
-
-# Remove documents with many URLs
-dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
-```
-
-## Language filtering
-
-```python
-from nemo_curator.filters import LanguageIdentificationFilter
-
-# Keep only English documents
-dataset = dataset.filter(LanguageIdentificationFilter(target_languages=["en"]))
-
-# Multiple languages
-dataset = dataset.filter(LanguageIdentificationFilter(target_languages=["en", "es", "fr"]))
-```
-
-## Classifier-based filtering
-
-### Quality classifier
-
-```python
-from nemo_curator.classifiers import QualityClassifier
-
-quality_clf = QualityClassifier(
-    model_path="nvidia/quality-classifier-deberta",
-    batch_size=256,
-    device="cuda"
-)
-
-# Filter low-quality (threshold > 0.5 = high quality)
-dataset = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
-```
-
-### NSFW classifier
-
-```python
-from nemo_curator.classifiers import NSFWClassifier
-
-nsfw_clf = NSFWClassifier(threshold=0.9, device="cuda")
-
-# Remove NSFW content
-dataset = dataset.filter(lambda doc: nsfw_clf(doc["text"]) < 0.9)
-```
-
-## Heuristic filters
-
-Full list of 30+ filters:
- WordCountFilter
- RepeatedLinesFilter
- UrlRatioFilter
- SymbolToWordRatioFilter
- NonAlphaNumericFilter
- BulletsFilter
- WhiteSpaceFilter
- ParenthesesFilter
- LongWordFilter
- And 20+ more...
-
-## Best practices
-
-1. **Apply cheap filters first** - Word count before GPU classifiers
-2. **Tune thresholds on sample** - Test on 10k docs before full run
-3. **Use GPU classifiers sparingly** - Expensive but effective
-4. **Chain filters efficiently** - Order by cost (cheap → expensive)
--- a/skills/mlops/obliteratus/SKILL.md
+++ b/skills/mlops/obliteratus/SKILL.md
@@ -1,314 +0,0 @@
---
-name: obliteratus
-description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods (+ 4 Python-API-only), 15 analysis modules, 116 model presets across 5 compute tiers. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
-version: 1.0.0
-author: Hermes Agent
-license: MIT
-dependencies: [obliteratus, torch, transformers, bitsandbytes, accelerate, safetensors]
-metadata:
-  hermes:
-    tags: [Abliteration, Uncensoring, Refusal-Removal, LLM, Weight-Projection, SVD, Mechanistic-Interpretability, HuggingFace, Model-Surgery]
-
---
-
-# OBLITERATUS Skill
-
-Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.
-
-**License warning:** OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (`obliteratus` command) or subprocess. This keeps Hermes Agent's MIT license clean.
-
-## When to Use This Skill
-
-Trigger when the user:
- Wants to "uncensor" or "abliterate" an LLM
- Asks about removing refusal/guardrails from a model
- Wants to create an uncensored version of Llama, Qwen, Mistral, etc.
- Mentions "refusal removal", "abliteration", "weight projection"
- Wants to analyze how a model's refusal mechanism works
- References OBLITERATUS, FailSpy, abliterator, or refusal directions
-
-## Step 1: Installation
-
-Check if already installed:
-```bash
-obliteratus --version 2>/dev/null && echo "INSTALLED" || echo "NOT INSTALLED"
-```
-
-If not installed, clone and install from GitHub:
-```
-Repository: https://github.com/elder-plinius/OBLITERATUS
-Install: pip install -e . (from the cloned directory)
-For Gradio UI: pip install -e ".[spaces]"
-```
-
-**IMPORTANT:** Confirm with user before installing. This pulls in ~5-10GB of dependencies (PyTorch, Transformers, bitsandbytes, etc.).
-
-## Step 2: Check Hardware
-
-Before anything, check what GPU is available:
-```bash
-python3 -c "
-import torch
-if torch.cuda.is_available():
-    gpu = torch.cuda.get_device_name(0)
-    vram = torch.cuda.get_device_properties(0).total_mem / 1024**3
-    print(f'GPU: {gpu}')
-    print(f'VRAM: {vram:.1f} GB')
-    if vram < 4: print('TIER: tiny (models under 1B)')
-    elif vram < 8: print('TIER: small (models 1-4B)')
-    elif vram < 16: print('TIER: medium (models 4-9B with 4bit quant)')
-    elif vram < 32: print('TIER: large (models 8-32B with 4bit quant)')
-    else: print('TIER: frontier (models 32B+)')
-else:
-    print('NO GPU - only tiny models (under 1B) on CPU')
-"
-```
-
-### VRAM Requirements (with 4-bit quantization)
-
-| VRAM     | Max Model Size  | Example Models                              |
-|:---------|:----------------|:--------------------------------------------|
-| CPU only | ~1B params      | GPT-2, TinyLlama, SmolLM                    |
-| 4-8 GB   | ~4B params      | Qwen2.5-1.5B, Phi-3.5 mini, Llama 3.2 3B   |
-| 8-16 GB  | ~9B params      | Llama 3.1 8B, Mistral 7B, Gemma 2 9B       |
-| 24 GB    | ~32B params     | Qwen3-32B, Llama 3.1 70B (tight), Command-R |
-| 48 GB+   | ~72B+ params    | Qwen2.5-72B, DeepSeek-R1                    |
-| Multi-GPU| 200B+ params    | Llama 3.1 405B, DeepSeek-V3 (685B MoE)      |
-
-## Step 3: Browse Available Models
-
-```bash
-# List models for your compute tier
-obliteratus models --tier medium
-
-# Get architecture info for a specific model
-obliteratus info meta-llama/Llama-3.1-8B-Instruct
-```
-
-## Step 4: Choose a Method
-
-### Method Selection Guide
-
-**First time / unsure? Use `informed`.** It auto-configures everything.
-
-| Situation                         | Recommended Method | Why                                      |
-|:----------------------------------|:-------------------|:-----------------------------------------|
-| First attempt, any model          | `informed`         | Auto-detects alignment type, auto-tunes  |
-| Quick test / prototyping          | `basic`            | Fast, simple, good enough to evaluate    |
-| Dense model (Llama, Mistral)      | `advanced`         | Multi-direction, norm-preserving         |
-| MoE model (DeepSeek, Mixtral)     | `nuclear`          | Expert-granular, handles MoE complexity  |
-| Reasoning model (R1 distills)     | `surgical`         | CoT-aware, preserves chain-of-thought    |
-| Stubborn refusals persist         | `aggressive`       | Whitened SVD + head surgery + jailbreak   |
-| Want reversible changes           | Use steering vectors (see Analysis section) |
-| Maximum quality, time no object   | `optimized`        | Bayesian search for best parameters      |
-
-### 9 CLI Methods
-
-These can be passed to `--method` on the command line:
-
- **basic** — Single refusal direction via diff-in-means. Fastest, simplest. (Arditi et al. 2024)
- **advanced** — Multiple SVD directions, norm-preserving projection. Good default.
- **aggressive** — Whitened SVD + jailbreak contrast + attention head surgery
- **spectral_cascade** — DCT frequency-domain decomposition
- **informed** — Runs analysis DURING abliteration to auto-configure. Detects DPO/RLHF/CAI, maps refusal geometry, compensates for self-repair. Best quality.
- **surgical** — SAE features + neuron masking + head surgery + per-expert. Maximum precision.
- **optimized** — Bayesian hyperparameter search (Optuna TPE). Slowest but optimal.
- **inverted** — Flips the refusal direction (model becomes eager to help, not just neutral)
- **nuclear** — Maximum force combo for stubborn MoE models.
-
-### 4 Python-API-Only Methods
-
-These reproduce prior community/academic work but are NOT available via CLI — only via the Python API (`from obliteratus.abliterate import AbliterationPipeline`). **Do not use these in CLI commands.**
-
- **failspy** — FailSpy/abliterator reproduction
- **gabliteration** — Gabliteration reproduction
- **heretic** — Heretic/p-e-w reproduction
- **rdo** — Refusal Direction Optimization (ICML 2025)
-
-## Step 5: Run Abliteration
-
-### Basic Usage
-
-```bash
-# Default (advanced method)
-obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct
-
-# With the informed pipeline (recommended)
-obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method informed
-
-# With 4-bit quantization to save VRAM
-obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct \
-  --method informed \
-  --quantization 4bit \
-  --output-dir ./abliterated-models
-
-# For large models (120B+), use conservative settings
-obliteratus obliterate Qwen/Qwen2.5-72B-Instruct \
-  --method advanced \
-  --quantization 4bit \
-  --large-model \
-  --output-dir ./abliterated-models
-```
-
-### Fine-Tuning Parameters
-
-```bash
-obliteratus obliterate <model> \
-  --method advanced \
-  --n-directions 8 \
-  --regularization 0.1 \
-  --refinement-passes 3 \
-  --dtype bfloat16 \
-  --device auto \
-  --output-dir ./output
-```
-
-Parameter explanations:
- `--n-directions N` — How many refusal directions to remove (default: auto-detected)
- `--regularization 0.0-1.0` — Fraction of original weights to preserve (higher = safer but less complete removal)
- `--refinement-passes N` — Iterative passes to catch self-repair (Ouroboros effect)
- `--dtype` — float16, bfloat16, or float32
- `--quantization` — 4bit or 8bit (saves VRAM, slight quality tradeoff)
- `--large-model` — Conservative defaults for 120B+ models (fewer directions, fewer passes)
-
-### Interactive Mode (Guided)
-
-For users unsure about options:
-```bash
-obliteratus interactive
-```
-
-### Web UI (Gradio)
-
-```bash
-obliteratus ui --port 7860
-```
-
-## Step 6: Verify Results
-
-After abliteration, check the output report for:
-
-| Metric         | Good Value          | Concerning Value        | Meaning                                    |
-|:---------------|:--------------------|:------------------------|:-------------------------------------------|
-| Refusal rate   | Near 0%             | > 10%                   | Refusals still present, try harder method  |
-| Perplexity     | Within 10% of orig  | > 20% increase          | Model coherence damaged, too aggressive    |
-| KL divergence  | < 0.1               | > 0.5                   | Large output distribution shift            |
-| Coherence      | High                | Low                     | Model generating nonsense                  |
-
-### If perplexity spiked (too aggressive):
-1. Increase `--regularization` (e.g., 0.2 or 0.3)
-2. Decrease `--n-directions` (e.g., 4 instead of 8)
-3. Use a less aggressive method (`advanced` instead of `aggressive`)
-
-### If refusal persists (not aggressive enough):
-1. Use `--method aggressive` or `--method nuclear`
-2. Add `--refinement-passes 3` to catch self-repair
-3. Use `--method informed` which auto-compensates
-
-## Step 7: Use the Abliterated Model
-
-The output is a standard HuggingFace model directory. Use it like any other model:
-
-### Quick test
-```bash
-python3 << 'EOF'
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained("./abliterated-models/model-name")
-tokenizer = AutoTokenizer.from_pretrained("./abliterated-models/model-name")
-inputs = tokenizer("Write a story about:", return_tensors="pt").to(model.device)
-outputs = model.generate(**inputs, max_new_tokens=200)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-EOF
-```
-
-### Upload to HuggingFace Hub
-```bash
-huggingface-cli login  # if not already logged in
-huggingface-cli upload your-username/model-name-abliterated ./abliterated-models/model-name
-```
-
-### Serve with vLLM
-```bash
-vllm serve ./abliterated-models/model-name --port 8000
-```
-
-## Analysis Modules (15 Modules, Pre-Abliteration, Optional)
-
-For understanding refusal geometry before committing to abliteration.
-
-### Run a Study
-
-```bash
-obliteratus run study-config.yaml --preset jailbreak
-```
-
-### Study Presets
-
-| Preset       | Purpose                              | Time   |
-|:-------------|:-------------------------------------|:-------|
-| `quick`      | Sanity check, basic metrics          | ~5 min |
-| `jailbreak`  | Refusal circuit localization         | ~20 min|
-| `guardrail`  | Guardrail robustness evaluation      | ~30 min|
-| `attention`  | Attention head contributions         | ~30 min|
-| `knowledge`  | FFN importance mapping               | ~30 min|
-| `full`       | Complete analysis, all strategies    | ~1 hr  |
-
-### Key Analysis Modules
-
- **Alignment Imprint Detection** — Fingerprints DPO vs RLHF vs CAI vs SFT from subspace geometry
- **Concept Cone Geometry** — Is refusal one linear direction or a polyhedral cone (many directions)?
- **Refusal Logit Lens** — Which transformer layer makes the refusal decision?
- **Ouroboros Detection** — Will the model self-repair its refusal after removal?
- **Causal Tracing** — Which attention heads and MLP layers are causally necessary for refusal?
- **Cross-Model Transfer** — Can refusal directions from one model architecture work on another?
- **Residual Stream Decomposition** — Attention vs MLP contribution to refusal behavior
- **SAE-based Analysis** — Sparse Autoencoder feature decomposition of refusal circuits
-
-## Steering Vectors (Reversible Alternative)
-
-For testing refusal removal without permanent weight changes:
-
-Steering vectors apply activation hooks at inference time. Model weights stay unchanged.
-Generated during the PROBE/DISTILL stages and can be saved/applied/removed at will.
-Useful for A/B testing before committing to permanent abliteration.
-
-## YAML Config for Reproducible Studies
-
-For complex or reproducible workflows, use YAML configs. See templates/ for examples:
-```bash
-obliteratus run my_study.yaml
-```
-
-## Telemetry Notice
-
- **CLI usage (local installs)**: Telemetry is OFF by default. Must explicitly opt in via `OBLITERATUS_TELEMETRY=1` env var or `--contribute` flag.
- **HuggingFace Spaces**: Telemetry is ON by default (auto-enabled when `SPACE_ID` env var is detected).
- Collected: model ID, method, benchmark scores, hardware info, timing (anonymous)
- NOT collected: IP addresses, user identity, prompt content
- Force off: `export OBLITERATUS_TELEMETRY=0`
-
-## Common Pitfalls
-
-1. **OOM (Out of Memory)** — Use `--quantization 4bit` and `--large-model` for big models
-2. **Perplexity spike** — Too aggressive. Increase `--regularization` or reduce `--n-directions`
-3. **Refusal persists** — Try `--method aggressive` or `--refinement-passes 3`
-4. **MoE models resist** — Use `--method nuclear` for DeepSeek, Mixtral, DBRX
-5. **Gated models fail** — Run `huggingface-cli login` and accept model terms on HF website first
-6. **Self-repair (Ouroboros)** — Some models reconstruct refusal. Use `--method informed` which auto-compensates
-7. **CoT damage** — Reasoning models lose chain-of-thought. Use `--method surgical` (CoT-aware)
-8. **Disk space** — Output is full model copy. 8B fp16 = ~16GB, 70B fp16 = ~140GB
-9. **Slow on CPU** — CPU-only is viable only for tiny models (<1B). Anything bigger needs GPU.
-
-## Complementary Hermes Skills
-
-After abliteration:
- **axolotl** / **unsloth** — Fine-tune the abliterated model further
- **serving-llms-vllm** — Serve the model as an OpenAI-compatible API
- **sparse-autoencoder-training** — Train SAEs for deeper interpretability work
-
-## Resources
-
- [OBLITERATUS GitHub](https://github.com/elder-plinius/OBLITERATUS) (AGPL-3.0)
- [HuggingFace Spaces Demo](https://huggingface.co/spaces/pliny-the-prompter/obliteratus)
- [Arditi et al. 2024 — Refusal in LMs Is Mediated by a Single Direction](https://arxiv.org/abs/2406.11717)
- [Refusal Direction Optimization — ICML 2025](https://arxiv.org/abs/2411.14793)
--- a/skills/mlops/obliteratus/references/analysis-modules.md
+++ b/skills/mlops/obliteratus/references/analysis-modules.md
@@ -1,170 +0,0 @@
-# OBLITERATUS Analysis Modules — Reference
-
-15 analysis modules for mechanistic interpretability of refusal in LLMs.
-These help you understand HOW a model refuses before you decide to remove it.
-
-> **Note:** The `analysis/` directory contains additional utility files (utils.py,
-> visualization.py, etc.) and helper functions beyond the 15 core analysis modules
-> listed below. The module count matches the README's "15 deep analysis modules."
-
-## Core Analysis (Run These First)
-
-### Alignment Imprint Detection
-**File:** `alignment_imprint.py`
-**Purpose:** Identifies what alignment technique was used to train the model
-**Detects:** DPO, RLHF, CAI (Constitutional AI), SFT (Supervised Fine-Tuning)
-**How:** Analyzes subspace geometry — each alignment method leaves a distinct
-geometric "fingerprint" in the weight space
-**Output:** Detected method + confidence score
-**Why it matters:** Different alignment methods need different abliteration approaches.
-DPO models typically have cleaner single-direction refusal; RLHF is more diffuse.
-
-### Concept Cone Geometry
-**File:** `concept_geometry.py`
-**Purpose:** Maps whether refusal is one direction or a polyhedral cone (many)
-**Output:** Cone angle, dimensionality, per-category breakdown
-**Why it matters:** If refusal is a single direction, `basic` method works. If it's
-a cone (multiple directions for different refusal categories), you need `advanced`
-or `informed` with higher `n_directions`.
-
-### Refusal Logit Lens
-**File:** `logit_lens.py`
-**Purpose:** Identifies the specific layer where the model "decides" to refuse
-**How:** Projects intermediate hidden states to vocabulary space at each layer,
-watches when "I cannot" tokens spike in probability
-**Output:** Layer-by-layer refusal probability plot
-**Why it matters:** Tells you which layers are most important to target
-
-### Ouroboros (Self-Repair) Detection
-**File:** `anti_ouroboros.py`
-**Purpose:** Predicts whether the model will reconstruct its refusal after removal
-**How:** Measures redundancy in refusal representation across layers
-**Output:** Self-repair risk score (0-1)
-**Why it matters:** High self-repair risk means you need multiple refinement passes
-or the `informed` method which auto-compensates
-
-### Causal Tracing
-**File:** `causal_tracing.py`
-**Purpose:** Determines which components are causally necessary for refusal
-**How:** Patches activations between clean and corrupted runs, measures causal effect
-**Output:** Causal importance map across layers, heads, and MLPs
-**Why it matters:** Shows exactly which components to target for surgical removal
-
-## Geometric Analysis
-
-### Cross-Layer Alignment
-**File:** `cross_layer.py`
-**Purpose:** Measures how aligned refusal directions are across layers
-**Output:** Alignment matrix, cluster assignments
-**Why it matters:** If directions are highly aligned across layers, removal is easier.
-If they cluster, you may need layer-group-specific directions.
-
-### Residual Stream Decomposition
-**File:** `residual_stream.py`
-**Purpose:** Breaks down refusal into Attention vs MLP contributions
-**Output:** Per-layer Attention/MLP contribution to refusal direction
-**Why it matters:** Helps decide whether to target attention heads, MLPs, or both
-
-### Riemannian Manifold Geometry
-**File:** `riemannian_manifold.py` (673 lines)
-**Purpose:** Analyzes the weight manifold geometry around refusal directions
-**Output:** Curvature, geodesics, tangent space analysis
-**Why it matters:** Research-grade; helps understand the geometric structure of alignment
-
-### Whitened SVD
-**File:** `whitened_svd.py`
-**Purpose:** Covariance-normalized SVD extraction
-**How:** Whitens the activation covariance before computing refusal directions,
-separating true refusal signal from natural activation variance
-**Output:** Cleaner refusal directions with less noise
-**Why it matters:** Produces more precise directions, especially for noisy activations
-
-## Probing & Classification
-
-### Activation Probing
-**File:** `activation_probing.py`
-**Purpose:** Post-excision probing to verify refusal signal is truly gone
-**Output:** Residual refusal signal strength per layer
-**Why it matters:** Verification that abliteration was complete
-
-### Probing Classifiers
-**File:** `probing_classifiers.py`
-**Purpose:** Trains linear classifiers to detect refusal in hidden states
-**Output:** Classification accuracy per layer (should drop to ~50% after abliteration)
-**Why it matters:** Quantitative measure of refusal removal completeness
-
-### Activation Patching
-**File:** `activation_patching.py`
-**Purpose:** Interchange interventions — swap activations between harmful/harmless runs
-**Output:** Which components are sufficient (not just necessary) for refusal
-**Why it matters:** Complementary to causal tracing; together they give full picture
-
-## Transfer & Robustness
-
-### Cross-Model Transfer
-**File:** `cross_model_transfer.py`
-**Purpose:** Tests if refusal directions from one model work on another
-**Output:** Transfer success rate between model pairs
-**Why it matters:** If directions transfer, you can skip PROBE stage on similar models
-
-### Defense Robustness
-**File:** `defense_robustness.py`
-**Purpose:** Evaluates how robust the model's refusal defenses are
-**Output:** Robustness score, entanglement mapping
-**Why it matters:** Higher robustness = need more aggressive method
-
-### Spectral Certification
-**File:** `spectral_certification.py`
-**Purpose:** Certifies completeness of refusal direction removal
-**Output:** Spectral gap analysis, completeness score
-**Why it matters:** Formal verification that all major refusal components are addressed
-
-## Advanced / Research
-
-### SAE-based Abliteration
-**File:** `sae_abliteration.py` (762 lines)
-**Purpose:** Uses Sparse Autoencoder features to decompose refusal at feature level
-**Output:** Refusal-specific SAE features, targeted removal
-**Why it matters:** Most fine-grained approach; can target individual refusal "concepts"
-
-### Wasserstein Optimal Extraction
-**File:** `wasserstein_optimal.py`
-**Purpose:** Optimal transport-based direction extraction
-**Output:** Wasserstein-optimal refusal directions
-**Why it matters:** Theoretically optimal direction extraction under distributional assumptions
-
-### Bayesian Kernel Projection
-**File:** `bayesian_kernel_projection.py`
-**Purpose:** Bayesian approach to refusal direction projection
-**Output:** Posterior distribution over refusal directions
-**Why it matters:** Quantifies uncertainty in direction estimation
-
-### Conditional Abliteration
-**File:** `conditional_abliteration.py`
-**Purpose:** Domain-specific conditional removal (remove refusal for topic X but keep for Y)
-**Output:** Per-domain refusal directions
-**Why it matters:** Selective uncensoring — remove only specific refusal categories
-
-### Steering Vectors
-**File:** `steering_vectors.py`
-**Purpose:** Generate inference-time steering vectors (reversible alternative)
-**Output:** Steering vector files that can be applied/removed at inference
-**Why it matters:** Non-destructive alternative to permanent weight modification
-
-### Tuned Lens
-**File:** `tuned_lens.py`
-**Purpose:** Trained linear probes per layer (more accurate than raw logit lens)
-**Output:** Layer-by-layer refusal representation with trained projections
-**Why it matters:** More accurate than logit lens, especially for deeper models
-
-### Multi-Token Position Analysis
-**File:** `multi_token_position.py`
-**Purpose:** Analyzes refusal signal at multiple token positions (not just last)
-**Output:** Position-dependent refusal direction maps
-**Why it matters:** Some models encode refusal at the system prompt position, not the query
-
-### Sparse Surgery
-**File:** `sparse_surgery.py`
-**Purpose:** Row-level sparse weight surgery instead of full matrix projection
-**Output:** Targeted weight modifications at the row level
-**Why it matters:** More surgical than full-matrix projection, less collateral damage
--- a/skills/mlops/obliteratus/references/methods-guide.md
+++ b/skills/mlops/obliteratus/references/methods-guide.md
@@ -1,132 +0,0 @@
-# OBLITERATUS Methods — Detailed Guide
-
-> **Important:** The CLI (`obliteratus obliterate --method`) accepts 9 methods:
-> basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized,
-> inverted, nuclear. Four additional methods (failspy, gabliteration, heretic, rdo)
-> are available only via the Python API and will be rejected by argparse if used on CLI.
-
-## How Abliteration Works (Theory)
-
-When a model is trained with RLHF/DPO/CAI, it learns to represent "should I refuse?"
-as a direction in its internal activation space. When processing a "harmful" prompt,
-activations shift in this direction, causing the model to generate refusal text.
-
-Abliteration works by:
-1. Measuring this direction (the difference between harmful and harmless activations)
-2. Removing it from the model's weight matrices via orthogonal projection
-3. The model can no longer "point toward" refusal, so it responds normally
-
-Mathematically: `W_new = W_old - (W_old @ d @ d.T)` where `d` is the refusal direction.
-
-## Method Details
-
-### basic
-**Technique:** Single refusal direction via diff-in-means
-**Based on:** Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction")
-**Speed:** Fast (~5-10 min for 8B)
-**Quality:** Moderate — works for simple refusal patterns
-**Best for:** Quick tests, models with clean single-direction refusal
-**Limitation:** Misses complex multi-direction refusal patterns
-
-### advanced (DEFAULT)
-**Technique:** Multiple SVD directions with norm-preserving projection
-**Speed:** Medium (~10-20 min for 8B)
-**Quality:** Good — handles multi-direction refusal
-**Best for:** Dense models (Llama, Qwen, Mistral) as a reliable default
-**Key improvement:** Norm preservation prevents weight magnitude drift
-
-### informed (RECOMMENDED)
-**Technique:** Analysis-guided auto-configuration
-**Speed:** Slow (~20-40 min for 8B, runs 4 analysis modules first)
-**Quality:** Best — adapts to each model's specific refusal implementation
-**Best for:** Any model when quality matters more than speed
-
-The informed pipeline runs these analysis modules during abliteration:
-1. **AlignmentImprintDetector** — Detects DPO/RLHF/CAI/SFT → sets regularization
-2. **ConceptConeAnalyzer** — Polyhedral vs linear refusal → sets n_directions
-3. **CrossLayerAlignmentAnalyzer** — Cluster-aware → selects target layers
-4. **DefenseRobustnessEvaluator** — Self-repair risk → sets refinement passes
-5. **Ouroboros loop** — Re-probes after excision, re-excises if refusal persists
-
-### aggressive
-**Technique:** Whitened SVD + jailbreak-contrastive activations + attention head surgery
-**Speed:** Slow (~30-60 min for 8B)
-**Quality:** High but higher risk of coherence damage
-**Best for:** Models that resist gentler methods
-**Key feature:** Whitened SVD separates refusal signal from natural activation variance
-
-### surgical
-**Technique:** SAE features + neuron masking + head surgery + per-expert directions
-**Speed:** Very slow (~1-2 hrs for 8B, needs SAE)
-**Quality:** Highest precision
-**Best for:** Reasoning models (R1 distills) where you must preserve CoT
-**Key feature:** CoT-Aware — explicitly protects reasoning-critical directions
-
-### nuclear
-**Technique:** Everything combined — expert transplant + steering + per-expert directions
-**Speed:** Very slow
-**Quality:** Most thorough removal, highest risk of side effects
-**Best for:** Stubborn MoE models (DeepSeek, Mixtral, DBRX) that resist other methods
-**Key feature:** Expert-granular abliteration decomposes signals per MoE expert
-
-### optimized
-**Technique:** Bayesian hyperparameter search via Optuna TPE
-**Speed:** Very slow (runs many trials)
-**Quality:** Finds optimal configuration automatically
-**Best for:** Research, when you want the mathematically best parameters
-**Requires:** optuna package
-
-### spectral_cascade
-**Technique:** DCT frequency-domain decomposition of refusal signal
-**Speed:** Medium-slow
-**Quality:** Novel approach, less battle-tested
-**Best for:** Research, exploring alternative decomposition strategies
-
-### inverted
-**Technique:** Reflects (inverts) the refusal direction instead of removing it
-**Speed:** Fast (same as basic)
-**Quality:** Aggressive — model becomes actively willing, not just neutral
-**Best for:** When you want the model to be maximally helpful
-**Warning:** Can make the model too eager; may reduce safety-adjacent reasoning
-
-### failspy / gabliteration / heretic / rdo (PYTHON API ONLY)
-**Technique:** Faithful reproductions of prior community/academic work
-**Speed:** Varies
-**Quality:** Known baselines
-**Best for:** Reproducing published results, comparing methods
-**⚠️ NOT available via CLI** — these methods are only accessible via the Python API.
-Do not use `--method failspy` etc. in CLI commands; argparse will reject them.
-
-## Method Selection Flowchart
-
-```
-Is this a quick test?
-├─ YES → basic
-└─ NO → Is the model MoE (DeepSeek, Mixtral)?
-         ├─ YES → nuclear
-         └─ NO → Is it a reasoning model (R1 distill)?
-                  ├─ YES → surgical
-                  └─ NO → Do you care about speed?
-                           ├─ YES → advanced
-                           └─ NO → informed
-```
-
-## Key Parameters
-
-| Parameter           | Range    | Default | Effect                                      |
-|:--------------------|:---------|:--------|:--------------------------------------------|
-| n_directions        | 1-32     | auto    | More = more thorough but riskier             |
-| regularization      | 0.0-1.0  | 0.0     | Higher preserves more original behavior      |
-| refinement_passes   | 1-5      | 1       | More catches self-repair (Ouroboros effect)   |
-| quantization        | 4/8 bit  | none    | Saves VRAM, slight quality tradeoff          |
-
-## Troubleshooting
-
-| Problem                    | Solution                                          |
-|:---------------------------|:--------------------------------------------------|
-| Refusal rate still > 10%   | Try aggressive/nuclear, add refinement passes     |
-| Perplexity up > 20%        | Reduce n_directions, increase regularization       |
-| Model generates nonsense   | Regularization too low, try 0.2-0.3               |
-| OOM on GPU                 | Use 4-bit quantization, or try smaller model       |
-| MoE model barely changes   | Use nuclear method (expert-granular)               |
-| CoT reasoning broken       | Use surgical method (CoT-aware)                    |
--- a/skills/mlops/obliteratus/templates/abliteration-config.yaml
+++ b/skills/mlops/obliteratus/templates/abliteration-config.yaml
@@ -1,33 +0,0 @@
-# OBLITERATUS Abliteration Config
-# Usage: obliteratus run this-file.yaml
-#
-# This is for reproducible, version-controlled abliteration runs.
-# For one-off usage, the CLI flags are simpler.
-
-# Model to abliterate
-model:
-  name: "meta-llama/Llama-3.1-8B-Instruct"
-  dtype: "bfloat16"         # float16, bfloat16, float32
-  quantization: null         # null, "4bit", "8bit"
-  device: "auto"             # auto, cuda, cuda:0, cpu
-
-# Abliteration method and parameters
-abliteration:
-  method: "informed"         # See SKILL.md Step 4 for all 13 methods
-  n_directions: null         # null = auto-detect, or integer (e.g., 8)
-  regularization: 0.0        # 0.0-1.0, fraction of original to preserve
-  refinement_passes: 1       # Iterative passes (increase for self-repair)
-  norm_preserve: true        # Keep weight norms intact after projection
-
-# Output
-output:
-  directory: "./abliterated-models"
-  save_metadata: true        # Save abliteration_metadata.json alongside model
-  contribute: false          # Save community contribution data
-
-# Verification
-verify:
-  enabled: true
-  test_prompts: null         # null = use built-in test prompts
-  compute_perplexity: true
-  compute_kl: true
--- a/skills/mlops/obliteratus/templates/analysis-study.yaml
+++ b/skills/mlops/obliteratus/templates/analysis-study.yaml
@@ -1,40 +0,0 @@
-# OBLITERATUS Analysis Study Config
-# Usage: obliteratus run this-file.yaml --preset jailbreak
-#
-# Run analysis modules to understand refusal geometry BEFORE abliterating.
-# Useful for research or when you want to understand what you're removing.
-
-# Model to analyze
-model:
-  name: "meta-llama/Llama-3.1-8B-Instruct"
-  dtype: "bfloat16"
-  quantization: "4bit"       # Saves VRAM for analysis
-  device: "auto"
-
-# Study configuration
-study:
-  # Available presets: quick, full, attention, jailbreak, guardrail, knowledge
-  preset: "jailbreak"
-
-  # Or specify individual strategies:
-  # strategies:
-  #   - layer_removal
-  #   - head_pruning
-  #   - ffn_ablation
-  #   - embedding_ablation
-
-# Analysis modules to run (subset of the 27 available)
-analysis:
-  - alignment_imprint        # Detect DPO/RLHF/CAI/SFT training method
-  - concept_geometry          # Map refusal cone geometry
-  - logit_lens               # Find which layer decides to refuse
-  - anti_ouroboros            # Detect self-repair tendency
-  - cross_layer              # Cross-layer alignment clustering
-  - causal_tracing           # Causal necessity of components
-  - residual_stream          # Attention vs MLP contribution
-
-# Output
-output:
-  directory: "./analysis-results"
-  save_plots: true           # Generate matplotlib visualizations
-  save_report: true          # Generate markdown report
--- a/skills/mlops/obliteratus/templates/batch-abliteration.yaml
+++ b/skills/mlops/obliteratus/templates/batch-abliteration.yaml
@@ -1,41 +0,0 @@
-# OBLITERATUS Batch Abliteration Config
-# Abliterate multiple models with the same method for comparison.
-#
-# Run each one sequentially:
-#   for model in models; do obliteratus obliterate $model --method informed; done
-#
-# Or use this as a reference for which models to process.
-
-# Common settings
-defaults:
-  method: "informed"
-  quantization: "4bit"
-  output_dir: "./abliterated-models"
-
-# Models to process (grouped by compute tier)
-models:
-  # Small (4-8 GB VRAM)
-  small:
-    - "Qwen/Qwen2.5-1.5B-Instruct"
-    - "microsoft/Phi-3.5-mini-instruct"
-    - "meta-llama/Llama-3.2-3B-Instruct"
-
-  # Medium (8-16 GB VRAM)
-  medium:
-    - "meta-llama/Llama-3.1-8B-Instruct"
-    - "mistralai/Mistral-7B-Instruct-v0.3"
-    - "google/gemma-2-9b-it"
-    - "Qwen/Qwen2.5-7B-Instruct"
-
-  # Large (24 GB VRAM, 4-bit quantization)
-  large:
-    - "Qwen/Qwen2.5-14B-Instruct"
-    - "Qwen/Qwen3-32B"
-    - "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
-
-# Per-model method overrides (optional)
-overrides:
-  "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B":
-    method: "surgical"        # CoT-aware for reasoning models
-  "mistralai/Mixtral-8x7B-Instruct-v0.1":
-    method: "nuclear"         # Expert-granular for MoE models
--- a/skills/mlops/peft/SKILL.md
+++ b/skills/mlops/peft/SKILL.md
@@ -1,434 +0,0 @@
---
-name: peft-fine-tuning
-description: Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [peft>=0.13.0, transformers>=4.45.0, torch>=2.0.0, bitsandbytes>=0.43.0]
-metadata:
-  hermes:
-    tags: [Fine-Tuning, PEFT, LoRA, QLoRA, Parameter-Efficient, Adapters, Low-Rank, Memory Optimization, Multi-Adapter]
-
---
-
-# PEFT (Parameter-Efficient Fine-Tuning)
-
-Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.
-
-## When to use PEFT
-
-**Use PEFT/LoRA when:**
- Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
- Need to train <1% parameters (6MB adapters vs 14GB full model)
- Want fast iteration with multiple task-specific adapters
- Deploying multiple fine-tuned variants from one base model
-
-**Use QLoRA (PEFT + quantization) when:**
- Fine-tuning 70B models on single 24GB GPU
- Memory is the primary constraint
- Can accept ~5% quality trade-off vs full fine-tuning
-
-**Use full fine-tuning instead when:**
- Training small models (<1B parameters)
- Need maximum quality and have compute budget
- Significant domain shift requires updating all weights
-
-## Quick start
-
-### Installation
-
-```bash
-# Basic installation
-pip install peft
-
-# With quantization support (recommended)
-pip install peft bitsandbytes
-
-# Full stack
-pip install peft transformers accelerate bitsandbytes datasets
-```
-
-### LoRA fine-tuning (standard)
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
-from peft import get_peft_model, LoraConfig, TaskType
-from datasets import load_dataset
-
-# Load base model
-model_name = "meta-llama/Llama-3.1-8B"
-model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-tokenizer.pad_token = tokenizer.eos_token
-
-# LoRA configuration
-lora_config = LoraConfig(
-    task_type=TaskType.CAUSAL_LM,
-    r=16,                          # Rank (8-64, higher = more capacity)
-    lora_alpha=32,                 # Scaling factor (typically 2*r)
-    lora_dropout=0.05,             # Dropout for regularization
-    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers
-    bias="none"                    # Don't train biases
-)
-
-# Apply LoRA
-model = get_peft_model(model, lora_config)
-model.print_trainable_parameters()
-# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%
-
-# Prepare dataset
-dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
-
-def tokenize(example):
-    text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
-    return tokenizer(text, truncation=True, max_length=512, padding="max_length")
-
-tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)
-
-# Training
-training_args = TrainingArguments(
-    output_dir="./lora-llama",
-    num_train_epochs=3,
-    per_device_train_batch_size=4,
-    gradient_accumulation_steps=4,
-    learning_rate=2e-4,
-    fp16=True,
-    logging_steps=10,
-    save_strategy="epoch"
-)
-
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=tokenized,
-    data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
-                                 "attention_mask": torch.stack([f["attention_mask"] for f in data]),
-                                 "labels": torch.stack([f["input_ids"] for f in data])}
-)
-
-trainer.train()
-
-# Save adapter only (6MB vs 16GB)
-model.save_pretrained("./lora-llama-adapter")
-```
-
-### QLoRA fine-tuning (memory-efficient)
-
-```python
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
-
-# 4-bit quantization config
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)
-    bnb_4bit_compute_dtype="bfloat16",   # Compute in bf16
-    bnb_4bit_use_double_quant=True       # Nested quantization
-)
-
-# Load quantized model
-model = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/Llama-3.1-70B",
-    quantization_config=bnb_config,
-    device_map="auto"
-)
-
-# Prepare for training (enables gradient checkpointing)
-model = prepare_model_for_kbit_training(model)
-
-# LoRA config for QLoRA
-lora_config = LoraConfig(
-    r=64,                              # Higher rank for 70B
-    lora_alpha=128,
-    lora_dropout=0.1,
-    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
-    bias="none",
-    task_type="CAUSAL_LM"
-)
-
-model = get_peft_model(model, lora_config)
-# 70B model now fits on single 24GB GPU!
-```
-
-## LoRA parameter selection
-
-### Rank (r) - capacity vs efficiency
-
-| Rank | Trainable Params | Memory | Quality | Use Case |
-|------|-----------------|--------|---------|----------|
-| 4 | ~3M | Minimal | Lower | Simple tasks, prototyping |
-| **8** | ~7M | Low | Good | **Recommended starting point** |
-| **16** | ~14M | Medium | Better | **General fine-tuning** |
-| 32 | ~27M | Higher | High | Complex tasks |
-| 64 | ~54M | High | Highest | Domain adaptation, 70B models |
-
-### Alpha (lora_alpha) - scaling factor
-
-```python
-# Rule of thumb: alpha = 2 * rank
-LoraConfig(r=16, lora_alpha=32)  # Standard
-LoraConfig(r=16, lora_alpha=16)  # Conservative (lower learning rate effect)
-LoraConfig(r=16, lora_alpha=64)  # Aggressive (higher learning rate effect)
-```
-
-### Target modules by architecture
-
-```python
-# Llama / Mistral / Qwen
-target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
-
-# GPT-2 / GPT-Neo
-target_modules = ["c_attn", "c_proj", "c_fc"]
-
-# Falcon
-target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
-
-# BLOOM
-target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
-
-# Auto-detect all linear layers
-target_modules = "all-linear"  # PEFT 0.6.0+
-```
-
-## Loading and merging adapters
-
-### Load trained adapter
-
-```python
-from peft import PeftModel, AutoPeftModelForCausalLM
-from transformers import AutoModelForCausalLM
-
-# Option 1: Load with PeftModel
-base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
-model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")
-
-# Option 2: Load directly (recommended)
-model = AutoPeftModelForCausalLM.from_pretrained(
-    "./lora-llama-adapter",
-    device_map="auto"
-)
-```
-
-### Merge adapter into base model
-
-```python
-# Merge for deployment (no adapter overhead)
-merged_model = model.merge_and_unload()
-
-# Save merged model
-merged_model.save_pretrained("./llama-merged")
-tokenizer.save_pretrained("./llama-merged")
-
-# Push to Hub
-merged_model.push_to_hub("username/llama-finetuned")
-```
-
-### Multi-adapter serving
-
-```python
-from peft import PeftModel
-
-# Load base with first adapter
-model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")
-
-# Load additional adapters
-model.load_adapter("./adapter-task2", adapter_name="task2")
-model.load_adapter("./adapter-task3", adapter_name="task3")
-
-# Switch between adapters at runtime
-model.set_adapter("task1")  # Use task1 adapter
-output1 = model.generate(**inputs)
-
-model.set_adapter("task2")  # Switch to task2
-output2 = model.generate(**inputs)
-
-# Disable adapters (use base model)
-with model.disable_adapter():
-    base_output = model.generate(**inputs)
-```
-
-## PEFT methods comparison
-
-| Method | Trainable % | Memory | Speed | Best For |
-|--------|------------|--------|-------|----------|
-| **LoRA** | 0.1-1% | Low | Fast | General fine-tuning |
-| **QLoRA** | 0.1-1% | Very Low | Medium | Memory-constrained |
-| AdaLoRA | 0.1-1% | Low | Medium | Automatic rank selection |
-| IA3 | 0.01% | Minimal | Fastest | Few-shot adaptation |
-| Prefix Tuning | 0.1% | Low | Medium | Generation control |
-| Prompt Tuning | 0.001% | Minimal | Fast | Simple task adaptation |
-| P-Tuning v2 | 0.1% | Low | Medium | NLU tasks |
-
-### IA3 (minimal parameters)
-
-```python
-from peft import IA3Config
-
-ia3_config = IA3Config(
-    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
-    feedforward_modules=["down_proj"]
-)
-model = get_peft_model(model, ia3_config)
-# Trains only 0.01% of parameters!
-```
-
-### Prefix Tuning
-
-```python
-from peft import PrefixTuningConfig
-
-prefix_config = PrefixTuningConfig(
-    task_type="CAUSAL_LM",
-    num_virtual_tokens=20,      # Prepended tokens
-    prefix_projection=True       # Use MLP projection
-)
-model = get_peft_model(model, prefix_config)
-```
-
-## Integration patterns
-
-### With TRL (SFTTrainer)
-
-```python
-from trl import SFTTrainer, SFTConfig
-from peft import LoraConfig
-
-lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")
-
-trainer = SFTTrainer(
-    model=model,
-    args=SFTConfig(output_dir="./output", max_seq_length=512),
-    train_dataset=dataset,
-    peft_config=lora_config,  # Pass LoRA config directly
-)
-trainer.train()
-```
-
-### With Axolotl (YAML config)
-
-```yaml
-# axolotl config.yaml
-adapter: lora
-lora_r: 16
-lora_alpha: 32
-lora_dropout: 0.05
-lora_target_modules:
-  - q_proj
-  - v_proj
-  - k_proj
-  - o_proj
-lora_target_linear: true  # Target all linear layers
-```
-
-### With vLLM (inference)
-
-```python
-from vllm import LLM
-from vllm.lora.request import LoRARequest
-
-# Load base model with LoRA support
-llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)
-
-# Serve with adapter
-outputs = llm.generate(
-    prompts,
-    lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
-)
-```
-
-## Performance benchmarks
-
-### Memory usage (Llama 3.1 8B)
-
-| Method | GPU Memory | Trainable Params |
-|--------|-----------|------------------|
-| Full fine-tuning | 60+ GB | 8B (100%) |
-| LoRA r=16 | 18 GB | 14M (0.17%) |
-| QLoRA r=16 | 6 GB | 14M (0.17%) |
-| IA3 | 16 GB | 800K (0.01%) |
-
-### Training speed (A100 80GB)
-
-| Method | Tokens/sec | vs Full FT |
-|--------|-----------|------------|
-| Full FT | 2,500 | 1x |
-| LoRA | 3,200 | 1.3x |
-| QLoRA | 2,100 | 0.84x |
-
-### Quality (MMLU benchmark)
-
-| Model | Full FT | LoRA | QLoRA |
-|-------|---------|------|-------|
-| Llama 2-7B | 45.3 | 44.8 | 44.1 |
-| Llama 2-13B | 54.8 | 54.2 | 53.5 |
-
-## Common issues
-
-### CUDA OOM during training
-
-```python
-# Solution 1: Enable gradient checkpointing
-model.gradient_checkpointing_enable()
-
-# Solution 2: Reduce batch size + increase accumulation
-TrainingArguments(
-    per_device_train_batch_size=1,
-    gradient_accumulation_steps=16
-)
-
-# Solution 3: Use QLoRA
-from transformers import BitsAndBytesConfig
-bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
-```
-
-### Adapter not applying
-
-```python
-# Verify adapter is active
-print(model.active_adapters)  # Should show adapter name
-
-# Check trainable parameters
-model.print_trainable_parameters()
-
-# Ensure model in training mode
-model.train()
-```
-
-### Quality degradation
-
-```python
-# Increase rank
-LoraConfig(r=32, lora_alpha=64)
-
-# Target more modules
-target_modules = "all-linear"
-
-# Use more training data and epochs
-TrainingArguments(num_train_epochs=5)
-
-# Lower learning rate
-TrainingArguments(learning_rate=1e-4)
-```
-
-## Best practices
-
-1. **Start with r=8-16**, increase if quality insufficient
-2. **Use alpha = 2 * rank** as starting point
-3. **Target attention + MLP layers** for best quality/efficiency
-4. **Enable gradient checkpointing** for memory savings
-5. **Save adapters frequently** (small files, easy rollback)
-6. **Evaluate on held-out data** before merging
-7. **Use QLoRA for 70B+ models** on consumer hardware
-
-## References
-
- **[Advanced Usage](references/advanced-usage.md)** - DoRA, LoftQ, rank stabilization, custom modules
- **[Troubleshooting](references/troubleshooting.md)** - Common errors, debugging, optimization
-
-## Resources
-
- **GitHub**: https://github.com/huggingface/peft
- **Docs**: https://huggingface.co/docs/peft
- **LoRA Paper**: arXiv:2106.09685
- **QLoRA Paper**: arXiv:2305.14314
- **Models**: https://huggingface.co/models?library=peft
--- a/skills/mlops/peft/references/advanced-usage.md
+++ b/skills/mlops/peft/references/advanced-usage.md
@@ -1,514 +0,0 @@
-# PEFT Advanced Usage Guide
-
-## Advanced LoRA Variants
-
-### DoRA (Weight-Decomposed Low-Rank Adaptation)
-
-DoRA decomposes weights into magnitude and direction components, often achieving better results than standard LoRA:
-
-```python
-from peft import LoraConfig
-
-dora_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
-    use_dora=True,  # Enable DoRA
-    task_type="CAUSAL_LM"
-)
-
-model = get_peft_model(model, dora_config)
-```
-
-**When to use DoRA**:
- Consistently outperforms LoRA on instruction-following tasks
- Slightly higher memory (~10%) due to magnitude vectors
- Best for quality-critical fine-tuning
-
-### AdaLoRA (Adaptive Rank)
-
-Automatically adjusts rank per layer based on importance:
-
-```python
-from peft import AdaLoraConfig
-
-adalora_config = AdaLoraConfig(
-    init_r=64,              # Initial rank
-    target_r=16,            # Target average rank
-    tinit=200,              # Warmup steps
-    tfinal=1000,            # Final pruning step
-    deltaT=10,              # Rank update frequency
-    beta1=0.85,
-    beta2=0.85,
-    orth_reg_weight=0.5,    # Orthogonality regularization
-    target_modules=["q_proj", "v_proj"],
-    task_type="CAUSAL_LM"
-)
-```
-
-**Benefits**:
- Allocates more rank to important layers
- Can reduce total parameters while maintaining quality
- Good for exploring optimal rank distribution
-
-### LoRA+ (Asymmetric Learning Rates)
-
-Different learning rates for A and B matrices:
-
-```python
-from peft import LoraConfig
-
-# LoRA+ uses higher LR for B matrix
-lora_plus_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    target_modules="all-linear",
-    use_rslora=True,  # Rank-stabilized LoRA (related technique)
-)
-
-# Manual implementation of LoRA+
-from torch.optim import AdamW
-
-# Group parameters
-lora_A_params = [p for n, p in model.named_parameters() if "lora_A" in n]
-lora_B_params = [p for n, p in model.named_parameters() if "lora_B" in n]
-
-optimizer = AdamW([
-    {"params": lora_A_params, "lr": 1e-4},
-    {"params": lora_B_params, "lr": 1e-3},  # 10x higher for B
-])
-```
-
-### rsLoRA (Rank-Stabilized LoRA)
-
-Scales LoRA outputs to stabilize training with different ranks:
-
-```python
-lora_config = LoraConfig(
-    r=64,
-    lora_alpha=64,
-    use_rslora=True,  # Enables rank-stabilized scaling
-    target_modules="all-linear"
-)
-```
-
-**When to use**:
- When experimenting with different ranks
- Helps maintain consistent behavior across rank values
- Recommended for r > 32
-
-## LoftQ (LoRA-Fine-Tuning-aware Quantization)
-
-Initializes LoRA weights to compensate for quantization error:
-
-```python
-from peft import LoftQConfig, LoraConfig, get_peft_model
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-
-# LoftQ configuration
-loftq_config = LoftQConfig(
-    loftq_bits=4,              # Quantization bits
-    loftq_iter=5,              # Alternating optimization iterations
-)
-
-# LoRA config with LoftQ initialization
-lora_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    target_modules="all-linear",
-    init_lora_weights="loftq",
-    loftq_config=loftq_config,
-    task_type="CAUSAL_LM"
-)
-
-# Load quantized model
-bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
-model = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/Llama-3.1-8B",
-    quantization_config=bnb_config
-)
-
-model = get_peft_model(model, lora_config)
-```
-
-**Benefits over standard QLoRA**:
- Better initial quality after quantization
- Faster convergence
- ~1-2% better final accuracy on benchmarks
-
-## Custom Module Targeting
-
-### Target specific layers
-
-```python
-# Target only first and last transformer layers
-lora_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    target_modules=["model.layers.0.self_attn.q_proj",
-                    "model.layers.0.self_attn.v_proj",
-                    "model.layers.31.self_attn.q_proj",
-                    "model.layers.31.self_attn.v_proj"],
-    layers_to_transform=[0, 31]  # Alternative approach
-)
-```
-
-### Layer pattern matching
-
-```python
-# Target layers 0-10 only
-lora_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    target_modules="all-linear",
-    layers_to_transform=list(range(11)),  # Layers 0-10
-    layers_pattern="model.layers"
-)
-```
-
-### Exclude specific layers
-
-```python
-lora_config = LoraConfig(
-    r=16,
-    target_modules="all-linear",
-    modules_to_save=["lm_head"],  # Train these fully (not LoRA)
-)
-```
-
-## Embedding and LM Head Training
-
-### Train embeddings with LoRA
-
-```python
-from peft import LoraConfig
-
-# Include embeddings
-lora_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    target_modules=["q_proj", "v_proj", "embed_tokens"],  # Include embeddings
-    modules_to_save=["lm_head"],  # Train lm_head fully
-)
-```
-
-### Extending vocabulary with LoRA
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from peft import get_peft_model, LoraConfig
-
-# Add new tokens
-tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
-new_tokens = ["<custom_token_1>", "<custom_token_2>"]
-tokenizer.add_tokens(new_tokens)
-
-# Resize model embeddings
-model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
-model.resize_token_embeddings(len(tokenizer))
-
-# Configure LoRA to train new embeddings
-lora_config = LoraConfig(
-    r=16,
-    target_modules="all-linear",
-    modules_to_save=["embed_tokens", "lm_head"],  # Train these fully
-)
-
-model = get_peft_model(model, lora_config)
-```
-
-## Multi-Adapter Patterns
-
-### Adapter composition
-
-```python
-from peft import PeftModel
-
-# Load model with multiple adapters
-model = AutoPeftModelForCausalLM.from_pretrained("./base-adapter")
-model.load_adapter("./style-adapter", adapter_name="style")
-model.load_adapter("./task-adapter", adapter_name="task")
-
-# Combine adapters (weighted sum)
-model.add_weighted_adapter(
-    adapters=["style", "task"],
-    weights=[0.7, 0.3],
-    adapter_name="combined",
-    combination_type="linear"  # or "cat", "svd"
-)
-
-model.set_adapter("combined")
-```
-
-### Adapter stacking
-
-```python
-# Stack adapters (apply sequentially)
-model.add_weighted_adapter(
-    adapters=["base", "domain", "task"],
-    weights=[1.0, 1.0, 1.0],
-    adapter_name="stacked",
-    combination_type="cat"  # Concatenate adapter outputs
-)
-```
-
-### Dynamic adapter switching
-
-```python
-import torch
-
-class MultiAdapterModel:
-    def __init__(self, base_model_path, adapter_paths):
-        self.model = AutoPeftModelForCausalLM.from_pretrained(adapter_paths[0])
-        for name, path in adapter_paths[1:].items():
-            self.model.load_adapter(path, adapter_name=name)
-
-    def generate(self, prompt, adapter_name="default"):
-        self.model.set_adapter(adapter_name)
-        return self.model.generate(**self.tokenize(prompt))
-
-    def generate_ensemble(self, prompt, adapters, weights):
-        """Generate with weighted adapter ensemble"""
-        outputs = []
-        for adapter, weight in zip(adapters, weights):
-            self.model.set_adapter(adapter)
-            logits = self.model(**self.tokenize(prompt)).logits
-            outputs.append(weight * logits)
-        return torch.stack(outputs).sum(dim=0)
-```
-
-## Memory Optimization
-
-### Gradient checkpointing with LoRA
-
-```python
-from peft import prepare_model_for_kbit_training
-
-# Enable gradient checkpointing
-model = prepare_model_for_kbit_training(
-    model,
-    use_gradient_checkpointing=True,
-    gradient_checkpointing_kwargs={"use_reentrant": False}
-)
-```
-
-### CPU offloading for training
-
-```python
-from accelerate import Accelerator
-
-accelerator = Accelerator(
-    mixed_precision="bf16",
-    gradient_accumulation_steps=8,
-    cpu_offload=True  # Offload optimizer states to CPU
-)
-
-model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
-```
-
-### Memory-efficient attention with LoRA
-
-```python
-from transformers import AutoModelForCausalLM
-
-# Combine Flash Attention 2 with LoRA
-model = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/Llama-3.1-8B",
-    attn_implementation="flash_attention_2",
-    torch_dtype=torch.bfloat16
-)
-
-# Apply LoRA
-model = get_peft_model(model, lora_config)
-```
-
-## Inference Optimization
-
-### Merge for deployment
-
-```python
-# Merge adapter weights into base model
-merged_model = model.merge_and_unload()
-
-# Quantize merged model for inference
-from transformers import BitsAndBytesConfig
-
-bnb_config = BitsAndBytesConfig(load_in_4bit=True)
-quantized_model = AutoModelForCausalLM.from_pretrained(
-    "./merged-model",
-    quantization_config=bnb_config
-)
-```
-
-### Export to different formats
-
-```python
-# Export to GGUF (llama.cpp)
-# First merge, then convert
-merged_model.save_pretrained("./merged-model")
-
-# Use llama.cpp converter
-# python convert-hf-to-gguf.py ./merged-model --outfile model.gguf
-
-# Export to ONNX
-from optimum.onnxruntime import ORTModelForCausalLM
-
-ort_model = ORTModelForCausalLM.from_pretrained(
-    "./merged-model",
-    export=True
-)
-ort_model.save_pretrained("./onnx-model")
-```
-
-### Batch adapter inference
-
-```python
-from vllm import LLM
-from vllm.lora.request import LoRARequest
-
-# Initialize with LoRA support
-llm = LLM(
-    model="meta-llama/Llama-3.1-8B",
-    enable_lora=True,
-    max_lora_rank=64,
-    max_loras=4  # Max concurrent adapters
-)
-
-# Batch with different adapters
-requests = [
-    ("prompt1", LoRARequest("adapter1", 1, "./adapter1")),
-    ("prompt2", LoRARequest("adapter2", 2, "./adapter2")),
-    ("prompt3", LoRARequest("adapter1", 1, "./adapter1")),
-]
-
-outputs = llm.generate(
-    [r[0] for r in requests],
-    lora_request=[r[1] for r in requests]
-)
-```
-
-## Training Recipes
-
-### Instruction tuning recipe
-
-```python
-lora_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    lora_dropout=0.05,
-    target_modules="all-linear",
-    bias="none",
-    task_type="CAUSAL_LM"
-)
-
-training_args = TrainingArguments(
-    output_dir="./output",
-    num_train_epochs=3,
-    per_device_train_batch_size=4,
-    gradient_accumulation_steps=4,
-    learning_rate=2e-4,
-    lr_scheduler_type="cosine",
-    warmup_ratio=0.03,
-    bf16=True,
-    logging_steps=10,
-    save_strategy="steps",
-    save_steps=100,
-    eval_strategy="steps",
-    eval_steps=100,
-)
-```
-
-### Code generation recipe
-
-```python
-lora_config = LoraConfig(
-    r=32,              # Higher rank for code
-    lora_alpha=64,
-    lora_dropout=0.1,
-    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
-    bias="none",
-    task_type="CAUSAL_LM"
-)
-
-training_args = TrainingArguments(
-    learning_rate=1e-4,        # Lower LR for code
-    num_train_epochs=2,
-    max_seq_length=2048,       # Longer sequences
-)
-```
-
-### Conversational/Chat recipe
-
-```python
-from trl import SFTTrainer
-
-lora_config = LoraConfig(
-    r=16,
-    lora_alpha=16,  # alpha = r for chat
-    lora_dropout=0.05,
-    target_modules="all-linear"
-)
-
-# Use chat template
-def format_chat(example):
-    messages = [
-        {"role": "user", "content": example["instruction"]},
-        {"role": "assistant", "content": example["response"]}
-    ]
-    return tokenizer.apply_chat_template(messages, tokenize=False)
-
-trainer = SFTTrainer(
-    model=model,
-    peft_config=lora_config,
-    train_dataset=dataset.map(format_chat),
-    max_seq_length=1024,
-)
-```
-
-## Debugging and Validation
-
-### Verify adapter application
-
-```python
-# Check which modules have LoRA
-for name, module in model.named_modules():
-    if hasattr(module, "lora_A"):
-        print(f"LoRA applied to: {name}")
-
-# Print detailed config
-print(model.peft_config)
-
-# Check adapter state
-print(f"Active adapters: {model.active_adapters}")
-print(f"Trainable: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
-```
-
-### Compare with base model
-
-```python
-# Generate with adapter
-model.set_adapter("default")
-adapter_output = model.generate(**inputs)
-
-# Generate without adapter
-with model.disable_adapter():
-    base_output = model.generate(**inputs)
-
-print(f"Adapter: {tokenizer.decode(adapter_output[0])}")
-print(f"Base: {tokenizer.decode(base_output[0])}")
-```
-
-### Monitor training metrics
-
-```python
-from transformers import TrainerCallback
-
-class LoRACallback(TrainerCallback):
-    def on_log(self, args, state, control, logs=None, **kwargs):
-        if "loss" in logs:
-            # Log adapter-specific metrics
-            model = kwargs["model"]
-            lora_params = sum(p.numel() for n, p in model.named_parameters()
-                            if "lora" in n and p.requires_grad)
-            print(f"Step {state.global_step}: loss={logs['loss']:.4f}, lora_params={lora_params}")
-```
--- a/skills/mlops/peft/references/troubleshooting.md
+++ b/skills/mlops/peft/references/troubleshooting.md
@@ -1,480 +0,0 @@
-# PEFT Troubleshooting Guide
-
-## Installation Issues
-
-### bitsandbytes CUDA Error
-
-**Error**: `CUDA Setup failed despite GPU being available`
-
-**Fix**:
-```bash
-# Check CUDA version
-nvcc --version
-
-# Install matching bitsandbytes
-pip uninstall bitsandbytes
-pip install bitsandbytes --no-cache-dir
-
-# Or compile from source for specific CUDA
-git clone https://github.com/TimDettmers/bitsandbytes.git
-cd bitsandbytes
-CUDA_VERSION=118 make cuda11x  # Adjust for your CUDA
-pip install .
-```
-
-### Triton Import Error
-
-**Error**: `ModuleNotFoundError: No module named 'triton'`
-
-**Fix**:
-```bash
-# Install triton (Linux only)
-pip install triton
-
-# Windows: Triton not supported, use CUDA backend
-# Set environment variable to disable triton
-export CUDA_VISIBLE_DEVICES=0
-```
-
-### PEFT Version Conflicts
-
-**Error**: `AttributeError: 'LoraConfig' object has no attribute 'use_dora'`
-
-**Fix**:
-```bash
-# Upgrade to latest PEFT
-pip install peft>=0.13.0 --upgrade
-
-# Check version
-python -c "import peft; print(peft.__version__)"
-```
-
-## Training Issues
-
-### CUDA Out of Memory
-
-**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
-
-**Solutions**:
-
-1. **Enable gradient checkpointing**:
-```python
-from peft import prepare_model_for_kbit_training
-model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
-```
-
-2. **Reduce batch size**:
-```python
-TrainingArguments(
-    per_device_train_batch_size=1,
-    gradient_accumulation_steps=16  # Maintain effective batch size
-)
-```
-
-3. **Use QLoRA**:
-```python
-from transformers import BitsAndBytesConfig
-
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_use_double_quant=True
-)
-model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
-```
-
-4. **Lower LoRA rank**:
-```python
-LoraConfig(r=8)  # Instead of r=16 or higher
-```
-
-5. **Target fewer modules**:
-```python
-target_modules=["q_proj", "v_proj"]  # Instead of all-linear
-```
-
-### Loss Not Decreasing
-
-**Problem**: Training loss stays flat or increases.
-
-**Solutions**:
-
-1. **Check learning rate**:
-```python
-# Start lower
-TrainingArguments(learning_rate=1e-4)  # Not 2e-4 or higher
-```
-
-2. **Verify adapter is active**:
-```python
-model.print_trainable_parameters()
-# Should show >0 trainable params
-
-# Check adapter applied
-print(model.peft_config)
-```
-
-3. **Check data formatting**:
-```python
-# Verify tokenization
-sample = dataset[0]
-decoded = tokenizer.decode(sample["input_ids"])
-print(decoded)  # Should look correct
-```
-
-4. **Increase rank**:
-```python
-LoraConfig(r=32, lora_alpha=64)  # More capacity
-```
-
-### NaN Loss
-
-**Error**: `Loss is NaN`
-
-**Fix**:
-```python
-# Use bf16 instead of fp16
-TrainingArguments(bf16=True, fp16=False)
-
-# Or enable loss scaling
-TrainingArguments(fp16=True, fp16_full_eval=True)
-
-# Lower learning rate
-TrainingArguments(learning_rate=5e-5)
-
-# Check for data issues
-for batch in dataloader:
-    if torch.isnan(batch["input_ids"].float()).any():
-        print("NaN in input!")
-```
-
-### Adapter Not Training
-
-**Problem**: `trainable params: 0` or model not updating.
-
-**Fix**:
-```python
-# Verify LoRA applied to correct modules
-for name, module in model.named_modules():
-    if "lora" in name.lower():
-        print(f"Found LoRA: {name}")
-
-# Check target_modules match model architecture
-from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING
-print(TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING.get(model.config.model_type))
-
-# Ensure model in training mode
-model.train()
-
-# Check requires_grad
-for name, param in model.named_parameters():
-    if param.requires_grad:
-        print(f"Trainable: {name}")
-```
-
-## Loading Issues
-
-### Adapter Loading Fails
-
-**Error**: `ValueError: Can't find adapter weights`
-
-**Fix**:
-```python
-# Check adapter files exist
-import os
-print(os.listdir("./adapter-path"))
-# Should contain: adapter_config.json, adapter_model.safetensors
-
-# Load with correct structure
-from peft import PeftModel, PeftConfig
-
-# Check config
-config = PeftConfig.from_pretrained("./adapter-path")
-print(config)
-
-# Load base model first
-base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
-model = PeftModel.from_pretrained(base_model, "./adapter-path")
-```
-
-### Base Model Mismatch
-
-**Error**: `RuntimeError: size mismatch`
-
-**Fix**:
-```python
-# Ensure base model matches adapter
-from peft import PeftConfig
-
-config = PeftConfig.from_pretrained("./adapter-path")
-print(f"Base model: {config.base_model_name_or_path}")
-
-# Load exact same base model
-base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
-```
-
-### Safetensors vs PyTorch Format
-
-**Error**: `ValueError: We couldn't connect to 'https://huggingface.co'`
-
-**Fix**:
-```python
-# Force local loading
-model = PeftModel.from_pretrained(
-    base_model,
-    "./adapter-path",
-    local_files_only=True
-)
-
-# Or specify format
-model.save_pretrained("./adapter", safe_serialization=True)  # safetensors
-model.save_pretrained("./adapter", safe_serialization=False)  # pytorch
-```
-
-## Inference Issues
-
-### Slow Generation
-
-**Problem**: Inference much slower than expected.
-
-**Solutions**:
-
-1. **Merge adapter for deployment**:
-```python
-merged_model = model.merge_and_unload()
-# No adapter overhead during inference
-```
-
-2. **Use optimized inference engine**:
-```python
-from vllm import LLM
-llm = LLM(model="./merged-model", dtype="half")
-```
-
-3. **Enable Flash Attention**:
-```python
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    attn_implementation="flash_attention_2"
-)
-```
-
-### Output Quality Issues
-
-**Problem**: Fine-tuned model produces worse outputs.
-
-**Solutions**:
-
-1. **Check evaluation without adapter**:
-```python
-with model.disable_adapter():
-    base_output = model.generate(**inputs)
-# Compare with adapter output
-```
-
-2. **Lower temperature during eval**:
-```python
-model.generate(**inputs, temperature=0.1, do_sample=False)
-```
-
-3. **Retrain with more data**:
-```python
-# Increase training samples
-# Use higher quality data
-# Train for more epochs
-```
-
-### Wrong Adapter Active
-
-**Problem**: Model using wrong adapter or no adapter.
-
-**Fix**:
-```python
-# Check active adapters
-print(model.active_adapters)
-
-# Explicitly set adapter
-model.set_adapter("your-adapter-name")
-
-# List all adapters
-print(model.peft_config.keys())
-```
-
-## QLoRA Specific Issues
-
-### Quantization Errors
-
-**Error**: `RuntimeError: mat1 and mat2 shapes cannot be multiplied`
-
-**Fix**:
-```python
-# Ensure compute dtype matches
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.bfloat16,  # Match model dtype
-    bnb_4bit_quant_type="nf4"
-)
-
-# Load with correct dtype
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    quantization_config=bnb_config,
-    torch_dtype=torch.bfloat16
-)
-```
-
-### QLoRA OOM
-
-**Error**: OOM even with 4-bit quantization.
-
-**Fix**:
-```python
-# Enable double quantization
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_use_double_quant=True  # Further memory reduction
-)
-
-# Use offloading
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    quantization_config=bnb_config,
-    device_map="auto",
-    max_memory={0: "20GB", "cpu": "100GB"}
-)
-```
-
-### QLoRA Merge Fails
-
-**Error**: `RuntimeError: expected scalar type BFloat16 but found Float`
-
-**Fix**:
-```python
-# Dequantize before merging
-from peft import PeftModel
-
-# Load in higher precision for merging
-base_model = AutoModelForCausalLM.from_pretrained(
-    base_model_name,
-    torch_dtype=torch.float16,  # Not quantized
-    device_map="auto"
-)
-
-# Load adapter
-model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
-
-# Now merge
-merged = model.merge_and_unload()
-```
-
-## Multi-Adapter Issues
-
-### Adapter Conflict
-
-**Error**: `ValueError: Adapter with name 'default' already exists`
-
-**Fix**:
-```python
-# Use unique names
-model.load_adapter("./adapter1", adapter_name="task1")
-model.load_adapter("./adapter2", adapter_name="task2")
-
-# Or delete existing
-model.delete_adapter("default")
-```
-
-### Mixed Precision Adapters
-
-**Error**: Adapters trained with different dtypes.
-
-**Fix**:
-```python
-# Convert adapter precision
-model = PeftModel.from_pretrained(base_model, "./adapter")
-model = model.to(torch.bfloat16)
-
-# Or load with specific dtype
-model = PeftModel.from_pretrained(
-    base_model,
-    "./adapter",
-    torch_dtype=torch.bfloat16
-)
-```
-
-## Performance Optimization
-
-### Memory Profiling
-
-```python
-import torch
-
-def print_memory():
-    if torch.cuda.is_available():
-        allocated = torch.cuda.memory_allocated() / 1e9
-        reserved = torch.cuda.memory_reserved() / 1e9
-        print(f"Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB")
-
-# Profile during training
-print_memory()  # Before
-model.train()
-loss = model(**batch).loss
-loss.backward()
-print_memory()  # After
-```
-
-### Speed Profiling
-
-```python
-import time
-import torch
-
-def benchmark_generation(model, tokenizer, prompt, n_runs=5):
-    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-
-    # Warmup
-    model.generate(**inputs, max_new_tokens=10)
-    torch.cuda.synchronize()
-
-    # Benchmark
-    times = []
-    for _ in range(n_runs):
-        start = time.perf_counter()
-        outputs = model.generate(**inputs, max_new_tokens=100)
-        torch.cuda.synchronize()
-        times.append(time.perf_counter() - start)
-
-    tokens = outputs.shape[1] - inputs.input_ids.shape[1]
-    avg_time = sum(times) / len(times)
-    print(f"Speed: {tokens/avg_time:.2f} tokens/sec")
-
-# Compare adapter vs merged
-benchmark_generation(adapter_model, tokenizer, "Hello")
-benchmark_generation(merged_model, tokenizer, "Hello")
-```
-
-## Getting Help
-
-1. **Check PEFT GitHub Issues**: https://github.com/huggingface/peft/issues
-2. **HuggingFace Forums**: https://discuss.huggingface.co/
-3. **PEFT Documentation**: https://huggingface.co/docs/peft
-
-### Debugging Template
-
-When reporting issues, include:
-
-```python
-# System info
-import peft
-import transformers
-import torch
-
-print(f"PEFT: {peft.__version__}")
-print(f"Transformers: {transformers.__version__}")
-print(f"PyTorch: {torch.__version__}")
-print(f"CUDA: {torch.version.cuda}")
-print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")
-
-# Config
-print(model.peft_config)
-model.print_trainable_parameters()
-```
--- a/skills/mlops/pytorch-fsdp/SKILL.md
+++ b/skills/mlops/pytorch-fsdp/SKILL.md
--- a/skills/mlops/pytorch-fsdp/references/index.md
+++ b/skills/mlops/pytorch-fsdp/references/index.md
@@ -1,7 +0,0 @@
-# Pytorch-Fsdp Documentation Index
-
-## Categories
-
-### Other
-**File:** `other.md`
-**Pages:** 15
--- a/skills/mlops/pytorch-fsdp/references/other.md
+++ b/skills/mlops/pytorch-fsdp/references/other.md
--- a/skills/mlops/pytorch-lightning/SKILL.md
+++ b/skills/mlops/pytorch-lightning/SKILL.md
@@ -1,349 +0,0 @@
---
-name: pytorch-lightning
-description: High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [lightning, torch, transformers]
-metadata:
-  hermes:
-    tags: [PyTorch Lightning, Training Framework, Distributed Training, DDP, FSDP, DeepSpeed, High-Level API, Callbacks, Best Practices, Scalable]
-
---
-
-# PyTorch Lightning - High-Level Training Framework
-
-## Quick start
-
-PyTorch Lightning organizes PyTorch code to eliminate boilerplate while maintaining flexibility.
-
-**Installation**:
-```bash
-pip install lightning
-```
-
-**Convert PyTorch to Lightning** (3 steps):
-
-```python
-import lightning as L
-import torch
-from torch import nn
-from torch.utils.data import DataLoader, Dataset
-
-# Step 1: Define LightningModule (organize your PyTorch code)
-class LitModel(L.LightningModule):
-    def __init__(self, hidden_size=128):
-        super().__init__()
-        self.model = nn.Sequential(
-            nn.Linear(28 * 28, hidden_size),
-            nn.ReLU(),
-            nn.Linear(hidden_size, 10)
-        )
-
-    def training_step(self, batch, batch_idx):
-        x, y = batch
-        y_hat = self.model(x)
-        loss = nn.functional.cross_entropy(y_hat, y)
-        self.log('train_loss', loss)  # Auto-logged to TensorBoard
-        return loss
-
-    def configure_optimizers(self):
-        return torch.optim.Adam(self.parameters(), lr=1e-3)
-
-# Step 2: Create data
-train_loader = DataLoader(train_dataset, batch_size=32)
-
-# Step 3: Train with Trainer (handles everything else!)
-trainer = L.Trainer(max_epochs=10, accelerator='gpu', devices=2)
-model = LitModel()
-trainer.fit(model, train_loader)
-```
-
-**That's it!** Trainer handles:
- GPU/TPU/CPU switching
- Distributed training (DDP, FSDP, DeepSpeed)
- Mixed precision (FP16, BF16)
- Gradient accumulation
- Checkpointing
- Logging
- Progress bars
-
-## Common workflows
-
-### Workflow 1: From PyTorch to Lightning
-
-**Original PyTorch code**:
-```python
-model = MyModel()
-optimizer = torch.optim.Adam(model.parameters())
-model.to('cuda')
-
-for epoch in range(max_epochs):
-    for batch in train_loader:
-        batch = batch.to('cuda')
-        optimizer.zero_grad()
-        loss = model(batch)
-        loss.backward()
-        optimizer.step()
-```
-
-**Lightning version**:
-```python
-class LitModel(L.LightningModule):
-    def __init__(self):
-        super().__init__()
-        self.model = MyModel()
-
-    def training_step(self, batch, batch_idx):
-        loss = self.model(batch)  # No .to('cuda') needed!
-        return loss
-
-    def configure_optimizers(self):
-        return torch.optim.Adam(self.parameters())
-
-# Train
-trainer = L.Trainer(max_epochs=10, accelerator='gpu')
-trainer.fit(LitModel(), train_loader)
-```
-
-**Benefits**: 40+ lines → 15 lines, no device management, automatic distributed
-
-### Workflow 2: Validation and testing
-
-```python
-class LitModel(L.LightningModule):
-    def __init__(self):
-        super().__init__()
-        self.model = MyModel()
-
-    def training_step(self, batch, batch_idx):
-        x, y = batch
-        y_hat = self.model(x)
-        loss = nn.functional.cross_entropy(y_hat, y)
-        self.log('train_loss', loss)
-        return loss
-
-    def validation_step(self, batch, batch_idx):
-        x, y = batch
-        y_hat = self.model(x)
-        val_loss = nn.functional.cross_entropy(y_hat, y)
-        acc = (y_hat.argmax(dim=1) == y).float().mean()
-        self.log('val_loss', val_loss)
-        self.log('val_acc', acc)
-
-    def test_step(self, batch, batch_idx):
-        x, y = batch
-        y_hat = self.model(x)
-        test_loss = nn.functional.cross_entropy(y_hat, y)
-        self.log('test_loss', test_loss)
-
-    def configure_optimizers(self):
-        return torch.optim.Adam(self.parameters(), lr=1e-3)
-
-# Train with validation
-trainer = L.Trainer(max_epochs=10)
-trainer.fit(model, train_loader, val_loader)
-
-# Test
-trainer.test(model, test_loader)
-```
-
-**Automatic features**:
- Validation runs every epoch by default
- Metrics logged to TensorBoard
- Best model checkpointing based on val_loss
-
-### Workflow 3: Distributed training (DDP)
-
-```python
-# Same code as single GPU!
-model = LitModel()
-
-# 8 GPUs with DDP (automatic!)
-trainer = L.Trainer(
-    accelerator='gpu',
-    devices=8,
-    strategy='ddp'  # Or 'fsdp', 'deepspeed'
-)
-
-trainer.fit(model, train_loader)
-```
-
-**Launch**:
-```bash
-# Single command, Lightning handles the rest
-python train.py
-```
-
-**No changes needed**:
- Automatic data distribution
- Gradient synchronization
- Multi-node support (just set `num_nodes=2`)
-
-### Workflow 4: Callbacks for monitoring
-
-```python
-from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor
-
-# Create callbacks
-checkpoint = ModelCheckpoint(
-    monitor='val_loss',
-    mode='min',
-    save_top_k=3,
-    filename='model-{epoch:02d}-{val_loss:.2f}'
-)
-
-early_stop = EarlyStopping(
-    monitor='val_loss',
-    patience=5,
-    mode='min'
-)
-
-lr_monitor = LearningRateMonitor(logging_interval='epoch')
-
-# Add to Trainer
-trainer = L.Trainer(
-    max_epochs=100,
-    callbacks=[checkpoint, early_stop, lr_monitor]
-)
-
-trainer.fit(model, train_loader, val_loader)
-```
-
-**Result**:
- Auto-saves best 3 models
- Stops early if no improvement for 5 epochs
- Logs learning rate to TensorBoard
-
-### Workflow 5: Learning rate scheduling
-
-```python
-class LitModel(L.LightningModule):
-    # ... (training_step, etc.)
-
-    def configure_optimizers(self):
-        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
-
-        # Cosine annealing
-        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
-            optimizer,
-            T_max=100,
-            eta_min=1e-5
-        )
-
-        return {
-            'optimizer': optimizer,
-            'lr_scheduler': {
-                'scheduler': scheduler,
-                'interval': 'epoch',  # Update per epoch
-                'frequency': 1
-            }
-        }
-
-# Learning rate auto-logged!
-trainer = L.Trainer(max_epochs=100)
-trainer.fit(model, train_loader)
-```
-
-## When to use vs alternatives
-
-**Use PyTorch Lightning when**:
- Want clean, organized code
- Need production-ready training loops
- Switching between single GPU, multi-GPU, TPU
- Want built-in callbacks and logging
- Team collaboration (standardized structure)
-
-**Key advantages**:
- **Organized**: Separates research code from engineering
- **Automatic**: DDP, FSDP, DeepSpeed with 1 line
- **Callbacks**: Modular training extensions
- **Reproducible**: Less boilerplate = fewer bugs
- **Tested**: 1M+ downloads/month, battle-tested
-
-**Use alternatives instead**:
- **Accelerate**: Minimal changes to existing code, more flexibility
- **Ray Train**: Multi-node orchestration, hyperparameter tuning
- **Raw PyTorch**: Maximum control, learning purposes
- **Keras**: TensorFlow ecosystem
-
-## Common issues
-
-**Issue: Loss not decreasing**
-
-Check data and model setup:
-```python
-# Add to training_step
-def training_step(self, batch, batch_idx):
-    if batch_idx == 0:
-        print(f"Batch shape: {batch[0].shape}")
-        print(f"Labels: {batch[1]}")
-    loss = ...
-    return loss
-```
-
-**Issue: Out of memory**
-
-Reduce batch size or use gradient accumulation:
-```python
-trainer = L.Trainer(
-    accumulate_grad_batches=4,  # Effective batch = batch_size × 4
-    precision='bf16'  # Or 'fp16', reduces memory 50%
-)
-```
-
-**Issue: Validation not running**
-
-Ensure you pass val_loader:
-```python
-# WRONG
-trainer.fit(model, train_loader)
-
-# CORRECT
-trainer.fit(model, train_loader, val_loader)
-```
-
-**Issue: DDP spawns multiple processes unexpectedly**
-
-Lightning auto-detects GPUs. Explicitly set devices:
-```python
-# Test on CPU first
-trainer = L.Trainer(accelerator='cpu', devices=1)
-
-# Then GPU
-trainer = L.Trainer(accelerator='gpu', devices=1)
-```
-
-## Advanced topics
-
-**Callbacks**: See [references/callbacks.md](references/callbacks.md) for EarlyStopping, ModelCheckpoint, custom callbacks, and callback hooks.
-
-**Distributed strategies**: See [references/distributed.md](references/distributed.md) for DDP, FSDP, DeepSpeed ZeRO integration, multi-node setup.
-
-**Hyperparameter tuning**: See [references/hyperparameter-tuning.md](references/hyperparameter-tuning.md) for integration with Optuna, Ray Tune, and WandB sweeps.
-
-## Hardware requirements
-
- **CPU**: Works (good for debugging)
- **Single GPU**: Works
- **Multi-GPU**: DDP (default), FSDP, or DeepSpeed
- **Multi-node**: DDP, FSDP, DeepSpeed
- **TPU**: Supported (8 cores)
- **Apple MPS**: Supported
-
-**Precision options**:
- FP32 (default)
- FP16 (V100, older GPUs)
- BF16 (A100/H100, recommended)
- FP8 (H100)
-
-## Resources
-
- Docs: https://lightning.ai/docs/pytorch/stable/
- GitHub: https://github.com/Lightning-AI/pytorch-lightning ⭐ 29,000+
- Version: 2.5.5+
- Examples: https://github.com/Lightning-AI/pytorch-lightning/tree/master/examples
- Discord: https://discord.gg/lightning-ai
- Used by: Kaggle winners, research labs, production teams
-
-
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Austin Pickett	cc5ca0fe42	chore: add tests	2026-03-05 19:01:24 -05:00
Austin Pickett	f035796381	feat: add support to deploy to modal	2026-03-05 18:29:48 -05:00