Compare commits

..

15 Commits

Author SHA1 Message Date
Teknium e52ddb6318 feat: language-aware context compression summaries
Port from anomalyco/opencode#20581: context compaction now generates
summaries in the same language the user was using in the conversation.

Previously, summaries were always produced in English regardless of the
conversation language, which would confuse multilingual users by injecting
English context into non-English conversations.

Adds 'Write the summary in the same language the user was using in the
conversation.' to both the initial and iterative update summarization
prompts in ContextCompressor.
2026-04-02 17:06:48 -07:00
Teknium 924bc67eee feat(memory): pluggable memory provider interface with profile isolation, review fixes, and honcho CLI restoration (#4623)
* feat(memory): add pluggable memory provider interface with profile isolation

Introduces a pluggable MemoryProvider ABC so external memory backends can
integrate with Hermes without modifying core files. Each backend becomes a
plugin implementing a standard interface, orchestrated by MemoryManager.

Key architecture:
- agent/memory_provider.py — ABC with core + optional lifecycle hooks
- agent/memory_manager.py — single integration point in the agent loop
- agent/builtin_memory_provider.py — wraps existing MEMORY.md/USER.md

Profile isolation fixes applied to all 6 shipped plugins:
- Cognitive Memory: use get_hermes_home() instead of raw env var
- Hindsight Memory: check $HERMES_HOME/hindsight/config.json first,
  fall back to legacy ~/.hindsight/ for backward compat
- Hermes Memory Store: replace hardcoded ~/.hermes paths with
  get_hermes_home() for config loading and DB path defaults
- Mem0 Memory: use get_hermes_home() instead of raw env var
- RetainDB Memory: auto-derive profile-scoped project name from
  hermes_home path (hermes-<profile>), explicit env var overrides
- OpenViking Memory: read-only, no local state, isolation via .env

MemoryManager.initialize_all() now injects hermes_home into kwargs so
every provider can resolve profile-scoped storage without importing
get_hermes_home() themselves.

Plugin system: adds register_memory_provider() to PluginContext and
get_plugin_memory_providers() accessor.

Based on PR #3825. 46 tests (37 unit + 5 E2E + 4 plugin registration).

* refactor(memory): drop cognitive plugin, rewrite OpenViking as full provider

Remove cognitive-memory plugin (#727) — core mechanics are broken:
decay runs 24x too fast (hourly not daily), prefetch uses row ID as
timestamp, search limited by importance not similarity.

Rewrite openviking-memory plugin from a read-only search wrapper into
a full bidirectional memory provider using the complete OpenViking
session lifecycle API:

- sync_turn: records user/assistant messages to OpenViking session
  (threaded, non-blocking)
- on_session_end: commits session to trigger automatic memory extraction
  into 6 categories (profile, preferences, entities, events, cases,
  patterns)
- prefetch: background semantic search via find() endpoint
- on_memory_write: mirrors built-in memory writes to the session
- is_available: checks env var only, no network calls (ABC compliance)

Tools expanded from 3 to 5:
- viking_search: semantic search with mode/scope/limit
- viking_read: tiered content (abstract ~100tok / overview ~2k / full)
- viking_browse: filesystem-style navigation (list/tree/stat)
- viking_remember: explicit memory storage via session
- viking_add_resource: ingest URLs/docs into knowledge base

Uses direct HTTP via httpx (no openviking SDK dependency needed).
Response truncation on viking_read to prevent context flooding.

* fix(memory): harden Mem0 plugin — thread safety, non-blocking sync, circuit breaker

- Remove redundant mem0_context tool (identical to mem0_search with
  rerank=true, top_k=5 — wastes a tool slot and confuses the model)
- Thread sync_turn so it's non-blocking — Mem0's server-side LLM
  extraction can take 5-10s, was stalling the agent after every turn
- Add threading.Lock around _get_client() for thread-safe lazy init
  (prefetch and sync threads could race on first client creation)
- Add circuit breaker: after 5 consecutive API failures, pause calls
  for 120s instead of hammering a down server every turn. Auto-resets
  after cooldown. Logs a warning when tripped.
- Track success/failure in prefetch, sync_turn, and all tool calls
- Wait for previous sync to finish before starting a new one (prevents
  unbounded thread accumulation on rapid turns)
- Clean up shutdown to join both prefetch and sync threads

* fix(memory): enforce single external memory provider limit

MemoryManager now rejects a second non-builtin provider with a warning.
Built-in memory (MEMORY.md/USER.md) is always accepted. Only ONE
external plugin provider is allowed at a time. This prevents tool
schema bloat (some providers add 3-5 tools each) and conflicting
memory backends.

The warning message directs users to configure memory.provider in
config.yaml to select which provider to activate.

Updated all 47 tests to use builtin + one external pattern instead
of multiple externals. Added test_second_external_rejected to verify
the enforcement.

* feat(memory): add ByteRover memory provider plugin

Implements the ByteRover integration (from PR #3499 by hieuntg81) as a
MemoryProvider plugin instead of direct run_agent.py modifications.

ByteRover provides persistent memory via the brv CLI — a hierarchical
knowledge tree with tiered retrieval (fuzzy text then LLM-driven search).
Local-first with optional cloud sync.

Plugin capabilities:
- prefetch: background brv query for relevant context
- sync_turn: curate conversation turns (threaded, non-blocking)
- on_memory_write: mirror built-in memory writes to brv
- on_pre_compress: extract insights before context compression

Tools (3):
- brv_query: search the knowledge tree
- brv_curate: store facts/decisions/patterns
- brv_status: check CLI version and context tree state

Profile isolation: working directory at $HERMES_HOME/byterover/ (scoped
per profile). Binary resolution cached with thread-safe double-checked
locking. All write operations threaded to avoid blocking the agent
(curate can take 120s with LLM processing).

* fix(memory): thread remaining sync_turns, fix holographic, add config key

Plugin fixes:
- Hindsight: thread sync_turn (was blocking up to 30s via _run_in_thread)
- RetainDB: thread sync_turn (was blocking on HTTP POST)
- Both: shutdown now joins sync threads alongside prefetch threads

Holographic retrieval fixes:
- reason(): removed dead intersection_key computation (bundled but never
  used in scoring). Now reuses pre-computed entity_residuals directly,
  moved role_content encoding outside the inner loop.
- contradict(): added _MAX_CONTRADICT_FACTS=500 scaling guard. Above
  500 facts, only checks the most recently updated ones to avoid O(n^2)
  explosion (~125K comparisons at 500 is acceptable).

Config:
- Added memory.provider key to DEFAULT_CONFIG ("" = builtin only).
  No version bump needed (deep_merge handles new keys automatically).

* feat(memory): extract Honcho as a MemoryProvider plugin

Creates plugins/honcho-memory/ as a thin adapter over the existing
honcho_integration/ package. All 4 Honcho tools (profile, search,
context, conclude) move from the normal tool registry to the
MemoryProvider interface.

The plugin delegates all work to HonchoSessionManager — no Honcho
logic is reimplemented. It uses the existing config chain:
$HERMES_HOME/honcho.json -> ~/.honcho/config.json -> env vars.

Lifecycle hooks:
- initialize: creates HonchoSessionManager via existing client factory
- prefetch: background dialectic query
- sync_turn: records messages + flushes to API (threaded)
- on_memory_write: mirrors user profile writes as conclusions
- on_session_end: flushes all pending messages

This is a prerequisite for the MemoryManager wiring in run_agent.py.
Once wired, Honcho goes through the same provider interface as all
other memory plugins, and the scattered Honcho code in run_agent.py
can be consolidated into the single MemoryManager integration point.

* feat(memory): wire MemoryManager into run_agent.py

Adds 8 integration points for the external memory provider plugin,
all purely additive (zero existing code modified):

1. Init (~L1130): Create MemoryManager, find matching plugin provider
   from memory.provider config, initialize with session context
2. Tool injection (~L1160): Append provider tool schemas to self.tools
   and self.valid_tool_names after memory_manager init
3. System prompt (~L2705): Add external provider's system_prompt_block
   alongside existing MEMORY.md/USER.md blocks
4. Tool routing (~L5362): Route provider tool calls through
   memory_manager.handle_tool_call() before the catchall handler
5. Memory write bridge (~L5353): Notify external provider via
   on_memory_write() when the built-in memory tool writes
6. Pre-compress (~L5233): Call on_pre_compress() before context
   compression discards messages
7. Prefetch (~L6421): Inject provider prefetch results into the
   current-turn user message (same pattern as Honcho turn context)
8. Turn sync + session end (~L8161, ~L8172): sync_all() after each
   completed turn, queue_prefetch_all() for next turn, on_session_end()
   + shutdown_all() at conversation end

All hooks are wrapped in try/except — a failing provider never breaks
the agent. The existing memory system, Honcho integration, and all
other code paths are completely untouched.

Full suite: 7222 passed, 4 pre-existing failures.

* refactor(memory): remove legacy Honcho integration from core

Extracts all Honcho-specific code from run_agent.py, model_tools.py,
toolsets.py, and gateway/run.py. Honcho is now exclusively available
as a memory provider plugin (plugins/honcho-memory/).

Removed from run_agent.py (-457 lines):
- Honcho init block (session manager creation, activation, config)
- 8 Honcho methods: _honcho_should_activate, _strip_honcho_tools,
  _activate_honcho, _register_honcho_exit_hook, _queue_honcho_prefetch,
  _honcho_prefetch, _honcho_save_user_observation, _honcho_sync
- _inject_honcho_turn_context module-level function
- Honcho system prompt block (tool descriptions, CLI commands)
- Honcho context injection in api_messages building
- Honcho params from __init__ (honcho_session_key, honcho_manager,
  honcho_config)
- HONCHO_TOOL_NAMES constant
- All honcho-specific tool dispatch forwarding

Removed from other files:
- model_tools.py: honcho_tools import, honcho params from handle_function_call
- toolsets.py: honcho toolset definition, honcho tools from core tools list
- gateway/run.py: honcho params from AIAgent constructor calls

Removed tests (-339 lines):
- 9 Honcho-specific test methods from test_run_agent.py
- TestHonchoAtexitFlush class from test_exit_cleanup_interrupt.py

Restored two regex constants (_SURROGATE_RE, _BUDGET_WARNING_RE) that
were accidentally removed during the honcho function extraction.

The honcho_integration/ package is kept intact — the plugin delegates
to it. tools/honcho_tools.py registry entries are now dead code (import
commented out in model_tools.py) but the file is preserved for reference.

Full suite: 7207 passed, 4 pre-existing failures. Zero regressions.

* refactor(memory): restructure plugins, add CLI, clean gateway, migration notice

Plugin restructure:
- Move all memory plugins from plugins/<name>-memory/ to plugins/memory/<name>/
  (byterover, hindsight, holographic, honcho, mem0, openviking, retaindb)
- New plugins/memory/__init__.py discovery module that scans the directory
  directly, loading providers by name without the general plugin system
- run_agent.py uses load_memory_provider() instead of get_plugin_memory_providers()

CLI wiring:
- hermes memory setup — interactive curses picker + config wizard
- hermes memory status — show active provider, config, availability
- hermes memory off — disable external provider (built-in only)
- hermes honcho — now shows migration notice pointing to hermes memory setup

Gateway cleanup:
- Remove _get_or_create_gateway_honcho (already removed in prev commit)
- Remove _shutdown_gateway_honcho and _shutdown_all_gateway_honcho methods
- Remove all calls to shutdown methods (4 call sites)
- Remove _honcho_managers/_honcho_configs dict references

Dead code removal:
- Delete tools/honcho_tools.py (279 lines, import was already commented out)
- Delete tests/gateway/test_honcho_lifecycle.py (131 lines, tested removed methods)
- Remove if False placeholder from run_agent.py

Migration:
- Honcho migration notice on startup: detects existing honcho.json or
  ~/.honcho/config.json, prints guidance to run hermes memory setup.
  Only fires when memory.provider is not set and not in quiet mode.

Full suite: 7203 passed, 4 pre-existing failures. Zero regressions.

* feat(memory): standardize plugin config + add per-plugin documentation

Config architecture:
- Add save_config(values, hermes_home) to MemoryProvider ABC
- Honcho: writes to $HERMES_HOME/honcho.json (SDK native)
- Mem0: writes to $HERMES_HOME/mem0.json
- Hindsight: writes to $HERMES_HOME/hindsight/config.json
- Holographic: writes to config.yaml under plugins.hermes-memory-store
- OpenViking/RetainDB/ByteRover: env-var only (default no-op)

Setup wizard (hermes memory setup):
- Now calls provider.save_config() for non-secret config
- Secrets still go to .env via env vars
- Only memory.provider activation key goes to config.yaml

Documentation:
- README.md for each of the 7 providers in plugins/memory/<name>/
- Requirements, setup (wizard + manual), config reference, tools table
- Consistent format across all providers

The contract for new memory plugins:
- get_config_schema() declares all fields (REQUIRED)
- save_config() writes native config (REQUIRED if not env-var-only)
- Secrets use env_var field in schema, written to .env by wizard
- README.md in the plugin directory

* docs: add memory providers user guide + developer guide

New pages:
- user-guide/features/memory-providers.md — comprehensive guide covering
  all 7 shipped providers (Honcho, OpenViking, Mem0, Hindsight,
  Holographic, RetainDB, ByteRover). Each with setup, config, tools,
  cost, and unique features. Includes comparison table and profile
  isolation notes.
- developer-guide/memory-provider-plugin.md — how to build a new memory
  provider plugin. Covers ABC, required methods, config schema,
  save_config, threading contract, profile isolation, testing.

Updated pages:
- user-guide/features/memory.md — replaced Honcho section with link to
  new Memory Providers page
- user-guide/features/honcho.md — replaced with migration redirect to
  the new Memory Providers page
- sidebars.ts — added both new pages to navigation

* fix(memory): auto-migrate Honcho users to memory provider plugin

When honcho.json or ~/.honcho/config.json exists but memory.provider
is not set, automatically set memory.provider: honcho in config.yaml
and activate the plugin. The plugin reads the same config files, so
all data and credentials are preserved. Zero user action needed.

Persists the migration to config.yaml so it only fires once. Prints
a one-line confirmation in non-quiet mode.

* fix(memory): only auto-migrate Honcho when enabled + credentialed

Check HonchoClientConfig.enabled AND (api_key OR base_url) before
auto-migrating — not just file existence. Prevents false activation
for users who disabled Honcho, stopped using it (config lingers),
or have ~/.honcho/ from a different tool.

* feat(memory): auto-install pip dependencies during hermes memory setup

Reads pip_dependencies from plugin.yaml, checks which are missing,
installs them via pip before config walkthrough. Also shows install
guidance for external_dependencies (e.g. brv CLI for ByteRover).

Updated all 7 plugin.yaml files with pip_dependencies:
- honcho: honcho-ai
- mem0: mem0ai
- openviking: httpx
- hindsight: hindsight-client
- holographic: (none)
- retaindb: requests
- byterover: (external_dependencies for brv CLI)

* fix: remove remaining Honcho crash risks from cli.py and gateway

cli.py: removed Honcho session re-mapping block (would crash importing
deleted tools/honcho_tools.py), Honcho flush on compress, Honcho
session display on startup, Honcho shutdown on exit, honcho_session_key
AIAgent param.

gateway/run.py: removed honcho_session_key params from helper methods,
sync_honcho param, _honcho.shutdown() block.

tests: fixed test_cron_session_with_honcho_key_skipped (was passing
removed honcho_key param to _flush_memories_for_session).

* fix: include plugins/ in pyproject.toml package list

Without this, plugins/memory/ wouldn't be included in non-editable
installs. Hermes always runs from the repo checkout so this is belt-
and-suspenders, but prevents breakage if the install method changes.

* fix(memory): correct pip-to-import name mapping for dep checks

The heuristic dep.replace('-', '_') fails for packages where the pip
name differs from the import name: honcho-ai→honcho, mem0ai→mem0,
hindsight-client→hindsight_client. Added explicit mapping table so
hermes memory setup doesn't try to reinstall already-installed packages.

* chore: remove dead code from old plugin memory registration path

- hermes_cli/plugins.py: removed register_memory_provider(),
  _memory_providers list, get_plugin_memory_providers() — memory
  providers now use plugins/memory/ discovery, not the general plugin system
- hermes_cli/main.py: stripped 74 lines of dead honcho argparse
  subparsers (setup, status, sessions, map, peer, mode, tokens,
  identity, migrate) — kept only the migration redirect
- agent/memory_provider.py: updated docstring to reflect new
  registration path
- tests: replaced TestPluginMemoryProviderRegistration with
  TestPluginMemoryDiscovery that tests the actual plugins/memory/
  discovery system. Added 3 new tests (discover, load, nonexistent).

* chore: delete dead honcho_integration/cli.py and its tests

cli.py (794 lines) was the old 'hermes honcho' command handler — nobody
calls it since cmd_honcho was replaced with a migration redirect.

Deleted tests that imported from removed code:
- tests/honcho_integration/test_cli.py (tested _resolve_api_key)
- tests/honcho_integration/test_config_isolation.py (tested CLI config paths)
- tests/tools/test_honcho_tools.py (tested the deleted tools/honcho_tools.py)

Remaining honcho_integration/ files (actively used by the plugin):
- client.py (445 lines) — config loading, SDK client creation
- session.py (991 lines) — session management, queries, flush

* refactor: move honcho_integration/ into the honcho plugin

Moves client.py (445 lines) and session.py (991 lines) from the
top-level honcho_integration/ package into plugins/memory/honcho/.
No Honcho code remains in the main codebase.

- plugins/memory/honcho/client.py — config loading, SDK client creation
- plugins/memory/honcho/session.py — session management, queries, flush
- Updated all imports: run_agent.py (auto-migration), hermes_cli/doctor.py,
  plugin __init__.py, session.py cross-import, all tests
- Removed honcho_integration/ package and pyproject.toml entry
- Renamed tests/honcho_integration/ → tests/honcho_plugin/

* docs: update architecture + gateway-internals for memory provider system

- architecture.md: replaced honcho_integration/ with plugins/memory/
- gateway-internals.md: replaced Honcho-specific session routing and
  flush lifecycle docs with generic memory provider interface docs

* fix: update stale mock path for resolve_active_host after honcho plugin migration

* fix(memory): address review feedback — P0 lifecycle, ABC contract, honcho CLI restore

Review feedback from Honcho devs (erosika):

P0 — Provider lifecycle:
- Remove on_session_end() + shutdown_all() from run_conversation() tail
  (was killing providers after every turn in multi-turn sessions)
- Add shutdown_memory_provider() method on AIAgent for callers
- Wire shutdown into CLI atexit, reset_conversation, gateway stop/expiry

Bug fixes:
- Remove sync_honcho=False kwarg from /btw callsites (TypeError crash)
- Fix doctor.py references to dead 'hermes honcho setup' command
- Cache prefetch_all() before tool loop (was re-calling every iteration)

ABC contract hardening (all backwards-compatible):
- Add session_id kwarg to prefetch/sync_turn/queue_prefetch
- Make on_pre_compress() return str (provider insights in compression)
- Add **kwargs to on_turn_start() for runtime context
- Add on_delegation() hook for parent-side subagent observation
- Document agent_context/agent_identity/agent_workspace kwargs on
  initialize() (prevents cron corruption, enables profile scoping)
- Fix docstring: single external provider, not multiple

Honcho CLI restoration:
- Add plugins/memory/honcho/cli.py (from main's honcho_integration/cli.py
  with imports adapted to plugin path)
- Restore full hermes honcho command with all subcommands (status, peer,
  mode, tokens, identity, enable/disable, sync, peers, --target-profile)
- Restore auto-clone on profile creation + sync on hermes update
- hermes honcho setup now redirects to hermes memory setup

* fix(memory): wire on_delegation, skip_memory for cron/flush, fix ByteRover return type

- Wire on_delegation() in delegate_tool.py — parent's memory provider
  is notified with task+result after each subagent completes
- Add skip_memory=True to cron scheduler (prevents cron system prompts
  from corrupting user representations — closes #4052)
- Add skip_memory=True to gateway flush agent (throwaway agent shouldn't
  activate memory provider)
- Fix ByteRover on_pre_compress() return type: None -> str

* fix(honcho): port profile isolation fixes from PR #4632

Ports 5 bug fixes found during profile testing (erosika's PR #4632):

1. 3-tier config resolution — resolve_config_path() now checks
   $HERMES_HOME/honcho.json → ~/.hermes/honcho.json → ~/.honcho/config.json
   (non-default profiles couldn't find shared host blocks)

2. Thread host=_host_key() through from_global_config() in cmd_setup,
   cmd_status, cmd_identity (--target-profile was being ignored)

3. Use bare profile name as aiPeer (not host key with dots) — Honcho's
   peer ID pattern is ^[a-zA-Z0-9_-]+$, dots are invalid

4. Wrap add_peers() in try/except — was fatal on new AI peers, killed
   all message uploads for the session

5. Gate Honcho clone behind --clone/--clone-all on profile create
   (bare create should be blank-slate)

Also: sanitize assistant_peer_id via _sanitize_id()

* fix(tests): add module cleanup fixture to test_cli_provider_resolution

test_cli_provider_resolution._import_cli() wipes tools.*, cli, and
run_agent from sys.modules to force fresh imports, but had no cleanup.
This poisoned all subsequent tests on the same xdist worker — mocks
targeting tools.file_tools, tools.send_message_tool, etc. patched the
NEW module object while already-imported functions still referenced
the OLD one. Caused ~25 cascade failures: send_message KeyError,
process_registry FileNotFoundError, file_read_guards timeouts,
read_loop_detection file-not-found, mcp_oauth None port, and
provider_parity/codex_execution stale tool lists.

Fix: autouse fixture saves all affected modules before each test and
restores them after, matching the pattern in
test_managed_browserbase_and_modal.py.
2026-04-02 15:33:51 -07:00
Teknium e0b2bdb089 fix: webhook platform support — skip home channel prompt, disable tool progress (salvage #4363) (#4660)
Cherry-picked from PR #4363 by @bennyhodl with follow-up fixes:

- Skip 'No home channel' prompt for webhook platform (webhooks deliver
  to configured targets, not a home channel)
- Disable tool progress for webhooks (no message editing support)
- Add webhook to PLATFORMS in tools_config.py and skills_config.py
- Add hermes-webhook toolset to toolsets.py + hermes-gateway includes
- Removed overly aggressive <50 char content filter that blocked
  legitimate short responses (tool progress already handled at source)

Co-authored-by: bennyhodl <bennyhodl@users.noreply.github.com>
2026-04-02 14:00:22 -07:00
SHL0MS 6d68fbf756 Merge pull request #4654 from SHL0MS/skill/research-paper-writing
Replace ml-paper-writing with research-paper-writing: full end-to-end research pipeline
2026-04-02 13:24:12 -07:00
SHL0MS b86647c295 Replace ml-paper-writing with research-paper-writing: full research pipeline skill
Replaces the writing-focused ml-paper-writing skill (940 lines) with a
complete end-to-end research paper pipeline (1,599 lines SKILL.md + 3,184
lines across 7 reference files).

New content:
- Full 8-phase pipeline: project setup, literature review, experiment
  design, execution/monitoring, analysis, paper drafting, review/revision,
  submission preparation
- Iterative refinement strategy guide from autoreason research (when to use
  autoreason vs critique-and-revise vs single-pass, model selection)
- Hermes agent integration: delegate_task parallel drafting, cronjob
  monitoring, memory/todo state management, skill composition
- Professional LaTeX tooling: microtype, siunitx, TikZ diagram patterns,
  algorithm2e, subcaption, latexdiff, SciencePlots
- Human evaluation design: annotation protocols, inter-annotator agreement,
  crowdsourcing platforms
- Title, Figure 1, conclusion, appendix strategy, page budget management
- Anonymization checklist, rebuttal writing, camera-ready preparation
- AAAI and COLM venue coverage (checklists, reviewer guidelines)

Preserved from ml-paper-writing:
- All writing philosophy (Nanda, Farquhar, Gopen & Swan, Lipton, Perez)
- Citation verification workflow (5-step mandatory process)
- All 6 conference templates (NeurIPS, ICML, ICLR, ACL, AAAI, COLM)
- Conference requirements, format conversion workflow
- Proactivity/collaboration guidance

Bug fixes in inherited reference files:
- BibLaTeX recommendation now correctly says natbib for conferences
- Bare except clauses fixed to except Exception
- Jinja2 template tags removed from citation-workflow.md
- Stale date caveats added to reviewer-guidelines.md
2026-04-02 16:13:26 -04:00
Teknium 798a7b99e4 docs: add Configuration Options section to Slack docs (#4644)
* docs: add Configuration Options section to Slack docs

Documents all config.yaml options for the Slack bot:
- Thread & reply behavior (reply_to_mode, reply_broadcast)
- Session isolation (group_sessions_per_user)
- Mention & trigger behavior (require_mention, mention_patterns, reply_prefix)
- Unauthorized user handling (unauthorized_dm_behavior)
- Voice transcription (stt_enabled)
- Full example config showing all options together

Includes a note about Slack's hardcoded @mention requirement in channels
(no free_response_channels equivalent like Discord/Telegram).

* docs: consolidate reply_in_thread into Configuration Options section

Folds the standalone Reply Threading subsection from PR #4643 into
the Thread & Reply Behavior subsection, keeping all config options
in one place. Adds reply_in_thread to the table and full example.
2026-04-02 12:38:13 -07:00
kshitijk4poor d2b08406a4 fix(agent): classify think-only empty responses before retrying 2026-04-02 12:29:18 -07:00
Teknium 241cbeeccd docs: add reply_in_thread config to Slack docs 2026-04-02 12:18:40 -07:00
Animesh Mishra b9a968c1de feat(slack): add reply_in_thread config option
By default, Hermes always threads replies to channel messages. Teams
that prefer direct channel replies had no way to opt out without
patching the source.

Add a reply_in_thread option (default: true) to the Slack platform
extra config:

  platforms:
    slack:
      extra:
        reply_in_thread: false

When false, _resolve_thread_ts() returns None for top-level channel
messages, so replies go directly to the channel. Messages already
inside an existing thread are still replied in-thread to preserve
conversation context. Default is true for full backward compatibility.
2026-04-02 12:18:40 -07:00
Teknium d89cc7fec1 feat(prompt): add Google model operational guidance for Gemini and Gemma (#4641)
Adapted from OpenCode's gemini.txt. Gemini and Gemma models now get
structured operational directives alongside tool-use enforcement:
absolute paths, verify-before-edit, dependency checks, conciseness,
parallel tool calls, non-interactive flags, autonomous execution.

Based on PR #4026, extended to cover Gemma models.
2026-04-02 11:52:34 -07:00
Teknium 3186668799 feat: per-turn primary runtime restoration and transport recovery (#4624)
Makes provider fallback turn-scoped in long-lived CLI sessions. Previously, a single transient failure pinned the session to the fallback provider for every subsequent turn.

- _primary_runtime dict snapshot at __init__ (model, provider, base_url, api_mode, client_kwargs, compressor state)
- _restore_primary_runtime() at top of run_conversation() — restores all state, resets fallback chain index
- _try_recover_primary_transport() — one extra recovery cycle (client rebuild + cooldown) for transient transport errors on direct endpoints before fallback
- Skipped for aggregator providers (OpenRouter, Nous)
- 25 tests

Inspired by #4612 (@betamod). Closes #4612.
2026-04-02 10:52:01 -07:00
Teknium 918d593544 chore: gitignore generated skills.json
Follow-up to #4500 — the extraction script generates this file at
build time, so it should not be committed.
2026-04-02 10:48:15 -07:00
Nacho Avecilla b8dd059c40 feat(website): add skills browse and search page to docs (#4500)
Adds a Skills Hub page to the documentation site with browsable/searchable catalog of all skills (built-in, optional, and community from cached hub indexes).

- Python extraction script (website/scripts/extract-skills.py) parses SKILL.md frontmatter and hub index caches into skills.json
- React page (website/src/pages/skills/) with search, category filtering, source filtering, and expandable skill cards
- CI workflow updated to run extraction before Docusaurus build
- Deploy trigger expanded to include skills/ and optional-skills/ changes

Authored by @IAvecilla
2026-04-02 10:47:38 -07:00
kshitijk4poor 20441cf2c8 fix(insights): persist token usage for non-CLI sessions 2026-04-02 10:47:13 -07:00
Teknium 585855d2ca fix: preserve Anthropic thinking block signatures across tool-use turns
Anthropic extended thinking blocks include an opaque 'signature' field
required for thinking chain continuity across multi-turn tool-use
conversations. Previously, normalize_anthropic_response() extracted
only the thinking text and set reasoning_details=None, discarding the
signature. On subsequent turns the API could not verify the chain.

Changes:
- _to_plain_data(): new recursive SDK-to-dict converter with depth cap
  (20 levels) and path-based cycle detection for safety
- _extract_preserved_thinking_blocks(): rehydrates preserved thinking
  blocks (including signature) from reasoning_details on assistant
  messages, placing them before tool_use blocks as Anthropic requires
- normalize_anthropic_response(): stores full thinking blocks in
  reasoning_details via _to_plain_data()
- _extract_reasoning(): adds 'thinking' key to the detail lookup chain
  so Anthropic-format details are found alongside OpenRouter format

Salvaged from PR #4503 by @priveperfumes — focused on the thinking
block continuity fix only (cache strategy and other changes excluded).
2026-04-02 10:30:32 -07:00
79 changed files with 5812 additions and 1083 deletions
+12
View File
@@ -6,6 +6,8 @@ on:
paths:
- 'website/**'
- 'landingpage/**'
- 'skills/**'
- 'optional-skills/**'
- '.github/workflows/deploy-site.yml'
workflow_dispatch:
@@ -34,6 +36,16 @@ jobs:
cache: npm
cache-dependency-path: website/package-lock.json
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install PyYAML for skill extraction
run: pip install pyyaml
- name: Extract skill metadata for dashboard
run: python3 website/scripts/extract-skills.py
- name: Install dependencies
run: npm ci
working-directory: website
+5 -2
View File
@@ -27,8 +27,11 @@ jobs:
with:
python-version: '3.11'
- name: Install ascii-guard
run: python -m pip install ascii-guard
- name: Install Python dependencies
run: python -m pip install ascii-guard pyyaml
- name: Extract skill metadata for dashboard
run: python3 website/scripts/extract-skills.py
- name: Lint docs diagrams
run: npm run lint:diagrams
+70 -2
View File
@@ -10,6 +10,7 @@ Auth supports:
- Claude Code credentials (~/.claude.json or ~/.claude/.credentials.json) → Bearer auth
"""
import copy
import json
import logging
import os
@@ -949,6 +950,69 @@ def _convert_content_part_to_anthropic(part: Any) -> Optional[Dict[str, Any]]:
return block
def _to_plain_data(value: Any, *, _depth: int = 0, _path: Optional[set] = None) -> Any:
"""Recursively convert SDK objects to plain Python data structures.
Guards against circular references (``_path`` tracks ``id()`` of objects
on the *current* recursion path) and runaway depth (capped at 20 levels).
Uses path-based tracking so shared (but non-cyclic) objects referenced by
multiple siblings are converted correctly rather than being stringified.
"""
_MAX_DEPTH = 20
if _depth > _MAX_DEPTH:
return str(value)
if _path is None:
_path = set()
obj_id = id(value)
if obj_id in _path:
return str(value)
if hasattr(value, "model_dump"):
_path.add(obj_id)
result = _to_plain_data(value.model_dump(), _depth=_depth + 1, _path=_path)
_path.discard(obj_id)
return result
if isinstance(value, dict):
_path.add(obj_id)
result = {k: _to_plain_data(v, _depth=_depth + 1, _path=_path) for k, v in value.items()}
_path.discard(obj_id)
return result
if isinstance(value, (list, tuple)):
_path.add(obj_id)
result = [_to_plain_data(v, _depth=_depth + 1, _path=_path) for v in value]
_path.discard(obj_id)
return result
if hasattr(value, "__dict__"):
_path.add(obj_id)
result = {
k: _to_plain_data(v, _depth=_depth + 1, _path=_path)
for k, v in vars(value).items()
if not k.startswith("_")
}
_path.discard(obj_id)
return result
return value
def _extract_preserved_thinking_blocks(message: Dict[str, Any]) -> List[Dict[str, Any]]:
"""Return Anthropic thinking blocks previously preserved on the message."""
raw_details = message.get("reasoning_details")
if not isinstance(raw_details, list):
return []
preserved: List[Dict[str, Any]] = []
for detail in raw_details:
if not isinstance(detail, dict):
continue
block_type = str(detail.get("type", "") or "").strip().lower()
if block_type not in {"thinking", "redacted_thinking"}:
continue
preserved.append(copy.deepcopy(detail))
return preserved
def _convert_content_to_anthropic(content: Any) -> Any:
"""Convert OpenAI-style multimodal content arrays to Anthropic blocks."""
if not isinstance(content, list):
@@ -995,7 +1059,7 @@ def convert_messages_to_anthropic(
continue
if role == "assistant":
blocks = []
blocks = _extract_preserved_thinking_blocks(m)
if content:
if isinstance(content, list):
converted_content = _convert_content_to_anthropic(content)
@@ -1279,6 +1343,7 @@ def normalize_anthropic_response(
"""
text_parts = []
reasoning_parts = []
reasoning_details = []
tool_calls = []
for block in response.content:
@@ -1286,6 +1351,9 @@ def normalize_anthropic_response(
text_parts.append(block.text)
elif block.type == "thinking":
reasoning_parts.append(block.thinking)
block_dict = _to_plain_data(block)
if isinstance(block_dict, dict):
reasoning_details.append(block_dict)
elif block.type == "tool_use":
name = block.name
if strip_tool_prefix and name.startswith(_MCP_TOOL_PREFIX):
@@ -1316,7 +1384,7 @@ def normalize_anthropic_response(
tool_calls=tool_calls or None,
reasoning="\n\n".join(reasoning_parts) if reasoning_parts else None,
reasoning_content=None,
reasoning_details=None,
reasoning_details=reasoning_details or None,
),
finish_reason,
)
+4
View File
@@ -301,6 +301,8 @@ Update the summary using this exact structure. PRESERVE all existing information
Target ~{summary_budget} tokens. Be specific — include file paths, command outputs, error messages, and concrete values rather than vague descriptions.
Write the summary in the same language the user was using in the conversation.
Write only the summary body. Do not include any preamble or prefix."""
else:
# First compaction: summarize from scratch
@@ -339,6 +341,8 @@ Use this exact structure:
Target ~{summary_budget} tokens. Be specific — include file paths, command outputs, error messages, and concrete values rather than vague descriptions. The goal is to prevent the next assistant from repeating work or losing important details.
Write the summary in the same language the user was using in the conversation.
Write only the summary body. Do not include any preamble or prefix."""
try:
+23 -1
View File
@@ -187,7 +187,29 @@ TOOL_USE_ENFORCEMENT_GUIDANCE = (
# Model name substrings that trigger tool-use enforcement guidance.
# Add new patterns here when a model family needs explicit steering.
TOOL_USE_ENFORCEMENT_MODELS = ("gpt", "codex")
TOOL_USE_ENFORCEMENT_MODELS = ("gpt", "codex", "gemini", "gemma")
# Gemini/Gemma-specific operational guidance, adapted from OpenCode's gemini.txt.
# Injected alongside TOOL_USE_ENFORCEMENT_GUIDANCE when the model is Gemini or Gemma.
GOOGLE_MODEL_OPERATIONAL_GUIDANCE = (
"# Google model operational directives\n"
"Follow these operational rules strictly:\n"
"- **Absolute paths:** Always construct and use absolute file paths for all "
"file system operations. Combine the project root with relative paths.\n"
"- **Verify first:** Use read_file/search_files to check file contents and "
"project structure before making changes. Never guess at file contents.\n"
"- **Dependency checks:** Never assume a library is available. Check "
"package.json, requirements.txt, Cargo.toml, etc. before importing.\n"
"- **Conciseness:** Keep explanatory text brief — a few sentences, not "
"paragraphs. Focus on actions and results over narration.\n"
"- **Parallel tool calls:** When you need to perform multiple independent "
"operations (e.g. reading several files), make all the tool calls in a "
"single response rather than sequentially.\n"
"- **Non-interactive commands:** Use flags like -y, --yes, --non-interactive "
"to prevent CLI tools from hanging on prompts.\n"
"- **Keep going:** Work autonomously until the task is fully resolved. "
"Don't stop with a plan — execute it.\n"
)
# Model name substrings that should use the 'developer' role instead of
# 'system' for the system prompt. OpenAI's newer models (GPT-5, Codex)
+11
View File
@@ -323,7 +323,18 @@ class SlackAdapter(BasePlatformAdapter):
Prefers metadata thread_id (the thread parent's ts, set by the
gateway) over reply_to (which may be a child message's ts).
When ``reply_in_thread`` is ``false`` in the platform extra config,
top-level channel messages receive direct channel replies instead of
thread replies. Messages that originate inside an existing thread are
always replied to in-thread to preserve conversation context.
"""
# When reply_in_thread is disabled (default: True for backward compat),
# only thread messages that are already part of an existing thread.
if not self.config.extra.get("reply_in_thread", True):
existing_thread = (metadata or {}).get("thread_id") or (metadata or {}).get("thread_ts")
return existing_thread or None
if metadata:
if metadata.get("thread_id"):
return metadata["thread_id"]
+9 -13
View File
@@ -2397,7 +2397,8 @@ class GatewayRunner:
)
# One-time prompt if no home channel is set for this platform
if not history and source.platform and source.platform != Platform.LOCAL:
# Skip for webhooks - they deliver directly to configured targets (github_comment, etc.)
if not history and source.platform and source.platform != Platform.LOCAL and source.platform != Platform.WEBHOOK:
platform_name = source.platform.value
env_key = f"{platform_name.upper()}_HOME_CHANNEL"
if not os.getenv(env_key):
@@ -2752,20 +2753,12 @@ class GatewayRunner:
skip_db=agent_persisted,
)
# Update session with actual prompt token count and model from the agent
# Token counts and model are now persisted by the agent directly.
# Keep only last_prompt_tokens here for context-window tracking and
# compression decisions.
self.session_store.update_session(
session_entry.session_key,
input_tokens=agent_result.get("input_tokens", 0),
output_tokens=agent_result.get("output_tokens", 0),
cache_read_tokens=agent_result.get("cache_read_tokens", 0),
cache_write_tokens=agent_result.get("cache_write_tokens", 0),
last_prompt_tokens=agent_result.get("last_prompt_tokens", 0),
model=agent_result.get("model"),
estimated_cost_usd=agent_result.get("estimated_cost_usd"),
cost_status=agent_result.get("cost_status"),
cost_source=agent_result.get("cost_source"),
provider=agent_result.get("provider"),
base_url=agent_result.get("base_url"),
)
# Auto voice reply: send TTS audio before the text response
@@ -5307,7 +5300,10 @@ class GatewayRunner:
or os.getenv("HERMES_TOOL_PROGRESS_MODE")
or "all"
)
tool_progress_enabled = progress_mode != "off"
# Disable tool progress for webhooks - they don't support message editing,
# so each progress line would be sent as a separate message.
from gateway.config import Platform
tool_progress_enabled = progress_mode != "off" and source.platform != Platform.WEBHOOK
# Queue for progress messages (thread-safe)
progress_queue = queue.Queue() if tool_progress_enabled else None
+1 -49
View File
@@ -778,66 +778,18 @@ class SessionStore:
def update_session(
self,
session_key: str,
input_tokens: int = 0,
output_tokens: int = 0,
cache_read_tokens: int = 0,
cache_write_tokens: int = 0,
last_prompt_tokens: int = None,
model: str = None,
estimated_cost_usd: Optional[float] = None,
cost_status: Optional[str] = None,
cost_source: Optional[str] = None,
provider: Optional[str] = None,
base_url: Optional[str] = None,
) -> None:
"""Update a session's metadata after an interaction."""
db_session_id = None
"""Update lightweight session metadata after an interaction."""
with self._lock:
self._ensure_loaded_locked()
if session_key in self._entries:
entry = self._entries[session_key]
entry.updated_at = _now()
# Direct assignment — the gateway receives cumulative totals
# from the cached agent, not per-call deltas.
entry.input_tokens = input_tokens
entry.output_tokens = output_tokens
entry.cache_read_tokens = cache_read_tokens
entry.cache_write_tokens = cache_write_tokens
if last_prompt_tokens is not None:
entry.last_prompt_tokens = last_prompt_tokens
if estimated_cost_usd is not None:
entry.estimated_cost_usd = estimated_cost_usd
if cost_status:
entry.cost_status = cost_status
entry.total_tokens = (
entry.input_tokens
+ entry.output_tokens
+ entry.cache_read_tokens
+ entry.cache_write_tokens
)
self._save()
db_session_id = entry.session_id
if self._db and db_session_id:
try:
self._db.set_token_counts(
db_session_id,
input_tokens=input_tokens,
output_tokens=output_tokens,
cache_read_tokens=cache_read_tokens,
cache_write_tokens=cache_write_tokens,
estimated_cost_usd=estimated_cost_usd,
cost_status=cost_status,
cost_source=cost_source,
billing_provider=provider,
billing_base_url=base_url,
model=model,
absolute=True,
)
except Exception as e:
logger.debug("Session DB operation failed: %s", e)
def reset_session(self, session_key: str) -> Optional[SessionEntry]:
"""Force reset a session, creating a new session ID."""
+1
View File
@@ -30,6 +30,7 @@ PLATFORMS = {
"dingtalk": "💬 DingTalk",
"feishu": "🪽 Feishu",
"wecom": "💬 WeCom",
"webhook": "🔗 Webhook",
}
# ─── Config Helpers ───────────────────────────────────────────────────────────
+1
View File
@@ -150,6 +150,7 @@ PLATFORMS = {
"wecom": {"label": "💬 WeCom", "default_toolset": "hermes-wecom"},
"api_server": {"label": "🌐 API Server", "default_toolset": "hermes-api-server"},
"mattermost": {"label": "💬 Mattermost", "default_toolset": "hermes-mattermost"},
"webhook": {"label": "🔗 Webhook", "default_toolset": "hermes-webhook"},
}
+348 -17
View File
@@ -85,11 +85,11 @@ from agent.model_metadata import (
fetch_model_metadata,
estimate_tokens_rough, estimate_messages_tokens_rough, estimate_request_tokens_rough,
get_next_probe_tier, parse_context_limit_from_error,
save_context_length,
save_context_length, is_local_endpoint,
)
from agent.context_compressor import ContextCompressor
from agent.prompt_caching import apply_anthropic_cache_control
from agent.prompt_builder import build_skills_system_prompt, build_context_files_prompt, load_soul_md, TOOL_USE_ENFORCEMENT_GUIDANCE, TOOL_USE_ENFORCEMENT_MODELS, DEVELOPER_ROLE_MODELS
from agent.prompt_builder import build_skills_system_prompt, build_context_files_prompt, load_soul_md, TOOL_USE_ENFORCEMENT_GUIDANCE, TOOL_USE_ENFORCEMENT_MODELS, DEVELOPER_ROLE_MODELS, GOOGLE_MODEL_OPERATIONAL_GUIDANCE
from agent.usage_pricing import estimate_usage_cost, normalize_usage
from agent.display import (
KawaiiSpinner, build_tool_preview as _build_tool_preview,
@@ -1194,6 +1194,34 @@ class AIAgent:
else:
print(f"📊 Context limit: {self.context_compressor.context_length:,} tokens (auto-compression disabled)")
# Snapshot primary runtime for per-turn restoration. When fallback
# activates during a turn, the next turn restores these values so the
# preferred model gets a fresh attempt each time. Uses a single dict
# so new state fields are easy to add without N individual attributes.
_cc = self.context_compressor
self._primary_runtime = {
"model": self.model,
"provider": self.provider,
"base_url": self.base_url,
"api_mode": self.api_mode,
"api_key": getattr(self, "api_key", ""),
"client_kwargs": dict(self._client_kwargs),
"use_prompt_caching": self._use_prompt_caching,
# Compressor state that _try_activate_fallback() overwrites
"compressor_model": _cc.model,
"compressor_base_url": _cc.base_url,
"compressor_api_key": getattr(_cc, "api_key", ""),
"compressor_provider": _cc.provider,
"compressor_context_length": _cc.context_length,
"compressor_threshold_tokens": _cc.threshold_tokens,
}
if self.api_mode == "anthropic_messages":
self._primary_runtime.update({
"anthropic_api_key": self._anthropic_api_key,
"anthropic_base_url": self._anthropic_base_url,
"is_anthropic_oauth": self._is_anthropic_oauth,
})
def reset_session_state(self):
"""Reset all session-scoped token counters to 0 for a fresh session.
@@ -1463,7 +1491,12 @@ class AIAgent:
for detail in assistant_message.reasoning_details:
if isinstance(detail, dict):
# Extract summary from reasoning detail object
summary = detail.get('summary') or detail.get('content') or detail.get('text')
summary = (
detail.get('summary')
or detail.get('thinking')
or detail.get('content')
or detail.get('text')
)
if summary and summary not in reasoning_parts:
reasoning_parts.append(summary)
@@ -1490,6 +1523,74 @@ class AIAgent:
return "\n\n".join(reasoning_parts)
return None
def _classify_empty_content_response(
self,
assistant_message,
*,
finish_reason: Optional[str],
approx_tokens: int,
api_messages: List[Dict[str, Any]],
conversation_history: Optional[List[Dict[str, Any]]],
) -> Dict[str, Any]:
"""Classify think-only/empty responses so we can retry, compress, or salvage.
We intentionally do NOT short-circuit all structured-reasoning responses.
Prior discussion/PR history shows some models recover on retry. Instead we:
- compress immediately when the pattern looks like implicit context pressure
- salvage reasoning early when the same reasoning-only payload repeats
- otherwise preserve the normal retry path
"""
reasoning_text = self._extract_reasoning(assistant_message)
has_structured_reasoning = bool(
getattr(assistant_message, "reasoning", None)
or getattr(assistant_message, "reasoning_content", None)
or getattr(assistant_message, "reasoning_details", None)
)
content = getattr(assistant_message, "content", None) or ""
stripped_content = self._strip_think_blocks(content).strip()
signature = (
content,
reasoning_text or "",
bool(has_structured_reasoning),
finish_reason or "",
)
repeated_signature = signature == getattr(self, "_last_empty_content_signature", None)
compressor = getattr(self, "context_compressor", None)
ctx_len = getattr(compressor, "context_length", 0) or 0
threshold_tokens = getattr(compressor, "threshold_tokens", 0) or 0
is_large_session = bool(
(ctx_len and approx_tokens >= max(int(ctx_len * 0.4), threshold_tokens))
or len(api_messages) > 80
)
is_local_custom = is_local_endpoint(getattr(self, "base_url", "") or "")
is_resumed = bool(conversation_history)
context_pressure_signals = any(
[
finish_reason == "length",
getattr(compressor, "_context_probed", False),
is_large_session,
is_resumed,
]
)
should_compress = bool(
self.compression_enabled
and is_local_custom
and context_pressure_signals
and not stripped_content
)
self._last_empty_content_signature = signature
return {
"reasoning_text": reasoning_text,
"has_structured_reasoning": has_structured_reasoning,
"repeated_signature": repeated_signature,
"should_compress": should_compress,
"is_local_custom": is_local_custom,
"is_large_session": is_large_session,
"is_resumed": is_resumed,
}
def _cleanup_task_resources(self, task_id: str) -> None:
"""Clean up VM and browser resources for a given task."""
@@ -2382,6 +2483,11 @@ class AIAgent:
_inject = any(p in model_lower for p in TOOL_USE_ENFORCEMENT_MODELS)
if _inject:
prompt_parts.append(TOOL_USE_ENFORCEMENT_GUIDANCE)
# Google model operational guidance (conciseness, absolute
# paths, parallel tool calls, verify-before-edit, etc.)
_model_lower = (self.model or "").lower()
if "gemini" in _model_lower or "gemma" in _model_lower:
prompt_parts.append(GOOGLE_MODEL_OPERATIONAL_GUIDANCE)
# so it can refer the user to them rather than reinventing answers.
@@ -4483,6 +4589,156 @@ class AIAgent:
logging.error("Failed to activate fallback %s: %s", fb_model, e)
return self._try_activate_fallback() # try next in chain
# ── Per-turn primary restoration ─────────────────────────────────────
def _restore_primary_runtime(self) -> bool:
"""Restore the primary runtime at the start of a new turn.
In long-lived CLI sessions a single AIAgent instance spans multiple
turns. Without restoration, one transient failure pins the session
to the fallback provider for every subsequent turn. Calling this at
the top of ``run_conversation()`` makes fallback turn-scoped.
The gateway creates a fresh agent per message so this is a no-op
there (``_fallback_activated`` is always False at turn start).
"""
if not self._fallback_activated:
return False
rt = self._primary_runtime
try:
# ── Core runtime state ──
self.model = rt["model"]
self.provider = rt["provider"]
self.base_url = rt["base_url"] # setter updates _base_url_lower
self.api_mode = rt["api_mode"]
self.api_key = rt["api_key"]
self._client_kwargs = dict(rt["client_kwargs"])
self._use_prompt_caching = rt["use_prompt_caching"]
# ── Rebuild client for the primary provider ──
if self.api_mode == "anthropic_messages":
from agent.anthropic_adapter import build_anthropic_client
self._anthropic_api_key = rt["anthropic_api_key"]
self._anthropic_base_url = rt["anthropic_base_url"]
self._anthropic_client = build_anthropic_client(
rt["anthropic_api_key"], rt["anthropic_base_url"],
)
self._is_anthropic_oauth = rt["is_anthropic_oauth"]
self.client = None
else:
self.client = self._create_openai_client(
dict(rt["client_kwargs"]),
reason="restore_primary",
shared=True,
)
# ── Restore context compressor state ──
cc = self.context_compressor
cc.model = rt["compressor_model"]
cc.base_url = rt["compressor_base_url"]
cc.api_key = rt["compressor_api_key"]
cc.provider = rt["compressor_provider"]
cc.context_length = rt["compressor_context_length"]
cc.threshold_tokens = rt["compressor_threshold_tokens"]
# ── Reset fallback chain for the new turn ──
self._fallback_activated = False
self._fallback_index = 0
logging.info(
"Primary runtime restored for new turn: %s (%s)",
self.model, self.provider,
)
return True
except Exception as e:
logging.warning("Failed to restore primary runtime: %s", e)
return False
# Which error types indicate a transient transport failure worth
# one more attempt with a rebuilt client / connection pool.
_TRANSIENT_TRANSPORT_ERRORS = frozenset({
"ReadTimeout", "ConnectTimeout", "PoolTimeout",
"ConnectError", "RemoteProtocolError",
})
def _try_recover_primary_transport(
self, api_error: Exception, *, retry_count: int, max_retries: int,
) -> bool:
"""Attempt one extra primary-provider recovery cycle for transient transport failures.
After ``max_retries`` exhaust, rebuild the primary client (clearing
stale connection pools) and give it one more attempt before falling
back. This is most useful for direct endpoints (custom, Z.AI,
Anthropic, OpenAI, local models) where a TCP-level hiccup does not
mean the provider is down.
Skipped for proxy/aggregator providers (OpenRouter, Nous) which
already manage connection pools and retries server-side if our
retries through them are exhausted, one more rebuilt client won't help.
"""
if self._fallback_activated:
return False
# Only for transient transport errors
error_type = type(api_error).__name__
if error_type not in self._TRANSIENT_TRANSPORT_ERRORS:
return False
# Skip for aggregator providers — they manage their own retry infra
if self._is_openrouter_url():
return False
provider_lower = (self.provider or "").strip().lower()
if provider_lower in ("nous", "nous-research"):
return False
try:
# Close existing client to release stale connections
if getattr(self, "client", None) is not None:
try:
self._close_openai_client(
self.client, reason="primary_recovery", shared=True,
)
except Exception:
pass
# Rebuild from primary snapshot
rt = self._primary_runtime
self._client_kwargs = dict(rt["client_kwargs"])
self.model = rt["model"]
self.provider = rt["provider"]
self.base_url = rt["base_url"]
self.api_mode = rt["api_mode"]
self.api_key = rt["api_key"]
if self.api_mode == "anthropic_messages":
from agent.anthropic_adapter import build_anthropic_client
self._anthropic_api_key = rt["anthropic_api_key"]
self._anthropic_base_url = rt["anthropic_base_url"]
self._anthropic_client = build_anthropic_client(
rt["anthropic_api_key"], rt["anthropic_base_url"],
)
self._is_anthropic_oauth = rt["is_anthropic_oauth"]
self.client = None
else:
self.client = self._create_openai_client(
dict(rt["client_kwargs"]),
reason="primary_recovery",
shared=True,
)
wait_time = min(3 + retry_count, 8)
self._vprint(
f"{self.log_prefix}🔁 Transient {error_type} on {self.provider}"
f"rebuilt client, waiting {wait_time}s before one last primary attempt.",
force=True,
)
time.sleep(wait_time)
return True
except Exception as e:
logging.warning("Primary transport recovery failed: %s", e)
return False
# ── End provider fallback ──────────────────────────────────────────────
@staticmethod
@@ -6120,6 +6376,11 @@ class AIAgent:
# Installed once, transparent when streams are healthy, prevents crash on write.
_install_safe_stdio()
# If the previous turn activated fallback, restore the primary
# runtime so this turn gets a fresh attempt with the preferred model.
# No-op when _fallback_activated is False (gateway, first turn, etc.).
self._restore_primary_runtime()
# Sanitize surrogate characters from user input. Clipboard paste from
# rich-text editors (Google Docs, Word, etc.) can inject lone surrogates
# that are invalid UTF-8 and crash JSON serialization in the OpenAI SDK.
@@ -6521,10 +6782,11 @@ class AIAgent:
api_start_time = time.time()
retry_count = 0
max_retries = 3
primary_recovery_attempted = False
max_compression_attempts = 3
codex_auth_retry_attempted = False
anthropic_auth_retry_attempted = False
nous_auth_retry_attempted = False
codex_auth_retry_attempted=False
anthropic_auth_retry_attempted=False
nous_auth_retry_attempted=False
has_retried_429 = False
restart_with_compressed_messages = False
restart_with_length_continuation = False
@@ -6916,11 +7178,13 @@ class AIAgent:
self.session_cost_source = cost_result.source
# Persist token counts to session DB for /insights.
# Gateway sessions persist via session_store.update_session()
# after run_conversation returns, so only persist here for
# CLI (and other non-gateway) platforms to avoid double-counting.
if (self._session_db and self.session_id
and getattr(self, 'platform', None) == 'cli'):
# Do this for every platform with a session_id so non-CLI
# sessions (gateway, cron, delegated runs) cannot lose
# token/accounting data if a higher-level persistence path
# is skipped or fails. Gateway/session-store writes use
# absolute totals, so they safely overwrite these per-call
# deltas instead of double-counting them.
if self._session_db and self.session_id:
try:
self._session_db.update_token_counts(
self.session_id,
@@ -7357,6 +7621,16 @@ class AIAgent:
}
if retry_count >= max_retries:
# Before falling back, try rebuilding the primary
# client once for transient transport errors (stale
# connection pool, TCP reset). Only attempted once
# per API call block.
if not primary_recovery_attempted and self._try_recover_primary_transport(
api_error, retry_count=retry_count, max_retries=max_retries,
):
primary_recovery_attempted = True
retry_count = 0
continue
# Try fallback before giving up entirely
self._emit_status(f"⚠️ Max retries ({max_retries}) exhausted — trying fallback...")
if self._try_activate_fallback():
@@ -7900,13 +8174,22 @@ class AIAgent:
self._response_was_previewed = True
break
# No fallback available — this is a genuine empty response.
# Retry in case the model just had a bad generation.
# No fallback available — classify the empty response before
# blindly spending retries. Some local/custom backends surface
# implicit context pressure as reasoning-only output rather than
# an explicit overflow error.
if not hasattr(self, '_empty_content_retries'):
self._empty_content_retries = 0
self._empty_content_retries += 1
reasoning_text = self._extract_reasoning(assistant_message)
empty_response_info = self._classify_empty_content_response(
assistant_message,
finish_reason=finish_reason,
approx_tokens=approx_tokens,
api_messages=api_messages,
conversation_history=conversation_history,
)
reasoning_text = empty_response_info["reasoning_text"]
self._vprint(f"{self.log_prefix}⚠️ Response only contains think block with no content after it")
if reasoning_text:
reasoning_preview = reasoning_text[:500] + "..." if len(reasoning_text) > 500 else reasoning_text
@@ -7914,6 +8197,45 @@ class AIAgent:
else:
content_preview = final_response[:80] + "..." if len(final_response) > 80 else final_response
self._vprint(f"{self.log_prefix} Content: '{content_preview}'")
if empty_response_info["should_compress"]:
compression_attempts += 1
if compression_attempts > max_compression_attempts:
self._vprint(f"{self.log_prefix}❌ Max compression attempts ({max_compression_attempts}) reached.", force=True)
self._vprint(f"{self.log_prefix} 💡 Local/custom backend returned reasoning-only output with no visible content. This often means the resumed/large session exceeds the runtime context window. Try /new or lower model.context_length to the actual runtime limit.", force=True)
else:
self._vprint(f"{self.log_prefix}🗜️ Reasoning-only response looks like implicit context pressure — attempting compression ({compression_attempts}/{max_compression_attempts})...", force=True)
original_len = len(messages)
messages, active_system_prompt = self._compress_context(
messages, system_message, approx_tokens=approx_tokens,
task_id=effective_task_id,
)
if len(messages) < original_len:
conversation_history = None
self._emit_status(f"🗜️ Compressed {original_len}{len(messages)} messages after reasoning-only response, retrying...")
time.sleep(2)
api_call_count -= 1
self.iteration_budget.refund()
retry_count += 1
continue
self._vprint(f"{self.log_prefix} Compression could not shrink the session; falling back to retry/salvage logic.")
if (
reasoning_text
and empty_response_info["repeated_signature"]
and empty_response_info["has_structured_reasoning"]
):
self._vprint(f"{self.log_prefix}️ Structured reasoning-only response repeated unchanged — using reasoning text directly.", force=True)
self._empty_content_retries = 0
final_response = reasoning_text
empty_msg = {
"role": "assistant",
"content": final_response,
"reasoning": reasoning_text,
"finish_reason": finish_reason,
}
messages.append(empty_msg)
break
if self._empty_content_retries < 3:
self._vprint(f"{self.log_prefix}🔄 Retrying API call ({self._empty_content_retries}/3)...")
@@ -7970,18 +8292,27 @@ class AIAgent:
self._cleanup_task_resources(effective_task_id)
self._persist_session(messages, conversation_history)
error_message = "Model generated only think blocks with no actual response after 3 retries"
if empty_response_info["is_local_custom"]:
error_message = (
"Local/custom backend returned reasoning-only output with no visible response after 3 retries. "
"Likely causes: wrong /v1 endpoint, runtime context window smaller than Hermes expects, "
"or a resumed/large session exceeding the backend's actual context limit."
)
return {
"final_response": final_response or None,
"messages": messages,
"api_calls": api_call_count,
"completed": False,
"partial": True,
"error": "Model generated only think blocks with no actual response after 3 retries"
"error": error_message
}
# Reset retry counter on successful content
# Reset retry counter/signature on successful content
if hasattr(self, '_empty_content_retries'):
self._empty_content_retries = 0
self._last_empty_content_signature = None
if (
self.api_mode == "codex_responses"
-940
View File
@@ -1,940 +0,0 @@
---
name: ml-paper-writing
description: Write publication-ready ML/AI papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Use when drafting papers from research repos, structuring arguments, verifying citations, or preparing camera-ready submissions. Includes LaTeX templates, reviewer guidelines, and citation verification workflows.
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [semanticscholar, arxiv, habanero, requests]
metadata:
hermes:
tags: [Academic Writing, NeurIPS, ICML, ICLR, ACL, AAAI, COLM, LaTeX, Paper Writing, Citations, Research]
---
# ML Paper Writing for Top AI Conferences
Expert-level guidance for writing publication-ready papers targeting **NeurIPS, ICML, ICLR, ACL, AAAI, and COLM**. This skill combines writing philosophy from top researchers (Nanda, Farquhar, Karpathy, Lipton, Steinhardt) with practical tools: LaTeX templates, citation verification APIs, and conference checklists.
## Core Philosophy: Collaborative Writing
**Paper writing is collaborative, but Claude should be proactive in delivering drafts.**
The typical workflow starts with a research repository containing code, results, and experimental artifacts. Claude's role is to:
1. **Understand the project** by exploring the repo, results, and existing documentation
2. **Deliver a complete first draft** when confident about the contribution
3. **Search literature** using web search and APIs to find relevant citations
4. **Refine through feedback cycles** when the scientist provides input
5. **Ask for clarification** only when genuinely uncertain about key decisions
**Key Principle**: Be proactive. If the repo and results are clear, deliver a full draft. Don't block waiting for feedback on every section—scientists are busy. Produce something concrete they can react to, then iterate based on their response.
---
## ⚠️ CRITICAL: Never Hallucinate Citations
**This is the most important rule in academic writing with AI assistance.**
### The Problem
AI-generated citations have a **~40% error rate**. Hallucinated references—papers that don't exist, wrong authors, incorrect years, fabricated DOIs—are a serious form of academic misconduct that can result in desk rejection or retraction.
### The Rule
**NEVER generate BibTeX entries from memory. ALWAYS fetch programmatically.**
| Action | ✅ Correct | ❌ Wrong |
|--------|-----------|----------|
| Adding a citation | Search API → verify → fetch BibTeX | Write BibTeX from memory |
| Uncertain about a paper | Mark as `[CITATION NEEDED]` | Guess the reference |
| Can't find exact paper | Note: "placeholder - verify" | Invent similar-sounding paper |
### When You Can't Verify a Citation
If you cannot programmatically verify a citation, you MUST:
```latex
% EXPLICIT PLACEHOLDER - requires human verification
\cite{PLACEHOLDER_author2024_verify_this} % TODO: Verify this citation exists
```
**Always tell the scientist**: "I've marked [X] citations as placeholders that need verification. I could not confirm these papers exist."
### Recommended: Install Exa MCP for Paper Search
For the best paper search experience, install **Exa MCP** which provides real-time academic search:
**Claude Code:**
```bash
claude mcp add exa -- npx -y mcp-remote "https://mcp.exa.ai/mcp"
```
**Cursor / VS Code** (add to MCP settings):
```json
{
"mcpServers": {
"exa": {
"type": "http",
"url": "https://mcp.exa.ai/mcp"
}
}
}
```
Exa MCP enables searches like:
- "Find papers on RLHF for language models published after 2023"
- "Search for transformer architecture papers by Vaswani"
- "Get recent work on sparse autoencoders for interpretability"
Then verify results with Semantic Scholar API and fetch BibTeX via DOI.
---
## Workflow 0: Starting from a Research Repository
When beginning paper writing, start by understanding the project:
```
Project Understanding:
- [ ] Step 1: Explore the repository structure
- [ ] Step 2: Read README, existing docs, and key results
- [ ] Step 3: Identify the main contribution with the scientist
- [ ] Step 4: Find papers already cited in the codebase
- [ ] Step 5: Search for additional relevant literature
- [ ] Step 6: Outline the paper structure together
- [ ] Step 7: Draft sections iteratively with feedback
```
**Step 1: Explore the Repository**
```bash
# Understand project structure
ls -la
find . -name "*.py" | head -20
find . -name "*.md" -o -name "*.txt" | xargs grep -l -i "result\|conclusion\|finding"
```
Look for:
- `README.md` - Project overview and claims
- `results/`, `outputs/`, `experiments/` - Key findings
- `configs/` - Experimental settings
- Existing `.bib` files or citation references
- Any draft documents or notes
**Step 2: Identify Existing Citations**
Check for papers already referenced in the codebase:
```bash
# Find existing citations
grep -r "arxiv\|doi\|cite" --include="*.md" --include="*.bib" --include="*.py"
find . -name "*.bib"
```
These are high-signal starting points for Related Work—the scientist has already deemed them relevant.
**Step 3: Clarify the Contribution**
Before writing, explicitly confirm with the scientist:
> "Based on my understanding of the repo, the main contribution appears to be [X].
> The key results show [Y]. Is this the framing you want for the paper,
> or should we emphasize different aspects?"
**Never assume the narrative—always verify with the human.**
**Step 4: Search for Additional Literature**
Use web search to find relevant papers:
```
Search queries to try:
- "[main technique] + [application domain]"
- "[baseline method] comparison"
- "[problem name] state-of-the-art"
- Author names from existing citations
```
Then verify and retrieve BibTeX using the citation workflow below.
**Step 5: Deliver a First Draft**
**Be proactive—deliver a complete draft rather than asking permission for each section.**
If the repo provides clear results and the contribution is apparent:
1. Write the full first draft end-to-end
2. Present the complete draft for feedback
3. Iterate based on scientist's response
If genuinely uncertain about framing or major claims:
1. Draft what you can confidently
2. Flag specific uncertainties: "I framed X as the main contribution—let me know if you'd prefer to emphasize Y instead"
3. Continue with the draft rather than blocking
**Questions to include with the draft** (not before):
- "I emphasized X as the main contribution—adjust if needed"
- "I highlighted results A, B, C—let me know if others are more important"
- "Related work section includes [papers]—add any I missed"
---
## When to Use This Skill
Use this skill when:
- **Starting from a research repo** to write a paper
- **Drafting or revising** specific sections
- **Finding and verifying citations** for related work
- **Formatting** for conference submission
- **Resubmitting** to a different venue (format conversion)
- **Iterating** on drafts with scientist feedback
**Always remember**: First drafts are starting points for discussion, not final outputs.
---
## Balancing Proactivity and Collaboration
**Default: Be proactive. Deliver drafts, then iterate.**
| Confidence Level | Action |
|-----------------|--------|
| **High** (clear repo, obvious contribution) | Write full draft, deliver, iterate on feedback |
| **Medium** (some ambiguity) | Write draft with flagged uncertainties, continue |
| **Low** (major unknowns) | Ask 1-2 targeted questions, then draft |
**Draft first, ask with the draft** (not before):
| Section | Draft Autonomously | Flag With Draft |
|---------|-------------------|-----------------|
| Abstract | Yes | "Framed contribution as X—adjust if needed" |
| Introduction | Yes | "Emphasized problem Y—correct if wrong" |
| Methods | Yes | "Included details A, B, C—add missing pieces" |
| Experiments | Yes | "Highlighted results 1, 2, 3—reorder if needed" |
| Related Work | Yes | "Cited papers X, Y, Z—add any I missed" |
**Only block for input when:**
- Target venue is unclear (affects page limits, framing)
- Multiple contradictory framings seem equally valid
- Results seem incomplete or inconsistent
- Explicit request to review before continuing
**Don't block for:**
- Word choice decisions
- Section ordering
- Which specific results to show (make a choice, flag it)
- Citation completeness (draft with what you find, note gaps)
---
## The Narrative Principle
**The single most critical insight**: Your paper is not a collection of experiments—it's a story with one clear contribution supported by evidence.
Every successful ML paper centers on what Neel Nanda calls "the narrative": a short, rigorous, evidence-based technical story with a takeaway readers care about.
**Three Pillars (must be crystal clear by end of introduction):**
| Pillar | Description | Example |
|--------|-------------|---------|
| **The What** | 1-3 specific novel claims within cohesive theme | "We prove that X achieves Y under condition Z" |
| **The Why** | Rigorous empirical evidence supporting claims | Strong baselines, experiments distinguishing hypotheses |
| **The So What** | Why readers should care | Connection to recognized community problems |
**If you cannot state your contribution in one sentence, you don't yet have a paper.**
---
## Paper Structure Workflow
### Workflow 1: Writing a Complete Paper (Iterative)
Copy this checklist and track progress. **Each step involves drafting → feedback → revision:**
```
Paper Writing Progress:
- [ ] Step 1: Define the one-sentence contribution (with scientist)
- [ ] Step 2: Draft Figure 1 → get feedback → revise
- [ ] Step 3: Draft abstract → get feedback → revise
- [ ] Step 4: Draft introduction → get feedback → revise
- [ ] Step 5: Draft methods → get feedback → revise
- [ ] Step 6: Draft experiments → get feedback → revise
- [ ] Step 7: Draft related work → get feedback → revise
- [ ] Step 8: Draft limitations → get feedback → revise
- [ ] Step 9: Complete paper checklist (required)
- [ ] Step 10: Final review cycle and submission
```
**Step 1: Define the One-Sentence Contribution**
**This step requires explicit confirmation from the scientist.**
Before writing anything, articulate and verify:
- What is the single thing your paper contributes?
- What was not obvious or present before your work?
> "I propose framing the contribution as: '[one sentence]'. Does this capture
> what you see as the main takeaway? Should we adjust the emphasis?"
**Step 2: Draft Figure 1**
Figure 1 deserves special attention—many readers skip directly to it.
- Convey core idea, approach, or most compelling result
- Use vector graphics (PDF/EPS for plots)
- Write captions that stand alone without main text
- Ensure readability in black-and-white (8% of men have color vision deficiency)
**Step 3: Write Abstract (5-Sentence Formula)**
From Sebastian Farquhar (DeepMind):
```
1. What you achieved: "We introduce...", "We prove...", "We demonstrate..."
2. Why this is hard and important
3. How you do it (with specialist keywords for discoverability)
4. What evidence you have
5. Your most remarkable number/result
```
**Delete** generic openings like "Large language models have achieved remarkable success..."
**Step 4: Write Introduction (1-1.5 pages max)**
Must include:
- 2-4 bullet contribution list (max 1-2 lines each in two-column format)
- Clear problem statement
- Brief approach overview
- Methods should start by page 2-3 maximum
**Step 5: Methods Section**
Enable reimplementation:
- Conceptual outline or pseudocode
- All hyperparameters listed
- Architectural details sufficient for reproduction
- Present final design decisions; ablations go in experiments
**Step 6: Experiments Section**
For each experiment, explicitly state:
- What claim it supports
- How it connects to main contribution
- Experimental setting (details in appendix)
- What to observe: "the blue line shows X, which demonstrates Y"
Requirements:
- Error bars with methodology (standard deviation vs standard error)
- Hyperparameter search ranges
- Compute infrastructure (GPU type, total hours)
- Seed-setting methods
**Step 7: Related Work**
Organize methodologically, not paper-by-paper:
**Good:** "One line of work uses Floogledoodle's assumption [refs] whereas we use Doobersnoddle's assumption because..."
**Bad:** "Snap et al. introduced X while Crackle et al. introduced Y."
Cite generously—reviewers likely authored relevant papers.
**Step 8: Limitations Section (REQUIRED)**
All major conferences require this. Counter-intuitively, honesty helps:
- Reviewers are instructed not to penalize honest limitation acknowledgment
- Pre-empt criticisms by identifying weaknesses first
- Explain why limitations don't undermine core claims
**Step 9: Paper Checklist**
NeurIPS, ICML, and ICLR all require paper checklists. See [references/checklists.md](references/checklists.md).
---
## Writing Philosophy for Top ML Conferences
**This section distills the most important writing principles from leading ML researchers.** These aren't optional style suggestions—they're what separates accepted papers from rejected ones.
> "A paper is a short, rigorous, evidence-based technical story with a takeaway readers care about." — Neel Nanda
### The Sources Behind This Guidance
This skill synthesizes writing philosophy from researchers who have published extensively at top venues:
| Source | Key Contribution | Link |
|--------|-----------------|------|
| **Neel Nanda** (Google DeepMind) | The Narrative Principle, What/Why/So What framework | [How to Write ML Papers](https://www.alignmentforum.org/posts/eJGptPbbFPZGLpjsp/highly-opinionated-advice-on-how-to-write-ml-papers) |
| **Sebastian Farquhar** (DeepMind) | 5-sentence abstract formula | [How to Write ML Papers](https://sebastianfarquhar.com/on-research/2024/11/04/how_to_write_ml_papers/) |
| **Gopen & Swan** | 7 principles of reader expectations | [Science of Scientific Writing](https://cseweb.ucsd.edu/~swanson/papers/science-of-writing.pdf) |
| **Zachary Lipton** | Word choice, eliminating hedging | [Heuristics for Scientific Writing](https://www.approximatelycorrect.com/2018/01/29/heuristics-technical-scientific-writing-machine-learning-perspective/) |
| **Jacob Steinhardt** (UC Berkeley) | Precision, consistent terminology | [Writing Tips](https://bounded-regret.ghost.io/) |
| **Ethan Perez** (Anthropic) | Micro-level clarity tips | [Easy Paper Writing Tips](https://ethanperez.net/easy-paper-writing-tips/) |
| **Andrej Karpathy** | Single contribution focus | Various lectures |
**For deeper dives into any of these, see:**
- [references/writing-guide.md](references/writing-guide.md) - Full explanations with examples
- [references/sources.md](references/sources.md) - Complete bibliography
### Time Allocation (From Neel Nanda)
Spend approximately **equal time** on each of:
1. The abstract
2. The introduction
3. The figures
4. Everything else combined
**Why?** Most reviewers form judgments before reaching your methods. Readers encounter your paper as: **title → abstract → introduction → figures → maybe the rest.**
### Writing Style Guidelines
#### Sentence-Level Clarity (Gopen & Swan's 7 Principles)
These principles are based on how readers actually process prose. Violating them forces readers to spend cognitive effort on structure rather than content.
| Principle | Rule | Example |
|-----------|------|---------|
| **Subject-verb proximity** | Keep subject and verb close | ❌ "The model, which was trained on..., achieves" → ✅ "The model achieves... after training on..." |
| **Stress position** | Place emphasis at sentence ends | ❌ "Accuracy improves by 15% when using attention" → ✅ "When using attention, accuracy improves by **15%**" |
| **Topic position** | Put context first, new info after | ✅ "Given these constraints, we propose..." |
| **Old before new** | Familiar info → unfamiliar info | Link backward, then introduce new |
| **One unit, one function** | Each paragraph makes one point | Split multi-point paragraphs |
| **Action in verb** | Use verbs, not nominalizations | ❌ "We performed an analysis" → ✅ "We analyzed" |
| **Context before new** | Set stage before presenting | Explain before showing equation |
**Full 7 principles with detailed examples:** See [references/writing-guide.md](references/writing-guide.md#the-7-principles-of-reader-expectations)
#### Micro-Level Tips (Ethan Perez)
These small changes accumulate into significantly clearer prose:
- **Minimize pronouns**: ❌ "This shows..." → ✅ "This result shows..."
- **Verbs early**: Position verbs near sentence start
- **Unfold apostrophes**: ❌ "X's Y" → ✅ "The Y of X" (when awkward)
- **Delete filler words**: "actually," "a bit," "very," "really," "basically," "quite," "essentially"
**Full micro-tips with examples:** See [references/writing-guide.md](references/writing-guide.md#micro-level-writing-tips)
#### Word Choice (Zachary Lipton)
- **Be specific**: ❌ "performance" → ✅ "accuracy" or "latency" (say what you mean)
- **Eliminate hedging**: Drop "may" and "can" unless genuinely uncertain
- **Avoid incremental vocabulary**: ❌ "combine," "modify," "expand" → ✅ "develop," "propose," "introduce"
- **Delete intensifiers**: ❌ "provides *very* tight approximation" → ✅ "provides tight approximation"
#### Precision Over Brevity (Jacob Steinhardt)
- **Consistent terminology**: Different terms for same concept creates confusion. Pick one and stick with it.
- **State assumptions formally**: Before theorems, list all assumptions explicitly
- **Intuition + rigor**: Provide intuitive explanations alongside formal proofs
### What Reviewers Actually Read
Understanding reviewer behavior helps prioritize your effort:
| Paper Section | % Reviewers Who Read | Implication |
|---------------|---------------------|-------------|
| Abstract | 100% | Must be perfect |
| Introduction | 90%+ (skimmed) | Front-load contribution |
| Figures | Examined before methods | Figure 1 is critical |
| Methods | Only if interested | Don't bury the lede |
| Appendix | Rarely | Put only supplementary details |
**Bottom line**: If your abstract and intro don't hook reviewers, they may never read your brilliant methods section.
---
## Conference Requirements Quick Reference
| Conference | Page Limit | Extra for Camera-Ready | Key Requirement |
|------------|------------|------------------------|-----------------|
| **NeurIPS 2025** | 9 pages | +0 | Mandatory checklist, lay summary for accepted |
| **ICML 2026** | 8 pages | +1 | Broader Impact Statement required |
| **ICLR 2026** | 9 pages | +1 | LLM disclosure required, reciprocal reviewing |
| **ACL 2025** | 8 pages (long) | varies | Limitations section mandatory |
| **AAAI 2026** | 7 pages | +1 | Strict style file adherence |
| **COLM 2025** | 9 pages | +1 | Focus on language models |
**Universal Requirements:**
- Double-blind review (anonymize submissions)
- References don't count toward page limit
- Appendices unlimited but reviewers not required to read
- LaTeX required for all venues
**LaTeX Templates:** See [templates/](templates/) directory for all conference templates.
---
## Using LaTeX Templates Properly
### Workflow 4: Starting a New Paper from Template
**Always copy the entire template directory first, then write within it.**
```
Template Setup Checklist:
- [ ] Step 1: Copy entire template directory to new project
- [ ] Step 2: Verify template compiles as-is (before any changes)
- [ ] Step 3: Read the template's example content to understand structure
- [ ] Step 4: Replace example content section by section
- [ ] Step 5: Keep template comments/examples as reference until done
- [ ] Step 6: Clean up template artifacts only at the end
```
**Step 1: Copy the Full Template**
```bash
# Create your paper directory with the complete template
cp -r templates/neurips2025/ ~/papers/my-new-paper/
cd ~/papers/my-new-paper/
# Verify structure is complete
ls -la
# Should see: main.tex, neurips.sty, Makefile, etc.
```
**⚠️ IMPORTANT**: Copy the ENTIRE directory, not just `main.tex`. Templates include:
- Style files (`.sty`) - required for compilation
- Bibliography styles (`.bst`) - required for references
- Example content - useful as reference
- Makefiles - for easy compilation
**Step 2: Verify Template Compiles First**
Before making ANY changes, compile the template as-is:
```bash
# Using latexmk (recommended)
latexmk -pdf main.tex
# Or manual compilation
pdflatex main.tex
bibtex main
pdflatex main.tex
pdflatex main.tex
```
If the unmodified template doesn't compile, fix that first. Common issues:
- Missing TeX packages → install via `tlmgr install <package>`
- Wrong TeX distribution → use TeX Live (recommended)
**Step 3: Keep Template Content as Reference**
Don't immediately delete all example content. Instead:
```latex
% KEEP template examples commented out as you write
% This shows you the expected format
% Template example (keep for reference):
% \begin{figure}[t]
% \centering
% \includegraphics[width=0.8\linewidth]{example-image}
% \caption{Template shows caption style}
% \end{figure}
% Your actual figure:
\begin{figure}[t]
\centering
\includegraphics[width=0.8\linewidth]{your-figure.pdf}
\caption{Your caption following the same style.}
\end{figure}
```
**Step 4: Replace Content Section by Section**
Work through the paper systematically:
```
Replacement Order:
1. Title and authors (anonymize for submission)
2. Abstract
3. Introduction
4. Methods
5. Experiments
6. Related Work
7. Conclusion
8. References (your .bib file)
9. Appendix
```
For each section:
1. Read the template's example content
2. Note any special formatting or macros used
3. Replace with your content following the same patterns
4. Compile frequently to catch errors early
**Step 5: Use Template Macros**
Templates often define useful macros. Check the preamble for:
```latex
% Common template macros to use:
\newcommand{\method}{YourMethodName} % Consistent method naming
\newcommand{\eg}{e.g.,\xspace} % Proper abbreviations
\newcommand{\ie}{i.e.,\xspace}
\newcommand{\etal}{\textit{et al.}\xspace}
```
**Step 6: Clean Up Only at the End**
Only remove template artifacts when paper is nearly complete:
```latex
% BEFORE SUBMISSION - remove these:
% - Commented-out template examples
% - Unused packages
% - Template's example figures/tables
% - Lorem ipsum or placeholder text
% KEEP these:
% - All style files (.sty)
% - Bibliography style (.bst)
% - Required packages from template
% - Any custom macros you're using
```
### Template Pitfalls to Avoid
| Pitfall | Problem | Solution |
|---------|---------|----------|
| Copying only `main.tex` | Missing `.sty`, won't compile | Copy entire directory |
| Modifying `.sty` files | Breaks conference formatting | Never edit style files |
| Adding random packages | Conflicts, breaks template | Only add if necessary |
| Deleting template content too early | Lose formatting reference | Keep as comments until done |
| Not compiling frequently | Errors accumulate | Compile after each section |
### Quick Template Reference
| Conference | Main File | Key Style File | Notes |
|------------|-----------|----------------|-------|
| NeurIPS 2025 | `main.tex` | `neurips.sty` | Has Makefile |
| ICML 2026 | `example_paper.tex` | `icml2026.sty` | Includes algorithm packages |
| ICLR 2026 | `iclr2026_conference.tex` | `iclr2026_conference.sty` | Has math_commands.tex |
| ACL | `acl_latex.tex` | `acl.sty` | Strict formatting |
| AAAI 2026 | `aaai2026-unified-template.tex` | `aaai2026.sty` | Very strict compliance |
| COLM 2025 | `colm2025_conference.tex` | `colm2025_conference.sty` | Similar to ICLR |
---
## Conference Resubmission & Format Conversion
When a paper is rejected or withdrawn from one venue and resubmitted to another, format conversion is required. This is a common workflow in ML research.
### Workflow 3: Converting Between Conference Formats
```
Format Conversion Checklist:
- [ ] Step 1: Identify source and target template differences
- [ ] Step 2: Create new project with target template
- [ ] Step 3: Copy content sections (not preamble)
- [ ] Step 4: Adjust page limits and content
- [ ] Step 5: Update conference-specific requirements
- [ ] Step 6: Verify compilation and formatting
```
**Step 1: Key Template Differences**
| From → To | Page Change | Key Adjustments |
|-----------|-------------|-----------------|
| NeurIPS → ICML | 9 → 8 pages | Cut 1 page, add Broader Impact if missing |
| ICML → ICLR | 8 → 9 pages | Can expand experiments, add LLM disclosure |
| NeurIPS → ACL | 9 → 8 pages | Restructure for NLP conventions, add Limitations |
| ICLR → AAAI | 9 → 7 pages | Significant cuts needed, strict style adherence |
| Any → COLM | varies → 9 | Reframe for language model focus |
**Step 2: Content Migration (NOT Template Merge)**
**Never copy LaTeX preambles between templates.** Instead:
```bash
# 1. Start fresh with target template
cp -r templates/icml2026/ new_submission/
# 2. Copy ONLY content sections from old paper
# - Abstract text
# - Section content (between \section{} commands)
# - Figures and tables
# - Bibliography entries
# 3. Paste into target template structure
```
**Step 3: Adjusting for Page Limits**
When cutting pages (e.g., NeurIPS 9 → AAAI 7):
- Move detailed proofs to appendix
- Condense related work (cite surveys instead of individual papers)
- Combine similar experiments into unified tables
- Use smaller figure sizes with subfigures
- Tighten writing: eliminate redundancy, use active voice
When expanding (e.g., ICML 8 → ICLR 9):
- Add ablation studies reviewers requested
- Expand limitations discussion
- Include additional baselines
- Add qualitative examples
**Step 4: Conference-Specific Adjustments**
| Target Venue | Required Additions |
|--------------|-------------------|
| **ICML** | Broader Impact Statement (after conclusion) |
| **ICLR** | LLM usage disclosure, reciprocal reviewing agreement |
| **ACL/EMNLP** | Limitations section (mandatory), Ethics Statement |
| **AAAI** | Strict adherence to style file (no modifications) |
| **NeurIPS** | Paper checklist (appendix), lay summary if accepted |
**Step 5: Update References**
```latex
% Remove self-citations that reveal identity (for blind review)
% Update any "under review" citations to published versions
% Add new relevant work published since last submission
```
**Step 6: Addressing Previous Reviews**
When resubmitting after rejection:
- **Do** address reviewer concerns in the new version
- **Do** add experiments/clarifications reviewers requested
- **Don't** include a "changes from previous submission" section (blind review)
- **Don't** reference the previous submission or reviews
**Common Conversion Pitfalls:**
- ❌ Copying `\usepackage` commands (causes conflicts)
- ❌ Keeping old conference header/footer commands
- ❌ Forgetting to update `\bibliography{}` path
- ❌ Missing conference-specific required sections
- ❌ Exceeding page limit after format change
---
## Citation Workflow (Hallucination Prevention)
**⚠️ CRITICAL**: AI-generated citations have ~40% error rate. **Never write BibTeX from memory.**
### The Golden Rule
```
IF you cannot programmatically fetch a citation:
→ Mark it as [CITATION NEEDED] or [PLACEHOLDER - VERIFY]
→ Tell the scientist explicitly
→ NEVER invent a plausible-sounding reference
```
### Workflow 2: Adding Citations
```
Citation Verification (MANDATORY for every citation):
- [ ] Step 1: Search using Exa MCP or Semantic Scholar API
- [ ] Step 2: Verify paper exists in 2+ sources (Semantic Scholar + arXiv/CrossRef)
- [ ] Step 3: Retrieve BibTeX via DOI (programmatically, not from memory)
- [ ] Step 4: Verify the claim you're citing actually appears in the paper
- [ ] Step 5: Add verified BibTeX to bibliography
- [ ] Step 6: If ANY step fails → mark as placeholder, inform scientist
```
**Step 0: Use Exa MCP for Initial Search (Recommended)**
If Exa MCP is installed, use it to find relevant papers:
```
Search: "RLHF language model alignment 2023"
Search: "sparse autoencoders interpretability"
Search: "attention mechanism transformers Vaswani"
```
Then verify each result with Semantic Scholar and fetch BibTeX via DOI.
**Step 1: Search Semantic Scholar**
```python
from semanticscholar import SemanticScholar
sch = SemanticScholar()
results = sch.search_paper("attention mechanism transformers", limit=5)
for paper in results:
print(f"{paper.title} - {paper.paperId}")
print(f" DOI: {paper.externalIds.get('DOI', 'N/A')}")
```
**Step 2: Verify Existence**
Confirm paper appears in at least two sources (Semantic Scholar + CrossRef/arXiv).
**Step 3: Retrieve BibTeX via DOI**
```python
import requests
def doi_to_bibtex(doi: str) -> str:
"""Get verified BibTeX from DOI via CrossRef."""
response = requests.get(
f"https://doi.org/{doi}",
headers={"Accept": "application/x-bibtex"}
)
response.raise_for_status()
return response.text
# Example
bibtex = doi_to_bibtex("10.48550/arXiv.1706.03762")
print(bibtex)
```
**Step 4: Verify Claims**
Before citing for a specific claim, access the paper and confirm the attributed claim actually appears.
**Step 5: Handle Failures Explicitly**
If you cannot verify a citation at ANY step:
```latex
% Option 1: Explicit placeholder
\cite{PLACEHOLDER_smith2023_verify} % TODO: Could not verify - scientist must confirm
% Option 2: Note in text
... as shown in prior work [CITATION NEEDED - could not verify Smith et al. 2023].
```
**Always inform the scientist:**
> "I could not verify the following citations and have marked them as placeholders:
> - Smith et al. 2023 on reward hacking - could not find in Semantic Scholar
> - Jones 2022 on scaling laws - found similar paper but different authors
> Please verify these before submission."
### Summary: Citation Rules
| Situation | Action |
|-----------|--------|
| Found paper, got DOI, fetched BibTeX | ✅ Use the citation |
| Found paper, no DOI | ✅ Use arXiv BibTeX or manual entry from paper |
| Paper exists but can't fetch BibTeX | ⚠️ Mark placeholder, inform scientist |
| Uncertain if paper exists | ❌ Mark `[CITATION NEEDED]`, inform scientist |
| "I think there's a paper about X" | ❌ **NEVER cite** - search first or mark placeholder |
**🚨 NEVER generate BibTeX from memory—always fetch programmatically. 🚨**
See [references/citation-workflow.md](references/citation-workflow.md) for complete API documentation.
---
## Common Issues and Solutions
**Issue: Abstract too generic**
Delete first sentence if it could be prepended to any ML paper. Start with your specific contribution.
**Issue: Introduction exceeds 1.5 pages**
Split background into Related Work. Front-load contribution bullets. Methods should start by page 2-3.
**Issue: Experiments lack explicit claims**
Add sentence before each experiment: "This experiment tests whether [specific claim]..."
**Issue: Reviewers find paper hard to follow**
- Add explicit signposting: "In this section, we show X"
- Use consistent terminology throughout
- Include figure captions that stand alone
**Issue: Missing statistical significance**
Always include:
- Error bars (specify: std dev or std error)
- Number of runs
- Statistical tests if comparing methods
---
## Reviewer Evaluation Criteria
Reviewers assess papers on four dimensions:
| Criterion | What Reviewers Look For |
|-----------|------------------------|
| **Quality** | Technical soundness, well-supported claims |
| **Clarity** | Clear writing, reproducible by experts |
| **Significance** | Community impact, advances understanding |
| **Originality** | New insights (doesn't require new method) |
**Scoring (NeurIPS 6-point scale):**
- 6: Strong Accept - Groundbreaking, flawless
- 5: Accept - Technically solid, high impact
- 4: Borderline Accept - Solid, limited evaluation
- 3: Borderline Reject - Solid but weaknesses outweigh
- 2: Reject - Technical flaws
- 1: Strong Reject - Known results or ethics issues
See [references/reviewer-guidelines.md](references/reviewer-guidelines.md) for detailed reviewer instructions.
---
## Tables and Figures
### Tables
Use `booktabs` LaTeX package for professional tables:
```latex
\usepackage{booktabs}
\begin{tabular}{lcc}
\toprule
Method & Accuracy ↑ & Latency ↓ \\
\midrule
Baseline & 85.2 & 45ms \\
\textbf{Ours} & \textbf{92.1} & 38ms \\
\bottomrule
\end{tabular}
```
**Rules:**
- Bold best value per metric
- Include direction symbols (↑ higher is better, ↓ lower is better)
- Right-align numerical columns
- Consistent decimal precision
### Figures
- **Vector graphics** (PDF, EPS) for all plots and diagrams
- **Raster** (PNG 600 DPI) only for photographs
- Use **colorblind-safe palettes** (Okabe-Ito or Paul Tol)
- Verify **grayscale readability** (8% of men have color vision deficiency)
- **No title inside figure**—the caption serves this function
- **Self-contained captions**—reader should understand without main text
---
## References & Resources
### Reference Documents (Deep Dives)
| Document | Contents |
|----------|----------|
| [writing-guide.md](references/writing-guide.md) | Gopen & Swan 7 principles, Ethan Perez micro-tips, word choice |
| [citation-workflow.md](references/citation-workflow.md) | Citation APIs, Python code, BibTeX management |
| [checklists.md](references/checklists.md) | NeurIPS 16-item, ICML, ICLR, ACL requirements |
| [reviewer-guidelines.md](references/reviewer-guidelines.md) | Evaluation criteria, scoring, rebuttals |
| [sources.md](references/sources.md) | Complete bibliography of all sources |
### LaTeX Templates
Templates in `templates/` directory: **ICML 2026**, **ICLR 2026**, **NeurIPS 2025**, **ACL/EMNLP**, **AAAI 2026**, **COLM 2025**.
**Compiling to PDF:**
- **VS Code/Cursor**: Install LaTeX Workshop extension + TeX Live → Save to auto-compile
- **Command line**: `latexmk -pdf main.tex` or `pdflatex` + `bibtex` workflow
- **Online**: Upload to [Overleaf](https://overleaf.com)
See [templates/README.md](templates/README.md) for detailed setup instructions.
### Key External Sources
**Writing Philosophy:**
- [Neel Nanda: How to Write ML Papers](https://www.alignmentforum.org/posts/eJGptPbbFPZGLpjsp/highly-opinionated-advice-on-how-to-write-ml-papers) - Narrative, "What/Why/So What"
- [Farquhar: How to Write ML Papers](https://sebastianfarquhar.com/on-research/2024/11/04/how_to_write_ml_papers/) - 5-sentence abstract
- [Gopen & Swan: Science of Scientific Writing](https://cseweb.ucsd.edu/~swanson/papers/science-of-writing.pdf) - 7 reader expectation principles
- [Lipton: Heuristics for Scientific Writing](https://www.approximatelycorrect.com/2018/01/29/heuristics-technical-scientific-writing-machine-learning-perspective/) - Word choice
- [Perez: Easy Paper Writing Tips](https://ethanperez.net/easy-paper-writing-tips/) - Micro-level clarity
**APIs:** [Semantic Scholar](https://api.semanticscholar.org/api-docs/) | [CrossRef](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) | [arXiv](https://info.arxiv.org/help/api/basics.html)
**Venues:** [NeurIPS](https://neurips.cc/Conferences/2025/PaperInformation/StyleFiles) | [ICML](https://icml.cc/Conferences/2025/AuthorInstructions) | [ICLR](https://iclr.cc/Conferences/2026/AuthorGuide) | [ACL](https://github.com/acl-org/acl-style-files)
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,394 @@
# Autoreason: Iterative Refinement Methodology
Complete reference for the autoreason iterative refinement method, derived from experimental results across subjective writing tasks, competitive programming, and four model tiers. Use this when any output (paper draft, experiment script, analysis, task definition) needs iterative improvement.
**Source**: [NousResearch/autoreason](https://github.com/NousResearch/autoreason) — "Autoreason: When Iterative LLM Refinement Works and Why It Fails"
---
## Strategy Selection Guide
### Decision Tree
```
Is the task objectively verifiable (code, math, factual)?
├── YES → Does the model solve it on the first attempt?
│ ├── YES → Use single pass (no refinement needed)
│ └── NO → Use autoreason (structured analysis → reason-informed revision)
└── NO (subjective) → What model tier are you using?
├── Weak (Llama 8B, small models)
│ → Single pass. Model too weak for refinement to help.
│ Invest in generation quality, not iteration.
├── Mid-tier (Haiku 3.5, Gemini Flash)
│ → Autoreason with stronger judges. This is the sweet spot.
│ Self-refinement DESTROYS weak model outputs — autoreason prevents this.
├── Strong (Sonnet 4)
│ → Autoreason for open-ended tasks. Wins 3/5.
│ Critique-and-revise for concrete technical tasks (2/5).
└── Frontier (Sonnet 4.6, Opus)
├── Constrained scope? → Autoreason. Wins 2/3 constrained tasks.
└── Unconstrained? → Critique-and-revise or single pass.
Autoreason FAILS on unconstrained frontier tasks (comes last).
```
### Strategy Comparison Table
| Strategy | Best For | Avoid When | Compute (per iteration) |
|----------|----------|------------|------------------------|
| **Single pass** | Frontier models, template tasks, tight budgets | Mid-tier models where quality ceiling is low | 1 call |
| **Critique-and-revise** | Concrete technical requirements (system design, specifications) | Weak models (degrades output), unconstrained subjective tasks | 2 calls |
| **Autoreason** | Mid-tier models, constrained scope, tasks with genuine tradeoffs | Weak models (Llama 8B), frontier + unconstrained | ~6 calls |
| **Best-of-N** | Almost never recommended | Weak models especially — worse than single pass | N calls |
### Why Each Strategy Fails
| Strategy | Failure Mode | Mechanism |
|----------|-------------|-----------|
| **Single pass** | Quality ceiling | No mechanism to improve beyond first attempt |
| **Critique-and-revise** | Progressive degradation | Model hallucinates problems (sycophancy), scope creeps each pass, never declines to change |
| **Best-of-N** | Random selection | Without good ranking signal, more samples = more mediocre options |
| **Autoreason (unconstrained)** | Synthesis drift | Stronger models produce syntheses so consistently preferred that incumbent never stabilizes |
---
## The Autoreason Loop
### Architecture
```
┌──────────────────────────────────────────────────────────┐
│ ITERATION LOOP │
│ │
│ Incumbent A ──► Critic ──► Author B ──► Synthesizer │
│ │ │ │
│ │ ┌───────────────────────┘ │
│ ▼ ▼ │
│ [A] [AB] [B] │
│ │ │ │ │
│ └──────────────┼────────────┘ │
│ ▼ │
│ Judge Panel (blind) │
│ │ │
│ ▼ │
│ Winner │
│ │ │
│ ┌───────┴───────┐ │
│ ▼ ▼ │
│ A wins k=2 B or AB wins │
│ consecutive? → new incumbent │
│ │ │
│ ▼ │
│ CONVERGED │
└──────────────────────────────────────────────────────────┘
```
### Roles
Every role is a **fresh, isolated agent** with no shared context:
| Role | Input | Output | Key Rule |
|------|-------|--------|----------|
| **Critic** | Task + Incumbent A | List of problems | Find problems ONLY. No fixes. No suggestions. |
| **Author B** | Task + A + Critique | Revised version B | Address each criticism. State which problem each change fixes. |
| **Synthesizer** | Task + X + Y (randomized labels) | Synthesis AB | Take strongest elements of each. Not a compromise. |
| **Judge Panel** | Task + A, AB, B (randomized labels + order) | Ranking | Rank best to worst. No authorship stake. |
### Configuration
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| **Convergence k** | 2 | k=1 premature (94% displaced later). k=2 converges 100%, quality plateaus. k=3 fails 24%, 2x cost, no quality gain. |
| **Author temperature** | 0.7-0.8 | Encourages diverse revisions |
| **Judge temperature** | 0.3 | Encourages consistent evaluation |
| **In-loop judges** | 3 | Balance per-pass cost vs evaluation stability |
| **Final evaluation judges** | 7 | Higher statistical power for final comparison |
| **Max tokens** | 4096 | Standard; 8192 for long-form (papers) |
| **Judge type** | Chain-of-thought | 3x faster convergence on some tasks. Always use. |
| **Tiebreak** | Conservative (incumbent wins) | Prevents false positives — A must be genuinely beaten |
| **Max passes** | 25 (constrained), 50 (remedy) | Safety cap; most converge by pass 10-15 |
### Prompts
#### Critic
```
System: You are a critical reviewer. Your only job is to find real problems.
Be specific and concrete. Do not suggest fixes.
User: Find real problems with this proposal. Focus on:
- Things that won't work as described
- Complexity that doesn't pay for itself
- Assumptions that are wrong
- Missing pieces
Do NOT propose fixes. Just the problems.
```
#### Author B
```
System: You are a senior consultant revising a proposal based on specific
criticisms. Address each valid criticism directly. Do not make changes not
motivated by an identified problem.
User: [TASK] + [VERSION A] + [CRITIC OUTPUT]
Revise to address these problems. For each change, state which problem it fixes.
```
#### Synthesizer
```
System: You are given two versions as equal inputs. Take the strongest elements
from each and produce a coherent synthesis. This is not a compromise.
User: [TASK] + [VERSION X] + [VERSION Y]
(labels randomized — synthesizer doesn't know which is incumbent)
```
#### Judge (Chain-of-Thought) — ALWAYS USE THIS VERSION
```
System: You are an independent evaluator. Think carefully before deciding.
User: [TASK] + Three proposals. For each, think step by step:
1. What does it get right?
2. What does it get wrong or miss?
3. Are numbers and claims defensible?
4. Is detail appropriate or bloated?
After reasoning, rank all three.
RANKING: [best], [second], [worst]
```
#### Baseline Prompts (for comparison experiments)
| Baseline | Prompt |
|----------|--------|
| **Conservative** | "Make minimal improvements while preserving what works. Do not add new sections or significantly expand scope." |
| **Improve this** | "Improve this document." (no further guidance) |
| **Harsh critic** | "Critically evaluate and rewrite, fixing all weaknesses you identify." |
| **Critique & revise** | Step 1: "Produce a structured critique. List specific weaknesses." Step 2: "Revise to address each criticism." |
---
## Scoring: Borda Count
Judges rank candidates. Points awarded by rank position:
| Rank | Points (3 candidates) |
|------|----------------------|
| 1st | 3 |
| 2nd | 2 |
| 3rd | 1 |
**Aggregation**: Sum across all judges. Winner = highest total.
**Tiebreak**: Incumbent (A) wins any tie.
**Example** (3 judges):
- Judge 1: AB > A > B → AB gets 3, A gets 2, B gets 1
- Judge 2: A > AB > B → A gets 3, AB gets 2, B gets 1
- Judge 3: AB > B > A → AB gets 3, B gets 2, A gets 1
- Totals: AB=8, A=6, B=4 → AB wins, becomes new incumbent
**Randomization per judge**:
- Candidate labels randomized (A might be called "Proposal X" for one judge, "Proposal Z" for another)
- Presentation order randomized (AB might appear first or last)
- This prevents position bias and label bias
---
## Model Selection Guide
### Empirical Results by Model Tier
| Model | Autoreason Wins | Autoreason Avg Borda | Best Baseline | Margin | Recommendation |
|-------|----------------|---------------------|---------------|--------|----------------|
| **Llama 3.1 8B** | 1/3 | 23.7 | 25.0 (single) | -1.3 | Skip autoreason. Model too weak for diverse candidates. |
| **Gemini 2.0 Flash** | 2/3 | 25.0 | 20.0 (single) | +5.0 | Good candidate. Moderate gains. |
| **Haiku 3.5** | 3/3 | **42.0** | 33.7 (single) | **+8.3** | **Best candidate.** Perfect scores. Baselines actively destroy quality. |
| **Sonnet 4** | 3/5 | 27.8 | 22.4 (C&R) | +5.4 | Good candidate for open tasks. C&R better for technical tasks. |
| **Sonnet 4.6 (unconstrained)** | 0/1 | 7.0 | 31.0 (C&R) | -24.0 | Do NOT use autoreason without constraints. |
| **Sonnet 4.6 (constrained)** | 2/3 | 29.0 | 27.0 (improve) | +2.0 | Use only with scope constraints. |
### The Generation-Evaluation Gap
The core insight: **autoreason's value depends on the gap between a model's generation capability and its self-evaluation capability.**
```
Weak models (Llama 8B):
Generation: Poor | Self-evaluation: Poor
Gap: Small (both bad) → Autoreason can't help, no diverse candidates
Mid-tier models (Haiku, Flash):
Generation: Decent | Self-evaluation: Poor
Gap: LARGE → Autoreason's sweet spot. External eval bridges the gap.
Strong models (Sonnet 4):
Generation: Good | Self-evaluation: Decent
Gap: Moderate → Autoreason helps on 3/5 tasks
Frontier models (Sonnet 4.6):
Generation: Excellent | Self-evaluation: Good
Gap: Small → Simple methods suffice. Autoreason hurts on unconstrained tasks.
```
**Practical rule**: As model costs drop and capabilities improve, today's frontier becomes tomorrow's mid-tier. The generation-evaluation gap is structural, not temporary. Match refinement architecture to the model's position on the capability curve.
### Judge Selection
| Author Model | Recommended Judge | Rationale |
|-------------|------------------|-----------|
| Llama 8B | Don't use autoreason | Model too weak |
| Gemini Flash | Sonnet 4 | Cross-model evaluation works |
| Haiku 3.5 | Sonnet 4 | Strong external eval is the mechanism |
| Haiku 3.5 | Haiku 3.5 (same) | Still works — tournament structure provides value even without strong judges (20.7 vs 18.3 avg Borda) |
| Sonnet 4 | Sonnet 4 (same) | Same-model judges work at this tier |
| Sonnet 4.6 | Sonnet 4.6 (same) | Only with scope constraints |
---
## Scope Constraint Design
### What Makes Autoreason Work on Constrained Tasks
The same model (Sonnet 4.6) goes from **last place** (unconstrained) to **first place** (constrained) with scope constraints. The constraints bound the improvement space so synthesis drift can't accumulate.
### Effective Constraints
| Constraint Type | Example | Why It Works |
|----------------|---------|-------------|
| **Fixed facts** | "Use only these 8 data points, add nothing else" | Bounds information space |
| **Fixed deliverable** | "500-word startup pitch" (not "improve this") | Defines done condition |
| **Fixed structure** | "Exactly 4 sections, each with 3 numbered items" | Prevents structural drift |
| **Fixed change items** | "Address exactly these 3 reviewer concerns" | Bounds modification scope |
### Ineffective Constraints
| Constraint | Why It Fails | What Happens |
|-----------|-------------|-------------|
| Word count alone | Not a scope constraint | False convergence — rejected for length, not quality |
| "Be concise" | Too vague | Ignored after 2-3 passes |
| "Be comprehensive" | Anti-constraint | Invites scope creep |
| No constraints at all | Unbounded improvement space | Synthesis dominates, no convergence |
### Task Categories
| Task Type | Autoreason Works? | Why |
|-----------|-------------------|-----|
| Tasks with genuine tradeoffs (strategy, policy) | Yes | Multiple valid approaches for tournament to select between |
| Constrained writing (pitch, memo, postmortem) | Mostly (2/3) | Bounded scope, clear evaluation criteria |
| Template-filling (incident postmortem) | No | One correct structure, minimal decision space |
| Competitive programming | Yes | Naturally scoped, test suite provides external verification |
| Open-ended unconstrained + frontier model | No | Synthesis drift, no convergence |
---
## Failure Taxonomy
| Failure Mode | Condition | Detection | Evidence |
|-------------|-----------|-----------|----------|
| **Self-correction unreliable** | No external evaluation signal | Baselines degrade below single pass | Haiku baselines: 16.3 avg vs 33.7 single pass |
| **Drift / synthesis dominance** | Unconstrained scope | A wins <15%, AB dominates | Sonnet 4.6 unconstrained: A wins 12%, AB wins 60%+ |
| **Overfitting to visible feedback** | Shallow revision loop (C&R) | High public/private divergence | C&R overfits 32% on hard code problems |
| **No convergence** | Broken judge pipeline | Parsing failures, <3 valid judges | Mixed panel parser failure: 11+ passes |
| **Model too weak** | Insufficient generation diversity | All candidates look similar | Llama 8B wins only 1/3 tasks |
### Recovery Patterns
| Failure | Recovery |
|---------|----------|
| No convergence (drift) | Add scope constraints to the task |
| No convergence (broken judges) | Fix parser, ensure 3 valid judges before continuing |
| Quality degrades with iteration | Switch to single pass or add constraints |
| Model too weak | Use a stronger model for generation, keep weak model for cheap roles |
| Overfitting (code) | Use structured analysis step, not just test feedback |
---
## Code Domain Adaptation
The autoreason method adapts differently for code vs writing:
### Writing Domain
```
Call 1: Critic (find problems in incumbent)
Call 2: Author B (revise based on critique)
Call 3: Synthesizer (merge A and B)
Calls 4-6: Judge Panel (3 blind judges rank A, B, AB)
```
### Code Domain (6-call budget)
```
Call 1: Initial generation
Call 2: Structured analysis (5 points — NO CODE):
- Problem analysis: what does the problem actually require?
- Approach analysis: what approach did we use, is it correct?
- Failure analysis: why did tests fail?
- Alternative approaches: what else could work?
- Edge cases: what inputs might break the solution?
Calls 3-6: Reason-informed revisions
- Each revision must explain WHY it fixes the issue
- Sees test results from public (visible) test cases
```
**Key difference**: The code strategy replaces the judge panel with test-suite evaluation (objective ground truth). The structured analysis step (Call 2) is what drives recovery — it forces reasoning about *why* the approach failed before attempting fixes.
**Results**: Recovery is the mechanism. Among problems where both autoreason and single-pass failed initially, autoreason recovered 62% vs single-pass's 43% (McNemar p=0.041, Cohen's h=0.32).
---
## Applying Autoreason to Paper Writing
The paper itself was refined using autoreason (Section 8 of the paper):
### Setup
- Model: claude-opus-4
- Judges: 3 Opus judges
- Enhancement: Ground-truth critic (access to actual experimental data)
- Result: Converged in 9 passes
### Key Findings for Paper Refinement
1. **Ground-truth critic is essential**: Without ground-truth access, Opus hallucinated a fabricated ablation study, fake confidence intervals, wrong model names, and incorrect role descriptions. With ground-truth access, the critic caught all four on pass 1.
2. **Judge panel integrity matters**: A broken parser in one judge (Gemini output format mismatch) reduced the panel from 3 to 2 judges. This prevented convergence for 11+ passes. Fixing to 3 working judges, the same incumbent converged in 2 passes. A broken judge doesn't add noise — it prevents equilibrium.
### Recommended Setup for Paper Refinement
```
Critic prompt: "You are reviewing a research paper draft. You have access to the
actual experimental results [GROUND TRUTH DATA]. Find factual errors, unsupported
claims, hallucinated results, and structural problems. Do not suggest fixes."
Author B prompt: "Revise this paper draft to fix the identified problems. For each
change, cite the specific problem it addresses. Do not add claims not supported by
the provided experimental data."
Judge prompt (CoT): "Compare three versions of this paper. For each, evaluate:
1. Factual accuracy against the provided results
2. Clarity of the narrative and contribution
3. Whether claims are properly hedged and supported
4. Writing quality (concision, precision, no filler)
After reasoning, rank all three. RANKING: [best], [second], [worst]"
```
### What to Provide as Ground Truth
- All experimental result JSON files
- Statistical test outputs
- Raw numbers for every table and figure
- Configuration files showing exact hyperparameters
- Code that generated the results (for method description accuracy)
---
## Compute Budget Reference
| Method | Calls per Pass | Typical Passes | Total Calls | Relative Cost |
|--------|---------------|----------------|-------------|---------------|
| Single pass | 1 | 1 | 1 | 1x |
| Best-of-N | N | 1 | N | Nx |
| Critique & revise | 2 | 15 | 30 | 30x |
| Autoreason (in-loop) | ~6 | 10-15 | 60-90 | 60-90x |
| Autoreason (with final eval) | ~6 + 7 | 10-15 + 1 | 67-97 | ~80x |
**Cost-quality tradeoff**: Autoreason uses ~6x more compute per pass and typically runs more passes. This is a real tradeoff. The method trades compute for evaluation quality. On constrained tasks with mid-tier models, this tradeoff is strongly positive. On unconstrained tasks with frontier models, it's negative.
**CoT judges reduce cost**: 1 CoT judge provides evaluation quality comparable to 3 standard judges, at ~40% cost savings. Always use CoT judges.
@@ -10,6 +10,8 @@ This reference documents the mandatory checklist requirements for major ML/AI co
- [ICML Paper Checklist](#icml-paper-checklist)
- [ICLR Requirements](#iclr-requirements)
- [ACL Requirements](#acl-requirements)
- [AAAI Requirements](#aaai-requirements)
- [COLM Requirements](#colm-requirements)
- [Universal Pre-Submission Checklist](#universal-pre-submission-checklist)
---
@@ -280,6 +282,77 @@ If applicable:
---
## AAAI Requirements
### Formatting (Strictest of All Venues)
AAAI enforces formatting rules more strictly than any other major venue. Papers that deviate from the template are desk-rejected.
- [ ] Use the **exact** AAAI style file without modification — no `\setlength`, no `\vspace` hacks, no font overrides
- [ ] 7 pages main content (8 for camera-ready with author info)
- [ ] Two-column format, Times font (set by template)
- [ ] References and appendices do not count toward page limit
- [ ] Abstract must be a single paragraph
- [ ] Do not modify margins, column widths, or font sizes
### Required Sections
- [ ] Abstract (single paragraph, no math or citations)
- [ ] Introduction with clear contribution statement
- [ ] References in AAAI format (uses `aaai2026.bst`)
- [ ] Appendix (optional, unlimited)
### Ethics and Reproducibility
- [ ] Broader impact statement (encouraged but not always mandatory — check current year's CFP)
- [ ] Reproducibility details (datasets, code availability)
- [ ] Acknowledge use of AI writing tools if applicable
### Key Differences from Other Venues
- **No separate limitations section required** (unlike ACL), but discussing limitations is recommended
- **Strictest formatting enforcement** — the style checker will reject non-compliant PDFs
- **No paper checklist** like NeurIPS has, but the universal checklist below still applies
- **Unified template** covers main paper and supplementary in the same file
---
## COLM Requirements
### Overview
COLM (Conference on Language Modeling) focuses specifically on language model research. Framing must target this community.
### Formatting
- [ ] 9 pages main content (10 for camera-ready)
- [ ] Use COLM template (based on ICLR template with modifications)
- [ ] Double-blind review
- [ ] References and appendices unlimited
### Required Sections
- [ ] Abstract
- [ ] Introduction framed for language modeling community
- [ ] Conclusion
- [ ] References
### Content Expectations
- [ ] Contribution must be relevant to language models (broadly interpreted: training, evaluation, applications, theory, alignment, safety)
- [ ] If the method is general, frame with language model examples
- [ ] Baselines should include recent LM-specific methods where applicable
### Key Differences from Other Venues
- **Narrower scope** than NeurIPS/ICML — must frame for LM community
- **Template derived from ICLR** — similar formatting rules
- **Newer venue** — reviewer norms are still establishing; err on the side of thorough evaluation
- **No mandatory checklist** like NeurIPS, but broader impact discussion is expected
- **LLM disclosure**: If LLMs were used in research (code generation, data annotation, writing assistance), disclose this
---
## Universal Pre-Submission Checklist
### Before Every Submission
@@ -289,7 +289,7 @@ class CitationManager:
)
if resp.status_code == 200:
sources.append("CrossRef")
except:
except Exception:
pass
# Check arXiv if ID available
@@ -301,7 +301,7 @@ class CitationManager:
)
if "<entry>" in resp.text and "<title>" in resp.text:
sources.append("arXiv")
except:
except Exception:
pass
return len(sources) >= 2, sources
@@ -318,7 +318,7 @@ class CitationManager:
)
if resp.status_code == 200:
return resp.text
except:
except Exception:
pass
# Fallback: generate from paper data
@@ -419,7 +419,7 @@ def batch_cite(queries: List[str], output_file: str = "references.bib"):
| Customization | Limited | Highly flexible |
| Backend | bibtex | Biber (recommended) |
**Recommendation**: Use BibLaTeX with Biber for new papers.
**Recommendation**: Use natbib with BibTeX for conference submissions — all major venue templates (NeurIPS, ICML, ICLR, ACL, AAAI, COLM) ship with natbib and `.bst` files. BibLaTeX with Biber is an option for journals or personal projects where you control the template.
### LaTeX Setup
@@ -0,0 +1,728 @@
# Experiment Design Patterns
Patterns and best practices distilled from running research experiments at scale with the Hermes agent. These cover experiment infrastructure, evaluation protocols, monitoring, and failure recovery.
---
## Experiment Infrastructure
### Directory Structure
Organize experiments with a consistent structure:
```
workspace/
experiments/
run_main.py # Core experiment runner
run_baselines.py # Baseline comparison
run_ablation.py # Ablation studies
strategies.py # Method implementations
config.yaml # Shared configuration
results/
<experiment_name>/
<task_or_problem>/
<strategy>/
result.json # Final metrics
final_output.md # Final output artifact
history.json # Full trajectory/log
pass_01/ # Per-iteration artifacts (if iterative)
intermediate.md
analysis/
analyze_results.py # Statistical analysis
compute_stats.py # Significance tests
make_charts.py # Visualization
paper/
paper.tex # LaTeX source
fig_*.pdf # Generated figures
```
### Script Design Principles
**1. Incremental Saving (Crash Recovery)**
Every experiment script should save results after each unit of work, and skip already-completed work on restart:
```python
import json, os
from pathlib import Path
def run_experiment(problems, strategies, output_dir):
for problem in problems:
for strategy in strategies:
result_path = Path(output_dir) / problem["id"] / strategy / "result.json"
if result_path.exists():
print(f"Skipping {problem['id']}/{strategy} (already done)")
continue
# Run the experiment
result = execute_strategy(problem, strategy)
# Save immediately
result_path.parent.mkdir(parents=True, exist_ok=True)
with open(result_path, 'w') as f:
json.dump(result, f, indent=2)
```
This pattern makes re-runs safe and efficient. If a process crashes at problem 47/150, restarting skips the first 46.
**2. Artifact Preservation**
Save all intermediate outputs, not just final results. This enables post-hoc analysis without re-running:
```python
def save_pass_artifacts(output_dir, pass_num, artifacts):
"""Save all artifacts from a single pass of an iterative method."""
pass_dir = Path(output_dir) / f"pass_{pass_num:02d}"
pass_dir.mkdir(parents=True, exist_ok=True)
for name, content in artifacts.items():
with open(pass_dir / f"{name}.md", 'w') as f:
f.write(content)
```
**3. Configuration Management**
Use YAML configs for reproducibility:
```yaml
# config.yaml
model: anthropic/claude-sonnet-4-20250514
author_temperature: 0.8
judge_temperature: 0.3
max_tokens: 4096
num_judges: 3
max_passes: 15
convergence_k: 2
```
```python
import yaml
with open("config.yaml") as f:
config = yaml.safe_load(f)
```
**4. Separation of Concerns**
Keep generation, evaluation, and visualization in separate scripts:
| Script | Purpose |
|--------|---------|
| `run_experiment.py` | Core method execution |
| `run_baselines.py` | Baseline comparisons at same compute |
| `run_eval.py` | Blind evaluation / judge panels |
| `analyze_results.py` | Statistical analysis |
| `make_charts.py` | Figure generation |
This lets you re-run evaluation without re-running expensive generation, and regenerate figures without re-running analysis.
---
## Evaluation Protocols
### Blind Judge Panels (for Subjective Tasks)
When evaluating subjective outputs (writing, analysis, recommendations), use a blind judge panel:
```python
import random
def run_blind_evaluation(outputs: dict, task_prompt: str, num_judges: int = 7):
"""
Run blind evaluation of multiple method outputs.
Args:
outputs: {"method_name": "output_text", ...}
task_prompt: The original task description
num_judges: Number of independent judge evaluations
"""
rankings = []
for judge_i in range(num_judges):
# Randomize labels and presentation order per judge
methods = list(outputs.keys())
random.shuffle(methods)
labels = {m: chr(65 + i) for i, m in enumerate(methods)} # A, B, C...
# Present to judge with randomized labels
prompt = f"Task: {task_prompt}\n\n"
for method in methods:
prompt += f"--- Proposal {labels[method]} ---\n{outputs[method]}\n\n"
prompt += "Rank all proposals from best to worst. Format: RANKING: [best], [second], [worst]"
ranking = call_judge(prompt)
rankings.append({"labels": labels, "ranking": ranking})
# Aggregate via Borda count
return compute_borda(rankings)
def compute_borda(rankings, n_methods=3):
"""Borda count: 3/2/1 points for 1st/2nd/3rd."""
scores = {}
points = {0: n_methods, 1: n_methods - 1, 2: n_methods - 2} # Adjust for n_methods
for r in rankings:
for position, method in enumerate(r["ranking"]):
scores[method] = scores.get(method, 0) + points.get(position, 0)
return scores
```
Key design decisions:
- **Randomize both labels AND order** per judge to prevent position bias
- **Use odd number of judges** (3, 5, 7) to break ties
- **Conservative tiebreak**: Incumbent/baseline wins ties (prevents false positives)
- **CoT judges** match non-CoT quality at ~40% cost (1 CoT judge ≈ 3 standard judges)
### Code/Objective Evaluation
For tasks with ground-truth evaluation (code, math, factual):
```python
import subprocess
def evaluate_code(solution: str, test_cases: list, timeout: int = 30):
"""Run code solution against test cases with sandboxed execution."""
results = {"public": [], "private": []}
for test in test_cases:
try:
proc = subprocess.run(
["python3", "-c", solution],
input=test["input"],
capture_output=True,
timeout=timeout,
text=True
)
actual = proc.stdout.strip()
expected = test["expected"].strip()
passed = actual == expected
except subprocess.TimeoutExpired:
passed = False
category = "public" if test.get("public") else "private"
results[category].append(passed)
return {
"public_pass_rate": sum(results["public"]) / max(len(results["public"]), 1),
"private_pass_rate": sum(results["private"]) / max(len(results["private"]), 1),
}
```
### Compute-Matched Comparison
Always compare methods at equal compute budget. If your method uses N API calls, baselines get N calls too:
| Method | Call Budget | Allocation |
|--------|-----------|------------|
| Single pass | 6 calls | 6 independent generations |
| Critique & revise | 6 calls | 1 generate + 5 revise rounds |
| Autoreason | 6 calls | 1 generate + 1 analysis + 4 revisions |
| Best-of-N | 6 calls | 6 independent, pick best on public test |
### Human Evaluation Design
Many ML/NLP papers require human evaluation, especially for subjective tasks (text generation, summarization, dialogue, creative writing). Poorly designed human evals are a common rejection reason.
#### When Human Evaluation Is Required
| Task Type | Required? | Notes |
|-----------|-----------|-------|
| Text generation (open-ended) | Yes | LLM-as-judge alone is insufficient for acceptance at ACL/EMNLP |
| Summarization | Usually | At minimum for a subset of outputs |
| Dialogue systems | Yes | User studies or annotation |
| Code generation | No | Test suites are objective ground truth |
| Classification | No | Standard metrics suffice |
| Any task with subjective quality | Strongly recommended | Strengthens the paper significantly |
#### Annotation Protocol Design
```
Human Evaluation Protocol:
1. Define the evaluation dimensions (fluency, relevance, factual accuracy, etc.)
2. Create annotation guidelines with examples of each score level
3. Run a pilot with 2-3 annotators on 20-30 examples
4. Compute pilot inter-annotator agreement — if low, revise guidelines
5. Run full evaluation
6. Report: annotator count, agreement metrics, compensation, time per item
```
**Evaluation dimensions** (pick relevant subset):
| Dimension | Definition | Scale |
|-----------|-----------|-------|
| Fluency | Grammaticality and naturalness | 1-5 Likert |
| Relevance | Does it address the task? | 1-5 Likert |
| Factual accuracy | Are stated facts correct? | Binary or 1-5 |
| Coherence | Logical flow and consistency | 1-5 Likert |
| Informativeness | Does it provide useful information? | 1-5 Likert |
| Overall preference | Which output is better? | A/B/Tie (pairwise) |
**Pairwise comparison** (preferred over absolute scoring — more reliable):
- Present two outputs side-by-side (randomize left/right position)
- Ask: "Which is better? A / B / Tie"
- More discriminative and less susceptible to annotator calibration drift
#### Inter-Annotator Agreement
Always report agreement metrics. Without them, reviewers assume your annotations are unreliable.
```python
# Krippendorff's alpha (preferred — handles missing data, any scale)
# pip install krippendorffs-alpha
import krippendorff
# Ratings: rows = annotators, columns = items, values = scores
ratings = [
[3, 4, 1, 2, 5, None, 3], # Annotator 1
[3, 5, 1, 3, 5, 2, 3], # Annotator 2
[4, 4, 2, 2, 4, 2, None], # Annotator 3
]
alpha = krippendorff.alpha(reliability_data=ratings, level_of_measurement="ordinal")
print(f"Krippendorff's alpha: {alpha:.3f}")
# Interpretation: >0.80 good, 0.67-0.80 acceptable, <0.67 questionable
```
```python
# Cohen's kappa (for exactly 2 annotators, categorical data)
from sklearn.metrics import cohen_kappa_score
annotator_1 = [1, 2, 3, 1, 2, 3, 2]
annotator_2 = [1, 2, 2, 1, 3, 3, 2]
kappa = cohen_kappa_score(annotator_1, annotator_2)
print(f"Cohen's kappa: {kappa:.3f}")
# Interpretation: >0.80 excellent, 0.60-0.80 substantial, 0.40-0.60 moderate
```
| Metric | When to Use | Annotators | Scale |
|--------|------------|-----------|-------|
| Krippendorff's alpha | Default choice | Any number | Any (ordinal, nominal, ratio) |
| Cohen's kappa | 2 annotators, categorical | Exactly 2 | Nominal/ordinal |
| Fleiss' kappa | 3+ annotators, categorical | 3+ | Nominal |
| Pearson/Spearman | Continuous scores | 2 | Interval/ratio |
#### Crowdsourcing Platforms
| Platform | Best For | Cost | Quality |
|----------|----------|------|---------|
| **Prolific** | Academic research, higher quality | $8-15/hr | High — academic participant pool |
| **MTurk** | Large-scale, fast turnaround | $2-10/hr | Variable — use qualifications |
| **Surge AI** | NLP-specific annotations | Premium | High — trained annotators |
| **Expert annotators** | Domain-specific (medical, legal) | Highest | Highest — but slow |
**Ethics requirements**:
- Report compensation rate (must be at minimum local minimum wage)
- Describe annotator demographics if relevant
- Obtain IRB/ethics approval if required by your institution
- ACL venues explicitly require compensation documentation
#### What to Report in the Paper
```
Human Evaluation Section Checklist:
- [ ] Number of annotators
- [ ] Annotator qualifications / recruitment method
- [ ] Number of items evaluated
- [ ] Evaluation dimensions with definitions
- [ ] Scale used (Likert, pairwise, binary)
- [ ] Inter-annotator agreement (Krippendorff's alpha or Cohen's kappa)
- [ ] Compensation rate
- [ ] Time per annotation item
- [ ] Whether annotators saw model identities (should be blind)
- [ ] Randomization of presentation order
```
---
## Statistical Analysis
### Required Tests
| Test | When to Use | Python |
|------|------------|--------|
| McNemar's test | Comparing two methods on same problems | `scipy.stats.binomtest` for small n |
| Two-proportion z-test | Comparing success rates | Custom or `statsmodels` |
| Fisher's exact test | Small sample pairwise comparison | `scipy.stats.fisher_exact` |
| Bootstrapped CI | Confidence intervals for any metric | Custom bootstrap |
| Cohen's h | Effect size for proportions | Manual calculation |
### Standard Analysis Script
```python
import numpy as np
from scipy import stats
from pathlib import Path
import json
def load_all_results(results_dir):
"""Load all results into a structured format."""
results = {}
for result_file in Path(results_dir).rglob("result.json"):
parts = result_file.relative_to(results_dir).parts
if len(parts) >= 3:
experiment, task, strategy = parts[0], parts[1], parts[2]
data = json.loads(result_file.read_text())
results.setdefault(experiment, {}).setdefault(strategy, {})[task] = data
return results
def pairwise_mcnemar(method_a_results, method_b_results):
"""McNemar's test for paired binary outcomes."""
a_win_b_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if a and not b)
b_win_a_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if b and not a)
n = a_win_b_lose + b_win_a_lose
if n < 25:
# Use exact binomial for small samples
result = stats.binomtest(a_win_b_lose, n, 0.5)
p_value = result.pvalue
else:
# Chi-squared approximation
chi2 = (abs(a_win_b_lose - b_win_a_lose) - 1)**2 / (a_win_b_lose + b_win_a_lose)
p_value = 1 - stats.chi2.cdf(chi2, df=1)
return {
"a_wins": a_win_b_lose,
"b_wins": b_win_a_lose,
"n_discordant": n,
"p_value": p_value,
"significant": p_value < 0.05
}
def bootstrap_ci(data, n_bootstrap=10000, ci=0.95):
"""Bootstrap confidence interval for mean."""
means = []
for _ in range(n_bootstrap):
sample = np.random.choice(data, size=len(data), replace=True)
means.append(np.mean(sample))
lower = np.percentile(means, (1 - ci) / 2 * 100)
upper = np.percentile(means, (1 + ci) / 2 * 100)
return {"mean": np.mean(data), "ci_lower": lower, "ci_upper": upper}
def cohens_h(p1, p2):
"""Cohen's h effect size for two proportions."""
return 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2))
```
### Reporting Standards
Always include in the paper:
- **Sample sizes**: n=X problems/tasks
- **Number of runs**: K independent runs if applicable
- **Error bars**: Specify standard deviation or standard error
- **Confidence intervals**: 95% CI for key results
- **Significance tests**: p-values for key comparisons
- **Effect sizes**: Cohen's d or h for practical significance
---
## Monitoring (Cron Pattern)
### Cron Prompt Template
For each experiment batch, create a monitoring prompt:
```
Check the status of the [EXPERIMENT_NAME] experiment:
1. Process check: ps aux | grep [PROCESS_PATTERN]
2. Log check: tail -30 [LOG_FILE]
3. Results check: ls [RESULT_DIR]/eval/ (or appropriate result location)
4. If results are available:
- Read the result JSON files
- Report metrics in a table (Borda scores, accuracy, etc.)
- Compute key comparisons between methods
5. If all experiments in this batch are complete:
- git add -A && git commit -m "[COMMIT_MESSAGE]" && git push
- Report final summary
6. Key question: [SPECIFIC ANALYTICAL QUESTION]
If nothing has changed since the last check, respond with [SILENT].
```
### Monitoring Best Practices
1. **Check processes first** — don't read results if the experiment is still running and results are incomplete
2. **Read the log tail** — look for errors, progress indicators, completion messages
3. **Count completed vs expected** — "45/150 problems done" is more useful than "some results exist"
4. **Report in structured tables** — always include key metrics in a table
5. **Answer the key question** — each experiment should have a specific analytical question to answer when done
6. **[SILENT] for no-news** — suppress notifications when nothing has changed
7. **Commit on completion** — every completed batch gets committed with a descriptive message
### Example Monitoring Report
```
## Code Experiments (Haiku 3.5) - COMPLETE
| Strategy | Pass Rate (150 problems) | vs Single |
|----------|------------------------|-----------|
| single_pass | 38.0% | — |
| critique_revise | 35.2% | -2.8pp |
| **autoreason** | **40.0%** | **+2.0pp** |
| best_of_6 | 31.0% | -7.0pp |
Key finding: Autoreason shows +2pp improvement over single pass, while
best-of-6 collapses due to single-public-test selection issue.
Committed: `git commit -m "Add Haiku code results (150 problems, 4 strategies)"`
Next: Run significance tests on these results.
```
---
## Failure Recovery
### Common Failures and Recovery
| Failure | Detection | Recovery |
|---------|-----------|----------|
| **API credit exhaustion** | 402 errors in logs, incomplete results | Top up credits, re-run (skips completed work automatically) |
| **Rate limiting** | 429 errors, slow progress | Add retry logic with exponential backoff |
| **Process crash** | PID gone, log stops mid-problem | Re-run script (resumes from last checkpoint) |
| **Wrong model ID** | Model not found errors | Fix ID (e.g., `claude-opus-4-6` not `claude-opus-4.6`) |
| **Parallel slowdown** | Each experiment taking 2x longer | Reduce parallel experiments to 2-3 max |
| **Security scan blocks** | Commands blocked by security | Use `execute_code` instead of piped `terminal` commands |
| **Delegation failures** | `delegate_task` returns errors | Fall back to doing work directly |
| **Timeout on hard problems** | Process stuck, no log progress | Kill, skip problem, note in results |
| **Dataset path mismatch** | File not found errors | Verify paths before launching |
### Retry Naming Convention
When re-running failed experiments, use a suffix to track rounds:
```
logs/experiment_haiku_0_50.log # Round 1
logs/experiment_haiku_0_50_r2.log # Round 2 (after credit exhaustion)
logs/experiment_haiku_0_50_r3.log # Round 3 (after bug fix)
```
### Pre-Flight Checklist
Before launching any experiment batch:
```
Pre-Flight:
- [ ] API credits sufficient for estimated calls
- [ ] Model IDs correct (test with 1 problem first)
- [ ] Output directory exists and is writable
- [ ] Resume logic works (re-run won't overwrite existing results)
- [ ] Log file path is unique (won't overwrite previous logs)
- [ ] Dataset/task files are accessible
- [ ] Config matches intended experiment
```
---
## Task/Benchmark Design
### Open-Ended Tasks (Subjective Evaluation)
Design tasks that have clear objectives but subjective quality:
```markdown
# Task: [Title]
## Context
[Specific scenario with concrete details: company size, constraints, timeline]
## Deliverable
[Exact format and structure required]
## Requirements
- [Specific, measurable requirements]
- [Not vague — "be comprehensive" is bad, "include exactly 6 sections" is good]
```
### Constrained Tasks (for Testing Scope Effects)
Constrained tasks test whether methods respect scope boundaries. Design with:
- **Fixed facts**: "Use only these N data points, add nothing else"
- **Fixed deliverable**: Specific format (pitch, postmortem, memo — not "improve this")
- **Fixed structure**: "These sections in this order, do not add/remove"
- **Fixed change items**: "Address exactly these N points, nothing else"
**Do NOT use word count as a scope constraint.** Word limits cause false convergence — outputs get rejected for length, not quality. Constrain scope (what to include) not length.
### Example: Good vs Bad Constraints
| Bad Constraint | Why | Good Constraint |
|---------------|-----|-----------------|
| "Max 500 words" | Judges reject for length | "Exactly 4 sections, each with 3 numbered items" |
| "Be concise" | Too vague | "Each prohibition must reference a specific base fact" |
| "Improve this" | Unbounded scope | "Write a 600-word incident postmortem with this exact structure" |
| "Make it better" | No clear criterion | "Address exactly these 3 reviewer concerns" |
---
## Visualization Best Practices
### Setup: SciencePlots + matplotlib
Install SciencePlots for publication-ready defaults:
```bash
pip install SciencePlots matplotlib numpy
```
**Option A: SciencePlots styles** (recommended — handles most defaults automatically):
```python
import matplotlib.pyplot as plt
import scienceplots # registers the styles
# Pick a style:
# 'science' — clean, serif fonts, suitable for most venues
# 'science+ieee' — IEEE-style (good for two-column papers)
# 'science+nature' — Nature-style
# Add 'no-latex' if LaTeX is not installed on the machine generating plots
with plt.style.context(['science', 'no-latex']):
fig, ax = plt.subplots(figsize=(3.5, 2.5)) # single-column width
# ... plot ...
fig.savefig('paper/fig_results.pdf', bbox_inches='tight')
```
**Option B: Manual rcParams** (when you need full control):
```python
import matplotlib.pyplot as plt
plt.rcParams.update({
'font.size': 10,
'font.family': 'serif',
'axes.labelsize': 11,
'axes.titlesize': 11,
'xtick.labelsize': 9,
'ytick.labelsize': 9,
'legend.fontsize': 9,
'figure.figsize': (3.5, 2.5), # single-column default
'figure.dpi': 300,
'savefig.dpi': 300,
'savefig.bbox': 'tight',
'savefig.pad_inches': 0.05,
'axes.linewidth': 0.8,
'lines.linewidth': 1.5,
'lines.markersize': 5,
'axes.grid': True,
'grid.alpha': 0.3,
'grid.linewidth': 0.5,
})
```
### Standard Figure Sizes (Two-Column Format)
| Use Case | figsize | Notes |
|----------|---------|-------|
| Single column | `(3.5, 2.5)` | Fits in one column of two-column layout |
| Double column | `(7.0, 3.0)` | Spans full page width |
| Square (heatmap, confusion matrix) | `(3.5, 3.5)` | Single column |
| Tall single (many rows) | `(3.5, 5.0)` | Use sparingly |
### Colorblind-Safe Palette (Okabe-Ito)
Use this palette for all paper figures. It is distinguishable by people with all common forms of color vision deficiency:
```python
COLORS = {
'blue': '#0072B2',
'orange': '#E69F00',
'green': '#009E73',
'red': '#D55E00',
'purple': '#CC79A7',
'cyan': '#56B4E9',
'yellow': '#F0E442',
'black': '#000000',
}
# As a list for cycling:
COLOR_CYCLE = ['#0072B2', '#D55E00', '#009E73', '#E69F00', '#CC79A7', '#56B4E9']
```
Also differentiate lines by **marker and linestyle**, not just color:
```python
STYLES = [
{'color': '#0072B2', 'marker': 'o', 'linestyle': '-'},
{'color': '#D55E00', 'marker': 's', 'linestyle': '--'},
{'color': '#009E73', 'marker': '^', 'linestyle': '-.'},
{'color': '#E69F00', 'marker': 'D', 'linestyle': ':'},
]
```
### Complete Example: Method Comparison Bar Chart
```python
import matplotlib.pyplot as plt
import numpy as np
try:
import scienceplots
style = ['science', 'no-latex']
except ImportError:
style = 'default'
with plt.style.context(style):
methods = ['Single Pass', 'Critique+Revise', 'Best-of-N', 'Ours']
scores = [73.2, 74.1, 68.5, 77.0]
errors = [2.1, 1.8, 3.2, 1.5]
colors = ['#56B4E9', '#E69F00', '#CC79A7', '#0072B2']
fig, ax = plt.subplots(figsize=(3.5, 2.5))
bars = ax.bar(methods, scores, yerr=errors, capsize=3,
color=colors, edgecolor='black', linewidth=0.5)
# Highlight "Ours"
bars[-1].set_edgecolor('#0072B2')
bars[-1].set_linewidth(1.5)
ax.set_ylabel('Pass Rate (%)')
ax.set_ylim(60, 85)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
fig.savefig('paper/fig_comparison.pdf', bbox_inches='tight')
```
### Complete Example: Convergence/Trajectory Line Chart
```python
with plt.style.context(style):
fig, ax = plt.subplots(figsize=(3.5, 2.5))
passes = np.arange(1, 16)
ours = [65, 72, 78, 82, 85, 87, 88, 89, 89.5, 90, 90, 90, 90, 90, 90]
baseline = [65, 68, 70, 71, 69, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58]
ax.plot(passes, ours, **STYLES[0], label='Ours', markersize=4)
ax.plot(passes, baseline, **STYLES[1], label='Critique+Revise', markersize=4)
# Mark convergence point
ax.axvline(x=10, color='gray', linestyle=':', alpha=0.5, linewidth=0.8)
ax.annotate('Converged', xy=(10, 90), fontsize=8, ha='center',
xytext=(10, 93), arrowprops=dict(arrowstyle='->', color='gray'))
ax.set_xlabel('Iteration')
ax.set_ylabel('Quality Score')
ax.legend(loc='lower right')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
fig.savefig('paper/fig_trajectory.pdf', bbox_inches='tight')
```
### Output Rules
- **Always save as PDF**: `fig.savefig('fig.pdf')` — vector graphics, sharp at any zoom
- **Never save as PNG** for paper figures — raster PNGs look blurry when printed/zoomed
- **Exception**: Screenshots, photographs, or pixel-art visualizations → PNG at 600 DPI
- **Verify grayscale**: Print to grayscale PDF and check all information is still visible
### Chart Types for Common Comparisons
| Comparison Type | Chart | Notes |
|----------------|-------|-------|
| Method vs method | Grouped bar chart | Include error bars |
| Across model sizes | Line chart with CI bands | Log scale for model size axis |
| Ablation study | Stacked/grouped bar | Highlight removed component |
| Trajectory/convergence | Line chart over iterations | Show winner per iteration |
| Per-task breakdown | Heatmap or grouped bar | Show variance across tasks |
@@ -105,7 +105,7 @@ Reviewers are explicitly instructed to:
- Penalizing authors for honest limitation acknowledgment
- Rejecting for missing citations to reviewer's own work
### Timeline (NeurIPS 2025)
### Timeline (NeurIPS 2025 — verify dates for current year)
- Bidding: May 17-21
- Reviewing period: May 29 - July 2
@@ -113,6 +113,8 @@ Reviewers are explicitly instructed to:
- Discussion period: July 31 - August 13
- Final notifications: September 18
> **Note**: These dates are from the 2025 cycle. Always check the current year's call for papers at the venue website.
---
## ICML Reviewer Guidelines
@@ -198,6 +200,70 @@ ACL has a dedicated ethics review process for:
---
## AAAI Reviewer Guidelines
### Evaluation Criteria
AAAI reviewers evaluate along similar axes to NeurIPS/ICML but with some differences:
| Criterion | Weight | Notes |
|-----------|--------|-------|
| **Technical quality** | High | Soundness of approach, correctness of results |
| **Significance** | High | Importance of the problem and contribution |
| **Novelty** | Medium-High | New ideas, methods, or insights |
| **Clarity** | Medium | Clear writing, well-organized presentation |
| **Reproducibility** | Medium | Sufficient detail to reproduce results |
### AAAI-Specific Considerations
- **Broader AI scope**: AAAI covers all of AI, not just ML. Papers on planning, reasoning, knowledge representation, NLP, vision, robotics, and multi-agent systems are all in scope. Reviewers may not be deep ML specialists.
- **Formatting strictness**: AAAI reviewers are instructed to flag formatting violations. Non-compliant papers may be desk-rejected before review.
- **Application papers**: AAAI is more receptive to application-focused work than NeurIPS/ICML. Framing a strong application contribution is viable.
- **Senior Program Committee**: AAAI uses SPCs (Senior Program Committee members) who mediate between reviewers and make accept/reject recommendations.
### Scoring (AAAI Scale)
- **Strong Accept**: Clearly above threshold, excellent contribution
- **Accept**: Above threshold, good contribution with minor issues
- **Weak Accept**: Borderline, merits outweigh concerns
- **Weak Reject**: Borderline, concerns outweigh merits
- **Reject**: Below threshold, significant issues
- **Strong Reject**: Well below threshold
---
## COLM Reviewer Guidelines
### Evaluation Criteria
COLM reviews focus on relevance to language modeling in addition to standard criteria:
| Criterion | Weight | Notes |
|-----------|--------|-------|
| **Relevance** | High | Must be relevant to language modeling community |
| **Technical quality** | High | Sound methodology, well-supported claims |
| **Novelty** | Medium-High | New insights about language models |
| **Clarity** | Medium | Clear presentation, reproducible |
| **Significance** | Medium-High | Impact on LM research and practice |
### COLM-Specific Considerations
- **Language model focus**: Reviewers will assess whether the contribution advances understanding of language models. General ML contributions need explicit LM framing.
- **Newer venue norms**: COLM is newer than NeurIPS/ICML, so reviewer calibration varies more. Write more defensively — anticipate a wider range of reviewer expertise.
- **ICLR-derived process**: Review process is modeled on ICLR (open reviews, author response period, discussion among reviewers).
- **Broad interpretation of "language modeling"**: Includes training, evaluation, alignment, safety, efficiency, applications, theory, multimodality (if language is central), and social impact of LMs.
### Scoring
COLM uses an ICLR-style scoring system:
- **8-10**: Strong accept (top papers)
- **6-7**: Weak accept (solid contribution)
- **5**: Borderline
- **3-4**: Weak reject (below threshold)
- **1-2**: Strong reject
---
## What Makes Reviews Strong
### Following Daniel Dennett's Rules
@@ -225,8 +225,6 @@ Provide context before asking the reader to consider anything new. This applies
---
---
## Micro-Level Writing Tips
### From Ethan Perez (Anthropic)
-37
View File
@@ -825,43 +825,6 @@ class TestLastPromptTokens:
store.update_session("k1", last_prompt_tokens=0)
assert entry.last_prompt_tokens == 0
def test_update_session_passes_model_to_db(self, tmp_path):
"""Gateway session updates should forward the resolved model to SQLite."""
config = GatewayConfig()
with patch("gateway.session.SessionStore._ensure_loaded"):
store = SessionStore(sessions_dir=tmp_path, config=config)
store._loaded = True
store._save = MagicMock()
store._db = MagicMock()
from gateway.session import SessionEntry
from datetime import datetime
entry = SessionEntry(
session_key="k1",
session_id="s1",
created_at=datetime.now(),
updated_at=datetime.now(),
)
store._entries = {"k1": entry}
store.update_session("k1", model="openai/gpt-5.4")
store._db.set_token_counts.assert_called_once_with(
"s1",
input_tokens=0,
output_tokens=0,
cache_read_tokens=0,
cache_write_tokens=0,
estimated_cost_usd=None,
cost_status=None,
cost_source=None,
billing_provider=None,
billing_base_url=None,
model="openai/gpt-5.4",
absolute=True,
)
class TestRewriteTranscriptPreservesReasoning:
"""rewrite_transcript must not drop reasoning fields from SQLite."""
-10
View File
@@ -126,15 +126,5 @@ async def test_handle_message_persists_agent_token_counts(monkeypatch):
assert result == "ok"
runner.session_store.update_session.assert_called_once_with(
session_entry.session_key,
input_tokens=120,
output_tokens=45,
cache_read_tokens=0,
cache_write_tokens=0,
last_prompt_tokens=80,
model="openai/test-model",
estimated_cost_usd=None,
cost_status=None,
cost_source=None,
provider=None,
base_url=None,
)
+95
View File
@@ -11,6 +11,7 @@ from agent.prompt_caching import apply_anthropic_cache_control
from agent.anthropic_adapter import (
_is_oauth_token,
_refresh_oauth_token,
_to_plain_data,
_write_claude_code_credentials,
build_anthropic_client,
build_anthropic_kwargs,
@@ -742,6 +743,33 @@ class TestConvertMessages:
assert tool_block["content"] == "result"
assert tool_block["cache_control"] == {"type": "ephemeral"}
def test_preserved_thinking_blocks_are_rehydrated_before_tool_use(self):
messages = [
{
"role": "assistant",
"content": "",
"tool_calls": [
{"id": "tc_1", "function": {"name": "test_tool", "arguments": "{}"}},
],
"reasoning_details": [
{
"type": "thinking",
"thinking": "Need to inspect the tool result first.",
"signature": "sig_123",
}
],
},
{"role": "tool", "tool_call_id": "tc_1", "content": "tool output"},
]
_, result = convert_messages_to_anthropic(messages)
assistant_blocks = next(msg for msg in result if msg["role"] == "assistant")["content"]
assert assistant_blocks[0]["type"] == "thinking"
assert assistant_blocks[0]["thinking"] == "Need to inspect the tool result first."
assert assistant_blocks[0]["signature"] == "sig_123"
assert assistant_blocks[1]["type"] == "tool_use"
def test_converts_data_url_image_to_anthropic_image_block(self):
messages = [
{
@@ -1079,6 +1107,59 @@ class TestGetAnthropicMaxOutput:
assert _get_anthropic_max_output("claude-3-5-sonnet-20241022") == 8_192
# ---------------------------------------------------------------------------
# _to_plain_data hardening
# ---------------------------------------------------------------------------
class TestToPlainData:
def test_simple_dict(self):
assert _to_plain_data({"a": 1, "b": [2, 3]}) == {"a": 1, "b": [2, 3]}
def test_pydantic_like_model_dump(self):
class FakeModel:
def model_dump(self):
return {"type": "thinking", "thinking": "hello"}
result = _to_plain_data(FakeModel())
assert result == {"type": "thinking", "thinking": "hello"}
def test_circular_reference_does_not_recurse_forever(self):
"""Circular dict reference should be stringified, not infinite-loop."""
d: dict = {"key": "value"}
d["self"] = d # circular
result = _to_plain_data(d)
assert isinstance(result, dict)
assert result["key"] == "value"
assert isinstance(result["self"], str)
def test_shared_sibling_objects_are_not_falsely_detected_as_cycles(self):
"""Two siblings referencing the same dict must both be converted."""
shared = {"type": "thinking", "thinking": "reason"}
parent = {"a": shared, "b": shared}
result = _to_plain_data(parent)
assert isinstance(result["a"], dict)
assert isinstance(result["b"], dict)
assert result["a"] == {"type": "thinking", "thinking": "reason"}
def test_deep_nesting_is_capped(self):
deep = "leaf"
for _ in range(25):
deep = {"nested": deep}
result = _to_plain_data(deep)
assert isinstance(result, dict)
def test_plain_values_pass_through(self):
assert _to_plain_data("hello") == "hello"
assert _to_plain_data(42) == 42
assert _to_plain_data(None) is None
def test_object_with_dunder_dict(self):
obj = SimpleNamespace(type="thinking", thinking="reason", signature="sig")
result = _to_plain_data(obj)
assert result == {"type": "thinking", "thinking": "reason", "signature": "sig"}
# ---------------------------------------------------------------------------
# Response normalization
# ---------------------------------------------------------------------------
@@ -1126,6 +1207,20 @@ class TestNormalizeResponse:
msg, reason = normalize_anthropic_response(self._make_response(blocks))
assert msg.content == "The answer is 42."
assert msg.reasoning == "Let me reason about this..."
assert msg.reasoning_details == [{"type": "thinking", "thinking": "Let me reason about this..."}]
def test_thinking_response_preserves_signature(self):
blocks = [
SimpleNamespace(
type="thinking",
thinking="Let me reason about this...",
signature="opaque_signature",
redacted=False,
),
]
msg, _ = normalize_anthropic_response(self._make_response(blocks))
assert msg.reasoning_details[0]["signature"] == "opaque_signature"
assert msg.reasoning_details[0]["thinking"] == "Let me reason about this..."
def test_stop_reason_mapping(self):
block = SimpleNamespace(type="text", text="x")
+424
View File
@@ -0,0 +1,424 @@
"""Tests for per-turn primary runtime restoration and transport recovery.
Verifies that:
1. Fallback is turn-scoped: a new turn restores the primary model/provider
2. The fallback chain index resets so all fallbacks are available again
3. Context compressor state is restored alongside the runtime
4. Transient transport errors get one recovery cycle before fallback
5. Recovery is skipped for aggregator providers (OpenRouter, Nous)
6. Non-transport errors don't trigger recovery
"""
import time
from types import SimpleNamespace
from unittest.mock import MagicMock, patch, PropertyMock
import pytest
from run_agent import AIAgent
def _make_tool_defs(*names: str) -> list:
return [
{
"type": "function",
"function": {
"name": n,
"description": f"{n} tool",
"parameters": {"type": "object", "properties": {}},
},
}
for n in names
]
def _make_agent(fallback_model=None, provider="custom", base_url="https://my-llm.example.com/v1"):
"""Create a minimal AIAgent with optional fallback config."""
with (
patch("run_agent.get_tool_definitions", return_value=_make_tool_defs("web_search")),
patch("run_agent.check_toolset_requirements", return_value={}),
patch("run_agent.OpenAI"),
):
agent = AIAgent(
api_key="test-key-12345678",
base_url=base_url,
provider=provider,
quiet_mode=True,
skip_context_files=True,
skip_memory=True,
fallback_model=fallback_model,
)
agent.client = MagicMock()
return agent
def _mock_resolve(base_url="https://openrouter.ai/api/v1", api_key="fallback-key-1234"):
"""Helper to create a mock client for resolve_provider_client."""
mock_client = MagicMock()
mock_client.api_key = api_key
mock_client.base_url = base_url
return mock_client
# =============================================================================
# _primary_runtime snapshot
# =============================================================================
class TestPrimaryRuntimeSnapshot:
def test_snapshot_created_at_init(self):
agent = _make_agent()
assert hasattr(agent, "_primary_runtime")
rt = agent._primary_runtime
assert rt["model"] == agent.model
assert rt["provider"] == "custom"
assert rt["base_url"] == "https://my-llm.example.com/v1"
assert rt["api_mode"] == agent.api_mode
assert "client_kwargs" in rt
assert "compressor_context_length" in rt
def test_snapshot_includes_compressor_state(self):
agent = _make_agent()
rt = agent._primary_runtime
cc = agent.context_compressor
assert rt["compressor_model"] == cc.model
assert rt["compressor_provider"] == cc.provider
assert rt["compressor_context_length"] == cc.context_length
assert rt["compressor_threshold_tokens"] == cc.threshold_tokens
def test_snapshot_includes_anthropic_state_when_applicable(self):
"""Anthropic-mode agents should snapshot Anthropic-specific state."""
with (
patch("run_agent.get_tool_definitions", return_value=_make_tool_defs("web_search")),
patch("run_agent.check_toolset_requirements", return_value={}),
patch("run_agent.OpenAI"),
patch("agent.anthropic_adapter.build_anthropic_client", return_value=MagicMock()),
):
agent = AIAgent(
api_key="sk-ant-test-12345678",
base_url="https://api.anthropic.com",
provider="anthropic",
api_mode="anthropic_messages",
quiet_mode=True,
skip_context_files=True,
skip_memory=True,
)
rt = agent._primary_runtime
assert "anthropic_api_key" in rt
assert "anthropic_base_url" in rt
assert "is_anthropic_oauth" in rt
def test_snapshot_omits_anthropic_for_openai_mode(self):
agent = _make_agent(provider="custom")
rt = agent._primary_runtime
assert "anthropic_api_key" not in rt
# =============================================================================
# _restore_primary_runtime()
# =============================================================================
class TestRestorePrimaryRuntime:
def test_noop_when_not_fallback(self):
agent = _make_agent()
assert agent._fallback_activated is False
assert agent._restore_primary_runtime() is False
def test_restores_model_and_provider(self):
agent = _make_agent(
fallback_model={"provider": "openrouter", "model": "anthropic/claude-sonnet-4"},
)
original_model = agent.model
original_provider = agent.provider
# Simulate fallback activation
mock_client = _mock_resolve()
with patch("agent.auxiliary_client.resolve_provider_client", return_value=(mock_client, None)):
agent._try_activate_fallback()
assert agent._fallback_activated is True
assert agent.model == "anthropic/claude-sonnet-4"
assert agent.provider == "openrouter"
# Restore should bring back the primary
with patch("run_agent.OpenAI", return_value=MagicMock()):
result = agent._restore_primary_runtime()
assert result is True
assert agent._fallback_activated is False
assert agent.model == original_model
assert agent.provider == original_provider
def test_resets_fallback_index(self):
"""After restore, the full fallback chain should be available again."""
agent = _make_agent(
fallback_model=[
{"provider": "openrouter", "model": "model-a"},
{"provider": "anthropic", "model": "model-b"},
],
)
# Advance through the chain
mock_client = _mock_resolve()
with patch("agent.auxiliary_client.resolve_provider_client", return_value=(mock_client, None)):
agent._try_activate_fallback()
assert agent._fallback_index == 1 # consumed one entry
with patch("run_agent.OpenAI", return_value=MagicMock()):
agent._restore_primary_runtime()
assert agent._fallback_index == 0 # reset for next turn
def test_restores_compressor_state(self):
agent = _make_agent(
fallback_model={"provider": "openrouter", "model": "anthropic/claude-sonnet-4"},
)
original_ctx_len = agent.context_compressor.context_length
original_threshold = agent.context_compressor.threshold_tokens
# Simulate fallback modifying compressor
mock_client = _mock_resolve()
with patch("agent.auxiliary_client.resolve_provider_client", return_value=(mock_client, None)):
agent._try_activate_fallback()
# Manually simulate compressor being changed (as _try_activate_fallback does)
agent.context_compressor.context_length = 32000
agent.context_compressor.threshold_tokens = 25600
with patch("run_agent.OpenAI", return_value=MagicMock()):
agent._restore_primary_runtime()
assert agent.context_compressor.context_length == original_ctx_len
assert agent.context_compressor.threshold_tokens == original_threshold
def test_restores_prompt_caching_flag(self):
agent = _make_agent()
original_caching = agent._use_prompt_caching
# Simulate fallback changing the caching flag
agent._fallback_activated = True
agent._use_prompt_caching = not original_caching
with patch("run_agent.OpenAI", return_value=MagicMock()):
agent._restore_primary_runtime()
assert agent._use_prompt_caching == original_caching
def test_restore_survives_exception(self):
"""If client rebuild fails, the method returns False gracefully."""
agent = _make_agent()
agent._fallback_activated = True
with patch("run_agent.OpenAI", side_effect=Exception("connection refused")):
result = agent._restore_primary_runtime()
assert result is False
# =============================================================================
# _try_recover_primary_transport()
# =============================================================================
def _make_transport_error(error_type="ReadTimeout"):
"""Create an exception whose type().__name__ matches the given name."""
cls = type(error_type, (Exception,), {})
return cls("connection timed out")
class TestTryRecoverPrimaryTransport:
def test_recovers_on_read_timeout(self):
agent = _make_agent(provider="custom")
error = _make_transport_error("ReadTimeout")
with patch("run_agent.OpenAI", return_value=MagicMock()), \
patch("time.sleep"):
result = agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
assert result is True
def test_recovers_on_connect_timeout(self):
agent = _make_agent(provider="custom")
error = _make_transport_error("ConnectTimeout")
with patch("run_agent.OpenAI", return_value=MagicMock()), \
patch("time.sleep"):
result = agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
assert result is True
def test_recovers_on_pool_timeout(self):
agent = _make_agent(provider="zai")
error = _make_transport_error("PoolTimeout")
with patch("run_agent.OpenAI", return_value=MagicMock()), \
patch("time.sleep"):
result = agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
assert result is True
def test_skipped_when_already_on_fallback(self):
agent = _make_agent(provider="custom")
agent._fallback_activated = True
error = _make_transport_error("ReadTimeout")
result = agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
assert result is False
def test_skipped_for_non_transport_error(self):
"""Non-transport errors (ValueError, APIError, etc.) skip recovery."""
agent = _make_agent(provider="custom")
error = ValueError("invalid model")
result = agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
assert result is False
def test_skipped_for_openrouter(self):
agent = _make_agent(provider="openrouter", base_url="https://openrouter.ai/api/v1")
error = _make_transport_error("ReadTimeout")
result = agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
assert result is False
def test_skipped_for_nous_provider(self):
agent = _make_agent(provider="nous", base_url="https://inference.nous.nousresearch.com/v1")
error = _make_transport_error("ReadTimeout")
result = agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
assert result is False
def test_allowed_for_anthropic_direct(self):
"""Direct Anthropic endpoint should get recovery."""
agent = _make_agent(provider="anthropic", base_url="https://api.anthropic.com")
# For non-anthropic_messages api_mode, it will use OpenAI client
error = _make_transport_error("ConnectError")
with patch("run_agent.OpenAI", return_value=MagicMock()), \
patch("time.sleep"):
result = agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
assert result is True
def test_allowed_for_ollama(self):
agent = _make_agent(provider="ollama", base_url="http://localhost:11434/v1")
error = _make_transport_error("ConnectTimeout")
with patch("run_agent.OpenAI", return_value=MagicMock()), \
patch("time.sleep"):
result = agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
assert result is True
def test_wait_time_scales_with_retry_count(self):
agent = _make_agent(provider="custom")
error = _make_transport_error("ReadTimeout")
with patch("run_agent.OpenAI", return_value=MagicMock()), \
patch("time.sleep") as mock_sleep:
agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
# wait_time = min(3 + retry_count, 8) = min(6, 8) = 6
mock_sleep.assert_called_once_with(6)
def test_wait_time_capped_at_8(self):
agent = _make_agent(provider="custom")
error = _make_transport_error("ReadTimeout")
with patch("run_agent.OpenAI", return_value=MagicMock()), \
patch("time.sleep") as mock_sleep:
agent._try_recover_primary_transport(
error, retry_count=10, max_retries=3,
)
# wait_time = min(3 + 10, 8) = 8
mock_sleep.assert_called_once_with(8)
def test_closes_existing_client_before_rebuild(self):
agent = _make_agent(provider="custom")
old_client = agent.client
error = _make_transport_error("ReadTimeout")
with patch("run_agent.OpenAI", return_value=MagicMock()), \
patch("time.sleep"), \
patch.object(agent, "_close_openai_client") as mock_close:
agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
mock_close.assert_called_once_with(
old_client, reason="primary_recovery", shared=True,
)
def test_survives_rebuild_failure(self):
"""If client rebuild fails, returns False gracefully."""
agent = _make_agent(provider="custom")
error = _make_transport_error("ReadTimeout")
with patch("run_agent.OpenAI", side_effect=Exception("socket error")), \
patch("time.sleep"):
result = agent._try_recover_primary_transport(
error, retry_count=3, max_retries=3,
)
assert result is False
# =============================================================================
# Integration: restore_primary_runtime called from run_conversation
# =============================================================================
class TestRestoreInRunConversation:
"""Verify the hook in run_conversation() calls _restore_primary_runtime."""
def test_restore_called_at_turn_start(self):
agent = _make_agent()
agent._fallback_activated = True
with patch.object(agent, "_restore_primary_runtime", return_value=True) as mock_restore, \
patch.object(agent, "run_conversation", wraps=None) as _:
# We can't easily run the full conversation, but we can verify
# the method exists and is callable
agent._restore_primary_runtime()
mock_restore.assert_called_once()
def test_full_cycle_fallback_then_restore(self):
"""Simulate: turn 1 activates fallback, turn 2 restores primary."""
agent = _make_agent(
fallback_model={"provider": "openrouter", "model": "anthropic/claude-sonnet-4"},
provider="custom",
)
# Turn 1: activate fallback
mock_client = _mock_resolve()
with patch("agent.auxiliary_client.resolve_provider_client", return_value=(mock_client, None)):
assert agent._try_activate_fallback() is True
assert agent._fallback_activated is True
assert agent.model == "anthropic/claude-sonnet-4"
assert agent.provider == "openrouter"
assert agent._fallback_index == 1
# Turn 2: restore primary
with patch("run_agent.OpenAI", return_value=MagicMock()):
assert agent._restore_primary_runtime() is True
assert agent._fallback_activated is False
assert agent._fallback_index == 0
assert agent.provider == "custom"
assert agent.base_url == "https://my-llm.example.com/v1"
+78 -1
View File
@@ -169,13 +169,21 @@ def _mock_tool_call(name="web_search", arguments="{}", call_id=None):
def _mock_response(
content="Hello", finish_reason="stop", tool_calls=None, reasoning=None, usage=None
content="Hello",
finish_reason="stop",
tool_calls=None,
reasoning=None,
reasoning_content=None,
reasoning_details=None,
usage=None,
):
"""Return a SimpleNamespace mimicking an OpenAI ChatCompletion response."""
msg = _mock_assistant_msg(
content=content,
tool_calls=tool_calls,
reasoning=reasoning,
reasoning_content=reasoning_content,
reasoning_details=reasoning_details,
)
choice = SimpleNamespace(message=msg, finish_reason=finish_reason)
resp = SimpleNamespace(choices=[choice], model="test/model")
@@ -1496,6 +1504,75 @@ class TestRunConversation:
assert result["completed"] is True
assert result["final_response"] == "internal reasoning"
def test_empty_content_local_resumed_session_triggers_compression(self, agent):
"""Local resumed reasoning-only responses should compress before burning retries."""
self._setup_agent(agent)
agent.base_url = "http://127.0.0.1:1234/v1"
agent.compression_enabled = True
empty_resp = _mock_response(
content=None,
finish_reason="stop",
reasoning_content="reasoning only",
)
ok_resp = _mock_response(content="Recovered after compression", finish_reason="stop")
prefill = [
{"role": "user", "content": "old question"},
{"role": "assistant", "content": "old answer"},
]
with (
patch.object(agent, "_interruptible_api_call", side_effect=[empty_resp, ok_resp]),
patch.object(agent, "_compress_context") as mock_compress,
patch.object(agent, "_persist_session"),
patch.object(agent, "_save_trajectory"),
patch.object(agent, "_cleanup_task_resources"),
):
mock_compress.return_value = (
[{"role": "user", "content": "compressed user message"}],
"compressed system prompt",
)
result = agent.run_conversation("hello", conversation_history=prefill)
mock_compress.assert_called_once()
assert result["completed"] is True
assert result["final_response"] == "Recovered after compression"
assert result["api_calls"] == 1 # compression retry is refunded, same as explicit overflow path
def test_empty_content_repeated_structured_reasoning_salvages_early(self, agent):
"""Repeated identical structured reasoning-only responses should stop retrying early."""
self._setup_agent(agent)
empty_resp = _mock_response(
content=None,
finish_reason="stop",
reasoning_content="structured reasoning answer",
)
agent.client.chat.completions.create.side_effect = [empty_resp, empty_resp]
with (
patch.object(agent, "_persist_session"),
patch.object(agent, "_save_trajectory"),
patch.object(agent, "_cleanup_task_resources"),
):
result = agent.run_conversation("answer me")
assert result["completed"] is True
assert result["final_response"] == "structured reasoning answer"
assert result["api_calls"] == 2
def test_empty_content_local_custom_error_is_actionable(self, agent):
"""Local/custom retries should return a diagnostic tailored to context/endpoint mismatch."""
self._setup_agent(agent)
agent.base_url = "http://127.0.0.1:1234/v1"
empty_resp = _mock_response(content=None, finish_reason="stop")
agent.client.chat.completions.create.side_effect = [empty_resp, empty_resp, empty_resp]
with (
patch.object(agent, "_persist_session"),
patch.object(agent, "_save_trajectory"),
patch.object(agent, "_cleanup_task_resources"),
):
result = agent.run_conversation("answer me")
assert result["completed"] is False
assert "Local/custom backend returned reasoning-only output" in result["error"]
assert "wrong /v1 endpoint" in result["error"]
def test_nous_401_refreshes_after_remint_and_retries(self, agent):
self._setup_agent(agent)
agent.provider = "nous"
+62
View File
@@ -0,0 +1,62 @@
from types import SimpleNamespace
from unittest.mock import MagicMock, patch
from run_agent import AIAgent
def _mock_response(*, usage: dict, content: str = "done"):
msg = SimpleNamespace(content=content, tool_calls=None)
choice = SimpleNamespace(message=msg, finish_reason="stop")
return SimpleNamespace(
choices=[choice],
model="test/model",
usage=SimpleNamespace(**usage),
)
def _make_agent(session_db, *, platform: str):
with (
patch("run_agent.get_tool_definitions", return_value=[]),
patch("run_agent.check_toolset_requirements", return_value={}),
patch("run_agent.OpenAI"),
):
agent = AIAgent(
api_key="test-key",
quiet_mode=True,
skip_context_files=True,
skip_memory=True,
session_db=session_db,
session_id=f"{platform}-session",
platform=platform,
)
agent.client = MagicMock()
agent.client.chat.completions.create.return_value = _mock_response(
usage={
"prompt_tokens": 11,
"completion_tokens": 7,
"total_tokens": 18,
}
)
return agent
def test_run_conversation_persists_tokens_for_telegram_sessions():
session_db = MagicMock()
agent = _make_agent(session_db, platform="telegram")
result = agent.run_conversation("hello")
assert result["final_response"] == "done"
session_db.update_token_counts.assert_called_once()
assert session_db.update_token_counts.call_args.args[0] == "telegram-session"
def test_run_conversation_persists_tokens_for_cron_sessions():
session_db = MagicMock()
agent = _make_agent(session_db, platform="cron")
result = agent.run_conversation("hello")
assert result["final_response"] == "done"
session_db.update_token_counts.assert_called_once()
assert session_db.update_token_counts.call_args.args[0] == "cron-session"
+7 -1
View File
@@ -363,10 +363,16 @@ TOOLSETS = {
"includes": []
},
"hermes-webhook": {
"description": "Webhook toolset - receive and process external webhook events",
"tools": _HERMES_CORE_TOOLS,
"includes": []
},
"hermes-gateway": {
"description": "Gateway toolset - union of all messaging platform tools",
"tools": [],
"includes": ["hermes-telegram", "hermes-discord", "hermes-whatsapp", "hermes-slack", "hermes-signal", "hermes-homeassistant", "hermes-email", "hermes-sms", "hermes-mattermost", "hermes-matrix", "hermes-dingtalk", "hermes-feishu", "hermes-wecom"]
"includes": ["hermes-telegram", "hermes-discord", "hermes-whatsapp", "hermes-slack", "hermes-signal", "hermes-homeassistant", "hermes-email", "hermes-sms", "hermes-mattermost", "hermes-matrix", "hermes-dingtalk", "hermes-feishu", "hermes-wecom", "hermes-webhook"]
}
}
+1
View File
@@ -7,6 +7,7 @@
# Generated files
.docusaurus
.cache-loader
src/data/skills.json
# Misc
.DS_Store
@@ -99,9 +99,9 @@ outputs (file contents, terminal output, search results).
┌─────────────────────────────────────────────────────────────┐
│ Message list │
│ │
│ [0..2] ← protect_first_n (system + first exchange) │
│ [3..N] ← middle turns → SUMMARIZED │
│ [N..end] ← tail (by token budget OR protect_last_n) │
│ [0..2] ← protect_first_n (system + first exchange)
│ [3..N] ← middle turns → SUMMARIZED
│ [N..end] ← tail (by token budget OR protect_last_n)
│ │
└─────────────────────────────────────────────────────────────┘
```
+118
View File
@@ -219,6 +219,124 @@ This is intentional — it prevents the bot from responding to every message in
---
## Configuration Options
Beyond the required environment variables from Step 8, you can customize Slack bot behavior through `~/.hermes/config.yaml`.
### Thread & Reply Behavior
```yaml
platforms:
slack:
# Controls how multi-part responses are threaded
# "off" — never thread replies to the original message
# "first" — first chunk threads to user's message (default)
# "all" — all chunks thread to user's message
reply_to_mode: "first"
extra:
# Whether to reply in a thread (default: true).
# When false, channel messages get direct channel replies instead
# of threads. Messages inside existing threads still reply in-thread.
reply_in_thread: true
# Also post thread replies to the main channel
# (Slack's "Also send to channel" feature).
# Only the first chunk of the first reply is broadcast.
reply_broadcast: false
```
| Key | Default | Description |
|-----|---------|-------------|
| `platforms.slack.reply_to_mode` | `"first"` | Threading mode for multi-part messages: `"off"`, `"first"`, or `"all"` |
| `platforms.slack.extra.reply_in_thread` | `true` | When `false`, channel messages get direct replies instead of threads. Messages inside existing threads still reply in-thread. |
| `platforms.slack.extra.reply_broadcast` | `false` | When `true`, thread replies are also posted to the main channel. Only the first chunk is broadcast. |
### Session Isolation
```yaml
# Global setting — applies to Slack and all other platforms
group_sessions_per_user: true
```
When `true` (the default), each user in a shared channel gets their own isolated conversation session. Two people talking to Hermes in `#general` will have separate histories and contexts.
Set to `false` if you want a collaborative mode where the entire channel shares one conversation session. Be aware this means users share context growth and token costs, and one user's `/reset` clears the session for everyone.
### Mention & Trigger Behavior
```yaml
slack:
# Require @mention in channels (this is the default behavior;
# the Slack adapter enforces @mention gating in channels regardless,
# but you can set this explicitly for consistency with other platforms)
require_mention: true
# Custom mention patterns that trigger the bot
# (in addition to the default @mention detection)
mention_patterns:
- "hey hermes"
- "hermes,"
# Text prepended to every outgoing message
reply_prefix: ""
```
:::info
Unlike Discord and Telegram, Slack does not have a `free_response_channels` equivalent. The Slack adapter always requires `@mention` in channels — this is hardcoded behavior. In DMs, the bot always responds without needing a mention.
:::
### Unauthorized User Handling
```yaml
slack:
# What happens when an unauthorized user (not in SLACK_ALLOWED_USERS) DMs the bot
# "pair" — prompt them for a pairing code (default)
# "ignore" — silently drop the message
unauthorized_dm_behavior: "pair"
```
You can also set this globally for all platforms:
```yaml
unauthorized_dm_behavior: "pair"
```
The platform-specific setting under `slack:` takes precedence over the global setting.
### Voice Transcription
```yaml
# Global setting — enable/disable automatic transcription of incoming voice messages
stt_enabled: true
```
When `true` (the default), incoming audio messages are automatically transcribed using the configured STT provider before being processed by the agent.
### Full Example
```yaml
# Global gateway settings
group_sessions_per_user: true
unauthorized_dm_behavior: "pair"
stt_enabled: true
# Slack-specific settings
slack:
require_mention: true
unauthorized_dm_behavior: "pair"
# Platform config
platforms:
slack:
reply_to_mode: "first"
extra:
reply_in_thread: true
reply_broadcast: false
```
---
## Home Channel
+5
View File
@@ -84,6 +84,11 @@ const config: Config = {
position: 'left',
label: 'Docs',
},
{
to: '/skills',
label: 'Skills',
position: 'left',
},
{
href: 'https://hermes-agent.nousresearch.com',
label: 'Home',
+268
View File
@@ -0,0 +1,268 @@
#!/usr/bin/env python3
"""Extract skill metadata from SKILL.md files and index caches into JSON."""
import json
import os
from collections import Counter
import yaml
REPO_ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
LOCAL_SKILL_DIRS = [
("skills", "built-in"),
("optional-skills", "optional"),
]
INDEX_CACHE_DIR = os.path.join(REPO_ROOT, "skills", "index-cache")
OUTPUT = os.path.join(REPO_ROOT, "website", "src", "data", "skills.json")
CATEGORY_LABELS = {
"apple": "Apple",
"autonomous-ai-agents": "AI Agents",
"blockchain": "Blockchain",
"communication": "Communication",
"creative": "Creative",
"data-science": "Data Science",
"devops": "DevOps",
"dogfood": "Dogfood",
"domain": "Domain",
"email": "Email",
"feeds": "Feeds",
"gaming": "Gaming",
"gifs": "GIFs",
"github": "GitHub",
"health": "Health",
"inference-sh": "Inference",
"leisure": "Leisure",
"mcp": "MCP",
"media": "Media",
"migration": "Migration",
"mlops": "MLOps",
"note-taking": "Note-Taking",
"productivity": "Productivity",
"red-teaming": "Red Teaming",
"research": "Research",
"security": "Security",
"smart-home": "Smart Home",
"social-media": "Social Media",
"software-development": "Software Dev",
"translation": "Translation",
"other": "Other",
}
SOURCE_LABELS = {
"anthropics_skills": "Anthropic",
"openai_skills": "OpenAI",
"claude_marketplace": "Claude Marketplace",
"lobehub": "LobeHub",
}
def extract_local_skills():
skills = []
for base_dir, source_label in LOCAL_SKILL_DIRS:
base_path = os.path.join(REPO_ROOT, base_dir)
if not os.path.isdir(base_path):
continue
for root, _dirs, files in os.walk(base_path):
if "SKILL.md" not in files:
continue
skill_path = os.path.join(root, "SKILL.md")
with open(skill_path) as f:
content = f.read()
if not content.startswith("---"):
continue
parts = content.split("---", 2)
if len(parts) < 3:
continue
try:
fm = yaml.safe_load(parts[1])
except yaml.YAMLError:
continue
if not fm or not isinstance(fm, dict):
continue
rel = os.path.relpath(root, base_path)
category = rel.split(os.sep)[0]
tags = []
metadata = fm.get("metadata")
if isinstance(metadata, dict):
hermes_meta = metadata.get("hermes", {})
if isinstance(hermes_meta, dict):
tags = hermes_meta.get("tags", [])
if not tags:
tags = fm.get("tags", [])
if isinstance(tags, str):
tags = [tags]
skills.append({
"name": fm.get("name", os.path.basename(root)),
"description": fm.get("description", ""),
"category": category,
"categoryLabel": CATEGORY_LABELS.get(category, category.replace("-", " ").title()),
"source": source_label,
"tags": tags or [],
"platforms": fm.get("platforms", []),
"author": fm.get("author", ""),
"version": fm.get("version", ""),
})
return skills
def extract_cached_index_skills():
skills = []
if not os.path.isdir(INDEX_CACHE_DIR):
return skills
for filename in os.listdir(INDEX_CACHE_DIR):
if not filename.endswith(".json"):
continue
filepath = os.path.join(INDEX_CACHE_DIR, filename)
try:
with open(filepath) as f:
data = json.load(f)
except (json.JSONDecodeError, OSError):
continue
stem = filename.replace(".json", "")
source_label = "community"
for key, label in SOURCE_LABELS.items():
if key in stem:
source_label = label
break
if isinstance(data, dict) and "agents" in data:
for agent in data["agents"]:
if not isinstance(agent, dict):
continue
skills.append({
"name": agent.get("identifier", agent.get("meta", {}).get("title", "unknown")),
"description": (agent.get("meta", {}).get("description", "") or "").split("\n")[0][:200],
"category": _guess_category(agent.get("meta", {}).get("tags", [])),
"categoryLabel": "", # filled below
"source": source_label,
"tags": agent.get("meta", {}).get("tags", []),
"platforms": [],
"author": agent.get("author", ""),
"version": "",
})
continue
if isinstance(data, list):
for entry in data:
if not isinstance(entry, dict) or not entry.get("name"):
continue
if "skills" in entry and isinstance(entry["skills"], list):
continue
skills.append({
"name": entry.get("name", ""),
"description": entry.get("description", ""),
"category": "uncategorized",
"categoryLabel": "",
"source": source_label,
"tags": entry.get("tags", []),
"platforms": [],
"author": "",
"version": "",
})
for s in skills:
if not s["categoryLabel"]:
s["categoryLabel"] = CATEGORY_LABELS.get(
s["category"],
s["category"].replace("-", " ").title() if s["category"] else "Uncategorized",
)
return skills
TAG_TO_CATEGORY = {}
for _cat, _tags in {
"software-development": [
"programming", "code", "coding", "software-development",
"frontend-development", "backend-development", "web-development",
"react", "python", "typescript", "java", "rust",
],
"creative": ["writing", "design", "creative", "art", "image-generation"],
"research": ["education", "academic", "research"],
"social-media": ["marketing", "seo", "social-media"],
"productivity": ["productivity", "business"],
"data-science": ["data", "data-science"],
"mlops": ["machine-learning", "deep-learning"],
"devops": ["devops"],
"gaming": ["gaming", "game", "game-development"],
"media": ["music", "media", "video"],
"health": ["health", "fitness"],
"translation": ["translation", "language-learning"],
"security": ["security", "cybersecurity"],
}.items():
for _t in _tags:
TAG_TO_CATEGORY[_t] = _cat
def _guess_category(tags: list) -> str:
if not tags:
return "uncategorized"
for tag in tags:
cat = TAG_TO_CATEGORY.get(tag.lower())
if cat:
return cat
return tags[0].lower().replace(" ", "-")
MIN_CATEGORY_SIZE = 4
def _consolidate_small_categories(skills: list) -> list:
for s in skills:
if s["category"] in ("uncategorized", ""):
s["category"] = "other"
s["categoryLabel"] = "Other"
counts = Counter(s["category"] for s in skills)
small_cats = {cat for cat, n in counts.items() if n < MIN_CATEGORY_SIZE}
for s in skills:
if s["category"] in small_cats:
s["category"] = "other"
s["categoryLabel"] = "Other"
return skills
def main():
local = extract_local_skills()
external = extract_cached_index_skills()
all_skills = _consolidate_small_categories(local + external)
source_order = {"built-in": 0, "optional": 1}
all_skills.sort(key=lambda s: (
source_order.get(s["source"], 2),
1 if s["category"] == "other" else 0,
s["category"],
s["name"],
))
os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
with open(OUTPUT, "w") as f:
json.dump(all_skills, f, indent=2)
print(f"Extracted {len(all_skills)} skills to {OUTPUT}")
print(f" {len(local)} local ({sum(1 for s in local if s['source'] == 'built-in')} built-in, "
f"{sum(1 for s in local if s['source'] == 'optional')} optional)")
print(f" {len(external)} from external indexes")
if __name__ == "__main__":
main()
+582
View File
@@ -0,0 +1,582 @@
import React, { useState, useMemo, useCallback, useRef, useEffect } from "react";
import Layout from "@theme/Layout";
import skills from "../../data/skills.json";
import styles from "./styles.module.css";
interface Skill {
name: string;
description: string;
category: string;
categoryLabel: string;
source: string;
tags: string[];
platforms: string[];
author: string;
version: string;
}
const allSkills: Skill[] = skills as Skill[];
const CATEGORY_ICONS: Record<string, string> = {
apple: "\u{f179}",
"autonomous-ai-agents": "\u{1F916}",
blockchain: "\u{26D3}",
communication: "\u{1F4AC}",
creative: "\u{1F3A8}",
"data-science": "\u{1F4CA}",
devops: "\u{2699}",
dogfood: "\u{1F436}",
domain: "\u{1F310}",
email: "\u{2709}",
feeds: "\u{1F4E1}",
gaming: "\u{1F3AE}",
gifs: "\u{1F3AC}",
github: "\u{1F4BB}",
health: "\u{2764}",
"inference-sh": "\u{26A1}",
leisure: "\u{2615}",
mcp: "\u{1F50C}",
media: "\u{1F3B5}",
migration: "\u{1F4E6}",
mlops: "\u{1F9EA}",
"note-taking": "\u{1F4DD}",
productivity: "\u{2705}",
"red-teaming": "\u{1F6E1}",
research: "\u{1F50D}",
security: "\u{1F512}",
"smart-home": "\u{1F3E0}",
"social-media": "\u{1F4F1}",
"software-development": "\u{1F4BB}",
translation: "\u{1F30D}",
other: "\u{1F4E6}",
};
const SOURCE_CONFIG: Record<
string,
{ label: string; color: string; bg: string; border: string; icon: string }
> = {
"built-in": {
label: "Built-in",
color: "#4ade80",
bg: "rgba(74, 222, 128, 0.08)",
border: "rgba(74, 222, 128, 0.2)",
icon: "\u{2713}",
},
optional: {
label: "Optional",
color: "#fbbf24",
bg: "rgba(251, 191, 36, 0.08)",
border: "rgba(251, 191, 36, 0.2)",
icon: "\u{2B50}",
},
Anthropic: {
label: "Anthropic",
color: "#d4845a",
bg: "rgba(212, 132, 90, 0.08)",
border: "rgba(212, 132, 90, 0.2)",
icon: "\u{25C6}",
},
LobeHub: {
label: "LobeHub",
color: "#60a5fa",
bg: "rgba(96, 165, 250, 0.08)",
border: "rgba(96, 165, 250, 0.2)",
icon: "\u{25CB}",
},
"Claude Marketplace": {
label: "Marketplace",
color: "#a78bfa",
bg: "rgba(167, 139, 250, 0.08)",
border: "rgba(167, 139, 250, 0.2)",
icon: "\u{25A0}",
},
};
const SOURCE_ORDER = ["all", "built-in", "optional", "Anthropic", "LobeHub", "Claude Marketplace"];
function highlightMatch(text: string, query: string): React.ReactNode {
if (!query || !text) return text;
const idx = text.toLowerCase().indexOf(query.toLowerCase());
if (idx === -1) return text;
return (
<>
{text.slice(0, idx)}
<mark className={styles.highlight}>{text.slice(idx, idx + query.length)}</mark>
{text.slice(idx + query.length)}
</>
);
}
function SkillCard({
skill,
query,
expanded,
onToggle,
onCategoryClick,
onTagClick,
style,
}: {
skill: Skill;
query: string;
expanded: boolean;
onToggle: () => void;
onCategoryClick: (cat: string) => void;
onTagClick: (tag: string) => void;
style?: React.CSSProperties;
}) {
const src = SOURCE_CONFIG[skill.source] || SOURCE_CONFIG["optional"];
const icon = CATEGORY_ICONS[skill.category] || "\u{1F4E6}";
return (
<div
className={`${styles.card} ${expanded ? styles.cardExpanded : ""}`}
onClick={onToggle}
style={style}
>
<div className={styles.cardAccent} style={{ background: src.color }} />
<div className={styles.cardInner}>
<div className={styles.cardTop}>
<span className={styles.cardIcon}>{icon}</span>
<div className={styles.cardTitleGroup}>
<h3 className={styles.cardTitle}>
{highlightMatch(skill.name, query)}
</h3>
<span
className={styles.sourcePill}
style={{
color: src.color,
background: src.bg,
borderColor: src.border,
}}
>
{src.icon} {src.label}
</span>
</div>
</div>
<p className={`${styles.cardDesc} ${expanded ? styles.cardDescFull : ""}`}>
{highlightMatch(skill.description || "No description available.", query)}
</p>
<div className={styles.cardMeta}>
<button
className={styles.catButton}
onClick={(e) => {
e.stopPropagation();
onCategoryClick(skill.category);
}}
title={`Filter by ${skill.categoryLabel}`}
>
{skill.categoryLabel || skill.category}
</button>
{skill.platforms?.map((p) => (
<span key={p} className={styles.platformPill}>
{p === "macos" ? "\u{F8FF} macOS" : p === "linux" ? "\u{1F427} Linux" : p}
</span>
))}
</div>
{expanded && (
<div className={styles.cardDetail}>
{skill.tags?.length > 0 && (
<div className={styles.tagRow}>
{skill.tags.map((tag) => (
<button
key={tag}
className={styles.tagPill}
onClick={(e) => {
e.stopPropagation();
onTagClick(tag);
}}
>
{tag}
</button>
))}
</div>
)}
{skill.author && (
<div className={styles.authorRow}>
<span className={styles.authorLabel}>Author</span>
<span className={styles.authorValue}>{skill.author}</span>
</div>
)}
{skill.version && (
<div className={styles.authorRow}>
<span className={styles.authorLabel}>Version</span>
<span className={styles.authorValue}>{skill.version}</span>
</div>
)}
<div className={styles.installHint}>
<code>hermes skills install {skill.name}</code>
</div>
</div>
)}
</div>
</div>
);
}
function StatCard({ value, label, color }: { value: number; label: string; color: string }) {
return (
<div className={styles.stat}>
<span className={styles.statValue} style={{ color }}>
{value}
</span>
<span className={styles.statLabel}>{label}</span>
</div>
);
}
const PAGE_SIZE = 60;
export default function SkillsDashboard() {
const [search, setSearch] = useState("");
const [sourceFilter, setSourceFilter] = useState("all");
const [categoryFilter, setCategoryFilter] = useState("all");
const [expandedCard, setExpandedCard] = useState<string | null>(null);
const [visibleCount, setVisibleCount] = useState(PAGE_SIZE);
const [sidebarOpen, setSidebarOpen] = useState(false);
const searchRef = useRef<HTMLInputElement>(null);
const gridRef = useRef<HTMLDivElement>(null);
useEffect(() => {
const handler = (e: KeyboardEvent) => {
if (e.key === "/" && document.activeElement?.tagName !== "INPUT") {
e.preventDefault();
searchRef.current?.focus();
}
if (e.key === "Escape") {
searchRef.current?.blur();
setExpandedCard(null);
}
};
window.addEventListener("keydown", handler);
return () => window.removeEventListener("keydown", handler);
}, []);
const sources = useMemo(() => {
const set = new Set(allSkills.map((s) => s.source));
return SOURCE_ORDER.filter((s) => s === "all" || set.has(s));
}, []);
const categoryEntries = useMemo(() => {
const pool =
sourceFilter === "all"
? allSkills
: allSkills.filter((s) => s.source === sourceFilter);
const map = new Map<string, { label: string; count: number }>();
for (const s of pool) {
const key = s.category || "uncategorized";
const existing = map.get(key);
if (existing) {
existing.count++;
} else {
map.set(key, {
label: s.categoryLabel || s.category || "Uncategorized",
count: 1,
});
}
}
return Array.from(map.entries())
.sort((a, b) => b[1].count - a[1].count)
.map(([key, { label, count }]) => ({ key, label, count }));
}, [sourceFilter]);
const filtered = useMemo(() => {
const q = search.toLowerCase().trim();
return allSkills.filter((s) => {
if (sourceFilter !== "all" && s.source !== sourceFilter) return false;
if (categoryFilter !== "all" && s.category !== categoryFilter) return false;
if (q) {
const haystack = [s.name, s.description, s.categoryLabel, s.author, ...(s.tags || [])]
.join(" ")
.toLowerCase();
return haystack.includes(q);
}
return true;
});
}, [search, sourceFilter, categoryFilter]);
useEffect(() => {
setVisibleCount(PAGE_SIZE);
setExpandedCard(null);
}, [search, sourceFilter, categoryFilter]);
const visible = filtered.slice(0, visibleCount);
const hasMore = visibleCount < filtered.length;
const handleSourceChange = useCallback(
(src: string) => {
setSourceFilter(src);
setCategoryFilter("all");
},
[]
);
const handleCategoryClick = useCallback((cat: string) => {
setCategoryFilter(cat);
gridRef.current?.scrollIntoView({ behavior: "smooth", block: "start" });
setSidebarOpen(false);
}, []);
const handleTagClick = useCallback((tag: string) => {
setSearch(tag);
searchRef.current?.focus();
}, []);
const clearAll = useCallback(() => {
setSearch("");
setSourceFilter("all");
setCategoryFilter("all");
}, []);
return (
<Layout
title="Skills Hub"
description="Browse all skills and plugins available for Hermes Agent"
>
<div className={styles.page}>
<header className={styles.hero}>
<div className={styles.heroGlow} />
<div className={styles.heroContent}>
<p className={styles.heroEyebrow}>Hermes Agent</p>
<h1 className={styles.heroTitle}>Skills Hub</h1>
<p className={styles.heroSub}>
Discover, search, and install from{" "}
<strong className={styles.heroAccent}>{allSkills.length}</strong> skills
across {sources.length - 1} registries
</p>
<div className={styles.statsRow}>
<StatCard
value={allSkills.filter((s) => s.source === "built-in").length}
label="Built-in"
color="#4ade80"
/>
<StatCard
value={allSkills.filter((s) => s.source === "optional").length}
label="Optional"
color="#fbbf24"
/>
<StatCard
value={
allSkills.filter(
(s) => s.source !== "built-in" && s.source !== "optional"
).length
}
label="Community"
color="#60a5fa"
/>
<StatCard
value={new Set(allSkills.map((s) => s.category)).size}
label="Categories"
color="#a78bfa"
/>
</div>
</div>
</header>
<div className={styles.controlsBar}>
<div className={styles.searchWrap}>
<svg className={styles.searchIcon} viewBox="0 0 20 20" fill="currentColor" width="18" height="18">
<path
fillRule="evenodd"
d="M8 4a4 4 0 100 8 4 4 0 000-8zM2 8a6 6 0 1110.89 3.476l4.817 4.817a1 1 0 01-1.414 1.414l-4.816-4.816A6 6 0 012 8z"
clipRule="evenodd"
/>
</svg>
<input
ref={searchRef}
type="text"
placeholder='Search skills... (press "/" to focus)'
value={search}
onChange={(e) => setSearch(e.target.value)}
className={styles.searchInput}
/>
{search && (
<button className={styles.clearBtn} onClick={() => setSearch("")}>
<svg viewBox="0 0 20 20" fill="currentColor" width="16" height="16">
<path
fillRule="evenodd"
d="M10 18a8 8 0 100-16 8 8 0 000 16zM8.707 7.293a1 1 0 00-1.414 1.414L8.586 10l-1.293 1.293a1 1 0 101.414 1.414L10 11.414l1.293 1.293a1 1 0 001.414-1.414L11.414 10l1.293-1.293a1 1 0 00-1.414-1.414L10 8.586 8.707 7.293z"
clipRule="evenodd"
/>
</svg>
</button>
)}
</div>
<div className={styles.sourcePills}>
{sources.map((src) => {
const active = sourceFilter === src;
const conf = SOURCE_CONFIG[src];
const count =
src === "all"
? allSkills.length
: allSkills.filter((s) => s.source === src).length;
return (
<button
key={src}
className={`${styles.srcPill} ${active ? styles.srcPillActive : ""}`}
onClick={() => handleSourceChange(src)}
style={
active && conf
? ({
"--pill-color": conf.color,
"--pill-bg": conf.bg,
"--pill-border": conf.border,
} as React.CSSProperties)
: undefined
}
>
{src === "all" ? "All" : conf?.label || src}
<span className={styles.srcCount}>{count}</span>
</button>
);
})}
</div>
</div>
<div className={styles.layout}>
<button
className={styles.sidebarToggle}
onClick={() => setSidebarOpen(!sidebarOpen)}
>
<svg viewBox="0 0 20 20" fill="currentColor" width="18" height="18">
<path
fillRule="evenodd"
d="M3 5a1 1 0 011-1h12a1 1 0 110 2H4a1 1 0 01-1-1zM3 10a1 1 0 011-1h12a1 1 0 110 2H4a1 1 0 01-1-1zM3 15a1 1 0 011-1h6a1 1 0 110 2H4a1 1 0 01-1-1z"
clipRule="evenodd"
/>
</svg>
Categories
{categoryFilter !== "all" && (
<span className={styles.activeCatBadge}>
{categoryEntries.find((c) => c.key === categoryFilter)?.label}
</span>
)}
</button>
<aside className={`${styles.sidebar} ${sidebarOpen ? styles.sidebarOpen : ""}`}>
<div className={styles.sidebarHeader}>
<h2 className={styles.sidebarTitle}>Categories</h2>
{categoryFilter !== "all" && (
<button className={styles.sidebarClear} onClick={() => setCategoryFilter("all")}>
Clear
</button>
)}
</div>
<nav className={styles.catList}>
<button
className={`${styles.catItem} ${categoryFilter === "all" ? styles.catItemActive : ""}`}
onClick={() => {
setCategoryFilter("all");
setSidebarOpen(false);
}}
>
<span className={styles.catItemIcon}>{"\u{1F4CB}"}</span>
<span className={styles.catItemLabel}>All Skills</span>
<span className={styles.catItemCount}>{filtered.length}</span>
</button>
{categoryEntries.map((cat) => (
<button
key={cat.key}
className={`${styles.catItem} ${categoryFilter === cat.key ? styles.catItemActive : ""}`}
onClick={() => handleCategoryClick(cat.key)}
>
<span className={styles.catItemIcon}>
{CATEGORY_ICONS[cat.key] || "\u{1F4E6}"}
</span>
<span className={styles.catItemLabel}>{cat.label}</span>
<span className={styles.catItemCount}>{cat.count}</span>
</button>
))}
</nav>
</aside>
<main className={styles.main} ref={gridRef}>
{(search || sourceFilter !== "all" || categoryFilter !== "all") && (
<div className={styles.filterSummary}>
<span className={styles.filterCount}>
{filtered.length} result{filtered.length !== 1 ? "s" : ""}
</span>
{search && (
<span className={styles.filterChip}>
&ldquo;{search}&rdquo;
<button onClick={() => setSearch("")}>&times;</button>
</span>
)}
{sourceFilter !== "all" && (
<span className={styles.filterChip}>
{SOURCE_CONFIG[sourceFilter]?.label || sourceFilter}
<button onClick={() => setSourceFilter("all")}>&times;</button>
</span>
)}
{categoryFilter !== "all" && (
<span className={styles.filterChip}>
{categoryEntries.find((c) => c.key === categoryFilter)?.label ||
categoryFilter}
<button onClick={() => setCategoryFilter("all")}>&times;</button>
</span>
)}
<button className={styles.clearAllBtn} onClick={clearAll}>
Clear all
</button>
</div>
)}
{visible.length > 0 ? (
<>
<div className={styles.grid}>
{visible.map((skill, i) => {
const key = `${skill.source}-${skill.name}-${i}`;
return (
<SkillCard
key={key}
skill={skill}
query={search}
expanded={expandedCard === key}
onToggle={() =>
setExpandedCard(expandedCard === key ? null : key)
}
onCategoryClick={handleCategoryClick}
onTagClick={handleTagClick}
style={{ animationDelay: `${Math.min(i, 20) * 25}ms` }}
/>
);
})}
</div>
{hasMore && (
<div className={styles.loadMoreWrap}>
<button
className={styles.loadMoreBtn}
onClick={() => setVisibleCount((v) => v + PAGE_SIZE)}
>
Show more ({filtered.length - visibleCount} remaining)
</button>
</div>
)}
</>
) : (
<div className={styles.empty}>
<div className={styles.emptyIcon}>{"\u{1F50D}"}</div>
<h3 className={styles.emptyTitle}>No skills found</h3>
<p className={styles.emptyDesc}>
Try a different search term or clear your filters.
</p>
<button className={styles.emptyReset} onClick={clearAll}>
Reset all filters
</button>
</div>
)}
</main>
</div>
</div>
{sidebarOpen && (
<div className={styles.backdrop} onClick={() => setSidebarOpen(false)} />
)}
</Layout>
);
}
+819
View File
@@ -0,0 +1,819 @@
@import url("https://fonts.googleapis.com/css2?family=DM+Sans:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&display=swap");
.page {
font-family: "DM Sans", -apple-system, BlinkMacSystemFont, sans-serif;
min-height: 100vh;
}
.hero {
position: relative;
overflow: hidden;
padding: 4rem 2rem 2.5rem;
text-align: center;
}
.heroGlow {
position: absolute;
top: -120px;
left: 50%;
transform: translateX(-50%);
width: 600px;
height: 400px;
background: radial-gradient(
ellipse at center,
rgba(255, 215, 0, 0.07) 0%,
transparent 70%
);
pointer-events: none;
}
.heroContent {
position: relative;
z-index: 1;
max-width: 720px;
margin: 0 auto;
}
.heroEyebrow {
font-family: "JetBrains Mono", monospace;
font-size: 0.75rem;
letter-spacing: 0.15em;
text-transform: uppercase;
color: rgba(255, 215, 0, 0.5);
margin-bottom: 0.75rem;
}
.heroTitle {
font-size: 3rem;
font-weight: 700;
letter-spacing: -0.04em;
line-height: 1.1;
margin: 0 0 0.75rem;
}
[data-theme="dark"] .heroTitle {
color: #fafaf6;
}
.heroSub {
font-size: 1.05rem;
color: var(--ifm-font-color-secondary, #9a968e);
line-height: 1.5;
margin: 0 0 2rem;
}
.heroAccent {
color: #ffd700;
font-weight: 700;
font-variant-numeric: tabular-nums;
}
.statsRow {
display: flex;
justify-content: center;
gap: 2.5rem;
flex-wrap: wrap;
}
.stat {
display: flex;
flex-direction: column;
align-items: center;
gap: 0.2rem;
}
.statValue {
font-family: "JetBrains Mono", monospace;
font-size: 1.6rem;
font-weight: 700;
line-height: 1;
}
.statLabel {
font-size: 0.72rem;
letter-spacing: 0.06em;
text-transform: uppercase;
color: var(--ifm-font-color-secondary, #9a968e);
}
.controlsBar {
position: sticky;
top: 60px; /* below Docusaurus navbar */
z-index: 50;
display: flex;
flex-direction: column;
gap: 0.75rem;
align-items: center;
padding: 1rem 2rem;
backdrop-filter: blur(16px) saturate(1.4);
border-bottom: 1px solid rgba(255, 215, 0, 0.06);
}
[data-theme="dark"] .controlsBar {
background: rgba(7, 7, 13, 0.85);
}
.searchWrap {
position: relative;
width: 100%;
max-width: 560px;
}
.searchIcon {
position: absolute;
left: 0.85rem;
top: 50%;
transform: translateY(-50%);
color: rgba(255, 215, 0, 0.35);
pointer-events: none;
}
.searchInput {
width: 100%;
padding: 0.7rem 2.5rem 0.7rem 2.6rem;
font-size: 0.95rem;
font-family: "DM Sans", sans-serif;
border: 1px solid rgba(255, 215, 0, 0.12);
border-radius: 10px;
background: rgba(15, 15, 24, 0.6);
color: var(--ifm-font-color-base, #e8e4dc);
outline: none;
transition: border-color 0.2s, box-shadow 0.2s;
}
.searchInput:focus {
border-color: rgba(255, 215, 0, 0.4);
box-shadow: 0 0 0 3px rgba(255, 215, 0, 0.06);
}
.searchInput::placeholder {
color: var(--ifm-font-color-secondary, #9a968e);
opacity: 0.5;
}
.clearBtn {
position: absolute;
right: 0.6rem;
top: 50%;
transform: translateY(-50%);
background: none;
border: none;
color: var(--ifm-font-color-secondary);
cursor: pointer;
padding: 0.15rem;
display: flex;
opacity: 0.6;
transition: opacity 0.15s;
}
.clearBtn:hover {
opacity: 1;
color: #ffd700;
}
.sourcePills {
display: flex;
gap: 0.4rem;
flex-wrap: wrap;
justify-content: center;
}
.srcPill {
display: inline-flex;
align-items: center;
gap: 0.35rem;
padding: 0.35rem 0.75rem;
border: 1px solid rgba(255, 255, 255, 0.07);
border-radius: 20px;
background: transparent;
color: var(--ifm-font-color-secondary, #9a968e);
font-family: "DM Sans", sans-serif;
font-size: 0.8rem;
font-weight: 500;
cursor: pointer;
transition: all 0.2s;
}
.srcPill:hover {
border-color: rgba(255, 255, 255, 0.15);
color: var(--ifm-font-color-base);
}
.srcPillActive {
border-color: var(--pill-border, rgba(255, 215, 0, 0.3));
background: var(--pill-bg, rgba(255, 215, 0, 0.06));
color: var(--pill-color, #ffd700);
}
.srcCount {
font-family: "JetBrains Mono", monospace;
font-size: 0.68rem;
background: rgba(255, 255, 255, 0.05);
padding: 0.05rem 0.35rem;
border-radius: 8px;
}
.srcPillActive .srcCount {
background: rgba(255, 255, 255, 0.08);
}
.layout {
display: grid;
grid-template-columns: 260px 1fr;
gap: 0;
max-width: 1440px;
margin: 0 auto;
min-height: 60vh;
}
.sidebar {
position: sticky;
top: 160px;
height: calc(100vh - 160px);
overflow-y: auto;
padding: 1.25rem 1rem 2rem 1.5rem;
border-right: 1px solid rgba(255, 215, 0, 0.05);
}
.sidebar::-webkit-scrollbar {
width: 4px;
}
.sidebar::-webkit-scrollbar-thumb {
background: rgba(255, 215, 0, 0.1);
border-radius: 2px;
}
.sidebarHeader {
display: flex;
align-items: center;
justify-content: space-between;
margin-bottom: 0.75rem;
}
.sidebarTitle {
font-size: 0.72rem;
font-weight: 600;
letter-spacing: 0.1em;
text-transform: uppercase;
color: var(--ifm-font-color-secondary);
margin: 0;
}
.sidebarClear {
font-family: "DM Sans", sans-serif;
font-size: 0.72rem;
color: rgba(255, 215, 0, 0.6);
background: none;
border: none;
cursor: pointer;
padding: 0;
transition: color 0.15s;
}
.sidebarClear:hover {
color: #ffd700;
}
.catList {
display: flex;
flex-direction: column;
gap: 1px;
}
.catItem {
display: flex;
align-items: center;
gap: 0.5rem;
padding: 0.45rem 0.6rem;
border: none;
border-radius: 6px;
background: transparent;
color: var(--ifm-font-color-secondary, #9a968e);
font-family: "DM Sans", sans-serif;
font-size: 0.82rem;
cursor: pointer;
transition: all 0.15s;
text-align: left;
width: 100%;
}
.catItem:hover {
background: rgba(255, 215, 0, 0.04);
color: var(--ifm-font-color-base);
}
.catItemActive {
background: rgba(255, 215, 0, 0.08);
color: #ffd700;
}
.catItemIcon {
font-size: 0.9rem;
width: 1.3rem;
text-align: center;
flex-shrink: 0;
}
.catItemLabel {
flex: 1;
overflow: hidden;
text-overflow: ellipsis;
white-space: nowrap;
}
.catItemCount {
font-family: "JetBrains Mono", monospace;
font-size: 0.68rem;
color: rgba(255, 215, 0, 0.3);
min-width: 1.5rem;
text-align: right;
}
.catItemActive .catItemCount {
color: rgba(255, 215, 0, 0.6);
}
.sidebarToggle {
display: none;
}
.main {
padding: 1.25rem 1.5rem 3rem;
min-width: 0;
}
.filterSummary {
display: flex;
align-items: center;
gap: 0.5rem;
flex-wrap: wrap;
margin-bottom: 1rem;
padding-bottom: 0.75rem;
border-bottom: 1px solid rgba(255, 215, 0, 0.05);
}
.filterCount {
font-size: 0.82rem;
font-weight: 600;
color: var(--ifm-font-color-base);
margin-right: 0.25rem;
}
.filterChip {
display: inline-flex;
align-items: center;
gap: 0.3rem;
padding: 0.2rem 0.5rem;
border: 1px solid rgba(255, 215, 0, 0.15);
border-radius: 4px;
background: rgba(255, 215, 0, 0.04);
color: rgba(255, 215, 0, 0.8);
font-size: 0.75rem;
}
.filterChip button {
background: none;
border: none;
color: inherit;
cursor: pointer;
padding: 0;
font-size: 0.85rem;
line-height: 1;
opacity: 0.6;
transition: opacity 0.15s;
}
.filterChip button:hover {
opacity: 1;
}
.clearAllBtn {
font-family: "DM Sans", sans-serif;
font-size: 0.75rem;
color: var(--ifm-font-color-secondary);
background: none;
border: none;
cursor: pointer;
padding: 0;
margin-left: auto;
transition: color 0.15s;
}
.clearAllBtn:hover {
color: #ffd700;
}
.grid {
display: grid;
grid-template-columns: repeat(auto-fill, minmax(340px, 1fr));
gap: 0.75rem;
}
@keyframes cardIn {
from {
opacity: 0;
transform: translateY(8px);
}
to {
opacity: 1;
transform: translateY(0);
}
}
.card {
position: relative;
border: 1px solid rgba(255, 255, 255, 0.05);
border-radius: 10px;
overflow: hidden;
cursor: pointer;
transition: border-color 0.2s, box-shadow 0.2s, transform 0.2s;
animation: cardIn 0.35s ease both;
}
[data-theme="dark"] .card {
background: #0c0c16;
}
.card:hover {
border-color: rgba(255, 215, 0, 0.15);
box-shadow: 0 4px 24px rgba(0, 0, 0, 0.3), 0 0 0 1px rgba(255, 215, 0, 0.05);
transform: translateY(-1px);
}
.cardExpanded {
border-color: rgba(255, 215, 0, 0.2);
box-shadow: 0 8px 32px rgba(0, 0, 0, 0.4), 0 0 0 1px rgba(255, 215, 0, 0.08);
}
.cardAccent {
position: absolute;
top: 0;
left: 0;
width: 3px;
height: 100%;
opacity: 0.5;
transition: opacity 0.2s;
}
.card:hover .cardAccent {
opacity: 1;
}
.cardInner {
padding: 1rem 1rem 0.85rem 1.15rem;
}
.cardTop {
display: flex;
align-items: flex-start;
gap: 0.6rem;
margin-bottom: 0.5rem;
}
.cardIcon {
font-size: 1.15rem;
line-height: 1;
flex-shrink: 0;
margin-top: 0.1rem;
opacity: 0.7;
}
.cardTitleGroup {
display: flex;
align-items: flex-start;
justify-content: space-between;
gap: 0.5rem;
flex: 1;
min-width: 0;
}
.cardTitle {
font-size: 0.92rem;
font-weight: 600;
line-height: 1.3;
margin: 0;
word-break: break-word;
color: var(--ifm-font-color-base);
}
.sourcePill {
display: inline-flex;
align-items: center;
gap: 0.25rem;
font-family: "JetBrains Mono", monospace;
font-size: 0.62rem;
font-weight: 500;
padding: 0.15rem 0.45rem;
border-radius: 4px;
border: 1px solid;
white-space: nowrap;
flex-shrink: 0;
margin-top: 0.1rem;
}
.cardDesc {
font-size: 0.82rem;
line-height: 1.55;
color: var(--ifm-font-color-secondary, #9a968e);
margin: 0 0 0.6rem;
display: -webkit-box;
-webkit-line-clamp: 2;
-webkit-box-orient: vertical;
overflow: hidden;
}
.cardDescFull {
-webkit-line-clamp: unset;
}
.cardMeta {
display: flex;
align-items: center;
gap: 0.35rem;
flex-wrap: wrap;
}
.catButton {
font-family: "JetBrains Mono", monospace;
font-size: 0.66rem;
padding: 0.15rem 0.45rem;
border: 1px solid rgba(255, 215, 0, 0.12);
border-radius: 3px;
background: rgba(255, 215, 0, 0.04);
color: rgba(255, 215, 0, 0.7);
cursor: pointer;
transition: all 0.15s;
}
.catButton:hover {
background: rgba(255, 215, 0, 0.1);
color: #ffd700;
border-color: rgba(255, 215, 0, 0.25);
}
.platformPill {
font-size: 0.66rem;
padding: 0.12rem 0.4rem;
border-radius: 3px;
background: rgba(96, 165, 250, 0.06);
color: rgba(96, 165, 250, 0.8);
border: 1px solid rgba(96, 165, 250, 0.1);
}
.cardDetail {
margin-top: 0.75rem;
padding-top: 0.7rem;
border-top: 1px solid rgba(255, 255, 255, 0.04);
animation: cardIn 0.2s ease both;
}
.tagRow {
display: flex;
flex-wrap: wrap;
gap: 0.3rem;
margin-bottom: 0.65rem;
}
.tagPill {
font-family: "DM Sans", sans-serif;
font-size: 0.68rem;
padding: 0.12rem 0.4rem;
border: 1px solid rgba(255, 255, 255, 0.06);
border-radius: 3px;
background: rgba(255, 255, 255, 0.02);
color: var(--ifm-font-color-secondary);
cursor: pointer;
transition: all 0.15s;
}
.tagPill:hover {
background: rgba(255, 215, 0, 0.06);
color: rgba(255, 215, 0, 0.8);
border-color: rgba(255, 215, 0, 0.15);
}
.authorRow {
display: flex;
align-items: center;
gap: 0.5rem;
margin-bottom: 0.3rem;
}
.authorLabel {
font-family: "JetBrains Mono", monospace;
font-size: 0.62rem;
text-transform: uppercase;
letter-spacing: 0.06em;
color: var(--ifm-font-color-secondary);
opacity: 0.5;
min-width: 3.5rem;
}
.authorValue {
font-size: 0.78rem;
color: var(--ifm-font-color-base);
}
.installHint {
margin-top: 0.65rem;
padding: 0.45rem 0.65rem;
background: rgba(0, 0, 0, 0.25);
border: 1px solid rgba(255, 215, 0, 0.06);
border-radius: 5px;
}
.installHint code {
font-family: "JetBrains Mono", monospace;
font-size: 0.72rem;
color: rgba(255, 215, 0, 0.7);
background: none;
padding: 0;
}
.highlight {
background: rgba(255, 215, 0, 0.2);
color: #ffd700;
border-radius: 2px;
padding: 0 1px;
}
.loadMoreWrap {
display: flex;
justify-content: center;
margin-top: 1.5rem;
}
.loadMoreBtn {
font-family: "DM Sans", sans-serif;
font-size: 0.85rem;
font-weight: 500;
padding: 0.6rem 1.5rem;
border: 1px solid rgba(255, 215, 0, 0.2);
border-radius: 8px;
background: rgba(255, 215, 0, 0.04);
color: rgba(255, 215, 0, 0.8);
cursor: pointer;
transition: all 0.2s;
}
.loadMoreBtn:hover {
background: rgba(255, 215, 0, 0.08);
border-color: rgba(255, 215, 0, 0.35);
color: #ffd700;
}
.empty {
display: flex;
flex-direction: column;
align-items: center;
justify-content: center;
padding: 5rem 2rem;
text-align: center;
}
.emptyIcon {
font-size: 2.5rem;
margin-bottom: 1rem;
opacity: 0.4;
}
.emptyTitle {
font-size: 1.1rem;
font-weight: 600;
margin: 0 0 0.5rem;
color: var(--ifm-font-color-base);
}
.emptyDesc {
font-size: 0.85rem;
color: var(--ifm-font-color-secondary);
margin: 0 0 1.25rem;
}
.emptyReset {
font-family: "DM Sans", sans-serif;
font-size: 0.85rem;
padding: 0.5rem 1.25rem;
border: 1px solid rgba(255, 215, 0, 0.25);
border-radius: 6px;
background: transparent;
color: #ffd700;
cursor: pointer;
transition: all 0.2s;
}
.emptyReset:hover {
background: rgba(255, 215, 0, 0.08);
}
.backdrop {
display: none;
}
.activeCatBadge {
font-size: 0.72rem;
padding: 0.1rem 0.4rem;
border-radius: 3px;
background: rgba(255, 215, 0, 0.1);
color: rgba(255, 215, 0, 0.8);
}
@media (max-width: 900px) {
.layout {
grid-template-columns: 1fr;
}
.sidebar {
display: none;
position: fixed;
top: 0;
left: 0;
bottom: 0;
width: 280px;
z-index: 200;
background: #0a0a14;
border-right: 1px solid rgba(255, 215, 0, 0.1);
padding-top: 1.5rem;
height: 100vh;
}
.sidebarOpen {
display: block;
}
.backdrop {
display: block;
position: fixed;
inset: 0;
z-index: 190;
background: rgba(0, 0, 0, 0.6);
backdrop-filter: blur(4px);
}
.sidebarToggle {
display: flex;
align-items: center;
gap: 0.4rem;
padding: 0.5rem 0.85rem;
margin: 0 1rem 0.75rem;
border: 1px solid rgba(255, 215, 0, 0.1);
border-radius: 6px;
background: rgba(255, 215, 0, 0.03);
color: var(--ifm-font-color-secondary);
font-family: "DM Sans", sans-serif;
font-size: 0.82rem;
cursor: pointer;
transition: all 0.15s;
}
.sidebarToggle:hover {
border-color: rgba(255, 215, 0, 0.2);
color: var(--ifm-font-color-base);
}
.hero {
padding: 2.5rem 1.25rem 1.75rem;
}
.heroTitle {
font-size: 2rem;
}
.statsRow {
gap: 1.5rem;
}
.statValue {
font-size: 1.25rem;
}
.controlsBar {
padding: 0.75rem 1rem;
}
.main {
padding: 0.75rem 1rem 2rem;
}
.grid {
grid-template-columns: 1fr;
}
}
@media (min-width: 901px) and (max-width: 1100px) {
.grid {
grid-template-columns: repeat(auto-fill, minmax(300px, 1fr));
}
}