Skip to content

Configuration

All configuration is via environment variables (in .env) and the auth.json credential store. Everything is optional with sensible defaults — the router runs out of the box once it has at least one key.

hr auth add writes to auth.json — the router’s own credential store, kept next to the router. It’s git-ignored, so real keys are never committed. Codex (ChatGPT subscription) logins are stored separately under codex_accounts (via hr auth import-codex); the router refreshes their OAuth access tokens automatically.

{
"providers": {
"openrouter": ["sk-or-key1", "sk-or-key2"],
"gemini": ["AIzaSy-key"]
}
}

Keys in .env (e.g. OPENROUTER_API_KEYS=k1,k2) still work too — the router reads auth.json first, then falls back to .env. Point at a different file with ROUTER_AUTH_FILE=/path/to/auth.json.

VariableDefaultPurpose
PORT8319Port to listen on
HOST0.0.0.0Bind address. Set 127.0.0.1 to listen on localhost only (recommended on a shared/VPS host — reach it via localhost or an SSH tunnel). Keep 0.0.0.0 for Docker.
PROXY_API_KEYSsk-router-1Comma-separated keys your app uses to authenticate
ROUTER_AUTH_FILE./auth.jsonWhere keys are stored
CACHE_TTL_SECONDS300Response cache lifetime (0 disables). Entries are namespaced per API key, so different PROXY_API_KEYS never share a cached answer — safe for multi-tenant use
LOG_LEVELINFOLogging verbosity
METRICS_REQUIRE_AUTH0Require the proxy key on /metrics (1 to enable)
REASONING_TOKEN_RESERVE4096Extra output budget added for reasoning models so hidden chain-of-thought doesn’t eat the answer (0 disables)
ROTATION_MODEround-robinHow keys are picked within a provider (set via hr mode) — round-robin or sequential

Sensible defaults — most users never touch these.

VariableDefaultPurpose
MAX_REQUEST_BYTES10485760 (10 MB)Max request body size; larger requests get 413 (guards against memory exhaustion)
WORKER_THREADS16Waitress worker threads (concurrency). The HTTP connection pool scales with this
CACHE_MAX_SIZE100Max entries in the response cache (LRU eviction)
SEMANTIC_CACHE0If 1, also serve cached answers for similar prompts (needs an embedding provider; falls back to exact match otherwise)
SEMANTIC_CACHE_THRESHOLD0.95Cosine-similarity cutoff for a semantic hit (1.0 = identical; lower = looser matching)
FAST_ROUTE_THRESHOLD0If >0, requests under this many tokens prefer low-latency providers first (0 disables)
ROUTER_MODEL_IDhermes-routerThe model name clients send (the router maps it to each provider’s real model)
ROUTER_STATE_FILE./router_state.jsonWhere provider ratings/capabilities are cached between restarts (use /tmp/... on read-only hosts like HF Spaces)
ROUTER_STATE_TTL_HOURS24How long the cached probe state is trusted before re-probing (0 = re-probe every start)
BREAKER_WINDOW8Recent outcomes the circuit breaker weighs per provider
BREAKER_MIN_SAMPLES4Minimum samples before the breaker can trip
BREAKER_ERROR_RATE0.5Health-failure fraction that trips the breaker
BREAKER_COOLDOWN60Seconds the breaker stays open before re-probing

Give each PROXY_API_KEYS entry a ceiling so the router is safe to share with a team. These env vars are global defaults; set per-key overrides in auth.json with hr limit set. 0 = unlimited (the default — no enforcement). Live usage shows in /v1/status and hr status.

VariableDefaultPurpose
PROXY_LIMIT_RPM0Requests/minute per key (rolling 60s window)
PROXY_LIMIT_REQ_DAY0Requests per UTC day, per key
PROXY_LIMIT_TOKENS_DAY0Tokens per UTC day, per key
Terminal window
hr limit set sk-team-1 --rpm 60 --req-day 500 --tokens-day 100000 # per-key, written to auth.json
hr limit list # show all
hr restart # apply

Exceeding a limit returns 429 with a clear message and a Retry-After header. Per-key limits in auth.json look like:

{ "proxy_keys": { "sk-team-1": { "rpm": 60, "req_per_day": 500, "tokens_per_day": 100000 } } }

Local model (Ollama / LM Studio / llama.cpp)

Section titled “Local model (Ollama / LM Studio / llama.cpp)”

Set either of the first two to enable a local provider pointing at a model on your own machine. It’s keyless (cloud providers remain the fallback). See Providers → Local models.

VariableDefaultPurpose
LOCAL_BASE_URLhttp://localhost:11434/v1Your local server’s OpenAI-compatible endpoint (LM Studio: :1234/v1)
LOCAL_MODELllama3.1Local model id (comma-separate for multi-model failover)
LOCAL_API_KEYlocalOnly if your local server actually requires a key
LOCAL_EMBED_MODEL(unset)Optional — also serve /v1/embeddings from the local server

Send model hermes-router:fast (or header X-Hermes-Profile: fast) to prefer the local model for short/casual turns, with cloud fallback for heavier requests.

VariableDefaultPurpose
ANTHROPIC_MODELclaude-haiku-4-5-20251001Model override (set via hr model set)
CODEX_MODELgpt-5.5Codex (ChatGPT subscription) model — see providers.md
OPENAI_MODELgpt-4o-miniModel override (set via hr model set)
GEMINI_MODELgemini-2.5-flash-liteModel override (set via hr model set)
<PROVIDER>_MODEL(varies)Same pattern applies to all providers
VariableDefaultPurpose
GEMINI_EMBED_MODELgemini-embedding-001Embedding model (empty disables this provider for /v1/embeddings)
<PROVIDER>_EMBED_MODEL(gemini/mistral/cohere set)Same pattern for embeddings; set empty to disable

The router auto-probes each provider at startup, but you can force the result:

VariableDefaultPurpose
<PROVIDER>_SUPPORTS_TOOLS(auto-probed)Force tool-capability on/off (1/0)
<PROVIDER>_REASONING(auto-probed)Force reasoning-model on/off (1/0)
<PROVIDER>_SKIP_TOKENS_OVER(per provider)Skip this provider when an estimated request exceeds this many tokens (0 = never)
<PROVIDER>_MAX_OUTPUT_TOKENS(per provider)Clamp max_tokens down to this provider’s output ceiling (0 = no clamp)

Each provider has a default model that works out of the box. Switch models without editing files:

Terminal window
hr model list # see all providers and their active model
hr model set anthropic claude-sonnet-4-6 # upgrade Anthropic to Sonnet
hr model set openai gpt-4o # use full GPT-4o instead of mini
hr model set gemini gemini-2.5-pro # switch Gemini to Pro
hr model reset anthropic # revert back to the default
hr restart # apply changes

Overrides are stored as plain variables in .env (e.g. ANTHROPIC_MODEL=claude-sonnet-4-6) and active overrides are highlighted in hr model list.

A provider can use several models — just give <PROVIDER>_MODEL a comma-separated list:

Terminal window
hr model set gemini gemini-2.5-flash-lite,gemini-2.5-flash,gemini-2.0-flash
hr restart

Free-tier rate limits are almost always per-model, so each model is its own quota bucket. When the first model hits its limit (429), the router fails over to the next model on the same key before cascading to the next provider — multiplying free throughput along a new axis (keys × models × providers), with no extra signups. The first model in the list is the primary (used for routing, rating, and status); list them in preference order (cheapest/fastest first).

Keep a provider’s models the same class. Tool-calling and reasoning are detected on the primary model and applied to the whole provider, so list models that behave alike (e.g. the Gemini flash family). Don’t mix a tool-capable chat model with one that can’t.

When a provider holds several keys (or several accounts), ROTATION_MODE decides how the router picks among them:

Terminal window
hr mode # show the current mode
hr mode round-robin # default — spread requests evenly across all keys
hr mode sequential # drain one key fully before moving to the next
hr restart # apply the change
  • round-robin (default) — every request goes to the next key in turn, so all keys share the load and deplete together. Best for spreading latency and load.
  • sequential — one key is used until it hits its rate limit, then the router moves to the next, and so on. Later keys stay untouched in reserve — useful when you want to ration accounts after a quota reset instead of burning them all at once. Keys are drained in the order they appear in auth.json.

Either way, failover, per-key cooldowns, and the circuit breaker keep working — the mode only changes which ready key is preferred next. The active mode shows in hr status and at /v1/status.