Architecture — How it works
hermes-router is a single Python file (router.py) running a small Flask/Waitress server. It
accepts OpenAI- or Anthropic-format requests and forwards each one to the best available
provider in a pool, handling key rotation, failover, and format translation transparently.
The request pipeline
Section titled “The request pipeline”Every request flows through the same pipeline:
┌──────────┐ OpenAI- or Anthropic-format ┌─────────────────────────────────────┐ │ Your app │ ───────────────────────────────► │ hermes-router │ └──────────┘ Bearer / x-api-key (PROXY key) │ │ ▲ │ 1. Auth check (constant-time) │ │ │ 2. Cache lookup (per-caller) │ │ OpenAI/Anthropic response │ 3. Rate the request 1–5 │ └────────────────────────────────────────►│ 4. Order providers by fit + health │ │ 5. Try providers, rotate keys │ └───────────────┬─────────────────────┘ │ first one that succeeds ┌───────────────▼─────────────────────┐ │ Gemini · OpenRouter · Groq · Mistral │ │ Cohere · NVIDIA · Codex · Kimi (16) │ └──────────────────────────────────────┘- Authenticate — the caller’s key is compared against
PROXY_API_KEYSin constant time (hmac.compare_digest). BothAuthorization: Bearerand Anthropic’sx-api-keyare accepted. - Cache lookup — identical requests can be served from an in-memory cache, namespaced by the calling key (see Response cache).
- Rate the request — a 1–5 difficulty score is computed from length and content, with no extra API call.
- Order providers — each model is scored 1–5 for capability; the router prefers the cheapest model that can still handle the request, skips unhealthy ones, and rotates among equally-good ties.
- Try and fail over — it sends to the first provider, rotating keys; on a rate-limit or error it cascades to the next, so a single failure never reaches your app.
The moving parts
Section titled “The moving parts”Credential pool
Section titled “Credential pool”Every provider can hold many keys (from auth.json first, then .env). Keys are tracked in a
thread-safe pool with per-key cooldowns. A key that gets rate-limited (HTTP 429) is put on a
short cooldown and skipped until it recovers.
Rotation modes (set with hr mode, see Configuration):
round-robin(default) — spread requests evenly across all keys; they deplete together.sequential— drain one key fully until it rate-limits, then move to the next, keeping later keys/accounts fresh in reserve. Ideal for rationing many accounts.
Multiple models per provider. A provider’s <PROVIDER>_MODEL can be a comma-separated
list. Because free-tier rate limits are per-model, cooldowns are tracked per (key,
model) pair: when one model hits a 429, the router fails over to the next model on the same
key before cascading to the next provider. This multiplies free capacity along a third axis —
keys × models × providers — with no extra signups. See
Configuration.
Smart routing
Section titled “Smart routing”Requests are scored for difficulty and models for capability (both 1–5, lower = more capable).
The router picks the cheapest model that can handle the request. Tool requests are only sent to
providers whose model supports function calling (detected at startup). Optional fast routing
(FAST_ROUTE_THRESHOLD) sends short requests to low-latency providers first.
Local models & conversation mode. A model running on your own machine (Ollama / LM Studio /
llama.cpp) can join the pool as the local provider — free, private, fast (see
Providers). Sending the model id
hermes-router:fast (or header X-Hermes-Profile: fast) makes the router prefer that local
model for short/casual turns, with the cloud providers as automatic fallback for heavier
requests.
Failover & circuit breaker
Section titled “Failover & circuit breaker”If a provider errors or times out, the router cascades to the next automatically. A provider
that keeps failing health checks (network errors or 5xx — not rate-limits or bad requests) has
its circuit breaker tripped: it’s pulled out of rotation for a cooldown, then re-probed
(half-open). Healthy providers are always preferred. Tunable via the BREAKER_* settings.
Response cache
Section titled “Response cache”Identical requests can be served from an in-memory TTL+LRU cache, saving free-tier quota. Cache
entries are namespaced by the caller’s API key, so two different PROXY_API_KEYS never share
a cached answer for the same prompt — safe to expose to multiple users. Disable with
CACHE_TTL_SECONDS=0.
Semantic cache (opt-in, SEMANTIC_CACHE=1) goes a step further: on an exact-match miss it
embeds the prompt (reusing the router’s own embeddings pipeline) and returns a cached answer
whose stored prompt is similar above SEMANTIC_CACHE_THRESHOLD (cosine). It’s a bounded linear
scan over the LRU within the caller’s namespace, and degrades gracefully to exact-match when no
embedding provider is available — so it adds savings without changing behavior when off.
Per-key budgets & rate limits
Section titled “Per-key budgets & rate limits”Each PROXY_API_KEYS entry can carry a requests-per-minute ceiling and per-UTC-day request and
token budgets (set globally via PROXY_LIMIT_* or per key in auth.json with hr limit). A
caller over its limit gets a 429 with Retry-After before any provider is contacted; live
counters appear in /v1/status. Unset = unlimited, so single-user setups are unaffected. This
makes the router safe to share with a team. See
Configuration.
Accurate token counting
Section titled “Accurate token counting”Request size is measured with tiktoken (the o200k_base encoder, loaded lazily) for accurate
routing and large-payload skipping, with a characters ÷ 4 fallback when tiktoken is unavailable.
Capability probing
Section titled “Capability probing”At startup the router probes each provider once to learn its real model, whether it supports
function calling, and whether it’s a reasoning model. Results are cached to
router_state.json for ROUTER_STATE_TTL_HOURS (default 24h) so restarts don’t re-probe. You
can override any result with <PROVIDER>_SUPPORTS_TOOLS / <PROVIDER>_REASONING.
Reasoning models spend output tokens on hidden chain-of-thought, so the router reserves extra
output budget (REASONING_TOKEN_RESERVE) to stop a small max_tokens from yielding an empty reply.
Request guardrails
Section titled “Request guardrails”The router defends itself and avoids wasted upstream calls:
- Body-size limit — requests larger than
MAX_REQUEST_BYTES(default 10 MB) are rejected with413before any provider is contacted, so a buggy client can’t exhaust memory. - Large-payload skip — some free tiers reject big requests outright (e.g. Groq ~6K
tokens/min →
413). When a request is estimated to exceed a provider’s ceiling (<PROVIDER>_SKIP_TOKENS_OVER), that provider is skipped and the router cascades on instead of burning a guaranteed-failed attempt. - Output clamp — providers that
400whenmax_tokensexceeds their output cap have the requested output transparently clamped down to their ceiling (<PROVIDER>_MAX_OUTPUT_TOKENS), so the call still succeeds.
Concurrency
Section titled “Concurrency”The server runs on Waitress with a configurable thread pool (WORKER_THREADS, default 16). The
upstream HTTP connection pool scales with that automatically, and streaming responses close their
upstream connection cleanly when the stream ends or the client disconnects.
Protocol translation
Section titled “Protocol translation”Your app always speaks one format; the router adapts to whatever the chosen provider needs.
| Provider type | Wire format | How the router handles it |
|---|---|---|
| Most providers | OpenAI Chat Completions | Pass-through (the router’s native format) |
| Anthropic | Messages API (/v1/messages) | Two-way translation incl. tools & streaming |
| Codex (ChatGPT) | Responses API over OAuth | Two-way translation + OAuth token lifecycle |
- OpenAI ⇄ Anthropic —
/v1/messagesis accepted for Anthropic-SDK apps, translated to OpenAI format, routed through the same pipeline, and translated back (includingtool_use/tool_resultblocks and streaming). - Codex (ChatGPT subscription) — authenticates with OAuth, not an API key. Accounts are
imported with
hr auth import-codex; the router mints fresh access tokens from the refresh token, sends requests to the ChatGPT backend in Responses-API format, and translates the SSE stream back to OpenAI chunks. Multiple accounts pool naturally and pair withsequentialrotation to ration them. See Providers.
Endpoints
Section titled “Endpoints”| Endpoint | Auth | Purpose |
|---|---|---|
POST /v1/chat/completions | proxy key | OpenAI chat completions (streaming + tools) |
POST /v1/messages | proxy key | Anthropic Messages API (translated) |
POST /v1/embeddings | proxy key | OpenAI embeddings (stable provider order) |
GET /v1/models | proxy key | Advertises the hermes-router model id |
GET /v1/status | proxy key | Per-provider health, latency, keys, rotation, cache |
GET /health | none | Liveness check for uptime monitors |
GET /metrics | optional | Prometheus metrics (set METRICS_REQUIRE_AUTH=1 to lock) |
Observability
Section titled “Observability”hr status renders a live dashboard (provider health, latency, key cooldowns, cache, rotation
mode) from /v1/status. /metrics exposes Prometheus counters and gauges for Grafana — counts
and timings only, never request content. See Monitoring.
Ways to run and connect
Section titled “Ways to run and connect”The same router.py engine runs everywhere; you choose how to launch it and how to drive it.
Run it:
hrCLI (Linux/macOS/WSL) —hr setup,hr auth add,hr status,hr restart. The friendly day-to-day way to manage a local router. See Deployment.- Docker image — the prebuilt multi-arch
shafiq735/hermes-routerruns the same on Windows, macOS, and Linux:docker run -p 8319:8319 …. See Deployment. - Hugging Face Space — host it in the cloud for free. See Deployment.
Connect to it:
- Any OpenAI or Anthropic SDK — point
base_urlat the router and you’re done. See Usage. - VS Code extension — monitor the provider pool, manage the router, and use hermes-router as a model inside Copilot Chat (including agent mode). See VS Code Extension.
Design principles
Section titled “Design principles”- Self-contained — one Python file; keys live in your own
auth.json(git-ignored,0600). Nothing is installed system-wide beyond thehrsymlink. - Configured by environment — every behavior is an env var with a sensible default; see Configuration.
- Fail soft — when in doubt the router makes forward progress (e.g. if every provider’s breaker is open it probes them all) rather than hard-failing while options remain.