Skip to content

Architecture — How it works

hermes-router is a single Python file (router.py) running a small Flask/Waitress server. It accepts OpenAI- or Anthropic-format requests and forwards each one to the best available provider in a pool, handling key rotation, failover, and format translation transparently.

Every request flows through the same pipeline:

┌──────────┐ OpenAI- or Anthropic-format ┌─────────────────────────────────────┐
│ Your app │ ───────────────────────────────► │ hermes-router │
└──────────┘ Bearer / x-api-key (PROXY key) │ │
▲ │ 1. Auth check (constant-time) │
│ │ 2. Cache lookup (per-caller) │
│ OpenAI/Anthropic response │ 3. Rate the request 1–5 │
└────────────────────────────────────────►│ 4. Order providers by fit + health │
│ 5. Try providers, rotate keys │
└───────────────┬─────────────────────┘
│ first one that succeeds
┌───────────────▼─────────────────────┐
│ Gemini · OpenRouter · Groq · Mistral │
│ Cohere · NVIDIA · Codex · Kimi (16) │
└──────────────────────────────────────┘
  1. Authenticate — the caller’s key is compared against PROXY_API_KEYS in constant time (hmac.compare_digest). Both Authorization: Bearer and Anthropic’s x-api-key are accepted.
  2. Cache lookup — identical requests can be served from an in-memory cache, namespaced by the calling key (see Response cache).
  3. Rate the request — a 1–5 difficulty score is computed from length and content, with no extra API call.
  4. Order providers — each model is scored 1–5 for capability; the router prefers the cheapest model that can still handle the request, skips unhealthy ones, and rotates among equally-good ties.
  5. Try and fail over — it sends to the first provider, rotating keys; on a rate-limit or error it cascades to the next, so a single failure never reaches your app.

Every provider can hold many keys (from auth.json first, then .env). Keys are tracked in a thread-safe pool with per-key cooldowns. A key that gets rate-limited (HTTP 429) is put on a short cooldown and skipped until it recovers.

Rotation modes (set with hr mode, see Configuration):

  • round-robin (default) — spread requests evenly across all keys; they deplete together.
  • sequential — drain one key fully until it rate-limits, then move to the next, keeping later keys/accounts fresh in reserve. Ideal for rationing many accounts.

Multiple models per provider. A provider’s <PROVIDER>_MODEL can be a comma-separated list. Because free-tier rate limits are per-model, cooldowns are tracked per (key, model) pair: when one model hits a 429, the router fails over to the next model on the same key before cascading to the next provider. This multiplies free capacity along a third axis — keys × models × providers — with no extra signups. See Configuration.

Requests are scored for difficulty and models for capability (both 1–5, lower = more capable). The router picks the cheapest model that can handle the request. Tool requests are only sent to providers whose model supports function calling (detected at startup). Optional fast routing (FAST_ROUTE_THRESHOLD) sends short requests to low-latency providers first.

Local models & conversation mode. A model running on your own machine (Ollama / LM Studio / llama.cpp) can join the pool as the local provider — free, private, fast (see Providers). Sending the model id hermes-router:fast (or header X-Hermes-Profile: fast) makes the router prefer that local model for short/casual turns, with the cloud providers as automatic fallback for heavier requests.

If a provider errors or times out, the router cascades to the next automatically. A provider that keeps failing health checks (network errors or 5xx — not rate-limits or bad requests) has its circuit breaker tripped: it’s pulled out of rotation for a cooldown, then re-probed (half-open). Healthy providers are always preferred. Tunable via the BREAKER_* settings.

Identical requests can be served from an in-memory TTL+LRU cache, saving free-tier quota. Cache entries are namespaced by the caller’s API key, so two different PROXY_API_KEYS never share a cached answer for the same prompt — safe to expose to multiple users. Disable with CACHE_TTL_SECONDS=0.

Semantic cache (opt-in, SEMANTIC_CACHE=1) goes a step further: on an exact-match miss it embeds the prompt (reusing the router’s own embeddings pipeline) and returns a cached answer whose stored prompt is similar above SEMANTIC_CACHE_THRESHOLD (cosine). It’s a bounded linear scan over the LRU within the caller’s namespace, and degrades gracefully to exact-match when no embedding provider is available — so it adds savings without changing behavior when off.

Each PROXY_API_KEYS entry can carry a requests-per-minute ceiling and per-UTC-day request and token budgets (set globally via PROXY_LIMIT_* or per key in auth.json with hr limit). A caller over its limit gets a 429 with Retry-After before any provider is contacted; live counters appear in /v1/status. Unset = unlimited, so single-user setups are unaffected. This makes the router safe to share with a team. See Configuration.

Request size is measured with tiktoken (the o200k_base encoder, loaded lazily) for accurate routing and large-payload skipping, with a characters ÷ 4 fallback when tiktoken is unavailable.

At startup the router probes each provider once to learn its real model, whether it supports function calling, and whether it’s a reasoning model. Results are cached to router_state.json for ROUTER_STATE_TTL_HOURS (default 24h) so restarts don’t re-probe. You can override any result with <PROVIDER>_SUPPORTS_TOOLS / <PROVIDER>_REASONING.

Reasoning models spend output tokens on hidden chain-of-thought, so the router reserves extra output budget (REASONING_TOKEN_RESERVE) to stop a small max_tokens from yielding an empty reply.

The router defends itself and avoids wasted upstream calls:

  • Body-size limit — requests larger than MAX_REQUEST_BYTES (default 10 MB) are rejected with 413 before any provider is contacted, so a buggy client can’t exhaust memory.
  • Large-payload skip — some free tiers reject big requests outright (e.g. Groq ~6K tokens/min → 413). When a request is estimated to exceed a provider’s ceiling (<PROVIDER>_SKIP_TOKENS_OVER), that provider is skipped and the router cascades on instead of burning a guaranteed-failed attempt.
  • Output clamp — providers that 400 when max_tokens exceeds their output cap have the requested output transparently clamped down to their ceiling (<PROVIDER>_MAX_OUTPUT_TOKENS), so the call still succeeds.

The server runs on Waitress with a configurable thread pool (WORKER_THREADS, default 16). The upstream HTTP connection pool scales with that automatically, and streaming responses close their upstream connection cleanly when the stream ends or the client disconnects.

Your app always speaks one format; the router adapts to whatever the chosen provider needs.

Provider typeWire formatHow the router handles it
Most providersOpenAI Chat CompletionsPass-through (the router’s native format)
AnthropicMessages API (/v1/messages)Two-way translation incl. tools & streaming
Codex (ChatGPT)Responses API over OAuthTwo-way translation + OAuth token lifecycle
  • OpenAI ⇄ Anthropic/v1/messages is accepted for Anthropic-SDK apps, translated to OpenAI format, routed through the same pipeline, and translated back (including tool_use / tool_result blocks and streaming).
  • Codex (ChatGPT subscription) — authenticates with OAuth, not an API key. Accounts are imported with hr auth import-codex; the router mints fresh access tokens from the refresh token, sends requests to the ChatGPT backend in Responses-API format, and translates the SSE stream back to OpenAI chunks. Multiple accounts pool naturally and pair with sequential rotation to ration them. See Providers.
EndpointAuthPurpose
POST /v1/chat/completionsproxy keyOpenAI chat completions (streaming + tools)
POST /v1/messagesproxy keyAnthropic Messages API (translated)
POST /v1/embeddingsproxy keyOpenAI embeddings (stable provider order)
GET /v1/modelsproxy keyAdvertises the hermes-router model id
GET /v1/statusproxy keyPer-provider health, latency, keys, rotation, cache
GET /healthnoneLiveness check for uptime monitors
GET /metricsoptionalPrometheus metrics (set METRICS_REQUIRE_AUTH=1 to lock)

hr status renders a live dashboard (provider health, latency, key cooldowns, cache, rotation mode) from /v1/status. /metrics exposes Prometheus counters and gauges for Grafana — counts and timings only, never request content. See Monitoring.

The same router.py engine runs everywhere; you choose how to launch it and how to drive it.

Run it:

  • hr CLI (Linux/macOS/WSL)hr setup, hr auth add, hr status, hr restart. The friendly day-to-day way to manage a local router. See Deployment.
  • Docker image — the prebuilt multi-arch shafiq735/hermes-router runs the same on Windows, macOS, and Linux: docker run -p 8319:8319 …. See Deployment.
  • Hugging Face Space — host it in the cloud for free. See Deployment.

Connect to it:

  • Any OpenAI or Anthropic SDK — point base_url at the router and you’re done. See Usage.
  • VS Code extension — monitor the provider pool, manage the router, and use hermes-router as a model inside Copilot Chat (including agent mode). See VS Code Extension.
  • Self-contained — one Python file; keys live in your own auth.json (git-ignored, 0600). Nothing is installed system-wide beyond the hr symlink.
  • Configured by environment — every behavior is an env var with a sensible default; see Configuration.
  • Fail soft — when in doubt the router makes forward progress (e.g. if every provider’s breaker is open it probes them all) rather than hard-failing while options remain.