Skip to content

Providers

hermes-router routes across a pool of providers. You only need one key to start — add more (and more providers) to stay online longer. You can stack quota by creating multiple keys per provider, and by signing up with multiple Google/GitHub accounts.

Add keys with hr auth add <provider> (see configuration.md for where they’re stored).

ProviderFree tierSign up
GeminiGenerous per-minute limitsaistudio.google.com
OpenRouter50 requests/day per keyopenrouter.ai
SambaNovaFree, fast Llama modelscloud.sambanova.ai
GitHub ModelsFree with any GitHub accountgithub.com/settings/tokens
CerebrasFast inference, free tiercloud.cerebras.ai
GroqFast inference, free tierconsole.groq.com
MistralFree tierconsole.mistral.ai
Cohere1,000 calls/mo per keydashboard.cohere.com
Z.ai (GLM)~1k requests/dayz.ai
Naga AI100 requests/day per keynaga.ac
NVIDIA NIM40 requests/min per keybuild.nvidia.com
Hugging Face~$0.10/mo credit (PRO: $2/mo) — 45k+ modelshuggingface.co/settings/tokens

Hugging Face note: one token reaches 45,000+ models across many inference partners via an OpenAI-compatible endpoint. The free credit is small, so it’s best as an extra in the pool (the router fails over to other providers when it runs out). The default model uses the :cheapest suffix to stretch the credit; change it with HUGGINGFACE_MODEL.

Add your existing API key; the router handles everything else.

ProviderDefault modelAPI keys
OpenAIgpt-4o-miniplatform.openai.com
Anthropicclaude-haiku-4-5console.anthropic.com

Anthropic’s API uses a different wire format from OpenAI. hermes-router translates automatically — your app sends the same OpenAI-format request regardless of which provider handles it.

Codex lets you use your ChatGPT subscription (Plus/Pro/Go) for completions instead of a pay-per-token API key. It doesn’t use an API key — it authenticates with OAuth tokens, so setup is different:

Terminal window
codex login # one-time, with the official Codex CLI (opens browser / device flow)
hr auth import-codex # copy the login into the router (reads ~/.codex/auth.json)
hr restart

The router stores the account under codex_accounts in auth.json, refreshes the access token automatically before it expires, and translates your OpenAI-format requests to the Codex Responses API transparently. Add several accounts (run hr auth import-codex after logging into each) and pair with hr mode sequential to drain one account’s quota before the next. Override the model with CODEX_MODEL (default gpt-5.5).

⚠️ Terms of service: routing ChatGPT subscription quota through a proxy is a gray area in OpenAI’s terms and could risk your account. Use your own accounts, at your own discretion.

The Kimi coding plan (Moonshot) is a subscription, but — unlike Codex — it authenticates with a normal API key (sk-...), not OAuth. Its endpoint is OpenAI-compatible, so it adds like any other provider:

Terminal window
hr auth add kimi # paste your Kimi/Moonshot key
hr restart

Defaults to https://api.kimi.com/coding/v1 with model kimi-for-coding. Using the standard Moonshot API instead of the coding plan? Point it elsewhere with KIMI_BASE_URL (e.g. https://api.moonshot.ai/v1) and set KIMI_MODEL to a model like kimi-k2-0905-preview. Get a key at platform.kimi.ai / platform.moonshot.ai.

Local models (Ollama / LM Studio / llama.cpp)

Section titled “Local models (Ollama / LM Studio / llama.cpp)”

Run a model on your own machine and route to it — free, private, and fast, with the cloud providers as automatic fallback. Any OpenAI-compatible local server works (Ollama, LM Studio, llama.cpp’s server, vLLM…). It’s keyless, so there’s nothing to add with hr auth add — just point the router at it:

Terminal window
# e.g. with Ollama: ollama serve && ollama pull llama3.1
hr model set local llama3.1 # writes LOCAL_MODEL; enables the local provider
hr restart

Or set it directly in .env:

LOCAL_BASE_URL=http://localhost:11434/v1 # Ollama default (LM Studio: http://localhost:1234/v1)
LOCAL_MODEL=llama3.1 # comma-separate for multi-model failover
# LOCAL_EMBED_MODEL=nomic-embed-text # optional: also serve /v1/embeddings locally

The provider turns on as soon as LOCAL_BASE_URL or LOCAL_MODEL is set.

Conversation mode — send the model id hermes-router:fast (or the header X-Hermes-Profile: fast) and the router prefers your local model for short/casual turns, falling back to the cloud pool for heavier requests. Plain hermes-router keeps the normal smart routing across every provider.

Use these names with hr auth add, hr model set, and the <PROVIDER>_* environment variables:

gemini, openrouter, sambanova, github_models, cerebras, groq, mistral, cohere, zai, naga, nvidia, huggingface, kimi, openai, anthropic, codex, local.

Each provider’s model is probed at startup for function-calling and reasoning support; results show up in hr status and /v1/status. See usage.md for how those affect tool routing, and configuration.md for the override variables.