website/docs/guides/local-ollama-setup.md
Cloud LLM APIs charge per token. A heavy coding session can cost $5–20. For personal projects, learning, or privacy-sensitive work, that adds up — and you're sending every conversation to a third party.
You'll set up Hermes Agent running entirely on your own hardware, using Ollama as the model backend. No API keys, no subscriptions, no data leaving your machine. Once configured, Hermes works exactly like it does with OpenRouter or Anthropic — terminal commands, file editing, web browsing, delegation — but the model runs locally.
By the end, you'll have:
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB (for 3B models) | 32+ GB (for 27B+ models) |
| Storage | 5 GB free | 30+ GB (for multiple models) |
| CPU | 4 cores | 8+ cores (AMD EPYC, Ryzen, Intel Xeon) |
| GPU | Not required | NVIDIA GPU with 8+ GB VRAM speeds things up significantly |
:::tip CPU-only works, but expect slower responses
Ollama runs on CPU-only servers. A 9B model on a modern 8-core CPU gives ~10 tokens/sec. A 31B model on CPU is slower (~2–5 tokens/sec) — each response takes 30–120 seconds, but it works. A GPU dramatically improves this. For CPU-only setups, widen the API timeout via the env var (it's not a config.yaml key):
# ~/.hermes/.env
HERMES_API_TIMEOUT=1800 # 30 minutes — generous for slow local models
:::
curl -fsSL https://ollama.com/install.sh | sh
Verify it's running:
ollama --version
curl http://localhost:11434/api/tags # Should return {"models":[]}
Choose based on your hardware:
| Model | Size on Disk | RAM Needed | Tool Calling | Best For |
|---|---|---|---|---|
gemma4:31b | ~20 GB | 24+ GB | Yes | Best quality — strong tool use and reasoning |
gemma2:27b | ~16 GB | 20+ GB | No | Conversational tasks, no tool use |
gemma2:9b | ~5 GB | 8+ GB | No | Fast chat, Q&A — cannot call tools |
llama3.2:3b | ~2 GB | 4+ GB | No | Lightweight quick answers only |
:::warning Tool calling matters
Hermes is an agentic assistant — it edits files, runs commands, and browses the web through tool calls. Models without tool-call support can only chat; they can't take actions. For the full Hermes experience, use a model that supports tools (like gemma4:31b).
:::
Pull your chosen model:
ollama pull gemma4:31b
:::info Multiple models
You can pull several models and switch between them inside Hermes with /model. Ollama loads the active model into memory on demand and unloads idle ones automatically.
:::
Verify the model works:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:31b",
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 50
}'
You should see a JSON response with the model's reply.
Run the Hermes setup wizard:
hermes setup
When prompted for a provider, select Custom Endpoint and enter:
http://localhost:11434/v1no-key (Ollama doesn't need one)gemma4:31b (or whichever model you pulled)Alternatively, edit ~/.hermes/config.yaml directly:
model:
default: "gemma4:31b"
provider: "custom"
base_url: "http://localhost:11434/v1"
hermes
That's it. You're now running a fully local agent. Try it out:
You: List all Python files in this directory and count the lines of code in each
You: Read the README.md and summarize what this project does
You: Create a Python script that fetches the weather for Ho Chi Minh City
Hermes will use the terminal tool, file operations, and your local model — no cloud calls.
Not every task needs the biggest model. Here's a practical guide:
| Task | Recommended Model | Why |
|---|---|---|
| File edits, code, terminal commands | gemma4:31b | Only model with reliable tool calling |
| Quick Q&A (no tool use needed) | gemma2:9b | Fast responses for conversational tasks |
| Lightweight chat | llama3.2:3b | Fastest, but very limited capabilities |
:::note
For full agentic work (editing files, running commands, browsing), gemma4:31b is currently the best local option with tool-call support. Check Ollama's model library for newer models — tool-calling support is expanding rapidly.
:::
Switch models on the fly inside a session:
/model gemma2:9b
By default, Ollama uses a 2048-token context. Hermes requires at least 64,000 tokens for agentic work with tools:
# Create a Modelfile that extends context
cat > /tmp/Modelfile << 'EOF'
FROM gemma4:31b
PARAMETER num_ctx 64000
EOF
ollama create gemma4-64k -f /tmp/Modelfile
Then update your Hermes config to use gemma4-64k as the model name.
By default, Ollama unloads models after 5 minutes of inactivity. For a persistent gateway bot, keep it loaded:
# Set keep-alive to 24 hours
curl http://localhost:11434/api/generate \
-d '{"model": "gemma4:31b", "keep_alive": "24h"}'
Or set it globally in Ollama's environment:
# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_KEEP_ALIVE=24h"
If you have an NVIDIA GPU, Ollama automatically offloads layers to it. Check with:
ollama ps # Shows which model is loaded and how many GPU layers
For a 31B model on a 12 GB GPU, you'll get partial offload (~40 layers on GPU, rest on CPU), which still gives a significant speedup.
Once Hermes works locally in the CLI, you can expose it as a Telegram or Discord bot — still running entirely on your hardware.
~/.hermes/config.yaml:model:
default: "gemma4:31b"
provider: "custom"
base_url: "http://localhost:11434/v1"
platforms:
telegram:
enabled: true
token: "YOUR_TELEGRAM_BOT_TOKEN"
hermes gateway
Now message your bot on Telegram — it responds using your local model.
platforms:
discord:
enabled: true
token: "YOUR_DISCORD_BOT_TOKEN"
hermes gatewayLocal models can struggle with complex tasks. Set up a cloud fallback that only activates when the local model fails:
model:
default: "gemma4:31b"
provider: "custom"
base_url: "http://localhost:11434/v1"
fallback_providers:
- provider: openrouter
model: anthropic/claude-sonnet-4
This way, 90% of your usage is free (local), and only the hard tasks hit the paid API.
Ollama isn't running. Start it:
sudo systemctl start ollama
# or
ollama serve
ollama ps: If no GPU layers are offloaded, responses are CPU-bound. This is normal for CPU-only servers./compress regularly, or set a lower compression threshold in config.Smaller models (3B, 7B) sometimes ignore tool-call instructions and produce plain text instead of structured function calls. Solutions:
gemma4:31b or gemma2:27b handle tool calls much better than 3B/7B models.The default Ollama context (2048 tokens) is too small for agentic work. See Step 6 to increase it.
Here's what running locally saves compared to cloud APIs, based on a typical coding session (~100K tokens input, ~20K tokens output):
| Provider | Cost per Session | Monthly (daily use) |
|---|---|---|
| Anthropic Claude Sonnet | ~$0.80 | ~$24 |
| OpenRouter (GPT-4o) | ~$0.60 | ~$18 |
| Ollama (local) | $0.00 | $0.00 |
Your only cost is electricity — roughly $0.01–0.05 per session depending on hardware.
The sweet spot: use local for everyday tasks, set up a cloud fallback for the hard stuff.