content/manuals/ai/docker-agent/local-models.md
Docker Model Runner lets you run AI models locally on your machine. No API keys, no recurring costs, and your data stays private.
Docker Model Runner lets you run models locally without API keys or recurring costs. Your data stays on your machine, and you can work offline once models are downloaded. This is an alternative to cloud model providers.
You need Docker Model Runner installed and running:
sudo apt-get install docker-model-plugin or sudo dnf install docker-model-plugin. See Get
started with DMR.Verify Docker Model Runner is available:
$ docker model version
If the command returns version information, you're ready to use local models.
Docker Model Runner can run any compatible model. Models can come from:
docker.io/namespace/model-name)hf.co/org/model-name)To see models available to the local Docker catalog, run:
$ docker model list --openai
To use a model, reference it in your configuration. DMR automatically pulls models on first use if they're not already local.
Configure your agent to use Docker Model Runner with the dmr provider:
agents:
root:
model: dmr/ai/qwen3
instruction: You are a helpful assistant
toolsets:
- type: filesystem
When you first run your agent, Docker Agent prompts you to pull the model if it's not already available locally:
$ docker agent run agent.yaml
Model not found locally. Do you want to pull it now? ([y]es/[n]o)
When you configure an agent to use DMR, Docker Agent automatically connects to your local Docker Model Runner and routes inference requests to it. If a model isn't available locally, Docker Agent prompts you to pull it on first use. No API keys or authentication are required.
For more control over model behavior, define a model configuration:
models:
local-qwen:
provider: dmr
model: ai/qwen3:14B
temperature: 0.7
max_tokens: 8192
agents:
root:
model: local-qwen
instruction: You are a helpful coding assistant
Speed up model responses using speculative decoding with a smaller draft model:
models:
fast-qwen:
provider: dmr
model: ai/qwen3:14B
provider_opts:
speculative_draft_model: ai/qwen3:0.6B-Q4_K_M
speculative_num_tokens: 16
speculative_acceptance_rate: 0.8
The draft model generates token candidates, and the main model validates them. This can significantly improve throughput for longer responses.
Pass engine-specific flags to optimize performance:
models:
optimized-qwen:
provider: dmr
model: ai/qwen3
provider_opts:
runtime_flags: ["--ngl=33", "--threads=8"]
Common flags:
--ngl - Number of GPU layers--threads - CPU thread count--repeat-penalty - Repetition penaltyDocker Model Runner supports both embeddings and reranking for RAG workflows.
Use local embeddings for indexing your knowledge base:
rag:
codebase:
docs: [./src]
strategies:
- type: chunked-embeddings
embedding_model: dmr/ai/embeddinggemma
database: ./code.db
DMR provides native reranking for improved RAG results:
models:
reranker:
provider: dmr
model: hf.co/ggml-org/qwen3-reranker-0.6b-q8_0-gguf
rag:
docs:
docs: [./documentation]
strategies:
- type: chunked-embeddings
embedding_model: dmr/ai/embeddinggemma
limit: 20
results:
reranking:
model: reranker
threshold: 0.5
limit: 5
Native DMR reranking is the fastest option for reranking RAG results.
If Docker Agent can't find Docker Model Runner:
Verify Docker Model Runner status:
$ docker model status
Check available models:
$ docker model list
Check model logs for errors:
$ docker model logs
Ensure Docker Desktop has Model Runner enabled in settings (macOS/Windows)