docs/content/getting-started/troubleshooting.md
+++ disableToc = false title = "Troubleshooting" weight = 9 url = '/basics/troubleshooting/' icon = "build" +++
This guide covers common issues you may encounter when using LocalAI, organized by category. For each issue, diagnostic steps and solutions are provided.
Before diving into specific issues, run these commands to gather diagnostic information:
# Check LocalAI is running and responsive
curl http://localhost:8080/readyz
# List loaded models
curl http://localhost:8080/v1/models
# Check LocalAI version
local-ai --version
# Enable debug logging for detailed output
DEBUG=true local-ai run
# or
local-ai run --log-level=debug
For Docker deployments:
# View container logs
docker logs local-ai
# Check container status
docker ps -a | grep local-ai
# Test GPU access (NVIDIA)
docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
Symptoms: Permission denied or "cannot execute binary file" errors.
Solution:
chmod +x local-ai-*
./local-ai-Linux-x86_64 run
If you see "cannot execute binary file: Exec format error", you downloaded the wrong architecture. Verify with:
uname -m
# x86_64 → download the x86_64 binary
# aarch64 → download the arm64 binary
Symptoms: macOS blocks LocalAI from running because the DMG is not signed by Apple.
Solution: See GitHub issue #6268 for quarantine bypass instructions. This is tracked for resolution in issue #6244.
Symptoms: API returns 404 or "model not found" error.
Diagnostic steps:
Check the model exists in your models directory:
ls -la /path/to/models/
Verify your models path is correct:
# Check what path LocalAI is using
local-ai run --models-path /path/to/models --log-level=debug
Confirm the model name matches your request:
# List available models
curl http://localhost:8080/v1/models | jq '.data[].id'
Symptoms: Model is found but fails to load, with backend errors in the logs.
Common causes and fixes:
llama-cpp, diffusion models use diffusers, etc. See the compatibility table for details.local-ai backends list
# Install a missing backend:
local-ai backends install llama-cpp
Symptoms: Model loads but produces unexpected results or errors during inference.
Check your model YAML configuration:
# Example model config
name: my-model
backend: llama-cpp
parameters:
model: my-model.gguf # Relative to models directory
context_size: 2048
threads: 4 # Should match physical CPU cores
Common mistakes:
model path must be relative to the models directory, not an absolute paththreads set higher than physical CPU cores causes contentioncontext_size too large for available RAM causes OOM errorsNVIDIA (CUDA):
# Verify CUDA is available
nvidia-smi
# For Docker, verify GPU passthrough
docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
When working correctly, LocalAI logs should show: ggml_init_cublas: found X CUDA devices.
Ensure you are using a CUDA-enabled container image (tags containing cuda11, cuda12, or cuda13). CPU-only images cannot use NVIDIA GPUs.
AMD (ROCm):
# Verify ROCm installation
rocminfo
# Docker requires device passthrough
docker run --device=/dev/kfd --device=/dev/dri --group-add=video ...
If your GPU is not in the default target list, open up an Issue. Supported targets include: gfx900, gfx906, gfx908, gfx90a, gfx940, gfx941, gfx942, gfx1030, gfx1031, gfx1100, gfx1101.
Intel (SYCL):
# Docker requires device passthrough
docker run --device /dev/dri ...
Use container images with gpu-intel in the tag. Known issue: SYCL hangs when mmap: true is set — disable it in your model config:
mmap: false
Overriding backend auto-detection:
If LocalAI picks the wrong GPU backend, override it:
LOCALAI_FORCE_META_BACKEND_CAPABILITY=nvidia local-ai run
# Options: default, nvidia, amd, intel
Symptoms: Model loading fails or the process is killed by the OS.
Solutions:
context_size in your model YAMLlow_vram: true to your model configlocal-ai run --max-active-backends=1
local-ai run --enable-watchdog-idle --watchdog-idle-timeout=10m
curl -X POST http://localhost:8080/backend/shutdown \
-H "Content-Type: application/json" \
-d '{"model": "model-name"}'
By default, models remain loaded in memory after first use. This can exhaust VRAM when switching between models.
Configure LRU eviction:
# Keep at most 2 models loaded; evict least recently used
local-ai run --max-active-backends=2
Configure watchdog auto-unload:
local-ai run \
--enable-watchdog-idle --watchdog-idle-timeout=15m \
--enable-watchdog-busy --watchdog-busy-timeout=5m
These can also be set via environment variables (LOCALAI_WATCHDOG_IDLE=true, LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m) or in the Web UI under Settings → Watchdog Settings.
See the VRAM Management guide for more details.
Symptoms: curl: (7) Failed to connect to localhost port 8080: Connection refused
Diagnostic steps:
Verify LocalAI is running:
# Direct install
ps aux | grep local-ai
# Docker
docker ps | grep local-ai
Check the bind address and port:
# Default is :8080. Override with:
local-ai run --address=0.0.0.0:8080
# or
LOCALAI_ADDRESS=":8080" local-ai run
Check for port conflicts:
ss -tlnp | grep 8080
Symptoms: 401 Unauthorized response.
If API key authentication is enabled (LOCALAI_API_KEY or --api-keys), include the key in your requests:
curl http://localhost:8080/v1/models \
-H "Authorization: Bearer YOUR_API_KEY"
Keys can also be passed via x-api-key or xi-api-key headers.
Symptoms: 400 Bad Request or 422 Unprocessable Entity.
Common causes:
model or messages)top_n for reranking)Enable debug logging to see the full request/response:
DEBUG=true local-ai run
See the API Errors reference for a complete list of error codes and their meanings.
Diagnostic steps:
Enable debug mode to see inference timing:
DEBUG=true local-ai run
Use streaming to measure time-to-first-token:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "my-model", "messages": [{"role": "user", "content": "Hello"}], "stream": true}'
Common causes and fixes:
mmap: false) to load the model entirely into RAM.--threads to match your physical CPU core count (not logical/hyperthreaded count).# In model config
mirostat: 0
gpu_layers is set in your model config to offload layers to GPU:
gpu_layers: 99 # Offload all layers
context_sizelow_vram: true in model configmmlock (memory locking) if it's enabled--max-active-backends=1 to keep only one model in memoryDiagnostic steps:
# Check container logs
docker logs local-ai
# Check if port is already in use
ss -tlnp | grep 8080
# Verify the image exists
docker images | grep localai
NVIDIA:
# Ensure nvidia-container-toolkit is installed, then:
docker run --gpus all ...
AMD:
docker run --device=/dev/kfd --device=/dev/dri --group-add=video ...
Intel:
docker run --device /dev/dri ...
Add a health check to your Docker Compose configuration:
services:
local-ai:
image: localai/localai:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
interval: 30s
timeout: 10s
retries: 3
Mount a volume for your models directory:
services:
local-ai:
volumes:
- ./models:/build/models:cached
Symptoms: Distributed inference setup but workers are not found.
Key requirements:
--net host or network_mode: host in DockerDebug P2P connectivity:
LOCALAI_P2P_LOGLEVEL=debug \
LOCALAI_P2P_LIB_LOGLEVEL=debug \
LOCALAI_P2P_ENABLE_LIMITS=true \
LOCALAI_P2P_TOKEN="<TOKEN>" \
local-ai run
If DHT is causing issues, try disabling it to use local mDNS discovery instead:
LOCALAI_P2P_DISABLE_DHT=true local-ai run
See the Distributed Inferencing guide for full setup instructions.
If your issue isn't covered here:
DEBUG=true or --log-level=debug and include the logs when reporting