docs/content/advanced/vram-management.md
+++ disableToc = false title = "VRAM and Memory Management" weight = 22 url = '/advanced/vram-management' +++
When running multiple models in LocalAI, especially on systems with limited GPU memory (VRAM), you may encounter situations where loading a new model fails because there isn't enough available VRAM. LocalAI provides several mechanisms to automatically manage model memory allocation and prevent VRAM exhaustion:
By default, LocalAI keeps models loaded in memory once they're first used. This means:
This is a common issue when working with GPU-accelerated models, as VRAM is typically more limited than system RAM. For more context, see issues #6068, #7269, and #5352.
LocalAI supports limiting the maximum number of active backends (loaded models) using LRU (Least Recently Used) eviction. When the limit is reached and a new model needs to be loaded, the least recently used model is automatically unloaded to make room.
Set the maximum number of active backends using CLI flags or environment variables:
# Allow up to 3 models loaded simultaneously
./local-ai --max-active-backends=3
# Using environment variables
LOCALAI_MAX_ACTIVE_BACKENDS=3 ./local-ai
MAX_ACTIVE_BACKENDS=3 ./local-ai
Setting the limit to 1 is equivalent to single active backend mode (see below). Setting to 0 disables the limit (unlimited backends).
By default, LocalAI will skip evicting models that have active API calls to prevent interrupting ongoing requests. This means:
You can configure this behavior via WebUI or using the following settings:
To allow evicting models even when they have active API calls (not recommended for production):
# Via CLI
./local-ai --force-eviction-when-busy
# Via environment variable
LOCALAI_FORCE_EVICTION_WHEN_BUSY=true ./local-ai
Warning: Enabling force eviction can interrupt active requests and cause errors. Only use this if you understand the implications.
When models are busy and cannot be evicted, LocalAI will retry eviction with configurable settings:
# Configure maximum retries (default: 30)
./local-ai --lru-eviction-max-retries=50
# Configure retry interval (default: 1s)
./local-ai --lru-eviction-retry-interval=2s
# Using environment variables
LOCALAI_LRU_EVICTION_MAX_RETRIES=50 \
LOCALAI_LRU_EVICTION_RETRY_INTERVAL=2s \
./local-ai
These settings control how long the system will wait for busy models to become idle before giving up. The retry mechanism allows busy models to complete their requests before being evicted, preventing request failures.
# Allow 2 active backends
LOCALAI_MAX_ACTIVE_BACKENDS=2 ./local-ai
# First request - model-a is loaded (1 active)
curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}'
# Second request - model-b is loaded (2 active, at limit)
curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}'
# Third request - model-a is evicted (LRU), model-c is loaded
curl http://localhost:8080/v1/chat/completions -d '{"model": "model-c", ...}'
# Request for model-b updates its "last used" time
curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}'
The simplest approach is to ensure only one model is loaded at a time. This is now implemented as --max-active-backends=1. When a new model is requested, LocalAI will automatically unload the currently active model before loading the new one.
# These are equivalent:
./local-ai --max-active-backends=1
./local-ai --single-active-backend
# Using environment variables
LOCALAI_MAX_ACTIVE_BACKENDS=1 ./local-ai
LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai
Note: The
--single-active-backendflag is deprecated but still supported for backward compatibility. It is recommended to use--max-active-backends=1instead.
For more flexible memory management, LocalAI provides watchdog mechanisms that automatically unload models based on their activity state. This allows multiple models to be loaded simultaneously, but automatically frees memory when models become inactive or stuck.
Note: Watchdog settings can be configured via the [Runtime Settings]({{%relref "features/runtime-settings#watchdog-settings" %}}) web interface, which allows you to adjust settings without restarting the application.
The idle watchdog monitors models that haven't been used for a specified period and automatically unloads them to free VRAM.
Via environment variables or CLI:
LOCALAI_WATCHDOG_IDLE=true ./local-ai
LOCALAI_WATCHDOG_IDLE=true LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m ./local-ai
./local-ai --enable-watchdog-idle --watchdog-idle-timeout=10m
Via web UI: Navigate to Settings → Watchdog Settings and enable "Watchdog Idle Enabled" with your desired timeout.
The busy watchdog monitors models that have been processing requests for an unusually long time and terminates them if they exceed a threshold. This is useful for detecting and recovering from stuck or hung backends.
Via environment variables or CLI:
LOCALAI_WATCHDOG_BUSY=true ./local-ai
LOCALAI_WATCHDOG_BUSY=true LOCALAI_WATCHDOG_BUSY_TIMEOUT=10m ./local-ai
./local-ai --enable-watchdog-busy --watchdog-busy-timeout=10m
Via web UI: Navigate to Settings → Watchdog Settings and enable "Watchdog Busy Enabled" with your desired timeout.
You can enable both watchdogs simultaneously for comprehensive memory management:
LOCALAI_WATCHDOG_IDLE=true \
LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m \
LOCALAI_WATCHDOG_BUSY=true \
LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \
./local-ai
Or using command line flags:
./local-ai \
--enable-watchdog-idle --watchdog-idle-timeout=15m \
--enable-watchdog-busy --watchdog-busy-timeout=5m
LOCALAI_WATCHDOG_IDLE=true \
LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m \
LOCALAI_WATCHDOG_BUSY=true \
LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \
./local-ai
curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}'
curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}'
Timeouts can be specified using Go's duration format:
15m - 15 minutes1h - 1 hour30s - 30 seconds2h30m - 2 hours and 30 minutesYou can combine Max Active Backends (LRU eviction) with the watchdog mechanisms for comprehensive memory management:
# Allow up to 3 active backends with idle watchdog
LOCALAI_MAX_ACTIVE_BACKENDS=3 \
LOCALAI_WATCHDOG_IDLE=true \
LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m \
./local-ai
Or using command line flags:
./local-ai \
--max-active-backends=3 \
--enable-watchdog-idle --watchdog-idle-timeout=15m
This configuration:
You can also configure retry behavior when models are busy:
# Allow up to 2 active backends with custom retry settings
LOCALAI_MAX_ACTIVE_BACKENDS=2 \
LOCALAI_LRU_EVICTION_MAX_RETRIES=50 \
LOCALAI_LRU_EVICTION_RETRY_INTERVAL=2s \
./local-ai
Or using command line flags:
./local-ai \
--max-active-backends=2 \
--lru-eviction-max-retries=50 \
--lru-eviction-retry-interval=2s
This configuration:
LocalAI cannot reliably estimate VRAM usage of new models to load across different backends (llama.cpp, vLLM, diffusers, etc.) because:
If automatic management doesn't meet your needs, you can manually stop models using the LocalAI management API:
curl -X POST http://localhost:8080/backend/shutdown \
-H "Content-Type: application/json" \
-d '{"model": "model-name"}'
To stop all models, you'll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once.
nvidia-smi (for NVIDIA GPUs) or similar tools to monitor actual VRAM usage--max-active-backends=1 is often the simplest solution. For systems with more VRAM, you can increase the limit to keep more models loaded--max-active-backends to limit the number of loaded models, and enable idle watchdog to unload models that haven't been used recently