docs/guides/models/deploy_local_llm.mdx
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
Deploy and run local models using Ollama, Xinference, vLLM ,SGLang , GPUStack or other frameworks.
RAGFlow supports deploying models locally using Ollama, Xinference, IPEX-LLM, vLLM ,SGLang , GPUStack or jina. If you have locally deployed models to leverage or wish to enable GPU or CUDA for inference acceleration, you can bind Ollama or Xinference into RAGFlow and use either of them as a local "server" for interacting with your local models.
RAGFlow seamlessly integrates with Ollama and Xinference, without the need for further environment configurations. You can use them to deploy two types of local models in RAGFlow: chat models and embedding models.
:::tip NOTE This user guide does not intend to cover much of the installation or configuration details of Ollama or Xinference; its focus is on configurations inside RAGFlow. For the most current information, you may need to check out the official site of Ollama or Xinference. :::
Ollama enables you to run open-source large language models that you deployed locally. It bundles model weights, configurations, and data into a single package, defined by a Modelfile, and optimizes setup and configurations, including GPU usage.
:::note
Ollama can be installed from binaries or deployed with Docker. Here are the instructions to deploy with Docker:
$ sudo docker run --name ollama -p 11434:11434 ollama/ollama
> time=2024-12-02T02:20:21.360Z level=INFO source=routes.go:1248 msg="Listening on [::]:11434 (version 0.4.6)"
> time=2024-12-02T02:20:21.360Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12]"
Ensure Ollama is listening on all IP address:
$ sudo ss -tunlp | grep 11434
> tcp LISTEN 0 4096 0.0.0.0:11434 0.0.0.0:* users:(("docker-proxy",pid=794507,fd=4))
> tcp LISTEN 0 4096 [::]:11434 [::]:* users:(("docker-proxy",pid=794513,fd=4))
Pull models as you need. We recommend that you start with llama3.2 (a 3B chat model) and bge-m3 (a 567M embedding model):
$ sudo docker exec ollama ollama pull llama3.2
> pulling dde5aa3fc5ff... 100% ▕████████████████▏ 2.0 GB
> success
$ sudo docker exec ollama ollama pull bge-m3
> pulling daec91ffb5dd... 100% ▕████████████████▏ 1.2 GB
> success
host.docker.internal. If Ollama runs on the same host machine, the right URL to use for Ollama would be `http://host.docker.internal:11434/' and you should check that Ollama is accessible from inside the RAGFlow container with:$ sudo docker exec -it docker-ragflow-cpu-1 bash
$ curl http://host.docker.internal:11434/
> Ollama is running
$ curl http://localhost:11434/
> Ollama is running
$ curl http://${IP_OF_OLLAMA_MACHINE}:11434/
> Ollama is running
In RAGFlow, click on your logo on the top right of the page > Model providers and add Ollama to RAGFlow:
In the popup window, complete basic settings for Ollama:
llama3.2 and chat) or (bge-m3 and embedding).http://host.docker.internal:11434, http://localhost:11434 or http://${IP_OF_OLLAMA_MACHINE}:11434.:::caution WARNING Improper base URL settings will trigger the following error:
Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff98b81ff0>: Failed to establish a new connection: [Errno 111] Connection refused'))
:::
Click on your logo > Model providers > System Model Settings to update your model:
Update your model(s) accordingly in Chat Configuration.
Xorbits Inference (Xinference) enables you to unleash the full potential of cutting-edge AI models.
:::note
To deploy a local model, e.g., Mistral, using Xinference:
Ensure that your host machine's firewall allows inbound connections on port 9997.
$ xinference-local --host 0.0.0.0 --port 9997
Launch your local model (Mistral), ensuring that you replace ${quantization} with your chosen quantization method:
$ xinference launch -u mistral --model-name mistral-v0.1 --size-in-billions 7 --model-format pytorch --quantization ${quantization}
In RAGFlow, click on your logo on the top right of the page > Model providers and add Xinference to RAGFlow:
Enter an accessible base URL, such as http://<your-xinference-endpoint-domain>:9997/v1.
For rerank model, please use the
http://<your-xinference-endpoint-domain>:9997/v1/rerankas the base URL.
Click on your logo > Model providers > System Model Settings to update your model.
You should now be able to find mistral from the dropdown list under Chat model.
Update your chat model accordingly in Chat Configuration:
IPEX-LLM is a PyTorch library for running LLMs on local Intel CPUs or GPUs (including iGPU or discrete GPUs like Arc, Flex, and Max) with low latency. It supports Ollama on Linux and Windows systems.
To deploy a local model, e.g., Qwen2, using IPEX-LLM-accelerated Ollama:
Ensure that your host machine's firewall allows inbound connections on port 11434. For example:
sudo ufw allow 11434/tcp
:::tip NOTE IPEX-LLM's supports Ollama on Linux and Windows systems. :::
For detailed information about installing IPEX-LLM for Ollama, see Run llama.cpp with IPEX-LLM on Intel GPU Guide:
After the installation, you should have created a Conda environment, e.g., llm-cpp, for running Ollama commands with IPEX-LLM.
llm-cpp Conda environment and initialize Ollama:<Tabs defaultValue="linux" values={[ {label: 'Linux', value: 'linux'}, {label: 'Windows', value: 'windows'}, ]}> <TabItem value="linux">
conda activate llm-cpp
init-ollama
Run these commands with administrator privileges in Miniforge Prompt:
conda activate llm-cpp
init-ollama.bat
If the installed ipex-llm[cpp] requires an upgrade to the Ollama binary files, remove the old binary files and reinitialize Ollama using init-ollama (Linux) or init-ollama.bat (Windows).
A symbolic link to Ollama appears in your current directory, and you can use this executable file following standard Ollama commands.
Set the environment variable OLLAMA_NUM_GPU to 999 to ensure that all layers of your model run on the Intel GPU; otherwise, some layers may default to CPU.
For optimal performance on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), set the following environment variable before launching the Ollama service:
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
Launch the Ollama service:
<Tabs defaultValue="linux" values={[ {label: 'Linux', value: 'linux'}, {label: 'Windows', value: 'windows'}, ]}> <TabItem value="linux">
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
./ollama serve
Run the following command in Miniforge Prompt:
set OLLAMA_NUM_GPU=999
set no_proxy=localhost,127.0.0.1
set ZES_ENABLE_SYSMAN=1
set SYCL_CACHE_PERSISTENT=1
ollama serve
:::tip NOTE
To enable the Ollama service to accept connections from all IP addresses, use OLLAMA_HOST=0.0.0.0 ./ollama serve rather than simply ./ollama serve.
:::
The console displays messages similar to the following:
With the Ollama service running, open a new terminal and run ./ollama pull <model_name> (Linux) or ollama.exe pull <model_name> (Windows) to pull the desired model. e.g., qwen2:latest:
<Tabs defaultValue="linux" values={[ {label: 'Linux', value: 'linux'}, {label: 'Windows', value: 'windows'}, ]}> <TabItem value="linux">
./ollama run qwen2:latest
ollama run qwen2:latest
To enable IPEX-LLM accelerated Ollama in RAGFlow, you must also complete the configurations in RAGFlow. The steps are identical to those outlined in the Deploy a local model using Ollama section:
ubuntu 22.04/24.04
pip install vllm
nohup vllm serve /data/Qwen3-8B --served-model-name Qwen3-8B-FP8 --dtype auto --port 1025 --gpu-memory-utilization 0.90 --tool-call-parser hermes --enable-auto-tool-choice > /var/log/vllm_startup1.log 2>&1 &
you can get log info
tail -f -n 100 /var/log/vllm_startup1.log
when see the follow ,it means vllm engine is ready for access
Starting vLLM API server 0 on http://0.0.0.0:1025
Started server process [19177]
Application startup complete.
setting->model providers->search->vllm->add ,configure as follow:
select vllm chat model as default llm model as follow:
create chat->create conversations-chat as follow:
ubuntu 22.04/24.04
sudo docker run -d --name gpustack \
--restart unless-stopped \
-p 80:80 \
-p 10161:10161 \
--volume gpustack-data:/var/lib/gpustack \
gpustack/gpustack
you can get docker info
docker ps
when see the follow ,it means vllm engine is ready for access
root@gpustack-prod:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
abf59be84b1a gpustack/gpustack "/usr/bin/entrypoint…" 6 hours ago Up 6 hours 0.0.0.0:80->80/tcp, [::]:80->80/tcp, 0.0.0.0:10161->10161/tcp, [::]:10161->10161/tcp gpustack
setting->model providers->search->gpustack->add ,configure as follow:
select gpustack chat model as default llm model as follow: