cookbook/PC/RAG-LLM/Serve-Example/README.md
This demo showcases a RAG implementation using Nexa Sdk.
These models are optimized for Qualcomm NPU:
nexa pull NexaAI/embeddinggemma-300m-npu
nexa pull NexaAI/jina-v2-rerank-npu
nexa pull NexaAI/Llama3.2-3B-NPU-Turbo
nexa pull NexaAI/Qwen3-4B-GGUF
nexa pull jinaai/jina-embeddings-v4-text-retrieval-GGUF
nexa pull jinaai/jina-reranker-v3-GGUF
💡 These models are fully compatible with macOS and Windows x64. No NPU is required; they run on CPU/GPU.
# Navigate to the example directory
cd Serve-Example
# Create a Python virtual environment
python -m venv .venv
# Activate the virtual environment
.\.venv\Scripts\activate # windows
source .venv/bin/activate # macOS
# Install all required dependencies
pip install -r requirements.txt
First, open a new terminal window and start the Nexa server:
# Start Nexa server
nexa serve
In a new terminal window, you can run either the CLI or Gradio UI version:
# Option 1: Run the CLI version which provides an interactive terminal interface
# This version allows direct interaction with the agent through command line
python rag_nexa.py --data ../docs
# Option 2: Run the Gradio UI version
# This starts a local web server with a chat interface at http://localhost:7860
python gradio_ui.py