Documentation/deployment_guide.md
Last updated: 2025-01-07
This guide provides comprehensive instructions for deploying the RAG system using both Docker and direct development approaches.
Both deployment methods require:
# Ollama (required for both approaches)
curl -fsSL https://ollama.ai/install.sh | sh
# Git for cloning
git 2.30+
For Docker deployment:
# Docker & Docker Compose
Docker Engine 24.0+
Docker Compose 2.20+
For direct development:
# Python & Node.js
Python 3.8+
Node.js 16+
npm 8+
Ubuntu/Debian:
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
newgrp docker
# Install Docker Compose V2
sudo apt-get update
sudo apt-get install docker-compose-plugin
macOS:
# Install Docker Desktop
brew install --cask docker
# Or download from: https://www.docker.com/products/docker-desktop
Windows:
# Install Docker Desktop with WSL2 backend
# Download from: https://www.docker.com/products/docker-desktop
git clone https://github.com/your-org/rag-system.git
cd rag-system
# Install Ollama (runs locally even with Docker)
curl -fsSL https://ollama.ai/install.sh | sh
# Start Ollama
ollama serve
# In another terminal, install models
ollama pull qwen3:0.6b
ollama pull qwen3:8b
# Start all containers using the convenience script
./start-docker.sh
# Or manually:
docker compose --env-file docker.env up --build -d
# Check container status
docker compose ps
# Test all endpoints
curl http://localhost:3000 # Frontend
curl http://localhost:8000/health # Backend
curl http://localhost:8001/models # RAG API
curl http://localhost:11434/api/tags # Ollama
# Start system
./start-docker.sh
# Stop system
./start-docker.sh stop
# View logs
./start-docker.sh logs
# Check status
./start-docker.sh status
# Manual Docker Compose commands
docker compose ps # Check status
docker compose logs -f # Follow logs
docker compose down # Stop all containers
docker compose up --build -d # Rebuild and restart
# Restart specific service
docker compose restart rag-api
# View specific service logs
docker compose logs -f backend
# Execute commands in container
docker compose exec rag-api python -c "print('Hello')"
Python Dependencies:
# Clone repository
git clone https://github.com/your-org/rag-system.git
cd rag-system
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install Python packages
pip install -r requirements.txt
Node.js Dependencies:
# Install Node.js dependencies
npm install
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Start Ollama
ollama serve
# In another terminal, install models
ollama pull qwen3:0.6b
ollama pull qwen3:8b
Option A: Integrated Launcher (Recommended)
# Start all components with one command
python run_system.py
Option B: Manual Component Startup
# Terminal 1: RAG API
python -m rag_system.api_server
# Terminal 2: Backend
cd backend && python server.py
# Terminal 3: Frontend
npm run dev
# Access at http://localhost:3000
# Check system health
python system_health_check.py
# Test endpoints
curl http://localhost:3000 # Frontend
curl http://localhost:8000/health # Backend
curl http://localhost:8001/models # RAG API
# Start system
python run_system.py
# Check system health
python system_health_check.py
# Stop system
# Press Ctrl+C in terminal running run_system.py
# Start components individually
python -m rag_system.api_server # RAG API on port 8001
cd backend && python server.py # Backend on port 8000
npm run dev # Frontend on port 3000
# Development tools
npm run build # Build frontend for production
pip install -r requirements.txt --upgrade # Update Python packages
graph TB
subgraph "Docker Containers"
Frontend[Frontend Container
Next.js
Port 3000]
Backend[Backend Container
Python API
Port 8000]
RAG[RAG API Container
Document Processing
Port 8001]
end
subgraph "Local System"
Ollama[Ollama Server
Port 11434]
end
Frontend --> Backend
Backend --> RAG
RAG --> Ollama
graph TB
subgraph "Local Processes"
Frontend[Next.js Dev Server
Port 3000]
Backend[Python Backend
Port 8000]
RAG[RAG API
Port 8001]
Ollama[Ollama Server
Port 11434]
end
Frontend --> Backend
Backend --> RAG
RAG --> Ollama
docker.env)# Ollama Configuration
OLLAMA_HOST=http://host.docker.internal:11434
# Service Configuration
NODE_ENV=production
RAG_API_URL=http://rag-api:8001
NEXT_PUBLIC_API_URL=http://localhost:8000
# Environment variables are set automatically by run_system.py
# Override in environment if needed:
export OLLAMA_HOST=http://localhost:11434
export RAG_API_URL=http://localhost:8001
# Embedding Models
EMBEDDING_MODELS = [
"Qwen/Qwen3-Embedding-0.6B", # Fast, 1024 dimensions
"Qwen/Qwen3-Embedding-4B", # High quality, 2048 dimensions
]
# Generation Models
GENERATION_MODELS = [
"qwen3:0.6b", # Fast responses
"qwen3:8b", # High quality
]
# For Docker: Increase memory allocation
# Docker Desktop → Settings → Resources → Memory → 16GB+
# For Direct Development: Monitor with
htop # or top on macOS
# Batch sizes (adjust based on available RAM)
EMBEDDING_BATCH_SIZE = 50 # Reduce if OOM
ENRICHMENT_BATCH_SIZE = 25 # Reduce if OOM
# Chunk settings
CHUNK_SIZE = 512 # Text chunk size
CHUNK_OVERLAP = 64 # Overlap between chunks
# Comprehensive system check
curl -f http://localhost:3000 && echo "✅ Frontend OK"
curl -f http://localhost:8000/health && echo "✅ Backend OK"
curl -f http://localhost:8001/models && echo "✅ RAG API OK"
curl -f http://localhost:11434/api/tags && echo "✅ Ollama OK"
# Docker monitoring
docker stats
# Direct development monitoring
htop # Overall system
nvidia-smi # GPU usage (if available)
# All services
docker compose logs -f
# Specific service
docker compose logs -f rag-api
# Save logs to file
docker compose logs > system.log 2>&1
# Logs are printed to terminal
# Redirect to file if needed:
python run_system.py > system.log 2>&1
# Create backup directory
mkdir -p backups/$(date +%Y%m%d)
# Backup databases and indexes
cp -r backend/chat_data.db backups/$(date +%Y%m%d)/
cp -r lancedb backups/$(date +%Y%m%d)/
cp -r index_store backups/$(date +%Y%m%d)/
# For Docker: also backup volumes
docker compose down
docker run --rm -v rag_system_old_ollama_data:/data -v $(pwd)/backups:/backup alpine tar czf /backup/ollama_models_$(date +%Y%m%d).tar.gz -C /data .
# Stop system
./start-docker.sh stop # Docker
# Or Ctrl+C for direct development
# Restore files
cp -r backups/YYYYMMDD/* ./
# Restart system
./start-docker.sh # Docker
python run_system.py # Direct development
# Check what's using ports
lsof -i :3000 -i :8000 -i :8001 -i :11434
# For Docker: Stop conflicting containers
./start-docker.sh stop
# For Direct: Kill processes
pkill -f "npm run dev"
pkill -f "server.py"
pkill -f "api_server"
# Docker daemon not running
docker version # Check if daemon responds
# Restart Docker Desktop (macOS/Windows)
# Or restart docker service (Linux)
sudo systemctl restart docker
# Clear Docker cache
docker system prune -f
# Check Ollama status
curl http://localhost:11434/api/tags
# Restart Ollama
pkill ollama
ollama serve
# Reinstall models
ollama pull qwen3:0.6b
ollama pull qwen3:8b
# Check memory usage
free -h # Linux
vm_stat # macOS
docker stats # Docker containers
# Solutions:
# 1. Increase system RAM
# 2. Reduce batch sizes in configuration
# 3. Use smaller models (qwen3:0.6b instead of qwen3:8b)
# Check model loading
curl http://localhost:11434/api/tags
# Monitor component response times
time curl http://localhost:8001/models
# Solutions:
# 1. Use SSD storage
# 2. Increase CPU cores
# 3. Use GPU acceleration (if available)
# Use reverse proxy (nginx/traefik) for production
# Enable HTTPS/TLS
# Restrict port access with firewall
# Enable authentication in production
# Encrypt sensitive data
# Regular security updates
# Use Docker Swarm or Kubernetes
# Load balance frontend and backend
# Scale RAG API instances based on load
# Use dedicated GPU nodes for AI workloads
# Implement model caching
# Optimize batch processing
Your deployment is successful when:
Acceptable Performance:
Optimal Performance:
Happy Deploying! 🚀