Documentation/system_overview.md
Last updated: 2025-01-09
This document provides a comprehensive overview of the Advanced Retrieval-Augmented Generation (RAG) System, covering its architecture, components, data flow, and operational characteristics.
The RAG system implements a sophisticated 4-tier microservices architecture:
graph TB
subgraph "Client Layer"
Browser[š¤ User Browser]
UI[Next.js Frontend
React/TypeScript]
Browser --> UI
end
subgraph "API Gateway Layer"
Backend[Backend Server
Python HTTP Server
Port 8000]
UI -->|REST API| Backend
end
subgraph "Processing Layer"
RAG[RAG API Server
Document Processing
Port 8001]
Backend -->|Internal API| RAG
end
subgraph "LLM Service Layer"
Ollama[Ollama Server
LLM Inference
Port 11434]
RAG -->|Model Calls| Ollama
end
subgraph "Storage Layer"
SQLite[(SQLite Database
Sessions & Metadata)]
LanceDB[(LanceDB
Vector Embeddings)]
FileSystem[File System
Documents & Indexes]
Backend --> SQLite
RAG --> LanceDB
RAG --> FileSystem
end
| Component | Technology | Port | Purpose |
|---|---|---|---|
| Frontend | Next.js 15, React 19, TypeScript | 3000 | User interface, chat interactions |
| Backend | Python 3.11, HTTP Server | 8000 | API gateway, session management, routing |
| RAG API | Python 3.11, Advanced NLP | 8001 | Document processing, retrieval, generation |
| Ollama | Go-based LLM server | 11434 | Local LLM inference (embedding, generation) |
| SQLite | Embedded database | - | Sessions, messages, index metadata |
| LanceDB | Vector database | - | Document embeddings, similarity search |
The system's key innovation is its dual-layer routing architecture that optimizes both speed and intelligence:
backend/server.py# Example routing decisions
"Hello!" ā Direct LLM (greeting pattern)
"What does the document say about pricing?" ā RAG Pipeline (document keyword)
"What's 2+2?" ā Direct LLM (simple + short)
"Summarize the key findings from the report" ā RAG Pipeline (complex + indicators)
rag_system/agent/loop.pydirect_answer: General knowledge queriesrag_query: Document-specific queries requiring retrievalgraph_query: Entity relationship queries (future feature)backend/chat_data.db)-- Core tables
sessions -- Chat sessions with metadata
messages -- Individual messages and responses
indexes -- Document index metadata
session_indexes -- Links sessions to their indexes
./lancedb/)tables/
āāā text_pages_[uuid] -- Document text embeddings
āāā image_pages_[uuid] -- Image embeddings (future)
āāā metadata_[uuid] -- Document metadata
./index_store/)index_store/
āāā overviews/ -- Document summaries for routing
āāā bm25/ -- BM25 keyword indexes
āāā graph/ -- Knowledge graph data
shared_uploads/)The system supports multiple embedding and generation models with automatic switching:
EXTERNAL_MODELS = {
"embedding_model": "Qwen/Qwen3-Embedding-0.6B", # 1024D
"reranker_model": "answerdotai/answerai-colbert-small-v1", # ColBERT reranker
"vision_model": "Qwen/Qwen-VL-Chat", # Vision model for multimodal
"fallback_reranker": "BAAI/bge-reranker-base", # Backup reranker
}
OLLAMA_CONFIG = {
"generation_model": "qwen3:8b", # High-quality generation
"enrichment_model": "qwen3:0.6b", # Fast enrichment/routing
"host": "http://localhost:11434"
}
Qwen/Qwen3-Embedding-0.6B (1024D) - Default, fast and high-qualityqwen3:8b - Primary generation model (high quality)qwen3:0.6b - Fast enrichment and routing modelanswerdotai/answerai-colbert-small-v1 - Primary ColBERT rerankerBAAI/bge-reranker-base - Fallback cross-encoder rerankerQwen/Qwen-VL-Chat - Vision-language model for image processingPIPELINE_CONFIGS = {
"default": {
"description": "Production-ready pipeline with hybrid search, AI reranking, and verification",
"storage": {
"lancedb_uri": "./lancedb",
"text_table_name": "text_pages_v3",
"bm25_path": "./index_store/bm25",
"graph_path": "./index_store/graph/knowledge_graph.gml"
},
"retrieval": {
"retriever": "multivector",
"search_type": "hybrid",
"late_chunking": {
"enabled": True,
"table_suffix": "_lc_v3"
},
"dense": {
"enabled": True,
"weight": 0.7
},
"bm25": {
"enabled": True,
"index_name": "rag_bm25_index"
}
},
"embedding_model_name": "Qwen/Qwen3-Embedding-0.6B",
"reranker": {
"enabled": True,
"model_name": "answerdotai/answerai-colbert-small-v1",
"top_k": 20
}
}
}
| Operation | Time Range | Notes |
|---|---|---|
| Simple Chat | 1-3 seconds | Direct LLM, no retrieval |
| Document Query | 5-15 seconds | Includes retrieval and reranking |
| Complex Analysis | 15-30 seconds | Multi-step reasoning |
| Document Indexing | 2-5 min/100MB | Depends on enrichment settings |
| Component | Memory Usage | Notes |
|---|---|---|
| Embedding Model | 1-2GB | Qwen3-Embedding-0.6B |
| Generation Model | 8-16GB | qwen3:8b |
| Reranker Model | 500MB-1GB | ColBERT reranker |
| Database Cache | 500MB-2GB | LanceDB and SQLite |
Models can be configured in rag_system/main.py:
# Embedding model configuration
EXTERNAL_MODELS = {
"embedding_model": "Qwen/Qwen3-Embedding-0.6B", # Your preferred model
"reranker_model": "answerdotai/answerai-colbert-small-v1",
}
# Generation model configuration
OLLAMA_CONFIG = {
"generation_model": "qwen3:8b", # Your LLM model
"enrichment_model": "qwen3:0.6b", # Your fast model
}
Processing behavior configured in PIPELINE_CONFIGS:
PIPELINE_CONFIGS = {
"retrieval": {
"search_type": "hybrid",
"dense": {"weight": 0.7},
"bm25": {"enabled": True}
},
"chunking": {
"chunk_size": 512,
"chunk_overlap": 64,
"enable_latechunk": True,
"enable_docling": True
}
}
Frontend behavior configured in environment variables:
NEXT_PUBLIC_API_URL=http://localhost:8000
NEXT_PUBLIC_ENABLE_STREAMING=true
NEXT_PUBLIC_MAX_FILE_SIZE=50MB
/health on all servicesThe system supports multiple configuration modes optimized for different use cases:
"default")"fast")"bm25")"graph_rag")BaseRetriever interfaceNote: This overview reflects the current implementation as of 2025-01-09. For the latest changes, check the git history and individual component documentation.