RAG_POC/architecture-data-model.md
This document explains the SQLite data model used to turn relational tables (e.g. MySQL posts) into a retrieval-friendly index hosted inside ProxySQL. It focuses on:
rag_sources defines explicit mapping rules (no guessing)Relational databases are excellent for structured queries, but RAG-style retrieval needs:
The model below implements a canonical document layer inside ProxySQL:
rag_sources — control plane: “what to ingest and how”Purpose
title, bodymetadata_jsondoc_idKey columns
backend_*: how to connect (v0 connects directly; later may be “via ProxySQL”)table_name, pk_column: what to ingestwhere_sql: optional restriction (e.g. only questions)doc_map_json: mapping rules (required)chunking_json: chunking rules (required)embedding_json: embedding rules (optional)Important: rag_sources is the only place that defines mapping logic.
A general-purpose ingester must never “guess” which fields belong to body or metadata.
rag_documents — canonical documents: “one per source row”Purpose
doc_id)pk_json)title, body)metadata_json)Why store full body here?
Key columns
doc_id (PK): stable across runs and machines (e.g. "posts:12345")source_id: ties back to rag_sourcespk_json: how to refetch the authoritative row later (e.g. {"Id":12345})title, body: canonical textmetadata_json: non-text signals used for filters/boostingupdated_at, deleted: lifecycle fields for incremental sync laterrag_chunks — retrieval units: “one or many per document”Purpose
Why chunk at all?
Key columns
chunk_id (PK): stable, derived from doc_id + chunk index (e.g. "posts:12345#0")doc_id (FK): parent documentsource_id: convenience for filtering without joining documentschunk_index: 0..N-1title, body: chunk text (often title repeated for context)metadata_json: optional chunk-level metadata (offsets, “has_code”, section label)updated_at, deleted: lifecycle for later incremental syncrag_fts_chunks — FTS5 index (contentless)Purpose
Design choice: contentless FTS
rag_chunks.Stored fields
chunk_id (unindexed, acts like a row identifier)title, body (indexed)rag_vec_chunks — vector index (sqlite3-vec)Purpose
Key columns
embedding float[DIM]: embedding vector (DIM must match your model)chunk_id: join key to rag_chunksdoc_id, source_id, updated_atNote
rag_chunk_view: joins rag_chunks with rag_documents for debugging/inspectionrag_sync_state: reserved for incremental sync later (not used in v0)Think of this as a data pipeline graph:
rag_sources
(defines mapping + chunking + embedding)
|
v
rag_documents (1 row per source row)
|
v
rag_chunks (1..N chunks per document)
/ \
v v
rag_fts rag_vec
Cardinality
rag_sources (1) -> rag_documents (N)rag_documents (1) -> rag_chunks (N)rag_chunks (1) -> rag_fts_chunks (1) (insertion done by ingester)rag_chunks (1) -> rag_vec_chunks (0/1+) (0 if embeddings disabled; 1 typically)doc_map_json existsA general-purpose system cannot infer that:
posts.Body should become document bodyposts.Title should become titleScore, Tags, CreationDate, etc. should become metadataTherefore, doc_map_json is required.
doc_map_json structure (v0)doc_map_json defines:
doc_id.format: string template with {ColumnName} placeholderstitle.concat: concatenation specbody.concat: concatenation specmetadata.pick: list of column names to include in metadata JSONmetadata.rename: mapping of old key -> new key (useful for typos or schema differences)Concatenation parts
{"col":"Column"} — appends the column value (if present){"lit":"..."} — appends a literal stringExample (posts-like):
{
"doc_id": { "format": "posts:{Id}" },
"title": { "concat": [ { "col": "Title" } ] },
"body": { "concat": [ { "col": "Body" } ] },
"metadata": {
"pick": ["Id","PostTypeId","Tags","Score","CreaionDate"],
"rename": {"CreaionDate":"CreationDate"}
}
}
Different tables need different chunking:
Body may be long -> chunking recommendedThus chunking is stored in rag_sources.chunking_json.
chunking_json structure (v0)v0 supports chars-based chunking (simple, robust).
{
"enabled": true,
"unit": "chars",
"chunk_size": 4000,
"overlap": 400,
"min_chunk_size": 800
}
Behavior
body.length <= chunk_size -> one chunkchunk_size with overlapmin_chunk_sizeWhy overlap matters
embedding_json structure (v0){
"enabled": true,
"dim": 1536,
"model": "text-embedding-3-large",
"input": { "concat": [
{"col":"Title"},
{"lit":"\nTags: "}, {"col":"Tags"},
{"lit":"\n\n"},
{"chunk_body": true}
]}
}
Meaning
This improves semantic retrieval for question-like content without embedding numeric metadata.
For each enabled rag_sources entry:
backend_*table_name (and optional where_sql)
doc_map_json and embedding_jsondoc_id using doc_map_json.doc_id.formatpk_json from pk_columntitle using title.concatbody using body.concatmetadata_json using metadata.pick and metadata.renamedoc_id already exists (v0 behavior)rag_documentsbody using chunking_jsonrag_chunksrag_fts_chunksembedding_json.inputrag_vec_chunksv0 is “insert-only and skip-existing.”
Product-grade ingestion requires:
Options:
LastActivityDate / updated_at columnsha256(title||body||metadata)) stored in documents tableNeeds:
deleted=1 when source row deletedUse rag_sync_state to store:
The current schema already includes:
updated_at and deletedrag_sync_state placeholderSo incremental sync can be added without breaking the data model.
posts tableGiven a MySQL posts row:
Id = 12345Title = "How to parse JSON in MySQL 8?"Body = "<p>I tried JSON_EXTRACT...</p>"Tags = "<mysql><json>"Score = 12With mapping:
doc_id = "posts:12345"title = Titlebody = Bodymetadata_json includes { "Tags": "...", "Score": "12", ... }posts:12345#0, posts:12345#1, etc.This data model separates concerns cleanly:
rag_sources defines policy (what/how to ingest)rag_documents defines canonical identity and refetch pointerrag_chunks defines retrieval unitsrag_fts_chunks defines keyword searchrag_vec_chunks defines semantic searchThis separation makes the system: