RAG_POC/embeddings-design.md
This document specifies how embeddings should be produced, stored, updated, and queried for chunk-level vector search in ProxySQL’s RAG index. It is intended as an implementation blueprint.
It assumes:
rag_chunks).vec0(...) virtual table (rag_vec_chunks).mcp-tools.md).Chunk-level embeddings
Deterministic embedding input
Model agility
Efficient updates
Operational safety
Recommended embedding input per chunk:
Example embedding input template:
{Title}
Tags: {Tags}
{ChunkBody}
This typically improves semantic recall significantly for knowledge-base-like content (StackOverflow posts, docs, tickets, runbooks).
Do not embed fields like Score, ViewCount, OwnerUserId, timestamps, etc. These should remain structured and be used for:
Embedding numeric metadata into text typically adds noise and reduces semantic quality.
If your chunk body contains HTML or code:
Normalization should be source-configurable.
Embedding input rules must be explicit and stored per source.
rag_sources.embedding_jsonRecommended schema:
{
"enabled": true,
"model": "text-embedding-3-large",
"dim": 1536,
"input": {
"concat": [
{"col":"Title"},
{"lit":"\nTags: "}, {"col":"Tags"},
{"lit":"\n\n"},
{"chunk_body": true}
]
},
"normalize": {
"strip_html": true,
"collapse_whitespace": true
}
}
Semantics
enabled: whether to compute/store embeddings for this sourcemodel: logical name (for observability and compatibility checks)dim: vector dimensioninput.concat: how to build embedding input textnormalize: optional normalization stepsrag_vec_chunks stores:
This is appropriate for v0 when you assume a single embedding model/dimension.
In a product setting, you may want multiple embedding models (e.g. general vs code-centric).
Two ways to support this:
rag_vec_chunksAdd columns:
model TEXTdim INTEGER (optional if fixed per model)Then allow multiple rows per chunk_id (unique key becomes (chunk_id, model)).
This may require schema change and a different vec0 design (some vec0 configurations support metadata columns, but uniqueness must be handled carefully).
Create:
rag_vec_chunks_1536_v1rag_vec_chunks_1024_code_v1
etc.Then MCP tools select the table based on requested model or default configuration.
Recommendation Start with Option A only if your sqlite3-vec build makes it easy to filter by model. Otherwise, Option B is operationally cleaner.
Embeddings are created during ingestion, immediately after chunk creation, if embedding_json.enabled=true.
This provides a simple, synchronous pipeline:
Embeddings must be recomputed if the embedding input string changes. That depends on:
Therefore, update logic should be based on a content hash of the embedding input.
Without hashing, you might recompute embeddings unnecessarily:
Store embedding_input_hash per chunk per model.
Implementation options:
rag_chunks.metadata_jsonExample:
{
"chunk_index": 0,
"embedding_hash": "sha256:...",
"embedding_model": "text-embedding-3-large"
}
Pros: no schema changes.
Cons: JSON parsing overhead.
Create rag_chunk_embedding_state:
CREATE TABLE rag_chunk_embedding_state (
chunk_id TEXT NOT NULL,
model TEXT NOT NULL,
dim INTEGER NOT NULL,
input_hash TEXT NOT NULL,
updated_at INTEGER NOT NULL DEFAULT (unixepoch()),
PRIMARY KEY(chunk_id, model)
);
Pros: fast lookups; avoids JSON parsing.
Cons: extra table.
Recommendation Use Option B for v1.
ProxySQL calls an embedding service:
Pros:
Cons:
ProxySQL links to an embedding runtime (llama.cpp, etc.)
Pros:
Cons:
Recommendation Start with an external embedding provider and keep a modular interface that can be swapped later.
Vector search needs a query embedding. Do this in the MCP layer:
query_textDo not
Depending on the embedding model and vec search primitive, vector search may return:
Recommendation Normalize to a “higher is better” score in MCP responses:
score_vec = 1 / (1 + distance) or similar monotonic transformKeep raw distance in debug fields if needed.
Filtering should be supported by:
source_id restrictionIn v0, filter by source_id is easiest because rag_vec_chunks stores source_id as metadata.
Embeddings are one leg of hybrid retrieval. Two recommended hybrid modes are described in mcp-tools.md:
Embeddings support both:
Candidate mode is often cheaper and more precise when the query includes strong exact tokens.
Embedding generation must be bounded by:
To improve throughput, embed in batches:
For v1, consider decoupling embedding generation from ingestion:
This allows:
In this design, store a state record:
embedding_json parsing in ingesterrag_vec_chunksrag.search_vector MCP tool using query embedding + vector SQLrag_chunk_embedding_state tableinput_hash per chunk per modelrag_sources.embedding_json.rag_vec_chunks (vec0).