RAG_POC/CHUNKING_AND_EMBEDDING_GUIDE.md
chunking_json and embedding_jsonThis guide explains how to configure document chunking and vector embedding generation in the ProxySQL RAG ingestion system.
The rag_sources table stores configuration for ingesting data into the RAG index. Two key JSON columns control how documents are processed:
| Column | Purpose | Required |
|---|---|---|
chunking_json | Controls how documents are split into chunks | Yes |
embedding_json | Controls how vector embeddings are generated | No |
Both columns accept JSON objects with specific fields that define the behavior of the ingestion pipeline.
The chunking_json column defines how documents are split into smaller pieces (chunks) for indexing. Chunking is important because:
CREATE TABLE rag_sources (
-- ...
chunking_json TEXT NOT NULL, -- REQUIRED
-- ...
);
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable/disable chunking. When false, entire document is a single chunk. |
unit | string | "chars" | Unit of measurement. Only "chars" is supported in v0. |
chunk_size | integer | 4000 | Maximum size of each chunk (in characters). |
overlap | integer | 400 | Number of characters shared between consecutive chunks. |
min_chunk_size | integer | 800 | Minimum size for the last chunk. If smaller, merges with previous chunk. |
| Condition | Action |
|---|---|
chunk_size <= 0 | Reset to 4000 |
overlap < 0 | Reset to 0 |
overlap >= chunk_size | Reset to chunk_size / 4 |
min_chunk_size < 0 | Reset to 0 |
unit != "chars" | Warning logged, falls back to "chars" |
The chunker uses a sliding window approach:
Document: "A long document text that needs to be split..."
With chunk_size=20, overlap=5:
Chunk 0: [0-19] "A long document tex"
Chunk 1: [15-34] "ment text that needs "
Chunk 2: [30-49] "to be split..."
Algorithm steps:
enabled=false, return entire document as single chunkchunk_size, return as single chunkstep = chunk_size - overlapstep charactersmin_chunk_size, append to previous chunk{
"enabled": false
}
Use case: Small documents (posts, comments) that don't need splitting.
{
"enabled": true,
"unit": "chars",
"chunk_size": 4000,
"overlap": 400,
"min_chunk_size": 800
}
Use case: General-purpose content like articles, documentation.
{
"enabled": true,
"unit": "chars",
"chunk_size": 1500,
"overlap": 200,
"min_chunk_size": 500
}
Use case: Code or technical content where smaller, more focused chunks improve retrieval.
{
"enabled": true,
"unit": "chars",
"chunk_size": 8000,
"overlap": 800,
"min_chunk_size": 2000
}
Use case: Books, long reports where maintaining more context per chunk is beneficial.
For a 10,000 character document with chunk_size=4000, overlap=400, min_chunk_size=800:
Chunk 0: chars 0-3999 (4000 chars)
Chunk 1: chars 3600-7599 (4000 chars, overlaps by 400)
Chunk 2: chars 7200-9999 (2799 chars - kept since > min_chunk_size)
Result: 3 chunks
With a 7,500 character document:
Chunk 0: chars 0-3999 (4000 chars)
Chunk 1: chars 3600-7499 (3899 chars - final chunk merged)
Result: 2 chunks (last 101 chars merged into Chunk 1 since < min_chunk_size)
The embedding_json column defines how vector embeddings are generated for semantic search. Embeddings convert text into numerical vectors that capture semantic meaning.
CREATE TABLE rag_sources (
-- ...
embedding_json TEXT, -- OPTIONAL (can be NULL)
-- ...
);
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable/disable embedding generation. |
dim | integer | 1536 | Vector dimension (must match model output). |
model | string | "unknown" | Model name/identifier (for observability). |
provider | string | "stub" | Embedding service: "openai" or "stub". |
api_base | string | (empty) | API endpoint URL. |
api_key | string | (empty) | API authentication key. |
batch_size | integer | 16 | Number of chunks processed per API request. |
timeout_ms | integer | 20000 | HTTP request timeout in milliseconds. |
input | object | (see below) | Specifies how to build embedding input text. |
| Condition | Action |
|---|---|
dim <= 0 | Reset to 1536 |
batch_size <= 0 | Reset to 16 |
timeout_ms <= 0 | Reset to 20000 |
stub ProviderGenerates deterministic pseudo-embeddings by hashing input text. Used for testing without API calls.
{
"enabled": true,
"provider": "stub",
"dim": 1536
}
Benefits:
Use case: Development, testing, CI/CD pipelines.
openai ProviderConnects to OpenAI or OpenAI-compatible APIs (e.g., Azure OpenAI, local LLM servers).
{
"enabled": true,
"provider": "openai",
"api_base": "https://api.openai.com/v1",
"api_key": "sk-your-api-key",
"model": "text-embedding-3-small",
"dim": 1536,
"batch_size": 16,
"timeout_ms": 30000
}
Benefits:
Use case: Production semantic search.
input FieldThe input field defines what text is sent to the embedding model. It uses a concat specification to combine:
{"col": "ColumnName"}{"lit": "text"}{"chunk_body": true}If input is not specified, only the chunk body is embedded.
{
"enabled": true,
"provider": "openai",
"dim": 1536,
"input": {
"concat": [
{"col": "Title"},
{"lit": "\nTags: "},
{"col": "Tags"},
{"lit": "\n\n"},
{"chunk_body": true}
]
}
}
Result: Embeds: {Title}\nTags: {Tags}\n\n{ChunkBody}
This typically improves semantic recall by including title and tags in the embedding.
// Simplified representation
if (input_spec contains "concat") {
result = ""
for each part in concat:
if part has "col": result += row[part.col]
if part has "lit": result += part.lit
if part has "chunk_body" && true: result += chunk_body
return result
}
return chunk_body // fallback
Embeddings are generated in batches to reduce API calls:
With batch_size=16 and 100 chunks:
Without batching: 100 API calls
With batching: 7 API calls (16+16+16+16+16+16+4)
Batch process:
batch_size chunks{
"enabled": false
}
Or leave embedding_json as NULL. Only full-text search will be available.
{
"enabled": true,
"provider": "stub",
"dim": 1536
}
{
"enabled": true,
"provider": "openai",
"api_base": "https://api.openai.com/v1",
"api_key": "sk-your-key",
"model": "text-embedding-3-small",
"dim": 1536
}
{
"enabled": true,
"provider": "openai",
"api_base": "https://api.openai.com/v1",
"api_key": "sk-your-key",
"model": "text-embedding-3-small",
"dim": 1536,
"batch_size": 32,
"timeout_ms": 45000,
"input": {
"concat": [
{"col": "Title"},
{"lit": "\n"},
{"chunk_body": true}
]
}
}
{
"enabled": true,
"provider": "openai",
"api_base": "http://localhost:8080/v1",
"api_key": "dummy",
"model": "nomic-embed-text",
"dim": 768,
"batch_size": 8,
"timeout_ms": 60000
}
INSERT INTO rag_sources (
source_id, name, enabled, backend_type,
backend_host, backend_port, backend_user, backend_pass, backend_db,
table_name, pk_column,
doc_map_json,
chunking_json,
embedding_json
) VALUES (
1, 'stack_posts', 1, 'mysql',
'127.0.0.1', 3306, 'root', 'root', 'stackdb',
'posts', 'Id',
'{
"doc_id": {"format": "posts:{Id}"},
"title": {"concat": [{"col": "Title"}]},
"body": {"concat": [{"col": "Body"}]},
"metadata": {"pick": ["Id", "Tags", "Score"]}
}',
'{
"enabled": true,
"unit": "chars",
"chunk_size": 4000,
"overlap": 400,
"min_chunk_size": 800
}',
'{
"enabled": true,
"provider": "openai",
"api_base": "https://api.openai.com/v1",
"api_key": "sk-your-key",
"model": "text-embedding-3-small",
"dim": 1536,
"batch_size": 16,
"timeout_ms": 30000,
"input": {
"concat": [
{"col": "Title"},
{"lit": "\nTags: "},
{"col": "Tags"},
{"lit": "\n\n"},
{"chunk_body": true}
]
}
}'
);
INSERT INTO rag_sources (
source_id, name, enabled, backend_type,
backend_host, backend_port, backend_user, backend_pass, backend_db,
table_name, pk_column,
doc_map_json,
chunking_json,
embedding_json
) VALUES (
2, 'docs', 1, 'mysql',
'127.0.0.1', 3306, 'root', 'root', 'docsdb',
'articles', 'article_id',
'{
"doc_id": {"format": "docs:{article_id}"},
"title": {"concat": [{"col": "title"}]},
"body": {"concat": [{"col": "content"}]},
"metadata": {"pick": ["category", "author"]}
}',
'{
"enabled": true,
"unit": "chars",
"chunk_size": 1500,
"overlap": 200,
"min_chunk_size": 500
}',
'{
"enabled": true,
"provider": "stub",
"dim": 1536
}'
);
INSERT INTO rag_sources (
source_id, name, enabled, backend_type,
backend_host, backend_port, backend_user, backend_pass, backend_db,
table_name, pk_column,
doc_map_json,
chunking_json,
embedding_json
) VALUES (
3, 'github_issues', 1, 'mysql',
'127.0.0.1', 3306, 'root', 'root', 'githubdb',
'issues', 'id',
'{
"doc_id": {"format": "issues:{id}"},
"title": {"concat": [{"col": "title"}]},
"body": {"concat": [{"col": "body"}]},
"metadata": {"pick": ["number", "state", "labels"]}
}',
'{
"enabled": false
}',
'{
"enabled": true,
"provider": "openai",
"api_base": "https://api.openai.com/v1",
"api_key": "sk-your-key",
"model": "text-embedding-3-small",
"dim": 1536,
"input": {
"concat": [
{"col": "title"},
{"lit": "\n\n"},
{"chunk_body": true}
]
}
}'
);
Match content to chunk size:
enabled: false)Set overlap to 10-20% of chunk size:
Set min_chunk_size to 20-25% of chunk size:
Consider your embedding token limit:
chunk_size doesn't exceed thisUse stub provider for development:
Optimize batch_size for your API:
Include relevant context in input:
Set appropriate timeouts:
Match dimension to model:
text-embedding-3-small: 1536text-embedding-3-large: 3072nomic-embed-text: 768Symptoms: High chunk count, many chunks under 500 characters
Solution: Increase min_chunk_size or decrease chunk_size
{
"chunk_size": 3000,
"min_chunk_size": 1000
}
Symptoms: Search misses information that spans chunk boundaries
Solution: Increase overlap
{
"chunk_size": 4000,
"overlap": 800
}
Cause: dim field doesn't match actual model output
Solution: Verify model dimension and update config
{
"model": "text-embedding-3-small",
"dim": 1536 // Must match model
}
Symptoms: Embedding requests fail after timeout
Solutions:
timeout_msbatch_size{
"timeout_ms": 60000,
"batch_size": 8
}
Symptoms: HTTP 429 errors from embedding service
Solutions:
batch_size{
"batch_size": 4
}
Cause: Using openai provider without setting api_base
Solution: Set api_base to your endpoint
{
"provider": "openai",
"api_base": "https://api.openai.com/v1",
"api_key": "sk-your-key"
}
-- Check current configuration
SELECT
source_id,
name,
chunking_json,
embedding_json
FROM rag_sources
WHERE enabled = 1;
-- Count chunks per document
SELECT
d.source_id,
AVG((SELECT COUNT(*) FROM rag_chunks c WHERE c.doc_id = d.doc_id)) as avg_chunks
FROM rag_documents d
GROUP BY d.source_id;
-- Check vector counts
SELECT
source_id,
COUNT(*) as vector_count
FROM rag_vec_chunks
GROUP BY source_id;
-- Verify dimensions match
SELECT
source_id,
COUNT(*) as count,
-- This would need a custom query to extract vector length
'Verify dim in rag_sources.embedding_json matches model' as check
FROM rag_vec_chunks
GROUP BY source_id;
-- FTS only (no embeddings)
INSERT INTO rag_sources (..., chunking_json, embedding_json) VALUES (
...,
'{"enabled": false}',
NULL
);
-- Chunking + stub embeddings (testing)
INSERT INTO rag_sources (..., chunking_json, embedding_json) VALUES (
...,
'{"enabled":true,"chunk_size":4000,"overlap":400}',
'{"enabled":true,"provider":"stub","dim":1536}'
);
-- Chunking + OpenAI embeddings (production)
INSERT INTO rag_sources (..., chunking_json, embedding_json) VALUES (
...,
'{"enabled":true,"chunk_size":4000,"overlap":400}',
'{"enabled":true,"provider":"openai","api_base":"https://api.openai.com/v1","api_key":"sk-...","model":"text-embedding-3-small","dim":1536}'
);
| Model | Dimension |
|---|---|
text-embedding-3-small | 1536 |
text-embedding-3-large | 3072 |
text-embedding-ada-002 | 1536 |
nomic-embed-text | 768 |
all-MiniLM-L6-v2 | 384 |