examples/text_embedding/README.md
"How does incremental processing work?" finds the right passage even when it shares no keywords — in plain async Python.
</p> <p align="center"> <strong>Star us ❤️ →</strong> <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> · <a href="https://cocoindex.io/docs/examples/text-embedding/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> · <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center"> </div>A pile of Markdown has the answers hiding in plain sight — but locked behind exact-keyword search. This pipeline reads each file, splits it into overlapping chunks, embeds every chunk with a local sentence-transformer, and stores the vectors in Postgres + pgvector so you can search by meaning. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — and the heavy lifting (incremental processing, change tracking, managed targets) runs in a Rust engine underneath, so editing one file re-embeds one file, not the whole folder.
The whole pipeline is ordinary async Python and the row type is your own dataclass:
live=True, so it can watch for changes).RecursiveSplitter — small, focused units, with overlap so an idea straddling a boundary still lands whole.all-MiniLM-L6-v2, a small, fast model that runs locally with no API key.process_file runs once per file; memo=True makes it incremental — if a file's content and the function's code are unchanged, the whole file is skipped on the next run. Read it top-to-bottom in main.py:
@dataclass
class DocEmbedding:
id: int
filename: str
chunk_start: int
chunk_end: int
text: str
embedding: Annotated[NDArray, EMBEDDER] # dimension inferred from the embedder
@coco.fn(memo=True)
async def process_file(file: FileLike, table: postgres.TableTarget[DocEmbedding]) -> None:
text = await file.read_text()
chunks = _splitter.split(text, chunk_size=2000, chunk_overlap=500, language="markdown")
id_gen = IdGenerator()
await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
target_table = await postgres.mount_table_target(
PG_DB, table_name=TABLE_NAME,
table_schema=await postgres.TableSchema.from_class(DocEmbedding, primary_key=["id"]),
pg_schema_name=PG_SCHEMA_NAME,
)
target_table.declare_vector_index(column="embedding")
files = localfs.walk_dir(sourcedir, recursive=True,
path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]), live=True)
await coco.mount_each(process_file, files.items(), target_table)
Each row's id is derived from its chunk text, so re-running upserts only the rows that actually changed and deletes the ones whose source is gone — you never write update logic.
Step-by-step walkthrough with the row schema, the chunk-and-embed flow, the vector index, the SQL query, and exactly what happens on each kind of change.
</p>main.py — the canonical foundation under RAG and semantic search.@coco.fn(memo=True) caches per file; edit one file and only its changed chunks re-embed, then mount_table_target upserts the diff and cleans up orphans — no diff logic to write.mount_table_target owns the table schema, the pgvector index, idempotent upserts, and deletion when a file disappears.all-MiniLM-L6-v2 via sentence-transformers — swap in any of 12k+ models. The same embedder is reused at query time so indexing and search stay consistent.EMBEDDER is declared with detect_change=True, so swapping the model re-embeds everything against it with no cache to clear by hand.1. Start Postgres + pgvector:
docker compose -f ../../dev/postgres.yaml up -d
2. Configure & install:
cp .env.example .env # set POSTGRES_URL (defaults to the local docker one)
pip install -e .
3. Build the index — the example ships a markdown_files/ folder of sample docs:
cocoindex update main # catch-up: scan, sync, exit
cocoindex update -L main # live: keep watching for file changes
4. Search — embeds your query with the same model and returns the nearest chunks by pgvector cosine distance:
python main.py "what is self-attention?"
The most semantically similar chunks come back ranked — even when they share none of the words in your query. That's the whole point of a vector index.
<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/text-embedding/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>
</p>