README - Cocoindex — ContextQMD

<a href="https://cocoindex.io/docs/examples/text-embedding-lancedb/" title="Semantic search over Markdown with CocoIndex and LanceDB — chunk, embed, and store vectors in an embedded, file-based store with zero server to run, in plain async Python"> </a> <h1 align="center">Semantic search over Markdown, stored in LanceDB.</h1> The Semantic Search 101 pipeline pointed at LanceDB — an embedded, file-based vector store with no server to stand up, no <code>POSTGRES_URL</code>, just a directory on disk you can copy to move.

Walk, chunk, embed locally, store — incrementally — in plain async Python.

Star us ❤️ → <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a>  ·  <a href="https://cocoindex.io/docs/examples/text-embedding-lancedb/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a>  ·  <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> <div align="center">

</div>

This is Semantic Search 101 with one thing changed: the vectors land in LanceDB instead of Postgres. LanceDB is an embedded, file-based vector store — no server to stand up, just a ./lancedb_data/ directory created on first run. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — and the heavy lifting (incremental processing, change tracking, managed targets) runs in a Rust engine underneath, so editing one file re-embeds one file, not the whole folder.

How it works

The chunk-and-embed half is byte-for-byte the base example — RecursiveSplitter cuts each file into overlapping Markdown chunks, and a local SentenceTransformerEmbedder (all-MiniLM-L6-v2, no API key) turns each into a vector. What changes is the resource and the target: a LanceAsyncConnection instead of an asyncpg pool, and lancedb.mount_table_target instead of the Postgres one — same call shape, same table.declare_row(...). Read it in main.py:

python

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    conn = await lancedb.connect_async(LANCEDB_URI)   # the "connection" is just a path on disk
    builder.provide(LANCE_DB, conn)
    builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
    yield

@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    target_table = await lancedb.mount_table_target(
        LANCE_DB, table_name=TABLE_NAME,
        table_schema=await lancedb.TableSchema.from_class(DocEmbedding, primary_key=["id"]),
    )
    files = localfs.walk_dir(sourcedir, recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]), live=True)
    await coco.mount_each(process_file, files.items(), target_table)

lancedb.mount_table_target is the LanceDB counterpart to the Postgres mount_table_target: it creates and manages the table, handles idempotent upserts keyed on the primary key, and cleans up orphan rows when a file disappears. Only the import changed.

📘 <a href="https://cocoindex.io/docs/examples/text-embedding-lancedb/">Full Tutorial →</a>

Step-by-step walkthrough with the LanceDB connection, the managed table target, the row schema, and the async search query.

Why it's worth a star ⭐

Zero infrastructure. No database to install, no POSTGRES_URL — LanceDB writes to ./lancedb_data/, created on first run. To start fresh, delete the directory and re-run.
Portable by design. Data lives in one directory on disk; copy it to move the whole index.
Managed table target. lancedb.mount_table_target owns the schema, idempotent upserts, and orphan cleanup — the same guarantees the Postgres target gives, against a local store.
Incremental by default. @coco.fn(memo=True) skips files whose content and code are unchanged; each row's id is derived from its chunk text, so only changed rows are upserted and vanished ones are deleted.
Same flow, different store. The chunk-and-embed code is identical to the Postgres version — proof the target is a swappable detail. The same local embedder is reused at query time so indexing and search stay consistent.

Run it

No database to install — LanceDB is embedded and writes to ./lancedb_data/, created on first run.

1. Configure & install:

cp .env.example .env     # no required secrets; optional LANCEDB_URI override
pip install -e .

2. Build the index — the example ships a markdown_files/ folder of sample docs:

cocoindex update main          # catch-up: scan, sync, exit
cocoindex update -L main       # live: keep watching for file changes

3. Search — embeds your query with the same model and returns the nearest vectors via LanceDB's async search:

python main.py "what is self-attention?"

The most semantically similar chunks come back ranked — even when they share none of the words in your query.

If this gave you a portable, server-free vector index, <a href="https://github.com/cocoindex-io/cocoindex">give CocoIndex a star ⭐</a> — it helps a lot.

<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/text-embedding-lancedb/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples">See all examples →</a>