examples/text_embedding_lancedb/README.md
Walk, chunk, embed locally, store — incrementally — in plain async Python.
</p> <p align="center"> <strong>Star us ❤️ →</strong> <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> · <a href="https://cocoindex.io/docs/examples/text-embedding-lancedb/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> · <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center"> </div>This is Semantic Search 101 with one thing changed: the vectors land in LanceDB instead of Postgres. LanceDB is an embedded, file-based vector store — no server to stand up, just a ./lancedb_data/ directory created on first run. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — and the heavy lifting (incremental processing, change tracking, managed targets) runs in a Rust engine underneath, so editing one file re-embeds one file, not the whole folder.
The chunk-and-embed half is byte-for-byte the base example — RecursiveSplitter cuts each file into overlapping Markdown chunks, and a local SentenceTransformerEmbedder (all-MiniLM-L6-v2, no API key) turns each into a vector. What changes is the resource and the target: a LanceAsyncConnection instead of an asyncpg pool, and lancedb.mount_table_target instead of the Postgres one — same call shape, same table.declare_row(...). Read it in main.py:
@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
conn = await lancedb.connect_async(LANCEDB_URI) # the "connection" is just a path on disk
builder.provide(LANCE_DB, conn)
builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
yield
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
target_table = await lancedb.mount_table_target(
LANCE_DB, table_name=TABLE_NAME,
table_schema=await lancedb.TableSchema.from_class(DocEmbedding, primary_key=["id"]),
)
files = localfs.walk_dir(sourcedir, recursive=True,
path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]), live=True)
await coco.mount_each(process_file, files.items(), target_table)
lancedb.mount_table_target is the LanceDB counterpart to the Postgres mount_table_target: it creates and manages the table, handles idempotent upserts keyed on the primary key, and cleans up orphan rows when a file disappears. Only the import changed.
Step-by-step walkthrough with the LanceDB connection, the managed table target, the row schema, and the async search query.
</p>POSTGRES_URL — LanceDB writes to ./lancedb_data/, created on first run. To start fresh, delete the directory and re-run.lancedb.mount_table_target owns the schema, idempotent upserts, and orphan cleanup — the same guarantees the Postgres target gives, against a local store.@coco.fn(memo=True) skips files whose content and code are unchanged; each row's id is derived from its chunk text, so only changed rows are upserted and vanished ones are deleted.No database to install — LanceDB is embedded and writes to
./lancedb_data/, created on first run.
1. Configure & install:
cp .env.example .env # no required secrets; optional LANCEDB_URI override
pip install -e .
2. Build the index — the example ships a markdown_files/ folder of sample docs:
cocoindex update main # catch-up: scan, sync, exit
cocoindex update -L main # live: keep watching for file changes
3. Search — embeds your query with the same model and returns the nearest vectors via LanceDB's async search:
python main.py "what is self-attention?"
The most semantically similar chunks come back ranked — even when they share none of the words in your query.
<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/text-embedding-lancedb/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>
</p>