docs/src/content/example-posts/text-embedding-lancedb.md
This is the Semantic Search 101 example with one thing changed: the vectors land in LanceDB instead of Postgres. LanceDB is an embedded, file-based vector store — no server to stand up, no POSTGRES_URL, just a directory on disk you can copy to move. Everything else — read Markdown, chunk, embed each chunk — is identical, so this post focuses on the one part that differs: the connector.
The whole pipeline is ordinary async Python and your own types. The heavy lifting — incremental processing, change tracking, managed targets — runs in a Rust engine underneath, so only what changed gets re-embedded and re-upserted.
From a high level, these are the steps:
You declare the transformation logic with native Python, without worrying about how updates propagate. Think: target_state = transformation(source_state).
The chunk-and-embed half is unchanged from the base example — RecursiveSplitter cuts each file into overlapping Markdown chunks, and a local SentenceTransformerEmbedder (all-MiniLM-L6-v2, no API key) turns each chunk into a 384-d vector. See the base walkthrough for the chunk/embed details. What changes here is the target.
LanceDB is embedded, so the "connection" is just a path on disk — the directory is created on first run. The shared resources the rest of the code builds on are the LanceDB connection and the embedding model, both provided once at startup in coco.lifespan. DocEmbedding defines one output row — each chunk becomes one row.
import cocoindex as coco
from cocoindex.connectors import lancedb, localfs
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
LANCEDB_URI = "./lancedb_data"
TABLE_NAME = "doc_embeddings"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
LANCE_DB = coco.ContextKey[lancedb.LanceAsyncConnection]("text_embedding_db")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)
@dataclass
class DocEmbedding:
id: int
filename: str
chunk_start: int
chunk_end: int
text: str
embedding: Annotated[NDArray, EMBEDDER]
@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
conn = await lancedb.connect_async(LANCEDB_URI)
builder.provide(LANCE_DB, conn)
builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
yield
Compared to the Postgres version, the only difference is the resource: lancedb.connect_async(LANCEDB_URI) instead of an asyncpg pool, and a LanceAsyncConnection context key instead of asyncpg.Pool. embedding: Annotated[NDArray, EMBEDDER] still ties the vector column to the embedder, so its dimensions are inferred automatically — and if you swap the model later, CocoIndex notices (detect_change=True) and re-embeds.
app_main wires the source to the target. It mounts the LanceDB table, walks the source directory, and mounts one processing component per file.
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
target_table = await lancedb.mount_table_target(
LANCE_DB,
table_name=TABLE_NAME,
table_schema=await lancedb.TableSchema.from_class(
DocEmbedding, primary_key=["id"]
),
)
files = localfs.walk_dir(
sourcedir,
recursive=True,
path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
live=True, # watch for changes; pass -L to `cocoindex update` to run live
)
await coco.mount_each(process_file, files.items(), target_table)
lancedb.mount_table_target is the LanceDB counterpart to the Postgres mount_table_target — same call shape, same managed-table guarantees: it creates and manages the table for you, handles idempotent upserts keyed on the primary key, and cleans up orphan rows when a file disappears. process_file and process_chunk take a lancedb.TableTarget[DocEmbedding] and call table.declare_row(...) exactly as before; only the import changed. live=True makes the filesystem source watch for changes, and mount_each runs one component per file.
The App binds it all together and points at the Markdown folder:
app = coco.App(
coco.AppConfig(name="TextEmbeddingLanceDBV1"),
app_main,
sourcedir=pathlib.Path("./markdown_files"),
)
No database to install — LanceDB writes to ./lancedb_data/, created on first run. Install the example's dependencies and grab a few .md files (the repo ships sample files, or drop your own into markdown_files/):
pip install -U "cocoindex[sentence_transformers,lancedb]" python-dotenv
Then build and update the index with the cocoindex CLI — catch-up (scan, sync, exit) or live (catch up, then keep watching):
# Catch-up run
cocoindex update main
# Live run: keep watching for file changes
cocoindex update -L main
Query with LanceDB's async search API, reusing the same embedder from the indexing flow so indexing and querying stay consistent.
async def query_once(conn, embedder, query_text: str, *, top_k: int = TOP_K) -> None:
query_vec = await embedder.embed(query_text)
table = await conn.open_table(TABLE_NAME)
search = await table.search(query_vec, vector_column_name="embedding")
results = await search.limit(top_k).to_list()
for r in results:
score = 1.0 - r["_distance"]
print(f"[{score:.3f}] {r['filename']}")
print(f" {r['text']}")
print("---")
table.search(...).limit(top_k) returns the nearest vectors; _distance is LanceDB's distance, which we turn into a similarity score. Run a search straight from the command line:
python main.py "what is self-attention?"
The most semantically similar chunks come back ranked — even when they share none of the words in your query.
The incremental story is identical to the base example: @coco.fn(memo=True) decides what to recompute (a file is skipped when its content and the function's code are both unchanged), and lancedb.mount_table_target decides what to write — each row's id is derived from its chunk's text, so it upserts only the rows that actually changed and deletes rows whose source is gone.
id and embedding, genuinely new chunks are embedded and inserted, and chunks that no longer exist are deleted.A catch-up run (cocoindex update main) does this once and exits; live mode (cocoindex update -L main) keeps watching and applies each change with low latency.
The full, runnable example is in the CocoIndex repo: examples/text_embedding_lancedb. For the chunk-and-embed walkthrough, see Semantic Search 101 — same flow, Postgres as the target.
If this was useful, star CocoIndex on GitHub and come say hi in our Discord community.