Back to Cocoindex

Code Embedding with LanceDB (Rust)

examples/rust/code_embedding_lancedb/README.md

1.0.81.9 KB
Original Source

Code Embedding with LanceDB (Rust)

Rust port of the Python code_embedding_lancedb example.

Walks a source tree, detects each file's language, chunks it (tree-sitter-aware), embeds the chunks, and stores them in LanceDB — then serves vector search.

Same pipeline as code_embedding, but the target is the native cocoindex::lancedb connector instead of Postgres/pgvector.

Parallel to the Python example

ConcernPythonRust (this example)
Sourcelocalfs.walk_dircocoindex::fs::walk
Per-file compute@coco.fn(memo=True) process_file#[cocoindex::function(memo)] process_file
Language detectdetect_code_languagecocoindex_ops_text::prog_langs::detect_language
ChunkingRecursiveSplitter (1000/300/300)cocoindex_ops_text RecursiveChunker (1000/300/300)
Embeddingssentence-transformers/all-MiniLM-L6-v2fastembed AllMiniLML6V2 (same model, 384-dim)
Targetlancedb.mount_table_targetcocoindex::lancedb::mount_table_target

Incrementality: unchanged files are memo-skipped; chunks of a removed/edited file are reconciled away by the managed LanceDB TableTarget.

Build dependency: LanceDB pulls in crates that compile .proto files, so a protoc (protobuf) compiler must be on PATH (brew install protobuf / apt-get install protobuf-compiler).

Run

bash
cargo run -- index [SOURCE_DIR]    # default: the repository root
cargo run -- query "your query"    # LanceDB vector search

# LanceDB data dir defaults to ./lancedb_data (override with LANCEDB_URI)