Back to Cocoindex

PDF Embedding (Rust)

examples/rust/pdf_embedding/README.md

1.0.82.1 KB
Original Source

PDF Embedding (Rust)

Rust port of the Python pdf_embedding example.

Walks local PDFs, extracts their text, chunks it, embeds the chunks, and stores them in Postgres/pgvector — then serves similarity search.

Parallel to the Python example

ConcernPythonRust (this example)
Sourcelocalfs.walk_dir (**/*.pdf)cocoindex::fs::walk (**/*.pdf)
PDF → textdocling (PDF → Markdown, ML pipeline)lopdf text extraction
Per-file compute@coco.fn(memo=True) process_file#[cocoindex::function(memo)] process_file
ChunkingRecursiveSplitter (markdown, 2000/500)cocoindex_ops_text RecursiveChunker (markdown, 2000/500)
Embeddingssentence-transformers/all-MiniLM-L6-v2fastembed AllMiniLML6V2 (same model, 384-dim)
Targetpostgres.mount_table_targetpostgres::mount_table_target

Deviation from Python: Python converts PDFs to Markdown with docling (a heavy ML document-understanding pipeline) and runs it on a coco.GPU runner. There is no Rust equivalent, so this port extracts plain text with lopdf (the same Rust-native PDF approach as paper_metadata). Extraction quality varies by PDF, but everything downstream — chunking, embeddings, target, query — mirrors Python. Like Python, no pgvector index is created (sequential cosine scan).

Target table is coco_examples.pdf_embeddings (id pk, filename, chunk_start, chunk_end, text, embedding vector(384)).

Run

bash
export POSTGRES_URL=postgres://cocoindex:cocoindex@localhost/cocoindex   # pgvector-enabled

cargo run -- index                 # walk ./pdf_files -> extract -> chunk -> embed -> Postgres
cargo run -- query "your query"    # cosine similarity search