examples/rust/pdf_embedding/README.md
Rust port of the Python pdf_embedding example.
Walks local PDFs, extracts their text, chunks it, embeds the chunks, and stores them in Postgres/pgvector — then serves similarity search.
| Concern | Python | Rust (this example) |
|---|---|---|
| Source | localfs.walk_dir (**/*.pdf) | cocoindex::fs::walk (**/*.pdf) |
| PDF → text | docling (PDF → Markdown, ML pipeline) | lopdf text extraction |
| Per-file compute | @coco.fn(memo=True) process_file | #[cocoindex::function(memo)] process_file |
| Chunking | RecursiveSplitter (markdown, 2000/500) | cocoindex_ops_text RecursiveChunker (markdown, 2000/500) |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 | fastembed AllMiniLML6V2 (same model, 384-dim) |
| Target | postgres.mount_table_target | postgres::mount_table_target |
Deviation from Python: Python converts PDFs to Markdown with docling (a
heavy ML document-understanding pipeline) and runs it on a coco.GPU runner.
There is no Rust equivalent, so this port extracts plain text with lopdf (the
same Rust-native PDF approach as paper_metadata). Extraction quality varies by
PDF, but everything downstream — chunking, embeddings, target, query — mirrors
Python. Like Python, no pgvector index is created (sequential cosine scan).
Target table is coco_examples.pdf_embeddings (id pk, filename,
chunk_start, chunk_end, text, embedding vector(384)).
export POSTGRES_URL=postgres://cocoindex:cocoindex@localhost/cocoindex # pgvector-enabled
cargo run -- index # walk ./pdf_files -> extract -> chunk -> embed -> Postgres
cargo run -- query "your query" # cosine similarity search