examples/rust/paper_metadata/README.md
Rust port of the Python paper_metadata example.
Walks local PDFs, extracts first-page text + page count, LLM-extracts title/authors/abstract, embeds the title and abstract chunks, and stores everything across three Postgres tables — then serves similarity search.
| Concern | Python | Rust (this example) |
|---|---|---|
| Source | localfs.walk_dir (**/*.pdf, live) | cocoindex::fs::walk (**/*.pdf) |
| Per-file compute | @coco.fn(memo=True) process_file | #[cocoindex::function(memo)] process_file |
| PDF parsing | pypdf (first-page text + page count) | lopdf (first-page text + page count) |
| LLM extraction | openai chat completions (gpt-4o, JSON) | OpenAI chat completions REST (gpt-4o, JSON mode) |
| Chunking | RecursiveSplitter + custom "abstract" lang | RecursiveChunker + CustomLanguageConfig ("abstract") |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 | fastembed AllMiniLML6V2 (same model, 384-dim) |
| Targets | 3× postgres.mount_table_target | 3× postgres::mount_table_target |
Tables (schema coco_examples_v1):
paper_metadata — filename (pk), title, authors (jsonb), abstract, num_pagesauthor_papers — (author_name, filename) pkmetadata_embeddings — id (uuid pk), filename, location, text, embedding vector(384)Incrementality: unchanged PDFs are memo-skipped; rows of a removed/edited PDF
are reconciled away (the managed TableTargets delete orphaned rows).
Deviation from Python: embedding-row UUIDs are derived deterministically
from (filename, location, text) via the SDK's UuidGenerator (vs Python's
uuid.uuid4()), so re-runs are stable. Like Python, no pgvector index is
created — the query demo does a sequential cosine scan.
export POSTGRES_URL=postgres://cocoindex:cocoindex@localhost/cocoindex # pgvector-enabled
export OPENAI_API_KEY=...
cargo run -- index # walk ./papers -> extract -> embed -> Postgres
cargo run -- query "attention mechanism" # cosine similarity search