Back to Cocoindex

Paper Metadata (Rust)

examples/rust/paper_metadata/README.md

1.0.82.5 KB
Original Source

Paper Metadata (Rust)

Rust port of the Python paper_metadata example.

Walks local PDFs, extracts first-page text + page count, LLM-extracts title/authors/abstract, embeds the title and abstract chunks, and stores everything across three Postgres tables — then serves similarity search.

Parallel to the Python example

ConcernPythonRust (this example)
Sourcelocalfs.walk_dir (**/*.pdf, live)cocoindex::fs::walk (**/*.pdf)
Per-file compute@coco.fn(memo=True) process_file#[cocoindex::function(memo)] process_file
PDF parsingpypdf (first-page text + page count)lopdf (first-page text + page count)
LLM extractionopenai chat completions (gpt-4o, JSON)OpenAI chat completions REST (gpt-4o, JSON mode)
ChunkingRecursiveSplitter + custom "abstract" langRecursiveChunker + CustomLanguageConfig ("abstract")
Embeddingssentence-transformers/all-MiniLM-L6-v2fastembed AllMiniLML6V2 (same model, 384-dim)
Targetspostgres.mount_table_targetpostgres::mount_table_target

Tables (schema coco_examples_v1):

  • paper_metadatafilename (pk), title, authors (jsonb), abstract, num_pages
  • author_papers — (author_name, filename) pk
  • metadata_embeddingsid (uuid pk), filename, location, text, embedding vector(384)

Incrementality: unchanged PDFs are memo-skipped; rows of a removed/edited PDF are reconciled away (the managed TableTargets delete orphaned rows).

Deviation from Python: embedding-row UUIDs are derived deterministically from (filename, location, text) via the SDK's UuidGenerator (vs Python's uuid.uuid4()), so re-runs are stable. Like Python, no pgvector index is created — the query demo does a sequential cosine scan.

Run

bash
export POSTGRES_URL=postgres://cocoindex:cocoindex@localhost/cocoindex   # pgvector-enabled
export OPENAI_API_KEY=...

cargo run -- index                       # walk ./papers -> extract -> embed -> Postgres
cargo run -- query "attention mechanism" # cosine similarity search