PDF Embedding (v1)

This example builds an embedding index from local PDF files. It converts PDFs to markdown, chunks the text, embeds each chunk, and stores the results in Postgres (pgvector). It also provides a simple query demo.

We appreciate a star ⭐ at CocoIndex Github if this is helpful.

Prerequisite

A running Postgres with the pgvector extension. If you don't have one, start a local instance with the compose file in this repo:

docker compose -f ../../dev/postgres.yaml up -d

Run

Install dependencies:

pip install -e .

Set a database URL (or use .env):

export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"

Build/update the index. Either of the following works:

cocoindex update main

python main.py

Query:

python main.py query "what is attention?"

Note: this example does not create a vector index; queries will do a sequential scan.