examples/docs_to_knowledge_graph/README.md
Turn a folder of Markdown documentation into a concept knowledge graph in
Neo4j. For each document an LLM (via
LiteLLM + instructor)
produces a short summary and a set of (subject, predicate, object) triples
about the concepts it covers — "concepts, not code" — and the triples become a
property graph.
This is the CocoIndex v1 port of the blog post Build a Knowledge Graph for Documents.
Please drop CocoIndex on Github a star to support us and stay tuned for more updates. Thank you so much 🥥🤗.
Document nodes — one per Markdown file, keyed by filename, with an
LLM-generated title and summaryEntity nodes — one per distinct concept named in a triple, keyed by valueRELATIONSHIP — Entity → Entity, with the predicate stored on the edgeMENTION — Document → Entity, recording which document named which conceptThe flow watches the source folder and keeps the graph up to date incrementally.
The pipeline runs in two phases:
DocumentSummary
(title + summary) and a list of relationship triples with LiteLLM +
instructor. The Document node is declared in this phase; the triples are
carried forward.Entity nodes
and the RELATIONSHIP / MENTION edges across all documents. Each distinct
triple is keyed by a stable hash, so re-asserting the same fact in another
doc maps to the same edge.CocoIndex reconciles changes incrementally — re-running after editing one doc
only re-extracts that doc, and the graph pass only re-runs when the set of
triples changes. To collapse near-identical entity names (e.g. "CocoIndex" vs
"Cocoindex"), add an entity-resolution pass like the one in
meeting_notes_graph_neo4j.
A running Neo4j 5.18+ instance:
docker run -d \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/cocoindex \
--name cocoindex-neo4j \
neo4j:5.26-community
The browser UI is at http://localhost:7474; log in with neo4j /
cocoindex.
An LLM. Defaults to OpenAI (set OPENAI_API_KEY); set LLM_MODEL to any
LiteLLM provider — e.g.
LLM_MODEL=ollama/llama3.2 to run the extraction locally with no API key.
Copy .env.example to .env and fill in the blanks:
cp .env.example .env
set -a && source .env && set +a
Install dependencies:
uv pip install -e .
This example ships a small markdown_files/ folder of sample concept docs so it
runs out of the box. Build/update the graph:
cocoindex update main
To index your own docs, drop .md / .mdx files into markdown_files/ (or
point sourcedir in main.py at another directory — e.g. CocoIndex's own
docs/) and re-run.
Open Neo4j Browser at http://localhost:7474, log in, and run Cypher queries:
// Everything
MATCH p=()-->() RETURN p LIMIT 200
// Concept-to-concept relationships
MATCH (a:Entity)-[r:RELATIONSHIP]->(b:Entity)
RETURN a.value, r.predicate, b.value
// Which documents mention which concepts
MATCH (d:Document)-[:MENTION]->(e:Entity)
RETURN d.filename, d.title, e.value
// Concepts mentioned in the most documents
MATCH (d:Document)-[:MENTION]->(e:Entity)
RETURN e.value, count(DISTINCT d) AS docs
ORDER BY docs DESC LIMIT 10
To wipe the graph between runs:
MATCH (n) DETACH DELETE n