examples/docs_to_knowledge_graph/README.md
Point it at a docs folder, and it re-extracts only the doc you edited, then reconciles the graph.
</p> <p align="center"> <strong>Star us ❤️ →</strong> <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> · <a href="https://cocoindex.io/docs/examples/docs-to-knowledge-graph/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> · <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center"> </div>Documentation is a web of concepts pretending to be a list of files — "incremental processing relies on change detection", "a target receives declared target states" — every page asserts relationships, but they're locked in prose. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — and the heavy lifting (incremental processing, change tracking, managed graph targets) runs in a Rust engine underneath, so editing one doc re-extracts one doc, and the graph reconciles itself: no orphaned nodes, no stale edges, no cleanup scripts.
Two node types, two relationship types, and the concept map falls out of the graph:
Document nodes — one per Markdown file, keyed by filename, with an LLM-generated title and summary.Entity nodes — one per distinct concept named in any triple, keyed by the concept name and shared across documents.RELATIONSHIP edges — Entity → Entity, with the predicate stored as an edge property.MENTION edges — Document → Entity, recording which document named which concept.Because entities are shared across documents, the pipeline runs in two phases — read it top-to-bottom in main.py:
@coco.fn(memo=True) # Phase 1 — per doc: declare the Document node, carry triples forward
async def process_file(file: localfs.File, document_table: neo4j.TableTarget[Document]) -> DocTriples:
content = await file.read_text()
filename = file.file_path.path.as_posix()
summary = await extract_summary(content)
document_table.declare_record(row=Document(filename=filename, title=summary.title, summary=summary.summary))
triples = await extract_relationships(content)
return DocTriples(filename=filename, triples=triples)
@coco.fn # Phase 2 — one pass owns the shared Entity nodes + both edge types
async def build_graph(docs, entity_table, relationship_rel, mention_rel) -> None:
for doc in docs:
for t in doc.triples:
rel_id = await generate_id((t.subject, t.predicate, t.object)) # stable edge identity
relationship_rel.declare_relation(from_id=t.subject, to_id=t.object,
record=Relationship(id=rel_id, predicate=t.predicate))
mention_rel.declare_relation(from_id=doc.filename, to_id=t.subject)
...
for value in entities: entity_table.declare_record(row=Entity(value=value))
Extraction is instructor over LiteLLM with your own Pydantic models; MENTION carries no payload, so the Neo4j connector derives its identity from the (document, entity) endpoints — one edge per pair.
Step-by-step walkthrough with the graph schema, the two-phase flow, the extraction models, and exactly what happens on each kind of change.
</p>Incremental Processing is one Entity node every doc can point at — not a copy per doc.@coco.fn(memo=True) caches each LLM extraction by content; edit one doc and only that doc re-extracts, then the graph diffs — adding new nodes/edges and removing ones no longer supported anywhere. A no-change re-run makes zero LLM calls.generate_id hashes each triple, so the same (subject, predicate, object) always maps to one edge — re-asserting a fact in another doc is a no-op, not a duplicate.LLM_MODEL for any LiteLLM provider (OpenAI, Ollama, …). No DSL.LLM_MODEL is declared with detect_change=True, so swapping the model re-extracts the whole corpus against it with no cache to clear by hand.1. Start Neo4j:
docker run -d -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/cocoindex --name cocoindex-neo4j neo4j:5.26-community
2. Configure & install:
cp .env.example .env # set OPENAI_API_KEY (or LLM_MODEL=ollama/llama3.2 to run locally)
pip install -e .
3. Build the graph — the example ships a markdown_files/ folder of sample docs so it runs out of the box:
cocoindex update main
To graph your own docs, drop .md / .mdx files into markdown_files/ (or point sourcedir at your real docs folder) and re-run.
4. Explore the graph — open Neo4j Browser (neo4j / cocoindex) and ask:
-- How concepts relate
MATCH (a:Entity)-[r:RELATIONSHIP]->(b:Entity)
RETURN a.value, r.predicate, b.value
-- Concepts mentioned in the most documents
MATCH (d:Document)-[:MENTION]->(e:Entity)
RETURN e.value, count(DISTINCT d) AS docs
ORDER BY docs DESC LIMIT 10
The LLM will sometimes name the same concept two ways ("CocoIndex" vs "Cocoindex"). The meeting notes graph example adds an embedding + LLM entity-resolution pass that collapses near-duplicates — it drops into this pipeline between the two phases.
<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/docs-to-knowledge-graph/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>
</p>