docs/src/content/example-posts/docs-to-knowledge-graph.md
Documentation is a web of concepts pretending to be a list of files. "Incremental processing relies on change detection", "a target receives declared target states" — every page asserts relationships like these, but they're locked in prose. You can search docs for keywords; you can't ask how the concepts connect.
In this tutorial, we'll build a CocoIndex pipeline that turns a folder of Markdown docs into a concept knowledge graph in Neo4j. For each document, an LLM extracts a summary plus a set of (subject, predicate, object) triples — "engine detects source changes", "triple becomes relationship in graph" — and the triples become a property graph you can query in Cypher.
The whole pipeline is ordinary async Python and your own types. The heavy lifting — incremental processing, change tracking, managed graph targets — runs in a Rust engine underneath, so editing one doc re-extracts only that doc, and the graph reconciles itself: no orphaned nodes, no stale edges, no cleanup scripts.
The graph schema is small — two node types, two relationship types:
Document nodes — one per Markdown file, keyed by filename, with an LLM-generated title and summary.Entity nodes — one per distinct concept named in any triple, keyed by the concept name.RELATIONSHIP edges — Entity → Entity, with the predicate stored as an edge property.MENTION edges — Document → Entity, recording which document named which concept.Here's the result in Neo4j Browser, built from a docs folder — documents (cyan) at the center of the concepts (pink) they mention:
A knowledge graph over living docs is exactly the kind of pipeline that's easy to demo and hard to keep correct:
The pipeline runs in two phases:
Document node is declared here; the triples are carried forward.Entity nodes and the RELATIONSHIP / MENTION edges.You declare the transformation with native Python; CocoIndex works out what to insert, update, and delete. Think: target_state = transformation(source_state).
Nodes and edges are plain dataclasses. Each becomes a Neo4j label (or relationship type), with one field as the primary key:
@dataclass
class Document:
filename: str # primary key
title: str
summary: str
@dataclass
class Entity:
value: str # primary key — the concept name
@dataclass
class Relationship:
"""RELATIONSHIP edge payload. ``id`` is a stable hash of the triple so the
same (subject, predicate, object) always maps to a single edge; the
``predicate`` is stored as an edge property."""
id: int
predicate: str
MENTION carries no payload, so it gets no schema at all — the Neo4j connector derives its identity from the (document, entity) endpoints: one edge per pair.
The lifespan provides what every step needs — the Neo4j connection factory and the LLM model id — once at startup, via context keys:
KG_DB = coco.ContextKey[neo4j.ConnectionFactory]("kg_db")
LLM_MODEL = coco.ContextKey[str]("llm_model", detect_change=True)
@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
builder.provide(
KG_DB,
neo4j.ConnectionFactory(
uri=os.environ.get("NEO4J_URI", "bolt://localhost:7687"),
auth=(
os.environ.get("NEO4J_USER", "neo4j"),
os.environ.get("NEO4J_PASSWORD", "cocoindex"),
),
database=os.environ.get("NEO4J_DATABASE", "neo4j"),
),
)
builder.provide(LLM_MODEL, os.environ.get("LLM_MODEL", "openai/gpt-5.4"))
yield
Note detect_change=True on LLM_MODEL: the model id participates in change detection. Point LLM_MODEL at a different model and CocoIndex knows every memoized extraction is stale — the whole corpus re-extracts on the next run, with no cache to clear manually. The model is any LiteLLM provider string; set LLM_MODEL=ollama/llama3.2 to run extraction locally with no API key.
Extraction is typed end to end: Pydantic models describe what we want, instructor enforces them. The field descriptions double as instructions to the model:
class ExtractedRelationship(pydantic.BaseModel):
subject: str = pydantic.Field(
description="The concept the statement is about, e.g. 'CocoIndex'."
)
predicate: str = pydantic.Field(
description="How subject relates to object, e.g. 'supports'."
)
object: str = pydantic.Field(
description="The related concept, e.g. 'Incremental Processing'."
)
class RelationshipList(pydantic.BaseModel):
relationships: list[ExtractedRelationship] = pydantic.Field(default_factory=list)
Two memoized functions call the LLM — one for the summary, one for the triples:
@coco.fn(memo=True)
async def extract_relationships(content: str) -> list[Triple]:
client = instructor.from_litellm(litellm.acompletion, mode=instructor.Mode.JSON)
result = await client.chat.completions.create(
model=coco.use_context(LLM_MODEL),
response_model=RelationshipList,
messages=[
{"role": "system", "content": RELATIONSHIP_PROMPT},
{"role": "user", "content": content},
],
)
validated = RelationshipList.model_validate(result.model_dump())
return [Triple(r.subject, r.predicate, r.object) for r in validated.relationships]
@coco.fn(memo=True) is what makes iteration affordable: the result is cached keyed by the document content (and the function's own code). Unchanged docs never hit the LLM again. The prompt steers extraction toward "concepts, not code" — salient noun-phrase subjects and objects, short verb-phrase predicates, only relationships supported by the text.
process_file runs once per document: extract the summary, declare the Document node, extract the triples, and return them for phase 2.
@coco.fn(memo=True)
async def process_file(
file: localfs.File,
document_table: neo4j.TableTarget[Document],
) -> DocTriples:
content = await file.read_text()
filename = file.file_path.path.as_posix()
summary = await extract_summary(content)
document_table.declare_record(
row=Document(filename=filename, title=summary.title, summary=summary.summary)
)
triples = await extract_relationships(content)
return DocTriples(filename=filename, triples=triples)
Each file runs as its own processing component, mounted in app_main and keyed by the file path:
file_coros = []
async for path_key, file in files.items():
file_coros.append(
coco.use_mount(
coco.component_subpath("file", path_key),
process_file,
file,
document_table,
)
)
docs: list[DocTriples] = list(await asyncio.gather(*file_coros))
Why a component per file? Ownership. The component at ("file", path_key) owns that document's Document node — if the file disappears, so does the component, and CocoIndex deletes its node (and the MENTION edges pointing from it) automatically. coco.use_mount returns each file's triples, and asyncio.gather runs all files concurrently.
A single component takes every file's triples and declares the cross-document parts of the graph: deduplicated Entity nodes and the two edge types.
@coco.fn
async def build_graph(
docs: list[DocTriples],
entity_table: neo4j.TableTarget[Entity],
relationship_rel: neo4j.RelationTarget[Relationship],
mention_rel: neo4j.RelationTarget[Any],
) -> None:
entities: set[str] = set()
mentions: set[tuple[str, str]] = set() # (filename, entity value)
for doc in docs:
for t in doc.triples:
entities.add(t.subject)
entities.add(t.object)
mentions.add((doc.filename, t.subject))
mentions.add((doc.filename, t.object))
rel_id = await generate_id((t.subject, t.predicate, t.object))
relationship_rel.declare_relation(
from_id=t.subject,
to_id=t.object,
record=Relationship(id=rel_id, predicate=t.predicate),
)
for value in entities:
entity_table.declare_record(row=Entity(value=value))
for filename, entity in mentions:
mention_rel.declare_relation(from_id=filename, to_id=entity)
Two details carry the correctness:
generate_id hashes the triple, so the same (subject, predicate, object) always maps to the same edge — re-asserting a fact in another doc is a no-op, not a duplicate.This is plain Python doing set-dedup in memory — no framework abstractions. The declarative part is only at the boundary: declare_record / declare_relation say what should exist, and the engine reconciles.
app_main mounts the targets and runs the two phases. Node tables come first, because relation targets are declared between two node tables — that's how the connector knows each edge's endpoint labels and keys:
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
document_table = await neo4j.mount_table_target(
KG_DB,
"Document",
await neo4j.TableSchema.from_class(Document, primary_key="filename"),
primary_key="filename",
)
entity_table = await neo4j.mount_table_target(
KG_DB,
"Entity",
await neo4j.TableSchema.from_class(Entity, primary_key="value"),
primary_key="value",
)
relationship_rel = await neo4j.mount_relation_target(
KG_DB,
"RELATIONSHIP",
entity_table,
entity_table,
await neo4j.TableSchema.from_class(Relationship, primary_key="id"),
primary_key="id",
)
mention_rel = await neo4j.mount_relation_target(
KG_DB, "MENTION", document_table, entity_table
)
files = localfs.walk_dir(
sourcedir,
recursive=True,
path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md", "**/*.mdx"]),
)
# ... phase 1 fan-out (above), then:
await coco.mount(
coco.component_subpath("build_graph"),
build_graph,
docs,
entity_table,
relationship_rel,
mention_rel,
)
app = coco.App(
coco.AppConfig(name="DocsToKnowledgeGraph"),
app_main,
sourcedir=pathlib.Path("./markdown_files"),
)
That's the entire pipeline — one file, ~200 lines.
You'll need a Neo4j instance and an LLM key. Start Neo4j with Docker:
docker run -d \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/cocoindex \
--name cocoindex-neo4j \
neo4j:5.26-community
Set up the environment and install:
cp .env.example .env # fill in OPENAI_API_KEY (or set LLM_MODEL=ollama/llama3.2)
pip install -e .
The example ships a small markdown_files/ folder of sample docs so it runs out of the box. Build the graph:
cocoindex update main
To graph your own docs, drop .md / .mdx files into markdown_files/ — or point sourcedir at your real docs folder — and re-run.
Open Neo4j Browser (neo4j / cocoindex) and ask the graph questions:
// Everything
MATCH p=()-->() RETURN p LIMIT 200
// How concepts relate
MATCH (a:Entity)-[r:RELATIONSHIP]->(b:Entity)
RETURN a.value, r.predicate, b.value
// Concepts mentioned in the most documents
MATCH (d:Document)-[:MENTION]->(e:Entity)
RETURN e.value, count(DISTINCT d) AS docs
ORDER BY docs DESC LIMIT 10
This is where the declarative model pays for itself. You never compute a diff or write cleanup logic — change something, re-run cocoindex update main, and CocoIndex works out the minimum set of LLM calls and graph writes.
Data changes.
build_graph re-runs and diffs the graph: new entities and edges are inserted, ones no longer supported anywhere are deleted. Every other doc's extraction is served from the memo cache.Document node and MENTION edges are cleaned up automatically; concepts only that doc introduced vanish from the entity set on the next graph pass.Logic changes are reconciled the same way:
LLM_MODEL has detect_change=True, so changing the env var invalidates every memoized extraction. No cache to clear, no manual rebuild.The full, runnable example is in the CocoIndex repo: examples/docs_to_knowledge_graph.
One natural next step: the LLM will sometimes name the same concept two ways ("CocoIndex" vs "Cocoindex"). The meeting notes graph example adds an embedding + LLM entity-resolution pass that collapses near-duplicates — it drops into this pipeline between the two phases. For a bigger end-to-end graph build (transcription, multi-entity schemas, polymorphic edges), see Turn Podcasts into a Knowledge Graph.
Got a docs folder, a wiki, or a pile of specs you want to turn into a graph? Come tell us on Discord — and if this was useful, star CocoIndex on GitHub.