README - Cocoindex — ContextQMD

<a href="https://cocoindex.io/docs/examples/multi-codebase-summarization/" title="Generate a self-updating wiki page for every project in a folder with CocoIndex — structured LLM code analysis, Mermaid diagrams, incremental, in plain async Python"> </a> <h1 align="center">A self-updating wiki for every codebase in a folder.</h1> An LLM reads each Python file, extracts its public classes, functions, and CocoIndex call graphs, and aggregates them into a one-pager Markdown wiki per project — in plain async Python.

Edit a file, re-run, and only that file is re-analyzed; the wiki stays fresh without going out of date.

Star us ❤️ → <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a>  ·  <a href="https://cocoindex.io/docs/examples/multi-codebase-summarization/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a>  ·  <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> <div align="center">

</div>

Your code is the source of truth, but a hand-written wiki drifts the moment someone merges a PR. This pipeline builds your own deep wiki — a one-pager per project that's always fresh, because it's regenerated by incremental processing instead of by hand. You declare the transformation in native Python — target_state = transformation(source_state) — and the Rust engine reprocesses the minimum: switch the model or edit one file, and only what changed is re-analyzed, keeping the wikis current in production.

How it works

Each top-level subdirectory is treated as a project. The pipeline extracts a structured CodebaseInfo per file with an LLM, aggregates files into a project summary, and writes Markdown with Mermaid diagrams. Read it in main.py:

python

@coco.fn(memo=True)   # per file — structured LLM extraction, cached by content
async def extract_file_info(file: FileLike) -> CodebaseInfo:
    result = await _instructor_client.chat.completions.create(
        model=LLM_MODEL, response_model=CodebaseInfo,
        messages=[{"role": "user", "content": prompt}],
    )
    return CodebaseInfo.model_validate(result.model_dump())

@coco.fn(memo=True)   # per project — extract every file, aggregate, write one Markdown page
async def process_project(project_name: str, files, output_dir: pathlib.Path) -> None:
    file_infos = await coco.map(extract_file_info, files)         # concurrent extraction
    project_info = await aggregate_project_info(project_name, file_infos)
    markdown = generate_markdown(project_name, project_info, file_infos)
    localfs.declare_file(output_dir / f"{project_name}.md", markdown, create_parent_dirs=True)

@coco.fn
async def app_main(root_dir: pathlib.Path, output_dir: pathlib.Path) -> None:
    for entry in root_dir.resolve().iterdir():
        if not entry.is_dir() or entry.name.startswith("."):
            continue
        files = [f async for f in localfs.walk_dir(entry, recursive=True,
                 path_matcher=PatternFilePathMatcher(included_patterns=["**/*.py"],
                                                     excluded_patterns=["**/.*", "**/__pycache__"]))]
        if files:
            await coco.mount(coco.component_subpath("project", entry.name),
                             process_project, entry.name, files, output_dir)

Extraction is instructor over LiteLLM with the Pydantic models in models.py; the LLM emits Mermaid graph syntax directly (bold for @coco.fn functions, thick ==> arrows for mount/use_mount calls). Each project mounts as its own processing component, so projects run in parallel and one finishing doesn't wait on the rest.

📘 <a href="https://cocoindex.io/docs/examples/multi-codebase-summarization/">Full Tutorial →</a>

Step-by-step walkthrough with the data models, per-project granularity, concurrent extraction, and the Markdown + Mermaid output.

Why it's worth a star ⭐

Always fresh, never by hand. The wiki is a target state regenerated from the code — edit a file and the one-pager updates itself; the docs can't drift from the source.
Incremental by default. @coco.fn(memo=True) caches each file's extraction by content, so re-running only re-analyzes changed files. Add a project and only that project is processed.
Concurrent by construction. coco.map(extract_file_info, files) fans every file out at once while staying visible to the pipeline — far faster than sequential LLM calls.
You pick the granularity. Here it's one wiki page per project directory, but the same shape works per file, per page, or per semantic unit.
Structured outputs, your stack. One CodebaseInfo Pydantic model drives both file- and project-level extraction; swap LLM_MODEL for any LiteLLM provider.

Run it

1. Configure & install — the default model is gemini/gemini-2.5-flash:

cp .env.example .env     # set GEMINI_API_KEY (or LLM_MODEL=<provider/model> with its matching key)
pip install -e .

2. Generate the wiki — root_dir defaults to ../, so out of the box it documents the CocoIndex examples/ folder itself, writing one page per example into ./output:

cocoindex update main.py

To document your own code, point root_dir in main.py at a folder of project subdirectories and re-run.

3. Read the output:

ls output/
cat output/code_embedding.md

Each page has an Overview, a Components list (★ marks @coco.fn functions), a CocoIndex Pipeline Mermaid diagram where applicable, and per-file summaries for multi-file projects. Edit a .py file and re-run — only that file is re-analyzed, every other file served from the memo cache.

If this kept your codebase docs fresh, <a href="https://github.com/cocoindex-io/cocoindex">give CocoIndex a star ⭐</a> — it helps a lot.

<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/multi-codebase-summarization/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples">See all examples →</a>