Back to Cocoindex

README

examples/multi_codebase_summarization/README.md

1.0.148.9 KB
Original Source
<p align="center"> <a href="https://cocoindex.io/docs/examples/multi-codebase-summarization/" title="Generate a self-updating wiki page for every project in a folder with CocoIndex — structured LLM code analysis, Mermaid diagrams, incremental, in plain async Python"> </a> </p> <h1 align="center">A <em>self-updating</em> wiki for every codebase in a folder.</h1> <p align="center"> <b>An LLM reads each Python file, extracts its public classes, functions, and CocoIndex call graphs, and aggregates them into a one-pager Markdown wiki per project — in plain async Python.</b>

Edit a file, re-run, and only that file is re-analyzed; the wiki stays fresh without going out of date.

</p> <p align="center"> <strong>Star us&nbsp;❤️&nbsp;→</strong>&nbsp;<a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> &nbsp;·&nbsp; <a href="https://cocoindex.io/docs/examples/multi-codebase-summarization/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> &nbsp;·&nbsp; <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center">

</div>

Your code is the source of truth, but a hand-written wiki drifts the moment someone merges a PR. This pipeline builds your own deep wiki — a one-pager per project that's always fresh, because it's regenerated by incremental processing instead of by hand. You declare the transformation in native Python — target_state = transformation(source_state) — and the Rust engine reprocesses the minimum: switch the model or edit one file, and only what changed is re-analyzed, keeping the wikis current in production.

How it works

Each top-level subdirectory is treated as a project. The pipeline extracts a structured CodebaseInfo per file with an LLM, aggregates files into a project summary, and writes Markdown with Mermaid diagrams. Read it in main.py:

python
@coco.fn(memo=True)   # per file — structured LLM extraction, cached by content
async def extract_file_info(file: FileLike) -> CodebaseInfo:
    result = await _instructor_client.chat.completions.create(
        model=LLM_MODEL, response_model=CodebaseInfo,
        messages=[{"role": "user", "content": prompt}],
    )
    return CodebaseInfo.model_validate(result.model_dump())

@coco.fn(memo=True)   # per project — extract every file, aggregate, write one Markdown page
async def process_project(project_name: str, files, output_dir: pathlib.Path) -> None:
    file_infos = await coco.map(extract_file_info, files)         # concurrent extraction
    project_info = await aggregate_project_info(project_name, file_infos)
    markdown = generate_markdown(project_name, project_info, file_infos)
    localfs.declare_file(output_dir / f"{project_name}.md", markdown, create_parent_dirs=True)

@coco.fn
async def app_main(root_dir: pathlib.Path, output_dir: pathlib.Path) -> None:
    for entry in root_dir.resolve().iterdir():
        if not entry.is_dir() or entry.name.startswith("."):
            continue
        files = [f async for f in localfs.walk_dir(entry, recursive=True,
                 path_matcher=PatternFilePathMatcher(included_patterns=["**/*.py"],
                                                     excluded_patterns=["**/.*", "**/__pycache__"]))]
        if files:
            await coco.mount(coco.component_subpath("project", entry.name),
                             process_project, entry.name, files, output_dir)

Extraction is instructor over LiteLLM with the Pydantic models in models.py; the LLM emits Mermaid graph syntax directly (bold for @coco.fn functions, thick ==> arrows for mount/use_mount calls). Each project mounts as its own processing component, so projects run in parallel and one finishing doesn't wait on the rest.

<p align="center"> 📘 <b><a href="https://cocoindex.io/docs/examples/multi-codebase-summarization/">Full Tutorial →</a></b>

Step-by-step walkthrough with the data models, per-project granularity, concurrent extraction, and the Markdown + Mermaid output.

</p>

Why it's worth a star ⭐

  • Always fresh, never by hand. The wiki is a target state regenerated from the code — edit a file and the one-pager updates itself; the docs can't drift from the source.
  • Incremental by default. @coco.fn(memo=True) caches each file's extraction by content, so re-running only re-analyzes changed files. Add a project and only that project is processed.
  • Concurrent by construction. coco.map(extract_file_info, files) fans every file out at once while staying visible to the pipeline — far faster than sequential LLM calls.
  • You pick the granularity. Here it's one wiki page per project directory, but the same shape works per file, per page, or per semantic unit.
  • Structured outputs, your stack. One CodebaseInfo Pydantic model drives both file- and project-level extraction; swap LLM_MODEL for any LiteLLM provider.

Run it

1. Configure & install — the default model is gemini/gemini-2.5-flash:

sh
cp .env.example .env     # set GEMINI_API_KEY (or LLM_MODEL=<provider/model> with its matching key)
pip install -e .

2. Generate the wikiroot_dir defaults to ../, so out of the box it documents the CocoIndex examples/ folder itself, writing one page per example into ./output:

sh
cocoindex update main.py

To document your own code, point root_dir in main.py at a folder of project subdirectories and re-run.

3. Read the output:

sh
ls output/
cat output/code_embedding.md

Each page has an Overview, a Components list (★ marks @coco.fn functions), a CocoIndex Pipeline Mermaid diagram where applicable, and per-file summaries for multi-file projects. Edit a .py file and re-run — only that file is re-analyzed, every other file served from the memo cache.


<p align="center"> If this kept your codebase docs fresh, <a href="https://github.com/cocoindex-io/cocoindex"><b>give CocoIndex a star ⭐</b></a> — it helps a lot.

<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/multi-codebase-summarization/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>

</p>