docs/src/content/example-posts/index-codebase.md
In this tutorial, we'll build a live semantic index over a codebase with CocoIndex. Point it at a repo, and you get a vector index you can search in natural language ("where do we embed chunks?") that updates itself as you edit — the kind of fresh, low-latency context an agent needs.
The whole pipeline is ordinary async Python and your own types. The heavy lifting — incremental processing, change tracking, managed targets — runs in a Rust engine underneath, so only what changed gets re-embedded and re-upserted.
A codebase is hard to keep indexed well, and it exercises most of what CocoIndex was built for:
live=True on the filesystem source and cocoindex update -L, the index keeps watching and applies changes with low latency.From a high level, these are the steps:
You declare the transformation logic with native Python, without worrying about how updates propagate. Think: target_state = transformation(source_state).
A running Postgres with the pgvector extension. CocoIndex supports many targets, so you can pick another store.
export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
Install CocoIndex and the dependencies this example uses:
pip install -U "cocoindex[postgres,sentence_transformers]" asyncpg pgvector numpy python-dotenv
Apps are the top-level runnable unit in CocoIndex. Before the App, we set up two things the rest of the code builds on. CodeEmbedding defines one row of the output table — each chunk of code becomes one row, with its text, location, and embedding vector. coco_lifespan provides the shared resources every step needs — the Postgres connection pool and the embedding model — once at startup.
import os
import pathlib
from dataclasses import dataclass
from typing import AsyncIterator, Annotated
import asyncpg
from numpy.typing import NDArray
import cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter, detect_code_language
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from cocoindex.resources.chunk import Chunk
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
from cocoindex.resources.id import IdGenerator
DATABASE_URL = os.getenv("POSTGRES_URL", "postgres://cocoindex:cocoindex@localhost/cocoindex")
TABLE_NAME = "code_embeddings"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
PG_DB = coco.ContextKey[asyncpg.Pool]("code_embedding_db")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)
_splitter = RecursiveSplitter()
@dataclass
class CodeEmbedding:
id: int
filename: str
code: str
embedding: Annotated[NDArray, EMBEDDER]
start_line: int
end_line: int
@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
async with asyncpg.create_pool(DATABASE_URL) as pool:
builder.provide(PG_DB, pool)
builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
yield
embedding: Annotated[NDArray, EMBEDDER] ties the vector column to the embedder, so its dimensions are inferred automatically.
process_file runs once per file. It reads the file, detects the language so Tree-sitter can parse it, splits the code along the syntax tree, and maps each chunk to process_chunk.
@coco.fn(memo=True)
async def process_file(
file: FileLike,
table: postgres.TableTarget[CodeEmbedding],
) -> None:
text = await file.read_text()
language = detect_code_language(filename=str(file.file_path.path.name))
chunks = _splitter.split(
text,
chunk_size=1000,
min_chunk_size=300,
chunk_overlap=300,
language=language,
)
id_gen = IdGenerator()
await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)
CocoIndex uses Tree-sitter to chunk code along its actual syntax structure rather than arbitrary line breaks. Because each chunk is a coherent syntactic unit, retrieval returns whole functions or blocks instead of fragments cut mid-statement. All major languages are supported; unknown types fall back to plain text.
@coco.fn with memo=True is what makes this incremental: if a file's content and this function's code are both unchanged, the whole file is skipped on the next run. coco.map fans out to one process_chunk call per chunk.
Here is what chunking produces: each file is split into syntactic chunks, each with its own location and text.
process_chunk embeds the chunk with the shared embedder and declares the target row.
@coco.fn
async def process_chunk(
chunk: Chunk,
filename: pathlib.PurePath,
id_gen: IdGenerator,
table: postgres.TableTarget[CodeEmbedding],
) -> None:
embedding = await coco.use_context(EMBEDDER).embed(chunk.text)
table.declare_row(
row=CodeEmbedding(
id=await id_gen.next_id(chunk.text),
filename=str(filename),
code=chunk.text,
embedding=embedding,
start_line=chunk.start.line,
end_line=chunk.end.line,
),
)
We use SentenceTransformerEmbedder with all-MiniLM-L6-v2; there are 12k+ sentence-transformer models on Hugging Face, so swap in whichever you prefer. chunk.start.line and chunk.end.line carry through, so search results point straight at the lines that matched.
app_main wires the source to the target. It mounts the Postgres table (with a vector index), walks the codebase, and mounts one processing component per file.
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
target_table = await postgres.mount_table_target(
PG_DB,
table_name=TABLE_NAME,
table_schema=await postgres.TableSchema.from_class(
CodeEmbedding, primary_key=["id"],
),
)
target_table.declare_vector_index(column="embedding")
files = localfs.walk_dir(
sourcedir,
recursive=True,
path_matcher=PatternFilePathMatcher(
included_patterns=["**/*.py", "**/*.rs", "**/*.toml", "**/*.md", "**/*.mdx"],
excluded_patterns=["**/.*", "**/target", "**/node_modules"],
),
live=True, # watch for changes; pass -L to `cocoindex update` to run live
)
await coco.mount_each(process_file, files.items(), target_table)
mount_table_target creates and manages the Postgres table for you: schema, the pgvector index, idempotent upserts, and orphan cleanup when a file disappears. live=True makes the filesystem source watch for changes, and mount_each runs one component per file so the engine can track and update them independently.
Bind app_main into a coco.App and point it at the codebase root.
app = coco.App(
coco.AppConfig(name="CodeEmbeddingV1"),
app_main,
sourcedir=pathlib.Path(__file__).parent / ".." / "..", # index from repo root
)
That is the entire indexing path.
Run the cocoindex CLI to set up and update the index. Choose catch-up (scan, sync, exit) or live (catch up, then keep watching):
# Catch-up run
cocoindex update main
# Live run: keep watching for file changes
cocoindex update -L main
Match user text against the index with a plain SQL query, reusing the same embedder from the indexing flow so indexing and querying stay consistent.
async def query_once(pool, embedder, query: str, *, top_k: int = 5) -> None:
query_vec = await embedder.embed(query)
async with pool.acquire() as conn:
rows = await conn.fetch(
f"""
SELECT filename, code, embedding <=> $1 AS distance, start_line, end_line
FROM "{TABLE_NAME}"
ORDER BY distance ASC
LIMIT $2
""",
query_vec, top_k,
)
for r in rows:
score = 1.0 - float(r["distance"])
print(f"[{score:.3f}] {r['filename']} (L{r['start_line']}-L{r['end_line']})")
print(f" {r['code']}")
print("---")
The <=> operator is pgvector's cosine distance. We turn it into a similarity score and print the filename, the matched line range, and the code snippet.
python main.py "embedding"
The search results print in the terminal:
CocoIndex keeps the index in sync with the codebase and does the minimum work to get there. You never compute a diff or write update logic: you change something, and CocoIndex works out exactly what to embed, upsert, and delete. Two pieces make this work. @coco.fn(memo=True) decides what to recompute — a file is skipped when its content and the function's code are both unchanged, and an embedding is reused when its chunk text is unchanged. mount_table_target decides what to write — each row's id is derived from its chunk's content, so it upserts only the rows that actually changed and deletes rows whose source is gone.
The same machinery covers two kinds of change: changes to your data (the code being indexed) and changes to your logic (the pipeline itself).
Data changes.
id and embedding, so they are left as-is; genuinely new chunks are embedded and inserted; chunks that no longer exist are deleted. Edit one function and only that function's chunks are re-embedded, even though the whole file was re-read.Logic changes. A pipeline change is reconciled the same way — CocoIndex compares the new output against what is already in Postgres and applies only the difference.
included_patterns / excluded_patterns) — files that now match are added automatically; files that no longer match have their rows deleted automatically.embedding column, with no rows added or removed.A catch-up run (cocoindex update main) does this once and exits; live mode (cocoindex update -L main) keeps watching and applies each change with low latency, so the index stays current as you code.
The full, runnable example is in the CocoIndex repo: examples/code_embedding.
If you'd rather not wire the pipeline yourself, CocoIndex Code is an end-to-end implementation of exactly this indexing, packaged as a CLI and an MCP server. It does the same thing this example does (AST-aware chunking, incremental re-index on file changes, local embeddings by default), hardened for production.
You can plug it straight into your coding agent or code-review agent:
npx skills add cocoindex-io/cocoindex-code, then invoke /ccc.claude mcp add cocoindex-code -- ccc mcp (Codex, OpenCode, Cursor, and any MCP client work the same way).ccc index to build the index, ccc search "where we embed chunks" to query it.