examples/entire_session_search/README.md
"How did I fix the auth bug" finds the right session even with zero shared keywords — in plain async Python.
</p> <p align="center"> <strong>Star us ❤️ →</strong> <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> · <a href="https://cocoindex.io/docs/examples/entire-session-search/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> · <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center"> </div>Entire captures every AI coding session you run — the full transcript, the prompt you started from, an AI-written context summary, and metadata like token counts and files touched — as checkpoints on disk. This pipeline turns that folder into a vector index you can search in plain English. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — and the heavy lifting (incremental processing, change tracking, managed targets) runs in a Rust engine underneath, so each new session you capture only embeds what changed.
A checkpoint folder holds four file types, and process_file routes on the name: full.jsonl is parsed into per-turn transcript chunks, prompt.txt is embedded whole, context.md is split into overlapping chunks, and metadata.json becomes a structured row in a second table. The transcript and context paths fan out to many rows via coco.map(process_chunk, ...); the prompt is a single short string embedded inline. Read it in main.py:
@coco.fn(memo=True)
async def process_file(file, emb_table, meta_table) -> None:
info = extract_session_info(file)
filename = file.file_path.path.name
id_gen = IdGenerator()
if filename == "full.jsonl":
chunks = parse_transcript(await file.read_text())
await coco.map(process_chunk,
[ChunkInput(text=c.text, content_type="transcript", role=c.role) for c in chunks],
info, id_gen, emb_table)
elif filename == "prompt.txt":
text = (await file.read_text()).strip()
if text:
emb_table.declare_row(row=SessionEmbeddingRow(
id=await id_gen.next_id(text), ..., content_type="prompt", role="user",
text=text, embedding=await coco.use_context(EMBEDDER).embed(text)))
elif filename == "context.md":
... # split into chunks, then coco.map(process_chunk, ..., content_type="context")
elif filename == "metadata.json":
meta = json.loads(await file.read_text())
meta_table.declare_row(row=SessionMetadataRow(..., total_tokens=..., files_touched=...))
Three content types and a structured record, all from one component. Each embedding row's id is derived from its text, so a turn that survives a re-parse keeps its row.
Step-by-step walkthrough with the two row shapes, the per-filename routing, the chunk fan-out, and the query.
</p>included_patterns list pulls full.jsonl, prompt.txt, context.md, and metadata.json into the same process_file, which routes on the name — no four separate pipelines.@coco.fn(memo=True) skips a file whose content and code are unchanged, so a finished session is never re-embedded; id derived from text means only genuinely new turns are inserted and vanished turns are deleted.live=True — pass -L and new sessions are picked up and embedded as they're written.all-MiniLM-L6-v2 embedder, no API key; swap EMBED_MODEL for any of the 12k+ sentence-transformer models on Hugging Face.1. Start Postgres + pgvector (the repo ships a compose file):
docker compose -f ../../dev/postgres.yaml up -d
2. Configure & install:
cp .env.example .env # set POSTGRES_URL (schema/table names are optional overrides)
pip install -e .
3. Check out some checkpoints — from any repo where Entire is capturing sessions:
git worktree add entire_checkpoints entire/checkpoints/v1
Each session is laid out as <checkpoint_id[:2]>/<checkpoint_id[2:]>/<session_idx>/ with full.jsonl, prompt.txt, context.md, and metadata.json.
4. Build the index — catch-up (scan, sync, exit) or live (catch up, then keep watching for new sessions):
cocoindex update main # catch-up
cocoindex update -L main # live
5. Search from the command line:
python main.py "how did I fix the auth bug"
Results print which session and content type matched, so a transcript turn, a prompt, and a context chunk are all distinguishable. This example keeps it minimal and doesn't declare a vector index, so queries do a sequential scan — fine for a personal history. For a larger corpus, add emb_table.declare_vector_index(column="embedding") exactly as Semantic Search 101 does.
<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/entire-session-search/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>
</p>