Back to Cocoindex

Entire Session Search

examples/entire_session_search/README.md

1.0.32.7 KB
Original Source

Entire Session Search

Semantic search over AI coding sessions captured by Entire, powered by CocoIndex.

What it does

  • Reads Entire checkpoint data (transcripts, prompts, context summaries, metadata)
  • Chunks and embeds text with SentenceTransformers, stores in Postgres with pgvector
  • Provides cosine-similarity search across all your AI coding sessions

Because CocoIndex is incremental, re-running after new sessions only processes what changed.

Prerequisites

  • Postgres with pgvector extension
  • Entire installed with some captured sessions
  • Python 3.11+

Setup

1. Check out the Entire checkpoint data:

sh
# From any repo where Entire is capturing sessions
git worktree add entire_checkpoints entire/checkpoints/v1

2. Install deps:

sh
pip install -e .

3. Set env vars (or edit .env):

sh
# .env
COCOINDEX_DB=./cocoindex.db
POSTGRES_URL=postgres://cocoindex:cocoindex@localhost/cocoindex

Run

Build the index:

sh
cocoindex update main.py

Search your sessions:

sh
python main.py "how did I fix the auth bug"

Or start an interactive search:

sh
python main.py

Configuration

VariableDefaultDescription
COCOINDEX_DB./cocoindex.dbSQLite path for CocoIndex internal state
POSTGRES_URLpostgres://cocoindex:cocoindex@localhost/cocoindexPostgres connection for embedding/metadata tables
TABLE_EMBEDDINGSsession_embeddingsEmbeddings table name
TABLE_METADATAsession_metadataMetadata table name
PG_SCHEMA_NAMEentirePostgres schema

How it works

mermaid
graph TD
    checkpoints[Entire Checkpoints] --> walk[walk_dir]
    walk --> mount_each[mount_each]
    mount_each --> process_file[<b>process_file</b>]
    process_file -->|full.jsonl| parse[parse_transcript]
    process_file -->|prompt.txt| embed_prompt[embed directly]
    process_file -->|context.md| split[RecursiveSplitter]
    process_file -->|metadata.json| meta_table[(session_metadata)]
    parse --> embed[SentenceTransformer]
    embed_prompt --> embed
    split --> embed
    embed --> emb_table[(session_embeddings)]

Entire checkpoint layout

<checkpoint_id[:2]>/<checkpoint_id[2:]>/<session_idx>/
  ├── metadata.json     # token counts, files touched, timestamps (note: if one prompt spans multiple commits, each gets its own checkpoint with the same token data; don't sum across checkpoints)
  ├── full.jsonl        # conversation transcript
  ├── prompt.txt        # user's initial prompt
  ├── context.md        # AI-generated session summary
  └── content_hash.txt  # content fingerprint (skipped)