Indexing

Continue uses a tagging system along with content addressing to ensure that nothing needs to be indexed twice. When you change branches, Continue will only re-index the files that are newly modified and that we don't already have a copy of. This system can be used across many different "artifacts" just by implementing the CodebaseIndex class.

artifact: something that is generated by indexing and then saved to be used later (e.g. emeddings, full-text search index, or a table of top-level code snippets in each file)

cacheKey: a key that determines whether two files can be considered the same to avoid re-indexing (always hash of file contents at this point)

CodebaseIndex: a class that makes it easy to use the indexing system to help you generate a new artifact

The indexing process does the following:

Check the modified timestamps of all files in the repo (this may seem extreme, but checking timestamps is significantly faster than actually reading a file. Git does the same thing.)
Compare these to a "catalog" (stored in SQLite) of the last time that we indexed each of these files to get a list of files to "add" or "remove". If the file exists in the repo but not in the catalog, then we must "add" the file. If it exists in the catalog but not the repo, we must "remove" the file. If it exists in both and was modified after last indexed, then we must update the file. In this case we also add it to the "add" list.
For each file to "add", check whether it was indexed on another branch. Here we use a SQLite table that acts as a cache for indexed files. If we find an entry in this table for a file with the same cacheKey, then we only need to add a tag to this entry for the current branch ("addTag"). Otherwise, we must "compute" the artifact.
For each file in "remove", check whether it was indexed on another branch. If we find only one entry with the same cacheKey (presumably this should be the entry for the current branch, or something has gone wrong), then this entry should be removed and there will be no more branches that need the artifact, so we want to "delete" it. If there is more than one tag on this artifact, then we should just remove the tag for this branch ("removeTag").
After having calculated these four lists of files ("compute", "delete", "addTag", "removeTag"), we pass them to the CodebaseIndex so that it can update whatever index-specific storage it might have. Many of them use SQLite and/or LanceDB. The CodebaseIndex implements a method called "update" that accepts the four lists and yields progress updates as it iterates over the lists. These progress updates are used to officially mark a file as having been indexed, so that if the extension is closed mid-indexing we don't falsely record progress.

Existing `CodebaseIndex`es

All indexes must be returned by getIndexesToBuild in CodebaseIndexer.ts if they are to be used.

CodeSnippetsCodebaseIndex: uses tree-sitter queries to get a list of functions, classes, and other top-level code objects in each file FullTextSearchCodebaseIndex: creates a full-text search index using SQLite FTS5 ChunkCodebaseIndex: chunks files recursively by code structure, for use in other embeddings providers like LanceDbIndex LanceDbIndex: calculates embeddings for each chunk and adds them to the LanceDB vector database, with metadata going into SQLite. Note that for each branch, a unique table is created in LanceDB.

Known problems

FullTextSearchCodebaseIndex doesn't differentiate between tags (branch, repo), so results may come from any branch/repo. LanceDB does this by creating separate tables for each tag (see tableNameForTag). The chunk index does this with a second table

Indexing

Indexing

Existing CodebaseIndexes

Known problems

Existing `CodebaseIndex`es