core/indexing/README.md
Continue uses a tagging system along with content addressing to ensure that nothing needs to be indexed twice. When you change branches, Continue will only re-index the files that are newly modified and that we don't already have a copy of. This system can be used across many different "artifacts" just by implementing the CodebaseIndex class.
artifact: something that is generated by indexing and then saved to be used later (e.g. emeddings, full-text search index, or a table of top-level code snippets in each file)
cacheKey: a key that determines whether two files can be considered the same to avoid re-indexing (always hash of file contents at this point)
CodebaseIndex: a class that makes it easy to use the indexing system to help you generate a new artifact
The indexing process does the following:
CodebaseIndex so that it can update whatever index-specific storage it might have. Many of them use SQLite and/or LanceDB. The CodebaseIndex implements a method called "update" that accepts the four lists and yields progress updates as it iterates over the lists. These progress updates are used to officially mark a file as having been indexed, so that if the extension is closed mid-indexing we don't falsely record progress.CodebaseIndexesAll indexes must be returned by getIndexesToBuild in CodebaseIndexer.ts if they are to be used.
CodeSnippetsCodebaseIndex: uses tree-sitter queries to get a list of functions, classes, and other top-level code objects in each file
FullTextSearchCodebaseIndex: creates a full-text search index using SQLite FTS5
ChunkCodebaseIndex: chunks files recursively by code structure, for use in other embeddings providers like LanceDbIndex
LanceDbIndex: calculates embeddings for each chunk and adds them to the LanceDB vector database, with metadata going into SQLite. Note that for each branch, a unique table is created in LanceDB.
FullTextSearchCodebaseIndex doesn't differentiate between tags (branch, repo), so results may come from any branch/repo. LanceDB does this by creating separate tables for each tag (see tableNameForTag). The chunk index does this with a second table