docs/guides/custom-code-rag.mdx
If possible, we recommend using voyage-code-3, which will give the most accurate answers of any existing embeddings model for code. You can obtain an API key here. Because their API is OpenAI-compatible, you can use any OpenAI client by swapping out the URL.
There are a number of available vector databases, but because most vector databases will be able to performantly handle large codebases, we would recommend choosing one for ease of setup and experimentation.
LanceDB is a good choice for this because it can run in-memory with libraries for both Python and Node.js. This means that in the beginning you can focus on writing code rather than setting up infrastructure. If you have already chosen a vector database, then using this instead of LanceDB is also a fine choice.
Most embeddings models can only handle a limited amount of text at once. To get around this, we "chunk" our code into smaller pieces.
If you use voyage-code-3, it has a maximum context length of 16,000 tokens, which is enough to fit most files. This means that in the beginning you can get away with a more naive strategy of truncating files that exceed the limit. In order of easiest to most comprehensive, 3 chunking strategies you can use are:
As usual in this guide, we recommend starting with the strategy that gives 80% of the benefit with 20% of the effort.
Indexing, in which we will insert your code into the vector database in a retrievable format, happens in three steps:
With LanceDB, we can do steps 2 and 3 simultaneously, as demonstrated in their docs. If you are using Voyage AI for example, it would be configured like this:
from lancedb.pydantic import LanceModel, Vectorfrom lancedb.embeddings import get_registrydb = lancedb.connect("/tmp/db")func = get_registry().get("openai").create( name="voyage-code-3", base_url="https://api.voyageai.com/v1/", api_key=os.environ["VOYAGE_API_KEY"],)class CodeChunks(LanceModel): filename: str text: str = func.SourceField() # 1024 is the default dimension for `voyage-code-3`: https://docs.voyageai.com/docs/embeddings#model-choices vector: Vector(1024) = func.VectorField()table = db.create_table("code_chunks", schema=CodeChunks, mode="overwrite")table.add([ {"text": "print('hello world!')", filename: "hello.py"}, {"text": "print('goodbye world!')", filename: "goodbye.py"}])query = "greetings"actual = table.search(query).limit(1).to_pydantic(CodeChunks)[0]print(actual.text)
Regardless of which database or model you have chosen, your script should iterate over all of the files that you wish to index, chunk them, generate embeddings for each chunk, and then insert all of the chunks into your vector database.
That said, we highly recommend first building and testing the pipeline before attempting this. Unless your codebase is being entirely rewritten frequently, an incremental refresh of the index is likely to be sufficient and reasonably cheap.
</Check>At this point, you've written your indexing script and tested that you can make queries from your vector database. Now, you'll want a plan for when to run the indexing script.
In the beginning, you should probably run it by hand. Once you are confident that your custom RAG is providing value and is ready for the long-term, then you can set up a cron job to run it periodically. Because codebases are largely unchanged in short time frames, you won't want to re-index more than once a day. Once per week or month is probably even sufficient.
To integrate your custom RAG system with Continue, you'll create an MCP (Model Context Protocol) server. MCP provides a standardized way for AI tools to access external resources.
Here's a reference implementation using Python that queries your vector database:
"""Custom RAG MCP server for code retrieval"""
import asyncio
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import lancedb
# Initialize your vector database connection
db = lancedb.connect("/path/to/your/db")
table = db.open_table("code_chunks")
app = Server("custom-rag-server")
@app.tool()
async def search_codebase(query: str, limit: int = 10) -> list[TextContent]:
"""
Search the codebase using vector similarity.
Args:
query: The search query
limit: Maximum number of results to return
"""
# Query your vector database
results = table.search(query).limit(limit).to_list()
# Format results for Continue
formatted_results = []
for result in results:
formatted_results.append(TextContent(
type="text",
text=f"File: {result['filename']}\n\n{result['text']}"
))
return formatted_results
@app.tool()
async def get_file_context(filename: str) -> list[TextContent]:
"""
Get all chunks from a specific file.
Args:
filename: The name of the file to retrieve
"""
results = table.where(f"filename = '{filename}'").to_list()
return [TextContent(
type="text",
text="\n".join([r['text'] for r in results])
)]
if __name__ == "__main__":
stdio_server(app).run()
Add your MCP server to Continue's configuration:
config.yaml:
mcpServers:
- name: custom-rag
command: python
args:
- /path/to/your/mcp_server.py
env:
VOYAGE_API_KEY: ${VOYAGE_API_KEY}
config.json:
{
"mcpServers": [
{
"name": "custom-rag",
"command": "python",
"args": ["/path/to/your/mcp_server.py"],
"env": {
"VOYAGE_API_KEY": "${VOYAGE_API_KEY}"
}
}
]
}
If you'd like to improve the quality of your results, a great first step is to add reranking. This involves retrieving a larger initial pool of results from the vector database, and then using a reranking model to order them from most to least relevant. This works because the reranking model can perform a slightly more expensive calculation on the small set of top results, and so can give a more accurate ordering than similarity search, which has to search over all entries in the database.
If you wish to return 10 total results for each query for example, then you would:
We recommend using the rerank-2 model from Voyage AI, which has examples of usage here.