doc/development/ai_features/semantic_search.md
Semantic Search is a GitLab framework that uses vector embeddings to find semantically similar content based on meaning rather than keyword matching. This enables AI features like Duo Chat to retrieve relevant context for user queries.
Semantic Search converts text into vector embeddings and stores them in a vector store. When a user makes a query, the query is also converted to an embedding and compared against stored vectors to find the most similar results. This approach captures semantic meaning, allowing searches to find relevant content even when exact keywords don't match.
Semantic Code Search is the first implementation of the Semantic Search framework. It enables Duo Chat and other AI features to find relevant code snippets from a repository. The feature is available as an MCP tool (semantic_code_search) that can be used by GitLab Duo Agent Platform and other AI platforms.
Please refer to the Semantic Code Search Architecture for further details.
The Semantic Search framework is powered by the gitlab-active-context gem. This gem provides a translation layer for different vector stores (Elasticsearch, OpenSearch, PostgreSQL with pgvector), allowing the same code to work with any supported vector store without needing vector store-specific implementations.
The framework is extensible and designed to support multiple types of semantic search. Each semantic search type is implemented using:
Ai::ActiveContext::Collections::<Type>): Define what content is indexed and how it's storedAi::ActiveContext::References::<Type>): Track and manage embeddings for content updatesAi::ActiveContext::Queries::<Type>): Retrieve similar content from the vector storeAi::ActiveContext::Queues::<Type>): Manage asynchronous processing of embedding generationNew semantic search types can be added by implementing these components for different content types (for example, merge requests or documentation).
Embeddings are generated asynchronously through a queue system using reference classes like Ai::ActiveContext::References::Code:
Ai::ActiveContext::BulkProcessWorkerThe BulkProcessWorker is a cron job that runs every minute and processes embedding references from the queue. It fetches references, generates embeddings, and removes them from the queue. If the queue is not empty after processing, the worker re-enqueues itself to continue processing. If embedding generation fails, it gets retried once and is then placed on a dead queue.
Currently, semantic search uses Vertex AI's text-embedding-005 model for generating embeddings. The model configuration is defined in the collection classes (for example, Ai::ActiveContext::Collections::Code).
Support for setting the embeddings model in a Self-hosted AI Gateway setup is planned in epic 20110. Once available, administrators in Self-Managed instances with a Self-hosted AI Gateway will be able to select their own embeddings model.
When a query is executed:
The Semantic Search framework uses a migration system to manage schema changes and data transformations for the connected vector store. Migrations are tracked in the database and executed asynchronously by a worker process.
Ai::ActiveContext::MigrationWorker runs as a cron job every 5 minutes to execute uncompleted migrations.
An instance can use one of the following vector stores:
A vector store connection must be created before semantic search can be used. There are two ways to configure the connection:
Option 1: Using the GitLab UI
For Elasticsearch or OpenSearch clusters used by advanced search:
Option 2: Using Rails console
connection = Ai::ActiveContext::Connection.create!(
name: "os",
options: { url: ["http://localhost:9202"] },
adapter_class: "ActiveContext::Databases::OpenSearch::Adapter"
)
connection.activate!
For PostgreSQL, use the pgvector extension:
In the PostgreSQL database, create the extension:
CREATE EXTENSION vector;
In the Rails console, create the connection:
connection = Ai::ActiveContext::Connection.create!(
name: "postgres",
options: { host: 'localhost', port: 5432, user: 'postgres', password: 'password' },
adapter_class: "ActiveContext::Databases::Postgresql::Adapter"
)
connection.activate!
For more information, see the pgvector documentation.
Supported adapter classes:
ActiveContext::Databases::Elasticsearch::AdapterActiveContext::Databases::OpenSearch::AdapterActiveContext::Databases::Postgresql::AdapterThe options hash should contain the connection details specific to your vector store (URL, credentials, etc.).
When the Semantic Code Search tool is invoked for a project that hasn't been indexed yet:
Ai::ActiveContext::Code::Repository record is created with pending stateAi::ActiveContext::Code::RepositoryIndexWorker processes the pending repositoryAi::ActiveContext::Code::InitialIndexingService calls the Ai::ActiveContext::Code::IndexerIndexer runs the gitlab-elasticsearch-indexer to fetch the repository's files from Gitaly, chunk the code, and index the chunks in the vector storeInitialIndexingService enqueues the references/IDs of the indexed content for embeddings generationAi::ActiveContext::BulkProcessWorker.Ai::ActiveContext::Code::MarkRepositoryAsReadyEventWorker runs on a 10-minute cron schedule (via SchedulingService) and checks if all embeddings have been generated. Once all embeddings are ready, it marks the repository as readyWhen code is merged into the default branch:
BranchPushServiceAi::ActiveContext::Code::RepositoryIndexWorker processes the ready ActiveContext repositoryAi::ActiveContext::Code::IncrementalIndexingService calls the Ai::ActiveContext::Code::IndexerIndexer runs the gitlab-elasticsearch-indexer to fetch the changed files from Gitaly, chunk the code, and index the chunks in the vector store. It also deletes orphaned data from the vector store.IncrementalIndexingService enqueues the references/IDs of the indexed content for embeddings generationAi::ActiveContext::BulkProcessWorker.When a namespace is no longer eligible for indexing, Ai::ActiveContext::Code::ProcessInvalidEnabledNamespaceEventWorker picks it up and deletes the EnabledNamespace record.
When a repository is no longer eligible for indexing, Ai::ActiveContext::Code::MarkRepositoryAsPendingDeletionEventWorker marks it as pending_delete. The Ai::ActiveContext::Code::RepositoryIndexWorker then processes the repository and calls the gitlab-elasticsearch-indexer to delete the project's documents from the vector store and delete the repository record.
gitlab-elasticsearch-indexerThe gitlab-elasticsearch-indexer Go project handles:
Chunking
The gitlab-elasticsearch-indexer makes use of the gitlab-code-parser library to split the code into logic chunks.
The chunking process uses a two-stage approach:
This approach ensures chunks are semantically meaningful while staying within size limits for embedding generation.
Not all namespaces are eligible for Semantic Code Search. Eligibility is managed through two workers:
Ai::ActiveContext::Code::CreateEnabledNamespaceEventWorker (runs daily via SchedulingService)
EnabledNamespace records for qualifying namespacesOn GitLab.com, a namespace is eligible if:
On self-managed instances, all top-level group namespaces are eligible if:
instance_level_ai_beta_features_enabled)Ai::ActiveContext::Code::MarkRepositoryAsPendingDeletionEventWorker marks repositories for deletion when they no longer meet eligibility criteria.
Ai::ActiveContext::Code::ProcessInvalidEnabledNamespaceEventWorker cleans up EnabledNamespace records for namespaces that no longer meet eligibility criteria.
Semantic Code Search indexes all files in a repository. Currently, results are post-filtered to exclude files matching the project's exclusion rules. Future versions will stop indexing excluded files entirely for improved efficiency.
For more information about GitLab MCP implementation and available clients, see the GitLab MCP documentation and the runbook.
Currently, the Semantic Code Search tool is available in IDEs when GitLab MCP is configured. With the rollout of the mcp_client feature flag, it will be available on the web.
For detailed information on extending the Semantic Search framework, see the gitlab-active-context gem documentation.
To add a new semantic search type (for example, merge requests or documentation), implement the following components:
Ai::ActiveContext::Collections::<Type>): Define the collection name, queue, reference class, and how to handle authorizationAi::ActiveContext::References::<Type>): Extend ActiveContext::Reference to track embeddings and define preprocessors for content and embedding generationAi::ActiveContext::Queries::<Type>): Implement query logic to search the vector storeAi::ActiveContext::Queues::<Type>): Define the queue for managing asynchronous processingSee the Semantic Code Search implementation for a complete example of how these components work together.
Vector store connection
Test that the vector store connection is working:
ActiveContext::adapter.search(
user: current_user,
collection: ::Ai::ActiveContext::Collections::Code,
query: ActiveContext::Query.all
)
This should return results without errors.
Embedding generation
Test that embedding generation is configured:
model_definition = ::Gitlab::Llm::Embeddings::ModelDefinition.for_gitlab_provided_code_embeddings(
identifier: 'text_embedding_005_vertex'
)
Gitlab::Llm::Embeddings::CodeEmbeddings.new(
'test',
unit_primitive: 'generate_embeddings_codebase',
user: User.first,
model_definition: model_definition
).execute
This should return a vector.
Beta experiment features
Verify that beta experiment features are enabled for the namespace:
namespace.experiment_features_enabled?
This should return true.
[!warning] Disabling semantic code search can cause long database locks if there are many repository records to delete. Use with caution on production environments. Upcoming work will allow disabling safely. See issue 582787.
Delete the index and collection record:
ActiveContext.adapter.executor.drop_collection(:code)
Delete the connection and associated records:
::Ai::ActiveContext::Connection.active.destroy!
To set up the MCP server locally for development and testing, see the MCP server development guide.
Tip: specifically ask for the semantic_code_search tool in your prompt to ensure the tool is used.
To invoke semantic search from your console, use the Ai::ActiveContext::Queries::Code class:
# Check if semantic code search is available
Ai::ActiveContext::Queries::Code.available?
# Perform a semantic search
result = Ai::ActiveContext::Queries::Code.new(
search_term: "user authentication logic",
user: current_user
).filter(
project_id: project.id,
path: "app/controllers/", # Optional: filter by directory
knn_count: 10, # Number of vectors to compare
limit: 10 # Number of results to return
)
View all queued items waiting to be processed:
ActiveContext::Queues.all_queued_items
Immediately process all queued items without waiting for cron workers:
ActiveContext.execute_all_queues!
Find all items in the vector store:
ActiveContext::adapter.search(
user: current_user,
collection: ::Ai::ActiveContext::Collections::Code,
query: ActiveContext::Query.all
)
To start fresh with a new connection, destroy all existing data and recreate:
active_connection = ::Ai::ActiveContext::Connection.active
active_connection.migrations.destroy_all
active_connection.repositories.destroy_all
active_connection.enabled_namespaces.destroy_all
active_connection.collections.destroy_all
active_connection.destroy
Then create and activate a new connection. When creating a migration in Rails console, remember to run:
connection.activate!
In order for a new embedding model to be supported for Semantic Search, it must be:
Each new model must be evaluated properly. GitLab may refuse to support models for certain reasons (e.g. legal, performance, etc).
After the evaluation, you should have the following information:
gemini-embedding-001, all-MiniLM-L6-v2See epic 17749 for further details on model evaluation.
:construction: :construction: :construction:
:construction: :construction: :construction:
:construction: :construction: :construction:
Possible causes:
embedding_indexing_in_progress)
Ai::ActiveContext::Code::Repository.find_by(project_id: project.id).stateActiveContext.execute_all_queues!Ai::ActiveContext::Code::EnabledNamespace.exists?(namespace_id: project.root_namespace.id)Ai::ActiveContext::Connection.active.present?If embedding generation fails repeatedly, items may be placed on a dead queue. Clear them using:
# Clear all dead queue items
ActiveContext::DeadQueue.clear_tracking!
# Or clear a specific queue
Ai::ActiveContext::Queues::Code.clear_tracking!