docs/dev-guides/semantic-search/SWITCHING_PROVIDERS.md
This guide explains how to migrate from one embedding provider to another. Switching providers requires deleting the semantic index and re-ingesting all documents because different models produce vectors with incompatible dimensions.
For initial setup of semantic search (including all provider configurations), see Semantic Search Configuration.
| Provider | Model | Model Key | Dimensions |
|---|---|---|---|
| OpenAI | text-embedding-3-large | text_embedding_3_large | 3072 |
| OpenAI | text-embedding-3-small | text_embedding_3_small | 1536 |
| AWS Bedrock | cohere.embed-english-v3 | cohere_embed_v3 | 1024 |
| Cohere | embed-english-v3.0 | embed_english_v3_0 | 1024 |
Important: The model key is derived from the model name by replacing
-and.with_. Both the ingestion connector and GMS must use the same model to ensure query embeddings match document embeddings.
Stop GMS and any ingestion jobs to prevent writes during migration:
# Docker Compose
docker stop datahub-gms
# Kubernetes
kubectl scale deployment datahub-gms --replicas=0
Delete the existing semantic index from OpenSearch:
# Check existing semantic indices
curl -s "http://localhost:9200/_cat/indices/*semantic*?v"
# Delete the semantic index (adjust index name as needed)
curl -X DELETE "http://localhost:9200/documentindex_v2_semantic"
Update your configuration with the new provider settings. See Semantic Search Configuration for the full configuration options for each provider (Helm charts and environment variables).
Make sure to update:
EMBEDDING_PROVIDER_TYPE)ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION) to match the new modelIf using application.yaml, update the model entry to match the new provider:
elasticsearch:
entityIndex:
semanticSearch:
models:
# Use the model key that matches your new provider
text_embedding_3_large:
vectorDimension: 3072 # Must match model output
knnEngine: faiss
spaceType: cosinesimil
efConstruction: 128
m: 16
Or via environment variable:
ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION=3072
Start GMS — the system update job will automatically recreate the semantic index:
# Docker Compose
docker start datahub-gms
# Kubernetes
kubectl scale deployment datahub-gms --replicas=1
The system update job runs automatically on startup and will:
After the index is recreated, re-ingest your documents to generate new embeddings:
datahub ingest -c your-recipe.yaml
Important: Make sure your ingestion recipe also uses the same embedding model. The ingestion connector generates document embeddings, while GMS generates query embeddings — both must use the same model.
# Check the index exists with correct mapping
curl -s "http://localhost:9200/documentindex_v2_semantic/_mapping?pretty" | head -50
# Check documents have embeddings
curl -s "http://localhost:9200/documentindex_v2_semantic/_search" \
-H "Content-Type: application/json" \
-d '{"size": 1, "_source": ["urn", "embeddings"]}' | head -30
# Test semantic search via GraphQL or the UI
Cause: Documents were ingested before the provider switch and have embeddings from the old model.
Solution: Re-run ingestion to generate new embeddings with the new provider.
Cause: The index was created with a different vector dimension than the new model produces.
Solution: Delete the semantic index and let it be recreated (Steps 2-5 above).
Cause: API key not set or incorrect.
Solution: Verify your API key is correctly set in the environment:
# Check the environment variable is set (in the container)
docker exec datahub-gms env | grep -E 'OPENAI_API_KEY|COHERE_API_KEY'
Cause: Model mismatch between ingestion and query time.
Solution: Ensure both the ingestion connector AND GMS use the same embedding model. Check:
BEDROCK_EMBEDDING_MODEL, OPENAI_EMBEDDING_MODEL, or COHERE_EMBEDDING_MODEL) in GMS config