docs/how-to/semantic-search-configuration.md
Semantic search lets you find DataHub entities using natural language queries like "customer churn analysis" — even when exact keywords differ.
opensearchproject/opensearch:2.19.3). Elasticsearch is not supported.If you deploy DataHub using the DataHub Helm chart, add the following to your values.yaml and run helm upgrade.
Create a secret, then configure:
kubectl create secret generic openai-secret --from-literal=api-key=sk-your-api-key-here
global:
datahub:
semantic_search:
enabled: true
vectorDimension: 3072
provider:
type: "openai"
openai:
apiKey:
secretRef: "openai-secret"
secretKey: "api-key"
model: "text-embedding-3-large"
No API key needed — Bedrock authenticates via the AWS SDK default credential chain (IRSA, EC2/ECS instance credentials, etc).
global:
datahub:
semantic_search:
enabled: true
vectorDimension: 1024
provider:
type: "aws-bedrock"
bedrock:
modelId: "cohere.embed-english-v3"
awsRegion: "us-west-2"
Create a secret, then configure:
kubectl create secret generic cohere-secret --from-literal=api-key=your-cohere-api-key
global:
datahub:
semantic_search:
enabled: true
vectorDimension: 1024
provider:
type: "cohere"
cohere:
apiKey:
secretRef: "cohere-secret"
secretKey: "api-key"
model: "embed-english-v3.0"
helm upgrade datahub datahub/datahub -f values.yaml
For Docker Compose or non-Helm deployments, set these on the datahub-gms service and restart it.
ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
SEARCH_SERVICE_SEMANTIC_SEARCH_ENABLED=true
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document
OPENAI_API_KEY=sk-your-api-key-here
That's it — OpenAI is the default provider, so no other variables are needed.
ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
SEARCH_SERVICE_SEMANTIC_SEARCH_ENABLED=true
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document
EMBEDDING_PROVIDER_TYPE=aws-bedrock
BEDROCK_EMBEDDING_AWS_REGION=us-west-2
ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION=1024
Authentication uses the AWS SDK default credential chain (EC2/ECS instance credentials, AWS_PROFILE, or AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY).
ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
SEARCH_SERVICE_SEMANTIC_SEARCH_ENABLED=true
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document
EMBEDDING_PROVIDER_TYPE=cohere
COHERE_API_KEY=your-cohere-api-key
ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION=1024
After restarting, check the GMS logs:
# Docker Compose
docker-compose logs datahub-gms | grep -i "embedding"
# Kubernetes
kubectl logs deployment/datahub-gms | grep -i "embedding"
You should see:
Creating embedding provider with type: openai
Initialized OpenAiEmbeddingProvider with model=text-embedding-3-large
Once semantic search is enabled, you need to run an ingestion source to generate embeddings for your documents.
source:
type: datahub-documents
config: {}
sink:
type: datahub-rest
config: {}
This automatically connects to DataHub, fetches your embedding config from the server, and processes documents in real-time.
datahub ingest -c recipe.yml
For external document sources (Notion, Confluence, etc.), see the Notion Source and DataHub Documents Source documentation.
| Provider | Model | Dimensions | Notes |
|---|---|---|---|
| OpenAI | text-embedding-3-large | 3072 | Default, higher quality |
| OpenAI | text-embedding-3-small | 1536 | Fast, cost-effective |
| AWS Bedrock | cohere.embed-english-v3 | 1024 | AWS-managed |
| Cohere | embed-english-v3.0 | 1024 | English optimized |
| Cohere | embed-multilingual-v3.0 | 1024 | 100+ languages |
To use a non-default model, set the model name in your Helm values or environment variable and update
vectorDimension/ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSIONto match.
| Symptom | Fix |
|---|---|
| "Semantic search is disabled or not configured" | Verify ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true and restart GMS |
| "Invalid API key provided" | Check your API key is set correctly in the GMS environment |
| "Dimension mismatch: expected 3072, got 1024" | Update ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION to match your model |
application.yaml reference and performance tuning