docs/dev-guides/semantic-search/README.md
This directory contains documentation for DataHub's semantic search capability, which enables natural language search across metadata entities using vector embeddings.
Note: This is developer documentation for the semantic search feature. For a working example, see the smoke test at
smoke-test/tests/semantic/test_semantic_search.py.
Traditional keyword search requires exact term matches, limiting discoverability. Semantic search uses AI-generated embeddings to understand the meaning of queries and documents, returning relevant results even when exact keywords don't match.
Example:
┌─────────────────────────────────────────────────────────────────────────┐
│ DataHub Semantic Search │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ Ingestion │ │ GMS │────▶│ OpenSearch │ │
│ │ Connector │ │ │ │ │ │
│ │ │ │ ┌──────────┐ │ │ ┌────────────────────┐ │ │
│ │ 1. Generate │ │ │ Process │ │ │ │ entityindex_v2 │ │ │
│ │ embeddings│ │ │ MCP + │ │ │ │ (keyword search) │ │ │
│ │ │ │ │ Write to │ │ │ └────────────────────┘ │ │
│ │ 2. Emit MCP │────▶│ │ indices │ │ │ │ │
│ │ with │ │ └──────────┘ │ │ ┌────────────────────┐ │ │
│ │ Semantic │ │ │ │ │ entityindex_v2_ │ │ │
│ │ Embedding │ │ │ │ │ semantic │ │ │
│ │ aspect │ │ │ │ │ (vector search) │ │ │
│ └──────────────┘ └──────────────┘ │ └────────────────────┘ │ │
│ └──────────────────────────┘ │
│ │ │
│ ┌──────────────┐ │ │
│ │ GraphQL │◀───────────────────────────────────────┘ │
│ │ Client │ semanticSearchAcrossEntities() │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Documents and other entities are ingested into DataHub using standard ingestion connectors. When semantic search is enabled, GMS performs a dual-write:
entityindex_v2): Standard keyword-searchable indexentityindex_v2_semantic): Vector-enabled index for semantic searchNote: The dual-index approach is transitional. The plan is to eventually retire
v2indices and use_semanticindices exclusively for both keyword and semantic search. See Architecture for details.
Embeddings are generated at two points:
Document Embeddings (at ingestion time):
SemanticContent aspectQuery Embeddings (at search time):
When a user performs a semantic search:
Set in your environment (e.g., docker/profiles/empty2.env):
ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document
The best way to verify semantic search is working is to run the smoke test:
cd smoke-test
ENABLE_SEMANTIC_SEARCH_TESTS=true pytest tests/semantic/test_semantic_search.py -v
This test:
query SemanticSearch($input: SearchAcrossEntitiesInput!) {
semanticSearchAcrossEntities(input: $input) {
total
searchResults {
entity {
urn
type
... on Document {
info {
title
contents {
text
}
}
}
}
}
}
}
Variables:
{
"input": {
"query": "how to request data access",
"types": ["DOCUMENT"],
"start": 0,
"count": 10
}
}
| File | Description |
|---|---|
README.md | This documentation - overview and quick start |
ARCHITECTURE.md | Detailed architecture and design decisions |
CONFIGURATION.md | Configuration options and embedding models |
SWITCHING_PROVIDERS.md | Guide for switching between embedding providers |
For a working example of semantic search:
# Run the smoke test
cd smoke-test
ENABLE_SEMANTIC_SEARCH_TESTS=true pytest tests/semantic/test_semantic_search.py -v