docs/dev-guides/semantic-search/ARCHITECTURE.md
This document provides a detailed explanation of DataHub's semantic search architecture, design decisions, and implementation details.
Traditional keyword search has limitations:
Semantic search addresses these by understanding the meaning of text through vector embeddings—numerical representations that capture semantic similarity.
For each entity type enabled for semantic search, two indices exist:
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
│ documentindex_v2 │ │ documentindex_v2_semantic │
├─────────────────────────────────┤ ├─────────────────────────────────┤
│ Standard OpenSearch index │ │ OpenSearch index with k-NN │
│ │ │ │
│ Fields: │ │ Fields: │
│ - urn │ │ - urn │
│ - title (text) │ │ - title (text) │
│ - text (text) │ │ - text (text) │
│ - browsePaths │ │ - browsePaths │
│ - tags │ │ - tags │
│ - ... │ │ - ... │
│ │ │ │
│ │ │ + embeddings (nested object): │
│ │ │ - cohere_embed_v3: │
│ │ │ - model_version │
│ │ │ - generated_at │
│ │ │ - chunks[] (nested): │
│ │ │ - position │
│ │ │ - text │
│ │ │ - vector (knn_vector) │
└─────────────────────────────────┘ └─────────────────────────────────┘
The dual-index approach is a transitional architecture. The long-term plan is to:
v2 indices entirelyBenefits of the transitional approach:
Future State:
Once the transition is complete, the _semantic indices will become the primary (and only) search indices. They will support both:
This unified index approach simplifies operations and reduces storage overhead.
The semantic index stores embeddings in a nested structure:
{
"urn": "urn:li:document:example-doc",
"title": "Data Access Guide",
"text": "How to request access to datasets...",
"embeddings": {
"cohere_embed_v3": {
"model_version": "bedrock/cohere.embed-english-v3",
"generated_at": "2024-01-15T10:30:00Z",
"chunking_strategy": "sentence_boundary_400t",
"total_chunks": 3,
"total_tokens": 850,
"chunks": [
{
"position": 0,
"text": "How to request access to datasets...",
"character_offset": 0,
"character_length": 450,
"token_count": 95,
"vector": [0.023, -0.041, 0.087, ...] // 1024 dimensions
},
{
"position": 1,
"text": "For sensitive data, additional approval...",
"character_offset": 450,
"character_length": 380,
"token_count": 82,
"vector": [0.019, -0.055, 0.091, ...]
}
]
}
}
}
The embeddings structure supports multiple embedding models:
{
"embeddings": {
"cohere_embed_v3": { ... },
"openai_text_embedding_3": { ... },
"custom_model": { ... }
}
}
This allows:
The ingestion connector generates document embeddings and sends them to GMS along with the document content:
┌─────────────────────────────────────────────────────────────────────────┐
│ Ingestion Flow │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ │ Source │ 1. Extract documents │
│ │ System │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Ingestion │ 2. Generate embeddings for document content │
│ │ Connector │ (using connector's embedding provider) │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ 3. Send document + embeddings to GMS │
│ │ GMS │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ OpenSearch │ │
│ │ ┌─────────────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ entityindex_v2 │ │ entityindex_v2_semantic │ │ │
│ │ │ (keyword search) │ │ (keyword + vector search) │ │ │
│ │ │ │ │ │ │ │
│ │ │ - urn │ │ - urn │ │ │
│ │ │ - title │ │ - title │ │ │
│ │ │ - text │ │ - text │ │ │
│ │ │ - ... │ │ - embeddings.model.chunks[]. │ │ │
│ │ │ │ │ vector │ │ │
│ │ └─────────────────────┘ └─────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Document Embeddings are generated by the ingestion connector at ingestion time and sent to GMS via MCP (Metadata Change Proposal). This ensures:
┌─────────────────────────────────────────────────────────────────────┐
│ Ingestion Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Source │───▶│ Ingestion │───▶│ DataHub GMS │ │
│ │ System │ │ Connector │ │ │ │
│ └──────────────┘ └──────┬───────┘ └──────────┬───────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Generate document│ │ Process MCP and │ │
│ │ embeddings │ │ write to semantic │ │
│ │ (in connector) │ │ search index │ │
│ └────────┬─────────┘ └──────────────────┘ │
│ │ ▲ │
│ │ MCP with │ │
│ └─────SemanticContent─┘ │
│ aspect │
│ │
└─────────────────────────────────────────────────────────────────────┘
Embeddings are stored as a proper DataHub aspect (SemanticContent), defined in PDL schema:
{
"entityType": "document",
"entityUrn": "urn:li:document:my-doc",
"aspectName": "semanticContent",
"aspect": {
"embeddings": {
"cohere_embed_v3": {
"modelVersion": "bedrock/cohere.embed-english-v3",
"generatedAt": 1702234567890,
"totalChunks": 2,
"chunks": [
{ "position": 0, "vector": [...], "text": "..." },
{ "position": 1, "vector": [...], "text": "..." }
]
}
}
}
}
The text field in each chunk is optional. This supports scenarios where:
Note: Embeddings are one-way—original text cannot be reconstructed from vectors.
Query Embeddings are generated by GMS at search time using the configured embedding provider (e.g., AWS Bedrock):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ GraphQL │───▶│ GMS │───▶│ Embedding │───▶│ OpenSearch │
│ Client │ │ │ │ Provider │ │ k-NN Query │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
▼
Query embedding
generated here
(for search only)
Key Point: The GMS embedding provider is used only for query embedding, not for document embedding. The ingestion connector is responsible for document embeddings.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ GraphQL │───▶│ GMS │───▶│ Embedding │───▶│ OpenSearch │
│ Client │ │ │ │ Provider │ │ k-NN Query │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
semanticSearchAcrossEntities( │
query: "how to access data" │
) │
▼
┌─────────────────────────────┐
│ Nested k-NN Query: │
│ │
│ { │
│ "nested": { │
│ "path": "embeddings │
│ .cohere_embed_v3 │
│ .chunks", │
│ "query": { │
│ "knn": { │
│ "...chunks.vector":│
│ { "vector": [...], │
│ "k": 10 } │
│ } │
│ } │
│ } │
│ } │
└─────────────────────────────┘
Embedding models have token limits (512 tokens for cohere's embed-english-v3.0). Long documents must be split into chunks:
def chunk_text(text, max_tokens=400):
"""
Chunk text at sentence boundaries, respecting token limits.
1. Split text into sentences
2. Accumulate sentences until approaching limit
3. Save chunk, start new accumulation
4. Handle oversized sentences by character splitting
"""
Parameters:
max_tokens: Target chunk size (default: 400)chars_per_token: Estimation ratio (default: 4 characters ≈ 1 token)Each chunk stores metadata for debugging and analysis:
{
"position": 0, // Order in document
"text": "...", // Chunk content
"character_offset": 0, // Start position in original
"character_length": 450, // Length in characters
"token_count": 95, // Estimated tokens
"vector": [...] // Embedding vector
}
The semantic index uses OpenSearch's k-NN plugin with FAISS engine:
{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"embeddings": {
"type": "nested",
"properties": {
"cohere_embed_v3": {
"type": "nested",
"properties": {
"chunks": {
"type": "nested",
"properties": {
"vector": {
"type": "knn_vector",
"dimension": 1024,
"method": {
"name": "hnsw",
"engine": "faiss",
"space_type": "cosinesimil",
"parameters": {
"ef_construction": 128,
"m": 16
}
}
}
}
}
}
}
}
}
}
}
}
| Parameter | Value | Description |
|---|---|---|
ef_construction | 128 | Build-time accuracy (higher = more accurate, slower build) |
m | 16 | Number of connections per node (higher = more accurate, more memory) |
space_type | cosinesimil | Similarity metric (cosine similarity) |
Semantic search respects DataHub's existing access controls:
| Index Size | Recommendation |
|---|---|
| < 100K docs | Single node sufficient |
| 100K - 1M docs | Consider dedicated k-NN nodes |
| > 1M docs | Sharding and replicas recommended |