DataHub Semantic Search

This directory contains documentation for DataHub's semantic search capability, which enables natural language search across metadata entities using vector embeddings.

Note: This is developer documentation for the semantic search feature. For a working example, see the smoke test at smoke-test/tests/semantic/test_semantic_search.py.

Overview

Traditional keyword search requires exact term matches, limiting discoverability. Semantic search uses AI-generated embeddings to understand the meaning of queries and documents, returning relevant results even when exact keywords don't match.

Example:

Query: "how to request data access permissions"
Keyword search: ❌ No results (no exact match)
Semantic search: ✅ Returns "Data Access Request Process" document

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           DataHub Semantic Search                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────────────────┐ │
│  │  Ingestion   │     │     GMS      │────▶│      OpenSearch          │ │
│  │  Connector   │     │              │     │                          │ │
│  │              │     │ ┌──────────┐ │     │  ┌────────────────────┐  │ │
│  │ 1. Generate  │     │ │ Process  │ │     │  │ entityindex_v2     │  │ │
│  │    embeddings│     │ │ MCP +    │ │     │  │ (keyword search)   │  │ │
│  │              │     │ │ Write to │ │     │  └────────────────────┘  │ │
│  │ 2. Emit MCP  │────▶│ │ indices  │ │     │                          │ │
│  │    with      │     │ └──────────┘ │     │  ┌────────────────────┐  │ │
│  │    Semantic  │     │              │     │  │ entityindex_v2_    │  │ │
│  │    Embedding │     │              │     │  │ semantic           │  │ │
│  │    aspect    │     │              │     │  │ (vector search)    │  │ │
│  └──────────────┘     └──────────────┘     │  └────────────────────┘  │ │
│                                            └──────────────────────────┘ │
│                                                          │              │
│  ┌──────────────┐                                        │              │
│  │   GraphQL    │◀───────────────────────────────────────┘              │
│  │   Client     │  semanticSearchAcrossEntities()                       │
│  └──────────────┘                                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

How It Works

1. Data Ingestion

Documents and other entities are ingested into DataHub using standard ingestion connectors. When semantic search is enabled, GMS performs a dual-write:

Primary Index (entityindex_v2): Standard keyword-searchable index
Semantic Index (entityindex_v2_semantic): Vector-enabled index for semantic search

Note: The dual-index approach is transitional. The plan is to eventually retire v2 indices and use _semantic indices exclusively for both keyword and semantic search. See Architecture for details.

2. Embedding Generation

Embeddings are generated at two points:

Document Embeddings (at ingestion time):

Generated by the ingestion connector
Emitted via MCP (Metadata Change Proposal) as a SemanticContent aspect
GMS processes the MCP and writes embeddings to the semantic index
Supports privacy-sensitive use cases where only embeddings (not source text) are shared

Query Embeddings (at search time):

Generated by GMS using the configured embedding provider
Used to find similar documents via k-NN search

3. Query Processing

When a user performs a semantic search:

The query text is converted to an embedding vector using the same model
OpenSearch performs k-NN (k-nearest neighbors) vector similarity search
Results are ranked by cosine similarity to the query embedding
Top matches are returned through the GraphQL API

Quick Start

Prerequisites

DataHub running with semantic search enabled
OpenAI API key (default), or AWS credentials (for Bedrock), or Cohere API key

1. Enable Semantic Search

Set in your environment (e.g., docker/profiles/empty2.env):

bash

ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document

2. Run the Smoke Test

The best way to verify semantic search is working is to run the smoke test:

bash

cd smoke-test
ENABLE_SEMANTIC_SEARCH_TESTS=true pytest tests/semantic/test_semantic_search.py -v

This test:

Ingests sample documents via GraphQL
Waits for indexing (20 seconds)
Executes semantic search
Verifies results

GraphQL API

Semantic Search Query

graphql

query SemanticSearch($input: SearchAcrossEntitiesInput!) {
  semanticSearchAcrossEntities(input: $input) {
    total
    searchResults {
      entity {
        urn
        type
        ... on Document {
          info {
            title
            contents {
              text
            }
          }
        }
      }
    }
  }
}

Variables:

json

{
  "input": {
    "query": "how to request data access",
    "types": ["DOCUMENT"],
    "start": 0,
    "count": 10
  }
}

Documentation Index

File	Description
`README.md`	This documentation - overview and quick start
`ARCHITECTURE.md`	Detailed architecture and design decisions
`CONFIGURATION.md`	Configuration options and embedding models
`SWITCHING_PROVIDERS.md`	Guide for switching between embedding providers

Testing

For a working example of semantic search:

bash

# Run the smoke test
cd smoke-test
ENABLE_SEMANTIC_SEARCH_TESTS=true pytest tests/semantic/test_semantic_search.py -v

DataHub Semantic Search

DataHub Semantic Search

Overview

Architecture

How It Works

1. Data Ingestion

2. Embedding Generation

3. Query Processing

Quick Start

Prerequisites

1. Enable Semantic Search

2. Run the Smoke Test

GraphQL API

Semantic Search Query

Documentation Index

Testing

Further Reading