Back to Datahub

DataHub Semantic Search

docs/dev-guides/semantic-search/README.md

1.5.0.37.5 KB
Original Source

DataHub Semantic Search

This directory contains documentation for DataHub's semantic search capability, which enables natural language search across metadata entities using vector embeddings.

Note: This is developer documentation for the semantic search feature. For a working example, see the smoke test at smoke-test/tests/semantic/test_semantic_search.py.

Overview

Traditional keyword search requires exact term matches, limiting discoverability. Semantic search uses AI-generated embeddings to understand the meaning of queries and documents, returning relevant results even when exact keywords don't match.

Example:

  • Query: "how to request data access permissions"
  • Keyword search: ❌ No results (no exact match)
  • Semantic search: ✅ Returns "Data Access Request Process" document

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           DataHub Semantic Search                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────────────────┐ │
│  │  Ingestion   │     │     GMS      │────▶│      OpenSearch          │ │
│  │  Connector   │     │              │     │                          │ │
│  │              │     │ ┌──────────┐ │     │  ┌────────────────────┐  │ │
│  │ 1. Generate  │     │ │ Process  │ │     │  │ entityindex_v2     │  │ │
│  │    embeddings│     │ │ MCP +    │ │     │  │ (keyword search)   │  │ │
│  │              │     │ │ Write to │ │     │  └────────────────────┘  │ │
│  │ 2. Emit MCP  │────▶│ │ indices  │ │     │                          │ │
│  │    with      │     │ └──────────┘ │     │  ┌────────────────────┐  │ │
│  │    Semantic  │     │              │     │  │ entityindex_v2_    │  │ │
│  │    Embedding │     │              │     │  │ semantic           │  │ │
│  │    aspect    │     │              │     │  │ (vector search)    │  │ │
│  └──────────────┘     └──────────────┘     │  └────────────────────┘  │ │
│                                            └──────────────────────────┘ │
│                                                          │              │
│  ┌──────────────┐                                        │              │
│  │   GraphQL    │◀───────────────────────────────────────┘              │
│  │   Client     │  semanticSearchAcrossEntities()                       │
│  └──────────────┘                                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

How It Works

1. Data Ingestion

Documents and other entities are ingested into DataHub using standard ingestion connectors. When semantic search is enabled, GMS performs a dual-write:

  • Primary Index (entityindex_v2): Standard keyword-searchable index
  • Semantic Index (entityindex_v2_semantic): Vector-enabled index for semantic search

Note: The dual-index approach is transitional. The plan is to eventually retire v2 indices and use _semantic indices exclusively for both keyword and semantic search. See Architecture for details.

2. Embedding Generation

Embeddings are generated at two points:

Document Embeddings (at ingestion time):

  • Generated by the ingestion connector
  • Emitted via MCP (Metadata Change Proposal) as a SemanticContent aspect
  • GMS processes the MCP and writes embeddings to the semantic index
  • Supports privacy-sensitive use cases where only embeddings (not source text) are shared

Query Embeddings (at search time):

  • Generated by GMS using the configured embedding provider
  • Used to find similar documents via k-NN search

3. Query Processing

When a user performs a semantic search:

  1. The query text is converted to an embedding vector using the same model
  2. OpenSearch performs k-NN (k-nearest neighbors) vector similarity search
  3. Results are ranked by cosine similarity to the query embedding
  4. Top matches are returned through the GraphQL API

Quick Start

Prerequisites

  • DataHub running with semantic search enabled
  • OpenAI API key (default), or AWS credentials (for Bedrock), or Cohere API key

Set in your environment (e.g., docker/profiles/empty2.env):

bash
ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document

2. Run the Smoke Test

The best way to verify semantic search is working is to run the smoke test:

bash
cd smoke-test
ENABLE_SEMANTIC_SEARCH_TESTS=true pytest tests/semantic/test_semantic_search.py -v

This test:

  • Ingests sample documents via GraphQL
  • Waits for indexing (20 seconds)
  • Executes semantic search
  • Verifies results

GraphQL API

Semantic Search Query

graphql
query SemanticSearch($input: SearchAcrossEntitiesInput!) {
  semanticSearchAcrossEntities(input: $input) {
    total
    searchResults {
      entity {
        urn
        type
        ... on Document {
          info {
            title
            contents {
              text
            }
          }
        }
      }
    }
  }
}

Variables:

json
{
  "input": {
    "query": "how to request data access",
    "types": ["DOCUMENT"],
    "start": 0,
    "count": 10
  }
}

Documentation Index

FileDescription
README.mdThis documentation - overview and quick start
ARCHITECTURE.mdDetailed architecture and design decisions
CONFIGURATION.mdConfiguration options and embedding models
SWITCHING_PROVIDERS.mdGuide for switching between embedding providers

Testing

For a working example of semantic search:

bash
# Run the smoke test
cd smoke-test
ENABLE_SEMANTIC_SEARCH_TESTS=true pytest tests/semantic/test_semantic_search.py -v

Further Reading