docs/features/feature-guides/mcp.md
The DataHub MCP Server implements the Model Context Protocol (MCP), giving AI agents direct access to your DataHub metadata. Search for data assets, traverse lineage, inspect schemas, and generate SQL — all through natural language in tools like Cursor, Windsurf, Claude Desktop, and OpenAI.
Want to learn more about the motivation, architecture, and advanced use cases? Check out our deep dive blog post.
Search for Data
Find the right data by asking questions in plain English. Supports wildcard matching (revenue_*), field searches (tag:PII), and boolean logic ((sales OR revenue) AND quarterly).
Dive Deeper
Get usage stats, ownership, documentation, tags, glossary terms, and quality signals for any table, column, dashboard, & more — so agents can separate signal from noise.
Lineage & Impact Analysis
Trace data flow at table and column level, upstream or downstream, across multiple hops. Understand the origins of your data, and plan for upcoming changes.
Query Analysis & Authoring
Surface real SQL queries that reference a dataset — see join patterns, common filters, and aggregation behavior — then generate new queries grounded in actual usage.
Works Where You Work
Seamlessly integrates with Cursor, Windsurf, Claude Desktop, OpenAI, and any other MCP-compatible client.
The DataHub MCP Server provides the following tools:
search
Search DataHub using structured keyword search (/q syntax) with boolean logic, filters, pagination, and optional sorting by usage metrics.
get_lineage
Retrieve upstream or downstream lineage for any entity (datasets, columns, dashboards, etc.) with filtering, query-within-lineage, pagination, and hop control.
get_dataset_queries
Fetch real SQL queries referencing a dataset or column—manual or system-generated—to understand usage patterns, joins, filters, and aggregation behavior.
get_entities
Fetch detailed metadata for one or more entities by URN; supports batch retrieval for efficient inspection of search results.
list_schema_fields
List schema fields for a dataset with keyword filtering and pagination, useful when search results truncate fields or when exploring large schemas.
get_lineage_paths_between
Retrieve the exact lineage paths between two assets or columns, including intermediate transformations and SQL query information.
:::info
Mutation tools are available in mcp-server-datahub v0.5.0+. They are enabled via the TOOLS_IS_MUTATION_ENABLED=true environment variable.
:::
add_tags / remove_tags
Add or remove tags from entities or schema fields (columns). Supports bulk operations on multiple entities.
add_terms / remove_terms
Add or remove glossary terms from entities or schema fields. Useful for applying business definitions and data classification.
add_owners / remove_owners
Add or remove ownership assignments from entities. Supports different ownership types (technical owner, data owner, etc.).
set_domains / remove_domains
Assign or remove domain membership for entities. Each entity can belong to one domain.
update_description
Update, append to, or remove descriptions for entities or schema fields. Supports markdown formatting.
add_structured_properties / remove_structured_properties
Manage structured properties (typed metadata fields) on entities. Supports string, number, URN, date, and rich text value types.
:::info
User tools are available in mcp-server-datahub v0.5.0+. They are enabled via the TOOLS_IS_USER_ENABLED=true environment variable.
:::
get_me
Retrieve information about the currently authenticated user, including profile details and group memberships.
:::info Document tools are available in mcp-server-datahub v0.5.0+. Document tools are automatically hidden if no documents exist in the catalog. :::
search_documents
Search for documents using keyword search with filters for platforms, domains, tags, glossary terms, and owners.
grep_documents
Search within document content using regex patterns. Useful for finding specific information across multiple documents.
save_document
Save standalone documents (insights, decisions, FAQs, notes) to DataHub's knowledge base. Documents are organized under a configurable parent folder.
For DataHub Cloud v0.3.12+, you can connect directly to the hosted MCP server endpoint — no local installation required.
:::info The managed MCP server endpoint is only available with DataHub Cloud v0.3.12+. For DataHub Core and older versions of DataHub Cloud, self-host the MCP server instead. :::
:::note Streamable HTTP Only DataHub's managed MCP server uses the streamable HTTP transport. Some older MCP clients (e.g. chatgpt.com) may only support the deprecated SSE transport — for those, use mcp-remote to bridge the gap. :::
https://<tenant>.acryl.ioYour managed MCP server URL is:
https://<tenant>.acryl.io/integrations/ai/mcp/
There are two ways to authenticate:
Authorization header — pass your token as a Bearer token in the Authorization header:
Authorization: Bearer <token>
Token in URL — append your token as a query parameter:
https://<tenant>.acryl.io/integrations/ai/mcp/?token=<token>
This is a convenient alternative when your MCP client doesn't support custom headers.
For on-premises DataHub Cloud, replace <tenant>.acryl.io with your DataHub FQDN, e.g. https://datahub.example.com/integrations/ai/mcp/?token=<token>.
claude_desktop_config.json file. You can find it by navigating to Claude Desktop -> Settings -> Developer -> Edit Config.<tenant> and <token> with your own values.{
"mcpServers": {
"datahub-cloud": {
"command": "npx",
"args": [
"-y",
"mcp-remote",
"https://<tenant>.acryl.io/integrations/ai/mcp/?token=<token>"
]
}
}
}
Claude Code natively supports streamable HTTP, so no proxy or additional dependencies are needed.
Run the following command, replacing <tenant> and <token> with your own values:
claude mcp add --transport http datahub-cloud "https://<tenant>.acryl.io/integrations/ai/mcp/?token=<token>"
{
"mcpServers": {
"datahub-cloud": {
"url": "https://<tenant>.acryl.io/integrations/ai/mcp/?token=<token>"
}
}
}
Most AI tools support remote MCP servers. Provide the hosted MCP server URL:
https://<tenant>.acryl.io/integrations/ai/mcp/?token=<token>
Make sure authentication mode is not set to "OAuth" (if applicable).
For clients that don't yet support remote MCP servers, use mcp-remote:
npx-y mcp-remote https://<tenant>.acryl.io/integrations/ai/mcp/?token=<token>Run the open-source MCP server locally. This works with any DataHub instance — both DataHub Core and DataHub Cloud.
Install uv:
# macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
The URL of your DataHub instance's GMS endpoint, e.g. http://localhost:8080 or https://<tenant>.acryl.io
The self-hosted server authenticates via environment variables:
DATAHUB_GMS_URL — your DataHub GMS endpointDATAHUB_GMS_TOKEN — your personal access tokenThese are passed to the mcp-server-datahub process at startup (see configuration examples below).
Run which uvx to find the full path to the uvx command.
Open your claude_desktop_config.json file. You can find it by navigating to Claude Desktop -> Settings -> Developer -> Edit Config.
Update the file to include the following content. Be sure to replace the placeholder values.
{
"mcpServers": {
"datahub": {
"command": "<full-path-to-uvx>", // e.g. /Users/hsheth/.local/bin/uvx
"args": ["mcp-server-datahub@latest"],
"env": {
"DATAHUB_GMS_URL": "<your-datahub-url>",
"DATAHUB_GMS_TOKEN": "<your-datahub-token>"
}
}
}
}
Run the following command, replacing the placeholder values:
claude mcp add datahub \
-e DATAHUB_GMS_URL="<your-datahub-url>" \
-e DATAHUB_GMS_TOKEN="<your-datahub-token>" \
-- uvx mcp-server-datahub@latest
{
"mcpServers": {
"datahub": {
"command": "uvx",
"args": ["mcp-server-datahub@latest"],
"env": {
"DATAHUB_GMS_URL": "<your-datahub-url>",
"DATAHUB_GMS_TOKEN": "<your-datahub-token>"
}
}
}
}
For other AI tools, provide the following configuration:
uvxmcp-server-datahub@latestDATAHUB_GMS_URL: <your-datahub-url>DATAHUB_GMS_TOKEN: <your-datahub-token>spawn uvx ENOENTThe full stack trace might look like this:
2025-04-08T19:58:16.593Z [datahub] [error] spawn uvx ENOENT {"stack":"Error: spawn uvx ENOENT\n at ChildProcess._handle.onexit (node:internal/child_process:285:19)\n at onErrorNT (node:internal/child_process:483:16)\n at process.processTicksAndRejections (node:internal/process/task_queues:82:21)"}
Solution: Replace the uvx bit of the command with the output of which uvx.