documentation/docs/mcp/datahub-mcp.mdx
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import YouTubeShortEmbed from '@site/src/components/YouTubeShortEmbed'; import CLIExtensionInstructions from '@site/src/components/CLIExtensionInstructions'; import GooseDesktopInstaller from '@site/src/components/GooseDesktopInstaller';
<YouTubeShortEmbed videoUrl="https://www.youtube.com/embed/VXRvHIZ3Eww?start=1878" />This tutorial covers how to add the DataHub MCP Server as a goose extension to enable AI-powered data discovery, lineage exploration, and metadata querying across your data ecosystem.
:::tip Quick Install <Tabs groupId="interface"> <TabItem value="ui" label="goose Desktop" default> Launch the installer </TabItem> <TabItem value="cli" label="goose CLI"> Command
uvx mcp-server-datahub@latest
DataHub is an open-source metadata platform that provides a unified view of your data ecosystem, cataloging datasets, dashboards, pipelines, and more with rich metadata including ownership, lineage, usage statistics, and data quality information.
The DataHub MCP Server enables AI agents to:
Learn more: DataHub MCP Server Guide | GitHub Repository
Before using the DataHub MCP Server, ensure you have:
:::info
Note that you'll need uv installed on your system to run this command, as it uses uvx.
:::
<GooseDesktopInstaller
extensionId="datahub-mcp"
extensionName="DataHub"
description="Data discovery and metadata platform integration"
type="stdio"
command="uvx"
args={["mcp-server-datahub@latest"]}
timeout={300}
envVars={[
{ name: "DATAHUB_GMS_URL", label: "DataHub GMS URL (e.g., https://your-instance.acryl.io or http://localhost:8080)" },
{ name: "DATAHUB_GMS_TOKEN", label: "DataHub Personal Access Token" }
]}
apiKeyLink="https://docs.datahub.com/docs/authentication/personal-access-tokens"
apiKeyLinkText="DataHub Personal Access Token"
/>
<CLIExtensionInstructions
name="DataHub"
description="Data discovery and metadata platform integration"
type="stdio"
command="uvx mcp-server-datahub@latest"
timeout={300}
envVars={[
{ key: "DATAHUB_GMS_URL", value: "https://your-instance.acryl.io" },
{ key: "DATAHUB_GMS_TOKEN", value: "▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪" }
]}
infoNote={
<>
Get your Personal Access Token from{" "}
<a href="https://docs.datahub.com/docs/authentication/personal-access-tokens" target="_blank" rel="noopener noreferrer">
DataHub documentation
</a>. Use your DataHub GMS URL (e.g., https://your-instance.acryl.io for DataHub Cloud or http://localhost:8080 for local instances).
</>
}
/>
Find datasets related to your project by describing what you need in natural language.
Find all datasets related to customer transactions that are owned by the analytics team
:::note Desktop
The DataHub extension will search across your data catalog and return relevant datasets with their metadata, including:
:::
I want to remove the "timestamp_seconds" column from the customer_orders table. What will break?
Show me the upstream lineage for the customer_orders table
:::note Desktop
The extension will traverse the lineage graph and show any:
That would be impacted by removing the column.
:::
How do I calculate the number of orders made in the USA last year?
What are the most common queries run against the customer_orders dataset?
:::note Desktop
The extension will retrieve SQL query history showing:
In addition to column names, types, descriptions, and any labels. This will enable the agent to generate high quality SQL to answer the question.
:::
Determine whether a dataset is trustworthy before using it.
Is the customer_orders table fresh and free of data quality issues?
:::note Desktop
The extension will fetch:
Allowing the agent to warn the user or confirm data trustworthiness.
:::
The DataHub MCP Server provides the following tools:
search
Search DataHub using structured keyword search (/q syntax) with boolean logic, filters, pagination, and optional sorting by usage metrics.
get_lineage
Retrieve upstream or downstream lineage for any entity (datasets, columns, dashboards, etc.) with filtering, query-within-lineage, pagination, and hop control.
get_dataset_queries
Fetch real SQL queries referencing a dataset or column—manual or system-generated—to understand usage patterns, joins, filters, and aggregation behavior.
get_entities
Fetch detailed metadata for one or more entities by URN; supports batch retrieval for efficient inspection of search results.
list_schema_fields
List schema fields for a dataset with keyword filtering and pagination, useful when search results truncate fields or when exploring large schemas.
get_lineage_paths_between
Retrieve the exact lineage paths between two assets or columns, including intermediate transformations and SQL query information.
If you're having trouble connecting to DataHub:
Verify your DATAHUB_GMS_URL is correct:
https://your-tenant.acryl.iohttp://localhost:8080https://datahub.your-company.comConfirm your Personal Access Token is valid and has appropriate permissions
Check network connectivity and firewall rules
If uvx is not found:
uv is installed: curl -LsSf https://astral.sh/uv/install.sh | shwhich uvx