README.md
Enterprise-grade metadata platform enabling discovery, governance, and observability across your entire data ecosystem
<p align="center"> <a href="https://github.com/datahub-project/datahub/actions/workflows/build-and-test.yml"> </a> <a href="https://pypi.org/project/acryl-datahub/"> </a> <a href="https://pypi.org/project/acryl-datahub/"> </a> <a href="https://hub.docker.com/r/linkedin/datahub-gms"> </a> <a href="https://datahub.com/slack?utm_source=github&utm_medium=readme&utm_campaign=github_readme"> </a> <a href="https://www.youtube.com/channel/UC3qFQC5IiwR5fvWEqi_tJ5w"> </a> <a href="https://datahub.com/blog/"> </a> <a href="https://github.com/datahub-project/datahub/graphs/contributors"> </a> <a href="https://github.com/datahub-project/datahub/stargazers"> </a> <a href="https://github.com/datahub-project/datahub/blob/master/LICENSE"> </a> </p> <p align="center"> <a href="https://docs.datahub.com/docs/quickstart"><b>Quick Start</b></a> โข <a href="https://demo.datahub.com"><b>Live Demo</b></a> โข <a href="https://docs.datahub.com"><b>Documentation</b></a> โข <a href="https://feature-requests.datahubproject.io/roadmap"><b>Roadmap</b></a> โข <a href="https://datahub.com/slack"><b>Slack Community</b></a> โข <a href="https://www.youtube.com/@datahubproject"><b>YouTube</b></a> </p> <p align="center"> <i>Built with โค๏ธ by <a href="https://datahub.com">DataHub</a> and <a href="https://engineering.linkedin.com">LinkedIn</a></i> </p><i>โถ๏ธ Click to watch full demo on YouTube</i>
</p>Connect your AI coding assistants (Cursor, Claude Desktop, Cline) directly to DataHub. Query metadata with natural language: "What datasets contain PII?" or "Show me lineage for this table"
Quick setup:
npx -y @acryldata/mcp-server-datahub init
๐ Finding the right DataHub? This is the open-source metadata platform at datahub.com (GitHub: datahub-project/datahub). It was previously hosted at
datahubproject.io, which now redirects to datahub.com. This project is not related to datahub.io, which is a separate public dataset hosting service. See the FAQ below.
DataHub is the #1 open-source AI data catalog that enables discovery, governance, and observability across your entire data ecosystem. Originally built at LinkedIn, DataHub now powers data discovery at thousands of organizations worldwide, managing millions of data assets.
The Challenge: Modern data stacks are fragmented across dozens of toolsโwarehouses, lakes, BI platforms, ML systems, AI agents, orchestration engines. Finding the right data, understanding its lineage, and ensuring governance is like searching through a maze blindfolded.
The DataHub Solution: DataHub acts as the central nervous system for your data stackโconnecting all your tools through real-time streaming or batch ingestion to create a unified metadata graph. Unlike static catalogs, DataHub keeps your metadata fresh and actionableโpowering both human teams and AI agents.
Essential for modern data teams and reliable AI agents:
No. datahub.io is a completely separate project โ a public dataset hosting service with no affiliation to this project. DataHub (this project) is an open-source metadata platform for data discovery, governance, and observability, hosted at datahub.com and developed at github.com/datahub-project/datahub.
</details> <details> <summary><b>What happened to datahubproject.io?</b></summary>DataHub was previously hosted at datahubproject.io. That domain now redirects to datahub.com. All documentation has moved to docs.datahub.com. If you find references to datahubproject.io in blog posts or tutorials, they refer to this same project โ just under its former domain.
Yes. DataHub was originally built at LinkedIn to manage metadata at scale across their data ecosystem. LinkedIn open-sourced DataHub in 2020. It has since grown into an independent community project under the datahub-project GitHub organization, now hosted at datahub.com.
</details> <details> <summary><b>How do I install the DataHub metadata platform?</b></summary>pip install acryl-datahub
datahub docker quickstart
See the Quick Start section below for full instructions. The PyPI package is acryl-datahub.
<p align="center"><b>๐ Universal Search</b>
Find any data asset instantly across your entire stack</p> </td> <td width="50%">
<p align="center"><b>๐ Column-Level Lineage</b>
Trace data flow from source to consumption</p> </td>
</tr> <tr> <td width="50%"> <p align="center"><b>๐ Rich Dataset Profiles</b>
Schema, statistics, documentation, and ownership</p> </td> <td width="50%">
<p align="center"><b>๐๏ธ Governance Dashboard</b>
Manage policies, tags, and compliance</p> </td>
</tr> </table>โถ๏ธ Watch DataHub in Action:
No installation required. Explore a fully-loaded DataHub instance with sample data instantly:
๐ Launch Live Demo: demo.datahub.com
Get DataHub running on your machine in under 2 minutes:
# Prerequisites: Docker Desktop with 8GB+ RAM allocated
# Upgrade pip and install DataHub CLI
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
# Launch DataHub locally via Docker
datahub docker quickstart
# Access DataHub at http://localhost:9002
# Default credentials: datahub / datahub
Note: You can also use uv or other Python package managers instead of pip.
What's included:
Best for advanced users who want to modify the core codebase or run directly from the repository:
# Clone the repository
git clone https://github.com/datahub-project/datahub.git
cd datahub
# Start all services with docker-compose
./docker/quickstart.sh
# Access DataHub at http://localhost:9002
# Default credentials: datahub / datahub
DataHub supports three deployment models:
โ See all deployment guides (AWS, Azure, GCP, environment variables)
โ Full architecture breakdown: components, storage layer, APIs, and design decisions
Use Case: Extract table metadata, column schemas, and usage statistics from Snowflake data warehouse.
Prerequisites:
pip install 'acryl-datahub[snowflake]')# snowflake_recipe.yml
source:
type: snowflake
config:
# Connection details
account_id: "xy12345.us-east-1"
warehouse: "COMPUTE_WH"
username: "${SNOWFLAKE_USER}"
password: "${SNOWFLAKE_PASSWORD}"
# Optional: Filter specific databases
database_pattern:
allow:
- "ANALYTICS_DB"
- "MARKETING_DB"
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
# Run ingestion
datahub ingest -c snowflake_recipe.yml
# Expected output:
# โ Connecting to Snowflake...
# โ Discovered 150 tables in ANALYTICS_DB
# โ Discovered 75 tables in MARKETING_DB
# โ Ingesting metadata...
# โ Successfully ingested 225 datasets to DataHub
What gets ingested:
Use Case: Programmatically search DataHub catalog and retrieve dataset metadata.
Prerequisites:
pip install 'acryl-datahub[datahub-rest]')from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
# Initialize DataHub client
config = DatahubClientConfig(server="http://localhost:8080")
graph = DataHubGraph(config)
# Search for datasets containing "customer"
# Returns up to 10 most relevant results
results = graph.search(
entity="dataset",
query="customer",
count=10
)
# Process and display results
for result in results:
print(f"Found: {result.entity.urn}")
print(f" Name: {result.entity.name}")
print(f" Platform: {result.entity.platform}")
print(f" Description: {result.entity.properties.description}")
print("---")
# Example output:
# Found: urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.customer_profiles,PROD)
# Name: customer_profiles
# Platform: snowflake
# Description: Aggregated customer data from CRM and transactions
# ---
Response format: Each result contains:
urn: Unique resource identifier for the datasetname: Human-readable dataset nameplatform: Source platform (snowflake, bigquery, etc.)properties: Additional metadata (description, tags, owners, etc.)Use Case: Retrieve upstream and downstream dependencies for a specific dataset.
Prerequisites:
GraphQL Query:
query GetLineage {
dataset(
urn: "urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.customer_profiles,PROD)"
) {
# Get upstream dependencies (source tables)
upstream: lineage(input: { direction: UPSTREAM }) {
entities {
urn
... on Dataset {
name
platform {
name
}
}
}
}
# Get downstream dependencies (consuming tables/dashboards)
downstream: lineage(input: { direction: DOWNSTREAM }) {
entities {
urn
type
... on Dataset {
name
platform {
name
}
}
... on Dashboard {
dashboardId
tool
}
}
}
}
}
Execute via cURL:
curl -X POST http://localhost:8080/api/graphql \
-H "Content-Type: application/json" \
-d '{"query": "query GetLineage { ... }"}'
Response structure:
upstream: Array of datasets that feed into this datasetdownstream: Array of datasets, dashboards, or ML models that consume this datasetUse Case: Programmatically add or update dataset documentation and custom properties.
Prerequisites:
from datahub.metadata.schema_classes import DatasetPropertiesClass
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
# Create emitter to send metadata to DataHub
emitter = DatahubRestEmitter("http://localhost:8080")
# Create dataset URN (unique identifier)
dataset_urn = make_dataset_urn(
platform="snowflake",
name="analytics.customer_profiles",
env="PROD"
)
# Define dataset properties
properties = DatasetPropertiesClass(
description="""
Customer profiles aggregated from CRM and transaction data.
**Update Schedule:** Updated nightly via Airflow DAG `customer_profile_etl`
**Data Retention:** 7 years for compliance
**Owner:** Data Platform Team
""",
customProperties={
"owner_team": "data-platform",
"update_frequency": "daily",
"data_sensitivity": "PII",
"upstream_dag": "customer_profile_etl",
"business_domain": "customer_analytics"
}
)
# Emit metadata to DataHub
emitter.emit_mcp(
entityUrn=dataset_urn,
aspectName="datasetProperties",
aspect=properties
)
print(f"โ Successfully updated documentation for {dataset_urn}")
What this does:
Use Case: Enable AI agents (Cursor, Claude Desktop, Cline) to query DataHub metadata directly from your IDE or development environment.
Prerequisites:
Quick Setup:
# Initialize MCP server for DataHub
npx -y @acryldata/mcp-server-datahub init
# Follow the interactive prompts to configure:
# - DataHub GMS endpoint (e.g., http://localhost:8080)
# - Authentication token (if required)
# - MCP server settings
Configure your AI tool:
For Claude Desktop, add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"datahub": {
"command": "npx",
"args": ["-y", "@acryldata/mcp-server-datahub"]
}
}
}
For Cursor, configure in Settings โ Features โ MCP Servers
What you can ask your AI:
Example conversation:
You: "What datasets are owned by the data-platform team?"
AI: Based on DataHub metadata, here are the datasets owned by data-platform:
- urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.customer_profiles,PROD)
Name: customer_profiles
Platform: Snowflake
Description: Aggregated customer data from CRM and transactions
- urn:li:dataset:(urn:li:dataPlatform:bigquery,marketing.campaign_performance,PROD)
Name: campaign_performance
Platform: BigQuery
Description: Marketing campaign metrics and ROI tracking
[... more results]
Benefits:
๐ Full Documentation: MCP Server for DataHub
</details>| Use Case | Description | Learn More |
|---|---|---|
| ๐ Data Discovery | Help users find the right data for analytics and ML | Guide |
| ๐ Impact Analysis | Understand downstream impact before making changes | Lineage Docs |
| ๐๏ธ Data Governance | Enforce policies, classify PII, manage access | Governance Guide |
| ๐ Data Quality | Monitor freshness, volumes, schema changes | Quality Checks |
| ๐ Documentation | Centralize data documentation and knowledge | Docs Features |
| ๐ฅ Collaboration | Foster data culture with discussions and ownership | Collaboration |
Learn from teams using DataHub in production and get practical guidance:
<table> <tr> <td width="33%"> <h3><a href="https://datahub.com/blog/metadata-in-action-tips-and-tricks-from-the-field/">๐ Best Practices from the Field</a></h3> <p>Real-world metadata strategies from teams at Grab, Slack, and Checkout.com who manage data at scale.</p> <sub><i>Case Studies</i></sub> </td> <td width="33%"> <h3><a href="https://datahub.com/blog/the-what-why-and-how-of-data-contracts/">๐ Data Contracts: How to Use Them</a></h3> <p>Practical guide to implementing data contracts between producers and consumers for quality and accountability.</p> <sub><i>Implementation Guide</i></sub> </td> <td width="33%"> <h3><a href="https://datahub.com/blog/datahub-mcp-server-block-ai-agents-use-case/">๐ค How Block Powers AI Agents with DataHub</a></h3> <p>Real-world case study: scaling data governance and AI operations across 50+ platforms using MCP.</p> <sub><i>AI Case Study</i></sub> </td> </tr> </table> <p align="center"> <a href="https://datahub.com/blog/"><b>โ Explore all posts on our blog</b></a> </p>3,000+ organizations run DataHub in production worldwide โ across both open-source deployments and DataHub Cloud โ from hyperscale tech companies to regulated financial institutions and healthcare providers.
๐ E-Commerce & Retail: Etsy โข Experius โข Klarna โข LinkedIn โข MediaMarkt Saturn โข Uphold โข Wealthsimple โข Wolt
๐ฅ Healthcare & Life Sciences: CVS Health โข IOMED โข Optum
โ๏ธ Travel & Transportation: Cabify โข DFDS โข Expedia Group โข Hurb โข Peloton โข Viasat
๐ Education & EdTech: ClassDojo โข Coursera โข Udemy
๐ฐ Financial Services: Banksalad โข Block โข Chime โข FIS โข Funding Circle โข GEICO โข Inter&Co โข N26 โข Santander โข Shanghai HuaRui Bank โข Stash โข Visa
๐ฎ Gaming, Entertainment & Streaming: Netflix โข Razer โข Showroomprive โข TypeForm โข UKEN Games โข Zynga
๐ Technology & SaaS: Adevinta โข Apple โข Digital Turbine โข DPG Media โข Foursquare โข Geotab โข HashiCorp โข hipages โข inovex โข KPN โข Miro โข MYOB โข Notion โข Okta โข Rippling โข Saxo Bank โข Slack โข ThoughtWorks โข Twilio โข Wikimedia โข WP Engine
๐ Data & Analytics: ABLY โข DefinedCrowd โข Grofers โข Haibo Technology โข Moloco โข PITS Global Data Recovery Services โข SpotHero
And thousands more across DataHub Core and DataHub Cloud.
Using DataHub? Please feel free to add your organization to the list if we missed it โ open a PR or let us know on Slack.
DataHub is part of a rich ecosystem of tools and integrations.
| Repository | Description | Links |
|---|---|---|
| datahub | Core platform: metadata model, services, connectors, and web UI | Docs |
| datahub-actions | Framework for responding to metadata changes in real-time | Guide |
| datahub-helm | Production-ready Helm charts for Kubernetes deployment | Charts |
| static-assets | Logos, images, and brand assets for DataHub | - |
| Project | Description | Maintainer |
|---|---|---|
| datahub-tools | Python tools for GraphQL endpoint interaction | Notion |
| dbt-impact-action | GitHub Action for dbt change impact analysis | Acryl Data |
| business-glossary-sync-action | Sync business glossary via GitHub PRs | Acryl Data |
| mcp-server-datahub | Model Context Protocol server for AI integration | Acryl Data |
| meta-world | Recipes, custom sources, and transformations | Community |
๐ BI & Analytics: Tableau โข Looker โข Power BI โข Superset โข Metabase โข Mode โข Redash
๐๏ธ Data Warehouses: Snowflake โข BigQuery โข Redshift โข Databricks โข Synapse โข ClickHouse
๐ Data Orchestration: Airflow โข dbt โข Dagster โข Prefect โข Luigi
๐ค ML Platforms: SageMaker โข MLflow โข Feast โข Kubeflow โข Weights & Biases
๐ Data Integration: Fivetran โข Airbyte โข Stitch โข Matillion
Join thousands of data practitioners building with DataHub!
Next Town Hall:
Last Town Hall:
| Channel | Purpose | Link |
|---|---|---|
| Slack Community | Real-time chat, questions, announcements | Join 14,000+ members |
| GitHub Discussions | Technical discussions, feature requests | Start a Discussion |
| GitHub Issues | Bug reports, feature requests | Open an Issue |
| Stack Overflow | Technical Q&A (tag: datahub) | Ask a Question |
| YouTube | Tutorials, demos, talks | Subscribe |
| Company updates, blogs | Follow Us | |
| Twitter/X | Quick updates, community highlights | Follow @datahubproject |
We โค๏ธ contributions from the community! See CONTRIBUTING.md for setup, guidelines, and ways to get involved.
Browse Good First Issues to get started!
Blog Posts & Articles:
Conference Talks:
Podcasts:
| Resource | URL |
|---|---|
| ๐ Official Documentation | https://docs.datahub.com |
| ๐ Project Website | https://datahub.com |
| ๐ Live Demo | https://demo.datahub.com |
| ๐ Roadmap | https://feature-requests.datahubproject.io/roadmap |
| ๐๏ธ Town Hall Schedule | https://docs.datahub.com/docs/townhalls |
| ๐ฌ Slack Community | https://datahub.com/slack |
| ๐บ YouTube Channel | https://youtube.com/@datahubproject |
| ๐ Blog | https://datahub.com/blog/ |
| ๐ LinkedIn | https://www.linkedin.com/company/72009941 |
| ๐ฆ Twitter/X | https://twitter.com/datahubproject |
| ๐ Security | https://docs.datahub.com/docs/security |
DataHub is open source software released under the Apache License 2.0.
Copyright 2015-2025 LinkedIn Corporation
Copyright 2025-Present DataHub Project Contributors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
What this means:
Learn more: Choose a License - Apache 2.0