metadata-models/docs/entities/dataPlatform.md
A Data Platform is a metadata entity that represents a source system, technology, or tool that contains and manages data assets. Data Platforms are the foundational building blocks in DataHub's metadata model, serving as the namespace and classification system for all datasets, dashboards, charts, jobs, and other data assets.
Examples of data platforms include databases (MySQL, PostgreSQL, Oracle), data warehouses (Snowflake, BigQuery, Redshift), BI tools (Looker, Tableau, Power BI), data lakes (S3, HDFS), message brokers (Kafka), and many other systems where data resides or flows through.
A Data Platform is uniquely identified by a single component:
The URN structure for a Data Platform is:
urn:li:dataPlatform:<platformName>
urn:li:dataPlatform:mysql
urn:li:dataPlatform:snowflake
urn:li:dataPlatform:bigquery
urn:li:dataPlatform:looker
urn:li:dataPlatform:kafka
urn:li:dataPlatform:s3
urn:li:dataPlatform:dbt
Platform names follow these conventions:
The complete list of officially supported data platforms is maintained in DataHub's data-platforms.yaml bootstrap configuration.
While DataHub ships with 100+ pre-defined platforms, you can create custom platform entities for:
When creating custom platforms, follow the naming conventions above and ensure uniqueness across your DataHub instance.
The dataPlatformInfo aspect contains the core metadata about a data platform:
Data platforms are classified into these categories:
The platform type helps DataHub understand how to interact with the platform and what kinds of metadata are expected.
The following code snippet shows you how to create a custom Data Platform.
<details> <summary>Python SDK: Create a Data Platform</summary>{{ inline /metadata-ingestion/examples/library/data_platform_create.py show_path_as_comment }}
The datasetNameDelimiter field is critical for understanding dataset naming conventions on each platform:
database.schema.table in PostgreSQL)/data/warehouse/customers in HDFS)This delimiter helps DataHub:
Platforms can have custom logos displayed in the DataHub UI through the logoUrl field. This helps users quickly recognize platforms visually. DataHub ships with built-in logos for all officially supported platforms, but custom platforms can specify their own logo URLs.
Data Platforms are the parent entity for all datasets. Every dataset URN includes a platform reference:
urn:li:dataset:(urn:li:dataPlatform:snowflake,database.schema.table,PROD)
This relationship enables:
Beyond datasets, platforms are referenced by many other entity types:
All of these entities include a platform reference in their URNs, creating a comprehensive technology inventory.
For organizations running multiple instances of the same platform (e.g., multiple Snowflake accounts, multiple production BigQuery projects), DataHub provides a dataPlatformInstance entity. This allows distinguishing between:
Platform instances reference their parent platform and add deployment-specific metadata like environment, region, or account information.
<details> <summary>Python SDK: Create a Platform Instance</summary>{{ inline /metadata-ingestion/examples/library/platform_instance_create.py show_path_as_comment }}
Data Platforms are typically created automatically through two mechanisms:
You rarely need to manually create platform entities unless you're adding a custom or in-house system not covered by the standard bootstrap list.
Platforms are fully searchable in DataHub:
This enables users to explore the data landscape from a platform-centric view.
Data Platforms can be queried through the GraphQL API to retrieve:
While platforms are rarely created via GraphQL (since they're mostly bootstrap data), the API enables programmatic access to platform information.
DataHub makes a distinction between:
Bootstrap platforms:
Custom platforms:
Once a platform is created and assets are associated with it, the platform name becomes immutable for practical purposes. Changing a platform name would require:
If you need to rename a platform, it's generally easier to:
Platform names are limited to 15 characters (enforced by @validate.strlen.max = 15). This is a legacy constraint that ensures platform names are concise and fit well in URNs and UI displays.
If your platform's natural name exceeds 15 characters, use an abbreviation or acronym:
The full name can go in the displayName field without length restrictions.
Platform names are case-sensitive in URNs but conventionally always lowercase. Avoid creating platforms that differ only in case (e.g., "MySQL" and "mysql") as this will cause confusion and potential URN conflicts.
The type field (FILE_SYSTEM, RELATIONAL_DB, etc.) is primarily for classification and display purposes. DataHub does not enforce different behaviors based on platform type. Two platforms of different types are treated the same way by the system.
However, connectors and ingestion logic may use platform type to determine:
RELATIONAL_DB platforms: MySQL, PostgreSQL, Oracle, SQL Server, MariaDB, DB2
OLAP_DATASTORE platforms: Snowflake, BigQuery, Redshift, Clickhouse, Pinot, Druid
These platforms typically have:
OTHERS platform type (BI-specific): Looker, Tableau, Power BI, Metabase, Superset, Mode
These platforms typically have:
OBJECT_STORE platforms: S3, GCS, Azure Blob Storage, MinIO
FILE_SYSTEM platforms: HDFS, NFS, Azure Data Lake
These platforms typically have:
MESSAGE_BROKER platforms: Kafka, Pulsar, Kinesis, Event Hubs, RabbitMQ
These platforms typically have:
OTHERS platform type (orchestration-specific): Airflow, Prefect, Dagster, Luigi, Argo
These platforms typically have:
OTHERS platform type (transformation-specific): dbt, Spark, Flink, Databricks
These platforms typically have:
Data Platforms enable several critical use cases in DataHub:
By establishing platforms as first-class entities, DataHub provides a comprehensive view of your organization's data technology landscape.