docs/platform-instances.md
DataHub's metadata model for Datasets supports a three-part key currently:
This naming scheme unfortunately does not allow for easy representation of the multiplicity of platforms (or technologies) that might be deployed at an organization within the same environment or fabric. For example, an organization might have multiple Redshift instances in Production and would want to see all the data assets located in those instances inside the DataHub metadata repository.
Note: While platform instances provide one solution to this problem it comes with trade-offs with respect to immutability. DataHub also offers alternative approaches for organizing and managing multiple platform instances. See the Alternative Approaches section below for more information.
As part of the v0.8.24+ releases, we are unlocking the first phase of supporting Platform Instances in the metadata model. This is done via two main additions:
dataPlatformInstance aspect that has been added to Datasets which allows datasets to be associated to an instance of a platformurn:li:dataset:(urn:li:dataPlatform:<platform>,<name>,ENV) format to urn:li:dataset:(urn:li:dataPlatform:<platform>,<instance.name>,ENV) format. Sources that produce lineage to datasets in other platforms (e.g. Looker, Superset etc) also have specific configuration additions that allow the recipe author to specify the mapping between a platform and the instance name that it should be mapped to.DataHub URNs are immutable identifiers that must remain unchanged once assigned to an entity. This immutability is fundamental to maintaining data integrity, lineage tracking, and consistent references throughout the system. Once a URN is created, it should never be modified, even if the underlying data asset's attributes change.
Many organizations face a critical challenge: URNs serve dual purposes - they are both internal system identifiers AND visible user-facing identifiers in the DataHub UI. This creates a conflict when organizational taxonomy changes (domains, products, systems) because:
When establishing platform instance naming conventions, it is crucial to choose names that are:
When configuring a platform instance, choose an instance name that is understandable and will be stable for the foreseeable future. e.g. core_warehouse or finance_redshift are allowed names, as are pure guids like a37dc708-c512-4fe4-9829-401cd60ed789. Remember that whatever instance name you choose, you will need to specify it in more than one recipe to ensure that the identifiers produced by different sources will line up.
To ensure URN immutability and long-term stability, platform instance names should be technical identifiers that are intrinsic to the infrastructure, not business concepts. Use DataHub's built-in features for domains, ownership, and business context.
✅ Good Examples:
us-east-1-cluster-1, eu-west-2-cluster-2primary-redshift, secondary-mysql, analytics-snowflakea37dc708-c512-4fe4-9829-401cd60ed789rds-prod-001, redshift-analytics-01❌ Avoid These Patterns:
company.domain.product.system (domains, products, systems change)customer_data_warehouse, finance_redshift (use DataHub domains instead)john_warehouse, sarah_analytics (use DataHub ownership features)redshift_v2, mysql_8_0 (use DataHub's versioning capabilities)temp_warehouse, migration_dblegacy_mysql, old_redshift (use DataHub tags instead)Key Principles:
Note: Business context like domains, ownership, data classification, and technology migration status should be managed through DataHub's dedicated features (domains, ownership, tags, etc.) rather than embedded in the platform instance name. Environment information is best handled by tags instead of fabric type which allows for promotion over time, and versioning should use DataHub's versioning capabilities.
Read the Ingestion source specific guides for how to enable platform instances in each of them.
The general pattern is to add an additional optional configuration parameter called platform_instance.
e.g. here is how you would configure a recipe to ingest a mysql instance that you want to call primary-mysql
source:
type: mysql
config:
# Coordinates
host_port: localhost:3306
platform_instance: primary-mysql
database: dbname
# Credentials
username: root
password: example
sink:
# sink configs
Instead of changing URNs when organizational taxonomy evolves, DataHub provides several alternative approaches that maintain URN immutability while enabling flexible business context management:
The most effective solution is to design your platform instance naming to be technically stable while using DataHub's metadata features for business context:
Use Stable Technical Identifiers: Design platform instance names that won't change
us-east-1-cluster-001, anomalo-prod-01, primary-redshiftcompany.domain.product.system (changes when taxonomy evolves)Leverage DataHub's Business Context Features:
DataHub offers several organizational concepts that can complement or serve as alternatives to platform instances:
Data Products group related data assets for business purposes, following data mesh principles:
Example Data Product:
Customer Analytics Data Product
├── Tables from Redshift Cluster 1
├── Tables from Snowflake Analytics
├── Dashboards from Looker
└── Pipelines from Airflow
DataHub offers several other ways to handle organizational context without changing URNs:
domain.voice, product.billing, system.anomaloorg_domain: "voice" that can be updated when domain changes| Approach | URN Impact | Flexibility | Complexity | Best Use Case |
|---|---|---|---|---|
| Platform Instances | Changes URN | Low | Low | Technical differentiation needed in URNs |
| Data Products | No change | High | High | Business-oriented grouping across platforms |
| Tags/Labels | No change | High | Low | Flexible metadata and searchable context |
| Custom Properties | No change | Medium | Medium | Structured metadata storage |
| Glossary Terms | No change | High | Medium | Business context and domain association |
| Search Features | No change | High | Low | Discovery and organization without changes |
| Automation | No change | Medium | High | Consistent metadata management |
Platform instances and data products each address different aspects of data organization in DataHub. Platform instances modify URNs to include technical identifiers, while data products provide organizational structure without changing the physical identity of the asset. For organizations with evolving taxonomy, the key is to separate technical identifiers (in URNs) from business context (in metadata), ensuring both immutability and flexibility.