metadata-models/docs/entities/container.md
The container entity is a core entity in the metadata model that represents a grouping of related data assets. Containers provide hierarchical organization for datasets, charts, dashboards, and other containers, enabling navigation and structure discovery within data platforms.
Containers are uniquely identified by a GUID (Globally Unique Identifier) that is typically derived from a combination of attributes specific to the container type. Unlike datasets which use platform, name, and environment, containers use a more flexible identification scheme based on their hierarchical properties.
The URN structure for a container is: urn:li:container:{guid}
The GUID is typically computed from container-specific properties such as:
urn:li:container:b5e95fce839e7d78151ed7e0a7420d84
The GUID is generated using the datahub_guid() function from a dictionary of properties. For example, a Snowflake schema container would be identified by:
{
"platform": "snowflake",
"instance": "prod_instance",
"database": "analytics",
"schema": "reporting"
}
Containers represent various hierarchical structures in data platforms:
The containerProperties aspect contains metadata inherited from the source system:
The editableContainerProperties aspect allows users to override or add information via the UI:
This separation ensures that metadata from source systems doesn't conflict with user-provided annotations.
Containers support nested hierarchies through the container aspect, which links a container to its parent container. This enables multi-level organizational structures:
Platform (implicit)
└── Database Container
└── Schema Container
└── Dataset
For example, in Snowflake:
Snowflake Platform
└── ANALYTICS_DB (Database Container)
└── REPORTING (Schema Container)
└── SALES_METRICS (Dataset)
└── REVENUE_TABLE (Dataset)
The subTypes aspect specifies the type of container, which helps the UI render appropriate icons and behaviors. Common subtypes include:
MLAssetSubTypes.MLFLOW_EXPERIMENT): ML experiment containers that organize training runsMachine learning experiments are modeled as containers with the MLFLOW_EXPERIMENT subtype. This pattern enables organizing related training runs (which are dataProcessInstance entities) into logical groups for comparison and tracking:
ML Experiment (Container)
├── Training Run 1 (DataProcessInstance)
├── Training Run 2 (DataProcessInstance)
└── Training Run 3 (DataProcessInstance)
Training runs belong to experiments through the container aspect. This structure mirrors common ML platform patterns (like MLflow) and enables:
For more information on ML experiments and training runs, see:
The following entity types can be contained within a container:
{{ inline /metadata-ingestion/examples/library/container_create_database.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/container_create_schema.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/container_add_metadata.py show_path_as_comment }}
Containers can be retrieved using the standard entity retrieval APIs:
<details> <summary>Fetch container entity including all aspects</summary>curl 'http://localhost:8080/entities/urn%3Ali%3Acontainer%3Ab5e95fce839e7d78151ed7e0a7420d84'
The response will include all aspects associated with the container, including properties, ownership, tags, terms, etc.
</details>To find all entities within a container, use the relationships API:
<details> <summary>Find all entities contained within a container</summary>curl 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3Acontainer%3Ab5e95fce839e7d78151ed7e0a7420d84&types=IsPartOf'
This returns all entities (datasets, charts, dashboards, sub-containers) that have this container as their parent.
</details>Datasets are the most common entities contained within containers. The relationship is established through the container aspect on the dataset, which points to the container URN.
# Dataset links to its parent container (schema)
dataset = Dataset(
platform="snowflake",
name="analytics_db.reporting.sales_table",
env="PROD",
parent_container=schema_key, # Links to schema container
)
Containers enable hierarchical navigation in the DataHub UI through parent-child relationships:
The container entity has specialized GraphQL resolvers:
These resolvers power the UI's hierarchical navigation and container overview pages.
Container GUIDs must remain stable across ingestion runs. Since containers are identified by GUID rather than explicit properties in the URN, changing the GUID computation will create a new container entity instead of updating the existing one.
When creating custom containers, ensure that the properties used to generate the GUID are:
While containers can contain other containers, be careful not to create circular references. The parent-child relationship should form a directed acyclic graph (DAG), not a cycle.
The env field in ContainerKey has special handling for backwards compatibility. In some sources, the platform instance was incorrectly set to the environment value. The backcompat_env_as_instance flag handles this case.
When using the env field:
instance field for multi-instance deploymentsUnlike datasets which embed platform instance in their URN, containers associate platform instances through the dataPlatformInstance aspect. This allows containers to be associated with specific instances of a data platform while maintaining a stable GUID.
Containers support the access aspect, which can be used to model access control policies at the container level. This is particularly useful for:
Access controls set on containers can be inherited by contained entities, though this behavior depends on the specific platform's implementation.