metadata-ingestion/docs/transformer/universal_transformers.md
These transformers apply metadata across multiple entity types in a single configuration. They are the recommended way to add domains, ownership, or modify browse paths when your ingestion pipeline produces entities beyond just datasets.
The domain and ownership transformers on this page support the following entity types by default:
You can restrict the set of entity types via the optional entity_types config field.
:::tip "Relationship to Dataset Transformers"
The domain and ownership transformers on this page are drop-in replacements for their *_dataset_* counterparts (e.g. simple_add_domain replaces simple_add_dataset_domain). The *_dataset_* variants continue to work but only process dataset entities. Use the names on this page when you need broader entity type coverage.
:::
| Aspect | Transformer |
|---|---|
domains | - Simple Add Domain |
ownership | - Simple Add OwnershipbrowsePathsV2 | - Set browsePaths |Adds a static list of domains to all supported entity types flowing through the pipeline.
| Field | Required | Type | Default | Description |
|---|---|---|---|---|
domains | ✅ | list[union[urn, str]] | List of domain URNs or simple domain names. | |
replace_existing | boolean | false | Whether to remove domains from entity sent by ingestion source. | |
semantics | enum | OVERWRITE | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | |
on_conflict | enum | DO_UPDATE | Whether to make changes if domains already exist on the target. | |
entity_types | list[string] | all types | Restrict which entity types this transformer processes. |
Add a domain to all entities (datasets, dashboards, charts, etc.):
transformers:
- type: "simple_add_domain"
config:
domains:
- urn:li:domain:engineering
Add a domain, merging with existing domains on the server:
transformers:
- type: "simple_add_domain"
config:
semantics: PATCH
domains:
- urn:li:domain:engineering
Restrict to only datasets and containers:
transformers:
- type: "simple_add_domain"
config:
entity_types:
- dataset
- container
domains:
- urn:li:domain:engineering
Adds domains based on regex matching against entity URNs. Works across all supported entity types.
| Field | Required | Type | Default | Description |
|---|---|---|---|---|
domain_pattern | ✅ | map[regex, list[union[urn, str]]] | Entity URN regex and list of domain URNs to apply to matching entities. | |
replace_existing | boolean | false | Whether to remove domains from entity sent by ingestion source. | |
semantics | enum | OVERWRITE | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | |
is_container | bool | false | Whether to also propagate domains to parent containers via browse paths. | |
entity_types | list[string] | all types | Restrict which entity types this transformer processes. |
Add domains to PowerBI dashboards and Snowflake datasets with different domain assignments:
transformers:
- type: "pattern_add_domain"
config:
domain_pattern:
rules:
".*powerbi.*": ["urn:li:domain:analytics"]
".*snowflake.*": ["urn:li:domain:data-warehouse"]
Add domains to matching entities and propagate to their parent containers:
transformers:
- type: "pattern_add_domain"
config:
is_container: true
semantics: PATCH
domain_pattern:
rules:
".*powerbi.*": ["analytics"]
Adds a static list of owners to all supported entity types.
| Field | Required | Type | Default | Description |
|---|---|---|---|---|
owner_urns | ✅ | list[string] | List of owner URNs. | |
ownership_type | string | DATAOWNER | Ownership type (enum value or ownership type URN). | |
replace_existing | boolean | false | Whether to remove owners from entity sent by ingestion source. | |
semantics | enum | OVERWRITE | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | |
entity_types | list[string] | all types | Restrict which entity types this transformer processes. |
Add owners to all entities in the pipeline:
transformers:
- type: "simple_add_ownership"
config:
owner_urns:
- "urn:li:corpuser:data-team"
- "urn:li:corpGroup:platform-eng"
ownership_type: "DATAOWNER"
Adds owners based on regex matching against entity URNs. Works across all supported entity types including containers.
| Field | Required | Type | Default | Description |
|---|---|---|---|---|
owner_pattern | ✅ | map[regex, list[urn]] | Entity URN regex and list of owner URNs to apply to matching entities. | |
ownership_type | string | DATAOWNER | Ownership type (enum value or ownership type URN). | |
replace_existing | boolean | false | Whether to remove owners from entity sent by ingestion source. | |
semantics | enum | OVERWRITE | Whether to OVERWRITE or PATCH the entity present on DataHub GMS. | |
is_container | bool | false | Whether to also propagate ownership to parent containers via browse paths. | |
entity_types | list[string] | all types | Restrict which entity types this transformer processes. |
Assign different owners to different platforms:
transformers:
- type: "pattern_add_ownership"
config:
owner_pattern:
rules:
".*powerbi.*": ["urn:li:corpuser:bi-team"]
".*snowflake.*": ["urn:li:corpuser:data-eng"]
ownership_type: "TECHNICAL_OWNER"
Assign owners and propagate to parent containers:
transformers:
- type: "pattern_add_ownership"
config:
is_container: true
semantics: PATCH
owner_pattern:
rules:
".*powerbi.*": ["urn:li:corpuser:bi-team"]
ownership_type: "DATAOWNER"
is_container)Both pattern_add_domain and pattern_add_ownership support an is_container flag. When enabled, the transformer will look up the browse path of each matched entity and apply the same metadata to all parent containers found in that path.
This is useful when you want containers to inherit domains or ownership from their children without configuring each container individually.
:::info
When multiple entities in the same container have different owners or domains, all values are merged additively onto the container. For example, if dataset A has owner alice and dataset B has owner bob, and both are in the same container, that container gets both alice and bob as owners.
:::
This transformer operates on browsePathsV2 aspect. If it is not emitted by the ingestion source, it will be created
by the transformer. By default it will prepend configured path to the original path (so it will add it as a prefix).
| Field | Required | Type | Default | Description |
|---|---|---|---|---|
path | ✅ | list[string] | List of nodes in the new path. | |
replace_existing | boolean | false | Whether to overwrite existing browse path, if set to false, the configured path will be prepended |
In the most basic case path contains list of static strings, for example, below config:
transformers:
- type: "set_browse_path"
config:
path:
- abc
- def
will be reflected as every entity having path prefixed by abc and def nodes (def will be contained by abc).
The transformer has a mechanism of variables substitution in the path, where list of variables are build based on
existing browsePathsV2 aspect of the entity. Every node in the existing path, as long as it contains reference to
another entity (e.g. a container or a dataPlatformInstance) is stored in the list of variables to use. Since
we can have multiple references to entities of the same type (e.g. containers) in the browse path, they are stored
in a list-like object, with original order being respected. Let's consider an example, real-world situation, of a table
ingested from Snowflake source, and having platform_instance set to some value. Such table will have browsePathsV2
aspect set to contain below references:
- urn:li:dataPlatformInstance:(urn:li:dataPlatform:snowflake,my_platform_instance)
- urn:li:container:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
- urn:li:container:bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
where urn:li:container:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa identifies a container reflecting a Snowflake's database and
urn:li:container:bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb identifies a container reflecting a Snowflake's schema.
Such, existing, path will be mapped into variables as shown below:
dataPlatformInstance[0] = "urn:li:dataPlatformInstance:(urn:li:dataPlatform:snowflake,my_platform_instance)"
container[0] = "urn:li:container:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
container[1] = "urn:li:container:bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
Those variables can be refered to, from the config, by using $ character, like below:
transformers:
- type: "set_browse_path"
config:
path:
- $dataPlatformInstance[0]
- $container[0]
- $container[1]
Additionally, 2 more rules apply to the variables resolution:
$variable[*] will expand entire list of variables to multiple nodes in the path (think about it as a "flat map"), for example, the equivalent of above config, would be:
transformers:
- type: "set_browse_path"
config:
path:
- $dataPlatformInstance[0]
- $container[*]
Add (prefix) a top-level node "datahub" to paths emitted by the source:
transformers:
- type: "set_browse_path"
config:
path:
- datahub
Remove data platform instance from the path (if it was set), while retaining containers structure:
transformers:
- type: "set_browse_path"
config:
replace_existing: true
path:
- $container[*]