metadata-models/docs/entities/versionSet.md
The VersionSet entity is a core metadata model entity in DataHub that groups together related versions of other entities. Version Sets are primarily used to manage versioned entities like ML models, datasets, and other assets that evolve over time with distinct versions. They provide a structured way to organize, track, and navigate between different versions of the same logical asset.
Version Sets are identified by two pieces of information:
mlModel, dataset). All entities within a single version set must be of the same type, ensuring type safety and consistency.An example of a version set identifier is urn:li:versionSet:(abc123def456,mlModel).
The URN structure follows the pattern: urn:li:versionSet:(<id>,<entityType>) where:
<id> is a unique identifier string, often a GUID generated from the platform and asset name<entityType> is the entity type being versioned (e.g., mlModel, dataset)Version Sets maintain metadata about the collection of versioned entities through the versionSetProperties aspect. This aspect contains:
The version set automatically tracks which entity is currently the latest version. This is stored in the latest field and provides a quick reference to the most recent version without needing to query all versions.
Version Sets support different versioning schemes to accommodate various versioning strategies:
The versioning scheme is static once set and determines how versions are ordered within the set.
Like other DataHub entities, Version Sets support custom properties for storing additional metadata specific to your use case.
<details> <summary>Python SDK: Create a version set with properties</summary>{{ inline /metadata-ingestion/examples/library/version_set_add_properties.py show_path_as_comment }}
Entities are linked to Version Sets through the versionProperties aspect on the versioned entity. This aspect contains:
{{ inline /metadata-ingestion/examples/library/version_set_link_entity.py show_path_as_comment }}
When creating a new version set, you typically link the first versioned entity to it. The version set can be created implicitly by linking an entity to a new version set URN.
<details> <summary>Python SDK: Create a version set by linking the first entity</summary>{{ inline /metadata-ingestion/examples/library/version_set_create.py show_path_as_comment }}
As you create new versions of an asset, you link each one to the same version set with a different version label. The version set automatically updates the latest pointer to the most recent version based on the versioning scheme.
{{ inline /metadata-ingestion/examples/library/version_set_link_multiple_versions.py show_path_as_comment }}
You can query version sets to retrieve information about all versions or find specific versions.
<details> <summary>Python SDK: Query a version set and its versions</summary>{{ inline /metadata-ingestion/examples/library/version_set_query.py show_path_as_comment }}
The standard DataHub REST APIs can be used to retrieve version set entities and their properties.
<details> <summary>Fetch version set entity via REST API</summary># Fetch a version set by URN
curl 'http://localhost:8080/entities/urn%3Ali%3AversionSet%3A(abc123def456,mlModel)'
# Get all entities in a version set using relationships
curl 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3AversionSet%3A(abc123def456,mlModel)&types=VersionOf'
DataHub's GraphQL API provides rich querying capabilities for version sets:
<details> <summary>GraphQL: Query version set with all versions</summary>query {
versionSet(urn: "urn:li:versionSet:(abc123def456,mlModel)") {
urn
latestVersion {
urn
... on MLModel {
properties {
name
description
}
}
}
versionsSearch(input: { query: "*", start: 0, count: 10 }) {
total
searchResults {
entity {
urn
... on MLModel {
versionProperties {
version {
versionTag
}
comment
isLatest
created {
time
}
}
}
}
}
}
}
}
Version Sets have a specific relationship pattern with other entities:
VersionOf relationship to their Version SetCurrently, DataHub supports versioning for the following entity types:
Future versions of DataHub may extend version set support to additional entity types.
Version Set functionality is controlled by the entityVersioning feature flag. This must be enabled in your DataHub deployment to use version sets:
# In your DataHub configuration
featureFlags:
entityVersioning: true
Several ingestion connectors automatically create and manage version sets:
A Version Set can only contain entities of a single type. This is enforced through the entityType field in the Version Set key. You cannot mix different entity types (e.g., datasets and ML models) in the same version set.
Once a versioning scheme is set for a Version Set, it should not be changed. The sorting and ordering of versions depend on the scheme, and changing it could break the version ordering.
The isLatest flag on versioned entities is automatically maintained by DataHub's versioning service. While it's technically possible to set this field manually through the API, you should rely on the automatic maintenance through the linkAssetVersion GraphQL mutation or the Python SDK's versioning methods.
Linking or unlinking entities to/from version sets requires UPDATE permissions on both the version set and the versioned entity. Ensure proper authorization is configured for users who need to manage versions.
While version labels (the version field) should be unique within a version set, this is not strictly enforced by the system. It's the responsibility of the client code to ensure uniqueness. Having duplicate version labels can cause confusion when querying or navigating versions.
When a versioned entity is deleted, it is not automatically unlinked from its version set. The relationship may become stale. Consider explicitly unlinking entities before deletion or implementing cleanup logic to handle orphaned version references.
Version Sets themselves are searchable entities in DataHub. Versioned entities can be searched by their version labels, aliases, and version set membership. Use the versionSortId field for ordering search results by version order.