metadata-models/docs/entities/mlPrimaryKey.md
MLPrimaryKey represents a primary key entity within a machine learning feature store. Primary keys uniquely identify records in feature tables and are essential for joining features with entities in online and offline feature serving. In feature stores like Feast, Tecton, or AWS SageMaker Feature Store, primary keys define the identifier columns that link features to the entities they describe (e.g., user_id, product_id, transaction_id).
MLPrimaryKeys are identified by two pieces of information:
An example of an MLPrimaryKey identifier is urn:li:mlPrimaryKey:(users_feature_table,user_id).
The URN structure follows this pattern:
urn:li:mlPrimaryKey:(<feature_namespace>,<primary_key_name>)
Where:
<feature_namespace> is the namespace, often matching the feature table name<primary_key_name> is the unique name of the primary keyFor example:
urn:li:mlPrimaryKey:(users_feature_table,user_id) - User ID in a user features tableurn:li:mlPrimaryKey:(product_features,product_id) - Product ID in a product features tableurn:li:mlPrimaryKey:(transactions,transaction_id) - Transaction ID in a transactions feature tableThe core metadata about an MLPrimaryKey is stored in the mlPrimaryKeyProperties aspect. This includes:
The following code snippet shows you how to create an MLPrimaryKey with properties:
<details> <summary>Python SDK: Create an MLPrimaryKey</summary>{{ inline /metadata-ingestion/examples/library/mlprimarykey_create.py show_path_as_comment }}
Like other DataHub entities, MLPrimaryKeys separate ingested metadata from user-edited metadata. The editableMlPrimaryKeyProperties aspect allows users to enhance the metadata through the DataHub UI without interfering with automated ingestion:
This separation ensures that:
MLPrimaryKeys support lineage tracking through their sources field. By linking primary keys to upstream datasets, you can:
The lineage relationships created are of type DerivedFrom and explicitly marked as lineage relationships (isLineage: true), ensuring they appear in DataHub's lineage visualization.
MLPrimaryKeys can have Tags or Terms attached to them through the globalTags and glossaryTerms aspects. This enables:
pii, sensitive)Read this blog to understand the difference between tags and terms.
Ownership is associated with an MLPrimaryKey using the ownership aspect. Owners can be data scientists, ML engineers, or feature store administrators responsible for maintaining the primary key definition. Ownership helps with:
MLPrimaryKeys support the domains aspect, allowing them to be organized into logical business domains or data products. This helps with:
MLPrimaryKeys support the structuredProperties aspect, allowing organizations to extend the metadata model with custom fields that are validated and searchable. This enables:
{{ inline /metadata-ingestion/examples/library/mlprimarykey_create.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/mlprimarykey_read.py show_path_as_comment }}
MLPrimaryKeys are typically associated with feature tables to define how records should be uniquely identified. A feature table can have one or more primary keys (composite keys).
<details> <summary>Python SDK: Add primary keys to a feature table</summary>{{ inline /metadata-ingestion/examples/library/mlprimarykey_add_to_mlfeature_table.py show_path_as_comment }}
The standard REST APIs can be used to retrieve MLPrimaryKey metadata and relationships.
<details> <summary>REST API: Fetch MLPrimaryKey entity information</summary>{{ inline /metadata-ingestion/examples/library/mlprimarykey_query_rest.py show_path_as_comment }}
The most important relationship for MLPrimaryKeys is with MLFeatureTables. Feature tables reference primary keys through their mlPrimaryKeyProperties aspect, creating a KeyedBy relationship. This relationship indicates that:
This bidirectional relationship enables:
MLPrimaryKeys can be linked to Dataset entities through the sources field in mlPrimaryKeyProperties. This creates DerivedFrom lineage relationships to upstream data warehouse tables, establishing:
While not a direct relationship, MLPrimaryKeys and MLFeatures both belong to the same feature namespace (typically a feature table). Primary keys identify the entity, while features provide the attributes of that entity. Together, they form the complete feature table schema.
MLPrimaryKeys are fully indexed for search with the following capabilities:
MLPrimaryKeys support the dataPlatformInstance aspect, which is useful when:
When a feature table requires multiple columns to uniquely identify a record, it uses composite primary keys. In DataHub:
mlPrimaryKeys array in MLFeatureTablePropertiesExample:
# For a feature table keyed by (user_id, date)
primary_keys = [
"urn:li:mlPrimaryKey:(daily_user_features,user_id)",
"urn:li:mlPrimaryKey:(daily_user_features,date)"
]
Different feature stores use different terminology:
DataHub normalizes these concepts under the mlPrimaryKey entity type. When ingesting from different platforms, connectors map these platform-specific terms to MLPrimaryKey.
In some feature stores, primary keys can also serve as features themselves (e.g., using user_id as both the key and a feature for training). In DataHub:
This dual representation accurately reflects the different roles the same data plays in the feature store.
The feature namespace in an MLPrimaryKey URN should typically match the feature table name where it's used. However, DataHub doesn't enforce this requirement, allowing for flexibility in cases where:
Primary key data types should remain stable to avoid breaking feature serving. However, if a type change is necessary:
version field to track the schema evolutionPrimary keys often contain or directly map to personally identifiable information (PII). Organizations should:
pii, gdpr_sensitive) to MLPrimaryKey entities