metadata-models/docs/entities/mlFeature.md
The ML Feature entity represents an individual input variable used by machine learning models. Features are the building blocks of feature engineering - they transform raw data into meaningful signals that ML algorithms can learn from. In modern ML systems, features are first-class citizens that can be discovered, documented, versioned, and reused across multiple models and teams.
ML Features are identified by two pieces of information:
user_features, transaction_features, product_features.age, lifetime_value, days_since_signup.An example of an ML Feature identifier is urn:li:mlFeature:(user_features,age).
The identity is defined by the mlFeatureKey aspect, which contains:
featureNamespace: A string representing the logical namespace or grouping for the featurename: The unique name of the feature within that namespaceurn:li:mlFeature:(user_features,age)
urn:li:mlFeature:(user_features,lifetime_value)
urn:li:mlFeature:(transaction_features,amount_last_7d)
urn:li:mlFeature:(product_features,price)
urn:li:mlFeature:(product_features,category_embedding)
The namespace and name together form a globally unique identifier. Multiple features can share the same namespace (representing a logical grouping), but each feature name must be unique within its namespace.
ML Features support comprehensive metadata through the mlFeatureProperties aspect. This aspect captures the essential characteristics that define a feature:
Features should have clear descriptions explaining what they represent, how they're calculated, and when they should be used. Good feature documentation is critical for:
{{ inline /metadata-ingestion/examples/library/mlfeature_create_with_description.py show_path_as_comment }}
Features have a data type specified using MLFeatureDataType that describes the nature of the feature values. Understanding data type is essential for proper feature handling, preprocessing, and model training. DataHub supports a rich taxonomy of data types:
Categorical Types:
NOMINAL: Discrete values with no inherent order (e.g., country, product category)ORDINAL: Discrete values that can be ranked (e.g., education level, rating)BINARY: Two-category values (e.g., is_subscriber, has_clicked)Numeric Types:
CONTINUOUS: Real-valued numeric data (e.g., height, price, temperature)COUNT: Non-negative integer counts (e.g., number of purchases, page views)INTERVAL: Numeric data with equal spacing (e.g., percentages, scores)Temporal:
TIME: Time-based cyclical features (e.g., hour_of_day, day_of_week)Unstructured:
TEXT: Text data requiring NLP processingIMAGE: Image dataVIDEO: Video dataAUDIO: Audio dataStructured:
MAP: Dictionary or mapping structuresSEQUENCE: Lists, arrays, or sequencesSET: Unordered collectionsBYTE: Binary-encoded complex objectsSpecial:
USELESS: High-cardinality unique values with no predictive relationship (e.g., random IDs)UNKNOWN: Type is not yet determined{{ inline /metadata-ingestion/examples/library/mlfeature_create_with_datatypes.py show_path_as_comment }}
One of the most powerful capabilities of ML Features in DataHub is their ability to declare source datasets through the sources property. This creates explicit "DerivedFrom" lineage relationships between features and the upstream datasets they're computed from.
Source lineage enables:
{{ inline /metadata-ingestion/examples/library/mlfeature_add_source_lineage.py show_path_as_comment }}
Features support versioning through the version property. Version information helps teams:
{{ inline /metadata-ingestion/examples/library/mlfeature_create_versioned.py show_path_as_comment }}
Features support arbitrary key-value custom properties through the customProperties field, allowing you to capture platform-specific or organization-specific metadata:
ML Features support tags and glossary terms for classification, discovery, and governance:
globalTags aspect) provide lightweight categorization such as PII indicators, feature maturity levels, or domain areasglossaryTerms aspect) link features to standardized business definitions and conceptsRead this blog to understand when to use tags vs terms.
<details> <summary>Python SDK: Add tags and terms to a feature</summary>{{ inline /metadata-ingestion/examples/library/mlfeature_add_tags_terms.py show_path_as_comment }}
Ownership is associated with features using the ownership aspect. Clear feature ownership is essential for:
{{ inline /metadata-ingestion/examples/library/mlfeature_add_ownership.py show_path_as_comment }}
Features can be organized into domains (via the domains aspect) to represent organizational structure or functional areas. Domain organization helps teams:
Here's a comprehensive example that creates a feature with all core metadata:
<details> <summary>Python SDK: Create a complete ML Feature</summary>{{ inline /metadata-ingestion/examples/library/mlfeature_create_complete.py show_path_as_comment }}
Features are typically organized into feature tables. While the feature entity itself doesn't directly reference its parent table (the relationship is inverse - tables reference features), you can discover the containing table through relationships:
<details> <summary>Python SDK: Find feature table containing a feature</summary>{{ inline /metadata-ingestion/examples/library/mlfeature_find_table.py show_path_as_comment }}
You can retrieve ML Feature metadata using both the Python SDK and REST API:
<details> <summary>Python SDK: Read an ML Feature</summary>{{ inline /metadata-ingestion/examples/library/mlfeature_read.py show_path_as_comment }}
# Get the complete entity with all aspects
curl 'http://localhost:8080/entities/urn%3Ali%3AmlFeature%3A(user_features,age)'
# Get relationships to see source datasets and consuming models
curl 'http://localhost:8080/relationships?direction=OUTGOING&urn=urn%3Ali%3AmlFeature%3A(user_features,age)&types=DerivedFrom'
curl 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3AmlFeature%3A(user_features,age)&types=Consumes'
When creating many features at once (e.g., from a feature store ingestion connector), batch operations improve performance:
<details> <summary>Python SDK: Create multiple features efficiently</summary>{{ inline /metadata-ingestion/examples/library/mlfeature_create_batch.py show_path_as_comment }}
ML Features integrate with multiple other entities in DataHub's metadata model to form a comprehensive ML metadata ecosystem:
Features declare their source datasets through the sources property in mlFeatureProperties. This creates a "DerivedFrom" lineage relationship that:
The relationship is directional: features point to their source datasets. Multiple features can derive from the same dataset, and a single feature can derive from multiple datasets if it's computed via a join or union.
ML Models consume features through the mlFeatures property in MLModelProperties. This creates a "Consumes" lineage relationship showing:
This relationship enables critical use cases like:
{{ inline /metadata-ingestion/examples/library/mlfeature_add_to_mlmodel.py show_path_as_comment }}
Feature tables contain ML Features through the "Contains" relationship. The feature table's mlFeatures property lists the URNs of features it contains. This relationship:
While features don't explicitly store their parent table, you can discover it by querying incoming "Contains" relationships.
<details> <summary>Python SDK: Add a feature to a feature table</summary>{{ inline /metadata-ingestion/examples/library/mlfeature_add_to_mlfeature_table.py show_path_as_comment }}
Features are often associated with a platform through their namespace or through related entities (feature tables). While features themselves don't have a direct platform reference in their key, the namespace often encodes platform-specific organization, and related feature tables declare their platform explicitly.
Features are accessible through DataHub's GraphQL API via the MLFeatureType class. The GraphQL interface provides:
The featureNamespace in the feature key is a logical grouping concept and doesn't necessarily correspond 1:1 with feature tables:
user_features contains features with namespace user_features.When ingesting features, ensure namespace values match the corresponding feature table names for proper relationship establishment.
A feature's identity (featureNamespace + name) is independent of any feature table. This means:
Most feature stores enforce 1:1 relationships between features and feature tables to avoid ambiguity.
There are multiple approaches to versioning features:
Option 1: Version in the URN (namespace or name)
urn:li:mlFeature:(user_features_v2,age)
urn:li:mlFeature:(user_features,age_v2)
Option 2: Version in the properties
MLFeatureProperties(
description="User age in years",
version=VersionTag(versionTag="2.0")
)
Recommendation: Use the version property in mlFeatureProperties for most use cases. Only use versioned URNs when breaking changes require fully separate entities (e.g., changing data type from continuous to categorical).
Composite features (features derived from other features) can be modeled in two ways:
Approach 1: Intermediate features as entities Create explicit feature entities for each transformation step, with lineage between them:
raw_feature -> transformed_feature -> composite_feature
Approach 2: Direct source lineage Skip intermediate features and link composite features directly to source datasets, documenting the transformation in the description.
Choose Approach 1 when:
Choose Approach 2 when:
While DataHub's ML Feature entity doesn't include built-in drift monitoring aspects, you can use:
HIGH_DRIFT_DETECTED or MONITORING_ENABLEDinstitutionalMemoryFeature drift detection typically happens in runtime feature stores or model monitoring systems, with DataHub serving as the metadata catalog that links to those systems.
Features are searchable by:
The name field has the highest search boost score (8.0), making feature name the primary discovery mechanism. Ensure feature names are descriptive and follow consistent naming conventions across your organization.