metadata-models/docs/entities/mlModel.md
The ML Model entity represents trained machine learning models across various ML platforms and frameworks. ML Models can be trained using different algorithms and frameworks (TensorFlow, PyTorch, Scikit-learn, etc.) and deployed to various platforms (MLflow, SageMaker, Vertex AI, etc.).
ML Models are identified by three pieces of information:
mlflow, sagemaker, vertexai, databricks, etc. See dataplatform for more details.recommendation-model)product-recommendation-v1)projects/123/locations/us-central1/models/456)An example of an ML Model identifier is urn:li:mlModel:(urn:li:dataPlatform:mlflow,my-recommendation-model,PROD).
The core information about an ML Model is captured in the mlModelProperties aspect. This includes:
versionProperties aspectThe following code snippet shows you how to create a basic ML Model:
<details> <summary>Python SDK: Create an ML Model</summary>{{ inline /metadata-ingestion/examples/library/mlmodel_create.py show_path_as_comment }}
ML Models can capture both the hyperparameters used during training and various metrics from training and production:
These are stored in the mlModelProperties aspect as structured lists of parameters and metrics.
{{ inline /metadata-ingestion/examples/library/mlmodel_add_metadata.py show_path_as_comment }}
DataHub supports comprehensive model documentation following ML model card best practices. These aspects help stakeholders understand the appropriate use cases and ethical implications of using the model:
intendedUse aspect): Documents primary use cases, intended users, and out-of-scope applicationsmlModelEthicalConsiderations aspect): Documents use of sensitive data, risks and harms, mitigation strategiesmlModelCaveatsAndRecommendations aspect): Additional considerations, ideal dataset characteristics, and usage recommendationsThese aspects align with responsible AI practices and help ensure models are used appropriately.
ML Models can document their training and evaluation datasets in two complementary ways:
mlModelTrainingData aspect): Datasets used to train the model, including preprocessing information and motivation for dataset selectionmlModelEvaluationData aspect): Datasets used for model evaluation and testingEach dataset reference includes the dataset URN, motivation for using that dataset, and any preprocessing steps applied. This creates direct lineage relationships between models and their training data.
Training runs (dataProcessInstance entities) provide an alternative and often more detailed way to capture training lineage:
dataProcessInstanceInput aspectdataProcessInstanceOutput aspecttrainingJobs fieldThis creates indirect lineage: Dataset → Training Run → Model
When to use each approach:
Most production ML systems should use training runs for comprehensive lineage tracking.
For detailed model analysis and performance reporting:
mlModelFactorPrompts aspect): Factors that may affect model performance (demographic groups, environmental conditions, etc.)mlModelQuantitativeAnalyses aspect): Links to dashboards or reports showing disaggregated performance metrics across different factorsmlModelMetrics aspect): Detailed metrics with descriptions beyond simple training/online metricssourceCode aspect): Links to model training code, notebooks, or repositories (GitHub, GitLab, etc.)cost aspect): Cost attribution information for tracking model training and inference expensesML Models in DataHub can be linked to their training runs and experiments, providing complete lineage from raw data through training to deployed models.
Training runs represent specific executions of model training jobs. In DataHub, training runs are modeled as dataProcessInstance entities with a specialized subtype:
dataProcessInstanceMLAssetSubTypes.MLFLOW_TRAINING_RUNdataProcessInstanceProperties: Basic properties like name, timestamps, and custom propertiesmlTrainingRunProperties: ML-specific properties including:
dataProcessInstanceInput: Input datasets used for trainingdataProcessInstanceOutput: Output datasets (predictions, feature importance, etc.)dataProcessInstanceRunEvent: Start, completion, and failure eventsTraining runs create lineage relationships showing:
Models reference their training runs through the trainingJobs field in mlModelProperties, and model groups can also reference training runs to track all training activity for a model family.
Experiments organize related training runs into logical groups, typically representing a series of attempts to optimize a model or compare different approaches. In DataHub, experiments are modeled as container entities:
containerMLAssetSubTypes.MLFLOW_EXPERIMENTTraining runs belong to experiments through the container aspect, creating a hierarchy:
Experiment: "Customer Churn Prediction"
├── Training Run 1: baseline model
├── Training Run 2: with feature engineering
├── Training Run 3: hyperparameter tuning
└── Training Run 4: final production model
This structure mirrors common ML platform patterns (like MLflow's experiment/run hierarchy) and enables:
{{ inline /metadata-ingestion/examples/ai/dh_ai_docs_demo.py show_path_as_comment }}
ML Models support rich relationship modeling through various aspects and fields:
Model Groups (via groups field in mlModelProperties): Models can belong to mlModelGroup entities, creating a MemberOf relationship. This organizes related models into logical families or collections.
Training Runs (via trainingJobs field in mlModelProperties): Models reference dataProcessInstance entities with MLFLOW_TRAINING_RUN subtype that produced them. This creates upstream lineage showing:
Features (via mlFeatures field in mlModelProperties): Models can consume mlFeature entities, creating a Consumes relationship. This documents:
Deployments (via deployments field in mlModelProperties): Models can be deployed to mlModelDeployment entities, representing running model endpoints in various environments (production, staging, etc.)
Training Datasets (via mlModelTrainingData aspect): Direct references to datasets used for training, including preprocessing information and motivation for dataset selection
Evaluation Datasets (via mlModelEvaluationData aspect): References to datasets used for model evaluation and testing
These relationships create a comprehensive lineage graph:
Training Datasets → Training Run → ML Model → ML Model Deployment
↓
Experiment
Feature Tables → ML Features → ML Model
ML Model Group ← ML Model
This enables powerful queries such as:
{{ inline /metadata-ingestion/examples/library/mlmodel_update_aspects.py show_path_as_comment }}
Like other DataHub entities, ML Models support:
globalTags aspect): Flexible categorization (e.g., "pii-model", "production-ready", "experimental")glossaryTerms aspect): Business concepts (e.g., "Customer Churn", "Fraud Detection")ownership aspect): Individuals or teams responsible for the model (data scientists, ML engineers, etc.)domains aspect): Organizational grouping (e.g., "Recommendations", "Risk Management")The following example demonstrates a complete ML model lifecycle in DataHub, showing how all the pieces work together:
1. Create Model Group
↓
2. Create Experiment (Container)
↓
3. Create Training Run (DataProcessInstance)
├── Link input datasets
├── Link output datasets
└── Add metrics and hyperparameters
↓
4. Create Model
├── Set version and aliases
├── Link to model group
├── Link to training run
├── Add hyperparameters and metrics
└── Add ownership and tags
↓
5. Link Training Run to Experiment
↓
6. Update Model properties as needed
├── Change version aliases (champion → challenger)
├── Add additional tags/terms
└── Update metrics from production
This workflow creates rich lineage showing:
See the comprehensive example in /metadata-ingestion/examples/ai/dh_ai_docs_demo.py which demonstrates:
The example shows both basic patterns for getting started and advanced patterns for production ML systems.
</details>The standard REST APIs can be used to retrieve ML Model entities and their aspects:
<details> <summary>Python: Query an ML Model via REST API</summary>{{ inline /metadata-ingestion/examples/library/mlmodel_query_rest_api.py show_path_as_comment }}
ML Models integrate with several other entities in the DataHub metadata model:
MLFLOW_TRAINING_RUN subtype): Specific training runs that created model versions, including metrics, hyperparameters, and lineage to input/output datasetsMLFLOW_EXPERIMENT subtype): Experiments that organize related training runs for a model or projectThe GraphQL API provides rich querying capabilities for ML Models through resolvers in datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/types/mlmodel/. These resolvers support:
Several ingestion sources automatically extract ML Model metadata:
These sources are located in /metadata-ingestion/src/datahub/ingestion/source/ and automatically populate model properties, relationships, and lineage.
ML Model versioning in DataHub uses the versionProperties aspect, which provides a robust framework for tracking model versions across their lifecycle. This is the standard approach demonstrated in production ML platforms.
Every ML Model should use the versionProperties aspect, which includes:
VersionTagClass containing the version identifier (e.g., "1", "2", "v1.0.0")urn:li:versionSet:(mlModel,mlmodel_my-model_versions))VersionTagClass objects for named version referencesVersion aliases enable flexible model lifecycle management and A/B testing workflows. Common aliases include:
These aliases allow you to reference models by their role rather than specific version numbers, enabling smooth model promotion workflows:
Model v1 (alias: "champion") # Currently in production
Model v2 (alias: "challenger") # Being tested in canary deployment
Model v3 (alias: "latest") # Just completed training
When v2 proves superior, you can update aliases without changing infrastructure:
Model v1 (no alias) # Retired
Model v2 (alias: "champion") # Promoted to production
Model v3 (alias: "challenger") # Now being tested
Model groups (mlModelGroup entities) serve as logical containers for organizing related models. While model groups can contain multiple versions of the same model, versioning is handled through the versionProperties aspect on individual models, not through the group structure itself. Model groups are used for:
The relationship between models and model groups is through the groups field in mlModelProperties, creating a MemberOf relationship.
Different ML platforms have different naming conventions:
When ingesting from these platforms, connectors handle platform-specific naming and convert it to appropriate DataHub URNs.
The various aspects (intendedUse, mlModelFactorPrompts, mlModelEthicalConsiderations, etc.) follow the Model Cards for Model Reporting framework (Mitchell et al., 2019). While these aspects are optional, they are strongly recommended for production models to ensure responsible AI practices and transparent model documentation.