metadata-ingestion/docs/sources/vertexai/README.md
Vertexai is a machine learning platform. Learn more in the official Vertexai documentation.
The DataHub integration for Vertexai covers ML entities such as models, features, and related lineage metadata. Depending on module capabilities, it can also capture features such as lineage, usage, profiling, ownership, tags, and stateful deletion detection.
| Source Concept | DataHub Concept | Notes |
|---|---|---|
Model | MlModelGroup | The name of a Model Group is the same as Model's name. Model serve as containers for multiple versions of the same model in Vertex AI. |
Model Version | MlModel | The name of a Model is {model_name}_{model_version} (e.g. my_vertexai_model_1 for model registered to Model Registry or Deployed to Endpoint. Each Model Version represents a specific iteration of a model with its own metadata. |
| Dataset |
| [`Dataset`](https://docs.datahub.com/docs/generated/metamodel/entities/dataset) | A Managed Dataset resource in Vertex AI is mapped to Dataset in DataHub.
</br> Supported types of datasets include (Text, Tabular, Image Dataset, Video, TimeSeries) |
| Training Job | DataProcessInstance | A Training Job is mapped as DataProcessInstance in DataHub.
</br> Supported types of training jobs include (AutoMLTextTrainingJob, AutoMLTabularTrainingJob, AutoMLImageTrainingJob, AutoMLVideoTrainingJob, AutoMLForecastingTrainingJob, Custom Job, Custom TrainingJob, Custom Container TrainingJob, Custom Python Packaging Job ) |
| Experiment | Container | Experiments organize related runs and serve as logical groupings for model development iterations. Each Experiment is mapped to a Container in DataHub. |
| Experiment Run | DataProcessInstance | An Experiment Run represents a single execution of a ML workflow. An Experiment Run tracks ML parameters, metricis, artifacts and metadata |
| Execution | DataProcessInstance | Metadata Execution resource for Vertex AI. Metadata Execution is started in a experiment run and captures input and output artifacts. |
| PipelineJob | DataFlow | A Vertex AI Pipeline is mapped to a stable DataFlow entity in DataHub (one per pipeline template). The compiled pipeline spec name (pipelineInfo.name, i.e. the @pipeline(name="...") argument) is used as the stable identifier; non-Kubeflow pipelines fall back to display_name with any timestamp suffix stripped. Each pipeline run creates a DataProcessInstance, and pipeline tasks are modeled as DataJobs nested under the parent DataFlow. This enables proper incremental lineage aggregation across multiple pipeline runs. Breaking Change (v1.4.0): Previously, each pipeline run created a separate DataFlow entity. Existing pipeline entities from earlier versions will appear as separate entities from new ingestion runs. Enable stateful ingestion with stale entity removal to clean up old pipeline entities. |
| PipelineJob Task | DataJob | Each task within a Vertex AI pipeline is modeled as a DataJob in DataHub, nested under its parent pipeline DataFlow. Tasks represent individual steps in the pipeline workflow. |
| PipelineJob Task Run | DataProcessInstance | Each execution of a pipeline task is modeled as a DataProcessInstance, linked to its DataJob (task definition). This captures runtime metadata, inputs/outputs, and lineage for each task execution. |