metadata-models/docs/entities/dataProcess.md
DEPRECATED: This entity is deprecated and should not be used for new implementations.
Use dataFlow and dataJob instead.
The
dataProcessentity was an early attempt to model data processing tasks but has been superseded by the more robust and flexibledataFlowanddataJobentities which better represent the hierarchical nature of modern data pipelines.
The dataProcess entity was deprecated to provide a clearer separation between:
This two-level hierarchy better matches how modern orchestration systems organize data processing work and provides more flexibility for lineage tracking, ownership assignment, and operational monitoring.
The original dataProcess entity had several limitations:
The new dataFlow and dataJob model addresses these limitations by providing a clear parent-child relationship that mirrors real-world data processing architectures.
DataProcess entities were identified by three components:
airflow, azkaban)The URN structure was:
urn:li:dataProcess:(<name>,<orchestrator>,<origin>)
urn:li:dataProcess:(customer_etl_job,airflow,PROD)
urn:li:dataProcess:(sales_aggregation,azkaban,DEV)
The dataProcessInfo aspect captured inputs and outputs of the process:
This established basic lineage relationships through "Consumes" relationships with datasets.
Like other entities, dataProcess supported:
Use DataFlow when representing:
Use DataJob when representing:
Use both together:
| DataProcess Concept | New Model Equivalent | Notes |
|---|---|---|
| Process with tasks | DataFlow + DataJobs | Split into two entities |
| Process name | DataFlow flowId | Becomes the parent identifier |
| Single-step process | DataFlow + 1 DataJob | Still requires both entities |
| Orchestrator | DataFlow orchestrator | Same concept, better modeling |
| Origin/Fabric | DataFlow cluster | Often matches environment |
| Inputs/Outputs | DataJob dataJobInputOutput | Moved to job level for precision |
To migrate from dataProcess to dataFlow/dataJob:
Identify your process structure: Determine if your dataProcess represents a pipeline (has multiple steps) or a single task
Create a DataFlow: This represents the overall pipeline/workflow
Create DataJob(s): Create one or more jobs within the flow
Migrate lineage: Move input/output dataset relationships from the process level to the job level
Migrate metadata: Transfer ownership, tags, and documentation to the appropriate entity (typically the DataFlow for pipeline-level metadata, or specific DataJobs for task-level metadata)
Example 1: Simple single-task process
Old dataProcess:
urn:li:dataProcess:(daily_report,airflow,PROD)
New structure:
DataFlow: urn:li:dataFlow:(airflow,daily_report,prod)
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,daily_report,prod),daily_report_task)
Example 2: Multi-step ETL pipeline
Old dataProcess:
urn:li:dataProcess:(customer_pipeline,airflow,PROD)
New structure:
DataFlow: urn:li:dataFlow:(airflow,customer_pipeline,prod)
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),extract_customers)
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),transform_customers)
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),load_customers)
If you need to query existing dataProcess entities for migration purposes:
<details> <summary>Python SDK: Query a dataProcess entity</summary>{{ inline /metadata-ingestion/examples/library/dataprocess_query_deprecated.py show_path_as_comment }}
Instead of using dataProcess, create the modern equivalent:
<details> <summary>Python SDK: Create DataFlow and DataJob to replace dataProcess</summary>{{ inline /metadata-ingestion/examples/library/dataprocess_migrate_to_flow_job.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/dataprocess_full_migration.py show_path_as_comment }}
The dataProcess entity was previously used by:
All modern DataHub connectors use dataFlow and dataJob:
Note that dataProcessInstance is NOT deprecated. It represents a specific execution/run of either:
DataProcessInstance continues to be used for tracking pipeline run history, status, and runtime information.
The dataProcess entity remains readable through all DataHub APIs for backward compatibility. Existing dataProcess entities in your instance will continue to function and display in the UI.
While technically possible to create new dataProcess entities, it is strongly discouraged. All new integrations should use dataFlow and dataJob.
There is no automatic migration tool. Organizations with significant dataProcess data should:
The dataProcess entity is minimally exposed in the GraphQL API. Modern GraphQL queries and mutations focus on dataFlow and dataJob entities.