metadata-models/docs/entities/dataFlow.md
A DataFlow represents a high-level data processing pipeline or workflow orchestrated by systems like Apache Airflow, Azkaban, Prefect, Dagster, or similar workflow management platforms. DataFlows serve as containers for related DataJobs, representing the overall execution context and organization of data processing tasks.
DataFlows are uniquely identified by three components:
airflow, azkaban, prefect, dagster)prod, staging, dev)The URN structure follows this pattern:
urn:li:dataFlow:(<orchestrator>,<flowId>,<cluster>)
Apache Airflow DAG in production:
urn:li:dataFlow:(airflow,daily_sales_pipeline,prod)
Prefect flow in staging:
urn:li:dataFlow:(prefect,customer_analytics,staging)
Azkaban workflow in development:
urn:li:dataFlow:(azkaban,data_quality_checks,dev)
DataFlows maintain essential metadata about the pipeline through the dataFlowInfo aspect:
The editableDataFlowProperties aspect allows users to modify certain properties through the DataHub UI without interfering with ingestion from source systems:
This separation ensures that edits made in DataHub are preserved and not overwritten by subsequent ingestion runs.
The versionInfo aspect tracks versioning details for the flow:
This is particularly useful for tracking changes to pipeline code and correlating pipeline versions with their execution history.
DataFlows act as parent entities for DataJobs. Each DataJob's identity includes a reference to its parent DataFlow through the flow field in the DataJobKey. This creates a hierarchical relationship:
DataFlow (Pipeline)
└─ DataJob (Task 1)
└─ DataJob (Task 2)
└─ DataJob (Task 3)
This structure mirrors how workflow orchestrators organize tasks within DAGs or pipelines.
The incidentsSummary aspect provides visibility into data quality or operational issues:
This enables DataHub to serve as a centralized incident management system for data pipelines.
{{ inline /metadata-ingestion/examples/library/dataflow_create.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/dataflow_comprehensive.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/dataflow_read.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/dataflow_add_tags_terms.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/dataflow_add_ownership.py show_path_as_comment }}
DataFlows can be queried using the standard DataHub REST APIs:
<details> <summary>REST API: Fetch a DataFlow entity</summary># Get a complete DataFlow snapshot
curl 'http://localhost:8080/entities/urn%3Ali%3AdataFlow%3A(airflow,daily_sales_pipeline,prod)'
Response includes all aspects:
{
"urn": "urn:li:dataFlow:(airflow,daily_sales_pipeline,prod)",
"aspects": {
"dataFlowKey": {
"orchestrator": "airflow",
"flowId": "daily_sales_pipeline",
"cluster": "prod"
},
"dataFlowInfo": {
"name": "Daily Sales Pipeline",
"description": "Processes daily sales data and updates aggregates",
"project": "analytics",
"externalUrl": "https://airflow.company.com/dags/daily_sales_pipeline"
},
"ownership": { ... },
"globalTags": { ... }
}
}
{{ inline /metadata-ingestion/examples/library/datajob_create_full.py show_path_as_comment }}
DataFlows have a parent-child relationship with DataJobs through the IsPartOf relationship. This is the primary integration point:
from datahub.sdk import DataFlow, DataJob, DataHubClient
# DataJob automatically links to its parent DataFlow
flow = DataFlow(platform="airflow", name="my_dag")
job = DataJob(name="extract_data", flow=flow)
While DataFlows don't directly reference datasets, their child DataJobs establish lineage relationships with datasets through:
This creates indirect lineage from DataFlows to datasets through their constituent jobs.
Common orchestrators that produce DataFlow entities:
DataFlow executions can be tracked using DataProcessInstance entities, which record:
This enables tracking of pipeline run history and troubleshooting failures.
The DataFlow entity is exposed through DataHub's GraphQL API with full support for:
Key GraphQL resolvers:
dataFlow: Fetch a single DataFlow by URNsearchAcrossEntities: Search for DataFlows with filtersupdateDataFlow: Modify DataFlow propertiesWhile DataFlows use the cluster field in their URN for identification, they also have an env field in the dataFlowInfo aspect. These serve different purposes:
In most cases, these should align (e.g., cluster="prod" and env="PROD"), but they can diverge when multiple production clusters exist or when representing complex deployment topologies.
Throughout the codebase, you'll see both terms used:
orchestratorplatformThese are synonymous and refer to the same concept: the workflow management system executing the flow.
Two APIs exist for creating DataFlows:
datahub.api.entities.datajob.DataFlow): Uses the older emitter patterndatahub.sdk.DataFlow): Preferred approach with cleaner interfacesNew code should use the modern SDK (imported from datahub.sdk), though both are maintained for backward compatibility.
In some contexts, you might see references to "data pipelines" in documentation or UI. These are informal terms that refer to DataFlows. The formal entity type in the metadata model is dataFlow, not dataPipeline.