metadata-models/docs/entities/dataJob.md
Data jobs represent individual units of data processing work within a data pipeline or workflow. They are the tasks, steps, or operations that transform, move, or process data as part of a larger data flow. Examples include Airflow tasks, dbt models, Spark jobs, Databricks notebooks, and similar processing units in orchestration systems.
Data jobs are identified by two pieces of information:
dataFlow entity. The data flow defines the orchestrator (e.g., airflow, spark, dbt), the flow ID (e.g., the DAG name or pipeline name), and the cluster where it runs.The URN structure for a data job is: urn:li:dataJob:(urn:li:dataFlow:(<orchestrator>,<flow_id>,<cluster>),<job_id>)
Airflow task:
urn:li:dataJob:(urn:li:dataFlow:(airflow,daily_etl_dag,prod),transform_customer_data)
dbt model:
urn:li:dataJob:(urn:li:dataFlow:(dbt,analytics_project,prod),staging.stg_customers)
Spark job:
urn:li:dataJob:(urn:li:dataFlow:(spark,data_processing_pipeline,PROD),aggregate_sales_task)
Databricks notebook:
urn:li:dataJob:(urn:li:dataFlow:(databricks,etl_workflow,production),process_events_notebook)
The dataJobInfo aspect captures the core properties of a data job:
The dataJobInputOutput aspect defines the data lineage relationships for the job:
inputDatasetEdges)outputDatasetEdges)inputDatajobEdges)This aspect establishes the critical relationships that enable DataHub to build and visualize data lineage graphs across your entire data ecosystem.
The editableDataJobProperties aspect stores documentation edits made through the DataHub UI:
This separation ensures that manual edits in the UI are preserved and not overwritten by ingestion pipelines.
Like other entities, data jobs support ownership through the ownership aspect. Owners can be users or groups with various ownership types (DATAOWNER, PRODUCER, DEVELOPER, etc.). This helps identify who is responsible for maintaining and troubleshooting the job.
Data jobs can be tagged and associated with glossary terms:
globalTags aspect): Used for categorization, classification, or operational purposes (e.g., PII, critical, deprecated)glossaryTerms aspect): Link jobs to business terminology and concepts from your glossaryData jobs can be organized into:
domains aspect): Business domains or data domains for organizational structureapplications aspect): Associated with specific applications or systemsData jobs support:
The simplest way to create a data job is using the Python SDK v2:
<details> <summary>Python SDK: Create a basic data job</summary>{{ inline /metadata-ingestion/examples/library/datajob_create_basic.py show_path_as_comment }}
Common metadata can be added to data jobs to enhance discoverability and governance:
<details> <summary>Python SDK: Add tags, terms, and ownership to a data job</summary>{{ inline /metadata-ingestion/examples/library/datajob_add_tags_terms_ownership.py show_path_as_comment }}
You can update job properties like descriptions using the low-level APIs:
<details> <summary>Python SDK: Update data job description</summary>{{ inline /metadata-ingestion/examples/library/datajob_update_description.py show_path_as_comment }}
Retrieve data job information via the REST API:
<details> <summary>REST API: Query a data job</summary>{{ inline /metadata-ingestion/examples/library/datajob_query_rest.py show_path_as_comment }}
Data jobs are often used to define lineage relationships. See the existing lineage examples:
<details> <summary>Python SDK: Add lineage using DataJobPatchBuilder</summary>{{ inline /metadata-ingestion/examples/library/datajob_add_lineage_patch.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/lineage_emitter_datajob_finegrained.py show_path_as_comment }}
Every data job belongs to exactly one dataFlow entity, which represents the parent pipeline or workflow. The data flow captures:
This hierarchical relationship allows DataHub to organize jobs within their workflows and understand the execution context.
Data jobs establish lineage by defining:
These relationships are the foundation of DataHub's lineage graph. When a job processes data, it creates a connection between upstream sources and downstream outputs, enabling impact analysis and data discovery.
While dataJob represents the definition of a processing task, dataProcessInstance represents a specific execution or run of that job. Process instances capture:
This separation allows you to track both the static definition of a job and its dynamic runtime behavior.
The DataHub GraphQL API provides rich query capabilities for data jobs:
Data jobs are commonly ingested from:
These connectors automatically extract job definitions, lineage, and metadata from the source systems.
DataHub's own ingestion pipelines are represented as data jobs with special aspects:
These aspects are specific to DataHub's internal ingestion framework and are not used for general-purpose data jobs.
The status field in dataJobInfo is deprecated in favor of the dataProcessInstance model. Instead of storing job status on the job definition itself, create separate process instance entities for each execution with their own status information. This provides a cleaner separation between job definitions and runtime execution history.
The subTypes aspect allows you to classify jobs into categories:
This helps with filtering and organizing jobs in the UI and API queries.