metadata-ingestion-modules/airflow-plugin/README.md
See the DataHub Airflow docs for details.
The plugin supports Apache Airflow versions 2.7+ and 3.1+.
| Airflow Version | Extra to Install | Status | Notes |
|---|---|---|---|
| 2.7-2.10 | [airflow2] | ✅ Fully Supported | |
| 3.0.x | [airflow3] | ⚠️ Requires manual fix | Needs pydantic>=2.11.8 upgrade |
| 3.1+ | [airflow3] | ✅ Fully Supported |
Note on Airflow 3.0.x: Airflow 3.0.6 pins pydantic==2.11.7, which contains a bug that prevents the DataHub plugin from importing correctly. This issue is resolved in Airflow 3.1.0+ which uses pydantic>=2.11.8. If you must use Airflow 3.0.6, you can manually upgrade pydantic to >=2.11.8, though this may conflict with Airflow's dependency constraints. We recommend upgrading to Airflow 3.1.0 or later.
Related issue: https://github.com/pydantic/pydantic/issues/10963
The installation command varies depending on your Airflow version due to different OpenLineage dependencies.
pip install 'acryl-datahub-airflow-plugin[airflow2]'
This installs the plugin with Legacy OpenLineage (openlineage-airflow>=1.2.0), which is required for Airflow 2.x lineage extraction.
If your Airflow 2.7+ environment rejects the Legacy OpenLineage package (e.g., due to dependency conflicts), you can use the native OpenLineage provider instead:
# Install the native Airflow provider first
pip install 'apache-airflow-providers-openlineage>=1.0.0'
# Then install the DataHub plugin without OpenLineage extras
pip install acryl-datahub-airflow-plugin
The plugin will automatically detect and use apache-airflow-providers-openlineage when available, providing the same functionality.
pip install 'acryl-datahub-airflow-plugin[airflow3]'
This installs the plugin with apache-airflow-providers-openlineage>=1.0.0, which is the native OpenLineage provider for Airflow 3.x.
Note: If using Airflow 3.0.x (3.0.6 specifically), you'll need to manually upgrade pydantic:
pip install 'acryl-datahub-airflow-plugin[airflow3]' 'pydantic>=2.11.8'
We recommend using Airflow 3.1.0+ which resolves this issue. See the Version Compatibility section above for details.
When you install without any extras:
pip install acryl-datahub-airflow-plugin
You get:
acryl-datahub[sql-parser,datahub-rest] - DataHub SDK with SQL parsing and REST emitterpydantic>=2.4.0 - Required for data validationapache-airflow>=2.5.0,<4.0.0 - Airflow itself[airflow2] Extrapip install 'acryl-datahub-airflow-plugin[airflow2]'
Adds:
openlineage-airflow>=1.2.0 - Standalone OpenLineage package for Airflow 2.x[airflow3] Extrapip install 'acryl-datahub-airflow-plugin[airflow3]'
Adds:
apache-airflow-providers-openlineage>=1.0.0 - Native OpenLineage provider for Airflow 3.xYou can combine multiple extras if needed:
# For Airflow 3.x with Kafka emitter support
pip install 'acryl-datahub-airflow-plugin[airflow3,datahub-kafka]'
# For Airflow 2.x with file emitter support
pip install 'acryl-datahub-airflow-plugin[airflow2,datahub-file]'
Available extras:
airflow2: OpenLineage support for Airflow 2.x (adds openlineage-airflow>=1.2.0)airflow3: OpenLineage support for Airflow 3.x (adds apache-airflow-providers-openlineage>=1.0.0)datahub-kafka: Kafka-based metadata emission (adds acryl-datahub[datahub-kafka])datahub-file: File-based metadata emission (adds acryl-datahub[sync-file-emitter]) - useful for testingAirflow 2.x and 3.x have different OpenLineage integrations:
openlineage-airflow package)apache-airflow-providers-openlineage)apache-airflow-providers-openlineage)The plugin automatically detects which OpenLineage variant is installed and uses it accordingly. This means:
[airflow2] or [airflow3]): The appropriate OpenLineage dependency is installed automaticallyThis flexibility allows you to adapt to different Airflow environments and dependency constraints.
The plugin can be configured via airflow.cfg under the [datahub] section. Below are the key configuration options:
When enable_extractors=True (default), the DataHub plugin enhances OpenLineage extractors to provide better lineage. You can fine-tune these enhancements:
[datahub]
# Enable/disable all OpenLineage extractors
enable_extractors = True # Default: True
# Fine-grained control over DataHub's OpenLineage enhancements
# --- SQL Parsing Configuration ---
# Enable multi-statement SQL parsing (resolves temp tables, merges lineage)
enable_multi_statement_sql_parsing = False # Default: False
# --- Patches (work with both Legacy OpenLineage and OpenLineage Provider) ---
# Patch SqlExtractor to use DataHub's advanced SQL parser (enables column-level lineage)
patch_sql_parser = True # Default: True
# Patch SnowflakeExtractor to fix default schema detection
patch_snowflake_schema = True # Default: True
# --- Custom Extractors (only apply to Legacy OpenLineage) ---
# Use DataHub's custom AthenaOperatorExtractor (better Athena lineage)
extract_athena_operator = True # Default: True
# Use DataHub's custom BigQueryInsertJobOperatorExtractor (handles BQ job configuration)
extract_bigquery_insert_job_operator = True # Default: True
Multi-Statement SQL Parsing:
When enable_multi_statement_sql_parsing=True, if a task executes multiple SQL statements (e.g., CREATE TEMP TABLE ...; INSERT ... FROM temp_table;), DataHub parses all statements together and resolves temporary table dependencies within that task. By default (False), only the first statement is parsed.
How it works:
Patches (apply to both Legacy OpenLineage and OpenLineage Provider):
patch_sql_parser=True:
SqlExtractor.extract() methodSQLParser.generate_openlineage_metadata_from_sql() methodpatch_snowflake_schema=True:
SnowflakeExtractor.default_schema propertyCustom Extractors/Operator Patches:
extract_athena_operator:
AthenaOperatorExtractorAthenaOperator.get_openlineage_facets_on_complete()extract_bigquery_insert_job_operator:
BigQueryInsertJobOperatorExtractorBigQueryInsertJobOperator.get_openlineage_facets_on_complete()Example use cases:
Disable DataHub's SQL parser to use OpenLineage's native parsing:
[datahub]
enable_extractors = True
patch_sql_parser = False # Use OpenLineage's native SQL parser
patch_snowflake_schema = True # Still fix Snowflake schema detection
Disable custom Athena extractor (only relevant for Legacy OpenLineage):
[datahub]
enable_extractors = True
extract_athena_operator = False # Use OpenLineage's default Athena extractor
For a complete list of configuration options, see the DataHub Airflow documentation.
See the developing docs.