metadata-ingestion/docs/sources/hive/hive_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
DataHub can extract lineage between Hive tables and their underlying storage locations (S3, Azure Blob, HDFS, GCS, etc.). This feature creates relationships showing data flow from raw storage to Hive tables.
Enable storage lineage with minimal configuration:
source:
type: hive
config:
host_port: hive.company.com:10000
username: datahub_user
password: ${HIVE_PASSWORD}
# Enable storage lineage
emit_storage_lineage: true
This will:
Storage lineage behavior is controlled by four parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
emit_storage_lineage | boolean | false | Master toggle to enable/disable storage lineage |
hive_storage_lineage_direction | string | "upstream" | Direction: "upstream" (storage → Hive) or "downstream" (Hive → storage) |
include_column_lineage | boolean | true | Enable column-level lineage from storage paths to Hive columns |
storage_platform_instance | string | None | Platform instance for storage URNs (e.g., "prod-s3", "dev-hdfs") |
The connector automatically detects and creates lineage for:
s3://, s3a://, s3n://hdfs://gs://wasb://, wasbs://adl://abfs://, abfss://dbfs://file:// or absolute pathsWhen ingesting from multiple Hive environments (e.g., production, staging, development), use platform_instance to distinguish them:
source:
type: hive
config:
host_port: prod-hive.company.com:10000
platform_instance: "prod-hive"
This creates URNs like:
urn:li:dataset:(urn:li:dataPlatform:hive,database.table,prod-hive)
Best Practice: Combine with storage_platform_instance for complete environment isolation:
source:
type: hive
config:
platform_instance: "prod-hive" # Hive environment
storage_platform_instance: "prod-s3" # Storage environment
emit_storage_lineage: true
For Hive clusters with thousands of tables, consider:
Database Filtering: Limit ingestion to specific databases:
database: "production_db" # Only ingest one database
Incremental Ingestion: Use DataHub's stateful ingestion to only process changes:
stateful_ingestion:
enabled: true
remove_stale_metadata: true
Disable Column Lineage: If not needed, disable to improve performance:
emit_storage_lineage: true
include_column_lineage: false # Faster ingestion
Connection Pooling: The connector uses a single connection by default. For better performance with large schemas, ensure your HiveServer2 has sufficient resources.
Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
LOCATION clause) will have storage lineageSession Timeout: Long-running ingestion may hit HiveServer2 session timeouts. Configure hive.server2.session.timeout appropriately on the Hive side.
Large Schemas: Tables with 1000+ columns may be slow to ingest due to schema extraction overhead.
Case Sensitivity:
View Lineage Parsing: Complex views using non-standard SQL may not have complete lineage extracted.