metadata-ingestion/docs/sources/hive-metastore/hive-metastore_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
Use connection_type: thrift when you cannot access the metastore database directly but have access to the HMS Thrift API (typically port 9083). This is common in:
Before using Thrift mode, ensure:
Verify connectivity:
# Test network connectivity to HMS
telnet hms.company.com 9083
# For Kerberos environments, verify ticket
klist
# Install with Thrift support
pip install 'acryl-datahub[hive-metastore]'
# For Kerberos authentication, also install:
pip install thrift-sasl pyhive[hive-pure-sasl]
| Option | Type | Default | Required | Description |
|---|---|---|---|---|
connection_type | string | sql | Yes (for Thrift) | Set to thrift to enable Thrift mode |
host_port | string | - | Yes | HMS host and port (e.g., hms.company.com:9083) |
use_kerberos | boolean | false | No | Enable Kerberos/SASL authentication |
kerberos_service_name | string | hive | No | Kerberos service principal name |
kerberos_hostname_override | string | - | No | Override hostname for Kerberos principal (for load balancers) |
kerberos_qop | string | auth | No | Kerberos Quality of Protection: auth, auth-int, or auth-conf (see below) |
timeout_seconds | int | 60 | No | Connection timeout in seconds |
max_retries | int | 3 | No | Maximum retry attempts for transient failures |
catalog_name | string | - | No | HMS 3.x catalog name (e.g., spark_catalog) |
include_catalog_name_in_ids | boolean | false | No | Include catalog in dataset URNs |
database_pattern | AllowDeny | - | No | Filter databases by regex pattern |
table_pattern | AllowDeny | - | No | Filter tables by regex pattern |
Note: SQL WHERE clause options (tables_where_clause_suffix, views_where_clause_suffix, schemas_where_clause_suffix) have been deprecated for security reasons (SQL injection risk) and are no longer supported. Use database_pattern and table_pattern instead.
source:
type: hive-metastore
config:
connection_type: thrift
host_port: hms.company.com:9083
Ensure you have a valid Kerberos ticket (kinit -kt /path/to/keytab user@REALM) before running ingestion:
source:
type: hive-metastore
config:
connection_type: thrift
host_port: hms.company.com:9083
use_kerberos: true
kerberos_service_name: hive # Change if HMS uses different principal
# kerberos_hostname_override: hms-internal.company.com # If using load balancer
# catalog_name: spark_catalog # For HMS 3.x multi-catalog
# kerberos_qop: auth-conf # For Kerberos QOP authentication + integrity + encryption
database_pattern:
allow:
- "^prod_.*"
If your Hive Metastore is configured with hadoop.rpc.protection set to integrity or privacy, you must configure the matching QOP level:
hadoop.rpc.protection | kerberos_qop | Description |
|---|---|---|
authentication | auth | Authentication only (default) |
integrity | auth-int | Authentication + integrity checking |
privacy | auth-conf | Authentication + integrity + encryption |
database_pattern/table_pattern insteadThe Hive Metastore connector supports the same storage lineage features as the Hive connector, with enhanced performance due to direct database access.
Enable storage lineage with minimal configuration:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
# Enable storage lineage
emit_storage_lineage: true
Storage lineage is controlled by the same parameters as the Hive connector:
| Parameter | Type | Default | Description |
|---|---|---|---|
emit_storage_lineage | boolean | false | Master toggle to enable/disable storage lineage |
hive_storage_lineage_direction | string | "upstream" | Direction: "upstream" (storage → Hive) or "downstream" (Hive → storage) |
include_column_lineage | boolean | true | Enable column-level lineage from storage paths to Hive columns |
storage_platform_instance | string | None | Platform instance for storage URNs (e.g., "prod-s3", "dev-hdfs") |
All storage platforms supported by the Hive connector are also supported here:
s3://, s3a://, s3n://)hdfs://)gs://)wasb://, wasbs://)adl://, abfs://, abfss://)dbfs://)file://)See the sections above for complete configuration details.
A key advantage of the Hive Metastore connector is its ability to extract metadata from Presto and Trino views that are stored in the metastore.
View Detection: The connector identifies views by checking the TABLE_PARAMS table for Presto/Trino view definitions.
View Parsing: Presto/Trino view JSON is parsed to extract:
Lineage Extraction: SQL is parsed using sqlglot to create table-to-view lineage.
Storage Lineage Integration: If storage lineage is enabled, the connector also creates lineage from storage → tables → views.
Presto/Trino view support is automatically enabled when ingesting from a metastore that contains Presto/Trino views. No additional configuration is required.
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
# Enable storage lineage for complete lineage chain
emit_storage_lineage: true
This configuration will create complete lineage:
S3 Bucket → Hive Table → Presto View
For large metastore deployments with many databases, use filtering to limit ingestion scope:
source:
type: hive-metastore
config:
# ... connection config ...
# Only ingest from specific databases
schema_pattern:
allow:
- "^production_.*" # All databases starting with production_
- "analytics" # Specific database
deny:
- ".*_test$" # Exclude test databases
For filtering by database name, use pattern-based filtering:
source:
type: hive-metastore
config:
# ... connection config ...
# Filter to specific databases using regex patterns
database_pattern:
allow:
- "^production_db$"
- "^analytics_db$"
deny:
- "^test_.*"
- ".*_staging$"
Note: The deprecated *_where_clause_suffix options have been removed for security reasons. Use database_pattern and table_pattern for filtering.
The Hive Metastore connector is significantly faster than the Hive connector because:
Performance Comparison (approximate):
Database Connection Pooling: The connector uses SQLAlchemy's default connection pooling. For very large deployments, consider tuning pool size:
options:
pool_size: 10
max_overflow: 20
Schema Filtering: Use schema_pattern to limit scope and reduce query time.
Stateful Ingestion: Enable to only process changes:
stateful_ingestion:
enabled: true
remove_stale_metadata: true
Disable Column Lineage: If not needed:
emit_storage_lineage: true
include_column_lineage: false # Faster
When ingesting from multiple metastores (e.g., different clusters or environments), use platform_instance:
source:
type: hive-metastore
config:
host_port: prod-metastore-db.company.com:5432
database: metastore
platform_instance: "prod-hive"
Best Practice: Combine with storage_platform_instance:
source:
type: hive-metastore
config:
platform_instance: "prod-hive" # Hive tables
storage_platform_instance: "prod-hdfs" # Storage locations
emit_storage_lineage: true
Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
Same as the Hive connector:
Large Column Lists: Tables with 500+ columns may be slow to process due to metastore query complexity.
View Definition Encoding: Some older Hive versions store view definitions in non-UTF-8 encoding, which may cause parsing issues.
Case Sensitivity:
Concurrent Metastore Writes: If the metastore is being actively modified during ingestion, some metadata may be inconsistent.
Problem: Could not connect to metastore database
Solutions:
host_port, database, and scheme are correcttelnet <host> <port>pg_hba.conf allows connections from your IPbind-address in my.cnfProblem: Authentication failed or Access denied
Solutions:
@server-name suffixProblem: Not all tables appear in DataHub
Solutions:
schema_pattern, database_pattern, or table_pattern SELECT d.name as db_name, t.tbl_name as table_name, t.tbl_type
FROM TBLS t
JOIN DBS d ON t.db_id = d.db_id
WHERE d.name = 'your_database';
Problem: Views defined in Presto/Trino don't show up
Solutions:
SELECT d.name as db_name, t.tbl_name as view_name, tp.param_value
FROM TBLS t
JOIN DBS d ON t.db_id = d.db_id
JOIN TABLE_PARAMS tp ON t.tbl_id = tp.tbl_id
WHERE t.tbl_type = 'VIRTUAL_VIEW'
AND tp.param_key = 'presto_view'
LIMIT 10;
Problem: No storage lineage relationships visible
Solutions:
emit_storage_lineage: true is set SELECT d.name as db_name, t.tbl_name as table_name, s.location
FROM TBLS t
JOIN DBS d ON t.db_id = d.db_id
JOIN SDS s ON t.sd_id = s.sd_id
WHERE s.location IS NOT NULL
LIMIT 10;
Problem: Ingestion takes too long
Solutions: