metadata-ingestion/docs/sources/hive/hive_pre.md
The hive module ingests metadata from Hive into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.
This plugin extracts the following:
Network Access: Access to HiveServer2 on port 10000 (or 10001 for TLS)
User Account: Hive user with read permissions on target databases and tables
Dependencies: Install PyHive connectivity:
pip install 'acryl-datahub[hive]'
The Hive user account used by DataHub needs the following permissions:
-- Grant SELECT on all databases you want to ingest
GRANT SELECT ON DATABASE <database_name> TO USER <datahub_user>;
-- Grant SELECT on tables/views for schema extraction
GRANT SELECT ON TABLE <database_name>.* TO USER <datahub_user>;
If you plan to enable storage lineage, the connector needs to read table location information:
-- Grant DESCRIBE on tables to read storage locations
GRANT SELECT ON <database_name>.* TO USER <datahub_user>;
INSERT, UPDATE, DELETE, or DROP privileges.database config parameter to limit scope and reduce the permissions required.The Hive connector supports multiple authentication methods through PyHive. Configure authentication using the recipe parameters described below.
The simplest authentication method using a username and password:
source:
type: hive
config:
host_port: hive.company.com:10000
username: datahub_user
password: ${HIVE_PASSWORD} # Use environment variables for sensitive data
For LDAP-based authentication:
source:
type: hive
config:
host_port: hive.company.com:10000
username: datahub_user
password: ${LDAP_PASSWORD}
options:
connect_args:
auth: LDAP
For Kerberos-secured Hive clusters:
source:
type: hive
config:
host_port: hive.company.com:10000
options:
connect_args:
auth: KERBEROS
kerberos_service_name: hive
Requirements:
kinit before running ingestion)/etc/krb5.conf or specified via KRB5_CONFIG environment variable)For secure connections over HTTPS:
source:
type: hive
config:
host_port: hive.company.com:10001
scheme: "hive+https"
username: datahub_user
password: ${HIVE_PASSWORD}
options:
connect_args:
auth: BASIC
For Microsoft Azure HDInsight clusters:
source:
type: hive
config:
host_port: <cluster_name>.azurehdinsight.net:443
scheme: "hive+https"
username: admin
password: ${HDINSIGHT_PASSWORD}
options:
connect_args:
http_path: "/hive2"
auth: BASIC
For Databricks clusters using the Hive connector:
source:
type: hive
config:
host_port: <workspace-url>:443
scheme: "databricks+pyhive"
username: token # or your Databricks username
password: ${DATABRICKS_TOKEN} # Personal access token or password
options:
connect_args:
http_path: "sql/protocolv1/o/xxxyyyzzzaaasa/1234-567890-hello123"
Note: For comprehensive Databricks support, consider using the dedicated Databricks Unity Catalog connector instead, which provides enhanced features.