metadata-ingestion/docs/sources/hive-metastore/hive-metastore_pre.md
The hive-metastore module ingests metadata from Hive Metastore into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.
The Hive Metastore connector supports two connection modes:
Choose your connection mode based on your environment:
| Feature | SQL Mode (default) | Thrift Mode |
|---|---|---|
| Use when | Direct database access available | Only HMS Thrift API accessible |
| Authentication | Database credentials | Kerberos/SASL or unauthenticated |
| Port | Database port (3306/5432) | Thrift port (9083) |
| Dependencies | Database drivers | pymetastore, thrift-sasl |
Requirements:
Database Access: Direct read access to the Hive metastore database (MySQL or PostgreSQL)
Network Access: Access to metastore database on configured port
Database Driver: Install the appropriate Python driver:
# For PostgreSQL metastore
pip install 'acryl-datahub[hive]' psycopg2-binary
# For MySQL metastore
pip install 'acryl-datahub[hive]' PyMySQL
Metastore Schema: Typically public (PostgreSQL) or database name (MySQL)
The database user account used by DataHub needs read-only access to the Hive metastore tables.
-- Create a dedicated read-only user for DataHub
CREATE USER datahub_user WITH PASSWORD 'secure_password';
-- Grant connection privileges
GRANT CONNECT ON DATABASE metastore TO datahub_user;
-- Grant schema usage
GRANT USAGE ON SCHEMA public TO datahub_user;
-- Grant SELECT on metastore tables
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datahub_user;
-- Grant SELECT on future tables (for metastore upgrades)
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO datahub_user;
-- Create a dedicated read-only user for DataHub
CREATE USER 'datahub_user'@'%' IDENTIFIED BY 'secure_password';
-- Grant SELECT privileges on metastore database
GRANT SELECT ON metastore.* TO 'datahub_user'@'%';
-- Apply changes
FLUSH PRIVILEGES;
DataHub queries the following metastore tables:
| Table | Purpose |
|---|---|
DBS | Database/schema information |
TBLS | Table metadata |
TABLE_PARAMS | Table properties (including view definitions) |
SDS | Storage descriptor (location, format) |
COLUMNS_V2 | Column metadata |
PARTITION_KEYS | Partition information |
SERDES | Serialization/deserialization information |
Recommendation: Grant SELECT on all metastore tables to ensure compatibility with different Hive versions and for future DataHub enhancements.
Standard Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
SSL Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
options:
connect_args:
sslmode: require
sslrootcert: /path/to/ca-cert.pem
Standard Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:3306
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "mysql+pymysql" # Default if not specified
SSL Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:3306
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "mysql+pymysql"
options:
connect_args:
ssl:
ca: /path/to/ca-cert.pem
cert: /path/to/client-cert.pem
key: /path/to/client-key.pem
For AWS RDS-hosted metastore databases:
source:
type: hive-metastore
config:
host_port: metastore.abc123.us-east-1.rds.amazonaws.com:5432
database: metastore
username: datahub_user
password: ${RDS_PASSWORD}
scheme: "postgresql+psycopg2" # or 'mysql+pymysql'
options:
connect_args:
sslmode: require # RDS requires SSL
source:
type: hive-metastore
config:
host_port: metastore-server.postgres.database.azure.com:5432
database: metastore
username: datahub_user@metastore-server # Note: Azure requires @server-name suffix
password: ${AZURE_DB_PASSWORD}
scheme: "postgresql+psycopg2"
options:
connect_args:
sslmode: require