metadata-ingestion/docs/sources/glue/glue_pre.md
The glue module ingests metadata from Glue into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.
This plugin extracts the following:
Before running ingestion, ensure network connectivity to the source, valid authentication credentials, and read permissions for metadata APIs required by this module.
For ingesting datasets, the following IAM permissions are required:
{
"Effect": "Allow",
"Action": [
"glue:GetDatabases",
"glue:GetTables"
],
"Resource": [
"arn:aws:glue:$region-id:$account-id:catalog",
"arn:aws:glue:$region-id:$account-id:database/*",
"arn:aws:glue:$region-id:$account-id:table/*"
]
}
For ingesting jobs (extract_transforms: True), the following additional permissions are required:
{
"Effect": "Allow",
"Action": [
"glue:GetDataflowGraph",
"glue:GetJobs",
"glue:GetConnection",
"s3:GetObject",
],
"Resource": "*"
}
The glue:GetConnection permission is required when Glue jobs reference named connections (e.g. JDBC connections configured in the Glue console). If your jobs only use inline connection parameters, this permission is not needed.
For profiling datasets, the following additional permissions are required:
{
"Effect": "Allow",
"Action": [
"glue:GetPartitions",
],
"Resource": "*"
}
The Glue connector supports cross-account access via AWS STS AssumeRole. This allows DataHub running in one AWS account to ingest Glue metadata from a catalog in a different AWS account.
Setup steps:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::SOURCE-ACCOUNT-ID:role/DataHubExecutionRole"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "your-unique-external-id"
}
}
}
]
}
aws_config.aws_role with the target role ARN:Simple ARN format:
source:
type: glue
config:
aws_config:
aws_role: "arn:aws:iam::TARGET-ACCOUNT-ID:role/DataHubGlueReadRole"
With External ID (recommended for security):
source:
type: glue
config:
aws_config:
aws_role:
RoleArn: "arn:aws:iam::TARGET-ACCOUNT-ID:role/DataHubGlueReadRole"
ExternalId: "your-unique-external-id"
Role chaining (assume multiple roles in sequence):
source:
type: glue
config:
aws_config:
aws_role:
- "arn:aws:iam::INTERMEDIARY-ACCOUNT-ID:role/IntermediateRole"
- RoleArn: "arn:aws:iam::TARGET-ACCOUNT-ID:role/DataHubGlueReadRole"
ExternalId: "your-unique-external-id"
The connector uses boto3's assume_role, so additional parameters like RoleSessionName, DurationSeconds, and Policy are also supported.
Cross-account catalog access:
For accessing a specific Glue catalog in another account (without assuming a role), use the catalog_id parameter:
source:
type: glue
config:
catalog_id: "123456789012" # Target account's AWS account ID
This is useful when Account A has shared its Glue catalog with Account B. If you're running ingestion from Account B and want to access Account A's catalog, specify Account A's ID in catalog_id.
Platform instance considerations:
platform_instance, DataHub recognizes them as the same entities and creates a single dataset.platform_instance values creates separate dataset entities with distinct URNs, useful for tracking the same data through different access paths.