metadata-ingestion/docs/sources/s3/s3_pre.md
The s3 module ingests metadata from S3 into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.
This connector ingests AWS S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub. Refer to the section Path Specs for more details.
:::tip
This connector can also be used to ingest local files.
Just replace s3:// in your path_specs with an absolute path to files on the machine running ingestion.
:::
Grant necessary S3 permissions to an IAM user or role:
1. Create an IAM Policy
Grant read access to the S3 bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": ["s3:ListBucket", "s3:GetBucketLocation", "s3:GetObject"],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
Permissions:
s3:ListBucket: List objects in the buckets3:GetBucketLocation: Retrieve bucket locations3:GetObject: Read object content (required for schema inference)2. Attach the Policy
Attach the policy to the IAM user or role used by the S3 ingestion source.
3. Configure the Source
Use the IAM user/role credentials in your S3 ingestion recipe.
The S3 connector supports cross-account access via AWS STS AssumeRole. This allows DataHub running in one AWS account to ingest S3 metadata from buckets in a different AWS account.
Setup steps:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::SOURCE-ACCOUNT-ID:role/DataHubExecutionRole"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "your-unique-external-id"
}
}
}
]
}
aws_config.aws_role with the target role ARN:Simple ARN format:
source:
type: s3
config:
aws_config:
aws_role: "arn:aws:iam::TARGET-ACCOUNT-ID:role/DataHubS3ReadRole"
path_specs:
- include: "s3://target-account-bucket/**"
With External ID (recommended for security):
source:
type: s3
config:
aws_config:
aws_role:
RoleArn: "arn:aws:iam::TARGET-ACCOUNT-ID:role/DataHubS3ReadRole"
ExternalId: "your-unique-external-id"
path_specs:
- include: "s3://target-account-bucket/**"
Role chaining (assume multiple roles in sequence):
source:
type: s3
config:
aws_config:
aws_role:
- "arn:aws:iam::INTERMEDIARY-ACCOUNT-ID:role/IntermediateRole"
- RoleArn: "arn:aws:iam::TARGET-ACCOUNT-ID:role/DataHubS3ReadRole"
ExternalId: "your-unique-external-id"
path_specs:
- include: "s3://target-account-bucket/**"
The connector uses boto3's assume_role, so additional parameters like RoleSessionName, DurationSeconds, and Policy are also supported.