metadata-ingestion/docs/sources/excel/excel_pre.md
The excel module ingests metadata from Excel into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.
Supported file types are as follows:
*.xlsx)*.xlsm)The connector will attempt to identify which cells contain table data. A table is defined as a header row, which is used to derive the column names, followed by data rows. The schema is inferred from the data types that are present in a column.
Rows that are directly above or directly below the table where only the first two columns have values are assumed to contain metadata. If such rows are located, they are converted to custom properties where the first column is the key, and the second column is the value. Additionally, the workbook standard and custom properties are also imported as dataset custom properties.
When configuring an S3 ingestion source to access files in an S3 bucket, the AWS account referenced in your ingestion recipe must have appropriate S3 permissions. Create a policy with the minimum required permissions by following these steps:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": ["s3:ListBucket", "s3:GetBucketLocation", "s3:GetObject"],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
Permissions Explanation:
s3:ListBucket: Allows listing the objects in the bucket. This permission is necessary for the S3 ingestion source to know which objects are available to read.s3:GetBucketLocation: Allows retrieving the location of the bucket.s3:GetObject: Allows reading the actual content of the objects in the bucket. This is needed to infer schema from sample files.Link Policy to Identity: Associate your newly created policy with the appropriate IAM user or role that will be used by the S3 ingestion process.
Set Up S3 Data Source: When configuring your S3 ingestion source, specify the IAM user to whom you assigned the permissions in the previous step.
To access files on Azure Blob Storage, you will need the following:
Azure Storage Account: A storage account that provides a unique namespace for your data in Azure.
Authentication Credentials: One of these supported authentication methods:
Container: A blob container that organizes your blobs (similar to a directory in a file system).
Access Permissions: Appropriate authorization for the authentication method: