Back to Datahub

Gcs Pre

metadata-ingestion/docs/sources/gcs/gcs_pre.md

1.5.0.31.3 KB
Original Source

Overview

The gcs module ingests metadata from Gcs into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

This connector ingests Google Cloud Storage datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub. To specify the group of files that form a dataset, use path_specs configuration in ingestion recipe. This source leverages Interoperability of GCS with S3 and uses DataHub S3 Data Lake integration source under the hood. Refer section Path Specs from S3 connector for more details.

Prerequisites

Before running ingestion, ensure network connectivity to the source, valid authentication credentials, and read permissions for metadata APIs required by this module.

  1. Create a service account with "Storage Object Viewer" Role - https://cloud.google.com/iam/docs/service-accounts-create
  2. Make sure you meet following requirements to generate HMAC key - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#before-you-begin
  3. Create an HMAC key for service account created above - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create .