Back to Datahub

Bigquery Pre

metadata-ingestion/docs/sources/bigquery/bigquery_pre.md

1.5.0.39.2 KB
Original Source

Overview

The bigquery module ingests metadata from Bigquery into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

Prerequisites

Familiarize yourself with BigQuery ingestion architecture:

<p align="center"> </p>

Two key concepts:

  • Extractor Project: Project containing the service account used to run metadata extraction queries
  • BigQuery Projects: Projects from which DataHub collects metadata (tables, lineage, usage, profiling). By default includes the extractor project; configure project_ids to specify projects explicitly

Create a datahub profile in GCP

  1. Create a custom role for DataHub following BigQuery docs
  2. Grant permissions to this role on the extractor project and all target projects (see below)
Basic Requirements (needed for metadata ingestion)

1. Grant the following permissions on the Extractor Project:

permission                      Description                                                                                                                        Capability                                                              
bigquery.jobs.create          Run jobs (e.g. queries) within the project. This only needs for the extractor project where the service account belongs                                                                                                                       
bigquery.jobs.list            Manage the queries that the service account has sent. This only needs for the extractor project where the service account belongs                                                                                                             
bigquery.readsessions.create  Create a session for streaming large results. This only needs for the extractor project where the service account belongs                                                                                                                     
bigquery.readsessions.getDataGet data from the read session. This only needs for the extractor project where the service account belongs                      

2. Grant the following permissions on all target projects for metadata extraction:

:::info

These permissions must be granted on every project you want to extract metadata from.

:::

PermissionDescriptionCapabilityDefault GCP Role Which Contains This Permission
bigquery.datasets.getRetrieve metadata about a dataset.Table Metadata Extractionroles/bigquery.metadataViewer
bigquery.datasets.getIamPolicyRead a dataset's IAM permissions.Table Metadata Extractionroles/bigquery.metadataViewer
bigquery.tables.listList BigQuery tables.Table Metadata Extractionroles/bigquery.metadataViewer
bigquery.tables.getRetrieve metadata for a table.Table Metadata Extractionroles/bigquery.metadataViewer
bigquery.routines.getGet Routines. Needs to retrieve metadata for a table from system table.Table Metadata Extractionroles/bigquery.metadataViewer
bigquery.routines.listList Routines. Needs to retrieve metadata for a table from system table.Table Metadata Extractionroles/bigquery.metadataViewer
resourcemanager.projects.getGet project metadata.Table Metadata Extractionroles/bigquery.metadataViewer
resourcemanager.projects.listSearch projects. Needed if not setting project_ids.Table Metadata Extractionroles/bigquery.metadataViewer
bigquery.jobs.listAllList all jobs (queries) submitted by any user. Needs for Lineage extraction.Lineage Extraction/Usage Extractionroles/bigquery.resourceViewer
logging.logEntries.listFetch log entries for lineage/usage data. Not required if use_exported_bigquery_audit_metadata is enabled.Lineage Extraction/Usage Extractionroles/logging.privateLogViewer
logging.privateLogEntries.listFetch log entries for lineage/usage data. Not required if use_exported_bigquery_audit_metadata is enabled.Lineage Extraction/Usage Extractionroles/logging.privateLogViewer
bigquery.tables.getDataAccess table data to extract storage size, last updated at, partition information, data profiles etc. Required when profiling is enabled or when use_tables_list_query_v2 is enabled. This permission is needed to query BigQuery's __TABLES__ pseudo-table.Profiling/Enhanced Table Metadata
datacatalog.policyTags.getOptional Get policy tags for columns with associated policy tags. This permission is required only if extract_policy_tags_from_catalog is enabled.Policy Tag Extractionroles/datacatalog.viewer

:::warning Important: bigquery.tables.getData Permission

The bigquery.tables.getData permission is required in the following scenarios:

  • When profiling is enabled (profiling.enabled: true)
  • When use_tables_list_query_v2 is enabled (for enhanced table metadata extraction)

Without this permission, you'll encounter errors when the connector tries to access BigQuery's __TABLES__ pseudo-table for detailed table information including partition data, row counts, and storage metrics.

:::

Create a service account in the Extractor Project

  1. Create a service account following BigQuery docs and assign the custom role created above
  2. Download a service account JSON keyfile Example credential file:
json
{
  "type": "service_account",
  "project_id": "project-id-1234567",
  "private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----",
  "client_email": "[email protected]",
  "client_id": "113545814931671546333",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com"
}
  1. To provide credentials to the source, you can either:

    Set an environment variable:

    sh
    $ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"
    

    or

    Set credential config in your source based on the credential json file. For example:

    yml
    credential:
      project_id: project-id-1234567
      private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0"
      private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n"
      client_email: "[email protected]"
      client_id: "123456678890"
    
Profiling Requirements

For external tables backed by Google Drive:

Grant "Viewer" access to the service account's email (client_email from credentials JSON) on the Google Drive documents:

  1. Find the source document: BigQuery Console → Table → Details → "Source" field
  2. Share the document: Open document → Share → Add service account email with "Viewer" access