Back to Datahub

Dataplex Pre

metadata-ingestion/docs/sources/dataplex/dataplex_pre.md

1.5.0.34.3 KB
Original Source

Overview

The dataplex module ingests metadata from Dataplex into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

The Dataplex connector extracts metadata from Google Dataplex using the Universal Catalog Entries API. This API extracts entries from system-managed entry groups for Google Cloud services and is the recommended approach for discovering resources across your GCP organization.

Supported services

  • BigQuery: datasets, tables, models, routines, connections, and linked datasets
  • Cloud SQL: instances
  • AlloyDB: instances, databases, schemas, tables, and views
  • Spanner: instances, databases, and tables
  • Pub/Sub: topics and subscriptions
  • Cloud Storage: buckets
  • Bigtable: instances, clusters, and tables
  • Vertex AI: models, datasets, and feature stores
  • Dataform: repositories and workflows
  • Dataproc Metastore: services and databases

:::note Only BigQuery and Cloud Storage (GCS) have been thoroughly tested with this connector. Other services may work but have not been validated. :::

Prerequisites

Refer to Dataplex documentation for Dataplex basics.

Authentication

Supports Application Default Credentials (ADC). See GCP documentation for ADC setup.

For service account authentication, follow these instructions:

Create a service account and assign roles

  1. Create a service account following GCP docs and assign the required roles

  2. Download the service account JSON keyfile

    Example credential file:

    json
    {
      "type": "service_account",
      "project_id": "project-id-1234567",
      "private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0",
      "private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----",
      "client_email": "[email protected]",
      "client_id": "113545814931671546333",
      "auth_uri": "https://accounts.google.com/o/oauth2/auth",
      "token_uri": "https://oauth2.googleapis.com/token",
      "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
      "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com"
    }
    
  3. To provide credentials to the source, you can either:

    Set an environment variable:

    sh
    $ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"
    

    or

    Set credential config in your source based on the credential json file. For example:

    yml
    credential:
      project_id: "project-id-1234567"
      private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0"
      private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n"
      client_email: "[email protected]"
      client_id: "123456678890"
    

Permissions

Grant the following permissions to the service account on all target projects.

Universal Catalog Entries API:

Default GCP Role: roles/dataplex.catalogViewer

PermissionDescription
dataplex.entryGroups.getRetrieve specific entry group details
dataplex.entryGroups.listView all entry groups in a location
dataplex.entries.getAccess entry metadata and details
dataplex.entries.getDataView data aspects within entries
dataplex.entries.listEnumerate entries within groups

Lineage extraction (optional, include_lineage: true):

Default GCP Role: roles/datalineage.viewer

PermissionDescription
datalineage.links.getAllows a user to view lineage links
datalineage.links.searchAllows a user to search for lineage links