Dataplex Pre - Datahub

Overview

The dataplex module ingests metadata from Google Cloud Knowledge Catalog (Dataplex) into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

The connector extracts metadata from Google Cloud Knowledge Catalog (Dataplex) using the Universal Catalog Entries API. This API extracts entries from system-managed entry groups for Google Cloud services and is the recommended approach for discovering resources across your GCP organization.

Spanner entry collection behavior

Spanner entries are collected through an additional search_entries workaround after the entry-group traversal phase. Because those entries are not discovered through list_entry_groups, filter_config.entry_groups.pattern does not apply to them. Use entry-level filters (filter_config.entries.pattern and filter_config.entries.fqn_pattern) to control Spanner inclusion.

Prerequisites

Refer to Google Cloud Knowledge Catalog (Dataplex) documentation for the basics.

Project Selection

The connector supports three ways to select GCP projects, evaluated in this order of precedence:

project_ids — explicit list of project IDs. When set, this overrides the other two options and no project discovery is performed.
project_labels — list of key:value labels. Projects carrying any of these labels are discovered via the Cloud Resource Manager search_projects API and then filtered through project_id_pattern.
project_id_pattern — AllowDenyPattern of regexes. When project_ids is empty, all projects visible to the credentials are returned via the Cloud Resource Manager search_projects API and filtered through this pattern.

At least one of these must be set. Auto-discovery via project_labels or project_id_pattern requires the service account to have resourcemanager.projects.get (e.g. via roles/browser) on each candidate project so the Cloud Resource Manager search_projects API can return them; no folder/organization-level grant is needed. When project_ids is set explicitly, no Resource Manager permissions are needed.

API Enablement

Enable the following APIs on all target projects:

Dataplex API (dataplex.googleapis.com) — see Enable Knowledge Catalog
Data Lineage API (datalineage.googleapis.com) — required for lineage extraction (include_lineage: true), see Enable Data Lineage API
Cloud Resource Manager API (cloudresourcemanager.googleapis.com) — required for term-asset associations (include_glossary_term_associations: true)

Some asset types require additional setup. For example, Cloud SQL instances must be connected to Dataplex to enable automatic metadata harvesting (schemas, tables, and views):

gcloud sql instances patch my-cloud-sql-instance --enable-dataplex-integration --project=my-gcp-project

Authentication

Supports Application Default Credentials (ADC). See GCP documentation for ADC setup.

For service account authentication, follow these instructions:

Create a service account and assign roles

Create a service account following GCP docs and assign the required roles

Download the service account JSON keyfile

Example credential file:

json

{
  "type": "service_account",
  "project_id": "project-id-1234567",
  "private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----",
  "client_email": "[email protected]",
  "client_id": "113545814931671546333",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com"
}

To provide credentials to the source, you can either:

Set an environment variable:

$ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"

Set credential config in your source based on the credential json file. For example:

yml

credential:
  project_id: "project-id-1234567"
  private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0"
  private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n"
  client_email: "[email protected]"
  client_id: "123456678890"

Permissions

Grant the following roles to the service account on all target projects.

Feature	Required Role
Universal Catalog Entries API (core ingestion)	`roles/dataplex.catalogViewer`
Lineage extraction (`include_lineage: true`)	`roles/datalineage.viewer`
Business Glossary ingestion (`include_glossaries: true`)	`roles/dataplex.catalogViewer`
Term-asset associations (`include_glossary_term_associations: true`)	`roles/browser` on each candidate project (lighter-weight) or `roles/resourcemanager.folderViewer` — both provide `resourcemanager.projects.get`, required for resolving GCP project numbers
Project auto-discovery via `project_id_pattern` or `project_labels`	`roles/browser` on each candidate project — provides `resourcemanager.projects.get` needed for `search_projects` to return the project

:::tip "Lineage requires the role on multiple projects"

Grant roles/datalineage.viewer on all projects where the corresponding process is actually executed. Note it may differ from the project containing the asset. :::

Additional asset-specific viewer roles:

roles/aiplatform.viewer (Vertex AI Viewer) is required when ingesting Vertex AI assets.
roles/spanner.viewer (Cloud Spanner Viewer) is required when ingesting Cloud Spanner assets.