metadata-ingestion/docs/sources/dataplex/dataplex_pre.md
The dataplex module ingests metadata from Google Cloud Knowledge Catalog (Dataplex) into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.
The connector extracts metadata from Google Cloud Knowledge Catalog (Dataplex) using the Universal Catalog Entries API. This API extracts entries from system-managed entry groups for Google Cloud services and is the recommended approach for discovering resources across your GCP organization.
Spanner entries are collected through an additional search_entries workaround after the entry-group traversal phase. Because those entries are not discovered through list_entry_groups, filter_config.entry_groups.pattern does not apply to them. Use entry-level filters (filter_config.entries.pattern and filter_config.entries.fqn_pattern) to control Spanner inclusion.
Refer to Google Cloud Knowledge Catalog (Dataplex) documentation for the basics.
The connector supports three ways to select GCP projects, evaluated in this order of precedence:
project_ids — explicit list of project IDs. When set, this overrides the other two options and no project discovery is performed.project_labels — list of key:value labels. Projects carrying any of these labels are discovered via the Cloud Resource Manager search_projects API and then filtered through project_id_pattern.project_id_pattern — AllowDenyPattern of regexes. When project_ids is empty, all projects visible to the credentials are returned via the Cloud Resource Manager search_projects API and filtered through this pattern.At least one of these must be set. Auto-discovery via project_labels or project_id_pattern requires the service account to have resourcemanager.projects.get (e.g. via roles/browser) on each candidate project so the Cloud Resource Manager search_projects API can return them; no folder/organization-level grant is needed. When project_ids is set explicitly, no Resource Manager permissions are needed.
Enable the following APIs on all target projects:
dataplex.googleapis.com) — see Enable Knowledge Catalogdatalineage.googleapis.com) — required for lineage extraction (include_lineage: true), see Enable Data Lineage APIcloudresourcemanager.googleapis.com) — required for term-asset associations (include_glossary_term_associations: true)Some asset types require additional setup. For example, Cloud SQL instances must be connected to Dataplex to enable automatic metadata harvesting (schemas, tables, and views):
gcloud sql instances patch my-cloud-sql-instance --enable-dataplex-integration --project=my-gcp-project
Supports Application Default Credentials (ADC). See GCP documentation for ADC setup.
For service account authentication, follow these instructions:
Create a service account following GCP docs and assign the required roles
Download the service account JSON keyfile
Example credential file:
{
"type": "service_account",
"project_id": "project-id-1234567",
"private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0",
"private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----",
"client_email": "[email protected]",
"client_id": "113545814931671546333",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com"
}
To provide credentials to the source, you can either:
Set an environment variable:
$ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"
or
Set credential config in your source based on the credential json file. For example:
credential:
project_id: "project-id-1234567"
private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0"
private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n"
client_email: "[email protected]"
client_id: "123456678890"
Grant the following roles to the service account on all target projects.
| Feature | Required Role |
|---|---|
| Universal Catalog Entries API (core ingestion) | roles/dataplex.catalogViewer |
Lineage extraction (include_lineage: true) | roles/datalineage.viewer |
Business Glossary ingestion (include_glossaries: true) | roles/dataplex.catalogViewer |
Term-asset associations (include_glossary_term_associations: true) | roles/browser on each candidate project (lighter-weight) or roles/resourcemanager.folderViewer — both provide resourcemanager.projects.get, required for resolving GCP project numbers |
Project auto-discovery via project_id_pattern or project_labels | roles/browser on each candidate project — provides resourcemanager.projects.get needed for search_projects to return the project |
:::tip "Lineage requires the role on multiple projects"
Grant roles/datalineage.viewer on all projects where the corresponding process is actually executed. Note it may differ from the project containing the asset.
:::
Additional asset-specific viewer roles:
roles/aiplatform.viewer (Vertex AI Viewer) is required when ingesting Vertex AI assets.roles/spanner.viewer (Cloud Spanner Viewer) is required when ingesting Cloud Spanner assets.