metadata-ingestion/docs/sources/dataplex/dataplex_pre.md
The dataplex module ingests metadata from Dataplex into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.
The Dataplex connector extracts metadata from Google Dataplex using the Universal Catalog Entries API. This API extracts entries from system-managed entry groups for Google Cloud services and is the recommended approach for discovering resources across your GCP organization.
:::note Only BigQuery and Cloud Storage (GCS) have been thoroughly tested with this connector. Other services may work but have not been validated. :::
Refer to Dataplex documentation for Dataplex basics.
Supports Application Default Credentials (ADC). See GCP documentation for ADC setup.
For service account authentication, follow these instructions:
Create a service account following GCP docs and assign the required roles
Download the service account JSON keyfile
Example credential file:
{
"type": "service_account",
"project_id": "project-id-1234567",
"private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0",
"private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----",
"client_email": "[email protected]",
"client_id": "113545814931671546333",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com"
}
To provide credentials to the source, you can either:
Set an environment variable:
$ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"
or
Set credential config in your source based on the credential json file. For example:
credential:
project_id: "project-id-1234567"
private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0"
private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n"
client_email: "[email protected]"
client_id: "123456678890"
Grant the following permissions to the service account on all target projects.
Universal Catalog Entries API:
Default GCP Role: roles/dataplex.catalogViewer
| Permission | Description |
|---|---|
dataplex.entryGroups.get | Retrieve specific entry group details |
dataplex.entryGroups.list | View all entry groups in a location |
dataplex.entries.get | Access entry metadata and details |
dataplex.entries.getData | View data aspects within entries |
dataplex.entries.list | Enumerate entries within groups |
Lineage extraction (optional, include_lineage: true):
Default GCP Role: roles/datalineage.viewer
| Permission | Description |
|---|---|
datalineage.links.get | Allows a user to view lineage links |
datalineage.links.search | Allows a user to search for lineage links |