Back to Datahub

README

metadata-ingestion/docs/sources/dataplex/README.md

1.6.04.3 KB
Original Source

Overview

Google Cloud Knowledge Catalog (Dataplex) is a is a fully managed service that automates the discovery and inventory of your distributed data and AI assets. Learn more in the official Google Cloud Knowledge Catalog (Dataplex) documentation.

The DataHub integration uses the Universal Catalog entries as the source of truth and maps them into DataHub datasets and containers with provider-native URNs (for example bigquery, cloudsql, spanner, pubsub, and bigtable). It also captures table-level lineage, Business Glossary ingestion and stateful deletion detection.

Concept Mapping

The ingestion is entry-type driven: each Universal Catalog entry_type maps to a specific DataHub entity type and hierarchy behavior.

Supported entry-type mapping

Google Cloud Knowledge Catalog (Dataplex) entry type short nameDataHub platformEmitted entityParent relationship
bigquery-datasetbigqueryContainer (BigQuery Dataset)Parent is project container
bigquery-tablebigqueryDataset (Table)Parent is BigQuery dataset container
bigquery-viewbigqueryDataset (View)Parent is BigQuery dataset container
cloudsql-mysql-instancecloudsqlContainer (Instance)Parent is project container
cloudsql-mysql-databasecloudsqlContainer (Database)Parent is Cloud SQL instance container
cloudsql-mysql-tablecloudsqlDataset (Table)Parent is Cloud SQL database container
cloud-spanner-instancespannerContainer (Instance)Parent is project container
cloud-spanner-databasespannerContainer (Database)Parent is Spanner instance container
cloud-spanner-tablespannerDataset (Table)Parent is Spanner database container
cloud-spanner-graphspannerDataset (Graph)Parent is Spanner database container
cloud-bigtable-instancebigtableContainer (Instance)Parent is project container
cloud-bigtable-tablebigtableDataset (Table)Parent is Bigtable instance container
pubsub-topicpubsubDataset (Topic)Parent is project container
vertexai-datasetvertexaiDataset (Table)Parent is project container

Business Glossary mapping

Dataplex Business Glossaries are ingested as a three-level hierarchy of DataHub Glossary entities.

Dataplex entityDataHub entityURN pattern
GlossaryGlossaryNodedataplex.{project_id}.{location}.{glossary_id}
CategoryGlossaryNodedataplex.{project_id}.{location}.{glossary_id}.{category_id}
TermGlossaryTermdataplex.{project_id}.{location}.{glossary_id}.{term_id}

Terms are marked as EXTERNAL with a source_url pointing to the Dataplex console entry. When include_glossary_term_associations is enabled (default), the connector also resolves term-to-asset links via the Dataplex lookupEntryLinks API and attaches the corresponding GlossaryTerm to each linked DataHub dataset.