docs/automations/knowledge-catalog-metadata-sync.md
import FeatureAvailability from '@site/src/components/FeatureAvailability';
:::info
This feature is currently in Public Beta in DataHub Cloud. Reach out to your DataHub Cloud representative if you face any issues configuring or validating the capabilities outlined below.
:::
Knowledge Catalog Metadata Sync is an automation that synchronizes DataHub Tags, Glossary Terms, and Structured Properties with Google Cloud Knowledge Catalog. This enables you to manage metadata centrally in DataHub and automatically propagate it to Knowledge Catalog, where it appears as custom aspects, native Business Glossary terms, and entry links on your BigQuery assets. This automation is exclusively available in DataHub Cloud.
| DataHub Source | Knowledge Catalog Target | Sync Direction | Notes |
|---|---|---|---|
| Column Tags | Custom Aspect (datahub-tags) | DataHub → Knowledge Catalog | Stored as key-value map in a custom aspect on the Knowledge Catalog entry |
| Column Glossary Terms | Native Business Glossary Term | DataHub → Knowledge Catalog | Creates native glossary terms, categories, and entry links |
| Table Glossary Terms | Native Business Glossary Term | DataHub → Knowledge Catalog | Creates native glossary terms, categories, and entry links |
| Structured Properties | Custom Aspect (datahub) | DataHub → Knowledge Catalog | All structured properties synced as a single map aspect |
:::note
:::
When a Tag is applied to a BigQuery column in DataHub, the automation:
datahub-tags custom aspect, scoped to the columnWhen a Glossary Term is applied to a BigQuery table or column in DataHub, the automation:
datahub by default)When Structured Properties are added or modified on a BigQuery asset in DataHub, the automation:
datahub custom aspect map on the Knowledge Catalog entryEnsure your service account has the following permissions:
| Task | Required Permissions | Suggested Role |
|---|---|---|
| Knowledge Catalog Access | dataplex.entries.get | |
dataplex.entries.update | ||
dataplex.aspectTypes.create | ||
dataplex.aspectTypes.update | ||
dataplex.aspectTypes.get | Knowledge Catalog Editor | |
| Data Catalog Lookup | datacatalog.entries.get | Data Catalog Viewer |
| Business Glossary Management | dataplex.glossaries.create | |
dataplex.glossaryTerms.create | ||
dataplex.glossaryTerms.update | ||
dataplex.glossaryCategories.create | ||
dataplex.glossaryCategories.update | Knowledge Catalog Editor | |
| Entry Link Management | dataplex.entryLinks.create | |
dataplex.entryLinks.delete | Knowledge Catalog Editor | |
| Project Number Resolution | resourcemanager.projects.get | Browser |
Note: Permissions must be granted in every GCP project where metadata sync is needed. The Data Catalog Viewer role is required because the automation uses Data Catalog to discover the GCP region of BigQuery assets (BigQuery URNs don't contain region information).
Choose the types of metadata to synchronize:
| Propagation Type | Description |
|---|---|
| Tags | Sync DataHub column Tags to Knowledge Catalog custom aspects |
| Glossary Terms | Sync DataHub Glossary Terms to native Knowledge Catalog Business Glossary |
| Structured Properties | Sync DataHub Structured Properties to Knowledge Catalog custom aspects |
:::note
You can limit Tag and Glossary Term propagation to specific Tags or Terms. If none are selected, ALL Tags or Glossary Terms will be propagated. The recommended approach is to not specify a filter to avoid inconsistent states.
:::
Fill in the required fields:
datahub)global)Click Save and Run to activate the automation. The automation will:
datahub-tags and datahub) if they don't existTo ensure that all existing Tags, Glossary Terms, and Structured Properties are propagated to Knowledge Catalog, you can backfill historical data. Note that the initial backfilling process may take some time, depending on the number of BigQuery assets you have.
This one-time step will trigger the backfilling process. If you only want to begin propagating metadata going forward, you can skip this step.
Synced Tags and Structured Properties are visible in the Knowledge Catalog entry details in the Google Cloud console. Look for custom aspects named datahub-tags and datahub on your BigQuery entries.
Synced Glossary Terms appear in the Knowledge Catalog Business Glossary section of the Google Cloud console:
datahub by default, contains all synced terms and categoriesA: The automation supports BigQuery tables. The asset must be discoverable via Data Catalog (which is automatic for BigQuery tables). The automation uses Data Catalog to resolve the GCP region, then constructs the Knowledge Catalog entry path.
A: BigQuery URNs in DataHub don't include GCP region information (e.g., us-east1, eu). The automation uses the Data Catalog LookupEntry API to discover which region a BigQuery table is in, then constructs the Knowledge Catalog entry path from that.
A: Knowledge Catalog Business Glossary supports a maximum of 3 nested category levels. If a DataHub Glossary Term has a hierarchy deeper than 3 levels, the sync is skipped for that term and a warning is logged.
A: Author and manage the glossary in DataHub. Glossary terms in Knowledge Catalog should be treated as a reflection of the DataHub glossary, not as the primary source of truth.
A: Knowledge Catalog resource IDs must match ^[a-z][a-z0-9-]{0,62}$. The automation automatically sanitizes DataHub names to comply: lowercasing, replacing special characters with hyphens, collapsing consecutive hyphens, and prefixing with t- if the result starts with a digit. Names are truncated to 63 characters.
A: The automation detects display name changes and updates the corresponding Knowledge Catalog glossary term or category. The resource ID (derived from the original name) remains unchanged — only the display name and description are updated.
A: Changes are synced in real-time (within a few seconds) when they occur in DataHub. The automation listens for metadata change events and processes them immediately.
A: The automation has built-in error rate limiting. If more than 15 errors occur within an hour (configurable), it will temporarily stop processing events to avoid cascading failures. Transient errors (like permission denied or network issues) are logged but don't permanently block sync.
A: Yes. Table-level tag propagation is handled by BigQuery Metadata Sync (as BigQuery Labels), while Knowledge Catalog Metadata Sync handles column-level tags (as custom aspects), glossary terms (as native Business Glossary), and structured properties. The two automations are complementary.