docs/automations/bigquery-metadata-sync.md
import FeatureAvailability from '@site/src/components/FeatureAvailability';
:::info
This feature is currently in open beta in DataHub Cloud. Reach out to your DataHub Cloud representative to get access.
:::
BigQuery Metadata Sync is an automation that synchronizes DataHub Tags, Table and Column descriptions, and Column Glossary Terms with BigQuery. This automation is exclusively available in DataHub Cloud.
| DataHub Source | BigQuery Target | Sync Direction | Notes |
|---|---|---|---|
| Table Tags | Table Labels | Bi-directional | Changes in either system reflect in both |
| Table Descriptions | Table Descriptions | Bi-directional | Changes in either system reflect in both |
| Column Descriptions | Column Descriptions | Bi-directional | Changes in either system reflect in both. |
| Thes sync doesn't delete table description from BigQuery | |||
| Column Glossary Terms | Column Policy Tags | DataHub → BigQuery | Created under DataHub taxonomy |
Ensure your service account has the following permissions:
| Task | Required Permissions | Available Role |
|---|---|---|
| Policy Tag Management | • datacatalog.taxonomies.create | |
• datacatalog.taxonomies.update | ||
• datacatalog.taxonomies.list | ||
• datacatalog.taxonomies.get | ||
• bigquery.tables.createTagBinding | Policy Tag Admin | |
| Policy Tag Assignment | • bigquery.tables.updateTag | - |
| Description Management | • bigquery.tables.update | - |
| Label Management | • bigquery.tables.update | - |
Note: bigquery.tables permissions must be granted in every project where metadata sync is needed.
Configure Automation:
| Propagation Type | DataHub Entity | BigQuery Entity | Note |
|---|---|---|---|
| Table Tags as Labels | Table Tag | BigQuery Label | - |
| Column Glossary Terms as Policy Tags | Glossary Term on Table Column | Policy Tag | <ul><li>Assigned Policy tags are created under DataHub taxonomy.</li></ul><ul><li>Only the latest assigned glossary term set as policy tag. BigQuery only supports one assigned policy tag.</li></ul> <ul><li>Policy Tags are not synced to DataHub as glossary term from BigQuery.</li></ul> |
| Table Descriptions | Table Description | Table Description | - |
| Column Descriptions | Column Description | Column Description | - |
:::note
You can limit propagation based on specific Tags and Glossary Terms. If none are selected, ALL Tags or Glossary Terms will be automatically propagated to BigQuery tables and columns. (The recommended approach is to not specify a filter to avoid inconsistent states.)
:::
:::note
:::
To ensure that all existing table Tags and Column Glossary Terms are propagated to BigQuery, you can back-fill historical data for existing assets. Note that the initial back-filling process may take some time, depending on the number of BigQuery assets you have.
To do so, follow these steps:
This one-time step will kick off the back-filling process for existing descriptions. If you only want to begin propagating descriptions going forward, you can skip this step.
You can view propagated Tags inside the BigQuery UI to confirm the automation is working as expected.
<p align="left"> </p>A: The following metadata elements support bi-directional syncing:
A: No, BigQuery Policy Tags are only propagated from DataHub to BigQuery, not vice versa. This means that Policy Tags should be mastered in DataHub using the Business Glossary.
It is recommended to avoid enabling extract_policy_tags_from_catalog during
ingestion, as this will ingest policy tags as BigQuery labels. Our sync process
propagates Glossary Term assignments to BigQuery as Policy Tags.
In a future release, we plan to remove this restriction to support full bi-directional syncing.
A: During ingestion from BigQuery:
A: The expectation is that you author and manage the glossary in DataHub. Policy tags in BigQuery should be treated as a reflection of the DataHub glossary, not as the primary source of truth.
A: Yes, BigQuery only supports one Policy Tag per column. If multiple glossary terms are assigned to a column in DataHub, only the most recently assigned term will be set as the policy tag in BigQuery. To reduce the scope of conflicts, you can set up filters in the BigQuery Metadata Sync to only synchronize terms from a specific area of the Business Glossary.
A: From DataHub to BigQuery, the sync happens instantly (within a few seconds) when the change occurs in DataHub.
From BigQuery to DataHub, changes are synced when ingestion occurs, and the frequency depends on your custom ingestion schedule. (Visible on the Integrations page)
A: In case of conflicts (e.g., a tag is modified in both systems between syncs), the DataHub version will typically take precedence. However, it's best to make changes in one system consistently to avoid potential conflicts.
A: Ensure that the service account used for the automation has the necessary permissions in both DataHub and BigQuery to read and write metadata. See the required BigQuery permissions at the top of the page.
No, the sync can only modify table description but it won't remove or clear a description from a table.