docs/automations/databricks-metadata-sync.md
import FeatureAvailability from '@site/src/components/FeatureAvailability';
:::info
This feature is currently in Public Beta in DataHub Cloud. Reach out to your DataHub Cloud representative if you face any issues configuring or validating the capabilities outlined below.
:::
Databricks Metadata Sync is an automation feature that enables seamless synchronization of DataHub Tags and Descriptions with Databricks Unity Catalog. This automation ensures consistent metadata governance across both platforms, automatically propagating DataHub governance artifacts to Unity Catalog tables, columns, catalogs, and schemas. Typically, this will be used in conjunction with the Databricks ingestion source, which enables ingesting Tags & descriptions from Databricks into DataHub.
This automation is exclusively available in DataHub Cloud.
The Databricks Metadata Sync automation provides comprehensive metadata synchronization with the following features:
A note about legacy Hive Metastore: Bi-directional sync for descriptions is supported for Hive Metastore Schemas & Tables, but Tag sync is not. This is because Databricks does not support applying of Tags to these assets on Hive Metastore.
Before enabling Databricks Metadata Sync, ensure the following permissions and configurations are in place:
Based on Unity Catalog requirements, to add tags to objects you need:
Note: If using governed tags, you may also need ASSIGN permission on the tag policy.
Configure the necessary permissions for your DataHub automation service principal based on your sync requirements:
-- Basic access permissions
GRANT USE CATALOG ON CATALOG your_catalog TO `datahub-automation@your-domain.com`;
GRANT USE SCHEMA ON SCHEMA your_catalog.your_schema TO `datahub-automation@your-domain.com`;
-- Tag application permissions
GRANT APPLY TAG ON CATALOG your_catalog TO `datahub-automation@your-domain.com`;
GRANT APPLY TAG ON SCHEMA your_catalog.your_schema TO `datahub-automation@your-domain.com`;
GRANT APPLY TAG ON ALL TABLES IN SCHEMA your_catalog.your_schema TO `datahub-automation@your-domain.com`;
-- Basic access permissions
GRANT USE CATALOG ON CATALOG your_catalog TO `datahub-automation@your-domain.com`;
GRANT USE SCHEMA ON SCHEMA your_catalog.your_schema TO `datahub-automation@your-domain.com`;
-- Description modification permissions
GRANT MODIFY ON CATALOG your_catalog TO `datahub-automation@your-domain.com`;
GRANT MODIFY ON SCHEMA your_catalog.your_schema TO `datahub-automation@your-domain.com`;
GRANT MODIFY ON ALL TABLES IN SCHEMA your_catalog.your_schema TO `datahub-automation@your-domain.com`;
-- Basic access permissions
GRANT USE CATALOG ON CATALOG your_catalog TO `datahub-automation@your-domain.com`;
GRANT USE SCHEMA ON SCHEMA your_catalog.your_schema TO `datahub-automation@your-domain.com`;
-- Tag application permissions
GRANT APPLY TAG ON CATALOG your_catalog TO `datahub-automation@your-domain.com`;
GRANT APPLY TAG ON SCHEMA your_catalog.your_schema TO `datahub-automation@your-domain.com`;
GRANT APPLY TAG ON ALL TABLES IN SCHEMA your_catalog.your_schema TO `datahub-automation@your-domain.com`;
-- Description modification permissions
GRANT MODIFY ON CATALOG your_catalog TO `datahub-automation@your-domain.com`;
GRANT MODIFY ON SCHEMA your_catalog.your_schema TO `datahub-automation@your-domain.com`;
GRANT MODIFY ON ALL TABLES IN SCHEMA your_catalog.your_schema TO `datahub-automation@your-domain.com`;
Ensure your DataHub instance has:
Navigate to the Automations section in your DataHub Cloud interface:
Configure the automation:
Choose the types of information to synchronize:
Choose between:
When syncing Tags, you can choose:
Complete the Databricks connection configuration:
https://abcsales.cloud.databricks.com)fab3e5fg0bcbfc56)Click Test Connection to verify your configuration before proceeding.
Provide automation metadata:
Click Save and Run to activate the automation and begin real-time synchronization.
For environments with existing DataHub metadata, you can perform a one-time backfill to ensure all current Tags and Descriptions from DataHub are propagated to Unity Catalog. Depending on the number of assets, this might take a while!
:::note Initialization Timeline
The initialization process duration depends on the volume of Unity Catalog assets in your environment. Large catalogs with extensive metadata may require significant processing time.
:::
Confirm successful metadata syncing by examining Unity Catalog objects:
For additional assistance with Databricks Metadata Sync, contact your DataHub Cloud representative.
In general, we recommend centrally authoring Tags and Descriptions within DataHub. This allows you to maintain a clear and consistent governance posture across all of your data sources and data products - there is always data outside of Databricks! Authoring this critical information in DataHub also improves the experience for your data practicioners trying to find the right data.
This automation is intended to enable this style of management, allowing you to "push down" metadata from the central catalog into Databricks, where your data is stored and queried.
During ingestion from Databricks, DataHub can ingest tags and descriptions that were originally authored within Databricks. DataHub converts key-value formatted tags in Databricks into DataHub tags of the format: key:value. For example, if you have a tag with key has_pii and value true in Databricks, this will be ingested as a single combined tag named has_pii: true in DataHub.
After ingestion into DataHub, you can apply this tag to tables or columns and sync it back to Databricks using this automation. Any tag with the format key:value that is applied on DataHub will be synced back to Databricks in proper key, value form.
If you apply a tag without a separator colon in DataHub (e.g. has_pii), it will be synced back to Databricks with the key being has_pii and value being empty.
This is usually because you've already overridden the description inside DataHub for this table. DataHub assumes that it will be the source of truth for documentation, which means that any edits that have taken place in the DataHub UI (or via API) will take precedent over changes provided in Databricks. When you change the description in DataHub, the description change will overwrite the latest description in Databricks if this automation is enabled.
But fear not - you can always view the original underlying Databricks description underneath the DataHub description in the DataHub UI, even when it changes.
Currently, no. Sync back is limited to Tags, to keep the concepts aligned more simply across both platforms. Reach out if you'd benefit from this capability!