Back to Datahub

Unity Catalog Pre

metadata-ingestion/docs/sources/databricks/unity-catalog_pre.md

1.6.09.2 KB
Original Source

Overview

The unity-catalog module ingests metadata from Databricks into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

Prerequisites

  • Get your Databricks instance's workspace url
  • Create a Databricks Service Principal
    • You can skip this step and use your own account to get things running quickly, but we strongly recommend creating a dedicated service principal for production use.

Authentication Options

You can authenticate with Databricks using OAuth, Azure authentication, a Personal Access Token (legacy), or Databricks unified authentication:

Option 1: OAuth

Option 2: Azure Authentication (for Azure Databricks)

  • Create an Azure Active Directory application:
  • Grant the Azure AD application access to your Databricks workspace:
    • Add the service principal to your Databricks workspace following this guide

Option 3: Personal Access Token (PAT) (legacy)

Option 4: Unified authentication

Provision your service account:

  • To ingest your workspace's metadata and lineage, your service principal must have all of the following:
    • One of: metastore admin role, ownership of, or USE CATALOG privilege on any catalogs you want to ingest
    • One of: metastore admin role, ownership of, or USE SCHEMA privilege on any schemas you want to ingest
    • Ownership of or SELECT privilege on any tables and views you want to ingest
    • Ownership documentation
    • Privileges documentation
  • To ingest legacy hive_metastore catalog (include_hive_metastore - enabled by default), your service principal must have all of the following:
    • READ_METADATA and USAGE privilege on hive_metastore catalog
    • READ_METADATA and USAGE privilege on schemas you want to ingest
    • READ_METADATA and USAGE privilege on tables and views you want to ingest
    • Hive Metastore Privileges documentation
  • To ingest your workspace's notebooks and respective lineage, your service principal must have CAN_READ privileges on the folders containing the notebooks you want to ingest: guide.
  • To include_usage_statistics (enabled by default), your service principal must have one of the following:
    • CAN_MANAGE permissions on any SQL Warehouses you want to ingest: guide.
    • When usage_data_source is set to SYSTEM_TABLES or AUTO (default) with warehouse_id configured: SELECT privilege on system.query.history table for improved performance with large query volumes and multi-workspace setups.
  • To ingest profiling information with the default SQLAlchemy profiler (method: sqlalchemy), you need SELECT privilege on tables and views.
  • To ingest profiling information with method: ge (requires pip install 'acryl-datahub[profiling-ge]'), you need SELECT privileges on all profiled tables.
  • To ingest profiling information with method: analyze and call_analyze: true (enabled by default), your service principal must have ownership or MODIFY privilege on any tables you want to profile.
    • Alternatively, you can run ANALYZE TABLE yourself on any tables you want to profile, then set call_analyze to false. You will still need SELECT privilege on those tables to fetch the results.
  • Check the starter recipe below and replace workspace_url and either token (for PAT authentication) or azure_auth credentials (for Azure authentication) with your information from the previous steps.

Permissions for DataHub Cloud Assertions (Observe)

If you plan to use DataHub Cloud's Freshness, Volume, or Column Assertions on Databricks, the required Unity Catalog privileges depend on which Source you select in the assertion builder:

Source TypeRequired Privilege(s)Notes
Table StatisticsMODIFY (or ownership) on the target tableRuns ANALYZE TABLE ... COMPUTE STATISTICS followed by DESCRIBE TABLE EXTENDED. On Delta tables this is metadata-only (reads file-level stats from the transaction log). Tables only, not Views. Default Volume Source.
Information SchemaUSE CATALOG + USE SCHEMA on the containing catalog/schema, plus SELECT on system.information_schema.tablesQueries the Unity Catalog information_schema.tables view. Tables only, not Views.
Audit LogSELECT on system.access.audit (requires Unity Catalog system schemas to be enabled)Reads workspace audit events. Tables only.
File MetadataSELECT on the target tableReads file-level modification time via Delta transaction log metadata. Delta tables only.
Query / Last Modified Column / High Watermark Column / Field ValueSELECT on the target tableRuns SQL queries against the table. Works for Tables and Views.
DataHub Operation / DataHub Dataset Profile(none)Uses DataHub metadata only, no Databricks access needed.

In addition, the service principal used for assertion evaluation needs USE CATALOG and USE SCHEMA on the catalog and schema containing the target tables, and must be granted access to a SQL Warehouse (CAN_USE permission) to run statements.