Back to Datahub

README

metadata-ingestion/docs/sources/bigquery/README.md

1.6.03.4 KB
Original Source

Overview

BigQuery is a data platform used to store and query analytical or operational data. Learn more in the official BigQuery documentation.

The DataHub integration for BigQuery covers core metadata entities such as datasets/tables/views, schema fields, and containers. It also captures table- and column-level lineage, usage statistics, data profiling, and stateful deletion detection.

Concept Mapping

BigQuery ConceptDataHub Entity (Subtype)Notes
GCP ProjectPlatform Instance, Container (PROJECT)Project ID is used as both the platform instance and the top-level container.
DatasetContainer (DATASET)Nested under its Project container. Includes location and labels.
TableDataset (TABLE)Regular and partitioned tables. Schema, descriptions, PKs/FKs, partition keys, clustering columns, and labels are extracted.
Sharded TableDataset (SHARDED TABLE)Tables with a _yyyymmdd suffix pattern; grouped under a shared URN prefix.
External TableDataset (EXTERNAL TABLE)Includes source format, URIs, and compression properties.
Table SnapshotDataset (BIGQUERY TABLE SNAPSHOT)Identified by name pattern {table}@{timestamp_ms}.
ViewDataset (VIEW)SQL definition captured in ViewProperties. materialized=false.
Materialized ViewDataset (VIEW)SQL definition captured. materialized=true in ViewProperties.
Column / fieldSchemaFieldIncludes partition key, clustering key, PKs, FKs, and policy tags where available.
LabelTagTable, view, and dataset labels captured as tags when capture_*_label_as_tag is enabled.
Policy tag (Data Catalog)TagColumn-level policy tags extracted when extract_policy_tags_from_catalog is enabled.
Table / column lineageLineage edgesFrom view definitions and audit log query history.
Query operations and usageDatasetUsageStatistics, OperationPer-dataset and per-column access counts extracted from audit logs.