Back to Datahub

README

metadata-ingestion/docs/sources/snowflake/README.md

1.6.05.9 KB
Original Source

Overview

Snowflake is a data platform used to store and query analytical or operational data. Learn more in the official Snowflake documentation.

The DataHub integration for Snowflake covers core metadata entities such as datasets/tables/views, schema fields, and containers. It also captures table- and column-level lineage, usage statistics, data profiling, tags, and stateful deletion detection.

:::info Snowflake Ingestion through the UI

The following video shows you how to ingest Snowflake metadata through the UI.

<div style={{ position: "relative", paddingBottom: "56.25%", height: 0 }}> <iframe src="https://www.loom.com/embed/15d0401caa1c4aa483afef1d351760db" frameBorder={0} webkitallowfullscreen="" mozallowfullscreen="" allowFullScreen="" style={{ position: "absolute", top: 0, left: 0, width: "100%", height: "100%" }} /> </div>

Read on if you are interested in ingesting Snowflake metadata using the datahub cli, or want to learn about all the configuration parameters that are supported by the connectors. :::

Concept Mapping

Snowflake ConceptDataHub Entity (Subtype)Notes
AccountPlatform InstanceTop-level scope; all URNs include the configured platform instance.
DatabaseContainer (DATABASE)Top-level namespace. Ingested with description, tags, and Snowsight URL.
SchemaContainer (SCHEMA)Nested under its Database container.
TableDataset (TABLE)Includes regular, Iceberg, and hybrid tables. Schema, PKs/FKs, tags, and descriptions are extracted.
Dynamic TableDataset (DYNAMIC TABLE)Includes target lag, SQL definition, and lineage to source tables.
ViewDataset (VIEW)Standard, materialized, and secure views. View definition is captured.
Semantic ViewDataset (SEMANTIC VIEW)Columns classified as DIMENSION, FACT, or METRIC. Column-level lineage to physical tables is extracted.
StreamDataset (SNOWFLAKE STREAM)Change-data-capture stream. Adds METADATA$ACTION, METADATA$ISUPDATE, METADATA$ROW_ID columns and lineage to the source table.
External TableDataset (TABLE)Lineage to the backing cloud storage location is emitted when available.
Internal StageContainer (SNOWFLAKE STAGE) + Dataset (SNOWFLAKE STAGE DATA)Emits both a Container (organizational) and a Dataset (for the resident data).
External StageContainer (SNOWFLAKE STAGE)Container only; the backing cloud storage asset (S3/GCS/Azure) is referenced via lineage.
TaskDataFlow (SNOWFLAKE TASK GROUP) + DataJob (SNOWFLAKE TASK)One DataFlow per schema; each task is a DataJob. Predecessor relationships appear as DataJob inputs.
PipeDataFlow (SNOWFLAKE PIPE GROUP) + DataJob (SNOWFLAKE PIPE)One DataFlow per schema; each pipe is a DataJob linking a stage to a target table via lineage.
Streamlit AppDashboard (STREAMLIT)App name, owner, and Snowsight URL captured as custom properties.
Column / fieldSchemaFieldColumn type, nullability, descriptions, and tags are extracted where available.
RoleCorpGroupOwnership roles are mapped to urn:li:corpGroup:{role_name}.
TagTag or Structured PropertyControlled by extract_tags config. Tags support database/schema/table/column inheritance.
Table- and column-level lineageLineage edgesExtracted from view definitions, dynamic table definitions, and SQL query history.
Query operations and usageDatasetUsageStatistics, OperationPer-dataset query counts, user access patterns, and DML operation metrics.