Snowflake Post

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Snowflake Shares

If you are using Snowflake Shares to share data across different Snowflake accounts, and you have set up DataHub recipes for ingesting metadata from all these accounts, you may end up having multiple similar dataset entities corresponding to virtual versions of the same table in different Snowflake accounts. The DataHub Snowflake connector can automatically link such tables together through Siblings and Lineage relationships if the user provides information necessary to establish the relationship using the shares configuration in the recipe.

Example

Snowflake account account1 (ingested as platform_instance instance1) owns a database db1. A share X is created in account1 that includes database db1 along with schemas and tables inside it.
Now, X is shared with Snowflake account account2 (ingested as platform_instance instance2). A database db1_from_X is created from inbound share X in account2. In this case, all tables and views included in share X will also be present in instance2.db1_from_X.

This can be represented in shares configuration section as

yaml

shares:
  X: # name of the share
    database: db1
    platform_instance: instance1
    consumers: # list of all databases created from share X
      - database: db1_from_X
        platform_instance: instance2

If share X is shared with more Snowflake accounts and a database is created from share X in those accounts, then additional entries need to be added to the consumers list for share X, one per Snowflake account. The same shares config can then be copied across recipes for all accounts.

Lineage and Usage

DataHub supports two strategies for extracting lineage and usage information from Snowflake:

New Strategy (Default - `use_queries_v2: true`)

The default and recommended approach uses an optimized query extraction method that:

Better Performance: Fetches query logs in a single optimized query instead of multiple separate queries
Enhanced Features:
- Query entities generation (include_queries)
- Query popularity statistics (include_query_usage_statistics)
- User filtering with patterns (pushdown_deny_usernames, pushdown_allow_usernames)
- Database pattern pushdown for performance (push_down_database_pattern_access_history)
- Query deduplication strategies (query_dedup_strategy)

Legacy Strategy (`use_queries_v2: false`)

The older approach that will be deprecated in future versions:

Uses separate extractors for lineage and usage
Less performant due to multiple query executions
Limited feature support compared to the new strategy

Both strategies access the same Snowflake system tables (account_usage.query_history, account_usage.access_history), but the new strategy provides significant performance improvements and additional functionality.

Snowflake Streams as Upstream Lineage Sources

DataHub extracts lineage when a query reads from a Snowflake Stream. Coverage details:

Multi-target INSERT ALL from a Stream — emits one lineage entry per downstream table, including column-level lineage. Requires use_queries_v2: true (the default).
Single-target queries reading from a Stream — fall back to SQL parsing of the query text rather than direct extraction from the audit log.
Audit log placeholder names — Snowflake occasionally emits placeholder object names ($SYS_VIEW_<id> or other $-prefixed names) for stream-driven queries. When this happens for a given row, that row falls back to SQL parsing so DataHub never builds lineage from unusable URNs.

The Stream entity itself is also extracted as a top-level dataset; the lineage above is in addition to that.

Metadata Pattern Pushdown

When ingesting metadata from large Snowflake environments, you can improve performance by pushing down pattern filters directly to Snowflake SQL queries using the push_down_metadata_patterns configuration option.

Note: This option applies only to metadata extraction queries (information_schema.databases, schemata, tables, views). For pushing down filters on lineage/usage queries (account_usage.access_history), use push_down_database_pattern_access_history instead. These two options are independent and target completely separate query paths.

Configuration

yaml

source:
  type: snowflake
  config:
    # Enable pattern pushdown for improved performance
    push_down_metadata_patterns: true

    # Your existing patterns - MUST follow Snowflake RLIKE syntax
    database_pattern:
      allow:
        - "PROD_.*" # Matches databases starting with PROD_
        - "ANALYTICS.*" # Matches databases starting with ANALYTICS
      deny:
        - ".*_TEMP$" # Excludes databases ending with _TEMP

    table_pattern:
      allow:
        - ".*" # Allow all tables
      deny:
        - ".*_BACKUP$" # Exclude tables ending with _BACKUP

View Pattern Limitation

By default, Snowflake views are fetched using SHOW VIEWS, which does not support SQL-level filtering. When push_down_metadata_patterns: true, the view_pattern is pushed down only if fetch_views_from_information_schema: true is also set. Otherwise, view filtering falls back to Python's re.match(), even with pushdown enabled.

This means if you write patterns in Snowflake RLIKE syntax (e.g., PROD.* for prefix matching), they will still work correctly with Python filtering since PROD.* is valid in both. However, patterns that rely on RLIKE's full-string matching semantics (e.g., exact match PROD_DB without .*) will behave differently — Python re.match() treats it as a prefix match.

Recommendation: If you enable push_down_metadata_patterns, also enable fetch_views_from_information_schema: true to ensure consistent behavior for view patterns.

Important: Snowflake RLIKE Syntax Differences

When push_down_metadata_patterns: true, patterns are executed in Snowflake using the RLIKE operator instead of Python's re.match(). These have different matching behaviors:

Behavior	Python `re.match()`	Snowflake `RLIKE`	Fix for Snowflake
Prefix match	`'PROD'` matches `'PROD_DB'`	`'PROD'` does NOT match `'PROD_DB'`	Use `PROD.*`
Suffix match	`'.*_BACKUP'` matches `'TABLE_BACKUP_V2'`	Does NOT match	Use `.*_BACKUP$`
Start anchor	`'^PROD'` matches `'PROD_DB'`	Does NOT match	Use `PROD.*`
Alternation	`'PROD\|DEV'` matches `'PROD_DB'`	Does NOT match	Use `(PROD\|DEV).*`
Case sensitivity	Case-insensitive by default	Case-sensitive by default	Handled automatically

Key difference: Python re.match() anchors at the START only (prefix matching), while Snowflake RLIKE requires a FULL STRING match.

Pattern Conversion Examples

Intent	Without Pushdown (Python)	With Pushdown (Snowflake RLIKE)
Starts with `PROD`	`PROD`	`PROD.*`
Ends with `_BACKUP`	`.*_BACKUP`	`.*_BACKUP$`
Contains `TEMP`	`.TEMP.`	`.TEMP.` (same)
Exact match	`PROD_DB`	`PROD_DB` (same)
Match `PROD` or `DEV` prefix	`PROD\|DEV`	`(PROD\|DEV).*`

Testing Your Patterns

Before enabling pushdown in production, test your patterns in Snowflake:

sql

-- Test prefix matching
SELECT 'PROD_DB' RLIKE 'PROD';      -- FALSE (needs .*)
SELECT 'PROD_DB' RLIKE 'PROD.*';    -- TRUE

-- Test suffix matching
SELECT 'TABLE_BACKUP_V2' RLIKE '.*_BACKUP';    -- FALSE (needs $)
SELECT 'TABLE_BACKUP' RLIKE '.*_BACKUP$';      -- TRUE

-- Test FQN with escaped dots
SELECT 'PROD_DB.PUBLIC.TABLE' RLIKE 'PROD_DB\\.PUBLIC\\..*';  -- TRUE

Semantic Views

DataHub supports ingestion of Snowflake Semantic Views, which are business-defined views that define metrics, dimensions, and relationships for consistent data modeling and AI-powered analytics.

Configuration

Semantic view ingestion is disabled by default (requires Snowflake Enterprise Edition or above). You can enable it using the following configuration options:

yaml

# Enable semantic view ingestion (requires Enterprise Edition)
semantic_views:
  enabled: true # Default: false
  column_lineage: true # Default: false - enable column-level lineage

# Filter semantic views using regex patterns
semantic_view_pattern:
  allow:
    - "ANALYTICS_DB.PUBLIC.*"
    - "SALES_DB.*"
  deny:
    - ".*_INTERNAL"

Features

Metadata Extraction: Extracts semantic view definitions (YAML), columns, comments, and timestamps
Lineage Support: Semantic views participate in lineage extraction like regular views
Tags Support: Tags applied to semantic views are extracted if extract_tags is enabled
External URLs: Direct links to Snowflake Snowsight UI for semantic views

Requirements

Semantic views require appropriate Snowflake edition and privileges
Requires REFERENCES or SELECT privileges on semantic views (they are treated as views in Snowflake's permission model)
The semantic view definition (SQL DDL) is extracted when available through the GET_DDL function

Stages, Tasks, and Pipes

DataHub supports ingestion of Snowflake Stages, Tasks, and Snowpipe objects. All three features are disabled by default and can be enabled independently.

Stages (`include_stages: true`)

Stages are ingested as containers nested under their parent schema. Internal stages additionally emit a placeholder dataset representing the staged data, which is used for pipe lineage resolution. External stages (S3, GCS, Azure) resolve their URLs to the corresponding cloud platform dataset URN.

yaml

include_stages: true
stage_pattern:
  allow:
    - "MY_DB.MY_SCHEMA.*"

Tasks (`include_tasks: true`)

Tasks are ingested as DataJob entities grouped under a per-schema DataFlow. Predecessor dependencies between tasks are captured as inputDatajobs on the DataJobInputOutput aspect, preserving the DAG structure.

yaml

include_tasks: true
task_pattern:
  allow:
    - "MY_DB.MY_SCHEMA.*"

Pipes (`include_pipes: true`)

Snowpipe objects are ingested as DataJob entities with lineage derived from parsing the COPY INTO statement. The pipe's source stage resolves to an upstream dataset (internal placeholder or external cloud URN) and the target table resolves to a downstream dataset. Enabling pipes automatically scans stages for lineage resolution, even if include_stages is false.

The parser handles the following COPY INTO patterns:

Column-list targets — COPY INTO t(a, b) FROM @stage resolves the target table correctly
Subquery sources — COPY INTO t FROM (SELECT ... FROM @s1 UNION ALL SELECT ... FROM @s2) captures all referenced stages as upstream datasets
Non-COPY pipe bodies are silently skipped; stage refs that cannot be normalized to a three-part FQN emit a warning in the ingestion report

yaml

include_pipes: true
pipe_pattern:
  allow:
    - "MY_DB.MY_SCHEMA.*"

Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.

Some features require specific Snowflake editions or additional privileges. This includes dynamic tables, semantic views, advanced lineage features, and tags.
Dynamic tables require the monitor privilege for metadata extraction. Without this privilege, dynamic tables will not be visible to DataHub.
Semantic views require REFERENCES or SELECT privileges for metadata extraction. Without these privileges, semantic views will not be visible to DataHub.
The underlying Snowflake views that we use to get metadata have a latency of 45 minutes to 3 hours. So we would not be able to get very recent metadata in some cases like queries you ran within that time period etc. This is applicable particularly for lineage, usage and tags (without lineage) extraction.
If there is any ongoing Snowflake incident, we will not be able to get the metadata until that incident is resolved.
Lineage extraction, when got directly from Snowflake access history, has some limitations, as documented here, here, and here.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

Snowsight Links Point to the Wrong URL (Private Link)

By default, DataHub generates Snowsight links of the form https://app.snowflake.com/<region>/<account>/.... If Snowsight is only reachable through private link in your environment, these links will not work for your users.

To fix this, set snowsight_base_url in the recipe to your private-link Snowsight URL. Retrieve the value from Snowflake as ACCOUNTADMIN:

sql

SELECT SYSTEM$GET_PRIVATELINK_CONFIG();
-- Use either snowsight-privatelink-url or regionless-snowsight-privatelink-url.

yaml

source:
  type: snowflake
  config:
    snowsight_base_url: "https://app.us-east-1.privatelink.snowflakecomputing.com/"

Capabilities

Snowflake Shares

Example

Lineage and Usage

New Strategy (Default - use_queries_v2: true)

Legacy Strategy (use_queries_v2: false)

Snowflake Streams as Upstream Lineage Sources

Metadata Pattern Pushdown

Configuration

View Pattern Limitation

Important: Snowflake RLIKE Syntax Differences

Pattern Conversion Examples

Testing Your Patterns

Semantic Views

Configuration

Features

Requirements

Stages, Tasks, and Pipes

Stages (include_stages: true)

Tasks (include_tasks: true)

Pipes (include_pipes: true)

Limitations

Troubleshooting

Snowsight Links Point to the Wrong URL (Private Link)

New Strategy (Default - `use_queries_v2: true`)

Legacy Strategy (`use_queries_v2: false`)

Stages (`include_stages: true`)

Tasks (`include_tasks: true`)

Pipes (`include_pipes: true`)