metadata-ingestion/docs/sources/snowflake/snowflake_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
If you are using Snowflake Shares to share data across different Snowflake accounts, and you have set up DataHub recipes for ingesting metadata from all these accounts, you may end up having multiple similar dataset entities corresponding to virtual versions of the same table in different Snowflake accounts. The DataHub Snowflake connector can automatically link such tables together through Siblings and Lineage relationships if the user provides information necessary to establish the relationship using the shares configuration in the recipe.
account1 (ingested as platform_instance instance1) owns a database db1. A share X is created in account1 that includes database db1 along with schemas and tables inside it.X is shared with Snowflake account account2 (ingested as platform_instance instance2). A database db1_from_X is created from inbound share X in account2. In this case, all tables and views included in share X will also be present in instance2.db1_from_X.shares configuration section as
shares:
X: # name of the share
database: db1
platform_instance: instance1
consumers: # list of all databases created from share X
- database: db1_from_X
platform_instance: instance2
X is shared with more Snowflake accounts and a database is created from share X in those accounts, then additional entries need to be added to the consumers list for share X, one per Snowflake account. The same shares config can then be copied across recipes for all accounts.DataHub supports two strategies for extracting lineage and usage information from Snowflake:
use_queries_v2: true)The default and recommended approach uses an optimized query extraction method that:
include_queries)include_query_usage_statistics)pushdown_deny_usernames, pushdown_allow_usernames)push_down_database_pattern_access_history)query_dedup_strategy)use_queries_v2: false)The older approach that will be deprecated in future versions:
Both strategies access the same Snowflake system tables (account_usage.query_history, account_usage.access_history), but the new strategy provides significant performance improvements and additional functionality.
DataHub extracts lineage when a query reads from a Snowflake Stream. Coverage details:
INSERT ALL from a Stream — emits one lineage entry per downstream table, including column-level lineage. Requires use_queries_v2: true (the default).$SYS_VIEW_<id> or other $-prefixed names) for stream-driven queries. When this happens for a given row, that row falls back to SQL parsing so DataHub never builds lineage from unusable URNs.The Stream entity itself is also extracted as a top-level dataset; the lineage above is in addition to that.
When ingesting metadata from large Snowflake environments, you can improve performance by pushing down pattern filters directly to Snowflake SQL queries using the push_down_metadata_patterns configuration option.
Note: This option applies only to metadata extraction queries (
information_schema.databases,schemata,tables,views). For pushing down filters on lineage/usage queries (account_usage.access_history), usepush_down_database_pattern_access_historyinstead. These two options are independent and target completely separate query paths.
source:
type: snowflake
config:
# Enable pattern pushdown for improved performance
push_down_metadata_patterns: true
# Your existing patterns - MUST follow Snowflake RLIKE syntax
database_pattern:
allow:
- "PROD_.*" # Matches databases starting with PROD_
- "ANALYTICS.*" # Matches databases starting with ANALYTICS
deny:
- ".*_TEMP$" # Excludes databases ending with _TEMP
table_pattern:
allow:
- ".*" # Allow all tables
deny:
- ".*_BACKUP$" # Exclude tables ending with _BACKUP
By default, Snowflake views are fetched using SHOW VIEWS, which does not support SQL-level filtering. When push_down_metadata_patterns: true, the view_pattern is pushed down only if fetch_views_from_information_schema: true is also set. Otherwise, view filtering falls back to Python's re.match(), even with pushdown enabled.
This means if you write patterns in Snowflake RLIKE syntax (e.g., PROD.* for prefix matching), they will still work correctly with Python filtering since PROD.* is valid in both. However, patterns that rely on RLIKE's full-string matching semantics (e.g., exact match PROD_DB without .*) will behave differently — Python re.match() treats it as a prefix match.
Recommendation: If you enable push_down_metadata_patterns, also enable fetch_views_from_information_schema: true to ensure consistent behavior for view patterns.
When push_down_metadata_patterns: true, patterns are executed in Snowflake using the RLIKE operator instead of Python's re.match(). These have different matching behaviors:
| Behavior | Python re.match() | Snowflake RLIKE | Fix for Snowflake |
|---|---|---|---|
| Prefix match | 'PROD' matches 'PROD_DB' | 'PROD' does NOT match 'PROD_DB' | Use PROD.* |
| Suffix match | '.*_BACKUP' matches 'TABLE_BACKUP_V2' | Does NOT match | Use .*_BACKUP$ |
| Start anchor | '^PROD' matches 'PROD_DB' | Does NOT match | Use PROD.* |
| Alternation | 'PROD|DEV' matches 'PROD_DB' | Does NOT match | Use (PROD|DEV).* |
| Case sensitivity | Case-insensitive by default | Case-sensitive by default | Handled automatically |
Key difference: Python re.match() anchors at the START only (prefix matching), while Snowflake RLIKE requires a FULL STRING match.
| Intent | Without Pushdown (Python) | With Pushdown (Snowflake RLIKE) |
|---|---|---|
Starts with PROD | PROD | PROD.* |
Ends with _BACKUP | .*_BACKUP | .*_BACKUP$ |
Contains TEMP | .*TEMP.* | .*TEMP.* (same) |
| Exact match | PROD_DB | PROD_DB (same) |
Match PROD or DEV prefix | PROD|DEV | (PROD|DEV).* |
Before enabling pushdown in production, test your patterns in Snowflake:
-- Test prefix matching
SELECT 'PROD_DB' RLIKE 'PROD'; -- FALSE (needs .*)
SELECT 'PROD_DB' RLIKE 'PROD.*'; -- TRUE
-- Test suffix matching
SELECT 'TABLE_BACKUP_V2' RLIKE '.*_BACKUP'; -- FALSE (needs $)
SELECT 'TABLE_BACKUP' RLIKE '.*_BACKUP$'; -- TRUE
-- Test FQN with escaped dots
SELECT 'PROD_DB.PUBLIC.TABLE' RLIKE 'PROD_DB\\.PUBLIC\\..*'; -- TRUE
DataHub supports ingestion of Snowflake Semantic Views, which are business-defined views that define metrics, dimensions, and relationships for consistent data modeling and AI-powered analytics.
Semantic view ingestion is disabled by default (requires Snowflake Enterprise Edition or above). You can enable it using the following configuration options:
# Enable semantic view ingestion (requires Enterprise Edition)
semantic_views:
enabled: true # Default: false
column_lineage: true # Default: false - enable column-level lineage
# Filter semantic views using regex patterns
semantic_view_pattern:
allow:
- "ANALYTICS_DB.PUBLIC.*"
- "SALES_DB.*"
deny:
- ".*_INTERNAL"
extract_tags is enabledREFERENCES or SELECT privileges on semantic views (they are treated as views in Snowflake's permission model)GET_DDL functionDataHub supports ingestion of Snowflake Stages, Tasks, and Snowpipe objects. All three features are disabled by default and can be enabled independently.
include_stages: true)Stages are ingested as containers nested under their parent schema. Internal stages additionally emit a placeholder dataset representing the staged data, which is used for pipe lineage resolution. External stages (S3, GCS, Azure) resolve their URLs to the corresponding cloud platform dataset URN.
include_stages: true
stage_pattern:
allow:
- "MY_DB.MY_SCHEMA.*"
include_tasks: true)Tasks are ingested as DataJob entities grouped under a per-schema DataFlow. Predecessor dependencies between tasks are captured as inputDatajobs on the DataJobInputOutput aspect, preserving the DAG structure.
include_tasks: true
task_pattern:
allow:
- "MY_DB.MY_SCHEMA.*"
include_pipes: true)Snowpipe objects are ingested as DataJob entities with lineage derived from parsing the COPY INTO statement. The pipe's source stage resolves to an upstream dataset (internal placeholder or external cloud URN) and the target table resolves to a downstream dataset. Enabling pipes automatically scans stages for lineage resolution, even if include_stages is false.
The parser handles the following COPY INTO patterns:
COPY INTO t(a, b) FROM @stage resolves the target table correctlyCOPY INTO t FROM (SELECT ... FROM @s1 UNION ALL SELECT ... FROM @s2) captures all referenced stages as upstream datasetsCOPY pipe bodies are silently skipped; stage refs that cannot be normalized to a three-part FQN emit a warning in the ingestion reportinclude_pipes: true
pipe_pattern:
allow:
- "MY_DB.MY_SCHEMA.*"
Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
monitor privilege for metadata extraction. Without this privilege, dynamic tables will not be visible to DataHub.REFERENCES or SELECT privileges for metadata extraction. Without these privileges, semantic views will not be visible to DataHub.If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.
By default, DataHub generates Snowsight links of the form https://app.snowflake.com/<region>/<account>/.... If Snowsight is only reachable through private link in your environment, these links will not work for your users.
To fix this, set snowsight_base_url in the recipe to your private-link Snowsight URL. Retrieve the value from Snowflake as ACCOUNTADMIN:
SELECT SYSTEM$GET_PRIVATELINK_CONFIG();
-- Use either snowsight-privatelink-url or regionless-snowsight-privatelink-url.
source:
type: snowflake
config:
snowsight_base_url: "https://app.us-east-1.privatelink.snowflakecomputing.com/"