metadata-ingestion/docs/sources/snowplow/snowplow_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
organization_id | string | ✅ | Organization UUID from Console URL | |
api_key_id | string | ✅ | API Key ID from Console credentials | |
api_key | string | ✅ | API Key secret | |
console_api_url | string | https://console.snowplowanalytics.com/api/msc/v1 | BDP Console API base URL | |
timeout_seconds | int | 60 | Request timeout in seconds | |
max_retries | int | 3 | Maximum retry attempts |
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
iglu_server_url | string | ✅ | Iglu server base URL | |
api_key | string | API key for private Iglu registry (UUID format) | ||
timeout_seconds | int | 30 | Request timeout in seconds |
Note: Iglu-only mode uses automatic schema discovery via the /api/schemas endpoint (requires Iglu Server 0.6+). All schemas in the registry will be automatically discovered.
| Option | Type | Default | Description | Required Permission |
|---|---|---|---|---|
extract_event_specifications | bool | true | Extract event specifications | read:event-specs |
extract_tracking_scenarios | bool | true | Extract tracking scenarios | read:tracking-scenarios |
extract_tracking_plans | bool | true | Extract tracking plans | read:data-products |
extract_pipelines | bool | true | Extract pipelines as DataFlow entities | read:pipelines |
extract_enrichments | bool | true | Extract enrichments as DataJob entities with lineage | read:enrichments |
enrichment_owner | string | None | Default owner email for enrichment DataJobs | N/A |
include_hidden_schemas | bool | false | Include schemas marked as hidden | N/A |
include_version_in_urn | bool | false | Include version in dataset URN (legacy behavior) | N/A |
extract_standard_schemas | bool | true | Extract Snowplow standard schemas from Iglu Central | N/A |
iglu_central_url | string | http://iglucentral.com | URL for fetching standard schemas | N/A |
| Option | Type | Default | Description |
|---|---|---|---|
schema_types_to_extract | list | ["event", "entity"] | Schema types to extract |
deployed_since | string | None | Only extract schemas deployed since this ISO 8601 timestamp |
schema_page_size | int | 100 | Number of schemas per API page |
⚠️ Note: Disabled by default. Prefer warehouse connectors (Snowflake, BigQuery) for column-level lineage.
| Option | Type | Default | Description | Required Permission |
|---|---|---|---|---|
warehouse_lineage.enabled | bool | false | Extract table-level lineage via Data Models API | read:data-products |
warehouse_lineage.platform_instance | string | None | Default platform instance for warehouse URNs | N/A |
warehouse_lineage.env | string | PROD | Default environment for warehouse datasets | N/A |
warehouse_lineage.validate_urns | bool | true | Validate warehouse URNs exist in DataHub | DataHub Graph API access |
warehouse_lineage.destination_mappings | list | [] | Per-destination platform instance overrides | N/A |
| Option | Type | Default | Description |
|---|---|---|---|
field_tagging.enabled | bool | true | Enable automatic field tagging |
field_tagging.tag_schema_version | bool | true | Tag fields with schema version |
field_tagging.tag_event_type | bool | true | Tag fields with event type |
field_tagging.tag_data_class | bool | true | Tag fields with data classification (PII, Sensitive) |
field_tagging.tag_authorship | bool | true | Tag fields with authorship info |
field_tagging.track_field_versions | bool | false | Track which version each field was added in |
field_tagging.use_structured_properties | bool | true | Use structured properties instead of tags |
field_tagging.emit_tags_and_structured_properties | bool | false | Emit both tags and structured properties |
field_tagging.pii_tags_only | bool | false | Only emit tags for PII fields when using both |
field_tagging.use_pii_enrichment | bool | true | Extract PII fields from PII Pseudonymization enrichment |
| Option | Type | Default | Description |
|---|---|---|---|
performance.max_concurrent_api_calls | int | 10 | Maximum concurrent API calls for deployment fetching |
performance.enable_parallel_fetching | bool | true | Enable parallel fetching of schema deployments |
| Option | Type | Default | Description |
|---|---|---|---|
schema_pattern | AllowDenyPattern | Allow all | Filter schemas by vendor/name pattern |
event_spec_pattern | AllowDenyPattern | Allow all | Filter event specifications by name |
tracking_scenario_pattern | AllowDenyPattern | Allow all | Filter tracking scenarios by name |
tracking_plan_pattern | AllowDenyPattern | Allow all | Filter tracking plans by name |
| Option | Type | Default | Description |
|---|---|---|---|
stateful_ingestion.enabled | bool | false | Enable stateful ingestion for deletion detection |
stateful_ingestion.remove_stale_metadata | bool | true | Remove schemas that no longer exist |
Create a recipe file snowplow_recipe.yml:
source:
type: snowplow
config:
bdp_connection:
organization_id: "<ORG_UUID>"
api_key_id: "${SNOWPLOW_API_KEY_ID}"
api_key: "${SNOWPLOW_API_KEY}"
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
Run ingestion:
datahub ingest -c snowplow_recipe.yml
For self-hosted Snowplow with Iglu registry (without BDP Console API):
source:
type: snowplow
config:
iglu_connection:
iglu_server_url: "https://iglu.example.com"
api_key: "${IGLU_API_KEY}" # Optional for private registries
schema_types_to_extract:
- "event"
- "entity"
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
Important notes for Iglu-only mode:
/api/schemas endpoint (requires Iglu Server 0.6+)For complete configuration options, see snowplow_iglu.yml.
⚠️ Note: This feature is disabled by default and should only be enabled in specific scenarios (see below).
Start from the baseline BDP recipe in 1. BDP Console (Managed Snowplow), then add:
source:
type: snowplow
config:
# Enable warehouse lineage via Data Models API
warehouse_lineage:
enabled: true
platform_instance: "prod_snowflake" # Optional
env: "PROD" # Optional
validate_urns: true # Optional
What this creates:
atomic.events → derived.sessions (or other derived tables)Supported warehouses: Snowflake, BigQuery, Redshift, Databricks
✅ Enable warehouse lineage if:
❌ Don't enable if you're using warehouse connectors:
Best practice: Use warehouse connector for detailed lineage. Only enable this for quick documentation of Data Models metadata.
Requirements: Data Models must be configured in your BDP organization.
Snowplow uses SchemaVer (semantic versioning for schemas) with the format MODEL-REVISION-ADDITION:
Example: 1-0-2
In DataHub, schemas are represented as:
{vendor}.{name}.{version} (e.g., com.example.page_view.1-0-0)This section explains how Snowplow concepts are modeled as DataHub entities.
| Snowplow Concept | DataHub Entity | DataHub Subtype | Description |
|---|---|---|---|
| Organization | Container | DATABASE | Top-level container for all Snowplow metadata |
| Event Schema | Dataset | snowplow_event_schema | Self-describing event definition (JSON Schema) |
| Entity Schema | Dataset | snowplow_entity_schema | Context/entity schema attached to events |
| Event Specification | Dataset | snowplow_event_spec | Tracking requirement defining what to track |
| Tracking Scenario | Container | (custom) | Logical grouping of related event specifications |
| Tracking Plan | Container | tracking_plan | Business-level tracking plan grouping |
| Pipeline | DataFlow | - | Snowplow data pipeline (Collector → Warehouse) |
| Enrichment | DataJob | - | Data transformation job within a pipeline |
| Collector | DataJob | - | HTTP endpoint receiving tracking events |
| Atomic Events | Dataset | atomic_event | Raw enriched events table in warehouse |
| Parsed Events | Dataset | event | Parsed event data combining all schemas |
Snowplow pipelines are modeled as DataFlow entities with DataJob children representing each processing stage:
Tracker SDKs (Web, Mobile, Server)
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Pipeline (DataFlow) │
│ urn:li:dataFlow:(snowplow,pipeline-id,PROD) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ Collector │ ◄── Receives HTTP tracking events │
│ │ (DataJob) │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐│
│ │ IP Lookup │ │ UA Parser │ │ PII Pseudonymization ││
│ │ (DataJob) │ │ (DataJob) │ │ (DataJob) ││
│ │ │ │ │ │ ││
│ │ user_ipaddress │ │ useragent │ │ user_id, email ││
│ │ → geo_*, ip_* │ │ → br_*, os_* │ │ → (hashed values) ││
│ └────────┬────────┘ └────────┬────────┘ └────────────┬────────────┘│
│ │ │ │ │
│ └────────────────────┼────────────────────────┘ │
│ ▼ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ Atomic Events (Dataset)│
│ Enriched event stream │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ Warehouse Tables │
│ (Snowflake, BigQuery) │
└─────────────────────────┘
The connector creates the following lineage relationships:
Event specifications reference the schemas they require:
┌──────────────────────────────┐
│ Event Schema │
│ (vendor.event_name.1-0-0) │────┐
└──────────────────────────────┘ │ ┌─────────────────────────┐
├────▶│ Event Specification │
┌──────────────────────────────┐ │ │ (Tracking Requirement) │
│ Entity Schema │────┘ └─────────────────────────┘
│ (vendor.context.1-0-0) │
└──────────────────────────────┘
Enrichments transform specific fields. Example for IP Lookup:
┌─────────────────────────────────────────────────────────────────────┐
│ IP Lookup Enrichment │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Input Output │
│ ───── ────── │
│ ┌─────────────────┐ │
│ ┌──▶│ geo_country │ │
│ │ ├─────────────────┤ │
│ ┌─────────────────┐ │ │ geo_city │ │
│ │ user_ipaddress │────────┼──▶├─────────────────┤ │
│ └─────────────────┘ │ │ geo_region │ │
│ │ ├─────────────────┤ │
│ │ │ geo_latitude │ │
│ ├──▶├─────────────────┤ │
│ │ │ geo_longitude │ │
│ │ ├─────────────────┤ │
│ └──▶│ ip_isp │ │
│ ├─────────────────┤ │
│ │ ip_organization │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Supported enrichments with column-level lineage:
user_ipaddress → geo_*, ip_* fieldsuseragent → br_*, os_* fieldsuseragent → browser, OS, device fieldspage_referrer → refr_* fieldspage_urlquery → mkt_* fieldsevent_fingerprintuseragent → iab_* classification fieldsWhen warehouse_lineage.enabled: true:
┌─────────────────────────┐ ┌─────────────────────────┐
│ Atomic Events │ Data Models API │ Derived Table │
│ (snowplow.atomic.events)│───────────────────▶│ (warehouse.schema.table)│
└─────────────────────────┘ └─────────────────────────┘
Organization (Container: DATABASE)
│
├── Event Schema: com.example.page_view.1-0-0 (Dataset)
├── Event Schema: com.example.checkout.1-0-0 (Dataset)
├── Entity Schema: com.example.user_context.1-0-0 (Dataset)
├── Event Specification: "Page View Tracking" (Dataset)
│
├── Tracking Scenario: "Checkout Flow" (Container)
│ ├── Event Specification: "Add to Cart" (Dataset)
│ └── Event Specification: "Purchase Complete" (Dataset)
│
└── Tracking Plan: "Web Analytics" (Container)
├── Event Specification (linked)
└── Schema (linked)
| Entity Type | URN Format |
|---|---|
| Organization | urn:li:container:{guid} |
| Event/Entity Schema | urn:li:dataset:(urn:li:dataPlatform:snowplow,vendor.name,ENV) |
| Event Specification | urn:li:dataset:(urn:li:dataPlatform:snowplow,event_spec_id,ENV) |
| Pipeline | urn:li:dataFlow:(snowplow,pipeline-id,ENV) |
| Enrichment/DataJob | urn:li:dataJob:(urn:li:dataFlow:(...),job-id) |
| Tracking Scenario | urn:li:container:{guid} |
Each entity type includes relevant custom properties:
Event/Entity Schemas:
vendor, name, version (SchemaVer format)schema_type (event/entity)json_schema (full JSON Schema definition)deployed_environments (PROD, DEV, etc.)Event Specifications:
status (draft, active, deprecated)trigger_conditionsreferenced_schemasEnrichments:
enrichment_typeinput_fields, output_fieldsconfiguration detailsGroup schemas by environment:
source:
type: snowplow
config:
bdp_connection:
organization_id: "<ORG_UUID>"
api_key_id: "${SNOWPLOW_API_KEY_ID}"
api_key: "${SNOWPLOW_API_KEY}"
platform_instance: "production"
env: "PROD"
Extract only specific vendor schemas:
source:
type: snowplow
config:
# ... connection config ...
schema_pattern:
allow:
- "com\\.example\\..*" # Allow com.example schemas
- "com\\.acme\\.events\\..*" # Allow com.acme.events schemas
deny:
- ".*\\.test$" # Deny test schemas
Use DataHub's built-in test-connection command:
datahub check source-connection snowplow \
--config snowplow_recipe.yml
This will:
BDP-specific features:
Iglu Server requirements:
/api/schemas endpointField tagging in Iglu-only mode:
If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.
Error: Authentication failed: Invalid API credentials
Solution:
api_key_id and api_key are correctError: Authentication failed: Forbidden
Solution:
organization_id matches your credentialsError: Permission denied for /data-structures
Solution:
read:data-structures permissionError: Permission denied for /event-specs
Solution:
extract_event_specifications: false in config, orread:event-specs permission for your API keyError: Request timeout: https://console.snowplowanalytics.com
Solution:
timeout_seconds in configurationError: Iglu connection failed
Solution:
iglu_server_url is correct and accessibleapi_key is validcurl https://iglu.example.com/api/schemasIssue: Ingestion completes but no schemas extracted
Solutions:
Check filtering patterns:
schema_pattern:
allow: [".*"] # Allow all schemas
Check schema types:
schema_types_to_extract: ["event", "entity"]
Include hidden schemas:
include_hidden_schemas: true
Verify schemas exist in BDP Console or Iglu registry
Error: HTTP 429: Rate limit exceeded
Solution: