Back to Datahub

Snowplow Post

metadata-ingestion/docs/sources/snowplow/snowplow_post.md

1.5.0.428.1 KB
Original Source

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Connection Options

BDP Console Connection
OptionTypeRequiredDefaultDescription
organization_idstringOrganization UUID from Console URL
api_key_idstringAPI Key ID from Console credentials
api_keystringAPI Key secret
console_api_urlstringhttps://console.snowplowanalytics.com/api/msc/v1BDP Console API base URL
timeout_secondsint60Request timeout in seconds
max_retriesint3Maximum retry attempts
Iglu Connection
OptionTypeRequiredDefaultDescription
iglu_server_urlstringIglu server base URL
api_keystringAPI key for private Iglu registry (UUID format)
timeout_secondsint30Request timeout in seconds

Note: Iglu-only mode uses automatic schema discovery via the /api/schemas endpoint (requires Iglu Server 0.6+). All schemas in the registry will be automatically discovered.

Feature Options

OptionTypeDefaultDescriptionRequired Permission
extract_event_specificationsbooltrueExtract event specificationsread:event-specs
extract_tracking_scenariosbooltrueExtract tracking scenariosread:tracking-scenarios
extract_tracking_plansbooltrueExtract tracking plansread:data-products
extract_pipelinesbooltrueExtract pipelines as DataFlow entitiesread:pipelines
extract_enrichmentsbooltrueExtract enrichments as DataJob entities with lineageread:enrichments
enrichment_ownerstringNoneDefault owner email for enrichment DataJobsN/A
include_hidden_schemasboolfalseInclude schemas marked as hiddenN/A
include_version_in_urnboolfalseInclude version in dataset URN (legacy behavior)N/A
extract_standard_schemasbooltrueExtract Snowplow standard schemas from Iglu CentralN/A
iglu_central_urlstringhttp://iglucentral.comURL for fetching standard schemasN/A

Schema Extraction Options

OptionTypeDefaultDescription
schema_types_to_extractlist["event", "entity"]Schema types to extract
deployed_sincestringNoneOnly extract schemas deployed since this ISO 8601 timestamp
schema_page_sizeint100Number of schemas per API page

Warehouse Lineage Options (Advanced)

⚠️ Note: Disabled by default. Prefer warehouse connectors (Snowflake, BigQuery) for column-level lineage.

OptionTypeDefaultDescriptionRequired Permission
warehouse_lineage.enabledboolfalseExtract table-level lineage via Data Models APIread:data-products
warehouse_lineage.platform_instancestringNoneDefault platform instance for warehouse URNsN/A
warehouse_lineage.envstringPRODDefault environment for warehouse datasetsN/A
warehouse_lineage.validate_urnsbooltrueValidate warehouse URNs exist in DataHubDataHub Graph API access
warehouse_lineage.destination_mappingslist[]Per-destination platform instance overridesN/A

Field Tagging Options

OptionTypeDefaultDescription
field_tagging.enabledbooltrueEnable automatic field tagging
field_tagging.tag_schema_versionbooltrueTag fields with schema version
field_tagging.tag_event_typebooltrueTag fields with event type
field_tagging.tag_data_classbooltrueTag fields with data classification (PII, Sensitive)
field_tagging.tag_authorshipbooltrueTag fields with authorship info
field_tagging.track_field_versionsboolfalseTrack which version each field was added in
field_tagging.use_structured_propertiesbooltrueUse structured properties instead of tags
field_tagging.emit_tags_and_structured_propertiesboolfalseEmit both tags and structured properties
field_tagging.pii_tags_onlyboolfalseOnly emit tags for PII fields when using both
field_tagging.use_pii_enrichmentbooltrueExtract PII fields from PII Pseudonymization enrichment

Performance Options

OptionTypeDefaultDescription
performance.max_concurrent_api_callsint10Maximum concurrent API calls for deployment fetching
performance.enable_parallel_fetchingbooltrueEnable parallel fetching of schema deployments

Filtering Options

OptionTypeDefaultDescription
schema_patternAllowDenyPatternAllow allFilter schemas by vendor/name pattern
event_spec_patternAllowDenyPatternAllow allFilter event specifications by name
tracking_scenario_patternAllowDenyPatternAllow allFilter tracking scenarios by name
tracking_plan_patternAllowDenyPatternAllow allFilter tracking plans by name

Stateful Ingestion

OptionTypeDefaultDescription
stateful_ingestion.enabledboolfalseEnable stateful ingestion for deletion detection
stateful_ingestion.remove_stale_metadatabooltrueRemove schemas that no longer exist

Quick Start

1. BDP Console (Managed Snowplow)

Create a recipe file snowplow_recipe.yml:

yaml
source:
  type: snowplow
  config:
    bdp_connection:
      organization_id: "<ORG_UUID>"
      api_key_id: "${SNOWPLOW_API_KEY_ID}"
      api_key: "${SNOWPLOW_API_KEY}"

sink:
  type: datahub-rest
  config:
    server: "http://localhost:8080"

Run ingestion:

bash
datahub ingest -c snowplow_recipe.yml

2. Open-Source Snowplow (Iglu-Only Mode)

For self-hosted Snowplow with Iglu registry (without BDP Console API):

yaml
source:
  type: snowplow
  config:
    iglu_connection:
      iglu_server_url: "https://iglu.example.com"
      api_key: "${IGLU_API_KEY}" # Optional for private registries

    schema_types_to_extract:
      - "event"
      - "entity"

sink:
  type: datahub-rest
  config:
    server: "http://localhost:8080"

Important notes for Iglu-only mode:

  • Supported: Event and entity schemas with full JSON Schema definitions
  • Supported: Automatic schema discovery via /api/schemas endpoint (requires Iglu Server 0.6+)
  • ⚠️ Not supported: Event specifications (requires BDP API)
  • ⚠️ Not supported: Tracking scenarios (requires BDP API)
  • ⚠️ Not supported: Field tagging/PII detection (requires BDP deployment data)

For complete configuration options, see snowplow_iglu.yml.

3. With Warehouse Lineage (BDP Only - Advanced)

⚠️ Note: This feature is disabled by default and should only be enabled in specific scenarios (see below).

Start from the baseline BDP recipe in 1. BDP Console (Managed Snowplow), then add:

yaml
source:
  type: snowplow
  config:
    # Enable warehouse lineage via Data Models API
    warehouse_lineage:
      enabled: true
      platform_instance: "prod_snowflake" # Optional
      env: "PROD" # Optional
      validate_urns: true # Optional

What this creates:

  • Table-level lineage: atomic.eventsderived.sessions (or other derived tables)
  • No direct warehouse credentials needed (uses BDP API)

Supported warehouses: Snowflake, BigQuery, Redshift, Databricks

When to Enable This Feature

✅ Enable warehouse lineage if:

  • You want quick table-level lineage without configuring warehouse connector
  • You don't have access to warehouse query logs
  • You want to document Data Models API metadata specifically

❌ Don't enable if you're using warehouse connectors:

  • Snowflake connector provides:
    • Column-level lineage by parsing SQL queries
    • Transformation logic from query history
    • Complete dependency graphs
  • BigQuery, Redshift, Databricks connectors similarly provide richer lineage

Best practice: Use warehouse connector for detailed lineage. Only enable this for quick documentation of Data Models metadata.

Requirements: Data Models must be configured in your BDP organization.

Schema Versioning

Snowplow uses SchemaVer (semantic versioning for schemas) with the format MODEL-REVISION-ADDITION:

  • MODEL (first digit): Breaking changes - incompatible with previous versions
  • REVISION (second digit): Non-breaking changes - additions that are backward compatible
  • ADDITION (third digit): Adding optional fields without breaking changes

Example: 1-0-2

  • Model: 1 (major version)
  • Revision: 0 (no revisions)
  • Addition: 2 (two optional field additions)

In DataHub, schemas are represented as:

  • Dataset name: {vendor}.{name}.{version} (e.g., com.example.page_view.1-0-0)
  • Schema version: Tracked in dataset properties

Entity Mapping: Snowplow → DataHub

This section explains how Snowplow concepts are modeled as DataHub entities.

Entity Type Mapping

Snowplow ConceptDataHub EntityDataHub SubtypeDescription
OrganizationContainerDATABASETop-level container for all Snowplow metadata
Event SchemaDatasetsnowplow_event_schemaSelf-describing event definition (JSON Schema)
Entity SchemaDatasetsnowplow_entity_schemaContext/entity schema attached to events
Event SpecificationDatasetsnowplow_event_specTracking requirement defining what to track
Tracking ScenarioContainer(custom)Logical grouping of related event specifications
Tracking PlanContainertracking_planBusiness-level tracking plan grouping
PipelineDataFlow-Snowplow data pipeline (Collector → Warehouse)
EnrichmentDataJob-Data transformation job within a pipeline
CollectorDataJob-HTTP endpoint receiving tracking events
Atomic EventsDatasetatomic_eventRaw enriched events table in warehouse
Parsed EventsDataseteventParsed event data combining all schemas

Pipeline Architecture in DataHub

Snowplow pipelines are modeled as DataFlow entities with DataJob children representing each processing stage:

Tracker SDKs (Web, Mobile, Server)
            │
            ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    Pipeline (DataFlow)                                   │
│  urn:li:dataFlow:(snowplow,pipeline-id,PROD)                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────┐                                                   │
│   │    Collector    │  ◄── Receives HTTP tracking events                │
│   │    (DataJob)    │                                                   │
│   └────────┬────────┘                                                   │
│            │                                                            │
│            ▼                                                            │
│   ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────────┐│
│   │   IP Lookup     │  │   UA Parser     │  │  PII Pseudonymization   ││
│   │   (DataJob)     │  │   (DataJob)     │  │       (DataJob)         ││
│   │                 │  │                 │  │                         ││
│   │ user_ipaddress  │  │ useragent       │  │ user_id, email          ││
│   │  → geo_*, ip_*  │  │  → br_*, os_*   │  │  → (hashed values)      ││
│   └────────┬────────┘  └────────┬────────┘  └────────────┬────────────┘│
│            │                    │                        │              │
│            └────────────────────┼────────────────────────┘              │
│                                 ▼                                       │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
                    ┌─────────────────────────┐
                    │  Atomic Events (Dataset)│
                    │  Enriched event stream  │
                    └────────────┬────────────┘
                                 │
                                 ▼
                    ┌─────────────────────────┐
                    │   Warehouse Tables      │
                    │ (Snowflake, BigQuery)   │
                    └─────────────────────────┘

Lineage Relationships

The connector creates the following lineage relationships:

1. Schema → Event Specification Lineage

Event specifications reference the schemas they require:

┌──────────────────────────────┐
│ Event Schema                 │
│ (vendor.event_name.1-0-0)    │────┐
└──────────────────────────────┘    │     ┌─────────────────────────┐
                                    ├────▶│   Event Specification   │
┌──────────────────────────────┐    │     │  (Tracking Requirement) │
│ Entity Schema                │────┘     └─────────────────────────┘
│ (vendor.context.1-0-0)       │
└──────────────────────────────┘
2. Enrichment Column-Level Lineage

Enrichments transform specific fields. Example for IP Lookup:

┌─────────────────────────────────────────────────────────────────────┐
│                        IP Lookup Enrichment                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Input                          Output                             │
│   ─────                          ──────                             │
│                                  ┌─────────────────┐                │
│                              ┌──▶│ geo_country     │                │
│                              │   ├─────────────────┤                │
│   ┌─────────────────┐        │   │ geo_city        │                │
│   │ user_ipaddress  │────────┼──▶├─────────────────┤                │
│   └─────────────────┘        │   │ geo_region      │                │
│                              │   ├─────────────────┤                │
│                              │   │ geo_latitude    │                │
│                              ├──▶├─────────────────┤                │
│                              │   │ geo_longitude   │                │
│                              │   ├─────────────────┤                │
│                              └──▶│ ip_isp          │                │
│                                  ├─────────────────┤                │
│                                  │ ip_organization │                │
│                                  └─────────────────┘                │
└─────────────────────────────────────────────────────────────────────┘

Supported enrichments with column-level lineage:

  • IP Lookup: user_ipaddressgeo_*, ip_* fields
  • UA Parser: useragentbr_*, os_* fields
  • YAUAA: useragent → browser, OS, device fields
  • Referer Parser: page_referrerrefr_* fields
  • Campaign Attribution: page_urlquerymkt_* fields
  • PII Pseudonymization: configured fields → same fields (hashed)
  • Currency Conversion: currency fields → converted fields
  • Event Fingerprint: event fields → event_fingerprint
  • IAB Spiders/Robots: useragentiab_* classification fields
3. Warehouse Lineage (Optional)

When warehouse_lineage.enabled: true:

┌─────────────────────────┐                    ┌─────────────────────────┐
│     Atomic Events       │   Data Models API  │     Derived Table       │
│ (snowplow.atomic.events)│───────────────────▶│ (warehouse.schema.table)│
└─────────────────────────┘                    └─────────────────────────┘

Container Hierarchy

Organization (Container: DATABASE)
│
├── Event Schema: com.example.page_view.1-0-0 (Dataset)
├── Event Schema: com.example.checkout.1-0-0 (Dataset)
├── Entity Schema: com.example.user_context.1-0-0 (Dataset)
├── Event Specification: "Page View Tracking" (Dataset)
│
├── Tracking Scenario: "Checkout Flow" (Container)
│   ├── Event Specification: "Add to Cart" (Dataset)
│   └── Event Specification: "Purchase Complete" (Dataset)
│
└── Tracking Plan: "Web Analytics" (Container)
    ├── Event Specification (linked)
    └── Schema (linked)

URN Formats

Entity TypeURN Format
Organizationurn:li:container:{guid}
Event/Entity Schemaurn:li:dataset:(urn:li:dataPlatform:snowplow,vendor.name,ENV)
Event Specificationurn:li:dataset:(urn:li:dataPlatform:snowplow,event_spec_id,ENV)
Pipelineurn:li:dataFlow:(snowplow,pipeline-id,ENV)
Enrichment/DataJoburn:li:dataJob:(urn:li:dataFlow:(...),job-id)
Tracking Scenariourn:li:container:{guid}

Custom Properties

Each entity type includes relevant custom properties:

Event/Entity Schemas:

  • vendor, name, version (SchemaVer format)
  • schema_type (event/entity)
  • json_schema (full JSON Schema definition)
  • deployed_environments (PROD, DEV, etc.)

Event Specifications:

  • status (draft, active, deprecated)
  • trigger_conditions
  • referenced_schemas

Enrichments:

  • enrichment_type
  • input_fields, output_fields
  • configuration details

Custom Platform Instance

Group schemas by environment:

yaml
source:
  type: snowplow
  config:
    bdp_connection:
      organization_id: "<ORG_UUID>"
      api_key_id: "${SNOWPLOW_API_KEY_ID}"
      api_key: "${SNOWPLOW_API_KEY}"

    platform_instance: "production"
    env: "PROD"

Schema Filtering

Extract only specific vendor schemas:

yaml
source:
  type: snowplow
  config:
    # ... connection config ...

    schema_pattern:
      allow:
        - "com\\.example\\..*" # Allow com.example schemas
        - "com\\.acme\\.events\\..*" # Allow com.acme.events schemas
      deny:
        - ".*\\.test$" # Deny test schemas

Testing the Connection

Use DataHub's built-in test-connection command:

bash
datahub check source-connection snowplow \
  --config snowplow_recipe.yml

This will:

  • Test BDP Console API authentication
  • Test Iglu registry connectivity (if configured)
  • Verify required permissions
  • Report capability availability

Limitations

  1. BDP-specific features:

    • Event specifications only available via BDP Console API
    • Tracking scenarios only available via BDP Console API
    • Tracking plans only available via BDP Console API
    • Open-source Iglu users won't have these features
  2. Iglu Server requirements:

    • Automatic schema discovery requires Iglu Server 0.6+ with /api/schemas endpoint
    • Older Iglu implementations may not support the list schemas API
  3. Field tagging in Iglu-only mode:

    • PII/sensitive field detection requires BDP deployment metadata
    • Not available when using Iglu-only mode

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

Authentication Errors

Error: Authentication failed: Invalid API credentials

Solution:

  1. Verify api_key_id and api_key are correct
  2. Check credentials are for the correct organization
  3. Ensure credentials haven't expired
  4. Generate new credentials in BDP Console if needed

Error: Authentication failed: Forbidden

Solution:

  • Check organization_id matches your credentials
  • Verify API key has required permissions
  • Contact Snowplow support if permissions are unclear

Permission Errors

Error: Permission denied for /data-structures

Solution:

  • API key missing read:data-structures permission
  • Generate new credentials with correct permissions in BDP Console → Settings → API Credentials

Error: Permission denied for /event-specs

Solution:

  • Set extract_event_specifications: false in config, or
  • Request read:event-specs permission for your API key

Connection Errors

Error: Request timeout: https://console.snowplowanalytics.com

Solution:

  • Check network connectivity to Snowplow Console
  • Increase timeout_seconds in configuration
  • Verify Console URL is correct

Error: Iglu connection failed

Solution:

  • Verify iglu_server_url is correct and accessible
  • For private registries, check api_key is valid
  • Test connectivity: curl https://iglu.example.com/api/schemas

No Schemas Found

Issue: Ingestion completes but no schemas extracted

Solutions:

  1. Check filtering patterns:

    yaml
    schema_pattern:
      allow: [".*"] # Allow all schemas
    
  2. Check schema types:

    yaml
    schema_types_to_extract: ["event", "entity"]
    
  3. Include hidden schemas:

    yaml
    include_hidden_schemas: true
    
  4. Verify schemas exist in BDP Console or Iglu registry

Rate Limiting

Error: HTTP 429: Rate limit exceeded

Solution:

  • Connector implements automatic retry with exponential backoff
  • Rate limits should be handled automatically
  • If issues persist, contact Snowplow support to increase limits