Back to Datahub

Snowplow Pre

metadata-ingestion/docs/sources/snowplow/snowplow_pre.md

1.5.0.44.0 KB
Original Source

Overview

The snowplow module ingests metadata from Snowplow into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

The Snowplow source extracts metadata from Snowplow's behavioral data platform, including:

  • Event schemas - Self-describing event definitions with properties and validation rules
  • Entity schemas - Context and entity schemas attached to events
  • Event specifications - Tracking requirements and specifications (BDP only)
  • Tracking scenarios - Groupings of related events (BDP only)
  • Organizations - Top-level containers for all schemas

Snowplow is an open-source behavioral data platform that collects, validates, and models event-level data. This connector supports both:

  • Snowplow BDP (Behavioral Data Platform) - Managed Snowplow with Console API
  • Open-source Snowplow - Self-hosted with Iglu schema registry

References

Prerequisites

Before running ingestion, ensure network connectivity to the source, valid authentication credentials, and read permissions for metadata APIs required by this module.

For Snowplow BDP (Managed)

  1. Snowplow BDP account with Console access
  2. Organization ID - Found in Console URL: https://console.snowplowanalytics.com/{org-id}/...
  3. API credentials - Generated from Console → Settings → API Credentials:
    • API Key ID
    • API Key Secret

For Open-Source Snowplow

  1. Iglu Schema Registry - URL of your Iglu server
  2. API Key (optional) - Required for private Iglu registries

Python Requirements

  • Python 3.8 or newer
  • DataHub CLI installed

Snowplow BDP API Permissions

The connector requires read-only access to the following BDP Console API endpoints:

Minimum Required Permissions

To extract basic schema metadata:

  • read:data-structures - Read access to data structures (event and entity schemas)
  • read:organizations - Access to organization information
Permissions by Capability
CapabilityRequired PermissionsConfiguration
Schema Metadataread:data-structuresEnabled by default
Event Specificationsread:event-specsextract_event_specifications: true
Tracking Scenariosread:tracking-scenariosextract_tracking_scenarios: true
Tracking Plansread:data-productsextract_tracking_plans: true
Permission Testing

Test your API credentials and permissions:

bash
# Get JWT token
curl -X POST \
  -H "X-API-Key-ID: <API_KEY_ID>" \
  -H "X-API-Key: <API_KEY>" \
  https://console.snowplowanalytics.com/api/msc/v1/organizations/<ORG_ID>/credentials/v3/token

# List data structures
curl -H "Authorization: Bearer <JWT>" \
  https://console.snowplowanalytics.com/api/msc/v1/organizations/<ORG_ID>/data-structures/v1

Iglu Registry Permissions

For open-source Snowplow with Iglu:

  • Public registries: No authentication required (e.g., Iglu Central)
  • Private registries: API key with read access to schemas

Configuration

See the recipe files for complete configuration examples: