docs/how/debug-ingestion-recording.md
:::note Beta Feature Recording and replay is currently in beta. The feature is stable for debugging purposes but the archive format may change in future releases. :::
When troubleshooting ingestion issues, it can be difficult to reproduce problems in a development environment. The recording and replay feature captures all external I/O during ingestion, allowing you to replay runs offline with full debugger support.
The recording system captures:
Recordings are stored in encrypted, compressed archives that can be replayed offline to reproduce issues exactly as they occurred.
pip install 'acryl-datahub[debug-recording]'
# Or with your source connectors
pip install 'acryl-datahub[looker,debug-recording]'
# Record with password protection (saves to temp directory)
datahub ingest run -c recipe.yaml --record --record-password mysecret --no-s3-upload
# Record to a specific directory
export INGESTION_ARTIFACT_DIR=/path/to/recordings
datahub ingest run -c recipe.yaml --record --record-password mysecret --no-s3-upload
# Record and upload directly to S3
datahub ingest run -c recipe.yaml --record --record-password mysecret \
--record-output-path s3://my-bucket/recordings/my-run.zip
# Replay in air-gapped mode (no network required)
datahub ingest replay recording.zip --password mysecret
# Replay from S3
datahub ingest replay s3://my-bucket/recordings/my-run.zip --password mysecret
# Replay with live sink (emit to real DataHub)
datahub ingest replay recording.zip --password mysecret \
--live-sink --server http://localhost:8080
Recording options:
| Option | Description |
|---|---|
--record | Enable recording |
--record-password | Encryption password (or use DATAHUB_RECORDING_PASSWORD env var) |
--record-output-path | Output path: local file or S3 URL (s3://bucket/path/file.zip) |
--no-s3-upload | Save locally only (uses INGESTION_ARTIFACT_DIR or temp dir) |
--no-secret-redaction | Keep real credentials (⚠️ use only for local debugging) |
Replay options:
| Option | Description |
|---|---|
--password | Decryption password |
--live-sink | Emit to real GMS instead of mocking responses |
--server | GMS URL for live sink mode |
--use-responses-lib | Use responses library for HTTP replay (for sources with VCR issues, e.g. Looker) |
You can also configure recording in your recipe file:
source:
type: looker
config:
# ... source config ...
# Recording configuration
recording:
enabled: true
password: ${DATAHUB_RECORDING_PASSWORD}
s3_upload: true # Set to true for S3 upload
output_path: s3://my-bucket/recordings/ # Required when s3_upload is true
| Variable | Description |
|---|---|
DATAHUB_RECORDING_PASSWORD | Default password for encryption/decryption |
INGESTION_ARTIFACT_DIR | Directory for local recordings (when not using S3) |
datahub recording info recording.zip --password mysecret
# Sample output:
# Recording Archive: recording.zip
# --------------------------------------------------
# Run ID: snowflake-2024-12-03-10_30_00-abc123
# Source Type: snowflake
# Sink Type: datahub-rest
# DataHub Version: 0.14.0
# Created At: 2024-12-03T10:35:00Z
# Format Version: 1.0.0
# File Count: 3
Use --json for machine-readable output.
datahub recording extract recording.zip --password mysecret --output-dir ./extracted
Extracts:
manifest.json - Archive metadatarecipe.yaml - Redacted recipe (secrets replaced with placeholders)http/cassette.yaml - HTTP recordingsdb/queries.jsonl - Database query recordingsCompare recorded and replayed output to verify semantic equivalence:
# Capture output during recording
datahub ingest run -c recipe.yaml --record --record-password test --no-s3-upload \
| tee recording_output.json
# Capture output during replay
datahub ingest replay recording.zip --password test \
| tee replay_output.json
# Compare (ignoring timestamps and run IDs)
datahub check metadata-diff \
--ignore-path "root['*']['systemMetadata']['lastObserved']" \
--ignore-path "root['*']['systemMetadata']['runId']" \
recording_output.json replay_output.json
A successful replay shows PERFECT SEMANTIC MATCH.
Looker, PowerBI, Tableau, Superset, Mode, Sigma, dbt Cloud, Fivetran
Snowflake, Redshift, Databricks, BigQuery, PostgreSQL, MySQL, MSSQL
Install the debug-recording plugin:
pip install 'acryl-datahub[debug-recording]'
The recording may be incomplete. Check for captured exceptions:
datahub recording info recording.zip --password mysecret
# Look for "has_exception: true"
Small differences are normal due to timing variations. Use metadata-diff to verify semantic equivalence (see above).
Some sources (e.g., Looker) use custom HTTP transport layers that conflict with VCR.py's urllib3 patching, causing errors like:
TypeError: super(type, obj): obj must be an instance or subtype of type
Automatic fallback: The replay command will automatically retry using the responses library if VCR.py fails.
Manual override: For known problematic sources, you can skip the VCR attempt entirely:
datahub ingest replay recording.zip --password mysecret --use-responses-lib
For detailed technical information, see the Recording Module README.