Matillion Dpc Post - Datahub

Capabilities

OpenLineage Namespace Mapping

Optional configuration to map OpenLineage namespace URIs to DataHub platform information. Without this, the connector extracts platform type from URIs with default environment.

Fields:

platform_instance: Platform instance identifier (must match source ingestion)
database / schema: Defaults for incomplete dataset names from OpenLineage
- 3-tier platforms (Snowflake, Postgres, Redshift): database.schema.table
- 2-tier platforms (MySQL, Hive): schema.table
convert_urns_to_lowercase: Normalize URNs to lowercase (use true for Snowflake)
env: Environment tag (PROD, DEV, etc.)

Fallback behavior: Unmapped namespaces extract platform type from the URI (e.g., postgresql://... → postgres) without platform instance assignment.

SQL Parsing for Column-Level Lineage

Enable parse_sql_for_lineage: true to parse SQL queries from OpenLineage events for additional column-level lineage.

Requirements:

DataHub graph connection configured
Schema information in OpenLineage events

Platform-Specific Handling

Snowflake: Use convert_urns_to_lowercase: true in namespace mapping

BigQuery: 3-tier naming (project.dataset.table). Set database: project-id, schema: dataset-name

MySQL / 2-tier: 2-tier naming (schema.table). Set schema only

Postgres / Redshift: 3-tier naming (database.schema.table). Set both database and schema

Filtering Options

The connector supports flexible regex-based filtering to control what metadata is ingested.

Project Filtering

yaml

project_patterns:
  allow: ["^prod-.*", "^staging-.*"]
  deny: [".*-deprecated$"]

Environment Filtering

yaml

environment_patterns:
  allow: ["^production$", "^staging$"]
  deny: ["^sandbox.*"]

Pipeline Filtering

yaml

pipeline_patterns:
  allow: [".*"]
  deny: ["^test_.*", ".*_backup$"]

Streaming Pipeline Filtering

yaml

streaming_pipeline_patterns:
  allow: ["^cdc_.*"]
  deny: [".*_test$"]

All patterns are case-insensitive by default and support full regex syntax. Deny patterns take precedence over allow patterns.

Child Pipeline Dependencies

The connector automatically detects and tracks when pipelines call other pipelines (via "Run Pipeline" components). This creates step-level dependency relationships in DataHub, showing:

Which pipeline steps trigger child pipelines
Complete execution lineage across pipeline orchestrations
Cross-pipeline data flow for comprehensive impact analysis

No configuration needed — this feature is automatic when execution history is ingested.

Published vs Unpublished Pipelines

The connector can discover pipelines from two sources:

Published Pipelines — Pipelines explicitly published in Matillion DPC (fetched from /published-pipelines API)
Unpublished Pipelines — Pipelines discovered from recent execution history (fetched from /pipeline-executions API)

By default, both types are ingested. To only ingest published pipelines:

yaml

include_unpublished_pipelines: false

This is useful when:

You want to control what appears in DataHub via Matillion's publish workflow
You have many development/test pipelines that run but shouldn't be documented
You want to reduce ingestion time and API calls

Limitations

SQL parsing for column-level lineage requires a DataHub graph connection and schema information in OpenLineage events. Unsupported SQL dialects or complex queries are skipped with a warning.
Column-level lineage is only available when Matillion pipelines emit SQL via OpenLineage; transformations without SQL output will have coarse-grained lineage only.

Troubleshooting

Lineage Not Showing Up

Verify namespace mapping matches source ingestion platform instances
Check logs for Processing OpenLineage event messages
Confirm dataset names in OpenLineage match actual tables

Column-Level Lineage Missing

Enable parse_sql_for_lineage: true (requires DataHub graph connection).

Execution History Not Appearing

Adjust start_time to query further back in time if needed
Verify API permissions for Pipeline Executions API

Performance Issues

Reduce time window by adjusting start_time (e.g., only last 7 days instead of 30)
Use filtering patterns to reduce scope:
- project_patterns to filter projects
- environment_patterns to filter environments
- pipeline_patterns to filter pipelines
- streaming_pipeline_patterns to filter streaming pipelines
Disable include_streaming_pipelines if not needed
Increase api_config.request_timeout_sec if needed