Matillion Dpc Pre - Datahub

Overview

The matillion-dpc module ingests metadata from Matillion Data Productivity Cloud (DPC) into DataHub. It extracts pipelines, streaming pipelines, projects, environments, execution history, and table and column-level lineage via the Matillion OpenLineage API.

Prerequisites

Obtain API Credentials

The connector uses OAuth2 client credentials and automatically handles token generation and refresh.

Log into Matillion Data Productivity Cloud as a Super Admin
Navigate to Profile & Account → API credentials
Click Set an API Credential
Provide a descriptive name (e.g., "DataHub Integration")
Assign an Account Role with read permissions to required APIs
Click Save and immediately copy the Client Secret (not shown again)
Note the Client ID (remains visible)

For detailed instructions, see Matillion API Authentication.

Required Permissions

The API credentials must have an Account Role with Read permissions to:

Projects (/v1/projects)
Environments (/v1/environments)
Pipelines (/v1/pipelines)
Schedules (/v1/schedules)
Lineage Events (/v1/lineage/events)
Pipeline Executions (/v1/pipeline-executions) - optional
Streaming Pipelines (/v1/streaming-pipelines) - optional

If using an account role other than Super Admin, grant project and environment-level roles as needed.

See Matillion RBAC documentation for details.

Lineage and Dependencies

The connector automatically extracts:

Table and Column-Level Lineage - From OpenLineage Events API (/v1/lineage/events) (docs)
Operational Metadata - Pipeline execution history from Pipeline Executions API (/v1/pipeline-executions) emitted as DataProcessInstance entities (docs)
Child Pipeline Dependencies - Automatically tracks when pipelines call other pipelines, creating step-to-step dependency relationships for comprehensive pipeline orchestration visibility

OpenLineage Namespace Mapping (Optional)

Optional: Map OpenLineage namespace URIs to DataHub platform instances for lineage connections. If not configured, the connector extracts platform type from URIs (e.g., postgresql://... → postgres) with default environment (PROD).

When to use: Configure this when you need lineage to connect to existing datasets with platform instances.

Example namespaces: postgresql://host:5432, snowflake://account.snowflakecomputing.com, bigquery://project

yaml

namespace_to_platform_instance:
  "postgresql://prod-db.us-east-1.rds.amazonaws.com:5432":
    platform_instance: postgres_prod
    env: PROD
    database: analytics
    schema: public

  "snowflake://prod-account.snowflakecomputing.com":
    platform_instance: snowflake_prod
    env: PROD
    convert_urns_to_lowercase: true

Platform instances must match those used when ingesting the source data platforms.