metadata-ingestion/docs/sources/dataplex/dataplex_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
:::caution The Dataplex connector will overwrite metadata from other Google Cloud source connectors (BigQuery, GCS, etc.) if they extract the same entities. If you're running multiple Google Cloud connectors, be aware that the last connector to run will determine the final metadata state for overlapping entities. :::
Datasets discovered by Dataplex use the same URNs as native connectors (e.g., bigquery, gcs). This means:
The connector adds the following custom properties to datasets:
dataplex_entry_id: The entry identifier in Dataplexdataplex_entry_group: The entry group containing this entrydataplex_fully_qualified_name: The fully qualified name of the entrydataplex_ingested: Marker indicating the dataset was ingested via Dataplex:::note
To access system-managed entry groups like @bigquery, use multi-region locations (us, eu, asia) via the entries_location config parameter. Regional locations (us-central1, etc.) only contain placeholder entries.
:::
Filter which datasets to ingest using regex patterns with allow/deny lists:
Example:
source:
type: dataplex
config:
project_ids:
- "my-gcp-project"
filter_config:
entries:
dataset_pattern:
allow:
- "production_.*" # Only production datasets
deny:
- ".*_test" # Exclude test datasets
- ".*_temp" # Exclude temporary datasets
When include_lineage is enabled and proper permissions are granted, the connector extracts table-level lineage using the Dataplex Lineage API. Dataplex automatically tracks lineage from these Google Cloud systems:
Supported Systems:
:::note Only BigQuery lineage has been thoroughly tested with this connector. Lineage from other systems may work but has not been validated. :::
Not Supported:
Lineage Limitations:
For more details, see Dataplex Lineage Documentation.
Metadata Extraction:
include_schema (default: true): Extract column metadata and typesinclude_lineage (default: true): Extract table-level lineage (automatically retries transient errors)Performance Tuning:
batch_size (default: 1000): Entries per batch for memory optimization. Set to None to disable batching (small deployments only)Lineage Retry Settings (optional):
lineage_max_retries (default: 3, range: 1-10): Retry attempts for transient errorslineage_retry_backoff_multiplier (default: 1.0, range: 0.1-10.0): Backoff delay multiplierExample Configuration:
source:
type: dataplex
config:
project_ids:
- "my-gcp-project"
# Location for entries (Universal Catalog) - defaults to "us"
# Must be multi-region (us, eu, asia) for system entry groups like @bigquery
entries_location: "us"
# Metadata extraction settings
include_schema: true # Enable schema metadata extraction (default: true)
include_lineage: true # Enable lineage extraction with automatic retries
# Lineage retry settings (optional, defaults shown)
lineage_max_retries: 3 # Max retry attempts (range: 1-10)
lineage_retry_backoff_multiplier: 1.0 # Exponential backoff multiplier (range: 0.1-10.0)
Configuration for Large Deployments:
For deployments with thousands of entries, memory optimization is important. The connector uses batched emission to keep memory bounded:
source:
type: dataplex
config:
project_ids:
- "my-gcp-project"
entries_location: "us"
# Performance tuning
batch_size: 1000 # Process and emit 1000 entries at a time to optimize memory usage
Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
Automatic Retry Behavior:
The connector automatically retries transient errors when extracting lineage:
After exhausting retries, the connector logs a warning and continues processing other entries. You'll still get metadata even if lineage extraction fails for some entries.
Common Issues:
us, eu, asia) rather than specific regions (us-central1). The connector automatically uses the entries_location config.roles/datalineage.viewer role on all projects.lineage_retry_backoff_multiplier to add more delay between retries, or decrease lineage_max_retries if you prefer faster failure.If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.