metadata-ingestion/docs/sources/dataplex/dataplex_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
:::caution The Google Cloud Knowledge Catalog (Dataplex) connector will overwrite metadata from other Google Cloud source connectors (BigQuery, GCS, etc.) if they extract the same entities. If you're running multiple Google Cloud connectors, be aware that the last connector to run will determine the final metadata state for overlapping entities. :::
Datasets discovered use the same URNs as native connectors (e.g., bigquery, gcs). This means:
The connector adds the following custom properties to datasets:
| Property | Always Present | Description |
|---|---|---|
dataplex_ingested | Yes | Marker indicating the dataset was ingested via Google Cloud Knowledge Catalog (Dataplex) |
dataplex_entry_id | Yes | The entry identifier in Google Cloud Knowledge Catalog (Dataplex) |
dataplex_entry_group | Yes | The entry group containing this entry |
dataplex_fully_qualified_name | Yes | The fully qualified name of the entry |
dataplex_entry_type | No | The Google Cloud Knowledge Catalog (Dataplex) entry type (e.g. bigquery-table) |
dataplex_parent_entry | No | The parent entry name, if set |
dataplex_source_resource | No | The source resource identifier from the entry source |
dataplex_source_system | No | The source system from the entry source |
dataplex_source_platform | No | The source platform from the entry source |
dataplex_aspect_<aspect_type> | No | One property per aspect attached to the entry, named after the aspect type |
Filter which datasets to ingest using regex patterns with allow/deny lists:
Example:
source:
type: dataplex
config:
project_ids:
- "my-gcp-project"
filter_config:
entries:
pattern:
allow:
- "production_.*" # Only production datasets
deny:
- ".*_test" # Exclude test datasets
- ".*_temp" # Exclude temporary datasets
When include_lineage is enabled and proper permissions are granted, the connector extracts table-level lineage using the Dataplex Lineage API. The connector automatically tracks lineage from these Google Cloud systems:
Supported Systems:
:::note Only BigQuery lineage has been thoroughly tested with this connector. Lineage from other systems may work but has not been validated. :::
Not Supported:
Lineage Limitations:
For more details, see Google Cloud Knowledge Catalog (Dataplex) Lineage Documentation.
Metadata Extraction:
include_schema (default: true): Extract column metadata and typesinclude_lineage (default: true): Extract table-level lineage (automatically retries transient errors)Entry detail fetching and lineage lookups are parallelised using thread pools to significantly reduce wall-clock ingestion time for large deployments.
Entries stage runs in three phases:
list_entry_groups + list_entries — sequential listing across all project × location pairs
(fast; no parallelism needed)get_entry(ALL) calls — parallel across a flat worker pool so entries are distributed evenly
regardless of how they are spread across projectssearch_entries — sequential (already fully-fetched, nothing to parallelise)Lineage stage dispatches one worker per entry to fetch search_links results across all
configured lineage_locations, so total API call time scales with
max(entries / max_workers_lineage) rather than entries × lineage_locations.
Two config fields control the thread pool sizes:
| Field | Default | Description |
|---|---|---|
max_workers_entries | 10 | Workers for get_entry calls (entries stage) |
max_workers_lineage | 10 | Workers for search_links calls (lineage stage) |
Increase these values for large deployments, subject to your GCP API quota limits.
source:
type: dataplex
config:
project_ids:
- "my-gcp-project"
entries_locations:
- "us"
# Parallel processing (tune to your deployment size and API quota)
max_workers_entries: 20 # default: 10
max_workers_lineage: 40 # default: 20
Lineage Retry Settings (optional):
lineage_max_retries (default: 3, range: 1-10): Retry attempts for transient errorslineage_retry_backoff_multiplier (default: 1.0, range: 0.1-10.0): Backoff delay multiplierExample Configuration:
source:
type: dataplex
config:
project_ids:
- "my-gcp-project"
# Location for entries (Universal Catalog) - defaults to ["us", "eu", "asia", "global"]
# Must be multi-region (us, eu, asia) for system entry groups like @bigquery
entries_locations:
- "us"
# Metadata extraction settings
include_schema: true # Enable schema metadata extraction (default: true)
include_lineage: true # Enable lineage extraction with automatic retries
# Lineage retry settings (optional, defaults shown)
lineage_max_retries: 3 # Max retry attempts (range: 1-10)
lineage_retry_backoff_multiplier: 1.0 # Exponential backoff multiplier (range: 0.1-10.0)
Configuration for Large Deployments:
For deployments with thousands of entries, memory optimization is important. The connector uses batched emission to keep memory bounded:
source:
type: dataplex
config:
project_ids:
- "my-gcp-project"
entries_locations:
- "us"
# Performance tuning
batch_size: 1000 # Process and emit 1000 entries at a time to optimize memory usage
When include_glossaries is enabled (default), the connector ingests all Dataplex Business Glossaries from the configured glossary_locations (default: global) and emits the full Glossary → Category → Term hierarchy as DataHub Glossary entities.
Each term is emitted as a GlossaryTerm with:
term_source: EXTERNAL and a source_url linking directly to the term in the Dataplex consolecustom_properties carrying project_id, location, glossary_id, and term_idWhen include_glossary_term_associations is enabled (opt-in, default: false), the connector additionally resolves term-to-asset links using the Dataplex lookupEntryLinks API and attaches the corresponding terms to each linked DataHub dataset. This phase runs after entries are ingested, so only assets already discovered by the entries stage can be linked. It requires roles/resourcemanager.projectViewer on all configured projects.
Configuration:
| Field | Default | Description |
|---|---|---|
include_glossaries | true | Ingest Dataplex Business Glossaries as GlossaryNode/GlossaryTerm |
include_glossary_term_associations | false | Attach glossary terms to linked datasets via lookupEntryLinks. Requires roles/resourcemanager.projectViewer (opt-in) |
glossary_locations | [global] | GCP locations to scan for glossaries; most glossaries live in global |
max_workers_glossary | 10 | Parallel workers for glossary ingestion and term-association lookups |
Example:
source:
type: dataplex
config:
project_ids:
- "my-gcp-project"
entries_locations:
- "us"
# Business Glossary ingestion (enabled by default)
include_glossaries: true
glossary_locations:
- "global"
# Term-to-asset associations (opt-in; requires roles/resourcemanager.projectViewer)
# include_glossary_term_associations: true
Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
Please be aware of the following documented delays when using this connector. These are standard Knowledge Catalog (Dataplex) behaviors and typically do not indicate an error:
If updates exceed these windows, check your Cloud Logging for specific job errors or permission issues.
Automatic Retry Behavior:
The connector automatically retries transient errors when extracting lineage:
After exhausting retries, the connector logs a warning and continues processing other entries. You'll still get metadata even if lineage extraction fails for some entries.
Common Issues:
roles/datalineage.viewer role on all projects.lineage_retry_backoff_multiplier to add more delay between retries, or decrease lineage_max_retries if you prefer faster failure.If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.