docs/how/configure-cdc.md
:::info TL;DR CDC mode ensures metadata changes are processed in exact database commit order by generating MetadataChangeLogs (MCLs) from database change events instead of directly from the GMS service. This is critical for use cases requiring strict ordering guarantees. CDC uses Debezium to capture database changes and route them through Kafka to DataHub's MCE Consumer. Not enabled by default - requires explicit configuration. :::
DataHub supports CDC (Change Data Capture) mode for generating MetadataChangeLogs (MCLs). In CDC mode, MCLs are generated from database change events rather than being produced directly by GMS. This provides stronger ordering guarantees - MCLs are generated in the exact order of database transaction commits.
Use CDC mode if you need strict ordering guarantees for metadata changes.
The default deployments do not use CDC. Explicit configuration is required to switch to CDC mode.
Before diving into CDC configuration, let's define the key terms you'll encounter:
MCL (MetadataChangeLog): A versioned event that records a change to metadata in DataHub. MCLs are the source of truth for downstream consumers like search indexers, graph builders, and data lineage systems. Every metadata update generates an MCL.
MCP (MetadataChangeProposal): An incoming request to change metadata in DataHub. MCPs are submitted by users, ingestion pipelines, or APIs and are processed by GMS.
GMS (General Metadata Service): The core backend service of DataHub. GMS receives MCPs, validates them, writes changes to the database, and (in normal mode) emits MCLs.
MCE Consumer: A separate microservice that consumes metadata events from Kafka. In CDC mode, it reads CDC events from the database change stream and generates MCLs by calling EntityService.
EntityService: The internal DataHub service responsible for reading and writing entity metadata. In CDC mode, the
MCE Consumer calls EntityService.produceMCLAsync() to generate MCLs from database change events.
Kafka: A distributed event streaming platform (think: high-performance message queue). Kafka is used by DataHub to decouple metadata producers from consumers. All MCLs flow through Kafka topics.
Kafka Topic: A named channel or feed for events. DataHub uses multiple topics: one for CDC events
(datahub.metadata_aspect_v2) and one for MCLs (MetadataChangeLog_Versioned_v1).
Kafka Partition: A subdivision of a topic for parallelism and ordering. Messages with the same partition key (e.g., same URN) always go to the same partition, guaranteeing ordering within that entity.
Kafka Connect: A framework for streaming data between Kafka and external systems (like databases). Debezium runs as a Kafka Connect plugin.
Debezium: An open-source CDC platform that captures row-level changes from databases (MySQL, PostgreSQL, etc.) and streams them to Kafka. Debezium reads database transaction logs (binary logs for MySQL, WAL for PostgreSQL) to capture changes.
Use this decision tree to determine if CDC is right for your deployment:
┌─────────────────────────────────────────────────────────────┐
│ Do you need STRICT ORDERING of metadata changes? │
│ (e.g., regulatory compliance, audit trails, lineage sync) │
└───────────────┬─────────────────────────────────────────────┘
│
┌───────┴───────┐
│ YES │ NO
│ │
v v
┌───────────────┐ ┌───────────────────────────────────────┐
│ Use CDC Mode │ │ Use Normal Mode (Default) │
│ │ │ - Simpler setup │
│ Benefits: │ │ - No additional infrastructure │
│ ✅ Exact order │ │ - Lower latency (no CDC hop) │
│ ✅ Replay safe │ │ - Sufficient for most use cases │
│ │ │ │
│ Tradeoffs: │ │ Tradeoffs: │
│ ⚠️ More infra │ │ ⚠️ Race conditions possible for │
│ ⚠️ Higher │ │ rapid updates to same entity │
│ latency │ └───────────────────────────────────────┘
│ ⚠️ Operational │
│ complexity │
└───────────────┘
Key Point: GMS writes to the database AND emits the MCL in the same code path. If two updates to the same entity happen in rapid succession, their MCLs might be emitted in a different order than their database commits.
Key Point: GMS ONLY writes to the database. MCLs are generated later by the MCE Consumer reading from the database's transaction log (via Debezium). This guarantees MCLs are emitted in exact database commit order.
| Aspect | Normal Mode | CDC Mode |
|---|---|---|
| Ordering Guarantee | Best-effort (may have races) | Strict (exact DB commit order) |
| Latency | Lower (~ms) | Higher (~100ms+ with CDC hop) |
| Infrastructure | GMS + Kafka only | GMS + Kafka + Kafka Connect + Debezium |
| Database Config | Standard | Requires replication enabled (binary logs or WAL) |
| Operational Complexity | Low | Medium (more moving parts) |
| Replay Capability | Limited (MCLs stored in Kafka) | High (can replay from DB) |
| Use Case | Standard deployments | Compliance, high-frequency updates, critical lineage |
The fastest way to get CDC running for local development or testing:
:::tip Quick Start This uses auto-configuration mode where DataHub automatically sets up the database and Debezium connector. Not recommended for production. :::
Add these environment variables to your docker-compose services:
datahub-gms:
environment:
- CDC_MCL_PROCESSING_ENABLED=true
mce-consumer:
environment:
- CDC_MCL_PROCESSING_ENABLED=true
datahub-upgrade:
environment:
- CDC_MCL_PROCESSING_ENABLED=true
- CDC_CONFIGURE_SOURCE=true # Auto-configure DB and Debezium
- CDC_DB_TYPE=mysql # or "postgres"
- CDC_USER=datahub_cdc
- CDC_PASSWORD=your_secure_password
- CDC_KAFKA_CONNECT_URL=http://kafka-connect:8083
docker-compose up -d
DataHub will automatically:
# Check connector status (should show "RUNNING")
curl http://localhost:8083/connectors/datahub-cdc-connector/status | jq
# See CDC events flowing
docker exec -it broker kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic datahub.metadata_aspect_v2 \
--max-messages 5
✅ You're done! Metadata changes will now flow through CDC.
Before enabling CDC mode, ensure you have:
How to check:
# MySQL
mysql -V
# PostgreSQL
psql --version
You need a running Kafka Connect cluster with the appropriate Debezium connector plugin installed.
Docker users: Use the debezium/connect image (includes all plugins):
kafka-connect:
image: debezium/connect:2.3
ports:
- "8083:8083"
environment:
- BOOTSTRAP_SERVERS=broker:29092
- GROUP_ID=1
- CONFIG_STORAGE_TOPIC=connect_configs
- OFFSET_STORAGE_TOPIC=connect_offsets
- STATUS_STORAGE_TOPIC=connect_status
How to check if Kafka Connect is running:
curl http://localhost:8083/connectors
# Should return: [] (empty list if no connectors are configured yet)
You need the ability to:
metadata_aspect_v2 tableWho can do this:
SUPER or REPLICATION privilegesSUPERUSER or REPLICATION roleCDC adds one additional Kafka topic: datahub.metadata_aspect_v2.
Capacity planning:
CDC mode requires configuration in three layers:
| Layer | What It Does | Auto-Config (Dev) | Manual Config (Prod) |
|---|---|---|---|
| 1. Database | Enable replication, create CDC user | ✅ Automated | ❌ Manual (recommended) |
| 2. Debezium | Configure Kafka Connect connector | ✅ Automated | ❌ Manual (recommended) |
| 3. DataHub | Enable CDC processing in GMS/MCE | ✅ Environment variables | ✅ Environment variables |
For development environments: DataHub can automate database setup via the system-update service when
CDC_CONFIGURE_SOURCE=true. The system-update service will create the necessary CDC user and configure database
settings during startup.
For production environments: Manual database configuration is recommended for better control and security practices specific to your environment.
Your MySQL instance must be configured to support CDC:
Enable Binary Logging: Configure MySQL to use ROW-based binary logging with FULL row images. This allows Debezium to capture all column changes.
Configuration (my.cnf or my.ini):
[mysqld]
log-bin=mysql-bin
binlog_format=ROW
binlog_row_image=FULL
server-id=1
How to verify:
SHOW VARIABLES LIKE 'log_bin';
SHOW VARIABLES LIKE 'binlog_format';
SHOW VARIABLES LIKE 'binlog_row_image';
Create CDC User: Create a dedicated database user for CDC with the following capabilities:
CREATE USER 'datahub_cdc'@'%' IDENTIFIED BY 'your_secure_password';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'datahub_cdc'@'%';
FLUSH PRIVILEGES;
What these privileges do:
SELECT: Read access to the metadata_aspect_v2 tableREPLICATION CLIENT: Read binary logsREPLICATION SLAVE: Monitor replication statusConfigure Binary Log Retention: Set appropriate binary log retention based on your operational needs and disaster recovery requirements. Consider factors like Debezium downtime tolerance and disk space.
-- Retain binary logs for 7 days
SET GLOBAL binlog_expire_logs_seconds = 604800;
Server ID: Ensure each MySQL instance (including the CDC connector) has a unique server ID for replication topology.
Configuration:
[mysqld]
server-id=1 # Must be unique
:::info What DataHub Automates (Development Mode)
When CDC_CONFIGURE_SOURCE=true:
:::
Your PostgreSQL instance must be configured to support logical replication:
Enable Logical Replication: Configure PostgreSQL WAL (Write-Ahead Log) level to 'logical'. This is required for CDC to capture row-level changes.
Configuration (postgresql.conf):
wal_level = logical
max_replication_slots = 4
max_wal_senders = 4
⚠️ Requires restart: Changing wal_level requires a PostgreSQL restart.
How to verify:
SHOW wal_level;
-- Should return: logical
Configure Replication Slots: Allocate sufficient replication slots (at least one per CDC connector). Replication slots track the CDC consumer's position in the WAL.
Create CDC User: Create a dedicated database user for CDC with the following capabilities:
CREATE ROLE datahub_cdc WITH REPLICATION LOGIN PASSWORD 'your_secure_password';
GRANT CONNECT ON DATABASE datahub TO datahub_cdc;
GRANT SELECT ON TABLE public.metadata_aspect_v2 TO datahub_cdc;
What these privileges do:
REPLICATION: Create and read from replication slotsCONNECT: Connect to the DataHub databaseSELECT: Read the metadata_aspect_v2 tableSet Replica Identity to FULL: Configure the metadata_aspect_v2 table to use FULL replica identity. This ensures
Debezium receives complete before/after images of changed rows, which is critical for DataHub's MCL generation.
ALTER TABLE public.metadata_aspect_v2 REPLICA IDENTITY FULL;
Why this matters: FULL replica identity ensures Debezium captures the complete row state (not just the primary key) for UPDATE and DELETE operations.
Create Publication: Create a PostgreSQL publication named dbz_publication for the metadata_aspect_v2
table. Publications define which tables participate in logical replication. DataHub expects this specific publication
name when auto-configuring.
CREATE PUBLICATION dbz_publication FOR TABLE public.metadata_aspect_v2;
Create Replication Slot: When manually configuring, create a replication slot named debezium using the
pgoutput plugin. DataHub's default configuration uses this slot name.
SELECT * FROM pg_create_logical_replication_slot('debezium', 'pgoutput');
Configure WAL Senders: Ensure sufficient max_wal_senders for the number of CDC connectors and any other
replication consumers.
:::info What DataHub Automates (Development Mode)
When CDC_CONFIGURE_SOURCE=true:
dbz_publication for the tabledebezium:::
:::danger Important PostgreSQL Notes
max_slot_wal_keep_size to prevent
runaway growth.dbz_publication and slot name debezium are DataHub's defaults - if you use
different names, update your Debezium connector configuration accordingly.:::
You have two options for configuring Debezium:
| Option | Use Case | Setup Time | Control | Security |
|---|---|---|---|---|
| Auto-Configuration | Dev/Test envs | ~5 minutes | Low (DataHub decides) | Lower (auto-creates users) |
| Manual Configuration | Production envs | ~30 minutes | High (full control) | Higher (you control) |
For development and testing environments, DataHub can automatically configure Debezium via the system-update service.
Environment Variables:
# Enable CDC processing
CDC_MCL_PROCESSING_ENABLED=true
# Enable auto-configuration
CDC_CONFIGURE_SOURCE=true
# Database type (mysql or postgres)
CDC_DB_TYPE=mysql
# CDC user credentials (will be created if doesn't exist)
CDC_USER=datahub_cdc
CDC_PASSWORD=your_secure_password
# Kafka Connect endpoint
CDC_KAFKA_CONNECT_URL=http://kafka-connect:8083
# Connector name
DATAHUB_CDC_CONNECTOR_NAME=datahub-cdc-connector
# CDC topic name
CDC_TOPIC_NAME=datahub.metadata_aspect_v2
Docker Compose Example:
datahub-gms:
environment:
- CDC_MCL_PROCESSING_ENABLED=true
mce-consumer:
environment:
- CDC_MCL_PROCESSING_ENABLED=true
datahub-upgrade:
environment:
- CDC_MCL_PROCESSING_ENABLED=true
- CDC_CONFIGURE_SOURCE=true
- CDC_DB_TYPE=mysql # or postgres
- CDC_USER=datahub_cdc
- CDC_PASSWORD=your_secure_password
- CDC_KAFKA_CONNECT_URL=http://kafka-connect:8083
:::warning Auto-Config is NOT Production-Ready Auto-configuration creates database users with elevated privileges automatically. This may violate your organization's security policies. Use manual configuration for production. :::
For production environments, manually configure Debezium Kafka Connect for better control.
Key Configuration Requirements:
Regardless of database type, your Debezium connector must be configured with the following critical settings:
Table to Capture: The connector must capture changes from the metadata_aspect_v2 table only. This is the
primary storage table for DataHub's versioned aspects.
Output Topic Name: CDC events must be routed to a single consolidated topic (default:
datahub.metadata_aspect_v2). Use Debezium's routing transformation to override the default per-table topic naming.
Message Format:
StringConverter - keys will be entity URNs as plain stringsJsonConverter - CDC payloads must be in JSON formatvalue.converter.schemas.enable: false - DataHub validates aspects internally and
doesn't use Kafka schema registry for CDC eventsPartitioning Key: Messages must be partitioned by the urn column. This ensures all changes for a given entity
are processed in order by routing them to the same Kafka partition.
The following is an example configuration for MySQL similar to what DataHub auto-configures for development/test:
{
"name": "datahub-cdc-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "mysql",
"database.port": "3306",
"database.user": "datahub_cdc",
"database.password": "your_secure_password",
"database.server.id": "184001",
"database.server.name": "datahub",
"database.include.list": "datahub",
"table.include.list": "datahub.metadata_aspect_v2",
"database.history.kafka.bootstrap.servers": "broker:29092",
"database.history.kafka.topic": "datahub.schema-changes",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"message.key.columns": "datahub.metadata_aspect_v2:urn",
"transforms": "route",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": ".*metadata_aspect_v2",
"transforms.route.replacement": "datahub.metadata_aspect_v2",
"snapshot.mode": "initial",
"snapshot.locking.mode": "minimal",
"decimal.handling.mode": "string",
"bigint.unsigned.handling.mode": "long"
}
}
Key config explanations:
database.server.id: Must be unique across all MySQL replicas and CDC connectorsmessage.key.columns: Ensures messages are partitioned by URN for orderingtransforms.route: Routes all events to a single topic instead of per-table topicssnapshot.mode: initial: Captures existing data on first runThe following is an example configuration for PostgreSQL similar to what DataHub auto-configures for development/test:
{
"name": "datahub-cdc-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "datahub_cdc",
"database.password": "your_secure_password",
"database.dbname": "datahub",
"database.server.name": "datahub",
"plugin.name": "pgoutput",
"publication.name": "dbz_publication",
"publication.autocreate.mode": "disabled",
"slot.name": "debezium",
"schema.include.list": "public",
"table.include.list": "public.metadata_aspect_v2",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"message.key.columns": "public.metadata_aspect_v2:urn",
"transforms": "route",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": ".*metadata_aspect_v2",
"transforms.route.replacement": "datahub.metadata_aspect_v2",
"snapshot.mode": "initial",
"decimal.handling.mode": "string"
}
}
Key config explanations:
plugin.name: pgoutput: Uses PostgreSQL's native logical replication output pluginpublication.name: Must match the publication you created earlierslot.name: Must match the replication slot you created earlierpublication.autocreate.mode: disabled: Prevents Debezium from auto-creating publicationsOnce you've created your connector configuration JSON, deploy it to Kafka Connect:
# Create the connector
curl -X POST http://localhost:8083/connectors \
-H "Content-Type: application/json" \
-d @connector-config.json
# Verify it's running
curl http://localhost:8083/connectors/datahub-cdc-connector/status | jq
Environment Variables for Manual Setup:
# Enable CDC processing
CDC_MCL_PROCESSING_ENABLED=true
# Disable auto-configuration
CDC_CONFIGURE_SOURCE=false
# CDC topic name (must match connector config)
CDC_TOPIC_NAME=datahub.metadata_aspect_v2
Once database and Debezium are configured, enable CDC mode in DataHub components:
datahub-gms:
environment:
# Enable CDC mode (disables direct MCL emission from GMS)
- CDC_MCL_PROCESSING_ENABLED=true
mce-consumer:
environment:
# Enable CDC mode (enables CDC message processing)
- CDC_MCL_PROCESSING_ENABLED=true
- CDC_TOPIC_NAME=datahub.metadata_aspect_v2
datahub-upgrade:
environment:
- CDC_MCL_PROCESSING_ENABLED=true
- CDC_CONFIGURE_SOURCE=false # indicates manually configured Debezium
- CDC_DB_TYPE=mysql # or postgres
- CDC_USER=datahub_cdc
- CDC_PASSWORD=your_secure_password
What each environment variable does:
CDC_MCL_PROCESSING_ENABLED=true (GMS): Disables direct MCL emission from GMS writesCDC_MCL_PROCESSING_ENABLED=true (MCE Consumer): Enables CDC event processingCDC_TOPIC_NAME: Name of the Kafka topic where CDC events are publishedCDC_CONFIGURE_SOURCE=false: Tells DataHub you've manually configured Debezium (no auto-setup)# Check connector status
curl http://kafka-connect:8083/connectors/datahub-cdc-connector/status | jq
# Expected output shows "state": "RUNNING"
:::tip Success Indicators
RUNNINGRUNNINGconnector.trace field:::
# List topics (should see datahub.metadata_aspect_v2)
kafka-topics --bootstrap-server broker:9092 --list | grep metadata_aspect_v2
# Consume CDC events
kafka-console-consumer --bootstrap-server broker:9092 \
--topic datahub.metadata_aspect_v2 \
--from-beginning --max-messages 10
What you should see:
{
"before": null,
"after": {
"urn": "urn:li:dataset:...",
"aspect": "datasetProperties",
"version": 0,
"metadata": "{...}",
"createdon": "2024-01-01T00:00:00.000Z"
},
"op": "c"
}
Make a metadata change via DataHub UI or API
Verify CDC event is captured in the CDC topic (CDC events are in JSON format):
kafka-console-consumer --bootstrap-server broker:9092 \
--topic datahub.metadata_aspect_v2 \
--from-beginning --max-messages 1
Verify MCL is generated in the MCL topic (MCL events are Avro-serialized and require schema registry access):
# Using kafka-avro-console-consumer (requires schema registry)
kafka-avro-console-consumer --bootstrap-server broker:9092 \
--topic MetadataChangeLog_Versioned_v1 \
--property schema.registry.url=http://schema-registry:8081 \
--from-beginning --max-messages 1
Alternatively, check the MCE consumer logs to confirm MCLs are being generated:
docker logs mce-consumer | grep -i "Emitting MCL"
Symptoms:
curl http://localhost:8083/connectors/datahub-cdc-connector/status
# Shows "state": "FAILED"
Common Causes & Fixes:
Database permissions missing
-- MySQL: Verify replication privileges
SHOW GRANTS FOR 'datahub_cdc'@'%';
-- PostgreSQL: Verify replication role
\du datahub_cdc
Binary logging not enabled (MySQL)
SHOW VARIABLES LIKE 'log_bin';
-- Should return: ON
Logical replication not enabled (PostgreSQL)
SHOW wal_level;
-- Should return: logical
Publication or slot missing (PostgreSQL)
-- Check publication exists
SELECT * FROM pg_publication WHERE pubname = 'dbz_publication';
-- Check replication slot exists
SELECT * FROM pg_replication_slots WHERE slot_name = 'debezium';
How to fix:
curl http://localhost:8083/connectors/datahub-cdc-connector/status | jq '.tasks[0].trace'
curl -X POST http://localhost:8083/connectors/datahub-cdc-connector/restart
Symptoms:
RUNNING but no messages in datahub.metadata_aspect_v2 topicCommon Causes:
No database changes yet: Debezium only captures changes AFTER it starts. Make a test metadata change in DataHub UI.
Wrong table filter: Check table.include.list in connector config matches metadata_aspect_v2.
Snapshot mode issue: If snapshot.mode is set to never, existing data won't be captured. Change to initial
and restart.
How to fix:
kafka-console-consumer --bootstrap-server broker:9092 \
--topic datahub.metadata_aspect_v2 \
--from-beginning --max-messages 5
Symptoms:
Cause:
CDC_MCL_PROCESSING_ENABLED is not set to true in all required containers (gms, mce-consumer, and datahub-upgrade).
Fix:
Ensure CDC_MCL_PROCESSING_ENABLED=true is set for all three containers:
datahub-gms:
environment:
- CDC_MCL_PROCESSING_ENABLED=true # Required!
mce-consumer:
environment:
- CDC_MCL_PROCESSING_ENABLED=true # Required!
datahub-upgrade:
environment:
- CDC_MCL_PROCESSING_ENABLED=true # Required!
Important: After setting these environment variables, you must restart GMS and mce-consumer for the changes to take effect:
docker-compose restart datahub-gms mce-consumer
Symptoms:
SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;
-- Shows large "lag" value (e.g., 10GB+)
Cause: CDC connector is stopped but replication slot still exists, preventing WAL cleanup.
Fix:
If connector is permanently stopped, drop the slot:
SELECT pg_drop_replication_slot('debezium');
If connector is temporarily stopped, restart it to resume WAL consumption.
Configure max_slot_wal_keep_size to prevent runaway growth:
# postgresql.conf
max_slot_wal_keep_size = 10GB
Symptoms:
Tuning Options:
kafka-topics --bootstrap-server broker:9092 \
--alter --topic datahub.metadata_aspect_v2 \
--partitions 9
{
"config": {
"poll.interval.ms": "100",
"max.batch.size": "2048"
}
}
If you're still stuck after trying the above:
Check DataHub logs:
docker logs datahub-gms | grep -i cdc
docker logs mce-consumer | grep -i cdc
docker logs datahub-upgrade | grep -i cdc
Check Debezium connector logs:
curl http://localhost:8083/connectors/datahub-cdc-connector/status | jq '.tasks[0].trace'
Ask in DataHub Slack: Join here and post in #troubleshooting
Additional Infrastructure:
metadata_aspect_v2 tableExpected Latency:
Key Metrics to Monitor:
Debezium Connector Health:
curl http://localhost:8083/connectors/datahub-cdc-connector/status
CDC Topic Lag:
kafka-consumer-groups --bootstrap-server broker:9092 \
--describe --group datahub-mce-consumer-job-client
Database Replication Lag (PostgreSQL):
SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;
Binary Log Disk Usage (MySQL):
SHOW BINARY LOGS;
Horizontal Scaling:
Vertical Scaling:
To revert from CDC mode back to normal mode:
datahub-gms:
environment:
- CDC_MCL_PROCESSING_ENABLED=false # Changed from true
mce-consumer:
environment:
- CDC_MCL_PROCESSING_ENABLED=false # Changed from true
datahub-upgrade:
environment:
- CDC_MCL_PROCESSING_ENABLED=false # Changed from true
- CDC_CONFIGURE_SOURCE=false
docker-compose restart datahub-gms mce-consumer datahub-upgrade
If you're permanently disabling CDC:
Delete Debezium Connector:
curl -X DELETE http://localhost:8083/connectors/datahub-cdc-connector
Drop Replication Slot (PostgreSQL):
SELECT pg_drop_replication_slot('debezium');
Drop Publication (PostgreSQL):
DROP PUBLICATION dbz_publication;
(Optional) Delete CDC Topic:
kafka-topics --bootstrap-server broker:9092 \
--delete --topic datahub.metadata_aspect_v2
:::warning In-Flight CDC Events If you disable CDC while there are unprocessed CDC events in the topic, those events will be ignored. Ensure the CDC topic is fully consumed (check lag) before disabling CDC to avoid losing metadata changes. :::
Set CDC_MCL_PROCESSING_ENABLED=true in your GMS, MCE Consumer, and system-update services. For development, also set
CDC_CONFIGURE_SOURCE=true to auto-configure the database and Debezium connector.
See the Quick Start section for a step-by-step guide.
CDC (Change Data Capture) mode is an alternative way to generate MetadataChangeLogs (MCLs) in DataHub. Instead of GMS emitting MCLs directly when writing to the database, CDC mode captures database changes via Debezium and generates MCLs from those change events. This guarantees strict ordering based on database transaction commit order.
Use CDC mode if you need:
Most deployments don't need CDC - normal mode is sufficient and simpler.
| Aspect | Normal Mode | CDC Mode |
|---|---|---|
| MCL Generation | GMS emits MCLs directly | MCE Consumer generates MCLs from DB changes |
| Ordering | Best-effort | Strict (DB commit order) |
| Infrastructure | GMS + Kafka | GMS + Kafka + Kafka Connect + Debezium |
| Latency | Lower (~ms) | Higher (~100ms+) |
See the comparison table for more details.
Yes! CDC mode supports both MySQL 5.7+ and PostgreSQL 10+. Configuration differs slightly between the two (binary logs for MySQL, logical replication for PostgreSQL).
See MySQL Requirements and PostgreSQL Requirements.
Yes, but with caveats:
Since there db and gms/consumer restarts are required, and to avoid any diplicate MCLs, it is recommended to do this switch with a minimal gms downtime. Stop datahub services, make configuration changes, start datahub services to run in CDC mode.
Latency: CDC adds ~50-200ms of latency compared to normal mode (database write → CDC capture → MCL emission).
Throughput: CDC can handle the same throughput as normal mode, but requires additional infrastructure (Kafka Connect).
Resources: Adds ~512MB-1GB RAM for Kafka Connect, plus CDC topic storage (~same size as your database).
See Performance & Operational Considerations for details.
Common issues and fixes:
initialCDC_MCL_PROCESSING_ENABLED=true in GMSSee the full Troubleshooting section for step-by-step fixes.
CDC mode is currently not supported in DataHub Cloud.
Set CDC_MCL_PROCESSING_ENABLED=false in all DataHub services and restart. Optionally clean up CDC infrastructure
(connector, replication slot, CDC topic).
See Disabling CDC Mode for detailed steps.