docs/content/v2024.1/explore/change-data-capture.md
In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. CDC can be used for a number of scenarios:
Microservice-oriented architectures - Some microservices require a stream of changes to the data, and using CDC in YugabyteDB can provide consumable data changes to CDC subscribers.
Asynchronous replication to remote systems - Remote systems may subscribe to a stream of data changes and then transform and consume the changes. Maintaining separate database instances for transactional and reporting purposes can be used to manage workload performance.
Multiple data center strategies - Maintaining multiple data centers enables enterprises to provide high availability (HA).
Compliance and auditing - Satisfy auditing and compliance requirements using CDC to maintain records of data changes.
YugabyteDB's CDC implementation uses PostgreSQL Logical Replication, ensuring compatibility with PostgreSQL CDC systems. Logical replication operates through a publish-subscribe model, where publications (source tables) send changes to subscribers (target systems).
CDC via logical replication is supported in YugabyteDB starting from v2024.1.1.
YugabyteDB logical replication can be used in conjunction with Apache Kafka to create a scalable, fault-tolerant, and highly available data pipeline as follows:
Using Kafka with PostgreSQL logical replication provides several benefits:
The following example uses pg_recvlogical, a command-line tool provided by PostgreSQL for interacting with the logical replication feature. It is specifically used to receive changes from the database using logical replication slots.
YugabyteDB provides the pg_recvlogical binary in the <yugabyte-db-dir>/postgres/bin/ directory, which is inherited from and based on PostgreSQL 11.2. Although PostgreSQL also offers a pg_recvlogical binary, you should use the YugabyteDB version to avoid compatibility issues.
To set up pg_recvlogical, create and start the local cluster by running the following command from your YugabyteDB home directory:
./bin/yugabyted start \
--advertise_address=127.0.0.1 \
--base_dir="${HOME}/var/node1" \
--tserver_flags="cdcsdk_publication_list_refresh_interval_secs=2"
Use ysqlsh to connect to the default yugabyte database with the default superuser yugabyte, as follows:
./bin/ysqlsh -h 127.0.0.1 -U yugabyte -d yugabyte
In the yugabyte database, create a table employees.
CREATE TABLE employees (
employee_id SERIAL PRIMARY KEY,
name VARCHAR(255),
email VARCHAR(255),
department_id INTEGER
);
Create a logical replication slot named test_logical_replication_slot using the test_decoding output plugin via the following function:
SELECT * FROM pg_create_logical_replication_slot('test_logical_replication_slot', 'test_decoding');
slot_name | lsn
-------------------------------+-----
test_logical_replication_slot | 0/2
The pg_recvlogical binary can be found under <yugabyte-db-dir>/postgres/bin/.
Open a new shell and start pg_recvlogical to connect to the yugabyte database with the superuser yugabyte and replicate changes using the replication slot you created as follows:
./pg_recvlogical -d yugabyte \
-U yugabyte \
-h 127.0.0.1 \
--slot test_logical_replication_slot \
--start \
-f -
Any changes that get replicated are printed to stdout.
For more information about pg_recvlogical configuration, refer to the PostgreSQL pg_recvlogical documentation.
Return to the shell where ysqlsh is running. Perform DMLs on the employees table.
BEGIN;
INSERT INTO employees (name, email, department_id)
VALUES ('Alice Johnson', '[email protected]', 1);
INSERT INTO employees (name, email, department_id)
VALUES ('Bob Smith', '[email protected]', 2);
COMMIT;
Expected output observed on stdout where pg_recvlogical is running:
BEGIN 2
table public.employees: INSERT: employee_id[integer]:1 name[character varying]:'Alice Johnson' email[character varying]:'[email protected]' department_id[integer]:1
table public.employees: INSERT: employee_id[integer]:2 name[character varying]:'Bob Smith' email[character varying]:'[email protected]' department_id[integer]:2
COMMIT 2
You can add a new table to the yugabyte database and any DMLs performed on the new table are also replicated to pg_recvlogical.
In the yugabyte database, create a new table projects:
CREATE TABLE projects (
project_id SERIAL PRIMARY KEY,
name VARCHAR(255),
description TEXT
);
Perform DMLs on the projects table:
INSERT INTO projects (name, description)
VALUES ('Project A', 'Description of Project A');
Expected output observed on stdout where pg_recvlogical is running:
BEGIN 3
table public.projects: INSERT: project_id[integer]:1 name[character varying]:'Project A' description[text]:'Description of Project A'
COMMIT 3
{{% explore-cleanup-local %}}