docs/en/connectors/source/Kafka.md
import ChangeLog from '../changelog/connector-kafka.md';
Kafka source connector
Spark
Flink
Seatunnel Zeta
Source connector for Apache Kafka.
In order to use the Kafka connector, the following dependencies are required. They can be downloaded via install-plugin.sh or from the Maven central repository.
| Datasource | Supported Versions | Maven |
|---|---|---|
| Kafka | Universal | Download |
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| topic | String | Yes | - | Topic name(s) to read data from when the table is used as source. It also supports topic list for source by separating topic by comma like 'topic-1,topic-2'. |
| table_list | Map | No | - | Topic list config You can configure only one table_list and one topic at the same time |
| bootstrap.servers | String | Yes | - | Comma separated list of Kafka brokers. |
| pattern | Boolean | No | false | If pattern is set to true,the regular expression for a pattern of topic names to read from. All topics in clients with names that match the specified regular expression will be subscribed by the consumer. |
| consumer.group | String | No | SeaTunnel-Consumer-Group | Kafka consumer group id, used to distinguish different consumer groups. |
| commit_on_checkpoint | Boolean | No | true | If true the consumer's offset will be periodically committed in the background. |
| poll.timeout | Long | No | 10000 | The interval(millis) for poll messages. |
| kafka.config | Map | No | - | In addition to the above necessary parameters that must be specified by the Kafka consumer client, users can also specify multiple consumer client non-mandatory parameters, covering all consumer parameters specified in the official Kafka document. |
| schema | Config | No | - | The structure of the data, including field names and field types. For more details, please refer to Schema Feature. |
| format | String | No | json | Data format. The default format is json. Optional text format, canal_json, debezium_json, maxwell_json, ogg_json, avro , protobuf and native. If you use json or text format. The default field separator is ", ". If you customize the delimiter, add the "field_delimiter" option.If you use canal format, please refer to canal-json for details.If you use debezium format, please refer to debezium-json for details. Some format details please refer formats |
| format_error_handle_way | String | No | fail | The processing method of data format error. The default value is fail, and the optional value is (fail, skip). When fail is selected, data format error will block and an exception will be thrown. When skip is selected, data format error will skip this line data. |
| debezium_record_table_filter | Config | No | - | Used for filtering data in debezium format, only when the format is set to debezium_json. Please refer debezium_record_table_filter below |
| field_delimiter | String | No | , | Customize the field delimiter for data format. |
| start_mode | StartMode[earliest],[group_offsets],[latest],[specific_offsets],[timestamp] | No | group_offsets | The initial consumption pattern of consumers. |
| start_mode.offsets | Config | No | - | The offset required for consumption mode to be specific_offsets. |
| start_mode.timestamp | Long | No | - | The time required for consumption mode to be "timestamp". |
| start_mode.end_timestamp | Long | No | - | The end time required for consumption mode to be "timestamp" in batch mode |
| partition-discovery.interval-millis | Long | No | -1 | The interval for dynamically discovering topics and partitions. |
| ignore_no_leader_partition | Boolean | No | false | Whether to ignore partitions that have no leader. If set to true, partitions without a leader will be skipped during partition discovery. If set to false (default), the connector will include all partitions regardless of leader status. This is useful when dealing with Kafka clusters that may have temporary leadership issues. |
| common-options | No | - | Source plugin common parameters, please refer to Source Common Options for details | |
| protobuf_message_name | String | No | - | Effective when the format is set to protobuf, specifies the Message name |
| protobuf_schema | String | No | - | Effective when the format is set to protobuf, specifies the Schema definition |
| strip_schema_registry_header | Boolean | No | false | Effective when the format is set to protobuf. Whether to strip the Confluent Schema Registry wire format header (magic byte, schema id and message indexes) before protobuf deserialization. This option is useful when consuming Protobuf messages that were encoded using Confluent Schema Registry. When enabled, the connector will try to detect and remove the Schema Registry header before parsing the Protobuf message. If the header is not detected, it will fall back to standard Protobuf deserialization. |
| reader_cache_queue_size | Integer | No | 1024 | The reader shard cache queue is used to cache the data corresponding to the shards. The size of the shard cache depends on the number of shards obtained by each reader, rather than the amount of data in each shard. |
| is_native | Boolean | No | false | Supports retaining the source information of the record. |
We can use debezium_record_table_filter to filter the data in the debezium format. The configuration is as follows:
debezium_record_table_filter {
database_name = "test" // null if not exists
schema_name = "public" // null if not exists
table_name = "products"
}
Only the data of the test.public.products table will be consumed.
The Kafka source automatically injects ConsumerRecord.timestamp into the SeaTunnel EventTime metadata when the value is non-negative. You can expose it as a normal field through the Metadata transform for downstream SQL or partitioning.
source {
Kafka {
plugin_output = "kafka_raw"
topic = "seatunnel_topic"
bootstrap.servers = "localhost:9092"
format = json
}
}
transform {
Metadata {
plugin_input = "kafka_raw"
plugin_output = "kafka_with_meta"
metadata_fields {
EventTime = kafka_ts # kafka_ts will contain ConsumerRecord.timestamp (ms)
}
}
Sql {
plugin_input = "kafka_with_meta"
plugin_output = "kafka_enriched"
query = "select *, FROM_UNIXTIME(kafka_ts/1000, 'yyyy-MM-dd', 'Asia/Shanghai') as pt from kafka_with_meta where kafka_ts >= 0"
}
}
This example reads the data of kafka's topic_1, topic_2, topic_3 and prints it to the client.And if you have not yet installed and deployed SeaTunnel, you need to follow the instructions in Install SeaTunnel to install and deploy SeaTunnel. And if you have not yet installed and deployed SeaTunnel, you need to follow the instructions in Install SeaTunnel to install and deploy SeaTunnel. And then follow the instructions in Quick Start With SeaTunnel Engine to run this job. In batch mode, during the enumerator sharding process, it will fetch the latest offset for each partition and use it as the stopping point.
# Defining the runtime environment
env {
parallelism = 2
job.mode = "BATCH"
}
source {
Kafka {
schema = {
fields {
name = "string"
age = "int"
}
}
format = text
field_delimiter = "#"
topic = "topic_1,topic_2,topic_3"
bootstrap.servers = "localhost:9092"
kafka.config = {
client.id = client_1
max.poll.records = 500
auto.offset.reset = "earliest"
enable.auto.commit = "false"
}
}
}
sink {
Console {}
}
source {
Kafka {
topic = ".*seatunnel*."
pattern = "true"
bootstrap.servers = "localhost:9092"
consumer.group = "seatunnel_group"
}
}
Replace the following ${username} and ${password} with the configuration values in AWS MSK.
source {
Kafka {
topic = "seatunnel"
bootstrap.servers = "xx.amazonaws.com.cn:9096,xxx.amazonaws.com.cn:9096,xxxx.amazonaws.com.cn:9096"
consumer.group = "seatunnel_group"
kafka.config = {
security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config="org.apache.kafka.common.security.scram.ScramLoginModule required username=\"username\" password=\"password\";"
#security.protocol=SASL_SSL
#sasl.mechanism=AWS_MSK_IAM
#sasl.jaas.config="software.amazon.msk.auth.iam.IAMLoginModule required;"
#sasl.client.callback.handler.class="software.amazon.msk.auth.iam.IAMClientCallbackHandler"
}
}
}
Download aws-msk-iam-auth-1.1.5.jar from https://github.com/aws/aws-msk-iam-auth/releases and put it in $SEATUNNEL_HOME/plugin/kafka/lib dir.
Please ensure the IAM policy have "kafka-cluster:Connect",. Like this:
"Effect": "Allow",
"Action": [
"kafka-cluster:Connect",
"kafka-cluster:AlterCluster",
"kafka-cluster:DescribeCluster"
],
Source Config
source {
Kafka {
topic = "seatunnel"
bootstrap.servers = "xx.amazonaws.com.cn:9098,xxx.amazonaws.com.cn:9098,xxxx.amazonaws.com.cn:9098"
consumer.group = "seatunnel_group"
kafka.config = {
#security.protocol=SASL_SSL
#sasl.mechanism=SCRAM-SHA-512
#sasl.jaas.config="org.apache.kafka.common.security.scram.ScramLoginModule required username=\"username\" password=\"password\";"
security.protocol=SASL_SSL
sasl.mechanism=AWS_MSK_IAM
sasl.jaas.config="software.amazon.msk.auth.iam.IAMLoginModule required;"
sasl.client.callback.handler.class="software.amazon.msk.auth.iam.IAMClientCallbackHandler"
}
}
}
Please set JVM parameters java.security.krb5.conf before starting the SeaTunnel or update default krb5.conf in /etc/krb5.conf.
Source Config
source {
Kafka {
topic = "seatunnel"
bootstrap.servers = "127.0.0.1:9092"
consumer.group = "seatunnel_group"
kafka.config = {
security.protocol=SASL_PLAINTEXT
sasl.kerberos.service.name=kafka
sasl.mechanism=GSSAPI
sasl.jaas.config="com.sun.security.auth.module.Krb5LoginModule required \n useKeyTab=true \n storeKey=true \n keyTab=\"/path/to/xxx.keytab\" \n principal=\"[email protected]\";"
}
}
}
This is written to the same pg table according to different formats and topics of parsing kafka Perform upsert operations based on the id
Note: Kafka is an unstructured data source and should be use 'tables_configs', and 'table_list' will be removed in the future.
env {
execution.parallelism = 1
job.mode = "BATCH"
}
source {
Kafka {
bootstrap.servers = "kafka_e2e:9092"
tables_configs = [
{
topic = "^test-ogg-sou.*"
pattern = "true"
consumer.group = "ogg_multi_group"
start_mode = earliest
schema = {
fields {
id = "int"
name = "string"
description = "string"
weight = "string"
}
},
format = ogg_json
},
{
topic = "test-cdc_mds"
start_mode = earliest
schema = {
fields {
id = "int"
name = "string"
description = "string"
weight = "string"
}
},
format = canal_json
}
]
}
}
sink {
Jdbc {
driver = org.postgresql.Driver
url = "jdbc:postgresql://postgresql:5432/test?loggerLevel=OFF"
user = test
password = test
generate_sink_sql = true
database = test
table = public.sink
primary_keys = ["id"]
}
}
env {
execution.parallelism = 1
job.mode = "BATCH"
}
source {
Kafka {
bootstrap.servers = "kafka_e2e:9092"
table_list = [
{
topic = "^test-ogg-sou.*"
pattern = "true"
consumer.group = "ogg_multi_group"
start_mode = earliest
schema = {
fields {
id = "int"
name = "string"
description = "string"
weight = "string"
}
},
format = ogg_json
},
{
topic = "test-cdc_mds"
start_mode = earliest
schema = {
fields {
id = "int"
name = "string"
description = "string"
weight = "string"
}
},
format = canal_json
}
]
}
}
sink {
Jdbc {
driver = org.postgresql.Driver
url = "jdbc:postgresql://postgresql:5432/test?loggerLevel=OFF"
user = test
password = test
generate_sink_sql = true
database = test
table = public.sink
primary_keys = ["id"]
}
}
Set format to protobuf, configure protobuf data structure, protobuf_message_name and protobuf_schema parameters
Example:
source {
Kafka {
topic = "test_protobuf_topic_fake_source"
format = protobuf
protobuf_message_name = Person
protobuf_schema = """
syntax = "proto3";
package org.apache.seatunnel.format.protobuf;
option java_outer_classname = "ProtobufE2E";
message Person {
int32 c_int32 = 1;
int64 c_int64 = 2;
float c_float = 3;
double c_double = 4;
bool c_bool = 5;
string c_string = 6;
bytes c_bytes = 7;
message Address {
string street = 1;
string city = 2;
string state = 3;
string zip = 4;
}
Address address = 8;
map<string, float> attributes = 9;
repeated string phone_numbers = 10;
}
"""
bootstrap.servers = "kafkaCluster:9092"
start_mode = "earliest"
plugin_output = "kafka_table"
}
}
When consuming Protobuf messages that were encoded using Confluent Schema Registry, you need to set strip_schema_registry_header to true. The connector will automatically detect and remove the Schema Registry wire format header (magic byte, schema id, and message indexes) before deserializing the Protobuf message.
Example:
source {
Kafka {
topic = "test_protobuf_schema_registry_topic"
format = protobuf
strip_schema_registry_header = true
protobuf_message_name = Person
protobuf_schema = """
syntax = "proto3";
package org.apache.seatunnel.format.protobuf;
option java_outer_classname = "ProtobufE2E";
message Person {
int32 c_int32 = 1;
int64 c_int64 = 2;
float c_float = 3;
double c_double = 4;
bool c_bool = 5;
string c_string = 6;
bytes c_bytes = 7;
message Address {
string street = 1;
string city = 2;
string state = 3;
string zip = 4;
}
Address address = 8;
map<string, float> attributes = 9;
repeated string phone_numbers = 10;
}
"""
bootstrap.servers = "kafkaCluster:9092"
start_mode = "earliest"
plugin_output = "kafka_table"
}
}
Note: When strip_schema_registry_header is enabled, the connector can safely handle both Schema Registry encoded messages and plain Protobuf messages. If the Schema Registry header is not detected, it will automatically fall back to standard Protobuf deserialization.
### Ignore No Leader Partition
When dealing with Kafka clusters that may have temporary leadership issues, you can configure the connector to ignore partitions without a leader:
```hocon
source {
Kafka {
topic = "test_topic"
bootstrap.servers = "localhost:9092"
consumer.group = "test_group"
ignore_no_leader_partition = true
start_mode = "earliest"
}
}
With ignore_no_leader_partition = true, the connector will skip any partitions that don't have a leader during partition discovery, allowing the job to continue processing other healthy partitions.
If you need to retain Kafka's native information, you can refer to the following configuration.
Config Example:
source {
Kafka {
topic = "test_topic_native_source"
bootstrap.servers = "kafkaCluster:9092"
start_mode = "earliest"
format_error_handle_way = skip
format = "NATIVE"
value_converter_schema_enabled = false
consumer.group = "native_group"
}
}
The returned data is as follows:
{
"headers": {
"header1": "header1",
"header2": "header2"
},
"key": "dGVzdF9ieXRlc19kYXRh",
"partition": 3,
"timestamp": 1672531200000,
"timestampType": "CREATE_TIME",
"value": "dGVzdF9ieXRlc19kYXRh"
}
Note:key/value is of type byte[].