Back to Seatunnel

RegexExtract

docs/en/transforms/regexextract.md

2.3.133.6 KB
Original Source

RegexExtract

RegexExtract transform plugin

Description

The RegexExtract transform plugin uses regular expressions to extract data from a specified field and outputs the extracted values to new fields. It supports capture groups in regex patterns and allows setting default values for each output field when the pattern doesn't match.

Options

nametyperequireddefault value
source_fieldstringyes
regex_patternstringyes
output_fieldsarrayyes
default_valuesarrayno

source_field [string]

The source field name to extract data from.

regex_pattern [string]

The regular expression pattern with capture groups. The number of capture groups must match the number of output fields.

output_fields [array]

The names of the output fields for extracted values. The size must match the number of capture groups in the regex pattern.

default_values [array]

Default values for output fields when the regex pattern does not match or the source field is null. If provided, the size must match the number of output fields.

Example

The data read from source is a table like this:

idemaillog_entry
1[email protected]2023-12-01 10:30:45 INFO User login successful
2[email protected]2023-12-01 11:15:22 ERROR Database connection failed
3[email protected]2023-12-01 12:00:00 WARN Memory usage high

We want to extract username, domain, and top-level domain from the email field:

transform {
  RegexExtract {
    plugin_input = "fake"
    plugin_output = "regex_result"
    source_field = "email"
    regex_pattern = "([^@]+)@([^.]+)\\.(.+)"
    output_fields = ["username", "domain", "tld"]
    default_values = ["unknown", "unknown", "unknown"]
  }
}

Then the data in result table regex_result will be:

idemaillog_entryusernamedomaintld
1[email protected]2023-12-01 10:30:45 INFO User login successfuluser1examplecom
2[email protected]2023-12-01 11:15:22 ERROR Database connection failedadmintestorg
3[email protected]2023-12-01 12:00:00 WARN Memory usage highguestdomainnet

Job Config Example

env {
  job.mode = "BATCH"
}

source {
  FakeSource {
    plugin_output = "fake"
    row.num = 100
    schema = {
      fields {
        id = "int"
        email = "string"
        log_entry = "string"
      }
    }
    rows = [
      {
          kind = INSERT,
          fields = [1, "[email protected]", "2023-12-01 10:30:45 INFO User login successful"]
      },
      {
        kind = INSERT,
        fields = [2, "[email protected]", "2023-12-01 11:15:22 ERROR Database connection failed"]
      },
      {
        kind = INSERT,
        fields = [3, "[email protected]", "2023-12-01 12:00:00 WARN Memory usage high"]
      }
    ]
  }
}

transform {
  RegexExtract {
    plugin_input = "fake"
    plugin_output = "regex_result"
    source_field = "email"
    regex_pattern = "([^@]+)@([^.]+)\\.(.+)"
    output_fields = ["username", "domain", "tld"]
    default_values = ["unknown", "unknown", "unknown"]
  }
}

sink {
  Console {
    plugin_input = "regex_result"
  }
}

Changelog