docs/src/main/sphinx/connector/hudi.md
The Hudi connector enables querying Hudi tables.
To use the Hudi connector, you need:
To configure the Hudi connector, create a catalog properties file
etc/catalog/example.properties that references the hudi connector.
You must configure a metastore for table metadata.
You must select and configure one of the supported file systems.
connector.name=hudi
hive.metastore.uri=thrift://example.net:9083
fs.x.enabled=true
Replace the fs.x.enabled configuration property with the desired file system.
There are {ref}HMS configuration properties <general-metastore-properties>
available for use with the Hudi connector. The connector recognizes Hudi tables
synced to the metastore by the Hudi sync tool.
Additionally, following configuration properties can be set depending on the use-case:
:::{list-table} Hudi configuration properties :widths: 30, 55, 15 :header-rows: 1
hudi.columns-to-hidehudi.parquet.use-column-namestruehudi.split-generator-parallelism4hudi.split-loader-parallelism4hudi.size-based-split-weights-enabledtruehudi.standard-split-weight-size128MBhudi.minimum-assigned-split-weight0.05hudi.max-splits-per-secondInteger.MAX_VALUEhudi.max-outstanding-splits1000hudi.per-transaction-metastore-cache-maximum-size2000hudi.query-partition-filter-requiredtrue to force a query to use a partition column in the filter condition.
The equivalent catalog session property is query_partition_filter_required.
Enabling this property causes query failures if the partition column used
in the filter condition doesn't effectively reduce the number of data files read.
Example: Complex filter expressions such as id = 1 OR part_key = '100'
or CAST(part_key AS INTEGER) % 2 = 0 are not recognized as partition filters,
and queries using such expressions fail if the property is set to true.falsehudi.ignore-absent-partitionsfalse:::
(hudi-file-system-configuration)=
The connector supports accessing the following file systems:
You must enable and configure the specific file system access. Legacy support is not recommended and will be removed.
The connector provides read access to data in the Hudi table that has been synced to
Hive metastore. The {ref}globally available <sql-globally-available>
and {ref}read operation <sql-read-operations> statements are supported.
In the following example queries, stock_ticks_cow is the Hudi copy-on-write
table referred to in the Hudi quickstart guide.
USE example.example_schema;
SELECT symbol, max(ts)
FROM stock_ticks_cow
GROUP BY symbol
HAVING symbol = 'GOOG';
symbol | _col1 |
-----------+----------------------+
GOOG | 2018-08-31 10:59:00 |
(1 rows)
SELECT dt, symbol
FROM stock_ticks_cow
WHERE symbol = 'GOOG';
dt | symbol |
------------+--------+
2018-08-31 | GOOG |
(1 rows)
SELECT dt, count(*)
FROM stock_ticks_cow
GROUP BY dt;
dt | _col1 |
------------+--------+
2018-08-31 | 99 |
(1 rows)
Hudi supports two types of tables depending on how the data is indexed and laid out on the file system. The following table displays a support matrix of tables types and query types for the connector:
:::{list-table} Hudi configuration properties :widths: 45, 55 :header-rows: 1
The following table properties are available for use:
:::{list-table} Hudi table properties :widths: 40, 60 :header-rows: 1
locationpartitioned_by(hudi-metadata-tables)=
The connector exposes a metadata table for each Hudi table. The metadata table contains information about the internal structure of the Hudi table. You can query each metadata table by appending the metadata table name to the table name:
SELECT * FROM "test_table$timeline"
$timeline tableThe $timeline table provides a detailed view of meta-data instants
in the Hudi table. Instants are specific points in time.
You can retrieve the information about the timeline of the Hudi table
test_table by using the following query:
SELECT * FROM "test_table$timeline"
timestamp | action | state
--------------------+---------+-----------
8667764846443717831 | commit | COMPLETED
7860805980949777961 | commit | COMPLETED
The output of the query has the following columns:
:::{list-table} Timeline columns :widths: 20, 30, 50 :header-rows: 1
timestampVARCHARactionVARCHARstateVARCHAR