src/v/serde/parquet/README.md
Here lies a library to be able to write parquet files for Redpanda's Iceberg integration. Due to Redpanda's usage of Seastar off the shelf preexisting parquet libraries do not meet our strict requirements, imposed by our userland task scheduler and virtual memory avoiding allocator.
Parquet shreds records into a columnar format using the same algorithm as published in the Dremel
paper. A very helpful step by step explainer of the algorithm can be found here.
Our implmentation of this can be found in shredder.cc.
Parquet metadata is serialized using Apache Thrift's compact wire format.
We use metadata that is the logical representation of what our application needs, then we write out the wire format with all the deprecated and legacy types to be compatible with legacy query systems.
The physical format of serialized parquet metadata is documented here.