crates/polars-parquet/src/arrow/read/deserialize/README.md
Let's start with the design used for non-nested arrays. The (private) entry point of this module for
non-nested arrays is simple::page_iter_to_arrays.
This function expects
PagesDataTypeand returns an iterator of Array, ArrayIter.
This design is shared among all (parquet, arrow) implemented tuples. Their main difference is
how they are deserialized, which depends on the source and target types.
When the array iterator is pulled the first time, the following happens:
Pages is pulledPageState<'a> is built from the pagePageState is consumed into a mutable array:
chunk_size is larger than the number of rows in the page, the mutable array state is
preserved and a new page is pulled and the process repeated until we fill a chunk.chunk_size is smaller than the number of rows in the page, the mutable array state is
returned and the remaining of the page is consumed into multiple mutable arrays of length
chunk_size into a FIFO queue.Subsequent pulls of arrays will first try to pull from the FIFO queue. Once the queue is empty, the a new page is pulled.
PageStateAs mentioned above, the iterator leverages the idea that we attach a state to a page. Recall that a
page is essentially [header][data]. The data part contains encoded
[rep levels][def levels][non-null values]. Some pages have an associated dictionary page, in which
case the non-null values represent the indices.
Irrespectively of the physical type, the main idea is to split the page in two iterators:
def levelsnon-null valuesand progress the iterators as needed. In particular, for non-nested types, def levels is a bitmap
with the same representation as Arrow, in which case the validity is extended directly.
The non-null values are "expanded" by filling null values with the default value of each physical
type.
For nested type with N+1 levels (1 is the primitive), we need to build the nest information of each N levels + the non-nested Arrow array.
This is done by first transversing the parquet types and using it to initialize, per chunk, the N levels.
The per-chunk execution is then similar but chunk_size only drives the number of retrieved rows
from the outermost parquet group (the field). Each of these pulls knows how many items need to be
pulled from the inner groups, all the way to the primitive type. This works because in parquet a row
cannot be split between two pages and thus each page is guaranteed to contain a full row.
The PageState of nested types is composed by 4 iterators:
rep levels and def levelsdef levelsnon-null valuesThe idea is that an iterator of rep, def contain all the information to decode the nesting
structure of an arrow array. The other two iterators are equivalent to the non-nested types with the
exception that def levels are no equivalent to arrow bitmaps.