protocol_rfcs/variant-shredding.md
Associated Github issue for discussions: https://github.com/delta-io/delta/issues/4032
This protocol change adds support for Variant shredding for the Variant data type. Shredding allows Variant data to be be more efficiently stored and queried.
New Section after the
Variant Data Typesection
This feature enables support for shredding of the Variant data type, to store and query Variant data more efficiently. Shredding a Variant value is taking paths from the Variant value, and storing them as a typed column in the file. The shredding does not duplicate data, so if a value is stored in the typed column, it is removed from the Variant binary. Storing Variant values as typed columns is faster to access, and enables data skipping with statistics.
The variantShredding feature depends on the variantType feature.
To support this feature:
variantType must exist in the table protocol's readerFeatures and writerFeatures.variantShredding must exist in the table protocol's readerFeatures and writerFeatures.Shredded Variant data is stored according to the Parquet Variant Shredding specification The shredded Variant data written to parquet files is written as a single Parquet struct, with the following fields:
| Struct field name | Parquet primitive type | Description |
|---|---|---|
| metadata | binary | (required) The binary-encoded Variant metadata, as described in Parquet Variant binary encoding |
| value | binary | (optional) The binary-encoded Variant value, as described in Parquet Variant binary encoding |
| typed_value | * | (optional) This can be any Parquet type, representing the data stored in the Variant. Details of the shredding scheme is found in the Variant Shredding specification |
When Variant Shredding is supported (writerFeatures field of a table's protocol action contains variantShredding), writers:
delta.enableVariantShredding table property configuration. If delta.enableVariantShredding=false, a column of type variant must not be written as a shredded Variant, but as an unshredded Variant. If delta.enableVariantShredding=true, the writer can choose to shred a Variant column according to the Parquet Variant Shredding specificationWhen Variant type is supported (readerFeatures field of a table's protocol action contains variantShredding), readers:
variant data type in a Delta schemametadata and value struct fields) or shredded (metadata, value, and typed_value struct fields) when reading a Variant data type from file.Update the
Per-file Statisticssection
After the description and examples starting from:
Per-column statistics record information for each column in the file and they are encoded, mirroring the schema of the actual data. For example, given the following data schema:
nullCount stat for a Variant column is a LONG representing the nullcount for the Variant column itself (nullcount stats are not captured for individual paths within the Variant).minValues and maxValues stats for a Variant column are Variant objects, where the object keys are normalized JSON path expressions, and the object values are the primitive Variant values representing the lower and upper bound for that field.minValues and maxValues stats for a Variant column are binary-encoded Variant values, concatenating the metadata and value, and serialized to strings using z85 encoding (see example below).minValues and maxValues stats for a Variant column are Parquet Variant columns, following the Parquet Variant encoding and shredding specifications.minValues and maxValues stats are allowed to be shredded, but it is not required.minValues (maxValues) value is the independently computed min (max) stat for the corresponding path in the file's Variant data, so e.g. minValues.v:a and minValues.v:b could come from different rows in the file.minValues and maxValues must be the same within any one file, but can vary from file to file.For a table with a single Variant column (varCol: variant) in its data schema, example statistics in JSON would look like:
"stats": {
"nullCount": {
"varCol": 2
}
"minValues": {
"varCol": "0S&u501fk+ze0(tB98CpzF6vU0rJl95HpNdvjbtatpi(cu0wW^cTu"
},
"maxValues": {
"varCol": "0S&u500&]LC42A9vqZe}wb#-i1}-a+cT!xdbWhT9cTx}7v<+K"
}
}
The corresponding human-readable form is:
"stats": {
"nullCount": {
"varCol": 2
}
"minValues": {
"varCol": {
"$['a']" : "min-string",
"$['b']['c']" : 1
}
},
"maxValues": {
"varCol": {
"$['a']" : "variant",
"$['b']['c']" : 100
}
}
}