docs/design_docs/json_storage.md
A set of "core fields" (such as primary keys and commonly used metadata) that are present in most records.
Additional attributes that appear only in some records, potentially involving unstructured or dynamically extended information.
When parsing JSON, predefined dense fields are extracted and mapped to independent columns in Parquet. A method similar to Parquet Variant Shredding is used to flatten nested data.
Fields not included in the dense part are stored in a sparse data field. They are serialized using BSON (Binary JSON) format, leveraging its efficient binary representation and rich data type support, with the result stored in a Parquet BINARY type field.
sparse_data) for storing sparse data, with type set to BINARY, directly storing BSON data.sparse_data field, achieving a balance between query efficiency and storage flexibility.To accelerate BSON parsing, an inverted index stores BSON keys along with their offsets and sizes or values if they are of numeric type.
| Valid | Type | Row ID | Offset/Value |
|---|---|---|---|
| 1bit | 4bit | 27bit | 16 offset, 16bit size |
The column key index is optional, and can be configured at table creation time or modified later through field properties.
[
{"id": 1, "attr1": "value1", "attr2": 100},
{"id": 2, "attr1": "value2", "attr3": true},
{"id": 3, "attr1": "value3", "attr4": "extra", "attr5": 3.14}
]
id is considered dense.attr1, attr2attr1, attr3attr1, attr4, attr5| Column Name | Data Type | Description |
|---|---|---|
| id | int64 | Dense column storing the integer identifier. |
| sparse_data | binary | Sparse column storing BSON-serialized data of all remaining fields. |
| sparse_index | binary | Index column storing key offsets for efficient parsing. |
Dense Column (id):
123Sparse Column (sparse_data):
{"attr1": "value1", "attr2": 100}{"attr1": "value2", "attr3": true}{"attr1": "value3", "attr4": "extra", "attr5": 3.14}Sparse Index (sparse_index):
attr1 and attr2 to their respective positions in sparse_data.attr1 and attr3.attr1, attr4, and attr5.In an actual system, the sparse data would be serialized using a BSON library (e.g., bsoncxx) for a compact binary format. The example above demonstrates the logical mapping of JSON data to the Parquet storage format.