hudi-io/hfile_format.md
HFile format is based on SSTable file format optimized for range scans/point lookups, originally designed and implemented by HBase. We use HFile version 3 as the base file format of the internal metadata table (MDT). Here we describe the HFile format that are relevant to Hudi, as not all features of HFile are used.
The HFile is structured as follows:
+----------+-----------------------+
| "Scanned | Data Block |
| block" +-----------------------+
| section | ... |
| +-----------------------+
| | Data Block |
+----------+-----------------------+
| "Non- | Meta Block |
| scanned +-----------------------+
| block" | ... |
| section +-----------------------+
| | Meta Block |
+----------+-----------------------+
| "Load- | Root Data Index Block |
| on-open" +-----------------------+
| section | Meta Index Block |
| +-----------------------+
| | File Info Block |
+----------+-----------------------+
| Trailer | Trailer, containing |
| | fields and |
| | HFile Version |
+----------+-----------------------+
Next, we describe the block format and each block in details.
All the blocks except for Trailer share the same format as follows:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Block Magic +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| On-disk Size Without Header |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Uncompressed Size Without Header |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Previous Block Offset +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum Type | Bytes Per Checksum >
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | On-disk Data Size With Header >
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | |
+-+-+-+-+-+-+-+-+ +
| |
~ Data ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Checksum ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Note that one tick mark represents one bit position.
Header:
DATABLK*: DATA block type for data blocksMETABLKc: META block type for meta blocksIDXROOT2: ROOT_INDEX block type for root-level index blocksFILEINF2: FILE_INFO block type for the file info block, a small key-value map of metadataData:
Checksum:
The "Data" part of the Data Block consists of one or multiple key-value pairs, with keys sorted in lexicographical order:
+--------------------+
| Key-value Pair 0 |
+--------------------+
| Key-value Pair 1 |
+--------------------+
| ... |
+--------------------+
| Key-value Pair N-1 |
+--------------------+
Each key-value pair has the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Key Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Value Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Key ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Value ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MVCC Timestamp|
+-+-+-+-+-+-+-+-+
Header:
Key:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Key Content Size | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
| |
~ Key Content ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Other Information ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Value:
The whole part represents the value in byte array. The size of value is indicated by the header.
MVCC Timestamp:
This is used by HBase and written to HFile. For Hudi, this field should always be zero, occupying 1 byte.
The "Data" part of the Meta Block contains the meta information in byte array. The key of the meta block can be found in the Meta Index Block.
The "Data" part of the Index Block can be empty. When not empty, the "Data" part of Index Block contains one or more block index entries organized like below:
+-----------------------+
| Block Index Entry 0 |
+-----------------------+
| Block Index Entry 1 |
+-----------------------+
| ... |
+-----------------------+
| Block Index Entry N-1 |
+-----------------------+
Each block index entry, referencing one relevant Data or Meta Block, has the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Block Offset +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Block Size on Disk |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Key Length ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Key +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Key:
+----------------+-----------+
| Key Bytes Size | Key Bytes |
+----------------+-----------+
For Data Index, the "Key Bytes" part has the following format (same as the key format in the Data Block):
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Key Content Size | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
| |
~ Key Content ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Other Information ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
For Meta Index, the "Key Bytes" part is the byte array of the key of the Meta Block.
The "Data" part of the File Info Block has the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| PBUF Magic |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ File Info ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
PBUF indicating the block is using Protobuf for serde.Here's the definition of the File Info proto InfoProto:
message BytesBytesPair {
required bytes first = 1;
required bytes second = 2;
}
message InfoProto {
repeated BytesBytesPair map_entry = 1;
}
The key and value are represented in byte array. When Hudi adds more key-value metadata entry to the file info, the key and value are encoded from String into byte array using UTF-8.
Here are common metadata stored in the File Info Block:
hfile.LASTKEY: The last key of the file (byte array)hfile.MAX_MEMSTORE_TS_KEY: Maximum MVCC timestamp of the key-value pairs in the file. In Hudi, this should always be
0.The HFile Trailer has a fixed size, 4096 bytes. The HFile Trailer has different format compared to other blocks, as follows:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Block Magic +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Trailer Content ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
TRABLK"$.message TrailerProto {
optional uint64 file_info_offset = 1;
optional uint64 load_on_open_data_offset = 2;
optional uint64 uncompressed_data_index_size = 3;
optional uint64 total_uncompressed_bytes = 4;
optional uint32 data_index_count = 5;
optional uint32 meta_index_count = 6;
optional uint64 entry_count = 7;
optional uint32 num_data_index_levels = 8;
optional uint64 first_data_block_offset = 9;
optional uint64 last_data_block_offset = 10;
optional string comparator_class_name = 11;
optional uint32 compression_codec = 12;
optional bytes encryption_key = 13;
}
Here are the meaning of each field:
file_info_offset: File info offsetload_on_open_data_offset: The offset of the section ("Load-on-open" section) that we need to load when opening the
fileuncompressed_data_index_size: The total uncompressed size of the whole data block indextotal_uncompressed_bytes: Total uncompressed bytesdata_index_count: Number of data index entriesmeta_index_count: Number of meta index entriesentry_count: Number of key-value pair entries in the filenum_data_index_levels: The number of levels in the data block indexfirst_data_block_offset: The offset of the first data blocklast_data_block_offset: The offset of the first byte after the last key-value data blockcomparator_class_name: Comparator class name (In Hudi, we always assume lexicographical order, so this is ignored)compression_codec: Compression codec: 0 = LZO, 1 = GZ, 2 = NONEencryption_key: Encryption key (not used by Hudi)The last 4 bytes of the Trailer content contain the HFile version: the number represented by the first byte indicates the minor version, and the number represented by the last three bytes indicates the major version. In the case of Hudi, the major version should always be 3, if written by HBase HFile writer.