rfc/rfc-99/appendix.md
The main implementation change would require replacing the Avro schema references with the new type system.
This section captures additional research and design notes for supporting a VECTOR logical type in Hudi. See appendix for more details on research sources.
The intial use case we are targeting for VECTOR within Hudi,
is to enable KNN style vector search functionality to be performed on blobs(large text, images, audio, video) alongside their generated vector embeddings.
Typically vector search is popular for Retrieval-Augmented Generation (RAG) applications
which provide relevant context to an LLM in order to improve its accuracy when answering user queries.
The vector embeddings generated by frontier models are usually in the form of an array of floating point values.
Dense vector
Sparse vector
Sparse vectors become important for other types of hybrid/lexical-style retrieval which is not targeted for the intial scope,
as that requires running different algorithms such as (TF-IDF or BM25) which is different from the intial use case of KNN style search.
Hence this RFC has seperated both into two distinct types one for VECTOR (dense) and one for SPARSE_VECTOR, we will for now spend time on VECTOR dense case.
Logical level requirements:
FLOAT, DOUBLE or INT8.storageBacking which lets the writers know how to serialize the vector to disk. For an intial approach we will start with a fixed bytes approach covered below.See the following avro schema model as a general example:
{
"type" : "fixed",
"name" : "vector",
"size" : 3072,
"logicalType" : "vector",
"dimension" : 768,
"elementType" : "FLOAT",
"storageBacking" : "FIXED_BYTES"
}
Physical level requirements:
For now we will support a fixed-size packed byte representation for storing vectors on disk as this yields optimal performance(see parquet tuning section below for more details):
D * 4 bytes (IEEE-754 float32, little-endian)FIXED_LEN_BYTE_ARRAY(D * 4) with VECTOR metadata.FixedSizedListVector data is typically high-cardinality and not dictionary-friendly. Therefore we will be disabling dictionary encoding and column stats for vector columns.
Also based on findings from the parquet community, encodings such as PLAIN or BYTE_STREAM_SPLIT are useful when dealing with vectors, as well as disabling compression
as this would yield best write/read performance.
Benchmark experiment with vectors
10,000 vectors (where each vector dimension is 1,536 and the element type is FLOAT(4 bytes), around 6KB per record).Physical backings tested
Summary of Results
Winner (most compact file size): Parquet LIST (byte-stream-split, ZSTD)
Currently parquet list is only a couple of MB more compact then the other parquet fixed tests.
Performance Winner (Write): Lance
Performance Winner (Read): Parquet FIXED (byte-stream-split, UNCOMPRESSED)
*Note* Parquet FIXED and Lance are close in write perf
Detailed comparison table
| Representation | File Size | Write Speed | Read Speed | Bytes/Rec | vs Raw | vs Base |
|---|---|---|---|---|---|---|
| Parquet LIST (plain, UNCOMPRESSED) | 58.86 MB | 124.93 MB/s | 233.44 MB/s | 6,172 B | 1.00x | 1.00x |
| Parquet LIST (plain, SNAPPY) | 58.69 MB | 125.20 MB/s | 232.51 MB/s | 6,154 B | 1.00x | 1.00x |
| Parquet LIST (plain, ZSTD) | 54.35 MB | 117.66 MB/s | 206.32 MB/s | 5,698 B | 1.08x | 0.92x |
| Parquet LIST (byte-stream-split, UNCOMPRESSED) | 58.86 MB | 118.61 MB/s | 210.77 MB/s | 6,172 B | 1.00x | 1.00x |
| Parquet LIST (byte-stream-split, SNAPPY) | 53.60 MB | 111.18 MB/s | 200.66 MB/s | 5,620 B | 1.09x | 0.91x |
| Parquet LIST (byte-stream-split, ZSTD) | 50.27 MB | 101.90 MB/s | 194.02 MB/s | 5,270 B | 1.17x | 0.85x |
| Parquet FIXED (plain, UNCOMPRESSED) | 58.82 MB | 527.87 MB/s | 2253.61 MB/s | 6,167 B | 1.00x | 1.00x |
| Parquet FIXED (plain, SNAPPY) | 58.69 MB | 496.56 MB/s | 2092.63 MB/s | 6,154 B | 1.00x | 1.00x |
| Parquet FIXED (plain, ZSTD) | 54.35 MB | 430.84 MB/s | 760.96 MB/s | 5,699 B | 1.08x | 0.92x |
| Parquet FIXED (byte-stream-split, UNCOMPRESSED) | 58.82 MB | 480.28 MB/s | 2343.75 MB/s | 6,167 B | 1.00x | 1.00x |
| Parquet FIXED (byte-stream-split, SNAPPY) | 58.69 MB | 327.34 MB/s | 2020.47 MB/s | 6,154 B | 1.00x | 1.00x |
| Parquet FIXED (byte-stream-split, ZSTD) | 54.35 MB | 415.56 MB/s | 802.65 MB/s | 5,699 B | 1.08x | 0.92x |
| Lance | 58.85 MB | 665.84 MB/s | 1395.09 MB/s | 6,170 B | 1.00x | - |
Winner (most compact): Parquet LIST (byte-stream-split, ZSTD)
| Representation | Write Time | Write Speed | Read Time | Read Speed |
|---|---|---|---|---|
| Parquet LIST (plain, UNCOMPRESSED) | 469 ms | 124.93 MB/s | 251 ms | 233.44 MB/s |
| Parquet LIST (plain, SNAPPY) | 468 ms | 125.20 MB/s | 252 ms | 232.51 MB/s |
| Parquet LIST (plain, ZSTD) | 498 ms | 117.66 MB/s | 284 ms | 206.32 MB/s |
| Parquet LIST (byte-stream-split, UNCOMPRESSED) | 494 ms | 118.61 MB/s | 278 ms | 210.77 MB/s |
| Parquet LIST (byte-stream-split, SNAPPY) | 527 ms | 111.18 MB/s | 292 ms | 200.66 MB/s |
| Parquet LIST (byte-stream-split, ZSTD) | 575 ms | 101.90 MB/s | 302 ms | 194.02 MB/s |
| Parquet FIXED (plain, UNCOMPRESSED) | 111 ms | 527.87 MB/s | 26 ms | 2253.61 MB/s |
| Parquet FIXED (plain, SNAPPY) | 118 ms | 496.56 MB/s | 28 ms | 2092.63 MB/s |
| Parquet FIXED (plain, ZSTD) | 136 ms | 430.84 MB/s | 77 ms | 760.96 MB/s |
| Parquet FIXED (byte-stream-split, UNCOMPRESSED) | 122 ms | 480.28 MB/s | 25 ms | 2343.75 MB/s |
| Parquet FIXED (byte-stream-split, SNAPPY) | 179 ms | 327.34 MB/s | 29 ms | 2020.47 MB/s |
| Parquet FIXED (byte-stream-split, ZSTD) | 141 ms | 415.56 MB/s | 73 ms | 802.65 MB/s |
| Lance | 88 ms | 665.84 MB/s | 42 ms | 1395.09 MB/s |
Performance Winner (Write): Lance Performance Winner (Read): Parquet FIXED (byte-stream-split, UNCOMPRESSED)
| Compression | File Size | vs Raw | Write Time | Read Time | Write MB/s |
|---|---|---|---|---|---|
| plain, UNCOMPRESSED | 58.86 MB | 1.00x | 469 ms | 251 ms | 124.93 |
| plain, SNAPPY | 58.69 MB | 1.00x | 468 ms | 252 ms | 125.20 |
| plain, ZSTD | 54.35 MB | 1.08x | 498 ms | 284 ms | 117.66 |
| byte-stream-split, UNCOMPRESSED | 58.86 MB | 1.00x | 494 ms | 278 ms | 118.61 |
| byte-stream-split, SNAPPY | 53.60 MB | 1.09x | 527 ms | 292 ms | 111.18 |
| byte-stream-split, ZSTD | 50.27 MB | 1.17x | 575 ms | 302 ms | 101.90 |
Best compression ratio: byte-stream-split, ZSTD Fastest write: plain, SNAPPY Fastest read: plain, UNCOMPRESSED
| Compression | File Size | vs Raw | Write Time | Read Time | Write MB/s |
|---|---|---|---|---|---|
| plain, UNCOMPRESSED | 58.82 MB | 1.00x | 111 ms | 26 ms | 527.87 |
| plain, SNAPPY | 58.69 MB | 1.00x | 118 ms | 28 ms | 496.56 |
| plain, ZSTD | 54.35 MB | 1.08x | 136 ms | 77 ms | 430.84 |
| byte-stream-split, UNCOMPRESSED | 58.82 MB | 1.00x | 122 ms | 25 ms | 480.28 |
| byte-stream-split, SNAPPY | 58.69 MB | 1.00x | 179 ms | 29 ms | 327.34 |
| byte-stream-split, ZSTD | 54.35 MB | 1.08x | 141 ms | 73 ms | 415.56 |
Best compression ratio: plain, ZSTD Fastest write: plain, UNCOMPRESSED Fastest read: byte-stream-split, UNCOMPRESSED
| Compression | File Size | vs Raw | Write Time | Read Time | Write MB/s |
|---|---|---|---|---|---|
| Default | 58.85 MB | 1.00x | 88 ms | 42 ms | 665.84 |
Note: Lance uses default compression settings (no variations tested)
| Encoding | Avg Size | Size Ratio | Avg Write | Avg Read |
|---|---|---|---|---|
| plain | 57.30 MB | 1.02x | 478 ms | 262 ms |
| byte-stream-split | 54.24 MB | 1.08x | 532 ms | 290 ms |
| Compression | File Size | vs Raw |
|---|---|---|
| UNCOMPRESSED | 58.86 MB | 1.00x |
| SNAPPY | 58.69 MB | 1.00x |
| ZSTD | 54.35 MB | 1.08x |
| Compression | File Size | vs Raw |
|---|---|---|
| UNCOMPRESSED | 58.86 MB | 1.00x |
| SNAPPY | 53.60 MB | 1.09x |
| ZSTD | 50.27 MB | 1.17x |
| Encoding | Avg Size | Size Ratio | Avg Write | Avg Read |
|---|---|---|---|---|
| plain | 57.29 MB | 1.02x | 121 ms | 43 ms |
| byte-stream-split | 57.29 MB | 1.02x | 147 ms | 42 ms |
| Compression | File Size | vs Raw |
|---|---|---|
| UNCOMPRESSED | 58.82 MB | 1.00x |
| SNAPPY | 58.69 MB | 1.00x |
| ZSTD | 54.35 MB | 1.08x |
| Compression | File Size | vs Raw |
|---|---|---|
| UNCOMPRESSED | 58.82 MB | 1.00x |
| SNAPPY | 58.69 MB | 1.00x |
| ZSTD | 54.35 MB | 1.08x |
| Encoding | Avg Size | Size Ratio | Avg Write | Avg Read |
|---|---|---|---|---|
| Default (Arrow IPC) | 58.85 MB | 1.00x | 88 ms | 42 ms |
Note: Lance uses Apache Arrow IPC encoding (no variations tested)