Back to Hudi

Appendix

rfc/rfc-99/appendix.md

0.5.311.9 KB
Original Source

The main implementation change would require replacing the Avro schema references with the new type system.

Supporting VECTOR type in Hudi

This section captures additional research and design notes for supporting a VECTOR logical type in Hudi. See appendix for more details on research sources.

Initial scope

The intial use case we are targeting for VECTOR within Hudi, is to enable KNN style vector search functionality to be performed on blobs(large text, images, audio, video) alongside their generated vector embeddings. Typically vector search is popular for Retrieval-Augmented Generation (RAG) applications which provide relevant context to an LLM in order to improve its accuracy when answering user queries. The vector embeddings generated by frontier models are usually in the form of an array of floating point values.

Dense vectors vs sparse vectors

Dense vector

  • Has a value for every dimension.
  • Stored as a full length-D sequence: v = [0.12, -0.03, 0.44, ...] (length = D)
  • Even if some entries are 0, you still store them.

Sparse vector

  • Most entries are 0 / absent, so you store only the non-zero positions.
  • Stored as pairs (index, value), sometimes also with a separate nnz count:
  • [(3, 0.44), (107, 1.2), (9012, -0.7)]
  • The “dimension” is still D, but the stored length is nnz (number of non-zeros), typically nnz << D.

Sparse vectors become important for other types of hybrid/lexical-style retrieval which is not targeted for the intial scope, as that requires running different algorithms such as (TF-IDF or BM25) which is different from the intial use case of KNN style search.
Hence this RFC has seperated both into two distinct types one for VECTOR (dense) and one for SPARSE_VECTOR, we will for now spend time on VECTOR dense case.

Vector Schema constraints

Logical level requirements:

  • All values within the VECTOR column must have the same dimension i.e (number of elements within the vector), as this is needed to perform cosine/L2/dot-product correctly.
  • There should be no null elements within the vector at write time.
  • VECTOR must have an "element type" which can be one of FLOAT, DOUBLE or INT8.
  • We also want to keep a property around such as storageBacking which lets the writers know how to serialize the vector to disk. For an intial approach we will start with a fixed bytes approach covered below.

See the following avro schema model as a general example:

{
  "type" : "fixed",
  "name" : "vector",
  "size" : 3072,
  "logicalType" : "vector",
  "dimension" : 768,
  "elementType" : "FLOAT",
  "storageBacking" : "FIXED_BYTES"
}

Physical level requirements:

For now we will support a fixed-size packed byte representation for storing vectors on disk as this yields optimal performance(see parquet tuning section below for more details):

  • FLOAT32 vector of dimension D stored as exactly D * 4 bytes (IEEE-754 float32, little-endian)
  • Map to Parquet FIXED_LEN_BYTE_ARRAY(D * 4) with VECTOR metadata.
  • For Lance, vectors are typically represented using Arrow's FixedSizedList

Optimal Parquet tuning for vectors:

Vector data is typically high-cardinality and not dictionary-friendly. Therefore we will be disabling dictionary encoding and column stats for vector columns. Also based on findings from the parquet community, encodings such as PLAIN or BYTE_STREAM_SPLIT are useful when dealing with vectors, as well as disabling compression as this would yield best write/read performance.

Benchmark experiment with vectors

  • The results below was from an experiment writing 10,000 vectors (where each vector dimension is 1,536 and the element type is FLOAT(4 bytes), around 6KB per record).
  • We performed a full round trip for writing all vectors to a file and then read it back, using PARQUET and LANCE's java file writers/readers.
  • For PARQUET we tried several combination of writing with different types,as well as tried different encodings, compressions, etc to handle vectors.
  • For LANCE we opted to use vanilla settings based on it claims already toward already handling vectors optimally.
  • We performed 5 warmup rounds and 10 measurement rounds and collected averages below.

Physical backings tested

  • Parquet LIST: Vectors stored as Parquet's LIST<FLOAT> type (variable-length array)
  • Parquet FIXED: Vectors stored as Parquet's FIXED_LEN_BYTE_ARRAY (fixed 6,144 bytes for 1,536 floats)
  • Lance: Vectors stored in Lance format using FixedSizeList<Float32>

Summary of Results

Winner (most compact file size): Parquet LIST (byte-stream-split, ZSTD)

Currently parquet list is only a couple of MB more compact then the other parquet fixed tests.

Performance Winner (Write): Lance
Performance Winner (Read):  Parquet FIXED (byte-stream-split, UNCOMPRESSED)

*Note* Parquet FIXED and Lance are close in write perf

Detailed comparison table

COMPARISON SUMMARY

RepresentationFile SizeWrite SpeedRead SpeedBytes/Recvs Rawvs Base
Parquet LIST (plain, UNCOMPRESSED)58.86 MB124.93 MB/s233.44 MB/s6,172 B1.00x1.00x
Parquet LIST (plain, SNAPPY)58.69 MB125.20 MB/s232.51 MB/s6,154 B1.00x1.00x
Parquet LIST (plain, ZSTD)54.35 MB117.66 MB/s206.32 MB/s5,698 B1.08x0.92x
Parquet LIST (byte-stream-split, UNCOMPRESSED)58.86 MB118.61 MB/s210.77 MB/s6,172 B1.00x1.00x
Parquet LIST (byte-stream-split, SNAPPY)53.60 MB111.18 MB/s200.66 MB/s5,620 B1.09x0.91x
Parquet LIST (byte-stream-split, ZSTD)50.27 MB101.90 MB/s194.02 MB/s5,270 B1.17x0.85x
Parquet FIXED (plain, UNCOMPRESSED)58.82 MB527.87 MB/s2253.61 MB/s6,167 B1.00x1.00x
Parquet FIXED (plain, SNAPPY)58.69 MB496.56 MB/s2092.63 MB/s6,154 B1.00x1.00x
Parquet FIXED (plain, ZSTD)54.35 MB430.84 MB/s760.96 MB/s5,699 B1.08x0.92x
Parquet FIXED (byte-stream-split, UNCOMPRESSED)58.82 MB480.28 MB/s2343.75 MB/s6,167 B1.00x1.00x
Parquet FIXED (byte-stream-split, SNAPPY)58.69 MB327.34 MB/s2020.47 MB/s6,154 B1.00x1.00x
Parquet FIXED (byte-stream-split, ZSTD)54.35 MB415.56 MB/s802.65 MB/s5,699 B1.08x0.92x
Lance58.85 MB665.84 MB/s1395.09 MB/s6,170 B1.00x-

Winner (most compact): Parquet LIST (byte-stream-split, ZSTD)

PERFORMANCE SUMMARY

RepresentationWrite TimeWrite SpeedRead TimeRead Speed
Parquet LIST (plain, UNCOMPRESSED)469 ms124.93 MB/s251 ms233.44 MB/s
Parquet LIST (plain, SNAPPY)468 ms125.20 MB/s252 ms232.51 MB/s
Parquet LIST (plain, ZSTD)498 ms117.66 MB/s284 ms206.32 MB/s
Parquet LIST (byte-stream-split, UNCOMPRESSED)494 ms118.61 MB/s278 ms210.77 MB/s
Parquet LIST (byte-stream-split, SNAPPY)527 ms111.18 MB/s292 ms200.66 MB/s
Parquet LIST (byte-stream-split, ZSTD)575 ms101.90 MB/s302 ms194.02 MB/s
Parquet FIXED (plain, UNCOMPRESSED)111 ms527.87 MB/s26 ms2253.61 MB/s
Parquet FIXED (plain, SNAPPY)118 ms496.56 MB/s28 ms2092.63 MB/s
Parquet FIXED (plain, ZSTD)136 ms430.84 MB/s77 ms760.96 MB/s
Parquet FIXED (byte-stream-split, UNCOMPRESSED)122 ms480.28 MB/s25 ms2343.75 MB/s
Parquet FIXED (byte-stream-split, SNAPPY)179 ms327.34 MB/s29 ms2020.47 MB/s
Parquet FIXED (byte-stream-split, ZSTD)141 ms415.56 MB/s73 ms802.65 MB/s
Lance88 ms665.84 MB/s42 ms1395.09 MB/s

Performance Winner (Write): Lance Performance Winner (Read): Parquet FIXED (byte-stream-split, UNCOMPRESSED)

COMPRESSION CODEC ANALYSIS

Parquet LIST — Compression Comparison

CompressionFile Sizevs RawWrite TimeRead TimeWrite MB/s
plain, UNCOMPRESSED58.86 MB1.00x469 ms251 ms124.93
plain, SNAPPY58.69 MB1.00x468 ms252 ms125.20
plain, ZSTD54.35 MB1.08x498 ms284 ms117.66
byte-stream-split, UNCOMPRESSED58.86 MB1.00x494 ms278 ms118.61
byte-stream-split, SNAPPY53.60 MB1.09x527 ms292 ms111.18
byte-stream-split, ZSTD50.27 MB1.17x575 ms302 ms101.90

Best compression ratio: byte-stream-split, ZSTD Fastest write: plain, SNAPPY Fastest read: plain, UNCOMPRESSED

Parquet FIXED — Compression Comparison

CompressionFile Sizevs RawWrite TimeRead TimeWrite MB/s
plain, UNCOMPRESSED58.82 MB1.00x111 ms26 ms527.87
plain, SNAPPY58.69 MB1.00x118 ms28 ms496.56
plain, ZSTD54.35 MB1.08x136 ms77 ms430.84
byte-stream-split, UNCOMPRESSED58.82 MB1.00x122 ms25 ms480.28
byte-stream-split, SNAPPY58.69 MB1.00x179 ms29 ms327.34
byte-stream-split, ZSTD54.35 MB1.08x141 ms73 ms415.56

Best compression ratio: plain, ZSTD Fastest write: plain, UNCOMPRESSED Fastest read: byte-stream-split, UNCOMPRESSED

Lance — Default Compression

CompressionFile Sizevs RawWrite TimeRead TimeWrite MB/s
Default58.85 MB1.00x88 ms42 ms665.84

Note: Lance uses default compression settings (no variations tested)

ENCODING STRATEGY ANALYSIS

Parquet LIST — Encoding Strategy

EncodingAvg SizeSize RatioAvg WriteAvg Read
plain57.30 MB1.02x478 ms262 ms
byte-stream-split54.24 MB1.08x532 ms290 ms

plain — Breakdown by Compression

CompressionFile Sizevs Raw
UNCOMPRESSED58.86 MB1.00x
SNAPPY58.69 MB1.00x
ZSTD54.35 MB1.08x

byte-stream-split — Breakdown by Compression

CompressionFile Sizevs Raw
UNCOMPRESSED58.86 MB1.00x
SNAPPY53.60 MB1.09x
ZSTD50.27 MB1.17x

Parquet FIXED — Encoding Strategy

EncodingAvg SizeSize RatioAvg WriteAvg Read
plain57.29 MB1.02x121 ms43 ms
byte-stream-split57.29 MB1.02x147 ms42 ms

plain — Breakdown by Compression

CompressionFile Sizevs Raw
UNCOMPRESSED58.82 MB1.00x
SNAPPY58.69 MB1.00x
ZSTD54.35 MB1.08x

byte-stream-split — Breakdown by Compression

CompressionFile Sizevs Raw
UNCOMPRESSED58.82 MB1.00x
SNAPPY58.69 MB1.00x
ZSTD54.35 MB1.08x

Lance — Default Encoding

EncodingAvg SizeSize RatioAvg WriteAvg Read
Default (Arrow IPC)58.85 MB1.00x88 ms42 ms

Note: Lance uses Apache Arrow IPC encoding (no variations tested)

Vector definition in HoodieSchema:

Appendix: