Appendix

The main implementation change would require replacing the Avro schema references with the new type system.

Supporting VECTOR type in Hudi

This section captures additional research and design notes for supporting a VECTOR logical type in Hudi. See appendix for more details on research sources.

Initial scope

The intial use case we are targeting for VECTOR within Hudi, is to enable KNN style vector search functionality to be performed on blobs(large text, images, audio, video) alongside their generated vector embeddings. Typically vector search is popular for Retrieval-Augmented Generation (RAG) applications which provide relevant context to an LLM in order to improve its accuracy when answering user queries. The vector embeddings generated by frontier models are usually in the form of an array of floating point values.

Dense vectors vs sparse vectors

Dense vector

Has a value for every dimension.
Stored as a full length-D sequence: v = [0.12, -0.03, 0.44, ...] (length = D)
Even if some entries are 0, you still store them.

Sparse vector

Most entries are 0 / absent, so you store only the non-zero positions.
Stored as pairs (index, value), sometimes also with a separate nnz count:
[(3, 0.44), (107, 1.2), (9012, -0.7)]
The “dimension” is still D, but the stored length is nnz (number of non-zeros), typically nnz << D.

Sparse vectors become important for other types of hybrid/lexical-style retrieval which is not targeted for the intial scope, as that requires running different algorithms such as (TF-IDF or BM25) which is different from the intial use case of KNN style search.
Hence this RFC has seperated both into two distinct types one for VECTOR (dense) and one for SPARSE_VECTOR, we will for now spend time on VECTOR dense case.

Vector Schema constraints

Logical level requirements:

All values within the VECTOR column must have the same dimension i.e (number of elements within the vector), as this is needed to perform cosine/L2/dot-product correctly.
There should be no null elements within the vector at write time.
VECTOR must have an "element type" which can be one of FLOAT, DOUBLE or INT8.
We also want to keep a property around such as storageBacking which lets the writers know how to serialize the vector to disk. For an intial approach we will start with a fixed bytes approach covered below.

See the following avro schema model as a general example:

{
  "type" : "fixed",
  "name" : "vector",
  "size" : 3072,
  "logicalType" : "vector",
  "dimension" : 768,
  "elementType" : "FLOAT",
  "storageBacking" : "FIXED_BYTES"
}

Physical level requirements:

For now we will support a fixed-size packed byte representation for storing vectors on disk as this yields optimal performance(see parquet tuning section below for more details):

FLOAT32 vector of dimension D stored as exactly D * 4 bytes (IEEE-754 float32, little-endian)
Map to Parquet FIXED_LEN_BYTE_ARRAY(D * 4) with VECTOR metadata.
For Lance, vectors are typically represented using Arrow's FixedSizedList

Optimal Parquet tuning for vectors:

Vector data is typically high-cardinality and not dictionary-friendly. Therefore we will be disabling dictionary encoding and column stats for vector columns. Also based on findings from the parquet community, encodings such as PLAIN or BYTE_STREAM_SPLIT are useful when dealing with vectors, as well as disabling compression as this would yield best write/read performance.

Benchmark experiment with vectors

The results below was from an experiment writing 10,000 vectors (where each vector dimension is 1,536 and the element type is FLOAT(4 bytes), around 6KB per record).
We performed a full round trip for writing all vectors to a file and then read it back, using PARQUET and LANCE's java file writers/readers.
For PARQUET we tried several combination of writing with different types,as well as tried different encodings, compressions, etc to handle vectors.
For LANCE we opted to use vanilla settings based on it claims already toward already handling vectors optimally.
We performed 5 warmup rounds and 10 measurement rounds and collected averages below.

Physical backings tested

Parquet LIST: Vectors stored as Parquet's LIST<FLOAT> type (variable-length array)
Parquet FIXED: Vectors stored as Parquet's FIXED_LEN_BYTE_ARRAY (fixed 6,144 bytes for 1,536 floats)
Lance: Vectors stored in Lance format using FixedSizeList<Float32>

Summary of Results

Winner (most compact file size): Parquet LIST (byte-stream-split, ZSTD)

Currently parquet list is only a couple of MB more compact then the other parquet fixed tests.

Performance Winner (Write): Lance
Performance Winner (Read):  Parquet FIXED (byte-stream-split, UNCOMPRESSED)

*Note* Parquet FIXED and Lance are close in write perf

Detailed comparison table

COMPARISON SUMMARY

Representation	File Size	Write Speed	Read Speed	Bytes/Rec	vs Raw	vs Base
Parquet LIST (plain, UNCOMPRESSED)	58.86 MB	124.93 MB/s	233.44 MB/s	6,172 B	1.00x	1.00x
Parquet LIST (plain, SNAPPY)	58.69 MB	125.20 MB/s	232.51 MB/s	6,154 B	1.00x	1.00x
Parquet LIST (plain, ZSTD)	54.35 MB	117.66 MB/s	206.32 MB/s	5,698 B	1.08x	0.92x
Parquet LIST (byte-stream-split, UNCOMPRESSED)	58.86 MB	118.61 MB/s	210.77 MB/s	6,172 B	1.00x	1.00x
Parquet LIST (byte-stream-split, SNAPPY)	53.60 MB	111.18 MB/s	200.66 MB/s	5,620 B	1.09x	0.91x
Parquet LIST (byte-stream-split, ZSTD)	50.27 MB	101.90 MB/s	194.02 MB/s	5,270 B	1.17x	0.85x
Parquet FIXED (plain, UNCOMPRESSED)	58.82 MB	527.87 MB/s	2253.61 MB/s	6,167 B	1.00x	1.00x
Parquet FIXED (plain, SNAPPY)	58.69 MB	496.56 MB/s	2092.63 MB/s	6,154 B	1.00x	1.00x
Parquet FIXED (plain, ZSTD)	54.35 MB	430.84 MB/s	760.96 MB/s	5,699 B	1.08x	0.92x
Parquet FIXED (byte-stream-split, UNCOMPRESSED)	58.82 MB	480.28 MB/s	2343.75 MB/s	6,167 B	1.00x	1.00x
Parquet FIXED (byte-stream-split, SNAPPY)	58.69 MB	327.34 MB/s	2020.47 MB/s	6,154 B	1.00x	1.00x
Parquet FIXED (byte-stream-split, ZSTD)	54.35 MB	415.56 MB/s	802.65 MB/s	5,699 B	1.08x	0.92x
Lance	58.85 MB	665.84 MB/s	1395.09 MB/s	6,170 B	1.00x	-

Winner (most compact): Parquet LIST (byte-stream-split, ZSTD)

PERFORMANCE SUMMARY

Representation	Write Time	Write Speed	Read Time	Read Speed
Parquet LIST (plain, UNCOMPRESSED)	469 ms	124.93 MB/s	251 ms	233.44 MB/s
Parquet LIST (plain, SNAPPY)	468 ms	125.20 MB/s	252 ms	232.51 MB/s
Parquet LIST (plain, ZSTD)	498 ms	117.66 MB/s	284 ms	206.32 MB/s
Parquet LIST (byte-stream-split, UNCOMPRESSED)	494 ms	118.61 MB/s	278 ms	210.77 MB/s
Parquet LIST (byte-stream-split, SNAPPY)	527 ms	111.18 MB/s	292 ms	200.66 MB/s
Parquet LIST (byte-stream-split, ZSTD)	575 ms	101.90 MB/s	302 ms	194.02 MB/s
Parquet FIXED (plain, UNCOMPRESSED)	111 ms	527.87 MB/s	26 ms	2253.61 MB/s
Parquet FIXED (plain, SNAPPY)	118 ms	496.56 MB/s	28 ms	2092.63 MB/s
Parquet FIXED (plain, ZSTD)	136 ms	430.84 MB/s	77 ms	760.96 MB/s
Parquet FIXED (byte-stream-split, UNCOMPRESSED)	122 ms	480.28 MB/s	25 ms	2343.75 MB/s
Parquet FIXED (byte-stream-split, SNAPPY)	179 ms	327.34 MB/s	29 ms	2020.47 MB/s
Parquet FIXED (byte-stream-split, ZSTD)	141 ms	415.56 MB/s	73 ms	802.65 MB/s
Lance	88 ms	665.84 MB/s	42 ms	1395.09 MB/s

Performance Winner (Write): Lance Performance Winner (Read): Parquet FIXED (byte-stream-split, UNCOMPRESSED)

COMPRESSION CODEC ANALYSIS

Parquet LIST — Compression Comparison

Compression	File Size	vs Raw	Write Time	Read Time	Write MB/s
plain, UNCOMPRESSED	58.86 MB	1.00x	469 ms	251 ms	124.93
plain, SNAPPY	58.69 MB	1.00x	468 ms	252 ms	125.20
plain, ZSTD	54.35 MB	1.08x	498 ms	284 ms	117.66
byte-stream-split, UNCOMPRESSED	58.86 MB	1.00x	494 ms	278 ms	118.61
byte-stream-split, SNAPPY	53.60 MB	1.09x	527 ms	292 ms	111.18
byte-stream-split, ZSTD	50.27 MB	1.17x	575 ms	302 ms	101.90

Best compression ratio: byte-stream-split, ZSTD Fastest write: plain, SNAPPY Fastest read: plain, UNCOMPRESSED

Parquet FIXED — Compression Comparison

Compression	File Size	vs Raw	Write Time	Read Time	Write MB/s
plain, UNCOMPRESSED	58.82 MB	1.00x	111 ms	26 ms	527.87
plain, SNAPPY	58.69 MB	1.00x	118 ms	28 ms	496.56
plain, ZSTD	54.35 MB	1.08x	136 ms	77 ms	430.84
byte-stream-split, UNCOMPRESSED	58.82 MB	1.00x	122 ms	25 ms	480.28
byte-stream-split, SNAPPY	58.69 MB	1.00x	179 ms	29 ms	327.34
byte-stream-split, ZSTD	54.35 MB	1.08x	141 ms	73 ms	415.56

Best compression ratio: plain, ZSTD Fastest write: plain, UNCOMPRESSED Fastest read: byte-stream-split, UNCOMPRESSED

Lance — Default Compression

Compression	File Size	vs Raw	Write Time	Read Time	Write MB/s
Default	58.85 MB	1.00x	88 ms	42 ms	665.84

Note: Lance uses default compression settings (no variations tested)

ENCODING STRATEGY ANALYSIS

Parquet LIST — Encoding Strategy

Encoding	Avg Size	Size Ratio	Avg Write	Avg Read
plain	57.30 MB	1.02x	478 ms	262 ms
byte-stream-split	54.24 MB	1.08x	532 ms	290 ms

plain — Breakdown by Compression

Compression	File Size	vs Raw
UNCOMPRESSED	58.86 MB	1.00x
SNAPPY	58.69 MB	1.00x
ZSTD	54.35 MB	1.08x

byte-stream-split — Breakdown by Compression

Compression	File Size	vs Raw
UNCOMPRESSED	58.86 MB	1.00x
SNAPPY	53.60 MB	1.09x
ZSTD	50.27 MB	1.17x

Parquet FIXED — Encoding Strategy

Encoding	Avg Size	Size Ratio	Avg Write	Avg Read
plain	57.29 MB	1.02x	121 ms	43 ms
byte-stream-split	57.29 MB	1.02x	147 ms	42 ms

plain — Breakdown by Compression

Compression	File Size	vs Raw
UNCOMPRESSED	58.82 MB	1.00x
SNAPPY	58.69 MB	1.00x
ZSTD	54.35 MB	1.08x

byte-stream-split — Breakdown by Compression

Compression	File Size	vs Raw
UNCOMPRESSED	58.82 MB	1.00x
SNAPPY	58.69 MB	1.00x
ZSTD	54.35 MB	1.08x

Lance — Default Encoding

Encoding	Avg Size	Size Ratio	Avg Write	Avg Read
Default (Arrow IPC)	58.85 MB	1.00x	88 ms	42 ms

Note: Lance uses Apache Arrow IPC encoding (no variations tested)

Vector definition in HoodieSchema:

Refer to the following PR: https://github.com/apache/hudi/pull/18146

Supporting VECTOR type in Hudi

Initial scope

Dense vectors vs sparse vectors

Vector Schema constraints

Optimal Parquet tuning for vectors:

COMPARISON SUMMARY

PERFORMANCE SUMMARY

COMPRESSION CODEC ANALYSIS

Parquet LIST — Compression Comparison

Parquet FIXED — Compression Comparison

Lance — Default Compression

ENCODING STRATEGY ANALYSIS

Parquet LIST — Encoding Strategy

plain — Breakdown by Compression

byte-stream-split — Breakdown by Compression

Parquet FIXED — Encoding Strategy

plain — Breakdown by Compression

byte-stream-split — Breakdown by Compression

Lance — Default Encoding

Vector definition in HoodieSchema:

Appendix: