Back to Memvid

MV2 File Format Specification

MV2_SPEC.md

2.0.1578.6 KB
Original Source

MV2 File Format Specification

Version 2.1

Overview

MV2 is a single-file format for AI memory storage. Everything lives in one file: header, write-ahead log, data segments, search indices, and metadata. No sidecar files.

┌─────────────────────────────────────────────────────────────┐
│                        .mv2 FILE                            │
├─────────────────────────────────────────────────────────────┤
│ Header                 │ 4 KB                               │
├─────────────────────────────────────────────────────────────┤
│ Embedded WAL           │ 1-64 MB (capacity-dependent)       │
├─────────────────────────────────────────────────────────────┤
│ Data Segments          │ Variable                           │
│   - Frame payloads                                          │
│   - Compressed content                                      │
├─────────────────────────────────────────────────────────────┤
│ Lex Index Segment      │ Tantivy index (optional)           │
├─────────────────────────────────────────────────────────────┤
│ Vec Index Segment      │ HNSW vectors (optional)            │
├─────────────────────────────────────────────────────────────┤
│ Time Index Segment     │ Chronological ordering             │
├─────────────────────────────────────────────────────────────┤
│ TOC (Footer)           │ Segment catalog + checksums        │
└─────────────────────────────────────────────────────────────┘

Header (4096 bytes)

The header occupies the first 4 KB of the file.

OffsetSizeFieldDescription
04magicMV2\0 (0x4D 0x56 0x32 0x00)
42versionFormat version (little-endian)
61spec_majorSpec major version (2)
71spec_minorSpec minor version (1)
88footer_offsetByte offset to TOC
168wal_offsetByte offset to WAL (always 4096)
248wal_sizeWAL region size in bytes
328wal_checkpoint_posLast checkpointed sequence
408wal_sequenceCurrent WAL sequence number
4832toc_checksumSHA-256 of TOC segment
804016reservedZero-filled, reserved for future use

All multi-byte integers are little-endian.

Write-Ahead Log (WAL)

The embedded WAL provides crash recovery. It starts at byte 4096 and has a capacity determined by the file's target size:

File CapacityWAL Size
< 100 MB1 MB
< 1 GB4 MB
< 10 GB16 MB
>= 10 GB64 MB

WAL Entry Format

┌──────────────────────────────────────┐
│ sequence    │ 8 bytes (u64 LE)       │
│ entry_type  │ 1 byte                 │
│ payload_len │ 4 bytes (u32 LE)       │
│ payload     │ variable               │
│ checksum    │ 4 bytes (CRC32)        │
└──────────────────────────────────────┘

Entry types:

  • 0x01 - Frame append
  • 0x02 - Frame update
  • 0x03 - Frame delete (tombstone)
  • 0x04 - Index update

Checkpoint Behavior

  • Checkpoint triggers at 75% WAL occupancy or every 1,000 transactions
  • Checkpoint flushes WAL entries to data segments
  • seal() forces immediate checkpoint
  • Recovery replays entries with sequence > wal_checkpoint_pos

Frame Structure

Each frame represents a single piece of content.

FieldTypeDescription
frame_idu64Unique identifier (monotonic)
uriStringHierarchical path (mv2://path/to/doc)
titleString?Optional display title
created_atu64Unix timestamp (seconds)
encodingu8Content encoding (see below)
payloadbytesCompressed content
payload_checksum[u8; 32]SHA-256 of uncompressed payload
tagsMap<String, String>User-defined key-value pairs
statusu80=active, 1=tombstoned

Encoding Types

ValueNameDescription
0RawUncompressed bytes
1ZstdZstandard compression
2Lz4LZ4 compression

Data Segments

Frames are grouped into segments for efficient storage and retrieval.

Segment Header

┌──────────────────────────────────────┐
│ magic         │ 4 bytes              │
│ version       │ 2 bytes              │
│ segment_type  │ 1 byte               │
│ frame_count   │ 4 bytes              │
│ compressed    │ 1 byte (bool)        │
│ checksum      │ 32 bytes             │
└──────────────────────────────────────┘

Segment types:

  • 0x01 - Data segment (frames)
  • 0x02 - Lex index segment
  • 0x03 - Vec index segment
  • 0x04 - Time index segment

Time Index

The time index enables chronological queries and time-travel.

Time Index Entry

FieldSizeDescription
frame_id8Frame identifier
timestamp8Unix timestamp
offset8Byte offset in data segment

Magic: MVTI (0x4D 0x56 0x54 0x49)

Lex Index (Full-Text Search)

When the lex feature is enabled, the file contains a Tantivy index segment.

Indexed fields:

  • body - Full text content
  • title - Document title
  • uri - Document URI
  • tags - Flattened tag values

Supports:

  • BM25 ranking
  • Phrase queries
  • Boolean operators
  • Date range filters

Vec Index (Vector Search)

When the vec feature is enabled, the file contains an HNSW index segment.

ParameterValue
Dimensions384 (BGE-small)
DistanceCosine similarity
M16
ef_construction200

Table of Contents (TOC)

The TOC is the final segment, pointed to by footer_offset in the header.

┌──────────────────────────────────────┐
│ magic         │ "MVTC"               │
│ version       │ 2 bytes              │
│ segment_count │ 4 bytes              │
│ segments[]    │ SegmentDescriptor[]  │
│ manifests     │ IndexManifests       │
│ checksum      │ 32 bytes             │
└──────────────────────────────────────┘

Segment Descriptor

FieldSizeDescription
segment_type1Type identifier
offset8Byte offset in file
length8Segment size in bytes
checksum32SHA-256 of segment

URI Scheme

All content is addressable via mv2:// URIs:

mv2://[track/][path/]name

Examples:

  • mv2://meetings/2024-01-15
  • mv2://docs/api/reference.md
  • mv2://media/photo.png

Invariants

  1. Single-file guarantee: No .wal, .shm, .lock, or other sidecar files
  2. Append-only frames: Existing frames are never modified in place
  3. Determinism: Same API calls produce identical bytes
  4. Crash safety: WAL ensures durability across unexpected termination
  5. Self-describing: TOC contains all metadata needed to parse the file

Version History

VersionChanges
2.1Current version. Embedded WAL, temporal track support
2.0Single-file format, removed external indices
1.xLegacy format (deprecated)