document/doc/document-format.html
When a Vespa document is stored or transferred from one application to another, it is serialized. The serialization format tries to achieve serialization robustness and speed. The most important fields are kept in a header that is accessible at low cost. The other fields are located by table look-ups.
The purpose of the serialized format is
All fields are in network byte order.
This is the oldest version that we currently support. No known installation stores documents with a version smaller than this.
This is the description of the serialized document format.
Document serialization format| Field | Type | Length | Description | | Version | Short integer | 2 | Version number. Current is 6. | | Length | Integer | 4 bytes | Total length of object (excluding this field and version). | | Document ID | Bytes | | Unique ID for document. 0-terminated string, UTF-8 encoding. | | Field Map | Bytes | See below | Placeholder for fields. (Note: Fieldmaps may contain other fieldmaps) |
Field maps are serialized like this
Fieldmap serialization format| Field | Type | Length (bytes) | Description |
| Inventory bit mask | Byte | 1 | Inventory bits describing the FieldMap element with data:
Bit 0 set: FieldMap has document type
Bit 1 set: FieldMap has header fields
Bit 2 set: FieldMap has body fields
Bit 3 set: FieldMap has external body fields
|
| Below section is present when bit 0 of inventory is set |
| Document Type | Bytes | | Document type. (0-terminated string, UTF-8 encoding.) |
| Version | Short integer | 2 | Document type version number. |
| Below section is present when bit 1 of inventory is set |
| Header data | Data array | See below | Header data packed in data array |
| Below section is present when bit 2 of inventory is set |
| Body data | Data array | See below | Body data packed in data array |
Data array serialization| Field | Type | Length (bytes) | Description | | Data length | Integer_2_4_8 | 2, 4 or 8 | Length of data block (see below). NOTE THAT THIS LENGTH INCLUDE ITSELF. | | Number of fields | Integer_1_4 | 1 or 4 | Number of fields in data array | | Below block is repeated "Number of fields" times | | Field ID | Integer_1_4 | 1 or 4 | ID of field. | | Field Size | Integer_1_2_4 | 1, 2 or 4 | Length of field. | | End of repeated block | | Data block | Bytes | | The data block.
Data type serialization| Data type | Length | Serialization | | Integer (ID 0) | 4 | Signed integer, two's complement notation, network byte order. | | Floating point number (ID 1) | 4 | IEEE 754, single precision, network byte order. | | String (ID 2) | 1 + (1 or 4) + length + 1 | Strings are serialization format:
First byte represents coding. This has traditionally denoted the maximum number of bits per character in the UTF-8 encoded string, but has never been used in deserialization code.
Set to 32 if not used.
Set to <32 if you know the UTF-8 string uses less bits per character; e.g. ASCII could use 8.
Set bit 6 (decimal 64) if the string has an annotation tree.
Integer_1_4 with length of string.
The string (UTF-8 encoding), including 0-terminating byte.
An annotation tree, if bit 6 (decimal 64) of coding byte is set:
total length of all span trees excl. itself: uint32
number of span trees int_1_2_4
for each root node:
| | Raw bytes (ID 3) | Length of buffer | Byte for byte copy | | Long integer (ID 4) | 8 | Signed integer, two's complement notation, network byte order. | | Double floating point number (ID 5) | 8 | IEEE 754, double precision, network byte order. | | Array (ID 6) | At least 8 bytes | Arrays of any fields are serialized like this:
Annotation tree serialization| Data type | Length | Serialization | | SpanNode (base class) | 1 + (1, 2 or 4) + Annotation serialization + subclass payload |
| | Annotation | 4 + (1, 2 or 4) + (possibly 4 + FieldValue serialization) |
| | Span | SpanNode serialization + (1, 2 or 4) + (1, 2 or 4) |
| | SpanList | SpanNode serialization + (1, 2 or 4) + n times SpanNode serialization |
| | AlternateSpanList | SpanNode serialization + (1, 2 or 4) + n times (8 + SpanList serialization) |
| | AnnotationRef | 1, 2 or 4 | AnnotationRef serialization
|
Data types used in serialized format| Data type | Serialization |
| Integer_1_4 | If bit 7 of first byte is unset, coded using 1 byte.
If bit 7 of first byte is set, coded using 4 bytes (bit 7 of first byte must be masked away).
Range: 0 - 2**31-1. |
| Integer_1_2_4 | If bit 7 of first byte is unset, coded using 1 byte.
If bit 7 of first byte is set and bit 6 of first byte is unset, coded using 2 bytes (bit 7 and 6 of first byte must be masked away).
If bit 7 and 6 of first byte are set, coded using 4 bytes (bit 7 and 6 of first byte must be masked away).
Range: 0 - 2**30-1. |
| Integer_2_4_8 | If bit 7 of first byte is unset, coded using 2 byte.
If bit 7 of first byte is set and bit 6 of first byte is unset, coded using 4 bytes (bit 7 and 6 of first byte must be masked away).
If bit 7 and 6 of first byte are set, coded using 8 bytes (bit 7 and 6 of first byte must be masked away).
Range: 0 - 2**62-1. |
This is the description of the serialized document update format.
Document update serialization format| Field | Type | Length | Description | | Document ID | Bytes | | Unique ID for document. 0-terminated string, UTF-8 encoding. | | Content byte | Byte | 1 byte | Always set to 1 | | Document Type | Bytes | | Document type. (0-terminated string, UTF-8 encoding.) | | Number of fields to update | Integer | 4 bytes | The number of fields to update | | Serialized field updates | Field Update | | The serialized field updates. See below. |
Document update serialization format| Field | Type | Length | Description | | Field Id | Integer | 4 bytes | Field id within document type. | | Number of value updates | Integer | 4 bytes | Numer of value updates to this field. | | Serialized field update values | Bytes | | The serialized field update values. See below. |
Document update value serialization format| Field | Type | Length | Description |
| Add Value Update |
|---|
| Add Value Update ID |
| Field serialization |
| Weight |
| Arithmetic Update |
| --- |
| Arithmetic Update ID |
| Operator ID |
| Operand |
| Assign Update |
| --- |
| Assign Update ID |
| Content flag |
| Field serialization |
| Clear Update |
| --- |
| Clear Update ID |
| Map Value Update |
| --- |
| Map Value Update ID |
| Field serialization |
| Value Update |
| Remove Value Update |
| --- |
| Remove Update ID |
| Field serialization |