Back to Questdb

QuestWire Protocol (QWP) Specification

docs/qwp/wire-ingress.md

9.4.250.2 KB
Original Source

QuestWire Protocol (QWP) Specification

QuestWire Protocol (QWP) is a columnar binary ingestion protocol designed for high-throughput, zero-GC data streaming into QuestDB. This specification is intended to enable alternative implementations to interoperate with QuestDB.

Table of Contents

  1. Overview
  2. Transport
  3. Version Negotiation
  4. Byte Ordering
  5. Variable-Length Integer Encoding (Varint)
  6. ZigZag Encoding
  7. Message Structure
  8. Table Block Structure
  9. Schema Definition
  10. Column Types
  11. Null Handling
  12. Column Data Encoding
  13. Response Format
  14. Protocol Limits
  15. Client Operation
  16. Examples
  17. Reference Implementation
  18. Version History

1. Overview

QWP is a binary protocol for high-performance time-series data ingestion. Key features:

  • Column-oriented encoding: All values for a column are stored contiguously
  • Batch processing: Multiple tables and rows per message
  • Gorilla timestamp compression: Delta-of-delta encoding for timestamps
  • Schema references: Reference previously sent schemas by numeric ID

Magic Bytes

Every QWP message begins with a 4-byte magic value identifying the protocol.

MagicHex ValueDescription
QWP10x31505751Standard data message

Version negotiation is handled entirely via HTTP upgrade headers (see §3), not via binary magics.

2. Transport

WebSocket

QWP uses RFC 6455 WebSocket binary frames. The client initiates an HTTP GET request to either /write/v4 or /api/v4/write with standard WebSocket upgrade headers. After the 101 Switching Protocols handshake, all communication uses binary frames.

UDP

UDP has no HTTP upgrade handshake. Each datagram is self-describing: the server inspects the version byte in the message header and processes or drops the datagram accordingly.

3. Version Negotiation

When QWP operates over WebSocket, the client and server negotiate the protocol version during the HTTP upgrade handshake.

Client Request Headers

HeaderRequiredDescription
X-QWP-Max-VersionNoMaximum QWP version the client supports (positive integer). Defaults to 1 if absent.
X-QWP-Client-IdNoFree-form client identifier (e.g., java/1.0.2, python/0.9.1).

Server Response Header

HeaderDescription
X-QWP-VersionThe QWP version selected for this connection.

The server selects the version as min(clientMax, serverMax). The selected version is never higher than either side's maximum. The server may also consider the X-QWP-Client-Id when selecting the version.

Connection-Level Contract

All QWP messages on a connection must use the negotiated version in the version byte (offset 4) of the message header. The server validates every incoming message against the negotiated version and rejects any message whose version byte does not match with a parse error.

Ingress is pinned to version 1

Ingress senders advertise X-QWP-Max-Version: 1 because no v2 ingest semantics exist. The v2 bump is purely an egress addition — an unsolicited SERVER_INFO frame on the upgrade carrying server role and zone metadata for read-side routing (see wire-egress.md §11.8 and failover.md §5). Ingress clients do NOT read SERVER_INFO, ignore zone advertising, and rely on the 421 + X-QuestDB-Role upgrade-reject convention alone for primary-vs-replica routing. The zone= connect-string knob is accepted but silently ignored on ingress so a single connect string can be reused across ingress and egress clients without per-startup noise; see failover.md §1.1.

4. Byte Ordering

All multi-byte numeric values are little-endian. Variable-length integers use unsigned LEB128 (see §5).

5. Variable-Length Integer Encoding (Varint)

QWP uses unsigned LEB128 (Little Endian Base 128) encoding for variable-length integers.

Encoding Rules

  • Values are split into 7-bit groups, LSB first
  • Each byte uses the high bit (0x80) as a continuation flag
  • If high bit is set (1), more bytes follow
  • If high bit is clear (0), this is the last byte
  • Maximum: 10 bytes for 64-bit values

Encoding Algorithm

while (value & ~0x7F) != 0:
    output_byte((value & 0x7F) | 0x80)
    value >>>= 7
output_byte(value)

Decoding Algorithm

result = 0
shift = 0
do:
    byte = read_byte()
    result |= (byte & 0x7F) << shift
    shift += 7
while (byte & 0x80) != 0
return result

Examples

ValueEncoded Bytes
00x00
10x01
1270x7F
1280x80 0x01
2550xFF 0x01
3000xAC 0x02
163840x80 0x80 0x01

6. ZigZag Encoding

Used to map signed integers to unsigned for efficient varint encoding:

encode(n) = (n << 1) ^ (n >> 63)    // 64-bit
decode(n) = (n >>> 1) ^ -(n & 1)

 0 →  0
-1 →  1
 1 →  2
-2 →  3
 2 →  4

7. Message Structure

Message Header (12 bytes, fixed)

Offset  Size  Type    Field           Description
──────────────────────────────────────────────────────────
0       4     int32   magic           "QWP1" (0x31505751)
4       1     uint8   version         Protocol version (0x01)
5       1     uint8   flags           Encoding flags
6       2     uint16  table_count     Number of table blocks
8       4     uint32  payload_length  Payload size in bytes

Total message size = 12 + payload length.

Flags Byte

BitMaskNameDescription
0-1Reserved (must be 0)
20x04FLAG_GORILLAGorilla delta-of-delta encoding for timestamp columns
30x08FLAG_DELTA_SYMBOL_DICTDelta symbol dictionary mode enabled
4-7Reserved (must be 0)

Complete Message Layout

┌─────────────────────────────────────────┐
│ Message Header (12 bytes)               │
├─────────────────────────────────────────┤
│ Payload (variable)                      │
│   ├─ [Delta Symbol Dictionary] (if 0x08)│
│   ├─ Table Block 0                      │
│   ├─ Table Block 1                      │
│   └─ ... Table Block N-1                │
└─────────────────────────────────────────┘

Delta Symbol Dictionary (optional)

Present only when FLAG_DELTA_SYMBOL_DICT (0x08) is set. Appears at the start of the payload, before any table blocks.

┌──────────────────────────────────────────────────────────────┐
│ delta_start:    varint   Starting global ID for this delta   │
│ delta_count:    varint   Number of new entries               │
│ For each new entry:                                          │
│   name_length:  varint   UTF-8 byte length                   │
│   name_bytes:   bytes    UTF-8 encoded symbol string         │
└──────────────────────────────────────────────────────────────┘

The client maintains a global symbol dictionary mapping symbol strings to sequential integer IDs (starting from 0). On each batch, only newly added symbols (the "delta") are transmitted. The server accumulates these entries across batches for the lifetime of the connection. Symbol columns in delta mode contain varint-encoded global IDs instead of per-column dictionaries.

8. Table Block Structure

Each table block contains data for a single table.

┌─────────────────────────────────────────┐
│ Table Header (variable)                 │
├─────────────────────────────────────────┤
│ Schema Section (variable)               │
├─────────────────────────────────────────┤
│ Column Data (variable)                  │
│   ├─ Column 0 data                      │
│   ├─ Column 1 data                      │
│   └─ ... Column N-1 data                │
└─────────────────────────────────────────┘

Table Header

FieldTypeDescription
name_lengthvarintTable name length in bytes
nameUTF-8Table name (max 127 bytes)
row_countvarintNumber of rows in this block
column_countvarintNumber of columns

9. Schema Definition

Schema Mode Byte

ValueModeDescription
0x00FullSchema ID + complete column definitions inline
0x01ReferenceSchema ID only (lookup from registry)

Full Schema Mode (0x00)

Sent the first time a table's schema appears on a connection, or whenever the column set changes.

┌─────────────────────────────────────────┐
│ mode_byte: 0x00                         │
├─────────────────────────────────────────┤
│ schema_id: varint                       │
├─────────────────────────────────────────┤
│ Column Definition 0                     │
│   ├─ name_length: varint                │
│   ├─ name: UTF-8 bytes                  │
│   └─ type_code: uint8                   │
├─────────────────────────────────────────┤
│ Column Definition 1 ...                 │
└─────────────────────────────────────────┘

Schema IDs are non-negative integers assigned by the client and scoped to the lifetime of a single connection. They are global across all tables on the connection (not per-table). Clients typically assign them sequentially starting at 0, but the server does not require any particular ordering.

The type_code byte contains the column type (0x01 through 0x18).

A column with an empty name (length 0) and type TIMESTAMP denotes the designated timestamp column.

Reference Schema Mode (0x01)

Used for subsequent batches when the server has already registered the schema.

┌─────────────────────────────────────────┐
│ mode_byte: 0x01                         │
├─────────────────────────────────────────┤
│ schema_id: varint                       │
└─────────────────────────────────────────┘

The server looks up the schema by its ID in the per-connection schema registry. Full-mode schemas may arrive in any order and may re-register an existing ID; the server accepts any ID within the per-connection schema-ID limit.

10. Column Types

Type Code Table

CodeHexTypeSizeDescription
10x01BOOLEAN1 bitBit-packed boolean
20x02BYTE1Signed 8-bit integer
30x03SHORT2Signed 16-bit integer
40x04INT4Signed 32-bit integer
50x05LONG8Signed 64-bit integer
60x06FLOAT4IEEE 754 single precision
70x07DOUBLE8IEEE 754 double precision
90x09SYMBOLvarDictionary-encoded string
100x0ATIMESTAMP8Microseconds since epoch
110x0BDATE8Milliseconds since epoch
120x0CUUID16RFC 4122 UUID
130x0DLONG25632256-bit integer
140x0EGEOHASHvarGeospatial hash
150x0FVARCHARvarLength-prefixed UTF-8 (aux storage)
160x10TIMESTAMP_NANOS8Nanoseconds since epoch
170x11DOUBLE_ARRAYvarN-dimensional double array
180x12LONG_ARRAYvarN-dimensional long array
190x13DECIMAL648Decimal (18 digits precision)
200x14DECIMAL12816Decimal (38 digits precision)
210x15DECIMAL25632Decimal (77 digits precision)
220x16CHAR2Single UTF-16 code unit
230x17BINARYvarLength-prefixed opaque bytes
240x18IPv4432-bit IPv4 address

Code 0x08 is unassigned. It was previously STRING, which has been removed; senders should use VARCHAR (0x0F) for text columns.

TIMESTAMP and TIMESTAMP_NANOS may use Gorilla encoding when FLAG_GORILLA is set (see Column Data Encoding).

11. Null Handling

Each column's data section begins with a 1-byte null flag. The flag tells the decoder how to interpret what follows:

  • 0x00 -- no bitmap follows. The column data contains one value per row (row_count values total). If the column has null rows, they are represented by a type-specific sentinel encoded in place.
  • Any nonzero value -- a null bitmap follows immediately after the flag byte, and the column data contains only value_count = row_count - null_count non-null values, densely packed. The bitmap identifies which row indices are null.

The choice between these two strategies is made per column by the encoder, and the decoder must support both. Sentinel mode avoids the per-row bitmap overhead, and bitmap mode avoids writing any data for null rows. The wider the column element, the more likely it is to get a more compact encoding using the null bitmap.

Sentinel mode requires the type to have a dedicated null representation available; it is not applicable to types whose full value range is meaningful payload (e.g. VARCHAR, SYMBOL).

Bitmap Format

  • Size: ceil(row_count / 8) bytes
  • Bit order: LSB first within each byte
  • Semantics: bit = 1 means row is NULL, bit = 0 means row has value

Layout

Byte 0:  [row7][row6][row5][row4][row3][row2][row1][row0]
Byte 1:  [row15][row14][row13][row12][row11][row10][row9][row8]
...

Example

For 10 rows where rows 0, 2, and 9 are null:

Byte 0: 0b00000101 = 0x05  (bits 0 and 2 set)
Byte 1: 0b00000010 = 0x02  (bit 1 set = row 9)

Accessing Null Status

byte_index = row_index / 8
bit_index = row_index % 8
is_null = (bitmap[byte_index] & (1 << bit_index)) != 0

Column Data Layout (all types)

┌──────────────────────────────────────────────────────────────┐
│ null_flag:     uint8     0 = no bitmap, nonzero = bitmap     │
│ [null bitmap:  ceil(row_count/8) bytes if flag != 0]         │
│ Column values:                                               │
│   - flag == 0 : row_count entries (null rows = sentinel)     │
│   - flag != 0 : value_count non-null entries, densely packed │
│                 (value_count = row_count - null_count)       │
└──────────────────────────────────────────────────────────────┘

Reference Implementation Null Strategy

The reference Java WebSocket client and the Go client make the same per-column choice:

StrategyTypes
SentinelBOOLEAN, BYTE, SHORT, CHAR, GEOHASH
BitmapINT, LONG, FLOAT, DOUBLE, VARCHAR, SYMBOL, TIMESTAMP, TIMESTAMP_NANOS, DATE, UUID, LONG256, DECIMAL64, DECIMAL128, DECIMAL256, DOUBLE_ARRAY, LONG_ARRAY

The reference Java UDP client additionally uses sentinel mode for LONG and DOUBLE (encoding null rows as Long.MIN_VALUE and NaN respectively).

Alternative implementations are free to make different per-column choices, as long as the null_flag value accurately describes the data that follows. A column with no null rows produces identical output under either strategy (null_flag = 0, row_count values).

Reference Sentinel Values

When the reference implementations emit sentinel mode (null_flag = 0), null rows are encoded as:

TypeSentinel
BOOLEANbit 0 (false)
BYTE0x00
SHORT0x0000
CHAR0x0000
GEOHASHall-ones: int64 -1 (0xFFFF…FFFF), truncated to the column's per-value byte width ceil(precision_bits / 8)

12. Column Data Encoding

Fixed-Width Types

For BYTE, SHORT, INT, LONG, FLOAT, DOUBLE, DATE, CHAR: values are written as contiguous arrays of their respective sizes.

┌────────────────────────────────────────────────────┐
│ [Null flag + bitmap (see §11)]                     │
├────────────────────────────────────────────────────┤
│ Values:                                            │
│   value[0], value[1], ... value[N-1]               │
│   where N = row_count                if flag == 0  │
│        or N = row_count - null_count if flag != 0  │
└────────────────────────────────────────────────────┘

The number of values depends on the null strategy chosen for the column (see §11). In sentinel mode (null_flag == 0) all row_count values are written, with the type's sentinel marking null rows. In bitmap mode (null_flag != 0) only the non-null values are written, densely packed.

The reference implementation uses sentinel mode for BYTE, SHORT, and CHAR, and bitmap mode for INT, LONG, FLOAT, DOUBLE, and DATE.

Boolean Type (0x01)

Values are bit-packed, 8 per byte, LSB-first. ceil(N/8) bytes are written, where N = row_count in sentinel mode (null_flag == 0) or N = row_count - null_count in bitmap mode. The reference implementation uses sentinel mode for BOOLEAN: null rows appear as bit 0 (false).

Byte layout for values [true, false, true, true, false, false, false, true]:
  0b10001101 = 0x8D

VARCHAR / BINARY Type (0x0F, 0x17)

STRING, VARCHAR, and BINARY all share the same wire format:

┌──────────────────────────────────────────┐
│ [Null flag + bitmap (see §11)]           │
├──────────────────────────────────────────┤
│ Offset array: (value_count + 1) x 4 bytes│
│   offset[0] = 0                          │
│   offset[i+1] = end of string[i]         │
├──────────────────────────────────────────┤
│ String data: concatenated UTF-8 bytes    │
└──────────────────────────────────────────┘
  • value_count = row_count - null_count
  • Offsets are uint32
  • String i spans bytes [offset[i], offset[i+1])
  • For STRING and VARCHAR, the bytes are valid UTF-8. For BINARY, the bytes are opaque — clients must not attempt UTF-8 interpretation.
  • The uint32 offset bounds individual values to {@code 2^31 - 1} bytes; larger payloads must be split into multiple BINARY columns or returned via a side-channel.

Symbol Type (0x09)

Dictionary-encoded strings for low-cardinality columns.

Per-Table Dictionary Mode (UDP)

┌─────────────────────────────────────────┐
│ [Null flag + bitmap (see §11)]          │
├─────────────────────────────────────────┤
│ dictionary_size: varint                 │
├─────────────────────────────────────────┤
│ Dictionary entries:                     │
│   For each entry:                       │
│     entry_length: varint                │
│     entry_data: UTF-8 bytes             │
├─────────────────────────────────────────┤
│ Value indices:                          │
│   For each non-null row:                │
│     dict_index: varint                  │
└─────────────────────────────────────────┘
  • Dictionary indices are 0-based
  • When a null bitmap is present, only non-null rows have indices written

Per-Table Dictionary Mode is used by UDP because datagrams cannot rely on a connection-scoped dictionary persisting across messages.

Global Delta Dictionary Mode (WebSocket, FLAG_DELTA_SYMBOL_DICT)

When the delta symbol dictionary flag is set, symbol columns use global integer IDs instead of per-table dictionaries. The dictionary entries are sent in the message-level delta dictionary section (see §7). Column data consists of varint-encoded global IDs only.

WebSocket clients set FLAG_DELTA_SYMBOL_DICT on every message and use this mode exclusively.

┌───────────────────────────────────────────┐
│ For each non-null row:                    │
│   global_id:   varint   Global symbol ID  │
└───────────────────────────────────────────┘

Timestamp Type (0x0A, 0x10)

When FLAG_GORILLA (0x04) is set in the message header flags, timestamp columns include a 1-byte encoding flag after the null bitmap. When FLAG_GORILLA is not set, there is no encoding flag -- timestamps are written as plain uncompressed int64 arrays.

Without FLAG_GORILLA (no encoding flag)

┌─────────────────────────────────────────┐
│ [Null flag + bitmap (see §11)]          │
├─────────────────────────────────────────┤
│ Timestamp values (non-null only):       │
│   value_count x int64                   │
└─────────────────────────────────────────┘

With FLAG_GORILLA (encoding flag present)

FlagModeDescription
0x00UncompressedArray of int64 values (only non-null values)
0x01GorillaDelta-of-delta compressed

Uncompressed mode (0x00):

┌─────────────────────────────────────────┐
│ [Null flag + bitmap (see §11)]          │
├─────────────────────────────────────────┤
│ encoding_flag: uint8 (0x00)             │
├─────────────────────────────────────────┤
│ Timestamp values (non-null only):       │
│   value_count x int64                   │
└─────────────────────────────────────────┘

Gorilla mode (0x01):

┌─────────────────────────────────────────┐
│ [Null flag + bitmap (see §11)]          │
├─────────────────────────────────────────┤
│ encoding_flag: uint8 (0x01)             │
├─────────────────────────────────────────┤
│ first_timestamp: int64                  │
├─────────────────────────────────────────┤
│ second_timestamp: int64                 │
├─────────────────────────────────────────┤
│ Bit-packed delta-of-deltas:             │
│   For timestamps 3..N                   │
└─────────────────────────────────────────┘

Gorilla Delta-of-Delta Encoding

delta[i] = t[i] - t[i-1]
DoD[i]   = delta[i] - delta[i-1]

Encoding buckets (bits are written LSB-first):

ConditionPrefixValue BitsTotal Bits
DoD == 0001
DoD in [-64, 63]107 (signed)9
DoD in [-256, 255]1109 (signed)12
DoD in [-2048, 2047]111012 (signed)16
Otherwise111132 (signed)36

The bit stream is padded to a byte boundary at the end. If any DoD value exceeds the 32-bit signed integer range, the encoder falls back to uncompressed mode.

UUID Type (0x0C)

16 bytes per value: 8 bytes low, then 8 bytes high.

LONG256 Type (0x0D)

32 bytes per value: four int64 values, least significant first.

GeoHash Type (0x0E)

┌──────────────────────────────────────────────────┐
│ [Null flag + bitmap (see §11)]                   │
├──────────────────────────────────────────────────┤
│ precision_bits: varint (1-60)                    │
├──────────────────────────────────────────────────┤
│ Packed geohash values:                           │
│   bytes_per_value = ceil(precision/8)            │
│   total = bytes_per_value x N                    │
│     where N = row_count              if flag == 0│
│          or N = row_count - null_count if flag != 0│
└──────────────────────────────────────────────────┘

The reference implementation uses sentinel mode for GEOHASH: null rows are encoded as all-ones (int64 -1) truncated to bytes_per_value.

Array Types (0x11, 0x12)

N-dimensional arrays of DOUBLE or LONG, row-major order:

┌────────────────────────────────────────--------------─┐
│ For each row:                                         │
│   n_dims:      uint8          Number of dimensions    │
│   dim_lengths: n_dims x int32      Length per dim     │
│   values:      product(dims) x element                │
│                (float64 for DOUBLE_ARRAY,             │
│                 int64 for LONG_ARRAY)                 │
└───────────────────────────────────────--------------──┘

Decimal Types (0x13, 0x14, 0x15)

Decimal values are stored as two's complement integers. The scale (number of decimal places) is a 1-byte prefix in the column data section, shared by all values in the column.

┌─────────────────────────────────────────┐
│ [Null flag + bitmap (see §11)]          │
├─────────────────────────────────────────┤
│ scale: uint8                            │
├─────────────────────────────────────────┤
│ Unscaled values:                        │
│   DECIMAL64:  8 bytes x value_count     │
│   DECIMAL128: 16 bytes x value_count    │
│   DECIMAL256: 32 bytes x value_count    │
└─────────────────────────────────────────┘
TypeValue SizePrecision
DECIMAL648 bytes18 digits
DECIMAL12816 bytes38 digits
DECIMAL25632 bytes77 digits

13. Response Format

Every response starts with a 1-byte status code. OK and error responses include an 8-byte sequence number that correlates the response with the original request. Durable-ack responses carry only per-table upload watermarks.

OK Response (11+ bytes)

┌──────────────────────────────────────────────────────┐
│ status:      uint8   (0x00)                          │
│ sequence:    int64          Request sequence number   │
│ tableCount:  uint16         Number of table entries   │
│ ┌── repeated tableCount times ─────────────────────┐ │
│ │ nameLen:   uint16         Table name length       │ │
│ │ name:      bytes          UTF-8 table name        │ │
│ │ seqTxn:    int64          Sequencer txn for table  │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘

The per-table entries report the sequencer txn assigned to each table that committed data in the acknowledged batch. tableCount is 0 when no WAL tables committed (e.g., non-WAL tables or empty batches).

Durable-Ack Response (3+ bytes)

Emitted only when the client opted in at handshake time (see below) and only by servers where primary replication is configured. Each per-table entry reports the highest sequencer txn whose WAL segments have been durably uploaded to the configured object store. Only tables whose durable watermark advanced since the last durable-ack are included.

┌──────────────────────────────────────────────────────┐
│ status:      uint8   (0x02)                          │
│ tableCount:  uint16         Number of table entries   │
│ ┌── repeated tableCount times ─────────────────────┐ │
│ │ nameLen:   uint16         Table name length       │ │
│ │ name:      bytes          UTF-8 table name        │ │
│ │ seqTxn:    int64          Durably-uploaded seqTxn  │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘

Error Response (11 + msg_len bytes)

┌──────────────────────────────────────────────────────┐
│ status:    uint8          Status code                │
│ sequence:  int64          Request sequence number    │
│ msg_len:   uint16         Error message length       │
│ msg_bytes: bytes          UTF-8 error message        │
└──────────────────────────────────────────────────────┘

Status Codes

CodeHexNameDescription
00x00OKBatch accepted (written to WAL)
20x02DURABLE_ACKBatch WAL uploaded to object store (opt-in)
30x03SCHEMA_MISMATCHColumn type incompatible with existing table
50x05PARSE_ERRORMalformed message
60x06INTERNAL_ERRORServer-side error
80x08SECURITY_ERRORAuthorization failure
90x09WRITE_ERRORWrite failure (e.g., table not accepting writes)

Durable-Upload Acknowledgment

The base OK frame confirms that a client message has been committed to the server's local WAL. It does not confirm durability beyond the primary node.

To receive a second, stronger acknowledgment after the WAL containing the commit has reached the configured object store, a client includes X-QWP-Request-Durable-Ack: true (case-insensitive) in the WebSocket upgrade request.

Server Behaviour

  • Servers without primary replication enabled silently ignore the request header and never emit STATUS_DURABLE_ACK frames.
  • Servers with primary replication that accept the opt-in echo X-QWP-Durable-Ack: enabled in the 101 upgrade response. The presence of this confirmation header is the only handshake-time signal that a given connection will receive durable-ack frames -- absence means the registry is not installed for this engine.
  • Confirming servers emit cumulative STATUS_DURABLE_ACK frames as the upload watermark advances. Delivery is piggy-backed on connection activity: frames are flushed whenever the connection next sends or receives a message, a PING, or a CLOSE. Idle connections that need prompt notification should send a WebSocket PING periodically.
  • The durable-ack watermark always trails the regular OK watermark.
  • There is no durable-failure status; persistent upload failures surface only as absence of a durable-ack frame within an expected window.
  • Empty messages (those that produced no WAL commit, e.g. only referencing materialized views) are trivially durable and their sequence advances the durable watermark as soon as all preceding messages are durable.

Client Behaviour

  • A client that opts in via the request header MUST verify the X-QWP-Durable-Ack: enabled confirmation in the 101 response, and MUST fail the connect attempt loudly when it is absent. Silently waiting for durable-ack frames against a server that will never emit them lets the client's store-and-forward log grow unbounded until disk fills.
  • A client that opted in MUST drive its store-and-forward trim from STATUS_DURABLE_ACK frames only -- regular OK frames acknowledge that the bytes are safely committed to the primary's WAL but not that they are durable beyond it, so trimming on OK in this mode would lose data on a primary failure. OK frames are still tracked (they identify the per-table seqTxns later durable acks must cover) but they do not advance the trim watermark.
  • A client that opted in SHOULD send a WebSocket PING periodically while there are pending durable confirmations and there is no organic outbound traffic. The OSS server has no background flush queue for durable-ack frames; it only flushes them when the connection's file descriptor wakes on inbound activity (binary message, PING, or CLOSE). An idle connection that has finished publishing would otherwise never see the watermark advance and its store-and-forward log would not trim. Organic frames count as inbound activity and suppress the PING; the PING is a filler, not a fixed-cadence keepalive. The reference Java client sends a 2-byte PING every durable_ack_keepalive_interval_millis (default 200 ms) while pendingDurable is non-empty AND no other frame has been sent in the interval; see sf-client.md §11.
  • Reconnects discard any in-flight durable-ack tracking. The new connection re-OKs replayed batches and the server re-emits cumulative durable-ack watermarks from scratch, so trim must restart against the new connection's wire sequencing.

Connect-string

The QuestDB ILP client exposes the opt-in via the request_durable_ack parameter. Allowed values are on and off (default off); any other value is a configuration error. The parameter is meaningful only on the WebSocket transport.

ws::addr=host:9000;sf_dir=/var/lib/qwp;request_durable_ack=on;

14. Protocol Limits

LimitDefault Value
Max batch size16 MB
Max tables per connection10,000
Max rows per table1,000,000
Max columns per table2,048
Max table name length127 bytes
Max column name length127 bytes
Max in-flight batches128
Max symbol dictionary entries1,000,000

The header's table_count field is a uint16, so the protocol ceiling for tables per message is 65,535 regardless of the configured limit. Individual string values have no dedicated length limit; they are bounded only by the max batch size.

The symbol dictionary limit applies per column in Per-Table Dictionary Mode and per connection in Global Delta Dictionary Mode (see §12). Exceeding it causes the server to reject the message with PARSE_ERROR.

15. Client Operation

This section describes the high-level batching and registry behaviour every client implements. The full client-side substrate — on-disk Store-and-Forward storage, frame-sequence-number model, ACK-driven trim, durable-ack handshake, keepalive PING, reconnect/replay semantics, error categories and policies — is specified separately in sf-client.md. Cross-language client implementers should treat that document as normative for SF-mode behaviour; this section is a sketch.

15.1 Double-Buffered Async I/O

The client uses double-buffered microbatches:

  1. The user thread writes rows to the active buffer.
  2. When a buffer reaches its threshold (row count, byte size, or age), the client seals it and enqueues it for sending.
  3. A dedicated I/O thread sends batches over the WebSocket.
  4. The client swaps to the other buffer so writing can continue without blocking.

15.2 Auto-Flush Triggers

TriggerDefault
Row count1,000 rows
Byte sizedisabled
Time since first row100 ms

15.3 Schema Registry

  • First batch for a given table: full schema mode (0x00) with a new schema ID.
  • Subsequent batches with an unchanged column set: schema reference mode (0x01) with the same ID.
  • When a table gains a column, the client assigns a new schema ID and sends it in full mode.
  • Schema IDs are global per connection, not per table; the server registers them in a per-connection registry.
  • On reconnect both sides reset: the client reassigns IDs from 0 and the server clears its registry.

15.4 Symbol Dictionary Lifecycle

  • The client maintains a global symbol dictionary across all tables/columns.
  • Symbol IDs are assigned sequentially starting from 0.
  • Each batch sends only the delta (newly added symbols since the last batch).
  • The server accumulates these deltas for the lifetime of the connection.
  • Upon connection loss, both sides reset the dictionary.

15.5 Initial Connect and Failover

Ingress senders use the cursor-engine reconnect loop documented in sf-client.md §13.6, regardless of whether sf_dir is configured. The two storage modes share identical failover semantics — host tracker, equal-jitter backoff, initial_connect_retry policy, reconnect_max_duration_millis outage budget, mid-stream demote, role-reject handling, terminal classification. They differ only in where the unacked buffer lives:

  • sf_dir set (store-and-forward): segments are mmap'd files under sf_dir. Unacked data survives sender restarts and is replayed by the next sender bound to the same slot. Orphan slots from prior sender processes can be adopted (see sf-client.md §18).
  • sf_dir unset (memory-mode): segments are malloc'd in process memory. Unacked data is lost if the sender process dies. The reconnect loop still spans transient server outages such as rolling upgrades, but the RAM buffer caps how much data can pile up during the outage. Operators who want to surface a stuck server sooner should lower reconnect_max_duration_millis below the 5-minute default; senders that need durability across sender restarts must opt into SF (sf_dir=...).

Host selection consumes the shared primitives in failover.md: connect-string keys (§1.1), host-health model and (state, zone_tier) priority lattice (§2), role filter (§5), and error classification (§6). Ingress is zone-blind in both storage modes — it pins QWP v1 and never reads SERVER_INFO, so every host's zone tier is Same and selection degenerates to state-only ordering. The zone= connect-string key is accepted but silently ignored, so a connect string shared with egress clients works unchanged on ingress.

Connect-string knobs are documented at their canonical home in sf-client.md §4.2 — the same keys apply whether or not sf_dir is set:

  • reconnect_max_duration_millis (default 300_000)
  • reconnect_initial_backoff_millis (default 100)
  • reconnect_max_backoff_millis (default 5_000)
  • initial_connect_retry (default off; on / sync / async)

Per-host upgrade-error classification follows failover.md §6: 401/403AuthError (terminal), 421 + X-QuestDB-Role → role reject (transient if PRIMARY_CATCHUP, topology otherwise). All other upgrade errors are transient and feed into the reconnect loop, including 404, 426, 503, generic 4xx/5xx, TCP/TLS failures, mid-stream send/recv errors, and an upgrade response that advertises a QWP version outside the client's supported range — the last one is per-endpoint, so a host on an in-flight rolling upgrade does not lock the client out of compatible peers.

16. Examples

Example 1: Simple Message with One Table

Table: sensors, 2 rows, 3 columns: id (LONG), value (DOUBLE), ts (TIMESTAMP). No nulls.

# Header (12 bytes)
51 57 50 31  # Magic: "QWP1"
01           # Version: 1
00           # Flags: none
01 00        # Table count: 1
XX XX XX XX  # Payload length

# Table Block
07           # Table name length: 7
73 65 6E 73 6F 72 73  # "sensors" UTF-8
02           # Row count: 2
03           # Column count: 3

# Schema (full mode)
00           # Schema mode: full
00           # Schema ID: 0

# Column 0: id
02           # Name length: 2
69 64        # "id" UTF-8
05           # Type: LONG

# Column 1: value
05           # Name length: 5
76 61 6C 75 65  # "value" UTF-8
07           # Type: DOUBLE

# Column 2: ts
02           # Name length: 2
74 73        # "ts" UTF-8
0A           # Type: TIMESTAMP

# Column 0 data (LONG, no nulls, 2 values)
00                       # null_flag: 0x00 (no nulls)
01 00 00 00 00 00 00 00  # id=1
02 00 00 00 00 00 00 00  # id=2

# Column 1 data (DOUBLE, no nulls, 2 values)
00                       # null_flag: 0x00 (no nulls)
CD CC CC CC CC CC F4 3F  # value=1.3
9A 99 99 99 99 99 01 40  # value=2.2

# Column 2 data (TIMESTAMP, no nulls, uncompressed, 2 values)
00                       # null_flag: 0x00 (no nulls)
00 E4 0B 54 02 00 00 00  # ts=10000000000 microseconds
80 1A 06 00 00 00 00 00  # ts=400000 microseconds

Example 2: Nullable VARCHAR Column

Table with nullable VARCHAR column, 4 rows where row 1 is null:

# Null flag + bitmap for 4 rows where row 1 is null
01           # null_flag: nonzero = bitmap follows
02           # 0b00000010 - bit 1 set

# Offset array (3 non-null values = 4 offsets)
00 00 00 00  # offset[0] = 0  (start of "foo")
03 00 00 00  # offset[1] = 3  (end of "foo", start of "bar")
06 00 00 00  # offset[2] = 6  (end of "bar", start of "baz")
09 00 00 00  # offset[3] = 9  (end of "baz")

# String data (concatenated UTF-8)
66 6F 6F     # "foo" (row 0)
62 61 72     # "bar" (row 2, since row 1 is null)
62 61 7A     # "baz" (row 3)

Example 3: Symbol Column

3 rows with values: "us", "eu", "us" (per-table dictionary mode):

# Null flag
00           # null_flag: 0x00 (no nulls)

# Dictionary
02           # Dictionary size: 2 entries

02           # Entry 0 length: 2
75 73        # "us"

02           # Entry 1 length: 2
65 75        # "eu"

# Value indices
00           # Row 0: index 0 ("us")
01           # Row 1: index 1 ("eu")
00           # Row 2: index 0 ("us")

Example 4: Multi-Table with Gorilla + Delta Symbol Dictionary

A message with 1 table ("sensors"), 2 rows, 3 columns (symbol "host", double "temp", designated timestamp):

Header (12 bytes):
  51 57 50 31   -- Magic: "QWP1"
  01            -- Version: 1
  0C            -- Flags: 0x04 (Gorilla) | 0x08 (Delta Symbol Dict)
  01 00         -- Table count: 1
  XX XX XX XX   -- Payload length (computed)

Payload:
  Delta Symbol Dictionary:
    00          -- delta_start = 0
    02          -- delta_count = 2
    07 73 65 72 76 65 72 31  -- "server1" (len=7)
    07 73 65 72 76 65 72 32  -- "server2" (len=7)

  Table Block:
    Table Header:
      07 73 65 6E 73 6F 72 73  -- table name "sensors" (len=7)
      02                       -- row_count = 2
      03                       -- column_count = 3

    Schema (full mode):
      00                       -- schema_mode = FULL
      00                       -- schema_id = 0
      04 68 6F 73 74  09       -- "host" : SYMBOL
      04 74 65 6D 70  07       -- "temp" : DOUBLE
      00              0A       -- "" : TIMESTAMP (designated)

    Column 0 (SYMBOL, global IDs):
      00                       -- null_flag: no nulls
      00                       -- row 0: global ID 0
      01                       -- row 1: global ID 1

    Column 1 (DOUBLE, 2 x 8 bytes):
      00                       -- null_flag: no nulls
      66 66 66 66 66 E6 56 40  -- 91.6
      9A 99 99 99 99 19 57 40  -- 92.4

    Column 2 (TIMESTAMP, Gorilla):
      00                       -- null_flag: no nulls
      01                       -- encoding = Gorilla
      [8 bytes: t0]
      [8 bytes: t1]

17. Reference Implementation

The authoritative implementation lives in QuestDB's Java codebase under core/src/main/java/io/questdb/cutlass/qwp/protocol/. That directory contains the header and varint parsers, the schema registry, the message and table-block cursors, and the type-specific column decoders.

18. Version History

VersionDescription
1 (0x01)Initial binary protocol release