docs/qwp/wire-egress.md
This document specifies QWP's egress mode: SQL queries in, columnar query
results out, over a dedicated WebSocket endpoint. It extends the ingress
specification (wire-ingress.md) by reusing the
header, type system, and column encodings unchanged. The deltas are limited to:
QWP egress streams SQL query results to clients using the same wire encoding QWP ingress uses for inbound data. Key properties:
0x00 (full) on the first batch of a query; mode 0x01 (reference) on
subsequent batches with the same column set.N bytes of result data. The server pauses production once the
credit window is exhausted. A row floor guarantees forward progress.QUERY_REQUEST produces zero or more RESULT_BATCH frames followed by
exactly one terminator (RESULT_END or QUERY_ERROR).Egress runs over RFC 6455 WebSocket binary frames on a new endpoint:
GET /read/v1
The new endpoint exists for two reasons:
/write/v4 accepts
only DATA_BATCH, /read/v1 accepts only egress kinds.Mixed-mode clients open one connection per direction.
Egress shares the QWP version namespace with ingress. Version and compression are negotiated at the HTTP upgrade:
| Header | Direction | Required | Description |
|---|---|---|---|
X-QWP-Max-Version | C -> S | No | Maximum QWP version the client supports. Defaults to 1 if absent. |
X-QWP-Client-Id | C -> S | No | Free-form client identifier (e.g., java-egress/1.0.0). |
X-QWP-Accept-Encoding | C -> S | No | Comma-separated list of acceptable RESULT_BATCH body encodings (see below). |
X-QWP-Max-Batch-Rows | C -> S | No | Client-preferred per-batch row cap. Decimal integer; 0 or absent = server default. The server clamps to its own hard limit, so this only ever asks for smaller batches (lower latency to first row, more per-batch overhead). |
X-QWP-Version | S -> C | Yes | Negotiated version = min(clientMax, serverMax). |
X-QWP-Content-Encoding | S -> C | No | Server's selected encoding from the client's accept list. Omitted means raw. |
Egress was introduced at version 1. Version 2 adds an unsolicited
SERVER_INFO frame (§11.8) delivered as the first WebSocket frame after the
upgrade response; the frame carries the server's replication role,
cluster/node identity, and a capabilities bitfield so clients can route reads
to primary vs replica. Ingest is pinned to version 1 because the v2 bump has
no ingest semantics.
The connection-level contract from the ingress spec applies: every message's header version byte must equal the negotiated version, and the server rejects mismatches with a parse error and closes the WebSocket.
X-QWP-Accept-Encoding is a comma-separated list of tokens. Each token is
name or name;param=value. First match wins. Supported names:
raw (or identity) - no compression.zstd - whole-RESULT_BATCH-body zstd compression. Optional parameter
level=N is a client-side hint; the server clamps to [1, 9] because
levels 10+ drop to <20 MB/s compress throughput. The default level is 1
-- the cheapest server-side CPU; raise it only when you measure a real
ratio improvement on your payload and have the headroom.The server echoes its choice in X-QWP-Content-Encoding (e.g.
zstd;level=1). When zstd is negotiated, individual RESULT_BATCH frames
set FLAG_ZSTD (§7) on a per-batch basis; a batch whose compressed form is
larger than its raw form ships raw. The region before the payload (msg_kind +
request_id + batch_seq) is never compressed so the client dispatcher can
route frames without paying the decompress cost.
Absent X-QWP-Accept-Encoding the server defaults to raw.
The egress header is byte-identical to the ingress header (12 bytes, little-endian throughout):
Offset Size Type Field Description
--------------------------------------------------------
0 4 int32 magic "QWP1" (0x31505751)
4 1 uint8 version Negotiated QWP version
5 1 uint8 flags See ingress §7
6 2 uint16 table_count 1 for RESULT_BATCH; 0 otherwise
8 4 uint32 payload_length Payload size in bytes
The first byte of the payload is the message kind. The remaining payload
layout depends on the kind. For RESULT_BATCH the rest of the payload is the
existing ingress payload format (optional delta symbol dictionary followed
by exactly one table block); for control kinds the layout is defined per
kind in sections 6 through 11.
+-----------------------------------------+
| Header (12 bytes) |
+-----------------------------------------+
| Payload |
| +-----------------------------------+ |
| | msg_kind: uint8 | |
| | (kind-specific body) | |
| +-----------------------------------+ |
+-----------------------------------------+
Putting msg_kind in the payload (rather than the header) keeps the codec
shared with ingress: header parsing, framing, and length checks are identical.
Endpoint disambiguation is sufficient because connections are direction-pure.
| Code | Name | Direction | Body |
|---|---|---|---|
0x00 | DATA_BATCH | C -> S | Defined in ingress spec (not used here) |
0x01 | RESPONSE | S -> C | Defined in ingress spec (not used here) |
0x10 | QUERY_REQUEST | C -> S | SQL plus bind parameters |
0x11 | RESULT_BATCH | S -> C | One table block of result rows |
0x12 | RESULT_END | S -> C | Cursor exhausted |
0x13 | QUERY_ERROR | S -> C | Mid-stream error |
0x14 | CANCEL | C -> S | Stop a running query |
0x15 | CREDIT | C -> S | Extend the byte window |
0x16 | EXEC_DONE | S -> C | Non-SELECT statement acknowledgement |
0x17 | CACHE_RESET | S -> C | Clear connection-scoped caches |
0x18 | SERVER_INFO | S -> C | Unsolicited, first frame on v2 only; carries role + cluster identity |
0x19 through 0x1F are reserved for future egress kinds (prepared
statements, transactions, server-driven keepalives). 0x20+ is reserved for
future protocol extensions.
Client to server. Initiates a new cursor.
+---------------------------------------------------------+
| msg_kind: uint8 0x10 |
| request_id: int64 Client-assigned, unique |
| within the connection |
| sql_length: varint UTF-8 byte length |
| sql_bytes: bytes SQL text |
| initial_credit: varint Bytes; 0 = unbounded |
| bind_count: varint Number of bind parameters |
| For each bind parameter (in declaration order): |
| bind_block: column_data Reuses ingress column |
| encoding with row_count = 1|
+---------------------------------------------------------+
A bind parameter is encoded exactly as a one-row column under the ingress
spec (§11 null bitmap, §12 column encoding). Each block begins with a
type_code: uint8 from the type table, followed by the standard null_flag
byte and either zero or one value.
Reusing the ingress encoding has two consequences:
type_code + null_flag = 0x01 + bitmap byte 0x01, with no value bytes following.Server leniency note: the Phase 1 server decoder accepts a SYMBOL wire type code for a bind parameter and treats it identically to STRING (single UTF-8 value, dispatched to {@code BindVariableService.setStr}). Compliant clients should still send STRING. A future revision may tighten this to reject SYMBOL bind type codes.
64-bit client-assigned identifier. It is echoed back by every
server-to-client frame related to the query (RESULT_BATCH, RESULT_END,
QUERY_ERROR). The client may reuse a request_id only after it has
observed the terminator for the previous use.
The wire protocol allows a connection to have multiple in-flight queries:
the server may interleave their RESULT_BATCH frames freely and clients
demultiplex using request_id. The protocol does not constrain ordering
between requests.
Phase 1 (current implementation): a single in-flight query per connection.
The server rejects any second QUERY_REQUEST that arrives before the active
query has terminated (RESULT_END, EXEC_DONE, or QUERY_ERROR) with a
QUERY_ERROR carrying STATUS_PARSE_ERROR and a message naming the
limitation. Multi-query multiplexing requires a fair scheduler on the server
and is tracked in the Phase 2 backlog. SDK authors targeting Phase 1 should
serialise queries on a per-connection basis or open additional connections.
Server to client. Carries one table block of result rows.
+---------------------------------------------------------+
| msg_kind: uint8 0x11 |
| request_id: int64 From the originating |
| QUERY_REQUEST |
| batch_seq: varint Monotonic per request, |
| starting at 0 |
| (rest of payload is the ingress payload format: |
| optional delta symbol dictionary if FLAG_DELTA_SYMBOL_DICT
| is set in the header flags, then exactly one table block.)
+---------------------------------------------------------+
The header's table_count field is 1. The header's flags byte uses the
same bit definitions as ingress plus an egress-specific compression bit:
| Bit | Name | Meaning |
|---|---|---|
0x04 | FLAG_GORILLA | The batch may use per-column Gorilla delta-of-delta encoding on TIMESTAMP / TIMESTAMP_NANOS / DATE columns. |
0x08 | FLAG_DELTA_SYMBOL_DICT | The batch carries a connection-scoped delta symbol-dictionary section (§12). |
0x10 | FLAG_ZSTD | The payload region after the msg_kind / request_id / batch_seq prelude is zstd-compressed. |
FLAG_GORILLA and FLAG_DELTA_SYMBOL_DICT are always set on RESULT_BATCH
frames under Phase 1 (both features are unconditionally active). FLAG_ZSTD
is set on a per-batch basis when compression has been negotiated (§3) and
the compressed form is smaller than the raw form.
When FLAG_GORILLA is set, every TIMESTAMP / TIMESTAMP_NANOS / DATE column
carries a 1-byte encoding discriminator immediately before its value region:
0x00 = raw int64 values; 0x01 = Gorilla bitstream. The server picks
Gorilla when the column has at least three non-null values and the
delta-of-delta bitstream is smaller than nonNull * 8 bytes; unordered or
jumpy columns fall back to raw.
The schema section inside the table block follows the standard rules:
0x00 (full), with a new
schema_id assigned by the server.0x01
(reference), reusing the same schema_id.If the result set is empty, the server still sends one RESULT_BATCH with
row_count = 0 so the client receives the schema, followed by RESULT_END.
The table block's name field carries an empty string (name_length = 0).
Result sets do not have a table name; clients should ignore the field.
Server to client. Signals successful end of stream.
+---------------------------------------------------------+
| msg_kind: uint8 0x12 |
| request_id: int64 |
| final_seq: varint Sequence number of the last |
| RESULT_BATCH (or 0 if none) |
| total_rows: varint Total rows produced; 0 if not |
| tracked by the server |
+---------------------------------------------------------+
table_count in the header is 0.
After RESULT_END, the server has no further state for this request_id
and the client may reuse it.
Server to client. Signals failure at any point in the lifecycle. May arrive
before any RESULT_BATCH (parse failure, security failure) or partway
through a stream (storage failure, cancellation acknowledged, server
shutdown).
+---------------------------------------------------------+
| msg_kind: uint8 0x13 |
| request_id: int64 |
| status: uint8 See §15 |
| msg_length: uint16 UTF-8 byte length |
| msg_bytes: bytes Human-readable error message |
+---------------------------------------------------------+
table_count in the header is 0. QUERY_ERROR is terminal: the client
must not expect any further frames for this request_id.
Client to server. Requests termination of a running query.
+---------------------------------------------------------+
| msg_kind: uint8 0x14 |
| request_id: int64 |
+---------------------------------------------------------+
The server acknowledges by emitting either RESULT_END (if the cursor
happened to finish first) or QUERY_ERROR with status CANCELLED. The
client must continue to drain any in-flight RESULT_BATCH frames the server
sent before processing the cancel; the terminator is the synchronization
point.
If request_id does not refer to an active query the server silently drops
the cancel.
Client to server. Extends the byte-credit window.
+---------------------------------------------------------+
| msg_kind: uint8 0x15 |
| request_id: int64 |
| additional_bytes: varint Bytes to add to the window |
+---------------------------------------------------------+
See §14 for the flow-control model.
QuestDB uses sentinel values (not a separate null bitmap) for several primitive types in its internal storage. The egress wire format inherits these conventions verbatim — the per-cell value IS the sentinel, and the row appears as NULL in the null bitmap. Implementations consuming egress should treat the following as indistinguishable from explicit NULL:
| QuestDB type | NULL sentinel | Notes |
|---|---|---|
| INT, IPv4 | Numbers.INT_NULL = Integer.MIN_VALUE for INT; Numbers.IPv4_NULL = 0 for IPv4 | The address 0.0.0.0 cannot be represented as a non-null IPv4. |
| LONG, DATE, TIMESTAMP, TIMESTAMP_NANOS, DECIMAL64 | Numbers.LONG_NULL = Long.MIN_VALUE | |
| FLOAT | Float.NaN | Any NaN, including 0.0f / 0.0f, is treated as NULL. |
| DOUBLE | Double.NaN | Same as FLOAT — any NaN is NULL. |
| GEOHASH (all widths) | -1 (sign-extends across BYTE/SHORT/INT/LONG storage) | A geohash whose bit pattern is "all ones" cannot be represented as non-null. |
| UUID | both halves Numbers.LONG_NULL | A UUID with both halves Long.MIN_VALUE is NULL. |
| LONG256 | all four longs Numbers.LONG_NULL | |
| BOOLEAN, BYTE, SHORT, CHAR | (no NULL sentinel) | These types cannot carry NULL in QuestDB; INSERT NULL stores false/0 and the wire row has the null bitmap bit clear. |
Servers writing egress and clients reading it MUST agree on these sentinels. The
spec does not introduce a separate "QWP NaN" representation — a row carrying
NaN in the dense values array is simultaneously marked NULL in the null bitmap.
Array column types (DOUBLE_ARRAY 0x18, LONG_ARRAY 0x12) ship element bytes
verbatim with no per-element null bitmap. Element-level NULL is encoded by
re-using the element-type's row-level sentinel from the table above:
| Array element type | NULL element value | Round-trip risk |
|---|---|---|
DOUBLE_ARRAY element | Double.NaN | A non-null NaN element (e.g. from 0.0 / 0.0) is indistinguishable from a NULL element on the wire. |
LONG_ARRAY element | Long.MIN_VALUE (Numbers.LONG_NULL) | A non-null Long.MIN_VALUE element cannot be represented as non-null and round-trips as NULL. |
This matches QuestDB's in-engine semantics — the engine itself treats these sentinels as NULL throughout — so the wire format does not lose information relative to the source. Clients writing array values they expect to round-trip as non-null MUST avoid the sentinel bit patterns.
The row-level null bitmap bit (set for the whole array cell) signals "the array itself is NULL", distinct from "an array of zero or more elements, some of which may be element-NULL". A non-NULL empty array is a valid value.
Server to client. Terminates a non-SELECT QUERY_REQUEST (DDL, INSERT,
UPDATE, ALTER, DROP, TRUNCATE, CREATE TABLE, CREATE MAT VIEW, and any
parse-time-executed statement). No RESULT_BATCH frames are sent for these
statements — the stream collapses to a single acknowledgement.
+---------------------------------------------------------+
| msg_kind: uint8 0x16 |
| request_id: int64 |
| op_type: uint8 CompiledQuery.TYPE_* discriminator |
| rows_affected: varint INSERT / UPDATE row count; |
| 0 for statements without a row |
| count (DDL, TRUNCATE, etc.) |
+---------------------------------------------------------+
table_count in the header is 0. EXEC_DONE is terminal: the client must
not expect any further frames for this request_id.
If the statement fails, the server sends QUERY_ERROR (§9) instead.
Server to client. Clears one or both of the connection-scoped caches
(SYMBOL delta dict; schema-fingerprint cache) when the server-side usage
crosses a configured soft cap. Emitted at a query boundary (between the
previous query's terminator and the next query's first RESULT_BATCH or
EXEC_DONE); never mid-stream, so no in-flight frame references an id the
reset would invalidate.
+---------------------------------------------------------+
| msg_kind: uint8 0x17 |
| reset_mask: uint8 Bit 0 = SYMBOL dict |
| Bit 1 = schema-fingerprint cache |
| Bits 2-7 reserved, must be zero |
+---------------------------------------------------------+
table_count in the header is 0. No request_id: the frame targets
connection state, not a specific query.
RESET_MASK_DICT: the peer clears its connection-scoped SYMBOL
dictionary. After the reset, the dictionary is empty (size 0, heap position
0). The next RESULT_BATCH carrying FLAG_DELTA_SYMBOL_DICT MUST start its
delta section at deltaStart = 0; batches with the flag and a mismatching
deltaStart are a protocol error (clients should raise a decode failure).
The server MUST also clear any per-column native-key -> connId caches on
its side; failing to do so would let a scratch hand back an id the reset
dictionary has already dropped.RESET_MASK_SCHEMAS: the peer clears its connection-scoped
schema registry. Every previously assigned schema_id is discarded. The
next RESULT_BATCH that would have shipped SCHEMA_MODE_REFERENCE (§7.1)
for a cached shape MUST instead ship SCHEMA_MODE_FULL with a freshly
allocated id. Schema ids allocated after the reset may collide with
previously-used values (the server's counter restarts at 0).Both bits MAY be set in the same frame. Reserved bits MUST be zero on transmit; recipients MUST ignore any reserved bits that are set.
Each QuestDB server enforces configurable soft caps on two metrics:
| Cap | Default | Triggers |
|---|---|---|
| Entry count in the SYMBOL delta dict | 100,000 | RESET_MASK_DICT |
| UTF-8 heap bytes in the SYMBOL delta dict | 8 MiB | RESET_MASK_DICT |
| Distinct registered schemas | 4,096 | RESET_MASK_SCHEMAS |
Actual cap values are implementation-defined; clients MUST accept any cap
policy and MUST be prepared to receive CACHE_RESET after any query
boundary, including on otherwise-identical workloads where a previous run
did not trigger one.
Resetting the dictionary or the schema registry while a RESULT_BATCH is
in flight would invalidate ids already referenced in that batch's row
payload. The server therefore postpones the reset until a natural query
boundary. This has two knock-on effects:
RESULT_BATCH may be preceded by a CACHE_RESET that resets the
counter. Clients SHOULD treat CACHE_RESET as transparent.Server's dict has grown past the 100k-entry cap. The client's next
QUERY_REQUEST triggers:
client -> QUERY_REQUEST(request_id=42, ...)
server -> CACHE_RESET(reset_mask=0x01) // dict bit only
server -> RESULT_BATCH(request_id=42, batch_seq=0, deltaStart=0, ...)
server -> RESULT_BATCH(request_id=42, batch_seq=1, ...)
server -> RESULT_END(request_id=42, ...)
If the schema cache is also over cap, the server emits a single
CACHE_RESET(reset_mask=0x03); the client clears both caches in one hop.
Server to client. Unsolicited frame delivered as the first WebSocket frame
after the HTTP upgrade response, and only when the negotiated version is 2
or above. A v1-only client never sees it; a v2 client must consume it before
submitting the first QUERY_REQUEST.
+---------------------------------------------------------+
| msg_kind: uint8 0x18 |
| role: uint8 see role table below |
| epoch: uint64 monotonic role epoch |
| capabilities: uint32 bitfield, see below |
| server_wall_ns: int64 server wall-clock, ns since |
| 1970-01-01T00:00Z |
| cluster_id_len: uint16 UTF-8 byte length |
| cluster_id: bytes UTF-8, up to 65535 bytes |
| node_id_len: uint16 UTF-8 byte length |
| node_id: bytes UTF-8, up to 65535 bytes |
| --- present iff capabilities & CAP_ZONE --- |
| zone_id_len: uint16 UTF-8 byte length |
| zone_id: bytes UTF-8, up to 65535 bytes |
+---------------------------------------------------------+
| Value | Role | Semantics |
|---|---|---|
0x00 | STANDALONE | No replication configured. OSS single-node default; behaves like a primary for routing. |
0x01 | PRIMARY | Authoritative write node; reads see latest commits. |
0x02 | REPLICA | Read-only replica; reads may lag the primary by up to the replication poll interval. |
0x03 | PRIMARY_CATCHUP | Promotion in flight; behaves like a primary but is still uploading in-flight segments. |
The epoch field is monotonic across role transitions on the same node
(replica promoted to primary, primary demoted to replica). Clients tracking a
specific primary use it to refuse a stale reconnect that lands on a node
which no longer believes it is primary at the current cluster epoch. The
field is 0 on releases where no fencing has been wired up yet; it is safe
for clients to ignore it as a hint rather than a guarantee.
cluster_id and node_id are free-form identifiers supplied by the server
operator. Clients may surface them in diagnostics and in error messages
produced by the role filter (§"Client routing" below).
The capabilities field is a bitfield of optional fields and protocol
extensions. v2.0 servers and clients set it to zero. Defined bits:
| Bit | Name | Meaning |
|---|---|---|
0x00000001 | CAP_ZONE | Server appends zone_id_len + zone_id after node_id. Identifies the server's geographic / logical zone (e.g. eu-west-1a, dc-amsterdam); used by clients with zone= set on the connection string to prefer same-zone endpoints. See failover.md §2 and §5. |
Higher bits remain reserved for future protocol extensions
(freshness-watermark reads, multi-query multiplexing, etc.). Clients
encountering an unknown capability bit MUST ignore it; servers MUST
only set bits the negotiated wire revision defines. Trailing fields
gated by unset bits are absent from the frame, so a v2.0 client reading
a v2.1 server with CAP_ZONE=0 sees the same byte layout it always did.
SERVER_INFO is delivered in the same TCP/WebSocket send buffer as the 101
upgrade response, so on a healthy connection the frame is already in the
client's kernel recv buffer by the time the client parses the upgrade. If the
server negotiates v1 it omits the frame entirely and clients fall back to
the "role unknown" path (equivalent to STANDALONE for routing purposes).
target=, zone=, failover=)Egress clients that support v2 accept a comma-separated list of endpoints plus role and zone preferences on the connection string:
ws::addr=db-a:9000,db-b:9000,db-c:9000;target=any;zone=eu-west-1a;failover=on;
The per-Execute reconnect loop and WalkTracker helper that consume
SERVER_INFO (and the 421 + X-QuestDB-Role / X-QuestDB-Zone
upgrade-reject convention) are specified in §11.9. The shared
primitives that both consume — host-health model, backoff function,
role filter, error classification — live in
failover.md. At the wire level §11.8 only fixes:
Role byte values inside SERVER_INFO (the role table above)
and their mapping to the target= filter (see failover.md §5).CAP_ZONE capability bit and the optional zone_id field that
follows node_id when the bit is set; clients compare zone_id
case-insensitively against their configured zone=.OnFailoverReset handler callback contract: fired between a
successful reconnect and the first replayed batch on the new node.
batch_seq restarts at 0 after the callback returns. See §11.9.421 + X-QuestDB-Role (and optional X-QuestDB-Zone)
upgrade-reject convention shared with ingress (see failover.md §5).failover=off restores the pre-v2 behaviour where transport failures
surface directly through onError (no automatic reconnect).
Egress clients drive a per-Execute() reconnect loop on top of the
shared primitives in failover.md. The connect-string
knobs are owned here; the host tracker (§2 of failover.md), the
backoff function (§3), the role filter (§5), and the error
classification (§6) are imported.
| Key | Type | Default | Notes |
|---|---|---|---|
target | any | primary | replica | any | Server-role filter applied per-endpoint after the upgrade reads SERVER_INFO (§11.8). |
failover | on | off | on | Master switch for the per-Execute reconnect loop. off surfaces transport errors directly through onError. |
failover_max_attempts | int | 8 | Cap on reconnects per Execute() (initial attempt + N-1 failovers). |
failover_backoff_initial_ms | int | 50 | First post-failure sleep. |
failover_backoff_max_ms | int | 1_000 | Cap on per-attempt sleep. |
failover_max_duration_ms | int | 30_000 | Total wall-clock budget per Execute(); 0 = unbounded. |
Common keys (addr, auth_timeout_ms, zone) live in failover.md
§1.1.
on QueryClient.New(): connect once via WalkTracker (§11.9.3); fail loud if no endpoint matches
on Execute(sql, handler):
if execute already in flight on this client: throw "one query at a time"
tracker.BeginRound(forgetClassifications=false)
attempt = 0
backoffMs = failover_backoff_initial_ms
deadline = now + failover_max_duration_ms (∞ if 0)
loop:
send QUERY_REQUEST(rid, sql, binds)
drive receive loop until terminator
on success: return
on transport-error AND failover=on AND attempt+1 < failover_max_attempts AND now < deadline:
tracker.RecordMidStreamFailure(activeIdx)
sleep clamp(FullJitter(backoffMs), deadline - now)
backoffMs = min(2 × backoffMs, failover_backoff_max_ms)
attempt++
ReconnectAsync(attempt) # uses WalkTracker (§11.9.3) with BeginRound(forget=false)
handler.OnFailoverReset(serverInfo)
continue
else: rethrow (failover not eligible or budget exhausted)
Key properties:
forgetClassifications=false.
Every Execute() enters with BeginRound(false) — the
attempted-this-round flags are cleared but topology classifications
observed in prior Executes are preserved. A previously
role-rejected endpoint stays at the bottom of the priority order;
it is reconsidered once the current round walks off the end (see
§11.9.3 fall-through reset). This is "lazy forget": the cost is one
extra reconnect attempt the first time topology actually changes,
the benefit is no per-Execute thundering against a known-bad host.forgetClassifications=false:
the topology classifications observed during this Execute()'s
reconnects accumulate. Once the round exhausts within a single
Execute(), WalkTracker calls BeginRound(forgetClassifications=true)
once and walks the list one more time. If still no endpoint matches,
fail with a role-mismatch error (carries the most recent observed
SERVER_INFO).failover_max_duration_ms bounds failover eligibility, not total
Execute wall-clock. The deadline check (now < deadline) gates
whether the loop will sleep + reconnect; the sleep itself is clamped
to deadline - now. But WalkTracker is bounded only by hostCount × auth_timeout_ms, with no internal deadline. With the defaults
(failover_max_attempts = 8, auth_timeout_ms = 15_000), a
single reconnect round can run up to ~120s after the deadline-check
passed. Total wall-clock for a failing Execute() is therefore
bounded by failover_max_duration_ms + (last WalkTracker round),
not failover_max_duration_ms alone. Operators sizing the budget
for user-visible latency should pick failover_max_duration_ms such
that budget + worst-case walk fits the SLA.[0, base) per failover.md
§3.1. A query client is single-user and benefits from the lowest
expected recovery time; thundering herd is not a concern at one
client per workload.OnFailoverReset callback fires after a successful reconnect
but before any replayed batches arrive. Handlers reset their
per-batch_seq state here (the new node restarts batch_seq at 0).
Clients that omit the callback MUST instead surface the failure
and have the user re-issue Execute from a clean accumulator.failover.md §6).failover.md §6). A host whose upgrade response advertises a
QWP version outside [1, ClientMaxVersion] is recorded as a
transport error and the walk continues; if every host disagrees the
round-exhaustion error surfaces the version detail. There is no
separate mid-stream "version mismatch" — once an upgrade negotiates
a version, it is fixed for the connection's lifetime, and a bad
version byte appearing in a later frame is frame corruption (handled
by the generic decode-error path).Cancel() API, out of scope for this spec)
sends a CANCEL frame; the server replies with a QUERY_ERROR
carrying STATUS_CANCELLED. That reply routes through the normal
error path, not the transport-error path, so it never triggers
failover.Shared between initial QueryClient.New() and ReconnectAsync:
WalkTracker():
retriedAfterReset = false
loop:
idx = tracker.PickNext()
if idx < 0:
if !retriedAfterReset:
tracker.BeginRound(forgetClassifications=true)
retriedAfterReset = true
continue
return (lastInfo, lastError, anyRoleMismatch)
candidate = build transport for hosts[idx]
try connect with auth_timeout_ms budget:
ServerInfo = (negotiatedVersion >= 2) ? read SERVER_INFO frame : null
if ServerInfo and (ServerInfo.capabilities & CAP_ZONE):
tracker.RecordZone(idx, ServerInfo.zone_id)
on success:
if EndpointMatchesTarget(ServerInfo):
RecordSuccess(idx); install transport; return (info, null, false)
else:
RecordRoleReject(idx, transient = (ServerInfo.Role == PRIMARY_CATCHUP))
lastInfo = ServerInfo; anyRoleMismatch = true
continue
on AuthError: rethrow (do NOT continue past this host)
on 421 + X-QuestDB-Role <known role>:
tracker.RecordZone(idx, response.X-QuestDB-Zone) # null/empty header → no-op
RecordRoleReject(idx, transient = (role == PRIMARY_CATCHUP))
anyRoleMismatch = true; continue
on other error:
RecordTransportError(idx); continue
The fall-through reset (when PickNext returns -1 for the first
time in this walk) gives stale TransientReject / TopologyReject
hosts from prior outages another shot before declaring the entire
walk failed. Unlike the SF reconnect loop (sf-client.md
§13.6), WalkTracker is bounded by hostCount × auth_timeout_ms
rather than a wall-clock budget; only one reset, then fail.
When WalkTracker returns no transport: if any role mismatch was
observed, raise a role-mismatch error with the last SERVER_INFO
attached so callers can distinguish "no endpoint matched target=" from
"all endpoints unreachable". Otherwise raise a transport error
summarising attempts × endpoints. The reconnect path uses the same
helper and wraps the outcome with "failover exhausted after N attempts
across M endpoints".
When target=primary, RecordZone is still called per the
pseudocode, but every host's zone tier is treated as Same for
selection purposes (the master must be followed across zones). The
(state, zone) priority lattice degenerates to state-only ordering;
same as if zone= were unset. See failover.md §2 for the zone-tier
collapse rule.
The server maintains a per-connection schema registry keyed by schema_id.
The first RESULT_BATCH for a query registers a new schema in mode 0x00;
subsequent batches with the same column set reference it in mode 0x01.
A query whose column set differs from a prior registration receives a fresh
schema_id. The server may garbage-collect schema entries that no longer
correspond to any active query, but is not required to.
Connections that accumulate many distinct column shapes may cross the
server-side schema soft cap. When that happens the server emits
CACHE_RESET with RESET_MASK_SCHEMAS (§11.7) at a query boundary and
clears the registry on both sides. schema_id values after the reset may
collide with previously-used ones; clients MUST treat the pre-reset and
post-reset id spaces as independent.
On disconnect, both sides reset the registry.
Egress uses a connection-scoped delta dictionary (the FLAG_DELTA_SYMBOL_DICT
mechanic from ingress). The server maintains a global mapping of symbol strings
to sequential integer IDs starting at 0, shared across every query on the
connection. Each RESULT_BATCH carries a delta section listing newly added
symbols, placed between the batch_seq prelude and the table block:
delta_start: varint First conn-id assigned in this batch
delta_count: varint Number of new entries (may be 0)
For each new entry (in order):
entry_length: varint
entry_bytes: UTF-8 bytes
Then, for every SYMBOL column in the table block's column-data section:
For each non-null row:
conn_id: varint Index into the connection-scoped dictionary
FLAG_DELTA_SYMBOL_DICT is always set on RESULT_BATCH in Phase 1. The
client accumulates delta entries for the lifetime of the connection. On
disconnect, both sides reset the dictionary.
Per-connection scope pays off for repeated queries (BI dashboards refreshing
the same SELECTs) at the cost of growth on long-lived connections that
surface high-cardinality symbols. The server enforces soft caps on both
the entry count and the heap bytes held in the dictionary (§16). When
either cap is crossed, the server emits CACHE_RESET with
RESET_MASK_DICT (§11.7) at a query boundary and both sides clear the
dictionary; the next RESULT_BATCH delta section starts at
deltaStart = 0 and re-transmits any symbols the subsequent queries
reference.
QUERY_REQUEST
client -----------------------------------> server
|
v
(parse, plan, open cursor)
|
v
client <---------------- RESULT_BATCH(seq=0) ----- schema mode 0x00
client <---------------- RESULT_BATCH(seq=1) ----- schema mode 0x01
client <---------------- RESULT_BATCH(seq=N) -----
|
v
client <----------------- RESULT_END -------------
Error path:
client <---------------- RESULT_BATCH(seq=K) -----
client <----------------- QUERY_ERROR ----------- (terminal)
Cancel path:
client ----------------- CANCEL ----------------->
client <------- (any in-flight RESULT_BATCH) -----
client <----------------- QUERY_ERROR ----------- status = CANCELLED
(or RESULT_END if it raced)
A connection-level error (parse failure on the message frame itself,
authentication failure, malformed header) closes the WebSocket. The
server's last frame before close should be a QUERY_ERROR with
request_id = -1 if the failure is not attributable to a specific request.
The server may interpose a CACHE_RESET frame (§11.7) between the
terminator of one query and the first frame of the next when a
connection-scoped cache has crossed a soft cap. The client MUST process
CACHE_RESET before assuming continuity of the delta dict or schema
registry across the boundary.
client <----------------- RESULT_END ------------- (query N)
client <----------------- CACHE_RESET ------------ (optional)
client ----------------- QUERY_REQUEST ----------> (query N+1)
client <---------------- RESULT_BATCH(seq=0) ----- deltaStart=0 after reset
Egress uses byte credits with a row floor.
The client sets initial_credit in QUERY_REQUEST. A value of 0 means
"unbounded": the server streams without waiting for credit. A nonzero value
is the byte budget the server may emit before pausing.
The client sends CREDIT frames to extend the window. The server adds
additional_bytes to the remaining budget. There is no upper bound on a
single grant.
The server decrements the budget by the total wire length of each
RESULT_BATCH (header + payload). When the budget would go non-positive,
the server pauses production for that request_id.
To prevent deadlock on rows larger than the remaining window, the server is
permitted to send one additional RESULT_BATCH of at least one row even
if doing so drives the budget negative. The next batch will not be sent
until credit returns to a positive value.
The row floor guarantees forward progress for any well-formed query regardless of credit size, at the cost of the budget being a soft ceiling (the client may briefly receive more bytes than it granted). Clients should size buffers to absorb up to one extra batch.
Each request_id has its own credit accounting. Granting credit on one
request does not unblock another.
QUERY_ERROR reuses the ingress status code namespace and adds two
egress-specific codes:
| Code | Hex | Name | Description |
|---|---|---|---|
| 3 | 0x03 | SCHEMA_MISMATCH | Bind parameter type incompatible with placeholder |
| 5 | 0x05 | PARSE_ERROR | Malformed message or SQL syntax error |
| 6 | 0x06 | INTERNAL_ERROR | Server-side execution failure |
| 8 | 0x08 | SECURITY_ERROR | Authorization failure |
| 10 | 0x0A | CANCELLED | Query terminated in response to CANCEL |
| 11 | 0x0B | LIMIT_EXCEEDED | A protocol limit was hit (see §16) |
OK is not used in egress; success terminates with RESULT_END, not a
status code.
Egress inherits ingress limits where they apply, with the following additions and changes:
| Limit | Default Value | Notes |
|---|---|---|
| Max in-flight queries | 1 (Phase 1) | Per connection. Wire protocol allows up to 64; the Phase 1 server rejects any second concurrent QUERY_REQUEST. See §6 Concurrency. |
| Max SQL text length | 1 MiB | UTF-8 bytes |
| Max bind parameters | 1,024 | Per QUERY_REQUEST |
| Max RESULT_BATCH wire size | 16 MiB | Same as ingress batch ceiling |
| Symbol dict soft cap — entries | 100,000 | Per connection. Exceeding triggers CACHE_RESET(RESET_MASK_DICT) at the next query boundary (§11.7). |
| Symbol dict soft cap — heap | 8 MiB | Per connection, UTF-8 bytes. Exceeding triggers CACHE_RESET(RESET_MASK_DICT). |
| Schema registry soft cap | 4,096 | Per connection. Exceeding triggers CACHE_RESET(RESET_MASK_SCHEMAS) at the next query boundary. |
| Min initial credit | 0 | 0 means unbounded |
Soft caps are implementation-defined and may be configured or tuned by the
server operator. Clients MUST tolerate any cap policy and MUST be prepared
to receive CACHE_RESET at any query boundary.
Client sends SELECT id, value FROM sensors LIMIT 2 with no bind
parameters and no credit window.
QUERY_REQUEST:
Header:
51 57 50 31 magic "QWP1"
01 version 1
00 flags
00 00 table_count = 0
XX XX XX XX payload_length
Payload:
10 msg_kind = QUERY_REQUEST
01 00 00 00 00 00 00 00 request_id = 1
24 sql_length = 36
53 45 4C 45 43 54 20 69 64 2C 20 76 61 6C 75 65
20 46 52 4F 4D 20 73 65 6E 73 6F 72 73 20 4C 49
4D 49 54 20 32 "SELECT id, value FROM sensors LIMIT 2"
00 initial_credit = 0 (unbounded)
00 bind_count = 0
Server responds:
RESULT_BATCH (seq=0):
Header:
51 57 50 31 magic "QWP1"
01 version 1
00 flags
01 00 table_count = 1
XX XX XX XX payload_length
Payload:
11 msg_kind = RESULT_BATCH
01 00 00 00 00 00 00 00 request_id = 1
00 batch_seq = 0
Table block:
00 name_length = 0 (anonymous)
02 row_count = 2
02 column_count = 2
Schema (full mode):
00 schema_mode = FULL
00 schema_id = 0
02 69 64 05 "id" : LONG
05 76 61 6C 75 65 07 "value" : DOUBLE
Column 0 (LONG):
00 null_flag = 0
01 00 00 00 00 00 00 00
02 00 00 00 00 00 00 00
Column 1 (DOUBLE):
00 null_flag = 0
CD CC CC CC CC CC F4 3F 1.3
9A 99 99 99 99 99 01 40 2.2
RESULT_END:
Header:
51 57 50 31 01 00 00 00 XX XX XX XX
Payload:
12 msg_kind = RESULT_END
01 00 00 00 00 00 00 00 request_id = 1
00 final_seq = 0
02 total_rows = 2
Bind parameter ? of type LONG with value 42:
Inside QUERY_REQUEST after bind_count = 1:
05 type_code = LONG
00 null_flag = 0 (no nulls)
2A 00 00 00 00 00 00 00 value = 42 (one-row column)
A NULL LONG bind:
05 type_code = LONG
01 null_flag = nonzero (bitmap follows)
01 bitmap byte: bit 0 set = NULL
(no value bytes)
Client opens a query with a 64 KiB initial credit:
QUERY_REQUEST: initial_credit = 65536, request_id = 7
Server emits RESULT_BATCH frames totaling 60 KiB, then pauses. Client
processes them and grants more:
CREDIT:
Payload:
15 msg_kind = CREDIT
07 00 00 00 00 00 00 00 request_id = 7
80 80 04 additional_bytes = 65536
Server resumes streaming.
These belong in a follow-up RFC, not the initial implementation:
RESULT_KEEPALIVE kind so clients can distinguish a slow
query from a stalled connection. WebSocket pings cover the connection
but not per-request liveness.PREPARE / EXECUTE split would let the
server cache the parse and plan. Adds two more message kinds and a
server-side handle scope.