docs/qwp/sf-client.md
This document specifies the client-side Store-and-Forward (SF) substrate that
sits beneath QWP ingest. It is a companion to wire-ingress.md: that
spec defines the bytes on the wire; this spec defines what every SF-capable
client must do around them — disk format, frame sequence numbering,
handshake, ACK-driven trim, durable-ack, keepalive, reconnect, replay, and
error policy.
The goal is interop. Clients written in any language must:
The SF substrate sits between the user's Sender API and the QWP wire
encoder. Encoded QWP messages ("frames") are appended to an in-memory or
on-disk ring of fixed-size segments. A dedicated I/O thread drains the ring
over a WebSocket connection. The server's ACKs feed back into the ring as a
trim watermark, freeing storage. Disconnects, server restarts, and JVM
crashes are masked from the producer; the I/O thread reconnects and replays
whatever frames remain unacked.
Two modes share the same substrate:
The producer never blocks on the wire. flush() returns once data is
published into the substrate (RAM in Memory mode, disk in SF mode); ACKs
arrive asynchronously.
The authoritative reference is the Java client at
github.com/questdb/java-questdb-client,
under core/src/main/java/io/questdb/client/cutlass/qwp/client/sf/cursor/.
This spec was extracted from commit 1125b0b (branch vi_sf,
2026-05-05). When the implementation drifts from this spec, the spec is
stale; raise it with the client maintainers and update this document.
Key files:
| Concern | File |
|---|---|
| Segment file I/O | MmapSegment.java |
| Segment ring + rotation | SegmentRing.java |
| Manager thread + cap | SegmentManager.java |
| Slot lock | SlotLock.java |
Ack watermark (.ack-watermark) | AckWatermark.java |
| Orphan adoption | OrphanScanner.java, BackgroundDrainerPool.java, BackgroundDrainer.java |
| Engine (lifecycle, recovery) | CursorSendEngine.java |
| I/O thread (wire) | CursorWebSocketSendLoop.java |
| Wire response parser | WebSocketResponse.java (one level up) |
| Sender + connect-string parser | Sender.java, QwpWebSocketSender.java (one level up) |
| Aspect | Memory | SF |
|---|---|---|
| Storage | malloc'd ring | mmap'd files in <sf_dir>/<sender_id>/ |
| Default total cap | 128 MiB | 10 GiB |
| Survives JVM exit | No | Yes |
| Survives JVM crash | No | Yes (recovery on next start) |
| Reconnect retries | Yes | Yes |
| Concurrent multi-sender slot collision | n/a | Detected at startup via advisory exclusive lock |
| Orphan adoption (drain another sender's stale slot) | n/a | Opt-in (drain_orphans=on) |
A client implementation may choose to support Memory mode only, SF mode
only, or both. Users select Memory mode by omitting sf_dir from the
connect string, and SF mode by including it.
All options pertain to the WebSocket transport (ws:: / wss::). HTTP/TCP
ingest does not use the SF substrate. UDP has its own, separate model
(see wire-udp.md).
| Key | Type | Default | Description |
|---|---|---|---|
sf_dir | path | unset (= Memory mode) | Group root. The slot lives at <sf_dir>/<sender_id>/. |
sender_id | string | default | Slot subdirectory name. Two senders sharing a sender_id collide on the lock. |
sf_max_bytes | size | 4M | Per-segment file size; rotation threshold. |
sf_max_total_bytes | size | 128M (Memory) / 10G (SF) | Cap on total buffered bytes; backpressures producer when full. |
sf_durability | enum | memory | memory only is implemented; flush and append are reserved (see §21). |
sf_append_deadline_millis | int (ms) | 30000 | How long the producer's appendBlocking waits for ACK-driven trim before throwing. |
drain_orphans | bool | off | Scan <sf_dir>/* at startup and spawn drainers for sibling slots that contain unacked data. |
max_background_drainers | int | 4 | Cap on concurrent orphan drainers. |
Size values accept integer bytes or unit suffixes (K, M, G, T,
binary multipliers).
| Key | Type | Default | Description |
|---|---|---|---|
reconnect_max_duration_millis | int (ms) | 300000 | Per-outage time budget. Resets on each successful reconnect. |
reconnect_initial_backoff_millis | int (ms) | 100 | Initial backoff. |
reconnect_max_backoff_millis | int (ms) | 5000 | Backoff ceiling. |
initial_connect_retry | enum | off | off (canonical; alias false) = terminal on first failure; on (canonical; aliases sync, true) = same retry loop as reconnect, blocking; async = same retry loop in the I/O thread, non-blocking. See §13.4 (initial connect) and §13.6 (the loop pseudocode that consumes this knob). |
close_flush_timeout_millis | int (ms) | 5000 | close() blocks up to this long waiting for ackedFsn >= publishedFsn. 0 or -1 skips the drain entirely; the pre-drain checkError() safety net still runs (it does no I/O and only rethrows when the error has not reached the user through any other channel). |
| Key | Type | Default | Description |
|---|---|---|---|
request_durable_ack | bool | off | Opt-in via the upgrade header. Trim is then driven by STATUS_DURABLE_ACK frames only; OK frames no longer advance the trim watermark. Connect fails loudly if the server does not echo X-QWP-Durable-Ack: enabled. |
durable_ack_keepalive_interval_millis | int (ms) | 200 | Cadence of WebSocket PING the I/O loop sends while there are pending durable confirmations. 0 or negative disables. WebSocket-only. |
| Key | Type | Default | Description |
|---|---|---|---|
error_inbox_capacity | int (≥16) | 256 | Bounded SPSC capacity for async error notifications. Overflow drops oldest and bumps droppedErrorNotifications. |
The per-category override keys (on_server_error, on_schema_error,
on_parse_error, on_internal_error, on_security_error,
on_write_error) are reserved in the spec but are not yet recognised by
the Java parser; they are wired only via the fluent LineSenderBuilder API
today. New clients should accept them in connect strings — that's the
spec — and treat them with the precedence rules in §14.
| Key | Type | Default | Description |
|---|---|---|---|
addr | host[:port][,host[:port]…] | required | Server endpoint(s). Comma-separated for multi-host failover; selection semantics, host-health states, and the cursor-engine reconnect loop are normatively specified in failover.md. |
username / password | string | unset | HTTP Basic auth on the upgrade request. |
token | string | unset | Bearer token on the upgrade request. |
tls_verify | enum | on | on or unsafe_off. Applies on wss:: / TLS connections. |
tls_roots | path | system trust | Path to a custom CA trust store. |
tls_roots_password | string | unset | Trust store password. |
auto_flush | bool | on | Master switch for auto-flush triggers. |
auto_flush_rows | int / off | 1000 | Row-count flush trigger. |
auto_flush_bytes | int / off | 0 (off) | Byte-size flush trigger. |
auto_flush_interval | int (ms) / off | 100 | Time-since-first-row flush trigger. Triggered on the next at() / flush() call, not by a timer. |
init_buf_size | size | 64K | Initial encode buffer capacity. |
max_buf_size | size | 100M | Max encode buffer capacity. |
max_name_len | int | 127 | Local validation cap for table/column names. |
max_schemas_per_connection | int | 65535 | Per-connection schema-id ceiling. |
The parser MUST reject:
sf_durability values other than memory, flush, append. flush and
append parse but are rejected at build-time today (deferred).sender_id containing path separators or empty.request_durable_ack=on on non-WebSocket transports.In SF mode the substrate writes all files under one slot directory:
<sf_dir>/<sender_id>/
├── .lock # advisory exclusive lock (kernel-released on process exit)
├── .lock.pid # UTF-8 text: holder PID + '\n' (diagnostic only)
├── .failed # OPTIONAL: drainer-failure sentinel (UTF-8 reason text)
├── .ack-watermark # OPTIONAL: 16-byte mmap'd durable-ack high-water mark
├── sf-<gen>.sfa # segment files (one or more)
└── sf-<gen>.sfa
.lock and .lock.pidThe .lock file is held under an advisory exclusive lock for the engine's
entire lifetime. POSIX clients use flock/fcntl; Windows clients use
LockFileEx. The lock is automatically released when the file descriptor
is closed, including on hard process exit (kernel cleanup).
A second engine pointing at the same slot dir MUST refuse to start. The
contract is: detect the collision at acquire time and fail loudly. The
error message MUST include the holder's PID by reading .lock.pid. The
PID is a separate file because Windows LockFileEx is mandatory and would
prevent reading the lock file's own bytes while it is held.
.lock.pid is overwritten on every successful acquisition with the
acquirer's PID followed by a newline, UTF-8 encoded. Absent or empty
.lock.pid is harmless; the failed acquire reports holder=unknown.
Neither .lock nor .lock.pid is removed on clean shutdown. Stale files
are harmless and are silently overwritten by the next acquirer.
.failedPresent iff a previous drainer attempt gave up on the slot (reconnect
budget exhausted, auth-terminal, irrecoverable corruption, etc.). The
contents are a UTF-8 text reason for human operators; the presence, not
the contents, is the signal. While .failed exists, the slot is excluded
from auto-drain by the orphan scanner. Operators clear it manually.
Drainers MUST write .failed before releasing the slot lock and exiting.
sf-<gen>.sfa)<gen> is a 16-character lowercase zero-padded hexadecimal generation
counter, e.g. sf-0000000000000003.sfa. The generation is allocated by the
segment manager monotonically across the JVM lifetime; on recovery it is
seeded to max(existing-gen) + 1 to avoid collisions with recovered files.
The filename does not encode the segment's FSN range. It encodes only
the allocation order. The FSN range is read from the segment's header at
recovery time. sf-initial.sfa is a legacy name accepted by the recovery
scanner and skipped during max-gen computation.
Empty slot directories (*.sfa-less) are NOT auto-cleaned; the substrate
never deletes the slot directory itself.
.ack-watermarkOptional 16-byte file that persists the cumulative durable-ack FSN
across process restarts. Without it, recovery seeds
ackedFsn = lowestSurvivingBaseSeq - 1 (see §6.5) — which guarantees no
data loss but cannot tell within the lowest surviving sealed segment
which frames the previous sender had already received durable acks for.
Replay therefore re-sends every frame in that segment, producing
row-level duplicates against a still-alive server unless the target
table dedupes them.
With .ack-watermark, recovery seeds
ackedFsn = max(lowestSurvivingBaseSeq - 1, watermark) — the max
clamp absorbs either order of the manager's "persist then trim" tick:
>= lowest,
watermark is correct; max picks watermark.lowestBase is
higher than watermark), watermark is stale; max picks
lowestBase - 1.Recovery MUST additionally bound-check: if the watermark exceeds
publishedFsn of the recovered ring, treat it as corrupt and fall
back to lowestBase - 1. A correctly operating prior session cannot
produce a watermark above the on-disk frame ceiling, so an excess
value is bit-rot or filesystem corruption; trusting it would seed
ackedFsn = publishedFsn (after the ring's own clamp), positioning
the cursor past every un-acked frame and silently dropping the
un-acked tail.
Format (16 bytes, little-endian, mmap'd RW for the engine's lifetime):
Offset Size Type Field Description
──────────────────────────────────────────────────────────────────
0 4 uint32 magic 0x31574B41 ("AKW1" little-endian)
4 4 uint32 reserved Must be 0
8 8 int64 fsn Cumulative durable-ack high-water mark
A magic value other than 0x31574B41 means "no usable watermark";
recovery falls back to the bare lowestBase - 1 seed. The most common
cause is a freshly created file from the previous session's first
open() where no write landed before crash — the OS zero-filled the
file at creation, so magic is 0 until the first write stamps it.
Write protocol (single-writer, segment-manager thread):
putLong to offset 8
against the mapped region. No alloc, no syscall, no fsync. Atomic
at the CPU level on x86_64 and arm64, and disk-atomic within one
sector (the file is 16 bytes — trivially within one sector).open() time, the magic store is skipped.ackedFsn — typically debounced via "only write if
current > lastPersisted". Frequency is bounded by the manager
tick cadence, not the inbound durable-ack rate.fsync cadence: NOT performed. Host crash falls back to the segment-derived seed (no regression vs the no-watermark behaviour). A client MAY add periodic fsync as a quality-of-service knob; the spec permits but does not require it.
Lifecycle: opened in engine.open after the slot lock is
acquired; mmap is held for the engine's lifetime. Closed during
engine shutdown after the manager has stopped (the manager is the
sole writer). A clean shutdown that drained all segments MAY unlink
.ack-watermark; the substrate does today as part of
unlinkAllSegmentFiles semantics. Otherwise the file persists for
the next session's recovery to read.
Interop: .ack-watermark is OPTIONAL. A client that does not
implement it MUST still produce on-disk state that the reference
recovery scanner can read — i.e. absence of .ack-watermark is
explicitly handled (§6.5). A drainer (orphan-adoption path) opening a
slot from a different client that did write .ack-watermark MUST
honour it on recovery; ignoring it is allowed but defeats the
optimisation.
All multi-byte integers are little-endian. The reference implementation relies on x86/ARM native byte order; clients MUST emit little-endian explicitly to be portable.
A segment file consists of a 24-byte header followed by zero or more
frames. The file is pre-allocated at create time to its full size
(sf_max_bytes); the tail is zero-filled until written.
Pre-allocation semantics. The reference implementation reserves real
disk blocks at create time via posix_fallocate on Linux and
F_PREALLOCATE/F_ALLOCATEALL on macOS, so ENOSPC surfaces as a clean
create-time failure (the partial file is unlinked and the caller falls
back to backpressure). Setting only the logical size via ftruncate is
NOT sufficient: the resulting file is sparse and a later store into the
mmap'd region will deliver a SIGBUS that tears down the process. On
filesystems whose underlying reservation call is unsupported and the
native layer falls back to ftruncate, this SIGBUS risk re-emerges for
that filesystem only — operators running on such filesystems should
size sf_max_bytes conservatively against free space. Clients that
implement their own segment files MUST treat block reservation as a
core invariant of the create path.
Offset Size Type Field Description
──────────────────────────────────────────────────────────────────
0 4 uint32 magic 0x31304653 ("SF01" little-endian)
4 1 uint8 version 0x01
5 1 uint8 flags Reserved (must be 0)
6 2 uint16 reserved Must be 0
8 8 uint64 baseSeq FSN of the first frame in this segment
16 8 int64 createdMicros Wall-clock microseconds at creation
baseSeq MUST be non-negative. The recovery scanner refuses segments with
baseSeq < 0 to defend against bit-rot or malicious files.
createdMicros is informational. Implementations may use any monotonic or
wall-clock source; recovery does not validate it.
Frames are tightly packed starting at offset 24. Each frame is:
Offset Size Type Field Description
──────────────────────────────────────────────────────────────────
0 4 uint32 crc32c CRC32C of (payloadLen||payload), Castagnoli
4 4 int32 payloadLen Payload size in bytes
8 payloadLen bytes payload One QWP message (verbatim wire bytes)
payloadLen MUST be ≥ 0 and ≤ sizeBytes - offset - 8. A negative or
overflowing length is treated as a torn-tail boundary by the recovery
scanner.
The CRC is standard CRC-32C (Castagnoli): polynomial 0x1EDC6F41,
reflected in/out, init 0xFFFFFFFF, final XOR 0xFFFFFFFF (same
variant as iSCSI/SCTP/Btrfs). It covers payloadLen (4 bytes,
little-endian) followed by the payload bytes, in that order. Any
off-the-shelf CRC-32C library will produce identical checksums.
The payload is the complete QWP message including its 12-byte QWP1 header
(magic QWP1, version, flags, table count, payload length). See
wire-ingress.md.
The producer is the single writer to a segment. It appends a frame with this sequence:
total = 8 + payloadLen and check appendCursor + total ≤ sizeBytes.
If not, return failure (caller rotates to the next segment).payloadLen (little-endian uint32) at mmap_addr + appendCursor + 4.payload to mmap_addr + appendCursor + 8.[payloadLen, payload] bytes.mmap_addr + appendCursor + 0.appendCursor to appendCursor + total and frameCount by 1.publishedCursor = appendCursor.Step 7 is the publication barrier. Until the volatile/release write of
publishedCursor retires, no consumer (I/O thread, drainer) is permitted
to read the new bytes. The CRC + zero-filled tail combination makes the
"interrupted at any point" outcome recoverable: the next recovery's CRC
verification will reject any partially-written frame.
A segment's last FSN is baseSeq + frameCount - 1. The segment becomes
trimmable once ackedFsn >= lastSeq. The ring's manager thread
totalBytes counter.Trim is destructive — files are unlinked, not zeroed.
A new segment is rotated in when the active one fills. The manager
pre-allocates "hot spare" segments ahead of demand to keep the producer
non-blocking; rotation rebases the spare's baseSeq to the active's
lastSeq + 1 at install time. This is implementation detail; cross-client
interop only requires that the on-disk layout described in §6.1–6.3 is
correct.
For each *.sfa file in the slot:
length ≥ 24, mmap.0x31304653, version == 1, baseSeq ≥ 0.lastGood.frameCount = (count of verified frames), appendCursor = lastGood,
publishedCursor = lastGood.[lastGood, lastGood + min(8, fileSize - lastGood))
is non-zero, the writer attempted-but-failed a frame write. Log a WARN
with the count. Clean partial fills (writer never reached the tail)
leave zeros and report 0.After scanning all files, sort segments by baseSeq ascending and verify
prev.baseSeq + prev.frameCount == curr.baseSeq for each adjacent pair.
A gap is a fatal recovery error — refuse to start. The highest-baseSeq
segment becomes active; the rest are sealed and drained in order.
The substrate seeds ackedFsn = max(lowestBaseSeq − 1, watermark) where
watermark is the FSN read from .ack-watermark (§5.4) if present and
valid, or absent otherwise. lowestBaseSeq − 1 alone guarantees no data
loss; the watermark additionally suppresses re-replay of frames inside
the lowest surviving segment that the previous sender had already been
told were durable. See §5.4 for the max ordering rationale.
Two distinct counters operate together:
baseSeq, never reset
while the slot exists. FSN survives restarts and reconnects.The relationship is pinned at connect time:
fsn = fsnAtZero + wireSeq
On a fresh connection, fsnAtZero = ackedFsn + 1 (i.e. 0 on a brand-new
slot, or wherever the trim watermark left off after a reconnect or
restart). The first frame the I/O loop sends has wireSeq = 0, regardless
of its FSN.
(connection, wireSeq). Replays after
reconnect carry the same payload but at a fresh wireSeq window;
dedup must use messageSequence inside the wire payload (which the
encoder emits) to recognise duplicates across reconnects.sf_max_total_bytes
worth of FSNs. Otherwise a sustained outage with a full SF cap can
produce double-writes on replay.The QWP wire encoder does NOT serialize wireSeq into the message header.
It is implicit: the server assigns wireSeq to inbound messages in receive
order, starting at 0 per connection, and echoes it back in the sequence
field of OK and error frames. The client maps the echoed value back to
FSN via fsn = fsnAtZero + sequence.
A consequence: clients MUST send frames in strict order. The server assumes "Nth frame in = wireSeq N", so any reordering breaks the FSN mapping.
Open a WebSocket on the server's /write/v4 (or /api/v4/write) path
with these request headers:
| Header | Required | Value |
|---|---|---|
X-QWP-Max-Version | recommended | Highest QWP version supported. Defaults to 1. |
X-QWP-Client-Id | optional | Free-form string (e.g. cpp/0.1.0). |
X-QWP-Accept-Encoding | optional | Encoding preference; omit if not supported. |
X-QWP-Max-Batch-Rows | optional | Per-batch row cap; omit for server default. |
X-QWP-Request-Durable-Ack | iff request_durable_ack=on | Literal true. |
Authorization etc. | as needed | Standard HTTP auth headers. |
On 101 the server responds with:
| Header | Meaning |
|---|---|
X-QWP-Version | Negotiated version (defaults to 1 if absent). |
X-QWP-Durable-Ack | Echoed enabled iff the client opted in AND the server can deliver durable-ack frames. |
If the client sent X-QWP-Request-Durable-Ack: true:
X-QWP-Durable-Ack: enabled confirms the server will
emit STATUS_DURABLE_ACK frames. The client switches to durable-ack-
driven trim (§10).LineSenderException-equivalent error; silently waiting for ack
frames that will never arrive lets the SF grow until disk fills.Auth failures during upgrade (HTTP 401 / 403) are terminal per
failover.md §6 — they neither retry the initial connect nor trigger
reconnect, and surface immediately as a SECURITY_ERROR category error
(HALT policy). Other non-101 statuses follow the topology / transient
classification in failover.md §6 (e.g. 421 is in-round topology;
404 / 426 / 503 are transient and consume the reconnect budget).
Every server response begins with a 1-byte status. OK is 0x00.
Offset Size Field Notes
─────────────────────────────────────────────────────────
0 1 status 0x00
1 8 sequence wireSeq (little-endian int64)
9 2 tableCount uint16 LE; 0 for non-WAL or empty batches
11 variable table entries repeated tableCount times:
2 bytes nameLen (LE)
nameLen bytes name (UTF-8)
8 bytes seqTxn (LE int64)
On OK reception:
fsn = fsnAtZero + min(sequence, nextWireSeq - 1). The
min defends against malicious or buggy servers reporting a wireSeq
ahead of what the client sent.engine.acknowledge(fsn): advance the trim watermark, possibly
freeing whole segments (§6.4).Auto-trim cadence is at the discretion of the I/O loop; the reference
implementation calls acknowledge immediately on every OK.
When request_durable_ack=on was negotiated, OK frames do not advance
ackedFsn. They are stashed (with their per-table seqTxns) into a
pendingDurable queue keyed by wireSeq. Trim happens only on
STATUS_DURABLE_ACK (§10).
Empty OK frames (tableCount=0, e.g. materialized-view-only batches) are
trivially durable: their wireSeq advances ackedFsn as soon as all
preceding entries in pendingDurable have been confirmed.
Durable-ack frames carry per-table watermarks for data already uploaded from the server's WAL to the configured object store.
Offset Size Field Notes
─────────────────────────────────────────────────────────
0 1 status 0x02
1 2 tableCount uint16 LE
3 variable table entries repeated tableCount times:
2 bytes nameLen (LE)
nameLen bytes name (UTF-8)
8 bytes seqTxn (LE int64)
There is no sequence field. Durable-acks are cumulative per table:
each entry reports the highest seqTxn whose WAL segments are durable.
Maintain durableTableWatermarks: Map<tableName, int64> (monotonic, max-merge).
On each durable-ack frame:
durableTableWatermarks[name] = max(current, seqTxn).pendingDurable (FIFO of (wireSeq, tableSeqTxnsForThisFrame)).(name, seqTxn) it carries,
durableTableWatermarks[name] >= seqTxn.engine.acknowledge(fsnAtZero + maxCoveredWireSeq) once with the
highest covered wireSeq.The watermark advances strictly monotonically; reordered durable-ack frames are tolerated by the max-merge.
On reconnect, pendingDurable is discarded. The new connection's OKs
re-stash entries; the server re-emits cumulative durable-ack watermarks
from scratch. Trim restarts against the new connection's wireSeq mapping.
The OSS server only flushes pending durable-ack frames during inbound recv events (binary message, ping, close). An idle client that has finished publishing would otherwise never see the durable-ack frame.
The SF I/O loop sends a WebSocket PING (any small payload; reference sends 2 bytes) when all of these hold:
request_durable_ack=on was negotiated.pendingDurable is non-empty.durable_ack_keepalive_interval_millis > 0.durable_ack_keepalive_interval_millis have elapsed since the
last sent frame or PING.In non-durable-ack mode the loop sends NO keepalive PINGs; ordinary frame traffic keeps the connection alive on the server side.
PING is always a WebSocket-level PING (RFC 6455 opcode 0x9), not a QWP message. Servers reply with PONG; clients ignore PONG content.
Every frame written to the SF substrate MUST be self-sufficient: it carries the full schema for every table it touches and the full symbol-dictionary delta from id 0. Schema-by-id refs are forbidden, and incremental delta-dicts are forbidden.
Concretely, the encoder is invoked with:
confirmedMaxId = -1 (so the symbol delta starts at id 0 every time)useSchemaRef = false (so column blocks always carry full schema mode)Rationale: SF frames may be replayed against a fresh server connection weeks later — after process restart, after reconnect, or in a background-drainer adopting an orphan slot. A frame that references schema id N or symbol id M is unreplayable if the new server has never seen those ids. Self-sufficiency makes every frame valid against any server. The cost is a small per-batch overhead which is accepted.
This is mandatory for cross-client interop. A drainer written in C++ adopting a slot written by Java must be able to replay the bytes unchanged.
Any wire error — send failure, recv failure, server close, ack timeout — enters the reconnect loop. The producer thread is not notified; it keeps publishing into the substrate (subject to the cap, §16).
Backoff math is normatively specified in failover.md
§3 (ComputeBackoff / NextBackoffOrGiveUp). SF uses equal-jitter
[base, 2·base) per §3.1 — equal-jitter has a non-zero lower bound
that damps reconnect storms when multiple SF producers share a cluster.
The post-jitter sleep is clamped to the remaining outage budget so a
single sleep cannot overshoot reconnect_max_duration_millis; see
failover.md §3 for the exact deadline check and §13.6 below for how
the SF loop consumes it.
Defaults bound to the SF connect-string keys
(reconnect_initial_backoff_millis = 100ms,
reconnect_max_backoff_millis = 5_000ms,
reconnect_max_duration_millis = 300_000ms); see §4.2 for the keys
and failover.md §7 for the cross-context defaults table.
Error classification is normatively specified in failover.md §6.
Mapped to the SF client's local error categories:
AuthError (HTTP 401 / 403 on the upgrade) surfaces as
SECURITY_ERROR; server-status rejects surface in the matching
status-byte category.421 Misdirected Request + X-QuestDB-Role) — handled
within the round per §13.6 (and failover.md §6 "Topology"); not
terminal.404, 426,
503, any other 4xx / 5xx that is not 401 / 403 / 421, an
upgrade response that advertises a QWP version outside the supported
range (per-endpoint mismatch — the loop walks to the next host so a
rolling upgrade in flight does not block compatible peers), and
generic frame-decode errors.When the per-outage budget (reconnect_max_duration_millis) exhausts,
the loop becomes terminal and surfaces as PROTOCOL_VIOLATION with FSN
span = unacked window at giveup time.
By default (initial_connect_retry=off) any failure on the very first
connect is terminal — typically a misconfig, retrying for 5 minutes
just hides it. Setting initial_connect_retry=on (aliases: sync,
true) reuses the same retry loop, blocking the calling thread.
async runs the retry loop in the I/O thread and lets the constructor
return; the producer then experiences backpressure if it tries to
publish before the connection comes up.
On each successful (re)connect:
fsnAtZero = engine.ackedFsn() + 1 (or 0 on a brand-new slot
where ackedFsn is -1).nextWireSeq = 0.fsnAtZero.nextWireSeq.fsn > replayTargetFsn)
are sent as new frames; the boundary affects only the
totalFramesReplayed counter, not behaviour.Producer threads MAY append concurrently during replay; the I/O loop
uses the volatile publishedFsn() cursor to discover newly-appended
frames.
The full pseudocode that ties together failure detection (§13.1),
backoff (§13.2), terminal classification (§13.3), initial-connect
policy (§13.4), and replay (§13.5). The loop owns the host tracker
defined in failover.md §2 and consumes the backoff
function in failover.md §3.
backoff = BackoffState()
seenFirstConnect = false
lastError = null
loop forever:
if previousIdx >= 0: # mid-stream failure path
tracker.RecordMidStreamFailure(previousIdx)
previousIdx = -1
idx = tracker.PickNext()
if idx < 0:
# Round exhausted: every host in this round has been tried. Pay
# one sleep, reset the round, then walk again.
if lastError == RoleReject:
sleep min(InitialBackoff, remaining) # remaining = MaxOutageDuration - elapsed
if elapsed >= MaxOutageDuration: SET TERMINAL; exit loop
backoff.Reset() # role-reject does not advance doubling
else:
sleep = NextBackoffOrGiveUp(backoff.attempt, backoff.elapsedSinceOutage)
if sleep == GIVE_UP: SET TERMINAL; exit loop
sleep; backoff.attempt++
tracker.BeginRound(forgetClassifications=true)
continue
try connect host[idx]:
on AuthError: SET TERMINAL; exit loop
on RoleReject(t):
tracker.RecordRoleReject(idx, t); lastError = RoleReject
continue # no per-host sleep within the round
on other error (incl. upgrade-time version mismatch):
tracker.RecordTransportError(idx); lastError = TransportError
if !seenFirstConnect && !initial_connect_retry:
SET TERMINAL; exit loop
continue # no per-host sleep within the round
seenFirstConnect = true
backoff.Reset(); lastError = null
rewind cursor to (ackedFsn + 1) # replay first un-acked frame; see §13.5
run send/recv pumps until either pump throws
on server status reject / AuthError: SET TERMINAL
on transient error (TCP/TLS, mid-stream send/recv, frame corruption, etc.):
previousIdx = currentIdx # demoted on next iter via RecordMidStreamFailure
continue # next iter walks the next untried host
Key properties:
PickNext == -1), the loop MUST invoke
BeginRound(forgetClassifications=true) before the next PickNext.
The current-round flags reset; classifications fade except for the
sticky-Healthy entry (failover.md §2.2).auth_timeout_ms allows. The post-error sleep (exponential
for transport, fixed InitialBackoff for role-reject) is paid once
per round, at the moment PickNext returns -1 and immediately
before BeginRound(forgetClassifications=true). The per-producer
round time stays bounded by auth_timeout_ms × hostCount plus one
backoff sleep, and equal-jitter (failover.md §3.1) spreads the
round-boundary sleeps across producers. Mirrors the egress
WalkTracker (wire-egress.md §11.9), where the
address-list walk is also un-throttled within a round.initial_connect_retry (default off):
off — first connect failure is terminal.on (alias sync) — block the constructor; enter the standard
reconnect loop until success or MaxOutageDuration exhaustion.async — return from the constructor immediately; the
background I/O thread drives the reconnect loop. append writes
accumulate in the SF dir without blocking on connect. Intended
for unattended producers where the SF dir may already carry
segments queued from a prior process and the server may come up
later.421 + role reply is interpreted as a topology hint. Within a round it
flows through the no-sleep path above (the next host gets tried
immediately). When a round exhausts with role-reject as lastError,
the round-boundary sleep is the configured InitialBackoff (no
doubling) and backoff.attempt is reset, so a long PRIMARY_CATCHUP
window doesn't blow up to reconnect_max_backoff_millis per probe.
The wall-clock outage budget (reconnect_max_duration_millis)
still bounds the loop — if every endpoint stays role-rejecting for
the full budget, the loop terminates. Operators should size
reconnect_max_duration_millis to span their largest expected
failover window (default 5 minutes).previousIdx. The next loop iteration calls
tracker.RecordMidStreamFailure(previousIdx) before the next
PickNext, so Healthy→TransportError demotion happens before host
selection runs (see failover.md §2.3 — this ordering is normative).
The next host is then tried under the same no-sleep rule
(skip-backoff-within-round), so a healthy peer takes over with only
auth_timeout_ms worth of slack. The mid-stream failure itself
does not advance backoff.attempt — the doubling counter starts
fresh from the post-success backoff.Reset() and only moves
forward at round exhaustion, so a single mid-stream blip pays no
exponential cost when at least one peer is healthy.previousIdx. When multiple loops drive reconnects
against the same tracker (foreground SF I/O thread plus background
drainers; see §18.2), each loop MUST own its own previousIdx
slot. Sharing one slot across loops corrupts the
Healthy→TransportError demotion. The Java client expresses this as
a per-caller ReconnectFactory instance with a private
previousIdx field. See failover.md §2.3.A non-OK status byte (other than 0x02) is an error frame.
Offset Size Field Notes
─────────────────────────────────────────────────────────
0 1 status see table below
1 8 sequence wireSeq the error applies to (LE int64)
9 2 msgLen uint16 LE (≤ 1024)
11 msgLen msg UTF-8 error message
| Status | Hex | Category |
|---|---|---|
| 3 | 0x03 | SCHEMA_MISMATCH |
| 5 | 0x05 | PARSE_ERROR |
| 6 | 0x06 | INTERNAL_ERROR |
| 8 | 0x08 | SECURITY_ERROR |
| 9 | 0x09 | WRITE_ERROR |
| any other (incl. 0x04, 0x07, 0xFF) | — | UNKNOWN |
| (n/a — synthesized for WS close codes) | — | PROTOCOL_VIOLATION |
| Category | Default policy | Rationale |
|---|---|---|
SCHEMA_MISMATCH | DROP_AND_CONTINUE | Replay reproduces the same rejection; halting blocks unrelated tables. |
PARSE_ERROR | HALT | Almost certainly a client bug; preserve disk frames for postmortem. |
INTERNAL_ERROR | HALT | Catch-all server fault; conservative without a retryable bit. |
SECURITY_ERROR | HALT | Misconfig; loud failure wanted. |
WRITE_ERROR | DROP_AND_CONTINUE | "Table not accepting writes" is per-batch in character. |
PROTOCOL_VIOLATION | HALT (forced) | Connection is gone — no choice. |
UNKNOWN | HALT (forced) | Never silently drop something we don't understand. |
The two non-forced rows are user-overridable via the on_*_error connect
keys (when implemented; see §4.4).
SenderError and offer to the error inbox (non-blocking).engine.acknowledge(fsnAtZero + sequence).pendingDurable, then run the drain loop.DROP semantically discards the data — once the server has rejected, the SF substrate is no longer the producer's safety net. Users who need a dead-letter trail must register an error handler.
SenderError and stash it as the loop's terminal error.SenderError).error_inbox_capacity (default 256).droppedErrorNotifications
counter. Watermarks are monotonic, so the latest entry is always the
most informative — drops compress information rather than lose it.WebSocket Close frames are inspected for the close code:
Surface as PROTOCOL_VIOLATION / HALT. Do NOT enter the reconnect
loop.
| Code | Name |
|---|---|
| 1002 | PROTOCOL_ERROR |
| 1003 | UNSUPPORTED_DATA |
| 1007 | INVALID_PAYLOAD_DATA |
| 1008 | POLICY_VIOLATION |
| 1009 | MESSAGE_TOO_BIG |
| 1010 | MANDATORY_EXTENSION |
The serverStatusByte is -1 and messageSequence is -1 for
PROTOCOL_VIOLATION events. The error message is
"ws-close[<code>]: <reason>". The FSN span is
[engine.ackedFsn() + 1, engine.publishedFsn()] — the unacked window at
close time.
Enter the reconnect loop (subject to per-outage cap):
| Code | Name |
|---|---|
| 1000 | NORMAL_CLOSURE |
| 1001 | GOING_AWAY |
| 1006 | ABNORMAL_CLOSURE |
| 1011 | INTERNAL_ERROR |
| any other code not listed in §15.1 | — |
The substrate enforces sf_max_total_bytes as a hard cap on resident
storage. When full, the producer's appendBlocking busy-spins (with
cooperative yield) for up to sf_append_deadline_millis waiting for ACK
arrival → trim → space free. If the deadline fires the call throws a
LineSenderException.
The error message MUST distinguish:
The cap covers all sealed segments + the active segment. Trim immediately
returns segment bytes to the available pool when ackedFsn crosses
baseSeq + frameCount - 1.
close() semantics depend on close_flush_timeout_millis:
5000): block waiting for engine.ackedFsn() >= engine.publishedFsn()
for up to 5 seconds. If the wait succeeds, all data is acked. If the
timeout fires, log a WARN and proceed; pending data remains on disk
(SF mode) or is lost (Memory mode).0 or -1: skip the drain entirely. Pending data is lost (Memory)
or persists for the next sender to recover (SF). Producer state is
still flushed into the engine before close() returns; only the wait
for server ACK is skipped.Regardless of which branch above runs, close() then performs a
non-blocking safety-net checkError() and rethrows any latched
terminal error, with two guards that prevent double-signalling:
SenderErrorHandler has already been
delivered an error during this sender's lifetime (the user has
already owned the failure asynchronously).flush() / at() /
atNow()) has already seen the latched error synchronously.The safety net is non-blocking (a volatile read plus a throw) so it
does not violate the "fast close" intent of the opt-out branch. Its
purpose is to keep latched terminal errors visible to users who only
ever call close() and have not installed an async handler — without
it, the default no-op handler would silently swallow real failures.
Implementations MUST always release the slot lock and unmap segments on
close(), regardless of timeout outcome or whether the safety net
threw. Implementations MUST NOT skip the safety-net checkError() on
the basis of close_flush_timeout_millis alone; only the two guards
above suppress it.
On engine open in SF mode:
<sf_dir>/<sender_id>/.lock. Refuse to start on contention.*.sfa files..ack-watermark (§5.4) if implemented. Read its FSN, or
record INVALID if the file is missing / has bad magic / fails to
map. Keep the mmap held for the engine's lifetime so the manager
can write through it without reopening.ackedFsn = max(lowestBaseSeq − 1, watermark), with the
additional bound watermark <= publishedFsn (see §5.4). A
watermark of INVALID or one exceeding publishedFsn reduces
to the bare lowestBaseSeq − 1 seed (legacy behaviour, no
regression).fileGeneration = max(existing-gen) + 1.A clean shutdown that drained all data is indistinguishable from a
fresh start — no segments, no replay, just ackedFsn = -1. A stale
.ack-watermark from a prior fully-drained session is unlinked at
this point so the new session starts with the expected "no
watermark yet" state.
When drain_orphans=on the foreground sender, after acquiring its own
lock, scans <sf_dir>/* for sibling slots that:
sender_id,*.sfa file,.failed.Note: lock state is intentionally NOT in the candidate filter; testing flock state races with concurrent acquirers. The drainer pool tries the locks in turn.
For each candidate, up to max_background_drainers at a time, spawn a
drainer that:
.lock. Skip on contention.ackedFsn ≥ publishedFsn and exits.Drainer failure modes:
.failed with a UTF-8 reason,
release lock, exit..failed slots are excluded from auto-drain on subsequent scans; the
operator must clear the sentinel manually.
Mandatory for every conformant SF client:
sf-<gen:016x>.sfa. Other clients
reading the slot need this to enumerate..lock,
PID in .lock.pid. Cross-client lock interop requires identical lock
primitives — POSIX clients use flock/fcntl, Windows uses
LockFileEx. A POSIX client and a Windows client MUST refuse to share
a slot on a network filesystem..failed sentinel (§5.2): presence (not contents) is the auto-drain
exclusion signal..ack-watermark (§5.4): the file is OPTIONAL — a conformant client
MAY skip implementing it. When present, the format is normative (16
bytes, magic 0x31574B41, FSN at offset 8, little-endian). A drainer
adopting a slot another client populated MUST honour the watermark on
recovery (apply the max clamp); ignoring it is allowed but produces
the duplicate-row behaviour the watermark was designed to suppress.fsn = fsnAtZero + wireSeq. Strict in-order send
on the wire.UNKNOWN).cpp_*).Permitted variations:
A conformant client SHOULD expose, at minimum:
| Counter | Meaning |
|---|---|
getTotalReconnectAttempts() | Reconnect attempt count across the lifetime of the sender. |
getTotalReconnectsSucceeded() | Successful reconnect count. |
getTotalFramesReplayed() | Frames re-sent after a reconnect (i.e. wireSeq < replayTargetFsn). |
getTotalServerErrors() | Count of error frames received (any category). |
getDroppedErrorNotifications() | Count of error notifications dropped due to inbox overflow. |
getTotalErrorNotificationsDelivered() | Count delivered to the user handler. |
getTotalBackpressureStalls() | Count of producer threads that hit the cap. |
getLastTerminalError() | The latched SenderError (or null). |
getActiveBackgroundDrainers() | Current count of running orphan-slot drainers. Same lax-cleanup race as the underlying pool snapshot: a drainer that finished moments ago may still count for a few ms. |
getTotalBackgroundDrainersSucceeded() | Cumulative count of drainers that exited after fully draining their slot. |
getTotalBackgroundDrainersFailed() | Cumulative count of drainers that exited by dropping a .failed sentinel. |
The default error handler MUST log every received SenderError. Silence is
forbidden by the contract: a buggy or no-op handler hides data loss
indistinguishably from a healthy connection.
Per-drainer event observation (progress watermark, durable-ack-mismatch
escalation, terminal outcome) is delivered through the existing
BackgroundDrainerListener callback at the pool, not through a
sender-level snapshot accessor. The three counters above cover
dashboards and post-startup health checks ("did my orphans get
adopted?"); the listener covers everything that wants per-drainer
event-time visibility. On-disk .failed sentinels remain the canonical
record of giveup events surviving sender restart.
Items intentionally out of scope today, tracked for future revisions:
sf_durability=flush and =append: per-batch and per-frame fsync.
Parser accepts the values; build-time rejects them with
"not yet supported". The wire-level behaviour is unchanged; only the
producer's flush path (and the sf_append_deadline_millis semantics)
need new code.INTERNAL_ERROR. A
future server change could split status bytes or add a 1-byte field;
when that lands, the spec gains a RETRY_TRANSIENT policy.tableName is only populated when the rejected batch had exactly one
table. The wire format would need an optional table index field for
the multi-table case.on_*_error connect-string keys: normatively defined in §4.4 but
not yet recognised by the Java parser; only the fluent
LineSenderBuilder API exposes them today..ack-watermark (§5.4) already records which
prefix is safe. Options for a future revision: rewrite-and-rotate
(copy un-acked frames into a fresh segment, unlink the old),
fallocate(FALLOC_FL_PUNCH_HOLE) on Linux (no Windows equivalent),
or a <segment>.ack-offset sidecar consulted by the trim path.
Reduces effective disk consumption under slow durable-ack cadence;
does not affect correctness.resumeAfterHalt() API: clears the latched terminal error and
restarts the I/O loop without rebuilding the sender. The reference
loop is one-shot today; users work around with close() + rebuild.| Date | Commit | Change |
|---|---|---|
| 2026-05-05 | client 1125b0b / oss dc108e66ff | Initial spec extracted from Java reference. |
| 2026-05-08 | (pending) | Align with failover.md. §13.3 + §8.2: rewrite HTTP-status terminal classification to defer to failover.md §6 (only 401/403 are terminal; 421 is topology, handled in-round; 404/426/503 are transient and consume the reconnect budget). §4.5 + §21: drop "multi-host failover deferred" — failover.md is the normative spec. §4.2 + §13.4: clarify initial_connect_retry=on is canonical; sync and true are aliases. §13.2: replace inlined backoff math with a forward-pointer to failover.md §3 / §3.1 to avoid drift. |
| 2026-05-12 | (pending) | Add .ack-watermark (§5.4): optional mmap'd 16-byte FSN watermark that suppresses re-replay of already-durable-acked frames inside the lowest surviving sealed segment on process restart. §6.5 and §18.1 updated to seed ackedFsn = max(lowestBaseSeq - 1, watermark). §19 adds the interop bullet (file is optional but format is normative). §21 item 5 narrowed from "partial-ack tracking" to "disk reclaim inside partially-acked segments" — the re-replay half of that follow-up is now addressed by the watermark. |
| 2026-05-12 | (pending) | §20: replace the structured getBackgroundDrainers() snapshot accessor with three single-long counters — getActiveBackgroundDrainers(), getTotalBackgroundDrainersSucceeded(), getTotalBackgroundDrainersFailed(). The structured accessor allocated per call, exposed a lifecycle-bounded view through the same surface as the lifetime-cumulative counters, and overlapped with the existing BackgroundDrainerListener callback path. The simpler counters align with the rest of §20, give dashboards a "did my orphans get adopted?" signal, and leave per-drainer event-time visibility to the listener (which is the canonical extension point for drainer observation). On-disk .failed sentinels are unchanged and remain the canonical giveup record across sender restarts. |