Back to Questdb

QWP Store-and-Forward Client Specification

docs/qwp/sf-client.md

9.4.258.9 KB
Original Source

QWP Store-and-Forward Client Specification

This document specifies the client-side Store-and-Forward (SF) substrate that sits beneath QWP ingest. It is a companion to wire-ingress.md: that spec defines the bytes on the wire; this spec defines what every SF-capable client must do around them — disk format, frame sequence numbering, handshake, ACK-driven trim, durable-ack, keepalive, reconnect, replay, and error policy.

The goal is interop. Clients written in any language must:

  • accept the same connect-string options,
  • write the same on-disk format (so a slot written by one client can be drained by another),
  • observe the same wire-level invariants (so the server sees consistent behaviour from every client),
  • surface the same error categories and policies (so users get a portable contract).

Table of Contents

  1. Overview
  2. Reference Implementation
  3. Modes
  4. Connect-String Options
  5. Slot Directory Layout
  6. Segment File Format
  7. Frame-Sequence-Number (FSN) Model
  8. WebSocket Handshake
  9. OK ACK and Trim
  10. Durable-Ack ACK and Trim
  11. Keepalive PING
  12. Self-Sufficient Framing
  13. Reconnect and Replay
  14. Error Frame Handling
  15. WebSocket Close Codes
  16. Backpressure
  17. Close and Shutdown
  18. Recovery and Orphan Adoption
  19. Protocol Invariants for Cross-Client Interop
  20. Observability
  21. Deferred Items

1. Overview

The SF substrate sits between the user's Sender API and the QWP wire encoder. Encoded QWP messages ("frames") are appended to an in-memory or on-disk ring of fixed-size segments. A dedicated I/O thread drains the ring over a WebSocket connection. The server's ACKs feed back into the ring as a trim watermark, freeing storage. Disconnects, server restarts, and JVM crashes are masked from the producer; the I/O thread reconnects and replays whatever frames remain unacked.

Two modes share the same substrate:

  • Memory mode — segments in malloc'd memory; data does not survive process exit but tolerates transient network blips.
  • SF mode — segments are mmap'd files under a slot directory; data survives JVM crashes and process restarts. Recovery on next start replays unacked frames against a fresh server connection.

The producer never blocks on the wire. flush() returns once data is published into the substrate (RAM in Memory mode, disk in SF mode); ACKs arrive asynchronously.

2. Reference Implementation

The authoritative reference is the Java client at github.com/questdb/java-questdb-client, under core/src/main/java/io/questdb/client/cutlass/qwp/client/sf/cursor/. This spec was extracted from commit 1125b0b (branch vi_sf, 2026-05-05). When the implementation drifts from this spec, the spec is stale; raise it with the client maintainers and update this document.

Key files:

ConcernFile
Segment file I/OMmapSegment.java
Segment ring + rotationSegmentRing.java
Manager thread + capSegmentManager.java
Slot lockSlotLock.java
Ack watermark (.ack-watermark)AckWatermark.java
Orphan adoptionOrphanScanner.java, BackgroundDrainerPool.java, BackgroundDrainer.java
Engine (lifecycle, recovery)CursorSendEngine.java
I/O thread (wire)CursorWebSocketSendLoop.java
Wire response parserWebSocketResponse.java (one level up)
Sender + connect-string parserSender.java, QwpWebSocketSender.java (one level up)

3. Modes

AspectMemorySF
Storagemalloc'd ringmmap'd files in <sf_dir>/<sender_id>/
Default total cap128 MiB10 GiB
Survives JVM exitNoYes
Survives JVM crashNoYes (recovery on next start)
Reconnect retriesYesYes
Concurrent multi-sender slot collisionn/aDetected at startup via advisory exclusive lock
Orphan adoption (drain another sender's stale slot)n/aOpt-in (drain_orphans=on)

A client implementation may choose to support Memory mode only, SF mode only, or both. Users select Memory mode by omitting sf_dir from the connect string, and SF mode by including it.

4. Connect-String Options

All options pertain to the WebSocket transport (ws:: / wss::). HTTP/TCP ingest does not use the SF substrate. UDP has its own, separate model (see wire-udp.md).

4.1 Storage (SF mode)

KeyTypeDefaultDescription
sf_dirpathunset (= Memory mode)Group root. The slot lives at <sf_dir>/<sender_id>/.
sender_idstringdefaultSlot subdirectory name. Two senders sharing a sender_id collide on the lock.
sf_max_bytessize4MPer-segment file size; rotation threshold.
sf_max_total_bytessize128M (Memory) / 10G (SF)Cap on total buffered bytes; backpressures producer when full.
sf_durabilityenummemorymemory only is implemented; flush and append are reserved (see §21).
sf_append_deadline_millisint (ms)30000How long the producer's appendBlocking waits for ACK-driven trim before throwing.
drain_orphansbooloffScan <sf_dir>/* at startup and spawn drainers for sibling slots that contain unacked data.
max_background_drainersint4Cap on concurrent orphan drainers.

Size values accept integer bytes or unit suffixes (K, M, G, T, binary multipliers).

4.2 Reconnect / connect

KeyTypeDefaultDescription
reconnect_max_duration_millisint (ms)300000Per-outage time budget. Resets on each successful reconnect.
reconnect_initial_backoff_millisint (ms)100Initial backoff.
reconnect_max_backoff_millisint (ms)5000Backoff ceiling.
initial_connect_retryenumoffoff (canonical; alias false) = terminal on first failure; on (canonical; aliases sync, true) = same retry loop as reconnect, blocking; async = same retry loop in the I/O thread, non-blocking. See §13.4 (initial connect) and §13.6 (the loop pseudocode that consumes this knob).
close_flush_timeout_millisint (ms)5000close() blocks up to this long waiting for ackedFsn >= publishedFsn. 0 or -1 skips the drain entirely; the pre-drain checkError() safety net still runs (it does no I/O and only rethrows when the error has not reached the user through any other channel).

4.3 Durable-ack

KeyTypeDefaultDescription
request_durable_ackbooloffOpt-in via the upgrade header. Trim is then driven by STATUS_DURABLE_ACK frames only; OK frames no longer advance the trim watermark. Connect fails loudly if the server does not echo X-QWP-Durable-Ack: enabled.
durable_ack_keepalive_interval_millisint (ms)200Cadence of WebSocket PING the I/O loop sends while there are pending durable confirmations. 0 or negative disables. WebSocket-only.

4.4 Error handling

KeyTypeDefaultDescription
error_inbox_capacityint (≥16)256Bounded SPSC capacity for async error notifications. Overflow drops oldest and bumps droppedErrorNotifications.

The per-category override keys (on_server_error, on_schema_error, on_parse_error, on_internal_error, on_security_error, on_write_error) are reserved in the spec but are not yet recognised by the Java parser; they are wired only via the fluent LineSenderBuilder API today. New clients should accept them in connect strings — that's the spec — and treat them with the precedence rules in §14.

4.5 Other (WebSocket-relevant)

KeyTypeDefaultDescription
addrhost[:port][,host[:port]…]requiredServer endpoint(s). Comma-separated for multi-host failover; selection semantics, host-health states, and the cursor-engine reconnect loop are normatively specified in failover.md.
username / passwordstringunsetHTTP Basic auth on the upgrade request.
tokenstringunsetBearer token on the upgrade request.
tls_verifyenumonon or unsafe_off. Applies on wss:: / TLS connections.
tls_rootspathsystem trustPath to a custom CA trust store.
tls_roots_passwordstringunsetTrust store password.
auto_flushboolonMaster switch for auto-flush triggers.
auto_flush_rowsint / off1000Row-count flush trigger.
auto_flush_bytesint / off0 (off)Byte-size flush trigger.
auto_flush_intervalint (ms) / off100Time-since-first-row flush trigger. Triggered on the next at() / flush() call, not by a timer.
init_buf_sizesize64KInitial encode buffer capacity.
max_buf_sizesize100MMax encode buffer capacity.
max_name_lenint127Local validation cap for table/column names.
max_schemas_per_connectionint65535Per-connection schema-id ceiling.

4.6 Validation

The parser MUST reject:

  • Unknown keys (forward-compat is via the spec, not silent ignore).
  • sf_durability values other than memory, flush, append. flush and append parse but are rejected at build-time today (deferred).
  • sender_id containing path separators or empty.
  • request_durable_ack=on on non-WebSocket transports.

5. Slot Directory Layout

In SF mode the substrate writes all files under one slot directory:

<sf_dir>/<sender_id>/
├── .lock              # advisory exclusive lock (kernel-released on process exit)
├── .lock.pid          # UTF-8 text: holder PID + '\n' (diagnostic only)
├── .failed            # OPTIONAL: drainer-failure sentinel (UTF-8 reason text)
├── .ack-watermark     # OPTIONAL: 16-byte mmap'd durable-ack high-water mark
├── sf-<gen>.sfa       # segment files (one or more)
└── sf-<gen>.sfa

5.1 .lock and .lock.pid

The .lock file is held under an advisory exclusive lock for the engine's entire lifetime. POSIX clients use flock/fcntl; Windows clients use LockFileEx. The lock is automatically released when the file descriptor is closed, including on hard process exit (kernel cleanup).

A second engine pointing at the same slot dir MUST refuse to start. The contract is: detect the collision at acquire time and fail loudly. The error message MUST include the holder's PID by reading .lock.pid. The PID is a separate file because Windows LockFileEx is mandatory and would prevent reading the lock file's own bytes while it is held.

.lock.pid is overwritten on every successful acquisition with the acquirer's PID followed by a newline, UTF-8 encoded. Absent or empty .lock.pid is harmless; the failed acquire reports holder=unknown.

Neither .lock nor .lock.pid is removed on clean shutdown. Stale files are harmless and are silently overwritten by the next acquirer.

5.2 .failed

Present iff a previous drainer attempt gave up on the slot (reconnect budget exhausted, auth-terminal, irrecoverable corruption, etc.). The contents are a UTF-8 text reason for human operators; the presence, not the contents, is the signal. While .failed exists, the slot is excluded from auto-drain by the orphan scanner. Operators clear it manually.

Drainers MUST write .failed before releasing the slot lock and exiting.

5.3 Segment files (sf-<gen>.sfa)

<gen> is a 16-character lowercase zero-padded hexadecimal generation counter, e.g. sf-0000000000000003.sfa. The generation is allocated by the segment manager monotonically across the JVM lifetime; on recovery it is seeded to max(existing-gen) + 1 to avoid collisions with recovered files.

The filename does not encode the segment's FSN range. It encodes only the allocation order. The FSN range is read from the segment's header at recovery time. sf-initial.sfa is a legacy name accepted by the recovery scanner and skipped during max-gen computation.

Empty slot directories (*.sfa-less) are NOT auto-cleaned; the substrate never deletes the slot directory itself.

5.4 .ack-watermark

Optional 16-byte file that persists the cumulative durable-ack FSN across process restarts. Without it, recovery seeds ackedFsn = lowestSurvivingBaseSeq - 1 (see §6.5) — which guarantees no data loss but cannot tell within the lowest surviving sealed segment which frames the previous sender had already received durable acks for. Replay therefore re-sends every frame in that segment, producing row-level duplicates against a still-alive server unless the target table dedupes them.

With .ack-watermark, recovery seeds ackedFsn = max(lowestSurvivingBaseSeq - 1, watermark) — the max clamp absorbs either order of the manager's "persist then trim" tick:

  • Persist crashed before trim: segments still on disk are >= lowest, watermark is correct; max picks watermark.
  • Trim ran before persist: segments are gone (so lowestBase is higher than watermark), watermark is stale; max picks lowestBase - 1.

Recovery MUST additionally bound-check: if the watermark exceeds publishedFsn of the recovered ring, treat it as corrupt and fall back to lowestBase - 1. A correctly operating prior session cannot produce a watermark above the on-disk frame ceiling, so an excess value is bit-rot or filesystem corruption; trusting it would seed ackedFsn = publishedFsn (after the ring's own clamp), positioning the cursor past every un-acked frame and silently dropping the un-acked tail.

Format (16 bytes, little-endian, mmap'd RW for the engine's lifetime):

Offset  Size  Type    Field            Description
──────────────────────────────────────────────────────────────────
0       4     uint32  magic            0x31574B41 ("AKW1" little-endian)
4       4     uint32  reserved         Must be 0
8       8     int64   fsn              Cumulative durable-ack high-water mark

A magic value other than 0x31574B41 means "no usable watermark"; recovery falls back to the bare lowestBase - 1 seed. The most common cause is a freshly created file from the previous session's first open() where no write landed before crash — the OS zero-filled the file at creation, so magic is 0 until the first write stamps it.

Write protocol (single-writer, segment-manager thread):

  1. Hot-path write is a single 8-byte aligned putLong to offset 8 against the mapped region. No alloc, no syscall, no fsync. Atomic at the CPU level on x86_64 and arm64, and disk-atomic within one sector (the file is 16 bytes — trivially within one sector).
  2. On the first write of a fresh file in a session, the writer also stamps the magic at offset 0 (after the FSN store, in program order). On subsequent writes within the same session and for cross-session reopens that observed magic already set at open() time, the magic store is skipped.
  3. The writer SHOULD persist on every advance of the in-memory ackedFsn — typically debounced via "only write if current > lastPersisted". Frequency is bounded by the manager tick cadence, not the inbound durable-ack rate.

fsync cadence: NOT performed. Host crash falls back to the segment-derived seed (no regression vs the no-watermark behaviour). A client MAY add periodic fsync as a quality-of-service knob; the spec permits but does not require it.

Lifecycle: opened in engine.open after the slot lock is acquired; mmap is held for the engine's lifetime. Closed during engine shutdown after the manager has stopped (the manager is the sole writer). A clean shutdown that drained all segments MAY unlink .ack-watermark; the substrate does today as part of unlinkAllSegmentFiles semantics. Otherwise the file persists for the next session's recovery to read.

Interop: .ack-watermark is OPTIONAL. A client that does not implement it MUST still produce on-disk state that the reference recovery scanner can read — i.e. absence of .ack-watermark is explicitly handled (§6.5). A drainer (orphan-adoption path) opening a slot from a different client that did write .ack-watermark MUST honour it on recovery; ignoring it is allowed but defeats the optimisation.

6. Segment File Format

All multi-byte integers are little-endian. The reference implementation relies on x86/ARM native byte order; clients MUST emit little-endian explicitly to be portable.

A segment file consists of a 24-byte header followed by zero or more frames. The file is pre-allocated at create time to its full size (sf_max_bytes); the tail is zero-filled until written.

Pre-allocation semantics. The reference implementation reserves real disk blocks at create time via posix_fallocate on Linux and F_PREALLOCATE/F_ALLOCATEALL on macOS, so ENOSPC surfaces as a clean create-time failure (the partial file is unlinked and the caller falls back to backpressure). Setting only the logical size via ftruncate is NOT sufficient: the resulting file is sparse and a later store into the mmap'd region will deliver a SIGBUS that tears down the process. On filesystems whose underlying reservation call is unsupported and the native layer falls back to ftruncate, this SIGBUS risk re-emerges for that filesystem only — operators running on such filesystems should size sf_max_bytes conservatively against free space. Clients that implement their own segment files MUST treat block reservation as a core invariant of the create path.

6.1 Header (24 bytes, fixed)

Offset  Size  Type    Field            Description
──────────────────────────────────────────────────────────────────
0       4     uint32  magic            0x31304653 ("SF01" little-endian)
4       1     uint8   version          0x01
5       1     uint8   flags            Reserved (must be 0)
6       2     uint16  reserved         Must be 0
8       8     uint64  baseSeq          FSN of the first frame in this segment
16      8     int64   createdMicros    Wall-clock microseconds at creation

baseSeq MUST be non-negative. The recovery scanner refuses segments with baseSeq < 0 to defend against bit-rot or malicious files.

createdMicros is informational. Implementations may use any monotonic or wall-clock source; recovery does not validate it.

6.2 Frame envelope (8-byte header + payload)

Frames are tightly packed starting at offset 24. Each frame is:

Offset       Size           Type    Field        Description
──────────────────────────────────────────────────────────────────
0            4              uint32  crc32c       CRC32C of (payloadLen||payload), Castagnoli
4            4              int32   payloadLen   Payload size in bytes
8            payloadLen     bytes   payload      One QWP message (verbatim wire bytes)

payloadLen MUST be ≥ 0 and ≤ sizeBytes - offset - 8. A negative or overflowing length is treated as a torn-tail boundary by the recovery scanner.

The CRC is standard CRC-32C (Castagnoli): polynomial 0x1EDC6F41, reflected in/out, init 0xFFFFFFFF, final XOR 0xFFFFFFFF (same variant as iSCSI/SCTP/Btrfs). It covers payloadLen (4 bytes, little-endian) followed by the payload bytes, in that order. Any off-the-shelf CRC-32C library will produce identical checksums.

The payload is the complete QWP message including its 12-byte QWP1 header (magic QWP1, version, flags, table count, payload length). See wire-ingress.md.

6.3 Append protocol (single-writer)

The producer is the single writer to a segment. It appends a frame with this sequence:

  1. Compute total = 8 + payloadLen and check appendCursor + total ≤ sizeBytes. If not, return failure (caller rotates to the next segment).
  2. Write payloadLen (little-endian uint32) at mmap_addr + appendCursor + 4.
  3. Copy payload to mmap_addr + appendCursor + 8.
  4. Compute CRC32C over the just-written [payloadLen, payload] bytes.
  5. Write the CRC at mmap_addr + appendCursor + 0.
  6. Bump appendCursor to appendCursor + total and frameCount by 1.
  7. Last: publish via a release-store of publishedCursor = appendCursor.

Step 7 is the publication barrier. Until the volatile/release write of publishedCursor retires, no consumer (I/O thread, drainer) is permitted to read the new bytes. The CRC + zero-filled tail combination makes the "interrupted at any point" outcome recoverable: the next recovery's CRC verification will reject any partially-written frame.

6.4 Trim and rotation

A segment's last FSN is baseSeq + frameCount - 1. The segment becomes trimmable once ackedFsn >= lastSeq. The ring's manager thread

  1. closes the mmap and file descriptor,
  2. unlinks the file,
  3. decrements the substrate's totalBytes counter.

Trim is destructive — files are unlinked, not zeroed.

A new segment is rotated in when the active one fills. The manager pre-allocates "hot spare" segments ahead of demand to keep the producer non-blocking; rotation rebases the spare's baseSeq to the active's lastSeq + 1 at install time. This is implementation detail; cross-client interop only requires that the on-disk layout described in §6.1–6.3 is correct.

6.5 Recovery scan

For each *.sfa file in the slot:

  1. Open the file, validate length ≥ 24, mmap.
  2. Validate magic == 0x31304653, version == 1, baseSeq ≥ 0.
  3. From offset 24, walk frames forward verifying each CRC. Stop at the first frame whose declared length overruns the file or whose CRC mismatches; the byte position becomes lastGood.
  4. Set frameCount = (count of verified frames), appendCursor = lastGood, publishedCursor = lastGood.
  5. Torn-tail detection: if any byte in [lastGood, lastGood + min(8, fileSize - lastGood)) is non-zero, the writer attempted-but-failed a frame write. Log a WARN with the count. Clean partial fills (writer never reached the tail) leave zeros and report 0.

After scanning all files, sort segments by baseSeq ascending and verify prev.baseSeq + prev.frameCount == curr.baseSeq for each adjacent pair. A gap is a fatal recovery error — refuse to start. The highest-baseSeq segment becomes active; the rest are sealed and drained in order.

The substrate seeds ackedFsn = max(lowestBaseSeq − 1, watermark) where watermark is the FSN read from .ack-watermark (§5.4) if present and valid, or absent otherwise. lowestBaseSeq − 1 alone guarantees no data loss; the watermark additionally suppresses re-replay of frames inside the lowest surviving segment that the previous sender had already been told were durable. See §5.4 for the max ordering rationale.

7. Frame-Sequence-Number (FSN) Model

Two distinct counters operate together:

  • FSN (frame-sequence-number): a monotonic counter assigned at append time, persisted in segment headers via baseSeq, never reset while the slot exists. FSN survives restarts and reconnects.
  • wireSeq: the per-connection counter the server uses for deduplication, reset to 0 on every successful WebSocket upgrade.

The relationship is pinned at connect time:

fsn = fsnAtZero + wireSeq

On a fresh connection, fsnAtZero = ackedFsn + 1 (i.e. 0 on a brand-new slot, or wherever the trim watermark left off after a reconnect or restart). The first frame the I/O loop sends has wireSeq = 0, regardless of its FSN.

7.1 Server invariants the client relies on

  • The server MUST dedup by (connection, wireSeq). Replays after reconnect carry the same payload but at a fresh wireSeq window; dedup must use messageSequence inside the wire payload (which the encoder emits) to recognise duplicates across reconnects.
  • The server's dedup window MUST be ≥ a sender's sf_max_total_bytes worth of FSNs. Otherwise a sustained outage with a full SF cap can produce double-writes on replay.
  • The server's per-table seqTxn watermarks reported in OK and durable-ack frames are monotonically non-decreasing.

7.2 wireSeq is not in the wire frame

The QWP wire encoder does NOT serialize wireSeq into the message header. It is implicit: the server assigns wireSeq to inbound messages in receive order, starting at 0 per connection, and echoes it back in the sequence field of OK and error frames. The client maps the echoed value back to FSN via fsn = fsnAtZero + sequence.

A consequence: clients MUST send frames in strict order. The server assumes "Nth frame in = wireSeq N", so any reordering breaks the FSN mapping.

8. WebSocket Handshake

Open a WebSocket on the server's /write/v4 (or /api/v4/write) path with these request headers:

HeaderRequiredValue
X-QWP-Max-VersionrecommendedHighest QWP version supported. Defaults to 1.
X-QWP-Client-IdoptionalFree-form string (e.g. cpp/0.1.0).
X-QWP-Accept-EncodingoptionalEncoding preference; omit if not supported.
X-QWP-Max-Batch-RowsoptionalPer-batch row cap; omit for server default.
X-QWP-Request-Durable-Ackiff request_durable_ack=onLiteral true.
Authorization etc.as neededStandard HTTP auth headers.

On 101 the server responds with:

HeaderMeaning
X-QWP-VersionNegotiated version (defaults to 1 if absent).
X-QWP-Durable-AckEchoed enabled iff the client opted in AND the server can deliver durable-ack frames.

8.1 Durable-ack handshake

If the client sent X-QWP-Request-Durable-Ack: true:

  • A response with X-QWP-Durable-Ack: enabled confirms the server will emit STATUS_DURABLE_ACK frames. The client switches to durable-ack- driven trim (§10).
  • Absence of the confirmation header means the server cannot deliver durable-acks (OSS build, primary not initialised, registry missing, replica). The client MUST fail the connection loudly with a LineSenderException-equivalent error; silently waiting for ack frames that will never arrive lets the SF grow until disk fills.

8.2 Authentication

Auth failures during upgrade (HTTP 401 / 403) are terminal per failover.md §6 — they neither retry the initial connect nor trigger reconnect, and surface immediately as a SECURITY_ERROR category error (HALT policy). Other non-101 statuses follow the topology / transient classification in failover.md §6 (e.g. 421 is in-round topology; 404 / 426 / 503 are transient and consume the reconnect budget).

9. OK ACK and Trim

Every server response begins with a 1-byte status. OK is 0x00.

9.1 OK frame layout

Offset    Size       Field          Notes
─────────────────────────────────────────────────────────
0         1          status         0x00
1         8          sequence       wireSeq (little-endian int64)
9         2          tableCount     uint16 LE; 0 for non-WAL or empty batches
11        variable   table entries  repeated tableCount times:
                                       2 bytes nameLen (LE)
                                       nameLen bytes name (UTF-8)
                                       8 bytes seqTxn (LE int64)

9.2 Default trim (no durable-ack)

On OK reception:

  1. Compute fsn = fsnAtZero + min(sequence, nextWireSeq - 1). The min defends against malicious or buggy servers reporting a wireSeq ahead of what the client sent.
  2. engine.acknowledge(fsn): advance the trim watermark, possibly freeing whole segments (§6.4).
  3. Bump per-table seqTxn watermarks for any subsequent durability tracking (§10).

Auto-trim cadence is at the discretion of the I/O loop; the reference implementation calls acknowledge immediately on every OK.

9.3 Durable-ack mode trim

When request_durable_ack=on was negotiated, OK frames do not advance ackedFsn. They are stashed (with their per-table seqTxns) into a pendingDurable queue keyed by wireSeq. Trim happens only on STATUS_DURABLE_ACK (§10).

Empty OK frames (tableCount=0, e.g. materialized-view-only batches) are trivially durable: their wireSeq advances ackedFsn as soon as all preceding entries in pendingDurable have been confirmed.

10. Durable-Ack ACK and Trim

Durable-ack frames carry per-table watermarks for data already uploaded from the server's WAL to the configured object store.

10.1 Durable-ack frame layout

Offset    Size       Field          Notes
─────────────────────────────────────────────────────────
0         1          status         0x02
1         2          tableCount     uint16 LE
3         variable   table entries  repeated tableCount times:
                                       2 bytes nameLen (LE)
                                       nameLen bytes name (UTF-8)
                                       8 bytes seqTxn (LE int64)

There is no sequence field. Durable-acks are cumulative per table: each entry reports the highest seqTxn whose WAL segments are durable.

10.2 Drain loop

Maintain durableTableWatermarks: Map<tableName, int64> (monotonic, max-merge).

On each durable-ack frame:

  1. For each entry, set durableTableWatermarks[name] = max(current, seqTxn).
  2. Walk the head of pendingDurable (FIFO of (wireSeq, tableSeqTxnsForThisFrame)).
  3. An entry is "covered" iff for every (name, seqTxn) it carries, durableTableWatermarks[name] >= seqTxn.
  4. Pop all consecutive covered head entries; call engine.acknowledge(fsnAtZero + maxCoveredWireSeq) once with the highest covered wireSeq.

The watermark advances strictly monotonically; reordered durable-ack frames are tolerated by the max-merge.

10.3 Reconnect

On reconnect, pendingDurable is discarded. The new connection's OKs re-stash entries; the server re-emits cumulative durable-ack watermarks from scratch. Trim restarts against the new connection's wireSeq mapping.

11. Keepalive PING

The OSS server only flushes pending durable-ack frames during inbound recv events (binary message, ping, close). An idle client that has finished publishing would otherwise never see the durable-ack frame.

The SF I/O loop sends a WebSocket PING (any small payload; reference sends 2 bytes) when all of these hold:

  • request_durable_ack=on was negotiated.
  • pendingDurable is non-empty.
  • durable_ack_keepalive_interval_millis > 0.
  • At least durable_ack_keepalive_interval_millis have elapsed since the last sent frame or PING.

In non-durable-ack mode the loop sends NO keepalive PINGs; ordinary frame traffic keeps the connection alive on the server side.

PING is always a WebSocket-level PING (RFC 6455 opcode 0x9), not a QWP message. Servers reply with PONG; clients ignore PONG content.

12. Self-Sufficient Framing

Every frame written to the SF substrate MUST be self-sufficient: it carries the full schema for every table it touches and the full symbol-dictionary delta from id 0. Schema-by-id refs are forbidden, and incremental delta-dicts are forbidden.

Concretely, the encoder is invoked with:

  • confirmedMaxId = -1 (so the symbol delta starts at id 0 every time)
  • useSchemaRef = false (so column blocks always carry full schema mode)

Rationale: SF frames may be replayed against a fresh server connection weeks later — after process restart, after reconnect, or in a background-drainer adopting an orphan slot. A frame that references schema id N or symbol id M is unreplayable if the new server has never seen those ids. Self-sufficiency makes every frame valid against any server. The cost is a small per-batch overhead which is accepted.

This is mandatory for cross-client interop. A drainer written in C++ adopting a slot written by Java must be able to replay the bytes unchanged.

13. Reconnect and Replay

13.1 Failure detection

Any wire error — send failure, recv failure, server close, ack timeout — enters the reconnect loop. The producer thread is not notified; it keeps publishing into the substrate (subject to the cap, §16).

13.2 Backoff

Backoff math is normatively specified in failover.md §3 (ComputeBackoff / NextBackoffOrGiveUp). SF uses equal-jitter [base, 2·base) per §3.1 — equal-jitter has a non-zero lower bound that damps reconnect storms when multiple SF producers share a cluster. The post-jitter sleep is clamped to the remaining outage budget so a single sleep cannot overshoot reconnect_max_duration_millis; see failover.md §3 for the exact deadline check and §13.6 below for how the SF loop consumes it.

Defaults bound to the SF connect-string keys (reconnect_initial_backoff_millis = 100ms, reconnect_max_backoff_millis = 5_000ms, reconnect_max_duration_millis = 300_000ms); see §4.2 for the keys and failover.md §7 for the cross-context defaults table.

13.3 Terminal conditions

Error classification is normatively specified in failover.md §6. Mapped to the SF client's local error categories:

  • Terminal (bypass reconnect; do NOT consume the retry budget) — AuthError (HTTP 401 / 403 on the upgrade) surfaces as SECURITY_ERROR; server-status rejects surface in the matching status-byte category.
  • Topology (421 Misdirected Request + X-QuestDB-Role) — handled within the round per §13.6 (and failover.md §6 "Topology"); not terminal.
  • Transient (consume the per-outage budget) — everything else: TCP / TLS / handshake failures, mid-stream send / recv errors, 404, 426, 503, any other 4xx / 5xx that is not 401 / 403 / 421, an upgrade response that advertises a QWP version outside the supported range (per-endpoint mismatch — the loop walks to the next host so a rolling upgrade in flight does not block compatible peers), and generic frame-decode errors.

When the per-outage budget (reconnect_max_duration_millis) exhausts, the loop becomes terminal and surfaces as PROTOCOL_VIOLATION with FSN span = unacked window at giveup time.

13.4 Initial connect

By default (initial_connect_retry=off) any failure on the very first connect is terminal — typically a misconfig, retrying for 5 minutes just hides it. Setting initial_connect_retry=on (aliases: sync, true) reuses the same retry loop, blocking the calling thread. async runs the retry loop in the I/O thread and lets the constructor return; the producer then experiences backpressure if it tries to publish before the connection comes up.

13.5 Replay

On each successful (re)connect:

  1. Set fsnAtZero = engine.ackedFsn() + 1 (or 0 on a brand-new slot where ackedFsn is -1).
  2. Reset nextWireSeq = 0.
  3. Position the read cursor in the segment ring at fsnAtZero.
  4. Stream frames from disk to the wire in FSN order, one frame per WebSocket binary message, incrementing nextWireSeq.
  5. Frames originally appended after the disconnect (fsn > replayTargetFsn) are sent as new frames; the boundary affects only the totalFramesReplayed counter, not behaviour.

Producer threads MAY append concurrently during replay; the I/O loop uses the volatile publishedFsn() cursor to discover newly-appended frames.

13.6 Reconnect loop

The full pseudocode that ties together failure detection (§13.1), backoff (§13.2), terminal classification (§13.3), initial-connect policy (§13.4), and replay (§13.5). The loop owns the host tracker defined in failover.md §2 and consumes the backoff function in failover.md §3.

backoff = BackoffState()
seenFirstConnect = false
lastError = null
loop forever:
    if previousIdx >= 0:                # mid-stream failure path
        tracker.RecordMidStreamFailure(previousIdx)
        previousIdx = -1

    idx = tracker.PickNext()
    if idx < 0:
        # Round exhausted: every host in this round has been tried. Pay
        # one sleep, reset the round, then walk again.
        if lastError == RoleReject:
            sleep min(InitialBackoff, remaining)    # remaining = MaxOutageDuration - elapsed
            if elapsed >= MaxOutageDuration: SET TERMINAL; exit loop
            backoff.Reset()                          # role-reject does not advance doubling
        else:
            sleep = NextBackoffOrGiveUp(backoff.attempt, backoff.elapsedSinceOutage)
            if sleep == GIVE_UP: SET TERMINAL; exit loop
            sleep; backoff.attempt++
        tracker.BeginRound(forgetClassifications=true)
        continue

    try connect host[idx]:
        on AuthError: SET TERMINAL; exit loop
        on RoleReject(t):
            tracker.RecordRoleReject(idx, t); lastError = RoleReject
            continue                                 # no per-host sleep within the round
        on other error (incl. upgrade-time version mismatch):
            tracker.RecordTransportError(idx); lastError = TransportError
            if !seenFirstConnect && !initial_connect_retry:
                SET TERMINAL; exit loop
            continue                                 # no per-host sleep within the round

    seenFirstConnect = true
    backoff.Reset(); lastError = null
    rewind cursor to (ackedFsn + 1)         # replay first un-acked frame; see §13.5
    run send/recv pumps until either pump throws
        on server status reject / AuthError: SET TERMINAL
        on transient error (TCP/TLS, mid-stream send/recv, frame corruption, etc.):
            previousIdx = currentIdx                 # demoted on next iter via RecordMidStreamFailure
            continue                                  # next iter walks the next untried host

Key properties:

  • Single-round walk per attempt. When the tracker exhausts (PickNext == -1), the loop MUST invoke BeginRound(forgetClassifications=true) before the next PickNext. The current-round flags reset; classifications fade except for the sticky-Healthy entry (failover.md §2.2).
  • No per-host backoff within a round. While the tracker still has untried hosts in the current round, every transient error (transport or role-reject) flows directly into the next iteration with no inter-host sleep — the loop walks the full address list as fast as per-host auth_timeout_ms allows. The post-error sleep (exponential for transport, fixed InitialBackoff for role-reject) is paid once per round, at the moment PickNext returns -1 and immediately before BeginRound(forgetClassifications=true). The per-producer round time stays bounded by auth_timeout_ms × hostCount plus one backoff sleep, and equal-jitter (failover.md §3.1) spreads the round-boundary sleeps across producers. Mirrors the egress WalkTracker (wire-egress.md §11.9), where the address-list walk is also un-throttled within a round.
  • initial_connect_retry (default off):
    • off — first connect failure is terminal.
    • on (alias sync) — block the constructor; enter the standard reconnect loop until success or MaxOutageDuration exhaustion.
    • async — return from the constructor immediately; the background I/O thread drives the reconnect loop. append writes accumulate in the SF dir without blocking on connect. Intended for unattended producers where the SF dir may already carry segments queued from a prior process and the server may come up later.
  • Role-rejects don't accumulate exponential backoff. A 421 + role reply is interpreted as a topology hint. Within a round it flows through the no-sleep path above (the next host gets tried immediately). When a round exhausts with role-reject as lastError, the round-boundary sleep is the configured InitialBackoff (no doubling) and backoff.attempt is reset, so a long PRIMARY_CATCHUP window doesn't blow up to reconnect_max_backoff_millis per probe. The wall-clock outage budget (reconnect_max_duration_millis) still bounds the loop — if every endpoint stays role-rejecting for the full budget, the loop terminates. Operators should size reconnect_max_duration_millis to span their largest expected failover window (default 5 minutes).
  • Mid-stream failure (the send or receive pump throws after a successful connect): the failed host index is captured as previousIdx. The next loop iteration calls tracker.RecordMidStreamFailure(previousIdx) before the next PickNext, so Healthy→TransportError demotion happens before host selection runs (see failover.md §2.3 — this ordering is normative). The next host is then tried under the same no-sleep rule (skip-backoff-within-round), so a healthy peer takes over with only auth_timeout_ms worth of slack. The mid-stream failure itself does not advance backoff.attempt — the doubling counter starts fresh from the post-success backoff.Reset() and only moves forward at round exhaustion, so a single mid-stream blip pays no exponential cost when at least one peer is healthy.
  • Per-caller previousIdx. When multiple loops drive reconnects against the same tracker (foreground SF I/O thread plus background drainers; see §18.2), each loop MUST own its own previousIdx slot. Sharing one slot across loops corrupts the Healthy→TransportError demotion. The Java client expresses this as a per-caller ReconnectFactory instance with a private previousIdx field. See failover.md §2.3.

14. Error Frame Handling

A non-OK status byte (other than 0x02) is an error frame.

14.1 Error frame layout

Offset    Size       Field          Notes
─────────────────────────────────────────────────────────
0         1          status         see table below
1         8          sequence       wireSeq the error applies to (LE int64)
9         2          msgLen         uint16 LE (≤ 1024)
11        msgLen     msg            UTF-8 error message

14.2 Status byte → category

StatusHexCategory
30x03SCHEMA_MISMATCH
50x05PARSE_ERROR
60x06INTERNAL_ERROR
80x08SECURITY_ERROR
90x09WRITE_ERROR
any other (incl. 0x04, 0x07, 0xFF)UNKNOWN
(n/a — synthesized for WS close codes)PROTOCOL_VIOLATION

14.3 Default policy per category

CategoryDefault policyRationale
SCHEMA_MISMATCHDROP_AND_CONTINUEReplay reproduces the same rejection; halting blocks unrelated tables.
PARSE_ERRORHALTAlmost certainly a client bug; preserve disk frames for postmortem.
INTERNAL_ERRORHALTCatch-all server fault; conservative without a retryable bit.
SECURITY_ERRORHALTMisconfig; loud failure wanted.
WRITE_ERRORDROP_AND_CONTINUE"Table not accepting writes" is per-batch in character.
PROTOCOL_VIOLATIONHALT (forced)Connection is gone — no choice.
UNKNOWNHALT (forced)Never silently drop something we don't understand.

The two non-forced rows are user-overridable via the on_*_error connect keys (when implemented; see §4.4).

14.4 DROP_AND_CONTINUE flow

  1. Log a WARN with the wireSeq, category, status byte, and server message.
  2. Build a SenderError and offer to the error inbox (non-blocking).
  3. Advance the trim watermark past the rejected span:
    • Non-durable mode: engine.acknowledge(fsnAtZero + sequence).
    • Durable mode: enqueue a trivially-durable empty entry into pendingDurable, then run the drain loop.
  4. Keep draining the next frame.

DROP semantically discards the data — once the server has rejected, the SF substrate is no longer the producer's safety net. Users who need a dead-letter trail must register an error handler.

14.5 HALT flow

  1. Build a SenderError and stash it as the loop's terminal error.
  2. Offer to the error inbox.
  3. Stop the loop. The producer's next API call observes the latched error and throws (typed exception carrying the SenderError).

14.6 Error inbox and dispatcher

  • Bounded SPSC queue, capacity = error_inbox_capacity (default 256).
  • Producer is the I/O thread. Consumer is a lazy-start daemon dispatcher thread that invokes the user's error handler.
  • Overflow policy: drop the oldest entry, bump droppedErrorNotifications counter. Watermarks are monotonic, so the latest entry is always the most informative — drops compress information rather than lose it.
  • The default error handler logs ERROR for HALT and WARN for DROP. Silence is forbidden; clients MUST install a non-silent default.

15. WebSocket Close Codes

WebSocket Close frames are inspected for the close code:

15.1 Terminal (PROTOCOL_VIOLATION)

Surface as PROTOCOL_VIOLATION / HALT. Do NOT enter the reconnect loop.

CodeName
1002PROTOCOL_ERROR
1003UNSUPPORTED_DATA
1007INVALID_PAYLOAD_DATA
1008POLICY_VIOLATION
1009MESSAGE_TOO_BIG
1010MANDATORY_EXTENSION

The serverStatusByte is -1 and messageSequence is -1 for PROTOCOL_VIOLATION events. The error message is "ws-close[<code>]: <reason>". The FSN span is [engine.ackedFsn() + 1, engine.publishedFsn()] — the unacked window at close time.

15.2 Reconnect-eligible

Enter the reconnect loop (subject to per-outage cap):

CodeName
1000NORMAL_CLOSURE
1001GOING_AWAY
1006ABNORMAL_CLOSURE
1011INTERNAL_ERROR
any other code not listed in §15.1

16. Backpressure

The substrate enforces sf_max_total_bytes as a hard cap on resident storage. When full, the producer's appendBlocking busy-spins (with cooperative yield) for up to sf_append_deadline_millis waiting for ACK arrival → trim → space free. If the deadline fires the call throws a LineSenderException.

The error message MUST distinguish:

  • backpressure while wire is publishing: server is acking but slowly.
  • backpressure while reconnecting: I/O loop is in the retry loop. The message includes attempt count and outage start time.

The cap covers all sealed segments + the active segment. Trim immediately returns segment bytes to the available pool when ackedFsn crosses baseSeq + frameCount - 1.

17. Close and Shutdown

close() semantics depend on close_flush_timeout_millis:

  • Default (5000): block waiting for engine.ackedFsn() >= engine.publishedFsn() for up to 5 seconds. If the wait succeeds, all data is acked. If the timeout fires, log a WARN and proceed; pending data remains on disk (SF mode) or is lost (Memory mode).
  • 0 or -1: skip the drain entirely. Pending data is lost (Memory) or persists for the next sender to recover (SF). Producer state is still flushed into the engine before close() returns; only the wait for server ACK is skipped.
  • Any other positive value: that timeout in milliseconds.

Regardless of which branch above runs, close() then performs a non-blocking safety-net checkError() and rethrows any latched terminal error, with two guards that prevent double-signalling:

  1. Skip the rethrow if a custom SenderErrorHandler has already been delivered an error during this sender's lifetime (the user has already owned the failure asynchronously).
  2. Skip the rethrow if a producer-thread call (flush() / at() / atNow()) has already seen the latched error synchronously.

The safety net is non-blocking (a volatile read plus a throw) so it does not violate the "fast close" intent of the opt-out branch. Its purpose is to keep latched terminal errors visible to users who only ever call close() and have not installed an async handler — without it, the default no-op handler would silently swallow real failures.

Implementations MUST always release the slot lock and unmap segments on close(), regardless of timeout outcome or whether the safety net threw. Implementations MUST NOT skip the safety-net checkError() on the basis of close_flush_timeout_millis alone; only the two guards above suppress it.

18. Recovery and Orphan Adoption

18.1 Foreground recovery

On engine open in SF mode:

  1. Acquire <sf_dir>/<sender_id>/.lock. Refuse to start on contention.
  2. Run §6.5 recovery scan over the slot's *.sfa files.
  3. Open .ack-watermark (§5.4) if implemented. Read its FSN, or record INVALID if the file is missing / has bad magic / fails to map. Keep the mmap held for the engine's lifetime so the manager can write through it without reopening.
  4. Seed ackedFsn = max(lowestBaseSeq − 1, watermark), with the additional bound watermark <= publishedFsn (see §5.4). A watermark of INVALID or one exceeding publishedFsn reduces to the bare lowestBaseSeq − 1 seed (legacy behaviour, no regression).
  5. Seed segment manager fileGeneration = max(existing-gen) + 1.
  6. Bump connection generation so the I/O loop, on first connect, replays from disk against a fresh wireSeq window.

A clean shutdown that drained all data is indistinguishable from a fresh start — no segments, no replay, just ackedFsn = -1. A stale .ack-watermark from a prior fully-drained session is unlinked at this point so the new session starts with the expected "no watermark yet" state.

18.2 Orphan adoption (opt-in)

When drain_orphans=on the foreground sender, after acquiring its own lock, scans <sf_dir>/* for sibling slots that:

  • are NOT the foreground's own sender_id,
  • contain at least one *.sfa file,
  • do NOT contain .failed.

Note: lock state is intentionally NOT in the candidate filter; testing flock state races with concurrent acquirers. The drainer pool tries the locks in turn.

For each candidate, up to max_background_drainers at a time, spawn a drainer that:

  1. Tries to acquire the orphan slot's .lock. Skip on contention.
  2. Opens its own engine + WebSocket connection (separate from the foreground's).
  3. Runs §6.5 recovery and §13.5 replay over the orphan's data.
  4. Releases the lock when ackedFsn ≥ publishedFsn and exits.

Drainer failure modes:

  • Reconnect budget exhausted → drop .failed with a UTF-8 reason, release lock, exit.
  • Auth-terminal upgrade error → same.
  • Irrecoverable corruption (e.g. CRC failure with no torn-tail heuristic match) → same.

.failed slots are excluded from auto-drain on subsequent scans; the operator must clear the sentinel manually.

19. Protocol Invariants for Cross-Client Interop

Mandatory for every conformant SF client:

  • On-disk format (§6) is authoritative. Magic, header layout, CRC32C-Castagnoli, frame envelope, little-endian byte order. No per-client extensions in the segment header until a versioned format is agreed.
  • Filename pattern (§5.3): sf-<gen:016x>.sfa. Other clients reading the slot need this to enumerate.
  • Slot lock semantics (§5.1): advisory exclusive on .lock, PID in .lock.pid. Cross-client lock interop requires identical lock primitives — POSIX clients use flock/fcntl, Windows uses LockFileEx. A POSIX client and a Windows client MUST refuse to share a slot on a network filesystem.
  • .failed sentinel (§5.2): presence (not contents) is the auto-drain exclusion signal.
  • .ack-watermark (§5.4): the file is OPTIONAL — a conformant client MAY skip implementing it. When present, the format is normative (16 bytes, magic 0x31574B41, FSN at offset 8, little-endian). A drainer adopting a slot another client populated MUST honour the watermark on recovery (apply the max clamp); ignoring it is allowed but produces the duplicate-row behaviour the watermark was designed to suppress.
  • FSN model (§7): fsn = fsnAtZero + wireSeq. Strict in-order send on the wire.
  • Self-sufficient frames (§12): mandatory. Schema refs and incremental delta-dicts are forbidden in SF frames.
  • Durable-ack handshake (§8.1): if the client opts in, it MUST validate the response header and fail loudly on absence.
  • Trim driver: in non-durable mode, OK frames advance trim. In durable mode, OK frames are tracked but do NOT advance trim; only durable-ack frames do.
  • Status byte ↔ category mapping (§14.2) is normative. Clients MAY add forward-compat mapping for new server status bytes (treated as UNKNOWN).
  • WS close-code routing (§15) is normative.
  • Connect-string keys (§4) are normative names. Clients MUST accept spec-defined keys; they MAY accept implementation-specific extensions but should namespace them (e.g. cpp_*).

Permitted variations:

  • Internal threading model.
  • Manager / drainer thread allocation strategy.
  • Hot-spare provisioning policy.
  • Default error-handler formatting.

20. Observability

A conformant client SHOULD expose, at minimum:

CounterMeaning
getTotalReconnectAttempts()Reconnect attempt count across the lifetime of the sender.
getTotalReconnectsSucceeded()Successful reconnect count.
getTotalFramesReplayed()Frames re-sent after a reconnect (i.e. wireSeq < replayTargetFsn).
getTotalServerErrors()Count of error frames received (any category).
getDroppedErrorNotifications()Count of error notifications dropped due to inbox overflow.
getTotalErrorNotificationsDelivered()Count delivered to the user handler.
getTotalBackpressureStalls()Count of producer threads that hit the cap.
getLastTerminalError()The latched SenderError (or null).
getActiveBackgroundDrainers()Current count of running orphan-slot drainers. Same lax-cleanup race as the underlying pool snapshot: a drainer that finished moments ago may still count for a few ms.
getTotalBackgroundDrainersSucceeded()Cumulative count of drainers that exited after fully draining their slot.
getTotalBackgroundDrainersFailed()Cumulative count of drainers that exited by dropping a .failed sentinel.

The default error handler MUST log every received SenderError. Silence is forbidden by the contract: a buggy or no-op handler hides data loss indistinguishably from a healthy connection.

Per-drainer event observation (progress watermark, durable-ack-mismatch escalation, terminal outcome) is delivered through the existing BackgroundDrainerListener callback at the pool, not through a sender-level snapshot accessor. The three counters above cover dashboards and post-startup health checks ("did my orphans get adopted?"); the listener covers everything that wants per-drainer event-time visibility. On-disk .failed sentinels remain the canonical record of giveup events surviving sender restart.

21. Deferred Items

Items intentionally out of scope today, tracked for future revisions:

  1. sf_durability=flush and =append: per-batch and per-frame fsync. Parser accepts the values; build-time rejects them with "not yet supported". The wire-level behaviour is unchanged; only the producer's flush path (and the sf_append_deadline_millis semantics) need new code.
  2. Retryable errors: the wire format has no retryable bit; transient server faults and permanent ones both surface as INTERNAL_ERROR. A future server change could split status bytes or add a 1-byte field; when that lands, the spec gains a RETRY_TRANSIENT policy.
  3. Per-table attribution in multi-table batch errors: today tableName is only populated when the rejected batch had exactly one table. The wire format would need an optional table index field for the multi-table case.
  4. on_*_error connect-string keys: normatively defined in §4.4 but not yet recognised by the Java parser; only the fluent LineSenderBuilder API exposes them today.
  5. Disk reclaim inside partially-acked sealed segments: today the whole segment file stays on disk until every frame in it is durably acked, even though .ack-watermark (§5.4) already records which prefix is safe. Options for a future revision: rewrite-and-rotate (copy un-acked frames into a fresh segment, unlink the old), fallocate(FALLOC_FL_PUNCH_HOLE) on Linux (no Windows equivalent), or a <segment>.ack-offset sidecar consulted by the trim path. Reduces effective disk consumption under slow durable-ack cadence; does not affect correctness.
  6. resumeAfterHalt() API: clears the latched terminal error and restarts the I/O loop without rebuilding the sender. The reference loop is one-shot today; users work around with close() + rebuild.

Document History

DateCommitChange
2026-05-05client 1125b0b / oss dc108e66ffInitial spec extracted from Java reference.
2026-05-08(pending)Align with failover.md. §13.3 + §8.2: rewrite HTTP-status terminal classification to defer to failover.md §6 (only 401/403 are terminal; 421 is topology, handled in-round; 404/426/503 are transient and consume the reconnect budget). §4.5 + §21: drop "multi-host failover deferred" — failover.md is the normative spec. §4.2 + §13.4: clarify initial_connect_retry=on is canonical; sync and true are aliases. §13.2: replace inlined backoff math with a forward-pointer to failover.md §3 / §3.1 to avoid drift.
2026-05-12(pending)Add .ack-watermark (§5.4): optional mmap'd 16-byte FSN watermark that suppresses re-replay of already-durable-acked frames inside the lowest surviving sealed segment on process restart. §6.5 and §18.1 updated to seed ackedFsn = max(lowestBaseSeq - 1, watermark). §19 adds the interop bullet (file is optional but format is normative). §21 item 5 narrowed from "partial-ack tracking" to "disk reclaim inside partially-acked segments" — the re-replay half of that follow-up is now addressed by the watermark.
2026-05-12(pending)§20: replace the structured getBackgroundDrainers() snapshot accessor with three single-long counters — getActiveBackgroundDrainers(), getTotalBackgroundDrainersSucceeded(), getTotalBackgroundDrainersFailed(). The structured accessor allocated per call, exposed a lifecycle-bounded view through the same surface as the lifetime-cumulative counters, and overlapped with the existing BackgroundDrainerListener callback path. The simpler counters align with the rest of §20, give dashboards a "did my orphans get adopted?" signal, and leave per-drainer event-time visibility to the listener (which is the canonical extension point for drainer observation). On-disk .failed sentinels are unchanged and remain the canonical giveup record across sender restarts.