src/docs/rfcs/003-fdb-seq-index.md
Data Model and Index Management for _changes in FoundationDB
This document describes how to implement the by_seq index that supports the
_changes endpoints in FoundationDB. It covers the data model, index
maintenance, and access patterns.
The basic data model is one where the key is a Sequence (as defined below) and
the value is a document ID, revision, and branch count.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Versionstamp: a 12 byte, unique, monotonically (but not sequentially)
increasing value for each committed transaction. The first 8 bytes are the
committed version of the database. The next 2 bytes are monotonic in the
serialization order for transactions. The final 2 bytes are user-defined and can
be used to create multiple versionstamps in a single transaction.
Incarnation: a monotonically increasing Integer value specified for each
CouchDB database. The Incarnation starts at zero (i.e. \x14 in the tuple
layer encoding) when a database is created and is incremented by one whenever a
database is relocated to a different FoundationDB cluster. Thus the majority of
the time an Incarnation fits into a single byte, or two bytes if the database
has been moved around a small number of times.
Sequence: the combinination of the current Incarnation for the database and
the Versionstamp of the transaction. Sequences are monotonically increasing
even when a database is relocated across FoundationDB clusters.
style=all_docs: An optional query parameter to the _changes feed which
requests that all leaf revision ids are included in the response. The replicator
(one of the most frequent consumers of _changes) supplies this parameter.
The _changes feed provides a list of the documents in a given database, in the
order in which they were most recently updated. Each document shows up exactly
once in a normal response to the _changes feed.
In CouchDB 2.x and 3.x the database sequence is a composition of sequence
numbers from individual database shards. In the API this sequence is encoded as
a long Base64 string. The response to the _changes feed is not totally
ordered; the only guarantee is that a client can resume the feed from a given
sequence and be guaranteed not to miss any updates.
Future releases of CouchDB based on FoundationDB will be able to offer stronger
guarantees. The Sequence defined in the Terminology section above is totally
ordered across the entire cluster, and repeated calls to _changes on a
quiescent database will retrieve the same results in the same order. The
Sequence will still be encoded as a string, but as it's a more compact value
we propose to encode it in hexadecimal notation. These strings will sort
correctly, something that has not always been true in CouchDB 2.x.
Each database will contain a changes subspace with keys and values that take
the form
("changes", Sequence) = (SeqFormat, DocID, RevPosition, RevHash, BranchCount, NotDeleted)
where the individual elements are defined as follows:
SeqFormat: enum for the value encoding, to enable schema evolutionDocID: the document IDRevPosition: positive integer encoded using standard tuple layer encoding
(signed, variable-length, order-preserving)RevHash: 16 bytes uniquely identifying the winning revision of this documentSequence: the sequence of the last transaction that modified the document
(NB: not necessarily the transaction that produced the RevPosition-RevHash
edit).BranchCount: the number of edit branches associated with this documentNotDeleted: \x26 if the leaf of the edit branch is deleted, \x27
otherwise (following tuple encoding for booleans)A typical response to _changes includes all of this information in each row
except the internal SeqFormat and the BranchCount. The latter is used as an
optimization for the style=all_docs request; if this parameter is specified
and the BranchCount is 1 we can avoid making an extra request to the
"revisions" space to discover that there are no other revisions to include.
As discussed in RFC 001, an update attempt
always retrieves the metadata KV for the current winning branch from the
"revisions" subspace. This metadata entry includes the sequence of the last edit
to the document, which serves as the key into the index in our "changes"
subspace. The writer will use that information to clear the existing KV from the
_changes subspace as part of the transaction.
The writer also knows in all cases what the RevPosition, RevHash,
BranchCount, and NotDeleted will be following the edit, and can use the
set_versionstamped_key API to write a new KV with the correct new sequence of
the transaction into the "changes" subspace.
In short, the operations in this subspace are
When using versionstamped keys as proposed in this RFC one needs to pay
particular care to the degraded mode when FoundationDB responds to a transaction
commit with commit_unknown_result. Versionstamped keys are not idempotent, and
so a naïve retry approach could result in duplicate entries in the "changes"
subspace. The index maintenance in this subspace is "blind" (i.e. no reads in
this subspace are performed), so the risk for duplicate entries is indeed a
valid concern.
We can guard against creating duplicates in the "changes" subspace by having the
transaction that updates that subspace also insert a KV into a dedicated
"transaction ID" subspace specifically corresponding to this document update. If
the CouchDB layer receives a commit_unknown_result it can simply check for the
presence of the transaction ID in FoundationDB to determine whether the previous
transaction succeeded or failed. If the transaction ID is not present, CouchDB
can safely retry with the same transaction ID. After a successful transaction
commit, the CouchDB layer can delete the transaction ID KV asynchronously. For
example, each process could dump the transaction ID of a successful commit into
a local ets table (shared by all databases), and a process could scan that table
once every few seconds and clear the associated entries from FDB in a single
transaction.
Let's consider first the simple case where an entire response to _changes fits
within a single FoundationDB transaction (specifically the 5 second limit). In
this case a normal request to _changes can be satisfied with a single range
read from the "changes" subspace. A style=all_docs request will need to check
the BranchCount for each row; if it's larger than 1, the client will need to
do a followup range request against the "revisions" subspace to retrieve the
additional revision identifiers to include in the response. A request with
include_docs=true will need to make a separate range request to the doc
storage subspace to retrieve the body of each winning document revision.
If a normal response to _changes cannot be delivered in a single transaction
the CouchDB layer should execute multiple transactions in series and stitch the
responses together as needed. Note that this opens up a subtle behavior change
from classic CouchDB, where a single database snapshot could be held open
~indefinitely for each shard, providing a complete snapshot of the database as
it existed at the beginning of the response. While future enhancements in
FoundationDB may allow us to recover that behavior, in the current version we
may end up with duplicate entries for individual documents that are updated
during the course of streaming the _changes response. The end result will be
that each document in the database shows up at least once, and if you take the
last entry for each document that you observe in the feed, you'll have the state
of the database as it existed at the end of the response.
Finally, when a user requests _changes with feed=continuous there is no
expectation of exactly-once semantics, and in fact this is implemented using
multiple database snapshots for each shard today. The extra bit of work with
this response type is to efficiently discover when a new read of the "changes"
subspace for a given database is required in FoundationDB. A few different
options have been discussed on the mailing list:
db_updated events to couch_event, listeners use
distributed Erlang to subscribe to all nodes, similar to the classic
approach._changes subspace, scale by nominating a specific process per node
to do the polling.This RFC proposes to pursue the second approach. It preserves the goal of a stateless CouchDB layer with no coordination between instances, and has a well-known scalability and performance profile.
This design eliminates "rewinds" of the _changes feed due to cluster
membership changes, and enhances database sequences to enable relocation of
logical CouchDB databases across FoundationDB clusters without rewinds as well.
We anticipate improved throughput due to the more compact encoding of database sequences.
The new sequence format always sorts correctly, which simplifies the job of consumers tracking the sequence from which they should resume in parallel processing environments.
It will not be possible to retrieve a complete point-in-time snapshot of a large database in which each document appears exactly once. This may change with a future enhancement to the storage engine underpinning FoundationDB.
Nothing additional to report here.
TBD depending on exact code layout going forward, but this functionality cuts across several core modules of CouchDB.
None.
None.
None have been identified.
Original mailing list discussion
Detailed thread on isolation semantics for long responses
Thanks to @iilyak, @rnewson, @mikerhodes, @garrensmith and @alexmiller-apple for comments on the mailing list discussions, and to @wohali for working through the implications of the isolation changes on IRC.