src/v/pandaproxy/schema_registry/README.md
Schema registry is built in to pandaproxy.
A single-partition topic _schemas is used as a backend store.
Each node maintains an in-memory copy of all schema data, lazily updated from the topic.
Active/Active HA: any node can serve reads and writes.
User-facing REST API is compatible with other products' schema registry implementations, but internal storage format is not intended to be compatible.
Future: we should apply a sanity limit on the number of schemas, as we hold them all in memory.
tldr; Optimistic writes, first writer wins
Each write includes a sequence number in the key. This sequence number increases monotonically, and any write with an out-of-order sequence number is ignored by readers. Writers set the sequence number for their next write as the most recent message sequence they've seen, plus one.
While the sequence number is logically independent of the offset, it is very convenient to simply use the offset as the sequence number: that way, message validity can be checked by simply comparing the sequence number to the actual offset where the message was written.
Under contention, writers know whether their write was successful by comparing the offset returned by their write against the offset where they expected to write. If not equal, writer retries until they succeed.
Using the offset as the effective sequence number comes with a downside: concurrent writes from different nodes are more expensive, as otherwise-valid writes can be rendered invalid by another /invalid/ write arriving before and knocking them out of their intended offset. However, schemas are likely to be written at comparatively low rates, and usually by a single API client (e.g. a CI system deploying new schemas for a new software build). This can be improved in future by switching to a true sequence number (such that for a given sequence number under contended writes, at least one writer is guaranteed to win).
Allocation of schema IDs and version IDs is done locally by writers based on the contents of their ShardedStore. If
this turns out to be out of date, then their write will retry (allocation is re-done on retry). Changes to the
ShardedStore, including advancing the next ID to allocate, are only applied once the write persists successfully.
To avoid compaction incorrectly preserving the last contended write (contended meaning same seq number) rather than first write to a particular subject-version, the writer's node ID is also included in the key.
For permanent deletion of subjects or versions (i.e. writing tombstones), writers must remember the sequence number and node ID that originally wrote to the subject, in order to regenerate the original write's key. These permanent deletion writes are not subject to any sequence-ordering logic, and if multiple nodes perma-delete the same thing concurrently, they will both succeed.
Q: Why is the sequence number global rather than per-subject?
See worked examples further down that include sequence numbers.
Reads are served symmetrically from any node.
High traffic read endpoints are usually served from memory, falling back to read from the topic only if the requested ID is not found locally (i.e. the local ShardedStore is out of date).
These fast-read endpoints are:
GET /schemas/ids/{int: id}GET /subjects/(string: subject)/versions/(versionId: version)GET /subjects/(string: subject)/versions/(versionId: version)/schemaEndpoints that generate lists require a read from the topic to ensure that the list they are returning is complete. These are: GET /schemas/ids/{int: id}/versions (because while we might have the schema in memory, it may have new subject-version associations we haven't seen yet)
GET /subjectsGET /subjects/(string: subject)/versionPOST /subjects/(string: subject)GET /subjects/(string: subject)/versions/{versionId: version}/referencedbyWithin a single node, only one read to the topic is in flight at at time (enforced by a semaphore) -- this serialization of reads reflects the serial layout of the underlying store.
Future: use a queue-of-waiters synchronisation mechanism to reduce number of topic reads when many readers hit slow endpoints concurrently within the same node
Future: if these endpoints turn out to be higher traffic than expected, introduce a config setting to make them eventually consistent
As well as being subject to inter-node serialization via sequence numbers, writes are serialized within a node using a semaphore. This simplifies the code, to avoid having rollback logic for projecting schema+version handle cases where multiple writes are in flight & only some succeed.
*Future: if we find it necessary to
Writes are served symmetrically from any node, but it is advisable to send writes to the same node (doesn't matter which) as much as possible, to reduce the cost of synchronisation.
POST /subjects/(string: subject)/versionsSoft deletions (without permanent parameter) are normal writes, which create new sequenced keys and include a flag in
the value to indicate that the named subject or version should be marked deleted.
DELETE /subjects/(string: subject)DELETE /subjects/(string: subject)/versions/(versionId: version)Permanent deletions render schemas unreadable, and are expected to free underlying storage. For us, that means emitting tombstones to the keys used when writing the subject/version.
DELETE /subjects/(string: subject)?permanent=trueDELETE /subjects/(string: subject)/versions/(versionId: version)?permanent=trueUsing shorthand key format of sequence-node-subject-version, where nodes are like A, B, C...
# POST /subjects/foo
1-A-foo-1 {schema}
# POST /subjects/foo
2-A-foo-2 {schema}
# DELETE /subjects/foo
3-A-foo-1 {deleted=true}
4-A-foo-2 {deleted=true}
# DELETE /subjects/foo?permanent=true
1-A-foo-1 NULL
1-A-foo-2 NULL
3-A-foo-1 NULL
4-A-foo-1 NULL
# POST A/subjects/foo
1-A-foo-1 {schema}
# Node A discovers they succeeded, returns schema=1 to caller
# POST B/subjects/foo
1-B-foo-1 {schema}
# Node B re-reads, sees A's write, realizes value was already applied, returns schema=1 to caller
# POST A/subjects/foo
1-A-foo-1 {schema}
# Node A succeeds at offset 1.
# POST B/subjects/bar
1-B-foo-1 {schema}
# Node B sees that its write landed at 'wrong' offset. Re-reads. Retries write.
2-B-foo-1 {schema}
# Node B succeeds at offset 2.
# DELETE A/subjects/foo?permanent=true
1-A-foo-1 NULL
# DELETE A/subjects/foo?permanent=true
1-A-foo-1 NULL
# Both nodes 'win'.