docs/RFCS/20160706_expressive_zone_config.md
This document has been made partially obsolete by more recent changes to ZoneConfig constraints. For the latest information, please see the docs.
The current ZoneConfig format has a number of drawbacks that make it hard to specify the number of replicas and more complicated constraints.
Current ZoneConfig format:
replicas:
- attrs: [comma-separated attribute list]
- attrs: [comma-separated attribute list]
- attrs: [comma-separated attribute list]
range_min_bytes: <size-in-bytes>
range_max_bytes: <size-in-bytes>
gc:
ttlseconds: <time-in-seconds>
https://www.cockroachlabs.com/docs/v2.0/configure-replication-zones#replication-zone-format
Currently, the number of replicas is controlled by adding extra - attrs: []
elements to the replicas array.
Comment from @bdarnell on #4866:
I'd be in favor of introducing a num_replicas field which would be used in place of len(attrs). If attrs are present they would be used to constrain the replica selection in the way that they are today, otherwise it is as if there are num_replicas empty attr entries. No need to actually address the issues around negative or diversity constraints at this point.
Currently it's only possible to specify positive constraints, such as a replica
must be placed on a store with an SSD or in us-west-1. Each node/store must
have the list of tags provided in the command line arguments to the server.
There's also no way to do wild card matches such as us-.*.
hdd, ssd, mem.gce, awsus-west-1rack-12gce-us-west-1-rack-12.ssdus-.*.hddus-.*us or canadaus or canadaus and on ssdus and not on ssdNote: This is one potential design. See Alternatives and Unresolved Questions.
num_replicas: <number-of-replicas>
constraints: [comma-separated constraint list]
range_min_bytes: <size-in-bytes>
range_max_bytes: <size-in-bytes>
gc:
ttlseconds: <time-in-seconds>
This approach would be to extend the current simple tag matching with modifiers and the ability to do regex matches.
Comment from @petermattis on #4868:
For example, instead of a single expression we could imagine there are a series of expressions.
diversity(us-.*), require(hdd), prohibit(eu-.*)would provide diversity across nodes containing an attribute matchingus-.*, require nodes to have the attributehddand prohibit nodes from containing the attributeeu-.*. To make this more terse we could provide short-hand operators:~us-.*, +hdd, !eu.*. We'd want some rule about how to drop constraints when they can't be achieved.
This would be implemented as a simple array of string tags (similar to the current implementation) with three types of constraints.
us-.*.Positive constraints are sufficient for most use cases. The allocator would take care of automatically maximizing diversity of these constraints so the user doesn't have to worry about having duplicate replicas on the same rack unless it's unavoidable. If these constraints can't be matched, the allocator will fall back to the next closest match.
Since the allocator will maximize diversity, this allows us to have or
statements. Specifying [ssd, mem], will have all replicas put on an SSD or
in-memory store since there won't be any stores that are both. Likewise,
specifying [us-.*, ssd] will have all replicas in the US and on SSDs.
+us-.* / Prohibited constraints -us-.*Required and Prohibited constraints are useful for legal situations where you need a certain data restriction. This may be used for cases where data has to be local to a country, or shouldn't be in certain ones. The failure mode would be blocking new replicas from being added while there isn't enough capacity that matches the constraints.
The suggested format for locality would be a set of keys and values that would
then be diversified across. This could look like
--locality cloud=gce,country=us,region=west,datacenter=us-west-1,rack=12.
Constraints would be applied like [cloud=gce, country=us].
We would try to maximize diversity within each KV tag. This would make it easier to tell which tags match each other instead of just the index in the hierarchy list.
We would use the order the tags were defined in as a hierarchy for which levels to prioritize hierarchy for. If the order of tags on one node doesn't match those on others, we would warn the user. In such a case, we will use the order that the majority of nodes have.
If we have three datacenters and each datacenter has 3 racks
datacenter=us-{1,2,3},rack={{1,2,3},{1,2,3},{1,2,3}} there's no way to
distinguish the primary diversity factor automatically, but we can maximize
diversity on both levels. In this situation, the best thing to do would be to
have 3 replicas
datacenter=us-1,rack=1, datacenter=us-2,rack=2 and datacenter=us-3,rack=3.
While maximizing diversity in a mostly empty cluster is easy to do, as servers fill up we might be in the situation of putting two replicas on the same rack or two replicas in the same datacenter. This means that an ordering is needed to know which tag diversity is more important.
Prefer SSDs and do not put on in-memory stores.
constraints: [ssd, -mem]
Require data to be stored in the US.
constraints: [+country=us]
In-memory store, and not in aws.
constraints: [mem, -cloud=aws]
Remove the ability to specify per replica attributes as they're a poor stand in for proper diversity constraints.
With a first class diversity system we can get rid of the attributes arrays. This also allows for specifying a candidate sets that are bigger than the exact number of replicas.
replicas:
- attrs: [us-1]
- attrs: [eur-2]
- attrs: [asia-3]
num_replicas: 3
constraints: [us-1, eur-2, asia-3]
By default, the allocator will try to maximize the number of constraints it satisfies. If it can't find a perfect match it'll find the best matching store and use that one instead.
The only exception is that for hard constraints and prohibited constraints, replication will fail and an alert will show up in the admin UI. In the future, we could try and migrate other ranges off of matching stores to free space and allow for replication to continue.
Any issues with constraints will show up in logs and in the admin UI. While typically a constraint issue shouldn't break the system, it could operate in a manner that a user doesn't expect.
A simple tool that will allow users to specify constraints and view the matching stores will be added to the web UI. This will allow users to vet any changes before actually using them. In addition, it will also highlight any existing invalid configurations such as out of order locality tags, and failing constraints.
One drawback is that the new system prevents explicit control over the diversity of a replica set. There's no way to say "one ssd and two hdds".
Might be useful to have weak negative constraints such as !ssd that won't stop
allocation such as how a prohibited constraint would. There's currently not a
strong use case for it, but should be fairly easy to add later on.