docs/RFCS/20170921_sql_partitioning.md
Table partitioning provides row-level control over how and where data is stored.
Specifically, CREATE TABLE, ALTER TABLE, and their INDEX counterparts
learn to accept a partition specification, which maps ranges of the index to
named partitions. Zone configs learn to target these partitions so that locality
constraints and replication settings can vary per partition instead of per table
or database.
The magic of CockroachDB is its ability to stitch together physically-disjoint ranges of data into one logical table, no matter whether the servers are separated by several racks or several continents.
If the cluster operator has any of the following requirements, however, CockroachDB’s default range allocation policy might be a poor fit:
Latency guarantees. The operator must ensure that accessing particular data from a particular part of the world will be fast. Since the default allocation policy optimizes average latency across all ranges, it might violate the latency requirements on a particular range in the service of balancing the cluster.
Data sovereignty. The operator is subject to regulations that require certain data to live on servers in certain geographical regions. The default allocation policy will blindly move data from overfull regions to underfull regions, potentially violating these regulations.
Cost reduction. The operator needs to store infrequently-accessed data on slower hardware (e.g. spinning disks) to save money. The default allocation policy will happily shuffle data between nodes with SSDs and nodes with HDDs.
Zone configs can help with the above, provided the constraints applies to an entire table. To meet latency guarantees, a zone config can constrain an entire table to a particular data center. To reduce costs, a zone config can constrain an entire infrequently-accessed table to nodes with slower hardware. Data sovereignty, however, requires further work.
Often, however, the requirement applies to only a subset of the table. A users
table, for example, should store its American users in North America, its
European users in Europe, etc. Maybe an articles table should store the most
recent few months of posts on SSDs, but everything else on HDDs. Creating
individual tables (users_america, users_europe, articles_recent,
articles_archive etc.) would allow today’s zone configs to solve the problem,
but would bring back some of the pain of manual sharding that CockroachDB was
designed to avoid. Clients would be responsible for reading data from and
writing data to the right shard and stitching data from multiple shards together
if necessary.
By allowing zone configs to target subsets (i.e., “partitions”) of a table, we allow operators to meet their latency, regulatory, and cost requirements while preserving the magic of CockroachDB: that is, a logical table that hides the complex physical sharding underneath.
To ensure the design proposed within can be implemented in a reasonable timeframe, the following items are considered out-of-scope:
Improving the zone config interface. The YAML/CLI interface for specifying zone configs is clunky and will get more confusing when partitions are added, but improvements are better left to a comprehensive redesign (which is already planned for row-level ttls, auditing, etc). Instead, this RFC proposes the minimum changes necessary to the existing interface to support partitioning.
Data sovereignty. The design within is necessary but not sufficient to meet data sovereignty regulations. In particular, the proposed scheme provides little protection against misconfiguration and little visibility into, at a given moment, what data is actually stored where. Additionally, we'll need a resolution to protected information leaking via keys, which make their way into various system ranges (e.g. meta2). The rest of the data sovereignty work will be developed in a followup RFC.
Engineering work schedule. Some of this work will make 2.0 and some won’t, but determining milestones is intentionally omitted from this RFC.
Altering the primary key columns of a table. The design below relies on careful primary key selection, which means tables created without partitioning in mind may not have a suitable schema. For now, a workaround is to export and reimport the data with a new schema. Eventually, we need to support primary key schema changes (see #19141).
Consider a globally-distributed online storefront, RoachMart, that sells and ships live cockroaches. RoachMart processes several hundred requests per second, each of which might generate several queries to RoachMart’s nine-node CockroachDB cluster. (Live cockroaches are very popular these days.) To keep these requests speedy, RoachMart’s operators want to automatically locate each user’s data in datacenters near that user.
Today, RoachMart has two data centers: one in North America and one in Australia. The company expects to open data centers in Europe, Africa, and Asia soon, so they want to ensure they can migrate users to the new data centers once they open.
In CockroachDB, partitions are cheap and easy to redefine at will, with one
caveat: partitions must be defined over columns that are a prefix of the primary
key. A users table with just a string primary key, like
CREATE TABLE users (
email STRING PRIMARY KEY,
continent STRING,
...
);
cannot be partitioned by continent because continent is not part of the
primary key. Instead, the table needs to be specified with a composite primary
key
CREATE TABLE users (
email STRING,
continent STRING,
...
PRIMARY KEY (continent, email)
);
where the partition column continent appears first in the key.
Note that this usage of composite primary keys is somewhat unusual. See the Partitioning key selection section below for a discussion.
Since it’s not currently possible to change a table’s primary key, the partition columns included in the primary key when the table is created must be granular enough to support any partitioning scheme that might be desired in the future. Even if it were possible, changing the primary key would require rewriting all the data in the table and any tables interleaved beneath, a potentially intractable operation. See Partitioning and index columns for a more detailed discussion of this drawback.
RoachMart is worried that, someday, they’ll have more than one data center per
continent, so they decide to use country instead of continent as their
partition column. Using the PARTITION BY LIST syntax, they can group
individual countries into larger partitions, one for each of their data centers:
CREATE TABLE roachmart.users (
email STRING,
country STRING,
...,
PRIMARY KEY (country, email)
) PARTITION BY LIST (country) (
PARTITION australia VALUES IN ('AU', 'NZ'),
PARTITION north_america VALUES IN ('CA', 'MX', 'US'),
PARTITION default VALUES IN (DEFAULT),
);
As requirements shift (e.g., a new data center opens, an existing data center is
running low on capacity, etc.), a country can be seamlessly migrated to a new
partition with an ALTER TABLE command that repartitions the table:
ALTER TABLE roachmart.users PARTITION BY LIST (country) (
PARTITION australia VALUES IN ('AU', 'NZ'),
PARTITION north_america VALUES IN ('CA', 'MX', 'US'),
PARTITION scandinavia VALUES IN ('DK', 'NO', 'SE'),
PARTITION default VALUES IN (DEFAULT)
);
Each partition is required to have a unique name that logically identifies it
across repartitions. Above, the default partition is considered equivalent to
the original default partition, though it is missing three countries that are
now in the new scandinavia partition.
On their own, partitions are inert. The SQL schema for roachmart.users does
nothing to actually locate the australia partition in Australia. Applying
functionality to a partition requires zone configs. Just as each database and
table can be targeted by a zone config that overrides the number and location
of its replicas, each partition can be targeted by a zone config.
So, RoachMart restarts each of its CockroachDB nodes with a --locality flag
that indicates its data center:
$ ./cockroach start --locality=datacenter=au1
$ ./cockroach start --locality=datacenter=us1
Then they can apply a zone config to each partition that restricts it to the appropriate locality:
$ cat australia.zone.yml
constraints: [+datacenter=au1]
$ cat north_america.zone.yml
constraints: [+datacenter=us1]
$ ./cockroach zone set roachmart.users --partition=australia -f australia.zone.yml
$ ./cockroach zone set roachmart.users --partition=north_america -f north_america.zone.yml
The replicate queue on each node will notice the updated zone config and begin
rebalancing the cluster. Users with country = 'NZ' will live on a range that
is only replicated within the au1 data center and users with country = 'US'
will live on a range that is only replicated within the us1 data center.
Data in the users table is now stored in the correct place, but the typical
RoachMart request accesses more than just the users table. In particular,
RoachMart would like to colocate users’ orders. Interleaved tables make this
simple.
CREATE TABLE roachmart.orders (
user_country STRING,
user_email STRING,
id INT,
part_id INT,
...
PRIMARY KEY (user_country, user_email, id),
FOREIGN KEY (user_country, user_email) REFERENCES users
) INTERLEAVE IN PARENT users (user_country, user_email)
The key encoding of interleaved tables ensures that the zone config of any top-level partitions applies to the data interleaved within.
RoachMart’s warehouse team wants to know how many orders are processed for each part they sell. They want a secondary index to make the query efficient. Since the common operation is to look at one country’s orders at a time, they partition the index on the same columns as users.
CREATE INDEX part_idx ON roachmart.orders (user_country, part_id)
PARTITION BY LIST (user_country) (
...
);
RoachBlog, a free weblog provider, is worried about the amount of data stored in
their articles table. The table is growing at the rate of approximately 100GB
per day. (Cockroach enthusiasts are prolific.)
An investigation of their query traffic has revealed that, save for a few
outliers, articles published more than 30 days ago receive virtually no traffic.
They’d like to move those articles to nodes with HDDs to save money on hardware.
Smartly—as if they had anticipated this eventuality—RoachBlog’s engineers
designed the table with a TIMESTAMP primary key:
CREATE TABLE roachblog.articles (
id SERIAL,
published TIMESTAMP,
author_id INT,
...,
PRIMARY KEY (published, id)
);
Listing out every published timestamp that should be considered “archived”
would be infeasible, so they use the range partitioning syntax instead:
ALTER TABLE roachblog.articles PARTITION BY RANGE (published) (
PARTITION archived VALUES FROM (MINVALUE) TO ('2017-12-04'),
PARTITION recent VALUES FROM ('2017-12-04') TO (MAXVALUE)
);
RoachBlog plans to run the query every week with an updated partition split point. Repartitioning was designed to be quite cheap to support this very use case, and indeed no actual data needs to be rewritten.
N.B.: Each time this query is run it will produce a new split point in the table, which could leave a lot of small ranges (until we implement range merges). Documentation will need to caution users about repartitioning too often.
As with RoachMart, RoachBlog now need only launch their nodes with appropriate store attributes
$ ./cockroach start --store=path=/mnt/crdb,attrs=ssd
$ ./cockroach start --store=path=/mnt/crdb,attrs=hdd
and install the corresponding zone config:
$ cat recent.zone.yml
constraints: [+ssd]
$ cat archived.zone.yml
constraints: [+hdd]
$ ./cockroach zone set roachblog.articles --partition=recent -f recent.zone.yml
$ ./cockroach zone set roachblog.articles --partition=archived -f archived.zone.yml
RoachBlog has a page which lists all articles written by a given author. There are no sovereignty issues with this, so they create a global index to keep this page fast:
CREATE INDEX author_idx ON roachblog.articles (author_id)
Typically, a composite key is used when the unique identifier for a row is naturally formed from more than one column. Consider, for example, a university course catalog that spans multiple years. Every course has an ID, like CS101, but that ID is reused every year. The primary key, then, must include both the course ID and the term it was offered. In this case, you might very reasonably choose to form a composite key out of three columns:
CREATE TABLE courses (
course_id STRING,
term STRING,
year INT,
...
PRIMARY KEY (course_id, term, year)
)
This is an accepted practice in schema design. You might even choose to order
your primary key in a different order based on expected query patterns; e.g.,
(year, term, course_id) might be a better order if the application is
frequently querying for the courses in a given year.
Partitioning stretches this practice by constructing a composite key out of a
columns that could uniquely identify a row on their own. Notice that in the
courses table, no proper subset of the primary key columns is sufficient to
uniquely identify a row. In roachmart.users, however, email alone is
sufficient to identify a row; the country column is superfluous and exists
only to facilitate partitioning. This biggest consequence of this oddity is the
potential for user confusion; overspecified keys cause no known technical
problems.
In fact, the guidelines for constructing a partitioning key under this scheme are quite straightforward: include everything you wish to partition by in the key, in the order you wish to nest your subpartitions, and follow it with everything that would have otherwise been in the key.
See the Partitioning and index columns section for a discussion of why this approach was chosen.
Partitions can be added to a table when it is created, or at any time afterward. (However, since the primary key of a table cannot be changed, forethought is still required.)
Partitions are defined over one or more columns. These columns must be a prefix of the primary key (or secondary index).
Indexes can also be partitioned, but are not required to be.
Each partition is required to have a name that is unique among all partitions on that table or index.
Repartitioning is relatively cheap; data is not rewritten. The table metadata is updated and the allocator is left to move the data as necessary. A partition with the same name before and after a repartitioning is considered the same partition. During a repartitioning, any partition zone config entries with no corresponding name in the new partitions are removed.
Future work to expose zone configs through SQL will likely allow for updating partitions and the corresponding zone configs together. In the meantime, users can create an empty partition, apply a zone config, then repartition.
After a table is partitioned, its partitions can be targeted by zone
constraints via the existing CLI, e.g. ./cockroach zone set database.table --partition=partition -f zone.yml.
As before, CockroachDB uses the most granular zone config available. Zone configs that target a partition are considered more granular than those that target a table, which in turn are considered more granular than those that target a database. Configs do not inherit, but unspecified fields when a zone override is first create are copied from the parent, as is currently the case.
There exists a tradeoff between making reads/writes fast and surviving failures.
Consider a partition with three replicas of roachmart.users for Australian
users. If only one replica is pinned to an Australian datacenter, then reads may
be fast (via leases follow the sun) but writes will be slow. If two replicas
are pinned to an Australian datacenter, than reads and writes will be fast (as
long as the cross-ocean link has enough bandwidth that the third replica doesn’t
fall behind and hit the quota pool). If those two replicas are in the same
datacenter, then loss of one datacenter can lead to data unavailability, so some
deployments may want two separate Australian datacenters. If all three replicas
are in Australian datacenters, then three Australian datacenters are needed to
be resilient to a datacenter loss.
Different key encodings were evaluated at length in a prior version of this RFC. The approach recommended, entitled Index Key Prefix, is described here without discussion of the tradeoffs involved in the decision; refer the original RFC for that discussion.
The Index Key Prefix approach simply allows part of the existing row key to be used as the partition key. For a partitioned table, the partition key can be any prefix of the table primary key. Similarly, for a partitioned index, the partition key can be any prefix of the indexed columns. Since a row’s key is encoded as the table ID, followed by the index ID, followed by each of the indexed columns, in order, the partition key appears immediately after the table and index IDs, and thus rows with the same partition key will be adjacent.
To further illustrate this, consider the key encodings for the earlier geographic partitioning example.
Primary index for roachmart.users:
/TableID/IndexID/<region>/<user_id> -> [email]
Global non-unique secondary index for roachblog.articles.author_id:
/TableID/IndexID/<author_id><article_id> -> null
Partitioned non-unique secondary index for roachmart.orders.part_id:
/TableID/IndexID/<region>/<part_id>/<order_id> -> null
Note that values need not be physically adjacent to belong to the same logical
partition. For example, 'US', 'CA' and 'MX' are interspersed with other
countries, but all map to the same north_america partition in
roachmart.users. This comes at a cost, as the north_america partition now
requires a minimum of three separate ranges instead of just one. See Range
splits below for further discussion.
The PARTITION BY clause is usable in CREATE TABLE, ALTER TABLE, CREATE INDEX, ALTER INDEX and variants. See Example: geographic partitioning for
the PARTITION BY LIST syntax and Example: date partitioning for PARTITION BY RANGE.
A somewhat formal version of the syntax is presented below:
CREATE TABLE <table-name> ( <elements...> ) [<interleave>] [<partition-scheme>]
CREATE INDEX [<index-name>] ON <tablename> ( <col-names>... ) [<interleave>] [<partition-scheme>]
ALTER TABLE <table-name> <partition-scheme>
ALTER INDEX <index-name> <partition-scheme>
Partition scheme:
PARTITION BY LIST ( <col-names>... ) ( <list-partition> [ , ... ] ) <partition-scheme>
PARTITION BY RANGE ( <col-names>... ) ( <range-partition> [ , ... ] ) <partition-scheme>
List partition:
PARTITION <partition-name> VALUES IN ( <list-expr>... )
List expression:
DEFAULT
<const-expr>
Range partition:
PARTITION <partition-name> VALUES FROM (<range-expr>) TO (<range-expr>)
Range expression:
MINVALUE
MAXVALUE
<const-expr>
To reduce confusion, expressions in a VALUES or VALUES IN clause must be
constant, as they are only computed once when the CREATE or ALTER statement
is executed. Allowing non-constant expressions, like now() or SELECT country_name FROM countries, would suggest that the partition updates whenever
the expression changes in value.
Note that the lower bound of a range partition (FROM) is inclusive, while the
upper bound (TO) is exclusive. Note also that a NULL value in a
range-partitioned column sorts into the first range, which is consistent with
our key encoding ordering and ORDER BY behavior.
To allow reads to target only selected partitions, we propose to extend table
names (not arbitrary table expressions) with a PARTITION clause. For example:
SELECT * FROM roachmart.users PARTITION (australia, north_america)
SELECT *
FROM
roachblog.articles PARTITION (recent) AS a
JOIN
roachblog.user_views PARTITION (north_america) AS b
ON a.id = b.article_id
The implementation is purely syntactic sugar and simply transforms each
PARTITION clause into an additional constraint. The join query above, for
example, would be rewritten to include WHERE user_views.country IN ('CA', 'MX', 'US') AND articles.published >= 'recent cutoff'.
This is only a sugar, but it's an important one. For example, in
geopartitioning, it is expected that there will be many more countries than
partitions, so it is much easier to specify partition restrictions with this
than the de-sugared WHERE.
Note that a list partitioning without a DEFAULT is an enum. The user can use
this syntax with all partitions specified to force the planner to turn a query
like SELECT * FROM roachmart.users WHERE email = "..." (and no country
specified) from a full table scan into a point lookup per country in the
partitioning.
Subpartitioning allows partitioning along several axes simultaneously. The
PARTITION BY syntax presented above is recursive so that list partitions can
be themselves partitioned any number of times, using either list or range
partitioning. Note that the subpartition columns must be a prefix of the columns
in the primary key that have not been consumed by parent partitions.
Suppose RoachMart wanted to age out users who haven’t logged in in several months to slower hardware, while continuing to partition by country for improved latency. Subpartitioning would neatly solve their use case:
CREATE TABLE roachmart.users (
id SERIAL,
country STRING,
last_seen DATETIME,
...,
PRIMARY KEY (country, last_seen, id)
) PARTITION BY LIST (country) (
PARTITION australia VALUES IN ('AU', 'NZ') PARTITION BY RANGE (last_seen) (
PARTITION australia_archived VALUES FROM (MINVALUE) TO ('2017-06-04'),
PARTITION australia_recent VALUES FROM ('2017-06-04') TO (MAXVALUE)
),
PARTITION north_america VALUES IN ('CA', 'MX', 'US') PARTITION BY RANGE (last_seen) (
PARTITION north_america_archived VALUES FROM (MINVALUE) TO ('2017-06-04'),
PARTITION north_america_recent VALUES FROM ('2017-06-04') TO (MAXVALUE)
),
...
);
Subpartition names must be unique within a table, as each table's partitions and subpartitions share a namespace.
Other databases also provide less flexible but more convenient syntaxes, like a
SUBPARTITION TEMPLATE that prevents repetition of identical subpartition
schemes. Above, a SUBPARTITION TEMPLATE could be used to describe the
last_seen partitioning scheme exactly once, instead of once for each country
partition. We propose to implement only the more general syntax and defer design
of a more convenient syntax until demand exists.
Partitioning information is added to IndexDescriptor as below. All tuples are
encoded as EncDatums using the value encoding.
message IndexDescriptor {
...
optional PartitioningDescriptor partitioning = 13;
}
message PartitioningDescriptor {
message List {
optional string name = 1;
repeated bytes values = 2;
optional PartitioningDescriptor subpartition = 3;
}
message Range {
optional string name = 1;
optional bytes inclusive_lower_bound = 2;
optional bytes exclusive_upper_bound = 3;
}
optional uint32 num_columns = 1;
repeated List list = 2;
repeated Range range = 3;
}
Zone configs are currently stored in the system.zones table, which maps
database and table IDs to ZoneConfig protobufs. We propose to adjust this
ZoneConfig protobuf to include configuration for "subzones," where a subzone
represents either an entire index, or a partition of any index. Subzones are not
applicable when the zone does not represent a table.
Ideally, we'd modify or replace system.zones with a table keyed by
(descriptor_id, index_id, partition_name), but a proposal to this effect [was
presented and rejected in a previous version of this RFC][system-subzones]. In
short, replacing system.zones requires a complex, backwards-incompatible
migration that is better left to its own RFC.
Our interim solution adds two fields to the ZoneConfig proto:
message ZoneConfig {
...
repeated Spans subzone_spans = 8;
repeated SubzoneConfig subzone_configs = 9;
}
message SubzoneSpan {
optional roachpb.Span span = 1;
optional uint32 subzone_index = 2;
}
message SubzoneConfig {
optional uint32 index_id = 1;
optional string partition_name = 2;
optional message ZoneConfig = 3;
}
A subzone's config is stored in the subzone_configs field as a SubzoneConfig
message, which bundles a ZoneConfig with identification of the index or
partition that the subzone represents. Every SubzoneConfig must specify an
index_id, but subzones that apply to the entire index omit the
partition_name field.
A mapping from key span to SubzoneConfig is stored in the subzone_spans
field to allow efficient lookups of the subzone override for a given key.
Entries in subzone_spans are non-overlapping and sorted by start key to allow
for binary search. If an entry's span contains the lookup key, the
subzone_index field is the index of the SubzoneConfig in the
subzone_configs field; if no span contains the lookup key, the table config
(i.e., the outer ZoneConfig) applies.
We could alternatively derive subzone_spans from the corresponding
TableDescriptor on every call to GetZoneConfigForKey, but this would involve
quite a bit of encoding/decoding on a hot code path, as the table descriptor
stores the partition split points using value encoding. Updating zone configs or
partitioning, by contrast, happens infrequently; we'll simply recompute
subzone_spans whenever an index or partition zone config is updated, or when
a table's partitioning scheme is updated.
One case requires special care to handle. Suppose the archived partition of
roachblog.articles has a custom zone config, but the roachblog.articles
table itself does not have a custom zone config. The system.zones table will
necessarily have an entry for roachblog.articles to store the custom zone
config for the archived partition, but that entry will have an otherwise empty
ZoneConfig. This must not be taken to mean that the other partitions of the
table should use this empty zone config, but that the default zone config
applies. We propose to use the num_replicas field for this purpose:
num_replicas = 0 is invalid and therefore indicates that the table in
question does not have an active ZoneConfig.
As mentioned in the examples, the zone config CLI will be adjusted to accept an index specifier and partitioning flag:
# Before:
$ ./cockroach zone {set|get|rm} DATABASE[.TABLE[@INDEX]] [--partition=PARTITION] -f zone.yml
# After:
$ ./cockroach zone {set|get|rm} DATABASE[.TABLE] f zone.yml
Omitting the index but specifying the --partition flag (e.g., ./cockroach zone set db.tbl --partition=p0) implies a partition of the primary index. Note,
however, that ./cockroach set db.tbl and ./cockroach set db.tbl@primary are
not equivalent: the former specifies a table zone config that applies to any
secondary indices, unless more specific overrides exist, while the latter
specifies a primary index zone config that never applies to the table's
secondary indices.
One downside of this scheme is that the zone configs for all partitions of a
table's indices are stored in one row, which puts an effective limit on the
number of partitions allowed on a table. Additionally, system.zones is part of
the unsplittable, gossiped system config span, so all zone configs across all
tables must add up to less than 64MB. Some back of the envelope math suggests 60
bytes per range partition, and 60 bytes + the size of the values in a list
partition.
For roachmart.users with every country allocated between 7 partitions, this
results in 1167B. This results in an absolute max of 64MB / 60B = ~1,000,000
partitioned tables or 64MB / 1167B = ~54,000 tables if they were all partitioned
by country. The recommended max number of partitions in a table seems to range
from 100 to 1024 in other partitioning implementations, so this seems
reasonable.
Similar to the way that the zone config for a table, if present, completely overrides (with no inheritance) the zone config for a database, a zone config for a partition overrides the zone config for the table, database, or cluster. The ergonomics of this will likely be sub-optimal; the user will need to maintain the denormalization of what is naturally an inheritance hierarchy of configuration. There is a larger upcoming effort to refactor zone configs which will address these issues, so the following are out of scope of this RFC:
A SQL interface for reading and writing zone configs
Auditing of zone config changes
Inheritance
Moving zone configs out of gossip
Raising global and per-table limitations on the number of partitions
Direct replica-level control of data placement
Partition configs outside of the database+table scoping. This could be useful for allowing partitioning in a shared CockroachDB offering with static, preset zone configs.
The ability to define a partition and specify a zone config for it at the same time.
The CockroachDB unit of replication and rebalancing is the range. So for zone configs to target a partition, it needs to be on its own range (and potentially more than that for list partitioning see Range splits).
Currently, CockroachDB asynchronously splits each newly created table into its own range. This happens regardless of whether any zone configs are added to target that table specifically. Each partition could similarly be asynchronously split after being defined.
This may create extra ranges if a table is partitioned, repartitioned, and only then given zone configs (though this problem goes away when we support range merges). We address this by lazily splitting a partition only when a zone config is added for it. This deviates a bit from the table behavior and may violate a user’s expectation of partitions living on separate ranges, but the tradeoff was deemed worth it.
A table’s partitions can have a large influence on sql planning, so any changes
should be broadcast immediately. Further, our schema changes and table leasing
require that at most 2 versions are in use at a time. So it’s a natural fit for
partitionings to run as schema changes. This schema change will be responsible
for creating the splits and can be hooked into system.jobs to provide
introspection of progress as well as cancellation.
Because of the decision to require partitions be defined over normal, materialized columns, there are no correctness changes needed to sql query planning. Performance, however, needs some work.
Consider the following queries issued by RoachMart to the roachmart.users
table. (Ignore the poor security practices, this is meant to be illustrative.)
When a user visits RoachMart and needs to log in:
SELECT id FROM roachmart.users WHERE email = $1`
Whenever a new page is loaded on the web or a new screen is loaded on the native
app, the id is extracted from a cookie or passed via the RoachMart api and
used in the stateless servers to rehydrate the user information:
SELECT * FROM roachmart.users WHERE id = 12345`
Regardless of whether the roachmart.users table is partitioned, the first
query is kept speedy by an unpartitioned secondary index on email. This may
require a cross-ocean hop to another datacenter but login is an infrequent
operation and so this is okay. (If data sovereignty of emails is a concern, then
a global index is not appropriate and either a global index on hash(email) or
a partitioned index on must be used. The details of this is left for a later
sovereignty RFC.)
The second query is much harder. In an unpartitioned table, the primary key
would be only on (id) and so this is a point lookup. But RoachMart is a global
company and wants to keep a user’s data in the datacenter nearest them, so
they’ve partitioned roachmart.users on (country, id). This means the query
as written above will require a full table scan, which is obviously
unacceptable.
The best solution to this, and the one we will always recommend first, is for
RoachMart to also specify the user’s country in the second query. The login
query will be changed to also return the country, this will be saved alongside
id in the web cookie or native app’s local data, and passed back whenever
retrieving the user. It’s even not as onerous as it first seems since the
RoachMart API returns the user’s id as a string containing both pieces:
US|123.
In some cases, this will not be possible or will not be desirable, so CockroachDB has a number of pieces that can be combined to deal with this. NB: None of these really solve the problem in a satisfactory way, so as mentioned above we will very strongly urge the user to specify the entire primary key.
The developer could create a global index on just id, but in the common
case this requires a cross-datacenter request for the index lookup.
If LIMIT 1 is added to the query and no sort is requested, the planner is
free to return the first result it finds and cancel the rest of the work. If
a uniqueness constraint exists on id, the LIMIT 1 can be assumed. This
latter is an optimization that may be helpful in general.
A list partitioning without a DEFAULT is an enum. The user can use the
SELECT FROM PARTITION syntax with all partitions specified. Internally, the
planner will turn this into an AND country IN ('CA', 'MX','US', …) clause
in the WHERE, which turns the query from a full table scan into a point
lookup per country in the partitioning. If DEFAULT is present, the
non-default cases could optimistically be checked first. If some future
version of CockroachDB supports enum types, the user would get this behavior
even without using the SELECT FROM PARTITION syntax.
This is not a full table scan in other partitioning implementations because
they don’t require the (country, id) primary key, instead indexing (id)
as normal inside each partition. A query on id without the partition
information then becomes a point lookup per partition. This can be simulated
in CockroachDB by introducing and partitioning on a derived partition_id
column that is 1:1 with partitions. This is sufficient justification to
prioritize building computed columns.
Concretely, the roachmart.users table above could have PRIMARY KEY
(continent, country, id) and PARTITION BY LIST (continent) to start, so there
is only one key value per partition. Later, when/if it is needed, it could
change to PARTITION BY LIST (continent, country)
3+4 could allow a user to issue one query to optimistically try a point lookup in the local partition before trying a point lookup in all partitions.
Other implementations surface partitioning information via
information_schema.partitions, so we should as well. SHOW CREATE TABLE will
also need to learn to display PARTITION BY clauses.
Interleaved tables and partitioning are designed to work together. Geographic
partitioning is used to locate roachmart.users records in the nearest
datacenter and interleaved tables are used to locate the data associated with
that user (orders, etc) near it. A geographically partitioned user can be moved
with one UPDATE, and the user’s orders (plus the rest of the interleave
hierarchy) can be moved with ON UPDATE CASCADE.
Like index selection, column families, interleaved tables, and other optimization tools, partitioning will require some knowledge of the internals of the system to use effectively. See, for example, the locality–resilience tradeoff described above.
Since a table can only be partitioned by a prefix of its primary key, a table
destined for partitioning often has an “unnatural” primary key. For example, to
support partitioning the roachmart.users table by country, the table’s
natural primary key, id, must be explicitly prefixed with the partition column
to create a composite primary key of (country, id). The composite primary key
has two notable drawbacks: it does not enforce that id is globally unique, and
it does not provide fast lookups on id. If ensuring uniqueness or fast lookups
are required, the user must explicitly create a unique, unpartitioned secondary
index on id.
We could automatically add this secondary index to preserve uniqueness and fast
lookups, but this would violate user expectations. First, a secondary index
might itself store sensitive information. We want to ensure that operators are
aware of the secondary index so they can specify an appropriate zone config or
even create the index on hash(id) (though hashes may not provide sufficient
masking of sensitive information, depending on the entropy of the input and the
type of hash). Second, every secondary index increases the cost of writes
(specifically for inserts and for updates that change an indexed or stored
column). For example, an unpartitioned, unique index on roachmart.users would
require cross-ocean hops for writes that would otherwise hit just one
continent.*
Similarly, we could silently prefix the specified primary key (i.e., the natural
primary key) with the partition columns, but this too would violate user
expectations. Most notably, queries that specify all columns in an index, as
specified at creation or returned by reflection like information_schema, are
expected to be point lookups.
Instead, we can aid the user with detailed guidance in the error message generated by invalid partitioning schemes.
*In general, ensuring global uniqueness requires cross-datacenter hops on
writes. In limited cases, like SERIAL columns, users can achieve both
uniqueness and fast writes without a secondary unique index by assuming
unique_rowid() collisions are sufficiently improbable. The risk with such a
scheme, of course, is that someone can manually insert a colliding value.
The unit of replication in CockroachDB is the range, so partitioning necessarily requires splitting at least one range for each partition. In the worst case, when partitioning by list, partitions containing non-adjacent values will generate an extra range for each non-adjacent value.
For example, consider the original partitioning specification from
roachmart.users, in which 'CA' and 'MX' belong to the same partition
north_america, but are bisected by a value, 'FJ', in another partition
oceania. This forces each country on to its own range. As more countries are
added, the effect is amplified.
This is unfortunate but should be fine in larger tables. It can be worked around
by introducing and partitioning on a partition_id column, which is derived
from country. (Computed columns are a natural choice for this.) The tradeoff
here is that repartitioning will necessitate rewriting rows instead of just
updating range metadata.
Note that repartitioning could make this worse, especially until we support range merges. As discussed in Range splits and schema changes, list partitioning may create more ranges than expected when partition values are not adjacent. Additionally, repartitioning may result in small or empty ranges that cannot be cleaned up until we support range merges.
Exposing partitioning information and zone configs in the admin UI is out of scope for this document. See #14113, which tracks providing broad insight into zone constraint violations.
Other PARTITION BY implementations can be used to quickly bulk load data into a table (or bulk remove it) and this appears to be a popular use. We currently allow only bulk load of an entire table at once, so this may be useful for us to consider as well, but it’s out of scope for this document.
The separation between partition specification, which happens via SQL, and zone configuration, which happens via the CLI, is unfortunate and largely historical happenstance. A table’s partitioning is rarely updated independently of its zone config, so if and when we move zone configuration to SQL, we should consider tightly coupling its interface to table definitions/partitioning.
There is unfortunately no SQL standard for partitioning. As a result, separate partitioning syntaxes have emerged:
MySQL and Oracle specify partitions inline in CREATE TABLE, like we do
above.
Microsoft SQL Server requires four steps: allocating physical storage called
“filegroups” for each partition with ALTER DATABASE… ADD FILE, followed by
CREATE PARTITION FUNCTION to define the partition split points, followed by
CREATE PARTITION SCHEME to tie the partition function output to the created
file groups, followed by CREATE TABLE... ON partition_scheme to tie the
table to the partitioning scheme.
PostgreSQL 10 takes a hybrid approach: the partition columns and scheme (i.e,
RANGE or LIST) are specified in the CREATE TABLE statement, but the
partition split points are specified by running CREATE TABLE… PARTITION OF… FOR VALUES once for each partition.
We normally reuse PostgreSQL syntax for compatibility reasons, but we’ve deliberately rejected it here. Partitioning was not a first class feature before PostgreSQL 10, which was only released on 2017-10-05, so we shouldn’t have compatibility issues with existing ORMs and applications. The syntax being introduced in PostgreSQL 10, shown below, treats a partitioned table roughly as a collection of tables, which each have their own indexes, constraints, etc. It does not allow for a global index across all partitions of a table and the global table namespace is polluted with each partition.
We’ve instead chosen to closely follow the MySQL and Oracle syntax, since it fits with our model of partitions as mostly-invisible subdivisions of a table.
For reference, we’ve replicated the roachblog.articles example in each of the
three syntaxes.
CREATE TABLE articles (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
published DATE,
...
) PARTITION BY RANGE (published) (
PARTITION archived VALUES LESS THAN ('2017-01-25'),
PARTITION recent VALUES LESS THAN MAXVALUE
);
CREATE TABLE articles (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
published DATE,
...
) PARTITION BY RANGE (published);
CREATE TABLE articles_archived PARTITION OF articles
FOR VALUES FROM (MINVALUE) TO ('2017-01-25');
CREATE TABLE articles_recent PARTITION OF articles
FOR VALUES FROM ('2017-01-25') TO (MAXVALUE);
CREATE PARTITION FUNCTION fun (int) AS RANGE LEFT FOR VALUES ('2017-01-25');
CREATE PARTITION SCHEME sch AS PARTITION fun TO (filegroups...);
CREATE TABLE articles (id int PRIMARY KEY, published DATE) ON sch (published);