docs/RFCS/20151214_sql_column_families.md
Store multiple columns for a row in a single value to amortize the overhead of keys and the fixed costs associated with values. The mapping of a column to a column family will be determined when a column is added to the table.
The SQL to KV mapping currently utilizes a key/value pair per non-primary key column. The key is structured as:
/<tableID>/<indexID>/<primaryKeyColumns...>/<columnID>
Keys also have an additional suffix imposed by the MVCC implementation: ~10 bytes of timestamp.
The value is a proto (note, this is what is stored on disk, we also include a timestamp when transmitting over the network):
message Value {
optional bytes raw_bytes;
}
The raw_bytes field currently holds a 4 byte CRC followed by the bytes
for a single "value" encoding (as opposed to the util/encoding "key"
encodings) of the <columnVal>. The value encoding is a tag byte
followed by a payload.
Consider the following table:
CREATE TABLE kv (
k INT PRIMARY KEY,
v INT
)
A row for this table logically has 16 bytes of data (INTs are 64-bit
values). The current SQL to KV mapping will create 2 key/value pairs:
a sentinel key for the row and the key/value for column v. These two
keys imply an overhead of ~28 bytes (timestamp + checksum) when stored
on disk and more than that when transmitted over the network (disk
storage takes advantage of prefix compression of the keys). We can cut
this overhead in half by storing only a single key/value for the
row. The savings are even more significant as more columns are added
to the table.
Note that there is also per key overhead at the transaction level. Every key written within a transaction has a "write intent" associated with it and these intents need to be resolved when the transaction is committed. Reducing the number of keys per row reduces this overhead.
The ColumnDescriptor proto will be extended with a FamilyID field.
FamilyIDs will be allocated per-table using a new
TableDescriptor.NextFamilyID field. FamilyID(0) will be always be
written for a row and will take the place of the sentinel key. A
FamilyDescriptor proto will be added along with a repeated field for
them in TableDescriptor. It will contain the id, a user provided or
autogenerated name, and future metadata.
The structure for table keys will be changed to:
/<tableID>/<indexID>/<primaryKeyColumns...>/<familyID>
The value proto will remain almost the same. The same 4 byte CRC will
be followed by a series of <columnID>/<columnVal> pairs where
<columnID> is varint encoded and <columnVal> is encoded using
(almost) the same value encoding as before. Unfortunately, the
bytes, string, and decimal encodings do not self-delimit their length,
so additional ones will be added with a length prefix. These new
encodings will also be used by distributed SQL, which also needs self-
delimiting value encodings. Similar to the existing convention, NULL
column values will not be present in the value.
For backwards compatibility, families with only a single column will use
exactly the old encoding. Columns without a family will use the
<columnID> as the <familyID>. Finally, the /<familyID> key suffix
will be omitted when encoding FamilyID(0) to make the encoding
identical to the previous sentinel key encoding.
When a column is added to a table it will be assigned to a family. This assignment will be done with a set of heuristics, but will be override-able by the user at column creation using a SQL syntax extension.
CREATE TABLE t (
a INT PRIMARY KEY,
b INT,
c STRING,
FAMILY primary (a, b),
FAMILY (c)
)
ALTER TABLE t ADD COLUMN d DECIMAL FAMILY d
Fixed sized columns (INT, DECIMAL with precision, FLOAT, BOOL, DATE,
TIMESTAMP, INTERVAL, STRING with length, BYTES with length) will be
packed into a family, up to a threshold (initially 256 bytes).
STRING and BYTES columns without a length restriction (and DECIMAL
without precision) get their own family.
If the declared length of a STRING or BYTES column is changed with an
ALTER COLUMN, family assignments are unaffected.
Columns in the primary index are declared as family 0, but they're stored in the key, so they don't count against the byte limit.
Family 0 (the sentinel) always has at least one column (though it might be a primary index column stored in the key). Non-nullable and fixed sized columns are preferred.
Note that these heuristics only apply at column creation, so there's no backwards compatibility issues in revisiting them at any time.
We have to support the old format for backwards compatibility. Luckily, we can re-frame most of the complexity as a one column family optimization instead of pure technical debt.
UPDATE will now have to fetch the previous values of every column
in a family being modified. However, we already have to scan before
every UPDATE so, at worst, this is just returning more fields from
the scan than before.
We could introduce a richer KV layer api that pushes knowledge of columns from the SQL layer down to the KV layer. This is not as radical as it sounds as it is essentially the Bigtable/HBase API. Specifically, we could enhance KV to know about rows and columns where columns are identified by integers but not restricted to a predefined schemas. This is actually a somewhat more limited version of the Bigtable API which allows columns to be arbitrary strings and also has the concept of column families. The upside to this approach is that the encoding of the data at the MVCC layer would be unchanged. We'd still have a key/value per column. But we'd get something akin to prefix compression at the network level where setting or retrieving multiple columns for a single row would only send the row key once. Additionally, we could likely get away with a single intent per row as opposed to an intent per column in the existing system. The downside to this approach is that it appears to be much more invasive than the column family change. Would every consumer of the KV api need to change?
Another alternative would be to omit the sentinel key when there is no non-NULL/non-primary-key column. For example, we could omit the sentinel key in the following table because we know there will always be one KV pair:
CREATE TABLE kv (
k INT PRIMARY KEY,
v INT NOT NULL
)
FamilyDescriptors could be backfilled for tables missing them on
cluster upgrade. Is there a reasonable mechanism for this? Or should
it be hidden as a TableDescriptor transformation in the
getAliasedTableLease call?Allow changing of a column's family using ALTER COLUMN and schema changes.
Add metadata to column families. @bdarnell mentions this would let us map some families to an alternative storage model (perhaps using rocksdb column families) and that this was very useful with bigtable column families/locality families. Once the schema changes are supported, it will also be needed to keep track of which column, if any, was used with the single column optimization.
Note: This is from @petermattis's original doc, so it might be out of date. The conclusion is still valid though.
For the above kv table, we can approximate the benefit of this
change for a benchmark by not writing the sentinel key for each
row. The numbers below show the delta for that change using the kv
table structure described above (instead of the 1 column table
currently used in the {Insert,Scan} benchmarks).
name old time/op new time/op delta
Insert1_Cockroach-8 983µs ± 1% 948µs ± 0% -3.53% (p=0.000 n=9+9)
Insert10_Cockroach-8 1.72ms ± 1% 1.34ms ± 0% -22.05% (p=0.000 n=10+9)
Insert100_Cockroach-8 8.52ms ± 1% 4.99ms ± 1% -41.42% (p=0.000 n=10+10)
Scan1_Cockroach-8 348µs ± 1% 345µs ± 1% -1.07% (p=0.002 n=10+10)
Scan10_Cockroach-8 464µs ± 1% 419µs ± 1% -9.68% (p=0.000 n=10+10)
Scan100_Cockroach-8 1.33ms ± 1% 0.95ms ± 1% -28.61% (p=0.000 n=10+10)
While the benefit is fairly small for single row insertions, this is only benchmarking the simplest of tables. We'd expect a bigger benefit for tables with more columns.