doc/developer/design/20241015_add_columns_to_tables.md
We want to support adding columns to relations in Materialize, both tables and sources. Concretely
for tables this means supporting Postgres’ syntax of ALTER TABLE ... ADD COLUMN ..., and for
sources supporting something like ALTER SOURCE ... REFRESH SCHEMA ... that will read the schema
from the upstream source and update the relations in Materialize accordingly.
When a column is added to a relation, it should not affect objects that depend on said relation. For example:
CREATE TABLE t1 (a int);
INSERT INTO t1 VALUES (1), (2), (3);
CREATE VIEW v1 AS SELECT * FROM t1;
ALTER TABLE t1 ADD COLUMN b text;
-- view 'v1' does not have column 'b' since it was added after 'v1' was created.
SELECT * FROM v1;
a
---
1
2
3
The specific problem we’re aiming to address in this design doc is how can we support evolving the
RelationDesc of an object, while upholding existing invariants around the
GlobalId -> RelationDesc mapping.
We have aligned on a design that allows us to evolve the RelationDesc (schema) of an object in
the Adapter, Compute, and Storage layers of Materialize. This design should either conform to
the existing Formalism,
or specifically describe how and why we will update the Formalism to support necessary changes.
ALTER SOURCE ... REFRESH SCHEMA ....
For all intents and purposes we are only concerned with adding columns to tables.GlobalIdWithin Materialize a GlobalId generally identifies a single object and is used as the primary key
in the Catalog as well as numerous internal data structures. GlobalIds are also exposed to users
via catalog tables, e.g. mz_tables,
where it is expected that they provide a stable mapping from ID to object name.
Additionally the Formalism defines GlobalIds as:
A
GlobalIdis a globally unique identifier used in Materialize. One of the things Materialize can identify with aGlobalIdis a TVC. EveryGlobalIdcorresponds to at most one TVC. This invariant holds over all wall-clock time:GlobalIds are never re-bound to different TVCs.
By changing the RelationDesc for an object you are arguably rebinding the GlobalId to a new
TVC. A number of places all across our code base rely on this mapping of GlobalId → RelationDesc
being stable, so we can’t modify the RelationDesc for a given GlobalId. But we also need to
provide a stable external mapping of object ID to object name, so we can’t modify the GlobalId
for a given object.
Within the Catalog we persist objects with their create_sql string. To track when a column was
added to a table, and what version of a table a dependent object relies on, we plan to introduce a
VERSION keyword. For example our internal create_sql persistence will look like:
CREATE TABLE t1 (a int, b text VERSION ADDED 1);
-- view 'v1' references 't1' when it had only column 'a'
CREATE VIEW v1 AS SELECT * FROM [u1 as "materialize"."public"."t1" VERSION 0];
This would allow us to track the versions of a table that exist, and what version dependent objects were initially planned against.
Introduce a new CatalogItemId that will be a stable 1:1 mapping of object name to object ID and
keep the structure of GlobalId exactly how it exists currently. When adding a column to a table
we will allocate a new GlobalId that will be a unique reference to a (CatalogItemId, VERSION).
In other words, a (CatalogItemId, VERSION) will uniquely identify a single TVC.
This new type will have two variants which are a subset of the variants of a GlobalId:
enum CatalogItemId {
// System namespace.
System(u64),
// User namespace.
User(u64),
}
This allows us to introduce the following relationships between our various types:
CatalogItemId can reference many GlobalIdsGlobalId will reference 1 (CatalogItemId, VERSION)GlobalId will reference at most 1 (ShardId, SchemaId) (Persist)CatalogItemId will reference at most 1 ShardId (Persist)Using our example from before we’ll have the following:
"materialize"."public"."t1"CatalogItemId: u1GlobalIds:
u1 → (CatalogItemId(u1), RelationDesc('a' int))u2 → (CatalogItemId(u1), RelationDesc('a' int, 'b' text))ShardId: sXXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXXWhile not necessary and possibly out of scope of this design, with this new setup I begin to
imagine a GlobalId as uniquely referencing a collection; in other words, GlobalId could be
renamed to CollectionId.
Despite the text representation of both a
CatalogItemIdandGlobalIdbeingu1, they refer to different things. This discrepancy already exists in our code base, e.g.RoleIdandClusterIdboth have this same text representation but refer to different things.
GlobalIds are used all over the codebase: at the time of writing there are >2,000 matches for
“GlobalId” in Rust files. I will need to begin prototyping before I can speak to specifics of
exactly where CatalogItemIds will replace GlobalIds, but at a high level:
All current references to GlobalId in the Catalog will get replaced with CatalogItemId. The
text representation for IDs that is persisted in create_sql will get parsed as CatalogItemIds.
In planning (or possibly name resolution) is where we will convert from the CatalogItemId(u1)
and VERSION syntax in create_sql to GlobalIds.
In the durable Catalog we will reuse the existing ID allocator that currently mints GlobalIds to
mint CatalogItemIds. We also will create a new ID allocator specifically for GlobalIds that will be
initialized to the same value as the original allocator.
We need to start CatalogItemIds at the current value of the GlobalId allocator so all existing
items can continue to be identified by the same text representation of their current ID, and thus
to prevent ID re-use. For example, if a user has a table named "orders" with GlobalId::User(42),
we'll migrate that to CatalogItemId::User(42) so externally that table continues to have the ID
of 'u42'. This migration is only possible if we start CatalogItemIds at the current value of
the GlobalId allocator, so all existing IDs are considered "allocated". Additionally, resuming
GlobalId allocation from the current value prevents accidental GlobalId re-use if they are
persisted outside the Catalog. Additionally we will extend the existing
ItemValue
protobuf type to include a map of VERSION -> GlobalId. Externally to map between CatalogItemIds
and GlobalIds we’ll introduce a new Catalog table, mz_internal.mz_collection_ids.
To me this is the largest unknown. The Storage Controller operates with GlobalIds which currently
have a 1:1 mapping with Persist’s ShardIds. This design calls for many GlobalIds to be able to
reference a single ShardId which breaks the existing relationship.
The Storage Controller will need to continue to use GlobalIds for operations like rendering a
source, but it will also need to have some careful management of Persist Handles, e.g. if there are
two open WriteHandles to the same Persist Shard, writes to one of them would implicitly advance
the frontier of the other. Or, dropping a GlobalId will need to prevent finalizing the underlying
Persist Shard, if there are other GlobalIds that still reference said shard.
Our Compute layer will operate entirely on GlobalIds and require only minor refactors the Catalog
APIs our Compute layer uses. Additionally, we'll should eventually add some Notices around Index
selection for tables. If a user creates an Index on a Table, then later adds a column to said
Table, that Index will no longer get used when querying the table because it is built on a previous
version.
So far I have prototyped two alternate approaches, and am currently working on implementing the
approach that introduces a new CatalogItemId type.
RelationDesc associated with a Table and applying a projection on to
expose only the relevant columns.pr#29694, only look at last commit. Implements adding columns to tables by adding GlobalId
"aliases" to Tables so when a Table is altered we create a new GlobalId, and thus multiple
GlobalIds can be associated with a single table.If changing the shape of data in a TVC is not considered as creating a new TVC, then arguably
changing the RelationDesc of an object in Materialize would not be rebinding the GlobalId
of the object. This shrinks the theoretical scope of the problem to just constraining what columns
are used when planning objects. For example, when restarting Materialize we need to make sure when
re-planning objects, they’re planned against the same RelationDesc that was used when they were
originally created.
We can achieve this by threading through the correct RelationDesc in planning, and always
applying a projection on top of the operator that reads data. This technique has been prototyped in
pr#29694. (Note: the test failures in
this PR are related to explain plans, notably the new
alter-table.slt passes).
Practically an issue with this solution is that in a number of places within the codebase rely on
the GlobalId -> RelationDesc mapping to be stable. For example, an issue not solved in that PR is
how to handle Indexes that are created on Tables. While existing test cases pass there are plenty
more things that could break because of violating this invariant.
GlobalIdInstead of introducing a new CatalogItemId we could extend GlobalId to include version
information. For example:
// Current
enum GlobalId {
// ... snipped
User(u64),
}
// Alternate Approach
enum GlobalId {
// ... snipped
User(u64, u64),
}
Where the second u64 in GlobalId::User would contain this new version information.
Outside of tests, everything in our codebase handles GlobalIds opaquely, they don’t look at the
inner value. Just adding more to the GlobalId::User variant would be a relatively small change
compared to adding a new ID type, but it would require more logical changes in the Adapter and
Storage layers. The Adapter still needs to maintain a stable mapping from object ID to object name
and Storage probably still needs to de-duplicate between GlobalIds and Persist’s ShardIds, both
of which would require looking at the inner value of the otherwise opaque GlobalId.
GlobalId, allow multiple GlobalIds to refer to a single Table.A combination of the proposed approach and the above alternative, instead of creating a new
CatalogItemId type or modifying the existing GlobalId type, just allow multiple GlobalIds to
refer to a single object. This can be modeled as "aliases" to a single object.
This approach requires the fewest code changes, but introduces the most ambiguity into the code
base. There are existing code paths that expect a GlobalId to uniquely refer to an object, e.g.
in the Catalog when dropping an object or maintaining CriticalSinceHandles in the Storage
Controller. If we allow multiple GlobalIds to refer to a single object then the onus of making
sure we pass the right GlobalId, or don't pass multiple GlobalIds that refer to the same
object, is put on the programmer. Whereas introducing a new CatalogItemId type designs away these
invalid states.