protocol_rfcs/rejected/managed-commits.md
Associated Github issue for discussions: https://github.com/delta-io/delta/issues/2598
This RFC proposes a new table feature managedCommit which changes the way Delta Lake performs commits.
Today’s Delta commit protocol relies on the filesystem to provide commit atomicity. This feature request is to allow Delta tables which gets commit atomicity using an external commit-owner and not the filesystem (s3, abfs etc). This allows us to deal with various limitations of Delta:
Change to existing section
Delta files are stored as JSON in a directory at the root of the table named _delta_log, and together with checkpoints make up the log of all changes that have occurred to a table.
Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
<ins>They are the unit of atomicity for a table.</ins>
<ins>Note: If managed commits table feature is enabled on the table, recently committed delta files may reside in the _delta_log/_commits directory. Delta clients have to contact
the corresponding commit-owner of the table in order to find the information about the un-backfilled commits.</ins>
<ins>The delta files in _delta_log directory are named using the next available version number, zero-padded to 20 digits.</ins>
For example:
./_delta_log/00000000000000000000.json
<ins>The delta files in the _delta_log/_commits directory have a UUID embedded into them and follow the pattern <version>.<uuid>.json, where the version corresponds to the next attempt version zero-padded to 20 digits.</ins>
For example:
./_delta_log/_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
./_delta_log/_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
./_delta_log/_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
./_delta_log/_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
The _delta_log/_commits directory may contain uncommitted delta files. The commit-owner is the source of truth about which of those delta
files map to committed versions. Refer to managed commits for more details.
Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
A delta file, n.json, contains an atomic set of actions that should be applied to the previous table state, n-1.json, in order to the construct nth snapshot of the table.
An action changes one aspect of the table's state, for example, adding or removing a file.
<ins>Delta files use newline-delimited JSON format, where every action is stored as a single line JSON document.
A delta file, corresponding to version n, contains an atomic set of actions that should be applied to the previous table state, corresponding to n-1, in order to construct the nth snapshot of the table.
An action changes one aspect of the table's state, for example, adding or removing a file.</ins>
Change to existing section
cutOffCommit. A checkpoint at the cutOffCommit is ideal, but an older one will do. Lets call it cutOffCheckpoint.
We need to preserve the cutOffCheckpoint and all commits after it, because we need them to enable
time travel for commits between cutOffCheckpoint and the next available checkpoint.
cutOffCheckpoint can be found, do not proceed with metadata cleanup as there is nothing to cleanup.</ins>cutOffCheckpoint checkpoint. Also delete all the log compaction files having
startVersion <= cutOffCheckpoint's version.
cutOffCheckpoint's version from the _delta_log/_commits directory.</ins>Change to existing section
Checkpoints are also stored in the _delta_log directory, and can be created at any time, for any committed version of the table.
For performance reasons, readers should prefer to use the newest complete checkpoint possible.
<ins>Note: If managed commits table feature is enabled on the table, a checkpoint can be created only for commit versions which are backfilled. Refer to maintenance operations on managed-commit tables section for more details</ins>
Change to existing section
<ins>Note: If managed commits table feature is enabled on the table, a log compaction file for commit
range [x, y] i.e. x.y.compacted.json can be created only when commit y is already backfilled i.e. _delta_log/<y>.json must exist.
Refer to maintenance operations on managed-commit tables section
for more details.</ins>
The next set of sections will be added to the existing spec just before Iceberg Compatibility V1 section
With this feature enabled:
The following is a high-level overview of how commits work in a table with managed-commits enabled:
Essentially the managed-commits table feature defines the overall commit protocol (e.g. atomicity requirements, backfills, etc), and the commit-owner is responsible to implement that protocol.
A commit-owner is an external entity which manages the commits on a delta table. It could be backed by a database, a file system, or any other persistence mechanism. Each commit-owner has its own spec around how Delta clients should contact them, and how they perform a commit.
A commit file is a delta file that contains the actions which are committed / need to be committed.
There are two types of commit files:
Un-backfilled commit files: These reside in the _delta_log/_commits directory.
<version>.<uuid>.json. Here the uuid is a random UUID that is generated for each commit and version is the version v which is being committed, zero-padded to 20 digits._delta_log directory.Backfilled commit files: These reside in the _delta_log directory.
<version>.json. Here the version is the version v which is being committed, zero-padded to 20 digits.<version>.json file proves that the corresponding version v is committed, even for managed-commit tables. Filesystem based Delta clients can use filesystem listing operations to directly discover such commits.Without managed-commits, a delta client must always write commit files directly to the _delta_log directory, relying on filesystem atomicity
to prevent lost writes when multiple writers attempt to commit the same version at the same time.
With managed-commits, the delta client asks the commit-owner to commit the version v and the commit-owner
decides which type of commit file to write, based on the managed commit protocol.
When managed commits are enabled, a commit-owner performs commits to the table on behalf of the Delta client. A commit-owner always has a client-side component (which the Delta client interacts with directly). It may also
involve a server-side component (which the client-side component would be responsible to communicate with). The Delta client is responsible to define the client-side API that commit-owners should target, and commit-owners
are responsible to define the commit atomicity and backfill protocols which the commit-owner client should implement.
At a high level, the commit-owner needs to provide:
x with given set of actions. This is explained in detail in the commit protocol section.When a commit-owner receives a request to commit version v, it must first verify that the previous version v-1 already exists, and that version v does not yet exist. It then has following choices to publish the commit:
_delta_log/_commits directory, and atomically record that the new file now corresponds to version v._delta_log directory. Note that the commit will be considered to have succeeded as soon as the file becomes visible to
other clients in the filesystem, regardless of when or whether the originating client receives a response.
The commit-owner must track the un-backfilled commits until they are backfilled.
Even after a commit succeeds, Delta clients can only discover the commit through filesystem operations if the commit is backfilled. If the commit is not backfilled, then delta implementations
have no way to determine which file in _delta_log/_commits directory corresponds to the actual commit v.
The commit-owner is responsible to implement an API (defined by the Delta client) that Delta clients can use to retrieve information about un-backfilled commits maintained by the commit-owner. The API must also return the latest version of the table ratified by the commit-owner (if any). Providing the latest ratified table version helps address potential race conditions between listing commits and contacting the commit-owner. For example, if a client performs a listing before a recently ratified commit is backfilled, and then contacts the commit-owner after the backfill completes, the commit-owner may return an empty list of un-backfilled commits. Without knowing the latest ratified version, the client might incorrectly assume their listing was complete and read a stale snapshot.
Delta clients who are unaware of the commit-owner (or unwilling to talk to it), may not see recent un-backfilled commits and thus may encounter stale reads.
The following is an example of a possible commit-owner API which some Java-based Delta client might require commit-owner implementations to target:
interface CommitStore {
/**
* Commits the given set of `actions` to the given commit `version`.
*
* @param version The version we want to commit.
* @param actions Actions that need to be committed.
*
* @return CommitResponse which has details around the new committed delta file.
*/
def commit(
version: Long,
actions: Iterator[String]): CommitResponse
/**
* API to get the un-backfilled commits for the table represented by the given `tablePath` where
* `startVersion` <= version <= endVersion.
* If endVersion is -1, then it means that we want to get all the commits starting from `startVersion`
* till the latest version tracked by commit-owner.
* The returned commits are contiguous and in ascending version order.
* Note that the first version returned by this API may not be equal to the `startVersion`. This
* happens when few versions starting from `startVersion` are already backfilled and so
* CommitStore may have stopped tracking them.
* The returned latestTableVersion is the maximum commit version ratified by the Commit-Owner.
* Note that returning latestTableVersion as -1 is acceptable only if the commit-owner never
* ratified any version i.e. it never accepted any un-backfilled commit.
*
* @return GetCommitsResponse which contains a list of `Commit`s and the latestTableVersion
* tracked by the commit-owner.
*/
def getCommits(
startVersion: Long,
endVersion: Long): GetCommitsResponse
/**
* API to ask the commit-owner to backfill all commits <= given `version`.
*/
def backfillToVersion(version: Long): Unit
}
Backfilling is the process of copying the un-backfilled commits i.e. _delta_log/_commits/<version>.<uuid>.json to _delta_log/<version>.json.
With the help of backfilling, the delta files are visible even to the filesystem based Delta clients that do not
understand managed-commits. Backfill also allows the commit-owner to reduce the number of commits it must track internally.
Backfill must be sequential. In other words, a commit-owner must ensure that backfill of commit v-1 is complete before initiating backfill of commit v.
commit-owners are encouraged to backfill the commits frequently. This has several advantages:
commit-owner is unavailable or loses state.The commit-owner also needs to expose an API to backfill the commits. This will allow clients to ask the commit-owner to backfill the commits if needed in order to do some maintenance operations.
Since commit backfills may happen at a later point in time, so the file modification timestamp of the backfilled file might be very different than the time of actual commit. For this reason, the managed-commit feature depends on another writer feature called in-commit-timestamps to make the commit timestamps more reliable. Refer to Writer Requirements for Managed Commits section for more details.
In order for a commit-owner to successfully take over an existing filesystem-based Delta table, the following invariants must hold:
Assuming the client follows the commit-owner’s protocol for ownership changes, the commit-owner MUST NOT refuse ownership after the backfill succeeds. Otherwise, the table would become permanently unusable, because the advertised commit-owner refuses to ratify the very commits that would repair the table by removing that commit-owner.
Thus, the commit-owner and client effectively perform a two-phase commit, where the commit-owner persists its commitment to own the table, and the actual commit point is the PUT-if-absent. Notifying the commit-owner that backfill has completed becomes a post-commit cleanup operation. If the put-if-absent fails (because somebody else gets there first), the commit-owner forgets about the proposed ownership change.
Once the backfill succeeds, clients will start contacting the commit-owner for any further commits. Meanwhile, any clients who were already attempting filesystem-based commits will encounter a physical conflict, see the protocol change, and either abort the commit or route it to the new owner.
Conceptually, creating a new managed-commit table is very similar to proposing an ownership change of an existing filesystem-based table that happens to not yet contain any commits. This means that, until commit 0 has been backfilled, there is a risk of multiple clients racing to create the same table with different commit-owners (or to create a filesystem-based table).
To avoid such races, Commit-owners are encouraged to use a put-if-absent API (if available) to write the backfilled commit directly (i.e. _delta_log/00000000000000000000.json).
If such put-if-absent is not available, then it is the responsibility of commit-owners to take whatever measures they deem appropriate to avoid or respond to such races.
In order to convert a managed-commit table to a filesystem-based table, the Delta client needs to initiate a commit which tries to remove the commit-owner information from change-metadata and also removes the table feature from the protocol action. The commit-owner is not required to give up ownership, and may reject the request. If it chooses to honor such a request, it must:
With managed-commits enabled, a table could have some part of table already backfilled and some part of the table yet-to-be-backfilled.
The precise information about what are the valid un-backfilled commits is maintained by the commit-owner.
E.g.
_delta_log/00000000000000000000.json
_delta_log/00000000000000000001.json
_delta_log/00000000000000000002.json
_delta_log/00000000000000000002.checkpoint.parquet
_delta_log/00000000000000000003.json
_delta_log/00000000000000000003.00000000000000000005.compacted.json
_delta_log/00000000000000000004.json
_delta_log/00000000000000000005.json
_delta_log/00000000000000000006.json
_delta_log/00000000000000000007.json
_delta_log/_commits/00000000000000000006.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
_delta_log/_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json
_delta_log/_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
_delta_log/_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json
_delta_log/_commits/00000000000000000009.41bf693a-f5b9-4478-9434-af7475d5a9f0.json
_delta_log/_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json
_delta_log/_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json
Suppose the commit-owner is tracking:
{
6 -> "00000000000000000006.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json",
7 -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
8 -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
9 -> "00000000000000000009.41bf693a-f5b9-4478-9434-af7475d5a9f0.json"
}
Delta clients have two choices to read such tables:
_delta_log directory and reading the delta/checkpoint/log-compaction files.
Without contacting the commit owner, they cannot access recent un-backfilled commits in the _delta_log/_commits directory, and may construct a stale snapshot.
_delta_log directory and get information about un-backfilled commits (v8, v9) from the commit-owner.Checkpoints and log compaction files can only be created for commits in the _delta_log directory. In other words, in order to
checkpoint version v or produce a compacted log file for commit range x <= v <= y, _delta_log/<v>.json must exist. Otherwise, filesystem-based readers who encountered
the seemingly-extra files might think the table metadata was corrupted.
The managed-commit feature is supported and active when:
protocol action with writerFeatures containing the feature managedCommit.delta.managedCommit.commitOwner in the change-metadata's configuration.delta.managedCommit.commitOwnerConf in the change-metadata's configuration. The value of this property is a json-coded string-to-string map.
Note that a table is in invalid state if the change-metadata contains the delta.managedCommit.commitOwner property but the table does not have the managedCommit feature in the protocol action (or vice versa).
E.g.
{
"metaData":{
"id":"af23c9d7-fff1-4a5a-a2c8-55c59bd782aa",
"format":{"provider":"parquet","options":{}},
"schemaString":"...",
"partitionColumns":[],
"configuration":{
"appendOnly": "true",
"delta.managedCommit.commitOwner": "commit-owner-1",
"delta.managedCommit.commitOwnerConf":
"{\"endpoint\":\"http://sample-url.com/commit\", \"authenticationMode\":\"oauth2\"}"
}
}
}
When supported and active:
inCommitTimestamp table feature must also be supported and active._delta_log directory.Managed commits is a writer feature. So it doesn't put any restrictions on the reader.
Filesystem-based delta readers which do not understand managed commits may only be able to read the backfilled commits. They may see a stale snapshot of the table if the recent commits are not backfilled.
The managed commits aware delta readers could additionally contact the commit-owner to get the information about the recent un-backfilled commits. This allows them to get the most recent snapshot of the table.