docs/RFCS/20170601_pause_resume_cancel.md
For the 1.1 release, we're adding support for canceling SQL queries and for pausing, resuming, and canceling long-running jobs (i.e., backups, restores, and schema changes).
We draw a distinction between queries and jobs, both in internal implementations and in user-facing interfaces, because queries and jobs have vastly different expected durations. Queries are expected to run for milliseconds or seconds, or minutes in the worst case. Jobs are expected to run for minutes or hours, or days in the worst case.
This difference is more significant than in might seem on a first pass. Queries
are short-lived enough that they benefit from an RPC-based cancel mechanism.
Jobs can spare the latency of waiting for gossip or polling a system table.
CockroachDB already takes advantage of this difference: the list of running
queries exists only in memory on each node, while the list of running jobs is
stored and replicated in a proper SQL table, system.jobs.
Still, we want the syntax to cancel a query to be similar to the syntax to cancel a job because, conceptually, the operations are identical.
We propose the following syntax:
PAUSE JOB <job-id>;
RESUME JOB <job-id>;
CANCEL [QUERY|JOB] <job-or-query-id>;
This syntax is symmetric with the soon-to-be-merged syntax for listing running
queries, SHOW QUERIES, and a soon-to-be-proposed syntax for listing running
jobs, SHOW JOBS.
These statements will all complete as soon as the pause, resume, or cancel
request has been submitted. For example, when the CANCEL statement completes,
the operation in question may still be in progress. Users can detect when the
operation has fully canceled by polling SHOW QUERIES or SHOW JOBS.
This is a lot of new SQL syntax with no direct basis in another SQL dialect.
MySQL and MSSQL use a similar syntax, but with KILL in place of CANCEL:
KILL <id>;
MySQL additionally supports modifiers (KILL CONNECTION <connection-id> and
KILL QUERY <query-id>), much like we do with QUERY and JOB.
We prefer CANCEL to KILL, as "kill" implies the query or job will be
terminated immediately. This will not be true of our cancel implementations. In
particular, large schema changes may take several hours to cancel as the cluster
rolls back all backfill progress.
Postgres (ab)uses a built-in function:
SELECT pg_cancel_backend(<query-id>);
SELECT pg_cancel_backend(generate_series) FROM generate_series(1, 100000); -- Yes, this is valid.
Postgres's approach avoids introducing new syntax... by introducing the world's largest side effect. Pausing, resuming, and canceling are novel operations that seem like they should have new syntax.
Alas, we may eventually end up supporting this syntax for compatibility's sake.
Oracle uses a hybrid approach and nests new syntax under an ALTER SYSTEM
command:
ALTER SYSTEM KILL SESSION <session-id>;
We have no precedent for ALTER SYSTEM, and it doesn't seem worth introducing
ALTER SYSTEM just to avoid three new top-level constructions (PAUSE...,
RESUME..., CANCEL...). We already have precedent for new top-level syntax
for new operations, like BACKUP... and RESTORE....
Peter Mattis suggested a slight variation on
the Oracle syntax, ALTER JOB.../ALTER QUERY..., but the RFC authors feel the
proposed top-level PAUSE/RESUME/CANCEL verbs read more naturally.
None. How these operations will actually be implemented is out of scope.