Proposal: Support TTL Table

Author(s): lcwangchao
Tracking Issue: https://github.com/pingcap/tidb/issues/39262

Proposal: Support TTL Table

Introduction

The rows in a TTL table will be deleted automatically when they are expired. It is useful for some scenes, for example, delete the expired verification codes. A TTL table will have a column with the type DATE/DATETIME/TIMESTAMP which will be compared with the current time, if the interval between them exceeds some threshold, the corresponding row will be deleted.

Detailed Design

Syntax

Create TTL Table

The following example shows how to create a TTL table. The column create_at is used to specify the creation time of the rows which will be deleted 3 months after that.

sql

CREATE TABLE t1 (
    id int PRIMARY KEY,
    created_at TIMESTAMP
) TTL = `created_at` + INTERVAL 3 MONTH;

We can use another TTL_ENABLE option to disable/enable the TTL job for the table. For example:

sql

CREATE TABLE t1 (
    id int PRIMARY KEY,
    created_at TIMESTAMP
) TTL = `created_at` + INTERVAL 3 MONTH TTL_ENABLE = 'OFF';

The above table will not delete expired rows automatically because TTL_ENABLE is set to OFF. When the TTL_ENABLE is omitted, it uses value ON by default.

To make it compatible with mysql, TTL options also support the comment format. For example:

sql

CREATE TABLE t1 (
    id int PRIMARY KEY,
    created_at TIMESTAMP
) /*T![ttl] TTL = `created_at` + INTERVAL 3 MONTH TTL_ENABLE = 'OFF'*/;

Alter a Table with TTL

We can alter an exist table with TTL options, for example:

sql

ALTER TABLE t1 TTL = `created_at` + INTERVAL 3 MONTH;

TTL job is scheduled every 1 hour by default, if we want to custom the schedule interval, we can do it like this:

sql

ALTER TABLE t1 TTL_JOB_INTERVAL='1d';

This alters the TTL job's schedule interval of table t1 to 1 day.

We should allow to update the existed TTL options. When it is updated, the running background job for this table should stop or restart according to the newest settings.

Alter to a non-TTL Table

Similar with mysql syntax "ALTER TABLE t REMOVE PARTITIONING", if we want to remove a table's TTL options, we can just do:

sql

ALTER TABLE t1 REMOVE TTL;

Generated Columns

TTL table supports a generated column as the time column, but currently, there maybe a performance degradation if you use it. The main reason is that generated column does not support pushing down its expression to TiKV side and the scan phase will have more performance cost we'll discuss later. Another reason is that most date functions are not push down supported too. So, if we want to solve this problem, we must:

support pushing down generated column expression to TiKV side.
support pushing down some common date functions to TiKV side.

Constraints

TTL does NOT work on a table that is referenced by a foreign key. For example, you cannot add TTL to the parent table, because it is referenced by a foreign key in the child table and deleting a row from the parent table could violate this constraint.

TTL Job Management

We use a SQL-layer approach to delete expired rows. The "SQL-layer approach" means that the background jobs are using SQL protol to scan and delete rows. It is simple to implement and has a good compatibility with the tools such as BR and TiCDC.

In the current design, we'll schedule a job for each TTL table when needed. We will try to schedule the jobs from different tables to different TiDB nodes to reduce the performance affect. For one physical table, the job will be running in one TiDB node and the partition table will be recognized as several physical tables, so there can be multiple jobs running in different TiDB nodes at one time.

The TTL table status will be record in a new system table mysql.tidb_ttl_table_status with the definition:

sql

CREATE TABLE `tidb_ttl_table_status` (
  `table_id` bigint(64) PRIMARY KEY,
  `parent_table_id` bigint(64),
  `table_statistics` TEXT DEFAULT NULL,
  `last_job_id` varchar(64) DEFAULT NULL,
  `last_job_start_time` timestamp NULL DEFAULT NULL,
  `last_job_finish_time` timestamp NULL DEFAULT NULL,
  `last_job_ttl_expire` timestamp NULL DEFAULT NULL,
  `last_job_summary` text DEFAULT NULL,
  `current_job_id` varchar(64) DEFAULT NULL,
  `current_job_owner_id` varchar(64) DEFAULT NULL,
  `current_job_owner_addr` varchar(256) DEFAULT NULL,
  `current_job_owner_hb_time` timestamp,
  `current_job_start_time` timestamp NULL DEFAULT NULL,
  `current_job_ttl_expire` timestamp NULL DEFAULT NULL,
  `current_job_state` text DEFAULT NULL,
  `current_job_status` varchar(64) DEFAULT NULL,
  `current_job_status_update_time` timestamp NULL DEFAULT NULL
);

It stores some information for each TTL table. The fields prefix last_job_ present the information of the last job which is successfully executed, and the fields with prefix current_job_ present the current job which has not been finished yet.

The explanation of the fields:

table_id: The id of the TTL table. If the table is a partitioned table, it stands for the physical table id of each partition.
parent_table_id: If the current row is a for table partition, it is the table id of its parent table. Otherwise, it equals to table_id.
table_statistics: some statistics of the table.
last_job_id: The id of the last success job.
last_job_start_time: The start time of the last job.
last_job_finish_time: The finish time of the last job.
last_job_ttl_expire: The expired time used by the last job for TTL works.
last_job_summary: The summary info for the last job.
current_job_id: The id of the current job that is not finished. It not only includes the running job, but also includes the job that is failed or cancelled by user.
current_job_owner_id: The id of the owner (a TiDB node) that runs the job.
current_job_owner_addr: The network address of the owner that runs the job.
current_job_owner_hb_time: The owner of the job updates this field with the current timestamp periodically. If it is not updated for a long time, it means the previous owner is offline and this job should be taken over by other node later.
current_job_start_time: The start time of the current job.
current_job_ttl_expire: The expired time used by the current job for TTL works.
current_job_state: Some inner state for the current job. It can be used for the job's fail over.
current_job_status: A enum with one of values: waiting, running, cancelling, cancelled, error
current_job_status_update_time: The update time of the current status

TTL job for each table runs periodically according to the configuration of the system variable tidb_ttl_job_run_interval . For example, if we configure set @@global.tidb_ttl_job_run_interval='1h', the cluster will schedule TTL jobs for each table every one hour to delete expired rows.

If you want to cancel a running job, you can execute a statement like this:

sql

ADMIN CANCEL TTL JOB 123456789

In the above example, the status of the TTL job with ID 123456789 will first become cancelling and then finally be updated to cancelled.

TTL Job Details

TiDB schedules TTL jobs to delete expired rows. One job is related to a TTL table and runs in one TiDB node. One TiDB node servers multiple workers where tasks from the TTL jobs are running.

A running job contains two kinds of tasks: scan tasks and delete tasks. Scan tasks are used to filter out expired rows from the table and then send them to delete tasks which will do delete operations in batch. When all expired rows are deleted, the job will be finished. Let's talk about scan and delete tasks in detail.

Scan Tasks

When a job starts to run, it first splits the table to N (N >= 1) ranges according to their primary key. Each range will be assigned to a scan task and each scan task performs a range scan for the specified range. The pseudocode below shows how it works:

func doScanTask(tbl, range, expire, ch) {
    var lastRow
    for {
        selectSQL := buildSelect(tbl, range lastRow, expire, LIMIT)
        rows := execute(selectSQL)
        ch <- deleteTask{tbl, expire, rows}
        if len(rows) < LIMIT {
            break
        }
        lastRow := rows[len(rows)-1]
    }
}

As we see above, it builds some select queries in a loop. The first query is built like this:

sql

SELECT LOW_PRIORITY id FROM t1
WHERE create_time < '2022-01-01 00:00:00' AND id >= 12345 AND id < 45678
ORDER BY id ASC
LIMIT 500;

In above example, the expired time is '2022-01-01 00:00:00' which is computed when the job started and the key range for the current scan task is [12345, 45678). We also limit the max count of return rows as 500 which you can set the system variable tidb_ttl_scan_batch_size to change it.

For most cases, we cannot get all expired rows in one query. So if the return row count equals to the limit we set, that means there are still some rows not read yet. Suppose the latest id we just queried is 23456, we should schedule the next query like this:

sql

SELECT LOW_PRIORITY id FROM t1
WHERE create_time < '2022-01-01 00:00:00' AND id >= 23456 AND id < 45678
ORDER BY id ASC
LIMIT 500;

The only difference with the first query is that the second one uses 23456 as the read start to skip the rows just read. This procedure will continue until all expired records are read.

The expired rows will be wrapped as deleteTask and then be sent to the delete workers simultaneously. Before we talk about delete works, there are still some things we should mention:

As we see, the scan operation is heavy because it scans the whole table. For a large table, it is recommended to set the system variable tidb_ttl_job_run_interval to a longer value to reduce the resource cost in one day.
When a table has no primary key or its primary key is not clustered, we'll use the hidden column _tidb_rowid instead as the row id.
As we just mentioned, using a generated column as the time column is not efficient because TiDB currently cannot push down a generated column's condition to TiKV side, the TiDB has to do the filter works that requires more network traffics and CPU times.

Delete Tasks

There are several delete workers running in one TiDB node, and they consume tasks sent from the scan phase to delete expired rows. The following pseudocode shows how it works:

func doDelTask(ch) {
	for _, task := range ch {
		batches := splitRowsToDeleteBatches(task.rows)
		for _, batch := range batches {
			deleteBatch(task.tbl, task.batch, task.expire)
		}
	}
}

A delete worker receives tasks from a channel (a chan object in goland) and then splits the rows in it to several batches according to the system variable tidb_ttl_delete_batch_size. After that, each batch expired rows will be deleted by a multi row DELETE query. For example:

sql

DELETE LOW_PRIORITY FROM t
WHERE id in (1, 2, 3, ...) AND create_time < '2022-01-01 00:00:00';

Notice that we are still using the condition create_time < '2022-01-01 00:00:00' to avoid deleting some "not expired" rows by mistake. For example, a row is expired when scanning, but updated to a not expired value before deleting it.

If there is no secondary index in a TTL table, we can assume that most of the delete operations can do a commit with 1PC. That is because the incoming rows are in the same region in most cases. So if you have a big table, keep no secondary indexes in this table may achieve a better performance for TTL jobs.

Distributed Task executions

As we mentioned above, a TTL job will be split to several scan tasks. In order to maximize the use of cluster resources, we can distribute these tasks to all TiDB nodes. Every time a new job is created, some rows will be inserted to a table named mysql.tidb_ttl_task. Each row in this table stands for a scan task. Every TiDB node will scan the table periodically and once a new task is discovered, the TiDB node will set the owner of the task to itself and then execute it.

Time Zone Consideration

Currently, we support three field types as the time column for TTL: Date, DateTime and TimeStamp . However, for Date and DateTime, they are not an absolute time, and we need another "time zone" information to determine what their accurate time point is. We have two options to select the time zone:

Use the system time zone which can be fetched from global variable system_time_zone and is determined when the cluster bootstraps. Because the system_time_zone will never changes, the row will be deleted at a certain time. However, it will have some problem when the user sets the cluster time zone time_zone to a different value other than SYSTEM. For example, if the system_time_zone is +08:00 and time_zone is set to UTC, the record will be seen as expired immediately when the tidb_ttl_job_run_interval is less than 8 hours.
Use the cluster time zone time_zone. It is fine when cluster time zone time_zone does not change. But if user changes the time_zone settings, the TTL job should be aware of it in time to forbid deleting some unexpected rows. In this case, whether one row is expired or not can not be determined without knowing the time_zone at that time.

New System Variables

Some new system variables will be introduced:

tidb_ttl_job_enable
- When this variable is OFF, the cluster will stop to schedule TTL jobs and the running jobs will be cancelled.
- Scope: Global
- Values: [ON, OFF]
- Default: ON
tidb_ttl_job_schedule_window_start_time
- This variable is used to restrict the start time of the time window of scheduling the ttl jobs.
- Scope: Global
- Type: Time
- Default: 00:00 +0000
tidb_ttl_job_schedule_window_end_time
- This variable is used to restrict the end time of the time window of scheduling the ttl jobs.
- Scope: Global
- Type: Time
- Default: 23:59 +0000
tidb_ttl_scan_worker_count
- The count of the scan workers in each TiDB node
- Scope: Global
- Range: [1, 256]
- Default: 4
tidb_ttl_scan_batch_size
- The limit value of each SELECT query in scan task
- Scope: Global
- Range: [1, 10240]
- Default: 500
tidb_ttl_delete_worker_count
- The count of the delete workers in each TiDB node
- Scope: Global
- Range: [1, 256]
- Default: 4
tidb_ttl_delete_batch_size
- The batch size in one delete query when deleting expired rows.
- Scope: Global
- Range: [0, 10240]
- Default: 100
tidb_ttl_delete_rate_limit
- The rate limit of the delete operations in each TiDB node. 0 is for no limit
- Scope: Global
- Range: [0, MaxInt64]
- Default: 0
tidb_ttl_running_tasks
- The max count of running tasks in a cluster. -1 stands for auto determined by TiDB.
- Scope: Global
- Range: -1 or [1, MaxInt64]
- Default: 0

New Metrics

We'll introduce some new metrics to monitor the TTL jobs:

ttl_queries
- The total count of select queries in TTL jobs
- Type: Counter
- Labels: sql_type, result
ttl_processed_expired_rows
- The total count of expired rows processed in TTL jobs
- Type: Counter
- Labels: sql_type
ttl_query_duration
- The duration of the queries in TTL jobs
- Labels: sql_type, result
- Type: Histogram
ttl_job_status
- The status for the current TTL job. When the job is in the specified status, the value will 1, otherwise, it will be 0
- Labels: table
- Type: Gauge
ttl_phase_time
- The spent in different phases of each worker
- Labels: type, phase
- Type: Counter
ttl_insert_rows
- The total inserted rows to TTL tables
- Type: Counter

In the above metrics, the optional values for sql_type label is 'select' and 'delete' and the optional values for result label is ok and error

Known Issues

Currently, the condition of generated column cannot be pushed down to TiKV. If a table uses generated column as a TTL time column, the filter will be performed in TiDB side. It brings some necessary network traffics and makes the query slow.

Future Works

Performance Optimizations

We'll do some performance optimizations in the future to reduce the execution time and performance cost of the TTL job:

If a TTL table has some Tiflash replicas, we can scan the TiFlash instead of TiKV.
If an index with the prefix of the TTL time column exists, we can use it to query expire rows instead of scanning the full table. It will reduce the execution time of the scan tasks.
In most times, secondary indexes will not be created to avoid the hotspot of the insert. In this scene, we can also reduce some unnecessary scans by caching the statistical information. For example, we can cache the created time of the oldest row for each region after a job finished. When the next job starts, it can check the all the regions, if the data of one region do not have any updates and its cached time is not expired, just skip that region.
If a table does not have any secondary index, we can do some further optimizations. One optimization is that to push down the scan and delete to TiKV side without data exchanging between TiDB and TiKV. It is somewhat like what GCWorker dose. In this way, a new coprocessor command "TTLGC" will be introduced and when a job starts, TiDB will send "TTLGC" commands to each region and TiKV will then scan and delete the expired rows (TiKV should delete expired rows in a non-transactional way).
The resource control works in ongoing: https://github.com/pingcap/tidb/issues/38025 and the global resource control framework is still in design. We can take advantages of it in the future to maximize the resource utilization of the cluster.

More Features to Support

There are also some features we can support in the future:

Support TTL table as a parent table referred by a child table with 'ON DELETE CASCADE'. When some rows in a TTL table are deleted, the related rows in child table will be deleted too.
Support pushing down generated column condition to TiKV side. Or use the definition of generated column to construct condition directly.
Dynamically adjust the runtime settings according to the current overhead of the cluster.
Add some builtin alters for cloud environment.

Alternative Solutions

TiKV supports TTL on RawKV, so we may get question that whether we can implement TiDB TTL in same concept. IMO, this method has some downsides:

Not SQL awareness.
Customer may want to use any column as TTL Column, and may want to use Generated Column to convert other column type(JSON, varchar) to a DATETIME column as TTL Column.
No future foreign key support.
Not table awareness: TTL on RawKV is kv-level, can't easily set TTL on/off by Table
Not compatible with CDC, Backup, Secondary Index

There is another proposal to implement TTL by pushing configurations to TiKV: https://github.com/pingcap/tidb/pull/22763 . However, there are still some unresolved problems. For example, it may break the constraint of snapshot isolation.

CockroachDB also supports row-level TTL feature, and it is very similar with our design, for example:

sql

CREATE TABLE events (
  id UUID PRIMARY KEY default gen_random_uuid(),
  description TEXT,
  inserted_at TIMESTAMP default current_timestamp()
) WITH (ttl_expire_after = '3 months');

The difference is that the CockroachDB is using a hidden column to store the "expire" time instead of "create_time". So when altering a table's TTL options, it will result in some data change and affect the performance depending on the table size, see: https://www.cockroachlabs.com/docs/stable/row-level-ttl.html#add-or-update-the-row-level-ttl-for-an-existing-table

Testing Plan

Functional Test

Test DDL operations including:
- Create a table with TTL options
- Alter a non TTL table with DDL options
- Alter a TTL table with new DDL options
- Remove a table's DDL options
Test a TTL table schedule background job and delete expire rows.
Test a TTL table should not schedule job when TTL_ENABLE or @@global.tidb_ttl_job_enable is OFF
Test TTL jobs should only be scheduled in time window.

Performance Test

Set up a 10 million rows table with 10 percent expire rows. Test the time it takes to clear the table.
Start a benchmark test and then start a TTL job. Collect QPS/latency of the benchmark and compare it with that when TTL job is not running.

Proposal: Support TTL Table

Proposal: Support TTL Table

Table of Contents

Introduction

Detailed Design

Syntax

Create TTL Table

Alter a Table with TTL

Alter to a non-TTL Table

Generated Columns

Constraints

TTL Job Management

TTL Job Details

Scan Tasks

Delete Tasks

Distributed Task executions

Time Zone Consideration

New System Variables

New Metrics

Known Issues

Future Works

Performance Optimizations

More Features to Support

Alternative Solutions

Testing Plan

Functional Test

Performance Test