docs/design/2023-06-30-configurable-kv-timeout.md
TiKV requests are typically processed quickly, taking only a few milliseconds. However, if there is disk IO or network latency jitter on a specified TiKV node, the duration of these requests may spike to several seconds or more. This usually happens when there is an EBS issue like the EBS is repairing its replica.
Currently, the timeout limit for the TiKV requests is fixed and cannot be adjusted. For example:
ReadTimeoutShort: [30 seconds](https://github.com/tikv/client-go/blob/ce9203ef66e99dcc8b14f68777c520830ba99099/internal/client/client.go#L82), which is used for the Get RPC request.ReadTimeoutMedium: 60 seconds, which is used for the BatchGet, Cop RPC request.The kv-client or TiDB may wait for the responses from TiKV nodes until timeout happens, it could not retry or redirect the requests to other nodes quickly as the timeout configuration is usually tens of seconds. As a result, the application may experience a query latency spike when TiKV node jitter is encountered.
For latency sensitive applications, providing predictable sub-second read latency by set fast retrying
is valuable. queries that are usually very fast (such as point-select query), setting the value
of tikv_client_read_timeout to short value like 500ms, the TiDB cluster would be more tolerable
for network latency or io latency jitter for a single storage node, because the retry are more quickly.
This would be helpful if the requests could be processed by more than 1 candidate targets, for example follower or stale read requests that could be handled by multiple available peers.
A possible improvement suggested by #44771 is to make the timeout values of specific KV requests configurable. For example:
tikv_client_read_timeout, which is used to control the timeout for a single
TiKV read RPC request. When the user sets the value of this variable, all read RPC request timeouts will use this value.
The default value of this variable is 0, and the timeout of TiKV read RPC requests is still the original
value of ReadTimeoutShort and ReadTimeoutMedium.SELECT /*+ set_var(tikv_client_read_timeout=500) */ * FROM t where id = ?; to
set the timeout value of the KV requests of this single query to the certain value.Consider the stale read usage case:
set @@tidb_read_staleness=-5;
# The unit is miliseconds. The session variable usage.
set @@tidb_tikv_tidb_timeout=500;
select * from t where id = 1;
# The unit is miliseconds. The query hint usage.
select /*+ set_var(tikv_client_read_timeout=500) */ * FROM t where id = 1;
tikv_client_read_timeout may not be easy if it affects the timeout for all
TiKV read requests, such as Get, BatchGet, Cop in this session.A timeout of 1 second may be sufficient for GET requests,
but may be small for COP requests. Some large COP requests may keep timing out and could not be processed properly.tikv_client_read_timeout is set too small, more retries will occur,
increasing the load pressure on the TiDB cluster. In the worst case the query would not return until the
max backoff timeout is reached if the tikv_client_read_timeout is set to a value which none of the replicas
could finish processing within that time.The query hint usage would be more flexible and safer as the impact is limited to a single query.
KVSnapshot struct liketype KVSnapshot struct {
...
KVReadTimeout: Duration
}
This change is to be done in the client-go repository.
SetKVReadTimeout and get interface GetKVReadTimeout for KVSnapshotfunc (s *KVSnapshot) SetKVReadTimeout(KVTimeout Duration) {
s.KVReadTimeout = KVTimeout
}
This change is to be done both in the client-go repository and tidb repository.
StmtContext to the KVSnapshot structures, when
the PointGet and BatchPointGet executors are built.Context.max_execution_duration_ms field of the RPC Context likemessage Context {
...
uint64 max_execution_duration_ms = 14;
...
}
Note by now this field is used only by write requests like Prewrite and Commit.
kv_read_timeout value should be read from the GetReuqest and BatchGetRequest
requests and passed to the pointGetter. By now there's no canceling mechanism when the task is scheduled to the
read pool in TiKV, a simpler way is to check the deadline duration before next processing and try to return
directly if the deadline is exceeded, but once the read task is spawned to the read pool it could not be
canceled anymore.pub fn get(
&self,
mut ctx: Context,
key: Key,
start_ts: TimeStamp,
) -> impl Future<Output = Result<(Option<Value>, KvGetStatistics)>> {
...
let res = self.read_pool.spawn_handle(
async move {
// TODO: Add timeout check here.
let snapshot =
Self::with_tls_engine(|engine| Self::snapshot(engine, snap_ctx)).await?;
{
// TODO: Add timeout check here.
...
let result = Self::with_perf_context(CMD, || {
let _guard = sample.observe_cpu();
let snap_store = SnapshotStore::new(
snapshot,
start_ts,
ctx.get_isolation_level(),
!ctx.get_not_fill_cache(),
bypass_locks,
access_locks,
false,
);
// TODO: Add timeout check here.
snap_store
.get(&key, &mut statistics)
});
...
...
);
}
These changes need to be done in the tikv repository.
copTask struct liketype copTask struct {
...
KVReadTimeout: Duration
}
Context.max_execution_duration_ms fieldmessage Context {
...
uint64 max_execution_duration_ms = 14;
...
}
kv_read_timeout value passed in to calculate the deadline result in parse_and_handle_unary_request,fn parse_request_and_check_memory_locks(
&self,
mut req: coppb::Request,
peer: Option<String>,
is_streaming: bool,
) -> Result<(RequestHandlerBuilder<E::Snap>, ReqContext)> {
...
req_ctx = ReqContext::new(
tag,
context,
ranges,
self.max_handle_duration, // Here use the specified timeout value.
peer,
Some(is_desc_scan),
start_ts.into(),
cache_match_version,
self.perf_level,
);
}
This change needs to be done in the tikv repository.
non-leader-only
snapshot read requests. The strategy is:
timeout errors, return timeout error to the upper layer,timeout error and retry the requests with the original default ReadShortTimeout
and ReadMediumTimeout values.The requests with configurable timeout values would take effect on newer versions. These fields are expected not to take effect when down-grading the cluster theoretically.
There’s already a query level timeout configuration max_execution_time which could be configured using
variables and query hints.
If the timeout of RPC or TiKV requests could derive the timeout values from the max_execution_time in an
intelligent way, it’s not needed to expose another variable or usage like tikv_client_read_timeout.
For example, consider the strategy:
max_execution_time is configured to 1s on the query levelmax_execution_time like 500msmax_execution_time like 400msFrom the application’s perspective, TiDB is trying to finish the requests ASAP considering the
input max_execution_time value, and when TiDB could not finish in time the timeout error is returned
to the application and the query fails fast.
Though it's easier to use and would not introduce more configurations or variables, it's difficult to make the timeout calculation intelligent or automatic, considering things like different request types, node pressures and latencies, etc. A lot of experiments are needed to figure out the appropriate strategy.