docs/design/2020-05-08-standardize-error-codes-and-messages.md
This issue proposes that TiDB components maintain standard error codes and error messages according to a consistent specification.
When certain errors occur in TiDB components, users are often unaware of the meaning of the corresponding error message, and we plan the following two initiatives to alleviate this problem
In order to let TiUP know the the errors every component may throw, the components developers should keep a metafile in the code repository. The metafile should be a toml file which looks like:
[8005]
error = '''Write Conflict, txnStartTS is stale'''
description = '''Transactions in TiDB encounter write conflicts.'''
workaround = '''
Check whether `tidb_disable_txn_auto_retry` is set to `on`. If so, set it to `off`; if it is already `off`, increase the value of `tidb_retry_limit` until the error no longer occurs.
'''
[9005]
error = '''Region is unavailable'''
description = '''
A certain Raft Group is not available, such as the number of replicas is not enough.
This error usually occurs when the TiKV server is busy or the TiKV node is down.
'''
workaround = '''Check the status, monitoring data and log of the TiKV server.'''
Benefit:
Tradeoff:
Except toml, we also considered json and markdown format, see rationale section.
This section introduce two candidate formats, which is deprecated.
The json format of metafile is like:
[
{
"code": 8005,
"error": "Write Conflict, txnStartTS is stale",
"description": "Transactions in TiDB encounter write conflicts.",
"workaround": "Check whether `tidb_disable_txn_auto_retry` is set to `on`. If so, set it to `off`; if it is already `off`, increase the value of `tidb_retry_limit` until the error no longer occurs."
},
{
"code": 9005,
"error": "Region is unavailable",
"description": "A certain Raft Group is not available, such as the number of replicas is not enough.\nThis error usually occurs when the TiKV server is busy or the TiKV node is down.",
"workaround": "Check the status, monitoring data and log of the TiKV server."
}
]
Benefit:
Tradeoff:
## Code: 8005
### Error
Write Conflict, txnStartTS is stale
### Description
A certain Raft Group is not available, such as the number of replicas is not enough.
This error usually occurs when the TiKV server is busy or the TiKV node is down.
### Workaround
Check whether `tidb_disable_txn_auto_retry` is set to `on`. If so, set it to `off`; if it is already `off`, increase the value of `tidb_retry_limit` until the error no longer occurs.
## Code: 9005
### Error
Region is unavailable
### Description
A certain Raft Group is not available, such as the number of replicas is not enough.
This error usually occurs when the TiKV server is busy or the TiKV node is down.
### Workaround
Check the status, monitoring data and log of the TiKV server.
Benefit:
Tradeoff:
Tradeoff Example:
## Code: 8005
### Error
Write Conflict, txnStartTS is stale
### Description
A certain Raft Group is not available, such as the number of replicas is not enough.
This error usually occurs when the TiKV server is busy or the TiKV node is down.
### Workaround
## Code: 9005
### Error
Region is unavailable
### Description
A certain Raft Group is not available, such as the number of replicas is not enough.
This error usually occurs when the TiKV server is busy or the TiKV node is down.
### Workaround
Check the status, monitoring data and log of the TiKV server.
As the syntax above, the 9005 block is the message part of 8005 block, so we expect it's result is the same as this toml:
[8005]
error = '''Write Conflict, txnStartTS is stale'''
description = '''Transactions in TiDB encounter write conflicts.'''
workaround = '''
## Code: 9005
### Error
Region is unavailable
### Description
A certain Raft Group is not available, such as the number of replicas is not enough.
This error usually occurs when the TiKV server is busy or the TiKV node is down.
### Workaround
Check the status, monitoring data and log of the TiKV server.
'''
In this case, we must define the grammar and write a parser. Writing parser increase much complexity. What's worse is that I found no grammar that fits it.
Through the above discussion, we recommend the toml version of metafile.
In addition, the error codes should be append only in case of conflict between versions.
In the discussion above, an error has at least 4 parts:
err.Error().Besides, we can append a optional tags field to it:
[9005]
error = ""
description = ""
workaround = ""
tags = ["tikv"]
The tags is used to classify errors (e.g. the level of seriousness). At the very beginning, we can ignore it since we don't have enough errors listed. Once we have enough data, we need to classify all errors by different dimensions. Then we will make out a standard about how to classify errors.
The error code is a 3-tuple of abbreviated component name, error class and error code, joined by a colon like {Component}:{ErrorClass}:{InnerErrorCode}.
Where Component field is the abbreviated component name of the error source, wrote as upper case, component names are mapped as below:
The ErrorClass is the name of the ErrClass the error belongs to, which defined by errors.RegisterErrorClass or someway likewise. If this is unacceptable (for projects not written with golang), anything that can classify the "type" of this error (e.g., package name.) would also be good.
The InnerErrorCode is the identity of this error internally, note that this error code can be duplicated in different component or ErrorClass. Both numeric and textual code are acceptable, but it would be better to provide textual code, which should be one or two short words with PascalCase to identity the error.
The content of ErrorClass and InnerErrorCode must matches [0-9a-zA-Z]+.
Here are some examples:
When logging, the format [ErrorCode] message should be used, for example:
[2020/07/17 18:38:06.461 +08:00] [ERROR] [import.go:259] ["failed to download file"] [error="[BR:Internal:DownloadFileFailed] failed to download foo.sst : File not found"] [errVerbose="..."]
For compatibility with MySQL protocol, the code transmitted through the mysql protocol should be number only, others can be a number with a prefix string.
The code of each components looks like:
TiDB: {class}:{code}
TiKV: KV:{class}:{code}
PD: PD:{class}:{code}
TiFlash: FLASH:{class}:{code}
DM: DM:{class}:{code}
BR: BR:{class}:{code}
CDC: CDC:{class}:{code}
Lightning: LN:{class}:{code}
Dumpling: DP:{class}:{code}
{class}, {code} ~= [A-Za-z0-9]+
For mysql protocol compatible components, table below shows the available purely numeric codes for each component.
| MySQL error code range | TiDB Family Component |
|---|---|
| [0, 9000) | TiDB |
| [8124, 8200) | Ecosystem Productions in TiDB |
| [9000, 9010) | TiKV / PD / TiFlash |
In every build, the pipeline should fetch all these metafiles from all repositories:
mkdir -p errors
curl https://raw.githubusercontent.com/pingcap/tidb/master/errors.toml -o errors/tidb.toml
curl https://raw.githubusercontent.com/tikv/tikv/master/errors.toml -o errors/tikv.toml
curl https://raw.githubusercontent.com/tikv/pd/master/errors.toml -o errors/pd.toml
Then there are two tasks will be execute on the errors directory: