docs/design/2021-08-18-charsets.md
Currently, TiDB only supports ascii, binary, latin1, utf8, and utf8mb4 character sets. This proposal aims to add a framework for character sets to TiDB to facilitate the support of other characters. This design takes the gbk character set as an example.
Character: Various letters and symbols are collectively referred to as characters. A character can be a Chinese character, an English letter, an Arabic numeral, a punctuation mark, a graphic symbol or a control symbol, etc.
Charset: A collection of multiple characters is called a charset. Common character set names: ASCII character set, Unicode character set, GBK character set, etc.
Character Encoding: Character encoding can be considered as a mapping rule. According to this mapping rule, a character can be mapped into other forms of data for storage and transmission in the computer. Each character set has corresponding character encoding rules, and the commonly used character set encoding rules include UTF-8 encoding, GBK encoding and so on.
MySQL supports many character sets, such as utf8mb4, gbk, gb18030, etc. However, currently TiDB only supports 5 character sets, and there are still some problems with related character sets (charset: incorrect encoding for latin1 character set,extended latin character set not support). Now it is difficult for TiDB to add a new character set. This proposal describes the implementation of the character set framework. In many traditional industries in China, gbk/gb18030 is used, so this proposal uses gbk character set support as an example.
Support gbk character set related functions:
CAST, CONVERT and other functions to convert gbk characters.SET CHARACTER SET GBK, SET NAMES GBK, SHOW CHARSET, etc.CONVERT, an error is returned. Otherwise, a warning will be returned.After receiving the non-utf-8 character set request, this solution will convert it to the utf-8 character set, and then it will be the utf-8 character set when calculating or storing in the TiDB runtime layer and the storage layer, and finally return the result of the conversion to the non-utf-8 character set.
Add a repertoire field to collationInfo to facilitate automatic character set conversion in expressions, so that many errors like "illegal mix of collations" can be avoided.
The corresponding types of the Repertoire attribute are as follows:
type Repertoire int
const (
// RepertoireASCII is pure ASCII and it’s Unicode range: U+0000..U+007F
RepertoireASCII Repertoire = 1
// RepertoireExtended is Extended characters and it’s Unicode range: U+0080..U+FFFF
RepertoireExtended Repertoire = 1 << 1
// RepertoireUnicode consists ASCII and EXTENDED, and it’s Unicode range: U+0000..U+FFFF
RepertoireUnicode Repertoire = ASCII | EXTENDED
)
Some of the built-in functions related to string computation require special processing after conversion from utf-8.
Process some string-related internal functions, check whether the corresponding character set needs to be processed, specific functions such as DatumsToString, strToInt.
Add gbk_chinese_ci and gbk_bin collations. In addition, considering the performance, we can add the collation of utf8mb4 (gbk_utf8mb4_bin).
new_collations_enabled_on_first_bootstrap switch.
new_collations_enabled_on_first_bootstrap is off, it only supports gbk_utf8mb4_bin which does not need to be converted to gbk charset before processing.Support for character sets of databases, tables and columns when creating databases, tables or adding columns, and support for changes through alter statements.
Other behaviors that need to be dealt with:
set character set gbk, set names gbk and set character_set_client = gbk, etc.Illegal character related issue:
Collation
gbk_bin and gbk_chinese_ci only when the config new_collations_enabled_on_first_bootstrap is enabled. Otherwise, it only supports gbk_utf8mb4_bin.After using this version of TiDB, and when gbk-encoded data is already stored in the database, you can only use the components corresponding to this version or greater.
Unit and integration tests to consider
Test the feasibility of using different character sets for mixed sorting.
Test the compatibility of some related features, such as SQL binding, SQL hints, clustered index, view, expression index, statements_summary, slow_query, explain analyze and other features.
There will be incompatibility issues when downgrading tables that use gbk encoding.
After this solution receives the gbk character set request, it will be converted to gbk encoding for processing, that is, it is the gbk character set when calculating or storing at the TiDB runtime layer and the storage layer. The reason for the final rejection of this plan is that the issues that need to be dealt with are more complex and more uncontrollable, such as:
The first stage (mainly realize the development of TiDB side)
The second stage
The third stage
Define how character encode and decode, the main work is to implement the encoding interface
// Encoding provide encode/decode functions for a string with a specific charset.
type Encoding interface {
// Name is the name of the encoding.
Name() string
// Tp is the type of the encoding.
Tp() EncodingTp
// Peek returns the next char.
Peek(src []byte) []byte
// MbLen returns multiple byte length, if the next character is single byte, return 0.
MbLen(string) int
// IsValid checks whether the utf-8 bytes can be convert to valid string in current encoding.
IsValid(src []byte) bool
// Foreach iterates the characters in in current encoding.
Foreach(src []byte, op Op, fn func(from, to []byte, ok bool) bool)
// Transform map the bytes in src to dest according to Op.
// **the caller should initialize the dest if it wants to avoid memory alloc every time, or else it will always make a new one**
// **the returned array may be the alias of `src`, edit the returned array on your own risk**
Transform(dest *bytes.Buffer, src []byte, op Op) ([]byte, error)
// ToUpper change a string to uppercase.
ToUpper(src string) string
// ToLower change a string to lowercase.
ToLower(src string) string
}
TiDB have no collation with the default config, all the collation will be ignored. After tidb 4.0, the collation framework is supported and uses a config to control it. After a new charset is added into TiDB, the related collation should also be added.
For example, gb18030 charset have 3 different collations
| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE |
| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 2 | PAD SPACE |
| gb18030_unicode_520_ci | gb18030 | 250 | | Yes | 8 | PAD SPACE |
We should at least implement gb18030_bin and gb18030_chinese_ci collations.
If the collation framework is not used, the gb18030_bin should be the default collation, otherwise, the gb18030_chinese_ci should be the default collation.
Add a new collation should implement the Collator interface
// Collator provides functionality for comparing strings for a given
// collation order.
type Collator interface {
// Compare returns an integer comparing the two strings. The result will be 0 if a == b, -1 if a < b, and +1 if a > b.
Compare(a, b string) int
// Key returns the collate key for str. If the collation is padding, make sure the PadLen >= len(rune[]str) in opt.
Key(str string) []byte
// Pattern get a collation-aware WildcardPattern.
Pattern() WildcardPattern
}
Then, add the charset and related collations into the support list to make tidb recognize it.
Many expressions have been pushed down to tikv, we should also make tikv support the charset and collation.