Proposal: Support Charset Framework in TiDB

Author(s): zimuxia (Xia Li)
Last updated: 2021-08-18
Tracking Issue: https://github.com/pingcap/tidb/issues/26812
Related Document: https://docs.google.com/document/d/1SC0spRoNScR8-GwUNJ9ZHAV8zmlkq99Q2PReu88FARs/edit#
Performance Test(simple implementation): https://docs.google.com/document/d/1c6AOPpPHOTwVKFJp7n0k72Jy8KdP3oWpTkTjo3Bk2PE/edit#

Introduction
Motivation or Background
Detailed Design
Test Design
Investigation & Alternatives
Development Plan

Introduction

Currently, TiDB only supports ascii, binary, latin1, utf8, and utf8mb4 character sets. This proposal aims to add a framework for character sets to TiDB to facilitate the support of other characters. This design takes the gbk character set as an example.

Motivation or Background

Character: Various letters and symbols are collectively referred to as characters. A character can be a Chinese character, an English letter, an Arabic numeral, a punctuation mark, a graphic symbol or a control symbol, etc.

Charset: A collection of multiple characters is called a charset. Common character set names: ASCII character set, Unicode character set, GBK character set, etc.

Character Encoding: Character encoding can be considered as a mapping rule. According to this mapping rule, a character can be mapped into other forms of data for storage and transmission in the computer. Each character set has corresponding character encoding rules, and the commonly used character set encoding rules include UTF-8 encoding, GBK encoding and so on.

MySQL supports many character sets, such as utf8mb4, gbk, gb18030, etc. However, currently TiDB only supports 5 character sets, and there are still some problems with related character sets (charset: incorrect encoding for latin1 character set，extended latin character set not support). Now it is difficult for TiDB to add a new character set. This proposal describes the implementation of the character set framework. In many traditional industries in China, gbk/gb18030 is used, so this proposal uses gbk character set support as an example.

Functional requirements

Support gbk character set related functions:

Supports all code points defined by the GBK 1.0 standard.
Support conversion with other character sets.
- Support implicit conversion according to collation coercibility rules.
- Supports the use of CAST, CONVERT and other functions to convert gbk characters.
Support SQL statements, such as: SET CHARACTER SET GBK, SET NAMES GBK, SHOW CHARSET, etc.
Supports the comparison between strings in the gbk character set.
If the characters passed in are illegal, an error or warning is returned. If an illegal character is used in CONVERT, an error is returned. Otherwise, a warning will be returned.
Compatible with other components, including data import, migration, and so on.

Supported Collations

gbk_bin: A binary collation.
gbk_chinese_ci: The default collation, which supports Pinyin.
gbk_utf8mb4_bin: A binary collation encoded by utf8mb4.

Detailed Design

After receiving the non-utf-8 character set request, this solution will convert it to the utf-8 character set, and then it will be the utf-8 character set when calculating or storing in the TiDB runtime layer and the storage layer, and finally return the result of the conversion to the non-utf-8 character set.

Encoder/Decoder

Encapsulate read and write functions, using the Transform functions in golang.org/x/text/encoding (The version of this package used between different components is preferably the same).

Parser

Add parser configuration information and record the encoding of the current parser. The value of the encoding is determined by the system variable character_set_client. When invoking Restore, you can output restore SQL according to encoding.
All ParseOneStmt/Parse usage needs to be checked
- For SQL strings that need to be temporarily saved, you need to bring character set information. For example, BindSQL, View Select Stmt, etc.
- For internally executed SQL statements, since they are already utf-8, they do not need to be processed. For example, the table creation statement in the perfschema package.

Runtime

Add a repertoire field to collationInfo to facilitate automatic character set conversion in expressions, so that many errors like "illegal mix of collations" can be avoided.

The corresponding types of the Repertoire attribute are as follows:

type Repertoire int

const (
  // RepertoireASCII is pure ASCII and it’s Unicode range: U+0000..U+007F
  RepertoireASCII     Repertoire = 1
  // RepertoireExtended is Extended characters and it’s Unicode range: U+0080..U+FFFF
  RepertoireExtended  Repertoire = 1 << 1
  // RepertoireUnicode consists ASCII and EXTENDED, and it’s Unicode range: U+0000..U+FFFF
  RepertoireUnicode   Repertoire = ASCII | EXTENDED
)

Some of the built-in functions related to string computation require special processing after conversion from utf-8.
- Reference document: https://dev.mysql.com/doc/refman/8.0/en/string-functions.html.
- Handle the related functions in the Coprocessor.
Process some string-related internal functions, check whether the corresponding character set needs to be processed, specific functions such as DatumsToString, strToInt.

Optimizer

The statistics module may need special processing functions based on charset: AvgColSizeDataInDiskByRows, GetFixedLen, BucketToString and DecodedString, etc.
Ranger module
- The processing of prefix len needs to consider the charset.
- Functions such as BuildFromPatternLike and checkLikeFunc may need to consider charset.

Collation

Add gbk_chinese_ci and gbk_bin collations. In addition, considering the performance, we can add the collation of utf8mb4 (gbk_utf8mb4_bin).

To support gbk_chinese_ci and gbk_bin collations, it needs to turn on the new_collations_enabled_on_first_bootstrap switch.
- If new_collations_enabled_on_first_bootstrap is off, it only supports gbk_utf8mb4_bin which does not need to be converted to gbk charset before processing.
Implement the Collator and WildcardPattern interface functions for each collation.
- gbk_chinese_ci and gbk_bin need to convert utf-8 to gbk encoding and then generate a sort key.
- Implement the corresponding functions in the Coprocessor.

DDL

Support for character sets of databases, tables and columns when creating databases, tables or adding columns, and support for changes through alter statements.

Others

Other behaviors that need to be dealt with:

TiKVMinVersion information may need to be modified.
Set character set operations are supported, such as set character set gbk, set names gbk and set character_set_client = gbk, etc.
Support the correct display of character sets, such as show charset/collation.
load data: LoadDataInfo.getLine.

Compatibility

Compatibility between TiDB versions

Upgrade compatibility:
- There may be compatibility issues when performing operations during the rolling upgrade.
- The new version of the cluster is expected to have no compatibility issues when reading old data.
Downgrade compatibility:
- Downgrade is not compatible. The index key uses the table of gbk_bin/gbk_chinese_ci. The lower version of TiDB will have problems when decoding, and it needs to be transcoded before downgrading.

Compatibility with MySQL

Illegal character related issue:
- Due to the internal conversion of non-utf-8-related encoding to utf8 for processing, it is not fully compatible with MySQL in some cases in terms of illegal character processing. TiDB controls its behavior through sql_mode.
Collation
- Fully support gbk_bin and gbk_chinese_ci only when the config new_collations_enabled_on_first_bootstrap is enabled. Otherwise, it only supports gbk_utf8mb4_bin.

Compatibility with other components

After using this version of TiDB, and when gbk-encoded data is already stored in the database, you can only use the components corresponding to this version or greater.

TiKV
- Coprocessor related builtin functions need to be processed, and collation related functions need to be implemented.
TiCDC
- It may need to be converted to gbk character set encoding when outputting in accordance with the TiCDC Open Protocol.
TiFlash
- Related builtin functions need to be processed, and collation related functions need to be implemented.
BR
- BR realizes backup and recovery by directly manipulating SST files. At present, BR only rewrites the table ID and index ID in the key when restoring, and needs to decode KV. The decoder reuses the codec interface of TiDB, and only needs to update the TiDB dependency. There are no compatibility issues.
Lightning
- tidb-lightning is responsible for parsing SQL to become a KV pair. It implements its own parse, and special processing may be required here.
DM
- Currently, DM will specify utf8mb4 when connecting to TiDB, which needs to be processed here.
- When DDL is synchronized, TiDB-parser will be used to parse the DDL statement. Currently, empty charset and collation are used. Use the charset in binlog to pass in the Parser function (it is expected to be available in the header of the DDL event):
  - If TiDB-parser supports (ascii, binary, latin1, utf8, utf8mb4, gbk), handle it directly.
  - If it is not supported, fallback to an empty charset/collation to keep the old behavior.
Dumpling
- Dumpling obtains the table structure and data of upstream TiDB/MySQL by executing SQL statements, so there is no compatibility problem.
TiDB-binlog
- TiDB to binlog is still encoded in utf8. That is, no special treatment is required.

Test Design

Functional Tests

Unit and integration tests to consider

Covers the supported built-in string functions with non-utf-8 character set, including tests for conversion of different character sets.
Copr-pushdown test of TiDB/TiKV/TiFlash.
Related read and write tests covering different ranges of character sets and collations.
Set or display the test of gbk character set related sentences. Port the gbk character set and related collation tests from mysql-tests.

Scenario Tests

Test the feasibility of using different character sets for mixed sorting.

Compatibility Tests

Compatibility with other features

Test the compatibility of some related features, such as SQL binding, SQL hints, clustered index, view, expression index, statements_summary, slow_query, explain analyze and other features.

Compatibility with other external components

Confirm that the corresponding versions of the libraries that call utf-8 and gbk transcoded by different components are consistent.
Using different components, gbk related operations can be executed normally.

Upgrade compatibility

Upgrades from versions below 4.0 do not support gbk.
Version 4.0 and above test upgrade:
- During the rolling upgrade process, gbk operation is not supported.
- After the upgrade, normal gbk related operations are supported.

Downgrade compatibility

There will be incompatibility issues when downgrading tables that use gbk encoding.

Benchmark Tests

Test the read and write performance test of utf-8 and gbk under the same amount of data.
Test the corresponding collation related performance test under gbk.

Investigation & Alternatives

MySQL's implementation of gbk can be considered as a solution that both runtime and storage use gbk. The character set used may be different between the server and the client of MySQL, between the connection and the result set, and the character set conversion is required. Different character sets in MySQL will implement character set specific interfaces to ensure the realization of character set related functions. Among them, the specific implementation reference of gbk character set: https://github.com/mysql/mysql-server/blob/8.0/strings/ctype-gbk.cc.
CockroachDB currently does not support additional character sets.

Rejected Alternatives

After this solution receives the gbk character set request, it will be converted to gbk encoding for processing, that is, it is the gbk character set when calculating or storing at the TiDB runtime layer and the storage layer. The reason for the final rejection of this plan is that the issues that need to be dealt with are more complex and more uncontrollable, such as:

Some String methods such as Datum require special handling, such as:
- ConvertToString
- String()
Most of the built-in functions related to TiDB string and some string processing methods need to be processed.
- Mainly a large number of function signatures and related calling methods need to be modified
- stringutil package
TiDB uses the methods of Go's own strings library (if you modify the related methods in this item, some built-in functions of TiDB do not need to be modified)：
- strings.go
  - ToUpper, ToLower
  - HasPrefix
  - ...
- strconv package
  - AppendQuote
  - ParseInt, ParseFloat
  - ...
- ToGBKString

Development Plan

The first stage (mainly realize the development of TiDB side)

Support normal read and write operations using gbk.
Supports setting charset to gbk or related sentences that display gbk information.
Support string-related commonly used functions to process characters with gbk encoding.
Support character set conversion, that is, support to use cast, convert and other functions to convert other character sets into gbk or gbk into other character sets.
Support gbk_bin collation.

The second stage

Realize TiKV and TiFlash related operations.
Compatible with other components, including TiCDC, BR, Lightning, DM, Dumpling and TiDB-binlog.
Support gbk_chinese_ci collation.

The third stage

Basically support all string-related functions already supported by TiDB.
Support the use of alter statement, support to modify the charset of the column.

Add New Charset

Encoding

Define how character encode and decode, the main work is to implement the encoding interface

golang

// Encoding provide encode/decode functions for a string with a specific charset.
type Encoding interface {
	// Name is the name of the encoding.
	Name() string
	// Tp is the type of the encoding.
	Tp() EncodingTp
	// Peek returns the next char.
	Peek(src []byte) []byte
	// MbLen returns multiple byte length, if the next character is single byte, return 0.
	MbLen(string) int
	// IsValid checks whether the utf-8 bytes can be convert to valid string in current encoding.
	IsValid(src []byte) bool
	// Foreach iterates the characters in in current encoding.
	Foreach(src []byte, op Op, fn func(from, to []byte, ok bool) bool)
	// Transform map the bytes in src to dest according to Op.
	// **the caller should initialize the dest if it wants to avoid memory alloc every time, or else it will always make a new one**
	// **the returned array may be the alias of `src`, edit the returned array on your own risk**
	Transform(dest *bytes.Buffer, src []byte, op Op) ([]byte, error)
	// ToUpper change a string to uppercase.
	ToUpper(src string) string
	// ToLower change a string to lowercase.
	ToLower(src string) string
}

Collation

New Collation

TiDB have no collation with the default config, all the collation will be ignored. After tidb 4.0, the collation framework is supported and uses a config to control it. After a new charset is added into TiDB, the related collation should also be added.

For example, gb18030 charset have 3 different collations

sql

| gb18030_bin                | gb18030  | 249 |         | Yes      |       1 | PAD SPACE     |
| gb18030_chinese_ci         | gb18030  | 248 | Yes     | Yes      |       2 | PAD SPACE     |
| gb18030_unicode_520_ci     | gb18030  | 250 |         | Yes      |       8 | PAD SPACE     |

We should at least implement gb18030_bin and gb18030_chinese_ci collations.

If the collation framework is not used, the gb18030_bin should be the default collation, otherwise, the gb18030_chinese_ci should be the default collation.

Add a new collation should implement the Collator interface

golang

// Collator provides functionality for comparing strings for a given
// collation order.
type Collator interface {
	// Compare returns an integer comparing the two strings. The result will be 0 if a == b, -1 if a < b, and +1 if a > b.
	Compare(a, b string) int
	// Key returns the collate key for str. If the collation is padding, make sure the PadLen >= len(rune[]str) in opt.
	Key(str string) []byte
	// Pattern get a collation-aware WildcardPattern.
	Pattern() WildcardPattern
}

Then, add the charset and related collations into the support list to make tidb recognize it.

TikV

Many expressions have been pushed down to tikv, we should also make tikv support the charset and collation.

Proposal: Support Charset Framework in TiDB

Proposal: Support Charset Framework in TiDB

Table of Contents

Introduction

Motivation or Background

Functional requirements

Supported Collations

Detailed Design

Encoder/Decoder

Parser

Runtime

Optimizer

Collation

DDL

Others

Compatibility

Compatibility between TiDB versions

Compatibility with MySQL

Compatibility with other components

Test Design

Functional Tests

Scenario Tests

Compatibility Tests

Compatibility with other features

Compatibility with other external components

Upgrade compatibility

Downgrade compatibility

Benchmark Tests

Investigation & Alternatives

Rejected Alternatives

Development Plan

Add New Charset

Encoding

Collation

New Collation

TikV