docs/design/2020-09-12-utf8mb4-pinyin-order.md
pinyin order for utf8mb4 charsetThis proposal proposes a new feature that supports pinyin order for chinese character.
It's unable now to order by a column based on its pinyin order. For example:
create table t(
a varchar(100)
)
charset = 'utf8mb4' collate = 'utf8mb4_zh_0900_as_cs';
# insert some data:
insert into t values ("中文"), ("啊中文");
# a query requires to order by column a in its pinyin order:
select * from t order by a;
+-----------+
| a |
+-----------+
| 啊中文 |
| 中文 |
+-----------+
2 rows in set (0.00 sec)
pinyin order for Chinese character supported by this proposal will add a new collation named utf8mb4_zh_pinyin_tidb_as_cs which is support all Unicode and sort Chinese characters correctly according to the PINYIN collation in zh.xml file of CLDR24, and only support those Chinese characters with pinyin in zh.xml currently, we support neither those CJK characters whose category defined in Unicode are Symbol with the same shape as Chinese characters nor the PINYIN characters. In utf8mb4_zh_pinyin_tidb_as_cs, utf8mb4 means charset utf8mb4, zh means Chinese language, pinyin means it has pinyin order, tidb means a special(tidb) version, and as_cs means it is accent-sensitive and case-sensitive.
It's a lot of work if we implement utf8mb4_zh_0900_as_cs. The implementation of MySQL looks complicated with weight reorders, magic numbers, and some tricks. Implementing utf8mb4_zh_pinyin_tidb_as_cs is much easier. It supports all Chinese characters and sorts Chinese characters in pinyin order. It is good enough.
It is not compatible with MySQL. MySQL does not have a collation named utf8mb4_zh_pinyin_tidb_as_cs.
Choose collation ID 2048 for utf8mb4_zh_pinyin_tidb_as_cs and add it into parser.
MySQL supports two-byte collation IDs. The range of IDs from 1024 to 2047 is reserved for user-defined collations. see also
utf8mb4_zh_pinyin_tidb_as_cs has same priority with utf8mb4_unicode_ci and utf8mb4_general_ci, which means these three collations incompatible with each other.
MySQL has a lot of language specific collations, for pinyin order, MySQL uses collation utf8mb4_zh_0900_as_cs.
There is no utf8mb4_zh_pinyin_tidb_as_cs collation in MySQL. We can comment utf8mb4_zh_pinyin_tidb_as_cs when users need to replicate their data from TiDB to MySQL.