Segmenter for gettext (.po) translation files

Inserts ZWSP between the segments of Chinese and Japanese text.

For Chinese, uses a high quality zh_segmentation model from Google: https://tfhub.dev/google/zh_segmentation/1.

For Japanese, uses Sudachi.

Pre-requisites

Python. The easiest way to install Python on any Linux system is https://github.com/asdf-vm/asdf. On Windows you can use the official installer. Note that the Python version must be supported by tensorflow (this is usually not the latest Python version).
gettext. On Windows you can use this installer.

Python packages:

shell

pip install --upgrade -r tools/segmenter/requirements.txt

To re-segment all the translation files:

shell

tools/segmenter/segment_all.py

To re-segment the Chinese translation files:

shell

tools/segmenter/segment_zh.py --input_path Translations/zh_CN.po
tools/segmenter/segment_zh.py --input_path Translations/zh_TW.po

To re-segment the Japanese translation files:

shell

tools/segmenter/segment_ja.py --input_path Translations/ja.po

Additionaly, you can provide a different separator, such as --separator='|', for debugging.

This tool performs a number of replacements to make sure interpolations are not affected etc.

You can also see the segmenter output for a given string like this:

console

tools/segmenter/segment_zh.py --debug '返回到 {:d} 层'

返回｜到 {:d} 层

When inspecting the diffs, you can use sed to display the segments, e.g.:

bash

git diff --color | sed "s/$(echo -ne '\u200B')/｜/g"