model/pretokenizer/README.md
The pretokenizer allows to tokenize datasets before training with the epfLLM/Megatron-LLM fork.
model_training module is installed:pip install -e ..
oasst_data module is installed:python -m pip install ../../oasst-data/
The datamix to proces can be configured with one or multiple sections in the
configs/pretokenize.yaml file.
python pretokenize.py --output_dir output--configs oasst_top1 llama2 --compress --write_json
usage: pretokenize.py [-h] --configs CONFIGS [CONFIGS ...] [--output_dir OUTPUT_DIR] [--write_json] [--compress]
Tokenize datamixes for LLama2/Falcon fine-tuning with Megatron-LLM.
options:
-h, --help show this help message and exit
configuration:
--configs CONFIGS [CONFIGS ...]
Configurations sections to apply (read from YAML, multiple can be specified).
--output_dir OUTPUT_DIR
Path to output directory
--write_json Generate a JSONL file with the formatted dialogues (key='text').
--compress Generate a .tar.gz file of the output directory.