docs/user-guide/data-preparation.md
Preparing your data correctly is essential for successful training with Megatron Core.
Megatron Core expects training data in JSONL (JSON Lines) format, where each line is a JSON object:
{"text": "Your training text here..."}
{"text": "Another training sample..."}
{"text": "More training data..."}
Use the preprocess_data.py tool to convert your JSONL data into Megatron's binary format:
python tools/preprocess_data.py \
--input data.jsonl \
--output-prefix processed_data \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model \
--workers 8 \
--append-eod
| Argument | Description |
|---|---|
--input | Path to input JSON/JSONL file |
--output-prefix | Prefix for output binary files (.bin and .idx) |
--tokenizer-type | Tokenizer type (HuggingFaceTokenizer, GPT2BPETokenizer, etc.) |
--tokenizer-model | Path to tokenizer model file |
--workers | Number of parallel workers for processing |
--append-eod | Add end-of-document token |
Use the --find-optimal-num-workers flag to find number of workers which gives the best performance in terms of preprocessed documents per second.
Script will lauch a few short data preprocessing runs with a different number of workers to define the fastest run in respect to collected performance data.
python tools/preprocess_data.py \
--input data.jsonl \
--output-prefix processed_data \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model \
--workers 8 \
--find-optimal-num-workers \
--workers-to-check 4 8 16 32 \
--max-documents 50000
Required arguments
| Argument | Description |
|---|---|
--find-optimal-num-workers | Activates search of optimal number of workers |
--workers-to-check | List of possible number of workers to run |
--max-documents | Number of documents to be preprocessed during each run |
Output example
-----------------------------------
Performance results (fastest → slowest):
1. 16 workers → avg. docs/s: 9606.6476
2. 32 workers → avg. docs/s: 9275.3284
3. 8 workers → avg. docs/s: 9151.9280
4. 4 workers → avg. docs/s: 6391.3819
-----------------------------------
The most optimal num of workers is 16 with avg. preprocessed docs/s: 9606.6476.
-----------------------------------
The preprocessing tool generates two files:
processed_data.bin - Binary file containing tokenized sequencesprocessed_data.idx - Index file for fast random accessReference your preprocessed data in training scripts:
--data-path processed_data \
--split 949,50,1 # Train/validation/test split
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model
--tokenizer-type GPT2BPETokenizer \
--vocab-file gpt2-vocab.json \
--merge-file gpt2-merges.txt