Back to Unilm

You Only Cache Once: Decoder-Decoder Architectures for Large Language Models

YOCO/README.md

latest6.4 KB
Original Source

You Only Cache Once: Decoder-Decoder Architectures for Large Language Models

Approach

<div align="center"> </div> <div align="center"> </div>

Performance

Harness Eval

Training with 1T Tokens:

ModelArc-cArc-eBoolQHellaswag$^*$OBQAPIQAWinograndeSciQAvg
OpenLLaMA-3B-v20.3390.6760.6570.7000.2600.7670.6290.9240.619
StableLM-base-alpha-3B-v20.3240.6730.6460.6860.2640.7600.6210.9210.612
StableLM-3B-4E1T---0.666---------0.7680.6320.914---
YOCO-3B0.3790.7310.6450.6890.2980.7630.6390.9240.634

Training with 1.6T Tokens:

ModelArc-cArc-eBoolQHellaswag$^*$OBQAPIQAWinograndeSciQAvg
StableLM-3B-4E1T---0.688---------0.7620.6270.913---
YOCO-3B0.3960.7330.6440.6980.3000.7640.6310.9210.636
YOCO-3B-1M0.4130.7470.6380.7050.3000.7730.6510.9320.645

Needle In A Haystack

<div align="center"> </div>

Multi-Needle Eval

ModelSizeN=1N=2N=4N=8
GPT-4-128K--1.001.000.981.00
MiniCPM-128K2.4B1.001.000.540.56
ChatGLM3-128K6B0.940.720.520.44
YaRN-Mistral-128K7B0.020.120.080.20
LWM-1M-text7B1.000.900.760.62
YOCO-3B-1M3B0.980.980.840.56

Setup

To install the required packages, use the following command:

bash
pip install -r requirements.txt

Besides normal packages, Apex and Flash-Attention should be installed seperately following their offcial guidences.

Harness Eval

To evaluate models in Harness-Eval, the script is as follows in scripts/eval_task.sh:

bash
cd fairseq/
TASK='harness_boolq'

torchrun --master-port=29505 --nproc_per_node=1 validate.py \
    --data-dir ../harness_data/ \
    --criterion harness_eval \
    --task harness_eval \
    --batch-size 4 \
    --eval-data ${TASK}  \
    --log-format simple  --log-interval 10 \
    --bf16 \
    --tokenizer-pad-to-multiple 8 \
    --arch yoco_3b_new --tiktoken-model cl100k_base --load-ckpt /path_to_ckpt/YOCO-3B-1M/checkpoint.pth --yoco-model /path_to_ckpt/YOCO-3B-1M  --tokens-per-sample 4096

Needle In A Haystack Evaluation

Our model uses city-number pairs for long sequence evaluation. To get the results at a certain maximal length, the script is as follows in scripts/eval_needle.sh:

bash
cd fairseq/
torchrun --master-port=29504 --nproc_per_node=1 validate.py \
    --task pseudo \
    --criterion needle_haystack \
    --batch-size 1 \
    --max-epoch 1 \
    --no-save \
    --tiktoken-model cl100k_base \
    --bf16 \
    --arch yoco_3b_new --tiktoken-model cl100k_base --load-ckpt /path_to_ckpt/YOCO-3B-1M/checkpoint.pth --yoco-model /path_to_ckpt/YOCO-3B-1M --tokens-per-sample 1048576 --interval 1048576

To run Multi-Needle experiments, replace --criterion needle_haystack with --criterion multi_needle --needle-num {num}.

Pretraining From Scratch

To support distributed training, our implementation is based on infinibatch to read data iteratively. The overall data directory should be organized as follows:

Data/
├── json/
│   ├── train.json
│   └── CC.json
│   └── StarCoder.json
│   └── ...
├── shard/
│   ├── CC/
│   │   ├── 00000.jsonl
│   │   ├── 00001.jsonl
│   │   └── ...
│   └── StarCoder/
│       ├── 00000.jsonl
│       ├── 00001.jsonl
│       └── ...

We recommend that each sharded data files contains no more than 10K lines with one json dict per line, and jsonl file, such as Data/shard/CC/00000.jsonl, should be in the format like this:

json
{"text": "File 1 is here..."}
{"text": "File 2 is here..."}
...

Then, for each source, a JSON file preserves all the paths of the jsonl files. Take Data/json/CC.json for example:

json
[
    "/path_to_data/Data/shard/CC/00000.jsonl",
    "/path_to_data/Data/shard/CC/00001.jsonl",
    ...
]

Finally, train.json records all sources' information and sampling ratio:

json
[
    {
        "name": "CC",
        "weight": 0.5
    },
    {
        "name": "StarCoder",
        "weight": 0.2
    },
    ...
]

scripts/train.sh:

bash
cd fairseq/
torchrun --nproc-per-node=1 train.py /path_to_data \
    --save-interval-updates 5000 \
    --no-epoch-checkpoints \
    --arch yoco_base \
    --criterion cross_entropy \
    --task gpt \
    --tokens-per-sample 2048 \
    --tokenizer-pad-to-multiple 8 \
    --pad-to-max-len \
    --optimizer adam --adam-betas "(0.9, 0.95)" \
    --adam-eps 1e-06 \
    --clip-norm 2.0 \
    --lr 0.00015 \
    --lr-scheduler polynomial_decay \
    --warmup-updates 50 \
    --weight-decay 0.05 \
    --batch-size 1  \
    --model-parallel-size 1 \
    --update-freq 1 \
    --batch-read-ahead 1000 \
    --total-num-update 300000 \
    --log-format simple      --log-interval 10    --disable-validation \
    --tiktoken-model cl100k_base \
    --save-interval-updates 5000 \
    --bf16 # bf16 is encouraged in pre-training