Introduction

This example introduce how to pretrain roberta from scratch, including preprocessing, pretraining, finetune. The example can help you quickly train a high-quality roberta.

0. Prerequisite

Install Colossal-AI
Editing the port from /etc/ssh/sshd_config and /etc/ssh/ssh_config, every host expose the same ssh port of server and client. If you are a root user, you also set the PermitRootLogin from /etc/ssh/sshd_config to "yes"
Ensure that each host can log in to each other without password. If you have n hosts, need to execute n<sup>2</sup> times

ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub ip_destination

In all hosts, edit /etc/hosts to record all hosts' name and ip.The example is shown below.

bash

192.168.2.1   GPU001
192.168.2.2   GPU002
192.168.2.3   GPU003
192.168.2.4   GPU004
192.168.2.5   GPU005
192.168.2.6   GPU006
192.168.2.7   GPU007
...

restart ssh

service ssh restart

1. Corpus Preprocessing

bash

cd preprocessing

following the README.md, preprocess original corpus to h5py plus numpy

2. Pretrain

bash

cd pretraining

following the README.md, load the h5py generated by preprocess of step 1 to pretrain the model

3. Finetune

The checkpoint produced by this repo can replace pytorch_model.bin from hfl/chinese-roberta-wwm-ext-large directly. Then use transformers from Hugging Face to finetune downstream application.

Contributors

The example is contributed by AI team from Moore Threads. If you find any problems for pretraining, please file an issue or send an email to [email protected]. At last, welcome any form of contribution!