Back to Examples

minGPT-DDP

distributed/minGPT-ddp/README.md

latest1017 B
Original Source

minGPT-DDP

Code accompanying the tutorial at https://pytorch.org/tutorials/intermediate/ddp_series_minGPT.html for training a GPT-like model with Distributed Data Parallel (DDP) in PyTorch.

Files marked with an asterisk (*) are adapted from the minGPT repo (https://github.com/karpathy/minGPT).

  • trainer.py includes the Trainer class that runs the distributed training iterations on the model with the provided dataset.
  • model.py * defines the model architecture.
  • char_dataset.py * contains the Datasetclass for a character-level dataset.
  • gpt2_train_cfg.yaml contains the configurations for data, model, optimizer and training run.
  • main.py is the entry point to the trainig job. It sets up the DDP process group, reads all the configurations and runs the training job.
  • slurm/ contains files for setting up an AWS cluster and the slurm script to run multinode training.