docs/source/en/basics/booster_plugins.md
Author: Hongxin Liu, Baizhou Zhang, Pengtai Xu
Prerequisite:
As mentioned in Booster API, we can use booster plugins to customize the parallel training. In this tutorial, we will introduce how to use booster plugins.
We currently provide the following plugins:
torch.nn.parallel.DistributedDataParallel and can be used to train models with data parallelism.torch.distributed.fsdp.FullyShardedDataParallel and can be used to train models with zero-dp.colossalai.zero.low_level.LowLevelZeroOptimizer and can be used to train models with zero-dp. It only supports zero stage-1 and stage-2.More plugins are coming soon.
Generally only one plugin is used to train a model. Our recommended use case for each plugin is as follows.
This plugin implements Zero-1 and Zero-2 (w/wo CPU offload), using reduce and gather to synchronize gradients and weights.
Zero-1 can be regarded as a better substitute of Torch DDP, which is more memory efficient and faster. It can be easily used in hybrid parallelism.
Zero-2 does not support local gradient accumulation. Though you can accumulate gradient if you insist, it cannot reduce communication cost. That is to say, it's not a good idea to use Zero-2 with pipeline parallelism.
{{ autodoc:colossalai.booster.plugin.LowLevelZeroPlugin }}
We've tested compatibility on some famous models, following models may not be supported:
timm.models.convit_basetorchrecCompatibility problems will be fixed in the future.
This plugin implements Zero-3 with chunk-based and heterogeneous memory management. It can train large models without much loss in speed. It also does not support local gradient accumulation. More details can be found in Gemini Doc.
{{ autodoc:colossalai.booster.plugin.GeminiPlugin }}
This plugin implements the combination of various parallel training strategies and optimization tools. The features of HybridParallelPlugin can be generally divided into four parts:
Mixed Precision Training: Support for fp16/bf16 mixed precision training. More details about its arguments configuration can be found in Mixed Precision Training Doc.
Torch DDP: This plugin will automatically adopt Pytorch DDP as data parallel strategy when pipeline parallel and Zero is not used. More details about its arguments configuration can be found in Pytorch DDP Docs.
Zero: This plugin can adopt Zero 1/2 as data parallel strategy through setting the zero_stage argument as 1 or 2 when initializing plugin. Zero 1 is compatible with pipeline parallel strategy, while Zero 2 is not. More details about its argument configuration can be found in Low Level Zero Plugin.
⚠ When using this plugin, only the subset of Huggingface transformers supported by Shardformer are compatible with tensor parallel, pipeline parallel and optimization tools. Mainstream transformers such as Llama 1, Llama 2, OPT, Bloom, Bert and GPT2 etc. are all supported by Shardformer.
{{ autodoc:colossalai.booster.plugin.HybridParallelPlugin }}
More details can be found in Pytorch Docs.
{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }}
⚠ This plugin is not available when torch version is lower than 1.12.0.
⚠ This plugin does not support save/load sharded model checkpoint now.
⚠ This plugin does not support optimizer that use multi params group.
More details can be found in Pytorch Docs.
{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }}
<!-- doc-test-command: echo -->