docs/sentence_transformer/training/distributed.rst
Sentence Transformers implements two forms of distributed training: Data Parallel (DP) and Distributed Data Parallel (DDP). Read the Data Parallelism documentation <https://huggingface.co/docs/transformers/en/perf_train_gpu_many#data-parallelism>_ on Hugging Face for more details on these strategies. Some of the key differences include:
In short, DDP is generally recommended. You can use DDP by running your normal training scripts with torchrun or accelerate. For example, if you have a script called train_script.py, you can run it with DDP using the following command:
.. |br| raw:: html
<div style="line-height: 0; padding: 0; margin: 0"></div>.. tab:: Via torchrun
|br|
torchrun documentation <https://pytorch.org/docs/stable/elastic/run.html>_::
torchrun --nproc_per_node=4 train_script.py
.. tab:: Via accelerate
|br|
accelerate documentation <https://huggingface.co/docs/accelerate/en/index>_::
accelerate launch --num_processes 4 train_script.py
.. note::
When performing distributed training, you have to wrap your code in a main function and call it with if __name__ == "__main__":. This is because each process will run the entire script, so you don't want to run the same code multiple times. Here is an example of how to do this::
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainingArguments, SentenceTransformerTrainer
# Other imports here
def main():
# Your training code here
if __name__ == "__main__":
main()
.. note::
When using an Evaluator <../training_overview.html#evaluator>_, the evaluator only runs on the first device unlike the training and evaluation datasets, which are shared across all devices.
The following table shows the speedup of DDP over DP and no parallelism given a certain hardware setup.
p3.8xlarge AWS instance, i.e. 4x V100 GPUsmicrosoft/mpnet-base <https://huggingface.co/microsoft/mpnet-base>_ (133M parameters)sentence-transformers/all-mpnet-base-v2 <https://huggingface.co/sentence-transformers/all-mpnet-base-v2>_)~sentence_transformers.sentence_transformer.losses.SoftmaxLoss for MultiNLI and SNLI, :class:~sentence_transformers.sentence_transformer.losses.CosineSimilarityLoss for STSB.. list-table:: :header-rows: 1
CUDA_VISIBLE_DEVICES=0 python train_script.pypython train_script.py (DP is used by default when launching a script with python)torchrun --nproc_per_node=4 train_script.py or accelerate launch --num_processes 4 train_script.pyFully Sharded Data Parallelism (FSDP) is another distributed training strategy that is not fully supported by Sentence Transformers. It is a more advanced version of DDP that is particularly useful for very large models. Note that in the previous comparison, FSDP reaches 5782 samples per second (2.122x speedup), i.e. worse than DDP. FSDP only makes sense with very large models. If you want to use FSDP with Sentence Transformers, you have to be aware of the following limitations:
evaluator functionality with FSDP.trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT") followed with trainer.save_model("output").fsdp=["full_shard", "auto_wrap"] and fsdp_config={"transformer_layer_cls_to_wrap": "BertLayer"} in your SentenceTransformerTrainingArguments, where BertLayer is the repeated layer in the encoder that houses the multi-head attention and feed-forward layers, so e.g. BertLayer or MPNetLayer.Read the FSDP documentation <https://huggingface.co/docs/accelerate/en/usage_guides/fsdp>_ by Accelerate for more details.