docs/api-guide/core/dist_checkpointing.md
A library for saving and loading the distributed checkpoints.
A distributed checkpoint in Megatron Core uses the torch_dist format,
a custom checkpointing mechanism built on top of PyTorch's native
checkpointing capabilities.
A key property of distributed checkpoints is that a checkpoint saved under one parallel configuration (tensor, pipeline, or data parallelism) can be loaded under a different parallel configuration. This enables flexible scaling and resharding of models across heterogeneous training setups.
Using the library requires defining sharded state_dict dictionaries with functions from mapping and optimizer modules. Those state dicts can be saved or loaded with a serialization module using strategies from strategies module.
Since PyTorch 2.6, the default behavior of torch.load is weights_only=True.
This ensures that only tensors and allow-listed classes are loaded, reducing the risk of arbitrary code execution.
If you encounter an error such as:
WeightsUnpickler error: Unsupported global: GLOBAL argparse.Namespace was not an allowed global by default.
you can fix it by explicitly allow-listing the missing class in your script:
import torch, argparse
torch.serialization.add_safe_globals([argparse.Namespace])
Beginning with mcore v0.14, the flattened_range attribute was removed from dist_checkpointing. As a result:
--no-load-optim flag.Step 1: Convert the legacy checkpoint using mcore v0.15.0
Run a dummy training job with mcore v0.15.0 to re-save the checkpoint with new optimizer states format.
MODEL_TRAIN_PARAMS=(
# Define model architecture and training parameters here
)
OLD_CKPT=/workspace/mcore_ckpt_old
CONVERTED_CKPT=/workspace/mcore_ckpt_0.15.0
torchrun --nproc_per_node=8 /opt/megatron-lm/pretrain_gpt.py \
--save-interval 1 \
--eval-interval 1 \
--exit-interval 1 \
--eval-iters 1 \
--use-distributed-optimizer \
--save ${CONVERTED_CKPT} \
--load ${OLD_CKPT} \
--ckpt-format torch_dist \
"${MODEL_TRAIN_PARAMS[@]}"
Step 2: Load the converted checkpoint with ToT MCore
Use the converted checkpoint as the input for continued training with ToT MCore.
MODEL_TRAIN_PARAMS=(
# Define model architecture and training parameters here
)
NEW_CKPT=/workspace/mcore_ckpt_new
CONVERTED_CKPT=/workspace/mcore_ckpt_0.15.0
torchrun --nproc_per_node=8 /opt/megatron-lm/pretrain_gpt.py \
--use-distributed-optimizer \
--save ${NEW_CKPT} \
--load ${CONVERTED_CKPT} \
--ckpt-format torch_dist \
"${MODEL_TRAIN_PARAMS[@]}"
After this step, training can proceed normally using ToT MCore with fully supported optimizer state loading.
The refactor of the Distributed Optimizer introduces two checkpoint formats:
--dist-ckpt-optim-fully-reshardable flag.You can combine formats to optimize both flexibility and performance:
Train using dp_reshardable (default) for faster checkpointing.
When you need to change model parallelism:
--dist-ckpt-optim-fully-reshardable.Save at least one checkpoint under the new model parallel configuration.
(Optional) To continue the training with updated model parallelism and better checkpointing performance, stop training and switch back to dp_reshardable format by removing --dist-ckpt-optim-fully-reshardable.
:maxdepth: 4
dist_checkpointing.strategies