docs/user-guide/features/fine_grained_activation_offloading.md
Memory capacity is more and more important with the rising of extreme sparse MoE models like DeepSeek-V3 and Qwen3-235B. Fine-grained recomputing reduces the memory footprint at the cost of extra recomputation, while offloading could utilize the host-device bandwidth to achieve nearly zero-overhead. Fine-grained Activation Offloading targets at offloading the activation at the granularity of specific modules, so that we can calibrate the amount of offloading activation to maximize the training throughput.
Currently, the supported offloading modules are "attn_norm", "core_attn", "attn_proj", "mlp_norm", "expert_fc1", "moe_act", which could work with fine-grained recomputation to release almost all activations of a transformer layer.
Features
Usage
# Enable fine-grained activation offloading
--fine-grained-activation-offloading
# Specify which modules are going to offload its input
# Choices: "attn_norm", "core_attn", "attn_proj", "mlp_norm", "expert_fc1", "moe_act".
--offload-modules expert_fc1
Compatible with Fine-grained Recomputation