Back to Verl

FP8 RL in verl

docs/advance/fp8.md

0.7.16.4 KB
Original Source

FP8 RL in verl

Last updated: 03/05/2026

verl supports two FP8 modes for accelerating RL training:

ModeTraining PrecisionRollout Precision
FP8 Rollout OnlyBF16FP8
FP8 End-to-EndFP8 (Megatron)FP8 (vLLM)

[!TIP] For ready-to-run scripts, see the low-precision recipe directory.


FP8 Rollout Only

FP8 rollout-only mode keeps training in BF16 and quantizes rollout inference to FP8. This reduces GPU memory during generation and speeds up rollout without affecting training precision.

Implementation

We monkey patch several vLLM functions to enable FP8 rollout for reinforcement learning:

  1. Quantize weights: Quantize model weights on-the-fly from higher-precision formats to FP8.
  2. Process weights after loading: For vLLM, we replace the vllm.model_executor.layers.quantization.fp8.Fp8LinearMethod.process_weights_after_loading function to handle weight processing after quantization. For SGLang, this patch is not needed as it natively supports loading quantized weights.

Support Matrix

  • FP8 blockwise quantization for rollout
    • Used in Deepseek, which is 1x128 quantization for activations and 128x128 quantization for model weights
  • Dense models and MoE models
  • Async rollout interfaces
  • vLLM 0.10.x & vLLM 0.11 & vLLM 0.12 & SGLang 0.5.5
  • FSDP and Megatron training backends

Usage

Enable in config file:

yaml
rollout:
  quantization: "fp8"

Or via command line:

bash
actor_rollout_ref.rollout.quantization=fp8

Experiments and Outcomes

Qwen3-8B-Base Dense Model

Configuration

  • DAPO recipe. AIME24 online validation.
  • vLLM(FP8 spmd rollout) + FSDP
    • Note that SPMD rollout has been deprecated, so we removed the FP8 SPMD rollout.
  • Prompt batch size 32, n=16.
  • Rollout batch size: 32*3*16
  • Train_batch_size & ppo_mini_batch_size 32
  • Max response length 20K
  • Token-level TIS, C=2
  • 8*H100
  • vLLM 0.10.0+CUDA 12.6 vs vLLM 0.11.0+CUDA 12.9

Accuracy dark green: BF16, orange: FP8 rollout + token-level TIS, light green: FP8 rollout without TIS

Results and observations:

  • With TIS, FP8 rollout aligns with BF16
  • Obvious accuracy drop when TIS is not enabled
  • Higher mismatch kl but within acceptable range throughout the training

Performance

green: BF16, orange: FP8 rollout + CUDA12.6 + DeepGemm, purple: FP8 rollout + CUDA 12.9 + DeepGemm

Results and observations:

  • FP8 rollout leads to around ~12% rollout speedup with CUDA 12.6 + DeepGemm
  • When upgrading to CUDA 12.9, speedup can be up to ~18%

Qwen3-30B-A3B-Base MoE Model

Configuration

  • DAPO recipe. AIME24 online validation.
  • FP8 async rollout, vLLM+FSDP
  • Prompt batch size 32
  • Rollout batch size: 32*3*16
  • Train_batch_size & ppo_mini_batch_size 32
  • Max response length 20K
  • Token-level TIS, C=2
  • 2*8*H100
  • vLLM 0.10.0+CUDA 12.6

Accuracy grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS

Results and observations:

  • Rollout & training distribution mismatch is in general higher for MoE
  • Rollout correction required even for BF16
  • FP8 rollout with token-level TIS aligns with BF16

Performance

grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS​

Results and observations:

  • FP8 rollout : over 35% rollout speedup
  • Expecting more perf gain with CUDA 12.9

FP8 End-to-End (Training + Rollout)

FP8 E2E applies FP8 to the entire RL pipeline: forward/backward passes via Transformer Engine, FP8 optimizer states, and FP8 rollout inference via vLLM. This maximizes memory savings and throughput.

Requirements

  • CUDA 12.9+ (required for block-wise FP8 scaling)
  • Transformer Engine with block-wise FP8 support
  • Environment variable: NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1

Key Configuration

yaml
# FP8 training via Transformer Engine
actor_rollout_ref.actor.megatron.override_transformer_config:
  fp8: "hybrid"              # FP8 forward + backward; also supports "e4m3"
  fp8_recipe: "blockwise"    # block-wise scaling

# FP8 optimizer
actor_rollout_ref.actor.optim.override_optimizer_config:
  fp8_recipe: "blockwise"

# FP8 rollout inference (vLLM)
actor_rollout_ref.rollout:
  quantization: fp8

Support Matrix

  • Megatron training backend (via Megatron-Bridge)
  • Verified on Qwen3-30B-A3B and Qwen3-8B
  • Block-wise FP8 scaling (fp8_recipe: "blockwise")

Experiments and Results

Qwen3-30B-A3B MoE Model

Configuration

  • DAPO recipe. AIME24 online validation.
  • Megatron + Megatron-Bridge, FP8 async rollout with vLLM
  • MoE router in BF16 for both vLLM and Megatron-Core
  • Prompt batch size 128, n=16
  • Max response length 20K
  • Token-level TIS, C=2
  • 2*8*H100, CUDA 12.9

Orange: BF16, Green: FP8 E2E, Red: FP8 rollout + BF16 training

Results and observations:

  • FP8 E2E achieves comparable accuracy to the BF16 baseline, with the two curves closely aligned throughout training.
  • The training/inference precision mismatch (measured by KL divergence) follows the ordering: FP8 rollout-only > FP8 E2E > BF16 E2E. This is expected, as FP8 E2E maintains consistent precision across both training and inference, resulting in lower distribution mismatch than the FP8 rollout-only setting where training remains in BF16.

Citation

For more extensive experiments, ablation studies, and analysis on FP8 reinforcement learning, please refer to our technical report:

bibtex
@article{qiu2026fp8rl,
  title={FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning},
  author={Qiu, Zhaopeng and Yu, Shuang and Zhang, Jingqi and Zhang, Shuai and Huang, Xue and Yang, Jingyi and Lai, Junjie},
  journal={arXiv preprint arXiv:2601.18150},
  year={2026},
  url={https://arxiv.org/abs/2601.18150}
}