docs/source/en/model_doc/deepseek_v32.md
This model was published in HF papers on 2025-12-02 and contributed to Hugging Face Transformers on 2026-06-10.
<div style="float: right;"> <div class="flex flex-wrap space-x-1"></div>
DeepSeek-V3.2-Exp is an experimental release from DeepSeek-AI that introduces DeepSeek Sparse Attention (DSA), a trainable, fine-grained sparse attention mechanism designed to improve training and inference efficiency in long-context scenarios. It is built directly on top of DeepSeek-V3.1-Terminus: the model keeps the same 685B-parameter Mixture-of-Experts (MoE) backbone and Multi-head Latent Attention (MLA), and is obtained through continued training that adds the sparse-attention indexer while deliberately aligning the training distribution with V3.1-Terminus so the two models can be compared head-to-head.
The work was later extended in the DeepSeek-V3.2 technical report, DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models, which pairs DSA with a scalable reinforcement-learning framework and reports gold-medal level results on competition math (IMO) and competitive programming (IOI) benchmarks.
The abstract from the DeepSeek-V3.2-Exp release is the following:
We introduce DeepSeek-V3.2-Exp, an experimental version of our model that incorporates DeepSeek Sparse Attention (DSA) to explore and validate optimizations for training and inference efficiency in long-context scenarios. DeepSeek Sparse Attention achieves fine-grained sparse attention for the first time with minimal impact on model output quality. Built upon DeepSeek-V3.1-Terminus, DeepSeek-V3.2-Exp delivers substantially improved efficiency in both training and inference, especially in long-context settings, while maintaining virtually identical benchmark performance.
DSA reduces the quadratic cost of attention over long sequences by attending only to a selected subset of past tokens. It has two components:
rotate_activation) transform; because the transform is orthogonal (Hq·Hk = q·k) and FP8 is only a precision optimization, the transformers port computes the same scores directly in bf16/fp32, keeping the indexer cheap relative to the main attention.index_topk (2048 by default) tokens, and main MLA attention is then computed only over those tokens via an additive mask. This turns the per-query attention cost from O(L) to O(index_topk) for long sequences when using flash_mla, which is not supported yet 😉.The indexer keeps its own small per-token key cache (single-head, index_head_dim) alongside the main K/V cache. In transformers this lives on a dedicated cache layer — [DynamicIndexedLayer] for growing caches and [StaticIndexedLayer] for static / torch.compile caches — and is updated through past_key_values.update_indexer().
In DeepSeek-V3.2 every layer runs its own indexer — there is no cross-layer top-k sharing.
[!NOTE] The MLA query LoRA path (
q_lora_rank) is required. The indexer scores queries from the low-rank query latentq_a_layernorm(q_a_proj(x))(itswq_bprojection is sized byq_lora_rank), so the model always uses the LoRA query path andq_lora_rankmust be set — the released checkpoint uses1536. The optional non-LoRAq_projpath that DeepSeek-V3 exposes forq_lora_rank=Noneis not supported here: without the query latent there is nothing for the indexer to consume.
DeepSeek-V3.2-Exp is distributed as an FP8 checkpoint. The indexer projections are kept out of FP8 quantization, since the checkpoint stores them in bf16/fp32:
from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "deepseek-ai/DeepSeek-V3.2-Exp"
quantization_config = FineGrainedFP8Config(
modules_to_not_convert=["model.layers.*.mlp.gate.*", "*.self_attn.indexer.weights_proj.*"],
weight_block_size=(128, 128),
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("What are we having for dinner?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The original code can be found here.
[[autodoc]] DeepseekV32Config
[[autodoc]] DeepseekV32PreTrainedModel - forward
[[autodoc]] DeepseekV32Model - forward
[[autodoc]] DeepseekV32ForCausalLM