This model was published in HF papers on 2025-12-02 and contributed to Hugging Face Transformers on 2026-06-10.

</div>

</div>

DeepSeek-V3.2

Overview

DeepSeek-V3.2-Exp is an experimental release from DeepSeek-AI that introduces DeepSeek Sparse Attention (DSA), a trainable, fine-grained sparse attention mechanism designed to improve training and inference efficiency in long-context scenarios. It is built directly on top of DeepSeek-V3.1-Terminus: the model keeps the same 685B-parameter Mixture-of-Experts (MoE) backbone and Multi-head Latent Attention (MLA), and is obtained through continued training that adds the sparse-attention indexer while deliberately aligning the training distribution with V3.1-Terminus so the two models can be compared head-to-head.

The work was later extended in the DeepSeek-V3.2 technical report, DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models, which pairs DSA with a scalable reinforcement-learning framework and reports gold-medal level results on competition math (IMO) and competitive programming (IOI) benchmarks.

The abstract from the DeepSeek-V3.2-Exp release is the following:

We introduce DeepSeek-V3.2-Exp, an experimental version of our model that incorporates DeepSeek Sparse Attention (DSA) to explore and validate optimizations for training and inference efficiency in long-context scenarios. DeepSeek Sparse Attention achieves fine-grained sparse attention for the first time with minimal impact on model output quality. Built upon DeepSeek-V3.1-Terminus, DeepSeek-V3.2-Exp delivers substantially improved efficiency in both training and inference, especially in long-context settings, while maintaining virtually identical benchmark performance.

DeepSeek Sparse Attention (DSA)

DSA reduces the quadratic cost of attention over long sequences by attending only to a selected subset of past tokens. It has two components:

Lightning indexer. A lightweight, low-head-count scoring module computes an index score between each query and every preceding key. In the reference implementation it runs in FP8 with a Hadamard (rotate_activation) transform; because the transform is orthogonal (Hq·Hk = q·k) and FP8 is only a precision optimization, the transformers port computes the same scores directly in bf16/fp32, keeping the indexer cheap relative to the main attention.
Fine-grained token selection. For each query the indexer keeps the top-index_topk (2048 by default) tokens, and main MLA attention is then computed only over those tokens via an additive mask. This turns the per-query attention cost from O(L) to O(index_topk) for long sequences when using flash_mla, which is not supported yet 😉.

The indexer keeps its own small per-token key cache (single-head, index_head_dim) alongside the main K/V cache. In transformers this lives on a dedicated cache layer — [DynamicIndexedLayer] for growing caches and [StaticIndexedLayer] for static / torch.compile caches — and is updated through past_key_values.update_indexer().

In DeepSeek-V3.2 every layer runs its own indexer — there is no cross-layer top-k sharing.

[!NOTE] The MLA query LoRA path (q_lora_rank) is required. The indexer scores queries from the low-rank query latent q_a_layernorm(q_a_proj(x)) (its wq_b projection is sized by q_lora_rank), so the model always uses the LoRA query path and q_lora_rank must be set — the released checkpoint uses 1536. The optional non-LoRA q_proj path that DeepSeek-V3 exposes for q_lora_rank=None is not supported here: without the query latent there is nothing for the indexer to consume.

Usage examples

DeepSeek-V3.2-Exp is distributed as an FP8 checkpoint. The indexer projections are kept out of FP8 quantization, since the checkpoint stores them in bf16/fp32:

python

from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "deepseek-ai/DeepSeek-V3.2-Exp"
quantization_config = FineGrainedFP8Config(
    modules_to_not_convert=["model.layers.*.mlp.gate.*", "*.self_attn.indexer.weights_proj.*"],
    weight_block_size=(128, 128),
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer("What are we having for dinner?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The original code can be found here.

DeepseekV32Config

[[autodoc]] DeepseekV32Config

DeepseekV32PreTrainedModel

[[autodoc]] DeepseekV32PreTrainedModel - forward

DeepseekV32Model

[[autodoc]] DeepseekV32Model - forward

DeepseekV32ForCausalLM

[[autodoc]] DeepseekV32ForCausalLM