docs/source/en/model_doc/mimo_v2_flash.md
This model was contributed to Hugging Face Transformers on 2026-06-30.
<div class="flex flex-wrap space-x-1"> </div>MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model developed by the Xiaomi MiMo team. Designed to establish a new balance between long-context modeling capabilities and inference efficiency, the model is built for strong performance in complex reasoning and agentic tasks. Trained on 27T tokens with native 32k sequence lengths, MiMo-V2-Flash seamlessly supports an extended 256K context window while significantly reducing KV-cache storage compared to standard global attention models.
For more details, please refer to the technical
report, and the official
repository.
This model was contributed by casinca.
The example below demonstrates how to generate text with [Pipeline] or the [AutoModelForCausalLM] class.
import torch
from transformers import pipeline
pipe = pipeline(
task="text-generation",
model="XiaomiMiMo/MiMo-V2-Flash",
)
pipe("Explain why sparse MoE models can be efficient at inference.")
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("XiaomiMiMo/MiMo-V2-Flash")
model = AutoModelForCausalLM.from_pretrained(
"XiaomiMiMo/MiMo-V2-Flash",
device_map="auto",
)
input_ids = tokenizer("Explain why sparse MoE models can be efficient at inference.", return_tensors="pt").to(model.device)
output = model.generate(**input_ids, max_new_tokens=128)
print(tokenizer.decode(output[0], skip_special_tokens=True))
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "XiaomiMiMo/MiMo-V2-Flash"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are MiMo, a helpful assistant."},
{"role": "user", "content": "Write a short summary of MiMo-V2-Flash."},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(input_ids, max_new_tokens=128)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
[[autodoc]] MiMoV2FlashConfig
[[autodoc]] MiMoV2FlashModel - forward
[[autodoc]] MiMoV2FlashForCausalLM - forward