docs/source/en/model_doc/qwen3_asr.md
This model was published in HF papers on 2026-01-29 and contributed to Hugging Face Transformers on 2026-06-26.
Qwen3 ASR is an automatic speech recognition model from Alibaba's Qwen team that combines a Whisper-style audio encoder with a Qwen3 language model decoder for speech-to-text transcription. The model supports automatic language detection and multilingual transcription.
A forced aligner model is also included. It can be used to timestamp a provided transcript and its audio. It uses the same audio encoder model with a classification head that predicts a word's length. This model can be used with the transcript from any ASR model (see the example below with Parakeet CTC).
Available checkpoints:
The following languages are supported:
Qwen3-ASR-1.7B and Qwen3-ASR-0.6B: Chinese (zh), English (en), Cantonese (yue), Arabic (ar), German (de), French (fr), Spanish (es), Portuguese (pt), Indonesian (id), Italian (it), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi), Japanese (ja), Turkish (tr), Hindi (hi), Malay (ms), Dutch (nl), Swedish (sv), Danish (da), Finnish (fi), Polish (pl), Czech (cs), Filipino (fil), Persian (fa), Greek (el), Hungarian (hu), Macedonian (mk), Romanian (ro).Qwen3-ForcedAligner-0.6B: Chinese (zh), English (en), Cantonese (yue), French (fr), German (de), Italian (it), Japanese (ja), Korean (ko), Portuguese (pt), Russian (ru), Spanish (es).See the original repository at QwenLM/Qwen3-ASR and the report for more details.
This model was contributed by Eric Bezzam and Muhammed Tariq.
The simplest way to transcribe audio is with apply_transcription_request, which handles the chat template formatting for you, namely it is a convenience wrapper for apply_chat_template (see Chat template below).
from transformers import AutoProcessor, AutoModelForMultimodalLM
model_id = "Qwen/Qwen3-ASR-1.7B-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(model_id, device_map="auto")
print(f"Model loaded on {model.device} with dtype {model.dtype}")
inputs = processor.apply_transcription_request(
audio="https://huggingface.co/datasets/bezzam/audio_samples/resolve/main/librispeech_mr_quilter.wav",
).to(model.device, model.dtype)
output_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
# Raw output includes language tag and <asr_text> marker
raw = processor.decode(generated_ids)[0]
print(f"Raw: {raw}")
# Parsed output: dict with "language" and "transcription"
parsed = processor.decode(generated_ids, return_format="parsed")[0]
print(f"Parsed: {parsed}")
# Extract only the transcription text
transcription = processor.decode(generated_ids, return_format="transcription_only")[0]
print(f"Transcription: {transcription}")
"""
Raw: language English<asr_text>Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
Parsed: {'language': 'English', 'transcription': 'Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'}
Transcription: Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
"""
You can provide a language hint to guide the model.
from transformers import AutoProcessor, AutoModelForMultimodalLM
model_id = "Qwen/Qwen3-ASR-1.7B-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(model_id, device_map="auto")
# Without language hint (auto-detect)
inputs = processor.apply_transcription_request(
audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
).to(model.device, model.dtype)
output_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
print(f"Auto-detect: {processor.decode(generated_ids, return_format='transcription_only')[0]}")
# With language hint
inputs = processor.apply_transcription_request(
audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
language="Chinese", # or language code "zh"
).to(model.device, model.dtype)
output_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
print(f"With hint: {processor.decode(generated_ids, return_format='transcription_only')[0]}")
Batch inference is possible by passing a list of audios and, if provided, a list of languages.
from transformers import AutoProcessor, AutoModelForMultimodalLM
model_id = "Qwen/Qwen3-ASR-1.7B-hf"
audio = [
"https://huggingface.co/datasets/bezzam/audio_samples/resolve/main/librispeech_mr_quilter.wav",
"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
]
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(model_id, device_map="auto")
inputs = processor.apply_transcription_request(
audio, language=[None, "zh"], # language codes ("zh") and full names ("Chinese") are both accepted
).to(model.device, model.dtype)
output_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
transcriptions = processor.decode(generated_ids, return_format="transcription_only")
for i, text in enumerate(transcriptions):
print(f"Audio {i + 1}: {text}")
Qwen3 ASR also accepts chat template inputs. The apply_transcription_request usage above is a convenience wrapper for apply_chat_template.
from transformers import AutoProcessor, Qwen3ASRForConditionalGeneration
model_id = "Qwen/Qwen3-ASR-1.7B-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3ASRForConditionalGeneration.from_pretrained(model_id, device_map="auto")
# With language hint as system message
chat_template = [
[
{"role": "system", "content": [{"type": "text", "text": "English"}]},
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/bezzam/audio_samples/resolve/main/librispeech_mr_quilter.wav",
},
],
},
],
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
},
],
},
],
]
inputs = processor.apply_chat_template(
chat_template, tokenize=True, return_dict=True,
).to(model.device, model.dtype)
output_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
transcriptions = processor.decode(generated_ids, return_format="transcription_only")
for text in transcriptions:
print(text)
Qwen3 ASR can be trained with the loss outputted by the model.
from transformers import AutoProcessor, Qwen3ASRForConditionalGeneration
model_id = "Qwen/Qwen3-ASR-1.7B-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3ASRForConditionalGeneration.from_pretrained(model_id, device_map="auto")
model.train()
chat_template = [
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/bezzam/audio_samples/resolve/main/librispeech_mr_quilter.wav",
},
],
}
],
]
inputs = processor.apply_chat_template(
chat_template, tokenize=True, return_dict=True, output_labels=True,
).to(model.device, model.dtype)
loss = model(**inputs).loss
print("Loss:", loss.item())
loss.backward()
Use Qwen3ASRForTokenClassification to obtain word-level timestamps from a transcript. First transcribe with the ASR model, then align with the forced aligner.
The following languages are supported: Chinese (zh), English (en), Cantonese (yue), French (fr), German (de), Italian (it), Japanese (ja), Korean (ko), Portuguese (pt), Russian (ru), Spanish (es).
Japanese requires the nagisa library, while Korean requires the soynlp library:
pip install nagisa soynlp
import torch
from transformers import AutoProcessor, AutoModelForMultimodalLM, AutoModelForTokenClassification
asr_model_id = "Qwen/Qwen3-ASR-0.6B-hf"
aligner_model_id = "Qwen/Qwen3-ForcedAligner-0.6B-hf"
asr_processor = AutoProcessor.from_pretrained(asr_model_id)
asr_model = AutoModelForMultimodalLM.from_pretrained(asr_model_id, device_map="auto")
aligner_processor = AutoProcessor.from_pretrained(aligner_model_id)
aligner_model = AutoModelForTokenClassification.from_pretrained(
aligner_model_id, dtype=torch.bfloat16, device_map="auto"
)
audio_url = "https://huggingface.co/datasets/bezzam/audio_samples/resolve/main/librispeech_mr_quilter.wav"
# Step 1: Transcribe
inputs = asr_processor.apply_transcription_request(audio=audio_url)
inputs = inputs.to(asr_model.device, asr_model.dtype)
output_ids = asr_model.generate(**inputs, max_new_tokens=256)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
parsed = asr_processor.decode(generated_ids, return_format="parsed")[0]
transcript = parsed["transcription"]
language = parsed["language"] or "English"
# Step 2: Prepare alignment inputs
aligner_inputs, word_lists = aligner_processor.prepare_forced_aligner_inputs(
audio=audio_url, transcript=transcript, language=language,
)
aligner_inputs = aligner_inputs.to(aligner_model.device, aligner_model.dtype)
# Step 3: Run forced aligner
with torch.inference_mode():
outputs = aligner_model(**aligner_inputs)
# Step 4: Decode timestamps
timestamps = aligner_processor.decode_forced_alignment(
logits=outputs.logits,
input_ids=aligner_inputs["input_ids"],
word_lists=word_lists,
timestamp_token_id=aligner_model.config.timestamp_token_id,
)[0]
for item in timestamps:
print(f"{item['text']:<20} {item['start_time']:>8.3f}s → {item['end_time']:>8.3f}s")
"""
Word Start (s) End (s)
------------------------------------------
Mr 0.560 0.800
Quilter 0.800 1.280
is 1.280 1.440
the 1.440 1.520
apostle 1.520 2.080
...
"""
The forced aligner is model-agnostic, meaning the transcripts from any ASR system can be provided. Below is a batch inference example using NVIDIA Parakeet CTC for transcription.
import torch
from datasets import Audio, load_dataset
from transformers import AutoModelForCTC, AutoProcessor, AutoModelForTokenClassification
parakeet_processor = AutoProcessor.from_pretrained("nvidia/parakeet-ctc-1.1b")
parakeet_model = AutoModelForCTC.from_pretrained(
"nvidia/parakeet-ctc-1.1b", dtype="auto", device_map="cuda",
)
aligner_model_id = "Qwen/Qwen3-ForcedAligner-0.6B-hf"
aligner_processor = AutoProcessor.from_pretrained(aligner_model_id)
aligner_model = AutoModelForTokenClassification.from_pretrained(
aligner_model_id, dtype=torch.bfloat16, device_map="cuda",
)
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=parakeet_processor.feature_extractor.sampling_rate))
audio_arrays = [ds[i]["audio"]["array"] for i in range(3)]
sr = ds[0]["audio"]["sampling_rate"]
# Batch transcribe with Parakeet
inputs = parakeet_processor(audio_arrays, sampling_rate=sr, return_tensors="pt", padding=True).to(
parakeet_model.device, dtype=parakeet_model.dtype
)
with torch.inference_mode():
outputs = parakeet_model.generate(**inputs)
transcripts = parakeet_processor.decode(outputs)
# Batch align with Qwen3 Forced Aligner
aligner_inputs, word_lists = aligner_processor.prepare_forced_aligner_inputs(
audio=audio_arrays, transcript=transcripts, language="English",
)
aligner_inputs = aligner_inputs.to(aligner_model.device, aligner_model.dtype)
with torch.inference_mode():
aligner_outputs = aligner_model(**aligner_inputs)
batch_timestamps = aligner_processor.decode_forced_alignment(
logits=aligner_outputs.logits,
input_ids=aligner_inputs["input_ids"],
word_lists=word_lists,
timestamp_token_id=aligner_model.config.timestamp_token_id,
)
for i, (transcript, timestamps) in enumerate(zip(transcripts, batch_timestamps)):
print(f"\n[Sample {i}] {transcript}")
for item in timestamps[:5]:
print(f" {item['text']:<20} {item['start_time']:>8.3f}s → {item['end_time']:>8.3f}s")
if len(timestamps) > 5:
print(f" ... ({len(timestamps) - 5} more words)")
Both the ASR and forced aligner models support torch.compile for faster inference. The forced aligner is an especially good fit for compilation because it runs a single forward pass (no autoregressive decoding). This makes it ideal for bulk audio timestamping: transcribe with any ASR model, then batch-align with the compiled forced aligner for maximum throughput.
On an A100, we observed a speed-up of ~1.2 for a batch size of 4 (script).
import torch
from transformers import AutoProcessor, AutoModelForTokenClassification
model_id = "Qwen/Qwen3-ForcedAligner-0.6B-hf"
num_warmup = 5
batch_size = 4
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id, dtype=torch.bfloat16).to("cuda")
# Prepare a batch of 4 samples
audio_url = "https://huggingface.co/datasets/bezzam/audio_samples/resolve/main/librispeech_mr_quilter.wav"
transcript = "Mr. Quilter is the apostle of the middle classes."
aligner_inputs, word_lists = processor.prepare_forced_aligner_inputs(
audio=[audio_url] * batch_size,
transcript=[transcript] * batch_size,
language=["English"] * batch_size,
)
aligner_inputs = aligner_inputs.to("cuda", torch.bfloat16)
# Warm-up and apply model
model = torch.compile(model)
with torch.no_grad():
for _ in range(num_warmup):
_ = model(**aligner_inputs)
with torch.no_grad():
_ = model(**aligner_inputs)
For autoregressive transcription, torch.compile accelerates the per-token forward passes inside generate setting providing a CompileConfig object.
On an A100, we observed a speed-up of ~3.8 for a batch size of 4 (script).
import torch
from transformers import AutoProcessor, AutoModelForMultimodalLM, CompileConfig
model_id = "Qwen/Qwen3-ASR-1.7B-hf"
num_warmup = 3
max_new_tokens = 256
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(model_id, dtype=torch.bfloat16).to("cuda").eval()
audio_url = "https://huggingface.co/datasets/bezzam/audio_samples/resolve/main/librispeech_mr_quilter.wav"
inputs = processor.apply_transcription_request(
audio=[audio_url] * 4, # batch of 4
).to("cuda", torch.bfloat16)
compile_config = CompileConfig()
# Warmup
with torch.inference_mode():
for _ in range(num_warmup):
_ = model.generate(
**inputs, max_new_tokens=max_new_tokens, do_sample=False,
cache_implementation="static", compile_config=compile_config,
)
torch.cuda.synchronize()
# Apply model
with torch.inference_mode():
output_ids = model.generate(
**inputs, max_new_tokens=max_new_tokens, do_sample=False,
cache_implementation="static", compile_config=compile_config,
)
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
text_compiled = processor.decode(generated_ids, return_format="transcription_only")[0]
print(f"Output: {text_compiled}")
from transformers import pipeline
model_id = "Qwen/Qwen3-ASR-1.7B-hf"
pipe = pipeline("any-to-any", model=model_id, device_map="auto")
chat_template = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/bezzam/audio_samples/resolve/main/librispeech_mr_quilter.wav",
},
],
}
]
outputs = pipe(text=chat_template, return_full_text=False)
raw_text = outputs[0]["generated_text"]
print(f"Raw: {raw_text}")
# Use processor helper to extract transcription
transcription = pipe.processor.extract_transcription(raw_text)
print(f"Transcription: {transcription}")
[[autodoc]] Qwen3ASRConfig
[[autodoc]] Qwen3ASREncoderConfig
[[autodoc]] Qwen3ASRFeatureExtractor - call
[[autodoc]] Qwen3ASRProcessor - call - apply_transcription_request - prepare_forced_aligner_inputs - decode_forced_alignment - decode
[[autodoc]] Qwen3ASREncoder
[[autodoc]] Qwen3ASRModel
[[autodoc]] Qwen3ASRForConditionalGeneration - forward - get_audio_features
[[autodoc]] Qwen3ASRForTokenClassification - forward