docs/source/en/model_doc/nemotron3_5_asr.md
This model was contributed to Hugging Face Transformers on 2026-06-27.
Nemotron 3.5 ASR is a 600M-parameter multilingual speech recognition model from NVIDIA, built for high-quality transcription in both low-latency streaming and high-throughput batch settings, with native punctuation and capitalization. For streaming, it offers configurable chunk sizes—80ms, 160ms, 560ms, and 1120ms, letting users trade off latency against accuracy to suit their application. Its cache-aware FastConformer-RNNT architecture is central to this capability: unlike traditional buffered streaming, which repeatedly reprocesses overlapping audio windows, the model processes only each new incoming chunk while reusing cached encoder context from prior chunks. This eliminates redundant computation, significantly improves efficiency, and minimizes end-to-end delay without sacrificing accuracy, making it well suited to real-time transcription workloads.
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="nvidia/nemotron-3.5-asr-streaming-0.6b",
)
out = pipe("https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3")
print(out)
</hfoption> <hfoption id="AutoModel">[!NOTE] The pipeline uses the default language prompt (index 0,
en-US). For explicit language conditioning or automatic detection, pass the processor'slanguageargument (see the AutoModel tab).
The language prompt is created by the processor, so the language travels with the inputs into generate.
from transformers import AutoModelForRNNT, AutoProcessor
from transformers.audio_utils import load_audio
model_id = "nvidia/nemotron-3.5-asr-streaming-0.6b"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForRNNT.from_pretrained(model_id, device_map="auto")
audio = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
sampling_rate=processor.feature_extractor.sampling_rate,
)
# Condition on a known language ...
inputs = processor(audio, sampling_rate=processor.feature_extractor.sampling_rate, language="en-US")
inputs.to(model.device, dtype=model.dtype)
output = model.generate(**inputs, return_dict_in_generate=True)
print(processor.decode(output.sequences, skip_special_tokens=True))
# ... or let the model detect it and keep the emitted <xx-XX> language tag.
inputs = processor(audio, sampling_rate=processor.feature_extractor.sampling_rate) # equiv to ..., language="auto"
inputs.to(model.device, dtype=model.dtype)
output = model.generate(**inputs, return_dict_in_generate=True)
print(processor.decode(output.sequences, skip_special_tokens=False))
[!NOTE] This is an experimental feature and the API is subject to change.
For real-time transcription, audio is split into chunks following:
from threading import Thread
from transformers import AutoModelForRNNT, AutoProcessor, TextIteratorStreamer
from transformers.audio_utils import load_audio
model_id = "nvidia/nemotron-3.5-asr-streaming-0.6b"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForRNNT.from_pretrained(model_id, device_map="auto")
processor.set_num_lookahead_tokens(6)
print(f"Streaming latency: {processor.streaming_latency_ms} ms")
# The language prompt rides along on every chunk; use a locale (e.g. "de-DE") or "auto".
language = "en-US"
sampling_rate = processor.feature_extractor.sampling_rate
audio = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
sampling_rate=sampling_rate,
)
first_chunk_inputs = processor(
audio[: processor.num_samples_first_audio_chunk],
sampling_rate=sampling_rate,
is_streaming=True,
is_first_audio_chunk=True,
language=language,
return_tensors="pt",
)
first_chunk_inputs = first_chunk_inputs.to(model.device, dtype=model.dtype)
def input_features_generator():
yield first_chunk_inputs.input_features[:, : processor.num_mel_frames_first_audio_chunk, :]
mel_frame_idx = processor.num_mel_frames_first_audio_chunk
hop_length = processor.feature_extractor.hop_length
n_fft = processor.feature_extractor.n_fft
start_idx = mel_frame_idx * hop_length - n_fft // 2
while (end_idx := start_idx + processor.num_samples_per_audio_chunk) < audio.shape[0]:
inputs = processor(
audio[start_idx:end_idx],
sampling_rate=sampling_rate,
is_streaming=True,
is_first_audio_chunk=False,
language=language,
return_tensors="pt",
)
inputs = inputs.to(model.device, dtype=model.dtype)
yield inputs.input_features
mel_frame_idx += processor.num_mel_frames_per_audio_chunk
start_idx = mel_frame_idx * hop_length - n_fft // 2
streamer = TextIteratorStreamer(processor.tokenizer, skip_special_tokens=True)
generate_kwargs = {
**first_chunk_inputs,
"input_features": input_features_generator(),
"streamer": streamer,
}
thread = Thread(target=model.generate, kwargs=generate_kwargs)
thread.start()
# Iterate over the streamer to get text chunks as they are generated
print("Model output (streaming):", end=" ", flush=True)
for text_chunk in streamer:
print(text_chunk, end="", flush=True)
thread.join()
The latency is set by num_lookahead_tokens, the right attention context (lookahead, in subsampled encoder frames) each chunk waits for before it is emitted. A larger value lets each chunk see more future audio: better accuracy at the cost of higher latency. Inspect the supported trade-offs, select one, and read back the resulting latency:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("nvidia/nemotron-3.5-asr-streaming-0.6b")
# Each supported `num_lookahead_tokens` mapped to its streaming latency in milliseconds:
print(processor.supported_streaming_latencies_ms)
# {3: 320, 0: 80, 6: 560, 13: 1120}
# Select a right attention context (this also re-derives the streaming chunk sizes used above):
processor.set_num_lookahead_tokens(6)
# Latency of the current selection:
print(processor.streaming_latency_ms)
# 560
set_num_lookahead_tokens sizes the chunks the processor emits, and the matching num_lookahead_tokens must reach generate (in the snippet above it travels through **inputs/**first_chunk_inputs, which carries num_lookahead_tokens). Streaming generate raises if it is omitted.
[[autodoc]] Nemotron3_5AsrConfig
[[autodoc]] Nemotron3_5AsrProcessor
[[autodoc]] Nemotron3_5AsrRNNTOutput
[[autodoc]] Nemotron3_5AsrForRNNT - forward - generate