docs/source/en/model_doc/moonshine_streaming.md
This model was released on 2024-10-21 and added to Hugging Face Transformers on 2026-02-03.
<div style="float: right;"> <div class="flex flex-wrap space-x-1"></div>
Moonshine Streaming is a streaming variant of the Moonshine speech recognition model, optimized for real-time transcription with low latency. Like the original Moonshine, it is an encoder-decoder model that uses Rotary Position Embedding (RoPE) for handling variable-length speech efficiently. The streaming architecture includes sliding window attention in the encoder and a context adapter that enables incremental processing of audio chunks.
Moonshine Streaming is available in three sizes: tiny, small, and medium, offering a trade-off between speed and accuracy. It is particularly well-suited for on-device streaming transcription and voice command applications.
You can find all the original Moonshine Streaming checkpoints under the Useful Sensors organization.
[!TIP] Moonshine Streaming processes raw audio waveforms directly without requiring mel-spectrogram preprocessing, making it efficient for real-time applications.
The example below demonstrates how to transcribe speech into text with [Pipeline] or the [AutoModel] class.
from transformers import pipeline
pipe = pipeline(
task="automatic-speech-recognition",
model="UsefulSensors/moonshine-streaming-tiny",
device=0
)
pipe("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
from datasets import load_dataset
from transformers import AutoProcessor, MoonshineStreamingForConditionalGeneration
processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny")
model = MoonshineStreamingForConditionalGeneration.from_pretrained(
"UsefulSensors/moonshine-streaming-tiny",
device_map="auto",
attn_implementation="sdpa"
)
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio_sample = ds[0]["audio"]
inputs = processor(audio_sample["array"], return_tensors="pt").to(model.device)
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=100)
transcription = processor.decode(generated_ids[0], skip_special_tokens=True)
transcription
[[autodoc]] MoonshineStreamingProcessor
[[autodoc]] MoonshineStreamingEncoderConfig
[[autodoc]] MoonshineStreamingConfig
[[autodoc]] MoonshineStreamingModel - forward
[[autodoc]] MoonshineStreamingForConditionalGeneration - forward - generate