This model was released on 2024-10-21 and added to Hugging Face Transformers on 2026-02-03.

</div>

</div>

Moonshine Streaming

Moonshine Streaming is a streaming variant of the Moonshine speech recognition model, optimized for real-time transcription with low latency. Like the original Moonshine, it is an encoder-decoder model that uses Rotary Position Embedding (RoPE) for handling variable-length speech efficiently. The streaming architecture includes sliding window attention in the encoder and a context adapter that enables incremental processing of audio chunks.

Moonshine Streaming is available in three sizes: tiny, small, and medium, offering a trade-off between speed and accuracy. It is particularly well-suited for on-device streaming transcription and voice command applications.

You can find all the original Moonshine Streaming checkpoints under the Useful Sensors organization.

[!TIP] Moonshine Streaming processes raw audio waveforms directly without requiring mel-spectrogram preprocessing, making it efficient for real-time applications.

The example below demonstrates how to transcribe speech into text with [Pipeline] or the [AutoModel] class.

python

from transformers import pipeline


pipe = pipeline(
    task="automatic-speech-recognition",
    model="UsefulSensors/moonshine-streaming-tiny",
    device=0
)
pipe("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")

</hfoption> <hfoption id="AutoModel">

python

from datasets import load_dataset

from transformers import AutoProcessor, MoonshineStreamingForConditionalGeneration


processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny")
model = MoonshineStreamingForConditionalGeneration.from_pretrained(
    "UsefulSensors/moonshine-streaming-tiny",
    device_map="auto",
    attn_implementation="sdpa"
)

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio_sample = ds[0]["audio"]

inputs = processor(audio_sample["array"], return_tensors="pt").to(model.device)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=100)
transcription = processor.decode(generated_ids[0], skip_special_tokens=True)
transcription

</hfoption> </hfoptions>

MoonshineStreamingProcessor

[[autodoc]] MoonshineStreamingProcessor

MoonshineStreamingEncoderConfig

[[autodoc]] MoonshineStreamingEncoderConfig

MoonshineStreamingConfig

[[autodoc]] MoonshineStreamingConfig

MoonshineStreamingModel

[[autodoc]] MoonshineStreamingModel - forward

MoonshineStreamingForConditionalGeneration

[[autodoc]] MoonshineStreamingForConditionalGeneration - forward - generate