Back to Sherpa Onnx

README

README.md

1.13.047.4 KB
Original Source

Supported functions

Speech recognitionSpeech synthesisSource separation
✔️✔️✔️
Speaker identificationSpeaker diarizationSpeaker verification
✔️✔️✔️
Spoken Language identificationAudio taggingVoice activity detection
✔️✔️✔️
Keyword spottingAdd punctuationSpeech enhancement
✔️✔️✔️

Supported platforms

ArchitectureAndroidiOSWindowsmacOSlinuxHarmonyOS
x64✔️✔️✔️✔️✔️
x86✔️✔️
arm64✔️✔️✔️✔️✔️✔️
arm32✔️✔️✔️
riscv64✔️

Supported programming languages

1. C++2. C3. Python4. JavaScript
✔️✔️✔️✔️
5. Java6. C#7. Kotlin8. Swift
✔️✔️✔️✔️
9. Go10. Dart11. Rust12. Pascal
✔️✔️✔️✔️

It also supports WebAssembly.

Supported NPUs

1. Rockchip NPU (RKNN)2. Qualcomm NPU (QNN)3. Ascend NPU
✔️✔️✔️
4. Axera NPU
✔️

Join our discord

Introduction

This repository supports running the following functions locally

  • Speech-to-text (i.e., ASR); both streaming and non-streaming are supported
  • Text-to-speech (i.e., TTS)
  • Speaker diarization
  • Speaker identification
  • Speaker verification
  • Spoken language identification
  • Audio tagging
  • VAD (e.g., silero-vad)
  • Speech enhancement (e.g., gtcrn, DPDFNet)
  • Keyword spotting
  • Source separation (e.g., spleeter, UVR)

on the following platforms and operating systems:

with the following APIs

  • C++, C, Python, Go, C#
  • Java, Kotlin, JavaScript
  • Swift, Rust
  • Dart, Object Pascal
<details> <summary>You can visit the following Huggingface spaces to try sherpa-onnx without installing anything. All you need is a browser.</summary>
DescriptionURL中国镜像
Speaker diarizationClick me镜像
Speech recognitionClick me镜像
Speech recognition with WhisperClick me镜像
Speech synthesisClick me镜像
Generate subtitlesClick me镜像
Audio taggingClick me镜像
Source separationClick me镜像
Spoken language identification with WhisperClick me镜像

We also have spaces built using WebAssembly. They are listed below:

DescriptionHuggingface spaceModelScope space
Voice activity detection with silero-vadClick me地址
Real-time speech recognition (Chinese + English) with ZipformerClick me地址
Real-time speech recognition (Chinese + English) with ParaformerClick me地址
Real-time speech recognition (Chinese + English + Cantonese) with Paraformer-largeClick me地址
Real-time speech recognition (English)Click me地址
VAD + speech recognition (Chinese) with Zipformer CTCClick me地址
VAD + speech recognition (Chinese + English + Korean + Japanese + Cantonese) with SenseVoiceClick me地址
VAD + speech recognition (English) with Whisper tiny.enClick me地址
VAD + speech recognition (English) with Moonshine tinyClick me地址
VAD + speech recognition (English) with Zipformer trained with GigaSpeechClick me地址
VAD + speech recognition (Chinese) with Zipformer trained with WenetSpeechClick me地址
VAD + speech recognition (Japanese) with Zipformer trained with ReazonSpeechClick me地址
VAD + speech recognition (Thai) with Zipformer trained with GigaSpeech2Click me地址
VAD + speech recognition (Chinese 多种方言) with a TeleSpeech-ASR CTC modelClick me地址
VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-largeClick me地址
VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-smallClick me地址
VAD + speech recognition (多语种及多种中文方言) with Dolphin-baseClick me地址
Speech synthesis (Piper, English)Click me地址
Speech synthesis (Piper, German)Click me地址
Speech synthesis (Matcha, Chinese)Click me地址
Speech synthesis (Matcha, English)Click me地址
Speech synthesis (Matcha, Chinese+English)Click me地址
Speaker diarizationClick me地址
Voice cloning with ZipVoice (Chinese+English)Click me地址
Voice cloning with Pocket TTS (English)Click me地址
</details> <details> <summary>You can find pre-built Android APKs for this repository in the following table</summary>
DescriptionURL中国用户
Speaker diarizationAddress点此
Streaming speech recognitionAddress点此
Simulated-streaming speech recognitionAddress点此
Text-to-speechAddress点此
Voice activity detection (VAD)Address点此
VAD + non-streaming speech recognitionAddress点此
Two-pass speech recognitionAddress点此
Audio taggingAddress点此
Audio tagging (WearOS)Address点此
Speaker identificationAddress点此
Spoken language identificationAddress点此
Keyword spottingAddress点此
</details> <details>

Real-time speech recognition

DescriptionURL中国用户
Streaming speech recognitionAddress点此

Text-to-speech

DescriptionURL中国用户
Android (arm64-v8a, armeabi-v7a, x86_64)Address点此
Linux (x64)Address点此
macOS (x64)Address点此
macOS (arm64)Address点此
Windows (x64)Address点此

Note: You need to build from source for iOS.

</details> <details>

Generating subtitles

DescriptionURL中国用户
Generate subtitles (生成字幕)Address点此
</details> <details>
DescriptionURL
Speech recognition (speech to text, ASR)Address
Text-to-speech (TTS)Address
VADAddress
Keyword spottingAddress
Audio taggingAddress
Speaker identification (Speaker ID)Address
Spoken language identification (Language ID)See multi-lingual Whisper ASR models from Speech recognition
PunctuationAddress
Speaker segmentationAddress
Speech enhancementAddress
Source separationAddress
</details>

Some pre-trained ASR models (Streaming)

<details>

Please see

for more models. The following table lists only SOME of them.

NameSupported LanguagesDescription
sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20Chinese, EnglishSee also
sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16Chinese, EnglishSee also
sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23ChineseSuitable for Cortex A7 CPU. See also
sherpa-onnx-streaming-zipformer-en-20M-2023-02-17EnglishSuitable for Cortex A7 CPU. See also
sherpa-onnx-streaming-zipformer-korean-2024-06-16KoreanSee also
sherpa-onnx-streaming-zipformer-fr-2023-04-14FrenchSee also
</details>

Some pre-trained ASR models (Non-Streaming)

<details>

Please see

for more models. The following table lists only SOME of them.

NameSupported LanguagesDescription
sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8EnglishIt is converted from https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
Whisper tiny.enEnglishSee also
Moonshine tinyEnglishSee also
sherpa-onnx-zipformer-ctc-zh-int8-2025-07-03ChineseA Zipformer CTC model
sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17Chinese, Cantonese, English, Korean, Japanese支持多种中文方言. See also
sherpa-onnx-paraformer-zh-2024-03-09Chinese, English也支持多种中文方言. See also
sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01JapaneseSee also
sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24RussianSee also
sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24RussianSee also
sherpa-onnx-zipformer-ru-2024-09-18RussianSee also
sherpa-onnx-zipformer-korean-2024-06-24KoreanSee also
sherpa-onnx-zipformer-thai-2024-06-20ThaiSee also
sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04Chinese支持多种方言. See also
</details>

How to reach us

Please see https://k2-fsa.github.io/sherpa/social-groups.html for 新一代 Kaldi 微信交流群 and QQ 交流群.

Projects using sherpa-onnx

Speed of Sound

A voice-typing application for the Linux desktop (GTK4/Adwaita). It captures microphone audio, transcribes it offline using Sherpa ONNX ASR models, optionally polishes the text with an LLM, and types the result into the active window via XDG Remote Desktop Portal keyboard simulation.

VoxSherpa TTS

VoxSherpa TTS is a 100% offline Android Text-to-Speech app powered by Sherpa-ONNX. It supports Kokoro-82M, Piper, and VITS engines with multilingual support including Hindi, English, British English, Japanese, Chinese and 50+ more languages.

<div align="center">
GenerateModelsLibrarySettings
</div>

BreezeApp from MediaTek Research

BreezeAPP is a mobile AI application developed for both Android and iOS platforms. Users can download it directly from the App Store and enjoy a variety of features offline, including speech-to-text, text-to-speech, text-based chatbot interactions, and image question-answering

123

Open-LLM-VTuber

Talk to any LLM with hands-free voice interaction, voice interruption, and Live2D taking face running locally across platforms

See also https://github.com/t41372/Open-LLM-VTuber/pull/50

voiceapi

<details> <summary>Streaming ASR and TTS based on FastAPI</summary>

It shows how to use the ASR and TTS Python APIs with FastAPI.

</details>

腾讯会议摸鱼工具 TMSpeech

Uses streaming ASR in C# with graphical user interface.

Video demo in Chinese: 【开源】Windows实时字幕软件(网课/开会必备)

lol互动助手

It uses the JavaScript API of sherpa-onnx along with Electron

Video demo in Chinese: 爆了!炫神教你开打字挂!真正影响胜率的英雄联盟工具!英雄联盟的最后一块拼图!和游戏中的每个人无障碍沟通!

Sherpa-ONNX 语音识别服务器

A server based on nodejs providing Restful API for speech recognition.

QSmartAssistant

一个模块化,全过程可离线,低占用率的对话机器人/智能音箱

It uses QT. Both ASR and TTS are used.

Flutter-EasySpeechRecognition

It extends ./flutter-examples/streaming_asr by downloading models inside the app to reduce the size of the app.

Note: [Team B] Sherpa AI backend also uses sherpa-onnx in a Flutter APP.

sherpa-onnx-unity

sherpa-onnx in Unity. See also #1695, #1892, and #1859

xiaozhi-esp32-server

本项目为xiaozhi-esp32提供后端服务,帮助您快速搭建ESP32设备控制服务器 Backend service for xiaozhi-esp32, helps you quickly build an ESP32 device control server.

See also

KaithemAutomation

Pure Python, GUI-focused home automation/consumer grade SCADA.

It uses TTS from sherpa-onnx. See also ✨ Speak command that uses the new globally configured TTS model.

Open-XiaoAI KWS

Enable custom wake word for XiaoAi Speakers. 让小爱音箱支持自定义唤醒词。

Video demo in Chinese: 小爱同学启动~˶╹ꇴ╹˶!

C++ WebSocket ASR Server

It provides a WebSocket server based on C++ for ASR using sherpa-onnx.

Go WebSocket Server

It provides a WebSocket server based on the Go programming language for sherpa-onnx.

Making robot Paimon, Ep10 "The AI Part 1"

It is a YouTube video, showing how the author tried to use AI so he can have a conversation with Paimon.

It uses sherpa-onnx for speech-to-text and text-to-speech.

1

TtsReader - Desktop application

A desktop text-to-speech application built using Kotlin Multiplatform.

MentraOS

Smart glasses OS, with dozens of built-in apps. Users get AI assistant, notifications, translation, screen mirror, captions, and more. Devs get to write 1 app that runs on any pair of smart glasses.

It uses sherpa-onnx for real-time speech recognition on iOS and Android devices. See also https://github.com/Mentra-Community/MentraOS/pull/861

It uses Swift for iOS and Java for Android.

flet_sherpa_onnx

Flet ASR/STT component based on sherpa-onnx. Example a chat box agent

achatbot-go

a multimodal chatbot based on go with sherpa-onnx's speech lib api.

fcitx5-vinput

Local offline voice input plugin for Fcitx5 (Linux input method framework). It uses C++ with offline ASR for speech recognition, supporting push-to-talk, command mode, and optional LLM post-processing.

Video demo in Chinese: fcitx5-vinput

Wake Word

A VS Code extension for hands-free voice-activated coding. It uses sherpa-onnx for real-time keyword spotting (KWS) to detect custom wake phrases and trigger VS Code commands by voice. Audio capture is handled by decibri, a cross-platform Node.js microphone streaming library with prebuilt native binaries.