Back to Mnn

README ONNX

apps/frameworks/sherpa-mnn/README_ONNX.md

3.5.033.8 KB
Original Source

Supported functions

Speech recognitionSpeech synthesis
✔️✔️
Speaker identificationSpeaker diarizationSpeaker verification
✔️✔️✔️
Spoken Language identificationAudio taggingVoice activity detection
✔️✔️✔️
Keyword spottingAdd punctuationSpeech enhancement
✔️✔️✔️

Supported platforms

ArchitectureAndroidiOSWindowsmacOSlinuxHarmonyOS
x64✔️✔️✔️✔️✔️
x86✔️✔️
arm64✔️✔️✔️✔️✔️✔️
arm32✔️✔️✔️
riscv64✔️

Supported programming languages

1. C++2. C3. Python4. JavaScript
✔️✔️✔️✔️
5. Java6. C#7. Kotlin8. Swift
✔️✔️✔️✔️
9. Go10. Dart11. Rust12. Pascal
✔️✔️✔️✔️

For Rust support, please see sherpa-rs

It also supports WebAssembly.

Introduction

This repository supports running the following functions locally

  • Speech-to-text (i.e., ASR); both streaming and non-streaming are supported
  • Text-to-speech (i.e., TTS)
  • Speaker diarization
  • Speaker identification
  • Speaker verification
  • Spoken language identification
  • Audio tagging
  • VAD (e.g., silero-vad)
  • Keyword spotting

on the following platforms and operating systems:

with the following APIs

  • C++, C, Python, Go, C#
  • Java, Kotlin, JavaScript
  • Swift, Rust
  • Dart, Object Pascal
<details> <summary>You can visit the following Huggingface spaces to try sherpa-onnx without installing anything. All you need is a browser.</summary>
DescriptionURL
Speaker diarizationClick me
Speech recognitionClick me
Speech recognition with WhisperClick me
Speech synthesisClick me
Generate subtitlesClick me
Audio taggingClick me
Spoken language identification with WhisperClick me

We also have spaces built using WebAssembly. They are listed below:

DescriptionHuggingface spaceModelScope space
Voice activity detection with silero-vadClick me地址
Real-time speech recognition (Chinese + English) with ZipformerClick me地址
Real-time speech recognition (Chinese + English) with ParaformerClick me地址
Real-time speech recognition (Chinese + English + Cantonese) with Paraformer-largeClick me地址
Real-time speech recognition (English)Click me地址
VAD + speech recognition (Chinese + English + Korean + Japanese + Cantonese) with SenseVoiceClick me地址
VAD + speech recognition (English) with Whisper tiny.enClick me地址
VAD + speech recognition (English) with Moonshine tinyClick me地址
VAD + speech recognition (English) with Zipformer trained with GigaSpeechClick me地址
VAD + speech recognition (Chinese) with Zipformer trained with WenetSpeechClick me地址
VAD + speech recognition (Japanese) with Zipformer trained with ReazonSpeechClick me地址
VAD + speech recognition (Thai) with Zipformer trained with GigaSpeech2Click me地址
VAD + speech recognition (Chinese 多种方言) with a TeleSpeech-ASR CTC modelClick me地址
VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-largeClick me地址
VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-smallClick me地址
Speech synthesis (English)Click me地址
Speech synthesis (German)Click me地址
Speaker diarizationClick me地址
</details> <details> <summary>You can find pre-built Android APKs for this repository in the following table</summary>
DescriptionURL中国用户
Speaker diarizationAddress点此
Streaming speech recognitionAddress点此
Text-to-speechAddress点此
Voice activity detection (VAD)Address点此
VAD + non-streaming speech recognitionAddress点此
Two-pass speech recognitionAddress点此
Audio taggingAddress点此
Audio tagging (WearOS)Address点此
Speaker identificationAddress点此
Spoken language identificationAddress点此
Keyword spottingAddress点此
</details> <details>

Real-time speech recognition

DescriptionURL中国用户
Streaming speech recognitionAddress点此

Text-to-speech

DescriptionURL中国用户
Android (arm64-v8a, armeabi-v7a, x86_64)Address点此
Linux (x64)Address点此
macOS (x64)Address点此
macOS (arm64)Address点此
Windows (x64)Address点此

Note: You need to build from source for iOS.

</details> <details>

Generating subtitles

DescriptionURL中国用户
Generate subtitles (生成字幕)Address点此
</details> <details>
DescriptionURL
Speech recognition (speech to text, ASR)Address
Text-to-speech (TTS)Address
VADAddress
Keyword spottingAddress
Audio taggingAddress
Speaker identification (Speaker ID)Address
Spoken language identification (Language ID)See multi-lingual Whisper ASR models from Speech recognition
PunctuationAddress
Speaker segmentationAddress
Speech enhancementAddress
</details>

Some pre-trained ASR models (Streaming)

<details>

Please see

for more models. The following table lists only SOME of them.

NameSupported LanguagesDescription
sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20Chinese, EnglishSee also
sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16Chinese, EnglishSee also
sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23ChineseSuitable for Cortex A7 CPU. See also
sherpa-onnx-streaming-zipformer-en-20M-2023-02-17EnglishSuitable for Cortex A7 CPU. See also
sherpa-onnx-streaming-zipformer-korean-2024-06-16KoreanSee also
sherpa-onnx-streaming-zipformer-fr-2023-04-14FrenchSee also
</details>

Some pre-trained ASR models (Non-Streaming)

<details>

Please see

for more models. The following table lists only SOME of them.

NameSupported LanguagesDescription
Whisper tiny.enEnglishSee also
Moonshine tinyEnglishSee also
sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17Chinese, Cantonese, English, Korean, Japanese支持多种中文方言. See also
sherpa-onnx-paraformer-zh-2024-03-09Chinese, English也支持多种中文方言. See also
sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01JapaneseSee also
sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24RussianSee also
sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24RussianSee also
sherpa-onnx-zipformer-ru-2024-09-18RussianSee also
sherpa-onnx-zipformer-korean-2024-06-24KoreanSee also
sherpa-onnx-zipformer-thai-2024-06-20ThaiSee also
sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04Chinese支持多种方言. See also
</details>

How to reach us

Please see https://k2-fsa.github.io/sherpa/social-groups.html for 新一代 Kaldi 微信交流群 and QQ 交流群.

Projects using sherpa-onnx

Open-LLM-VTuber

Talk to any LLM with hands-free voice interaction, voice interruption, and Live2D taking face running locally across platforms

See also https://github.com/t41372/Open-LLM-VTuber/pull/50

voiceapi

<details> <summary>Streaming ASR and TTS based on FastAPI</summary>

It shows how to use the ASR and TTS Python APIs with FastAPI.

</details>

腾讯会议摸鱼工具 TMSpeech

Uses streaming ASR in C# with graphical user interface.

Video demo in Chinese: 【开源】Windows实时字幕软件(网课/开会必备)

lol互动助手

It uses the JavaScript API of sherpa-onnx along with Electron

Video demo in Chinese: 爆了!炫神教你开打字挂!真正影响胜率的英雄联盟工具!英雄联盟的最后一块拼图!和游戏中的每个人无障碍沟通!

Sherpa-ONNX 语音识别服务器

A server based on nodejs providing Restful API for speech recognition.

QSmartAssistant

一个模块化,全过程可离线,低占用率的对话机器人/智能音箱

It uses QT. Both ASR and TTS are used.

Flutter-EasySpeechRecognition

It extends ./flutter-examples/streaming_asr by downloading models inside the app to reduce the size of the app.

sherpa-onnx-unity

sherpa-onnx in Unity. See also #1695, #1892, and #1859