sdk/runanywhere-kotlin/modules/runanywhere-core-onnx/README.md
Speech & Audio inference backend for the RunAnywhere Kotlin SDK — powered by ONNX Runtime and Sherpa-ONNX for on-device STT, TTS, and VAD.
This module provides the Speech-to-Text (STT), Text-to-Speech (TTS), and Voice Activity Detection (VAD) backends, enabling complete voice AI capabilities on-device using ONNX Runtime and Sherpa-ONNX.
This module is optional. Only include it if your app needs STT, TTS, or VAD capabilities.
Add to your module's build.gradle.kts:
dependencies {
// Core SDK (required)
implementation("com.runanywhere.sdk:runanywhere-kotlin:0.1.4")
// ONNX backend (this module)
implementation("com.runanywhere.sdk:runanywhere-core-onnx:0.1.4")
}
Once included, the module automatically registers the ONNX framework with the SDK.
import com.runanywhere.sdk.public.RunAnywhere
import com.runanywhere.sdk.public.extensions.*
// Register and download STT model
val sttModel = RunAnywhere.registerModel(
name = "Whisper Tiny",
url = "https://github.com/RunanywhereAI/sherpa-onnx/releases/download/runanywhere-models-v1/sherpa-onnx-whisper-tiny.en.tar.gz",
framework = InferenceFramework.ONNX,
modality = ModelCategory.SPEECH_RECOGNITION
)
RunAnywhere.downloadModel(sttModel.id).collect { progress ->
println("Download: ${(progress.progress * 100).toInt()}%")
}
// Load and transcribe
RunAnywhere.loadSTTModel(sttModel.id)
val text = RunAnywhere.transcribe(audioData)
println("Transcription: $text")
val output = RunAnywhere.transcribeWithOptions(
audioData = audioBytes,
options = STTOptions(
language = "en",
enablePunctuation = true,
enableTimestamps = true
)
)
println("Text: ${output.text}")
println("Confidence: ${output.confidence}")
output.wordTimestamps?.forEach { word ->
println("${word.word}: ${word.startTime}s - ${word.endTime}s")
}
RunAnywhere.transcribeStream(audioData) { partial ->
// Update UI with partial results
println("Partial: ${partial.transcript}")
}
// Register and download TTS voice
val ttsVoice = RunAnywhere.registerModel(
name = "English US Voice",
url = "https://github.com/RunanywhereAI/sherpa-onnx/releases/download/runanywhere-models-v1/vits-piper-en_US-libritts-high.tar.gz",
framework = InferenceFramework.ONNX,
modality = ModelCategory.SPEECH_SYNTHESIS
)
RunAnywhere.downloadModel(ttsVoice.id).collect { /* progress */ }
// Load and synthesize
RunAnywhere.loadTTSVoice(ttsVoice.id)
// Simple speak (handles playback)
RunAnywhere.speak("Hello, world!")
// Or get audio bytes
val output = RunAnywhere.synthesize("Welcome to RunAnywhere")
val audioBytes = output.audioData
val duration = output.duration
val output = RunAnywhere.synthesize(
text = "Hello!",
options = TTSOptions(
rate = 1.2f, // Faster speech
pitch = 1.0f,
volume = 0.8f
)
)
RunAnywhere.synthesizeStream(longText) { chunk ->
audioPlayer.play(chunk) // Play as chunks arrive
}
// Detect speech in audio
val result = RunAnywhere.detectVoiceActivity(audioData)
if (result.hasSpeech) {
println("Speech detected! Confidence: ${result.confidence}")
}
RunAnywhere.configureVAD(VADConfiguration(
threshold = 0.5f,
minSpeechDurationMs = 250,
minSilenceDurationMs = 300
))
RunAnywhere.streamVAD(audioSamplesFlow)
.collect { result ->
when {
result.hasSpeech -> println("Speaking...")
else -> println("Silence")
}
}
| Model | Size | Languages | Quality |
|---|---|---|---|
| whisper-tiny | ~75MB | 99 languages | Good for mobile |
| whisper-base | ~150MB | 99 languages | Better accuracy |
| whisper-small | ~500MB | 99 languages | High accuracy |
| Voice | Size | Language | Quality |
|---|---|---|---|
| vits-piper-en_US-libritts-high | ~100MB | English (US) | High quality |
| vits-piper-en_GB-* | ~100MB | English (UK) | High quality |
| vits-piper-de_DE-* | ~100MB | German | High quality |
| vits-piper-es_ES-* | ~100MB | Spanish | High quality |
VAD uses Silero VAD which is bundled with Sherpa-ONNX (~5MB).
┌─────────────────────────────────────────────────────────────┐
│ RunAnywhere SDK (Kotlin) │
│ │
│ RunAnywhere.transcribe() / synthesize() / detectVAD() │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ runanywhere-core-onnx │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ JNI Bridge (Kotlin ↔ C++) │ │
│ │ librac_backend_onnx_jni.so │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ librunanywhere_onnx.so │ │
│ │ RunAnywhere ONNX wrapper │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────┐ ┌─────────────────────────────────┐ │
│ │ libonnxruntime.so│ │ Sherpa-ONNX libs │ │
│ │ ONNX Runtime │ │ STT / TTS / VAD inference │ │
│ │ (~15MB) │ │ libsherpa-onnx-*.so │ │
│ └──────────────────┘ └─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
This module bundles the following native libraries (~25MB total for ARM64):
| Library | Size | Description |
|---|---|---|
librac_backend_onnx_jni.so | ~1MB | JNI bridge |
librunanywhere_onnx.so | ~2MB | RunAnywhere ONNX wrapper |
libonnxruntime.so | ~15MB | ONNX Runtime |
libsherpa-onnx-c-api.so | ~2MB | Sherpa-ONNX C API |
libsherpa-onnx-cxx-api.so | ~3MB | Sherpa-ONNX C++ API |
libsherpa-onnx-jni.so | ~2MB | Sherpa-ONNX JNI bridge |
arm64-v8a — Primary target (modern Android devices)Native libraries are automatically downloaded from GitHub releases:
// gradle.properties
runanywhere.useLocalNatives=false // Downloads from releases
runanywhere.coreVersion=0.1.4
For developing with local C++ builds:
// gradle.properties
runanywhere.useLocalNatives=true // Uses local jniLibs/
Then build the native libraries:
cd ../../ # SDK root
./scripts/build-kotlin.sh --setup
| Model | Audio Length | Processing Time | RTF |
|---|---|---|---|
| whisper-tiny | 5s | ~200ms | 0.04 |
| whisper-tiny | 30s | ~1.2s | 0.04 |
| whisper-base | 5s | ~400ms | 0.08 |
RTF = Real-Time Factor (lower is better)
| Voice | Text Length | Synthesis Time | Duration |
|---|---|---|---|
| libritts-high | 100 chars | ~100ms | ~2s |
| libritts-high | 500 chars | ~300ms | ~10s |
val options = STTOptions(
audioFormat = AudioFormat.PCM,
sampleRate = 16000
)
SDKError: MODEL_LOAD_FAILED - Invalid model format
Solution: Ensure the model is in Sherpa-ONNX format (usually .tar.bz2 archives from the official releases).
Try using a higher-quality voice model:
// Use "high" quality variants
val ttsVoice = RunAnywhere.registerModel(
url = "...libritts-high.tar.bz2" // Not "low" or "medium"
)
Adjust the threshold:
RunAnywhere.configureVAD(VADConfiguration(
threshold = 0.3f, // Lower = more sensitive (0.0 - 1.0)
minSpeechDurationMs = 100 // Shorter = faster detection
))
Ensure proper audio format matching:
val output = RunAnywhere.synthesize(text, TTSOptions(
audioFormat = AudioFormat.PCM,
sampleRate = 22050
))
// Configure AudioTrack with matching sample rate
Apache 2.0. See LICENSE.
This module includes: