Back to Runanywhere Sdks

RunAnywhere Core ONNX Module

sdk/runanywhere-kotlin/modules/runanywhere-core-onnx/README.md

0.19.1312.0 KB
Original Source

RunAnywhere Core ONNX Module

Speech & Audio inference backend for the RunAnywhere Kotlin SDK — powered by ONNX Runtime and Sherpa-ONNX for on-device STT, TTS, and VAD.


Features

This module provides the Speech-to-Text (STT), Text-to-Speech (TTS), and Voice Activity Detection (VAD) backends, enabling complete voice AI capabilities on-device using ONNX Runtime and Sherpa-ONNX.

This module is optional. Only include it if your app needs STT, TTS, or VAD capabilities.

  • Speech-to-Text (STT) — Whisper-based transcription on-device
  • Text-to-Speech (TTS) — Neural TTS voice synthesis
  • Voice Activity Detection (VAD) — Silero VAD for speech detection
  • Streaming Support — Real-time transcription and synthesis
  • Multiple Languages — Multi-language STT and TTS support
  • ARM64 Optimized — Native ONNX Runtime for Android

Installation

Add to your module's build.gradle.kts:

kotlin
dependencies {
    // Core SDK (required)
    implementation("com.runanywhere.sdk:runanywhere-kotlin:0.1.4")

    // ONNX backend (this module)
    implementation("com.runanywhere.sdk:runanywhere-core-onnx:0.1.4")
}

Usage

Once included, the module automatically registers the ONNX framework with the SDK.

Speech-to-Text (STT)

kotlin
import com.runanywhere.sdk.public.RunAnywhere
import com.runanywhere.sdk.public.extensions.*

// Register and download STT model
val sttModel = RunAnywhere.registerModel(
    name = "Whisper Tiny",
    url = "https://github.com/RunanywhereAI/sherpa-onnx/releases/download/runanywhere-models-v1/sherpa-onnx-whisper-tiny.en.tar.gz",
    framework = InferenceFramework.ONNX,
    modality = ModelCategory.SPEECH_RECOGNITION
)

RunAnywhere.downloadModel(sttModel.id).collect { progress ->
    println("Download: ${(progress.progress * 100).toInt()}%")
}

// Load and transcribe
RunAnywhere.loadSTTModel(sttModel.id)
val text = RunAnywhere.transcribe(audioData)
println("Transcription: $text")

Advanced STT Options

kotlin
val output = RunAnywhere.transcribeWithOptions(
    audioData = audioBytes,
    options = STTOptions(
        language = "en",
        enablePunctuation = true,
        enableTimestamps = true
    )
)

println("Text: ${output.text}")
println("Confidence: ${output.confidence}")
output.wordTimestamps?.forEach { word ->
    println("${word.word}: ${word.startTime}s - ${word.endTime}s")
}

Streaming STT

kotlin
RunAnywhere.transcribeStream(audioData) { partial ->
    // Update UI with partial results
    println("Partial: ${partial.transcript}")
}

Text-to-Speech (TTS)

kotlin
// Register and download TTS voice
val ttsVoice = RunAnywhere.registerModel(
    name = "English US Voice",
    url = "https://github.com/RunanywhereAI/sherpa-onnx/releases/download/runanywhere-models-v1/vits-piper-en_US-libritts-high.tar.gz",
    framework = InferenceFramework.ONNX,
    modality = ModelCategory.SPEECH_SYNTHESIS
)

RunAnywhere.downloadModel(ttsVoice.id).collect { /* progress */ }

// Load and synthesize
RunAnywhere.loadTTSVoice(ttsVoice.id)

// Simple speak (handles playback)
RunAnywhere.speak("Hello, world!")

// Or get audio bytes
val output = RunAnywhere.synthesize("Welcome to RunAnywhere")
val audioBytes = output.audioData
val duration = output.duration

TTS Options

kotlin
val output = RunAnywhere.synthesize(
    text = "Hello!",
    options = TTSOptions(
        rate = 1.2f,     // Faster speech
        pitch = 1.0f,
        volume = 0.8f
    )
)

Streaming TTS

kotlin
RunAnywhere.synthesizeStream(longText) { chunk ->
    audioPlayer.play(chunk)  // Play as chunks arrive
}

Voice Activity Detection (VAD)

kotlin
// Detect speech in audio
val result = RunAnywhere.detectVoiceActivity(audioData)

if (result.hasSpeech) {
    println("Speech detected! Confidence: ${result.confidence}")
}

Configure VAD

kotlin
RunAnywhere.configureVAD(VADConfiguration(
    threshold = 0.5f,
    minSpeechDurationMs = 250,
    minSilenceDurationMs = 300
))

Streaming VAD

kotlin
RunAnywhere.streamVAD(audioSamplesFlow)
    .collect { result ->
        when {
            result.hasSpeech -> println("Speaking...")
            else -> println("Silence")
        }
    }

Supported Models

Speech-to-Text (Whisper)

ModelSizeLanguagesQuality
whisper-tiny~75MB99 languagesGood for mobile
whisper-base~150MB99 languagesBetter accuracy
whisper-small~500MB99 languagesHigh accuracy

Text-to-Speech (VITS/Piper)

VoiceSizeLanguageQuality
vits-piper-en_US-libritts-high~100MBEnglish (US)High quality
vits-piper-en_GB-*~100MBEnglish (UK)High quality
vits-piper-de_DE-*~100MBGermanHigh quality
vits-piper-es_ES-*~100MBSpanishHigh quality

VAD (Built-in)

VAD uses Silero VAD which is bundled with Sherpa-ONNX (~5MB).


Architecture

┌─────────────────────────────────────────────────────────────┐
│                  RunAnywhere SDK (Kotlin)                    │
│                                                              │
│  RunAnywhere.transcribe() / synthesize() / detectVAD()       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    runanywhere-core-onnx                     │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │                 JNI Bridge (Kotlin ↔ C++)               │ │
│  │             librac_backend_onnx_jni.so                  │ │
│  └────────────────────────────────────────────────────────┘ │
│                              │                               │
│  ┌────────────────────────────────────────────────────────┐ │
│  │                  librunanywhere_onnx.so                 │ │
│  │               RunAnywhere ONNX wrapper                  │ │
│  └────────────────────────────────────────────────────────┘ │
│                              │                               │
│  ┌──────────────────┐  ┌─────────────────────────────────┐ │
│  │ libonnxruntime.so│  │       Sherpa-ONNX libs          │ │
│  │   ONNX Runtime   │  │  STT / TTS / VAD inference      │ │
│  │     (~15MB)      │  │  libsherpa-onnx-*.so            │ │
│  └──────────────────┘  └─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Native Libraries

This module bundles the following native libraries (~25MB total for ARM64):

LibrarySizeDescription
librac_backend_onnx_jni.so~1MBJNI bridge
librunanywhere_onnx.so~2MBRunAnywhere ONNX wrapper
libonnxruntime.so~15MBONNX Runtime
libsherpa-onnx-c-api.so~2MBSherpa-ONNX C API
libsherpa-onnx-cxx-api.so~3MBSherpa-ONNX C++ API
libsherpa-onnx-jni.so~2MBSherpa-ONNX JNI bridge

Supported ABIs

  • arm64-v8a — Primary target (modern Android devices)

Build Configuration

Remote Mode (Default)

Native libraries are automatically downloaded from GitHub releases:

kotlin
// gradle.properties
runanywhere.useLocalNatives=false  // Downloads from releases
runanywhere.coreVersion=0.1.4

Local Development

For developing with local C++ builds:

kotlin
// gradle.properties
runanywhere.useLocalNatives=true   // Uses local jniLibs/

Then build the native libraries:

bash
cd ../../  # SDK root
./scripts/build-kotlin.sh --setup

Performance

Speech-to-Text (Pixel 7, 8GB RAM)

ModelAudio LengthProcessing TimeRTF
whisper-tiny5s~200ms0.04
whisper-tiny30s~1.2s0.04
whisper-base5s~400ms0.08

RTF = Real-Time Factor (lower is better)

Text-to-Speech

VoiceText LengthSynthesis TimeDuration
libritts-high100 chars~100ms~2s
libritts-high500 chars~300ms~10s

VAD

  • Frame processing: < 5ms per 30ms frame
  • Latency: < 100ms speech detection

Audio Format Requirements

STT Input

  • Format: PCM (16-bit signed, little-endian)
  • Sample Rate: 16000 Hz (recommended)
  • Channels: Mono
kotlin
val options = STTOptions(
    audioFormat = AudioFormat.PCM,
    sampleRate = 16000
)

TTS Output

  • Format: PCM (16-bit signed)
  • Sample Rate: 22050 Hz (default) or 44100 Hz
  • Channels: Mono

Requirements

  • Android: API 24+ (Android 7.0+)
  • Architecture: ARM64 (arm64-v8a)
  • Memory: 512MB+ free RAM recommended
  • RunAnywhere SDK: 0.1.4+

Troubleshooting

STT model fails to load

SDKError: MODEL_LOAD_FAILED - Invalid model format

Solution: Ensure the model is in Sherpa-ONNX format (usually .tar.bz2 archives from the official releases).

TTS voice sounds robotic

Try using a higher-quality voice model:

kotlin
// Use "high" quality variants
val ttsVoice = RunAnywhere.registerModel(
    url = "...libritts-high.tar.bz2"  // Not "low" or "medium"
)

VAD too sensitive / not sensitive enough

Adjust the threshold:

kotlin
RunAnywhere.configureVAD(VADConfiguration(
    threshold = 0.3f,  // Lower = more sensitive (0.0 - 1.0)
    minSpeechDurationMs = 100  // Shorter = faster detection
))

Audio playback issues

Ensure proper audio format matching:

kotlin
val output = RunAnywhere.synthesize(text, TTSOptions(
    audioFormat = AudioFormat.PCM,
    sampleRate = 22050
))
// Configure AudioTrack with matching sample rate

License

Apache 2.0. See LICENSE.

This module includes:

  • ONNX Runtime — MIT License
  • Sherpa-ONNX — Apache 2.0 License
  • Silero VAD — MIT License

See Also