sdk/runanywhere-kotlin/modules/runanywhere-core-llamacpp/README.md
LLM inference backend for the RunAnywhere Kotlin SDK — powered by llama.cpp for on-device text generation.
This module provides the LLM (Large Language Model) backend, enabling on-device text generation using the industry-standard llama.cpp library. It's optimized for mobile devices with support for quantized models (GGUF format).
This module is optional. Only include it if your app needs LLM/text generation capabilities.
Add to your module's build.gradle.kts:
dependencies {
// Core SDK (required)
implementation("com.runanywhere.sdk:runanywhere-kotlin:0.1.4")
// LlamaCPP backend (this module)
implementation("com.runanywhere.sdk:runanywhere-core-llamacpp:0.1.4")
}
Once included, the module automatically registers the LLAMA_CPP framework with the SDK.
import com.runanywhere.sdk.public.RunAnywhere
import com.runanywhere.sdk.public.extensions.*
import com.runanywhere.sdk.core.types.InferenceFramework
val model = RunAnywhere.registerModel(
name = "Qwen 0.5B Instruct",
url = "https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q8_0.gguf",
framework = InferenceFramework.LLAMA_CPP
)
// Download model
RunAnywhere.downloadModel(model.id).collect { progress ->
println("Download: ${(progress.progress * 100).toInt()}%")
}
// Load into memory
RunAnywhere.loadLLMModel(model.id)
// Simple chat
val response = RunAnywhere.chat("What is 2+2?")
println(response)
// With options
val result = RunAnywhere.generate(
prompt = "Write a haiku about code",
options = LLMGenerationOptions(
maxTokens = 100,
temperature = 0.8f
)
)
println("Response: ${result.text}")
println("Speed: ${result.tokensPerSecond} tok/s")
RunAnywhere.generateStream("Tell me a story")
.collect { token ->
print(token) // Display tokens in real-time
}
Any GGUF-format model compatible with llama.cpp. Popular options:
| Model | Size | Quantization | Use Case |
|---|---|---|---|
| Qwen2.5-0.5B | ~300MB | Q8_0 | General chat, fast inference |
| Qwen2.5-0.5B | ~200MB | Q4_0 | Memory-constrained devices |
| Qwen2.5-1.5B | ~900MB | Q8_0 | Higher quality responses |
| Llama-3.2-1B | ~600MB | Q8_0 | Meta's latest small model |
| Phi-3-mini | ~2.2GB | Q4_K_M | Microsoft's reasoning model |
| DeepSeek-R1-Distill | ~1.5GB | Q4_K_M | Reasoning/thinking model |
Models can be downloaded directly from HuggingFace using the resolve/main URL pattern:
https://huggingface.co/{org}/{repo}/resolve/main/{filename}.gguf
┌─────────────────────────────────────────────────────────────┐
│ RunAnywhere SDK (Kotlin) │
│ │
│ RunAnywhere.generate() / chat() / generateStream() │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ runanywhere-core-llamacpp │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ JNI Bridge (Kotlin ↔ C++) │ │
│ │ librac_backend_llamacpp_jni.so │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ librunanywhere_llamacpp.so │ │
│ │ RunAnywhere llama.cpp wrapper │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ llama.cpp core │ │
│ │ libllama.so + libcommon.so │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
This module bundles the following native libraries (~34MB total for ARM64):
| Library | Size | Description |
|---|---|---|
librac_backend_llamacpp_jni.so | ~2MB | JNI bridge |
librunanywhere_llamacpp.so | ~15MB | RunAnywhere llama.cpp wrapper |
libllama.so | ~15MB | llama.cpp core inference |
libcommon.so | ~2MB | llama.cpp utilities |
arm64-v8a — Primary target (modern Android devices)Native libraries are automatically downloaded from GitHub releases:
// gradle.properties
runanywhere.useLocalNatives=false // Downloads from releases
runanywhere.coreVersion=0.1.4
For developing with local C++ builds:
// gradle.properties
runanywhere.useLocalNatives=true // Uses local jniLibs/
Then build the native libraries:
cd ../../ # SDK root
./scripts/build-kotlin.sh --setup
| Model | Load Time | Tokens/sec | Memory |
|---|---|---|---|
| Qwen2.5-0.5B Q8 | ~500ms | 15-25 tok/s | ~500MB |
| Qwen2.5-0.5B Q4 | ~400ms | 20-30 tok/s | ~300MB |
| Qwen2.5-1.5B Q8 | ~800ms | 10-15 tok/s | ~1.5GB |
contextLength for faster inferenceunloadLLMModel() to free memorySDKError: MODEL_LOAD_FAILED - Insufficient memory
Solution: Use a smaller quantized model (Q4 instead of Q8) or ensure sufficient free RAM.
Check the result metrics:
val result = RunAnywhere.generate(prompt)
if (result.tokensPerSecond < 5) {
// Consider a smaller model or check device state
}
Ensure the model is GGUF format and framework is set correctly:
RunAnywhere.registerModel(
framework = InferenceFramework.LLAMA_CPP // Must be LLAMA_CPP for this module
)
Apache 2.0. See LICENSE.
This module includes: