Back to Runanywhere Sdks

RunAnywhere Core LlamaCPP Module

sdk/runanywhere-kotlin/modules/runanywhere-core-llamacpp/README.md

0.19.139.3 KB
Original Source

RunAnywhere Core LlamaCPP Module

LLM inference backend for the RunAnywhere Kotlin SDK — powered by llama.cpp for on-device text generation.


Features

This module provides the LLM (Large Language Model) backend, enabling on-device text generation using the industry-standard llama.cpp library. It's optimized for mobile devices with support for quantized models (GGUF format).

This module is optional. Only include it if your app needs LLM/text generation capabilities.

  • On-Device LLM Inference — Run language models locally without network
  • GGUF Model Support — Compatible with quantized models from HuggingFace
  • Streaming Generation — Token-by-token output for responsive UX
  • Multiple Quantization Levels — Q4, Q5, Q8 for memory/quality tradeoffs
  • Thinking/Reasoning Models — Support for models with reasoning capabilities
  • ARM64 Optimized — Native performance on modern Android devices

Installation

Add to your module's build.gradle.kts:

kotlin
dependencies {
    // Core SDK (required)
    implementation("com.runanywhere.sdk:runanywhere-kotlin:0.1.4")

    // LlamaCPP backend (this module)
    implementation("com.runanywhere.sdk:runanywhere-core-llamacpp:0.1.4")
}

Usage

Once included, the module automatically registers the LLAMA_CPP framework with the SDK.

Register a Model

kotlin
import com.runanywhere.sdk.public.RunAnywhere
import com.runanywhere.sdk.public.extensions.*
import com.runanywhere.sdk.core.types.InferenceFramework

val model = RunAnywhere.registerModel(
    name = "Qwen 0.5B Instruct",
    url = "https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q8_0.gguf",
    framework = InferenceFramework.LLAMA_CPP
)

Download & Load

kotlin
// Download model
RunAnywhere.downloadModel(model.id).collect { progress ->
    println("Download: ${(progress.progress * 100).toInt()}%")
}

// Load into memory
RunAnywhere.loadLLMModel(model.id)

Generate Text

kotlin
// Simple chat
val response = RunAnywhere.chat("What is 2+2?")
println(response)

// With options
val result = RunAnywhere.generate(
    prompt = "Write a haiku about code",
    options = LLMGenerationOptions(
        maxTokens = 100,
        temperature = 0.8f
    )
)
println("Response: ${result.text}")
println("Speed: ${result.tokensPerSecond} tok/s")

Streaming

kotlin
RunAnywhere.generateStream("Tell me a story")
    .collect { token ->
        print(token) // Display tokens in real-time
    }

Supported Models

Any GGUF-format model compatible with llama.cpp. Popular options:

ModelSizeQuantizationUse Case
Qwen2.5-0.5B~300MBQ8_0General chat, fast inference
Qwen2.5-0.5B~200MBQ4_0Memory-constrained devices
Qwen2.5-1.5B~900MBQ8_0Higher quality responses
Llama-3.2-1B~600MBQ8_0Meta's latest small model
Phi-3-mini~2.2GBQ4_K_MMicrosoft's reasoning model
DeepSeek-R1-Distill~1.5GBQ4_K_MReasoning/thinking model

HuggingFace URLs

Models can be downloaded directly from HuggingFace using the resolve/main URL pattern:

https://huggingface.co/{org}/{repo}/resolve/main/{filename}.gguf

Architecture

┌─────────────────────────────────────────────────────────────┐
│                  RunAnywhere SDK (Kotlin)                    │
│                                                              │
│  RunAnywhere.generate() / chat() / generateStream()          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   runanywhere-core-llamacpp                  │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │                 JNI Bridge (Kotlin ↔ C++)               │ │
│  │           librac_backend_llamacpp_jni.so                │ │
│  └────────────────────────────────────────────────────────┘ │
│                              │                               │
│  ┌────────────────────────────────────────────────────────┐ │
│  │                 librunanywhere_llamacpp.so              │ │
│  │              RunAnywhere llama.cpp wrapper              │ │
│  └────────────────────────────────────────────────────────┘ │
│                              │                               │
│  ┌────────────────────────────────────────────────────────┐ │
│  │                    llama.cpp core                       │ │
│  │             libllama.so + libcommon.so                  │ │
│  └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Native Libraries

This module bundles the following native libraries (~34MB total for ARM64):

LibrarySizeDescription
librac_backend_llamacpp_jni.so~2MBJNI bridge
librunanywhere_llamacpp.so~15MBRunAnywhere llama.cpp wrapper
libllama.so~15MBllama.cpp core inference
libcommon.so~2MBllama.cpp utilities

Supported ABIs

  • arm64-v8a — Primary target (modern Android devices)

Build Configuration

Remote Mode (Default)

Native libraries are automatically downloaded from GitHub releases:

kotlin
// gradle.properties
runanywhere.useLocalNatives=false  // Downloads from releases
runanywhere.coreVersion=0.1.4

Local Development

For developing with local C++ builds:

kotlin
// gradle.properties
runanywhere.useLocalNatives=true   // Uses local jniLibs/

Then build the native libraries:

bash
cd ../../  # SDK root
./scripts/build-kotlin.sh --setup

Performance

Typical Benchmarks (Pixel 7, 8GB RAM)

ModelLoad TimeTokens/secMemory
Qwen2.5-0.5B Q8~500ms15-25 tok/s~500MB
Qwen2.5-0.5B Q4~400ms20-30 tok/s~300MB
Qwen2.5-1.5B Q8~800ms10-15 tok/s~1.5GB

Optimization Tips

  1. Use quantized models — Q4 uses ~40% less memory than Q8
  2. Limit context length — Reduce contextLength for faster inference
  3. Monitor thermal state — Throttle if device is hot
  4. Unload when done — Call unloadLLMModel() to free memory

Requirements

  • Android: API 24+ (Android 7.0+)
  • Architecture: ARM64 (arm64-v8a)
  • Memory: 1GB+ free RAM recommended
  • RunAnywhere SDK: 0.1.4+

Troubleshooting

Model fails to load

SDKError: MODEL_LOAD_FAILED - Insufficient memory

Solution: Use a smaller quantized model (Q4 instead of Q8) or ensure sufficient free RAM.

Slow inference

Check the result metrics:

kotlin
val result = RunAnywhere.generate(prompt)
if (result.tokensPerSecond < 5) {
    // Consider a smaller model or check device state
}

Model not recognized

Ensure the model is GGUF format and framework is set correctly:

kotlin
RunAnywhere.registerModel(
    framework = InferenceFramework.LLAMA_CPP  // Must be LLAMA_CPP for this module
)

License

Apache 2.0. See LICENSE.

This module includes:

  • llama.cpp — MIT License
  • ggml — MIT License

See Also