sdk/runanywhere-react-native/packages/llamacpp/README.md
LlamaCPP backend for the RunAnywhere React Native SDK. Provides on-device LLM text generation with GGUF models powered by llama.cpp.
@runanywhere/llamacpp provides the LlamaCPP backend for on-device Large Language Model (LLM) inference. It enables:
@runanywhere/core (peer dependency)npm install @runanywhere/core @runanywhere/llamacpp
# or
yarn add @runanywhere/core @runanywhere/llamacpp
cd ios && pod install && cd ..
No additional setup required. Native libraries are downloaded automatically.
import { RunAnywhere, SDKEnvironment, ModelCategory } from '@runanywhere/core';
import { LlamaCPP } from '@runanywhere/llamacpp';
// 1. Initialize SDK
await RunAnywhere.initialize({
environment: SDKEnvironment.Development,
});
// 2. Register LlamaCPP backend
LlamaCPP.register();
// 3. Add a model
await LlamaCPP.addModel({
id: 'smollm2-360m-q8_0',
name: 'SmolLM2 360M Q8_0',
url: 'https://huggingface.co/prithivMLmods/SmolLM2-360M-GGUF/resolve/main/SmolLM2-360M.Q8_0.gguf',
memoryRequirement: 500_000_000,
});
// 4. Download model
await RunAnywhere.downloadModel('smollm2-360m-q8_0', (progress) => {
console.log(`Downloading: ${(progress.progress * 100).toFixed(1)}%`);
});
// 5. Load model
const modelInfo = await RunAnywhere.getModelInfo('smollm2-360m-q8_0');
await RunAnywhere.loadModel(modelInfo.localPath);
// 6. Generate text
const response = await RunAnywhere.chat('What is the capital of France?');
console.log(response);
import { LlamaCPP } from '@runanywhere/llamacpp';
LlamaCPP.register()Register the LlamaCPP backend with the SDK. Must be called before using LLM features.
LlamaCPP.register(): void
Example:
await RunAnywhere.initialize({ ... });
LlamaCPP.register(); // Now LLM features are available
LlamaCPP.addModel(options)Add a GGUF model to the model registry.
await LlamaCPP.addModel(options: LlamaCPPModelOptions): Promise<ModelInfo>
Parameters:
interface LlamaCPPModelOptions {
/**
* Unique model ID.
* If not provided, generated from the URL filename.
*/
id?: string;
/** Display name for the model */
name: string;
/** Download URL for the model (GGUF format) */
url: string;
/**
* Model category.
* Default: ModelCategory.Language
*/
modality?: ModelCategory;
/**
* Memory requirement in bytes.
* Used for device capability checks.
*/
memoryRequirement?: number;
/**
* Whether model supports reasoning/thinking tokens.
* If true, thinking content is extracted from responses.
*/
supportsThinking?: boolean;
}
Returns: Promise<ModelInfo> — The registered model info
Example:
// Basic model
await LlamaCPP.addModel({
id: 'smollm2-360m-q8_0',
name: 'SmolLM2 360M Q8_0',
url: 'https://huggingface.co/prithivMLmods/SmolLM2-360M-GGUF/resolve/main/SmolLM2-360M.Q8_0.gguf',
memoryRequirement: 500_000_000,
});
// Larger model
await LlamaCPP.addModel({
id: 'llama-2-7b-chat-q4_k_m',
name: 'Llama 2 7B Chat Q4_K_M',
url: 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf',
memoryRequirement: 4_000_000_000,
});
// Model with thinking support (e.g., DeepSeek-R1)
await LlamaCPP.addModel({
id: 'deepseek-r1-distill-qwen-1.5b',
name: 'DeepSeek R1 Distill Qwen 1.5B',
url: 'https://huggingface.co/.../deepseek-r1-distill-qwen-1.5b-q8_0.gguf',
memoryRequirement: 2_000_000_000,
supportsThinking: true,
});
LlamaCPP.moduleId // 'llamacpp'
LlamaCPP.moduleName // 'LlamaCPP'
LlamaCPP.inferenceFramework // LLMFramework.LlamaCpp
LlamaCPP.capabilities // ['llm']
LlamaCPP.defaultPriority // 100
Once a model is registered and loaded, use the RunAnywhere API for generation:
const response = await RunAnywhere.chat('Hello!');
console.log(response);
const result = await RunAnywhere.generate(
'Explain machine learning in simple terms',
{
maxTokens: 256,
temperature: 0.7,
topP: 0.95,
systemPrompt: 'You are a helpful teacher.',
stopSequences: ['\n\n'],
}
);
console.log('Response:', result.text);
console.log('Tokens:', result.tokensUsed);
console.log('Speed:', result.performanceMetrics.tokensPerSecond, 'tok/s');
console.log('TTFT:', result.performanceMetrics.timeToFirstTokenMs, 'ms');
const streamResult = await RunAnywhere.generateStream(
'Write a story about a robot',
{ maxTokens: 500 }
);
// Display tokens as they're generated
for await (const token of streamResult.stream) {
process.stdout.write(token);
}
// Get final metrics
const result = await streamResult.result;
console.log('\nSpeed:', result.performanceMetrics.tokensPerSecond, 'tok/s');
Register tools and let the LLM call them during generation. Tool calling parsing and prompt formatting is handled entirely in C++ for consistency across platforms.
import { RunAnywhere } from '@runanywhere/core';
import { LlamaCPP } from '@runanywhere/llamacpp';
// Register a tool
RunAnywhere.registerTool(
{
name: 'calculate',
description: 'Perform a math calculation',
parameters: [
{ name: 'expression', type: 'string', description: 'Math expression', required: true },
],
},
async (args) => {
const result = eval(args.expression as string); // simplified example
return { result };
}
);
// Generate with tools
const result = await RunAnywhere.generateWithTools(
'What is 42 * 17?',
{
autoExecute: true,
maxToolCalls: 3,
temperature: 0.7,
format: 'default', // 'default' for most models, 'lfm2' for Liquid AI models
}
);
console.log(result.text); // "42 * 17 = 714"
Supported tool calling formats:
| Format | Tag Pattern | Models |
|---|---|---|
default | <tool_call>{"tool":"name","arguments":{}}</tool_call> | Llama, Qwen, Mistral, SmolLM, most GGUF models |
lfm2 | <|tool_call_start|>[func(arg="val")]<|tool_call_end|> | Liquid AI LFM2-Tool models |
// Load model
await RunAnywhere.loadModel('/path/to/model.gguf');
// Check if loaded
const isLoaded = await RunAnywhere.isModelLoaded();
// Unload to free memory
await RunAnywhere.unloadModel();
// Cancel ongoing generation
await RunAnywhere.cancelGeneration();
Any GGUF-format model works with this backend. Recommended models:
| Model | Size | Memory | Tool Calling | Description |
|---|---|---|---|---|
| SmolLM2 360M Q8_0 | ~400MB | 500MB | - | Fast, lightweight |
| Qwen 2.5 0.5B Q6_K | ~500MB | 600MB | Yes | Multilingual |
| LFM2 350M Q4_K_M | ~200MB | 250MB | Yes (lfm2) | Ultra-compact, Liquid AI |
| Model | Size | Memory | Tool Calling | Description |
|---|---|---|---|---|
| Phi-3 Mini Q4_K_M | ~2GB | 2.5GB | - | Microsoft |
| Gemma 2B Q4_K_M | ~1.5GB | 2GB | - | |
| LFM2 1.2B Q4_K_M | ~800MB | 1GB | Yes (lfm2) | Liquid AI tool-calling |
| Qwen 2.5 1.5B Instruct Q4_K_M | ~1GB | 1.5GB | Yes | Alibaba, multilingual |
| TinyLlama 1.1B Q4_K_M | ~700MB | 1GB | - | Fast chat |
| Model | Size | Memory | Tool Calling | Description |
|---|---|---|---|---|
| Llama 3.2 3B Instruct Q4_K_M | ~2GB | 3GB | Yes | Meta latest |
| Mistral 7B Instruct Q4_K_M | ~4GB | 5GB | Yes | Mistral AI |
| Qwen 2.5 7B Instruct Q4_K_M | ~4GB | 5GB | Yes | Alibaba |
| Llama 2 7B Chat Q4_K_M | ~4GB | 5GB | - | Meta |
format: 'lfm2') or Qwen 2.5 0.5Bformat: 'lfm2' only with Liquid AI LFM2-Tool models; all others use format: 'default'| Device | Model | Speed |
|---|---|---|
| iPhone 15 Pro | SmolLM2 360M Q8 | 50-80 tok/s |
| iPhone 15 Pro | Llama 3.2 3B Q4 | 15-25 tok/s |
| MacBook M2 | Llama 2 7B Q4 | 20-40 tok/s |
| Pixel 8 | SmolLM2 360M Q8 | 30-50 tok/s |
This package uses RABackendLLAMACPP.xcframework which includes:
The framework is automatically downloaded during pod install.
Native library librunanywhere_llamacpp.so includes:
Libraries are automatically downloaded during Gradle build.
packages/llamacpp/
├── src/
│ ├── index.ts # Package exports
│ ├── LlamaCPP.ts # Module API (register, addModel)
│ ├── LlamaCppProvider.ts # Service provider
│ ├── native/
│ │ └── NativeRunAnywhereLlama.ts
│ └── specs/
│ └── RunAnywhereLlama.nitro.ts
├── cpp/
│ ├── HybridRunAnywhereLlama.cpp
│ ├── HybridRunAnywhereLlama.hpp
│ └── bridges/
├── ios/
│ ├── RunAnywhereLlama.podspec
│ └── Frameworks/
│ └── RABackendLLAMACPP.xcframework
├── android/
│ ├── build.gradle
│ └── src/main/jniLibs/
│ └── arm64-v8a/
│ └── librunanywhere_llamacpp.so
└── nitrogen/
└── generated/
Symptoms: modelLoadFailed error
Solutions:
memoryRequirement)Symptoms: < 5 tokens/second
Solutions:
Symptoms: App crash during inference
Solutions:
MIT License