sdk/runanywhere-swift/Sources/LlamaCPPRuntime/README.md
The LlamaCPPRuntime module provides large language model (LLM) text generation capabilities for the RunAnywhere Swift SDK using llama.cpp with GGUF models and Metal acceleration.
This module enables on-device text generation with support for:
| Platform | Minimum Version |
|---|---|
| iOS | 17.0+ |
| macOS | 14.0+ |
The module requires the RABackendLlamaCPP.xcframework binary, which is automatically included when you add the SDK as a dependency.
The LlamaCPPRuntime module is included in the RunAnywhere SDK. Add it to your target:
dependencies: [
.package(url: "https://github.com/RunanywhereAI/runanywhere-sdks", from: "0.16.0")
],
targets: [
.target(
name: "YourApp",
dependencies: [
.product(name: "RunAnywhere", package: "runanywhere-sdks"),
.product(name: "RunAnywhereLlamaCPP", package: "runanywhere-sdks"),
]
)
]
https://github.com/RunanywhereAI/runanywhere-sdksRunAnywhereLlamaCPP to your targetRegister the module at app startup before using LLM capabilities:
import RunAnywhere
import LlamaCPPRuntime
@main
struct MyApp: App {
init() {
Task { @MainActor in
LlamaCPP.register()
try RunAnywhere.initialize(
apiKey: "<YOUR_API_KEY>",
baseURL: "https://api.runanywhere.ai",
environment: .production
)
}
}
var body: some Scene {
WindowGroup { ContentView() }
}
}
// Load a GGUF model by ID
try await RunAnywhere.loadModel("llama-3.2-1b-instruct-q4")
// Check if model is loaded
let isLoaded = await RunAnywhere.isModelLoaded
// Simple chat
let response = try await RunAnywhere.chat("What is the capital of France?")
print(response)
// Generation with options and metrics
let result = try await RunAnywhere.generate(
"Explain quantum computing in simple terms",
options: LLMGenerationOptions(
maxTokens: 200,
temperature: 0.7,
systemPrompt: "You are a helpful assistant."
)
)
print("Response: \(result.text)")
print("Tokens used: \(result.tokensUsed)")
print("Speed: \(result.tokensPerSecond) tok/s")
let result = try await RunAnywhere.generateStream(
"Write a short poem about technology",
options: LLMGenerationOptions(maxTokens: 150)
)
// Display tokens in real-time
for try await token in result.stream {
print(token, terminator: "")
}
// Get complete metrics after streaming finishes
let metrics = try await result.result.value
print("\nSpeed: \(metrics.tokensPerSecond) tok/s")
print("Total tokens: \(metrics.tokensUsed)")
struct QuizQuestion: Generatable {
let question: String
let options: [String]
let correctAnswer: Int
static var jsonSchema: String {
"""
{
"type": "object",
"properties": {
"question": { "type": "string" },
"options": { "type": "array", "items": { "type": "string" } },
"correctAnswer": { "type": "integer" }
},
"required": ["question", "options", "correctAnswer"]
}
"""
}
}
let quiz: QuizQuestion = try await RunAnywhere.generateStructured(
QuizQuestion.self,
prompt: "Create a quiz question about Swift programming"
)
try await RunAnywhere.unloadModel()
public enum LlamaCPP: RunAnywhereModule {
/// Module identifier
public static let moduleId = "llamacpp"
/// Human-readable module name
public static let moduleName = "LlamaCPP"
/// Capabilities provided by this module
public static let capabilities: Set<SDKComponent> = [.llm]
/// Default registration priority
public static let defaultPriority: Int = 100
/// Inference framework used
public static let inferenceFramework: InferenceFramework = .llamaCpp
/// Module version
public static let version = "2.0.0"
/// Underlying llama.cpp library version
public static let llamaCppVersion = "b7199"
/// Register the module with the service registry
@MainActor
public static func register(priority: Int = 100)
/// Unregister the module
public static func unregister()
/// Check if the module can handle a given model
public static func canHandle(modelId: String?) -> Bool
}
The LlamaCPP module handles models with the .gguf file extension. Compatible model families include:
Key options for LLM generation:
| Option | Type | Default | Description |
|---|---|---|---|
maxTokens | Int | 100 | Maximum tokens to generate |
temperature | Float | 0.8 | Sampling temperature (0.0 - 2.0) |
topP | Float | 1.0 | Top-p sampling parameter |
stopSequences | [String] | [] | Stop generation at these sequences |
systemPrompt | String? | nil | System prompt for generation |
The module follows a thin wrapper pattern:
LlamaCPP.swift (Swift wrapper)
|
LlamaCPPBackend (C headers)
|
RABackendLlamaCPP.xcframework (C++ implementation)
|
llama.cpp (Core inference engine)
The Swift code registers the backend with the C++ service registry, which handles all model loading and inference operations internally.
Typical performance on Apple Silicon:
| Device | Model | Tokens/sec |
|---|---|---|
| iPhone 15 Pro | Llama 3.2 1B Q4 | 25-35 |
| iPhone 15 Pro | Llama 3.2 3B Q4 | 15-20 |
| M1 MacBook | Llama 3.2 1B Q4 | 40-50 |
| M1 MacBook | Llama 3.2 7B Q4 | 20-30 |
Performance varies based on model size, quantization, context length, and device thermal state.
ModelInfo.isDownloadedregister() is called on the main actorregister() before RunAnywhere.initialize()Copyright 2025 RunAnywhere AI. All rights reserved.