content/providers/03-community-providers/24-llama-cpp.mdx
lgrammel/ai-sdk-llama-cpp is a community provider that enables local LLM inference using llama.cpp directly within Node.js via native C++ bindings.
This provider loads llama.cpp directly into Node.js memory, eliminating the need for an external server while providing native performance and GPU acceleration.
generateText and streamTextOutputembed and embedManyBefore installing, ensure you have the following:
# Install Xcode Command Line Tools (includes Clang)
xcode-select --install
# Install CMake via Homebrew
brew install cmake
The llama.cpp provider is available in the ai-sdk-llama-cpp module. You can install it with:
<Tabs items={['pnpm', 'npm', 'yarn', 'bun']}> <Tab> <Snippet text="pnpm add ai-sdk-llama-cpp" dark /> </Tab> <Tab> <Snippet text="npm install ai-sdk-llama-cpp" dark /> </Tab> <Tab> <Snippet text="yarn add ai-sdk-llama-cpp" dark /> </Tab> <Tab> <Snippet text="bun add ai-sdk-llama-cpp" dark /> </Tab> </Tabs>
The installation will automatically compile llama.cpp as a static library with Metal support and build the native Node.js addon.
You can import llamaCpp from ai-sdk-llama-cpp and create a model instance:
import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({
modelPath: './models/llama-3.2-1b-instruct.Q4_K_M.gguf',
});
You can customize the model instance with the following options:
modelPath string (required)
Path to the GGUF model file.
contextSize number
Maximum context size. Default: 2048.
gpuLayers number
Number of layers to offload to GPU. Default: 99 (all layers). Set to 0 to disable GPU.
threads number
Number of CPU threads. Default: 4.
debug boolean
Enable verbose debug output from llama.cpp. Default: false.
chatTemplate string
Chat template to use for formatting messages. Default: "auto" (uses the template embedded in the GGUF model file). Available templates include: llama3, chatml, gemma, mistral-v1, mistral-v3, phi3, phi4, deepseek, and more.
const model = llamaCpp({
modelPath: './models/your-model.gguf',
contextSize: 4096,
gpuLayers: 99,
threads: 8,
chatTemplate: 'llama3',
});
You can use llama.cpp models to generate text with the generateText function:
import { generateText } from 'ai';
import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({
modelPath: './models/llama-3.2-1b-instruct.Q4_K_M.gguf',
});
try {
const { text } = await generateText({
model,
prompt: 'Explain quantum computing in simple terms.',
});
console.log(text);
} finally {
await model.dispose();
}
The provider fully supports streaming with streamText:
import { streamText } from 'ai';
import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({
modelPath: './models/llama-3.2-1b-instruct.Q4_K_M.gguf',
});
try {
const result = streamText({
model,
prompt: 'Write a haiku about programming.',
});
for await (const chunk of result.textStream) {
process.stdout.write(chunk);
}
} finally {
await model.dispose();
}
Generate type-safe JSON objects that conform to a schema using Output:
import { generateText, Output } from 'ai';
import { z } from 'zod';
import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp({
modelPath: './models/your-model.gguf',
});
try {
const { output: recipe } = await generateText({
model,
output: Output.object({
schema: z.object({
name: z.string(),
ingredients: z.array(
z.object({
name: z.string(),
amount: z.string(),
}),
),
steps: z.array(z.string()),
}),
}),
prompt: 'Generate a recipe for chocolate chip cookies.',
});
console.log(recipe);
} finally {
await model.dispose();
}
The structured output feature uses GBNF grammar constraints to ensure the model generates valid JSON that conforms to your schema.
Standard AI SDK generation parameters are supported:
const { text } = await generateText({
model,
prompt: 'Hello!',
maxTokens: 256,
temperature: 0.7,
topP: 0.9,
topK: 40,
stopSequences: ['\n'],
});
You can create embedding models using the llamaCpp.embedding() factory method:
import { embed, embedMany } from 'ai';
import { llamaCpp } from 'ai-sdk-llama-cpp';
const model = llamaCpp.embedding({
modelPath: './models/nomic-embed-text-v1.5.Q4_K_M.gguf',
});
try {
const { embedding } = await embed({
model,
value: 'Hello, world!',
});
const { embeddings } = await embedMany({
model,
values: ['Hello, world!', 'Goodbye, world!'],
});
} finally {
model.dispose();
}
You'll need to download GGUF-format models separately. Popular sources:
Example download:
# Create models directory
mkdir -p models
# Download a model (example: Llama 3.2 1B)
wget -P models/ https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf
const model = llamaCpp({
modelPath: './models/your-model.gguf',
});
try {
// Use the model...
} finally {
await model.dispose();
}