import Collapse from '@site/src/components/Collapse';

llama.cpp

llama.cpp is a popular C++ library for serving gguf-based models. It provides a server implementation that supports completion, chat, and embedding functionalities through HTTP APIs.

Chat model

llama.cpp provides an OpenAI-compatible chat API interface.

toml

[model.chat.http]
kind = "openai/chat"
api_endpoint = "http://localhost:8888"

Completion model

llama.cpp offers a specialized completion API interface for code completion tasks.

toml

[model.completion.http]
kind = "llama.cpp/completion"
api_endpoint = "http://localhost:8888"
prompt_template = "<PRE> {prefix} <SUF>{suffix} <MID>"  # Example prompt template for the CodeLlama model series.

Embeddings model

llama.cpp provides embedding functionality through its HTTP API.

The llama.cpp embedding API interface and response format underwent some changes in version b4356. Therefore, we have provided two different kinds to accommodate the various versions of the llama.cpp embedding interface.

You can refer to the configuration as follows:

toml

[model.embedding.http]
kind = "llama.cpp/embedding"
api_endpoint = "http://localhost:8888"

toml

[model.embedding.http]
kind = "llama.cpp/before_b4356_embedding"
api_endpoint = "http://localhost:8888"

</Collapse>