Back to Ollama

llama.cpp compatibility layer

llama/compat/README.md

0.30.08.1 KB
Original Source

llama.cpp compatibility layer

This directory holds a temporary in-process compatibility layer for existing published Ollama GGUFs whose metadata or tensor layout does not yet match what llama.cpp expects directly. The layer translates those files in memory at load time so users do not need to re-pull or re-create models during the transition to llama-server.

This patch model is intended to be short lived. The target end state is that published models and newly created models use llama.cpp-compatible metadata and tensor layouts on disk, and this directory can be removed.

The layer is applied automatically at build time via CMake FetchContent's PATCH_COMMAND for normal fetched builds. If CMake is pointed at a source override through FETCHCONTENT_SOURCE_DIR_LLAMA_CPP, the same patch is applied during configure. If OLLAMA_LLAMA_CPP_SOURCE is set, the patch is intentionally skipped so a developer can iterate on a local llama.cpp tree.

Files

  • llama-ollama-compat.h, llama-ollama-compat.cpp - the compatibility entry points and per-architecture handlers.
  • llama-ollama-compat-util.h, llama-ollama-compat-util.cpp - helpers for KV edits, tensor renames, skip-prefix tracking, tensor load operations, and small tensor repacking primitives.
  • llama-cpp-hooks.patch - small additive call-site edits in llama.cpp files. It currently touches src/llama-model-loader.cpp and tools/mtmd/clip.cpp.
  • compat.cmake, apply-patch.cmake - CMake glue and an idempotent patch applier used by llama/server/CMakeLists.txt.

The compatibility source files stay in this directory and are linked into the fetched llama.cpp targets. The patch file only adds call sites.

Load-Time Hooks

The layer runs at a small set of loader hook points:

  1. Main model constructor: translate_metadata inspects the parsed metadata and mutates the in-memory gguf_context and ggml_context when a handler recognizes an existing published model format. It can also request mmap disablement when a handler needs writable backend buffers for transformed tensor data.
  2. Main model tensor indexing: should_skip_tensor hides embedded projector, vision, audio, MTP, or other tensors that the text loader should not claim.
  3. Main model tensor reads: maybe_load_text_tensor applies registered text-side load operations, such as FFN concat or dtype promotion, before the normal llama.cpp file read. This is wired into both full model loading and single-tensor reads used by tools such as llama-quantize.
  4. mtmd/clip constructor: translate_clip_metadata rewrites a clip-facing view of monolithic GGUFs into the mmproj form expected by llama.cpp.
  5. mtmd/clip tensor load loop: maybe_load_tensor applies clip-side load operations, such as F16 to F32 promotion, QKV merge, tensor repack, tensor split, or zero-fill.

Files that do not match a supported published-model marker are left unchanged. Setting OLLAMA_LLAMA_CPP_COMPAT=0 disables the hook bodies for internal create-time validation and for models that are already known to be llama.cpp-compatible on disk.

Supported Transformations

This table tracks the dispatch surface. Keep it brief; the handler comments in llama-ollama-compat.cpp are the source of truth for exact KV and tensor maps.

Internal arch / markerText handlingClip/mmproj handling
gemma3Normalizes Gemma 3 metadata, tokenizer fields, and embedded vision/projector tensors.Gemma 3 projector translation.
gemma3 + embedding markers (embeddinggemma)Maps to gemma-embedding metadata and fixes embedding dense/norm tensors.n/a
bert + Snowflake markers (snowflake-arctic-embed2)Fixes Snowflake Arctic Embed 2 tokenizer metadata.n/a
gemma3nNormalizes tokenizer/EOS metadata, truncates vocab-shaped tensors, and hides unused embedded vision/audio/projector tensors.n/a
gemma4Normalizes tokenizer metadata and hides embedded audio/vision/projector tensors from the text loader.Gemma 4 projector translation; audio remains disabled.
gptossMaps to gpt-oss, copies KVs, injects missing expert FFN metadata, and renames tensors.n/a
lfm2Renames norm tensors and fixes feed-forward metadata.n/a
olmo3Maps to the OLMo2-compatible loader path.n/a
mistral3Fixes RoPE/YaRN metadata and hides embedded vision/projector tensors.Pixtral-style projector translation.
qwen35, qwen35moeFixes Qwen3.5/Qwen3-VL-style text metadata, translates embedded MTP tensors, and hides embedded vision/projector tensors.Qwen3-VL merger-style projector translation.
qwen3nextNormalizes hybrid attention KV-head metadata and renames SSM dt tensors to the names expected by llama.cpp.n/a
qwen25vlMaps to qwen2vl metadata conventions.Qwen2.5-VL projector translation.
qwen3vl, qwen3vlmoeAdds missing Qwen3-VL metadata and hides embedded vision/projector tensors.Qwen3-VL projector translation, including QKV merge and patch-embedding split/repack.
deepseekocrMaps to deepseek2-ocr, injects missing OCR/MoE metadata, and hides embedded SAM/vision/projector tensors.DeepSeek OCR projector translation.
glmocrMaps GLM OCR metadata/tensors to the llama.cpp-compatible view.GLM OCR projector translation.
glm4moeliteMaps GLM-4.7 Flash MLA metadata to the deepseek2 path and fixes special-token metadata.n/a
nemotron_h_moeFixes latent-FFN variants and hides MTP tensors.n/a
nemotron_h_omniSelects the Nemotron text loader and hides audio/vision/projector tensors from the text loader.Nemotron V2 VL projector translation; audio remains disabled.
llama with Llama 3 markersFixes Llama 3 tokenizer metadata.n/a
llama4Hides embedded vision/projector tensors from the text loader.Llama 4 projector translation.
clip projector without clip.projector_typen/aDefaults LLaVA/BakLLaVA projectors to clip.projector_type=mlp.

Usage:

sh
llama-server --model /path/to/ollama-blob --mmproj /path/to/ollama-blob

Passing the same monolithic GGUF as both --model and --mmproj works because each loader applies its own translation.

Additional architectures are added by implementing a handle_<arch>() and, for vision models, handle_<arch>_clip() in llama-ollama-compat.cpp, then dispatching them from translate_metadata / translate_clip_metadata. For monolithic vision models, also update the compatClipArches allowlist in llm/llama_server.go so Ollama passes the main GGUF as --mmproj.

Regenerating the Patch File

After a llama.cpp bump moves the insertion points, re-apply the edits to a fresh checkout and run:

sh
cd /path/to/llama.cpp
git diff -- \
    src/llama-model-loader.cpp \
    tools/mtmd/clip.cpp \
    > /path/to/ollama/llama/compat/llama-cpp-hooks.patch

Implementation Notes

The compatibility code is mostly written against public APIs (gguf.h, ggml.h, ggml-backend.h). A few operations rely on implementation details because the public API does not expose equivalent mutators:

DependencyUseReplacement if needed
Direct writes to ggml_tensor::type / ne[] / nb[]Post-creation tensor reshape/retype for in-memory translation.Add public tensor shape/type mutators.
const_cast<char *>(gguf_get_tensor_name(...)) in rename_tensorRenames gguf tensors in place.Add a public gguf_rename_tensor helper.
llama_model_loader forward declaration from src/llama-model-loader.hOpaque key for per-loader registries. The pointer is never dereferenced.Replace registry keys with const void *.

Two helpers need extra context:

  • reclaim_slot_as repurposes an orphaned tensor slot as a synthesized tensor when a clip handler splits one source tensor into multiple destination tensors. This is needed because clip metadata loading allocates exactly enough tensor slots for the source file.
  • Load-op registry overrides ignore the caller-provided file_offset when a registered operation exists. The operations capture their own source offsets at translation time, before renames change tensor names.