Back to Llamafile

llama.cpp Patches for Llamafile

llama.cpp.patches/README.md

0.10.213.6 KB
Original Source

llama.cpp Patches for Llamafile

This directory contains patches that adapt llama.cpp for use with Llamafile and Cosmopolitan libc. These patches enable llama.cpp to run as a portable, single-file executable across Windows, macOS, Linux, and BSD without installation.

Directory Structure

llama.cpp.patches/
├── README.md              # This file
├── apply-patches.sh       # Script to apply all patches to llama.cpp submodule
├── renames.sh             # Script for file renames/moves (if any)
├── llamafile-files/       # Additional files to copy into llama.cpp
│   ├── BUILD.mk           # Makefile for building llama.cpp with cosmocc
│   ├── README.llamafile   # License and modification notes
│   └── common/
│       └── license.cpp    # Llama.cpp's license file (cmake creates this at build time)
└── patches/               # Patch files for upstream sources

Applying Patches

To apply all patches to the llama.cpp submodule:

sh
./llama.cpp.patches/apply-patches.sh

To reset the submodule to its clean state:

sh
cd llama.cpp && git reset --hard && git clean -fdx

Patch Index

Windows/macOS ABI Compatibility (GGML_CALL)

GPU backends (CUDA, Vulkan, Metal) are compiled as shared libraries (.dll/.so/.dylib) using native compilers, but the llamafile host binary is built with Cosmopolitan libc which uses System V AMD64 ABI everywhere — including on Windows. When the host calls function pointers inside backend interface structs, the calling convention must match.

The GGML_CALL macro (defined as __attribute__((__ms_abi__)) when GGML_MULTIPLATFORM is set) annotates all function pointers in the backend interface structs and their implementations, so the correct calling convention is used on every platform.

PatchDescription
ggml_include_ggml-backend.h.patchDefines the GGML_CALL macro; adds it to the five get_proc_address return typedefs (ggml_backend_split_buffer_type_t, ggml_backend_set_n_threads_t, ggml_backend_dev_get_extra_bufts_t, ggml_backend_set_abort_callback_t, ggml_backend_get_features_t)
ggml_include_ggml-cpu.h.patchAdds GGML_CALL to declarations of ggml_backend_cpu_set_n_threads and ggml_backend_cpu_set_abort_callback (returned via get_proc_address)
ggml_include_ggml-cuda.h.patchAdds GGML_CALL to declarations of ggml_backend_cuda_split_buffer_type, ggml_backend_cuda_register_host_buffer, and ggml_backend_cuda_unregister_host_buffer
ggml_src_ggml-backend-impl.h.patchAdds GGML_CALL to all 49+ function pointers across the five interface structs (ggml_backend_buffer_type_i, ggml_backend_buffer_i, ggml_backend_i, ggml_backend_device_i, ggml_backend_reg_i); also adds free_struct callback (see Cross-Module Memory below)
ggml_src_ggml-backend.cpp.patchAdds GGML_CALL to CPU buffer, buffer type, and multi-buffer callback implementations; also adds free_struct support (see Cross-Module Memory below)
ggml_src_ggml-cpu_ggml-cpu.cpp.patchAdds GGML_CALL to all CPU backend, device, and registry callback implementations, plus get_proc_address-returned functions (set_n_threads, set_abort_callback, get_extra_buffers_type, get_features)
ggml_src_ggml-cpu_amx_amx.cpp.patchAdds GGML_CALL to all AMX buffer and buffer type callback implementations (10 functions)
ggml_src_ggml-cpu_repack.cpp.patchAdds GGML_CALL to CPU repack buffer and buffer type callback implementations (5 functions)
ggml_src_ggml-cuda_ggml-cuda.cu.patchAdds GGML_CALL to all CUDA backend callback implementations (60+ functions); also adds free_struct and TinyBLAS BF16 guard (see below)
ggml_src_ggml-metal_ggml-metal.cpp.patchAdds GGML_CALL to all Metal backend callback implementations (62 functions); also adds free_struct (see below)
ggml_src_ggml-vulkan_ggml-vulkan.cpp.patchAdds GGML_CALL to all Vulkan backend callback implementations; also adds free_struct and a heap memory underflow fix (see below)
ggml_src_ggml-backend-meta.cpp.patchAdds GGML_CALL to all meta-device, meta-buffer-type, meta-buffer, and meta-backend callback implementations (the meta backend aggregates several simple backends behind one interface, so its callbacks are reached through the same function-pointer structs)

Cross-Module Memory Management

When GPU backends (CUDA, Vulkan, Metal) are loaded as dynamic libraries, memory allocated by the DSO must be freed by the DSO's allocator, not the main executable's.

PatchDescription
ggml_src_ggml-backend-impl.h.patchAdds free_struct callback to ggml_backend_buffer_i interface for cross-module buffer cleanup
ggml_src_ggml-backend.cpp.patchImplements free_struct callback support in ggml_backend_buffer_free() — calls DSO's free_struct instead of delete when set
ggml_src_ggml-cuda_ggml-cuda.cu.patchAdds free_struct implementation for CUDA buffers (regular, split, and host); sets it on fallback CPU buffers allocated within the DSO
ggml_src_ggml-metal_ggml-metal.cpp.patchAdds free_struct implementation for Metal shared and private buffers
ggml_src_ggml-vulkan_ggml-vulkan.cpp.patchAdds free_struct implementation for Vulkan buffers and host buffer fallback path

Cosmopolitan Libc Compatibility

These patches address compatibility issues when building with Cosmopolitan libc (cosmocc).

PatchDescription
common_arg.cpp.patchAdds COSMOCC platform detection for PATH_MAX (includes linux/limits.h)
common_common.cpp.patchAdds platform-aware cache directory detection for Cosmopolitan (checks LOCALAPPDATA, XDG_CACHE_HOME, falls back to ~/.cache/); also adds mmproj model size estimation to GPU fit params so the fit algorithm reserves enough VRAM for multimodal projectors
common_download.cpp.patchAdds COSMOCC platform detection for PATH_MAX
common_ngram-mod.cpp.patchAdds missing #include <algorithm> (needed for std::min/std::max)

Threading and Signal Handling

Cosmopolitan libc has specific behaviors with condition variables and signals that require workarounds.

PatchDescription
common_log.cpp.patchAdds #include <csignal>; blocks SIGINT/SIGTERM on logger thread via pthread_sigmask to prevent EINTR exceptions; replaces cv.wait() with wait_for(30s) loop to work around XNU futex timeout bug (~72 minute expiry)
tools_server_server-models.cpp.patchAdds #include <csignal>; blocks SIGINT/SIGTERM on stopping thread; replaces cv.wait() with wait_for(30s) loops in unload_lru, stopping_thread, and wait_until_loading_finished
tools_server_server-queue.cpp.patchAdds missing includes (<cerrno>, ``, <csignal>); blocks SIGINT/SIGTERM on queue thread; replaces wait() with wait_for() loops in three locations (wait_until_no_sleep, main loop, recv)
vendor_cpp-httplib_httplib.cpp.patchFixes httplib thread pool with wait_for() instead of wait() for XNU futex compatibility

TinyBLAS Integration

Llamafile uses TinyBLAS as a lightweight replacement for cuBLAS, enabling GPU support without CUDA SDK dependencies.

PatchDescription
ggml_src_ggml-cuda_vendors_cuda.h.patchIncludes TinyBLAS headers (tinyblas.h, tinyblas-compat.h) instead of cublas_v2.h when GGML_USE_TINYBLAS is defined; guards backward-compat CUBLAS_* defines so they don't conflict with TinyBLAS's own definitions
ggml_src_ggml-cuda_common.cuh.patchDisables BF16 MMA when using TinyBLAS (TinyBLAS would incorrectly interpret BF16 as FP16)
ggml_src_ggml-cuda_ggml-cuda.cu.patchDisables BF16 in ggml_cuda_op_mul_mat_cublas when using TinyBLAS

Optional IQ-Quant Exclusion (CUDA)

The IQ ("importance") quantization formats (IQ1_S, IQ2_XXS/XS/S, IQ3_S/XXS, IQ4_NL/XS) pull in a large amount of CUDA template instantiation that inflates compile time and binary size. These patches gate every IQ code path behind #ifndef GGML_CUDA_NO_IQ_QUANTS, so a build can compile them out by defining GGML_CUDA_NO_IQ_QUANTS. When the macro is undefined (the default), behavior is unchanged.

PatchDescription
ggml_src_ggml-cuda_convert.cu.patchGuards IQ dequantization cases in ggml_get_to_fp16_cuda and ggml_get_to_fp32_cuda
ggml_src_ggml-cuda_cpy.cu.patchGuards the f32 → IQ4_NL copy helper and its dispatch case
ggml_src_ggml-cuda_mmq.cu.patchGuards IQ cases in ggml_cuda_mul_mat_q_switch_type and in the ggml_cuda_should_use_mmq support/heuristic switches
ggml_src_ggml-cuda_mmq.cuh.patchGuards the extern DECL_MMQ_CASE(...) declarations for IQ types
ggml_src_ggml-cuda_mmvq.cu.patchGuards IQ cases in get_vec_dot_q_cuda and get_vdr_mmvq

CPU Performance Optimizations (llamafile #975)

These patches restore llamafile's optimized CPU kernels (TinyBLAS matmul, AVX-512 flash-attention helpers) on top of upstream's CPU backend, and tune CPU-only defaults. The hooks call into symbols exported from llamafile/sgemm.cpp and are compiled only when GGML_USE_LLAMAFILE is defined.

PatchDescription
ggml_src_ggml-cpu_ggml-cpu.c.patchRoutes MoE matmul (ggml_compute_forward_mul_mat_id) through llamafile_mixmul / llamafile_mixmul_iqk, mirroring the dense-matmul llamafile_sgemm hook; reserves work-buffer space for the MoE kernel in ggml_graph_plan via llamafile_mixmul_needs
ggml_src_ggml-cpu_ops.cpp.patchRoutes flash-attention inner loops through llamafile's AVX-512 helpers (llamafile_fa_vec_dot_f16, llamafile_fa_fp16_to_fp32_row, llamafile_fa_simd_gemm) in both the one-chunk and tiled FA paths; also accumulates VKQ in f32 on CPUs lacking native f16 FMA (avoiding costly f16↔f32 round-trips per KV step)
src_llama-context.cpp.patchDefaults -fa auto to off on CPU-only setups (no GPU devices), since the CPU flash-attention path is slower than the non-FA path on x86; users can still force -fa on for memory savings on long contexts

Llamafile File Handling

These patches integrate llamafile's file handling APIs for loading models from bundled zip archives and .llamafile containers.

PatchDescription
src_llama-mmap.h.patchAdds has_premapped_content(), premapped_content(), and get_llamafile() methods to llama_file class
src_llama-mmap.cpp.patchUnder COSMOCC, redirects file open/read/seek/tell/close through llamafile API (llamafile_open_gguf, llamafile_read, etc.); adds premapped content support to llama_mmap using llamafile reference counting (llamafile_ref/llamafile_unref); skips munmap for premapped content
ggml_src_gguf.cpp.patchAdds tell()/seek() to gguf_reader; under COSMOCC, adds gguf_llamafile_reader that reads via llamafile API; templatizes gguf_init_from_reader_impl so both readers work; redirects gguf_init_from_file through llamafile_open_gguf (supports /zip/ paths, .llamafile containers)

Server Integration

PatchDescription
tools_server_server.cpp.patchRenames upstream's llama_server() to server_main() and adds on_ready/on_shutdown_available callbacks for combined TUI+server mode; adds Metal/GPU backend trigger before common_init(); adds Cosmopolitan-specific standalone main() with cosmo_args, verbose flag handling, and GPU pre-initialization; handles LLAMAFILE_TUI exit to avoid Metal cleanup crashes

The web UI moved upstream from prebuilt tools/server/public/* assets to a Svelte project under tools/ui/, embedded at CMake time via tools/ui/embed.cpp. cosmocc has no JS toolchain, so apply-patches.sh (run by make setup) downloads the prebuilt Svelte bundle from the ggml-org/llama-ui Hugging Face bucket into llama.cpp/tools/ui/dist/. At build time, llama.cpp/BUILD.mk compiles tools/ui/embed.cpp with cosmoc++ (its APE output runs on the host) and runs it against the downloaded assets to emit o/$(MODE)/llama.cpp/tools/ui/ui.{cpp,h}, which is then compiled like any other C++ source and linked into llama-server and llamafile. If the download fails (offline, version not yet on HF) apply-patches.sh warns and continues with dist/ empty — embed.cpp then emits a no-op llama_ui_find_asset, and server-http.cpp skips UI route registration via its LLAMA_UI_HAS_ASSETS guard so the API still works.

Bug Fixes

PatchDescription
ggml_src_ggml-backend-reg.cpp.patchSuppresses debug log noise for non-existent backend search paths (irrelevant for llamafile's DSO loading approach)
ggml_src_ggml-vulkan_ggml-vulkan.cpp.patchFixes unsigned integer underflow in ggml_backend_vk_get_device_memory where Vulkan's heapUsage can exceed heapBudget (clamps to zero instead of wrapping)
src_models_t5.cpp.patchForward-declares the graph<false>/graph<true> explicit specializations before build_arch_graph so clang's -std=gnu++23 doesn't reject them as specializations after implicit instantiation

Creating New Patches

Files in llama.cpp are usually modified in-place for development and testing. Once they are ready to be committed, you can update all files in the llama.cpp.patches directory by running the following:

sh
cd llama.cpp
../tools/generate-patches.sh --output-dir ../llama.cpp.patches

Patch filenames will automatically reflect the file path with underscores replacing slashes (e.g., common_arg.cpp.patch for common/arg.cpp).