llama.cpp.patches/README.md
This directory contains patches that adapt llama.cpp for use with Llamafile and Cosmopolitan libc. These patches enable llama.cpp to run as a portable, single-file executable across Windows, macOS, Linux, and BSD without installation.
llama.cpp.patches/
├── README.md # This file
├── apply-patches.sh # Script to apply all patches to llama.cpp submodule
├── renames.sh # Script for file renames/moves (if any)
├── llamafile-files/ # Additional files to copy into llama.cpp
│ ├── BUILD.mk # Makefile for building llama.cpp with cosmocc
│ ├── README.llamafile # License and modification notes
│ └── common/
│ └── license.cpp # Llama.cpp's license file (cmake creates this at build time)
└── patches/ # Patch files for upstream sources
To apply all patches to the llama.cpp submodule:
./llama.cpp.patches/apply-patches.sh
To reset the submodule to its clean state:
cd llama.cpp && git reset --hard && git clean -fdx
GGML_CALL)GPU backends (CUDA, Vulkan, Metal) are compiled as shared libraries (.dll/.so/.dylib) using native compilers, but the llamafile host binary is built with Cosmopolitan libc which uses System V AMD64 ABI everywhere — including on Windows. When the host calls function pointers inside backend interface structs, the calling convention must match.
The GGML_CALL macro (defined as __attribute__((__ms_abi__)) when GGML_MULTIPLATFORM is set) annotates all function pointers in the backend interface structs and their implementations, so the correct calling convention is used on every platform.
| Patch | Description |
|---|---|
ggml_include_ggml-backend.h.patch | Defines the GGML_CALL macro; adds it to the five get_proc_address return typedefs (ggml_backend_split_buffer_type_t, ggml_backend_set_n_threads_t, ggml_backend_dev_get_extra_bufts_t, ggml_backend_set_abort_callback_t, ggml_backend_get_features_t) |
ggml_include_ggml-cpu.h.patch | Adds GGML_CALL to declarations of ggml_backend_cpu_set_n_threads and ggml_backend_cpu_set_abort_callback (returned via get_proc_address) |
ggml_include_ggml-cuda.h.patch | Adds GGML_CALL to declarations of ggml_backend_cuda_split_buffer_type, ggml_backend_cuda_register_host_buffer, and ggml_backend_cuda_unregister_host_buffer |
ggml_src_ggml-backend-impl.h.patch | Adds GGML_CALL to all 49+ function pointers across the five interface structs (ggml_backend_buffer_type_i, ggml_backend_buffer_i, ggml_backend_i, ggml_backend_device_i, ggml_backend_reg_i); also adds free_struct callback (see Cross-Module Memory below) |
ggml_src_ggml-backend.cpp.patch | Adds GGML_CALL to CPU buffer, buffer type, and multi-buffer callback implementations; also adds free_struct support (see Cross-Module Memory below) |
ggml_src_ggml-cpu_ggml-cpu.cpp.patch | Adds GGML_CALL to all CPU backend, device, and registry callback implementations, plus get_proc_address-returned functions (set_n_threads, set_abort_callback, get_extra_buffers_type, get_features) |
ggml_src_ggml-cpu_amx_amx.cpp.patch | Adds GGML_CALL to all AMX buffer and buffer type callback implementations (10 functions) |
ggml_src_ggml-cpu_repack.cpp.patch | Adds GGML_CALL to CPU repack buffer and buffer type callback implementations (5 functions) |
ggml_src_ggml-cuda_ggml-cuda.cu.patch | Adds GGML_CALL to all CUDA backend callback implementations (60+ functions); also adds free_struct and TinyBLAS BF16 guard (see below) |
ggml_src_ggml-metal_ggml-metal.cpp.patch | Adds GGML_CALL to all Metal backend callback implementations (62 functions); also adds free_struct (see below) |
ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch | Adds GGML_CALL to all Vulkan backend callback implementations; also adds free_struct and a heap memory underflow fix (see below) |
ggml_src_ggml-backend-meta.cpp.patch | Adds GGML_CALL to all meta-device, meta-buffer-type, meta-buffer, and meta-backend callback implementations (the meta backend aggregates several simple backends behind one interface, so its callbacks are reached through the same function-pointer structs) |
When GPU backends (CUDA, Vulkan, Metal) are loaded as dynamic libraries, memory allocated by the DSO must be freed by the DSO's allocator, not the main executable's.
| Patch | Description |
|---|---|
ggml_src_ggml-backend-impl.h.patch | Adds free_struct callback to ggml_backend_buffer_i interface for cross-module buffer cleanup |
ggml_src_ggml-backend.cpp.patch | Implements free_struct callback support in ggml_backend_buffer_free() — calls DSO's free_struct instead of delete when set |
ggml_src_ggml-cuda_ggml-cuda.cu.patch | Adds free_struct implementation for CUDA buffers (regular, split, and host); sets it on fallback CPU buffers allocated within the DSO |
ggml_src_ggml-metal_ggml-metal.cpp.patch | Adds free_struct implementation for Metal shared and private buffers |
ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch | Adds free_struct implementation for Vulkan buffers and host buffer fallback path |
These patches address compatibility issues when building with Cosmopolitan libc (cosmocc).
| Patch | Description |
|---|---|
common_arg.cpp.patch | Adds COSMOCC platform detection for PATH_MAX (includes linux/limits.h) |
common_common.cpp.patch | Adds platform-aware cache directory detection for Cosmopolitan (checks LOCALAPPDATA, XDG_CACHE_HOME, falls back to ~/.cache/); also adds mmproj model size estimation to GPU fit params so the fit algorithm reserves enough VRAM for multimodal projectors |
common_download.cpp.patch | Adds COSMOCC platform detection for PATH_MAX |
common_ngram-mod.cpp.patch | Adds missing #include <algorithm> (needed for std::min/std::max) |
Cosmopolitan libc has specific behaviors with condition variables and signals that require workarounds.
| Patch | Description |
|---|---|
common_log.cpp.patch | Adds #include <csignal>; blocks SIGINT/SIGTERM on logger thread via pthread_sigmask to prevent EINTR exceptions; replaces cv.wait() with wait_for(30s) loop to work around XNU futex timeout bug (~72 minute expiry) |
tools_server_server-models.cpp.patch | Adds #include <csignal>; blocks SIGINT/SIGTERM on stopping thread; replaces cv.wait() with wait_for(30s) loops in unload_lru, stopping_thread, and wait_until_loading_finished |
tools_server_server-queue.cpp.patch | Adds missing includes (<cerrno>, ``, <csignal>); blocks SIGINT/SIGTERM on queue thread; replaces wait() with wait_for() loops in three locations (wait_until_no_sleep, main loop, recv) |
vendor_cpp-httplib_httplib.cpp.patch | Fixes httplib thread pool with wait_for() instead of wait() for XNU futex compatibility |
Llamafile uses TinyBLAS as a lightweight replacement for cuBLAS, enabling GPU support without CUDA SDK dependencies.
| Patch | Description |
|---|---|
ggml_src_ggml-cuda_vendors_cuda.h.patch | Includes TinyBLAS headers (tinyblas.h, tinyblas-compat.h) instead of cublas_v2.h when GGML_USE_TINYBLAS is defined; guards backward-compat CUBLAS_* defines so they don't conflict with TinyBLAS's own definitions |
ggml_src_ggml-cuda_common.cuh.patch | Disables BF16 MMA when using TinyBLAS (TinyBLAS would incorrectly interpret BF16 as FP16) |
ggml_src_ggml-cuda_ggml-cuda.cu.patch | Disables BF16 in ggml_cuda_op_mul_mat_cublas when using TinyBLAS |
The IQ ("importance") quantization formats (IQ1_S, IQ2_XXS/XS/S, IQ3_S/XXS, IQ4_NL/XS) pull in a large amount of CUDA template instantiation that inflates compile time and binary size. These patches gate every IQ code path behind #ifndef GGML_CUDA_NO_IQ_QUANTS, so a build can compile them out by defining GGML_CUDA_NO_IQ_QUANTS. When the macro is undefined (the default), behavior is unchanged.
| Patch | Description |
|---|---|
ggml_src_ggml-cuda_convert.cu.patch | Guards IQ dequantization cases in ggml_get_to_fp16_cuda and ggml_get_to_fp32_cuda |
ggml_src_ggml-cuda_cpy.cu.patch | Guards the f32 → IQ4_NL copy helper and its dispatch case |
ggml_src_ggml-cuda_mmq.cu.patch | Guards IQ cases in ggml_cuda_mul_mat_q_switch_type and in the ggml_cuda_should_use_mmq support/heuristic switches |
ggml_src_ggml-cuda_mmq.cuh.patch | Guards the extern DECL_MMQ_CASE(...) declarations for IQ types |
ggml_src_ggml-cuda_mmvq.cu.patch | Guards IQ cases in get_vec_dot_q_cuda and get_vdr_mmvq |
These patches restore llamafile's optimized CPU kernels (TinyBLAS matmul, AVX-512 flash-attention helpers) on top of upstream's CPU backend, and tune CPU-only defaults. The hooks call into symbols exported from llamafile/sgemm.cpp and are compiled only when GGML_USE_LLAMAFILE is defined.
| Patch | Description |
|---|---|
ggml_src_ggml-cpu_ggml-cpu.c.patch | Routes MoE matmul (ggml_compute_forward_mul_mat_id) through llamafile_mixmul / llamafile_mixmul_iqk, mirroring the dense-matmul llamafile_sgemm hook; reserves work-buffer space for the MoE kernel in ggml_graph_plan via llamafile_mixmul_needs |
ggml_src_ggml-cpu_ops.cpp.patch | Routes flash-attention inner loops through llamafile's AVX-512 helpers (llamafile_fa_vec_dot_f16, llamafile_fa_fp16_to_fp32_row, llamafile_fa_simd_gemm) in both the one-chunk and tiled FA paths; also accumulates VKQ in f32 on CPUs lacking native f16 FMA (avoiding costly f16↔f32 round-trips per KV step) |
src_llama-context.cpp.patch | Defaults -fa auto to off on CPU-only setups (no GPU devices), since the CPU flash-attention path is slower than the non-FA path on x86; users can still force -fa on for memory savings on long contexts |
These patches integrate llamafile's file handling APIs for loading models from bundled zip archives and .llamafile containers.
| Patch | Description |
|---|---|
src_llama-mmap.h.patch | Adds has_premapped_content(), premapped_content(), and get_llamafile() methods to llama_file class |
src_llama-mmap.cpp.patch | Under COSMOCC, redirects file open/read/seek/tell/close through llamafile API (llamafile_open_gguf, llamafile_read, etc.); adds premapped content support to llama_mmap using llamafile reference counting (llamafile_ref/llamafile_unref); skips munmap for premapped content |
ggml_src_gguf.cpp.patch | Adds tell()/seek() to gguf_reader; under COSMOCC, adds gguf_llamafile_reader that reads via llamafile API; templatizes gguf_init_from_reader_impl so both readers work; redirects gguf_init_from_file through llamafile_open_gguf (supports /zip/ paths, .llamafile containers) |
| Patch | Description |
|---|---|
tools_server_server.cpp.patch | Renames upstream's llama_server() to server_main() and adds on_ready/on_shutdown_available callbacks for combined TUI+server mode; adds Metal/GPU backend trigger before common_init(); adds Cosmopolitan-specific standalone main() with cosmo_args, verbose flag handling, and GPU pre-initialization; handles LLAMAFILE_TUI exit to avoid Metal cleanup crashes |
The web UI moved upstream from prebuilt tools/server/public/* assets to
a Svelte project under tools/ui/, embedded at CMake time via
tools/ui/embed.cpp. cosmocc has no JS toolchain, so apply-patches.sh
(run by make setup) downloads the prebuilt Svelte bundle from the
ggml-org/llama-ui Hugging Face bucket into llama.cpp/tools/ui/dist/.
At build time, llama.cpp/BUILD.mk compiles tools/ui/embed.cpp with
cosmoc++ (its APE output runs on the host) and runs it against the
downloaded assets to emit o/$(MODE)/llama.cpp/tools/ui/ui.{cpp,h},
which is then compiled like any other C++ source and linked into
llama-server and llamafile. If the download fails (offline, version
not yet on HF) apply-patches.sh warns and continues with dist/
empty — embed.cpp then emits a no-op llama_ui_find_asset, and
server-http.cpp skips UI route registration via its
LLAMA_UI_HAS_ASSETS guard so the API still works.
| Patch | Description |
|---|---|
ggml_src_ggml-backend-reg.cpp.patch | Suppresses debug log noise for non-existent backend search paths (irrelevant for llamafile's DSO loading approach) |
ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch | Fixes unsigned integer underflow in ggml_backend_vk_get_device_memory where Vulkan's heapUsage can exceed heapBudget (clamps to zero instead of wrapping) |
src_models_t5.cpp.patch | Forward-declares the graph<false>/graph<true> explicit specializations before build_arch_graph so clang's -std=gnu++23 doesn't reject them as specializations after implicit instantiation |
Files in llama.cpp are usually modified in-place for development and testing.
Once they are ready to be committed, you can update all files in the llama.cpp.patches directory by running the following:
cd llama.cpp
../tools/generate-patches.sh --output-dir ../llama.cpp.patches
Patch filenames will automatically reflect the file path with underscores replacing slashes (e.g., common_arg.cpp.patch for common/arg.cpp).