llama.cpp.patches/README.md
This directory contains patches that adapt llama.cpp for use with Llamafile and Cosmopolitan libc. These patches enable llama.cpp to run as a portable, single-file executable across Windows, macOS, Linux, and BSD without installation.
llama.cpp.patches/
├── README.md # This file
├── apply-patches.sh # Script to apply all patches to llama.cpp submodule
├── renames.sh # Script for file renames/moves (if any)
├── llamafile-files/ # Additional files to copy into llama.cpp
│ ├── BUILD.mk # Makefile for building llama.cpp with cosmocc
│ ├── README.llamafile # License and modification notes
│ └── common/
│ └── license.cpp # Llama.cpp's license file (cmake creates this at build time)
└── patches/ # Patch files for upstream sources
To apply all patches to the llama.cpp submodule:
./llama.cpp.patches/apply-patches.sh
To reset the submodule to its clean state:
cd llama.cpp && git reset --hard && git clean -fdx
GGML_CALL)GPU backends (CUDA, Vulkan, Metal) are compiled as shared libraries (.dll/.so/.dylib) using native compilers, but the llamafile host binary is built with Cosmopolitan libc which uses System V AMD64 ABI everywhere — including on Windows. When the host calls function pointers inside backend interface structs, the calling convention must match.
The GGML_CALL macro (defined as __attribute__((__ms_abi__)) when GGML_MULTIPLATFORM is set) annotates all function pointers in the backend interface structs and their implementations, so the correct calling convention is used on every platform.
| Patch | Description |
|---|---|
ggml_include_ggml-backend.h.patch | Defines the GGML_CALL macro; adds it to the five get_proc_address return typedefs (ggml_backend_split_buffer_type_t, ggml_backend_set_n_threads_t, ggml_backend_dev_get_extra_bufts_t, ggml_backend_set_abort_callback_t, ggml_backend_get_features_t) |
ggml_include_ggml-cpu.h.patch | Adds GGML_CALL to declarations of ggml_backend_cpu_set_n_threads and ggml_backend_cpu_set_abort_callback (returned via get_proc_address) |
ggml_include_ggml-cuda.h.patch | Adds GGML_CALL to declarations of ggml_backend_cuda_split_buffer_type, ggml_backend_cuda_register_host_buffer, and ggml_backend_cuda_unregister_host_buffer |
ggml_src_ggml-backend-impl.h.patch | Adds GGML_CALL to all 49+ function pointers across the five interface structs (ggml_backend_buffer_type_i, ggml_backend_buffer_i, ggml_backend_i, ggml_backend_device_i, ggml_backend_reg_i); also adds free_struct callback (see Cross-Module Memory below) |
ggml_src_ggml-backend.cpp.patch | Adds GGML_CALL to CPU buffer, buffer type, and multi-buffer callback implementations; also adds free_struct support (see Cross-Module Memory below) |
ggml_src_ggml-cpu_ggml-cpu.cpp.patch | Adds GGML_CALL to all CPU backend, device, and registry callback implementations, plus get_proc_address-returned functions (set_n_threads, set_abort_callback, get_extra_buffers_type, get_features) |
ggml_src_ggml-cpu_amx_amx.cpp.patch | Adds GGML_CALL to all AMX buffer and buffer type callback implementations (10 functions) |
ggml_src_ggml-cpu_repack.cpp.patch | Adds GGML_CALL to CPU repack buffer and buffer type callback implementations (5 functions) |
ggml_src_ggml-cuda_ggml-cuda.cu.patch | Adds GGML_CALL to all CUDA backend callback implementations (60+ functions); also adds free_struct and TinyBLAS BF16 guard (see below) |
ggml_src_ggml-metal_ggml-metal.cpp.patch | Adds GGML_CALL to all Metal backend callback implementations (62 functions); also adds free_struct (see below) |
ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch | Adds GGML_CALL to all Vulkan backend callback implementations; also adds free_struct and a heap memory underflow fix (see below) |
When GPU backends (CUDA, Vulkan, Metal) are loaded as dynamic libraries, memory allocated by the DSO must be freed by the DSO's allocator, not the main executable's.
| Patch | Description |
|---|---|
ggml_src_ggml-backend-impl.h.patch | Adds free_struct callback to ggml_backend_buffer_i interface for cross-module buffer cleanup |
ggml_src_ggml-backend.cpp.patch | Implements free_struct callback support in ggml_backend_buffer_free() — calls DSO's free_struct instead of delete when set |
ggml_src_ggml-cuda_ggml-cuda.cu.patch | Adds free_struct implementation for CUDA buffers (regular, split, and host); sets it on fallback CPU buffers allocated within the DSO |
ggml_src_ggml-metal_ggml-metal.cpp.patch | Adds free_struct implementation for Metal shared and private buffers |
ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch | Adds free_struct implementation for Vulkan buffers and host buffer fallback path |
These patches address compatibility issues when building with Cosmopolitan libc (cosmocc).
| Patch | Description |
|---|---|
common_arg.cpp.patch | Adds COSMOCC platform detection for PATH_MAX (includes linux/limits.h) |
common_common.cpp.patch | Adds platform-aware cache directory detection for Cosmopolitan (checks LOCALAPPDATA, XDG_CACHE_HOME, falls back to ~/.cache/); also adds mmproj model size estimation to GPU fit params so the fit algorithm reserves enough VRAM for multimodal projectors |
common_download.cpp.patch | Adds COSMOCC platform detection for PATH_MAX |
common_ngram-mod.cpp.patch | Adds missing #include <algorithm> (needed for std::min/std::max) |
Cosmopolitan libc has specific behaviors with condition variables and signals that require workarounds.
| Patch | Description |
|---|---|
common_log.cpp.patch | Adds #include <csignal>; blocks SIGINT/SIGTERM on logger thread via pthread_sigmask to prevent EINTR exceptions; replaces cv.wait() with wait_for(30s) loop to work around XNU futex timeout bug (~72 minute expiry) |
tools_server_server-models.cpp.patch | Adds #include <csignal>; blocks SIGINT/SIGTERM on stopping thread; replaces cv.wait() with wait_for(30s) loops in unload_lru, stopping_thread, and wait_until_loading_finished |
tools_server_server-queue.cpp.patch | Adds missing includes (<cerrno>, ``, <csignal>); blocks SIGINT/SIGTERM on queue thread; replaces wait() with wait_for() loops in three locations (wait_until_no_sleep, main loop, recv) |
vendor_cpp-httplib_httplib.cpp.patch | Fixes httplib thread pool with wait_for() instead of wait() for XNU futex compatibility |
Llamafile uses TinyBLAS as a lightweight replacement for cuBLAS, enabling GPU support without CUDA SDK dependencies.
| Patch | Description |
|---|---|
ggml_src_ggml-cuda_vendors_cuda.h.patch | Includes TinyBLAS headers (tinyblas.h, tinyblas-compat.h) instead of cublas_v2.h when GGML_USE_TINYBLAS is defined; guards backward-compat CUBLAS_* defines so they don't conflict with TinyBLAS's own definitions |
ggml_src_ggml-cuda_common.cuh.patch | Disables BF16 MMA when using TinyBLAS (TinyBLAS would incorrectly interpret BF16 as FP16) |
ggml_src_ggml-cuda_ggml-cuda.cu.patch | Disables BF16 in ggml_cuda_op_mul_mat_cublas when using TinyBLAS |
These patches integrate llamafile's file handling APIs for loading models from bundled zip archives and .llamafile containers.
| Patch | Description |
|---|---|
src_llama-mmap.h.patch | Adds has_premapped_content(), premapped_content(), and get_llamafile() methods to llama_file class |
src_llama-mmap.cpp.patch | Under COSMOCC, redirects file open/read/seek/tell/close through llamafile API (llamafile_open_gguf, llamafile_read, etc.); adds premapped content support to llama_mmap using llamafile reference counting (llamafile_ref/llamafile_unref); skips munmap for premapped content |
ggml_src_gguf.cpp.patch | Adds tell()/seek() to gguf_reader; under COSMOCC, adds gguf_llamafile_reader that reads via llamafile API; templatizes gguf_init_from_reader_impl so both readers work; redirects gguf_init_from_file through llamafile_open_gguf (supports /zip/ paths, .llamafile containers) |
| Patch | Description |
|---|---|
tools_server_server.cpp.patch | Renames main() to server_main() with on_ready/on_shutdown_available callbacks for combined TUI+server mode; adds Metal/GPU backend trigger before common_init(); adds Cosmopolitan-specific standalone main() with cosmo_args, verbose flag handling, and GPU pre-initialization; handles LLAMAFILE_TUI exit to avoid Metal cleanup crashes |
| Patch | Description |
|---|---|
ggml_src_ggml-backend-reg.cpp.patch | Suppresses debug log noise for non-existent backend search paths (irrelevant for llamafile's DSO loading approach) |
ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch | Fixes unsigned integer underflow in ggml_backend_vk_get_device_memory where Vulkan's heapUsage can exceed heapBudget (clamps to zero instead of wrapping) |
Files in llama.cpp are usually modified in-place for development and testing.
Once they are ready to be committed, you can update all files in the llama.cpp.patches directory by running the following:
cd llama.cpp
../tools/generate-patches.sh --output-dir ../llama.cpp.patches
Patch filenames will automatically reflect the file path with underscores replacing slashes (e.g., common_arg.cpp.patch for common/arg.cpp).