docs/backend/VirtGPU.md
The GGML-VirtGPU backend enables GGML applications to run machine learning computations on host hardware while the application itself runs inside a virtual machine. It uses host-guest shared memory to efficiently share data buffers between the two sides.
This backend relies on the virtio-gpu, and VirglRenderer API Remoting (APIR) component. The backend is split into two libraries:
| OS | Status | Backend | CI testing | Notes |
|---|---|---|---|---|
| MacOS 14 | Supported | ggml-metal | X | Working when compiled on MacOS 14 |
| MacOS 15 | Supported | ggml-metal | X | Working when compiled on MacOS 14 or MacOS 15 |
| MacOS 26 | Not tested | |||
| Linux | Under development | ggml-vulkan | not working | Working locally, CI running into deadlocks |
The GGML-VirtGPU backend consists of three main components:
graph TD
%% Nodes
subgraph GuestVM ["Guest VM - Frontend"]
App([GGML Application
llama.cpp, etc.])
direction TB
Interface[GGML Backend Interface]
Comm["GGML-VirtGPU
(hypercalls + shared mem)"]
App --> Interface
Interface --> Comm
end
API[virtio-gpu / virglrenderer API]
subgraph HostSystem [Host System - Backend]
direction TB
Dispatcher[GGML-VirtGPU-Backend]
BackendLib[GGML Backend library
Metal / Vulkan / CPU / ...]
Dispatcher --> BackendLib
end
%% Connections
Comm --> API
API --> HostSystem
ggml-virtgpu/): Implements the GGML backend interface and forwards operations to the hostggml-virtgpu/backend/): Receives forwarded operations and executes them on actual hardware backendsThe backend uses two primary communication mechanisms:
DRM_IOCTL_VIRTGPU_EXECBUFFER): Trigger remote execution from guest to hostEach connection uses two shared memory buffers:
The Virglrender API Remoting protocol defines three command types:
HANDSHAKE: Protocol version negotiation and capability discoveryLOADLIBRARY: Dynamic loading of backend libraries on the hostFORWARD: API function call forwardingCommands and data are serialized using a custom binary protocol with:
libdrm for DRM/virtio-gpu communicationGGML_VIRTGPU_BACKEND_LIBRARY: Path to the host-side backend libraryGGML_VIRTGPU_DEBUG: Enable debug loggingGGML_VIRTGPU: Enable the VirtGPU backend (ON or OFF, default: OFF)GGML_VIRTGPU_BACKEND: Build the host-side backend component (ON, OFF or ONLY, default: OFF)libkrun hypervisor, the RAM + VRAM
addressable memory is limited to 64 GB. So the maximum GPU memory
will be 64GB - RAM, regardless of the hardware VRAM size.This work is pending upstream changes in the VirglRenderer project.
This work is pending changes in the VMM/hypervisor running the virtual machine, which need to know how to route the newly introduced APIR capset.
VIRGL_ROUTE_VENUS_TO_APIR=1 allows
using the Venus capset, until the relevant hypervisors have been
patched. However, setting this flag breaks the Vulkan/Venus normal
behavior.GGML_REMOTING_USE_APIR_CAPSET tells the
ggml-virtgpu backend to use the APIR capset. This will become
the default when the relevant hypervisors have been patched.This work focused on improving the performance of llama.cpp running
on MacOS containers, and is mainly tested on this platform. The
linux support (via krun) is in progress.