Back to Llama Cpp

GGML-VirtGPU Backend

docs/backend/VirtGPU.md

latest6.1 KB
Original Source

GGML-VirtGPU Backend

The GGML-VirtGPU backend enables GGML applications to run machine learning computations on host hardware while the application itself runs inside a virtual machine. It uses host-guest shared memory to efficiently share data buffers between the two sides.

This backend relies on the virtio-gpu, and VirglRenderer API Remoting (APIR) component. The backend is split into two libraries:

  • a GGML implementation (the "remoting frontend"), running in the guest and interacting with the virtgpu device
  • a VirglRenderer APIR compatible library (the "remoting backend"), running in the host and interacting with Virglrenderer and an actual GGML device backend.

OS support

OSStatusBackendCI testingNotes
MacOS 14Supportedggml-metalXWorking when compiled on MacOS 14
MacOS 15Supportedggml-metalXWorking when compiled on MacOS 14 or MacOS 15
MacOS 26Not tested
LinuxUnder developmentggml-vulkannot workingWorking locally, CI running into deadlocks

Architecture Overview

The GGML-VirtGPU backend consists of three main components:

mermaid
graph TD
    %% Nodes

 subgraph GuestVM ["Guest VM - Frontend"]
        App([GGML Application
llama.cpp, etc.])

        direction TB
        Interface[GGML Backend Interface]
        Comm["GGML-VirtGPU
(hypercalls + shared mem)"]

        App --> Interface
        Interface --> Comm
    end

    API[virtio-gpu / virglrenderer API]

    subgraph HostSystem [Host System - Backend]
        direction TB
        Dispatcher[GGML-VirtGPU-Backend]
        BackendLib[GGML Backend library
Metal / Vulkan / CPU / ...]

        Dispatcher --> BackendLib
    end

    %% Connections
    Comm --> API
    API --> HostSystem

Key Components

  1. Guest-side Frontend (ggml-virtgpu/): Implements the GGML backend interface and forwards operations to the host
  2. Host-side Backend (ggml-virtgpu/backend/): Receives forwarded operations and executes them on actual hardware backends
  3. Communication Layer: Uses virtio-gpu hypercalls and shared memory for efficient data transfer

Features

  • Dynamic backend loading on the host side (CPU, CUDA, Metal, etc.)
  • Zero-copy data transfer via host-guest shared memory pages

Communication Protocol

Hypercalls and Shared Memory

The backend uses two primary communication mechanisms:

  1. Hypercalls (DRM_IOCTL_VIRTGPU_EXECBUFFER): Trigger remote execution from guest to host
  2. Shared Memory Pages: Zero-copy data transfer for tensors and parameters

Shared Memory Layout

Each connection uses two shared memory buffers:

  • Data Buffer (24 MiB): For command/response data and tensor transfers
  • Reply Buffer (16 KiB): For command replies and status information
  • Data Buffers: Dynamically allocated host-guest shared buffers served as GGML buffers.

APIR Protocol

The Virglrender API Remoting protocol defines three command types:

  • HANDSHAKE: Protocol version negotiation and capability discovery
  • LOADLIBRARY: Dynamic loading of backend libraries on the host
  • FORWARD: API function call forwarding

Binary Serialization

Commands and data are serialized using a custom binary protocol with:

  • Fixed-size encoding for basic types
  • Variable-length arrays with size prefixes
  • Buffer bounds checking
  • Error recovery mechanisms

Supported Operations

Device Operations

  • Device enumeration and capability queries
  • Memory information (total/free)
  • Backend type detection

Buffer Operations

  • Buffer allocation and deallocation
  • Tensor data transfer (host ↔ guest)
  • Memory copying and clearing

Computation Operations

  • Graph execution forwarding

Build Requirements

Guest-side Dependencies

  • libdrm for DRM/virtio-gpu communication
  • C++20 compatible compiler
  • CMake 3.14+

Host-side Dependencies

  • virglrenderer with APIR support (pending upstream review)
  • Target backend libraries (libggml-metal, libggml-vulkan, etc.)

Configuration

Environment Variables

  • GGML_VIRTGPU_BACKEND_LIBRARY: Path to the host-side backend library
  • GGML_VIRTGPU_DEBUG: Enable debug logging

Build Options

  • GGML_VIRTGPU: Enable the VirtGPU backend (ON or OFF, default: OFF)
  • GGML_VIRTGPU_BACKEND: Build the host-side backend component (ON, OFF or ONLY, default: OFF)

System Requirements

  • VM with virtio-gpu support
  • VirglRenderer with APIR patches
  • Compatible backend libraries on host

Limitations

  • VM-specific: Only works in virtual machines with virtio-gpu support
  • Host dependency: Requires properly configured host-side backend
  • Latency: Small overhead from VM escaping for each operation
  • Shared-memory size: with the libkrun hypervisor, the RAM + VRAM addressable memory is limited to 64 GB. So the maximum GPU memory will be 64GB - RAM, regardless of the hardware VRAM size.
  • This work is pending upstream changes in the VirglRenderer project.

  • This work is pending changes in the VMM/hypervisor running the virtual machine, which need to know how to route the newly introduced APIR capset.

    • The environment variable VIRGL_ROUTE_VENUS_TO_APIR=1 allows using the Venus capset, until the relevant hypervisors have been patched. However, setting this flag breaks the Vulkan/Venus normal behavior.
    • The environment variable GGML_REMOTING_USE_APIR_CAPSET tells the ggml-virtgpu backend to use the APIR capset. This will become the default when the relevant hypervisors have been patched.
  • This work focused on improving the performance of llama.cpp running on MacOS containers, and is mainly tested on this platform. The linux support (via krun) is in progress.

See Also