.agents/ci-caching.md
Container builds — both the root LocalAI image (Dockerfile) and the per-backend images (backend/Dockerfile.*) — share a registry-backed BuildKit cache plus a layered set of prebuilt base images. This file explains how the cache is laid out, what invalidates it, and how to bypass it.
| Workflow | Purpose | Triggers |
|---|---|---|
.github/workflows/backend.yml | Backend container images on master | push to master + tags, weekly Sunday cron, workflow_dispatch |
.github/workflows/backend_pr.yml | Backend container images on PRs | pull_request |
.github/workflows/backend_build.yml | Reusable: builds one backend (one arch) by digest | workflow_call from above |
.github/workflows/backend_merge.yml | Reusable: assembles per-arch digests into a multi-arch manifest list | workflow_call |
.github/workflows/backend_build_darwin.yml | Reusable: macOS-native backend builds | workflow_call |
.github/workflows/image.yml / image-pr.yml | Root LocalAI image (push / PR) | push / PR |
.github/workflows/image_build.yml / image_merge.yml | Reusable: per-arch root-image build + merge | workflow_call |
.github/workflows/base-images.yml | Builds the prebuilt base-grpc-* builder bases | Saturdays 05:00 UTC cron, workflow_dispatch, master push touching Dockerfile.base-grpc-builder, .docker/install-base-deps.sh, .docker/apt-mirror.sh, or this workflow |
The matrix that drives backend.yml / backend_pr.yml lives in .github/backend-matrix.yml (data-only YAML, not embedded in the workflow). scripts/changed-backends.js parses it, applies path-filter logic against the PR diff (PR events) or the GitHub Compare API (push events), and emits the filtered matrix plus a merge-matrix for backends with multiple per-arch entries.
quay.io/go-skynet/ci-cachetag-suffix and platform-tag:
backend_build.yml): cache<tag-suffix>-<platform-tag>
cache-cpu-faster-whisper-amd64, cache-cpu-faster-whisper-arm64, cache-gpu-nvidia-cuda-13-llama-cpp-amd64image_build.yml): cache-localai<tag-suffix>-<platform-tag> (with a -core placeholder when tag-suffix is empty, so cache-localai-core-amd64 for the core image)base-images.yml): cache-base-grpc-<variant> (one per (BUILD_TYPE, arch) permutation)mode=max), so every intermediate stage is re-usable, not just the final image.The per-arch suffix exists because amd64 and arm64 builds produce different intermediate content; sharing one cache key would thrash on every cross-arch rebuild.
| Trigger | cache-from | cache-to |
|---|---|---|
push to master / tag / cron / dispatch | yes | yes (mode=max,ignore-error=true) |
pull_request | yes | no |
PR builds read master's warm cache but never write — this prevents PRs from polluting the shared cache with their experimental state. After merge, the master build for that matrix entry refreshes the cache.
ignore-error=true on the write side means a transient quay push failure does not fail the build; the next master push retries.
base-grpc-*)The C++ backend Dockerfiles (Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}) compile gRPC from source. On a cold build that's ~25–35 min before any LocalAI source compiles. To skip that on CI, .github/workflows/base-images.yml builds and pushes a set of pre-prepped builder bases:
| Tag | Contents |
|---|---|
base-grpc-amd64 / base-grpc-arm64 | Ubuntu 24.04 + apt build deps + protoc + cmake + gRPC at /opt/grpc |
base-grpc-cuda-12-amd64 | the above + CUDA 12.8 toolkit |
base-grpc-cuda-13-amd64 | the above + CUDA 13.0 toolkit (Ubuntu 22.04 base) |
base-grpc-cuda-13-arm64 | the above + CUDA 13.0 sbsa toolkit (Ubuntu 24.04 base) |
base-grpc-l4t-cuda-12-arm64 | JetPack r36.4.0 base (CUDA preinstalled, SKIP_DRIVERS=true) + gRPC |
base-grpc-rocm-amd64 | rocm/dev-ubuntu-24.04:7.2.1 base + hipblas/hipblaslt/rocblas + gRPC |
base-grpc-vulkan-amd64 / base-grpc-vulkan-arm64 | Ubuntu 24.04 + Vulkan SDK 1.4.335 + gRPC |
base-grpc-intel-amd64 | intel/oneapi-basekit:2025.3.2 base + gRPC |
Single source of truth: the install logic for all 10 variants lives in .docker/install-base-deps.sh. Both Dockerfile.base-grpc-builder AND each variant Dockerfile's builder-fromsource stage bind-mount and execute the same script — so the prebuilt CI base and the local from-source path are bit-equivalent by construction.
Dockerfile.{llama-cpp,ik-llama-cpp,turboquant} are multi-target. Three stages plus a final aliasing stage:
builder-fromsource — FROM ${BASE_IMAGE} then runs install-base-deps.sh and the per-backend compile script. Used when BUILDER_TARGET=builder-fromsource (the default; local make backends/<name>).builder-prebuilt — FROM ${BUILDER_BASE_IMAGE} (one of the prebuilt base-grpc-* tags) and runs only the per-backend compile script. Used when BUILDER_TARGET=builder-prebuilt (CI when the matrix entry sets builder-base-image).FROM ${BUILDER_TARGET} AS builder — alias resolves the ARG-selected stage to a fixed name (BuildKit doesn't allow ARG expansion in COPY --from=).FROM scratch + COPY --from=builder ...package/. ./ — emits the final scratch image with just the package contents.BuildKit prunes the unreferenced builder stage, so each build only runs the path it needs. backend_build.yml derives BUILDER_TARGET=builder-prebuilt automatically when the matrix entry has a non-empty builder-base-image; otherwise it defaults to builder-fromsource.
The matrix (build-type, platforms) → builder-base-image mapping for llama-cpp / ik-llama-cpp / turboquant entries:
build-type | platforms | tag |
|---|---|---|
'' | linux/amd64 | base-grpc-amd64 |
'' | linux/arm64 | base-grpc-arm64 |
cublas cuda 12 | linux/amd64 | base-grpc-cuda-12-amd64 |
cublas cuda 13 | linux/amd64 | base-grpc-cuda-13-amd64 |
cublas cuda 13 | linux/arm64 | base-grpc-cuda-13-arm64 |
cublas cuda 12 + JetPack base | linux/arm64 | base-grpc-l4t-cuda-12-arm64 |
hipblas | linux/amd64 | base-grpc-rocm-amd64 |
vulkan | linux/amd64 | base-grpc-vulkan-amd64 |
vulkan | linux/arm64 | base-grpc-vulkan-arm64 |
sycl_* | linux/amd64 | base-grpc-intel-amd64 |
If you add a new entry to base-images.yml's matrix, the new tag does not exist on quay until the workflow runs. To consume it from a variant entry safely, dispatch the base-images workflow on the branch first:
gh workflow run base-images.yml --ref <feature-branch>
Wait for the new variant to push, then merge the consumer change. Otherwise the consumer's CI fails with "image not found."
Multi-arch backends (and the core LocalAI image) build natively per arch instead of running both arches under QEMU emulation on a single x86 runner. The pattern:
tag-suffix but distinct platforms + platform-tag + runs-on. Example: -cpu-faster-whisper has one amd64 entry on ubuntu-latest and one arm64 entry on ubuntu-24.04-arm.outputs: type=image,push-by-digest=true,name-canonical=true,push=true. The digest is uploaded as an artifact named digests<tag-suffix>-<platform-tag> (or digests-localai<...> for root-image builds).scripts/changed-backends.js detects shared tag-suffix and emits a merge-matrix output. backend.yml / backend_pr.yml have a backend-merge-jobs job that consumes it and calls backend_merge.yml.backend_merge.yml downloads all matching digest artifacts and runs docker buildx imagetools create to publish the final tagged manifest list pointing at both per-arch digests. Same docker/metadata-action config as the original monolithic build, so consumers see no tag-shape change.image_merge.yml is the equivalent for the root LocalAI image (-core placeholder when tag-suffix is empty so the artifact-name glob doesn't over-match across core and gpu-vulkan).provenance: false is required on multi-registry digest pushes: with the default mode=max provenance attestation, BuildKit bundles a per-registry attestation manifest into each registry's manifest list, making the resulting list digest diverge across registries. steps.build.outputs.digest only matches one of them and the merge step's imagetools create <reg>@sha256:<digest> lookup fails on the other. Setting provenance: false keeps the digest content-only and identical across registries.
Both backend.yml (push) and backend_pr.yml (PR) generate their matrix dynamically through scripts/changed-backends.js:
pulls/{n}/files API → filter the matrix to entries whose dockerfile path prefix matches the PR diff./repos/{owner}/{repo}/compare/{before}...{after}) → same path-filter logic. Falls back to "run everything" on first-branch push (event.before zero), API truncation (≥300 changed files), missing API token, or any thrown error.FORCE_ALL=true is set from the workflow side (startsWith(github.ref, 'refs/tags/')) — releases rebuild every backend regardless of diff.workflow_dispatch: no event.before, falls through to "run everything" automatically.The Sunday 06:00 UTC cron on backend.yml exists specifically because path filtering can leave Python backends frozen on stale wheels. DEPS_REFRESH (below) only fires when the build actually runs, so an untouched Python backend would never re-resolve its unpinned deps. The weekly cron is the safety net.
DEPS_REFRESH cache-buster (Python backends)Every Python backend goes through the shared backend/Dockerfile.python, which ends with:
ARG DEPS_REFRESH=initial
RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
Most Python backends ship requirements*.txt files that do not pin every transitive dep (torch, transformers, vllm, diffusers, etc. are listed without a == pin, or with >= lower bounds only). With a warm BuildKit cache, the make layer hashes only on Dockerfile instructions + COPYed source — not on what pip install resolves at runtime. So a warm cache would ship the first version of vllm ever cached and never pick up upstream releases.
DEPS_REFRESH defends against that:
backend_build.yml computes date -u +%Y-W%V (ISO week, e.g. 2026-W19) before each build and passes it as a build-arg.RUN ... make layer's BuildKit hash now includes that string, so the layer invalidates at most once per week, automatically picking up newer wheels.This applies only to Dockerfile.python because:
Dockerfile.golang) pins versions in go.mod / go.sum.Dockerfile.rust) pins via Cargo.lock.v1.65.0) and llama.cpp at a specific commit; their inputs don't drift between rebuilds.Bump the format to daily (+%Y-%m-%d) or hourly (+%Y-%m-%d-%H) for faster refreshes. For one-shot rebuilds without changing the schedule, append a marker to the tag-suffix in the matrix or temporarily delete that backend's cache tag in quay.
Dockerfile.{llama-cpp,ik-llama-cpp,turboquant} declare a BuildKit cache mount on /root/.ccache:
RUN --mount=type=cache,target=/root/.ccache,id=<backend>-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
bash /usr/local/sbin/compile.sh
The compile script exports CMAKE_C/CXX/CUDA_COMPILER_LAUNCHER=ccache so CMake threads ccache through gcc/g++/nvcc. cache-to: type=registry,mode=max exports the cache mount data into the registry cache, so subsequent builds restore it.
On a LLAMA_VERSION bump, most translation units are byte-identical to the previous version's preprocessed source — ccache returns the previous .o and skips the real compile. Same for LocalAI source changes that don't actually touch llama.cpp's CMake inputs. Cache scope is per (TARGETARCH, BUILD_TYPE) so e.g. cublas-12 doesn't share with cublas-13 (their CUDA headers differ; cross-pollination would just be cache misses anyway).
Two composite actions handle runner-side prep:
.github/actions/free-disk-space/action.yml — wraps jlumbroso/free-disk-space@main plus an explicit apt purge of dotnet/android/ghc/mono/etc. Reclaims ~6–10 GB on ubuntu-latest. No-op on self-hosted runners. Used by backend_build.yml, image_build.yml, test.yml, tests-aio.yml, etc..github/actions/setup-build-disk/action.yml — relocates Docker's data-root to /mnt on hosted X64 runners. GHA hosted ubuntu-latest ships ~75 GB of unused space at /mnt; combined with the free-disk-space cleanup this gives ~100 GB working space — enough for ROCm dev image + vLLM torch install + flash-attn intermediate layers. No-op on self-hosted and on non-X64 hosted runners. Used by backend_build.yml, image_build.yml, base-images.yml.Both actions run before any docker buildx step.
All backend.yml / image.yml / test.yml / etc. workflows use:
concurrency:
group: ci-<workflow>-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
github.sha → each master commit gets its own run; rapid-fire merges don't cancel each other (this was a real issue prior — two master pushes 11 seconds apart would cancel the first's CI).There is no cron job that pre-warms the BuildKit cache for individual backends. The production builds are the populators. The first master build of a given matrix entry pays the cold cost; subsequent same-entry master builds reuse everything that hasn't changed (apt installs, gRPC compile in the variant builder-fromsource stage or skipped entirely when consuming base-grpc-*, Python wheel installs, etc.). The base-images workflow's weekly cron is the closest thing to a populator and only refreshes the prebuilt builder bases.
To force a fully cold build for one backend or the whole image:
# Delete a single tag (requires quay credentials with admin on the repo)
curl -X DELETE \
-H "Authorization: Bearer ${QUAY_TOKEN}" \
https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/cache-gpu-nvidia-cuda-12-vllm-amd64
# List all tags
curl -s -H "Authorization: Bearer ${QUAY_TOKEN}" \
"https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/?limit=100" | jq '.tags[].name'
Eviction is rarely needed in normal operation — DEPS_REFRESH handles weekly drift, source changes invalidate naturally, and mode=max keeps the cache scoped per matrix entry per arch so a stale tag never bleeds into a different build.
free-disk-space and setup-build-disk composite actions run on every job — these reclaim runner-state, not Docker layers, so BuildKit caches don't apply.Build (PR) are not pushed anywhere — PRs only build for verification.backend_build_darwin.yml runs natively on macOS-14 GitHub-hosted runners — there is no Docker, no BuildKit, no cross-job registry cache. Instead, the reusable workflow uses actions/cache@v4 for four native caches that mirror the spirit of the Linux cache (warm by default, weekly refresh for unpinned Python deps, PRs read-only).
| Cache | Path(s) | Key | Scope |
|---|---|---|---|
| Go modules + build | ~/go/pkg/mod, ~/Library/Caches/go-build | go.sum (managed by actions/setup-go@v5 cache: true) | All darwin jobs |
| Homebrew | ~/Library/Caches/Homebrew/downloads, selected /opt/homebrew/Cellar/* | hash of backend_build_darwin.yml | All darwin jobs |
| ccache (llama.cpp CMake) | ~/Library/Caches/ccache | pinned LLAMA_VERSION from backend/cpp/llama-cpp/Makefile | inputs.backend == 'llama-cpp' only |
| Python wheels (uv + pip) | ~/Library/Caches/pip, ~/Library/Caches/uv | inputs.backend + ISO week (+%Y-W%V) + hash of that backend's requirements*.txt | inputs.lang == 'python' only |
Read/write semantics match the BuildKit cache: actions/cache/restore runs every time, actions/cache/save is gated on github.event_name != 'pull_request'. PRs read master's warm cache but never write back.
The Python wheel cache uses the same ISO-week cache-buster as the Linux DEPS_REFRESH build-arg — same problem (unpinned torch/mlx/diffusers/transformers resolve to fresh wheels weekly), same ~one-cold-rebuild-per-week solution.
The brew Cellar cache requires HOMEBREW_NO_AUTO_UPDATE=1 and HOMEBREW_NO_INSTALL_CLEANUP=1 (set as job-level env). Without those, brew install would mutate the very directories that were just restored, defeating the cache.
Force-link after cache restore: actions/cache restores /opt/homebrew/Cellar/* but NOT the /opt/homebrew/bin/* symlinks. After a cache hit, brew install sees the Cellar entries and decides "already installed" without re-running its link step, leaving the formulas off PATH. The Dependencies step explicitly runs brew link --overwrite for every cached formula afterwards to ensure the symlinks exist.
For ccache, the workflow exports CMAKE_ARGS=… -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache via $GITHUB_ENV before running make build-darwin-go-backend. The Makefile in backend/cpp/llama-cpp/ already forwards CMAKE_ARGS through to each variant build (fallback, grpc, rpc-server), so no script changes are needed. The three variants share most TUs, so ccache dedupes object files across them.
backend_build_darwin.yml also has a llama-cpp-specific build-step branch that runs make backends/llama-cpp-darwin (the bespoke script that compiles three CMake variants and bundles dylibs via otool), distinct from the generic make build-darwin-${lang}-backend path. This was consolidated from a previously-bespoke top-level llama-cpp-darwin job in backend.yml so llama-cpp on Darwin honors the same path filter as the other 34 Darwin backends.
GitHub Actions caches are limited to 10 GB per repo. Steady-state worst case: ~800 MB Go cache + ~2 GB brew Cellar + up to 2 GB ccache + ~1.5 GB × 5 python backends. If the cap is hit, prefer collapsing the per-backend Python keys into a shared pyenv-darwin-shared-<week> key (accepts more cross-backend churn for a smaller footprint) before reducing other caches.
.github/backend-matrix.yml has zero references to arc-runner-set or bigger-runner — all backends run on GHA free-tier hosted runners (ubuntu-latest for amd64, ubuntu-24.04-arm for arm64 native, macos-14 for Darwin). The migration off self-hosted relied on the per-arch native split (no QEMU emulation) plus setup-build-disk's /mnt relocation (~100 GB working space, enough for ROCm dev image + vLLM/torch installs).
One residual self-hosted reference remains in test-extra.yml (tests-vibevoice-cpp-grpc-transcription uses bigger-runner for the 30s JFK-decode timeout headroom). That's a separate concern.
When changing image_build.yml, backend_build.yml, any of the backend/Dockerfile.* files, Dockerfile.base-grpc-builder, .docker/install-base-deps.sh, .docker/<backend>-compile.sh, or scripts/changed-backends.js:
DEPS_REFRESH=... from the build-args without a replacement strategy (lockfiles, pinned requirements). Otherwise master will silently freeze on whichever versions were cached at the time.(tag-suffix, platform-tag) unique per matrix entry — together they're the cache namespace. Two matrix entries sharing a key would clobber each other's cache.cache-to gated on github.event_name != 'pull_request' — PRs must not write.ignore-error=true on cache-to — quay registry hiccups must not fail builds.provenance: false on push-by-digest steps — multi-registry digest divergence is the Bug We Already Fixed; reintroducing provenance attestation re-breaks the merge.install-base-deps.sh is the single source of truth for base contents. Both Dockerfile.base-grpc-builder (CI) and the variant Dockerfiles' builder-fromsource (local) bind-mount and execute it. If you add a package to one path, add it to the script — don't fork the logic into a Dockerfile RUN.base-images.yml matrix variant, run the workflow on your branch before merging consumer changes that depend on the new tag — otherwise the consumer's CI fails "image not found."