GHCR libs cache: warmer ↔ consumers ↔ promote ↔ retention ↔ clear

The prebuilt vendored LLVM (build/libs plus lib/llvm/src/compiler-rt/lib/builtins) is cached as a GHCR OCI artifact instead of in the GitHub Actions cache. The libs-cache scripts live together in .ci-scripts/libs-cache/: oci_libs_cache.py is the push/pull/exists primitive layer (registry v2 API); promote_libs_cache.py copies a branch-cache artifact into the main cache at the same tag (registry.copy), which the warmer uses to reuse a build instead of cold-building; resolve_libs_cache.py is the consumer/warmer orchestration the workflows call (it sequences those primitives around a make libs build command). There is no libs-ci/libs-push Make target anymore — Makefile/make.ps1 know only how to build (make libs); all CI cache logic lives in these scripts. The full package name is assembled in one place — cache_package() — as ponyc-libs-cache/<platform>-<arch>, and the tag is the hashFiles content hash:

The ponyc-libs-cache/ path namespace keeps these apart from distributable containers (matching nightly/, releases/) and fences the retention/clear globs.
<platform> is the builder image's name+date (derive_platform, for container jobs that pass the --image flag) or a literal label (the --platform flag, a bare label like x86-macos-15-intel — the script adds the namespace and arch, so workflows pass the label only).
<arch> comes from platform.machine() on the build machine, normalized to one canonical spelling per ISA by cache_arch.canonical (amd64→x86_64, aarch64→arm64; an unrecognized arch is a hard error, not a silent passthrough — add it to ARCH_ALIASES on purpose). Normalization is essential twice over: (a) the alpine and ubuntu26.04 builder images are multi-arch, built on both the x86-64 and arm64 warmer jobs — without the arch the two would push to the same package:tag and clobber each other (an arm64 consumer then pulls x86-64 libs); (b) the same ISA is spelled differently across OSes (Linux x86_64 vs FreeBSD/OpenBSD amd64), so the BSD warmer's host-side existence check (run on the Linux runner) and the in-VM push must canonicalize to the same name or the host could never see what the VM wrote. cache_arch.py is one of the shared support-library modules in .ci-scripts/libs-cache/ (with registry.py, ghpackages.py, common.py) that the thin entry scripts import — one definition, not copies to keep in sync. Changing an ISA's canonical spelling (or any package-name shape) renames the package: the old-named main-cache artifacts are orphaned, and prune (keep-N per package) never reclaims a package that stopped receiving versions, so run clear-libs-cache.yml once to delete the strays. The branch cache self-heals (age-prune); the main cache does not.

package:tag is keyed on the same inputs as the old actions/cache key (which lacked the arch component). Coupled invariants:

update-lib-cache.yml (the warmer) is the only writer of the main cache, on push-to-main. On a main-cache miss it first tries to promote a matching branch-cache artifact (promote_libs_cache.py, a registry copy — reusing a build a PR or an ad-hoc tier dispatch already made) and only cold-builds when there is none. Every other workflow except ponyc-tier3.yml pulls-or-builds the main cache (tier2/weekly additionally write the branch cache — see the branch-cache coupling in branch-libs-cache.md); tier3 now requires a cache hit and never cold-builds, like the stress tests (the exception below). So the warmer's jobs must cover every platform/label any consumer pulls — a consumer whose builder image or runner label the warmer doesn't build will pull a miss and cold-build forever. When you add a new libs consumer (or a new platform to an existing one), add the matching platform to the warmer, in the right stage (below).

The warmer runs its build-outs in three sequential stages, so a cold push (an LLVM-input change, when every platform cold-builds LLVM at once) doesn't saturate the org-wide runner pool with rarely-used builds ahead of the most-used caches:
- Stage 1 — PR platforms (x86_64-linux-pr = ubuntu26.04 x86-64, arm64-macos, x86_64-windows): what pr.yml pulls. No gate; starts immediately.
- Stage 2 — release/nightly platforms (x86_64-linux-release, arm64-linux, x86_64-macos, arm64-windows): the rest of what release.yml/nightlies.yml pull.
- Stage 3 — everything else (x86_64-linux-other = fedora + the cross images, freebsd, openbsd, dragonflybsd): tier2/tier3/weekly-only platforms.
Each later stage needs: the prior stage's fast jobs only (Linux + macOS) and never Windows — a Windows build kicks off in its stage but must never gate the next one (it's the slowest). Stage jobs carry if: ${{ !cancelled() }}, so a prior-stage build failure delays but does not cancel the later stages. A new platform goes in the earliest stage that pulls it: stage 1 if a PR pulls it, else stage 2 if a release/nightly pulls it, else stage 3. The x86-64 Linux builds are split across three jobs (-pr/-release/-other) because needs: is job-level, not matrix-entry-level, and those builder images span all three stages — keep each image in exactly one of the three.

Exception to the pull-or-build rule — the stress-test workflows (stress-test-*.yml) and ponyc-tier3.yml pass --require-cache-hit to resolve_libs_cache.py, and a miss never cold-builds. What a miss does depends on the trigger, selected per-event by the step's LIBS_CACHE_FLAGS env var. A scheduled run passes --skip-on-miss: a miss writes the .libs-cache-miss marker and exits 0, and every build/run step gates on if: hashFiles('.libs-cache-miss') == '', so the job goes green with those steps skipped and the Send alert on failure Zulip step (now gated failure() && github.event_name == 'schedule') stays silent. This is deliberate: the continuous stress loop runs at staggered times that legitimately overlap an empty or refilling cache, so a scheduled miss is expected, not a coverage bug. A manual workflow_dispatch run passes --branch-cache instead, so on a main miss it also pulls the branch cache (the same main→branch resolution a PR consumer uses) and fails loudly only if neither has the libs for that target — a dev's branch check must not silently no-op. Accepted tradeoff: a stress-only permanent warmer-coverage gap is no longer surfaced loudly by the stress jobs (a scheduled miss just skips); it would still surface via the non-stress consumers that pull the same platforms. (Both flags are require-cache-hit modifiers; --branch-cache is no longer rejected with --require-cache-hit. The marker is written to GITHUB_WORKSPACE — falling back to cwd, which is the workspace mount for the arm64-linux docker-in-docker job — so the host-side hashFiles gate sees it.) ponyc-tier3.yml uses the same per-trigger LIBS_CACHE_FLAGS and marker gate; its cross-compile legs run the resolve in the workspace like the stress jobs, but its BSD legs run the resolve inside the VM, where the marker would be invisible to the host gate — so each BSD leg makes the require-cache-hit decision as a host-side oci_libs_cache.py exists (plus branch_libs_cache.py exists on workflow_dispatch) before booting the VM, writing the same .libs-cache-miss marker, and the libs are still pulled inside the VM afterward. (tier3's Send alert on failure is left plain failure(), not scheduled-gated like the stress one — a scheduled skip never fails so it stays silent, but a manual-run failure still alerts.)
The container-platform name is derived from the builder image reference by IMAGE_RE (via derive_platform) in registry.py (contract: ghcr.io/ponylang/ponyc-ci-<name>:<YYYYMMDD>); a builder image whose name/tag breaks that format fails the step loudly — update IMAGE_RE if the naming convention changes.
The BSD VMs have no builder image, so they use explicit per-version labels (freebsd-15.1, openbsd-7.9, dragonfly-6.4.2, …); the warmer boots those VMs to build+push and the ponyc-tier3.yml BSD jobs pull (each threads GITHUB_TOKEN into the VM over ssh, like every other consumer). The BSD VM provisioning is shared, not duplicated: both workflows call .ci-scripts/bsd/{freebsd,openbsd,dragonfly}-provision.bash (which free disk, install QEMU, download the image, boot the VM, install deps, and rsync the checkout in) from a single Provision VM step. DragonFly's provision script shells out to .ci-scripts/bsd/dfly_configure_vm.py (the QEMU sendkey console automation; reads the ssh pubkey from PUB_KEY). Change VM setup in one place now — the script — not two YAML copies. freebsd-provision.bash takes FREEBSD_VERSION; it installs doas + a doas.conf unconditionally (tier3's dtrace smoke test needs it; harmless to the warmer). The two callers differ only in how they invoke the script: the warmer runs a host-side oci_libs_cache.py exists check, then a host-side promote step (branch_libs_cache.py exists → promote_libs_cache.py, no VM boot) that reuses a branch artifact if one exists, and gates the Provision VM step (and the terminal build/push step) on if: steps.check.outputs.hit != 'true' && steps.promote.outputs.promoted != 'true', so a main hit or a successful promote skips the QEMU boot entirely (a promote failure is non-fatal — promoted=false falls through to the VM build) while the job still succeeds (steps skipped, not failed) to keep prune's needs: satisfied; tier3 now also gates the QEMU boot on a host-side cache check, but inverted from the warmer: a host-side oci_libs_cache.py exists (plus branch_libs_cache.py exists on workflow_dispatch) gates Provision VM and every step after it on the .libs-cache-miss marker, so a miss skips the VM entirely instead of cold-building; the in-VM step then pulls with --require-cache-hit (no build command). Because it no longer builds, tier3 no longer captures anything to the branch cache (see the require-cache-hit exception above). The shell scripts are file-based so Super-Linter's shellcheck covers them (the embedded run: blocks it could not); the dfly_configure_vm.py extraction has a dfly_configure_vm_test.py guarding the KEYMAP de-escaping.
The warmer's prune job (.ci-scripts/libs-cache/prune_libs_cache.py --keep 2) keeps the 2 newest versions per package; the platform lives in the package name (not the tag) so keep-N counts per platform — moving it into the tag would let keep-N delete live artifacts for other platforms.
clear-libs-cache.yml + .ci-scripts/libs-cache/clear_libs_cache.py are the escape hatch: they whole-package-delete every ponyc-libs-cache/* package (REST API; / is %2F-encoded) and re-dispatch the warmer. Deletion is the only way to invalidate (the content-hash tag means there is no "touch to expire").
Deletion needs BOTH tokens — each can only do the half its scope allows. The org-level PONYLANG_MAIN_READ_PACKAGE_TOKEN (classic PAT, read:packages) is the only one that can enumerate the org's packages (the repo-scoped GITHUB_TOKEN gets 400 on the org package-list endpoint). But only GITHUB_TOKEN (packages: write) can delete these packages — they're repo-scoped, so the org PAT gets 404 on the delete regardless of its scopes. So clear_libs_cache.py / prune_libs_cache.py list with PONYLANG_MAIN_READ_PACKAGE_TOKEN and delete with GITHUB_TOKEN; both workflows pass both secrets. This split is also why retention is a custom script, not snok/container-retention-policy (which takes one token and can't enumerate-and-delete here). The clear workflow also needs actions: write for the re-warm dispatch.

The llvm_tools=false flag is not part of the cache identity; see the LLVM tools flag coupling in llvm-tools-flag.md for why that is safe today and what would break it.