docs/github-actions-caching.md
This document explains the caching strategy used in the Terraform AWS Provider's GitHub Actions workflows and why it's designed this way.
The Terraform AWS Provider is a massive codebase with unique caching challenges:
internal/** in Cache Keys Doesn't WorkA common pattern is to include source code in cache keys:
key: ${{ runner.os }}-GOCACHE-${{ hashFiles('go.sum') }}-${{ hashFiles('internal/**') }}
This creates catastrophic cache thrashing:
8 workflows × 500 PRs × 8GB cache = 32,000 GB demand
GitHub limit: 10 GB (per repo)
Result: 0.03% cache hit rate (constant misses)
Every PR changes internal/**, creating a unique cache key. With hundreds of PRs, caches are constantly evicted before they can be reused.
go.sum in Cache Keys Is ProblematicIncluding go.sum in the cache key seems logical but causes issues:
key: ${{ runner.os }}-go-build-${{ hashFiles('go.sum') }}
Problems:
go.sum → new cache key → full recompileGo's build cache is self-invalidating - it automatically detects when dependencies change and only recompiles affected packages. Including go.sum in the key defeats this optimization.
key: ${{ runner.os }}-go-build-${{ env.CACHE_DATE }}
restore-keys: |
${{ runner.os }}-go-build-
Where CACHE_DATE=$(date +%Y-%m-%d)
Why this works:
restore-keys is prefix, i.e., ${{ runner.os }}-go-build-*, GitHub returns most recent match)┌─────────────────┐
│ go_build job │ ← Only job that SAVES cache
│ (provider.yml) │
└────────┬────────┘
│ saves
▼
┌─────────┐
│ Cache │ 8GB, daily rotation
│ Storage │ key: go-build-2025-12-15
└────┬────┘
│ restores (read-only)
▼
┌────────────────────────────────┐
│ All other jobs restore cache: │
│ - go_generate │
│ - go_test │
│ - import-lint │
│ - validate_sweepers │
│ - copyright │
│ - dependencies │
│ - modern_go │
│ - providerlint │
│ - pull_request_target │
│ - skaff │
│ - smarterr │
└────────────────────────────────┘
Only provider.yml's go_build job saves cache:
- name: Save Go Build Cache
uses: actions/cache/[email protected]
if: always() && steps.cache-go-build.outputs.cache-hit != 'true'
with:
path: ${{ env.GOCACHE }}
key: ${{ runner.os }}-go-build-${{ env.CACHE_DATE }}
All other jobs restore-only:
- name: Restore Go Build Cache
uses: actions/cache/[email protected]
with:
path: ${{ env.GOCACHE }}
key: ${{ runner.os }}-go-build-${{ env.CACHE_DATE }}
restore-keys: |
${{ runner.os }}-go-build-
Benefits:
All jobs that use caching must set CACHE_DATE:
- name: go env
run: |
echo "GOCACHE=$(go env GOCACHE)" >> $GITHUB_ENV
echo "CACHE_DATE=$(date +%Y-%m-%d)" >> $GITHUB_ENV
The go_test job includes cleanup to prevent test artifacts from bloating the cache:
- name: Cleanup Test Artifacts
if: always()
run: |
if [ -d "$GOCACHE" ]; then
# Remove test binaries - huge and rarely reused
find $GOCACHE -name "*.test" -type f -delete 2>/dev/null || true
# Remove entries older than 2 days
find $GOCACHE -type f -mtime +2 -delete 2>/dev/null || true
find $GOCACHE -type d -empty -delete 2>/dev/null || true
fi
The go/pkg/mod cache uses a different strategy since dependencies are stable:
- uses: actions/[email protected]
with:
path: ~/go/pkg/mod
key: ${{ runner.os }}-go-pkg-mod-${{ hashFiles('go.sum') }}
This cache:
go.sum in the key (dependencies change infrequently)| Metric | Before | After |
|---|---|---|
| Cache demand | 32,000 GB | 10 GB |
| Cache hit rate | 0.03% | 80-90% |
| Build time | 10-15 min | 2-3 min |
| Cache stability | Constant thrashing | Stable |
First run of the day:
Subsequent PRs same day:
Next day:
The same strategy is used in the GNUmakefile for local testing:
# On macOS (with CrowdStrike), uses temp cache to avoid scanning
make test-fast
# Automatically detects:
# - macOS: Uses /tmp cache to avoid security software overhead
# - Linux: Uses default cache location
See Makefile Cheat Sheet for details.
Monitor cache effectiveness in GitHub Actions:
If cache hit rates drop below 70%, investigate:
go_build)Symptom: PR shows cache miss even though another PR ran earlier same day.
Cause: Different runner OS or cache was evicted due to size limits.
Solution: This is expected occasionally. The restore-keys will fall back to a recent cache.
Symptom: Cache approaching 10GB limit.
Cause: Test artifacts or stale entries accumulating.
Solution: The cleanup step in go_test should handle this. If not, adjust cleanup thresholds.
Symptom: Cache hit but build still takes 10+ minutes.
Cause: Major dependency update invalidated most of Go's internal cache.
Solution: This is expected after large AWS SDK updates. Subsequent builds will be fast.