GitHub Actions Caching Strategy#
This document explains the caching strategy used in the Terraform AWS Provider's GitHub Actions workflows and why it's designed this way.
The Problem#
The Terraform AWS Provider is a massive codebase with unique caching challenges:
- 261 services with complex interdependencies
- 30-50 AWS SDK package updates per week
- 500+ active pull requests at any given time
- 8+ workflows that compile Go code
- 10GB GitHub Actions cache limit for the entire repository
Why internal/** in Cache Keys Doesn't Work#
A common pattern is to include source code in cache keys:
key: ${{ runner.os }}-GOCACHE-${{ hashFiles('go.sum') }}-${{ hashFiles('internal/**') }}
This creates catastrophic cache thrashing:
8 workflows × 500 PRs × 8GB cache = 32,000 GB demand
GitHub limit: 10 GB (per repo)
Result: 0.03% cache hit rate (constant misses)
Every PR changes internal/**, creating a unique cache key. With hundreds of PRs, caches are constantly evicted before they can be reused.
Why go.sum in Cache Keys Is Problematic#
Including go.sum in the cache key seems logical but causes issues:
key: ${{ runner.os }}-go-build-${{ hashFiles('go.sum') }}
Problems:
- AWS SDK updates 30-50 packages/week
- Each update changes
go.sum→ new cache key → full recompile - Wastes the 90% of packages that didn't change
Go's build cache is self-invalidating - it automatically detects when dependencies change and only recompiles affected packages. Including go.sum in the key defeats this optimization.
The Solution: Daily Rotation with Shared Cache#
Cache Key Strategy#
key: ${{ runner.os }}-go-build-${{ env.CACHE_DATE }}
restore-keys: |
${{ runner.os }}-go-build-
Where CACHE_DATE=$(date +%Y-%m-%d)
Why this works:
- One cache per day (not per PR or per commit)
- All PRs share the same cache on a given day
- Daily rotation prevents unbounded growth
- Restore-keys provide fallback to yesterday's cache (
restore-keysis prefix, i.e.,${{ runner.os }}-go-build-*, GitHub returns most recent match) - Go's internal cache handles incremental compilation
Cache Architecture#
┌─────────────────┐
│ go_build job │ ← Only job that SAVES cache
│ (provider.yml) │
└────────┬────────┘
│ saves
▼
┌─────────┐
│ Cache │ 8GB, daily rotation
│ Storage │ key: go-build-2025-12-15
└────┬────┘
│ restores (read-only)
▼
┌────────────────────────────────┐
│ All other jobs restore cache: │
│ - go_generate │
│ - go_test │
│ - import-lint │
│ - validate_sweepers │
│ - copyright │
│ - dependencies │
│ - modern_go │
│ - providerlint │
│ - pull_request_target │
│ - skaff │
│ - smarterr │
└────────────────────────────────┘
Single Producer Pattern#
Only provider.yml's go_build job saves cache:
- name: Save Go Build Cache
uses: actions/cache/save@v5.0.1
if: always() && steps.cache-go-build.outputs.cache-hit != 'true'
with:
path: ${{ env.GOCACHE }}
key: ${{ runner.os }}-go-build-${{ env.CACHE_DATE }}
All other jobs restore-only:
- name: Restore Go Build Cache
uses: actions/cache/restore@v5.0.1
with:
path: ${{ env.GOCACHE }}
key: ${{ runner.os }}-go-build-${{ env.CACHE_DATE }}
restore-keys: |
${{ runner.os }}-go-build-
Benefits:
- Prevents race conditions
- Ensures consistency
- Reduces cache save time
- Avoids duplicate cache entries
Implementation Details#
Setting Up Cache Date#
All jobs that use caching must set CACHE_DATE:
- name: go env
run: |
echo "GOCACHE=$(go env GOCACHE)" >> $GITHUB_ENV
echo "CACHE_DATE=$(date +%Y-%m-%d)" >> $GITHUB_ENV
Cache Cleanup in Tests#
The go_test job includes cleanup to prevent test artifacts from bloating the cache:
- name: Cleanup Test Artifacts
if: always()
run: |
if [ -d "$GOCACHE" ]; then
# Remove test binaries - huge and rarely reused
find $GOCACHE -name "*.test" -type f -delete 2>/dev/null || true
# Remove entries older than 2 days
find $GOCACHE -type f -mtime +2 -delete 2>/dev/null || true
find $GOCACHE -type d -empty -delete 2>/dev/null || true
fi
Dependency Cache#
The go/pkg/mod cache uses a different strategy since dependencies are stable:
- uses: actions/cache@v5.0.1
with:
path: ~/go/pkg/mod
key: ${{ runner.os }}-go-pkg-mod-${{ hashFiles('go.sum') }}
This cache:
- Does use
go.sumin the key (dependencies change infrequently) - Is shared across all workflows
- Typically ~2GB
- Rarely invalidated
Expected Results#
Cache Performance#
| Metric | Before | After |
|---|---|---|
| Cache demand | 32,000 GB | 10 GB |
| Cache hit rate | 0.03% | 80-90% |
| Build time | 10-15 min | 2-3 min |
| Cache stability | Constant thrashing | Stable |
Daily Workflow#
First run of the day:
- Cold cache (or restores yesterday's)
- Full compilation: ~10 minutes
- Saves new cache for the day
Subsequent PRs same day:
- Warm cache hit
- Incremental compilation: ~2-3 minutes
- No cache save (already exists)
Next day:
- New cache key (new date)
- Fresh start prevents unbounded growth
- Old cache auto-expires after 7 days
Local Development#
The same strategy is used in the GNUmakefile for local testing:
# On macOS (with CrowdStrike), uses temp cache to avoid scanning
make test-fast
# Automatically detects:
# - macOS: Uses /tmp cache to avoid security software overhead
# - Linux: Uses default cache location
See Makefile Cheat Sheet for details.
Monitoring#
Monitor cache effectiveness in GitHub Actions:
- Cache hit rate: Check workflow logs for "Cache restored from key"
- Build times: Compare first run of day vs. subsequent runs
- Cache size: Should stay around 8-10GB total
If cache hit rates drop below 70%, investigate:
- Are multiple workflows saving cache? (should only be
go_build) - Is cache size approaching 10GB limit?
- Are there new workflows not following the pattern?
Troubleshooting#
Cache Miss on Same Day#
Symptom: PR shows cache miss even though another PR ran earlier same day.
Cause: Different runner OS or cache was evicted due to size limits.
Solution: This is expected occasionally. The restore-keys will fall back to a recent cache.
Cache Size Growing#
Symptom: Cache approaching 10GB limit.
Cause: Test artifacts or stale entries accumulating.
Solution: The cleanup step in go_test should handle this. If not, adjust cleanup thresholds.
Slow Builds Despite Cache Hit#
Symptom: Cache hit but build still takes 10+ minutes.
Cause: Major dependency update invalidated most of Go's internal cache.
Solution: This is expected after large AWS SDK updates. Subsequent builds will be fast.