Purging CDN cache by tag on deploy

This scenario belongs to Cache Invalidation Patterns within Advanced Caching Strategies & CDN Architecture: your deploy needs to evict exactly the objects the release changed — no more, no less — from a CI job.

The symptom that brings teams here is one of two opposites. Either a deploy ships but users keep seeing yesterday's content (under-purge), or every deploy hammers the origin with a synchronized miss storm and TTFB blows past 200ms for thirty seconds (over-purge). Both are tag-design failures, and both are fixable without resorting to purge everything on every release.

Purge blast radius spectrum Under-purge leaves stale content, over-purge causes a miss storm; a tag scoped to the change sits between them. Match the tag to the change Under-purge missing tags, 429s stale content ships Scoped tag purge per-resource keys evict only changed Over-purge coarse site tag origin miss storm Verify: changed URL = MISS, untouched URL = HIT, hit ratio recovers >85%. Compute changed tags from the build manifest; batch and rate-limit calls. Reserve the build-rev key for emergencies, not every deploy.

Rapid Diagnosis: Is the Deploy Over- or Under-Purging?

Run this checklist the moment a deploy looks wrong:

  • Check the purge API response in CI logs. A 200/201 with the expected key list means the call landed. A 429 means you were rate-limited and the purge silently dropped — classic under-purge.
  • Read the edge status header on a changed URL right after deploy. curl -sI <url> | grep -i x-cache (or cf-cache-status). It should show MISS/EXPIRED once, then HIT. A persistent HIT on changed content = under-purge.
  • Watch origin RPS in the 60s after deploy. A vertical spike to the entire steady-state cache miss volume = over-purge (you evicted far more than the release touched).
  • Diff the tags you purged against the tags the build changed. If you purged layout-v3 for a one-product price change, you over-purged the whole site.
  • Target thresholds: post-deploy origin RPS within origin headroom, edge hit ratio recovering to >85% within minutes, propagation <2s.

Root Cause Analysis

1. Over-broad tags evict the whole catalog

A single shared tag like site or layout attached to every response means any deploy that purges it drops the entire edge cache. The blast radius is the union of every URL carrying the tag. This is over-purge by design: the tag granularity is coarser than the change.

2. Missing or stale tags leave changed content cached

If the origin never stamped a Surrogate-Key/Cache-Tag on a response, no purge can target it. Purging product-42 evicts nothing because no object carries that key. The deploy reports success and the content stays stale — under-purge from absent tagging.

3. Variant fragmentation — one logical page, many cache keys

A page cached under multiple keys (Vary: Accept-Encoding, A/B cookie, device class, query-string permutations) has several edge objects. A purge-by-URL hits one; the others survive. Tag purge solves this only if every variant carries the tag.

4. Rate-limited or unbatched purge calls drop silently

Firing one purge request per changed URL across hundreds of assets hits the CDN's per-second purge limit. The CDN returns 429 and the excess purges never execute. Under high deploy frequency this looks intermittent and is brutal to debug.

Step-by-Step Resolution

Fixes are ordered by impact: fix tag granularity first (root cause of over-purge), then guarantee tagging (root cause of under-purge), then batch the API calls.

Step 1 — Tag responses at the right granularity

Stamp keys that match the smallest unit you'll ever invalidate, plus a few coarse keys for intentional broad purges.

nginx
# Per-resource keys + one deliberate coarse key for full-section purges
location ~ ^/products/(\d+)$ {
  add_header Surrogate-Key "product-$1 section-catalog build-$BUILD_REV";
  proxy_pass http://backend;
}
# trade-off: build-$BUILD_REV lets you purge an entire release in one call,
# but if you purge it on EVERY deploy you've reinvented purge-everything.
# Reserve the build key for emergencies; use product-N keys for normal deploys.

Expected outcome: a routine single-product deploy now evicts one object instead of the whole catalog, holding origin RPS flat (over-purge eliminated).

Step 2 — Compute the affected tags from the build, then purge only those

Derive the changed tags from the build manifest or git diff so the purge is scoped to what actually changed.

bash
#!/usr/bin/env bash
# ci/purge-fastly.sh — purge only the surrogate keys this deploy changed
set -euo pipefail
# CHANGED_KEYS is computed from `git diff --name-only` mapped to content ids
CHANGED_KEYS="$1"   # e.g. "product-42 product-77 section-catalog"

curl -fsS -X POST "https://api.fastly.com/service/$FASTLY_SERVICE_ID/purge" \
  -H "Fastly-Key: $FASTLY_PURGE_TOKEN" \
  -H "Fastly-Soft-Purge: 1" \
  -H "surrogate-key: $CHANGED_KEYS"
# trade-off: soft purge serves one stale response per object while it
# revalidates, smoothing origin load. Do NOT soft-purge a security or
# pricing fix where a single stale hit is unacceptable — drop the
# Fastly-Soft-Purge header to force a hard purge there.

The Cloudflare Enterprise equivalent purges by Cache-Tag:

bash
# ci/purge-cloudflare.sh — tag purge, max 30 tags per request
curl -fsS -X POST "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/purge_cache" \
  -H "Authorization: Bearer $CF_PURGE_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"tags":["product-42","product-77","section-catalog"]}'
# trade-off: Cloudflare caps purge_cache at 30 tags per call, so large
# deploys must chunk the tag list. Don't fall back to {"purge_everything":true}
# when you exceed 30 — batch the tags (Step 3) to keep the blast radius small.

Expected outcome: the deploy evicts exactly the changed content; previously stale pages now show MISS then HIT, closing the under-purge gap.

Step 3 — Batch and rate-limit the purge calls

Group keys into a single request (Fastly accepts many keys in one surrogate-key header; Cloudflare allows 30 tags per call) and chunk anything larger.

bash
# Chunk a long tag list into 30-tag batches for Cloudflare
printf '%s\n' $CHANGED_KEYS | xargs -n 30 | while read -r batch; do
  tags=$(printf '"%s",' $batch | sed 's/,$//')
  curl -fsS -X POST "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/purge_cache" \
    -H "Authorization: Bearer $CF_PURGE_TOKEN" -H "Content-Type: application/json" \
    --data "{\"tags\":[$tags]}"
  sleep 1
done
# trade-off: the 1s pause keeps you under the purge rate limit but adds
# latency to large deploys. Drop it for small tag sets; keep it when a
# release can touch hundreds of tags, or the 429s return.

Expected outcome: no more silent 429 drops; every intended key is evicted even on large deploys (under-purge from rate limiting eliminated).

Step 4 — Guarantee every variant carries the tag

Apply the Surrogate-Key before any Vary-splitting or A/B logic so all variants of a page inherit the same tag, making tag purge cover what URL purge cannot. For CloudFront, which has no native tagging, fall back to path-pattern invalidations scoped to the changed prefixes rather than /*.

Expected outcome: every cached variant of a changed page is evicted by a single tag purge; no surviving stale variant.

Verification

Confirm the deploy purged correctly with a before/after check wired into CI:

bash
# Assert a changed URL is fresh, an unchanged URL is still warm
changed="https://your-domain.com/products/42"
untouched="https://your-domain.com/about/"

curl -sI "$changed"  | grep -i x-cache | grep -qiE 'miss|expired' \
  || { echo "FAIL: changed page still cached (under-purge)"; exit 1; }
curl -sI "$untouched" | grep -i x-cache | grep -qiE 'hit' \
  || { echo "FAIL: unrelated page evicted (over-purge)"; exit 1; }
# trade-off: checks one URL per class. Sample one per tag, not all URLs —
# verifying hundreds in CI becomes its own load test.

The combined assertion is the tell: the changed URL must miss, the untouched URL must still hit. If both pass, the purge was correctly scoped. In production, confirm via CDN analytics that the post-deploy edge hit ratio dips only slightly and recovers above 85% within minutes, and that origin RPS never exceeded its TTFB ≤ 200ms headroom. Tie soft purge to a stale window per Stale-While-Revalidate Implementation to keep the refill smooth.