Configuring stale-if-error for origin outages
This scenario sits under the CDN edge caching configuration guide inside Advanced Caching Strategies & CDN Architecture, and addresses one failure mode: your origin returns 5xx or times out, and the edge dutifully forwards the error to users instead of falling back to the perfectly good copy it cached minutes ago.
The fix is the RFC 5861 stale-if-error directive, usually paired with stale-while-revalidate. Configured correctly, an origin outage becomes invisible: the edge keeps serving stale HTML and JSON with a fast TTFB (≤ 200ms) while the origin recovers, and your Largest Contentful Paint stays under 2.5s instead of collapsing into an error page. The danger is the opposite failure — serving stale content forever, or serving it when you should not — so the configuration has to be precise.
Rapid diagnosis
Before changing headers, confirm the edge is actually forwarding origin errors rather than absorbing them.
- Take the origin down or point a request at a known-failing route, then
curl -Ithe edge URL. A502/503/504reaching the client means nostale-if-errorfallback is active. - Inspect the cache-status header (
CF-Cache-Status,X-Cache,Fastlydebug headers,x-vercel-cache). During the outage you want to see a hit/stale indicator (STALE,HIT), notMISSorERROR. - Check the
Cache-Controlon the origin response in the browser Network tab. Ifstale-if-erroris absent, the edge has nothing authorising a fallback. - Confirm the resource was cached at all before the outage — a route with
Cache-Control: no-storeor a cookie-fragmented cache key has no stale copy to fall back to. - Look at the edge TTL: if the object already expired past the
stale-if-errorwindow, the edge correctly stops serving it.
Root cause analysis
1. The directive is simply missing. The most common cause: origin responses carry max-age but no stale-if-error. The edge has a fresh-or-revalidate model only, so on a 504 it has no instruction to reuse the expired object and passes the error through.
2. The grace window is too short. stale-if-error=60 only covers a one-minute blip. A real deploy failure or database incident runs for minutes to hours; once the window lapses, the edge resumes forwarding errors. The window must be sized to your realistic mean-time-to-recovery, not to a hypothetical.
3. The route was never cacheable. Set-Cookie, Vary: Cookie, or Cache-Control: private/no-store on the HTML means the edge never stored a shared copy, so there is nothing to serve stale. Personalised pages are the classic blind spot here.
4. The CDN does not honour the standard directive. Several CDNs ignore the HTTP stale-if-error directive and require their own configuration knob (Cloudflare's Always Online / Tiered Cache serve-stale, Fastly's VCL stale-if-error in vcl_fetch, Akamai's downstream caching settings). Setting the header alone does nothing on those platforms.
5. The error window is masking a real incident. A subtler failure mode is success that hides a problem. Once stale-if-error is serving day-old content, edge status codes look healthy even though the origin has been down for an hour. Teams that alarm on edge 5xx rate will see nothing fire while customers quietly view stale data. The directive is doing its job, but observability has to move to the origin tier — otherwise the safety net becomes a blindfold. This is not a reason to avoid stale-if-error; it is a reason to pair it with origin-side error-rate and saturation alarms so the outage is still visible to you even when it is invisible to users.
Step-by-step resolution
Apply these in order of impact; each lists the expected outcome.
1. Add stale-if-error to origin responses
Emit both grace directives on cacheable HTML and API payloads from the origin:
# Origin (or shielding tier) response for cacheable HTML.
location / {
add_header Cache-Control "public, max-age=60, stale-while-revalidate=600, stale-if-error=86400";
# trade-off: an 86400s (24h) error window can serve day-old content during a
# long outage. Do NOT use this length for prices, stock, or auth-sensitive
# pages — cap stale-if-error to seconds there and prefer an explicit error.
}
Expected outcome: during a 5xx/timeout, the edge serves the last good copy for up to 24h instead of an error page, holding TTFB at edge-cache speed (≤ 50ms) rather than an origin round-trip plus failure.
2. Enable serve-stale on CDNs that ignore the header
On Fastly, make the behaviour explicit in VCL so a fetch failure reuses stale:
sub vcl_fetch {
# Reuse cached object for 24h if origin errors, refresh quietly for 10m.
set beresp.stale_if_error = 86400s;
set beresp.stale_while_revalidate = 600s;
if (beresp.status >= 500 && stale.exists) {
return (deliver_stale);
}
# trade-off: deliver_stale masks origin 5xx from monitoring. Keep origin
# error rate alarms on the origin tier, not on edge status codes, or a
# multi-hour outage will look healthy from the edge.
}
On Cloudflare, enable Always Online and Tiered Cache, or set Cache-Control via a Cache Rule, since the worker/edge honours serve-stale through those settings rather than the raw directive alone.
Expected outcome: platforms that drop the standard directive now fall back to stale, eliminating the silent gap where the header was present but inert.
3. Make the route cacheable in the first place
Strip personalisation from the cache key so a shared stale copy exists:
# Don't fragment the HTML cache on analytics/session cookies.
proxy_cache_key "$scheme$host$request_uri";
proxy_ignore_headers Set-Cookie;
proxy_hide_header Set-Cookie;
# trade-off: this is only safe for genuinely shared, anonymous HTML.
# Applying it to logged-in pages leaks one user's cached page to another —
# segment personalised routes into a separate, non-shared cache.
Expected outcome: anonymous HTML now has a shared edge object, giving stale-if-error something to serve and lifting baseline hit ratio above the 85% target for static routes.
4. Size the window to real recovery time
Set stale-if-error to comfortably exceed your p95 incident duration (commonly 1-24h for HTML, shorter for data) while keeping stale-while-revalidate short (300-600s) so healthy traffic refreshes promptly. The two are independent: stale-while-revalidate handles the normal expiry path, stale-if-error handles the failure path.
Expected outcome: routine expiries stay fresh within ~10 minutes, while a multi-hour outage is still fully absorbed.
Verification
Confirm the fallback works and that it heals after recovery:
- Before/after header diff. Compare the origin
Cache-Control:bashcurl -sI https://yourdomain.com/ | grep -i cache-control # before: cache-control: public, max-age=60 # after: cache-control: public, max-age=60, stale-while-revalidate=600, stale-if-error=86400 - Simulate the outage. Block the origin (firewall rule or kill the upstream) and re-request the edge. Expected:
200 OKfrom cache with aSTALEcache-status, not a5xx. - Confirm recovery. Restore the origin. Within the
stale-while-revalidatewindow the edge should serve stale once more, then transparently refresh to aHITwith the new body — verify the response content updates. - CI assertion. Add a synthetic check that fails the pipeline if the production
Cache-Controlfor/lacksstale-if-error, so a future origin refactor cannot silently drop it. - RUM field check. During the next real origin blip, watch error-page rate and TTFB p75 in RUM. Expected: client-visible 5xx rate near zero and TTFB held under 200ms, even while origin error rate spikes.
This pattern composes with the broader edge configuration and with deliberate purging — when you do need to evict the stale copy after a fix, see the cache invalidation patterns guide.
Related
- CDN edge caching configuration — the parent guide covering TTL tiers and edge rule orchestration.
- Stale-while-revalidate implementation — the companion directive for the normal expiry path.
- HTTP Cache-Control headers explained — directive precedence and syntax behind these headers.
- Cache invalidation patterns — how to purge the stale object once the origin is healthy.
- Advanced Caching Strategies & CDN Architecture — the overall caching architecture this resilience pattern belongs to.