Serving AVIF and WebP with Fallbacks: An Encoding and Negotiation Workflow

This guide extends the Image & Media Optimization discipline into the codec layer, where the biggest single byte reduction on most pages still lives. Re-encoding a JPEG hero as AVIF routinely cuts transfer weight by 40–60% at matched visual quality, and because that hero is usually the Largest Contentful Paint candidate, those bytes come straight off your LCP budget of 2.5s. On a mid-tier mobile connection (~1.6 Mbps), trimming a 180KB JPEG to a 75KB AVIF removes roughly 0.5s of pure transfer time before any other optimization. The catch is that no single modern codec is universally supported and freshly decoded by every client, so shipping AVIF safely means shipping a fallback chain — not picking one format and hoping.

The workflow here is mechanical: encode each master into AVIF and WebP at a defensible quality, declare the candidates in a <picture> type chain (or negotiate via the Accept header at the edge), tune the encoder knobs against a byte budget, and then verify that the byte win is not silently eaten by a decode cost on the main thread. Format choice is orthogonal to size selection — the resolution ladder from responsive images with srcset and sizes still chooses how big; this layer chooses which codec.

picture type fallback chain The browser evaluates source elements top to bottom and uses the first whose type attribute names a codec it can decode, falling back to the img element. The picture type fallback chain source: AVIF smallest bytes highest decode cost source: WebP broad support cheap decode img: JPEG universal floor always renders Browser uses the FIRST source it can decode — order matters. Put the smallest supported codec first; the img is the guaranteed floor. Each source carries its own srcset, so resolution selection still applies per codec. Byte savings are real; decode time on the main thread is the hidden tax. Validate that the smaller file still paints sooner, not just downloads sooner.

1. Environment Setup: Encoders and Source Assets

Settle the toolchain before you encode a single file, because the encoder you choose determines both the byte floor and how long your build takes. For AVIF the reference encoder is avifenc from libavif (which wraps the AOM aom encoder); for WebP it is cwebp from libwebp. In a Node build, sharp exposes both through one API (.avif() and .webp()), which is the path of least friction for most pipelines; squoosh is excellent for interactive, per-image tuning when you are still finding your quality knee but is awkward to run at scale in CI. Pin the encoder versions: AVIF output is not bit-stable across libaom releases, and an unpinned bump can silently change your byte budgets between builds.

Start from the highest-fidelity master you have — ideally the original capture or a lossless PNG, never a previously compressed JPEG. Encoding AVIF or WebP from an already-lossy JPEG bakes in the JPEG's block artifacts and then spends bytes preserving them, so you pay twice and still look worse. The master must also be at least as wide as the largest width in your resolution ladder; codec choice does not rescue a pixel deficit, as covered in the source-asset discussion in responsive images with srcset and sizes.

The decision that bites teams later is where in the pipeline encoding runs. There are three viable positions, each with a distinct operational profile. Build-time encoding (sharp in a CI step, or avifenc/cwebp invoked by your bundler) produces static variants you can fingerprint and cache forever; it is the right home for a fixed set of marketing and product imagery because the encode cost is paid once and the output is fully CDN-cacheable. Request-time encoding behind an image CDN transcodes on demand from a single origin master, which is the only sane option for user-generated content where you cannot enumerate the asset set ahead of time. The hybrid — build static variants for the known critical assets and let the CDN handle everything else — is what most production sites converge on. Whichever you choose, treat the encoder version, the quality settings, and the chroma mode as part of the artifact's cache key, because changing any of them must invalidate the cached output; a silent quality change that reuses the old cached bytes is a class of bug that survives every functional test.

bash
# Encode one width into AVIF and WebP from a clean master
avifenc --min 0 --max 63 -a end-usage=q -a cq-level=28 \
        --speed 6 hero-master.png hero-1200.avif
cwebp -q 78 -m 6 hero-master.png -o hero-1200.webp
# trade-off: --speed 6 / -m 6 are mid-effort. Dropping to avifenc --speed 2
# roughly triples encode time for ~3-5% more savings — worth it for a static
# hero rebuilt rarely, wasteful for thousands of user-uploaded images per minute.

2. Capture a Byte and Quality Baseline

Quantify what each codec actually buys before wiring up the chain. Take one representative master and encode it three ways — JPEG at quality 72, WebP, and AVIF — targeting visually matched quality, then record transferred bytes for each in the DevTools Network panel. Matched quality is the discipline that makes the comparison honest: comparing AVIF at quality 30 against JPEG at quality 90 proves nothing. Use a perceptual metric — SSIMULACRA2 or at minimum a butteraugli/DSSIM score — rather than the encoder's internal quality integer, because the same numeric "quality" means different things across codecs.

Tabulate three columns per format: transferred bytes, the perceptual score, and the bytes-per-quality-point ratio that lets you compare encoders on equal footing. Then add the number that the byte savings can hide: decode time. In the DevTools Performance panel, record a load and find the Decode Image (or Image Decode) task for the hero; AVIF decode is meaningfully heavier than JPEG, and on a low-end phone a large AVIF can spend 30–80ms decoding on the main thread. That decode sits inside your LCP render-delay phase, so a file that downloads 100KB faster but decodes 60ms slower has a smaller net LCP win than the byte chart suggests. Capturing decode alongside bytes is what separates a real improvement from a paper one.

Run the baseline on more than one image, because the codec ranking is content-dependent and a single sample will mislead you. Pick at least one image from each class you actually ship: a noisy photograph, a smooth-gradient image (a sky, a soft product backdrop), a flat illustration or chart, and a screenshot containing text. Encode each in all three formats at matched perceptual quality and watch how the AVIF lead swings — wide on the noisy photo, narrow or even negative on the flat chart where banding forces the quality back up. This per-class table is the artifact you carry into the rest of the workflow: it tells you which content classes justify AVIF, which are fine on WebP, and which (text-heavy screenshots especially) may belong in a lossless format entirely. Without it you will set one global quality and either under-compress your photos or visibly degrade your charts. Finally, normalize for size in the baseline: record bytes at a fixed display width, not at the master resolution, because a codec that wins at 1600px can lose at 320px once its fixed container overhead dominates the payload.

3. Isolate Delivery: Type Negotiation vs Accept-Header Negotiation

There are two mechanisms for getting the right codec to each client, and they fail in different ways. Client-side type negotiation uses <picture> with <source type="image/avif"> and <source type="image/webp">; the browser walks the sources top to bottom and uses the first whose type it can decode. This is fully static, CDN-cacheable under a single URL per resource, and requires no server logic — but it ships the markup for every codec to every client and locks you into emitting all variants at build time.

Server-side Accept-header negotiation inspects the request's Accept header (browsers that support AVIF send image/avif in it) and returns the best codec from a single image URL. This keeps markup minimal and lets an image CDN transcode on demand, but it makes the response vary by request, so you must set Vary: Accept or a cache will serve an AVIF body to a client that asked for JPEG. The two approaches are not mutually exclusive: a common production shape is a plain <img> whose src points at a CDN doing Accept negotiation, falling back to client-side <picture> only where you need explicit control over the fallback order.

The trade-off that decides between them is almost always cache behavior, not markup verbosity. Vary: Accept is correct but blunt: because the Accept header that browsers send for images is not perfectly uniform across versions and proxies, a naive Vary: Accept can fragment your cache into more variants than the two or three codecs you actually serve, lowering hit rate and pushing more requests to origin. Mature image CDNs sidestep this by normalizing Accept internally to a small set of canonical variants before keying the cache, so confirm your CDN does that rather than varying on the raw header. Client-side type negotiation has the opposite profile: one immutable URL per variant, a perfect cache key, and no origin involvement at request time — at the cost of emitting and storing every variant up front and shipping all the <source> markup to every client. There is also a discoverability difference that matters for the LCP image: with static <picture>, the preload scanner can see the candidate URLs in the HTML and start the fetch immediately, whereas with a single CDN URL the negotiation and any redirect add latency before the bytes flow. For the hero specifically, prefer the path that lets the scanner commit to a concrete, high-priority URL as early as possible.

nginx
# Edge Accept-header negotiation: serve AVIF only to clients that advertise it
map $http_accept $img_ext {
    default        "jpg";
    "~*image/avif" "avif";
    "~*image/webp" "webp";
}
location ~* ^/img/(?<name>.+)\.(jpg|jpeg)$ {
    add_header Vary Accept;                 # REQUIRED or caches cross-serve codecs
    try_files /img/$name.$img_ext /img/$name.jpg =404;
}
# trade-off: Vary: Accept fragments the CDN cache across Accept variants and the
# regex match runs per request. For a small fixed asset set, static <picture>
# with build-time variants caches better under one URL and skips the server logic.

4. Apply the Fix: The picture Type Fallback Chain

For client-side delivery, the corrected markup is a <picture> whose <source> elements are ordered smallest-codec-first. The browser commits to the first <source> whose type it supports, so AVIF precedes WebP, and the inner <img> — carrying the JPEG src, alt, intrinsic width/height, and decoding/fetchpriority — is the mandatory floor that renders when no <source> matches. Each <source> keeps its own srcset and sizes, so resolution selection runs independently per codec; you are composing two orthogonal axes, format and size, in one element.

Always keep the intrinsic width and height on the <img> to reserve layout space and avoid the Cumulative Layout Shift that comes from images snapping in after decode. For an LCP hero, add fetchpriority="high" so the chosen candidate is requested early rather than waiting behind the preload scanner's default ordering — the priority discussion lives in image CDNs and fetchpriority.

html
<!-- Type fallback chain: AVIF, then WebP, then JPEG floor -->
<picture>
  <source type="image/avif"
          srcset="hero-800.avif 800w, hero-1200.avif 1200w, hero-1600.avif 1600w"
          sizes="(max-width: 600px) 100vw, 1100px">
  <source type="image/webp"
          srcset="hero-800.webp 800w, hero-1200.webp 1200w, hero-1600.webp 1600w"
          sizes="(max-width: 600px) 100vw, 1100px">
  <img src="hero-1200.jpg" width="1200" height="675"
       alt="Quarterly revenue dashboard"
       decoding="async" fetchpriority="high">
</picture>
<!-- trade-off: this triples your build outputs and cache entries per image.
     For below-the-fold or rarely-viewed images, the storage and build cost of
     three full ladders can outweigh the byte savings — ship AVIF+JPEG only,
     or skip AVIF and serve WebP+JPEG, where decode budget is tight. -->

Deconstructing the Encoder Knobs: Quality, Effort, and Speed

Encoder output is governed by two independent dials, and conflating them wastes either bytes or build minutes. Quality (the perceptual target) controls how aggressively detail is discarded; in avifenc this is cq-level (lower is higher quality, ~20–32 is the photographic sweet spot), and in cwebp it is -q (higher is higher quality, ~72–82 for photos). Effort/speed controls how hard the encoder searches for a smaller representation at that qualityavifenc --speed (lower is slower and smaller) and cwebp -m (higher is slower and smaller). Effort changes encode time and file size; it does not change decode time, which is a property of the bitstream and the decoder, not how long you spent encoding.

The practical consequence is that quality and effort are tuned against different budgets. Quality is tuned against a perceptual floor: encode a ladder of cq-level values, score each with SSIMULACRA2, and pick the highest cq-level (smallest file) that still clears your perceptual threshold — this is the byte/quality knee. Effort is tuned against your build budget: maximum effort on a static hero rebuilt weekly costs nothing the user sees, but maximum effort on an upload pipeline serving thousands of images per minute can blow your processing SLA for a few percent of bytes. A defensible default is mid-effort with quality chosen per content class — photographic heroes tolerate more compression than flat illustrations or text-bearing screenshots, which show banding and ringing far earlier and often justify a lower cq-level or even staying on WebP/PNG.

There is a third axis on AVIF specifically: chroma subsampling. AVIF can encode 4:4:4 (full chroma) or 4:2:0 (subsampled), and the default 4:2:0 is right for photos but smears fine colored detail — red text on a colored background, sharp UI chrome. For those assets, forcing 4:4:4 (avifenc --yuv 444) preserves edges at a byte cost; for everything photographic, leave it at 4:2:0. Treat content class as an input to the encoder config, not a one-size pipeline.

Advanced Diagnostics: Decode Cost, Animation, and Format Failure Modes

The failure mode that survives a clean byte audit is decode cost on the critical path. AVIF decode is CPU-heavy; on a low-end Android device a full-bleed AVIF hero can add tens of milliseconds of decode to the LCP render-delay phase, and decoding="async" does not help the LCP element because the browser must decode it to paint it. Profile decode on a throttled CPU (6x slowdown in the Performance panel) before assuming AVIF wins for your largest image — sometimes WebP, with cheaper decode and only slightly more bytes, paints sooner end to end. This is the central trade-off explored in AVIF vs WebP: which format to serve.

A second, quieter failure mode is the progressive-rendering gap. JPEG's long history gave it progressive scans that paint a blurry-then-sharp preview as bytes arrive, which can make a slow JPEG feel faster even when its final paint is later. AVIF and WebP do not stream a progressive preview the same way; a partially downloaded AVIF shows nothing until enough of the bitstream has arrived to decode the first tiles. On a fast connection this is irrelevant, but on the slow links where you most want the byte savings, the perceived loading experience can regress even as the metric improves. The mitigation is a lightweight inline placeholder — a tiny blurred LQIP encoded as a data URI, or a CSS background color sampled from the image — so the layout slot is never visually empty while the modern-codec bytes stream in. This composes cleanly with the intrinsic width/height you already set for layout stability: the dimensions reserve the box, the placeholder fills it, and the decoded image replaces it.

A third failure mode is over-eager AVIF on assets that re-compress poorly downstream. If an image will later be screenshotted, embedded in a PDF export, or re-encoded by an email client, the artifacts that AVIF's aggressive quantization leaves can compound badly through a second lossy pass. For assets with an uncertain downstream life, a slightly larger WebP or a quality-conservative AVIF is the safer call than the smallest possible file. None of these failure modes argue against modern codecs — they argue for choosing the codec and quality per asset's role rather than running one global setting and trusting the byte chart.

Animation and transparency are their own decision points. For animation, neither codec should be your reflex: animated AVIF and animated WebP exist but are heavy to decode and awkward to author; a short, muted, playsinline MP4/WebM via <video> almost always beats an animated image format on both bytes and decode. For transparency, both AVIF and WebP carry an alpha channel and crush PNG on byte size, so a logo with transparency belongs in WebP or AVIF, not PNG — but watch for halo artifacts on hard alpha edges at aggressive quality. Finally, beware the "AVIF is always smaller" reflex on small images: codec container overhead means that below a few KB a WebP or even an optimized PNG can undercut AVIF, so size-gate your codec choice rather than blindly encoding everything to AVIF.

Validation and Performance Budgeting

Validation closes the loop opened by your baseline. Re-run the three-format measurement and confirm the chosen codec both transfers fewer bytes and paints no later: capture transferred bytes in the Network panel and the Decode Image task in the Performance panel, and verify the net effect on the LCP timestamp, not just on download size. The concrete budget: an above-the-fold hero should land well under a hard byte ceiling (e.g. ≤ 100KB for the LCP image) while keeping decode under ~16ms on a mid-tier device so it fits inside a single frame.

Enforce both halves in CI. Lighthouse's modern-image-formats audit flags any image still served as legacy JPEG/PNG where a modern codec would save meaningful bytes; assert it at minScore 1 so a regression — someone adds an image without the modern variants — fails the build. Pair it with a hard byte budget on the LCP image and an LCP assertion, since a heavy decode can pass a byte budget while still regressing the metric.

json
{
  "ci": {
    "assert": {
      "assertions": {
        "modern-image-formats": ["error", { "minScore": 1 }],
        "uses-optimized-images": ["error", { "minScore": 1 }],
        "largest-contentful-paint": ["error", { "maxNumericValue": 2500 }]
      }
    }
  }
}

Use modern-image-formats to catch any asset still shipping as legacy JPEG/PNG before merge.