The Fundamentals Behind Core Web Vitals

Back in 2020, Google defined three universally applicable performance metrics that they deemed important for user experience. Known as Core Web Vitals (CWV), the original three metrics were Cumulative Layout Shift (CLS), Largest Contentful Paint (LCP), and First Input Delay (FID). This week, a new CWV metric, Interaction to Next Paint (INP), took the place of FID.

This marks the first time that all the CWV metrics are named after parts of the rendering process. Because I like to teach things with the fundamentals in mind, I wanted to touch on how those fundamentals can be applied to understanding the CWV metrics.

A brief overview of browser rendering

If you're not familiar with how browser rendering works, it's a multistep process that starts as soon as the browser receives the initial HTML back from the server.

Document Object Model

Once the browser receives HTML back from the server, it begins the process of displaying the page to the user. HTML is parsed and organized into the Document Object Model (DOM). The DOM is a tree, so it might help to visualize an HTML document like this:

html
- head
  - title
  - meta
  - script
  - script
  - link
  - link
- body
  - a
  - header
    - a
      - img
    - nav
      - ul
        
        li
        
        a
        
        li
        
        a
        
        li
        
        a
  - main
    - div
      - ...

Once the resources are in HTML or returned from the HTML's response headers, they are discoverable by the browser. You can provide the browser with instructions to emphasize priority, with mechanisms such as resource hints, 103 early hints, and fetchpriority. If there are any sub-resources, such as images, scripts, stylesheets, fonts, videos, iframes, and the like, the browser determines the priority and order in which to download them.

CSS Object Model

CSS has its own separate tree structure known as the CSS Object Model (CSSOM) that helps reconcile which style in the HTML gets applied to each DOM node.

The DOM and CSSOM together form the render tree, and it's after this point where we can start to see the points in the rendering process where the CWV metrics derive their names.

Layout

The layout process is where Cumulative Layout Shift gets its name. The layout portion of the rendering process is responsible for determining how much space to allocate for each element. Text-based elements that are part of the HTML are already known entities, but the reason why it's important to declare heights and widths on images, iframes, and videos is that they come in after the initial HTML.

As the name implies, CLS means that there was some initial layout performed and then an element on the page shifted because of something the browser learned later on, like an image being downloaded or an element such as an ad being injected into the page dynamically.

The score in CLS is an accumulating total of the layout shifts that happen on a page within a defined time window (thanks to Barry Pollard for pointing out this nuance).

Paint

Painting is the process of filling in pixels on a screen, and this is where LCP and INP get the "paint" part of their names.

Largest Contentful Paint

LCP is a time-based metric that measures how long it takes to display the largest element on the screen. The theory with LCP is that the largest element in terms of visual hierarchy is also the most important on the page.

If you want to understand LCP, dissecting the name is very helpful. The "largest" part is determined by the layout process and "contentful" means that it conveys meaning. Remember how I said you can provide the browser with instructions to help it determine the priority of sub-resources? This is largely to help with painting.

Interaction to Next Paint

INP is the only CWV metric that relies on user input, and like LCP, INP is a time-based metric. INP measures the time it takes to update the user interface with a paint after an interaction occurs. INP only counts certain types of user input in its metric:

Clicking with a mouse.
Tapping on a device with a touchscreen.
Pressing a key on either a physical or onscreen keyboard.

Composite

The final stage of rendering is known as composite, and while it doesn't have a CWV metric that borrows part of its name, it's mentioned here for thoroughness. When a browser paints, it does so in layers and the composite step is responsible for stitching the layers together into a cohesive interface. This is a really complex topic, and you can go incredibly deep on the subject. If you want to learn more about this process, here are some of my favorite long reads on the subject:

Conclusion

This was a big week in web performance, and INP is a step forward from the metric it replaced. Now that all the metrics are tied to parts of the browser rendering process in name, it also is an opportunity to use the fundamentals or teach both CWV and browser rendering simultaneously.