Cache is King

Part 1

Aug 22, 2023

The fastest program is the program which does the least work, and the most effective way to do less work is to avoid repeating work you’ve already done. It follows that caching is probably the most fundamental optimization technique there is, applicable to everything from loading websites, compiling programs and typesetting documents to buffering disk access and avoiding trips to main memory. One of the most important factors determining a program’s performance is how effectively the CPU’s various caches are used: while poor performance can be caused by performing too many instructions, often, a poor memory access pattern is the root of the issue.

In this article, we’ll go over:

The cache hierarchy
The effect of the CPU’s data caches on performance, using a simple benchmark
Some notes on the importance of memory access patterns

I plan to cover more advanced topics such as memory fences, stalls, the instruction cache, the translation lookaside buffer (TLB), and disk access in future articles in this series. As always, code is available on my GitHub.

Let’s get started!

The Cache Hierarchy

Fast memory is (very) expensive, and memory as a whole takes up space. It follows that modern computers have usually only a small amount of the very fastest memory, located on-die right next to the CPU’s other components to ensure access takes as little time as possible: this is called the L1 cache. There is often some slightly slower memory, a little bit further from the core, or perhaps shared by multiple cores, and then, eventually, main memory, backed by RAM. Some systems have swap space even further down the hierarchy, stored on an SSD or even a hard disk. This system of caches lying over main memory is called the cache hierarchy, and it allows us to dramatically improve the bandwidth and latency of our computer’s memory.

When an uncached memory address is first accessed (a cache miss), the entire cache line (often about 64 bytes) of memory containing that address will be loaded into the L1 cache by the CPU. Subsequent accesses to memory in the L1 cache (an L1 cache hit) can take as little as a single nanosecond. Alas, the L1 cache is very small, with sizes often as tiny as a few KB. As the cache fills up, older entries are evicted to higher cache levels, which may be shared with multiple cores.1 Eventually, entries are evicted from the cache altogether, and future reads to the same memory will require a trip back to main memory, which can take up to 80 times as long. Typical latencies and bandwidths for an Intel processor are:

1 ns, 1 TB/s for an L1 cache read
4 ns, 1 TB/s for an L2 cache read
40 ns, >400 GB/s for an L3 cache read
80 ns, ~100 GB/s for main memory

As an example (since I will be running my benchmarks on it), my Intel i9-9900KF CPU has2

L1 data cache: 32 KB/core
L1 instruction cache: 32 KB/core
L2 cache: 256 KB/core
L3 cache: 16 MB, shared across all cores

Temporal Locality

As we can see from the latency numbers above, maximizing the proportion of cache hits among memory accesses is essential to achieving good performance. One of the simplest ways to do this is to minimize the number of distinct memory addresses we access at any one time: this increases the probability that a given access will be in a given cache level, since more (or all) of the desired addresses will fit. Let’s write a simple benchmark to see what effect this has on performance. Our experiment will be as follows:

We’ll hash a set of n bytes distributed randomly throughout a 1GB buffer along with their indices, updating the bytes to be the first 8 bits of their hash
We’ll measure our throughput: how many objects we can hash in 1 second

We’ll start by making a new crate and add a few dependencies and a criterion benchmark spread_cache, via the following Cargo.toml3:

[package]
name = "spread-cache"
version = "0.1.0"
edition = "2021"

[dependencies]
fxhash = "0.2.1"

[dev-dependencies]
criterion = "0.5.1"

[[bench]]
name = "spread_cache"
harness = false

Our lib.rs will consist of a single function that does a configurable amount of work (“work_factor”) to compute a random index given the current index and a seed value, implemented as follows:

#[inline(never)]
pub fn next_index(mut current_index: usize, work_factor: usize, value: u8) -> usize {
    for i in 0..work_factor {
        current_index = fxhash::hash(&(current_index, work_factor, i, value));
    }
    current_index
}

We’ll start out by simply figuring out how much time computing next_index once takes. To do so, we’ll fill in benches/spread_cache.rs as follows:

use criterion::black_box;
use criterion::BenchmarkId;
use criterion::Criterion;
use criterion::Throughput;
use criterion::{criterion_group, criterion_main};

use spread_cache::next_index;

fn spread_cache(c: &mut Criterion) {
    let mut group = c.benchmark_group("spread_cache");

    group.throughput(Throughput::Elements(1));
    group.bench_with_input("work_1", &LOG_MAX_SPREAD, |b, &log_spread| {
        b.iter(|| {
            let next_index =
                next_index(current_index, 1, buffer[current_index]) % (1 << log_spread);
            black_box(next_index)
        });
    });
}

criterion_group!(benches, spread_cache);
criterion_main!(benches);

Running this, we obtain

spread_cache/work_1     time:   [2.0857 ns 2.0876 ns 2.0897 ns]                                 
                        thrpt:  [478.54 Melem/s 479.03 Melem/s 479.44 Melem/s]

i.e., that hashing one element takes 2.0857 ns, giving us a throughput of 478.54 elements a second,

We can now write our actual benchmark. One way to make a choice of one of n random objects is to ensure that only n possible inputs may be given to next_input, since it is highly likely that each input will be mapped to a unique output, which can then be used to derive an offset into our buffer by simply taking the output modulo the buffer’s length. Since there are 2**8 possible values for next_index’s value parameter, assuming work is held constant, all we need to do is to take current_index moduo 2**m to guarantee that at most 2**(m + 8) output values are possible. Translated into code, this becomes

const LOG_MAX_SPREAD: u32 = 30;
const MAX_SPREAD: usize = 1 << LOG_MAX_SPREAD;
let mut buffer = vec![123u8; MAX_SPREAD];
let mut current_index = 0;

for log_spread in 8..=LOG_MAX_SPREAD {
    group.throughput(Throughput::Elements(1));
    group.bench_with_input(
        BenchmarkId::new("repeat", log_spread),
        &(log_spread - 8),
        |b, &log_index| {
            b.iter(|| {
                let next_index = next_index(
                    current_index % (1 << log_index), 
                    1, buffer[current_index])
                    % MAX_SPREAD;
                buffer[current_index] = next_index as u8;
                current_index = next_index
            });
        },
    );
}

Here, log_spread corresponds to the logarithm of the number of unique inputs to next_index.

Plotting our results, we obtain

Qualitatively, we can clearly see that the time it takes to hash a single object has an S-shaped (logistic) dependency on the logarithm of the number of unique elements: we have high performance for a low number of elements which all fit in cache, then rapidly decreasing performance as the number of elements outstrips the size of the cache, increasing the likelihood that a given element will not be in the cache, and finally a plateau when almost all elements are outside the cache. Furthermore, this effect is dramatic: hashing an element outside the cache takes about 14x as long (61 ns) as hashing an element in L1 (4.25 ns)!

Zooming in we can see a similar pattern for the very smallest numbers of elements, for which all elements can presumably fit in L1 cache. The plateau, however, is presumably cut short by cache misses later in the hierarchy.

As we can see, trying to fit the data we’re actively processing at a given moment into cache can lead to big performance wins. This is an example of the broader optimization principle of maximizing temporal locality: how frequently a program accesses the same memory locations repeatedly.

While reducing the total amount of memory involved in a given computation can be a very useful optimization technique, a more straightforward way of increasing temporal locality is to try to sequence multiple operations on the same memory close to each other (preferably, back to back). For example, we could update each element of an array 32 times before moving on to the next element, rather than updating the entire array element-by-element 32 times. This style of programming also often takes better advantage of the ability of compilers and modern CPUs to coalesce memory operations.

To demonstrate, let’s adjust our benchmark to measure how long it takes to hash the same memory 32 times: we can do this by just passing in work_factor = 32 as follows:

for log_spread in 8..=LOG_MAX_SPREAD {
    group.throughput(Throughput::Elements(32));
    group.bench_with_input(
        BenchmarkId::new("repeat_32", log_spread),
        &(log_spread - 8),
        |b, &log_index| {
            b.iter(|| {
                let next_index = next_index(
                    current_index % (1 << log_index), 
                    32, buffer[current_index])
                    % MAX_SPREAD;
                buffer[current_index] = next_index as u8;
                current_index = next_index
            });
        },
    );
}

Running our benchmark, we obtain:

As we can see, we obtain exactly the same trend4, but now the worst case takes less than 2x the time of the best case; in other words, we’ve significantly reduced the relative impact of dealing with a large number of objects.

We can’t directly compare this benchmark to the previous one, since it’s not doing the same thing. Instead, it makes more sense to compare the throughput, i.e., the number of objects processed per second, which we can compute by dividing the number of objects processed per iteration of the benchmark loop (here, 32, versus the repeat benchmark’s 1) by the time taken to execute the benchmark. Doing so, we obtain:

As we can see, there is a significant increase in throughput even for a much larger number of distinct memory locations accessed.

Spatial Locality

Another way to make better use of our cache is to increase spatial locality: to make it so that subsequent memory accesses are likely to be nearby previous memory accesses.

Let’s modify our previous benchmark so that, instead of capping the number of distinct memory addresses we access, we instead cap the range within the buffer which we sample from.

We can do so by adding the following benchmark:

for log_spread in 0..=LOG_MAX_SPREAD {
    group.throughput(Throughput::Elements(1));
    group.bench_with_input(
        BenchmarkId::new("spread", log_spread),
        &log_spread,
        |b, &log_spread| {
            b.iter(|| {
                let next_index = next_index(
                    current_index, 
                    1, buffer[current_index]) 
                    % (1 << log_spread);
                buffer[current_index] = next_index as u8;
                current_index = next_index
            });
        },
    );
}

Note that in addition to decreasing the total number of distinct memory addresses accessed (and hence improving temporal locality) like the repeat benchmark above, this will also increase the spatial locality of accesses by clustering objects closer together. Hence, we should expect our results to be strictly better than those of the repeat benchmark for the same value of log_spread.

Running our benchmark, we obtain the following results:

We seem to get a similar curve to repeat, with a similar 15x performance difference between the best and worse case.

In fact, the attentive eye might notice the curve is shifted backward on the x-axis by about 6: this reflects the fact that on my system, a cache line is 64, or 2⁶, bytes, and when we access an element, we have to load the entire cache line containing that element into the cache. In the repeat benchmark, a 1 byte element can take up to 64 bytes of cache space, whereas almost all cache lines loaded spread benchmark will have all 64 bytes corresponding to entries. Indeed, it turns out we can use the spread benchmark to deduce the size of our caches: we see that our runtime begins to increase at log_spread = 18 before briefly plateauing and then increasing again at log_spread = 24. This is interesting since 2**18 == 256KB, the size of L2 cache, and 2**24 == 16MB, the size of L3 cache5. Presumably, the increase begins as the probability of a cache miss increases, with a plateau when almost all memory accesses lead to a cache miss, until the next level of the hierarchy’s capacity is exhausted.

Just as optimizing temporal locality directly (by re-using the same object multiple times) yielded better results than doing so indirectly (by reducing the total number of objects in use at once), we can try to optimize spatial locality directly by changing our memory access pattern: in all the previous benchmarks, we’re essentially jumping around memory at random. What if, instead, we iterate over the buffer in order, as in the following benchmark:

for log_spread in 0..=LOG_MAX_SPREAD {
    group.throughput(Throughput::Elements(1));
    group.bench_with_input(
        BenchmarkId::new("linear", log_spread),
        &log_spread,
        |b, &log_spread| {
            b.iter(|| {
                let next_index = next_index(
                    current_index, 
                    1, buffer[current_index]) 
                    % (1 << log_spread);
                buffer[current_index] = next_index as u8;
                current_index += 1;
                current_index %= 1 << log_spread;
            });
        },
    );
}

Plotting our results, we obtain:

Quite remarkably, not only do we outperform all the previous benchmarks, finishing an iteration in around 2.7 nanoseconds, but our speed is almost independent of the total number of items.

Indeed, plotting our throughput, we obtain the following encouraging graph6:

Conclusion

In conclusion, optimizing cache usage has a dramatic effect on performance: “compute is free, moving memory is expensive.” It therefore pays dividends to optimize one’s data structures so they fit into cache and have a predictable access pattern, even if it means doing (up to 15x) more instructions. We’ll go over some ways to do this in future articles!

Exercises

Try to determine your CPU’s cache hierarchy by running the benchmarks yourself. Does it agree with the results of lstopo?
Why is iterating over an IndexMap faster than iterating over a HashMap?
Why can ECS-based design lead to better cache utilization than traditional OOP?
If you’re excited for more articles on the above topics…
Subscribe for free to receive new posts and support my work.

Entries may also be evicted due to special fence/flush instructions, or due to another core accessing the same memory concurrently (potentially leading to false sharing), but we’ll cover that in another article.

Obtained using the lstopo command

For more on what this means, check out our previous article.

We’ll ignore the anomalous data for log_spread = 23, 24… real world data is messy!

The L1 cache, which should appear around log_spread = 15, is not reflected in the benchmark; I am not sure why.

Here, spread(32) denotes the spread benchmark with work_factor = 32 and throughput set accordingly to Elements(32)

Zero Cost Abstractions

Discussion about this post