Optimizing for non-uniform memory and cache effects in Stockfish

I've recently been contributing to the computer chess engine Stockfish! My work thus far has mostly been on small performance optimizations around the codebase, but I've also introduced some changes to the engine's search strategy. Funnily enough, the largest search patch also introduced an interesting performance problem—the topic of this blog post.

Strong alpha–beta chess engines that support multi-threading generally use a parallel search strategy called Lazy SMP. In short, multiple search threads are spawned (typically one per CPU core) and begin searching from the root position. They don't coordinate with each other, except rather indirectly, through a large shared hash map called the transposition table (TT), which maps searched positions to information about that position (e.g., a prospective best move). Lazy SMP is simple and effective, even when scaling to hundreds of threads on high-end systems.

Meanwhile, each thread independently has a rich set of data gathered during search. Much of these data are histories. Histories tabulate how positions or moves with a particular property (such as a certain pawn structure) have historically performed during search. These histories allow search to adapt to statistical patterns found in this particular game that might not apply to other games (e.g., "this pawn structure tends to be good for Black").

My straightforward idea was to share some of these histories between threads. That way, if one thread found that a particular feature tended to be good or bad, another thread could immediately use that information in its search, rather than having to rediscover it. I also decided to scale the shared history tables proportionally to the thread count, which maintains the total memory footprint while (hopefully) avoiding regressions due to excessive hash collisions.

Engine self-play tests showed that this idea worked for a particular type of history (correction history), and further testing suggested that the positive effect increased with the number of search threads; the strength gain was somewhat larger at 64 threads than at 8 threads.

So all's good in shared history land, yes?

An emergent performance problem

Alas! The first sign of trouble came when I ran the shared history branch on all threads of an EPYC 9755 (a 128-core, 256-thread Zen 5 processor) configured with 4 NUMA¹ nodes (and therefore 64 threads per NUMA node). The result was that the shared history patch made Stockfish 40% slower! Evidently, the issue was a large number of memory accesses to remote NUMA nodes, both reads of history values and writes to adjust those values.

Some context

The actual history updates look like this (adapted from history.h):

template<int D>
struct StatsEntry {
   private:
    std::atomic<int16_t> entry;

   public:
    void update(int bonus) {
        int clampedBonus = std::clamp(bonus, -D, D);
        T   val          = entry.load(std::memory_order_relaxed);
        entry.store(val + clampedBonus - val * std::abs(clampedBonus) / D,
            std::memory_order_relaxed);

        assert(std::abs(T(*this)) <= D);
    }
};

using CorrhistEntry = StatsEntry<1024>;

// dummy example of how this is used
std::array<CorrhistEntry, 65536> history;
history[pawn_hash % 65536].update(100);  // give pawn structure a positive bonus

A bonus of 0 leaves the entry unchanged, a positive bonus nudges the entry higher, and a negative bonus nudges it lower. The entry itself is two bytes and is constrained to (in this case) [-1024, 1024].

Notably, although we're using atomic loads and stores—necessary to avoid data-race undefined behavior—the updates themselves are not atomic because a thread does not hold exclusive access to the history entry for the duration of the update. Two separate threads could read the same old value and write their new value (losing one update in the process), whereas properly serialized updates would have produced a third, different result. This does mean that we'll very occasionally lose an update, but the impact on engine strength is negligible.

Of course, simply making updates non-atomic doesn't help with the issue of slow cross-node accesses.

Partitioning by NUMA node

A fair solution—easy to implement because of Stockfish's existing infrastructure—is to simply allocate one shared history table per NUMA node. This brought the speed penalty down to 6%, which is not great, but probably compensated for.

While this worked when the system was booted with 4 NUMA nodes, after rebooting the system with 1 NUMA node, the slowdown reappeared: now only 25% instead of 40%, but still unacceptably large.

My initial suspicion was that the increased thread count was causing too much sharing, with potentially many threads hammering the same cache line. But partitioning the 256 threads into 64-thread groups only recovered part of the performance (~15% slower, from 25%). It was only when pinning the 64-thread groups to specific CPU cores (in particular, pinning them as if we had 4 NUMA nodes—to physical cores 0–31, 32–63, 64–95, 96–127) that I recovered the smaller slowdown of 6%.

Non-uniform cache access

AMD's recent CPUs have a chiplet-based structure, with multiple so-called core chiplet dies (CCDs), each containing 8 or 16 CPU cores, rather than one massive die containing all cores. For the aforementioned EPYC 9755, there are 16 chiplets, each with 8 physical cores. The CCDs are further subdivided into core complexes (CCXs).² Straight from the horse's mouth:

Importantly, each CCX has its own level-3 cache domain; a cache line may therefore be simultaneously resident in multiple L3 caches. Cache coherency across CCDs is maintained through the slower "infinity fabric". Indeed, when one core incurs a local L3 miss that is found in another CCD's L3, a remote cache access occurs over the fabric. While not quite as expensive as a full L3 miss that needs to go to DRAM, these remote cache accesses have considerable latency: this Chips and Cheese article reports 200 ns on a Zen 5 desktop CPU.

This microarchitectural design accounts for why artificial subdivision into "NUMA nodes" helped; the groupings precisely coincided with two CCDs, and the reduced size allows for better occupancy in each cache's local L3.

Numerical evidence

I used Linux perf to record useful data of a standardized speed test in various modes. In particular, we can record the source of L1 cache fills: core-private L2 cache, local CCX, remote CCXs, and DRAM loads.

At a high-level, program-wide, there isn't an obvious difference between configs...

... but there is a clear impact when looking specifically at remote CCX accesses.

And here's the NPS data for the various configs:

So pinning threads to (groups of) L3 cache domains brings an average boost of ~4%. The benefit is lessened somewhat by early prefetching of the history entries (source), which helps a lot more in the no-pin case than the pinned case (seems to be around 6% vs. 1% for 16 threads, respectively).

Of course, the remote CCX fetches are only part of the performance story. As I mentioned, I suspect there's also a component of true sharing at very high thread counts. Despite the proportional scaling of history size, threads might often write to the same entry and cause a flood of RfOs³ that need to be serialized.

(The raw data are available here.)

A sensible default

This is great and all, but what default should we choose? vondele ran some many-core self-play tests (10 seconds per side, 128 threads) to estimate the strength impact of sharing:

   # PLAYER      :  RATING  ERROR   POINTS  PLAYED   (%)
   1 master32    :     0.0   ----  10369.0   20480    51
   2 master16    :    -2.8    4.1  10249.5   20480    50
   3 master08    :    -6.3    4.2  10101.5   20480    49

From these tests we inferred that more sharing was worth the small speed loss, and as a compromise, decided to bundle L3 domains to reach 32 threads per shared history table. We also avoid bundling L3 domains that are on different NUMA nodes, since the cross-node penalties would be prohibitive. Users can try different configurations at runtime with the (non-standard) NumaPolicy UCI option.

Obtaining L3 information from the OS

Actually obtaining the L3 topology at runtime requires some code monkeying; even libnuma (which SF avoids using) doesn't supply this information! We need OS-specific implementations.

Linux

Here we can use the sysfs API and scan through the active CPUs for L3 cache siblings. In particular, it's available at /sys/devices/system/cpu/cpu<n>/cache/index3/shared_cpu_list".

struct L3Domain {
    NumaIndex          systemNumaIndex{};
    std::set<CpuIndex> cpus{};
};
std::vector<L3Domain> l3Domains;
std::set<CpuIndex> seenCpus;
auto               nextUnseenCpu = [&seenCpus]() {
    for (CpuIndex i = 0;; ++i)
    if (!seenCpus.count(i))
    return i;
};

while (true)
{
    CpuIndex next = nextUnseenCpu();
    auto     siblingsStr =
    read_file_to_string("/sys/devices/system/cpu/cpu" + std::to_string(next)
    + "/cache/index3/shared_cpu_list");

    if (!siblingsStr.has_value() || siblingsStr->empty())
    {
        break;  // we have read all available CPUs
    }

    L3Domain domain;
    for (size_t c : indices_from_shortened_string(*siblingsStr))  // e.g. "0-5,10-15"
    {
        domain.systemNumaIndex = get_numa_index(c);
        domain.cpus.insert(c);
        seenCpus.insert(c);
    }
    if (!domain.cpus.empty())
    {
        l3Domains.emplace_back(std::move(domain));
    }
}

Windows

Windows is less straightforward; we use the function GetLogicalProcessorInformationEx, which dumps out the processor information into a bunch of variable length entries that we need to scan through.

DWORD bufSize = 0;
GetLogicalProcessorInformationEx(RelationCache, nullptr, &bufSize);
if (GetLastError() != ERROR_INSUFFICIENT_BUFFER)
    return std::nullopt;

std::vector<char> buffer(bufSize);
auto info = reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(buffer.data());
if (!GetLogicalProcessorInformationEx(RelationCache, info, &bufSize))
    return std::nullopt;

while (reinterpret_cast<char*>(info) < buffer.data() + bufSize)
{
    info = std::launder(info);
    if (info->Relationship == RelationCache && info->Cache.Level == 3)
    {
        L3Domain domain{};
        domain.cpus = readCacheMembers(info);
        if (!domain.cpus.empty())
        {
            domain.systemNumaIndex = systemConfig.nodeByCpu.at(*domain.cpus.begin());
            l3Domains.push_back(std::move(domain));
        }
    }
    // Variable length data structure, advance to next
    info = reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(
      reinterpret_cast<char*>(info) + info->Size);
}

readCacheMembers has two implementations, one for Windows 10 and one for Windows 11:

constexpr unsigned WIN_PROCESSOR_GROUP_SIZE = 64;

template<typename T, typename = void>
struct HasGroupCount: std::false_type {};

template<typename T>
struct HasGroupCount<T, std::void_t<decltype(std::declval<T>().Cache.GroupCount)>>: std::true_type {
};

template<typename T, typename Pred, std::enable_if_t<HasGroupCount<T>::value, bool> = true>
std::set<CpuIndex> readCacheMembers(const T* info) {
    std::set<CpuIndex> cpus;
    // On Windows 10 this will read a 0 because GroupCount doesn't exist
    int groupCount = std::max(info->Cache.GroupCount, WORD(1));
    for (WORD procGroup = 0; procGroup < groupCount; ++procGroup)
    {
        for (BYTE number = 0; number < WIN_PROCESSOR_GROUP_SIZE; ++number)
        {
            WORD           groupNumber = info->Cache.GroupMasks[procGroup].Group;
            const CpuIndex c = static_cast<CpuIndex>(groupNumber) * WIN_PROCESSOR_GROUP_SIZE
                             + static_cast<CpuIndex>(number);
            if (!(info->Cache.GroupMasks[procGroup].Mask & (1ULL << number)))
                continue;
            cpus.insert(c);
        }
    }
    return cpus;
}

template<typename T, typename Pred, std::enable_if_t<!HasGroupCount<T>::value, bool> = true>
std::set<CpuIndex> readCacheMembers(const T* info) {
    std::set<CpuIndex> cpus;
    for (BYTE number = 0; number < WIN_PROCESSOR_GROUP_SIZE; ++number)
    {
        WORD           groupNumber = info->Cache.GroupMask.Group;
        const CpuIndex c           = static_cast<CpuIndex>(groupNumber) * WIN_PROCESSOR_GROUP_SIZE
                         + static_cast<CpuIndex>(number);
        if (!(info->Cache.GroupMask.Mask & (1ULL << number)))
            continue;
        cpus.insert(c);
    }
    return cpus;
}

The SFINAE slop is necessary because Windows 10 doesn't have the info->Cache.GroupCount member. Importantly, when a binary compiled for Windows 11 runs on Windows 10, it'll read a zero from that nonexistent field; hence we take a max with 1 so that it reads one bitmask. (GroupCount was added to allow for a cache domain with >64 logical cores to be accurately reported.)

Conclusions

Non-uniform cache access is pretty weird as an application programmer, but a very reasonable design choice as systems become available with huge caches and core counts, to keep L3 latency low.
Pinning threads and, if sustaining a lot of concurrent reads and writes, keeping cache-private tables, can significantly improve performance.
Prefetching may make a bigger-than-usual impact in these settings, even for working sets that ostensibly fit in L3 cache, by helping to hide the considerable latency of remote cache accesses.
It'd be nice to have a lightweight library to handle this stuff. I'm thinking of writing a Rust one for some upcoming projects.

AI disclosure

Generative AI was not used in the creation of this blog post, except to assist me in making the diagrams.

Footnotes

A system configured with non-uniform memory access (Wikipedia) will assign different regions in memory to specific, disjoint groups of CPU cores (called NUMA nodes). For example, for the EPYC 9755 test system with 128 physical cores, configured with 4 NUMA nodes, each consecutive group of 32 cores forms its own NUMA node, i.e., 0–31, 32–63, 64–95, 96–127. Reads/writes from a CPU core from/to memory on the same NUMA node (local accesses) have significantly lower latency than accesses to a different NUMA node (remote accesses). ↩
In the case of the EPYC 9755, however, each CCD is made up of only a single CCX. ↩
Reads for ownership (Wikipedia). ↩

The condemned patch: Sharing correction history between search threads