Notes on vdotq fast-forwarding on ARM processors

Looking at some of linrock's ARM optimizations for Stockfish reminded me of something I'd long-ago noticed in the Neoverse N2 optimization manual that I'd never tested. In the instruction latency/throughput tables, the entries for sdot/udot/sudot/usdot have a note:

Instr. Group	Instrs.	Latency	Throughput	Pipelines	Notes
ASIMD dot product	SDOT, UDOT	3 (1)	4	V	2*
ASIMD dot product using signed and unsigned integers	SUDOT, USDOT	3 (1)	4	V	2*

Notes *1 and *2 state:

Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar μOPs, allowing a typical sequence of integer multiply-accumulate μOPs to issue one every cycle or one every other cycle (accumulate latency shown in parentheses).
Other accumulate pipelines also support late-forwarding of accumulate operands from similar μOPs, allowing a typical sequence of such μOPs to issue one every cycle (accumulate latency shown in parentheses)

Let's quickly break this down.

The instructions

These four NEON instructions all do the same variety of operation (only differentiated by the signedness of each operand). In C pseudocode:

void sdot(int32_t acc[4], int8_t a[16], int8_t b[16]) {
    for (int i = 0; i < 4; ++i) {
        acc[i] += a[4 * i] * b[4 * i] +
            a[4 * i + 1] * b[4 * i + 1] +
            a[4 * i + 2] * b[4 * i + 2] +
            a[4 * i + 3] * b[4 * i + 3];
    }
}

void udot(uint32_t acc[4], uint8_t a[16], uint8_t b[16]);  // etc.
void usdot(int32_t acc[4], uint8_t a[16], int8_t b[16]);
void sudot(int32_t acc[4], int8_t a[16], uint8_t b[16]);

In English: for each 32-bit lane in the NEON vector, we compute the dot-product of corresponding quadruplets of 8-bit integers in a and b. The 8-bit integers are treated as signed or unsigned depending on the instruction mnemonic. We then add this value into the 32-bit accumulator acc. (You can see this nice article, which is about an equivalent x86 instruction, to learn more about this operation.)

These instructions are often used in low-precision integer inference because they efficiently perform an integer dot product, and are in the same spirit as other multiply–accumulate operations like floating-point FMA.

Late forwarding

(Disclaimer: I'm not a CPU engineer, so take my words here with a heap of salt.) Most arithmetic operations have a fixed latency with respect to each of its operands. An integer mul instruction, for example, can only begin execution when both of its operands are ready, and may take several cycles to complete (but on modern processors, it's typically pipelined, so that one can start every cycle).

A µop scheduler might treat every µop as having a fixed latency and requiring all its operands to begin execution, but this approach is suboptimal for complex operations in a pipelined processor, where some parts of the operation naturally occur later in the pipeline.

In particular, for the sdot case, the accumulation clearly happens "last", after the 8-bit integer multiplies and horizontal accumulation. Therefore, it makes sense that the last stage of the sdot pipeline can directly accept the accumulator as input, with one cycle of latency; while the "data" inputs (i.e., the 8-bit integers) have a longer latency of 3 cycles.

This optimization is obviously worth the complexity: sdot is almost always used accumulating repeatedly into the same register.

Actual data

I wrote a very scrappy program to measure the latency between operations:

#include <stdio.h>
#include <time.h>
#include <arm_neon.h>

#define ITERS 1000000000

void bench_addition(void) {
    for (int i = 0, a = 0; i < ITERS; i++)
        asm volatile ("add %w0, %w0, #1" : "+r"(a));
}

void bench_vdotq_acc(void) {
    uint8x16_t a = vdupq_n_u8(0), b = vdupq_n_u8(0), c = vdupq_n_u8(0);
    for (int i = 0; i < ITERS; i++)
        asm volatile ("sdot %0.4s, %1.16b, %2.16b" : "+x"(a) : "x"(b), "x"(c));
}

void bench_vdotq_acc_add(void) {
    uint8x16_t a = vdupq_n_u8(0), b = vdupq_n_u8(0), c = vdupq_n_u8(0);
    for (int i = 0; i < ITERS; i++)
        asm volatile ("sdot %0.4s, %1.16b, %2.16b\nadd %0.4s, %0.4s, %0.4s" : "+x"(a) : "x"(b), "x"(c));
}

void bench_vdotq_data(void) {
    uint8x16_t a = vdupq_n_u8(0), b = vdupq_n_u8(0), c = vdupq_n_u8(0);
    for (int i = 0; i < ITERS; i++)
        asm volatile ("sdot %0.4s, %1.16b, %0.16b" : "+x"(a) : "x"(b), "x"(c));
}

void bench_vdotq_data_add(void) {
    uint8x16_t a = vdupq_n_u8(0), b = vdupq_n_u8(0), c = vdupq_n_u8(0);
    for (int i = 0; i < ITERS; i++)
        asm volatile ("sdot %0.4s, %1.16b, %0.16b\nadd %0.4s, %0.4s, %0.4s" : "+x"(a) : "x"(b), "x"(c));
}

long long elapsed(void (*fn)(void)) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    fn();
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) * 1000000000 + (end.tv_nsec - start.tv_nsec);
}

int main(void) {
    elapsed(bench_addition); // warmup

    long long ticksAdd = elapsed(bench_addition);
    long long ticksSdot = elapsed(bench_vdotq_acc);
    long long ticksSdotAdd = elapsed(bench_vdotq_acc_add);
    long long ticksData = elapsed(bench_vdotq_data);
    long long ticksDataAdd = elapsed(bench_vdotq_data_add);

    printf("Addition: %lld ns\n", ticksAdd);
    printf("vdotq acc -> acc: %lld ns -- %f\n", ticksSdot, (double)ticksSdot / (double)ticksAdd);
    printf("vdotq acc -> acc with add: %lld ns -- %f\n", ticksSdotAdd, (double)ticksSdotAdd / (double)ticksAdd);
    printf("vdotq data-> acc: %lld ns -- %f\n", ticksData, (double)ticksData / (double)ticksAdd);
    printf("vdotq data-> acc with add: %lld ns -- %f\n", ticksDataAdd, (double)ticksDataAdd / (double)ticksAdd);

    return 0;
}

Apple M1 doesn't have the late-forwarding optimization and has a consistent 3-cycle latency:

Addition: 414759000 ns
vdotq acc -> acc: 1242760000 ns -- 2.996342
vdotq acc -> acc with add: 2068012000 ns -- 4.986057
vdotq data-> acc: 1238288000 ns -- 2.985560
vdotq data-> acc with add: 2061117000 ns -- 4.969433

but Neoverse N2 indeed has 1-cycle latency w.r.t. the accumulator:

Addition: 401643344 ns
vdotq acc -> acc: 401498879 ns -- 0.999640
vdotq acc -> acc with add: 2409335059 ns -- 5.998693
vdotq data-> acc: 1204691258 ns -- 2.999406
vdotq data-> acc with add: 2409117740 ns -- 5.998152

Both Apple M1 and Neoverse N2 have four execution units for sdot, so M1 requires 12 accumulators to fully saturate the execution units while N2 requires only 4. In general, accumulator-splitting optimizations are likely to help on Apple processors, and unlikely to help on Neoverse processors.

Specific conditions for the forwarding

The source of the accumulator has to be an sdot or similar instruction. This is proven by the "with add" test cases above, which actually take the latency from 3 to 6 cycles on Neoverse N2! Vector add itself has a 2-cycle latency; there seems to be a one cycle delay associated with forwarding the sdot output to add. Quoting from the N2 optimization guide:

4.7 Region based fast forwarding
The forwarding logic in the V pipelines is optimized to provide optimal latency for instructions which are expected to commonly forward to one another. The effective latency of FP and ASIMD instructions as described in section 3 is increased by one cycle if the producer and consumer instructions are not part of the same forwarding region. These optimized forwarding regions are defined in the following table.
Region 1: ASIMD/SVE ALU [emphasis mine], ASIMD/SVE shift, ASIMD/scalar insert and move, ASIMD/SVE abs/cmp/max/min and the ASIMD miscellaneous instructions in table 3-18. ...
The following instructions are not a part of any region:
...
ASIMD integer mul/mac

In this case, the consumer add is part of "region 1" and the producer sdot is not part of any region, so add sees an extra cycle of latency.

Meanwhile Apple M1 has no forwarding to defeat, nor has any penalty for using the result of sdot as an input to add – it's still $2+3 = 5$ cycles.

Comparisons to other instructions and processors

Arm Neoverse has this type of optimization for the majority of multiply–accumulate operations, including FP, matrix accumulate and scalar madd/msub. Apple M1's FMA is 4 cycles w.r.t. all operands. I haven't tested whether Apple does it for any other instructions, or whether later Apple silicon implements this.

Like Apple M1, x86 processors generally don't seem to have this late forwarding trick.¹ For example, with vpdpbusd on Zen 5:

$ sudo bash nanoBench.sh -asm "vpdpbusd xmm0, xmm1, xmm2"
LsNotHaltedCyc: 4.00
$ sudo bash nanoBench.sh -asm "vpdpbusd xmm0, xmm1, xmm0"
LsNotHaltedCyc: 4.00

No late forwarding in sight: vpdpbusd is 4 cycles of latency with respect to all operands. Some processors are able to reduce the latency of chained floating-point operations, but I'm not aware of any which, for example, can make an vfmadd132ps chain the same speed as an vaddps chain, in the same way that Neoverse can.

AI disclosure

Generative AI was not used in the creation of this blog post.

Footnotes

When you observe such behavior, it typically comes from splitting the instruction into multiple µops which can be scheduled independently. ↩