Relaxed Integer Dot Product instructions

What are the instructions being proposed?
============================

I propose relaxed 8-bit versions of the Dot Product instructions introduced in WebAssembly/simd#127. These instruction expose multipliplication of 8-bit (unsigned or signed) elements by 7-bit (treated as unsigned) elements with accumulation of adjacent products. These instructions are designed to expose the performance benefits of the following native instructions in a portable way:
- x86/x86-64 SSSE3 `PMADDUBSW` instruction
- x86/x86-64 VNNI (either `AVX2-VNNI` or `AVX512-VNNI`) `VPDPBUSD` instruction.
- AArch32 NEON Dot Product instructions (`VSDOT.S8` ~~and `VUDOT.U8`~~)
- AArch64 Dot Product instructions (`SDOT` ~~and `UDOT`~~)

Discussion on [Issue #9](https://github.com/WebAssembly/relaxed-simd/issues/9) goes into great length explaining the performance benefits of the above-mentioned native instructions.

I suggest `i16x8.dot_i8x16_i7x16_s`, ~~`i16x8.dot_i8x16_i7x16_u`~~, `i32x4.dot_i8x16_i7x16_add_s`, and ~~`i32x4.dot_i8x16_i7x16_add_u`~~ as the tentative names for the proposed instructions.

What are the semantics of these instructions?
============================

Both x86 and ARM provide variants of Dot Product instructions on SIMD vectors of 8-bit elements, but differ in the semantics of the input elements:
- On x86 the instructions treat one of the SIMD vectors as having signed 8-bit elements and the other as unsigned 8-bit elements.
- On ARM the input SIMD vectors are treated as either both having signed 8-bit elements, or both having unsigned 8-bit elements.

The proposed instructions resolve this incompatibility by guaranteeing the result only when elements of the second input SIMD vector have at most 7 non-zero bits, as in this case there is no difference between signed and unsigned representation.

**i16x8.dot_i8x16_i7x16_s** is a 2-element dot product instruction consuming signed 8-bit integer elements as the first input SIMD vector and 7-bit integer elements (treated as unsigned) as the second input SIMD vector and producing signed 16-bit integer output elements. The 2-element dot product never overflows as the worst case outputs fit into signed 16-bit integer:

- `-128` * `127` + `-128` * `127` = `-32512` > `INT16_MIN` = `-32768`
- `127` * `127` + `127` * `127` = `32512` < `INT16_MAX` = `32767`

~~**i16x8.dot_i8x16_i7x16_u** is a 2-element dot product instruction consuming unsigned 8-bit integer elements as the first input SIMD vector and 7-bit integer elements (treated as unsigned) as the second input SIMD vector and producing unsigned 16-bit integer output elements. The 2-element dot product never overflows as the worst case outputs fit into unsigned 16-bit integer:~~

- ~~`255` * `127` + `255` * `127` = `64770` < `UINT16_MAX` = `65536`~~

**i32x4.dot_i8x16_i7x16_add_s** is a 4-element dot product with accumulation instruction consuming signed 8-bit integer elements in the first input SIMD vector, 7-bit integer elements (treated as unsigned) in the second input SIMD vector, and 32-bit integer elements (signedness-agnostic) in the third input SIMD vector and producing (signedness-agnostic) 32-bit integer output elements. The 4-element dot product producing a 32-bit result never overflows, and the addition of the third input SIMD vector is performed in modulo arithmetics.

~~**i32x4.dot_i8x16_i7x16_add_u** is a 4-element dot product with accumulation instruction consuming unsigned 8-bit integer elements in the first input SIMD vector, 7-bit integer elements (treated as unsigned) in the second input SIMD vector, and 32-bit integer elements (signedness-agnostic) in the third input SIMD vector and producing (signedness-agnostic) 32-bit integer output elements. The 4-element dot product producing a 32-bit result never overflows, and the addition of the third input SIMD vector is performed in modulo arithmetics.~~

How will these instructions be implemented?
============================

#### x86/x86-64 processors with AVX2-VNNI or AVX512-VNNI instruction set

- **i32x4.dot_i8x16_i7x16_add_s**
  - `c = i32x4.dot_i8x16_i7x16_add_s(a, b, c)` is lowered to `VPDPBUSD xmm_c, xmm_a, xmm_b`
  - `y = i32x4.dot_i8x16_i7x16_add_s(a, b, c)` is lowered to `VMOVDQA xmm_y, xmm_c` + `VPDPBUSD xmm_c, xmm_a, xmm_b`

- ~~**i32x4.dot_i8x16_i7x16_add_u**~~
  - ~~`c = i32x4.dot_i8x16_i7x16_add_u(a, b, c)` is lowered to `VPDPBUSD xmm_c, xmm_b, xmm_a`~~
  - ~~`y = i32x4.dot_i8x16_i7x16_add_u(a, b, c)` is lowered to `VMOVDQA xmm_y, xmm_c` + `VPDPBUSD xmm_c, xmm_b, xmm_a`~~

#### x86/x86-64 processors with XOP instruction set

- **i32x4.dot_i8x16_i7x16_add_s**
  - `y = i32x4.dot_i8x16_i7x16_add_s(a, b)` is lowered to:
    - `VPMADDUBSW xmm_tmp, xmm_b, xmm_a`
    - `VPHADDWD xmm_tmp, xmm_tmp`
    - `VPADDD xmm_y, xmm_y, xmm_tmp`

- ~~**i32x4.dot_i8x16_i7x16_add_u**~~
  - ~~`y = i32x4.dot_i8x16_i7x16_add_u(a, b)` is lowered to:~~
    - ~~`VPMADDUBSW xmm_tmp, xmm_a, xmm_b`~~
    - ~~`VPHADDWD xmm_tmp, xmm_tmp`~~
    - ~~`VPADDD xmm_y, xmm_y, xmm_tmp`~~

#### x86/x86-64 processors with AVX instruction set

- **i16x8.dot_i8x16_i7x16_s**
  - `y = i16x8.dot_i8x16_i7x16_s(a, b)` is lowered to `VPMADDUBSW xmm_y, xmm_b, xmm_a`

- ~~**i16x8.dot_i8x16_i7x16_u**~~
  - ~~`y = i16x8.dot_i8x16_i7x16_u(a, b)` is lowered to `VPMADDUBSW xmm_y, xmm_a, xmm_b`~~

- **i32x4.dot_i8x16_i7x16_add_s**
  - `y = i32x4.dot_i8x16_i7x16_add_s(a, b)` is lowered to:
    - `VPMADDUBSW xmm_tmp, xmm_b, xmm_a`
    - `VPMADDWD xmm_tmp, xmm_tmp, [wasm_i16x8_splat(1)]`
    - `VPADDD xmm_y, xmm_y, xmm_tmp`

- ~~**i32x4.dot_i8x16_i7x16_add_u**~~
  - ~~`y = i32x4.dot_i8x16_i7x16_add_u(a, b)` is lowered to:~~
    - ~~`VPMADDUBSW xmm_tmp, xmm_a, xmm_b`~~
    - ~~`VPMADDWD xmm_tmp, xmm_tmp, [wasm_i16x8_splat(1)]`~~
    - ~~`VPADDD xmm_y, xmm_y, xmm_tmp`~~

#### x86/x86-64 processors with SSSE3 instruction set

- **i16x8.dot_i8x16_i7x16_s**
  - `y = i16x8.dot_i8x16_i7x16_s(a, b)` (`y` is **NOT** `a`) is lowered to `MOVDQA xmm_y, xmm_b` + `PMADDUBSW xmm_y, xmm_a`

- ~~**i16x8.dot_i8x16_i7x16_u**~~
  - ~~`y = i16x8.dot_i8x16_i7x16_u(a, b)` (`y` is **NOT** `b`) is lowered to `MOVDQA xmm_y, xmm_a` + `PMADDUBSW xmm_y, xmm_b`~~

- **i32x4.dot_i8x16_i7x16_add_s**
  - `y = i32x4.dot_i8x16_i7x16_add_s(a, b)` is lowered to:
    - `MOVDQA xmm_tmp, xmm_b`
    - `PMADDUBSW xmm_tmp, xmm_a`
    - `PMADDWD xmm_tmp, [wasm_i16x8_splat(1)]`
    - `PADDD xmm_y, xmm_tmp`

- ~~**i32x4.dot_i8x16_i7x16_add_u**~~
  - ~~`y = i32x4.dot_i8x16_i7x16_add_u(a, b)` is lowered to:~~
    - ~~`MOVDQA xmm_tmp, xmm_a`~~
    - ~~`PMADDUBSW xmm_tmp, xmm_b`~~
    - ~~`PMADDWD xmm_tmp, [wasm_i16x8_splat(1)]`~~
    - ~~`PADDD xmm_y, xmm_tmp`~~

#### ARM64 processors with Dot Product extension

- **i32x4.dot_i8x16_i7x16_add_s**
  - `y = i32x4.dot_i8x16_i7x16_add_s(a, b, c)` is lowered to `MOV Vy.16B, Vc.16B` + `SDOT Vy.4S, Va.16B, Vb.16B`
  - `c = i32x4.dot_i8x16_i7x16_add_s(a, b, c)` is lowered to `SDOT Vc.4S, Va.16B, Vb.16B`

- ~~**i32x4.dot_i8x16_i7x16_add_u**~~
  - ~~`y = i32x4.dot_i8x16_i7x16_add_u(a, b, c)` is lowered to `MOV Vy.16B, Vc.16B` + `UDOT Vy.4S, Va.16B, Vb.16B`~~
  - ~~`c = i32x4.dot_i8x16_i7x16_add_u(a, b, c)` is lowered to `UDOT Vc.4S, Va.16B, Vb.16B`~~

#### ARM64 processors

- **i16x8.dot_i8x16_i7x16_s**
  - `y = i16x8.dot_i8x16_i7x16_s(a, b)` is lowered to:
    - `SMULL Vy.8H, Va.8B, 4B.8B`
    - `SMULL2 Vtmp.8H, Va.16B, 4B.16B`
    - `ADDP Vy.8H, Vy.8H, Vtmp.8H`

- ~~**i16x8.dot_i8x16_i7x16_u**~~
  - ~~`y = i16x8.dot_i8x16_i7x16_u(a, b)` is lowered to:~~
    - ~~`UMULL Vy.8H, Va.8B, 4B.8B`~~
    - ~~`UMULL2 Vtmp.8H, Va.16B, 4B.16B`~~
    - ~~`ADDP Vy.8H, Vy.8H, Vtmp.8H`~~

- **i32x4.dot_i8x16_i7x16_add_s**
  - `y = i32x4.dot_i8x16_i7x16_add_s(a, b, c)` is lowered to:
    - `SMULL Vtmp.8H, Va.8B, 4B.8B`
    - `SMULL2 Vtmp2.8H, Va.16B, 4B.16B`
    - `ADDP Vtmp.8H, Vtmp.8H, Vtmp2.8H`
    - `SADDLP Vtmp.4S, Vtmp.8H`
    - `ADD Vy.4S, Vtmp.4S, Vc.8H`
  - `c = i32x4.dot_i8x16_i7x16_add_s(a, b, c)` is lowered to:
    - `SMULL Vtmp.8H, Va.8B, 4B.8B`
    - `SMULL2 Vtmp2.8H, Va.16B, 4B.16B`
    - `ADDP Vtmp.8H, Vtmp.8H, Vtmp2.8H`
    - `SADALP Vc.4S, Vtmp.8H`

- ~~**i32x4.dot_i8x16_i7x16_add_u**~~
  - ~~`y = i32x4.dot_i8x16_i7x16_add_s(a, b, c)` is lowered to:~~
    - ~~`UMULL Vtmp.8H, Va.8B, 4B.8B`~~
    - ~~`UMULL2 Vtmp2.8H, Va.16B, 4B.16B`~~
    - ~~`ADDP Vtmp.8H, Vtmp.8H, Vtmp2.8H`~~
    - ~~`UADDLP Vtmp.4S, Vtmp.8H`~~
    - ~~`ADD Vy.4S, Vtmp.4S, Vc.8H`~~
  - ~~`c = i32x4.dot_i8x16_i7x16_add_u(a, b, c)` is lowered to:~~
    - ~~`UMULL Vtmp.8H, Va.8B, 4B.8B`~~
    - ~~`UMULL2 Vtmp2.8H, Va.16B, 4B.16B`~~
    - ~~`ADDP Vtmp.8H, Vtmp.8H, Vtmp2.8H`~~
    - ~~`UADALP Vc.4S, Vtmp.8H`~~

#### Reference lowering through the WAsm SIMD128 instruction set

- **i16x8.dot_i8x16_i7x16_s**
  - `y = i16x8.dot_i8x16_i7x16_s(a, b)` is lowered to:
    - `const v128_t prod_low = wasm_i16x8_extmul_low_i8x16(a, b)`
    - `const v128_t prod_high = wasm_i16x8_extmul_high_i8x16(a, b)`
    - `const v128_t prod_even = wasm_v16x8_shuffle(prod_low, prod_high, 0, 2, 4, 6, 8, 10, 12, 14)`
    - `const v128_t prod_odd = wasm_v16x8_shuffle(prod_low, prod_high, 1, 3, 5, 7, 8, 11, 13, 15)`
    - `y = wasm_i16x8_add(prod_even, prod_odd)`

- ~~**i16x8.dot_i8x16_i7x16_u**~~
  - ~~`y = i16x8.dot_i8x16_i7x16_u(a, b)` is lowered to:~~
    - ~~`const v128_t prod_low = wasm_u16x8_extmul_low_u8x16(a, b)`~~
    - ~~`const v128_t prod_high = wasm_u16x8_extmul_high_u8x16(a, b)`~~
    - ~~`const v128_t prod_even = wasm_v16x8_shuffle(prod_low, prod_high, 0, 2, 4, 6, 8, 10, 12, 14)`~~
    - ~~`const v128_t prod_odd = wasm_v16x8_shuffle(prod_low, prod_high, 1, 3, 5, 7, 8, 11, 13, 15)`~~
    - ~~`y = wasm_i16x8_add(prod_even, prod_odd)`~~

- **i32x4.dot_i8x16_i7x16_add_s**
  - `y = i32x4.dot_i8x16_i7x16_add_s(a, b, c)` is lowered to:
    - `const v128_t prod_low = wasm_i16x8_extmul_low_i8x16(a, b)`
    - `const v128_t prod_high = wasm_i16x8_extmul_high_i8x16(a, b)`
    - `const v128_t psum_low = wasm_i32x4_extadd_pairwise_i16x8(prod_low)`
    - `const v128_t psum_high = wasm_i32x4_extadd_pairwise_i16x8(prod_high)`
    - `const v128_t psum_even = wasm_v32x4_shuffle(psum_low, psum_high, 0, 2, 4, 6)`
    - `const v128_t psum_odd = wasm_v32x4_shuffle(psum_low, psum_high, 1, 3, 5, 7)`
    - `const v128_t psum = wasm_i32x4_add(prod_even, prod_odd)`
    - `y = wasm_i32x4_add(psum, c)`

- ~~**i32x4.dot_i8x16_i7x16_add_u**~~
  - ~~`y = i32x4.dot_i8x16_i7x16_add_s(a, b, c)` is lowered to:~~
    - ~~`const v128_t prod_low = wasm_u16x8_extmul_low_u8x16(a, b)`~~
    - ~~`const v128_t prod_high = wasm_u16x8_extmul_high_u8x16(a, b)`~~
    - ~~`const v128_t psum_low = wasm_u32x4_extadd_pairwise_u16x8(prod_low)`~~
    - ~~`const v128_t psum_high = wasm_u32x4_extadd_pairwise_u16x8(prod_high)`~~
    - ~~`const v128_t psum_even = wasm_v32x4_shuffle(psum_low, psum_high, 0, 2, 4, 6)`~~
    - ~~`const v128_t psum_odd = wasm_v32x4_shuffle(psum_low, psum_high, 1, 3, 5, 7)`~~
    - ~~`const v128_t psum = wasm_i32x4_add(prod_even, prod_odd)`~~
    - ~~`y = wasm_i32x4_add(psum, c)`~~

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?
============================

As the native equivalents of the proposed Dot Product instructions on x86 perform signed-by-unsigned multiplication and the native equivalents on ARM perform either signed-by-signed, or unsigned-by-unsigned multiplications, it is possible to distinguish these architectures from results on out-of-bounds (when the high bit of the elements in the second input SIMD vector) inputs. x86/x86-64 can already be distinguished from ARM/ARM64 based on NaN behavior, so this aspect doesn't expose any new fingerprinting surfaces.

However, it is also possible to distinguish processors with AVX2-VNNI or AVX512-VNNI instruction sets on x86 from processors without these instruction sets by detecting saturation of intermediate results (in `PMADDUBSW` instruction), and distinguish ARM processors with Dot Product extension from ARM processors without this extension by detecting wrapping of intermediate results (in `ADDP` instructions). WebAssembly engines have three options to manage exposure of this fingerprinting surface:

1. Wait it out, as new processors tend to support the AVX2-VNNI / AVX512-VNNI extension on the x86 and the NEON Dot Product extension on ARM. 2022 processor cores from Intel (Golden Cove, Gracemont), AMD (Zen 4), ARM (Cortex-X2, Cortex-A710, Cortex-A510), and Apple (A15) all support these instruction set extensions.

2. Mimick the behaviour of the AVX2-VNNI / AVX512-VNNI `VPDPBUSD` instruction on the x86 processors without this instruction extension and the behaviour of the NEON Dot Product instructions on the ARM processors without this instruction set extension. This option comes at a performance cost on the older processors.

3. Avoid the AVX2-VNNI / AVX512-VNNI `VPDPBUSD` instruction on x86 and the NEON Dot Product instructions on ARM, and use the same instruction sequences as the older processors. This option comes at a performance cost on the newer processors.

What use cases are there?
============================

- [libvpx (video codec)](https://github.com/webmproject/libvpx/blob/705bf9de8c96cfe5301451f1d7e5c90a41c64e5f/vpx_dsp/x86/vpx_subpixel_8t_intrin_ssse3.c#L88-L89)
- [AV1 Codec Library (video codec)](https://github.com/mozilla/aom/blob/46b7bb243b39929c02d6decc1bdb2e39e11540d6/av1/common/x86/av1_convolve_ssse3.c#L152-L153)
- [OpenVINO (machine learning toolkit)](https://github.com/openvinotoolkit/openvino/blob/b6a75d7d916b7555f867e74681449aa7aa865653/src/plugins/intel_cpu/src/nodes/bin_conv.cpp#L375-L382)
- @kpu [mentioned in #9](https://github.com/WebAssembly/relaxed-simd/issues/9#issuecomment-1012378445) that access to `PMADDUBSW` instruction accelerates machine translation application compiled to WebAssembly by 310% over pure WebAssembly SIMD.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Relaxed Integer Dot Product instructions #52

What are the instructions being proposed?

What are the semantics of these instructions?

How will these instructions be implemented?

x86/x86-64 processors with AVX2-VNNI or AVX512-VNNI instruction set

x86/x86-64 processors with XOP instruction set

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSSE3 instruction set

ARM64 processors with Dot Product extension

ARM64 processors

Reference lowering through the WAsm SIMD128 instruction set

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

What use cases are there?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Relaxed Integer Dot Product instructions #52

Description

What are the instructions being proposed?

What are the semantics of these instructions?

How will these instructions be implemented?

x86/x86-64 processors with AVX2-VNNI or AVX512-VNNI instruction set

x86/x86-64 processors with XOP instruction set

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSSE3 instruction set

ARM64 processors with Dot Product extension

ARM64 processors

Reference lowering through the WAsm SIMD128 instruction set

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

What use cases are there?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions