Aarch64 performance: `vld1q_u8` intrinsic can cause single-byte loads

While adding aarch64 support to [simdutf8](https://github.com/rusticstuff/simdutf8) I encountered an unexpected eight times slowdown when hand-unrolling a loop. This slowdown was the result of the compiler deciding to suddenly load 128-bit `uint8x16_t` values with single-byte load instructions instead of 128-bit loads.

It turns out, that the `vld1q_u8` intrinsic is at fault. The code generator thinks it can "optimize" loads by loading bytes individually if a SIMD shuffle instruction follows. According to [the ARM docs](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vld1q_u8) this intrinsic should always be coded as one instruction. I fixed it by doing the load similar to how it is [currently done for SSE2](https://github.com/rust-lang/stdarch/blob/6495bb0e33578443c21764655a4dd8b55399c008/crates/core_arch/src/x86/sse2.rs#L1193).

[Testcase and proposed fix on Godbolt](https://godbolt.org/z/T1h8145Tq)

The same issue likely applies is to the other `vld1q` intrinsics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aarch64 performance: `vld1q_u8` intrinsic can cause single-byte loads #1148

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aarch64 performance: vld1q_u8 intrinsic can cause single-byte loads #1148

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Aarch64 performance: `vld1q_u8` intrinsic can cause single-byte loads #1148