You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While adding aarch64 support to simdutf8 I encountered an unexpected eight times slowdown when hand-unrolling a loop. This slowdown was the result of the compiler deciding to suddenly load 128-bit uint8x16_t values with single-byte load instructions instead of 128-bit loads.
It turns out, that the vld1q_u8 intrinsic is at fault. The code generator thinks it can "optimize" loads by loading bytes individually if a SIMD shuffle instruction follows. According to the ARM docs this intrinsic should always be coded as one instruction. I fixed it by doing the load similar to how it is currently done for SSE2.