[WebAssembly] Mask undef shuffle lanes #149084

sparker-arm · 2025-07-16T12:09:41Z

In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, with undef lanes, friendly to VMs by trying to lower to a i32x4 shuffle. This patch inserts an AND, to mask the undef lanes, when only the bottom four or eight bytes are defined. This allows the VM to easily understand which lanes are required.

llvmbot · 2025-07-16T12:10:15Z

@llvm/pr-subscribers-backend-webassembly

Author: Sam Parker (sparker-arm)

Changes

In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, with undef lanes, friendly to VMs by trying to lower to a i32x4 shuffle. This patch inserts an AND, to mask the undef lanes, when only the bottom four or eight bytes are defined. This allows the VM to easily understand which lanes are required.

Patch is 81.81 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/149084.diff

10 Files Affected:

(modified) llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp (+36-2)
(modified) llvm/test/CodeGen/WebAssembly/extend-shuffles.ll (+18-10)
(modified) llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll (+36)
(modified) llvm/test/CodeGen/WebAssembly/simd-concat.ll (+6)
(modified) llvm/test/CodeGen/WebAssembly/simd-conversions.ll (+38-24)
(modified) llvm/test/CodeGen/WebAssembly/simd-extending-convert.ll (+4)
(modified) llvm/test/CodeGen/WebAssembly/simd-extending.ll (+6)
(modified) llvm/test/CodeGen/WebAssembly/simd.ll (+12-4)
(modified) llvm/test/CodeGen/WebAssembly/vector-reduce.ll (+401-285)
(modified) llvm/test/CodeGen/WebAssembly/wide-simd-mul.ll (+40-32)

diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
index bf2e04caa0a61..a360c592d3ecc 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
@@ -2719,18 +2719,52 @@ WebAssemblyTargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,
   Ops[OpIdx++] = Op.getOperand(0);
   Ops[OpIdx++] = Op.getOperand(1);
 
+  std::bitset<16> DefinedLaneBytes = 0xFFFF;
   // Expand mask indices to byte indices and materialize them as operands
   for (int M : Mask) {
     for (size_t J = 0; J < LaneBytes; ++J) {
       // Lower undefs (represented by -1 in mask) to {0..J}, which use a
       // whole lane of vector input, to allow further reduction at VM. E.g.
       // match an 8x16 byte shuffle to an equivalent cheaper 32x4 shuffle.
+      if (M == -1) {
+        DefinedLaneBytes[OpIdx - 2] = 0;
+      }
       uint64_t ByteIndex = M == -1 ? J : (uint64_t)M * LaneBytes + J;
       Ops[OpIdx++] = DAG.getConstant(ByteIndex, DL, MVT::i32);
     }
   }
-
-  return DAG.getNode(WebAssemblyISD::SHUFFLE, DL, Op.getValueType(), Ops);
+  EVT VT = Op.getValueType();
+  SDValue Shuffle = DAG.getNode(WebAssemblyISD::SHUFFLE, DL, VT, Ops);
+
+  // If only the lower four or eight bytes are actually defined by the
+  // shuffle, insert an AND so a VM can know that it can ignore the higher,
+  // undef, lanes.
+  if (DefinedLaneBytes == 0xF) {
+    SDValue LowLaneMask[] = {
+        DAG.getConstant(uint32_t(-1), DL, MVT::i32),
+        DAG.getConstant(uint32_t(0), DL, MVT::i32),
+        DAG.getConstant(uint32_t(0), DL, MVT::i32),
+        DAG.getConstant(uint32_t(0), DL, MVT::i32),
+    };
+    SDValue UndefMask =
+        DAG.getNode(ISD::BUILD_VECTOR, DL, MVT::v4i32, LowLaneMask);
+    SDValue MaskedShuffle =
+        DAG.getNode(ISD::AND, DL, MVT::v4i32,
+                    DAG.getBitcast(MVT::v4i32, Shuffle), UndefMask);
+    return DAG.getBitcast(VT, MaskedShuffle);
+  } else if (DefinedLaneBytes == 0xFF) {
+    SDValue LowLaneMask[] = {
+        DAG.getConstant(uint64_t(-1), DL, MVT::i64),
+        DAG.getConstant(uint32_t(0), DL, MVT::i64),
+    };
+    SDValue UndefMask =
+        DAG.getNode(ISD::BUILD_VECTOR, DL, MVT::v2i64, LowLaneMask);
+    SDValue MaskedShuffle =
+        DAG.getNode(ISD::AND, DL, MVT::v2i64,
+                    DAG.getBitcast(MVT::v2i64, Shuffle), UndefMask);
+    return DAG.getBitcast(VT, MaskedShuffle);
+  }
+  return Shuffle;
 }
 
 SDValue WebAssemblyTargetLowering::LowerSETCC(SDValue Op,
diff --git a/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll b/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll
index 7736e78271e55..0085c6cd82797 100644
--- a/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll
+++ b/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll
@@ -10,9 +10,11 @@ define <4 x i32> @sext_high_v4i8(<8 x i8> %in) {
 ; SIMD128:         .functype sext_high_v4i8 (v128) -> (v128)
 ; SIMD128-NEXT:  # %bb.0:
 ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
-; SIMD128-NEXT:    i16x8.extend_low_i8x16_s $push1=, $pop0
-; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push2=, $pop1
-; SIMD128-NEXT:    return $pop2
+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT:    i16x8.extend_low_i8x16_s $push3=, $pop2
+; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push4=, $pop3
+; SIMD128-NEXT:    return $pop4
  %shuffle = shufflevector <8 x i8> %in, <8 x i8> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %res = sext <4 x i8> %shuffle to <4 x i32>
  ret <4 x i32> %res
@@ -23,9 +25,11 @@ define <4 x i32> @zext_high_v4i8(<8 x i8> %in) {
 ; SIMD128:         .functype zext_high_v4i8 (v128) -> (v128)
 ; SIMD128-NEXT:  # %bb.0:
 ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
-; SIMD128-NEXT:    i16x8.extend_low_i8x16_u $push1=, $pop0
-; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push2=, $pop1
-; SIMD128-NEXT:    return $pop2
+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT:    i16x8.extend_low_i8x16_u $push3=, $pop2
+; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push4=, $pop3
+; SIMD128-NEXT:    return $pop4
  %shuffle = shufflevector <8 x i8> %in, <8 x i8> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %res = zext <4 x i8> %shuffle to <4 x i32>
  ret <4 x i32> %res
@@ -58,8 +62,10 @@ define <2 x i32> @sext_high_v2i16(<4 x i16> %in) {
 ; SIMD128:         .functype sext_high_v2i16 (v128) -> (v128)
 ; SIMD128-NEXT:  # %bb.0:
 ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push1=, $pop0
-; SIMD128-NEXT:    return $pop1
+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push3=, $pop2
+; SIMD128-NEXT:    return $pop3
  %shuffle = shufflevector <4 x i16> %in, <4 x i16> poison, <2 x i32> <i32 2, i32 3>
  %res = sext <2 x i16> %shuffle to <2 x i32>
  ret <2 x i32> %res
@@ -70,8 +76,10 @@ define <2 x i32> @zext_high_v2i16(<4 x i16> %in) {
 ; SIMD128:         .functype zext_high_v2i16 (v128) -> (v128)
 ; SIMD128-NEXT:  # %bb.0:
 ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push1=, $pop0
-; SIMD128-NEXT:    return $pop1
+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push3=, $pop2
+; SIMD128-NEXT:    return $pop3
  %shuffle = shufflevector <4 x i16> %in, <4 x i16> poison, <2 x i32> <i32 2, i32 3>
  %res = zext <2 x i16> %shuffle to <2 x i32>
  ret <2 x i32> %res
diff --git a/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll b/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll
index 7190e162eb010..27b7e8c6b01cd 100644
--- a/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll
+++ b/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll
@@ -32,6 +32,8 @@ define <2 x i32> @stest_f64i32(<2 x double> %x) {
 ; CHECK-NEXT:    v128.bitselect
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i64>
@@ -76,6 +78,8 @@ define <2 x i32> @utest_f64i32(<2 x double> %x) {
 ; CHECK-NEXT:    v128.bitselect
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <2 x double> %x to <2 x i64>
@@ -112,6 +116,8 @@ define <2 x i32> @ustest_f64i32(<2 x double> %x) {
 ; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i64>
@@ -301,6 +307,8 @@ define <2 x i16> @stest_f64i16(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i32>
@@ -328,6 +336,8 @@ define <2 x i16> @utest_f64i16(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.min_u
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <2 x double> %x to <2 x i32>
@@ -355,6 +365,8 @@ define <2 x i16> @ustest_f64i16(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i32>
@@ -378,6 +390,8 @@ define <4 x i16> @stest_f32i16(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <4 x float> %x to <4 x i32>
@@ -399,6 +413,8 @@ define <4 x i16> @utest_f32i16(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.min_u
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <4 x float> %x to <4 x i32>
@@ -420,6 +436,8 @@ define <4 x i16> @ustest_f32i16(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <4 x float> %x to <4 x i32>
@@ -1484,6 +1502,8 @@ define <2 x i32> @stest_f64i32_mm(<2 x double> %x) {
 ; CHECK-NEXT:    v128.bitselect
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i64>
@@ -1526,6 +1546,8 @@ define <2 x i32> @utest_f64i32_mm(<2 x double> %x) {
 ; CHECK-NEXT:    v128.bitselect
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <2 x double> %x to <2 x i64>
@@ -1561,6 +1583,8 @@ define <2 x i32> @ustest_f64i32_mm(<2 x double> %x) {
 ; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i64>
@@ -1738,6 +1762,8 @@ define <2 x i16> @stest_f64i16_mm(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i32>
@@ -1763,6 +1789,8 @@ define <2 x i16> @utest_f64i16_mm(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.min_u
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <2 x double> %x to <2 x i32>
@@ -1789,6 +1817,8 @@ define <2 x i16> @ustest_f64i16_mm(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i32>
@@ -1810,6 +1840,8 @@ define <4 x i16> @stest_f32i16_mm(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <4 x float> %x to <4 x i32>
@@ -1829,6 +1861,8 @@ define <4 x i16> @utest_f32i16_mm(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.min_u
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <4 x float> %x to <4 x i32>
@@ -1849,6 +1883,8 @@ define <4 x i16> @ustest_f32i16_mm(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <4 x float> %x to <4 x i32>
diff --git a/llvm/test/CodeGen/WebAssembly/simd-concat.ll b/llvm/test/CodeGen/WebAssembly/simd-concat.ll
index 42ded8a47c199..4473f7ffc6a93 100644
--- a/llvm/test/CodeGen/WebAssembly/simd-concat.ll
+++ b/llvm/test/CodeGen/WebAssembly/simd-concat.ll
@@ -24,6 +24,8 @@ define <8 x i8> @concat_v4i8(<4 x i8> %a, <4 x i8> %b) {
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 16, 17, 18, 19, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <4 x i8> %a, <4 x i8> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
   ret <8 x i8> %v
@@ -48,6 +50,8 @@ define <4 x i8> @concat_v2i8(<2 x i8> %a, <2 x i8> %b) {
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 16, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <2 x i8> %a, <2 x i8> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
   ret <4 x i8> %v
@@ -60,6 +64,8 @@ define <4 x i16> @concat_v2i16(<2 x i16> %a, <2 x i16> %b) {
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 16, 17, 18, 19, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <2 x i16> %a, <2 x i16> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
   ret <4 x i16> %v
diff --git a/llvm/test/CodeGen/WebAssembly/simd-conversions.ll b/llvm/test/CodeGen/WebAssembly/simd-conversions.ll
index 8459ec8101ff2..c98567eaaf7d6 100644
--- a/llvm/test/CodeGen/WebAssembly/simd-conversions.ll
+++ b/llvm/test/CodeGen/WebAssembly/simd-conversions.ll
@@ -313,14 +313,16 @@ define <4 x double> @convert_low_s_v4f64(<8 x i32> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <8 x i32> %x, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
   %a = sitofp <4 x i32> %v to <4 x double>
@@ -333,14 +335,16 @@ define <4 x double> @convert_low_u_v4f64(<8 x i32> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <8 x i32> %x, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
   %a = uitofp <4 x i32> %v to <4 x double>
@@ -354,14 +358,16 @@ define <4 x double> @convert_low_s_v4f64_2(<8 x i32> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = sitofp <8 x i32> %x to <8 x double>
   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
@@ -374,14 +380,16 @@ define <4 x double> @convert_low_u_v4f64_2(<8 x i32> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = uitofp <8 x i32> %x to <8 x double>
   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
@@ -394,14 +402,16 @@ define <4 x double> @promote_low_v4f64(<8 x float> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.promote_low_f32x4
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.promote_low_f32x4
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <8 x float> %x, <8 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
   %a = fpext <4 x float> %v to <4 x double>
@@ -414,14 +424,16 @@ define <4 x double> @promote_low_v4f64_2(<8 x float> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.promote_low_f32x4
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.promote_low_f32x4
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = fpext <8 x float> %x to <8 x double>
   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
@@ -435,6 +447,8 @@ define <2 x double> @promote_mixed_v2f64(<4 x float> %x, <4 x float> %y) {
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
 ; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 28, 29, 30, 31, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.promote_low_f32x4
 ; CHECK-NEXT:   ...
[truncated]

dschuff · 2025-07-16T18:32:22Z

Can you say a little more about what the advantages of this are, i.e. what the VM does differently as a result? (And, which VMs have you tested this with?)

sparker-arm · 2025-07-17T08:35:22Z

I haven't tested with any VMs yet, as I doubt any of them will be taking advantage of this now.

The main advantage of this change is identify 'narrow' shuffles that can be mapped to target instructions. Even though Wasm is 128-bit, it doesn't always mean we're operating on that full width. Imagine that we're operating on 4 x 16-bit vector and we want the result to be the even lanes: 0, 2, 4, 6. But the wasm shuffle will be 0, 2, 4, 6, 0, 0, 0, 0.

I've optimised the AArch64 backend in V8 so that these cases are often handled by splatting lane zero first, but this is still far from optimal.

With the undef mask, during isel and with very little overhead, the backend can recognize this as an 'unzip' operation instead of an arbitrary lane shuffle.

The extend_low operations also provide the same information as this mask but, if the shuffle has multiple users, it's unlikely to be such a simple optimisation during isel. I've created an optimisation in V8 specifically for figuring out undef lanes and it's non trivial. This undef mask change would make it much more simple for other runtimes to generate good shuffle code far more easily.

As you may have noticed, I've found WebAssembly shuffles to be a real pain! I would really like to see a revision to the spec so that these undef lanes/bytes can be explicitly encoded :)

tlively · 2025-07-17T18:20:19Z

As you may have noticed, I've found WebAssembly shuffles to be a real pain! I would really like to see a revision to the spec so that these undef lanes/bytes can be explicitly encoded :)

If we did add such an extension to the shuffle instruction, we would still have to specify what value ends up in the lanes of the result. Would it be portable and fast if we specified that the "undef" lanes all end up containing zeros, for instance?

sparker-arm · 2025-07-18T13:07:11Z

which VMs have you tested this with?

With ~20 lines of code into V8 to notice the AND mask, this change gave me ~10% speedup on my microbenchmark suite for memory interleaving.

sparker-arm · 2025-07-18T13:11:01Z

Would it be portable and fast if we specified that the "undef" lanes all end up containing zeros, for instance?

This is the only option that I have considered, really. It would then have the same semantics as what I'm proposing here and I would expect it is cheap enough on any architecture.

sparker-arm requested review from dschuff and tlively July 16, 2025 12:09

sparker-arm self-assigned this Jul 16, 2025

llvmbot added the backend:WebAssembly label Jul 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WebAssembly] Mask undef shuffle lanes #149084

[WebAssembly] Mask undef shuffle lanes #149084

Uh oh!

sparker-arm commented Jul 16, 2025

Uh oh!

llvmbot commented Jul 16, 2025

Uh oh!

dschuff commented Jul 16, 2025

Uh oh!

sparker-arm commented Jul 17, 2025

Uh oh!

tlively commented Jul 17, 2025

Uh oh!

sparker-arm commented Jul 18, 2025

Uh oh!

sparker-arm commented Jul 18, 2025

Uh oh!

Uh oh!

[WebAssembly] Mask undef shuffle lanes #149084

Are you sure you want to change the base?

[WebAssembly] Mask undef shuffle lanes #149084

Uh oh!

Conversation

sparker-arm commented Jul 16, 2025

Uh oh!

llvmbot commented Jul 16, 2025

Uh oh!

dschuff commented Jul 16, 2025

Uh oh!

sparker-arm commented Jul 17, 2025

Uh oh!

tlively commented Jul 17, 2025

Uh oh!

sparker-arm commented Jul 18, 2025

Uh oh!

sparker-arm commented Jul 18, 2025

Uh oh!

Uh oh!