Skip to content

[WebAssembly] Mask undef shuffle lanes #149084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sparker-arm
Copy link
Contributor

In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, with undef lanes, friendly to VMs by trying to lower to a i32x4 shuffle. This patch inserts an AND, to mask the undef lanes, when only the bottom four or eight bytes are defined. This allows the VM to easily understand which lanes are required.

In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, with
undef lanes, friendly to VMs by trying to lower to a i32x4 shuffle.
This patch inserts an AND, to mask the undef lanes, when only the
bottom four or eight bytes are defined. This allows the VM to easily
understand which lanes are required.
@llvmbot
Copy link
Member

llvmbot commented Jul 16, 2025

@llvm/pr-subscribers-backend-webassembly

Author: Sam Parker (sparker-arm)

Changes

In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, with undef lanes, friendly to VMs by trying to lower to a i32x4 shuffle. This patch inserts an AND, to mask the undef lanes, when only the bottom four or eight bytes are defined. This allows the VM to easily understand which lanes are required.


Patch is 81.81 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/149084.diff

10 Files Affected:

  • (modified) llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp (+36-2)
  • (modified) llvm/test/CodeGen/WebAssembly/extend-shuffles.ll (+18-10)
  • (modified) llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll (+36)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-concat.ll (+6)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-conversions.ll (+38-24)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-extending-convert.ll (+4)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-extending.ll (+6)
  • (modified) llvm/test/CodeGen/WebAssembly/simd.ll (+12-4)
  • (modified) llvm/test/CodeGen/WebAssembly/vector-reduce.ll (+401-285)
  • (modified) llvm/test/CodeGen/WebAssembly/wide-simd-mul.ll (+40-32)
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
index bf2e04caa0a61..a360c592d3ecc 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
@@ -2719,18 +2719,52 @@ WebAssemblyTargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,
   Ops[OpIdx++] = Op.getOperand(0);
   Ops[OpIdx++] = Op.getOperand(1);
 
+  std::bitset<16> DefinedLaneBytes = 0xFFFF;
   // Expand mask indices to byte indices and materialize them as operands
   for (int M : Mask) {
     for (size_t J = 0; J < LaneBytes; ++J) {
       // Lower undefs (represented by -1 in mask) to {0..J}, which use a
       // whole lane of vector input, to allow further reduction at VM. E.g.
       // match an 8x16 byte shuffle to an equivalent cheaper 32x4 shuffle.
+      if (M == -1) {
+        DefinedLaneBytes[OpIdx - 2] = 0;
+      }
       uint64_t ByteIndex = M == -1 ? J : (uint64_t)M * LaneBytes + J;
       Ops[OpIdx++] = DAG.getConstant(ByteIndex, DL, MVT::i32);
     }
   }
-
-  return DAG.getNode(WebAssemblyISD::SHUFFLE, DL, Op.getValueType(), Ops);
+  EVT VT = Op.getValueType();
+  SDValue Shuffle = DAG.getNode(WebAssemblyISD::SHUFFLE, DL, VT, Ops);
+
+  // If only the lower four or eight bytes are actually defined by the
+  // shuffle, insert an AND so a VM can know that it can ignore the higher,
+  // undef, lanes.
+  if (DefinedLaneBytes == 0xF) {
+    SDValue LowLaneMask[] = {
+        DAG.getConstant(uint32_t(-1), DL, MVT::i32),
+        DAG.getConstant(uint32_t(0), DL, MVT::i32),
+        DAG.getConstant(uint32_t(0), DL, MVT::i32),
+        DAG.getConstant(uint32_t(0), DL, MVT::i32),
+    };
+    SDValue UndefMask =
+        DAG.getNode(ISD::BUILD_VECTOR, DL, MVT::v4i32, LowLaneMask);
+    SDValue MaskedShuffle =
+        DAG.getNode(ISD::AND, DL, MVT::v4i32,
+                    DAG.getBitcast(MVT::v4i32, Shuffle), UndefMask);
+    return DAG.getBitcast(VT, MaskedShuffle);
+  } else if (DefinedLaneBytes == 0xFF) {
+    SDValue LowLaneMask[] = {
+        DAG.getConstant(uint64_t(-1), DL, MVT::i64),
+        DAG.getConstant(uint32_t(0), DL, MVT::i64),
+    };
+    SDValue UndefMask =
+        DAG.getNode(ISD::BUILD_VECTOR, DL, MVT::v2i64, LowLaneMask);
+    SDValue MaskedShuffle =
+        DAG.getNode(ISD::AND, DL, MVT::v2i64,
+                    DAG.getBitcast(MVT::v2i64, Shuffle), UndefMask);
+    return DAG.getBitcast(VT, MaskedShuffle);
+  }
+  return Shuffle;
 }
 
 SDValue WebAssemblyTargetLowering::LowerSETCC(SDValue Op,
diff --git a/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll b/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll
index 7736e78271e55..0085c6cd82797 100644
--- a/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll
+++ b/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll
@@ -10,9 +10,11 @@ define <4 x i32> @sext_high_v4i8(<8 x i8> %in) {
 ; SIMD128:         .functype sext_high_v4i8 (v128) -> (v128)
 ; SIMD128-NEXT:  # %bb.0:
 ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
-; SIMD128-NEXT:    i16x8.extend_low_i8x16_s $push1=, $pop0
-; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push2=, $pop1
-; SIMD128-NEXT:    return $pop2
+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT:    i16x8.extend_low_i8x16_s $push3=, $pop2
+; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push4=, $pop3
+; SIMD128-NEXT:    return $pop4
  %shuffle = shufflevector <8 x i8> %in, <8 x i8> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %res = sext <4 x i8> %shuffle to <4 x i32>
  ret <4 x i32> %res
@@ -23,9 +25,11 @@ define <4 x i32> @zext_high_v4i8(<8 x i8> %in) {
 ; SIMD128:         .functype zext_high_v4i8 (v128) -> (v128)
 ; SIMD128-NEXT:  # %bb.0:
 ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
-; SIMD128-NEXT:    i16x8.extend_low_i8x16_u $push1=, $pop0
-; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push2=, $pop1
-; SIMD128-NEXT:    return $pop2
+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT:    i16x8.extend_low_i8x16_u $push3=, $pop2
+; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push4=, $pop3
+; SIMD128-NEXT:    return $pop4
  %shuffle = shufflevector <8 x i8> %in, <8 x i8> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %res = zext <4 x i8> %shuffle to <4 x i32>
  ret <4 x i32> %res
@@ -58,8 +62,10 @@ define <2 x i32> @sext_high_v2i16(<4 x i16> %in) {
 ; SIMD128:         .functype sext_high_v2i16 (v128) -> (v128)
 ; SIMD128-NEXT:  # %bb.0:
 ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push1=, $pop0
-; SIMD128-NEXT:    return $pop1
+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT:    i32x4.extend_low_i16x8_s $push3=, $pop2
+; SIMD128-NEXT:    return $pop3
  %shuffle = shufflevector <4 x i16> %in, <4 x i16> poison, <2 x i32> <i32 2, i32 3>
  %res = sext <2 x i16> %shuffle to <2 x i32>
  ret <2 x i32> %res
@@ -70,8 +76,10 @@ define <2 x i32> @zext_high_v2i16(<4 x i16> %in) {
 ; SIMD128:         .functype zext_high_v2i16 (v128) -> (v128)
 ; SIMD128-NEXT:  # %bb.0:
 ; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push1=, $pop0
-; SIMD128-NEXT:    return $pop1
+; SIMD128-NEXT:    v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT:    v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT:    i32x4.extend_low_i16x8_u $push3=, $pop2
+; SIMD128-NEXT:    return $pop3
  %shuffle = shufflevector <4 x i16> %in, <4 x i16> poison, <2 x i32> <i32 2, i32 3>
  %res = zext <2 x i16> %shuffle to <2 x i32>
  ret <2 x i32> %res
diff --git a/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll b/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll
index 7190e162eb010..27b7e8c6b01cd 100644
--- a/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll
+++ b/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll
@@ -32,6 +32,8 @@ define <2 x i32> @stest_f64i32(<2 x double> %x) {
 ; CHECK-NEXT:    v128.bitselect
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i64>
@@ -76,6 +78,8 @@ define <2 x i32> @utest_f64i32(<2 x double> %x) {
 ; CHECK-NEXT:    v128.bitselect
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <2 x double> %x to <2 x i64>
@@ -112,6 +116,8 @@ define <2 x i32> @ustest_f64i32(<2 x double> %x) {
 ; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i64>
@@ -301,6 +307,8 @@ define <2 x i16> @stest_f64i16(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i32>
@@ -328,6 +336,8 @@ define <2 x i16> @utest_f64i16(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.min_u
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <2 x double> %x to <2 x i32>
@@ -355,6 +365,8 @@ define <2 x i16> @ustest_f64i16(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i32>
@@ -378,6 +390,8 @@ define <4 x i16> @stest_f32i16(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <4 x float> %x to <4 x i32>
@@ -399,6 +413,8 @@ define <4 x i16> @utest_f32i16(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.min_u
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <4 x float> %x to <4 x i32>
@@ -420,6 +436,8 @@ define <4 x i16> @ustest_f32i16(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <4 x float> %x to <4 x i32>
@@ -1484,6 +1502,8 @@ define <2 x i32> @stest_f64i32_mm(<2 x double> %x) {
 ; CHECK-NEXT:    v128.bitselect
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i64>
@@ -1526,6 +1546,8 @@ define <2 x i32> @utest_f64i32_mm(<2 x double> %x) {
 ; CHECK-NEXT:    v128.bitselect
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <2 x double> %x to <2 x i64>
@@ -1561,6 +1583,8 @@ define <2 x i32> @ustest_f64i32_mm(<2 x double> %x) {
 ; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i64>
@@ -1738,6 +1762,8 @@ define <2 x i16> @stest_f64i16_mm(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i32>
@@ -1763,6 +1789,8 @@ define <2 x i16> @utest_f64i16_mm(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.min_u
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <2 x double> %x to <2 x i32>
@@ -1789,6 +1817,8 @@ define <2 x i16> @ustest_f64i16_mm(<2 x double> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <2 x double> %x to <2 x i32>
@@ -1810,6 +1840,8 @@ define <4 x i16> @stest_f32i16_mm(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <4 x float> %x to <4 x i32>
@@ -1829,6 +1861,8 @@ define <4 x i16> @utest_f32i16_mm(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.min_u
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptoui <4 x float> %x to <4 x i32>
@@ -1849,6 +1883,8 @@ define <4 x i16> @ustest_f32i16_mm(<4 x float> %x) {
 ; CHECK-NEXT:    i32x4.max_s
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
 entry:
   %conv = fptosi <4 x float> %x to <4 x i32>
diff --git a/llvm/test/CodeGen/WebAssembly/simd-concat.ll b/llvm/test/CodeGen/WebAssembly/simd-concat.ll
index 42ded8a47c199..4473f7ffc6a93 100644
--- a/llvm/test/CodeGen/WebAssembly/simd-concat.ll
+++ b/llvm/test/CodeGen/WebAssembly/simd-concat.ll
@@ -24,6 +24,8 @@ define <8 x i8> @concat_v4i8(<4 x i8> %a, <4 x i8> %b) {
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 16, 17, 18, 19, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <4 x i8> %a, <4 x i8> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
   ret <8 x i8> %v
@@ -48,6 +50,8 @@ define <4 x i8> @concat_v2i8(<2 x i8> %a, <2 x i8> %b) {
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 16, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK-NEXT:    v128.const -1, 0, 0, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <2 x i8> %a, <2 x i8> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
   ret <4 x i8> %v
@@ -60,6 +64,8 @@ define <4 x i16> @concat_v2i16(<2 x i16> %a, <2 x i16> %b) {
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
 ; CHECK-NEXT:    i8x16.shuffle 0, 1, 2, 3, 16, 17, 18, 19, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <2 x i16> %a, <2 x i16> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
   ret <4 x i16> %v
diff --git a/llvm/test/CodeGen/WebAssembly/simd-conversions.ll b/llvm/test/CodeGen/WebAssembly/simd-conversions.ll
index 8459ec8101ff2..c98567eaaf7d6 100644
--- a/llvm/test/CodeGen/WebAssembly/simd-conversions.ll
+++ b/llvm/test/CodeGen/WebAssembly/simd-conversions.ll
@@ -313,14 +313,16 @@ define <4 x double> @convert_low_s_v4f64(<8 x i32> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <8 x i32> %x, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
   %a = sitofp <4 x i32> %v to <4 x double>
@@ -333,14 +335,16 @@ define <4 x double> @convert_low_u_v4f64(<8 x i32> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <8 x i32> %x, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
   %a = uitofp <4 x i32> %v to <4 x double>
@@ -354,14 +358,16 @@ define <4 x double> @convert_low_s_v4f64_2(<8 x i32> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_s
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = sitofp <8 x i32> %x to <8 x double>
   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
@@ -374,14 +380,16 @@ define <4 x double> @convert_low_u_v4f64_2(<8 x i32> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.convert_low_i32x4_u
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = uitofp <8 x i32> %x to <8 x double>
   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
@@ -394,14 +402,16 @@ define <4 x double> @promote_low_v4f64(<8 x float> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.promote_low_f32x4
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.promote_low_f32x4
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = shufflevector <8 x float> %x, <8 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
   %a = fpext <4 x float> %v to <4 x double>
@@ -414,14 +424,16 @@ define <4 x double> @promote_low_v4f64_2(<8 x float> %x) {
 ; CHECK-NEXT:  # %bb.0:
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    local.get 1
-; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
 ; CHECK-NEXT:    f64x2.promote_low_f32x4
-; CHECK-NEXT:    v128.store 16
+; CHECK-NEXT:    v128.store 0
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.promote_low_f32x4
-; CHECK-NEXT:    v128.store 0
+; CHECK-NEXT:    v128.store 16
 ; CHECK-NEXT:    # fallthrough-return
   %v = fpext <8 x float> %x to <8 x double>
   %a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
@@ -435,6 +447,8 @@ define <2 x double> @promote_mixed_v2f64(<4 x float> %x, <4 x float> %y) {
 ; CHECK-NEXT:    local.get 0
 ; CHECK-NEXT:    local.get 1
 ; CHECK-NEXT:    i8x16.shuffle 8, 9, 10, 11, 28, 29, 30, 31, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT:    v128.const -1, 0
+; CHECK-NEXT:    v128.and
 ; CHECK-NEXT:    f64x2.promote_low_f32x4
 ; CHECK-NEXT:   ...
[truncated]

@dschuff
Copy link
Member

dschuff commented Jul 16, 2025

Can you say a little more about what the advantages of this are, i.e. what the VM does differently as a result? (And, which VMs have you tested this with?)

@sparker-arm
Copy link
Contributor Author

I haven't tested with any VMs yet, as I doubt any of them will be taking advantage of this now.

The main advantage of this change is identify 'narrow' shuffles that can be mapped to target instructions. Even though Wasm is 128-bit, it doesn't always mean we're operating on that full width. Imagine that we're operating on 4 x 16-bit vector and we want the result to be the even lanes: 0, 2, 4, 6. But the wasm shuffle will be 0, 2, 4, 6, 0, 0, 0, 0.

I've optimised the AArch64 backend in V8 so that these cases are often handled by splatting lane zero first, but this is still far from optimal.

With the undef mask, during isel and with very little overhead, the backend can recognize this as an 'unzip' operation instead of an arbitrary lane shuffle.

The extend_low operations also provide the same information as this mask but, if the shuffle has multiple users, it's unlikely to be such a simple optimisation during isel. I've created an optimisation in V8 specifically for figuring out undef lanes and it's non trivial. This undef mask change would make it much more simple for other runtimes to generate good shuffle code far more easily.

As you may have noticed, I've found WebAssembly shuffles to be a real pain! I would really like to see a revision to the spec so that these undef lanes/bytes can be explicitly encoded :)

@tlively
Copy link
Collaborator

tlively commented Jul 17, 2025

As you may have noticed, I've found WebAssembly shuffles to be a real pain! I would really like to see a revision to the spec so that these undef lanes/bytes can be explicitly encoded :)

If we did add such an extension to the shuffle instruction, we would still have to specify what value ends up in the lanes of the result. Would it be portable and fast if we specified that the "undef" lanes all end up containing zeros, for instance?

@sparker-arm
Copy link
Contributor Author

which VMs have you tested this with?

With ~20 lines of code into V8 to notice the AND mask, this change gave me ~10% speedup on my microbenchmark suite for memory interleaving.

@sparker-arm
Copy link
Contributor Author

Would it be portable and fast if we specified that the "undef" lanes all end up containing zeros, for instance?

This is the only option that I have considered, really. It would then have the same semantics as what I'm proposing here and I would expect it is cheap enough on any architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants