-
Notifications
You must be signed in to change notification settings - Fork 14.5k
[WebAssembly] Mask undef shuffle lanes #149084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, with undef lanes, friendly to VMs by trying to lower to a i32x4 shuffle. This patch inserts an AND, to mask the undef lanes, when only the bottom four or eight bytes are defined. This allows the VM to easily understand which lanes are required.
@llvm/pr-subscribers-backend-webassembly Author: Sam Parker (sparker-arm) ChangesIn LowerVECTOR_SHUFFLE, we already attempt to make shuffles, with undef lanes, friendly to VMs by trying to lower to a i32x4 shuffle. This patch inserts an AND, to mask the undef lanes, when only the bottom four or eight bytes are defined. This allows the VM to easily understand which lanes are required. Patch is 81.81 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/149084.diff 10 Files Affected:
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
index bf2e04caa0a61..a360c592d3ecc 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
@@ -2719,18 +2719,52 @@ WebAssemblyTargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,
Ops[OpIdx++] = Op.getOperand(0);
Ops[OpIdx++] = Op.getOperand(1);
+ std::bitset<16> DefinedLaneBytes = 0xFFFF;
// Expand mask indices to byte indices and materialize them as operands
for (int M : Mask) {
for (size_t J = 0; J < LaneBytes; ++J) {
// Lower undefs (represented by -1 in mask) to {0..J}, which use a
// whole lane of vector input, to allow further reduction at VM. E.g.
// match an 8x16 byte shuffle to an equivalent cheaper 32x4 shuffle.
+ if (M == -1) {
+ DefinedLaneBytes[OpIdx - 2] = 0;
+ }
uint64_t ByteIndex = M == -1 ? J : (uint64_t)M * LaneBytes + J;
Ops[OpIdx++] = DAG.getConstant(ByteIndex, DL, MVT::i32);
}
}
-
- return DAG.getNode(WebAssemblyISD::SHUFFLE, DL, Op.getValueType(), Ops);
+ EVT VT = Op.getValueType();
+ SDValue Shuffle = DAG.getNode(WebAssemblyISD::SHUFFLE, DL, VT, Ops);
+
+ // If only the lower four or eight bytes are actually defined by the
+ // shuffle, insert an AND so a VM can know that it can ignore the higher,
+ // undef, lanes.
+ if (DefinedLaneBytes == 0xF) {
+ SDValue LowLaneMask[] = {
+ DAG.getConstant(uint32_t(-1), DL, MVT::i32),
+ DAG.getConstant(uint32_t(0), DL, MVT::i32),
+ DAG.getConstant(uint32_t(0), DL, MVT::i32),
+ DAG.getConstant(uint32_t(0), DL, MVT::i32),
+ };
+ SDValue UndefMask =
+ DAG.getNode(ISD::BUILD_VECTOR, DL, MVT::v4i32, LowLaneMask);
+ SDValue MaskedShuffle =
+ DAG.getNode(ISD::AND, DL, MVT::v4i32,
+ DAG.getBitcast(MVT::v4i32, Shuffle), UndefMask);
+ return DAG.getBitcast(VT, MaskedShuffle);
+ } else if (DefinedLaneBytes == 0xFF) {
+ SDValue LowLaneMask[] = {
+ DAG.getConstant(uint64_t(-1), DL, MVT::i64),
+ DAG.getConstant(uint32_t(0), DL, MVT::i64),
+ };
+ SDValue UndefMask =
+ DAG.getNode(ISD::BUILD_VECTOR, DL, MVT::v2i64, LowLaneMask);
+ SDValue MaskedShuffle =
+ DAG.getNode(ISD::AND, DL, MVT::v2i64,
+ DAG.getBitcast(MVT::v2i64, Shuffle), UndefMask);
+ return DAG.getBitcast(VT, MaskedShuffle);
+ }
+ return Shuffle;
}
SDValue WebAssemblyTargetLowering::LowerSETCC(SDValue Op,
diff --git a/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll b/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll
index 7736e78271e55..0085c6cd82797 100644
--- a/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll
+++ b/llvm/test/CodeGen/WebAssembly/extend-shuffles.ll
@@ -10,9 +10,11 @@ define <4 x i32> @sext_high_v4i8(<8 x i8> %in) {
; SIMD128: .functype sext_high_v4i8 (v128) -> (v128)
; SIMD128-NEXT: # %bb.0:
; SIMD128-NEXT: i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
-; SIMD128-NEXT: i16x8.extend_low_i8x16_s $push1=, $pop0
-; SIMD128-NEXT: i32x4.extend_low_i16x8_s $push2=, $pop1
-; SIMD128-NEXT: return $pop2
+; SIMD128-NEXT: v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT: v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT: i16x8.extend_low_i8x16_s $push3=, $pop2
+; SIMD128-NEXT: i32x4.extend_low_i16x8_s $push4=, $pop3
+; SIMD128-NEXT: return $pop4
%shuffle = shufflevector <8 x i8> %in, <8 x i8> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%res = sext <4 x i8> %shuffle to <4 x i32>
ret <4 x i32> %res
@@ -23,9 +25,11 @@ define <4 x i32> @zext_high_v4i8(<8 x i8> %in) {
; SIMD128: .functype zext_high_v4i8 (v128) -> (v128)
; SIMD128-NEXT: # %bb.0:
; SIMD128-NEXT: i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
-; SIMD128-NEXT: i16x8.extend_low_i8x16_u $push1=, $pop0
-; SIMD128-NEXT: i32x4.extend_low_i16x8_u $push2=, $pop1
-; SIMD128-NEXT: return $pop2
+; SIMD128-NEXT: v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT: v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT: i16x8.extend_low_i8x16_u $push3=, $pop2
+; SIMD128-NEXT: i32x4.extend_low_i16x8_u $push4=, $pop3
+; SIMD128-NEXT: return $pop4
%shuffle = shufflevector <8 x i8> %in, <8 x i8> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%res = zext <4 x i8> %shuffle to <4 x i32>
ret <4 x i32> %res
@@ -58,8 +62,10 @@ define <2 x i32> @sext_high_v2i16(<4 x i16> %in) {
; SIMD128: .functype sext_high_v2i16 (v128) -> (v128)
; SIMD128-NEXT: # %bb.0:
; SIMD128-NEXT: i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; SIMD128-NEXT: i32x4.extend_low_i16x8_s $push1=, $pop0
-; SIMD128-NEXT: return $pop1
+; SIMD128-NEXT: v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT: v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT: i32x4.extend_low_i16x8_s $push3=, $pop2
+; SIMD128-NEXT: return $pop3
%shuffle = shufflevector <4 x i16> %in, <4 x i16> poison, <2 x i32> <i32 2, i32 3>
%res = sext <2 x i16> %shuffle to <2 x i32>
ret <2 x i32> %res
@@ -70,8 +76,10 @@ define <2 x i32> @zext_high_v2i16(<4 x i16> %in) {
; SIMD128: .functype zext_high_v2i16 (v128) -> (v128)
; SIMD128-NEXT: # %bb.0:
; SIMD128-NEXT: i8x16.shuffle $push0=, $0, $0, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; SIMD128-NEXT: i32x4.extend_low_i16x8_u $push1=, $pop0
-; SIMD128-NEXT: return $pop1
+; SIMD128-NEXT: v128.const $push1=, -1, 0, 0, 0
+; SIMD128-NEXT: v128.and $push2=, $pop0, $pop1
+; SIMD128-NEXT: i32x4.extend_low_i16x8_u $push3=, $pop2
+; SIMD128-NEXT: return $pop3
%shuffle = shufflevector <4 x i16> %in, <4 x i16> poison, <2 x i32> <i32 2, i32 3>
%res = zext <2 x i16> %shuffle to <2 x i32>
ret <2 x i32> %res
diff --git a/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll b/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll
index 7190e162eb010..27b7e8c6b01cd 100644
--- a/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll
+++ b/llvm/test/CodeGen/WebAssembly/fpclamptosat_vec.ll
@@ -32,6 +32,8 @@ define <2 x i32> @stest_f64i32(<2 x double> %x) {
; CHECK-NEXT: v128.bitselect
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <2 x double> %x to <2 x i64>
@@ -76,6 +78,8 @@ define <2 x i32> @utest_f64i32(<2 x double> %x) {
; CHECK-NEXT: v128.bitselect
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptoui <2 x double> %x to <2 x i64>
@@ -112,6 +116,8 @@ define <2 x i32> @ustest_f64i32(<2 x double> %x) {
; CHECK-NEXT: v128.and
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <2 x double> %x to <2 x i64>
@@ -301,6 +307,8 @@ define <2 x i16> @stest_f64i16(<2 x double> %x) {
; CHECK-NEXT: i32x4.max_s
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0, 0, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <2 x double> %x to <2 x i32>
@@ -328,6 +336,8 @@ define <2 x i16> @utest_f64i16(<2 x double> %x) {
; CHECK-NEXT: i32x4.min_u
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0, 0, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptoui <2 x double> %x to <2 x i32>
@@ -355,6 +365,8 @@ define <2 x i16> @ustest_f64i16(<2 x double> %x) {
; CHECK-NEXT: i32x4.max_s
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0, 0, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <2 x double> %x to <2 x i32>
@@ -378,6 +390,8 @@ define <4 x i16> @stest_f32i16(<4 x float> %x) {
; CHECK-NEXT: i32x4.max_s
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <4 x float> %x to <4 x i32>
@@ -399,6 +413,8 @@ define <4 x i16> @utest_f32i16(<4 x float> %x) {
; CHECK-NEXT: i32x4.min_u
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptoui <4 x float> %x to <4 x i32>
@@ -420,6 +436,8 @@ define <4 x i16> @ustest_f32i16(<4 x float> %x) {
; CHECK-NEXT: i32x4.max_s
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <4 x float> %x to <4 x i32>
@@ -1484,6 +1502,8 @@ define <2 x i32> @stest_f64i32_mm(<2 x double> %x) {
; CHECK-NEXT: v128.bitselect
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <2 x double> %x to <2 x i64>
@@ -1526,6 +1546,8 @@ define <2 x i32> @utest_f64i32_mm(<2 x double> %x) {
; CHECK-NEXT: v128.bitselect
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptoui <2 x double> %x to <2 x i64>
@@ -1561,6 +1583,8 @@ define <2 x i32> @ustest_f64i32_mm(<2 x double> %x) {
; CHECK-NEXT: v128.and
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 2, 3, 8, 9, 10, 11, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <2 x double> %x to <2 x i64>
@@ -1738,6 +1762,8 @@ define <2 x i16> @stest_f64i16_mm(<2 x double> %x) {
; CHECK-NEXT: i32x4.max_s
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0, 0, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <2 x double> %x to <2 x i32>
@@ -1763,6 +1789,8 @@ define <2 x i16> @utest_f64i16_mm(<2 x double> %x) {
; CHECK-NEXT: i32x4.min_u
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0, 0, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptoui <2 x double> %x to <2 x i32>
@@ -1789,6 +1817,8 @@ define <2 x i16> @ustest_f64i16_mm(<2 x double> %x) {
; CHECK-NEXT: i32x4.max_s
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0, 0, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <2 x double> %x to <2 x i32>
@@ -1810,6 +1840,8 @@ define <4 x i16> @stest_f32i16_mm(<4 x float> %x) {
; CHECK-NEXT: i32x4.max_s
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <4 x float> %x to <4 x i32>
@@ -1829,6 +1861,8 @@ define <4 x i16> @utest_f32i16_mm(<4 x float> %x) {
; CHECK-NEXT: i32x4.min_u
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptoui <4 x float> %x to <4 x i32>
@@ -1849,6 +1883,8 @@ define <4 x i16> @ustest_f32i16_mm(<4 x float> %x) {
; CHECK-NEXT: i32x4.max_s
; CHECK-NEXT: local.get 0
; CHECK-NEXT: i8x16.shuffle 0, 1, 4, 5, 8, 9, 12, 13, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
entry:
%conv = fptosi <4 x float> %x to <4 x i32>
diff --git a/llvm/test/CodeGen/WebAssembly/simd-concat.ll b/llvm/test/CodeGen/WebAssembly/simd-concat.ll
index 42ded8a47c199..4473f7ffc6a93 100644
--- a/llvm/test/CodeGen/WebAssembly/simd-concat.ll
+++ b/llvm/test/CodeGen/WebAssembly/simd-concat.ll
@@ -24,6 +24,8 @@ define <8 x i8> @concat_v4i8(<4 x i8> %a, <4 x i8> %b) {
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
; CHECK-NEXT: i8x16.shuffle 0, 1, 2, 3, 16, 17, 18, 19, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
%v = shufflevector <4 x i8> %a, <4 x i8> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
ret <8 x i8> %v
@@ -48,6 +50,8 @@ define <4 x i8> @concat_v2i8(<2 x i8> %a, <2 x i8> %b) {
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
; CHECK-NEXT: i8x16.shuffle 0, 1, 16, 17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+; CHECK-NEXT: v128.const -1, 0, 0, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
%v = shufflevector <2 x i8> %a, <2 x i8> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
ret <4 x i8> %v
@@ -60,6 +64,8 @@ define <4 x i16> @concat_v2i16(<2 x i16> %a, <2 x i16> %b) {
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
; CHECK-NEXT: i8x16.shuffle 0, 1, 2, 3, 16, 17, 18, 19, 0, 1, 0, 1, 0, 1, 0, 1
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: # fallthrough-return
%v = shufflevector <2 x i16> %a, <2 x i16> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
ret <4 x i16> %v
diff --git a/llvm/test/CodeGen/WebAssembly/simd-conversions.ll b/llvm/test/CodeGen/WebAssembly/simd-conversions.ll
index 8459ec8101ff2..c98567eaaf7d6 100644
--- a/llvm/test/CodeGen/WebAssembly/simd-conversions.ll
+++ b/llvm/test/CodeGen/WebAssembly/simd-conversions.ll
@@ -313,14 +313,16 @@ define <4 x double> @convert_low_s_v4f64(<8 x i32> %x) {
; CHECK-NEXT: # %bb.0:
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
-; CHECK-NEXT: local.get 1
-; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
; CHECK-NEXT: f64x2.convert_low_i32x4_s
-; CHECK-NEXT: v128.store 16
+; CHECK-NEXT: v128.store 0
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
+; CHECK-NEXT: local.get 1
+; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: f64x2.convert_low_i32x4_s
-; CHECK-NEXT: v128.store 0
+; CHECK-NEXT: v128.store 16
; CHECK-NEXT: # fallthrough-return
%v = shufflevector <8 x i32> %x, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%a = sitofp <4 x i32> %v to <4 x double>
@@ -333,14 +335,16 @@ define <4 x double> @convert_low_u_v4f64(<8 x i32> %x) {
; CHECK-NEXT: # %bb.0:
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
-; CHECK-NEXT: local.get 1
-; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
; CHECK-NEXT: f64x2.convert_low_i32x4_u
-; CHECK-NEXT: v128.store 16
+; CHECK-NEXT: v128.store 0
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
+; CHECK-NEXT: local.get 1
+; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: f64x2.convert_low_i32x4_u
-; CHECK-NEXT: v128.store 0
+; CHECK-NEXT: v128.store 16
; CHECK-NEXT: # fallthrough-return
%v = shufflevector <8 x i32> %x, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%a = uitofp <4 x i32> %v to <4 x double>
@@ -354,14 +358,16 @@ define <4 x double> @convert_low_s_v4f64_2(<8 x i32> %x) {
; CHECK-NEXT: # %bb.0:
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
-; CHECK-NEXT: local.get 1
-; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
; CHECK-NEXT: f64x2.convert_low_i32x4_s
-; CHECK-NEXT: v128.store 16
+; CHECK-NEXT: v128.store 0
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
+; CHECK-NEXT: local.get 1
+; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: f64x2.convert_low_i32x4_s
-; CHECK-NEXT: v128.store 0
+; CHECK-NEXT: v128.store 16
; CHECK-NEXT: # fallthrough-return
%v = sitofp <8 x i32> %x to <8 x double>
%a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
@@ -374,14 +380,16 @@ define <4 x double> @convert_low_u_v4f64_2(<8 x i32> %x) {
; CHECK-NEXT: # %bb.0:
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
-; CHECK-NEXT: local.get 1
-; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
; CHECK-NEXT: f64x2.convert_low_i32x4_u
-; CHECK-NEXT: v128.store 16
+; CHECK-NEXT: v128.store 0
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
+; CHECK-NEXT: local.get 1
+; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: f64x2.convert_low_i32x4_u
-; CHECK-NEXT: v128.store 0
+; CHECK-NEXT: v128.store 16
; CHECK-NEXT: # fallthrough-return
%v = uitofp <8 x i32> %x to <8 x double>
%a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
@@ -394,14 +402,16 @@ define <4 x double> @promote_low_v4f64(<8 x float> %x) {
; CHECK-NEXT: # %bb.0:
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
-; CHECK-NEXT: local.get 1
-; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
; CHECK-NEXT: f64x2.promote_low_f32x4
-; CHECK-NEXT: v128.store 16
+; CHECK-NEXT: v128.store 0
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
+; CHECK-NEXT: local.get 1
+; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: f64x2.promote_low_f32x4
-; CHECK-NEXT: v128.store 0
+; CHECK-NEXT: v128.store 16
; CHECK-NEXT: # fallthrough-return
%v = shufflevector <8 x float> %x, <8 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%a = fpext <4 x float> %v to <4 x double>
@@ -414,14 +424,16 @@ define <4 x double> @promote_low_v4f64_2(<8 x float> %x) {
; CHECK-NEXT: # %bb.0:
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
-; CHECK-NEXT: local.get 1
-; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
; CHECK-NEXT: f64x2.promote_low_f32x4
-; CHECK-NEXT: v128.store 16
+; CHECK-NEXT: v128.store 0
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
+; CHECK-NEXT: local.get 1
+; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: f64x2.promote_low_f32x4
-; CHECK-NEXT: v128.store 0
+; CHECK-NEXT: v128.store 16
; CHECK-NEXT: # fallthrough-return
%v = fpext <8 x float> %x to <8 x double>
%a = shufflevector <8 x double> %v, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
@@ -435,6 +447,8 @@ define <2 x double> @promote_mixed_v2f64(<4 x float> %x, <4 x float> %y) {
; CHECK-NEXT: local.get 0
; CHECK-NEXT: local.get 1
; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 28, 29, 30, 31, 0, 1, 2, 3, 0, 1, 2, 3
+; CHECK-NEXT: v128.const -1, 0
+; CHECK-NEXT: v128.and
; CHECK-NEXT: f64x2.promote_low_f32x4
; CHECK-NEXT: ...
[truncated]
|
Can you say a little more about what the advantages of this are, i.e. what the VM does differently as a result? (And, which VMs have you tested this with?) |
I haven't tested with any VMs yet, as I doubt any of them will be taking advantage of this now. The main advantage of this change is identify 'narrow' shuffles that can be mapped to target instructions. Even though Wasm is 128-bit, it doesn't always mean we're operating on that full width. Imagine that we're operating on 4 x 16-bit vector and we want the result to be the even lanes: 0, 2, 4, 6. But the wasm shuffle will be 0, 2, 4, 6, 0, 0, 0, 0. I've optimised the AArch64 backend in V8 so that these cases are often handled by splatting lane zero first, but this is still far from optimal. With the undef mask, during isel and with very little overhead, the backend can recognize this as an 'unzip' operation instead of an arbitrary lane shuffle. The extend_low operations also provide the same information as this mask but, if the shuffle has multiple users, it's unlikely to be such a simple optimisation during isel. I've created an optimisation in V8 specifically for figuring out undef lanes and it's non trivial. This undef mask change would make it much more simple for other runtimes to generate good shuffle code far more easily. As you may have noticed, I've found WebAssembly shuffles to be a real pain! I would really like to see a revision to the spec so that these undef lanes/bytes can be explicitly encoded :) |
If we did add such an extension to the shuffle instruction, we would still have to specify what value ends up in the lanes of the result. Would it be portable and fast if we specified that the "undef" lanes all end up containing zeros, for instance? |
With ~20 lines of code into V8 to notice the AND mask, this change gave me ~10% speedup on my microbenchmark suite for memory interleaving. |
This is the only option that I have considered, really. It would then have the same semantics as what I'm proposing here and I would expect it is cheap enough on any architecture. |
In LowerVECTOR_SHUFFLE, we already attempt to make shuffles, with undef lanes, friendly to VMs by trying to lower to a i32x4 shuffle. This patch inserts an AND, to mask the undef lanes, when only the bottom four or eight bytes are defined. This allows the VM to easily understand which lanes are required.