-
Notifications
You must be signed in to change notification settings - Fork 18.3k
Description
The issue
gofrontend
currently generates very inefficient code for T$equal
and t1 == t2
when dealing with identity-comparable types (i.e. integers, pointers, and unpadded structs and arrays of such).
For example:
- The following code does two 4-byte compares and a jump in
T$equal
, and calls memcmp int1 == t2
. Fort1 == t2
,gofrontend
has a trick to turn this into a single 8-byte compare if the struct is aligned and size<=16 bytes, but it's not the case here.
type T struct {
a int32
b int32
}
- the following code always generates a call to memcmp:
type T struct {
a int32
b int32
c int32
d int32
}
A solution
In 2016 gcc introduced the __builtin_memcmp_eq
builtin that knows how to lower memcmp efficiently when the result is only used for equality comparison (i.e. equality with 0 instead of 3-way ordering). This is typically useful when the size is a constexpr (as is the case here).
The basic idea is to replace a larger chain of integer comparisons loaded from contiguous memory locations into a smaller chain of bigger integer comparisons. Benefits are twofold:
- There are less jumps, and therefore less opportunities for mispredictions and I-cache misses.
- The code is smaller, both because jumps are removed and because the encoding of a 2*n byte compare is smaller than that of two n-byte compares.
As a first step, I’m simply proposing to replace calls to runtime.memequal
with calls to __builtin_memcmp_eq
. This only improves the generated code.
In first second example above, this would change the generated code (gccgo -march=haswell -m64 -O3 -c test.go
) from:
t1 == t2
b: 8b 56 04 mov 0x4(%rsi),%edx
e: 8b 4f 04 mov 0x4(%rdi),%ecx
11: 31 c0 xor %eax,%eax
13: 8b 36 mov (%rsi),%esi
15: 39 37 cmp %esi,(%rdi)
17: 74 07 je 20 <go.test.DoStuff+0x20>
19: c3 retq
1a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
20: 39 d1 cmp %edx,%ecx
22: 0f 94 c0 sete %al
25: c3 retq
T$equal
7b: 48 83 ec 08 sub $0x8,%rsp
7f: ba 08 00 00 00 mov $0x8,%edx
84: e8 00 00 00 00 callq 89 <go.test.T$equal+0x19>
85: R_X86_64_PC32 runtime.memequal-0x4
89: 48 83 c4 08 add $0x8,%rsp
8d: c3 retq
To (in both cases):
7b: 48 8b 06 mov (%rsi),%rax
7e: 48 39 07 cmp %rax,(%rdi)
81: 0f 94 c0 sete %al
84: c3 retq
This is both smaller in terms of code size and much more efficient.
Going further
Simplifying gofrontend
This also allows removing any specific code for handling sizes smaller than 16 bytes since they are already handled by gcc.
More performance improvements
This should be extended to piecewise-identity-comparable structs. For example, the following structure should be compared with three builtin calls ({a}
, {c,d,e}
, and {f,g}
) and a float compare.
type T struct {
a int32
b float32 // Floats are not identity-comparable
c int32
d int32
e byte
// Implicit _ [3]byte padding
f int32
g int32
}