Skip to content

Slow indirect function calls with 16-byte by-pointer enum argument #143050

@y21

Description

@y21

(The title is intentionally rather general even though this does seem somewhat specific, since I'm not 100% sure on whether this is something that could be fixed on the rustc side or LLVM side, or if my analysis is actually correct; but regardless, it was a bit unexpected to me that a function taking a 16-byte enum would have any more overhead than other 16-byte types so I filed it as a bug)

While I was profiling a program that makes somewhat heavy use of indirect function calls that can't be inlined, I noticed that a non-trivial amount of time was spent on movups/movaps instructions to copy a 16-byte enum argument that was being passed by pointer.

I believe I reduced the slowdown down to a function that takes a 16-byte enum as an argument (that rustc passes by pointer), and its caller that constructs it on the stack.

#[inline(never)]
pub fn byptr(val: Result<u64, u32>) {
  std::hint::black_box(val); // force a load of the parameter which is passed by pointer
}

pub fn test() {
  byptr(Ok(1));
}

Godbolt

Here's a full reproducer where the call is wrapped in a tight loop that can be run and shows the slowdown, compared to another function that takes a different 16-byte type also passed by pointer, which doesn't have this issue.
Running that locally, 1 consistently performs about 5x worse than 2 (375ms vs 70ms) on an AMD Ryzen 3 PRO 3200G.


Now this is more or less an educated guess on what could be the cause (I don't have a machine at hand that can run perf with hardware counters), but looking at the asm, as mentioned above it uses a movups to load the 16 byte enum into xmm0:

example::byptr::hf156bd35b01c5a6e:
        movups  xmm0, xmmword ptr [rdi]

At callsite however, to initialize that Ok(1) value on the stack, it does a 32-bit store at [rdi] for the discriminant and another 64-bit store at [rdi + 8] for the payload:

example::test::h8fec37f4e1b51032:
...
        mov     qword ptr [rsp + 16], 1
        mov     dword ptr [rsp + 8], 0
        lea     rdi, [rsp + 8]

Could the slowdown come from a failed store-to-load forward? As far as I know, this pattern of making two smaller stores to create a large value, then loading the large value at once, is a case that store-to-load forwarding can't handle since there's no entry in the store buffer with a matching start address and a greater or equals size, and results in a perf degradation.

If I use (u32, u32, u32, u32) instead of Result<u64, u32> I see that it uses a 128 bit store/load via movups/movaps on both sides where a fast store-to-load presumably succeeds, and as the repro above showed this is significantly faster.

And likewise, if I change the function to explicitly take an &Result<u64, u32> and initializing it manually with a movups makes it fast again, or if I change the function to use two movs. So basically, making sure the store/loads match up in their sizes makes the perf degradation go away.

So my question would be, is there something that's preventing Result<u64, u32> from just being passed by-value as an i128 instead of by memory? A very quick bisection on godbolt shows that it did do that up until 1.60 for the code above. In 1.61 it started passing it by pointer. Interestingly, Result<u64, u64> does get passed by-value.
Or is this something that would be better fixed on the LLVM side, like not splitting up stores/loads like that? Assuming that this is actually is the issue.

Meta

rustc 1.89.0-nightly (cdd545be1 2025-06-07)
binary: rustc
commit-hash: cdd545be1b4f024d38360aa9f000dcb782fbc81b
commit-date: 2025-06-07
host: x86_64-unknown-linux-gnu
release: 1.89.0-nightly
LLVM version: 20.1.5

(but it really reproduces with any rustc after 1.61)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-ABIArea: Concerning the application binary interface (ABI)C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchI-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions