Elide suspension points via [[clang::coro_await_suspend_destroy]] #152623

snarkmaster · 2025-08-08T01:52:18Z

Start by reading the detailed user-facing docs in AttrDocs.td.

My immediate motivation was that I noticed that short-circuiting coroutines failed to optimize well. Interact with the demo program here: https://godbolt.org/z/E3YK5c45a

If Clang on Compiler Explorer supported [[clang::coro_await_suspend_destroy]], the assembly for simple_coro would be drastically shorter, and would not contain a call to operator new.

Here are a few high-level thoughts that don't belong in the docs:

This has lit tests, but what gives me real confidence in its correctness is the integration test in coro_await_suspend_destroy_test.cpp. This caught all the interesting bugs that I had in earlier revs, and covers equivalence to the standard code path in far more scenarios.
I considered a variety of other designs. Here are some key design points:
- I considered optimizing unmodified await_suspend() methods, as long as they unconditionally end with an h.destroy() call on the current handle, or an exception. However, this would (a) force dynamic dispatch for destroy -- bloating IR & reducing optimization opportunities, (b) require far more complex, delicate, and fragile analysis, (c) retain more of the frame setup, so that e.g. h.done() works properly. The current solution shortcuts all these concerns.
- I want await_suspend_destroy to take Promise&, rather than std::coroutine_handle -- this is safer, simpler, and more efficient. Short-circuiting corotuines should not touch the handle. This decision forces this to be a class attribute. Resolving a method attribute would have required looking up overloads for both types, and choosing one, which is costly and a bad UX to boot.
- AttrDocs.td shows how to write portable code with a stub await_suspend(). This portability / compatibility solution avoids dire issues that would arise if users relied on __has_cpp_attribute() and the declaration and definition happened to use different toolchains. In particular, it will even be safe for a future compiler release to killswitch this attribute by removing its implementation and setting its version to 0.

let Spellings = [Clang<"coro_destroy_after_suspend", /*allowInC*/ 0,
                 /*Version*/ 0>];

In the docs, I mention the HasCoroSuspend path in CoroEarly.cpp as a further optimization opportunity. But, I'm sure there are higher-leverage ways of making these non-suspending coros compile better, I just don't know the coro optimization pipeline well enough to flag them.
IIUC the only interaction of this with coro_only_destroy_when_complete would be that the compiler expends fewer cycles if both are set.
I ran some benchmarks on folly::result. It's definitely faster, heap allocs are definitely elided, the compiled code looks like a function, not a coroutine, but there's still an optimization gap. On the plus side, this results in a 4x speedup (!) in optimized ASAN builds (numbers not shown for brevity).

// Simple result coroutine that adds 1 to the input
result<int> result_coro(result<int>&& r) {
  co_return co_await std::move(r) + 1;
}

// Non-coroutine equivalent using value_or_throw()
result<int> catching_result_func(result<int>&& r) {
  return result_catch_all([&]() -> result<int> {
    if (r.has_value()) {
      return r.value_or_throw() + 1;
    }
    return std::move(r).non_value();
  });
}

// Not QUITE equivalent to the coro -- lacks the exception boundary
result<int> non_catching_result_func(result<int>&& r) {
  if (r.has_value()) {
    return r.value_or_throw() + 1;
  }
  return std::move(r).non_value();
}

============================================================================
[...]lly/result/test/result_coro_bench.cpp     relative  time/iter   iters/s
============================================================================
result_coro_success                                        13.61ns    73.49M
non_catching_result_func_success                            3.39ns   295.00M
catching_result_func_success                                4.41ns   226.88M
result_coro_error                                          19.55ns    51.16M
non_catching_result_func_error                              9.15ns   109.26M
catching_result_func_error                                 10.19ns    98.10M

============================================================================
[...]lly/result/test/result_coro_bench.cpp     relative  time/iter   iters/s
============================================================================
result_coro_success                                        10.59ns    94.39M
non_catching_result_func_success                            3.39ns   295.00M
catching_result_func_success                                4.07ns   245.81M
result_coro_error                                          13.66ns    73.18M
non_catching_result_func_error                              9.00ns   111.11M
catching_result_func_error                                 10.04ns    99.63M

Demo program from the Compiler Explorer link above:

#include <coroutine>
#include <optional>

// Read this LATER -- this implementation detail isn't required to understand
// the value of [[clang::coro_await_suspend_destroy]].
//
// `optional_wrapper` exists since `get_return_object()` can't return
// `std::optional` directly. C++ coroutines have a fundamental timing mismatch
// between when the return object is created and when the value is available:
//
// 1) Early (coroutine startup): `get_return_object()` is called and must return
//    something immediately.
// 2) Later (when `co_return` executes): `return_value(T)` is called with the
//    actual value.
// 3) Issue: If `get_return_object()` returns the storage, it's empty when
//    returned, and writing to it later cannot affect the already-returned copy.
template <typename T>
struct optional_wrapper {
  std::optional<T> storage_;
  std::optional<T>*& pointer_;
  optional_wrapper(std::optional<T>*& p) : pointer_(p) {
    pointer_ = &storage_;
  }
  operator std::optional<T>() { return std::move(storage_); }
  ~optional_wrapper() {}
};

// Make `std::optional` a coroutine
template <typename T, typename... Args>
struct std::coroutine_traits<std::optional<T>, Args...> {
  struct promise_type {
    std::optional<T>* storagePtr_ = nullptr;
    promise_type() = default;
    ::optional_wrapper<T> get_return_object() {
      return ::optional_wrapper<T>(storagePtr_);
    }
    std::suspend_never initial_suspend() const noexcept { return {}; }
    std::suspend_never final_suspend() const noexcept { return {}; }
    void return_value(T&& value) { *storagePtr_ = std::move(value); }
    void unhandled_exception() {
      // Leave storage_ empty to represent error
    }
  };
};

template <typename T>
struct [[clang::coro_await_suspend_destroy]] optional_awaitable {
  std::optional<T> opt_;
  bool await_ready() const noexcept { return opt_.has_value(); }
  T await_resume() { return std::move(opt_).value(); }
  // Adding `noexcept` here makes the early IR much smaller, but the
  // optimizer is able to discard the cruft for simpler cases.
  void await_suspend_destroy(auto& promise) noexcept {
    // Assume the return object defaults to "empty"
  }
  void await_suspend(auto handle) {
    await_suspend_destroy(handle.promise());
    handle.destroy();
  }
};

template <typename T>
optional_awaitable<T> operator co_await(std::optional<T> opt) {
  return {std::move(opt)};
}

// Non-coroutine baseline -- matches the logic of `simple_coro`.
std::optional<int> simple_func(const std::optional<int>& r) {
  try {
    if (r.has_value()) {
        return r.value() + 1;
    }
  } catch (...) {}
  return std::nullopt; // return empty on empty input or error
}

// Without `coro_await_suspend_destroy`, allocates its frame on-heap.
std::optional<int> simple_coro(const std::optional<int>& r) {
  co_return co_await std::move(r) + 4;
}

// Without `co_await`, this optimizes much like `simple_func`.
// Bugs:
//  - Doesn't short-circuit when `r` is empty, but throws
//  - Lacks an exception boundary
std::optional<int> wrong_simple_coro(const std::optional<int>& r) {
  co_return r.value() + 2;
}

int main() {
  return
      simple_func(std::optional<int>{32}).value() +
      simple_coro(std::optional<int>{8}).value() +
      wrong_simple_coro(std::optional<int>{16}).value();
}

Test Plan:

For the all-important E2E test, I used this terrible cargo-culted script to run the new end-to-end test with the new compiler. (Yes, I realize I should only need 10% of those -D settings for a successful build.)

To make sure the test covered what I meant it to do:

I also added an #error in the "no attribute" branch to make sure the compiler indeed supports the attribute.
I ran it with a compiler not supporting the attribute, and that also passed.
I also tried return 1; from main() and saw the logs of the 7 successful tests running.

#!/bin/bash -uex
set -o pipefail
LLVMBASE=/path/to/source/of/llvm-project
SYSCLANG=/path/to/origianl/bin/clang

# NB Can add `--debug-output` to debug cmake...

# Bootstrap clang -- Use `RelWithDebInfo` or the next phase is too slow!
mkdir -p bootstrap
cd bootstrap
cmake "$LLVMBASE/llvm" \
    -G Ninja \
    -DBUILD_SHARED_LIBS=true \
    -DCMAKE_ASM_COMPILER="$SYSCLANG" \
    -DCMAKE_ASM_COMPILER_ID=Clang \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DCMAKE_CXX_COMPILER="$SYSCLANG"++ \
    -DCMAKE_C_COMPILER="$SYSCLANG" \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_BINDINGS=OFF \
    -DLLVM_ENABLE_LLD=ON \
    -DLLVM_ENABLE_PROJECTS="clang;lld" \
    -DLLVM_OPTIMIZED_TABLEGEN=true \
    -DLLVM_FORCE_ENABLE_STATS=ON \
    -DLLVM_ENABLE_DUMP=ON \
    -DCLANG_DEFAULT_PIE_ON_LINUX=OFF
ninja clang lld
ninja check-clang-codegencoroutines # Includes the new IR regression tests
cd ..

NEWCLANG="$PWD"/bootstrap/bin/clang
NEWLLD="$PWD"/bootstrap/bin/lld
# LIBCXX_INCLUDE_BENCHMARKS=OFF because google-benchmark bugs out
cmake "$LLVMBASE/runtimes" \
    -G Ninja \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \
    -DBUILD_SHARED_LIBS=true \
    -DCMAKE_ASM_COMPILER="$NEWCLANG" \
    -DCMAKE_ASM_COMPILER_ID=Clang \
    -DCMAKE_C_COMPILER="$NEWCLANG" \
    -DCMAKE_CXX_COMPILER="$NEWCLANG"++ \
    -DLLVM_FORCE_ENABLE_STATS=ON \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_LLD=ON \
    -DLIBCXX_INCLUDE_TESTS=ON \
    -DLIBCXX_INCLUDE_BENCHMARKS=OFF \
    -DLLVM_INCLUDE_TESTS=ON \
    -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi;libunwind" \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DCMAKE_EXPORT_COMPILE_COMMANDS=ON

ninja cxx-test-depends

LIBCXXBUILD=$PWD
cd "$LLVMBASE"

libcxx/utils/libcxx-lit "$LIBCXXBUILD" -v \
    libcxx/test/std/language.support/support.coroutines/end.to.end/coro_await_suspend_destroy.pass.cpp

Start by reading the detailed user-facing docs in `AttrDocs.td`. My immediate motivation was that I noticed that short-circuiting coroutines failed to optimize well. Interact with the demo program here: https://godbolt.org/z/E3YK5c45a If Clang on Compiler Explorer supported [[clang::coro_await_suspend_destroy]], the assembly for `simple_coro` would be drastically shorter, and would not contain a call to `operator new`. Here are a few high-level thoughts that don't belong in the docs: - This has `lit` tests, but what gives me real confidence in its correctness is the integration test in `coro_await_suspend_destroy_test.cpp`. This caught all the interesting bugs that I had in earlier revs, and covers equivalence to the standard code path in far more scenarios. - I considered a variety of other designs. Here are some key design points: * I considered optimizing unmodified `await_suspend()` methods, as long as they unconditionally end with an `h.destroy()` call on the current handle, or an exception. However, this would (a) force dynamic dispatch for `destroy` -- bloating IR & reducing optimization opportunities, (b) require far more complex, delicate, and fragile analysis, (c) retain more of the frame setup, so that e.g. `h.done()` works properly. The current solution shortcuts all these concerns. * I want to `Promise&`, rather than `std::coroutine_handle` to `await_suspend_destroy` -- this is safer, simpler, and more efficient. Short-circuiting corotuines should not touch the handle. This decision forces the attribue to go on the class. Resolving a method attribute would have required looking up overloads for both types, and choosing one, which is costly and a bad UX to boot. * `AttrDocs.td` tells portable code to provide a stub `await_suspend()`. This portability / compatibility solution avoids dire issues that would arise if users relied on `__has_cpp_attribute` and the declaration and definition happened to use different toolchains. In particular, it will even be safe for a future compiler release to killswitch this attribute by removing its implementation and setting its version to 0. ``` let Spellings = [Clang<"coro_destroy_after_suspend", /*allowInC*/ 0, /*Version*/ 0>]; ``` - In the docs, I mention the `HasCoroSuspend` path in `CoroEarly.cpp` as a further optimization opportunity. But, I'm sure there are higher-leverage ways of making these non-suspending coros compile better, I just don't know the coro optimization pipeline well enough to flag them. - IIUC the only interaction of this with `coro_only_destroy_when_complete` would be that the compiler expends fewer cycles. - I ran some benchmarks on [folly::result]( https://github.com/facebook/folly/blob/main/folly/result/docs/result.md). Heap allocs are definitely elided, the compiled code looks like a function, not a coroutine, but there's still an optimization gap. On the plus side, this results in a 4x speedup (!) in optimized ASAN builds (numbers not shown for brevity. ``` // Simple result coroutine that adds 1 to the input result<int> result_coro(result<int>&& r) { co_return co_await std::move(r) + 1; } // Non-coroutine equivalent using value_or_throw() result<int> catching_result_func(result<int>&& r) { return result_catch_all([&]() -> result<int> { if (r.has_value()) { return r.value_or_throw() + 1; } return std::move(r).non_value(); }); } // Not QUITE equivalent to the coro -- lacks the exception boundary result<int> non_catching_result_func(result<int>&& r) { if (r.has_value()) { return r.value_or_throw() + 1; } return std::move(r).non_value(); } ============================================================================ [...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s ============================================================================ result_coro_success 13.61ns 73.49M non_catching_result_func_success 3.39ns 295.00M catching_result_func_success 4.41ns 226.88M result_coro_error 19.55ns 51.16M non_catching_result_func_error 9.15ns 109.26M catching_result_func_error 10.19ns 98.10M ============================================================================ [...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s ============================================================================ result_coro_success 10.59ns 94.39M non_catching_result_func_success 3.39ns 295.00M catching_result_func_success 4.07ns 245.81M result_coro_error 13.66ns 73.18M non_catching_result_func_error 9.00ns 111.11M catching_result_func_error 10.04ns 99.63M ``` Demo program from the Compiler Explorer link above: ```cpp #include <coroutine> #include <optional> // Read this LATER -- this implementation detail isn't required to understand // the value of [[clang::coro_await_suspend_destroy]]. // // `optional_wrapper` exists since `get_return_object()` can't return // `std::optional` directly. C++ coroutines have a fundamental timing mismatch // between when the return object is created and when the value is available: // // 1) Early (coroutine startup): `get_return_object()` is called and must return // something immediately. // 2) Later (when `co_return` executes): `return_value(T)` is called with the // actual value. // 3) Issue: If `get_return_object()` returns the storage, it's empty when // returned, and writing to it later cannot affect the already-returned copy. template <typename T> struct optional_wrapper { std::optional<T> storage_; std::optional<T>*& pointer_; optional_wrapper(std::optional<T>*& p) : pointer_(p) { pointer_ = &storage_; } operator std::optional<T>() { return std::move(storage_); } ~optional_wrapper() {} }; // Make `std::optional` a coroutine template <typename T, typename... Args> struct std::coroutine_traits<std::optional<T>, Args...> { struct promise_type { std::optional<T>* storagePtr_ = nullptr; promise_type() = default; ::optional_wrapper<T> get_return_object() { return ::optional_wrapper<T>(storagePtr_); } std::suspend_never initial_suspend() const noexcept { return {}; } std::suspend_never final_suspend() const noexcept { return {}; } void return_value(T&& value) { *storagePtr_ = std::move(value); } void unhandled_exception() { // Leave storage_ empty to represent error } }; }; template <typename T> struct [[clang::coro_await_suspend_destroy]] optional_awaitable { std::optional<T> opt_; bool await_ready() const noexcept { return opt_.has_value(); } T await_resume() { return std::move(opt_).value(); } // Adding `noexcept` here makes the early IR much smaller, but the // optimizer is able to discard the cruft for simpler cases. void await_suspend_destroy(auto& promise) noexcept { // Assume the return object defaults to "empty" } void await_suspend(auto handle) { await_suspend_destroy(handle.promise()); handle.destroy(); } }; template <typename T> optional_awaitable<T> operator co_await(std::optional<T> opt) { return {std::move(opt)}; } // Non-coroutine baseline -- matches the logic of `simple_coro`. std::optional<int> simple_func(const std::optional<int>& r) { try { if (r.has_value()) { return r.value() + 1; } } catch (...) {} return std::nullopt; // return empty on empty input or error } // Without `coro_await_suspend_destroy`, allocates its frame on-heap. std::optional<int> simple_coro(const std::optional<int>& r) { co_return co_await std::move(r) + 4; } // Without `co_await`, this optimizes much like `simple_func`. // Bugs: // - Doesn't short-circuit when `r` is empty, but throws // - Lacks an exception boundary std::optional<int> wrong_simple_coro(const std::optional<int>& r) { co_return r.value() + 2; } int main() { return simple_func(std::optional<int>{32}).value() + simple_coro(std::optional<int>{8}).value() + wrong_simple_coro(std::optional<int>{16}).value(); } ``` Test Plan: For the all-important E2E test, I used this terrible cargo-culted script to run the new end-to-end test with the new compiler. (Yes, I realize I should only need 10% of those `-D` settings for a successful build.) To make sure the test covered what I meant it to do: - I also added an `#error` in the "no attribute" branch to make sure the compiler indeed supports the attribute. - I ran it with a compiler not supporting the attribute, and that also passed. - I also tried `return 1;` from `main()` and saw the logs of the 7 successful tests running. ```sh #!/bin/bash -uex set -o pipefail LLVMBASE=/path/to/source/of/llvm-project SYSCLANG=/path/to/origianl/bin/clang # NB Can add `--debug-output` to debug cmake... # Bootstrap clang -- Use `RelWithDebInfo` or the next phase is too slow! mkdir -p bootstrap cd bootstrap cmake "$LLVMBASE/llvm" \ -G Ninja \ -DBUILD_SHARED_LIBS=true \ -DCMAKE_ASM_COMPILER="$SYSCLANG" \ -DCMAKE_ASM_COMPILER_ID=Clang \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DCMAKE_CXX_COMPILER="$SYSCLANG"++ \ -DCMAKE_C_COMPILER="$SYSCLANG" \ -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DLLVM_ENABLE_BINDINGS=OFF \ -DLLVM_ENABLE_LLD=ON \ -DLLVM_ENABLE_PROJECTS="clang;lld" \ -DLLVM_OPTIMIZED_TABLEGEN=true \ -DLLVM_FORCE_ENABLE_STATS=ON \ -DLLVM_ENABLE_DUMP=ON \ -DCLANG_DEFAULT_PIE_ON_LINUX=OFF ninja clang lld ninja check-clang-codegencoroutines # Includes the new IR regression tests cd .. NEWCLANG="$PWD"/bootstrap/bin/clang NEWLLD="$PWD"/bootstrap/bin/lld # LIBCXX_INCLUDE_BENCHMARKS=OFF because google-benchmark bugs out cmake "$LLVMBASE/runtimes" \ -G Ninja \ -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \ -DBUILD_SHARED_LIBS=true \ -DCMAKE_ASM_COMPILER="$NEWCLANG" \ -DCMAKE_ASM_COMPILER_ID=Clang \ -DCMAKE_C_COMPILER="$NEWCLANG" \ -DCMAKE_CXX_COMPILER="$NEWCLANG"++ \ -DLLVM_FORCE_ENABLE_STATS=ON \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DLLVM_ENABLE_LLD=ON \ -DLIBCXX_INCLUDE_TESTS=ON \ -DLIBCXX_INCLUDE_BENCHMARKS=OFF \ -DLLVM_INCLUDE_TESTS=ON \ -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi;libunwind" \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ninja cxx-test-depends LIBCXXBUILD=$PWD cd "$LLVMBASE" libcxx/utils/libcxx-lit "$LIBCXXBUILD" -v \ libcxx/test/std/language.support/support.coroutines/end.to.end/coro_await_suspend_destroy.pass.cpp ```

github-actions · 2025-08-08T01:52:38Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-08-08T01:53:10Z

@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-coroutines

Author: None (snarkmaster)

Changes

Start by reading the detailed user-facing docs in AttrDocs.td.

My immediate motivation was that I noticed that short-circuiting coroutines failed to optimize well. Interact with the demo program here: https://godbolt.org/z/E3YK5c45a

If Clang on Compiler Explorer supported [[clang::coro_await_suspend_destroy]], the assembly for simple_coro would be drastically shorter, and would not contain a call to operator new.

Here are a few high-level thoughts that don't belong in the docs:

This has lit tests, but what gives me real confidence in its correctness is the integration test in coro_await_suspend_destroy_test.cpp. This caught all the interesting bugs that I had in earlier revs, and covers equivalence to the standard code path in far more scenarios.
I considered a variety of other designs. Here are some key design points:
- I considered optimizing unmodified await_suspend() methods, as long as they unconditionally end with an h.destroy() call on the current handle, or an exception. However, this would (a) force dynamic dispatch for destroy -- bloating IR & reducing optimization opportunities, (b) require far more complex, delicate, and fragile analysis, (c) retain more of the frame setup, so that e.g. h.done() works properly. The current solution shortcuts all these concerns.
- I want to Promise&, rather than std::coroutine_handle to await_suspend_destroy -- this is safer, simpler, and more efficient. Short-circuiting corotuines should not touch the handle. This decision forces the attribue to go on the class. Resolving a method attribute would have required looking up overloads for both types, and choosing one, which is costly and a bad UX to boot.
- AttrDocs.td tells portable code to provide a stub await_suspend(). This portability / compatibility solution avoids dire issues that would arise if users relied on __has_cpp_attribute and the declaration and definition happened to use different toolchains. In particular, it will even be safe for a future compiler release to killswitch this attribute by removing its implementation and setting its version to 0.

let Spellings = [Clang&lt;"coro_destroy_after_suspend", /*allowInC*/ 0,
                 /*Version*/ 0&gt;];

In the docs, I mention the HasCoroSuspend path in CoroEarly.cpp as a further optimization opportunity. But, I'm sure there are higher-leverage ways of making these non-suspending coros compile better, I just don't know the coro optimization pipeline well enough to flag them.
IIUC the only interaction of this with coro_only_destroy_when_complete would be that the compiler expends fewer cycles.
I ran some benchmarks on folly::result. Heap allocs are definitely elided, the compiled code looks like a function, not a coroutine, but there's still an optimization gap. On the plus side, this results in a 4x speedup (!) in optimized ASAN builds (numbers not shown for brevity.

// Simple result coroutine that adds 1 to the input
result&lt;int&gt; result_coro(result&lt;int&gt;&amp;&amp; r) {
  co_return co_await std::move(r) + 1;
}

// Non-coroutine equivalent using value_or_throw()
result&lt;int&gt; catching_result_func(result&lt;int&gt;&amp;&amp; r) {
  return result_catch_all([&amp;]() -&gt; result&lt;int&gt; {
    if (r.has_value()) {
      return r.value_or_throw() + 1;
    }
    return std::move(r).non_value();
  });
}

// Not QUITE equivalent to the coro -- lacks the exception boundary
result&lt;int&gt; non_catching_result_func(result&lt;int&gt;&amp;&amp; r) {
  if (r.has_value()) {
    return r.value_or_throw() + 1;
  }
  return std::move(r).non_value();
}

============================================================================
[...]lly/result/test/result_coro_bench.cpp     relative  time/iter   iters/s
============================================================================
result_coro_success                                        13.61ns    73.49M
non_catching_result_func_success                            3.39ns   295.00M
catching_result_func_success                                4.41ns   226.88M
result_coro_error                                          19.55ns    51.16M
non_catching_result_func_error                              9.15ns   109.26M
catching_result_func_error                                 10.19ns    98.10M

============================================================================
[...]lly/result/test/result_coro_bench.cpp     relative  time/iter   iters/s
============================================================================
result_coro_success                                        10.59ns    94.39M
non_catching_result_func_success                            3.39ns   295.00M
catching_result_func_success                                4.07ns   245.81M
result_coro_error                                          13.66ns    73.18M
non_catching_result_func_error                              9.00ns   111.11M
catching_result_func_error                                 10.04ns    99.63M

Demo program from the Compiler Explorer link above:

#include &lt;coroutine&gt;
#include &lt;optional&gt;

// Read this LATER -- this implementation detail isn't required to understand
// the value of [[clang::coro_await_suspend_destroy]].
//
// `optional_wrapper` exists since `get_return_object()` can't return
// `std::optional` directly. C++ coroutines have a fundamental timing mismatch
// between when the return object is created and when the value is available:
//
// 1) Early (coroutine startup): `get_return_object()` is called and must return
//    something immediately.
// 2) Later (when `co_return` executes): `return_value(T)` is called with the
//    actual value.
// 3) Issue: If `get_return_object()` returns the storage, it's empty when
//    returned, and writing to it later cannot affect the already-returned copy.
template &lt;typename T&gt;
struct optional_wrapper {
  std::optional&lt;T&gt; storage_;
  std::optional&lt;T&gt;*&amp; pointer_;
  optional_wrapper(std::optional&lt;T&gt;*&amp; p) : pointer_(p) {
    pointer_ = &amp;storage_;
  }
  operator std::optional&lt;T&gt;() { return std::move(storage_); }
  ~optional_wrapper() {}
};

// Make `std::optional` a coroutine
template &lt;typename T, typename... Args&gt;
struct std::coroutine_traits&lt;std::optional&lt;T&gt;, Args...&gt; {
  struct promise_type {
    std::optional&lt;T&gt;* storagePtr_ = nullptr;
    promise_type() = default;
    ::optional_wrapper&lt;T&gt; get_return_object() {
      return ::optional_wrapper&lt;T&gt;(storagePtr_);
    }
    std::suspend_never initial_suspend() const noexcept { return {}; }
    std::suspend_never final_suspend() const noexcept { return {}; }
    void return_value(T&amp;&amp; value) { *storagePtr_ = std::move(value); }
    void unhandled_exception() {
      // Leave storage_ empty to represent error
    }
  };
};

template &lt;typename T&gt;
struct [[clang::coro_await_suspend_destroy]] optional_awaitable {
  std::optional&lt;T&gt; opt_;
  bool await_ready() const noexcept { return opt_.has_value(); }
  T await_resume() { return std::move(opt_).value(); }
  // Adding `noexcept` here makes the early IR much smaller, but the
  // optimizer is able to discard the cruft for simpler cases.
  void await_suspend_destroy(auto&amp; promise) noexcept {
    // Assume the return object defaults to "empty"
  }
  void await_suspend(auto handle) {
    await_suspend_destroy(handle.promise());
    handle.destroy();
  }
};

template &lt;typename T&gt;
optional_awaitable&lt;T&gt; operator co_await(std::optional&lt;T&gt; opt) {
  return {std::move(opt)};
}

// Non-coroutine baseline -- matches the logic of `simple_coro`.
std::optional&lt;int&gt; simple_func(const std::optional&lt;int&gt;&amp; r) {
  try {
    if (r.has_value()) {
        return r.value() + 1;
    }
  } catch (...) {}
  return std::nullopt; // return empty on empty input or error
}

// Without `coro_await_suspend_destroy`, allocates its frame on-heap.
std::optional&lt;int&gt; simple_coro(const std::optional&lt;int&gt;&amp; r) {
  co_return co_await std::move(r) + 4;
}

// Without `co_await`, this optimizes much like `simple_func`.
// Bugs:
//  - Doesn't short-circuit when `r` is empty, but throws
//  - Lacks an exception boundary
std::optional&lt;int&gt; wrong_simple_coro(const std::optional&lt;int&gt;&amp; r) {
  co_return r.value() + 2;
}

int main() {
  return
      simple_func(std::optional&lt;int&gt;{32}).value() +
      simple_coro(std::optional&lt;int&gt;{8}).value() +
      wrong_simple_coro(std::optional&lt;int&gt;{16}).value();
}

Test Plan:

For the all-important E2E test, I used this terrible cargo-culted script to run the new end-to-end test with the new compiler. (Yes, I realize I should only need 10% of those -D settings for a successful build.)

To make sure the test covered what I meant it to do:

I also added an #error in the "no attribute" branch to make sure the compiler indeed supports the attribute.
I ran it with a compiler not supporting the attribute, and that also passed.
I also tried return 1; from main() and saw the logs of the 7 successful tests running.

#!/bin/bash -uex
set -o pipefail
LLVMBASE=/path/to/source/of/llvm-project
SYSCLANG=/path/to/origianl/bin/clang

# NB Can add `--debug-output` to debug cmake...

# Bootstrap clang -- Use `RelWithDebInfo` or the next phase is too slow!
mkdir -p bootstrap
cd bootstrap
cmake "$LLVMBASE/llvm" \
    -G Ninja \
    -DBUILD_SHARED_LIBS=true \
    -DCMAKE_ASM_COMPILER="$SYSCLANG" \
    -DCMAKE_ASM_COMPILER_ID=Clang \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DCMAKE_CXX_COMPILER="$SYSCLANG"++ \
    -DCMAKE_C_COMPILER="$SYSCLANG" \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_BINDINGS=OFF \
    -DLLVM_ENABLE_LLD=ON \
    -DLLVM_ENABLE_PROJECTS="clang;lld" \
    -DLLVM_OPTIMIZED_TABLEGEN=true \
    -DLLVM_FORCE_ENABLE_STATS=ON \
    -DLLVM_ENABLE_DUMP=ON \
    -DCLANG_DEFAULT_PIE_ON_LINUX=OFF
ninja clang lld
ninja check-clang-codegencoroutines # Includes the new IR regression tests
cd ..

NEWCLANG="$PWD"/bootstrap/bin/clang
NEWLLD="$PWD"/bootstrap/bin/lld
# LIBCXX_INCLUDE_BENCHMARKS=OFF because google-benchmark bugs out
cmake "$LLVMBASE/runtimes" \
    -G Ninja \
    -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \
    -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \
    -DBUILD_SHARED_LIBS=true \
    -DCMAKE_ASM_COMPILER="$NEWCLANG" \
    -DCMAKE_ASM_COMPILER_ID=Clang \
    -DCMAKE_C_COMPILER="$NEWCLANG" \
    -DCMAKE_CXX_COMPILER="$NEWCLANG"++ \
    -DLLVM_FORCE_ENABLE_STATS=ON \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_LLD=ON \
    -DLIBCXX_INCLUDE_TESTS=ON \
    -DLIBCXX_INCLUDE_BENCHMARKS=OFF \
    -DLLVM_INCLUDE_TESTS=ON \
    -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi;libunwind" \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DCMAKE_EXPORT_COMPILE_COMMANDS=ON

ninja cxx-test-depends

LIBCXXBUILD=$PWD
cd "$LLVMBASE"

libcxx/utils/libcxx-lit "$LIBCXXBUILD" -v \
    libcxx/test/std/language.support/support.coroutines/end.to.end/coro_await_suspend_destroy.pass.cpp

Patch is 46.90 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152623.diff

10 Files Affected:

(modified) clang/docs/ReleaseNotes.rst (+6)
(modified) clang/include/clang/Basic/Attr.td (+8)
(modified) clang/include/clang/Basic/AttrDocs.td (+87)
(modified) clang/include/clang/Basic/DiagnosticSemaKinds.td (+3)
(modified) clang/lib/CodeGen/CGCoroutine.cpp (+160-72)
(modified) clang/lib/Sema/SemaCoroutine.cpp (+84-18)
(added) clang/test/CodeGenCoroutines/coro-await-suspend-destroy-errors.cpp (+55)
(added) clang/test/CodeGenCoroutines/coro-await-suspend-destroy.cpp (+129)
(modified) clang/test/Misc/pragma-attribute-supported-attributes-list.test (+1)
(added) libcxx/test/std/language.support/support.coroutines/end.to.end/coro_await_suspend_destroy.pass.cpp (+409)

diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index 0e9fcaa5fac6a..41c412730b033 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -136,6 +136,12 @@ Removed Compiler Flags
 Attribute Changes in Clang
 --------------------------
 
+- Introduced a new attribute ``[[clang::coro_await_suspend_destroy]]``.  When
+  applied to a coroutine awaiter class, it causes suspensions into this awaiter
+  to use a new `await_suspend_destroy(Promise&)` method instead of the standard
+  `await_suspend(std::coroutine_handle<...>)`.  The coroutine is then destroyed.
+  This improves code speed & size for "short-circuiting" coroutines.
+
 Improvements to Clang's diagnostics
 -----------------------------------
 - Added a separate diagnostic group ``-Wfunction-effect-redeclarations``, for the more pedantic
diff --git a/clang/include/clang/Basic/Attr.td b/clang/include/clang/Basic/Attr.td
index 30efb9f39e4f4..341848be00e7d 100644
--- a/clang/include/clang/Basic/Attr.td
+++ b/clang/include/clang/Basic/Attr.td
@@ -1352,6 +1352,14 @@ def CoroAwaitElidableArgument : InheritableAttr {
   let SimpleHandler = 1;
 }
 
+def CoroAwaitSuspendDestroy: InheritableAttr {
+  let Spellings = [Clang<"coro_await_suspend_destroy">];
+  let Subjects = SubjectList<[CXXRecord]>;
+  let LangOpts = [CPlusPlus];
+  let Documentation = [CoroAwaitSuspendDestroyDoc];
+  let SimpleHandler = 1;
+}
+
 // OSObject-based attributes.
 def OSConsumed : InheritableParamAttr {
   let Spellings = [Clang<"os_consumed">];
diff --git a/clang/include/clang/Basic/AttrDocs.td b/clang/include/clang/Basic/AttrDocs.td
index 2b095ab975202..d2224d86b3900 100644
--- a/clang/include/clang/Basic/AttrDocs.td
+++ b/clang/include/clang/Basic/AttrDocs.td
@@ -9270,6 +9270,93 @@ Example:
 }];
 }
 
+def CoroAwaitSuspendDestroyDoc : Documentation {
+  let Category = DocCatDecl;
+  let Content = [{
+
+The ``[[clang::coro_await_suspend_destroy]]`` attribute may be applied to a C++
+coroutine awaiter type.  When this attribute is present, the awaiter must
+implement ``void await_suspend_destroy(Promise&)``.  If ``await_ready()``
+returns ``false`` at a suspension point, ``await_suspend_destroy`` will be
+called directly, bypassing the ``await_suspend(std::coroutine_handle<...>)``
+method.  The coroutine being suspended will then be immediately destroyed.
+
+Logically, the new behavior is equivalent to this standard code:
+
+.. code-block:: c++
+
+  void await_suspend_destroy(YourPromise&) { ... }
+  void await_suspend(auto handle) {
+    await_suspend_destroy(handle.promise());
+    handle.destroy();
+  }
+
+This enables `await_suspend_destroy()` usage in portable awaiters — just add a
+stub ``await_suspend()`` as above.  Without ``coro_await_suspend_destroy``
+support, the awaiter will behave nearly identically, with the only difference
+being heap allocation instead of stack allocation for the coroutine frame.
+
+This attribute exists to optimize short-circuiting coroutines—coroutines whose
+suspend points are either (i) trivial (like ``std::suspend_never``), or (ii)
+short-circuiting (like a ``co_await`` that can be expressed in regular control
+flow as):
+
+.. code-block:: c++
+
+  T val;
+  if (awaiter.await_ready()) {
+    val = awaiter.await_resume();
+  } else {
+    awaiter.await_suspend();
+    return /* value representing the "execution short-circuited" outcome */;
+  }
+
+The benefits of this attribute are:
+  - **Avoid heap allocations for coro frames**: Allocating short-circuiting
+    coros on the stack makes code more predictable under memory pressure.
+    Without this attribute, LLVM cannot elide heap allocation even when all
+    awaiters are short-circuiting.
+  - **Performance**: Significantly faster execution and smaller code size.
+  - **Build time**: Faster compilation due to less IR being generated.
+
+Marking your ``await_suspend_destroy`` method as ``noexcept`` can sometimes
+further improve optimization.
+
+Here is a toy example of a portable short-circuiting awaiter:
+
+.. code-block:: c++
+
+  template <typename T>
+  struct [[clang::coro_await_suspend_destroy]] optional_awaitable {
+    std::optional<T> opt_;
+    bool await_ready() const noexcept { return opt_.has_value(); }
+    T await_resume() { return std::move(opt_).value(); }
+    void await_suspend_destroy(auto& promise) {
+      // Assume the return object of the outer coro defaults to "empty".
+    }
+    // Fallback for when `coro_await_suspend_destroy` is unavailable.
+    void await_suspend(auto handle) {
+      await_suspend_destroy(handle.promise());
+      handle.destroy();
+    }
+  };
+
+If all suspension points use (i) trivial or (ii) short-circuiting awaiters,
+then the coroutine optimizes more like a plain function, with 2 caveats:
+  - **Behavior:** The coroutine promise provides an implicit exception boundary
+    (as if wrapping the function in ``try {} catch { unhandled_exception(); }``).
+    This exception handling behavior is usually desirable in robust,
+    return-value-oriented programs that need short-circuiting coroutines.
+    Otherwise, the promise can always re-throw.
+  - **Speed:** As of 2025, there is still an optimization gap between a
+    realistic short-circuiting coro, and the equivalent (but much more verbose)
+    function.  For a guesstimate, expect 4-5ns per call on x86.  One idea for
+    improvement is to also elide trivial suspends like `std::suspend_never`, in
+    order to hit the `HasCoroSuspend` path in `CoroEarly.cpp`.
+
+}];
+}
+
 def CountedByDocs : Documentation {
   let Category = DocCatField;
   let Content = [{
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 116341f4b66d5..58e7dd7db86d1 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -12504,6 +12504,9 @@ def note_coroutine_promise_call_implicitly_required : Note<
 def err_await_suspend_invalid_return_type : Error<
   "return type of 'await_suspend' is required to be 'void' or 'bool' (have %0)"
 >;
+def err_await_suspend_destroy_invalid_return_type : Error<
+  "return type of 'await_suspend_destroy' is required to be 'void' (have %0)"
+>;
 def note_await_ready_no_bool_conversion : Note<
   "return type of 'await_ready' is required to be contextually convertible to 'bool'"
 >;
diff --git a/clang/lib/CodeGen/CGCoroutine.cpp b/clang/lib/CodeGen/CGCoroutine.cpp
index 827385f9c1a1f..d74bef592aa9c 100644
--- a/clang/lib/CodeGen/CGCoroutine.cpp
+++ b/clang/lib/CodeGen/CGCoroutine.cpp
@@ -174,6 +174,66 @@ static bool StmtCanThrow(const Stmt *S) {
   return false;
 }
 
+// Check if this suspend should be calling `await_suspend_destroy`
+static bool useCoroAwaitSuspendDestroy(const CoroutineSuspendExpr &S) {
+  // This can only be an `await_suspend_destroy` suspend expression if it
+  // returns void -- `buildCoawaitCalls` in `SemaCoroutine.cpp` asserts this.
+  // Moreover, when `await_suspend` returns a handle, the outermost method call
+  // is `.address()` -- making it harder to get the actual class or method.
+  if (S.getSuspendReturnType() !=
+      CoroutineSuspendExpr::SuspendReturnType::SuspendVoid) {
+    return false;
+  }
+
+  // `CGCoroutine.cpp` & `SemaCoroutine.cpp` must agree on whether this suspend
+  // expression uses `[[clang::coro_await_suspend_destroy]]`.
+  //
+  // Any mismatch is a serious bug -- we would either double-free, or fail to
+  // destroy the promise type. For this reason, we make our decision based on
+  // the method name, and fatal outside of the happy path -- including on
+  // failure to find a method name.
+  //
+  // As a debug-only check we also try to detect the `AwaiterClass`. This is
+  // secondary, because  detection of the awaiter type can be silently broken by
+  // small `buildCoawaitCalls` AST changes.
+  StringRef SuspendMethodName;           // Primary
+  CXXRecordDecl *AwaiterClass = nullptr; // Debug-only, best-effort
+  if (auto *SuspendCall =
+          dyn_cast<CallExpr>(S.getSuspendExpr()->IgnoreImplicit())) {
+    if (auto *SuspendMember = dyn_cast<MemberExpr>(SuspendCall->getCallee())) {
+      if (auto *BaseExpr = SuspendMember->getBase()) {
+        // `IgnoreImplicitAsWritten` is critical since `await_suspend...` can be
+        // invoked on the base of the actual awaiter, and the base need not have
+        // the attribute. In such cases, the AST will show the true awaiter
+        // being upcast to the base.
+        AwaiterClass = BaseExpr->IgnoreImplicitAsWritten()
+                           ->getType()
+                           ->getAsCXXRecordDecl();
+      }
+      if (auto *SuspendMethod =
+              dyn_cast<CXXMethodDecl>(SuspendMember->getMemberDecl())) {
+        SuspendMethodName = SuspendMethod->getName();
+      }
+    }
+  }
+  if (SuspendMethodName == "await_suspend_destroy") {
+    assert(!AwaiterClass ||
+           AwaiterClass->hasAttr<CoroAwaitSuspendDestroyAttr>());
+    return true;
+  } else if (SuspendMethodName == "await_suspend") {
+    assert(!AwaiterClass ||
+           !AwaiterClass->hasAttr<CoroAwaitSuspendDestroyAttr>());
+    return false;
+  } else {
+    llvm::report_fatal_error(
+        "Wrong method in [[clang::coro_await_suspend_destroy]] check: "
+        "expected 'await_suspend' or 'await_suspend_destroy', but got '" +
+        SuspendMethodName + "'");
+  }
+
+  return false;
+}
+
 // Emit suspend expression which roughly looks like:
 //
 //   auto && x = CommonExpr();
@@ -220,6 +280,25 @@ namespace {
     RValue RV;
   };
 }
+
+// The simplified `await_suspend_destroy` path avoids suspend intrinsics.
+static void emitAwaitSuspendDestroy(CodeGenFunction &CGF, CGCoroData &Coro,
+                                    llvm::Function *SuspendWrapper,
+                                    llvm::Value *Awaiter, llvm::Value *Frame,
+                                    bool AwaitSuspendCanThrow) {
+  SmallVector<llvm::Value *, 2> DirectCallArgs;
+  DirectCallArgs.push_back(Awaiter);
+  DirectCallArgs.push_back(Frame);
+
+  if (AwaitSuspendCanThrow) {
+    CGF.EmitCallOrInvoke(SuspendWrapper, DirectCallArgs);
+  } else {
+    CGF.EmitNounwindRuntimeCall(SuspendWrapper, DirectCallArgs);
+  }
+
+  CGF.EmitBranchThroughCleanup(Coro.CleanupJD);
+}
+
 static LValueOrRValue emitSuspendExpression(CodeGenFunction &CGF, CGCoroData &Coro,
                                     CoroutineSuspendExpr const &S,
                                     AwaitKind Kind, AggValueSlot aggSlot,
@@ -234,7 +313,6 @@ static LValueOrRValue emitSuspendExpression(CodeGenFunction &CGF, CGCoroData &Co
   auto Prefix = buildSuspendPrefixStr(Coro, Kind);
   BasicBlock *ReadyBlock = CGF.createBasicBlock(Prefix + Twine(".ready"));
   BasicBlock *SuspendBlock = CGF.createBasicBlock(Prefix + Twine(".suspend"));
-  BasicBlock *CleanupBlock = CGF.createBasicBlock(Prefix + Twine(".cleanup"));
 
   // If expression is ready, no need to suspend.
   CGF.EmitBranchOnBoolExpr(S.getReadyExpr(), ReadyBlock, SuspendBlock, 0);
@@ -243,95 +321,105 @@ static LValueOrRValue emitSuspendExpression(CodeGenFunction &CGF, CGCoroData &Co
   CGF.EmitBlock(SuspendBlock);
 
   auto &Builder = CGF.Builder;
-  llvm::Function *CoroSave = CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_save);
-  auto *NullPtr = llvm::ConstantPointerNull::get(CGF.CGM.Int8PtrTy);
-  auto *SaveCall = Builder.CreateCall(CoroSave, {NullPtr});
 
   auto SuspendWrapper = CodeGenFunction(CGF.CGM).generateAwaitSuspendWrapper(
       CGF.CurFn->getName(), Prefix, S);
 
-  CGF.CurCoro.InSuspendBlock = true;
-
   assert(CGF.CurCoro.Data && CGF.CurCoro.Data->CoroBegin &&
          "expected to be called in coroutine context");
 
-  SmallVector<llvm::Value *, 3> SuspendIntrinsicCallArgs;
-  SuspendIntrinsicCallArgs.push_back(
-      CGF.getOrCreateOpaqueLValueMapping(S.getOpaqueValue()).getPointer(CGF));
-
-  SuspendIntrinsicCallArgs.push_back(CGF.CurCoro.Data->CoroBegin);
-  SuspendIntrinsicCallArgs.push_back(SuspendWrapper);
-
-  const auto SuspendReturnType = S.getSuspendReturnType();
-  llvm::Intrinsic::ID AwaitSuspendIID;
-
-  switch (SuspendReturnType) {
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
-    AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_void;
-    break;
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendBool:
-    AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_bool;
-    break;
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle:
-    AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_handle;
-    break;
-  }
-
-  llvm::Function *AwaitSuspendIntrinsic = CGF.CGM.getIntrinsic(AwaitSuspendIID);
-
   // SuspendHandle might throw since it also resumes the returned handle.
+  const auto SuspendReturnType = S.getSuspendReturnType();
   const bool AwaitSuspendCanThrow =
       SuspendReturnType ==
           CoroutineSuspendExpr::SuspendReturnType::SuspendHandle ||
       StmtCanThrow(S.getSuspendExpr());
 
-  llvm::CallBase *SuspendRet = nullptr;
-  // FIXME: add call attributes?
-  if (AwaitSuspendCanThrow)
-    SuspendRet =
-        CGF.EmitCallOrInvoke(AwaitSuspendIntrinsic, SuspendIntrinsicCallArgs);
-  else
-    SuspendRet = CGF.EmitNounwindRuntimeCall(AwaitSuspendIntrinsic,
-                                             SuspendIntrinsicCallArgs);
+  llvm::Value *Awaiter =
+      CGF.getOrCreateOpaqueLValueMapping(S.getOpaqueValue()).getPointer(CGF);
+  llvm::Value *Frame = CGF.CurCoro.Data->CoroBegin;
 
-  assert(SuspendRet);
-  CGF.CurCoro.InSuspendBlock = false;
+  if (useCoroAwaitSuspendDestroy(S)) { // Call `await_suspend_destroy` & cleanup
+    emitAwaitSuspendDestroy(CGF, Coro, SuspendWrapper, Awaiter, Frame,
+                            AwaitSuspendCanThrow);
+  } else { // Normal suspend path -- can actually suspend, uses intrinsics
+    CGF.CurCoro.InSuspendBlock = true;
 
-  switch (SuspendReturnType) {
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
-    assert(SuspendRet->getType()->isVoidTy());
-    break;
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendBool: {
-    assert(SuspendRet->getType()->isIntegerTy());
-
-    // Veto suspension if requested by bool returning await_suspend.
-    BasicBlock *RealSuspendBlock =
-        CGF.createBasicBlock(Prefix + Twine(".suspend.bool"));
-    CGF.Builder.CreateCondBr(SuspendRet, RealSuspendBlock, ReadyBlock);
-    CGF.EmitBlock(RealSuspendBlock);
-    break;
-  }
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle: {
-    assert(SuspendRet->getType()->isVoidTy());
-    break;
-  }
-  }
+    SmallVector<llvm::Value *, 3> SuspendIntrinsicCallArgs;
+    SuspendIntrinsicCallArgs.push_back(Awaiter);
+    SuspendIntrinsicCallArgs.push_back(Frame);
+    SuspendIntrinsicCallArgs.push_back(SuspendWrapper);
+    BasicBlock *CleanupBlock = CGF.createBasicBlock(Prefix + Twine(".cleanup"));
 
-  // Emit the suspend point.
-  const bool IsFinalSuspend = (Kind == AwaitKind::Final);
-  llvm::Function *CoroSuspend =
-      CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_suspend);
-  auto *SuspendResult = Builder.CreateCall(
-      CoroSuspend, {SaveCall, Builder.getInt1(IsFinalSuspend)});
+    llvm::Function *CoroSave = CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_save);
+    auto *NullPtr = llvm::ConstantPointerNull::get(CGF.CGM.Int8PtrTy);
+    auto *SaveCall = Builder.CreateCall(CoroSave, {NullPtr});
 
-  // Create a switch capturing three possible continuations.
-  auto *Switch = Builder.CreateSwitch(SuspendResult, Coro.SuspendBB, 2);
-  Switch->addCase(Builder.getInt8(0), ReadyBlock);
-  Switch->addCase(Builder.getInt8(1), CleanupBlock);
+    llvm::Intrinsic::ID AwaitSuspendIID;
 
-  // Emit cleanup for this suspend point.
-  CGF.EmitBlock(CleanupBlock);
-  CGF.EmitBranchThroughCleanup(Coro.CleanupJD);
+    switch (SuspendReturnType) {
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
+      AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_void;
+      break;
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendBool:
+      AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_bool;
+      break;
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle:
+      AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_handle;
+      break;
+    }
+
+    llvm::Function *AwaitSuspendIntrinsic =
+        CGF.CGM.getIntrinsic(AwaitSuspendIID);
+
+    llvm::CallBase *SuspendRet = nullptr;
+    // FIXME: add call attributes?
+    if (AwaitSuspendCanThrow)
+      SuspendRet =
+          CGF.EmitCallOrInvoke(AwaitSuspendIntrinsic, SuspendIntrinsicCallArgs);
+    else
+      SuspendRet = CGF.EmitNounwindRuntimeCall(AwaitSuspendIntrinsic,
+                                               SuspendIntrinsicCallArgs);
+
+    assert(SuspendRet);
+    CGF.CurCoro.InSuspendBlock = false;
+
+    switch (SuspendReturnType) {
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
+      assert(SuspendRet->getType()->isVoidTy());
+      break;
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendBool: {
+      assert(SuspendRet->getType()->isIntegerTy());
+
+      // Veto suspension if requested by bool returning await_suspend.
+      BasicBlock *RealSuspendBlock =
+          CGF.createBasicBlock(Prefix + Twine(".suspend.bool"));
+      CGF.Builder.CreateCondBr(SuspendRet, RealSuspendBlock, ReadyBlock);
+      CGF.EmitBlock(RealSuspendBlock);
+      break;
+    }
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle: {
+      assert(SuspendRet->getType()->isVoidTy());
+      break;
+    }
+    }
+
+    // Emit the suspend point.
+    const bool IsFinalSuspend = (Kind == AwaitKind::Final);
+    llvm::Function *CoroSuspend =
+        CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_suspend);
+    auto *SuspendResult = Builder.CreateCall(
+        CoroSuspend, {SaveCall, Builder.getInt1(IsFinalSuspend)});
+
+    // Create a switch capturing three possible continuations.
+    auto *Switch = Builder.CreateSwitch(SuspendResult, Coro.SuspendBB, 2);
+    Switch->addCase(Builder.getInt8(0), ReadyBlock);
+    Switch->addCase(Builder.getInt8(1), CleanupBlock);
+
+    // Emit cleanup for this suspend point.
+    CGF.EmitBlock(CleanupBlock);
+    CGF.EmitBranchThroughCleanup(Coro.CleanupJD);
+  }
 
   // Emit await_resume expression.
   CGF.EmitBlock(ReadyBlock);
diff --git a/clang/lib/Sema/SemaCoroutine.cpp b/clang/lib/Sema/SemaCoroutine.cpp
index d193a33f22393..83fe7219c9997 100644
--- a/clang/lib/Sema/SemaCoroutine.cpp
+++ b/clang/lib/Sema/SemaCoroutine.cpp
@@ -289,6 +289,45 @@ static ExprResult buildCoroutineHandle(Sema &S, QualType PromiseType,
   return S.BuildCallExpr(nullptr, FromAddr.get(), Loc, FramePtr, Loc);
 }
 
+// To support [[clang::coro_await_suspend_destroy]], this builds
+//   *static_cast<Promise*>(
+//       __builtin_coro_promise(handle, alignof(Promise), false))
+static ExprResult buildPromiseRef(Sema &S, QualType PromiseType,
+                                  SourceLocation Loc) {
+  uint64_t Align =
+      S.Context.getTypeAlign(PromiseType) / S.Context.getCharWidth();
+
+  // Build the call to __builtin_coro_promise()
+  SmallVector<Expr *, 3> Args = {
+      S.BuildBuiltinCallExpr(Loc, Builtin::BI__builtin_coro_frame, {}),
+      S.ActOnIntegerConstant(Loc, Align).get(),         // alignof(Promise)
+      S.ActOnCXXBoolLiteral(Loc, tok::kw_false).get()}; // false
+  ExprResult CoroPromiseCall =
+      S.BuildBuiltinCallExpr(Loc, Builtin::BI__builtin_coro_promise, Args);
+
+  if (CoroPromiseCall.isInvalid())
+    return ExprError();
+
+  // Cast to Promise*
+  ExprResult CastExpr = S.ImpCastExprToType(
+      CoroPromiseCall.get(), S.Context.getPointerType(PromiseType), CK_BitCast);
+  if (CastExpr.isInvalid())
+    return ExprError();
+
+  // Dereference to get Promise&
+  return S.CreateBuiltinUnaryOp(Loc, UO_Deref, CastExpr.get());
+}
+
+static bool hasCoroAwaitSuspendDestroyAttr(Expr *Awaiter) {
+  QualType AwaiterType = Awaiter->getType();
+  if (auto *RD = AwaiterType->getAsCXXRecordDecl()) {
+    if (RD->hasAttr<CoroAwaitSuspendDestroyAttr>()) {
+      return true;
+    }
+  }
+  return false;
+}
+
 struct ReadySuspendResumeResult {
   enum AwaitCallType { ACT_Ready, ACT_Suspend, ACT_Resume };
   Expr *Results[3];
@@ -399,15 +438,30 @@ static ReadySuspendResumeResult buildCoawaitCalls(Sema &S, VarDecl *CoroPromise,
       Calls.Results[ACT::ACT_Ready] = S.MaybeCreateExprWithCleanups(Conv.get());
   }
 
-  ExprRes...
[truncated]

NewSigma · 2025-08-08T02:22:17Z

Given your code contains destroy in await_suspend, will the solution of #148380 work for you?

snarkmaster · 2025-08-08T02:52:43Z

@NewSigma, thanks!

I quickly ran my folly::result benchmark from the description under clang-17 rather than clang-19, and ... its performance is even worse on 17 than 19. So, I doubt that resolving the inlining-of-await_suspend issue will directly address my problem.

The fundamental problem I'm addressing isn't lack of devirtualization, but lack of heap alloc elision.

That said, if you have a PoC patch, I'm happy to try it.

ChuanqiXu9 · 2025-08-08T06:37:44Z

@NewSigma, thanks!

I quickly ran my folly::result benchmark from the description under clang-17 rather than clang-19, and ... its performance is even worse on 17 than 19. So, I doubt that resolving the inlining-of-await_suspend issue will directly address my problem.

The fundamental problem I'm addressing isn't lack of devirtualization, but lack of heap alloc elision.

That said, if you have a PoC patch, I'm happy to try it.

I didn't have time to look into details yet. But could your patch cover #148380?

And also, it is better to split changes between clang and libc++ whenever possible.

snarkmaster · 2025-08-08T07:18:47Z

@ChuanqiXu9,

(1) I'm not actually changing libc++. I just added an end-to-end test for this compiler feature. It's backwards- and forwards-compatible, so I don't think there's any risk to bundling it with this PR.

(2) My patch isn't a fully general solution to #148380. Rather, it's a (probably) better solution to the more narrow class of problems of short-circuiting suspends.

To use my thing, two things have to be true:

You actually have to opt into the new attribute and define await_suspend_destroy to get the perf benefits.
You can only do so if your await_suspend would destroy the coro unconditionally.

When both of those are true (opt-in & unconditional deestroy), it results in a much simpler compilation setup -- HALO, less IR, less optimized code, better perf.

ChuanqiXu9 · 2025-08-08T07:22:00Z

@ChuanqiXu9,

(1) I'm not actually changing libc++. I just added an end-to-end test for this compiler feature. It's backwards- and forwards-compatible, so I don't think there's any risk to bundling it with this PR.

This is not about risks. It is simply about developing processes. In another word, it will be faster to merge this if you split it.

(2) My patch isn't a fully general solution to #148380. Rather, it's a (probably) better solution to the more narrow class of problems of short-circuiting suspends.

To use my thing, two things have to be true:

You actually have to opt into the new attribute and define await_suspend_destroy to get the perf benefits.

You can only do so if your await_suspend would destroy the coro unconditionally.

When both of those are true (opt-in & unconditional deestroy), it results in a much simpler compilation setup -- HALO, less IR, less optimized code, better perf.

In another word, if users want, this patch can cover #148380, right? Could you have a test for this?

ChuanqiXu9

Did a pretty quick scanning.

It is slightly surprising to me that you didn't touch LLVM. The optimization somehow comes from skipping AwaitSuspend intrinsic, which tries to solve such patterns:

void await_suspend(...) {
     ...
     call resume conditionally directly or indirectly
     ...
     some other codes
}

Then if the coroutine is destroyed in other threads, the finally generated code is unsafe even if the code wrote by the user is safe since the middle end may perform some optimizations.

But in your case, it is fine. Since it said it will be destroyed after await suspend destroy. A valid program can't destroy a coroutine instance twice. So I like the idea.

But similarly, if you want, may be you can implement similar attributes to not invoke AwaitSuspend intrinsic in different conditions.

clang/include/clang/Basic/AttrDocs.td

ChuanqiXu9 · 2025-08-08T07:01:55Z

clang/include/clang/Basic/AttrDocs.td

+
+This attribute exists to optimize short-circuiting coroutines—coroutines whose
+suspend points are either (i) trivial (like ``std::suspend_never``), or (ii)
+short-circuiting (like a ``co_await`` that can be expressed in regular control


I feel better to define the concept short-circuiting coro more explicitly and clearly.

Next update, I will start with something like the below. I'm happy to linkify the various concepts if that's not frowned upon. Do you have any feedback?

A short-circuiting coroutine is one where every co_await or co_yield either immediately produces a value, or exits the coroutine. In other words, they use coroutine syntax to concisely branch out of a synchronous function. Here are analogies to other languages:

Rust has Result<T> and a ? operator to unpack it, while folly::result<T> is a C++ short-circuiting coroutine, where co_await acts just like ?.

Haskell has Maybe & Error monads. A short-circuiting co_await loosely corresponds to the monadic >>=, whereas a short-circuiting std::optional would be an exact analog of Maybe.

Response requested

ChuanqiXu9 · 2025-08-08T07:02:35Z

clang/include/clang/Basic/AttrDocs.td

+  }
+
+The benefits of this attribute are:
+  - **Avoid heap allocations for coro frames**: Allocating short-circuiting


Maybe it is helpful to describe the underlying reasons. I don't feel it can avoid heap allocation necessarily.

I understand the mechanism to be this:

If all co_await / co_yield awaiters use await_suspend_destroy(), that leaves the coro with just 2 suspend points: initial & final.

If those suspend points have await_ready() { return true; } like suspend_never, the remaining two suspends get cut out somewhere in the middle end.

At that point, there are no suspend intrinsics left for the escape analysis, and heap elision kicks in.

If we were able to elide suspend_never earlier, I suspect that short-circuiting coros could optimize even more like plain functions.

Can you please clarify how you'd like me to edit this portion of the docs?

My reasoning seems a bit too low-level to be user-relevant. I feel like if a user cares about allocs, they would just look at the generated code for their application...

Response requested

clang/include/clang/Basic/AttrDocs.td

clang/lib/CodeGen/CGCoroutine.cpp

ChuanqiXu9 · 2025-08-08T07:27:21Z

clang/lib/CodeGen/CGCoroutine.cpp

+    if (auto *SuspendMember = dyn_cast<MemberExpr>(SuspendCall->getCallee())) {
+      if (auto *BaseExpr = SuspendMember->getBase()) {
+        // `IgnoreImplicitAsWritten` is critical since `await_suspend...` can be
+        // invoked on the base of the actual awaiter, and the base need not have
+        // the attribute. In such cases, the AST will show the true awaiter
+        // being upcast to the base.
+        AwaiterClass = BaseExpr->IgnoreImplicitAsWritten()
+                           ->getType()
+                           ->getAsCXXRecordDecl();
+      }
+      if (auto *SuspendMethod =
+              dyn_cast<CXXMethodDecl>(SuspendMember->getMemberDecl())) {
+        SuspendMethodName = SuspendMethod->getName();
+      }
+    }
+  }
+  if (SuspendMethodName == "await_suspend_destroy") {
+    assert(!AwaiterClass ||
+           AwaiterClass->hasAttr<CoroAwaitSuspendDestroyAttr>());
+    return true;
+  } else if (SuspendMethodName == "await_suspend") {
+    assert(!AwaiterClass ||
+           !AwaiterClass->hasAttr<CoroAwaitSuspendDestroyAttr>());
+    return false;
+  } else {
+    llvm::report_fatal_error(
+        "Wrong method in [[clang::coro_await_suspend_destroy]] check: "
+        "expected 'await_suspend' or 'await_suspend_destroy', but got '" +
+        SuspendMethodName + "'");
+  }


These checks seem to be redundant. We would and should perform checks in Sema. And if Sema accepts it, we can check for the attribute only.

Yes, the "awaiter" and "method name" checks are redundant, but I did it this way for a (good?) reason.

Initially, I just looked for the attribute on the awaiter checks, and I had a nasty bug described in a comment higher in that function. Worse yet, I only found it by accident because my test happened to have a base awaiter that didn't have the attribute, and a derived one that did.

Since any class can be an awaiter, it's hard to be sure if we're looking on the same CXXRecord that Sema examined.

For that reason, I made the method name check primary. There are only two valid values here, and I can emit a fatal if I don't see either. This makes it clear that Sema alone is responsible for checking & interpreting the attribute.

The AwaiterClass attribute check is secondary, and debug-only. If you prefer, I'm happy to delete that one.

Response requested

I'd make this entire function be a getter here:

llvm-project/clang/include/clang/AST/ExprCXX.h

Line 5249 in df75b4b

class CoroutineSuspendExpr : public Expr {

I do recognize that you want to check it due to it being error prone. How about wrapping it in EXPENSIVE_CHECKS?

Thanks @yuxuanchen1997 -- but just to confirm, you're OK with the "test method name" as the primary check, right?

Make this a getter on CoroutineSuspendExpr

Ok, will do in the next rev.

Don't rely on assert() to optimize out the AwaiterClass check, gate it on #ifdef EXPENSIVE_CHECKS

Sure, will do in the next rev.

What do you mean by "primary check"?

Like @ChuanqiXu9 mentioned, I also don't find these checks necessary. I was just suggesting that you should put them behind #ifdef EXPENSIVE_CHECKS.

I also don't find these checks necessary

I'm confused. CGCoroutine.cpp has to know whether a callsite is await_suspend (emit suspend intrinsics) or await_suspend_destroy (emit direct call to suspend wrapper & clean up).

So, I have to make some kind of check in this function.

Primary check: As written, an opt build will only check the method name here to decide the code branch.

Secondary check: For paranoia, I also assert() that the attribute on the AwaiterClass matches the method name.

I hope that we're discussing the "secondary check". I believe there are two options:

(a) Delete the secondary check. I am fine with that.

(b) Gate the secondary check by #ifdef EXPENSIVE_CHECKS. I am also fine with that.

Does your prior comment simply mean (a)?

Thanks for clarification. For the "primary" check, I would store it as a bit somehow in CoroutineSuspendExpr.

It's quite common to store these bits along with the expression itself. See

llvm-project/clang/include/clang/AST/Stmt.h

Line 775 in 5165a6c

class CXXOperatorCallExprBitfields {

Sure, the "store a bit" solution sounds nicer to me. Are there ABI / layout compat considerations at play here? Or I can just follow the pattern in Stmt.h and make this change:

//===--- C++ Coroutines bitfields classes ---===// - class CoawaitExprBitfields { - friend class CoawaitExpr; + class CoroutineSuspendExprBitfields { + friend class CoroutineSuspendExpr; LLVM_PREFERRED_TYPE(ExprBitfields) unsigned : NumExprBits; + LLVM_PREFERRED_TYPE(bool) + unsigned HasAwaitSuspendDestroy : 1; + }; + enum { NumCoroutineSuspendExprBits = NumExprBits + 1 }; + + class CoawaitExprBitfields { + friend class CoawaitExpr; + + LLVM_PREFERRED_TYPE(CoroutineSuspendExprBitfields) + unsigned : NumCoroutineSuspendExprBits; + LLVM_PREFERRED_TYPE(bool) unsigned IsImplicit : 1; };

LLVM doesn't guarantee ABI stability across major versions.

Remember to touch Serialization part if you added any new bits to AST nodes

snarkmaster · 2025-08-09T00:25:23Z

Thanks for the initial review! I'm glad you like the idea. I'll finish polishing the e2e test, and address your inline comments 1-by-1 above.

split out libcxx change

Okay, I'll do that for the next update, no problem.

... can cover #148380, right?

For the trivial example of void await_suspend(auto handle) { handle.destroy(); }, yes, the user could use the new attribute.

But, I think it's possible to have coros that conditionally destroy the passed-in handle. So, if the user had something more complicated (e.g. coro that returns a handle, and sometimes destroys the current coro), then my thing wouldn't help.

Could you have a test for this?

My current IR test verifies that there's no heap alloc across a few examples. Were you thinking of a different kind of test? Can you be very specific since I'm totally new to LLVM?

Response requested ^^^

if you want, may be you can implement similar attributes to not invoke AwaitSuspend intrinsic in different conditions.

I don't know when I'll next be able to make time to work on this, but if I do, my next target would be std::suspend_never. I hope this one could be done without an attribute, though. Do you have other use-cases in mind?

snarkmaster · 2025-08-09T01:08:39Z

@ChuanqiXu9, I marked a few of the inline comments "Response requested", please take a look.

@ChuanqiXu9

…d test for [[clang::coro_await_suspend_destroy]] When reviewing llvm#152623, @@ChuanqiXu9 suggested that it's better to separate `clang` and `libcxx` changes, so I did that here. This event-replay test is compatible with compilers lacking the new attribute. The two PRs can be landed in any order. Furthermore, the test does not just verify the attribute's correct behavior. By storing "gold" event traces, it also provides some protection against future breakage in the complex control flow of the C++ coroutine implementation. Since suspension remains an area of active optimization work, such bugs are not ruled out. As a new contributor to LLVM, I myself made (and fixed) 2 control-flow bugs while working on llvm#152623. This test would've caught both.

snarkmaster · 2025-08-09T02:30:29Z

I improved the libcxx-hosted test and put it up as a separate PR: #152820

…lvm-project into coro_await_suspend_destroy

snarkmaster · 2025-08-09T07:35:38Z

I addressed most of the comments, except that the handful marked "Response requested" might still benefit from your input. But, if you are happy, I am also happy with the 2 PRs as they are now.

snarkmaster · 2025-08-10T07:54:46Z

... can cover #148380, right?

Above, I mentioned "await_suspend that destroys the coro returns a handle" as one of the cases I don't cover.

There's no logical reason this couldn't be supported, but I only needed void for short-circuiting coros.

I looked through folly/coro and found a destroying suspend that returned a handle (see DetachedBarrierTask). I then made a quick-and-dirty PoC of a compiler change to support returning handles from await_suspend_destroy. Unfortunately, it didn't make an obvious difference to the code optimization in this case.

So, I don't think the current PR should include the extra code to support await_suspend_destroy returning handles. We'd need to see some kind of use-case where it makes a meaningful difference to perf, and a reason why making that optimize the normal way is hard.

I think for short-circuiting coros, we have both the meaningful perf difference, and the reason why normal optimization is hard -- IIUC it's tricky to make the escape analysis smart enough to understand that destroying suspends can be optimized more.

Summary: Before further prioritizing `result` coro usage in `folly/coro/safe` collector code, I wanted to know if the compiler is currently allocating result coro frames on-stack. It turned out that prior to llvm/llvm-project#152623, it is not. ``` Before P152623: ============================================================================ [...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s ============================================================================ result_coro_success 13.61ns 73.49M non_catching_result_func_success 3.39ns 295.00M catching_result_func_success 4.41ns 226.88M result_coro_error 19.55ns 51.16M non_catching_result_func_error 9.15ns 109.26M catching_result_func_error 10.19ns 98.10M After P152623: ============================================================================ [...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s ============================================================================ result_coro_success 10.59ns 94.39M non_catching_result_func_success 3.39ns 295.00M catching_result_func_success 4.07ns 245.81M result_coro_error 13.66ns 73.18M non_catching_result_func_error 9.00ns 111.11M catching_result_func_error 10.04ns 99.63M ``` While it should be possible to further reduce the optimization gap between coro & non-coro, I'm not planning to work on it in the immediate future. Reviewed By: ispeters Differential Revision: D79969519 fbshipit-source-id: 0583a773965f718388715b06067bda4ca8d85737

yuxuanchen1997 · 2025-08-11T20:24:37Z

clang/test/CodeGenCoroutines/coro-await-suspend-destroy-errors.cpp

@@ -0,0 +1,55 @@
+// RUN: %clang_cc1 -std=c++20 -verify %s 


This test should belong to clang/test/SemaCXX as this doesn't verify the generated code.

yuxuanchen1997 · 2025-08-11T20:30:00Z

clang/test/CodeGenCoroutines/coro-await-suspend-destroy.cpp

+// CHECK-INITIAL-LABEL: define{{.*}} void @_Z28test_single_destroying_awaitv
+// CHECK-INITIAL: call{{.*}} @llvm.coro.alloc
+// CHECK-INITIAL: call{{.*}} @llvm.coro.begin
+
+// CHECK-OPTIMIZED-LABEL: define{{.*}} void @_Z28test_single_destroying_awaitv
+// CHECK-OPTIMIZED-NOT: call{{.*}} @llvm.coro.alloc
+// CHECK-OPTIMIZED-NOT: call{{.*}} malloc
+// CHECK-OPTIMIZED-NOT: call{{.*}} @_Znwm


These checks are probably too simple. I would prefer to use matching get the variable name and do more pattern checking.

A good example can be found here: https://github.com/llvm/llvm-project/blob/main/clang/test/CodeGenCoroutines/coro-await.cpp#L53

@yuxuanchen1997, I'm not sure I understand your meaning. Can you make it more explicit?

What is bad about these checks being simple? Are you worried about insufficient coverage, or future failures that are due to a check not being specific enough?

If you want more coverage, can you suggest more tests?

If you're want me to make the tests more specific, how can I do that?

What variable name should I be testing for?

Please make the tests more specific. If someone else's change later breaks the test, at least they understand your intention.

I would manually run the test command and read the IR. Find parts that belong to the coroutine machinary. Then add them as skeleton checks. That including finding which SSA variables are coro.id or coro.begin. When I am checking other calls to intrinsics, I can refer to them in the arguments.

For this particular function, I'd also like to check for DestroyingAwaitable{} and the call to await_suspend_destroy if possible.

Got it, thanks! So you're not specifically calling out these "no heap alloc" test lines with CHECK-OPTIMIZED-NOT. Rather, you want to see a skeleton of the coro in the IR, so the tests are a bit more self-documenting.

On the next rev, I'll give it a go, following the example you linked.

read the IR

I've done that, of course, as part of writing and debugging these tests :)

On these particular lines, I just tried to write the shortest test that shows the desired behavior change of "no alloc".

yuxuanchen1997 · 2025-08-11T20:40:46Z

clang/lib/CodeGen/CGCoroutine.cpp

+    if (auto *SuspendMember = dyn_cast<MemberExpr>(SuspendCall->getCallee())) {
+      if (auto *BaseExpr = SuspendMember->getBase()) {
+        // `IgnoreImplicitAsWritten` is critical since `await_suspend...` can be
+        // invoked on the base of the actual awaiter, and the base need not have
+        // the attribute. In such cases, the AST will show the true awaiter
+        // being upcast to the base.
+        AwaiterClass = BaseExpr->IgnoreImplicitAsWritten()
+                           ->getType()
+                           ->getAsCXXRecordDecl();
+      }
+      if (auto *SuspendMethod =
+              dyn_cast<CXXMethodDecl>(SuspendMember->getMemberDecl())) {
+        SuspendMethodName = SuspendMethod->getName();
+      }
+    }
+  }
+  if (SuspendMethodName == "await_suspend_destroy") {
+    assert(!AwaiterClass ||
+           AwaiterClass->hasAttr<CoroAwaitSuspendDestroyAttr>());
+    return true;
+  } else if (SuspendMethodName == "await_suspend") {
+    assert(!AwaiterClass ||
+           !AwaiterClass->hasAttr<CoroAwaitSuspendDestroyAttr>());
+    return false;
+  } else {
+    llvm::report_fatal_error(
+        "Wrong method in [[clang::coro_await_suspend_destroy]] check: "
+        "expected 'await_suspend' or 'await_suspend_destroy', but got '" +
+        SuspendMethodName + "'");
+  }


I'd make this entire function be a getter here:

llvm-project/clang/include/clang/AST/ExprCXX.h

Line 5249 in df75b4b

class CoroutineSuspendExpr : public Expr {

I do recognize that you want to check it due to it being error prone. How about wrapping it in EXPENSIVE_CHECKS?

github-actions · 2025-08-11T22:22:24Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff HEAD~1 HEAD --extensions cpp -- clang/test/CodeGenCoroutines/coro-await-suspend-destroy-errors.cpp clang/test/CodeGenCoroutines/coro-await-suspend-destroy.cpp clang/lib/CodeGen/CGCoroutine.cpp clang/lib/Sema/SemaCoroutine.cpp

View the diff from clang-format here.

diff --git a/clang/lib/CodeGen/CGCoroutine.cpp b/clang/lib/CodeGen/CGCoroutine.cpp
index 883a45d2a..da779f4a5 100644
--- a/clang/lib/CodeGen/CGCoroutine.cpp
+++ b/clang/lib/CodeGen/CGCoroutine.cpp
@@ -279,205 +279,208 @@ namespace {
     LValue LV;
     RValue RV;
   };
-}
+  } // namespace
 
-// The simplified `await_suspend_destroy` path avoids suspend intrinsics.
-//
-// If a coro has only `await_suspend_destroy` and trivial (`suspend_never`)
-// awaiters, then subsequent passes are able to allocate its frame on-stack.
-//
-// As of 2025, there is still an optimization gap between a realistic
-// short-circuiting coro, and the equivalent plain function.  For a
-// guesstimate, expect 4-5ns per call on x86.  One idea for improvement is to
-// also elide trivial suspends like `std::suspend_never`, in order to hit the
-// `HasCoroSuspend` path in `CoroEarly.cpp`.
-static void emitAwaitSuspendDestroy(CodeGenFunction &CGF, CGCoroData &Coro,
-                                    llvm::Function *SuspendWrapper,
-                                    llvm::Value *Awaiter, llvm::Value *Frame,
-                                    bool AwaitSuspendCanThrow) {
-  SmallVector<llvm::Value *, 2> DirectCallArgs;
-  DirectCallArgs.push_back(Awaiter);
-  DirectCallArgs.push_back(Frame);
-
-  if (AwaitSuspendCanThrow) {
-    CGF.EmitCallOrInvoke(SuspendWrapper, DirectCallArgs);
-  } else {
-    CGF.EmitNounwindRuntimeCall(SuspendWrapper, DirectCallArgs);
+  // The simplified `await_suspend_destroy` path avoids suspend intrinsics.
+  //
+  // If a coro has only `await_suspend_destroy` and trivial (`suspend_never`)
+  // awaiters, then subsequent passes are able to allocate its frame on-stack.
+  //
+  // As of 2025, there is still an optimization gap between a realistic
+  // short-circuiting coro, and the equivalent plain function.  For a
+  // guesstimate, expect 4-5ns per call on x86.  One idea for improvement is to
+  // also elide trivial suspends like `std::suspend_never`, in order to hit the
+  // `HasCoroSuspend` path in `CoroEarly.cpp`.
+  static void emitAwaitSuspendDestroy(CodeGenFunction &CGF, CGCoroData &Coro,
+                                      llvm::Function *SuspendWrapper,
+                                      llvm::Value *Awaiter, llvm::Value *Frame,
+                                      bool AwaitSuspendCanThrow) {
+    SmallVector<llvm::Value *, 2> DirectCallArgs;
+    DirectCallArgs.push_back(Awaiter);
+    DirectCallArgs.push_back(Frame);
+
+    if (AwaitSuspendCanThrow) {
+      CGF.EmitCallOrInvoke(SuspendWrapper, DirectCallArgs);
+    } else {
+      CGF.EmitNounwindRuntimeCall(SuspendWrapper, DirectCallArgs);
+    }
+
+    CGF.EmitBranchThroughCleanup(Coro.CleanupJD);
   }
 
-  CGF.EmitBranchThroughCleanup(Coro.CleanupJD);
-}
+  static void emitStandardAwaitSuspend(
+      CodeGenFunction &CGF, CGCoroData &Coro, CoroutineSuspendExpr const &S,
+      llvm::Function *SuspendWrapper, llvm::Value *Awaiter, llvm::Value *Frame,
+      bool AwaitSuspendCanThrow, SmallString<32> Prefix, BasicBlock *ReadyBlock,
+      AwaitKind Kind,
+      CoroutineSuspendExpr::SuspendReturnType SuspendReturnType) {
+    auto &Builder = CGF.Builder;
 
-static void emitStandardAwaitSuspend(
-    CodeGenFunction &CGF, CGCoroData &Coro, CoroutineSuspendExpr const &S,
-    llvm::Function *SuspendWrapper, llvm::Value *Awaiter, llvm::Value *Frame,
-    bool AwaitSuspendCanThrow, SmallString<32> Prefix, BasicBlock *ReadyBlock,
-    AwaitKind Kind, CoroutineSuspendExpr::SuspendReturnType SuspendReturnType) {
-  auto &Builder = CGF.Builder;
-
-  CGF.CurCoro.InSuspendBlock = true;
-
-  SmallVector<llvm::Value *, 3> SuspendIntrinsicCallArgs;
-  SuspendIntrinsicCallArgs.push_back(Awaiter);
-  SuspendIntrinsicCallArgs.push_back(Frame);
-  SuspendIntrinsicCallArgs.push_back(SuspendWrapper);
-  BasicBlock *CleanupBlock = CGF.createBasicBlock(Prefix + Twine(".cleanup"));
-
-  llvm::Function *CoroSave = CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_save);
-  auto *NullPtr = llvm::ConstantPointerNull::get(CGF.CGM.Int8PtrTy);
-  auto *SaveCall = Builder.CreateCall(CoroSave, {NullPtr});
-
-  llvm::Intrinsic::ID AwaitSuspendIID;
-  switch (SuspendReturnType) {
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
-    AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_void;
-    break;
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendBool:
-    AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_bool;
-    break;
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle:
-    AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_handle;
-    break;
-  }
+    CGF.CurCoro.InSuspendBlock = true;
 
-  llvm::Function *AwaitSuspendIntrinsic = CGF.CGM.getIntrinsic(AwaitSuspendIID);
+    SmallVector<llvm::Value *, 3> SuspendIntrinsicCallArgs;
+    SuspendIntrinsicCallArgs.push_back(Awaiter);
+    SuspendIntrinsicCallArgs.push_back(Frame);
+    SuspendIntrinsicCallArgs.push_back(SuspendWrapper);
+    BasicBlock *CleanupBlock = CGF.createBasicBlock(Prefix + Twine(".cleanup"));
 
-  llvm::CallBase *SuspendRet = nullptr;
-  // FIXME: add call attributes?
-  if (AwaitSuspendCanThrow)
-    SuspendRet =
-        CGF.EmitCallOrInvoke(AwaitSuspendIntrinsic, SuspendIntrinsicCallArgs);
-  else
-    SuspendRet = CGF.EmitNounwindRuntimeCall(AwaitSuspendIntrinsic,
-                                             SuspendIntrinsicCallArgs);
+    llvm::Function *CoroSave = CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_save);
+    auto *NullPtr = llvm::ConstantPointerNull::get(CGF.CGM.Int8PtrTy);
+    auto *SaveCall = Builder.CreateCall(CoroSave, {NullPtr});
 
-  assert(SuspendRet);
-  CGF.CurCoro.InSuspendBlock = false;
+    llvm::Intrinsic::ID AwaitSuspendIID;
+    switch (SuspendReturnType) {
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
+      AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_void;
+      break;
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendBool:
+      AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_bool;
+      break;
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle:
+      AwaitSuspendIID = llvm::Intrinsic::coro_await_suspend_handle;
+      break;
+    }
 
-  switch (SuspendReturnType) {
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
-    assert(SuspendRet->getType()->isVoidTy());
-    break;
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendBool: {
-    assert(SuspendRet->getType()->isIntegerTy());
-
-    // Veto suspension if requested by bool returning await_suspend.
-    BasicBlock *RealSuspendBlock =
-        CGF.createBasicBlock(Prefix + Twine(".suspend.bool"));
-    CGF.Builder.CreateCondBr(SuspendRet, RealSuspendBlock, ReadyBlock);
-    CGF.EmitBlock(RealSuspendBlock);
-    break;
-  }
-  case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle: {
-    assert(SuspendRet->getType()->isVoidTy());
-    break;
-  }
-  }
+    llvm::Function *AwaitSuspendIntrinsic =
+        CGF.CGM.getIntrinsic(AwaitSuspendIID);
 
-  // Emit the suspend point.
-  const bool IsFinalSuspend = (Kind == AwaitKind::Final);
-  llvm::Function *CoroSuspend =
-      CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_suspend);
-  auto *SuspendResult = Builder.CreateCall(
-      CoroSuspend, {SaveCall, Builder.getInt1(IsFinalSuspend)});
-
-  // Create a switch capturing three possible continuations.
-  auto *Switch = Builder.CreateSwitch(SuspendResult, Coro.SuspendBB, 2);
-  Switch->addCase(Builder.getInt8(0), ReadyBlock);
-  Switch->addCase(Builder.getInt8(1), CleanupBlock);
-
-  // Emit cleanup for this suspend point.
-  CGF.EmitBlock(CleanupBlock);
-  CGF.EmitBranchThroughCleanup(Coro.CleanupJD);
-}
+    llvm::CallBase *SuspendRet = nullptr;
+    // FIXME: add call attributes?
+    if (AwaitSuspendCanThrow)
+      SuspendRet =
+          CGF.EmitCallOrInvoke(AwaitSuspendIntrinsic, SuspendIntrinsicCallArgs);
+    else
+      SuspendRet = CGF.EmitNounwindRuntimeCall(AwaitSuspendIntrinsic,
+                                               SuspendIntrinsicCallArgs);
 
-static LValueOrRValue emitSuspendExpression(CodeGenFunction &CGF, CGCoroData &Coro,
-                                    CoroutineSuspendExpr const &S,
-                                    AwaitKind Kind, AggValueSlot aggSlot,
-                                    bool ignoreResult, bool forLValue) {
-  auto *E = S.getCommonExpr();
+    assert(SuspendRet);
+    CGF.CurCoro.InSuspendBlock = false;
 
-  auto CommonBinder =
-      CodeGenFunction::OpaqueValueMappingData::bind(CGF, S.getOpaqueValue(), E);
-  auto UnbindCommonOnExit =
-      llvm::make_scope_exit([&] { CommonBinder.unbind(CGF); });
-
-  auto Prefix = buildSuspendPrefixStr(Coro, Kind);
-  BasicBlock *ReadyBlock = CGF.createBasicBlock(Prefix + Twine(".ready"));
-  BasicBlock *SuspendBlock = CGF.createBasicBlock(Prefix + Twine(".suspend"));
-
-  // If expression is ready, no need to suspend.
-  CGF.EmitBranchOnBoolExpr(S.getReadyExpr(), ReadyBlock, SuspendBlock, 0);
-
-  // Otherwise, emit suspend logic.
-  CGF.EmitBlock(SuspendBlock);
-
-  auto SuspendWrapper = CodeGenFunction(CGF.CGM).generateAwaitSuspendWrapper(
-      CGF.CurFn->getName(), Prefix, S);
-
-  assert(CGF.CurCoro.Data && CGF.CurCoro.Data->CoroBegin &&
-         "expected to be called in coroutine context");
-
-  // SuspendHandle might throw since it also resumes the returned handle.
-  const auto SuspendReturnType = S.getSuspendReturnType();
-  const bool AwaitSuspendCanThrow =
-      SuspendReturnType ==
-          CoroutineSuspendExpr::SuspendReturnType::SuspendHandle ||
-      StmtCanThrow(S.getSuspendExpr());
-
-  llvm::Value *Awaiter =
-      CGF.getOrCreateOpaqueLValueMapping(S.getOpaqueValue()).getPointer(CGF);
-  llvm::Value *Frame = CGF.CurCoro.Data->CoroBegin;
-
-  if (useCoroAwaitSuspendDestroy(S)) { // Call `await_suspend_destroy` & cleanup
-    emitAwaitSuspendDestroy(CGF, Coro, SuspendWrapper, Awaiter, Frame,
-                            AwaitSuspendCanThrow);
-  } else { // Normal suspend path -- can actually suspend, uses intrinsics
-    emitStandardAwaitSuspend(CGF, Coro, S, SuspendWrapper, Awaiter, Frame,
-                             AwaitSuspendCanThrow, Prefix, ReadyBlock, Kind,
-                             SuspendReturnType);
+    switch (SuspendReturnType) {
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendVoid:
+      assert(SuspendRet->getType()->isVoidTy());
+      break;
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendBool: {
+      assert(SuspendRet->getType()->isIntegerTy());
+
+      // Veto suspension if requested by bool returning await_suspend.
+      BasicBlock *RealSuspendBlock =
+          CGF.createBasicBlock(Prefix + Twine(".suspend.bool"));
+      CGF.Builder.CreateCondBr(SuspendRet, RealSuspendBlock, ReadyBlock);
+      CGF.EmitBlock(RealSuspendBlock);
+      break;
+    }
+    case CoroutineSuspendExpr::SuspendReturnType::SuspendHandle: {
+      assert(SuspendRet->getType()->isVoidTy());
+      break;
+    }
+    }
+
+    // Emit the suspend point.
+    const bool IsFinalSuspend = (Kind == AwaitKind::Final);
+    llvm::Function *CoroSuspend =
+        CGF.CGM.getIntrinsic(llvm::Intrinsic::coro_suspend);
+    auto *SuspendResult = Builder.CreateCall(
+        CoroSuspend, {SaveCall, Builder.getInt1(IsFinalSuspend)});
+
+    // Create a switch capturing three possible continuations.
+    auto *Switch = Builder.CreateSwitch(SuspendResult, Coro.SuspendBB, 2);
+    Switch->addCase(Builder.getInt8(0), ReadyBlock);
+    Switch->addCase(Builder.getInt8(1), CleanupBlock);
+
+    // Emit cleanup for this suspend point.
+    CGF.EmitBlock(CleanupBlock);
+    CGF.EmitBranchThroughCleanup(Coro.CleanupJD);
   }
 
-  // Emit await_resume expression.
-  CGF.EmitBlock(ReadyBlock);
+  static LValueOrRValue emitSuspendExpression(
+      CodeGenFunction &CGF, CGCoroData &Coro, CoroutineSuspendExpr const &S,
+      AwaitKind Kind, AggValueSlot aggSlot, bool ignoreResult, bool forLValue) {
+    auto *E = S.getCommonExpr();
+
+    auto CommonBinder = CodeGenFunction::OpaqueValueMappingData::bind(
+        CGF, S.getOpaqueValue(), E);
+    auto UnbindCommonOnExit =
+        llvm::make_scope_exit([&] { CommonBinder.unbind(CGF); });
+
+    auto Prefix = buildSuspendPrefixStr(Coro, Kind);
+    BasicBlock *ReadyBlock = CGF.createBasicBlock(Prefix + Twine(".ready"));
+    BasicBlock *SuspendBlock = CGF.createBasicBlock(Prefix + Twine(".suspend"));
+
+    // If expression is ready, no need to suspend.
+    CGF.EmitBranchOnBoolExpr(S.getReadyExpr(), ReadyBlock, SuspendBlock, 0);
+
+    // Otherwise, emit suspend logic.
+    CGF.EmitBlock(SuspendBlock);
+
+    auto SuspendWrapper = CodeGenFunction(CGF.CGM).generateAwaitSuspendWrapper(
+        CGF.CurFn->getName(), Prefix, S);
+
+    assert(CGF.CurCoro.Data && CGF.CurCoro.Data->CoroBegin &&
+           "expected to be called in coroutine context");
+
+    // SuspendHandle might throw since it also resumes the returned handle.
+    const auto SuspendReturnType = S.getSuspendReturnType();
+    const bool AwaitSuspendCanThrow =
+        SuspendReturnType ==
+            CoroutineSuspendExpr::SuspendReturnType::SuspendHandle ||
+        StmtCanThrow(S.getSuspendExpr());
+
+    llvm::Value *Awaiter =
+        CGF.getOrCreateOpaqueLValueMapping(S.getOpaqueValue()).getPointer(CGF);
+    llvm::Value *Frame = CGF.CurCoro.Data->CoroBegin;
+
+    if (useCoroAwaitSuspendDestroy(
+            S)) { // Call `await_suspend_destroy` & cleanup
+      emitAwaitSuspendDestroy(CGF, Coro, SuspendWrapper, Awaiter, Frame,
+                              AwaitSuspendCanThrow);
+    } else { // Normal suspend path -- can actually suspend, uses intrinsics
+      emitStandardAwaitSuspend(CGF, Coro, S, SuspendWrapper, Awaiter, Frame,
+                               AwaitSuspendCanThrow, Prefix, ReadyBlock, Kind,
+                               SuspendReturnType);
+    }
+
+    // Emit await_resume expression.
+    CGF.EmitBlock(ReadyBlock);
+
+    // Exception handling requires additional IR. If the 'await_resume' function
+    // is marked as 'noexcept', we avoid generating this additional IR.
+    CXXTryStmt *TryStmt = nullptr;
+    if (Coro.ExceptionHandler && Kind == AwaitKind::Init &&
+        StmtCanThrow(S.getResumeExpr())) {
+      auto &Builder = CGF.Builder;
+      Coro.ResumeEHVar = CGF.CreateTempAlloca(Builder.getInt1Ty(),
+                                              Prefix + Twine("resume.eh"));
+      Builder.CreateFlagStore(true, Coro.ResumeEHVar);
+
+      auto Loc = S.getResumeExpr()->getExprLoc();
+      auto *Catch = new (CGF.getContext())
+          CXXCatchStmt(Loc, /*exDecl=*/nullptr, Coro.ExceptionHandler);
+      auto *TryBody = CompoundStmt::Create(CGF.getContext(), S.getResumeExpr(),
+                                           FPOptionsOverride(), Loc, Loc);
+      TryStmt = CXXTryStmt::Create(CGF.getContext(), Loc, TryBody, Catch);
+      CGF.EnterCXXTryStmt(*TryStmt);
+      CGF.EmitStmt(TryBody);
+      // We don't use EmitCXXTryStmt here. We need to store to ResumeEHVar that
+      // doesn't exist in the body.
+      Builder.CreateFlagStore(false, Coro.ResumeEHVar);
+      CGF.ExitCXXTryStmt(*TryStmt);
+      LValueOrRValue Res;
+      // We are not supposed to obtain the value from init suspend
+      // await_resume().
+      Res.RV = RValue::getIgnored();
+      return Res;
+    }
 
-  // Exception handling requires additional IR. If the 'await_resume' function
-  // is marked as 'noexcept', we avoid generating this additional IR.
-  CXXTryStmt *TryStmt = nullptr;
-  if (Coro.ExceptionHandler && Kind == AwaitKind::Init &&
-      StmtCanThrow(S.getResumeExpr())) {
-    auto &Builder = CGF.Builder;
-    Coro.ResumeEHVar =
-        CGF.CreateTempAlloca(Builder.getInt1Ty(), Prefix + Twine("resume.eh"));
-    Builder.CreateFlagStore(true, Coro.ResumeEHVar);
-
-    auto Loc = S.getResumeExpr()->getExprLoc();
-    auto *Catch = new (CGF.getContext())
-        CXXCatchStmt(Loc, /*exDecl=*/nullptr, Coro.ExceptionHandler);
-    auto *TryBody = CompoundStmt::Create(CGF.getContext(), S.getResumeExpr(),
-                                         FPOptionsOverride(), Loc, Loc);
-    TryStmt = CXXTryStmt::Create(CGF.getContext(), Loc, TryBody, Catch);
-    CGF.EnterCXXTryStmt(*TryStmt);
-    CGF.EmitStmt(TryBody);
-    // We don't use EmitCXXTryStmt here. We need to store to ResumeEHVar that
-    // doesn't exist in the body.
-    Builder.CreateFlagStore(false, Coro.ResumeEHVar);
-    CGF.ExitCXXTryStmt(*TryStmt);
     LValueOrRValue Res;
-    // We are not supposed to obtain the value from init suspend await_resume().
-    Res.RV = RValue::getIgnored();
+    if (forLValue)
+      Res.LV = CGF.EmitLValue(S.getResumeExpr());
+    else
+      Res.RV = CGF.EmitAnyExpr(S.getResumeExpr(), aggSlot, ignoreResult);
+
     return Res;
   }
 
-  LValueOrRValue Res;
-  if (forLValue)
-    Res.LV = CGF.EmitLValue(S.getResumeExpr());
-  else
-    Res.RV = CGF.EmitAnyExpr(S.getResumeExpr(), aggSlot, ignoreResult);
-
-  return Res;
-}
-
 RValue CodeGenFunction::EmitCoawaitExpr(const CoawaitExpr &E,
                                         AggValueSlot aggSlot,
                                         bool ignoreResult) {

snarkmaster · 2025-08-11T22:35:24Z

clang-format warning

Ok, interesting, I appear to be hitting a bug in git-clang-format. The recommended change is huge and obviously incorrect.

This command from Yuxuan gives a more reasonable diff, which I will apply in the next rev. Not sure why my pre-submit run of git-clang-format didn't apply this, maybe I needed to have re-run formatting after adding some comments. Sigh.

The good thing is that after applying the smaller change below, git-clang-format no longer acts broken.

git diff -U0 --no-color HEAD^ | python3 clang/tools/clang-format/clang-format-diff.py -i -p1

diff --git i/clang/lib/CodeGen/CGCoroutine.cpp w/clang/lib/CodeGen/CGCoroutine.cpp
index 152862d8d67c..cf806763b029 100644
--- i/clang/lib/CodeGen/CGCoroutine.cpp
+++ w/clang/lib/CodeGen/CGCoroutine.cpp
@@ -390,10 +390,10 @@ static void emitStandardAwaitSuspend(
   CGF.EmitBranchThroughCleanup(Coro.CleanupJD);
 }

-static LValueOrRValue emitSuspendExpression(CodeGenFunction &CGF, CGCoroData &Coro,
-                                    CoroutineSuspendExpr const &S,
-                                    AwaitKind Kind, AggValueSlot aggSlot,
-                                    bool ignoreResult, bool forLValue) {
+static LValueOrRValue
+emitSuspendExpression(CodeGenFunction &CGF, CGCoroData &Coro,
+                      CoroutineSuspendExpr const &S, AwaitKind Kind,
+                      AggValueSlot aggSlot, bool ignoreResult, bool forLValue) {
   auto *E = S.getCommonExpr();

   auto CommonBinder =

snarkmaster · 2025-08-11T22:38:27Z

Thanks for the review so far @ChuanqiXu9 and @yuxuanchen1997!

Just to sum up the state of the review (EDITED as of 8/11 21:50 Pacific):

Last Friday, I addressed a bunch of feedback from the first round of review. PTAL.
There remain some minor inline comments that I'll address in the next rev, once the responses converge. Those changes are at the level of "hygiene":
- make some IR tests more detailed per Yuxuan above
- add an IR test for [clang] Missed optimization regression: coro.destroy not devirtualized even when possible #148380 to show show that the destroy call is devirtualized per Chuanqi
- use new bit-field on CoroutineSuspendExpr (per above) instead of useCoroAwaitSuspendDestroy
  ~~either remove or gate the "secondary" AwaiterClass check in CGCoroutine.cpp~~
- move one test to Sema
- 2 pending discussions on the wording of the docs.
- minor formatting fix
- resubmit the end-to-end test to llvm-test-suite from [libcxx] Test await_suspend control flow & coro_await_suspend_destroy attr #152820
- update docs per Yuxuan's inline comments below; see if I can shorten anything.
Before I do the polish rev, I'd be grateful if we can get to "looks good" on the actual functionality. If larger changes are required, working on polish would be premature.

yuxuanchen1997

I gave the AttrDocs a read. Overall I think it's good design doc material but I don't know whether it succinct enough to read. Left a few comments to improve the accuracy of wording.

yuxuanchen1997 · 2025-08-11T22:44:58Z

clang/include/clang/Basic/AttrDocs.td

+stub ``await_suspend()`` as above.  Without ``coro_await_suspend_destroy``
+support, the awaiter will behave nearly identically, with the only difference
+being heap allocation instead of stack allocation for the coroutine frame.
+


The coroutine's heap allocation gets elided as a result of coro_await_suspend_destroy, so you can't say this for certain here without knowing usages of other awaitables here.

I used this phrasing for brevity. I will rephrase it to be more precise.

yuxuanchen1997 · 2025-08-11T22:48:52Z

clang/include/clang/Basic/AttrDocs.td

+support, the awaiter will behave nearly identically, with the only difference
+being heap allocation instead of stack allocation for the coroutine frame.
+
+This attribute helps optimize short-circuiting coroutines.


Would it be more clear if you distinguish what is actually short-circuiting here? In my mental model, the coroutine itself is still the same, what makes it "short-circuit" to destroy is the "short-circuiting awaiters".

yuxuanchen1997 · 2025-08-11T22:49:50Z

clang/include/clang/Basic/AttrDocs.td

+A short-circuiting coroutine is one where every ``co_await`` or ``co_yield``
+either immediately produces a value, or exits the coroutine.  In other words,


nit: mention that such an expression produces a value due to "await_ready()" being true?

yuxuanchen1997 · 2025-08-11T22:50:55Z

clang/include/clang/Basic/AttrDocs.td

+
+A short-circuiting coroutine is one where every ``co_await`` or ``co_yield``
+either immediately produces a value, or exits the coroutine.  In other words,
+they use coroutine syntax to concisely branch out of a synchronous function. 


nit: as mentioned above, you can intermix short-circuiting and non-short-circuiting awaitors across many different awaits. So, this coroutine may not be precisely a "synchronous" function.

I am trying to define a short-circuiting coro as one in which EVERY awaiter is synchronous, though. I think the definition is sound.

I can discuss "coroutines mixing short-circuiting and non-short-circuiting awaiters" separately, if you'd like.

yuxuanchen1997 · 2025-08-11T22:52:35Z

clang/include/clang/Basic/AttrDocs.td

+The C++ implementation relies on short-circuiting awaiters.  These either
+resume synchronously, or immediately destroy the awaiting coroutine and return
+control to the parent:


This sentence makes it sound like a language feature in C++. I would reword this to say: "Short-circuiting awaiters in C++ emulate such behavior by either "

Nit: I don't think you want to use the term "resume synchronously". It doesn't make a whole lot of sense. The calling coroutine is never considered to be suspended due to await_ready() being true, consequently there's no resumption.

yuxuanchen1997 · 2025-08-11T23:06:33Z

clang/include/clang/Basic/AttrDocs.td

+Marking your ``await_suspend_destroy`` method as ``noexcept`` can sometimes
+further improve optimization.


method is not a C++ term. The await_suspend_destroy is a non-static member function.

Lol sure, I'll change it, but ~everyone knows that method is short for "member function" :-P

ChuanqiXu9 · 2025-08-12T03:55:10Z

My current IR test verifies that there's no heap alloc across a few examples. Were you thinking of a different kind of test? Can you be very specific since I'm totally new to LLVM?

I mean to add a test for #148380 to show we can cover that in certain cases. Specifically, you can have a test on the IR to show it is devirtualized.

I don't know when I'll next be able to make time to work on this, but if I do, my next target would be std::suspend_never. I hope this one could be done without an attribute, though. Do you have other use-cases in mind?

e.g., add an attribute to ask the compiler don't generate await.suspend intrinsics then the compiler may have optimized the await_suspend call better by inlining. The cost is the risk if the code resume or destroy the current coroutine during await_suspend somehow. But the programmer should be able to know that.

The core idea is to avoid generating await.suspend intrinsics conditionally. I am open to the conditions.

snarkmaster · 2025-08-12T05:02:01Z

Thanks! One more question @ChuanqiXu9 -- I now have a pile of smallish fixes to make. Are you still thinking about the overall design of this PR, or should I go ahead and polish it up for merging?

test for #148380 to show we can cover that in certain cases

Ok, this sounds simple enough. I added that to my to-do list for the next revision.

The core idea is to avoid generating await.suspend intrinsics conditionally. I am open to the conditions.

Ah, now I understand. Your idea ties back to the complex (for me, I didn't spend the necessary time reading the relevant PRs and Phabricator thread!) reasons that the await.suspend.XXX intrinsics were introduced in the first place.

It sounds cool in principle, but:

That's a separate work-stream, right?
Without deeply understanding the old mis-optimization bug, I doubt I'd be good at coming up with this conditions or use-cases for this class of attributes.

If you're thinking about this actively, and you think that my knowledge of e.g. folly/coro is useful for your decision-making, then I can make some time to think about it, or VC about it, and/or you could start a Discourse thread about it?

ChuanqiXu9 · 2025-08-12T05:46:12Z

Thanks! One more question @ChuanqiXu9 -- I now have a pile of smallish fixes to make. Are you still thinking about the overall design of this PR, or should I go ahead and polish it up for merging?

I feel good with the ideas. Let's address these comments and try to merge it.

test for #148380 to show we can cover that in certain cases

Ok, this sounds simple enough. I added that to my to-do list for the next revision.

The core idea is to avoid generating await.suspend intrinsics conditionally. I am open to the conditions.

Ah, now I understand. Your idea ties back to the complex (for me, I didn't spend the necessary time reading the relevant PRs and Phabricator thread!) reasons that the await.suspend.XXX intrinsics were introduced in the first place.

It sounds cool in principle, but:

That's a separate work-stream, right?

Yes. That is a casual chat. Not a requirement for you or this PR.

Without deeply understanding the old mis-optimization bug, I doubt I'd be good at coming up with this conditions or use-cases for this class of attributes.

If you're thinking about this actively, and you think that my knowledge of e.g. folly/coro is useful for your decision-making, then I can make some time to think about it, or VC about it, and/or you could start a Discourse thread about it?

If you want to understand this fully, you can read it at: #56301 The root cause is complex as it relates to the design of LLVM Coroutine IRs and LLVM middle end optimizations. But the higher level key point is, the LLVM awaiter suspend intrinsics may affect the performance. The problem we're handling is the resumer/destroyer in the await_suspend, especially the coroutine is resumed/destroyed in another thread while the await_suspend is executing.

As far as I know, this is reported from google's internal repo, so I folly/coro may not be affected by this. But folly/coro will be a good candidate in the open source world to show the usefulness of coroutines optimization. Also I think our library (https://github.com/alibaba/async_simple) is a good candidate too as it is much simpler and cheaper.

snarkmaster requested a review from a team as a code owner August 8, 2025 01:52

snarkmaster mentioned this pull request Aug 8, 2025

Elide suspension points via [[clang::coro_destroyed_on_suspend]] #152029

Closed

yuxuanchen1997 requested a review from ChuanqiXu9 August 8, 2025 01:59

snarkmaster and others added 3 commits August 8, 2025 00:06

Fix CI

eb5557a

Improve doc formatting

5d6a06d

Merge branch 'main' into coro_await_suspend_destroy

9fe0b17

ChuanqiXu9 reviewed Aug 8, 2025

View reviewed changes

snarkmaster mentioned this pull request Aug 9, 2025

[libcxx] Test await_suspend control flow & coro_await_suspend_destroy attr #152820

Open

snarkmaster added 5 commits August 9, 2025 00:27

Rework the AttrDocs.td addition based on feedback

1dabfe6

Split out the libcxx/test change into PR llvm#152820

62789ef

Lift standard suspend flow to emitStandardAwaitSuspend; tweak comment.

4835f37

Merge branch 'llvm:main' into coro_await_suspend_destroy

3e84df1

Merge branch 'coro_await_suspend_destroy' of github.com:snarkmaster/l…

b359f5f

…lvm-project into coro_await_suspend_destroy

yuxuanchen1997 reviewed Aug 11, 2025

View reviewed changes

yuxuanchen1997 removed the request for review from a team August 11, 2025 22:00

yuxuanchen1997 reviewed Aug 11, 2025

View reviewed changes

		A short-circuiting coroutine is one where every ``co_await`` or ``co_yield``
		either immediately produces a value, or exits the coroutine. In other words,

		Marking your ``await_suspend_destroy`` method as ``noexcept`` can sometimes
		further improve optimization.

Elide suspension points via [[clang::coro_await_suspend_destroy]] #152623

Are you sure you want to change the base?

Elide suspension points via [[clang::coro_await_suspend_destroy]] #152623

Uh oh!

Conversation

snarkmaster commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

llvmbot commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NewSigma commented Aug 8, 2025

Uh oh!

snarkmaster commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChuanqiXu9 commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snarkmaster commented Aug 8, 2025

Uh oh!

ChuanqiXu9 commented Aug 8, 2025

Uh oh!

ChuanqiXu9 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snarkmaster Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snarkmaster commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snarkmaster commented Aug 9, 2025

Uh oh!

snarkmaster commented Aug 9, 2025

Uh oh!

snarkmaster commented Aug 9, 2025

Uh oh!

snarkmaster commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

snarkmaster commented Aug 8, 2025 •

edited

Loading

llvmbot commented Aug 8, 2025 •

edited

Loading

snarkmaster commented Aug 8, 2025 •

edited

Loading

ChuanqiXu9 commented Aug 8, 2025 •

edited

Loading

snarkmaster Aug 9, 2025 •

edited

Loading

snarkmaster commented Aug 9, 2025 •

edited

Loading

snarkmaster commented Aug 10, 2025 •

edited

Loading

snarkmaster commented Aug 11, 2025 •

edited

Loading