Move WTF-8 code from std into core and alloc #145335

clarfonthey · 2025-08-13T05:11:51Z

This is basically a small portion of #129411 with a smaller scope. It does not* affect any public APIs; this code is still internal to the standard library. It just moves the WTF-8 code into core and alloc so it can be accessed by no_std crates like backtrace.

* The only public API this affects is by adding a Debug implementation to std::os::windows::ffi::EncodeWide, which was not present before. This is due to the fact that core requires Debug implementations for all types, but std does not (yet) require this. Even though this was ultimately changed to be a wrapper over the original type, not a re-export, I decided to keep the Debug implementation so it remains useful.

Like we do with ordinary strings, the tests are still located entirely in alloc, rather than splitting them into core and alloc.

Reviewer note: for ease of review, this is split into three commits:

Moving the original files into their new "locations"
Actually modifying the code to compile.
Removing aesthetic changes that were made so that the diff for commit 2 was readable.

You can review commits 1 and 3 to verify these claims, but commit 2 contains the majority of the changes you should care about.

rustbot · 2025-08-13T05:11:57Z

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

clarfonthey · 2025-08-13T19:10:52Z

Hmm, that appears to be a genuine rustdoc bug: if the module is marked as #[doc(hidden)], the types still show up on traits, which are broken links. Will play around with extra attributes and then file an issue.

clarfonthey · 2025-08-13T23:07:20Z

library/alloc/src/wtf8/tests.rs

-#[test]
-fn wtf8_to_ascii_lowercase() {
-    let lowercase = Wtf8::from_str("").to_ascii_lowercase();
-    assert_eq!(lowercase.bytes, b"");
-
-    let lowercase = Wtf8::from_str("GrEeN gRaPeS! 🍇").to_ascii_lowercase();
-    assert_eq!(lowercase.bytes, b"green grapes! \xf0\x9f\x8d\x87");
-
-    let lowercase = unsafe { Wtf8::from_bytes_unchecked(b"\xED\xA0\x80").to_ascii_lowercase() };
-    assert_eq!(lowercase.bytes, b"\xED\xA0\x80");
-    assert!(!lowercase.is_known_utf8);
-}
-
-#[test]
-fn wtf8_to_ascii_uppercase() {
-    let uppercase = Wtf8::from_str("").to_ascii_uppercase();
-    assert_eq!(uppercase.bytes, b"");
-
-    let uppercase = Wtf8::from_str("GrEeN gRaPeS! 🍇").to_ascii_uppercase();
-    assert_eq!(uppercase.bytes, b"GREEN GRAPES! \xf0\x9f\x8d\x87");
-
-    let uppercase = unsafe { Wtf8::from_bytes_unchecked(b"\xED\xA0\x80").to_ascii_uppercase() };
-    assert_eq!(uppercase.bytes, b"\xED\xA0\x80");
-    assert!(!uppercase.is_known_utf8);
-}
-


I decided to get rid of these tests because they didn't add much on top of the make_ascii_*case tests, and because the #[cfg(not(test))] on the incoherent impl Wtf8 block would have required me to make separate standalone functions for these to ensure that we're testing the correct version of Wtf8Buf. Just felt easier to delete them since the tests aren't adding a whole lot.

clarfonthey · 2025-08-13T23:07:44Z

library/coretests/tests/wtf8.rs

@@ -0,0 +1 @@
+// All `wtf8` tests live in library/alloctests/tests/wtf8.rs


str module has its own version of this so I decided to make one here too.

clarfonthey · 2025-08-13T23:08:47Z

library/alloc/src/wtf8/mod.rs

@@ -181,6 +88,7 @@ impl fmt::Display for Wtf8Buf {
    }
 }

+#[cfg_attr(test, allow(dead_code))]


Ideally, all methods would be tested, but my goal was to move the code, not improve its test suite. The code originally had allow(dead_code) in the entire module, so, strictly speaking, this is an improvement.

clarfonthey · 2025-08-13T23:10:45Z

library/alloc/src/wtf8/mod.rs

-            [.., 0xED, b2 @ 0xA0..=0xAF, b3] => Some(decode_surrogate(b2, b3)),
-            _ => None,
-        }
+#[cfg(not(test))]


All of these are due to the unfortunate way that alloctests works when testing private internals. Since we have to include a copy of the module in the tests crate, we end up having two versions of these methods implemented on the same Wtf8 type, but returning Wtf8Buf from two different crates. This is why I decided to leave standalone methods that are emitted in all cases and just wrap them outside of tests.

clarfonthey · 2025-08-13T23:13:40Z

library/core/src/wtf8.rs

@@ -1046,21 +572,19 @@ impl Iterator for EncodeWide<'_> {
    }
 }

+impl fmt::Debug for EncodeWide<'_> {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        f.debug_struct("EncodeWide").finish_non_exhaustive()


Unlike Wtf8CodePoints, it's not entirely clear how you'd reconstruct the original string from EncodeWide and include the unpaired surrogate, so, I decided to not bother for now. Someone can write a better debug implementation later.

rustbot assigned Mark-Simulacrum Aug 13, 2025

rustbot added O-windows Operating system: Windows S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Aug 13, 2025