Skip to content

Move WTF-8 code from std into core and alloc #145335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

clarfonthey
Copy link
Contributor

@clarfonthey clarfonthey commented Aug 13, 2025

This is basically a small portion of #129411 with a smaller scope. It does not* affect any public APIs; this code is still internal to the standard library. It just moves the WTF-8 code into core and alloc so it can be accessed by no_std crates like backtrace.

* The only public API this affects is by adding a Debug implementation to std::os::windows::ffi::EncodeWide, which was not present before. This is due to the fact that core requires Debug implementations for all types, but std does not (yet) require this. Even though this was ultimately changed to be a wrapper over the original type, not a re-export, I decided to keep the Debug implementation so it remains useful.

Like we do with ordinary strings, the tests are still located entirely in alloc, rather than splitting them into core and alloc.


Reviewer note: for ease of review, this is split into three commits:

  1. Moving the original files into their new "locations"
  2. Actually modifying the code to compile.
  3. Removing aesthetic changes that were made so that the diff for commit 2 was readable.

You can review commits 1 and 3 to verify these claims, but commit 2 contains the majority of the changes you should care about.

@rustbot
Copy link
Collaborator

rustbot commented Aug 13, 2025

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added O-windows Operating system: Windows S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Aug 13, 2025
@rust-log-analyzer

This comment has been minimized.

@rustbot rustbot added A-tidy Area: The tidy tool T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Aug 13, 2025
@rust-log-analyzer

This comment has been minimized.

@clarfonthey
Copy link
Contributor Author

Hmm, that appears to be a genuine rustdoc bug: if the module is marked as #[doc(hidden)], the types still show up on traits, which are broken links. Will play around with extra attributes and then file an issue.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@clarfonthey clarfonthey force-pushed the wtf8-core-alloc branch 3 times, most recently from d0674a4 to 3fc7d04 Compare August 13, 2025 23:03
Comment on lines -602 to -627
#[test]
fn wtf8_to_ascii_lowercase() {
let lowercase = Wtf8::from_str("").to_ascii_lowercase();
assert_eq!(lowercase.bytes, b"");

let lowercase = Wtf8::from_str("GrEeN gRaPeS! 🍇").to_ascii_lowercase();
assert_eq!(lowercase.bytes, b"green grapes! \xf0\x9f\x8d\x87");

let lowercase = unsafe { Wtf8::from_bytes_unchecked(b"\xED\xA0\x80").to_ascii_lowercase() };
assert_eq!(lowercase.bytes, b"\xED\xA0\x80");
assert!(!lowercase.is_known_utf8);
}

#[test]
fn wtf8_to_ascii_uppercase() {
let uppercase = Wtf8::from_str("").to_ascii_uppercase();
assert_eq!(uppercase.bytes, b"");

let uppercase = Wtf8::from_str("GrEeN gRaPeS! 🍇").to_ascii_uppercase();
assert_eq!(uppercase.bytes, b"GREEN GRAPES! \xf0\x9f\x8d\x87");

let uppercase = unsafe { Wtf8::from_bytes_unchecked(b"\xED\xA0\x80").to_ascii_uppercase() };
assert_eq!(uppercase.bytes, b"\xED\xA0\x80");
assert!(!uppercase.is_known_utf8);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to get rid of these tests because they didn't add much on top of the make_ascii_*case tests, and because the #[cfg(not(test))] on the incoherent impl Wtf8 block would have required me to make separate standalone functions for these to ensure that we're testing the correct version of Wtf8Buf. Just felt easier to delete them since the tests aren't adding a whole lot.

@@ -0,0 +1 @@
// All `wtf8` tests live in library/alloctests/tests/wtf8.rs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str module has its own version of this so I decided to make one here too.

@@ -181,6 +88,7 @@ impl fmt::Display for Wtf8Buf {
}
}

#[cfg_attr(test, allow(dead_code))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, all methods would be tested, but my goal was to move the code, not improve its test suite. The code originally had allow(dead_code) in the entire module, so, strictly speaking, this is an improvement.

[.., 0xED, b2 @ 0xA0..=0xAF, b3] => Some(decode_surrogate(b2, b3)),
_ => None,
}
#[cfg(not(test))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these are due to the unfortunate way that alloctests works when testing private internals. Since we have to include a copy of the module in the tests crate, we end up having two versions of these methods implemented on the same Wtf8 type, but returning Wtf8Buf from two different crates. This is why I decided to leave standalone methods that are emitted in all cases and just wrap them outside of tests.

@@ -1046,21 +572,19 @@ impl Iterator for EncodeWide<'_> {
}
}

impl fmt::Debug for EncodeWide<'_> {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.debug_struct("EncodeWide").finish_non_exhaustive()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike Wtf8CodePoints, it's not entirely clear how you'd reconstruct the original string from EncodeWide and include the unpaired surrogate, so, I decided to not bother for now. Someone can write a better debug implementation later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tidy Area: The tidy tool O-windows Operating system: Windows S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants