Skip to content

Fix and clarify CR LF normalization and CR in string literals #1944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 25, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/input-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ r[input.crlf]
## CRLF normalization

Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
This happens once, not repeatedly, so after the normalization, there can still exist `U+000D` (CR) immediately followed by `U+000A` (LF) in the input (e.g. if the raw input contained "CR CR LF LF").

Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).

Expand Down
8 changes: 3 additions & 5 deletions src/tokens.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,6 @@ Literals are tokens used in [literal expressions].

[^nsets]: The number of `#`s on each side of the same literal must be equivalent.

> [!NOTE]
> Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).

#### ASCII escapes

Expand Down Expand Up @@ -198,9 +196,9 @@ which must be _escaped_ by a preceding `U+005C` character (`\`).

r[lex.token.literal.str.linefeed]
Line-breaks, represented by the character `U+000A` (LF), are allowed in string literals.
The character `U+000D` (CR) may not appear in a string literal.
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
See [String continuation escapes] for details.
The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape.

r[lex.token.literal.char-escape]
#### Character escapes
Expand Down Expand Up @@ -323,9 +321,9 @@ below.

r[lex.token.str-byte.linefeed]
Line-breaks, represented by the character `U+000A` (LF), are allowed in byte string literals.
The character `U+000D` (CR) may not appear in a byte string literal.
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
See [String continuation escapes] for details.
The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape.

r[lex.token.str-byte.escape]
Some additional _escapes_ are available in either byte or non-raw byte string
Expand Down Expand Up @@ -429,9 +427,9 @@ permitted within a C string.

r[lex.token.str-c.linefeed]
Line-breaks, represented by the character `U+000A` (LF), are allowed in C string literals.
The character `U+000D` (CR) may not appear in a C string literal.
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
See [String continuation escapes] for details.
The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape.

r[lex.token.str-c.escape]
Some additional _escapes_ are available in non-raw C string literals. An escape
Expand Down