Skip to content

P1854R4 Making non-encodable string literals ill-formed #6317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 16, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 12 additions & 44 deletions source/lex.tex
Original file line number Diff line number Diff line change
Expand Up @@ -1436,42 +1436,21 @@
\indextext{type!\idxcode{char32_t}}%
\indextext{wide-character}%
\indextext{type!\idxcode{wchar_t}}%
A \defnx{non-encodable character literal}{literal!character!non-encodable}
is a \grammarterm{character-literal}
whose \grammarterm{c-char-sequence} consists of a single \grammarterm{c-char}
that is not a \grammarterm{numeric-escape-sequence} and
that specifies a character
that either lacks representation in the literal's associated character encoding
or that cannot be encoded as a single code unit.
A \defnadj{multicharacter}{literal} is a \grammarterm{character-literal}
whose \grammarterm{c-char-sequence} consists of
more than one \grammarterm{c-char}.
The \grammarterm{encoding-prefix} of
a non-encodable character literal or a multicharacter literal
shall be absent.
Such \grammarterm{character-literal}s are conditionally-supported.
A multicharacter literal shall not have an \grammarterm{encoding-prefix}.
If a multicharacter literal contains a \grammarterm{c-char}
that is not encodable as a single code unit in the ordinary literal encoding,
the program is ill-formed.
Multicharacter literals are conditionally-supported.

\pnum
The kind of a \grammarterm{character-literal},
its type, and its associated character encoding\iref{lex.charset}
are determined by
its \grammarterm{encoding-prefix} and its \grammarterm{c-char-sequence}
as defined by \tref{lex.ccon.literal}.
The special cases for
non-encodable character literals and multicharacter literals
take precedence over the base kind.
\begin{note}
The associated character encoding for ordinary character literals
determines encodability,
but does not determine the value of
non-encodable ordinary character literals or
ordinary multicharacter literals.
The examples in \tref{lex.ccon.literal}
for non-encodable ordinary character literals assume that
the specified character lacks representation in
the ordinary literal encoding or
that encoding the character would require more than one code unit.
\end{note}

\begin{floattable}{Character literals}{lex.ccon.literal}
{l|l|l|l|l}
Expand All @@ -1482,15 +1461,10 @@
none &
\defnx{ordinary character literal}{literal!character!ordinary} &
\keyword{char} &
ordinary &
ordinary literal &
\tcode{'v'} \\ \cline{2-3}\cline{5-5}
&
non-encodable ordinary character literal &
\keyword{int} &
literal &
\tcode{'\textbackslash U0001F525'} \\ \cline{2-3}\cline{5-5}
&
ordinary multicharacter literal &
multicharacter literal &
\keyword{int} &
encoding &
\tcode{'abcd'} \\ \hline
Expand Down Expand Up @@ -1522,8 +1496,7 @@
the value of a \grammarterm{character-literal} is determined
using the range of representable values
of the \grammarterm{character-literal}'s type in translation phase 7.
A non-encodable character literal or a multicharacter literal
has an
A multicharacter literal has an
\impldef{value of non-encodable character literal or multicharacter literal}
value.
The value of any other kind of \grammarterm{character-literal}
Expand All @@ -1537,12 +1510,10 @@
\grammarterm{universal-character-name}
is the code unit value of the specified character
as encoded in the literal's associated character encoding.
\begin{note}
If the specified character lacks
representation in the literal's associated character encoding or
if it cannot be encoded as a single code unit,
then the literal is a non-encodable character literal.
\end{note}
then the program is ill-formed.
\item
A \grammarterm{character-literal} with
a \grammarterm{c-char-sequence} consisting of
Expand All @@ -1568,7 +1539,7 @@
$v$ does not exceed the range of representable values of the corresponding unsigned type for the underlying type of the \grammarterm{character-literal}'s type,
then the value is the unique value of the \grammarterm{character-literal}'s type \tcode{T} that is congruent to $v$ modulo $2^N$, where $N$ is the width of \tcode{T}.
\item
Otherwise, the \grammarterm{character-literal} is ill-formed.
Otherwise, the program is ill-formed.
\end{itemize}
\item
A \grammarterm{character-literal} with
Expand Down Expand Up @@ -2006,10 +1977,7 @@
is encoded to a code unit sequence
using the \grammarterm{string-literal}'s associated character encoding.
If a character lacks representation in the associated character encoding,
then the \grammarterm{string-literal} is conditionally-supported and
an
\impldef{code unit sequence for non-representable \grammarterm{string-literal}}
code unit sequence is encoded.
then the program is ill-formed.
\begin{note}
No character lacks representation in any Unicode encoding form.
\end{note}
Expand Down Expand Up @@ -2050,7 +2018,7 @@
the \grammarterm{string-literal}'s array element type \tcode{T}
that is congruent to $v$ modulo $2^N$, where $N$ is the width of \tcode{T}.
\item
Otherwise, the \grammarterm{string-literal} is ill-formed.
Otherwise, the program is ill-formed.
\end{itemize}
When encoding a stateful character encoding,
these sequences should have no effect on encoding state.
Expand Down