Skip to content

Commit 374bea4

Browse files
jensmaurertkoeppe
authored andcommitted
P1949R7 C++ Identifier Syntax using Unicode Standard Annex 31
1 parent cf4ddd0 commit 374bea4

File tree

6 files changed

+206
-89
lines changed

6 files changed

+206
-89
lines changed

source/back.tex

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,12 @@ \chapter{Bibliography}
2323
\doccite{Unicode Text Segmentation} [online].
2424
Edited by Mark Davis. Revision 35; issued for Unicode 12.0.0. 2019-02-15 [viewed 2020-02-23].
2525
Available from: \url{http://www.unicode.org/reports/tr29/tr29-35.html}
26+
\item
27+
The Unicode Consortium. Unicode Standard Annex, UAX \#31,
28+
\doccite{Unicode Identifier and Pattern Syntax} [online].
29+
Edited by Mark Davis. Revision 33; issued for Unicode 13.0.0.
30+
2020-02-13 [viewed 2021-06-08].
31+
Available from: \url{https://www.unicode.org/reports/tr31/tr31-33.html}
2632
\item
2733
IANA Time Zone Database.
2834
Available from: \url{https://www.iana.org/time-zones}

source/compatibility.tex

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,28 @@
11
%!TEX root = std.tex
22
\infannex{diff}{Compatibility}
33

4-
% TODO: Add this once we have differences.
5-
6-
%\rSec1[diff.cpp20]{\Cpp{} and ISO \CppXX{}}
7-
%
8-
%\pnum
9-
%\indextext{summary!compatibility with ISO \CppXX{}}%
10-
%This subclause lists the differences between \Cpp{} and
11-
%ISO \CppXX{} (ISO/IEC 14882:2020, \doccite{Programming Languages --- \Cpp{}}),
12-
%by the chapters of this document.
4+
\rSec1[diff.cpp20]{\Cpp{} and ISO \CppXX{}}
5+
6+
\rSec2[diff.cpp20.general]{General}
7+
8+
\pnum
9+
\indextext{summary!compatibility with ISO \CppXX{}}%
10+
Subclause \ref{diff.cpp20} lists the differences between \Cpp{} and
11+
ISO \CppXX{} (ISO/IEC 14882:2020, \doccite{Programming Languages --- \Cpp{}}),
12+
by the chapters of this document.
13+
14+
\rSec2[diff.cpp20.lex]{\ref{lex}: lexical conventions}
15+
16+
\diffref{lex.name}
17+
\change
18+
Previously valid identifiers containing characters
19+
not present in UAX \#44 properties XID_Start or XID_Continue, or
20+
not in Normalization Form C, are now rejected.
21+
\rationale
22+
Prevent confusing characters in identifiers.
23+
Requiring normalization of names ensures consistent linker behavior.
24+
\effect
25+
Some identifiers are no longer well-formed.
1326

1427
\rSec1[diff.cpp17]{\Cpp{} and ISO \CppXVII{}}
1528

source/intro.tex

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,15 @@
7070
\end{footnote}
7171
Language Specification},
7272
Standard Ecma-262, third edition, 1999.
73+
\item
74+
The Unicode Consortium.
75+
Unicode Standard Annex, UAX \#44, \doccite{Unicode Character Database}.
76+
Edited by Ken Whistler and Lauren\c{t}iu Iancu.
77+
Available from: \url{http://www.unicode.org/reports/tr44/}
78+
\item
79+
The Unicode Consortium.
80+
The Unicode Standard, \doccite{Derived Core Properties}.
81+
Available from: \url{https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt}
7382
\end{itemize}
7483

7584
\pnum

source/lex.tex

Lines changed: 35 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -325,6 +325,7 @@
325325
string-literal\br
326326
user-defined-string-literal\br
327327
preprocessing-op-or-punc\br
328+
\textnormal{each} universal-character-name \textnormal{that cannot be one of the above}\br
328329
\textnormal{each non-whitespace character that cannot be one of the above}
329330
\end{bnf}
330331

@@ -340,8 +341,12 @@
340341
(\grammarterm{import-keyword}, \grammarterm{module-keyword}, and \grammarterm{export-keyword}),
341342
identifiers, preprocessing numbers, character literals (including user-defined character
342343
literals), string literals (including user-defined string literals), preprocessing
343-
operators and punctuators, and single non-whitespace characters that do not lexically
344-
match the other preprocessing token categories. If a \tcode{'} or a \tcode{"} character
344+
operators and punctuators, and single \grammarterm{universal-character-name}s and non-whitespace characters that do not lexically
345+
match the other preprocessing token categories.
346+
If a single \grammarterm{universal-character-name}
347+
does not match any of the other preprocessing token categories,
348+
the program is ill-formed.
349+
If a \tcode{'} or a \tcode{"} character
345350
matches the last category, the behavior is undefined. Preprocessing tokens can be
346351
separated by
347352
\indextext{whitespace}%
@@ -602,8 +607,7 @@
602607
\nontermdef{pp-number}\br
603608
digit\br
604609
\terminal{.} digit\br
605-
pp-number digit\br
606-
pp-number identifier-nondigit\br
610+
pp-number identifier-continue\br
607611
pp-number \terminal{'} digit\br
608612
pp-number \terminal{'} nondigit\br
609613
pp-number \terminal{e} sign\br
@@ -630,15 +634,21 @@
630634
\indextext{identifier|(}%
631635
\begin{bnf}
632636
\nontermdef{identifier}\br
633-
identifier-nondigit\br
634-
identifier identifier-nondigit\br
635-
identifier digit
637+
identifier-start\br
638+
identifier identifier-continue\br
636639
\end{bnf}
637640

638641
\begin{bnf}
639-
\nontermdef{identifier-nondigit}\br
642+
\nontermdef{identifier-start}\br
640643
nondigit\br
641-
universal-character-name
644+
universal-character-name \textnormal{of class XID_Start}
645+
\end{bnf}
646+
647+
\begin{bnf}
648+
\nontermdef{identifier-continue}\br
649+
digit\br
650+
nondigit\br
651+
universal-character-name \textnormal{of class XID_Continue}
642652
\end{bnf}
643653

644654
\begin{bnf}
@@ -657,15 +667,8 @@
657667
\pnum
658668
\indextext{name!length of}%
659669
\indextext{name}%
660-
An identifier is an arbitrarily long sequence of letters and digits.
661-
Each \grammarterm{universal-character-name} in an identifier shall designate a
662-
character whose encoding in ISO/IEC 10646 falls into one of the ranges
663-
specified in \tref{lex.name.allowed}.
664-
The initial element shall not be a \grammarterm{universal-character-name}
665-
designating a character whose encoding falls into one of the ranges
666-
specified in \tref{lex.name.disallowed}.
667-
Upper- and lower-case letters are
668-
different. All characters are significant.
670+
The character classes XID_Start and XID_Continue
671+
are Derived Core Properties as described by UAX \#44.
669672
\begin{footnote}
670673
On systems in which linkers cannot accept extended
671674
characters, an encoding of the \grammarterm{universal-character-name} can be used in
@@ -674,69 +677,21 @@
674677
\tcode{\textbackslash u} in a \grammarterm{universal-character-name}. Extended
675678
characters can produce a long external identifier, but \Cpp{} does not
676679
place a translation limit on significant characters for external
677-
identifiers. In \Cpp{}, upper- and lower-case letters are considered
678-
different for all identifiers, including external identifiers.
680+
identifiers.
679681
\end{footnote}
680-
681-
\begin{floattable}{Ranges of characters allowed}{lex.name.allowed}
682-
{lllll}
683-
\topline
684-
\tcode{00A8} &
685-
\tcode{00AA} &
686-
\tcode{00AD} &
687-
\tcode{00AF} &
688-
\tcode{00B2-00B5} \\
689-
\tcode{00B7-00BA} &
690-
\tcode{00BC-00BE} &
691-
\tcode{00C0-00D6} &
692-
\tcode{00D8-00F6} &
693-
\tcode{00F8-00FF} \\
694-
\tcode{0100-167F} &
695-
\tcode{1681-180D} &
696-
\tcode{180F-1FFF} &&\\
697-
\tcode{200B-200D} &
698-
\tcode{202A-202E} &
699-
\tcode{203F-2040} &
700-
\tcode{2054} &
701-
\tcode{2060-206F} \\
702-
\tcode{2070-218F} &
703-
\tcode{2460-24FF} &
704-
\tcode{2776-2793} &
705-
\tcode{2C00-2DFF} &
706-
\tcode{2E80-2FFF} \\
707-
\tcode{3004-3007} &
708-
\tcode{3021-302F} &
709-
\tcode{3031-D7FF} && \\
710-
\tcode{F900-FD3D} &
711-
\tcode{FD40-FDCF} &
712-
\tcode{FDF0-FE44} &
713-
\tcode{FE47-FFFD} & \\
714-
\tcode{10000-1FFFD} &
715-
\tcode{20000-2FFFD} &
716-
\tcode{30000-3FFFD} &
717-
\tcode{40000-4FFFD} &
718-
\tcode{50000-5FFFD} \\
719-
\tcode{60000-6FFFD} &
720-
\tcode{70000-7FFFD} &
721-
\tcode{80000-8FFFD} &
722-
\tcode{90000-9FFFD} &
723-
\tcode{A0000-AFFFD} \\
724-
\tcode{B0000-BFFFD} &
725-
\tcode{C0000-CFFFD} &
726-
\tcode{D0000-DFFFD} &
727-
\tcode{E0000-EFFFD} &
728-
\\
729-
\end{floattable}
730-
731-
\begin{floattable}{Ranges of characters disallowed initially (combining characters)}{lex.name.disallowed}
732-
{llll}
733-
\topline
734-
\tcode{0300-036F} &
735-
% FIXME: Unicode v7 adds 1AB0-1AFF
736-
\tcode{1DC0-1DFF} &
737-
\tcode{20D0-20FF} &
738-
\tcode{FE20-FE2F} \\
739-
\end{floattable}
682+
The program is ill-formed
683+
if an \grammarterm{identifier} does not conform to
684+
Normalization Form C as specified in ISO/IEC 10646.
685+
\begin{note}
686+
Upper- and lower-case letters are considered different for all identifiers.
687+
\end{note}
688+
\begin{note}
689+
In translation phase 4,
690+
\grammarterm{identifier} also includes
691+
those \grammarterm{preprocessing-token}s\iref{lex.pptoken}
692+
differentiated as keywords\iref{lex.key}
693+
in the later translation phase 7\iref{lex.token}.
694+
\end{note}
740695

741696
\pnum
742697
\indextext{\idxcode{import}}%

source/std.tex

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,7 @@
146146
\include{limits}
147147
\include{compatibility}
148148
\include{future}
149+
\include{uax31}
149150

150151
%%--------------------------------------------------
151152
%% back matter

source/uax31.tex

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
%!TEX root = std.tex
2+
\infannex{uaxid}{Conformance with UAX \#31}
3+
4+
\rSec1[uaxid.general]{General}
5+
6+
\pnum
7+
This Annex describes the choices made in application of
8+
UAX \#31 (``Unicode Identifier and Pattern Syntax'')
9+
to \Cpp{} in terms of the requirements from UAX \#31 and
10+
how they do or do not apply to \Cpp{}.
11+
In terms of UAX \#31,
12+
\Cpp{} conforms by meeting the requirements
13+
R1 ``Default Identifiers'' and
14+
R4 ``Equivalent Normalized Identifiers''.
15+
The other requirements, also listed below,
16+
are either alternatives not taken or do not apply to \Cpp{}.
17+
18+
\rSec1[uaxid.def]{R1 Default identifiers}
19+
20+
\rSec2[uaxid.def.general]{General}
21+
22+
\pnum
23+
UAX \#31 specifies a default syntax for identifiers
24+
based on properties from the Unicode Character Database, UAX \#44.
25+
The general syntax is
26+
\begin{codeblock}
27+
<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*
28+
\end{codeblock}
29+
where \tcode{<Start>} has the XID_Start property,
30+
\tcode{<Continue>} has the XID_Continue property, and
31+
\tcode{<Medial>} is a list of characters permitted between continue characters.
32+
For \Cpp{} we add the character U+005F, LOW LINE, or \tcode{_},
33+
to the set of permitted \tcode{<Start>} characters,
34+
the \tcode{<Medial>} set is empty, and
35+
the \tcode{<Continue>} characters are unmodified.
36+
In the grammar used in UAX \#31, this is
37+
\begin{codeblock}
38+
<Identifier> := <Start> <Continue>*
39+
<Start> := XID_Start + U+005F
40+
<Continue> := <Start> + XID_Continue
41+
\end{codeblock}
42+
43+
\pnum
44+
This is described in the \Cpp{} grammar in \ref{lex.name},
45+
where \grammarterm{identifier} is formed from
46+
\grammarterm{identifier-start} or
47+
\grammarterm{identifier} followed by \grammarterm{identifier-continue}.
48+
49+
\rSec2[uaxid.def.rfmt]{R1a Restricted format characters}
50+
51+
\pnum
52+
If an implementation of UAX \#31 wishes to allow format characters
53+
such as ZERO WIDTH JOINER or ZERO WIDTH NON-JOINER
54+
it must define a profile allowing them, or
55+
describe precisely which combinations are permitted.
56+
57+
\pnum
58+
\Cpp{} does not allow format characters in identifiers, so this does not apply.
59+
60+
\rSec2[uaxid.def.stable]{R1b Stable identifiers}
61+
62+
\pnum
63+
An implementation of UAX \#31 may choose to guarantee
64+
that identifiers are stable across versions of the Unicode Standard.
65+
Once a string qualifies as an identifier it does so in all future versions.
66+
67+
\pnum
68+
\Cpp{} does not make this guarantee,
69+
except to the extent that UAX \#31 guarantees
70+
the stability of the XID_Start and XID_Continue properties.
71+
72+
\rSec1[uaxid.immutable]{R2 Immutable identifiers}
73+
74+
\pnum
75+
An implementation may choose to guarantee that
76+
the set of identifiers will never change
77+
by fixing the set of code points allowed in identifiers forever.
78+
79+
\pnum
80+
\Cpp{} does not choose to make this guarantee.
81+
As scripts are added to Unicode,
82+
additional characters in those scripts may become available
83+
for use in identifiers.
84+
85+
\rSec1[uaxid.pattern]{R3 Pattern_White_Space and Pattern_Syntax characters}
86+
87+
\pnum
88+
UAX \#31 describes how languages that use or interpret patterns of characters,
89+
such as regular expressions or number formats,
90+
may describe that syntax with Unicode properties.
91+
92+
\pnum
93+
\Cpp{} does not do this as part of the language,
94+
deferring to library components for such usage of patterns.
95+
This requirement does not apply to \Cpp{}.
96+
97+
\rSec1[uaxid.eqn]{R4 Equivalent normalized identifiers}
98+
99+
\pnum
100+
UAX \#31 requires that implementations describe
101+
how identifiers are compared and considered equivalent.
102+
103+
\pnum
104+
\Cpp{} requires that identifiers be in Normalization Form C and
105+
therefore identifiers that compare the same under NFC are equivalent.
106+
This is described in \ref{lex.name}.
107+
108+
\rSec1[uaxid.eqci]{R5 Equivalent case-insensitive identifiers}
109+
110+
\pnum
111+
\Cpp{} considers case to be significant in identifier comparison, and
112+
does not do any case folding.
113+
This requirement does not apply to \Cpp{}.
114+
115+
\rSec1[uaxid.filter]{R6 Filtered normalized identifiers}
116+
117+
\pnum
118+
If any characters are excluded from normalization,
119+
UAX \#31 requires a precise specification of those exclusions.
120+
121+
\pnum
122+
\Cpp{} does not make any such exclusions.
123+
124+
\rSec1[uaxid.filterci]{R7 Filtered case-insensitive identifiers}
125+
126+
\pnum
127+
\Cpp{} identifiers are case sensitive, and
128+
therefore this requirement does not apply.
129+
130+
\rSec1[uaxid.hashtag]{R8 Hashtag identifiers}
131+
132+
\pnum
133+
There are no hashtags in \Cpp{}, so this requirement does not apply.

0 commit comments

Comments
 (0)