Skip to content

Commit 53f9db6

Browse files
committed
P1949R7 C++ Identifier Syntax using Unicode Standard Annex 31
1 parent fb3bea8 commit 53f9db6

File tree

6 files changed

+204
-89
lines changed

6 files changed

+204
-89
lines changed

source/back.tex

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,12 @@ \chapter{Bibliography}
2323
\doccite{Unicode Text Segmentation} [online].
2424
Edited by Mark Davis. Revision 35; issued for Unicode 12.0.0. 2019-02-15 [viewed 2020-02-23].
2525
Available from: \url{http://www.unicode.org/reports/tr29/tr29-35.html}
26+
\item
27+
The Unicode Consortium. Unicode Standard Annex, UAX \#31,
28+
\doccite{Unicode Identifier and Pattern Syntax} [online].
29+
Edited by Mark Davis. Revision 33; issued for Unicode 13.0.0.
30+
2020-02-13 [viewed 2021-06-08].
31+
Available from: \url{https://www.unicode.org/reports/tr31/tr31-33.html}
2632
\item
2733
IANA Time Zone Database.
2834
Available from: \url{https://www.iana.org/time-zones}

source/compatibility.tex

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,28 @@
11
%!TEX root = std.tex
22
\infannex{diff}{Compatibility}
33

4-
% TODO: Add this once we have differences.
5-
6-
%\rSec1[diff.cpp20]{\Cpp{} and ISO \CppXX{}}
7-
%
8-
%\pnum
9-
%\indextext{summary!compatibility with ISO \CppXX{}}%
10-
%This subclause lists the differences between \Cpp{} and
11-
%ISO \CppXX{} (ISO/IEC 14882:2020, \doccite{Programming Languages --- \Cpp{}}),
12-
%by the chapters of this document.
4+
\rSec1[diff.cpp20]{\Cpp{} and ISO \CppXX{}}
5+
6+
\rSec2[diff.cpp20.general]{General}
7+
8+
\pnum
9+
\indextext{summary!compatibility with ISO \CppXX{}}%
10+
Subclause \ref{diff.cpp20} lists the differences between \Cpp{} and
11+
ISO \CppXX{} (ISO/IEC 14882:2020, \doccite{Programming Languages --- \Cpp{}}),
12+
by the chapters of this document.
13+
14+
\rSec2[diff.cpp20.lex]{\ref{lex}: lexical conventions}
15+
16+
\diffref{lex.name}
17+
\change
18+
Previously valid identifiers containing characters
19+
not present in UAX \#44 properties XID_Start or XID_Continue, or
20+
not in Normalization Form C, are now rejected.
21+
\rationale
22+
Prevent confusing characters in identifiers.
23+
Requiring normalization of names ensures consistent linker behavior.
24+
\effect
25+
Some identifiers are no longer well-formed.
1326

1427
\rSec1[diff.cpp17]{\Cpp{} and ISO \CppXVII{}}
1528

source/intro.tex

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,15 @@
7070
\end{footnote}
7171
Language Specification},
7272
Standard Ecma-262, third edition, 1999.
73+
\item
74+
The Unicode Consortium.
75+
Unicode Standard Annex, UAX \#44, \doccite{Unicode Character Database}.
76+
Edited by Ken Whistler and Lauren\c{t}iu Iancu.
77+
Available from: \url{http://www.unicode.org/reports/tr44/}
78+
\item
79+
The Unicode Consortium.
80+
The Unicode Standard, \doccite{Derived Core Properties}.
81+
Available from: \url{https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt}
7382
\end{itemize}
7483

7584
\pnum

source/lex.tex

Lines changed: 35 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -325,6 +325,7 @@
325325
string-literal\br
326326
user-defined-string-literal\br
327327
preprocessing-op-or-punc\br
328+
\textnormal{each} universal-character-name \textnormal{that cannot be one of the above}\br
328329
\textnormal{each non-whitespace character that cannot be one of the above}
329330
\end{bnf}
330331

@@ -340,8 +341,12 @@
340341
(\grammarterm{import-keyword}, \grammarterm{module-keyword}, and \grammarterm{export-keyword}),
341342
identifiers, preprocessing numbers, character literals (including user-defined character
342343
literals), string literals (including user-defined string literals), preprocessing
343-
operators and punctuators, and single non-whitespace characters that do not lexically
344-
match the other preprocessing token categories. If a \tcode{'} or a \tcode{"} character
344+
operators and punctuators, and single \grammarterm{universal-character-name}s and non-whitespace characters that do not lexically
345+
match the other preprocessing token categories.
346+
If a single \grammarterm{universal-character-name}
347+
does not match any of the other preprocessing token categories,
348+
the program is ill-formed.
349+
If a \tcode{'} or a \tcode{"} character
345350
matches the last category, the behavior is undefined. Preprocessing tokens can be
346351
separated by
347352
\indextext{whitespace}%
@@ -602,8 +607,7 @@
602607
\nontermdef{pp-number}\br
603608
digit\br
604609
\terminal{.} digit\br
605-
pp-number digit\br
606-
pp-number identifier-nondigit\br
610+
pp-number identifier-continue\br
607611
pp-number \terminal{'} digit\br
608612
pp-number \terminal{'} nondigit\br
609613
pp-number \terminal{e} sign\br
@@ -630,15 +634,21 @@
630634
\indextext{identifier|(}%
631635
\begin{bnf}
632636
\nontermdef{identifier}\br
633-
identifier-nondigit\br
634-
identifier identifier-nondigit\br
635-
identifier digit
637+
identifier-start\br
638+
identifier identifier-continue\br
636639
\end{bnf}
637640

638641
\begin{bnf}
639-
\nontermdef{identifier-nondigit}\br
642+
\nontermdef{identifier-start}\br
640643
nondigit\br
641-
universal-character-name
644+
universal-character-name \textnormal{of class XID_Start}
645+
\end{bnf}
646+
647+
\begin{bnf}
648+
\nontermdef{identifier-continue}\br
649+
digit\br
650+
nondigit\br
651+
universal-character-name \textnormal{of class XID_Continue}
642652
\end{bnf}
643653

644654
\begin{bnf}
@@ -657,15 +667,8 @@
657667
\pnum
658668
\indextext{name!length of}%
659669
\indextext{name}%
660-
An identifier is an arbitrarily long sequence of letters and digits.
661-
Each \grammarterm{universal-character-name} in an identifier shall designate a
662-
character whose encoding in ISO/IEC 10646 falls into one of the ranges
663-
specified in \tref{lex.name.allowed}.
664-
The initial element shall not be a \grammarterm{universal-character-name}
665-
designating a character whose encoding falls into one of the ranges
666-
specified in \tref{lex.name.disallowed}.
667-
Upper- and lower-case letters are
668-
different. All characters are significant.
670+
The character classes XID_Start and XID_Continue
671+
are Derived Core Properties as described by UAX \#44.
669672
\begin{footnote}
670673
On systems in which linkers cannot accept extended
671674
characters, an encoding of the \grammarterm{universal-character-name} can be used in
@@ -674,69 +677,21 @@
674677
\tcode{\textbackslash u} in a \grammarterm{universal-character-name}. Extended
675678
characters can produce a long external identifier, but \Cpp{} does not
676679
place a translation limit on significant characters for external
677-
identifiers. In \Cpp{}, upper- and lower-case letters are considered
678-
different for all identifiers, including external identifiers.
680+
identifiers.
679681
\end{footnote}
680-
681-
\begin{floattable}{Ranges of characters allowed}{lex.name.allowed}
682-
{lllll}
683-
\topline
684-
\tcode{00A8} &
685-
\tcode{00AA} &
686-
\tcode{00AD} &
687-
\tcode{00AF} &
688-
\tcode{00B2-00B5} \\
689-
\tcode{00B7-00BA} &
690-
\tcode{00BC-00BE} &
691-
\tcode{00C0-00D6} &
692-
\tcode{00D8-00F6} &
693-
\tcode{00F8-00FF} \\
694-
\tcode{0100-167F} &
695-
\tcode{1681-180D} &
696-
\tcode{180F-1FFF} &&\\
697-
\tcode{200B-200D} &
698-
\tcode{202A-202E} &
699-
\tcode{203F-2040} &
700-
\tcode{2054} &
701-
\tcode{2060-206F} \\
702-
\tcode{2070-218F} &
703-
\tcode{2460-24FF} &
704-
\tcode{2776-2793} &
705-
\tcode{2C00-2DFF} &
706-
\tcode{2E80-2FFF} \\
707-
\tcode{3004-3007} &
708-
\tcode{3021-302F} &
709-
\tcode{3031-D7FF} && \\
710-
\tcode{F900-FD3D} &
711-
\tcode{FD40-FDCF} &
712-
\tcode{FDF0-FE44} &
713-
\tcode{FE47-FFFD} & \\
714-
\tcode{10000-1FFFD} &
715-
\tcode{20000-2FFFD} &
716-
\tcode{30000-3FFFD} &
717-
\tcode{40000-4FFFD} &
718-
\tcode{50000-5FFFD} \\
719-
\tcode{60000-6FFFD} &
720-
\tcode{70000-7FFFD} &
721-
\tcode{80000-8FFFD} &
722-
\tcode{90000-9FFFD} &
723-
\tcode{A0000-AFFFD} \\
724-
\tcode{B0000-BFFFD} &
725-
\tcode{C0000-CFFFD} &
726-
\tcode{D0000-DFFFD} &
727-
\tcode{E0000-EFFFD} &
728-
\\
729-
\end{floattable}
730-
731-
\begin{floattable}{Ranges of characters disallowed initially (combining characters)}{lex.name.disallowed}
732-
{llll}
733-
\topline
734-
\tcode{0300-036F} &
735-
% FIXME: Unicode v7 adds 1AB0-1AFF
736-
\tcode{1DC0-1DFF} &
737-
\tcode{20D0-20FF} &
738-
\tcode{FE20-FE2F} \\
739-
\end{floattable}
682+
The program is ill-formed
683+
if an \grammarterm{identifier} does not conform to
684+
Normalization Form C as specified in ISO/IEC 10646.
685+
\begin{note}
686+
Upper- and lower-case letters are considered different for all identifiers.
687+
\end{note}
688+
\begin{note}
689+
In translation phase 4,
690+
\grammarterm{identifier} also includes
691+
those \grammarterm{preprocessing-token}s\iref{lex.pptoken}
692+
differentiated as keywords\iref{lex.key}
693+
in the later translation phase 7\iref{lex.token}.
694+
\end{note}
740695

741696
\pnum
742697
\indextext{\idxcode{import}}%

source/std.tex

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,7 @@
146146
\include{limits}
147147
\include{compatibility}
148148
\include{future}
149+
\include{uax31}
149150

150151
%%--------------------------------------------------
151152
%% back matter

source/uax31.tex

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
%!TEX root = std.tex
2+
\infannex{uaxid}{Conformance with UAX \#31}
3+
4+
\rSec1[uaxid.general]{General}
5+
6+
\pnum
7+
This Annex describes the choices made in application of
8+
UAX \#31 (``Unicode Identifier and Pattern Syntax'')
9+
to \Cpp{} in terms of the requirements from UAX \#31 and
10+
how they do or do not apply to \Cpp{}.
11+
In terms of UAX \#31,
12+
\Cpp{} conforms by meeting the requirements
13+
R1 ``Default Identifiers'' and
14+
R4 ``Equivalent Normalized Identifiers''.
15+
The other requirements, also listed below,
16+
are either alternatives not taken or do not apply to \Cpp{}.
17+
18+
\rSec1[uaxid.default]{R1 Default identifiers}
19+
20+
\pnum
21+
UAX \#31 specifies a default syntax for identifiers
22+
based on properties from the Unicode Character Database, UAX \#44.
23+
The general syntax is
24+
\begin{codeblock}
25+
<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*
26+
\end{codeblock}
27+
where \tcode{<Start>} has the XID_Start property,
28+
\tcode{<Continue>} has the XID_Continue property, and
29+
\tcode{<Medial>} is a list of characters permitted between continue characters.
30+
For \Cpp{} we add the character U+005F, LOW LINE, or \tcode{_},
31+
to the set of permitted \tcode{<Start>} characters,
32+
the \tcode{<Medial>} set is empty, and
33+
the \tcode{<Continue>} characters are unmodified.
34+
In the grammar used in UAX \#31, this is
35+
\begin{codeblock}
36+
<Identifier> := <Start> <Continue>*
37+
<Start> := XID_Start + U+005F
38+
<Continue> := <Start> + XID_Continue
39+
\end{codeblock}
40+
41+
\pnum
42+
This is described in the \Cpp{} grammar in \ref{lex.name},
43+
where \grammarterm{identifier} is formed from
44+
\grammarterm{identifier-start} or
45+
\grammarterm{identifier} followed by \grammarterm{identifier-continue}.
46+
47+
\rSec2[uaxid.rfmt]{R1a Restricted format characters}
48+
49+
\pnum
50+
If an implementation of UAX \#31 wishes to allow format characters
51+
such as ZERO WIDTH JOINER or ZERO WIDTH NON-JOINER
52+
it must define a profile allowing them, or
53+
describe precisely which combinations are permitted.
54+
55+
\pnum
56+
\Cpp{} does not allow format characters in identifiers, so this does not apply.
57+
58+
\rSec2[uaxid.stable]{R1b Stable identifiers}
59+
60+
\pnum
61+
An implementation of UAX \#31 may choose to guarantee
62+
that identifiers are stable across versions of the Unicode Standard.
63+
Once a string qualifies as an identifier it does so in all future versions.
64+
65+
\pnum
66+
\Cpp{} does not make this guarantee,
67+
except to the extent that UAX \#31 guarantees
68+
the stability of the XID_Start and XID_Continue properties.
69+
70+
\rSec1[uaxid.immutable]{R2 Immutable identifiers}
71+
72+
\pnum
73+
An implementation may choose to guarantee that
74+
the set of identifiers will never change
75+
by fixing the set of code points allowed in identifiers forever.
76+
77+
\pnum
78+
\Cpp{} does not choose to make this guarantee.
79+
As scripts are added to Unicode,
80+
additional characters in those scripts may become available
81+
for use in identifiers.
82+
83+
\rSec1[uaxid.pattern]{R3 Pattern_White_Space and Pattern_Syntax characters}
84+
85+
\pnum
86+
UAX \#31 describes how languages that use or interpret patterns of characters,
87+
such as regular expressions or number formats,
88+
may describe that syntax with Unicode properties.
89+
90+
\pnum
91+
\Cpp{} does not do this as part of the language,
92+
deferring to library components for such usage of patterns.
93+
This requirement does not apply to \Cpp{}.
94+
95+
\rSec1[uaxid.eqn]{R4 Equivalent normalized identifiers}
96+
97+
\pnum
98+
UAX \#31 requires that implementations describe
99+
how identifiers are compared and considered equivalent.
100+
101+
\pnum
102+
\Cpp{} requires that identifiers be in Normalization Form C and
103+
therefore identifiers that compare the same under NFC are equivalent.
104+
This is described in \ref{lex.name}.
105+
106+
\rSec1[uaxid.eqci]{R5 Equivalent case-insensitive identifiers}
107+
108+
\pnum
109+
\Cpp{} considers case to be significant in identifier comparison, and
110+
does not do any case folding.
111+
This requirement does not apply to \Cpp{}.
112+
113+
\rSec1[uaxid.filter]{R6 Filtered normalized identifiers}
114+
115+
\pnum
116+
If any characters are excluded from normalization,
117+
UAX \#31 requires a precise specification of those exclusions.
118+
119+
\pnum
120+
\Cpp{} does not make any such exclusions.
121+
122+
\rSec1[uaxid.filterci]{R7 Filtered case-insensitive identifiers}
123+
124+
\pnum
125+
\Cpp{} identifiers are case sensitive, and
126+
therefore this requirement does not apply.
127+
128+
\rSec1[uaxid.hashtag]{R8 Hashtag identifiers}
129+
130+
\pnum
131+
There are no hashtags in \Cpp{}, so this requirement does not apply.

0 commit comments

Comments
 (0)