From Jason Turner

[lex]

Diff to HTML by rtfpessoa

Files changed (1) hide show
  1. tmp/tmpj6b1nb8v/{from.md → to.md} +602 -466
tmp/tmpj6b1nb8v/{from.md → to.md} RENAMED
@@ -5,11 +5,11 @@
5
  The text of the program is kept in units called *source files* in this
6
  document. A source file together with all the headers [[headers]] and
7
  source files included [[cpp.include]] via the preprocessing directive
8
  `#include`, less any source lines skipped by any of the conditional
9
  inclusion [[cpp.cond]] preprocessing directives, is called a
10
- *translation unit*.
11
 
12
  [*Note 1*: A C++ program need not all be translated at the same
13
  time. — *end note*]
14
 
15
  [*Note 2*: Previously translated translation units and instantiation
@@ -24,160 +24,282 @@ program [[basic.link]]. — *end note*]
24
  ## Phases of translation <a id="lex.phases">[[lex.phases]]</a>
25
 
26
  The precedence among the syntax rules of translation is specified by the
27
  following phases.[^1]
28
 
29
- 1. Physical source file characters are mapped, in an
30
- *implementation-defined* manner, to the basic source character set
31
- (introducing new-line characters for end-of-line indicators) if
32
- necessary. The set of physical source file characters accepted is
33
- *implementation-defined*. Any source file character not in the basic
34
- source character set [[lex.charset]] is replaced by the
35
- *universal-character-name* that designates that character. An
36
- implementation may use any internal encoding, so long as an actual
37
- extended character encountered in the source file, and the same
38
- extended character expressed in the source file as a
39
- *universal-character-name* (e.g., using the `\uXXXX` notation), are
40
- handled equivalently except where this replacement is reverted
41
- [[lex.pptoken]] in a raw string literal.
42
- 2. Each instance of a backslash character (\\ immediately followed by a
43
- new-line character is deleted, splicing physical source lines to
44
- form logical source lines. Only the last backslash on any physical
45
- source line shall be eligible for being part of such a splice.
46
- Except for splices reverted in a raw string literal, if a splice
47
- results in a character sequence that matches the syntax of a
 
 
 
 
 
 
 
 
 
 
48
  *universal-character-name*, the behavior is undefined. A source file
49
  that is not empty and that does not end in a new-line character, or
50
- that ends in a new-line character immediately preceded by a
51
- backslash character before any such splicing takes place, shall be
52
- processed as if an additional new-line character were appended to
53
- the file.
54
  3. The source file is decomposed into preprocessing tokens
55
- [[lex.pptoken]] and sequences of white-space characters (including
56
  comments). A source file shall not end in a partial preprocessing
57
  token or in a partial comment.[^2] Each comment is replaced by one
58
  space character. New-line characters are retained. Whether each
59
- nonempty sequence of white-space characters other than new-line is
60
- retained or replaced by one space character is unspecified. The
61
- process of dividing a source file’s characters into preprocessing
62
- tokens is context-dependent. \[*Example 1*: See the handling of `<`
63
- within a `#include` preprocessing directive. *end example*]
 
 
 
 
 
 
 
64
  4. Preprocessing directives are executed, macro invocations are
65
- expanded, and `_Pragma` unary operator expressions are executed. If
66
- a character sequence that matches the syntax of a
67
- *universal-character-name* is produced by token concatenation
68
- [[cpp.concat]], the behavior is undefined. A `#include`
69
- preprocessing directive causes the named header or source file to be
70
- processed from phase 1 through phase 4, recursively. All
71
  preprocessing directives are then deleted.
72
- 5. Each basic source character set member in a *character-literal* or a
73
- *string-literal*, as well as each escape sequence and
74
- *universal-character-name* in a *character-literal* or a non-raw
75
- string literal, is converted to the corresponding member of the
76
- execution character set ([[lex.ccon]], [[lex.string]]); if there is
77
- no corresponding member, it is converted to an
78
- *implementation-defined* member other than the null (wide)
79
- character.[^3]
80
- 6. Adjacent string literal tokens are concatenated.
81
- 7. White-space characters separating tokens are no longer significant.
82
  Each preprocessing token is converted into a token [[lex.token]].
83
- The resulting tokens are syntactically and semantically analyzed and
84
- translated as a translation unit. \[*Note 1*: The process of
85
- analyzing and translating the tokens may occasionally result in one
86
- token being replaced by a sequence of other tokens
87
- [[temp.names]]. — *end note*] It is *implementation-defined*
88
- whether the sources for module units and header units on which the
89
- current translation unit has an interface dependency (
90
- [[module.unit]], [[module.import]]) are required to be available.
91
- \[*Note 2*: Source files, translation units and translated
92
- translation units need not necessarily be stored as files, nor need
93
- there be any one-to-one correspondence between these entities and
94
- any external representation. The description is conceptual only, and
95
- does not specify any particular implementation. — *end note*]
 
96
  8. Translated translation units and instantiation units are combined as
97
- follows: \[*Note 3*: Some or all of these may be supplied from a
98
  library. — *end note*] Each translated translation unit is examined
99
- to produce a list of required instantiations. \[*Note 4*: This may
100
  include instantiations which have been explicitly requested
101
  [[temp.explicit]]. — *end note*] The definitions of the required
102
  templates are located. It is *implementation-defined* whether the
103
  source of the translation units containing these definitions is
104
- required to be available. \[*Note 5*: An implementation could encode
105
- sufficient information into the translated translation unit so as to
106
- ensure the source is not required here. — *end note*] All the
107
- required instantiations are performed to produce *instantiation
108
- units*. \[*Note 6*: These are similar to translated translation
109
- units, but contain no references to uninstantiated templates and no
110
- template definitions. — *end note*] The program is ill-formed if
111
- any instantiation fails.
112
  9. All external entity references are resolved. Library components are
113
  linked to satisfy external references to entities not defined in the
114
  current translation. All such translator output is collected into a
115
  program image which contains information needed for execution in its
116
  execution environment.
117
 
118
  ## Character sets <a id="lex.charset">[[lex.charset]]</a>
119
 
120
- The *basic source character set* consists of 96 characters: the space
121
- character, the control characters representing horizontal tab, vertical
122
- tab, form feed, and new-line, plus the following 91 graphical
123
- characters:[^4]
124
-
125
- ``` cpp
126
- a b c d e f g h i j k l m n o p q r s t u v w x y z
127
- A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
128
- 0 1 2 3 4 5 6 7 8 9
129
- _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \" '
130
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
 
132
  The *universal-character-name* construct provides a way to name other
133
  characters.
134
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  ``` bnf
136
  hex-quad:
137
  hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
138
  ```
139
 
 
 
 
 
 
 
140
  ``` bnf
141
  universal-character-name:
142
  '\u' hex-quad
143
  '\U' hex-quad hex-quad
 
 
144
  ```
145
 
146
- A *universal-character-name* designates the character in ISO/IEC 10646
147
- (if any) whose code point is the hexadecimal number represented by the
148
- sequence of *hexadecimal-digit*s in the *universal-character-name*. The
149
- program is ill-formed if that number is not a code point or if it is a
150
- surrogate code point. Noncharacter code points and reserved code points
151
- are considered to designate separate characters distinct from any
152
- ISO/IEC 10646 character. If a *universal-character-name* outside the
153
- *c-char-sequence*, *s-char-sequence*, or *r-char-sequence* of a
154
- *character-literal* or *string-literal* (in either case, including
155
- within a *user-defined-literal*) corresponds to a control character or
156
- to a character in the basic source character set, the program is
157
- ill-formed.[^5]
158
-
159
- [*Note 1*: ISO/IEC 10646 code points are integers in the range
160
- [0, 10FFFF] (hexadecimal). A surrogate code point is a value in the
161
- range [D800, DFFF] (hexadecimal). A control character is a character
162
- whose code point is in either of the ranges [0, 1F] or [7F, 9F]
163
- (hexadecimal). *end note*]
164
-
165
- The *basic execution character set* and the *basic execution
166
- wide-character set* shall each contain all the members of the basic
167
- source character set, plus control characters representing alert,
168
- backspace, and carriage return, plus a *null character* (respectively,
169
- *null wide character*), whose value is 0. For each basic execution
170
- character set, the values of the members shall be non-negative and
171
- distinct from one another. In both the source and execution basic
172
- character sets, the value of each character after `0` in the above list
173
- of decimal digits shall be one greater than the value of the previous.
174
- The *execution character set* and the *execution wide-character set* are
175
- *implementation-defined* supersets of the basic execution character set
176
- and the basic execution wide-character set, respectively. The values of
177
- the members of the execution character sets and the sets of additional
178
- members are locale-specific.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
 
180
  ## Preprocessing tokens <a id="lex.pptoken">[[lex.pptoken]]</a>
181
 
182
  ``` bnf
183
  preprocessing-token:
@@ -190,48 +312,53 @@ preprocessing-token:
190
  character-literal
191
  user-defined-character-literal
192
  string-literal
193
  user-defined-string-literal
194
  preprocessing-op-or-punc
195
- each non-white-space character that cannot be one of the above
196
  ```
197
 
198
  Each preprocessing token that is converted to a token [[lex.token]]
199
  shall have the lexical form of a keyword, an identifier, a literal, or
200
  an operator or punctuator.
201
 
202
  A preprocessing token is the minimal lexical element of the language in
203
- translation phases 3 through 6. The categories of preprocessing token
204
- are: header names, placeholder tokens produced by preprocessing `import`
205
- and `module` directives (*import-keyword*, *module-keyword*, and
206
- *export-keyword*), identifiers, preprocessing numbers, character
207
- literals (including user-defined character literals), string literals
208
- (including user-defined string literals), preprocessing operators and
209
- punctuators, and single non-white-space characters that do not lexically
210
- match the other preprocessing token categories. If a `'` or a `"`
211
- character matches the last category, the behavior is undefined.
212
- Preprocessing tokens can be separated by white space; this consists of
213
- comments [[lex.comment]], or white-space characters (space, horizontal
214
- tab, new-line, vertical tab, and form-feed), or both. As described in
215
- [[cpp]], in certain circumstances during translation phase 4, white
216
- space (or the absence thereof) serves as more than preprocessing token
217
- separation. White space can appear within a preprocessing token only as
218
- part of a header name or between the quotation characters in a character
219
- literal or string literal.
 
 
 
 
 
220
 
221
  If the input stream has been parsed into preprocessing tokens up to a
222
  given character:
223
 
224
  - If the next character begins a sequence of characters that could be
225
  the prefix and initial double quote of a raw string literal, such as
226
  `R"`, the next preprocessing token shall be a raw string literal.
227
  Between the initial and final double quote characters of the raw
228
- string, any transformations performed in phases 1 and 2
229
- (*universal-character-name*s and line splicing) are reverted; this
230
- reversion shall apply before any *d-char*, *r-char*, or delimiting
231
- parenthesis is identified. The raw string literal is defined as the
232
- shortest sequence of characters that matches the raw-string pattern
233
  ``` bnf
234
  encoding-prefixₒₚₜ 'R' raw-string
235
  ```
236
  - Otherwise, if the next three characters are `<::` and the subsequent
237
  character is neither `:` nor `>`, the `<` is treated as a
@@ -262,28 +389,29 @@ by preprocessing either of the previous two directives.
262
  [*Note 1*: None has any observable spelling. — *end note*]
263
 
264
  [*Example 2*: The program fragment `0xe+foo` is parsed as a
265
  preprocessing number token (one that is not a valid *integer-literal* or
266
  *floating-point-literal* token), even though a parse as three
267
- preprocessing tokens `0xe`, `+`, and `foo` might produce a valid
268
- expression (for example, if `foo` were a macro defined as `1`).
269
- Similarly, the program fragment `1E1` is parsed as a preprocessing
270
- number (one that is a valid *floating-point-literal* token), whether or
271
- not `E` is a macro name. — *end example*]
272
 
273
  [*Example 3*: The program fragment `x+++++y` is parsed as `x
274
  ++ ++ + y`, which, if `x` and `y` have integral types, violates a
275
  constraint on increment operators, even though the parse `x ++ + ++ y`
276
- might yield a correct expression. — *end example*]
277
 
278
  ## Alternative tokens <a id="lex.digraph">[[lex.digraph]]</a>
279
 
280
  Alternative token representations are provided for some operators and
281
- punctuators.[^6]
282
 
283
  In all respects of the language, each alternative token behaves the
284
- same, respectively, as its primary token, except for its spelling.[^7]
 
285
  The set of alternative tokens is defined in [[lex.digraph]].
286
 
287
  ## Tokens <a id="lex.token">[[lex.token]]</a>
288
 
289
  ``` bnf
@@ -292,11 +420,12 @@ token:
292
  keyword
293
  literal
294
  operator-or-punctuator
295
  ```
296
 
297
- There are five kinds of tokens: identifiers, keywords, literals,[^8]
 
298
  operators, and other separators. Blanks, horizontal and vertical tabs,
299
  newlines, formfeeds, and comments (collectively, “whitespace”), as
300
  described below, are ignored except as they serve to separate tokens.
301
 
302
  [*Note 1*: Some whitespace is required to separate otherwise adjacent
@@ -307,11 +436,11 @@ containing alphabetic characters. — *end note*]
307
 
308
  The characters `/*` start a comment, which terminates with the
309
  characters `*/`. These comments do not nest. The characters `//` start a
310
  comment, which terminates immediately before the next new-line
311
  character. If there is a form-feed or a vertical-tab character in such a
312
- comment, only white-space characters shall appear between it and the
313
  new-line that terminates the comment; no diagnostic is required.
314
 
315
  [*Note 1*: The comment characters `//`, `/*`, and `*/` have no special
316
  meaning within a `//` comment and are treated just like other
317
  characters. Similarly, the comment characters `//` and `/*` have no
@@ -331,22 +460,22 @@ h-char-sequence:
331
  h-char-sequence h-char
332
  ```
333
 
334
  ``` bnf
335
  h-char:
336
- any member of the source character set except new-line and '>'
337
  ```
338
 
339
  ``` bnf
340
  q-char-sequence:
341
  q-char
342
  q-char-sequence q-char
343
  ```
344
 
345
  ``` bnf
346
  q-char:
347
- any member of the source character set except new-line and '"'
348
  ```
349
 
350
  [*Note 1*: Header name preprocessing tokens only appear within a
351
  `#include` preprocessing directive, a `__has_include` preprocessing
352
  expression, or after certain occurrences of an `import` token (see 
@@ -358,20 +487,19 @@ names as specified in  [[cpp.include]].
358
 
359
  The appearance of either of the characters `'` or `\` or of either of
360
  the character sequences `/*` or `//` in a *q-char-sequence* or an
361
  *h-char-sequence* is conditionally-supported with
362
  *implementation-defined* semantics, as is the appearance of the
363
- character `"` in an *h-char-sequence*.[^9]
364
 
365
  ## Preprocessing numbers <a id="lex.ppnumber">[[lex.ppnumber]]</a>
366
 
367
  ``` bnf
368
  pp-number:
369
  digit
370
  '.' digit
371
- pp-number digit
372
- pp-number identifier-nondigit
373
  pp-number ''' digit
374
  pp-number ''' nondigit
375
  pp-number 'e' sign
376
  pp-number 'E' sign
377
  pp-number 'p' sign
@@ -389,19 +517,25 @@ after a successful conversion to an *integer-literal* token or a
389
 
390
  ## Identifiers <a id="lex.name">[[lex.name]]</a>
391
 
392
  ``` bnf
393
  identifier:
394
- identifier-nondigit
395
- identifier identifier-nondigit
396
- identifier digit
397
  ```
398
 
399
  ``` bnf
400
- identifier-nondigit:
401
  nondigit
402
- universal-character-name
 
 
 
 
 
 
 
403
  ```
404
 
405
  ``` bnf
406
  nondigit: one of
407
  'a b c d e f g h i j k l m'
@@ -413,51 +547,37 @@ nondigit: one of
413
  ``` bnf
414
  digit: one of
415
  '0 1 2 3 4 5 6 7 8 9'
416
  ```
417
 
418
- An identifier is an arbitrarily long sequence of letters and digits.
419
- Each *universal-character-name* in an identifier shall designate a
420
- character whose encoding in ISO/IEC 10646 falls into one of the ranges
421
- specified in [[lex.name.allowed]]. The initial element shall not be a
422
- *universal-character-name* designating a character whose encoding falls
423
- into one of the ranges specified in [[lex.name.disallowed]]. Upper- and
424
- lower-case letters are different. All characters are significant.[^10]
425
 
426
- **Table: Ranges of characters allowed** <a id="lex.name.allowed">[lex.name.allowed]</a>
 
427
 
428
- | | | | | |
429
- | ------------- | ------------- | ------------- | ------------- | ------------- |
430
- | `00A8` | `00AA` | `00AD` | `00AF` | `00B2-00B5` |
431
- | `00B7-00BA` | `00BC-00BE` | `00C0-00D6` | `00D8-00F6` | `00F8-00FF` |
432
- | `0100-167F` | `1681-180D` | `180F-1FFF` | | |
433
- | `200B-200D` | `202A-202E` | `203F-2040` | `2054` | `2060-206F` |
434
- | `2070-218F` | `2460-24FF` | `2776-2793` | `2C00-2DFF` | `2E80-2FFF` |
435
- | `3004-3007` | `3021-302F` | `3031-D7FF` | | |
436
- | `F900-FD3D` | `FD40-FDCF` | `FDF0-FE44` | `FE47-FFFD` | |
437
- | `10000-1FFFD` | `20000-2FFFD` | `30000-3FFFD` | `40000-4FFFD` | `50000-5FFFD` |
438
- | `60000-6FFFD` | `70000-7FFFD` | `80000-8FFFD` | `90000-9FFFD` | `A0000-AFFFD` |
439
- | `B0000-BFFFD` | `C0000-CFFFD` | `D0000-DFFFD` | `E0000-EFFFD` | |
440
 
 
 
441
 
442
- **Table: Ranges of characters disallowed initially (combining characters)** <a id="lex.name.disallowed">[lex.name.disallowed]</a>
443
-
444
- | | | | |
445
- | ----------- | ---------------------------------------------- | ----------- | ----------- |
446
- | `0300-036F` | % FIXME: Unicode v7 adds 1AB0-1AFF `1DC0-1DFF` | `20D0-20FF` | `FE20-FE2F` |
447
 
 
 
 
 
448
 
449
  The identifiers in [[lex.name.special]] have a special meaning when
450
  appearing in a certain context. When referred to in the grammar, these
451
  identifiers are used explicitly rather than using the *identifier*
452
  grammar production. Unless otherwise specified, any ambiguity as to
453
  whether a given *identifier* has a special meaning is resolved to
454
  interpret the token as a regular *identifier*.
455
 
456
- In addition, some identifiers are reserved for use by C++
457
- implementations and shall not be used otherwise; no diagnostic is
458
- required.
459
 
460
  - Each identifier that contains a double underscore `__` or begins with
461
  an underscore followed by an uppercase letter is reserved to the
462
  implementation for any use.
463
  - Each identifier that begins with an underscore is reserved to the
@@ -527,11 +647,11 @@ translation phase 7 [[lex.phases]].
527
 
528
  ## Literals <a id="lex.literal">[[lex.literal]]</a>
529
 
530
  ### Kinds of literals <a id="lex.literal.kinds">[[lex.literal.kinds]]</a>
531
 
532
- There are several kinds of literals.[^11]
533
 
534
  ``` bnf
535
  literal:
536
  integer-literal
537
  character-literal
@@ -540,10 +660,13 @@ literal:
540
  boolean-literal
541
  pointer-literal
542
  user-defined-literal
543
  ```
544
 
 
 
 
545
  ### Integer literals <a id="lex.icon">[[lex.icon]]</a>
546
 
547
  ``` bnf
548
  integer-literal:
549
  binary-literal integer-suffixₒₚₜ
@@ -611,12 +734,14 @@ hexadecimal-digit: one of
611
 
612
  ``` bnf
613
  integer-suffix:
614
  unsigned-suffix long-suffixₒₚₜ
615
  unsigned-suffix long-long-suffixₒₚₜ
 
616
  long-suffix unsigned-suffixₒₚₜ
617
  long-long-suffix unsigned-suffixₒₚₜ
 
618
  ```
619
 
620
  ``` bnf
621
  unsigned-suffix: one of
622
  'u U'
@@ -630,10 +755,15 @@ long-suffix: one of
630
  ``` bnf
631
  long-long-suffix: one of
632
  'll LL'
633
  ```
634
 
 
 
 
 
 
635
  In an *integer-literal*, the sequence of *binary-digit*s,
636
  *octal-digit*s, *digit*s, or *hexadecimal-digit*s is interpreted as a
637
  base N integer as shown in table [[lex.icon.base]]; the lexically first
638
  digit of the sequence of digits is the most significant.
639
 
@@ -658,16 +788,16 @@ decimal values ten through fifteen.
658
  `0x10'0000`, and `0'004'000'000` all have the same
659
  value. — *end example*]
660
 
661
  The type of an *integer-literal* is the first type in the list in
662
  [[lex.icon.type]] corresponding to its optional *integer-suffix* in
663
- which its value can be represented. An *integer-literal* is a prvalue.
664
 
665
  **Table: Types of *integer-literal*s** <a id="lex.icon.type">[lex.icon.type]</a>
666
 
667
  | *integer-suffix* | *decimal-literal* | *integer-literal* other than *decimal-literal* |
668
- | ---------------- | ------------------------ | ---------------------------------------------- |
669
  | none | `int` | `int` |
670
  | | `long int` | `unsigned int` |
671
  | | `long long int` | `long int` |
672
  | | | `unsigned long int` |
673
  | | | `long long int` |
@@ -683,10 +813,15 @@ which its value can be represented. An *integer-literal* is a prvalue.
683
  | and `l` or `L` | `unsigned long long int` | `unsigned long long int` |
684
  | `ll` or `LL` | `long long int` | `long long int` |
685
  | | | `unsigned long long int` |
686
  | Both `u` or `U` | `unsigned long long int` | `unsigned long long int` |
687
  | and `ll` or `LL` | | |
 
 
 
 
 
688
 
689
 
690
  If an *integer-literal* cannot be represented by any type in its list
691
  and an extended integer type [[basic.fundamental]] can represent its
692
  value, it may have that extended integer type. If all of the types in
@@ -716,157 +851,165 @@ c-char-sequence:
716
  c-char-sequence c-char
717
  ```
718
 
719
  ``` bnf
720
  c-char:
721
- any member of the basic source character set except the single-quote ''', backslash '\', or new-line character
722
  escape-sequence
723
  universal-character-name
724
  ```
725
 
 
 
 
 
 
 
726
  ``` bnf
727
  escape-sequence:
728
  simple-escape-sequence
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
729
  octal-escape-sequence
730
  hexadecimal-escape-sequence
731
  ```
732
 
733
  ``` bnf
734
- simple-escape-sequence: one of
735
- '\'' '\"' '\?' '\\'
736
- '\a' '\b' '\f' '\n' '\r' '\t' '\v'
737
  ```
738
 
739
  ``` bnf
740
  octal-escape-sequence:
741
  '\' octal-digit
742
  '\' octal-digit octal-digit
743
  '\' octal-digit octal-digit octal-digit
 
744
  ```
745
 
746
  ``` bnf
747
  hexadecimal-escape-sequence:
748
- '\x' hexadecimal-digit
749
- hexadecimal-escape-sequence hexadecimal-digit
750
  ```
751
 
752
- A *character-literal* that does not begin with `u8`, `u`, `U`, or `L` is
753
- an *ordinary character literal*. An ordinary character literal that
754
- contains a single *c-char* representable in the execution character set
755
- has type `char`, with value equal to the numerical value of the encoding
756
- of the *c-char* in the execution character set. An ordinary character
757
- literal that contains more than one *c-char* is a
758
- *multicharacter literal*. A multicharacter literal, or an ordinary
759
- character literal containing a single *c-char* not representable in the
760
- execution character set, is conditionally-supported, has type `int`, and
761
- has an *implementation-defined* value.
762
-
763
- A *character-literal* that begins with `u8`, such as `u8'w'`, is a
764
- *character-literal* of type `char8_t`, known as a *UTF-8 character
765
- literal*. The value of a UTF-8 character literal is equal to its ISO/IEC
766
- 10646 code point value, provided that the code point value can be
767
- encoded as a single UTF-8 code unit.
768
-
769
- [*Note 1*: That is, provided the code point value is in the range
770
- [0, 7F] (hexadecimal). — *end note*]
771
-
772
- If the value is not representable with a single UTF-8 code unit, the
773
- program is ill-formed. A UTF-8 character literal containing multiple
774
- *c-char*s is ill-formed.
775
-
776
- A *character-literal* that begins with the letter `u`, such as `u'x'`,
777
- is a *character-literal* of type `char16_t`, known as a *UTF-16
778
- character literal*. The value of a UTF-16 character literal is equal to
779
- its ISO/IEC 10646 code point value, provided that the code point value
780
- is representable with a single 16-bit code unit.
781
-
782
- [*Note 2*: That is, provided the code point value is in the range
783
- [0, FFFF] (hexadecimal). — *end note*]
784
-
785
- If the value is not representable with a single 16-bit code unit, the
786
- program is ill-formed. A UTF-16 character literal containing multiple
787
- *c-char*s is ill-formed.
788
-
789
- A *character-literal* that begins with the letter `U`, such as `U'y'`,
790
- is a *character-literal* of type `char32_t`, known as a *UTF-32
791
- character literal*. The value of a UTF-32 character literal containing a
792
- single *c-char* is equal to its ISO/IEC 10646 code point value. A UTF-32
793
- character literal containing multiple *c-char*s is ill-formed.
794
-
795
- A *character-literal* that begins with the letter `L`, such as `L'z'`,
796
- is a *wide-character literal*. A wide-character literal has type
797
- `wchar_t`.[^12] The value of a wide-character literal containing a
798
- single *c-char* has value equal to the numerical value of the encoding
799
- of the *c-char* in the execution wide-character set, unless the *c-char*
800
- has no representation in the execution wide-character set, in which case
801
- the value is *implementation-defined*.
802
-
803
- [*Note 3*: The type `wchar_t` is able to represent all members of the
804
- execution wide-character set (see 
805
- [[basic.fundamental]]). — *end note*]
806
-
807
- The value of a wide-character literal containing multiple *c-char*s is
808
- *implementation-defined*.
809
-
810
- Certain non-graphic characters, the single quote `'`, the double quote
811
- `"`, the question mark `?`,[^13] and the backslash `\`, can be
812
- represented according to [[lex.ccon.esc]]. The double quote `"` and the
813
- question mark `?`, can be represented as themselves or by the escape
814
- sequences `\"` and `\?` respectively, but the single quote `'` and the
815
- backslash `\` shall be represented by the escape sequences `\'` and `\\`
816
- respectively. Escape sequences in which the character following the
817
- backslash is not listed in [[lex.ccon.esc]] are conditionally-supported,
818
- with *implementation-defined* semantics. An escape sequence specifies a
819
- single character.
820
-
821
- **Table: Escape sequences** <a id="lex.ccon.esc">[lex.ccon.esc]</a>
822
-
823
- | | | |
824
- | --------------- | -------------- | ------------------ |
825
- | new-line | NL(LF) | `\n` |
826
- | horizontal tab | HT | `\t` |
827
- | vertical tab | VT | `\v` |
828
- | backspace | BS | `\b` |
829
- | carriage return | CR | `\r` |
830
- | form feed | FF | `\f` |
831
- | alert | BEL | `\a` |
832
- | backslash | \ | `` |
833
- | question mark | ? | `\?` |
834
- | single quote | `'` | `\'` |
835
- | double quote | `"` | `\"` |
836
- | octal number | \numconst{ooo} | `numconst{ooo}` |
837
- | hex number | \numconst{hhh} | `\x\numconst{hhh}` |
838
-
839
-
840
- The escape `\\numconst{ooo}` consists of the backslash followed by one,
841
- two, or three octal digits that are taken to specify the value of the
842
- desired character. The escape `\x\numconst{hhh}` consists of the
843
- backslash followed by `x` followed by one or more hexadecimal digits
844
- that are taken to specify the value of the desired character. There is
845
- no limit to the number of digits in a hexadecimal sequence. A sequence
846
- of octal or hexadecimal digits is terminated by the first character that
847
- is not an octal digit or a hexadecimal digit, respectively. The value of
848
- a *character-literal* is *implementation-defined* if it falls outside of
849
- the *implementation-defined* range defined for `char` (for
850
- *character-literal*s with no prefix) or `wchar_t` (for
851
- *character-literal*s prefixed by `L`).
852
 
853
- [*Note 4*: If the value of a *character-literal* prefixed by `u`, `u8`,
854
- or `U` is outside the range defined for its type, the program is
855
- ill-formed. *end note*]
 
856
 
857
- A *universal-character-name* is translated to the encoding, in the
858
- appropriate execution character set, of the character named. If there is
859
- no such encoding, the *universal-character-name* is translated to an
860
- *implementation-defined* encoding.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
861
 
862
- [*Note 5*: In translation phase 1, a *universal-character-name* is
863
- introduced whenever an actual extended character is encountered in the
864
- source text. Therefore, all extended characters are described in terms
865
- of *universal-character-name*s. However, the actual compiler
866
- implementation may use its own native character set, so long as the same
867
- results are obtained. — *end note*]
868
 
869
  ### Floating-point literals <a id="lex.fcon">[[lex.fcon]]</a>
870
 
871
  ``` bnf
872
  floating-point-literal:
@@ -921,23 +1064,33 @@ digit-sequence:
921
  digit-sequence '''ₒₚₜ digit
922
  ```
923
 
924
  ``` bnf
925
  floating-point-suffix: one of
926
- 'f l F L'
927
  ```
928
 
929
- The type of a *floating-point-literal* is determined by its
 
930
  *floating-point-suffix* as specified in [[lex.fcon.type]].
931
 
 
 
 
 
932
  **Table: Types of *floating-point-literal*{s}** <a id="lex.fcon.type">[lex.fcon.type]</a>
933
 
934
  | *floating-point-suffix* | type |
935
- | ----------------------- | --------------- |
936
  | none | `double` |
937
  | `f` or `F` | `float` |
938
  | `l` or `L` | `long` `double` |
 
 
 
 
 
939
 
940
 
941
  The *significand* of a *floating-point-literal* is the
942
  *fractional-constant* or *digit-sequence* of a
943
  *decimal-floating-point-literal* or the
@@ -946,11 +1099,11 @@ The *significand* of a *floating-point-literal* is the
946
  of *digit*s or *hexadecimal-digit*s and optional period are interpreted
947
  as a base N real number s, where N is 10 for a
948
  *decimal-floating-point-literal* and 16 for a
949
  *hexadecimal-floating-point-literal*.
950
 
951
- [*Note 1*: Any optional separating single quotes are ignored when
952
  determining the value. — *end note*]
953
 
954
  If an *exponent-part* or *binary-exponent-part* is present, the exponent
955
  e of the *floating-point-literal* is the result of interpreting the
956
  sequence of an optional *sign* and the *digit*s as a base 10 integer.
@@ -982,15 +1135,21 @@ s-char-sequence:
982
  s-char-sequence s-char
983
  ```
984
 
985
  ``` bnf
986
  s-char:
987
- any member of the basic source character set except the double-quote '"', backslash '\', or new-line character
988
  escape-sequence
989
  universal-character-name
990
  ```
991
 
 
 
 
 
 
 
992
  ``` bnf
993
  raw-string:
994
  '"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
995
  ```
996
 
@@ -1000,27 +1159,43 @@ r-char-sequence:
1000
  r-char-sequence r-char
1001
  ```
1002
 
1003
  ``` bnf
1004
  r-char:
1005
- any member of the source character set, except a right parenthesis ')' followed by
1006
- the initial *d-char-sequence* (which may be empty) followed by a double quote '"'.
1007
  ```
1008
 
1009
  ``` bnf
1010
  d-char-sequence:
1011
  d-char
1012
  d-char-sequence d-char
1013
  ```
1014
 
1015
  ``` bnf
1016
  d-char:
1017
- any member of the basic source character set except:
1018
- space, the left parenthesis '(', the right parenthesis ')', the backslash '\', and the control characters
1019
- representing horizontal tab, vertical tab, form feed, and newline.
1020
  ```
1021
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1022
  A *string-literal* that has an `R` in the prefix is a *raw string
1023
  literal*. The *d-char-sequence* serves as a delimiter. The terminating
1024
  *d-char-sequence* of a *raw-string* is the same sequence of characters
1025
  as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
1026
  at most 16 characters.
@@ -1063,149 +1238,130 @@ R"(x = "\"y\"")"
1063
 
1064
  is equivalent to `"x = \"\\\"y\\\"\""`.
1065
 
1066
  — *end example*]
1067
 
1068
- After translation phase 6, a *string-literal* that does not begin with
1069
- an *encoding-prefix* is an *ordinary string literal*. An ordinary string
1070
- literal has type “array of *n* `const char`” where *n* is the size of
1071
- the string as defined below, has static storage duration [[basic.stc]],
1072
- and is initialized with the given characters.
1073
-
1074
- A *string-literal* that begins with `u8`, such as `u8"asdf"`, is a
1075
- *UTF-8 string literal*. A UTF-8 string literal has type “array of *n*
1076
- `const char8_t`”, where *n* is the size of the string as defined below;
1077
- each successive element of the object representation [[basic.types]] has
1078
- the value of the corresponding code unit of the UTF-8 encoding of the
1079
- string.
1080
-
1081
  Ordinary string literals and UTF-8 string literals are also referred to
1082
  as narrow string literals.
1083
 
1084
- A *string-literal* that begins with `u`, such as `u"asdf"`, is a *UTF-16
1085
- string literal*. A UTF-16 string literal has type “array of *n*
1086
- `const char16_t`”, where *n* is the size of the string as defined below;
1087
- each successive element of the array has the value of the corresponding
1088
- code unit of the UTF-16 encoding of the string.
 
1089
 
1090
- [*Note 3*: A single *c-char* may produce more than one `char16_t`
1091
- character in the form of surrogate pairs. A surrogate pair is a
1092
- representation for a single code point as a sequence of two 16-bit code
1093
- units. — *end note*]
1094
-
1095
- A *string-literal* that begins with `U`, such as `U"asdf"`, is a *UTF-32
1096
- string literal*. A UTF-32 string literal has type “array of *n*
1097
- `const char32_t`”, where *n* is the size of the string as defined below;
1098
- each successive element of the array has the value of the corresponding
1099
- code unit of the UTF-32 encoding of the string.
1100
-
1101
- A *string-literal* that begins with `L`, such as `L"asdf"`, is a *wide
1102
- string literal*. A wide string literal has type “array of *n* `const
1103
- wchar_t`”, where *n* is the size of the string as defined below; it is
1104
- initialized with the given characters.
1105
 
1106
  In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
1107
- concatenated. If both *string-literal*s have the same *encoding-prefix*,
1108
- the resulting concatenated *string-literal* has that *encoding-prefix*.
1109
- If one *string-literal* has no *encoding-prefix*, it is treated as a
1110
- *string-literal* of the same *encoding-prefix* as the other operand. If
1111
- a UTF-8 string literal token is adjacent to a wide string literal token,
1112
- the program is ill-formed. Any other concatenations are
1113
- conditionally-supported with *implementation-defined* behavior.
1114
-
1115
- [*Note 4*: This concatenation is an interpretation, not a conversion.
1116
- Because the interpretation happens in translation phase 6 (after each
1117
- character from a *string-literal* has been translated into a value from
1118
- the appropriate character set), a *string-literal*’s initial rawness has
1119
- no effect on the interpretation or well-formedness of the
1120
- concatenation. — *end note*]
 
 
 
 
 
1121
 
1122
  [[lex.string.concat]] has some examples of valid concatenations.
1123
 
 
 
1124
  **Table: String literal concatenations** <a id="lex.string.concat">[lex.string.concat]</a>
1125
 
1126
  | | | | | | |
1127
  | -------------------------- | ----- | -------------------------- | ----- | -------------------------- | ----- |
1128
  | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means |
1129
  | `u"a"` | `u"b"` | `u"ab"` | `U"a"` | `U"b"` | `U"ab"` | `L"a"` | `L"b"` | `L"ab"` |
1130
  | `u"a"` | `"b"` | `u"ab"` | `U"a"` | `"b"` | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
1131
  | `"a"` | `u"b"` | `u"ab"` | `"a"` | `U"b"` | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
1132
 
1133
 
1134
- Characters in concatenated strings are kept distinct.
1135
-
1136
- [*Example 2*:
1137
-
1138
- ``` cpp
1139
- "\xA" "B"
1140
- ```
1141
-
1142
- contains the two characters `'\xA'` and `'B'` after concatenation (and
1143
- not the single hexadecimal character `'\xAB'`).
1144
-
1145
- — *end example*]
1146
-
1147
- After any necessary concatenation, in translation phase 7
1148
- [[lex.phases]], `'\0'` is appended to every *string-literal* so that
1149
- programs that scan a string can find its end.
1150
-
1151
- Escape sequences and *universal-character-name*s in non-raw string
1152
- literals have the same meaning as in *character-literal*s [[lex.ccon]],
1153
- except that the single quote `'` is representable either by itself or by
1154
- the escape sequence `\'`, and the double quote `"` shall be preceded by
1155
- a `\`, and except that a *universal-character-name* in a UTF-16 string
1156
- literal may yield a surrogate pair. In a narrow string literal, a
1157
- *universal-character-name* may map to more than one `char` or `char8_t`
1158
- element due to *multibyte encoding*. The size of a `char32_t` or wide
1159
- string literal is the total number of escape sequences,
1160
- *universal-character-name*s, and other characters, plus one for the
1161
- terminating `U'\0'` or `L'\0'`. The size of a UTF-16 string literal is
1162
- the total number of escape sequences, *universal-character-name*s, and
1163
- other characters, plus one for each character requiring a surrogate
1164
- pair, plus one for the terminating `u'\0'`.
1165
-
1166
- [*Note 5*: The size of a `char16_t` string literal is the number of
1167
- code units, not the number of characters. — *end note*]
1168
-
1169
- [*Note 6*: Any *universal-character-name*s are required to correspond
1170
- to a code point in the range [0, D800) or [E000, 10FFFF] (hexadecimal)
1171
- [[lex.charset]]. — *end note*]
1172
-
1173
- The size of a narrow string literal is the total number of escape
1174
- sequences and other characters, plus at least one for the multibyte
1175
- encoding of each *universal-character-name*, plus one for the
1176
- terminating `'\0'`.
1177
-
1178
  Evaluating a *string-literal* results in a string literal object with
1179
- static storage duration, initialized from the given characters as
1180
- specified above. Whether all *string-literal*s are distinct (that is,
1181
- are stored in nonoverlapping objects) and whether successive evaluations
1182
- of a *string-literal* yield the same or a different object is
1183
- unspecified.
1184
-
1185
- [*Note 7*: The effect of attempting to modify a *string-literal* is
1186
- undefined. — *end note*]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1187
 
1188
  ### Boolean literals <a id="lex.bool">[[lex.bool]]</a>
1189
 
1190
  ``` bnf
1191
  boolean-literal:
1192
  'false'
1193
  'true'
1194
  ```
1195
 
1196
  The Boolean literals are the keywords `false` and `true`. Such literals
1197
- are prvalues and have type `bool`.
1198
 
1199
  ### Pointer literals <a id="lex.nullptr">[[lex.nullptr]]</a>
1200
 
1201
  ``` bnf
1202
  pointer-literal:
1203
  'nullptr'
1204
  ```
1205
 
1206
- The pointer literal is the keyword `nullptr`. It is a prvalue of type
1207
  `std::nullptr_t`.
1208
 
1209
  [*Note 1*: `std::nullptr_t` is a distinct type that is neither a
1210
  pointer type nor a pointer-to-member type; rather, a prvalue of this
1211
  type is a null pointer constant and can be converted to a null pointer
@@ -1269,14 +1425,13 @@ The syntactic non-terminal preceding the *ud-suffix* in a
1269
  that could match that non-terminal.
1270
 
1271
  A *user-defined-literal* is treated as a call to a literal operator or
1272
  literal operator template [[over.literal]]. To determine the form of
1273
  this call for a given *user-defined-literal* *L* with *ud-suffix* *X*,
1274
- the *literal-operator-id* whose literal suffix identifier is *X* is
1275
- looked up in the context of *L* using the rules for unqualified name
1276
- lookup [[basic.lookup.unqual]]. Let *S* be the set of declarations found
1277
- by this lookup. *S* shall not be empty.
1278
 
1279
  If *L* is a *user-defined-integer-literal*, let *n* be the literal
1280
  without its *ud-suffix*. If *S* contains a literal operator with
1281
  parameter type `unsigned long long`, the literal *L* is treated as a
1282
  call of the form
@@ -1288,11 +1443,11 @@ operator "" X(nULL)
1288
  Otherwise, *S* shall contain a raw literal operator or a numeric literal
1289
  operator template [[over.literal]] but not both. If *S* contains a raw
1290
  literal operator, the literal *L* is treated as a call of the form
1291
 
1292
  ``` cpp
1293
- operator "" X("n{"})
1294
  ```
1295
 
1296
  Otherwise (*S* contains a numeric literal operator template), *L* is
1297
  treated as a call of the form
1298
 
@@ -1301,11 +1456,11 @@ operator "" X<'c₁', 'c₂', ... 'cₖ'>()
1301
  ```
1302
 
1303
  where *n* is the source character sequence c₁c₂...cₖ.
1304
 
1305
  [*Note 1*: The sequence c₁c₂...cₖ can only contain characters from the
1306
- basic source character set. — *end note*]
1307
 
1308
  If *L* is a *user-defined-floating-point-literal*, let *f* be the
1309
  literal without its *ud-suffix*. If *S* contains a literal operator with
1310
  parameter type `long double`, the literal *L* is treated as a call of
1311
  the form
@@ -1317,11 +1472,11 @@ operator "" X(fL)
1317
  Otherwise, *S* shall contain a raw literal operator or a numeric literal
1318
  operator template [[over.literal]] but not both. If *S* contains a raw
1319
  literal operator, the *literal* *L* is treated as a call of the form
1320
 
1321
  ``` cpp
1322
- operator "" X("f{"})
1323
  ```
1324
 
1325
  Otherwise (*S* contains a numeric literal operator template), *L* is
1326
  treated as a call of the form
1327
 
@@ -1330,11 +1485,11 @@ operator "" X<'c₁', 'c₂', ... 'cₖ'>()
1330
  ```
1331
 
1332
  where *f* is the source character sequence c₁c₂...cₖ.
1333
 
1334
  [*Note 2*: The sequence c₁c₂...cₖ can only contain characters from the
1335
- basic source character set. — *end note*]
1336
 
1337
  If *L* is a *user-defined-string-literal*, let *str* be the literal
1338
  without its *ud-suffix* and let *len* be the number of code units in
1339
  *str* (i.e., its length excluding the terminating null character). If
1340
  *S* contains a literal operator template with a non-type template
@@ -1388,39 +1543,43 @@ suffix is applied to the result of the concatenation.
1388
 
1389
  [*Example 3*:
1390
 
1391
  ``` cpp
1392
  int main() {
1393
- L"A" "B" "C"_x; // OK: same as L"ABC"_x
1394
  "P"_x "Q" "R"_y; // error: two different ud-suffix{es}
1395
  }
1396
  ```
1397
 
1398
  — *end example*]
1399
 
1400
  <!-- Link reference definitions -->
 
1401
  [basic.fundamental]: basic.md#basic.fundamental
1402
  [basic.link]: basic.md#basic.link
1403
  [basic.lookup.unqual]: basic.md#basic.lookup.unqual
1404
  [basic.stc]: basic.md#basic.stc
1405
- [basic.types]: basic.md#basic.types
1406
  [conv.mem]: expr.md#conv.mem
1407
  [conv.ptr]: expr.md#conv.ptr
1408
  [cpp]: cpp.md#cpp
1409
- [cpp.concat]: cpp.md#cpp.concat
1410
  [cpp.cond]: cpp.md#cpp.cond
1411
  [cpp.import]: cpp.md#cpp.import
1412
  [cpp.include]: cpp.md#cpp.include
1413
  [cpp.module]: cpp.md#cpp.module
1414
  [cpp.stringize]: cpp.md#cpp.stringize
1415
  [dcl.attr.grammar]: dcl.md#dcl.attr.grammar
 
1416
  [headers]: library.md#headers
1417
  [lex]: #lex
1418
  [lex.bool]: #lex.bool
1419
  [lex.ccon]: #lex.ccon
1420
  [lex.ccon.esc]: #lex.ccon.esc
 
1421
  [lex.charset]: #lex.charset
 
 
1422
  [lex.comment]: #lex.comment
1423
  [lex.digraph]: #lex.digraph
1424
  [lex.ext]: #lex.ext
1425
  [lex.fcon]: #lex.fcon
1426
  [lex.fcon.type]: #lex.fcon.type
@@ -1431,83 +1590,60 @@ int main() {
1431
  [lex.key]: #lex.key
1432
  [lex.key.digraph]: #lex.key.digraph
1433
  [lex.literal]: #lex.literal
1434
  [lex.literal.kinds]: #lex.literal.kinds
1435
  [lex.name]: #lex.name
1436
- [lex.name.allowed]: #lex.name.allowed
1437
- [lex.name.disallowed]: #lex.name.disallowed
1438
  [lex.name.special]: #lex.name.special
1439
  [lex.nullptr]: #lex.nullptr
1440
  [lex.operators]: #lex.operators
1441
  [lex.phases]: #lex.phases
1442
  [lex.ppnumber]: #lex.ppnumber
1443
  [lex.pptoken]: #lex.pptoken
1444
  [lex.separate]: #lex.separate
1445
  [lex.string]: #lex.string
1446
  [lex.string.concat]: #lex.string.concat
 
1447
  [lex.token]: #lex.token
1448
  [module.import]: module.md#module.import
1449
  [module.unit]: module.md#module.unit
1450
  [over.literal]: over.md#over.literal
 
1451
  [temp.explicit]: temp.md#temp.explicit
1452
  [temp.names]: temp.md#temp.names
1453
 
1454
- [^1]: Implementations must behave as if these separate phases occur,
1455
- although in practice different phases might be folded together.
1456
 
1457
  [^2]: A partial preprocessing token would arise from a source file
1458
  ending in the first portion of a multi-character token that requires
1459
  a terminating sequence of characters, such as a *header-name* that
1460
  is missing the closing `"` or `>`. A partial comment would arise
1461
  from a source file ending with an unclosed `/*` comment.
1462
 
1463
- [^3]: An implementation need not convert all non-corresponding source
1464
- characters to the same execution character.
1465
-
1466
- [^4]: The glyphs for the members of the basic source character set are
1467
- intended to identify characters from the subset of ISO/IEC 10646
1468
- which corresponds to the ASCII character set. However, because the
1469
- mapping from source file characters to the source character set
1470
- (described in translation phase 1) is specified as
1471
- *implementation-defined*, an implementation is required to document
1472
- how the basic source characters are represented in source files.
1473
-
1474
- [^5]: A sequence of characters resembling a *universal-character-name*
1475
- in an *r-char-sequence* [[lex.string]] does not form a
1476
- *universal-character-name*.
1477
-
1478
- [^6]: These include “digraphs” and additional reserved words. The term
1479
  “digraph” (token consisting of two characters) is not perfectly
1480
  descriptive, since one of the alternative *preprocessing-token*s is
1481
  `%:%:` and of course several primary tokens contain two characters.
1482
  Nonetheless, those alternative tokens that aren’t lexical keywords
1483
  are colloquially known as “digraphs”.
1484
 
1485
- [^7]: Thus the “stringized” values [[cpp.stringize]] of `[` and `<:`
1486
  will be different, maintaining the source spelling, but the tokens
1487
  can otherwise be freely interchanged.
1488
 
1489
- [^8]: Literals include strings and character and numeric literals.
1490
 
1491
- [^9]: Thus, a sequence of characters that resembles an escape sequence
1492
- might result in an error, be interpreted as the character
1493
  corresponding to the escape sequence, or have a completely different
1494
  meaning, depending on the implementation.
1495
 
1496
- [^10]: On systems in which linkers cannot accept extended characters, an
1497
- encoding of the *universal-character-name* may be used in forming
1498
  valid external identifiers. For example, some otherwise unused
1499
- character or sequence of characters may be used to encode the `\u`
1500
- in a *universal-character-name*. Extended characters may produce a
1501
  long external identifier, but C++ does not place a translation limit
1502
- on significant characters for external identifiers. In C++, upper-
1503
- and lower-case letters are considered different for all identifiers,
1504
- including external identifiers.
1505
 
1506
- [^11]: The term “literal” generally designates, in this document, those
1507
  tokens that are called “constants” in ISO C.
1508
-
1509
- [^12]: They are intended for character sets where a character does not
1510
- fit into a single byte.
1511
-
1512
- [^13]: Using an escape sequence for a question mark is supported for
1513
- compatibility with ISO C++14 and ISO C.
 
5
  The text of the program is kept in units called *source files* in this
6
  document. A source file together with all the headers [[headers]] and
7
  source files included [[cpp.include]] via the preprocessing directive
8
  `#include`, less any source lines skipped by any of the conditional
9
  inclusion [[cpp.cond]] preprocessing directives, is called a
10
+ *preprocessing translation unit*.
11
 
12
  [*Note 1*: A C++ program need not all be translated at the same
13
  time. — *end note*]
14
 
15
  [*Note 2*: Previously translated translation units and instantiation
 
24
  ## Phases of translation <a id="lex.phases">[[lex.phases]]</a>
25
 
26
  The precedence among the syntax rules of translation is specified by the
27
  following phases.[^1]
28
 
29
+ 1. An implementation shall support input files that are a sequence of
30
+ UTF-8 code units (UTF-8 files). It may also support an
31
+ *implementation-defined* set of other kinds of input files, and, if
32
+ so, the kind of an input file is determined in an
33
+ *implementation-defined* manner that includes a means of designating
34
+ input files as UTF-8 files, independent of their content.
35
+ \[*Note 1*: In other words, recognizing the U+feff (byte order mark)
36
+ is not sufficient. *end note*] If an input file is determined to
37
+ be a UTF-8 file, then it shall be a well-formed UTF-8 code unit
38
+ sequence and it is decoded to produce a sequence of Unicode scalar
39
+ values. A sequence of translation character set elements is then
40
+ formed by mapping each Unicode scalar value to the corresponding
41
+ translation character set element. In the resulting sequence, each
42
+ pair of characters in the input sequence consisting of
43
+ U+000d (carriage return) followed by U+000a (line feed), as well as
44
+ each U+000d (carriage return) not immediately followed by a
45
+ U+000a (line feed), is replaced by a single new-line character. For
46
+ any other kind of input file supported by the implementation,
47
+ characters are mapped, in an *implementation-defined* manner, to a
48
+ sequence of translation character set elements [[lex.charset]],
49
+ representing end-of-line indicators as new-line characters.
50
+ 2. If the first translation character is U+feff (byte order mark), it
51
+ is deleted. Each sequence of a backslash character (\\ immediately
52
+ followed by zero or more whitespace characters other than new-line
53
+ followed by a new-line character is deleted, splicing physical
54
+ source lines to form logical source lines. Only the last backslash
55
+ on any physical source line shall be eligible for being part of such
56
+ a splice. Except for splices reverted in a raw string literal, if a
57
+ splice results in a character sequence that matches the syntax of a
58
  *universal-character-name*, the behavior is undefined. A source file
59
  that is not empty and that does not end in a new-line character, or
60
+ that ends in a splice, shall be processed as if an additional
61
+ new-line character were appended to the file.
 
 
62
  3. The source file is decomposed into preprocessing tokens
63
+ [[lex.pptoken]] and sequences of whitespace characters (including
64
  comments). A source file shall not end in a partial preprocessing
65
  token or in a partial comment.[^2] Each comment is replaced by one
66
  space character. New-line characters are retained. Whether each
67
+ nonempty sequence of whitespace characters other than new-line is
68
+ retained or replaced by one space character is unspecified. As
69
+ characters from the source file are consumed to form the next
70
+ preprocessing token (i.e., not being consumed as part of a comment
71
+ or other forms of whitespace), except when matching a
72
+ *c-char-sequence*, *s-char-sequence*, *r-char-sequence*,
73
+ *h-char-sequence*, or *q-char-sequence*, *universal-character-name*s
74
+ are recognized and replaced by the designated element of the
75
+ translation character set. The process of dividing a source file’s
76
+ characters into preprocessing tokens is context-dependent.
77
+ \[*Example 1*: See the handling of `<` within a `#include`
78
+ preprocessing directive. — *end example*]
79
  4. Preprocessing directives are executed, macro invocations are
80
+ expanded, and `_Pragma` unary operator expressions are executed. A
81
+ `#include` preprocessing directive causes the named header or source
82
+ file to be processed from phase 1 through phase 4, recursively. All
 
 
 
83
  preprocessing directives are then deleted.
84
+ 5. For a sequence of two or more adjacent *string-literal* tokens, a
85
+ common *encoding-prefix* is determined as specified in
86
+ [[lex.string]]. Each such *string-literal* token is then considered
87
+ to have that common *encoding-prefix*.
88
+ 6. Adjacent *string-literal* tokens are concatenated [[lex.string]].
89
+ 7. Whitespace characters separating tokens are no longer significant.
 
 
 
 
90
  Each preprocessing token is converted into a token [[lex.token]].
91
+ The resulting tokens constitute a *translation unit* and are
92
+ syntactically and semantically analyzed and translated.
93
+ \[*Note 2*: The process of analyzing and translating the tokens can
94
+ occasionally result in one token being replaced by a sequence of
95
+ other tokens [[temp.names]]. — *end note*] It is
96
+ *implementation-defined* whether the sources for module units and
97
+ header units on which the current translation unit has an interface
98
+ dependency [[module.unit]], [[module.import]] are required to be
99
+ available. \[*Note 3*: Source files, translation units and
100
+ translated translation units need not necessarily be stored as
101
+ files, nor need there be any one-to-one correspondence between these
102
+ entities and any external representation. The description is
103
+ conceptual only, and does not specify any particular
104
+ implementation. — *end note*]
105
  8. Translated translation units and instantiation units are combined as
106
+ follows: \[*Note 4*: Some or all of these can be supplied from a
107
  library. — *end note*] Each translated translation unit is examined
108
+ to produce a list of required instantiations. \[*Note 5*: This can
109
  include instantiations which have been explicitly requested
110
  [[temp.explicit]]. — *end note*] The definitions of the required
111
  templates are located. It is *implementation-defined* whether the
112
  source of the translation units containing these definitions is
113
+ required to be available. \[*Note 6*: An implementation can choose
114
+ to encode sufficient information into the translated translation
115
+ unit so as to ensure the source is not required here. — *end note*]
116
+ All the required instantiations are performed to produce
117
+ *instantiation units*. \[*Note 7*: These are similar to translated
118
+ translation units, but contain no references to uninstantiated
119
+ templates and no template definitions. — *end note*] The program is
120
+ ill-formed if any instantiation fails.
121
  9. All external entity references are resolved. Library components are
122
  linked to satisfy external references to entities not defined in the
123
  current translation. All such translator output is collected into a
124
  program image which contains information needed for execution in its
125
  execution environment.
126
 
127
  ## Character sets <a id="lex.charset">[[lex.charset]]</a>
128
 
129
+ The *translation character set* consists of the following elements:
130
+
131
+ - each abstract character assigned a code point in the Unicode
132
+ codespace, and
133
+ - a distinct character for each Unicode scalar value not assigned to an
134
+ abstract character.
135
+
136
+ [*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
137
+ (hexadecimal). A surrogate code point is a value in the range
138
+ [D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
139
+ that is not a surrogate code point. — *end note*]
140
+
141
+ The *basic character set* is a subset of the translation character set,
142
+ consisting of 96 characters as specified in [[lex.charset.basic]].
143
+
144
+ [*Note 2*: Unicode short names are given only as a means to identifying
145
+ the character; the numerical value has no other meaning in this
146
+ context. — *end note*]
147
+
148
+ **Table: Basic character set** <a id="lex.charset.basic">[lex.charset.basic]</a>
149
+
150
+ | character | | glyph |
151
+ | -------------------- | --------------------------- | --------------------------- |
152
+ | `U+0009` | character tabulation | |
153
+ | `U+000b` | line tabulation | |
154
+ | `U+000c` | form feed | |
155
+ | `U+0020` | space | |
156
+ | `U+000a` | line feed | new-line |
157
+ | `U+0021` | exclamation mark | `!` |
158
+ | `U+0022` | quotation mark | `"` |
159
+ | `U+0023` | number sign | `#` |
160
+ | `U+0025` | percent sign | `%` |
161
+ | `U+0026` | ampersand | `&` |
162
+ | `U+0027` | apostrophe | `'` |
163
+ | `U+0028` | left parenthesis | `(` |
164
+ | `U+0029` | right parenthesis | `)` |
165
+ | `U+002a` | asterisk | `*` |
166
+ | `U+002b` | plus sign | `+` |
167
+ | `U+002c` | comma | `,` |
168
+ | `U+002d` | hyphen-minus | `-` |
169
+ | `U+002e` | full stop | `.` |
170
+ | `U+002f` | solidus | `/` |
171
+ | `U+0030` .. `U+0039` | digit zero .. nine | `0 1 2 3 4 5 6 7 8 9` |
172
+ | `U+003a` | colon | `:` |
173
+ | `U+003b` | semicolon | `;` |
174
+ | `U+003c` | less-than sign | `<` |
175
+ | `U+003d` | equals sign | `=` |
176
+ | `U+003e` | greater-than sign | `>` |
177
+ | `U+003f` | question mark | `?` |
178
+ | `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
179
+ | | | `N O P Q R S T U V W X Y Z` |
180
+ | `U+005b` | left square bracket | `[` |
181
+ | `U+005c` | reverse solidus | \texttt{\} |
182
+ | `U+005d` | right square bracket | `]` |
183
+ | `U+005e` | circumflex accent | `^` |
184
+ | `U+005f` | low line | `_` |
185
+ | `U+0061` .. `U+007a` | latin small letter a .. z | `a b c d e f g h i j k l m` |
186
+ | | | `n o p q r s t u v w x y z` |
187
+ | `U+007b` | left curly bracket | \texttt{\ |
188
+ | `U+007c` | vertical line | `|` |
189
+ | `U+007d` | right curly bracket | `}` |
190
+ | `U+007e` | tilde | `~` |
191
+
192
 
193
  The *universal-character-name* construct provides a way to name other
194
  characters.
195
 
196
+ ``` bnf
197
+ n-char: one of
198
+ any member of the translation character set except the U+007d (right curly bracket) or new-line character
199
+ ```
200
+
201
+ ``` bnf
202
+ n-char-sequence:
203
+ n-char
204
+ n-char-sequence n-char
205
+ ```
206
+
207
+ ``` bnf
208
+ named-universal-character:
209
+ '\N{' n-char-sequence '}'
210
+ ```
211
+
212
  ``` bnf
213
  hex-quad:
214
  hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
215
  ```
216
 
217
+ ``` bnf
218
+ simple-hexadecimal-digit-sequence:
219
+ hexadecimal-digit
220
+ simple-hexadecimal-digit-sequence hexadecimal-digit
221
+ ```
222
+
223
  ``` bnf
224
  universal-character-name:
225
  '\u' hex-quad
226
  '\U' hex-quad hex-quad
227
+ '\u{' simple-hexadecimal-digit-sequence '}'
228
+ named-universal-character
229
  ```
230
 
231
+ A *universal-character-name* of the form `\u` *hex-quad*, `\U`
232
+ *hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
233
+ designates the character in the translation character set whose Unicode
234
+ scalar value is the hexadecimal number represented by the sequence of
235
+ *hexadecimal-digit*s in the *universal-character-name*. The program is
236
+ ill-formed if that number is not a Unicode scalar value.
237
+
238
+ A *universal-character-name* that is a *named-universal-character*
239
+ designates the corresponding character in the Unicode Standard (chapter
240
+ 4.8 Name) if the *n-char-sequence* is equal to its character name or to
241
+ one of its character name aliases of type “control”, “correction”, or
242
+ “alternate”; otherwise, the program is ill-formed.
243
+
244
+ [*Note 3*: These aliases are listed in the Unicode Character Database’s
245
+ `NameAliases.txt`. None of these names or aliases have leading or
246
+ trailing spaces. *end note*]
247
+
248
+ If a *universal-character-name* outside the *c-char-sequence*,
249
+ *s-char-sequence*, or *r-char-sequence* of a *character-literal* or
250
+ *string-literal* (in either case, including within a
251
+ *user-defined-literal*) corresponds to a control character or to a
252
+ character in the basic character set, the program is ill-formed.
253
+
254
+ [*Note 4*: A sequence of characters resembling a
255
+ *universal-character-name* in an *r-char-sequence* [[lex.string]] does
256
+ not form a *universal-character-name*. *end note*]
257
+
258
+ The *basic literal character set* consists of all characters of the
259
+ basic character set, plus the control characters specified in
260
+ [[lex.charset.literal]].
261
+
262
+ **Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
263
+
264
+ | | |
265
+ | -------- | --------------- |
266
+ | `U+0000` | null |
267
+ | `U+0007` | alert |
268
+ | `U+0008` | backspace |
269
+ | `U+000d` | carriage return |
270
+
271
+
272
+ A *code unit* is an integer value of character type
273
+ [[basic.fundamental]]. Characters in a *character-literal* other than a
274
+ multicharacter or non-encodable character literal or in a
275
+ *string-literal* are encoded as a sequence of one or more code units, as
276
+ determined by the *encoding-prefix* [[lex.ccon]], [[lex.string]]; this
277
+ is termed the respective *literal encoding*. The
278
+ *ordinary literal encoding* is the encoding applied to an ordinary
279
+ character or string literal. The *wide literal encoding* is the encoding
280
+ applied to a wide character or string literal.
281
+
282
+ A literal encoding or a locale-specific encoding of one of the execution
283
+ character sets [[character.seq]] encodes each element of the basic
284
+ literal character set as a single code unit with non-negative value,
285
+ distinct from the code unit for any other such element.
286
+
287
+ [*Note 5*: A character not in the basic literal character set can be
288
+ encoded with more than one code unit; the value of such a code unit can
289
+ be the same as that of a code unit for an element of the basic literal
290
+ character set. — *end note*]
291
+
292
+ The U+0000 (null) character is encoded as the value `0`. No other
293
+ element of the translation character set is encoded with a code unit of
294
+ value `0`. The code unit value of each decimal digit character after the
295
+ digit `0` (`U+0030`) shall be one greater than the value of the
296
+ previous. The ordinary and wide literal encodings are otherwise
297
+ *implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
298
+ Unicode scalar value corresponding to each character of the translation
299
+ character set is encoded as specified in the Unicode Standard for the
300
+ respective Unicode encoding form.
301
 
302
  ## Preprocessing tokens <a id="lex.pptoken">[[lex.pptoken]]</a>
303
 
304
  ``` bnf
305
  preprocessing-token:
 
312
  character-literal
313
  user-defined-character-literal
314
  string-literal
315
  user-defined-string-literal
316
  preprocessing-op-or-punc
317
+ each non-whitespace character that cannot be one of the above
318
  ```
319
 
320
  Each preprocessing token that is converted to a token [[lex.token]]
321
  shall have the lexical form of a keyword, an identifier, a literal, or
322
  an operator or punctuator.
323
 
324
  A preprocessing token is the minimal lexical element of the language in
325
+ translation phases 3 through 6. In this document, glyphs are used to
326
+ identify elements of the basic character set [[lex.charset]]. The
327
+ categories of preprocessing token are: header names, placeholder tokens
328
+ produced by preprocessing `import` and `module` directives
329
+ (*import-keyword*, *module-keyword*, and *export-keyword*), identifiers,
330
+ preprocessing numbers, character literals (including user-defined
331
+ character literals), string literals (including user-defined string
332
+ literals), preprocessing operators and punctuators, and single
333
+ non-whitespace characters that do not lexically match the other
334
+ preprocessing token categories. If a U+0027 (apostrophe) or a
335
+ U+0022 (quotation mark) character matches the last category, the
336
+ behavior is undefined. If any character not in the basic character set
337
+ matches the last category, the program is ill-formed. Preprocessing
338
+ tokens can be separated by whitespace; this consists of comments
339
+ [[lex.comment]], or whitespace characters (U+0020 (space),
340
+ U+0009 (character tabulation), new-line, U+000b (line tabulation), and
341
+ U+000c (form feed)), or both. As described in [[cpp]], in certain
342
+ circumstances during translation phase 4, whitespace (or the absence
343
+ thereof) serves as more than preprocessing token separation. Whitespace
344
+ can appear within a preprocessing token only as part of a header name or
345
+ between the quotation characters in a character literal or string
346
+ literal.
347
 
348
  If the input stream has been parsed into preprocessing tokens up to a
349
  given character:
350
 
351
  - If the next character begins a sequence of characters that could be
352
  the prefix and initial double quote of a raw string literal, such as
353
  `R"`, the next preprocessing token shall be a raw string literal.
354
  Between the initial and final double quote characters of the raw
355
+ string, any transformations performed in phase 2 (line splicing) are
356
+ reverted; this reversion shall apply before any *d-char*, *r-char*, or
357
+ delimiting parenthesis is identified. The raw string literal is
358
+ defined as the shortest sequence of characters that matches the
359
+ raw-string pattern
360
  ``` bnf
361
  encoding-prefixₒₚₜ 'R' raw-string
362
  ```
363
  - Otherwise, if the next three characters are `<::` and the subsequent
364
  character is neither `:` nor `>`, the `<` is treated as a
 
389
  [*Note 1*: None has any observable spelling. — *end note*]
390
 
391
  [*Example 2*: The program fragment `0xe+foo` is parsed as a
392
  preprocessing number token (one that is not a valid *integer-literal* or
393
  *floating-point-literal* token), even though a parse as three
394
+ preprocessing tokens `0xe`, `+`, and `foo` can produce a valid
395
+ expression (for example, if `foo` is a macro defined as `1`). Similarly,
396
+ the program fragment `1E1` is parsed as a preprocessing number (one that
397
+ is a valid *floating-point-literal* token), whether or not `E` is a
398
+ macro name. — *end example*]
399
 
400
  [*Example 3*: The program fragment `x+++++y` is parsed as `x
401
  ++ ++ + y`, which, if `x` and `y` have integral types, violates a
402
  constraint on increment operators, even though the parse `x ++ + ++ y`
403
+ can yield a correct expression. — *end example*]
404
 
405
  ## Alternative tokens <a id="lex.digraph">[[lex.digraph]]</a>
406
 
407
  Alternative token representations are provided for some operators and
408
+ punctuators.[^3]
409
 
410
  In all respects of the language, each alternative token behaves the
411
+ same, respectively, as its primary token, except for its spelling.[^4]
412
+
413
  The set of alternative tokens is defined in [[lex.digraph]].
414
 
415
  ## Tokens <a id="lex.token">[[lex.token]]</a>
416
 
417
  ``` bnf
 
420
  keyword
421
  literal
422
  operator-or-punctuator
423
  ```
424
 
425
+ There are five kinds of tokens: identifiers, keywords, literals,[^5]
426
+
427
  operators, and other separators. Blanks, horizontal and vertical tabs,
428
  newlines, formfeeds, and comments (collectively, “whitespace”), as
429
  described below, are ignored except as they serve to separate tokens.
430
 
431
  [*Note 1*: Some whitespace is required to separate otherwise adjacent
 
436
 
437
  The characters `/*` start a comment, which terminates with the
438
  characters `*/`. These comments do not nest. The characters `//` start a
439
  comment, which terminates immediately before the next new-line
440
  character. If there is a form-feed or a vertical-tab character in such a
441
+ comment, only whitespace characters shall appear between it and the
442
  new-line that terminates the comment; no diagnostic is required.
443
 
444
  [*Note 1*: The comment characters `//`, `/*`, and `*/` have no special
445
  meaning within a `//` comment and are treated just like other
446
  characters. Similarly, the comment characters `//` and `/*` have no
 
460
  h-char-sequence h-char
461
  ```
462
 
463
  ``` bnf
464
  h-char:
465
+ any member of the translation character set except new-line and U+003e (greater-than sign)
466
  ```
467
 
468
  ``` bnf
469
  q-char-sequence:
470
  q-char
471
  q-char-sequence q-char
472
  ```
473
 
474
  ``` bnf
475
  q-char:
476
+ any member of the translation character set except new-line and U+0022 (quotation mark)
477
  ```
478
 
479
  [*Note 1*: Header name preprocessing tokens only appear within a
480
  `#include` preprocessing directive, a `__has_include` preprocessing
481
  expression, or after certain occurrences of an `import` token (see 
 
487
 
488
  The appearance of either of the characters `'` or `\` or of either of
489
  the character sequences `/*` or `//` in a *q-char-sequence* or an
490
  *h-char-sequence* is conditionally-supported with
491
  *implementation-defined* semantics, as is the appearance of the
492
+ character `"` in an *h-char-sequence*.[^6]
493
 
494
  ## Preprocessing numbers <a id="lex.ppnumber">[[lex.ppnumber]]</a>
495
 
496
  ``` bnf
497
  pp-number:
498
  digit
499
  '.' digit
500
+ pp-number identifier-continue
 
501
  pp-number ''' digit
502
  pp-number ''' nondigit
503
  pp-number 'e' sign
504
  pp-number 'E' sign
505
  pp-number 'p' sign
 
517
 
518
  ## Identifiers <a id="lex.name">[[lex.name]]</a>
519
 
520
  ``` bnf
521
  identifier:
522
+ identifier-start
523
+ identifier identifier-continue
 
524
  ```
525
 
526
  ``` bnf
527
+ identifier-start:
528
  nondigit
529
+ an element of the translation character set with the Unicode property XID_Start
530
+ ```
531
+
532
+ ``` bnf
533
+ identifier-continue:
534
+ digit
535
+ nondigit
536
+ an element of the translation character set with the Unicode property XID_Continue
537
  ```
538
 
539
  ``` bnf
540
  nondigit: one of
541
  'a b c d e f g h i j k l m'
 
547
  ``` bnf
548
  digit: one of
549
  '0 1 2 3 4 5 6 7 8 9'
550
  ```
551
 
552
+ [*Note 1*:
 
 
 
 
 
 
553
 
554
+ The character properties XID_Start and XID_Continue are Derived Core
555
+ Properties as described by UAX \#44 of the Unicode Standard.[^7]
556
 
557
+ — *end note*]
 
 
 
 
 
 
 
 
 
 
 
558
 
559
+ The program is ill-formed if an *identifier* does not conform to
560
+ Normalization Form C as specified in the Unicode Standard.
561
 
562
+ [*Note 2*: Identifiers are case-sensitive. *end note*]
 
 
 
 
563
 
564
+ [*Note 3*: In translation phase 4, *identifier* also includes those
565
+ *preprocessing-token*s [[lex.pptoken]] differentiated as keywords
566
+ [[lex.key]] in the later translation phase 7
567
+ [[lex.token]]. — *end note*]
568
 
569
  The identifiers in [[lex.name.special]] have a special meaning when
570
  appearing in a certain context. When referred to in the grammar, these
571
  identifiers are used explicitly rather than using the *identifier*
572
  grammar production. Unless otherwise specified, any ambiguity as to
573
  whether a given *identifier* has a special meaning is resolved to
574
  interpret the token as a regular *identifier*.
575
 
576
+ In addition, some identifiers appearing as a *token* or
577
+ *preprocessing-token* are reserved for use by C++ implementations and
578
+ shall not be used otherwise; no diagnostic is required.
579
 
580
  - Each identifier that contains a double underscore `__` or begins with
581
  an underscore followed by an uppercase letter is reserved to the
582
  implementation for any use.
583
  - Each identifier that begins with an underscore is reserved to the
 
647
 
648
  ## Literals <a id="lex.literal">[[lex.literal]]</a>
649
 
650
  ### Kinds of literals <a id="lex.literal.kinds">[[lex.literal.kinds]]</a>
651
 
652
+ There are several kinds of literals.[^8]
653
 
654
  ``` bnf
655
  literal:
656
  integer-literal
657
  character-literal
 
660
  boolean-literal
661
  pointer-literal
662
  user-defined-literal
663
  ```
664
 
665
+ [*Note 1*: When appearing as an *expression*, a literal has a type and
666
+ a value category [[expr.prim.literal]]. — *end note*]
667
+
668
  ### Integer literals <a id="lex.icon">[[lex.icon]]</a>
669
 
670
  ``` bnf
671
  integer-literal:
672
  binary-literal integer-suffixₒₚₜ
 
734
 
735
  ``` bnf
736
  integer-suffix:
737
  unsigned-suffix long-suffixₒₚₜ
738
  unsigned-suffix long-long-suffixₒₚₜ
739
+ unsigned-suffix size-suffixₒₚₜ
740
  long-suffix unsigned-suffixₒₚₜ
741
  long-long-suffix unsigned-suffixₒₚₜ
742
+ size-suffix unsigned-suffixₒₚₜ
743
  ```
744
 
745
  ``` bnf
746
  unsigned-suffix: one of
747
  'u U'
 
755
  ``` bnf
756
  long-long-suffix: one of
757
  'll LL'
758
  ```
759
 
760
+ ``` bnf
761
+ size-suffix: one of
762
+ 'z Z'
763
+ ```
764
+
765
  In an *integer-literal*, the sequence of *binary-digit*s,
766
  *octal-digit*s, *digit*s, or *hexadecimal-digit*s is interpreted as a
767
  base N integer as shown in table [[lex.icon.base]]; the lexically first
768
  digit of the sequence of digits is the most significant.
769
 
 
788
  `0x10'0000`, and `0'004'000'000` all have the same
789
  value. — *end example*]
790
 
791
  The type of an *integer-literal* is the first type in the list in
792
  [[lex.icon.type]] corresponding to its optional *integer-suffix* in
793
+ which its value can be represented.
794
 
795
  **Table: Types of *integer-literal*s** <a id="lex.icon.type">[lex.icon.type]</a>
796
 
797
  | *integer-suffix* | *decimal-literal* | *integer-literal* other than *decimal-literal* |
798
+ | ---------------- | ----------------------------------------- | ---------------------------------------------- |
799
  | none | `int` | `int` |
800
  | | `long int` | `unsigned int` |
801
  | | `long long int` | `long int` |
802
  | | | `unsigned long int` |
803
  | | | `long long int` |
 
813
  | and `l` or `L` | `unsigned long long int` | `unsigned long long int` |
814
  | `ll` or `LL` | `long long int` | `long long int` |
815
  | | | `unsigned long long int` |
816
  | Both `u` or `U` | `unsigned long long int` | `unsigned long long int` |
817
  | and `ll` or `LL` | | |
818
+ | `z` or `Z` | the signed integer type corresponding | the signed integer type |
819
+ | | to `std::size_t` [[support.types.layout]] | corresponding to `std::size_t` |
820
+ | | | `std::size_t` |
821
+ | Both `u` or `U` | `std::size_t` | `std::size_t` |
822
+ | and `z` or `Z` | | |
823
 
824
 
825
  If an *integer-literal* cannot be represented by any type in its list
826
  and an extended integer type [[basic.fundamental]] can represent its
827
  value, it may have that extended integer type. If all of the types in
 
851
  c-char-sequence c-char
852
  ```
853
 
854
  ``` bnf
855
  c-char:
856
+ basic-c-char
857
  escape-sequence
858
  universal-character-name
859
  ```
860
 
861
+ ``` bnf
862
+ basic-c-char:
863
+ any member of the translation character set except the U+0027 (apostrophe),
864
+ U+005c (reverse solidus), or new-line character
865
+ ```
866
+
867
  ``` bnf
868
  escape-sequence:
869
  simple-escape-sequence
870
+ numeric-escape-sequence
871
+ conditional-escape-sequence
872
+ ```
873
+
874
+ ``` bnf
875
+ simple-escape-sequence:
876
+ '\' simple-escape-sequence-char
877
+ ```
878
+
879
+ ``` bnf
880
+ simple-escape-sequence-char: one of
881
+ '' " ? \ a b f n r t v'
882
+ ```
883
+
884
+ ``` bnf
885
+ numeric-escape-sequence:
886
  octal-escape-sequence
887
  hexadecimal-escape-sequence
888
  ```
889
 
890
  ``` bnf
891
+ simple-octal-digit-sequence:
892
+ octal-digit
893
+ simple-octal-digit-sequence octal-digit
894
  ```
895
 
896
  ``` bnf
897
  octal-escape-sequence:
898
  '\' octal-digit
899
  '\' octal-digit octal-digit
900
  '\' octal-digit octal-digit octal-digit
901
+ '\o{' simple-octal-digit-sequence '}'
902
  ```
903
 
904
  ``` bnf
905
  hexadecimal-escape-sequence:
906
+ '\x' simple-hexadecimal-digit-sequence
907
+ '\x{' simple-hexadecimal-digit-sequence '}'
908
  ```
909
 
910
+ ``` bnf
911
+ conditional-escape-sequence:
912
+ '\' conditional-escape-sequence-char
913
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
914
 
915
+ ``` bnf
916
+ conditional-escape-sequence-char:
917
+ any member of the basic character set that is not an octal-digit, a simple-escape-sequence-char, or the characters 'N', 'o', 'u', 'U', or 'x'
918
+ ```
919
 
920
+ A *non-encodable character literal* is a *character-literal* whose
921
+ *c-char-sequence* consists of a single *c-char* that is not a
922
+ *numeric-escape-sequence* and that specifies a character that either
923
+ lacks representation in the literal’s associated character encoding or
924
+ that cannot be encoded as a single code unit. A *multicharacter literal*
925
+ is a *character-literal* whose *c-char-sequence* consists of more than
926
+ one *c-char*. The *encoding-prefix* of a non-encodable character literal
927
+ or a multicharacter literal shall be absent. Such *character-literal*s
928
+ are conditionally-supported.
929
+
930
+ The kind of a *character-literal*, its type, and its associated
931
+ character encoding [[lex.charset]] are determined by its
932
+ *encoding-prefix* and its *c-char-sequence* as defined by
933
+ [[lex.ccon.literal]]. The special cases for non-encodable character
934
+ literals and multicharacter literals take precedence over the base kind.
935
+
936
+ [*Note 1*: The associated character encoding for ordinary character
937
+ literals determines encodability, but does not determine the value of
938
+ non-encodable ordinary character literals or ordinary multicharacter
939
+ literals. The examples in [[lex.ccon.literal]] for non-encodable
940
+ ordinary character literals assume that the specified character lacks
941
+ representation in the ordinary literal encoding or that encoding the
942
+ character would require more than one code unit. — *end note*]
943
+
944
+ **Table: Character literals** <a id="lex.ccon.literal">[lex.ccon.literal]</a>
945
+
946
+ | | | | | |
947
+ | ---- | -------------------------- | ---------- | ------------ | ------- |
948
+ | none | ordinary character literal | `char` | ordinary | `'v'` |
949
+ | `L` | wide character literal | `wchar_t` | wide literal | `L'w'` |
950
+ | | | | encoding | |
951
+ | `u8` | UTF-8 character literal | `char8_t` | UTF-8 | `u8'x'` |
952
+ | `u` | UTF-16 character literal | `char16_t` | UTF-16 | `u'y'` |
953
+ | `U` | UTF-32 character literal | `char32_t` | UTF-32 | `U'z'` |
954
+
955
+
956
+ In translation phase 4, the value of a *character-literal* is determined
957
+ using the range of representable values of the *character-literal*’s
958
+ type in translation phase 7. A non-encodable character literal or a
959
+ multicharacter literal has an *implementation-defined* value. The value
960
+ of any other kind of *character-literal* is determined as follows:
961
+
962
+ - A *character-literal* with a *c-char-sequence* consisting of a single
963
+ *basic-c-char*, *simple-escape-sequence*, or
964
+ *universal-character-name* is the code unit value of the specified
965
+ character as encoded in the literal’s associated character encoding.
966
+ \[*Note 2*: If the specified character lacks representation in the
967
+ literal’s associated character encoding or if it cannot be encoded as
968
+ a single code unit, then the literal is a non-encodable character
969
+ literal. — *end note*]
970
+ - A *character-literal* with a *c-char-sequence* consisting of a single
971
+ *numeric-escape-sequence* has a value as follows:
972
+ - Let v be the integer value represented by the octal number
973
+ comprising the sequence of *octal-digit*s in an
974
+ *octal-escape-sequence* or by the hexadecimal number comprising the
975
+ sequence of *hexadecimal-digit*s in a *hexadecimal-escape-sequence*.
976
+ - If v does not exceed the range of representable values of the
977
+ *character-literal*’s type, then the value is v.
978
+ - Otherwise, if the *character-literal*’s *encoding-prefix* is absent
979
+ or `L`, and v does not exceed the range of representable values of
980
+ the corresponding unsigned type for the underlying type of the
981
+ *character-literal*’s type, then the value is the unique value of
982
+ the *character-literal*’s type `T` that is congruent to v modulo 2ᴺ,
983
+ where N is the width of `T`.
984
+ - Otherwise, the *character-literal* is ill-formed.
985
+ - A *character-literal* with a *c-char-sequence* consisting of a single
986
+ *conditional-escape-sequence* is conditionally-supported and has an
987
+ *implementation-defined* value.
988
+
989
+ The character specified by a *simple-escape-sequence* is specified in
990
+ [[lex.ccon.esc]].
991
+
992
+ [*Note 3*: Using an escape sequence for a question mark is supported
993
+ for compatibility with ISO C++14 and ISO C. — *end note*]
994
+
995
+ **Table: Simple escape sequences** <a id="lex.ccon.esc">[lex.ccon.esc]</a>
996
+
997
+ | character | | *simple-escape-sequence* |
998
+ | --------- | -------------------- | ------------------------ |
999
+ | `U+000a` | line feed | `\n` |
1000
+ | `U+0009` | character tabulation | `\t` |
1001
+ | `U+000b` | line tabulation | `\v` |
1002
+ | `U+0008` | backspace | `\b` |
1003
+ | `U+000d` | carriage return | `\r` |
1004
+ | `U+000c` | form feed | `\f` |
1005
+ | `U+0007` | alert | `\a` |
1006
+ | `U+005c` | reverse solidus | `` |
1007
+ | `U+003f` | question mark | `\?` |
1008
+ | `U+0027` | apostrophe | `\'` |
1009
+ | `U+0022` | quotation mark | `\"` |
1010
 
 
 
 
 
 
 
1011
 
1012
  ### Floating-point literals <a id="lex.fcon">[[lex.fcon]]</a>
1013
 
1014
  ``` bnf
1015
  floating-point-literal:
 
1064
  digit-sequence '''ₒₚₜ digit
1065
  ```
1066
 
1067
  ``` bnf
1068
  floating-point-suffix: one of
1069
+ 'f l f16 f32 f64 f128 bf16 F L F16 F32 F64 F128 BF16'
1070
  ```
1071
 
1072
+ The type of a *floating-point-literal*
1073
+ [[basic.fundamental]], [[basic.extended.fp]] is determined by its
1074
  *floating-point-suffix* as specified in [[lex.fcon.type]].
1075
 
1076
+ [*Note 1*: The floating-point suffixes `f16`, `f32`, `f64`, `f128`,
1077
+ `bf16`, `F16`, `F32`, `F64`, `F128`, and `BF16` are
1078
+ conditionally-supported. See [[basic.extended.fp]]. — *end note*]
1079
+
1080
  **Table: Types of *floating-point-literal*{s}** <a id="lex.fcon.type">[lex.fcon.type]</a>
1081
 
1082
  | *floating-point-suffix* | type |
1083
+ | ----------------------- | ----------------- |
1084
  | none | `double` |
1085
  | `f` or `F` | `float` |
1086
  | `l` or `L` | `long` `double` |
1087
+ | `f16` or `F16` | `std::float16_t` |
1088
+ | `f32` or `F32` | `std::float32_t` |
1089
+ | `f64` or `F64` | `std::float64_t` |
1090
+ | `f128` or `F128` | `std::float128_t` |
1091
+ | `bf16` or `BF16` | `std::bfloat16_t` |
1092
 
1093
 
1094
  The *significand* of a *floating-point-literal* is the
1095
  *fractional-constant* or *digit-sequence* of a
1096
  *decimal-floating-point-literal* or the
 
1099
  of *digit*s or *hexadecimal-digit*s and optional period are interpreted
1100
  as a base N real number s, where N is 10 for a
1101
  *decimal-floating-point-literal* and 16 for a
1102
  *hexadecimal-floating-point-literal*.
1103
 
1104
+ [*Note 2*: Any optional separating single quotes are ignored when
1105
  determining the value. — *end note*]
1106
 
1107
  If an *exponent-part* or *binary-exponent-part* is present, the exponent
1108
  e of the *floating-point-literal* is the result of interpreting the
1109
  sequence of an optional *sign* and the *digit*s as a base 10 integer.
 
1135
  s-char-sequence s-char
1136
  ```
1137
 
1138
  ``` bnf
1139
  s-char:
1140
+ basic-s-char
1141
  escape-sequence
1142
  universal-character-name
1143
  ```
1144
 
1145
+ ``` bnf
1146
+ basic-s-char:
1147
+ any member of the translation character set except the U+0022 (quotation mark),
1148
+ U+005c (reverse solidus), or new-line character
1149
+ ```
1150
+
1151
  ``` bnf
1152
  raw-string:
1153
  '"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
1154
  ```
1155
 
 
1159
  r-char-sequence r-char
1160
  ```
1161
 
1162
  ``` bnf
1163
  r-char:
1164
+ any member of the translation character set, except a U+0029 (right parenthesis) followed by
1165
+ the initial *d-char-sequence* (which may be empty) followed by a U+0022 (quotation mark)
1166
  ```
1167
 
1168
  ``` bnf
1169
  d-char-sequence:
1170
  d-char
1171
  d-char-sequence d-char
1172
  ```
1173
 
1174
  ``` bnf
1175
  d-char:
1176
+ any member of the basic character set except:
1177
+ U+0020 (space), U+0028 (left parenthesis), U+0029 (right parenthesis), U+005c (reverse solidus),
1178
+ U+0009 (character tabulation), U+000b (line tabulation), U+000c (form feed), and new-line
1179
  ```
1180
 
1181
+ The kind of a *string-literal*, its type, and its associated character
1182
+ encoding [[lex.charset]] are determined by its encoding prefix and
1183
+ sequence of *s-char*s or *r-char*s as defined by [[lex.string.literal]]
1184
+ where n is the number of encoded code units as described below.
1185
+
1186
+ **Table: String literals** <a id="lex.string.literal">[lex.string.literal]</a>
1187
+
1188
+ | | | | | |
1189
+ | ---- | ----------------------- | ----------------------------- | ------------------------- | ---------------------------------------------- |
1190
+ | none | ordinary string literal | array of $n$ `const char` | ordinary literal encoding | `"ordinary string"` `R"(ordinary raw string)"` |
1191
+ | `L` | wide string literal | array of $n$ `const wchar_t` | wide literal encoding | `L"wide string"` `LR"w(wide raw string)w"` |
1192
+ | `u8` | UTF-8 string literal | array of $n$ `const char8_t` | UTF-8 | `u8"UTF-8 string"` `u8R"x(UTF-8 raw string)x"` |
1193
+ | `u` | UTF-16 string literal | array of $n$ `const char16_t` | UTF-16 | `u"UTF-16 string"` `uR"y(UTF-16 raw string)y"` |
1194
+ | `U` | UTF-32 string literal | array of $n$ `const char32_t` | UTF-32 | `U"UTF-32 string"` `UR"z(UTF-32 raw string)z"` |
1195
+
1196
+
1197
  A *string-literal* that has an `R` in the prefix is a *raw string
1198
  literal*. The *d-char-sequence* serves as a delimiter. The terminating
1199
  *d-char-sequence* of a *raw-string* is the same sequence of characters
1200
  as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
1201
  at most 16 characters.
 
1238
 
1239
  is equivalent to `"x = \"\\\"y\\\"\""`.
1240
 
1241
  — *end example*]
1242
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1243
  Ordinary string literals and UTF-8 string literals are also referred to
1244
  as narrow string literals.
1245
 
1246
+ The common *encoding-prefix* for a sequence of adjacent
1247
+ *string-literal*s is determined pairwise as follows: If two
1248
+ *string-literal*s have the same *encoding-prefix*, the common
1249
+ *encoding-prefix* is that *encoding-prefix*. If one *string-literal* has
1250
+ no *encoding-prefix*, the common *encoding-prefix* is that of the other
1251
+ *string-literal*. Any other combinations are ill-formed.
1252
 
1253
+ [*Note 3*: A *string-literal*’s rawness has no effect on the
1254
+ determination of the common *encoding-prefix*. *end note*]
 
 
 
 
 
 
 
 
 
 
 
 
 
1255
 
1256
  In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
1257
+ concatenated. The lexical structure and grouping of the contents of the
1258
+ individual *string-literal*s is retained.
1259
+
1260
+ [*Example 2*:
1261
+
1262
+ ``` cpp
1263
+ "\xA" "B"
1264
+ ```
1265
+
1266
+ represents the code unit `'\xA'` and the character `'B'` after
1267
+ concatenation (and not the single code unit `'\xAB'`). Similarly,
1268
+
1269
+ ``` cpp
1270
+ R"(\u00)" "41"
1271
+ ```
1272
+
1273
+ represents six characters, starting with a backslash and ending with the
1274
+ digit `1` (and not the single character `'A'` specified by a
1275
+ *universal-character-name*).
1276
 
1277
  [[lex.string.concat]] has some examples of valid concatenations.
1278
 
1279
+ — *end example*]
1280
+
1281
  **Table: String literal concatenations** <a id="lex.string.concat">[lex.string.concat]</a>
1282
 
1283
  | | | | | | |
1284
  | -------------------------- | ----- | -------------------------- | ----- | -------------------------- | ----- |
1285
  | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means |
1286
  | `u"a"` | `u"b"` | `u"ab"` | `U"a"` | `U"b"` | `U"ab"` | `L"a"` | `L"b"` | `L"ab"` |
1287
  | `u"a"` | `"b"` | `u"ab"` | `U"a"` | `"b"` | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
1288
  | `"a"` | `u"b"` | `u"ab"` | `"a"` | `U"b"` | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
1289
 
1290
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1291
  Evaluating a *string-literal* results in a string literal object with
1292
+ static storage duration [[basic.stc]]. Whether all *string-literal*s are
1293
+ distinct (that is, are stored in nonoverlapping objects) and whether
1294
+ successive evaluations of a *string-literal* yield the same or a
1295
+ different object is unspecified.
1296
+
1297
+ [*Note 4*: The effect of attempting to modify a string literal object
1298
+ is undefined. *end note*]
1299
+
1300
+ String literal objects are initialized with the sequence of code unit
1301
+ values corresponding to the *string-literal*’s sequence of *s-char*s
1302
+ (originally from non-raw string literals) and *r-char*s (originally from
1303
+ raw string literals), plus a terminating U+0000 (null) character, in
1304
+ order as follows:
1305
+
1306
+ - The sequence of characters denoted by each contiguous sequence of
1307
+ *basic-s-char*s, *r-char*s, *simple-escape-sequence*s [[lex.ccon]],
1308
+ and *universal-character-name*s [[lex.charset]] is encoded to a code
1309
+ unit sequence using the *string-literal*’s associated character
1310
+ encoding. If a character lacks representation in the associated
1311
+ character encoding, then the *string-literal* is
1312
+ conditionally-supported and an *implementation-defined* code unit
1313
+ sequence is encoded. \[*Note 5*: No character lacks representation in
1314
+ any Unicode encoding form. — *end note*] When encoding a stateful
1315
+ character encoding, implementations should encode the first such
1316
+ sequence beginning with the initial encoding state and encode
1317
+ subsequent sequences beginning with the final encoding state of the
1318
+ prior sequence. \[*Note 6*: The encoded code unit sequence can differ
1319
+ from the sequence of code units that would be obtained by encoding
1320
+ each character independently. — *end note*]
1321
+ - Each *numeric-escape-sequence* [[lex.ccon]] contributes a single code
1322
+ unit with a value as follows:
1323
+ - Let v be the integer value represented by the octal number
1324
+ comprising the sequence of *octal-digit*s in an
1325
+ *octal-escape-sequence* or by the hexadecimal number comprising the
1326
+ sequence of *hexadecimal-digit*s in a *hexadecimal-escape-sequence*.
1327
+ - If v does not exceed the range of representable values of the
1328
+ *string-literal*’s array element type, then the value is v.
1329
+ - Otherwise, if the *string-literal*’s *encoding-prefix* is absent or
1330
+ `L`, and v does not exceed the range of representable values of the
1331
+ corresponding unsigned type for the underlying type of the
1332
+ *string-literal*’s array element type, then the value is the unique
1333
+ value of the *string-literal*’s array element type `T` that is
1334
+ congruent to v modulo 2ᴺ, where N is the width of `T`.
1335
+ - Otherwise, the *string-literal* is ill-formed.
1336
+
1337
+ When encoding a stateful character encoding, these sequences should
1338
+ have no effect on encoding state.
1339
+ - Each *conditional-escape-sequence* [[lex.ccon]] contributes an
1340
+ *implementation-defined* code unit sequence. When encoding a stateful
1341
+ character encoding, it is *implementation-defined* what effect these
1342
+ sequences have on encoding state.
1343
 
1344
  ### Boolean literals <a id="lex.bool">[[lex.bool]]</a>
1345
 
1346
  ``` bnf
1347
  boolean-literal:
1348
  'false'
1349
  'true'
1350
  ```
1351
 
1352
  The Boolean literals are the keywords `false` and `true`. Such literals
1353
+ have type `bool`.
1354
 
1355
  ### Pointer literals <a id="lex.nullptr">[[lex.nullptr]]</a>
1356
 
1357
  ``` bnf
1358
  pointer-literal:
1359
  'nullptr'
1360
  ```
1361
 
1362
+ The pointer literal is the keyword `nullptr`. It has type
1363
  `std::nullptr_t`.
1364
 
1365
  [*Note 1*: `std::nullptr_t` is a distinct type that is neither a
1366
  pointer type nor a pointer-to-member type; rather, a prvalue of this
1367
  type is a null pointer constant and can be converted to a null pointer
 
1425
  that could match that non-terminal.
1426
 
1427
  A *user-defined-literal* is treated as a call to a literal operator or
1428
  literal operator template [[over.literal]]. To determine the form of
1429
  this call for a given *user-defined-literal* *L* with *ud-suffix* *X*,
1430
+ first let *S* be the set of declarations found by unqualified lookup for
1431
+ the *literal-operator-id* whose literal suffix identifier is *X*
1432
+ [[basic.lookup.unqual]]. *S* shall not be empty.
 
1433
 
1434
  If *L* is a *user-defined-integer-literal*, let *n* be the literal
1435
  without its *ud-suffix*. If *S* contains a literal operator with
1436
  parameter type `unsigned long long`, the literal *L* is treated as a
1437
  call of the form
 
1443
  Otherwise, *S* shall contain a raw literal operator or a numeric literal
1444
  operator template [[over.literal]] but not both. If *S* contains a raw
1445
  literal operator, the literal *L* is treated as a call of the form
1446
 
1447
  ``` cpp
1448
+ operator ""X("n")
1449
  ```
1450
 
1451
  Otherwise (*S* contains a numeric literal operator template), *L* is
1452
  treated as a call of the form
1453
 
 
1456
  ```
1457
 
1458
  where *n* is the source character sequence c₁c₂...cₖ.
1459
 
1460
  [*Note 1*: The sequence c₁c₂...cₖ can only contain characters from the
1461
+ basic character set. — *end note*]
1462
 
1463
  If *L* is a *user-defined-floating-point-literal*, let *f* be the
1464
  literal without its *ud-suffix*. If *S* contains a literal operator with
1465
  parameter type `long double`, the literal *L* is treated as a call of
1466
  the form
 
1472
  Otherwise, *S* shall contain a raw literal operator or a numeric literal
1473
  operator template [[over.literal]] but not both. If *S* contains a raw
1474
  literal operator, the *literal* *L* is treated as a call of the form
1475
 
1476
  ``` cpp
1477
+ operator ""X("f")
1478
  ```
1479
 
1480
  Otherwise (*S* contains a numeric literal operator template), *L* is
1481
  treated as a call of the form
1482
 
 
1485
  ```
1486
 
1487
  where *f* is the source character sequence c₁c₂...cₖ.
1488
 
1489
  [*Note 2*: The sequence c₁c₂...cₖ can only contain characters from the
1490
+ basic character set. — *end note*]
1491
 
1492
  If *L* is a *user-defined-string-literal*, let *str* be the literal
1493
  without its *ud-suffix* and let *len* be the number of code units in
1494
  *str* (i.e., its length excluding the terminating null character). If
1495
  *S* contains a literal operator template with a non-type template
 
1543
 
1544
  [*Example 3*:
1545
 
1546
  ``` cpp
1547
  int main() {
1548
+ L"A" "B" "C"_x; // OK, same as L"ABC"_x
1549
  "P"_x "Q" "R"_y; // error: two different ud-suffix{es}
1550
  }
1551
  ```
1552
 
1553
  — *end example*]
1554
 
1555
  <!-- Link reference definitions -->
1556
+ [basic.extended.fp]: basic.md#basic.extended.fp
1557
  [basic.fundamental]: basic.md#basic.fundamental
1558
  [basic.link]: basic.md#basic.link
1559
  [basic.lookup.unqual]: basic.md#basic.lookup.unqual
1560
  [basic.stc]: basic.md#basic.stc
1561
+ [character.seq]: library.md#character.seq
1562
  [conv.mem]: expr.md#conv.mem
1563
  [conv.ptr]: expr.md#conv.ptr
1564
  [cpp]: cpp.md#cpp
 
1565
  [cpp.cond]: cpp.md#cpp.cond
1566
  [cpp.import]: cpp.md#cpp.import
1567
  [cpp.include]: cpp.md#cpp.include
1568
  [cpp.module]: cpp.md#cpp.module
1569
  [cpp.stringize]: cpp.md#cpp.stringize
1570
  [dcl.attr.grammar]: dcl.md#dcl.attr.grammar
1571
+ [expr.prim.literal]: expr.md#expr.prim.literal
1572
  [headers]: library.md#headers
1573
  [lex]: #lex
1574
  [lex.bool]: #lex.bool
1575
  [lex.ccon]: #lex.ccon
1576
  [lex.ccon.esc]: #lex.ccon.esc
1577
+ [lex.ccon.literal]: #lex.ccon.literal
1578
  [lex.charset]: #lex.charset
1579
+ [lex.charset.basic]: #lex.charset.basic
1580
+ [lex.charset.literal]: #lex.charset.literal
1581
  [lex.comment]: #lex.comment
1582
  [lex.digraph]: #lex.digraph
1583
  [lex.ext]: #lex.ext
1584
  [lex.fcon]: #lex.fcon
1585
  [lex.fcon.type]: #lex.fcon.type
 
1590
  [lex.key]: #lex.key
1591
  [lex.key.digraph]: #lex.key.digraph
1592
  [lex.literal]: #lex.literal
1593
  [lex.literal.kinds]: #lex.literal.kinds
1594
  [lex.name]: #lex.name
 
 
1595
  [lex.name.special]: #lex.name.special
1596
  [lex.nullptr]: #lex.nullptr
1597
  [lex.operators]: #lex.operators
1598
  [lex.phases]: #lex.phases
1599
  [lex.ppnumber]: #lex.ppnumber
1600
  [lex.pptoken]: #lex.pptoken
1601
  [lex.separate]: #lex.separate
1602
  [lex.string]: #lex.string
1603
  [lex.string.concat]: #lex.string.concat
1604
+ [lex.string.literal]: #lex.string.literal
1605
  [lex.token]: #lex.token
1606
  [module.import]: module.md#module.import
1607
  [module.unit]: module.md#module.unit
1608
  [over.literal]: over.md#over.literal
1609
+ [support.types.layout]: support.md#support.types.layout
1610
  [temp.explicit]: temp.md#temp.explicit
1611
  [temp.names]: temp.md#temp.names
1612
 
1613
+ [^1]: Implementations behave as if these separate phases occur, although
1614
+ in practice different phases can be folded together.
1615
 
1616
  [^2]: A partial preprocessing token would arise from a source file
1617
  ending in the first portion of a multi-character token that requires
1618
  a terminating sequence of characters, such as a *header-name* that
1619
  is missing the closing `"` or `>`. A partial comment would arise
1620
  from a source file ending with an unclosed `/*` comment.
1621
 
1622
+ [^3]: These include “digraphs” and additional reserved words. The term
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1623
  “digraph” (token consisting of two characters) is not perfectly
1624
  descriptive, since one of the alternative *preprocessing-token*s is
1625
  `%:%:` and of course several primary tokens contain two characters.
1626
  Nonetheless, those alternative tokens that aren’t lexical keywords
1627
  are colloquially known as “digraphs”.
1628
 
1629
+ [^4]: Thus the “stringized” values [[cpp.stringize]] of `[` and `<:`
1630
  will be different, maintaining the source spelling, but the tokens
1631
  can otherwise be freely interchanged.
1632
 
1633
+ [^5]: Literals include strings and character and numeric literals.
1634
 
1635
+ [^6]: Thus, a sequence of characters that resembles an escape sequence
1636
+ can result in an error, be interpreted as the character
1637
  corresponding to the escape sequence, or have a completely different
1638
  meaning, depending on the implementation.
1639
 
1640
+ [^7]: On systems in which linkers cannot accept extended characters, an
1641
+ encoding of the \*universal-character-name\* can be used in forming
1642
  valid external identifiers. For example, some otherwise unused
1643
+ character or sequence of characters can be used to encode the `̆` in
1644
+ a \*universal-character-name\*. Extended characters can produce a
1645
  long external identifier, but C++ does not place a translation limit
1646
+ on significant characters for external identifiers.
 
 
1647
 
1648
+ [^8]: The term “literal” generally designates, in this document, those
1649
  tokens that are called “constants” in ISO C.