From Jason Turner

[lex]

Diff to HTML by rtfpessoa

Files changed (1) hide show
  1. tmp/tmpxaeb_ar5/{from.md → to.md} +454 -369
tmp/tmpxaeb_ar5/{from.md → to.md} RENAMED
@@ -4,24 +4,18 @@
4
 
5
  The text of the program is kept in units called *source files* in this
6
  document. A source file together with all the headers [[headers]] and
7
  source files included [[cpp.include]] via the preprocessing directive
8
  `#include`, less any source lines skipped by any of the conditional
9
- inclusion [[cpp.cond]] preprocessing directives, is called a
10
- *preprocessing translation unit*.
 
 
11
 
12
- [*Note 1*: A C++ program need not all be translated at the same
13
- time. *end note*]
14
-
15
- [*Note 2*: Previously translated translation units and instantiation
16
- units can be preserved individually or in libraries. The separate
17
- translation units of a program communicate [[basic.link]] by (for
18
- example) calls to functions whose identifiers have external or module
19
- linkage, manipulation of objects whose identifiers have external or
20
- module linkage, or manipulation of data files. Translation units can be
21
- separately translated and then later linked to produce an executable
22
- program [[basic.link]]. — *end note*]
23
 
24
  ## Phases of translation <a id="lex.phases">[[lex.phases]]</a>
25
 
26
  The precedence among the syntax rules of translation is specified by the
27
  following phases.[^1]
@@ -33,115 +27,169 @@ following phases.[^1]
33
  *implementation-defined* manner that includes a means of designating
34
  input files as UTF-8 files, independent of their content.
35
  \[*Note 1*: In other words, recognizing the U+feff (byte order mark)
36
  is not sufficient. — *end note*] If an input file is determined to
37
  be a UTF-8 file, then it shall be a well-formed UTF-8 code unit
38
- sequence and it is decoded to produce a sequence of Unicode scalar
39
- values. A sequence of translation character set elements is then
40
- formed by mapping each Unicode scalar value to the corresponding
41
- translation character set element. In the resulting sequence, each
42
- pair of characters in the input sequence consisting of
43
- U+000d (carriage return) followed by U+000a (line feed), as well as
44
- each U+000d (carriage return) not immediately followed by a
45
- U+000a (line feed), is replaced by a single new-line character. For
46
- any other kind of input file supported by the implementation,
47
- characters are mapped, in an *implementation-defined* manner, to a
48
- sequence of translation character set elements [[lex.charset]],
49
- representing end-of-line indicators as new-line characters.
 
50
  2. If the first translation character is U+feff (byte order mark), it
51
- is deleted. Each sequence of a backslash character (\\ immediately
52
- followed by zero or more whitespace characters other than new-line
53
- followed by a new-line character is deleted, splicing physical
54
- source lines to form logical source lines. Only the last backslash
55
- on any physical source line shall be eligible for being part of such
56
- a splice. Except for splices reverted in a raw string literal, if a
57
- splice results in a character sequence that matches the syntax of a
58
- *universal-character-name*, the behavior is undefined. A source file
59
- that is not empty and that does not end in a new-line character, or
60
- that ends in a splice, shall be processed as if an additional
61
- new-line character were appended to the file.
62
  3. The source file is decomposed into preprocessing tokens
63
  [[lex.pptoken]] and sequences of whitespace characters (including
64
  comments). A source file shall not end in a partial preprocessing
65
- token or in a partial comment.[^2] Each comment is replaced by one
66
- space character. New-line characters are retained. Whether each
67
- nonempty sequence of whitespace characters other than new-line is
68
- retained or replaced by one space character is unspecified. As
69
- characters from the source file are consumed to form the next
70
- preprocessing token (i.e., not being consumed as part of a comment
71
- or other forms of whitespace), except when matching a
72
- *c-char-sequence*, *s-char-sequence*, *r-char-sequence*,
73
- *h-char-sequence*, or *q-char-sequence*, *universal-character-name*s
74
- are recognized and replaced by the designated element of the
75
- translation character set. The process of dividing a source file’s
 
76
  characters into preprocessing tokens is context-dependent.
77
  \[*Example 1*: See the handling of `<` within a `#include`
78
- preprocessing directive. — *end example*]
79
- 4. Preprocessing directives are executed, macro invocations are
80
- expanded, and `_Pragma` unary operator expressions are executed. A
81
- `#include` preprocessing directive causes the named header or source
82
- file to be processed from phase 1 through phase 4, recursively. All
83
- preprocessing directives are then deleted.
84
- 5. For a sequence of two or more adjacent *string-literal* tokens, a
85
- common *encoding-prefix* is determined as specified in
86
- [[lex.string]]. Each such *string-literal* token is then considered
87
- to have that common *encoding-prefix*.
88
- 6. Adjacent *string-literal* tokens are concatenated [[lex.string]].
89
- 7. Whitespace characters separating tokens are no longer significant.
90
- Each preprocessing token is converted into a token [[lex.token]].
 
 
 
 
 
91
  The resulting tokens constitute a *translation unit* and are
92
- syntactically and semantically analyzed and translated.
93
- \[*Note 2*: The process of analyzing and translating the tokens can
 
94
  occasionally result in one token being replaced by a sequence of
95
- other tokens [[temp.names]]. — *end note*] It is
96
- *implementation-defined* whether the sources for module units and
97
- header units on which the current translation unit has an interface
98
- dependency [[module.unit]], [[module.import]] are required to be
99
- available. \[*Note 3*: Source files, translation units and
100
- translated translation units need not necessarily be stored as
101
- files, nor need there be any one-to-one correspondence between these
102
- entities and any external representation. The description is
103
- conceptual only, and does not specify any particular
104
- implementation. — *end note*]
105
- 8. Translated translation units and instantiation units are combined as
106
- follows: \[*Note 4*: Some or all of these can be supplied from a
107
- library. *end note*] Each translated translation unit is examined
108
- to produce a list of required instantiations. \[*Note 5*: This can
109
- include instantiations which have been explicitly requested
110
- [[temp.explicit]]. — *end note*] The definitions of the required
111
- templates are located. It is *implementation-defined* whether the
112
- source of the translation units containing these definitions is
113
- required to be available. \[*Note 6*: An implementation can choose
114
- to encode sufficient information into the translated translation
115
- unit so as to ensure the source is not required here. — *end note*]
116
- All the required instantiations are performed to produce
117
- *instantiation units*. \[*Note 7*: These are similar to translated
118
- translation units, but contain no references to uninstantiated
119
- templates and no template definitions. *end note*] The program is
 
 
 
 
 
120
  ill-formed if any instantiation fails.
121
- 9. All external entity references are resolved. Library components are
122
- linked to satisfy external references to entities not defined in the
123
- current translation. All such translator output is collected into a
124
- program image which contains information needed for execution in its
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  execution environment.
126
 
127
- ## Character sets <a id="lex.charset">[[lex.charset]]</a>
 
 
128
 
129
  The *translation character set* consists of the following elements:
130
 
131
- - each abstract character assigned a code point in the Unicode
132
- codespace, and
133
  - a distinct character for each Unicode scalar value not assigned to an
134
  abstract character.
135
 
136
  [*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
137
  (hexadecimal). A surrogate code point is a value in the range
138
  [D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
139
  that is not a surrogate code point. — *end note*]
140
 
141
  The *basic character set* is a subset of the translation character set,
142
- consisting of 96 characters as specified in [[lex.charset.basic]].
143
 
144
  [*Note 2*: Unicode short names are given only as a means to identifying
145
  the character; the numerical value has no other meaning in this
146
  context. — *end note*]
147
 
@@ -155,10 +203,11 @@ context. — *end note*]
155
  | `U+0020` | space | |
156
  | `U+000a` | line feed | new-line |
157
  | `U+0021` | exclamation mark | `!` |
158
  | `U+0022` | quotation mark | `"` |
159
  | `U+0023` | number sign | `#` |
 
160
  | `U+0025` | percent sign | `%` |
161
  | `U+0026` | ampersand | `&` |
162
  | `U+0027` | apostrophe | `'` |
163
  | `U+0028` | left parenthesis | `(` |
164
  | `U+0029` | right parenthesis | `)` |
@@ -173,90 +222,27 @@ context. — *end note*]
173
  | `U+003b` | semicolon | `;` |
174
  | `U+003c` | less-than sign | `<` |
175
  | `U+003d` | equals sign | `=` |
176
  | `U+003e` | greater-than sign | `>` |
177
  | `U+003f` | question mark | `?` |
 
178
  | `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
179
  | | | `N O P Q R S T U V W X Y Z` |
180
  | `U+005b` | left square bracket | `[` |
181
  | `U+005c` | reverse solidus | \texttt{\} |
182
  | `U+005d` | right square bracket | `]` |
183
  | `U+005e` | circumflex accent | `^` |
184
  | `U+005f` | low line | `_` |
 
185
  | `U+0061` .. `U+007a` | latin small letter a .. z | `a b c d e f g h i j k l m` |
186
  | | | `n o p q r s t u v w x y z` |
187
  | `U+007b` | left curly bracket | \texttt{\ |
188
  | `U+007c` | vertical line | `|` |
189
  | `U+007d` | right curly bracket | `}` |
190
  | `U+007e` | tilde | `~` |
191
 
192
 
193
- The *universal-character-name* construct provides a way to name other
194
- characters.
195
-
196
- ``` bnf
197
- n-char: one of
198
- any member of the translation character set except the U+007d (right curly bracket) or new-line character
199
- ```
200
-
201
- ``` bnf
202
- n-char-sequence:
203
- n-char
204
- n-char-sequence n-char
205
- ```
206
-
207
- ``` bnf
208
- named-universal-character:
209
- '\N{' n-char-sequence '}'
210
- ```
211
-
212
- ``` bnf
213
- hex-quad:
214
- hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
215
- ```
216
-
217
- ``` bnf
218
- simple-hexadecimal-digit-sequence:
219
- hexadecimal-digit
220
- simple-hexadecimal-digit-sequence hexadecimal-digit
221
- ```
222
-
223
- ``` bnf
224
- universal-character-name:
225
- '\u' hex-quad
226
- '\U' hex-quad hex-quad
227
- '\u{' simple-hexadecimal-digit-sequence '}'
228
- named-universal-character
229
- ```
230
-
231
- A *universal-character-name* of the form `\u` *hex-quad*, `\U`
232
- *hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
233
- designates the character in the translation character set whose Unicode
234
- scalar value is the hexadecimal number represented by the sequence of
235
- *hexadecimal-digit*s in the *universal-character-name*. The program is
236
- ill-formed if that number is not a Unicode scalar value.
237
-
238
- A *universal-character-name* that is a *named-universal-character*
239
- designates the corresponding character in the Unicode Standard (chapter
240
- 4.8 Name) if the *n-char-sequence* is equal to its character name or to
241
- one of its character name aliases of type “control”, “correction”, or
242
- “alternate”; otherwise, the program is ill-formed.
243
-
244
- [*Note 3*: These aliases are listed in the Unicode Character Database’s
245
- `NameAliases.txt`. None of these names or aliases have leading or
246
- trailing spaces. — *end note*]
247
-
248
- If a *universal-character-name* outside the *c-char-sequence*,
249
- *s-char-sequence*, or *r-char-sequence* of a *character-literal* or
250
- *string-literal* (in either case, including within a
251
- *user-defined-literal*) corresponds to a control character or to a
252
- character in the basic character set, the program is ill-formed.
253
-
254
- [*Note 4*: A sequence of characters resembling a
255
- *universal-character-name* in an *r-char-sequence* [[lex.string]] does
256
- not form a *universal-character-name*. — *end note*]
257
-
258
  The *basic literal character set* consists of all characters of the
259
  basic character set, plus the control characters specified in
260
  [[lex.charset.literal]].
261
 
262
  **Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
@@ -282,24 +268,100 @@ applied to a wide character or string literal.
282
  A literal encoding or a locale-specific encoding of one of the execution
283
  character sets [[character.seq]] encodes each element of the basic
284
  literal character set as a single code unit with non-negative value,
285
  distinct from the code unit for any other such element.
286
 
287
- [*Note 5*: A character not in the basic literal character set can be
288
  encoded with more than one code unit; the value of such a code unit can
289
  be the same as that of a code unit for an element of the basic literal
290
  character set. — *end note*]
291
 
292
  The U+0000 (null) character is encoded as the value `0`. No other
293
  element of the translation character set is encoded with a code unit of
294
  value `0`. The code unit value of each decimal digit character after the
295
  digit `0` (`U+0030`) shall be one greater than the value of the
296
  previous. The ordinary and wide literal encodings are otherwise
297
  *implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
298
- Unicode scalar value corresponding to each character of the translation
299
- character set is encoded as specified in the Unicode Standard for the
300
- respective Unicode encoding form.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
301
 
302
  ## Preprocessing tokens <a id="lex.pptoken">[[lex.pptoken]]</a>
303
 
304
  ``` bnf
305
  preprocessing-token:
@@ -315,27 +377,22 @@ preprocessing-token:
315
  user-defined-string-literal
316
  preprocessing-op-or-punc
317
  each non-whitespace character that cannot be one of the above
318
  ```
319
 
320
- Each preprocessing token that is converted to a token [[lex.token]]
321
- shall have the lexical form of a keyword, an identifier, a literal, or
322
- an operator or punctuator.
323
-
324
  A preprocessing token is the minimal lexical element of the language in
325
  translation phases 3 through 6. In this document, glyphs are used to
326
  identify elements of the basic character set [[lex.charset]]. The
327
  categories of preprocessing token are: header names, placeholder tokens
328
  produced by preprocessing `import` and `module` directives
329
  (*import-keyword*, *module-keyword*, and *export-keyword*), identifiers,
330
  preprocessing numbers, character literals (including user-defined
331
  character literals), string literals (including user-defined string
332
  literals), preprocessing operators and punctuators, and single
333
  non-whitespace characters that do not lexically match the other
334
- preprocessing token categories. If a U+0027 (apostrophe) or a
335
- U+0022 (quotation mark) character matches the last category, the
336
- behavior is undefined. If any character not in the basic character set
337
  matches the last category, the program is ill-formed. Preprocessing
338
  tokens can be separated by whitespace; this consists of comments
339
  [[lex.comment]], or whitespace characters (U+0020 (space),
340
  U+0009 (character tabulation), new-line, U+000b (line tabulation), and
341
  U+000c (form feed)), or both. As described in [[cpp]], in certain
@@ -343,10 +400,21 @@ circumstances during translation phase 4, whitespace (or the absence
343
  thereof) serves as more than preprocessing token separation. Whitespace
344
  can appear within a preprocessing token only as part of a header name or
345
  between the quotation characters in a character literal or string
346
  literal.
347
 
 
 
 
 
 
 
 
 
 
 
 
348
  If the input stream has been parsed into preprocessing tokens up to a
349
  given character:
350
 
351
  - If the next character begins a sequence of characters that could be
352
  the prefix and initial double quote of a raw string literal, such as
@@ -362,34 +430,38 @@ given character:
362
  ```
363
  - Otherwise, if the next three characters are `<::` and the subsequent
364
  character is neither `:` nor `>`, the `<` is treated as a
365
  preprocessing token by itself and not as the first character of the
366
  alternative token `<:`.
 
 
 
 
 
367
  - Otherwise, the next preprocessing token is the longest sequence of
368
  characters that could constitute a preprocessing token, even if that
369
- would cause further lexical analysis to fail, except that a
370
- *header-name* [[lex.header]] is only formed
371
- - after the `include` or `import` preprocessing token in an `#include`
372
- [[cpp.include]] or `import` [[cpp.import]] directive, or
373
- - within a *has-include-expression*.
 
 
 
 
 
 
374
 
375
  [*Example 1*:
376
 
377
  ``` cpp
378
  #define R "x"
379
  const char* s = R"y"; // ill-formed raw string, not "x" "y"
380
  ```
381
 
382
  — *end example*]
383
 
384
- The *import-keyword* is produced by processing an `import` directive
385
- [[cpp.import]], the *module-keyword* is produced by preprocessing a
386
- `module` directive [[cpp.module]], and the *export-keyword* is produced
387
- by preprocessing either of the previous two directives.
388
-
389
- [*Note 1*: None has any observable spelling. — *end note*]
390
-
391
  [*Example 2*: The program fragment `0xe+foo` is parsed as a
392
  preprocessing number token (one that is not a valid *integer-literal* or
393
  *floating-point-literal* token), even though a parse as three
394
  preprocessing tokens `0xe`, `+`, and `foo` can produce a valid
395
  expression (for example, if `foo` is a macro defined as `1`). Similarly,
@@ -400,98 +472,57 @@ macro name. — *end example*]
400
  [*Example 3*: The program fragment `x+++++y` is parsed as `x
401
  ++ ++ + y`, which, if `x` and `y` have integral types, violates a
402
  constraint on increment operators, even though the parse `x ++ + ++ y`
403
  can yield a correct expression. — *end example*]
404
 
405
- ## Alternative tokens <a id="lex.digraph">[[lex.digraph]]</a>
406
-
407
- Alternative token representations are provided for some operators and
408
- punctuators.[^3]
409
-
410
- In all respects of the language, each alternative token behaves the
411
- same, respectively, as its primary token, except for its spelling.[^4]
412
-
413
- The set of alternative tokens is defined in [[lex.digraph]].
414
-
415
- ## Tokens <a id="lex.token">[[lex.token]]</a>
416
-
417
- ``` bnf
418
- token:
419
- identifier
420
- keyword
421
- literal
422
- operator-or-punctuator
423
- ```
424
-
425
- There are five kinds of tokens: identifiers, keywords, literals,[^5]
426
-
427
- operators, and other separators. Blanks, horizontal and vertical tabs,
428
- newlines, formfeeds, and comments (collectively, “whitespace”), as
429
- described below, are ignored except as they serve to separate tokens.
430
-
431
- [*Note 1*: Some whitespace is required to separate otherwise adjacent
432
- identifiers, keywords, numeric literals, and alternative tokens
433
- containing alphabetic characters. — *end note*]
434
-
435
- ## Comments <a id="lex.comment">[[lex.comment]]</a>
436
-
437
- The characters `/*` start a comment, which terminates with the
438
- characters `*/`. These comments do not nest. The characters `//` start a
439
- comment, which terminates immediately before the next new-line
440
- character. If there is a form-feed or a vertical-tab character in such a
441
- comment, only whitespace characters shall appear between it and the
442
- new-line that terminates the comment; no diagnostic is required.
443
-
444
- [*Note 1*: The comment characters `//`, `/*`, and `*/` have no special
445
- meaning within a `//` comment and are treated just like other
446
- characters. Similarly, the comment characters `//` and `/*` have no
447
- special meaning within a `/*` comment. — *end note*]
448
-
449
  ## Header names <a id="lex.header">[[lex.header]]</a>
450
 
451
  ``` bnf
452
  header-name:
453
  '<' h-char-sequence '>'
454
  '"' q-char-sequence '"'
455
  ```
456
 
457
  ``` bnf
458
  h-char-sequence:
459
- h-char
460
- h-char-sequence h-char
461
  ```
462
 
463
  ``` bnf
464
  h-char:
465
  any member of the translation character set except new-line and U+003e (greater-than sign)
466
  ```
467
 
468
  ``` bnf
469
  q-char-sequence:
470
- q-char
471
- q-char-sequence q-char
472
  ```
473
 
474
  ``` bnf
475
  q-char:
476
  any member of the translation character set except new-line and U+0022 (quotation mark)
477
  ```
478
 
479
- [*Note 1*: Header name preprocessing tokens only appear within a
480
- `#include` preprocessing directive, a `__has_include` preprocessing
481
- expression, or after certain occurrences of an `import` token (see 
482
- [[lex.pptoken]]). — *end note*]
483
-
484
  The sequences in both forms of *header-name*s are mapped in an
485
  *implementation-defined* manner to headers or to external source file
486
  names as specified in  [[cpp.include]].
487
 
 
 
 
 
 
488
  The appearance of either of the characters `'` or `\` or of either of
489
  the character sequences `/*` or `//` in a *q-char-sequence* or an
490
  *h-char-sequence* is conditionally-supported with
491
  *implementation-defined* semantics, as is the appearance of the
492
- character `"` in an *h-char-sequence*.[^6]
 
 
 
 
 
493
 
494
  ## Preprocessing numbers <a id="lex.ppnumber">[[lex.ppnumber]]</a>
495
 
496
  ``` bnf
497
  pp-number:
@@ -513,10 +544,76 @@ tokens [[lex.icon]] and all *floating-point-literal* tokens
513
 
514
  A preprocessing number does not have a type or a value; it acquires both
515
  after a successful conversion to an *integer-literal* token or a
516
  *floating-point-literal* token.
517
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
518
  ## Identifiers <a id="lex.name">[[lex.name]]</a>
519
 
520
  ``` bnf
521
  identifier:
522
  identifier-start
@@ -549,21 +646,24 @@ digit: one of
549
  '0 1 2 3 4 5 6 7 8 9'
550
  ```
551
 
552
  [*Note 1*:
553
 
554
- The character properties XID_Start and XID_Continue are Derived Core
555
- Properties as described by UAX \#44 of the Unicode Standard.[^7]
556
 
557
  — *end note*]
558
 
559
  The program is ill-formed if an *identifier* does not conform to
560
  Normalization Form C as specified in the Unicode Standard.
561
 
562
  [*Note 2*: Identifiers are case-sensitive. — *end note*]
563
 
564
- [*Note 3*: In translation phase 4, *identifier* also includes those
 
 
 
565
  *preprocessing-token*s [[lex.pptoken]] differentiated as keywords
566
  [[lex.key]] in the later translation phase 7
567
  [[lex.token]]. — *end note*]
568
 
569
  The identifiers in [[lex.name.special]] have a special meaning when
@@ -576,12 +676,13 @@ interpret the token as a regular *identifier*.
576
  In addition, some identifiers appearing as a *token* or
577
  *preprocessing-token* are reserved for use by C++ implementations and
578
  shall not be used otherwise; no diagnostic is required.
579
 
580
  - Each identifier that contains a double underscore `__` or begins with
581
- an underscore followed by an uppercase letter is reserved to the
582
- implementation for any use.
 
583
  - Each identifier that begins with an underscore is reserved to the
584
  implementation for use as a name in the global namespace.
585
 
586
  ## Keywords <a id="lex.key">[[lex.key]]</a>
587
 
@@ -609,44 +710,10 @@ Furthermore, the alternative representations shown in
609
  | | | | | | |
610
  | -------- | -------- | -------- | ------- | -------- | ----- |
611
  | `and` | `and_eq` | `bitand` | `bitor` | `compl` | `not` |
612
  | `not_eq` | `or` | `or_eq` | `xor` | `xor_eq` | |
613
 
614
- ## Operators and punctuators <a id="lex.operators">[[lex.operators]]</a>
615
-
616
- The lexical representation of C++ programs includes a number of
617
- preprocessing tokens that are used in the syntax of the preprocessor or
618
- are converted into tokens for operators and punctuators:
619
-
620
- ``` bnf
621
- preprocessing-op-or-punc:
622
- preprocessing-operator
623
- operator-or-punctuator
624
- ```
625
-
626
- ``` bnf
627
- %% Ed. note: character protrusion would misalign various operators.
628
- preprocessing-operator: one of
629
- '# ## %: %:%:'
630
- ```
631
-
632
- ``` bnf
633
- operator-or-punctuator: one of
634
- '{ } [ ] ( )'
635
- '<: :> <% %> ; : ...'
636
- '? :: . .* -> ->* ~'
637
- '! + - * / % ^ & |'
638
- '= += -= *= /= %= ^= &= |='
639
- '== != < > <= >= <=> && ||'
640
- '<< >> <<= >>= ++ -- ,'
641
- 'and or xor not bitand bitor compl'
642
- 'and_eq or_eq xor_eq not_eq'
643
- ```
644
-
645
- Each *operator-or-punctuator* is converted to a single token in
646
- translation phase 7 [[lex.phases]].
647
-
648
  ## Literals <a id="lex.literal">[[lex.literal]]</a>
649
 
650
  ### Kinds of literals <a id="lex.literal.kinds">[[lex.literal.kinds]]</a>
651
 
652
  There are several kinds of literals.[^8]
@@ -762,12 +829,12 @@ size-suffix: one of
762
  'z Z'
763
  ```
764
 
765
  In an *integer-literal*, the sequence of *binary-digit*s,
766
  *octal-digit*s, *digit*s, or *hexadecimal-digit*s is interpreted as a
767
- base N integer as shown in table [[lex.icon.base]]; the lexically first
768
- digit of the sequence of digits is the most significant.
769
 
770
  [*Note 1*: The prefix and any optional separating single quotes are
771
  ignored when determining the value. — *end note*]
772
 
773
  **Table: Base of *integer-literal*{s}** <a id="lex.icon.base">[lex.icon.base]</a>
@@ -820,20 +887,23 @@ which its value can be represented.
820
  | | | `std::size_t` |
821
  | Both `u` or `U` | `std::size_t` | `std::size_t` |
822
  | and `z` or `Z` | | |
823
 
824
 
825
- If an *integer-literal* cannot be represented by any type in its list
 
826
  and an extended integer type [[basic.fundamental]] can represent its
827
  value, it may have that extended integer type. If all of the types in
828
  the list for the *integer-literal* are signed, the extended integer type
829
- shall be signed. If all of the types in the list for the
830
- *integer-literal* are unsigned, the extended integer type shall be
831
- unsigned. If the list contains both signed and unsigned types, the
832
- extended integer type may be signed or unsigned. A program is ill-formed
833
- if one of its translation units contains an *integer-literal* that
834
- cannot be represented by any of the allowed types.
 
 
835
 
836
  ### Character literals <a id="lex.ccon">[[lex.ccon]]</a>
837
 
838
  ``` bnf
839
  character-literal:
@@ -845,12 +915,11 @@ encoding-prefix: one of
845
  'u8' 'u' 'U' 'L'
846
  ```
847
 
848
  ``` bnf
849
  c-char-sequence:
850
- c-char
851
- c-char-sequence c-char
852
  ```
853
 
854
  ``` bnf
855
  c-char:
856
  basic-c-char
@@ -887,12 +956,11 @@ numeric-escape-sequence:
887
  hexadecimal-escape-sequence
888
  ```
889
 
890
  ``` bnf
891
  simple-octal-digit-sequence:
892
- octal-digit
893
- simple-octal-digit-sequence octal-digit
894
  ```
895
 
896
  ``` bnf
897
  octal-escape-sequence:
898
  '\' octal-digit
@@ -915,60 +983,47 @@ conditional-escape-sequence:
915
  ``` bnf
916
  conditional-escape-sequence-char:
917
  any member of the basic character set that is not an octal-digit, a simple-escape-sequence-char, or the characters 'N', 'o', 'u', 'U', or 'x'
918
  ```
919
 
920
- A *non-encodable character literal* is a *character-literal* whose
921
- *c-char-sequence* consists of a single *c-char* that is not a
922
- *numeric-escape-sequence* and that specifies a character that either
923
- lacks representation in the literal’s associated character encoding or
924
- that cannot be encoded as a single code unit. A *multicharacter literal*
925
- is a *character-literal* whose *c-char-sequence* consists of more than
926
- one *c-char*. The *encoding-prefix* of a non-encodable character literal
927
- or a multicharacter literal shall be absent. Such *character-literal*s
928
- are conditionally-supported.
929
 
930
  The kind of a *character-literal*, its type, and its associated
931
  character encoding [[lex.charset]] are determined by its
932
  *encoding-prefix* and its *c-char-sequence* as defined by
933
- [[lex.ccon.literal]]. The special cases for non-encodable character
934
- literals and multicharacter literals take precedence over the base kind.
935
-
936
- [*Note 1*: The associated character encoding for ordinary character
937
- literals determines encodability, but does not determine the value of
938
- non-encodable ordinary character literals or ordinary multicharacter
939
- literals. The examples in [[lex.ccon.literal]] for non-encodable
940
- ordinary character literals assume that the specified character lacks
941
- representation in the ordinary literal encoding or that encoding the
942
- character would require more than one code unit. — *end note*]
943
 
944
  **Table: Character literals** <a id="lex.ccon.literal">[lex.ccon.literal]</a>
945
 
946
- | | | | | |
947
- | ---- | -------------------------- | ---------- | ------------ | ------- |
948
- | none | ordinary character literal | `char` | ordinary | `'v'` |
949
  | `L` | wide character literal | `wchar_t` | wide literal | `L'w'` |
950
  | | | | encoding | |
951
  | `u8` | UTF-8 character literal | `char8_t` | UTF-8 | `u8'x'` |
952
  | `u` | UTF-16 character literal | `char16_t` | UTF-16 | `u'y'` |
953
  | `U` | UTF-32 character literal | `char32_t` | UTF-32 | `U'z'` |
954
 
955
 
956
  In translation phase 4, the value of a *character-literal* is determined
957
  using the range of representable values of the *character-literal*’s
958
- type in translation phase 7. A non-encodable character literal or a
959
- multicharacter literal has an *implementation-defined* value. The value
960
- of any other kind of *character-literal* is determined as follows:
961
 
962
  - A *character-literal* with a *c-char-sequence* consisting of a single
963
  *basic-c-char*, *simple-escape-sequence*, or
964
  *universal-character-name* is the code unit value of the specified
965
  character as encoded in the literal’s associated character encoding.
966
- \[*Note 2*: If the specified character lacks representation in the
967
- literal’s associated character encoding or if it cannot be encoded as
968
- a single code unit, then the literal is a non-encodable character
969
- literal. — *end note*]
970
  - A *character-literal* with a *c-char-sequence* consisting of a single
971
  *numeric-escape-sequence* has a value as follows:
972
  - Let v be the integer value represented by the octal number
973
  comprising the sequence of *octal-digit*s in an
974
  *octal-escape-sequence* or by the hexadecimal number comprising the
@@ -979,20 +1034,20 @@ of any other kind of *character-literal* is determined as follows:
979
  or `L`, and v does not exceed the range of representable values of
980
  the corresponding unsigned type for the underlying type of the
981
  *character-literal*’s type, then the value is the unique value of
982
  the *character-literal*’s type `T` that is congruent to v modulo 2ᴺ,
983
  where N is the width of `T`.
984
- - Otherwise, the *character-literal* is ill-formed.
985
  - A *character-literal* with a *c-char-sequence* consisting of a single
986
  *conditional-escape-sequence* is conditionally-supported and has an
987
  *implementation-defined* value.
988
 
989
  The character specified by a *simple-escape-sequence* is specified in
990
  [[lex.ccon.esc]].
991
 
992
- [*Note 3*: Using an escape sequence for a question mark is supported
993
- for compatibility with ISO C++14 and ISO C. — *end note*]
994
 
995
  **Table: Simple escape sequences** <a id="lex.ccon.esc">[lex.ccon.esc]</a>
996
 
997
  | character | | *simple-escape-sequence* |
998
  | --------- | -------------------- | ------------------------ |
@@ -1129,12 +1184,11 @@ string-literal:
1129
  encoding-prefixₒₚₜ 'R' raw-string
1130
  ```
1131
 
1132
  ``` bnf
1133
  s-char-sequence:
1134
- s-char
1135
- s-char-sequence s-char
1136
  ```
1137
 
1138
  ``` bnf
1139
  s-char:
1140
  basic-s-char
@@ -1153,24 +1207,22 @@ raw-string:
1153
  '"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
1154
  ```
1155
 
1156
  ``` bnf
1157
  r-char-sequence:
1158
- r-char
1159
- r-char-sequence r-char
1160
  ```
1161
 
1162
  ``` bnf
1163
  r-char:
1164
  any member of the translation character set, except a U+0029 (right parenthesis) followed by
1165
  the initial *d-char-sequence* (which may be empty) followed by a U+0022 (quotation mark)
1166
  ```
1167
 
1168
  ``` bnf
1169
  d-char-sequence:
1170
- d-char
1171
- d-char-sequence d-char
1172
  ```
1173
 
1174
  ``` bnf
1175
  d-char:
1176
  any member of the basic character set except:
@@ -1179,16 +1231,17 @@ d-char:
1179
  ```
1180
 
1181
  The kind of a *string-literal*, its type, and its associated character
1182
  encoding [[lex.charset]] are determined by its encoding prefix and
1183
  sequence of *s-char*s or *r-char*s as defined by [[lex.string.literal]]
1184
- where n is the number of encoded code units as described below.
 
1185
 
1186
  **Table: String literals** <a id="lex.string.literal">[lex.string.literal]</a>
1187
 
1188
- | | | | | |
1189
- | ---- | ----------------------- | ----------------------------- | ------------------------- | ---------------------------------------------- |
1190
  | none | ordinary string literal | array of $n$ `const char` | ordinary literal encoding | `"ordinary string"` `R"(ordinary raw string)"` |
1191
  | `L` | wide string literal | array of $n$ `const wchar_t` | wide literal encoding | `L"wide string"` `LR"w(wide raw string)w"` |
1192
  | `u8` | UTF-8 string literal | array of $n$ `const char8_t` | UTF-8 | `u8"UTF-8 string"` `u8R"x(UTF-8 raw string)x"` |
1193
  | `u` | UTF-16 string literal | array of $n$ `const char16_t` | UTF-16 | `u"UTF-16 string"` `uR"y(UTF-16 raw string)y"` |
1194
  | `U` | UTF-32 string literal | array of $n$ `const char32_t` | UTF-32 | `U"UTF-32 string"` `UR"z(UTF-32 raw string)z"` |
@@ -1198,12 +1251,12 @@ A *string-literal* that has an `R` in the prefix is a *raw string
1198
  literal*. The *d-char-sequence* serves as a delimiter. The terminating
1199
  *d-char-sequence* of a *raw-string* is the same sequence of characters
1200
  as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
1201
  at most 16 characters.
1202
 
1203
- [*Note 1*: The characters `'('` and `')'` are permitted in a
1204
- *raw-string*. Thus, `R"delimiter((a|b))delimiter"` is equivalent to
1205
  `"(a|b)"`. — *end note*]
1206
 
1207
  [*Note 2*:
1208
 
1209
  A source-file new-line in a raw string literal results in a new-line in
@@ -1239,18 +1292,15 @@ R"(x = "\"y\"")"
1239
  is equivalent to `"x = \"\\\"y\\\"\""`.
1240
 
1241
  — *end example*]
1242
 
1243
  Ordinary string literals and UTF-8 string literals are also referred to
1244
- as narrow string literals.
1245
 
1246
- The common *encoding-prefix* for a sequence of adjacent
1247
- *string-literal*s is determined pairwise as follows: If two
1248
- *string-literal*s have the same *encoding-prefix*, the common
1249
- *encoding-prefix* is that *encoding-prefix*. If one *string-literal* has
1250
- no *encoding-prefix*, the common *encoding-prefix* is that of the other
1251
- *string-literal*. Any other combinations are ill-formed.
1252
 
1253
  [*Note 3*: A *string-literal*’s rawness has no effect on the
1254
  determination of the common *encoding-prefix*. — *end note*]
1255
 
1256
  In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
@@ -1287,16 +1337,17 @@ digit `1` (and not the single character `'A'` specified by a
1287
  | `u"a"` | `"b"` | `u"ab"` | `U"a"` | `"b"` | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
1288
  | `"a"` | `u"b"` | `u"ab"` | `"a"` | `U"b"` | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
1289
 
1290
 
1291
  Evaluating a *string-literal* results in a string literal object with
1292
- static storage duration [[basic.stc]]. Whether all *string-literal*s are
1293
- distinct (that is, are stored in nonoverlapping objects) and whether
1294
- successive evaluations of a *string-literal* yield the same or a
1295
- different object is unspecified.
1296
 
1297
- [*Note 4*: The effect of attempting to modify a string literal object
 
 
 
 
1298
  is undefined. — *end note*]
1299
 
1300
  String literal objects are initialized with the sequence of code unit
1301
  values corresponding to the *string-literal*’s sequence of *s-char*s
1302
  (originally from non-raw string literals) and *r-char*s (originally from
@@ -1306,20 +1357,19 @@ order as follows:
1306
  - The sequence of characters denoted by each contiguous sequence of
1307
  *basic-s-char*s, *r-char*s, *simple-escape-sequence*s [[lex.ccon]],
1308
  and *universal-character-name*s [[lex.charset]] is encoded to a code
1309
  unit sequence using the *string-literal*’s associated character
1310
  encoding. If a character lacks representation in the associated
1311
- character encoding, then the *string-literal* is
1312
- conditionally-supported and an *implementation-defined* code unit
1313
- sequence is encoded. \[*Note 5*: No character lacks representation in
1314
- any Unicode encoding form. *end note*] When encoding a stateful
1315
- character encoding, implementations should encode the first such
1316
- sequence beginning with the initial encoding state and encode
1317
- subsequent sequences beginning with the final encoding state of the
1318
- prior sequence. \[*Note 6*: The encoded code unit sequence can differ
1319
- from the sequence of code units that would be obtained by encoding
1320
- each character independently. — *end note*]
1321
  - Each *numeric-escape-sequence* [[lex.ccon]] contributes a single code
1322
  unit with a value as follows:
1323
  - Let v be the integer value represented by the octal number
1324
  comprising the sequence of *octal-digit*s in an
1325
  *octal-escape-sequence* or by the hexadecimal number comprising the
@@ -1330,35 +1380,53 @@ order as follows:
1330
  `L`, and v does not exceed the range of representable values of the
1331
  corresponding unsigned type for the underlying type of the
1332
  *string-literal*’s array element type, then the value is the unique
1333
  value of the *string-literal*’s array element type `T` that is
1334
  congruent to v modulo 2ᴺ, where N is the width of `T`.
1335
- - Otherwise, the *string-literal* is ill-formed.
1336
 
1337
  When encoding a stateful character encoding, these sequences should
1338
  have no effect on encoding state.
1339
  - Each *conditional-escape-sequence* [[lex.ccon]] contributes an
1340
  *implementation-defined* code unit sequence. When encoding a stateful
1341
  character encoding, it is *implementation-defined* what effect these
1342
  sequences have on encoding state.
1343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1344
  ### Boolean literals <a id="lex.bool">[[lex.bool]]</a>
1345
 
1346
  ``` bnf
1347
  boolean-literal:
1348
- 'false'
1349
- 'true'
1350
  ```
1351
 
1352
  The Boolean literals are the keywords `false` and `true`. Such literals
1353
  have type `bool`.
1354
 
1355
  ### Pointer literals <a id="lex.nullptr">[[lex.nullptr]]</a>
1356
 
1357
  ``` bnf
1358
  pointer-literal:
1359
- 'nullptr'
1360
  ```
1361
 
1362
  The pointer literal is the keyword `nullptr`. It has type
1363
  `std::nullptr_t`.
1364
 
@@ -1490,11 +1558,11 @@ where *f* is the source character sequence c₁c₂...cₖ.
1490
  basic character set. — *end note*]
1491
 
1492
  If *L* is a *user-defined-string-literal*, let *str* be the literal
1493
  without its *ud-suffix* and let *len* be the number of code units in
1494
  *str* (i.e., its length excluding the terminating null character). If
1495
- *S* contains a literal operator template with a non-type template
1496
  parameter for which *str* is a well-formed *template-argument*, the
1497
  literal *L* is treated as a call of the form
1498
 
1499
  ``` cpp
1500
  operator ""X<str>()
@@ -1557,26 +1625,37 @@ int main() {
1557
  [basic.fundamental]: basic.md#basic.fundamental
1558
  [basic.link]: basic.md#basic.link
1559
  [basic.lookup.unqual]: basic.md#basic.lookup.unqual
1560
  [basic.stc]: basic.md#basic.stc
1561
  [character.seq]: library.md#character.seq
 
1562
  [conv.mem]: expr.md#conv.mem
1563
  [conv.ptr]: expr.md#conv.ptr
1564
  [cpp]: cpp.md#cpp
1565
  [cpp.cond]: cpp.md#cpp.cond
 
1566
  [cpp.import]: cpp.md#cpp.import
1567
  [cpp.include]: cpp.md#cpp.include
1568
  [cpp.module]: cpp.md#cpp.module
 
 
 
 
 
1569
  [cpp.stringize]: cpp.md#cpp.stringize
1570
  [dcl.attr.grammar]: dcl.md#dcl.attr.grammar
 
 
1571
  [expr.prim.literal]: expr.md#expr.prim.literal
1572
  [headers]: library.md#headers
 
1573
  [lex]: #lex
1574
  [lex.bool]: #lex.bool
1575
  [lex.ccon]: #lex.ccon
1576
  [lex.ccon.esc]: #lex.ccon.esc
1577
  [lex.ccon.literal]: #lex.ccon.literal
 
1578
  [lex.charset]: #lex.charset
1579
  [lex.charset.basic]: #lex.charset.basic
1580
  [lex.charset.literal]: #lex.charset.literal
1581
  [lex.comment]: #lex.comment
1582
  [lex.digraph]: #lex.digraph
@@ -1600,50 +1679,56 @@ int main() {
1600
  [lex.pptoken]: #lex.pptoken
1601
  [lex.separate]: #lex.separate
1602
  [lex.string]: #lex.string
1603
  [lex.string.concat]: #lex.string.concat
1604
  [lex.string.literal]: #lex.string.literal
 
1605
  [lex.token]: #lex.token
 
1606
  [module.import]: module.md#module.import
 
1607
  [module.unit]: module.md#module.unit
1608
  [over.literal]: over.md#over.literal
1609
  [support.types.layout]: support.md#support.types.layout
1610
  [temp.explicit]: temp.md#temp.explicit
 
1611
  [temp.names]: temp.md#temp.names
 
 
1612
 
1613
  [^1]: Implementations behave as if these separate phases occur, although
1614
  in practice different phases can be folded together.
1615
 
1616
- [^2]: A partial preprocessing token would arise from a source file
 
 
 
 
 
1617
  ending in the first portion of a multi-character token that requires
1618
  a terminating sequence of characters, such as a *header-name* that
1619
  is missing the closing `"` or `>`. A partial comment would arise
1620
  from a source file ending with an unclosed `/*` comment.
1621
 
1622
- [^3]: These include “digraphs” and additional reserved words. The term
1623
  “digraph” (token consisting of two characters) is not perfectly
1624
  descriptive, since one of the alternative *preprocessing-token*s is
1625
  `%:%:` and of course several primary tokens contain two characters.
1626
  Nonetheless, those alternative tokens that aren’t lexical keywords
1627
  are colloquially known as “digraphs”.
1628
 
1629
- [^4]: Thus the “stringized” values [[cpp.stringize]] of `[` and `<:`
1630
  will be different, maintaining the source spelling, but the tokens
1631
  can otherwise be freely interchanged.
1632
 
1633
- [^5]: Literals include strings and character and numeric literals.
1634
-
1635
- [^6]: Thus, a sequence of characters that resembles an escape sequence
1636
- can result in an error, be interpreted as the character
1637
- corresponding to the escape sequence, or have a completely different
1638
- meaning, depending on the implementation.
1639
 
1640
  [^7]: On systems in which linkers cannot accept extended characters, an
1641
  encoding of the \*universal-character-name\* can be used in forming
1642
  valid external identifiers. For example, some otherwise unused
1643
  character or sequence of characters can be used to encode the `̆` in
1644
  a \*universal-character-name\*. Extended characters can produce a
1645
  long external identifier, but C++ does not place a translation limit
1646
  on significant characters for external identifiers.
1647
 
1648
  [^8]: The term “literal” generally designates, in this document, those
1649
- tokens that are called “constants” in ISO C.
 
4
 
5
  The text of the program is kept in units called *source files* in this
6
  document. A source file together with all the headers [[headers]] and
7
  source files included [[cpp.include]] via the preprocessing directive
8
  `#include`, less any source lines skipped by any of the conditional
9
+ inclusion [[cpp.cond]] preprocessing directives, as modified by the
10
+ implementation-defined behavior of any
11
+ conditionally-supported-directives [[cpp.pre]] and pragmas
12
+ [[cpp.pragma]], if any, is called a *preprocessing translation unit*.
13
 
14
+ [*Note 1*: A C++ program need not all be translated at the same time.
15
+ Translation units can be separately translated and then later linked to
16
+ produce an executable program [[basic.link]]. — *end note*]
 
 
 
 
 
 
 
 
17
 
18
  ## Phases of translation <a id="lex.phases">[[lex.phases]]</a>
19
 
20
  The precedence among the syntax rules of translation is specified by the
21
  following phases.[^1]
 
27
  *implementation-defined* manner that includes a means of designating
28
  input files as UTF-8 files, independent of their content.
29
  \[*Note 1*: In other words, recognizing the U+feff (byte order mark)
30
  is not sufficient. — *end note*] If an input file is determined to
31
  be a UTF-8 file, then it shall be a well-formed UTF-8 code unit
32
+ sequence and it is decoded to produce a sequence of Unicode[^2]
33
+ scalar values. A sequence of translation character set elements
34
+ [[lex.charset]] is then formed by mapping each Unicode scalar value
35
+ to the corresponding translation character set element. In the
36
+ resulting sequence, each pair of characters in the input sequence
37
+ consisting of U+000d (carriage return) followed by
38
+ U+000a (line feed), as well as each U+000d (carriage return) not
39
+ immediately followed by a U+000a (line feed), is replaced by a
40
+ single new-line character. For any other kind of input file
41
+ supported by the implementation, characters are mapped, in an
42
+ *implementation-defined* manner, to a sequence of translation
43
+ character set elements, representing end-of-line indicators as
44
+ new-line characters.
45
  2. If the first translation character is U+feff (byte order mark), it
46
+ is deleted. Each sequence comprising a backslash character (\\
47
+ immediately followed by zero or more whitespace characters other
48
+ than new-line followed by a new-line character is deleted, splicing
49
+ physical source lines to form *logical source lines*. Only the last
50
+ backslash on any physical source line shall be eligible for being
51
+ part of such a splice. \[*Note 2*: Line splicing can form a
52
+ *universal-character-name* [[lex.charset]]. *end note*] A source
53
+ file that is not empty and that (after splicing) does not end in a
54
+ new-line character shall be processed as if an additional new-line
55
+ character were appended to the file.
 
56
  3. The source file is decomposed into preprocessing tokens
57
  [[lex.pptoken]] and sequences of whitespace characters (including
58
  comments). A source file shall not end in a partial preprocessing
59
+ token or in a partial comment.[^3] Each comment [[lex.comment]] is
60
+ replaced by one U+0020 (space) character. New-line characters are
61
+ retained. Whether each nonempty sequence of whitespace characters
62
+ other than new-line is retained or replaced by one U+0020 (space)
63
+ character is unspecified. As characters from the source file are
64
+ consumed to form the next preprocessing token (i.e., not being
65
+ consumed as part of a comment or other forms of whitespace), except
66
+ when matching a *c-char-sequence*, *s-char-sequence*,
67
+ *r-char-sequence*, *h-char-sequence*, or *q-char-sequence*,
68
+ *universal-character-name*s are recognized [[lex.universal.char]]
69
+ and replaced by the designated element of the translation character
70
+ set [[lex.charset]]. The process of dividing a source file’s
71
  characters into preprocessing tokens is context-dependent.
72
  \[*Example 1*: See the handling of `<` within a `#include`
73
+ preprocessing directive
74
+ [[lex.header]], [[cpp.include]]. *end example*]
75
+ 4. The source file is analyzed as a *preprocessing-file* [[cpp.pre]].
76
+ Preprocessing directives [[cpp]] are executed, macro invocations are
77
+ expanded [[cpp.replace]], and `_Pragma` unary operator expressions
78
+ are executed [[cpp.pragma.op]]. A `#include` preprocessing directive
79
+ [[cpp.include]] causes the named header or source file to be
80
+ processed from phase 1 through phase 4, recursively. All
81
+ preprocessing directives are then deleted. Whitespace characters
82
+ separating preprocessing tokens are no longer significant.
83
+ 5. For a sequence of two or more adjacent *string-literal*
84
+ preprocessing tokens, a common *encoding-prefix* is determined as
85
+ specified in [[lex.string]]. Each such *string-literal*
86
+ preprocessing token is then considered to have that common
87
+ *encoding-prefix*.
88
+ 6. Adjacent *string-literal* preprocessing tokens are concatenated
89
+ [[lex.string]].
90
+ 7. Each preprocessing token is converted into a token [[lex.token]].
91
  The resulting tokens constitute a *translation unit* and are
92
+ syntactically and semantically analyzed as a *translation-unit*
93
+ [[basic.link]] and translated.
94
+ \[*Note 3*: The process of analyzing and translating the tokens can
95
  occasionally result in one token being replaced by a sequence of
96
+ other tokens [[temp.names]]. — *end note*]
97
+ It is *implementation-defined* whether the sources for module units
98
+ and header units on which the current translation unit has an
99
+ interface dependency [[module.unit]], [[module.import]] are required
100
+ to be available.
101
+ \[*Note 4*: Source files, translation units and translated
102
+ translation units need not necessarily be stored as files, nor need
103
+ there be any one-to-one correspondence between these entities and
104
+ any external representation. The description is conceptual only, and
105
+ does not specify any particular implementation. — *end note*]
106
+ \[*Note 5*: Previously translated translation units can be preserved
107
+ individually or in libraries. The separate translation units of a
108
+ program communicate [[basic.link]] by (for example) calls to
109
+ functions whose names have external or module linkage, manipulation
110
+ of variables whose names have external or module linkage, or
111
+ manipulation of data files. — *end note*]
112
+ While the tokens constituting translation units are being analyzed
113
+ and translated, required instantiations are performed.
114
+ \[*Note 6*: This can include instantiations which have been
115
+ explicitly requested [[temp.explicit]]. *end note*]
116
+ The contexts from which instantiations may be performed are
117
+ determined by their respective points of instantiation
118
+ [[temp.point]].
119
+ \[*Note 7*: Other requirements in this document can further
120
+ constrain the context from which an instantiation can be performed.
121
+ For example, a constexpr function template specialization might have
122
+ a point of instantiation at the end of a translation unit, but its
123
+ use in certain constant expressions could require that it be
124
+ instantiated at an earlier point [[temp.inst]]. — *end note*]
125
+ Each instantiation results in new program constructs. The program is
126
  ill-formed if any instantiation fails.
127
+ During the analysis and translation of tokens, certain expressions
128
+ are evaluated [[expr.const]]. Constructs appearing at a program
129
+ point P are analyzed in a context where each side effect of
130
+ evaluating an expression E as a full-expression is complete if and
131
+ only if
132
+ - E is the expression corresponding to a
133
+ *consteval-block-declaration* [[dcl.pre]], and
134
+ - either that *consteval-block-declaration* or the template
135
+ definition from which it is instantiated is reachable from
136
+ [[module.reach]]
137
+ - P, or
138
+ - the point immediately following the *class-specifier* of the
139
+ outermost class for which P is in a complete-class context
140
+ [[class.mem.general]].
141
+
142
+ \[*Example 2*:
143
+ ``` cpp
144
+ class S {
145
+ class Incomplete;
146
+
147
+ class Inner {
148
+ void fn() {
149
+ /* p₁ */ Incomplete i; // OK
150
+ }
151
+ }; /* p₂ */
152
+
153
+ consteval {
154
+ define_aggregate(^^Incomplete, {});
155
+ }
156
+ }; /* p₃ */
157
+ ```
158
+
159
+ Constructs at p₁ are analyzed in a context where the side effect of
160
+ the call to `define_aggregate` is evaluated because
161
+ - E is the expression corresponding to a consteval block, and
162
+ - p₁ is in a complete-class context of `S` and the consteval block
163
+ is reachable from p₃.
164
+
165
+ — *end example*]
166
+ 8. Translated translation units are combined, and all external entity
167
+ references are resolved. Library components are linked to satisfy
168
+ external references to entities not defined in the current
169
+ translation. All such translator output is collected into a program
170
+ image which contains information needed for execution in its
171
  execution environment.
172
 
173
+ ## Characters <a id="lex.char">[[lex.char]]</a>
174
+
175
+ ### Character sets <a id="lex.charset">[[lex.charset]]</a>
176
 
177
  The *translation character set* consists of the following elements:
178
 
179
+ - each abstract character assigned a code point in the Unicode codespace
180
+ as specified in the Unicode Standard, and
181
  - a distinct character for each Unicode scalar value not assigned to an
182
  abstract character.
183
 
184
  [*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
185
  (hexadecimal). A surrogate code point is a value in the range
186
  [D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
187
  that is not a surrogate code point. — *end note*]
188
 
189
  The *basic character set* is a subset of the translation character set,
190
+ consisting of 99 characters as specified in [[lex.charset.basic]].
191
 
192
  [*Note 2*: Unicode short names are given only as a means to identifying
193
  the character; the numerical value has no other meaning in this
194
  context. — *end note*]
195
 
 
203
  | `U+0020` | space | |
204
  | `U+000a` | line feed | new-line |
205
  | `U+0021` | exclamation mark | `!` |
206
  | `U+0022` | quotation mark | `"` |
207
  | `U+0023` | number sign | `#` |
208
+ | `U+0024` | dollar sign | `$` |
209
  | `U+0025` | percent sign | `%` |
210
  | `U+0026` | ampersand | `&` |
211
  | `U+0027` | apostrophe | `'` |
212
  | `U+0028` | left parenthesis | `(` |
213
  | `U+0029` | right parenthesis | `)` |
 
222
  | `U+003b` | semicolon | `;` |
223
  | `U+003c` | less-than sign | `<` |
224
  | `U+003d` | equals sign | `=` |
225
  | `U+003e` | greater-than sign | `>` |
226
  | `U+003f` | question mark | `?` |
227
+ | } |
228
  | `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
229
  | | | `N O P Q R S T U V W X Y Z` |
230
  | `U+005b` | left square bracket | `[` |
231
  | `U+005c` | reverse solidus | \texttt{\} |
232
  | `U+005d` | right square bracket | `]` |
233
  | `U+005e` | circumflex accent | `^` |
234
  | `U+005f` | low line | `_` |
235
+ | `U+0060` | grave accent | `\` |
236
  | `U+0061` .. `U+007a` | latin small letter a .. z | `a b c d e f g h i j k l m` |
237
  | | | `n o p q r s t u v w x y z` |
238
  | `U+007b` | left curly bracket | \texttt{\ |
239
  | `U+007c` | vertical line | `|` |
240
  | `U+007d` | right curly bracket | `}` |
241
  | `U+007e` | tilde | `~` |
242
 
243
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
244
  The *basic literal character set* consists of all characters of the
245
  basic character set, plus the control characters specified in
246
  [[lex.charset.literal]].
247
 
248
  **Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
 
268
  A literal encoding or a locale-specific encoding of one of the execution
269
  character sets [[character.seq]] encodes each element of the basic
270
  literal character set as a single code unit with non-negative value,
271
  distinct from the code unit for any other such element.
272
 
273
+ [*Note 3*: A character not in the basic literal character set can be
274
  encoded with more than one code unit; the value of such a code unit can
275
  be the same as that of a code unit for an element of the basic literal
276
  character set. — *end note*]
277
 
278
  The U+0000 (null) character is encoded as the value `0`. No other
279
  element of the translation character set is encoded with a code unit of
280
  value `0`. The code unit value of each decimal digit character after the
281
  digit `0` (`U+0030`) shall be one greater than the value of the
282
  previous. The ordinary and wide literal encodings are otherwise
283
  *implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
284
+ implementation shall encode the Unicode scalar value corresponding to
285
+ each character of the translation character set as specified in the
286
+ Unicode Standard for the respective Unicode encoding form.
287
+
288
+ ### Universal character names <a id="lex.universal.char">[[lex.universal.char]]</a>
289
+
290
+ ``` bnf
291
+ n-char:
292
+ any member of the translation character set except the U+007d (right curly bracket) or new-line character
293
+ ```
294
+
295
+ ``` bnf
296
+ n-char-sequence:
297
+ n-char n-char-sequenceₒₚₜ
298
+ ```
299
+
300
+ ``` bnf
301
+ named-universal-character:
302
+ '\N{' n-char-sequence '}'
303
+ ```
304
+
305
+ ``` bnf
306
+ hex-quad:
307
+ hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
308
+ ```
309
+
310
+ ``` bnf
311
+ simple-hexadecimal-digit-sequence:
312
+ hexadecimal-digit simple-hexadecimal-digit-sequenceₒₚₜ
313
+ ```
314
+
315
+ ``` bnf
316
+ universal-character-name:
317
+ '\u' hex-quad
318
+ '\U' hex-quad hex-quad
319
+ '\u{' simple-hexadecimal-digit-sequence '}'
320
+ named-universal-character
321
+ ```
322
+
323
+ The *universal-character-name* construct provides a way to name any
324
+ element in the translation character set using just the basic character
325
+ set. If a *universal-character-name* outside the *c-char-sequence*,
326
+ *s-char-sequence*, or *r-char-sequence* of a *character-literal* or
327
+ *string-literal* (in either case, including within a
328
+ *user-defined-literal*) corresponds to a control character or to a
329
+ character in the basic character set, the program is ill-formed.
330
+
331
+ [*Note 1*: A sequence of characters resembling a
332
+ *universal-character-name* in an *r-char-sequence* [[lex.string]] does
333
+ not form a *universal-character-name*. — *end note*]
334
+
335
+ A *universal-character-name* of the form `\u` *hex-quad*, `\U`
336
+ *hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
337
+ designates the character in the translation character set whose Unicode
338
+ scalar value is the hexadecimal number represented by the sequence of
339
+ *hexadecimal-digit*s in the *universal-character-name*. The program is
340
+ ill-formed if that number is not a Unicode scalar value.
341
+
342
+ A *universal-character-name* that is a *named-universal-character*
343
+ designates the corresponding character in the Unicode Standard (chapter
344
+ 4.8 Name) if the *n-char-sequence* is equal to its character name or to
345
+ one of its character name aliases of type “control”, “correction”, or
346
+ “alternate”; otherwise, the program is ill-formed.
347
+
348
+ [*Note 2*: These aliases are listed in the Unicode Character Database’s
349
+ `NameAliases.txt`. None of these names or aliases have leading or
350
+ trailing spaces. — *end note*]
351
+
352
+ ## Comments <a id="lex.comment">[[lex.comment]]</a>
353
+
354
+ The characters `/*` start a comment, which terminates with the
355
+ characters `*/`. These comments do not nest. The characters `//` start a
356
+ comment, which terminates immediately before the next new-line
357
+ character.
358
+
359
+ [*Note 1*: The comment characters `//`, `/*`, and `*/` have no special
360
+ meaning within a `//` comment and are treated just like other
361
+ characters. Similarly, the comment characters `//` and `/*` have no
362
+ special meaning within a `/*` comment. — *end note*]
363
 
364
  ## Preprocessing tokens <a id="lex.pptoken">[[lex.pptoken]]</a>
365
 
366
  ``` bnf
367
  preprocessing-token:
 
377
  user-defined-string-literal
378
  preprocessing-op-or-punc
379
  each non-whitespace character that cannot be one of the above
380
  ```
381
 
 
 
 
 
382
  A preprocessing token is the minimal lexical element of the language in
383
  translation phases 3 through 6. In this document, glyphs are used to
384
  identify elements of the basic character set [[lex.charset]]. The
385
  categories of preprocessing token are: header names, placeholder tokens
386
  produced by preprocessing `import` and `module` directives
387
  (*import-keyword*, *module-keyword*, and *export-keyword*), identifiers,
388
  preprocessing numbers, character literals (including user-defined
389
  character literals), string literals (including user-defined string
390
  literals), preprocessing operators and punctuators, and single
391
  non-whitespace characters that do not lexically match the other
392
+ preprocessing token categories. If a U+0027 (apostrophe), a
393
+ U+0022 (quotation mark), or any character not in the basic character set
 
394
  matches the last category, the program is ill-formed. Preprocessing
395
  tokens can be separated by whitespace; this consists of comments
396
  [[lex.comment]], or whitespace characters (U+0020 (space),
397
  U+0009 (character tabulation), new-line, U+000b (line tabulation), and
398
  U+000c (form feed)), or both. As described in [[cpp]], in certain
 
400
  thereof) serves as more than preprocessing token separation. Whitespace
401
  can appear within a preprocessing token only as part of a header name or
402
  between the quotation characters in a character literal or string
403
  literal.
404
 
405
+ Each preprocessing token that is converted to a token [[lex.token]]
406
+ shall have the lexical form of a keyword, an identifier, a literal, or
407
+ an operator or punctuator.
408
+
409
+ The *import-keyword* is produced by processing an `import` directive
410
+ [[cpp.import]], the *module-keyword* is produced by preprocessing a
411
+ `module` directive [[cpp.module]], and the *export-keyword* is produced
412
+ by preprocessing either of the previous two directives.
413
+
414
+ [*Note 1*: None has any observable spelling. — *end note*]
415
+
416
  If the input stream has been parsed into preprocessing tokens up to a
417
  given character:
418
 
419
  - If the next character begins a sequence of characters that could be
420
  the prefix and initial double quote of a raw string literal, such as
 
430
  ```
431
  - Otherwise, if the next three characters are `<::` and the subsequent
432
  character is neither `:` nor `>`, the `<` is treated as a
433
  preprocessing token by itself and not as the first character of the
434
  alternative token `<:`.
435
+ - Otherwise, if the next three characters are `[::` and the subsequent
436
+ character is not `:`, or if the next three characters are `[:>`, the
437
+ `[` is treated as a preprocessing token by itself and not as the first
438
+ character of the preprocessing token `[:`. \[*Note 2*: The tokens `[:`
439
+ and `:]` cannot be composed from digraphs. — *end note*]
440
  - Otherwise, the next preprocessing token is the longest sequence of
441
  characters that could constitute a preprocessing token, even if that
442
+ would cause further lexical analysis to fail, except that
443
+ - a *string-literal* token is never formed when a *header-name* token
444
+ can be formed, and
445
+ - a *header-name* [[lex.header]] is only formed
446
+ - immediately after the `include`, `embed`, or `import`
447
+ preprocessing token in a `#include` [[cpp.include]], `#embed`
448
+ [[cpp.embed]], or `import` [[cpp.import]] directive, respectively,
449
+ or
450
+ - immediately after a preprocessing token sequence of
451
+ `__has_include` or `__has_embed` immediately followed by `(` in a
452
+ `#if`, `#elif`, or `#embed` directive [[cpp.cond]], [[cpp.embed]].
453
 
454
  [*Example 1*:
455
 
456
  ``` cpp
457
  #define R "x"
458
  const char* s = R"y"; // ill-formed raw string, not "x" "y"
459
  ```
460
 
461
  — *end example*]
462
 
 
 
 
 
 
 
 
463
  [*Example 2*: The program fragment `0xe+foo` is parsed as a
464
  preprocessing number token (one that is not a valid *integer-literal* or
465
  *floating-point-literal* token), even though a parse as three
466
  preprocessing tokens `0xe`, `+`, and `foo` can produce a valid
467
  expression (for example, if `foo` is a macro defined as `1`). Similarly,
 
472
  [*Example 3*: The program fragment `x+++++y` is parsed as `x
473
  ++ ++ + y`, which, if `x` and `y` have integral types, violates a
474
  constraint on increment operators, even though the parse `x ++ + ++ y`
475
  can yield a correct expression. — *end example*]
476
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
477
  ## Header names <a id="lex.header">[[lex.header]]</a>
478
 
479
  ``` bnf
480
  header-name:
481
  '<' h-char-sequence '>'
482
  '"' q-char-sequence '"'
483
  ```
484
 
485
  ``` bnf
486
  h-char-sequence:
487
+ h-char h-char-sequenceₒₚₜ
 
488
  ```
489
 
490
  ``` bnf
491
  h-char:
492
  any member of the translation character set except new-line and U+003e (greater-than sign)
493
  ```
494
 
495
  ``` bnf
496
  q-char-sequence:
497
+ q-char q-char-sequenceₒₚₜ
 
498
  ```
499
 
500
  ``` bnf
501
  q-char:
502
  any member of the translation character set except new-line and U+0022 (quotation mark)
503
  ```
504
 
 
 
 
 
 
505
  The sequences in both forms of *header-name*s are mapped in an
506
  *implementation-defined* manner to headers or to external source file
507
  names as specified in  [[cpp.include]].
508
 
509
+ [*Note 1*: Header name preprocessing tokens appear only within a
510
+ `#include` preprocessing directive, a `__has_include` preprocessing
511
+ expression, or after certain occurrences of an `import` token (see 
512
+ [[lex.pptoken]]). — *end note*]
513
+
514
  The appearance of either of the characters `'` or `\` or of either of
515
  the character sequences `/*` or `//` in a *q-char-sequence* or an
516
  *h-char-sequence* is conditionally-supported with
517
  *implementation-defined* semantics, as is the appearance of the
518
+ character `"` in an *h-char-sequence*.
519
+
520
+ [*Note 2*: Thus, a sequence of characters that resembles an escape
521
+ sequence can result in an error, be interpreted as the character
522
+ corresponding to the escape sequence, or have a completely different
523
+ meaning, depending on the implementation. — *end note*]
524
 
525
  ## Preprocessing numbers <a id="lex.ppnumber">[[lex.ppnumber]]</a>
526
 
527
  ``` bnf
528
  pp-number:
 
544
 
545
  A preprocessing number does not have a type or a value; it acquires both
546
  after a successful conversion to an *integer-literal* token or a
547
  *floating-point-literal* token.
548
 
549
+ ## Operators and punctuators <a id="lex.operators">[[lex.operators]]</a>
550
+
551
+ The lexical representation of C++ programs includes a number of
552
+ preprocessing tokens that are used in the syntax of the preprocessor or
553
+ are converted into tokens for operators and punctuators:
554
+
555
+ ``` bnf
556
+ preprocessing-op-or-punc:
557
+ preprocessing-operator
558
+ operator-or-punctuator
559
+ ```
560
+
561
+ ``` bnf
562
+ %% Ed. note: character protrusion would misalign various operators.
563
+
564
+ preprocessing-operator: one of
565
+ '# ## %: %:%:'
566
+ ```
567
+
568
+ ``` bnf
569
+ operator-or-punctuator: one of
570
+ '{ } [ ] ( ) [: :]'
571
+ '<% %> <: :> ; : ...'
572
+ '? :: . .* -> ->* ^^ ~'
573
+ '! + - * / % ^ & |'
574
+ '= += -= *= /= %= ^= &= |='
575
+ '== != < > <= >= <=> && ||'
576
+ '<< >> <<= >>= ++ -- ,'
577
+ 'and or xor not bitand bitor compl'
578
+ 'and_eq or_eq xor_eq not_eq'
579
+ ```
580
+
581
+ Each *operator-or-punctuator* is converted to a single token in
582
+ translation phase 7 [[lex.phases]].
583
+
584
+ ## Alternative tokens <a id="lex.digraph">[[lex.digraph]]</a>
585
+
586
+ Alternative token representations are provided for some operators and
587
+ punctuators.[^4]
588
+
589
+ In all respects of the language, each alternative token behaves the
590
+ same, respectively, as its primary token, except for its spelling.[^5]
591
+
592
+ The set of alternative tokens is defined in [[lex.digraph]].
593
+
594
+ ## Tokens <a id="lex.token">[[lex.token]]</a>
595
+
596
+ ``` bnf
597
+ token:
598
+ identifier
599
+ keyword
600
+ literal
601
+ operator-or-punctuator
602
+ ```
603
+
604
+ There are five kinds of tokens: identifiers, keywords, literals,[^6]
605
+
606
+ operators, and other separators. Comments and the characters
607
+ U+0020 (space), U+0009 (character tabulation), U+000b (line tabulation),
608
+ U+000c (form feed), and new-line (collectively, “whitespace”), as
609
+ described below, are ignored except as they serve to separate tokens.
610
+
611
+ [*Note 1*: Whitespace can separate otherwise adjacent identifiers,
612
+ keywords, numeric literals, and alternative tokens containing alphabetic
613
+ characters. — *end note*]
614
+
615
  ## Identifiers <a id="lex.name">[[lex.name]]</a>
616
 
617
  ``` bnf
618
  identifier:
619
  identifier-start
 
646
  '0 1 2 3 4 5 6 7 8 9'
647
  ```
648
 
649
  [*Note 1*:
650
 
651
+ The character properties XID_Start and XID_Continue are described by UAX
652
+ \#44 of the Unicode Standard.[^7]
653
 
654
  — *end note*]
655
 
656
  The program is ill-formed if an *identifier* does not conform to
657
  Normalization Form C as specified in the Unicode Standard.
658
 
659
  [*Note 2*: Identifiers are case-sensitive. — *end note*]
660
 
661
+ [*Note 3*: [[uaxid]] compares the requirements of UAX \#31 of the
662
+ Unicode Standard with the C++ rules for identifiers. — *end note*]
663
+
664
+ [*Note 4*: In translation phase 4, *identifier* also includes those
665
  *preprocessing-token*s [[lex.pptoken]] differentiated as keywords
666
  [[lex.key]] in the later translation phase 7
667
  [[lex.token]]. — *end note*]
668
 
669
  The identifiers in [[lex.name.special]] have a special meaning when
 
676
  In addition, some identifiers appearing as a *token* or
677
  *preprocessing-token* are reserved for use by C++ implementations and
678
  shall not be used otherwise; no diagnostic is required.
679
 
680
  - Each identifier that contains a double underscore `__` or begins with
681
+ an underscore followed by an uppercase letter, other than those
682
+ specified in this document (for example, `__cplusplus`
683
+ [[cpp.predefined]]), is reserved to the implementation for any use.
684
  - Each identifier that begins with an underscore is reserved to the
685
  implementation for use as a name in the global namespace.
686
 
687
  ## Keywords <a id="lex.key">[[lex.key]]</a>
688
 
 
710
  | | | | | | |
711
  | -------- | -------- | -------- | ------- | -------- | ----- |
712
  | `and` | `and_eq` | `bitand` | `bitor` | `compl` | `not` |
713
  | `not_eq` | `or` | `or_eq` | `xor` | `xor_eq` | |
714
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
715
  ## Literals <a id="lex.literal">[[lex.literal]]</a>
716
 
717
  ### Kinds of literals <a id="lex.literal.kinds">[[lex.literal.kinds]]</a>
718
 
719
  There are several kinds of literals.[^8]
 
829
  'z Z'
830
  ```
831
 
832
  In an *integer-literal*, the sequence of *binary-digit*s,
833
  *octal-digit*s, *digit*s, or *hexadecimal-digit*s is interpreted as a
834
+ base N integer as shown in [[lex.icon.base]]; the lexically first digit
835
+ of the sequence of digits is the most significant.
836
 
837
  [*Note 1*: The prefix and any optional separating single quotes are
838
  ignored when determining the value. — *end note*]
839
 
840
  **Table: Base of *integer-literal*{s}** <a id="lex.icon.base">[lex.icon.base]</a>
 
887
  | | | `std::size_t` |
888
  | Both `u` or `U` | `std::size_t` | `std::size_t` |
889
  | and `z` or `Z` | | |
890
 
891
 
892
+ Except for *integer-literal*s containing a *size-suffix*, if the value
893
+ of an *integer-literal* cannot be represented by any type in its list
894
  and an extended integer type [[basic.fundamental]] can represent its
895
  value, it may have that extended integer type. If all of the types in
896
  the list for the *integer-literal* are signed, the extended integer type
897
+ is signed. If all of the types in the list for the *integer-literal* are
898
+ unsigned, the extended integer type is unsigned. If the list contains
899
+ both signed and unsigned types, the extended integer type may be signed
900
+ or unsigned. If an *integer-literal* cannot be represented by any of the
901
+ allowed types, the program is ill-formed.
902
+
903
+ [*Note 2*: An *integer-literal* with a `z` or `Z` suffix is ill-formed
904
+ if it cannot be represented by `std::size_t`. — *end note*]
905
 
906
  ### Character literals <a id="lex.ccon">[[lex.ccon]]</a>
907
 
908
  ``` bnf
909
  character-literal:
 
915
  'u8' 'u' 'U' 'L'
916
  ```
917
 
918
  ``` bnf
919
  c-char-sequence:
920
+ c-char c-char-sequenceₒₚₜ
 
921
  ```
922
 
923
  ``` bnf
924
  c-char:
925
  basic-c-char
 
956
  hexadecimal-escape-sequence
957
  ```
958
 
959
  ``` bnf
960
  simple-octal-digit-sequence:
961
+ octal-digit simple-octal-digit-sequenceₒₚₜ
 
962
  ```
963
 
964
  ``` bnf
965
  octal-escape-sequence:
966
  '\' octal-digit
 
983
  ``` bnf
984
  conditional-escape-sequence-char:
985
  any member of the basic character set that is not an octal-digit, a simple-escape-sequence-char, or the characters 'N', 'o', 'u', 'U', or 'x'
986
  ```
987
 
988
+ A *multicharacter literal* is a *character-literal* whose
989
+ *c-char-sequence* consists of more than one *c-char*. A multicharacter
990
+ literal shall not have an *encoding-prefix*. If a multicharacter literal
991
+ contains a *c-char* that is not encodable as a single code unit in the
992
+ ordinary literal encoding, the program is ill-formed. Multicharacter
993
+ literals are conditionally-supported.
 
 
 
994
 
995
  The kind of a *character-literal*, its type, and its associated
996
  character encoding [[lex.charset]] are determined by its
997
  *encoding-prefix* and its *c-char-sequence* as defined by
998
+ [[lex.ccon.literal]].
 
 
 
 
 
 
 
 
 
999
 
1000
  **Table: Character literals** <a id="lex.ccon.literal">[lex.ccon.literal]</a>
1001
 
1002
+ | Encoding prefix | Kind \chdr | Type \chdr | Associated char- acter encoding | Example |
1003
+ | --------------- | -------------------------- | ---------- | ------------------------------- | ------- |
1004
+ | none | ordinary character literal | `char` | ordinary literal | `'v'` |
1005
  | `L` | wide character literal | `wchar_t` | wide literal | `L'w'` |
1006
  | | | | encoding | |
1007
  | `u8` | UTF-8 character literal | `char8_t` | UTF-8 | `u8'x'` |
1008
  | `u` | UTF-16 character literal | `char16_t` | UTF-16 | `u'y'` |
1009
  | `U` | UTF-32 character literal | `char32_t` | UTF-32 | `U'z'` |
1010
 
1011
 
1012
  In translation phase 4, the value of a *character-literal* is determined
1013
  using the range of representable values of the *character-literal*’s
1014
+ type in translation phase 7. A multicharacter literal has an
1015
+ *implementation-defined* value. The value of any other kind of
1016
+ *character-literal* is determined as follows:
1017
 
1018
  - A *character-literal* with a *c-char-sequence* consisting of a single
1019
  *basic-c-char*, *simple-escape-sequence*, or
1020
  *universal-character-name* is the code unit value of the specified
1021
  character as encoded in the literal’s associated character encoding.
1022
+ If the specified character lacks representation in the literal’s
1023
+ associated character encoding or if it cannot be encoded as a single
1024
+ code unit, then the program is ill-formed.
 
1025
  - A *character-literal* with a *c-char-sequence* consisting of a single
1026
  *numeric-escape-sequence* has a value as follows:
1027
  - Let v be the integer value represented by the octal number
1028
  comprising the sequence of *octal-digit*s in an
1029
  *octal-escape-sequence* or by the hexadecimal number comprising the
 
1034
  or `L`, and v does not exceed the range of representable values of
1035
  the corresponding unsigned type for the underlying type of the
1036
  *character-literal*’s type, then the value is the unique value of
1037
  the *character-literal*’s type `T` that is congruent to v modulo 2ᴺ,
1038
  where N is the width of `T`.
1039
+ - Otherwise, the program is ill-formed.
1040
  - A *character-literal* with a *c-char-sequence* consisting of a single
1041
  *conditional-escape-sequence* is conditionally-supported and has an
1042
  *implementation-defined* value.
1043
 
1044
  The character specified by a *simple-escape-sequence* is specified in
1045
  [[lex.ccon.esc]].
1046
 
1047
+ [*Note 1*: Using an escape sequence for a question mark is supported
1048
+ for compatibility with C++14 and C. — *end note*]
1049
 
1050
  **Table: Simple escape sequences** <a id="lex.ccon.esc">[lex.ccon.esc]</a>
1051
 
1052
  | character | | *simple-escape-sequence* |
1053
  | --------- | -------------------- | ------------------------ |
 
1184
  encoding-prefixₒₚₜ 'R' raw-string
1185
  ```
1186
 
1187
  ``` bnf
1188
  s-char-sequence:
1189
+ s-char s-char-sequenceₒₚₜ
 
1190
  ```
1191
 
1192
  ``` bnf
1193
  s-char:
1194
  basic-s-char
 
1207
  '"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
1208
  ```
1209
 
1210
  ``` bnf
1211
  r-char-sequence:
1212
+ r-char r-char-sequenceₒₚₜ
 
1213
  ```
1214
 
1215
  ``` bnf
1216
  r-char:
1217
  any member of the translation character set, except a U+0029 (right parenthesis) followed by
1218
  the initial *d-char-sequence* (which may be empty) followed by a U+0022 (quotation mark)
1219
  ```
1220
 
1221
  ``` bnf
1222
  d-char-sequence:
1223
+ d-char d-char-sequenceₒₚₜ
 
1224
  ```
1225
 
1226
  ``` bnf
1227
  d-char:
1228
  any member of the basic character set except:
 
1231
  ```
1232
 
1233
  The kind of a *string-literal*, its type, and its associated character
1234
  encoding [[lex.charset]] are determined by its encoding prefix and
1235
  sequence of *s-char*s or *r-char*s as defined by [[lex.string.literal]]
1236
+ where n is the number of encoded code units that would result from an
1237
+ evaluation of the *string-literal* (see below).
1238
 
1239
  **Table: String literals** <a id="lex.string.literal">[lex.string.literal]</a>
1240
 
1241
+ | Enco- ding prefix | Kind \chdr \chdr | Type \chdr \chdr | Associated character encoding | Examples \rhdr \rhdr |
1242
+ | ----------------- | ----------------------- | ----------------------------- | ----------------------------- | ---------------------------------------------- |
1243
  | none | ordinary string literal | array of $n$ `const char` | ordinary literal encoding | `"ordinary string"` `R"(ordinary raw string)"` |
1244
  | `L` | wide string literal | array of $n$ `const wchar_t` | wide literal encoding | `L"wide string"` `LR"w(wide raw string)w"` |
1245
  | `u8` | UTF-8 string literal | array of $n$ `const char8_t` | UTF-8 | `u8"UTF-8 string"` `u8R"x(UTF-8 raw string)x"` |
1246
  | `u` | UTF-16 string literal | array of $n$ `const char16_t` | UTF-16 | `u"UTF-16 string"` `uR"y(UTF-16 raw string)y"` |
1247
  | `U` | UTF-32 string literal | array of $n$ `const char32_t` | UTF-32 | `U"UTF-32 string"` `UR"z(UTF-32 raw string)z"` |
 
1251
  literal*. The *d-char-sequence* serves as a delimiter. The terminating
1252
  *d-char-sequence* of a *raw-string* is the same sequence of characters
1253
  as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
1254
  at most 16 characters.
1255
 
1256
+ [*Note 1*: The characters `'('` and `')'` can appear in a *raw-string*.
1257
+ Thus, `R"delimiter((a|b))delimiter"` is equivalent to
1258
  `"(a|b)"`. — *end note*]
1259
 
1260
  [*Note 2*:
1261
 
1262
  A source-file new-line in a raw string literal results in a new-line in
 
1292
  is equivalent to `"x = \"\\\"y\\\"\""`.
1293
 
1294
  — *end example*]
1295
 
1296
  Ordinary string literals and UTF-8 string literals are also referred to
1297
+ as *narrow string literals*.
1298
 
1299
+ The *string-literal*s in any sequence of adjacent *string-literal*s
1300
+ shall have at most one unique *encoding-prefix* among them. The common
1301
+ *encoding-prefix* of the sequence is that *encoding-prefix*, if any.
 
 
 
1302
 
1303
  [*Note 3*: A *string-literal*’s rawness has no effect on the
1304
  determination of the common *encoding-prefix*. — *end note*]
1305
 
1306
  In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
 
1337
  | `u"a"` | `"b"` | `u"ab"` | `U"a"` | `"b"` | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
1338
  | `"a"` | `u"b"` | `u"ab"` | `"a"` | `U"b"` | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
1339
 
1340
 
1341
  Evaluating a *string-literal* results in a string literal object with
1342
+ static storage duration [[basic.stc]].
 
 
 
1343
 
1344
+ [*Note 4*: String literal objects are potentially non-unique
1345
+ [[intro.object]]. Whether successive evaluations of a *string-literal*
1346
+ yield the same or a different object is unspecified. — *end note*]
1347
+
1348
+ [*Note 5*: The effect of attempting to modify a string literal object
1349
  is undefined. — *end note*]
1350
 
1351
  String literal objects are initialized with the sequence of code unit
1352
  values corresponding to the *string-literal*’s sequence of *s-char*s
1353
  (originally from non-raw string literals) and *r-char*s (originally from
 
1357
  - The sequence of characters denoted by each contiguous sequence of
1358
  *basic-s-char*s, *r-char*s, *simple-escape-sequence*s [[lex.ccon]],
1359
  and *universal-character-name*s [[lex.charset]] is encoded to a code
1360
  unit sequence using the *string-literal*’s associated character
1361
  encoding. If a character lacks representation in the associated
1362
+ character encoding, then the program is ill-formed. \[*Note 6*: No
1363
+ character lacks representation in any Unicode encoding
1364
+ form. *end note*] When encoding a stateful character encoding,
1365
+ implementations should encode the first such sequence beginning with
1366
+ the initial encoding state and encode subsequent sequences beginning
1367
+ with the final encoding state of the prior sequence. \[*Note 7*: The
1368
+ encoded code unit sequence can differ from the sequence of code units
1369
+ that would be obtained by encoding each character
1370
+ independently. *end note*]
 
1371
  - Each *numeric-escape-sequence* [[lex.ccon]] contributes a single code
1372
  unit with a value as follows:
1373
  - Let v be the integer value represented by the octal number
1374
  comprising the sequence of *octal-digit*s in an
1375
  *octal-escape-sequence* or by the hexadecimal number comprising the
 
1380
  `L`, and v does not exceed the range of representable values of the
1381
  corresponding unsigned type for the underlying type of the
1382
  *string-literal*’s array element type, then the value is the unique
1383
  value of the *string-literal*’s array element type `T` that is
1384
  congruent to v modulo 2ᴺ, where N is the width of `T`.
1385
+ - Otherwise, the program is ill-formed.
1386
 
1387
  When encoding a stateful character encoding, these sequences should
1388
  have no effect on encoding state.
1389
  - Each *conditional-escape-sequence* [[lex.ccon]] contributes an
1390
  *implementation-defined* code unit sequence. When encoding a stateful
1391
  character encoding, it is *implementation-defined* what effect these
1392
  sequences have on encoding state.
1393
 
1394
+ ### Unevaluated strings <a id="lex.string.uneval">[[lex.string.uneval]]</a>
1395
+
1396
+ ``` bnf
1397
+ unevaluated-string:
1398
+ string-literal
1399
+ ```
1400
+
1401
+ An *unevaluated-string* shall have no *encoding-prefix*.
1402
+
1403
+ Each *universal-character-name* and each *simple-escape-sequence* in an
1404
+ *unevaluated-string* is replaced by the member of the translation
1405
+ character set it denotes. An *unevaluated-string* that contains a
1406
+ *numeric-escape-sequence* or a *conditional-escape-sequence* is
1407
+ ill-formed.
1408
+
1409
+ An *unevaluated-string* is never evaluated and its interpretation
1410
+ depends on the context in which it appears.
1411
+
1412
  ### Boolean literals <a id="lex.bool">[[lex.bool]]</a>
1413
 
1414
  ``` bnf
1415
  boolean-literal:
1416
+ false
1417
+ true
1418
  ```
1419
 
1420
  The Boolean literals are the keywords `false` and `true`. Such literals
1421
  have type `bool`.
1422
 
1423
  ### Pointer literals <a id="lex.nullptr">[[lex.nullptr]]</a>
1424
 
1425
  ``` bnf
1426
  pointer-literal:
1427
+ nullptr
1428
  ```
1429
 
1430
  The pointer literal is the keyword `nullptr`. It has type
1431
  `std::nullptr_t`.
1432
 
 
1558
  basic character set. — *end note*]
1559
 
1560
  If *L* is a *user-defined-string-literal*, let *str* be the literal
1561
  without its *ud-suffix* and let *len* be the number of code units in
1562
  *str* (i.e., its length excluding the terminating null character). If
1563
+ *S* contains a literal operator template with a constant template
1564
  parameter for which *str* is a well-formed *template-argument*, the
1565
  literal *L* is treated as a call of the form
1566
 
1567
  ``` cpp
1568
  operator ""X<str>()
 
1625
  [basic.fundamental]: basic.md#basic.fundamental
1626
  [basic.link]: basic.md#basic.link
1627
  [basic.lookup.unqual]: basic.md#basic.lookup.unqual
1628
  [basic.stc]: basic.md#basic.stc
1629
  [character.seq]: library.md#character.seq
1630
+ [class.mem.general]: class.md#class.mem.general
1631
  [conv.mem]: expr.md#conv.mem
1632
  [conv.ptr]: expr.md#conv.ptr
1633
  [cpp]: cpp.md#cpp
1634
  [cpp.cond]: cpp.md#cpp.cond
1635
+ [cpp.embed]: cpp.md#cpp.embed
1636
  [cpp.import]: cpp.md#cpp.import
1637
  [cpp.include]: cpp.md#cpp.include
1638
  [cpp.module]: cpp.md#cpp.module
1639
+ [cpp.pragma]: cpp.md#cpp.pragma
1640
+ [cpp.pragma.op]: cpp.md#cpp.pragma.op
1641
+ [cpp.pre]: cpp.md#cpp.pre
1642
+ [cpp.predefined]: cpp.md#cpp.predefined
1643
+ [cpp.replace]: cpp.md#cpp.replace
1644
  [cpp.stringize]: cpp.md#cpp.stringize
1645
  [dcl.attr.grammar]: dcl.md#dcl.attr.grammar
1646
+ [dcl.pre]: dcl.md#dcl.pre
1647
+ [expr.const]: expr.md#expr.const
1648
  [expr.prim.literal]: expr.md#expr.prim.literal
1649
  [headers]: library.md#headers
1650
+ [intro.object]: basic.md#intro.object
1651
  [lex]: #lex
1652
  [lex.bool]: #lex.bool
1653
  [lex.ccon]: #lex.ccon
1654
  [lex.ccon.esc]: #lex.ccon.esc
1655
  [lex.ccon.literal]: #lex.ccon.literal
1656
+ [lex.char]: #lex.char
1657
  [lex.charset]: #lex.charset
1658
  [lex.charset.basic]: #lex.charset.basic
1659
  [lex.charset.literal]: #lex.charset.literal
1660
  [lex.comment]: #lex.comment
1661
  [lex.digraph]: #lex.digraph
 
1679
  [lex.pptoken]: #lex.pptoken
1680
  [lex.separate]: #lex.separate
1681
  [lex.string]: #lex.string
1682
  [lex.string.concat]: #lex.string.concat
1683
  [lex.string.literal]: #lex.string.literal
1684
+ [lex.string.uneval]: #lex.string.uneval
1685
  [lex.token]: #lex.token
1686
+ [lex.universal.char]: #lex.universal.char
1687
  [module.import]: module.md#module.import
1688
+ [module.reach]: module.md#module.reach
1689
  [module.unit]: module.md#module.unit
1690
  [over.literal]: over.md#over.literal
1691
  [support.types.layout]: support.md#support.types.layout
1692
  [temp.explicit]: temp.md#temp.explicit
1693
+ [temp.inst]: temp.md#temp.inst
1694
  [temp.names]: temp.md#temp.names
1695
+ [temp.point]: temp.md#temp.point
1696
+ [uaxid]: uax31.md#uaxid
1697
 
1698
  [^1]: Implementations behave as if these separate phases occur, although
1699
  in practice different phases can be folded together.
1700
 
1701
+ [^2]: Unicode® is a registered trademark of Unicode, Inc. This
1702
+ information is given for the convenience of users of this document
1703
+ and does not constitute an endorsement by ISO or IEC of this
1704
+ product.
1705
+
1706
+ [^3]: A partial preprocessing token would arise from a source file
1707
  ending in the first portion of a multi-character token that requires
1708
  a terminating sequence of characters, such as a *header-name* that
1709
  is missing the closing `"` or `>`. A partial comment would arise
1710
  from a source file ending with an unclosed `/*` comment.
1711
 
1712
+ [^4]: These include “digraphs” and additional reserved words. The term
1713
  “digraph” (token consisting of two characters) is not perfectly
1714
  descriptive, since one of the alternative *preprocessing-token*s is
1715
  `%:%:` and of course several primary tokens contain two characters.
1716
  Nonetheless, those alternative tokens that aren’t lexical keywords
1717
  are colloquially known as “digraphs”.
1718
 
1719
+ [^5]: Thus the “stringized” values [[cpp.stringize]] of `[` and `<:`
1720
  will be different, maintaining the source spelling, but the tokens
1721
  can otherwise be freely interchanged.
1722
 
1723
+ [^6]: Literals include strings and character and numeric literals.
 
 
 
 
 
1724
 
1725
  [^7]: On systems in which linkers cannot accept extended characters, an
1726
  encoding of the \*universal-character-name\* can be used in forming
1727
  valid external identifiers. For example, some otherwise unused
1728
  character or sequence of characters can be used to encode the `̆` in
1729
  a \*universal-character-name\*. Extended characters can produce a
1730
  long external identifier, but C++ does not place a translation limit
1731
  on significant characters for external identifiers.
1732
 
1733
  [^8]: The term “literal” generally designates, in this document, those
1734
+ tokens that are called “constants” in C.