From Jason Turner

[lex]

Diff to HTML by rtfpessoa

Files changed (1) hide show
  1. tmp/tmpjhdm7syv/{from.md → to.md} +449 -312
tmp/tmpjhdm7syv/{from.md → to.md} RENAMED
@@ -5,47 +5,49 @@
5
  The text of the program is kept in units called *source files* in this
6
  International Standard. A source file together with all the headers (
7
  [[headers]]) and source files included ([[cpp.include]]) via the
8
  preprocessing directive `#include`, less any source lines skipped by any
9
  of the conditional inclusion ([[cpp.cond]]) preprocessing directives,
10
- is called a *translation unit*. A C++program need not all be translated
11
- at the same time.
12
 
13
- Previously translated translation units and instantiation units can be
14
- preserved individually or in libraries. The separate translation units
15
- of a program communicate ([[basic.link]]) by (for example) calls to
16
- functions whose identifiers have external linkage, manipulation of
17
- objects whose identifiers have external linkage, or manipulation of data
18
- files. Translation units can be separately translated and then later
19
- linked to produce an executable program ([[basic.link]]).
 
 
 
 
20
 
21
  ## Phases of translation <a id="lex.phases">[[lex.phases]]</a>
22
 
23
  The precedence among the syntax rules of translation is specified by the
24
  following phases.[^1]
25
 
26
  1. Physical source file characters are mapped, in an
27
  *implementation-defined* manner, to the basic source character set
28
  (introducing new-line characters for end-of-line indicators) if
29
  necessary. The set of physical source file characters accepted is
30
- *implementation-defined*. Trigraph sequences ([[lex.trigraph]]) are
31
- replaced by corresponding single-character internal representations.
32
- Any source file character not in the basic source character set (
33
- [[lex.charset]]) is replaced by the universal-character-name that
34
- designates that character. (An implementation may use any internal
35
- encoding, so long as an actual extended character encountered in the
36
- source file, and the same extended character expressed in the source
37
- file as a universal-character-name (i.e., using the `\uXXXX`
38
- notation), are handled equivalently except where this replacement is
39
- reverted in a raw string literal.)
40
  2. Each instance of a backslash character (\\ immediately followed by a
41
  new-line character is deleted, splicing physical source lines to
42
  form logical source lines. Only the last backslash on any physical
43
  source line shall be eligible for being part of such a splice.
44
  Except for splices reverted in a raw string literal, if a splice
45
  results in a character sequence that matches the syntax of a
46
- universal-character-name, the behavior is undefined. A source file
47
  that is not empty and that does not end in a new-line character, or
48
  that ends in a new-line character immediately preceded by a
49
  backslash character before any such splicing takes place, shall be
50
  processed as if an additional new-line character were appended to
51
  the file.
@@ -55,53 +57,57 @@ following phases.[^1]
55
  token or in a partial comment.[^2] Each comment is replaced by one
56
  space character. New-line characters are retained. Whether each
57
  nonempty sequence of white-space characters other than new-line is
58
  retained or replaced by one space character is unspecified. The
59
  process of dividing a source file’s characters into preprocessing
60
- tokens is context-dependent. see the handling of `<` within a
61
- `#include` preprocessing directive.
62
  4. Preprocessing directives are executed, macro invocations are
63
  expanded, and `_Pragma` unary operator expressions are executed. If
64
  a character sequence that matches the syntax of a
65
- universal-character-name is produced by token concatenation (
66
  [[cpp.concat]]), the behavior is undefined. A `#include`
67
  preprocessing directive causes the named header or source file to be
68
  processed from phase 1 through phase 4, recursively. All
69
  preprocessing directives are then deleted.
70
  5. Each source character set member in a character literal or a string
71
  literal, as well as each escape sequence and
72
- universal-character-name in a character literal or a non-raw string
73
- literal, is converted to the corresponding member of the execution
74
- character set ([[lex.ccon]], [[lex.string]]); if there is no
75
- corresponding member, it is converted to an *implementation-defined*
76
- member other than the null (wide) character.[^3]
 
77
  6. Adjacent string literal tokens are concatenated.
78
  7. White-space characters separating tokens are no longer significant.
79
- Each preprocessing token is converted into a token. (
80
- [[lex.token]]). The resulting tokens are syntactically and
81
- semantically analyzed and translated as a translation unit. The
82
- process of analyzing and translating the tokens may occasionally
83
- result in one token being replaced by a sequence of other tokens (
84
- [[temp.names]]).Source files, translation units and translated
85
- translation units need not necessarily be stored as files, nor need
86
- there be any one-to-one correspondence between these entities and
87
- any external representation. The description is conceptual only, and
88
- does not specify any particular implementation.
 
89
  8. Translated translation units and instantiation units are combined as
90
- follows: Some or all of these may be supplied from a library. Each
91
- translated translation unit is examined to produce a list of
92
- required instantiations. This may include instantiations which have
93
- been explicitly requested ([[temp.explicit]]). The definitions of
94
- the required templates are located. It is *implementation-defined*
95
- whether the source of the translation units containing these
96
- definitions is required to be available. An implementation could
97
- encode sufficient information into the translated translation unit
98
- so as to ensure the source is not required here. All the required
99
- instantiations are performed to produce *instantiation units*. These
100
- are similar to translated translation units, but contain no
101
- references to uninstantiated templates and no template definitions.
102
- The program is ill-formed if any instantiation fails.
 
 
103
  9. All external entity references are resolved. Library components are
104
  linked to satisfy external references to entities not defined in the
105
  current translation. All such translator output is collected into a
106
  program image which contains information needed for execution in its
107
  execution environment.
@@ -113,15 +119,12 @@ character, the control characters representing horizontal tab, vertical
113
  tab, form feed, and new-line, plus the following 91 graphical
114
  characters:[^4]
115
 
116
  ``` cpp
117
  a b c d e f g h i j k l m n o p q r s t u v w x y z
118
-
119
  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
120
-
121
  0 1 2 3 4 5 6 7 8 9
122
-
123
  _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \" '
124
  ```
125
 
126
  The *universal-character-name* construct provides a way to name other
127
  characters.
@@ -135,62 +138,41 @@ hex-quad:
135
  universal-character-name:
136
  '\u' hex-quad
137
  '\U' hex-quad hex-quad
138
  ```
139
 
140
- The character designated by the universal-character-name `\UNNNNNNNN` is
141
- that character whose character short name in ISO/IEC 10646 is
142
- `NNNNNNNN`; the character designated by the universal-character-name
143
  `\uNNNN` is that character whose character short name in ISO/IEC 10646
144
- is `0000NNNN`. If the hexadecimal value for a universal-character-name
145
  corresponds to a surrogate code point (in the range 0xD800–0xDFFF,
146
  inclusive), the program is ill-formed. Additionally, if the hexadecimal
147
- value for a universal-character-name outside the *c-char-sequence*,
148
  *s-char-sequence*, or *r-char-sequence* of a character or string literal
149
  corresponds to a control character (in either of the ranges 0x00–0x1F or
150
  0x7F–0x9F, both inclusive) or to a character in the basic source
151
  character set, the program is ill-formed.[^5]
152
 
153
  The *basic execution character set* and the *basic execution
154
  wide-character set* shall each contain all the members of the basic
155
  source character set, plus control characters representing alert,
156
  backspace, and carriage return, plus a *null character* (respectively,
157
- *null wide character*), whose representation has all zero bits. For each
158
- basic execution character set, the values of the members shall be
159
- non-negative and distinct from one another. In both the source and
160
- execution basic character sets, the value of each character after `0` in
161
- the above list of decimal digits shall be one greater than the value of
162
- the previous. The *execution character set* and the *execution
163
- wide-character set* are implementation-defined supersets of the basic
164
- execution character set and the basic execution wide-character set,
165
- respectively. The values of the members of the execution character sets
166
- and the sets of additional members are locale-specific.
167
-
168
- ## Trigraph sequences <a id="lex.trigraph">[[lex.trigraph]]</a>
169
-
170
- Before any other processing takes place, each occurrence of one of the
171
- following sequences of three characters (“*trigraph sequences*”) is
172
- replaced by the single character indicated in Table 
173
- [[tab:trigraph.sequences]].
174
-
175
- ``` cpp
176
- ??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
177
- ```
178
-
179
- becomes
180
-
181
- ``` cpp
182
- #define arraycheck(a,b) a[b] || b[a]
183
- ```
184
-
185
- No other trigraph sequence exists. Each `?` that does not begin one of
186
- the trigraphs listed above is not changed.
187
 
188
  ## Preprocessing tokens <a id="lex.pptoken">[[lex.pptoken]]</a>
189
 
190
  ``` bnf
191
- %
192
  preprocessing-token:
193
  header-name
194
  identifier
195
  pp-number
196
  character-literal
@@ -227,43 +209,49 @@ given character:
227
 
228
  - If the next character begins a sequence of characters that could be
229
  the prefix and initial double quote of a raw string literal, such as
230
  `R"`, the next preprocessing token shall be a raw string literal.
231
  Between the initial and final double quote characters of the raw
232
- string, any transformations performed in phases 1 and 2 (trigraphs,
233
- universal-character-names, and line splicing) are reverted; this
234
  reversion shall apply before any *d-char*, *r-char*, or delimiting
235
  parenthesis is identified. The raw string literal is defined as the
236
  shortest sequence of characters that matches the raw-string pattern
237
  ``` bnf
238
  encoding-prefixₒₚₜ 'R' raw-string
239
  ```
240
  - Otherwise, if the next three characters are `<::` and the subsequent
241
- character is neither `:` nor `>`, the `<` is treated as a preprocessor
242
- token by itself and not as the first character of the alternative
243
- token `<:`.
244
  - Otherwise, the next preprocessing token is the longest sequence of
245
  characters that could constitute a preprocessing token, even if that
246
- would cause further lexical analysis to fail.
 
 
 
 
247
 
248
  ``` cpp
249
  #define R "x"
250
  const char* s = R"y"; // ill-formed raw string, not "x" "y"
251
  ```
252
 
253
- The program fragment `1Ex` is parsed as a preprocessing number token
254
- (one that is not a valid floating or integer literal token), even though
255
- a parse as the pair of preprocessing tokens `1` and `Ex` might produce a
256
- valid expression (for example, if `Ex` were a macro defined as `+1`).
257
- Similarly, the program fragment `1E1` is parsed as a preprocessing
258
- number (one that is a valid floating literal token), whether or not `E`
259
- is a macro name.
260
 
261
- The program fragment `x+++++y` is parsed as `x
 
 
 
 
 
 
 
 
262
  ++ ++ + y`, which, if `x` and `y` have integral types, violates a
263
  constraint on increment operators, even though the parse `x ++ + ++ y`
264
- might yield a correct expression.
265
 
266
  ## Alternative tokens <a id="lex.digraph">[[lex.digraph]]</a>
267
 
268
  Alternative token representations are provided for some operators and
269
  punctuators.[^6]
@@ -286,26 +274,28 @@ token:
286
 
287
  There are five kinds of tokens: identifiers, keywords, literals,[^8]
288
  operators, and other separators. Blanks, horizontal and vertical tabs,
289
  newlines, formfeeds, and comments (collectively, “white space”), as
290
  described below, are ignored except as they serve to separate tokens.
291
- Some white space is required to separate otherwise adjacent identifiers,
292
- keywords, numeric literals, and alternative tokens containing alphabetic
293
- characters.
 
294
 
295
  ## Comments <a id="lex.comment">[[lex.comment]]</a>
296
 
297
  The characters `/*` start a comment, which terminates with the
298
  characters `*/`. These comments do not nest. The characters `//` start a
299
  comment, which terminates immediately before the next new-line
300
  character. If there is a form-feed or a vertical-tab character in such a
301
  comment, only white-space characters shall appear between it and the
302
- new-line that terminates the comment; no diagnostic is required. The
303
- comment characters `//`, `/*`, and `*/` have no special meaning within a
304
- `//` comment and are treated just like other characters. Similarly, the
305
- comment characters `//` and `/*` have no special meaning within a `/*`
306
- comment.
 
307
 
308
  ## Header names <a id="lex.header">[[lex.header]]</a>
309
 
310
  ``` bnf
311
  header-name:
@@ -333,21 +323,23 @@ q-char-sequence:
333
  ``` bnf
334
  q-char:
335
  any member of the source character set except new-line and '"'
336
  ```
337
 
338
- Header name preprocessing tokens shall only appear within a `#include`
339
- preprocessing directive ([[cpp.include]]). The sequences in both forms
340
- of *header-name*s are mapped in an *implementation-defined* manner to
341
- headers or to external source file names as specified in 
342
- [[cpp.include]].
 
 
343
 
344
  The appearance of either of the characters `'` or `\` or of either of
345
  the character sequences `/*` or `//` in a *q-char-sequence* or an
346
- *h-char-sequence* is conditionally-supported with implementation-defined
347
- semantics, as is the appearance of the character `"` in an
348
- *h-char-sequence*.[^9]
349
 
350
  ## Preprocessing numbers <a id="lex.ppnumber">[[lex.ppnumber]]</a>
351
 
352
  ``` bnf
353
  pp-number:
@@ -357,10 +349,12 @@ pp-number:
357
  pp-number identifier-nondigit
358
  pp-number ''' digit
359
  pp-number ''' nondigit
360
  pp-number 'e' sign
361
  pp-number 'E' sign
 
 
362
  pp-number '.'
363
  ```
364
 
365
  Preprocessing number tokens lexically include all integer literal
366
  tokens ([[lex.icon]]) and all floating literal tokens ([[lex.fcon]]).
@@ -380,11 +374,10 @@ identifier:
380
 
381
  ``` bnf
382
  identifier-nondigit:
383
  nondigit
384
  universal-character-name
385
- other implementation-defined characters
386
  ```
387
 
388
  ``` bnf
389
  nondigit: one of
390
  'a b c d e f g h i j k l m'
@@ -397,16 +390,40 @@ nondigit: one of
397
  digit: one of
398
  '0 1 2 3 4 5 6 7 8 9'
399
  ```
400
 
401
  An identifier is an arbitrarily long sequence of letters and digits.
402
- Each universal-character-name in an identifier shall designate a
403
  character whose encoding in ISO 10646 falls into one of the ranges
404
- specified in  [[charname.allowed]]. The initial element shall not be a
405
- universal-character-name designating a character whose encoding falls
406
- into one of the ranges specified in  [[charname.disallowed]]. Upper- and
407
- lower-case letters are different. All characters are significant.[^10]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
408
 
409
  The identifiers in Table  [[tab:identifiers.special]] have a special
410
  meaning when appearing in a certain context. When referred to in the
411
  grammar, these identifiers are used explicitly rather than using the
412
  *identifier* grammar production. Unless otherwise specified, any
@@ -419,19 +436,24 @@ resolved to interpret the token as a regular *identifier*.
419
  | ---------- | ------- |
420
  | `override` | `final` |
421
 
422
 
423
  In addition, some identifiers are reserved for use by C++
424
- implementations and standard libraries ([[global.names]]) and shall not
425
- be used otherwise; no diagnostic is required.
 
 
 
 
 
 
426
 
427
  ## Keywords <a id="lex.key">[[lex.key]]</a>
428
 
429
  The identifiers shown in Table  [[tab:keywords]] are reserved for use as
430
  keywords (that is, they are unconditionally treated as keywords in phase
431
- 7) except in an *attribute-token* ([[dcl.attr.grammar]]) The `export`
432
- keyword is unused but is reserved for future use.:
433
 
434
  **Table: Keywords** <a id="tab:keywords">[tab:keywords]</a>
435
 
436
  | | | | | |
437
  | ------------ | -------------- | ----------- | ------------------ | ---------- |
@@ -450,10 +472,13 @@ keyword is unused but is reserved for future use.:
450
  | `const` | `false` | `private` | `this` | `while` |
451
  | `constexpr` | `float` | `protected` | `thread_local` | |
452
  | `const_cast` | `for` | `public` | `throw` | |
453
 
454
 
 
 
 
455
  Furthermore, the alternative representations shown in Table 
456
  [[tab:alternative.representations]] for certain operators and
457
  punctuators ([[lex.digraph]]) are reserved and shall not be used
458
  otherwise:
459
 
@@ -519,13 +544,11 @@ decimal-literal:
519
  decimal-literal '''ₒₚₜ digit
520
  ```
521
 
522
  ``` bnf
523
  hexadecimal-literal:
524
- '0x' hexadecimal-digit
525
- '0X' hexadecimal-digit
526
- hexadecimal-literal '''ₒₚₜ hexadecimal-digit
527
  ```
528
 
529
  ``` bnf
530
  binary-digit:
531
  '0'
@@ -540,10 +563,21 @@ octal-digit: one of
540
  ``` bnf
541
  nonzero-digit: one of
542
  '1 2 3 4 5 6 7 8 9'
543
  ```
544
 
 
 
 
 
 
 
 
 
 
 
 
545
  ``` bnf
546
  hexadecimal-digit: one of
547
  '0 1 2 3 4 5 6 7 8 9'
548
  'a b c d e f'
549
  'A B C D E F'
@@ -574,22 +608,25 @@ long-long-suffix: one of
574
 
575
  An *integer literal* is a sequence of digits that has no period or
576
  exponent part, with optional separating single quotes that are ignored
577
  when determining its value. An integer literal may have a prefix that
578
  specifies its base and a suffix that specifies its type. The lexically
579
- first digit of the sequence of digits is the most significant. A
580
- *binary* integer literal (base two) begins with `0b` or `0B` and
581
- consists of a sequence of binary digits. An *octal* integer literal
582
- (base eight) begins with the digit `0` and consists of a sequence of
583
- octal digits.[^12] A *decimal* integer literal (base ten) begins with a
584
- digit other than `0` and consists of a sequence of decimal digits. A
585
- *hexadecimal* integer literal (base sixteen) begins with `0x` or `0X`
586
  and consists of a sequence of hexadecimal digits, which include the
587
  decimal digits and the letters `a` through `f` and `A` through `F` with
588
- decimal values ten through fifteen. The number twelve can be written
589
- `12`, `014`, `0XC`, or `0b1100`. The literals `1048576`, `1'048'576`,
590
- `0X100000`, `0x10'0000`, and `0'004'000'000` all have the same value.
 
 
 
591
 
592
  The type of an integer literal is the first of the corresponding list in
593
  Table  [[tab:lex.type.integer.literal]] in which its value can be
594
  represented.
595
 
@@ -619,26 +656,28 @@ represented.
619
 
620
 
621
  If an integer literal cannot be represented by any type in its list and
622
  an extended integer type ([[basic.fundamental]]) can represent its
623
  value, it may have that extended integer type. If all of the types in
624
- the list for the literal are signed, the extended integer type shall be
625
- signed. If all of the types in the list for the literal are unsigned,
626
- the extended integer type shall be unsigned. If the list contains both
627
- signed and unsigned types, the extended integer type may be signed or
628
- unsigned. A program is ill-formed if one of its translation units
629
- contains an integer literal that cannot be represented by any of the
630
- allowed types.
631
 
632
  ### Character literals <a id="lex.ccon">[[lex.ccon]]</a>
633
 
634
  ``` bnf
635
  character-literal:
636
- ''' c-char-sequence '''
637
- u''' c-char-sequence '''
638
- U''' c-char-sequence '''
639
- L''' c-char-sequence '''
 
 
640
  ```
641
 
642
  ``` bnf
643
  c-char-sequence:
644
  c-char
@@ -670,46 +709,64 @@ hexadecimal-escape-sequence:
670
  '\x' hexadecimal-digit
671
  hexadecimal-escape-sequence hexadecimal-digit
672
  ```
673
 
674
  A character literal is one or more characters enclosed in single quotes,
675
- as in `'x'`, optionally preceded by one of the letters `u`, `U`, or `L`,
676
- as in `u'y'`, `U'z'`, or `L'x'`, respectively. A character literal that
677
- does not begin with `u`, `U`, or `L` is an ordinary character literal,
678
- also referred to as a narrow-character literal. An ordinary character
679
- literal that contains a single *c-char* representable in the execution
680
- character set has type `char`, with value equal to the numerical value
681
- of the encoding of the *c-char* in the execution character set. An
682
- ordinary character literal that contains more than one *c-char* is a
683
- *multicharacter literal*. A multicharacter literal, or an ordinary
684
- character literal containing a single *c-char* not representable in the
685
- execution character set, is conditionally-supported, has type `int`, and
686
- has an *implementation-defined* value.
687
 
688
- A character literal that begins with the letter `u`, such as `u'y'`, is
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
689
  a character literal of type `char16_t`. The value of a `char16_t`
690
- literal containing a single *c-char* is equal to its ISO 10646 code
691
- point value, provided that the code point is representable with a single
692
- 16-bit code unit. (That is, provided it is a basic multi-lingual plane
693
- code point.) If the value is not representable within 16 bits, the
694
- program is ill-formed. A `char16_t` literal containing multiple
695
- *c-char*s is ill-formed. A character literal that begins with the letter
696
- `U`, such as `U'z'`, is a character literal of type `char32_t`. The
697
- value of a `char32_t` literal containing a single *c-char* is equal to
698
- its ISO 10646 code point value. A `char32_t` literal containing multiple
699
- *c-char*s is ill-formed. A character literal that begins with the letter
700
- `L`, such as `L'x'`, is a wide-character literal. A wide-character
701
- literal has type `wchar_t`.[^13] The value of a wide-character literal
702
- containing a single *c-char* has value equal to the numerical value of
703
- the encoding of the *c-char* in the execution wide-character set, unless
704
- the *c-char* has no representation in the execution wide-character set,
705
- in which case the value is *implementation-defined*. The type `wchar_t`
706
- is able to represent all members of the execution wide-character set
707
- (see  [[basic.fundamental]]). . The value of a wide-character literal
708
- containing multiple *c-char*s is *implementation-defined*.
709
 
710
- Certain nongraphic characters, the single quote `'`, the double quote
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
711
  `"`, the question mark `?`,[^14] and the backslash `\`, can be
712
  represented according to Table  [[tab:escape.sequences]]. The double
713
  quote `"` and the question mark `?`, can be represented as themselves or
714
  by the escape sequences `\"` and `\?` respectively, but the single quote
715
  `'` and the backslash `\` shall be represented by the escape sequences
@@ -744,45 +801,74 @@ backslash followed by `x` followed by one or more hexadecimal digits
744
  that are taken to specify the value of the desired character. There is
745
  no limit to the number of digits in a hexadecimal sequence. A sequence
746
  of octal or hexadecimal digits is terminated by the first character that
747
  is not an octal digit or a hexadecimal digit, respectively. The value of
748
  a character literal is *implementation-defined* if it falls outside of
749
- the implementation-defined range defined for `char` (for literals with
750
- no prefix), `char16_t` (for literals prefixed by `'u'`), `char32_t` (for
751
- literals prefixed by `'U'`), or `wchar_t` (for literals prefixed by
752
- `'L'`).
753
 
754
- A universal-character-name is translated to the encoding, in the
 
 
 
 
755
  appropriate execution character set, of the character named. If there is
756
- no such encoding, the universal-character-name is translated to an
757
- *implementation-defined* encoding. In translation phase 1, a
758
- universal-character-name is introduced whenever an actual extended
759
- character is encountered in the source text. Therefore, all extended
760
- characters are described in terms of universal-character-names. However,
761
- the actual compiler implementation may use its own native character set,
762
- so long as the same results are obtained.
 
 
763
 
764
  ### Floating literals <a id="lex.fcon">[[lex.fcon]]</a>
765
 
766
  ``` bnf
767
  floating-literal:
 
 
 
 
 
 
768
  fractional-constant exponent-partₒₚₜ floating-suffixₒₚₜ
769
  digit-sequence exponent-part floating-suffixₒₚₜ
770
  ```
771
 
 
 
 
 
 
 
772
  ``` bnf
773
  fractional-constant:
774
  digit-sequenceₒₚₜ '.' digit-sequence
775
  digit-sequence '.'
776
  ```
777
 
 
 
 
 
 
 
778
  ``` bnf
779
  exponent-part:
780
  'e' signₒₚₜ digit-sequence
781
  'E' signₒₚₜ digit-sequence
782
  ```
783
 
 
 
 
 
 
 
784
  ``` bnf
785
  sign: one of
786
  '+ -'
787
  ```
788
 
@@ -795,46 +881,55 @@ digit-sequence:
795
  ``` bnf
796
  floating-suffix: one of
797
  'f l F L'
798
  ```
799
 
800
- A floating literal consists of an integer part, a decimal point, a
801
- fraction part, an `e` or `E`, an optionally signed integer exponent, and
802
- an optional type suffix. The integer and fraction parts both consist of
803
- a sequence of decimal (base ten) digits. Optional separating single
804
- quotes in a *digit-sequence* are ignored when determining its value. The
805
- literals `1.602'176'565e-19` and `1.602176565e-19` have the same value.
806
- Either the integer part or the fraction part (not both) can be omitted;
807
- either the decimal point or the letter `e` (or `E` ) and the exponent
808
- (not both) can be omitted. The integer part, the optional decimal point
809
- and the optional fraction part form the *significant part* of the
810
- floating literal. The exponent, if present, indicates the power of 10 by
811
- which the significant part is to be scaled. If the scaled value is in
812
- the range of representable values for its type, the result is the scaled
813
- value if representable, else the larger or smaller representable value
814
- nearest the scaled value, chosen in an *implementation-defined* manner.
815
- The type of a floating literal is `double` unless explicitly specified
816
- by a suffix. The suffixes `f` and `F` specify `float`, the suffixes `l`
817
- and `L` specify `long` `double`. If the scaled value is not in the range
818
- of representable values for its type, the program is ill-formed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
819
 
820
  ### String literals <a id="lex.string">[[lex.string]]</a>
821
 
822
  ``` bnf
823
  string-literal:
824
  encoding-prefixₒₚₜ '"' s-char-sequenceₒₚₜ '"'
825
  encoding-prefixₒₚₜ 'R' raw-string
826
  ```
827
 
828
- ``` bnf
829
- encoding-prefix:
830
- 'u8'
831
- 'u'
832
- 'U'
833
- 'L'
834
- ```
835
-
836
  ``` bnf
837
  s-char-sequence:
838
  s-char
839
  s-char-sequence s-char
840
  ```
@@ -854,36 +949,43 @@ r-char-sequence:
854
  d-char-sequence:
855
  d-char
856
  d-char-sequence d-char
857
  ```
858
 
859
- A string literal is a sequence of characters (as defined in 
860
  [[lex.ccon]]) surrounded by double quotes, optionally prefixed by `R`,
861
  `u8`, `u8R`, `u`, `uR`, `U`, `UR`, `L`, or `LR`, as in `"..."`,
862
  `R"(...)"`, `u8"..."`, `u8R"**(...)**"`, `u"..."`, `uR"*~(...)*~"`,
863
  `U"..."`, `UR"zzz(...)zzz"`, `L"..."`, or `LR"(...)"`, respectively.
864
 
865
- A string literal that has an `R` in the prefix is a *raw string
866
  literal*. The *d-char-sequence* serves as a delimiter. The terminating
867
  *d-char-sequence* of a *raw-string* is the same sequence of characters
868
  as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
869
  at most 16 characters.
870
 
871
- The characters `'('` and `')'` are permitted in a *raw-string*. Thus,
872
- `R"delimiter((a|b))delimiter"` is equivalent to `"(a|b)"`.
 
 
 
873
 
874
  A source-file new-line in a raw string literal results in a new-line in
875
- the resulting execution *string-literal*. Assuming no whitespace at the
876
  beginning of lines in the following example, the assert will succeed:
877
 
878
  ``` cpp
879
  const char* p = R"(a\
880
  b
881
  c)";
882
  assert(std::strcmp(p, "a\\\nb\nc") == 0);
883
  ```
884
 
 
 
 
 
885
  The raw string
886
 
887
  ``` cpp
888
  R"a(
889
  )\
@@ -905,62 +1007,63 @@ R"#(
905
  )#"
906
  ```
907
 
908
  is equivalent to `"\n)\?\?=\"\n"`.
909
 
910
- After translation phase 6, a string literal that does not begin with an
911
- *encoding-prefix* is an ordinary string literal, and is initialized with
912
- the given characters.
913
 
914
- A string literal that begins with `u8`, such as `u8"asdf"`, is a UTF-8
915
- string literal.
 
 
 
 
916
 
917
  Ordinary string literals and UTF-8 string literals are also referred to
918
  as narrow string literals. A narrow string literal has type “array of
919
  *n* `const char`”, where *n* is the size of the string as defined below,
920
  and has static storage duration ([[basic.stc]]).
921
 
922
  For a UTF-8 string literal, each successive element of the object
923
  representation ([[basic.types]]) has the value of the corresponding
924
  code unit of the UTF-8 encoding of the string.
925
 
926
- A string literal that begins with `u`, such as `u"asdf"`, is a
927
  `char16_t` string literal. A `char16_t` string literal has type “array
928
  of *n* `const char16_t`”, where *n* is the size of the string as defined
929
- below; it has static storage duration and is initialized with the given
930
- characters. A single *c-char* may produce more than one `char16_t`
931
- character in the form of surrogate pairs.
932
 
933
- A string literal that begins with `U`, such as `U"asdf"`, is a
934
  `char32_t` string literal. A `char32_t` string literal has type “array
935
  of *n* `const char32_t`”, where *n* is the size of the string as defined
936
- below; it has static storage duration and is initialized with the given
937
- characters.
938
 
939
- A string literal that begins with `L`, such as `L"asdf"`, is a wide
940
- string literal. A wide string literal has type “array of *n* `const
941
- wchar_t`”, where *n* is the size of the string as defined below; it has
942
- static storage duration and is initialized with the given characters.
943
 
944
- Whether all string literals are distinct (that is, are stored in
945
- nonoverlapping objects) is *implementation-defined*. The effect of
946
- attempting to modify a string literal is undefined.
947
-
948
- In translation phase 6 ([[lex.phases]]), adjacent string literals are
949
- concatenated. If both string literals have the same *encoding-prefix*,
950
  the resulting concatenated string literal has that *encoding-prefix*. If
951
- one string literal has no *encoding-prefix*, it is treated as a string
952
- literal of the same *encoding-prefix* as the other operand. If a UTF-8
953
- string literal token is adjacent to a wide string literal token, the
954
- program is ill-formed. Any other concatenations are
955
- conditionally-supported with *implementation-defined* behavior. This
956
- concatenation is an interpretation, not a conversion. Because the
957
- interpretation happens in translation phase 6 (after each character from
958
- a literal has been translated into a value from the appropriate
959
- character set), a string literal’s initial rawness has no effect on the
960
- interpretation or well-formedness of the concatenation. Table 
961
- [[tab:lex.string.concat]] has some examples of valid concatenations.
 
 
 
 
962
 
963
  **Table: String literal concatenations** <a id="tab:lex.string.concat">[tab:lex.string.concat]</a>
964
 
965
  | | | | | | |
966
  | -------------------------- | ----- | -------------------------- | ----- | -------------------------- | ----- |
@@ -970,41 +1073,59 @@ interpretation or well-formedness of the concatenation. Table 
970
  | `"a"` | `u"b"` | `u"ab"` | `"a"` | `U"b"` | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
971
 
972
 
973
  Characters in concatenated strings are kept distinct.
974
 
 
 
975
  ``` cpp
976
  "\xA" "B"
977
  ```
978
 
979
  contains the two characters `'\xA'` and `'B'` after concatenation (and
980
  not the single hexadecimal character `'\xAB'`).
981
 
 
 
982
  After any necessary concatenation, in translation phase 7 (
983
  [[lex.phases]]), `'\0'` is appended to every string literal so that
984
  programs that scan a string can find its end.
985
 
986
- Escape sequences and universal-character-names in non-raw string
987
  literals have the same meaning as in character literals ([[lex.ccon]]),
988
  except that the single quote `'` is representable either by itself or by
989
  the escape sequence `\'`, and the double quote `"` shall be preceded by
990
- a `\`. In a narrow string literal, a universal-character-name may map to
991
- more than one `char` element due to *multibyte encoding*. The size of a
992
- `char32_t` or wide string literal is the total number of escape
993
- sequences, universal-character-names, and other characters, plus one for
994
- the terminating `U'\0'` or `L'\0'`. The size of a `char16_t` string
995
- literal is the total number of escape sequences,
996
- universal-character-names, and other characters, plus one for each
997
- character requiring a surrogate pair, plus one for the terminating
998
- `u'\0'`. The size of a `char16_t` string literal is the number of code
999
- units, not the number of characters. Within `char32_t` and `char16_t`
1000
- literals, any universal-character-names shall be within the range `0x0`
1001
- to `0x10FFFF`. The size of a narrow string literal is the total number
1002
- of escape sequences and other characters, plus at least one for the
1003
- multibyte encoding of each universal-character-name, plus one for the
 
 
 
 
 
1004
  terminating `'\0'`.
1005
 
 
 
 
 
 
 
 
 
 
1006
  ### Boolean literals <a id="lex.bool">[[lex.bool]]</a>
1007
 
1008
  ``` bnf
1009
  boolean-literal:
1010
  'false'
@@ -1020,14 +1141,17 @@ are prvalues and have type `bool`.
1020
  pointer-literal:
1021
  'nullptr'
1022
  ```
1023
 
1024
  The pointer literal is the keyword `nullptr`. It is a prvalue of type
1025
- `std::nullptr_t`. `std::nullptr_t` is a distinct type that is neither a
 
 
1026
  pointer type nor a pointer to member type; rather, a prvalue of this
1027
  type is a null pointer constant and can be converted to a null pointer
1028
- value or null member pointer value. See  [[conv.ptr]] and  [[conv.mem]].
 
1029
 
1030
  ### User-defined literals <a id="lex.ext">[[lex.ext]]</a>
1031
 
1032
  ``` bnf
1033
  user-defined-literal:
@@ -1047,10 +1171,12 @@ user-defined-integer-literal:
1047
 
1048
  ``` bnf
1049
  user-defined-floating-literal:
1050
  fractional-constant exponent-partₒₚₜ ud-suffix
1051
  digit-sequence exponent-part ud-suffix
 
 
1052
  ```
1053
 
1054
  ``` bnf
1055
  user-defined-string-literal:
1056
  string-literal ud-suffix
@@ -1064,15 +1190,24 @@ user-defined-character-literal:
1064
  ``` bnf
1065
  ud-suffix:
1066
  identifier
1067
  ```
1068
 
1069
- If a token matches both *user-defined-literal* and another literal kind,
1070
- it is treated as the latter. `123_km` is a *user-defined-literal*, but
1071
- `12LL` is an *integer-literal*. The syntactic non-terminal preceding the
1072
- *ud-suffix* in a *user-defined-literal* is taken to be the longest
1073
- sequence of characters that could match that non-terminal.
 
 
 
 
 
 
 
 
 
1074
 
1075
  A *user-defined-literal* is treated as a call to a literal operator or
1076
  literal operator template ([[over.literal]]). To determine the form of
1077
  this call for a given *user-defined-literal* *L* with *ud-suffix* *X*,
1078
  the *literal-operator-id* whose literal suffix identifier is *X* is
@@ -1102,13 +1237,14 @@ a call of the form
1102
 
1103
  ``` cpp
1104
  operator "" X<'c₁', 'c₂', ... 'cₖ'>()
1105
  ```
1106
 
1107
- where *n* is the source character sequence c₁c₂...cₖ. The sequence
1108
- c₁c₂...cₖ can only contain characters from the basic source character
1109
- set.
 
1110
 
1111
  If *L* is a *user-defined-floating-literal*, let *f* be the literal
1112
  without its *ud-suffix*. If *S* contains a literal operator with
1113
  parameter type `long double`, the literal *L* is treated as a call of
1114
  the form
@@ -1130,32 +1266,35 @@ a call of the form
1130
 
1131
  ``` cpp
1132
  operator "" X<'c₁', 'c₂', ... 'cₖ'>()
1133
  ```
1134
 
1135
- where *f* is the source character sequence c₁c₂...cₖ. The sequence
1136
- c₁c₂...cₖ can only contain characters from the basic source character
1137
- set.
 
1138
 
1139
  If *L* is a *user-defined-string-literal*, let *str* be the literal
1140
  without its *ud-suffix* and let *len* be the number of code units in
1141
  *str* (i.e., its length excluding the terminating null character). The
1142
  literal *L* is treated as a call of the form
1143
 
1144
  ``` cpp
1145
- operator "" X(str{}, len{})
1146
  ```
1147
 
1148
  If *L* is a *user-defined-character-literal*, let *ch* be the literal
1149
  without its *ud-suffix*. *S* shall contain a literal operator (
1150
  [[over.literal]]) whose only parameter has the type of *ch* and the
1151
  literal *L* is treated as a call of the form
1152
 
1153
  ``` cpp
1154
- operator "" X(ch{})
1155
  ```
1156
 
 
 
1157
  ``` cpp
1158
  long double operator "" _w(long double);
1159
  std::string operator "" _w(const char16_t*, std::size_t);
1160
  unsigned operator "" _w(const char*);
1161
  int main() {
@@ -1164,48 +1303,47 @@ int main() {
1164
  12_w; // calls operator "" _w("12")
1165
  "two"_w; // error: no applicable literal operator
1166
  }
1167
  ```
1168
 
 
 
1169
  In translation phase 6 ([[lex.phases]]), adjacent string literals are
1170
  concatenated and *user-defined-string-literal*s are considered string
1171
  literals for that purpose. During concatenation, *ud-suffix*es are
1172
  removed and ignored and the concatenation process occurs as described
1173
  in  [[lex.string]]. At the end of phase 6, if a string literal is the
1174
  result of a concatenation involving at least one
1175
  *user-defined-string-literal*, all the participating
1176
  *user-defined-string-literal*s shall have the same *ud-suffix* and that
1177
  suffix is applied to the result of the concatenation.
1178
 
 
 
1179
  ``` cpp
1180
  int main() {
1181
  L"A" "B" "C"_x; // OK: same as L"ABC"_x
1182
  "P"_x "Q" "R"_y;// error: two different ud-suffix{es}
1183
  }
1184
  ```
1185
 
1186
- Some *identifier*s appearing as *ud-suffix*es are reserved for future
1187
- standardization ([[usrlit.suffix]]). A program containing such a
1188
- *ud-suffix* is ill-formed, no diagnostic required.
1189
 
1190
  <!-- Link reference definitions -->
1191
  [basic.fundamental]: basic.md#basic.fundamental
1192
  [basic.link]: basic.md#basic.link
1193
  [basic.lookup.unqual]: basic.md#basic.lookup.unqual
1194
  [basic.stc]: basic.md#basic.stc
1195
  [basic.types]: basic.md#basic.types
1196
- [charname.allowed]: charname.md#charname.allowed
1197
- [charname.disallowed]: charname.md#charname.disallowed
1198
  [conv.mem]: conv.md#conv.mem
1199
  [conv.ptr]: conv.md#conv.ptr
1200
  [cpp]: cpp.md#cpp
1201
  [cpp.concat]: cpp.md#cpp.concat
1202
  [cpp.cond]: cpp.md#cpp.cond
1203
  [cpp.include]: cpp.md#cpp.include
1204
  [cpp.stringize]: cpp.md#cpp.stringize
1205
  [dcl.attr.grammar]: dcl.md#dcl.attr.grammar
1206
- [global.names]: library.md#global.names
1207
  [headers]: library.md#headers
1208
  [lex]: #lex
1209
  [lex.bool]: #lex.bool
1210
  [lex.ccon]: #lex.ccon
1211
  [lex.charset]: #lex.charset
@@ -1225,23 +1363,22 @@ standardization ([[usrlit.suffix]]). A program containing such a
1225
  [lex.ppnumber]: #lex.ppnumber
1226
  [lex.pptoken]: #lex.pptoken
1227
  [lex.separate]: #lex.separate
1228
  [lex.string]: #lex.string
1229
  [lex.token]: #lex.token
1230
- [lex.trigraph]: #lex.trigraph
1231
  [over.literal]: over.md#over.literal
1232
  [tab:alternative.representations]: #tab:alternative.representations
1233
  [tab:alternative.tokens]: #tab:alternative.tokens
 
 
1234
  [tab:escape.sequences]: #tab:escape.sequences
1235
  [tab:identifiers.special]: #tab:identifiers.special
1236
  [tab:keywords]: #tab:keywords
1237
  [tab:lex.string.concat]: #tab:lex.string.concat
1238
  [tab:lex.type.integer.literal]: #tab:lex.type.integer.literal
1239
- [tab:trigraph.sequences]: #tab:trigraph.sequences
1240
  [temp.explicit]: temp.md#temp.explicit
1241
  [temp.names]: temp.md#temp.names
1242
- [usrlit.suffix]: library.md#usrlit.suffix
1243
 
1244
  [^1]: Implementations must behave as if these separate phases occur,
1245
  although in practice different phases might be folded together.
1246
 
1247
  [^2]: A partial preprocessing token would arise from a source file
@@ -1256,16 +1393,16 @@ standardization ([[usrlit.suffix]]). A program containing such a
1256
  [^4]: The glyphs for the members of the basic source character set are
1257
  intended to identify characters from the subset of ISO/IEC 10646
1258
  which corresponds to the ASCII character set. However, because the
1259
  mapping from source file characters to the source character set
1260
  (described in translation phase 1) is specified as
1261
- implementation-defined, an implementation is required to document
1262
  how the basic source characters are represented in source files.
1263
 
1264
- [^5]: A sequence of characters resembling a universal-character-name in
1265
- an *r-char-sequence* ([[lex.string]]) does not form a
1266
- universal-character-name.
1267
 
1268
  [^6]: These include “digraphs” and additional reserved words. The term
1269
  “digraph” (token consisting of two characters) is not perfectly
1270
  descriptive, since one of the alternative preprocessing-tokens is
1271
  `%:%:` and of course several primary tokens contain two characters.
@@ -1282,14 +1419,14 @@ standardization ([[usrlit.suffix]]). A program containing such a
1282
  might result in an error, be interpreted as the character
1283
  corresponding to the escape sequence, or have a completely different
1284
  meaning, depending on the implementation.
1285
 
1286
  [^10]: On systems in which linkers cannot accept extended characters, an
1287
- encoding of the universal-character-name may be used in forming
1288
  valid external identifiers. For example, some otherwise unused
1289
  character or sequence of characters may be used to encode the `\u`
1290
- in a universal-character-name. Extended characters may produce a
1291
  long external identifier, but C++does not place a translation limit
1292
  on significant characters for external identifiers. In C++, upper-
1293
  and lower-case letters are considered different for all identifiers,
1294
  including external identifiers.
1295
 
@@ -1299,7 +1436,7 @@ standardization ([[usrlit.suffix]]). A program containing such a
1299
  [^12]: The digits `8` and `9` are not octal digits.
1300
 
1301
  [^13]: They are intended for character sets where a character does not
1302
  fit into a single byte.
1303
 
1304
- [^14]: Using an escape sequence for a question mark can avoid
1305
- accidentally creating a trigraph.
 
5
  The text of the program is kept in units called *source files* in this
6
  International Standard. A source file together with all the headers (
7
  [[headers]]) and source files included ([[cpp.include]]) via the
8
  preprocessing directive `#include`, less any source lines skipped by any
9
  of the conditional inclusion ([[cpp.cond]]) preprocessing directives,
10
+ is called a *translation unit*.
 
11
 
12
+ [*Note 1*: A C++program need not all be translated at the same
13
+ time. *end note*]
14
+
15
+ [*Note 2*: Previously translated translation units and instantiation
16
+ units can be preserved individually or in libraries. The separate
17
+ translation units of a program communicate ([[basic.link]]) by (for
18
+ example) calls to functions whose identifiers have external linkage,
19
+ manipulation of objects whose identifiers have external linkage, or
20
+ manipulation of data files. Translation units can be separately
21
+ translated and then later linked to produce an executable program (
22
+ [[basic.link]]). — *end note*]
23
 
24
  ## Phases of translation <a id="lex.phases">[[lex.phases]]</a>
25
 
26
  The precedence among the syntax rules of translation is specified by the
27
  following phases.[^1]
28
 
29
  1. Physical source file characters are mapped, in an
30
  *implementation-defined* manner, to the basic source character set
31
  (introducing new-line characters for end-of-line indicators) if
32
  necessary. The set of physical source file characters accepted is
33
+ *implementation-defined*. Any source file character not in the basic
34
+ source character set ([[lex.charset]]) is replaced by the
35
+ *universal-character-name* that designates that character. An
36
+ implementation may use any internal encoding, so long as an actual
37
+ extended character encountered in the source file, and the same
38
+ extended character expressed in the source file as a
39
+ *universal-character-name* (e.g., using the `\uXXXX` notation), are
40
+ handled equivalently except where this replacement is reverted (
41
+ [[lex.pptoken]]) in a raw string literal.
 
42
  2. Each instance of a backslash character (\\ immediately followed by a
43
  new-line character is deleted, splicing physical source lines to
44
  form logical source lines. Only the last backslash on any physical
45
  source line shall be eligible for being part of such a splice.
46
  Except for splices reverted in a raw string literal, if a splice
47
  results in a character sequence that matches the syntax of a
48
+ *universal-character-name*, the behavior is undefined. A source file
49
  that is not empty and that does not end in a new-line character, or
50
  that ends in a new-line character immediately preceded by a
51
  backslash character before any such splicing takes place, shall be
52
  processed as if an additional new-line character were appended to
53
  the file.
 
57
  token or in a partial comment.[^2] Each comment is replaced by one
58
  space character. New-line characters are retained. Whether each
59
  nonempty sequence of white-space characters other than new-line is
60
  retained or replaced by one space character is unspecified. The
61
  process of dividing a source file’s characters into preprocessing
62
+ tokens is context-dependent. \[*Example 1*: see the handling of `<`
63
+ within a `#include` preprocessing directive. — *end example*]
64
  4. Preprocessing directives are executed, macro invocations are
65
  expanded, and `_Pragma` unary operator expressions are executed. If
66
  a character sequence that matches the syntax of a
67
+ *universal-character-name* is produced by token concatenation (
68
  [[cpp.concat]]), the behavior is undefined. A `#include`
69
  preprocessing directive causes the named header or source file to be
70
  processed from phase 1 through phase 4, recursively. All
71
  preprocessing directives are then deleted.
72
  5. Each source character set member in a character literal or a string
73
  literal, as well as each escape sequence and
74
+ *universal-character-name* in a character literal or a non-raw
75
+ string literal, is converted to the corresponding member of the
76
+ execution character set ([[lex.ccon]], [[lex.string]]); if there is
77
+ no corresponding member, it is converted to an
78
+ *implementation-defined* member other than the null (wide)
79
+ character.[^3]
80
  6. Adjacent string literal tokens are concatenated.
81
  7. White-space characters separating tokens are no longer significant.
82
+ Each preprocessing token is converted into a token ([[lex.token]]).
83
+ The resulting tokens are syntactically and semantically analyzed and
84
+ translated as a translation unit. \[*Note 1*: The process of
85
+ analyzing and translating the tokens may occasionally result in one
86
+ token being replaced by a sequence of other tokens (
87
+ [[temp.names]]). *end note*] \[*Note 2*: Source files,
88
+ translation units and translated translation units need not
89
+ necessarily be stored as files, nor need there be any one-to-one
90
+ correspondence between these entities and any external
91
+ representation. The description is conceptual only, and does not
92
+ specify any particular implementation. — *end note*]
93
  8. Translated translation units and instantiation units are combined as
94
+ follows: \[*Note 3*: Some or all of these may be supplied from a
95
+ library. *end note*] Each translated translation unit is examined
96
+ to produce a list of required instantiations. \[*Note 4*: This may
97
+ include instantiations which have been explicitly requested (
98
+ [[temp.explicit]]). *end note*] The definitions of the required
99
+ templates are located. It is *implementation-defined* whether the
100
+ source of the translation units containing these definitions is
101
+ required to be available. \[*Note 5*: An implementation could encode
102
+ sufficient information into the translated translation unit so as to
103
+ ensure the source is not required here. — *end note*] All the
104
+ required instantiations are performed to produce *instantiation
105
+ units*. \[*Note 6*: These are similar to translated translation
106
+ units, but contain no references to uninstantiated templates and no
107
+ template definitions. — *end note*] The program is ill-formed if
108
+ any instantiation fails.
109
  9. All external entity references are resolved. Library components are
110
  linked to satisfy external references to entities not defined in the
111
  current translation. All such translator output is collected into a
112
  program image which contains information needed for execution in its
113
  execution environment.
 
119
  tab, form feed, and new-line, plus the following 91 graphical
120
  characters:[^4]
121
 
122
  ``` cpp
123
  a b c d e f g h i j k l m n o p q r s t u v w x y z
 
124
  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
 
125
  0 1 2 3 4 5 6 7 8 9
 
126
  _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \" '
127
  ```
128
 
129
  The *universal-character-name* construct provides a way to name other
130
  characters.
 
138
  universal-character-name:
139
  '\u' hex-quad
140
  '\U' hex-quad hex-quad
141
  ```
142
 
143
+ The character designated by the *universal-character-name* `\UNNNNNNNN`
144
+ is that character whose character short name in ISO/IEC 10646 is
145
+ `NNNNNNNN`; the character designated by the *universal-character-name*
146
  `\uNNNN` is that character whose character short name in ISO/IEC 10646
147
+ is `0000NNNN`. If the hexadecimal value for a *universal-character-name*
148
  corresponds to a surrogate code point (in the range 0xD800–0xDFFF,
149
  inclusive), the program is ill-formed. Additionally, if the hexadecimal
150
+ value for a *universal-character-name* outside the *c-char-sequence*,
151
  *s-char-sequence*, or *r-char-sequence* of a character or string literal
152
  corresponds to a control character (in either of the ranges 0x00–0x1F or
153
  0x7F–0x9F, both inclusive) or to a character in the basic source
154
  character set, the program is ill-formed.[^5]
155
 
156
  The *basic execution character set* and the *basic execution
157
  wide-character set* shall each contain all the members of the basic
158
  source character set, plus control characters representing alert,
159
  backspace, and carriage return, plus a *null character* (respectively,
160
+ *null wide character*), whose value is 0. For each basic execution
161
+ character set, the values of the members shall be non-negative and
162
+ distinct from one another. In both the source and execution basic
163
+ character sets, the value of each character after `0` in the above list
164
+ of decimal digits shall be one greater than the value of the previous.
165
+ The *execution character set* and the *execution wide-character set* are
166
+ *implementation-defined* supersets of the basic execution character set
167
+ and the basic execution wide-character set, respectively. The values of
168
+ the members of the execution character sets and the sets of additional
169
+ members are locale-specific.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
  ## Preprocessing tokens <a id="lex.pptoken">[[lex.pptoken]]</a>
172
 
173
  ``` bnf
 
174
  preprocessing-token:
175
  header-name
176
  identifier
177
  pp-number
178
  character-literal
 
209
 
210
  - If the next character begins a sequence of characters that could be
211
  the prefix and initial double quote of a raw string literal, such as
212
  `R"`, the next preprocessing token shall be a raw string literal.
213
  Between the initial and final double quote characters of the raw
214
+ string, any transformations performed in phases 1 and 2
215
+ (*universal-character-name*s and line splicing) are reverted; this
216
  reversion shall apply before any *d-char*, *r-char*, or delimiting
217
  parenthesis is identified. The raw string literal is defined as the
218
  shortest sequence of characters that matches the raw-string pattern
219
  ``` bnf
220
  encoding-prefixₒₚₜ 'R' raw-string
221
  ```
222
  - Otherwise, if the next three characters are `<::` and the subsequent
223
+ character is neither `:` nor `>`, the `<` is treated as a
224
+ preprocessing token by itself and not as the first character of the
225
+ alternative token `<:`.
226
  - Otherwise, the next preprocessing token is the longest sequence of
227
  characters that could constitute a preprocessing token, even if that
228
+ would cause further lexical analysis to fail, except that a
229
+ *header-name* ([[lex.header]]) is only formed within a `#include`
230
+ directive ([[cpp.include]]).
231
+
232
+ [*Example 1*:
233
 
234
  ``` cpp
235
  #define R "x"
236
  const char* s = R"y"; // ill-formed raw string, not "x" "y"
237
  ```
238
 
239
+ *end example*]
 
 
 
 
 
 
240
 
241
+ [*Example 2*: The program fragment `0xe+foo` is parsed as a
242
+ preprocessing number token (one that is not a valid floating or integer
243
+ literal token), even though a parse as three preprocessing tokens `0xe`,
244
+ `+`, and `foo` might produce a valid expression (for example, if `foo`
245
+ were a macro defined as `1`). Similarly, the program fragment `1E1` is
246
+ parsed as a preprocessing number (one that is a valid floating literal
247
+ token), whether or not `E` is a macro name. — *end example*]
248
+
249
+ [*Example 3*: The program fragment `x+++++y` is parsed as `x
250
  ++ ++ + y`, which, if `x` and `y` have integral types, violates a
251
  constraint on increment operators, even though the parse `x ++ + ++ y`
252
+ might yield a correct expression. — *end example*]
253
 
254
  ## Alternative tokens <a id="lex.digraph">[[lex.digraph]]</a>
255
 
256
  Alternative token representations are provided for some operators and
257
  punctuators.[^6]
 
274
 
275
  There are five kinds of tokens: identifiers, keywords, literals,[^8]
276
  operators, and other separators. Blanks, horizontal and vertical tabs,
277
  newlines, formfeeds, and comments (collectively, “white space”), as
278
  described below, are ignored except as they serve to separate tokens.
279
+
280
+ [*Note 1*: Some white space is required to separate otherwise adjacent
281
+ identifiers, keywords, numeric literals, and alternative tokens
282
+ containing alphabetic characters. — *end note*]
283
 
284
  ## Comments <a id="lex.comment">[[lex.comment]]</a>
285
 
286
  The characters `/*` start a comment, which terminates with the
287
  characters `*/`. These comments do not nest. The characters `//` start a
288
  comment, which terminates immediately before the next new-line
289
  character. If there is a form-feed or a vertical-tab character in such a
290
  comment, only white-space characters shall appear between it and the
291
+ new-line that terminates the comment; no diagnostic is required.
292
+
293
+ [*Note 1*: The comment characters `//`, `/*`, and `*/` have no special
294
+ meaning within a `//` comment and are treated just like other
295
+ characters. Similarly, the comment characters `//` and `/*` have no
296
+ special meaning within a `/*` comment. — *end note*]
297
 
298
  ## Header names <a id="lex.header">[[lex.header]]</a>
299
 
300
  ``` bnf
301
  header-name:
 
323
  ``` bnf
324
  q-char:
325
  any member of the source character set except new-line and '"'
326
  ```
327
 
328
+ [*Note 1*: Header name preprocessing tokens only appear within a
329
+ `#include` preprocessing directive (see 
330
+ [[lex.pptoken]]). *end note*]
331
+
332
+ The sequences in both forms of *header-name*s are mapped in an
333
+ *implementation-defined* manner to headers or to external source file
334
+ names as specified in  [[cpp.include]].
335
 
336
  The appearance of either of the characters `'` or `\` or of either of
337
  the character sequences `/*` or `//` in a *q-char-sequence* or an
338
+ *h-char-sequence* is conditionally-supported with
339
+ *implementation-defined* semantics, as is the appearance of the
340
+ character `"` in an *h-char-sequence*.[^9]
341
 
342
  ## Preprocessing numbers <a id="lex.ppnumber">[[lex.ppnumber]]</a>
343
 
344
  ``` bnf
345
  pp-number:
 
349
  pp-number identifier-nondigit
350
  pp-number ''' digit
351
  pp-number ''' nondigit
352
  pp-number 'e' sign
353
  pp-number 'E' sign
354
+ pp-number 'p' sign
355
+ pp-number 'P' sign
356
  pp-number '.'
357
  ```
358
 
359
  Preprocessing number tokens lexically include all integer literal
360
  tokens ([[lex.icon]]) and all floating literal tokens ([[lex.fcon]]).
 
374
 
375
  ``` bnf
376
  identifier-nondigit:
377
  nondigit
378
  universal-character-name
 
379
  ```
380
 
381
  ``` bnf
382
  nondigit: one of
383
  'a b c d e f g h i j k l m'
 
390
  digit: one of
391
  '0 1 2 3 4 5 6 7 8 9'
392
  ```
393
 
394
  An identifier is an arbitrarily long sequence of letters and digits.
395
+ Each *universal-character-name* in an identifier shall designate a
396
  character whose encoding in ISO 10646 falls into one of the ranges
397
+ specified in Table  [[tab:charname.allowed]]. The initial element shall
398
+ not be a *universal-character-name* designating a character whose
399
+ encoding falls into one of the ranges specified in Table 
400
+ [[tab:charname.disallowed]]. Upper- and lower-case letters are
401
+ different. All characters are significant.[^10]
402
+
403
+ **Table: Ranges of characters allowed** <a id="tab:charname.allowed">[tab:charname.allowed]</a>
404
+
405
+ | | | | | |
406
+ | ------------- | ------------- | ------------- | ------------- | ------------- |
407
+ | `00A8` | `00AA` | `00AD` | `00AF` | `00B2-00B5` |
408
+ | `00B7-00BA` | `00BC-00BE` | `00C0-00D6` | `00D8-00F6` | `00F8-00FF` |
409
+ | `0100-167F` | `1681-180D` | `180F-1FFF` | | |
410
+ | `200B-200D` | `202A-202E` | `203F-2040` | `2054` | `2060-206F` |
411
+ | `2070-218F` | `2460-24FF` | `2776-2793` | `2C00-2DFF` | `2E80-2FFF` |
412
+ | `3004-3007` | `3021-302F` | `3031-D7FF` | | |
413
+ | `F900-FD3D` | `FD40-FDCF` | `FDF0-FE44` | `FE47-FFFD` | |
414
+ | `10000-1FFFD` | `20000-2FFFD` | `30000-3FFFD` | `40000-4FFFD` | `50000-5FFFD` |
415
+ | `60000-6FFFD` | `70000-7FFFD` | `80000-8FFFD` | `90000-9FFFD` | `A0000-AFFFD` |
416
+ | `B0000-BFFFD` | `C0000-CFFFD` | `D0000-DFFFD` | `E0000-EFFFD` | |
417
+
418
+
419
+ **Table: Ranges of characters disallowed initially (combining characters)** <a id="tab:charname.disallowed">[tab:charname.disallowed]</a>
420
+
421
+ | | | | |
422
+ | ----------- | ---------------------------------------------- | ----------- | ----------- |
423
+ | `0300-036F` | % FIXME: Unicode v7 adds 1AB0-1AFF `1DC0-1DFF` | `20D0-20FF` | `FE20-FE2F` |
424
+
425
 
426
  The identifiers in Table  [[tab:identifiers.special]] have a special
427
  meaning when appearing in a certain context. When referred to in the
428
  grammar, these identifiers are used explicitly rather than using the
429
  *identifier* grammar production. Unless otherwise specified, any
 
436
  | ---------- | ------- |
437
  | `override` | `final` |
438
 
439
 
440
  In addition, some identifiers are reserved for use by C++
441
+ implementations and shall not be used otherwise; no diagnostic is
442
+ required.
443
+
444
+ - Each identifier that contains a double underscore `__` or begins with
445
+ an underscore followed by an uppercase letter is reserved to the
446
+ implementation for any use.
447
+ - Each identifier that begins with an underscore is reserved to the
448
+ implementation for use as a name in the global namespace.
449
 
450
  ## Keywords <a id="lex.key">[[lex.key]]</a>
451
 
452
  The identifiers shown in Table  [[tab:keywords]] are reserved for use as
453
  keywords (that is, they are unconditionally treated as keywords in phase
454
+ 7) except in an *attribute-token* ([[dcl.attr.grammar]]):
 
455
 
456
  **Table: Keywords** <a id="tab:keywords">[tab:keywords]</a>
457
 
458
  | | | | | |
459
  | ------------ | -------------- | ----------- | ------------------ | ---------- |
 
472
  | `const` | `false` | `private` | `this` | `while` |
473
  | `constexpr` | `float` | `protected` | `thread_local` | |
474
  | `const_cast` | `for` | `public` | `throw` | |
475
 
476
 
477
+ [*Note 1*: The `export` and `register` keywords are unused but are
478
+ reserved for future use. — *end note*]
479
+
480
  Furthermore, the alternative representations shown in Table 
481
  [[tab:alternative.representations]] for certain operators and
482
  punctuators ([[lex.digraph]]) are reserved and shall not be used
483
  otherwise:
484
 
 
544
  decimal-literal '''ₒₚₜ digit
545
  ```
546
 
547
  ``` bnf
548
  hexadecimal-literal:
549
+ hexadecimal-prefix hexadecimal-digit-sequence
 
 
550
  ```
551
 
552
  ``` bnf
553
  binary-digit:
554
  '0'
 
563
  ``` bnf
564
  nonzero-digit: one of
565
  '1 2 3 4 5 6 7 8 9'
566
  ```
567
 
568
+ ``` bnf
569
+ hexadecimal-prefix: one of
570
+ '0x 0X'
571
+ ```
572
+
573
+ ``` bnf
574
+ hexadecimal-digit-sequence:
575
+ hexadecimal-digit
576
+ hexadecimal-digit-sequence '''ₒₚₜ hexadecimal-digit
577
+ ```
578
+
579
  ``` bnf
580
  hexadecimal-digit: one of
581
  '0 1 2 3 4 5 6 7 8 9'
582
  'a b c d e f'
583
  'A B C D E F'
 
608
 
609
  An *integer literal* is a sequence of digits that has no period or
610
  exponent part, with optional separating single quotes that are ignored
611
  when determining its value. An integer literal may have a prefix that
612
  specifies its base and a suffix that specifies its type. The lexically
613
+ first digit of the sequence of digits is the most significant. A *binary
614
+ integer literal* (base two) begins with `0b` or `0B` and consists of a
615
+ sequence of binary digits. An *octal integer literal* (base eight)
616
+ begins with the digit `0` and consists of a sequence of octal
617
+ digits.[^12] A *decimal integer literal* (base ten) begins with a digit
618
+ other than `0` and consists of a sequence of decimal digits. A
619
+ *hexadecimal integer literal* (base sixteen) begins with `0x` or `0X`
620
  and consists of a sequence of hexadecimal digits, which include the
621
  decimal digits and the letters `a` through `f` and `A` through `F` with
622
+ decimal values ten through fifteen.
623
+
624
+ [*Example 1*: The number twelve can be written `12`, `014`, `0XC`, or
625
+ `0b1100`. The integer literals `1048576`, `1'048'576`, `0X100000`,
626
+ `0x10'0000`, and `0'004'000'000` all have the same
627
+ value. — *end example*]
628
 
629
  The type of an integer literal is the first of the corresponding list in
630
  Table  [[tab:lex.type.integer.literal]] in which its value can be
631
  represented.
632
 
 
656
 
657
 
658
  If an integer literal cannot be represented by any type in its list and
659
  an extended integer type ([[basic.fundamental]]) can represent its
660
  value, it may have that extended integer type. If all of the types in
661
+ the list for the integer literal are signed, the extended integer type
662
+ shall be signed. If all of the types in the list for the integer literal
663
+ are unsigned, the extended integer type shall be unsigned. If the list
664
+ contains both signed and unsigned types, the extended integer type may
665
+ be signed or unsigned. A program is ill-formed if one of its translation
666
+ units contains an integer literal that cannot be represented by any of
667
+ the allowed types.
668
 
669
  ### Character literals <a id="lex.ccon">[[lex.ccon]]</a>
670
 
671
  ``` bnf
672
  character-literal:
673
+ encoding-prefixₒₚₜ ''' c-char-sequence '''
674
+ ```
675
+
676
+ ``` bnf
677
+ encoding-prefix: one of
678
+ 'u8' 'u' 'U' 'L'
679
  ```
680
 
681
  ``` bnf
682
  c-char-sequence:
683
  c-char
 
709
  '\x' hexadecimal-digit
710
  hexadecimal-escape-sequence hexadecimal-digit
711
  ```
712
 
713
  A character literal is one or more characters enclosed in single quotes,
714
+ as in `'x'`, optionally preceded by `u8`, `u`, `U`, or `L`, as in
715
+ `u8'w'`, `u'x'`, `U'y'`, or `L'z'`, respectively.
 
 
 
 
 
 
 
 
 
 
716
 
717
+ A character literal that does not begin with `u8`, `u`, `U`, or `L` is
718
+ an *ordinary character literal*. An ordinary character literal that
719
+ contains a single *c-char* representable in the execution character set
720
+ has type `char`, with value equal to the numerical value of the encoding
721
+ of the *c-char* in the execution character set. An ordinary character
722
+ literal that contains more than one *c-char* is a *multicharacter
723
+ literal*. A multicharacter literal, or an ordinary character literal
724
+ containing a single *c-char* not representable in the execution
725
+ character set, is conditionally-supported, has type `int`, and has an
726
+ *implementation-defined* value.
727
+
728
+ A character literal that begins with `u8`, such as `u8'w'`, is a
729
+ character literal of type `char`, known as a *UTF-8 character literal*.
730
+ The value of a UTF-8 character literal is equal to its ISO 10646 code
731
+ point value, provided that the code point value is representable with a
732
+ single UTF-8 code unit (that is, provided it is in the C0 Controls and
733
+ Basic Latin Unicode block). If the value is not representable with a
734
+ single UTF-8 code unit, the program is ill-formed. A UTF-8 character
735
+ literal containing multiple *c-char*s is ill-formed.
736
+
737
+ A character literal that begins with the letter `u`, such as `u'x'`, is
738
  a character literal of type `char16_t`. The value of a `char16_t`
739
+ character literal containing a single *c-char* is equal to its ISO 10646
740
+ code point value, provided that the code point is representable with a
741
+ single 16-bit code unit. (That is, provided it is a basic multi-lingual
742
+ plane code point.) If the value is not representable within 16 bits, the
743
+ program is ill-formed. A `char16_t` character literal containing
744
+ multiple *c-char*s is ill-formed.
 
 
 
 
 
 
 
 
 
 
 
 
 
745
 
746
+ A character literal that begins with the letter `U`, such as `U'y'`, is
747
+ a character literal of type `char32_t`. The value of a `char32_t`
748
+ character literal containing a single *c-char* is equal to its ISO 10646
749
+ code point value. A `char32_t` character literal containing multiple
750
+ *c-char*s is ill-formed.
751
+
752
+ A character literal that begins with the letter `L`, such as `L'z'`, is
753
+ a *wide-character literal*. A wide-character literal has type
754
+ `wchar_t`.[^13] The value of a wide-character literal containing a
755
+ single *c-char* has value equal to the numerical value of the encoding
756
+ of the *c-char* in the execution wide-character set, unless the *c-char*
757
+ has no representation in the execution wide-character set, in which case
758
+ the value is *implementation-defined*.
759
+
760
+ [*Note 1*: The type `wchar_t` is able to represent all members of the
761
+ execution wide-character set (see 
762
+ [[basic.fundamental]]). — *end note*]
763
+
764
+ The value of a wide-character literal containing multiple *c-char*s is
765
+ *implementation-defined*.
766
+
767
+ Certain non-graphic characters, the single quote `'`, the double quote
768
  `"`, the question mark `?`,[^14] and the backslash `\`, can be
769
  represented according to Table  [[tab:escape.sequences]]. The double
770
  quote `"` and the question mark `?`, can be represented as themselves or
771
  by the escape sequences `\"` and `\?` respectively, but the single quote
772
  `'` and the backslash `\` shall be represented by the escape sequences
 
801
  that are taken to specify the value of the desired character. There is
802
  no limit to the number of digits in a hexadecimal sequence. A sequence
803
  of octal or hexadecimal digits is terminated by the first character that
804
  is not an octal digit or a hexadecimal digit, respectively. The value of
805
  a character literal is *implementation-defined* if it falls outside of
806
+ the *implementation-defined* range defined for `char` (for character
807
+ literals with no prefix) or `wchar_t` (for character literals prefixed
808
+ by `L`).
 
809
 
810
+ [*Note 2*: If the value of a character literal prefixed by `u`, `u8`,
811
+ or `U` is outside the range defined for its type, the program is
812
+ ill-formed. — *end note*]
813
+
814
+ A *universal-character-name* is translated to the encoding, in the
815
  appropriate execution character set, of the character named. If there is
816
+ no such encoding, the *universal-character-name* is translated to an
817
+ *implementation-defined* encoding.
818
+
819
+ [*Note 3*: In translation phase 1, a *universal-character-name* is
820
+ introduced whenever an actual extended character is encountered in the
821
+ source text. Therefore, all extended characters are described in terms
822
+ of *universal-character-name*s. However, the actual compiler
823
+ implementation may use its own native character set, so long as the same
824
+ results are obtained. — *end note*]
825
 
826
  ### Floating literals <a id="lex.fcon">[[lex.fcon]]</a>
827
 
828
  ``` bnf
829
  floating-literal:
830
+ decimal-floating-literal
831
+ hexadecimal-floating-literal
832
+ ```
833
+
834
+ ``` bnf
835
+ decimal-floating-literal:
836
  fractional-constant exponent-partₒₚₜ floating-suffixₒₚₜ
837
  digit-sequence exponent-part floating-suffixₒₚₜ
838
  ```
839
 
840
+ ``` bnf
841
+ hexadecimal-floating-literal:
842
+ hexadecimal-prefix hexadecimal-fractional-constant binary-exponent-part floating-suffixₒₚₜ
843
+ hexadecimal-prefix hexadecimal-digit-sequence binary-exponent-part floating-suffixₒₚₜ
844
+ ```
845
+
846
  ``` bnf
847
  fractional-constant:
848
  digit-sequenceₒₚₜ '.' digit-sequence
849
  digit-sequence '.'
850
  ```
851
 
852
+ ``` bnf
853
+ hexadecimal-fractional-constant:
854
+ hexadecimal-digit-sequenceₒₚₜ '.' hexadecimal-digit-sequence
855
+ hexadecimal-digit-sequence '.'
856
+ ```
857
+
858
  ``` bnf
859
  exponent-part:
860
  'e' signₒₚₜ digit-sequence
861
  'E' signₒₚₜ digit-sequence
862
  ```
863
 
864
+ ``` bnf
865
+ binary-exponent-part:
866
+ 'p' signₒₚₜ digit-sequence
867
+ 'P' signₒₚₜ digit-sequence
868
+ ```
869
+
870
  ``` bnf
871
  sign: one of
872
  '+ -'
873
  ```
874
 
 
881
  ``` bnf
882
  floating-suffix: one of
883
  'f l F L'
884
  ```
885
 
886
+ A floating literal consists of an optional prefix specifying a base, an
887
+ integer part, a radix point, a fraction part, an `e`, `E`, `p` or `P`,
888
+ an optionally signed integer exponent, and an optional type suffix. The
889
+ integer and fraction parts both consist of a sequence of decimal (base
890
+ ten) digits if there is no prefix, or hexadecimal (base sixteen) digits
891
+ if the prefix is `0x` or `0X`. The floating literal is a *decimal
892
+ floating literal* in the former case and a *hexadecimal floating
893
+ literal* in the latter case. Optional separating single quotes in a
894
+ *digit-sequence* or *hexadecimal-digit-sequence* are ignored when
895
+ determining its value.
896
+
897
+ [*Example 1*: The floating literals `1.602'176'565e-19` and
898
+ `1.602176565e-19` have the same value. *end example*]
899
+
900
+ Either the integer part or the fraction part (not both) can be omitted.
901
+ Either the radix point or the letter `e` or `E` and the exponent (not
902
+ both) can be omitted from a decimal floating literal. The radix point
903
+ (but not the exponent) can be omitted from a hexadecimal floating
904
+ literal. The integer part, the optional radix point, and the optional
905
+ fraction part, form the *significand* of the floating literal. In a
906
+ decimal floating literal, the exponent, if present, indicates the power
907
+ of 10 by which the significand is to be scaled. In a hexadecimal
908
+ floating literal, the exponent indicates the power of 2 by which the
909
+ significand is to be scaled.
910
+
911
+ [*Example 2*: The floating literals `49.625` and `0xC.68p+2` have the
912
+ same value. — *end example*]
913
+
914
+ If the scaled value is in the range of representable values for its
915
+ type, the result is the scaled value if representable, else the larger
916
+ or smaller representable value nearest the scaled value, chosen in an
917
+ *implementation-defined* manner. The type of a floating literal is
918
+ `double` unless explicitly specified by a suffix. The suffixes `f` and
919
+ `F` specify `float`, the suffixes `l` and `L` specify `long` `double`.
920
+ If the scaled value is not in the range of representable values for its
921
+ type, the program is ill-formed.
922
 
923
  ### String literals <a id="lex.string">[[lex.string]]</a>
924
 
925
  ``` bnf
926
  string-literal:
927
  encoding-prefixₒₚₜ '"' s-char-sequenceₒₚₜ '"'
928
  encoding-prefixₒₚₜ 'R' raw-string
929
  ```
930
 
 
 
 
 
 
 
 
 
931
  ``` bnf
932
  s-char-sequence:
933
  s-char
934
  s-char-sequence s-char
935
  ```
 
949
  d-char-sequence:
950
  d-char
951
  d-char-sequence d-char
952
  ```
953
 
954
+ A *string-literal* is a sequence of characters (as defined in 
955
  [[lex.ccon]]) surrounded by double quotes, optionally prefixed by `R`,
956
  `u8`, `u8R`, `u`, `uR`, `U`, `UR`, `L`, or `LR`, as in `"..."`,
957
  `R"(...)"`, `u8"..."`, `u8R"**(...)**"`, `u"..."`, `uR"*~(...)*~"`,
958
  `U"..."`, `UR"zzz(...)zzz"`, `L"..."`, or `LR"(...)"`, respectively.
959
 
960
+ A *string-literal* that has an `R` in the prefix is a *raw string
961
  literal*. The *d-char-sequence* serves as a delimiter. The terminating
962
  *d-char-sequence* of a *raw-string* is the same sequence of characters
963
  as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
964
  at most 16 characters.
965
 
966
+ [*Note 1*: The characters `'('` and `')'` are permitted in a
967
+ *raw-string*. Thus, `R"delimiter((a|b))delimiter"` is equivalent to
968
+ `"(a|b)"`. — *end note*]
969
+
970
+ [*Note 2*:
971
 
972
  A source-file new-line in a raw string literal results in a new-line in
973
+ the resulting execution string literal. Assuming no whitespace at the
974
  beginning of lines in the following example, the assert will succeed:
975
 
976
  ``` cpp
977
  const char* p = R"(a\
978
  b
979
  c)";
980
  assert(std::strcmp(p, "a\\\nb\nc") == 0);
981
  ```
982
 
983
+ — *end note*]
984
+
985
+ [*Example 1*:
986
+
987
  The raw string
988
 
989
  ``` cpp
990
  R"a(
991
  )\
 
1007
  )#"
1008
  ```
1009
 
1010
  is equivalent to `"\n)\?\?=\"\n"`.
1011
 
1012
+ *end example*]
 
 
1013
 
1014
+ After translation phase 6, a *string-literal* that does not begin with
1015
+ an *encoding-prefix* is an *ordinary string literal*, and is initialized
1016
+ with the given characters.
1017
+
1018
+ A *string-literal* that begins with `u8`, such as `u8"asdf"`, is a
1019
+ *UTF-8 string literal*.
1020
 
1021
  Ordinary string literals and UTF-8 string literals are also referred to
1022
  as narrow string literals. A narrow string literal has type “array of
1023
  *n* `const char`”, where *n* is the size of the string as defined below,
1024
  and has static storage duration ([[basic.stc]]).
1025
 
1026
  For a UTF-8 string literal, each successive element of the object
1027
  representation ([[basic.types]]) has the value of the corresponding
1028
  code unit of the UTF-8 encoding of the string.
1029
 
1030
+ A *string-literal* that begins with `u`, such as `u"asdf"`, is a
1031
  `char16_t` string literal. A `char16_t` string literal has type “array
1032
  of *n* `const char16_t`”, where *n* is the size of the string as defined
1033
+ below; it is initialized with the given characters. A single *c-char*
1034
+ may produce more than one `char16_t` character in the form of surrogate
1035
+ pairs.
1036
 
1037
+ A *string-literal* that begins with `U`, such as `U"asdf"`, is a
1038
  `char32_t` string literal. A `char32_t` string literal has type “array
1039
  of *n* `const char32_t`”, where *n* is the size of the string as defined
1040
+ below; it is initialized with the given characters.
 
1041
 
1042
+ A *string-literal* that begins with `L`, such as `L"asdf"`, is a *wide
1043
+ string literal*. A wide string literal has type “array of *n* `const
1044
+ wchar_t`”, where *n* is the size of the string as defined below; it is
1045
+ initialized with the given characters.
1046
 
1047
+ In translation phase ([[lex.phases]]), adjacent *string-literal*s are
1048
+ concatenated. If both *string-literal*s have the same *encoding-prefix*,
 
 
 
 
1049
  the resulting concatenated string literal has that *encoding-prefix*. If
1050
+ one *string-literal* has no *encoding-prefix*, it is treated as a
1051
+ *string-literal* of the same *encoding-prefix* as the other operand. If
1052
+ a UTF-8 string literal token is adjacent to a wide string literal token,
1053
+ the program is ill-formed. Any other concatenations are
1054
+ conditionally-supported with *implementation-defined* behavior.
1055
+
1056
+ [*Note 3*: This concatenation is an interpretation, not a conversion.
1057
+ Because the interpretation happens in translation phase 6 (after each
1058
+ character from a string literal has been translated into a value from
1059
+ the appropriate character set), a *string-literal*’s initial rawness has
1060
+ no effect on the interpretation or well-formedness of the
1061
+ concatenation. — *end note*]
1062
+
1063
+ Table  [[tab:lex.string.concat]] has some examples of valid
1064
+ concatenations.
1065
 
1066
  **Table: String literal concatenations** <a id="tab:lex.string.concat">[tab:lex.string.concat]</a>
1067
 
1068
  | | | | | | |
1069
  | -------------------------- | ----- | -------------------------- | ----- | -------------------------- | ----- |
 
1073
  | `"a"` | `u"b"` | `u"ab"` | `"a"` | `U"b"` | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
1074
 
1075
 
1076
  Characters in concatenated strings are kept distinct.
1077
 
1078
+ [*Example 2*:
1079
+
1080
  ``` cpp
1081
  "\xA" "B"
1082
  ```
1083
 
1084
  contains the two characters `'\xA'` and `'B'` after concatenation (and
1085
  not the single hexadecimal character `'\xAB'`).
1086
 
1087
+ — *end example*]
1088
+
1089
  After any necessary concatenation, in translation phase 7 (
1090
  [[lex.phases]]), `'\0'` is appended to every string literal so that
1091
  programs that scan a string can find its end.
1092
 
1093
+ Escape sequences and *universal-character-name*s in non-raw string
1094
  literals have the same meaning as in character literals ([[lex.ccon]]),
1095
  except that the single quote `'` is representable either by itself or by
1096
  the escape sequence `\'`, and the double quote `"` shall be preceded by
1097
+ a `\`, and except that a *universal-character-name* in a `char16_t`
1098
+ string literal may yield a surrogate pair. In a narrow string literal, a
1099
+ *universal-character-name* may map to more than one `char` element due
1100
+ to *multibyte encoding*. The size of a `char32_t` or wide string literal
1101
+ is the total number of escape sequences, *universal-character-name*s,
1102
+ and other characters, plus one for the terminating `U'\0'` or `L'\0'`.
1103
+ The size of a `char16_t` string literal is the total number of escape
1104
+ sequences, *universal-character-name*s, and other characters, plus one
1105
+ for each character requiring a surrogate pair, plus one for the
1106
+ terminating `u'\0'`.
1107
+
1108
+ [*Note 4*: The size of a `char16_t` string literal is the number of
1109
+ code units, not the number of characters. *end note*]
1110
+
1111
+ Within `char32_t` and `char16_t` string literals, any
1112
+ *universal-character-name*s shall be within the range `0x0` to
1113
+ `0x10FFFF`. The size of a narrow string literal is the total number of
1114
+ escape sequences and other characters, plus at least one for the
1115
+ multibyte encoding of each *universal-character-name*, plus one for the
1116
  terminating `'\0'`.
1117
 
1118
+ Evaluating a *string-literal* results in a string literal object with
1119
+ static storage duration, initialized from the given characters as
1120
+ specified above. Whether all string literals are distinct (that is, are
1121
+ stored in nonoverlapping objects) and whether successive evaluations of
1122
+ a *string-literal* yield the same or a different object is unspecified.
1123
+
1124
+ [*Note 5*: The effect of attempting to modify a string literal is
1125
+ undefined. — *end note*]
1126
+
1127
  ### Boolean literals <a id="lex.bool">[[lex.bool]]</a>
1128
 
1129
  ``` bnf
1130
  boolean-literal:
1131
  'false'
 
1141
  pointer-literal:
1142
  'nullptr'
1143
  ```
1144
 
1145
  The pointer literal is the keyword `nullptr`. It is a prvalue of type
1146
+ `std::nullptr_t`.
1147
+
1148
+ [*Note 1*: `std::nullptr_t` is a distinct type that is neither a
1149
  pointer type nor a pointer to member type; rather, a prvalue of this
1150
  type is a null pointer constant and can be converted to a null pointer
1151
+ value or null member pointer value. See  [[conv.ptr]] and 
1152
+ [[conv.mem]]. — *end note*]
1153
 
1154
  ### User-defined literals <a id="lex.ext">[[lex.ext]]</a>
1155
 
1156
  ``` bnf
1157
  user-defined-literal:
 
1171
 
1172
  ``` bnf
1173
  user-defined-floating-literal:
1174
  fractional-constant exponent-partₒₚₜ ud-suffix
1175
  digit-sequence exponent-part ud-suffix
1176
+ hexadecimal-prefix hexadecimal-fractional-constant binary-exponent-part ud-suffix
1177
+ hexadecimal-prefix hexadecimal-digit-sequence binary-exponent-part ud-suffix
1178
  ```
1179
 
1180
  ``` bnf
1181
  user-defined-string-literal:
1182
  string-literal ud-suffix
 
1190
  ``` bnf
1191
  ud-suffix:
1192
  identifier
1193
  ```
1194
 
1195
+ If a token matches both *user-defined-literal* and another *literal*
1196
+ kind, it is treated as the latter.
1197
+
1198
+ [*Example 1*:
1199
+
1200
+ `123_km`
1201
+
1202
+ is a *user-defined-literal*, but `12LL` is an *integer-literal*.
1203
+
1204
+ — *end example*]
1205
+
1206
+ The syntactic non-terminal preceding the *ud-suffix* in a
1207
+ *user-defined-literal* is taken to be the longest sequence of characters
1208
+ that could match that non-terminal.
1209
 
1210
  A *user-defined-literal* is treated as a call to a literal operator or
1211
  literal operator template ([[over.literal]]). To determine the form of
1212
  this call for a given *user-defined-literal* *L* with *ud-suffix* *X*,
1213
  the *literal-operator-id* whose literal suffix identifier is *X* is
 
1237
 
1238
  ``` cpp
1239
  operator "" X<'c₁', 'c₂', ... 'cₖ'>()
1240
  ```
1241
 
1242
+ where *n* is the source character sequence c₁c₂...cₖ.
1243
+
1244
+ [*Note 1*: The sequence c₁c₂...cₖ can only contain characters from the
1245
+ basic source character set. — *end note*]
1246
 
1247
  If *L* is a *user-defined-floating-literal*, let *f* be the literal
1248
  without its *ud-suffix*. If *S* contains a literal operator with
1249
  parameter type `long double`, the literal *L* is treated as a call of
1250
  the form
 
1266
 
1267
  ``` cpp
1268
  operator "" X<'c₁', 'c₂', ... 'cₖ'>()
1269
  ```
1270
 
1271
+ where *f* is the source character sequence c₁c₂...cₖ.
1272
+
1273
+ [*Note 2*: The sequence c₁c₂...cₖ can only contain characters from the
1274
+ basic source character set. — *end note*]
1275
 
1276
  If *L* is a *user-defined-string-literal*, let *str* be the literal
1277
  without its *ud-suffix* and let *len* be the number of code units in
1278
  *str* (i.e., its length excluding the terminating null character). The
1279
  literal *L* is treated as a call of the form
1280
 
1281
  ``` cpp
1282
+ operator "" X(str, len)
1283
  ```
1284
 
1285
  If *L* is a *user-defined-character-literal*, let *ch* be the literal
1286
  without its *ud-suffix*. *S* shall contain a literal operator (
1287
  [[over.literal]]) whose only parameter has the type of *ch* and the
1288
  literal *L* is treated as a call of the form
1289
 
1290
  ``` cpp
1291
+ operator "" X(ch)
1292
  ```
1293
 
1294
+ [*Example 2*:
1295
+
1296
  ``` cpp
1297
  long double operator "" _w(long double);
1298
  std::string operator "" _w(const char16_t*, std::size_t);
1299
  unsigned operator "" _w(const char*);
1300
  int main() {
 
1303
  12_w; // calls operator "" _w("12")
1304
  "two"_w; // error: no applicable literal operator
1305
  }
1306
  ```
1307
 
1308
+ — *end example*]
1309
+
1310
  In translation phase 6 ([[lex.phases]]), adjacent string literals are
1311
  concatenated and *user-defined-string-literal*s are considered string
1312
  literals for that purpose. During concatenation, *ud-suffix*es are
1313
  removed and ignored and the concatenation process occurs as described
1314
  in  [[lex.string]]. At the end of phase 6, if a string literal is the
1315
  result of a concatenation involving at least one
1316
  *user-defined-string-literal*, all the participating
1317
  *user-defined-string-literal*s shall have the same *ud-suffix* and that
1318
  suffix is applied to the result of the concatenation.
1319
 
1320
+ [*Example 3*:
1321
+
1322
  ``` cpp
1323
  int main() {
1324
  L"A" "B" "C"_x; // OK: same as L"ABC"_x
1325
  "P"_x "Q" "R"_y;// error: two different ud-suffix{es}
1326
  }
1327
  ```
1328
 
1329
+ *end example*]
 
 
1330
 
1331
  <!-- Link reference definitions -->
1332
  [basic.fundamental]: basic.md#basic.fundamental
1333
  [basic.link]: basic.md#basic.link
1334
  [basic.lookup.unqual]: basic.md#basic.lookup.unqual
1335
  [basic.stc]: basic.md#basic.stc
1336
  [basic.types]: basic.md#basic.types
 
 
1337
  [conv.mem]: conv.md#conv.mem
1338
  [conv.ptr]: conv.md#conv.ptr
1339
  [cpp]: cpp.md#cpp
1340
  [cpp.concat]: cpp.md#cpp.concat
1341
  [cpp.cond]: cpp.md#cpp.cond
1342
  [cpp.include]: cpp.md#cpp.include
1343
  [cpp.stringize]: cpp.md#cpp.stringize
1344
  [dcl.attr.grammar]: dcl.md#dcl.attr.grammar
 
1345
  [headers]: library.md#headers
1346
  [lex]: #lex
1347
  [lex.bool]: #lex.bool
1348
  [lex.ccon]: #lex.ccon
1349
  [lex.charset]: #lex.charset
 
1363
  [lex.ppnumber]: #lex.ppnumber
1364
  [lex.pptoken]: #lex.pptoken
1365
  [lex.separate]: #lex.separate
1366
  [lex.string]: #lex.string
1367
  [lex.token]: #lex.token
 
1368
  [over.literal]: over.md#over.literal
1369
  [tab:alternative.representations]: #tab:alternative.representations
1370
  [tab:alternative.tokens]: #tab:alternative.tokens
1371
+ [tab:charname.allowed]: #tab:charname.allowed
1372
+ [tab:charname.disallowed]: #tab:charname.disallowed
1373
  [tab:escape.sequences]: #tab:escape.sequences
1374
  [tab:identifiers.special]: #tab:identifiers.special
1375
  [tab:keywords]: #tab:keywords
1376
  [tab:lex.string.concat]: #tab:lex.string.concat
1377
  [tab:lex.type.integer.literal]: #tab:lex.type.integer.literal
 
1378
  [temp.explicit]: temp.md#temp.explicit
1379
  [temp.names]: temp.md#temp.names
 
1380
 
1381
  [^1]: Implementations must behave as if these separate phases occur,
1382
  although in practice different phases might be folded together.
1383
 
1384
  [^2]: A partial preprocessing token would arise from a source file
 
1393
  [^4]: The glyphs for the members of the basic source character set are
1394
  intended to identify characters from the subset of ISO/IEC 10646
1395
  which corresponds to the ASCII character set. However, because the
1396
  mapping from source file characters to the source character set
1397
  (described in translation phase 1) is specified as
1398
+ *implementation-defined*, an implementation is required to document
1399
  how the basic source characters are represented in source files.
1400
 
1401
+ [^5]: A sequence of characters resembling a *universal-character-name*
1402
+ in an *r-char-sequence* ([[lex.string]]) does not form a
1403
+ *universal-character-name*.
1404
 
1405
  [^6]: These include “digraphs” and additional reserved words. The term
1406
  “digraph” (token consisting of two characters) is not perfectly
1407
  descriptive, since one of the alternative preprocessing-tokens is
1408
  `%:%:` and of course several primary tokens contain two characters.
 
1419
  might result in an error, be interpreted as the character
1420
  corresponding to the escape sequence, or have a completely different
1421
  meaning, depending on the implementation.
1422
 
1423
  [^10]: On systems in which linkers cannot accept extended characters, an
1424
+ encoding of the *universal-character-name* may be used in forming
1425
  valid external identifiers. For example, some otherwise unused
1426
  character or sequence of characters may be used to encode the `\u`
1427
+ in a *universal-character-name*. Extended characters may produce a
1428
  long external identifier, but C++does not place a translation limit
1429
  on significant characters for external identifiers. In C++, upper-
1430
  and lower-case letters are considered different for all identifiers,
1431
  including external identifiers.
1432
 
 
1436
  [^12]: The digits `8` and `9` are not octal digits.
1437
 
1438
  [^13]: They are intended for character sets where a character does not
1439
  fit into a single byte.
1440
 
1441
+ [^14]: Using an escape sequence for a question mark is supported for
1442
+ compatibility with ISO C++14and ISO C.