[lex] - C++20 → C++23

Files changed (1) hide show

tmp/tmpj6b1nb8v/{from.md → to.md} +602 -466

tmp/tmpj6b1nb8v/{from.md → to.md} RENAMED Viewed

@@ -5,11 +5,11 @@
 The text of the program is kept in units called *source files* in this
 document. A source file together with all the headers [[headers]] and
 source files included [[cpp.include]] via the preprocessing directive
 `#include`, less any source lines skipped by any of the conditional
 inclusion [[cpp.cond]] preprocessing directives, is called a
-*translation unit*.
 [*Note 1*: A C++ program need not all be translated at the same
 time. — *end note*]
 [*Note 2*: Previously translated translation units and instantiation
@@ -24,160 +24,282 @@ program [[basic.link]]. — *end note*]
 ## Phases of translation <a id="lex.phases">[[lex.phases]]</a>
 The precedence among the syntax rules of translation is specified by the
 following phases.[^1]
-1.  Physical source file characters are mapped, in an
- *implementation-defined* manner, to the basic source character set
- (introducing new-line characters for end-of-line indicators) if
- necessary. The set of physical source file characters accepted is
-    *implementation-defined*. Any source file character not in the basic
- source character set [[lex.charset]] is replaced by the
-    *universal-character-name* that designates that character. An
- implementation may use any internal encoding, so long as an actual
- extended character encountered in the source file, and the same
- extended character expressed in the source file as a
- *universal-character-name* (e.g., using the `\uXXXX` notation), are
- handled equivalently except where this replacement is reverted
- [[lex.pptoken]] in a raw string literal.
-2.  Each instance of a backslash character (\\ immediately followed by a
- new-line character is deleted, splicing physical source lines to
- form logical source lines. Only the last backslash on any physical
- source line shall be eligible for being part of such a splice.
- Except for splices reverted in a raw string literal, if a splice
- results in a character sequence that matches the syntax of a
     *universal-character-name*, the behavior is undefined. A source file
     that is not empty and that does not end in a new-line character, or
-    that ends in a new-line character immediately preceded by a
- backslash character before any such splicing takes place, shall be
-    processed as if an additional new-line character were appended to
-    the file.
 3.  The source file is decomposed into preprocessing tokens
-    [[lex.pptoken]] and sequences of white-space characters (including
     comments). A source file shall not end in a partial preprocessing
     token or in a partial comment.[^2] Each comment is replaced by one
     space character. New-line characters are retained. Whether each
-    nonempty sequence of white-space characters other than new-line is
-    retained or replaced by one space character is unspecified. The
- process of dividing a source file’s characters into preprocessing
- tokens is context-dependent. \[*Example 1*: See the handling of `<`
- within a `#include` preprocessing directive. — *end example*]
 4.  Preprocessing directives are executed, macro invocations are
-    expanded, and `_Pragma` unary operator expressions are executed. If
- a character sequence that matches the syntax of a
- *universal-character-name* is produced by token concatenation
-    [[cpp.concat]], the behavior is undefined. A `#include`
-    preprocessing directive causes the named header or source file to be
-    processed from phase 1 through phase 4, recursively. All
     preprocessing directives are then deleted.
-5.  Each basic source character set member in a *character-literal* or a
-    *string-literal*, as well as each escape sequence and
- *universal-character-name* in a *character-literal* or a non-raw
- string literal, is converted to the corresponding member of the
-    execution character set ([[lex.ccon]], [[lex.string]]); if there is
-    no corresponding member, it is converted to an
-    *implementation-defined* member other than the null (wide)
-    character.[^3]
-6.  Adjacent string literal tokens are concatenated.
-7.  White-space characters separating tokens are no longer significant.
     Each preprocessing token is converted into a token [[lex.token]].
-    The resulting tokens are syntactically and semantically analyzed and
- translated as a translation unit. \[*Note 1*: The process of
-    analyzing and translating the tokens may occasionally result in one
-    token being replaced by a sequence of other tokens
-    [[temp.names]]. — *end note*] It is *implementation-defined*
-    whether the sources for module units and header units on which the
-    current translation unit has an interface dependency (
-    [[module.unit]], [[module.import]]) are required to be available.
-    \[*Note 2*: Source files, translation units and translated
-    translation units need not necessarily be stored as files, nor need
-    there be any one-to-one correspondence between these entities and
-    any external representation. The description is conceptual only, and
-    does not specify any particular implementation. — *end note*]
 8.  Translated translation units and instantiation units are combined as
-    follows: \[*Note 3*: Some or all of these may be supplied from a
     library. — *end note*] Each translated translation unit is examined
-    to produce a list of required instantiations. \[*Note 4*: This may
     include instantiations which have been explicitly requested
     [[temp.explicit]]. — *end note*] The definitions of the required
     templates are located. It is *implementation-defined* whether the
     source of the translation units containing these definitions is
-    required to be available. \[*Note 5*: An implementation could encode
-    sufficient information into the translated translation unit so as to
-    ensure the source is not required here. — *end note*] All the
-    required instantiations are performed to produce *instantiation
-    units*. \[*Note 6*: These are similar to translated translation
-    units, but contain no references to uninstantiated templates and no
-    template definitions. — *end note*] The program is ill-formed if
-    any instantiation fails.
 9.  All external entity references are resolved. Library components are
     linked to satisfy external references to entities not defined in the
     current translation. All such translator output is collected into a
     program image which contains information needed for execution in its
     execution environment.
 ## Character sets <a id="lex.charset">[[lex.charset]]</a>
-The *basic source character set* consists of 96 characters: the space
-character, the control characters representing horizontal tab, vertical
-tab, form feed, and new-line, plus the following 91 graphical
-characters:[^4]
-``` cpp
-a b c d e f g h i j k l m n o p q r s t u v w x y z
-A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
-0 1 2 3 4 5 6 7 8 9
-_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \" '
-```
 The *universal-character-name* construct provides a way to name other
 characters.
 ``` bnf
 hex-quad:
     hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
 ```
 ``` bnf
 universal-character-name:
     '\u' hex-quad
     '\U' hex-quad hex-quad
 ```
-A *universal-character-name* designates the character in ISO/IEC 10646
-(if any) whose code point is the hexadecimal number represented by the
-sequence of *hexadecimal-digit*s in the *universal-character-name*. The
-program is ill-formed if that number is not a code point or if it is a
-surrogate code point. Noncharacter code points and reserved code points
-are considered to designate separate characters distinct from any
-ISO/IEC 10646 character. If a *universal-character-name* outside the
-*c-char-sequence*, *s-char-sequence*, or *r-char-sequence* of a
-*character-literal* or *string-literal* (in either case, including
-within a *user-defined-literal*) corresponds to a control character or
-to a character in the basic source character set, the program is
-ill-formed.[^5]
-[*Note 1*: ISO/IEC 10646 code points are integers in the range
-[0, 10FFFF] (hexadecimal). A surrogate code point is a value in the
-range [D800, DFFF] (hexadecimal). A control character is a character
-whose code point is in either of the ranges [0, 1F] or [7F, 9F]
-(hexadecimal). — *end note*]
-The *basic execution character set* and the *basic execution
-wide-character set* shall each contain all the members of the basic
-source character set, plus control characters representing alert,
-backspace, and carriage return, plus a *null character* (respectively,
-*null wide character*), whose value is 0. For each basic execution
-character set, the values of the members shall be non-negative and
-distinct from one another. In both the source and execution basic
-character sets, the value of each character after `0` in the above list
-of decimal digits shall be one greater than the value of the previous.
-The *execution character set* and the *execution wide-character set* are
-*implementation-defined* supersets of the basic execution character set
-and the basic execution wide-character set, respectively. The values of
-the members of the execution character sets and the sets of additional
-members are locale-specific.
 ## Preprocessing tokens <a id="lex.pptoken">[[lex.pptoken]]</a>
 ``` bnf
 preprocessing-token:
@@ -190,48 +312,53 @@ preprocessing-token:
     character-literal
     user-defined-character-literal
     string-literal
     user-defined-string-literal
     preprocessing-op-or-punc
-    each non-white-space character that cannot be one of the above
 ```
 Each preprocessing token that is converted to a token [[lex.token]]
 shall have the lexical form of a keyword, an identifier, a literal, or
 an operator or punctuator.
 A preprocessing token is the minimal lexical element of the language in
-translation phases 3 through 6. The categories of preprocessing token
-are: header names, placeholder tokens produced by preprocessing `import`
-and `module` directives (*import-keyword*, *module-keyword*, and
-*export-keyword*), identifiers, preprocessing numbers, character
-literals (including user-defined character literals), string literals
-(including user-defined string literals), preprocessing operators and
-punctuators, and single non-white-space characters that do not lexically
-match the other preprocessing token categories. If a `'` or a `"`
-character matches the last category, the behavior is undefined.
-Preprocessing tokens can be separated by white space; this consists of
-comments [[lex.comment]], or white-space characters (space, horizontal
-tab, new-line, vertical tab, and form-feed), or both. As described in
-[[cpp]], in certain circumstances during translation phase 4, white
-space (or the absence thereof) serves as more than preprocessing token
-separation. White space can appear within a preprocessing token only as
-part of a header name or between the quotation characters in a character
-literal or string literal.
 If the input stream has been parsed into preprocessing tokens up to a
 given character:
 - If the next character begins a sequence of characters that could be
   the prefix and initial double quote of a raw string literal, such as
   `R"`, the next preprocessing token shall be a raw string literal.
   Between the initial and final double quote characters of the raw
-  string, any transformations performed in phases 1 and 2
- (*universal-character-name*s and line splicing) are reverted; this
- reversion shall apply before any *d-char*, *r-char*, or delimiting
- parenthesis is identified. The raw string literal is defined as the
- shortest sequence of characters that matches the raw-string pattern
   ``` bnf
   encoding-prefixₒₚₜ 'R' raw-string
   ```
 - Otherwise, if the next three characters are `<::` and the subsequent
   character is neither `:` nor `>`, the `<` is treated as a
@@ -262,28 +389,29 @@ by preprocessing either of the previous two directives.
 [*Note 1*: None has any observable spelling. — *end note*]
 [*Example 2*: The program fragment `0xe+foo` is parsed as a
 preprocessing number token (one that is not a valid *integer-literal* or
 *floating-point-literal* token), even though a parse as three
-preprocessing tokens `0xe`, `+`, and `foo` might produce a valid
-expression (for example, if `foo` were a macro defined as `1`).
-Similarly, the program fragment `1E1` is parsed as a preprocessing
-number (one that is a valid *floating-point-literal* token), whether or
-not `E` is a macro name. — *end example*]
 [*Example 3*: The program fragment `x+++++y` is parsed as `x
 ++ ++ + y`, which, if `x` and `y` have integral types, violates a
 constraint on increment operators, even though the parse `x ++ + ++ y`
-might yield a correct expression. — *end example*]
 ## Alternative tokens <a id="lex.digraph">[[lex.digraph]]</a>
 Alternative token representations are provided for some operators and
-punctuators.[^6]
 In all respects of the language, each alternative token behaves the
-same, respectively, as its primary token, except for its spelling.[^7]
 The set of alternative tokens is defined in [[lex.digraph]].
 ## Tokens <a id="lex.token">[[lex.token]]</a>
 ``` bnf
@@ -292,11 +420,12 @@ token:
     keyword
     literal
     operator-or-punctuator
 ```
-There are five kinds of tokens: identifiers, keywords, literals,[^8]
 operators, and other separators. Blanks, horizontal and vertical tabs,
 newlines, formfeeds, and comments (collectively, “whitespace”), as
 described below, are ignored except as they serve to separate tokens.
 [*Note 1*: Some whitespace is required to separate otherwise adjacent
@@ -307,11 +436,11 @@ containing alphabetic characters. — *end note*]
 The characters `/*` start a comment, which terminates with the
 characters `*/`. These comments do not nest. The characters `//` start a
 comment, which terminates immediately before the next new-line
 character. If there is a form-feed or a vertical-tab character in such a
-comment, only white-space characters shall appear between it and the
 new-line that terminates the comment; no diagnostic is required.
 [*Note 1*: The comment characters `//`, `/*`, and `*/` have no special
 meaning within a `//` comment and are treated just like other
 characters. Similarly, the comment characters `//` and `/*` have no
@@ -331,22 +460,22 @@ h-char-sequence:
     h-char-sequence h-char
 ```
 ``` bnf
 h-char:
-    any member of the source character set except new-line and '>'
 ```
 ``` bnf
 q-char-sequence:
     q-char
     q-char-sequence q-char
 ```
 ``` bnf
 q-char:
-    any member of the source character set except new-line and '"'
 ```
 [*Note 1*: Header name preprocessing tokens only appear within a
 `#include` preprocessing directive, a `__has_include` preprocessing
 expression, or after certain occurrences of an `import` token (see
@@ -358,20 +487,19 @@ names as specified in  [[cpp.include]].
 The appearance of either of the characters `'` or `\` or of either of
 the character sequences `/*` or `//` in a *q-char-sequence* or an
 *h-char-sequence* is conditionally-supported with
 *implementation-defined* semantics, as is the appearance of the
-character `"` in an *h-char-sequence*.[^9]
 ## Preprocessing numbers <a id="lex.ppnumber">[[lex.ppnumber]]</a>
 ``` bnf
 pp-number:
     digit
     '.' digit
-    pp-number digit
-    pp-number identifier-nondigit
     pp-number ''' digit
     pp-number ''' nondigit
     pp-number 'e' sign
     pp-number 'E' sign
     pp-number 'p' sign
@@ -389,19 +517,25 @@ after a successful conversion to an *integer-literal* token or a
 ## Identifiers <a id="lex.name">[[lex.name]]</a>
 ``` bnf
 identifier:
-    identifier-nondigit
-    identifier identifier-nondigit
-    identifier digit
 ```
 ``` bnf
-identifier-nondigit:
     nondigit
- universal-character-name
 ```
 ``` bnf
 nondigit: one of
     'a b c d e f g h i j k l m'
@@ -413,51 +547,37 @@ nondigit: one of
 ``` bnf
 digit: one of
     '0 1 2 3 4 5 6 7 8 9'
 ```
-An identifier is an arbitrarily long sequence of letters and digits.
-Each *universal-character-name* in an identifier shall designate a
-character whose encoding in ISO/IEC 10646 falls into one of the ranges
-specified in [[lex.name.allowed]]. The initial element shall not be a
-*universal-character-name* designating a character whose encoding falls
-into one of the ranges specified in [[lex.name.disallowed]]. Upper- and
-lower-case letters are different. All characters are significant.[^10]
-**Table: Ranges of characters allowed** <a id="lex.name.allowed">[lex.name.allowed]</a>
-|               |               |               |               |               |
-| ------------- | ------------- | ------------- | ------------- | ------------- |
-| `00A8`        | `00AA`        | `00AD`        | `00AF`        | `00B2-00B5`   |
-| `00B7-00BA`   | `00BC-00BE`   | `00C0-00D6`   | `00D8-00F6`   | `00F8-00FF`   |
-| `0100-167F`   | `1681-180D`   | `180F-1FFF`   |               |               |
-| `200B-200D`   | `202A-202E`   | `203F-2040`   | `2054`        | `2060-206F`   |
-| `2070-218F`   | `2460-24FF`   | `2776-2793`   | `2C00-2DFF`   | `2E80-2FFF`   |
-| `3004-3007`   | `3021-302F`   | `3031-D7FF`   |               |               |
-| `F900-FD3D`   | `FD40-FDCF`   | `FDF0-FE44`   | `FE47-FFFD`   |               |
-| `10000-1FFFD` | `20000-2FFFD` | `30000-3FFFD` | `40000-4FFFD` | `50000-5FFFD` |
-| `60000-6FFFD` | `70000-7FFFD` | `80000-8FFFD` | `90000-9FFFD` | `A0000-AFFFD` |
-| `B0000-BFFFD` | `C0000-CFFFD` | `D0000-DFFFD` | `E0000-EFFFD` |               |
-**Table: Ranges of characters disallowed initially (combining characters)** <a id="lex.name.disallowed">[lex.name.disallowed]</a>
-|             |                                                |             |             |
-| ----------- | ---------------------------------------------- | ----------- | ----------- |
-| `0300-036F` | % FIXME: Unicode v7 adds 1AB0-1AFF `1DC0-1DFF` | `20D0-20FF` | `FE20-FE2F` |
 The identifiers in [[lex.name.special]] have a special meaning when
 appearing in a certain context. When referred to in the grammar, these
 identifiers are used explicitly rather than using the *identifier*
 grammar production. Unless otherwise specified, any ambiguity as to
 whether a given *identifier* has a special meaning is resolved to
 interpret the token as a regular *identifier*.
-In addition, some identifiers are reserved for use by C++
-implementations and shall not be used otherwise; no diagnostic is
-required.
 - Each identifier that contains a double underscore `__` or begins with
   an underscore followed by an uppercase letter is reserved to the
   implementation for any use.
 - Each identifier that begins with an underscore is reserved to the
@@ -527,11 +647,11 @@ translation phase 7 [[lex.phases]].
 ## Literals <a id="lex.literal">[[lex.literal]]</a>
 ### Kinds of literals <a id="lex.literal.kinds">[[lex.literal.kinds]]</a>
-There are several kinds of literals.[^11]
 ``` bnf
 literal:
     integer-literal
     character-literal
@@ -540,10 +660,13 @@ literal:
     boolean-literal
     pointer-literal
     user-defined-literal
 ```
 ### Integer literals <a id="lex.icon">[[lex.icon]]</a>
 ``` bnf
 integer-literal:
     binary-literal integer-suffixₒₚₜ
@@ -611,12 +734,14 @@ hexadecimal-digit: one of
 ``` bnf
 integer-suffix:
     unsigned-suffix long-suffixₒₚₜ
     unsigned-suffix long-long-suffixₒₚₜ
     long-suffix unsigned-suffixₒₚₜ
     long-long-suffix unsigned-suffixₒₚₜ
 ```
 ``` bnf
 unsigned-suffix: one of
     'u U'
@@ -630,10 +755,15 @@ long-suffix: one of
 ``` bnf
 long-long-suffix: one of
     'll LL'
 ```
 In an *integer-literal*, the sequence of *binary-digit*s,
 *octal-digit*s, *digit*s, or *hexadecimal-digit*s is interpreted as a
 base N integer as shown in table [[lex.icon.base]]; the lexically first
 digit of the sequence of digits is the most significant.
@@ -658,16 +788,16 @@ decimal values ten through fifteen.
 `0x10'0000`, and `0'004'000'000` all have the same
 value. — *end example*]
 The type of an *integer-literal* is the first type in the list in
 [[lex.icon.type]] corresponding to its optional *integer-suffix* in
-which its value can be represented. An *integer-literal* is a prvalue.
 **Table: Types of *integer-literal*s** <a id="lex.icon.type">[lex.icon.type]</a>
 | *integer-suffix* | *decimal-literal*                         | *integer-literal* other than *decimal-literal* |
-| ---------------- | ------------------------ | ---------------------------------------------- |
 | none             | `int`                                     | `int`                                          |
 |                  | `long int`                                | `unsigned int`                                 |
 |                  | `long long int`                           | `long int`                                     |
 |                  |                                           | `unsigned long int`                            |
 |                  |                                           | `long long int`                                |
@@ -683,10 +813,15 @@ which its value can be represented. An *integer-literal* is a prvalue.
 | and `l` or `L`   | `unsigned long long int`                  | `unsigned long long int`                       |
 | `ll` or `LL`     | `long long int`                           | `long long int`                                |
 |                  |                                           | `unsigned long long int`                       |
 | Both `u` or `U`  | `unsigned long long int`                  | `unsigned long long int`                       |
 | and `ll` or `LL` |                                           |                                                |
 If an *integer-literal* cannot be represented by any type in its list
 and an extended integer type [[basic.fundamental]] can represent its
 value, it may have that extended integer type. If all of the types in
@@ -716,157 +851,165 @@ c-char-sequence:
     c-char-sequence c-char
 ```
 ``` bnf
 c-char:
- any member of the basic source character set except the single-quote ''', backslash '\', or new-line character
     escape-sequence
     universal-character-name
 ```
 ``` bnf
 escape-sequence:
     simple-escape-sequence
     octal-escape-sequence
     hexadecimal-escape-sequence
 ```
 ``` bnf
-simple-escape-sequence: one of
- '\'' '\"' '\?' '\\'
- '\a' '\b' '\f' '\n' '\r' '\t' '\v'
 ```
 ``` bnf
 octal-escape-sequence:
     '\' octal-digit
     '\' octal-digit octal-digit
     '\' octal-digit octal-digit octal-digit
 ```
 ``` bnf
 hexadecimal-escape-sequence:
-    '\x' hexadecimal-digit
-    hexadecimal-escape-sequence hexadecimal-digit
 ```
-A *character-literal* that does not begin with `u8`, `u`, `U`, or `L` is
-an *ordinary character literal*. An ordinary character literal that
-contains a single *c-char* representable in the execution character set
-has type `char`, with value equal to the numerical value of the encoding
-of the *c-char* in the execution character set. An ordinary character
-literal that contains more than one *c-char* is a
-*multicharacter literal*. A multicharacter literal, or an ordinary
-character literal containing a single *c-char* not representable in the
-execution character set, is conditionally-supported, has type `int`, and
-has an *implementation-defined* value.
-A *character-literal* that begins with `u8`, such as `u8'w'`, is a
-*character-literal* of type `char8_t`, known as a *UTF-8 character
-literal*. The value of a UTF-8 character literal is equal to its ISO/IEC
-10646 code point value, provided that the code point value can be
-encoded as a single UTF-8 code unit.
-[*Note 1*: That is, provided the code point value is in the range
-[0, 7F] (hexadecimal). — *end note*]
-If the value is not representable with a single UTF-8 code unit, the
-program is ill-formed. A UTF-8 character literal containing multiple
-*c-char*s is ill-formed.
-A *character-literal* that begins with the letter `u`, such as `u'x'`,
-is a *character-literal* of type `char16_t`, known as a *UTF-16
-character literal*. The value of a UTF-16 character literal is equal to
-its ISO/IEC 10646 code point value, provided that the code point value
-is representable with a single 16-bit code unit.
-[*Note 2*: That is, provided the code point value is in the range
-[0, FFFF] (hexadecimal). — *end note*]
-If the value is not representable with a single 16-bit code unit, the
-program is ill-formed. A UTF-16 character literal containing multiple
-*c-char*s is ill-formed.
-A *character-literal* that begins with the letter `U`, such as `U'y'`,
-is a *character-literal* of type `char32_t`, known as a *UTF-32
-character literal*. The value of a UTF-32 character literal containing a
-single *c-char* is equal to its ISO/IEC 10646 code point value. A UTF-32
-character literal containing multiple *c-char*s is ill-formed.
-A *character-literal* that begins with the letter `L`, such as `L'z'`,
-is a *wide-character literal*. A wide-character literal has type
-`wchar_t`.[^12] The value of a wide-character literal containing a
-single *c-char* has value equal to the numerical value of the encoding
-of the *c-char* in the execution wide-character set, unless the *c-char*
-has no representation in the execution wide-character set, in which case
-the value is *implementation-defined*.
-[*Note 3*: The type `wchar_t` is able to represent all members of the
-execution wide-character set (see
-[[basic.fundamental]]). — *end note*]
-The value of a wide-character literal containing multiple *c-char*s is
-*implementation-defined*.
-Certain non-graphic characters, the single quote `'`, the double quote
-`"`, the question mark `?`,[^13] and the backslash `\`, can be
-represented according to [[lex.ccon.esc]]. The double quote `"` and the
-question mark `?`, can be represented as themselves or by the escape
-sequences `\"` and `\?` respectively, but the single quote `'` and the
-backslash `\` shall be represented by the escape sequences `\'` and `\\`
-respectively. Escape sequences in which the character following the
-backslash is not listed in [[lex.ccon.esc]] are conditionally-supported,
-with *implementation-defined* semantics. An escape sequence specifies a
-single character.
-**Table: Escape sequences** <a id="lex.ccon.esc">[lex.ccon.esc]</a>
-|                 |                |                    |
-| --------------- | -------------- | ------------------ |
-| new-line        | NL(LF)         | `\n`               |
-| horizontal tab  | HT             | `\t`               |
-| vertical tab    | VT             | `\v`               |
-| backspace       | BS             | `\b`               |
-| carriage return | CR             | `\r`               |
-| form feed       | FF             | `\f`               |
-| alert           | BEL            | `\a`               |
-| backslash       | \              | ``                 |
-| question mark   | ?              | `\?`               |
-| single quote    | `'`            | `\'`               |
-| double quote    | `"`            | `\"`               |
-| octal number    | \numconst{ooo} | `numconst{ooo}`    |
-| hex number      | \numconst{hhh} | `\x\numconst{hhh}` |
-The escape `\\numconst{ooo}` consists of the backslash followed by one,
-two, or three octal digits that are taken to specify the value of the
-desired character. The escape `\x\numconst{hhh}` consists of the
-backslash followed by `x` followed by one or more hexadecimal digits
-that are taken to specify the value of the desired character. There is
-no limit to the number of digits in a hexadecimal sequence. A sequence
-of octal or hexadecimal digits is terminated by the first character that
-is not an octal digit or a hexadecimal digit, respectively. The value of
-a *character-literal* is *implementation-defined* if it falls outside of
-the *implementation-defined* range defined for `char` (for
-*character-literal*s with no prefix) or `wchar_t` (for
-*character-literal*s prefixed by `L`).
-[*Note 4*: If the value of a *character-literal* prefixed by `u`, `u8`,
-or `U` is outside the range defined for its type, the program is
-ill-formed. — *end note*]
-A *universal-character-name* is translated to the encoding, in the
-appropriate execution character set, of the character named. If there is
-no such encoding, the *universal-character-name* is translated to an
-*implementation-defined* encoding.
-[*Note 5*: In translation phase 1, a *universal-character-name* is
-introduced whenever an actual extended character is encountered in the
-source text. Therefore, all extended characters are described in terms
-of *universal-character-name*s. However, the actual compiler
-implementation may use its own native character set, so long as the same
-results are obtained. — *end note*]
 ### Floating-point literals <a id="lex.fcon">[[lex.fcon]]</a>
 ``` bnf
 floating-point-literal:
@@ -921,23 +1064,33 @@ digit-sequence:
     digit-sequence '''ₒₚₜ digit
 ```
 ``` bnf
 floating-point-suffix: one of
-    'f l F L'
 ```
-The type of a *floating-point-literal* is determined by its
 *floating-point-suffix* as specified in [[lex.fcon.type]].
 **Table: Types of *floating-point-literal*{s}** <a id="lex.fcon.type">[lex.fcon.type]</a>
 | *floating-point-suffix* | type              |
-| ----------------------- | --------------- |
 | none                    | `double`          |
 | `f` or `F`              | `float`           |
 | `l` or `L`              | `long` `double`   |
 The *significand* of a *floating-point-literal* is the
 *fractional-constant* or *digit-sequence* of a
 *decimal-floating-point-literal* or the
@@ -946,11 +1099,11 @@ The *significand* of a *floating-point-literal* is the
 of *digit*s or *hexadecimal-digit*s and optional period are interpreted
 as a base N real number s, where N is 10 for a
 *decimal-floating-point-literal* and 16 for a
 *hexadecimal-floating-point-literal*.
-[*Note 1*: Any optional separating single quotes are ignored when
 determining the value. — *end note*]
 If an *exponent-part* or *binary-exponent-part* is present, the exponent
 e of the *floating-point-literal* is the result of interpreting the
 sequence of an optional *sign* and the *digit*s as a base 10 integer.
@@ -982,15 +1135,21 @@ s-char-sequence:
     s-char-sequence s-char
 ```
 ``` bnf
 s-char:
- any member of the basic source character set except the double-quote '"', backslash '\', or new-line character
     escape-sequence
     universal-character-name
 ```
 ``` bnf
 raw-string:
     '"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
 ```
@@ -1000,27 +1159,43 @@ r-char-sequence:
     r-char-sequence r-char
 ```
 ``` bnf
 r-char:
-    any member of the source character set, except a right parenthesis ')' followed by
-       the initial *d-char-sequence* (which may be empty) followed by a double quote '"'.
 ```
 ``` bnf
 d-char-sequence:
     d-char
     d-char-sequence d-char
 ```
 ``` bnf
 d-char:
-    any member of the basic source character set except:
- space, the left parenthesis '(', the right parenthesis ')', the backslash '\', and the control characters
-       representing horizontal tab, vertical tab, form feed, and newline.
 ```
 A *string-literal* that has an `R` in the prefix is a *raw string
 literal*. The *d-char-sequence* serves as a delimiter. The terminating
 *d-char-sequence* of a *raw-string* is the same sequence of characters
 as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
 at most 16 characters.
@@ -1063,149 +1238,130 @@ R"(x = "\"y\"")"
 is equivalent to `"x = \"\\\"y\\\"\""`.
 — *end example*]
-After translation phase 6, a *string-literal* that does not begin with
-an *encoding-prefix* is an *ordinary string literal*. An ordinary string
-literal has type “array of *n* `const char`” where *n* is the size of
-the string as defined below, has static storage duration [[basic.stc]],
-and is initialized with the given characters.
-A *string-literal* that begins with `u8`, such as `u8"asdf"`, is a
-*UTF-8 string literal*. A UTF-8 string literal has type “array of *n*
-`const char8_t`”, where *n* is the size of the string as defined below;
-each successive element of the object representation [[basic.types]] has
-the value of the corresponding code unit of the UTF-8 encoding of the
-string.
 Ordinary string literals and UTF-8 string literals are also referred to
 as narrow string literals.
-A *string-literal* that begins with `u`, such as `u"asdf"`, is a *UTF-16
-string literal*. A UTF-16 string literal has type “array of *n*
-`const char16_t`”, where *n* is the size of the string as defined below;
-each successive element of the array has the value of the corresponding
-code unit of the UTF-16 encoding of the string.
-[*Note 3*: A single *c-char* may produce more than one `char16_t`
-character in the form of surrogate pairs. A surrogate pair is a
-representation for a single code point as a sequence of two 16-bit code
-units. — *end note*]
-A *string-literal* that begins with `U`, such as `U"asdf"`, is a *UTF-32
-string literal*. A UTF-32 string literal has type “array of *n*
-`const char32_t`”, where *n* is the size of the string as defined below;
-each successive element of the array has the value of the corresponding
-code unit of the UTF-32 encoding of the string.
-A *string-literal* that begins with `L`, such as `L"asdf"`, is a *wide
-string literal*. A wide string literal has type “array of *n* `const
-wchar_t`”, where *n* is the size of the string as defined below; it is
-initialized with the given characters.
 In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
-concatenated. If both *string-literal*s have the same *encoding-prefix*,
-the resulting concatenated *string-literal* has that *encoding-prefix*.
-If one *string-literal* has no *encoding-prefix*, it is treated as a
-*string-literal* of the same *encoding-prefix* as the other operand. If
-a UTF-8 string literal token is adjacent to a wide string literal token,
-the program is ill-formed. Any other concatenations are
-conditionally-supported with *implementation-defined* behavior.
-[*Note 4*: This concatenation is an interpretation, not a conversion.
-Because the interpretation happens in translation phase 6 (after each
-character from a *string-literal* has been translated into a value from
-the appropriate character set), a *string-literal*’s initial rawness has
-no effect on the interpretation or well-formedness of the
-concatenation. — *end note*]
 [[lex.string.concat]] has some examples of valid concatenations.
 **Table: String literal concatenations** <a id="lex.string.concat">[lex.string.concat]</a>
 |                            |       |                            |       |                            |       |
 | -------------------------- | ----- | -------------------------- | ----- | -------------------------- | ----- |
 | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means |
 | `u"a"`                     | `u"b"` | `u"ab"`                    | `U"a"` | `U"b"`                     | `U"ab"` | `L"a"` | `L"b"` | `L"ab"` |
 | `u"a"`                     | `"b"` | `u"ab"`                    | `U"a"` | `"b"`                      | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
 | `"a"`                      | `u"b"` | `u"ab"`                    | `"a"` | `U"b"`                     | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
-Characters in concatenated strings are kept distinct.
-[*Example 2*:
-``` cpp
-"\xA" "B"
-```
-contains the two characters `'\xA'` and `'B'` after concatenation (and
-not the single hexadecimal character `'\xAB'`).
-— *end example*]
-After any necessary concatenation, in translation phase 7
-[[lex.phases]], `'\0'` is appended to every *string-literal* so that
-programs that scan a string can find its end.
-Escape sequences and *universal-character-name*s in non-raw string
-literals have the same meaning as in *character-literal*s [[lex.ccon]],
-except that the single quote `'` is representable either by itself or by
-the escape sequence `\'`, and the double quote `"` shall be preceded by
-a `\`, and except that a *universal-character-name* in a UTF-16 string
-literal may yield a surrogate pair. In a narrow string literal, a
-*universal-character-name* may map to more than one `char` or `char8_t`
-element due to *multibyte encoding*. The size of a `char32_t` or wide
-string literal is the total number of escape sequences,
-*universal-character-name*s, and other characters, plus one for the
-terminating `U'\0'` or `L'\0'`. The size of a UTF-16 string literal is
-the total number of escape sequences, *universal-character-name*s, and
-other characters, plus one for each character requiring a surrogate
-pair, plus one for the terminating `u'\0'`.
-[*Note 5*: The size of a `char16_t` string literal is the number of
-code units, not the number of characters. — *end note*]
-[*Note 6*: Any *universal-character-name*s are required to correspond
-to a code point in the range [0, D800) or [E000, 10FFFF] (hexadecimal)
-[[lex.charset]]. — *end note*]
-The size of a narrow string literal is the total number of escape
-sequences and other characters, plus at least one for the multibyte
-encoding of each *universal-character-name*, plus one for the
-terminating `'\0'`.
 Evaluating a *string-literal* results in a string literal object with
-static storage duration, initialized from the given characters as
-specified above. Whether all *string-literal*s are distinct (that is,
-are stored in nonoverlapping objects) and whether successive evaluations
-of a *string-literal* yield the same or a different object is
-unspecified.
-[*Note 7*:  The effect of attempting to modify a *string-literal* is
-undefined. — *end note*]
 ### Boolean literals <a id="lex.bool">[[lex.bool]]</a>
 ``` bnf
 boolean-literal:
     'false'
     'true'
 ```
 The Boolean literals are the keywords `false` and `true`. Such literals
-are prvalues and have type `bool`.
 ### Pointer literals <a id="lex.nullptr">[[lex.nullptr]]</a>
 ``` bnf
 pointer-literal:
     'nullptr'
 ```
-The pointer literal is the keyword `nullptr`. It is a prvalue of type
 `std::nullptr_t`.
 [*Note 1*: `std::nullptr_t` is a distinct type that is neither a
 pointer type nor a pointer-to-member type; rather, a prvalue of this
 type is a null pointer constant and can be converted to a null pointer
@@ -1269,14 +1425,13 @@ The syntactic non-terminal preceding the *ud-suffix* in a
 that could match that non-terminal.
 A *user-defined-literal* is treated as a call to a literal operator or
 literal operator template [[over.literal]]. To determine the form of
 this call for a given *user-defined-literal* *L* with *ud-suffix* *X*,
-the *literal-operator-id* whose literal suffix identifier is *X* is
-looked up in the context of *L* using the rules for unqualified name
-lookup [[basic.lookup.unqual]]. Let *S* be the set of declarations found
-by this lookup. *S* shall not be empty.
 If *L* is a *user-defined-integer-literal*, let *n* be the literal
 without its *ud-suffix*. If *S* contains a literal operator with
 parameter type `unsigned long long`, the literal *L* is treated as a
 call of the form
@@ -1288,11 +1443,11 @@ operator "" X(nULL)
 Otherwise, *S* shall contain a raw literal operator or a numeric literal
 operator template [[over.literal]] but not both. If *S* contains a raw
 literal operator, the literal *L* is treated as a call of the form
 ``` cpp
-operator "" X("n{"})
 ```
 Otherwise (*S* contains a numeric literal operator template), *L* is
 treated as a call of the form
@@ -1301,11 +1456,11 @@ operator "" X<'c₁', 'c₂', ... 'cₖ'>()
 ```
 where *n* is the source character sequence c₁c₂...cₖ.
 [*Note 1*: The sequence c₁c₂...cₖ can only contain characters from the
-basic source character set. — *end note*]
 If *L* is a *user-defined-floating-point-literal*, let *f* be the
 literal without its *ud-suffix*. If *S* contains a literal operator with
 parameter type `long double`, the literal *L* is treated as a call of
 the form
@@ -1317,11 +1472,11 @@ operator "" X(fL)
 Otherwise, *S* shall contain a raw literal operator or a numeric literal
 operator template [[over.literal]] but not both. If *S* contains a raw
 literal operator, the *literal* *L* is treated as a call of the form
 ``` cpp
-operator "" X("f{"})
 ```
 Otherwise (*S* contains a numeric literal operator template), *L* is
 treated as a call of the form
@@ -1330,11 +1485,11 @@ operator "" X<'c₁', 'c₂', ... 'cₖ'>()
 ```
 where *f* is the source character sequence c₁c₂...cₖ.
 [*Note 2*: The sequence c₁c₂...cₖ can only contain characters from the
-basic source character set. — *end note*]
 If *L* is a *user-defined-string-literal*, let *str* be the literal
 without its *ud-suffix* and let *len* be the number of code units in
 *str* (i.e., its length excluding the terminating null character). If
 *S* contains a literal operator template with a non-type template
@@ -1388,39 +1543,43 @@ suffix is applied to the result of the concatenation.
 [*Example 3*:
 ``` cpp
 int main() {
-  L"A" "B" "C"_x;   // OK: same as L"ABC"_x
   "P"_x "Q" "R"_y;  // error: two different ud-suffix{es}
 }
 ```
 — *end example*]
 <!-- Link reference definitions -->
 [basic.fundamental]: basic.md#basic.fundamental
 [basic.link]: basic.md#basic.link
 [basic.lookup.unqual]: basic.md#basic.lookup.unqual
 [basic.stc]: basic.md#basic.stc
-[basic.types]: basic.md#basic.types
 [conv.mem]: expr.md#conv.mem
 [conv.ptr]: expr.md#conv.ptr
 [cpp]: cpp.md#cpp
-[cpp.concat]: cpp.md#cpp.concat
 [cpp.cond]: cpp.md#cpp.cond
 [cpp.import]: cpp.md#cpp.import
 [cpp.include]: cpp.md#cpp.include
 [cpp.module]: cpp.md#cpp.module
 [cpp.stringize]: cpp.md#cpp.stringize
 [dcl.attr.grammar]: dcl.md#dcl.attr.grammar
 [headers]: library.md#headers
 [lex]: #lex
 [lex.bool]: #lex.bool
 [lex.ccon]: #lex.ccon
 [lex.ccon.esc]: #lex.ccon.esc
 [lex.charset]: #lex.charset
 [lex.comment]: #lex.comment
 [lex.digraph]: #lex.digraph
 [lex.ext]: #lex.ext
 [lex.fcon]: #lex.fcon
 [lex.fcon.type]: #lex.fcon.type
@@ -1431,83 +1590,60 @@ int main() {
 [lex.key]: #lex.key
 [lex.key.digraph]: #lex.key.digraph
 [lex.literal]: #lex.literal
 [lex.literal.kinds]: #lex.literal.kinds
 [lex.name]: #lex.name
-[lex.name.allowed]: #lex.name.allowed
-[lex.name.disallowed]: #lex.name.disallowed
 [lex.name.special]: #lex.name.special
 [lex.nullptr]: #lex.nullptr
 [lex.operators]: #lex.operators
 [lex.phases]: #lex.phases
 [lex.ppnumber]: #lex.ppnumber
 [lex.pptoken]: #lex.pptoken
 [lex.separate]: #lex.separate
 [lex.string]: #lex.string
 [lex.string.concat]: #lex.string.concat
 [lex.token]: #lex.token
 [module.import]: module.md#module.import
 [module.unit]: module.md#module.unit
 [over.literal]: over.md#over.literal
 [temp.explicit]: temp.md#temp.explicit
 [temp.names]: temp.md#temp.names
-[^1]: Implementations must behave as if these separate phases occur,
- although in practice different phases might be folded together.
 [^2]: A partial preprocessing token would arise from a source file
     ending in the first portion of a multi-character token that requires
     a terminating sequence of characters, such as a *header-name* that
     is missing the closing `"` or `>`. A partial comment would arise
     from a source file ending with an unclosed `/*` comment.
-[^3]: An implementation need not convert all non-corresponding source
-    characters to the same execution character.
-[^4]: The glyphs for the members of the basic source character set are
-    intended to identify characters from the subset of ISO/IEC 10646
-    which corresponds to the ASCII character set. However, because the
-    mapping from source file characters to the source character set
-    (described in translation phase 1) is specified as
-    *implementation-defined*, an implementation is required to document
-    how the basic source characters are represented in source files.
-[^5]: A sequence of characters resembling a *universal-character-name*
-    in an *r-char-sequence* [[lex.string]] does not form a
-    *universal-character-name*.
-[^6]:  These include “digraphs” and additional reserved words. The term
     “digraph” (token consisting of two characters) is not perfectly
     descriptive, since one of the alternative *preprocessing-token*s is
     `%:%:` and of course several primary tokens contain two characters.
     Nonetheless, those alternative tokens that aren’t lexical keywords
     are colloquially known as “digraphs”.
-[^7]: Thus the “stringized” values [[cpp.stringize]] of `[` and `<:`
     will be different, maintaining the source spelling, but the tokens
     can otherwise be freely interchanged.
-[^8]: Literals include strings and character and numeric literals.
-[^9]: Thus, a sequence of characters that resembles an escape sequence
- might result in an error, be interpreted as the character
     corresponding to the escape sequence, or have a completely different
     meaning, depending on the implementation.
-[^10]: On systems in which linkers cannot accept extended characters, an
-    encoding of the *universal-character-name* may be used in forming
     valid external identifiers. For example, some otherwise unused
-    character or sequence of characters may be used to encode the `\u`
- in a *universal-character-name*. Extended characters may produce a
     long external identifier, but C++ does not place a translation limit
-    on significant characters for external identifiers. In C++, upper-
-    and lower-case letters are considered different for all identifiers,
-    including external identifiers.
-[^11]: The term “literal” generally designates, in this document, those
     tokens that are called “constants” in ISO C.
-[^12]: They are intended for character sets where a character does not
-    fit into a single byte.
-[^13]: Using an escape sequence for a question mark is supported for
-    compatibility with ISO C++14 and ISO C.

 The text of the program is kept in units called *source files* in this
 document. A source file together with all the headers [[headers]] and
 source files included [[cpp.include]] via the preprocessing directive
 `#include`, less any source lines skipped by any of the conditional
 inclusion [[cpp.cond]] preprocessing directives, is called a
+*preprocessing translation unit*.
 [*Note 1*: A C++ program need not all be translated at the same
 time. — *end note*]
 [*Note 2*: Previously translated translation units and instantiation
 ## Phases of translation <a id="lex.phases">[[lex.phases]]</a>
 The precedence among the syntax rules of translation is specified by the
 following phases.[^1]
+1.  An implementation shall support input files that are a sequence of
+ UTF-8 code units (UTF-8 files). It may also support an
+ *implementation-defined* set of other kinds of input files, and, if
+ so, the kind of an input file is determined in an
+    *implementation-defined* manner that includes a means of designating
+ input files as UTF-8 files, independent of their content.
+ \[*Note 1*: In other words, recognizing the U+feff (byte order mark)
+ is not sufficient. — *end note*] If an input file is determined to
+ be a UTF-8 file, then it shall be a well-formed UTF-8 code unit
+ sequence and it is decoded to produce a sequence of Unicode scalar
+ values. A sequence of translation character set elements is then
+ formed by mapping each Unicode scalar value to the corresponding
+ translation character set element. In the resulting sequence, each
+    pair of characters in the input sequence consisting of
+ U+000d (carriage return) followed by U+000a (line feed), as well as
+ each U+000d (carriage return) not immediately followed by a
+ U+000a (line feed), is replaced by a single new-line character. For
+ any other kind of input file supported by the implementation,
+ characters are mapped, in an *implementation-defined* manner, to a
+    sequence of translation character set elements [[lex.charset]],
+    representing end-of-line indicators as new-line characters.
+2.  If the first translation character is U+feff (byte order mark), it
+    is deleted. Each sequence of a backslash character (\\ immediately
+    followed by zero or more whitespace characters other than new-line
+    followed by a new-line character is deleted, splicing physical
+    source lines to form logical source lines. Only the last backslash
+    on any physical source line shall be eligible for being part of such
+    a splice. Except for splices reverted in a raw string literal, if a
+    splice results in a character sequence that matches the syntax of a
     *universal-character-name*, the behavior is undefined. A source file
     that is not empty and that does not end in a new-line character, or
+    that ends in a splice, shall be processed as if an additional
+ new-line character were appended to the file.
 3.  The source file is decomposed into preprocessing tokens
+    [[lex.pptoken]] and sequences of whitespace characters (including
     comments). A source file shall not end in a partial preprocessing
     token or in a partial comment.[^2] Each comment is replaced by one
     space character. New-line characters are retained. Whether each
+    nonempty sequence of whitespace characters other than new-line is
+    retained or replaced by one space character is unspecified. As
+ characters from the source file are consumed to form the next
+ preprocessing token (i.e., not being consumed as part of a comment
+ or other forms of whitespace), except when matching a
+    *c-char-sequence*, *s-char-sequence*, *r-char-sequence*,
+    *h-char-sequence*, or *q-char-sequence*, *universal-character-name*s
+    are recognized and replaced by the designated element of the
+    translation character set. The process of dividing a source file’s
+    characters into preprocessing tokens is context-dependent.
+    \[*Example 1*: See the handling of `<` within a `#include`
+    preprocessing directive. — *end example*]
 4.  Preprocessing directives are executed, macro invocations are
+    expanded, and `_Pragma` unary operator expressions are executed. A
+ `#include` preprocessing directive causes the named header or source
+ file to be processed from phase 1 through phase 4, recursively. All
     preprocessing directives are then deleted.
+5.  For a sequence of two or more adjacent *string-literal* tokens, a
+ common *encoding-prefix* is determined as specified in
+ [[lex.string]]. Each such *string-literal* token is then considered
+    to have that common *encoding-prefix*.
+6.  Adjacent *string-literal* tokens are concatenated [[lex.string]].
+7.  Whitespace characters separating tokens are no longer significant.
     Each preprocessing token is converted into a token [[lex.token]].
+    The resulting tokens constitute a *translation unit* and are
+ syntactically and semantically analyzed and translated.
+ \[*Note 2*: The process of analyzing and translating the tokens can
+ occasionally result in one token being replaced by a sequence of
+ other tokens [[temp.names]]. — *end note*] It is
+ *implementation-defined* whether the sources for module units and
+ header units on which the current translation unit has an interface
+ dependency [[module.unit]], [[module.import]] are required to be
+ available. \[*Note 3*: Source files, translation units and
+ translated translation units need not necessarily be stored as
+ files, nor need there be any one-to-one correspondence between these
+ entities and any external representation. The description is
+ conceptual only, and does not specify any particular
+    implementation. — *end note*]
 8.  Translated translation units and instantiation units are combined as
+    follows: \[*Note 4*: Some or all of these can be supplied from a
     library. — *end note*] Each translated translation unit is examined
+    to produce a list of required instantiations. \[*Note 5*: This can
     include instantiations which have been explicitly requested
     [[temp.explicit]]. — *end note*] The definitions of the required
     templates are located. It is *implementation-defined* whether the
     source of the translation units containing these definitions is
+    required to be available. \[*Note 6*: An implementation can choose
+ to encode sufficient information into the translated translation
+ unit so as to ensure the source is not required here. — *end note*]
+ All the required instantiations are performed to produce
+ *instantiation units*. \[*Note 7*: These are similar to translated
+ translation units, but contain no references to uninstantiated
+ templates and no template definitions. — *end note*] The program is
+ ill-formed if any instantiation fails.
 9.  All external entity references are resolved. Library components are
     linked to satisfy external references to entities not defined in the
     current translation. All such translator output is collected into a
     program image which contains information needed for execution in its
     execution environment.
 ## Character sets <a id="lex.charset">[[lex.charset]]</a>
+The *translation character set* consists of the following elements:
+- each abstract character assigned a code point in the Unicode
+  codespace, and
+- a distinct character for each Unicode scalar value not assigned to an
+  abstract character.
+[*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
+(hexadecimal). A surrogate code point is a value in the range
+[D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
+that is not a surrogate code point. — *end note*]
+The *basic character set* is a subset of the translation character set,
+consisting of 96 characters as specified in [[lex.charset.basic]].
+[*Note 2*: Unicode short names are given only as a means to identifying
+the character; the numerical value has no other meaning in this
+context. — *end note*]
+**Table: Basic character set** <a id="lex.charset.basic">[lex.charset.basic]</a>
+| character            |                             | glyph                       |
+| -------------------- | --------------------------- | --------------------------- |
+| `U+0009`             | character tabulation        |                             |
+| `U+000b`             | line tabulation             |                             |
+| `U+000c`             | form feed                   |                             |
+| `U+0020`             | space                       |                             |
+| `U+000a`             | line feed                   | new-line                    |
+| `U+0021`             | exclamation mark            | `!`                         |
+| `U+0022`             | quotation mark              | `"`                         |
+| `U+0023`             | number sign                 | `#`                         |
+| `U+0025`             | percent sign                | `%`                         |
+| `U+0026`             | ampersand                   | `&`                         |
+| `U+0027`             | apostrophe                  | `'`                         |
+| `U+0028`             | left parenthesis            | `(`                         |
+| `U+0029`             | right parenthesis           | `)`                         |
+| `U+002a`             | asterisk                    | `*`                         |
+| `U+002b`             | plus sign                   | `+`                         |
+| `U+002c`             | comma                       | `,`                         |
+| `U+002d`             | hyphen-minus                | `-`                         |
+| `U+002e`             | full stop                   | `.`                         |
+| `U+002f`             | solidus                     | `/`                         |
+| `U+0030` .. `U+0039` | digit zero .. nine          | `0 1 2 3 4 5 6 7 8 9`       |
+| `U+003a`             | colon                       | `:`                         |
+| `U+003b`             | semicolon                   | `;`                         |
+| `U+003c`             | less-than sign              | `<`                         |
+| `U+003d`             | equals sign                 | `=`                         |
+| `U+003e`             | greater-than sign           | `>`                         |
+| `U+003f`             | question mark               | `?`                         |
+| `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
+|                      |                             | `N O P Q R S T U V W X Y Z` |
+| `U+005b`             | left square bracket         | `[`                         |
+| `U+005c`             | reverse solidus             | \texttt{\}                  |
+| `U+005d`             | right square bracket        | `]`                         |
+| `U+005e`             | circumflex accent           | `^`                         |
+| `U+005f`             | low line                    | `_`                         |
+| `U+0061` .. `U+007a` | latin small letter a .. z   | `a b c d e f g h i j k l m` |
+|                      |                             | `n o p q r s t u v w x y z` |
+| `U+007b`             | left curly bracket          | \texttt{\                   |
+| `U+007c`             | vertical line               | `|`                         |
+| `U+007d`             | right curly bracket         | `}`                         |
+| `U+007e`             | tilde                       | `~`                         |
 The *universal-character-name* construct provides a way to name other
 characters.
+``` bnf
+n-char: one of
+     any member of the translation character set except the U+007d (right curly bracket) or new-line character
+```
+``` bnf
+n-char-sequence:
+    n-char
+    n-char-sequence n-char
+```
+``` bnf
+named-universal-character:
+    '\N{' n-char-sequence '}'
+```
 ``` bnf
 hex-quad:
     hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
 ```
+``` bnf
+simple-hexadecimal-digit-sequence:
+    hexadecimal-digit
+    simple-hexadecimal-digit-sequence hexadecimal-digit
+```
 ``` bnf
 universal-character-name:
     '\u' hex-quad
     '\U' hex-quad hex-quad
+    '\u{' simple-hexadecimal-digit-sequence '}'
+    named-universal-character
 ```
+A *universal-character-name* of the form `\u` *hex-quad*, `\U`
+*hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
+designates the character in the translation character set whose Unicode
+scalar value is the hexadecimal number represented by the sequence of
+*hexadecimal-digit*s in the *universal-character-name*. The program is
+ill-formed if that number is not a Unicode scalar value.
+A *universal-character-name* that is a *named-universal-character*
+designates the corresponding character in the Unicode Standard (chapter
+4.8 Name) if the *n-char-sequence* is equal to its character name or to
+one of its character name aliases of type “control”, “correction”, or
+“alternate”; otherwise, the program is ill-formed.
+[*Note 3*: These aliases are listed in the Unicode Character Database’s
+`NameAliases.txt`. None of these names or aliases have leading or
+trailing spaces. — *end note*]
+If a *universal-character-name* outside the *c-char-sequence*,
+*s-char-sequence*, or *r-char-sequence* of a *character-literal* or
+*string-literal* (in either case, including within a
+*user-defined-literal*) corresponds to a control character or to a
+character in the basic character set, the program is ill-formed.
+[*Note 4*: A sequence of characters resembling a
+*universal-character-name* in an *r-char-sequence* [[lex.string]] does
+not form a *universal-character-name*. — *end note*]
+The *basic literal character set* consists of all characters of the
+basic character set, plus the control characters specified in
+[[lex.charset.literal]].
+**Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
+|          |                 |
+| -------- | --------------- |
+| `U+0000` | null            |
+| `U+0007` | alert           |
+| `U+0008` | backspace       |
+| `U+000d` | carriage return |
+A *code unit* is an integer value of character type
+[[basic.fundamental]]. Characters in a *character-literal* other than a
+multicharacter or non-encodable character literal or in a
+*string-literal* are encoded as a sequence of one or more code units, as
+determined by the *encoding-prefix* [[lex.ccon]], [[lex.string]]; this
+is termed the respective *literal encoding*. The
+*ordinary literal encoding* is the encoding applied to an ordinary
+character or string literal. The *wide literal encoding* is the encoding
+applied to a wide character or string literal.
+A literal encoding or a locale-specific encoding of one of the execution
+character sets [[character.seq]] encodes each element of the basic
+literal character set as a single code unit with non-negative value,
+distinct from the code unit for any other such element.
+[*Note 5*: A character not in the basic literal character set can be
+encoded with more than one code unit; the value of such a code unit can
+be the same as that of a code unit for an element of the basic literal
+character set. — *end note*]
+The U+0000 (null) character is encoded as the value `0`. No other
+element of the translation character set is encoded with a code unit of
+value `0`. The code unit value of each decimal digit character after the
+digit `0` (`U+0030`) shall be one greater than the value of the
+previous. The ordinary and wide literal encodings are otherwise
+*implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
+Unicode scalar value corresponding to each character of the translation
+character set is encoded as specified in the Unicode Standard for the
+respective Unicode encoding form.
 ## Preprocessing tokens <a id="lex.pptoken">[[lex.pptoken]]</a>
 ``` bnf
 preprocessing-token:
     character-literal
     user-defined-character-literal
     string-literal
     user-defined-string-literal
     preprocessing-op-or-punc
+    each non-whitespace character that cannot be one of the above
 ```
 Each preprocessing token that is converted to a token [[lex.token]]
 shall have the lexical form of a keyword, an identifier, a literal, or
 an operator or punctuator.
 A preprocessing token is the minimal lexical element of the language in
+translation phases 3 through 6. In this document, glyphs are used to
+identify elements of the basic character set [[lex.charset]]. The
+categories of preprocessing token are: header names, placeholder tokens
+produced by preprocessing `import` and `module` directives
+(*import-keyword*, *module-keyword*, and *export-keyword*), identifiers,
+preprocessing numbers, character literals (including user-defined
+character literals), string literals (including user-defined string
+literals), preprocessing operators and punctuators, and single
+non-whitespace characters that do not lexically match the other
+preprocessing token categories. If a U+0027 (apostrophe) or a
+U+0022 (quotation mark) character matches the last category, the
+behavior is undefined. If any character not in the basic character set
+matches the last category, the program is ill-formed. Preprocessing
+tokens can be separated by whitespace; this consists of comments
+[[lex.comment]], or whitespace characters (U+0020 (space),
+U+0009 (character tabulation), new-line, U+000b (line tabulation), and
+U+000c (form feed)), or both. As described in [[cpp]], in certain
+circumstances during translation phase 4, whitespace (or the absence
+thereof) serves as more than preprocessing token separation. Whitespace
+can appear within a preprocessing token only as part of a header name or
+between the quotation characters in a character literal or string
+literal.
 If the input stream has been parsed into preprocessing tokens up to a
 given character:
 - If the next character begins a sequence of characters that could be
   the prefix and initial double quote of a raw string literal, such as
   `R"`, the next preprocessing token shall be a raw string literal.
   Between the initial and final double quote characters of the raw
+  string, any transformations performed in phase 2 (line splicing) are
+ reverted; this reversion shall apply before any *d-char*, *r-char*, or
+ delimiting parenthesis is identified. The raw string literal is
+ defined as the shortest sequence of characters that matches the
+  raw-string pattern
   ``` bnf
   encoding-prefixₒₚₜ 'R' raw-string
   ```
 - Otherwise, if the next three characters are `<::` and the subsequent
   character is neither `:` nor `>`, the `<` is treated as a
 [*Note 1*: None has any observable spelling. — *end note*]
 [*Example 2*: The program fragment `0xe+foo` is parsed as a
 preprocessing number token (one that is not a valid *integer-literal* or
 *floating-point-literal* token), even though a parse as three
+preprocessing tokens `0xe`, `+`, and `foo` can produce a valid
+expression (for example, if `foo` is a macro defined as `1`). Similarly,
+the program fragment `1E1` is parsed as a preprocessing number (one that
+is a valid *floating-point-literal* token), whether or not `E` is a
+macro name. — *end example*]
 [*Example 3*: The program fragment `x+++++y` is parsed as `x
 ++ ++ + y`, which, if `x` and `y` have integral types, violates a
 constraint on increment operators, even though the parse `x ++ + ++ y`
+can yield a correct expression. — *end example*]
 ## Alternative tokens <a id="lex.digraph">[[lex.digraph]]</a>
 Alternative token representations are provided for some operators and
+punctuators.[^3]
 In all respects of the language, each alternative token behaves the
+same, respectively, as its primary token, except for its spelling.[^4]
 The set of alternative tokens is defined in [[lex.digraph]].
 ## Tokens <a id="lex.token">[[lex.token]]</a>
 ``` bnf
     keyword
     literal
     operator-or-punctuator
 ```
+There are five kinds of tokens: identifiers, keywords, literals,[^5]
 operators, and other separators. Blanks, horizontal and vertical tabs,
 newlines, formfeeds, and comments (collectively, “whitespace”), as
 described below, are ignored except as they serve to separate tokens.
 [*Note 1*: Some whitespace is required to separate otherwise adjacent
 The characters `/*` start a comment, which terminates with the
 characters `*/`. These comments do not nest. The characters `//` start a
 comment, which terminates immediately before the next new-line
 character. If there is a form-feed or a vertical-tab character in such a
+comment, only whitespace characters shall appear between it and the
 new-line that terminates the comment; no diagnostic is required.
 [*Note 1*: The comment characters `//`, `/*`, and `*/` have no special
 meaning within a `//` comment and are treated just like other
 characters. Similarly, the comment characters `//` and `/*` have no
     h-char-sequence h-char
 ```
 ``` bnf
 h-char:
+    any member of the translation character set except new-line and U+003e (greater-than sign)
 ```
 ``` bnf
 q-char-sequence:
     q-char
     q-char-sequence q-char
 ```
 ``` bnf
 q-char:
+    any member of the translation character set except new-line and U+0022 (quotation mark)
 ```
 [*Note 1*: Header name preprocessing tokens only appear within a
 `#include` preprocessing directive, a `__has_include` preprocessing
 expression, or after certain occurrences of an `import` token (see
 The appearance of either of the characters `'` or `\` or of either of
 the character sequences `/*` or `//` in a *q-char-sequence* or an
 *h-char-sequence* is conditionally-supported with
 *implementation-defined* semantics, as is the appearance of the
+character `"` in an *h-char-sequence*.[^6]
 ## Preprocessing numbers <a id="lex.ppnumber">[[lex.ppnumber]]</a>
 ``` bnf
 pp-number:
     digit
     '.' digit
+    pp-number identifier-continue
     pp-number ''' digit
     pp-number ''' nondigit
     pp-number 'e' sign
     pp-number 'E' sign
     pp-number 'p' sign
 ## Identifiers <a id="lex.name">[[lex.name]]</a>
 ``` bnf
 identifier:
+    identifier-start
+    identifier identifier-continue
 ```
 ``` bnf
+identifier-start:
     nondigit
+ an element of the translation character set with the Unicode property XID_Start
+```
+``` bnf
+identifier-continue:
+    digit
+    nondigit
+    an element of the translation character set with the Unicode property XID_Continue
 ```
 ``` bnf
 nondigit: one of
     'a b c d e f g h i j k l m'
 ``` bnf
 digit: one of
     '0 1 2 3 4 5 6 7 8 9'
 ```
+[*Note 1*:
+The character properties XID_Start and XID_Continue are Derived Core
+Properties as described by UAX \#44 of the Unicode Standard.[^7]
+— *end note*]
+The program is ill-formed if an *identifier* does not conform to
+Normalization Form C as specified in the Unicode Standard.
+[*Note 2*: Identifiers are case-sensitive. — *end note*]
+[*Note 3*: In translation phase 4, *identifier* also includes those
+*preprocessing-token*s [[lex.pptoken]] differentiated as keywords
+[[lex.key]] in the later translation phase 7
+[[lex.token]]. — *end note*]
 The identifiers in [[lex.name.special]] have a special meaning when
 appearing in a certain context. When referred to in the grammar, these
 identifiers are used explicitly rather than using the *identifier*
 grammar production. Unless otherwise specified, any ambiguity as to
 whether a given *identifier* has a special meaning is resolved to
 interpret the token as a regular *identifier*.
+In addition, some identifiers appearing as a *token* or
+*preprocessing-token* are reserved for use by C++ implementations and
+shall not be used otherwise; no diagnostic is required.
 - Each identifier that contains a double underscore `__` or begins with
   an underscore followed by an uppercase letter is reserved to the
   implementation for any use.
 - Each identifier that begins with an underscore is reserved to the
 ## Literals <a id="lex.literal">[[lex.literal]]</a>
 ### Kinds of literals <a id="lex.literal.kinds">[[lex.literal.kinds]]</a>
+There are several kinds of literals.[^8]
 ``` bnf
 literal:
     integer-literal
     character-literal
     boolean-literal
     pointer-literal
     user-defined-literal
 ```
+[*Note 1*: When appearing as an *expression*, a literal has a type and
+a value category [[expr.prim.literal]]. — *end note*]
 ### Integer literals <a id="lex.icon">[[lex.icon]]</a>
 ``` bnf
 integer-literal:
     binary-literal integer-suffixₒₚₜ
 ``` bnf
 integer-suffix:
     unsigned-suffix long-suffixₒₚₜ
     unsigned-suffix long-long-suffixₒₚₜ
+    unsigned-suffix size-suffixₒₚₜ
     long-suffix unsigned-suffixₒₚₜ
     long-long-suffix unsigned-suffixₒₚₜ
+    size-suffix unsigned-suffixₒₚₜ
 ```
 ``` bnf
 unsigned-suffix: one of
     'u U'
 ``` bnf
 long-long-suffix: one of
     'll LL'
 ```
+``` bnf
+size-suffix: one of
+   'z Z'
+```
 In an *integer-literal*, the sequence of *binary-digit*s,
 *octal-digit*s, *digit*s, or *hexadecimal-digit*s is interpreted as a
 base N integer as shown in table [[lex.icon.base]]; the lexically first
 digit of the sequence of digits is the most significant.
 `0x10'0000`, and `0'004'000'000` all have the same
 value. — *end example*]
 The type of an *integer-literal* is the first type in the list in
 [[lex.icon.type]] corresponding to its optional *integer-suffix* in
+which its value can be represented.
 **Table: Types of *integer-literal*s** <a id="lex.icon.type">[lex.icon.type]</a>
 | *integer-suffix* | *decimal-literal*                         | *integer-literal* other than *decimal-literal* |
+| ---------------- | ----------------------------------------- | ---------------------------------------------- |
 | none             | `int`                                     | `int`                                          |
 |                  | `long int`                                | `unsigned int`                                 |
 |                  | `long long int`                           | `long int`                                     |
 |                  |                                           | `unsigned long int`                            |
 |                  |                                           | `long long int`                                |
 | and `l` or `L`   | `unsigned long long int`                  | `unsigned long long int`                       |
 | `ll` or `LL`     | `long long int`                           | `long long int`                                |
 |                  |                                           | `unsigned long long int`                       |
 | Both `u` or `U`  | `unsigned long long int`                  | `unsigned long long int`                       |
 | and `ll` or `LL` |                                           |                                                |
+| `z` or `Z`       | the signed integer type corresponding     | the signed integer type                        |
+|                  | to `std::size_t` [[support.types.layout]] | corresponding to `std::size_t`                 |
+|                  |                                           | `std::size_t`                                  |
+| Both `u` or `U`  | `std::size_t`                             | `std::size_t`                                  |
+| and `z` or `Z`   |                                           |                                                |
 If an *integer-literal* cannot be represented by any type in its list
 and an extended integer type [[basic.fundamental]] can represent its
 value, it may have that extended integer type. If all of the types in
     c-char-sequence c-char
 ```
 ``` bnf
 c-char:
+    basic-c-char
     escape-sequence
     universal-character-name
 ```
+``` bnf
+basic-c-char:
+    any member of the translation character set except the U+0027 (apostrophe),
+      U+005c (reverse solidus), or new-line character
+```
 ``` bnf
 escape-sequence:
     simple-escape-sequence
+    numeric-escape-sequence
+    conditional-escape-sequence
+```
+``` bnf
+simple-escape-sequence:
+    '\' simple-escape-sequence-char
+```
+``` bnf
+simple-escape-sequence-char: one of
+    '' " ? \ a b f n r t v'
+```
+``` bnf
+numeric-escape-sequence:
     octal-escape-sequence
     hexadecimal-escape-sequence
 ```
 ``` bnf
+simple-octal-digit-sequence:
+ octal-digit
+ simple-octal-digit-sequence octal-digit
 ```
 ``` bnf
 octal-escape-sequence:
     '\' octal-digit
     '\' octal-digit octal-digit
     '\' octal-digit octal-digit octal-digit
+    '\o{' simple-octal-digit-sequence '}'
 ```
 ``` bnf
 hexadecimal-escape-sequence:
+    '\x' simple-hexadecimal-digit-sequence
+ '\x{' simple-hexadecimal-digit-sequence '}'
 ```
+``` bnf
+conditional-escape-sequence:
+    '\' conditional-escape-sequence-char
+```
+``` bnf
+conditional-escape-sequence-char:
+    any member of the basic character set that is not an octal-digit, a simple-escape-sequence-char, or the characters 'N', 'o', 'u', 'U', or 'x'
+```
+A *non-encodable character literal* is a *character-literal* whose
+*c-char-sequence* consists of a single *c-char* that is not a
+*numeric-escape-sequence* and that specifies a character that either
+lacks representation in the literal’s associated character encoding or
+that cannot be encoded as a single code unit. A *multicharacter literal*
+is a *character-literal* whose *c-char-sequence* consists of more than
+one *c-char*. The *encoding-prefix* of a non-encodable character literal
+or a multicharacter literal shall be absent. Such *character-literal*s
+are conditionally-supported.
+The kind of a *character-literal*, its type, and its associated
+character encoding [[lex.charset]] are determined by its
+*encoding-prefix* and its *c-char-sequence* as defined by
+[[lex.ccon.literal]]. The special cases for non-encodable character
+literals and multicharacter literals take precedence over the base kind.
+[*Note 1*: The associated character encoding for ordinary character
+literals determines encodability, but does not determine the value of
+non-encodable ordinary character literals or ordinary multicharacter
+literals. The examples in [[lex.ccon.literal]] for non-encodable
+ordinary character literals assume that the specified character lacks
+representation in the ordinary literal encoding or that encoding the
+character would require more than one code unit. — *end note*]
+**Table: Character literals** <a id="lex.ccon.literal">[lex.ccon.literal]</a>
+|      |                            |            |              |         |
+| ---- | -------------------------- | ---------- | ------------ | ------- |
+| none | ordinary character literal | `char`     | ordinary     | `'v'`   |
+| `L`  | wide character literal     | `wchar_t`  | wide literal | `L'w'`  |
+|      |                            |            | encoding     |         |
+| `u8` | UTF-8 character literal    | `char8_t`  | UTF-8        | `u8'x'` |
+| `u`  | UTF-16 character literal   | `char16_t` | UTF-16       | `u'y'`  |
+| `U`  | UTF-32 character literal   | `char32_t` | UTF-32       | `U'z'`  |
+In translation phase 4, the value of a *character-literal* is determined
+using the range of representable values of the *character-literal*’s
+type in translation phase 7. A non-encodable character literal or a
+multicharacter literal has an *implementation-defined* value. The value
+of any other kind of *character-literal* is determined as follows:
+- A *character-literal* with a *c-char-sequence* consisting of a single
+  *basic-c-char*, *simple-escape-sequence*, or
+  *universal-character-name* is the code unit value of the specified
+  character as encoded in the literal’s associated character encoding.
+  \[*Note 2*: If the specified character lacks representation in the
+  literal’s associated character encoding or if it cannot be encoded as
+  a single code unit, then the literal is a non-encodable character
+  literal. — *end note*]
+- A *character-literal* with a *c-char-sequence* consisting of a single
+  *numeric-escape-sequence* has a value as follows:
+  - Let v be the integer value represented by the octal number
+    comprising the sequence of *octal-digit*s in an
+    *octal-escape-sequence* or by the hexadecimal number comprising the
+    sequence of *hexadecimal-digit*s in a *hexadecimal-escape-sequence*.
+  - If v does not exceed the range of representable values of the
+    *character-literal*’s type, then the value is v.
+  - Otherwise, if the *character-literal*’s *encoding-prefix* is absent
+    or `L`, and v does not exceed the range of representable values of
+    the corresponding unsigned type for the underlying type of the
+    *character-literal*’s type, then the value is the unique value of
+    the *character-literal*’s type `T` that is congruent to v modulo 2ᴺ,
+    where N is the width of `T`.
+  - Otherwise, the *character-literal* is ill-formed.
+- A *character-literal* with a *c-char-sequence* consisting of a single
+  *conditional-escape-sequence* is conditionally-supported and has an
+  *implementation-defined* value.
+The character specified by a *simple-escape-sequence* is specified in
+[[lex.ccon.esc]].
+[*Note 3*: Using an escape sequence for a question mark is supported
+for compatibility with ISO C++14 and ISO C. — *end note*]
+**Table: Simple escape sequences** <a id="lex.ccon.esc">[lex.ccon.esc]</a>
+| character |                      | *simple-escape-sequence* |
+| --------- | -------------------- | ------------------------ |
+| `U+000a`  | line feed            | `\n`                     |
+| `U+0009`  | character tabulation | `\t`                     |
+| `U+000b`  | line tabulation      | `\v`                     |
+| `U+0008`  | backspace            | `\b`                     |
+| `U+000d`  | carriage return      | `\r`                     |
+| `U+000c`  | form feed            | `\f`                     |
+| `U+0007`  | alert                | `\a`                     |
+| `U+005c`  | reverse solidus      | ``                       |
+| `U+003f`  | question mark        | `\?`                     |
+| `U+0027`  | apostrophe           | `\'`                     |
+| `U+0022`  | quotation mark       | `\"`                     |
 ### Floating-point literals <a id="lex.fcon">[[lex.fcon]]</a>
 ``` bnf
 floating-point-literal:
     digit-sequence '''ₒₚₜ digit
 ```
 ``` bnf
 floating-point-suffix: one of
+    'f l f16 f32 f64 f128 bf16 F L F16 F32 F64 F128 BF16'
 ```
+The type of a *floating-point-literal*
+[[basic.fundamental]], [[basic.extended.fp]] is determined by its
 *floating-point-suffix* as specified in [[lex.fcon.type]].
+[*Note 1*: The floating-point suffixes `f16`, `f32`, `f64`, `f128`,
+`bf16`, `F16`, `F32`, `F64`, `F128`, and `BF16` are
+conditionally-supported. See [[basic.extended.fp]]. — *end note*]
 **Table: Types of *floating-point-literal*{s}** <a id="lex.fcon.type">[lex.fcon.type]</a>
 | *floating-point-suffix* | type              |
+| ----------------------- | ----------------- |
 | none                    | `double`          |
 | `f` or `F`              | `float`           |
 | `l` or `L`              | `long` `double`   |
+| `f16` or `F16`          | `std::float16_t`  |
+| `f32` or `F32`          | `std::float32_t`  |
+| `f64` or `F64`          | `std::float64_t`  |
+| `f128` or `F128`        | `std::float128_t` |
+| `bf16` or `BF16`        | `std::bfloat16_t` |
 The *significand* of a *floating-point-literal* is the
 *fractional-constant* or *digit-sequence* of a
 *decimal-floating-point-literal* or the
 of *digit*s or *hexadecimal-digit*s and optional period are interpreted
 as a base N real number s, where N is 10 for a
 *decimal-floating-point-literal* and 16 for a
 *hexadecimal-floating-point-literal*.
+[*Note 2*: Any optional separating single quotes are ignored when
 determining the value. — *end note*]
 If an *exponent-part* or *binary-exponent-part* is present, the exponent
 e of the *floating-point-literal* is the result of interpreting the
 sequence of an optional *sign* and the *digit*s as a base 10 integer.
     s-char-sequence s-char
 ```
 ``` bnf
 s-char:
+    basic-s-char
     escape-sequence
     universal-character-name
 ```
+``` bnf
+basic-s-char:
+    any member of the translation character set except the U+0022 (quotation mark),
+      U+005c (reverse solidus), or new-line character
+```
 ``` bnf
 raw-string:
     '"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
 ```
     r-char-sequence r-char
 ```
 ``` bnf
 r-char:
+    any member of the translation character set, except a U+0029 (right parenthesis) followed by
+       the initial *d-char-sequence* (which may be empty) followed by a U+0022 (quotation mark)
 ```
 ``` bnf
 d-char-sequence:
     d-char
     d-char-sequence d-char
 ```
 ``` bnf
 d-char:
+    any member of the basic character set except:
+      U+0020 (space), U+0028 (left parenthesis), U+0029 (right parenthesis), U+005c (reverse solidus),
+      U+0009 (character tabulation), U+000b (line tabulation), U+000c (form feed), and new-line
 ```
+The kind of a *string-literal*, its type, and its associated character
+encoding [[lex.charset]] are determined by its encoding prefix and
+sequence of *s-char*s or *r-char*s as defined by [[lex.string.literal]]
+where n is the number of encoded code units as described below.
+**Table: String literals** <a id="lex.string.literal">[lex.string.literal]</a>
+|      |                         |                               |                           |                                                |
+| ---- | ----------------------- | ----------------------------- | ------------------------- | ---------------------------------------------- |
+| none | ordinary string literal | array of $n$ `const char`     | ordinary literal encoding | `"ordinary string"` `R"(ordinary raw string)"` |
+| `L`  | wide string literal     | array of $n$ `const wchar_t`  | wide literal encoding     | `L"wide string"` `LR"w(wide raw string)w"`     |
+| `u8` | UTF-8 string literal    | array of $n$ `const char8_t`  | UTF-8                     | `u8"UTF-8 string"` `u8R"x(UTF-8 raw string)x"` |
+| `u`  | UTF-16 string literal   | array of $n$ `const char16_t` | UTF-16                    | `u"UTF-16 string"` `uR"y(UTF-16 raw string)y"` |
+| `U`  | UTF-32 string literal   | array of $n$ `const char32_t` | UTF-32                    | `U"UTF-32 string"` `UR"z(UTF-32 raw string)z"` |
 A *string-literal* that has an `R` in the prefix is a *raw string
 literal*. The *d-char-sequence* serves as a delimiter. The terminating
 *d-char-sequence* of a *raw-string* is the same sequence of characters
 as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
 at most 16 characters.
 is equivalent to `"x = \"\\\"y\\\"\""`.
 — *end example*]
 Ordinary string literals and UTF-8 string literals are also referred to
 as narrow string literals.
+The common *encoding-prefix* for a sequence of adjacent
+*string-literal*s is determined pairwise as follows: If two
+*string-literal*s have the same *encoding-prefix*, the common
+*encoding-prefix* is that *encoding-prefix*. If one *string-literal* has
+no *encoding-prefix*, the common *encoding-prefix* is that of the other
+*string-literal*. Any other combinations are ill-formed.
+[*Note 3*: A *string-literal*’s rawness has no effect on the
+determination of the common *encoding-prefix*. — *end note*]
 In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
+concatenated. The lexical structure and grouping of the contents of the
+individual *string-literal*s is retained.
+[*Example 2*:
+``` cpp
+"\xA" "B"
+```
+represents the code unit `'\xA'` and the character `'B'` after
+concatenation (and not the single code unit `'\xAB'`). Similarly,
+``` cpp
+R"(\u00)" "41"
+```
+represents six characters, starting with a backslash and ending with the
+digit `1` (and not the single character `'A'` specified by a
+*universal-character-name*).
 [[lex.string.concat]] has some examples of valid concatenations.
+— *end example*]
 **Table: String literal concatenations** <a id="lex.string.concat">[lex.string.concat]</a>
 |                            |       |                            |       |                            |       |
 | -------------------------- | ----- | -------------------------- | ----- | -------------------------- | ----- |
 | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means |
 | `u"a"`                     | `u"b"` | `u"ab"`                    | `U"a"` | `U"b"`                     | `U"ab"` | `L"a"` | `L"b"` | `L"ab"` |
 | `u"a"`                     | `"b"` | `u"ab"`                    | `U"a"` | `"b"`                      | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
 | `"a"`                      | `u"b"` | `u"ab"`                    | `"a"` | `U"b"`                     | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
 Evaluating a *string-literal* results in a string literal object with
+static storage duration [[basic.stc]]. Whether all *string-literal*s are
+distinct (that is, are stored in nonoverlapping objects) and whether
+successive evaluations of a *string-literal* yield the same or a
+different object is unspecified.
+[*Note 4*:  The effect of attempting to modify a string literal object
+is undefined. — *end note*]
+String literal objects are initialized with the sequence of code unit
+values corresponding to the *string-literal*’s sequence of *s-char*s
+(originally from non-raw string literals) and *r-char*s (originally from
+raw string literals), plus a terminating U+0000 (null) character, in
+order as follows:
+- The sequence of characters denoted by each contiguous sequence of
+  *basic-s-char*s, *r-char*s, *simple-escape-sequence*s [[lex.ccon]],
+  and *universal-character-name*s [[lex.charset]] is encoded to a code
+  unit sequence using the *string-literal*’s associated character
+  encoding. If a character lacks representation in the associated
+  character encoding, then the *string-literal* is
+  conditionally-supported and an *implementation-defined* code unit
+  sequence is encoded. \[*Note 5*: No character lacks representation in
+  any Unicode encoding form. — *end note*] When encoding a stateful
+  character encoding, implementations should encode the first such
+  sequence beginning with the initial encoding state and encode
+  subsequent sequences beginning with the final encoding state of the
+  prior sequence. \[*Note 6*: The encoded code unit sequence can differ
+  from the sequence of code units that would be obtained by encoding
+  each character independently. — *end note*]
+- Each *numeric-escape-sequence* [[lex.ccon]] contributes a single code
+  unit with a value as follows:
+  - Let v be the integer value represented by the octal number
+    comprising the sequence of *octal-digit*s in an
+    *octal-escape-sequence* or by the hexadecimal number comprising the
+    sequence of *hexadecimal-digit*s in a *hexadecimal-escape-sequence*.
+  - If v does not exceed the range of representable values of the
+    *string-literal*’s array element type, then the value is v.
+  - Otherwise, if the *string-literal*’s *encoding-prefix* is absent or
+    `L`, and v does not exceed the range of representable values of the
+    corresponding unsigned type for the underlying type of the
+    *string-literal*’s array element type, then the value is the unique
+    value of the *string-literal*’s array element type `T` that is
+    congruent to v modulo 2ᴺ, where N is the width of `T`.
+  - Otherwise, the *string-literal* is ill-formed.
+  When encoding a stateful character encoding, these sequences should
+  have no effect on encoding state.
+- Each *conditional-escape-sequence* [[lex.ccon]] contributes an
+  *implementation-defined* code unit sequence. When encoding a stateful
+  character encoding, it is *implementation-defined* what effect these
+  sequences have on encoding state.
 ### Boolean literals <a id="lex.bool">[[lex.bool]]</a>
 ``` bnf
 boolean-literal:
     'false'
     'true'
 ```
 The Boolean literals are the keywords `false` and `true`. Such literals
+have type `bool`.
 ### Pointer literals <a id="lex.nullptr">[[lex.nullptr]]</a>
 ``` bnf
 pointer-literal:
     'nullptr'
 ```
+The pointer literal is the keyword `nullptr`. It has type
 `std::nullptr_t`.
 [*Note 1*: `std::nullptr_t` is a distinct type that is neither a
 pointer type nor a pointer-to-member type; rather, a prvalue of this
 type is a null pointer constant and can be converted to a null pointer
 that could match that non-terminal.
 A *user-defined-literal* is treated as a call to a literal operator or
 literal operator template [[over.literal]]. To determine the form of
 this call for a given *user-defined-literal* *L* with *ud-suffix* *X*,
+first let *S* be the set of declarations found by unqualified lookup for
+the *literal-operator-id* whose literal suffix identifier is *X*
+[[basic.lookup.unqual]]. *S* shall not be empty.
 If *L* is a *user-defined-integer-literal*, let *n* be the literal
 without its *ud-suffix*. If *S* contains a literal operator with
 parameter type `unsigned long long`, the literal *L* is treated as a
 call of the form
 Otherwise, *S* shall contain a raw literal operator or a numeric literal
 operator template [[over.literal]] but not both. If *S* contains a raw
 literal operator, the literal *L* is treated as a call of the form
 ``` cpp
+operator ""X("n")
 ```
 Otherwise (*S* contains a numeric literal operator template), *L* is
 treated as a call of the form
 ```
 where *n* is the source character sequence c₁c₂...cₖ.
 [*Note 1*: The sequence c₁c₂...cₖ can only contain characters from the
+basic character set. — *end note*]
 If *L* is a *user-defined-floating-point-literal*, let *f* be the
 literal without its *ud-suffix*. If *S* contains a literal operator with
 parameter type `long double`, the literal *L* is treated as a call of
 the form
 Otherwise, *S* shall contain a raw literal operator or a numeric literal
 operator template [[over.literal]] but not both. If *S* contains a raw
 literal operator, the *literal* *L* is treated as a call of the form
 ``` cpp
+operator ""X("f")
 ```
 Otherwise (*S* contains a numeric literal operator template), *L* is
 treated as a call of the form
 ```
 where *f* is the source character sequence c₁c₂...cₖ.
 [*Note 2*: The sequence c₁c₂...cₖ can only contain characters from the
+basic character set. — *end note*]
 If *L* is a *user-defined-string-literal*, let *str* be the literal
 without its *ud-suffix* and let *len* be the number of code units in
 *str* (i.e., its length excluding the terminating null character). If
 *S* contains a literal operator template with a non-type template
 [*Example 3*:
 ``` cpp
 int main() {
+  L"A" "B" "C"_x;   // OK, same as L"ABC"_x
   "P"_x "Q" "R"_y;  // error: two different ud-suffix{es}
 }
 ```
 — *end example*]
 <!-- Link reference definitions -->
+[basic.extended.fp]: basic.md#basic.extended.fp
 [basic.fundamental]: basic.md#basic.fundamental
 [basic.link]: basic.md#basic.link
 [basic.lookup.unqual]: basic.md#basic.lookup.unqual
 [basic.stc]: basic.md#basic.stc
+[character.seq]: library.md#character.seq
 [conv.mem]: expr.md#conv.mem
 [conv.ptr]: expr.md#conv.ptr
 [cpp]: cpp.md#cpp
 [cpp.cond]: cpp.md#cpp.cond
 [cpp.import]: cpp.md#cpp.import
 [cpp.include]: cpp.md#cpp.include
 [cpp.module]: cpp.md#cpp.module
 [cpp.stringize]: cpp.md#cpp.stringize
 [dcl.attr.grammar]: dcl.md#dcl.attr.grammar
+[expr.prim.literal]: expr.md#expr.prim.literal
 [headers]: library.md#headers
 [lex]: #lex
 [lex.bool]: #lex.bool
 [lex.ccon]: #lex.ccon
 [lex.ccon.esc]: #lex.ccon.esc
+[lex.ccon.literal]: #lex.ccon.literal
 [lex.charset]: #lex.charset
+[lex.charset.basic]: #lex.charset.basic
+[lex.charset.literal]: #lex.charset.literal
 [lex.comment]: #lex.comment
 [lex.digraph]: #lex.digraph
 [lex.ext]: #lex.ext
 [lex.fcon]: #lex.fcon
 [lex.fcon.type]: #lex.fcon.type
 [lex.key]: #lex.key
 [lex.key.digraph]: #lex.key.digraph
 [lex.literal]: #lex.literal
 [lex.literal.kinds]: #lex.literal.kinds
 [lex.name]: #lex.name
 [lex.name.special]: #lex.name.special
 [lex.nullptr]: #lex.nullptr
 [lex.operators]: #lex.operators
 [lex.phases]: #lex.phases
 [lex.ppnumber]: #lex.ppnumber
 [lex.pptoken]: #lex.pptoken
 [lex.separate]: #lex.separate
 [lex.string]: #lex.string
 [lex.string.concat]: #lex.string.concat
+[lex.string.literal]: #lex.string.literal
 [lex.token]: #lex.token
 [module.import]: module.md#module.import
 [module.unit]: module.md#module.unit
 [over.literal]: over.md#over.literal
+[support.types.layout]: support.md#support.types.layout
 [temp.explicit]: temp.md#temp.explicit
 [temp.names]: temp.md#temp.names
+[^1]: Implementations behave as if these separate phases occur, although
+    in practice different phases can be folded together.
 [^2]: A partial preprocessing token would arise from a source file
     ending in the first portion of a multi-character token that requires
     a terminating sequence of characters, such as a *header-name* that
     is missing the closing `"` or `>`. A partial comment would arise
     from a source file ending with an unclosed `/*` comment.
+[^3]:  These include “digraphs” and additional reserved words. The term
     “digraph” (token consisting of two characters) is not perfectly
     descriptive, since one of the alternative *preprocessing-token*s is
     `%:%:` and of course several primary tokens contain two characters.
     Nonetheless, those alternative tokens that aren’t lexical keywords
     are colloquially known as “digraphs”.
+[^4]: Thus the “stringized” values [[cpp.stringize]] of `[` and `<:`
     will be different, maintaining the source spelling, but the tokens
     can otherwise be freely interchanged.
+[^5]: Literals include strings and character and numeric literals.
+[^6]: Thus, a sequence of characters that resembles an escape sequence
+ can result in an error, be interpreted as the character
     corresponding to the escape sequence, or have a completely different
     meaning, depending on the implementation.
+[^7]: On systems in which linkers cannot accept extended characters, an
+    encoding of the \*universal-character-name\* can be used in forming
     valid external identifiers. For example, some otherwise unused
+    character or sequence of characters can be used to encode the `̆` in
+    a \*universal-character-name\*. Extended characters can produce a
     long external identifier, but C++ does not place a translation limit
+    on significant characters for external identifiers.
+[^8]: The term “literal” generally designates, in this document, those
     tokens that are called “constants” in ISO C.

Diff to HTML by rtfpessoa