[lex.string] - C++20 → C++23

Files changed (1) hide show

tmp/tmp0ohtbg44/{from.md → to.md} +108 -105

tmp/tmp0ohtbg44/{from.md → to.md} RENAMED Viewed

@@ -12,15 +12,21 @@ s-char-sequence:
     s-char-sequence s-char
 ```
 ``` bnf
 s-char:
- any member of the basic source character set except the double-quote '"', backslash '\', or new-line character
     escape-sequence
     universal-character-name
 ```
 ``` bnf
 raw-string:
     '"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
 ```
@@ -30,27 +36,43 @@ r-char-sequence:
     r-char-sequence r-char
 ```
 ``` bnf
 r-char:
-    any member of the source character set, except a right parenthesis ')' followed by
-       the initial *d-char-sequence* (which may be empty) followed by a double quote '"'.
 ```
 ``` bnf
 d-char-sequence:
     d-char
     d-char-sequence d-char
 ```
 ``` bnf
 d-char:
-    any member of the basic source character set except:
- space, the left parenthesis '(', the right parenthesis ')', the backslash '\', and the control characters
-       representing horizontal tab, vertical tab, form feed, and newline.
 ```
 A *string-literal* that has an `R` in the prefix is a *raw string
 literal*. The *d-char-sequence* serves as a delimiter. The terminating
 *d-char-sequence* of a *raw-string* is the same sequence of characters
 as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
 at most 16 characters.
@@ -93,125 +115,106 @@ R"(x = "\"y\"")"
 is equivalent to `"x = \"\\\"y\\\"\""`.
 — *end example*]
-After translation phase 6, a *string-literal* that does not begin with
-an *encoding-prefix* is an *ordinary string literal*. An ordinary string
-literal has type “array of *n* `const char`” where *n* is the size of
-the string as defined below, has static storage duration [[basic.stc]],
-and is initialized with the given characters.
-A *string-literal* that begins with `u8`, such as `u8"asdf"`, is a
-*UTF-8 string literal*. A UTF-8 string literal has type “array of *n*
-`const char8_t`”, where *n* is the size of the string as defined below;
-each successive element of the object representation [[basic.types]] has
-the value of the corresponding code unit of the UTF-8 encoding of the
-string.
 Ordinary string literals and UTF-8 string literals are also referred to
 as narrow string literals.
-A *string-literal* that begins with `u`, such as `u"asdf"`, is a *UTF-16
-string literal*. A UTF-16 string literal has type “array of *n*
-`const char16_t`”, where *n* is the size of the string as defined below;
-each successive element of the array has the value of the corresponding
-code unit of the UTF-16 encoding of the string.
-[*Note 3*: A single *c-char* may produce more than one `char16_t`
-character in the form of surrogate pairs. A surrogate pair is a
-representation for a single code point as a sequence of two 16-bit code
-units. — *end note*]
-A *string-literal* that begins with `U`, such as `U"asdf"`, is a *UTF-32
-string literal*. A UTF-32 string literal has type “array of *n*
-`const char32_t`”, where *n* is the size of the string as defined below;
-each successive element of the array has the value of the corresponding
-code unit of the UTF-32 encoding of the string.
-A *string-literal* that begins with `L`, such as `L"asdf"`, is a *wide
-string literal*. A wide string literal has type “array of *n* `const
-wchar_t`”, where *n* is the size of the string as defined below; it is
-initialized with the given characters.
 In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
-concatenated. If both *string-literal*s have the same *encoding-prefix*,
-the resulting concatenated *string-literal* has that *encoding-prefix*.
-If one *string-literal* has no *encoding-prefix*, it is treated as a
-*string-literal* of the same *encoding-prefix* as the other operand. If
-a UTF-8 string literal token is adjacent to a wide string literal token,
-the program is ill-formed. Any other concatenations are
-conditionally-supported with *implementation-defined* behavior.
-[*Note 4*: This concatenation is an interpretation, not a conversion.
-Because the interpretation happens in translation phase 6 (after each
-character from a *string-literal* has been translated into a value from
-the appropriate character set), a *string-literal*’s initial rawness has
-no effect on the interpretation or well-formedness of the
-concatenation. — *end note*]
 [[lex.string.concat]] has some examples of valid concatenations.
 **Table: String literal concatenations** <a id="lex.string.concat">[lex.string.concat]</a>
 |                            |       |                            |       |                            |       |
 | -------------------------- | ----- | -------------------------- | ----- | -------------------------- | ----- |
 | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means |
 | `u"a"`                     | `u"b"` | `u"ab"`                    | `U"a"` | `U"b"`                     | `U"ab"` | `L"a"` | `L"b"` | `L"ab"` |
 | `u"a"`                     | `"b"` | `u"ab"`                    | `U"a"` | `"b"`                      | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
 | `"a"`                      | `u"b"` | `u"ab"`                    | `"a"` | `U"b"`                     | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
-Characters in concatenated strings are kept distinct.
-[*Example 2*:
-``` cpp
-"\xA" "B"
-```
-contains the two characters `'\xA'` and `'B'` after concatenation (and
-not the single hexadecimal character `'\xAB'`).
-— *end example*]
-After any necessary concatenation, in translation phase 7
-[[lex.phases]], `'\0'` is appended to every *string-literal* so that
-programs that scan a string can find its end.
-Escape sequences and *universal-character-name*s in non-raw string
-literals have the same meaning as in *character-literal*s [[lex.ccon]],
-except that the single quote `'` is representable either by itself or by
-the escape sequence `\'`, and the double quote `"` shall be preceded by
-a `\`, and except that a *universal-character-name* in a UTF-16 string
-literal may yield a surrogate pair. In a narrow string literal, a
-*universal-character-name* may map to more than one `char` or `char8_t`
-element due to *multibyte encoding*. The size of a `char32_t` or wide
-string literal is the total number of escape sequences,
-*universal-character-name*s, and other characters, plus one for the
-terminating `U'\0'` or `L'\0'`. The size of a UTF-16 string literal is
-the total number of escape sequences, *universal-character-name*s, and
-other characters, plus one for each character requiring a surrogate
-pair, plus one for the terminating `u'\0'`.
-[*Note 5*: The size of a `char16_t` string literal is the number of
-code units, not the number of characters. — *end note*]
-[*Note 6*: Any *universal-character-name*s are required to correspond
-to a code point in the range [0, D800) or [E000, 10FFFF] (hexadecimal)
-[[lex.charset]]. — *end note*]
-The size of a narrow string literal is the total number of escape
-sequences and other characters, plus at least one for the multibyte
-encoding of each *universal-character-name*, plus one for the
-terminating `'\0'`.
 Evaluating a *string-literal* results in a string literal object with
-static storage duration, initialized from the given characters as
-specified above. Whether all *string-literal*s are distinct (that is,
-are stored in nonoverlapping objects) and whether successive evaluations
-of a *string-literal* yield the same or a different object is
-unspecified.
-[*Note 7*:  The effect of attempting to modify a *string-literal* is
-undefined. — *end note*]

     s-char-sequence s-char
 ```
 ``` bnf
 s-char:
+    basic-s-char
     escape-sequence
     universal-character-name
 ```
+``` bnf
+basic-s-char:
+    any member of the translation character set except the U+0022 (quotation mark),
+      U+005c (reverse solidus), or new-line character
+```
 ``` bnf
 raw-string:
     '"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
 ```
     r-char-sequence r-char
 ```
 ``` bnf
 r-char:
+    any member of the translation character set, except a U+0029 (right parenthesis) followed by
+       the initial *d-char-sequence* (which may be empty) followed by a U+0022 (quotation mark)
 ```
 ``` bnf
 d-char-sequence:
     d-char
     d-char-sequence d-char
 ```
 ``` bnf
 d-char:
+    any member of the basic character set except:
+      U+0020 (space), U+0028 (left parenthesis), U+0029 (right parenthesis), U+005c (reverse solidus),
+      U+0009 (character tabulation), U+000b (line tabulation), U+000c (form feed), and new-line
 ```
+The kind of a *string-literal*, its type, and its associated character
+encoding [[lex.charset]] are determined by its encoding prefix and
+sequence of *s-char*s or *r-char*s as defined by [[lex.string.literal]]
+where n is the number of encoded code units as described below.
+**Table: String literals** <a id="lex.string.literal">[lex.string.literal]</a>
+|      |                         |                               |                           |                                                |
+| ---- | ----------------------- | ----------------------------- | ------------------------- | ---------------------------------------------- |
+| none | ordinary string literal | array of $n$ `const char`     | ordinary literal encoding | `"ordinary string"` `R"(ordinary raw string)"` |
+| `L`  | wide string literal     | array of $n$ `const wchar_t`  | wide literal encoding     | `L"wide string"` `LR"w(wide raw string)w"`     |
+| `u8` | UTF-8 string literal    | array of $n$ `const char8_t`  | UTF-8                     | `u8"UTF-8 string"` `u8R"x(UTF-8 raw string)x"` |
+| `u`  | UTF-16 string literal   | array of $n$ `const char16_t` | UTF-16                    | `u"UTF-16 string"` `uR"y(UTF-16 raw string)y"` |
+| `U`  | UTF-32 string literal   | array of $n$ `const char32_t` | UTF-32                    | `U"UTF-32 string"` `UR"z(UTF-32 raw string)z"` |
 A *string-literal* that has an `R` in the prefix is a *raw string
 literal*. The *d-char-sequence* serves as a delimiter. The terminating
 *d-char-sequence* of a *raw-string* is the same sequence of characters
 as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
 at most 16 characters.
 is equivalent to `"x = \"\\\"y\\\"\""`.
 — *end example*]
 Ordinary string literals and UTF-8 string literals are also referred to
 as narrow string literals.
+The common *encoding-prefix* for a sequence of adjacent
+*string-literal*s is determined pairwise as follows: If two
+*string-literal*s have the same *encoding-prefix*, the common
+*encoding-prefix* is that *encoding-prefix*. If one *string-literal* has
+no *encoding-prefix*, the common *encoding-prefix* is that of the other
+*string-literal*. Any other combinations are ill-formed.
+[*Note 3*: A *string-literal*’s rawness has no effect on the
+determination of the common *encoding-prefix*. — *end note*]
 In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
+concatenated. The lexical structure and grouping of the contents of the
+individual *string-literal*s is retained.
+[*Example 2*:
+``` cpp
+"\xA" "B"
+```
+represents the code unit `'\xA'` and the character `'B'` after
+concatenation (and not the single code unit `'\xAB'`). Similarly,
+``` cpp
+R"(\u00)" "41"
+```
+represents six characters, starting with a backslash and ending with the
+digit `1` (and not the single character `'A'` specified by a
+*universal-character-name*).
 [[lex.string.concat]] has some examples of valid concatenations.
+— *end example*]
 **Table: String literal concatenations** <a id="lex.string.concat">[lex.string.concat]</a>
 |                            |       |                            |       |                            |       |
 | -------------------------- | ----- | -------------------------- | ----- | -------------------------- | ----- |
 | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means |
 | `u"a"`                     | `u"b"` | `u"ab"`                    | `U"a"` | `U"b"`                     | `U"ab"` | `L"a"` | `L"b"` | `L"ab"` |
 | `u"a"`                     | `"b"` | `u"ab"`                    | `U"a"` | `"b"`                      | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
 | `"a"`                      | `u"b"` | `u"ab"`                    | `"a"` | `U"b"`                     | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
 Evaluating a *string-literal* results in a string literal object with
+static storage duration [[basic.stc]]. Whether all *string-literal*s are
+distinct (that is, are stored in nonoverlapping objects) and whether
+successive evaluations of a *string-literal* yield the same or a
+different object is unspecified.
+[*Note 4*:  The effect of attempting to modify a string literal object
+is undefined. — *end note*]
+String literal objects are initialized with the sequence of code unit
+values corresponding to the *string-literal*’s sequence of *s-char*s
+(originally from non-raw string literals) and *r-char*s (originally from
+raw string literals), plus a terminating U+0000 (null) character, in
+order as follows:
+- The sequence of characters denoted by each contiguous sequence of
+  *basic-s-char*s, *r-char*s, *simple-escape-sequence*s [[lex.ccon]],
+  and *universal-character-name*s [[lex.charset]] is encoded to a code
+  unit sequence using the *string-literal*’s associated character
+  encoding. If a character lacks representation in the associated
+  character encoding, then the *string-literal* is
+  conditionally-supported and an *implementation-defined* code unit
+  sequence is encoded. \[*Note 5*: No character lacks representation in
+  any Unicode encoding form. — *end note*] When encoding a stateful
+  character encoding, implementations should encode the first such
+  sequence beginning with the initial encoding state and encode
+  subsequent sequences beginning with the final encoding state of the
+  prior sequence. \[*Note 6*: The encoded code unit sequence can differ
+  from the sequence of code units that would be obtained by encoding
+  each character independently. — *end note*]
+- Each *numeric-escape-sequence* [[lex.ccon]] contributes a single code
+  unit with a value as follows:
+  - Let v be the integer value represented by the octal number
+    comprising the sequence of *octal-digit*s in an
+    *octal-escape-sequence* or by the hexadecimal number comprising the
+    sequence of *hexadecimal-digit*s in a *hexadecimal-escape-sequence*.
+  - If v does not exceed the range of representable values of the
+    *string-literal*’s array element type, then the value is v.
+  - Otherwise, if the *string-literal*’s *encoding-prefix* is absent or
+    `L`, and v does not exceed the range of representable values of the
+    corresponding unsigned type for the underlying type of the
+    *string-literal*’s array element type, then the value is the unique
+    value of the *string-literal*’s array element type `T` that is
+    congruent to v modulo 2ᴺ, where N is the width of `T`.
+  - Otherwise, the *string-literal* is ill-formed.
+  When encoding a stateful character encoding, these sequences should
+  have no effect on encoding state.
+- Each *conditional-escape-sequence* [[lex.ccon]] contributes an
+  *implementation-defined* code unit sequence. When encoding a stateful
+  character encoding, it is *implementation-defined* what effect these
+  sequences have on encoding state.

Diff to HTML by rtfpessoa