From Jason Turner

[lex.string]

Diff to HTML by rtfpessoa

Files changed (1) hide show
  1. tmp/tmp0ohtbg44/{from.md → to.md} +108 -105
tmp/tmp0ohtbg44/{from.md → to.md} RENAMED
@@ -12,15 +12,21 @@ s-char-sequence:
12
  s-char-sequence s-char
13
  ```
14
 
15
  ``` bnf
16
  s-char:
17
- any member of the basic source character set except the double-quote '"', backslash '\', or new-line character
18
  escape-sequence
19
  universal-character-name
20
  ```
21
 
 
 
 
 
 
 
22
  ``` bnf
23
  raw-string:
24
  '"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
25
  ```
26
 
@@ -30,27 +36,43 @@ r-char-sequence:
30
  r-char-sequence r-char
31
  ```
32
 
33
  ``` bnf
34
  r-char:
35
- any member of the source character set, except a right parenthesis ')' followed by
36
- the initial *d-char-sequence* (which may be empty) followed by a double quote '"'.
37
  ```
38
 
39
  ``` bnf
40
  d-char-sequence:
41
  d-char
42
  d-char-sequence d-char
43
  ```
44
 
45
  ``` bnf
46
  d-char:
47
- any member of the basic source character set except:
48
- space, the left parenthesis '(', the right parenthesis ')', the backslash '\', and the control characters
49
- representing horizontal tab, vertical tab, form feed, and newline.
50
  ```
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  A *string-literal* that has an `R` in the prefix is a *raw string
53
  literal*. The *d-char-sequence* serves as a delimiter. The terminating
54
  *d-char-sequence* of a *raw-string* is the same sequence of characters
55
  as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
56
  at most 16 characters.
@@ -93,125 +115,106 @@ R"(x = "\"y\"")"
93
 
94
  is equivalent to `"x = \"\\\"y\\\"\""`.
95
 
96
  — *end example*]
97
 
98
- After translation phase 6, a *string-literal* that does not begin with
99
- an *encoding-prefix* is an *ordinary string literal*. An ordinary string
100
- literal has type “array of *n* `const char`” where *n* is the size of
101
- the string as defined below, has static storage duration [[basic.stc]],
102
- and is initialized with the given characters.
103
-
104
- A *string-literal* that begins with `u8`, such as `u8"asdf"`, is a
105
- *UTF-8 string literal*. A UTF-8 string literal has type “array of *n*
106
- `const char8_t`”, where *n* is the size of the string as defined below;
107
- each successive element of the object representation [[basic.types]] has
108
- the value of the corresponding code unit of the UTF-8 encoding of the
109
- string.
110
-
111
  Ordinary string literals and UTF-8 string literals are also referred to
112
  as narrow string literals.
113
 
114
- A *string-literal* that begins with `u`, such as `u"asdf"`, is a *UTF-16
115
- string literal*. A UTF-16 string literal has type “array of *n*
116
- `const char16_t`”, where *n* is the size of the string as defined below;
117
- each successive element of the array has the value of the corresponding
118
- code unit of the UTF-16 encoding of the string.
 
119
 
120
- [*Note 3*: A single *c-char* may produce more than one `char16_t`
121
- character in the form of surrogate pairs. A surrogate pair is a
122
- representation for a single code point as a sequence of two 16-bit code
123
- units. — *end note*]
124
-
125
- A *string-literal* that begins with `U`, such as `U"asdf"`, is a *UTF-32
126
- string literal*. A UTF-32 string literal has type “array of *n*
127
- `const char32_t`”, where *n* is the size of the string as defined below;
128
- each successive element of the array has the value of the corresponding
129
- code unit of the UTF-32 encoding of the string.
130
-
131
- A *string-literal* that begins with `L`, such as `L"asdf"`, is a *wide
132
- string literal*. A wide string literal has type “array of *n* `const
133
- wchar_t`”, where *n* is the size of the string as defined below; it is
134
- initialized with the given characters.
135
 
136
  In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
137
- concatenated. If both *string-literal*s have the same *encoding-prefix*,
138
- the resulting concatenated *string-literal* has that *encoding-prefix*.
139
- If one *string-literal* has no *encoding-prefix*, it is treated as a
140
- *string-literal* of the same *encoding-prefix* as the other operand. If
141
- a UTF-8 string literal token is adjacent to a wide string literal token,
142
- the program is ill-formed. Any other concatenations are
143
- conditionally-supported with *implementation-defined* behavior.
144
-
145
- [*Note 4*: This concatenation is an interpretation, not a conversion.
146
- Because the interpretation happens in translation phase 6 (after each
147
- character from a *string-literal* has been translated into a value from
148
- the appropriate character set), a *string-literal*’s initial rawness has
149
- no effect on the interpretation or well-formedness of the
150
- concatenation. — *end note*]
 
 
 
 
 
151
 
152
  [[lex.string.concat]] has some examples of valid concatenations.
153
 
 
 
154
  **Table: String literal concatenations** <a id="lex.string.concat">[lex.string.concat]</a>
155
 
156
  | | | | | | |
157
  | -------------------------- | ----- | -------------------------- | ----- | -------------------------- | ----- |
158
  | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means |
159
  | `u"a"` | `u"b"` | `u"ab"` | `U"a"` | `U"b"` | `U"ab"` | `L"a"` | `L"b"` | `L"ab"` |
160
  | `u"a"` | `"b"` | `u"ab"` | `U"a"` | `"b"` | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
161
  | `"a"` | `u"b"` | `u"ab"` | `"a"` | `U"b"` | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
162
 
163
 
164
- Characters in concatenated strings are kept distinct.
165
-
166
- [*Example 2*:
167
-
168
- ``` cpp
169
- "\xA" "B"
170
- ```
171
-
172
- contains the two characters `'\xA'` and `'B'` after concatenation (and
173
- not the single hexadecimal character `'\xAB'`).
174
-
175
- — *end example*]
176
-
177
- After any necessary concatenation, in translation phase 7
178
- [[lex.phases]], `'\0'` is appended to every *string-literal* so that
179
- programs that scan a string can find its end.
180
-
181
- Escape sequences and *universal-character-name*s in non-raw string
182
- literals have the same meaning as in *character-literal*s [[lex.ccon]],
183
- except that the single quote `'` is representable either by itself or by
184
- the escape sequence `\'`, and the double quote `"` shall be preceded by
185
- a `\`, and except that a *universal-character-name* in a UTF-16 string
186
- literal may yield a surrogate pair. In a narrow string literal, a
187
- *universal-character-name* may map to more than one `char` or `char8_t`
188
- element due to *multibyte encoding*. The size of a `char32_t` or wide
189
- string literal is the total number of escape sequences,
190
- *universal-character-name*s, and other characters, plus one for the
191
- terminating `U'\0'` or `L'\0'`. The size of a UTF-16 string literal is
192
- the total number of escape sequences, *universal-character-name*s, and
193
- other characters, plus one for each character requiring a surrogate
194
- pair, plus one for the terminating `u'\0'`.
195
-
196
- [*Note 5*: The size of a `char16_t` string literal is the number of
197
- code units, not the number of characters. — *end note*]
198
-
199
- [*Note 6*: Any *universal-character-name*s are required to correspond
200
- to a code point in the range [0, D800) or [E000, 10FFFF] (hexadecimal)
201
- [[lex.charset]]. — *end note*]
202
-
203
- The size of a narrow string literal is the total number of escape
204
- sequences and other characters, plus at least one for the multibyte
205
- encoding of each *universal-character-name*, plus one for the
206
- terminating `'\0'`.
207
-
208
  Evaluating a *string-literal* results in a string literal object with
209
- static storage duration, initialized from the given characters as
210
- specified above. Whether all *string-literal*s are distinct (that is,
211
- are stored in nonoverlapping objects) and whether successive evaluations
212
- of a *string-literal* yield the same or a different object is
213
- unspecified.
214
-
215
- [*Note 7*: The effect of attempting to modify a *string-literal* is
216
- undefined. — *end note*]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
217
 
 
12
  s-char-sequence s-char
13
  ```
14
 
15
  ``` bnf
16
  s-char:
17
+ basic-s-char
18
  escape-sequence
19
  universal-character-name
20
  ```
21
 
22
+ ``` bnf
23
+ basic-s-char:
24
+ any member of the translation character set except the U+0022 (quotation mark),
25
+ U+005c (reverse solidus), or new-line character
26
+ ```
27
+
28
  ``` bnf
29
  raw-string:
30
  '"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
31
  ```
32
 
 
36
  r-char-sequence r-char
37
  ```
38
 
39
  ``` bnf
40
  r-char:
41
+ any member of the translation character set, except a U+0029 (right parenthesis) followed by
42
+ the initial *d-char-sequence* (which may be empty) followed by a U+0022 (quotation mark)
43
  ```
44
 
45
  ``` bnf
46
  d-char-sequence:
47
  d-char
48
  d-char-sequence d-char
49
  ```
50
 
51
  ``` bnf
52
  d-char:
53
+ any member of the basic character set except:
54
+ U+0020 (space), U+0028 (left parenthesis), U+0029 (right parenthesis), U+005c (reverse solidus),
55
+ U+0009 (character tabulation), U+000b (line tabulation), U+000c (form feed), and new-line
56
  ```
57
 
58
+ The kind of a *string-literal*, its type, and its associated character
59
+ encoding [[lex.charset]] are determined by its encoding prefix and
60
+ sequence of *s-char*s or *r-char*s as defined by [[lex.string.literal]]
61
+ where n is the number of encoded code units as described below.
62
+
63
+ **Table: String literals** <a id="lex.string.literal">[lex.string.literal]</a>
64
+
65
+ | | | | | |
66
+ | ---- | ----------------------- | ----------------------------- | ------------------------- | ---------------------------------------------- |
67
+ | none | ordinary string literal | array of $n$ `const char` | ordinary literal encoding | `"ordinary string"` `R"(ordinary raw string)"` |
68
+ | `L` | wide string literal | array of $n$ `const wchar_t` | wide literal encoding | `L"wide string"` `LR"w(wide raw string)w"` |
69
+ | `u8` | UTF-8 string literal | array of $n$ `const char8_t` | UTF-8 | `u8"UTF-8 string"` `u8R"x(UTF-8 raw string)x"` |
70
+ | `u` | UTF-16 string literal | array of $n$ `const char16_t` | UTF-16 | `u"UTF-16 string"` `uR"y(UTF-16 raw string)y"` |
71
+ | `U` | UTF-32 string literal | array of $n$ `const char32_t` | UTF-32 | `U"UTF-32 string"` `UR"z(UTF-32 raw string)z"` |
72
+
73
+
74
  A *string-literal* that has an `R` in the prefix is a *raw string
75
  literal*. The *d-char-sequence* serves as a delimiter. The terminating
76
  *d-char-sequence* of a *raw-string* is the same sequence of characters
77
  as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
78
  at most 16 characters.
 
115
 
116
  is equivalent to `"x = \"\\\"y\\\"\""`.
117
 
118
  — *end example*]
119
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
  Ordinary string literals and UTF-8 string literals are also referred to
121
  as narrow string literals.
122
 
123
+ The common *encoding-prefix* for a sequence of adjacent
124
+ *string-literal*s is determined pairwise as follows: If two
125
+ *string-literal*s have the same *encoding-prefix*, the common
126
+ *encoding-prefix* is that *encoding-prefix*. If one *string-literal* has
127
+ no *encoding-prefix*, the common *encoding-prefix* is that of the other
128
+ *string-literal*. Any other combinations are ill-formed.
129
 
130
+ [*Note 3*: A *string-literal*’s rawness has no effect on the
131
+ determination of the common *encoding-prefix*. *end note*]
 
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
  In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
134
+ concatenated. The lexical structure and grouping of the contents of the
135
+ individual *string-literal*s is retained.
136
+
137
+ [*Example 2*:
138
+
139
+ ``` cpp
140
+ "\xA" "B"
141
+ ```
142
+
143
+ represents the code unit `'\xA'` and the character `'B'` after
144
+ concatenation (and not the single code unit `'\xAB'`). Similarly,
145
+
146
+ ``` cpp
147
+ R"(\u00)" "41"
148
+ ```
149
+
150
+ represents six characters, starting with a backslash and ending with the
151
+ digit `1` (and not the single character `'A'` specified by a
152
+ *universal-character-name*).
153
 
154
  [[lex.string.concat]] has some examples of valid concatenations.
155
 
156
+ — *end example*]
157
+
158
  **Table: String literal concatenations** <a id="lex.string.concat">[lex.string.concat]</a>
159
 
160
  | | | | | | |
161
  | -------------------------- | ----- | -------------------------- | ----- | -------------------------- | ----- |
162
  | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means | *[spans 2 columns]* Source | Means |
163
  | `u"a"` | `u"b"` | `u"ab"` | `U"a"` | `U"b"` | `U"ab"` | `L"a"` | `L"b"` | `L"ab"` |
164
  | `u"a"` | `"b"` | `u"ab"` | `U"a"` | `"b"` | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
165
  | `"a"` | `u"b"` | `u"ab"` | `"a"` | `U"b"` | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
166
 
167
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
  Evaluating a *string-literal* results in a string literal object with
169
+ static storage duration [[basic.stc]]. Whether all *string-literal*s are
170
+ distinct (that is, are stored in nonoverlapping objects) and whether
171
+ successive evaluations of a *string-literal* yield the same or a
172
+ different object is unspecified.
173
+
174
+ [*Note 4*: The effect of attempting to modify a string literal object
175
+ is undefined. *end note*]
176
+
177
+ String literal objects are initialized with the sequence of code unit
178
+ values corresponding to the *string-literal*’s sequence of *s-char*s
179
+ (originally from non-raw string literals) and *r-char*s (originally from
180
+ raw string literals), plus a terminating U+0000 (null) character, in
181
+ order as follows:
182
+
183
+ - The sequence of characters denoted by each contiguous sequence of
184
+ *basic-s-char*s, *r-char*s, *simple-escape-sequence*s [[lex.ccon]],
185
+ and *universal-character-name*s [[lex.charset]] is encoded to a code
186
+ unit sequence using the *string-literal*’s associated character
187
+ encoding. If a character lacks representation in the associated
188
+ character encoding, then the *string-literal* is
189
+ conditionally-supported and an *implementation-defined* code unit
190
+ sequence is encoded. \[*Note 5*: No character lacks representation in
191
+ any Unicode encoding form. — *end note*] When encoding a stateful
192
+ character encoding, implementations should encode the first such
193
+ sequence beginning with the initial encoding state and encode
194
+ subsequent sequences beginning with the final encoding state of the
195
+ prior sequence. \[*Note 6*: The encoded code unit sequence can differ
196
+ from the sequence of code units that would be obtained by encoding
197
+ each character independently. — *end note*]
198
+ - Each *numeric-escape-sequence* [[lex.ccon]] contributes a single code
199
+ unit with a value as follows:
200
+ - Let v be the integer value represented by the octal number
201
+ comprising the sequence of *octal-digit*s in an
202
+ *octal-escape-sequence* or by the hexadecimal number comprising the
203
+ sequence of *hexadecimal-digit*s in a *hexadecimal-escape-sequence*.
204
+ - If v does not exceed the range of representable values of the
205
+ *string-literal*’s array element type, then the value is v.
206
+ - Otherwise, if the *string-literal*’s *encoding-prefix* is absent or
207
+ `L`, and v does not exceed the range of representable values of the
208
+ corresponding unsigned type for the underlying type of the
209
+ *string-literal*’s array element type, then the value is the unique
210
+ value of the *string-literal*’s array element type `T` that is
211
+ congruent to v modulo 2ᴺ, where N is the width of `T`.
212
+ - Otherwise, the *string-literal* is ill-formed.
213
+
214
+ When encoding a stateful character encoding, these sequences should
215
+ have no effect on encoding state.
216
+ - Each *conditional-escape-sequence* [[lex.ccon]] contributes an
217
+ *implementation-defined* code unit sequence. When encoding a stateful
218
+ character encoding, it is *implementation-defined* what effect these
219
+ sequences have on encoding state.
220