- tmp/tmpxaeb_ar5/{from.md → to.md} +454 -369
tmp/tmpxaeb_ar5/{from.md → to.md}
RENAMED
|
@@ -4,24 +4,18 @@
|
|
| 4 |
|
| 5 |
The text of the program is kept in units called *source files* in this
|
| 6 |
document. A source file together with all the headers [[headers]] and
|
| 7 |
source files included [[cpp.include]] via the preprocessing directive
|
| 8 |
`#include`, less any source lines skipped by any of the conditional
|
| 9 |
-
inclusion [[cpp.cond]] preprocessing directives,
|
| 10 |
-
|
|
|
|
|
|
|
| 11 |
|
| 12 |
-
[*Note 1*: A C++ program need not all be translated at the same
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
[*Note 2*: Previously translated translation units and instantiation
|
| 16 |
-
units can be preserved individually or in libraries. The separate
|
| 17 |
-
translation units of a program communicate [[basic.link]] by (for
|
| 18 |
-
example) calls to functions whose identifiers have external or module
|
| 19 |
-
linkage, manipulation of objects whose identifiers have external or
|
| 20 |
-
module linkage, or manipulation of data files. Translation units can be
|
| 21 |
-
separately translated and then later linked to produce an executable
|
| 22 |
-
program [[basic.link]]. — *end note*]
|
| 23 |
|
| 24 |
## Phases of translation <a id="lex.phases">[[lex.phases]]</a>
|
| 25 |
|
| 26 |
The precedence among the syntax rules of translation is specified by the
|
| 27 |
following phases.[^1]
|
|
@@ -33,115 +27,169 @@ following phases.[^1]
|
|
| 33 |
*implementation-defined* manner that includes a means of designating
|
| 34 |
input files as UTF-8 files, independent of their content.
|
| 35 |
\[*Note 1*: In other words, recognizing the U+feff (byte order mark)
|
| 36 |
is not sufficient. — *end note*] If an input file is determined to
|
| 37 |
be a UTF-8 file, then it shall be a well-formed UTF-8 code unit
|
| 38 |
-
sequence and it is decoded to produce a sequence of Unicode
|
| 39 |
-
values. A sequence of translation character set elements
|
| 40 |
-
formed by mapping each Unicode scalar value
|
| 41 |
-
translation character set element. In the
|
| 42 |
-
pair of characters in the input sequence
|
| 43 |
-
U+000d (carriage return) followed by
|
| 44 |
-
each U+000d (carriage return) not
|
| 45 |
-
U+000a (line feed), is replaced by a
|
| 46 |
-
any other kind of input file
|
| 47 |
-
characters are mapped, in an
|
| 48 |
-
|
| 49 |
-
representing end-of-line indicators as
|
|
|
|
| 50 |
2. If the first translation character is U+feff (byte order mark), it
|
| 51 |
-
is deleted. Each sequence
|
| 52 |
-
followed by zero or more whitespace characters other
|
| 53 |
-
followed by a new-line character is deleted, splicing
|
| 54 |
-
source lines to form logical source lines. Only the last
|
| 55 |
-
on any physical source line shall be eligible for being
|
| 56 |
-
a splice.
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
new-line character were appended to the file.
|
| 62 |
3. The source file is decomposed into preprocessing tokens
|
| 63 |
[[lex.pptoken]] and sequences of whitespace characters (including
|
| 64 |
comments). A source file shall not end in a partial preprocessing
|
| 65 |
-
token or in a partial comment.[^
|
| 66 |
-
space character. New-line characters are
|
| 67 |
-
nonempty sequence of whitespace characters
|
| 68 |
-
retained or replaced by one
|
| 69 |
-
characters from the source file are
|
| 70 |
-
preprocessing token (i.e., not being
|
| 71 |
-
or other forms of whitespace), except
|
| 72 |
-
*c-char-sequence*, *s-char-sequence*,
|
| 73 |
-
*
|
| 74 |
-
are recognized
|
| 75 |
-
|
|
|
|
| 76 |
characters into preprocessing tokens is context-dependent.
|
| 77 |
\[*Example 1*: See the handling of `<` within a `#include`
|
| 78 |
-
preprocessing directive
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
The resulting tokens constitute a *translation unit* and are
|
| 92 |
-
syntactically and semantically analyzed
|
| 93 |
-
|
|
|
|
| 94 |
occasionally result in one token being replaced by a sequence of
|
| 95 |
-
other tokens [[temp.names]]. — *end note*]
|
| 96 |
-
*implementation-defined* whether the sources for module units
|
| 97 |
-
header units on which the current translation unit has an
|
| 98 |
-
dependency [[module.unit]], [[module.import]] are required
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
implementation. — *end note*]
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
ill-formed if any instantiation fails.
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
execution environment.
|
| 126 |
|
| 127 |
-
##
|
|
|
|
|
|
|
| 128 |
|
| 129 |
The *translation character set* consists of the following elements:
|
| 130 |
|
| 131 |
-
- each abstract character assigned a code point in the Unicode
|
| 132 |
-
|
| 133 |
- a distinct character for each Unicode scalar value not assigned to an
|
| 134 |
abstract character.
|
| 135 |
|
| 136 |
[*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
|
| 137 |
(hexadecimal). A surrogate code point is a value in the range
|
| 138 |
[D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
|
| 139 |
that is not a surrogate code point. — *end note*]
|
| 140 |
|
| 141 |
The *basic character set* is a subset of the translation character set,
|
| 142 |
-
consisting of
|
| 143 |
|
| 144 |
[*Note 2*: Unicode short names are given only as a means to identifying
|
| 145 |
the character; the numerical value has no other meaning in this
|
| 146 |
context. — *end note*]
|
| 147 |
|
|
@@ -155,10 +203,11 @@ context. — *end note*]
|
|
| 155 |
| `U+0020` | space | |
|
| 156 |
| `U+000a` | line feed | new-line |
|
| 157 |
| `U+0021` | exclamation mark | `!` |
|
| 158 |
| `U+0022` | quotation mark | `"` |
|
| 159 |
| `U+0023` | number sign | `#` |
|
|
|
|
| 160 |
| `U+0025` | percent sign | `%` |
|
| 161 |
| `U+0026` | ampersand | `&` |
|
| 162 |
| `U+0027` | apostrophe | `'` |
|
| 163 |
| `U+0028` | left parenthesis | `(` |
|
| 164 |
| `U+0029` | right parenthesis | `)` |
|
|
@@ -173,90 +222,27 @@ context. — *end note*]
|
|
| 173 |
| `U+003b` | semicolon | `;` |
|
| 174 |
| `U+003c` | less-than sign | `<` |
|
| 175 |
| `U+003d` | equals sign | `=` |
|
| 176 |
| `U+003e` | greater-than sign | `>` |
|
| 177 |
| `U+003f` | question mark | `?` |
|
|
|
|
| 178 |
| `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
|
| 179 |
| | | `N O P Q R S T U V W X Y Z` |
|
| 180 |
| `U+005b` | left square bracket | `[` |
|
| 181 |
| `U+005c` | reverse solidus | \texttt{\} |
|
| 182 |
| `U+005d` | right square bracket | `]` |
|
| 183 |
| `U+005e` | circumflex accent | `^` |
|
| 184 |
| `U+005f` | low line | `_` |
|
|
|
|
| 185 |
| `U+0061` .. `U+007a` | latin small letter a .. z | `a b c d e f g h i j k l m` |
|
| 186 |
| | | `n o p q r s t u v w x y z` |
|
| 187 |
| `U+007b` | left curly bracket | \texttt{\ |
|
| 188 |
| `U+007c` | vertical line | `|` |
|
| 189 |
| `U+007d` | right curly bracket | `}` |
|
| 190 |
| `U+007e` | tilde | `~` |
|
| 191 |
|
| 192 |
|
| 193 |
-
The *universal-character-name* construct provides a way to name other
|
| 194 |
-
characters.
|
| 195 |
-
|
| 196 |
-
``` bnf
|
| 197 |
-
n-char: one of
|
| 198 |
-
any member of the translation character set except the U+007d (right curly bracket) or new-line character
|
| 199 |
-
```
|
| 200 |
-
|
| 201 |
-
``` bnf
|
| 202 |
-
n-char-sequence:
|
| 203 |
-
n-char
|
| 204 |
-
n-char-sequence n-char
|
| 205 |
-
```
|
| 206 |
-
|
| 207 |
-
``` bnf
|
| 208 |
-
named-universal-character:
|
| 209 |
-
'\N{' n-char-sequence '}'
|
| 210 |
-
```
|
| 211 |
-
|
| 212 |
-
``` bnf
|
| 213 |
-
hex-quad:
|
| 214 |
-
hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
|
| 215 |
-
```
|
| 216 |
-
|
| 217 |
-
``` bnf
|
| 218 |
-
simple-hexadecimal-digit-sequence:
|
| 219 |
-
hexadecimal-digit
|
| 220 |
-
simple-hexadecimal-digit-sequence hexadecimal-digit
|
| 221 |
-
```
|
| 222 |
-
|
| 223 |
-
``` bnf
|
| 224 |
-
universal-character-name:
|
| 225 |
-
'\u' hex-quad
|
| 226 |
-
'\U' hex-quad hex-quad
|
| 227 |
-
'\u{' simple-hexadecimal-digit-sequence '}'
|
| 228 |
-
named-universal-character
|
| 229 |
-
```
|
| 230 |
-
|
| 231 |
-
A *universal-character-name* of the form `\u` *hex-quad*, `\U`
|
| 232 |
-
*hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
|
| 233 |
-
designates the character in the translation character set whose Unicode
|
| 234 |
-
scalar value is the hexadecimal number represented by the sequence of
|
| 235 |
-
*hexadecimal-digit*s in the *universal-character-name*. The program is
|
| 236 |
-
ill-formed if that number is not a Unicode scalar value.
|
| 237 |
-
|
| 238 |
-
A *universal-character-name* that is a *named-universal-character*
|
| 239 |
-
designates the corresponding character in the Unicode Standard (chapter
|
| 240 |
-
4.8 Name) if the *n-char-sequence* is equal to its character name or to
|
| 241 |
-
one of its character name aliases of type “control”, “correction”, or
|
| 242 |
-
“alternate”; otherwise, the program is ill-formed.
|
| 243 |
-
|
| 244 |
-
[*Note 3*: These aliases are listed in the Unicode Character Database’s
|
| 245 |
-
`NameAliases.txt`. None of these names or aliases have leading or
|
| 246 |
-
trailing spaces. — *end note*]
|
| 247 |
-
|
| 248 |
-
If a *universal-character-name* outside the *c-char-sequence*,
|
| 249 |
-
*s-char-sequence*, or *r-char-sequence* of a *character-literal* or
|
| 250 |
-
*string-literal* (in either case, including within a
|
| 251 |
-
*user-defined-literal*) corresponds to a control character or to a
|
| 252 |
-
character in the basic character set, the program is ill-formed.
|
| 253 |
-
|
| 254 |
-
[*Note 4*: A sequence of characters resembling a
|
| 255 |
-
*universal-character-name* in an *r-char-sequence* [[lex.string]] does
|
| 256 |
-
not form a *universal-character-name*. — *end note*]
|
| 257 |
-
|
| 258 |
The *basic literal character set* consists of all characters of the
|
| 259 |
basic character set, plus the control characters specified in
|
| 260 |
[[lex.charset.literal]].
|
| 261 |
|
| 262 |
**Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
|
|
@@ -282,24 +268,100 @@ applied to a wide character or string literal.
|
|
| 282 |
A literal encoding or a locale-specific encoding of one of the execution
|
| 283 |
character sets [[character.seq]] encodes each element of the basic
|
| 284 |
literal character set as a single code unit with non-negative value,
|
| 285 |
distinct from the code unit for any other such element.
|
| 286 |
|
| 287 |
-
[*Note
|
| 288 |
encoded with more than one code unit; the value of such a code unit can
|
| 289 |
be the same as that of a code unit for an element of the basic literal
|
| 290 |
character set. — *end note*]
|
| 291 |
|
| 292 |
The U+0000 (null) character is encoded as the value `0`. No other
|
| 293 |
element of the translation character set is encoded with a code unit of
|
| 294 |
value `0`. The code unit value of each decimal digit character after the
|
| 295 |
digit `0` (`U+0030`) shall be one greater than the value of the
|
| 296 |
previous. The ordinary and wide literal encodings are otherwise
|
| 297 |
*implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
|
| 298 |
-
Unicode scalar value corresponding to
|
| 299 |
-
character
|
| 300 |
-
respective Unicode encoding form.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 301 |
|
| 302 |
## Preprocessing tokens <a id="lex.pptoken">[[lex.pptoken]]</a>
|
| 303 |
|
| 304 |
``` bnf
|
| 305 |
preprocessing-token:
|
|
@@ -315,27 +377,22 @@ preprocessing-token:
|
|
| 315 |
user-defined-string-literal
|
| 316 |
preprocessing-op-or-punc
|
| 317 |
each non-whitespace character that cannot be one of the above
|
| 318 |
```
|
| 319 |
|
| 320 |
-
Each preprocessing token that is converted to a token [[lex.token]]
|
| 321 |
-
shall have the lexical form of a keyword, an identifier, a literal, or
|
| 322 |
-
an operator or punctuator.
|
| 323 |
-
|
| 324 |
A preprocessing token is the minimal lexical element of the language in
|
| 325 |
translation phases 3 through 6. In this document, glyphs are used to
|
| 326 |
identify elements of the basic character set [[lex.charset]]. The
|
| 327 |
categories of preprocessing token are: header names, placeholder tokens
|
| 328 |
produced by preprocessing `import` and `module` directives
|
| 329 |
(*import-keyword*, *module-keyword*, and *export-keyword*), identifiers,
|
| 330 |
preprocessing numbers, character literals (including user-defined
|
| 331 |
character literals), string literals (including user-defined string
|
| 332 |
literals), preprocessing operators and punctuators, and single
|
| 333 |
non-whitespace characters that do not lexically match the other
|
| 334 |
-
preprocessing token categories. If a U+0027 (apostrophe)
|
| 335 |
-
U+0022 (quotation mark) character
|
| 336 |
-
behavior is undefined. If any character not in the basic character set
|
| 337 |
matches the last category, the program is ill-formed. Preprocessing
|
| 338 |
tokens can be separated by whitespace; this consists of comments
|
| 339 |
[[lex.comment]], or whitespace characters (U+0020 (space),
|
| 340 |
U+0009 (character tabulation), new-line, U+000b (line tabulation), and
|
| 341 |
U+000c (form feed)), or both. As described in [[cpp]], in certain
|
|
@@ -343,10 +400,21 @@ circumstances during translation phase 4, whitespace (or the absence
|
|
| 343 |
thereof) serves as more than preprocessing token separation. Whitespace
|
| 344 |
can appear within a preprocessing token only as part of a header name or
|
| 345 |
between the quotation characters in a character literal or string
|
| 346 |
literal.
|
| 347 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 348 |
If the input stream has been parsed into preprocessing tokens up to a
|
| 349 |
given character:
|
| 350 |
|
| 351 |
- If the next character begins a sequence of characters that could be
|
| 352 |
the prefix and initial double quote of a raw string literal, such as
|
|
@@ -362,34 +430,38 @@ given character:
|
|
| 362 |
```
|
| 363 |
- Otherwise, if the next three characters are `<::` and the subsequent
|
| 364 |
character is neither `:` nor `>`, the `<` is treated as a
|
| 365 |
preprocessing token by itself and not as the first character of the
|
| 366 |
alternative token `<:`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 367 |
- Otherwise, the next preprocessing token is the longest sequence of
|
| 368 |
characters that could constitute a preprocessing token, even if that
|
| 369 |
-
would cause further lexical analysis to fail, except that
|
| 370 |
-
*
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 374 |
|
| 375 |
[*Example 1*:
|
| 376 |
|
| 377 |
``` cpp
|
| 378 |
#define R "x"
|
| 379 |
const char* s = R"y"; // ill-formed raw string, not "x" "y"
|
| 380 |
```
|
| 381 |
|
| 382 |
— *end example*]
|
| 383 |
|
| 384 |
-
The *import-keyword* is produced by processing an `import` directive
|
| 385 |
-
[[cpp.import]], the *module-keyword* is produced by preprocessing a
|
| 386 |
-
`module` directive [[cpp.module]], and the *export-keyword* is produced
|
| 387 |
-
by preprocessing either of the previous two directives.
|
| 388 |
-
|
| 389 |
-
[*Note 1*: None has any observable spelling. — *end note*]
|
| 390 |
-
|
| 391 |
[*Example 2*: The program fragment `0xe+foo` is parsed as a
|
| 392 |
preprocessing number token (one that is not a valid *integer-literal* or
|
| 393 |
*floating-point-literal* token), even though a parse as three
|
| 394 |
preprocessing tokens `0xe`, `+`, and `foo` can produce a valid
|
| 395 |
expression (for example, if `foo` is a macro defined as `1`). Similarly,
|
|
@@ -400,98 +472,57 @@ macro name. — *end example*]
|
|
| 400 |
[*Example 3*: The program fragment `x+++++y` is parsed as `x
|
| 401 |
++ ++ + y`, which, if `x` and `y` have integral types, violates a
|
| 402 |
constraint on increment operators, even though the parse `x ++ + ++ y`
|
| 403 |
can yield a correct expression. — *end example*]
|
| 404 |
|
| 405 |
-
## Alternative tokens <a id="lex.digraph">[[lex.digraph]]</a>
|
| 406 |
-
|
| 407 |
-
Alternative token representations are provided for some operators and
|
| 408 |
-
punctuators.[^3]
|
| 409 |
-
|
| 410 |
-
In all respects of the language, each alternative token behaves the
|
| 411 |
-
same, respectively, as its primary token, except for its spelling.[^4]
|
| 412 |
-
|
| 413 |
-
The set of alternative tokens is defined in [[lex.digraph]].
|
| 414 |
-
|
| 415 |
-
## Tokens <a id="lex.token">[[lex.token]]</a>
|
| 416 |
-
|
| 417 |
-
``` bnf
|
| 418 |
-
token:
|
| 419 |
-
identifier
|
| 420 |
-
keyword
|
| 421 |
-
literal
|
| 422 |
-
operator-or-punctuator
|
| 423 |
-
```
|
| 424 |
-
|
| 425 |
-
There are five kinds of tokens: identifiers, keywords, literals,[^5]
|
| 426 |
-
|
| 427 |
-
operators, and other separators. Blanks, horizontal and vertical tabs,
|
| 428 |
-
newlines, formfeeds, and comments (collectively, “whitespace”), as
|
| 429 |
-
described below, are ignored except as they serve to separate tokens.
|
| 430 |
-
|
| 431 |
-
[*Note 1*: Some whitespace is required to separate otherwise adjacent
|
| 432 |
-
identifiers, keywords, numeric literals, and alternative tokens
|
| 433 |
-
containing alphabetic characters. — *end note*]
|
| 434 |
-
|
| 435 |
-
## Comments <a id="lex.comment">[[lex.comment]]</a>
|
| 436 |
-
|
| 437 |
-
The characters `/*` start a comment, which terminates with the
|
| 438 |
-
characters `*/`. These comments do not nest. The characters `//` start a
|
| 439 |
-
comment, which terminates immediately before the next new-line
|
| 440 |
-
character. If there is a form-feed or a vertical-tab character in such a
|
| 441 |
-
comment, only whitespace characters shall appear between it and the
|
| 442 |
-
new-line that terminates the comment; no diagnostic is required.
|
| 443 |
-
|
| 444 |
-
[*Note 1*: The comment characters `//`, `/*`, and `*/` have no special
|
| 445 |
-
meaning within a `//` comment and are treated just like other
|
| 446 |
-
characters. Similarly, the comment characters `//` and `/*` have no
|
| 447 |
-
special meaning within a `/*` comment. — *end note*]
|
| 448 |
-
|
| 449 |
## Header names <a id="lex.header">[[lex.header]]</a>
|
| 450 |
|
| 451 |
``` bnf
|
| 452 |
header-name:
|
| 453 |
'<' h-char-sequence '>'
|
| 454 |
'"' q-char-sequence '"'
|
| 455 |
```
|
| 456 |
|
| 457 |
``` bnf
|
| 458 |
h-char-sequence:
|
| 459 |
-
h-char
|
| 460 |
-
h-char-sequence h-char
|
| 461 |
```
|
| 462 |
|
| 463 |
``` bnf
|
| 464 |
h-char:
|
| 465 |
any member of the translation character set except new-line and U+003e (greater-than sign)
|
| 466 |
```
|
| 467 |
|
| 468 |
``` bnf
|
| 469 |
q-char-sequence:
|
| 470 |
-
q-char
|
| 471 |
-
q-char-sequence q-char
|
| 472 |
```
|
| 473 |
|
| 474 |
``` bnf
|
| 475 |
q-char:
|
| 476 |
any member of the translation character set except new-line and U+0022 (quotation mark)
|
| 477 |
```
|
| 478 |
|
| 479 |
-
[*Note 1*: Header name preprocessing tokens only appear within a
|
| 480 |
-
`#include` preprocessing directive, a `__has_include` preprocessing
|
| 481 |
-
expression, or after certain occurrences of an `import` token (see
|
| 482 |
-
[[lex.pptoken]]). — *end note*]
|
| 483 |
-
|
| 484 |
The sequences in both forms of *header-name*s are mapped in an
|
| 485 |
*implementation-defined* manner to headers or to external source file
|
| 486 |
names as specified in [[cpp.include]].
|
| 487 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 488 |
The appearance of either of the characters `'` or `\` or of either of
|
| 489 |
the character sequences `/*` or `//` in a *q-char-sequence* or an
|
| 490 |
*h-char-sequence* is conditionally-supported with
|
| 491 |
*implementation-defined* semantics, as is the appearance of the
|
| 492 |
-
character `"` in an *h-char-sequence*.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 493 |
|
| 494 |
## Preprocessing numbers <a id="lex.ppnumber">[[lex.ppnumber]]</a>
|
| 495 |
|
| 496 |
``` bnf
|
| 497 |
pp-number:
|
|
@@ -513,10 +544,76 @@ tokens [[lex.icon]] and all *floating-point-literal* tokens
|
|
| 513 |
|
| 514 |
A preprocessing number does not have a type or a value; it acquires both
|
| 515 |
after a successful conversion to an *integer-literal* token or a
|
| 516 |
*floating-point-literal* token.
|
| 517 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 518 |
## Identifiers <a id="lex.name">[[lex.name]]</a>
|
| 519 |
|
| 520 |
``` bnf
|
| 521 |
identifier:
|
| 522 |
identifier-start
|
|
@@ -549,21 +646,24 @@ digit: one of
|
|
| 549 |
'0 1 2 3 4 5 6 7 8 9'
|
| 550 |
```
|
| 551 |
|
| 552 |
[*Note 1*:
|
| 553 |
|
| 554 |
-
The character properties XID_Start and XID_Continue are
|
| 555 |
-
|
| 556 |
|
| 557 |
— *end note*]
|
| 558 |
|
| 559 |
The program is ill-formed if an *identifier* does not conform to
|
| 560 |
Normalization Form C as specified in the Unicode Standard.
|
| 561 |
|
| 562 |
[*Note 2*: Identifiers are case-sensitive. — *end note*]
|
| 563 |
|
| 564 |
-
[*Note 3*:
|
|
|
|
|
|
|
|
|
|
| 565 |
*preprocessing-token*s [[lex.pptoken]] differentiated as keywords
|
| 566 |
[[lex.key]] in the later translation phase 7
|
| 567 |
[[lex.token]]. — *end note*]
|
| 568 |
|
| 569 |
The identifiers in [[lex.name.special]] have a special meaning when
|
|
@@ -576,12 +676,13 @@ interpret the token as a regular *identifier*.
|
|
| 576 |
In addition, some identifiers appearing as a *token* or
|
| 577 |
*preprocessing-token* are reserved for use by C++ implementations and
|
| 578 |
shall not be used otherwise; no diagnostic is required.
|
| 579 |
|
| 580 |
- Each identifier that contains a double underscore `__` or begins with
|
| 581 |
-
an underscore followed by an uppercase letter
|
| 582 |
-
|
|
|
|
| 583 |
- Each identifier that begins with an underscore is reserved to the
|
| 584 |
implementation for use as a name in the global namespace.
|
| 585 |
|
| 586 |
## Keywords <a id="lex.key">[[lex.key]]</a>
|
| 587 |
|
|
@@ -609,44 +710,10 @@ Furthermore, the alternative representations shown in
|
|
| 609 |
| | | | | | |
|
| 610 |
| -------- | -------- | -------- | ------- | -------- | ----- |
|
| 611 |
| `and` | `and_eq` | `bitand` | `bitor` | `compl` | `not` |
|
| 612 |
| `not_eq` | `or` | `or_eq` | `xor` | `xor_eq` | |
|
| 613 |
|
| 614 |
-
## Operators and punctuators <a id="lex.operators">[[lex.operators]]</a>
|
| 615 |
-
|
| 616 |
-
The lexical representation of C++ programs includes a number of
|
| 617 |
-
preprocessing tokens that are used in the syntax of the preprocessor or
|
| 618 |
-
are converted into tokens for operators and punctuators:
|
| 619 |
-
|
| 620 |
-
``` bnf
|
| 621 |
-
preprocessing-op-or-punc:
|
| 622 |
-
preprocessing-operator
|
| 623 |
-
operator-or-punctuator
|
| 624 |
-
```
|
| 625 |
-
|
| 626 |
-
``` bnf
|
| 627 |
-
%% Ed. note: character protrusion would misalign various operators.
|
| 628 |
-
preprocessing-operator: one of
|
| 629 |
-
'# ## %: %:%:'
|
| 630 |
-
```
|
| 631 |
-
|
| 632 |
-
``` bnf
|
| 633 |
-
operator-or-punctuator: one of
|
| 634 |
-
'{ } [ ] ( )'
|
| 635 |
-
'<: :> <% %> ; : ...'
|
| 636 |
-
'? :: . .* -> ->* ~'
|
| 637 |
-
'! + - * / % ^ & |'
|
| 638 |
-
'= += -= *= /= %= ^= &= |='
|
| 639 |
-
'== != < > <= >= <=> && ||'
|
| 640 |
-
'<< >> <<= >>= ++ -- ,'
|
| 641 |
-
'and or xor not bitand bitor compl'
|
| 642 |
-
'and_eq or_eq xor_eq not_eq'
|
| 643 |
-
```
|
| 644 |
-
|
| 645 |
-
Each *operator-or-punctuator* is converted to a single token in
|
| 646 |
-
translation phase 7 [[lex.phases]].
|
| 647 |
-
|
| 648 |
## Literals <a id="lex.literal">[[lex.literal]]</a>
|
| 649 |
|
| 650 |
### Kinds of literals <a id="lex.literal.kinds">[[lex.literal.kinds]]</a>
|
| 651 |
|
| 652 |
There are several kinds of literals.[^8]
|
|
@@ -762,12 +829,12 @@ size-suffix: one of
|
|
| 762 |
'z Z'
|
| 763 |
```
|
| 764 |
|
| 765 |
In an *integer-literal*, the sequence of *binary-digit*s,
|
| 766 |
*octal-digit*s, *digit*s, or *hexadecimal-digit*s is interpreted as a
|
| 767 |
-
base N integer as shown in
|
| 768 |
-
|
| 769 |
|
| 770 |
[*Note 1*: The prefix and any optional separating single quotes are
|
| 771 |
ignored when determining the value. — *end note*]
|
| 772 |
|
| 773 |
**Table: Base of *integer-literal*{s}** <a id="lex.icon.base">[lex.icon.base]</a>
|
|
@@ -820,20 +887,23 @@ which its value can be represented.
|
|
| 820 |
| | | `std::size_t` |
|
| 821 |
| Both `u` or `U` | `std::size_t` | `std::size_t` |
|
| 822 |
| and `z` or `Z` | | |
|
| 823 |
|
| 824 |
|
| 825 |
-
|
|
|
|
| 826 |
and an extended integer type [[basic.fundamental]] can represent its
|
| 827 |
value, it may have that extended integer type. If all of the types in
|
| 828 |
the list for the *integer-literal* are signed, the extended integer type
|
| 829 |
-
|
| 830 |
-
|
| 831 |
-
|
| 832 |
-
|
| 833 |
-
|
| 834 |
-
|
|
|
|
|
|
|
| 835 |
|
| 836 |
### Character literals <a id="lex.ccon">[[lex.ccon]]</a>
|
| 837 |
|
| 838 |
``` bnf
|
| 839 |
character-literal:
|
|
@@ -845,12 +915,11 @@ encoding-prefix: one of
|
|
| 845 |
'u8' 'u' 'U' 'L'
|
| 846 |
```
|
| 847 |
|
| 848 |
``` bnf
|
| 849 |
c-char-sequence:
|
| 850 |
-
c-char
|
| 851 |
-
c-char-sequence c-char
|
| 852 |
```
|
| 853 |
|
| 854 |
``` bnf
|
| 855 |
c-char:
|
| 856 |
basic-c-char
|
|
@@ -887,12 +956,11 @@ numeric-escape-sequence:
|
|
| 887 |
hexadecimal-escape-sequence
|
| 888 |
```
|
| 889 |
|
| 890 |
``` bnf
|
| 891 |
simple-octal-digit-sequence:
|
| 892 |
-
octal-digit
|
| 893 |
-
simple-octal-digit-sequence octal-digit
|
| 894 |
```
|
| 895 |
|
| 896 |
``` bnf
|
| 897 |
octal-escape-sequence:
|
| 898 |
'\' octal-digit
|
|
@@ -915,60 +983,47 @@ conditional-escape-sequence:
|
|
| 915 |
``` bnf
|
| 916 |
conditional-escape-sequence-char:
|
| 917 |
any member of the basic character set that is not an octal-digit, a simple-escape-sequence-char, or the characters 'N', 'o', 'u', 'U', or 'x'
|
| 918 |
```
|
| 919 |
|
| 920 |
-
A *
|
| 921 |
-
*c-char-sequence* consists of
|
| 922 |
-
*
|
| 923 |
-
|
| 924 |
-
|
| 925 |
-
|
| 926 |
-
one *c-char*. The *encoding-prefix* of a non-encodable character literal
|
| 927 |
-
or a multicharacter literal shall be absent. Such *character-literal*s
|
| 928 |
-
are conditionally-supported.
|
| 929 |
|
| 930 |
The kind of a *character-literal*, its type, and its associated
|
| 931 |
character encoding [[lex.charset]] are determined by its
|
| 932 |
*encoding-prefix* and its *c-char-sequence* as defined by
|
| 933 |
-
[[lex.ccon.literal]].
|
| 934 |
-
literals and multicharacter literals take precedence over the base kind.
|
| 935 |
-
|
| 936 |
-
[*Note 1*: The associated character encoding for ordinary character
|
| 937 |
-
literals determines encodability, but does not determine the value of
|
| 938 |
-
non-encodable ordinary character literals or ordinary multicharacter
|
| 939 |
-
literals. The examples in [[lex.ccon.literal]] for non-encodable
|
| 940 |
-
ordinary character literals assume that the specified character lacks
|
| 941 |
-
representation in the ordinary literal encoding or that encoding the
|
| 942 |
-
character would require more than one code unit. — *end note*]
|
| 943 |
|
| 944 |
**Table: Character literals** <a id="lex.ccon.literal">[lex.ccon.literal]</a>
|
| 945 |
|
| 946 |
-
|
|
| 947 |
-
| ---- | -------------------------- | ---------- | ------------ | ------- |
|
| 948 |
-
| none
|
| 949 |
| `L` | wide character literal | `wchar_t` | wide literal | `L'w'` |
|
| 950 |
| | | | encoding | |
|
| 951 |
| `u8` | UTF-8 character literal | `char8_t` | UTF-8 | `u8'x'` |
|
| 952 |
| `u` | UTF-16 character literal | `char16_t` | UTF-16 | `u'y'` |
|
| 953 |
| `U` | UTF-32 character literal | `char32_t` | UTF-32 | `U'z'` |
|
| 954 |
|
| 955 |
|
| 956 |
In translation phase 4, the value of a *character-literal* is determined
|
| 957 |
using the range of representable values of the *character-literal*’s
|
| 958 |
-
type in translation phase 7. A
|
| 959 |
-
|
| 960 |
-
|
| 961 |
|
| 962 |
- A *character-literal* with a *c-char-sequence* consisting of a single
|
| 963 |
*basic-c-char*, *simple-escape-sequence*, or
|
| 964 |
*universal-character-name* is the code unit value of the specified
|
| 965 |
character as encoded in the literal’s associated character encoding.
|
| 966 |
-
|
| 967 |
-
|
| 968 |
-
|
| 969 |
-
literal. — *end note*]
|
| 970 |
- A *character-literal* with a *c-char-sequence* consisting of a single
|
| 971 |
*numeric-escape-sequence* has a value as follows:
|
| 972 |
- Let v be the integer value represented by the octal number
|
| 973 |
comprising the sequence of *octal-digit*s in an
|
| 974 |
*octal-escape-sequence* or by the hexadecimal number comprising the
|
|
@@ -979,20 +1034,20 @@ of any other kind of *character-literal* is determined as follows:
|
|
| 979 |
or `L`, and v does not exceed the range of representable values of
|
| 980 |
the corresponding unsigned type for the underlying type of the
|
| 981 |
*character-literal*’s type, then the value is the unique value of
|
| 982 |
the *character-literal*’s type `T` that is congruent to v modulo 2ᴺ,
|
| 983 |
where N is the width of `T`.
|
| 984 |
-
- Otherwise, the
|
| 985 |
- A *character-literal* with a *c-char-sequence* consisting of a single
|
| 986 |
*conditional-escape-sequence* is conditionally-supported and has an
|
| 987 |
*implementation-defined* value.
|
| 988 |
|
| 989 |
The character specified by a *simple-escape-sequence* is specified in
|
| 990 |
[[lex.ccon.esc]].
|
| 991 |
|
| 992 |
-
[*Note
|
| 993 |
-
for compatibility with
|
| 994 |
|
| 995 |
**Table: Simple escape sequences** <a id="lex.ccon.esc">[lex.ccon.esc]</a>
|
| 996 |
|
| 997 |
| character | | *simple-escape-sequence* |
|
| 998 |
| --------- | -------------------- | ------------------------ |
|
|
@@ -1129,12 +1184,11 @@ string-literal:
|
|
| 1129 |
encoding-prefixₒₚₜ 'R' raw-string
|
| 1130 |
```
|
| 1131 |
|
| 1132 |
``` bnf
|
| 1133 |
s-char-sequence:
|
| 1134 |
-
s-char
|
| 1135 |
-
s-char-sequence s-char
|
| 1136 |
```
|
| 1137 |
|
| 1138 |
``` bnf
|
| 1139 |
s-char:
|
| 1140 |
basic-s-char
|
|
@@ -1153,24 +1207,22 @@ raw-string:
|
|
| 1153 |
'"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
|
| 1154 |
```
|
| 1155 |
|
| 1156 |
``` bnf
|
| 1157 |
r-char-sequence:
|
| 1158 |
-
r-char
|
| 1159 |
-
r-char-sequence r-char
|
| 1160 |
```
|
| 1161 |
|
| 1162 |
``` bnf
|
| 1163 |
r-char:
|
| 1164 |
any member of the translation character set, except a U+0029 (right parenthesis) followed by
|
| 1165 |
the initial *d-char-sequence* (which may be empty) followed by a U+0022 (quotation mark)
|
| 1166 |
```
|
| 1167 |
|
| 1168 |
``` bnf
|
| 1169 |
d-char-sequence:
|
| 1170 |
-
d-char
|
| 1171 |
-
d-char-sequence d-char
|
| 1172 |
```
|
| 1173 |
|
| 1174 |
``` bnf
|
| 1175 |
d-char:
|
| 1176 |
any member of the basic character set except:
|
|
@@ -1179,16 +1231,17 @@ d-char:
|
|
| 1179 |
```
|
| 1180 |
|
| 1181 |
The kind of a *string-literal*, its type, and its associated character
|
| 1182 |
encoding [[lex.charset]] are determined by its encoding prefix and
|
| 1183 |
sequence of *s-char*s or *r-char*s as defined by [[lex.string.literal]]
|
| 1184 |
-
where n is the number of encoded code units
|
|
|
|
| 1185 |
|
| 1186 |
**Table: String literals** <a id="lex.string.literal">[lex.string.literal]</a>
|
| 1187 |
|
| 1188 |
-
|
|
| 1189 |
-
| ---- | ----------------------- | ----------------------------- | ------------------------- | ---------------------------------------------- |
|
| 1190 |
| none | ordinary string literal | array of $n$ `const char` | ordinary literal encoding | `"ordinary string"` `R"(ordinary raw string)"` |
|
| 1191 |
| `L` | wide string literal | array of $n$ `const wchar_t` | wide literal encoding | `L"wide string"` `LR"w(wide raw string)w"` |
|
| 1192 |
| `u8` | UTF-8 string literal | array of $n$ `const char8_t` | UTF-8 | `u8"UTF-8 string"` `u8R"x(UTF-8 raw string)x"` |
|
| 1193 |
| `u` | UTF-16 string literal | array of $n$ `const char16_t` | UTF-16 | `u"UTF-16 string"` `uR"y(UTF-16 raw string)y"` |
|
| 1194 |
| `U` | UTF-32 string literal | array of $n$ `const char32_t` | UTF-32 | `U"UTF-32 string"` `UR"z(UTF-32 raw string)z"` |
|
|
@@ -1198,12 +1251,12 @@ A *string-literal* that has an `R` in the prefix is a *raw string
|
|
| 1198 |
literal*. The *d-char-sequence* serves as a delimiter. The terminating
|
| 1199 |
*d-char-sequence* of a *raw-string* is the same sequence of characters
|
| 1200 |
as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
|
| 1201 |
at most 16 characters.
|
| 1202 |
|
| 1203 |
-
[*Note 1*: The characters `'('` and `')'`
|
| 1204 |
-
|
| 1205 |
`"(a|b)"`. — *end note*]
|
| 1206 |
|
| 1207 |
[*Note 2*:
|
| 1208 |
|
| 1209 |
A source-file new-line in a raw string literal results in a new-line in
|
|
@@ -1239,18 +1292,15 @@ R"(x = "\"y\"")"
|
|
| 1239 |
is equivalent to `"x = \"\\\"y\\\"\""`.
|
| 1240 |
|
| 1241 |
— *end example*]
|
| 1242 |
|
| 1243 |
Ordinary string literals and UTF-8 string literals are also referred to
|
| 1244 |
-
as narrow string literals.
|
| 1245 |
|
| 1246 |
-
The
|
| 1247 |
-
|
| 1248 |
-
*
|
| 1249 |
-
*encoding-prefix* is that *encoding-prefix*. If one *string-literal* has
|
| 1250 |
-
no *encoding-prefix*, the common *encoding-prefix* is that of the other
|
| 1251 |
-
*string-literal*. Any other combinations are ill-formed.
|
| 1252 |
|
| 1253 |
[*Note 3*: A *string-literal*’s rawness has no effect on the
|
| 1254 |
determination of the common *encoding-prefix*. — *end note*]
|
| 1255 |
|
| 1256 |
In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
|
|
@@ -1287,16 +1337,17 @@ digit `1` (and not the single character `'A'` specified by a
|
|
| 1287 |
| `u"a"` | `"b"` | `u"ab"` | `U"a"` | `"b"` | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
|
| 1288 |
| `"a"` | `u"b"` | `u"ab"` | `"a"` | `U"b"` | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
|
| 1289 |
|
| 1290 |
|
| 1291 |
Evaluating a *string-literal* results in a string literal object with
|
| 1292 |
-
static storage duration [[basic.stc]].
|
| 1293 |
-
distinct (that is, are stored in nonoverlapping objects) and whether
|
| 1294 |
-
successive evaluations of a *string-literal* yield the same or a
|
| 1295 |
-
different object is unspecified.
|
| 1296 |
|
| 1297 |
-
[*Note 4*:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1298 |
is undefined. — *end note*]
|
| 1299 |
|
| 1300 |
String literal objects are initialized with the sequence of code unit
|
| 1301 |
values corresponding to the *string-literal*’s sequence of *s-char*s
|
| 1302 |
(originally from non-raw string literals) and *r-char*s (originally from
|
|
@@ -1306,20 +1357,19 @@ order as follows:
|
|
| 1306 |
- The sequence of characters denoted by each contiguous sequence of
|
| 1307 |
*basic-s-char*s, *r-char*s, *simple-escape-sequence*s [[lex.ccon]],
|
| 1308 |
and *universal-character-name*s [[lex.charset]] is encoded to a code
|
| 1309 |
unit sequence using the *string-literal*’s associated character
|
| 1310 |
encoding. If a character lacks representation in the associated
|
| 1311 |
-
character encoding, then the
|
| 1312 |
-
|
| 1313 |
-
|
| 1314 |
-
|
| 1315 |
-
|
| 1316 |
-
|
| 1317 |
-
|
| 1318 |
-
|
| 1319 |
-
|
| 1320 |
-
each character independently. — *end note*]
|
| 1321 |
- Each *numeric-escape-sequence* [[lex.ccon]] contributes a single code
|
| 1322 |
unit with a value as follows:
|
| 1323 |
- Let v be the integer value represented by the octal number
|
| 1324 |
comprising the sequence of *octal-digit*s in an
|
| 1325 |
*octal-escape-sequence* or by the hexadecimal number comprising the
|
|
@@ -1330,35 +1380,53 @@ order as follows:
|
|
| 1330 |
`L`, and v does not exceed the range of representable values of the
|
| 1331 |
corresponding unsigned type for the underlying type of the
|
| 1332 |
*string-literal*’s array element type, then the value is the unique
|
| 1333 |
value of the *string-literal*’s array element type `T` that is
|
| 1334 |
congruent to v modulo 2ᴺ, where N is the width of `T`.
|
| 1335 |
-
- Otherwise, the
|
| 1336 |
|
| 1337 |
When encoding a stateful character encoding, these sequences should
|
| 1338 |
have no effect on encoding state.
|
| 1339 |
- Each *conditional-escape-sequence* [[lex.ccon]] contributes an
|
| 1340 |
*implementation-defined* code unit sequence. When encoding a stateful
|
| 1341 |
character encoding, it is *implementation-defined* what effect these
|
| 1342 |
sequences have on encoding state.
|
| 1343 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1344 |
### Boolean literals <a id="lex.bool">[[lex.bool]]</a>
|
| 1345 |
|
| 1346 |
``` bnf
|
| 1347 |
boolean-literal:
|
| 1348 |
-
|
| 1349 |
-
|
| 1350 |
```
|
| 1351 |
|
| 1352 |
The Boolean literals are the keywords `false` and `true`. Such literals
|
| 1353 |
have type `bool`.
|
| 1354 |
|
| 1355 |
### Pointer literals <a id="lex.nullptr">[[lex.nullptr]]</a>
|
| 1356 |
|
| 1357 |
``` bnf
|
| 1358 |
pointer-literal:
|
| 1359 |
-
|
| 1360 |
```
|
| 1361 |
|
| 1362 |
The pointer literal is the keyword `nullptr`. It has type
|
| 1363 |
`std::nullptr_t`.
|
| 1364 |
|
|
@@ -1490,11 +1558,11 @@ where *f* is the source character sequence c₁c₂...cₖ.
|
|
| 1490 |
basic character set. — *end note*]
|
| 1491 |
|
| 1492 |
If *L* is a *user-defined-string-literal*, let *str* be the literal
|
| 1493 |
without its *ud-suffix* and let *len* be the number of code units in
|
| 1494 |
*str* (i.e., its length excluding the terminating null character). If
|
| 1495 |
-
*S* contains a literal operator template with a
|
| 1496 |
parameter for which *str* is a well-formed *template-argument*, the
|
| 1497 |
literal *L* is treated as a call of the form
|
| 1498 |
|
| 1499 |
``` cpp
|
| 1500 |
operator ""X<str>()
|
|
@@ -1557,26 +1625,37 @@ int main() {
|
|
| 1557 |
[basic.fundamental]: basic.md#basic.fundamental
|
| 1558 |
[basic.link]: basic.md#basic.link
|
| 1559 |
[basic.lookup.unqual]: basic.md#basic.lookup.unqual
|
| 1560 |
[basic.stc]: basic.md#basic.stc
|
| 1561 |
[character.seq]: library.md#character.seq
|
|
|
|
| 1562 |
[conv.mem]: expr.md#conv.mem
|
| 1563 |
[conv.ptr]: expr.md#conv.ptr
|
| 1564 |
[cpp]: cpp.md#cpp
|
| 1565 |
[cpp.cond]: cpp.md#cpp.cond
|
|
|
|
| 1566 |
[cpp.import]: cpp.md#cpp.import
|
| 1567 |
[cpp.include]: cpp.md#cpp.include
|
| 1568 |
[cpp.module]: cpp.md#cpp.module
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1569 |
[cpp.stringize]: cpp.md#cpp.stringize
|
| 1570 |
[dcl.attr.grammar]: dcl.md#dcl.attr.grammar
|
|
|
|
|
|
|
| 1571 |
[expr.prim.literal]: expr.md#expr.prim.literal
|
| 1572 |
[headers]: library.md#headers
|
|
|
|
| 1573 |
[lex]: #lex
|
| 1574 |
[lex.bool]: #lex.bool
|
| 1575 |
[lex.ccon]: #lex.ccon
|
| 1576 |
[lex.ccon.esc]: #lex.ccon.esc
|
| 1577 |
[lex.ccon.literal]: #lex.ccon.literal
|
|
|
|
| 1578 |
[lex.charset]: #lex.charset
|
| 1579 |
[lex.charset.basic]: #lex.charset.basic
|
| 1580 |
[lex.charset.literal]: #lex.charset.literal
|
| 1581 |
[lex.comment]: #lex.comment
|
| 1582 |
[lex.digraph]: #lex.digraph
|
|
@@ -1600,50 +1679,56 @@ int main() {
|
|
| 1600 |
[lex.pptoken]: #lex.pptoken
|
| 1601 |
[lex.separate]: #lex.separate
|
| 1602 |
[lex.string]: #lex.string
|
| 1603 |
[lex.string.concat]: #lex.string.concat
|
| 1604 |
[lex.string.literal]: #lex.string.literal
|
|
|
|
| 1605 |
[lex.token]: #lex.token
|
|
|
|
| 1606 |
[module.import]: module.md#module.import
|
|
|
|
| 1607 |
[module.unit]: module.md#module.unit
|
| 1608 |
[over.literal]: over.md#over.literal
|
| 1609 |
[support.types.layout]: support.md#support.types.layout
|
| 1610 |
[temp.explicit]: temp.md#temp.explicit
|
|
|
|
| 1611 |
[temp.names]: temp.md#temp.names
|
|
|
|
|
|
|
| 1612 |
|
| 1613 |
[^1]: Implementations behave as if these separate phases occur, although
|
| 1614 |
in practice different phases can be folded together.
|
| 1615 |
|
| 1616 |
-
[^2]:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1617 |
ending in the first portion of a multi-character token that requires
|
| 1618 |
a terminating sequence of characters, such as a *header-name* that
|
| 1619 |
is missing the closing `"` or `>`. A partial comment would arise
|
| 1620 |
from a source file ending with an unclosed `/*` comment.
|
| 1621 |
|
| 1622 |
-
[^
|
| 1623 |
“digraph” (token consisting of two characters) is not perfectly
|
| 1624 |
descriptive, since one of the alternative *preprocessing-token*s is
|
| 1625 |
`%:%:` and of course several primary tokens contain two characters.
|
| 1626 |
Nonetheless, those alternative tokens that aren’t lexical keywords
|
| 1627 |
are colloquially known as “digraphs”.
|
| 1628 |
|
| 1629 |
-
[^
|
| 1630 |
will be different, maintaining the source spelling, but the tokens
|
| 1631 |
can otherwise be freely interchanged.
|
| 1632 |
|
| 1633 |
-
[^
|
| 1634 |
-
|
| 1635 |
-
[^6]: Thus, a sequence of characters that resembles an escape sequence
|
| 1636 |
-
can result in an error, be interpreted as the character
|
| 1637 |
-
corresponding to the escape sequence, or have a completely different
|
| 1638 |
-
meaning, depending on the implementation.
|
| 1639 |
|
| 1640 |
[^7]: On systems in which linkers cannot accept extended characters, an
|
| 1641 |
encoding of the \*universal-character-name\* can be used in forming
|
| 1642 |
valid external identifiers. For example, some otherwise unused
|
| 1643 |
character or sequence of characters can be used to encode the `̆` in
|
| 1644 |
a \*universal-character-name\*. Extended characters can produce a
|
| 1645 |
long external identifier, but C++ does not place a translation limit
|
| 1646 |
on significant characters for external identifiers.
|
| 1647 |
|
| 1648 |
[^8]: The term “literal” generally designates, in this document, those
|
| 1649 |
-
tokens that are called “constants” in
|
|
|
|
| 4 |
|
| 5 |
The text of the program is kept in units called *source files* in this
|
| 6 |
document. A source file together with all the headers [[headers]] and
|
| 7 |
source files included [[cpp.include]] via the preprocessing directive
|
| 8 |
`#include`, less any source lines skipped by any of the conditional
|
| 9 |
+
inclusion [[cpp.cond]] preprocessing directives, as modified by the
|
| 10 |
+
implementation-defined behavior of any
|
| 11 |
+
conditionally-supported-directives [[cpp.pre]] and pragmas
|
| 12 |
+
[[cpp.pragma]], if any, is called a *preprocessing translation unit*.
|
| 13 |
|
| 14 |
+
[*Note 1*: A C++ program need not all be translated at the same time.
|
| 15 |
+
Translation units can be separately translated and then later linked to
|
| 16 |
+
produce an executable program [[basic.link]]. — *end note*]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
## Phases of translation <a id="lex.phases">[[lex.phases]]</a>
|
| 19 |
|
| 20 |
The precedence among the syntax rules of translation is specified by the
|
| 21 |
following phases.[^1]
|
|
|
|
| 27 |
*implementation-defined* manner that includes a means of designating
|
| 28 |
input files as UTF-8 files, independent of their content.
|
| 29 |
\[*Note 1*: In other words, recognizing the U+feff (byte order mark)
|
| 30 |
is not sufficient. — *end note*] If an input file is determined to
|
| 31 |
be a UTF-8 file, then it shall be a well-formed UTF-8 code unit
|
| 32 |
+
sequence and it is decoded to produce a sequence of Unicode[^2]
|
| 33 |
+
scalar values. A sequence of translation character set elements
|
| 34 |
+
[[lex.charset]] is then formed by mapping each Unicode scalar value
|
| 35 |
+
to the corresponding translation character set element. In the
|
| 36 |
+
resulting sequence, each pair of characters in the input sequence
|
| 37 |
+
consisting of U+000d (carriage return) followed by
|
| 38 |
+
U+000a (line feed), as well as each U+000d (carriage return) not
|
| 39 |
+
immediately followed by a U+000a (line feed), is replaced by a
|
| 40 |
+
single new-line character. For any other kind of input file
|
| 41 |
+
supported by the implementation, characters are mapped, in an
|
| 42 |
+
*implementation-defined* manner, to a sequence of translation
|
| 43 |
+
character set elements, representing end-of-line indicators as
|
| 44 |
+
new-line characters.
|
| 45 |
2. If the first translation character is U+feff (byte order mark), it
|
| 46 |
+
is deleted. Each sequence comprising a backslash character (\\
|
| 47 |
+
immediately followed by zero or more whitespace characters other
|
| 48 |
+
than new-line followed by a new-line character is deleted, splicing
|
| 49 |
+
physical source lines to form *logical source lines*. Only the last
|
| 50 |
+
backslash on any physical source line shall be eligible for being
|
| 51 |
+
part of such a splice. \[*Note 2*: Line splicing can form a
|
| 52 |
+
*universal-character-name* [[lex.charset]]. — *end note*] A source
|
| 53 |
+
file that is not empty and that (after splicing) does not end in a
|
| 54 |
+
new-line character shall be processed as if an additional new-line
|
| 55 |
+
character were appended to the file.
|
|
|
|
| 56 |
3. The source file is decomposed into preprocessing tokens
|
| 57 |
[[lex.pptoken]] and sequences of whitespace characters (including
|
| 58 |
comments). A source file shall not end in a partial preprocessing
|
| 59 |
+
token or in a partial comment.[^3] Each comment [[lex.comment]] is
|
| 60 |
+
replaced by one U+0020 (space) character. New-line characters are
|
| 61 |
+
retained. Whether each nonempty sequence of whitespace characters
|
| 62 |
+
other than new-line is retained or replaced by one U+0020 (space)
|
| 63 |
+
character is unspecified. As characters from the source file are
|
| 64 |
+
consumed to form the next preprocessing token (i.e., not being
|
| 65 |
+
consumed as part of a comment or other forms of whitespace), except
|
| 66 |
+
when matching a *c-char-sequence*, *s-char-sequence*,
|
| 67 |
+
*r-char-sequence*, *h-char-sequence*, or *q-char-sequence*,
|
| 68 |
+
*universal-character-name*s are recognized [[lex.universal.char]]
|
| 69 |
+
and replaced by the designated element of the translation character
|
| 70 |
+
set [[lex.charset]]. The process of dividing a source file’s
|
| 71 |
characters into preprocessing tokens is context-dependent.
|
| 72 |
\[*Example 1*: See the handling of `<` within a `#include`
|
| 73 |
+
preprocessing directive
|
| 74 |
+
[[lex.header]], [[cpp.include]]. — *end example*]
|
| 75 |
+
4. The source file is analyzed as a *preprocessing-file* [[cpp.pre]].
|
| 76 |
+
Preprocessing directives [[cpp]] are executed, macro invocations are
|
| 77 |
+
expanded [[cpp.replace]], and `_Pragma` unary operator expressions
|
| 78 |
+
are executed [[cpp.pragma.op]]. A `#include` preprocessing directive
|
| 79 |
+
[[cpp.include]] causes the named header or source file to be
|
| 80 |
+
processed from phase 1 through phase 4, recursively. All
|
| 81 |
+
preprocessing directives are then deleted. Whitespace characters
|
| 82 |
+
separating preprocessing tokens are no longer significant.
|
| 83 |
+
5. For a sequence of two or more adjacent *string-literal*
|
| 84 |
+
preprocessing tokens, a common *encoding-prefix* is determined as
|
| 85 |
+
specified in [[lex.string]]. Each such *string-literal*
|
| 86 |
+
preprocessing token is then considered to have that common
|
| 87 |
+
*encoding-prefix*.
|
| 88 |
+
6. Adjacent *string-literal* preprocessing tokens are concatenated
|
| 89 |
+
[[lex.string]].
|
| 90 |
+
7. Each preprocessing token is converted into a token [[lex.token]].
|
| 91 |
The resulting tokens constitute a *translation unit* and are
|
| 92 |
+
syntactically and semantically analyzed as a *translation-unit*
|
| 93 |
+
[[basic.link]] and translated.
|
| 94 |
+
\[*Note 3*: The process of analyzing and translating the tokens can
|
| 95 |
occasionally result in one token being replaced by a sequence of
|
| 96 |
+
other tokens [[temp.names]]. — *end note*]
|
| 97 |
+
It is *implementation-defined* whether the sources for module units
|
| 98 |
+
and header units on which the current translation unit has an
|
| 99 |
+
interface dependency [[module.unit]], [[module.import]] are required
|
| 100 |
+
to be available.
|
| 101 |
+
\[*Note 4*: Source files, translation units and translated
|
| 102 |
+
translation units need not necessarily be stored as files, nor need
|
| 103 |
+
there be any one-to-one correspondence between these entities and
|
| 104 |
+
any external representation. The description is conceptual only, and
|
| 105 |
+
does not specify any particular implementation. — *end note*]
|
| 106 |
+
\[*Note 5*: Previously translated translation units can be preserved
|
| 107 |
+
individually or in libraries. The separate translation units of a
|
| 108 |
+
program communicate [[basic.link]] by (for example) calls to
|
| 109 |
+
functions whose names have external or module linkage, manipulation
|
| 110 |
+
of variables whose names have external or module linkage, or
|
| 111 |
+
manipulation of data files. — *end note*]
|
| 112 |
+
While the tokens constituting translation units are being analyzed
|
| 113 |
+
and translated, required instantiations are performed.
|
| 114 |
+
\[*Note 6*: This can include instantiations which have been
|
| 115 |
+
explicitly requested [[temp.explicit]]. — *end note*]
|
| 116 |
+
The contexts from which instantiations may be performed are
|
| 117 |
+
determined by their respective points of instantiation
|
| 118 |
+
[[temp.point]].
|
| 119 |
+
\[*Note 7*: Other requirements in this document can further
|
| 120 |
+
constrain the context from which an instantiation can be performed.
|
| 121 |
+
For example, a constexpr function template specialization might have
|
| 122 |
+
a point of instantiation at the end of a translation unit, but its
|
| 123 |
+
use in certain constant expressions could require that it be
|
| 124 |
+
instantiated at an earlier point [[temp.inst]]. — *end note*]
|
| 125 |
+
Each instantiation results in new program constructs. The program is
|
| 126 |
ill-formed if any instantiation fails.
|
| 127 |
+
During the analysis and translation of tokens, certain expressions
|
| 128 |
+
are evaluated [[expr.const]]. Constructs appearing at a program
|
| 129 |
+
point P are analyzed in a context where each side effect of
|
| 130 |
+
evaluating an expression E as a full-expression is complete if and
|
| 131 |
+
only if
|
| 132 |
+
- E is the expression corresponding to a
|
| 133 |
+
*consteval-block-declaration* [[dcl.pre]], and
|
| 134 |
+
- either that *consteval-block-declaration* or the template
|
| 135 |
+
definition from which it is instantiated is reachable from
|
| 136 |
+
[[module.reach]]
|
| 137 |
+
- P, or
|
| 138 |
+
- the point immediately following the *class-specifier* of the
|
| 139 |
+
outermost class for which P is in a complete-class context
|
| 140 |
+
[[class.mem.general]].
|
| 141 |
+
|
| 142 |
+
\[*Example 2*:
|
| 143 |
+
``` cpp
|
| 144 |
+
class S {
|
| 145 |
+
class Incomplete;
|
| 146 |
+
|
| 147 |
+
class Inner {
|
| 148 |
+
void fn() {
|
| 149 |
+
/* p₁ */ Incomplete i; // OK
|
| 150 |
+
}
|
| 151 |
+
}; /* p₂ */
|
| 152 |
+
|
| 153 |
+
consteval {
|
| 154 |
+
define_aggregate(^^Incomplete, {});
|
| 155 |
+
}
|
| 156 |
+
}; /* p₃ */
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
Constructs at p₁ are analyzed in a context where the side effect of
|
| 160 |
+
the call to `define_aggregate` is evaluated because
|
| 161 |
+
- E is the expression corresponding to a consteval block, and
|
| 162 |
+
- p₁ is in a complete-class context of `S` and the consteval block
|
| 163 |
+
is reachable from p₃.
|
| 164 |
+
|
| 165 |
+
— *end example*]
|
| 166 |
+
8. Translated translation units are combined, and all external entity
|
| 167 |
+
references are resolved. Library components are linked to satisfy
|
| 168 |
+
external references to entities not defined in the current
|
| 169 |
+
translation. All such translator output is collected into a program
|
| 170 |
+
image which contains information needed for execution in its
|
| 171 |
execution environment.
|
| 172 |
|
| 173 |
+
## Characters <a id="lex.char">[[lex.char]]</a>
|
| 174 |
+
|
| 175 |
+
### Character sets <a id="lex.charset">[[lex.charset]]</a>
|
| 176 |
|
| 177 |
The *translation character set* consists of the following elements:
|
| 178 |
|
| 179 |
+
- each abstract character assigned a code point in the Unicode codespace
|
| 180 |
+
as specified in the Unicode Standard, and
|
| 181 |
- a distinct character for each Unicode scalar value not assigned to an
|
| 182 |
abstract character.
|
| 183 |
|
| 184 |
[*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
|
| 185 |
(hexadecimal). A surrogate code point is a value in the range
|
| 186 |
[D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
|
| 187 |
that is not a surrogate code point. — *end note*]
|
| 188 |
|
| 189 |
The *basic character set* is a subset of the translation character set,
|
| 190 |
+
consisting of 99 characters as specified in [[lex.charset.basic]].
|
| 191 |
|
| 192 |
[*Note 2*: Unicode short names are given only as a means to identifying
|
| 193 |
the character; the numerical value has no other meaning in this
|
| 194 |
context. — *end note*]
|
| 195 |
|
|
|
|
| 203 |
| `U+0020` | space | |
|
| 204 |
| `U+000a` | line feed | new-line |
|
| 205 |
| `U+0021` | exclamation mark | `!` |
|
| 206 |
| `U+0022` | quotation mark | `"` |
|
| 207 |
| `U+0023` | number sign | `#` |
|
| 208 |
+
| `U+0024` | dollar sign | `$` |
|
| 209 |
| `U+0025` | percent sign | `%` |
|
| 210 |
| `U+0026` | ampersand | `&` |
|
| 211 |
| `U+0027` | apostrophe | `'` |
|
| 212 |
| `U+0028` | left parenthesis | `(` |
|
| 213 |
| `U+0029` | right parenthesis | `)` |
|
|
|
|
| 222 |
| `U+003b` | semicolon | `;` |
|
| 223 |
| `U+003c` | less-than sign | `<` |
|
| 224 |
| `U+003d` | equals sign | `=` |
|
| 225 |
| `U+003e` | greater-than sign | `>` |
|
| 226 |
| `U+003f` | question mark | `?` |
|
| 227 |
+
| } |
|
| 228 |
| `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
|
| 229 |
| | | `N O P Q R S T U V W X Y Z` |
|
| 230 |
| `U+005b` | left square bracket | `[` |
|
| 231 |
| `U+005c` | reverse solidus | \texttt{\} |
|
| 232 |
| `U+005d` | right square bracket | `]` |
|
| 233 |
| `U+005e` | circumflex accent | `^` |
|
| 234 |
| `U+005f` | low line | `_` |
|
| 235 |
+
| `U+0060` | grave accent | `\` |
|
| 236 |
| `U+0061` .. `U+007a` | latin small letter a .. z | `a b c d e f g h i j k l m` |
|
| 237 |
| | | `n o p q r s t u v w x y z` |
|
| 238 |
| `U+007b` | left curly bracket | \texttt{\ |
|
| 239 |
| `U+007c` | vertical line | `|` |
|
| 240 |
| `U+007d` | right curly bracket | `}` |
|
| 241 |
| `U+007e` | tilde | `~` |
|
| 242 |
|
| 243 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 244 |
The *basic literal character set* consists of all characters of the
|
| 245 |
basic character set, plus the control characters specified in
|
| 246 |
[[lex.charset.literal]].
|
| 247 |
|
| 248 |
**Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
|
|
|
|
| 268 |
A literal encoding or a locale-specific encoding of one of the execution
|
| 269 |
character sets [[character.seq]] encodes each element of the basic
|
| 270 |
literal character set as a single code unit with non-negative value,
|
| 271 |
distinct from the code unit for any other such element.
|
| 272 |
|
| 273 |
+
[*Note 3*: A character not in the basic literal character set can be
|
| 274 |
encoded with more than one code unit; the value of such a code unit can
|
| 275 |
be the same as that of a code unit for an element of the basic literal
|
| 276 |
character set. — *end note*]
|
| 277 |
|
| 278 |
The U+0000 (null) character is encoded as the value `0`. No other
|
| 279 |
element of the translation character set is encoded with a code unit of
|
| 280 |
value `0`. The code unit value of each decimal digit character after the
|
| 281 |
digit `0` (`U+0030`) shall be one greater than the value of the
|
| 282 |
previous. The ordinary and wide literal encodings are otherwise
|
| 283 |
*implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
|
| 284 |
+
implementation shall encode the Unicode scalar value corresponding to
|
| 285 |
+
each character of the translation character set as specified in the
|
| 286 |
+
Unicode Standard for the respective Unicode encoding form.
|
| 287 |
+
|
| 288 |
+
### Universal character names <a id="lex.universal.char">[[lex.universal.char]]</a>
|
| 289 |
+
|
| 290 |
+
``` bnf
|
| 291 |
+
n-char:
|
| 292 |
+
any member of the translation character set except the U+007d (right curly bracket) or new-line character
|
| 293 |
+
```
|
| 294 |
+
|
| 295 |
+
``` bnf
|
| 296 |
+
n-char-sequence:
|
| 297 |
+
n-char n-char-sequenceₒₚₜ
|
| 298 |
+
```
|
| 299 |
+
|
| 300 |
+
``` bnf
|
| 301 |
+
named-universal-character:
|
| 302 |
+
'\N{' n-char-sequence '}'
|
| 303 |
+
```
|
| 304 |
+
|
| 305 |
+
``` bnf
|
| 306 |
+
hex-quad:
|
| 307 |
+
hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
|
| 308 |
+
```
|
| 309 |
+
|
| 310 |
+
``` bnf
|
| 311 |
+
simple-hexadecimal-digit-sequence:
|
| 312 |
+
hexadecimal-digit simple-hexadecimal-digit-sequenceₒₚₜ
|
| 313 |
+
```
|
| 314 |
+
|
| 315 |
+
``` bnf
|
| 316 |
+
universal-character-name:
|
| 317 |
+
'\u' hex-quad
|
| 318 |
+
'\U' hex-quad hex-quad
|
| 319 |
+
'\u{' simple-hexadecimal-digit-sequence '}'
|
| 320 |
+
named-universal-character
|
| 321 |
+
```
|
| 322 |
+
|
| 323 |
+
The *universal-character-name* construct provides a way to name any
|
| 324 |
+
element in the translation character set using just the basic character
|
| 325 |
+
set. If a *universal-character-name* outside the *c-char-sequence*,
|
| 326 |
+
*s-char-sequence*, or *r-char-sequence* of a *character-literal* or
|
| 327 |
+
*string-literal* (in either case, including within a
|
| 328 |
+
*user-defined-literal*) corresponds to a control character or to a
|
| 329 |
+
character in the basic character set, the program is ill-formed.
|
| 330 |
+
|
| 331 |
+
[*Note 1*: A sequence of characters resembling a
|
| 332 |
+
*universal-character-name* in an *r-char-sequence* [[lex.string]] does
|
| 333 |
+
not form a *universal-character-name*. — *end note*]
|
| 334 |
+
|
| 335 |
+
A *universal-character-name* of the form `\u` *hex-quad*, `\U`
|
| 336 |
+
*hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
|
| 337 |
+
designates the character in the translation character set whose Unicode
|
| 338 |
+
scalar value is the hexadecimal number represented by the sequence of
|
| 339 |
+
*hexadecimal-digit*s in the *universal-character-name*. The program is
|
| 340 |
+
ill-formed if that number is not a Unicode scalar value.
|
| 341 |
+
|
| 342 |
+
A *universal-character-name* that is a *named-universal-character*
|
| 343 |
+
designates the corresponding character in the Unicode Standard (chapter
|
| 344 |
+
4.8 Name) if the *n-char-sequence* is equal to its character name or to
|
| 345 |
+
one of its character name aliases of type “control”, “correction”, or
|
| 346 |
+
“alternate”; otherwise, the program is ill-formed.
|
| 347 |
+
|
| 348 |
+
[*Note 2*: These aliases are listed in the Unicode Character Database’s
|
| 349 |
+
`NameAliases.txt`. None of these names or aliases have leading or
|
| 350 |
+
trailing spaces. — *end note*]
|
| 351 |
+
|
| 352 |
+
## Comments <a id="lex.comment">[[lex.comment]]</a>
|
| 353 |
+
|
| 354 |
+
The characters `/*` start a comment, which terminates with the
|
| 355 |
+
characters `*/`. These comments do not nest. The characters `//` start a
|
| 356 |
+
comment, which terminates immediately before the next new-line
|
| 357 |
+
character.
|
| 358 |
+
|
| 359 |
+
[*Note 1*: The comment characters `//`, `/*`, and `*/` have no special
|
| 360 |
+
meaning within a `//` comment and are treated just like other
|
| 361 |
+
characters. Similarly, the comment characters `//` and `/*` have no
|
| 362 |
+
special meaning within a `/*` comment. — *end note*]
|
| 363 |
|
| 364 |
## Preprocessing tokens <a id="lex.pptoken">[[lex.pptoken]]</a>
|
| 365 |
|
| 366 |
``` bnf
|
| 367 |
preprocessing-token:
|
|
|
|
| 377 |
user-defined-string-literal
|
| 378 |
preprocessing-op-or-punc
|
| 379 |
each non-whitespace character that cannot be one of the above
|
| 380 |
```
|
| 381 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 382 |
A preprocessing token is the minimal lexical element of the language in
|
| 383 |
translation phases 3 through 6. In this document, glyphs are used to
|
| 384 |
identify elements of the basic character set [[lex.charset]]. The
|
| 385 |
categories of preprocessing token are: header names, placeholder tokens
|
| 386 |
produced by preprocessing `import` and `module` directives
|
| 387 |
(*import-keyword*, *module-keyword*, and *export-keyword*), identifiers,
|
| 388 |
preprocessing numbers, character literals (including user-defined
|
| 389 |
character literals), string literals (including user-defined string
|
| 390 |
literals), preprocessing operators and punctuators, and single
|
| 391 |
non-whitespace characters that do not lexically match the other
|
| 392 |
+
preprocessing token categories. If a U+0027 (apostrophe), a
|
| 393 |
+
U+0022 (quotation mark), or any character not in the basic character set
|
|
|
|
| 394 |
matches the last category, the program is ill-formed. Preprocessing
|
| 395 |
tokens can be separated by whitespace; this consists of comments
|
| 396 |
[[lex.comment]], or whitespace characters (U+0020 (space),
|
| 397 |
U+0009 (character tabulation), new-line, U+000b (line tabulation), and
|
| 398 |
U+000c (form feed)), or both. As described in [[cpp]], in certain
|
|
|
|
| 400 |
thereof) serves as more than preprocessing token separation. Whitespace
|
| 401 |
can appear within a preprocessing token only as part of a header name or
|
| 402 |
between the quotation characters in a character literal or string
|
| 403 |
literal.
|
| 404 |
|
| 405 |
+
Each preprocessing token that is converted to a token [[lex.token]]
|
| 406 |
+
shall have the lexical form of a keyword, an identifier, a literal, or
|
| 407 |
+
an operator or punctuator.
|
| 408 |
+
|
| 409 |
+
The *import-keyword* is produced by processing an `import` directive
|
| 410 |
+
[[cpp.import]], the *module-keyword* is produced by preprocessing a
|
| 411 |
+
`module` directive [[cpp.module]], and the *export-keyword* is produced
|
| 412 |
+
by preprocessing either of the previous two directives.
|
| 413 |
+
|
| 414 |
+
[*Note 1*: None has any observable spelling. — *end note*]
|
| 415 |
+
|
| 416 |
If the input stream has been parsed into preprocessing tokens up to a
|
| 417 |
given character:
|
| 418 |
|
| 419 |
- If the next character begins a sequence of characters that could be
|
| 420 |
the prefix and initial double quote of a raw string literal, such as
|
|
|
|
| 430 |
```
|
| 431 |
- Otherwise, if the next three characters are `<::` and the subsequent
|
| 432 |
character is neither `:` nor `>`, the `<` is treated as a
|
| 433 |
preprocessing token by itself and not as the first character of the
|
| 434 |
alternative token `<:`.
|
| 435 |
+
- Otherwise, if the next three characters are `[::` and the subsequent
|
| 436 |
+
character is not `:`, or if the next three characters are `[:>`, the
|
| 437 |
+
`[` is treated as a preprocessing token by itself and not as the first
|
| 438 |
+
character of the preprocessing token `[:`. \[*Note 2*: The tokens `[:`
|
| 439 |
+
and `:]` cannot be composed from digraphs. — *end note*]
|
| 440 |
- Otherwise, the next preprocessing token is the longest sequence of
|
| 441 |
characters that could constitute a preprocessing token, even if that
|
| 442 |
+
would cause further lexical analysis to fail, except that
|
| 443 |
+
- a *string-literal* token is never formed when a *header-name* token
|
| 444 |
+
can be formed, and
|
| 445 |
+
- a *header-name* [[lex.header]] is only formed
|
| 446 |
+
- immediately after the `include`, `embed`, or `import`
|
| 447 |
+
preprocessing token in a `#include` [[cpp.include]], `#embed`
|
| 448 |
+
[[cpp.embed]], or `import` [[cpp.import]] directive, respectively,
|
| 449 |
+
or
|
| 450 |
+
- immediately after a preprocessing token sequence of
|
| 451 |
+
`__has_include` or `__has_embed` immediately followed by `(` in a
|
| 452 |
+
`#if`, `#elif`, or `#embed` directive [[cpp.cond]], [[cpp.embed]].
|
| 453 |
|
| 454 |
[*Example 1*:
|
| 455 |
|
| 456 |
``` cpp
|
| 457 |
#define R "x"
|
| 458 |
const char* s = R"y"; // ill-formed raw string, not "x" "y"
|
| 459 |
```
|
| 460 |
|
| 461 |
— *end example*]
|
| 462 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 463 |
[*Example 2*: The program fragment `0xe+foo` is parsed as a
|
| 464 |
preprocessing number token (one that is not a valid *integer-literal* or
|
| 465 |
*floating-point-literal* token), even though a parse as three
|
| 466 |
preprocessing tokens `0xe`, `+`, and `foo` can produce a valid
|
| 467 |
expression (for example, if `foo` is a macro defined as `1`). Similarly,
|
|
|
|
| 472 |
[*Example 3*: The program fragment `x+++++y` is parsed as `x
|
| 473 |
++ ++ + y`, which, if `x` and `y` have integral types, violates a
|
| 474 |
constraint on increment operators, even though the parse `x ++ + ++ y`
|
| 475 |
can yield a correct expression. — *end example*]
|
| 476 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 477 |
## Header names <a id="lex.header">[[lex.header]]</a>
|
| 478 |
|
| 479 |
``` bnf
|
| 480 |
header-name:
|
| 481 |
'<' h-char-sequence '>'
|
| 482 |
'"' q-char-sequence '"'
|
| 483 |
```
|
| 484 |
|
| 485 |
``` bnf
|
| 486 |
h-char-sequence:
|
| 487 |
+
h-char h-char-sequenceₒₚₜ
|
|
|
|
| 488 |
```
|
| 489 |
|
| 490 |
``` bnf
|
| 491 |
h-char:
|
| 492 |
any member of the translation character set except new-line and U+003e (greater-than sign)
|
| 493 |
```
|
| 494 |
|
| 495 |
``` bnf
|
| 496 |
q-char-sequence:
|
| 497 |
+
q-char q-char-sequenceₒₚₜ
|
|
|
|
| 498 |
```
|
| 499 |
|
| 500 |
``` bnf
|
| 501 |
q-char:
|
| 502 |
any member of the translation character set except new-line and U+0022 (quotation mark)
|
| 503 |
```
|
| 504 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 505 |
The sequences in both forms of *header-name*s are mapped in an
|
| 506 |
*implementation-defined* manner to headers or to external source file
|
| 507 |
names as specified in [[cpp.include]].
|
| 508 |
|
| 509 |
+
[*Note 1*: Header name preprocessing tokens appear only within a
|
| 510 |
+
`#include` preprocessing directive, a `__has_include` preprocessing
|
| 511 |
+
expression, or after certain occurrences of an `import` token (see
|
| 512 |
+
[[lex.pptoken]]). — *end note*]
|
| 513 |
+
|
| 514 |
The appearance of either of the characters `'` or `\` or of either of
|
| 515 |
the character sequences `/*` or `//` in a *q-char-sequence* or an
|
| 516 |
*h-char-sequence* is conditionally-supported with
|
| 517 |
*implementation-defined* semantics, as is the appearance of the
|
| 518 |
+
character `"` in an *h-char-sequence*.
|
| 519 |
+
|
| 520 |
+
[*Note 2*: Thus, a sequence of characters that resembles an escape
|
| 521 |
+
sequence can result in an error, be interpreted as the character
|
| 522 |
+
corresponding to the escape sequence, or have a completely different
|
| 523 |
+
meaning, depending on the implementation. — *end note*]
|
| 524 |
|
| 525 |
## Preprocessing numbers <a id="lex.ppnumber">[[lex.ppnumber]]</a>
|
| 526 |
|
| 527 |
``` bnf
|
| 528 |
pp-number:
|
|
|
|
| 544 |
|
| 545 |
A preprocessing number does not have a type or a value; it acquires both
|
| 546 |
after a successful conversion to an *integer-literal* token or a
|
| 547 |
*floating-point-literal* token.
|
| 548 |
|
| 549 |
+
## Operators and punctuators <a id="lex.operators">[[lex.operators]]</a>
|
| 550 |
+
|
| 551 |
+
The lexical representation of C++ programs includes a number of
|
| 552 |
+
preprocessing tokens that are used in the syntax of the preprocessor or
|
| 553 |
+
are converted into tokens for operators and punctuators:
|
| 554 |
+
|
| 555 |
+
``` bnf
|
| 556 |
+
preprocessing-op-or-punc:
|
| 557 |
+
preprocessing-operator
|
| 558 |
+
operator-or-punctuator
|
| 559 |
+
```
|
| 560 |
+
|
| 561 |
+
``` bnf
|
| 562 |
+
%% Ed. note: character protrusion would misalign various operators.
|
| 563 |
+
|
| 564 |
+
preprocessing-operator: one of
|
| 565 |
+
'# ## %: %:%:'
|
| 566 |
+
```
|
| 567 |
+
|
| 568 |
+
``` bnf
|
| 569 |
+
operator-or-punctuator: one of
|
| 570 |
+
'{ } [ ] ( ) [: :]'
|
| 571 |
+
'<% %> <: :> ; : ...'
|
| 572 |
+
'? :: . .* -> ->* ^^ ~'
|
| 573 |
+
'! + - * / % ^ & |'
|
| 574 |
+
'= += -= *= /= %= ^= &= |='
|
| 575 |
+
'== != < > <= >= <=> && ||'
|
| 576 |
+
'<< >> <<= >>= ++ -- ,'
|
| 577 |
+
'and or xor not bitand bitor compl'
|
| 578 |
+
'and_eq or_eq xor_eq not_eq'
|
| 579 |
+
```
|
| 580 |
+
|
| 581 |
+
Each *operator-or-punctuator* is converted to a single token in
|
| 582 |
+
translation phase 7 [[lex.phases]].
|
| 583 |
+
|
| 584 |
+
## Alternative tokens <a id="lex.digraph">[[lex.digraph]]</a>
|
| 585 |
+
|
| 586 |
+
Alternative token representations are provided for some operators and
|
| 587 |
+
punctuators.[^4]
|
| 588 |
+
|
| 589 |
+
In all respects of the language, each alternative token behaves the
|
| 590 |
+
same, respectively, as its primary token, except for its spelling.[^5]
|
| 591 |
+
|
| 592 |
+
The set of alternative tokens is defined in [[lex.digraph]].
|
| 593 |
+
|
| 594 |
+
## Tokens <a id="lex.token">[[lex.token]]</a>
|
| 595 |
+
|
| 596 |
+
``` bnf
|
| 597 |
+
token:
|
| 598 |
+
identifier
|
| 599 |
+
keyword
|
| 600 |
+
literal
|
| 601 |
+
operator-or-punctuator
|
| 602 |
+
```
|
| 603 |
+
|
| 604 |
+
There are five kinds of tokens: identifiers, keywords, literals,[^6]
|
| 605 |
+
|
| 606 |
+
operators, and other separators. Comments and the characters
|
| 607 |
+
U+0020 (space), U+0009 (character tabulation), U+000b (line tabulation),
|
| 608 |
+
U+000c (form feed), and new-line (collectively, “whitespace”), as
|
| 609 |
+
described below, are ignored except as they serve to separate tokens.
|
| 610 |
+
|
| 611 |
+
[*Note 1*: Whitespace can separate otherwise adjacent identifiers,
|
| 612 |
+
keywords, numeric literals, and alternative tokens containing alphabetic
|
| 613 |
+
characters. — *end note*]
|
| 614 |
+
|
| 615 |
## Identifiers <a id="lex.name">[[lex.name]]</a>
|
| 616 |
|
| 617 |
``` bnf
|
| 618 |
identifier:
|
| 619 |
identifier-start
|
|
|
|
| 646 |
'0 1 2 3 4 5 6 7 8 9'
|
| 647 |
```
|
| 648 |
|
| 649 |
[*Note 1*:
|
| 650 |
|
| 651 |
+
The character properties XID_Start and XID_Continue are described by UAX
|
| 652 |
+
\#44 of the Unicode Standard.[^7]
|
| 653 |
|
| 654 |
— *end note*]
|
| 655 |
|
| 656 |
The program is ill-formed if an *identifier* does not conform to
|
| 657 |
Normalization Form C as specified in the Unicode Standard.
|
| 658 |
|
| 659 |
[*Note 2*: Identifiers are case-sensitive. — *end note*]
|
| 660 |
|
| 661 |
+
[*Note 3*: [[uaxid]] compares the requirements of UAX \#31 of the
|
| 662 |
+
Unicode Standard with the C++ rules for identifiers. — *end note*]
|
| 663 |
+
|
| 664 |
+
[*Note 4*: In translation phase 4, *identifier* also includes those
|
| 665 |
*preprocessing-token*s [[lex.pptoken]] differentiated as keywords
|
| 666 |
[[lex.key]] in the later translation phase 7
|
| 667 |
[[lex.token]]. — *end note*]
|
| 668 |
|
| 669 |
The identifiers in [[lex.name.special]] have a special meaning when
|
|
|
|
| 676 |
In addition, some identifiers appearing as a *token* or
|
| 677 |
*preprocessing-token* are reserved for use by C++ implementations and
|
| 678 |
shall not be used otherwise; no diagnostic is required.
|
| 679 |
|
| 680 |
- Each identifier that contains a double underscore `__` or begins with
|
| 681 |
+
an underscore followed by an uppercase letter, other than those
|
| 682 |
+
specified in this document (for example, `__cplusplus`
|
| 683 |
+
[[cpp.predefined]]), is reserved to the implementation for any use.
|
| 684 |
- Each identifier that begins with an underscore is reserved to the
|
| 685 |
implementation for use as a name in the global namespace.
|
| 686 |
|
| 687 |
## Keywords <a id="lex.key">[[lex.key]]</a>
|
| 688 |
|
|
|
|
| 710 |
| | | | | | |
|
| 711 |
| -------- | -------- | -------- | ------- | -------- | ----- |
|
| 712 |
| `and` | `and_eq` | `bitand` | `bitor` | `compl` | `not` |
|
| 713 |
| `not_eq` | `or` | `or_eq` | `xor` | `xor_eq` | |
|
| 714 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 715 |
## Literals <a id="lex.literal">[[lex.literal]]</a>
|
| 716 |
|
| 717 |
### Kinds of literals <a id="lex.literal.kinds">[[lex.literal.kinds]]</a>
|
| 718 |
|
| 719 |
There are several kinds of literals.[^8]
|
|
|
|
| 829 |
'z Z'
|
| 830 |
```
|
| 831 |
|
| 832 |
In an *integer-literal*, the sequence of *binary-digit*s,
|
| 833 |
*octal-digit*s, *digit*s, or *hexadecimal-digit*s is interpreted as a
|
| 834 |
+
base N integer as shown in [[lex.icon.base]]; the lexically first digit
|
| 835 |
+
of the sequence of digits is the most significant.
|
| 836 |
|
| 837 |
[*Note 1*: The prefix and any optional separating single quotes are
|
| 838 |
ignored when determining the value. — *end note*]
|
| 839 |
|
| 840 |
**Table: Base of *integer-literal*{s}** <a id="lex.icon.base">[lex.icon.base]</a>
|
|
|
|
| 887 |
| | | `std::size_t` |
|
| 888 |
| Both `u` or `U` | `std::size_t` | `std::size_t` |
|
| 889 |
| and `z` or `Z` | | |
|
| 890 |
|
| 891 |
|
| 892 |
+
Except for *integer-literal*s containing a *size-suffix*, if the value
|
| 893 |
+
of an *integer-literal* cannot be represented by any type in its list
|
| 894 |
and an extended integer type [[basic.fundamental]] can represent its
|
| 895 |
value, it may have that extended integer type. If all of the types in
|
| 896 |
the list for the *integer-literal* are signed, the extended integer type
|
| 897 |
+
is signed. If all of the types in the list for the *integer-literal* are
|
| 898 |
+
unsigned, the extended integer type is unsigned. If the list contains
|
| 899 |
+
both signed and unsigned types, the extended integer type may be signed
|
| 900 |
+
or unsigned. If an *integer-literal* cannot be represented by any of the
|
| 901 |
+
allowed types, the program is ill-formed.
|
| 902 |
+
|
| 903 |
+
[*Note 2*: An *integer-literal* with a `z` or `Z` suffix is ill-formed
|
| 904 |
+
if it cannot be represented by `std::size_t`. — *end note*]
|
| 905 |
|
| 906 |
### Character literals <a id="lex.ccon">[[lex.ccon]]</a>
|
| 907 |
|
| 908 |
``` bnf
|
| 909 |
character-literal:
|
|
|
|
| 915 |
'u8' 'u' 'U' 'L'
|
| 916 |
```
|
| 917 |
|
| 918 |
``` bnf
|
| 919 |
c-char-sequence:
|
| 920 |
+
c-char c-char-sequenceₒₚₜ
|
|
|
|
| 921 |
```
|
| 922 |
|
| 923 |
``` bnf
|
| 924 |
c-char:
|
| 925 |
basic-c-char
|
|
|
|
| 956 |
hexadecimal-escape-sequence
|
| 957 |
```
|
| 958 |
|
| 959 |
``` bnf
|
| 960 |
simple-octal-digit-sequence:
|
| 961 |
+
octal-digit simple-octal-digit-sequenceₒₚₜ
|
|
|
|
| 962 |
```
|
| 963 |
|
| 964 |
``` bnf
|
| 965 |
octal-escape-sequence:
|
| 966 |
'\' octal-digit
|
|
|
|
| 983 |
``` bnf
|
| 984 |
conditional-escape-sequence-char:
|
| 985 |
any member of the basic character set that is not an octal-digit, a simple-escape-sequence-char, or the characters 'N', 'o', 'u', 'U', or 'x'
|
| 986 |
```
|
| 987 |
|
| 988 |
+
A *multicharacter literal* is a *character-literal* whose
|
| 989 |
+
*c-char-sequence* consists of more than one *c-char*. A multicharacter
|
| 990 |
+
literal shall not have an *encoding-prefix*. If a multicharacter literal
|
| 991 |
+
contains a *c-char* that is not encodable as a single code unit in the
|
| 992 |
+
ordinary literal encoding, the program is ill-formed. Multicharacter
|
| 993 |
+
literals are conditionally-supported.
|
|
|
|
|
|
|
|
|
|
| 994 |
|
| 995 |
The kind of a *character-literal*, its type, and its associated
|
| 996 |
character encoding [[lex.charset]] are determined by its
|
| 997 |
*encoding-prefix* and its *c-char-sequence* as defined by
|
| 998 |
+
[[lex.ccon.literal]].
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 999 |
|
| 1000 |
**Table: Character literals** <a id="lex.ccon.literal">[lex.ccon.literal]</a>
|
| 1001 |
|
| 1002 |
+
| Encoding prefix | Kind \chdr | Type \chdr | Associated char- acter encoding | Example |
|
| 1003 |
+
| --------------- | -------------------------- | ---------- | ------------------------------- | ------- |
|
| 1004 |
+
| none | ordinary character literal | `char` | ordinary literal | `'v'` |
|
| 1005 |
| `L` | wide character literal | `wchar_t` | wide literal | `L'w'` |
|
| 1006 |
| | | | encoding | |
|
| 1007 |
| `u8` | UTF-8 character literal | `char8_t` | UTF-8 | `u8'x'` |
|
| 1008 |
| `u` | UTF-16 character literal | `char16_t` | UTF-16 | `u'y'` |
|
| 1009 |
| `U` | UTF-32 character literal | `char32_t` | UTF-32 | `U'z'` |
|
| 1010 |
|
| 1011 |
|
| 1012 |
In translation phase 4, the value of a *character-literal* is determined
|
| 1013 |
using the range of representable values of the *character-literal*’s
|
| 1014 |
+
type in translation phase 7. A multicharacter literal has an
|
| 1015 |
+
*implementation-defined* value. The value of any other kind of
|
| 1016 |
+
*character-literal* is determined as follows:
|
| 1017 |
|
| 1018 |
- A *character-literal* with a *c-char-sequence* consisting of a single
|
| 1019 |
*basic-c-char*, *simple-escape-sequence*, or
|
| 1020 |
*universal-character-name* is the code unit value of the specified
|
| 1021 |
character as encoded in the literal’s associated character encoding.
|
| 1022 |
+
If the specified character lacks representation in the literal’s
|
| 1023 |
+
associated character encoding or if it cannot be encoded as a single
|
| 1024 |
+
code unit, then the program is ill-formed.
|
|
|
|
| 1025 |
- A *character-literal* with a *c-char-sequence* consisting of a single
|
| 1026 |
*numeric-escape-sequence* has a value as follows:
|
| 1027 |
- Let v be the integer value represented by the octal number
|
| 1028 |
comprising the sequence of *octal-digit*s in an
|
| 1029 |
*octal-escape-sequence* or by the hexadecimal number comprising the
|
|
|
|
| 1034 |
or `L`, and v does not exceed the range of representable values of
|
| 1035 |
the corresponding unsigned type for the underlying type of the
|
| 1036 |
*character-literal*’s type, then the value is the unique value of
|
| 1037 |
the *character-literal*’s type `T` that is congruent to v modulo 2ᴺ,
|
| 1038 |
where N is the width of `T`.
|
| 1039 |
+
- Otherwise, the program is ill-formed.
|
| 1040 |
- A *character-literal* with a *c-char-sequence* consisting of a single
|
| 1041 |
*conditional-escape-sequence* is conditionally-supported and has an
|
| 1042 |
*implementation-defined* value.
|
| 1043 |
|
| 1044 |
The character specified by a *simple-escape-sequence* is specified in
|
| 1045 |
[[lex.ccon.esc]].
|
| 1046 |
|
| 1047 |
+
[*Note 1*: Using an escape sequence for a question mark is supported
|
| 1048 |
+
for compatibility with C++14 and C. — *end note*]
|
| 1049 |
|
| 1050 |
**Table: Simple escape sequences** <a id="lex.ccon.esc">[lex.ccon.esc]</a>
|
| 1051 |
|
| 1052 |
| character | | *simple-escape-sequence* |
|
| 1053 |
| --------- | -------------------- | ------------------------ |
|
|
|
|
| 1184 |
encoding-prefixₒₚₜ 'R' raw-string
|
| 1185 |
```
|
| 1186 |
|
| 1187 |
``` bnf
|
| 1188 |
s-char-sequence:
|
| 1189 |
+
s-char s-char-sequenceₒₚₜ
|
|
|
|
| 1190 |
```
|
| 1191 |
|
| 1192 |
``` bnf
|
| 1193 |
s-char:
|
| 1194 |
basic-s-char
|
|
|
|
| 1207 |
'"' d-char-sequenceₒₚₜ '(' r-char-sequenceₒₚₜ ')' d-char-sequenceₒₚₜ '"'
|
| 1208 |
```
|
| 1209 |
|
| 1210 |
``` bnf
|
| 1211 |
r-char-sequence:
|
| 1212 |
+
r-char r-char-sequenceₒₚₜ
|
|
|
|
| 1213 |
```
|
| 1214 |
|
| 1215 |
``` bnf
|
| 1216 |
r-char:
|
| 1217 |
any member of the translation character set, except a U+0029 (right parenthesis) followed by
|
| 1218 |
the initial *d-char-sequence* (which may be empty) followed by a U+0022 (quotation mark)
|
| 1219 |
```
|
| 1220 |
|
| 1221 |
``` bnf
|
| 1222 |
d-char-sequence:
|
| 1223 |
+
d-char d-char-sequenceₒₚₜ
|
|
|
|
| 1224 |
```
|
| 1225 |
|
| 1226 |
``` bnf
|
| 1227 |
d-char:
|
| 1228 |
any member of the basic character set except:
|
|
|
|
| 1231 |
```
|
| 1232 |
|
| 1233 |
The kind of a *string-literal*, its type, and its associated character
|
| 1234 |
encoding [[lex.charset]] are determined by its encoding prefix and
|
| 1235 |
sequence of *s-char*s or *r-char*s as defined by [[lex.string.literal]]
|
| 1236 |
+
where n is the number of encoded code units that would result from an
|
| 1237 |
+
evaluation of the *string-literal* (see below).
|
| 1238 |
|
| 1239 |
**Table: String literals** <a id="lex.string.literal">[lex.string.literal]</a>
|
| 1240 |
|
| 1241 |
+
| Enco- ding prefix | Kind \chdr \chdr | Type \chdr \chdr | Associated character encoding | Examples \rhdr \rhdr |
|
| 1242 |
+
| ----------------- | ----------------------- | ----------------------------- | ----------------------------- | ---------------------------------------------- |
|
| 1243 |
| none | ordinary string literal | array of $n$ `const char` | ordinary literal encoding | `"ordinary string"` `R"(ordinary raw string)"` |
|
| 1244 |
| `L` | wide string literal | array of $n$ `const wchar_t` | wide literal encoding | `L"wide string"` `LR"w(wide raw string)w"` |
|
| 1245 |
| `u8` | UTF-8 string literal | array of $n$ `const char8_t` | UTF-8 | `u8"UTF-8 string"` `u8R"x(UTF-8 raw string)x"` |
|
| 1246 |
| `u` | UTF-16 string literal | array of $n$ `const char16_t` | UTF-16 | `u"UTF-16 string"` `uR"y(UTF-16 raw string)y"` |
|
| 1247 |
| `U` | UTF-32 string literal | array of $n$ `const char32_t` | UTF-32 | `U"UTF-32 string"` `UR"z(UTF-32 raw string)z"` |
|
|
|
|
| 1251 |
literal*. The *d-char-sequence* serves as a delimiter. The terminating
|
| 1252 |
*d-char-sequence* of a *raw-string* is the same sequence of characters
|
| 1253 |
as the initial *d-char-sequence*. A *d-char-sequence* shall consist of
|
| 1254 |
at most 16 characters.
|
| 1255 |
|
| 1256 |
+
[*Note 1*: The characters `'('` and `')'` can appear in a *raw-string*.
|
| 1257 |
+
Thus, `R"delimiter((a|b))delimiter"` is equivalent to
|
| 1258 |
`"(a|b)"`. — *end note*]
|
| 1259 |
|
| 1260 |
[*Note 2*:
|
| 1261 |
|
| 1262 |
A source-file new-line in a raw string literal results in a new-line in
|
|
|
|
| 1292 |
is equivalent to `"x = \"\\\"y\\\"\""`.
|
| 1293 |
|
| 1294 |
— *end example*]
|
| 1295 |
|
| 1296 |
Ordinary string literals and UTF-8 string literals are also referred to
|
| 1297 |
+
as *narrow string literals*.
|
| 1298 |
|
| 1299 |
+
The *string-literal*s in any sequence of adjacent *string-literal*s
|
| 1300 |
+
shall have at most one unique *encoding-prefix* among them. The common
|
| 1301 |
+
*encoding-prefix* of the sequence is that *encoding-prefix*, if any.
|
|
|
|
|
|
|
|
|
|
| 1302 |
|
| 1303 |
[*Note 3*: A *string-literal*’s rawness has no effect on the
|
| 1304 |
determination of the common *encoding-prefix*. — *end note*]
|
| 1305 |
|
| 1306 |
In translation phase 6 [[lex.phases]], adjacent *string-literal*s are
|
|
|
|
| 1337 |
| `u"a"` | `"b"` | `u"ab"` | `U"a"` | `"b"` | `U"ab"` | `L"a"` | `"b"` | `L"ab"` |
|
| 1338 |
| `"a"` | `u"b"` | `u"ab"` | `"a"` | `U"b"` | `U"ab"` | `"a"` | `L"b"` | `L"ab"` |
|
| 1339 |
|
| 1340 |
|
| 1341 |
Evaluating a *string-literal* results in a string literal object with
|
| 1342 |
+
static storage duration [[basic.stc]].
|
|
|
|
|
|
|
|
|
|
| 1343 |
|
| 1344 |
+
[*Note 4*: String literal objects are potentially non-unique
|
| 1345 |
+
[[intro.object]]. Whether successive evaluations of a *string-literal*
|
| 1346 |
+
yield the same or a different object is unspecified. — *end note*]
|
| 1347 |
+
|
| 1348 |
+
[*Note 5*: The effect of attempting to modify a string literal object
|
| 1349 |
is undefined. — *end note*]
|
| 1350 |
|
| 1351 |
String literal objects are initialized with the sequence of code unit
|
| 1352 |
values corresponding to the *string-literal*’s sequence of *s-char*s
|
| 1353 |
(originally from non-raw string literals) and *r-char*s (originally from
|
|
|
|
| 1357 |
- The sequence of characters denoted by each contiguous sequence of
|
| 1358 |
*basic-s-char*s, *r-char*s, *simple-escape-sequence*s [[lex.ccon]],
|
| 1359 |
and *universal-character-name*s [[lex.charset]] is encoded to a code
|
| 1360 |
unit sequence using the *string-literal*’s associated character
|
| 1361 |
encoding. If a character lacks representation in the associated
|
| 1362 |
+
character encoding, then the program is ill-formed. \[*Note 6*: No
|
| 1363 |
+
character lacks representation in any Unicode encoding
|
| 1364 |
+
form. — *end note*] When encoding a stateful character encoding,
|
| 1365 |
+
implementations should encode the first such sequence beginning with
|
| 1366 |
+
the initial encoding state and encode subsequent sequences beginning
|
| 1367 |
+
with the final encoding state of the prior sequence. \[*Note 7*: The
|
| 1368 |
+
encoded code unit sequence can differ from the sequence of code units
|
| 1369 |
+
that would be obtained by encoding each character
|
| 1370 |
+
independently. — *end note*]
|
|
|
|
| 1371 |
- Each *numeric-escape-sequence* [[lex.ccon]] contributes a single code
|
| 1372 |
unit with a value as follows:
|
| 1373 |
- Let v be the integer value represented by the octal number
|
| 1374 |
comprising the sequence of *octal-digit*s in an
|
| 1375 |
*octal-escape-sequence* or by the hexadecimal number comprising the
|
|
|
|
| 1380 |
`L`, and v does not exceed the range of representable values of the
|
| 1381 |
corresponding unsigned type for the underlying type of the
|
| 1382 |
*string-literal*’s array element type, then the value is the unique
|
| 1383 |
value of the *string-literal*’s array element type `T` that is
|
| 1384 |
congruent to v modulo 2ᴺ, where N is the width of `T`.
|
| 1385 |
+
- Otherwise, the program is ill-formed.
|
| 1386 |
|
| 1387 |
When encoding a stateful character encoding, these sequences should
|
| 1388 |
have no effect on encoding state.
|
| 1389 |
- Each *conditional-escape-sequence* [[lex.ccon]] contributes an
|
| 1390 |
*implementation-defined* code unit sequence. When encoding a stateful
|
| 1391 |
character encoding, it is *implementation-defined* what effect these
|
| 1392 |
sequences have on encoding state.
|
| 1393 |
|
| 1394 |
+
### Unevaluated strings <a id="lex.string.uneval">[[lex.string.uneval]]</a>
|
| 1395 |
+
|
| 1396 |
+
``` bnf
|
| 1397 |
+
unevaluated-string:
|
| 1398 |
+
string-literal
|
| 1399 |
+
```
|
| 1400 |
+
|
| 1401 |
+
An *unevaluated-string* shall have no *encoding-prefix*.
|
| 1402 |
+
|
| 1403 |
+
Each *universal-character-name* and each *simple-escape-sequence* in an
|
| 1404 |
+
*unevaluated-string* is replaced by the member of the translation
|
| 1405 |
+
character set it denotes. An *unevaluated-string* that contains a
|
| 1406 |
+
*numeric-escape-sequence* or a *conditional-escape-sequence* is
|
| 1407 |
+
ill-formed.
|
| 1408 |
+
|
| 1409 |
+
An *unevaluated-string* is never evaluated and its interpretation
|
| 1410 |
+
depends on the context in which it appears.
|
| 1411 |
+
|
| 1412 |
### Boolean literals <a id="lex.bool">[[lex.bool]]</a>
|
| 1413 |
|
| 1414 |
``` bnf
|
| 1415 |
boolean-literal:
|
| 1416 |
+
false
|
| 1417 |
+
true
|
| 1418 |
```
|
| 1419 |
|
| 1420 |
The Boolean literals are the keywords `false` and `true`. Such literals
|
| 1421 |
have type `bool`.
|
| 1422 |
|
| 1423 |
### Pointer literals <a id="lex.nullptr">[[lex.nullptr]]</a>
|
| 1424 |
|
| 1425 |
``` bnf
|
| 1426 |
pointer-literal:
|
| 1427 |
+
nullptr
|
| 1428 |
```
|
| 1429 |
|
| 1430 |
The pointer literal is the keyword `nullptr`. It has type
|
| 1431 |
`std::nullptr_t`.
|
| 1432 |
|
|
|
|
| 1558 |
basic character set. — *end note*]
|
| 1559 |
|
| 1560 |
If *L* is a *user-defined-string-literal*, let *str* be the literal
|
| 1561 |
without its *ud-suffix* and let *len* be the number of code units in
|
| 1562 |
*str* (i.e., its length excluding the terminating null character). If
|
| 1563 |
+
*S* contains a literal operator template with a constant template
|
| 1564 |
parameter for which *str* is a well-formed *template-argument*, the
|
| 1565 |
literal *L* is treated as a call of the form
|
| 1566 |
|
| 1567 |
``` cpp
|
| 1568 |
operator ""X<str>()
|
|
|
|
| 1625 |
[basic.fundamental]: basic.md#basic.fundamental
|
| 1626 |
[basic.link]: basic.md#basic.link
|
| 1627 |
[basic.lookup.unqual]: basic.md#basic.lookup.unqual
|
| 1628 |
[basic.stc]: basic.md#basic.stc
|
| 1629 |
[character.seq]: library.md#character.seq
|
| 1630 |
+
[class.mem.general]: class.md#class.mem.general
|
| 1631 |
[conv.mem]: expr.md#conv.mem
|
| 1632 |
[conv.ptr]: expr.md#conv.ptr
|
| 1633 |
[cpp]: cpp.md#cpp
|
| 1634 |
[cpp.cond]: cpp.md#cpp.cond
|
| 1635 |
+
[cpp.embed]: cpp.md#cpp.embed
|
| 1636 |
[cpp.import]: cpp.md#cpp.import
|
| 1637 |
[cpp.include]: cpp.md#cpp.include
|
| 1638 |
[cpp.module]: cpp.md#cpp.module
|
| 1639 |
+
[cpp.pragma]: cpp.md#cpp.pragma
|
| 1640 |
+
[cpp.pragma.op]: cpp.md#cpp.pragma.op
|
| 1641 |
+
[cpp.pre]: cpp.md#cpp.pre
|
| 1642 |
+
[cpp.predefined]: cpp.md#cpp.predefined
|
| 1643 |
+
[cpp.replace]: cpp.md#cpp.replace
|
| 1644 |
[cpp.stringize]: cpp.md#cpp.stringize
|
| 1645 |
[dcl.attr.grammar]: dcl.md#dcl.attr.grammar
|
| 1646 |
+
[dcl.pre]: dcl.md#dcl.pre
|
| 1647 |
+
[expr.const]: expr.md#expr.const
|
| 1648 |
[expr.prim.literal]: expr.md#expr.prim.literal
|
| 1649 |
[headers]: library.md#headers
|
| 1650 |
+
[intro.object]: basic.md#intro.object
|
| 1651 |
[lex]: #lex
|
| 1652 |
[lex.bool]: #lex.bool
|
| 1653 |
[lex.ccon]: #lex.ccon
|
| 1654 |
[lex.ccon.esc]: #lex.ccon.esc
|
| 1655 |
[lex.ccon.literal]: #lex.ccon.literal
|
| 1656 |
+
[lex.char]: #lex.char
|
| 1657 |
[lex.charset]: #lex.charset
|
| 1658 |
[lex.charset.basic]: #lex.charset.basic
|
| 1659 |
[lex.charset.literal]: #lex.charset.literal
|
| 1660 |
[lex.comment]: #lex.comment
|
| 1661 |
[lex.digraph]: #lex.digraph
|
|
|
|
| 1679 |
[lex.pptoken]: #lex.pptoken
|
| 1680 |
[lex.separate]: #lex.separate
|
| 1681 |
[lex.string]: #lex.string
|
| 1682 |
[lex.string.concat]: #lex.string.concat
|
| 1683 |
[lex.string.literal]: #lex.string.literal
|
| 1684 |
+
[lex.string.uneval]: #lex.string.uneval
|
| 1685 |
[lex.token]: #lex.token
|
| 1686 |
+
[lex.universal.char]: #lex.universal.char
|
| 1687 |
[module.import]: module.md#module.import
|
| 1688 |
+
[module.reach]: module.md#module.reach
|
| 1689 |
[module.unit]: module.md#module.unit
|
| 1690 |
[over.literal]: over.md#over.literal
|
| 1691 |
[support.types.layout]: support.md#support.types.layout
|
| 1692 |
[temp.explicit]: temp.md#temp.explicit
|
| 1693 |
+
[temp.inst]: temp.md#temp.inst
|
| 1694 |
[temp.names]: temp.md#temp.names
|
| 1695 |
+
[temp.point]: temp.md#temp.point
|
| 1696 |
+
[uaxid]: uax31.md#uaxid
|
| 1697 |
|
| 1698 |
[^1]: Implementations behave as if these separate phases occur, although
|
| 1699 |
in practice different phases can be folded together.
|
| 1700 |
|
| 1701 |
+
[^2]: Unicode® is a registered trademark of Unicode, Inc. This
|
| 1702 |
+
information is given for the convenience of users of this document
|
| 1703 |
+
and does not constitute an endorsement by ISO or IEC of this
|
| 1704 |
+
product.
|
| 1705 |
+
|
| 1706 |
+
[^3]: A partial preprocessing token would arise from a source file
|
| 1707 |
ending in the first portion of a multi-character token that requires
|
| 1708 |
a terminating sequence of characters, such as a *header-name* that
|
| 1709 |
is missing the closing `"` or `>`. A partial comment would arise
|
| 1710 |
from a source file ending with an unclosed `/*` comment.
|
| 1711 |
|
| 1712 |
+
[^4]: These include “digraphs” and additional reserved words. The term
|
| 1713 |
“digraph” (token consisting of two characters) is not perfectly
|
| 1714 |
descriptive, since one of the alternative *preprocessing-token*s is
|
| 1715 |
`%:%:` and of course several primary tokens contain two characters.
|
| 1716 |
Nonetheless, those alternative tokens that aren’t lexical keywords
|
| 1717 |
are colloquially known as “digraphs”.
|
| 1718 |
|
| 1719 |
+
[^5]: Thus the “stringized” values [[cpp.stringize]] of `[` and `<:`
|
| 1720 |
will be different, maintaining the source spelling, but the tokens
|
| 1721 |
can otherwise be freely interchanged.
|
| 1722 |
|
| 1723 |
+
[^6]: Literals include strings and character and numeric literals.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1724 |
|
| 1725 |
[^7]: On systems in which linkers cannot accept extended characters, an
|
| 1726 |
encoding of the \*universal-character-name\* can be used in forming
|
| 1727 |
valid external identifiers. For example, some otherwise unused
|
| 1728 |
character or sequence of characters can be used to encode the `̆` in
|
| 1729 |
a \*universal-character-name\*. Extended characters can produce a
|
| 1730 |
long external identifier, but C++ does not place a translation limit
|
| 1731 |
on significant characters for external identifiers.
|
| 1732 |
|
| 1733 |
[^8]: The term “literal” generally designates, in this document, those
|
| 1734 |
+
tokens that are called “constants” in C.
|