[lex.char] - C++23 → Trunk

Files changed (1) hide show

tmp/tmpqjlue9ms/{from.md → to.md} +179 -0

tmp/tmpqjlue9ms/{from.md → to.md} RENAMED Viewed

	@@ -0,0 +1,179 @@

+## Characters <a id="lex.char">[[lex.char]]</a>
+### Character sets <a id="lex.charset">[[lex.charset]]</a>
+The *translation character set* consists of the following elements:
+- each abstract character assigned a code point in the Unicode codespace
+  as specified in the Unicode Standard, and
+- a distinct character for each Unicode scalar value not assigned to an
+  abstract character.
+[*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
+(hexadecimal). A surrogate code point is a value in the range
+[D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
+that is not a surrogate code point. — *end note*]
+The *basic character set* is a subset of the translation character set,
+consisting of 99 characters as specified in [[lex.charset.basic]].
+[*Note 2*: Unicode short names are given only as a means to identifying
+the character; the numerical value has no other meaning in this
+context. — *end note*]
+**Table: Basic character set** <a id="lex.charset.basic">[lex.charset.basic]</a>
+| character            |                             | glyph                       |
+| -------------------- | --------------------------- | --------------------------- |
+| `U+0009`             | character tabulation        |                             |
+| `U+000b`             | line tabulation             |                             |
+| `U+000c`             | form feed                   |                             |
+| `U+0020`             | space                       |                             |
+| `U+000a`             | line feed                   | new-line                    |
+| `U+0021`             | exclamation mark            | `!`                         |
+| `U+0022`             | quotation mark              | `"`                         |
+| `U+0023`             | number sign                 | `#`                         |
+| `U+0024`             | dollar sign                 | `$`                         |
+| `U+0025`             | percent sign                | `%`                         |
+| `U+0026`             | ampersand                   | `&`                         |
+| `U+0027`             | apostrophe                  | `'`                         |
+| `U+0028`             | left parenthesis            | `(`                         |
+| `U+0029`             | right parenthesis           | `)`                         |
+| `U+002a`             | asterisk                    | `*`                         |
+| `U+002b`             | plus sign                   | `+`                         |
+| `U+002c`             | comma                       | `,`                         |
+| `U+002d`             | hyphen-minus                | `-`                         |
+| `U+002e`             | full stop                   | `.`                         |
+| `U+002f`             | solidus                     | `/`                         |
+| `U+0030` .. `U+0039` | digit zero .. nine          | `0 1 2 3 4 5 6 7 8 9`       |
+| `U+003a`             | colon                       | `:`                         |
+| `U+003b`             | semicolon                   | `;`                         |
+| `U+003c`             | less-than sign              | `<`                         |
+| `U+003d`             | equals sign                 | `=`                         |
+| `U+003e`             | greater-than sign           | `>`                         |
+| `U+003f`             | question mark               | `?`                         |
+| }                    |
+| `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
+|                      |                             | `N O P Q R S T U V W X Y Z` |
+| `U+005b`             | left square bracket         | `[`                         |
+| `U+005c`             | reverse solidus             | \texttt{\}                  |
+| `U+005d`             | right square bracket        | `]`                         |
+| `U+005e`             | circumflex accent           | `^`                         |
+| `U+005f`             | low line                    | `_`                         |
+| `U+0060`             | grave accent                | `\`                         |
+| `U+0061` .. `U+007a` | latin small letter a .. z   | `a b c d e f g h i j k l m` |
+|                      |                             | `n o p q r s t u v w x y z` |
+| `U+007b`             | left curly bracket          | \texttt{\                   |
+| `U+007c`             | vertical line               | `|`                         |
+| `U+007d`             | right curly bracket         | `}`                         |
+| `U+007e`             | tilde                       | `~`                         |
+The *basic literal character set* consists of all characters of the
+basic character set, plus the control characters specified in
+[[lex.charset.literal]].
+**Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
+|          |                 |
+| -------- | --------------- |
+| `U+0000` | null            |
+| `U+0007` | alert           |
+| `U+0008` | backspace       |
+| `U+000d` | carriage return |
+A *code unit* is an integer value of character type
+[[basic.fundamental]]. Characters in a *character-literal* other than a
+multicharacter or non-encodable character literal or in a
+*string-literal* are encoded as a sequence of one or more code units, as
+determined by the *encoding-prefix* [[lex.ccon]], [[lex.string]]; this
+is termed the respective *literal encoding*. The
+*ordinary literal encoding* is the encoding applied to an ordinary
+character or string literal. The *wide literal encoding* is the encoding
+applied to a wide character or string literal.
+A literal encoding or a locale-specific encoding of one of the execution
+character sets [[character.seq]] encodes each element of the basic
+literal character set as a single code unit with non-negative value,
+distinct from the code unit for any other such element.
+[*Note 3*: A character not in the basic literal character set can be
+encoded with more than one code unit; the value of such a code unit can
+be the same as that of a code unit for an element of the basic literal
+character set. — *end note*]
+The U+0000 (null) character is encoded as the value `0`. No other
+element of the translation character set is encoded with a code unit of
+value `0`. The code unit value of each decimal digit character after the
+digit `0` (`U+0030`) shall be one greater than the value of the
+previous. The ordinary and wide literal encodings are otherwise
+*implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
+implementation shall encode the Unicode scalar value corresponding to
+each character of the translation character set as specified in the
+Unicode Standard for the respective Unicode encoding form.
+### Universal character names <a id="lex.universal.char">[[lex.universal.char]]</a>
+``` bnf
+n-char:
+     any member of the translation character set except the U+007d (right curly bracket) or new-line character
+```
+``` bnf
+n-char-sequence:
+    n-char n-char-sequenceₒₚₜ
+```
+``` bnf
+named-universal-character:
+    '\N{' n-char-sequence '}'
+```
+``` bnf
+hex-quad:
+    hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
+```
+``` bnf
+simple-hexadecimal-digit-sequence:
+    hexadecimal-digit simple-hexadecimal-digit-sequenceₒₚₜ
+```
+``` bnf
+universal-character-name:
+    '\u' hex-quad
+    '\U' hex-quad hex-quad
+    '\u{' simple-hexadecimal-digit-sequence '}'
+    named-universal-character
+```
+The *universal-character-name* construct provides a way to name any
+element in the translation character set using just the basic character
+set. If a *universal-character-name* outside the *c-char-sequence*,
+*s-char-sequence*, or *r-char-sequence* of a *character-literal* or
+*string-literal* (in either case, including within a
+*user-defined-literal*) corresponds to a control character or to a
+character in the basic character set, the program is ill-formed.
+[*Note 1*: A sequence of characters resembling a
+*universal-character-name* in an *r-char-sequence* [[lex.string]] does
+not form a *universal-character-name*. — *end note*]
+A *universal-character-name* of the form `\u` *hex-quad*, `\U`
+*hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
+designates the character in the translation character set whose Unicode
+scalar value is the hexadecimal number represented by the sequence of
+*hexadecimal-digit*s in the *universal-character-name*. The program is
+ill-formed if that number is not a Unicode scalar value.
+A *universal-character-name* that is a *named-universal-character*
+designates the corresponding character in the Unicode Standard (chapter
+4.8 Name) if the *n-char-sequence* is equal to its character name or to
+one of its character name aliases of type “control”, “correction”, or
+“alternate”; otherwise, the program is ill-formed.
+[*Note 2*: These aliases are listed in the Unicode Character Database’s
+`NameAliases.txt`. None of these names or aliases have leading or
+trailing spaces. — *end note*]

Diff to HTML by rtfpessoa