[lex.charset] - C++20 → C++23

Files changed (1) hide show

tmp/tmpe19ig0fi/{from.md → to.md} +157 -44

tmp/tmpe19ig0fi/{from.md → to.md} RENAMED Viewed

@@ -1,62 +1,175 @@
 ## Character sets <a id="lex.charset">[[lex.charset]]</a>
-The *basic source character set* consists of 96 characters: the space
-character, the control characters representing horizontal tab, vertical
-tab, form feed, and new-line, plus the following 91 graphical
-characters:[^4]
-``` cpp
-a b c d e f g h i j k l m n o p q r s t u v w x y z
-A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
-0 1 2 3 4 5 6 7 8 9
-_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \" '
-```
 The *universal-character-name* construct provides a way to name other
 characters.
 ``` bnf
 hex-quad:
     hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
 ```
 ``` bnf
 universal-character-name:
     '\u' hex-quad
     '\U' hex-quad hex-quad
 ```
-A *universal-character-name* designates the character in ISO/IEC 10646
-(if any) whose code point is the hexadecimal number represented by the
-sequence of *hexadecimal-digit*s in the *universal-character-name*. The
-program is ill-formed if that number is not a code point or if it is a
-surrogate code point. Noncharacter code points and reserved code points
-are considered to designate separate characters distinct from any
-ISO/IEC 10646 character. If a *universal-character-name* outside the
-*c-char-sequence*, *s-char-sequence*, or *r-char-sequence* of a
-*character-literal* or *string-literal* (in either case, including
-within a *user-defined-literal*) corresponds to a control character or
-to a character in the basic source character set, the program is
-ill-formed.[^5]
-[*Note 1*: ISO/IEC 10646 code points are integers in the range
-[0, 10FFFF] (hexadecimal). A surrogate code point is a value in the
-range [D800, DFFF] (hexadecimal). A control character is a character
-whose code point is in either of the ranges [0, 1F] or [7F, 9F]
-(hexadecimal). — *end note*]
-The *basic execution character set* and the *basic execution
-wide-character set* shall each contain all the members of the basic
-source character set, plus control characters representing alert,
-backspace, and carriage return, plus a *null character* (respectively,
-*null wide character*), whose value is 0. For each basic execution
-character set, the values of the members shall be non-negative and
-distinct from one another. In both the source and execution basic
-character sets, the value of each character after `0` in the above list
-of decimal digits shall be one greater than the value of the previous.
-The *execution character set* and the *execution wide-character set* are
-*implementation-defined* supersets of the basic execution character set
-and the basic execution wide-character set, respectively. The values of
-the members of the execution character sets and the sets of additional
-members are locale-specific.

 ## Character sets <a id="lex.charset">[[lex.charset]]</a>
+The *translation character set* consists of the following elements:
+- each abstract character assigned a code point in the Unicode
+  codespace, and
+- a distinct character for each Unicode scalar value not assigned to an
+  abstract character.
+[*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
+(hexadecimal). A surrogate code point is a value in the range
+[D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
+that is not a surrogate code point. — *end note*]
+The *basic character set* is a subset of the translation character set,
+consisting of 96 characters as specified in [[lex.charset.basic]].
+[*Note 2*: Unicode short names are given only as a means to identifying
+the character; the numerical value has no other meaning in this
+context. — *end note*]
+**Table: Basic character set** <a id="lex.charset.basic">[lex.charset.basic]</a>
+| character            |                             | glyph                       |
+| -------------------- | --------------------------- | --------------------------- |
+| `U+0009`             | character tabulation        |                             |
+| `U+000b`             | line tabulation             |                             |
+| `U+000c`             | form feed                   |                             |
+| `U+0020`             | space                       |                             |
+| `U+000a`             | line feed                   | new-line                    |
+| `U+0021`             | exclamation mark            | `!`                         |
+| `U+0022`             | quotation mark              | `"`                         |
+| `U+0023`             | number sign                 | `#`                         |
+| `U+0025`             | percent sign                | `%`                         |
+| `U+0026`             | ampersand                   | `&`                         |
+| `U+0027`             | apostrophe                  | `'`                         |
+| `U+0028`             | left parenthesis            | `(`                         |
+| `U+0029`             | right parenthesis           | `)`                         |
+| `U+002a`             | asterisk                    | `*`                         |
+| `U+002b`             | plus sign                   | `+`                         |
+| `U+002c`             | comma                       | `,`                         |
+| `U+002d`             | hyphen-minus                | `-`                         |
+| `U+002e`             | full stop                   | `.`                         |
+| `U+002f`             | solidus                     | `/`                         |
+| `U+0030` .. `U+0039` | digit zero .. nine          | `0 1 2 3 4 5 6 7 8 9`       |
+| `U+003a`             | colon                       | `:`                         |
+| `U+003b`             | semicolon                   | `;`                         |
+| `U+003c`             | less-than sign              | `<`                         |
+| `U+003d`             | equals sign                 | `=`                         |
+| `U+003e`             | greater-than sign           | `>`                         |
+| `U+003f`             | question mark               | `?`                         |
+| `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
+|                      |                             | `N O P Q R S T U V W X Y Z` |
+| `U+005b`             | left square bracket         | `[`                         |
+| `U+005c`             | reverse solidus             | \texttt{\}                  |
+| `U+005d`             | right square bracket        | `]`                         |
+| `U+005e`             | circumflex accent           | `^`                         |
+| `U+005f`             | low line                    | `_`                         |
+| `U+0061` .. `U+007a` | latin small letter a .. z   | `a b c d e f g h i j k l m` |
+|                      |                             | `n o p q r s t u v w x y z` |
+| `U+007b`             | left curly bracket          | \texttt{\                   |
+| `U+007c`             | vertical line               | `|`                         |
+| `U+007d`             | right curly bracket         | `}`                         |
+| `U+007e`             | tilde                       | `~`                         |
 The *universal-character-name* construct provides a way to name other
 characters.
+``` bnf
+n-char: one of
+     any member of the translation character set except the U+007d (right curly bracket) or new-line character
+```
+``` bnf
+n-char-sequence:
+    n-char
+    n-char-sequence n-char
+```
+``` bnf
+named-universal-character:
+    '\N{' n-char-sequence '}'
+```
 ``` bnf
 hex-quad:
     hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
 ```
+``` bnf
+simple-hexadecimal-digit-sequence:
+    hexadecimal-digit
+    simple-hexadecimal-digit-sequence hexadecimal-digit
+```
 ``` bnf
 universal-character-name:
     '\u' hex-quad
     '\U' hex-quad hex-quad
+    '\u{' simple-hexadecimal-digit-sequence '}'
+    named-universal-character
 ```
+A *universal-character-name* of the form `\u` *hex-quad*, `\U`
+*hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
+designates the character in the translation character set whose Unicode
+scalar value is the hexadecimal number represented by the sequence of
+*hexadecimal-digit*s in the *universal-character-name*. The program is
+ill-formed if that number is not a Unicode scalar value.
+A *universal-character-name* that is a *named-universal-character*
+designates the corresponding character in the Unicode Standard (chapter
+4.8 Name) if the *n-char-sequence* is equal to its character name or to
+one of its character name aliases of type “control”, “correction”, or
+“alternate”; otherwise, the program is ill-formed.
+[*Note 3*: These aliases are listed in the Unicode Character Database’s
+`NameAliases.txt`. None of these names or aliases have leading or
+trailing spaces. — *end note*]
+If a *universal-character-name* outside the *c-char-sequence*,
+*s-char-sequence*, or *r-char-sequence* of a *character-literal* or
+*string-literal* (in either case, including within a
+*user-defined-literal*) corresponds to a control character or to a
+character in the basic character set, the program is ill-formed.
+[*Note 4*: A sequence of characters resembling a
+*universal-character-name* in an *r-char-sequence* [[lex.string]] does
+not form a *universal-character-name*. — *end note*]
+The *basic literal character set* consists of all characters of the
+basic character set, plus the control characters specified in
+[[lex.charset.literal]].
+**Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
+|          |                 |
+| -------- | --------------- |
+| `U+0000` | null            |
+| `U+0007` | alert           |
+| `U+0008` | backspace       |
+| `U+000d` | carriage return |
+A *code unit* is an integer value of character type
+[[basic.fundamental]]. Characters in a *character-literal* other than a
+multicharacter or non-encodable character literal or in a
+*string-literal* are encoded as a sequence of one or more code units, as
+determined by the *encoding-prefix* [[lex.ccon]], [[lex.string]]; this
+is termed the respective *literal encoding*. The
+*ordinary literal encoding* is the encoding applied to an ordinary
+character or string literal. The *wide literal encoding* is the encoding
+applied to a wide character or string literal.
+A literal encoding or a locale-specific encoding of one of the execution
+character sets [[character.seq]] encodes each element of the basic
+literal character set as a single code unit with non-negative value,
+distinct from the code unit for any other such element.
+[*Note 5*: A character not in the basic literal character set can be
+encoded with more than one code unit; the value of such a code unit can
+be the same as that of a code unit for an element of the basic literal
+character set. — *end note*]
+The U+0000 (null) character is encoded as the value `0`. No other
+element of the translation character set is encoded with a code unit of
+value `0`. The code unit value of each decimal digit character after the
+digit `0` (`U+0030`) shall be one greater than the value of the
+previous. The ordinary and wide literal encodings are otherwise
+*implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
+Unicode scalar value corresponding to each character of the translation
+character set is encoded as specified in the Unicode Standard for the
+respective Unicode encoding form.

Diff to HTML by rtfpessoa