[lex.charset] - C++23 → Trunk

Files changed (1) hide show

tmp/tmp92ytj4on/{from.md → to.md} +11 -73

tmp/tmp92ytj4on/{from.md → to.md} RENAMED Viewed

@@ -1,21 +1,21 @@
-## Character sets <a id="lex.charset">[[lex.charset]]</a>
 The *translation character set* consists of the following elements:
-- each abstract character assigned a code point in the Unicode
- codespace, and
 - a distinct character for each Unicode scalar value not assigned to an
   abstract character.
 [*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
 (hexadecimal). A surrogate code point is a value in the range
 [D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
 that is not a surrogate code point. — *end note*]
 The *basic character set* is a subset of the translation character set,
-consisting of 96 characters as specified in [[lex.charset.basic]].
 [*Note 2*: Unicode short names are given only as a means to identifying
 the character; the numerical value has no other meaning in this
 context. — *end note*]
@@ -29,10 +29,11 @@ context. — *end note*]
 | `U+0020`             | space                       |                             |
 | `U+000a`             | line feed                   | new-line                    |
 | `U+0021`             | exclamation mark            | `!`                         |
 | `U+0022`             | quotation mark              | `"`                         |
 | `U+0023`             | number sign                 | `#`                         |
 | `U+0025`             | percent sign                | `%`                         |
 | `U+0026`             | ampersand                   | `&`                         |
 | `U+0027`             | apostrophe                  | `'`                         |
 | `U+0028`             | left parenthesis            | `(`                         |
 | `U+0029`             | right parenthesis           | `)`                         |
@@ -47,90 +48,27 @@ context. — *end note*]
 | `U+003b`             | semicolon                   | `;`                         |
 | `U+003c`             | less-than sign              | `<`                         |
 | `U+003d`             | equals sign                 | `=`                         |
 | `U+003e`             | greater-than sign           | `>`                         |
 | `U+003f`             | question mark               | `?`                         |
 | `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
 |                      |                             | `N O P Q R S T U V W X Y Z` |
 | `U+005b`             | left square bracket         | `[`                         |
 | `U+005c`             | reverse solidus             | \texttt{\}                  |
 | `U+005d`             | right square bracket        | `]`                         |
 | `U+005e`             | circumflex accent           | `^`                         |
 | `U+005f`             | low line                    | `_`                         |
 | `U+0061` .. `U+007a` | latin small letter a .. z   | `a b c d e f g h i j k l m` |
 |                      |                             | `n o p q r s t u v w x y z` |
 | `U+007b`             | left curly bracket          | \texttt{\                   |
 | `U+007c`             | vertical line               | `|`                         |
 | `U+007d`             | right curly bracket         | `}`                         |
 | `U+007e`             | tilde                       | `~`                         |
-The *universal-character-name* construct provides a way to name other
-characters.
-``` bnf
-n-char: one of
-     any member of the translation character set except the U+007d (right curly bracket) or new-line character
-```
-``` bnf
-n-char-sequence:
-    n-char
-    n-char-sequence n-char
-```
-``` bnf
-named-universal-character:
-    '\N{' n-char-sequence '}'
-```
-``` bnf
-hex-quad:
-    hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
-```
-``` bnf
-simple-hexadecimal-digit-sequence:
-    hexadecimal-digit
-    simple-hexadecimal-digit-sequence hexadecimal-digit
-```
-``` bnf
-universal-character-name:
-    '\u' hex-quad
-    '\U' hex-quad hex-quad
-    '\u{' simple-hexadecimal-digit-sequence '}'
-    named-universal-character
-```
-A *universal-character-name* of the form `\u` *hex-quad*, `\U`
-*hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
-designates the character in the translation character set whose Unicode
-scalar value is the hexadecimal number represented by the sequence of
-*hexadecimal-digit*s in the *universal-character-name*. The program is
-ill-formed if that number is not a Unicode scalar value.
-A *universal-character-name* that is a *named-universal-character*
-designates the corresponding character in the Unicode Standard (chapter
-4.8 Name) if the *n-char-sequence* is equal to its character name or to
-one of its character name aliases of type “control”, “correction”, or
-“alternate”; otherwise, the program is ill-formed.
-[*Note 3*: These aliases are listed in the Unicode Character Database’s
-`NameAliases.txt`. None of these names or aliases have leading or
-trailing spaces. — *end note*]
-If a *universal-character-name* outside the *c-char-sequence*,
-*s-char-sequence*, or *r-char-sequence* of a *character-literal* or
-*string-literal* (in either case, including within a
-*user-defined-literal*) corresponds to a control character or to a
-character in the basic character set, the program is ill-formed.
-[*Note 4*: A sequence of characters resembling a
-*universal-character-name* in an *r-char-sequence* [[lex.string]] does
-not form a *universal-character-name*. — *end note*]
 The *basic literal character set* consists of all characters of the
 basic character set, plus the control characters specified in
 [[lex.charset.literal]].
 **Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
@@ -156,20 +94,20 @@ applied to a wide character or string literal.
 A literal encoding or a locale-specific encoding of one of the execution
 character sets [[character.seq]] encodes each element of the basic
 literal character set as a single code unit with non-negative value,
 distinct from the code unit for any other such element.
-[*Note 5*: A character not in the basic literal character set can be
 encoded with more than one code unit; the value of such a code unit can
 be the same as that of a code unit for an element of the basic literal
 character set. — *end note*]
 The U+0000 (null) character is encoded as the value `0`. No other
 element of the translation character set is encoded with a code unit of
 value `0`. The code unit value of each decimal digit character after the
 digit `0` (`U+0030`) shall be one greater than the value of the
 previous. The ordinary and wide literal encodings are otherwise
 *implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
-Unicode scalar value corresponding to each character of the translation
-character set is encoded as specified in the Unicode Standard for the
-respective Unicode encoding form.

+### Character sets <a id="lex.charset">[[lex.charset]]</a>
 The *translation character set* consists of the following elements:
+- each abstract character assigned a code point in the Unicode codespace
+ as specified in the Unicode Standard, and
 - a distinct character for each Unicode scalar value not assigned to an
   abstract character.
 [*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
 (hexadecimal). A surrogate code point is a value in the range
 [D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
 that is not a surrogate code point. — *end note*]
 The *basic character set* is a subset of the translation character set,
+consisting of 99 characters as specified in [[lex.charset.basic]].
 [*Note 2*: Unicode short names are given only as a means to identifying
 the character; the numerical value has no other meaning in this
 context. — *end note*]
 | `U+0020`             | space                       |                             |
 | `U+000a`             | line feed                   | new-line                    |
 | `U+0021`             | exclamation mark            | `!`                         |
 | `U+0022`             | quotation mark              | `"`                         |
 | `U+0023`             | number sign                 | `#`                         |
+| `U+0024`             | dollar sign                 | `$`                         |
 | `U+0025`             | percent sign                | `%`                         |
 | `U+0026`             | ampersand                   | `&`                         |
 | `U+0027`             | apostrophe                  | `'`                         |
 | `U+0028`             | left parenthesis            | `(`                         |
 | `U+0029`             | right parenthesis           | `)`                         |
 | `U+003b`             | semicolon                   | `;`                         |
 | `U+003c`             | less-than sign              | `<`                         |
 | `U+003d`             | equals sign                 | `=`                         |
 | `U+003e`             | greater-than sign           | `>`                         |
 | `U+003f`             | question mark               | `?`                         |
+| }                    |
 | `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
 |                      |                             | `N O P Q R S T U V W X Y Z` |
 | `U+005b`             | left square bracket         | `[`                         |
 | `U+005c`             | reverse solidus             | \texttt{\}                  |
 | `U+005d`             | right square bracket        | `]`                         |
 | `U+005e`             | circumflex accent           | `^`                         |
 | `U+005f`             | low line                    | `_`                         |
+| `U+0060`             | grave accent                | `\`                         |
 | `U+0061` .. `U+007a` | latin small letter a .. z   | `a b c d e f g h i j k l m` |
 |                      |                             | `n o p q r s t u v w x y z` |
 | `U+007b`             | left curly bracket          | \texttt{\                   |
 | `U+007c`             | vertical line               | `|`                         |
 | `U+007d`             | right curly bracket         | `}`                         |
 | `U+007e`             | tilde                       | `~`                         |
 The *basic literal character set* consists of all characters of the
 basic character set, plus the control characters specified in
 [[lex.charset.literal]].
 **Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
 A literal encoding or a locale-specific encoding of one of the execution
 character sets [[character.seq]] encodes each element of the basic
 literal character set as a single code unit with non-negative value,
 distinct from the code unit for any other such element.
+[*Note 3*: A character not in the basic literal character set can be
 encoded with more than one code unit; the value of such a code unit can
 be the same as that of a code unit for an element of the basic literal
 character set. — *end note*]
 The U+0000 (null) character is encoded as the value `0`. No other
 element of the translation character set is encoded with a code unit of
 value `0`. The code unit value of each decimal digit character after the
 digit `0` (`U+0030`) shall be one greater than the value of the
 previous. The ordinary and wide literal encodings are otherwise
 *implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
+implementation shall encode the Unicode scalar value corresponding to
+each character of the translation character set as specified in the
+Unicode Standard for the respective Unicode encoding form.

Diff to HTML by rtfpessoa