From Jason Turner

[lex.charset]

Diff to HTML by rtfpessoa

Files changed (1) hide show
  1. tmp/tmp92ytj4on/{from.md → to.md} +11 -73
tmp/tmp92ytj4on/{from.md → to.md} RENAMED
@@ -1,21 +1,21 @@
1
- ## Character sets <a id="lex.charset">[[lex.charset]]</a>
2
 
3
  The *translation character set* consists of the following elements:
4
 
5
- - each abstract character assigned a code point in the Unicode
6
- codespace, and
7
  - a distinct character for each Unicode scalar value not assigned to an
8
  abstract character.
9
 
10
  [*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
11
  (hexadecimal). A surrogate code point is a value in the range
12
  [D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
13
  that is not a surrogate code point. — *end note*]
14
 
15
  The *basic character set* is a subset of the translation character set,
16
- consisting of 96 characters as specified in [[lex.charset.basic]].
17
 
18
  [*Note 2*: Unicode short names are given only as a means to identifying
19
  the character; the numerical value has no other meaning in this
20
  context. — *end note*]
21
 
@@ -29,10 +29,11 @@ context. — *end note*]
29
  | `U+0020` | space | |
30
  | `U+000a` | line feed | new-line |
31
  | `U+0021` | exclamation mark | `!` |
32
  | `U+0022` | quotation mark | `"` |
33
  | `U+0023` | number sign | `#` |
 
34
  | `U+0025` | percent sign | `%` |
35
  | `U+0026` | ampersand | `&` |
36
  | `U+0027` | apostrophe | `'` |
37
  | `U+0028` | left parenthesis | `(` |
38
  | `U+0029` | right parenthesis | `)` |
@@ -47,90 +48,27 @@ context. — *end note*]
47
  | `U+003b` | semicolon | `;` |
48
  | `U+003c` | less-than sign | `<` |
49
  | `U+003d` | equals sign | `=` |
50
  | `U+003e` | greater-than sign | `>` |
51
  | `U+003f` | question mark | `?` |
 
52
  | `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
53
  | | | `N O P Q R S T U V W X Y Z` |
54
  | `U+005b` | left square bracket | `[` |
55
  | `U+005c` | reverse solidus | \texttt{\} |
56
  | `U+005d` | right square bracket | `]` |
57
  | `U+005e` | circumflex accent | `^` |
58
  | `U+005f` | low line | `_` |
 
59
  | `U+0061` .. `U+007a` | latin small letter a .. z | `a b c d e f g h i j k l m` |
60
  | | | `n o p q r s t u v w x y z` |
61
  | `U+007b` | left curly bracket | \texttt{\ |
62
  | `U+007c` | vertical line | `|` |
63
  | `U+007d` | right curly bracket | `}` |
64
  | `U+007e` | tilde | `~` |
65
 
66
 
67
- The *universal-character-name* construct provides a way to name other
68
- characters.
69
-
70
- ``` bnf
71
- n-char: one of
72
- any member of the translation character set except the U+007d (right curly bracket) or new-line character
73
- ```
74
-
75
- ``` bnf
76
- n-char-sequence:
77
- n-char
78
- n-char-sequence n-char
79
- ```
80
-
81
- ``` bnf
82
- named-universal-character:
83
- '\N{' n-char-sequence '}'
84
- ```
85
-
86
- ``` bnf
87
- hex-quad:
88
- hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
89
- ```
90
-
91
- ``` bnf
92
- simple-hexadecimal-digit-sequence:
93
- hexadecimal-digit
94
- simple-hexadecimal-digit-sequence hexadecimal-digit
95
- ```
96
-
97
- ``` bnf
98
- universal-character-name:
99
- '\u' hex-quad
100
- '\U' hex-quad hex-quad
101
- '\u{' simple-hexadecimal-digit-sequence '}'
102
- named-universal-character
103
- ```
104
-
105
- A *universal-character-name* of the form `\u` *hex-quad*, `\U`
106
- *hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
107
- designates the character in the translation character set whose Unicode
108
- scalar value is the hexadecimal number represented by the sequence of
109
- *hexadecimal-digit*s in the *universal-character-name*. The program is
110
- ill-formed if that number is not a Unicode scalar value.
111
-
112
- A *universal-character-name* that is a *named-universal-character*
113
- designates the corresponding character in the Unicode Standard (chapter
114
- 4.8 Name) if the *n-char-sequence* is equal to its character name or to
115
- one of its character name aliases of type “control”, “correction”, or
116
- “alternate”; otherwise, the program is ill-formed.
117
-
118
- [*Note 3*: These aliases are listed in the Unicode Character Database’s
119
- `NameAliases.txt`. None of these names or aliases have leading or
120
- trailing spaces. — *end note*]
121
-
122
- If a *universal-character-name* outside the *c-char-sequence*,
123
- *s-char-sequence*, or *r-char-sequence* of a *character-literal* or
124
- *string-literal* (in either case, including within a
125
- *user-defined-literal*) corresponds to a control character or to a
126
- character in the basic character set, the program is ill-formed.
127
-
128
- [*Note 4*: A sequence of characters resembling a
129
- *universal-character-name* in an *r-char-sequence* [[lex.string]] does
130
- not form a *universal-character-name*. — *end note*]
131
-
132
  The *basic literal character set* consists of all characters of the
133
  basic character set, plus the control characters specified in
134
  [[lex.charset.literal]].
135
 
136
  **Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
@@ -156,20 +94,20 @@ applied to a wide character or string literal.
156
  A literal encoding or a locale-specific encoding of one of the execution
157
  character sets [[character.seq]] encodes each element of the basic
158
  literal character set as a single code unit with non-negative value,
159
  distinct from the code unit for any other such element.
160
 
161
- [*Note 5*: A character not in the basic literal character set can be
162
  encoded with more than one code unit; the value of such a code unit can
163
  be the same as that of a code unit for an element of the basic literal
164
  character set. — *end note*]
165
 
166
  The U+0000 (null) character is encoded as the value `0`. No other
167
  element of the translation character set is encoded with a code unit of
168
  value `0`. The code unit value of each decimal digit character after the
169
  digit `0` (`U+0030`) shall be one greater than the value of the
170
  previous. The ordinary and wide literal encodings are otherwise
171
  *implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
172
- Unicode scalar value corresponding to each character of the translation
173
- character set is encoded as specified in the Unicode Standard for the
174
- respective Unicode encoding form.
175
 
 
1
+ ### Character sets <a id="lex.charset">[[lex.charset]]</a>
2
 
3
  The *translation character set* consists of the following elements:
4
 
5
+ - each abstract character assigned a code point in the Unicode codespace
6
+ as specified in the Unicode Standard, and
7
  - a distinct character for each Unicode scalar value not assigned to an
8
  abstract character.
9
 
10
  [*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
11
  (hexadecimal). A surrogate code point is a value in the range
12
  [D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
13
  that is not a surrogate code point. — *end note*]
14
 
15
  The *basic character set* is a subset of the translation character set,
16
+ consisting of 99 characters as specified in [[lex.charset.basic]].
17
 
18
  [*Note 2*: Unicode short names are given only as a means to identifying
19
  the character; the numerical value has no other meaning in this
20
  context. — *end note*]
21
 
 
29
  | `U+0020` | space | |
30
  | `U+000a` | line feed | new-line |
31
  | `U+0021` | exclamation mark | `!` |
32
  | `U+0022` | quotation mark | `"` |
33
  | `U+0023` | number sign | `#` |
34
+ | `U+0024` | dollar sign | `$` |
35
  | `U+0025` | percent sign | `%` |
36
  | `U+0026` | ampersand | `&` |
37
  | `U+0027` | apostrophe | `'` |
38
  | `U+0028` | left parenthesis | `(` |
39
  | `U+0029` | right parenthesis | `)` |
 
48
  | `U+003b` | semicolon | `;` |
49
  | `U+003c` | less-than sign | `<` |
50
  | `U+003d` | equals sign | `=` |
51
  | `U+003e` | greater-than sign | `>` |
52
  | `U+003f` | question mark | `?` |
53
+ | } |
54
  | `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
55
  | | | `N O P Q R S T U V W X Y Z` |
56
  | `U+005b` | left square bracket | `[` |
57
  | `U+005c` | reverse solidus | \texttt{\} |
58
  | `U+005d` | right square bracket | `]` |
59
  | `U+005e` | circumflex accent | `^` |
60
  | `U+005f` | low line | `_` |
61
+ | `U+0060` | grave accent | `\` |
62
  | `U+0061` .. `U+007a` | latin small letter a .. z | `a b c d e f g h i j k l m` |
63
  | | | `n o p q r s t u v w x y z` |
64
  | `U+007b` | left curly bracket | \texttt{\ |
65
  | `U+007c` | vertical line | `|` |
66
  | `U+007d` | right curly bracket | `}` |
67
  | `U+007e` | tilde | `~` |
68
 
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  The *basic literal character set* consists of all characters of the
71
  basic character set, plus the control characters specified in
72
  [[lex.charset.literal]].
73
 
74
  **Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
 
94
  A literal encoding or a locale-specific encoding of one of the execution
95
  character sets [[character.seq]] encodes each element of the basic
96
  literal character set as a single code unit with non-negative value,
97
  distinct from the code unit for any other such element.
98
 
99
+ [*Note 3*: A character not in the basic literal character set can be
100
  encoded with more than one code unit; the value of such a code unit can
101
  be the same as that of a code unit for an element of the basic literal
102
  character set. — *end note*]
103
 
104
  The U+0000 (null) character is encoded as the value `0`. No other
105
  element of the translation character set is encoded with a code unit of
106
  value `0`. The code unit value of each decimal digit character after the
107
  digit `0` (`U+0030`) shall be one greater than the value of the
108
  previous. The ordinary and wide literal encodings are otherwise
109
  *implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
110
+ implementation shall encode the Unicode scalar value corresponding to
111
+ each character of the translation character set as specified in the
112
+ Unicode Standard for the respective Unicode encoding form.
113