From Jason Turner

[lex.charset]

Diff to HTML by rtfpessoa

Files changed (1) hide show
  1. tmp/tmpe19ig0fi/{from.md → to.md} +157 -44
tmp/tmpe19ig0fi/{from.md → to.md} RENAMED
@@ -1,62 +1,175 @@
1
  ## Character sets <a id="lex.charset">[[lex.charset]]</a>
2
 
3
- The *basic source character set* consists of 96 characters: the space
4
- character, the control characters representing horizontal tab, vertical
5
- tab, form feed, and new-line, plus the following 91 graphical
6
- characters:[^4]
7
-
8
- ``` cpp
9
- a b c d e f g h i j k l m n o p q r s t u v w x y z
10
- A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
11
- 0 1 2 3 4 5 6 7 8 9
12
- _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \" '
13
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  The *universal-character-name* construct provides a way to name other
16
  characters.
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ``` bnf
19
  hex-quad:
20
  hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
21
  ```
22
 
 
 
 
 
 
 
23
  ``` bnf
24
  universal-character-name:
25
  '\u' hex-quad
26
  '\U' hex-quad hex-quad
 
 
27
  ```
28
 
29
- A *universal-character-name* designates the character in ISO/IEC 10646
30
- (if any) whose code point is the hexadecimal number represented by the
31
- sequence of *hexadecimal-digit*s in the *universal-character-name*. The
32
- program is ill-formed if that number is not a code point or if it is a
33
- surrogate code point. Noncharacter code points and reserved code points
34
- are considered to designate separate characters distinct from any
35
- ISO/IEC 10646 character. If a *universal-character-name* outside the
36
- *c-char-sequence*, *s-char-sequence*, or *r-char-sequence* of a
37
- *character-literal* or *string-literal* (in either case, including
38
- within a *user-defined-literal*) corresponds to a control character or
39
- to a character in the basic source character set, the program is
40
- ill-formed.[^5]
41
-
42
- [*Note 1*: ISO/IEC 10646 code points are integers in the range
43
- [0, 10FFFF] (hexadecimal). A surrogate code point is a value in the
44
- range [D800, DFFF] (hexadecimal). A control character is a character
45
- whose code point is in either of the ranges [0, 1F] or [7F, 9F]
46
- (hexadecimal). *end note*]
47
-
48
- The *basic execution character set* and the *basic execution
49
- wide-character set* shall each contain all the members of the basic
50
- source character set, plus control characters representing alert,
51
- backspace, and carriage return, plus a *null character* (respectively,
52
- *null wide character*), whose value is 0. For each basic execution
53
- character set, the values of the members shall be non-negative and
54
- distinct from one another. In both the source and execution basic
55
- character sets, the value of each character after `0` in the above list
56
- of decimal digits shall be one greater than the value of the previous.
57
- The *execution character set* and the *execution wide-character set* are
58
- *implementation-defined* supersets of the basic execution character set
59
- and the basic execution wide-character set, respectively. The values of
60
- the members of the execution character sets and the sets of additional
61
- members are locale-specific.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
 
1
  ## Character sets <a id="lex.charset">[[lex.charset]]</a>
2
 
3
+ The *translation character set* consists of the following elements:
4
+
5
+ - each abstract character assigned a code point in the Unicode
6
+ codespace, and
7
+ - a distinct character for each Unicode scalar value not assigned to an
8
+ abstract character.
9
+
10
+ [*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
11
+ (hexadecimal). A surrogate code point is a value in the range
12
+ [D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
13
+ that is not a surrogate code point. — *end note*]
14
+
15
+ The *basic character set* is a subset of the translation character set,
16
+ consisting of 96 characters as specified in [[lex.charset.basic]].
17
+
18
+ [*Note 2*: Unicode short names are given only as a means to identifying
19
+ the character; the numerical value has no other meaning in this
20
+ context. — *end note*]
21
+
22
+ **Table: Basic character set** <a id="lex.charset.basic">[lex.charset.basic]</a>
23
+
24
+ | character | | glyph |
25
+ | -------------------- | --------------------------- | --------------------------- |
26
+ | `U+0009` | character tabulation | |
27
+ | `U+000b` | line tabulation | |
28
+ | `U+000c` | form feed | |
29
+ | `U+0020` | space | |
30
+ | `U+000a` | line feed | new-line |
31
+ | `U+0021` | exclamation mark | `!` |
32
+ | `U+0022` | quotation mark | `"` |
33
+ | `U+0023` | number sign | `#` |
34
+ | `U+0025` | percent sign | `%` |
35
+ | `U+0026` | ampersand | `&` |
36
+ | `U+0027` | apostrophe | `'` |
37
+ | `U+0028` | left parenthesis | `(` |
38
+ | `U+0029` | right parenthesis | `)` |
39
+ | `U+002a` | asterisk | `*` |
40
+ | `U+002b` | plus sign | `+` |
41
+ | `U+002c` | comma | `,` |
42
+ | `U+002d` | hyphen-minus | `-` |
43
+ | `U+002e` | full stop | `.` |
44
+ | `U+002f` | solidus | `/` |
45
+ | `U+0030` .. `U+0039` | digit zero .. nine | `0 1 2 3 4 5 6 7 8 9` |
46
+ | `U+003a` | colon | `:` |
47
+ | `U+003b` | semicolon | `;` |
48
+ | `U+003c` | less-than sign | `<` |
49
+ | `U+003d` | equals sign | `=` |
50
+ | `U+003e` | greater-than sign | `>` |
51
+ | `U+003f` | question mark | `?` |
52
+ | `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
53
+ | | | `N O P Q R S T U V W X Y Z` |
54
+ | `U+005b` | left square bracket | `[` |
55
+ | `U+005c` | reverse solidus | \texttt{\} |
56
+ | `U+005d` | right square bracket | `]` |
57
+ | `U+005e` | circumflex accent | `^` |
58
+ | `U+005f` | low line | `_` |
59
+ | `U+0061` .. `U+007a` | latin small letter a .. z | `a b c d e f g h i j k l m` |
60
+ | | | `n o p q r s t u v w x y z` |
61
+ | `U+007b` | left curly bracket | \texttt{\ |
62
+ | `U+007c` | vertical line | `|` |
63
+ | `U+007d` | right curly bracket | `}` |
64
+ | `U+007e` | tilde | `~` |
65
+
66
 
67
  The *universal-character-name* construct provides a way to name other
68
  characters.
69
 
70
+ ``` bnf
71
+ n-char: one of
72
+ any member of the translation character set except the U+007d (right curly bracket) or new-line character
73
+ ```
74
+
75
+ ``` bnf
76
+ n-char-sequence:
77
+ n-char
78
+ n-char-sequence n-char
79
+ ```
80
+
81
+ ``` bnf
82
+ named-universal-character:
83
+ '\N{' n-char-sequence '}'
84
+ ```
85
+
86
  ``` bnf
87
  hex-quad:
88
  hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
89
  ```
90
 
91
+ ``` bnf
92
+ simple-hexadecimal-digit-sequence:
93
+ hexadecimal-digit
94
+ simple-hexadecimal-digit-sequence hexadecimal-digit
95
+ ```
96
+
97
  ``` bnf
98
  universal-character-name:
99
  '\u' hex-quad
100
  '\U' hex-quad hex-quad
101
+ '\u{' simple-hexadecimal-digit-sequence '}'
102
+ named-universal-character
103
  ```
104
 
105
+ A *universal-character-name* of the form `\u` *hex-quad*, `\U`
106
+ *hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
107
+ designates the character in the translation character set whose Unicode
108
+ scalar value is the hexadecimal number represented by the sequence of
109
+ *hexadecimal-digit*s in the *universal-character-name*. The program is
110
+ ill-formed if that number is not a Unicode scalar value.
111
+
112
+ A *universal-character-name* that is a *named-universal-character*
113
+ designates the corresponding character in the Unicode Standard (chapter
114
+ 4.8 Name) if the *n-char-sequence* is equal to its character name or to
115
+ one of its character name aliases of type “control”, “correction”, or
116
+ “alternate”; otherwise, the program is ill-formed.
117
+
118
+ [*Note 3*: These aliases are listed in the Unicode Character Database’s
119
+ `NameAliases.txt`. None of these names or aliases have leading or
120
+ trailing spaces. *end note*]
121
+
122
+ If a *universal-character-name* outside the *c-char-sequence*,
123
+ *s-char-sequence*, or *r-char-sequence* of a *character-literal* or
124
+ *string-literal* (in either case, including within a
125
+ *user-defined-literal*) corresponds to a control character or to a
126
+ character in the basic character set, the program is ill-formed.
127
+
128
+ [*Note 4*: A sequence of characters resembling a
129
+ *universal-character-name* in an *r-char-sequence* [[lex.string]] does
130
+ not form a *universal-character-name*. *end note*]
131
+
132
+ The *basic literal character set* consists of all characters of the
133
+ basic character set, plus the control characters specified in
134
+ [[lex.charset.literal]].
135
+
136
+ **Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
137
+
138
+ | | |
139
+ | -------- | --------------- |
140
+ | `U+0000` | null |
141
+ | `U+0007` | alert |
142
+ | `U+0008` | backspace |
143
+ | `U+000d` | carriage return |
144
+
145
+
146
+ A *code unit* is an integer value of character type
147
+ [[basic.fundamental]]. Characters in a *character-literal* other than a
148
+ multicharacter or non-encodable character literal or in a
149
+ *string-literal* are encoded as a sequence of one or more code units, as
150
+ determined by the *encoding-prefix* [[lex.ccon]], [[lex.string]]; this
151
+ is termed the respective *literal encoding*. The
152
+ *ordinary literal encoding* is the encoding applied to an ordinary
153
+ character or string literal. The *wide literal encoding* is the encoding
154
+ applied to a wide character or string literal.
155
+
156
+ A literal encoding or a locale-specific encoding of one of the execution
157
+ character sets [[character.seq]] encodes each element of the basic
158
+ literal character set as a single code unit with non-negative value,
159
+ distinct from the code unit for any other such element.
160
+
161
+ [*Note 5*: A character not in the basic literal character set can be
162
+ encoded with more than one code unit; the value of such a code unit can
163
+ be the same as that of a code unit for an element of the basic literal
164
+ character set. — *end note*]
165
+
166
+ The U+0000 (null) character is encoded as the value `0`. No other
167
+ element of the translation character set is encoded with a code unit of
168
+ value `0`. The code unit value of each decimal digit character after the
169
+ digit `0` (`U+0030`) shall be one greater than the value of the
170
+ previous. The ordinary and wide literal encodings are otherwise
171
+ *implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
172
+ Unicode scalar value corresponding to each character of the translation
173
+ character set is encoded as specified in the Unicode Standard for the
174
+ respective Unicode encoding form.
175