From Jason Turner

[lex.char]

Diff to HTML by rtfpessoa

Files changed (1) hide show
  1. tmp/tmpqjlue9ms/{from.md → to.md} +179 -0
tmp/tmpqjlue9ms/{from.md → to.md} RENAMED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Characters <a id="lex.char">[[lex.char]]</a>
2
+
3
+ ### Character sets <a id="lex.charset">[[lex.charset]]</a>
4
+
5
+ The *translation character set* consists of the following elements:
6
+
7
+ - each abstract character assigned a code point in the Unicode codespace
8
+ as specified in the Unicode Standard, and
9
+ - a distinct character for each Unicode scalar value not assigned to an
10
+ abstract character.
11
+
12
+ [*Note 1*: Unicode code points are integers in the range [0, 10FFFF]
13
+ (hexadecimal). A surrogate code point is a value in the range
14
+ [D800, DFFF] (hexadecimal). A Unicode scalar value is any code point
15
+ that is not a surrogate code point. — *end note*]
16
+
17
+ The *basic character set* is a subset of the translation character set,
18
+ consisting of 99 characters as specified in [[lex.charset.basic]].
19
+
20
+ [*Note 2*: Unicode short names are given only as a means to identifying
21
+ the character; the numerical value has no other meaning in this
22
+ context. — *end note*]
23
+
24
+ **Table: Basic character set** <a id="lex.charset.basic">[lex.charset.basic]</a>
25
+
26
+ | character | | glyph |
27
+ | -------------------- | --------------------------- | --------------------------- |
28
+ | `U+0009` | character tabulation | |
29
+ | `U+000b` | line tabulation | |
30
+ | `U+000c` | form feed | |
31
+ | `U+0020` | space | |
32
+ | `U+000a` | line feed | new-line |
33
+ | `U+0021` | exclamation mark | `!` |
34
+ | `U+0022` | quotation mark | `"` |
35
+ | `U+0023` | number sign | `#` |
36
+ | `U+0024` | dollar sign | `$` |
37
+ | `U+0025` | percent sign | `%` |
38
+ | `U+0026` | ampersand | `&` |
39
+ | `U+0027` | apostrophe | `'` |
40
+ | `U+0028` | left parenthesis | `(` |
41
+ | `U+0029` | right parenthesis | `)` |
42
+ | `U+002a` | asterisk | `*` |
43
+ | `U+002b` | plus sign | `+` |
44
+ | `U+002c` | comma | `,` |
45
+ | `U+002d` | hyphen-minus | `-` |
46
+ | `U+002e` | full stop | `.` |
47
+ | `U+002f` | solidus | `/` |
48
+ | `U+0030` .. `U+0039` | digit zero .. nine | `0 1 2 3 4 5 6 7 8 9` |
49
+ | `U+003a` | colon | `:` |
50
+ | `U+003b` | semicolon | `;` |
51
+ | `U+003c` | less-than sign | `<` |
52
+ | `U+003d` | equals sign | `=` |
53
+ | `U+003e` | greater-than sign | `>` |
54
+ | `U+003f` | question mark | `?` |
55
+ | } |
56
+ | `U+0041` .. `U+005a` | latin capital letter a .. z | `A B C D E F G H I J K L M` |
57
+ | | | `N O P Q R S T U V W X Y Z` |
58
+ | `U+005b` | left square bracket | `[` |
59
+ | `U+005c` | reverse solidus | \texttt{\} |
60
+ | `U+005d` | right square bracket | `]` |
61
+ | `U+005e` | circumflex accent | `^` |
62
+ | `U+005f` | low line | `_` |
63
+ | `U+0060` | grave accent | `\` |
64
+ | `U+0061` .. `U+007a` | latin small letter a .. z | `a b c d e f g h i j k l m` |
65
+ | | | `n o p q r s t u v w x y z` |
66
+ | `U+007b` | left curly bracket | \texttt{\ |
67
+ | `U+007c` | vertical line | `|` |
68
+ | `U+007d` | right curly bracket | `}` |
69
+ | `U+007e` | tilde | `~` |
70
+
71
+
72
+ The *basic literal character set* consists of all characters of the
73
+ basic character set, plus the control characters specified in
74
+ [[lex.charset.literal]].
75
+
76
+ **Table: Additional control characters in the basic literal character set** <a id="lex.charset.literal">[lex.charset.literal]</a>
77
+
78
+ | | |
79
+ | -------- | --------------- |
80
+ | `U+0000` | null |
81
+ | `U+0007` | alert |
82
+ | `U+0008` | backspace |
83
+ | `U+000d` | carriage return |
84
+
85
+
86
+ A *code unit* is an integer value of character type
87
+ [[basic.fundamental]]. Characters in a *character-literal* other than a
88
+ multicharacter or non-encodable character literal or in a
89
+ *string-literal* are encoded as a sequence of one or more code units, as
90
+ determined by the *encoding-prefix* [[lex.ccon]], [[lex.string]]; this
91
+ is termed the respective *literal encoding*. The
92
+ *ordinary literal encoding* is the encoding applied to an ordinary
93
+ character or string literal. The *wide literal encoding* is the encoding
94
+ applied to a wide character or string literal.
95
+
96
+ A literal encoding or a locale-specific encoding of one of the execution
97
+ character sets [[character.seq]] encodes each element of the basic
98
+ literal character set as a single code unit with non-negative value,
99
+ distinct from the code unit for any other such element.
100
+
101
+ [*Note 3*: A character not in the basic literal character set can be
102
+ encoded with more than one code unit; the value of such a code unit can
103
+ be the same as that of a code unit for an element of the basic literal
104
+ character set. — *end note*]
105
+
106
+ The U+0000 (null) character is encoded as the value `0`. No other
107
+ element of the translation character set is encoded with a code unit of
108
+ value `0`. The code unit value of each decimal digit character after the
109
+ digit `0` (`U+0030`) shall be one greater than the value of the
110
+ previous. The ordinary and wide literal encodings are otherwise
111
+ *implementation-defined*. For a UTF-8, UTF-16, or UTF-32 literal, the
112
+ implementation shall encode the Unicode scalar value corresponding to
113
+ each character of the translation character set as specified in the
114
+ Unicode Standard for the respective Unicode encoding form.
115
+
116
+ ### Universal character names <a id="lex.universal.char">[[lex.universal.char]]</a>
117
+
118
+ ``` bnf
119
+ n-char:
120
+ any member of the translation character set except the U+007d (right curly bracket) or new-line character
121
+ ```
122
+
123
+ ``` bnf
124
+ n-char-sequence:
125
+ n-char n-char-sequenceₒₚₜ
126
+ ```
127
+
128
+ ``` bnf
129
+ named-universal-character:
130
+ '\N{' n-char-sequence '}'
131
+ ```
132
+
133
+ ``` bnf
134
+ hex-quad:
135
+ hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
136
+ ```
137
+
138
+ ``` bnf
139
+ simple-hexadecimal-digit-sequence:
140
+ hexadecimal-digit simple-hexadecimal-digit-sequenceₒₚₜ
141
+ ```
142
+
143
+ ``` bnf
144
+ universal-character-name:
145
+ '\u' hex-quad
146
+ '\U' hex-quad hex-quad
147
+ '\u{' simple-hexadecimal-digit-sequence '}'
148
+ named-universal-character
149
+ ```
150
+
151
+ The *universal-character-name* construct provides a way to name any
152
+ element in the translation character set using just the basic character
153
+ set. If a *universal-character-name* outside the *c-char-sequence*,
154
+ *s-char-sequence*, or *r-char-sequence* of a *character-literal* or
155
+ *string-literal* (in either case, including within a
156
+ *user-defined-literal*) corresponds to a control character or to a
157
+ character in the basic character set, the program is ill-formed.
158
+
159
+ [*Note 1*: A sequence of characters resembling a
160
+ *universal-character-name* in an *r-char-sequence* [[lex.string]] does
161
+ not form a *universal-character-name*. — *end note*]
162
+
163
+ A *universal-character-name* of the form `\u` *hex-quad*, `\U`
164
+ *hex-quad* *hex-quad*, or `\u{simple-hexadecimal-digit-sequence}`
165
+ designates the character in the translation character set whose Unicode
166
+ scalar value is the hexadecimal number represented by the sequence of
167
+ *hexadecimal-digit*s in the *universal-character-name*. The program is
168
+ ill-formed if that number is not a Unicode scalar value.
169
+
170
+ A *universal-character-name* that is a *named-universal-character*
171
+ designates the corresponding character in the Unicode Standard (chapter
172
+ 4.8 Name) if the *n-char-sequence* is equal to its character name or to
173
+ one of its character name aliases of type “control”, “correction”, or
174
+ “alternate”; otherwise, the program is ill-formed.
175
+
176
+ [*Note 2*: These aliases are listed in the Unicode Character Database’s
177
+ `NameAliases.txt`. None of these names or aliases have leading or
178
+ trailing spaces. — *end note*]
179
+