From Jason Turner

[basic.extended.fp]

Diff to HTML by rtfpessoa

Files changed (1) hide show
  1. tmp/tmpdyc6ovse/{from.md → to.md} +65 -0
tmp/tmpdyc6ovse/{from.md → to.md} RENAMED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### Optional extended floating-point types <a id="basic.extended.fp">[[basic.extended.fp]]</a>
2
+
3
+ If the implementation supports an extended floating-point type
4
+ [[basic.fundamental]] whose properties are specified by the ISO/IEC/IEEE
5
+ 60559 floating-point interchange format binary16, then the
6
+ *typedef-name* `std::float16_t` is defined in the header `<stdfloat>`
7
+ and names such a type, the macro `__STDCPP_FLOAT16_T__` is defined
8
+ [[cpp.predefined]], and the floating-point literal suffixes `f16` and
9
+ `F16` are supported [[lex.fcon]].
10
+
11
+ If the implementation supports an extended floating-point type whose
12
+ properties are specified by the ISO/IEC/IEEE 60559 floating-point
13
+ interchange format binary32, then the *typedef-name* `std::float32_t` is
14
+ defined in the header `<stdfloat>` and names such a type, the macro
15
+ `__STDCPP_FLOAT32_T__` is defined, and the floating-point literal
16
+ suffixes `f32` and `F32` are supported.
17
+
18
+ If the implementation supports an extended floating-point type whose
19
+ properties are specified by the ISO/IEC/IEEE 60559 floating-point
20
+ interchange format binary64, then the *typedef-name* `std::float64_t` is
21
+ defined in the header `<stdfloat>` and names such a type, the macro
22
+ `__STDCPP_FLOAT64_T__` is defined, and the floating-point literal
23
+ suffixes `f64` and `F64` are supported.
24
+
25
+ If the implementation supports an extended floating-point type whose
26
+ properties are specified by the ISO/IEC/IEEE 60559 floating-point
27
+ interchange format binary128, then the *typedef-name* `std::float128_t`
28
+ is defined in the header `<stdfloat>` and names such a type, the macro
29
+ `__STDCPP_FLOAT128_T__` is defined, and the floating-point literal
30
+ suffixes `f128` and `F128` are supported.
31
+
32
+ If the implementation supports an extended floating-point type with the
33
+ properties, as specified by ISO/IEC/IEEE 60559, of radix (b) of 2,
34
+ storage width in bits (k) of 16, precision in bits (p) of 8, maximum
35
+ exponent (emax) of 127, and exponent field width in bits (w) of 8, then
36
+ the *typedef-name* `std::bfloat16_t` is defined in the header
37
+ `<stdfloat>` and names such a type, the macro `__STDCPP_BFLOAT16_T__` is
38
+ defined, and the floating-point literal suffixes `bf16` and `BF16` are
39
+ supported.
40
+
41
+ [*Note 1*: A summary of the parameters for each type is given in
42
+ [[basic.extended.fp]]. The precision p includes the implicit 1 bit at
43
+ the beginning of the mantissa, so the storage used for the mantissa is
44
+ p-1 bits. ISO/IEC/IEEE 60559 does not assign a name for a type having
45
+ the parameters specified for `std::bfloat16_t`. — *end note*]
46
+
47
+ **Table: Properties of named extended floating-point types** <a id="basic.extended.fp">[basic.extended.fp]</a>
48
+
49
+ | Parameter | `float16_t` | `float32_t` | `float64_t` | `float128_t` | `bfloat16_t` |
50
+ | --------------------------------- | ----------- | ----------- | ----------- | ------------ | ------------ |
51
+ | ISO/IEC/IEEE 60559 name | binary16 | binary32 | binary64 | binary128 | |
52
+ | $k$, storage width in bits | 16 | 32 | 64 | 128 | 16 |
53
+ | $p$, precision in bits | 11 | 24 | 53 | 113 | 8 |
54
+ | $emax$, maximum exponent | 15 | 127 | 1023 | 16383 | 127 |
55
+ | $w$, exponent field width in bits | 5 | 8 | 11 | 15 | 8 |
56
+
57
+
58
+ *Recommended practice:* Any names that the implementation provides for
59
+ the extended floating-point types described in this subsection that are
60
+ in addition to the names defined in the `<stdfloat>` header should be
61
+ chosen to increase compatibility and interoperability with the
62
+ interchange types `_Float16`, `_Float32`, `_Float64`, and `_Float128`
63
+ defined in ISO/IEC TS 18661-3 and with future versions of the C
64
+ standard.
65
+