@@ -16,15 +16,68 @@ Arduino library to implement float16 data type.
16
16
## Description
17
17
18
18
This ** experimental** library defines the float16 (2 byte) data type, including conversion
19
- function to and from float32 type. It is definitely ** work in progress** .
20
-
21
- The library implements the ** Printable** interface so one can directly print the
22
- float16 values in any stream e.g. Serial.
19
+ function to and from float32 type.
23
20
24
21
The primary usage of the float16 data type is to efficiently store and transport
25
22
a floating point number. As it uses only 2 bytes where float and double have typical
26
23
4 and 8 bytes, gains can be made at the price of range and precision.
27
24
25
+ Note that float16 only has ~ 3 significant digits.
26
+
27
+ To print a float16, one need to convert it with toFloat(), toDouble() or toString(decimals).
28
+ The latter allows concatenation and further conversion to an char array.
29
+
30
+ In pre 0.3.0 version the Printable interface was implemented, but it has been removed
31
+ as it caused excessive memory usage when declaring arrays of float16.
32
+
33
+
34
+ #### ARM alternative half-precision
35
+
36
+ -https://en.wikipedia.org/wiki/Half-precision_floating-point_format#ARM_alternative_half-precision
37
+
38
+ _ ARM processors support (via a floating point control register bit)
39
+ an "alternative half-precision" format, which does away with the
40
+ special case for an exponent value of 31 (111112).[ 10] It is almost
41
+ identical to the IEEE format, but there is no encoding for infinity or NaNs;
42
+ instead, an exponent of 31 encodes normalized numbers in the range 65536 to 131008._
43
+
44
+ Implemented in https://github.com/RobTillaart/float16ext class.
45
+
46
+
47
+ #### Difference with float16 and float16ext
48
+
49
+ The float16ext library has an extended range as it supports values from +- 65504
50
+ to +- 131008.
51
+
52
+ The float16ext does not support INF, -INF and NAN. These values are mapped upon
53
+ the largest positive, the largest negative and the largest positive number.
54
+
55
+ The -0 and 0 values will both exist.
56
+
57
+
58
+ Although they share a lot of code float16 and float16ext should not be mixed.
59
+ In the future these libraries might merge / derive one from the other.
60
+
61
+
62
+ #### Breaking change 0.3.0
63
+
64
+ Version 0.3.0 has a breaking change. The ** Printable** interface is removed as
65
+ it causes larger than expected arrays of float 16 (See #16 ). On ESP8266 every
66
+ float16 object was 8 bytes and on AVR it was 5 bytes instead of the expected 2 bytes.
67
+
68
+ To support printing the class added two new conversion functions:
69
+ ``` cpp
70
+ f16.toFloat();
71
+ f16.toString(decimals);
72
+
73
+ Serial.println(f16.toFloat(), 4 );
74
+ Serial.println(f16.toString(4 ));
75
+ ```
76
+ This keeps printing relative easy.
77
+
78
+ The footprint of the library is now smaller and one can now create compact array's
79
+ of float16 elements using only 2 bytes per element.
80
+
28
81
29
82
#### Breaking change 0.2.0
30
83
@@ -34,26 +87,28 @@ For some specific values the mantissa overflowed when the float 16 was
34
87
assigned a value to. This overflow was not detected / corrected.
35
88
36
89
During the analysis of this bug it became clear that the sub-normal numbers
37
- were also implemented correctly. This is fixed too in 0.2.0.
90
+ were also not implemented correctly. This is fixed too in 0.2.0.
38
91
39
- There is still an issue 0 versus -0
92
+ There is still an issue with 0 versus -0 (sign gets lost in conversion).
40
93
41
94
** This makes all pre-0.2.0 version obsolete.**
42
95
43
96
44
97
## Specifications
45
98
46
99
47
- | attribute | value | notes |
48
- | :----------| :-------------| :--------|
49
- | size | 2 bytes | layout s eeeee mmmmmmmmmm (1,5,10)
50
- | sign | 1 bit |
51
- | exponent | 5 bit |
52
- | mantissa | 10 bit | ~ 3 digits
53
- | minimum | 5.96046 E−8 | smallest positive number.
54
- | | 1.0009765625 | 1 + 2^−10 = smallest number larger than 1.
55
- | maximum | 65504 |
56
- | | |
100
+ | Attribute | Value | Notes |
101
+ | :------------| :----------------| :--------|
102
+ | size | 2 bytes | layout s eeeee mmmmmmmmmm (1, 5, 10)
103
+ | sign | 1 bit |
104
+ | exponent | 5 bit |
105
+ | mantissa | 10 bit | 3 - 4 digits
106
+ | minimum | ±5.96046 E−8 | smallest number.
107
+ | | ±1.0009765625 | 1 + 2^−10 = smallest number larger than 1.
108
+ | maximum | ±65504 |
109
+ | | |
110
+
111
+ ± = ALT 0177
57
112
58
113
59
114
#### Example values
@@ -87,6 +142,10 @@ Source: https://en.wikipedia.org/wiki/Half-precision_floating-point_format
87
142
#### Related
88
143
89
144
- https://wokwi.com/projects/376313228108456961 (demo of its usage)
145
+ - https://github.com/RobTillaart/float16
146
+ - https://github.com/RobTillaart/float16ext
147
+ - https://github.com/RobTillaart/fraction
148
+ - https://en.wikipedia.org/wiki/Half-precision_floating-point_format
90
149
91
150
92
151
## Interface
@@ -97,28 +156,35 @@ Source: https://en.wikipedia.org/wiki/Half-precision_floating-point_format
97
156
98
157
#### Constructors
99
158
100
- - ** float16(void)** defaults to zero.
159
+ - ** float16(void)** defaults value to zero.
101
160
- ** float16(double f)** constructor.
102
161
- ** float16(const float16 &f)** copy constructor.
103
162
104
163
105
164
#### Conversion
106
165
107
- - ** double toDouble(void)** convert to double (or float).
166
+ - ** double toDouble(void)** convert value to double or float (if the same e.g. UNO).
167
+ - ** float toFloat(void)** convert value to float.
168
+ - ** String toString(unsigned int decimals = 2)** convert value to a String with decimals.
169
+ Please note that the accuracy is only 3-4 digits for the whole number so use decimals
170
+ with care.
171
+
172
+
173
+ #### Export and store
174
+
175
+ To serialize the internal format e.g. to disk, two helper functions are available.
176
+
108
177
- ** uint16_t getBinary()** get the 2 byte binary representation.
109
178
- ** void setBinary(uint16_t u)** set the 2 bytes binary representation.
110
- - ** size_t printTo(Print& p) const** Printable interface.
111
- - ** void setDecimals(uint8_t d)** idem, used for printTo.
112
- - ** uint8_t getDecimals()** idem.
113
-
114
- Note the setDecimals takes one byte per object which is not efficient for arrays of float16.
115
- See array example for efficient storage using set/getBinary() functions.
116
179
117
180
118
181
#### Compare
119
182
120
- Standard compare functions. Since 0.1.5 these are quite optimized,
121
- so it is fast to compare e.g. 2 measurements.
183
+ The library implement the standard compare functions.
184
+ These are optimized, so it is fast to compare 2 float16 values.
185
+
186
+ Note: comparison with a float or double always include a conversion.
187
+ You can improve performance by converting e.g. a threshold only once before comparison.
122
188
123
189
- ** bool operator == (const float16& f)**
124
190
- ** bool operator != (const float16& f)**
@@ -143,20 +209,16 @@ Not planned to optimize these.
143
209
- ** float16& operator \* = (const float16& f)**
144
210
- ** float16& operator /= (const float16& f)**
145
211
146
- negation operator.
212
+ Negation operator.
147
213
- ** float16 operator - ()** fast negation.
148
214
215
+ Math helpers.
149
216
- ** int sign()** returns 1 == positive, 0 == zero, -1 == negative.
150
217
- ** bool isZero()** returns true if zero. slightly faster than ** sign()** .
151
- - ** bool isInf()** returns true if value is (-)infinite.
152
-
153
-
154
- #### Experimental 0.1.8
155
-
156
- - ** bool isNaN()** returns true if value is not a number.
157
-
158
-
159
- ## Notes
218
+ - ** bool isNaN()** returns true if value is not a number.
219
+ - ** bool isInf()** returns true if value is ± infinite.
220
+ - ** bool isPosInf()** returns true if value is + infinite.
221
+ - ** bool isNegInf()** returns true if value is - infinite.
160
222
161
223
162
224
## Future
@@ -167,26 +229,19 @@ negation operator.
167
229
168
230
#### Should
169
231
170
- - unit tests of the above.
171
232
- how to handle 0 == -0 (0x0000 == 0x8000)
172
- - investigate ARM alternative half-precision
173
- _ ARM processors support (via a floating point control register bit)
174
- an "alternative half-precision" format, which does away with the
175
- special case for an exponent value of 31 (111112).[ 10] It is almost
176
- identical to the IEEE format, but there is no encoding for infinity or NaNs;
177
- instead, an exponent of 31 encodes normalized numbers in the range 65536 to 131008._
178
-
179
233
180
234
#### Could
181
235
182
- - copy constructor?
183
- - update documentation.
236
+ - unit tests.
184
237
- error handling.
185
238
- divide by zero errors.
186
239
- look for optimizations.
187
240
- rewrite ** f16tof32()** with bit magic.
188
- - add storage example - with SD card, FRAM or EEPROM
189
- - add communication example - serial or Ethernet?
241
+ - add examples
242
+ - persistent storage e.g. SD card, FRAM or EEPROM.
243
+ - communication e.g. Serial or Ethernet (XML, JSON)?
244
+ - sorting an array of float16?
190
245
191
246
#### Wont
192
247
0 commit comments