Please report any issues on GitHub.
Many embedded DSPs and low-powered processors do not support floating-point arithmetic. Instead, they support only integer and fixed-point arithmetic. Fixed-point representations make fractional arithmetic possible using only integers.
So what is fixed-point? As the name implies, it describes a number where the decimal point is in a fixed position. Specifically, in this context, this refers to the position of the decimal point in the binary representation. For example, a fixed-point number that has 16 bits might use 4 bits for the integer part, and 12 bits for the fractional part. In two's complement, this implies that the integer is in the range [−8, 7], and the fractional part is in the range [0, 1).
Format | Bit values | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Integer | −215 | 214 | 213 | 212 | 211 | 210 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 |
12 fractional bits | −23 | 22 | 21 | 20 | 2−1 | 2−2 | 2−3 | 2−4 | 2−5 | 2−6 | 2−7 | 2−8 | 2−9 | 2−10 | 2−11 | 2−12 |
Fixed-point numbers are stored as integers. All integers have a range that is determined by: the number of
bits used to encode the integer, and whether the integer is signed or unsigned. For example, a signed 16-bit
integer has 216 = 65536 possible values ranging from −32768
(0b1000000000000000
) to 32767 (0b0111111111111111
). If the decimal point is moved
to the left, then the new range will be reduced. Note, though, that the position of the decimal point is not
encoded in the data. So, for example, 0b0001000000000000
conveys 4096 as an integer, or 1 if
there are 12 fractional bits. The decimal position will have to be conveyed in metadata, comments,
or documentation. This description of the number of integer and fractional bits is called the Q-format.
The choice of the number of integer and fractional bits is a tradeoff: more fractional bits increases the precision whilst reducing the numeric range. Fewer fractional bits has the opposite effect. See some examples below. It's usually preferable to keep the integer range as small as possible with respect to the data, whilst avoiding overflow, in order to maximise the precision (and hence minimising any numeric error).
Number of bits | Integer bits | Fractional bits | Minimum value | Maximum value | Resolution |
---|---|---|---|---|---|
16 | 16 | 0 | −32768 | 32767 | 1 |
16 | 4 | 12 | −8.0 | 7.999755859375 | 0.000244140625 |
16 | 1 | 15 | −1.0 | 0.999969482421875 | 0.000030517578125 |
16 | 12 | 14 | −2048.0 | 2047.9375 | 0.0625 |
Conversion between integer and fractional numbers is straightforward. For a fractional number
Note that
The fractional value can be recovered from the integer with the reverse operation:
But how is this useful for arithmetic? For addition and subtraction, provided that
the operands have the same number of fractional bits, then the operation is unaffected: the
result will have the same number of fractional bits, and can be converted accordingly. Specifically,
if two fractional operands
As with all arithmetic operations, care must be taken to avoid an overflow. Most processors will either saturate or wrap the result in the event of an overflow.
Multiplication is slightly more complex. The operands are not required to have the same number of
fractional bits, but the number of fractional bits in the result will depend on the operands.
Specifically, if
In other words, the result will have
Many DSPs have native support for fixed-point types with one integer bit, i.e. numbers in the range [−1, 1). For arithmetic operations, this often includes allocating an appropriate accumulator for the result, and then bit shifting the accumulator output to the original size and format.