Please report any issues on GitHub.
Many embedded DSPs and low-powered processors do not support floating-point arithmetic. Instead, they support only integer and fixed-point arithmetic. Fixed-point representations make fractional arithmetic possible using only integers.
So what is fixed-point? As the name implies, it describes a number where the decimal point is in a fixed position. Specifically, in this context, this refers to the position of the decimal point in the binary representation. For example, a fixed-point number that has 16 bits might use 4 bits for the integer part, and 12 bits for the fractional part. In two's complement, this implies that the integer is in the range [−8, 7], and the fractional part is in the range [0, 1).
Format | Bit values | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Integer | −215 | 214 | 213 | 212 | 211 | 210 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 |
12 fractional bits | −23 | 22 | 21 | 20 | 2−1 | 2−2 | 2−3 | 2−4 | 2−5 | 2−6 | 2−7 | 2−8 | 2−9 | 2−10 | 2−11 | 2−12 |
Fixed-point numbers are stored as integers. All integers have a range that is determined by: the number of
bits used to encode the integer, and whether the integer is signed or unsigned. For example, a signed 16-bit
integer has 216 = 65536 possible values ranging from −32768
(0b1000000000000000
) to 32767 (0b0111111111111111
). If the decimal point is moved
to the left, then the new range will be reduced. Note, though, that the position of the decimal point is not
encoded in the data. So, for example, 0b0001000000000000
conveys 4096 as an integer, or 1 if
there are 12 fractional bits. The decimal position will have to be conveyed in metadata, comments,
or documentation. This description of the number of integer and fractional bits is called the Q-format.
The choice of the number of integer and fractional bits is a tradeoff: more fractional bits increases the precision whilst reducing the numeric range. Fewer fractional bits has the opposite effect. See some examples below. It's usually preferable to keep the integer range as small as possible with respect to the data, whilst avoiding overflow, in order to maximise the precision (and hence minimising any numeric error).
Number of bits | Integer bits | Fractional bits | Minimum value | Maximum value | Resolution |
---|---|---|---|---|---|
16 | 16 | 0 | −32768 | 32767 | 1 |
16 | 4 | 12 | −8.0 | 7.999755859375 | 0.000244140625 |
16 | 1 | 15 | −1.0 | 0.999969482421875 | 0.000030517578125 |
16 | 12 | 14 | −2048.0 | 2047.9375 | 0.0625 |
Conversion between integer and fractional numbers is straightforward. For a fractional number \(x\) with \(N\) fractional bits, the integer value \(y\in\mathbb{Z}\) is calculated as: \[y \approx 2^Nx\]
Note that \(y\) is necessarily rounded in order to store the value as an integer. This rounding causes an irreversible loss of precision.
The fractional value can be recovered from the integer with the reverse operation: \[x \approx \frac{y}{2^N}\]
But how is this useful for arithmetic? For addition and subtraction, provided that the operands have the same number of fractional bits, then the operation is unaffected: the result will have the same number of fractional bits, and can be converted accordingly. Specifically, if two fractional operands \(x_1\) and \(x_2\) have \(N\) fractional bits, then the result of adding them will be: \[2^Nx_1 + 2^Nx_2 \equiv 2^N(x_1 + x_2)\]
As with all arithmetic operations, care must be taken to avoid an overflow. Most processors will either saturate or wrap the result in the event of an overflow.
Multiplication is slightly more complex. The operands are not required to have the same number of fractional bits, but the number of fractional bits in the result will depend on the operands. Specifically, if \(x_1\) has \(M\) fractional bits, and \(x_2\) has \(N\) fractional bits, then the result of multiplying them will be: \[2^Mx_1 \cdot 2^Nx_2 \equiv 2^{M + N}x_1x_2\]
In other words, the result will have \(M + N\) fractional bits. The result will also require a word size that is the sum of the operand word sizes. Depending on the operation, it is usually possible to perform an arithmetic right shift on the result in order to store it in a smaller word.
Many DSPs have native support for fixed-point types with one integer bit, i.e. numbers in the range [−1, 1). For arithmetic operations, this often includes allocating an appropriate accumulator for the result, and then bit shifting the accumulator output to the original size and format.