Unfortunately in embedded system design it is typically not possible to avoid fixed point issues because floating point silicon requires many more transistors, particularly to implement the multiplier, and hence is more expensive and slower than equivalent fixed point implementations.
There are many different ways to represent numbers using integers including signed 2's Complement, Signed Offset Binary etc. but one of the most useful is the Q Format.
Q format numbers split the fixed point word into two components - the integer (m bits) and fractional (n bits) parts to give a number represented in the format m.n. This makes the word length of the integer equal to m+n bits.
The Q format allows the algorithm to be implemented using fixed point operations but requires the designer to manage the data both into and out of the algorithm.
A regular integer number creates overflow during the multiplication and addition operations.
Some processors provide a partial workaround for this by allowing optional use of fractional numbers for multiplication, in which case underflow occurs but this still doesn't avoid overflow with addition operations.
8 bit integer multiplication example
Treating fixed point numbers as integers means that we have the possibility of overflow.
The maximum 8 bit signed number is 127 (0x7f)
127 x 127 = 16129 which is a 16 bit number and has overflowed our 8 bit word length
8 bit fractional multiplication example
Treating fixed point numbers as fractions means that we have the possibility of underflow but not overflow.
The maximum 8 bit signed number is 127/128 = 0.9921875 (0x7f)
0.9921875 x 0.9921875 = 0.98443603515625 which has not overflowed our 8 bit word length but the lsbs will underflow to give us 126/128 = 0.984375.
8 bit (2.6) Q format example
We have 2 bits for the integer (m) and 6 bits for the fractional (n) components.
The value of 1.0 is represented by 0b01000000 (0x40)
If we now represent the following numbers in 2.6 Q format :
2.0 - 0b10000000 (0x80)
1.5 - 0b01100000 (0x60)
0.5 - 0b00100000 (0x20)
0.25 - 0b00010000 (0x10)
0.75 - 0b00110000 (0x30)
We can now perform some simple maths :
0.75 x 2 = 0x30 x 0x80 = 0x1800 (0b1100000000000)
Note the word growth, which means we need a longer word length for the partial result. We now need to scale this back to give us our 8 bit result (or we could keep this extended format if this is part of a multiply accumulate (MAC) instruction).
Scaling (right shifting) by the number of fractional bits (6) gives us 0b1100000 (0x60) or 1.5, which is the result we would expect.
We could now perform the same multiply operation with :
1.5 x 0.5 = 0x60 x 0x20 = 0xC00 (0b110000000000)
And again, if we perform the required scaling by 6 bits we get 0b110000 (0x30) = 0.75
We can see from this example that it is possible to handle numbers with both integer and fractional components using the Q format.
When implementing a regular FIR filter where |h(n)| <= 1.0 then the Q format number system allows for word growth in the accumulator.
The worst case word growth allows 2^N MAC operations before overflow, where N is the number of guard bits in the accumulator word.
Using Q format numbers allows m integer bits to be used for the guard bits hence in the above example we can implement 2^2 = 4 MAC operations and guarantee no overflow.
In more realistic FIR filters where only the central coefficients tend towards the maximum it is possible to implement larger filters before experiencing overflow but these scenarios should always be tested thoroughly.
Unlike FIR filters, where |h(n)| is typically <= 1.0, in IIR filters the coefficient magnitude can often be > 1.0. Here are the coefficients of a very simple 2nd order Low Pass Butterworth Filter :
float CoefficientArray  =
9.57767909330728970000e-002, 1.61076720320250940000e-001, 9.57767909330728970000e-002,
Implementing a filter like this in a fixed point DSP is clearly much more easy when using Q format numbers than either integer or fractional fixed point numbers.
As an example of where you might use Q format numbers is in high quality (>= 24 bit) audio. Using a 32 bit word length allows several options for the values of m and n, for example :
32 bit word using m.n = 8.24 gives
8 integer bits for word growth and 24 fractional bits
32 bit word using m.n = 5.27 gives
5 integer bits for word growth and 27 fractional bits for fractional bit growth
In these examples it is common to use a 64 bit result / accumulator register then this partial result is converted back to 32 bit Q format for further processing stages.
Please see this blog entry for C code for implementing the Q format number system : http://realgonegeek.blogspot.co.uk/2015/01/c-code-to-implement-q-format-number.html.
Further fixed point references are available at :
If you have found this solution useful then please do hit the Google (+1) button so that others may be able to find it as well.
Numerix-DSP Libraries : http://www.numerix-dsp.com/eval/