The Numerix-DSP Blog: December 2019

Introduction

There are many practical issues to consider when implementing DSP applications, including the choice between fixed or floating-point devices or coding in a high level language or with assembly code.

One of the most popular techniques for developing DSP systems is to simulate the system in C on a general purpose micro-processor and then port the C code onto a DSP device. For many applications, C provides perfectly acceptable performance but to achieve this, even the most modern compilers require the assistance of the programmer. The following is a list of suggestions that can make C coded real-time routines as efficient as possible.

Local Variables

C places local variables on the stack and hence they are accessed indirectly and therefore slowly. It is often more efficient to place variables on the heap and there are two primary techniques for doing this. The first one is to declare them as globals (outside of the scope of a function) and the second technique is to declare the variable as static, within the function.

Most compilers allow a level of optimization that will place local variables in registers however the compiler can often be assisted by explicitly declaring frequently used variables as 'register' types.

Re-use local variables declared as 'register' within a function, for multiple non-conflicting variables. On processors with a small number of registers or in complex functions, the benefit will be that fewer registers need to be pushed onto the stack but the down side is that the code can become less readable.

Function Calling

As standard, all C function parameters are placed on the stack however many compilers allow the optimization of this task by the use of an optional register based parameter model.

Declaring functions as 'inline' can completely remove the function call overhead but does increase the size of the object code. Compilers often incorporate a command line switch to enable the automatic in-lining of functions that are smaller than a given size.

Always prototype functions because most compilers can use the information to optimize the code.

Some compilers can perform further levels of optimizations if the parameters are placed in certain orders (Usually separaating the order of pointers, floating point and fixed point variables etc.).

Interrupt Service Routines

When implementing interrupt service routines (ISRs), all registers that are used within the ISR must be pushed onto the stack to prevent side effects. In some cases using higher levels of optimization and hence extra registers for interrupt service routines may actually slow them down, due to the overhead of the extra stack manipulation that is required at both the start and end of the ISR. Experiment with different levels of optimization on different sections of code by splitting them into separate source files.

Variables shared between ISRs and other functions should be declared as 'volatile' to prevent them being removed by the optimizer.

Data Types

When any DSP functions are implemented on fixed-point DSPs it is imperative that careful attention is paid to such issues as overflow and wrap-around due to the hardware numerical bounds.

Never use data words longer than necessary and try to ensure they can be loaded into the CPU core in a single cycle.

Loop Optimization

Most DSP algorithms, by their nature, consist of tight looped code and there are many steps that can be taken to optimize loop execution, including :

Move constant expressions outside the loop and pre-calculate the result. Modern compilers are usually able to do this automatically however it is often better, especially with long loops, to assist the compiler by implementing this at the source code level.

Replace division operations with multiplication by the reciprocal and if the divisor is a constant take the reciprocal operation outside the loop. Care should be taken with this route because the numerical errors will be different for each technique.

Unrolling loops : The loop code should be repeated several times at the source code level, within the loop construct. On some processors this can benefit the performance by allowing parallel operations from separate iterations however it is important to ensure that the code section is not larger than the on-chip cache. Some DSP compilers generate single instruction looped code that is un-interruptable, unrolling an inner loop will require more program memory but the code will often be just as efficient but it will also be interruptable.

Reduce data dependencies : By separating the data used in one operation from the data used in another parallel operation, the compiler can often utilize the on-chip resources more efficiently.

Always try to avoid calling functions within loops or if absolutely necessary use function pointers, especially if the function that is called is data dependent.

Analyze the performance of the compiler with respect to 'do', 'do while' and 'for' loop efficiencies. For a given algorithm, the efficiency of each technique can be both compiler and algorithm dependent.

Try to avoid 'test and branch' operations because they can be time consuming and can call code that is not currently resident in the cache. This can often be achieved be splitting the loop into multiple instances, each handling separate conditions.

Many compilers will perform better if the data is read from memory at the beginning of the loop and written back at the end.

Numerical Operations

Multiply and division of integers by numbers which are powers of 2 can be usually be performed more efficiently using bit shift operations.

ry to avoid using trigonometric functions by using look-up tables, especially in FFT routines etc.

When using a floating point device, try to use floating point data formats, this will reduce the burden on fixed point processing units, which will probably also be required to perform the loop counting operations. Some general purpose devices can perform floating point data operations quicker than fixed point. Floating point data also has the advantage that scaling issues are less demanding and can usually be accounted for with less overhead.

Try to avoid underflows or overflows of the numerical system, unless the algorithm demands it.

Memory Accessing

Most compilers allow the use of different memory models however it is always better to use the smallest model necessary because large models often entail an overhead for manipulating memory segment or page pointers.

Most modern processors include zero overhead pointer manipulation and this can mean that using pointers to access arrays in a linear fashion is often faster than using array indexing. It should be noted that this is not always true and will very from processor to processor.

Most DSPs incorporate functionality for zero overhead looping and bit reversed addressing and in order to use these techniques it is often necessary to correctly align the base element in a data vector. Incorrect array alignment is one of the most common reasons for DSP code not working correctly.

The CPU must access memory for loading both program instructions and data and huge benefits can be gained by analyzing the data flows and locating the heaviest loaded functions or arrays in on-chip memory. It is often a good idea to experiment with different combinations of data, stack and/or program instructions within the on-chip memory.

Always enable the caches.

Some DSPs have separate program and data memory spaces and on others they are combined. Pipeline and internal bus conflicts can mean that paarticular arrangements for the partitioning of program and data can be more efficient than others. See pipeline conflicts section.

Utilize the maximum width of the external bus. Many DSPs can now load multiple parallel data words and separate them within the CPU, with no processing overhead. E.G. load two 16 bit words with one 32 bit transfer. This may require some loop unrolling.

For efficiently performing multiple accesses to arrays and complex structures, data should be loaded into temporary local (preferably 'register') variables.

For large data sets or large programs, it can be more efficient to store all the instructions and / or data in external memory and use the on-chip DMA controller to read the appropriate parts into internal memory when needed.

In 'paged' memory systems it is often beneficial to ensure, where possible, that individual data sets do not span page boundaries because this can cause delays to be inserted in the memory access cycle.

CPU Utilization

Use intrinsic functions. Intrinsic functions are C like functions that directly map to the low level instructions of the CPU. Often the use of these functions allow specific or more efficient variations of standard mathematical operations (E.G. +-/x).

Avoid Pipeline Conflicts

Most DSP CPUs are fed with instructions and data in a pipeline and before attempting to obtain the maximum performance from these devices it is important to be familiar with the pipeline. The Users Guides for the DSPs usually incorporate an important chapter on this subject.

Correct partitioning of program and data across the various memory segments is critical.

Avoid internal or external memory access conflicts.

It is often only possible to access both internal and external memory within a single instruction cycle if the external memory access is initiated first.

Memory Conflicts

If you find that your otherwise perfect C code crashes for no particular reason then the reason may be that the problem relates to memory management issues.

When learning about writing C programs many books and classes specify that all data that is local to a function should be declared within the function. This provides for a neater programming style that reduces problems caused by coding errors. Unfortunately, this is not the whole truth because local data is placed on the stack, which is a dynamic structure that grows and shrinks as required. This applications note describes how to get the most out of your compiler and linker tools to avoid unpredictable program execution.

The compiler will turn the C code into executable object code and this is passed to the linker, which will locate all of the code and data into the memory for the target processor. Locating program code is generally easy because the compiler will calculate how large the code space is and the linker will then place it correctly in memory. Data is altogether more complex because there are several different ways to access data. Although there are many different data types, they can all be split into three generic types :

Heap Space - This is data space that is dynamically allocated, for example using malloc () etc.
Global Variable Space - This is the memory space for global variables that are statically allocated by the program. I.E. this memory space is fixed in size at compile time.
Stack Space - This is a dynamic memory space that is used for function parameters and variables that are local to a function.

The C source declaration of global variables is shown in figure 1.

/* Start of source file */
/* Data and arrays allocated here will be placed in global variable space */

void function (void)

{

/* Data and arrays allocated here will be placed on the stack unless declared 'static' */

}

Figure 1 : Memory allocation is C source file

While the compiler and linker can calculate the size of the global variable space, memory allocation problems usually arise because of the stack and heap spaces due to the fact that they are dynamic and hence neither the compiler nor the linker can predict how large they are going to be nor where to locate them in memory.

Although there are many different allocation schemes in use, the same basic rules apply to all. Figure 2 shows a generic scheme where the stack grows from the bottom of memory up and the heap grows from the top down, with the global variable space located in the middle. Problems will occur in this scheme if either the stack or the heap spaces expand over the global variable space.

All compilers utilise pre-defined maximum heap and stack sizes that can usually be modified by the use of a command line parameter. If the dynamic memory allocation functions (e.g. malloc ()) are used and the return values checked then a lack of heap space will become obvious and can be worked around by modifying the linker options or the program code. The stack however does not check that there is enough space to allocate the required memory and if too much is requested then the first symptom is typically an errant pointer and a program crash..

Typically linkers allocate less that 10 KBytes for the stack and while for simple programs this is not an issue, for programs that are required to process a large amount of data, this will mean that the code must be written to take this into account. For a real time DSP application, data structures may often be larger than this, especially when floating point data is used (float is of size 4 bytes and double of size 8 bytes) so these must never be placed on the stack. The rules for managing the stack are therefore simple :

Limit the number of parameters that are passed to a function - i.e. use pointers where large amounts of data must be passed.
Limit the size of the local variable space - i.e. only use this for individual variables and never for arrays.

Thus we have distilled the majority of problems related to memory management down to removing all local arrays from functions. We will now look at the steps we can take to do this in a structured manner.

We basically have three types of memory to manage :

Static Initialised - This is typically static look up tables that are initialised with data at compile time, for example constellation point maps for QAM transmitters.
Dynamic Initialised - This is typically look up tables that are allocated and initialised with data at run time, for example FFT twiddle factor tables that are initialised using the ANSI C sin () function.
Dynamic Uninitialised - This is typically application data that is being processed at run time.

(The fouth possible option of static uninitialised is a perfectly valid option but the use of dynamically allocated data arrays leads to a more structured program).

There are two options for statically allocating look up tables, they can either be allocated outside of any function so that the data will be located in the global memory space or they can be declared locally within a function and the static key word used to specify that the data is placed in the global memory space. As a point of note, the first option can often be of most use in a DSP application because many look up tables are required in multiple functions (for example a constellation map may be required in both modulator and demodulator functions) so declaring them to be global can reduce duplication.

For dynamic memory, the solution is to use malloc to allocate the data arrays but this is a slow function to execute and not consistent with being included in a real time function. As a result it is common for all complex DSP functions to be accompanied by an initilasation function that can be called at boot time. This initialisation function will have the task of allocating memory and, if required, initialising the contents.

If you have tried all of these suggestions and your code still crashes then it may be that you will need to use a larger memory model. This allows for a larger memory space but will often mean that the code will run slower because memory paging will be used, rather than direct memory accesses.

If you have found this solution useful then please do hit the Google (+1) button so that others may be able to find it as well.

Numerix-DSP Libraries : http://www.numerix-dsp.com/eval/

The Numerix-DSP Blog

Wednesday 4 December 2019

Real-time C Programming Guidelines

About Me