Sunday, 24 May 2020

A Simple Google Assistant On The Raspberry Pi

Here's a fun lockdown project that explains how to integrate a Google assistant on a Raspberry Pi : https://github.com/Numerix-DSP/GoogleAssistant .


Saturday, 16 May 2020

Timing DSP Code Running On ARM Cortex Architecture

A recent project reqired porting some DSP algorithms to the NXP LPC55S6x ARM Cortex-M33 based microcontroller.
It was necessary to benchmark the algorithms so I wrote the following code that utilizes the Cycle Count Register, which is part of the ARM Cortex-M Data Watchpoint and Trace (DWT) unit.

The code below includes macros for accessing the DWT and also calculates the overhead of calling the functions to read the timer register, before using the same functions to time some code.
This code has been compiled and tested on the NXP LPCXpresso55S69 Development Board but should run on any ARM device that includes the DWT module.

#include "fsl_debug_console.h"

// Timers
// DWT (Data Watchpoint and Trace) registers, only exists on ARM Cortex with a DWT unit
#define KIN1_DWT_CONTROL          (*((volatile uint32_t*)0xE0001000)) // DWT Control register
#define KIN1_DWT_CYCCNTENA_BIT      (1UL<<0) // CYCCNTENA bit in DWT_CONTROL register
#define KIN1_DWT_CYCCNT            (*((volatile uint32_t*)0xE0001004)) // DWT Cycle Counter register
#define KIN1_DEMCR                (*((volatile uint32_t*)0xE000EDFC)) // DEMCR: Debug Exception and Monitor Control Register
#define KIN1_TRCENA_BIT          (1UL<<24) // Trace enable bit in DEMCR register

#define KIN1_InitCycleCounter() KIN1_DEMCR |= KIN1_TRCENA_BIT     // TRCENA: Enable trace and debug block DEMCR (Debug Exception and Monitor Control Register
#define KIN1_ResetCycleCounter() KIN1_DWT_CYCCNT = 0          // Reset cycle counter
#define KIN1_EnableCycleCounter() KIN1_DWT_CONTROL |= KIN1_DWT_CYCCNTENA_BIT  // Enable cycle counter
#define KIN1_DisableCycleCounter()  KIN1_DWT_CONTROL &= ~KIN1_DWT_CYCCNTENA_BIT // Disable cycle counter
#define KIN1_GetCycleCounter()      KIN1_DWT_CYCCNT                             // Read cycle counter register


int main(void)
{
uint32_t start_time, end_time, overhead_time; // number of cycles

KIN1_InitCycleCounter(); // enable DWT hardware
KIN1_ResetCycleCounter(); // reset cycle counter
KIN1_EnableCycleCounter(); // start counting

start_time = KIN1_GetCycleCounter(); // get cycle counter
    __asm volatile ("nop");
    end_time = KIN1_GetCycleCounter(); // get cycle counter
    overhead_time = end_time - start_time;

PRINTF("Mutex example started.\r\n");

  start_time = KIN1_GetCycleCounter(); // get cycle counter
    __asm volatile ("nop");
    __asm volatile ("nop");
    end_time = KIN1_GetCycleCounter(); // get cycle counter
    printf ("Elapsed time = %d cycles\n", end_time - start_time - overhead_time);

    return(0);
}

Notes
There appears to be a +/- 1 cycle jitter on the results of any code timing instance. I have not got to the bottom of exactly why but regardless of the route cause, this is very accurate and definitely suitable for the vast majority of applications.

References





If you have found this solution useful then please do hit the Google (+1) button so that others may be able to find it as well.
Numerix-DSP Libraries : http://www.numerix-dsp.com/eval/

Monday, 27 April 2020

Python/Numpy : How Not To Generate A Sinusoidal Waveform

I was recently reviewing some Python/Numpy code that included a waveform generator. A simplified version of code looked like this :

x = np.linspace(0,2*np.pi-(2*np.pi/8),8)
np.sin(x)

This generates the following :

array([ 0.00000000e+00,  7.81831482e-01,  9.74927912e-01,  4.33883739e-01,
       -4.33883739e-01, -9.74927912e-01, -7.81831482e-01, -2.44929360e-16])

Which looks like a perfect single cycle of a sinusoid. Except it isn't !
On closer inspection, the last element in the array is, to all intents and purposes, 0, which means that this isn't a perfect single cycle of a sinusoid because that final sample is actually the first sample of the next cycle.

To generate a perfect single cycle of a sinusoid using linspace you need to account for where the last sample of the sinusoid should fall, if you were to plot it on a graph.

x = np.linspace(0,2*np.pi-(2*np.pi/8),8)
np.sin(x)

This generates the following array, which is spot on :

array([ 0.00000000e+00,  7.07106781e-01,  1.00000000e+00,  7.07106781e-01,
        1.22464680e-16, -7.07106781e-01, -1.00000000e+00, -7.07106781e-01])

In thinking about this problem, it occurred to me that this is not ideal and very likely to cause confusion becasue it is easy to forget the required modification. The main reason for the confusion is that standard Python generates and processes data from, for example, 0 to N-1 as shown in this simple Numpy example :

np.arange(8.)

Which yeilds :

array([0., 1., 2., 3., 4., 5., 6., 7.])

So returning to the original problem, a far safer way of generating the sinusoid is the following code :

x = np.arange(0., 2.*np.pi, 2.*np.pi/8.)
np.sin(x)
Which generates the following array :
array([ 0.00000000e+00,  7.07106781e-01,  1.00000000e+00,  7.07106781e-01,
        1.22464680e-16, -7.07106781e-01, -1.00000000e+00, -7.07106781e-01])

Now we have the first np.arrange() instruction to generate the time index and the second stage np.sin() to generate the sinudoid. This is clear, precise and unlikely to cause error.

Side note : of course, it would be entirely possibl to combine this into a single line instruction however I believe this causes other possibilities for error insertion.



Saturday, 25 April 2020

The Difference Between FFT Spectrum and Power Spectral Density

I always teach the difference between FFT Spectrum and Power Spectral Density on my DSP courses and many students find it confusing.

This applications note from Audio Precision summarizes the subject very neatly : The Difference Between FFT Spectrum and Power Spectral Density

Functions for calculating both the FFT Spectrum and Power Spectral Density are included in the SigLib DSP Library.

Saturday, 8 February 2020

VMWare Virtual Machines On Windows 10 - Disabling Device/Credential Guard - Solution

Not a DSP related post but something that caused me no end of unnecessary pain.

I use Virtual Machines a lot but they stopped working under Windows a while back with the following message :

"VMware Workstation and Device/Credential Guard are not compatible"

The VMWare URL pointed to a Microsoft webpage that was out of date with the newer version of Windows 10 I am using [Version 10.0.19041.21].

This helped greatly but, unfortunately, it is still out of date :
https://www.tenforums.com/tutorials/68913-enable-disable-device-guard-windows-10-a.html

Here is what I had to do but note, VMWare only started working after doing all three so the first two might not be necessary but it works now so I'm not going to make any changes ;-)

Control Panel | Programs And Features | Turn Windows Features On or Off | Untick the following :
Hyper-V
Virtual Machine Platform
Windows Hypervisor Platform


Search Windows for “Group Policy” open "Edit Group Policy" app and do :
Computer Configuration\Administrative Templates\System\Device Guard
Disable : Turn On Virtualization Based Security

Download dgreadiness from here : https://www.microsoft.com/en-us/download/details.aspx?id=53337 and do the following in an Administrator PowerShell :
.\DG_Readiness_Tool_v3.6.ps1 -Disable
Reboot.

Unfortunately, breaks Windows Subsystem For Linux :-(.
My current solution, to run WSL2, is to do the following in an Administrator PowerShell :
.\DG_Readiness_Tool_v3.6.ps1 -Enable
Reboot.

This is so bloody stupid, that I can't run a VM and WSL side-by-side.

I'll endeavour to keep this page updated when Microsoft change things, again.

PS I'm sure this is also necessary for VirtualBox but I haven't got a current Windows 10 hosted VirtualBox to test.

Wednesday, 4 December 2019

The Next Round Of The University Of Oxford, UK Digital Signal Processing Courses Take Place In July 2020

As part of the University Of Oxford Summer Engineering Program for Industry, the Digital Signal Processing courses are returning in June 2020. for the 28th year.

The courses are presented by experts from industry for Engineers in industry and over the last 27 years has trained many hundreds of Engineers, from all areas of Science and Engineering.

Here is a summary of the two courses.

Digital Signal Processing (Theory and Application) - Monday 6th to Wednesday 8th July 2020

https://www.conted.ox.ac.uk/courses/digital-signal-processing-theory-and-application
This course provides a good understanding of DSP principles and their implementation and equips the delegate to put the ideas into practice and/or to tackle more advanced aspects of DSP. 'Hands-on' laboratory sessions are interspersed with the lectures to illustrate the taught material and allow you to pursue your own areas of interest in DSP. The hands-on sessions use specially written software running on PCs.

Subjects include :

Theoretical Foundations
Digital Filtering
Fourier Transforms And Frequency Domain Processing
DSP Hardware And Programming
ASIC Implementation
Typical DSP Applications

Digital Signal Processing Implementation (algorithms to optimization) - Thursday 9th July 2020

A one-day supplement to the Digital Signal Processing course that takes the theory and translates it into practice.
https://www.conted.ox.ac.uk/courses/digital-signal-processing-implementation-algorithms-to-optimisation
The course will include a mixed lecture and demonstration format and has been written to be independent of target processor architecture.
The course will show how to take common DSP algorithms and map them onto common processor architectures. It will also give a guide line for how to choose a DSP device, in particular how to choose and use the correct data word length for any application.

Attendee Feedback From Previous Courses :

It was informative, enjoyable and stimulating
Excellent content, very lively thanks to the 2 excellent presenters - Anonymous
A very good introduction to DSP theory
Excellent lecturers! Really useful information and very understandable
Great mix of theory and practice
The lecturers gave a detailed and excellent explanation of the fundamental topics of DSP with real world engineering practice.
This session closes the gap and clears up much confusion between classroom DSP theories and actual DSP implementation.
Very good session, with in-depth discussion on the math and background.


These courses will be held at the University of Oxford, UK

If you have found this solution useful then please do hit the Google (+1) button so that others may be able to find it as well.

Evaluate The Numerix-DSP Libraries : http://www.numerix-dsp.com/eval/

Real-time C Programming Guidelines

Introduction

There are many practical issues to consider when implementing DSP applications, including the choice between fixed or floating-point devices or coding in a high level language or with assembly code.

One of the most popular techniques for developing DSP systems is to simulate the system in C on a general purpose micro-processor and then port the C code onto a DSP device. For many applications, C provides perfectly acceptable performance but to achieve this, even the most modern compilers require the assistance of the programmer. The following is a list of suggestions that can make C coded real-time routines as efficient as possible.

Local Variables

  • C places local variables on the stack and hence they are accessed indirectly and therefore slowly. It is often more efficient to place variables on the heap and there are two primary techniques for doing this. The first one is to declare them as globals (outside of the scope of a function) and the second technique is to declare the variable as static, within the function.
  • Most compilers allow a level of optimization that will place local variables in registers however the compiler can often be assisted by explicitly declaring frequently used variables as 'register' types.
  • Re-use local variables declared as 'register' within a function, for multiple non-conflicting variables. On processors with a small number of registers or in complex functions, the benefit will be that fewer registers need to be pushed onto the stack but the down side is that the code can become less readable.


Function Calling

  • As standard, all C function parameters are placed on the stack however many compilers allow the optimization of this task by the use of an optional register based parameter model.
  • Declaring functions as 'inline' can completely remove the function call overhead but does increase the size of the object code. Compilers often incorporate a command line switch to enable the automatic in-lining of functions that are smaller than a given size.
  • Always prototype functions because most compilers can use the information to optimize the code.
  • Some compilers can perform further levels of optimizations if the parameters are placed in certain orders (Usually separaating the order of pointers, floating point and fixed point variables etc.).


Interrupt Service Routines

  • When implementing interrupt service routines (ISRs), all registers that are used within the ISR must be pushed onto the stack to prevent side effects. In some cases using higher levels of optimization and hence extra registers for interrupt service routines may actually slow them down, due to the overhead of the extra stack manipulation that is required at both the start and end of the ISR. Experiment with different levels of optimization on different sections of code by splitting them into separate source files.
  • Variables shared between ISRs and other functions should be declared as 'volatile' to prevent them being removed by the optimizer.


Data Types

  • When any DSP functions are implemented on fixed-point DSPs it is imperative that careful attention is paid to such issues as overflow and wrap-around due to the hardware numerical bounds.
  • Never use data words longer than necessary and try to ensure they can be loaded into the CPU core in a single cycle.


Loop Optimization

  • Most DSP algorithms, by their nature, consist of tight looped code and there are many steps that can be taken to optimize loop execution, including :
  • Move constant expressions outside the loop and pre-calculate the result. Modern compilers are usually able to do this automatically however it is often better, especially with long loops, to assist the compiler by implementing this at the source code level.
  • Replace division operations with multiplication by the reciprocal and if the divisor is a constant take the reciprocal operation outside the loop. Care should be taken with this route because the numerical errors will be different for each technique.
  • Unrolling loops : The loop code should be repeated several times at the source code level, within the loop construct. On some processors this can benefit the performance by allowing parallel operations from separate iterations however it is important to ensure that the code section is not larger than the on-chip cache. Some DSP compilers generate single instruction looped code that is un-interruptable, unrolling an inner loop will require more program memory but the code will often be just as efficient but it will also be interruptable.
  • Reduce data dependencies : By separating the data used in one operation from the data used in another parallel operation, the compiler can often utilize the on-chip resources more efficiently.
  • Always try to avoid calling functions within loops or if absolutely necessary use function pointers, especially if the function that is called is data dependent.
  • Analyze the performance of the compiler with respect to 'do', 'do while' and 'for' loop efficiencies. For a given algorithm, the efficiency of each technique can be both compiler and algorithm dependent.
  • Try to avoid 'test and branch' operations because they can be time consuming and can call code that is not currently resident in the cache. This can often be achieved be splitting the loop into multiple instances, each handling separate conditions.
  • Many compilers will perform better if the data is read from memory at the beginning of the loop and written back at the end.


Numerical Operations

  • Multiply and division of integers by numbers which are powers of 2 can be usually be performed more efficiently using bit shift operations.
  •  ry to avoid using trigonometric functions by using look-up tables, especially in FFT routines etc.
  • When using a floating point device, try to use floating point data formats, this will reduce the burden on fixed point processing units, which will probably also be required to perform the loop counting operations. Some general purpose devices can perform floating point data operations quicker than fixed point. Floating point data also has the advantage that scaling issues are less demanding and can usually be accounted for with less overhead.
  • Try to avoid underflows or overflows of the numerical system, unless the algorithm demands it.


Memory Accessing

  • Most compilers allow the use of different memory models however it is always better to use the smallest model necessary because large models often entail an overhead for manipulating memory segment or page pointers.
  • Most modern processors include zero overhead pointer manipulation and this can mean that using pointers to access arrays in a linear fashion is often faster than using array indexing. It should be noted that this is not always true and will very from processor to processor.
  • Most DSPs incorporate functionality for zero overhead looping and bit reversed addressing and in order to use these techniques it is often necessary to correctly align the base element in a data vector. Incorrect array alignment is one of the most common reasons for DSP code not working correctly.
  • The CPU must access memory for loading both program instructions and data and huge benefits can be gained by analyzing the data flows and locating the heaviest loaded functions or arrays in on-chip memory. It is often a good idea to experiment with different combinations of data, stack and/or program instructions within the on-chip memory.
  • Always enable the caches.
  • Some DSPs have separate program and data memory spaces and on others they are combined. Pipeline and internal bus conflicts can mean that paarticular arrangements for the partitioning of program and data can be more efficient than others. See pipeline conflicts section.
  • Utilize the maximum width of the external bus. Many DSPs can now load multiple parallel data words and separate them within the CPU, with no processing overhead. E.G. load two 16 bit words with one 32 bit transfer. This may require some loop unrolling.
  • For efficiently performing multiple accesses to arrays and complex structures, data should be loaded into temporary local (preferably 'register') variables.
  • For large data sets or large programs, it can be more efficient to store all the instructions and / or data in external memory and use the on-chip DMA controller to read the appropriate parts into internal memory when needed.
  • In 'paged' memory systems it is often beneficial to ensure, where possible, that individual data sets do not span page boundaries because this can cause delays to be inserted in the memory access cycle.

CPU Utilization

  • Use intrinsic functions. Intrinsic functions are C like functions that directly map to the low level instructions of the CPU. Often the use of these functions allow specific or more efficient variations of standard mathematical operations (E.G. +-/x).

Avoid Pipeline Conflicts

Most DSP CPUs are fed with instructions and data in a pipeline and before attempting to obtain the maximum performance from these devices it is important to be familiar with the pipeline. The Users Guides for the DSPs usually incorporate an important chapter on this subject.


  • Correct partitioning of program and data across the various memory segments is critical.
  • Avoid internal or external memory access conflicts.
  • It is often only possible to access both internal and external memory within a single instruction cycle if the external memory access is initiated first.

Memory Conflicts

If you find that your otherwise perfect C code crashes for no particular reason then the reason may be that the problem relates to memory management issues.

When learning about writing C programs many books and classes specify that all data that is local to a function should be declared within the function. This provides for a neater programming style that reduces problems caused by coding errors. Unfortunately, this is not the whole truth because local data is placed on the stack, which is a dynamic structure that grows and shrinks as required. This applications note describes how to get the most out of your compiler and linker tools to avoid unpredictable program execution.

The compiler will turn the C code into executable object code and this is passed to the linker, which will locate all of the code and data into the memory for the target processor. Locating program code is generally easy because the compiler will calculate how large the code space is and the linker will then place it correctly in memory. Data is altogether more complex because there are several different ways to access data. Although there are many different data types, they can all be split into three generic types :


  • Heap Space - This is data space that is dynamically allocated, for example using malloc () etc.
  • Global Variable Space - This is the memory space for global variables that are statically allocated by the program. I.E. this memory space is fixed in size at compile time.
  • Stack Space - This is a dynamic memory space that is used for function parameters and variables that are local to a function.


The C source declaration of global variables is shown in figure 1.

/* Start of source file */
/* Data and arrays allocated here will be placed in global variable space */

void function (void)

{

/* Data and arrays allocated here will be placed on the stack unless declared 'static' */

}

Figure 1 : Memory allocation is C source file

While the compiler and linker can calculate the size of the global variable space, memory allocation problems usually arise because of the stack and heap spaces due to the fact that they are dynamic and hence neither the compiler nor the linker can predict how large they are going to be nor where to locate them in memory.

Although there are many different allocation schemes in use, the same basic rules apply to all. Figure 2 shows a generic scheme where the stack grows from the bottom of memory up and the heap grows from the top down, with the global variable space located in the middle. Problems will occur in this scheme if either the stack or the heap spaces expand over the global variable space.




All compilers utilise pre-defined maximum heap and stack sizes that can usually be modified by the use of a command line parameter. If the dynamic memory allocation functions (e.g. malloc ()) are used and the return values checked then a lack of heap space will become obvious and can be worked around by modifying the linker options or the program code. The stack however does not check that there is enough space to allocate the required memory and if too much is requested then the first symptom is typically an errant pointer and a program crash..

Typically linkers allocate less that 10 KBytes for the stack and while for simple programs this is not an issue, for programs that are required to process a large amount of data, this will mean that the code must be written to take this into account. For a real time DSP application, data structures may often be larger than this, especially when floating point data is used (float is of size 4 bytes and double of size 8 bytes) so these must never be placed on the stack. The rules for managing the stack are therefore simple :


  • Limit the number of parameters that are passed to a function - i.e. use pointers where large amounts of data must be passed.
  • Limit the size of the local variable space - i.e. only use this for individual variables and never for arrays.

Thus we have distilled the majority of problems related to memory management down to removing all local arrays from functions. We will now look at the steps we can take to do this in a structured manner.

We basically have three types of memory to manage :


  • Static Initialised - This is typically static look up tables that are initialised with data at compile time, for example constellation point maps for QAM transmitters.
  • Dynamic Initialised - This is typically look up tables that are allocated and initialised with data at run time, for example FFT twiddle factor tables that are initialised using the ANSI C sin () function.
  • Dynamic Uninitialised - This is typically application data that is being processed at run time.

(The fouth possible option of static uninitialised is a perfectly valid option but the use of dynamically allocated data arrays leads to a more structured program).

There are two options for statically allocating look up tables, they can either be allocated outside of any function so that the data will be located in the global memory space or they can be declared locally within a function and the static key word used to specify that the data is placed in the global memory space. As a point of note, the first option can often be of most use in a DSP application because many look up tables are required in multiple functions (for example a constellation map may be required in both modulator and demodulator functions) so declaring them to be global can reduce duplication.

For dynamic memory, the solution is to use malloc to allocate the data arrays but this is a slow function to execute and not consistent with being included in a real time function. As a result it is common for all complex DSP functions to be accompanied by an initilasation function that can be called at boot time. This initialisation function will have the task of allocating memory and, if required, initialising the contents.

If you have tried all of these suggestions and your code still crashes then it may be that you will need to use a larger memory model. This allows for a larger memory space but will often mean that the code will run slower because memory paging will be used, rather than direct memory accesses.

If you have found this solution useful then please do hit the Google (+1) button so that others may be able to find it as well.

Numerix-DSP Libraries : http://www.numerix-dsp.com/eval/