I recently had to profile a large C code project, to identify functions to target for further optimization and I used gprof for this task.
In this blog post I'm going to take the dot-product code from one of my earlier posts and use that to demonstrate how to profile an application. The original code is available here: https://blog.numerix-dsp.com/2013/01/lets-do-some-digital-signal-processing.html.
Gprof is part of the GNU Binutils package (https://www.gnu.org/software/binutils/) and in general this is installed alongside the standard GCC (build-essential) package with:
sudo apt install build-essential
In order to generate the profile in a human friendly format (a .png image) we're going to use dot, which is part of the graphviz package. Graphviz can be installed using:
sudo apt install graphviz
We're also going to use gprof2dot (https://pypi.org/project/gprof2dot/). Depending on your Python installation you can use either of the following commands:
pip:
pip install gprof2dot
conda:
conda install -c conda-forge gprof2dot
Profiling is enabled at compile time using the '-pg' command-line option and inserts instrumentation code into the application to create a statistical approximation of the code execution by, typically, sampling the code execution every 10 milliseconds. This means that to get a good approximation of the function load then the code needs to run for a number of seconds to allow the statistics to be generated accurately.
Here is the example code, in which main() calls two functions called dotp1() and dotp2(), which in turn call the main dotp() function multiple times to allow the profiler to do its work.
#define COUNT 48 // The filter size must be modulo 8
#define USE_ARRAY_INDEX 1 // Set to '1' to use array indexing or '0' to use pointers
float a[] = {
1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0,
11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0,
21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0,
31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0,
41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0
};
float b[] = {
41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0,
51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0,
61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0,
71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0,
81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0
};
float dotp (const float *, const float *, const int count);
float dotp1 (const float *, const float *, const int count);
float dotp2 (const float *, const float *, const int count);
void main (void)
{
float sum;
sum = dotp1 (a, b, COUNT);
sum = dotp2 (a, b, COUNT);
printf ("Dot product = %f\n", sum);
}
float dotp (const float *x, const float *y, const int count)
{
float sum = 0.0;
for (int i = 0; i < count; i++) {
#if USE_ARRAY_INDEX
sum += x[i] * y[i];
#else
sum += *x++ * *y++;
#endif
}
return sum;
}
float dotp1 (const float *x, const float *y, const int count)
{
float sum = 0.0;
for (int i = 0; i < 10000000; i++) {
sum = dotp (x, y, count);
}
return sum;
}
float dotp2 (const float *x, const float *y, const int count)
{
float sum = 0.0;
for (int i = 0; i < 12000000; i++) {
sum = dotp (x, y, count);
}
return sum;
}
Here is the shell script to build and execute the app then generate the png output file:
And here is the .png output:
In the image we can see the following information:
- The call graph structure
- The number of times each function is called
- The percentage of total executation time each function uses, overall and at each level
In this simple application we observe that the majority of the time is spent in the dotp() function, which is to be expected. We can also see the relative time split betweeen the dotp1() and dotp2() functions, given the number of iterations in their 'for' loops.