Our ultimate goal at Appentra is to create a software suite that will help users achieve the peak performance of their software. One of the ways to do it is with a touch of parallelism. This post will talk about the NPB CG benchmark, a popular benchmark for comparing supercomputers, developed by NASA. We will talk about how our Parallelware Analyzer can help users distribute their program’s workload to several CPU cores.
The workflow
The steps are as follows:
- We are assuming that you have a Linux system with GCC or Clang toolchains.
- We will download the source code of the benchmark.
- Compile it.
- Profile it to see where the hot spots are.
- Use Parallelware Analyzer to automatically distribute the workload in hot loops to several CPU cores.
- And finally, we are going to measure the speed improvement.
Prerequisites
First, we will need to download the source code of NPB CG Benchmark. When you have the code, unpack it at a convenient location. You will also need Parallelware Analyzer available at our web site, and perf
profiler available in your Linux’ distribution repository.
The benchmark we are interested in is located in SNU_NPB-1.0.3/NPB3.3-SER-C/CG
directory. Go to that directory and execute the following commands:
$ mkdir ../bin $ make CLASS=C CC=clang CFLAGS="-O3 -g -fopenmp" CLINKFLAGS="-lm -fopenmp" $ ../bin/cg.C.x
The make
command will compile the benchmark for class C (class C means a large input size, you could also look up in the source code for other input sizes). With the last command the binary will start running the benchmark.
✉ Subscribe! Parallel programming tips for your inbox.
Finding the hot loops
To make our program run faster, first we need to find the code that takes the most time. We call this code hot code and we use tools called profilers to find it. On Linux, one of the most popular profilers is perf
, which we will use here. To collect the profiling data and display the profile result, execute:
$ perf record --call-graph dwarf ../bin/cg.C.x $ perf report
Command perf report
opens perf
’s text user interface where you can look around and see which functions take the most time. IBM has a full tutorial for using perf
available here; we assume you are familiar with perf
; if not, check it out. We are interested in functions that have a high percentage in the Self
column. In our case, the function that clearly dominates the profile is conj_grad
.Â
To see which code inside the function takes the most time, press the key A for annotate. Annotation in perf’s terminology means marking each instruction in the function with a percentage: how much time did the program spend doing that particular instruction.
If we look at the perf’s output, we see that our program spends a lot of time running instructions belonging to the loop for (k = rowstr[j]; k < rowstr[j+1]; k++)
. If we add percentages for each instruction colored with red or green, it turns 94,48% of the time is spent there. Here is the full source code of the loop with its surrounding loops:
for (cgit = 1; cgit <= cgitmax; cgit++) { ... for (j = 0; j < lastrow - firstrow + 1; j++) { sum = 0.0; for (k = rowstr[j]; k < rowstr[j+1]; k++) { sum = sum + a[k]*p[colidx[k]]; } q[j] = sum; } ... }
Before doing any parallelization effort, you should try to understand why your serial code is slow. Speeding up the serial code will almost always result in speeding up the parallel code. In this case, the reason for the slowness comes from data cache misses. The access to value p[colidx[k]]
will almost always result in a data cache miss, and since this access is inside a triply nested loop, we can expect many cache misses. You can confirm the hypothesis by recording data cache misses using perf record -e cache-misses --call-graph dwarf ../bin/cg.C.x
. The command will point you to the code where most cache misses occur.
Unfortunately, in this case no serial optimization would do the trick: there is no way we can refactor the code to avoid access to p[colidx[k]]
. We might try to cache the values, e.g. p_tmp[k] = p[colidx[k]]
since we are in the nested loop, but in this case, k always has a different value for all values of j
. So, we have to look for performance elsewhere.
The magic of parallelism
Now when we know what is the code where our program spends most of its time, we can focus on speeding it up. We use Parallelware Analyzer to do this automatically. It can detect the loops with parallelization potential and annotate them with OpenMP pragmas 1. The runtime will then be able to distribute the loop to multiple CPU cores.
The steps to use Parallelware Analyzer are as follows:
- Use
pwloops
to list all parallelization opportunities - Use
pwdirectives
to autoparallelize the loop - Measure the performance difference
To list all the parallelization opportunities, we use the tool pwloops. The command looks like this:
$ pwloops cg.c -- -I../common/
We are instructing pwloops
to perform the analysis on a file cg.c
. In the command line, after --
, we need to provide the same flags we are passing to the compiler. In our case we are providing -I../common
, which means to look for headers in ../common
directory.
Compiler flags: -I../common/ Loop Analyzable Compute patterns Opportunity Auto-Parallelizable Parallelized ---------------------------- ---------- ---------------- ----------- ------------------- ------------ cg.c (part of the output omitted for brevity) |- conj_grad:425:3 x forall simd, multi x |- conj_grad:436:3 x scalar simd, multi x |- conj_grad:445:3 x n/a | |- conj_grad:458:5 x forall multi x | | `- conj_grad:460:7 x scalar simd x | |- conj_grad:508:5 x scalar simd, multi x | |- conj_grad:527:5 x forall simd, multi x | |- conj_grad:536:5 x scalar simd, multi x | `- conj_grad:548:5 x forall simd, multi x |- conj_grad:559:3 x forall multi x | `- conj_grad:561:5 x scalar simd x |- conj_grad:570:3 x scalar simd, multi x (part of the output omitted for brevity) Loop : loop name following the syntax <file>:<function>:<line>:<column> Analyzable : all C/C++/Fortran language features present in the loop are supported by Parallelware Compute patterns : compute patterns found in the loop ('forall', 'scalar' or 'sparse' reduction, 'recurrence', 'dep(endency)') Opportunity : whether the loop is a parallelization opportunity and for which paradigms ('multi' for multi-threading or 'simd' for vectorization) Auto-Parallelizable : loop can be parallelized by Parallelware Parallelized : loop is already parallelized, for instance with OpenMP or OpenACC directives SUGGESTIONS Get more details about the data scoping of each variable within a loop, e.g.: pwloops --datascoping --loop cg.c:vecset:887:3 cg.c -- -I../common/ Print the code annotated with opportunities, e.g.: pwloops --code --function cg.c:vecset cg.c -- -I../common/ Find out what prevented the analysis of a loop, e.g.: pwloops --non-analyzable --loop cg.c:main:140:3 cg.c -- -I../common/ Parallelize an auto-parallelizable loop, e.g.: pwdirectives cg.c:conj_grad:425:3 -o <output_file> -- -I../common/ 1 file successfully analyzed and 0 failures in 101 ms
The tool pwloops lists all the available loops with parallelization potential. We are interested in the loop cg.c:conj_grad:458:5
since it belongs to our hot loop. For this loop, the compute pattern says forall
and the opportunity says multi
. A forall pattern in a loop is a pattern where each iteration of the loop modifies a single element of the output array, and each iteration does work independently of the other iterations. Multi opportunity is a contraction for multithreading, which means that the loop can be parallelized by distributing it to multiple CPU cores.
Since the Auto-parallelizable box is ticked for the loop, this means that Parallelware Analyzer can rewrite the source code by adding parallel semantics. You can see how to do it in the output of the pwdirectives command. Let’s do it:
$ cp cg.c cg-original.c $ pwdirectives cg.c:conj_grad:458:5 --in-place -- -I../common
The replacement is done in place with the command line switch --in-place
. To show you the changes done by Parallelware Analyzer, here is the output of a diff tool:
diff --git a/cg-original.c b/cg.c index d3df699..2779b98 100644 --- a/cg-original.c +++ b/cg.c @@ -455,6 +455,9 @@ static void conj_grad(int colidx[], // The unrolled-by-8 version below is significantly faster // on the Cray t3d - overall speed of code is 1.5 times faster. + #pragma omp parallel default(none) shared(a, colidx, firstrow, lastrow, p, q, rowstr) private(j, k, sum) + { + #pragma omp for private(k, sum) schedule(auto) for (j = 0; j < lastrow - firstrow + 1; j++) { sum = 0.0; for (k = rowstr[j]; k < rowstr[j+1]; k++) { @@ -462,6 +465,7 @@ static void conj_grad(int colidx[], } q[j] = sum; } + } // end parallel /* for (j = 0; j < lastrow - firstrow + 1; j++) {
Parallelware Analyzer added two OpenMP pragmas to allow distributing the loop to multiple CPU cores. It marked the variables as private or shared, depending on the pattern they should be used in when running in multithreaded settings.
✉ Subscribe! Don’t miss our next blog post.
The results
Command pwdirectives has created a parallelized version of our loop. Now we need to recompile it. But before doing that, let’s measure how much time does the original program take. You may use the time command:
$ time ../bin/cg.C.x
Now, let’s recompile it with multithreading and measure the difference. To recompile and run the multithreaded version:
$ make CLASS=C CC=clang CFLAGS="-O3 -g -fopenmp" CLINKFLAGS="-lm -fopenmp" $ time ../bin/cg.C.x
We measured the speed difference on a general-purpose AMD Ryzen 7 4800H laptop with 8 cores and 16 hardware threads, 16 GB of memory, Ubuntu 20.04 operating system and CLANG 10 compiler. Here are the results:
The multithreaded version is 3 times faster than the original version. A nice improvement with a little effort!
A few concluding remarks
In this example, we showed how focusing on hot loops together with distribution to multiple cores can help speed up our program. Parallelware Analyzer did all the dirty work for us, so we were running our program in full speed very quickly.
Although Parallelware Analyzer detected many parallelization opportunities, we decided to parallelize only one hot spot. Why? Parallelization by distributing the workload to several cores is not a magic solution to every performance issue. It takes time to spawn several threads and divide the workload among them, so this kind of optimizations pays off only for significant workloads. The idea to parallelize every possible loop regardless of its trip count would bring severe performance degradation.
A question you might be asking yourself is why the speedup is not proportional to the number of threads? If the serial version took 141.4 seconds, we could expect that the version with four threads takes one fourth of that and the version with sixteen threads takes one sixteenth. This can actually happen in some applications, but in this particular case the reason for the slowness was not the lack of computational power, but the lack of memory bandwidth. All the cores share access to the same memory, and this bandwidth is limited. That is the reason why the speed up is not proportional to the number of threads and also the reason why the version with 4 threads is only slightly slower (52.1 seconds) compared to the version with 16 threads (45.7 seconds).
Further read
If you want to know more about Parallelware Analyzer and how it can help you speed up your program, check out the ATMUX quickstart and NPB BT quickstart. These are quickstart guides that explain in more details how to use Parallelware Analyzer to get your program run faster with very little effort!
Related resources
- Appentra – Parallelware Tools ATMUX Quickstart
- Appentra – Parallelware Analyzer NPB Quickstart
- Appentra – Knowledge Base: Patterns
- Appentra – Parallelware Analyzer
- NASA – NAS Parallel Benchmarks (NPB)
- Seoul National University – SNU NPB Suite
- IBM Developer – Analyzing performance with perf annotate
1 Note: OpenMP is the de facto industry standard when it comes to distributing workloads to several CPU cores. All the major compilers support it. It uses compiler pragmas to instruct the compiler and the runtime on how to distribute the workload to several CPU cores.
Join Parallelware Analyzer Early Access
Enter the program to have access to all versions of Parallelware Analyzer until the official release.
Leave a Reply