In this blog post, we show how to get started with Parallelware Analyzer using a matrix multiplication example in C. You will see how to obtain actionable insights through optimization reports that help ensure best practices to speedup the code through parallel computing in vector, multicore and accelerator processors.
A structured report displays the actionable items (defects, recommendations, remarks, opportunities for parallelization, …) detected at the function level and at the loop level, followed by a code coverage summary and a performance metrics summary. You can control the amount of detail to be displayed and you will get clear suggestions on what your next actions should be, no matter whether they correspond to code changes or further invocations of Parallelware Analyzer to dig into more information.
You can find the code in the examples/matmul subfolder inside your Parallelware Analyzer installation. You can also watch the following video covering the same contents:
The src folder contains the following source files and a include subfolder with the corresponding header files:
- matmul.c: main and matmul functions. This is the file to work on, specifically the matmul function which performs the computation.
- matrix.c: matrix utility functions. Matrices are stored in contiguous dynamic memory, using a linearized array plus a row pointer array.
- clock.c: time measurement utility function. It will use the most reliable function available: OpenMP’s
clock, in that order.
How to build and run
Both a CMake project and a Makefile are provided, so you can build using either CMake or make directly. The result is a matmul binary that requires the matrix size as argument.
You can use any generator, in the following example we use Ninja:
$ mkdir build $ cd build $ cmake -G Ninja .. $ ninja $ ./matmul 1000
Using Parallelware Analyzer
1. Analyze your hotspot
You should always start by invoking the pwreport tool for your hotspot. In this example, this corresponds to the matmul function located in the main.c source file. Therefore, invoke:
$ pwreport src/main.c:matmul FAILURES pwreport: error: failed to analyze 'src/main.c' ...
You can see that there is an error due to a missing header file. Parallelware Analyzer is a static code analyzer and thus it requires the code to be valid. This means that included header files must be available and required macro symbols must be defined. You can supply that information as arguments using GCC/Clang syntax following two dashes (—):
$ pwreport src/main.c:matmul -- -I src/include Compiler flags: -I src/include ACTIONS REPORT FUNCTION BEGIN at src/main.c:matmul:6:1 LOOP BEGIN at src/main.c:matmul:8:5 LOOP BEGIN at src/main.c:matmul:9:9 2 remarks 1 opportunity for parallelism (1 SIMD) LOOP END 2 opportunities for parallelism (1 multi-threading and 1 offload) LOOP END LOOP BEGIN at src/main.c:matmul:15:5 LOOP BEGIN at src/main.c:matmul:16:9 LOOP BEGIN at src/main.c:matmul:17:13 LOOP END 1 recommendation and 3 remarks LOOP END 1 recommendation and 4 remarks 2 opportunities for parallelism (1 multi-threading and 1 offload) LOOP END FUNCTION END CODE COVERAGE Analyzable files: 1 / 1 (100.00 %) Analyzable functions: 1 / 1 (100.00 %) Analyzable loops: 5 / 5 (100.00 %) Parallelized SLOCs: 0 / 17 ( 0.00 %) METRICS SUMMARY Total defects: 0 Total recommendations: 2 Total remarks: 9 Total opportunities: 5 SUGGESTIONS Use --level 1|2|3 to get more details, e.g: pwreport --level 2 src/main.c:matmul -- -I src/include If you want to get an overview of your whole codebase, not only the hotspot, you can use: pwreport --summary src -- -I src/include 1 file successfully analyzed and 0 failures in 33 ms
The hotspot analysis succeeds and a report is outputted with the following sections:
- ACTIONS REPORT: structured report with actionable insights per function and loop.
- CODE COVERAGE: summary of how much code could be analyzed.
- METRICS SUMMARY: aggregated summary of the actionable insights detected in the analysis.
- SUGGESTIONS: general Parallelware Analyzer usage hints.
The CODE COVERAGE report shows that all the code was successfully analyzed and the METRICS SUMMARY shows the different actionable insights detected. The ACTIONS REPORT provides a per function and loop summary of actionable insights detected. As hinted in the SUGGESTIONS section at the end, you can add
--level to increase the level of the detail of the ACTIONS REPORT.
2. Dig deeper into the actionable insights for your hotspot
--level 3 which is the more detailed level. This is very verbose but it will even provide Parallelware Analyzer invocations that you can copy and paste. For instance, let’s focus on the following excerpt from the output:
$ pwreport --level 3 src/main.c:matmul -- -I src/include ... [OPP001] src/main.c:15:5 is a multi-threading opportunity Compute patterns: - 'forall' over the variable 'C' SUGGESTION: use pwloops to get more details or pwdirectives to generate directives to parallelize it: pwloops src/main.c:matmul:15:5 -- -I src/include pwdirectives --omp multi src/main.c:matmul:15:5 --in-place -- -I src/include ...
You can see suggestions on how to use other tools of Parallelware Analyzer: use pwloops to get details on the loop which constitutes an opportunity for parallelization or pwdirectives to create a parallel version of the loop using multi-threading.
3. Parallelize your hotspot
Let’s give the latter a try to add multi-threading to your matrix computation. First, let’s build and run matmul to see how long it takes for the sequential version to execute. You can use the Makefile to do so:
$ make rm -f *.o matmul cc -I src/include -fopenmp src/matrix.c src/clock.c src/main.c -o matmul ./matmul 1500 - Input parameters n = 1500 - Executing test... time (s)= 24.913945 size = 1500 chksum = 68432918175
Now copy the command suggested by pwreport (note that using
--in-place will modify the file, you can use
-o matmul_omp.c instead to create a new file):
$ pwdirectives --omp multi src/main.c:matmul:15:5 --in-place -- -I src/include Compiler flags: -I src/include Results for file 'src/main.c': Successfully parallelized loop at 'src/main.c:matmul:15:5' [using multi-threading]: 15:5: [ INFO ] Parallel forall: variable 'C' 15:5: [ INFO ] Loop parallelized with multithreading using OpenMP directive 'for' 15:5: [ INFO ] Parallel region defined by OpenMP directive 'parallel' Successfully updated src/main.c
Build and run again to compare the performance:
$ make rm -f *.o matmul cc -I src/include -fopenmp src/matrix.c src/clock.c src/main.c -o matmul ./matmul 1500 - Input parameters n = 1500 - Executing test... time (s)= 2.392996 size = 1500 chksum = 68432918175
On a laptop equipped with an AMD Ryzen 7 4800HS CPU (8 cores, 16 threads), the execution went from 24 seconds to just 2: more than a 10x speedup!
Where to go next
Parallelware Analyzer is composed of several tools: pwreport, pwcheck, pwloops and pwdirectives. pwreport is the link between all of them and will offer usage suggestions for different use cases.
For instance, if you look back at the previous example, you can see a suggestion to invoke pwloops:
$ pwloops src/main.c:matmul:15:5 -- -I src/include
Each tool composing Parallelware Analyzer has many different subanalyses available. Use
--help to get a listing of them along with other options available.
In general, you should pay attention to the suggestions in the more detailed level of pwreport on what is available for each actionable insight.
Focus on your area of interest
Parallelware Analyzer can provide large reports covering areas that might not be your main concert at the moment. To help you focus on your area of interest, all the actionable insights have associated tags that you can use to include or exclude them from the analysis results using the
--exclude-tags, respectively. For instance the following would only report actionable insights related to offloading to the GPU:
$ pwreport --include-tags offload,gpu src/main.c:matmul -- -I src/include
You can see the list of tags associated with each defect, recommendation, remark and opportunity by invoking
Analyzing files and directories
By default, you are required to provide a hotspot (either a function or a loop) to be analyzed. However, in many cases you need to analyze an entire file or directory. You can do so by passing the
--summary to pwreport. To avoid large outputs, by default, the ACTIONS REPORT is not printed when
--summary is used unless
--detail is also passed, for instance:
$ pwreport src --summary --detail -- -I src/include
All tools accept a configuration through the
--config argument. It can store compiler flags (such as
-I src/include in the example) to be used when analyzing different files, integrate with build tools (eg. to obtain the compiler flags from a JSON Compilation Database) or declare file dependencies to enable inter-procedural analysis across different source files.
For more details, take a look at docs/ConfigurationFile.md and examples/config in the root folder of your Parallelware Analyzer installation.
Integration with build tools
Supplying the required compiler flags can be a hassle. Parallelware Analyzer can consume a JSON Compilation Database. This can be generated using CMake or with tools such as bear that intercept compilation commands from different build systems. If you build the example using CMake, you will find a compile_commands.json file in the build directory. You can use the configuration file to instruct Parallelware Analyzer to use it or, if you don’t need any other settings, pass it to
$ mkdir build $ cd build $ cmake .. $ pwreport --config compile_commands.json ../src/main.c:matmul
For more details, take a look at docs/ConfigurationFile.md and examples/config in the root folder of your Parallelware Analyzer installation as well as at our Using CMake’s compilation database with Parallelware Analyzer blog post.
Inter-procedural analysis across multiple files
Parallelware Analyzer supports interprocedural analysis across multiple source files. This is required for instance when your hotspot invokes a function defined in another source file. In these cases, you will need to declare the file dependencies using the configuration file.
For more details, take a look at docs/ConfigurationFile.md and examples/config in the root folder of your Parallelware Analyzer installation as well as at our Interprocedural analysis across source code files with Parallelware Analyzer blog post.
- Fixing defects in parallel code: an OpenMP example
- Using CMake’s compilation database with Parallelware Analyzer
- Interprocedural analysis across source code files with Parallelware Analyzer
- JSON Compilation Database
Update: this post has been updated on September 7th, 2021 so that the different shown tool outputs match those of the latest version of Parallelware Analyzer.
Get started with Parallelware Analyzer today
Boost the performance of your code