This is the seventh video of our OpenACC programming course based on Appentra’s unique, productivity-oriented approach to learn best practices for parallel programming using OpenACC. Learning outcomes include how to decompose codes into parallel patterns and a practical step-by-step process based on patterns for parallelizing any code. In this video you will learn how to decompose the code into parallel patterns and best practices on how to translate such patterns into correct OpenACC directives.
✉ Subscribe to our newsletter and get all of our latest updates.
- Course overview
- What is OpenACC?
- How OpenACC works.
- Building and running an OpenACC code.
- The openACC parallelization process.
- Using your first OpenACC directives.
- Using patterns to parallelize.
- Identifying a forall patterns.
- Implementing a parallel forall.
Using patterns to parallelize
How you have learned about the most commonly used directives in OpenACC, parallel, loop and the data clauses copyin, copyout and copy, it’s time to understand where and when to use these.
Decomposing your code into components
OpenACC is designed to allow incremental code parallelization, and the simplest way to do this is to decompose your code into components, that you can categorize by possible parallelization solutions.
By using this approach you can tackle real, complex codes, that otherwise may seem daunting to parallelizing.
- The goal is to take your serial code, break the code down into components, then categorize these components into two types:
- Scientific components that can use library calls such as an FFT calculation or matrix multiplication.
- And code components or patterns. For example, this might be a loop that calculates a reduction – something we will cover in more detail later. Scientific components can often be parallelized by using a parallel library.
- For the remaining scientific components and the other code components, the next step is to categorize them by pattern type.
- Once all components have an identified pattern, the patterns can be matched to the different parallel solutions, and then the parallel solutions are implemented to create parallel code.
- Using this approach allows you to follow a step-by-step recipe for parallelism, speeds up the parallelization process and improves code quality and correctness by providing opportunities for regular testing, rather than using the ‘all or nothing’ parallelization approach.
It is worth noting that this process sits within the broader best practice for parallelization,
- That first identify the hotspots in the serial code, so that the parallelization effort focuses on the most computationally intensive part of the code.
- Once the hotspots are identified, the serial code is analyzed and decomposed into components.
- Following the process already described, these components are categorised, and where possible patterns are identified
- and the directives are added to produce parallel patterns
- which results in parallel code.
Finally, in the overall parallelization process, you should compare the performance of the serial and parallel versions of the code, and then optimize the code.
But how do you actually go about decomposing code?
- First, make use of tools available to you. Use a profiler, to break your code down into calls, routines and functions or loops. This will also tell you the most computationally intensive parts so you can focus your analysis of the components and the subsequent parallelization on these areas.
- Next, loop for the results of your profiler which are actually using external library calls. Wherever possible try and use a GPU enabled version of the library call as these are often highly optimized. Sometimes this is not possible and you will need to either manually code the routine and then parallelize it yourself, or very occasionally, it may be possible parallelize the calls to the routine.
- Then, for code components that are actually calculations available in an external routine, consider whether you can replace this code with a call to a highly optimized GPU-enabled parallel version of the routine.
- Finally for the remaining code that is routines in the code, that are not external library calls, or match patterns found in external libraries, it is time to categorize these into patterns you can then parallelize. To do this requires understanding what is being computed and the flow of data through the region.
That all sounds easy, until that last part: how do you identify the patterns if you code?
Parallelizing by pattern
We’ll explain in more detail in later videos how to understand the patterns in your code, what to look for and work through some examples. But it is worth remembering that whatever approach you take to parallelize your code, essentially it always comes down to understanding how your code will have in parallel, and doing this component by component, identifying the pattern, the corresponding parallel pattern or patterns, and then implementing this as parallel code, is the simplest way to start accelerating your code with GPUs.
Patterns and parallelization strategies
In the next video we will start learning about the first parallelization pattern, a forall. However, it is worth understanding first what we mean by a parallel pattern. For each pattern that is discussed in this course, there is at least one, sometimes more different parallel implementations, or strategies available.
- For the forall pattern, there is just one parallel strategy: the parallel loop. This actually what you implemented in the D-A-X-P-Y code at the end of the last video.
- But the second pattern you will learn, a scalar reduction, actually has three possible methods for parallelization, using a built-in reduction clause, but also atomic protection and explicit privatization.
- The same three strategies can also be applied to the third pattern, asparse reduction.
- The final pattern, a sparse forall, can only be parallelized using atomic protection.
However, it is worth noting that these are general strategies to parallelizing the pattern, and there are specific constraints based on architecture. As this course is about GPUs, it is important to note when the strategies can’t be applied because of the technology.
Specifically, you will notice in this table which summarises the patterns and the applicability of the strategies shown on the previous slide, that explicit privatization cannot be used with GPUs as there is no concept of privatizing a variable on the shared memory space within a GPU.
We will cover how to identify each of these patterns and the parallelization strategies in the rest of this course, starting with the forall pattern.
✉ Subscribe to our newsletter and get all of our latest updates.