This is the ninth video of our OpenACC programming course based on Appentra’s unique, productivity-oriented approach to learn best practices for parallel programming using OpenACC. Learning outcomes include how to decompose codes into parallel patterns and a practical step-by-step process based on patterns for parallelizing any code. In this last video we will learn to combine the forall pattern with the information on the parallel and loop OpenACC directives covered in video 7 to start parallelizing foralls.
✉ Subscribe to our newsletter and get all of our latest updates.
Course Index
- Course overview
- What is OpenACC?
- How OpenACC works.
- Building and running an OpenACC code.
- The OpenACC parallelization process.
- Using your first OpenACC directives.
- Using patterns to parallelize.
- Identifying a forall patterns.
- Implementing a parallel forall.
Video transcript
Implementing a parallel forall
Patterns and parallelization strategies
Now that we have learnt about the forall pattern and the characteristics to look out for, we can combine this with the information on the parallel and loop OpenACC directives covered in video 7 to start parallelizing foralls.
Parallelizing the forall is actually what we covered in the example in video 7, but in this video, we will cover in a little more detail what is going on and how to implement this more generally.
Implementation of Parallel loop
If we revisit the core part of the D-A-X-P-Y calculation, where the result, D, is the sum of a vector, X multiplied by a constant, a, added to the vector Y.
As previously discussed each iteration of the loop is fully independent from each other iteration, so we can assert this by adding the parallel directive.
for (int i = 0; i < n; i++) {
D[i] = a * X[i] + Y[i];
}
Note that we must close the region that we are asserting is suitable for the use of the parallel directive by adding the closing brace.
pragma acc parallel
{
for (int i = 0; i < n; i++) {
D[i] = a * X[i] + Y[i];
}
} // end parallel
If we remember how the OpenACC parallel directive works, the region enclosed in the parallel directive, i.e. this entire for loop, will be duplicated on every parallel gang operating within this parallel region on the GPU.
By adding the loop work-sharing directive the loop iterations within the parallel loop, will now be split amongst threads.
pragma acc parallel
{
pragma acc loop
for (int i = 0; i < n; i++) {
D[i] = a * X[i] + Y[i];
}
} // end parallel
We now have the core components of the requirements for parallelizing a forall pattern: a parallel directive and the ability to work-share using the loop directive.
Note that we are not at this time handling the data management between CPU and GPU as we are assuming that the data handling, which we discussed in video 7, is handled separately from parallelizing this region.
As we discuss the different paralellizing methods we will focus on the parallelization requirements, and handle the data management separately so that you can understand later in the course how to optimize your data requirements separately from the parallelization.
Pros and Cons
Finally, it is worth mentioning the pros and cons of this approach.
- Pros: As the forall pattern is straightforward and simple this is the easiest parallelization method to implement and has no synchronisation within the loop as the threads do not need to communicate data due to the independence of each loop iteration.
- Cons: However, although the forall pattern is relatively common, it is very limited in its use, for the same reason that it is so efficient and simple to implement i.e. each loop iteration is entirely independent
✉ Subscribe to our newsletter and get all of our latest updates.
Leave a Reply