This is the sixth video of our OpenACC programming course based on Appentra’s unique, productivity-oriented approach to learn best practices for parallel programming using OpenACC. Learning outcomes include how to decompose codes into parallel patterns and a practical step-by-step process based on patterns for parallelizing any code. In this video we will learn how to use the OpenACC directives parallel, loop and data.
✉ Subscribe to our newsletter and get all of our latest updates.
- Course overview
- What is OpenACC?
- How OpenACC works.
- Building and running an OpenACC code.
- The openACC parallelization process.
- Using your first OpenACC directives.
- Using patterns to parallelize.
- UPCOMING. Identifying a forall patterns.
- UPCOMING. Implementing a parallel forall.
Using your first OpenACC directives
This course will introduce OpenACC directives incrementally, as they become useful for the porting process. In this video we will learn how to use the parallel, loop and data directives.
By the end of this video you know how to implement your first OpenACC code using a simple example in Parallelware Trainer.
OpenACC Directive Syntax
To use OpenACC you need three key components: a compiler designation, a sentinel and a directive. The compiler designation, a pragma in C, or an explanation mark (also known as a bang in programming) in Fortran, informs the compiler of something about the code.
The next component, the acc sentinel, tells the compiler that the following text will be OpenACC.
Following the sentinel is the OpenACC directive. The directive may be an executable statement or influence one line or a region of code.
These directives may have clauses, but all must follow the syntax as laid out in the OpenACC standard.
Terminology: gangs, workers and vectors.
Before we learn about directives, we first need to understand a little terminology: specifically gang, workers and vectors. These are the three levels of parallelism that the OpenACC execution model expects. Gang, worker and vector each represent a different level of parallelism and is designed to improve portability by allowing mapping to any architecture that supports multithreading where the threads can perform vector instructions.
At this stage, just remember that a gang is comprises a set of worker threads and that gangs are not explicitly synchronised whereas worker threads are within a gang. This will become important when we talk about the parallel directive next.
To understand how to use these sentinels and directives in your code, we will now discuss the most commonly used directives. The first directive, and the directive you are likely to use the most, is the parallel directive:
- By adding a parallel directive before a code block, the programmer is identifying the next code region as suitable for parallelization across OpenACC gangs.
- However, it is the responsibility of the programmer, that’s you, to ensure that the parallelization is possible. So some analysis of the code block that this is assigned to is necessary.
- By itself, the parallel directive is of limited use, as actually this parallelises the following code block by running the same thing on every gang, rather than splitting the work up amongst gangs.
It is therefore most commonly used with the loop directive.
- The loop directive tells the compiler that the very next loop in the source code is safe to execute in parallel.
- In combination with the parallel directive, this allows the compiler to split the work in the subject loop, up amongst worker threads, so that different loop iterations are executed by different threads.
- However, there are many loop constructions where the loop iterations cannot simply be executed in parallel. One particular case is when the loop is calculating a reduction. In this particular instance, because it is so commonly used, OpenACC provides a reduction clause that can be used in conjunction with the loop clauses. We’ll discuss reductions more in a later video.
The next clause to consider is the data construct.
- This gives the programmer additional control over how and when data is created and destroyed on a GPU and when data is copied between CPU and GPU.
- Without the data directive, OpenACC will make assumptions about whether data is already on the device or not. By using the data construct you help to ensure correctness, and also improve performance by avoiding unnecessary data copies.
- The data directive may be used in conjunction with many other directives including parallel and loop.
- The data construct can accept 7 clauses including, copyin, copyout and copy.
In all three cases, OpenACC ensures that if the variable or variables listed after the copyin, copyout or copy clauses are on a device where the memory is shared between host and device, no action is taken, ensuring optimal performance.
However, as in most cases the memory is not shared, OpenACC does the following.
For copyin, on entering the code region subject to the specified data clause, the variables listed under the copyin clause are copied onto the device, and at the end of the region the device memory used by the variable is released. This is useful when you need the variable in order to compute the calculations within the loop, but the loop doesn’t update the variable, and therefore the copy of the variable on the device is not needed on the CPU after the loop is executed.
The copyout clause can be considered to work in the opposite direction to copyin: space is created on the device for any variables listed under the copyout clause, but the values are not initialized. At the end of the region, the variable(s) are copied back to the host. Copyout is used when the data region initialises one or more variables that are needed at the end of the data region on the host.
In the situation where a variable is initialised on the host, used and updated on the device, and the updated variable is then needed on the host gain, the copy data clause can be used.
There are several commonalities with copyin, copyout and copy.
Firstly, if you are using OpenACC 2.5 or newer, all three clauses first check if the variables listed are already present on the device, this improves performance. Prior to OpenACC 2.5, this was achieved by using the copy_or_present clause and equivalent for copyin and copyout.
Secondly, in each instance, the memory on the device is released at the end of the data region.
And thirdly, to make it simpler to optimise later on, it is worth adding a new data <clause> line for each clause, whether it is copyin, copyout or copy. You can add all three clauses on one line, but that can increase the effort at a later stage when you combine data regions – more on that later!
To understand how to use these new directives, we will now parallelize a vector multiplication and addition calculation.
First, download the D-A-X-P-Y project, linked below this post, and open it in Parallelware Trainer. This project computes the solution to a Scalar-Vector product plus a Vector, i.e. the solution is a constant times the vector X + the vector Y (a*X+Y). The name D-A-X-P-Y is the naming convention used by the BLAS scientific library.
- Set up the project by clicking the build configuration.
- Add make in the build command.
- Then in the run command, type dot slash D-A-X-P-Y’
- Close the build command
- And now update the Makefile to add the OpenACC flag for compilation.
- As we are using the GCC compiler we are adding the -fopenacc flag, but if you wish to use different compiler make sure you update the makefile with the appropriate compiler and flag.
- Save the Makefile
- and close.
Now it is time to start accelerating the calculation.
- Open the D-A-X-P-Y.c file and you will see that on line 10 there is a green circle. Parallelware Trainer has noticed that there is a parallelization opportunity in the loop that starts on this line.
- Click on the green circle, and the parallelization dialogue will open.
- As we are attempting to parallelize for OpenACC and GPUs, choose OpenACC as the standard, GPU for device and offloading for the paradigm.
- Now click ‘Parallelize’. You will now see that three openacc directives have been added before the loop and the parallel and data region are opened before the loop, and closed at the end of the loop.
Note that the loop directive does not need to be defined this way, as it is only valid when used immediately before a loop.
There are a few things to note. Firstly, we are using copyin and copyout clauses for data directive on line 10.
The vectors that are part of the calculation X and Y, are subject to a copy in. They are not updated within this loop, and therefore the memory used on the device can be freed at the end of the data region without needing to copy anything back to the host.
The variable a, the constant used to multiply the vector X by, and the variable n, the length of the vector and therefore the number of loop iterations are also copyin.
The copyout clause is only used for the result vector, D. This is because this is the only variable that is initialised inside the loop, but where the result is needed after the loop.
If we now build and run.
The code successfully runs using OpenACC parallel loop on GPUs.
✉ Subscribe to our newsletter and get all of our latest updates.