This is the third video of our OpenACC programming course based on Appentra’s unique, productivity-oriented approach to learn best practices for parallel programming using OpenACC. Learning outcomes include how to decompose codes into parallel patterns and a practical step-by-step process based on patterns for parallelizing any code. In this video, we explain how OpenACC works and why the OpenACC programming model allows you to quickly parallelize the code.
✉ Subscribe to our newsletter to receive the new videos that will be published every week during the following months.
- Course overview
- What is OpenACC?
- How OpenACC works.
- Building and running an OpenACC code.
- UPCOMING. The openACC parallelization process.
- UPCOMING. Using your first OpenACC directives.
- UPCOMING. Using patterns to parallelize.
- UPCOMING. Identifying a forall patterns.
- UPCOMING. Implementing a parallel forall.
How OpenACC works?
To understand the potential benefits to you and the performance of your code it is worth taking a few minutes to understand how OpenACC works.
The OpenACC Execution Model
OpenACC assumes certain characteristics which make it easier to use but limit its applicability.
- OpenACC uses a host-driven execution model
- that is the CPU ‘drives’ the execution of your code and controls how the GPU is used. The CPU is often known as the host, and the GPU as the device. Sequential code runs on the CPU which is a conventional processor.
- It is up to you as the programmer to make use of OpenACC by adding directives so that the controlling CPU knows to run the computationally intensive parallel pieces of code (also called hotspots) on the available GPU.
- To maximize performance, high-performance applications generally conform to the following three rules of accelerator programming:
- Transfer the data onto the device and keep it there, i.e. don’t move things back and forth between the CPU and GPU unless you need to.
- Give the device enough work to do, so you don’t spend all the compute time transferring between host and device.
- Focus on data reuse within the device(s) to avoid memory bandwidth bottlenecks.
GPUs have a reputation for being difficult to use because of libraries such as CUDA. But OpenACC has been developed precisely to change that by enabling everyone to make better use of GPUs in a more simple manner.
OpenACC’s use of directives that are similar to OpenMP improves portability and readability of the code compared to other methods for using accelerators.
OpenACC removes the need to explicitly address the hardware when accelerating your code. This increases portability, makes it easier to use than other alternatives but also intrinsically means you are unlikely to beat the performance of a highly optimised CUDA code.
But that is the thing to remember: very few CUDA codes are actually that well optimised: a lot of work and effort would need to go in everytime you move from one machine to another or upgrade hardware, to beat OpenACC with something like CUDA or one of the even lower level programming solutions available.
It is also worth noting that at this time OpenACC, like CUDA and many other parallel programming tools, is only available for C, C++ and Fortran codes.
Note that the examples you will learn and use in this course are in C! But the same methods apply to Fortran code.
The OpenACC Accelerator Model
In order to ensure that OpenACC would be portable to all computing architectures available at the time of its inception, and into the future, OpenACC defines an abstract model for accelerated computing. This model exposes multiple levels of parallelism that may appear on a processor, as well as a hierarchy of memories, with varying degrees of speed and addressability.
The goal of this model is to ensure that OpenACC will be applicable to more than just a particular architecture, or the set of architectures currently widely available. OpenACC aims to ensure that code is also compatible with future devices as well.
In the OpenACC execution model, the multicore CPU is treated like an accelerator device that shares memory with the initial host thread. With a shared-memory device, most of the OpenACC data clauses which we will discuss in a later video, are ignored, and the accelerator device (the parallel multicore) uses the same data as the initial host thread.
When using OpenACC with a GPU, data gets copied from the system memory to device memory (and back). The user is responsible for keeping the two copies of data coherent, as needed. When using OpenACC on a multicore, there is only one copy of the data, so there is no coherence problem. However, the GPU OpenACC program can produce different results than a multicore OpenACC program. This occurs if the program depends on the parallel compute regions updating a different copy of the data, whereas on a sequential CPU program the data would be updated by just the host.
✉ Subscribe to our newsletter here to receive the new videos that will be published every week during the following months.