Functions offloaded to the GPU should be annotated with ‘declare target’.
When a loop is offloaded to the GPU, the compiler creates the proper instructions understandable by the GPU which are distinct from those of the CPU. In this way, offloaded sections are translated into mini-programs embedded into the main program. The runtime is in charge of executing those mini-programs in the GPU at the runtime, as well as of doing the proper data movement between the CPU and GPU memories. If an offloaded loop invokes functions, a GPU-version of those functions must also be created, as if it were a GPU mini-library called from the corresponding GPU mini-program. In order for the compiler to create it, the relevant functions must be marked with the OpenMP ‘declare target’ directive. When this is not done, the CPU version will be called instead, with the corresponding performance loss due to moving computation from the GPU to the CPU to execute then function and then back to the GPU once it returns.
Annotate the function with ‘#pragma omp declare target’.
Start boosting the performance of your code with Parallelware Analyzer