Do you really know the importance of parallelization? Does a parallelized code imply a reduction in your execution time? Are all parallelizations equally good? This article talks about these questions and using the calculation of the number pi tries to answer these questions.

### Calculation of PI

The code shown below is the calculation of pi by the method of numeric integration. The algorithm has a loop (lines 4-7) which executes a predetermined number of iterations (num_steps) where a reduction over the variable ‘sum’ is done. The variable ‘x’ works like a partial result which has an independent value between iterations. At the end of the loop, the operation between ‘step’ and ‘sum’ produces ‘pi’. The number pi will be more precise if the number of iterations is greater.

1 2 3 4 5 6 7 8 |
x=0; sum = 0.0; step = 1.0/(double) num_steps; for (i=0; i < num_steps; ++i) { x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; |

The aim of the parallelization is to reduce the variable ‘sum’. The analysis of the code shows that this is a reduction on the variable ‘sum’ and the type is addition. With these features there is more than one way to parallelize this code. We will then see two different types of how to parallelize this code and a comparison between both.

### Parallel computation of PI: (1) with CRITICAL; (2) with REDUCTION

The first parallelization is with a critical section on the reduction variable. The reduction variable is shared but it is necessary to protect the access and modification with a critical section. Good practise is to protect the minimum number of operations with the critical section, so the variable ‘aux’ is used to reduce the computation in the critical section.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
x=0; sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel private(i,x,aux) shared(sum) { #pragma omp for schedule(static) for (i=0; i<num_steps; i=i+1){ x=(i+0.5)*step; aux=4.0/(1.0+x*x); #pragma omp critical sum = sum + aux; } } pi=step*sum; |

The second way to parallelize this code is by using the reduction clause. This is a clause provided by the OpenMP standard which abstracts the problem of a reduction. For this clause is only necessary to annotate the reduction variable and the type of reduction, in this case ‘reduction(+:sum)’.

1 2 3 4 5 6 7 8 9 10 11 12 |
x=0; sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel private(i,x) { #pragma omp for reduction(+:sum) schedule(static) for (i=0; i<num_steps; i=i+1){ x=(i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } } pi=step*sum; |

### Performance on a multicore machine.

The ideal speed-up is the number of threads, for example if the application is working with 4 threads the ideal speed-up in this case is 4. In the next graph we can see a speed-up comparison between the ideal (red line), the sequential (blue line), the critical (green line) and the parallelization using reduction (yellow line).

The parallelization with the reduction clause is near to obtaining ideal speed-ups, conversely with a parallelization with critical the speed-ups are worse, even the code obtains a slow down. A slow down in a parallelization is something common if we choose a wrong strategy of parallelization. In this case, protecting with a critical is a bad idea because the threads are waiting to enter in the critical section and if the code has a low arithmetic intensity it makes the threads spend more time waiting than executing the code.

**In conclusion:** Choosing the correct parallelization strategy is really important. A correct decision can change the execution time from ideal times to times that are even worse than the sequential time. Do not miss the next article of our **series “OpenMP guided parallelization with Parallware”** that will address the parallelization of the matrix-vector product.

If you like this post, **please share!**

**The Appentra Team**

## Leave a Reply