In this article we will discuss the parallel matrix product, a simple yet efficient parallel algorithm for the product of two matrices. This algorithm is used a lot so its a good idea to make it parallel. Here we can see the code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
int alg_matmul2D(int m, int n, int p, float **a, float **b, float **c) { int i, j, k ; for (i=0; i<m; i++) { for (j=0; j<n; j++) { a[i][j]=0; for (k=0; k<p; k++) a[i][j] += b[i][k]*c[k][j] ; } } return 0; } |

### Performance considerations

This code introduces more complexity than the codes analyzed in the previous articles about parallell computation of **PI **and parallel computation of **matrix-vector product.** The main issues come from memory accesses that do not exploit data locality. More specifically, the computation of PI uses only scalar variables, there is not any use of large arrays. The matrix-vector product uses arrays that are accessed preserving memory locality. Here in the matrix-matrix product, we are addressing arrays of two dimensions that are not accessed preserving data locality, and thus exhibit the following memory problems that impact on performance.

In the code presented above we notice that we are implementing the basic matrix product accumulating the products of rows and columns into the result matrix *a*. As we multiply rows and columns the major problem on this code appears. How can we perform an optimized memory access at the same time for matrices *b* and *c*? Assume that the matrices are stored in memory in row-major order. We access matrix *a* by rows and matrix *b* by columns, so we use antagonic memory accesses here. The key point is that if we optimize one matrix memory access, we get a bad memory access in the other matrix. The same holds for column-major storage.

### Parallel matrix matrix multiplication with OpenMP

Basically we can’t avoid the previous memory access problem without changing the way the matrices are allocated in memory, and this is out of the scope of this article. Apart from memory and locality issues, how can we obtain better performance from this code? The answer is make it parallel. Here we can see the parallel implementation using OpenMP standard:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
int alg_matmul2D(int m, int n, int p, float** a, float** b, float** c) { int i,j,k; #pragma omp parallel shared(a,b,c) private(i,j,k) { #pragma omp for schedule(static) for (i=0; i<m; i=i+1){ for (j=0; j<n; j=j+1){ a[i][j]=0.; for (k=0; k<p; k=k+1){ a[i][j]=(a[i][j])+((b[i][k])*(c[k][j])); } } } } return 0; } |

The OpenMP-enabled parallel code exploits coarse grain parallelism, which makes use of the cores available in a multicore machine. Basically, we have parallelized the outermost loop which drives the accesses to the result matrix *a* in the first dimension. Thus, work-sharing among the thrads is such that different threads will calculate different rows of the result matrix *a*. Note that different threads will write different parts of the result in the array *a*, so we dont’t get any problems during the parallel execution. Note that accesses to matrices *b* and *c* are read-only and do not introduce any problems either. In the parallelization strategy, we share all the matrices and everything will work fine; in addtion. all the indices of the loops are privatized because every thread needs to do it’s own iterations of the loops.

### Experimental results

We have described a simple parallel algorithm for the matrix-matrix product using OpenMP. But what about its performance? Lets see them now. All executions, times and speed-ups are measured using the following machine:

Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz (4 cores with hyperthreading Cache size 6144 KB) 8GB RAM

The experimental results shown above use 4 physical cores with hyperthreading, which makes a total of 8 threads, with 2 threads per core. The ideal speed-up is around 5 and we get a maximun speed up near 3.5, which exposes the memory problems of the matrix-matrix product and its impact on performance. These results give us a good reason to use parallelism as we will get better execution times, increase our productivity, reduce costs and reduce the “time-to-market”.

The performance of parallel algorithms may be increased significantly with programming techniques oriented to locality and memory issues. However, although parallel programming is hard and error-prone, it is the primary source of performance gain in modern computer systems.

If you like this post, **please share and subscribe!**

** Join the Parallware Trainer Early Access Program! **

## Leave a Reply