In this article, we will discuss the code to multiply a vector and a matrix. This is a widely-used code in a variety of scientific fields, so it is good to analyze how to make it parallel. This code is a little bit more complex than the parallel computation of PI analyzed previously in this blog. The main reason for this is that this code uses matrices and vectors, so we will introduce some memory considerations here. First of all lets see the code:

1 2 3 4 5 6 7 8 9 10 11 12 |
int matvec(float **matrix, float *vector, float *result, int size_i, int size_j) { int i; int j; for(i=0;i<size_i;i++) { result[i]=0; for(j=0;j<size_j;j++) { result[i] += matrix[i][j]*vector[j]; } } return 0; } |

**Memory considerations**

Here we can see that this code implements vector-matrix multiplication performing the addition of the partial multiplications of the vector and the matrix rows. Lets discuss here the memory access pattern focussing on lines 5 to 10. First look at the line 8, where we have got the main access patterns to array variables *result*, *matrix* and *vector*. The accesses are performed with the following loop indices: *result* with *i*, *matrix* with *i*** **and *j*, and vector with *j*. Thus, between *matrix* and *vector* we have a similar memory access in the last dimension indexed by *j*. As *j* is the index of the innermost loop, this row-major access pattern is perfect for a C code as it exploits memory locality. The other memory access involved is the write into *result[i]*, where *i*** **is the index of the outermost loop. As in this case there is a row-major access to a one dimension array, we won’t have locality problems here either.

**Make it parallel**

Once we know there are no big problems with the memory performance, we know we will be able to get good speed-ups on this code if we parallelize it. So lets do it. Here is the parallel code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
int matvec(float** matrix, float* vector, float* result, int size_i, int size_j) { int i,j; #pragma omp parallel shared(matrix,result,vector) private(i,j) { #pragma omp for schedule(static) for (i=0; i<size_i; i=i+1){ result[i]=0.; for (j=0; j<size_j; j=j+1){ result[i]=(result[i])+((matrix[i][j])*(vector[j])); } } } return 0; } |

What we are exploiting is coarse-grain parallelism by parallelizing the outermost loop of the algorithm. Note that the variables *matrix* and *vector *are read-only so they are never written in this algorithm. Thus, we can share them as there will not appear race conditions on those variables during the parallel execution. Now consider the variable *result*, which is the important computation in this algorithm. This variable is always indexed by *i*, the index of the outermost loop. So if we distribute the iterations in this loop, making *i* private and making *result* shared, the threads will never write in the same positions of *result*. This is exactly what we are doing with this parallel computation of the matix-vector product shown above.

**Experimental results**

Once having the parallel code, we execute it and take time measurements for different square matrix sizes. We should expect a good performance from this code as we discussed in the previous sections of this article. All the measurements shown below were produced on the following machine using 8 threads:

Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz (4 cores with hyperthreading Cache size 6144 KB ) 8GB RAM

Here we can see this data graphically:

Note that using 8 threads it is theoretically posible to achive an speedup 8. On our quad-core system we are using the 4 cores with hyperthreading (2 threads per core) enabled. In the graph above the ideal speedup is represented with a red line, and the meassured speedup is depicted with a blue line. When enabling hyperthreading, each additional thread typically provides an extra 20% performance. As a result, the 4.8 speedup of the parallel algorithm is really good as it is 0.8 higher than the theoretical speedup for 4 threads.

Finally, remark that the parallel computation of the matrix-vector product discussed in this article achieves up to 4.73 of speedup with big-enough matrix sizes. Thus, this a simple and efficient OpenMP-enabled parallel algorithm.

If you like this post, **please share!**

## Leave a Reply