The focus of this report is showing that **Parallware** succeeds in the automatic parallelization of sequential C code that uses the finite-difference time-domain (FDTD) method. As case study we will use a 3D FDTD simple code written in C developed by Dmitry Gorodetsky. For illustrative purposes, the following figure shows the main magnitudes Hx, Hy and Hz computed in the code.

The rest of this report is organized as follows. Section 1 analyzes the algorithmic structure of the sequential source code. Section 2 discusses the best parallelization strategy, which is compared with the parallelization strategy automatically built by Parallware. Section 3 presents performance measurements of the automatically parallelized 3D FDTD code. Finally, the conclusions of this article are presented.

## 1. The 3D Finite-Difference Time-Domain Sequential Code

The source code shown below is a pseudocode of the most computationally-intensive loop of the 3D FDTD sequential code. The algorithmic structure of the sequential C code consists of a main loop that computes hx, hy and hz as well as ex, ey and ez during a fixed number of iterations. In a given iteration, the values of hx are updated by adding the previous value and the value of an expression that depends on ey and ez. The computation of the remaining hy, hz, ex, ey and ez is similar. We call these loops “propagation loop”, and they are of special relevance in the scope of scientific programs because they are typically used to implement simulations over time.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
for(iteration = 0; iteration < MAXIMUM_ITERATION; iteration++) { stimulus = sin(omega*currentSimulatedTime); for (i=0; i<(1); i++) { for(j=0; j<(ny); j++) { { for(k=0; k<nz; k++) { ez[i][j][k] = stimulus; } } } for(i=0; i<(nx-1); i++) { for(j=0; j<(ny-1); j++) { for(k=0; k<(nz-1); k++) { hx[i][j][k] += (dtmudz*(ey[i+1][j][k+1] - ey[i+1][j][k]) - dtmudy*(ez[i+1][j+1][k] - ez[i+1][j][k])); } } } for(i=0; i<(nx-1); i++) { for(j=0; j<(ny-1); j++) { for(k=0; k<(nz-1); k++) { hy[i][j][k] += (dtmudx*(ez[i+1][j+1][k] - ez[i][j+1][k]) - dtmudz*(ex[i][j+1][k+1] - ex[i][j+1][k])); } } } for(i=0; i<(nx-1); i++) { for(j=0; j<(ny-1); j++) { for(k=0; k<(nz-1); k++) { hz[i][j][k] += (dtmudy*(ex[i][j+1][k+1] - ex[i][j][k+1]) - dtmudx*(ey[i+1][j][k+1] - ey[i][j][k+1])); } } } for(i=0; i<(nx); i++) { for(j=1; j<(ny); j++) { for(k=1; k<(nz); k++) { ex[i][j][k] += (dtepsdy*(hz[i][j][k-1] - hz[i][j-1][k-1]) - dtepsdz*(hy[i][j-1][k] - hy[i][j-1][k-1])); } } } for(i=1; i<(nx); i++) { for(j=0; j<(ny); j++) { for(k=1; k<(nz); k++) { ey[i][j][k] += (dtepsdz*(hx[i-1][j][k] - hx[i-1][j][k-1]) - dtepsdx*(hz[i][j][k-1] - hz[i-1][j][k-1])); } } } for(i=1; i<(nx); i++) { for(j=1; j<(ny); j++) { for(k=0; k<(nz); k++) { ez[i][j][k] += (dtepsdx*(hy[i][j-1][k] - hy[i-1][j-1][k]) - dtepsdy*(hx[i-1][j][k] - hx[i-1][j-1][k])); } } } } |

## 2. Automatic Parallelization of Sequential C Code based on 3D FDTD

From the point of view of parallelization, the coarser-grained parallelism is available in the propagation loop. However, in the propagation loop the variables hx, hy, hz, ex, ey and ez are mutually dependent. In fact, in order to calculate the variables *hx, hy *and *hz* the value of *ex, ey *and *ez *in the previous iteration is needed; and the value of *ex, ey *and *ez * is updated with values of * h *computed at the beginning of the same iteration. Thus, it cannot be parallelized through the parallel execution of its iterations. Instead, finer-grain parallelism has to be exploited within each iteration of the propagation loop.

In the scope of each iteration of the propagation loop, what fine-grain parallelism can we exploit? The best strategy is to parallelize the inner loop nests that compute hx, hy, hz, ex, ey and ez, respectively. Each of these loop nests is a fully parallel loop because an independent value hx[i][j][k] is computed for each grid point (i,j,k). Thus, the parallel execution is free of race conditions and the loop can be parallelized with an OpenMP directive *#pragma omp parallel for*.

Once the inner loop nests have been parallelized, it is time to optimize the parallel execution by minimizing the parallelization overhead introduced by the OpenMP directives. The first thing to do is to try to minimize the number of parallel regions, ideally reducing it to only one directive *#pragma omp parallel * for the whole code. As shown in the following pseudocode of the OpenMP-enabled parallel 3D FDTD code, it is only needed one parallel region that encloses the whole propagation loop. In order to preserve correctness, appropriate synchronization is added so that the OpenMP threads are synchronized at the beginning of each iteration of the propagation loop.

Overall, the parallelization strategy described above can be interpreted from a geometrical point of view, which will be more familiar to scientists and engineers with experience in the MPI standard for distributed-memory computers. The 3D problem is solved on a 3D grid. In distributed-memory computers and MPI, the 3D grid is usually partitioned along the three axis X, Y and Z simultaneously. Then MPI processes solve 3D subproblems on different 3D subgrids. From this geometrical point of view, the OpenMP parallelization strategy is quite different. Indeed, the 3D FDTD problem is parallelized only along the X axis. This is implemented through the *#pragma omp for* directive inserted before the loops with *nx* iterations that traverse the X axis of the 3D grid.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
#pragma omp parallel shared(ex,ey,ez,hx,hy,hz) private(i,iteration,j,k) { for (iteration=0; iteration<1000; iteration=iteration+1){ #pragma omp master { currentSimulatedTime=(dt)*(iteration); stimulus=sin((62831840000.)*(currentSimulatedTime)); } #pragma omp barrier #pragma omp for schedule(static) for (i=0; i<1; i=i+1){ for (j=0; j<(ny)+(1); j=j+1){ for (k=0; k<nz; k=k+1){ ez[i][j][k]=stimulus; } } } #pragma omp for schedule(static) for (i=0; i<(nx)-(1); i=i+1){ for (j=0; j<ny; j=j+1){ for (k=0; k<nz; k=k+1){ hx[i][j][k]=(hx[i][j][k])+(((dtmudz)*((ey[(i)+(1)][j][(k)+(1)])-(ey[(i)+(1)][j][k])))-((dtmudy)*((ez[(i)+(1)][(j)+(1)][k])-(ez[(i)+(1)][j][k])))); } } } #pragma omp for schedule(static) for (i=0; i<nx; i=i+1){ for (j=0; j<(ny)-(1); j=j+1){ for (k=0; k<nz; k=k+1){ hy[i][j][k]=(hy[i][j][k])+(((dtmudx)*((ez[(i)+(1)][(j)+(1)][k])-(ez[i][(j)+(1)][k])))-((dtmudz)*((ex[i][(j)+(1)][(k)+(1)])-(ex[i][(j)+(1)][k])))); } } } #pragma omp for schedule(static) for (i=0; i<nx; i=i+1){ for (j=0; j<ny; j=j+1){ for (k=0; k<(nz)-(1); k=k+1){ hz[i][j][k]=(hz[i][j][k])+(((dtmudy)*((ex[i][(j)+(1)][(k)+(1)])-(ex[i][j][(k)+(1)])))-((dtmudx)*((ey[(i)+(1)][j][(k)+(1)])-(ey[i][j][(k)+(1)])))); } } } #pragma omp for schedule(static) for (i=0; i<nx; i=i+1){ for (j=1; j<ny; j=j+1){ for (k=1; k<nz; k=k+1){ ex[i][j][k]=(ex[i][j][k])+(((dtepsdy)*((hz[i][j][(k)-(1)])-(hz[i][(j)-(1)][(k)-(1)])))-((dtepsdz)*((hy[i][(j)-(1)][k])-(hy[i][(j)-(1)][(k)-(1)])))); } } } #pragma omp for schedule(static) for (i=1; i<nx; i=i+1){ for (j=0; j<ny; j=j+1){ for (k=1; k<nz; k=k+1){ ey[i][j][k]=(ey[i][j][k])+(((dtepsdz)*((hx[(i)-(1)][j][k])-(hx[(i)-(1)][j][(k)-(1)])))-((dtepsdx)*((hz[i][j][(k)-(1)])-(hz[(i)-(1)][j][(k)-(1)])))); } } } #pragma omp for schedule(static) for (i=1; i<nx; i=i+1){ for (j=1; j<ny; j=j+1){ for (k=0; k<nz; k=k+1){ ez[i][j][k]=(ez[i][j][k])+(((dtepsdx)*((hy[i][(j)-(1)][k])-(hy[(i)-(1)][(j)-(1)][k])))-((dtepsdy)*((hx[(i)-(1)][j][k])-(hx[(i)-(1)][(j)-(1)][k])))); } } } } |

With regard to Parallware, the following figure shows the Parallware command and the output messages given to the user. The transcription of the terminal session is as follows:

Parallware reports the inner loop nests as parallel, so the propagation loop is implicitly reported as sequential. This is the best parallelization strategy, as described above! Thus, Parallware automatically produces the OpenMP-enabled parallel 3D FDTD code that maximizes performance. Next, we will evaluate such performance in terms of speedups on a multi-core computer. Let’s see the results obtained in our experiments!

## 3. Performance Results

The performance is measured in a computer with four cores Intel® Core™ i7-4700MQ CPU @ 2.40GHz with hyperthreading, and a RAM memory of 8 GB. The OpenMP-enabled parallel code generated automatically by Parallware is evaluated in terms of execution time and speedups. From a numerical point of view, we have used the following parameters: MAXIMUM_ITERATIONS=1000, nx=133, ny=21 and nz=9. We evaluate the performance of the original source code (called ToyFDTD) as well as the same code with higher arithmetic intensity (called FDTD). For 1, 2, 4 and 8 threads, the speedup graph shows the ideal speedup, the speedup of ToyFDTD and the speedup of FDTD.

We can observe that the speedups for ToyFDTD are not good (orange line) as they stall for 4 or more threads, and they even decrease for 8 threads. The speedups do not scale with the number of threads because the parallelization overhead is bigger if we raise the number of threads.

As the ToyFDTD is a simple 3D FDTD code, we have assumed that other FDTD codes that are more complex have the same algorithmic structure and, thus, the parallelization strategy is still valid. These more complex FDTD codes are assumed to have higher arithmetic intensity. We have analyzed the performance by increasing the arithmetic intensity of the ToyFDTD code by hand. For illustrative purposes, ToyFDTD executes 5 floating-point operations to calculate each element hx[i][j][k]. In the FDTD experiment, we have increased arithmetic intensity so that to calculate hx[i][j][k] the number of floating-point operations is aproximately 4*10^6. The resulting speedups of FDTD (red line) demonstrate that with higher arithmetic intensities the OpenMP parallel code scales better. This is because the impact of the parallelization overhead is attenuated by the higher computational load of the code.

## Conclusions on Automatic Parallelization of FDTD Codes

Parallware is a technology that automatically discovers the parallelism implicit in sequential codes and, just in one click, auto-generates OpenMP-enabled parallel code. Parallware always guarantees correctness and exploitation of the coarser-grained and fine-grained parallelism available in the sequential code, minimizing the parallelization overhead.

**Watch the video of a live demo that demonstrates how easy is the usage of Parallware!**

### About Dr. Dmitry Gorodetsky

For the past ten years Dr. Dmitry Gorodetsky has been studying numerical techniques for Computational Electromagnetics (CEM). His work has focused on increasing the speed and efficiency of the Finite Difference Time Domain (FDTD) method, one of the most versatile and influential in computational electromagnetics. In that regard, he has improved an existing FDTD-based algorithm and developed a novel implementation for parallel/distributed computers. More recently Dr. Gorodetsky has attempted to extend these techniques to large scale terrain-specific RF propagation modeling which has been made possible by an SBIR award from the United States National Science Foundation.

If you like this post, **please share!**

## Leave a Reply